Course Syllabus
Syllabus MSDS 422-57/59: Practical Machine Learning (Fall 2020)
Instructor: Lawrence Fulton, Ph.D.
E-mail: lawrence.fulton@northwestern.edu
Teaching Assistant: John Stanham
E-Mail: jstanham@northwestern.edu
Response Times: Responses to question-and-answer discussion forums and e-mail questions are provided within 48 hours. Telephone communication: By appointment.
Course Description:
The course introduces machine learning with business applications. It provides a survey of machine learning techniques, including traditional statistical methods, resampling techniques, model selection and regularization, tree-based methods, principal components analysis, cluster analysis, artificial neural networks, and deep learning. Students implement machine learning models with open-source software for data science. They explore data and learn from data, finding underlying patterns useful for data reduction, feature analysis, prediction, and classification. Prerequisites: MSDS 400-DL Math for Data Scientists, MSDS 401-DL Applied Statistics with R, and MSDS 402-DL Introduction to Data Science.
Learning Outcomes
Practical Machine Learning is a survey course with a long list of learning outcomes:
- Explain the learning algorithm trade-offs, balancing performance within training data and robustness on unobserved test data.
- Distinguish between supervised and unsupervised learning methods.
- Distinguish between regression and classification problems.
- Explain bootstrap and cross-validation procedures.
- Explore and visualize data and perform basic statistical analysis.
- List alternative methods for evaluating classifiers.
- List alternative methods for evaluating regression.
- Demonstrate the application of traditional statistical methods for classification and regression.
- Demonstrate the application of trees and random forests for classification and regression.
- Demonstrate principal components for dimension reduction.
- Demonstrate principal components regression.
- Describe hierarchical and non-hierarchical clustering techniques.
- Describe how semi-supervised learning may be utilized in addressing classification and regression problems.
- Explain how measurement and feature engineering are relevant to modeling.
- Describe how artificial neural networks are constructed from logical connections of artificial neurons and activation functions.
- Demonstrate the use of artificial neural networks (including deep neural networks) in classification and regression.
- Describe how convolutional neural networks are constructed.
- Describe how recurrent neural networks are constructed.
- Distinguish between autoencoders and other forms of unsupervised learning.
- Describe applications of autoencoders.
- Explain how the results of machine learning can be useful to business managers.
- Transform data and research results into actionable insights.
Required Readings
Géron, A. Hands-On Machine Learning with Scikit-Learn & TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. 2d Edition. Sebastopol, Calif.: O'Reilly. [ISBN 9781492032649], 2019. Source code available at https://github.com/ageron/handson-ml2.
Hastie et al., Elements of Statistical Learning, Springer, 2009. https://web.stanford.edu/~hastie/ElemStatLearn/
James et al., An Introduction to Statistical Learning with Applications in R, 7th Edition. Springer, 2018. https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf
Goodfellow et al., Deep Learning. MIT Press, 2016. https://www.deeplearningbook.org/
Reference Textbooks and Courses
Chollet, F. 2017. Deep Learning with Python. Manning Publications. [ISBN-13:978-1617294433] Code at https://github.com/fchollet/deep-learning-with-python-notebooks
Of special interest are Linkedin Learning / Lynda courses in Python programming, JavaScript, web development, and Git/GitHub. https://www.northwestern.edu/hr/learning/development/online-learning/lynda.com.html
Assigned readings posted on Canvas, including timely news articles and academic research that you will read in order to complete some assignments and participate in discussion forums.
Study Teams (optional)
Student study teams may be utilized in this course as a means to foster a collaborative learning environment. Blue Jeans is available as a conferencing tool. Preliminary groups have been set up under the People/Study Teams tab of Canvas. Each student is encouraged to joint a study team of no more than five students. It may make sense to join a team based on time zone (Eastern, Central, Mountain, or Pacific) and preferred personal computer operating system (Mac OSX or Windows).
Software and Technology Resources
Required software for this course is open source and freely available on the web. Python is the primary programming environment. Personal computer software is available for Mac OS X and Windows systems. Students will also have access to the Data Science Computing Cluster from the School of Professional Studies. Additional information about software solutions will be provided on the Canvas course site under Modules / Technical Resources.
Late Policy
Students must provide written notification of late assignment work 24 hours prior to the deadline. One grace day is allowed for those who provide late work notification. Only one grace day without reduction of points is allowed. Inform instructor of any late papers. Late papers are subject to point reductions.
Assignments and Exams
This course has eight graded assignments worth 50 points each. There are also two final exams worth 50 points each, one non-proctored with no time-limit, the other proctored with a two-hour time-limit.
|
Graded Course Component |
Point Value |
|
Week 1. Exploring and Visualizing Data |
50 |
|
Week 2. Evaluating Regression Models |
50 |
|
Week 3. Evaluating Classification Models |
50 |
|
Week 4. Random Forests |
50 |
|
Week 5. Principal Components Analysis |
50 |
|
Week 6. Neural Networks |
50 |
|
Week 7. Deep Learning: Image Processing with a CNN |
50 |
|
Week 8. Deep Learning: Language Modeling with an RNN |
50 |
|
Weeks 9. Autoencoders |
50 |
|
Weeks 10. Proctored Final Exam |
50 |
|
Week 1-10. Discussion |
100 (10 points each week) |
|
TOTAL |
600 Points |
Grading Scale
A: 90-100%
B: 80-89%
C: 70-79%
F: Below 70%
Proctored Final Exam
The proctored final exam is a multiple-choice exam focused on machine learning concepts. It includes little mathematics and no programming. There will be twenty-five multiple-choice items worth two-points each. Answers to the exam are entered online in Canvas, while students are under the supervision of an Examity proctor. The proctored final exam has a one-hour time-limit. Like the non-proctored final exam, the proctored final exam is comprehensive, covering material from weeks 1 through 9 of the course.
Academic Integrity Policy
Students are required to comply with University regulations regarding academic integrity. If you are in doubt about what constitutes academic dishonesty, speak with your instructor or graduate coordinator before the assignment is due and/or examine the University Web site. Academic dishonesty includes, but is not limited to, cheating on an exam, obtaining an unfair advantage, and engaging in plagiarism (e.g., using material from readings without citing or copying another student's paper). Failure to maintain academic integrity will result in a grade sanction, possibly as severe as failing and being required to retake the course, and could lead to a suspension or expulsion from the program. Further penalties may apply. For more information, visit The Office of the Provost's Academic Integrity page. Some assignments in SPS courses may be required to be submitted through Turnitin, a plagiarism detection and education tool. You can find an explanation of the tool here.
Accessibility Policy and Accommodations
Northwestern University and the School of Professional Studies strive to ensure that all online courses are accessible to all students. In collaboration with AccessibleNU, we work to provide accommodations to students with disabilities and other conditions so they may fully participate and engage in the learning environment. The majority of accommodations and services available to eligible students are coordinated by AccessibleNU. If you have an accommodation request, please contact your professor or AccessibleNU as soon as possible.
Discussion Board Participation
The purpose of the discussion boards is to allow students to freely exchange ideas. It is imperative to remain respectful of all viewpoints and positions and, when necessary, agree to respectfully disagree. While active and frequent participation is encouraged, cluttering a discussion board with inappropriate, irrelevant, or insignificant material will not earn points and may result in receiving less than full credit. Frequency and length of postings are unimportant. Try to keep your responses short and to the point.
The content of the postings is paramount. Remember to cite all sources—when relevant—in order to avoid plagiarism. Do not include graphics or images in postings, but feel free to include web addresses/links to public-domain resources.
Graded discussion items are designed so you must post your viewpoint before seeing the responses of others. When responding to others’ postings, avoid simple responses such as “I agree” or “I do not agree.” Instead, move the discussion forward, adding value by contributing new information. Explain, clarify, politely ask for details, provide details of your own, persuade, and enrich communications for a productive discussion experience. You are required to participate in the weekly discussion board forum. The due date and time for posting to each week’s discussion forum is SUNDAY at 11:55 p.m. (Central Time).
Additional Resources
Extensive technology and visualization resources are provided with the Canvas course site. These are found under Modules headings such as Technology Resources, Library / Information Resources, Course Resources, and Recordings and Links.
Syllabus over Canvas Course Site
This syllabus in its Adobe Acrobat Portable Document Format (pdf) form is the defining document for Practical Machine Learning. This syllabus defines course objectives, requirements, due dates, and grading standards. If there is ever a discrepancy between this syllabus and the Canvas course site, rely on this syllabus as the final word.
Week 1: Introduction to Machine Learning, Due Sunday, End of Week, 11:59PM CT
Learning Objectives
- Describe the basic concepts of the learning problem and why/how machine learning methods are used to learn from data to find underlying patterns for prediction and decision-making.
- Explain the learning algorithm trade-offs, balancing performance within training data and robustness on unobserved test data.
- Distinguish between supervised and unsupervised learning methods.
- Distinguish between regression and classification problems.
- Summarize the basic concepts of assessing model accuracy and the bias-variance trade-off.
- Describe how sampling and selection are relevant to machine learning.
- Explain bootstrap and cross-validation procedures.
- Explore and visualize data and perform basic statistical analysis.*
*acquired in previous course, reviewed here
Textbook Readings
Géron, A. Chapters 1 and 2
James, Chapters 1 and 2
Hastie, Chapters 1 and 2
KaggleTutorial:
EDA & Machine Learning
https://www.datacamp.com/community/tutorials/kaggle-machine-learning-eda
Introduction to Colab and Python: https://colab.research.google.com/github/tensorflow/examples/blob/master/courses/udacity_intro_to_tensorflow_for_deep_learning/l01c01_introduction_to_colab_and_python.ipynb
Discussion Board (10 points)
Participation due Sunday evening 11:55 p.m. Central time
Assignment
Exploring and Visualizing Data (50 points)
Due Sunday evening of the week assigned by 11:59 pm. Central time
Sync Sessions
Weekly per Announcement
Week 2: Supervised Learning for Regression, Due Sunday, End of Week, 11:59PM CT
Learning Objectives
- Express the basic setting of a regression problem.*
- Describe the statistical model of multiple linear regression.*
- Demonstrate multiple linear regression.
- Demonstrate regularized regression methods.
- Evaluate regression methods within a cross-validation framework.
- List alternative methods for evaluating regression models.
- Explain how the results of regression may be useful to business managers (and others)
*acquired in previous course, reviewed here
Textbook Readings
Géron, Chapter 4
James, Chapter 3
Hastie, Chapter 3
Discussion Board (10 points)
Participation due Sunday evening 11:55 p.m. Central time
Assignment
Evaluating Regression Models (50 points)
Due Sunday evening of the week assigned by 11:55 pm. Central time
Sync Sessions
Weekly per Announcement
Week 3: Supervised Learning for Classification, Due Sunday, End of Week, 11:59PM CT
Learning Objectives
- Express the basic setting of a classification problem.
- Describe the statistical model of logistic regression.
- Demonstrate the binary logistic regression algorithm.
- Demonstrate naïve Bayes classification.
- Evaluate classification methods within a cross-validation framework.
- List alternative methods for evaluating classifiers.
- Explain how the results of classification may be useful to business managers (and others)
Textbook Readings
Géron, Chapter 3
James, Chapter 4
Hastie, Chapter 4
Online Interactive Training
Naïve Bayes Classification under Modules / Recordings and Links
Discussion Board (10 points)
Participation due Sunday evening 11:55 p.m. Central time
Assignment
Evaluating Classification Models (50 points)
Due Sunday evening of the week assigned by 11:55 pm. Central time
Sync Sessions
Weekly per Announcement
Week 4: Trees and Random Forests, Due Sunday, End of Week, 11:59PM CT
Learning Objectives
- Describe decision trees.
- Describe how impurity is measured for decision trees.
- List advantages and disadvantages of using decision trees.
- Demonstrate the use of decision trees for classification.
- Demonstrate the use of decision trees for regression.
- Compare prediction performance of trees with other methods of classification and regression.
- Describe bagging, sampling with and without replacement are employed in random forests.
- Describe how boosting may be used to improve predictive performance.
- Demonstrate the use of random forests for classification.
- Demonstrate the use of random forests for regression.
- Compare prediction performance of random forests with other methods of classification and regression.
- Explain how results from decision trees and random forests may be useful to business managers (and others)
Textbook Readings
Géron, Chapter 6 and 7
James, Chapter 8
Hastie, Chapters 9 and 10
Online Interactive Training
Decision Tree Classification Algorithm under Modules / Recordings and Links
Discussion Board (10 points)
Participation due Sunday evening 11:55 p.m. Central time
Assignment
Random Forests (50 points)
Due Sunday evening of the week assigned by 11:55 pm. Central time
Sync Sessions
Weekly per Announcement
Week 5: Unsupervised Learning, Due Sunday, End of Week, 11:59PM CT
Learning Objectives
- Perform a principal components analysis and determine how many principal components to retain for subsequent analyses.
- Explain how principal components are related to eigenvectors and eigenvalues.
- Demonstrate principal components for dimension reduction.
- Demonstrate principal components regression.
- Describe hierarchical and non-hierarchical clustering techniques.
- Demonstrate hierarchical clustering.
- Demonstrate K-means cluster analysis.
- Describe how semi-supervised learning may be utilized in addressing classification and regression problems.
- Explain how results from unsupervised learning may be useful to business managers (and others)
Textbook Readings
Géron, Chapter 8 and 9
James, Chapters 6 and 10
Hastie, Chapters 13 and 14
Westfall, et al. https://www.tandfonline.com/doi/abs/10.1080/00273171.2017.1340824?journalCode=hmbr20
Discussion Board (10 points)
Participation due Sunday evening 11:55 p.m. Central time
Assignment
Principal Components Analysis (50 points)
Due Sunday evening of the week assigned by 11:55 pm. Central time
Sync Sessions
Weekly per Announcement
Week 6: Neural Networks, Due Sunday, End of Week, 11:59PM CT
Learning Objectives
- Describe how artificial neural networks are constructed from logical connections of artificial neurons.
- Describe alternative activation functions, including step, logit, hyperbolic tangent, and rectified linear units (reLU).
- Explain the roles of training, validation, and test samples in the development and evaluation of artificial neural networks.
- Explain how measurement and feature engineering are relevant to modeling.
- Demonstrate the use of artificial neural networks in classification.
- Demonstrate the use of artificial neural networks in regression.
- Explain how the results of artificial neural networks can be useful to business managers (and others).
Textbook Readings
Géron, A. Chapters 10 through 13
Hastie, Chapter 11
Discussion Board (10 points)
Participation due Sunday evening 11:55 p.m. Central time
Assignment
Neural Networks (50 points)
Due Sunday evening of the week assigned by 11:55 pm. Central time
Sync Sessions
Weekly per Announcement
Week 7: Deep Learning for Computer Vision, Due Sunday, End of Week, 11:59PM CT
Learning Objectives
- Describe how deep neural networks are constructed.
- Describe alternative activation functions, including variations on rectified linear units (reLU).
- Describe advantages and disadvantages of deep neural networks.
- Describe how convolutional neural networks are constructed.
- Describe how artificial neural networks are employed in computer vision.
- Demonstrate the use of deep neural networks in visual recognition.
- Explain how the results of deep neural networks can be useful to business managers, with special reference to vision problems (and others).
Textbook Readings
Géron, Chapter 14
Discussion Board (10 points)
Participation due Sunday evening 11:55 p.m. Central
Assignment
Deep Learning: Image Processing with a CNN (50 points)
Due Sunday evening of the week assigned by 11:55 pm. Central time
Sync Sessions
Weekly per Announcement
Week 8: Recurrent Neural Networks for Natural Language Processing, Due Sunday, End of Week, 11:59PM CT
Learning Objectives
- Describe how recurrent neural networks are constructed.
- Describe how deep neural networks (convolutional and recurrent) may be employed in natural language processing.
- Explain how the results of deep neural networks can be useful to business managers, with special reference to natural language processing (and others)
Textbook Readings
Géron, Chapters 15 and 16
Goodfellow et al., 2016. Deep Learning. Chapter 10.
Discussion Board (10 points)
Participation due Sunday evening 11:55 p.m. Central time
Assignment
Recurrent Neural Network (50 points)
Due Sunday evening of the week assigned by 11:55 pm. Central time
Sync Sessions
Weekly per Announcement
Week 9: Neural Network Autoencoders, Due Sunday, End of Week, 11:59PM CT
Learning Objectives
- Define what is meant by an autoencoder.
- Distinguish between autoencoders and other forms of unsupervised learning, such as principal component analysis
- Identify applications of neural network autoencoders.
- Distinguish among autoencoder neural network architectures.
- Implement an autoencoder for vision.
- Implement a word2vec autoencoder for natural language processing
Textbook Readings
Géron, Chapter 17
Goodfellow et al., 2016. Deep Learning. Chapter 14. https://www.deeplearningbook.org/contents/autoencoders.html
Discussion Board (10 points)
Participation due Sunday evening 11:55 p.m. Central time
Assignments
Autoencoders (50 points)
Due Sunday evening of the week assigned by 11:55 pm. Central time
Sync Sessions
Weekly per Announcement
Week 10: Final Exams, Due Sunday, End of Week, 11:59PM CT
Learning Objectives
No new learning objectives. This week is for exams and an course overview discussion.
Discussion Board (10 points)
Participation due Sunday evening 11:55 p.m. Central time
Assignments – Final Exams (continued availability)
Proctored Final Exam (50 points)
Last day/time available: last day of course
The proctored final exam will be proctored by Examity and must be taken in one continuous one-hour sitting.
Reference Materials
Machine Learning
Abu-Mostafa, Y. S., Magdon-Ismail, M., and Lin, H-T. 2012. Learning from Data: A Short Course. AMLbook.com. [ISBN-13: 978-1-60049-006-4]
Beck, R. A. 2008. Statistical Learning: A Regression Perspective. New York: Springer.
[ISBN-13: 978-0-387-77500-5]
Bishop, C. M. 2006. Pattern Recognition and Machine Learning. New York: Springer.
[ISBN-13: 978-0-387-31073-2]
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. 1984. Classification and Regression Trees. New York: Chapman & Hall. [ISBN-10: 0-412-04841-8]
Buduma, N. 2017. Fundamentals of Deep Learning: Designing Next-Generation Machine Intelligence Algorithms. Sebastopol, Calif.: O’Reilly. [ISBN-13: 978-1-491-92561-4]
Caudill, M. and Butler, C. 1990. Naturally Intelligent Systems. Cambridge, Mass.: MIT Press.
[ISBN-10: 0-262-03156-6]
Chollet, F. 2018. Deep Learning with Python. Shelter Island, N.Y.: Manning. Code available at
https://github.com/fchollet/deep-learning-with-python-notebooks.git
Conway, D. and White, J. M. 2012. Machine Learning for Hackers. Sebastopol, Calif.: O’Reilly. [ISBN-13: 978-1449303716]
Cristianini, N. Shawe-Taylor, J. 2000. Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge, U.K.: Cambridge University Press. [ISBN-10: 0-521-78019-5]
Duda, R. O., Hart, P. E., and Stork, D. G. 2001. Pattern Classification, second edition. New York: Wiley. [ISBN-10: 0-471-05669-3]
Friedman, J. H. 1991. Multivariate adaptive regression splines, with commentary, The Annals of Statistics, 19(1), 1–141.
Goodfellow, I., Bengio, Y., and Courville, A. 2016. Deep Learning. Cambridge, Mass.: MIT Press. [ISBN-13: 978-0262035613]
Géron, A. 2019. Hands-On Machine Learning with Scikit-Learn & TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. Sebastopol, Calif.: O'Reilly.
Harrington, P. 2012. Machine Learning in Action, Shelter Island, N.Y.: Manning
[ISBN-13: 978-1617290183]
Hastie, T., Tibshirani, R, & Friedman, J. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, second edition. New York: Springer. [ISBN-13: 978-0-387-84857-0]
Hagan, M. T., Demuth, H. B., Beale, M. H., and De Jesus, O. 2014. Neural Network Design, second edition. Stillwater, Okla.: Martin Hagan. [ISBN-13: 978-0-9717321-1-7]
Hand, D. J. 1997. Construction and Assessment of Classification Rules. New York: Wiley.
[ISBN-10: 0-471-96583-9]
Haykin, S. 1999. Neural Networks: A Comprehensive Foundation, second edition. Upper Saddle River, N.J.: Prentice Hall. [ISBN-10: 0-13-273350-1]
Izenman, A. J. 2008. Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning. New York: Springer. [ISBN-13: 978-0-387-78188-4] Available from Springer collection at: http://link.springer.com.turing.library.northwestern.edu/
Kuhn, M. and Johnson, K., 2013. Applied Predictive Modeling. New York: Springer.
[ISBN-13: 978-1461468486] Available from Springer collection at:
http://link.springer.com.turing.library.northwestern.edu/
Liu, B. 2011. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. (2nd ed.). New York: Springer. [ISBN-13: 978-3-642-19459-7] [ISBN-13 for the electronic edition: 978-3-642-19460-3] (For Northwestern University students and faculty, Springer books are available for free electronic download at http://link.springer.com.turing.library.northwestern.edu/ )
Manning, C. D. 2015. Computational linguistics and deep learning. Computational Linguistics, 41(4), 701–707. Presidential address to the Association for Computational Linguistics.
Maren, A., Harston, C., and Pap, R. 1990. Handbook of Neural Network Computing. Boston: Academic Press. [ISBN-10: 0-12-546090-2]
Marsland, S. 2015. Machine Learning: An Algorithmic Perspective. Boca Ratan, Fla.: CRC Press. [ISBN-13: 978-1-4665-8328-3]
Miller, T. W. 2015. Marketing Data Science: Modeling Techniques in Predictive Analytics with R and Python. Upper Saddle River, N.J.: Pearson. [ISBN-13: 978-0-13-388655-9]
Miller, T. W. 2015. Modeling Techniques in Predictive Analytics with Python and R: A Guide for Data Science. Upper Saddle River, N.J.: Pearson. [ISBN-13: 978-0-13-389206-2]
Miller, T. W. 2015. Web and Network Data Science: Modeling Techniques in Predictive Analytics. Upper Saddle River, N.J.: Pearson. [ISBN-13: 978-0-13-388644-3]
Miller, T. W. 2016. Sports Analytics and Data Science: Winning the Game with Methods and Models. Old Tappan, N.J.: Pearson. [ISBN-13: 978-0-13-388643-6] Data sets and programs on GitHub: https://github.com/mtpa/
Montavon, G., Orr, G. B., and Müller, K-R. (eds.) 2012. Neural Networks: Tricks of the Trade, second ed. New York: Springer. [ISBN-13: 978-3-642-35289-8] Electronic book available from the Springer collection at http://link.springer.com.turing.library.northwestern.edu/
Müller, A. C. and Guido, S. 2017. Introduction to Machine Learning with Python: A Guide for Data Scientists. Sebastopol, Calif.: O’Reilly. [ISBN-13: 978-1449369415] Code examples at
https://github.com/amueller/introduction_to_ml_with_python
Murphy, K. P. 2012. Machine Learning: A Probabilistic Perspective. Cambridge, Mass.: MIT Press. [ISBN-13: 978-0-262-01802-9]
Nielsen, M. 2017. Neural Networks and Deep Learning. Online textbook available at
http://neuralnetworksanddeeplearning.com/ Code repository at
https://github.com/mnielsen/neural-networks-and-deep-learning
Rajaraman, A. and Ullman, J. D. 2012. Mining Massive Datasets. Cambridge, U.K.: Cambridge University Press. [ISBN-13: 978-1-107-01535-7]
Rumelhart, D. E., McClelland, J. L., and the PDP Research Group. 1986. Parallel Distributed Processing, Volume I: Foundations. Cambridge, Mass.: MIT Press. [ISBN-10: 0-262-18120-7]
Schapire, R. E. and Freund, Y. 2014. Boosting: Foundations and Algorithms. Cambridge, Mass.: MIT Press. [ISBN-13: 978-0-262-52603-6]
Segaran, T. 2007. Programming Collective Intelligence. Sebastopol, Calif.: O’Reilly.
[ISBN-13: 978-0-596-52932-1]
Shukla, N. 2018. Machine Learning with TensorFlow. Shelter Island, N.Y.: Manning.
[ISBN-13: 978-1617293870]
Tan, P-N., Steinbach, M., and Kumar, V. 2006. Introduction to Data Mining. Boston: Pearson/Addison Wesley. [ISBN-13: 978-0-321-32136-7]
Witten, I. H., Frank, E., and Hall, M. A. 2011. Data Mining: Practical Machine Learning Tools and Techniques, third edition. Burlington, Mass: Morgan Kaufmann. [ISBN-13: 978-0-12-374856-0]
Traditional Statistics and Mathematics
Chatterjee, S. and Hadi, A. S. 2012. Regression Analysis by Example, fifth edition. New York, NY: Wiley. [ISBN-13: 978-0470905845] Data sets available from the second author at
http://www1.aucegypt.edu/faculty/hadi/RABE5/
Feldman, R. M. and Valdez-Flores, C. 2010. Applied Probability and Stochastic Processes, second edition. New York: Springer. [electronic version ISBN-13: 978-3-642-05158-6]
Available from Springer collection at:
http://link.springer.com.turing.library.northwestern.edu/
Gentle, J. E. 2007. Matrix Algebra: Theory, Computations, and Applications in Statistics. New York: Springer. [electronic version ISBN-13: 978-0-387-70873-7]
Available from Springer collection at:
http://link.springer.com.turing.library.northwestern.edu/
Kaufman, L. and Rousseeuw, P. J. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. New York, NY: Wiley. [ISBN-10: 0-471-87876-6]
Manly, B. F. J. and Alberto, J. A. N., 2017. Multivariate Statistical Methods: A Primer, fourth edition. Boca Raton, Fla.: CRC Press. [ISBN-13: 978-1498728966] There are R programs and data sets available at http://www.manly-biostatistics.co.nz/downloads
McCullagh, P. and Nelder, J. A., 1989. Generalized Linear Models, second edition. London: Chapman & Hall. [ISBN-10: 0-412-31760-5]
Snedecor, G. W. and Cochran, W. G. 1967. Statistical Methods, sixth edition. Ames, Iowa: Iowa University Press. [ISBN-10: 0-8138-1560-6]
Weisberg, S., 2014. Applied Linear Regression, fourth edition. New York: Wiley. [ISBN-13: 978-1118386088]
Programming: Python and Linux
Barrett, D. J. 2016 Linux Pocket Guide (3nd ed.). Sebastopol, Calif.: O’Reilly.
[ISBN-13: 978-1491927571]
Beazley, D. M. 2009. Python Essential Reference, fourth edition. Boston: Addison-Wesley.
[ISBN-13: 978-0672329784]
Beazley, D. M. 2017. Python Programming Language Live Lessions. Old Tappan, N.J.: Pearson. Video available through Safari Books Online, Sebastopol, Calif.: O’Reilly.
Beazley, D. & Jones, B. K. 2013. Python Cookbook, third edition. Sebastopol, Calif.: O’Reilly. [ISBN-13: 978-1-449-34037-7]
Bird, S., Klein, E., and Loper, E. 2009. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. Sebastopol, Calif.: O’Reilly. [ISBN-13: 978-0-596-51649-9]
Chun, W. J. 2007. Core Python Programming (2nd ed.). Upper Saddle River, N.J.: Prentice Hall. [ISBN-13: 978-0-13-226993-3]
Gift, N. and Jones, J. M. 2008. Python for Unix and Linux System Administrators: Efficient Problem Solving with Python. Sebastopol, Calif.: O’Reilly. (Chapter 2: IPython, pages 21–69.) [ISBN-13: 978-0-596-51582-9]
Hellmann, D. 2011. The Python Standard Library by Example. Upper Saddle River, N.J.: Pearson/Addison-Wesley. [ISBN-13: 978-0-321-76734-9]
Hunt, A. and Thomas, D. 2000. The Pragmatic Programmer: From Journeyman to Master. Reading, Mass.: Addison-Wesley. [ISBN13: 978-0201616224]
Keysey, J. 2016. How to Program: Computer Science Concepts and Python Exercises. Chantilly, Va.: The Great Courses. Video course available on DVD, electronic, and streaming media.
Lubanovic, B. 2015. Introducing Python: Modern Computing in Simple Packages. Sebastopol, Calif.: O’Reilly. [ISBN-13: 978-1-449-35936-2]
McKinney, W. 2013. Python for Data Analysis: Agile Tools for Real-World Data. Sebastopol, Calif.: O’Reilly. [ISBN-13: 978-1-449-31979-3]
Ramalho, L. 2015. Fluent Python: Clear, Concise, and Effective Programming. Sebastopol, Calif.: O’Reilly. [ISBN-13: 978-1-491-94600-8]
Rossant, C. 2014. Python Interactive Computing and Visualization Cookbook. Birmingham, U.K.: Packt Publishing Ltd. [ISBN-13: 978-1-78328-481-8]
Shotts, W. E. 2012. The Linux Command Line: A Complete Introduction. San Francisco: No Starch Press. [ISBN-13: 978-1-59327-389-7]
Solem, J. E. 2012. Programming Computer Vision with Python: Tools and Algorithms for Analyzing Images. Sebastopol, Calif.: O’Reilly. [ISBN-13: 978-1-449-31654-9]
Sweigart, A. 2015. Automate the Boring Stuff with Python: Practical Programming for Total Beginners. San Francisco: No Starch Press. [978-1-59327-599-0]
VanderPlas, J., 2017. Python Data Science Handbook: Essential Tools for Working with Data. Sebastopol, Calif.: O’Reilly [ISBN-13: 978-1491912058] Python code examples at
https://github.com/jakevdp/PythonDataScienceHandbook
Ward, B. 2015. How Linux Works: What Every Superuser Should Know (2nd ed.). San Francisco: No Starch Press. [ISBN-13: 978-1-59327-567-6]
Programming: R
Chambers, J. M. and Hastie, T. J., editors. 1992 Statistical Models in S. Pacific Grove, Calif.: Wadsworth & Brooks/Cole. [ISBN-10: 0-534-16764-0]
Chang, W. 2013. R Graphics Cookbook. Sebastopol, Calif.: O'Reilly. [ISBN-13: 9781449316952]
Davies, T. M. 2016. The Book of R: A First Course in Programming and Statistics. San Francisco: No Starch Press. [ISBN-13: 9781593276515]
James, G., Witten, D., Hastie, T., & Tibshiriani, R. (2014). An Introduction to Statistical Learning with Applications in R. New York: Springer. [ISBN13: 978-1-4614-7137-0] Available as a free download at http://www-bcf.usc.edu/~gareth/ISL/ with exercise solutions at http://blog.princehonest.com/stat-learning/
Fox, J. & Weisberg, S. 2011. An R Companion to Applied Regression, second edition. Thousand Oaks, CA: Sage. [ISBN-13 978-1412975148]
Kabacoff, R. 2015. R in Action: Data Analysis and Graphics with R, second edition. Shelter Island, N.Y.: Manning. [ISBN-13: 978-1617291388] Code available at https://github.com/kabacoff/RiA2
Lander, J. P. 2014. R for Everyone: Advanced Analytics and Graphics. Upper Saddle River, N.J.: Pearson/Addison-Wesley. [ISBN-13: 978-0-321-88803-7]
Matloff, N. 2011. The Art of R programming: A Tour of Statistical Software Design. San Francisco: No Starch Press. [ISBN-13: 978-1593273842]
Sarkar, D. 2002. Lattice: Multivariate Data Visualization with R. New York: Springer.
[ISBN-13: 978-0-387-75968-2]
Venables, W. N & Ripley, B. D. 2002. Modern Applied Statistics with S, fourth edition. New York: Springer.
[ISBN-10: 0-387-95457-0]
Wickham, H. 2015. Advanced R. Boca Raton, Fla.: CRC Press/Chapman & Hall.
[ISBN-13: 978-1-4665-8696-3]
Data Visualization
Bertin, J. 1983. Semiology of Graphics: Diagrams, Networks, Maps. Madison, Wisc.: University of Wisconsin Press. [ESRI Press ISBN-13: 978-1589482616]
Cairo, A. 2013. The Functional Art: An Introduction to Information Graphics and Visualization. Berkeley, Calif.: Pearson/New Riders. [ISBN-13: 978-0321834737]
Cairo, A. 2015. The Truthful Art: Data, Charts and Maps for Communication. Berkeley, Calif.: Pearson/New Riders. [ISBN-13: 978-0321934079]
Cleveland, W. S. 1993. Visualizing Data. Murray Hill, N.J.: AT&T Bell Laboratories. [ISBN-10: 0-9634884]
Cooper, A., Reimann, R., Cronin, D., and Noessel. C. 2014. About Face: The Essentials of Interaction Design (fourth ed.). New York: Wiley. [ISBN-13: 978-1118766576]
Dale, K. 2016. Data Visualization with Python & JavaScript: Scrape, Clean, Explore, and Transform Your Data. Sebastopol, Calif.: O'Reilly. [ISBN-13: 978-1491920510]
Code available at https://github.com/Kyrand/dataviz-with-python-and-js
Few, S. 2009. Now You See It: Simple Visualization Techniques for Quantitative Analysis. Oakland: Analytics Press. [ISBN-13: 978-0970601988]
Foote, S. 2015. Learning to Program. Upper Saddle River, N.J.: Addison-Wesley.
[ISBN-13: 978-0-7897-5339-7]
Haverbeke, M. 2015. Eloquent JavaScript: A Modern Introduction to Programming, (second ed.). San Francisco: No Strach Press. [ISBN13: 978-1593275846] Available online at http://eloquentjavascript.net/ Code sandbox at http://eloquentjavascript.net/code/
Kirk, A. 2016. Data Visualization: A Handbook for Data Driven Design. Los Angeles: Sage. [ISBN-13: 978-1473912144] Website: http://book.visualisingdata.com/home
Knaflic, C. N. 2015. Storytelling with Data: A Data Visualization Guide for Business Professionals. New York: Wiley. [ISBN-13: 978-1119002253]
Krug, S. 2014. Don’t Make Me Think: A Common Sense Approach to Web Usability (third ed.). Upper Saddle River, N.J.: Pearson/New Riders. [ISBN13: 978-0321965516]
Meeks, E. 2018. D3.js in Action: Data Visualization with JavaScript, second ed. Shelter Island, N.Y.: Manning. [ISBN-13: 978-1617294488] D3 v4 code at https://github.com/emeeks/d3_in_action_2
Murray, S. 2017. Interactive Data Visualization for the Web: An Introduction to Designing with D3 (second ed.). Sebastopol, Calif.: O'Reilly. [ISBN13: 978-1491921289] Print version expected August 2017. Code available from GitHub at https://github.com/alignedleft/d3-book
Purewal, S. 2014. Learning Web App Development: Build Quickly with Proven JavaScript Techniques. Sebastopol, Calif.: O'Reilly. [ISBN-13: 978-11449370190]
Robbins, J. N. 2012. Learning Web design: A Beginner’s Guide to HTML, CSS, JavaScript, and Web Graphics (fourth ed.). Sebastopol, Calif.: O'Reilly. [ISBN13: 978-1449319274]
Tufte, E. R. 2001. The Visual Display of Quantitative Information (second ed.). Cheshire, Conn.: Graphics Press. [ISBN13: 978-0961392147]
Tukey, J. W. 1977. Exploratory Data Analysis. Reading, Mass.: Addison-Wesley. [ISBN10: 0-201-07616-0]
Wilkinson, L. 2005. The Grammar of Graphics (second ed.). New York, Springer.
[ISBN-13: 978-0387245447] Electronic edition available to Northwestern University students at http://link.springer.com.turing.library.northwestern.edu/
Web and Network Data Science
Barabasi, A-L. 2016. Network Science. Cambridge, UK: Cambridge University Press.
[ISBN-13: 978-1107076266]
Buttcher, S., Clarke, C. L. A., and Cormack, Gordon V. 2010. Information Retrieval: Implementing and Evaluating Search Engines. Cambridge, Mass.: MIT Press. [ISBN-13: 978-0262026512]
Campbell, S. and Swigart, S. 2014. Going Beyond Google: Gathering Internet Intelligence (fifth edition). Oregon City, OR: Cascade Insights. (Out of print resource, available as electronic book on the Canvas course site.)
Ceri, S. et al. 2013. Web Information Retrieval. New York: Springer. [ISBN-13: 978-3-642-39313-6] (For Northwestern University students and faculty, Springer books are available for free electronic download at http://link.springer.com.turing.library.northwestern.edu/)
[ISBN-13 electronic edition: 978-3-642-39314-3]
Gheorghe, R., Hinman, M.L., and Russo, R. 2016. Elasticsearch in Action. Shetler Island, N.Y.: Manning. [ISBN-13: 978-1617291623]
Gormley, C. and Tong, Z. 2015. Elasticsearch Search: The Definitive Guide. Sebastopol, Calif.: O’Reilly. [ISBN-13: 978-1449358549]
Liu, B. 2011. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. (2nd ed.). New York: Springer. [ISBN-13: 978-3-642-19459-7] [ISBN-13 for the electronic edition: 978-3-642-19460-3] (For Northwestern University students and faculty, Springer books are available for free electronic download at http://link.springer.com.turing.library.northwestern.edu/ )
Liu, B. 2015. Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. New York: Cambridge University Press. [ISBN-13: 978-3-107-01789-4]
Manning, C. D., Raghaven, P., and Schutze, H. 2008 Introduction to Information Retrieval. Cambridge, UK: Cambridge University Press. [ISBN-13: 978-0521865715]
Available online at http://nlp.stanford.edu/IR-book/information-retrieval-book.html
Mitchell, R. 2015. Web Scraping with Python: Collecting Data from the Modern Web. Sebastopol, Calif.: O’Reilly. [ISBN-13: 978-1491910290] Code at https://github.com/REMitchell/python-scraping
Nolan, D. and Lang, D. T. 2014 XML and Web Technologies for Data Sciences with R. New York: Springer. [ISBN-13: 978-1-4614-7900-0]
Course Summary:
| Date | Details | Due |
|---|---|---|