Course Syllabus

Syllabus MSDS 422-57/59: Practical Machine Learning (Fall 2020)

fall 2020.docx

Instructor: Lawrence Fulton, Ph.D. 

E-mail: lawrence.fulton@northwestern.edu

Teaching Assistant: John Stanham

E-Mail:  jstanham@northwestern.edu

Response Times:  Responses to question-and-answer discussion forums and e-mail questions are provided within 48 hours.    Telephone communication: By appointment.

 

Course Description: 

The course introduces machine learning with business applications. It provides a survey of machine learning techniques, including traditional statistical methods, resampling techniques, model selection and regularization, tree-based methods, principal components analysis, cluster analysis, artificial neural networks, and deep learning. Students implement machine learning models with open-source software for data science. They explore data and learn from data, finding underlying patterns useful for data reduction, feature analysis, prediction, and classification. Prerequisites: MSDS 400-DL Math for Data Scientists, MSDS 401-DL Applied Statistics with R, and MSDS 402-DL Introduction to Data Science.

 

Learning Outcomes

Practical Machine Learning is a survey course with a long list of learning outcomes:

  • Explain the learning algorithm trade-offs, balancing performance within training data and robustness on unobserved test data.
  • Distinguish between supervised and unsupervised learning methods.
  • Distinguish between regression and classification problems.
  • Explain bootstrap and cross-validation procedures.
  • Explore and visualize data and perform basic statistical analysis.
  • List alternative methods for evaluating classifiers.
  • List alternative methods for evaluating regression.
  • Demonstrate the application of traditional statistical methods for classification and regression.
  • Demonstrate the application of trees and random forests for classification and regression.
  • Demonstrate principal components for dimension reduction.
  • Demonstrate principal components regression.
  • Describe hierarchical and non-hierarchical clustering techniques.
  • Describe how semi-supervised learning may be utilized in addressing classification and regression problems.
  • Explain how measurement and feature engineering are relevant to modeling.
  • Describe how artificial neural networks are constructed from logical connections of artificial neurons and activation functions.
  • Demonstrate the use of artificial neural networks (including deep neural networks) in classification and regression.
  • Describe how convolutional neural networks are constructed.
  • Describe how recurrent neural networks are constructed.
  • Distinguish between autoencoders and other forms of unsupervised learning.
  • Describe applications of autoencoders.
  • Explain how the results of machine learning can be useful to business managers.
  • Transform data and research results into actionable insights.

 

Required Readings

Géron, A. Hands-On Machine Learning with Scikit-Learn & TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. 2d Edition.  Sebastopol, Calif.: O'Reilly. [ISBN 9781492032649], 2019.  Source code available at https://github.com/ageron/handson-ml2. 

Hastie et al., Elements of Statistical Learning,   Springer, 2009. https://web.stanford.edu/~hastie/ElemStatLearn/

James et al.,  An Introduction to Statistical Learning with Applications in R, 7th Edition.  Springer, 2018. https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf

Goodfellow et al., Deep Learning.  MIT Press, 2016.  https://www.deeplearningbook.org/

Reference Textbooks and Courses

Chollet, F. 2017. Deep Learning with Python. Manning Publications. [ISBN-13:978-1617294433] Code at https://github.com/fchollet/deep-learning-with-python-notebooks

Of special interest are Linkedin Learning / Lynda  courses in Python programming, JavaScript, web development, and Git/GitHub. https://www.northwestern.edu/hr/learning/development/online-learning/lynda.com.html

Assigned readings posted on Canvas, including timely news articles and academic research that you will read in order to complete some assignments and participate in discussion forums.

Study Teams (optional)

Student study teams may be utilized in this course as a means to foster a collaborative learning environment.  Blue Jeans is available as a conferencing tool.  Preliminary groups have been set up under the People/Study Teams tab of Canvas. Each student is encouraged to joint a study team of no more than five students. It may make sense to join a team based on time zone (Eastern, Central, Mountain, or Pacific) and preferred personal computer operating system (Mac OSX or Windows).

 

Software and Technology Resources

Required software for this course is open source and freely available on the web. Python is the primary programming environment. Personal computer software is available for Mac OS X and Windows systems. Students will also have access to the Data Science Computing Cluster from the School of Professional Studies. Additional information about software solutions will be provided on the Canvas course site under Modules / Technical Resources.

 

Late Policy

Students must provide written notification of late assignment work 24 hours prior to the deadline. One grace day is allowed for those who provide late work notification. Only one grace day without reduction of points is allowed. Inform instructor of any late papers. Late papers are subject to point reductions.

 

Assignments and Exams

This course has eight graded assignments worth 50 points each. There are also two final exams worth 50 points each, one non-proctored with no time-limit, the other proctored with a two-hour time-limit.

 

Graded Course Component

Point Value

Week 1. Exploring and Visualizing Data

50

Week 2. Evaluating Regression Models

50

Week 3. Evaluating Classification Models

50

Week 4. Random Forests

50

Week 5. Principal Components Analysis

50

Week 6. Neural Networks

50

Week 7. Deep Learning: Image Processing with a CNN

50

Week 8. Deep Learning: Language Modeling with an RNN

50

Weeks 9.  Autoencoders

50

Weeks 10. Proctored Final Exam

50

Week 1-10. Discussion

100 (10 points each week)

TOTAL

 600 Points

 

Grading Scale

A:  90-100%

B:  80-89%

C:  70-79%

F:  Below 70%

Proctored Final Exam

The proctored final exam is a multiple-choice exam focused on machine learning concepts. It includes little mathematics and no programming. There will be twenty-five multiple-choice items worth two-points each. Answers to the exam are entered online in Canvas, while students are under the supervision of an Examity proctor. The proctored final exam has a one-hour time-limit. Like the non-proctored final exam, the proctored final exam is comprehensive, covering material from weeks 1 through 9 of the course.

Academic Integrity Policy 

Students are required to comply with University regulations regarding academic integrity. If you are in doubt about what constitutes academic dishonesty, speak with your instructor or graduate coordinator before the assignment is due and/or examine the University Web site. Academic dishonesty includes, but is not limited to, cheating on an exam, obtaining an unfair advantage, and engaging in plagiarism (e.g., using material from readings without citing or copying another student's paper). Failure to maintain academic integrity will result in a grade sanction, possibly as severe as failing and being required to retake the course, and could lead to a suspension or expulsion from the program. Further penalties may apply. For more information, visit The Office of the Provost's Academic Integrity page. Some assignments in SPS courses may be required to be submitted through Turnitin, a plagiarism detection and education tool. You can find an explanation of the tool here.  

Accessibility Policy and Accommodations

Northwestern University and the School of Professional Studies strive to ensure that all online courses are accessible to all students. In collaboration with AccessibleNU, we work to provide accommodations to students with disabilities and other conditions so they may fully participate and engage in the learning environment. The majority of accommodations and services available to eligible students are coordinated by AccessibleNU. If you have an accommodation request, please contact your professor or AccessibleNU as soon as possible.

Discussion Board Participation

The purpose of the discussion boards is to allow students to freely exchange ideas. It is imperative to remain respectful of all viewpoints and positions and, when necessary, agree to respectfully disagree. While active and frequent participation is encouraged, cluttering a discussion board with inappropriate, irrelevant, or insignificant material will not earn points and may result in receiving less than full credit. Frequency and length of postings are unimportant. Try to keep your responses short and to the point.

The content of the postings is paramount.  Remember to cite all sources—when relevant—in order to avoid plagiarism. Do not include graphics or images in postings, but feel free to include web addresses/links to public-domain resources.

 

Graded discussion items are designed so you must post your viewpoint before seeing the responses of others. When responding to others’ postings, avoid simple responses such as “I agree” or “I do not agree.” Instead, move the discussion forward, adding value by contributing new information. Explain, clarify, politely ask for details, provide details of your own, persuade, and enrich communications for a productive discussion experience. You are required to participate in the weekly discussion board forum. The due date and time for posting to each week’s discussion forum is SUNDAY at 11:55 p.m. (Central Time).

Additional Resources

Extensive technology and visualization resources are provided with the Canvas course site. These are found under Modules headings such as Technology Resources, Library / Information Resources, Course Resources, and Recordings and Links.

Syllabus over Canvas Course Site

This syllabus in its Adobe Acrobat Portable Document Format (pdf) form is the defining document for Practical Machine Learning. This syllabus defines course objectives, requirements, due dates, and grading standards. If there is ever a discrepancy between this syllabus and the Canvas course site, rely on this syllabus as the final word.

 

 

 

Week 1: Introduction to Machine Learning, Due Sunday, End of Week, 11:59PM CT

Learning Objectives

  • Describe the basic concepts of the learning problem and why/how machine learning methods are used to learn from data to find underlying patterns for prediction and decision-making.
  • Explain the learning algorithm trade-offs, balancing performance within training data and robustness on unobserved test data.
  • Distinguish between supervised and unsupervised learning methods.
  • Distinguish between regression and classification problems.
  • Summarize the basic concepts of assessing model accuracy and the bias-variance trade-off.
  • Describe how sampling and selection are relevant to machine learning.
  • Explain bootstrap and cross-validation procedures.
  • Explore and visualize data and perform basic statistical analysis.*

 

*acquired in previous course, reviewed here

Textbook Readings

Géron, A. Chapters 1 and 2

James, Chapters 1 and 2

Hastie, Chapters 1 and 2

 

KaggleTutorial:  

EDA & Machine Learning

https://www.datacamp.com/community/tutorials/kaggle-machine-learning-eda   

 

Introduction to Colab and Python: https://colab.research.google.com/github/tensorflow/examples/blob/master/courses/udacity_intro_to_tensorflow_for_deep_learning/l01c01_introduction_to_colab_and_python.ipynb

Discussion Board (10 points)

Participation due Sunday evening 11:55 p.m. Central time

Assignment

Exploring and Visualizing Data (50 points)

Due Sunday evening of the week assigned by 11:59 pm. Central time

Sync Sessions

Weekly per Announcement

 

 

Week 2: Supervised Learning for Regression, Due Sunday, End of Week, 11:59PM CT

Learning Objectives

  • Express the basic setting of a regression problem.*
  • Describe the statistical model of multiple linear regression.*
  • Demonstrate multiple linear regression.
  • Demonstrate regularized regression methods.
  • Evaluate regression methods within a cross-validation framework.
  • List alternative methods for evaluating regression models.
  • Explain how the results of regression may be useful to business managers (and others)

 

*acquired in previous course, reviewed here

Textbook Readings

Géron, Chapter 4

James, Chapter 3

Hastie, Chapter 3

Discussion Board (10 points)

Participation due Sunday evening 11:55 p.m. Central time

Assignment

Evaluating Regression Models (50 points)

Due Sunday evening of the week assigned by 11:55 pm. Central time

 

Sync Sessions

Weekly per Announcement

 

 

Week 3: Supervised Learning for Classification, Due Sunday, End of Week, 11:59PM CT

 

Learning Objectives

  • Express the basic setting of a classification problem.
  • Describe the statistical model of logistic regression.
  • Demonstrate the binary logistic regression algorithm.
  • Demonstrate naïve Bayes classification.
  • Evaluate classification methods within a cross-validation framework.
  • List alternative methods for evaluating classifiers.
  • Explain how the results of classification may be useful to business managers (and others)

Textbook Readings

Géron, Chapter 3

James, Chapter 4

Hastie, Chapter 4

Online Interactive Training

Naïve Bayes Classification under Modules / Recordings and Links

Discussion Board (10 points)

Participation due Sunday evening 11:55 p.m. Central time

Assignment

Evaluating Classification Models (50 points)

Due Sunday evening of the week assigned by 11:55 pm. Central time

Sync Sessions

Weekly per Announcement

 

 

Week 4: Trees and Random Forests, Due Sunday, End of Week, 11:59PM CT

 

Learning Objectives

 

  • Describe decision trees.
  • Describe how impurity is measured for decision trees.
  • List advantages and disadvantages of using decision trees.
  • Demonstrate the use of decision trees for classification.
  • Demonstrate the use of decision trees for regression.
  • Compare prediction performance of trees with other methods of classification and regression.
  • Describe bagging, sampling with and without replacement are employed in random forests.
  • Describe how boosting may be used to improve predictive performance.
  • Demonstrate the use of random forests for classification.
  • Demonstrate the use of random forests for regression.
  • Compare prediction performance of random forests with other methods of classification and regression.
  • Explain how results from decision trees and random forests may be useful to business managers (and others)

Textbook Readings

 

Géron,  Chapter 6 and 7

James, Chapter 8

Hastie, Chapters 9 and 10

Online Interactive Training

Decision Tree Classification Algorithm under Modules / Recordings and Links

Discussion Board (10 points)

Participation due Sunday evening 11:55 p.m. Central time

Assignment

Random Forests (50 points)

Due Sunday evening of the week assigned by 11:55 pm. Central time

Sync Sessions

Weekly per Announcement

 

 

 

Week 5: Unsupervised Learning, Due Sunday, End of Week, 11:59PM CT

Learning Objectives

 

  • Perform a principal components analysis and determine how many principal components to retain for subsequent analyses.
  • Explain how principal components are related to eigenvectors and eigenvalues.
  • Demonstrate principal components for dimension reduction.
  • Demonstrate principal components regression.
  • Describe hierarchical and non-hierarchical clustering techniques.
  • Demonstrate hierarchical clustering.
  • Demonstrate K-means cluster analysis.
  • Describe how semi-supervised learning may be utilized in addressing classification and regression problems.
  • Explain how results from unsupervised learning may be useful to business managers (and others)

Textbook Readings

Géron, Chapter 8 and 9

James, Chapters 6 and 10

Hastie, Chapters 13 and 14

Westfall, et al.   https://www.tandfonline.com/doi/abs/10.1080/00273171.2017.1340824?journalCode=hmbr20

 

Discussion Board (10 points)

Participation due Sunday evening 11:55 p.m. Central time

Assignment

Principal Components Analysis (50 points)

Due Sunday evening of the week assigned by 11:55 pm. Central time

Sync Sessions

Weekly per Announcement

 

 

Week 6: Neural Networks, Due Sunday, End of Week, 11:59PM CT

 

Learning Objectives

 

  • Describe how artificial neural networks are constructed from logical connections of artificial neurons.
  • Describe alternative activation functions, including step, logit, hyperbolic tangent, and rectified linear units (reLU).
  • Explain the roles of training, validation, and test samples in the development and evaluation of artificial neural networks.
  • Explain how measurement and feature engineering are relevant to modeling.
  • Demonstrate the use of artificial neural networks in classification.
  • Demonstrate the use of artificial neural networks in regression.
  • Explain how the results of artificial neural networks can be useful to business managers (and others).

Textbook Readings

 

Géron, A. Chapters 10 through 13

Hastie, Chapter 11

Discussion Board (10 points)

Participation due Sunday evening 11:55 p.m. Central time

Assignment

Neural Networks (50 points)

Due Sunday evening of the week assigned by 11:55 pm. Central time

 

Sync Sessions

Weekly per Announcement

 

Week 7: Deep Learning for Computer Vision, Due Sunday, End of Week, 11:59PM CT

Learning Objectives

  • Describe how deep neural networks are constructed.
  • Describe alternative activation functions, including variations on rectified linear units (reLU).
  • Describe advantages and disadvantages of deep neural networks.
  • Describe how convolutional neural networks are constructed.
  • Describe how artificial neural networks are employed in computer vision.
  • Demonstrate the use of deep neural networks in visual recognition.
  • Explain how the results of deep neural networks can be useful to business managers, with special reference to vision problems (and others).

Textbook Readings

Géron, Chapter 14

Discussion Board (10 points)

Participation due Sunday evening 11:55 p.m. Central

 

Assignment

Deep Learning: Image Processing with a CNN (50 points)

Due Sunday evening of the week assigned by 11:55 pm. Central time

Sync Sessions

Weekly per Announcement

 

 

 

 

Week 8: Recurrent Neural Networks for Natural Language Processing, Due Sunday, End of Week, 11:59PM CT

 

Learning Objectives

 

  • Describe how recurrent neural networks are constructed.
  • Describe how deep neural networks (convolutional and recurrent) may be employed in natural language processing.
  • Explain how the results of deep neural networks can be useful to business managers, with special reference to natural language processing (and others)

Textbook Readings

 

Géron, Chapters 15 and 16

Goodfellow et al., 2016.  Deep Learning.  Chapter 10. 

Discussion Board (10 points)

Participation due Sunday evening 11:55 p.m. Central time

Assignment

Recurrent Neural Network (50 points)

Due Sunday evening of the week assigned by 11:55 pm. Central time

Sync Sessions

Weekly per Announcement

 

 

 

 

Week 9: Neural Network Autoencoders, Due Sunday, End of Week, 11:59PM CT

 

Learning Objectives

  • Define what is meant by an autoencoder.
  • Distinguish between autoencoders and other forms of unsupervised learning, such as principal component analysis
  • Identify applications of neural network autoencoders.
  • Distinguish among autoencoder neural network architectures.
  • Implement an autoencoder for vision.
  • Implement a word2vec autoencoder for natural language processing

Textbook Readings

 

Géron, Chapter 17

Goodfellow et al., 2016.  Deep Learning.  Chapter 14.  https://www.deeplearningbook.org/contents/autoencoders.html

 

Discussion Board (10 points)

Participation due Sunday evening 11:55 p.m. Central time

 

Assignments

Autoencoders (50 points)

Due Sunday evening of the week assigned by 11:55 pm. Central time

Sync Sessions

Weekly per Announcement

 

 

 

 

 

Week 10: Final Exams, Due Sunday, End of Week, 11:59PM CT

Learning Objectives

No new learning objectives. This week is for exams and an course overview discussion.

 

Discussion Board (10 points)

Participation due Sunday evening 11:55 p.m. Central time

 

 

Assignments – Final Exams (continued availability)

Proctored Final Exam                  (50 points)

 

Last day/time available:  last day of course

 

The proctored final exam will be proctored by Examity and must be taken in one continuous one-hour sitting. 

 

 

 

Reference Materials

Machine Learning

Abu-Mostafa, Y. S., Magdon-Ismail, M., and Lin, H-T. 2012. Learning from Data: A Short Course. AMLbook.com. [ISBN-13: 978-1-60049-006-4]

 

Beck, R. A. 2008. Statistical Learning: A Regression Perspective. New York: Springer.

[ISBN-13: 978-0-387-77500-5]

 

Bishop, C. M. 2006. Pattern Recognition and Machine Learning. New York: Springer.

[ISBN-13: 978-0-387-31073-2]

 

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. 1984. Classification and Regression Trees. New York: Chapman & Hall. [ISBN-10: 0-412-04841-8]

 

Buduma, N. 2017. Fundamentals of Deep Learning: Designing Next-Generation Machine Intelligence Algorithms. Sebastopol, Calif.: O’Reilly. [ISBN-13: 978-1-491-92561-4]

 

Caudill, M. and Butler, C. 1990. Naturally Intelligent Systems. Cambridge, Mass.: MIT Press.

[ISBN-10: 0-262-03156-6]

 

Chollet, F. 2018. Deep Learning with Python. Shelter Island, N.Y.: Manning. Code available at

https://github.com/fchollet/deep-learning-with-python-notebooks.git

 

Conway, D. and White, J. M. 2012. Machine Learning for Hackers. Sebastopol, Calif.: O’Reilly. [ISBN-13: 978-1449303716]

 

Cristianini, N. Shawe-Taylor, J. 2000. Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge, U.K.: Cambridge University Press. [ISBN-10: 0-521-78019-5]

 

Duda, R. O., Hart, P. E., and Stork, D. G. 2001. Pattern Classification, second edition. New York: Wiley. [ISBN-10: 0-471-05669-3]

 

Friedman, J. H. 1991. Multivariate adaptive regression splines, with commentary, The Annals of Statistics, 19(1), 1–141.

 

Goodfellow, I., Bengio, Y., and Courville, A. 2016. Deep Learning. Cambridge, Mass.: MIT Press. [ISBN-13: 978-0262035613]

 

Géron, A. 2019. Hands-On Machine Learning with Scikit-Learn & TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. Sebastopol, Calif.: O'Reilly.

 

Harrington, P. 2012. Machine Learning in Action, Shelter Island, N.Y.: Manning

[ISBN-13: 978-1617290183]

 

Hastie, T., Tibshirani, R, & Friedman, J. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, second edition. New York: Springer. [ISBN-13: 978-0-387-84857-0]

 

Hagan, M. T., Demuth, H. B., Beale, M. H., and De Jesus, O. 2014. Neural Network Design, second edition. Stillwater, Okla.: Martin Hagan. [ISBN-13: 978-0-9717321-1-7]

 

Hand, D. J. 1997. Construction and Assessment of Classification Rules. New York: Wiley.

[ISBN-10: 0-471-96583-9]

 

Haykin, S. 1999. Neural Networks: A Comprehensive Foundation, second edition. Upper Saddle River, N.J.: Prentice Hall. [ISBN-10: 0-13-273350-1]

 

Izenman, A. J. 2008. Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning. New York: Springer. [ISBN-13: 978-0-387-78188-4]  Available from Springer collection at:  http://link.springer.com.turing.library.northwestern.edu/ 

 

Kuhn, M. and Johnson, K., 2013. Applied Predictive Modeling. New York: Springer.

[ISBN-13: 978-1461468486]  Available from Springer collection at:

 http://link.springer.com.turing.library.northwestern.edu/ 

 

Liu, B. 2011. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. (2nd ed.). New York: Springer. [ISBN-13: 978-3-642-19459-7] [ISBN-13 for the electronic edition: 978-3-642-19460-3] (For Northwestern University students and faculty, Springer books are available for free electronic download at http://link.springer.com.turing.library.northwestern.edu/ )

 

Manning, C. D. 2015. Computational linguistics and deep learning. Computational Linguistics, 41(4), 701–707. Presidential address to the Association for Computational Linguistics.

 

Maren, A., Harston, C., and Pap, R. 1990. Handbook of Neural Network Computing. Boston: Academic Press. [ISBN-10: 0-12-546090-2]

 

Marsland, S. 2015. Machine Learning: An Algorithmic Perspective. Boca Ratan, Fla.: CRC Press. [ISBN-13: 978-1-4665-8328-3]

 

Miller, T. W. 2015. Marketing Data Science: Modeling Techniques in Predictive Analytics with R and Python. Upper Saddle River, N.J.: Pearson. [ISBN-13: 978-0-13-388655-9]

 

Miller, T. W. 2015. Modeling Techniques in Predictive Analytics with Python and R: A Guide for Data Science. Upper Saddle River, N.J.: Pearson. [ISBN-13: 978-0-13-389206-2]

 

Miller, T. W. 2015. Web and Network Data Science: Modeling Techniques in Predictive Analytics. Upper Saddle River, N.J.: Pearson. [ISBN-13: 978-0-13-388644-3]

 

Miller, T. W. 2016. Sports Analytics and Data Science: Winning the Game with Methods and Models. Old Tappan, N.J.: Pearson. [ISBN-13: 978-0-13-388643-6]  Data sets and programs on GitHub: https://github.com/mtpa/ 

 

Montavon, G., Orr, G. B., and Müller, K-R. (eds.) 2012. Neural Networks: Tricks of the Trade, second ed. New York: Springer. [ISBN-13: 978-3-642-35289-8] Electronic book available from the Springer collection at http://link.springer.com.turing.library.northwestern.edu/

 

 

Müller, A. C. and Guido, S. 2017. Introduction to Machine Learning with Python: A Guide for Data Scientists. Sebastopol, Calif.: O’Reilly. [ISBN-13: 978-1449369415]  Code examples at

https://github.com/amueller/introduction_to_ml_with_python 

 

Murphy, K. P. 2012. Machine Learning: A Probabilistic Perspective. Cambridge, Mass.: MIT Press. [ISBN-13: 978-0-262-01802-9]

 

Nielsen, M. 2017. Neural Networks and Deep Learning. Online textbook available at

http://neuralnetworksanddeeplearning.com/   Code repository at

https://github.com/mnielsen/neural-networks-and-deep-learning

 

Rajaraman, A. and Ullman, J. D. 2012. Mining Massive Datasets. Cambridge, U.K.: Cambridge University Press. [ISBN-13: 978-1-107-01535-7]

 

Rumelhart, D. E., McClelland, J. L., and the PDP Research Group. 1986. Parallel Distributed Processing, Volume I: Foundations. Cambridge, Mass.: MIT Press. [ISBN-10: 0-262-18120-7]

 

Schapire, R. E. and Freund, Y. 2014. Boosting: Foundations and Algorithms. Cambridge, Mass.: MIT Press. [ISBN-13: 978-0-262-52603-6]

 

Segaran, T. 2007. Programming Collective Intelligence. Sebastopol, Calif.: O’Reilly.

[ISBN-13: 978-0-596-52932-1]

 

Shukla, N. 2018. Machine Learning with TensorFlow. Shelter Island, N.Y.: Manning.

[ISBN-13: 978-1617293870]

 

Tan, P-N., Steinbach, M., and Kumar, V. 2006. Introduction to Data Mining. Boston: Pearson/Addison Wesley. [ISBN-13: 978-0-321-32136-7]

 

Witten, I. H., Frank, E., and Hall, M. A. 2011. Data Mining: Practical Machine Learning Tools and Techniques, third edition. Burlington, Mass: Morgan Kaufmann. [ISBN-13: 978-0-12-374856-0]    

Traditional Statistics and Mathematics

Chatterjee, S. and Hadi, A. S. 2012.  Regression Analysis by Example, fifth edition. New York, NY: Wiley. [ISBN-13: 978-0470905845]  Data sets available from the second author at

http://www1.aucegypt.edu/faculty/hadi/RABE5/

 

Feldman, R. M. and Valdez-Flores, C. 2010.  Applied Probability and Stochastic Processes, second edition. New York: Springer. [electronic version ISBN-13: 978-3-642-05158-6] 

Available from Springer collection at:

 http://link.springer.com.turing.library.northwestern.edu/

 

Gentle, J. E. 2007.  Matrix Algebra: Theory, Computations, and Applications in Statistics. New York: Springer. [electronic version ISBN-13: 978-0-387-70873-7] 

Available from Springer collection at:

 http://link.springer.com.turing.library.northwestern.edu/

 

Kaufman, L. and Rousseeuw, P. J. 1990.  Finding Groups in Data: An Introduction to Cluster Analysis. New York, NY: Wiley. [ISBN-10: 0-471-87876-6] 

 

Manly, B. F. J. and Alberto, J. A. N., 2017.  Multivariate Statistical Methods: A Primer, fourth edition. Boca Raton, Fla.: CRC Press. [ISBN-13: 978-1498728966]  There are R programs and data sets available at http://www.manly-biostatistics.co.nz/downloads 

 

McCullagh, P. and Nelder, J. A., 1989.  Generalized Linear Models, second edition. London: Chapman & Hall. [ISBN-10: 0-412-31760-5] 

 

Snedecor, G. W. and Cochran, W. G. 1967.  Statistical Methods, sixth edition. Ames, Iowa: Iowa University Press. [ISBN-10: 0-8138-1560-6]   

 

Weisberg, S., 2014.  Applied Linear Regression, fourth edition. New York: Wiley. [ISBN-13: 978-1118386088]

 

 

Programming: Python and Linux

Barrett, D. J. 2016 Linux Pocket Guide (3nd  ed.). Sebastopol, Calif.: O’Reilly.

[ISBN-13: 978-1491927571]

 

Beazley, D. M. 2009. Python Essential Reference, fourth edition. Boston: Addison-Wesley.

[ISBN-13: 978-0672329784] 

 

Beazley, D. M. 2017. Python Programming Language Live Lessions. Old Tappan, N.J.: Pearson. Video available through Safari Books Online, Sebastopol, Calif.: O’Reilly.

 

Beazley, D. & Jones, B. K. 2013. Python Cookbook, third edition. Sebastopol, Calif.: O’Reilly.                  [ISBN-13: 978-1-449-34037-7]

 

Bird, S., Klein, E., and Loper, E. 2009. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. Sebastopol, Calif.: O’Reilly. [ISBN-13: 978-0-596-51649-9]

            

Chun, W. J. 2007. Core Python Programming (2nd ed.). Upper Saddle River, N.J.: Prentice Hall.         [ISBN-13: 978-0-13-226993-3]

 

Gift, N. and Jones, J. M. 2008. Python for Unix and Linux System Administrators: Efficient Problem Solving with Python. Sebastopol, Calif.: O’Reilly. (Chapter 2: IPython, pages 21–69.) [ISBN-13: 978-0-596-51582-9]

 

Hellmann, D. 2011. The Python Standard Library by Example. Upper Saddle River, N.J.:   Pearson/Addison-Wesley. [ISBN-13: 978-0-321-76734-9]

 

Hunt, A. and Thomas, D. 2000. The Pragmatic Programmer: From Journeyman to Master. Reading, Mass.:  Addison-Wesley.  [ISBN13: 978-0201616224] 

 

Keysey, J. 2016. How to Program: Computer Science Concepts and Python Exercises. Chantilly, Va.:  The Great Courses.  Video course available on DVD, electronic, and streaming media. 

 

Lubanovic, B. 2015. Introducing Python: Modern Computing in Simple Packages.  Sebastopol, Calif.: O’Reilly. [ISBN-13: 978-1-449-35936-2]

 

McKinney, W. 2013. Python for Data Analysis: Agile Tools for Real-World Data.  Sebastopol, Calif.: O’Reilly. [ISBN-13: 978-1-449-31979-3]

 

Ramalho, L. 2015. Fluent Python: Clear, Concise, and Effective Programming.  Sebastopol, Calif.: O’Reilly. [ISBN-13: 978-1-491-94600-8]

 

Rossant, C. 2014. Python Interactive Computing and Visualization Cookbook.  Birmingham, U.K.: Packt Publishing Ltd. [ISBN-13: 978-1-78328-481-8]

 

Shotts, W. E. 2012. The Linux Command Line: A Complete Introduction. San Francisco: No Starch Press. [ISBN-13: 978-1-59327-389-7]

 

Solem, J. E. 2012. Programming Computer Vision with Python: Tools and Algorithms for Analyzing Images.  Sebastopol, Calif.: O’Reilly. [ISBN-13: 978-1-449-31654-9]

 

Sweigart, A. 2015. Automate the Boring Stuff with Python: Practical Programming for Total Beginners.  San Francisco: No Starch Press. [978-1-59327-599-0]

 

VanderPlas, J., 2017.  Python Data Science Handbook: Essential Tools for Working with Data. Sebastopol, Calif.: O’Reilly [ISBN-13: 978-1491912058] Python code examples at

https://github.com/jakevdp/PythonDataScienceHandbook 

 

Ward, B. 2015. How Linux Works: What Every Superuser Should Know (2nd ed.). San Francisco: No Starch Press. [ISBN-13: 978-1-59327-567-6]

Programming: R

Chambers, J. M. and Hastie, T. J., editors. 1992 Statistical Models in S. Pacific Grove, Calif.: Wadsworth & Brooks/Cole. [ISBN-10: 0-534-16764-0]

 

Chang, W. 2013. R Graphics Cookbook. Sebastopol, Calif.: O'Reilly. [ISBN-13: 9781449316952]

 

Davies, T. M. 2016. The Book of R: A First Course in Programming and Statistics. San Francisco: No Starch Press. [ISBN-13: 9781593276515]

 

James, G., Witten, D., Hastie, T., & Tibshiriani, R. (2014). An Introduction to Statistical Learning with Applications in R. New York: Springer. [ISBN13: 978-1-4614-7137-0] Available as a free download at http://www-bcf.usc.edu/~gareth/ISL/   with exercise solutions at http://blog.princehonest.com/stat-learning/

 

Fox, J. & Weisberg, S. 2011. An R Companion to Applied Regression, second edition. Thousand Oaks, CA: Sage. [ISBN-13 978-1412975148]

 

Kabacoff, R. 2015. R in Action: Data Analysis and Graphics with R, second edition. Shelter Island, N.Y.: Manning. [ISBN-13: 978-1617291388] Code available at https://github.com/kabacoff/RiA2

 

Lander, J. P. 2014. R for Everyone: Advanced Analytics and Graphics. Upper Saddle River, N.J.: Pearson/Addison-Wesley. [ISBN-13: 978-0-321-88803-7]

 

Matloff, N. 2011. The Art of R programming: A Tour of Statistical Software Design. San Francisco: No Starch Press.  [ISBN-13: 978-1593273842]

 

Sarkar, D. 2002. Lattice: Multivariate Data Visualization with R. New York: Springer.

[ISBN-13: 978-0-387-75968-2]

 

Venables, W. N & Ripley, B. D. 2002. Modern Applied Statistics with S, fourth edition. New York: Springer.

[ISBN-10: 0-387-95457-0]

 

Wickham, H. 2015. Advanced R. Boca Raton, Fla.: CRC Press/Chapman & Hall.

[ISBN-13: 978-1-4665-8696-3]

Data Visualization

Bertin, J. 1983. Semiology of Graphics: Diagrams, Networks, Maps. Madison, Wisc.: University of Wisconsin Press.  [ESRI Press ISBN-13: 978-1589482616]

 

Cairo, A. 2013. The Functional Art: An Introduction to Information Graphics and Visualization. Berkeley, Calif.: Pearson/New Riders.  [ISBN-13: 978-0321834737]

 

Cairo, A. 2015. The Truthful Art: Data, Charts and Maps for Communication. Berkeley, Calif.: Pearson/New Riders. [ISBN-13: 978-0321934079]

 

Cleveland, W. S. 1993. Visualizing Data. Murray Hill, N.J.: AT&T Bell Laboratories.               [ISBN-10: 0-9634884]

 

Cooper, A., Reimann, R., Cronin, D., and Noessel. C. 2014. About Face: The Essentials of Interaction Design (fourth ed.). New York: Wiley.  [ISBN-13: 978-1118766576]

 

Dale, K. 2016. Data Visualization with Python & JavaScript: Scrape, Clean, Explore, and Transform Your Data. Sebastopol, Calif.: O'Reilly. [ISBN-13: 978-1491920510]

Code available at https://github.com/Kyrand/dataviz-with-python-and-js

 

Few, S. 2009. Now You See It: Simple Visualization Techniques for Quantitative Analysis. Oakland: Analytics Press. [ISBN-13: 978-0970601988]

 

 

Foote, S. 2015. Learning to Program. Upper Saddle River, N.J.: Addison-Wesley.

[ISBN-13: 978-0-7897-5339-7]  

 

Haverbeke, M. 2015. Eloquent JavaScript: A Modern Introduction to Programming, (second ed.). San Francisco: No Strach Press. [ISBN13: 978-1593275846] Available online at http://eloquentjavascript.net/ Code sandbox at http://eloquentjavascript.net/code/

 

Kirk, A. 2016. Data Visualization: A Handbook for Data Driven Design. Los Angeles: Sage. [ISBN-13: 978-1473912144]  Website: http://book.visualisingdata.com/home 

 

Knaflic, C. N. 2015. Storytelling with Data: A Data Visualization Guide for Business Professionals. New York: Wiley. [ISBN-13: 978-1119002253]

 

Krug, S. 2014. Don’t Make Me Think: A Common Sense Approach to Web Usability (third ed.). Upper Saddle River, N.J.: Pearson/New Riders. [ISBN13: 978-0321965516]

 

Meeks, E. 2018. D3.js in Action: Data Visualization with JavaScript, second ed. Shelter Island, N.Y.: Manning. [ISBN-13: 978-1617294488]  D3 v4 code at https://github.com/emeeks/d3_in_action_2

Murray, S. 2017. Interactive Data Visualization for the Web: An Introduction to Designing with D3 (second ed.). Sebastopol, Calif.: O'Reilly. [ISBN13: 978-1491921289] Print version expected August 2017.  Code available from GitHub at   https://github.com/alignedleft/d3-book

 

Purewal, S. 2014. Learning Web App Development: Build Quickly with Proven JavaScript Techniques. Sebastopol, Calif.: O'Reilly. [ISBN-13: 978-11449370190]

 

Robbins, J. N. 2012. Learning Web design: A Beginner’s Guide to HTML, CSS, JavaScript, and Web Graphics (fourth ed.). Sebastopol, Calif.: O'Reilly. [ISBN13: 978-1449319274]

 

Tufte, E. R. 2001. The Visual Display of Quantitative Information (second ed.). Cheshire, Conn.: Graphics Press. [ISBN13: 978-0961392147]

 

Tukey, J. W. 1977. Exploratory Data Analysis. Reading, Mass.: Addison-Wesley.              [ISBN10: 0-201-07616-0]

 

Wilkinson, L. 2005. The Grammar of Graphics (second ed.). New York, Springer.

[ISBN-13: 978-0387245447] Electronic edition available to Northwestern University students at http://link.springer.com.turing.library.northwestern.edu/  

 

 

Web and Network Data Science

Barabasi, A-L. 2016. Network Science. Cambridge, UK: Cambridge University Press.

[ISBN-13: 978-1107076266]

 

Buttcher, S., Clarke, C. L. A., and Cormack, Gordon V. 2010. Information Retrieval: Implementing and Evaluating Search Engines. Cambridge, Mass.: MIT Press. [ISBN-13: 978-0262026512]

 

Campbell, S. and Swigart, S. 2014. Going Beyond Google: Gathering Internet Intelligence (fifth edition). Oregon City, OR: Cascade Insights. (Out of print resource, available as electronic book on the Canvas course site.)

 

Ceri, S. et al. 2013. Web Information Retrieval. New York: Springer. [ISBN-13: 978-3-642-39313-6] (For Northwestern University students and faculty, Springer books are available for free electronic download at http://link.springer.com.turing.library.northwestern.edu/)

[ISBN-13 electronic edition: 978-3-642-39314-3]

 

Gheorghe, R., Hinman, M.L., and Russo, R. 2016. Elasticsearch in Action. Shetler Island, N.Y.: Manning. [ISBN-13: 978-1617291623]

 

Gormley, C. and Tong, Z. 2015. Elasticsearch Search: The Definitive Guide. Sebastopol, Calif.: O’Reilly. [ISBN-13: 978-1449358549]

 

Liu, B. 2011. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. (2nd ed.). New York: Springer. [ISBN-13: 978-3-642-19459-7] [ISBN-13 for the electronic edition: 978-3-642-19460-3] (For Northwestern University students and faculty, Springer books are available for free electronic download at http://link.springer.com.turing.library.northwestern.edu/ )

 

Liu, B. 2015. Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. New York: Cambridge University Press. [ISBN-13: 978-3-107-01789-4]

 

Manning, C. D., Raghaven, P., and Schutze, H. 2008 Introduction to Information Retrieval. Cambridge, UK: Cambridge University Press. [ISBN-13: 978-0521865715] 

Available online at http://nlp.stanford.edu/IR-book/information-retrieval-book.html 

 

Mitchell, R. 2015. Web Scraping with Python: Collecting Data from the Modern Web. Sebastopol, Calif.: O’Reilly. [ISBN-13: 978-1491910290]  Code at https://github.com/REMitchell/python-scraping

 

Nolan, D. and Lang, D. T. 2014 XML and Web Technologies for Data Sciences with R. New York: Springer.  [ISBN-13: 978-1-4614-7900-0] 

 

Course Summary:

Course Summary
Date Details Due