Unlike most other libraries for this course, Pytorch is not included in the basic Anaconda installation. To use Pytorch, we suggest you choose one of two options.

Install Pytorch locally (for free). You can see the directions on the website: Select the stable build, your operating system, Conda (for Anaconda), Python, and CPU to see install directions for your particular setup. (CUDA is used to support hardware acceleration with NVIDIA graphics cards and is not necessary for this course).
Use Pytorch in a Jupyter notebook in the cloud (also for free). The easiest way to do this if you have a Google account is with a Google colab notebook; Pytorch will already be available to you in this cloud environment.

You can find the official Pytorch documentation here. Of particular note are the Pytorch tutorials, including Pytorch recipes which serve as small examples of common tasks.

Book

The deep learning book is available free online and is authored by some of the leading experts in machine learning with deep artificial neural networks. It is very detailed and in-depth and is purely for those who are interested in learning more about deep learning theory now or in the future; you do not need to read the book for this course.

Module 09: Databases and SQL

By Ruixin Zhang

On November 2, 2023

In Module

Prepare (due Mon 11/6)
1. Content below
2. Sakai quizzes
Peer Instructions – See on the class forum
Homework (due Sun 11/12) [LINK]
Worked Example [LINK]

Content

09.A – Predictive Modeling and Regression

Relational Database (24 min.)

09.B – Machine Learning and Classification

SQL Querying (21 min.)
SQL with Python and Pandas (12 min.)

Optional Supplements

SQLite Command Line Interface If you have a Mac/Linux machine, you should already be able to launch by just entering “sqlite3” in a terminal. If you have a Windows machine, you can download the command line interface from the Precompiled Binaries for Windows on the SQLite download page.
Python SQLite3 API Documentation
Pandas SQL Documentation
w3resource SQLite Tutorial
Database Schema Visualizer

Module 08: Prediction & Supervised Machine Learning

By Qianyu Yang

On October 19, 2023

In Module

Prepare (due Mon 10/30)
1. Content below
2. Canvas quizzes
Peer Instructions – See on the class forum
Homework (due Sun 11/5) [Link]
Worked Examples [Link]

Content (Slides in Box)

08. A Predictive Modelling and Regression

Ordinary Linear Regression and Intro Scikit-Learn (21 min.)
Nonlinear Regression and Scikit-Learn Preprocessing (13 min.)
Binary Classification with Logistic Regression (22 min.)

Note: sklearn.metrics.plot_confusion_matrix introduced in p.28-29 in the slides/video is deprecated; use sklearn.metrics.ConfusionMatrixDisplay instead. To see the updated slides, switch to the “slides” panel when viewing the 09.A.III video in Panopto.

08.B Machine Learning and Classification

Naïve Bayes and Text Classification (20 min.) – The video has a typo on slide 10, see the pdf of the slides in Box for the fix.
K-Nearest Neighbors and Training/Testing (31 min.)

Optional Supplements

Chapter 5 Machine Learning from the Python Data Science Handbook provides a very nice treatment of many of the topics from the above videos and more. If you are new to machine learning, we highly recommend that you read sections 5.1 “What is Machine Learning” through 5.4 “Feature Engineering” after completing the videos. After that, you can optionally read any of the In-Depth sections about specific algorithms for prediction.

In addition, the scikit-learn documentation itself provides several resources for working with the library:

- Scikit-learn Getting Started and Scikit-learn tutorials provide some short introductory materials
- Scikit-learn examples has an extensive library of example applications with code
- Scikit-learn user guide explains the classes of models and features of the library
- Scikit-learn api reference contains the full api reference

Module 07: Statistical Inference

By Ruixin Zhang

On October 5, 2023

In Module

Prepare (due Mon 10/16)
1. Content below
2. Canvas quizzes
Peer Instructions – See on the class forum
Homework (due Sun 10/22) [Link]
Worked Example [Link]

Content

Note: the slides for this module have been updated. Please switch to the “slides” panel when viewing the video in Panopto. DO NOT stay on the “screen” panel, as the recorded screen showed the old slides (which contained typoes and old information).

07.A – Confidence Intervals and Bootstrapping

Intro Confidence Intervals (17 min.)
Confidence Intervals in Python (17 min.)
Misconceptions about Confidence Intervals (short read)
OR
The 3rd paragraph (starting with “As a technical note…” in this link

07.B – Hypothesis Testing

Intro Hypothesis Testing and Proportions (14 min.)
Hypothesis Testing Means and More (33 min.)

Optional Supplements

You can access an excellent free online textbook on OpenIntro Statistics here, co-authored by Duke faculty. You can pay a suggested but adjustable price for a tablet-friendly pdf, but you can also just get the regular pdf for free. For Module 7, the following optional readings may be particularly helpful supplements:

Chapter 5.2 Confidence intervals for a proportion. This provides introductory material on confidence intervals elaborating on 5.A.1.
Chapter 5.3 Hypothesis testing for a proportion. This elaborates on the introduction to hypothesis testing from 5.B.1.
Chapters 7.1, 7.3, and 7.5 cover material from 5.B.2 on using t-tests for a single mean, the difference of two means, and many pairwise means respectively.
Chapter 6.3 discusses the chi-square test for categorical data introduced in 5.B.2.

In addition, here is the documentation for the scipy.stats library that implements most of the functionality described here as well as many other useful statistical functions.

Module 06: Combining Data

By Qianyu Yang

On September 28, 2023

In Module

Prepare (due Mon 10/9)
1. Content below
2. Canvas quizzes
Peer Instructions – See on the class forum
Homework (due Sun 10/15) [Link]
Worked Example [Link]

Content (Slides in the Box Folder)

06.A – Summarizing Data

Read Section 3.8 Aggregating and Grouping from Python Data Science Handbook.
Read Section 3.9 Pivot Tables from Python Data Science Handbook.

06.B – Merging Data

Read Section 3.6 Concat and Append from Python Data Science Handbook. Please note that the join_axes optional parameter mentioned in this section has been deprecated from the Pandas library, you can skip over the details on this parameter.
Read Section 3.7 Merge and Join from Python Data Science Handbook
Record Linkage (8 min.)
Fuzzy Matching (21 min.)

Optional Supplements

Module 05: Probability

By Ruixin Zhang

On September 21, 2023

In Module

Prepare (due Mon 9/25)
1. Content below
2. Sakai quizzes
Video of the piece that got lost from Wednesday’s class
Peer Instructions – See on the class forum
Homework (due Sun 10/1) [Link]
Worked Examples [Link]

Content (Slides in the Box folder)

5.A – Foundations of Probability (52 min.)

Outcomes, Events, Probabilities (15 min.)
Joint and Conditional Probability (11 min.)
Marginalization and Bayes’ Theorem (15 min.)
Random Variables and Expectations (11 min.)

5.B – Distributions of Random Variables (46 min.)

Distributions, Means, Variance (19 min.)
Monte Carlo Simulation (15 min.)
Central Limit Theorem (12 min.)
1. Slide 26 in the video has a typo that is fixed in the pdf version of the slides on Box. In the video, it says the probability is <= 0.95, but it should say < 0.05.

Optional Supplements

Helpful YouTube videos to understand nuance with examples

Online Textbook and Documentation

You can access an excellent free online textbook on OpenIntro Statistics here, co-authored by Duke faculty. You can pay a suggested but adjustable price for a tablet-friendly pdf, but you can also just get the regular pdf for free. For this module, the following optional readings may be particularly helpful supplements:

Chapter 3: Probability. This provides more information on many of the topics from the above videos in Foundations of Probability.
Chapter 4: Distributions of random variables. This provides much more information about particular classic distributions than is provided in 2B.B.1.
Chapter 5.1: Point estimates and sampling variability. This provides more information on some of the topics from 2B.B.2-3.

In addition, you can find documentation for the two pseudorandom number-generating / sampling libraries in python that we mentioned here:

Python random – Base Python library
Numpy random – Numpy random sampling library

Module 03: Visualization

By Ruixin Zhang

On September 7, 2023

In Module

Prepare (due Mon 9/16)
1. Content below
2. Sakai quizzes
Class engagement – See on the class forum
Homework (due Sun 9/22) [Link]
Worked Examples [Link]

Content

03.A – Data Visualization and Design

Why Visualize? (11 min.)
Basic Plot Types (17 min.)
Dos and Don’ts (10 min.)

03.B – Visualization in Python

Intro to Python Visualization Landscape (7 min.)
Seaborn Introduction (17 min.)
Seaborn Examples (17 min.)

Optional Supplements

Module 02: Numpy & Pandas

By Qianyu Yang

On August 31, 2023

In Module

Prepare (due Mon 9/4)
1. Content below
2. Canvas quiz
Peer Instructions – See on the class forum
Homework (due Sun 9/10) [Link]
Worked Example [Link]

Content (Slides in the Box folder)

2.A – Numpy (1 hour)

Why Numpy (8 min.)
Numpy Array Basics (15 min.)
Numpy Universal Functions (20 min.)
Numpy Axis (14 min.)

2.B – Pandas (45 min.)

Why Pandas (7 min.)
Pandas Series (19 min.)
Pandas Dataframe (21 min.)

Optional Supplements

Numpy Beginner’s Tutorial
Chapter 2: Introduction to Numpy from Python Data Science Handbook
Numpy Documentation
10 Minute to Pandas Tutorial
Pandas User Guide
Chapter 3: Data Manipulation with Pandas from Python Data Science Handbook (just the first three subsections)

Module 01: Python, Central tendency, & Jupyter Notebook

By Qianyu Yang

On August 29, 2023

In Module

Prepare (due Mon 8/28)
- Content below if you need a refresher on Python or central tendency
- Canvas quiz
- Install Anaconda (see the Resources page for more instructions)
Peer Instructions – See on the class forum
Homework (due Sun 9/3, 11:59 PM, late due Su 9/10, no late tokens required) [Link]

Content (Slides in the Box folder)

1.A – Welcome to the class! (in-class on 8/30 or see recording)

1.B – Python3 (14 min.)

Python vs. Java (3 min.)
Data Types (2 min.)
Iteration, Functions, Classes (7 min.) – slide 19 has a typo, the pdf has been fixed
sorted() function documentation (2 min.)

1.C – Python for Data Science

Anaconda and Jupyter (10 min.)
Jupyter Notebook Demo (11 min.)

1.D – Central Tendency

If you need a refresher/overview on the definitions of central tendency: mean, median, and mode

Category: Module

Content (Slides in the Box folder)

Optional Supplements

Content (Box)

Optional Supplements

Pytorch

Book

Content

Optional Supplements

Content (Slides in Box)

Optional Supplements

Content

Optional Supplements

Content (Slides in the Box Folder)

Optional Supplements

Content (Slides in the Box folder)

Optional Supplements

Helpful YouTube videos to understand nuance with examples

Online Textbook and Documentation

Content

Optional Supplements

Content (Slides in the Box folder)

Optional Supplements

Content (Slides in the Box folder)

Optional Supplements