Category Archives: Module

Module 10: Deep Learning

  1. Prepare (due M 4/4)
    1. Content below
    2. Sakai quizzes
  2. Peer Instructions – See on the class forum
  3. Homework (due Su 4/10)
  4. There are no worked examples

Content

10 Deep Learning

  1. Neural Networks and Applications (16 min.)
  2. Forward Propagation (10 min.)
  3. Gradient Descent (14 min.)
  4. Back Propagation (11 min.)
  5. Convolutional Neural Network (15 min.)
  6. Introducing Pytorch (23 min.)

Optional Supplements

The deep learning book is available free online and is authored by some of the leading experts in machine learning with deep artificial neural networks. It is very detailed and in-depth and is purely for those who are interested in learning more about deep learning theory now or in the future; you do not need to read the book for this course.

Unlike most other libraries for this course, Pytorch is not included in the basic Anaconda installation. To use Pytorch, we suggest you choose one of two options.

  • Install Pytorch locally (for free). You can see the directions on the website: Select the stable build, your operating system, Conda (for Anaconda), Python, and CPU to see install directions for your particular setup. (CUDA is used to support hardware acceleration with NVIDIA graphics cards and is not necessary for this course).
  • Use Pytorch in a Jupyter notebook in the cloud (also for free). The easiest way to do this if you have a Google account is with a Google colab notebook; Pytorch will already be available to you in this cloud environment.

You can find the official Pytorch documentation here. Of particular note are the Pytorch tutorials, including Pytorch recipes which serve as small examples of common tasks.

Module 09: Prediction & Supervised Machine Learning

  1. Prepare (due M 3/21)
    1. Content below
    2. Sakai quizzes
  2. Peer Instructions – See on the class forum
  3. Homework (due Su 3/27)
  4. Worked Examples

Content

9.A Predictive Modeling and Regression

  1. Ordinary Linear Regression and Intro Scikit-Learn (21 min.)
  2. Nonlinear Regression and Scikit-Learn Preprocessing (13 min.)
  3. Binary Classification with Logistic Regression (22 min.)

9.B Machine Learning and Classification

  1. Naïve Bayes and Text Classification (20 min.) – The video has a type on slide 10, see the pdf of the slides in Box for the fix.
  2. K-Nearest Neighbors and Training/Testing (31 min.)

Optional Supplements

Chapter 5 Machine Learning from the Python Data Science Handbook provides a very nice treatment of many of the topics from the above videos and more. If you are new to machine learning, we highly recommend that you read sections 5.1 “What is Machine Learning” through 5.4 “Feature Engineering” after completing the videos. After that, you can optionally read any of the In-Depth sections about specific algorithms for prediction.

In addition, the scikit-learn documentation itself provides several resources for working with the library:

Module 08: Visualization

  1. Prepare (due M 3/14)
    1. Content below
    2. Sakai quizzes
  2. Peer Instructions – See on the class forum
  3. Homework (due Su 3/20)
  4. Worked Examples

Content

8.A – Data Visualization and Design

  1. Why Visualize? (11 min.)
  2. Basic Plot Types (17 min.)
  3. Dos and Don’ts (10 min.)

8.B Visualization in Python

  1. Intro to Python Visualization Landscape (7 min.)
  2. Seaborn Introduction (17 min.)
  3. Seaborn Examples (17 min.)

Optional Supplements

Module 07: Databases & SQL

  1. Prepare (due M 2/21)
    1. Content below
    2. Sakai quizzes
  2. Peer Instructions – See on the class forum
  3. Homework (due Su 2/27)
  4. Worked Example

Content

7.A – Relational Database (24 min.)

7.B

  1. SQL Querying (21 min.)
  2. SQL with Python and Pandas (12 min.)

Optional Supplements

Module 06: Combining Data

  1. Prepare (due M 2/14)
    1. Content below
    2. Sakai quizzes
  2. Peer Instructions — See on the class forum
  3. Homework (due Su 2/20)

Content

6.A – Summarizing Data

  1. Read Section 3.8 Aggregating and Grouping from Python Data Science Handbook.
  2. Read Section 3.9 Pivot Tables from Python Data Science Handbook.

6.B – Merging Data

  1. Record Linkage (8 min.)
  2. Read Section 3.6 Concat and Append from Python Data Science Handbook. Please note that the join_axes optional parameter mentioned in this section has been deprecated from the Pandas library, you can skip over the details on this parameter.
  3. Read Section 3.7 Merge and Join from Python Data Science Handbook
  4. Fuzzy Matching (21 min.)

Optional Supplements

Module 05: Statistical Inference

  1. Prepare (due M 2/7)
    1. Content below
    2. Sakai quizzes
  2. Peer Instructions – See on the class forum
  3. Homework (due Su 2/13)
  4. Worked Example

Content

5.A – Confidence Intervals and Bootstrapping

  1. Intro Confidence Intervals (17 min.)
  2. Confidence Intervals in Python (17 min.)

5.B – Hypothesis Testing

  1. Intro Hypothesis Testing and Proportions (14 min.)
  2. Hypothesis Testing Means and More (33 min.)

Optional Supplements

You can access an excellent free online textbook on OpenIntro Statistics here, co-authored by Duke faculty. You can pay a suggested but adjustable price for a tablet-friendly pdf, but you can also just get the regular pdf for free. For Module 3B, the following optional readings may be particularly helpful supplements:

  • Chapter 5.2 Confidence intervals for a proportion. This provides introductory material on confidence intervals elaborating on 3B.A.1.
  • Chapter 5.3 Hypothesis testing for a proportion. This elaborates on the introduction to hypothesis testing from 3B.B.1.
  • Chapters 7.1, 7.3, and 7.5 cover material from 3B.B.2 on using t-tests for a single mean, the difference of two means, and many pairwise means respectively.
  • Chapter 6.3 discusses the chi-square test for categorical data introduced in 3B.B.2.

In addition, here is the documentation for the scipy.stats library that implements most of the functionality described here as well as many other useful statistical functions.

Module 04: Data Wrangling

  1. Prepare (due M 1/31)
    1. Content below
    2. Sakai quizzes
  2. Peer Instructions – See on the class forum
  3. Homework (due Su 2/6)
  4. Worked Example

Content (Slides in the Box folder)

4.A – What is Wrangling

  1. Data sources, formats, and importing (26 min.)
  2. Common data cleaning problems (16 min.)
  3. Read Section 3.4 Handling Missing Data from Python Data Science Handbook

4.B – Wrangling Text

  1. Python string operations (16 min.)
  2. Introduction to regular expressions (18 min.)
  3. Read Section 3.10 Vectorized String Operations from Python Data Science Handbook

Optional Supplements

Module 03: Probability

  1. Prepare (due M 1/24)
    1. Content below
    2. Sakai quizzes
  2. Peer Instructions – See on the class forum
  3. Homework (due Su 1/30)
  4. Worked Examples

Content (Slides in the Box folder)

3.A – Foundations of Probability (52 min.)

  1. Outcomes, Events, Probabilities (15 min.)
  2. Joint and Conditional Probability (11 min.)
  3. Marginalization and Bayes’ Theorem (15 min.)
  4. Random Variables and Expectations (11 min.)

3.B – Distributions of Random Variables (46 min.)

  1. Distributions, Means, Variance (19 min.)
  2. Monte Carlo Simulation (15 min.)
  3. Central Limit Theorem (12 min.)
    1. Slide 26 in the video has a typo that is fixed in the pdf version of the slides on Box. In the video, it says the probability is <= 0.95, but it should say < 0.05.

Optional Supplements

You can access an excellent free online textbook on OpenIntro Statistics here, co-authored by Duke faculty. You can pay a suggested but adjustable price for a tablet-friendly pdf, but you can also just get the regular pdf for free. For this module, the following optional readings may be particularly helpful supplements:

  • Chapter 3: Probability. This provides more information on many of the topics from the above videos in Foundations of Probability.
  • Chapter 4: Distributions of random variables. This provides much more information about particular classic distributions than is provided in 2B.B.1.
  • Chapter 5.1: Point estimates and sampling variability. This provides more information on some of the topics from 2B.B.2-3.

In addition, you can find documentation for the two pseudorandom number generating / sampling libraries in python that we mentioned here:

Module 02: Numpy & Pandas

  1. Prepare (due M 1/17)
    1. Content below
    2. Sakai quizzes
  2. Peer Instructions
    1. DataFrame Indexing: Round 1, Round 2
    2. Series Adding: Round 1, Round 2
    3. hstack/vstack: Round 1, Round 2
    4. Slicing: Round 1, Round 2
  3. Homework (Su 1/23)
  4. Worked Example

Content (Slides in the Box folder)

2.A – Numpy (1 hour)

  1. Why Numpy (8 min.)
  2. Numpy Array Basics (15 min.)
  3. Numpy Universal Functions (20 min.)
  4. Numpy Axis (14 min.)

2.B – Pandas (45 min.)

  1. Why Pandas (7 min.)
  2. Pandas Series (19 min.)
  3. Pandas Dataframe (21 min.)

Optional Supplements

Module 01: What is Data Science, Anaconda, Python, & Jupyter

  1. Prepare (due M 1/10 )
    1. Content below
    2. Quiz is on Sakai
    3. Install Anaconda
  2. Peer Instructions (these will open when we use them)
    1. lambda with min/max: Round 1, Round 2
    2. Sorting: Round 1, Round 2
    3. Notebooks I: Round 1, Round 2
    4. Notebooks II: Round 1, Round 2
  3. Homework (due Su 1/16)

Content (Slides in the Box folder)

1.A – What is Data Science? (in-class on 1/7 or see recording)

1.B – Python3 (12 min.)

  1. Python vs. Java (3 min.)
  2. Data Types (2 min.)
  3. Iteration, Functions, Classes (7 min.)

1.C – Python for Data Science

  1. Anaconda and Jupyter (10 min.)
  2. Jupyter Notebook Demo (11 min.)

Optional Supplements