Monthly Archives: January 2025

Module 09: Databases and SQL

  1. Prepare (due Mon 03/24)
    1. Content below
    2. Canvas quizzes
  2. Class engagement – See on the class forum
  3. Homework (due Sun 04/06) [LINK]
  4. Worked Example [LINK]

Content

09.A – Relational Database

  1. Relational Database (24 min.)

09.B – SQL Python and Pandas

  1. SQL Querying (21 min.)
  2. SQL with Python and Pandas (12 min.)

Optional Supplements

OhHai Help Instructions

Here is a guide on how to use OhHai to ask office hour questions. 

  1. You can access the application here:  https://uta.cs.duke.edu. After logging in via Shibboleth, you should see your default view, which will be the Answer Questions page:

If you are a student in more than one course, you can select which course to submit a question by using the drop-down menu and clicking the ‘Select’ button:

2. You will not be able to submit a question to a section that is ‘Closed’. Once the desired section changes its status to ‘Open’, you can submit a question by clicking on the submit icon:

This will bring up the Submit Question form modal:

Fill out this form and submit your question.  You are only allowed to submit one question at a time.  Once submitted, you will see your question in the queue as well as what position you are in the queue. 

3. When your question is ready to be answered, a modal window will open:

You should hear a sound when it is your turn to be helped. 

NOTE: The sound might not occur in Chrome if you are using an Apple computer.

4. After your session with the TA has ended, your question should now appear in your ‘Recently Answered Questions’ section. 

There you can rate the performance of the TA in how well they answered your question:

Module 04: Data Wrangling

  1. Prepare (due Mon 2/3)
    1. Content below
    2. Canvas quizzes
  2. Class engagement – See on the class forum
  3. Homework (due Sun 2/9) [LINK]
  4. Worked Example [LINK]

Content (Slides in the Box folder)

04.A – What is Wrangling

  1. Data sources, formats, and importing (26 min.)
  2. Common data cleaning problems (16 min.)
  3. Read Section 3.4 Handling Missing Data from Python Data Science Handbook

04.B – Wrangling Text

  1. Python string operations (16 min.)
  2. Introduction to regular expressions (18 min.)
  3. Read Section 3.10 Vectorized String Operations from Python Data Science Handbook

Optional Supplements

Module 08: Prediction & Supervised Machine Learning

  1. Prepare (due Mon 3/23)
    1. Content below
    2. Canvas quizzes
  2. Class Participation – See on the class forum
  3. Homework (due Sun 3/23, late due 3/27) [Link]
  4. Worked Examples [Link]

Content (Slides in Box)

08. A Predictive Modelling and Regression

  1. Ordinary Linear Regression and Intro Scikit-Learn (21 min.)
  2. Nonlinear Regression and Scikit-Learn Preprocessing (13 min.)
  3. Binary Classification with Logistic Regression (22 min.)

Note: sklearn.metrics.plot_confusion_matrix introduced in p.28-29 in the slides/video is deprecated; use sklearn.metrics.ConfusionMatrixDisplay instead. To see the updated slides, switch to the “slides” panel when viewing the 09.A.III video in Panopto.

08.B Machine Learning and Classification

  1. Naïve Bayes and Text Classification (20 min.) – The video has a typo on slide 10, see the pdf of the slides in Box for the fix.
  2. K-Nearest Neighbors and Training/Testing (31 min.)

Optional Supplements

Chapter 5 Machine Learning from the Python Data Science Handbook provides a very nice treatment of many of the topics from the above videos and more. If you are new to machine learning, we highly recommend that you read sections 5.1 “What is Machine Learning” through 5.4 “Feature Engineering” after completing the videos. After that, you can optionally read any of the In-Depth sections about specific algorithms for prediction.

In addition, the scikit-learn documentation itself provides several resources for working with the library:

Module 03: Visualization

  1. Prepare (due Mon 1/27)
    1. Content below
    2. Sakai quizzes
  2. Class engagement – See on the class forum
  3. Homework (due Sun 2/2) [Link]
  4. Worked Examples [Link]

Content

03.A – Data Visualization and Design

  1. Why Visualize? (11 min.)
  2. Kinds of Data (7 min.)
  3. Basic Plot Types (12 min.)
  4. Dos and Don’ts (10 min.)

03.B – Visualization in Python

  1. Intro to Python Visualization Landscape (7 min.)
  2. Seaborn Introduction (17 min.)
  3. Seaborn Examples (17 min.)

Optional Supplements

Module 05: Probability

  1. Prepare (due Mon 2/10)
    1. Content below
    2. Canvas quizzes
  2. Class engagement – See on the class forum
  3. Homework (due Sun 2/16, late due Thurs 2/20) [Link]
  4. Worked Examples [Link]

Content (Slides in the Box folder)

5.A – Foundations of Probability (52 min.)

  1. Outcomes, Events, Probabilities (15 min.)
  2. Joint and Conditional Probability (11 min.)
  3. Marginalization and Bayes’ Theorem (15 min.)
  4. Random Variables and Expectations (11 min.)

5.B – Distributions of Random Variables (46 min.)

  1. Distributions, Means, Variance (19 min.)
  2. Monte Carlo Simulation (15 min.)
  3. Central Limit Theorem (12 min.)
    1. Slide 26 in the video has a typo that is fixed in the pdf version of the slides on Box. In the video, it says the probability is <= 0.95, but it should say < 0.05.

Optional Supplements

Helpful YouTube videos to understand nuance with examples

In the slides Box folder you will find additional resources on understanding Chebyshev and Markov

Online Textbook and Documentation

You can access an excellent free online textbook on OpenIntro Statistics here, co-authored by Duke faculty. You can pay a suggested but adjustable price for a tablet-friendly pdf, but you can also just get the regular pdf for free. For this module, the following optional readings may be particularly helpful supplements:

  • Chapter 3: Probability. This provides more information on many of the topics from the above videos in Foundations of Probability.
  • Chapter 4: Distributions of random variables. This provides much more information about particular classic distributions than is provided in 2B.B.1.
  • Chapter 5.1: Point estimates and sampling variability. This provides more information on some of the topics from 2B.B.2-3.

In addition, you can find documentation for the two pseudorandom number-generating / sampling libraries in python that we mentioned here:

Module 01: Python & Jupyter Notebook

  1. Prepare (due Mon 1/13)
  2. Class engagement – See on the class forum
  3. Homework (due Sun 1/19, 11:59 PM) [Link]

Content (Slides in the Box folder)

1.A – Python3 (14 min.)

  1. Python vs. Java (3 min.)
  2. Data Types (2 min.)
  3. Iteration, Functions, Classes (7 min.) – slide 19 has a typo, the pdf has been fixed
  4. sorted() function documentation (2 min.)

1.B – Python for Data Science (21 min.)

  1. Anaconda and Jupyter (10 min.)
  2. Jupyter Notebook Demo (11 min.)

Optional Supplements