All posts by Dr Kristin Stephens-Martinez, Ph.D.

Project: Group Formation

In place of a final exam, this course has a collaborative final project where we ask you to bring your data science skills to bear on a research project of your own choosing. It is time to start forming groups (of 4-5 students) for the project. Fill out the group formation survey no later than Monday, Sept 27th.

The form should only take a couple of minutes. If you already know who you want to work with you can indicate that in the form. In this case, communicate with your group first and only fill out the form once with everyone’s name/netid. It’s also fine if you don’t know who you want to work with, in which case you can fill out the form and we will match you to a group.

If it is helpful to start thinking about possible project ideas, below are some project ideas. But it is not required that you have a concrete project idea until the proposal.

Project ideas

Not sure how to get started? Looking for examples of what a data science project might look like? Here are some of the topics that students studied in Spring 2020:

  • Comparing Stock Market Losses between SARS and SARS-CoV-2
  • Recessions, Depressions, and Depression: Mental Health in Relation to Economic Factors
  • Predicting North Carolina Election Outcomes
  • Relating Text Analysis of Corporate Reports and Stock Performance
  • Modeling Consumer Flight Behavior Based on Economic Indicators
  • Predicting COVID-19 Death Tolls from Google Search Trends
  • Sentiment Analysis of COVID-19 Tweets
  • Economic Status and Drug Overdose in North Carolina
  • Analyzing Gender and Tech Careers
  • Political Landscape According to Social Media
  • Forecasting Market Shocks and Performance using Article Headlines
  • Tracking Recidivism in US Prisons
  • Understanding AirBnBs impact on Evictions
  • Understanding Musical Tastes (Music Recommender System)
  • Human Impact on Climate since the Industrial Revolution
  • The Troll Toll: An Investigation into Troll Tweets

And here is an archive of summer Data+ projects from the last several years. In Data+, teams of about 4 undergraduate students collaborate over the summer on a data science project. You should be able to see final presentations and/or executive summary slides for most projects; feel free to browse for inspiration.

Example Data Sources

Below, we have some examples of datasets or where you might find data. You should work with data that is interesting to you and should feel free (strongly encouraged even) to look for sources yourself. These are listed just as possibilities and starting places.

  • Kaggle maintains several thousand public datasets of interest in a variety of topics. Kaggle also hosts several prediction challenges; one idea for a machine learning project is to enter one of these competitions as a team.
  • The Yelp Dataset is provided by Yelp as a research challenge with lots and lots of data about reviews, businesses, images, and cities – text data, rich json data, etc.
  • The University of California Irvine maintains a large UCI ML repository of publicly contributed datasets aimed toward machine learning tasks of all types. They range from small simple example datasets to large and complicated datasets from specific scientific domains.
  • Data.gov has a huge compilation of data sets produced by the US government. The US Census Bureau also publishes datasets from all of its survey work. Similarly, The Supreme Court Database tracks all cases decided by the US Supreme Court, and GovTrack.us provides links to all kinds of information about the US Congress and all votes casted by its members.

Module 3B: Statistical Inference

  1. Prepare (soft due Th 9/16, hard due M 9/27)
    1. Content below
    2. Sakai quizzes
  2. Group Worksheet (soft due F 9/17, hard due M 9/27)
  3. Practice (due M 9/27)
  4. Perform (due M 10/11)

Content

3B.A – Confidence Intervals and Bootstrapping

  1. Intro Confidence Intervals (17 min.)
  2. Confidence Intervals in Python (17 min.)

3B.B – Hypothesis Testing

  1. Intro Hypothesis Testing and Proportions (14 min.)
  2. Hypothesis Testing Means and More (33 min.)

Optional Supplements

You can access an excellent free online textbook on OpenIntro Statistics here, co-authored by Duke faculty. You can pay a suggested but adjustable price for a tablet-friendly pdf, but you can also just get the regular pdf for free. For Module 3B, the following optional readings may be particularly helpful supplements:

  • Chapter 5.2 Confidence intervals for a proportion. This provides introductory material on confidence intervals elaborating on 3B.A.1.
  • Chapter 5.3 Hypothesis testing for a proportion. This elaborates on the introduction to hypothesis testing from 3B.B.1.
  • Chapters 7.1, 7.3, and 7.5 cover material from 3B.B.2 on using t-tests for a single mean, the difference of two means, and many pairwise means respectively.
  • Chapter 6.3 discusses the chi-square test for categorical data introduced in 3B.B.2.

In addition, here is the documentation for the scipy.stats library that implements most of the functionality described here as well as many other useful statistical functions.

Module 3A: Data Wrangling

  1. Prepare (soft due Tu 9/14, hard due M 9/27)
    1. Content below
    2. Sakai quizzes
  2. Group Worksheet (soft due W 9/15, hard due M 9/27)
  3. Practice (due M 9/27)
  4. Perform (due M 10/11)

Content

3A.A – What is Wrangling

  1. Data sources, formats, and importing (26 min.)
  2. Common data cleaning problems (16 min.)
  3. Read Section 3.4 Handling Missing Data from Python Data Science Handbook

3A.B – Wrangling Text

  1. Python string operations (16 min.)
  2. Introduction to regular expressions (18 min.)
  3. Read Section 3.10 Vectorized String Operations from Python Data Science Handbook

Optional Supplements

Module 2B: Probability

  1. Prepare (soft due Th 9/2, hard due M 9/13)
    1. Content below
    2. Sakai quizzes
  2. Group Worksheet (soft due F 9/3, hard due M 9/13)
  3. Practice (due M 9/13)
  4. Perform (due M 9/27)

Content

2B.A – Foundations of Probability (52 min.)

  1. Outcomes, Events, Probabilities (15 min.)
  2. Joint and Conditional Probability (11 min.)
  3. Marginalization and Bayes’ Theorem (15 min.)
  4. Random Variables and Expectations (11 min.)

2B.B – Distributions of Random Variables (46 min.)

  1. Distributions, Means, Variance (19 min.)
  2. Monte Carlo Simulation (15 min.)
  3. Central Limit Theorem (12 min.)

Optional Supplements

You can access an excellent free online textbook on OpenIntro Statistics here, co-authored by Duke faculty. You can pay a suggested but adjustable price for a tablet-friendly pdf, but you can also just get the regular pdf for free. For this module, the following optional readings may be particularly helpful supplements:

  • Chapter 3: Probability. This provides more information on many of the topics from the above videos in Foundations of Probability.
  • Chapter 4: Distributions of random variables. This provides much more information about particular classic distributions than is provided in 2B.B.1.
  • Chapter 5.1: Point estimates and sampling variability. This provides more information on some of the topics from 2B.B.2-3.

In addition, you can find documentation for the two pseudorandom number generating / sampling libraries in python that we mentioned here:

Module 2A: Numpy & Pandas

  1. Prepare (soft due Tu 8/31, hard due M 9/13)
    1. Content below
    2. Sakai quizzes
  2. Group Worksheet (soft due W 9/1, hard due M 9/13)
  3. Practice (due M 9/13)
  4. Perform (due M 9/27)

Content

2A.A – Numpy (1 hour)

  1. Why Numpy (8 min.)
  2. Numpy Array Basics (15 min.)
  3. Numpy Universal Functions (20 min.)
  4. Numpy Axis (14 min.)

2A.B – Pandas (45 min.)

  1. Why Pandas (7 min.)
  2. Pandas Series (19 min.)
  3. Pandas Dataframe (21 min.)

Optional Supplements

Module 1: What is Data Science, Anaconda, Python, & Jupyter

  1. Prepare (soft due Th 8/26, hard due M 8/30)
    1. Content below
    2. See Sakai for quiz
    3. Install Anaconda
  2. Group Worksheet (soft due F 8/27, hard due M 8/30)
  3. Practice (due M 8/30) (Solution)
  4. No Perform

Content

1.A – What is Data Science? (in class or see recording)

1.B – Python3 (12 min.)

  1. Python vs. Java (3 min.)
  2. Data Types (2 min.)
  3. Iteration, Functions, Classes (7 min.)

1.C – Python for Data Science

  1. Anaconda and Jupyter (10 min.)
  2. Jupyter Notebook Demo (11 min.)

Optional Supplements