All posts by Dr Kristin Stephens-Martinez, Ph.D.

Project: Group Formation

September 18, 2021ProjectDr Kristin Stephens-Martinez, Ph.D.

In place of a final exam, this course has a collaborative final project where we ask you to bring your data science skills to bear on a research project of your own choosing. It is time to start forming groups (of 4-5 students) for the project. Fill out the group formation survey no later than Monday, Sept 27th.

The form should only take a couple of minutes. If you already know who you want to work with you can indicate that in the form. In this case, communicate with your group first and only fill out the form once with everyone’s name/netid. It’s also fine if you don’t know who you want to work with, in which case you can fill out the form and we will match you to a group.

If it is helpful to start thinking about possible project ideas, below are some project ideas. But it is not required that you have a concrete project idea until the proposal.

Project ideas

Not sure how to get started? Looking for examples of what a data science project might look like? Here are some of the topics that students studied in Spring 2020:

Comparing Stock Market Losses between SARS and SARS-CoV-2
Recessions, Depressions, and Depression: Mental Health in Relation to Economic Factors
Predicting North Carolina Election Outcomes
Relating Text Analysis of Corporate Reports and Stock Performance
Modeling Consumer Flight Behavior Based on Economic Indicators
Predicting COVID-19 Death Tolls from Google Search Trends
Sentiment Analysis of COVID-19 Tweets
Economic Status and Drug Overdose in North Carolina
Analyzing Gender and Tech Careers
Political Landscape According to Social Media
Forecasting Market Shocks and Performance using Article Headlines
Tracking Recidivism in US Prisons
Understanding AirBnBs impact on Evictions
Understanding Musical Tastes (Music Recommender System)
Human Impact on Climate since the Industrial Revolution
The Troll Toll: An Investigation into Troll Tweets

And here is an archive of summer Data+ projects from the last several years. In Data+, teams of about 4 undergraduate students collaborate over the summer on a data science project. You should be able to see final presentations and/or executive summary slides for most projects; feel free to browse for inspiration.

Example Data Sources

Below, we have some examples of datasets or where you might find data. You should work with data that is interesting to you and should feel free (strongly encouraged even) to look for sources yourself. These are listed just as possibilities and starting places.

Kaggle maintains several thousand public datasets of interest in a variety of topics. Kaggle also hosts several prediction challenges; one idea for a machine learning project is to enter one of these competitions as a team.
The Yelp Dataset is provided by Yelp as a research challenge with lots and lots of data about reviews, businesses, images, and cities – text data, rich json data, etc.
The University of California Irvine maintains a large UCI ML repository of publicly contributed datasets aimed toward machine learning tasks of all types. They range from small simple example datasets to large and complicated datasets from specific scientific domains.
Data.gov has a huge compilation of data sets produced by the US government. The US Census Bureau also publishes datasets from all of its survey work. Similarly, The Supreme Court Database tracks all cases decided by the US Supreme Court, and GovTrack.us provides links to all kinds of information about the US Congress and all votes casted by its members.

Module 3B: Statistical Inference

September 10, 2021ModuleDr Kristin Stephens-Martinez, Ph.D.

Prepare (soft due Th 9/16, hard due M 9/27)
1. Content below
2. Sakai quizzes
Group Worksheet (soft due F 9/17, hard due M 9/27)
Practice (due M 9/27)
Perform (due M 10/11)

Content

3B.A – Confidence Intervals and Bootstrapping

Intro Confidence Intervals (17 min.)
Confidence Intervals in Python (17 min.)

3B.B – Hypothesis Testing

Intro Hypothesis Testing and Proportions (14 min.)
Hypothesis Testing Means and More (33 min.)

Optional Supplements

You can access an excellent free online textbook on OpenIntro Statistics here, co-authored by Duke faculty. You can pay a suggested but adjustable price for a tablet-friendly pdf, but you can also just get the regular pdf for free. For Module 3B, the following optional readings may be particularly helpful supplements:

Chapter 5.2 Confidence intervals for a proportion. This provides introductory material on confidence intervals elaborating on 3B.A.1.
Chapter 5.3 Hypothesis testing for a proportion. This elaborates on the introduction to hypothesis testing from 3B.B.1.
Chapters 7.1, 7.3, and 7.5 cover material from 3B.B.2 on using t-tests for a single mean, the difference of two means, and many pairwise means respectively.
Chapter 6.3 discusses the chi-square test for categorical data introduced in 3B.B.2.

In addition, here is the documentation for the scipy.stats library that implements most of the functionality described here as well as many other useful statistical functions.

Module 3A: Data Wrangling

September 10, 2021ModuleDr Kristin Stephens-Martinez, Ph.D.

Prepare (soft due Tu 9/14, hard due M 9/27)
1. Content below
2. Sakai quizzes
Group Worksheet (soft due W 9/15, hard due M 9/27)
Practice (due M 9/27)
Perform (due M 10/11)

Content

3A.A – What is Wrangling

Data sources, formats, and importing (26 min.)
Common data cleaning problems (16 min.)
Read Section 3.4 Handling Missing Data from Python Data Science Handbook

3A.B – Wrangling Text

Python string operations (16 min.)
Introduction to regular expressions (18 min.)
Read Section 3.10 Vectorized String Operations from Python Data Science Handbook

Optional Supplements

Module 2B: Probability

August 22, 2021ModuleDr Kristin Stephens-Martinez, Ph.D.

Prepare (soft due Th 9/2, hard due M 9/13)
1. Content below
2. Sakai quizzes
Group Worksheet (soft due F 9/3, hard due M 9/13)
Practice (due M 9/13)
Perform (due M 9/27)

Content

2B.A – Foundations of Probability (52 min.)

Outcomes, Events, Probabilities (15 min.)
Joint and Conditional Probability (11 min.)
Marginalization and Bayes’ Theorem (15 min.)
Random Variables and Expectations (11 min.)

2B.B – Distributions of Random Variables (46 min.)

Distributions, Means, Variance (19 min.)
Monte Carlo Simulation (15 min.)
Central Limit Theorem (12 min.)

Optional Supplements

You can access an excellent free online textbook on OpenIntro Statistics here, co-authored by Duke faculty. You can pay a suggested but adjustable price for a tablet-friendly pdf, but you can also just get the regular pdf for free. For this module, the following optional readings may be particularly helpful supplements:

Chapter 3: Probability. This provides more information on many of the topics from the above videos in Foundations of Probability.
Chapter 4: Distributions of random variables. This provides much more information about particular classic distributions than is provided in 2B.B.1.
Chapter 5.1: Point estimates and sampling variability. This provides more information on some of the topics from 2B.B.2-3.

In addition, you can find documentation for the two pseudorandom number generating / sampling libraries in python that we mentioned here:

Python random – Base Python library
Numpy random – Numpy random sampling library

Module 2A: Numpy & Pandas

August 22, 2021ModuleDr Kristin Stephens-Martinez, Ph.D.

Prepare (soft due Tu 8/31, hard due M 9/13)
1. Content below
2. Sakai quizzes
Group Worksheet (soft due W 9/1, hard due M 9/13)
Practice (due M 9/13)
Perform (due M 9/27)

Content

2A.A – Numpy (1 hour)

Why Numpy (8 min.)
Numpy Array Basics (15 min.)
Numpy Universal Functions (20 min.)
Numpy Axis (14 min.)

2A.B – Pandas (45 min.)

Why Pandas (7 min.)
Pandas Series (19 min.)
Pandas Dataframe (21 min.)

Optional Supplements

Numpy Beginner’s Tutorial
Chapter 2: Introduction to Numpy from Python Data Science Handbook
Numpy Documentation
10 Minute to Pandas Tutorial
Pandas User Guide
Chapter 3: Data Manipulation with Pandas from Python Data Science Handbook (just the first three subsections)

Module 1: What is Data Science, Anaconda, Python, & Jupyter

August 22, 2021ModuleDr Kristin Stephens-Martinez, Ph.D.

Prepare (soft due Th 8/26, hard due M 8/30)
1. Content below
2. See Sakai for quiz
3. Install Anaconda
Group Worksheet (soft due F 8/27, hard due M 8/30)
Practice (due M 8/30) (Solution)
No Perform

Content

1.A – What is Data Science? (in class or see recording)

1.B – Python3 (12 min.)

1.C – Python for Data Science

Anaconda and Jupyter (10 min.)
Jupyter Notebook Demo (11 min.)

CompSci216, Fall 2021

Everything Data

All posts by Dr Kristin Stephens-Martinez, Ph.D.

Project: Group Formation

Project ideas

Example Data Sources

Module 3B: Statistical Inference

Content

Optional Supplements

Module 3A: Data Wrangling

Content

Optional Supplements

Module 2B: Probability

Content

Optional Supplements

Module 2A: Numpy & Pandas

Content

Optional Supplements

Module 1: What is Data Science, Anaconda, Python, & Jupyter

Content

Optional Supplements