Monthly Archives: October 2021

Final Project: Proposal

Due: Sunday 11/14

This should be a document 1-2 pages in length that includes the following parts:

  1. What question(s) you plan to address
  2. The data set you will use
  3. A group plan

You will submit your proposal on Gradescope using the group submission feature, just like prior group submissions.

Part 1: Introduction and Research Questions

Your proposal should begin by introducing your topic in general and then defining one or more research questions. Research questions are the guiding questions you want to answer or problems you want to solve in your project. Your research question(s) should be (1) substantial, (2) feasible, and (3) relevant.

  1. Substantial research questions require more than a surface-level analysis (more than just computing basic summary statistics on readily available datasets, for example).
  2. Feasible research questions can actually be addressed by four or five team members over the course of approximately six weeks using data you can access.
  3. Relevant research questions address a subject of importance and interest within the scientific community or broader society.

You should provide a brief justification of your research question(s) with respect to each of these three points.

Part 2: Data Sources

Your project should deal with real data. We provide pointers to some data sources in the Project Ideas section of the group formation post, but you are welcome and encouraged to look for your own data sources. After your introduction and research questions, your proposal should discuss the data you will use to answer your research questions. Be as specific as possible: name the datasets you will use and how you will access them or specify where you will look for the relevant datasets and why you expect to be successful in finding them. You should also briefly justify why the data you plan to obtain will be relevant and appropriate for addressing your research questions. Searching for data sources as you refine your research questions is likely to be the most time-consuming part of preparing your proposal and is crucial for a good start on your project, so do not put it off.

Part 3: Group Plan

This should be similar to group 2’s plan and answers items 3 through 7. You of course can also have a team name (item 2).

Feedback and Grading Rubric

Proposals will be evaluated on the following criterion-based rubric. Proposals satisfying all criteria will receive full credit. Formative feedback (comments and suggestions) will also be provided for each proposal.

  1. Satisfies general directions (length, on-time pdf submission, group submission, etc.)
  2. Includes a brief introduction to the topic of interest
  3. Poses one or more concrete research questions
  4. Provides a reasonable justification that research questions are substantial
  5. Provides a reasonable justification that research questions are feasible
  6. Provides a reasonable justification that research questions are relevant
  7. Includes one or more specific datasets or reasonable discussion of how to locate data
  8. Provides reasonable justification that data sources are appropriate for research questions
  9. Has a group plan that addresses (1) how you will communicate, (2) when, (3) where, and (4) how you will work together, and (5) a proposal of what happens if a team member cannot finish their planned work.

Final Project: Group formation

Due: Tuesday 11/9

For the final project, you will get to choose your own group of 3-4 people. You do not need to find a group if you do not know who you want to work with or you do not yet have enough members in your group. We will assign or form groups if needed. The form asks for project interest and we will match you to a group that aligns as close to your interests as possible.

When you are ready, you will need everyone’s names and netids. Fill out the group formation form. You may fill out this form with fewer people, which we will take as an indication of you want to be together and we will add more members to your group. If you have 3 people in your group, we may add a 4th to ensure everyone is in a group by the end of the group formation period.

Only fill this form out once per group. If your group fills this out more than once, we will take the last entry. Make sure to confirm with your group who is responsible for filling out the form.

This form collects the person filling it out already, so the questions are only for the other members of the group.

Project Ideas

It may help to start looking for project ideas and data sources as you form your groups.

Example Ideas

Not sure how to get started? Looking for examples of what a data science project might look like? Here are some of the topics that students studied in CS216 Spring 2020:

  • Comparing Stock Market Losses between SARS and SARS-CoV-2
  • Recessions, Depressions, and Depression: Mental Health in Relation to Economic Factors
  • Predicting North Carolina Election Outcomes
  • Relating Text Analysis of Corporate Reports and Stock Performance
  • Modeling Consumer Flight Behavior Based on Economic Indicators
  • Predicting COVID-19 Death Tolls from Google Search Trends
  • Sentiment Analysis of COVID-19 Tweets
  • Economic Status and Drug Overdose in North Carolina
  • Analyzing Gender and Tech Careers
  • Political Landscape According to Social Media
  • Forecasting Market Shocks and Performance using Article Headlines
  • Tracking Recidivism in US Prisons
  • Understanding AirBnBs impact on Evictions
  • Understanding Musical Tastes (Music Recommender System)
  • Human Impact on Climate since the Industrial Revolution
  • The Troll Toll: An Investigation into Troll Tweets

And here is an archive of summer Data+ projects from the last several years. In Data+, teams of about 4 undergraduate students collaborate over the summer on a data science project. You should be able to see final presentations and/or executive summary slides for most projects; feel free to browse for inspiration.

Example Data Sources

Below, we have some examples of datasets or where you might find data. You should work with data that is interesting to you and should feel free (strongly encouraged even) to look for sources yourself. These are listed just as possibilities and starting places.

  • Kaggle maintains several thousand public datasets of interest in a variety of topics. Kaggle also hosts several prediction challenges; one idea for a machine learning project is to enter one of these competitions as a team.
  • The Yelp Dataset is provided by Yelp as a research challenge with lots and lots of data about reviews, businesses, images, and cities – text data, rich json data, etc.
  • The University of California Irvine maintains a large UCI ML repository of publicly contributed datasets aimed toward machine learning tasks of all types. They range from small simple example datasets to large and complicated datasets from specific scientific domains.
  • has a huge compilation of data sets produced by the US government. The US Census Bureau also publishes datasets from all of its survey work. Similarly, The Supreme Court Database tracks all cases decided by the US Supreme Court, and provides links to all kinds of information about the US Congress and all votes casted by its members.

Exam 2

This post outlines what Exam 2 will be like. The format will be very similar to Exam 1. Anything that is different than Exam 1 ([DIFF] ) or new ([NEW] ) is marked as such.

Exam Logistics

  • [DIFF] The exam will cover up to and including Module 7. It will emphasize Modules 4-7, but will include Module 1-3 content as needed because this class’s material is cumulative.
  • The exam will be take-home. It is open book, open note, open internet, but closed to people.
    • This means you cannot communicate with a person while taking the exam, including asking someone through the Internet (like stackoverflow) for help and receiving help.
  • [DIFF] Timeframe: It must be completed on Thursday 11/04 between 10:15 am (start of class) and 11:59 pm.
    • The exam will close at 11:59 pm regardless of when you started.
  • The exam has two parts: Multiple Choice and Jupyter Notebook.
    • You may take a break between each part.
    • Both parts are timed through Sakai.
  • The exam must be done individually. It is a violation of class policy if you collaborate in any way with another person (in or not in the class) on the exam. You can only talk to the teaching staff about the exam.
  • Protect the integrity of the exam and your exam submission.
    • Do not talk to anyone about the exam during the exam period.
    • Take your exam in a secure location where no one can bother you.
    • Take your exam in a place where you will not be distracted or tempted to talk to someone.
  • The exam has randomized elements in it so no one’s exam will be identical to another person’s.
  • If you have a question during the exam, ask it as a private new message on the class forum. Or on Zoom if a teaching staff member is on call at that time.
    • We will do our best to always have someone checking the forum, however, we cannot make promises someone will instantly answer your question.
    • Prof. Stephens-Martinez will be in the class Zoom during class time and in her office hours zoom during her office hours that are immediately after Thursday’s class.
    • [DIFF] David has office hours Thursday 12:30-1:30 pm ET.
    • The exam is tested for readability, so the wording should be straightforward.
  • [DIFF] There is no mock exam.

Multiple Choice Questions (30 minutes)

  • You will have 30 minutes to complete this part.
  • It will be a Sakai Quiz (like the homework).
  • You can submit only once.
  • You will not see your score until after the testing period is over.

[DIFF] Jupyter Notebook (30 45 minutes)

  • [DIFF] You will have 30 45 minutes to complete this part.
  • You will get your Jupyter Notebook zip file inside a Sakai Quiz that is not the multiple-choice part.
  • You will submit it on Gradescope.
  • [NEW] We strongly recommend you submit to Gradescope multiple times, such as after each question.
  • You can rely on the Sakai Quiz timer to tell you how much time you have left.
  • We will use your logged start time in Sakai to track if you submitted on Gradescope on time.
  • You do not need to do anything with Sakai after you retrieve your zip file from the quiz.
  • During your testing period, you can submit as many times as you want to Gradescope. We will take your last submission.
  • The autograder will tell you if your values are the correct type, but not necessarily if they are the correct value. There are hidden tests. Your score will only be revealed after we have finished all grading, including the manual grading part.

How to Prepare

See Exam 1’s post.

Module 8: Normal Curve, Correlation, Regression, and Least Squares

This module is 1 class period longer than usual. This is to accommodate Exam 2 that is on Thursday 11/4.

  1. Videos
  2. Textbook (supplemental)
  3. Homework (Due Sunday 10/24, late 10/25)
    1. Part 1: Normal Curve
    2. Part 2: Correlation
    3. Part 3: Regression
    4. Part 4: Least Squares
  4. Group Worksheet
  5. Lab 08 (Due Tuesday 11/2, late 11/7)


Part 1: Normal Curve

  1. Standard Units (14:08)
  2. SD and Bell Curves (8:52)
  3. Normal Distribution (8:41)
  4. Central Limit Theorem (19:28)

Part 2: Correlation

  1. Visualization (15:12)
  2. Calculation (19:54)
  3. Interpretation (11:21)

Part 3: Regression

  1. Prediction (11:53)
  2. Linear Regression (18:37)
  3. Regression to the Mean (6:33)
  4. Regression Equation (22:38)
  5. Interpreting the Slope (3:22)

Part 4: Least Squares

  1. Linear Regression Review (optional, 5:25)
  2. Discussion Question (optional, 5:09)
  3. Squared Error (9:55)
  4. Least Squares (6:15)


Module 7: Causality, Confidence Intervals, Interpreting Confidence, and Center & Spread

  1. Videos
  2. Textbook (supplemental)
  3. Homework (Due Sunday 10/17, late 10/18)
    1. Part 1: Causality
    2. Part 2: Confidence Intervals
    3. Part 3: Interpreting Confidence
    4. Part 4: Center and Spread
  4. Group Worksheet
  5. Lab 07 (Due Friday 10/22)


Part 1: Causality

  1. Introduction (7:29)
  2. Hypotheses (5:57)
  3. Test Statistic (3:09)
  4. Performing a Test (8:44)

Part 2: Confidence Intervals

  1. Percentiles (4:57)
  2. Estimation (9:29)
  3. Estimate Variability (7:22)
  4. The Bootstrap (21:10)

Part 3: Interpreting Confidence

  1. Applying the Bootstrap (11:05)
  2. Confidence Interval Pitfalls (5:54)
  3. Confidence Interval Tests (1:57)

Part 4: Center and Spread

  1. Introduction (16:27)
  2. Average and Median (8:35)
  3. Standard Deviation (12:49)
  4. Chebyshev’s Bounds (19:12)


Project 2

The zip file will be in the class Box folder in the Project folder. You will submit this as a group on Gradescope. This covers up to module 6. It is due Friday 10/29, late to Sunday 10/31.

To work collaboratively, you can choose to use Google Colab. Put the file in your Google Drive and share it with your group. When you open the file, it will open in Google Colab. You all should be able to work on the notebook at the same time. However, working within the same cell may not work. You may notice that the file locations for the data are over the internet, rather than local. This change is to make working with Colab easier, which does not hold onto the data files between uses.

Group Plan 2

It’s time for round 2 of groups!

Some notes from the group reflection:

  • 77% found the group contracts useful or maybe useful.
  • 85% said they would want to do it again or maybe do it again.
  • 44% didn’t think the group contracts needed to change
  • Common themes for change included:
    • Remove/change the roles section
    • Add a section for what happens if something happens so someone cannot contribute as originally planned
    • More flexibility in what goes in the contract
    • Have a better plan on what, when, and how the group will work together.

Therefore, I’m going to require a group plan but not provide a template. There is still the group contract template in the prior post if your group wants to use it as a starting point. It’s a plan rather than a contract. This change of framing is to refocus how you all will use the document.

The following needs to be in your plan:

  1. Names of all team members
  2. Optional: Team name
    1. For inspiration, There are many team name generators on the internet.
  3. How you will communicate
  4. When you will work together
  5. Where you will work together (including a potential meeting outside of class)
  6. How you will work together
  7. Proposal for what to do if something happens to a team member and they cannot finish their planned work.
    1. Contacting Prof. Stephens-Martinez can be part of this proposal