Monthly Archives: September 2021

Module 4: Combining Data

There is only 1 module for learning sprint 4. The rest of your time should be spent on your project.

  1. Prepare (soft due Tu 9/28, hard due M 10/11)
    1. Content below
    2. Sakai quizzes
  2. Group Worksheet (soft due W 9/29, hard due M 10/11)
  3. Practice (due M 10/11)
  4. Perform (due M 10/25)

Content

4.A – Summarizing Data

  1. Read Section 3.8 Aggregating and Grouping from Python Data Science Handbook.
  2. Read Section 3.9 Pivot Tables from Python Data Science Handbook.

4.B – Merging Data

  1. Record Linkage (8 min.)
  2. Read Section 3.6 Concat and Append from Python Data Science Handbook. Please note that the join_axes optional parameter mentioned in this section has been deprecated from the Pandas library, you can skip over the details on this parameter.
  3. Read Section 3.7 Merge and Join from Python Data Science Handbook
  4. Fuzzy Matching (21 min.)

Optional Supplements

Project: Proposal

Due: Monday 10/11

General Directions

The purpose of this document is to prepare your team for success in the course project. Your proposal should contain at least three parts, which we define below. In terms of length, it should be 1-2 pages using standard margins (1 in.), font (11-12 pt), and line spacing (1-1.5). In addition to these three components, you should provide any additional context or information necessary to understand your vision for your project. You should convert your final document to a pdf and upload it to Gradescope under the assignment “Project Proposal” by the due date. Be sure to include your names and NetIds in your final document and use the group submission feature on Gradescope to include all of your group members on a single submission.

Part 1: Introduction and Research Questions

Your proposal should begin by introducing your topic in general and then defining one or more research questions. Research questions are the guiding questions you want to answer or problems you want to solve in your project. Your research question(s) should be (1) substantial, (2) feasible, and (3) relevant.

  1. Substantial research questions require more than a surface-level analysis (more than just computing basic summary statistics on readily available datasets, for example).
  2. Feasible research questions can actually be addressed by four or five team members over the course of approximately six weeks using data you can access.
  3. Relevant research questions address a subject of importance and interest within the scientific community or broader society.

You should provide a brief justification of your research question(s) with respect to each of these three points.

While you are welcome to study whatever topic you like, the following have been popular themes in previous years: health and medicine, business and economics, sports analytics, social media analysis, politics and/or policy, gender and/or race. The Project Ideas in the group formation post has many examples of topics.

Part 2: Data Sources

Your project should deal with real data. We provide pointers to some data sources in the Project Ideas section of the group formation post, but you are welcome and encouraged to look for your own data sources. After your introduction and research questions, your proposal should discuss the data you will use to answer your research questions. Be as specific as possible: name the datasets you will use and how you will access them or specify where you will look for the relevant datasets and why you expect to be successful in finding them. You should also briefly justify why the data you plan to obtain will be relevant and appropriate for addressing your research questions. Searching for data sources as you refine your research questions is likely to be the most time-consuming part of preparing your proposal and is crucial for a good start on your project, so do not put it off.

Part 3: Collaboration Plan

This is a collaborative course project pursued by a team of students who bring different strengths and interests to the table. This is reflective of the reality that significant real-world projects in data science are almost always pursued by teams. For the collaboration to be successful, it helps to establish some guidelines that serve as a starting point. Your collaboration plan should address the following:

  1. How will you divide responsibilities? Will some students be responsible for certain portions of the project, or will you be more integrated and decide responsibilities on a weekly basis?
  2. About how much time do you expect every group member to spend on the project each week, on average? It is ok if this number is higher toward the last couple of weeks of the semester.
  3. When and how will you meet? You should plan to meet at least once per week for at least 30 minutes to check in on one another’s progress, get help, and plan for what comes next. Identify a day of the week, a time, and the platform you will use to meet.
  4. What platform(s) will you use to communicate between meetings? Will you primarily use email, text, slack, or other chat apps? If you want a more professional enterprise tool, Duke provides free access to Microsoft Teams.
  5. Where will you store data, code, writing, etc., so that all group members have easy access to shared materials?* Duke provides free access to Box and GitLab which could serve these purposes, but you could also use external services like Google Drive or GitHub. Provide a link to the folder/repository in your proposal to demonstrate that it is created and ready.

* In addition to a common repository for data, you may find it useful to explore the Google colab which allows you to collaborate on Jupyter notebooks and execute them in the cloud (like a google doc for Jupyter notebooks).

Feedback and Grading Rubric

Proposals will be evaluated on the following criterion-based rubric. Proposals satisfying all criteria will receive full credit. Formative feedback (comments and suggestions) will also be provided for each proposal by a teaching assistant who will be assigned as a project group mentor.

  1. Satisfies general directions (length, on-time pdf submission, group submission, etc.)
  2. Includes a brief introduction to the topic of interest
  3. Poses one or more concrete research questions
  4. Provides a reasonable justification that research questions are substantial
  5. Provides a reasonable justification that research questions are feasible
  6. Provides a reasonable justification that research questions are relevant
  7. Includes one or more specific datasets or reasonable discussion of how to locate data
  8. Provides reasonable justification that data sources are appropriate for research questions
  9. Collaboration plan specifies how responsibilities will be divided and about how much time on average each group member should expect to spend per week
  10. Collaboration plan specifies when and how team will meet, at least weekly
  11. Collaboration plan specifies platform/technology for communication between meetings and provides a link to a folder/repository for sharing data, code, etc.

Project: Group Formation

In place of a final exam, this course has a collaborative final project where we ask you to bring your data science skills to bear on a research project of your own choosing. It is time to start forming groups (of 4-5 students) for the project. Fill out the group formation survey no later than Monday, Sept 27th.

The form should only take a couple of minutes. If you already know who you want to work with you can indicate that in the form. In this case, communicate with your group first and only fill out the form once with everyone’s name/netid. It’s also fine if you don’t know who you want to work with, in which case you can fill out the form and we will match you to a group.

If it is helpful to start thinking about possible project ideas, below are some project ideas. But it is not required that you have a concrete project idea until the proposal.

Project ideas

Not sure how to get started? Looking for examples of what a data science project might look like? Here are some of the topics that students studied in Spring 2020:

  • Comparing Stock Market Losses between SARS and SARS-CoV-2
  • Recessions, Depressions, and Depression: Mental Health in Relation to Economic Factors
  • Predicting North Carolina Election Outcomes
  • Relating Text Analysis of Corporate Reports and Stock Performance
  • Modeling Consumer Flight Behavior Based on Economic Indicators
  • Predicting COVID-19 Death Tolls from Google Search Trends
  • Sentiment Analysis of COVID-19 Tweets
  • Economic Status and Drug Overdose in North Carolina
  • Analyzing Gender and Tech Careers
  • Political Landscape According to Social Media
  • Forecasting Market Shocks and Performance using Article Headlines
  • Tracking Recidivism in US Prisons
  • Understanding AirBnBs impact on Evictions
  • Understanding Musical Tastes (Music Recommender System)
  • Human Impact on Climate since the Industrial Revolution
  • The Troll Toll: An Investigation into Troll Tweets

And here is an archive of summer Data+ projects from the last several years. In Data+, teams of about 4 undergraduate students collaborate over the summer on a data science project. You should be able to see final presentations and/or executive summary slides for most projects; feel free to browse for inspiration.

Example Data Sources

Below, we have some examples of datasets or where you might find data. You should work with data that is interesting to you and should feel free (strongly encouraged even) to look for sources yourself. These are listed just as possibilities and starting places.

  • Kaggle maintains several thousand public datasets of interest in a variety of topics. Kaggle also hosts several prediction challenges; one idea for a machine learning project is to enter one of these competitions as a team.
  • The Yelp Dataset is provided by Yelp as a research challenge with lots and lots of data about reviews, businesses, images, and cities – text data, rich json data, etc.
  • The University of California Irvine maintains a large UCI ML repository of publicly contributed datasets aimed toward machine learning tasks of all types. They range from small simple example datasets to large and complicated datasets from specific scientific domains.
  • Data.gov has a huge compilation of data sets produced by the US government. The US Census Bureau also publishes datasets from all of its survey work. Similarly, The Supreme Court Database tracks all cases decided by the US Supreme Court, and GovTrack.us provides links to all kinds of information about the US Congress and all votes casted by its members.

Module 3B: Statistical Inference

  1. Prepare (soft due Th 9/16, hard due M 9/27)
    1. Content below
    2. Sakai quizzes
  2. Group Worksheet (soft due F 9/17, hard due M 9/27)
  3. Practice (due M 9/27)
  4. Perform (due M 10/11)

Content

3B.A – Confidence Intervals and Bootstrapping

  1. Intro Confidence Intervals (17 min.)
  2. Confidence Intervals in Python (17 min.)

3B.B – Hypothesis Testing

  1. Intro Hypothesis Testing and Proportions (14 min.)
  2. Hypothesis Testing Means and More (33 min.)

Optional Supplements

You can access an excellent free online textbook on OpenIntro Statistics here, co-authored by Duke faculty. You can pay a suggested but adjustable price for a tablet-friendly pdf, but you can also just get the regular pdf for free. For Module 3B, the following optional readings may be particularly helpful supplements:

  • Chapter 5.2 Confidence intervals for a proportion. This provides introductory material on confidence intervals elaborating on 3B.A.1.
  • Chapter 5.3 Hypothesis testing for a proportion. This elaborates on the introduction to hypothesis testing from 3B.B.1.
  • Chapters 7.1, 7.3, and 7.5 cover material from 3B.B.2 on using t-tests for a single mean, the difference of two means, and many pairwise means respectively.
  • Chapter 6.3 discusses the chi-square test for categorical data introduced in 3B.B.2.

In addition, here is the documentation for the scipy.stats library that implements most of the functionality described here as well as many other useful statistical functions.

Module 3A: Data Wrangling

  1. Prepare (soft due Tu 9/14, hard due M 9/27)
    1. Content below
    2. Sakai quizzes
  2. Group Worksheet (soft due W 9/15, hard due M 9/27)
  3. Practice (due M 9/27)
  4. Perform (due M 10/11)

Content

3A.A – What is Wrangling

  1. Data sources, formats, and importing (26 min.)
  2. Common data cleaning problems (16 min.)
  3. Read Section 3.4 Handling Missing Data from Python Data Science Handbook

3A.B – Wrangling Text

  1. Python string operations (16 min.)
  2. Introduction to regular expressions (18 min.)
  3. Read Section 3.10 Vectorized String Operations from Python Data Science Handbook

Optional Supplements