Monthly Archives: February 2022

Module 08: Visualization

  1. Prepare (due M 3/14)
    1. Content below
    2. Sakai quizzes
  2. Peer Instructions – See on the class forum
  3. Homework (due Su 3/20)
  4. Worked Examples

Content

8.A – Data Visualization and Design

  1. Why Visualize? (11 min.)
  2. Basic Plot Types (17 min.)
  3. Dos and Don’ts (10 min.)

8.B Visualization in Python

  1. Intro to Python Visualization Landscape (7 min.)
  2. Seaborn Introduction (17 min.)
  3. Seaborn Examples (17 min.)

Optional Supplements

Mini-Exam 2 and Mini-Exam 1 Retake

Mini-Exam 2 Logistics

  • Modules covered: 4, 5, and 6
    • Note: Even though it’s the same number of modules as Mini-exam 1 that does not mean it is assessing the same amount of knowledge. Module 01 was a pre-amble module to remind you all of Python and as a warm-up for the weekly cadence of the class.
  • Timeframe: It will open Thursday 3/3, 12:01 AM, and close Saturday 3/5, 11:59 PM.
    • The exam will close at 11:59 pm regardless of when you started.
  • The exam will be take-home. It is open book, open note, open internet, but closed to people.
    • This means you cannot communicate with a person about the exam, including asking someone through the Internet (like stackoverflow) for help and receiving help.
  • Like Mini-Exam 1, it consists of 2 parts that each have a time limit of 2 hours. Both parts will have data sets and they will be different.
  • All other information is similar to Mini-Exam 1’s. Such as getting the files, Gradescope, Sakai, asking for help, grading policy, etc.

Mini-Exam 1 Retake Logistics

  • Timeframe: It will open Wednesday 3/2, 12:01 AM, and close Saturday 3/5, 11:59 PM.
    • The exam will close at 11:59 pm regardless of when you started.
  • It assesses the same thing as Mini-exam 1.
    • You may use things that you have learned that were not in the modules that this exam is testing (such as .groupby() or .apply()) but you can answer it without knowing any modules beyond what this exam is testing.
  • The data sets and events will be different.
  • You do not need to do both parts. You can only do one part if you wish. You must do ALL of the questions in that part though. We will take the max score per part.
  • All other information is similar to Mini-Exam 1’s. Such as getting the files, Gradescope, Sakai, asking for help, grading policy, etc.

Module 07: Databases & SQL

  1. Prepare (due M 2/21)
    1. Content below
    2. Sakai quizzes
  2. Peer Instructions – See on the class forum
  3. Homework (due Su 2/27)
  4. Worked Example

Content

7.A – Relational Database (24 min.)

7.B

  1. SQL Querying (21 min.)
  2. SQL with Python and Pandas (12 min.)

Optional Supplements

Project: Group Formation

In place of a final exam, this course has a collaborative final project where we ask you to bring your data science skills to bear on a research project of your own choosing. It is time to start forming groups (of 4-5 students) for the project. Fill out the group formation survey no later than Sunday, Feb 20th.

The form should only take a couple of minutes. If you already know who you want to work with you can indicate that in the form. In this case, communicate with your group first and only fill out the form once with everyone’s name/netid. It’s also fine if you don’t know who you want to work with, in which case you can fill out the form and we will match you to a group.

If it is helpful to start thinking about possible project ideas, below are some project ideas. But it is not required that you have a concrete project idea until the proposal.

Project ideas

Not sure how to get started? Looking for examples of what a data science project might look like? Here are some of the topics that students studied in Spring 2020:

  • Comparing Stock Market Losses between SARS and SARS-CoV-2
  • Recessions, Depressions, and Depression: Mental Health in Relation to Economic Factors
  • Predicting North Carolina Election Outcomes
  • Relating Text Analysis of Corporate Reports and Stock Performance
  • Modeling Consumer Flight Behavior Based on Economic Indicators
  • Predicting COVID-19 Death Tolls from Google Search Trends
  • Sentiment Analysis of COVID-19 Tweets
  • Economic Status and Drug Overdose in North Carolina
  • Analyzing Gender and Tech Careers
  • Political Landscape According to Social Media
  • Forecasting Market Shocks and Performance using Article Headlines
  • Tracking Recidivism in US Prisons
  • Understanding AirBnBs impact on Evictions
  • Understanding Musical Tastes (Music Recommender System)
  • Human Impact on Climate since the Industrial Revolution
  • The Troll Toll: An Investigation into Troll Tweets

And here is an archive of summer Data+ projects from the last several years. In Data+, teams of about 4 undergraduate students collaborate over the summer on a data science project. You should be able to see final presentations and/or executive summary slides for most projects; feel free to browse for inspiration.

Example Data Sources

Below, we have some examples of datasets or where you might find data. You should work with data that is interesting to you and should feel free (strongly encouraged even) to look for sources yourself. These are listed just as possibilities and starting places.

  • Kaggle maintains several thousand public datasets of interest in a variety of topics. Kaggle also hosts several prediction challenges; one idea for a machine learning project is to enter one of these competitions as a team.
  • The Yelp Dataset is provided by Yelp as a research challenge with lots and lots of data about reviews, businesses, images, and cities – text data, rich json data, etc.
  • The University of California Irvine maintains a large UCI ML repository of publicly contributed datasets aimed toward machine learning tasks of all types. They range from small simple example datasets to large and complicated datasets from specific scientific domains.
  • Data.gov has a huge compilation of data sets produced by the US government. The US Census Bureau also publishes datasets from all of its survey work. Similarly, The Supreme Court Database tracks all cases decided by the US Supreme Court, and GovTrack.us provides links to all kinds of information about the US Congress and all votes casted by its members.
  • V7 maintains open data sets mainly focused on computer vision.

Project Proposal

Due: Sunday 3/6

General Directions

The purpose of this document is to prepare your team for success in the course project. Your proposal should contain at least three parts, which we define below. In terms of length, it should be 1-2 pages using standard margins (1 in.), font (11-12 pt), and line spacing (1-1.5). In addition to these three components, you should provide any additional context or information necessary to understand your vision for your project. You should convert your final document to a pdf and upload it to Gradescope under the assignment “Project Proposal” by the due date. Be sure to include your names and NetIds in your final document and use the group submission feature on Gradescope to include all of your group members on a single submission.

Part 1: Introduction and Research Questions

Your proposal should begin by introducing your topic in general and then defining one or more research questions. Research questions are the guiding questions you want to answer or problems you want to solve in your project. Your research question(s) should be (1) substantial, (2) feasible, and (3) relevant.

  1. Substantial research questions require more than a surface-level analysis (more than just computing basic summary statistics on readily available datasets, for example).
  2. Feasible research questions can actually be addressed by four or five team members over the course of approximately six weeks using data you can access.
  3. Relevant research questions address a subject of importance and interest within the scientific community or broader society.

You should provide a brief justification of your research question(s) with respect to each of these three points.

While you are welcome to study whatever topic you like, the following have been popular themes in previous years: health and medicine, business and economics, sports analytics, social media analysis, politics and/or policy, gender and/or race. The Project Ideas in the group formation post has many examples of topics.

Part 2: Data Sources

Your project should deal with real data. We provide pointers to some data sources in the Project Ideas section of the group formation post, but you are welcome and encouraged to look for your own data sources. After your introduction and research questions, your proposal should discuss the data you will use to answer your research questions. Be as specific as possible: name the datasets you will use and how you will access them or specify where you will look for the relevant datasets and why you expect to be successful in finding them. You should also briefly justify why the data you plan to obtain will be relevant and appropriate for addressing your research questions. Searching for data sources as you refine your research questions is likely to be the most time-consuming part of preparing your proposal and is crucial for a good start on your project, so do not put it off.

Part 3: Collaboration Plan

This is a collaborative course project pursued by a team of students who bring different strengths and interests to the table. This is reflective of the reality that significant real-world projects in data science are almost always pursued by teams. For the collaboration to be successful, it helps to establish some guidelines that serve as a starting point. Your collaboration plan should address the following:

  1. How will you divide responsibilities? Will some students be responsible for certain portions of the project, or will you be more integrated and decide responsibilities on a weekly basis?
  2. About how much time do you expect every group member to spend on the project each week, on average? It is ok if this number is higher toward the last couple of weeks of the semester.
  3. When and how will you meet? You should plan to meet at least once per week for at least 30 minutes to check in on one another’s progress, get help, and plan for what comes next. Identify a day of the week, a time, and the platform you will use to meet.
  4. What platform(s) will you use to communicate between meetings? Will you primarily use email, text, slack, or other chat apps? If you want a more professional enterprise tool, Duke provides free access to Microsoft Teams.
  5. Where will you store data, code, writing, etc., so that all group members have easy access to shared materials?* Duke provides free access to Box and GitLab which could serve these purposes, but you could also use external services like Google Drive or GitHub. Provide a link to the folder/repository in your proposal to demonstrate that it is created and ready.

* In addition to a common repository for data, you may find it useful to explore the Google colab or DeepNote which allows you to collaborate on Jupyter notebooks and execute them in the cloud (like a google doc for Jupyter notebooks).

Feedback and Grading Rubric

Proposals will be evaluated on the following criterion-based rubric. Proposals satisfying all criteria will receive full credit. Formative feedback (comments and suggestions) will also be provided for each proposal by a teaching assistant who will be assigned as a project group mentor.

  1. Satisfies general directions (length, on-time pdf submission, group submission, etc.)
  2. Includes a brief introduction to the topic of interest
  3. Poses one or more concrete research questions
  4. Provides a reasonable justification that research questions are substantial
  5. Provides a reasonable justification that research questions are feasible
  6. Provides a reasonable justification that research questions are relevant
  7. Includes one or more specific datasets or reasonable discussion of how to locate data
  8. Provides reasonable justification that data sources are appropriate for research questions
  9. Collaboration plan specifies how responsibilities will be divided and about how much time on average each group member should expect to spend per week
  10. Collaboration plan specifies when and how team will meet, at least weekly
  11. Collaboration plan specifies platform/technology for communication between meetings and provides a link to a folder/repository for sharing data, code, etc.

Module 06: Combining Data

  1. Prepare (due M 2/14)
    1. Content below
    2. Sakai quizzes
  2. Peer Instructions — See on the class forum
  3. Homework (due Su 2/20)

Content

6.A – Summarizing Data

  1. Read Section 3.8 Aggregating and Grouping from Python Data Science Handbook.
  2. Read Section 3.9 Pivot Tables from Python Data Science Handbook.

6.B – Merging Data

  1. Record Linkage (8 min.)
  2. Read Section 3.6 Concat and Append from Python Data Science Handbook. Please note that the join_axes optional parameter mentioned in this section has been deprecated from the Pandas library, you can skip over the details on this parameter.
  3. Read Section 3.7 Merge and Join from Python Data Science Handbook
  4. Fuzzy Matching (21 min.)

Optional Supplements

Mini-Exam 1

This post outlines what Mini-Exam 1 will be like.

Exam Logistics

  • Modules covered: 1, 2, and 3
  • The exam will be take-home. It is open book, open note, open internet, but closed to people.
    • This means you cannot communicate with a person about the exam, including asking someone through the Internet (like stackoverflow) for help and receiving help.
  • Timeframe: It will open Thursday 2/10, 12:01 AM, and close Saturday 2/12, 11:59 PM.
    • The exam will close at 11:59 pm regardless of when you started.
  • The exam consists of 2 parts.
    • Part 1 consists of only a Jupyter Notebook.
    • Part 2 consists of a Jupyter Notebook and a data set.
    • You will get the zip files inside a Sakai Quiz.
    • You will submit them on Gradescope.
    • You will have 2 hours for each exam part.
      • We do not expect you to need the entire 2 hours for each part, however, it is not uncommon to get lost in a data set and we wanted to account for that.
    • You can rely on the Sakai Quiz timer to tell you how much time you have left.
    • We will use your logged start time in Sakai to track if you submitted on Gradescope on time.
      • If you submit after your allotted time, we will use the last submission within your allotted time. That includes marking it as zero if you do not submit within your time limit (so you will need to rely on the retake for your exam).
      • We recommend you submit to Gradescope periodically (after each problem) so you are not scrambling at the end trying to open Gradescope.
    • You do not need to do anything with Sakai after you retrieve your zip file from the quiz.
    • During your testing period, you can submit as many times as you want to Gradescope. We will take the submission you mark as active, which is your last submission unless you change it using the history.
  • The exam must be done individually. It is a violation of class policy if you collaborate in any way with another person (in or not in the class) on the exam. You can only talk to the teaching staff about the exam.
  • Protect the integrity of the exam and your exam submission.
    • Do not talk to anyone about the exam during the exam period.
    • Take your exam in a secure location where no one can bother you.
    • Take your exam in a place where you will not be distracted or tempted to talk to someone.
  • If you have a question during the exam, ask it as a private new message on the class forum. Or on Zoom if a teaching staff member is on call at that time.
    • We will do our best to always have someone checking the forum, however, we cannot make promises someone will instantly answer your question.
    • The exam is tested for readability, so the wording should be straightforward.
  • The Mini-Exam Retake 1 will be during Mini-Exam 2. Your Mini-Exam 1 score will be the max between this exam and the retake.

Grading Scale and Points Allocation

Each section will be graded on a four-step rubric scale as follows.

  • E (Exemplary) – Work that meets all requirements and displays full mastery of all learning goals and material.
  • S (Satisfactory) – Work that meets all requirements and displays at least partial mastery of all learning goals as well as full mastery of core learning goals.
  • N (Not yet) – Work that does not meet some requirements and/or displays developing or incomplete mastery of at least some learning goals and material.
  • U (Unassessable) – Work that is missing, does not demonstrate meaningful effort, or does not provide enough evidence to determine a level of mastery.

There are ~100 points possible and fewer than 10 questions. The number of points earned are evenly distributed across the problems based on the number of concepts they are testing. The rubric will be converted to points as follows:

  • E = full credit
  • S = E_full_credit – 1
  • N = E_full_credit * 0.6
  • U = E_full_credit *0.2
  • Blank = 0

This scheme ensures that earning an E or S on all problems ensures an A. While a single U means an A is very unlikely, which is reasonable since a U on a problem clearly shows a lack of mastery on all the content for this exam.