Month: February 2023

Module 06: Combining Data

  1. Prepare (due Mon 2/27)
    1. Content below
    2. Sakai quizzes
  2. Peer Instructions – See on the class forum
  3. Homework (due Sun 3/5) [Link]
  4. Worked Example [Link]

Content

6.A – Summarizing Data

  1. Read Section 3.8 Aggregating and Grouping from Python Data Science Handbook.
  2. Read Section 3.9 Pivot Tables from Python Data Science Handbook.

6.B – Merging Data

  1. Record Linkage (8 min.)
  2. Read Section 3.6 Concat and Append from Python Data Science Handbook. Please note that the join_axes optional parameter mentioned in this section has been deprecated from the Pandas library, you can skip over the details on this parameter.
  3. Read Section 3.7 Merge and Join from Python Data Science Handbook
  4. Fuzzy Matching (21 min.)

Optional Supplements

Module 05: Statistical Inference

  1. Prepare (due Mon 2/20)
    1. Content below
    2. Sakai quizzes
  2. Peer Instructions – See on the class forum
  3. Homework (due Sun 2/26) [Link]
  4. Worked Example [Link]

Content

Note: the slides for this module have been updated. Please switch to the “slides” panel when viewing the video in Panopto. DO NOT stay on the “screen” panel, as the recorded screen showed the old slides (which contained typoes and old information).

5.A – Confidence Intervals and Bootstrapping

  1. Intro Confidence Intervals (17 min.)
  2. Confidence Intervals in Python (17 min.)
  3. Misconceptions about Confidence Intervals (short read)
    OR
    The 3rd paragraph (starting with “As a technical note…” in this link

5.B – Hypothesis Testing

  1. Intro Hypothesis Testing and Proportions (14 min.)
  2. Hypothesis Testing Means and More (33 min.)

Optional Supplements

You can access an excellent free online textbook on OpenIntro Statistics here, co-authored by Duke faculty. You can pay a suggested but adjustable price for a tablet-friendly pdf, but you can also just get the regular pdf for free. For Module 5, the following optional readings may be particularly helpful supplements:

  • Chapter 5.2 Confidence intervals for a proportion. This provides introductory material on confidence intervals elaborating on 5.A.1.
  • Chapter 5.3 Hypothesis testing for a proportion. This elaborates on the introduction to hypothesis testing from 5.B.1.
  • Chapters 7.1, 7.3, and 7.5 cover material from 5.B.2 on using t-tests for a single mean, the difference of two means, and many pairwise means respectively.
  • Chapter 6.3 discusses the chi-square test for categorical data introduced in 5.B.2.

In addition, here is the documentation for the scipy.stats library that implements most of the functionality described here as well as many other useful statistical functions.

Project: Proposal

Due: Sunday, March 5th

General Directions

The purpose of this document is to prepare your team for success in the course project. You should have feedback from your Initial Plan on the different research topics you have explored and are now introducing your chosen topic.  Your proposal should contain at least three parts, which we define below. In terms of length, it should be 1.5-3 pages (2 pages is typical) using standard margins (1 in.), font (11-12 pt), and line spacing (1-1.5). In addition to these three components, you should provide any additional context or information necessary to understand your vision for your project. You should convert your final document to a pdf and upload it to Gradescope under the assignment “Project Proposal” by the due date. Be sure to include your names and NetIds in your final document and use the group submission feature on Gradescope to include all of your group members on a single submission.

The proposal is out of 100 points. Meeting basic formatting requirements is worth 40 points and will be graded as follows:

  • E (Exemplary, 40pts) – Work that meets all requirements.
  • N (Not yet, 24pts) – Does not meet all requirements.
  • U (Unassessable, 8pts) –  Missing at least one section.

Part 1: Introduction and Research Questions (20 points)

Your proposal should begin by introducing your topic in general and then defining one or more research questions. Research questions are the guiding questions you want to answer or problems you want to solve in your project. Your research question(s) should be (1) substantial, (2) feasible, and (3) relevant.

  1. Substantial research questions require more than a surface-level analysis (more than just computing basic summary statistics on readily available datasets, for example).
  2. Feasible research questions can actually be addressed by four or five team members over the course of approximately six weeks using data you can access.
  3. Relevant research questions address a subject of importance and interest within the scientific community or broader society. Additionally, we are looking for why your group believes this research project is worthwhile to your time in this course. 

You should provide a brief justification of your research question(s) with respect to each of these three points. We recommend clearly marking this section by bolding the words substantial, feasible, and relevant when you provide your justification.

Remember to review the feedback you received from your Initial Plan and decide on a topic/research questions that meet the criteria above and spark interest in your group. This is a project that you will be working on for a significant portion of the semester. 

Grading

  • E (Exemplary, 20pts) – Comprehensive introduction with clearly labeled research questions. It includes a justification for the research questions about whether they are substantial, feasible, and relevant. And the justification is reasonable and clear in relevance to a CS216 project.
  • S (Satisfactory, 19pts) – Comprehensive introduction with clearly labeled research questions. It includes a justification for the research questions about whether they are substantial, feasible, and relevant. But the justification is clearly missing in terms of clarity or reasonableness in relevance to a CS216 project.
  • N (Not yet, 12pts) – Incomplete introduction where the research questions or justification are missing pieces, but at least some of it is present. Or the justification is clearly not reasonable.
  • U (Unassessable, 4pts) – Incomplete introduction where it is entirely missing the research questions or justification or does not demonstrate meaningful effort.

Part 2: Data Sources (20 points)

Your project should deal with real data. We provide pointers to some data sources in the Project Ideas section of the group formation post, but you are welcome and encouraged to look for your own data sources. After your introduction and research questions, your proposal should discuss the data you will use to answer your research questions. Be as specific as possible: name the datasets you will use and how you will access them or specify where you will look for the relevant datasets and why you expect to be successful in finding them. You should also briefly justify why the data you plan to obtain will be relevant and appropriate for addressing your research questions. Searching for data sources as you refine your research questions is likely to be the most time-consuming part of preparing your proposal and is crucial for a good start on your project, so do not put it off.

Grading

  • E (Exemplary, 20pts) – Origins of data or methods to acquire data are properly specified, cited, and relevant to answering the research question(s). And if the data is not already available, the justification for why they expect they will have access to it soon is reasonable. (a.k.a. We are reasonably confident you’ll be able to get the data you need for your research questions.)
  • S (Satisfactory, 19pts) – Origins of data or methods to acquire data are properly specified and cited. However, the justification is not clear why the data is relevant to the proposed research question(s) OR the justification of why they expect they will have access to the data is not reasonable. (a.k.a. We are not entirely sure you’ll be able to get the data you need for your research questions.)
  • N (Not yet, 12pts) – Poorly specified data sources or methods to acquire data OR the justification for using that data set or the methods to acquire the data is lacking.
  • U (Unassessable, 4pts) – Data sources or methods to acquire data are missing or do not demonstrate meaningful effort.

Part 3: What Modules are You Using? (20 points)

Your project should utilize concepts from modules we have/will cover in this course to answer your research question(s). We will assume you will use the skills you have acquired from modules 1 (Python), 2 (Numpy/Pandas), and 3 (Probability). This section should state at least 3 more modules that you will utilize for your project. Each module should have a short description of how you will use the knowledge in this module and a justification for that use. In addition, include what concepts from the module you will use and at what stage of your project you plan to mostly use this module. Potential stages include, but are not limited to: data gathering, data cleaning, data investigation, data analysis, and final report.

  • Module 4: Data Wrangling
  • Module 5: Statistical Inference
  • Module 6: Combining Data
  • Module 7: Databases and SQL
  • Module 8: Visualization
  • Module 9: Prediction & Supervised Machine Learning

When the proposal is due, you may have not yet learned material from some of the modules above. In this case, you should still provide the modules that are applicable with a description of what concepts you believe will be covered in this section that will be useful to answer your research question.

If you do not plan to use python, numpy, and pandas for your project, you must state this and explain why you are choosing not to. It is okay to use something else, like R, but keep in mind that the teaching staff may not have the skills to support you.

Grading

  • E (Exemplary, 20pts) – States at least 3 modules. For each module they provide a (1) short description of how they will use the module, (2) justification for using this module, (3) what concepts they will likely use, and (4) what stage they expect they will use it.
  • S (Satisfactory, 19pts) – States at least 3 modules, but there are some weaknesses somewhere, such as one module as 3 or more parts not well fleshed out or across all 3 modules one part is weak.
  • N (Not yet, 12pts) – States at 3 modules, but 3 or more parts are entirely missing or basically non-existent out of 12 = 4 parts X 3 modules.
  • U (Unassessable, 4pts) – Does not meet the Not Yet criteria, such as having fewer than 3 modules or missing more than 3 parts across all 12 = 4 parts X 3 modules.

Example:

Here is an example justification for Module 3, assuming the project is about creating a prediction model that is classifying the data. Remember that this module is not on the list of modules to count as one of your 3, but you are welcome to include analysis using concepts from it. Note the bolding, which will help you ensure you are meeting all requirements and your grader to find them.

Module 3 Probability: We will use this module to calculate the accuracy of a baseline version of the model we will build. We will do this by considering the proportion of the label we are trying to predict, as well as taking into account some of the independent variables. Our justification is that we need a baseline accuracy to understand how good our model is. The concepts we will mainly use are the probability axioms and maybe some of Bayes or marginalization to calculate this baseline. We plan to use this module during the data analysis and final report stage.

Checklist Before You Submit:

  1. Does your proposal satisfy all general directions?
    1. 1.5-3 pages in length
    2. Standard margins (1 in.)
    3. Font size is 11-12 pt
    4. Line spacing is 1-1.5
    5. Final document is a pdf
  2. Do you have an Introduction and clearly stated Research Question(s)?
    1. Do you feel as if this part meets the requirements of E (Exemplary) or S (Satisfactory)?
  3. Have you properly specified/cited one or more specific Data Sources or methods to acquire data and justified why they are relevant to the Research Questions?
    1. Do you feel as if this part meets the requirements of E (Exemplary) or S (Satisfactory)?
  4. Did you state at least 3 Modules to be used and how, as well as a justification of which concepts will be used at specific stages of the project?
    1. Do you feel as if this part meets the requirements of E (Exemplary) or S (Satisfactory)?

Exam 1

This post outlines what Exam 1 will be like.

There is a Sakai quiz called “Prepare 04.E: Exam 1 Logistics” that will count towards your Prepare 4. It is due Wednesday, 2/8.

Exam Logistics

  • Modules covered: 1, 2, and 3
  • Practice Exam (Link)
  • The exam consists of 2 parts.
    • Part 1 is in-person only.
    • Part 2 will be take-home. It is open book, open note, open internet, and closed to people.
  • Timeframe:
    • Part 1: During class Wednesday, 2/15.
    • Part 2: Open Thursday, 2/16, 12:01 AM, and close Saturday, 2/18, 11:59 PM.
    • The exam will close at 11:59 pm regardless of when you started.
  • There will be no class on Friday, 2/17.
  • The Exam Retake 1 will be during Exam 2. Your Exam 1 Part X score will be the max between this exam and the retake per part.
  • The exam must be done individually. It is a violation of class policy if you collaborate in any way with another person (in or not in the class) on the exam. You can only talk to the teaching staff about the exam.

Part 1

  • Is in-person only.
  • It will cover mainly probability.
  • It is a paper exam taken during class.
  • There will be multiple versions.
  • We will give you one reference sheet (TBA).
  • You may bring one piece of paper as a cheatsheet and can put things on the front and back.
  • You will not need a calculator. Instead, you will show your work and simply write the final numerical equation that would get that final value. There will be no need to calculate the final value by hand.

Part 2

  • It will be take-home. It is open book, open note, open internet, and closed to people.
    • This means you cannot receive help on this exam from anyone, including (but not limited to) communicating with a person while taking the exam, such as asking someone through the Internet (like stackoverflow) to receive help.
  • It will cover mostly coding and some probability.
  • It is a Jupyter Notebook and a data set.
  • You will get the zip files inside a Sakai Quiz.
  • You will submit it on Gradescope.
    • During your testing period, you can submit as many times as you want to Gradescope. We will take the submission you mark as active, which is your last submission unless you change it using the history.
    • Gradescope will have tests, but they are sanity checks only. That means they are checking if the variable is the correct type and within the correct range. The vast majority of the points will be from hand grading. See the grading section below.
  • You will have 2 hours.
    • We do not expect you to need the entire 2 hours. However, it is not uncommon to get lost in a data set, and we wanted to account for that.
  • You can rely on the Sakai Quiz timer to tell you how much time you have left.
  • We will use your logged start time in Sakai to track if you submitted it to Gradescope on time.
    • If you submit after your allotted time, we will use the last submission within your allotted time. That includes marking it as zero if you do not submit within your time limit (so you will need to rely on the retake for your exam).
    • We recommend you submit to Gradescope periodically (after each problem) so you are not scrambling at the end trying to open Gradescope.
  • You do not need to do anything with Sakai after you retrieve your zip file from the quiz.
  • Protect the integrity of the exam and your exam submission.
    • Take your exam:
      • in a secure location where no one can see your screen or bother you.
      • in a place where you will not be distracted or tempted to talk to someone.
    • Only after grades have been published can you do the following. Doing any of these before grades are published will be considered a violation of the Duke Community Standard.
      • Discuss the exam.
      • Show your solutions to other students.
      • View other solutions.
  • If you have a question during the exam, ask it as a private new message on the class forum. Or in helper hours.
    • We cannot help you debug your code. If it appears as if the notebook or autograder is not working, but it turns out to be your own code that has a bug, you will be graded according to your submission.
    • We will do our best to always have someone checking the forum. However, we cannot make promises someone will instantly answer your question.
    • The exam is tested for readability, so the wording should be straightforward.

Grading Scale and Points Allocation

For the questions that do not have a clear correct or incorrect answer or where partial credit is warranted, the following rubric will be used.

  • E (Exemplary) – Work that meets all requirements and displays full mastery of all learning goals and material. And the code is clean and easy to read (see the practice exam for examples of what this means).
  • S (Satisfactory) – Work that meets all requirements and displays at least partial mastery of all learning goals as well as full mastery of core learning goals.
  • N (Not yet) – Work that does not meet some requirements and/or displays developing or incomplete mastery of at least some learning goals and material.
  • U (Unassessable) – Work that is missing, does not demonstrate meaningful effort, or does not provide enough evidence to determine a level of mastery.

The number of points earned is distributed across the problems based on the number of learning goals they are testing. The rubric will be converted to points as follows:

  • E = full credit
  • S = E_full_credit – 1
  • N = E_full_credit * 0.6
  • U = E_full_credit * 0.2
  • Blank = 0

Unit tests will earn you points up to, but not quite, the U level.

Module 04: Data Wrangling

  1. Prepare (due Mon 2/6)
    1. Content below
    2. Sakai quizzes
  2. Peer Instructions – See on the class forum
  3. Homework (due Sun 2/12) [LINK]
  4. Worked Example [LINK]

Content (Slides in the Box folder)

4.A – What is Wrangling

  1. Data sources, formats, and importing (26 min.)
  2. Common data cleaning problems (16 min.)
  3. Read Section 3.4 Handling Missing Data from Python Data Science Handbook

4.B – Wrangling Text

  1. Python string operations (16 min.)
  2. Introduction to regular expressions (18 min.)
  3. Read Section 3.10 Vectorized String Operations from Python Data Science Handbook

Optional Supplements