Monthly Archives: February 2025

Module 07: Statistical Inference

  1. Prepare (due Mon 3/3)
    1. Content below
    2. Canvas quizzes
  2. Class engagement – See on the class forum
  3. Homework (due Sun 3/16) [Link]
  4. Worked Example [Link]

Content (Slides in the Box folder)

Note: the slides for this module have been updated. Please switch to the “slides” panel when viewing the video in Panopto. DO NOT stay on the “screen” panel, as the recorded screen showed the old slides (which contained typoes and old information).

07.A – Confidence Intervals and Bootstrapping

  1. Intro Confidence Intervals (17 min.)
  2. Confidence Intervals in Python (17 min.)
  3. Misconceptions about Confidence Intervals (short read)
    OR
    The 3rd paragraph (starting with “As a technical note…” in this link

07.B – Hypothesis Testing

  1. Intro Hypothesis Testing and Proportions (14 min.)
  2. Hypothesis Testing Means and More (33 min.)

Optional Supplements

You can access an excellent free online textbook on OpenIntro Statistics here, co-authored by Duke faculty. You can pay a suggested but adjustable price for a tablet-friendly pdf, but you can also just get the regular pdf for free. For Module 7, the following optional readings may be particularly helpful supplements:

  • Chapter 5.2 Confidence intervals for a proportion. This provides introductory material on confidence intervals elaborating on 5.A.1.
  • Chapter 5.3 Hypothesis testing for a proportion. This elaborates on the introduction to hypothesis testing from 5.B.1.
  • Chapters 7.1, 7.3, and 7.5 cover material from 5.B.2 on using t-tests for a single mean, the difference of two means, and many pairwise means respectively.
  • Chapter 6.3 discusses the chi-square test for categorical data introduced in 5.B.2.

In addition, here is the documentation for the scipy.stats library that implements most of the functionality described here as well as many other useful statistical functions.

Exam 01 Logistics: In-Person Exam

This post outlines the in-person part of Exam 1. See the Practicum 1 or Practicum 1 Update posts for details on the other parts.

  • Modules covered: 2 – 5
  • When: Wednesday 2/26, during regular class time
  • Is in-person only
  • Bring a calculator.
  • It is a paper exam taken during class.
  • We will print and provide a reference sheet for you at the exam. See what it is in the exam Box folder.
  • You may bring one piece of standard-sized paper as a cheatsheet and can put things on the front and back.
  • There will be multiple versions.
  • Code on the exam
    • It will have no code writing and focus more on thinking like a data scientist.
    • It will have code reading (so know what these functions do), in particular:
      • The results of calling the describe function on a data set.
      • The results of a seaborn function call: catplot, displot, or relplot.
    • You will not be tested on regular expressions on the paper exam.
    • The data set used for this exam is Seaborn’s taxis data set. We recommend familiarizing yourself with the columns’ meanings.

Study Exams

  • Canvas Exam 1 Study Quiz
    • Worth 2 class engagement points
    • Includes randomized question pools for all questions that can be auto-graded of all past exams.
  • Study Exam in exam Box folder
    • You may see a question in here that is duplicated from the Canvas quiz, that’s because part of it is not auto-gradeable and we wanted to ensure you saw what the question will look like on the actual exam.
    • Solutions for the exam in Box will be released on the Friday before the exam. This is to encourage everyone to try the study exams before looking at the solutions.

Grading Scale and Points Allocation

For the questions that do not have a clear correct or incorrect answer or where partial credit is warranted, the following rubric will be used.

  • E (Exemplary) – Work that meets all requirements and displays full mastery of all learning goals and material.
  • S (Satisfactory) – Work that meets all requirements and displays at least partial mastery of all learning goals as well as full mastery of core learning goals.
  • N (Not yet) – Work that does not meet some requirements and/or displays developing or incomplete mastery of at least some learning goals and material.
  • U (Unassessable) – Work that is missing, does not demonstrate meaningful effort, or does not provide enough evidence to determine a level of mastery.

The number of points earned is distributed across the problems based on the number of learning goals they are testing. The rubric will be converted to points as follows:

  • E = full credit
  • S = E_full_credit – some small value resulting in around E_full_credit*0.9
  • N = E_full_credit * 0.6
  • U = E_full_credit * 0.2
  • Blank = 0

Exam 01 Logistics: Practicum

This post outlines the Practicum of Exam 1. See the in-person Exam 1 or Practicum 1 Update posts for details on the other parts.

  • Modules covered: 2 – 5
  • When: Friday 2/28 12:01am to Saturday 3/1 11:59pm
    • There is no class on Friday.
    • It should take around 2-3 hours to complete, but you can take as long as you want. It must be submitted before the deadline.
  • Study Practicum in exam Box folder
  • This can be done in a pair. See details below on the logistics, the definition of collaboration, and the consequences if collaboration happens without citation.
  • It is a take-home, open book, open note, open internet, and open LLM practicum.
    • Each question will have a variable you set to True or False to indicate if you used an LLM when answering this question.
  • It is closed to anyone outside you (and your partner if you have one). So, do not ask someone to do it for you or ask on places like stackoverflow.
  • It focuses on coding and interpreting the results of that code.
  • Consists of a Jupyter Notebook and a data set
    • Recommendation: Discuss in advance with your partner (if you have one) how you will create the final submission and who will submit it.
  • At the start of the practicum, a Canvas announcement will go out with a link to the Box folder containing all the files you need.
  • The act of submitting or being part of a submission means that you are upholding the Duke community standard that you contributed equally to this submission and only talked amongst yourselves when working on it.
  • Protect the integrity of the practicum and your submission.
    • Take your practicum:
      • In a secure location where only you (and your partner) can see your screen (and only your partner can talk to you).
      • In a place where you will not be distracted or tempted to talk to someone beyond your partner (if you have one).
    • You can do the following only after grades have been published for the Practicum Update. Doing any of these before grades are published will be considered a violation of the Duke Community Standard.
      • Discuss what you did on the practicum.
      • Show your solutions to other students.
      • View other solutions.
  • If you have a question during the practicum, ask it as a private new message on the class forum, in helper hours, or during class time when Prof. Stephens-Martinez will be in the helper hours Zoom room.
    • We cannot help you debug your code. If the notebook or autograder appears to be not working, but it turns out your code has a bug, you will be graded according to your submission.
    • We will do our best to always have someone checking the forum. However, we cannot promise that someone will instantly answer your question.
    • The practicum is tested for readability, so the wording should be straightforward.

Collaboration on the Practicum

  • Working in a pair means you collaborated on the Practicum.
    • Collaboration – 2 people have collaborated if one or both have given or received work/help on the Practicum. Notice these are “or’s.” That means if you share your Practicum with another person, even if that person did not give you anything in return, you both are now considered collaborators and should include each other in your notebook(s) as a partner.
    • This also means that if 2 people submit together and then 1 person shares that submission with a 3rd person, who then submits something too similar to have been done in isolation, all 3 are considered collaborators because it is impossible to detect who shared with whom. This collaboration is then considered a violation of the rules and, therefore, a violation of the Duke Community Standard.
  • The NetIds of all those who worked on the notebook must be listed in the notebook. There will be a 0-point test case with two variables for the NetIds of you and your partner. If you are solo, the notebook will state what to fill in for the other variable.
    • If you do not do this and we detect your notebooks as too similar to have been done in isolation, this is considered a violation of the Duke Community Standard.
  • You and your partner may submit notebooks separately or as a single submission. If you plan to submit identical files, submit as a single submission. Please help the graders be efficient.

Grading Scale and Points Allocation

This is the same as Exam 1’s in-person exam, with the following addition:

  1. For Exemplary – The code is clean and easy to read (see the study exam for examples of what this means).
  2. Unit tests in the autograder for the Practicum will earn you points up to, but not quite, the U level.
  3. How much fewer points an S is worth compared to an E depends on the practicum part. The practicum totals to 100 points. The goal is earning only S’s results in a low A. So, for example, if the Practicum has only 4 questions, an S would lose 2.5 points compared to an E, which means getting all S’s is a low A (90%), but still guarantees an A on the Practicum.

Exam 01 Logistics: Practicum Update

This post outlines the Practicum Update part of Exam 1. See the in-person Exam 1 or Practicum 1 posts for details on the other parts

  • Your group will have the option to update your Practicum after seeing the results of your Practicum grade. If you choose to submit an update, your grade for the Practicum will be as follows:
    • Practicum (original): 15%
    • Practicum Update: 85%
  • When: Thursday 3/6 – Saturday 3/8
    • This is during Module 7.
  • For the update, you will do the following:
    • Update your original notebook as needed.
    • Fill in the template diff cell at the top of the Practicum and list all of the changes you made from your original submission.
      • This is worth 0.5 points per question.
      • We may not grade your update properly if you do not do this.
  • We may grade outside of your changes because the Practicum aims to show your competency level in the material, not your competency + what the graders accidentally miss in the first grading.

Module 06: Combining Data

  1. Prepare (due Mon 2/17)
    1. Content below
    2. Canvas quizzes
  2. Class engagement – See on the class forum
  3. Homework (due Sun 2/23) [Link]
  4. Worked Example [Link]

Content (Slides in the Box Folder)

06.A – Summarizing Data

  1. Read Section 3.8 Aggregating and Grouping from Python Data Science Handbook.
  2. Read Section 3.9 Pivot Tables from Python Data Science Handbook.

06.B – Merging Data

  1. Read Section 3.6 Concat and Append from Python Data Science Handbook. Please note that the join_axes optional parameter mentioned in this section has been deprecated from the Pandas library, you can skip over the details on this parameter.
  2. Read Section 3.7 Merge and Join from Python Data Science Handbook
  3. Table Relationships (4 min.)
  4. Which Join to Use (4 min.)
  5. Record Linkage (8 min.)
  6. Fuzzy Matching (21 min.)

Optional Supplements