All posts by Dr Kristin Stephens-Martinez, Ph.D.

Exam 01 Logistics: In-Person Exam

This post outlines the in-person part of Exam 1. See the Practicum 1 or Practicum 1 Update posts for details on the other parts.

  • Modules covered: 2 – 5
  • When: Wednesday 2/26, during regular class time
  • Is in-person only
  • Bring a calculator.
  • It is a paper exam taken during class.
  • We will print and provide a reference sheet for you at the exam. See what it is in the exam Box folder.
  • You may bring one piece of standard-sized paper as a cheatsheet and can put things on the front and back.
  • There will be multiple versions.
  • Code on the exam
    • It will have no code writing and focus more on thinking like a data scientist.
    • It will have code reading (so know what these functions do), in particular:
      • The results of calling the describe function on a data set.
      • The results of a seaborn function call: catplot, displot, or relplot.
    • You will not be tested on regular expressions on the paper exam.
    • The data set used for this exam is Seaborn’s taxis data set. We recommend familiarizing yourself with the columns’ meanings.

Study Exams

  • Canvas Exam 1 Study Quiz
    • Worth 2 class engagement points
    • Includes randomized question pools for all questions that can be auto-graded of all past exams.
  • Study Exam in exam Box folder
    • You may see a question in here that is duplicated from the Canvas quiz, that’s because part of it is not auto-gradeable and we wanted to ensure you saw what the question will look like on the actual exam.
    • Solutions for the exam in Box will be released on the Friday before the exam. This is to encourage everyone to try the study exams before looking at the solutions.

Grading Scale and Points Allocation

For the questions that do not have a clear correct or incorrect answer or where partial credit is warranted, the following rubric will be used.

  • E (Exemplary) – Work that meets all requirements and displays full mastery of all learning goals and material.
  • S (Satisfactory) – Work that meets all requirements and displays at least partial mastery of all learning goals as well as full mastery of core learning goals.
  • N (Not yet) – Work that does not meet some requirements and/or displays developing or incomplete mastery of at least some learning goals and material.
  • U (Unassessable) – Work that is missing, does not demonstrate meaningful effort, or does not provide enough evidence to determine a level of mastery.

The number of points earned is distributed across the problems based on the number of learning goals they are testing. The rubric will be converted to points as follows:

  • E = full credit
  • S = E_full_credit – some small value resulting in around E_full_credit*0.9
  • N = E_full_credit * 0.6
  • U = E_full_credit * 0.2
  • Blank = 0

Exam 01 Logistics: Practicum

This post outlines the Practicum of Exam 1. See the in-person Exam 1 or Practicum 1 Update posts for details on the other parts.

  • Modules covered: 2 – 5
  • When: Friday 2/28 12:01am to Saturday 3/1 11:59pm
    • There is no class on Friday.
    • It should take around 2-3 hours to complete, but you can take as long as you want. It must be submitted before the deadline.
  • Study Practicum in exam Box folder
  • This can be done in a pair. See details below on the logistics, the definition of collaboration, and the consequences if collaboration happens without citation.
  • It is a take-home, open book, open note, open internet, and open LLM practicum.
    • Each question will have a variable you set to True or False to indicate if you used an LLM when answering this question.
  • It is closed to anyone outside you (and your partner if you have one). So, do not ask someone to do it for you or ask on places like stackoverflow.
  • It focuses on coding and interpreting the results of that code.
  • Consists of a Jupyter Notebook and a data set
    • Recommendation: Discuss in advance with your partner (if you have one) how you will create the final submission and who will submit it.
  • At the start of the practicum, a Canvas announcement will go out with a link to the Box folder containing all the files you need.
  • The act of submitting or being part of a submission means that you are upholding the Duke community standard that you contributed equally to this submission and only talked amongst yourselves when working on it.
  • Protect the integrity of the practicum and your submission.
    • Take your practicum:
      • In a secure location where only you (and your partner) can see your screen (and only your partner can talk to you).
      • In a place where you will not be distracted or tempted to talk to someone beyond your partner (if you have one).
    • You can do the following only after grades have been published for the Practicum Update. Doing any of these before grades are published will be considered a violation of the Duke Community Standard.
      • Discuss what you did on the practicum.
      • Show your solutions to other students.
      • View other solutions.
  • If you have a question during the practicum, ask it as a private new message on the class forum, in helper hours, or during class time when Prof. Stephens-Martinez will be in the helper hours Zoom room.
    • We cannot help you debug your code. If the notebook or autograder appears to be not working, but it turns out your code has a bug, you will be graded according to your submission.
    • We will do our best to always have someone checking the forum. However, we cannot promise that someone will instantly answer your question.
    • The practicum is tested for readability, so the wording should be straightforward.

Collaboration on the Practicum

  • Working in a pair means you collaborated on the Practicum.
    • Collaboration – 2 people have collaborated if one or both have given or received work/help on the Practicum. Notice these are “or’s.” That means if you share your Practicum with another person, even if that person did not give you anything in return, you both are now considered collaborators and should include each other in your notebook(s) as a partner.
    • This also means that if 2 people submit together and then 1 person shares that submission with a 3rd person, who then submits something too similar to have been done in isolation, all 3 are considered collaborators because it is impossible to detect who shared with whom. This collaboration is then considered a violation of the rules and, therefore, a violation of the Duke Community Standard.
  • The NetIds of all those who worked on the notebook must be listed in the notebook. There will be a 0-point test case with two variables for the NetIds of you and your partner. If you are solo, the notebook will state what to fill in for the other variable.
    • If you do not do this and we detect your notebooks as too similar to have been done in isolation, this is considered a violation of the Duke Community Standard.
  • You and your partner may submit notebooks separately or as a single submission. If you plan to submit identical files, submit as a single submission. Please help the graders be efficient.

Grading Scale and Points Allocation

This is the same as Exam 1’s in-person exam, with the following addition:

  1. For Exemplary – The code is clean and easy to read (see the study exam for examples of what this means).
  2. Unit tests in the autograder for the Practicum will earn you points up to, but not quite, the U level.
  3. How much fewer points an S is worth compared to an E depends on the practicum part. The practicum totals to 100 points. The goal is earning only S’s results in a low A. So, for example, if the Practicum has only 4 questions, an S would lose 2.5 points compared to an E, which means getting all S’s is a low A (90%), but still guarantees an A on the Practicum.

Exam 01 Logistics: Practicum Update

This post outlines the Practicum Update part of Exam 1. See the in-person Exam 1 or Practicum 1 posts for details on the other parts

  • Your group will have the option to update your Practicum after seeing the results of your Practicum grade. If you choose to submit an update, your grade for the Practicum will be as follows:
    • Practicum (original): 15%
    • Practicum Update: 85%
  • When: Thursday 3/6 – Saturday 3/8
    • This is during Module 7.
  • For the update, you will do the following:
    • Update your original notebook as needed.
    • Fill in the template diff cell at the top of the Practicum and list all of the changes you made from your original submission.
      • This is worth 0.5 points per question.
      • We may not grade your update properly if you do not do this.
  • We may grade outside of your changes because the Practicum aims to show your competency level in the material, not your competency + what the graders accidentally miss in the first grading.

Module 04: Data Wrangling

  1. Prepare (due Mon 2/3)
    1. Content below
    2. Canvas quizzes
  2. Class engagement – See on the class forum
  3. Homework (due Sun 2/9) [LINK]
  4. Worked Example [LINK]

Content (Slides in the Box folder)

04.A – What is Wrangling

  1. Data sources, formats, and importing (26 min.)
  2. Common data cleaning problems (16 min.)
  3. Read Section 3.4 Handling Missing Data from Python Data Science Handbook

04.B – Wrangling Text

  1. Python string operations (16 min.)
  2. Introduction to regular expressions (18 min.)
  3. Read Section 3.10 Vectorized String Operations from Python Data Science Handbook

Optional Supplements

Module 03: Visualization

  1. Prepare (due Mon 1/27)
    1. Content below
    2. Sakai quizzes
  2. Class engagement – See on the class forum
  3. Homework (due Sun 2/2) [Link]
  4. Worked Examples [Link]

Content

03.A – Data Visualization and Design

  1. Why Visualize? (11 min.)
  2. Kinds of Data (7 min.)
  3. Basic Plot Types (12 min.)
  4. Dos and Don’ts (10 min.)

03.B – Visualization in Python

  1. Intro to Python Visualization Landscape (7 min.)
  2. Seaborn Introduction (17 min.)
  3. Seaborn Examples (17 min.)

Optional Supplements

Module 05: Probability

  1. Prepare (due Mon 2/10)
    1. Content below
    2. Canvas quizzes
  2. Class engagement – See on the class forum
  3. Homework (due Sun 2/16, late due Thurs 2/20) [Link]
  4. Worked Examples [Link]

Content (Slides in the Box folder)

5.A – Foundations of Probability (52 min.)

  1. Outcomes, Events, Probabilities (15 min.)
  2. Joint and Conditional Probability (11 min.)
  3. Marginalization and Bayes’ Theorem (15 min.)
  4. Random Variables and Expectations (11 min.)

5.B – Distributions of Random Variables (46 min.)

  1. Distributions, Means, Variance (19 min.)
  2. Monte Carlo Simulation (15 min.)
  3. Central Limit Theorem (12 min.)
    1. Slide 26 in the video has a typo that is fixed in the pdf version of the slides on Box. In the video, it says the probability is <= 0.95, but it should say < 0.05.

Optional Supplements

Helpful YouTube videos to understand nuance with examples

In the slides Box folder you will find additional resources on understanding Chebyshev and Markov

Online Textbook and Documentation

You can access an excellent free online textbook on OpenIntro Statistics here, co-authored by Duke faculty. You can pay a suggested but adjustable price for a tablet-friendly pdf, but you can also just get the regular pdf for free. For this module, the following optional readings may be particularly helpful supplements:

  • Chapter 3: Probability. This provides more information on many of the topics from the above videos in Foundations of Probability.
  • Chapter 4: Distributions of random variables. This provides much more information about particular classic distributions than is provided in 2B.B.1.
  • Chapter 5.1: Point estimates and sampling variability. This provides more information on some of the topics from 2B.B.2-3.

In addition, you can find documentation for the two pseudorandom number-generating / sampling libraries in python that we mentioned here:

Module 01: Python & Jupyter Notebook

  1. Prepare (due Mon 1/13)
  2. Class engagement – See on the class forum
  3. Homework (due Sun 1/19, 11:59 PM) [Link]

Content (Slides in the Box folder)

1.A – Python3 (14 min.)

  1. Python vs. Java (3 min.)
  2. Data Types (2 min.)
  3. Iteration, Functions, Classes (7 min.) – slide 19 has a typo, the pdf has been fixed
  4. sorted() function documentation (2 min.)

1.B – Python for Data Science (21 min.)

  1. Anaconda and Jupyter (10 min.)
  2. Jupyter Notebook Demo (11 min.)

Optional Supplements

Formulating Your Research Question(s)

To come up with a good research question for the 216 Final Project, try testing your potential questions against the following criteria:

  1. Is your research question clear?
    1. Does your research question present the problem with enough context such that the person reading it doesn’t need to go on an internet goose chase to understand it?
  2. Is your research question focused?
    1. Is your question direct, and address specific points? Is it appropriate for the scope of the project (i.e., timeline, skills, etc.)? 
    2. Can the question be answered thoroughly through this project?
  3. Is your research question concise?
    1. Make sure the wording of your question is to the point – don’t be verbose here, as you’re laying out the groundwork for the rest of your project
  4. Is your research question complex?
    1. Similar to above, can it be answered thoroughly within the limits of the paper but also not be too simple (i.e., yes or no, questions that can be answered through a simple linear regression)?
    2. Does your question require a sophisticated analysis that is potentially beyond the scope of this class?
  5. Is your research question arguable?
    1. Can you take a stance and make an argument in your answer? Can your question be tested?
  6. Is your research question analytical?
    1. Does your research question result in a description of the problem or an analysis of the problem?

Examples of GOOD research questions:

  1. How do the percent of COVID deaths vary depending on the race of the person across the counties in the US? How does the percent of COVID deaths differ in Republican (red) and Democrat (blue) counties? How do percent of COVID deaths differ based on race and the political party at the county level? 
    1. You don’t necessarily need to have multiple research questions (this depends on the depth of the analysis needed per question, how the questions are related, etc.), but these questions all fulfill each of the above criteria individually and work well together to meet the scope of the project.
  2. Is there a positive relationship between ______ and ______ in a country? Does a higher prevalence of _____ correlate with ________ in a country? Is there a negative relationship between  ________ and _________ in a country?
    1. While this research question has several sub-questions, each question is essentially a ‘yes’ or ‘no’ response. It’s not the strongest question, but if you opt for this type of question, we recommend using multiple methods from the course to answer the questions and compare your results. Alternatively, consider replacing the words “is there a positive relationship” to “what is the relationship and how does it vary,” or something that opens the question up to more than binary responses.
  3. What is the relationship between vaccine hesitancy and COVID-19 deaths? What is the prevalence of vaccine hesitancy among the general population? What is the impact of political beliefs on vaccine hesitancy? Do factors such as race have any relation to vaccine hesitancy?
    1. Again, multiple questions that mostly meet the above criteria — the last question could be answered with a ‘yes’ or ‘no,’ but you could easily replace it with “which factors have the strongest relationship with.” All other questions invite deeper analysis and the formulation of an argument.

 

Project: Initial Plan

Due: Saturday, February 8th

General Directions

The purpose of this document is to ensure that your group is choosing a substantial research project topic that is interesting and worthwhile. You will be working on the collaborative final project for a large portion (~14 weeks) of this course, and will use this deliverable to brainstorm project ideas and plan how your team will collaborate. In terms of length, it should be 1-2 pages (not including the appendix) using standard margins (1 in.), font (11-12 pt), and line spacing (1-1.5). You should convert your final document to a pdf and upload it to Gradescope under the assignment “Initial Plan” by the due date. Be sure to include your names and NetIDs in your final document and use the group submission feature on Gradescope to include all of your group members in a single submission.

The Initial Plan is out of 100 points. Meeting basic formatting requirements is worth 40 points and will be graded as follows:

  • E (Exemplary, 40pts) – Work that meets all requirements. NetIDs and names of all group members are included in the report.
  • S (Satisfactory, 38pts) – Work that meets most of the requirements. NetIDs or names of some group members are missing.
  • N (Not yet, 24pts) – Does not meet all requirements. NetIDs or names of all group members are missing.
  • U (Unassessable, 8pts) –  Missing at least one section.

Part 1: Brainstorming (40 points)

As you go through your brainstorm, consider reading over this guide to forming your research question. To brainstorm ideas for your research topic, you may use one of two options:

  1. Mind map of potential project ideas.
  2. Discussion with ChatGPT or LLM of your choice.

 

For the mind map, you can use an online tool, Google drawing, whiteboard, post-it notes, etc. Just ensure you can put it in your report. To create your mind map, use the following steps:

  1. Put a central idea or main concept in the center, such as “data science research project” or something more specific that your group finds interesting.
  2. Branch out from the main with ideas that can cover a range from interesting topics to previous project ideas that caught your group’s attention.
  3. Branch off of those ideas to add more specific interests or personalized ways you would change a topic or project.
  4. Put your mind map (if it’s on something like a physical whiteboard take a picture) as an appendix in this submission.

 

For the discussion with an LLM, do the following:

  1. Tell the LLM you are brainstorming for data science projects, what your group’s interests are that could be potential sources of data, and that you need to find the data yourself.
  2. Ask it what ideas it has for your project.
  3. Tell it what ideas you liked, didn’t like, why a suggestion isn’t a good one, etc.
  4. Do at least 2-3 rounds of steps 2 and 3 with the LLM.
  5. Put your chat as an appendix in this submission.

 

After your brainstorm (regardless of if you used the mind map or an LLM), reflect by answering the following questions:

  1. Why did you choose the method you used?
  2. What patterns do you see in what you find interesting?
  3. What research topics or questions did your group generate from this brainstorming? Which of these ideas can you see your group potentially pursuing?
  4. Do you feel like more brainstorming is needed before you find a topic?
  5. If you used
    1. The mindmap: Did you find your brainstorming narrowing or diverging as you discuss ideas to write down?
    2. LLM: How satisfied were you with its answers? Why?

Whether you choose to create a mind map or use an LLM, use this exercise to brainstorm project ideas that your group collectively believes are interesting, relevant, and worthwhile to your time in this course.

Grading

  • E (Exemplary, 40pts) – Appendix has a mind map that branches out at least two levels from the center OR an LLM conversation. In addition, the report has a reflection that comprehensively answers all 5 questions.
  • S (Satisfactory, 38pts) – Appendix has a mind map that branches out at least two levels from the center OR an LLM conversation. In addition, the report has a reflection that mostly answers all 5 questions.
  • N (Not yet, 24pts) –  A brainstorm that does not entirely answer 1 or 2 of the questions. Reflection does not entirely answer at least 1 of the questions.
  • U (Unassessable, 8pts) – Work that does not entirely answer 3 or more of the questions above for either the brainstorm or the reflection.

Part 2: Collaboration Plan (20 points)

This is a collaborative course project pursued by a team of students who bring different strengths and interests to the table. This reflects the reality that significant real-world projects in data science are almost always pursued by teams. For the collaboration to be successful, it helps to establish some guidelines/group norms that serve as a starting point. Your collaboration plan should address the following:

  1. How will you divide responsibilities? Will some students be responsible for certain portions of the project, or will you be more integrated and decide on responsibilities on a weekly basis?
  2. About how much time do you expect every group member to spend on the project each week, on average? It is okay if this number is higher toward the last couple of weeks of the semester.
  3. When and how will you meet? You should plan to meet at least once per week for at least 30 minutes to check in on one another’s progress, get help, and plan for what comes next. Identify a day of the week, a time, and the place/platform you will use to meet. We strongly recommend having a consistent time and not having ad-hoc times as needed.
  4. What platform(s) will you use to communicate between meetings? Will you primarily use email, text, Slack, or other chat apps? If you want a more professional enterprise tool, Duke provides free access to Microsoft Teams.
  5. Where will you track who is doing what tasks and when those tasks will be done? This can be as simple as a Google doc with a checklist or as advanced as a Trello board. What is important is there is a clear repository of who is doing what, the status of that thing, and when it should be done.
  6. Where will you store data, code, writing, etc., so that all group members have easy access to shared materials?* Duke provides free access to Box and GitLab, which could serve these purposes, but you could also use external services like Google Drive or GitHub. Provide a link to the folder/repository in your proposal to demonstrate that it is created and ready.
  7. Is your group willing to publicly share your project, for example, as part of a portfolio of work? If yes, how will you share? How will you articulate authorship and who did what? When will you revisit this near the end of the semester to confirm you all still agree to what you write in this initial plan?

* In addition to a common repository for data, you may find it useful to explore Google colab or DeepNote, which allows you to collaborate on Jupyter Notebooks and execute them in the cloud (like a Google doc for Jupyter notebooks).

Grading

  • E (Exemplary, 20pts) – Comprehensive plan that answers all 6 questions and includes a link to their folder/repository.
  • S (Satisfactory, 18pts) – Comprehensive plan that mostly answers all 6 questions. The link to their folder/repository could be missing.
  • N (Not yet, 12pts) – A plan that does not entirely answer 1 or 2 of the questions above. Link can be missing.
  • U (Unassessable, 4pts) – A plan that does not entirely answer 3 or more of the questions above.

Project: Group Formation

Due: Friday, January 24th

In place of a final exam, this course has a collaborative final project where we ask you to bring your data science skills to bear on a research project of your own choosing. It is time to start forming groups (of 4-5 students) for the project. Fill out the group formation quiz on Gradescope no later than Friday, January 24th.

The form should only take a couple of minutes. If you already know who you want to work with, you can indicate that in the form using the group submission feature in Gradescope. In this case, communicate with your group first and have one member fill out the form once with everyone added as group members. If you submit more than once, the active submission is considered valid. It’s also fine if you don’t know who you want to work with, in which case you should fill out the form solo, and we will match you to a group.

If it is helpful to start thinking about possible project ideas, below are some project ideas. You can also brainstorm now using strategies that are outlined in the Initial Plan post (out soon). But it is not required that you have a concrete project idea until the proposal.

Project ideas

Not sure how to get started? Looking for examples of what a data science project might look like? Here are some of the topics that students studied in Spring 2020:

  • Comparing Stock Market Losses between SARS and SARS-CoV-2
  • Recessions, Depressions, and Depression: Mental Health in Relation to Economic Factors
  • Predicting North Carolina Election Outcomes
  • Relating Text Analysis of Corporate Reports and Stock Performance
  • Modeling Consumer Flight Behavior Based on Economic Indicators
  • Predicting COVID-19 Death Tolls from Google Search Trends
  • Sentiment Analysis of COVID-19 Tweets
  • Economic Status and Drug Overdose in North Carolina
  • Analyzing Gender and Tech Careers
  • Political Landscape According to Social Media
  • Forecasting Market Shocks and Performance using Article Headlines
  • Tracking Recidivism in US Prisons
  • Understanding AirBnBs impact on Evictions
  • Understanding Musical Tastes (Music Recommender System)
  • Human Impact on Climate since the Industrial Revolution
  • The Troll Toll: An Investigation into Troll Tweets

And here is an archive of summer Data+ projects from the last several years. In Data+, teams of about 4 undergraduate students collaborate over the summer on a data science project. You should be able to see final presentations and/or executive summary slides for most projects; feel free to browse for inspiration.

Example Data Sources

Below, we have some examples of datasets or where you might find data. You should work with data that is interesting to you and should feel free (strongly encouraged even) to look for sources yourself. These are listed just as possibilities and starting places.

  • Data.gov has a huge compilation of data sets produced by the US government. The US Census Bureau also publishes datasets from all of its survey work. Similarly, The Supreme Court Database tracks all cases decided by the US Supreme Court, and GovTrack.us provides links to all kinds of information about the US Congress and all votes casted by its members.
  • Duke University Library Digital Repository Research Data
  • ICPSR – An international consortium of more than 750 academic institutions and research organizations, Inter-university Consortium for Political and Social Research (ICPSR) provides leadership and training in data access, curation, and methods of analysis for the social science research community. ICPSR maintains a data archive of more than 250,000 files of research in the social and behavioral sciences. It hosts 21 specialized collections of data in education, aging, criminal justice, substance abuse, terrorism, and other fields.
  • The University of California Irvine maintains a large UCI ML repository of publicly contributed datasets aimed toward machine learning tasks of all types. They range from small simple example datasets to large and complicated datasets from specific scientific domains.
  • Kaggle maintains several thousand public datasets of interest in a variety of topics. Kaggle also hosts several prediction challenges; one idea for a machine learning project is to enter one of these competitions as a team.
  • The Yelp Dataset is provided by Yelp as a research challenge with lots and lots of data about reviews, businesses, images, and cities – text data, rich json data, etc.