Author: Dr Kristin Stephens-Martinez, Ph.D.

Exam 3 Retake Logistics

  • Timeframe: It will open Monday, 5/1, at 12:01 AM and close Wednesday, 5/3, at 11:59 PM.
    • The exam will close at 11:59 pm regardless of when you started.
  • It assesses the same thing as Exam 3.
    • But it will be different than the original and practice exams.
    • You may use things that you have learned that were not in the modules that this exam is testing, but you can answer it without knowing any modules beyond what this exam is testing.
  • The data sets will be different.
  • There is no regrade window due to time constraints.
  • All other information is similar to Exam 3’s. Such as getting the files, Gradescope, Sakai, asking for help, grading policy, etc.
    • Reminder: You do not need to do both parts. You can do only one part if you wish. You must do ALL of the questions in that part, though. We will take the max score per part.

Exam 3 and Exam 2 Retake

Exam 3 Logistics

  • Modules covered: 7, 8, and 9
  • Practice Exam (Part1 – LINK, Part2 – LINK)
  • Timeframe: It will open Thursday, 4/20, at 12:01 AM, and close Saturday, 4/22, at 11:59 PM.
    • The exam will close at 11:59 pm regardless of when you started.
  • The exam will be take-home. It is open book, open note, open internet, but closed to people and AI tools (such as ChatGPT).
    • This means you cannot receive help on this exam from anyone, including (but not limited to) communicating with a person while taking the exam, such as asking someone through the Internet (like stackoverflow) to receive help.
    • In addition, you cannot give a question on the exam to an AI tool and ask it to generate an answer.
    • Your submission must represent your own work only and is your evidence that you have mastered the material.
  • Like prior Exams, it consists of 2 parts. However, each part has a time limit of 2.5 hours. Both parts will have data sets, and they will be different.
    • Note that each part is 30 minutes longer than the prior exam parts.
  • Note about the ESNU grading:
    • There are ~100 points possible and fewer than 10 questions. The number of points earned are evenly distributed across the problems based on the number of concepts they are testing. The rubric to point conversation ensures that earning an E or S on all problems means an A. While a single U means an A is very unlikely, which is reasonable since a U on a problem clearly shows a lack of mastery of at least some content for this exam.
  • All other information is similar to Exam 1 Part 2. Such as getting the files, Gradescope, Sakai, asking for help, grading ESNU policy, etc.
    • Grading Clarification: A simple copy+paste and find+replace replacement from the practice exam is considered a Satisfactory answer. You must use your own words or elaborate beyond the practice exam text to show an Exemplary level of content mastery.

Exam 2 Retake Logistics

  • Timeframe: It will open Wednesday, 4/19, at 12:01 AM, and close Saturday, 4/22, at 11:59 PM.
    • Note it is open one day earlier since Exam 3 will take longer.
    • The exam will close at 11:59 pm regardless of when you started.
  • It assesses the same thing as Exam 2.
    • You may use things that you have learned that were not in the modules that this exam is testing, but you can answer it without knowing any modules beyond what this exam is testing.
  • The data sets will be different.
  • You do not need to do both parts. You can do only one part if you wish. You must do ALL of the questions in that part, though. We will take the max score per part.
  • All other information is similar to Exam 1 Part 2. Such as getting the files, Gradescope, Sakai, asking for help, grading policy, etc.

Project Prototype

Due: Sunday, April 2nd

General Directions

The prototype deliverable is intended to demonstrate a proof of concept for your final project report. Large multi-week projects are challenging. This deliverable is intended to provide additional structure to ensure you are making progress and on a path toward success.

It consists of a written report detailed below, along with any accompanying data, code, or other supplementary resources that demonstrate your progress so far in the project. You can think of it as a rough draft for your final project. The report should stand on its own so that it makes sense to someone who has not read your proposal.

The report should contain at least five parts, which we define below. In terms of length, it should be 3-4 pages using standard margins (1 in.), font (11-12 pt), and line spacing (1-1.5). A typical submission is around 2-3 pages of text and 3-4 pages overall with tables and figures. You should convert your written report to a pdf and upload it to Gradescope under the assignment “Project Prototype” by the due date. Be sure to include your names and NetIDs in your final document and use the group submission feature on Gradescope. You do not need to upload your accompanying data, code, or other supplemental resources demonstrating your work to Gradescope; instead, your report should contain instructions on how to access these resources (see parts 2 and 4 below for more details).

  • E (Exemplary, 30pts) – Work that meets all requirements.
  • S (Satisfactory, 29 pts) – Work that meets all requirements but is over 4 pages.
  • N (Not yet, 18pts) – Does not meet all requirements.
  • U (Unassessable, 6pts) –  Missing at least one section.

Part 1: Introduction and Research Questions (15 points)

Your prototype report should begin by reintroducing your topic and restating your research question(s) as in your proposal. Your research question(s) should be (1) substantial, (2) feasible, and (3) relevant. Briefly justify each of these points as in the project proposal. You can start with the text from your proposal, but you should update your introduction and research questions to reflect changes in or refinements of the project vision. Specifically, point out what has changed since the proposal. Your introduction should be sufficient to provide context for the rest of your report.

Grading

  • E (Exemplary, 15pts) – Comprehensive introduction with clearly labeled, updated research questions and a justification for the research questions about whether they are substantial, feasible, and relevant. Any changes are specifically mentioned or they note there are no changes.
  • S (Satisfactory, 14pts) – Comprehensive introduction with clearly labeled research questions and a justification for the research questions about whether they are substantial, feasible, and relevant. Changes and updates may not be specifically mentioned.
  • N (Not yet, 9pts) – Incomplete introduction where the research questions or justification are missing pieces, but at least some of it is present. Or the justification is clearly not reasonable.
  • U (Unassessable, 3pts) – Incomplete introduction where it is entirely missing the research questions or justification or does not demonstrate meaningful effort.

Part 2: Data Sources (15 points)

After your introduction and research questions, your prototype should discuss the data you have collected and are using to answer your research questions. Be specific: name the datasets you are using and where they were collected from / how they were prepared. Briefly justify why your data are appropriate and sufficient to address your research questions. As in the introduction, you can begin with the text from your proposal but be sure to update it to fit your evolving project.

Grading

  • E (Exemplary, 15pts) – Origins of data are properly specified, cited, and relevant to answering the research question(s). If any data wrangling, cleaning, or other data preparation was done, these processes are explained.
  • S (Satisfactory, 14pts) – Origins of data are properly specified and cited. However, the justification is not clear why the data is relevant to the proposed research question(s). If any data wrangling, cleaning, or other data preparation was done, these processes are explained.
  • N (Not yet, 9pts) – Poorly specified data sources or the justification for using that data set or the methods to acquire the data is lacking. No discussion of preparing the dataset.
  • U (Unassessable, 3pts) – Data sources or methods to acquire data are missing or do not demonstrate meaningful effort.

Part 3: What Modules are You Using? (15 points)

Your project should utilize concepts from modules we have/will cover in this course to answer your research question(s). We will assume you will use modules 1 (Python), 2 (Numpy/Pandas), and 3 (Probability). This section should state at least 3 more modules that you will utilize for your project. Each module should have a short description of how you will use the knowledge in this module and a justification for that use. In addition, include what concepts from the module you will use and at what stage of your project you plan to mostly use this module. Potential stages include, but are not limited to: data gathering, data cleaning, data investigation, data analysis, and final report.

  • Module 4: Data Wrangling
  • Module 5: Statistical Inference
  • Module 6: Combining Data
  • Module 7: Databases and SQL
  • Module 8: Visualization
  • Module 9: Prediction & Supervised Machine Learning

As in Parts 1 and 2, you can begin with the text from your proposal but be sure to update it to fit with your evolving project. You should add any additional modules you will be using and update the existing modules to be more specific to the different tasks and stages of your projects.

Grading

  • E (Exemplary, 15pts) – States at least 3 modules. For each module, they provide an updated (1) short description of how they will use the module, (2) justification for using this module, (3) what concepts they will likely use, and (4) what stage they expect they will use it. 
  • S (Satisfactory, 14pts) – States at least 3 modules, but there are some weaknesses somewhere, such as one module as 3 or more parts not well fleshed out or across all 3 modules one part is weak.
  • N (Not yet, 9pts) – States at 3 modules, but 3 or more parts are entirely missing or basically non-existent out of 12 = 4 parts X 3 modules.
  • U (Unassessable, 3pts) – Does not meet the Not Yet criteria, such as having fewer than 3 modules or missing more than 3 parts across all 12 = 4 parts X 3 modules.

Example:

Here is an example justification for Module 3, assuming the project is about creating a prediction model that is classifying the data. Note the bolding, which will help you ensure you are meeting all requirements and your grader to find them.

Proposal

Module 3 Probability: We will use this module to calculate the accuracy of a baseline version of the model we will build. We will do this by considering the proportion of the label we are trying to predict, as well as taking into account some of the independent variables. Our justification is that we need a baseline accuracy to understand how good our model is. The concepts we will mainly use are the probability axioms and maybe some of Bayes or marginalization to calculate this baseline. We plan to use this module during the data analysis and final report stage.

Prototype

Module 3 Probability: We used this module to calculate the accuracy of a baseline version of a model we will build to predict the type of a Pokemon. We did this by considering the proportion of each type of a Pokemon in our data set and creating a baseline model that just predicted the most common pokemon in our data set. Our justification is that we need a baseline accuracy to understand how good our model is for predicting the type of a Pokemon based on other characteristics. The concepts we mainly used were the probability axioms and some of Bayes or marginalization to consider if there was a better baseline model we could use. We used this module during our data analysis and plan to use it in the final report stage.

Part 4: Preliminary Results and Methods (15 points)

The preliminary results section of your report should summarize the results obtained so far in the project. Where possible, results should be summarized using clearly labeled tables or figures and supplemented with a written explanation of the significance of the results with respect to the research questions outlined in the previous section. Please note that a screenshot of your dataset does not count as a table or figure and should not be included in your Prototype. Instead, if your primary progress is gathering and cleaning your data, provide a table with descriptive statistics about your data. Your results do not need to be final or conclusive for your entire project but should demonstrate substantial effort and progress and should provide concrete proof of concept or initial analysis with respect to your research questions.

Your results should be specific about exactly what data were used and how the results were generated. For example, if you scraped multiple web databases, merged them, and created a visualization, then you should explain how each step was conducted in enough detail that an informed reader could reasonably be expected to reproduce your results with time and effort. Just saying, “we cleaned the data and dealt with missing values,” is not sufficient detail, for example.

Your report itself should include an explanation of your methods, but it should also contain instructions on how to access your full implementation (that is, your code, data, and any other supplemental resources like additional charts or tables). The simplest way to do so is to include a link to the box folder, GitLab repo, or whatever other platform your group is using to house your data and code.

Grading

  • E (Exemplary, 15pts) – Preliminary results are thoroughly discussed using labeled tables or figures followed by written descriptions. Specific explanation of how the results were generated and from what data. Link to code/data to create charts or visualizations is provided. 
  • S (Satisfactory, 14pts) – Preliminary results are thoroughly discussed using labeled tables or figures followed by written descriptions. Explanation of how the results were generated may lack some specification or it is somewhat unclear as to what data the results are from. Link provided.
  • N (Not yet, 9pts) – Preliminary results are discussed using tables with missing labels or lacking written descriptions. It is unclear how the results were generated and from what data.
  • U (Unassessable, 3pts) – Preliminary results are missing or do demonstrate meaningful effort.

Part 5: Reflection and Next Steps (10 points)

In this part, you should answer the following sections in their own subsection (if space is limited, how you create the clear subsections is up to you):

  1. Successes/Mostly Complete – What has been successful in the project so far or what is essentially complete and ready for the final report?
  2. Challenges/Incomplete – What has been challenging in the project so far or what is incomplete in the prototype that needs to be finished for the final report?
  3. Collaboration plan reflection – How is the collaboration going? What is currently happening versus the original proposed plan? Is the group okay with what is happening? Does the group need to renegotiate what the plan should be? If yes, what is the new plan?
  4. Next Steps – What are your next steps? These should be concrete and specific actions that your group will take to address the challenges identified in order to complete a successful final project.

Grading

  • E (Exemplary, 10pts) – All four parts are present and the reflection is comprehensive on successes and challenges so far, a reflection on their collaboration plan, and a specific plan of action to address any concerns and future work.
  • S (Satisfactory, 9pts) – All four parts are present and the reflection is comprehensive on successes and challenges so far, but the collaboration plan is weak and there is only a loose plan of action to address any concerns and future work.
  • N (Not yet, 6pts) – A reflection/plan that does not entirely answer 1 or 2 of the questions above.
  • U (Unassessable, 2pts) – A reflection/plan that does not entirely answer 3 of the questions above.

Checklist Before You Submit:

  1. Does your prototype satisfy all general directions?
    1. 3-4 pages in length
    2. Standard margins (1 in.)
    3. Font size is 11-12 pt
    4. Line spacing is 1-1.5
    5. Final document is a pdf
  2. Do you have an Introduction and clearly stated Research Question(s)?
    1. Do you feel as if this part meets the requirements of E (Exemplary) or S (Satisfactory)?
  3. Have you properly specified/cited one or more specific Data Sources and justified why they are relevant to the research Questions?
    1. Do you feel as if this part meets the requirements of E (Exemplary) or S (Satisfactory)?
  4. Did you state at least 3 Modules to be used and how, as well as a justification of which concepts will be used at specific stages of the project?
    1. Do you feel as if this part meets the requirements of E (Exemplary) or S (Satisfactory)?
  5. Have you reported all of your Preliminary Results and Methods, including a specific explanation of how the results were generated?
    1. Do you feel as if this part meets the requirements of E (Exemplary) or S (Satisfactory)?
  6. Have you written a comprehensive reflection?
    1. Do you feel as if this part meets the requirements of E (Exemplary) or S (Satisfactory)?

Exam 2 and Exam 1 Retake

Exam 2 Logistics

  • Modules covered: 4, 5, and 6
    • Note: Even though it’s the same number of modules as Exam 1, that does not mean it is assessing the same amount of knowledge. Module 01 was a pre-amble module to remind you all of Python and as a warm-up for the weekly cadence of the class.
  • Practice Exam (Part1 – Link, Part2 – Link)
    • Note: This does not have a question about regular expressions, but the real exam will. One of the question’s solutions does use regular expressions it is just not required of the question.
  • Timeframe: It will open Thursday, 3/23, at 12:01 AM, and close Saturday, 3/25, at 11:59 PM.
    • The exam will close at 11:59 pm regardless of when you started.
  • The exam will be take-home. It is open book, open note, open internet, but closed to people.
    • This means you cannot receive help on this exam from anyone, including (but not limited to) communicating with a person while taking the exam, such as asking someone through the Internet (like stackoverflow) to receive help.
  • It consists of 2 parts that, like Exam 1 Part 2,  each have a time limit of 2 hours. Both parts will have data sets, and they will be different.
  • All other information is similar to Exam 1 Part 2. Such as getting the files, Gradescope, Sakai, asking for help, grading policy, etc.
    • Grading Clarification: A simple copy+paste and find+replace replacement from the practice exam is considered a Satisfactory answer. You must use your own words or elaborate beyond the practice exam text to show an Exemplary level of content mastery.

Exam 1 Retake Logistics

  • Timeframe:
    • Part 1: In-person during normal class time on Wednesday, 3/22.
      • Fill out the RSVP form by Sunday 3/19, so we know you are coming. Help us save trees and not waste paper.
      • If you change your mind last minute, we will print a few extras.
    • Part 2 (same as Exam 2): It will open Thursday, 3/23, at 12:01 AM, and close Saturday, 3/25, at 11:59 PM.
      • The exam will close at 11:59 pm regardless of when you started.
  • It assesses the same thing as Exam 1.
    • You may use things that you have learned that were not in the modules that this exam is testing (such as .groupby() or .apply()), but you can answer it without knowing any modules beyond what this exam is testing.
  • The data set and/or problems will be different.
  • You do not need to do both parts. You can do only one part if you wish. You must do ALL of the questions in that part, though. We will take the max score per part.
  • All other information is similar to Exam 1’s. Such as getting the files, Gradescope, Sakai, asking for help, grading policy, etc.
    • Grading Clarification: A simple copy+paste and find+replace replacement from the practice exam is considered a Satisfactory answer. You must use your own words or elaborate beyond the practice exam text to show an Exemplary level of content mastery.

Project: Proposal

Due: Sunday, March 5th

General Directions

The purpose of this document is to prepare your team for success in the course project. You should have feedback from your Initial Plan on the different research topics you have explored and are now introducing your chosen topic.  Your proposal should contain at least three parts, which we define below. In terms of length, it should be 1.5-3 pages (2 pages is typical) using standard margins (1 in.), font (11-12 pt), and line spacing (1-1.5). In addition to these three components, you should provide any additional context or information necessary to understand your vision for your project. You should convert your final document to a pdf and upload it to Gradescope under the assignment “Project Proposal” by the due date. Be sure to include your names and NetIds in your final document and use the group submission feature on Gradescope to include all of your group members on a single submission.

The proposal is out of 100 points. Meeting basic formatting requirements is worth 40 points and will be graded as follows:

  • E (Exemplary, 40pts) – Work that meets all requirements.
  • N (Not yet, 24pts) – Does not meet all requirements.
  • U (Unassessable, 8pts) –  Missing at least one section.

Part 1: Introduction and Research Questions (20 points)

Your proposal should begin by introducing your topic in general and then defining one or more research questions. Research questions are the guiding questions you want to answer or problems you want to solve in your project. Your research question(s) should be (1) substantial, (2) feasible, and (3) relevant.

  1. Substantial research questions require more than a surface-level analysis (more than just computing basic summary statistics on readily available datasets, for example).
  2. Feasible research questions can actually be addressed by four or five team members over the course of approximately six weeks using data you can access.
  3. Relevant research questions address a subject of importance and interest within the scientific community or broader society. Additionally, we are looking for why your group believes this research project is worthwhile to your time in this course. 

You should provide a brief justification of your research question(s) with respect to each of these three points. We recommend clearly marking this section by bolding the words substantial, feasible, and relevant when you provide your justification.

Remember to review the feedback you received from your Initial Plan and decide on a topic/research questions that meet the criteria above and spark interest in your group. This is a project that you will be working on for a significant portion of the semester. 

Grading

  • E (Exemplary, 20pts) – Comprehensive introduction with clearly labeled research questions. It includes a justification for the research questions about whether they are substantial, feasible, and relevant. And the justification is reasonable and clear in relevance to a CS216 project.
  • S (Satisfactory, 19pts) – Comprehensive introduction with clearly labeled research questions. It includes a justification for the research questions about whether they are substantial, feasible, and relevant. But the justification is clearly missing in terms of clarity or reasonableness in relevance to a CS216 project.
  • N (Not yet, 12pts) – Incomplete introduction where the research questions or justification are missing pieces, but at least some of it is present. Or the justification is clearly not reasonable.
  • U (Unassessable, 4pts) – Incomplete introduction where it is entirely missing the research questions or justification or does not demonstrate meaningful effort.

Part 2: Data Sources (20 points)

Your project should deal with real data. We provide pointers to some data sources in the Project Ideas section of the group formation post, but you are welcome and encouraged to look for your own data sources. After your introduction and research questions, your proposal should discuss the data you will use to answer your research questions. Be as specific as possible: name the datasets you will use and how you will access them or specify where you will look for the relevant datasets and why you expect to be successful in finding them. You should also briefly justify why the data you plan to obtain will be relevant and appropriate for addressing your research questions. Searching for data sources as you refine your research questions is likely to be the most time-consuming part of preparing your proposal and is crucial for a good start on your project, so do not put it off.

Grading

  • E (Exemplary, 20pts) – Origins of data or methods to acquire data are properly specified, cited, and relevant to answering the research question(s). And if the data is not already available, the justification for why they expect they will have access to it soon is reasonable. (a.k.a. We are reasonably confident you’ll be able to get the data you need for your research questions.)
  • S (Satisfactory, 19pts) – Origins of data or methods to acquire data are properly specified and cited. However, the justification is not clear why the data is relevant to the proposed research question(s) OR the justification of why they expect they will have access to the data is not reasonable. (a.k.a. We are not entirely sure you’ll be able to get the data you need for your research questions.)
  • N (Not yet, 12pts) – Poorly specified data sources or methods to acquire data OR the justification for using that data set or the methods to acquire the data is lacking.
  • U (Unassessable, 4pts) – Data sources or methods to acquire data are missing or do not demonstrate meaningful effort.

Part 3: What Modules are You Using? (20 points)

Your project should utilize concepts from modules we have/will cover in this course to answer your research question(s). We will assume you will use the skills you have acquired from modules 1 (Python), 2 (Numpy/Pandas), and 3 (Probability). This section should state at least 3 more modules that you will utilize for your project. Each module should have a short description of how you will use the knowledge in this module and a justification for that use. In addition, include what concepts from the module you will use and at what stage of your project you plan to mostly use this module. Potential stages include, but are not limited to: data gathering, data cleaning, data investigation, data analysis, and final report.

  • Module 4: Data Wrangling
  • Module 5: Statistical Inference
  • Module 6: Combining Data
  • Module 7: Databases and SQL
  • Module 8: Visualization
  • Module 9: Prediction & Supervised Machine Learning

When the proposal is due, you may have not yet learned material from some of the modules above. In this case, you should still provide the modules that are applicable with a description of what concepts you believe will be covered in this section that will be useful to answer your research question.

If you do not plan to use python, numpy, and pandas for your project, you must state this and explain why you are choosing not to. It is okay to use something else, like R, but keep in mind that the teaching staff may not have the skills to support you.

Grading

  • E (Exemplary, 20pts) – States at least 3 modules. For each module they provide a (1) short description of how they will use the module, (2) justification for using this module, (3) what concepts they will likely use, and (4) what stage they expect they will use it.
  • S (Satisfactory, 19pts) – States at least 3 modules, but there are some weaknesses somewhere, such as one module as 3 or more parts not well fleshed out or across all 3 modules one part is weak.
  • N (Not yet, 12pts) – States at 3 modules, but 3 or more parts are entirely missing or basically non-existent out of 12 = 4 parts X 3 modules.
  • U (Unassessable, 4pts) – Does not meet the Not Yet criteria, such as having fewer than 3 modules or missing more than 3 parts across all 12 = 4 parts X 3 modules.

Example:

Here is an example justification for Module 3, assuming the project is about creating a prediction model that is classifying the data. Remember that this module is not on the list of modules to count as one of your 3, but you are welcome to include analysis using concepts from it. Note the bolding, which will help you ensure you are meeting all requirements and your grader to find them.

Module 3 Probability: We will use this module to calculate the accuracy of a baseline version of the model we will build. We will do this by considering the proportion of the label we are trying to predict, as well as taking into account some of the independent variables. Our justification is that we need a baseline accuracy to understand how good our model is. The concepts we will mainly use are the probability axioms and maybe some of Bayes or marginalization to calculate this baseline. We plan to use this module during the data analysis and final report stage.

Checklist Before You Submit:

  1. Does your proposal satisfy all general directions?
    1. 1.5-3 pages in length
    2. Standard margins (1 in.)
    3. Font size is 11-12 pt
    4. Line spacing is 1-1.5
    5. Final document is a pdf
  2. Do you have an Introduction and clearly stated Research Question(s)?
    1. Do you feel as if this part meets the requirements of E (Exemplary) or S (Satisfactory)?
  3. Have you properly specified/cited one or more specific Data Sources or methods to acquire data and justified why they are relevant to the Research Questions?
    1. Do you feel as if this part meets the requirements of E (Exemplary) or S (Satisfactory)?
  4. Did you state at least 3 Modules to be used and how, as well as a justification of which concepts will be used at specific stages of the project?
    1. Do you feel as if this part meets the requirements of E (Exemplary) or S (Satisfactory)?

Exam 1

This post outlines what Exam 1 will be like.

There is a Sakai quiz called “Prepare 04.E: Exam 1 Logistics” that will count towards your Prepare 4. It is due Wednesday, 2/8.

Exam Logistics

  • Modules covered: 1, 2, and 3
  • Practice Exam (Link)
  • The exam consists of 2 parts.
    • Part 1 is in-person only.
    • Part 2 will be take-home. It is open book, open note, open internet, and closed to people.
  • Timeframe:
    • Part 1: During class Wednesday, 2/15.
    • Part 2: Open Thursday, 2/16, 12:01 AM, and close Saturday, 2/18, 11:59 PM.
    • The exam will close at 11:59 pm regardless of when you started.
  • There will be no class on Friday, 2/17.
  • The Exam Retake 1 will be during Exam 2. Your Exam 1 Part X score will be the max between this exam and the retake per part.
  • The exam must be done individually. It is a violation of class policy if you collaborate in any way with another person (in or not in the class) on the exam. You can only talk to the teaching staff about the exam.

Part 1

  • Is in-person only.
  • It will cover mainly probability.
  • It is a paper exam taken during class.
  • There will be multiple versions.
  • We will give you one reference sheet (TBA).
  • You may bring one piece of paper as a cheatsheet and can put things on the front and back.
  • You will not need a calculator. Instead, you will show your work and simply write the final numerical equation that would get that final value. There will be no need to calculate the final value by hand.

Part 2

  • It will be take-home. It is open book, open note, open internet, and closed to people.
    • This means you cannot receive help on this exam from anyone, including (but not limited to) communicating with a person while taking the exam, such as asking someone through the Internet (like stackoverflow) to receive help.
  • It will cover mostly coding and some probability.
  • It is a Jupyter Notebook and a data set.
  • You will get the zip files inside a Sakai Quiz.
  • You will submit it on Gradescope.
    • During your testing period, you can submit as many times as you want to Gradescope. We will take the submission you mark as active, which is your last submission unless you change it using the history.
    • Gradescope will have tests, but they are sanity checks only. That means they are checking if the variable is the correct type and within the correct range. The vast majority of the points will be from hand grading. See the grading section below.
  • You will have 2 hours.
    • We do not expect you to need the entire 2 hours. However, it is not uncommon to get lost in a data set, and we wanted to account for that.
  • You can rely on the Sakai Quiz timer to tell you how much time you have left.
  • We will use your logged start time in Sakai to track if you submitted it to Gradescope on time.
    • If you submit after your allotted time, we will use the last submission within your allotted time. That includes marking it as zero if you do not submit within your time limit (so you will need to rely on the retake for your exam).
    • We recommend you submit to Gradescope periodically (after each problem) so you are not scrambling at the end trying to open Gradescope.
  • You do not need to do anything with Sakai after you retrieve your zip file from the quiz.
  • Protect the integrity of the exam and your exam submission.
    • Take your exam:
      • in a secure location where no one can see your screen or bother you.
      • in a place where you will not be distracted or tempted to talk to someone.
    • Only after grades have been published can you do the following. Doing any of these before grades are published will be considered a violation of the Duke Community Standard.
      • Discuss the exam.
      • Show your solutions to other students.
      • View other solutions.
  • If you have a question during the exam, ask it as a private new message on the class forum. Or in helper hours.
    • We cannot help you debug your code. If it appears as if the notebook or autograder is not working, but it turns out to be your own code that has a bug, you will be graded according to your submission.
    • We will do our best to always have someone checking the forum. However, we cannot make promises someone will instantly answer your question.
    • The exam is tested for readability, so the wording should be straightforward.

Grading Scale and Points Allocation

For the questions that do not have a clear correct or incorrect answer or where partial credit is warranted, the following rubric will be used.

  • E (Exemplary) – Work that meets all requirements and displays full mastery of all learning goals and material. And the code is clean and easy to read (see the practice exam for examples of what this means).
  • S (Satisfactory) – Work that meets all requirements and displays at least partial mastery of all learning goals as well as full mastery of core learning goals.
  • N (Not yet) – Work that does not meet some requirements and/or displays developing or incomplete mastery of at least some learning goals and material.
  • U (Unassessable) – Work that is missing, does not demonstrate meaningful effort, or does not provide enough evidence to determine a level of mastery.

The number of points earned is distributed across the problems based on the number of learning goals they are testing. The rubric will be converted to points as follows:

  • E = full credit
  • S = E_full_credit – 1
  • N = E_full_credit * 0.6
  • U = E_full_credit * 0.2
  • Blank = 0

Unit tests will earn you points up to, but not quite, the U level.

Project: Initial Plan

Due: Tuesday, February 21st

General Directions

The purpose of this document is to ensure that your group is choosing a substantial research project topic that is interesting and worthwhile. You will be working on the collaborative final project for a large portion of this course, and will use this deliverable to brainstorm project ideas and plan how your team will collaborate. In terms of length, it should be 1-2 pages (not including the appendix) using standard margins (1 in.), font (11-12 pt), and line spacing (1-1.5). You should convert your final document to a pdf and upload it to Gradescope under the assignment “Initial Plan” by the due date. Be sure to include your names and NetIds in your final document and use the group submission feature on Gradescope to include all of your group members on a single submission.

The Initial Plan is out of 100 points. Meeting basic formatting requirements is worth 40 points and will be graded as follows:

  • E (Exemplary, 40pts) – Work that meets all requirements.
  • N (Not yet, 24pts) – Does not meet all requirements.
  • U (Unassessable, 8pts) –  Missing at least one section.

Part 1: Brainstorming (40 points)

To brainstorm ideas for your research topic, you may use one of two options:

  1. Mind map of potential project ideas.
  2. Discussion with ChatGPT.

For the mind map, you can use an online tool, Google drawing, whiteboard, post-it notes, etc. Just ensure you can put it in your report. To create your mind map use the following steps:

  1. Put a central idea or main concept in the center, such as “data science research project” or something more specific that your group finds interesting.
  2. Branch out from the main with ideas that can cover a range from interesting topics to previous project ideas that caught your group’s attention.
  3. Branch off of those ideas to add more specific interests or personalized ways you would change a topic or project.
  4. Put your mind map as an appendix in this submission.

For the discussion with ChatGPT, do the following:

  1. Tell ChatGPT you are brainstorming for data science projects, what your group’s interests are that could be potential sources of data, and that you need to find the data yourself.
  2. Ask it what ideas it has for your project.
  3. Tell it what ideas you liked, didn’t like, why a suggestion isn’t a good one etc.
  4. Do at least 2-3 rounds of steps 2 and 3 with ChatGPT.
  5. Put your chat as an appendix in this submission.

After your brainstorm, reflect by answering the following questions:

  1. Why did you choose the method you used?
  2. What patterns do you see in what you find interesting?
  3. What research topics or questions did your group generate from this brainstorm? Which of these ideas can you see your group potentially pursuing?
  4. Do you feel like more brainstorming is needed before you find a topic?
  5. If you used
    1. The mindmap: Did you find your brainstorm narrowing or diverging as you discuss ideas to write down?
    2. ChatGPT: How satisfied were you with its answers? Why?

Whether you choose to create a mind map or use ChatGPT, use this exercise to brainstorm project ideas that your group collectively believes are interesting, relevant, and worthwhile to your time in this course.

Grading

  • E (Exemplary, 40pts) – Appendix has a mind map that branches out at least two levels from the center OR a ChatGPT conversation. In addition, has a reflection that answers all 5 questions.
  • S (Satisfactory, 39pts) – Appendix has a mind map that branches out at least two levels from the center OR a ChatGPT conversation. In addition, has a reflection that mostly answers all 5 questions.
  • N (Not yet, 24pts) –  A brainstorm that does not entirely answer 1 or 2 of the questions. Reflection does not entirely answer at least 1 of the questions.
  • U (Unassessable, 8pts) – Work that does not entirely answer 3 or more of the questions above for either the brainstorm or the reflection.

Part 2: Collaboration Plan (20 points)

This is a collaborative course project pursued by a team of students who bring different strengths and interests to the table. This reflects the reality that significant real-world projects in data science are almost always pursued by teams. For the collaboration to be successful, it helps to establish some guidelines that serve as a starting point. Your collaboration plan should address the following:

  1. How will you divide responsibilities? Will some students be responsible for certain portions of the project, or will you be more integrated and decide on responsibilities on a weekly basis?
  2. About how much time do you expect every group member to spend on the project each week, on average? It is ok if this number is higher toward the last couple of weeks of the semester.
  3. When and how will you meet? You should plan to meet at least once per week for at least 30 minutes to check in on one another’s progress, get help, and plan for what comes next. Identify a day of the week, a time, and the place/platform you will use to meet.
  4. What platform(s) will you use to communicate between meetings? Will you primarily use email, text, slack, or other chat apps? If you want a more professional enterprise tool, Duke provides free access to Microsoft Teams.
  5. Where will you store data, code, writing, etc., so that all group members have easy access to shared materials?* Duke provides free access to Box and GitLab which could serve these purposes, but you could also use external services like Google Drive or GitHub. Provide a link to the folder/repository in your proposal to demonstrate that it is created and ready.

* In addition to a common repository for data, you may find it useful to explore Google colab or DeepNote which allows you to collaborate on Jupyter notebooks and execute them in the cloud (like a google doc for Jupyter notebooks).

Grading

  • E (Exemplary, 20pts) – Comprehensive plan that answers all 5 questions and includes a link to their folder/repository.
  • S (Satisfactory, 19pts) – Comprehensive plan that mostly answers all 5 questions. The link to their folder/repository could be missing.
  • N (Not yet, 12pts) – A plan that does not entirely answer 1 or 2 of the questions above. Link can be missing.
  • U (Unassessable, 4pts) – A plan that does not entirely answer 3 or more of the questions above.

Project: Group Formation

Due: Friday, February 10th

In place of a final exam, this course has a collaborative final project where we ask you to bring your data science skills to bear on a research project of your own choosing. It is time to start forming groups (of 4-5 students) for the project. Fill out the group formation survey no later than Friday, February 10th.

The form should only take a couple of minutes. If you already know who you want to work with, you can indicate that in the form. In this case, communicate with your group first and have one member fill out the form once with everyone’s name/netid. If you submit more than once, the last submission is considered valid, but please try not to because it’s already hard to write code to process that survey data file. It’s also fine if you don’t know who you want to work with, in which case you can fill out the form, and we will match you to a group.

If it is helpful to start thinking about possible project ideas, below are some project ideas. You can also brainstorm now using strategies that are outlined in the Initial Plan post. But it is not required that you have a concrete project idea until the proposal.

Project ideas

Not sure how to get started? Looking for examples of what a data science project might look like? Here are some of the topics that students studied in Spring 2020:

  • Comparing Stock Market Losses between SARS and SARS-CoV-2
  • Recessions, Depressions, and Depression: Mental Health in Relation to Economic Factors
  • Predicting North Carolina Election Outcomes
  • Relating Text Analysis of Corporate Reports and Stock Performance
  • Modeling Consumer Flight Behavior Based on Economic Indicators
  • Predicting COVID-19 Death Tolls from Google Search Trends
  • Sentiment Analysis of COVID-19 Tweets
  • Economic Status and Drug Overdose in North Carolina
  • Analyzing Gender and Tech Careers
  • Political Landscape According to Social Media
  • Forecasting Market Shocks and Performance using Article Headlines
  • Tracking Recidivism in US Prisons
  • Understanding AirBnBs impact on Evictions
  • Understanding Musical Tastes (Music Recommender System)
  • Human Impact on Climate since the Industrial Revolution
  • The Troll Toll: An Investigation into Troll Tweets

And here is an archive of summer Data+ projects from the last several years. In Data+, teams of about 4 undergraduate students collaborate over the summer on a data science project. You should be able to see final presentations and/or executive summary slides for most projects; feel free to browse for inspiration.

Example Data Sources

Below, we have some examples of datasets or where you might find data. You should work with data that is interesting to you and should feel free (strongly encouraged even) to look for sources yourself. These are listed just as possibilities and starting places.

  • Kaggle maintains several thousand public datasets of interest in a variety of topics. Kaggle also hosts several prediction challenges; one idea for a machine learning project is to enter one of these competitions as a team.
  • The Yelp Dataset is provided by Yelp as a research challenge with lots and lots of data about reviews, businesses, images, and cities – text data, rich json data, etc.
  • The University of California Irvine maintains a large UCI ML repository of publicly contributed datasets aimed toward machine learning tasks of all types. They range from small simple example datasets to large and complicated datasets from specific scientific domains.
  • Data.gov has a huge compilation of data sets produced by the US government. The US Census Bureau also publishes datasets from all of its survey work. Similarly, The Supreme Court Database tracks all cases decided by the US Supreme Court, and GovTrack.us provides links to all kinds of information about the US Congress and all votes casted by its members.
  • Duke University Library Digital Repository Research Data
  • ICPSR – An international consortium of more than 750 academic institutions and research organizations, Inter-university Consortium for Political and Social Research (ICPSR) provides leadership and training in data access, curation, and methods of analysis for the social science research community. ICPSR maintains a data archive of more than 250,000 files of research in the social and behavioral sciences. It hosts 21 specialized collections of data in education, aging, criminal justice, substance abuse, terrorism, and other fields.

Module 01: Python, Central tendency, & Jupyter Notebook

  1. Prepare (due M 1/16)
    1. Content below
    2. Sakai quiz
    3. Install Anaconda (see the Resources page for more instructions)
  2. Peer Instructions – See on the class forum
  3. Homework (due 1/22, 11:59 PM) [Link]

Content (Slides in the Box folder)

1.A – What is Data Science? (in-class on 1/13 or see recording)

1.B – Python3 (14 min.)

  1. Python vs. Java (3 min.)
  2. Data Types (2 min.)
  3. Iteration, Functions, Classes (7 min.) – slide 19 has a typo, the pdf has been fixed
  4. sorted() function documentation (2 min.)

1.C – Python for Data Science

  1. Anaconda and Jupyter (10 min.)
  2. Jupyter Notebook Demo (11 min.)

Optional Supplements