All posts by Dr Kristin Stephens-Martinez, Ph.D.

Project: Presentation

Due: Sunday 12/12, 11:59 PM

General Directions

The project presentation is intended to provide a high-level overview of your project to an audience of your peers (that is, individuals who have a reasonable knowledge of data science but are not experts in your particular project topic). Presentation recordings will be made available to the entire class (through Sakai, so not available outside of the class). The presentation should demonstrate your ability to communicate the significance and interpret the findings of your research project. The presentation should stand on its own so that it makes sense to someone who has not read your proposal or prototype.

Your group should create a video recording of your presentation in which every group member speaks and in which you use a visual aid such as presentation slides. The easiest way to do this is to simply hold a zoom call with all members of your project group, share your screen with your presentation slides, and record either locally or to the cloud (see Zoom recording help information). If this is not possible, you can also record portions individually and combine the recordings (though this will require additional editing work). In the end, we will ask for a URL to your complete recording, so you can either provide a share link to a zoom cloud recording or you can record locally and then upload your recording to Duke Box, Warpwire, or any other cloud platform where we can access and view your recording directly online (we should not need to download to view the recording). Ensure that anyone with the link can view your recording.

In terms of length, the presentation should be between 8 and 12 minutes. You can have as many slides as are necessary, but a typical pace has 1-2 slides per minute, so 8-24 slides total would be reasonable. Your slides should prioritize well labeled figures or visualizations and use text sparingly to emphasize important points. When you are finished you will submit a pdf of your slides to gradescope under the assignment “Project Presentation.” Be sure to include your names and netids in your final document and use the group submission feature on gradescope.  Your first slide should include the URL where we can view the recording of your presentation.

Part 0: Title Slide

The very first slide of your presentation should be a title slide containing at least the following information:

  • A title of your project / presentation
  • Names of all group members
  • URL to recording of your presentation

Part 1: Introduction and Research Questions

Your presentation should begin by introducing your topic generally and posing your research questions. Provide some explanation of the relevance or motivation of your research questions.

Part 2: Data Sources

Discuss the data you have collected and are using to answer your research questions. Be specific: name the datasets you are using, the information they contain, and where they were collected from / how they were prepared.

Part 3: Results

Describe your results. Where possible, provide well labeled and legible charts/figures in your slides to summarize results instead of verbose text. Interpret the results in the context of your research questions. It may not be possible to describe every individual result from your project in a brief amount of time. Focus on the most important and essential results for addressing your research questions.

Unlike your final report, it is not generally possible to describe your methods in sufficient detail in a short presentation that an informed audience member could reproduce your results. Instead, you should focus on your results and their interpretation, and only discuss methods at a high level such as may be necessary to interpret the results.

Part 4: Limitations and Future Work

You should briefly discuss any important limitations or caveats to your results with respect to answering your research questions. For example, if you don’t have as much data as you would like or are unable to fairly evaluate the performance of a predictive model, explain and contextualize those limitations.

Finally, provide a brief discussion of future work. This could explain how future research might address the limitations you outline, or it could pose additional follow-up research questions based on your results so far. In short, explain how an informed audience member (such as a peer in the class) could improve on and extend your results.

Grading Rubric

Final reports will be evaluated on the following criterion-based rubric. Reports satisfying all criteria will receive full credit.

  1. Submits a relevant document satisfying general requirements including a URL to a recording
  2. Includes a brief introduction to the topic of interest
  3. Poses one or more concrete research questions
  4. Provides a reasonable discussion of the relevance or motivation for the research questions
  5. Includes a discussion of concrete/specific data sources
  6. Provides results in the form of analysis, tables, visualization, etc.
  7. Tables and figures are properly labeled and legible
  8. Results are discussed and interpreted in the context of the research questions
  9. Provides a reasonable discussion of any limitations to the results
  10. Provides a reasonable discussion of future work and how the results could be extended
  11. The final recording is polished and easy to follow.

Project: Final Report

Due: Sunday 12/12, 11:59 PM

General Directions

The final report is intended to provide a comprehensive account of your collaborative course project in data science. The report should demonstrate your ability to apply the data science skills you have learned to a real-world project in a holistic way from posing research questions and gathering data to analysis, visualization, interpretation, and communication. The report should stand on its own so that it makes sense to someone who has not read your proposal or prototype.

The report should contain at least the parts defined below. In terms of length, it should be about 5-7 pages using standard margins (1 in.), font (11-12 pt), and line spacing (1-1.5). A typical submission is around 3-4 pages of text and 5-7 pages overall with tables and figures. You should convert your written report to a pdf and upload it to gradescope under the assignment “Project Final Report” by the due date. Be sure to include your names and netids in your final document and use the group submission feature on gradescope. You do not need to upload your accompanying data, code, or other supplemental resources demonstrating your work to gradescope; instead, your report should contain instructions on how to access these resources (see part 4 below for more details).

Part 1: Introduction and Research Questions

Your final report should begin by reintroducing your topic and restating your research question(s) as in your proposal. As before, your research question(s) should be (1) substantial, (2) feasible, and (3) relevant. In contrast to the prior reports the final report does not need to explicitly justify that the research questions are substantial and feasible in text; your results should demonstrate both of these points. You should still explicitly justify how your research questions are relevant. In other words, be sure to explain the motivation of your research questions.

You can start with the text from your prototype, but you should update your introduction and research questions to reflect changes in or refinements of the project vision. Your introduction should be sufficient to provide context for the rest of your report.

Part 2: Summary of Results

Provide a brief (one or two paragraphs) summary of your results. This summary of results should address your research questions. For example, if one of your research questions was “Did COVID-19 result in bankruptcy in North Carolina during 2020?” then a possible (and purely hypothetical) summary of results might be “We aggregate the public records disclosures of small businesses in North Carolina from January 2019 to December 2020 and find substantial evidence that COVID-19 did result in a moderate increase in bankruptcy during 2020. This increase is not geographically uniform and is concentrated during summer and fall 2020. We also examined the impact of federal stimulus but cannot provide an evaluation of its impact from the available data.”

Part 3: Data Sources

Discuss the data you have collected and are using to answer your research questions. Be specific: name the datasets you are using, the information they contain, and where they were collected from / how they were prepared. You can begin with the text from your prototype but be sure to update it to fit the vision for your final project.

Part 4: Results and Methods

This is likely to be the longest section of your paper at multiple pages. The results and methods section of your report should explain your detailed results and the methods used to obtain them. Where possible, results should be summarized using clearly labeled tables or figures and supplemented with written explanations of the significance of the results with respect to the research questions outlined previously.

Your description of your methods should be specific. For example, if you scraped multiple web databases, merged them, and created a visualization, then you should explain how each step was conducted in enough detail that an informed reader could reasonably be expected to reproduce your results with time and effort. Just saying “we cleaned the data and dealt with missing values” or “we built a predictive model” is not sufficient detail, for example.

Your report should also contain instructions on how to access your full implementation (that is, your code, data, and any other supplemental resources like additional charts or tables). The simplest way to do so is to include a link to the box folder, GitLab repo (if you use GitHub wish to keep the repo private add Prof. Stephens-Martinez (username: ksteph) and your mentor to the repo), or whatever other platforms your group is using to house your data and code.

Part 5: Limitations and Future Work

In this part, you should discuss any important limitations or caveats to your results with respect to answering your research questions. For example, if you don’t have as much data as you would like or are unable to fairly evaluate the performance of a predictive model, explain and contextualize those limitations.

Finally, provide a brief discussion of future work. This could explain how future research might address the limitations you outline, or it could pose additional follow-up research questions based on your results so far. In short, explain how an informed reader (such as a peer in the class) could improve on and extend your results.

Grading Rubric

Final reports will be evaluated on the following criterion-based rubric. Reports satisfying all criteria will receive full credit.

  1. Submits a relevant document satisfying general requirements
  2. Includes a brief introduction to the topic of interest
  3. Poses one or more concrete research questions
  4. Provides a reasonable justification that research questions are relevant
  5. Provides a brief summary of results
  6. Includes a discussion of concrete/specific data sources
  7. Provides results in the form of analysis, tables, visualization, etc.
  8. Final tables and visualizations are properly labeled and legible
  9. Results provide reasonable answers to research questions and interpretation is provided in the text. Some results may be negative or incomplete (with discussion) but should provide some concrete evidence toward answers to research questions.
  10. Results and methods demonstrate substantial effort and progress over the course of the project
  11. Methods used to obtain results are described in sufficient detail to understand and interpret results
  12. Methods used are generally appropriate and do not contain significant methodological errors
  13. Provides a link/reference to additional materials (e.g., code and data stored in Box or GitLab)
  14. Provides a reasonable discussion of any limitations to the results
  15. Provides a reasonable discussion of future work and how the results could be extended
  16. Final writeup is edited and polished. Can have one or two typos or grammatical errors, but the document is sufficiently edited as to not distract or confuse the reader.

Final Perform

Due: Monday 11/22

Box folder with the files for this perform

Introduction

The Final Perform will have you show all that you have learned in the class so far. This Perform consists of a skeleton notebook and a raw data set. You must process, clean, and analyze the raw data to learn something interesting. We encourage you to work in pairs so you can explore the data set more thoroughly, but it is not required.

The grading scale and points allocation are different than prior notebooks. Moreover, the last 3 (out of 100) points for this Perform are allocated towards a conclusion section and the overall cohesion of the notebook. These points focus on how well the sections are connected together and build towards a specific conclusion. Keep in mind that the syllabus states you only need 95% of the possible points to earn full credit. Therefore if you do not want to demonstrate that level of mastery, you do not need to spend the extra time to work on this.

Working together

  1. You may work with up to one other person.
    1. We recommend that you do, but understand if you would prefer to work by yourself.
    2. If you want to find a partner, try posting on the class forum.
  2. You may share your data loading and cleaning code.
    1. This is code that converts the data files into DataFrames and converts the columns into a useful format.
    2. Just like in the real world, developers would be helping each other in figuring out how to get raw data into a needed format. You may do so for this Perform.
    3. So you should feel free to ask and answer such questions on the class forum.
    4. If you are not sure a question falls under this designation, ask it as a private question first.
  3. You may discuss the kind of analysis you are doing.
  4. You may NOT share your analysis code with anyone except your partner (if you have one).

Assessment Goals

The goals of this Perform are for you to demonstrate the following skills:

  1. Load and process raw data that is not necessarily in an easy-to-use format for your intended analysis.
  2. Visualize data such that a meaningful interpretation can be made.
  3. Wisely choose, explain the choice of, conduct, and interpret the results of a hypothesis test.
  4. Create a prediction model from an existing data set.
  5. Stretch goal: Using all of the above elements to create a cohesive explanation of a finding(s).

Grading Scale and Points Allocation

Each section will be graded on a four-step rubric scale as follows.

  • E (Exemplary) – Work that meets all requirements and displays full mastery of all learning goals and material.
  • S (Satisfactory) – Work that meets all requirements and displays at least partial mastery of all learning goals as well as full mastery of core learning goals.
  • N (Not yet) – Work that does not meet some requirements and/or displays developing or incomplete mastery of at least some learning goals and material.
  • U (Unassessable) – Work that is missing, does not demonstrate meaningful effort, or does not provide enough evidence to determine a level of mastery.

There are 100 points possible. The number of points earned depends on the notebook section. The rubric will be converted to points as follows:

  • E = full credit
  • S = E_full_credit – 1
  • N = E_full_credit / 2
  • U = E_full_credit / 5
  • Blank = 0

Notebook Sections and Grading Expectations

Overall Grading Considerations

The entire notebook is expected to take into account the following:

  1. The code takes advantage of Pandas and NumPy libraries
    1. For loops are allowed
    2. Do not use a for loop to iterate over a DataFrame’s rows, unless it is guaranteed to be < 100 rows
  2. Accounts for the fact that there is a different number of ratings for each professor in the data set

Section: Data Loading and Cleaning (21 points)

This section should have all of your data loading and cleaning code where you load and create your DataFrame(s). It does not need to contain all of the data processing code if creating a new column or table in a later section makes more sense for explanation and cohesion.

  1. Loads data from all of the data files
  2. Shows at least the first 10 rows of all DataFrames created that are used later in the notebook
  3. Plus overall grading considerations

Section: Visualization (19 points, Module 5B)

This section should contain at least one visualization showing something informative about the data. The skills you learned for this section primarily came from Module 5B.

  1. Each visualization has:
    1. X-axis and Y-axis are labeled and have appropriate values
    2. Legend is provided if needed to interpret the visualization
    3. Use of color adds and does not detract from the visualization
    4. A title or caption describing what the visualization is showing
  2. Draws at least 1 visualization from at least 1 column of data
  3. Provides a short 1-4 sentence summary of key takeaways from the visualizations.
  4. Plus overall grading considerations

Section: Hypothesis Test (19 points, Module 3B)

This section should contain at least one hypothesis test about the data. The skills you learned for this section primarily came from Module 3B.

  1. H0 and H1 hypotheses are clearly labeled and stated
  2. What kind of test is clearly written
  3. Has a clear interpretation of the test’s result
  4. Plus overall grading considerations

Section: Prediction (19 points, Module 6)

This section should contain the creation and testing of at least one model. The skills you learned for this section primarily came from Module 6.

  1. The data and target for the model are clearly labeled
  2. Has a clear rationale for the data used in the model
  3. Properly splits and uses a train and test set
  4. Has a clear interpretation for the results of the model
  5. Plus overall grading considerations

Section: Additional Analysis (19 points)

This section should contain one more analysis of your choosing. It can be like any of the other analysis sections, so another visualization, hypothesis test, or prediction analysis.

  1. Clearly states what the additional analysis is
  2. Provides a clear rationale for the analysis
  3. Has a clear interpretation for the results of the analysis
  4. Fulfills all of the requirements of the kind of analysis that it is
  5. Plus overall grading considerations

Section: Conclusion (and Cohesion, 3 points)

You only need this section if you are interested in earning these last points.

If you need to rearrange the sections to improve the cohesion of your notebook, you may do so.

These points can only be earned if at least two of the analysis sections earned an E and an S is earned for all of the other sections. These points focus on the overall cohesion of your sections and if the conclusion effectively summarizes the results across all of the sections.

  1. All five sections have a clear progression and build off of each other
  2. Each section references another as appropriate in building a cohesive explanation of the main results of the notebook
  3. The conclusion effectively summarizes the notebook (it should not just be a list of the results of each section)
  4. The conclusion provides a summary of the key takeaways from the analyses
  5. Plus overall grading considerations

Module 7: Deep Learning

  1. Prepare (soft due Tu 11/10, hard due M 11/15)
    1. Content below
    2. Sakai quizzes
  2. Group Worksheet (soft due W 11/10, hard due M 11/15)
    1. Part 1
    2. Part 2
    3. Part 3
    4. Part 4
  3. Practice (due M 11/22)
  4. Perform – There is no Perform for this module

Content

7 Deep Learning

  1. Neural Networks and Applications (16 min.)
  2. Forward Propagation (10 min.)
  3. Gradient Descent (14 min.)
  4. Back Propagation (11 min.)
  5. Convolutional Neural Network (15 min.)
  6. Introducing Pytorch (23 min.)

Optional Supplements

The deep learning book is available free online and is authored by some of the leading experts in machine learning with deep artificial neural networks. It is very detailed and in-depth and is purely for those who are interested in learning more about deep learning theory now or in the future; you do not need to read the book for this course.

Unlike most other libraries for this course, Pytorch is not included in the basic Anaconda installation. To use Pytorch, we suggest you choose one of two options.

  • Install Pytorch locally (for free). You can see the directions on the website: Select the stable build, your operating system, Conda (for Anaconda), Python, and CPU to see install directions for your particular setup. (CUDA is used to support hardware acceleration with NVIDIA graphics cards and is not necessary for this course).
  • Use Pytorch in a Jupyter notebook in the cloud (also for free). The easiest way to do this if you have a Google account is with a Google colab notebook; Pytorch will already be available to you in this cloud environment.

You can find the official Pytorch documentation here. Of particular note are the Pytorch tutorials, including Pytorch recipes which serve as small examples of common tasks.

Module 6: Prediction & Supervised Machine Learning

  1. Prepare (soft due Tu 10/26, hard due M 11/1)
    1. Content below, if you are new to machine learning some of the optional is strongly recommended.
    2. Sakai quizzes
  2. Group Worksheet (soft due W 10/27, hard due M 11/1)
  3. Practice (due M 11/8)
  4. Perform (due M 11/22)

Content

6.A Predictive Modeling and Regression

  1. Ordinary Linear Regression and Intro Scikit-Learn (21 min.)
  2. Nonlinear Regression and Scikit-Learn Preprocessing (13 min.)
  3. Binary Classification with Logistic Regression (22 min.)

6.B Machine Learning and Classification

  1. Naïve Bayes and Text Classification (20 min.) – The video has a type on slide 10, see the pdf of the slides in Box for the fix.
  2. K-Nearest Neighbors and Training/Testing (31 min.)

Optional Supplements

Chapter 5 Machine Learning from the Python Data Science Handbook provides a very nice treatment of many of the topics from the above videos and more. If you are new to machine learning, we highly recommend that you read sections 5.1 What is Machine Learning through 5.4 Feature Engineering after completing the videos. After that, you can optionally read any of the In-Depth sections about specific algorithms for prediction.

In addition, the scikit-learn documentation itself provides several resources for working with the library:

Project: Prototype

Due: Monday 11/08, no late period

General Directions

The prototype deliverable is intended to demonstrate a proof of concept for your final project report. Large multi-week projects are challenging, this deliverable is intended to provide additional structure to ensure you are making progress and on a path towards success. It consists of a written report detailed below along with any accompanying data, code, or other supplementary resources that demonstrate your progress so far in the project. You can think of it as a rough draft for your final project. The report should stand on its own so that it makes sense to someone who has not read your proposal.

The report should contain at least three parts, which we define below. In terms of length, it should be about 3-4 pages using standard margins (1 in.), font (11-12 pt), and line spacing (1-1.5). A typical submission is around 2-3 pages of text and 3-4 pages overall with tables and figures. You should convert your written report to a pdf and upload it to Gradescope under the assignment “Project Prototype” by the due date. Be sure to include your names and NetIDs in your final document and use the group submission feature on Gradescope. You do not need to upload your accompanying data, code, or other supplemental resources demonstrating your work to Gradescope; instead, your report should contain instructions on how to access these resources (see part 2 below for more details).

Part 1: Introduction and Research Questions

Your prototype report should begin by reintroducing your topic and restating your research question(s) as in your proposal. Your research question(s) should be (1) substantial, (2) feasible, and (3) relevant. Briefly justify each of these points as in the project proposal. You can start with the text from your proposal, but you should update your introduction and research questions to reflect changes in or refinements of the project vision. Specifically, point out what has changed since the proposal. Your introduction should be sufficient to provide context for the rest of your report.

Part 2: Data Sources

After your introduction and research questions, your prototype should discuss the data you have collected and are using to answer your research questions. Be specific: name the datasets you are using and where they were collected from / how they were prepared. Briefly justify why your data are appropriate and sufficient to address your research questions. As in the introduction, you can begin with the text from your proposal but be sure to update it to fit with your evolving project.

Part 3: Preliminary Results and Methods

The preliminary results section of your report should summarize the results obtained so far in the project. Where possible, results should be summarized using clearly labeled tables or figures and supplemented with a written explanation of the significance of the results with respect to the research questions outlined in the previous section. Your results do not need to be final or conclusive for your entire project but should demonstrate substantial effort and progress and should provide concrete proof of concept or initial analysis with respect to your research questions.

Your results should be specific about exactly what data were used and how the results were generated. For example, if you scraped multiple web databases, merged them, and created a visualization, then you should explain how each step was conducted in enough detail that an informed reader could reasonably be expected to reproduce your results with time and effort. Just saying “we cleaned the data and dealt with missing values” is not sufficient detail, for example.

Your report itself should include an explanation of your methods, but it should also contain instructions on how to access your full implementation (that is, your code, data, and any other supplemental resources like additional charts or tables). The simplest way to do so is to include a link to the box folder, GitLab repo, or whatever other platform your group is using to house your data and code.

Part 4: Reflection and Next Steps

In this part, you should begin by reflecting on the progress of your project so far. Address the following:

  1. What has been successful in the project so far or what is essentially complete and ready for the final report?
  2. What has been challenging in the project so far or what is incomplete in the prototype that needs to be finished for the final report?
  3. What are your next steps? These should be concrete and specific actions that your group will take to address the challenges identified in order to complete a successful final project.

Feedback and Grading Rubric

Prototypes will be evaluated on the following criterion-based rubric. Prototypes satisfying all criteria will receive full credit. Formative feedback (comments and suggestions) will also be provided for each proposal by your project group mentor.

  1. Submits a relevant document satisfying general requirements
  2. Includes a brief introduction to the topic of interest
  3. Poses one or more concrete research questions
  4. Provides a reasonable justification that research questions are substantial
  5. Provides a reasonable justification that research questions are feasible
  6. Provides a reasonable justification that research questions are relevant
  7. Explains how the topic/research questions have (or have not) changed since the proposal
  8. Includes a discussion of concrete/specific data sources
  9. Provides reasonable justification that data sources are appropriate for research questions
  10. Provides some specific preliminary results in the form of analysis, tables, visualization, etc.
  11. Results demonstrate substantial effort and progress toward addressing research questions, do not have to be complete or exhaustive but must demonstrate effort and progress
  12. Methods used to obtain results are described in sufficient detail to understand and interpret results
  13. Provides a link/reference to additional materials (e.g., code and data stored in box or GitLab)
  14. Reflects on successes / what is fairly complete in the project so far
  15. Reflects on challenges / what is incomplete in the project so far
  16. Discusses concrete/specific action items to complete the final project

Module 5B: Visualization

  1. Prepare (soft due Th 10/14, hard due 10/18)
    1. Content below
    2. Sakai quizzes
  2. Group Worksheet (soft due F 10/15, hard due 10/18)
  3. Practice (due M 10/25)
  4. Perform (due M 11/8)

Content

5B.A Data Visualization and Design

  1. Why Visualize? (11 min.)
  2. Basic Plot Types (17 min.)
  3. Dos and Don’ts (10 min.)

5B.B Visualization in Python

  1. Intro to Python Visualization Landscape (7 min.)
  2. Seaborn Introduction (17 min.)
  3. Seaborn Examples (17 min.)

Optional Supplements

Module 5A: Databases & SQL

  1. Prepare (soft due Tu 10/12, hard due 10/18)
    1. Content below
    2. Sakai quizzes
  2. Group Worksheet (soft due W 10/13, hard due 10/18)
  3. Practice (due M 10/25)
  4. Perform (due M 11/8)

Content

5A.A – Relational Database (24 min.)

5A.B

  1. SQL Querying (21 min.)
  2. SQL with Python and Pandas (12 min.)

Optional Supplements

Module 4: Combining Data

There is only 1 module for learning sprint 4. The rest of your time should be spent on your project.

  1. Prepare (soft due Tu 9/28, hard due M 10/11)
    1. Content below
    2. Sakai quizzes
  2. Group Worksheet (soft due W 9/29, hard due M 10/11)
  3. Practice (due M 10/11)
  4. Perform (due M 10/25)

Content

4.A – Summarizing Data

  1. Read Section 3.8 Aggregating and Grouping from Python Data Science Handbook.
  2. Read Section 3.9 Pivot Tables from Python Data Science Handbook.

4.B – Merging Data

  1. Record Linkage (8 min.)
  2. Read Section 3.6 Concat and Append from Python Data Science Handbook. Please note that the join_axes optional parameter mentioned in this section has been deprecated from the Pandas library, you can skip over the details on this parameter.
  3. Read Section 3.7 Merge and Join from Python Data Science Handbook
  4. Fuzzy Matching (21 min.)

Optional Supplements

Project: Proposal

Due: Monday 10/11

General Directions

The purpose of this document is to prepare your team for success in the course project. Your proposal should contain at least three parts, which we define below. In terms of length, it should be 1-2 pages using standard margins (1 in.), font (11-12 pt), and line spacing (1-1.5). In addition to these three components, you should provide any additional context or information necessary to understand your vision for your project. You should convert your final document to a pdf and upload it to Gradescope under the assignment “Project Proposal” by the due date. Be sure to include your names and NetIds in your final document and use the group submission feature on Gradescope to include all of your group members on a single submission.

Part 1: Introduction and Research Questions

Your proposal should begin by introducing your topic in general and then defining one or more research questions. Research questions are the guiding questions you want to answer or problems you want to solve in your project. Your research question(s) should be (1) substantial, (2) feasible, and (3) relevant.

  1. Substantial research questions require more than a surface-level analysis (more than just computing basic summary statistics on readily available datasets, for example).
  2. Feasible research questions can actually be addressed by four or five team members over the course of approximately six weeks using data you can access.
  3. Relevant research questions address a subject of importance and interest within the scientific community or broader society.

You should provide a brief justification of your research question(s) with respect to each of these three points.

While you are welcome to study whatever topic you like, the following have been popular themes in previous years: health and medicine, business and economics, sports analytics, social media analysis, politics and/or policy, gender and/or race. The Project Ideas in the group formation post has many examples of topics.

Part 2: Data Sources

Your project should deal with real data. We provide pointers to some data sources in the Project Ideas section of the group formation post, but you are welcome and encouraged to look for your own data sources. After your introduction and research questions, your proposal should discuss the data you will use to answer your research questions. Be as specific as possible: name the datasets you will use and how you will access them or specify where you will look for the relevant datasets and why you expect to be successful in finding them. You should also briefly justify why the data you plan to obtain will be relevant and appropriate for addressing your research questions. Searching for data sources as you refine your research questions is likely to be the most time-consuming part of preparing your proposal and is crucial for a good start on your project, so do not put it off.

Part 3: Collaboration Plan

This is a collaborative course project pursued by a team of students who bring different strengths and interests to the table. This is reflective of the reality that significant real-world projects in data science are almost always pursued by teams. For the collaboration to be successful, it helps to establish some guidelines that serve as a starting point. Your collaboration plan should address the following:

  1. How will you divide responsibilities? Will some students be responsible for certain portions of the project, or will you be more integrated and decide responsibilities on a weekly basis?
  2. About how much time do you expect every group member to spend on the project each week, on average? It is ok if this number is higher toward the last couple of weeks of the semester.
  3. When and how will you meet? You should plan to meet at least once per week for at least 30 minutes to check in on one another’s progress, get help, and plan for what comes next. Identify a day of the week, a time, and the platform you will use to meet.
  4. What platform(s) will you use to communicate between meetings? Will you primarily use email, text, slack, or other chat apps? If you want a more professional enterprise tool, Duke provides free access to Microsoft Teams.
  5. Where will you store data, code, writing, etc., so that all group members have easy access to shared materials?* Duke provides free access to Box and GitLab which could serve these purposes, but you could also use external services like Google Drive or GitHub. Provide a link to the folder/repository in your proposal to demonstrate that it is created and ready.

* In addition to a common repository for data, you may find it useful to explore the Google colab which allows you to collaborate on Jupyter notebooks and execute them in the cloud (like a google doc for Jupyter notebooks).

Feedback and Grading Rubric

Proposals will be evaluated on the following criterion-based rubric. Proposals satisfying all criteria will receive full credit. Formative feedback (comments and suggestions) will also be provided for each proposal by a teaching assistant who will be assigned as a project group mentor.

  1. Satisfies general directions (length, on-time pdf submission, group submission, etc.)
  2. Includes a brief introduction to the topic of interest
  3. Poses one or more concrete research questions
  4. Provides a reasonable justification that research questions are substantial
  5. Provides a reasonable justification that research questions are feasible
  6. Provides a reasonable justification that research questions are relevant
  7. Includes one or more specific datasets or reasonable discussion of how to locate data
  8. Provides reasonable justification that data sources are appropriate for research questions
  9. Collaboration plan specifies how responsibilities will be divided and about how much time on average each group member should expect to spend per week
  10. Collaboration plan specifies when and how team will meet, at least weekly
  11. Collaboration plan specifies platform/technology for communication between meetings and provides a link to a folder/repository for sharing data, code, etc.