Project Proposal

Due: Saturday, 10/15

General Directions

The purpose of this document is to prepare your team for success in the course project. Your proposal should contain at least four parts, which we define below. In terms of length, it should be 2-3 pages (2.5 pages is typical) using standard margins (1 in.), font (11-12 pt), and line spacing (1-1.5). In addition to these four components, you should provide any additional context or information necessary to understand your vision for your project. You should convert your final document to a pdf and upload it to Gradescope under the assignment “Project Proposal” by the due date. Be sure to include your names and NetIds in your final document and use the group submission feature on Gradescope to include all of your group members on a single submission.

The proposal is out of 100 points. Meeting basic formatting requirements is worth 45 points and will be graded as follows:

  • E (Exemplary, 45pts)Work that meets all requirements.
  • N (Not yet, 27pts)Does not meet all requirements.
  • U (Unassessable, 9pts) Missing at least one section.

Part 1: Introduction and Research Questions (15 points)

Your proposal should begin by introducing your topic in general and then defining one or more research questions. Research questions are the guiding questions you want to answer or problems you want to solve in your project. Your research question(s) should be (1) substantial, (2) feasible, and (3) relevant.

  1. Substantial research questions require more than a surface-level analysis (more than just computing basic summary statistics on readily available datasets, for example).
  2. Feasible research questions can actually be addressed by four or five team members over the course of approximately six weeks using data you can access.
  3. Relevant research questions address a subject of importance and interest within the scientific community or broader society.

You should provide a brief justification of your research question(s) with respect to each of these three points. We recommend clearly marking this section by bolding the words substantial, feasible, and relevant when you provide your justification.

While you are welcome to study whatever topic you like, the following have been popular themes in previous years: health and medicine, business and economics, sports analytics, social media analysis, politics and/or policy, gender and/or race. The Project Ideas in the group formation post has many examples of topics.

Grading

  • E (Exemplary, 15pts) – Comprehensive introduction with clearly labeled research questions. It includes a justification for the research questions about whether they are substantial, feasible, and relevant. And the justification is reasonable and clear in relevance to a  CS216 project.
  • S (Satisfactory, 14pts) – Comprehensive introduction with clearly labeled research questions. It includes a justification for the research questions about whether they are substantial, feasible, and relevant. But the justification is clearly missing in terms of clarity or reasonableness in relevance to a CS216 project.
  • N (Not yet, 9pts) – Incomplete introduction where the research questions or justification are missing pieces, but at least some of it is present. Or the justification is clearly not reasonable.
  • U (Unassessable) – Incomplete introduction where it is entirely missing the research questions or justification or does not demonstrate meaningful effort.

Part 2: Data Sources (15 points)

Your project should deal with real data. We provide pointers to some data sources in the Project Ideas section of the group formation post, but you are welcome and encouraged to look for your own data sources. After your introduction and research questions, your proposal should discuss the data you will use to answer your research questions. Be as specific as possible: name the datasets you will use and how you will access them or specify where you will look for the relevant datasets and why you expect to be successful in finding them. You should also briefly justify why the data you plan to obtain will be relevant and appropriate for addressing your research questions. Searching for data sources as you refine your research questions is likely to be the most time-consuming part of preparing your proposal and is crucial for a good start on your project, so do not put it off.

Grading

  • E (Exemplary) – Origins of data or methods to acquire data are properly specified, cited, and relevant to answering the research question(s). And if the data is not already available, the justification for why they expect they will have access to it soon is reasonable. (a.k.a. We are reasonably confident you’ll be able to get the data you need for your research questions.)
  • S (Satisfactory) – Origins of data or methods to acquire data are properly specified and cited. However, the justification is not clear why the data is relevant to the proposed research question(s) OR the justification of why they expect they will have access to the data is not reasonable. (a.k.a. We are not entirely sure you’ll be able to get the data you need for your research questions.)
  • N (Not yet) – Poorly specified data sources or methods to acquire data OR the justification for using that data set or the methods to acquire the data is lacking.
  • U (Unassessable) – Data sources or methods to acquire data are missing or do not demonstrate meaningful effort.

Part 3: What Modules are You Using? (15 points)

Your project should utilize concepts from modules we have/will cover in this course to answer your research question(s). We will assume you will use modules 1 (python) and 2 (numpy/pandas). This section should state at least 3 more modules that you will utilize for your project. Each module should have a short description of how you will use the knowledge in this module and a justification for that use. In addition, include what concepts from the module you will use and at what stage of your project you plan to mostly use this module. Potential stages include, but are not limited to: data gathering, data cleaning, data investigation, data analysis, and final report.

  • Module 3: Probability
  • Module 4: Data Wrangling
  • Module 5: Statistical Inference
  • Module 6: Combining Data
  • Module 7: Databases and SQL
  • Module 8: Visualization
  • Module 9: Prediction & Supervised Machine Learning

When the proposal is due, you may have not yet learned material from some of the modules above. In this case, you should still provide the modules that are applicable with a description of what concepts you believe will be covered in this section that will be useful to answer your research question.

If you do not plan to use python, numpy, and pandas for your project, you must state this and explain why you are choosing not to. It is okay to use something else, like R, but keep in mind that the teaching staff may not have the skills to support you.

Grading

  • E (Exemplary) – States at least 3 modules. For each module they provide a (1) short description of how they will use the module, (2) justification for using this module, (3) what concepts they will likely use, and (4) what stage they expect they will use it.
  • S (Satisfactory) – States at least 3 modules, but there are some weaknesses somewhere, such as one module as 3 or more parts not well fleshed out or across all 3 modules one part is weak.
  • N (Not yet) – States at 3 modules, but 3 or more parts are entirely missing or basically non-existent out of 12 = 4 parts X 3 modules.
  • U (Unassessable) – Does not meet the Not Yet criteria, such as having fewer than 3 modules or missing more than 3 parts across all 12 = 4 parts X 3 modules.

Example:

Here is an example justification for module 3 assuming the project is about creating a prediction model that is classifying the data. Note the bolding, which will help you ensure you are meeting all requirements and your grader to find them.

Module 3 Probability: We will use this module to calculate the accuracy of a baseline version of the model we will build. We will do this by considering the proportion of the label we are trying to predict, as well as taking into account some of the independent variables. Our justification is that we need a baseline accuracy to understand how good our model is. The concepts we will mainly use are the probability axioms and maybe some of Bayes or marginalization to calculate this baseline. We plan to use this module during the data analysis and final report stage.

Part 4: Collaboration Plan (10 points)

This is a collaborative course project pursued by a team of students who bring different strengths and interests to the table. This reflects the reality that significant real-world projects in data science are almost always pursued by teams. For the collaboration to be successful, it helps to establish some guidelines that serve as a starting point. Your collaboration plan should address the following:

  1. How will you divide responsibilities? Will some students be responsible for certain portions of the project, or will you be more integrated and decide on responsibilities on a weekly basis?
  2. About how much time do you expect every group member to spend on the project each week, on average? It is ok if this number is higher toward the last couple of weeks of the semester.
  3. When and how will you meet? You should plan to meet at least once per week for at least 30 minutes to check in on one another’s progress, get help, and plan for what comes next. Identify a day of the week, a time, and the place/platform you will use to meet.
  4. What platform(s) will you use to communicate between meetings? Will you primarily use email, text, slack, or other chat apps? If you want a more professional enterprise tool, Duke provides free access to Microsoft Teams.
  5. Where will you store data, code, writing, etc., so that all group members have easy access to shared materials?* Duke provides free access to Box and GitLab which could serve these purposes, but you could also use external services like Google Drive or GitHub. Provide a link to the folder/repository in your proposal to demonstrate that it is created and ready.

* In addition to a common repository for data, you may find it useful to explore the Google colab or DeepNote which allows you to collaborate on Jupyter notebooks and execute them in the cloud (like a google doc for Jupyter notebooks).

Grading

  • E (Exemplary) – Comprehensive plan that answers all 5 questions and includes a link to their folder/repository.
  • S (Satisfactory) – Comprehensive plan that mostly answers all 5 questions. The link to their folder/repository could be missing.
  • N (Not yet) – A plan that does not entirely answer 1 or 2 of the questions above. Link can be missing.
  • U (Unassessable)A plan that does not entirely answer 3 or more of the questions above.

Checklist Before You Submit:

  1. Does your proposal satisfy all general directions?
    1. 2-3 pages in length
    2. Standard margins (1 in.)
    3. Font size is 11-12 pt
    4. Line spacing is 1-1.5
    5. Final document is a pdf
  2. Do you have an Introduction and clearly stated Research Question(s)?
    1. Do you feel as if this part meets the requirements of E (Exemplary) or S (Satisfactory)?
  3. Have you properly specified/cited one or more specific Data Sources or methods to acquire data and justified why they are relevant to the research Questions?
    1. Do you feel as if this part meets the requirements of E (Exemplary) or S (Satisfactory)?
  4. Did you state at least 3 Modules to be used and how, as well as a justification of which concepts will be used at specific stages of the project?
    1. Do you feel as if this part meets the requirements of E (Exemplary) or S (Satisfactory)?
  5. Have you specified a Collaboration Plan with designated responsibilities, meeting times, and platforms for communication and project storage?
    1. Do you feel as if this part meets the requirements of E (Exemplary) or S (Satisfactory)?

Leave a Reply

Your email address will not be published. Required fields are marked *