Due: Friday, September 8th
In place of a final exam, this course has a collaborative final project where we ask you to bring your data science skills to bear on a research project of your own choosing. To help groups without enough people (you need 4-5), you must indicate who will be in your group by filling out the group formation survey on Gradescope no later than Friday, September 8th. Use the group submission feature on Gradescope to include all of your group members on a single submission.
The survey should only take a couple of minutes. If you do not have anyone to work with or do not have sufficient people, we will assign you to a group or add more people to your group. So fill out the form so we know what your or your subgroup’s interests are.
If it is helpful to start thinking about possible project ideas, below are some project ideas. You can also brainstorm now using strategies that are outlined in the Initial Plan post (TBA). But it is not required that you have a concrete project idea until the proposal.
Project ideas
Not sure how to get started? Looking for examples of what a data science project might look like? Here are some of the topics that students studied in Spring 2020:
- Comparing Stock Market Losses between SARS and SARS-CoV-2
- Recessions, Depressions, and Depression: Mental Health in Relation to Economic Factors
- Predicting North Carolina Election Outcomes
- Relating Text Analysis of Corporate Reports and Stock Performance
- Modeling Consumer Flight Behavior Based on Economic Indicators
- Predicting COVID-19 Death Tolls from Google Search Trends
- Sentiment Analysis of COVID-19 Tweets
- Economic Status and Drug Overdose in North Carolina
- Analyzing Gender and Tech Careers
- Political Landscape According to Social Media
- Forecasting Market Shocks and Performance using Article Headlines
- Tracking Recidivism in US Prisons
- Understanding AirBnBs impact on Evictions
- Understanding Musical Tastes (Music Recommender System)
- Human Impact on Climate since the Industrial Revolution
- The Troll Toll: An Investigation into Troll Tweets
And here is an archive of summer Data+ projects from the last several years. In Data+, teams of about 4 undergraduate students collaborate over the summer on a data science project. You should be able to see final presentations and/or executive summary slides for most projects; feel free to browse for inspiration.
Example Data Sources
Below, we have some examples of datasets or where you might find data. You should work with data that is interesting to you and should feel free (strongly encouraged even) to look for sources yourself. These are listed just as possibilities and starting places.
- Kaggle maintains several thousand public datasets of interest in a variety of topics. Kaggle also hosts several prediction challenges; one idea for a machine learning project is to enter one of these competitions as a team.
- The Yelp Dataset is provided by Yelp as a research challenge with lots and lots of data about reviews, businesses, images, and cities – text data, rich json data, etc.
- The University of California Irvine maintains a large UCI ML repository of publicly contributed datasets aimed toward machine learning tasks of all types. They range from small simple example datasets to large and complicated datasets from specific scientific domains.
- Data.gov has a huge compilation of data sets produced by the US government. The US Census Bureau also publishes datasets from all of its survey work. Similarly, The Supreme Court Database tracks all cases decided by the US Supreme Court, and GovTrack.us provides links to all kinds of information about the US Congress and all votes casted by its members.
- Duke University Library Digital Repository Research Data
- ICPSR – An international consortium of more than 750 academic institutions and research organizations, Inter-university Consortium for Political and Social Research (ICPSR) provides leadership and training in data access, curation, and methods of analysis for the social science research community. ICPSR maintains a data archive of more than 250,000 files of research in the social and behavioral sciences. It hosts 21 specialized collections of data in education, aging, criminal justice, substance abuse, terrorism, and other fields.