Project Info – Data Cleaning and Integration

Milestone 1: Project Warm-Up

In class, we will first randomly group students for 30 minutes of discussion. This grouping may or may not be the same as your project teams. Then, groups, project teams, or individuals take turns pitching their project ideas; everybody speaks—individuals who have not yet formed groups can talk about their interests/skill sets.

By the evening of the following day, submit on Piazza a short project proposal, including the list of your project team members. Only one person on the team needs to submit. It is okay if your proposal is still somewhat vague at this point, but try to be as specific as possible. For example, if you want to develop a new method/algorithm, at least decide on the general problem you are solving, and identify any relevant previous work. If you want to clean some data, at least narrow down the domain and list several specific data sets you wish to investigate.

Milestone 2: Midterm Progress Report

Since there are five project teams, each team will have 15 minutes (including questions) in class to present the midterm progress (with slides). Discuss, in your presentation, progress you have made to date and your plan to finish the project. At this point, you should have spent enough time investigating your dataset/problem, identified the key technical challenges, and have tried at least some preliminary approaches. If there are any remaining roadblocks that could prevent successful completion of the your project, it is time to discuss them now.

By the evening of the following day, submit on Piazza a short progress update. It can be the presentation slides you used if they contain enough details, or you may additionally submit a more detailed write-up. Your team will need to meet with the instructor as a group after the spring recess for feedback and follow-up discussion.

Final Demo/Presentation

With five projects, each team will have 30 minutes (including questions) during the final examination time to present the project to the class. Unlike proposal and progress presentations, these presentations should be self-contained, complete with motivation, approach, results, and related work. Think of them as full-length presentations of papers at academic conferences, but make sure you devote enough time to technical details.

By the evening of the same day, submit your 1) presentation slides, 2) project report, and 3) pointer to source code by email to the instructor. Unlike the progress report, you cannot simply use your presentation slides as the final report. Ideally, the final report should read like a research paper or a demonstration description that can be submitted to a conference/journal/workshop. At the bare minimum, the paper should be self-contained, describing motivation, approach, results, and related work, with enough details so your research/work can be reproduced later by others.

Data

Here is a list of datasets with varying degrees of quality issues. You may pick those that interests you the most and apply/extend the techniques learned in this course to clean/integrate them.

Duke Men’s Basketball game/player statistics, as discussed in class. Talk to Jun if you want to explore this direction more.
KDD Cup (http://www.kdd.org/kdd-cup) in 2003 had an entity (authors of scientific papers) resolution challenge for the Microsoft Academic Search database. Alternatively, you can use the DBLP computer science bibliography (http://dblp.uni-trier.de/).
Data.gov (http://www.data.gov/) has a huge compilation of data sets produced by the US government.
The Supreme Court Database (http://scdb.wustl.edu/data.php) tracks all cases decided by the US Supreme Court. Talk to Jun if you want to explore this direction more.
US government spending data (https://www.usaspending.gov/) has information about government contracts and awards.
Federal Election Commission (http://www.fec.gov/disclosure.shtml) has campaign finance data to download; their “disclosure portal” (http://www.fec.gov/pindex.shtml) also provide nice interfaces for exploring the data.
GovTrack.us (http://www.govtrack.us/developers) tracks all bills through the Congress and all votes casted by its members. The Washington Post has a nice website (http://projects.washingtonpost.com/congress/113/) for exploring this type of data. We have an ongoing project that uses this data (http://icheckuclaim.org/). In addition to improving data quality, there are also efforts on getting voter’s guide from special interest groups such as ACU and AFL/CIO and cross-referencing them. There are lots of data quality issues. Talk to Jun if you want to explore this direction more.
Each state legislature also maintains its own voting records. For example, you can find North Carolina’s here: http://www.ncleg.net/Legislation/voteHistory/voteHistory.html. Some states provide records in already structured formats, but for others, you may need to scrape their websites. Talk to Jun if you want to explore this direction more.
The Washington Post maintains a list of datasets (http://www.washingtonpost.com/wp-srv/metro/data/datapost.html) that have been used to generate investigative news pieces. Most of these datasets hide behind some interface and may need to be scraped, and they may have already been “cleaned.” Nonetheless, you can verify their quality, and/or work on integrating interesting combinations of data sets.
Stanford Journalism Program maintains a list of curated transportation-related datasets (http://www.datadrivenstanford.org/).
National Institute for Computer-Assisted Reporting maintains a list of datasets of public interest (http://www.ire.org/nicar/database-library/). Use this list for examples of what datasets are “interesting”—they are generally not available to the public, but there may be alternative ways to obtain them.