Data Cleaning and Integration

What: CompSci590.01 (Spring 2017), Duke University
When: TuTh 3:05PM – 4:20PM
Where: LSRC A247
Instructor: Jun Yang
Office hours: MW 4:00PM – 4:50PM

Data is increasingly used to make decisions small and big—from showing ads on a web page, optimizing business strategies, diagnosing and treating diseases, all the way to policies affecting millions of lives. However, real data is usually dirty, and often comes from multiple sources. Hence, before data can be used for meaningful analysis, it needs to be cleaned and integrated. By some estimates, data scientists spend from 50 percent to 80 percent of their time mired in “munging” data before it can be mined for useful nuggets.

This course covers recent research on data cleaning and integration. The problems will be approached from multiple angles: databases (e.g., rule-based cleaning, schema matching, and efficient algorithms for large data), human-computer interaction and collective intelligence (e.g., interactive and crowdsourcing systems), statistics/machine learning (e.g., leveraging stats/ML to improve cleaning/integration, as well as considering the effects on subsequent analysis), and privacy.

Class meetings involve presenting and discussing selected research papers. Students are required to submit short reviews for these papers, and each student must lead the presentation/discussion of at least one paper during the semester. There are no exams. Instead, there is a semester-long course project, to be completed individually or in groups of two. Projects can develop new approaches or better algorithms, and/or implement/adapt existing techniques for real application settings. The course grade will be based on a combination of the course project, written reviews, and class participation.

Prerequisites: The course is open to interested graduate and undergraduate students. Basic knowledge in algorithms and probability will be assumed. Familiarity with databases and machine learning would help but is not necessary.

Home