Any paper below with an extra [review] link (which points to Piazza) requires you to submit on Piazza a short “review” by 8am on the morning of the lecture. This review should contain the following:

  • At least three important things that the paper says.
  • At least three interesting take-away points that you learned from the paper. They can be related to the paper’s fundamental contributions, or just things like a non-obvious pitfall, an uncanny insight, or a neat trick.
  • At least two things you didn’t like about the paper.
  • At least two directions in which one can improve the paper or extend the work.

There is no hard requirement on the length—a good review can be as brief as 400 words.

WeekDateTopicReadings
101/12Introduction [slides]
201/17Quantitative data cleaning primer [slides]Hellerstein. "Quantitative Data Cleaning for Large Databases." Technical Report, UC Berkeley, 2008. [link]
01/19A case study of handling missing sensor data [slides]Silberstein et al. "Making Sense of Suppressions and Failures in Sensor Data: A Bayesian Approach." VLDB 2007. [link]
301/24ERACER: RDN on top of a Database System [slides]Mayfield, Neville, and Prabhakar. "ERACER: A Database Approach for Statistical Inference and Data Cleaning." SIGMOD 2010. [link] [review]
See also: Neville & Jensen. "Relational Dependency Networks." JMLR 2007. [link]
01/26Distortion as consequence of cleaning, presented by Junyang, Amir, and Yuhao [slides]Dasu and Loh. "Statistical Distortion: Consequences of Data Cleaning." VLDB 2012. [link] [review]
401/31Data profiling + UI, presented by Brett and Michael [slides]Kandel et al. "Profiler: Integrated Statistical Analysis and Visualization for Data Quality Assessment." AVI 2012. [link] [review]
02/02Qualitative data cleaning: data profiling [slides]For reference: Ilyas & Chu. "Trends in Cleaning Relational Data: Consistency and Deduplication." FnTdb, 2015. [link]
502/07Qualitative data cleaning: data repairing [slides]Cong et al. "Improving Data Quality: Consistency and Accuracy." VLDB 2007. [link]
For reference: Fan. "Data Quality: From Theory to Practice." SIGMOD Record 2015 [link]
02/09Project warm-upProposal submission due 02/10 [submit]
602/14Qualitative data cleaning: handling heterogeneous rules, presented by Zhou and Sitong [slides]Dallachiesa et al. "NADEEF: A Commodity Data Cleaning System." SIGMOD 2013. [link] [review]
02/16Cleaning by samples: aggregate queries [slides]Wang et al. "A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data." SIGMOD 2014. [link] [review]
702/21Cleaning by samples: learning models, presented by Safkat and Carolyn [slides]Krishnan et al. "ActiveClean: Interactive Data Cleaning For Statistical Modeling." VLDB 2016. [link] [review]
02/23Cleaning by queries, presented by Usama and Yahui [slides]Bergman et al. "Query-Oriented Data Cleaning with Oracles." SIGMOD 2015. [link] [review]
802/28Combining quantitative and logical cleaning, presented by Michael and Safkat [slides]Prokoshyna et al. "Combining Quantitative and Logical Data Cleaning." PVLDB 2015. [link] [review]
03/02Entity resolution: introduction [slides]For reference:
Doan, Halevy, Ives. Principles of Data Integration. Chapter 7. [link]
Elsner & Schudy. "Bounding and Comparing Methods for Correlation Clustering Beyond ILP." ILP-NLP Workshop, 2009. [link]
903/07Entity resolution: probabilistic approaches [slides]For reference:
Bhattacharya & Getoor. "A Latent Dirichlet Model for Unsupervised Entity Resolution." SDM 2006. [link]
Singla & Domingos. "Entity Resolution with Markov Logic." ICDM 2006. [link]
03/09Project presentations
1003/14Spring recess
03/16Spring recess
1103/21Entity resolution: a collective approach, presented by Brett and Srikar [slides]Bhattacharya and Getoor. "Collective Entity Resolution in Relational Data." TKDD 2007. [link] [review]
03/23Entity resolution: Dedupalog, presented by Stavros and Yahui [slides]Arasu, Re, and Suciu. "Large-Scale Deduplication with Constraints using Dedupalog." ICDE 2009. [link] [review]
1203/28Entity resolution: efficiency and scalability [slides]McCallum, Nigam, and Ungar. "Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching." KDD 2000. [link]
Rastogi, Dalvi, and Garofalakis. "Large-Scale Collective Entity Matching." VLDB 2011. [link]
03/30Entity resolution: query-driven, presented by Amir and Stavros [slides]Altwaijry, Kalashnikov, and Mehrotra. "Query-Driven Approach to Entity Resolution." PVLDB 2013. [link]
Follow-up work: Altwaijry, Mehrotra, and Kalashnikov. "QuERy: A Framework for Integrating Entity Resolution with Query Processing." PVLDB 2015. [link] [review]
1304/04DeepDive, presented by Andrew and Usama [slides]Shin et al. "Incremental Knowledge Base Construction Using DeepDive." VLDB 2015. [link] [review]
04/06Data Tamer, presented by Junyang and Zhou [slides]Stonebraker et al. "Data Curation at Scale: The Data Tamer System." CIDR 2013. [link] [review]
1404/11Unknown unknowns, presented by Carolyn and Andrew [slides]Chung et al. "Estimating the Impact of Unknown Unknowns on Aggregate Query Results." SIGMOD 2016. [link] [review]
04/13Cleaning and privacy, presented by guest lecturer Xi He [slides]Krishnan et al. "PrivateClean: Data Cleaning and Differential Privacy." SIGMOD 2016. [link] [review]
1504/18Combining error detectors, presented by Sitong and Srikar [slides]Abedjan et al. "Detecting Data Errors: Where are we and what needs to be done?" PVLDB 2016. [link] [review]
04/20Graduate reading period
1604/25Graduate reading period
04/27Graduate reading period
1705/04
(9am-12pm)
Project presentations