Schedule & Readings – Data Cleaning and Integration

Any paper below with an extra [review] link (which points to Piazza) requires you to submit on Piazza a short “review” by 8am on the morning of the lecture. This review should contain the following:

At least three important things that the paper says.
At least three interesting take-away points that you learned from the paper. They can be related to the paper’s fundamental contributions, or just things like a non-obvious pitfall, an uncanny insight, or a neat trick.
At least two things you didn’t like about the paper.
At least two directions in which one can improve the paper or extend the work.

There is no hard requirement on the length—a good review can be as brief as 400 words.

Week	Date	Topic	Readings
1	01/12	Introduction [slides]
2	01/17	Quantitative data cleaning primer [slides]	Hellerstein. "Quantitative Data Cleaning for Large Databases." Technical Report, UC Berkeley, 2008. [link]
	01/19	A case study of handling missing sensor data [slides]	Silberstein et al. "Making Sense of Suppressions and Failures in Sensor Data: A Bayesian Approach." VLDB 2007. [link]
3	01/24	ERACER: RDN on top of a Database System [slides]	Mayfield, Neville, and Prabhakar. "ERACER: A Database Approach for Statistical Inference and Data Cleaning." SIGMOD 2010. [link] [review] See also: Neville & Jensen. "Relational Dependency Networks." JMLR 2007. [link]
	01/26	Distortion as consequence of cleaning, presented by Junyang, Amir, and Yuhao [slides]	Dasu and Loh. "Statistical Distortion: Consequences of Data Cleaning." VLDB 2012. [link] [review]
4	01/31	Data profiling + UI, presented by Brett and Michael [slides]	Kandel et al. "Profiler: Integrated Statistical Analysis and Visualization for Data Quality Assessment." AVI 2012. [link] [review]
	02/02	Qualitative data cleaning: data profiling [slides]	For reference: Ilyas & Chu. "Trends in Cleaning Relational Data: Consistency and Deduplication." FnTdb, 2015. [link]
5	02/07	Qualitative data cleaning: data repairing [slides]	Cong et al. "Improving Data Quality: Consistency and Accuracy." VLDB 2007. [link] For reference: Fan. "Data Quality: From Theory to Practice." SIGMOD Record 2015 [link]
	02/09	Project warm-up	Proposal submission due 02/10 [submit]
6	02/14	Qualitative data cleaning: handling heterogeneous rules, presented by Zhou and Sitong [slides]	Dallachiesa et al. "NADEEF: A Commodity Data Cleaning System." SIGMOD 2013. [link] [review]
	02/16	Cleaning by samples: aggregate queries [slides]	Wang et al. "A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data." SIGMOD 2014. [link] [review]
7	02/21	Cleaning by samples: learning models, presented by Safkat and Carolyn [slides]	Krishnan et al. "ActiveClean: Interactive Data Cleaning For Statistical Modeling." VLDB 2016. [link] [review]
	02/23	Cleaning by queries, presented by Usama and Yahui [slides]	Bergman et al. "Query-Oriented Data Cleaning with Oracles." SIGMOD 2015. [link] [review]
8	02/28	Combining quantitative and logical cleaning, presented by Michael and Safkat [slides]	Prokoshyna et al. "Combining Quantitative and Logical Data Cleaning." PVLDB 2015. [link] [review]
	03/02	Entity resolution: introduction [slides]	For reference: Doan, Halevy, Ives. Principles of Data Integration. Chapter 7. [link] Elsner & Schudy. "Bounding and Comparing Methods for Correlation Clustering Beyond ILP." ILP-NLP Workshop, 2009. [link]
9	03/07	Entity resolution: probabilistic approaches [slides]	For reference: Bhattacharya & Getoor. "A Latent Dirichlet Model for Unsupervised Entity Resolution." SDM 2006. [link] Singla & Domingos. "Entity Resolution with Markov Logic." ICDM 2006. [link]
	03/09	Project presentations
10	03/14	Spring recess
	03/16	Spring recess
11	03/21	Entity resolution: a collective approach, presented by Brett and Srikar [slides]	Bhattacharya and Getoor. "Collective Entity Resolution in Relational Data." TKDD 2007. [link] [review]
	03/23	Entity resolution: Dedupalog, presented by Stavros and Yahui [slides]	Arasu, Re, and Suciu. "Large-Scale Deduplication with Constraints using Dedupalog." ICDE 2009. [link] [review]
12	03/28	Entity resolution: efficiency and scalability [slides]	McCallum, Nigam, and Ungar. "Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching." KDD 2000. [link] Rastogi, Dalvi, and Garofalakis. "Large-Scale Collective Entity Matching." VLDB 2011. [link]
	03/30	Entity resolution: query-driven, presented by Amir and Stavros [slides]	Altwaijry, Kalashnikov, and Mehrotra. "Query-Driven Approach to Entity Resolution." PVLDB 2013. [link] Follow-up work: Altwaijry, Mehrotra, and Kalashnikov. "QuERy: A Framework for Integrating Entity Resolution with Query Processing." PVLDB 2015. [link] [review]
13	04/04	DeepDive, presented by Andrew and Usama [slides]	Shin et al. "Incremental Knowledge Base Construction Using DeepDive." VLDB 2015. [link] [review]
	04/06	Data Tamer, presented by Junyang and Zhou [slides]	Stonebraker et al. "Data Curation at Scale: The Data Tamer System." CIDR 2013. [link] [review]
14	04/11	Unknown unknowns, presented by Carolyn and Andrew [slides]	Chung et al. "Estimating the Impact of Unknown Unknowns on Aggregate Query Results." SIGMOD 2016. [link] [review]
	04/13	Cleaning and privacy, presented by guest lecturer Xi He [slides]	Krishnan et al. "PrivateClean: Data Cleaning and Differential Privacy." SIGMOD 2016. [link] [review]
15	04/18	Combining error detectors, presented by Sitong and Srikar [slides]	Abedjan et al. "Detecting Data Errors: Where are we and what needs to be done?" PVLDB 2016. [link] [review]
	04/20	Graduate reading period
16	04/25	Graduate reading period
	04/27	Graduate reading period
17	05/04 (9am-12pm)	Project presentations