Most federal agencies view disseminating data to the public for secondary analyses as a core mission; yet, concerns over data confidentiality make it increasingly difficult to do so. As threats to data confidentiality grow, federal agencies planning to produce public use data may be forced to release heavily redacted files. Many confidentiality protection strategies applied at high intensities result in severely reduced data quality. Even worse, analysts of secondary data have no way to determine how much their analysis has been compromised by the disclosure protection. The TCRN will advance methodologies and tools for disseminating heavily redacted datasets by
- developing theory and methodology for releasing multiply-imputed, synthetic datasets based on flexible, nonparametric Bayesian models built specifically for high dimensional data with longitudinal and multi-level aspects;
- developing approaches for including survey weights in heavily redacted data that can improve statistical estimation without leading to confidentiality disclosures;
- developing the framework for computer systems that provide secondary analysts with feedback on the quality of inferences from heavily redacted data; and
- developing theory and methodology for creating synthetic contingency tables based on fusions of linear programming and Bayesian modeling.
The TCRN will use the methodologies to create files suitable for public use for the Annual Survey of Manufactures (ASM), for which no public use microdata currently exist, and to determine best practices for releasing high-quality, safe tabular magnitude (national economic) data.