Home » Research Projects » Downloadable Software

Downloadable Software

The TCRN has developed several software packages that implement the methodological advances.  This page includes links to these packages.  We only post packages sufficiently developed to be applied by others.  We also post links to relevant computer code for selected papers on the publications page.  If you are interested in software routines for other TCRN methodologies, contact Jerry Reiter.

1.   Synthesis of county-to-county migration flows  download software
To maintain confidentiality national statistical agencies traditionally do not include small counts in publicly released tabular data products.  They typically delete these small counts, or combine them with counts in adjacent table cells to preserve the totals at higher levels of aggregation.  In some cases these suppression procedures result in too much loss of information.  To increase data utility and make more data publicly available, we created methods and software to generate synthetic values for the small counts from a Bayesian hierarchical model.  The software generates synthetic data and computes several measures of disclosure risk.   The software was applied by the Census Bureau in synthesizing small county-to-county migration counts.  The zip file includes a document summarizing the model.

2.  Multiple imputation of missing data by Bayesian latent class models  download software
Many datasets comprise exclusively categorical variables that suffer from missing data.  When the number of variables is large, it can be challenging to specify models for use in multiple imputation (MI) of missing data.  One approach is to use Bayesian latent class models for MI.  In a series of papers, we showed that these models can capture complex dependencies and hence serve as effective MI engines.  This R software package implements MI via latent class models when the categorical data include structural zeros (i.e., some combinations have zero probability).  The package also includes an option for MI in categorical data without structural zeros.  The package is available on CRAN.

3.  Bayesian edit-imputation for continuous data  download software
Many statistical organizations collect data that are expected to satisfy linear constraints; as examples, component variables should sum to total variables, and ratios of pairs of variables should be bounded by expert-specified constants. When reported data violate constraints, organizations identify and replace values potentially in error in a process known as edit-imputation.  In a paper published in the Journal of the American Statistical Association, we developed an approach that fully integrates editing and imputation for continuous microdata under linear constraints. The approach relies on a Bayesian hierarchical model that includes (i) a flexible joint probability model for the underlying true values of the data with support only on the set of values that satisfy all editing constraints, (ii) a model for latent indicators of the variables that are in error, and (iii) a model for the reported responses for variables in error.  This R package implements a version of the model that uses mixtures of multivariate normal distributions for the underlying true values and uniform distributions for measurement errors.  The package is available on CRAN.

4.  Nonignorable missing data imputation for multivariate continuous data  download software
In a paper currently under review, we present an approach to inform decisions about nonresponse followup sampling.  The basic idea is (i) to create completed samples by imputing nonrespondents’ data under various assumption about the nonresponse mechanisms, (ii) take hypothetical samples of varying sizes from the completed samples, and (iii) compute and compare measures of accuracy and cost for dierent proposed sample sizes. As part of the methodology, we present a new approach for generating imputations for multivariate continuous data with nonignorable unit nonresponse. We fit mixtures of multivariate normal distributions to the respondents’ data, and adjust the probabilities of the mixture components to generate nonrespondents’ distributions with desired features.  This R software implements the techniques in that paper.

5.  Nonparametric Bayesian missing data imputation for multivariate mixed continuous and categorical data  download software
Many datasets include a mix of continuous and categorical variables with missing values.  In a paper published in the Journal of the American Statistical Association, we developed a joint model for such mixed data that can be used for multiple imputation.  The approach uses a nonparametric Bayesian mixture model as the imputation engine.  The mixture model comprises one set of mixture components with multivariate normal kernels for the continuous variables, and a separate set of mixture components with products of independent multinomial kernels for the categorical variables.  The model induces dependence between the continuous and categorical variables in two ways, namely (i) by allowing the means of the multivariate normal distributions to depend on the categorical variables, and (ii) by using a tensor factorization prior that links the two sets of membership components.  The package is available on CRAN.

6.  Synthetic categorical data for households  download software
This package fits a Bayesian model for estimating the joint distribution of multivariate categorical data when units are nested within groups. Such data arise frequently in social science settings, for example, people living in households. The model assumes that (i) each group is a member of a group-level latent class, and (ii) each unit is a member of a unit-level latent class nested within its group-level latent class. This structure allows the model to capture dependence among units in the same group. It also facilitates simultaneous modeling of variables at both group and unit levels. We develop a version of the model that assigns zero probability to groups and units with physically impossible combinations of variables.  The software generates synthetic household level data for the variables in the decennial census using a subset of structural zeros defined by edit constraints from the American Community Survey.  In this version, alterations of the edit constraints must be done directly in the source code.  The package is available on CRAN.

7.  Synthetic data programs for short course at JPSM
These files are used by Jerry Reiter and Joerg Drechsler during their short course for the JPSM.  Link to the CPS data.   Link to Exercises.pdf.    Link to cps_w_model_checks.   Link to synthpop_w_model checks.   Link to answers from exercises.    Link to calculate_risksLink to synthpop explanationsLink to synthpop manualLink to synthpop vignettes Slides for the course.

TCRN no longer active

The NSF award that supported the TCRN ended on September 30, 2018.  This site is maintained for archival purposes.