Reproducibility check | Dunedin & E-Risk Research Projects

Reproducibility Stat Check for Dunedin and E-Risk Papers

Renate Houts, July 2019

It is policy for Dunedin and E-Risk papers to go thru reproducibility checks prior to submission. When checking a paper, remember:

The reproducibility check is a collaborative process between peers
Keep feedback constructive and as positive as possible; it is entirely possible that a “failure to reproduce” is due to the checker’s coding error!
There is usually more than one appropriate way to carry out analysis for a particular question. As the checker, it can be helpful to determine whether another, equally appropriate, method results in similar substantive conclusions. Nonetheless, if the analysis approach used by the authors is not blatantly incorrect (and “blatantly incorrect” is rare!), the reproducibility check is NOT the time to insist on a new analysis strategy.

Checking a paper for reproducibility involves the following broad steps:

Getting a fresh data set from the data manager
Reading the paper with “fresh eyes”
Reproducing the analyses without using the original syntax for “guidance” (i.e., write new code)
Checking that the numbers reported in the paper match those obtained in #3
Communicating results of the reproducibility check to the authors
Reconciling any inconsistencies between the original and reproduction

When checking a paper, these areas require checking:

Determining the analysis sample:
- Is “who’s in” and “who’s out” adequately explained?
- Is the analysis sample N reproducible?
- Is there an attrition analysis to explain if/how the analysis sample differs from the full sample?
Are new composite variables created following “Standard Operating Procedures”?
- Only making composites for SM’s with >= 50% available data
- Pro-rating so that SM’s with more data do not receive higher scores just because they have more data
Do the descriptive N’s, %’s, Means, SDs match those reported in the paper?
Are the analyses described in enough detail to allow reproducibility?
Are the analyses reported appropriate for answering the questions posed with the data used? … Things to consider when making this judgement:
- Transformations needed for non-normal data?
- Continuous vs categorical data (e.g., OLS vs logistic regression)
- Any nesting accounted for? (e.g., time within person; person within family)
- Consistent N’s when comparing across models (e.g., differing covariates; mediation)
- Consistent scaling for cross-time comparisons
Are results reported using standard effect sizes appropriate for the data type (e.g., r’s, OR, RR, Cohen’s d)?
Are confidence intervals reported?
Do results from the reproducibility check match those reported in the paper?
- Check ALL numbers in the text, tables and figures
- Different statistical packages occasionally provide slightly different point estimates, confidence intervals and p-values for the “same” analysis. In the reproducibility check, the goal is to assure substantive consistency between findings in the original and reproduction (i.e., don’t get hung up on point estimate differences that are “close but not exactly the same”).
Does the interpretation of findings make sense?

After the reproducibility check, it’s time to communicate results with authors and reconcile any inconsistencies. Things to keep in mind during this stage of the check:

The reproducibility check is a collaborative process between peers
Keep feedback constructive and as positive as possible; it is entirely possible that a “failure to reproduce” is due to the checker’s coding error!
There is usually more than one appropriate way to carry out analysis for a particular question. As the checker, it can be helpful to determine whether another, equally appropriate, method results in similar substantive conclusions. Nonetheless, if the analysis approach used by the authors is not blatantly incorrect (and “blatantly incorrect” is rare!), the reproducibility check is NOT the time to insist on a new analysis strategy.
Common sources of inconsistent results:
- Authors using preliminary (& not final) data set
- Using different versions of a variable
- Differing N’s
- Differing treatment of missing values
- Differing approaches to handling nesting