Technical Reports

Data Quality Assessments

Empirical research has demonstrated that the quality of data of the CLHLS is quite good including age reporting, mortality, self-reported morbidity, and reliability and validity of major measures.

1. Accuracy of age reporting
Based on comparisons of various indices of elderly age reporting and age distributions of centenarians in Sweden, Japan, England and Wales, Australia, Canada, China, the U.S.A., and Chile, our empirical study demonstrates that age reporting among the oldest-old interviewees (Han and six minority groups combined) in the 22 provinces in China where the CLHLS has been conducted is not as good as that in Sweden, Japan, and England and Wales, but is relatively close to that in Australia, more or less the same as that in Canada, slightly better than that in the U.S.A., and much better than that in Chile. As indicated by the higher density of centenarians, age exaggeration exists in the six ethnic minority groups in the 22 Han-dominated provinces, although we cannot rule out and quantify the potential impacts of past mortality selection and better natural environmental conditions among these minority groups. We find that the age exaggeration of minorities in the CLHLS may not cause substantial biases in demographic and statistical analyses using the CLHLS data, since minorities consist of a rather small portion of the sample, 6.8 percent at baseline (Zeng and Gu 2008).

1. Zeng, Y., and D. Gu. (2008) Reliability of age reporting among the Chinese oldest-old in the CLHLS datasets. In Y. Zeng, D.L. Poston, D.A. Vlosky, and D. Gu. (eds.). Healthy Longevity in China:Demographic, Socioeconomic, and Psychological Dimensions, pp61-79. Dordrecht, The Netherlands: Springer Publisher. (Springer’s Flyer)

2. Mortality and morbidity assessments
Refer to Gu,D. and Dupre, M.E. (2008). Assessment of reliability of mortality and morbidity in the 1998-2002 CLHLS waves. In Y. Zeng, D.L. Poston, D.A. Vlosky, and D. Gu. (eds.). Healthy Longevity in China:Demographic, Socioeconomic, and Psychological Dimensions, pp101-117. Dordrecht, The Netherlands: Springer Publisher. (Springer’s Flyer)

3. Other dimensions of data assessments
Refer to Gu. D. (2008). General Data Assessment of the CLHLS. In Y. Zeng, D.L. Poston, D.A. Vlosky, and D. Gu. (eds.). Healthy Longevity in China:Demographic, Socioeconomic, and Psychological Dimensions, pp39-59. Dordrecht, The Netherlands: Springer Publisher. (Springer’s Flyer)

4. Data quality assessment for the 2005 wave


Please refer to “Weight method” for details on how the weight variable in the CLHLS was created. Given that the CLHLS is specially designed by considering the clusters of age-sex-urban/rural residence (see Survey Design) , the weight variable should be applied when users want to calculate the mean or distribution of an interest variable to represent the whole elderly population in the sampled provinces. Only when the researcher aims to describe the sample and not to compare the difference in a characteristic across different groups, may she/he not use the weight variable. In other words, we strongly recommend that users incorporate the weight variable in cross-groups comparisons.

In multivariate regressions, the weights may not necessarily be applied as long as age, sex, and urban/rural residence are controlled. There are, however, pros and cons for using weight in regression when the weight is a function of independent/explanatory variables. Empirical research has shown that when the sampling weight is a function of dependent variables, unweighted results produce some biases and inconsistency. In such a case, we recommend applying sampling weight in the analyses and using the White heteroskedastic consistent estimator for the standard errors (see Hendrikx 2002; Winship and Radbill 1994).

We recommend comparing the weighted and unweighted results as a pre-check for model specification. According to our experience, the weighted and unweighted results are close to each other in most cases when the weight is not a function of dependent variables. Please refer the following two articles and other Internet sources for details.

Useful references:

1. Hendrikx, J.(2002). The impact of weights on standard errors. Presented at the annual meeting of the Association for Survey Computing, April 17, Imperial College, London, UK. available at (accessed on August 8, 2006).
2. Magee, L., and Robb, A.L., Burbidge, J.B. (1998). On the use of sampling weights when estimating regression models with survey data. Journal of Econometrics 84(2), 251-271.
3. Winship, C, & Radbill, L. (1994). Sampling weights and regression analysis. Sociological Methods Research 23, 230-257.

Many websites also provide very helpful guides for the use of weights in statistical analyses. For example, UCLA Stat Computing Portal website provides various useful knowledge on statistically modeling including weighting issues in regressions. The Department of Sociology, Ohio State University provides answers for question “I’m fitting a regression model to a survey that includes sampling weights. Should I use the weights in my analysis?

Missing Values

For information on the distribution of missing values for the first three waves, their patterns, and how to deal with it in the analysis, please refer to Section 3 and Section4 in General Data Assessment of the CLHLS (Gu, 2008) and references cited there.

1. Gu. D. (2008). General Data Assessment of the CLHLS. In Y. Zeng, D.L. Poston, D.A. Vlosky, and D. Gu. (eds.). Healthy Longevity in China:Demographic, Socioeconomic, and Psychological Dimensions, pp39-59. Dordrecht, The Netherlands: Springer Publisher.

Pooling Multiwave Datasets

Therefore two options for pooling multiple waves of the CLHLS datasets. (1) The first approach is to pool the panel data for those elders who were first interviewed in 1998. Based on this panel dataset (1998-2005), we can look at these elders’ changes from 1998 to 2005. You may also pool the panel data for those elders who were interviewed in 2000 to see the changes from 2000 to 2005, or you may pool the panel data for those elders who were interviewed in 2002 to see the changes from 2002 to 2005. (2) The second approach is to add those samples who were recruited after 1998 (i.e., newly interviewed in 2000 and 2002) into the 1998-2005 panel dataset. Using the latter type of pooled datasets, we can investigate the changes over time for all CLHLS respondents from 1998 to 2005, although some of them have only two or three observations (i.e., newly recruited interviewees in 2002 or 2000). The expanded panel dataset (2) may produce more robust results given that more observations are included. But in this case, you need to be very cautious about the dynamic changes for those young elders aged 65-79 since their events are derived from 3-year period (compared to a 7-year period for the oldest-old aged 80+), especially in survival analysis. Users may also generate other type of pooled datasets. The first type of pooled multiwave datasets have been prepared by the CLHLS research team and have been released in terms of longitudinal dataset or panel dataset.

Other Issues

Colleagues need to pay attentions to the following issues when you use the CLHLS datasets:

(1) Although by definition of longitudinal surveys, the CLHLS questionnaires should be basically the same across waves to ensure the compatibility, we made s small number of revisions, deletions and additions to improve the data quality and/or adjust the data contents in terms of new research needs, constraints of budget, length of interview time for the oldest-old. For example, similar to questions in the Berlin aging study, we asked six disposition-related questions in the 1998 baseline survey such as: “Some people stated that ‘I often feel lonely and isolated’; How similar are you to these people?” We found that a substantially higher proportion of the Chinese oldest-old, especially centenarians, were unable to answer these types of questions, as compared to other types of questions because some illiterate oldest-old could not understand questions that required them to compare themselves to a typical person of a specified disposition. We subsequently revised these questions in our 2000 and 2002 follow-up surveys. As a result, the Cronbach’s alpha, which indicates how well the disposition-related variables measure the uni-dimensional latent construct of personality, was substantially improved in 2000 (0.72) and 2002 (0.71) as compared to 1998 (0.63). Another example is that the question of “do you grow vegetables or do other field works at present” (variable name is d10b in the 1998 dataset) has been changed into the question of “do you do any personal outdoor activities at present?” (variable name has been changed to d11b since 2000). Since the 2002 wave, a question “do you participate in any organized social activities at present?” has been added into the leisure activities section.

(2) Age at changes in marital status in subsequent wave is an estimate rather than the actual data for those sampled elders whose marital status changed between two adjacent waves. This is because the CLHLS only collected marital status information at each wave. Therefore, age at change in marital status was estimated as the mean of ages at two adjacent waves for those whose marital status was changed.

(3) Information on siblings’ survival status, frequency of visiting, residence distance, and age at subsequent waves was collected only at the first interview and was not collected again in the subsequent waves. The same for data on respondents’.

(4) Missing values are high for ages at death of parents. Therefore, users need to pay more attention on this. Please refer to above relevant technical reports and references cited there on how to deal with missing value.

(5) Some variables are not internally consistent across waves, e.g., number of teeth. Users need to be very cautious about these variables.

Please refer to codebooks and technical documents.