Age data for older ages often show a “zigzag” pattern of alternating peaks and troughs in successive five year age groups. The phenomenon is illustrated by the South African deaths data shown in Figure 1 (Mortality and causes of death in South Africa, 1997-2003: Findings from death notification, Appendix C, page 46).
Beyond age 40 the plotted points rise, then fall, then rise, and so on, with perfect consistency through age 85. The 45-49, 55-59, 65-69 and 75-79 age groups are peaks. The 40-44, 50-54, 60-64 and 70-74 age groups are troughs. Differential heaping on ages ending in “0” and “5” is a plausible explanation for the pattern.
When age distribution errors are due to age misreporting we may attempt to correct them by distributing persons whose age is incorrectly reported to the correct age group.
In one way or another, we first estimate p[i, j], the proportion of persons whose reported age is in the i-th age group but whose true age is in the -th group, and then estimate the true number of persons in the j-th age group as the sum over all i of p[i, j]N[i], N[i] denoting the reported number of persons in the i-th age group.
Because redistributive smoothing transfers persons from one age group to another, the total number of persons in the smoothed age groups necessarily equals the number in the original distribution.
For an early example of this approach see my "A technique for correcting age distributions for heaping on multiples of Five", Asian and Pacific Census Forum 5(3):12-14 (February 1979).
Some assumption constraining possible values of the p[i, j] must be made, and the free parameters estimated from the data. For zigzag, a simple assumption is that persons incorrectly reported in peak age groups are evenly divided between the two adjacent age groups, that is,
for each i representing a peak age group.
It is notationally convenient to number age groups from the age group preceding the first peak age group to the age group following the last peak group. In Figure 1, for example, the 40-44 age group is indexed by i = 1, the peak age groups by i = 2, 4, 6, and 8, the age groups between the peaks by i = 3, 5, and 7, and the 80-84 age group by i = 9.
The smoothed numbers S[i] of persons are then given by
One approach to estimating the p[i] is to define a measure of the “roughness” and then determine p[i] that minimize this measure. Consider the difference R[i] between the number of persons in the i-th age group and the average of the numbers in adjacent age groups,
If the distribution displays zigzag, the R[i] will relatively large. If the distribution is smooth, they will be relatively small. This suggests taking the sum of squares of these differences as the measure of roughness.
If there is substantial variability in the numbers in different age groups, as there often is (though not in Figure 1), this measure may give insufficient weight to small age groups. This may be avoided by minimizing instead the sum of
Table 1 shows the implementation of this procedure for the South African deaths data displayed in Figure 1.
Numerical minimization is readily effected in a computer spreadsheet (Solver in Excel or Gnumeric) or in the programming language R using optim().
Figure 2 shows the original and smoothed distribution obtained by this method. It is evident that the procedure is effective and that the smoothed distribution is almost certainly closer to the true distribution than the observed distribution.
The method has been applied to age distributions of total, male and female deaths for 1998–2009 and the results are in every case satisfactory.
It is strongly suggested, however, that results not be accepted without inspection of plots. Observe Tukey's injunction: Never fail to plot and look.