Spitalfields: Comparison with known age-at-death • baytaAAR

library(baytaAAR)

To show how the age estimations of baytaAAR can be compared with known age-at-death, we use the data set of Spitalfields (Buckberry and Chamberlain 2002). The individuals come from a crypt at Christ Church in Spitalfields, London and were identified in terms of age and sex by plates on their coffins. The following table gives an impression of the data:

data(spitalfields, package = "baytaAAR")
head(spitalfields)
#>   Age TO ST MI MA AP Sex
#> 1  16  1  1  1  1  2   F
#> 2  17  2  1  1  1  1   F
#> 3  19  1  2  1  1  1   F
#> 4  23  3  2  2  1  2   F
#> 5  27  1  1  3  1  1   F
#> 6  27  2  2  2  1  1   F

Buckberry and Chamberlain introduced a new scoring system of the pubic symphysis and thereby replaced a single score with multiple levels to several traits with fewer levels each.

For running the model, we use NIMBLE and the multivariate normal likelihood (mnorm). In a real-life setting, multicore should be set to TRUE to reduce computation time. Still, even on modern hardware, the model will run for up to several weeks until the usual quality criteria are met. However, already a shorter run demonstrates the principle (for the full run see the results in Müller-Scheeßel et al. 2026).

spitalfields_res <- bay.ta(
  framework = "NIMBLE",
  algorithm = "mnorm",
  multicore = F,
  method = spitalfields[,c(2:6)],
  minimum_age = 16,
  parameters = c("b", "a", "beta0", "beta", "thresh", "age.s", "Ustar"),
  thinSteps = 200,
  numSavedSteps = 500,
  seed = 331
)

For the sake of illustration, we skip the diagnostic checks here and move directly to the comparison with known ages-at-death. For this, baytaAAR provides several custom functions which essentially work as wrappers around other R functions for convenience.

As the name suggests, the function age.comp.summary() summarizes standard quality measures of the comparison of known and estimated ages-at-death. These fall into two groups:

Frequentist point estimates

Bias - Arithmetic mean of the difference between known and estimated age of all individuals in the sample.
corrPearson – Correlation between known and estimated age with
p_value – p-value of that correlation and
Residual_slope – Slope of the regression of the difference between known and estimated age.
Inaccuracy – Arithmetic mean of the absolute difference between known and estimated age.
RMSE – Root mean square error of known and estimated age.

Bayesian density estimates

TMNLP – Test mean log posterior, a local evaluation of the probability density at the point of known age (Stull et al. 2022).
CRPS – Continuous ranked probability score, a global evaluation of the probability density at the point of known age (Gneiting et al. 2007).

For nearly all measures, lower is better. The only exception is corrPearson where larger is better. Because the differences are squared with the RMSE, outliers influence this measure more strongly than this is the case for Inaccuracy, and therefore, the former tends to be higher than the latter. A similar relationship exists between CRPS and TMNLP. Overall, Müller-Scheeßel et al. found that the CRPS seems to be more informative than the TMNLP and, therefore, the CRPS is probably the most informative and single best measure to compare the results between different age estimation methods in terms of their quality to ‘predict’ age-at-death.

age.comp.summary() assumes that you choose one of the point estimates Mode, Median, Mean but in the code snippet below we demonstrate how all three measures can be computed and tabulated in one go. age.comp.summary() expects the raw output from bay.ta() as well as a vector of known age-at-death. The latter may contain NAs, these are subsequently simply ignored.

summary_list <- lapply(c("Mode", "Median", "Mean"), function(choice) {
  age.comp.summary(mcmc_list = spitalfields_res, 
                   known_age = spitalfields$Age,
                   mean_choice = choice)})
summary_mat <- do.call(rbind, summary_list)
rownames(summary_mat) <- c("Mode", "Median", "Mean")
summary_mat |> t() |> knitr::kable(digits = 2)

	Mode	Median	Mean
Bias	2.64	4.33	5.42
corrPearson	0.63	0.67	0.67
corr_p	0.00	0.00	0.00
Residual_slope	0.31	0.39	0.47
Inaccuracy	12.52	11.21	11.19
RMSE	15.88	14.28	14.06
TMNLP	4.95	4.95	4.95
CRPS	8.03	8.03	8.03

As the above table shows, it can make quite a difference which mean measures Mean (= arithmetic mean), Median or Mode is chosen for the frequentist point estimate. In this case, the differences are even exaggerated because there are not enough iterations for stable estimates. On the other hand, the Bayesian estimates are not influenced by the choice of the mean measure because they are based on the probability density of the posterior.

Age estimation methods usually also supply minimum and maximum values for their estimates, so a range within which the true age-at-death should fall. Ideally, this range should adapt to the credibility/confidence level chosen, and the percentage of the true cases, the so-called coverage should be close to the chosen level. To test coverage formally, in the literature the cumulative binomial test has been proposed (Konigsberg et al. 2008). The wrapper sequential.binom.test() calculates this test for several given credibility levels:

sequential.binom.test(spitalfields_res,
                      HDImass = c(seq(0.5, 0.9, 0.1), 0.95),
                      known_age = spitalfields$Age) |>
  knitr::kable(digits = 3)

coverage	n_in	perc	CI_low	CI_up	p_value
0.50	80	0.444	0.371	0.520	0.157
0.60	99	0.550	0.474	0.624	0.172
0.70	122	0.678	0.604	0.745	0.516
0.80	137	0.761	0.692	0.821	0.192
0.90	158	0.878	0.821	0.922	0.319
0.95	166	0.922	0.873	0.957	0.088

In the table above, for all levels the confidence intervals of the observed coverage include the expected coverage. Thus, this result suggests that the model is able to ‘predict’ age-at-death with the correct level of confidence.

Finally, age.comp.plot() shows the relationship between estimated and known ages-at-death graphically. Its four plots convey the following information:

Top left: Comparison of estimated highest density intervals with known ages (green = age within HDI, red = age outside HDI, individuals ordered according to known age-at-death).
Top right: Comparison of the density of known ages with a Gompertz function derived from the arithmetic mean of the estimated population parameters $\alpha$ and $\beta$ .
Bottom left: Scatter plot of known and estimated ages with regression line in blue. The dotted line marks perfect equivalence.
Bottom right: Slope of the regression line from the left bottom image (cf. goodness-of-fit measure Residual_slope from the function age.comp.summary()).

Again, it can make quite a difference which mean measure is chosen for the point estimate for the two lower plots. The left top plot is not affected because the highest density intervals derive from the density distribution of the age estimates. For the right top plot, the Gompertz parameters are always taken from the arithmetic mean because only then the deterministic relationship between $\alpha$ and $\beta$ holds. As the code below shows, age.comp.plot() expects the output from diagnostic.summary() so the credibility level chosen (HDImass) there also determines what is shown in the top left plot. age.comp.plot() presupposes, of course, that there is a vector with known ages.

diagnostic.summary(spitalfields_res, HDImass = 0.95) |>
  age.comp.plot(known_age = spitalfields$Age)

The top left plot visualizes the result of the bottom line of the last table: Only eleven estimates, marked in red, do not contain the true age-at-death at the default credibility level 0.95. The top right plot illustrates the good agreement of the estimated Gompertz parameters fed into a Gompertz function with the density of the true ages-at-death. The bottom left plot is related to the correlation expressed in corrPearson, of course. The bottom-right plot shows the same relationship, relative to the dotted line in the bottom-left plot which is equivalent to the zero-line on the bottom right. The regression line in blue on the bottom right is nothing else than the quality measure Residual_slope. This plot also visualizes the spread of the difference between true and estimated age-at-death.

References

Buckberry, J. L., and A. T. Chamberlain. 2002. “Age estimation from the auricular surface of the ilium: a revised method.” American Journal of Physical Anthropology (Department of Archaeology; Prehistory, University of Sheffield, Sheffield S1 4ET, UK. J.Buckberry@Sheffield.ac.uk) 119 (3): 231–39. https://doi.org/10.1002/ajpa.10130.

Gneiting, Tilmann, Fadoua Balabdaoui, and Adrian E. Raftery. 2007. “Probabilistic Forecasts, Calibration and Sharpness.” Journal of the Royal Statistical Society Series B: Statistical Methodology 69 (2): 243–68.

Konigsberg, Lyle W., Nicholas P. Herrmann, Daniel J. Wescott, and Erin H. Kimmerle. 2008. “Estimation and Evidence in Forensic Anthropology: Age-at-Death.” Journal of Forensic Sciences 53 (3): 541–57.

Müller-Scheeßel, Nils, Christoph Rinne, and Katharina Fuchs. 2026. “A Fully Bayesian Approach to Adult Skeletal Age Estimation: Multivariate Latent Trait Modeling with Markov Chain Monte Carlo Sampling.” American Journal of Biological Anthropology 190 (2): e70289. https://doi.org/10.1002/ajpa.70289.

Stull, Kyra E., Elaine Y. Chu, Louise K. Corron, and Michael H. Price. 2022. “Subadult Age Estimation Using the Mixed Cumulative Probit and a Contemporary United States Population.” Forensic Sciences 2 (4): 741–79.