A Novel Method for Osteometric Reassociation Using Hamiltonian Markov Chain Monte Carlo (MCMC) Simulation.
In osteological analysis, commingled assemblages present a situation in which discrete sets of remains are not readily apparent. Commingled assemblages, such as ossuaries, are a fairly common situation in bioarchaeology (Curtin 2008; Herrmann & Devlin 2008; Ubelaker & Rife 2008; Willey 1990). With the increasing utilization of forensic anthropologists in arenas such as mass disaster (Hinkes 1989; Mundorff 2008, 2012; Sledzik & Rodriguez 2001), cremation litigation (Steadman et al. 2008), and human rights investigations (Primorac et al. 1996; Varas & Leiva 2012), resolution of commingling is becoming commonplace (Adams & Byrd 2008, 2014). Forensic analysis of commingled remains focuses on victim identification and reassociating remains into discrete individuals (Adams & Byrd 2006, 2008, 2014; Byrd & Adams 2003, 2009). This focus has led to an increase in research on methodology for resolving commingling (Adams & Byrd 2008, 2014).
Of the methods available for resolving commingling, osteometric reassociation, which uses statistical models to compare bone dimensions, is considered a reliable and relatively objective technique (Adams & Byrd 2006; Buikstra et al. 1984; Byrd 2008; Byrd & Adams 2003; Byrd & LeGarde 2014; Konigsberg & Frankenberg 2013; O'Brien & Storlie 2011; Rosing & Pischtschan 1995; Snow & Folk 1970). Traditional osteometric sorting logic is a decision-making, error-mitigation approach (Byrd 2008; Byrd & Adams 2003; Byrd & LeGarde 2014). This approach does not seek to reassociate elements per se; rather the analyst tests the null hypothesis that the dimensions of two bones are similar enough to have derived from the same individual (Adams & Byrd 2006; Byrd 2008; Byrd & Adams 2003; Byrd & LeGarde 2014). Possible matches are eliminated if the calculated p-value exceeds an analyst-defined threshold, or alpha level. Bones are reassociated if all other possible matches can be eliminated. This approach implies that, because of broad variation in intra-individual bone size, reassociation is achievable via osteometrics when the assemblage represents a closed population of a smaller number of different-sized individuals (Byrd 2008).
The logic of reassociation through elimination was first introduced by Byrd and Adams (2003). A regression model and associated 90% prediction interval, based on the natural logarithm of the summed measurements by element, was constructed. If the bone in question fell outside of the 90% prediction interval, the researcher concluded that the elements are too different in size to be from one individual. The form of decision making used by Byrd and Adams (2003) follows a Neyman-Pearson approach to hypothesis testing, where decisions concerning the null hypothesis are strictly based on whether a test statistic passes an a priori threshold value (Royall 2000). The researcher is making a dichotomous decision whether to reject or fail to reject the null hypothesis. Under this paradigm, there is no degree of belief in the null hypothesis--it is either rejected or it is not (Royall 1997). The explicit decision-making rationale and ease of interpretation of this approach to science has obvious strengths. The elements in question either derive from the same individual or they do not; there are only two possible outcomes (Byrd 2008).
Byrd (2008) provides a more nuanced statistical framework and presents specific osteometric reassociation models for paired, articulating, and other element comparisons. Again, possible matches are eliminated by comparing a p-value to an alpha level (ranging from 0.05 to 0.10, depending on the comparison type). Byrd (2008) also provides a means for aggregating multiple test results in more complex commingling situations and introduces the severity principle, which focuses on identifying and mitigating error in decision making (Mayo & Spanos 2010). Decisions concerning the null hypothesis are based on the output of a statistical test. A researcher feels confident in his or her decision concerning a hypothesis if the test has a high chance of detecting the falsity of the hypothesis (Mayo & Spanos 2010). Severity is used to incorporate the strength of evidence into the decision-making process.
This interpretative shift blends two forms of testing statistical hypotheses: Neyman-Pearson hypothesis testing and Fisherian significance testing (Lew 2013; Royall 1997). These approaches have different purposes: the former sets an a priori criterion (alpha level) for deciding between two competing hypotheses, while the latter attempts to interpret the strength of evidence against the null hypothesis. Most contemporary frequentists blend these two forms of hypothesis testing into a third formulation, sometimes referred to as rejection trials (Royall 1997). Rejection trials use an a priori alpha level as a decision-making criterion, similar to the Neyman-Pearson approach, but the researcher subjectively interprets the p-value as a measure of the strength of evidence against the null hypothesis (Royall 1997).
While this shift toward including additional information into the decision-making process increases subjectivity, it also increases rationality. The decision to reassociate a set of remains should be based on multiple lines of evidence, of which osteometric reassociation is just one (Byrd 2008). Incorporating multiple lines of evidence into a decision is a subjective process, based in part on the experience of the researcher. It matters if a p-value is 0.049 or 0.000001--the latter can be regarded as stronger evidence against the null hypothesis than the former.
This frequentist logic has obvious strengths. With a focus on the hypothetical frequency of a correct rejection, the results are highly reliable and easy to interpret, if any interpretation is needed. There is, however, an obvious downside to this approach; it does not directly address the primary question of interest, namely, which bones are from the same individual? The sole reliance on eliminating possible matches is peculiar compared to the predictive nature of most other forms of osteological analysis (e.g., age, sex, ancestry, stature estimation).
To a Bayesian, probability is the numeric representation of the "degree of belief" in a proposition or set of propositions (Stark & Freedman 2003). This usage is more in line with a layperson's understanding of probability than the frequentist view of probability as long-run frequencies of an event. A Bayesian understanding of probability has shown promise for resolving commingling (Konigsberg & Frankenberg 2013; McCormick 2016) and other aspects of forensic investigation (Brennaman et al. 2017; Jantz & Ousley 2005; Konigsberg & Frankenberg 2013).
One way to operationalize a Bayesian approach is to assign prior probabilities to each possible match, either through prior information or uninformed (equal) probabilities. Prior probabilities are multiplied by the likelihood of the data to obtain a posterior probability, which is interpreted as the relative probability of a correct match after incorporating model information (McCormick 2016; Byrd & LeGarde 2014; Konigsberg & Frankenberg 2013).
Prior probability distributions can be assigned to the parameters used in estimating the model, such as the slope, y-intercept, and error term in linear regression. These prior distributions are used along with the likelihood function of the data to explore parameter space (possible values of the parameter) and to arrive at a posterior distribution for that parameter (Kery 2010; S. M. Lynch 2007). Model parameters are explicitly treated as distributions, instead of point estimates with uncertainty around that estimate, typically associated with frequentist modeling. The consequence of these different views of parameters is obvious in predictive modeling, such as linear regression. A frequentist regression model results in a single value for model parameters, including the dependent (y) variable. Some form of interval estimation (typically confidence and prediction intervals) is required to better understand the uncertainty in parameter point estimates. These intervals are not direct properties of the parameter and are not probabilistic statements that a parameter's true value lies within a specified boundary (Hoekstra et al. 2014; Mayo 1982). Rather, prior to observing the data, a 95% confidence interval means there is a 95% chance that the interval will contain the true parameter value (Hoekstra et al. 2014; Mayo 1982). After the data are observed, the true value is either within the interval or it is not. The interpretation of these intervals is based in a frequentist understanding of probability, leading to pervasive misunderstanding. The osteometric sorting model of Byrd and Adams (2003), where possible matches were rejected if they fell outside of the prescribed prediction interval, is an example of such a misunderstanding. Byrd and Adams (2003) is best viewed as a shortcoming of a frequentist approach to problems of prediction rather than statistical acumen. While there are valid criticisms of Bayesian modeling, such as subjectivity of prior information and, by extension, posterior distributions, as well as directed sampling strategies (Gelman 2008), Bayesian modeling does not contain the interpretative pitfalls of a frequentist design. The ease of interpretation, handling of model parameters, and flexibility in model construction are major differences between frequentist and Bayesian modeling and are perhaps the main benefits of a Bayesian approach. Bayesian interpretation and modeling has yet to be applied to resolving commingling. The current study examines the utility of such an approach to osteometric reassociation.
The data consist of 24 standard limb measurements from a total of 833 individuals curated at the William M. Bass Donated Skeletal Collection at the University of Tennessee, Knoxville. Individuals in the current study are predominantly European American adults, ranging in age from 18 to 70 years at death (Fig. 1), a majority of which are male (males=583, females=250). The number of individuals varies by comparison, as only those with complete measurements for the compared elements were used.
The measurements used in this study are from the Forensic Anthropology Data Bank (FDB). Some interobserver variability is expected, given the multiple contributors to the FDB. Bivariate plots comparing left- and right-side homologous measurements were used to identify and remove obvious outliers. The number of measurements varies by element (see Table 1). The number and quality of measurements should have an influence on reassociation. Elements with a large number of highly correlated variables should show the highest accuracy rates.
This study tests the accuracy of a Bayesian approach to osteometric reassociation by simulating small-scale (n = 10) closed-population commingled assemblages and predicting the best match using standard osteological measurements and Bayesian regression. This process is repeated 500 times for each comparison. Accuracy is defined as the correct classification rate, or the number of times the best match is the correct match divided by 500.
Following Byrd (2008), limb element comparisons are grouped into three types: paired, articulating, and other element comparisons (Table 2). By virtue of being antimeres, measurements are directly comparable between paired elements. For articulating and other comparison types, transformation of raw measurements is required to compare elements.
Partial Least Squares
Partial least squares (PLS) analysis is a class of techniques for data reduction and latent variable analysis (Boulestiex & Strimmer 2006; Chen & Hoo 2011; Haenlein & Kaplan 2004; Rosipal & Kramer 2006; Wegelin 2000). These techniques share a common method of extracting components--via ordinary least squares regression. PLS analysis is similar to principal component analysis (PCA) and canonical correlation analysis (CCA), which extract orthogonal (uncorrelated) score vectors that are weighted composites of the original data set (Rosipal & Kramer 2006). Typically, the goal with any type of predictive data reduction analysis is twofold: (1) to find linear combinations that well represent the original variables and (2) to find highly correlated linear combinations. Because PCA captures a maximum amount of variation from the original variables, it is an optimal solution to the first goal. In a predictive framework, where one block of variables is used to predict another block, PCA fails to achieve the second goal, because components between blocks of variables have no relationship. On the other hand, CCA optimally achieves the second goal by creating linear combinations of each block that are maximally correlated with one another. However, CCA fails at the first goal because these components are not designed to capture information or variance within a block and are based on the correlation matrix of raw variables, obscuring the biological meaning of components and making the interpretation of components difficult (Bookstein 1991; Wegelin 2000). Furthermore, CCA components are unstable in instances of multicollinearity, and solutions are not uniquely defined when the number of variables is large compared to the sample size (Wegelin 2000). Simply, PCA explains variation within a block of variables and CCA explains variation between two blocks of variables. While not optimal, PLS achieves both goals by finding linear combinations of variables through the covariance of raw variables that both capture variability and are highly correlated (Bookstein 1991; Wegelin 2000). Components of the X-block (independent variables) are orthogonal, are good representations of X, and are good at explaining Y (dependent variables). Components of the Y-block are orthogonal, are good representations of Y, and are highly correlated with the X-block components. Stated another way, PLS models create components that predict a set of dependent variables from a set of independent variables that have the best predictive power on the dependent variables (Chen & Hoo 2011). The package "plsdepot" (Sanchez 2016) was used in R (R Core Team 2015) to extract relevant PLS components.
Ten individuals were randomly removed from the total data set. These 10 individuals act as a simulated commingled assemblage. One element is chosen as the independent (x) variable, with the 10 possible matching elements acting as the dependent (y) variable. For example, if we are interested in reassociating a left femur with 10 possible right femora, then the left femur is predicting the right femur. In this situation, the left femur is the independent variable and the right femur is the dependent variable. A left femur is selected from the commingled assemblage and compared to the 10 possible right femur matches. These comparisons are made using the model described below, with the remaining sample (total sample excluding the commingled individuals) acting as training data.
The model used for assessing each variable is a simple linear regression, which takes the form of:
[y.sub.i] = [alpha] + [beta][x.sub.i] + [[epsilon].sub.i] (1)
where [y.sub.i] and [x.sub.i] are the ith case of the dependent and independent variables, respectively. The y-intercept is represented by a (alpha), and [beta] (beta) represents the slope, or coefficient by which the independent variable changes in relation to the dependent variable. The error term is [[epsilon].sub.i] (sigma) and represents the stochastic part of the model that accounts for all other factors that influence the value of the dependent variable. The y-intercept and slope are the deterministic portions of the model.
Typically, the regression line is fit by finding the line that minimizes the squared vertical distance between all data points. Although point estimates for the y-intercept and slope are calculated, uncertainty is not incorporated into those estimates. Confidence and prediction intervals attempt to deal with this limitation but are often misinterpreted and misapplied. Linear regression of this type is associated with frequentist inference and does not provide an intuitive or easily interpretable way for comparing multiple possible values of [y.sub.i]. Bayesians specify regression models in terms of probability distributions, eliminating these inferential limitations. Bayes' theorem is used to specify probability distributions, taking the form of:
[rho]([theta]|y,x) [??] [rho](y|[theta],x)[rho]([theta],x) (2)
In this un-normalized form, the posterior probability [rho]([theta]|y,x) of parameter, [theta], given data, y, and constant, x, is proportional (for fixed y and x) to the product of the likelihood function [rho](y|[theta],x) and prior [rho]([theta],x) (Stan Development Team 2016).
The Bayesian regression model used in this study assigns a normal distribution to the y-variable, with improper (uniform) prior distributions for regression parameters. Unbounded (-[infinity] to +[infinity]) uniform priors are assigned to the alpha and beta regression parameters, with a positive uniform (0< to +[infinity]) assigned to sigma. These uniform priors are essentially non-informative, leading the posterior distribution of the regression parameters to be driven by the likelihood of the training data. While on its face this model may seem sub-optimal by assigning non-informative prior distributions to the regression model, on a practical level this model is needed because of its flexibility. Variable values will change based on the type of comparison and the variable values of the individuals in the training set. Thus, an abstracted regression model is needed to help ensure that predictions are realistic for all variables.
Markov Chain Monte Carlo
Bayesians view parameters as observed realizations of random variables drawn from a probability distribution. As such, parameters are modeled as distributions. Modeling parameters as distributions requires calculus, and calculus is difficult, even for computers. This difficulty and the associated computational modeling time is reduced through Markov chain Monte Carlo (MCMC) simulation. MCMC methods provide a means for exploring the parameter space utilizing equation 2. Given a model, a likelihood, and data, MCMC simulates draws from the posterior distribution using quasi-dependent sequences of random variables (Kery 2010; S. M. Lynch 2007). This process is repeated a large number of times to approximate the posterior distribution of the parameter, or parameter space.
Many algorithms are available for searching this parameter space. All of them require an initial burn-in or warmup period (Kery 2010; S. M. Lynch 2007; Stan Development Team 2016). This period is the initial sequence of random draws that are strongly influenced by initial starting values and are not representative of the posterior distribution of the parameter (S. M. Lynch 2007). The Markov chain is considered representative of the posterior parameter space once the chain has converged to equilibrium, or entered a high probability area of the stationary distribution of the parameter (Stan Development Team 2016).
The effectiveness of a MCMC algorithm is measured by its ability to quickly reach convergence and exhaustively explore the parameter space. Many algorithms are inefficient in these respects because they can rely heavily on initial starting values and incoherently search parameter space (Carpenter et al. 2017). Hamiltonian Monte Carlo sampling, however, is both coherent and efficient (Carpenter et al. 2017). This method is based on modeling the behavior of particles using the properties of physical system (Hamiltonian) dynamics (Carpenter et al. 2017; Neal 2011). This system state consists of the position of the particle, q, and the momentum of the particle, p (Neal 2011). The position and momentum of the particle are described by its potential and kinetic energy, respectively (Neal 2011). These energy forms are inversely related. As this particle moves across a surface, its potential and kinetic energy change with the slope of the surface.
Hamiltonian dynamics are extended to searching parameter space by interpreting the parameter, [theta], as the position of a fictional particle at a point in time, with a potential energy defined by the negative log of the probability density of [theta] and a stochastic momentum variable (Neal 2011; Stan Development Team 2016). Stated simply, Hamiltonian MCMC is an efficient and effective way of exploring parameter space, allowing for the explicit modeling of uncertainty in parameter estimates, including the dependent variable. Thus, instead of a point estimate for an expected bone value, Hamiltonian MCMC provides a distribution of values. These values are weighted by their relative simulated frequency. Convergence of the MCMC simulations is required for the simulated y-values to be a good predictive representation (Stan Development Team 2016). Visual inspection of autocorrelation and chain mixture plots as well as metrics, including r-hat and effective sample size values, are methods for assessing model convergence used in this study.
The Hamiltonian MCMC sampler STAN implemented with the package "rstan" (Stan Development Team 2016) in R (R Core Team 2015) was used to simulate y-values. Specifically, each variable was modeled using 1,000 iterations across four chains with three simulated y-values per iteration. Four chains of 1,000 iterations was chosen over one chain of 4,000 iterations for several reasons. Chains can be ran in parallel, or simultaneously, increasing computational efficiency and reducing run time. Additionally, chains have random starting values. The convergence and proper mixing of each chain provide another check of correct model behavior. The package "shinystan" (Stan Development Team 2016) was used in R (R Core Team 2015) to periodically assess model diagnostics to confirm proper mixing and Markov chain convergence. The default in STAN is to treat the first half of iterations as the burn-in period (Stan Development Team 2016). Thus, for each variable, 6,000 y-values were simulated. Further treatment is required to normalize these values into a probability density function to assess the relative probabilities of each possible match.
Kernel Density Estimation
Kernel density estimation is a means of estimating a probability density function based on the frequency of sample values (Duong 2007). This family of techniques fits a continuous line to the shape of the data with a kernel and bandwidth. The kernel is a non-negative function centered on zero that integrates to one (Duong 2007). The bandwidth is a free parameter that determines the width of the data range on which the kernel function is fit. A small bandwidth for the data results in an under-smoothed density estimate, containing spurious data artifacts, and is essentially "connecting the dots" between data points. An overly wide bandwidth results in an over-smoothed density and obscures the underlying structure of these data. The bandwidth used in this study approaches an optimal solution for the density estimate by selecting a bandwidth that is the standard deviation of the kernel function (R Core Team 2015). The function density() in the package "stats" (R Core Team 2015) was used in R to fit a kernel density to the simulated y-values.
Estimating Best Matches
The result of this analysis is a probability density function of y-values for a given x-value for each variable on which the values for the 10 possible matches can be evaluated. The function approx() in the package "stats" (R Core Team 2015) was used in R to evaluate densities for each possible match. These densities are used in two ways to estimate the best match: density weight and equal weight. In the first best match estimate, each possibility is weighted by its density estimate for each variable. This calculation takes the form of:
[Pr.sub.i] = [[summation]([di.sub.l... n])/[summation]([dtot.sub.l... n])] (3)
where [Pr.sub.i] is the match probability for the ith possible match, [di.sub.n] is the density estimate of the ith possible match for the nth predictive variable, and [dtot.sub.n] is the density estimate of all possible matches for the nth predictive variable. Calculating match probability in this way does not weigh each predictive variable equally. Predictive variables that have high correlations between x-values and y-values will result in tightly dispersed simulated y-values, because uncertainty in its prediction is low (Fig. 2). Conversely, predictive variables that have low correlations also have high uncertainty in y-value predictions, leading to widely dispersed y-values (Fig. 2). This relationship between predictive ability of a variable and the standard error of simulated y-values affects the resulting density estimates (Fig. 3). With this calculation of match probability, predictive variables with higher correlations will lead to higher density estimates and larger relative contributions to the overall match probability. However, these larger relative contributions may swamp the contribution of other, lower correlated variables, leading to spurious classifications if the best match from predictive variables with high correlations is not the correct match.
The second calculation of overall match probability weighs all predictive variables equally and takes the form of:
[Pr.sub.i] = (([di.sub.1]/[dtot.sub.1])+([di.sub.2]/[dtot.sub.2])+...([di.sub.n]/[dtot.sub.n]))/n (4)
where the notation is the same as formula 3. Here, densities are normalized into probabilities for each variable. The overall match probability is the sum of these probabilities divided by the number of variables. This way of calculating the best match artificially increases the relative importance of variables with low predictive ability. Each method for assessing the best match has possible strengths and weaknesses. Thus, each type is employed to empirically address which performs best.
Best match probabilities are a poor metric for recognizing model error. Similar to other methods that classify using Bayesian probability, one of the possible matches will be classified as the best match even when the actual match is not among the possible choices. Thus, it is useful to have another metric by which to assess possible matches. To this end, the 5% and 95% quantiles of the simulated range of y-values were identified. A possible match failed this test if it fell outside of these boundaries. Quantile tests can be interpreted as a two-tailed significance test with an alpha level of 0.10. These tests may be used to reject possible matches, similar to the traditional logic, or to aid in identifying model error. There is a major difference between traditional rejection-based logic, which arrives at a single p-value, and the quantile tests of this study (Adams & Byrd 2006; Byrd 2008; Byrd & Adams 2003; Byrd & LeGarde 2014; J. J. Lynch 2018; Warnke-Sommer et al. 2019). Quantile tests for possible matches were conducted for each variable, with the number of variables ranging from eight for the femur to three for many other comparisons (see Table 1). Quantile tests allow for the examination of this metric as a means of assessing model error and as a rejection criterion. Conducting a quantile test on each variable relates Type 1 error rates to the number of variables rather than directly to the possible match. Thus, comparisons with more variables increase the number of chances for Type 1 error (rejecting a possible match if any variable failed the quantile test). The equation for the expected Type 1 error rate for correct matches is:
Expected Error = 1 - [(1 - [p.sub.fail]).sup.n] (5)
where [p.sub.fail] is the expected chance of failing a quantile test and n is the number of variables.
Among all comparison types, the correct match was identified in 51.60% of the simulations (3,870/7,500). Correct classification varied by prediction and comparison types (Fig. 4). In all but two instances, equal-weight comparisons provided the best classification (Table 3 and Fig. 5). With an average difference of 12.12%, paired elements exhibit the largest difference between prediction types. Interestingly, femur/femur equal-weight comparisons only showed a 1.4% improvement. Articulating elements and other comparisons, at an average increase of 1.80% and 1.86%, respectively, showed a minimal difference between prediction types.
Differences in prediction type should be identifiable through the distribution of posterior probabilities. The distributions of density and equal-weight posterior probabilities are quite similar (Fig. 6). Moreover, the differences between median values for equal-weight and density predictions are negligible (0.08% for other elements, 0.13% for paired elements, and 0.23% for articulating elements). The strong similarities in prediction type distributions for paired elements are unexpected given the difference in accuracy. Given the better or similar classification and similar distributional properties, correct classification refers to equal-weight predictions unless otherwise specified.
Paired elements performed markedly better than the two other comparison types, almost doubling the accuracy of articulating elements. In contrast, the difference between articulating and other comparisons was less than 10% (see Table 2). Unlike prediction type, the distribution of posterior probabilities by comparison type shows distinct differences (Fig. 7). The relatively low accuracies of articulating and other element comparisons result in slightly positively skewed normal distribution. Paired elements, on the other hand, show a bimodal distribution with a strong positive skew. The shape of these distributions is in line with expectations based on comparison type accuracies. The low accuracy of articulating and other comparisons is due to uncertainty. This randomness leads to posterior probabilities approximating a normal distribution over a large number of trials. The high accuracy of paired elements results in less uncertainty. This structure results in a model that not only predicts the best match well but is also good at identifying bad and not-so-bad matches, leading to a high density of values near zero, another peak near the median, and a long positive tail.
Besides identifying the best match, posterior probabilities rank all possible matches. This aspect is most useful when the analyst is trying to cull down possible matches, in open-population situations, or in non-paired comparisons, where correct classification rates are relatively low. Tables 4-6 provide the best match rank of the correct match. For paired element comparisons, the correct match is among the top three best matches in over 97% of the simulations. For articulating and other comparisons, the correct match is among the top five best matches for 85.85% and 81.74% of simulations, respectively.
For each variable, possible match values were compared against the 5% and 95% quantiles of the predicted match distribution, resulting in 300,000 quantile tests. There are interesting trends in the behavior of the quantile tests (Table 7). Correct best match variables failed less often than the expected 10%, with articulating and other comparison variables failing roughly an order of magnitude less. Incorrect best match variables failed more often than correct match variables--10.5% of the time for paired elements, but rarely for articulating and other comparison variables. As expected, variables for other possible matches failed quantile tests more often than best matches. Correct matches failed at least one quantile test (Type 1 error) less than the expected 39.69% for paired, 30.84% for articulating, and 28.37% for other comparisons. Again, Type 1 error for articulating and other comparisons was well below expected rates (see Table 7).
Figure 8 shows an example of typical model diagnostics plot results. These plots show that the MCMC model is working quite well and the parameter estimates are reliable. Density plots should approximate a normal distribution; autocorrelation plots should look like an inverse exponential curve in histogram form, where autocorrelation is initially high and quickly drops off. Chain mixture plots should show no discernible pattern, where each chain moves around parameter space without getting "stuck" in a particular area.
Metric model diagnostics were also periodically checked, including r-hat values and effective sample sizes. An r-hat value is an estimate of convergence based on the mean and standard deviation estimated from each chain (Stan Development Team 2016). Chains have properly converged with r-hat values between 1.0 and 1.2; the closer to 1.0, the better the convergence. Rarely were r-hat values above 1.0, and in no case was an r-hat value above 1.2. Effective sample size is an estimate of the information available from each simulation; the closer the effective sample size is to the number of simulations, the better the chain convergence. Rarely was the effective sample size below 75% of the total number of drawn samples. Most effective sample sizes were between 80 and 90% of the total number of draws, yet another confirmation that model chains are properly converging.
The strength of a Bayesian approach to resolving commingling is its versatility. The posterior distribution of y-values allows for the prediction of the correct match and rejecting possible matches. Like most practical applications in forensic anthropology, the analyst must have a clear question to address and a strong understanding of the strengths and weaknesses of the method employed. This study represents a start to understanding those methodological aspects of a Bayesian approach to resolving commingling.
Equal-weight variable predictions perform better than density-weighted predictions. A more nuanced look at trends between prediction types, however, suggests underlying factors that may be affecting classification accuracy by type: the number, type, and predictive ability of measurements (Table 8). As expected, the more highly correlated variables used in the model, the better the accuracy. This trend may explain the almost nonexistent difference between prediction types for the femur and the large difference for other paired comparisons, like the radius. The correlations between left and right length measurements are the strongest for all elements, and are likely driving density-weighted predictions. For the radii, besides maximum length, there are only two moderately correlated measurements of the midshaft. Mid-shaft measurements are swamped by length in density-weighted comparisons, but are able to adjust the best match to the correct match often in equal-weight predictions. In femur comparisons, the other strongly correlated variables are able to adjust predictions when length-based predictions are wrong, leading to comparable accuracy between the two types. This trend also suggests measurements that quantify different aspects of a bone increase model performance. A likely reason for the high correct classification rates of the femur is the novel information provided by femur measurements. Stated another way, if maximum length is in the model, the addition of bicondylar length is unlikely to appreciably improve performance, as these measurements are, at least statistically, essentially the same ([r.sup.2] = 0.995). Adding information on epipcondylar breadth, femoral head size, and sub-trochanteric dimensions is likely to show a marked increase in model performance at each step. Directly testing this assertion is an avenue for future research.
At an overall performance of just over 50%, it would appear that a Bayesian approach to osteometric reassociation is impractical in many situations. It is important to consider the difficulty of the question that these models are attempting to answer: What is the best match from among these ten possibilities? This question is an order of magnitude more difficult than the question typically asked in osteometric reassociation: Is this one possible match different enough that it can be reliably rejected as a possibility? In this light, an overall performance of just over 50% does not seem so bad. The overall correct classification rate, however, is a misleading metric that obscures some important aspects of osteometric reassociation identified in this study.
Paired element comparisons are superior to articulating and other comparison types. Femora, for example, were correctly matched in 93% of simulations. Paired elements are developmentally and (to varying degrees) functionally integrated elements with directly comparable measurements. Composite variables are required to directly compare non-paired elements. While composite variables are orthogonal and there is a good degree of redundancy by treating each paired element measurement as independent, composite variables are likely obscuring important size and shape relationships that paired element models are able to exploit. This assertion is supported by the lower percentage of quantile rejections for articulating and other type comparisons, where the composite variables may artificially make elements more homogeneous. The use of composite variables should have the largest negative effect in osteometric models based on a rejection criterion. Issues identified in rejection-based osteometric reassociation models (LeGarde 2012; McCormick 2016; Vickers et al. 2015) have been mitigated in recent improvements to and expansions on the paired element model described by Byrd (2008) (J. J. Lynch et al. 2018; Warnke-Sommer et al. 2019). While these changes improve model performance, the underlying logic of the approach has remained the same since its description in Byrd and Adams (2003).
The behavior of quantile tests in this study suggests that deriving one composite p-value as a rejection criterion would be of little value outside of paired element comparisons. Indeed, recent research has focused exclusively on paired element comparisons (J. J. Lynch 2018; J. J. Lynch et al. 2018; Warnke-Sommer et al. 2019). Rejecting a possible match, and by extension, identifying model error for best matches, if any variable failed a quantile test appears to be a viable approach for articulating and other comparisons. This assertion does not apply to paired element comparisons. Over 41% of incorrect best matches and 96% of non-best possibilities fail at least one quantile test, which seems excellent for identifying model error and rejecting possible matches. However, over 25% of the correct best matches also fail at least one quantile test. Despite being below the expected Type 1 error rate (39.69%), it is an unacceptably high rate compared to an aggregate p-value. These results suggest the traditional rejection-based approach is optimal for paired elements.
This study simulated 7,500 closed-population commingled assemblages and assessed the accuracy of predicting the correct match using a complete set of limb measurements. The correct match was identified in 3,870 of the simulations, for an overall correct classification of 51.60%. There are several factors to consider when interpreting the results of this study. The sample used to construct the Bayesian regression model should be near-identical to the average simulated commingled assemblage. While the simulated commingled assemblage was removed from the overall sample prior to the construction of the model, the ten random individuals were drawn from the same population of predominantly European American males. Drawing the commingled assemblage from the same population as the reference sample should have two main influences on these results. First, the reference sample used to create the regression model is very appropriate for the commingled assemblage and represents a "best case" for predicting the best match. Second, the simulated commingled assemblages are, on average, quite homogeneous, making discriminating among possible matches difficult. The relative influence of these factors is beyond the scope of this study. Reference sample composition and homogeneity of the commingled assemblage are additional areas of future research. Other areas of future research include examining the effect of assemblage size, missing measurements, and different methods of quantifying skeletal elements on classification rates.
The reassociation model described above is firmly placed within a Bayesian framework, in both model construction and inference. A Bayesian understanding of probability is easily interpreted and is in line with practical applications of forensic anthropology, where deductive reasoning is required to make statements about a particular case based on a larger theory of knowledge. This approach is not to say the frequentist paradigm is not without merit. In fact, this study has a major aspect that most frequentists would laud--the simulation of commingled assemblages to directly test model performance over the "long run." The frequency of correct matches over an extended series of trials is an inductive way to build the theoretical foundation on which deductive statements are made. Bayesian modeling is flexible, can be tailored to various types of data, and assumptions can be explicitly built into the model. Modeling parameters as distributions provides an intuitive way to directly compare possible matches. The posterior distribution of y-values can be interpreted in different ways, depending on the goal of the analysis. Although rejecting possible matches has been the purview of the traditional, frequentist approach, there is no reason to limit a Bayesian approach to just prediction. The beauty of the model presented here is the analyst can have the "best of both worlds" through the ability to predict the best match and reject possible matches to create a short list of possibilities. Furthermore, Bayesian inference allows for incorporating additional lines of evidence into the calculation of posterior probabilities. Thus, in theory, other methods or information, such as the spatial relationship between elements recovered in the field, can be incorporated into an overall match probability.
The author would like to thank Dr. John Byrd for his mentorship and encouragement to learn more about Bayesian statistics. Thanks are certainly due to Dr. Dawnie Steadman, Dr. Amy Mundorff, Dr. Benjamin Auerbach, Dr. Richard Jantz, and Dr. James Fordyce for their guidance and advice, and to the anonymous reviewers for their thoughtful comments.
Adams BJ, Byrd JE, eds. Commingled Human Remains: Methods in Recovery, Analysis, and Identification. San Diego: Academic Press; 2014.
Adams BJ, Byrd JE, eds. Recovery, Analysis, and Identification of Commingled Human Remains. Totowa, NJ: Humana Press; 2008.
Adams BJ, Byrd JE. Resolution of small-scale commingling: A case report from the Vietnam War. Forensic Science International 2006;156(1):63-69.
Bookstein FL. Morphometric Tools for Landmark Data: Geometry and Biology. Cambridge: Cambridge University Press; 1991.
Boulesteix A-L, Strimmer K. Partial least squares: A versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioinformatics 2007;8(1):32-44.
Brennaman AL, Love KR, Bethard JD, Pokines JT. A Bayesian approach to age-at-death estimation from osteoarthritis of the shoulder in modern North Americans. Journal of Forensic Sciences 2017;62(3):573-584.
Buikstra JE, Gordon CC, St. Hoyme L. The case of the severed skull: Individuation in forensic anthropology. In: Rathburn TA, Buikstra JE, eds. Human Identification: Case Studies in Forensic Anthropology. Springfield: Charles C. Thomas; 1984:121-135.
Byrd JE. Models and methods for osteometric sorting. In: Byrd JE, Adams BJ, eds. Recovery, Analysis, and Identification of Commingled Human Remains. Totowa, NJ: Humana Press; 2008:199-220.
Byrd JE, Adams BJ. Analysis of commingled human remains. In: Blau S, Ubelaker DH, eds. Handbook of Forensic Anthropology and Archaeology. Walnut Creek: Left Coast Press; 2009:174-185
Byrd JE, Adams BJ. Osteometric sorting of commingled human remains. Journal of Forensic Sciences 2003;48(4):717-724.
Byrd JE, LeGarde CB. Osteometric sorting. In: Adams BJ, Byrd JE, eds. Commingled Human Remains: Methods in Recovery, Analysis, and Identification. San Diego: Academic Press; 2014; 167-191.
Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, Betancourt M, et al. Stan: A probabilistic programming language. Journal of Statistical Software 2017;76(1):1-32. doi: 10.18637/jss.v076.i01
Chen Y, Hoo KA. Application of partial least square regression in uncertainty study area. American Control Conference 2011; 1958-1962.
Curtin AJ. Putting together the pieces: Reconstructing mortuary practices from commingled ossuary cremains. In: Schmidt CW, Symes SA, eds. The Analysis of Burned Human Remains. San Diego: Academic Press; 2008;219-227.
Duong T. ks: Kernel density estimation and kernel discriminant analysis for multivariate data in R. Journal of Statistical Software 2007;21(7):1-16.
Gelman A. Objections to Bayesian statistics. Bayesian Analysis 2008;3:445-449.
Haenlein M, Kaplan AM. A beginner's guide to partial least squares analysis. Understanding Statistics 2004;3(4):283-297.
Herrmann NP, Devlin JB. Assessment of commingled human remains using a GIS-based approach. In: Adams BJ, Byrd JE, eds. Recovery, Analysis and Identification of Commingled Human Remains. Totowa, NJ: Humana Press; 2008;257-269.
Hinkes MJ. The role of forensic anthropology in mass disaster resolution. Aviation, Space, and Environmental Medicine 1989;60: A60-3.
Hoekstra R, Morey RD, Rouder JN, Wagenmakers EJ. Robust misinterpretation of confidence intervals. Psychonomic Bulletin & Review 2014;21(5):1157-1164.
Jantz RL, Ousley SD. FORDISC 3.0: Personal computer forensic discriminant functions. University of Tennessee, Knoxville; 2005.
Kery M. Introduction to WinBUGS for Ecologists: Bayesian Approach to Regression, ANOVA, Mixed Models and Related Analyses. New York: Academic Press; 2010.
Konigsberg LW, Frankenberg SR. Bayes in biological anthropology. American Journal of Physical Anthropology 2013;152(S57): 153-184.
LeGarde CB. Asymmetry of the Humerus: The Influence of Handedness on the Deltoid Tuberosity and Possible Implications for Osteometric Sorting [master's thesis]. Missoula: University of Montana; 2012.
Lew MJ. To P or not to P: On the evidential nature of P-values and their place in scientific inference stat. arXiv:1311.0081: 2013.
Lynch JJ. An analysis on the choice of alpha level in the osteometric pair-matching of the os coxa, scapula, and clavicle. Journal of Forensic Sciences 2018;63(3):793-797.
Lynch JJ, Byrd J, LeGarde CB. The power of exclusion using automated osteometric sorting: Pair-matching. Journal of Forensic Sciences 2018;63(2):371-380.
Lynch SM. Introduction to Applied Bayesian Statistics and Estimation for Social Scientists. New York: Springer Science and Business Media; 2007.
Mayo DG. In defense of the Neyman-Pearson theory of confidence intervals. Philosophy of Science 1981;48(2):269-280.
Mayo DG, Spanos A. Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science. Cambridge: Cambridge University Press; 2010.
McCormick KA. A Biologically Informed Structure to Accuracy in Osteometric Reassociation [PhD dissertation]. Knoxville: University of Tennessee; 2016.
Mundorff AZ. Anthropologist-directed triage: Three distinct mass fatality events involving fragmentation of human remains. In: Adams BJ, Byrd JE, eds. Recovery, Analysis and Identification of Commingled Human Remains. Totowa, NJ: Humana Press; 2008;123-144.
Mundorff AZ. Integrating forensic anthropology into disaster victim identification. Forensic Science, Medicine, and Pathology 2012;8:131-139.
Neal RM. MCMC using Hamiltonian dynamics. In: Brooks S, Gelman A, Jones GL, Meng X-L, eds. Handbook of Markov Chain Monte Carlo. Boca Raton, FL: CRC Press: 2011:113-162.
O'Brien MJ, Storlie CB. An alternative bilateral refitting model for zooarchaeological assemblages. Journal of Taphonomy 2011;9: 245-268.
Primorac D, Andelinovic S, Definis-Gojanovic M, Drmic I, Rezic B, Baden MM, et al. Identification of war victims from mass graves in Croatia, Bosnia, and Herzegovina by the use of standard forensic methods and DNA typing. Journal of Forensic Sciences 1996;41(5):891-894.
R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, 2015. http://www.R-project.org/.
Rosing FW, Pischtschan E. Re-individualisation of commingled skeletal remains. In: Jacob B, Bonte W, eds. Advances in Forensic Sciences. Berlin: Verlag Dr. Koster; 1995:1-9.
Rosipal R, Kramer N. Overview and recent advances in partial least squares. In: Saunders C, Grobelnik M, Gunn S, Shawe-Taylor J, eds. Subspace, Latent Structure and Feature Selection. Berlin: Springer; 2006:34-51.
Royall R. On the probability of observing misleading statistical evidence. Journal of the American Statistical Association 2000; 95(451):760-768.
Royall R. Statistical Evidence: A Likelihood Paradigm. Boca Raton, FL:CRC Press; 1997. Monographs on Statistics and Applied Probability; vol. 71.
Sanchez G. plsdepot: Partial least squares (PLS) data analysis methods. Rpackage version 0.1, 17; 2016.
Sledzik PS, Rodriguez WC. Damnum fatale: The taphonomic fate of human remains in mass disasters. In: Haglund WD, Sorg MH, eds. Advances in Forensic Taphonomy: Method, Theory, and Archaeological Perspectives. Boca Raton, FL: CRC Press; 2001;322-330.
Snow CC, Folk ED. Statistical assessment of commingled skeletal remains. American Journal of Physical Anthropology 1970;32: 423-427.
Stan Development Team. Stan modeling language: User's guide and reference manual, Version 2.17.0; 2016.
Stark PB, Freedman DA. What is the chance of an earthquake? NATO Science Series IV: Earth and Environmental Sciences 2003;32:201-213.
Steadman DW, Sperry K, Snow F, Fulginiti L, Craig E. Anthropological investigations of the Tri-State Crematorium incident. In: Adams BJ, Byrd JE, eds. Recovery, Analysis and Identification of Commingled Human Remains. Totowa, NJ: Humana Press; 2008:81-96.
Ubelaker DH, Rife JL. Approaches to commingling issues in archeological samples: A case study from Roman era tombs in Greece. In: Adams BJ, Byrd JE, eds. Recovery, Analysis and Identification of Commingled Human Remains. Totowa, NJ: Humana Press; 2008:97-122.
Varas CG, Leiva MI. Managing commingled remains from mass graves: Considerations, implications and recommendations from a human rights case in Chile. Forensic Science International 2012;219:e19-e24.
Vickers S, Lubinski PM, DeLeon LH, Bowen JT. Proposed method for predicting pair matching of skeletal elements allows too many false rejections. Journal of Forensic Sciences 2015;60(1): 102-106.
Warnke-Sommer JD, Lynch JJ, Pawaskar SS, Damann FE. Z-Transform method for pairwise osteometric pair-matching. Journal of Forensic Sciences. 2019;64(1):23-33.
Wegelin JA. A survey of partia1 least squares (PLS) methods, with emphasis on the two-block case. Technical report, Department of Statistics, University of Washington, Seattle; 2000.
Willey PS. Prehistoric Warfare on the Great Plains: Skeletal Analysis of the Crow Creek Massacre Victims. New York: Garland Publications; 1990.
Kyle McCormick (a*)
(a) Defense POW/MIA Accounting Agency, DoD, Hawai'i, USA
(*) Correspondence to: Kyle McCormick, Defense POW/MIA Accounting Agency, DoD, 590 Moffett St., Joint Base Pearl Harbor--Hickam, AFB, Hawai'i 96853, USA
Received 19 June 2018; Revised 17 October 2018; Accepted 26 October 2018
TABLE 1--Forensic Data Bank Measurements by Element. Element Measurement Name FDB # Humerus Maximum length v40 n=5 Epicondylar breadth v41 Maximum vertical head diameter v42 Maximum diameter at midshaft v43 Minimum diameter at midshaft v44 Radius Maximum length v45 n=3 A/P diameter at midshaft v46 Transverse diameter at midshaft v47 Ulna Maximum length v48 n=4 Dorso-Volar diameter v49 Transverse diameter v50 Physiological length v51 Femur Maximum length v60 n=8 Bicondylar length v61 Epicondylar breadth v62 Maximum diameter of head v63 A/P subtrochanteric diameter v64 Transverse subtrochanteric diameter v65 A/P diameter at midshaft v66 Transverse diameter at midshaft v67 Tibia Condylar-malar length v69 n=4 Maximum proximal epiphyseal breadth v70 Distal epiphyseal breadth v71 Maximum diameter at nutrient foramen v72 TABLE 2--Osteometric Comparisons by Type. Comparison Type Femur/Femur Paired Humerus/Humerus Paired Radius/Radius Paired Tibia/Tibia Paired Ulna/Ulna Paired Femur/Tibia Articulating Humerus/Ulna Articulating Humerus/Radius Articulating Ulna/Radius Articulating Femur/Humerus Other Femur/Ulna Other Femur/Radius Other Tibia/Humerus Other Tibia/Ulna Other Tibia/Radius Other TABLE 3--Accuracy of Osteometric Comparison Types. Comparison Type Equal Weight Density Weight Femur/Femur Paired 93.00% (*) 91.60% Humerus/Humerus Paired 86.20% 74.00% Tibia/Tibia Paired 85.60% 75.80% Radius/Radius Paired 71.60% 54.40% Ulna/Ulna Paired 67.40% 47.40% Paired Overall 80.76% 68.64% Femur/Tibia Articulating 55.40% 51.80% Ulna/Radius Articulating 42.40% 43.20% Humerus/Radius Articulating 35.80% 34.00% Humerus/Ulna Articulating 34.80% 32.20% Articulating Overall 42.10% 40.30% Femur/Humerus Other 35.60% 34.40% Femur/Ulna Other 35.60% 32.60% Tibia/Radius Other 34.60% 31.40% Tibia/Ulna Other 32.80% 32.40% Femur/Radius Other 32.60% 29.00% Tibia/Humerus Other 30.60% 30.80% Other Overall 33.63% 31.77% Overall 51.60% 46.33% (*) Bold indicates prediction types with the highest correct classification rates. TABLE 4--Paired Element Comparison Correct Match Rank (n=2,500 Simulations). Rank Femur Humerus Tibia Radius Ulna Total % Correct Cumulative % 1 465 431 428 358 337 2,015 80.76 80.76 2 30 50 61 101 107 349 13.96 94.72 3 4 9 6 23 27 69 2.76 97.48 4 0 6 3 5 18 32 1.28 98.76 5 0 1 1 8 8 18 0.72 99.48 6 1 2 0 4 1 8 0.32 99.80 7 0 1 0 0 0 1 0.04 99.84 8 0 0 0 0 1 1 0.04 99.88 9 0 0 1 0 1 2 0.08 99.96 10 0 0 0 1 0 1 0.04 100.00 TABLE 5--Articulating Element Comparison Correct Match Rank (n = 2,000 Simulations). Rank F/T U/R H/R H/U Total % Correct Cumulative % 1 277 212 179 174 842 42.10 42.10 2 102 98 98 89 387 19.35 61.45 3 50 65 80 74 269 13.45 75.90 4 17 36 36 36 125 6.25 81.15 5 15 19 30 30 94 4.70 85.85 6 15 16 14 25 70 3.50 89.35 7 7 14 16 21 58 2.90 92.25 8 6 12 18 13 49 2.45 94.70 9 5 12 15 17 49 2.45 97.15 10 6 16 14 21 57 2.85 100.00 F = femur, H = humerus, T = tibia, R = radius, U = ulna. TABLE 6--Other Element Comparison Correct Match Rank (n = 3,000 Simulations). Rank F/H F/U F/R T/H T/U T/R Total % Correct Cumulative % 1 178 178 163 153 164 173 1,009 33.63 33.63 2 109 87 71 96 94 99 556 18.53 52.16 3 73 53 63 84 71 48 392 13.07 65.23 4 35 42 72 69 48 54 320 10.67 75.90 5 22 36 40 36 39 27 200 6.67 82.57 6 19 21 10 23 18 19 110 3.37 85.94 7 23 24 17 21 23 21 129 4.30 90.24 8 20 16 22 7 20 21 106 3.53 93.77 9 11 13 17 7 13 19 80 2.67 96.44 10 10 30 25 4 10 19 98 3.27 100.00 F = femur, H = humerus, T = tibia, R = radius, U = ulna. TABLE 7--Quantile Test Results for All Measurements and by Individual. Type Position Tests Fails % Individuals Paired Best/Correct 14,047 640 4.56 2,019 Best/Incorrect 2,953 310 10.50 481 Other Possibilities 108,000 66,349 61.43 22,500 Articulating Best/Correct 4,666 56 1.20 842 Best/Incorrect 6,334 107 1.69 1,158 Other Possibilities 63,000 19,105 30.33 18,000 Other Best/Correct 5,193 40 0.77 1,009 Best/Incorrect 10,307 116 1.28 1,991 Other Possibilities 85,500 23,639 27.65 27,000 Total 300,000 110,362 36.79 75,000 Type Position Ind. Fails % Paired Best/Correct 514 25.46 Best/Incorrect 48 41.16 Other Possibilities 21,686 96.38 Articulating Best/Correct 198 5.70 Best/Incorrect 97 8.38 Other Possibilities 13,158 73.10 Other Best/Correct 34 3.39 Best/Incorrect 104 5.21 Other Possibilities 17,452 64.64 Total 53,291 71.05 TABLE 8--Descriptive Statistics of the Correlation between Variables. Comparison Type # of Avg. r Max. r Min. r Vars. Femur/Femur Paired 8 0.923 0.988 0.852 Tibia/Tibia Paired 4 0.863 0.980 0.755 Humerus/Humerus Paired 5 0.840 0.964 0.722 Radius/Radius Paired 3 0.792 0.964 0.671 Ulna/Ulna Paired 4 0.765 0.958 0.485 All Paired 24 0.837 Femur/Tibia Articulating 4 0.646 0.910 0.306 Humerus/Ulna Articulating 3 0.593 0.778 0.344 Ulna/Radius Articulating 3 0.532 0.898 0.096 Humerus/Radius Articulating 4 0.505 0.838 0.079 All Articulating 14 0.569 Tibia/Humerus Other 3 0.597 0.823 0.346 Femur/Humerus Other 4 0.574 0.866 0.307 Tibia/Ulna Other 3 0.551 0.775 0.175 Tibia/Radius Other 3 0.540 0.779 0.133 Femur/Ulna Other 3 0.508 0.778 0.155 Femur/Radius Other 3 0.501 0.809 0.114 All Others 19 0.545
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||RESEARCH ARTICLE|
|Date:||Mar 22, 2019|
|Previous Article:||Preliminary Findings from a Visual Pair-Matching Study in a Large Commingled Assemblage.|
|Next Article:||The Tarawa Project Part I: A Multidisciplinary Approach to Resolve Commingled Human Remains from the Battle of Tarawa.|