**Springer Texts in Statistics**

Silvia Bozza Franco Taroni Alex Biedermann

# Bayes Factors for Forensic Decision Analyses with R

## **Springer Texts in Statistics**

#### **Series Editors**

G. Allen, Rice University, Department of Statistics, Houston, TX, USA

R. De Veaux, Department of Mathematics and Statistics, Williams College, Williamstown, MA, USA

R. Nugent, Department of Statistics, Carnegie Mellon University, Pittsburgh, PA, USA

*Springer Texts in Statistics (STS)* includes advanced textbooks from 3rd- to 4th-year undergraduate courses to 1st- to 2nd-year graduate courses. Exercise sets should be included. The series editors are currently Genevera I. Allen, Richard D. De Veaux, and Rebecca Nugent. Stephen Fienberg, George Casella, and Ingram Olkin were editors of the series for many years.

Silvia Bozza • Franco Taroni • Alex Biedermann

## Bayes Factors for Forensic Decision Analyses with R

Silvia Bozza Department of Economics Ca' Foscari University of Venice Venice, Italy

Faculty of Law, Criminal Justice and Public Administration, School of Criminal Justice University of Lausanne Lausanne-Dorigny, Switzerland

Alex Biedermann Faculty of Law, Criminal Justice and Public Administration, School of Criminal Justice University of Lausanne Lausanne-Dorigny, Switzerland

Franco Taroni Faculty of Law, Criminal Justice and Public Administration, School of Criminal Justice University of Lausanne Lausanne-Dorigny, Switzerland

Published with the support of the Swiss National Science Foundation (Grant no. 10BP12\_208532/1)

ISSN 1431-875X ISSN 2197-4136 (electronic) Springer Texts in Statistics ISBN 978-3-031-09838-3 ISBN 978-3-031-09839-0 (eBook) https://doi.org/10.1007/978-3-031-09839-0

© The Editor(s) (if applicable) and The Author(s) 2022. This book is an open access publication. **Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

*To our families*

## **Preface**

The introduction of scientific evidence in legal proceedings raises a host of intricate questions and themes, ranging from the architecture of legal systems across contemporary jurisdictions and psychological aspects of judgment and decision-making, to principles and methods of logical reasoning and decision-making under uncertainty. Over decades of theoretical and practice-oriented research, scholars in fields such as law, statistics, history, philosophy of science, psychology, and forensic science have come to the understanding that the sound use of scientific findings in evidence and proof processes critically depends on the ability of forensic scientists to use formal methods of reasoning, so as to ensure a coherent approach to dealing with and communicating about uncertainty. The focal point of these developments is the recognition of probability as the reference method for measuring uncertainty.

It is thus hardly surprising that, in recent years, the intersection between law and forensic science has seen an increase in the number of reports, guidelines, and recommendations issued by eminent societies, review panels, and expert groups that insist on the importance of aligning the interpretation of scientific evidence by forensic scientists to a probabilistic measure of the value of evidence.<sup>1</sup> This measure is the likelihood ratio and has been widely described in peer-reviewed articles and textbooks.

What is less often recognized, however, is that the likelihood ratio is merely a particular instance of a more general concept, known as the *Bayes factor*. While the likelihood ratio is typically presented in the focused context of evidence-based discrimination between pairs of competing propositions, the Bayes factor is a method of choice for approaching a more comprehensive collection of problems commonly associated with the use of measurements and data in forensic science.

<sup>1</sup> Examples include documents issued by the Royal Statistical Society (Aitken et al., 2010), The Royal Society of Edinburgh (Nic Daéid et al., 2020), The UK Forensic Science Regulator (Tully, 2021), The European Network of Forensic Science Institutes (Willis et al., 2015), The Association of Forensic Science Providers (Association of Forensic Science Providers, 2009), and expert communities, in particular sub-fields of forensic science, such as forensic genetics (e.g. Gill et al., 2018) or forensic voice comparison (Drygajlo et al., 2015; Morrison et al., 2021).

Examples include the comparison of probabilistic models, model selection, and decision-making regarding competing theories and model parameters. We believe that by becoming acquainted with Bayes factors across a range of different applications, forensic scientists can strengthen the use of probabilistic methods in their respective disciplines. Forensic scientists should also gain an understanding of the role of Bayes factors in coherent decision-making under uncertainty. The core idea of this book on Bayes factors, the first on this theme in forensic science, is to address these questions.

*Bayes Factors for Forensic Decision Analyses with R* is a new Bayesian modeling book that provides a self-contained account of essential elements of computational Bayesian statistics using R, a leading programming language and a freely available software environment for statistical computing. This book features a well-rounded approach to three naturally interrelated topics. The first is probabilistic inference. As a core concept of Bayesian inferential statistics, Bayes factors are ideally suited to help forensic scientists think about the logical and balanced evaluation of the value of evidence. This is a necessary preliminary to coherent reporting on scientific evidence. Second, this book highlights the logical connection between probabilistic reasoning, using Bayes factors, and decision analysis under uncertainty. This perspective involves the decision-theoretic (re-)conceptualization of questions that, in classical statistics, are often framed as problems of hypothesis testing using a disparate set of concepts, such as p-values, that have a longstanding and welldocumented history of misinterpretations by both scientists and recipients of expert information. Here, Bayes factors provide a sound and defensible alternative. The third theme that this book covers is operational relevance. Thus, throughout this book, all key concepts are systematically illustrated with hands-on examples and complete template code in R, including sensitivity analyses and explanations on how to interpret results in context. This usefully complements the theoretical and philosophical justifications for the coherent approach to inference and decision emphasized throughout this book.

Besides explaining the role of the Bayes factor as a guide to reasoning and as a preliminary to coherent decision analysis, the original contribution of this book is to work out the relevance of these topics with respect to two main forensic areas of application: investigation and evaluation. The first, investigation, refers to discriminating between general propositions of interest, i.e., when no named person (or object) is available for comparative examinations with a given trace, mark, or impression of unknown source. The second, evaluation, is concerned with assessing the meaning of evidence with respect to specific propositions of interest, e.g., whether given trace material, a mark, or an impression comes from a particular person (or object), rather than from an unknown person (or object). While investigation and evaluation pertain to distinct procedural phases with specific needs and constraints, they involve inferential and decisional tasks that have common conceptual underpinnings that can be formally captured, analyzed, and expressed in terms of Bayes factors, and embedded in a coherent framework for decision analysis.

This book does not contain recipes nor does it intend to prescribe what scientists should do. Instead, the aim of this book is to provide forensic scientists with a sound analytical framework for inference and decision analysis that allows them to critically rethink their current approaches drawn from more traditional courses in probability and statistics. As prerequisites, readers should have a minimal background in probability and statistics including, ideally, notions from Bayesian statistics. With its balanced presentation of theoretical and philosophical background, together with practical illustrations, this concise book seeks to make an original contribution to forensic science literature. It will be of equal interest to forensic practitioners and applied forensic statisticians, and can be used to support courses on Bayesian statistics for forensic scientists. Occasionally, we will refer to datasets and computational routines, available as online supplementary materials on the book's website at http://link.springer.com/.

This book presents materials developed through a longstanding collaboration between the authors. Their research was supported, at various instances, by the *Swiss National Science Foundation*, the *Foundation for the University of Lausanne* (Fondation pour l'Université de Lausanne), the *Vaud Academic Society* (Société Académique Vaudoise), the *Department of Economics of Ca' Foscari University of Venice*, and the *School of Criminal Justice of the University of Lausanne*. The authors are deeply indebted to Colin Aitken and Daniel Ramos for their valuable advice, to Lorenzo Gaborini for sharing routines developed in his Ph.D thesis, and to Luc Besson, Jacques Linden, Raymond Marquis, Valentin Scherz, and Matthieu Schmittbuhl for sharing data of forensic interest. Finally, students and fellow researchers at *Ca' Foscari University of Venice* and the *University of Lausanne* have provided the authors with exciting and encouraging environments without which much of the writing of this book would not have been possible.

Venice, Italy Silvia Bozza Lausanne-Dorigny, Switzerland Franco Taroni Lausanne-Dorigny, Switzerland Alex Biedermann August 2022

## **Contents**



## **Chapter 1 Introduction to the Bayes Factor and Decision Analysis**

#### **1.1 Introduction**

The assessment of the value of scientific evidence involves subtle forensic, statistical, and computational aspects that can represent an obstacle in practical applications. The purpose of this book is to provide theory, examples, and elements of R code to illustrate a variety of topics pertaining to value of evidence assessments using Bayes factors in a decision-theoretic perspective.

The structure of this book is as follows. This chapter starts by presenting an overview of the role of statistics in forensic science, with an emphasis on the Bayesian perspective and the role of the Bayes factor for logical inference and decision. Next, the chapter addresses three general topics that forensic scientists commonly encounter: model choice, evaluation, and investigation. For each of these themes, Bayes factors will be developed and discussed using practical examples. Particular attention will be devoted to the distinction between feature- and scorebased Bayes factors, typically used in evaluative settings. This chapter also provides theoretical background analysts might need during data analysis, including elements of forensic interpretation, computational methods, decision theory, prior elicitation, and sensitivity analysis.

Chapter 2 addresses the problem of discrimination between competing propositions regarding target features of a population of interest (i.e., parameters). Examples include applications involving counting processes and propositions referring to the proportion of items of forensic interest (e.g., items with illegal content) or an unknown quantity. Attention will be drawn to background elements that may affect counting processes or continuous measurements and a decisional approach to this problem.

Chapter 3 addresses the problem of evaluation of scientific evidence in the form of discrete, continuous, and continuous multivariate data. The latter may present a complex dependence structure that will be handled by means of multilevel models.

Chapter 4 focuses on the problem of investigation, using examples involving either univariate or multivariate data.

For each topic covered in the book, examples will be accompanied with R code, allowing readers to reproduce computations and adapt sample code to their own problems. The end of each chapter presents an outline of the principal R functions used throughout the respective chapters. While some functions can be easily reproduced, others are more elaborate and copying their R code would be tedious. These functions are available, as well as datasets, as supplementary materials on the book's website (on http://link.springer.com/).

#### **1.2 Statistics in Forensic Science**

Forensic science uses scientific principles and technical methods to help with the use of evidence in legal proceedings of criminal, civil, or administrative nature. To assist members of the judiciary in their inquiries regarding the existence or past occurrence of events of legal interest, forensic scientists examine recovered traces, objects, and materials related to persons of interest. This may involve, for example, the analysis of the nature of body fluids and various other items such as textile fibers, glass and paint fragments, handwriting, digital device data, as well as the classification of such items and data into various categories.

More generally, forensic science takes a major interest in both investigative proceedings and evaluative processes at trial. This involves the examination of persons and objects, as well as the vestiges of actions. Forensic scientists also help with reconstructing past events. Thus, incomplete knowledge and, hence, uncertainty are key challenges that all participants in the legal process must deal with. The standard approach to cope with uncertainty is the structured collection and sound use of data. Typically, data result from the analysis and comparative examination of evidential material (i.e., biological traces, toxic substances, documents, crime scene findings, imaging data, etc.), followed by an assessment of the probative value of scientific results within the context of the event under investigation and in the light of the task-relevant information.

However, despite its potential to support legal evidence and proof processes, forensic science has also been found to be a contributing factor to miscarriages of justice (Cole, 2014). Furthermore, over the last decade, reviews by expert panels have exposed several areas of forensic science practice as insufficiently reliable (e.g., PCAST, 2016), and courts across many jurisdictions have insisted on the need to probe and demonstrate the empirical foundations of forensic science disciplines.

Scientists currently address these challenges by directing research not only toward more studies involving experiments under controlled conditions but also by developing formal frameworks for value of evidence assessment that can cope with scientific evidence independent of its nature and type. Central to this development is a convergence to the Bayesian perspective, which is well suited to help forensic scientists assess the probative value of observations that, typically, do not arise under only one given hypothesis or proposition.1 Bayesian thinking can cope with situations in which one holds varying degrees of belief about competing hypotheses and one considers that those hypotheses may differ in their capacity to account for one's observations and findings. As noted by Cornfield (1967, p. 34),

Bayes' theorem is important because it provides an explication for this process of consistent choice between hypotheses on the basis of observations and for quantitative characterization of their respective uncertainties.

In forensic science, the *Bayes factor* (BF)—a central element in Bayesian analysis—has come to play an extremely important role. It represents a key statistic for assessing the value of scientific findings and is, thereby, widely covered in forensic literature (e.g., Aitken et al., 2021; Buckleton et al., 2016). It allows scientists to assess case-related observations or measurements in the light of competing propositions presented by parties at trial. In essence, the Bayes factor is a concept that provides a measure of the degree to which a scientific finding is capable to discriminate between the competing propositions of interest.

The choice of the Bayes factor to assess the value of outcomes of laboratory examinations and analyses results from the requirement to comply with several practical precepts of coherent thinking and decision-making. The desirable properties that the Bayes factor accounts for are balance, transparency, robustness, and logic. In addition, it is a flexible measure, acknowledged throughout forensic science, law, and statistics, because it can deal with any type of evidence (e.g., Evett, 1996; Jackson, 2000; Robertson & Vignaux, 1993; Robertson et al., 2016; Good, 1950; Kass & Raftery, 1995; Lindley, 1977; Taroni et al., 2010).

In forensic science, the Bayes factor is more commonly called *likelihood ratio*, even if this may create confusion because the two terms represent two distinct concepts, and the Bayes factor does not always simplify to a likelihood ratio. This will be explained later in Sect. 1.4. Generally, the use of the Bayes factor is now well established in both theory and practice, though some branches of forensic science are more advanced in Bayes factor analyses than others. A general overview is presented by the Royal Statistical Society's Section Committee on Statistics and Law (e.g., Aitken et al., 2010) in a series of practitioner guides for judges, forensic scientists, and expert witnesses.

While the Bayes factor represents a coherent metric for value of evidence

<sup>1</sup> The term hypothesis (or proposition) is interpreted here as an assertion or a statement that such and such is the case (e.g., an outcome or a state of nature of the kind "the questioned document has been printed with printer 1" or "the recovered item is from the same source as the control item") and also as a description of a decision. Propositions are, therefore, statements that are either true or false and that can be affirmed or denied. An important basis for much of the argument developed in this book is the assumption that personal degrees of belief can be assigned to propositions or hypotheses. Throughout this book, hypothesis and proposition are treated as synonyms.

assessment<sup>2</sup> in evaluative reporting3 (i.e., when a person of interest is available for comparison purposes), it is important to mention that it can also be used in investigative contexts. A case is investigative when there is no person or object available for comparison, and examinations concentrate primarily on helping to draw inferences about general features (e.g., sex, right-/left-handedness, etc.) related to the source of a recovered stain, mark, or trace. More generally, the Bayes factor can be used for two main purposes in forensic science:


To illustrate these concepts, imagine a case involving a questioned document and handwriting. In cases of anonymous letter-writing, it regularly occurs that, at least initially, no suspected writer is available. In such a case, there will be no possibility for jointly evaluating characteristics observed on a questioned document and features on reference (known or control) material from a person of interest, as would be the case in an evaluative context. However, this does not mean that measurements made only on the questioned document, without comparison to reference material, could not be informative for investigative purposes. For example, features extracted from the handwriting of unknown source may be evaluated with respect to more general propositions such as "the questioned document (e.g., a ransom note) has been written by a man (woman)" or "the questioned document has been written by a right- (left)-handed person." Helping to discriminate between such propositions contributes to reducing the pool of potential writers in an investigation.

As a metric to assess the value of findings in a forensic context, the Bayes factor allows practitioners to offer a quantitative expression that they can convey in a more general reasoning framework that conforms to the logic of Bayesian thinking. From the scientist's point of view, the contribution to inference is perfectly symmetric. That is, the findings may support either of the two competing propositions, with

<sup>2</sup> A list of necessary logical conditions to guarantee coherence is presented and discussed in Taroni et al. (2021a).

<sup>3</sup> On the difference between evaluative and other types of reporting, such as technical and intelligence reporting, see ENFSI Guideline for Evaluative Reporting in Forensic Science (Willis et al., 2015) §1*.*1.

respect to the relevant alternative proposition. This strengthens the scientist's role as balanced expert in the legal process.

#### **1.3 Bayesian Thinking and the Value of Evidence**

Bayesian philosophy is named after Reverend Thomas Bayes and is based on an interpretation of probability as personal degree of belief (de Finetti, 1989). In Bayesian theory, all uncertainties in a problem must necessarily be described by probabilities. Probability is intended as one's conditional measure of uncertainty associated with the evidence, the available information, and all the underlying assumptions. In this book, we will use the term evidence in the general sense of a given piece of information or data. This includes, but is not restricted to, the idea of evidence used in legal proceedings. The term evidence is used here in a broad sense as synonym for other terms such as "finding" or "outcome." According to Good (1988), evidence may be defined as data that makes one alter one's beliefs about how the world is working. The word finding, in turn, is used in this book to designate the result of a forensic examination or analysis. Findings are measurements in a quantitative form, discrete or continuous. Examples for discrete quantitative results are counts of glass fragments or gunshot residues. Examples for continuous results are measurements of physical quantities such as length, weight, refractive index, and summaries of complex comparisons in the form of similarity scores. For a formal definition of the term findings, see also the ENFSI Guideline for Evaluative Reporting in Forensic Science (Willis et al., 2015).

Starting from prior probabilities, representing subjective degrees of belief about propositions of interest, the Bayesian paradigm allows one to rationally revise such beliefs and compute posterior probabilities, draw inferences about propositions, and make decisions (Sprenger, 2016). For example, when new information becomes available, it may be necessary to assess how this information ought to affect propositions regarding the involvement of a person of interest in particular alleged activities. Likewise, physicians need to structure their thought processes when performing medical diagnosis. In general, the question is how to update one's personal beliefs regarding uncertain events when one receives new information.

Suppose that the events *H*1*,...,Hn* form a partition, and denote by Pr*(Hi* | *I )* the probability that is associated with *Hi*, *i* = 1*,...,n*, given relevant background information *I* . This probability is called a *prior probability*. Furthermore, consider an event or quantity *E*, whose probability can be expressed by means of the *law of total probability* as

$$\Pr(E \mid I) = \sum\_{j} \Pr(E \mid H\_j, I) \Pr(H\_j \mid I). \tag{1.1}$$

The ENFSI Guideline for Evaluative Reporting in Forensic Science (Willis et al., 2015, at p. 21) regards conditioning information as the essential ingredient of probability assignment, since all probabilities are conditional. In forensic evaluation, it is important not to focus on all possible information, but only on the information that is relevant to the forensic task at hand. Disciplined forensic reporting requires scientists to make clear their perception of the conditioning information at the time they conduct their evaluation. Conditioning information is sometimes known as the framework of circumstances (or background information). Much of the nonscientific information will not have a bearing on the value of scientific findings, but it is essential to recognize those aspects that do. Examples of relevant information may include the ethnic origin of the perpetrator (but not that of the suspect) and the nature of garments and surfaces involved in alleged transfer events. More generally, conditioning information may also include data and domain knowledge that the expert uses to assign probabilities. The conditioning on (task-) relevant information *I* is important because it clarifies that probability assignments are personal and depend on the knowledge of the person conducting the evaluation.

Bayes rule (or theorem) is a straightforward application of the conditionalization principle and the partition formula (1.1). It allows one to compute the so-called *posterior probability* Pr*(Hi* | *E, I)* as

$$\Pr(H\_l \mid E, I) = \frac{\Pr(E \mid H\_l, I)\Pr(H\_l \mid I)}{\Pr(E \mid I)} = \frac{\Pr(E \mid H\_l, I)\Pr(H\_l \mid I)}{\sum\_j \Pr(E \mid H\_j, I)\Pr(H\_j \mid I)},$$

which emphasizes that certain knowledge of *E* modifies the probability of *Hi*. <sup>4</sup> Note that *prior* and *posterior* probabilities are only relative to the new finding *E*. The posterior probability will become again a prior probability when additional findings become available. Lindley (2000, p. 301) expressed this as follows: "Today's posterior is tomorrow's prior." Bayesian statistics is the sequential application of Bayes rule to all situations that involve observed and missing data, unknown quantities (e.g., events, propositions, population parameters), or unobserved data (e.g., future observations).

Participants in the legal process are typically concerned with the problem of comparing competing propositions about a contested event. A typical example for trace evidence is "the recovered glass fragments come from the broken window" versus "the recovered glass fragments come from an unknown source." When measurements on various items (i.e., glass fragments) are available, it may be necessary to quantitatively evaluate these findings with respect to selected propositions of interest. According to Bayesian methodology developed by Jeffreys (1961), this involves the introduction of a statistical model to describe the probability of the available measurements according to different hypotheses (propositions or models). The posterior probability of each hypothesis is then computed via a direct application of Bayes theorem. Following Jeffreys' criterion for comparing hypotheses, a hypothesis is accepted or rejected on the basis of its posterior

<sup>4</sup> See Taroni et al. (2020) for a discussion on the generalization of Bayes rule (i.e. Jeffrey's conditionalization) when one is faced to uncertain evidence.

probability being greater or smaller than that of the alternative proposition. Note that the acceptance or rejection of a proposition is not meant as an assertion of its truth or falsity, only that its probability is greater or smaller than that of the respective alternative proposition (Press, 2003).

The primary element in Bayesian methodology for comparing propositions is the Bayes factor (BF for short). It provides a numerical representation of the impact of findings on propositions of interest. In other words, the Bayes factor quantifies the degree to which observed measurements discriminate between competing propositions. The Bayes factor is the ingredient by which the prior odds in favor of a proposition are multiplied in virtue of the knowledge of the findings (Good, 1958):

Posterior odds = BF × Prior odds*.*

Broadly speaking, prior and posterior odds are the ratios of probabilities of the hypotheses of interest before and after acquiring new findings, respectively. The value of experimental outcomes is measured by how much *more* probable they make one hypothesis relative to the respective alternative hypothesis, compared to the situation *before* considering the experimental findings.

A formal definition of the Bayes factor is given in Sect. 1.4, along with a discussion about its interpretation as measure of the value of the evidence. Practical examples in Sects. 1.5 and 1.6 and further developments in Chaps. 3 and 4 will illustrate the use of the Bayes factor for evaluative and investigative purposes.

#### **1.4 Bayes Factor for Model Choice**

Consider an unknown quantity *X*, referring to a quantity or measurement of interest such as the number of ecstasy pills in a sample drawn from a large seizure of pills, the elemental chemical composition of glass fragments, or a feature (e.g., the length) of a handwritten character. Furthermore, suppose that *f (x* | *θ )* is a suitable *probability model*<sup>5</sup> for *X*, where the unknown parameter<sup>6</sup> *θ* belongs to the parameter space *Θ*. Suppose also that the parameter space consists of two non-overlapping sets *Θ*<sup>1</sup> and *Θ*<sup>2</sup> such that *Θ* = *Θ*<sup>1</sup> ∪ *Θ*2. A question that may be of interest is whether the parameter *θ* belongs to *Θ*1, or to *Θ*2, that is to compare the hypothesis

$$H\_{\mathbf{l}}: \theta \in \Theta\_{\mathbf{l}},$$

against the alternative hypothesis

<sup>5</sup> A probability model is understood here as a characterization of the distribution of measurements.

<sup>6</sup> A parameter is taken here as a characteristic of the distribution of all members (e.g., individuals or objects) of a population of interest.

$$H\_2: \theta \in \Theta\_2.$$

Note that *H*<sup>1</sup> is usually called the null hypothesis. Under a classical (frequentist) approach, the distinction between null and alternative hypotheses is very important. Users must be aware that when performing significance testing, competing hypotheses are not equivalent and there is, in fact, an asymmetry associated with them. One collects data (or evidence) against the null hypothesis before it is rejected, but the acceptance of the null hypothesis is not an assertion about its truthfulness. It merely means that there is little evidence against it. As will be shown, under the Bayesian paradigm, this does not represent an issue.

A hypothesis *Hi* is called *simple* if there is only one possible value for *θ*, say *Θi* = {*θi*}. A hypothesis is called *composite* (see, e.g., Example 1.1) if there is more than one possible value.

Let *π*<sup>1</sup> = Pr*(H*1*)* = Pr*(θ* ∈ *Θ*1*)* and *π*<sup>2</sup> = Pr*(H*2*)* = Pr*(θ* ∈ *Θ*2*)* denote the prior probabilities for the competing composite hypotheses *H*<sup>1</sup> and *H*2. Note that, for the sake of simplicity, the letter *I* denoting background information is omitted here. The ratio of the prior probabilities *π*1*/π*<sup>2</sup> is called the *prior odds* of *H*<sup>1</sup> to *H*2. The prior odds indicate whether hypothesis *H*<sup>1</sup> is more or less probable than hypothesis *H*<sup>2</sup> (prior odds being greater or smaller than 1) or whether the hypotheses are (almost) equally probable, i.e., the prior odds are (close) to 1.7 Suppose observational data *x* are available that do not provide conclusive evidence8 about the propositions of interest but will allow one to update prior beliefs using Bayes theorem. Let us denote by *fHi(x)* the *marginal probability* of the data under proposition *Hi*, that is,

$$f\_{H\_l}(\mathbf{x}) = \int\_{\Theta\_l} f(\mathbf{x} \mid \theta) \pi\_{H\_l}(\theta) d\theta,\tag{1.2}$$

where *πHi(θ )* denotes the prior probability density of *θ* for *θ* ∈ *Θi*. The marginal probability is also called the *predictive probability*, which is the probability to observe the actual data before any data become available. Kass and Raftery (1995) refer to it as the *marginal likelihood*: the probability of the observations averaged

<sup>7</sup> The ratio of the probabilities of two mutually exclusive and collectively exhaustive events is called *odds* in favor of the event whose probability is in the numerator of the ratio. Note that hypotheses are not necessarily exhaustive: the word odds is sometimes used loosely in reference to the ratio of the probabilities of mutually exclusive propositions whose probabilities do not add to 1 (Taroni et al., 2010).

<sup>8</sup> The problem of imperfect evidence is well illustrated by Robertson and Vignaux (1995, at p.12):

An ideal piece of evidence would be something that always occurs when what we are trying to prove is true and never occurs otherwise. If we are trying to demonstrate the truth of an hypothesis or assertion we would like to find as evidence something which always occurs when the hypothesis is true and never occurs when the hypothesis is not true. In real life, evidence this good is almost impossible to find.

across the prior distribution over the parameter space *Θ*. Note that the parameter space *Θ* can be either continuous or discrete. In the latter case, the integral in (1.2) must be replaced by a sum, and the marginal probability of the evidence (i.e., data *x*) becomes

$$f\_{H\_l}(\mathbf{x}) = \sum\_{\theta \in \Theta\_l} f(\mathbf{x} \mid \theta) \Pr(\theta \mid H\_l).$$

The Bayes factor for comparing *H*<sup>1</sup> and *H*<sup>2</sup> is defined as the ratio of the marginal probabilities *fHi(x)* under the competing hypotheses, that is,

$$\text{BF} = \frac{f\_{H\_1}(\mathbf{x})}{f\_{H\_2}(\mathbf{x})}.\tag{1.3}$$

Let *α*<sup>1</sup> = Pr*(H*<sup>1</sup> | *x)* = Pr*(θ* ∈ *Θ*<sup>1</sup> | *x)* and *α*<sup>2</sup> = Pr*(H*<sup>2</sup> | *x)* = Pr*(θ* ∈ *Θ*<sup>2</sup> | *x)* denote the posterior probabilities for the competing hypotheses. The ratio of the posterior probabilities *α*1*/α*<sup>2</sup> is called the *posterior odds* of *H*<sup>1</sup> to *H*2. Recalling the odds form of Bayes theorem, one can express the Bayes factor for comparing hypothesis *H*<sup>1</sup> against hypothesis *H*<sup>2</sup> as the factor by which the prior odds of *H*<sup>1</sup> to *H*<sup>2</sup> are multiplied in virtue of the knowledge of the data to obtain the posterior odds, that is,

$$
\alpha\_1/\alpha\_2 = \mathbf{BF} \times \pi\_1/\pi\_2.
$$

The Bayes factor measures the change produced by the new information (or, data) in the odds when going from the prior to the posterior distributions in favor of one proposition as opposed to a given alternative. For this reason, it is not uncommon to find the BF defined as the ratio of the posterior odds in favor of *H*<sup>1</sup> to the prior odds in favor of *H*1, that is,

$$\text{BF} = \frac{\alpha\_1/\alpha\_2}{\pi\_1/\pi\_2}.\tag{1.4}$$

One of the attractive features of using a Bayes factor to quantify the value of the acquired information is that it does not depend on prior probabilities of competing hypotheses. However, this bears potential for misunderstandings. The Bayes factor is sometimes interpreted as, for example, the odds provided by the data alone, for *H*<sup>1</sup> to *H*2: this is conceptually incorrect. Though cases may be found where the Bayes factor can be expressed as a ratio of likelihoods<sup>9</sup> and correctly be interpreted

<sup>9</sup> While probabilistic modeling provides the probability *f (x* <sup>|</sup> *θ )* of any hypothetical data *<sup>x</sup>* before any observation is made, conditional on *θ*, statistical methods allow one to draw conclusions about *θ* given the collected observations *x*. This difference in focus is expressed by the *likelihood function*, written *l(θ* | *x)*, where the probability distribution *f (x* | *θ )* is written as a function of *θ* conditional on the observations *x*, i.e., *f (x* | *θ )* = *l(θ* | *x)*.

as the "summary of the evidence provided by the data in favor of one scientific theory (. . . ) as opposed to another" (Kass & Raftery, 1995, at p. 777), this does not hold in general. The Bayes factor will generally depend on prior assumptions. It is necessary, thus, to clarify the meaning of "prior assumptions" because confusion may arise between, on the one hand, the notion of prior probability about model parameters (*θ* ∈ *Θi*) and, on the other hand, prior probabilities of propositions (*Hi*).

To clarify this distinction, consider the comparison of a simple hypothesis *H*<sup>1</sup> : *θ* = *θ*<sup>1</sup> against a simple alternative hypothesis *H*<sup>2</sup> : *θ* = *θ*2. The prior probabilities of these hypotheses are expressed as *π*<sup>1</sup> = Pr*(θ* = *θ*1*)* and *π*<sup>2</sup> = Pr*(θ* = *θ*2*)*. The posterior probabilities *αi* in the light of prior probabilities *πi* (*i* = 1*,* 2) and observed data *x* can be easily computed by means of a direct application of Bayes theorem:

$$\alpha\_{l} = \Pr(H\_{l} \mid \mathbf{x}) = \Pr(\theta = \theta\_{l} \mid \mathbf{x}) = \frac{f(\mathbf{x} \mid \theta\_{l})\pi\_{l}}{\sum\_{j=1,2} f(\mathbf{x} \mid \theta\_{j})\pi\_{j}}. \tag{1.5}$$

The ratio of the posterior probabilities *α*1*/α*<sup>2</sup> obtained from computing (1.5) for *i* = 1*,* 2 simplifies to the product of the likelihood ratio times the ratio of the prior probabilities, that is,

$$\frac{\alpha\_1}{\alpha\_2} = \frac{f(\mathbf{x} \mid \theta\_1)}{f(\mathbf{x} \mid \theta\_2)} \times \frac{\pi\_1}{\pi\_2}.$$

Recalling (1.4), it is readily seen that the Bayes factor in this simple case is the likelihood ratio of *H*<sup>1</sup> to *H*2,

$$\text{BF}^{\circ} = \frac{f(\mathbf{x} \mid \theta\_{\mathbf{l}})}{f(\mathbf{x} \mid \theta\_{2})} \times \frac{\pi\_{\mathbf{l}}}{\pi\_{2}} \times \frac{\pi\_{2}}{\pi\_{1}} = \frac{f(\mathbf{x} \mid \theta\_{\mathbf{l}})}{f(\mathbf{x} \mid \theta\_{2})},\tag{1.6}$$

and it is correct then to interpret this as "the odds provided by the data alone for *H*<sup>1</sup> to *H*2."

However, the comparison of simple versus simple hypotheses is a particular case among many others. Practitioners may face the more general situation where at least one of the hypotheses is composite, that is, the parameter of interest may take one of a range of different values (e.g., *Θi* = {*θ*1*,...,θk*}), or infinitely many, as is the case when *θ* is continuous. In the case of composite hypotheses, the prior probabilities *πi* for *i* = 1*,* 2 will take the following form:

$$\pi\_l = \Pr(\theta \in \Theta\_l) = \begin{cases} \sum\_{\theta \in \Theta\_l} \Pr(\theta) & \text{for } \theta \text{ discrete} \\\\ \int\_{\Theta\_l} \pi(\theta) d\theta & \text{for } \theta \text{ continuous}, \end{cases} \tag{1.7}$$

where *π(θ )* is the prior probability density for *θ* ∈ *Θ*. The posterior probabilities *αi* are therefore computed as

#### 1.4 Bayes Factor for Model Choice 11

$$\alpha\_{l} = \Pr(\theta \in \Theta\_{l} \mid x) = \begin{cases} \frac{\sum\_{\theta \in \Theta\_{l}} f(\mathbf{x}|\theta) \Pr(\theta)}{\sum\_{\theta \in \Theta} f(\mathbf{x}|\theta) \Pr(\theta)} & \text{for } \theta \text{ discrete} \\\\ \frac{\int\_{\Theta\_{l}} f(\mathbf{x}|\theta) \pi(\theta) d\theta}{\int\_{\Theta} f(\mathbf{x}|\theta) \pi(\theta) d\theta} & \text{for } \theta \text{ continuous}, \end{cases} \tag{1.8}$$

and the posterior odds will be

$$\frac{\alpha\_1}{\alpha\_2} = \begin{cases} \frac{\sum\_{\theta \in \Theta\_1} f(\mathbf{x}|\theta) \operatorname{Pr}(\theta)}{\sum\_{\theta \in \Theta\_2} f(\mathbf{x}|\theta) \operatorname{Pr}(\theta)} \text{ for } \theta \text{ discrete} \\\\ \frac{\int\_{\Theta\_1} f(\mathbf{x}|\theta) \pi(\theta) d\theta}{\int\_{\Theta\_2} f(\mathbf{x}|\theta) \pi(\theta) d\theta} & \text{for } \theta \text{ continuous} \end{cases} \tag{1.9}$$

Following (1.4), the Bayes factor can be reconstructed as follows:

$$\mathbf{BF} = \begin{cases} \frac{\sum\_{\boldsymbol{\theta} \in \Theta\_1} f(\boldsymbol{\chi}|\boldsymbol{\theta}) \Pr(\boldsymbol{\theta})}{\sum\_{\boldsymbol{\theta} \in \Theta\_2} f(\boldsymbol{\chi}|\boldsymbol{\theta}) \Pr(\boldsymbol{\theta})} / \frac{\pi\_1}{\pi\_2} & \text{for } \boldsymbol{\theta} \text{ discrete} \\\\ \frac{\int\_{\Theta\_1} f(\boldsymbol{\chi}|\boldsymbol{\theta}) \pi(\boldsymbol{\theta}) d\boldsymbol{\theta}}{\int\_{\Theta\_2} f(\boldsymbol{\chi}|\boldsymbol{\theta}) \pi(\boldsymbol{\theta}) d\boldsymbol{\theta}} / \frac{\pi\_1}{\pi\_2} & \text{for } \boldsymbol{\theta} \text{ continuous}, \end{cases} \tag{1.10}$$

where the *πi* are computed as in (1.7). It is seen that the Bayes factor can no longer be expressed as a likelihood ratio as in the case of comparing simple versus simple hypotheses. We will show this for the case where *θ* is continuous.

Start with the prior probability density *π(θ )* on *Θ*, and divide it by the probability *πi* of the hypothesis *Hi* to obtain the restriction of the prior probability density *π(θ )* on *Θi*, that is,

$$
\pi\_{H\_l}(\theta) = \frac{\pi(\theta)}{\pi\_l} \quad \text{for } \theta \in \Theta\_l.
$$

The probability density *πHi(θ )* simply describes how the prior probability spreads over the hypothesis *Hi*. The prior probability density *π(θ )* can thus be rewritten in the following form:

$$\pi(\theta) = \begin{cases} \pi\_1 \pi\_{H\_1}(\theta) \text{ for } \theta \in \Theta\_1, \\\\ \pi\_2 \pi\_{H\_2}(\theta) \text{ for } \theta \in \Theta\_2. \end{cases}$$

Therefore, the posterior odds in (1.9) for the continuous case can be rewritten as

$$\frac{\alpha\_1}{\alpha\_2} = \frac{\pi\_1 \int\_{\Theta\_1} f(\mathbf{x} \mid \theta) \pi\_{H\_1}(\theta) d\theta}{\pi\_2 \int\_{\Theta\_2} f(\mathbf{x} \mid \theta) \pi\_{H\_2}(\theta) d\theta}. \tag{1.11}$$

Recalling (1.4), the Bayes factor in (1.10) will take the form of integrated likelihoods under the hypotheses of interest, that is,

12 1 Introduction to the Bayes Factor and Decision Analysis

$$\text{BF} = \frac{\int\_{\Theta\_1} f(\mathbf{x} \mid \theta) \pi\_{H\_1}(\theta) d\theta}{\int\_{\Theta\_2} f(\mathbf{x} \mid \theta) \pi\_{H\_2}(\theta) d\theta}. \tag{1.12}$$

The reader can verify that the two expressions in (1.3) and (1.12) are equivalent. Prior evaluations enter the Bayes factor through the weights *πH*<sup>1</sup> *(θ )* and *πH*<sup>2</sup> *(θ )*. The Bayes factor depends on how the prior mass is spread over the two hypotheses (Berger, 1985). It is also worth noting that whenever hypotheses are unidirectional (e.g., when comparing *H*<sup>1</sup> : *θ* ≤ *θ*<sup>0</sup> against *H*<sup>2</sup> : *θ>θ*0), the choice of a prior probability density *π(θ )* over *Θ* = *Θ*<sup>1</sup> ∪ *Θ*<sup>2</sup> (with *Θ*<sup>1</sup> = [0*, θ*0] and *Θ*<sup>1</sup> = *(θ*0*,* 1]) is equivalent to the expression of a prior probability for the competing hypotheses. Conversely, whenever hypotheses are bidirectional (e.g., when comparing *H*<sup>1</sup> : *θ* = *θ*<sup>0</sup> against *H*<sup>2</sup> : *θ* = *θ*0), one cannot choose a prior probability density *π(θ )* over the entire parameter space *Θ*, as this would amount to place a probability equal to 0 to the hypothesis *H*<sup>1</sup> : *θ* = *θ*0. The prior probability distribution over *θ* must, in this case, be a mixture of a discrete component that assigns a positive mass *π*<sup>1</sup> = Pr*(θ* = *θ*0*)* to *H*<sup>1</sup> and a continuous component that spreads the remaining mass *π*<sup>2</sup> = 1−*π*<sup>1</sup> over *Θ*<sup>2</sup> according to the probability density *πH*<sup>2</sup> *(θ )*. The posterior probability *α*<sup>1</sup> can then be computed as in (1.8), where *Θ*<sup>1</sup> = *θ*0,

$$\alpha\_{\rm l} = \Pr(H\_{\rm l} \mid \mathbf{x}) = \frac{\pi\_{\rm l} f(\mathbf{x} \mid \theta\_0)}{\pi\_{\rm l} f(\mathbf{x} \mid \theta\_0) + \pi\_2 \int\_{\Theta\_2} f(\mathbf{x} \mid \theta) \pi\_{H\_2}(\theta) d\theta}. \tag{1.13}$$

Analogously, the posterior probability *α*<sup>2</sup> may be computed, and the Bayes factor is

$$\text{BF}^{\circ} = \frac{f(\mathbf{x} \mid \theta\_0)}{\int\_{\Theta\_2} f(\mathbf{x} \mid \theta) \pi\_{H\_2}(\theta) d\theta}. \tag{1.14}$$

It can be observed that the Bayes factor in (1.14) does not depend on the prior probabilities of competing hypotheses which can vary considerably among recipients of expert information. Any such recipient can, starting from their own probabilities, use the Bayes factor to obtain posterior probabilities in a straightforward manner. Consider, for the sake of illustration, the posterior probability of hypothesis *H*<sup>1</sup> in (1.13). A simple manipulation allows one to obtain

$$\alpha\_1 = \left[1 + \frac{\pi\_2}{\pi\_1} \frac{1}{\text{BF}}\right]^{-1} = \frac{\text{BF}}{\text{BF}\_1 + \pi\_2/\pi\_1}.$$

In summary, the Bayes factor thus measures the change in the odds in favor of one hypothesis, as compared to a given alternative hypothesis, when going from the prior to the posterior distribution. This means that a Bayes factor larger than 1 indicates that the data support hypothesis *H*<sup>1</sup> compared to *H*2. However, the Bayes factor does not indicate whether *H*<sup>1</sup> is more *probable* than the opposing hypothesis *H*2, the BF only makes it more probable than it was *before observing* the data (Lavine & Schervish, 1999).

*Example 1.1 (Alcohol Concentration in Blood)* A person is stopped because of suspicion of driving under the influence of alcohol. Blood taken from that person is submitted to a forensic laboratory to investigate whether the quantity of alcohol in blood *θ* is greater than a legal threshold of, say, 0*.*5 g/kg. Thus, the hypotheses of interest can be defined as *H*<sup>1</sup> : *θ >* 0*.*5 versus *H*<sup>2</sup> : *θ* ≤ 0*.*5. Suppose that a prior probability density *π(θ )* is given for *θ* and that the prior probabilities of *H*<sup>1</sup> and *H*<sup>2</sup> in (1.7) are *π*<sup>1</sup> = 0*.*05 and *π*<sup>2</sup> = 0*.*95, corresponding to prior odds approximately equal to 0.0526. These values suggest that, based on the circumstances, and before considering results of blood analyses, the hypothesis *H*<sup>1</sup> is believed to be much less probable than the alternative hypothesis. Suppose next that the posterior probabilities, after taking into account laboratory measurements, are computed as in (1.8). The results are *α*<sup>1</sup> = 0*.*24 and *α*<sup>2</sup> = 0*.*76. Thus, the posterior odds are approximately equal to 0*.*3158. The ratio of the posterior odds by the prior odds leads to a BF equal to 6. This result represents limited evidence in support of the hypothesis that the alcohol level in blood is greater than the legal threshold, compared to the alternative hypothesis. Still, the posterior probability of hypothesis *H*<sup>1</sup> is low: the BF only renders the hypothesis *H*<sup>1</sup> slightly more probable than it was before observing the measurements made in the laboratory. This example will be further developed in Chap. 2.

#### **1.5 Bayes Factor in the Evaluative Setting**

Consider the general situation where evidentiary material is collected and control items from a person or object of interest are available for comparative purposes. The following measurements of a particular characteristic are available: measurements *y* on a questioned item (e.g., a glass fragment found on the clothing of a person of interest) and measurements *x* on a control item (e.g., fragments from a broken window). In this evaluative setting, so-called source level propositions<sup>10</sup> could be defined as follows:

<sup>10</sup> The notion of *source level* refers to a given level in a hierarchy of hypotheses. This view considers a classification (i.e., hierarchy) of propositions into three main categories or levels, called the source level, activity level, and crime level. See Cook et al. (1998) for a discussion. Note that source level propositions for the example of glass fragments are chosen here as a formative example and for illustrative purposes. As a type of transfer evidence, glass fragments should be evaluated using activity level propositions (Willis et al., 2015).


This setting is called evaluative because it involves the comparison between control and recovered items and the use of the results of this comparison for discriminating between the competing propositions. Models for comparison can either be *feature-based* or *score-based*. Feature-based models (Sect. 1.5.1) focus on the probability of measurements made directly on evidentiary and reference items. Conversely, score-based models (Sect. 1.5.2) focus on the probability of observing a pairwise similarity (or distance), i.e., score, between compared materials.

#### *1.5.1 Feature-Based Models*

If one assumes that *y* and *x* are realizations of random variables *Y* and *X* with a given probability distribution *f (*·*)*, the Bayes factor is

$$\text{BF} = \frac{f(\mathbf{y}, \mathbf{x} \mid H\_1, I)}{f(\mathbf{y}, \mathbf{x} \mid H\_2, I)},\tag{1.15}$$

where *I* represents the available background information. Application of the rules of conditional probability allows one to rewrite the Bayes factor as follows:

$$\text{BF} = \frac{f(\mathbf{y} \mid \mathbf{x}, H\_1, I)}{f(\mathbf{y} \mid \mathbf{x}, H\_2, I)} \times \frac{f(\mathbf{x} \mid H\_1, I)}{f(\mathbf{x} \mid H\_2, I)}.$$

This expression can be further simplified by considering the fact that (i) the distribution of measurements *x* on the control item does not depend on whether *H*<sup>1</sup> or *H*<sup>2</sup> is true (and hence *f (x* | *H*1*,I)* = *f (x* | *H*2*,I)* holds) and (ii) the distribution of the measurement *y* on the questioned item does not depend on the measurement *<sup>x</sup>* on the control item if *<sup>H</sup>*<sup>2</sup> is true,<sup>11</sup> so that *f (y* <sup>|</sup> *x,H*2*,I)* <sup>=</sup> *f (y* <sup>|</sup> *<sup>H</sup>*2*,I)*. The Bayes factor can therefore be written as

$$\text{BF} = \frac{f(\mathbf{y} \mid \mathbf{x}, H\_1, I)}{f(\mathbf{y} \mid H\_2, I)}. \tag{1.16}$$

<sup>11</sup> Note that this assumption of independence is not always valid, e.g., with DNA evidence (Balding & Nichols, 1994; Aitken et al., 2021). A further example is the case of questioned signatures. Under the proposition that a signature has been forged and therefore is not authentic, one should take into account that a forger will attempt to reproduce the features of a target signature. Thus, recovered and control measurements cannot be considered independent (Linden et al., 2021); see Sect. 3.4.3.

The numerator is the probability of observing the measurements on the recovered item under the assumption that it comes from the known source, given the information *I* and knowledge of *x*, the features of the known source. The denominator is the probability of observing the measurements *y* on the recovered item, assuming that it comes from an unknown source, usually selected in some aleatory way from a relevant population,<sup>12</sup> and assuming again the relevant information *I* . Note that, for the sake of simplicity, the conditioning information *I* will be omitted in the arguments hereafter.

For many types of forensic evidence, it can be reasonable to assume a parametric model {*f (*· | *θ ), θ* ∈ *Θ*}. In this way, the probability distribution characterizing the available data is of a known form, with the only unknown element being the parameter *θ*, which may vary between sources. Consider, for example, the probability distribution *f (*· | *θ )* with unknown parameter *θ* = *θy* for the measurements *y* on the recovered item and the same probability distribution with unknown parameter *θ* = *θx* for the measurements *x* on the control item. In practice, the parameter *θ* is unknown, and a prior probability distribution *π(θ* | *Hi)*, representing personal beliefs about *θ* under each hypothesis *Hi*, is introduced. The marginal distribution *f (y* | *x,H*1*)* in the numerator of (1.16) may be rewritten as follows:

$$f(\mathbf{y} \mid \mathbf{x}, H\_{\mathbf{l}}) = \int f(\mathbf{y} \mid \boldsymbol{\theta}) \pi(\boldsymbol{\theta} \mid \mathbf{x}, H\_{\mathbf{l}}) d\boldsymbol{\theta}$$

$$= \int f(\mathbf{y} \mid \boldsymbol{\theta}) f(\mathbf{x} \mid \boldsymbol{\theta}) \pi(\boldsymbol{\theta} \mid H\_{\mathbf{l}}) d\boldsymbol{\theta} / f(\mathbf{x} \mid H\_{\mathbf{l}}), \qquad \text{(1.17)}$$

where the posterior density *π(θ* | *x,H*1*)* in the first line is rewritten in extended form using Bayes theorem. The distribution *f (y* | *x,H*1*)* is also called a *posterior predictive* distribution.<sup>13</sup>

The marginal distribution *f (y* | *H*2*)* in the denominator of (1.16) can be rewritten as follows:

$$f(\mathbf{y} \mid H\_2) = \int f(\mathbf{y} \mid \theta) \pi(\theta \mid H\_2) d\theta. \tag{1.18}$$

This is also called a predictive distribution.

<sup>12</sup> Note that rules of conditional probability do not specify on which variable we should condition. Champod et al. (2004) suggest that we should condition on the item with greater information content. Therefore we usually condition on the control item (e.g., in the case of DNA, traces can be degraded or of small quantity, while a complete profile can usually be obtained for a person of interest). For further comments, see also Aitken et al. (2021, pp. 619–627).

<sup>13</sup> For a discussion of posterior predictive distributions in forensic science contexts, see, e.g., Biedermann et al. (2015).

*Example 1.2 (Toner on Printed Documents)* Suppose experimental findings are available in the form of measurements of magnetism of toner on printed documents of known origin *(x)* and questioned origin *(y)* for which a Normal distribution is considered suitable. Thus, *<sup>X</sup>* <sup>∼</sup> <sup>N</sup>*(θx , σ*2*)* and *<sup>Y</sup>* <sup>∼</sup> <sup>N</sup>*(θy , σ*2*)*, where the variance *σ*<sup>2</sup> of both distributions is assumed known and equal (Biedermann et al., 2016a). A Normal distribution with mean *μ* and variance *τ* <sup>2</sup> is taken to model our prior uncertainty about the means *θx* and *θy* , that is, *<sup>θ</sup>* <sup>∼</sup> <sup>N</sup>*(μ, τ* <sup>2</sup>*)* for *<sup>θ</sup>* = {*θx , θy* }. The integrals in (1.17) and (1.18) have an analytical solution, and the marginals can be obtained in closed form. See Aitken et al. (2021, pp. 815–817) for more details.

Here, *H*<sup>1</sup> and *H*<sup>2</sup> denote the propositions according to which the items of toner come from, respectively, the same and different printing machines. Consider, first, the numerator of the BF in (1.17), where the posterior density *π(θ* <sup>|</sup> *x,H*1*)* is still a Normal distribution with mean *μx* and variance *<sup>τ</sup>* <sup>2</sup> *x* , computed according to well-known updating rules (see, e.g., Lee, 2012),

$$
\mu\_{\chi} = \frac{\sigma^2}{\sigma^2 + \mathfrak{r}^2} \mu + \frac{\mathfrak{r}^2}{\sigma^2 + \mathfrak{r}^2} \mathfrak{x} \tag{1.19}
$$

and

$$
\pi\_\chi^2 = \frac{\sigma^2 \pi^2}{\sigma^2 + \pi^2}. \tag{1.20}
$$

The posterior mean, *μx* , is a weighted average of the prior mean *μ* and the observation *x*. The weights are given by the population variance *σ*<sup>2</sup> and the variance *τ* <sup>2</sup> of the prior probability distribution, respectively, such that the component (observation or prior mean) which has the smaller variance has the greater contribution to the posterior mean. This result can be generalized to consider the distribution of the mean of a set of *n* observations *x*1*,...,xn* from the same Normal distribution (see Sect. 2.3.1).

The marginal or posterior predictive distribution *f (y* | *x,H*1*)* is also a Normal distribution with mean equal to the posterior mean *μx* and variance equal to the sum of the posterior variance *τ* <sup>2</sup> *<sup>x</sup>* and the population variance *σ*2, that is,

$$(Y \mid \ge, H\_1) \sim \mathcal{N}(\mu\_{\ge}, \tau\_{\ge}^2 + \sigma^2). \tag{1.21}$$

The same arguments apply to the marginal or predictive distribution *f (y* | *H*2*)* in the denominator, which is a Normal distribution with mean equal to the prior mean *μ* and variance equal to the sum of the prior variance *τ* <sup>2</sup> and the population variance *σ*2, that is,

*Example 1.2* (continued)

$$(Y \mid H\_2) \sim \mathcal{N}(\mu, \tau^2 + \sigma^2). \tag{1.22}$$

The Bayes factor can then obtained as follows:

BF <sup>=</sup> <sup>N</sup>*(y* <sup>|</sup> *μx , τ* <sup>2</sup> *<sup>x</sup>* <sup>+</sup> *<sup>σ</sup>*2*)* N*(y* | *μ, τ* <sup>2</sup> + *σ*2*)* = *(τ* <sup>2</sup> *<sup>x</sup>* <sup>+</sup> *<sup>σ</sup>*2*)*−1*/*<sup>2</sup> exp  −1 2 *(y*−*μx )*<sup>2</sup> *τ* 2 *<sup>x</sup>* +*σ*<sup>2</sup> *(τ* <sup>2</sup> <sup>+</sup> *<sup>σ</sup>*2*)*−1*/*<sup>2</sup> exp  −1 2 *(y*−*μ)*<sup>2</sup> *τ* <sup>2</sup>+*σ*<sup>2</sup> *.*

Note that this can be easily extended to cases with multiple measurements *y* = *(y*1*,...,yn)* (see Sect. 3.3.1).

Note that the value of the measurements *y* and *x* may be expressed as a ratio of the marginal likelihoods in (1.17) and (1.18), that is,

$$\text{BF} = \frac{\int f(\mathbf{y} \mid \theta) f(\mathbf{x} \mid \theta) \pi(\theta \mid H\_1) d\theta}{f(\mathbf{x} \mid H\_1)} \times \frac{1}{f(\mathbf{y} \mid H\_2)}$$

$$= \frac{\int f(\mathbf{y} \mid \theta) f(\mathbf{x} \mid \theta) \pi(\theta \mid H\_1) d\theta}{\int f(\mathbf{x} \mid \theta) \pi(\theta \mid H\_2) d\theta \int f(\mathbf{y} \mid \theta) \pi(\theta \mid H\_2) d\theta},\tag{1.23}$$

as *f (x* | *H*1*)* = *f (x* | *H*2*)*. If the recovered item and the control item come from the same source (i.e., hypothesis *H*<sup>1</sup> holds), then *θy* = *θx* , otherwise *θy* = *θx* (i.e., hypothesis *H*<sup>2</sup> holds). If *H*<sup>2</sup> is true and hence the examined items come from different sources, the measurements can be considered independent. Note, however, that this is not necessarily the case. There are instances where the assumption of independence among measurements on control and recovered material under *H*<sup>2</sup> does not hold, and the BF will not simplify as in (1.23). See Linden et al. (2021) for a discussion about this issue in the context of questioned signatures.

The expression of the Bayes factor in (1.23) involves prior assessments about the unknown parameter *θ*, in terms of *π(θ* | *Hi)*, as well as the likelihood function *f (*· | *θ )*. Thus, the Bayes factor cannot generally be regarded as a measure of the relative support to competing propositions provided by the data alone.

#### *1.5.2 Score-Based Models*

For some types of forensic evidence, the specification of a probability model for available data may be difficult. This is the case, for example, when the measurements are obtained using high-dimensional quantification techniques, e.g., for fingermarks or toolmarks (using complex sets of variables), in speaker recognition, or for traces such as glass, drugs or toxic substances that may be described by several chemical components. In such applications, a *feature-based* Bayes factor (Sect. 1.5.1) may not be feasible, and a *score-based* approach may represent a practicable (or even the only) available alternative. Broadly speaking, a score is a metric that summarizes the result of a forensic comparison of two items or traces, in terms of a single variable, representing a measure of similarity or difference (e.g., distance). Various distance measures can be used, such as *Euclidean* or *Manhattan* distance, see, e.g., Bolck et al. (2015).<sup>14</sup> One of the first proposals of score-based approaches in forensic science was presented in the context of forensic speaker recognition by Meuwly (2001).

Let *Δ(*·*)* denote the function which assesses the degree of similarity between feature vectors *x* and *y*. The *similarity score Δ(x, y)* represents the evidence for which a Bayes factor is to be computed. The introduction of a score function for quantifying the similarities/dissimilarities between compared items allows one to reduce the dimensionality of the problem, while retaining the discriminative information as much as possible. For a score given by a distance, for example, one will expect a value close to zero if the features *x* and *y* relate to items from the same source. Vice versa, if the features *x* and *y* relate to items from different sources, one will expect a larger score, provided that there are differences between members in a population. The score-based Bayes factor (sBF) is

$$\text{sBF} = \frac{\text{g}(\Delta(\mathbf{x}, \mathbf{y}) \mid H\_1, I)}{\text{g}(\Delta(\mathbf{x}, \mathbf{y}) \mid H\_2, I)},\tag{1.24}$$

where *g(*·*)* denotes the probability distribution associated with *Δ(X, Y )*. For the sake of simplicity, the conditioning information *I* will be omitted hereafter.

For the Bayes factor in (1.24), one cannot reproduce the simplified expression that was derived in (1.16) for the feature-based Bayes factor. The score-based Bayes factor must be computed as the ratio of two probability density functions evaluated at the evidence score *Δ(x, y)*, given the competing propositions *H*<sup>1</sup> and *H*2. Since these two distributions are not generally available by default, the forensic examiner will generally try to derive a sBF using sampling distributions based on many scores produced under each of the two competing propositions. One way to compute the density of the score *Δ(x, y)* in the numerator is to generate many scores for comparisons between the known features *x* and the features *y* of other items known to come from the potential source assumed under *H*1. The numerator can therefore be written as *g(Δ(x, y)* ˆ | *x,H*1*)*, where *g(*ˆ ·*)* indicates that the distribution is constructed on the basis of relevant data (scores) produced for the case of interest.

<sup>14</sup> The score can also be interpreted as the inner product of two vectors (Neumann & Ausdemore, 2020).

In the denominator, it is assumed that the proposition *H*<sup>2</sup> is true, and *x* and *y* denote features of items that come from different sources. The challenge for the forensic examiner is that of selecting the most appropriate data for obtaining the distribution in the denominator. Note that there are different ways to address this question because, depending on the case at hand, it might be appropriate to condition on (i) the known source (i.e., pursuing a so-called *source-anchored* approach), (ii) the trace (i.e., *trace-anchored* approach), or (iii) none of these (i.e., *non-anchored* approach). This amounts to evaluating the score using the probability density distribution that is obtained by producing scores for comparisons between (i) the features *x* of the control item from the known source and features of items taken from randomly selected sources of the relevant population, (ii) the features *y* of the trace item and features of items taken from sources selected randomly in the relevant population, (iii) features of pairs of items taken from sources selected randomly in the relevant population (i.e., without using *x* and *y*). Formally, this amounts to defining the distribution in the denominator as follows:

$$\begin{array}{ll} \text{(i)} & \hat{\mathfrak{g}}(\varDelta(\mathbf{x}, \mathbf{y}) \mid \mathbf{x}, H\_{2}), \\ \text{(ii)} & \hat{\mathfrak{g}}(\varDelta(\mathbf{x}, \mathbf{y}) \mid \mathbf{y}, H\_{2}), \\ \text{(iii)} & \hat{\mathfrak{g}}(\varDelta(\mathbf{x}, \mathbf{y}) \mid H\_{2}). \end{array}$$

See, e.g., Hepler et al. (2012) for a discussion of this topic.

*Example 1.3 (Image Comparison)* Consider a hypothetical case where the face of an individual is captured by surveillance cameras during the commission of a crime. Available screenshots are compared with the reference image(s) of a person of interest. For image comparison purposes, the evidence to be considered is a score given by the distance between the feature vectors *x* of the known reference and the evidential recording *y* (see Jacquet and Champod (2020) for a review). Consider the following competing propositions. *H*1: The person of interest is the individual shown in the images of the surveillance camera, versus *H*2: An unknown person is depicted in the image of the surveillance camera. To help specify the probability distribution of the score in the numerator, one can take several pairs of images from the person of interest to serve as pairs of questioned and reference items. To inform the probability distribution for the score in the denominator, conditioning on the reference item *x* (i.e., the images depicting the person of interest) can be justified as it may contain information that is relevant to the case and may be helpful for generating scores (Jacquet & Champod, 2020; Hepler et al., 2012). The distribution in the denominator can thus be computed using a *source-anchored* approach as in (i). The sBF can therefore be obtained as

*Example 1.3* (continued)

$$\text{sBF} = \frac{\hat{\mathfrak{g}}(\varDelta(x, y) \mid x, H\_1)}{\hat{\mathfrak{g}}(\varDelta(x, y) \mid x, H\_2)}.$$

In other types of forensic cases, conditioning on *y* in the denominator, case (ii), may be more appropriate. This represents an asymmetric approach to defining the distribution in the numerator and in the denominator.

*Example 1.4 (Handwritten Documents)* Consider a case involving handwriting on a questioned document. Handwriting features *y* on the questioned document are compared to the handwriting features *x* of a person of interest. The similarities and differences between *x* and *y* are measured by a suitable metric (score). To inform about the probability distribution of the scores in the numerator, one can take several draws of pairs of handwritten characters originating from the known source to serve as recovered and control items and to obtain scores from the selected draws. Under *H*2, consideration of *x* is not relevant for the assessment. Note that here *H*<sup>2</sup> is the proposition according to which the person of interest is not the source of the handwriting on the questioned document, but someone else from the relevant population. It would then seem reasonable to construct the distribution for the denominator by comparing the features *y* of the questioned document with features *x* from items of handwriting of persons randomly selected from the relevant population of potential writers. This amounts to a *trace-anchored* approach as in situation (ii) defined above. In fact, for handwriting, the approach (i) would amount to discarding relevant information related to the questioned document. The sBF can therefore be obtained as

$$\text{sBF} = \frac{\hat{\mathbf{g}}(\Delta(\mathbf{x}, \mathbf{y}) \mid \mathbf{x}, H\_1)}{\hat{\mathbf{g}}(\Delta(\mathbf{x}, \mathbf{y}) \mid \mathbf{y}, H\_2)}.$$

In yet other cases, the distribution in the denominator may be obtained by comparing pairs of items drawn randomly from the relevant population, without conditioning on either *x* or *y*. In such cases, the alternative proposition *H*<sup>2</sup> is that the two compared items come from different sources.

*Example 1.5 (Firearm Examination)* Consider a case in which a bullet is found at a crime scene and a person carrying a gun is arrested. The extent of agreement between marks left by firearms on bullets can be summarized by a score or multiple scores. An example of a simple score is the concept of consecutive matching striations. To inform the distribution in the numerator, the scientist fires multiple bullets using the seized firearm. To inform the distribution in the denominator, the scientist fires and compares many bullets known to come from different guns (i.e., different relevant models). The distribution in the denominator can thus be computed using a *non-anchored* approach. The sBF can therefore be obtained as

$$\text{sBF} = \frac{\hat{\mathbf{g}}(\varDelta(\mathbf{x}, \mathbf{y}) \mid \mathbf{x}, H\_1)}{\hat{\mathbf{g}}(\varDelta(\mathbf{x}, \mathbf{y}) \mid H\_2)}.$$

Note that this is a coarse approach in the sense that no consideration is given to general manufacturing features. Indeed, the amount and quality of striation on a bullet may depend on aspects such as the caliber and the composition (e.g., jacketed/non-jacketed bullets, etc.), hence a conditioning on *y* may be considered.

Another example for a non-anchored approach, in the context of fingermark comparison, can be found in Leegwater et al. (2017). An example will be presented in Sect. 3.3.4.

Note that the above considerations refer to so-called *specific-source* cases. In such cases, recovered material is compared to material from a known source. However, there are also other situations where the competing propositions are as follows:

*H*1: The recovered and the control material originate from the *same* source.

*H*2: The recovered and the control material originate from *different* sources.

For such *common-source* propositions, the sampling distributions under the competing propositions can be learned, under *H*1, from many scores for known same-source pairs (with each pair drawn from a distinct source) and, under (*H*2), from many scores for pairs known to come from different sources. The score-based BF in this case will account for the occurrence of the observed score under the competing propositions, but it does not account for the rarity of the characteristics of the trace.

While a score-based approach has the potential to reduce the dimensionality of the problem, the use of scores implies a loss of information because features *y* and *x* are replaced by a single score. Therefore, there is a trade-off to be found between the complexity of the original configuration of features and the performance of the score-metric, the choice of which requires a justification.

For a critical discussion of score-based evaluative metrics, see Neumann (2020) and Neumann and Ausdemore (2020). See also Bolck et al. (2015) for a discussion of feature- and score-based approaches for multivariate data.

#### **1.6 Bayes Factor in the Investigative Setting**

While the use of the Bayes factor for evaluative purposes is rather well established in both theory and practice, the focus on investigative settings still offers much room for original developments. In many forensic settings, especially in early stages of an investigation, it may be that no potential source is available for comparison. In such situations, it will not be possible to compare characteristics observed on recovered and reference materials, as would be the case in an evaluative setting (Sect. 1.5). Nevertheless, one can derive valuable information from the recovered material alone. Consider, for example, two populations denoted *p*<sup>1</sup> and *p*2, respectively, and the following two propositions:

*H*1: The recovered item comes from population *p*<sup>1</sup> (e.g., a population of females).

*H*2: The recovered item comes from population *p*<sup>2</sup> (e.g., a population of males).

Denote by *y* the measurements on the recovered material known to belong to one of the two populations specified by the competing hypotheses, but it is not known which one. For such a situation, the Bayes factor measures the change produced by the measurements *y* on the recovered item in the odds in favor of *H*1, as compared to *H*2, when going from the prior to the posterior distribution.

Assume that a parametric statistical model {*f (*· | *θ ), θ* ∈ *Θ*} is suitable for the data at hand. The problem of discriminating between two populations can then be treated as a problem of comparing statistical hypotheses, assuming that the probability distribution for the measurements on the recovered material (under each hypothesis) is of a given form. Consider, first, the situation where the parameters characterizing the two populations are known, that is, *θ* = *θ*<sup>1</sup> if the recovered item comes from population *p*<sup>1</sup> and *θ* = *θ*<sup>2</sup> if the recovered item comes from population *p*2. Formally, this amounts to specifying the probability distributions *f (y* | *θ*1*)* and *f (y* | *θ*2*)*, respectively. The posterior probability of the competing propositions can be computed as in (1.5) and the Bayes factor simplifies to a ratio of likelihoods as in (1.6).

*Example 1.6 (Fingermark Examination)* Consider a case involving a single fingermark of unknown source. The fingerprint examiner seeks to help with the question of whether the mark comes from a man or woman. Thus, for investigative purposes, the following two propositions are of interest:

#### *Example 1.6* (continued)

*H*1: The fingermark comes from a man.

*H*2: The fingermark comes from a woman.

A type of data that can be acquired from fingermarks is ridge width, summarized in terms of the ridge count per surface in mm2. See, for example, Appendix A of Champod et al. (2016) for a summary of different data collections. Consider ridge density, which was found to vary as a function of sex (i.e., women tend have narrower ridges than men), but also between populations. Suppose that normality represents a reasonable assumption for ridge density so that the probability distribution for available measurements can be considered Normal N*(θi, σ*<sup>2</sup> *<sup>i</sup> )*, with the unknown mean *θ* being equal to *θi* and the variance *σ*<sup>2</sup> being equal to *σ*<sup>2</sup> *<sup>i</sup>* if *Hi* is true. Given *H*1, the measurements *y* thus have a probability distribution N*(θ*1*, σ*<sup>2</sup> <sup>1</sup> *)* and given *H*<sup>2</sup> a probability distribution N*(θ*2*, σ*<sup>2</sup> 2 *)*.

The posterior probability of the competing propositions can be computed as in (1.5), and the Bayes factor simplifies to a likelihood ratio as in (1.6), that is,

$$\mathbf{BF} = \frac{\mathbf{N}(\mathbf{y} \mid \theta\_1, \sigma\_1^2)}{\mathbf{N}(\mathbf{y} \mid \theta\_2, \sigma\_2^2)}.$$

Generally, however, the parameters, or some of the parameters, characterizing the two distributions are unknown and a pair of probability density distributions will be introduced to model this uncertainty. As a consequence, the Bayes factor will also depend on prior assumptions and will not simplify to a likelihood ratio. Consider the case where parameters *θi* are continuous and take values in the parameter space *Θi*. A prior distribution *π(θi* | *pi)* must be specified, with *θi* ∈ *Θi* and *pi* representing the population of interest. A marginal distribution for each population *pi* can be computed as in (1.2),

$$f\_{H\_l}(\mathbf{y}) = \int\_{\Theta\_l} f(\mathbf{y} \mid \theta\_l) \pi(\theta\_l \mid p\_l) d\theta\_l \tag{1.25}$$

and the Bayes factor will take the form of a ratio of marginal likelihoods as in (1.3), that is,

$$\text{BF} = \frac{f\_{H\_1}(\mathbf{y})}{f\_{H\_2}(\mathbf{y})}.\tag{1.26}$$

*Example 1.7 (Fingermark Examination—Continued)* Recall Example 1.6 where a Normal probability distribution was assumed for the measured ridge density on a fingermark, with variance known and equal to *σ*<sup>2</sup> *<sup>i</sup>* . A conjugate prior distribution may be introduced for the population mean *θi*, say *θi* ∼ N*(μi, τ* <sup>2</sup> *<sup>i</sup> )*. The marginal likelihoods are still Normal with mean equal to the prior mean *μi* and variance equal to the sum of the prior variance *τ* <sup>2</sup> *<sup>i</sup>* and the population variance *σ*<sup>2</sup> *<sup>i</sup>* . The Bayes factor therefore is

$$\mathbf{BF} = \frac{\mathbf{N}(\mathbf{y} \mid \mu\_1, \tau\_1^2 + \sigma\_1^2)}{\mathbf{N}(\mathbf{y} \mid \mu\_2, \tau\_2^2 + \sigma\_2^2)}.$$

The same idea can be extended to the case where both the mean and the variance are unknown. This will be addressed in Sect. 4.3.2.

The Bayes factor thus depends on the prior *assumptions* about parameters characterizing each population. This must not be confused, as noted earlier, with prior probabilities for competing propositions. The latter will form the prior *odds* which will be multiplied by the Bayes factor to compute the posterior odds

$$\frac{\Pr(H\_1 \mid \mathbf{y})}{\Pr(H\_2 \mid \mathbf{y})} = \frac{f\_{H\_1}(\mathbf{y})}{f\_{H\_2}(\mathbf{y})} \times \frac{\Pr(H\_1)}{\Pr(H\_2)}.$$

The Bayesian approach for discriminating between two propositions regarding population membership can be easily generalized to the case where there are any number *k* (>2) of competing mutually exclusive propositions. Let *H*1*,...,Hk* denote *k* propositions and denote by *y* the observation to be evaluated. The propositions of interest can be defined as follows:


*Example 1.8 (Questioned Documents)* Consider a case involving questioned documents where the issue of interest is which of three printing machines has been used to print a questioned document. Propositions of interest are:


(continued)

*.*

#### *Example 1.8* (continued)

*H*3: The questioned documents have been printed with printer 3.

After having specified a Bayesian statistical model for each proposition (i.e., a probability distribution for the available measurements and a prior distribution for the unknown parameters), the marginal likelihoods *fHi(y)*, *i* = 1*,* 2*,* 3, characterizing propositions *H*1, *H*2, and *H*3, can be obtained as in (1.25).

Occasionally, cases involve multiple propositions. Imagine a case involving DNA findings, such as bloodstains recovered on a crime scene, with the reported profile being compared against the profile of a person of interest. The defense argues that the bloodstain does not come from the person but from either a relative (e.g., a brother) or an unknown person. A question that may arise in such a case is how to evaluate and report results, because the Bayes factor involves pairwise comparisons. One option is to report only the marginal likelihoods *fHi(y)*, even if they may not be easy to interpret. Alternatively, one may report a scaled version *f* ∗ *Hi (y)* as suggested by Berger and Pericchi (2015), that is,

$$f\_{H\_l}^\*(\mathbf{y}) = \frac{f\_{H\_l}(\mathbf{y})}{\sum\_{j=1}^k f\_{H\_l}(\mathbf{y})}.\tag{1.27}$$

This expression will be much easier to interpret, because the scaled likelihoods *f* ∗ *Hi (y)* sum up to 1. Generally, prior probabilities Pr*(Hi)* may vary between recipients of such reports, but the posterior probability can be easily computed as

$$\Pr(H\_l \mid \mathbf{y}) = \frac{\Pr(H\_l) f\_{H\_l}^\*(\mathbf{y})}{\sum\_{j=1}^k \Pr(H\_j) f\_{H\_j}^\*(\mathbf{y})}, \qquad \qquad i = 1, \dots, k$$

followed, if required, by classification of the recovered material in the population with the highest posterior probability. Note that reporting the scaled version in (1.27) is equivalent to assuming equal prior probabilities for competing propositions. In fact, if Pr*(Hi)* <sup>=</sup> <sup>1</sup> *<sup>k</sup>* , *i* = 1*,...,k*, then it can easily be shown that

$$\Pr(H\_l \mid \mathbf{y}) = \frac{f\_{H\_l}^\*(\mathbf{y})}{\sum\_{j=1}^k f\_{H\_j}^\*(\mathbf{y})} = f\_{H\_l}^\*(\mathbf{y}), \qquad i = 1, \dots, k, \mathbf{y}$$

as *<sup>k</sup> <sup>j</sup>*=<sup>1</sup> *<sup>f</sup>* <sup>∗</sup> *Hj (y)* = 1.

The analyst may also consider the possibility of summarizing several propositions into one, in order to produce a comparison between two propositions regarding population membership. One of these propositions will be composite. Let *H*¯<sup>1</sup> = *H*<sup>2</sup> ∪ ··· ∪ *Hk*. Starting from *k* possible populations from which the recovered material may come from, a pair of competing propositions of interest may thus be formulated as follows:

*H*1: The recovered item comes from population 1 (*p*1).

*H*¯1: The recovered item comes from one of the other populations (*p*2*,...,pk*).

The marginal likelihood *fH*<sup>1</sup> *(y)* characterizing proposition *H*<sup>1</sup> is obtained as in (1.25), while the marginal likelihood under *H*¯<sup>1</sup> is

$$f\_{\bar{H}\_{\bar{l}}}(\mathbf{y}) = \sum\_{l=2}^{k} \text{Pr}(p\_l) \int\_{\Theta\_l} f(\mathbf{y} \mid \theta\_l) \pi(\theta\_l \mid p\_l) d\theta\_l,$$

with *<sup>k</sup> <sup>i</sup>*=<sup>1</sup> Pr*(pi)* <sup>=</sup> 1. The Bayes factor expressing the value of *<sup>y</sup>* for comparing *H*<sup>1</sup> against *H*¯*<sup>i</sup>* becomes

$$\text{BF} = \frac{f\_{H\_1}(\mathbf{y}) \sum\_{l=2}^{k} \text{Pr}(p\_l)}{f\_{\bar{H}\_1}(\mathbf{y})}. \tag{1.28}$$

The posterior odds become

$$\frac{\Pr(H\_{\mathbb{I}} \mid \mathbf{y})}{\Pr(\bar{H\_{\mathbb{I}}} \mid \mathbf{y})} = \frac{f\_{H\_{\mathbb{I}}}(\mathbf{y}) \Pr(p\_{\mathbb{I}})}{f\_{\bar{H\_{\mathbb{I}}}}(\mathbf{y})},$$

(Aitken et al., 2021, p. 643).

*Example 1.9 (Questioned Documents—Continued)* Consider the following propositions:

*H*1: The questioned documents have been printed with printer 1.

*H*¯1: The questioned documents have been printed with printer 2 or with printer 3.

The marginal likelihood characterizing proposition *H*<sup>1</sup> is

$$f\_{H\_1}(\mathbf{y}) = \int\_{\Theta\_1} f(\mathbf{y} \mid \theta\_1) \pi(\theta\_1 \mid p\_1) d\theta\_1.$$

The marginal likelihood characterizing proposition *H*¯<sup>1</sup> will become

$$f\_{\tilde{H}\_1}(\mathbf{y}) = \text{Pr}(p\_2) \int\_{\Theta\_2} f(\mathbf{y} \mid \theta\_2) \pi(\theta\_2 \mid p\_2) d\theta\_2$$


**Table 1.2** Verbal scale for expressing evidential value, in terms of the Bayes factor, in support of the prosecution's proposition over the alternative (defense) proposition (Willis et al., 2015)


```
Example 1.9 (continued)
```

$$+\Pr(p\_3)\int\_{\Theta\_3} f(\mathbf{y}\mid\theta\_3)\pi(\theta\_3\mid p\_3)d\theta\_3,$$

and the Bayes factor can be obtained as in (1.28).

#### **1.7 Bayes Factor Interpretation**

The Bayes factor is a coherent measure of the change in support that the findings provide for one hypothesis against a given alternative (Jeffrey, 1975). Table 1.1 shows a guide for expressing Bayes factors verbally, following Jeffreys (1961). A historical review is presented in Aitken and Taroni (2021).

The verbal equivalent must express a degree of support for one of the propositions relative to an alternative and is defined from ranges of Bayes factor values. Qualitative interpretations of the Bayes factor have also been proposed in the context of forensic science (Evett, 1987, 1990; Evett et al., 2000; Nordgaard et al., 2012; Willis et al., 2015). Table 1.2 summarizes an example of a scale given in the ENFSI Guideline for Evaluative Reporting in Forensic Science (Willis et al., 2015), inspired by the scale proposed by Nordgaard et al. (2012). Users of these scales must be aware that labelling several Bayes factor apportionments offers a broad descriptive statement about standards of evidence in scientific investigation and not a calibration of the Bayes factor (Kass, 1993). See, e.g., Ramos and Gonzalez-Rodriguez (2013), van Leeuwen and Brümmer (2013) and Aitken et al. (2021) for an account of calibration as a measure of performance of BF computation methods.

Moreover, it is important to note that the choice of a reported verbal equivalent is based on the magnitude of the Bayes factor and not the reverse. Marquis et al. (2016) present a discussion on how to implement a verbal scale in a forensic laboratory, considering benefits, pitfalls, and suggestions to avoid misunderstandings.

It is worth to reiterate that a Bayes factor represents *a measure of change in support* rather than *a measure of support*, though the two expressions may be perceived as equivalent. In fact, the Bayes factor can be shown to be a noncoherent measure of support: a small Bayes factor means that the data will lower the probability of the hypothesis of interest relative to its value *prior* to considering the evidence, but it does not imply that the probability of this hypothesis is low. The Bayes factor measures the *change* produced in the odds, thus providing a measure of whether the available findings have increased or decreased the odds in favor of one proposition compared to the alternative (Bernardo & Smith, 2000).

#### **1.8 Computational Aspects**

The computation of Bayes factors can be challenging, especially when the marginal likelihoods in the numerator and in the denominator (1.2) involve integrals that do not have an analytical solution. Several methods have been proposed to address this complication. See Kass and Raftery (1995) and Han and Carlin (2001) for a review.

Consider the following general expression for the marginal likelihood:

$$f(\mathbf{x}) = \int f(\mathbf{x} \mid \theta) \pi(\theta) d\theta. \tag{1.29}$$

If the likelihood *f (x* | *θ )* and the prior *π(θ )* are not family conjugates, then an analytical solution may not be available. But suppose that it is possible to draw values from the prior distribution *π(*·*)*. The integral in (1.29) can then be approximated by Monte Carlo methods as

$$\hat{f}\_{\mathbf{l}}(\mathbf{x}) = \sum\_{l=1}^{N} f(\mathbf{x} \mid \boldsymbol{\theta}^{(l)}) / N,\tag{1.30}$$

where *<sup>θ</sup>(i)*, *<sup>i</sup>* <sup>=</sup> <sup>1</sup>*,...,N*, are *<sup>N</sup>* independent draws from *π(*·*)*. This is the average of the likelihood of the sampled values. An example will be provided in Sect. 2.2.2 (Example 2.3).

This simulation process can be rather inefficient when the posterior distribution is concentrated, relative to the prior, as most of the *θ(i)* will have a small likelihood and the estimate *f*ˆ <sup>1</sup>*(x)* in (1.30) may be dominated by a few values with large likelihood. The precision of the Monte Carlo integration can be improved by importance sampling (Kass & Raftery, 1995). Moreover, statistical packages (e.g., in R) allow one to sample from a certain number of distributions.

Importance sampling as well as other Monte Carlo tools may help to overcome such difficulties as there is no need for the distribution *π(θ )* to be available in closed form. Consider any manageable density *π*∗*(θ )* from which it is feasible to sample. The integral in (1.29) can then be approximated by *importance sampling* as

$$\hat{f}\_2(\mathbf{x}) = \frac{\sum\_{l=1}^{N} w\_l f(\mathbf{x} \mid \theta^{(l)})}{\sum\_{l=1}^{N} w\_l},\tag{1.31}$$

where *θ(i)* are independent draws from *π*∗*(θ )* and are weighted by importance weights *wi* <sup>=</sup> *π(θ(i))/π*∗*(θ(i))*. The function *<sup>π</sup>*∗*(θ )* is known as *importance sampling function* (e.g., Geweke, 1989). An example will be provided in Sect. 2.2.2 (Example 2.3).

In the case where *π*∗*(θ )* is taken to be the posterior density *π(θ* | *x)* = *π(θ )f (x* | *θ )/f (x)*, the use of this expression in (1.31) yields the harmonic mean of the sampled likelihood values as an estimate for the marginal likelihood *f (x)*:

$$\hat{f}\_3(\mathbf{x}) = \left[ \frac{1}{N} \sum\_{i=1}^N \frac{1}{f(\mathbf{x} \mid \theta\_i)} \right]^{-1} \mathbf{.}$$

Note that, whatever method is used, the output of such a simulation procedure is an approximation that must be handled carefully. Notwithstanding, it is worth pointing out that while the *Monte Carlo estimate* is not exact, the Monte Carlo error (e.g., *f (x)*−*f*ˆ <sup>1</sup>*(x)*) can be very small if a sufficiently large number of draws are generated. A study of Monte Carlo errors for the quantification of the value of forensic evidence is provided by Ommen et al. (2017).

Many practical problems require more advanced techniques based on Markov chain Monte Carlo methods (MCMC) to overcome computational hurdles. The general idea behind these methods is to sample recursively values *θ(i)* from some transition distribution that depends on the previous draw *θ(i*−1*)* in such a way that at each step of the iteration process, we expect to draw from a distribution that becomes closer (i.e., converges) to the target posterior distribution *π(θ* | *x)*. This means that, for many iterations, *<sup>θ</sup>(i)* is approximately distributed according to *π(θ* <sup>|</sup> *x)* and can be used like the output of a Monte Carlo simulation algorithm. To avoid the effect of starting values, the first set of iterations is generally discarded (this is called the *burn in* period), and the simulated values beyond the first *nb* iterations

$$\theta^{(n\_b+1)}, \dots, \theta^{(N)}$$

are taken as draws from the target posterior distribution. The Gibbs sampling algorithm is a well-known method to construct a chain with these features. Suppose that the parameter vector can be decomposed into several components, say *θ* = *(θ*1*,...,θp)*, and let *π(θj* <sup>|</sup> *<sup>θ</sup>(i*−1*)* <sup>−</sup>*<sup>j</sup> )* denote the so-called full conditional distribution, that is the conditional distribution of *θj* at step *(i)* given all the other components, say *θ*−*<sup>j</sup>* , at the previous step *(i* − 1*)*

$$
\theta\_{-j}^{(l-1)} = (\theta\_1^{(l-1)}, \dots, \theta\_{j-1}^{(l-1)}, \theta\_{j+1}^{(l-1)}, \dots, \theta\_p^{(l-1)}).
$$

For many problems, it is possible to sample easily from the conditional distributions, as is the case when distributions are conjugate. The Gibbs sampling algorithm works as follows: start with an arbitrary value *<sup>θ</sup>(*0*)* <sup>=</sup> *(θ(*0*)* <sup>1</sup> *,...,θ(*0*) <sup>p</sup> )* and generate *θ(i) <sup>j</sup>* at each iteration according to the conditional distribution given the current values *θ(i*−1*)* <sup>−</sup>*<sup>j</sup>* . Examples will be given in Sects. 3.4.1.3 (Example 3.14) and 3.4.3 (Example 3.16.)

Whenever it is not possible to decompose the joint distribution in manageable conditionals, one can implement an alternative approach, the Metropolis–Hastings (M–H) algorithm (e.g. Gelman et al., 2014). This algorithm can be summarized as follows. Start with an arbitrary value *<sup>θ</sup>(*0*)* <sup>=</sup> *(θ(*0*)* <sup>1</sup> *,...,θ(*0*) <sup>p</sup> )* and generate *θ(i) <sup>j</sup>* at each iteration, as follows:


$$\alpha\left(\theta\_j^{(l-1)},\theta\_j^{\text{prop}}\right) = \min\left\{\frac{\pi\left(\theta\_j^{\text{prop}}\right)q\left(\theta\_j^{\text{prop}},\theta\_j^{(l-1)}\right)}{\pi\left(\theta\_j^{(l-1)}\right)q\left(\theta\_j^{(l-1)},\theta\_j^{\text{prop}}\right)}\right\}.\tag{1.32}$$

3. Accept the proposed value *θ* prop *<sup>j</sup>* with probability *α θ(i*−1*) <sup>j</sup> , θ* prop *j* , and set *θ(i) j* = *θ* prop *<sup>j</sup>* ; otherwise, reject the proposed value and set *<sup>θ</sup>(i) <sup>j</sup>* <sup>=</sup> *<sup>θ</sup>(i*−1*) <sup>j</sup>* .

Note that if the candidate generating function is symmetric (e.g., a Normal probability density), the acceptance probability in (1.32) simplifies to

$$\alpha\left(\theta\_j^{(i-1)}, \theta\_j^{\text{prop}}\right) = \min\left\{\frac{\pi\left(\theta\_j^{\text{prop}}\right)}{\pi\left(\theta\_j^{(i-1)}\right)}\right\}.$$

The performance of an MCMC algorithm can be monitored by inspecting graphs and computing diagnostic statistics. Such exploratory analysis is fundamental for assessing convergence to the posterior distribution. An example will be given in Sect. 2.2.2 (Example 2.6).

#### 1.8 Computational Aspects 31

The output of the MCMC algorithm can be used to provide the marginal likelihood that is needed for the numerator and the denominator of the Bayes factor, as proposed by Chib (1995) for a Gibbs sampling algorithm and by Chib and Jeliazkov (2001) for an M–H algorithm. The key idea is to obtain the marginal likelihood *f (x)* by a direct application of Bayes theorem since it can be seen as the normalizing constant of the posterior density

$$f(\mathbf{x}) = \frac{f(\mathbf{x} \mid \theta^\*) \pi(\theta^\*)}{\pi(\theta^\* \mid \mathbf{x})},\tag{1.33}$$

where *θ* ∗ is a parameter value with high posterior density. Note that (1.33) is valid for any parameter value *θ* ∈ *Θ*. The likelihood *f (x* | *θ )* and the prior density *π(θ )* can be directly computed at a given parameter point *θ* <sup>∗</sup>. The posterior density *π(θ* | *x)* is unavailable in closed form, but it can be approximated by using the output of the Gibbs sampling. Consequently, the marginal likelihood can be approximated as

$$\hat{f}(\mathbf{x}) = \frac{f(\mathbf{x} \mid \theta^\*) \pi(\theta^\*)}{\hat{\pi}(\theta^\* \mid \mathbf{x})}. \tag{1.34}$$

Examples will be given in Sects. 3.4.1.3 (Example 3.14) and 3.4.3 (Example 3.16).

This short overview of computational tools is not intended to be exhaustive. There are instances, for example when dealing with high-dimensional distributions, where the simulation process is very slow, giving rise to inefficiencies in the behavior of the Gibbs sampler or Metropolis algorithm. An alternative solution is given by the Hamiltonian Monte Carlo (HMC) method, where the proposal distribution is not centered on the current position of the chain and changes depend on the current position of the chain. This allows one to obtain more promising candidate values, avoiding to get stuck in a very slow exploration of the target distribution and therefore to move much more rapidly (Neal, 1996). As in any Metropolis algorithm, the HMC proceeds by a series of iterations, though it requires more efforts in terms of programming and tuning. The user can refer to a computer program called Stan (sampling through adaptive neighborhoods) to directly apply the Hamiltonian Monte Carlo method. The reader can refer to Gelman et al. (2014) and Stan Development Team (2021) for instructions and examples. A complete picture of basic and more advanced methods of Bayesian computation can be found, e.g., in Gelman et al. (2014), Marin and Robert (2014), and Robert and Casella (2010). The reader can also refer to Han and Carlin (2001) and to Friel and Pettitt (2008) for a review of methods to compute BFs.

In all examples in this book, dealing with the Gibbs sampler and the Metropolis– Hastings algorithm, we will directly program the computations in R. Other opensource programs however exist that can be used to build Markov chain Monte Carlo sampler, such as Stan or Jags (Just another Gibbs sampler, https://mcmc-jags. sourceforge.io/). They both can interact with R (see libraries RStan, rjags and runjags). Further examples can be found in Albert (2009) and Kruschke (2015).

#### **1.9 Bayes Factor and Decision Analysis**

The Bayes factor provides a coherent and quantitative way for relating probabilities for states of nature, before information is obtained, to probabilities given information that has become available. A subsequent step, the choice between different hypotheses, represents a problem of decision-making (Lindley, 1985). For the purpose of illustration, consider the simple and regularly encountered case where only two hypotheses are of interest, say *H*<sup>1</sup> and *H*2. The two hypotheses represent the list of, more generally, *n* exclusive and exhaustive uncertain events (also called *states of nature*) and denote the entirety of nature. The decision space is the set of all possible actions, here decisions *d*<sup>1</sup> and *d*2, where decision *di* can be formalized as the acceptance of hypothesis *Hi*. The decision problem can be expressed in terms of a decision matrix (see Table 1.3) with *Cij* denoting the consequence of deciding *di* when hypothesis *Hj* is true. Decision *di* is called "correct" if hypothesis *Hj* is true and *i* = *j* . Conversely decision *di* is not correct if hypothesis *Hj* is true and *i* = *j* , i.e., *H*¬*<sup>i</sup>* is true. When choosing between competing hypotheses, one takes preferences among decision consequences into account, in particular among adverse outcomes. This aspect is formalized by introducing a measure for expressing the decision maker's relative desirability, or undesirability, of the various decision consequences. To measure the undesirability of consequences on a numerical scale, one can introduce a loss function L*(*·*)*, where L*(Cij )* denotes the loss that one assigns to the outcome of deciding *di* when hypothesis *Hj* is true.

If it can be agreed that a correct decision represents neither a loss nor a gain, the loss function for a two-action problem can be described with a two-way table that contains zeros for the losses L*(Cij )*, *i* = *j* , and the value *li* for L*(Cij )*, *i* = *j* . Such a "0 − *li*" loss function is shown in Table 1.4, where *li* = L*(di, H*¬*i)* denotes the loss one incurs whenever decision *di* is a wrong decision.

The relative (un-)desirability of available decisions can be expressed by their *expected loss* EL*(*·*)*, computed as


**Table 1.3** Decision matrix with *d*<sup>1</sup> and *d*<sup>2</sup> denoting the possible actions, *H*<sup>1</sup> and *H*<sup>2</sup> denoting the states of nature, and *Cij* denoting the consequence of deciding *di* when hypothesis *Hj* is true

**Table 1.4** The "0−*li*" loss function for a decision problem with *d*<sup>1</sup> and *d*<sup>2</sup> denoting the possible actions, *H*<sup>1</sup> and *H*<sup>2</sup> denoting the states of nature, and *li* denoting the loss associated with adverse decision consequences


$$\begin{split} \operatorname{EL}(d\_{l} \mid \boldsymbol{x}) &= \underbrace{\operatorname{L}(d\_{l}, H\_{l})}\_{\operatorname{0}} \underbrace{\operatorname{Pr}(H\_{l} \mid \boldsymbol{x})}\_{\alpha\_{l}} + \underbrace{\operatorname{L}(d\_{l}, H\_{\neg l})}\_{l\_{l}} \underbrace{\operatorname{Pr}(H\_{\neg l} \mid \boldsymbol{x})}\_{\alpha\_{\neg l}} \\ &= l\_{l} \alpha\_{\neg l} \,, \end{split}$$

where *x* denotes the observation or a series of measurements and *α*¬*<sup>i</sup>* denotes the (posterior) probability of the event *H*¬*<sup>i</sup>* given *x*. The formal Bayesian decision criterion is to accept hypothesis *H*<sup>1</sup> if the expected loss of the decision to accept *H*<sup>1</sup> is smaller than the expected loss of rejecting it, that is, if the (posterior) expected loss of decision *d*<sup>1</sup> is smaller than the (posterior) expected loss of decision *d*2:

$$\text{EL}(d\_{\text{l}} \mid \text{x}) < \text{EL}(d\_{\text{2}} \mid \text{x})$$

$$l\_{1}\alpha\_{2} < l\_{2}\alpha\_{1}. \tag{1.35}$$

When rearranging the terms in (1.35) to *α*1*/α*<sup>2</sup> *> l*1*/l*2, and dividing both sides by the prior odds *π*1*/π*2, the Bayes decision criterion states that accepting *H*<sup>1</sup> is the optimal decision whenever

$$\frac{\alpha\_1/\alpha\_2}{\pi\_1/\pi\_2} > \frac{l\_1/l\_2}{\pi\_1/\pi\_2} = c.$$

This is equivalent to accepting *H*<sup>1</sup> whenever the Bayes factor in favor of this proposition is larger than a constant *c* determined by the prior odds and the loss ratio. Given a set of observations, the Bayes factor is computed and, depending on whether or not it exceeds a given threshold, the decision maker chooses between the members in the list of states of nature (here *H*<sup>1</sup> and *H*2). Examples will be given in Chap. 3 in the context of inference of source (Sect. 3.3.3) and in Chap. 4 in the context of classification (Sects. 4.2.2 and 4.4.1.2). An extended review of elements of decision analysis in forensic science can be found in Taroni et al. (2021b).

This decision criterion is simple and intuitive, yet it poses challenges. For example, the requirement to choose a prior probability for the two hypotheses may be discomforting, because there is no ad hoc recipe for this purpose. In principle, probabilities are personal, since they depend on one's knowledge (Lindley, 2014). They may change as information changes and may vary among individuals. For example, a given hypothesis may be considered almost true by one individual, but far less probable by someone else. The fact that different individuals with different knowledge bases may specify different probabilities for the same event, provided that they are accompanied with a justification, is not a problem in principle (Lindley, 2000). The only strict requirement to which probability assignments ought to conform is coherence (de Finetti, 2017). Coherence has the normative role of encouraging people to make careful assignments based on their personal knowledge. This can be operationally supported by the concept of scoring rules. See, for example, Biedermann et al. (2013, 2017a) for a discussion of scoring rules in the context of forensic science.

The same viewpoint applies to utility and loss functions, which may be difficult to specify. A "correct" utility (or loss) function does not exist, because preference structures are personal. Adverse decision consequences may be considered more or less undesirable, depending on the background, the context and the decision maker's objectives (e.g., Taroni et al., 2010). Moreover, the loss function does not need to have constant values, such as the "0 − *li*" loss function introduced above. More general loss functions treat the loss as a function of the severity of the consequences. Examples will be given in Chap. 2 regarding inference and decision about a proportion (Sect. 2.2.3) and about a mean (Sect. 2.3.3).

Note that, in the context here, the terms "personal" and "subjective" do not mean that the theory is arbitrary, unjustified or groundless (Biedermann et al., 2017b; Taroni et al., 2018). There are various devices for the sound elicitation of probabilities and the measurement of the value of decision consequences (Lindley, 1985). What matters in a situation in which a decision maker is asked to make a choice among alternative courses of action that have uncertain consequences is that the behavior is one that can be qualified as rational. This includes, in particular, a coherent specification of the loss function, reflecting personal preferences among consequences in terms of desirability or undesirability.

This formal decision-analytic approach provides decision criteria that (i) are based on clearly defined concepts, (ii) promote rational decision-making under uncertainty, and (iii) make a clear distinction between the evaluation of the strength of evidence (as given by the Bayes factor), which is the domain of the forensic scientist, and the specification of the threshold with which the Bayes factor is compared, i.e., the ratio between the loss ratio and the prior odds. The latter lies in the domain of the recipient of expert information, such as investigative authorities and members of the judiciary.

#### **1.10 Choice of the Prior Distribution**

Bayesian model builders may encounter various difficulties. One of them is the choice of the prior distribution. Bayes theorem does not specify how one ought to define the prior distribution. The chosen prior distribution should, however, suitably reflect one's prior beliefs. In this context, so-called vague or non-informative prior distributions may help to find a broad consensus. However, it is important to keep in mind that even a "non-informative" prior distribution effectively conveys a welldefined opinion, i.e., that probabilities spread uniformly over the parameter space (de Finetti, 1993a). In contrast to this, personal or so-called informative priors aim at encoding available prior knowledge. Whenever feasible, it is advantageous to choose a member of the class of conjugate distributions, that is, a family of prior distributions such that for any prior in this family and a particular probability distribution, the corresponding posterior distribution will be in the same family. For example, the beta distribution and the binomial distribution are said to be conjugate in this sense. Several examples will be provided throughout this book.


**Table 1.5** Some common conjugate prior distribution families

Table 1.5 provides a list of some common families of conjugate distributions. A more extensive list can be found in Bernardo and Smith (2000). Despite such smooth technical options, eliciting a prior distribution may not be easy.

First, it may be that none of the standard parametric families mentioned above is suitable to describe one's prior degree of belief. There may be circumstances where multimodal priors may better reflect the available knowledge, and mixture priors would be more convenient (see e.g. Taroni et al., 2010). Another option is to specify prior beliefs over a selection of points and then interpolate between them (Bolstad & Curran, 2017). More generally, there may be cases where the choice of a conjugate prior is not appropriate as it does not properly reflect available knowledge. If this is the case, the application of Bayes theorem may lead to a posterior distribution that is analytically intractable. Such situations require the implementation of computational tools as described in Sect. 1.8.

Second, practitioners will immediately realize that even if the choice of a given standard parametric family may appear justifiable, they will still need to choose a member from the selected family. Stated otherwise, they will need to fix the hyperparameters of the prior distribution in a way that the resulting shape will reasonably reflect their knowledge. Assume that practitioners are in a situation where, based on their experience in the field, they can summarize and translate their prior beliefs into a numerical value for the prior mean, say *m*, and into a numerical value for the prior standard deviation, say *s*. They can then find the values of the parameters that specify a prior distribution that reflects the assessed prior location and prior dispersion, respectively. For example, suppose that the parameter of interest, *θ*, is a proportion and that a beta prior distribution is chosen to model prior uncertainty, i.e., *θ* ∼ Be*(α, β)*. The problem then is how to choose *α* and *β*. If one can specify a value *m* for the prior mean and a value *s* for the prior standard deviation, that is the two values describing the location and the shape of the prior distribution, one can elicit the hyperparameters *α* and *β* by relating the assessed prior mean and prior variance to the prior moments of a beta distributed random variable, that is,

36 1 Introduction to the Bayes Factor and Decision Analysis

$$m = \frac{\alpha}{\alpha + \beta} \tag{1.36}$$

$$s^2 = \frac{\alpha\beta}{(\alpha+\beta+1)(\alpha+\beta)^2}.\tag{1.37}$$

The hyperparameters of the beta prior can then be obtained by solving the two equations in (1.36) and (1.37) for *α* and *β*

$$\alpha = m \left[ \frac{m(1-m)}{s^2} - 1 \right] \tag{1.38}$$

$$\beta = (1 - m) \left[ \frac{m(1 - m)}{s^2} - 1 \right]. \tag{1.39}$$

It is advisable to inspect the prior distribution thus elicited. Producing a graphical representation can help examine whether the shape of the distribution reasonably reflects one's prior beliefs. Moreover, the so-called *equivalent sample size ne* should be calculated in order to examine the reasonableness of the amount of information that underlies the proposed prior; one should make sure that it is not unrealistically high (Bolstad & Curran, 2017). Stated otherwise, one should examine whether the information that is conveyed by the prior is equivalent, at least roughly, to the information that would be obtained by collecting a sample of equivalent size *ne*. For example, consider a random sample *(X*1*,...,Xne )* of size *ne*, providing the same information that is conveyed by the prior. The sample mean *<sup>X</sup>*¯ <sup>=</sup> <sup>1</sup> *ne ne <sup>i</sup>*=<sup>1</sup> *Xi* should have, at least roughly, the same location and the same dispersion as the prior.

For the beta-binomial case, the equivalent sample size *ne* can be obtained by relating the moments of the beta prior to the corresponding moments characterizing a random sample of size *ne* from a Bernoullian population with probability of success *θ*:

$$\frac{\alpha}{\alpha+\beta} = \theta \tag{1.40}$$

$$\frac{\alpha\beta}{(\alpha+\beta+1)(\alpha+\beta)^2} = \frac{\theta(1-\theta)}{n\_\epsilon}.\tag{1.41}$$

Solving for *ne*, one obtains *ne* = *α* + *β* + 1. If this is felt to be unrealistic, then one should revise one's prior assessments, increase the dispersion and recalculate the prior. Otherwise, one might specify too much information about the proportion *θ* relative to the amount of information provided by a sample of size *ne*.

*Example 1.10 (Elicitation of a Beta Prior)* Suppose that a prior distribution needs to be elicited for the proportion *θ* of non-counterfeit merchandise (e.g., medicines) in a target population. It is thought that the distribution is centered around 0.8 with a standard deviation equal to 0.1. Parameters *α* and *β* can be elicited as in (1.38) and (1.39)

```
> m=0.8
> s=0.1
> a=m*(m*(1-m)/s^2-1)
> b=(1-m)*(m*(1-m)/s^2-1)
> c(a,b)
[1] 12 3
```
Figure 1.1 shows the elicited beta prior Be*(*12*,* 3*)*.

```
> plot(function(x) dbeta(x,a,b),0,1,xlab=expression
```

```
+ (paste(theta)),ylab=expression(paste(pi)*
```

```
+ paste('(')*paste(theta)*paste(')')))
```
The equivalent sample size is 12+3+1=16. This is the size of the sample that should be available in terms of information that is equivalent to that conveyed by the elicited prior.

An objection to this procedure might be that while specifying a value for the location of the prior may be feasible, this may not necessarily be so for the dispersion. In many cases, the available prior knowledge takes the form of a realization *(x*1*,...,xn)* of a random sample of size *n* from a previous experiment. In this case, it is sufficient to solve (1.40) and (1.41) with respect to *α* and *β* for this sample size *n*:

$$
\alpha = p(n-1),
\tag{1.42}
$$

$$
\beta = (1 - p)(n - 1),
\tag{1.43}
$$

where *<sup>θ</sup>* has been estimated by the sample proportion *<sup>θ</sup>*<sup>ˆ</sup> <sup>=</sup> *<sup>p</sup>* <sup>=</sup> *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *xi/n*. One can immediately verify that whenever the hyperparameters *α* and *β* are elicited as in (1.42) and (1.43), then *α* + *β* + 1 = *n*. The elicited parameters reflect the amount of information provided by a sample of size *n*.

Some further practical examples will be provided throughout the book. For an extended discussion of prior elicitation, the reader can refer to Garthwaite et al. (2005) and O'Hagan et al. (2006).

#### **1.11 Sensitivity Analysis**

In Sect. 1.4, it has been emphasized that the Bayes factor is not a measure of the relative support for the competing propositions provided by the data alone. The Bayes factor is influenced by the choice and the elicitation of the subjective prior densities (probabilities) for model parameters under propositions *H*<sup>1</sup> and *H*2. This reflects background knowledge that may be available to analysts. For this reason, prior elicitation of model parameters must not be confused with prior probabilities of the propositions of interest.

While the computation of the Bayes factor requires prior assessments about unknown quantities, a main objection to the choice of such prior distributions is that they may be hard to define, in particular when the available information is limited. Situations characterized by an abundance of relevant data that can be used to construct a prior distribution may be rare. Generally, the choice of a prior is the result of a subtle combination of relevant information, published data, and explainable personal knowledge of the expert. The specification of the prior must be taken seriously, because it can be shown that even when a large amount of evidence is available, the marginal likelihood is highly sensitive to the choice of the prior distribution, and so is the Bayes factor (Gelman et al., 2014). This is different for the posterior distribution that is dominated by the likelihood.

Sensitivity analyses allow one to explore how results may be affected by changes in the priors (e.g. Kass & Raftery, 1995; Kass, 1993; Liu & Aitkin, 2008). This, however, may turn out to be computationally intensive and time consuming. An alternative approach has been proposed by Sinharay and Stern (2002) for comparing nested models, though it can be extended to non-nested models. The general idea is to assess the sensitivity of the Bayes factor to the prior distribution for a given parameter *θ* by computing the Bayes factor for a vector of parameter values (or a grid of parameter values in the case of a two-dimensional vector parameter *θ*). The result is a graphical representation of the Bayes factor (i.e., a sensitivity curve) as a function of *θ*, say BF*<sup>θ</sup>* . In this way, one can get an idea about the Bayes factor one could obtain for different values of *θ*, and thus about the sensitivity of the Bayes factor to various prior distributions. These prior distributions have their mass concentrated on different apportionments of the parameter space. For one or two-dimensional problems, the inspection of a sensitivity curve represents a straightforward and effective approach to study the impact of varying parameter values on the BF under consideration. An example is given in Sect. 2.3.1 for the choice of the prior distribution about a Normal mean. A sensitivity analysis with respect to the prior probability assessments of competing propositions is provided in Sect. 3.2.3.

A further layer of sensitivity analyses relates to the choice of the utility/loss function. An example is presented in Sect. 2.2.3 for the choice of the loss function in the context of inference and decision about a population proportion. Section 4.4.1.2 gives an example for the investigation of the effect of different prior probabilities and loss values in the context of classification of skeletal remains.

A sensitivity analysis for Monte Carlo and Markov chain Monte Carlo procedures is presented in Sects. 2.2.2 and 3.4.1.3. In Sect. 4.3.3, a sensitivity analysis is developed for the choice of a smoothing parameter in a kernel density estimation.

#### **1.12 Using R**

R is a rich environment for data analysis and statistical computing. In its base package, it contains a large collection of functions for exploring, summarizing, and representing data graphically, handling many standard probability distributions and more. R includes a simple programming language that users can extend with new functions. Some basic instructions on the use of R or of particular functions are available from the R Help menu, by using the command help*.*start*()*. The reader can refer to, for example, Verzani (2014) for a detailed introduction to the use of R for descriptive and inferential statistics, to Albert (2009) for an overview of elements of Bayesian computation with R, and to the R project home page (https://www.r-project.org) for more references. Datasets and routines used in the examples throughout this book are available on the website of this book (on http:// link.springer.com/).

Generally, we will give results of R computations as produced directly by R. We do not make any recommendations as to the level of precision that scientists should use when reporting numerical results.

Published with the support of the Swiss National Science Foundation (Grant no. 10BP12\_208532/1).

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 2 Bayes Factor for Model Choice**

#### **2.1 Introduction**

The Bayes factor can assist forensic scientists in the evaluation of findings when recipients of expert information need help in discriminating between propositions concerning, for example, a parameter of interest. A typical example is the discrimination between competing propositions regarding the concentration of a controlled substance (e.g., drugs in blood) with respect to a given threshold. This chapter will approach one-sided hypothesis testing involving model parameters in the form of a proportion (Sect. 2.2) and a mean (Sect. 2.3). In both situations, additional factors, such as errors (Sects. 2.2.2 and 2.3.2), are considered. Aspects of decision-making are also considered (Sects. 2.2.3 and 2.3.3).

Throughout this chapter, the Bayes factor will be obtained as a ratio of marginal likelihoods following the ideas described in Sect. 1.4. The greater marginal likelihood will support the respective proposition against the other. This, along with other aspects, such as the decision maker's preferences among adverse consequences, has an impact on the decision-making process.

#### **2.2 Proportion**

A common problem in forensic practice is the investigation of the proportion of items or individuals that present a characteristic of interest, e.g., the proportion of seized pills containing a controlled substance or the proportion of counterfeit medicines in a given population. A consignment of items is considered a random sample from a super-population of items of the same type, and the parameter *θ* is the proportion of units in the super-population that present the target characteristic. Note that for consignments of very large size (i.e., several thousands), a finite number of units will correspond to each positive value of *θ*. For consignments of small size (i.e., smaller than 50), the parameter *θ* is a nuisance parameter (i.e., one that is not of primary interest) that can be integrated out, leaving a probability distribution for the unknown number of items having the target characteristic. For consignments of intermediate size, *θ* can be treated as a continuous value in the interval *(*0*,* 1*)* (e.g., Aitken et al., 2021). As an example, consider the following pair of propositions:

*H*1: The proportion *θ* of items having the characteristic of interest is larger than *θ*0.

*H*2: The proportion *θ* of items having the characteristic of interest is smaller than or equal to *θ*0,

where *<sup>θ</sup>*<sup>0</sup> <sup>∈</sup> *(*0*,* <sup>1</sup>*)* is a given threshold of legal interest.<sup>1</sup> Note that applications of this type of propositions are broad and include, for example, quality control of food (and other consumer products), the analysis of levels of contamination of laboratory equipment, and the extent of environmental pollution.

This section covers three main topics: (1) inference about an unknown proportion *θ* (Sect. 2.2.1), (2) inference about *θ* when background elements may affect the counting process (Sect. 2.2.2), and (3) decision regarding competing propositions about *θ* (Sect. 2.2.3).

#### *2.2.1 Inference About a Proportion*

Consider a case of inference about a population parameter based on a sample of size *n*. Aitken (1999) and Aitken et al. (2021) discuss how to choose a sample size. Suppose that among the *n* items, *x* shows a characteristic that is of interest from a legal point of view. The question then is how such an analytical result supports one or the other of the competing propositions regarding the proportion of items in the population that have the target characteristic.

Experiments of this kind can be regarded as Bernoulli trials (after the Swiss mathematician Jacob Bernoulli, 1654–1705), where trials are independent and give rise to one of the two mutually exclusive outcomes, conventionally labeled success and failure, with constant probability of success in each trial. The binomial distribution Bin*(n, θ )* is a statistical model for data that arise from a sequence of Bernoulli trials:

$$f(\mathbf{x} \mid n, \theta) = \binom{n}{\mathbf{x}} \theta^{\mathbf{x}} (1 - \theta)^{n - \mathbf{x}}, \qquad \mathbf{x} = \mathbf{0}, 1, \dots, n.$$

In the Bayesian perspective, the most common prior distribution for the parameter of interest *θ* is the beta distribution Be*(α, β)*:

<sup>1</sup> See Biedermann et al. (2012, 2018) for a general discussion of thresholds of legal interest when data are continuous.

$$f(\theta \mid \alpha, \beta) = \theta^{\alpha - 1} (1 - \theta)^{\beta - 1} / \mathbf{B}(\alpha, \beta), \qquad \qquad 0 < \theta < 1 \; ; \; \alpha, \beta > 0,$$

with <sup>B</sup>*(α, β)* <sup>=</sup> *Γ (α)Γ (β) Γ (α*+*β)* .

The beta prior distribution and the binomial likelihood are conjugate (see Sect. 1.10): after inspecting a sample, one can easily compute the posterior distribution, which is still beta, Be*(α*∗*, β*∗*)* with parameters updated according to well-known updating rules, *α*<sup>∗</sup> = *α*+*x*, *β*<sup>∗</sup> = *β* +*n*−*x* (e.g., Lee, 2012). The prior odds, the posterior odds, and the Bayes factor can be easily computed, as discussed in Sect. 1.4, by means of standard routines.

*Example 2.1 (Counterfeit Medicines)* Consider a case in which a large batch of medicines (say, *N >* 50) is seized, suspected to contain counterfeit items. The following propositions are of interest:

*H*1: The proportion *θ* of counterfeit medicines is greater than 0*.*2.

*H*2: The proportion *θ* of counterfeit medicines in not greater than 0*.*2.

Suppose that, initially, limited information is available so that a uniform prior distribution is chosen over the interval *(*0*,* 1*)*, that is, *θ* ∼ Be*(*1*,* 1*)*. Note that although a prior distribution Be*(*1*,* 1*)* is often called *uninformative*, it is in fact informative (see Sect. 1.10 and de Finetti (1993b)). It conveys the view that every value of *θ* in the interval *(*0*,* 1*)* is considered equally probable. The prior odds can then easily be obtained.

```
> th=0.2
> a=1
> b=1
> pi1=pbeta(th,a,b,lower.tail=F)
> prior_odds=pi1/(1-pi1)
> prior_odds
```

```
[1] 4
```
A uniform prior distribution clearly favors, a priori, hypothesis *H*1, that *θ* is greater than 0*.*2. Next, suppose that a sample of size 40 is analyzed and 12 out 40 items are found to be positive (counterfeit). The posterior distribution follows immediately and so the posterior odds and the Bayes factor.

```
> n=40
> x=12
> astar=a+x
> bstar=b+n-x
> alpha1=pbeta(th,astar,bstar,lower.tail=F)
```

```
Example 2.1 (continued)
> post_odds=alpha1/(1-alpha1)
> post_odds
[1] 18.19594
```
The posterior probability of proposition *H*<sup>1</sup> is, therefore, approximately 18 times greater than the posterior probability of the alternative proposition *H*2. Thus, the Bayes factor can be obtained as

```
> BF=post_odds/prior_odds
> BF
[1] 4.548985
```
The Bayes factor indicates that the evidence is in favor of proposition *H*<sup>1</sup> that the proportion of counterfeit medicines is greater than 0.2, rather than proposition *H*<sup>2</sup> (i.e., *θ <* 0*.*2). According to the verbal scale presented in Table 1.2, the BF weakly supports proposition *H*<sup>1</sup> over *H*2.

To help specify the prior distribution, information in the form of data regarding similar consignments from cases with comparable circumstances may be used. Such data may suggest a distribution other than the uniform distribution used in the above example. An example of how to elicit a subjective prior distribution about a proportion is provided in Sect. 1.10. For a more extensive discussion about prior elicitation for a proportion, the reader can refer to O'Hagan et al. (2006). Forensically relevant applications of prior elicitation for *θ* are discussed in Aitken (1999). Note, however, that in certain practical applications, analytical results may be affected by further factors that cannot be dissociated from the observational process. An example for such a factor is considered is Sect. 2.2.2.

The analysis pursued above focused on the problem of inference about a proportion for a large batch. Consider now the case where the size *N* of the consignment is small (less than 50). Suppose a sample of size *n* is inspected and *x* items are found to present the target characteristic (e.g., yield a positive test result), so that *θ* ∼ Be*(α* + *x,β* + *n* − *x)*. Denote by *Y* the unknown number of positive items in the uninspected part of the consignment. This random variable has still a binomial distribution, *Y* ∼ Bin*(m, θ )*, where *m* = *N* − *n* represents the number of units that have not been inspected. The probability distribution for the unknown number of positive units can be obtained by integrating out parameter *θ*. This turns out to be a beta-binomial distribution Be-Bin*(n, m, x, α, β)*:

Pr*(Y* = *y* | *n, m, x, α, β)* <sup>=</sup> *Γ (n* <sup>+</sup> *<sup>α</sup>* <sup>+</sup> *β)<sup>m</sup> y Γ (y* + *x* + *α)Γ (n* + *m* − *x* − *y* + *β) Γ (x* <sup>+</sup> *α)Γ (n* <sup>−</sup> *<sup>x</sup>* <sup>+</sup> *β)Γ (n* <sup>+</sup> *<sup>m</sup>* <sup>+</sup> *<sup>α</sup>* <sup>+</sup> *β) (y* <sup>=</sup> <sup>0</sup>*,* <sup>1</sup>*, . . . , n)* (2.1)

(Aitken, 1999).

*Example 2.2 (Counterfeit Medicines—Small Consignment)* Consider Example 2.1 and suppose now that the consignment is small, say *N* = 40. Suppose further that a sample of size *n* = 10 has been inspected and that 2 items are found to be counterfeit. Starting from a uniform prior distribution *θ* ∼ Be*(*1*,* 1*)*, the beta posterior distribution becomes *θ* ∼ Be*(*3*,* 9*)*.

> N=40 > n=10 > x=2 > a=1 > b=1 > astar=a+x> bstar=b+n-x

The distribution of *Y* then is Be-Bin*(*10*,* 30*,* 2*,* 1*,* 1*)*. The probability to observe a given number of counterfeit items (e.g., *y* = 1) in the remainder of the consignment can be obtained using the function dbbinom that is available in the package extraDistr (Wolodzko, 2020).

> library(extraDistr) > dbbinom(1,N-n,astar,bstar)

```
[1] 0.03665943
```
One can also use the function pbbinom that allows to compute the cumulative distribution function for the beta-binomial random variable in (2.1). For example, the probability to observe at most 2 counterfeit items can be obtained as

```
> pbbinom(2,N-n,astar,bstar)
```
[1] 0.109604

A Bayesian network for inference about a proportion of a small consignment has been developed in Biedermann et al. (2008). Posterior probabilities for *θ* can easily be calculated with such models.

#### *2.2.2 Background Elements Affecting Counting Processes*

In many real-world applications, counting processes performed in forensic laboratories cannot be considered error-free. Examinations may be affected by inefficiencies and perturbing factors. For example, it may be that items are lost or missed during counting or that background elements are present, i.e., objects observationally indistinguishable from the target objects. This section addresses inferential challenges due to such background elements.

Suppose that *x* is the number of recorded successes, i.e., the number of times that the target characteristic is detected. However, the number *x* may not correspond to the number *xs* of items actually showing the characteristic of interest but be affected by a certain number of background elements, *xb*, that are wrongly counted as successes. This complication may typically arise in applications where the items of interest are small particles. Consider, for example, the assessment of rice quality in a context of food quality control. Rice quality can be measured by means of several features, such as the percentage of cracked or immature grains. For example, there may be legal provisions regarding the maximum tolerated amount of cracked grains.<sup>2</sup> It might then be of interest to compare alternative propositions according to which the percentage of cracked grains is above or below a given regulatory threshold. A key question is how to conduct such a comparison when the counting process may be affected by background elements, e.g., oil seeds in the example here.

While the number of elements *actually* showing the target characteristic is modeled as the outcome of a binomial distribution, *Xs* ∼ Bin*(n, θ )*, the amount of background elements affecting the counting process, *xb*, can be modeled by a Poisson distribution, *Xb* ∼ Pn*(λ)*, where *λ* is the mean number of background elements (D'Agostini, 2004). The total number of *recorded* successes is therefore *X* = *Xs* + *Xb*. The graphical model (see e.g. Cowell et al., 1999) in Fig. 2.1 offers a schematic representation of the probabilistic relationship among the variables.

<sup>2</sup> For legislation in, e.g., Italy, see Gazzetta Ufficiale della Repubblica Italiana, 6, 09-01-2018, Decreto 20 settembre 2017.

#### 2.2 Proportion 47

It can be shown3 that *X* has the following probability distribution:

$$f(\mathbf{x} \mid n, \theta, \lambda) = \sum\_{\mathbf{x}\_b=0}^{\chi} \binom{n}{\chi - \chi\_b} \theta^{\chi - \chi\_b} (1 - \theta)^{n - \chi + \chi\_b} \mathbf{e}^{-\lambda} \lambda^{\chi\_b} / \chi\_b!$$

Recall that prior uncertainty about *θ* can modeled by a beta distribution Be*(α, β)*. The posterior distribution is then given by

$$f(\theta \mid n, \mathbf{x}, \lambda) = \frac{\sum\_{\mathbf{x}\_b=0}^{\mathbf{x}} \binom{n}{\mathbf{x} - \mathbf{x}\_b} \theta^{\mathbf{x} - \mathbf{x}\_b} (1 - \theta)^{n - \mathbf{x} + \mathbf{x}\_b} e^{-\lambda} \lambda^{\mathbf{x}\_b} / \mathbf{x}\_b! \theta^{\alpha - 1} (1 - \theta)^{\beta - 1}}{f(\mathbf{x} \mid n, \lambda) B(\alpha, \beta)}, \quad (2.2)$$

where the normalizing constant *f (x* | *n, λ)* in the denominator is

$$f(\mathbf{x} \mid n, \lambda) = \int f(\mathbf{x} \mid n, \theta, \lambda) f(\theta) d\theta. \tag{2.3}$$

The posterior distribution (2.2) cannot be obtained in closed form as the integral characterizing the normalizing constant *f (x* |*, n, λ)* is not tractable analytically. However, since it is possible to draw values from the beta distribution, the integral in (2.3) can be computed by Monte Carlo approximation as in (1.30), that is,

$$\hat{f}(\mathbf{x} \mid n, \lambda) = \frac{1}{N} \sum\_{i=1}^{N} f(\mathbf{x} \mid n, \theta^{(i)}, \lambda), \tag{2.4}$$

where *<sup>θ</sup>(i)* <sup>∼</sup> Be*(α, β)*.

*Example 2.3 (Rice Quality)* Consider a consignment of rice and suppose that it is of interest to assess whether the proportion of cracked grains is below a given level of tolerance. The following competing propositions may be of interest:


In a sample of 1000 grains, a total of 28 cracked grains are observed.

<sup>3</sup> The method for finding the distribution of a sum of random variables is given, for example, in Casella and Berger (2002). It can be used to extend the model to the case of missing counts, an aspect that is not treated here.

```
Example 2.3 (continued)
> n=1000
> x=28
```
The beta prior distribution for *θ* needs to be elicited. Suppose that available knowledge indicates that it is implausible that the proportion of cracked grains is greater than 5%. An asymmetric prior distribution with a prior mass concentrated over values lower than 0.05 can be elicited as follows. Start with *α* = 1 and *β* = 1, then increment *β* by 1 until the shape of the beta distribution is such that Pr*(θ >* 0*.*05*)* is small, e.g., equal to 0.1.

```
> a=1
> b=1
> while(pbeta(0.05,a,b,lower.tail=F)>0.1){b=b+1}
> c(a,b,pbeta(0.05,a,b,lower.tail=F))
```
[1] 1.00000000 45.00000000 0.09944026

The parameters *α* and *β* can thus be set equal to 1 and 45, respectively. Figure 2.2 (left) can be obtained with

```
> plot(function(x) dbeta(x,a,b),0,0.1,xlab=expression
+ (theta),ylab=expression(paste(pi)*paste('(')*
+ paste(theta)*paste(')')))
```
The prior odds can now be computed in a straightforward manner.

```
> th0=0.025
> pi1=pbeta(th0,a,b,lower.tail=F)
> prior_odds=pi1/(1-pi1)
> prior_odds
```

```
[1] 0.4706802
```
This value, approximately 0*.*5, means that the probability of hypothesis *H*<sup>2</sup> is, a priori, approximately 2 times greater than the probability of hypothesis *H*1.

Suppose that when inspecting a sample of 1000 rice grains, on average, 1 grain (e.g., oil seed) is wrongly counted as cracked. Parameter *λ* can thus be taken to be equal to 0*.*001.

First, we write a function dbinpois that computes the product between a binomial likelihood Bin*(n, θ )* at *x* − *xb* and a Poisson likelihood Pn*(λ)* at *xb*.

```
> dbinpois=function(xb){
+ dbinom((x-xb),n,theta)*dpois(xb,lambda)}
```
The unnormalized posterior distribution in (2.2)

```
Example 2.3 (continued)
     x
       xb=0
            n
            x−xb

                θ x−xb (1 − θ )n−x+xb e−λλxb /xb!θα−1(1 − θ )β−1
                           B(α, β)
is computed as
> lambda=0.001
> xb=matrix(seq(0,x,1),nrow=1)
> incr=0.0001
> thetav=seq(0.0001,0.9999,incr)
> theta=thetav[1]
> s=sum(apply(xb,2,dbinpois))
> upost=dbeta(theta,a,b)*s
> for (i in 2:length(thetav)){
+ theta=thetav[i]
+ s=sum(apply(xb,2,dbinpois))
+ upost=c(upost,dbeta(theta,a,b)*s)
+ }
The normalizing constant f (x | n, λ) can be approximated as in (2.4)
> theta=rbeta(1,a,b)
> norm_const=sum(apply(xb,2,dbinpois))
> nn=10000
> for (i in 2:nn){
+ theta=rbeta(1,a,b)
+ s=sum(apply(xb,2,dbinpois))
+ norm_const=norm_const+s
+ }
> norm_const=norm_const/nn
and the approximated posterior density, represented in Fig. 2.2 (right), can be
obtained as
> normpost=upost/(norm_const)
> plot(thetav,normpost,xlab=expression(paste(theta)),+ ylab=expression(hat(f)*paste('(')*paste(theta)*
```

```
+ paste('|n,x,')*paste(lambda)*paste(')')),
```
+ xlim=c(0,0.1),type='l')

To calculate the BF, we need to obtain the posterior probabilities of the competing propositions *H*<sup>1</sup> and *H*2. Consider proposition *H*2. The (approximate) posterior probability of proposition *H*<sup>2</sup> can be obtained by Monte Carlo integration as

**Fig. 2.2** Left: beta prior distribution Be*(*1*,* 45*)* of the unknown proportion *θ* of cracked grains (Example 2.3). Right: approximated posterior distribution of *θ*, *f (θ* ˆ | *n, x, λ)*. The gray shaded area shows the posterior probability of the hypothesis *H*<sup>1</sup> (*θ >* 0*.*025)

$$
\hat{\alpha}\_2 = \frac{1}{\hat{f}(\mathbf{x} \mid n, \lambda)} \int\_0^{\theta\_0} f(\mathbf{x} \mid n, \theta, \lambda) f(\theta) d\theta
$$

$$
= \frac{\theta\_0}{\hat{f}(\mathbf{x} \mid n, \lambda)} \int\_0^{\theta\_0} f(\mathbf{x} \mid n, \theta, \lambda) f(\theta) \frac{1}{\theta\_0} d\theta
$$

$$
\approx \frac{\theta\_0}{\hat{f}(\mathbf{x} \mid n, \lambda)} \cdot \frac{1}{N} \sum\_{i=1}^N f(\mathbf{x} \mid n, \theta^i, \lambda) f(\theta^i) d\theta,\tag{2.5}
$$

where *<sup>θ</sup><sup>i</sup>* is sampled from a uniform distribution in the interval *(*0*, θ*0*)*, *<sup>θ</sup><sup>i</sup>* <sup>∼</sup> Unif*(*0*, θ*0*)*, and the normalizing constant *f (x* | *n, λ)* is approximated as in (2.4). The (approximate) posterior probability of hypothesis *H*<sup>1</sup> is 1 − ˆ*α*2. The (approximated) BF will be

$$
\widehat{\text{BF}} = \frac{\widehat{\alpha}\_1/\widehat{\alpha}\_2}{\pi\_1/\pi\_2}.\tag{2.6}
$$

*Example 2.4 (Rice Quality—Continued)* Consider the scenario described in Example 2.3, and compute the (approximate) posterior probability of the hypothesis *H*2: the proportion *θ* of cracked grains is smaller than or equal to 0.025 (as in 2.5).

```
Example 2.4 (continued)
```

```
> m=10000
```

```
> theta=runif(m,0,th0)
```

```
> alpha2=mean(rowSums(apply(xb,2,dbinpois))
```

```
+*dbeta(theta,a,b))*th0/norm_const
```

```
> alpha2
```
[1] 0.30753

The (approximate) posterior probability of hypothesis *H*<sup>1</sup> then is *α*ˆ <sup>1</sup> = 0*.*6925. This is highlighted by the gray shaded area in Fig. 2.2 (right). The posterior odds and the BF therefore are

```
> post_odds=(1-alpha2)/(alpha2)
> post_odds
[1] 2.251715
> BF=post_odds/prior_odds
> BF
[1] 4.783959
```
The Bayes factor indicates that the evidence favors hypothesis *H*1, i.e., *θ >* 0*.*025, over *H*2, i.e., *θ* ≤ 0*.*025. A BF of approximately 5 provides limited support for the hypothesis *H*1. Note that the results obtained by the laboratory analyses clearly affect our belief about *θ*. The analytical results change prior odds in favor of *H*<sup>1</sup> (0.47) to posterior odds of approximately 2.25 in favor of *H*1.

#### **2.2.2.1 Sensitivity to Monte Carlo Approximation**

The Monte Carlo estimate of the Bayes factor obtained in (2.6) is subject to variability, which may be a source of concern. Figure 2.3 provides an illustration of BF variability. The figure shows 500 realizations of the BF approximation in (2.6).

```
> ns=500
> m=10000
> BFs=0
> dbinpois=function(xb){
+ dbinom((x-xb),n,theta)*dpois(xb,lambda)}
> for (j in 1:ns){
+ rthetav=rbeta(m,a,b)
```

```
+ norm_const=0
+ for (i in 1:m){
+ theta=rthetav[i]
+ s=sum(apply(xb,2,dbinpois))
+ norm_const=norm_const+s
+ }
+ norm_const=norm_const/m
+ theta=runif(m,0,th0)
+ alpha2=mean(rowSums(apply(xb,2,dbinpois))
+ *dbeta(theta,a,b))*th0/norm_const
+ post_odds=(1-alpha2)/alpha2
+ BFs=c(BFs,post_odds/prior_odds)
+ }
> BFs=BFs[-1]
> hist(BFs,main='',prob=T)
> curve(dnorm(x,mean(BFs),sd(BFs)),lwd=2,add=T)
```
**Fig. 2.3** Histogram of 500 realizations of the BF approximation in (2.6), where the posterior probability of hypothesis *H*<sup>2</sup> is obtained as in (2.5). The solid line represents the fitted Normal density

The purpose of the graphical representation in Fig. 2.3 is to illustrate that the repeated application of the procedure leads to a distribution of BFs. While the Monte Carlo estimate is not an exact value, it can be shown that the approximation error can be made arbitrarily small by generating a sufficiently large amount of observations. For a large number of simulations, it can also be proven, by Central Limit Theorem, that the error | *f (x)* ˆ − *f (x)* | <sup>√</sup>*<sup>N</sup>* is normally distributed. This can be used to analyze the variability of the Monte Carlo estimate (see, e.g., Marin and Robert (2014)). Note that the shape of the histogram is roughly symmetric and bell-shaped, as shown in Fig. 2.3.

It is worth noting that other, more efficient ways than traditional Monte Carlo methods may be implemented to compute the integrals related to the posterior probabilities of the competing hypotheses. Importance sampling (see Sect. 1.8), for example, can improve the integral approximation. It can also be used when the target density is unnormalized. Consider again the posterior probability of hypothesis *H*2:

$$\alpha\_2 = \int\_0^{\theta\_0} \frac{f(\mathbf{x} \mid n, \theta, \lambda) f(\theta)}{f(\mathbf{x} \mid n, \lambda)} d\theta.$$

This can be rewritten as

$$\begin{aligned} \alpha\_2 &= \frac{1}{f(\mathbf{x} \mid n, \lambda)} \int\_0^1 h(\theta) f(\mathbf{x} \mid n, \theta, \lambda) f(\theta) \frac{\mathbf{g}(\theta)}{\mathbf{g}(\theta)} d\theta, \\\\ &= \frac{1}{f(\mathbf{x} \mid n, \lambda)} \int\_0^1 h(\theta) w(\theta) \mathbf{g}(\theta) d\theta, \end{aligned}$$

where

$$h(\theta) = \begin{cases} 1 \text{ if } 0 < \theta < \theta\_0 \\\\ 0 \text{ if } \theta\_0 \le \theta < 1, \end{cases}$$

*w(θ )* = *f (x* | *n, θ , λ)f (θ )/g(θ )* and *g(θ )* is the importance sampling function.

The posterior probability *α*<sup>2</sup> can be approximated as

$$\hat{\alpha}\_2 = \frac{\frac{1}{N} \sum\_{i=1}^{N} h(\theta^i) w(\theta^i)}{\frac{1}{N} \sum\_{i=1}^{N} w(\theta^i)},\tag{2.7}$$

where *<sup>θ</sup><sup>i</sup>* <sup>∼</sup> *g(θ )*.

*Example 2.5 (Rice Quality—Continued)* A Be*(*20*,* 780*)* is chosen as importance sampling function *g(θ )*. It can be readily verified that it is centered at 0.025 and that the density rapidly collapses toward zero for values greater than 0.04. This will avoid the generation of points for which the integrand is close to zero, with a very modest contribution to the approximation. Next, sample 10000 values from this distribution.

```
> m=10000
```

```
> a1=20
```

```
> b1=780
```
> theta=rbeta(m,a1,b1)

The posterior probability *α*<sup>2</sup> of hypothesis *H*<sup>2</sup> can be obtained as in (2.7)

```
> fx=rep(0,m)
> fx[theta<th0]=1
> num=mean(rowSums(apply(xb,2,dbinpois))*+ dbeta(theta,a,b)/dbeta(theta,a1,b1)*fx)> den=mean(rowSums(apply(xb,2,dbinpois))*+ dbeta(theta,a,b)/dbeta(theta,a1,b1))
> alpha2=num/den
> alpha2
[1] 0.3079344
> BF=((1-alpha2)/alpha2)/prior_odds
> BF
[1] 4.774886
```
Figure 2.4 provides an illustration of BF variability. Notice that while the BFs in Figs. 2.3 and 2.4 have roughly the same location, the importance sampling in (2.7) produced an increase in precision.

It is important to understand that the resulting distribution does *not* mean that there is a distribution *for a given* BF because the BF, by definition, is a single number. See, e.g., Taroni et al. (2016) and Biedermann et al. (2017a) for discussions of this topic among forensic statisticians and forensic scientists. The error resulting from the implementation of numerical techniques is an important source of information about which the scientist should be transparent. Following ideas presented in Tanner (1996), recently reconsidered by Ommen et al. (2017) in a forensic context, the numerical precision in the overall approximated value can be estimated by the associated Monte Carlo standard error.

**Fig. 2.4** Histogram of 500 realizations of the BF approximation in (2.6), where the posterior probability of hypothesis *H*<sup>2</sup> is obtained as in (2.7). The solid line represents the fitted Normal density

#### **2.2.2.2 Unknown Expected Value of the Number of Background Elements**

It is important to note that, contrary to what was developed in Example 2.3, the expected value *λ* of the number of background events is generally unknown. The uncertainty about *λ* can be modeled by means of a gamma distribution, *λ* ∼ Ga*(a, b)*. The marginal posterior distribution of parameter *θ*, written *f (θ* | *n, x)*, now takes a more complicated form as one needs to handle the joint posterior distribution that is proportional to

$$\begin{aligned} &f(\theta,\lambda \mid n,x) \\ &\propto \sum\_{\mathbf{x}\_{b}=0}^{\mathbf{x}} \binom{n}{\mathbf{x}-\mathbf{x}\_{b}} \theta^{\mathbf{x}-\mathbf{x}\_{b}} (1-\theta)^{n-\mathbf{x}+\mathbf{x}\_{b}} \frac{\mathbf{e}^{-\lambda}\lambda^{\mathbf{x}\_{b}}}{\mathbf{x}\_{b}!} \theta^{a-1} (1-\theta)^{\beta-1} \lambda^{a-1} \mathbf{e}^{-b\lambda}. \end{aligned} \tag{2.8}$$

Following ideas described in Taroni et al. (2010), a two-block M–H algorithm (Sect. 1.8) can be implemented in order to draw a sample from the joint posterior distribution in (2.8). For each block, the candidate generating density is taken to be Normal with the mean equal to the current value of the parameter and the variance chosen so as to obtain a good acceptance rate (Gamerman & Lopes, 2006).

Consider the parameter *θ* first. The full conditional density of *θ* is proportional to

$$f\_1(\theta \mid \lambda, n, x) \propto \sum\_{\mathbf{x}\_b=0}^{\mathbf{x}} \binom{n}{\mathbf{x}-\mathbf{x}\_b} \theta^{\mathbf{x}-\mathbf{x}\_b} (1-\theta)^{n-\mathbf{x}+\mathbf{x}\_b} \frac{\lambda^{\mathbf{x}\_b}}{\mathbf{x}\_b!} \theta^{\mathbf{a}-1} (1-\theta)^{\beta-1}.$$

Starting from the current value for *θ*, say *θ(i*−1*)* , a candidate value *θ* prop for *θ* can be obtained as

$$
\theta^{\text{prop}} = \frac{\mathbf{e}^{\psi^{\text{prop}}}}{1 + \mathbf{e}^{\psi^{\text{prop}}}}, \qquad \text{where} \quad \psi^{\text{prop}} \sim \text{N}\left(\psi^{(\ell - 1)}, \tau\_1^2\right),
$$

and *<sup>ψ</sup>(i*−1*)* <sup>=</sup> log *<sup>θ</sup>(i*−1*)* 1−*θ(i*−1*)* . In this way, the proposed value *θ* prop will be defined in the interval *(*0*,* 1*)*. The candidate value *θ* prop is accepted with probability

$$\alpha(\psi^{(i-1)}, \psi^{\text{prop}}) = \min\left\{1, \frac{f(\psi^{\text{prop}} \mid \lambda^{(i-1)})}{f(\psi^{(i-1)} \mid \lambda^{(i)})}\right\},$$

where *f (ψ* | *λ)* is the reparametrized full conditional density of parameter *θ* and can be obtained as

$$f(\psi \mid \lambda) = \frac{\mathbf{e}^{\psi}}{(1 + \mathbf{e}^{\psi})^2} f\_{\mathbf{l}} \left( \frac{\mathbf{e}^{\psi}}{(1 + \mathbf{e}^{\psi})^2} \mid \lambda, n, \mathbf{x} \right).$$

See, e.g., Casella and Berger (2002) for distributions of functions of random variables.

If the candidate *θ* prop is accepted, it becomes the current value of the chain, i.e., *<sup>θ</sup>(i)* <sup>=</sup> *<sup>θ</sup>* prop; otherwise *<sup>θ</sup>(i)* <sup>=</sup> *<sup>θ</sup>(i*−1*)* .

The second block refers to parameter *λ*. The full conditional density of parameter *λ* is proportional to

$$f\_2(\boldsymbol{\lambda} \mid \boldsymbol{\theta}, n, \boldsymbol{x}) \propto \sum\_{\boldsymbol{\chi}\_b = 0}^{\boldsymbol{\chi}} \binom{n}{\boldsymbol{\chi} - \boldsymbol{\chi}\_b} \theta^{\boldsymbol{\chi} - \boldsymbol{\chi}\_b} (1 - \theta)^{n - \boldsymbol{\chi} + \boldsymbol{\chi}\_b} \frac{\mathbf{e}^{-\boldsymbol{\lambda}} \boldsymbol{\lambda}^{\boldsymbol{x}\_b}}{\boldsymbol{\chi}\_b!} \boldsymbol{\lambda}^{a - 1} \,\,\mathbf{e}^{-b\boldsymbol{\lambda}} \,\,\boldsymbol{\lambda}$$

Starting from the current value for *λ*, say *λ(i*−1*)* , a candidate value *λ*prop for *λ* can be obtained as

$$
\lambda^{\text{prop}} = \mathbf{c}^{\phi^{\text{prop}}}, \qquad \text{where } \phi^{\text{prop}} \sim \mathbf{N}\left(\phi^{(l-1)}, \mathbf{r}\_2^2\right),
$$

and *<sup>φ</sup>(i*−1*)* <sup>=</sup> log *<sup>λ</sup>(i*−1*)* . In this way, the proposed value *λ*prop will be defined in the interval *(*0*,*∞*)*. The candidate value *<sup>λ</sup>*prop is accepted with probability

$$\alpha(\phi^{(i-1)}, \phi^{\text{prop}}) = \min\left\{1, \frac{f(\phi^{\text{prop}} \mid \theta^{(i-1)})}{f(\phi^{(i-1)} \mid \theta^{(i-1)})}\right\},$$

where *f (φ* | *θ )* is the reparametrized full conditional density of parameter *λ* and can be obtained as

$$f(\phi \mid \theta) = \mathbf{e}^{\phi} f\_2(\mathbf{e}^{\phi} \mid \theta, n, x).$$

If the candidate *λ*prop is accepted, it becomes the current value of the chain, i.e., *<sup>λ</sup>(i)* <sup>=</sup> *<sup>λ</sup>*prop; otherwise *<sup>λ</sup>(i)* <sup>=</sup> *<sup>λ</sup>(i*−1*)* .

The two-block M–H algorithm can be summarized as follows: *Initialization*: start with arbitrary values *θ(*0*)* and *λ(*0*) Iteration i*: 1. Given *θ(i*−1*)* and *λ(i*−1*)* , – Generate *<sup>θ</sup>* prop according to *<sup>f</sup>*1*(θ* <sup>|</sup> *<sup>λ</sup>(i*−1*) , n, x)*. – With probability *α(θ(i*−1*) , θ* prop*)* accept *<sup>θ</sup>* prop and set *<sup>θ</sup>(i)* <sup>=</sup> *<sup>θ</sup>* prop; otherwise reject *<sup>θ</sup>* prop and set *<sup>θ</sup>(i)* <sup>=</sup> *<sup>θ</sup>(i*−1*)* . 2. Given *θ(i)* and *λ(i*−1*)* , – Generate *<sup>λ</sup>*prop according to *<sup>f</sup>*2*(λ* <sup>|</sup> *<sup>θ</sup>(i), n, x)*. – With probability *α(λ(i*−1*) , λ*prop*)* accept *<sup>λ</sup>*prop and set *<sup>λ</sup>(i)* <sup>=</sup> *<sup>λ</sup>*prop; otherwise reject *<sup>λ</sup>*prop and set *<sup>λ</sup>(i)* <sup>=</sup> *<sup>λ</sup>(i*−1*)* . *Return* {*θ(nb*+1*) ,...,θ(N )*} and {*λ(nb*+1*) ,...,λ(N )*}, where *nb* is the burn-in period and *N* is the number of iterations.

*Example 2.6 (Rice Quality—Continued)* Consider again Example 2.3 where prior uncertainty about *θ* was modeled by a Be*(*1*,* 45*)* distribution, and the parameter *λ* was set equal to 0.001. For the purpose of the example here, a gamma distribution with parameters *a* = 2 and *b* = 1000 is used to model prior uncertainty about *λ*. The prior density Ga*(*2*,* 1000*)* is shown in Fig. 2.5. It can be observed that the prior mass is concentrated at very small values of *λ*.

> n=1000 > x=28

```
Example 2.6 (continued)
```

```
+ paste(lambda)*paste(')')))
```
Let the starting values for *<sup>θ</sup>* and *<sup>λ</sup>* be *<sup>θ</sup>(*0*)* <sup>=</sup> <sup>0</sup>*.*1 and *<sup>λ</sup>(*0*)* <sup>=</sup> <sup>0</sup>*.*001, and the variances *τ* <sup>2</sup> <sup>1</sup> and *<sup>τ</sup>* <sup>2</sup> <sup>2</sup> of the proposal densities be set equal to 0.7 and 3, respectively.


Current values of the parameters *θ* and *λ* will be stored in a vector called thetav and lambdav, respectively.

```
> thetav=theta
```
> lambdav=lambda

#### *Example 2.6* (continued)

Before running the algorithm, it is useful to introduce the following functions: mh1 is used to obtain the candidate (current) value *θ* prop (*θ* curr); mh2 is used to calculate the probability of acceptance of the candidate value *θ* prop; dbinpois computes the product between a binomial likelihood Bin*(n, θ )* at *x* − *xb* and a Poisson likelihood at *xb*.

> mh1=function(x){x/(1+x)}

```
> mh2=function(x){x/((1+x)^2)}
```

```
> dbinpois=function(xb){
```

```
+ dbinom((x-xb),n,theta)*dpois(xb,lambda)}
```
The MCMC algorithm is run over 15000 iterations, with a burn-in range of 5000 iterations.

```
> n.iter=15000
> acct=n.iter
> accl=n.iter
> burn.in=5000
> for (i in 1:n.iter){
+ psicurr=log(theta/(1-theta))
+ s=sum(apply(xb,2,dbinpois))
+ pipsicurr=mh2(exp(psicurr))*dbeta(theta,a,b)*s
+
+ # Generate the candidate value of parameter theta
+
+ psiprop=rnorm(1,psicurr,tau[1])
+ theta=mh1(exp(psiprop))
+ s=sum(apply(xb,2,dbinpois))
+ pipsiprop=mh2(exp(psiprop))*dbeta(theta,a,b)*s
+
+ # acceptance/rejection of the candidate value
+ # (parameter theta)
+
+ if(runif(1)>pipsiprop/pipsicurr){
+ theta=mh1(exp(psicurr))
+ acct=acct-1}
+ thetav=c(thetav,theta)
+
+ # generate the candidate value of parameter lambda
+
+ phicurr=log(lambda)
+ s=sum(apply(xb,2,dbinpois))
```

```
Example 2.6 (continued)
+ piphicurr=exp(phicurr)*dgamma(lambda,ag,bg)*s
+ phiprop=rnorm(1,phicurr,tau[2])
+ lambda=exp(phiprop)
+ s=sum(apply(xb,2,dbinpois))
+ piphiprop=exp(phiprop)*dgamma(lambda,ag,bg)*s
+
+ # acceptance/rejection of the candidate value
+ # (parameter lambda)
+
+ if(runif(1)>piphiprop/piphicurr){
+ lambda=exp(phicurr)
+ accl=accl-1}
+ lambdav=c(lambdav,lambda)
+ }
> c(acct/n.iter,accl/n.iter)
[1] 0.3102000 0.2973333
```
These values represent the acceptance rates for *θ* and *λ*, respectively.

The output of the simulation run is shown in Fig. 2.6, representing the trace-plot, the autocorrelation plot (showing the correlation structure of the sequences), and the histogram of the simulated draws for *θ* (left column) and *λ* (right column). The simulated draws have an acceptance rate of approximately 31% for *θ* and 30% for *λ*. The trace-plots of simulated draws look like random noise and the autocorrelation decreases rapidly as the time lag at which it is calculated increases.

```
> par(mfrow=c(3,2))
```

```
> plot(thetav,type='l',xlab='Iterations',ylab=
```

```
+ expression(paste(theta)),main=expression(paste
```

```
+ (theta)))
```

```
> plot(lambdav,type='l',xlab='Iterations',ylab=
```

```
+ expression(paste(lambda)),main=expression(paste
```

```
+ (lambda)))
```

```
> acf(thetav[-c(1:burn.in)],type="correlation",ci=0,
```

```
+ main=expression(paste(theta)),ylab='')
```

```
> acf(lambdav[-c(1:burn.in)],type="correlation",ci=0,
```

```
+ (lambda)),ylab='',main='')
```
#### *Example 2.6* (continued)

Note that the argument ci=0 in the function acf for computing and plotting the estimate of the autocorrelation function suppresses the plot of the confidence interval.

**Fig. 2.6** MCMC diagnostic with trace-plots of simulated draws of *θ* (top left) and *λ* (top right), autocorrelation plots over the last 10000 iterations (center) and histograms over the last 10000 iterations (bottom)

The simulated values *θ(nb*+1*) ,...,θ(N )* can serve as draws from the posterior distribution *f*1*(θ* | *λ, n, x)*. The posterior probability of hypothesis *H*<sup>1</sup> can then be approximated as

62 2 Bayes Factor for Model Choice

$$\widehat{\alpha\_{\!l}} = \sum\_{\theta^{(l)} > 0.025} \theta^{(l)} / (N - n\_b), \tag{2.9}$$

and the BF can be obtained straightforwardly.

*Example 2.7 (Rice Quality—Continued)* Using a burn-in range of 5000 iterations, the average value of parameter *θ* over the last 10000 iterations can be computed as

```
> thetahat=mean(thetav[-c(1:burn.in)])
> thetahat
```

```
[1] 0.02788516
```
The posterior probability of hypothesis *H*<sup>1</sup> can be approximated as in (2.9):

```
> alpha1=sum(thetav[-c(1:burn.in)]>th0)/
+ (n.iter-burn.in)
> alpha1
[1] 0.71
> post_odds=alpha1/(1-alpha1)
> post_odds
[1] 2.448276
```
Recall that the prior odds have been quantified previously as approximately 0.47. The Bayes factor then is

```
> post_odds/prior_odds
```

```
[1] 5.201569
```
The uncertainty about the presence of background elements, modeled by *λ*, modifies the value of the BF from approximately 4.77 to 5.2. This change is small. The BF still provides only weak support for the hypothesis *H*<sup>1</sup> that *θ >* 0*.*025, compared to *H*2.

#### *2.2.3 Decision for a Proportion*

The normative framework for decision-making introduced in Chap. 1 is well suited for addressing problems of statistical inference presented in this chapter. Consider again a pair of competing propositions as defined in Sect. 2.2 regarding the question of whether the proportion of items showing a target characteristic of interest is

greater (*H*1) or not greater (*H*2) than a given threshold *θ*0. From a decision-theoretic point of view, two courses of action are possible: *d*<sup>1</sup> and *d*2. Decision *d*<sup>1</sup> amounts to accepting the view that the proportion *θ* is greater than a given (legal) threshold, *θ*0. Decision *d*<sup>2</sup> amounts to accepting the view that *θ* is smaller than or equal to the threshold *θ*0. A possible loss function L*(*·*)* for such a two-action decision problem is

$$\mathcal{L}(d\_{\mathbb{I}},\theta) = \begin{cases} 0 & \text{if } \theta \in \Theta\_{\mathbb{I}}, \\\\ l\_{\mathbb{I}}(\theta\_{0} - \theta) \text{ if } \theta \in \Theta\_{2}. \end{cases} \quad \mathcal{L}(d\_{2},\theta) = \begin{cases} 0 & \text{if } \theta \in \Theta\_{2}, \\\\ l\_{2}(\theta - \theta\_{0}) \text{ if } \theta \in \Theta\_{1}. \end{cases} \tag{2.10}$$

This is a linear loss function where the loss is proportional to the magnitude of the error (e.g., *θ*0−*θ*). An example is shown in Fig. 2.7, where *θ*<sup>0</sup> = 0*.*2, and loss values *l*<sup>1</sup> and *l*<sup>2</sup> are equal to 1.

Given this loss function, the Bayesian posterior expected loss for *d*1, that is accepting *H*<sup>1</sup> : *θ>θ*0, is

$$\mathrm{EL}(d\_{\mathbb{L}}\mid\mathbf{x}) = \int\_{\Theta\_{2}} l\_{\mathbb{I}} \theta\_{\mathbf{0}} f(\theta \mid \mathbf{x}) \mathrm{d}\theta - \int\_{\Theta\_{2}} l\_{\mathbb{I}} \theta f(\theta \mid \mathbf{x}) \mathrm{d}\theta,$$

where *f (θ* | *x)* = Be*(α*<sup>∗</sup> = *α* + *x,β*<sup>∗</sup> = *β* + *n* − *x)*. Similarly, the Bayesian posterior expected loss for *d*2, that is accepting *H*<sup>2</sup> : *θ* ≤ *θ*0, is

$$\mathrm{EL}(d\_2 \mid \mathbf{x}) = \int\_{\Theta\_1} l\_2 \theta f(\theta \mid \mathbf{x}) \mathrm{d}\theta - \int\_{\Theta\_1} l\_2 \theta\_0 f(\theta \mid \mathbf{x}) \mathrm{d}\theta.$$

After some algebra, it can be shown (Taroni et al., 2010) that

$$\text{EL}(d\_{\text{l}} \mid \mathbf{x}) = l\_{\text{l}} \theta\_{\text{0}} \Pr(\theta < \theta\_{\text{0}} \mid \boldsymbol{\alpha}^{\*}, \boldsymbol{\beta}^{\*}) - l\_{\text{l}} \frac{\boldsymbol{\alpha} + \mathbf{x}}{\boldsymbol{\alpha} + \boldsymbol{\beta} + \mathbf{n}} \Pr(\theta < \theta\_{\text{0}} \mid \boldsymbol{\alpha}^{\*} + \mathbf{l}, \boldsymbol{\beta}^{\*}), \tag{2.11}$$

and

$$\text{EL}(d\_2 \mid \mathbf{x}) = l\_2 \frac{\boldsymbol{\alpha} + \mathbf{x}}{\alpha + \boldsymbol{\beta} + \boldsymbol{n}} \Pr(\boldsymbol{\theta} > \theta\_0 \mid \boldsymbol{\alpha}^\* + \mathbf{1}, \boldsymbol{\beta}^\*) - l\_2 \theta\_0 \Pr(\boldsymbol{\theta} > \theta\_0 \mid \boldsymbol{\alpha}^\*, \boldsymbol{\beta}^\*). \tag{2.12}$$

The decision criterion then is to decide *d*<sup>1</sup> (*d*2) whenever EL*(d*1*)* is smaller (greater) than EL*(d*2*)*.

*Example 2.8 (Counterfeit Medicines—Continued)* Recall Example 2.1 where the competing propositions refer to the proportion of counterfeit medicines that may be either greater or not greater than a given limiting value, e.g., *θ*<sup>0</sup> = 0*.*2. Consider a uniform prior Be*(*1*,* 1*)* for *θ* and the finding that 12 out 40 items are positive. Consider a linear loss function as in (2.10), with *l*<sup>1</sup> = 1 and *l*<sup>2</sup> = 1. This is a symmetric loss, reflecting the idea that falsely deciding that the proportion is greater than the threshold is as undesirable, and hence as severely penalized, as falsely deciding that the proportion is smaller than the threshold. The expected losses of decisions *d*<sup>1</sup> and *d*<sup>2</sup> are computed as in (2.11) and (2.12).

```
> th0=0.2
> a=1
> b=1
> n=40
> x=12
> l1=1
> l2=1
> ax=(a+x)/(a+b+n)
> ELd1=l1*th0*pbeta(th0,a+x,b+n-x)-
+ l1*ax*pbeta(th0,a+x+1,b+n-x)
> ELd2=l2*ax*pbeta(th0,a+x+1,b+n-x,lower.tail=F)-
+ l2*th0*pbeta(th0,a+x,b+n-x,lower.tail=F)
> c(ELd1,ELd2)
[1] 0.001207984 0.110731793
```
#### *Example 2.8* (continued)

The optimal decision thus is *d*1, since it minimizes the expected loss. Given prior beliefs, the observed data, and personal loss assignments, the optimal course of action is to decide in favor of proposition *H*<sup>1</sup> according to which the proportion of counterfeit medicines is greater than 0.2.

A decision maker may find a "0 − *li*" loss function, as shown in Table 1.4, more appropriate. Consider again the case discussed in Sect. 2.2.1 where it was of interest to compare the hypotheses that the proportion of counterfeit medicines in a seizure was greater (*H*1) or not greater (*H*2) than a given threshold *θ*0. In such a context, the loss *l*<sup>1</sup> (i.e., the loss incurred when deciding *d*<sup>1</sup> and *H*<sup>2</sup> is true) could amount to the net loss represented by expenses incurred by issuing legal proceedings in a non-priority case (i.e., falsely considering *θ>θ*0). In turn, loss *l*<sup>2</sup> could amount to monetary value of property that could have been confiscated by investigative authorities in a meritorious case. Following results in Sect. 1.9, the decision criterion becomes

$$\text{decide } d\_1 \text{ if } \quad \frac{\alpha\_1}{\alpha\_2} > \frac{l\_1}{l\_2} \quad \text{or} \quad \text{BF} > \frac{l\_1/l\_2}{\pi\_1/\pi\_2}.$$

Decision *d*<sup>1</sup> is to be preferred to decision *d*<sup>2</sup> if and only if the posterior odds in favor of *H*<sup>1</sup> are greater than the ratio of the losses of adverse outcomes or, alternatively, if the BF is greater than the ratio between the loss ratio of adverse outcomes and the prior odds.

Decision makers may find it difficult to assign losses *l*<sup>1</sup> and *l*2. Note, however, that when adverse outcomes are considered equally undesirable, then the loss ratio simplifies to 1, and the decision criterion becomes to decide *d*<sup>1</sup> whenever the posterior odds are larger than 1, i.e., the posterior probability of hypothesis *H*<sup>1</sup> is greater than the posterior probability of hypothesis *H*2. In turn, when adverse consequences are not equally undesirable, a decision maker may consider how much more (less) undesirable one adverse outcome is compared to the other. This can be expressed as *l*<sup>1</sup> = *kl*2, i.e., by specifying how much worse deciding *d*<sup>1</sup> is when *θ* ≤ *θ*<sup>0</sup> is true, compared to deciding *d*<sup>2</sup> when *θ>θ*<sup>0</sup> is true (Biedermann et al., 2016b). A sensitivity analysis can be performed for different values of *k*.

#### **2.3 Normal Mean**

Toxicology laboratories are frequently asked to quantify the amount of target substance (e.g., alcohol, illegal drugs, particular metabolites, etc.) in samples such as blood, urine, and hair in order to help assess whether an unknown target quantity *θ* (e.g., the level of alcohol in blood) exceeds a given value (e.g., a legal threshold). Competing propositions of interest may be specified as follows:

*H*1: The target quantity *θ* exceeds a given level *θ*0.

*H*2: The target quantity *θ* is equal to or smaller than a given level *θ*0.

This section considers three main topics: (1) inference about an unknown quantity *θ* (Sect. 2.3.1), (2) inference about *θ* in presence of factors influencing the measurement process (Sect. 2.3.2), and (3) decision about competing propositions regarding *θ* (Sect. 2.3.3).

#### *2.3.1 Inference About a Normal Mean*

Consider the hypothetical case of a person, Mr. X, stopped by traffic police because of suspicion of driving under the influence of a given substance (e.g., alcohol or THC). A blood sample is taken and a series of analyses are performed by a forensic laboratory. The propositions of interest may be, for example, that "The quantity *θ* of target substance in Mr. X's blood exceeds the legal threshold *θ*0" (*H*1) versus the alternative proposition "The quantity *θ* of target substance in Mr. X's blood is smaller than or equal to the legal threshold *θ*0" (*H*2). A series of measurements *x* are obtained. It is often reasonable to assume that such measurements follow a Normal distribution N*(θ , σ*2*)*:

$$f(\mathbf{x} \mid \theta, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left\{-\frac{1}{2\sigma^2}(\mathbf{x} - \theta)^2\right\},$$

where the mean *θ* is the unknown quantity of target substance. The variance *σ*<sup>2</sup> can be approximated from previous ad hoc calibrations (see discussion by Howson and Urbach (1996)). The most common prior distribution for the Normal mean *θ* is itself a Normal distribution N*(μ, τ* <sup>2</sup>*)*:

$$f(\theta \mid \mu, \tau^2) = \frac{1}{\sqrt{2\pi\tau^2}} \exp\left\{-\frac{1}{2\tau^2}(\theta - \mu)^2\right\},$$

where the hyperparameters *μ* and *τ* <sup>2</sup> are often called *prior mean* and *prior variance*, respectively.

The posterior distribution of the target quantity *θ* is still a Normal distribution, denoted N*(μx , τ* <sup>2</sup> *<sup>x</sup> )*, because the Normal prior and the Normal likelihood are conjugate. Generalizing the updating formulae (1.19) and (1.20) to the case where a vector of *n* measurements *(x*1*,...,xn)* is available leads to

$$
\mu\_{\chi} = \frac{\sigma^2/n}{\sigma^2/n + \mathfrak{r}^2}\mu + \frac{\mathfrak{r}^2}{\sigma^2/n + \mathfrak{r}^2}\bar{\mathfrak{x}}\tag{2.13}
$$

#### 2.3 Normal Mean 67

and

$$
\pi\_\times^2 = \frac{\mathfrak{r}^2 \sigma^2 / n}{\sigma^2 / n + \mathfrak{r}^2},
\tag{2.14}
$$

where *<sup>x</sup>*¯ <sup>=</sup> *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *xi/n*.

The posterior mean *μx* and the posterior variance *τ* <sup>2</sup> *<sup>x</sup>* can be calculated by means of the function post\_distr.

```
> post_distr=function(sigma,n,barx,pm,pv){
+ postm=(pm*sigma/n+barx*pv)/(sigma/n+pv)
+ postv=(pv*sigma/n)/(sigma/n+pv)
+ op=c(postm,postv)
+ return(op)}
```
The prior odds, the posterior odds, and the Bayes factor can be easily computed, as discussed in Sect. 1.4, by means of standard routines (see Example 2.9). The case where the population variance *σ*<sup>2</sup> is unknown and a prior distribution must be specified for both parameters *(θ , σ*2*)* will be addressed in Sect. 3.3.2.

*Example 2.9 (Alcohol Concentration in Blood)* A person is stopped by traffic police because of suspicion of driving under the influence of alcohol. Two measurements are obtained by the laboratory, 0.4866 g/kg and 0.5078 g/kg. The population variance *σ*<sup>2</sup> is known and is taken to be equal to 0*.*0232. Available information, e.g., the fact that the person has been stopped by traffic police while driving late in the night, exceeding the speed limit etc., suggests a prior mean equal to 0.8 and a prior variance equal to 0*.*152, say *<sup>θ</sup>* <sup>∼</sup> *N (μ* <sup>=</sup> <sup>0</sup>*.*8*, τ* <sup>2</sup> <sup>=</sup> <sup>0</sup>*.*152*)*. This amounts to say that, a priori, values for the alcohol level in blood lower than 0.35 and larger than 1.25 are considered extremely implausible (prior probabilities for values outside this range are on the order of 0.01).

The propositions of interest are the following:


The prior odds can be easily computed as follows:

```
Example 2.9 (continued)
> th0=0.5
> pm=0.8
> pv=0.15^2
> pi1=pnorm(th0,pm,sqrt(pv),lower.tail=F)
> prior_odds=pi1/(1-pi1)
> prior_odds
```

```
[1] 42.95579
```
The probability of hypothesis *H*<sup>1</sup> is, a priori, approximately 43 times greater than the probability of the alternative hypothesis *H*2. Consider now the effect of the measurements made on the blood sample.

```
> x=c(0.4866,0.5078)
> s2=0.023^2
> postm=post_distr(s2,length(x),mean(x),pm,pv)[1]
> postm
[1] 0.5007182
> postv=post_distr(s2,length(x),mean(x),pm,pv)[2]
> postv
[1] 0.0002614268
```
The posterior distribution of the quantity of alcohol in blood *θ* is, therefore, N*(*0*.*5007*,* 3*e* − 04*)*. The posterior odds are

```
> alpha1=pnorm(th0,postm,sqrt(postv),lower.tail=F)
> post_odds=alpha1/(1-alpha1)
> post_odds
```

```
[1] 1.073465
```
The ratio between posterior and prior odds gives the Bayes factor:

```
> BF=post_odds/prior_odds
> BF
```

```
[1] 0.02498999
```
The probability to obtain the two measurements if Mr X's alcohol level in blood does *not* exceed the legal threshold *θ*<sup>0</sup> = 0*.*5 is approximately 40 times greater than given the proposition that the blood alcohol level is greater than the legal threshold. The evidence thus provides moderate support for the hypothesis *H*2, compared to *H*1.

#### **2.3.1.1 Choosing the Parameters of the Normal Prior for the Mean**

If the experimenter has no reason to consider the distribution describing prior uncertainty about the unknown quantity *θ* to be asymmetric, then a choice may be made in the family of Normal distributions. When choosing a member from this family, the analyst will need to assign a value to the prior mean *μ* and a value to the prior standard deviation *τ* . To elicit a Normal prior, it is useful to recall that for a Normal distribution *<sup>θ</sup>* <sup>∼</sup> <sup>N</sup>*(μ, τ* <sup>2</sup>*)*, approximately 99*.*7% of values are within 3 standard deviation from the mean, thus

$$\Pr\left\{\mu - \Im \tau \le \theta \le \mu + \Im \tau\right\} \approx 0.997.$$

Hence, if the practitioner can assign a measure of location *μ* and a pair of values that define the upper and lower bounds of an interval that covers a range of plausible values of the unknown quantity *θ*, then the standard deviation can be assigned as

$$
\pi = \frac{l\text{up} - \mu}{3},
\tag{2.15}
$$

where *l*up is the upper bound mentioned above. In Example 2.9, a prior location was fixed at *μ* = 0*.*8. Moreover, prior probabilities for values smaller than 0.35 and greater than 1.25 were extremely small (i.e., on the order of 0.01). The standard deviation has been elicited as in (2.15).

It may be worth to inspect the reasonableness of the elicited prior. This includes, as highlighted in Sect. 1.10, producing a graphical representation to see whether the amount of available information is suitably conveyed. Consider a random sample of size *ne* from a Normal population providing an equivalent amount of information conveyed by the prior. The equivalent sample size *ne* can be found by matching the prior variance *τ* <sup>2</sup> to the dispersion from the sample, *σ*<sup>2</sup>*/ne*, and solving for *ne*. The smaller *ne*, the weaker will be prior beliefs, and the more the posterior distribution will be influenced by even a modest amount of data. Vice versa, the larger *ne*, the stronger will be the prior beliefs, and the more the posterior distribution will be dominated by the prior. Thus, more data will be necessary to make a substantial impact on prior beliefs.

Whenever the state of information is such as to consider all possible values of *θ* equally plausible, a locally uniform prior can be defined:

$$f(\theta) \propto \text{constant.}$$

In the latter case, the posterior distribution of *θ* is a Normal distribution centered at the sample mean *<sup>x</sup>*¯ with spread parameter equal to *<sup>σ</sup>*2*/n* (e.g., Bolstad & Curran, 2017).

#### **2.3.1.2 Sensitivity to the Choice of the Prior Distribution**

As noted in Sect. 1.11, the marginal likelihood is highly sensitive to the choice of the prior distribution and so is the Bayes factor. Thus, it should be emphasized that the BF obtained in Example 2.9, the value 0.02, does not depend on the data alone. It also depends on the choice of the prior distribution on *θ*.

For the purpose of illustration, consider a sensitivity analysis for the hyperparameters that characterize the prior distribution for the unknown level of alcohol in blood. Let values of *μ* range from 0.4 to 1 and the prior variance *τ* <sup>2</sup> be fixed and equal to 0*.*0225.

```
> pm=seq(0.4,1,0.01)
> pv=0.0025
```
The prior odds, the posterior odds, and the BF can be calculated for all possible values of the prior mean *μ* (pm). Note that computing the posterior Normal distribution with the function post\_distr, using several possible values for the prior mean *μ*, returns an output vector of length *n* = 61 whose first *n* − 1 = 60 elements represent the posterior mean, while the last element represents the posterior variance.

```
> th0=0.5
> pi1=pnorm(th0,pm,sqrt(pv),lower.tail=F)
> prior_odds=pi1/(1-pi1)
> x=c(0.4866,0.5078)
> s2=0.023^2
> postm=
+ post_distr(s2,length(x),mean(x),pm,pv)[1:length(pm)]
> postv=post_distr(s2,length(x),mean(x),pm,pv)
+ [length(pm)+1]
> alpha1=pnorm(th0,postm,sqrt(postv),lower.tail=F)
> post_odds=alpha1/(1-alpha1)
> BF=post_odds/prior_odds
```
Figure 2.8 shows the prior probability *π*<sup>1</sup> of proposition *H*1, the posterior probability *α*1, and the BF in favor of proposition *H*<sup>1</sup> for values of the prior mean *μ* ranging from 0.4 to 1.

```
> plot(pm,BF,type='l',ylim=c(0,max(pi1,alpha1,BF)),
+ xlim=range(pm),xlab=expression(paste(mu)),ylab='')
```

```
> lines(pm,pi1,lty=4)
> lines(pm,alpha1,lty=2)
> leg=expression(paste('BF'),paste(pi)[1],paste(alpha)
+ [1])
> legend(0.85,1.92,leg,lty=c(1,4,2))
```
Note that the BF favors proposition *H*<sup>1</sup> (i.e., a BF greater than 1) over *H*<sup>2</sup> only for values of *μ* smaller than 0*.*47. Most importantly, one can observe the impact of the prior assessments (i.e., different choices of the prior mean *μ*) on the value of the BF. The higher the prior probability of proposition *H*1, the lower is the value of the measurements *x* = *(*0*.*4866*,* 0*.*5078*)* in terms of the BF in favor of *H*<sup>1</sup> over *H*<sup>2</sup> Note, however, that the BF in the latter case represents strong support for *H*<sup>2</sup> over *H*1.

**Fig. 2.8** Sensitivity analysis of the prior probability *π*<sup>1</sup> (dot-dashed line), posterior probability *<sup>α</sup>*<sup>1</sup> (dashed line), and BF (solid line) for values of *<sup>μ</sup>* ranging from 0*.*4 to 1 and *<sup>τ</sup>* <sup>2</sup> <sup>=</sup> <sup>0</sup>*.*<sup>0225</sup> (Example 2.9). Note that for a BF of 1 (dotted line), the lines of the prior and posterior probabilities intersect

#### *2.3.2 Continuous Measurements Affected by Errors*

As noted in Sect. 2.2.2, a measurement process or observations may be affected by background noise. Consider a case in which it is of interest to assess the height of an individual based on video recordings made by a surveillance camera during a bank robbery. Propositions of interest may be as follows:

*H*1: The height of the individual is less than 180 cm.

*H*2: The height of the individual is equal to or greater than 180 cm.

Assume that the height measurements *x* of an individual are normally distributed, *<sup>X</sup>* <sup>∼</sup> <sup>N</sup>*(θ , σ*2*)*, where *<sup>θ</sup>* represents the true height of the individual and *<sup>σ</sup>*<sup>2</sup> represents the variance of the measurement device. Assume also that the variance *σ*<sup>2</sup> is inferred from previous ad hoc experiments. However, the measured height is, generally, affected by an error *ξ* , related to the circumstances under which the recording was made. Factors of interest here include the posture and movements of the person, the type of clothing (including headwear and shoes) and lighting conditions. Such circumstances represent a further source of variation *δ*2, unrelated to *σ*2. The measured height is therefore *<sup>X</sup>* <sup>∼</sup> <sup>N</sup>*(θ* <sup>+</sup> *ξ,σ*<sup>2</sup> <sup>+</sup> *<sup>δ</sup>*2*)*. A conjugate Normal prior distribution N*(μ, τ* <sup>2</sup>*)* is taken to model prior uncertainty about *θ*. The values of the parameters *ξ* and *δ*<sup>2</sup> are case-specific assignments. It can be shown that the posterior distribution of the true height *θ* is still Normal with mean

$$\mu\_{\chi} = \frac{\mathfrak{r}^2(\bar{\mathfrak{x}} - \xi) + \mu(\sigma^2 + \delta^2)/n}{\mathfrak{r}^2 + (\sigma^2 + \delta^2)/n} \tag{2.16}$$

and variance

$$
\pi\_\chi^2 = \frac{\mathfrak{r}^2(\sigma^2 + \delta^2)/n}{\mathfrak{r}^2 + (\sigma^2 + \delta^2)/n}. \tag{2.17}
$$

*Example 2.10 (Image Analysis)* Consider the hypothetical case introduced above and assume that, according to eyewitness testimony, the height of the perpetrator is approximately between 175 cm and 185 cm. This allows one to define a prior probability distribution for the height *θ* centered at 180 cm with variance equal to 2.79 cm, i.e., *θ* ∼ N*(*180*,* 2*.*79*)*. The standard deviation can be quantified as in (2.15):


#### *Example 2.10* (continued)

Thus, the two hypotheses *H*<sup>1</sup> and *H*<sup>2</sup> introduced above are, a priori, equally probable (hence, the prior odds equal 1).

```
> th0=180
> pi1=pnorm(th0,pm,sqrt(pv))
> prior_odds=pi1/(1-pi1)
> prior_odds
```
#### [1] 1

The available recordings depict an individual appearing in *n* = 10 images. Height measurements yield the sample mean *x*¯ = 180*.*25. The variance of the measurement procedure is known and equal to *<sup>σ</sup>*<sup>2</sup> <sup>=</sup> <sup>0</sup>*.*12. The experimental setting is such that the values for the parameters of the Normal distribution of the error can be set to *<sup>ξ</sup>* <sup>=</sup> <sup>0</sup>*.*5 and *<sup>δ</sup>*<sup>2</sup> <sup>=</sup> 1.

```
> mx=180.25
> n=10
> s2=0.12
> xi=0.5
> d2=1
```
The posterior mean and the posterior variance of *θ* can be computed as in (2.16) and (2.17), respectively.

```
> postm=(pv*(mx-xi)+pm*(s2+d2)/n)/(pv+(s2+d2)/n)
> postm
[1] 179.7597
> postv=(pv*(s2+d2)/n)/(pv+(s2+d2)/n)
> postv
```

```
[1] 0.1076592
```
The gray shaded area in Fig. 2.9 shows the posterior probability of the hypothesis *H*1. The posterior odds and the Bayes factor can be obtained straightforwardly

```
> alpha1=pnorm(th0,postm,sqrt(postv))
> post_odds=alpha1/(1-alpha1)
> post_odds
[1] 3.311039
> BF=post_odds/prior_odds
> BF
[1] 3.311039
```
#### *Example 2.10* (continued)

Given that the prior odds are 1, the BF is numerically equivalent to the posterior odds. This value represents support for the hypothesis *H*<sup>1</sup> (the height of the individual is lower than 180 cm) over *H*2. Specifically, the BF indicates that it is approximately 3 times more probable to obtain such height measurements if the height of the individual is less than 180 cm than if the height is equal to or greater than 180 cm.

#### *2.3.3 Decision for a Mean*

The previous sections focused on how to draw a probabilistic inference about a Normal mean, using the Bayes factor. Recall that the competing propositions were:

*H*1: The target quantity *θ* exceeds a given level *θ*0.

*H*2: The target quantity *θ* is equal to or smaller than a given level *θ*0.

A related question is how to *decide* about whether or not a quantity of interest is above a given (legal) threshold, i.e., accepting either *H*<sup>1</sup> or *H*2. In order to address this question, it is necessary to introduce a loss function to take into account the decision maker's preferences. Suppose a linear loss function is considered as in (2.18):

$$\mathcal{L}(d\_1, \theta) = \begin{cases} 0 & \text{if } \theta > \theta\_0, \\\\ l\_1(\theta\_0 - \theta) \text{ if } \theta \le \theta\_0. \end{cases} \quad \mathcal{L}(d\_2, \theta) = \begin{cases} 0 & \text{if } \theta \le \theta\_0, \\\\ l\_2(\theta - \theta\_0) \text{ if } \theta > \theta\_0. \end{cases} \tag{2.18}$$

The Bayesian posterior expected loss of decision *d*<sup>1</sup> can be computed as

$$\operatorname{EL}(d\_{\mathbb{L}} \mid \mathbf{x}) = l\_{\mathbb{L}} \int\_{\theta \le \theta\_{0}} (\theta\_{0} - \theta) f(\theta \mid \mathbf{x}) d\theta$$

$$= l\_{\mathbb{L}} \mathbf{r}\_{\mathbb{X}} \left[ \phi(t) + t \int\_{0}^{l} \phi(s) ds \right], \tag{2.19}$$

where *f (θ* <sup>|</sup> *x)* is a Normal posterior distribution with parameters *μx* and *<sup>τ</sup>* <sup>2</sup>*(x)* as in (2.13) and (2.14), *t* = *τx (θ*<sup>0</sup> − *μx )*, while *φ(*·*)* denotes the probability density of a standardized Normal distribution (Bernardo & Smith, 2000).

In turn, the Bayesian posterior expected loss of decision *d*<sup>2</sup> can be computed as

$$\operatorname{EL}(d\_2 \mid \mathbf{x}) = l\_2 \int\_{\theta > \theta\_0} (\theta - \theta\_0) f(\theta \mid \mathbf{x}) d\theta$$

$$= l\_2 \tau\_\chi \left[ \phi(t) - t \int\_t^\infty \phi(s) ds \right]. \tag{2.20}$$

Again, the decision criterion amounts to deciding *d*<sup>1</sup> (*d*2) whenever EL*(d*<sup>1</sup> | *x)* is smaller (greater) than EL*(d*<sup>2</sup> | *x)*.

*Example 2.11 (Alcohol Concentration in Blood—Continued)* Recall Example 2.9 where the posterior distribution of the alcohol level *θ* was N*(*0*.*50072*,* 0*.*00026*)*, and the legal threshold was equal to 0.5.

> th0=0.5 > postm [1] 0.5007182 > postv [1] 0.0002614268

Consider a symmetric linear loss function as in (2.18) with *l*<sup>1</sup> = *l*<sup>2</sup> = 1. The Bayesian posterior expected losses in (2.19) and (2.20) can be obtained as

$$\begin{array}{l} \succ \mathsf{l}\,\mathsf{1}\,\mathsf{2} = \mathsf{1} \\ \succ \mathsf{l}\,\mathsf{2}\,\mathsf{2} = \mathsf{1} \\ \succ \mathsf{t}\,\mathsf{s}\,\mathsf{q}\,\mathsf{rt}\,\mathsf{(post\,tv)} \star (\mathsf{t}\,\mathsf{th}\,\mathsf{0}\,\mathsf{-post\,m}) \end{array}$$

*Example 2.11* (continued) > eld1=l1\*sqrt(postv)\*(dnorm(t)+t\*(pnorm(t)-0.5)) > eld2=l2\*sqrt(postv)\*(dnorm(t)-t\*pnorm(t,lower. + tail=F)) > c(eld1,eld2) [1] 0.006450377 0.006450471

The optimal decision thus is to consider that the alcohol level is greater than the legal threshold because this decision has a lower expected loss, though the difference between the two expected losses is, in the example here, extremely small

> abs(eld1-eld2) [1] 9.388144e-08

Note that this result crucially depends on the decision maker's value assessments (i.e., the chosen loss function).

When expected losses for rival decisions are very similar, as is the case in Example 2.11, a sensitivity analysis should be performed as suggested, for example, in legal literature (Edwards, 1988). The sensitivity analysis should evaluate the effect of changes in the prior parameters and the loss values. See also Sect. 2.3.1 for a sensitivity analysis of the BF for evaluating the impact of changes in hyperparameters characterizing the prior distribution for the unknown level of alcohol in blood.

It is also worth to reflect on the choice of the loss function. A symmetric loss function, as previously suggested, may not realistically reflect the decision maker's preferences. For example, a decision maker who is concerned about road safety may consider that falsely concluding that an individual's blood alcohol concentration is below the legal limit is a more serious error than falsely concluding that an individual's blood alcohol concentration is above the legal threshold. Therefore, *l*<sup>2</sup> may be taken to be larger than *l*1, reflecting the greater inconvenience associated with underestimating the alcohol concentration. For example, when *l*<sup>1</sup> = 1 and *l*<sup>2</sup> = 2, meaning that underestimating the alcohol level is considered twice as serious as overestimating it, the expected loss of decision *d*<sup>2</sup> will increase. One can verify that for any reasonable value of *l*<sup>2</sup> greater than *l*1, decision *d*<sup>1</sup> will be the one with the smaller expected loss.

#### **2.4 Summary of R Functions**

The R functions outlined below have been used in this chapter.

#### *Functions Available in the Base Package*

apply: applies a function to the margins (either rows or columns) of a matrix

acf: computes and plots estimates of the autocorrelation function

```
d<name of distribution>, p<name of distribution>,
```
r*<*name of distribution*>* (e.g., dbeta, pbeta, rbeta): calculates the density and the cumulative probability and generates random numbers for various parametric distributions

rowSums: forms row sums for numeric arrays (or data frames)

Further details can be found in the Help menu, help*.*start*()*.

#### *Functions Available in Other Packages*

dbbinom and pbbinom in package extraDistr: calculates the density and the cumulative probability for a beta-binomial distribution

#### *Functions Developed in This Chapter*

dbinpois: computes the product between a binomial likelihood Bin*(n, θ )* at *x* − *xb* and a Poisson likelihood Pn*(λ)* at *xb* where *x* represents the number of items counted as presenting a given target characteristic and *xb* represents the number of background elements affecting the counting process

*Usage*: dbinpois(xb)

*Arguments*: xb: a vector of integers ranging from 0 to *x*

*Output*: a vector of values, where each value represents the probability of the product between the binomial and the Poisson likelihood at a given value of the input argument xb

mh1: computes the function *x/(*1 + *x) Usage*: mh1(x) *Arguments*: x: a scalar value *x Output*: the value of *x/(*1 + *x)* mh2: computes the function *x/(*<sup>1</sup> <sup>+</sup> *x)*<sup>2</sup>

*Usage*: mh2(x)

*Arguments*: x: a scalar value *x*

*Output*: the value of *x/(*<sup>1</sup> <sup>+</sup> *x)*<sup>2</sup>

post\_distr: computes the posterior distribution N*(μx , τ* <sup>2</sup> *<sup>x</sup> )* of a Normal mean *θ*, with *<sup>X</sup>* <sup>∼</sup> <sup>N</sup>*(θ , σ*2*)* and *<sup>θ</sup>* <sup>∼</sup> <sup>N</sup>*(μ, τ* <sup>2</sup>*)*

*Usage*: post\_distr(sigma,n,barx,pm,pv)


Published with the support of the Swiss National Science Foundation (Grant no. 10BP12\_208532/1).

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 3 Bayes Factor for Evaluative Purposes**

#### **3.1 Introduction**

Consider a case where material of known source (control material) and evidential material of unknown source (recovered or questioned material) are collected and analyzed. Interpretation of scientific evidence then amounts to assessing the probative value of the observations made during comparative examinations. The evidence is evaluated in terms of its effect on the odds in favor of a proposition *H*<sup>1</sup> put forward by the prosecution, compared to an alternative proposition *H*<sup>2</sup> advanced by the defense.

During comparative examinations, observations and measurements are made, leading to either discrete or continuous data. Forensic laboratories may also have equipment and methodologies that can lead to output in the form of multivariate data. Thus, scientific evidence is often described by more than one variable. For example, glass fragments from a crime scene can be compared with fragments collected on the clothing of a person of interest on the basis of several chemical components, as well as physical characteristics. It should be noted, however, that the assessment of a Bayes factor for multivariate data may be challenging. For example, data may not present enough regularity so that standard parametric distributions cannot be used. Data may also present a complex dependence structure with several levels of variation. In addition, a feature-based approach might not be always feasible, and it may be necessary to derive a Bayes factor on the basis of scores.

This chapter is structured as follows. Sections 3.2 and 3.3 address the problem of evaluation of evidence for various types of discrete and continuous data, respectively. Section 3.4 presents an extension to continuous multivariate data.

**Supplementary Information** The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-09839-0\_3. The files can be accessed individually by clicking the DOI link in the accompanying figure caption or by scanning this link with the SN More Media App.

#### **3.2 Evidence Evaluation for Discrete Data**

This section deals with measurement results in the form of counts, using the binomial model (Sect. 3.2.1), the multinomial model (Sect. 3.2.2), and the Poisson model (Sect. 3.2.3).

#### *3.2.1 Binomial Model*

In many practical applications, data derive from realizations of experiments that may take one of two mutually exclusive outcomes. Examples include general features (so-called class characteristics) observed on questioned and known items or materials (e.g., fired bullets, fibers) when the question of interest is whether the compared materials come from the same source.

Consider a hypothetical case involving a questioned document for which results of analyses of black toner are available. On the questioned document, black bicomponent toner is present. It is of the same type as that used by a given printing machine (known source). A question that may be of interest in such a case is how this analytical information should affect one's belief in the proposition according to which the questioned document has been printed using the device of interest (Biedermann et al., 2009, 2011a). The competing propositions can thus be defined as follows:

*H*<sup>1</sup> : The questioned document has been printed with the device of interest.

*H*<sup>2</sup> : The questioned document has been printed with an unknown device.

Let *T* denote the observed toner type, either single component (*TS*) or bicomponent (*TB*). Suppose that a database of the toner type (magnetism) of samples of black toner from *N* machines is available, *n* of which use a bi-component toner. Denote by *θ* the proportion of the population of printing devices equipped with bicomponent toner. Available counts can be treated as realizations of Bernoulli trials (Sect. 2.2.1) with constant probability of success *θ*, Pr*(TB* | *θ )* = *θ*. Suppose a conjugate beta prior distribution Be*(α, β)* is used to model uncertainty about *θ*, where *α* and *β* can be elicited using the available background knowledge as in (1.42) and (1.43).

Denote by *Ey* the observations made on recovered material and by *Ex* the observations made on control material (i.e., documents printed with the device of interest). If the questioned document originates from the device of interest, the probability of the evidence becomes

$$\begin{aligned} \Pr(E\_\mathbf{y} = T\_B, E\_\mathbf{x} = T\_B \mid H\_\mathbf{l}) &= \int\_{\Theta} \Pr(T\_B \mid \theta) \cdot \theta^{\alpha - 1} (1 - \theta)^{\beta - 1} d\theta / \mathbf{B}(\alpha, \beta), \\\ &= \int\_{\Theta} \theta \cdot \theta^{\alpha - 1} (1 - \theta)^{\beta - 1} d\theta / \mathbf{B}(\alpha, \beta). \end{aligned}$$

If the questioned document originates from an unknown device (i.e., two distinct devices have been used), the probability of the evidence becomes

$$\Pr(E\_\circ = T\_B, E\_\times = T\_B \mid H\_2) = \int\_{\Theta} \theta^2 \cdot \theta^{a-1} (1 - \theta)^{\beta - 1} d\theta / \mathbf{B}(a, \beta).$$

The Bayes factor can be computed as

$$\begin{split} \text{BF} &= \frac{\int\_{\Theta} \theta \cdot \theta^{\alpha - 1} (1 - \theta)^{\beta - 1} d\theta}{\int\_{\Theta} \theta^2 \cdot \theta^{\alpha - 1} (1 - \theta)^{\beta - 1} d\theta} \\ &= \frac{\text{B}(\alpha + 1, \beta)}{\text{B}(\alpha + 2, \beta)} \int\_{\Theta} \frac{\theta^{\alpha} (1 - \theta)^{\beta - 1}}{\theta^{\alpha + 1} (1 - \theta)^{\beta - 1}} \frac{\text{B}(\alpha + 2, \beta)}{\text{B}(\alpha + 1, \beta)} \\ &= \frac{\alpha + \beta + 1}{\alpha + 1}. \end{split} \tag{3.1}$$

*Example 3.1 (Questioned Documents)* Consider the case of a printed document of unknown origin. Analyses reveal that the toner present on the printed document is of type "bi-component." The printing device that is thought to have been used to print the questioned document is equipped with a bicomponent toner. In an available database with a total of *N* = 100 samples of black toner, *n* = 23 are bi-component (see Table 3.1). Using this information, the parameters of the beta prior distribution about *θ* can be elicited as follows:

```
> n=23
```

This leads to a Be*(*23*,* 76*)*.

The Bayes factor in (3.1) can be computed straightforwardly as follows:

```
> BF=(a+b+1)/(a+1)
> BF
```
[1] 4.206984

The Bayes factor provides weak support for the proposition *H*<sup>1</sup> according to which the questioned document has been printed with the printing device of interest rather than with an unknown printing device (*H*2).

It is worth noting that there is an alternative development described in the forensic statistics literature that considers background information derived from a population database as part of the evidence, (e.g., Ommen et al., 2016; Dawid,


2017). According to this line of reasoning, if proposition *H*<sup>1</sup> is true (numerator), there are *(n* + 1*)* counts of bi-component toners. That is, the questioned item and the known item are assumed to come from the same source, hence adding one count to the database. Conversely, if proposition *H*<sup>2</sup> is true (denominator), there are *(n*+2*)* counts of bi-component toner. Here, it is assumed that the questioned item and the known item come from different sources, hence adding two counts to the database. The Bayes factor can then be obtained as

$$\text{BF} = \frac{\int\_{\Theta} \theta^{n+1} (1 - \theta)^{N-n} \theta^{\alpha - 1} (1 - \theta)^{\beta - 1} d\theta}{\int\_{\Theta} \theta^{n+2} (1 - \theta)^{N-n} \theta^{\alpha - 1} (1 - \theta)^{\beta - 1} d\theta}$$

$$= \frac{\alpha + \beta + N + 1}{\alpha + n + 1}. \tag{3.2}$$

One can immediately verify that this corresponds to the BF in (3.1) with parameter *α* replaced by *α* + *n*, and parameter *β* replaced by *β* + *N* − *n*. However, it may be questioned whether the available database should be considered as evidence, rather than as conditioning information, because the database contains only general data unrelated to the case under investigation (Aitken et al., 2021).

#### *3.2.2 Multinomial Model*

The analyses described in Sect. 3.2.1 can be extended to situations where experiments can lead to more than two mutually exclusive outcomes.

Consider again the case involving printed documents, introduced in Sect. 3.2.1. Laboratories often analyze resins of toner on printed documents by means of Fourier Infrared Spectroscopy (FTIR). The results can be classified into one of several (*k*) categories (Table 3.1). Suppose that the resin type (*R*) recovered on the questioned document belongs to category *j* , which is also found in the toner used by a given printing machine. The question of interest is similar to the one considered in Sect. 3.2.1, that is, how the available analytical information should affect one's belief in the proposition according to which a questioned document has been printed using a given device, called the potential source, rather than by some unknown printing device.

Denote by *θj* the proportion of the population that is of type (category) *Rj* , *j* = 1*,...,k*, Pr*(Rj* | *θj )* = *θj* . Assume that observations of distinct categories can be treated as independent: available counts *n*1*,...,nk* can be treated as realizations from a multinomial distribution Mult*(n, θ*1*,...,θk)*

$$f(n\_1, \ldots, n\_k \mid \theta\_1, \ldots, \theta\_k) = \frac{N!}{n\_1! \cdots \cdots \cdot n\_k!} \theta\_1^{n\_1} \cdots \cdots \theta\_k^{n\_k}.$$

A conjugate Dirichlet prior probability distribution Dir*(α*1*,...,αk)* is considered for modeling uncertainty about the population proportions *θ*1*,...,θk*:

$$f(\theta\_1, \dots, \theta\_k \mid \alpha\_1, \dots, \alpha\_k) = \theta\_1^{\alpha\_1 - 1} \cdot \dots \cdot \theta\_k^{\alpha\_k - 1} / \mathbf{B}(\alpha),$$

with B*(α)* = #*<sup>k</sup> <sup>i</sup>*=<sup>1</sup> *Γ (αi) Γ (α)* and *<sup>α</sup>* <sup>=</sup> *<sup>k</sup> <sup>i</sup>*=<sup>1</sup> *αi*.

Denote by *Ey* the observations made on the recovered material and by *Ex* the observations made on the control material (i.e., documents printed with the device of interest). If the questioned document originates from the device of interest, the probability of the findings *E* = *(Ey , Ex )* becomes

$$\Pr(E\_{\mathbf{y}} = R\_j, E\_x = R\_j \mid H\_1) = \int\_{\Theta} \Pr(R\_j \mid \theta\_j) \cdot \theta\_1^{a\_1 - 1} \cdot \dots \cdot \theta\_j^{a\_j - 1} \cdot \dots \cdot \theta\_k^{a\_k - 1} d\theta / \mathbf{B}(\alpha)$$

$$= \int\_{\Theta} \theta\_j \cdot \theta\_1^{a\_1 - 1} \cdot \dots \cdot \theta\_j^{a\_j - 1} \cdot \dots \cdot \theta\_k^{a\_k - 1} d\theta / \mathbf{B}(\alpha).$$

If the questioned documents originate from an unknown device (i.e., two distinct devices have been used), the probability of the findings *E* becomes

$$\Pr(E\_{\mathcal{Y}} = R\_j, E\_{\mathcal{X}} = R\_j \mid H\_2) = \int\_{\Theta} \theta\_j^2 \cdot \theta\_1^{a\_1 - 1} \cdot \dots \cdot \theta\_j^{a\_j - 1} \cdot \dots \cdot \theta\_k^{a\_k - 1} d\theta / \mathcal{B}(a).$$

The Bayes factor can be computed as

$$\text{BF} = \frac{\int \theta\_j \cdot \theta\_1^{\alpha\_1 - 1} \cdot \dots \cdot \theta\_j^{\alpha\_j - 1} \cdot \dots \cdot \theta\_k^{\alpha\_k - 1} d\theta}{\int \theta\_j^2 \cdot \theta\_1^{\alpha\_1 - 1} \cdot \dots \cdot \theta\_j^{\alpha\_j - 1} \cdot \dots \cdot \theta\_k^{\alpha\_k - 1} d\theta}$$

$$= \frac{\alpha + 1}{\alpha\_j + 1}. \tag{3.3}$$

*Example 3.2 (Questioned Documents—Continued)* Recall Example 3.1, involving questioned documents on which black toner is present. Suppose now that laboratory analyses focus on the toner's resin component. Suppose that the parameters of the Dirichlet prior probability distribution are elicited as

> a=c(15,4,3,2,2,2,2)

Suppose that the rather common resin group *Epoxy-A* (category *j* = 2 in Table 3.1) is observed on both the questioned and known documents. The Bayes factor in (3.3) can be computed straightforwardly as

> j=2 > BF=(sum(a)+1)/(a[j]+1) > BF

[1] 6.2

The Bayes factor provides, again, weak support for the proposition *H*<sup>1</sup> according to which the questioned document has been printed with the printing device of interest, rather than with an unknown printing device (*H*2).

Suppose that a database of the resin type of samples of black toner from *N* machines is available, *n*<sup>1</sup> (*n*2, *...* ) of which belong to category 1 (2, *...* ), as in Table 3.1. These data can be used to elicit the Dirichlet prior probability distribution. Following the methodology proposed by Zapata-Vazquez et al. (2014), the hyperparameters *α*1*,...,αk* can be assessed by starting from expert judgments (e.g., a vector of quantiles) about proportions of items belonging to each category. Tools for eliciting prior probability distributions from experts' opinions are also available in the R package SHELF. An example will be presented in Sect. 4.2.2.

#### *3.2.3 Poisson Model*

Some forensic science applications focus on the number of occurrences of particular events or observations that take place at given intervals of time or space. Practical examples are the number of gunshot residue particles (GSR) collected on the surface of the hands of individuals suspected to be involved in the discharge of a firearm (Cardinetti et al., 2006), or the number of corresponding matching striations in the comparative examination of marks left by firearms on fired bullets (Bunch, 2000).

Consider the following hypothetical case. A fired bullet is found at a crime scene, and a person of interest is apprehended, carrying a gun. The following propositions are of interest:


The recovered bullet and bullets fired with the seized gun are compared. *Consecutive matching striations* (CMS) is a simple concept to quantify the extent of agreement between marks. The number of observed consecutively matching striations can be interpreted as a *score*. Let *Δ(x, y)* be the maximum CMS count for a given comparison. For the evaluation of a CMS count, data on comparisons made between pairs of bullets test-fired with the seized gun and between pairs of bullets test-fired with different guns are needed. The (score-based) Bayes factor therefore is

$$\text{sBF} = \frac{\text{g}(\Delta(\text{x}, \text{y}) \mid H\_{\text{l}})}{\text{g}(\Delta(\text{x}, \text{y}) \mid H\_{\text{2}})}.$$

A statistical model commonly used in the forensic science literature for the type of data encountered in the example here assumes that counts follow a Poisson distribution Pn*(λ)*

$$g(\varDelta(\mathbf{x}, \mathbf{y}) \mid \lambda\_l) = \frac{e^{-\lambda\_l} \lambda\_l^{\Delta(\mathbf{x}, \mathbf{y})}}{\Delta(\mathbf{x}, \mathbf{y})!}, \qquad \qquad \qquad \Delta(\mathbf{x}, \mathbf{y}) = 0, 1, \ldots \; ; \; \lambda\_l \ge 0, \mathbf{y}$$

where parameter *λi*, *i* = 1*,* 2, represents the weighted average maximum CMS count.

Suppose that two datasets are compiled. The first relates to pairs of bullets fired with the seized gun, and the second to pairs of bullets fired with different guns. Such data can be used to inform the probability distribution *g(*·*)* at the score value *Δ(x, y)* as discussed in Sect. 1.5.2 and to compute the Bayes factor as

$$\text{sBF} = \frac{\hat{\mathcal{g}}(\varDelta(\mathfrak{x}, \mathfrak{y}) \mid \mathfrak{x}, H\_{\mathfrak{l}})}{\hat{\mathfrak{g}}(\varDelta(\mathfrak{x}, \mathfrak{y}) \mid H\_{\mathfrak{l}})}.$$

Bunch (2000) describes a likelihood ratio procedure for inference about competing propositions. This account is based on a frequentist perspective because it uses the maximum likelihood estimates *λ*ˆ <sup>1</sup> and *λ*ˆ <sup>2</sup> for parameters *λ*<sup>1</sup> and *λ*2, calculated under the assumption that either proposition *H*<sup>1</sup> or proposition *H*<sup>2</sup> is true. Using these two estimates in the component Poisson likelihoods leads to the following likelihood ratio:

$$\text{LR} = \frac{e^{-\hat{\lambda}\_1}\hat{\lambda}\_1^{\Delta(\mathbf{x},\mathbf{y})}}{e^{-\hat{\lambda}\_2}\hat{\lambda}\_2^{\Delta(\mathbf{x},\mathbf{y})}}.$$

In Bayesian statistics, the most common prior distribution for *λi* is the gamma distribution Ga*(αi, βi)* with shape parameter *α* and rate parameter *β* (e.g. Bernardo and Smith, 2000):

$$f(\lambda\_l \mid \alpha\_l, \beta\_l) = \frac{\beta\_l^{\alpha\_l}}{\Gamma(a\_l)} \lambda\_l^{\alpha\_l - 1} e^{-\beta\_l \lambda\_l}, \qquad \qquad \lambda\_l > 0 \; ; \; \alpha\_l, \beta\_l > 0.$$

Since the Poisson and gamma distributions are conjugate (Sect. 1.10), the posterior distribution of *λ* is still in the family of gamma distributions, with parameters *α* and *β* updated according to well-known updating rules (see, e.g., Lee, 2012). When we have a realization of a random sample from a Poisson distribution, Pn*(λ)*, say *(z*1*,...,zn)*, we end up with a Ga*(α , β )*, where *<sup>α</sup>* <sup>=</sup> *<sup>α</sup>* <sup>+</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *zi* and *<sup>β</sup>* <sup>=</sup> *<sup>β</sup>* <sup>+</sup>*n*. Note that in this case there is only one observation, *Δ(x, y)*; therefore, *α* = *α* + *Δ(x, y)* and *β* = *β* + 1. See also Biedermann et al. (2011b) for further illustrations of the Poisson–gamma model in forensic science applications.

The marginal distribution in the numerator and denominator of the Bayes factor is known in closed form here. It is a Poisson–gamma distribution:

$$g(\varDelta(\mathbf{x},\mathbf{y})|\alpha\_{l},\beta\_{l}) = \int\_{\lambda\_{l}} g(\varDelta(\mathbf{x},\mathbf{y})|\lambda\_{l}) f(\lambda\_{l}|\alpha\_{l},\beta\_{l}) d\lambda\_{l}$$

$$= \frac{1}{\varDelta(\mathbf{x},\mathbf{y})!} \frac{\beta\_{l}^{\alpha\_{l}}}{\varGamma(\alpha\_{l})} \frac{\Gamma(\alpha\_{l} + \varDelta(\mathbf{x},\mathbf{y}))}{(\beta\_{l} + 1)^{\alpha\_{l} + \varDelta(\mathbf{x},\mathbf{y})}}.\tag{3.4}$$

The score-based Bayes factor then becomes

$$\text{sBF} = \frac{\beta\_1^{\alpha\_1} \Gamma(\alpha\_2) \Gamma(\alpha\_1 + \Delta(\mathbf{x}, \mathbf{y})) (\beta\_2 + 1)^{\alpha\_2 + \Delta(\mathbf{x}, \mathbf{y})}}{\beta\_2^{\alpha\_2} \Gamma(\alpha\_1) \Gamma(\alpha\_2 + \Delta(\mathbf{x}, \mathbf{y})) (\beta\_1 + 1)^{\alpha\_1 + \Delta(\mathbf{x}, \mathbf{y})}}. \tag{3.5}$$

Another example of the use of the Poisson distribution for data in the form of independent counts can be found in Aitken and Gold (2013). These authors considered the number of occurrences of selected characteristics of speech recorded in a succession of time periods. In this application, a feature-based Bayes factor is used to assess findings with respect to the proposition according to which recorded and control speeches originate from the same source versus the alternative proposition that they originate from different sources.

*Example 3.3 (Firearm Examination)* Consider a case involving a questioned bullet. During comparison with a reference bullet, the examiner counts four CMS, i.e., *Δ(x, y)* = 4. Suppose that the assumptions made in Bunch (2000) are suitable for the case here so that for bullets fired from the same gun (proposition *H*<sup>1</sup> holds), the weighted average maximum CMS is taken to be equal to 3*.*91. For bullets fired from different guns (proposition *H*<sup>2</sup> holds), the weighted average maximum CMS count is taken to be equal to 1*.*32. These values are used in the Poisson likelihoods under *H*<sup>1</sup> and *H*2, and the likelihood ratio can easily obtained as

 $\mathbf{\color{red}{\text{S=4}}}\text{ מאָדעריין ווי אָרוער אָרוער אָרוער אָרוער אָרוער אָרוער איז דעם אָרוער אָרוער איז דעם אָרוער איז דעם אָרוער איז דעם אָרוער איז אָרוער איז אָרוער איז איז אָרוער איז איז אָרוער איז איז אָרוער איז איז אָרוער איז איז איז איז איז איז איז איז איז איז איז איז איז איז איז איז איז איז אין 1 אין 2 אין 2 אין 2 אין 2 אין 2 אין 2 אין 2 אין 2 אין 2 אין 2 אין 3 2 אין 3 2 אין 3 2 אין 3 2 2 אין 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2$ 

```
Example 3.3 (continued)
> lambda2=1.32
> LR=dpois(s,lambda1)/dpois(s,lambda2)
> LR
```
[1] 5.775487

The evidence provides weak support in favor of the proposition according to which the recovered bullet passed through the barrel of the seized gun, rather than through the barrel of an unknown gun.

Consider now the Bayesian perspective. Suppose that the available knowledge allows one to set the hyperparameters of the gamma distribution equal to {*α*<sup>1</sup> = 125*, β*<sup>1</sup> = 32} for the numerator and to {*α*<sup>2</sup> = 7*, β*<sup>2</sup> = 5} for the denominator. This amounts to using a gamma prior distribution for *λ*<sup>1</sup> with mean equal to 3.91 and standard deviation equal to 0.35 and a gamma prior distribution for *λ*<sup>2</sup> with mean equal to 1.4 and standard deviation equal to 0.53. The two prior distributions are shown in Fig. 3.1.

```
> an=125
> bn=32
> ad=7
> bd=5
> plot(function(x) dgamma(x,an,bn),0,8,
+ xlab=expression(paste(lambda)),ylab='Probability
+ density')
> plot(function(x) dgamma(x,ad,bd),0,8,add=TRUE,
+ lty=2)
> leg=expression(paste('Ga(125,32)'),paste(
+'Ga(7,5)'))
>legend(4.85,1.15,leg,lty=c(1,2))
```
First, we write a short function poisg that computes the marginal distribution in (3.4)

```
> poisg=function(a,b,x)
+ {(b^a)/gamma(a)*gamma(a+x)/((b+1)^(a+x))}
```
Next, the Bayes factor can be computed as follows:

```
> BF=poisg(an,bn,s)/poisg(ad,bd,s)
> BF
[1] 4.248019
```
Note that the introduction of a prior probability distribution reflecting uncertainty about the population parameters *λ*<sup>1</sup> and *λ*<sup>2</sup> has slightly lowered the value of the evidence. The result still represents weak evidence in favor of the

*Example 3.3* (continued) proposition that the recovered bullet was fired with the seized gun, rather than with an unknown gun.

Note that Example 3.3 involves a non-anchored approach at the numerator. The probability distribution of the score value is solely conditioned on the hypothesis of interest, that is *g(Δ(x, y)* ˆ | *H*1*)*. As mentioned at the beginning of this section, and in Sect. 1.5.2, other anchoring approaches may be considered.

#### **3.2.3.1 Choosing the Parameters of the Gamma Prior**

An evaluator who, initially, would like to give the same weight to all possible values of *λ* may consider to use a non-informative prior distribution, that is

$$f(\lambda\_l) = \lambda\_l^{-1/2}; \qquad \qquad \lambda\_l > 0 \text{ and } i = 1, 2.$$

The posterior probability distribution given the observations *(z*1*,...,zn)* will be of type gamma with shape parameter *<sup>α</sup>* <sup>=</sup> *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *zi* <sup>+</sup> <sup>1</sup>*/*<sup>2</sup> and rate parameter *β* = *n*. Note that in the type of case considered here, there is only one observation; therefore, *α* = *Δ(x, y)* + 1*/*2 and *β* = 1.

However, the choice of a non-informative prior distribution may be questioned. Take, for instance, the case example discussed earlier in this section (Example 3.3). It is difficult to imagine that *no* suitable information is available to express prior

uncertainty about the unknown weighted average maximum count CMS, and hence that the same non-informative prior distribution should apply under each proposition.

In Example 3.3, an informative prior distribution has been used. This raises the question of how to translate prior knowledge into a prior distribution. As illustrated in Sect. 1.10, one way to elicit prior parameters is to express prior beliefs in terms of a measure of location and a measure of dispersion and then equate these values with the prior moments of the distribution. In the case of a gamma distribution Ga*(α, β)*, this amounts to equate a value for the mean, *m*, with the prior mean *α/β*, and a value for the variance, *s*2, with the prior variance *α/β*2, that is,

$$m = \frac{\alpha}{\beta} \qquad ; \qquad s^2 = \frac{\alpha}{\beta^2} \dots$$

Solving for *α* and *β* gives

$$\alpha = \frac{m^2}{s^2} \tag{3.6}$$

$$
\beta = \frac{m}{s^2}.\tag{3.7}
$$

If the shape of the prior distribution resulting from the choice of *α* and *β* as in (3.6) and (3.7) does not reflect one's prior beliefs suitably, then one should adjust the numerical values of *m* and *s*. However, this may not be enough to ensure that the resulting prior distribution is reasonable. One should also inquire about whether the information that is conveyed by the prior is realistically attainable. Consider a random sample of size *ne*, providing the same amount of information as conveyed by the elicited prior. The sample mean should have, at least roughly, the same location and the same dispersion as the prior. The equivalent sample size *ne* can then be found by matching the moments of the gamma distribution to the corresponding moments characterizing a sample of size *ne* from a Poisson distributed random variable located at *λ*:

$$\frac{\alpha}{\beta} = \lambda$$

$$\frac{\alpha}{\beta^2} = \frac{\lambda}{n\_\ell}.$$

If the mean *λ* is set equal to the prior mean *α/β*, the equivalent sample size *ne* is equal to *β*.

*Example 3.4 (Elicitation of a Gamma Prior)* In Example 3.3, a Ga*(*125*,* 32*)* was used for *λ*<sup>1</sup> (the weighted average maximum CMS count under proposition *H*1), and a Ga*(*7*,* 5*)* for *λ*<sup>2</sup> (the weighted average maximum CMS count under proposition *H*2). For the prior means of *λ*<sup>1</sup> and *λ*2, the values 3*.*91 and 1*.*4 were used following Bunch (2000). For the dispersion of the two distributions, the values 0.35 and 0.53 have been assigned to the standard deviation under propositions *H*<sup>1</sup> and *H*2, respectively. Parameters *(α*<sup>1</sup> = 125*, β*<sup>1</sup> = 32*)* and *(α*<sup>2</sup> = 7*, β*<sup>2</sup> = 5*)* have then been obtained as in (3.6) and (3.7). This amounts to an equivalent sample size equal to 32 for the prior density of *λ*1, and 5 for *λ*2.

#### **3.2.3.2 Sensitivity to Prior Probabilities of Competing Propositions**

It is important to emphasize that the analyses presented here make no direct probabilistic statement about the truth of the propositions put forward by opposing parties at trial. A Bayes factor of approximately 4.25, as obtained in Example 3.3, only means that the evidence is approximately 4 times more probable if proposition *H*<sup>1</sup> is true than if the alternative proposition *H*<sup>2</sup> is true. As noted earlier, this does not mean that proposition *H*<sup>1</sup> is more probable than *H*2. This depends on the prior probabilities of the competing propositions, which can vary considerably among recipients of expert information, and which are beyond the area of competence of scientists.

However, it may be of interest to show the impact of different prior probability assignments on the posterior probability of the competing propositions. To do so, recall that the posterior odds are given by the product of the prior odds and the Bayes factor

$$\frac{\Pr(H\_{\mathbb{I}} \mid \cdot)}{\Pr(H\_2 \mid \cdot)} = \text{BF} \times \frac{\Pr(H\_{\mathbb{I}})}{\Pr(H\_2)}.$$

Using this expression, one can then investigate how the posterior probability of proposition *H*1, i.e., *α*1, varies for values of *π*1, i.e., Pr*(H*1*)*, ranging from 0*.*01 until 0*.*99, and for a Bayes factor equal to 4*.*25, as in Example 3.3.

> pi1=seq(0.01,0.99,0.01) > prior\_odds=pi1/(1-pi1) > BF=4.25 > post\_odds=prior\_odds\*BF > alpha1=post\_odds/(1+post\_odds)

The solid line in Fig. 3.2 shows the value of *α*1, the posterior probability of the proposition *H*1, as a function of the prior probability, *π*1, for BF = 4*.*25. The plot also shows results for BF = 1 (dashed line) and for BF = 100 (dotted line).

```
> plot(pi1,alpha1,type='l',xlab=expression(pi[1]),
+ ylab=expression(alpha[1]))
> BF=1
> post_odds=prior_odds*BF
> alpha1=post_odds/(1+post_odds)
> lines(pi1,alpha1,lty=2)
> BF=100
> post_odds=prior_odds*BF
> alpha1=post_odds/(1+post_odds)
> lines(pi1,alpha1,lty=3)
```
More generally, it can be observed that the higher the value of the Bayes factor, the smaller the impact of the prior probabilities on posterior probabilities.

#### **3.3 Evidence Evaluation for Continuous Data**

The previous section considered the evaluation of scientific evidence as given by discrete data. However, for many types of evidence, measurements result in continuous data.

#### *3.3.1 Normal Model with Known Variance*

In some applications, the distribution of measurements exhibits enough regularity to be captured by standard parametric models, such as the Normal distribution. One example, introduced earlier in Sect. 1.5.1, is the analysis of magnetism of black toner on printed documents. Due to the wide distribution and availability of printing machines, forensic document examiners are commonly requested to examine documents produced by electrophotographic printing processes that use dry toner. A question that forensic scientists may be asked to help with is whether or not two or more documents were printed with the same laser printer. This task involves the comparison of analytical features of a questioned document with those of control documents. One such analytical feature is the magnetic flux of toner. It is thought to be largely influenced by individual settings of the printing device, so that detectable differences may be expected on documents printed at different instances using the same or different machines (Biedermann et al., 2016a).

Suspected page substitution is a commonly encountered problem in forensic document examination. Imagine a case involving a contract consisting of three pages where the allegation is that the second page has been substituted. It may be of interest, thus, to investigate the extent to which available measurements of magnetic flux can be informative in this case.

Consider the following pair of propositions:

*H*<sup>1</sup> : Page two has been printed by the device used for printing pages one and three (i.e., the three pages have been printed with the same device).

*H*<sup>2</sup> : Page two has been printed by a different device.

Denote by **y** = *(y*1*,...,yn)* the measurements of magnetic flux obtained for the questioned page. Measurements are assumed to be normally distributed with unknown mean *θ* and known variance *σ*2. The likelihood of the normal random sample *(y*1*,...,yn)* can therefore be expressed as

$$f(\mathbf{y} \mid \boldsymbol{\theta}) = \prod\_{l=1}^{n} (2\pi\sigma^2)^{-1/2} \exp\left\{-\frac{1}{2\sigma^2} (\mathbf{y}\_l - \boldsymbol{\theta})^2\right\}.\tag{3.8}$$

It can be shown, (e.g., Bolstad and Curran, 2017), that the likelihood of a normal random sample is proportional to the likelihood of the sample mean *<sup>y</sup>*¯ <sup>=</sup> <sup>1</sup> *n <sup>n</sup> <sup>i</sup>*=<sup>1</sup> *yi*. The sample mean is normally distributed with mean *θ* and variance *σ*2*/n*

$$f(\vec{\chi} \mid \theta) = (2\pi\sigma^2/n)^{-1/2} \exp\left\{-\frac{1}{2\sigma^2/n}(\vec{\chi} - \theta)^2\right\}.\tag{3.9}$$

In other words, it is possible to reduce the problem to one where a single normal observation *y*¯ is available.

Next, denote the measurements on uncontested pages by {**x***l*} = *(xlj , j* = 1*,...,n* and *l* = 1*,* 2*)*, where the subscript *l* refers to the page number and *j* to the number of measurements of magnetic flux obtained for the page *l*. A normal distribution with mean *θ* and variance *σ*<sup>2</sup> is assumed for **x**, analogously to what has been assumed for **y**. A conjugate normal prior distribution is chosen for *θ*, say *<sup>θ</sup>* <sup>∼</sup> <sup>N</sup>*(μ, τ* <sup>2</sup>*)*. The Bayes factor can be computed as in (1.16):

$$\begin{split} \text{BF} &= \frac{f(\overline{\mathbf{y}} \mid \mathbf{x}\_{1}, \mathbf{x}\_{2}, H\_{1})}{f(\overline{\mathbf{y}} \mid H\_{2})} \\ &= \frac{\int f(\overline{\mathbf{y}} \mid \theta) f(\theta \mid \mathbf{x}\_{1}, \mathbf{x}\_{2}, H\_{1}) d\theta}{\int f(\overline{\mathbf{y}} \mid \theta) f(\theta \mid H\_{2}) d\theta}, \end{split} \tag{3.10}$$

where *f (θ* | **x**1*,* **x**2*, H*1*)* is the posterior distribution of *θ*, obtained by updating the prior distribution N*(μ, τ* <sup>2</sup>*)* using the measurements **x**<sup>1</sup> and **x**2. This is a normal distribution, *(θ* <sup>|</sup> **<sup>x</sup>**1*,* **<sup>x</sup>**2*)* <sup>∼</sup> <sup>N</sup>*(μx , τ* <sup>2</sup> *<sup>x</sup> )*, with posterior mean *μx* and posterior variance *τ* <sup>2</sup> *<sup>x</sup>* , computed according to the updating rules (2.13) and (2.14). Using the result (1.21), one can easily verify that the density in the numerator is still a normal distribution with mean equal to the posterior mean *μx* and variance equal to the sum of the posterior variance *τ* <sup>2</sup> *<sup>x</sup>* and the population variance *σ*<sup>2</sup> divided by the sample size *n*, i.e., *τ* <sup>2</sup> *<sup>x</sup>* <sup>+</sup> *<sup>σ</sup>*2*/n*. In the same way, invoking (1.22), the density in the denominator is still a normal distribution with mean equal to the prior mean *μ* and variance equal to the sum of the prior variance *τ* <sup>2</sup> and the population variance *σ*<sup>2</sup> divided by the sample size *<sup>n</sup>*, i.e., *<sup>τ</sup>* <sup>2</sup> <sup>+</sup> *<sup>σ</sup>*2*/n*.

*Example 3.5 (Printed Documents)* Consider the case described above where a forensic document examiner measures the magnetic flux on two uncontested pages 1 and 3 (Biedermann et al., 2016a). The results are **x**<sup>1</sup> = *(*16*,* 15*,* 15*)* and **x**<sup>2</sup> = *(*16*,* 15*,* 16*)*. The measurements for the contested page 2 are **y** = *(*15*,* 16*,* 16*)*. Previous experiments allow one to assign the value 0*.*24 for the population standard deviation *σ*. Based on the available knowledge regarding the magnetic flux of toner on printed documents, the prior mean *μ* and the prior variance *τ* <sup>2</sup> for the unknown quantity of magnetic flux are set equal to 17.5 and 3*.*922, respectively. This means that values of the magnetic flux smaller than 6 and greater than 29 are considered, a priori, to be extremely unlikely.

```
> mu=17.5
> tau2=3.92^2
> sigma2=0.24^2
> x=c(16,15,15,16,15,16)> y=c(15,16,16)
> nx=length(x)
>ny=length(y)
```
#### *Example 3.5* (continued)

The posterior distribution *f (θ* | **x**1*,* **x**2*)* can be obtained by a single application of Bayes theorem with the full set of available measurements *(***x**1*,* **x**2*)*. The posterior parameters *μx* and *τ* <sup>2</sup> *<sup>x</sup>* can be calculated using the function post\_distr introduced in Sect. 2.3.1.

```
> mupost=post_distr(sigma2,nx,mean(x),mu,tau2)[1]
> mupost
[1] 15.50125
```

```
> tau2post=post_distr(sigma2,nx,mean(x),mu,tau2)[2]
> tau2post
```

```
[1] 0.009594006
```
The two marginal densities in the numerator and denominator of the BF in (3.10) can be calculated at the sample mean *y*¯. The exact value of the Bayes factor is given by

```
> BF=dnorm(mean(y),mupost,sqrt(tau2post+sigma2/ny))/
+ dnorm(mean(y),mu,sqrt(tau2+sigma2/ny))
> BF
```

```
[1] 16.03199
```
This value represents moderate support for the proposition of page substitution, compared to the proposition of no page manipulation.

#### *3.3.2 Normal Model with Both Parameters Unknown*

So far, the variance of the distribution of the observations has been assumed to be known, though in many practical situations the mean and the variance are both unknown, and it is necessary to choose a prior distribution for the parameter vector *(θ , σ*2*)*. The Bayes factor can be computed as in (1.16):

$$\begin{split} \text{BF} &= \frac{f(\mathbf{y} \mid \mathbf{x}, H\_{\mathbf{l}})}{f(\mathbf{y} \mid H\_{2})} \\ &= \frac{\int f(\mathbf{y} \mid \boldsymbol{\theta}, \sigma^{2}) f(\boldsymbol{\theta}, \sigma^{2} \mid \mathbf{x}, H\_{\mathbf{l}}) \mathbf{d}(\boldsymbol{\theta}, \sigma^{2})}{\int f(\mathbf{y} \mid \boldsymbol{\theta}, \sigma^{2}) f(\boldsymbol{\theta}, \sigma^{2} \mid H\_{2}) \mathbf{d}(\boldsymbol{\theta}, \sigma^{2})}. \end{split} \tag{3.11}$$

Consider the case where a conjugate prior distribution for *(θ , σ*2*)* of the form

$$f(\theta, \sigma^2) = f(\theta \mid \sigma^2) f(\sigma^2) \tag{3.12}$$

is chosen. In this distribution, prior beliefs about the population mean *θ* are calibrated by the scale of measurements of the observations.1 The conditional distribution *f (θ* <sup>|</sup> *<sup>σ</sup>*2*)* is taken to be normal, centered at *<sup>μ</sup>* with variance *<sup>σ</sup>*2*/n*0, *(θ* <sup>|</sup> *<sup>σ</sup>*2*)* <sup>∼</sup> <sup>N</sup>*(μ, <sup>σ</sup>*<sup>2</sup> *<sup>n</sup>*<sup>0</sup> *)*. The parameter *n*<sup>0</sup> can be thought of as the prior sample size for the distribution of *θ*. As pointed out in Sect. 2.3.1, it formalizes the size of the sample from a normal population that provides an equivalent amount of information about *θ*. The distribution *f (σ*2*)* is taken to be an *S* times inverse chisquared distribution with *<sup>k</sup>* degrees of freedom, *<sup>σ</sup>*<sup>2</sup> <sup>∼</sup> *<sup>S</sup>*·*χ*−2*(k)*. It can be shown that this is equivalent to an inverse gamma distribution with shape parameter *α* = *k/*2 and scale parameter *<sup>β</sup>* <sup>=</sup> *S/*2, *<sup>σ</sup>*<sup>2</sup> <sup>∼</sup> IG*(α* <sup>=</sup> *k/*2*, β* <sup>=</sup> *S/*2*)*. Alternatively, prior uncertainty about dispersion can be formulated in terms of the precision *<sup>λ</sup>*<sup>2</sup> <sup>=</sup> <sup>1</sup>*/σ*2. The prior distribution of *λ*<sup>2</sup> becomes a gamma distribution with shape parameter *<sup>α</sup>* <sup>=</sup> *k/*2 and rate parameter *<sup>β</sup>* <sup>=</sup> *S/*2, *<sup>λ</sup>*<sup>2</sup> <sup>∼</sup> Ga*(α* <sup>=</sup> *k/*2*, β* <sup>=</sup> *S/*2*)*. For further discussion, see e.g. Bernardo and Smith (2000), Bolstad and Curran (2017) and Robert (2001).

Consider now the posterior distribution of the unknown parameter vector *(θ , λ*2*)* once a vector of observations **x** = *(x*1*,...,xn)* becomes available. It takes the form of a normal–gamma distribution

$$f(\theta, \lambda^2 \mid \mathbf{x}, H\_{\mathbb{I}}) = \mathrm{NG}(\mu\_n, n', \alpha\_n, \beta\_n),$$

with

$$
\mu\_n = \frac{n\bar{x} + n\_0\mu}{n + n\_0} \qquad ; \qquad n' = n + n\_0
$$

$$\alpha\_n = \alpha + \frac{n}{2}$$

;

$$
\beta\_n = \beta + \frac{1}{2} \left[ (n-1)s^2 + \frac{n\_0 n (\bar{\mathbf{x}} - \mu)^2}{n\_0 + n} \right],
$$

<sup>1</sup> Note that in (3.12) population parameters are not, a priori, independent. Whenever this condition is felt to be too restrictive (see, e.g., Robert (2001)), it is also possible to choose a prior distribution as the product of independent priors, *f (θ , σ*2*)* <sup>=</sup> *f (θ )f (σ*2*)*. In this case, the derivation of the posterior distribution can be more demanding.

and *<sup>s</sup>*<sup>2</sup> <sup>=</sup> <sup>1</sup> *n*−1 *<sup>n</sup> <sup>i</sup>*=<sup>1</sup>*(xi* − ¯*x)*2.

If uncertainty about the two unknown parameters is modeled by means of the conjugate prior distribution in (3.12), the integrations in (3.11) have an analytical solution and the BF can be obtained straightforwardly.

Denote by **y** = *(y*1*,...,yny )* a vector of measurements made on questioned material and consider the sample mean *<sup>y</sup>*¯ <sup>=</sup> *ny <sup>i</sup>*=<sup>1</sup> *yi*. It can be proved that the marginal density *f (y*¯ | **x***, H*1*)* in the numerator is a Student t distribution with 2*α*+*n* degrees of freedom, centered at *μn*, with spread parameter, denoted *sn*, equal to

$$s\_n = \frac{n\_\circ (n + n\_0)}{n + n\_0 + n\_\circ} \left(\alpha + \frac{n}{2}\right) \beta\_n^{-1}.$$

This can be denoted as *f*1*(y*¯ | *μn, sn,* 2*α* + *n)*.

The marginal density *f (y* | *H*2*)* in the denominator is a Student t distribution with *k* degrees of freedom, centered at *μ* with spread parameter (precision), denoted *sd* , equal to

$$s\_d = \frac{n\_0 n\_y}{n\_0 + n\_y} \alpha \beta^{-1}$$

(Bernardo and Smith, 2000). This can be denoted as *f*2*(y*¯ | *μ, sd ,* 2*α)*.

The Bayes factor can then be computed as

$$\text{BF} = \frac{f\_1(\text{\"y} \mid \mu\_n, \text{s}\_n, 2\alpha + n)}{f\_2(\text{\"y} \mid \mu, \text{s}\_d, 2\alpha)}. \tag{3.13}$$

#### **Choosing the Parameters of the Normal Prior**

The use of a conjugate prior distribution for the mean and the variance of a normal distribution raises the question of how to choose the hyperparameters, as the resulting distribution should suitably reflect available prior knowledge. The prior distribution *f (θ* <sup>|</sup> *<sup>σ</sup>*2*)* requires one to choose a value for *<sup>μ</sup>*, the measure of location, and a value for *n*0. The ratio *n*0*/n* characterizes the relative precision of the prior distribution compared to the precision of the observations. If this ratio is very small, the less informative will be the prior distribution, and the closest will be the posterior distribution to that obtained using a non-informative prior distribution. In fact, when *n*0*/n* approaches zero, the limiting form of the marginal distribution of the population mean *<sup>θ</sup>* is <sup>N</sup>*(x,σ* ¯ <sup>2</sup>*/n)*, which corresponds to the posterior distribution that would be obtained using a non-informative prior distribution (Robert, 2001). For more specific prior beliefs (i.e., concentrated on a limited range of values), a higher value of *n*<sup>0</sup> should be chosen.

Regarding the prior distribution of *σ*2, consider a number of degrees of freedom *k* = 20 so that the prior mass is distributed rather symmetrically. Suppose also that, based on knowledge available from previous experiments, it is considered that values of *σ*<sup>2</sup> greater or smaller than 0.05 are equally plausible, so Pr*(σ*<sup>2</sup> *>* <sup>0</sup>*.*05*)* <sup>=</sup> <sup>0</sup>*.*5. The parameter *<sup>S</sup>* can be elicited by recalling that *<sup>σ</sup>*2*/S* <sup>∼</sup> *<sup>χ</sup>*−2*(k)* and, analogously, *<sup>S</sup>* · *<sup>λ</sup>*<sup>2</sup> <sup>∼</sup> *<sup>χ</sup>*2*(k)* so

$$\Pr\left(\sigma^2 > 0.05\right) = \Pr\left(S \cdot \lambda^2 < S \cdot 20\right) = 0.5,$$

where *<sup>S</sup>* · 20 is the quantile of order 0.5 of a *<sup>χ</sup>*<sup>2</sup> distributed random variable with *k* = 20 degrees of freedom.

```
> sigma2=0.05
> k=20
> p=0.5
> q=qchisq(p,k)
> q
[1] 19.33743
> S=q*sigma2
```
Parameter *S* is then equal to

$$S = 19.3374 \times 0.05 \approx 1.$$

The elicited prior distribution for *σ*<sup>2</sup> is IG*(* <sup>20</sup> <sup>2</sup> *,* <sup>1</sup> <sup>2</sup> *)* and is shown in Fig. 3.3.

*Example 3.6 (Printed Documents—Continued)* Consider again Example 3.5 where magnetic flux was measured on uncontested and questioned pages. The population variance *σ*<sup>2</sup> was assumed known and equal to 0*.*0576. Suppose now that a new measuring device is used and that the number of previous experiments (i.e., measurements) conducted with this device is limited. A conjugate prior distribution as in (3.12) is introduced to model prior uncertainty about *θ* and *σ*2.

The prior distribution for *<sup>θ</sup>* <sup>|</sup> *<sup>σ</sup>*<sup>2</sup> can be centered at *<sup>μ</sup>* <sup>=</sup> <sup>17</sup>*.*5 as in Example 3.5 with *n*<sup>0</sup> = 0*.*004 reflecting a very weak prior belief with respect to the precision of the observations, *<sup>θ</sup>* <sup>∼</sup> <sup>N</sup>*(*17*.*5*, σ*2*/*0*.*004*)*.

```
> mu=17.5
> n0=0.004
```
The prior distribution about *<sup>σ</sup>*<sup>2</sup> has been elicited above, with *<sup>k</sup>* <sup>=</sup> 20 degrees of freedom, and *<sup>S</sup>* <sup>=</sup> 1, *<sup>σ</sup>*<sup>2</sup> <sup>∼</sup> IG*(* <sup>20</sup> <sup>2</sup> *,* <sup>1</sup> <sup>2</sup> *)*, shown in Fig. 3.3.

```
> library(extraDistr)
> S=1
> k=20
> plot(function(x) dinvgamma(x,k/2,S/2),0,0.2,
+ xlab=expression(paste(sigma)^2),ylab='')
```
Note that the function dinvgamma is available in the package extraDistr (Wolodzko, 2020). Measurements are the same as in Example 3.5.

```
> x=c(16,15,15,16,15,16)
> y=c(15,16,16)
> n=length(x)
> ny=length(y)
```
Let us first consider the marginal density in the numerator of the Bayes factor in (3.13). It is a Student t distribution with 2*α* + *n* = *k* + *n* = 26 degrees of freedom, centered at *μn* = 15*.*5 with spread parameter *sn* = 20*.*6724.

```
> mun=(n*mean(x)+n0*mu)/(n+n0)
> mun
[1] 15.50133
> s2=sum((x-mean(x))^2)
> bn=S/2+(s2+n0*n*(mean(x)-mu)^2*(n0+n)^(-1))/2
> sn=ny*(n+n0)/(n+n0+ny)*(k+n)/2*bn^(-1)
> sn
[1] 20.6724
```
#### *Example 3.6* (continued)

The marginal density at the denominator of the Bayes factor in (3.13) is a Student t distribution with 2*α* = *k* = 20 degrees of freedom, centered at *μ* = 17*.*5 with spread parameter *sd* = 0*.*0799.

> sd=ny\*n0/(n0+ny)\*k/S > sd

```
[1] 0.07989348
```
The density of a non-central Student t distributed random variable can be calculated using the function dstp, available in the package LaplacesDemon (Hall et al., 2020). The Bayes factor can be obtained as

```
> library(LaplacesDemon)
> BF=dstp(mean(y),mun,sn,k+n)/dstp(mean(y),mu,sd,k)
> BF
```
[1] 13.88188

The Bayes factor represents moderate support for the proposition according to which page two has been printed by the same device as the one used for printing pages one and three, compared to the proposition according to which page two has been printed by a different device.

It is worth emphasizing that the BF is highly sensitive to the choice of the prior (see Sect. 1.11). A sensitivity analysis should therefore be conducted.

#### *3.3.3 Normal Model for Inference of Source*

Consider again a case as described in Sect. 3.3.1, involving the analysis of toner on printed documents. Magnetic flux was considered as a feature of interest because it is largely influenced by the settings of the printing device. Suppose now that more than one potential source (i.e., printing device) is available for examination. The issue of interest is which of two machines has been used to print a questioned document (e.g., a contested contract). The propositions of interest can be defined as follows:

*H*<sup>1</sup> : The questioned document has been printed with machine *A*.

*H*<sup>2</sup> : The questioned document has been printed with machine *B*.

The two potential sources, i.e., machines *A* and *B*, are used to print documents under controlled conditions. The measurements made on documents printed by the two devices are denoted {**x***p*} = *(***x***pi, p* = *A, B* and *i* = 1*, . . . , m)*, with **x***pi* = *(xpi*1*,...,xpin)* denoting the vector of *n* measurements for each analyzed page, *i* = 1*,...,m*, from each printer *p* = *A, B*. Measurements are assumed to be normally distributed with unknown mean *θp*, *<sup>p</sup>* <sup>=</sup> *A, B*, and variance *<sup>σ</sup>*2. The variance is assumed to be known and equal for the two devices. A conjugate normal prior distribution is taken for the unknown mean *θp*, say *θp* <sup>∼</sup> <sup>N</sup>*(μp, τ* <sup>2</sup> *p)*, *p* = *A, B*.

Measurements on the questioned document are denoted by **y** = *(***y**1*,...,* **y***<sup>q</sup> )*, with **y***<sup>j</sup>* = *(yj*1*,...,yj n)* denoting the vector of *n* measurements from each contested page *j* = 1*,...,q*. For cases in which *q >* 1, it is assumed that all pages have been printed with a single device. The distribution of measurements on the questioned document is also taken to be normal. The sample mean *<sup>y</sup>*¯ <sup>=</sup> <sup>1</sup> *nq <sup>q</sup> j*=1 *<sup>n</sup> <sup>k</sup>*=<sup>1</sup> *yjk* has a normal distribution with mean *θp* and variance *<sup>σ</sup>*2*/nq*, *(Y*¯ <sup>|</sup> *θp, σ*<sup>2</sup>*/nq)* <sup>∼</sup> <sup>N</sup>*(θp, σ*<sup>2</sup>*/nq)*.

The Bayes factor can be computed as

$$\text{BF} = \frac{\int f(\bar{\mathbf{y}} \mid \theta\_A) f(\theta\_A \mid \mathbf{x}\_A) d\theta\_A}{\int f(\bar{\mathbf{y}} \mid \theta\_B) f(\theta\_B \mid \mathbf{x}\_B) d\theta\_B}$$

$$= \frac{f(\bar{\mathbf{y}} \mid \mathbf{x}\_A, H\_1)}{f(\bar{\mathbf{y}} \mid \mathbf{x}\_B, H\_2)}. \tag{3.14}$$

The marginal probability density in the numerator can be obtained in closed form. It is a normal distribution with mean equal to the posterior mean *μA,x* and variance equal to the sum of the posterior variance *τ* <sup>2</sup> *A,x* and population variance *σ*2 *<sup>A</sup>/nq* (where *nq* is the total number of observations), that is, *f (y*¯ | **x***A, H*1*)* = N*(μA,x , τ* <sup>2</sup> *A,x* <sup>+</sup> *<sup>σ</sup>*2*/nq)*. In the same way, one can obtain the marginal probability density in the denominator, *f (y*¯ <sup>|</sup> **<sup>x</sup>***B, H*2*)* <sup>=</sup> <sup>N</sup>*(μB,x , τ* <sup>2</sup> *B,x* <sup>+</sup> *<sup>σ</sup>*2*/nq)*. As observed in Sect. 3.3.1, the numerator and the denominator of (3.14) can be calculated as the densities of two normally distributed random variables, N*(μA,x , τ* <sup>2</sup> *A,x* <sup>+</sup> *<sup>σ</sup>*2*/nq)* and N*(μB,x , τ* <sup>2</sup> *B,x* <sup>+</sup> *<sup>σ</sup>*2*/nq)*, at the sample mean *<sup>y</sup>*¯ of the measurements on the questioned document.

*Example 3.7 (Printed Documents)* Consider a type of case and propositions as introduced above, and suppose that there is only one contested page, that is, *q* = 1. Measurements of the magnetic flux lead to the following results: **y** = *(*20*,* 20*,* 21*)* (i.e., *n* = 3 measurements are taken). Two pages are printed with each printing device. The results are as follows (Biedermann et al., 2016a):


#### *Example 3.7* (continued) The available data thus are

> xa=c(20,20,19,20,21,20) > xb=c(21,20,21,21,22,21) > y=c(20,20,21) > n=length(y)

The population standard deviation *σ* is taken to be equal to 0.24, as in Example 3.5. We also choose the same prior distribution as used in Example 3.5 to describe uncertainty about the magnetic flux of toner printed by the two printing devices. Thus, *μA* <sup>=</sup> *μB* <sup>=</sup> <sup>17</sup>*.*5 and *<sup>τ</sup>* <sup>2</sup> *<sup>A</sup>* <sup>=</sup> *<sup>τ</sup>* <sup>2</sup> *<sup>B</sup>* <sup>=</sup> <sup>3</sup>*.*922.


The posterior distributions *f (θA* | **x***A)* and *f (θB* | **x***B)* can be obtained by a single application of Bayes theorem using the full set of available measurements for each printer. The posterior parameters *μA,x* , *μB,x* , *τ* <sup>2</sup> *A,x* and *τ* <sup>2</sup> *B,x* can be calculated using the function post\_distr:

```
> muapost=post_distr(sigma2,na,mean(xa),mu,tau2)[1]
> tauapost=post_distr(sigma2,na,mean(xa),mu,tau2)[2]
```

```
> mubpost=post_distr(sigma2,nb,mean(xb),mu,tau2)[1]
```

```
> taubpost=post_distr(sigma2,nb,mean(xb),mu,tau2)[2]
```
The two marginal densities in the numerator and denominator of the BF in (3.14) can be calculated at the observed value *y*¯. The BF can thus be computed as the ratio of two marginal densities:

```
> BF=dnorm(mean(y),muapost,sqrt(sigma2/n+tauapost))/
+ dnorm(mean(y),mubpost,sqrt(sigma2/n+taubpost))
> BF
```

```
[1] 304.7886
```
This value represents moderately strong support for the proposition according to which the questioned page been printed using device *A*, rather than using device *B*.

Consider a "0−*li*" loss function as in Table 1.4. The optimal decision is to accept the view according to which the questioned page was printed by the device *A* (as stated by proposition *H*1), rather than by device *B*, whenever

#### 102 3 Bayes Factor for Evaluative Purposes

$$\text{BF} > \frac{l\_1/l\_2}{\pi\_1/\pi\_2}.$$

If the odds are evens, and a symmetric loss function is felt to be appropriate, the Bayes decision is to accept the view according to which the questioned document has been printed with machine *A* (*B*) whenever the BF is greater (smaller) than 1.

When available information is limited, one may choose a non-informative prior distribution for *(θ , σ*2*)* that can be specified as

$$f(\theta, \sigma^2) = \frac{1}{\sigma^2}. \tag{3.15}$$

In this case, the marginal distribution in the numerator of the BF is proportional to a Student t distribution with *nA* − 1 degrees of freedom, centered at the sample mean *x*¯*<sup>A</sup>* with spread parameter *sn* equal to

$$\mathbf{s}\_n = \frac{n\_A n q}{(n\_A + nq)\mathbf{s}\_A^2},$$

where *sA* <sup>=</sup> <sup>1</sup> *nA*−1 *nA <sup>i</sup>*=<sup>1</sup>*(xA* − ¯*xA)*2, *nA* is the total number of observations from device *A*, and *nq* is the total number of measurements from the *q* contested pages (i.e., *n* measurements for each contested page). This can be denoted as *f*1*(y*¯ | ¯*xA, sn, nA* − 1*)*.

Vice versa, the marginal distribution in the denominator of the BF is proportional to a Student t distribution with *nB* − 1 degrees of freedom, centered at the sample mean *x*¯*<sup>B</sup>* with spread parameter *sd* equal to

$$s\_d = \frac{n\_B n q}{(n\_B + nq)s\_B^2},$$

where *sB* <sup>=</sup> <sup>1</sup> *nB*−1 *nB <sup>i</sup>*=<sup>1</sup>*(xB* − ¯*xB)*<sup>2</sup> and *nB* is the total number of observations from device *B*. This can be denoted as *f*2*(y*¯ | ¯*xB, sd , nB* − 1*)*.

The Bayes factor can then be obtained as

$$\text{BF} = \frac{f\_1(\bar{\mathbf{y}} \mid \bar{\mathbf{x}}\_A, \mathbf{s}\_n, n\_A - 1)}{f\_2(\bar{\mathbf{y}} \mid \bar{\mathbf{x}}\_B, \mathbf{s}\_d, n\_B - 1)}. \tag{3.16}$$

*Example 3.8 (Printed Documents—Continued)* In Example 3.7, a normal prior distribution has been used for *(θ , σ*2*)*. Consider now a non-informative prior distribution as in (3.15). In order to compute the Bayes factor, one must first obtain the spread parameters *sn* and *sd* under the competing propositions.

```
Example 3.8 (continued)
> s2a=var(xa)
> sn=na*n/((na+n)*s2a)
> s2b=var(xb)
>sd=nb*n/((nb+n)*s2b)
```
Note that in this case the number of contested pages *q* is set equal to 1. The density of a non-central Student t distributed random variable can be obtained using the function dstp available in the package LaplacesDemon (Hall et al., 2020). The Bayes factor can be obtained as follows:

```
> library(LaplacesDemon)
> BF=dstp(mean(y),mean(xa),sn,na-1)/
+ dstp(mean(y),mean(xb),sd,nb-1)
> BF
[1] 2.197
```
The Bayes factor represents weak support for the proposition according to which the questioned document has been printed with machine *A*, rather than with machine *B*.

#### **More Than Two Propositions**

Consider now the case where more than two devices are available. As in Sect. 1.6, the question is how to evaluate measurements made on questioned and known items (i.e., documents), as the BF involves pairwise comparisons. A scaled version of the marginal likelihood may be reported as in (1.27).

*Example 3.9 (Printed Documents, More Than Two Propositions)* Recall Example 3.7, and assume that a third printer, machine *C*, is available for comparative examinations. The propositions of interest are therefore:


Two pages are printed with the additional printing device *C*. All results, including those from machines *A* and *B*, are as follows:

*Example 3.9* (continued)


Let the prior distribution describing uncertainty about the magnetic flux characterizing machine *C* be the same as introduced previously, that is *μC* = 17*.*5 and *τ* <sup>2</sup> *<sup>C</sup>* <sup>=</sup> <sup>3</sup>*.*922. First, the posterior distribution *f (θC* <sup>|</sup> **<sup>x</sup>***C)* is calculated:

> xc=c(21,20,21,20,21,20)

```
> nc=length(xc)
```

```
> mucpost=post_distr(sigma2,nc,mean(xc),mu,tau2)[1]
```

```
> taucpost=post_distr(sigma2,nc,mean(xc),mu,tau2)[2]
```
Next, consider the marginal likelihoods of the sample mean that can be obtained as

```
> mla=dnorm(mean(y),muapost,sqrt(sigma2/n+tauapost))
> mlb=dnorm(mean(y),mubpost,sqrt(sigma2/n+taubpost))
```

```
> mlc=dnorm(mean(y),mucpost,sqrt(sigma2/n+taucpost))
```
The scaled version of the marginal likelihoods then is

```
> smla=mla/(mla+mlb+mlc)
```

```
> smlb=mlb/(mla+mlb+mlc)
```

```
> smlc=mlc/(mla+mlb+mlc)
```

```
> round(c(smla,smlb,smlc),5)
```

```
[1] 0.18593 0.00061 0.81346
```
Recall from Sect. 1.6 that this is equivalent to reporting the posterior probability of competing propositions with equal prior probabilities. Therefore, if Pr*(H*1*)* <sup>=</sup> Pr*(H*2*)* <sup>=</sup> Pr*(H*3*)* <sup>=</sup> <sup>1</sup> <sup>3</sup> , then proposition *H*<sup>3</sup> has received the greatest evidential support.

Alternatively, the analyst may also consider the possibility of aggregating propositions *H*<sup>1</sup> and *H*<sup>2</sup> and consider:


*Example 3.10 (Printed Documents, More Than Two Propositions— Continued)* When considering a single proposition *H*<sup>1</sup> compared to a composite proposition *H*¯<sup>1</sup> as defined above, the Bayes factor can be obtained as in (1.28), with Pr*(H*1*)* = 1*/*3 and Pr*(H*¯1*)* = 2*/*3.

> p=1/3 > mlc\*(1-p)/(mla\*p+mlb\*p) [1] 8.72179

#### *3.3.4 Score-Based Bayes Factor*

As mentioned previously in Sect. 1.5.2, it may not be possible to specify a probability model for some types of forensic evidence and data. An example was given in Sect. 3.2.3 for discrete data regarding consecutive matching striations, used to quantify the extent of agreement between marks on bullets.

Consider now a case where a saliva trace is collected at the crime scene. The salivary microbiome is analyzed as well as that of traces originating from a known source, Mr. X, with the aim of discriminating between the following competing propositions:


Note that the proposition *H*<sup>2</sup> represents an extreme case of relatedness. To investigate this type of case, consider the data collected by Scherz (2021). This longitudinal study involving 30 monozygotic twins has shown the potential of salivary microbiome profiles to discriminate between closely related individuals (Scherz et al., 2021). This may represent an alternative method when standard DNA profiling analyses yield no useful results.

In the study by Scherz (2021), four salivary samples have been collected from each participant. The first at the beginning of the study, and the others after 1, 12, and 13 months. Given the complex composition of microbiota, a distance can be calculated to compare microbiota profiles. One possibility is the Jaccard distance, obtained by dividing the number of amplicon sequence variants (AVSs) shared by the two samples by the number of distinct AVSs in the two compared samples. This measure has shown good discriminatory power. Other distances (e.g., Jensen– Shannon) can be calculated (Scherz, 2021).

The intra-individual variability was studied by comparing all four samples of each individual. The intra-pair variability was evaluated by comparing pairs of samples from related individuals (here: homozygous twins). The inter-individual variability was studied by comparing samples of unrelated individuals (Fig. 3.4).

Let *δ(y, x)* denote the distance between the analytical features of questioned material (i.e., a saliva trace of unknown origin) and control material (i.e., a saliva sample from Mr. X). A score-based Bayes factor (sBF) can be defined as follows:

$$\text{sBF} = \frac{\text{g}(\delta(\mathbf{x}, \mathbf{y}) \mid H\_1)}{\text{g}(\delta(\mathbf{x}, \mathbf{y}) \mid H\_2)}. \tag{3.17}$$

To obtain a value for this sBF, it is necessary to study the probability distribution of the calculated score under the competing propositions. However, the limited number of samples per individual, available for pairwise comparison, might make it difficult to assess the numerator, which is specific for a given person of interest. To address this problem, Davis et al. (2012) propose the use of a database of simulated samples to help with the construction of probability distributions for scores.

In the example studied here, a maximum number of 6 intra-volunteer comparisons are available for each participant. A viable alternative is to perform a so-called common-source comparison,<sup>2</sup> and use the limited number of items from all participants, provided that one is willing to assume a generic probability distribution for all individuals in the numerator. In the same way, a generic probability distribution is used at the denominator in all cases where a twin is assumed as the alternative source of the salivary trace (Bozza et al., 2022).

Denote by {*Z*<sup>1</sup> *ij , i* = 1*,...,m*1*, j* = 1*,...,n*1} the intra-individual distances and by {*Z*<sup>2</sup> *ij , i* = 1*,...,m*2*, j* = 1*,...,n*2} the intra-pair distances, where *m*<sup>1</sup> (*m*2) are the number of distinct individuals (couples of twin brothers) and *n*<sup>1</sup> (*n*2) are the number of distances calculated for each individual (couple). A normal distribution is used for both the numerator and denominator to model the *within-source* variation

<sup>2</sup> See Sect. 1.5.2 on the difference between specific-source and common-source propositions.

(i.e., the variation between distances characterizing materials originating from the same individual and from the same couple of twins, respectively), *Z<sup>p</sup> ij* <sup>∼</sup> <sup>N</sup>*(θp, σ*<sup>2</sup> *p)*, where *p* = {1*,* 2}. Different distributions can be used to describe the between-source variation (i.e., the variation between distances characterizing materials originating from different individuals and from different couples of twins, respectively). Here, a normal distribution is retained, *θp* <sup>∼</sup> <sup>N</sup>*(μp, τ* <sup>2</sup> *p)*. The mean vector between sources *μp*, the within-source variance *σ*<sup>2</sup> *<sup>p</sup>*, and the between-source variance *τ* <sup>2</sup> *<sup>p</sup>* can be estimated from the background data:

$$
\hat{\mu}\_p = \bar{z}\_p = \frac{1}{m\_p n\_p} \sum\_{i=1}^{m\_p} \sum\_{j=1}^{n\_p} z\_{ij}^p \tag{3.18}
$$

$$\hat{\sigma}\_p^2 = \frac{1}{m\_p(n\_p - 1)} \sum\_{i=1}^{m\_p} \sum\_{j=1}^{n\_p} (z\_{ij}^p - \bar{z}\_i)^2 \tag{3.19}$$

$$
\hat{\tau}\_p^2 = \frac{1}{m\_p - 1} \sum\_{l=1}^{m\_p} (\bar{z}\_l^p - \bar{z}\_p)^2 - \frac{\hat{\sigma}\_p^2}{n\_p},
\tag{3.20}
$$

where *z*¯ *p <sup>i</sup>* <sup>=</sup> *np <sup>j</sup>*=<sup>1</sup> *zij* .

> *Example 3.11 (Saliva Traces)* Consider a case where a saliva trace is recovered at a crime scene and a sample is taken from a person of interest for comparative purposes. The Jaccard distance between the microbiota composition of recovered and control sample is equal to 0.51.

> d=0.51

The propositions are *H*1, the compared items come from the same source, and *H*2, the compared items come from different sources (twins). Suppose that the estimated means between sources in (3.18) are 0.454 and 0.769; the estimated within-source variances in (3.19) are 0.0057 and 0.00067; the estimated between-source variances in (3.20) are 0.0028 and 0.0024 (Source of data: Scherz (2021)).


The Bayes factor can then be obtained straightforwardly as in (3.17)

*Example 3.11* (continued) > BF=dnorm(d,mu1,sqrt(tau1+sigma1))/ + dnorm(d,mu2,sqrt(tau2+sigma2)) > BF

[1] 27766.33

The Bayes factor provides very strong support for the proposition that the saliva traces originate from the same individual rather than from two different individuals (twins).

Note that a higher value of the BF is expected whenever the alternative proposition *H*<sup>2</sup> involves unrelated individuals. The inspection of Fig. 3.4 highlights that higher distances are recorded in this type of case.

The between-source variability can also be modeled by a kernel density distribution, as presented in Bozza et al. (2022). See also Sect. 3.4.1.2, where a detailed description of the kernel density approach is given for two-level multivariate data.

#### **3.4 Multivariate Data**

Forensic scientists encounter multivariate data in contexts where the examined objects and materials can be described by several variables. Examples are glass fragments that are searched and recovered on the clothing of a person of interest and on a crime scene, or seized materials supposed to contain illicit substances. Such materials may be analyzed and compared on the basis of their chemical compounds as well as their physical characteristics. Multivariate data also arise in other forensic science disciplines, such as handwriting examination. Handwritten characters can, in fact, be described by means of several variables, such as the width, the height, the surface, the orientation of the strokes, or by Fourier descriptors (Marquis et al., 2005). In addition, an emerging topic that forensic document examiners nowadays encounter is handwriting (e.g., signatures) on digital tablets. Such electronic devices provide several static (e.g., length of a signature) and dynamic features (e.g., speed) that can be used as variables to describe signatures (Linden et al., 2018). These developments have led to substantial databases that often present a complex dependence structure, a large number of variables, and multiple sources of variation.

#### *3.4.1 Two-Level Models*

Denote by *p* the number of characteristics (variables) observed on items of a particular evidential type. Suppose that continuous measurements of these variables are available on a random sample of *m* sources with *n* items from each source. For handwriting evidence, a source is a single writer, with *n* characters from each writer and *p* observed characteristics that pertain to the shape of handwritten characters. For glass evidence, a source is a window, with *n* replicate measurements from a glass fragment originating from each window and *p* observed characteristics given by concentrations in elemental composition. The background data can be denoted by **z***ij* = *(zij*1*,...,zijp)*, where *i* = 1*,...,m* denotes the number of sources (e.g., windows), *j* = 1*,...,n* denotes the number of items for each source (e.g., replicate measurements from a glass fragment), and *p* is the number of variables.

This data structure suggests a two-level hierarchy, accounting for two sources of variation: the variation between replicate measurements within the same source (the so-called within-source variation) and the variation between sources (the so-called between-source variation).

#### **3.4.1.1 Normal Distribution for the Between-Source Variability**

In some applications, data exhibit regularity that can reasonably be described using standard probabilistic models. For example, the within-source variability and the between-source variability may be modeled by a normal distribution. A Bayesian statistical model for the evaluation of trace evidence for two-level normally distributed multivariate data was proposed by Aitken and Lucy (2004) in the context of evaluating the elemental composition of glass fragments. To illustrate this model, denote the mean vector within source *i* by *θi*. Denote by *W* the matrix of within-source variances and covariances. The distribution of *Zij* for the withinsource variation is taken to be normal, *Zij* ∼ N*(θi,W)*. For the between-source variation, the mean vector between sources is denoted by *μ*, and the matrix of between-source variances and covariances by *B*. The distribution of the *θ<sup>i</sup>* is taken to be normal, *θ<sup>i</sup>* ∼ N*(μ,B)*.

Measurements are available on items from an unknown source (recovered material) as well as measurements on items from a known source (control material). The examined items may or may not come from the same source. Competing propositions may be formulated as follows:


Denote the measurements on recovered and control items by, respectively, **y** = *(***y**1*,...,* **y***ny )* and **x** = *(***x**1*,...,* **x***nx )*, where **y***<sup>j</sup>* = *(yj*1*,...,yjp)*, **x***<sup>j</sup>* = *(xj*1*,...,xjp)*, *j* = 1*,...,ny(x)*. A Bayes factor can be derived as in (1.15):

$$\text{BF} = \frac{f(\mathbf{y}, \mathbf{x} \mid H\_1)}{f(\mathbf{y}, \mathbf{x} \mid H\_2)}. \tag{3.21}$$

The distribution of the measurements on the recovered and control materials is taken to be normal, with vector means *θ <sup>y</sup>* and *θ <sup>x</sup>* , and covariance matrices *Wy* and *Wx* . Thus,

$$(Y \mid \theta\_{\text{y}}, W\_{\text{y}}) \sim \mathcal{N}(\theta\_{\text{y}}, W\_{\text{y}}) \qquad ; \qquad (X \mid \theta\_{\text{x}}, W\_{\text{x}}) \sim \mathcal{N}(\theta\_{\text{x}}, W\_{\text{x}}) . \tag{3.22}$$

The Bayes factor is the ratio of two probability densities of the form *f (***y***,* **x** | *Hi)* = *fi(***y***,* **x** | *μ,W,B)*, *i* = 1*,* 2. The probability density in the numerator is given by

$$f\_{\mathbf{l}}(\mathbf{y}, \mathbf{x} \mid \boldsymbol{\mu}, \, W, B) = \int\_{\boldsymbol{\theta}} f(\mathbf{y} \mid \boldsymbol{\theta}, W) f(\mathbf{x} \mid \boldsymbol{\theta}, W) f(\boldsymbol{\theta} \mid \boldsymbol{\mu}, B) d\boldsymbol{\theta}, \quad (3.23)$$

where

$$f(\mathbf{y} \mid \boldsymbol{\theta}, \boldsymbol{W}) = |2\pi|^{-pn\_{\mathcal{I}}/2} |\boldsymbol{W}|^{-n\_{\mathcal{I}}/2} \exp\left[ -\frac{1}{2} \sum\_{j=1}^{n\_{\mathcal{I}}} (\mathbf{y}\_{j} - \boldsymbol{\theta})^{\prime} \boldsymbol{W}^{-1} \left( \mathbf{y}\_{j} - \boldsymbol{\theta} \right) \right], \quad (3.24)$$

*f (***x** | *θ,W)* has the same probabilistic structure as *f (***y** | *θ,W)*, and

$$f(\theta \mid \mu, B) = |2\pi|^{-p/2} |B|^{-1/2} \exp\left[ -\frac{1}{2} \left( \theta - \mu \right)' B^{-1} \left( \theta - \mu \right) \right]. \quad (3.25)$$

In the denominator, where **y** and **x** are taken to be independent, the probability density is given by

$$f\_2(\mathbf{y}, \mathbf{x} \mid \boldsymbol{\mu}, \, W, B) = f\_2(\mathbf{y} \mid \boldsymbol{\theta}, \, W, B) \times f\_2(\mathbf{x} \mid \boldsymbol{\theta}, \, W, B) \tag{3.26}$$

$$= \int\_{\boldsymbol{\theta}} f(\mathbf{y} \mid \boldsymbol{\theta}, \, W) f(\boldsymbol{\theta} \mid \boldsymbol{\mu}, \, B) d\boldsymbol{\theta} \int\_{\boldsymbol{\theta}} f(\mathbf{x} \mid \boldsymbol{\theta}, \, W) f(\boldsymbol{\theta} \mid \boldsymbol{\mu}, \, B) d\boldsymbol{\theta} \,.$$

This is equivalent to the algebraic expression of the Bayes factor in (1.23). In the numerator, under proposition *H*1, the source means *θ <sup>y</sup>* and *θ <sup>x</sup>* are assumed equal, say *θ <sup>y</sup>* = *θ <sup>x</sup>* = *θ*. In the denominator, under proposition *H*2, the source means *θ <sup>y</sup>* and *θ <sup>x</sup>* are assumed to be different.

The integrals in (3.23) and (3.26) have an analytical solution. A proof is given by Aitken and Lucy (2004). The numerator can be shown to be equal to

$$f(\mathbf{y}, \mathbf{x} \mid H\_1) = |\, 2\pi W \mid^{-(n\_\mathbf{y} + n\_\mathbf{x})/2} |\, 2\pi B \mid^{-1/2} |\, 2\pi \left[ (n\_\mathbf{y} + n\_\mathbf{x}) W^{-1} + B^{-1} \right]^{-1} \vert^{\frac{1}{2}}$$

$$\times \exp\left\{ -\frac{1}{2} \left[ F\_1 + F\_2 + \text{tr}\left( S\_\mathbf{y} W^{-1} \right) + \text{tr}\left( S\_\mathbf{x} W^{-1} \right) \right] \right\}, \qquad (3.27)$$

where:

$$\begin{split} &F\_{1} = (\bar{\mathbf{w}} - \boldsymbol{\mu})' \left(\frac{W}{n\_{\text{y}} + n\_{x}} + \boldsymbol{B}\right)^{-1} (\bar{\mathbf{w}} - \boldsymbol{\mu}), \\ &F\_{2} = (\bar{\mathbf{y}} - \bar{\mathbf{x}})' \left(\frac{W}{n\_{\text{y}}} + \frac{W}{n\_{x}}\right)^{-1} (\bar{\mathbf{y}} - \bar{\mathbf{x}}), \\ &\bar{\mathbf{w}} = \frac{1}{n\_{\text{y}} + n\_{x}} \left(\sum\_{j=1}^{n\_{\text{y}}} \mathbf{y}\_{j} + \sum\_{j=1}^{n\_{x}} \mathbf{x}\_{j}\right), \bar{\mathbf{y}} = \frac{1}{n\_{\text{y}}} \sum\_{j=1}^{n\_{y}} \mathbf{y}\_{j} \text{ and } \bar{\mathbf{x}} = \frac{1}{n\_{x}} \sum\_{j=1}^{n\_{x}} \mathbf{x}\_{j} \text{ , } \\ &S\_{\boldsymbol{\mathcal{Y}}} = \sum\_{j=1}^{n\_{\text{y}}} \left(\mathbf{y}\_{j} - \bar{\mathbf{y}}\right) \left(\mathbf{y}\_{j} - \bar{\mathbf{y}}\right)', S\_{\boldsymbol{x}} = \sum\_{j=1}^{n\_{x}} \left(\mathbf{x}\_{j} - \bar{\mathbf{x}}\right) \left(\mathbf{x}\_{j} - \bar{\mathbf{x}}\right)'. \end{split}$$

Consider the first factor in the denominator, *f*2*(***y** | *θ,W,B)*. It can be obtained as

$$f\_2(\mathbf{y} \mid \mu, W, B) = \left| \, 2\pi \, W \mid^{-n\_{\mathcal{Y}}/2} \right| \, 2\pi \, B \mid^{-1/2} \left| \, 2\pi (n\_{\mathcal{Y}} W^{-1} + B^{-1})^{-1} \right|^{1/2}$$

$$\times \exp\left\{ -\frac{1}{2} \left[ (\bar{\mathbf{y}} - \boldsymbol{\mu})' (n\_{\mathcal{Y}}^{-1} W + B)^{-1} (\bar{\mathbf{y}} - \boldsymbol{\mu}) + \text{tr}\left( S\_{\mathcal{Y}} W^{-1} \right) \right] \right\}. \tag{3.28}$$

The second factor *f*2*(***x** | *θ,W,B)* can be obtained analogously as

$$f\_2(\mathbf{x} \mid \mu, W, B) = \left| \, 2\pi W \mid^{-n\_{\mathcal{X}}/2} \right| \, 2\pi B \mid^{-1/2} |\, 2\pi (n\_{\mathcal{X}} W^{-1} + B^{-1})^{-1} \, ^1 \vert^{1/2} \, \tag{3.29}$$

$$\times \exp \left\{ -\frac{1}{2} \left[ (\bar{\mathbf{x}} - \boldsymbol{\mu})^\prime (n\_{\mathcal{X}}^{-1} W + B)^{-1} (\bar{\mathbf{x}} - \boldsymbol{\mu}) + \text{tr} \left( S\_{\boldsymbol{x}} W^{-1} \right) \right] \right\}. \tag{3.20}$$

The Bayes factor in (3.21) then is the ratio between (3.27) and the product between (3.28) and (3.29), respectively. After some manipulation, the BF can be obtained as the ratio between

$$\leq 2\pi \left[ (n\_{\mathcal{Y}} + n\_{\mathcal{X}}) W^{-1} + \mathcal{B}^{-1} \right]^{-1} |^{1/2} \exp\left\{ -\frac{1}{2} \left( F\_1 + F\_2 \right) \right\} \tag{3.30}$$

and

$$\geq |\mathcal{Z}\,\boldsymbol{B}\,\vert^{-1/2}|\,2\pi(n\_{\boldsymbol{Y}}W^{-1}+\boldsymbol{B}^{-1})^{-1}\,\vert^{1/2}|\,2\pi(n\_{\boldsymbol{X}}W^{-1}+\boldsymbol{B}^{-1})^{-1}\,\vert^{1/2}$$

$$\times \exp\left\{-\frac{1}{2}\left(F\_{3}+F\_{4}\right)\right\},\tag{3.31}$$

where:

$$\begin{split} F\_{3} &= (\boldsymbol{\mu} - \boldsymbol{\mu}^{\*})^{\prime} \left\{ \left( \frac{\boldsymbol{W}}{n\_{\boldsymbol{\gamma}}} + \boldsymbol{B} \right)^{-1} + \left( \frac{\boldsymbol{W}}{n\_{\boldsymbol{\alpha}}} + \boldsymbol{B} \right)^{-1} \right\} (\boldsymbol{\mu} - \boldsymbol{\mu}^{\*}), \\ F\_{4} &= (\bar{\mathbf{y}} - \bar{\mathbf{x}})^{\prime} \left( \frac{\boldsymbol{W}}{n\_{\boldsymbol{\gamma}}} + \frac{\boldsymbol{W}}{n\_{\boldsymbol{\alpha}}} + 2\boldsymbol{B} \right)^{-1} (\bar{\mathbf{y}} - \bar{\mathbf{x}}), \\ \boldsymbol{\mu}^{\*} &= \left\{ \left( \frac{\boldsymbol{W}}{n\_{\boldsymbol{\gamma}}} + \boldsymbol{B} \right)^{-1} + \left( \frac{\boldsymbol{W}}{n\_{\boldsymbol{\alpha}}} + \boldsymbol{B} \right)^{-1} \right\}^{-1} \times \left\{ \left( \frac{\boldsymbol{W}}{n\_{\boldsymbol{\gamma}}} + \boldsymbol{B} \right)^{-1} \bar{\mathbf{y}} + \left( \frac{\boldsymbol{W}}{n\_{\boldsymbol{\alpha}}} + \boldsymbol{B} \right)^{-1} \bar{\mathbf{x}} \right\}. \end{split}$$

The mean vector between sources *μ*, the within-source covariance matrix *W*, and the between-source covariance matrix *B* can be estimated using the available background data:

$$
\hat{\boldsymbol{\mu}} = \bar{\mathbf{z}} = \frac{1}{mn} \sum\_{i=1}^{m} \sum\_{j=1}^{n} \mathbf{z}\_{ij}, \tag{3.32}
$$

$$\hat{W} = \frac{1}{m(n-1)} \sum\_{l=1}^{m} \sum\_{j=1}^{n} (\mathbf{z}\_{lj} - \bar{\mathbf{z}}\_{l})(\mathbf{z}\_{lj} - \bar{\mathbf{z}}\_{l})',\tag{3.33}$$

$$\hat{B} = \frac{1}{m-1} \sum\_{l=1}^{m} (\bar{\mathbf{z}}\_l - \bar{\mathbf{z}})(\bar{\mathbf{z}}\_l - \bar{\mathbf{z}})' - \frac{\hat{W}}{n},\tag{3.34}$$

where **<sup>z</sup>**¯*<sup>i</sup>* <sup>=</sup> <sup>1</sup> *n <sup>n</sup> <sup>j</sup>*=<sup>1</sup> **<sup>z</sup>***ij* .

> *Example 3.12 (Glass Evidence)* Consider a case in which two glass fragments are recovered on the jacket of an individual who is suspected to be involved in a crime. Two glass fragments are collected at the crime scene for comparative purposes. The competing propositions are:


For each fragment, three variables are considered: the logarithmic transformation of the ratios *Ca/K*, *Ca/Si*, and *Ca/F e* (Aitken and Lucy, 2004). Two replicate measurements are available for each fragment. Measurements on the two recovered fragments are

$$\mathbf{y}\_1 = \begin{pmatrix} 3.77379 \\ -0.89063 \\ 2.62038 \end{pmatrix}, \ \mathbf{y}\_2 = \begin{pmatrix} 3.93937 \\ -0.89343 \\ 2.63860 \end{pmatrix}.$$

Measurements on the two control fragments are

$$\mathbf{x}\_{1} = \begin{pmatrix} 3.84396 \\ -0.91010 \\ 2.65437 \end{pmatrix}, \ \mathbf{x}\_{2} = \begin{pmatrix} 3.72493 \\ -0.89811 \\ 2.61933 \end{pmatrix}.$$

Consider the database named glass-data.txt. This database is part of the supplementary material of Aitken and Lucy (2004) and contains *n* = 5 replicate measurements of the elemental concentration of glass fragments

#### *Example 3.12* (continued)

from several windows (*m* = 62). The variables of interest (i.e., the logarithmic transformation of the ratios *Ca/K*, *Ca/Si*, and *Ca/F e*) are displayed in columns 6*,* 7 and 8, while the object (window) identifier is in column 9.


```
> grouping.item=9
```
Measurements from the recovered fragments, **y** = *(***y**1*,* **y**2*)*, and measurements from the control fragments, **x** = *(***x**1*,* **x**2*)*, were selected from the available replicate measurements for the first group (window). The first two replicate measurements were selected to act as recovered data, while the last two replicate measurements were selected to act as control data

```
> item=1
> recovered=population[which(population[,grouping.
+ item]==item),][1:2,variables]
> recovered
   logCaK logCaSi logCaFe
1 3.77379 -0.89063 2.62038
2 3.93937 -0.89343 2.63860
> control=population[which(population[,grouping.
+ item]==item),][4:5,variables]
> control
   logCaK logCaSi logCaFe
4 3.72493 -0.89811 2.61933
5 3.66573 -0.89693 2.76393
```
Data concerning measurements from the first window were then excluded from the database

```
> pop.back <- population[-which(population[,grouping.
+ item]==item),]
```
The database named pop.back will serve as background data and can be used to estimate the model parameters *μ*, *W* and *B* as in (3.32), (3.33), and (3.34) by means of the function two.level.mv.WB contained in the routines file two\_level\_functions.r. This file is part of the supplementary materials available on the website of this book (on

```
Example 3.12 (continued)
http://link.springer.com/) and can be run in the R console by
inserting the command
```

```
> source('two_level_functions.r')
```
The mean vector between sources, the within-source covariance matrix, and the between-source covariance matrix can therefore be obtained as follows:

```
> WB <- two.level.mv.WB(pop.back,variables,
+ grouping.item)
> mu <- WB$all.means
> W <- WB$W
> B <- WB$B
> mu
      logCaK logCaSi logCaFe
[1,] 4.20495 -0.7425402 2.770238
> W
               logCaK logCaSi logCaFe
logCaK 1.688046e-02 2.792714e-05 2.783344e-04
logCaSi 2.792714e-05 6.545540e-05 8.362677e-06
logCaFe 2.783344e-04 8.362677e-06 1.294188e-03
> B
              logCaK logCaSi logCaFe
logCaK 0.71485025 0.099343866 -0.047824106
logCaSi 0.09934387 0.062724678 -0.007360187
logCaFe -0.04782411 -0.007360187 0.102438334
The Bayes factor can be calculated as the ratio between (3.27)
and (3.28) using the function two.level.mvn.BF available in the
routines file two_level_functions.r. This function is part of the
supplementary materials available on the website of this book (on
http://link.springer.com/). First, it is necessary to calculate the
sample means y¯ and x¯ and to determine the sample size ny and nx
> ybar=as.vector(colMeans(recovered))
> xbar=as.vector(colMeans(control))
> ny=dim(recovered)[1]
> nx=dim(control)[1]
```
*Example 3.12* (continued) The Bayes factor can be obtained as

> BF=two.level.mvn.BF(W, B, mu, xbar, ybar, nx, ny) > BF

[1] 157.6265

This Bayes factor represents moderately strong support for the proposition according to which the recovered and the control fragments originate from the same source, rather than from different sources. This is expected because the compared measurements refer to the same fragment.

#### **3.4.1.2 Non-normal Distribution for the Between-Source Variability**

The two-level random effect model presented in the previous section is based on the assumption of normality of the between-source variability. However, in many practical applications, observations or measurements do not exhibit (enough) regularity for standard parametric models to be used. For example, a multivariate normal distribution for the mean vector *θ* may be difficult to justify. It can be replaced by a kernel density estimate, which is sensitive to multimodality and skewness, and which may provide a better representation of the available data.

Starting from a database {**z***ij* = *(zij*1*,...,zij*1*)*; *i* = 1*,...,m* and *j* = 1*, . . . , n)*}, the estimate of the probability density distribution for the betweensource variability can be obtained as follows:

$$f(\boldsymbol{\theta} \mid \bar{\mathbf{z}}\_1, \dots, \bar{\mathbf{z}}\_m, B, h) = \frac{1}{m} \sum\_{l=1}^m K(\boldsymbol{\theta} \mid \bar{\mathbf{z}}\_l, B, h), \tag{3.35}$$

where the kernel density function *K(θ* | **z**¯*i, B, h)* is taken to be a multivariate normal distribution centered at the group mean **<sup>z</sup>**¯*i*, with covariance matrix *<sup>h</sup>*2*B*. The smoothing parameter *h* can be estimated as

$$
\hat{h} = \left(\frac{4}{2p+1}\right)^{\frac{1}{p+4}} m^{-1/(p+4)}.\tag{3.36}
$$

See also Silverman (1986) and Scott (1992).

We first write a function hopt that computes the estimate of the smoothing parameter.

```
> hopt=function(p,m){
+ h=(4/(2*p+1))^(1/(p+4))*m^(-1/(p+4))
+ return(h)}
```
Thus, if the number *p* of variables is set equal to 4 and the number of sources *m* is set equal to 30, the smoothing parameter *h* can be estimated as in (3.36)

```
> p=4
> m=30
> hopt(p,m)
[1] 0.5906593
```
The BF can be obtained as in (3.21), where a multivariate normal distribution is used for the control and the recovered measurements as in (3.22), and a kernel distribution for the between-source variability, as in (3.35). The numerator and the denominator of the BF, *f*1*(***y***,* **x** | *μ,W,B)* and *f*2*(***y***,* **x** | *μ,W,B)*, can be obtained analytically (Aitken and Lucy, 2004). The BF is the ratio between

$$\begin{aligned} \vert \ B \vert \, &\vert \, {}^{1/2} m \, h^p \vert \, n\_\mathbf{y} W^{-1} \\ &+ n\_\times W^{-1} + (h^2 B)^{-1} \vert \, ^{-1/2} \exp \left\{ -\frac{1}{2} F\_2 \right\} \sum\_{l=1}^m \exp \left\{ -\frac{1}{2} F\_l \right\} \end{aligned} \tag{3.37}$$

and

$$\left| \left| n\_{\boldsymbol{Y}} W^{-1} + (h^{2} \boldsymbol{B})^{-1} \right| \right|^{-1/2} \sum\_{l=1}^{m} \exp \left\{ -\frac{1}{2} F\_{\boldsymbol{y}l} \right\} $$

$$\times \left| n\_{\boldsymbol{x}} W^{-1} + (h^{2} \boldsymbol{B})^{-1} \right|^{-1/2} \sum\_{l=1}^{m} \exp \left\{ -\frac{1}{2} F\_{\boldsymbol{x}l} \right\}, \tag{3.38}$$

where:

$$\begin{split} &F\_{l} = (\mathbf{w}^{\*} - \mathbf{\bar{z}}\_{l})' \left\{ \left( n\_{\circ} W^{-1} + n\_{\times} W^{-1} \right)^{-1} + \left( h^{2} B \right) \right\}^{-1} (\mathbf{w}^{\*} - \mathbf{\bar{z}}\_{l}), \\ &\mathbf{w}^{\*} = \left( n\_{\circ} W^{-1} + n\_{\times} W^{-1} \right)^{-1} \left( n\_{\circ} W^{-1} \mathbf{\bar{y}} + n\_{\times} W^{-1} \mathbf{\bar{x}} \right), \\ &F\_{yl} = (\mathbf{\bar{y}} - \mathbf{\bar{z}}\_{l})' \left( \frac{W}{n\_{\circ}} + h^{2} B \right)^{-1} (\mathbf{\bar{y}} - \mathbf{\bar{z}}\_{l}), \\ &F\_{xl} = (\mathbf{\bar{x}} - \mathbf{\bar{z}}\_{l})' \left( \frac{W}{n\_{\times}} + h^{2} B \right)^{-1} (\mathbf{\bar{x}} - \mathbf{\bar{z}}\_{l}). \end{split}$$

*Example 3.13 (Glass Evidence—Continued)* Consider the case examined in Example 3.12, and suppose a kernel distribution is used to model the betweensource variability (Aitken and Lucy, 2004). Start from the same database, glass-data.txt, covering *n* replicate measurements of *p* variables for each of *m* = 62 different sources. The smoothing parameter can be estimated using the function hopt, for *p* = 3.

```
> p=3
> m=62
> h=hopt(p,m)
> h
```

```
[1] 0.5119462
```
First, the group means **z**¯*<sup>i</sup>* must be obtained. They are an output of the function two.level.mv.WB, previously used to estimate the model parameters.

```
> group.means=WB$group.means
```
Here we show only the first six rows of the *(m* × *p)* matrix, where each row represents the means of the measurements **<sup>z</sup>**¯*<sup>i</sup>* <sup>=</sup> <sup>1</sup> *n <sup>n</sup> <sup>i</sup>*=<sup>1</sup> **<sup>z</sup>***ij* .

```
> head(group.means)
```


The Bayes factor can then be calculated as the ratio between (3.37) and (3.38) using the function two.level.mvk.BF contained in the routines file two\_level\_functions. This function is part of the supplementary materials available on the website of this book (on http://link.springer.com/).

```
> source('two_level_functions.r')
> BF=two.level.mvk.BF(xbar,ybar,nx,ny,W,B, group.
  means, h)
> BF
[1] 151.6001
```
The Bayes factor represents moderately strong support for the proposition according to which the recovered and the control fragments originate from the same source, rather than from different sources.

A detailed comparison and discussion of the performance of these two multivariate random effect models can be found in Aitken and Lucy (2004). An alternative approach to the kernel density estimation is presented by Franco-Pedroso et al. (2016), modeling the between-source distribution by means of a Gaussian mixture model.

Note that a third level of variability could be considered. In fact, one may wish to model separately the variability between replicate measurements from a given item originating from a given source (e.g., replicate measurements from a glass fragment originating from a given window) and the variability between different items originating from a given source (e.g., different glass fragments originating from the same window). This aspect will be tackled in Sect. 3.4.4 where *three-level* models will be introduced.

#### **3.4.1.3 Non-constant Within-Source Variability**

The two-level random effect models presented in Sects. 3.4.1.1 and 3.4.1.2 are characterized by the assumption of a constant within-source variability. In other words, it was assumed that every single source has the same intra-variability. While for some type of trace evidence this assumption is acceptable (e.g., for measurements of the elemental composition of glass fragments), a constant withinsource variation may be more difficult to justify in other forensic domains. Consider, for example, the case of handwriting on questioned documents where it is largely recognized that intra-variability may vary between writers (Marquis et al., 2006).

Suppose that a handwritten document of unknown source is available for comparative examinations. Handwritten items from a person who is suspected to be the writer are collected and analyzed. Multiple characters are analyzed on the questioned document and on the known writings of the person of interest. The following propositions are defined:

*H*1: The person of interest wrote the questioned document.

*H*2: An unknown person wrote the questioned document.

The distribution of the vector of means within group (source) *θi* is treated as explained in Sect. 3.4.1.1, i.e., *(θ<sup>i</sup>* | *μ,B)* ∼ N*(μ,B)*. An inverse Wishart distribution is chosen to model the uncertainty about the within-group covariance matrix,

$$(W\_l \mid \mathfrak{Q}, \boldsymbol{\upsilon}) \sim W^{-1}(\mathfrak{Q}, \boldsymbol{\upsilon}), \tag{3.39}$$

where *Ω* is the scale matrix and *ν* are the degrees of freedom (Bozza et al., 2008). The scale matrix *Ω* is elicited in a way such that the prior mean of *Wi* is taken to be equal to the within-group covariance matrix estimated from the available background data as in (3.33), while *μ* is estimated as in (3.32) and the betweengroup covariance matrix is estimated as

#### 3.4 Multivariate Data 119

$$\hat{B} = \frac{1}{m-1} \sum\_{i=1}^{m} n \left(\bar{\mathbf{z}}\_{i} - \bar{\mathbf{z}}\right) \left(\bar{\mathbf{z}}\_{i} - \bar{\mathbf{z}}\right)'.$$

A two-level multivariate random effect model with an inverse Wishart distribution, modeling the uncertainty about the within-source covariance matrix, has also been proposed by Ommen et al. (2017).

First, consider the numerator of the Bayes factor in (3.21). If proposition *H*<sup>1</sup> holds, then *θ <sup>y</sup>* = *θ <sup>x</sup>* = *θ* and *Wy* = *Wx* = *W*, and the marginal likelihood is as follows:

$$\begin{split} f(\mathbf{y}, \mathbf{x} \mid H\_{\mathbf{l}}) &= f\_{\mathbf{l}}(\mathbf{y}, \mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\mathcal{B}}, \boldsymbol{\mathcal{Q}}, \boldsymbol{\nu}) \\ &= \int f(\mathbf{y} \mid \boldsymbol{\theta}, \mathbf{W}) f(\mathbf{x} \mid \boldsymbol{\theta}, \mathbf{W}) f(\boldsymbol{\theta} \mid \boldsymbol{\mu}, \boldsymbol{\mathcal{B}}) f(\mathbf{W} \mid \boldsymbol{\mathcal{Q}}, \boldsymbol{\nu}) d(\boldsymbol{\theta}, \mathbf{W}), \end{split} \tag{3.40}$$

where *f (θ* | *μ,B)* is as in (3.25), and

$$f(W \mid \mathcal{Q}, \boldsymbol{\nu}) = \frac{c \mid \mathcal{Q} \mid ^{\boldsymbol{\nu} - p - 1} \boldsymbol{\mathcal{Q}}}{\mid \boldsymbol{W} \mid ^{\boldsymbol{\nu}/2}} \exp\left\{ -\frac{1}{2} \text{tr}(\boldsymbol{W}^{-1} \boldsymbol{\mathcal{Q}}) \right\},$$

where *c* is the normalizing constant (e.g., Press, 2005).

If proposition *H*<sup>2</sup> holds, then *θ <sup>y</sup>* = *θ <sup>x</sup>* and *Wy* = *Wx* , and the marginal likelihood takes the following form:

$$f(\mathbf{y}, \mathbf{x} \mid H\_2) = f\_2(\mathbf{y}, \mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{B}, \boldsymbol{\mathcal{Q}}, \boldsymbol{\upsilon}) \tag{3.41}$$

$$= \int f(\mathbf{y} \mid \boldsymbol{\theta}, \mathbf{W}) f(\boldsymbol{\theta}, \mathbf{W} \mid \boldsymbol{\mu}, \boldsymbol{B}, \boldsymbol{\mathcal{Q}}, \boldsymbol{\upsilon}) d(\boldsymbol{\theta}, \mathbf{W})$$

$$\times \int f(\mathbf{x} \mid \boldsymbol{\theta}, \mathbf{W}) f(\boldsymbol{\theta}, \mathbf{W} \mid \boldsymbol{\mu}, \boldsymbol{B}, \boldsymbol{\mathcal{Q}}, \boldsymbol{\upsilon}) d(\boldsymbol{\theta}, \mathbf{W}).$$

The Bayes factor is the ratio between the marginal likelihoods in (3.40) and (3.41). However, these distributions are not available in closed form as the integrals do not have an analytical solution. Several approaches are available to deal with this problem. Chib (1995) estimates the marginal likelihood *f (***y***,* **x** | *Hi)* by a direct application of Bayes theorem, since the marginal likelihood can be seen as the normalizing constant of the posterior density *f (θ, W* | **y***,* **x***, Hi)*. The marginal likelihood can therefore be obtained as

$$f(\mathbf{y}, \mathbf{x} \mid H\_l) = \frac{f(\mathbf{y}, \mathbf{x} \mid \theta, W) f(\theta, W \mid H\_l)}{f(\theta, W \mid \mathbf{y}, \mathbf{x}, H\_l)}. \tag{3.42}$$

While the likelihood function *f (***y***,* **x** | *θ,W)* and the prior density *f (θ, W* | *Hi)* can be easily evaluated at any parameter point *(θ* ∗*, W*∗*)*, this is not the case for the posterior density *f (θ, W* | **y***,* **x***, Hi)*, which is not known in closed form. A Gibbs sampling algorithm (Sect. 1.8) can be applied to the set of the complete conditional densities *f (θ* | *W,* **y***,* **x***, Hi)* and *f (W* | *θ,* **y***,* **x***, Hi)*, and the posterior density *f (θ, W* | **y***,* **x***, Hi)* can be approximated from the output of the Gibbs sampling algorithm as *f (*ˆ *θ, W* | **y***,* **x***, Hi)* (Chib, 1995; Bozza et al., 2008; Aitken et al., 2021).

The marginal likelihood in (3.42) can be estimated at a given parameter point *(θ* ∗*, W*∗*)* as

$$\hat{f}(\mathbf{y}, \mathbf{x} \mid H\_l) = \frac{f(\mathbf{y}, \mathbf{x} \mid \boldsymbol{\theta}^\*, W^\*) f(\boldsymbol{\theta}^\*, W^\* \mid H\_l)}{f(\boldsymbol{\theta}^\*, W^\* \mid \mathbf{y}, \mathbf{x}, H\_l)}.$$

The Bayes factor is then calculated as

$$\text{BF} = \frac{\hat{f}(\mathbf{y}, \mathbf{x} \mid H\_1)}{\hat{f}(\mathbf{y}, \mathbf{x} \mid H\_2)}. \tag{3.43}$$

As mentioned in Sect. 1.8, many other approaches are available, and their efficiency should be studied and compared.

*Example 3.14 (Handwriting Evidence)* Consider a hypothetical case involving a handwritten document. Handwritten items from a person of interest are available for comparative examinations. The propositions of interest are therefore:


Suppose that *n*<sup>1</sup> = 8 characters of type a are collected from the questioned document and that *n*<sup>2</sup> = 8 characters of the same type are extracted from a document originating from the person of interest, taken for comparative purposes. The contour shape of loops of handwritten characters can be described using a methodology based on Fourier analysis (Marquis et al., 2005, 2006). In brief, the contour shape of each handwritten character loop can be described by means of a set of variables representing the surface and a set of harmonics. Each harmonic corresponds to a specific contribution to the shape and is defined by an amplitude and a phase, the Fourier descriptors.

Consider the database named handwriting.txt available on the book's website. It contains data on *p* = 9 variables (i.e., the surface, the amplitude and the phase of the first four harmonics), measured on several characters of type a collected from *m* = 20 writers. The variables of interest are displayed in columns 2 to 10. Column 1 contains the item (writer) identifier

```
Example 3.14 (continued)
```
> population=read.table('handwriting.txt',

```
+ header=TRUE)
```
> names(population)=c('writer','A0','A1','B1','A2',

```
+'B2','A3','B3','A4','B4')
```

```
> variables=2:10
```

```
> grouping.item=1
```
In the current example, measurements **y** on the questioned document and measurements **x** on the control document were randomly selected from the available measurements on characters collected from a given writer (i.e., writer no. 1). Starting from a total number of, say, *n* available characters, 2×*n*<sup>1</sup> characters have been selected: the first *n*<sup>1</sup> characters serve as recovered data, while the remaining serve as control data

```
> item=1
> base=population[which(population[,grouping.item]
+ ==item),]
> nr=dim(base)[1]
> n1=8
> recovered=as.matrix(base[1:n1,variables])
>control=as.matrix(base[(n1+1):(2*n1),variables])
```
Data concerning measurements from the selected writer were then excluded from the database

```
> pop.back=population[-which(population[,grouping.
```

```
+ item]==item),]
```
The database pop.back will serve as background data and can be used to estimate the model parameters as in (Bozza et al., 2008) using the function two.level.mv.WB available in the file two\_level\_functions.r.

```
> source('two_level_functions.r')
> WB = two.level.mv.WB(pop.back,variables,
+ grouping.item,nc=TRUE)
> mu = t(WB$all.means)
> W = WB$W
> B = WB$B
```
The number of degrees of freedom *ν* of the inverse Wishart distribution is chosen so as to reduce the variability of this distribution, centered at the within-source covariance matrix estimated as in (3.33).

```
> p=9
> nu=40
> Omega=W*(nu-2*p-2)
```
*Example 3.14* (continued)

The Gibbs sampling algorithm is run over 10000 iterations with a burn-in of 1000.

> n.iter=10000 > burn.in=1000

The Bayes factor in (3.43) can then be calculated using the function two.level.mvniw.BF that is part of the supplementary materials. Note also that this routine requires other routines that are available in the packages MCMCpack (Martin et al., 2021) and mvtnorm (Genz et al., 2020).

```
> BF=two.level.mvniw.BF(recovered,control,Omega,B,mu,
+ nu,p, n.iter,burn.in)
> BF
```
[1] 5543330

The Bayes factor represents extremely strong support for the proposition according to which the questioned and the recovered handwritten materials originate from the same source, rather than from different sources. A fully documented open-source package (Gaborini, 2019) has been developed by Gaborini (2021).

Note that it is important to critically examine large BF values, such as the one obtained above. For a discussion about extreme values, see Aitken et al. (2021), Hopwood et al. (2012), and Kaye (2009). Moreover, as underlined in Sect. 1.11, the marginal likelihood is highly sensitive to the prior assessments and so is the BF. In particular, while the overall mean vector, the within- and the between-source covariance matrices are estimated from the available background data, the number of degrees of freedom of the inverse Wishart distribution are chosen so as to reduce the dispersion of the prior. A sensitivity analysis may be performed to assess the sensitivity of the BF to different choices of the degrees of freedom *ν* in (3.39).

The BF may also be sensitive to the MCMC approximation. Figure 3.5 provides an illustration of BF variability. Results are based on 50 realizations of the BF approximation in (3.43).

```
> ns=50
```

```
> BFs=matrix(0,nrow=ns,ncol=1)
```

```
> for(i in 1:ns){
```

```
+ BFs[i]=two.level.mvniw.BF(recovered,control,Omega,B,
```

```
+ mu,nu,p,n.iter,burn.in)}
```

```
> hist(log(BF),freq=F,main='',xlab='log(BF)')
```
**Fig. 3.5** Histogram of 50 realizations of the BF approximation in (3.43)

The models discussed here rely on the assumption of independence between sources, focusing on the inherent variability of features. In the case of questioned documents (Sect. 3.4.1.3), this amounts to assume that handwritten material has been produced without any intention of reproducing someone else's writing style. The possibility of forgery and/or disguise breaks the independence assumption made at denominator. Section 3.4.3 will address this complication.

#### *3.4.2 Assessment of Method Performance*

The results of the procedures described in the previous sections may be sensitive to changes in the features of recovered and control materials, the available background information, as well as to choices made during probabilistic modeling and prior elicitation. A sensitivity analysis may be conducted in order to gain a better understanding of the properties of the chosen method. It is fundamental to gain an understanding of how well a method performs: if the recovered and control data originate from the same source, the BF is expected to be greater than 1. Vice versa, if the compared items come from different sources, a BF smaller than 1 is expected.

Several methods exist for the assessment of the performance of the methods for evidence evaluation. Commonly encountered measures in this context are rates of false negatives (i.e., cases in which the Bayes factor is smaller than 1, supporting hypothesis *H*2, when hypothesis *H*<sup>1</sup> holds) and false positives (i.e., cases in which the Bayes factor is greater than 1, supporting hypothesis *H*1, when hypothesis *H*<sup>2</sup> holds). The rate of false negatives is the number of same-source comparisons with a Bayes factor smaller than 1 divided by the total number of same-source comparisons. The false positive rate is the number of different-source comparisons with a Bayes factor greater than 1 divided by the total number of differentsource comparisons. Given a database of cases (e.g., measurements on handwriting characters) for which the source is known, it is possible to study the behavior of the Bayes factor as the data pertaining to control and recovered items change.

Consider again the questioned document case discussed in Sect. 3.4.1.3. There is variability in handwriting, and the reported Bayes factor is sensitive to variability of the shape of handwritten characters. This is not surprising as no one writes the same word exactly the same way twice. Consider measurements of features of handwritten characters of a given writer taken from the available database. These measurements are organized into a *(n* × *p)* matrix, where *n* is the number of available handwritten characters and *p* represents the number of features (variables). Denote this matrix base. Suppose that, among the *n* characters, we select a certain number 2 × *n*<sup>1</sup> *< n* of characters, forming a group. Repeating this a certain number of times leads to multiple groups. On each member (character) within a group, *p* variables are measured. Then we take pairs of groups (i.e., measurements on the group members), taken to represent recovered and control data. Then, the Bayes factor is calculated for each couple. Here, each couple represents a same-source comparison.

*Example 3.15 (Two-Level Model for Handwriting—Assessment of Model Performance)* Recall Example 3.14 where a total number of 16 characters have been randomly selected from the available characters collected from a given writer (writer no. 1), extracted from the database handwriting.txt. A Bayes factor equal to 5543330 was obtained. If different sets of characters are extracted, the Bayes factor will be influenced (also) by the within-writer variability.

Suppose now that, for the same writer, *ns* = 50 distinct groups of characters (each of size 16) are drawn and split into groups of size 8 to act as questioned and control data. The Bayes factor is calculated for each of the 50 groups. Clearly, since the sampled measurements originate from the same writer, we expect Bayes factors greater than 1.

```
> ns=50
> n=dim(base)[1]
> n1=8
> BFs=matrix(0,nrow=ns,ncol=1)
> for (i in 1:ns){
+ ind=sample(1:n,2*n1,replace=F)
+ recovered=as.matrix(base[ind[1:n1],
+ variables])control=as.matrix(base
+ [ind[(n1+1):length(ind)],variables])
```

```
Example 3.15 (continued)
+ BFs[i]=two.level.mvniw.BF(recovered,
+ control,Omega,
+ B,mu,nu,p,n.iter,burn.in)
+ }
```
Figure 3.6 shows a histogram of the results for the *ns* = 50 groups of sampled characters. No false negatives have been observed. The range of the BF values obtained is given here below

```
> range(BFs)
```

```
[1] 1.709027e+02 1.438262e+29
```
There is also variability between writers, as no two writers write exactly alike. Consider now measurements of features of handwritten characters from a different writer, say writer no. 6, drawn from the same database. These measurements are stored in a matrix denoted base2.

```
> item2=6
> base2=population[which(population[,grouping.item]==
+ item2),]
> n2=dim(base2)[1]
```
We first estimate the population parameters from the background population where both selected writers have been eliminated.

```
> pop.back=population[-which(population[,grouping
+ .item]==item|population[,grouping.item]==item2),]
> WB = two.level.mv.WB(pop.back,variables,
+ grouping.item,nc=TRUE)
> mu = t(WB$all.means)
> W = WB$W
> B = WB$B
>Omega=W*(nu-2*p-2)
```
Next, for each of the two writers, take 50 groups of characters (from base and base2). Each group contains 8 members, on each of which *p* features are measured. Then, take a group from each writer and form a so-called known different-source pair, and do this multiple times. These draws are taken to represent recovered and control data. Then, the Bayes factor is calculated for each couple.

```
> ns=50
> n=dim(base)[1]
> nc=dim(base2)[1]
> n1=8
```

```
Example 3.15 (continued)
```

```
> BFs2=matrix(0,nrow=ns,ncol=1)
> for (i in 1:ns){
+ val.r=sample(1:n,n1)
+ recovered=as.matrix(base[val.r,variables])
+ val.c=sample(1:nc,n1)
+ control=as.matrix(base2[val.c,variables])
+ BFs[i]=two.level.mvniw.BF(recovered,
+ control,Omega,B,
+ mu,nu,p,n.iter,burn.in)
+ }
```
Figure 3.7 shows a histogram of the results. No false positives have been observed. The range of the BF values obtained is

```
> range(BFs)
```
[1] 2.733273e-10 7.034354e-02

The variability of BF values for different samples is not surprising because of handwriting variability. However, this should not be understood as there being a Bayes factor distribution. See, e.g., Morrison (2016), Ommen et al. (2016), and Taroni et al. (2016) for a discussion of issues relating to the reporting of the precision of forensic likelihood ratios.

Over the past decade, several other approaches have been proposed in forensic statistics literature for evaluating the performance of statistical procedures, based

on a likelihood ratio or a Bayes factor. These methods provide a rigorous approach to assessing and comparing the performance of evaluative methods prior to using them in casework and forensic reporting. See, in particular, Ramos and Gonzalez-Rodriguez (2013) and Ramos et al. (2021) for a methodology to measure calibration of a set of likelihood ratio values and the concept of Empirical Cross-Entropy for representing performance, illustrated using examples from forensic speech analysis. These concepts are also discussed by Meuwly et al. (2017) who present a guideline for the validation of evaluative methods considering source level propositions. Zadora et al. (2014) present performance assessment for physicochemical data in the context of trace evidence (e.g., glass). For a recent review, see also Chapter 8 of Aitken et al. (2021).

#### *3.4.3 On the Assumption of Independence Under H***<sup>2</sup>**

The models presented in Sect. 3.4.1 are based on the assumption of independence between the questioned and known materials under hypothesis *H*2. This may be reasonable for certain types of evidence and cases, but less for others. In fact, while a physical feature (e.g., the elementary composition of glass fragments) requires external constraint to be altered, a behavioral or biometric feature such as signature can be modified intentionally.

Consider handwriting as an example. When evaluating results of comparative handwriting examination, the case circumstances may be such that there is no issue of handwriting features being disguised or the result of an attempt to imitate the handwriting of another person. The approach suggested in Sect. 3.4.1.3 may thus be applicable. In turn, in case of alleged forgery of signatures, the (unknown) writer specifically intends to reproduce features of a target signature. The allegation, then, is that a signature is either simulated or disguised, rather than presenting a correspondence or similarity with a genuine signature by mere chance alone (Linden et al., 2021). In such cases, the Bayes factors previously developed in Sect. 3.4.1 cannot be used to approach the question of interest here because the assumption of independence between sources at the denominator cannot be maintained. It follows that one must compute

$$\text{BF} = \frac{f(\mathbf{y} \mid \mathbf{x}, H\_1)}{f(\mathbf{y} \mid \mathbf{x}, H\_2)},\tag{3.44}$$

as *f (***y** | **x***, H*2*)*, following the above argument, does not simplify to *f (***y** | *H*2*)* (see also Sect. 1.5.1).

Consider the following competing propositions:


If proposition *H*<sup>2</sup> is true, the forensic document examiner has to deal with a signature written by someone who has knowledge of the POI's signature.

Consider the two-level model in Sect. 3.4.1.3 where the distribution of the measurements on the recovered and control data is taken to be Normal, with vector means *θ <sup>y</sup>* and *θ <sup>x</sup>* , and covariance matrices *Wy* and *Wx*

$$(Y \mid \theta\_{\text{y}}, W\_{\text{y}}) \sim \mathcal{N}(\theta\_{\text{y}}, W\_{\text{y}}) \qquad ; \qquad (X \mid \theta\_{\text{x}}, W\_{\text{x}}) \sim \mathcal{N}(\theta\_{\text{x}}, W\_{\text{x}}) . \tag{3.45}$$

The probability densities at the numerator and denominator of the BF in (3.44) can be obtained as

$$f(\mathbf{y}, \mathbf{x} \mid H\_l) = f\_l(\mathbf{y}, \mathbf{x} \mid \boldsymbol{\mu}\_i, B\_l, \mathcal{Q}\_l, \boldsymbol{\upsilon}\_l)$$

$$= \int f(\mathbf{y} \mid \boldsymbol{\theta}, \mathbf{W}) f(\boldsymbol{\theta}, \mathbf{W} \mid \mathbf{x}, \boldsymbol{\mu}\_i, B\_l, \mathcal{Q}\_l, \boldsymbol{\upsilon}\_l), \tag{3.46}$$

where *(μi, Bi)* and *(Ωi, νi)* are the hyperparameters of the prior distributions under the competing propositions (i.e., a normal prior and an inverse Wishart prior distribution). The Bayes factor can thus be calculated as

$$\text{BF} = \frac{f\_1(\mathbf{y}, \mathbf{x} \mid \boldsymbol{\mu}\_1, \boldsymbol{\mathcal{B}}\_1, \boldsymbol{\mathcal{Q}}\_1, \boldsymbol{\nu}\_1)}{f\_2(\mathbf{y}, \mathbf{x} \mid \boldsymbol{\mu}\_2, \boldsymbol{\mathcal{B}}\_2, \boldsymbol{\mathcal{Q}}\_2, \boldsymbol{\nu}\_2)}. \tag{3.47}$$

Two different background databases are needed to inform model parameters under the competing propositions: a database of genuine signatures (**z***ij* ) and a database of imitated signatures (**s***ij* ). Someone who imitates a signature needs to work outside their writing habits and movement patterns. Thus, simulated signatures do not reflect the same movements and writing features as genuine signatures. Model parameter *μ<sup>i</sup>* can be estimated as in (3.32), and *Bi* as explained in Sect. 3.4.1.3. The scale matrix *Ωi* can be chosen so as to center the prior distribution at the within-group covariance matrix *Wi* that can be estimated as in (3.33).

The probability densities in (3.46) are not available in closed form but can be estimated from the output of a MCMC algorithm following, for example, the ideas described in Sect. 3.4.1.3. A Gibbs sampling algorithm is implemented here. The routine is different from that developed in Sect. 3.4.1.3 because it calculates the BF in (3.47). In this formula, no assumption of independence is made at the denominator, and two different databases are used.

*Example 3.16 (Digitally Captured Signatures)* Consider a case involving a questioned signature on a contract signed on a digital tablet. The person of interest denies having signed the contract. Among the multiple features that are captured by the digital tablet, the average speed and writing time are considered here. See Linden et al. (2021) for a detailed description of the experimental conditions. Measurements on the questioned signature are **y** = *(*4639*,* 380*.*42*)*, while measurements on the control signature are **x** = *(*4460*,* 323*.*4787*)*. Note that the first value is the average speed and the second is the writing time.

> quest=c(4639,380.42) > ref=c(4460,323.4787)

Model parameters under hypothesis *H*<sup>1</sup> (i.e., the mean vector *μ*1, the withingroup covariance matrix *W*1, and the between-group covariance matrix *B*1) are estimated from an available database of genuine signatures (**z***ij* ) and are given here below.

```
> mug=matrix(c(2754.767,511.284),ncol=1)
```

```
> Wg=matrix(c(95755.861,-4214.939,-4214.939,
```

```
+ 2857.975),byrow=T,nrow=2)
```
> Bg=matrix(c(3377136,30548.24,30548.24,20335.10),

```
+ byrow=T,nrow=2)
```
The trace matrix of the inverse Wishart distribution is then obtained as

```
> p=2
> nu=10
> Omegag=Wg*(nu-2*p-2)
```
In the same way, model parameters under hypothesis *H*<sup>2</sup> are estimated from an available database of simulated signatures (**s***ij* ) and are given here below.

```
Example 3.16 (continued)
```

```
> Omegas=Ws*(nu-2*p-2)
```
A Gibbs sampling algorithm is run over 10000 iterations, with a burn-in of 1000.

```
> n.iter=10000
```

```
> burn.in=1000
```
The Bayes factor in (3.44) can then be calculated using the function two.level.mvniw2.BF (see supplementary materials).

```
> source('two_level_functions.r')
> BF=two.level.mvniw2.BF(quest,ref,Wg,Bg,mug,Ws,Bs,
+ mus,nu,p,n.iter,burn.in)
> BF
```

```
[1] 40846.87
```
The BF represents very strong support for the proposition according to which the questioned signature originates from the person of interest rather than from an unknown person who attempted to imitate the target signature.

#### *3.4.4 Three-Level Models*

So far, two-level models have been considered, taking into account the within-source and the between-source variability. However, it is not uncommon to encounter situations in which the hierarchical ordering shows an additional level of variability, e.g., in relation to measurement error.

Denote again by *p* the number of variables observed on items of a given evidential type. Suppose that continuous measurements of these variables are available on a random sample from *m* sources with *s* items for each source and *n* replicate measurements on each of the *N* = *ms* items. The background data can be denoted by **z***ikj* = *(zikj*1*,...,zikjp)* , where *i* = 1*,...,m* denotes the number of sources (e.g., windows, writers), *k* = 1*,...,s* denotes the number of items for each source (e.g., glass fragments, handwritten characters), and *j* = 1*,...,n* denotes the number of replicate measurements for each item.

A Bayesian statistical model for the evaluation of evidence for three-level normally distributed multivariate data was proposed by Aitken et al. (2006), focusing on the elemental composition of glass fragments. Denote the mean vector within item *k* in group *i* as *θik* and the covariance matrix of replicate measurements as *W*. For the variability of replicate measurements, the distribution of **Z***ikj* is taken to be normal, **Z***ikj* ∼ N*(θik,W)*.

Denote by *μ<sup>i</sup>* the mean vector within group *i* and by *V* the within-group covariance matrix. The distribution of *θik* for the within-group variability is taken to be normal, *θik* ∼ N*(μi,B)*.

Denote by *φ* the mean vector between groups. Let *U* denote the between-group covariance matrix. For the between-group variability, the distribution of the *μ<sup>i</sup>* is taken to be normal, *μ<sup>i</sup>* ∼ N*(φ,V)*.

Consider the case described in Sect. 3.4.1, where measurements are available on *ny* items from an unknown origin as well as measurements on *nx* items from a known origin. These two groups of items may or may not come from the same source. Competing propositions may be formulated as follows:

*H*<sup>1</sup> : The recovered and the control items originate from the same source.

*H*<sup>2</sup> : The recovered and the control items originate from different sources.

There are *n*<sup>1</sup> replicate measurements available on each of the recovered *ny* items. Denote the measurement vector by **y**, where the vector components are denoted by **y***kj* (for *k* = 1*,...,ny* and *j* = 1*,...,n*1*)* and **y***kj* = *(ykj*1*,...,ykjp)* . For each of the *nx* control items, *n*<sup>2</sup> replicate measurements are available. Denote the measurement vector by **x**, where the vector components are denoted *(***x***kj , k* = 1*,...,nx* and *j* = 1*,...,n*2*)* and **x***kj* = *(xkj*1*,...,xkjp)* .

The Bayes factor is the ratio of two probability densities of the form *f (***y***,* **x** | *Hi)* = *fi(***y***,* **x** | *φ,W,B,V )*, *i* = 1*,* 2. The probability density in the numerator is given by

$$\begin{split} & f\_{\mathbf{l}}(\mathbf{y}, \mathbf{x} \mid \boldsymbol{\Phi}, W, B, V) \\ & \qquad = \int \int f(\mathbf{y} \mid \boldsymbol{\theta}, W) f(\mathbf{x} \mid \boldsymbol{\theta}, W) f(\boldsymbol{\theta} \mid \boldsymbol{\mu}, B) f(\boldsymbol{\mu} \mid \boldsymbol{\Phi}, V) d\boldsymbol{\mu} d\boldsymbol{\theta}, \end{split} \tag{3.48}$$

where all probability densities are multivariate normal.

In the denominator, the probability density is given by

$$\begin{split} f\_2(\mathbf{y}, \mathbf{x} \mid \boldsymbol{\Phi}, W, B.V) &= \int \int f(\mathbf{y} \mid \boldsymbol{\theta}, W) f(\boldsymbol{\theta} \mid \boldsymbol{\mu}, B) f(\boldsymbol{\mu} \mid \boldsymbol{\Phi}, V) d\boldsymbol{\mu} d\boldsymbol{\theta} \\ &\times \int \int f(\mathbf{x} \mid \boldsymbol{\theta}, W) f(\boldsymbol{\theta} \mid \boldsymbol{\mu}, B) f(\boldsymbol{\mu} \mid \boldsymbol{\Phi}, V) d\boldsymbol{\mu} d\boldsymbol{\theta}, \end{split} \tag{3.49}$$

where all probability densities are multivariate normal.

As shown by Aitken et al. (2006), the value of the evidence is the ratio of

$$\begin{aligned} \left| \left| \mathcal{B} + V \right| \right|^{1/2} \left| \left[ (n\_{\mathcal{Y}} n\_1 + n\_{\mathcal{X}} n\_2 \right) W^{-1} \right. \\ \left. + (\mathcal{B} + V)^{-1} \right] \left| \left. \right|^{-1/2} \exp \left\{ -\frac{1}{2} (F\_1 + F\_2) \right\} \end{aligned} \right. $$

to

$$|\left(n\_{\!/\!mkern-1.2mu\!T\!}W^{-1}+(B+V)^{-1}\right)|^{-1/2}|\left|\left.n\_{\!/\!mkern-1.2mu\!T\!}W^{-1}+(B+V)^{-1}\right|^{-1/2}\right.$$

$$\times \exp\left\{-\frac{1}{2}\left(F\_{3}+F\_{4}\right)\right\},\tag{3.51}$$

where:

$$\begin{aligned} F\_1 &= (\bar{\mathbf{y}} - \bar{\mathbf{x}})' \left( \frac{n\_\mathcal{\boldsymbol{n}} n\_\mathcal{\boldsymbol{n}} n\_2 W^{-1}}{n\_\mathcal{\boldsymbol{n}} n\_1 + n\_\mathcal{\boldsymbol{n}} n\_2} \right) (\bar{\mathbf{y}} - \bar{\mathbf{x}}), \\ F\_2 &= (\bar{\mathbf{w}} - \boldsymbol{\phi})' \left( (n\_\mathcal{\boldsymbol{n}} n\_1 + n\_\mathcal{\boldsymbol{n}} n\_2)^{-1} W + B + V \right)^{-1} (\bar{\mathbf{w}} - \boldsymbol{\phi}), \\ F\_3 &= (\bar{\mathbf{y}} - \boldsymbol{\phi})' \left[ (n\_\mathcal{\boldsymbol{y}} n\_1)^{-1} W + B + V \right]^{-1} (\bar{\mathbf{y}} - \boldsymbol{\phi}), \\ F\_4 &= (\bar{\mathbf{x}} - \boldsymbol{\phi})' \left[ (n\_\mathcal{\boldsymbol{n}} n\_2)^{-1} W + B + V \right]^{-1} (\bar{\mathbf{x}} - \boldsymbol{\phi}), \end{aligned}$$

and **<sup>w</sup>**¯ <sup>=</sup> *nyn*1**y**¯+*nxn*2**x**¯ *nyn*1+*nxn*<sup>2</sup> .

The overall mean *φ*, the measurement error covariance matrix *W*, the withingroup covariance matrix *B*, and the between-group covariance matrix *V* can be estimated using the available background data:

$$\hat{\boldsymbol{\Phi}} = \frac{1}{m} \frac{1}{s} \frac{1}{n} \sum\_{i=1}^{m} \sum\_{k=1}^{s} \sum\_{j=1}^{n} \mathbf{z}\_{ikj},\tag{3.52}$$

$$\hat{W} = \frac{1}{ms(n-1)} \sum\_{l=1}^{m} \sum\_{k=1}^{s} \sum\_{j=1}^{n} (\mathbf{z}\_{ikj} - \bar{\mathbf{z}}\_{ik.})(\mathbf{z}\_{ikj} - \bar{\mathbf{z}}\_{lk.})',\tag{3.53}$$

$$\hat{B} = \frac{1}{m(s-1)} \sum\_{i=1}^{m} \sum\_{k=1}^{s} (\bar{\mathbf{z}}\_{ik.} - \bar{\mathbf{z}}\_{l..})(\bar{\mathbf{z}}\_{ik.} - \bar{\mathbf{z}}\_{l..})' - \frac{\hat{W}}{n},\tag{3.54}$$

$$\hat{V} = \frac{1}{m-1} \sum\_{l=1}^{m} (\bar{\mathbf{z}}\_{l..} - \bar{\mathbf{z}}\_{...})(\bar{\mathbf{z}}\_{l..} - \bar{\mathbf{z}}\_{...})' - \frac{\hat{B}}{s} - \frac{\hat{W}}{sn},\tag{3.55}$$

where **<sup>z</sup>**¯*ik.* <sup>=</sup> <sup>1</sup> *n <sup>n</sup> <sup>j</sup>*=<sup>1</sup> **<sup>z</sup>***ikj* , **<sup>z</sup>**¯*i..* <sup>=</sup> <sup>1</sup> *s <sup>s</sup> <sup>k</sup>*=<sup>1</sup> **<sup>z</sup>***ik.* and **<sup>z</sup>**¯*i...* <sup>=</sup> <sup>1</sup> *m <sup>m</sup> <sup>i</sup>*=<sup>1</sup> **<sup>z</sup>**¯*i..*.

*Example 3.17 (Glass Evidence—Continued)* Consider again the case described in Example 3.12 where two glass fragments are recovered on the jacket of an individual who is suspected to be involved in a crime. Two glass fragments are collected at the crime scene for comparative purposes. The competing propositions are:

#### *Example 3.17* (continued)


A database named glass-database.txt is available as part of the supplementary material of Zadora et al. (2014). It contains measurements of the elemental concentration of glass fragments from several windows (*m* = 200). For each source, there are *s* = 12 fragments with *n* = 3 replicate measurements. For each fragment, five variables are considered: the logarithmic transformation of the ratios *N a/O*, *Mg/O*, *Al/O*, *Si/O*, *Ca/O*. The variables of interest are displayed in columns 3*,* 4*,* 5*,* 6, and 8, while the object (window) identifier is in column 1. The fragment identifier is in column 2.


Three replicate measurements are available for each fragment. Using the notation introduced above


Measurements for the recovered fragments, **y**, and measurements for the control fragments, **x**, were selected from the available data for the first and second group (window) and the first two items (fragments) from these windows. Therefore, a BF smaller than 1 is expected.

```
> recovered.item=1
```

```
> recovered
```

```
Example 3.17 (continued)
 fragment logNaO logMgO logAlO logSiO
1 1 -0.6603 -1.4683 -1.4683 -0.1463
2 1 -0.6658 -1.4705 -1.4814 -0.1429
3 1 -0.6560 -1.4523 -1.4789 -0.1477
4 2 -0.6309 -1.4707 -1.5121 -0.1823
5 2 -0.6332 -1.4516 -1.4996 -0.1792
6 2 -0.6315 -1.4641 -1.4883 -0.1710
  logCaO
1 -1.1096
2 -1.1115
3 -1.1118
4 -1.1306
5 -1.1332
6 -1.1291
> control=base_c[which(base_c[,grouping.fragment]==1|
+ base_c[,grouping.fragment]==2),c(2,variables)]
> control
  fragment logNaO logMgO logAlO logSiO
13 1 -0.6231 -1.3641 -1.6540 -0.0964
14 1 -0.6122 -1.3589 -1.6622 -0.0886
15 1 -0.6108 -1.3742 -1.6935 -0.1205
16 2 -0.6135 -1.3686 -1.7202 -0.1381
17 2 -0.6205 -1.3844 -1.6831 -0.1273
18 2 -0.6204 -1.3692 -1.7269 -0.1199
   logCaO
13 -0.9993
14 -0.9836
15 -1.0524
16 -1.0830
17 -1.0721
18 -1.0392
Next, the means of measurements y¯, x¯, and w¯ are obtained.
>bary=colMeans(recovered[,-1])
```

```
 
>barx=colMeans(control[,-1])
```
> barw=colMeans(rbind(recovered,control)[,-1])

#### *Example 3.17* (continued)

Data concerning measurements from the first two windows were then excluded from the database

```
> pop.back <- population[-which(population[,
+ grouping.item]==1|population[,grouping.item]==2),]
```
The database named pop.back will serve as background data. It can be used to estimate the model parameters *φ*, *W*, *B*, and *V* as in (3.52), (3.53), (3.54) and (3.55) by means of the function three.level.mv.WBV contained in the routines file three\_level\_functions.r. This file is part of the supplementary materials available on the book's website and can be run in the R console with the command

```
> source('three_level_functions.r')
```
The overall mean, the measurement error covariance matrix, the withinsource covariance matrix, and the between-source covariance matrix can be estimated as follows:

```
> WBV=three.level.mv.WBV(pop.back,variables,
```
+ grouping.item,grouping.fragment)

```
> psi=WBV$overall.means
```

The Bayes factor can be calculated as the ratio between (3.50) and (3.51) using the function three.level.mvn.BF available in the routines file three\_level\_functions.r. This function is part of the supplementary materials available on the book's website.

```
> BF=three.level.mvn.BF(bary,barx,barw,ny,nx,n1,n2,
+ psi,W,B,V)
> BF
```
[1] 0.000083299

The Bayes factor represents extremely strong support for the proposition according to which the recovered and the control fragments originate from different sources, rather than from the same source.

Note that the above development does not take into account the topic of variable selection. See Aitken et al. (2006) for a proposal for dimensionality reduction based on a probabilistic structure, determined by a graphical model obtained from a scaled inverse covariance matrix.

#### **3.5 Summary of R Functions**

The R functions outlined below have been used in this chapter.

#### **Functions Available in the Base Package**

colMeans: Forms column means for numeric arrays (or data frames)

d <name of distribution>, p <name of distribution> (e.g., dpois, pnorm): Calculate the density and the cumulative probability for many parametric distributions.

More details can be found in the Help menu, help*.*start*()*.

#### **Functions Available in Other Packages**

dinvgamma in package extraDistr: calculates the density of an inverse gamma distribution.

dstp in package LaplacesDemon: calculates the density of a non-central Student t distribution.

#### **Functions Developed in the Chapter**

hopt: Calculates the estimates *h*ˆ of the smoothing parameter *h*. *Usage*: hopt(p,m). *Arguments*: p, the number of variables: m, the number of sources. *Output*: A scalar value.

poisg: Computes the density of a Poisson–gamma distribution Pg*(α, β,* 1*)* at *x*. *Usage*: poisg(a,b,x).

*Arguments*: a, the shape parameter *α*; b, the rate parameter *β*; x, a scalar value *x*. *Output*: A scalar value.

post\_distr: Computes the posterior distribution N*(μx , τ* <sup>2</sup> *<sup>x</sup> )* of a normal mean *θ*, with *<sup>X</sup>* <sup>∼</sup> <sup>N</sup>*(θ , σ*2*)* and *<sup>θ</sup>* <sup>∼</sup> <sup>N</sup>*(μ, τ* <sup>2</sup>*)*.

*Usage*: post\_distr(sigma,n,barx,pm,pv).


taining the column indices of the variables to be used; grouping.variable, a scalar specifying the variable that is to be used as the grouping factor. By default (nc = FALSE), the between-group covariance matrix is estimated as in Sect. 3.4.1.1. If nc = TRUE, the between-group covariance matrix is estimated as in Sect. 3.4.1.3.


*Output*: A scalar value.


*Output*: A scalar value.


measurements; ny, the number of recovered items; nx, the number of control items; n1, the number of replicate measurements on each of the recovered items; n2, the number of replicate measurements on each of the control items; psi, the overall mean vector; W, the replicate measurements covariance matrix; B, the

within-group covariance matrix; V, the between-source covariance matrix. *Output*: A scalar value.

Published with the support of the Swiss National Science Foundation (Grant no. 10BP12\_208532/1).

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 4 Bayes Factor for Investigative Purposes**

#### **4.1 Introduction**

Forensic laboratories routinely face the problem of classifying items or individuals into one of several classes or populations on the basis of available data (e.g., measurements of one or more attributes), when no control material is available for comparison. As discussed in Sect. 1.6, forensic analyses can provide valuable information regarding the category membership of a particular item. For example, it may be of interest to classify banknotes seized on a person of interest as either banknotes from general circulation or banknotes related to drug trafficking (Wilson et al., 2014). The collected material is analyzed (e.g., the degree of contamination with cocaine is measured), and results are evaluated in terms of their effect on the odds in favor of a proposition *H*<sup>1</sup> according to which the recovered items originate from a given population (e.g., banknotes in general circulation), compared to an alternative proposition *H*<sup>2</sup> according to which the recovered items originate from another population (e.g., banknotes related to drug trafficking).

An assumption made throughout this chapter is that there is a finite number of populations to which an item of interest may belong. Each population will be characterized by a member from a family of probability distributions. Data can be either discrete or continuous, though for the latter it is easier to find examples and applications. There are many instances where the scientific evidence is described by several variables, and available measurements take the form of multivariate data. As mentioned in Sect. 3.1, data do not always present enough regularity so that standard parametric distributions could be used (e.g., the normal model). Moreover, data may present a complex dependence structure with several levels of variation.

**Supplementary Information** The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-09839-0\_4. The files can be accessed individually by clicking the DOI link in the accompanying figure caption or by scanning this link with the SN More Media App.

This chapter is structured as follows. Sections 4.2 and 4.3 address the problem of classification for various types of discrete and continuous data, respectively. Section 4.4 presents an extension to continuous multivariate data. Note that most of the examples developed in this chapter involve only two populations. An extension to more than two propositions is given in Sect. 4.2.2.

#### **4.2 Discrete Data**

This section deals with measurement results in the form of counts, using the binomial model (Sect. 4.2.1) and the multinomial model (Sect. 4.2.2).

#### *4.2.1 Binomial Model*

Imagine a case in which the issue is the quality of a consignment of Basmati rice. Basmati is a rice variety originating from the Indian subcontinent that became valuable in international trade in the last decades. This prompted the cultivation of high-yielding Basmati derivatives. Traditional and evolved (non-traditional) varieties, however, have distinct characteristics (e.g., Kamath et al., 2008), and distinguishing between varieties may be a relevant analytical task. Given a batch of Basmati rice of unknown type, the following pair of propositions may be of interest:


Denote by *θ*<sup>1</sup> and *θ*<sup>2</sup> the proportion of chalky grains in the two populations, respectively. Available counts can be treated as realizations of Bernoulli trials (Sect. 2.2.1) with constant probability of success *θ*<sup>1</sup> (*θ*2). Suppose a conjugate beta prior distribution Be*(αi, βi)* is used to model uncertainty about *θi*, where *αi* and *βi* can be elicited using the available background knowledge (as in Sect. 1.10).

Among several characteristics of interest, such as grain length, thickness, weight, etc., is the percentage of chalky grains, determined by counting the number of grains having chalky area. A sample of size *n* is inspected, and a total number *y* of chalky grains are observed. This can be treated as a realization of a binomial distribution Bin*(n, θ )*.

The marginal distribution at the numerator and denominator can be computed as in (1.25):

$$f\_{H\_l}(\mathbf{y}) = \binom{n}{\mathbf{y}} \frac{\Gamma(\alpha\_l + \beta\_l)\Gamma(\alpha\_l + \mathbf{y})\Gamma(\beta\_l + n - \mathbf{y})}{\Gamma(\alpha\_l)\Gamma(\beta\_l)\Gamma(\alpha\_l + n + \beta\_l)}.$$

This is a beta-binomial distribution with parameters *n*, *αi*, and *βi*. The Bayes factor in favor of proposition *H*<sup>1</sup> can be computed as in (1.26) and becomes

$$\frac{f\_{H\_1}(\mathbf{y})}{f\_{H\_2}(\mathbf{y})} = \frac{\Gamma(\alpha\_1 + \beta\_1)\Gamma(\alpha\_1 + \mathbf{y})\Gamma(\beta\_1 + n - \mathbf{y})\Gamma(\alpha\_2)\Gamma(\beta\_2)\Gamma(\alpha\_2 + n + \beta\_2)}{\Gamma(\alpha\_2 + \beta\_2)\Gamma(\alpha\_2 + \mathbf{y})\Gamma(\beta\_2 + n - \mathbf{y})\Gamma(\alpha\_1)\Gamma(\beta\_1)\Gamma(\alpha\_1 + n + \beta\_1)}. \tag{4.1}$$

*Example 4.1 (Basmati Rice)* Consider a case where 500 rice grains are examined and a total of 200 chalky grains are counted.

```
> n=500
>y=200
```
Suppose that the prior distribution for the proportion *θ*<sup>1</sup> of chalky grains in traditional varieties can be centered at 0.51 with a standard deviation equal to 0.19, while the proportion *θ*<sup>2</sup> of chalky grains in non-traditional varieties can be centered at 0.39 with a standard deviation equal to 0.31. The prior parameters *(αi, βi)* can be elicited as in (1.38) and (1.39).

```
> m1=0.51
> s1=0.19
> m2=0.39
> s2=0.31
```
We first write a function beta\_prior that computes the prior parameters *αi* and *βi* according to (1.38) and (1.39).

```
> beta_prior=function(m,v){
+ a=m*(m*(1-m)/v-1)
+ b=(1-m)*(m*(1-m)/v-1)
+ return(c(a,b))}
```
The hyperparameters of the two beta distributions, say *α*1*, β*1*, α*2, and *β*<sup>2</sup> can then be obtained straightforwardly as

```
> ab1=beta_prior(m1,s1^2)
> ab2=beta_prior(m2,s2^2)
```
The beta-binomial distribution can be calculated straightforwardly using the function dbbinom that is available in the package extraDistr (Wolodzko, 2020).

```
> library(extraDistr)
> BF=dbbinom(y,n,ab1[1],ab1[2])/dbbinom(y,n,ab2[1],
+ ab2[2])
> BF
[1] 2.009102
```
The Bayes factor provides weak support for the hypothesis that the rice type is traditional rather than non-traditional.

#### *4.2.2 Multinomial Model*

The physical and chemical analysis of gunshot residues (GSR) is a well-established field within forensic science. GSR are commonly analyzed to help with issues regarding the distance of firing and alleged activities of persons in incidents involving the use of firearms. A study by Brozek-Mucha and Jankowicz (2001) focused on the use of GSR for discriminating between a selected number of case types (i.e., particular combinations of weapon and ammunition). The authors conducted experiments using six categories, each consisting of a specific combination of weapon and ammunition, called categories A to F. Note that the aim here is not to infer a particular weapon and ammunition as the source of recovered GSR of unknown source. The purpose is only to provide assistance in discriminating between well-defined case types (i.e., categories).

Consider the following pair of competing propositions:


Denote by *θ*1*<sup>j</sup>* and *θ*2*<sup>j</sup>* the proportion of particles in given chemical classes, *j* = 1*,...,k*, characterizing categories D (i.e., category 1) and E (i.e., category 2). The number *n*1*,...,nk* of particles pertaining to distinct chemical classes 1*,...,k*, i.e., the chemical classes PbSbBa, PbSb, SbBa, Sb(Sn), Pb, and PbSnPb as specified in Brozek-Mucha and Jankowicz (2001), can be treated as realization from a multinomial distribution *f (n*1*,...,nk* | *θi*1*,...,θik)*, *i* = 1*,* 2. A conjugate Dirichlet prior probability distribution *f (θi*1*,...,θik* | *αi*1*,...,αik)* can be considered for modeling uncertainty about the proportions *θij* , *i* = 1*,* 2 (Sect. 3.2.2).

The marginal distribution at the numerator and the denominator of the Bayes factor in (1.26) can be computed as in (1.25) and becomes

$$f\_{H\_l}(n\_1, \dots, n\_k \mid \alpha\_{l1}, \dots, \alpha\_{lk}) = \frac{\Gamma(\alpha\_l)\Gamma(n+1)}{\Gamma(n+\alpha\_l)} \prod\_{j=1}^k \frac{\Gamma(n\_k + \alpha\_{lj})}{\Gamma(\alpha\_{lj})\Gamma(n\_j + 1)},$$

where *αi* <sup>=</sup> *<sup>k</sup> <sup>j</sup>*=<sup>1</sup> *αij* and *<sup>n</sup>* <sup>=</sup> *<sup>k</sup> <sup>j</sup>*=<sup>1</sup> *nj* . This is a Dirichlet-multinomial distribution with parameters *n* and *αi*1*,...,αik*.

From a decision-theoretic point of view, the questioned items can be classified in category D (decision *d*1) whenever

$$\text{BF} \, > \frac{l\_1/l\_2}{\pi\_1/\pi\_2},\tag{4.2}$$

where *l*<sup>1</sup> (*l*2) represents the loss incurred when decision *d*<sup>1</sup> (*d*2) is erroneous, and a "0 − *li*" loss function is chosen (Sect. 1.9 and Table 1.4), while *π*1*/π*<sup>2</sup> is the prior odds in favor of *H*1.

It may be objected that the values for *l*<sup>1</sup> and *l*<sup>2</sup> are difficult to assess. However, what really matters is the ratio *k* of the actual values, *l*<sup>1</sup> = *k* · *l*2. Note that this is an asymmetric loss function. In this way, starting from a prior odds equal to 1, the criterion in (4.2) may be rewritten as follows:

$$\text{BF} > k.\tag{4.3}$$

Stated otherwise, whenever the competing hypotheses are considered equally probable, a priori, the decision *d*<sup>1</sup> will be optimal if BF *> k*, that is if wrongly deciding *d*<sup>1</sup> (i.e., *H*<sup>2</sup> holds) is less than BF times worse than wrongly deciding *d*<sup>2</sup> (i.e., *H*<sup>1</sup> holds). Clearly, the prior odds must not necessarily be equal to 1, and the criterion can be adapted accordingly.

#### **4.2.2.1 Choosing the Parameters of the Dirichlet Prior**

The problem of how to elicit a prior probability distribution about a proportion has been discussed in Sect. 1.10. In the type of case considered here, an analyst will face the problem of eliciting a prior opinion about a set of proportions, assuming that the subjective prior distribution is chosen from the family of Dirichlet distributions.

There are various options for the hyperparameters *αi*1*,...,αik*, characterizing the prior probability distribution on the proportions *θi*1*,...,θik*. One is the uniform prior probability distribution, with *αij* = 1, *j* = 1*,...,k*. Whenever further information is available in terms of the number of outcomes in the distinct categories, e.g., *xi*1*,...,xik*, the hyperparameters *αij* can be updated to *αij* + *xij* .

There are cases, however, where the analyst is able to specify a non-uniform prior probability distribution about the proportions. Following the methodology illustrated in Zapata-Vazquez et al. (2014), the prior probability distribution about a set of proportions *θi*1*,...,θik* can be elicited using tools available in the package SHELF (Oakley, 2008). The user is only asked to provide a lower (e.g., 0.25), a median, and a upper (e.g., 0.75) quantile for the marginal densities of proportions that follow a beta distribution. Details will follow in the next example. The reader can also refer to O'Hagan et al. (2006), where a practical example is provided.

*Example 4.2 (Gunshot Residue Particles)* Consider a case in which a given number of particles (266) have been collected and analyzed by a scientist. The particles have been collected from a target surface (e.g., a person's hands). The counts of gunshot residue particles are as follows:


The scientist is asked to help discriminating between the following two propositions:


One way to elicit the Dirichlet distribution in the case here is to use observed frequencies of particles in various chemical classes as reported in previous studies (e.g., Brozek-Mucha & Jankowicz, 2001). Suppose that the elicited expert judgments for the marginal proportions characterizing category D are as follows:


and those characterizing category E:


Consider, first, the elicitation of the Dirichlet distribution concerning the first population, Dir*(θ*11*,...,θ*1*<sup>k</sup>* | *α*11*,...,α*1*k)*. Starting from the given lower, median, and upper quartiles for each marginal proportion, the prior distribution can be elicited as follows.

> p=c(0.25,0.5,0.75)

> th1=c(5,5.25,5.5)/100

```
Example 4.2 (continued)
```

```
> th2=c(9,9.25,9.5)/100
```

The function fitdist, available in the package SHELF, allows one to fit a parametric distribution starting from the elicited probabilities. In the example here, the parameters of the elicited beta distribution for each proportion are of interest.

```
> library(SHELF)
> fit1=fitdist(vals = th1, probs = p, 0, 1)
> fit2=fitdist(vals = th2, probs = p, 0, 1)
> fit3=fitdist(vals = th3, probs = p, 0, 1)
> fit4=fitdist(vals = th4, probs = p, 0, 1)
> fit5=fitdist(vals = th5, probs = p, 0, 1)
> fit6=fitdist(vals = th6, probs = p, 0, 1)
```
The last six objects contain the parameters of the beta distribution that is fitted for each marginal proportion. For example, the parameters *α*<sup>1</sup> and *β*<sup>1</sup> of the elicited beta distribution of *θ*<sup>1</sup> (i.e., proportion of gunshot residue particles in category PbSbBa) can be obtained as

```
> fit1$Beta
```
shape1 shape2 1 190.1306 3427.17

Next, fit the Dirichlet distribution to the elicited marginals by means of the function fitDirichlet that is available in the same package.

```
> d.fit = fitDirichlet(fit1,fit2,fit3,fit4,fit5,fit6,
+ categories = c("PbSbBa","PbSb","SbBa","Sb(Sn)",
+ "Pb","PbSnPb"),n.fitted = "min")
Directly elicited beta marginal distributions:
        PbSbBa PbSb SbBa Sb(Sn)
shape1 1.90e+02 5.65e+02 3.67e+01 168.0000
shape2 3.43e+03 5.54e+03 8.06e+03 79.3000
mean 5.26e-02 9.25e-02 4.53e-03 0.6800
sd 3.71e-03 3.71e-03 7.46e-04 0.0296
sum 3.62e+03 6.11e+03 8.10e+03 248.0000
```

```
Example 4.2 (continued)
            Pb PbSnPb
shape1 5.65e+02 6.38e+02
shape2 5.54e+03 7.54e+03
mean 9.25e-02 7.80e-02
sd 3.71e-03 2.97e-03
sum 6.11e+03 8.18e+03
Sum of elicited marginal means: 1
Beta marginal distributions from Dirichlet fit:
        PbSbBa PbSb SbBa Sb(Sn)
shape1 13.0000 22.9000 1.12e+00 168.0000
shape2 235.0000 225.0000 2.46e+02 79.3000
mean 0.0526 0.0925 4.53e-03 0.6800
sd 0.0142 0.0184 4.26e-03 0.0296
sum 248.0000 248.0000 2.48e+02 248.0000
            Pb PbSnPb
shape1 22.9000 19.300
shape2 225.0000 228.000
mean 0.0925 0.078
sd 0.0184 0.017
sum 248.0000 248.000
```
The Dirichlet parameters *α*11*,...,α*1*<sup>k</sup>* can be read off from the row shape 1 and will be stored in a vector named a1.

```
> a1=c(13,22.9,1.12,168,22.9,19.3)
```
Parameter *n* of the Dirichlet prior is chosen by minimizing the sum of the beta parameters in each elicited marginal (input n.fitted set equal to min). See Oakley (2008) for more details.

In the same way, the Dirichlet distribution concerning the second population, Dir*(θ*21*,...,θ*2*<sup>k</sup>* | *α*21*,...,α*2*k)*, can be elicited.

```
> th1=c(2.35,2.55,2.75)/100
> th2=c(7,7.5,8)/100
> th3=c(0.13,0.15,0.17)/100
> th4=c(56,58,60)/100
> th5=c(24,26,28)/100
> th6=c(5.6,5.8,6)/100
> fit1=fitdist(vals = th1, probs = p,> fit2=fitdist(vals = th2, probs = p,
```
(continued)

 0, 1)

 0, 1) 
```
Example 4.2 (continued)
```

```
> fit3=fitdist(vals = th3, probs = p, 0, 1)
> fit4=fitdist(vals = th4, probs = p, 0, 1)
> fit5=fitdist(vals = th5, probs = p, 0, 1)
> fit6=fitdist(vals = th6, probs = p, 0, 1)
> d.fit = fitDirichlet(fit1,fit2,fit3,fit4,fit5,fit6,
+ categories = c("PbSbBa","PbSb","SbBa","Sb(Sn)",
+"Pb","PbSnPb"),n.fitted="min")
```
The Dirichlet parameters *α*21*,...,α*2*<sup>k</sup>* can be read off analogously from the row shape 1 (not shown here) and will be stored in a vector named a2.

> a2=c(5.59,16.4,0.331,127,57,12.7)

The counts of gunshot residue particles are

> n=c(18,36,2,150,38,22)

The density of a Dirichlet-multinomial distribution can be calculated using the function ddirmnom that is available in the package extraDistr (Wolodzko, 2020), and the Bayes factor can be obtained straightforwardly

```
> library(extraDistr)
> BF=ddirmnom(n,sum(n),a1)/ddirmnom(n,sum(n),a2)
> BF
```
[1] 658.6326

The Bayes factor provides moderately strong support for the hypothesis that the gunshot residue particles originate from a Beretta pistol with Luger 9 mm ammunition rather than from a Margolin pistol with Sporting 5.6 mm ammunition.

Assume *π*<sup>1</sup> = *π*<sup>2</sup> = 1. If a "0 − *li*" loss function is introduced, then decision *d*1, classifying the gunshot residue particles into category D, is to be preferred to the alternative decision *d*<sup>2</sup> unless wrongly deciding *d*<sup>1</sup> is felt more than 659 times worse than classifying the particles in category E.

Note that by choosing a "0 − 1" loss function, or a symmetric "0 − *li*" loss function with *l*<sup>1</sup> = *l*2, a BF greater than 1 (or, more generally, greater than *π*2*/π*<sup>1</sup> for unequal prior probabilities) provides a criterion for addressing the classification problem. The aim here was to show that when assuming equal prior probabilities for the hypotheses being compared, then, for the decision *d*<sup>2</sup> to be optimal, it is not sufficient to have an asymmetric loss function that assigns a loss to the adverse consequence of decision *d*<sup>1</sup> that is greater than the loss assigned to the adverse consequence of decision *d*2. Specifically, this loss must be roughly 659 times greater.

#### **4.2.2.2 More than Two Populations**

Consider now the case where more than two weapons (and related ammunitions) could be at the origin of the collected gunshot particles. Suppose that a third weapon is taken into consideration and that the competing propositions are specified as follows:


As discussed in Sect. 1.6, the expert may calculate the marginal likelihood *fHi(y)* (i.e., a Dirichlet-multinomial distribution) for each proposition and report a scaled version as in (1.27), that is,

$$f\_{H\_l}^\*(\mathbf{y}) = \frac{f\_{H\_l}(\mathbf{y})}{\sum\_{j=1}^3 f\_{H\_j}(\mathbf{y})},$$

or the posterior probabilities

$$\Pr(H\_l \mid \mathbf{y}) = \frac{\Pr(H\_l) f\_{H\_l}^\*(\mathbf{y})}{\sum\_{j=1}^3 \Pr(H\_j) f\_{H\_j}^\*(\mathbf{y})}, \qquad \qquad i = 1, \dots, 3.$$

Alternatively, the analyst may also consider the possibility of summarizing propositions *H*<sup>2</sup> and *H*<sup>3</sup> into one as *H*¯<sup>1</sup> = *H*<sup>2</sup> ∪ *H*3. A pair of competing propositions may thus be formulated as follows:


The Bayes factor can be obtained as in (1.28), that is,

$$\text{BF} = \frac{f\_{H\_1}(\mathbf{y}) \sum\_{l=2}^{3} \text{Pr}(p\_l)}{f\_{\bar{H}\_1}(\mathbf{y})},\tag{4.4}$$

where

$$f\_{\tilde{H}\_{\mathbf{l}}}(\mathbf{y}) = \sum\_{l=2}^{3} \text{Pr}(p\_l) \int\_{\Theta\_l} f(\mathbf{y} \mid \theta\_l) \pi(\theta\_l \mid p\_l) d\theta\_l.$$

*Example 4.3 (Gunshot Residue Particles—Continued)* Recall Example 4.2, and suppose that the elicited expert judgments for the marginal propositions characterizing category F are as follows:


The Dirichlet distribution concerning this new combination of weapon/ammunition can be elicited as before:

> th1=c(6,6.15,6.30)/100 > th2=c(4.5,4.75,5)/100 > th3=c(3,3.25,3.5)/100 > th4=c(65,67,69)/100 > th5=c(14,14.5,15)/100 > th6=c(3,3.25,3.5)/100 > fit1=fitdist(vals = th1, probs = p, 0, 1) > fit2=fitdist(vals = th2, probs = p, 0, 1) > fit3=fitdist(vals = th3, probs = p, 0, 1) > fit4=fitdist(vals = th4, probs = p, 0, 1) > fit5=fitdist(vals = th5, probs = p, 0, 1) > fit6=fitdist(vals = th6, probs = p, 0, 1) > d.fit = fitDirichlet(fit1,fit2,fit3,fit4,fit5,fit6, + categories = c("PbSbBa","PbSb","SbBa","Sb(Sn)", + "Pb","PbSnPb"),n.fitted = "min")

The Dirichlet parameters *α*31*,...,α*3*<sup>k</sup>* can be read off from the row shape 1 (not shown here) and will be stored in a vector named a3.

> a3=c(15.7,12.1,8.29,170,36.9,8.29)

The scaled version of the marginal likelihoods can be easily obtained as

```
> fh1=ddirmnom(n,sum(n),a1)
> fh2=ddirmnom(n,sum(n),a2)
> fh3=ddirmnom(n,sum(n),a3)
> fh1scaled=fh1/(fh1+fh2+fh3)> fh2scaled=fh2/(fh1+fh2+fh3)
```

```
Example 4.3 (continued)
> fh3scaled=fh3/(fh1+fh2+fh3)
> c(fh1scaled,fh2scaled,fh3scaled)
```
[1] 0.9980356379 0.0015153146 0.0004490475

Note that the scaled likelihoods *f* ∗ *Hi (y)* are equivalent to the posterior probabilities Pr*(Hi* | *y)* whenever the prior probabilities of the three propositions are equal.

Alternatively, suppose that propositions *H*<sup>2</sup> and *H*<sup>3</sup> are summarized as above, i.e., *H*¯<sup>1</sup> = *H*<sup>2</sup> ∪ *H*3, and that the prior probabilities of *H*<sup>1</sup> and *H*¯<sup>1</sup> are equal, so that Pr*(H*1*)* = 0*.*5 and Pr*(H*2*)* = Pr*(H*3*)* = 0*.*25.

> p2=0.25 >p3=0.25

The Bayes factor can then be obtained as

```
> fh1=ddirmnom(n,sum(n),a1)
> fh2=p2*ddirmnom(n,sum(n),a2)+p3*+ ddirmnom(n,sum(n),a3)
> BF=fh1*(p2+p3)/fh2
> BF
[1] 1016.142
```
#### **4.3 Continuous Data**

The previous section considered the evaluation of scientific evidence in the form of discrete data for investigative purposes. However, for many types of scientific evidence, measurements lead to continuous data. In this section, we discuss parametric and non-parametric models for continuous data.

#### *4.3.1 Normal Model and Known Variance*

Suppose that tablets of unknown source are seized, and the question is whether they belong to population *A* or population *B*, which differ in color dye concentration. The propositions of interest are as follows:


The measurement of color dye concentration leads to continuous data for which a normal distribution is considered appropriate, say *XA* <sup>∼</sup> <sup>N</sup>*(θA, σ*<sup>2</sup> *<sup>A</sup>)* for population *<sup>A</sup>* and *XB* <sup>∼</sup> <sup>N</sup>*(θB, σ*<sup>2</sup> *<sup>B</sup>)* for population *B*. Suppose that the variance of color dye concentration in the different populations is known. For the population means, a conjugate prior normal distribution is introduced, i.e., *θA* <sup>∼</sup> <sup>N</sup>*(μA, τ* <sup>2</sup> *<sup>A</sup>)* and *θB* ∼ N*(μB, τB)*.

The analysis of a tablet of unknown origin yields the measurement *y*. The Bayes factor can be obtained as in (1.26), where the marginal likelihoods *fHi(y)* are still normal with mean equal to the prior mean *μ* and variance equal to the sum of the prior variance *<sup>τ</sup>* <sup>2</sup> and the population variance *<sup>σ</sup>*2, *fHi(y)* <sup>=</sup> <sup>N</sup>*(μ, τ* <sup>2</sup> <sup>+</sup> *<sup>σ</sup>*2*)*.

Whenever several measurements (*y*1*,...,yn*) are available, it is sufficient to recall that the joint likelihood is proportional to the likelihood of the sample mean *<sup>y</sup>*¯, which is normally distributed, *<sup>Y</sup>*¯ <sup>∼</sup> <sup>N</sup>*(θ , σ*2*/n)*, and that the marginal likelihood in correspondence of the sample mean *<sup>y</sup>*¯ becomes *fHi(y)*¯ <sup>=</sup> <sup>N</sup>*(μ, τ* <sup>2</sup> <sup>+</sup> *<sup>σ</sup>*2*/n)*.

*Example 4.4 (Color Dye Concentration in Ecstasy Tablets)* A tablet of unknown origin is analyzed, and the measured color dye concentration is 0.16 (measurements are in %). A prior probability distribution is elicited for the mean of population *<sup>A</sup>*, as *θA* <sup>∼</sup> <sup>N</sup>*(*0*.*14*,* <sup>0</sup>*.*0032*)*, and for the mean of population *<sup>B</sup>*, as *θB* <sup>∼</sup> <sup>N</sup>*(*0*.*3*,* <sup>0</sup>*.*0162*)*. The population variances *<sup>σ</sup>*<sup>2</sup> *<sup>A</sup>* and *σ*2 *<sup>B</sup>* are assumed to be known and equal to 0*.*012 and 0*.*062, respectively (Goldmann et al., 2004).


The Bayes factor in (1.26) can be obtained straightforwardly as the ratio of two normal likelihoods evaluated for the available measurement of color dye concentration *y*.

```
> BF=dnorm(y,pma,sqrt(pva+sigmaa))/
+ dnorm(y,pmb,sqrt(pvb+sigmab))
> BF
```

```
[1] 12.05706
```
The Bayes factor provides moderate support for the proposition according to which the analyzed tablet comes from population *A*, rather than the proposition according to which the tablet comes from population *B*. Note

#### *Example 4.4* (continued)

again that this result does not mean that proposition *H*<sup>1</sup> is more probable than proposition *H*2. It solely means that the probability to observe the concentration *y* is roughly 12 times greater if the tablet originates from population *A* rather than from population *B*. The posterior odds might be in favor of proposition *H*<sup>2</sup> even in the presence of a Bayes factor greater than 1, if the prior probability of proposition *H*<sup>1</sup> is sufficiently small. In the case at hand, it can be easily verified that the prior probability of proposition *H*<sup>1</sup> needs to be smaller than 0.07 in order for the posterior odds to be in favor of *H*2.

Suppose now that *n* = 5 tablets are available, and the color dye concentration measurements are *y* = *(*0*.*155*,* 0*.*160*,* 0*.*165*,* 0*.*161*,* 0*.*159*)*. The value of the evidence can then be computed for the sample mean

```
> y=c(0.155,0.160,0.165,0.161,0.159)
> n=length(y)
> num=dnorm(mean(y),pma,sqrt(pva+sigmaa/n))> den=dnorm(mean(y),pmb,sqrt(pvb+sigmab/n))> BF=num/den
> BF
```

```
[1] 134.628
```
The Bayes factor now provides moderately strong support for the proposition *H*1, compared to proposition *H*2. This is a direct effect of the increased number of measurements.

#### *4.3.2 Normal Model and Unknown Variance*

In some applications, both parameters are unknown, and a prior distribution for the population mean and the population variance must be introduced. A non-informative or a subjective prior distribution may be chosen, as mentioned previously in Sect. 3.3.2.

Consider a case where skeletal remains are analyzed, and the question is whether they belong to a man or a woman. The competing propositions are as follows:


The study of Benazzi et al. (2009) found that the measurement of the sacral base is a useful indicator of sex.

Consider a normal probability distribution for the area of the sacral base *XF* ∼ N*(θF , σ*<sup>2</sup> *<sup>F</sup> )* for the population of females, and *XM* <sup>∼</sup> <sup>N</sup>*(θM, σ*<sup>2</sup> *<sup>M</sup>)* for the population of males. A conjugate prior probability distribution *f (θi, σ*<sup>2</sup> *<sup>i</sup> )* can be assumed for *(θi, σ*<sup>2</sup> *<sup>i</sup> )* as in (3.12), where *(θi* <sup>|</sup> *<sup>σ</sup>*<sup>2</sup> *<sup>i</sup> )* <sup>∼</sup> <sup>N</sup>*(μi, σ*<sup>2</sup> *<sup>i</sup> /ni)* and *<sup>σ</sup>*<sup>2</sup> *<sup>i</sup>* <sup>∼</sup> *Si* · *<sup>χ</sup>*−<sup>2</sup>*(ki)*, *i* = {*F,M*}. This amounts to an inverse gamma distribution with shape parameter *αi* <sup>=</sup> *ki/*2 and scale parameter *βi* <sup>=</sup> *Si/*2, *<sup>σ</sup>*<sup>2</sup> *<sup>i</sup>* ∼ IG*(ki/*2*, Si/*2*)*.

The marginal density needed to compute the BF, *fHi(*·*)*, is a Student t distribution with *ki* degrees of freedom, centered at *μi*, with spread parameter, denoted here *spi*, equal to

$$sp\_{l} = \frac{n\_{l}}{n\_{l} + 1} \alpha\_{l} \beta\_{l}^{-1}$$

(as noted previously in Sect. 3.3.2). Note that in this case there is one available measure (*ny* = 1).

*Example 4.5 (Sex Discrimination for Skeletal Remains)* The sacral base of a skeletal remain is measured and found to be 11*.*5 cm2. The prior probability distribution for *(θp, σ*<sup>2</sup> *p)*, as illustrated in Sect. 3.3.2, is elicited based on the following population data:


The prior distribution for *(θF* <sup>|</sup> *<sup>σ</sup>*<sup>2</sup> *<sup>F</sup> )* and *(θM* <sup>|</sup> *<sup>σ</sup>*<sup>2</sup> *<sup>M</sup>)* can be centered at *μF* = 10*.*35 and *μM* = 14*.*09, respectively, with *nF* = 38 and *nM* = 35.

> muf=10.35

> nf=38


The prior distribution for *σ*<sup>2</sup> *<sup>F</sup>* and *<sup>σ</sup>*<sup>2</sup> *<sup>M</sup>* can be elicited using the parameter value *k* = 20 (as in Example 3.6) and choosing *SF* and *SM* such that

$$\Pr(\sigma\_F^2 > 1.42^2) = \Pr(\sigma\_M^2 > 1.52^2) = 0.5$$

> k=20 > sigmaf=1.42^2>sigmam=1.52^2

> q=qchisq(0.5,k)

*Example 4.5* (continued) > Sf=q\*sigmaf > Sm=q\*sigmam > c(Sf,Sm)

[1] 38.99199 44.67720

The prior distributions for *σ*<sup>2</sup> *<sup>F</sup>* and *<sup>σ</sup>*<sup>2</sup> *<sup>M</sup>* are 39 · *<sup>χ</sup>*−2*(*20*)* and 45 · *<sup>χ</sup>*−2*(*20*)*, respectively. The marginal density in the numerator of the Bayes factor is a Student t distribution with *kF* degrees of freedom, centered at *μF* = 10*.*35 with spread parameter *sF* = 0*.*5 (rounded at the second decimal).

> spf=nf/(nf+1)\*k/Sf

The marginal density in the denominator of the Bayes factor is a Student t distribution with *kM* degrees of freedom, centered at *μM* = 14*.*09 with spread parameter *sM* = 0*.*44 (rounded at the second decimal).

> spm=nm/(nm+1)\*k/Sm

Note that in this case *kF* = *kM* = *k*.

The density of a non-central Student t distributed random variable can be calculated using the function dstp available in the package LaplacesDemon (Hall et al., 2020). The Bayes factor can be obtained as follows:

```
> library(LaplacesDemon)
> y=11.5
> BF=dstp(y,muf,spf,k)/dstp(y,mum,spm,k)
> BF
```

```
[1] 3.184994
```
This value provides weak support for the proposition according to which the skeletal remains belong to a woman rather than a man.

#### *4.3.3 Non-Normal Model*

As pointed out in Sect. 3.4.1.2, certain types of observations lack sufficient regularity to apply standard parametric models.

Consider a case where banknotes are seized on an individual following an arrest. A question commonly asked in such a case is whether the seized banknotes come from a population of banknotes used in drug dealing activities. The following propositions may thus be formulated:

**Fig. 4.1** Drug intensity measured on banknotes of 200 euro in a population of banknotes from drug trafficking (left) and general circulation (right) (Besson, 2004)


Figure 4.1 shows histograms of drug intensities measured on banknotes from drug trafficking (left) and general circulation (right). It can immediately be observed that the distributions for the two populations are different, that the distribution related to banknotes involved in drug trafficking is not unimodal, and that the one for banknotes in general circulation is positively skewed (Besson, 2004).

Suppose a database is available {**z***<sup>l</sup>* = *(zl*1*,...,zlml), l* = 1*,* 2}. The probability distribution for population *pl*, *fl(*·*)*, can be estimated by means of kernel density estimation *f*ˆ *l(*·*)* as

$$\hat{f}\_l(\mathbf{y} \mid z\_{l1}, \dots, z\_{lm\_l}) = \frac{1}{m\_l} \sum\_{l=1}^{m\_l} \mathbf{K}(\mathbf{y} \mid z\_{ll}, h\_l), \tag{4.5}$$

where K*(y* | *zli, hl)* is taken to be a normal distribution centered at *zli* with variance equal to *h*<sup>2</sup> *l s*2 *<sup>l</sup>* , *<sup>s</sup>*<sup>2</sup> *<sup>l</sup>* <sup>=</sup> *ml <sup>i</sup>*=<sup>1</sup>*(zli* − ¯*zl)*<sup>2</sup>*/(ml* <sup>−</sup> <sup>1</sup>*)*, and *<sup>z</sup>*¯*<sup>l</sup>* <sup>=</sup> *ml <sup>i</sup>*=<sup>1</sup> *zli/ml*.

The estimate *f*ˆ *l(y)* of the probability density is obtained by adding individual densities over all observations in the database and then dividing by the sum of the observations.

Figure 4.2 shows the kernel density estimates *f*ˆ <sup>1</sup>*(y* | *z*11*,...,z*1*m*<sup>1</sup> *)* and *f*ˆ <sup>2</sup>*(y* | *z*21*,...,z*2*m*<sup>2</sup> *)* obtained using (4.5) with the smoothing parameter set equal to 0*.*15 for both populations. It can be observed that kernel density estimates are more sensitive to multimodality and skewness and provide a better representation of the available data.

**Fig. 4.2** Drug intensity measured on banknotes of 200 euro in a population of banknotes from drug trafficking (left) and general circulation (right), and associated kernel density estimates with smoothing parameter *h* equal to 0*.*15

Starting from the available measurements *y* = *(y*1*,...,yn)* on a sample of size *n*, a Bayes factor can be obtained as

$$\text{BF} = \frac{f\_{H\_1}(\mathbf{y})}{f\_{H\_2}(\mathbf{y})} = \frac{\prod\_{l=1}^{m\_1} \hat{f}\_1(\mathbf{y}\_l \mid z\_{11}, \dots, z\_{1m\_1})}{\prod\_{l=1}^{m\_2} \hat{f}\_2(\mathbf{y}\_l \mid z\_{21}, \dots, z\_{2m\_2})}. \tag{4.6}$$

*Example 4.6 (Contaminated Banknotes)* Consider a case in which 8 banknotes are seized on a person of interest. Laboratory analyses of the banknotes reveal drug intensities [*du*] equal to *y* = *(*322*,* 158*,* 114*,* 125*,* 361*,* 801*,* 798*,* 135*)*. A database named banknotes.Rdata is available on the book's website. It contains sample data for drug intensities on banknotes from drug trafficking and general circulation (Fig. 4.1). Note that these are hypothetical data used for the sole purpose of illustration. The *(n*<sup>1</sup> × 1*)* vector of measurements on banknotes from drug trafficking is extracted and denoted pop1; analogously, the *(n*<sup>2</sup> × 1*)* vector of measurements on banknotes from general circulation is extracted and denoted pop2.


#### *Example 4.6* (continued)

The smoothing parameters *h*<sup>1</sup> and *h*<sup>2</sup> are set equal to 0.15. The variances of drug concentration from each population, *s*<sup>2</sup> <sup>1</sup> and *<sup>s</sup>*<sup>2</sup> <sup>2</sup> , are estimated by the sample variance

```
> h1=0.15
> h2=0.15
> s1=var(pop1)
> s2=var(pop2)
```
The kernel density estimation in (4.5) for the numerator and the denominator is computed by means of the functions kn1 and kn2, respectively.

```
> n1=length(pop1)
> n2=length(pop2)
> sk1=h1*sqrt(s1)
> sk2=h2*sqrt(s2)
> kn1=function(x){sum(dnorm(x,pop1,sk1))/n1}
> kn2=function(x){sum(dnorm(x,pop2,sk2))/n2}
```
The estimated probability densities are represented in Fig. 4.2.

```
> x=matrix(seq(0,1100,1),nrow=1)
```

```
> f1h=apply(x,2,kn1)
```

Consider now the vector of measurements *y*. The probability densities are estimated as in (4.5):

```
> y=matrix(c(322,158,114,125,361,801,798,135),nrow=1)
> f1=apply(y,2,kn1)
```

```
> f2=apply(y,2,kn2)
```
and the Bayes factor is obtained as in (4.6):

```
> BF=prod(f1)/prod(f2)
> BF
[1] 29.7187
```

```
The Bayes factor represents moderate support for the proposition according
to which the seized banknotes have been used in illegal drug trafficking
rather than the proposition according to which they are part of the general
```
circulation.

#### **Sensitivity to the Choice of the Smoothing Parameter**

The sensitivity of the BF to the choice of the smoothing parameter may be a cause of concern, as different choices may be made. The smoothing parameter *h* determines the shape of the estimated probability density: if it is (too) large, the curve *f (y)* ˆ will be (very) smooth; on the other side, if it is (too) small, the resulting curve will be more spiky. Figure 4.3 shows, for both populations, the density curves obtained with *h* = 0*.*1 (dotted line), *h* = 0*.*15 (solid line), *h* = 0*.*2 (dashed line), *h* = 0*.*25 (dot-dashed line). The Bayes factor for the available measurements in Example 4.6 is then calculated for several choices of the smoothing parameter *h*.

```
> hsens=c(0.1,0.15,0.2,0.25)
> BFsens=rep(0,length(hsens))
> for (i in 1:length(hsens)){
+ sk1=hsens[i]*sqrt(s1)
+ sk2=hsens[i]*sqrt(s2)
+ f1=apply(y,2,kn1)
+ f2=apply(y,2,kn2)
+ BFsens[i]=prod(f1)/prod(f2)}
> round(BFsens,2)
[1] 1402.94 29.72 5.63 2.00
```
Note that the last two values correspond to large values of the smoothing parameter *h*, providing a very smooth curve.

#### **4.4 Multivariate Data**

As mentioned in Sect. 3.4, analysts frequently encounter multivariate data because the features of examined items and materials, such as handwritten or printed documents, glass fragments, or skeletal remains, can be described by more than one variable. Such data often present a complex dependence structure with a large number of variables and multiple levels of variation.

#### *4.4.1 Normal Multivariate Data*

The classification of skeletal remains on the basis of sexual dimorphism is a common problem in paleontology. Section 4.3.2 dealt with the question of how to quantify the evidential value of measurements of a given morphological trait (e.g.,

**Fig. 4.3** Sample data used in Example 4.6 regarding drug intensities on banknotes for a population of banknotes from drug trafficking (top) and in general circulation (bottom), and associated kernel density estimates with smoothing parameter *h* equal to 0*.*1 (dashed line), 0*.*15 (solid line), 0*.*2 (dotted line), and 0*.*25 (dot-dashed line)

## **Drug trafficking**

the profile of the sacral base). A number of studies have documented sex differences in particular pelvic traits, such as the *obturator foramen*, that tend to be oval in males and triangular in females. The shape of these traits can be described quantitatively by Fourier descriptors following the image analysis procedure developed by Bierry et al. (2010). Each item can be described by means of several variables, i.e., the amplitude and the phase of the first three harmonics.

Suppose that observations are available from a *p*-dimensional multivariate normal distribution whose mean vector and variance–covariance matrix are *θ<sup>l</sup>* and *Wl*, respectively, *Zli* ∼ N*(θl, Wl)*, *l* = 1*,* 2 (where *l* = 1 stands for the population of females and *l* = 2 for the population of males). Suppose further that the prior distribution about *(θl, Wl)* is chosen in the conjugate family of the normal-inverse Wishart distribution NIW*(Ωl, νl,μl, cl)*: 1

$$f(\boldsymbol{\theta}\_{l},\boldsymbol{W}\_{l}) \propto |\boldsymbol{W}\_{l}|^{-(\boldsymbol{\eta}+\boldsymbol{p}+\boldsymbol{2})/2} \exp\left\{-\frac{c\_{l}}{2}(\boldsymbol{\theta}\_{l}-\boldsymbol{\mu}\_{l})^{\prime}\boldsymbol{W}\_{l}^{-1}(\boldsymbol{\theta}\_{l}-\boldsymbol{\mu}\_{l}) - \frac{1}{2}\text{tr}(\boldsymbol{W}\_{l}^{-1}\boldsymbol{\mathcal{Q}}\_{l})\right\},$$

where *μ<sup>l</sup>* is the center vector, *cl* are the degrees of freedom associated with the center vector *μl*, *Ωl* is the dispersion matrix, and *νl* are the degrees of freedom associated with the dispersion matrix *Ωl* (O'Hagan & Kendall, 1994).

Consider now a case where skeletal remains are recovered, and the following propositions are of interest:

*H*1: The skeletal remains belong to a woman (i.e., a member of population *p*1). *H*2: The skeletal remains belong to a man (i.e., a member of population *p*2).

Denote by **y** = *(y*1*,...,yp)* the measurements (i.e., Fourier descriptors) related to the item whose origin is unknown and that needs to be classified. The marginal distribution under the competing propositions *H*<sup>1</sup> and *H*2, *fHl(***y***)* for *l* = 1*,* 2, can be obtained as

$$f(\mathbf{y} \mid \boldsymbol{\mu}\_l, c\_l, \mathcal{Q}\_l, \boldsymbol{\nu}\_l) = \int\_{\boldsymbol{\theta}\_l, W\_l} f(\mathbf{y} \mid \boldsymbol{\theta}, W) f(\boldsymbol{\theta}, W) d(\boldsymbol{\theta}, W)$$

$$\propto \left\{ 1 + (\mathbf{y} - \boldsymbol{\mu}\_l)' \left[ \frac{c\_l + 1}{c\_l} \mathcal{Q}\_l \right]^{-1} (\mathbf{y} - \boldsymbol{\mu}\_l) \right\}^{-(\boldsymbol{\eta} + 1)/2} . \tag{4.7}$$

This is a *p*-dimensional Student t distribution with *δl* = *νl* + 1 − *p* degrees of freedom, location *μl*, and scale matrix

$$
\Delta\_l = \frac{(c\_l + 1)\,\Omega\_l}{(c\_l\delta\_l)}.
$$

<sup>1</sup> Note that a conjugate prior distribution may not always be the best choice. A method for assessing a non-conjugate prior distribution where the vector mean and the covariance matrix of the multivariate normal are, a priori, independent is provided by Garthwaite and Al-Awadhi (2001).

The Bayes factor can be obtained as

$$\text{BF} = \frac{f(\mathbf{y} \mid \boldsymbol{\mu}\_1, \boldsymbol{c}\_1, \boldsymbol{\Omega}\_1, \boldsymbol{\nu}\_1)}{f(\mathbf{y} \mid \boldsymbol{\mu}\_2, \boldsymbol{c}\_2, \boldsymbol{\Omega}\_2, \boldsymbol{\nu}\_2)}.$$

#### **4.4.1.1 Prior Distribution for the Unknown Mean and Variance**

Four parameters must be elicited. The elicitation of *μ<sup>l</sup>* is rather simple. Since *μ<sup>l</sup>* represents the mean, the median, and the mode of the prior probability distribution, the analyst may assess any of these summaries (O'Hagan et al., 2006). A procedure for the elicitation of the degrees of freedom *c* and *ν* and the dispersion matrix *Ω* has been provided by Al-Awadhi and Garthwaite (1998).

Here, suppose a non-informative prior distribution is used:

$$|f(\theta\_l, W\_l) \propto |\, |W\_l \text{ |}^{-(p+1)/2}\text{ .}$$

A database is available, with *n*<sup>1</sup> measurements for the population of females (*p*1) and *n*<sup>2</sup> measurements for the population of males (*p*2). The corresponding posterior distributions (one for the numerator, one for the denominator) can be written as

$$(\boldsymbol{\theta}\_{l} \mid \mathbf{z}\_{l}, \boldsymbol{\Sigma}\_{l}) \sim \mathcal{N}(\mathbf{\bar{z}}\_{l}, \boldsymbol{\Sigma}\_{l} / n\_{l}) \tag{4.8}$$

$$\mathbf{W}(\Sigma\_l \mid \mathbf{z}\_l) \sim \text{IW}(\mathbf{S}\_l, n\_l - 1),\tag{4.9}$$

where *Sl* <sup>=</sup> *nl <sup>i</sup>*=1*(***z***li* <sup>−</sup> **<sup>z</sup>**¯*l)(***z***li* <sup>−</sup> **<sup>z</sup>**¯*l)* is the sum of the squares about the sample mean and **<sup>z</sup>**¯*<sup>l</sup>* <sup>=</sup> *nl <sup>j</sup>*=<sup>1</sup> **<sup>z</sup>***lj /nl*.

The marginal likelihood *fHl(***y***)* is, therefore, a *p*-dimensional Student t distribution with *nl* − *p* degrees of freedom, location vector **z**¯*l*, and scale matrix

$$F\_l = \frac{(n\_l + 1)S\_l}{n\_l(n\_l - p)},\tag{4.10}$$

so that *(***y** | **z**¯*l, Fl, nl* − *p)* ∼ *tnl*−*p(***z**¯*l, Fl)*.

*Example 4.7 (Sex Discrimination for Skeletal Remains Using Multivariate Data)* Skeletal remains are recovered, and the obturator foramen area is measured. The measurements of the first three pairs of Fourier descriptors are as follows:


Suppose that two databases of dimensions *(n*<sup>1</sup> × *p)* = *(*51 × 6*)* and *(n*<sup>2</sup> × *p)* = *(*50 × 6*)* are available for the population of women and men, respectively. These two databases can be used to obtain the summaries **z**¯1, **z**¯<sup>2</sup> (i.e., the location vectors) and *S*1, *S*<sup>2</sup> (i.e., the sum of the squares about the sample means) that are needed to calculate the marginal probability densities of the available measurements under the competing propositions. The location vectors **z**<sup>1</sup> and **z**<sup>2</sup> and the sum of the squares about the sample means *S*<sup>1</sup> and *S*<sup>2</sup> can be obtained straightforwardly as

```
> as.matrix(colMeans(population))
>cov(population)*(n-1)
```
where population is a database of dimension *(n* × *p)* containing the available data. Note that only summaries **z**1, **z**2, *S*1, *S*2, as well as the vector of measurements *y* are available in the database skeletal.Rdata and can be obtained as

```
> load('skeletal.Rdata')
> y
      A1 Phi1 A2 Phi2 A3
0.0830950 2.6527709 0.9323330 0.4530559 0.4137360
    Phi3
0.3174581
> cbind(m1,m2)
          [,1] [,2]
A1 0.07500563 0.05078316
Phi1 2.60792515 3.37739963
A2 1.08366494 1.15684192
Phi2 0.17014670 0.08233948
A3 0.50490100 0.39364526
Phi3 0.34169629 0.39422141
> S1
```
(continued)


The marginal density *fH*<sup>1</sup> *(***y***)* in the numerator of the Bayes factor is a *p*dimensional Student t distribution with *n*<sup>1</sup> − *p* = 45 degrees of freedom, location m1 as above, and scale matrix

> n1=51 > p=6 > F1=S1\*(n1+1)/(n1\*(n1-p))

The marginal density *fH*<sup>2</sup> *(***y***)* in the denominator of the Bayes factor is a *p*dimensional Student t distribution with *n*<sup>2</sup> − *p* = 44 degrees of freedom, location m2 as above, and scale matrix

```
Example 4.7 (continued)
> n2=50
> F2=S2*(n2+1)/(n2*(n2-p))
```
The density of a multivariate Student t distributed random variable can be calculated using the function dmvt available in the package LaplacesDemon (Hall et al., 2020).

```
> library(LaplacesDemon)
> num=dmvt(y,t(m1),F1,n1-p,log=FALSE)
> den=dmvt(y,t(m2),F2,n2-p,log=FALSE)
> num/den
```

```
[1] 1545.489
```
The Bayes factor represents strong support for the proposition according to which the skeletal remains originate from a woman (population *p*1) rather than from a man (population *p*2).

As discussed in Sect. 3.4.2, it is important to study the performance of the proposed model. This can be achieved by using the available databases to generate many test cases and computing relevant performance metrics.

#### **4.4.1.2 Classification as a Decision**

The BF obtained in Example 4.7 supports proposition *H*<sup>1</sup> over *H*2. However, if a decision is to be made, one needs to take into account the prior uncertainty (in terms of probabilities) about the competing propositions and the undesirability (in terms of losses) of adverse outcomes (i.e., classification errors).

Let *π*<sup>1</sup> and *π*<sup>2</sup> denote the prior probabilities of propositions *H*<sup>1</sup> and *H*2. The posterior probabilities *α*<sup>1</sup> and *α*<sup>2</sup> can be easily calculated as

$$\alpha\_{l} = \frac{\pi\_{l} f(\mathbf{y} \mid \boldsymbol{\mu}\_{l}, \boldsymbol{c}\_{l}, \boldsymbol{\Omega}\_{l}, \boldsymbol{\nu}\_{l})}{\sum\_{j=1}^{2} \pi\_{j} f(\mathbf{y} \mid \boldsymbol{\mu}\_{j}, \boldsymbol{c}\_{j}, \boldsymbol{\Omega}\_{j}, \boldsymbol{\nu}\_{j})},$$

where the marginals *f (***y** | *μ<sup>j</sup> , cj , Ωj , νj )*, *l* = 1*,* 2, are as in (4.7).

A criterion that can be used to classify the recovered item into one of the two populations has been outlined in Sect. 1.9. When using a "0 − *li*" loss function (Table 1.4), the Bayes decision criterion states that the decision *d*1, classifying the recovered item in the population of females (*p*1), is optimal whenever

$$\text{BF} > \frac{l\_1/l\_2}{\pi\_1/\pi\_2} = c.\tag{4.11}$$

*Example 4.8 (Sex Discrimination for Skeletal Remains Using Multivariate Data—Continued)* If the prior odds are 1, and a symmetric loss function is chosen (i.e., *l*<sup>1</sup> = *l*2), the criterion in (4.11) says that the decision *d*<sup>1</sup> is optimal whenever BF *>* 1.

Assuming equal prior probabilities may be unrealistic because, often, there is at least some information to help assert whether one proposition is more probable than the stated alternative proposition. Likewise, the decision maker's preferences among adverse outcomes may not properly be reflected by a symmetric loss function, though it should be noted that what actually matters is only the ratio of *l*<sup>1</sup> to *l*2.

To investigate the effect of alternative choices for the prior odds and the loss function, one can conduct a sensitivity analysis. Figure 4.4 shows an example for the threshold *c* in (4.11) as a function of increasing values of the prior probability *π*<sup>1</sup> and for different asymmetric loss functions, where *l*2, the loss associated with the adverse outcome of the decision *d*2, is fixed at 1, and *l*1, associated with the adverse outcome of the decision *d*1, is equal to 10, 50, and 100.

This analysis reveals that *d*<sup>1</sup> is not the optimal decision for very high values of *l*1, compared to *l*2, and for very small values of the prior probability *π*1.

#### *4.4.2 Two-Level Models*

A recurrent problem in forensic practice is to help distinguish between legal and illegal cannabis plants (Bozza et al., 2014). Cannabis seedlings can be discriminated, to some extent, on the basis of their chemical profiles using chemometric tools and a methodology as described in Broséus et al. (2010). This study focused on several target compounds, taking into account their presence in drug type (illegal) and fiber type (legal) Cannabis.

Suppose a dataset is available that consists of replicate measurements (*n*) made on illegal plants (population *p*1) and on fiber type plants (population *p*2). The sample size is equal to *m*<sup>1</sup> and *m*<sup>2</sup> for populations *p*<sup>1</sup> and *p*2, respectively. Background data can be denoted by **z***lij* = *(zlij*1*,...,zlijp)*, where *l* = 1*,* 2, *i* = 1*,...,ml*, *j* = 1*,...,n*, and *p* is the number of variables. Available data suggest that a statistical model with two levels of variation is suitable: variation between replicate measurements from the same source and variation between measurements from different sources.

#### **4.4.2.1 Normal Distribution for the Between-Source Variation**

Here we use the two-level random effect model described in Sect. 3.4.1.1. For the within-source variation, the distribution of *Zlij* is taken to be normal, *Zlij* ∼ N*(θli, Wl)*. For the between-source variation, denote the mean vector between sources by *μl*, and the matrix of between-source variances and covariances by *Bl*. The distribution of *θli* is taken to be normal, *θli* ∼ N*(μl, Bl)*.

Measurements are available on some seized material, denoted by **y** = *(***y**1*,...,* **y***n)*, where **y***<sup>j</sup>* = *(***y***j*1*,...,* **y***jp)*, *j* = 1*,...,n*. A laboratory is asked to help determine the plant's chemotype. The following propositions may be of interest:

*H*1: The seized plant is drug type Cannabis (population *p*1).

*H*2: The seized plant is fiber type Cannabis (population *p*2).

The probability distribution of the measurements on items from each population is taken to be normal, *Y* ∼ N*(θl, Bl)*, *l* = 1*,* 2. The marginal probability densities in the numerator and denominator have the form *fHl(***y***)* = *fl(***y** | *μl, Wl, Bl)*, *l* = 1*,* 2, and can be obtained as in (3.28)

$$f\_l(\mathbf{y} \mid \boldsymbol{\mu}\_l, \mathbf{W}\_l, \mathcal{B}\_l) = |\, 2\pi \, \mathbf{W}\_l \mid \, ^{-n/2} |\, 2\pi \, \mathcal{B}\_l \mid \, ^{-1/2} |\, 2\pi (n \, \mathcal{W}\_l^{-1} + \mathcal{B}\_l^{-1})^{-1} \, ^1 |^{1/2}$$

$$\times \exp\left\{ -\frac{1}{2} \left[ (\bar{\mathbf{y}} - \boldsymbol{\mu}\_l)' (n^{-1} \, \mathcal{W}\_l + \mathcal{B}\_l)^{-1} (\bar{\mathbf{y}} - \boldsymbol{\mu}\_l) + \text{tr}\left( \mathcal{S} \mathcal{W}\_l^{-1} \right) \right] \right\}, \qquad (4.12)$$

where *<sup>S</sup>* <sup>=</sup> *<sup>n</sup> <sup>i</sup>*=1*(***y***<sup>i</sup>* <sup>−</sup> **<sup>y</sup>**¯*)(***y***<sup>i</sup>* <sup>−</sup> **<sup>y</sup>**¯*)* .

The Bayes factor can then be obtained as in (1.26) as a ratio between the two marginals

$$\begin{split} \text{BF} &= \frac{f\_{H\_{l}}(\mathbf{y})}{f\_{H\_{2}}(\mathbf{y})} = \frac{f\_{1}(\mathbf{y} \mid \boldsymbol{\mu}\_{1}, \boldsymbol{W}\_{1}, \boldsymbol{B}\_{1})}{f\_{2}(\mathbf{y} \mid \boldsymbol{\mu}\_{2}, \boldsymbol{W}\_{2}, \boldsymbol{B}\_{2})} \\ &= \left(\frac{|W\_{1}|}{|W\_{2}|}\right)^{-\frac{n}{2}} \left(\frac{|\boldsymbol{B}\_{1}|}{|\boldsymbol{B}\_{2}|}\right)^{-\frac{1}{2}} \left(\frac{|\left(n\boldsymbol{W}\_{1}^{-1} + \boldsymbol{B}\_{1}^{-1}\right)^{-1}|}{|\left(n\boldsymbol{W}\_{2}^{-1} + \boldsymbol{B}\_{2}^{-1}\right)^{-1}|}\right)^{\frac{1}{2}} \\ &\times \exp\left\{\sum\_{i=1}^{2}(-1)^{i}\frac{1}{2}\left[\text{tr}(\boldsymbol{S}\boldsymbol{W}\_{i})^{-1} + \left(\bar{\mathbf{y}} - \boldsymbol{\mu}\_{i}\right)^{\prime}\left(n^{-1}\boldsymbol{W}\_{i} + \boldsymbol{B}\_{i}\right)^{-1}\left(\bar{\mathbf{y}} - \boldsymbol{\mu}\_{i}\right)\right]\right\}. \end{split} \tag{4.13}$$

The overall means *μ*<sup>1</sup> and *μ*2, the within-source covariance matrices *W*<sup>1</sup> and *W*2, and the between-source covariance matrices *B*<sup>1</sup> and *B*<sup>2</sup> can be estimated from the available background data using (3.32), (3.33), and (3.34).

*Example 4.9 (Cannabis Seedlings)* A plant of unknown type is analyzed, and the chemical profile is extracted. Three replicate measurements are taken (*n* = 3) on three variables (*p* = 3): Cannabidiol (CBD), D9-Tetrahydrocannabinol (THC), and Cannabinol (CBN). Measurements on the item of unknown type are as follows:


> y=matrix(c(-1.304,0.231,0.6874,-1.2918,0.24,0.735, + -1.0710,0.3176,0.9113),nrow=3,byrow=T)

The mean vectors between sources *μ*, the within-source covariance matrices *W*, and the between-source covariance matrices *B* can be estimated from the available background data (Bozza et al., 2014).

The estimates of the overall means *μ*<sup>1</sup> and *μ*<sup>2</sup> of the within-source covariance matrices *W*<sup>1</sup> and *W*<sup>2</sup> and of the between-source covariance matrices *B*<sup>1</sup> and *B*<sup>2</sup> are available in the database plant.Rdata and can be obtained as

*Example 4.9* (continued) > load('plant.Rdata') > mu1 CBD THC CBN [1,] -0.4566709 0.9728053 0.9196972 > mu2 CBD THC CBN [1,] 0.4097014 -0.7850832 -0.7592971 > W1 CBD THC CBN CBD 0.01995126 0.015787374 0.010380235 THC 0.01578737 0.015708590 0.005226694 CBN 0.01038024 0.005226694 0.094354823 > W2 CBD THC CBN CBD 0.0180694402 1.901708e-03 -3.699212e-04 THC 0.0019017082 5.685754e-04 7.930402e-05 CBN -0.0003699212 7.930402e-05 1.878924e-02 > B1 CBD THC CBN CBD 0.4154039 0.2135218 0.1470832 THC 0.2135218 0.4752159 0.3893965 CBN 0.1470832 0.3893965 0.4292913 > B2 CBD THC CBN CBD 1.10811258 0.05630523 0.01847022 THC 0.05630523 0.06703743 0.05462002 CBN 0.01847022 0.05462002 0.10964122 These estimates can be obtained using the function two.level.mv.WB introduced in Sect. 3.4.1.1

> two.level.mv.WB(population,variables,+grouping.object)

where population is a data frame with the available data, variables indicates the columns where variables are displayed, and grouping.object indicates the item number.

#### *Example 4.9* (continued)

Given the available measurements, the Bayes factor can be calculated as in (4.13) using the function two.level.mvn.inv.BF.

> BF=two.level.mvn.inv.BF(y,W1,W2,B1,B2,mu1,mu2) > BF

[1] 48739.7

The Bayes factor represents very strong support for the proposition according to which the seized plant is of drug type rather than fiber type.

#### **4.4.2.2 Non-normal Distribution for the Between-Source Variation**

As noted in Sect. 3.4.1.2, whenever the assumption of normality for the betweensource variability is considered inappropriate, the normal distribution *f (θli* | *μl, Bl)* = N*(μl, Bl)* previously proposed can be replaced by a kernel density estimate as in (3.35). The marginal densities *fHl(y)* at the numerator and denominator of the Bayes factor become

$$f\_l(\tilde{\mathbf{y}} \mid W\_l, B\_l, h\_l) = (2\pi)^{-p} \mid B\_l \mid ^{-1} (m\_l h\_l^2)^{-2} \mid D\_l \mid ^{-1/2} \mid D\_l^{-1} + (h\_l^2 B\_l)^{-1} \mid ^{-1/2}$$

$$\dots \sum\_{l=1}^{m\_l} \dots \int \mid \mathbf{1}\_{\{\tilde{\mathbf{y}}\_1 = \tilde{\mathbf{y}}\_2 \mid \tilde{\mathbf{y}}\_1 = \dots \mid h\_l^2 B\_l \ge -1\sqrt{2}\tilde{\mathbf{y}}\_2 - \tilde{\mathbf{y}}\_1\}} \mid\_{\{\tilde{\mathbf{y}}\_1 \mid \tilde{\mathbf{y}}\_1 = \dots \mid \tilde{\mathbf{y}}\_{l-1}\}} $$

$$\times \sum\_{l=1}^{\cdots \prime} \exp \left\{ -\frac{1}{2} (\bar{\mathbf{y}} - \bar{\mathbf{z}}\_{li})^{\prime} (D\_{l} + h\_{l}^{2} B\_{l})^{-1} (\bar{\mathbf{y}} - \bar{\mathbf{z}}\_{li}) \right\}, \qquad (4.14)$$

where *Dl* <sup>=</sup> *<sup>n</sup>*−<sup>1</sup>*Wl*. Note that this is just the marginal density of the recovered data, that is, the first line in (3.38), with all multiplicative constants.

The Bayes factor is then given by the ratio of the marginal probability densities in (4.14) for *l* = 1*,* 2, that is,

$$\text{BF} = \frac{f\_1(\overline{\mathbf{y}} \mid W\_1, B\_1, h\_1)}{f\_2(\overline{\mathbf{y}} \mid W\_2, B\_2, h\_2)}. \tag{4.15}$$

*Example 4.10 (Cannabis Seedlings—Continued)* Consider again the case examined in Example 4.9, and suppose that a kernel distribution is used to model the between-source variability. First, the group means **z**¯*li* must be obtained. They can be obtained as an output of the function two.level.mv.WB that can be used to estimate the model parameters.

```
Example 4.10 (continued)
> head(group.means.1)
          CBD THC CBN
1 -1.22249231 0.2629209 0.777929
2 -0.04734919 1.7607730 2.293862
3 -0.59036072 1.1574978 1.403290
4 -0.27733591 1.5211215 1.832527
5 -0.54204482 1.2387804 1.545526
6 -0.65989575 -0.9686288 1.831042
> head(group.means.2)
            CBD THC CBN
141 -0.12963445 -1.0232887 -0.896759
142 -0.16827410 -0.9934113 -0.896759
143 -0.61568550 -1.0464456 -0.896759
144 0.03267767 -0.9815586 -0.896759
145 0.12647601 -0.9349308 -0.896759
146 -0.51730995 -0.9909842 -0.896759
> m1=dim(group.means.1)[1]
> m2=dim(group.means.2)[1]
> c(m1,m2)
```

```
[1] 117 155
```
Here we show only the first six rows of the *(ml* × *p)* matrices, where each row represents the vector of means **<sup>z</sup>**¯*li* <sup>=</sup> <sup>1</sup> *n <sup>n</sup> <sup>j</sup>*=<sup>1</sup> **<sup>z</sup>***lij* , *<sup>l</sup>* <sup>=</sup> <sup>1</sup>*,* 2. Note that the group means **z**¯<sup>1</sup> and **z**¯2, as well as all the estimated parameters (*μ*1, *μ*2, *W*1, *W*2, *B*<sup>1</sup> and *B*2) are available in the database plant.Rdata.

The smoothing parameters *h*<sup>1</sup> and *h*<sup>2</sup> in the two populations can be estimated as in (3.36), using the function hopt:

```
> p=3
> h1=hopt(p,m1)
> h2=hopt(p,m2)
> c(h1,h2)
[1] 0.4675469 0.4491338
```
Given the available measurements, the Bayes factor can be calculated as in (4.15) using the function two.level.mvk.inv.BF available in the supplementary materials available on the book's website

```
> BF=two.level.mvk.inv.BF(y,group.means.1,
```
+ group.means.2,W1,W2,B1,B2,h1,h2) *Example 4.10* (continued) > BF [1] 7.42

The Bayes factor represents moderate support for the proposition according to which the seized plant is drug type Cannabis rather than fiber type Cannabis.

#### **4.4.2.3 Assessing Model Performance**

One way to investigate the performance of the two models described in Sects. 4.4.2.1 and 4.4.2.2, denoted here Model 1 and Model 2, is to calculate a Bayes factor for all available measurements on items from population 1 (drug type). One would expect to obtain BFs greater than 1 (see Table 4.1). Clearly, one should also consider BF computations for all measurements on items from population *p*<sup>2</sup> (fiber type). In the latter case, BFs smaller than 1 would be expected (see Table 4.2).



#### **4.5 Summary of R Functions**

The R functions outlined below have been used in this chapter.

#### **Functions Available in the Base Package**

apply: Applies a function to the margins (either rows or columns) of a matrix.

colMeans: Forms column means for numeric arrays (or data frames).


#### **Functions Available in Other Packages**


#### **Functions Developed in the Chapter**

beta\_prior: Calculates the hyperparameters *α* and *β* of a beta distribution Be*(α, β)* starting from the prior mean *m* and the prior variance *v*.

*Usage*: beta\_prior(m,v).

*Arguments*: m, the prior mean; v, the prior variance.

*Output*: A vector of values, the first is *α*, the second is *β*.

hopt: Calculates the estimates *h*ˆ of the smoothing parameter *h*.

*Usage*: hopt(p,m).

*Arguments*: p, the number of variables; m, the number of sources.

*Output*: A scalar value.

kn1: Computes the kernel density estimation (numerator).

*Usage*: kn1(x,pop1,sk1).


*Arguments*: y, a *(n* × *p)* matrix of measurements; W<sup>1</sup> and W2, the within-source covariance matrices; B<sup>1</sup> and B2, the between-source covariance matrices; the overall group means *μ*<sup>1</sup> and *μ*2; variables, a vector containing the column indices of the variables to be used.

*Output*: A scalar value.

two.level.mvk.inv.BF: Computes the BF for investigative purposes from a two-level model where the within-source variability is assumed to be normally distributed, while the between-source variability is modeled by a kernel density.

*Usage*: two.level.mvk.inv.BF(y,gmu1,gmu2,W1,W2,B1,B2,h1,h2).

*Arguments*: y, a *(n*×*p)* matrix of measurements; gmu1 and gmu2, the group means **z**¯1*<sup>i</sup>* and **z**¯2*i*; W<sup>1</sup> and W2, the within-source covariance matrices; B<sup>1</sup> and B2, the between-source covariance matrices; h<sup>1</sup> and h2, the smoothing parameters h<sup>1</sup> and h2.

*Output*: A scalar value.

Published with the support of the Swiss National Science Foundation (Grant no. 10BP12\_208532/1).

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **References**

Aitken, C. G. G. (1999). Sampling - How big a sample? *Journal of Forensic Sciences, 44*, 750–760.


© The Author(s) 2022

Springer Texts in Statistics, https://doi.org/10.1007/978-3-031-09839-0

S. Bozza et al., *Bayes Factors for Forensic Decision Analyses with R*,


Casella, G., & Berger, R. L. (2002). *Statistical Inference* (2nd ed.). Pacific Grove: Duxbury Press.


Anonymous, B. (2021). Consensus on validation of forensic voice comparison. *Science & Justice, 61*, 299–309.


## **Index**

#### **A**

Alcohol concentration, 13, 67, 70, 74, 75

#### **B**

Bayes factor, 3, 7, 9, 10 approximation, 50 computational aspects, 28–31 for continuous data, 11 for discrete data, 11 for evaluation, 13–22 feature-based, 14 for investigation, 22–27 for multiple propositions, 26, 150 interpretation, 27–28 score-based, 18, 86 verbal scale, 27 Bayesian network, 45 Bayesian thinking, 5–7 Bayes' theorem, 3, 6, 10, 119 Belief, 3, 5 Bernoulli trials, 42, 80, 142

#### **C**

Cannabis plant type, 169, 171 Consecutive matching striations, 85, 86 Counterfeit medicines, 43, 45, 64

#### **D**

Decision for classification, 166 consequence, 32 criterion, 33, 64, 75, 144, 145, 166 expected loss, 32, 63, 75

loss function, 32 matrix, 32 for a mean, 74–76 for a proportion, 62–65 Decision analysis, 32–34 Distribution beta, 35, 42, 80, 142 beta-binomial, 44, 143 binomial, 35, 42, 44, 46 chi-squared, 97 conjugate, 24, 34, 35, 43, 66, 80, 83, 86, 93, 94, 100, 142, 153, 155 Dirichlet, 35, 83, 144 Dirichlet-multinomial, 144 gamma, 35, 55, 85, 88, 95 inverse chi-squared, 95, 97 inverse gamma, 35, 95, 97, 155 inverse Wishart, 118, 128 kernel, 115, 157, 171 density estimation, 115, 157 smoothing parameter, 115, 157, 159, 172 marginal, 15, 23 multinomial, 35, 83, 144 multivariate normal, 110, 115, 116, 118, 128, 131, 160, 162, 168, 171 non-informative prior, 163 multivariate Student t, 162, 163 normal, 16, 23, 35, 66, 100, 106, 153, 154, 157 known variance, 92 mean and variance unknown, 94 non-informative prior, 102 posterior mean, 16 posterior variance, 16

© The Author(s) 2022 S. Bozza et al., *Bayes Factors for Forensic Decision Analyses with R*, Springer Texts in Statistics, https://doi.org/10.1007/978-3-031-09839-0 185

Distribution (*cont.*) normal-gamma, 95 normal-inverse Wishart, 162 Poisson, 35, 46, 85 Poisson-gamma, 86 posterior predictive, 15 predictive, 15 normal, 16 prior choice, 34–38 prior elicitation, 44 beta, 35, 37, 38, 48 Dirichlet, 145 equivalent sample size, 36, 69, 89 gamma, 88 non-informative, 88 normal, 69 normal-gamma, 96, 155 normal-inverse Wishart, 163 Student t, 96, 102, 155, 156 uniform, 43, 69, 88, 145 Drugs on banknotes, 158

#### **E**

Ecstasy tablets, 153 Error continuous measurements, 72 counting process, 46 Evaluation, 4 for continuous data, 91–108 normal (both parameters unknown), 94 normal (known variance), 92 score-based, 105 for discrete data, 80–91 binomial, 80 multnomial, 82 Poisson, 84 for multivariate data, 108–135 non-constant within-source variation, 118 non-normal between-source variation, 115 normal between-source variation, 109 three-level, 130 two-level, 109 Evidence, 5

#### **F**

Fingermarks, 22, 24 Firearms, 21, 84, 86 Food quality control, 46 Fourier descriptors, 120, 162

#### **G**

Gibbs sampling, 30, 120 Glass, 112, 117, 132 Gunshot residues, 144, 145, 151

#### **H**

Hamiltonian Monte Carlo, 31 Handwriting, 20, 120, 124 Hyperparameter, 35 Hypothesis, 3 alternative, 8 composite, 8 null, 8 simple, 8

#### **I**

Image analysis, 72 Image comparison, 19 Independence under the alternative proposition, 14, 127 Inference discrete propositions, 99 mean, 66–68 proportion, 42–45 Information background, 6 task-relevant, 6 Investigation, 4 with continuous data, 152–160 non-normal, 156 normal (both parameters unknown), 154 normal (known variance), 152 with discrete data, 142–149 binomial, 142 multinomial, 144 with multivariate data, 160–173 non-normal between-source variation, 171 normal between-source variation, 168 two-level, 168

#### **J**

Jaccard distance, 107

#### **L**

Likelihood function, 9 marginal, 8, 17, 23, 28 normal, 24 scaled, 25 ratio, 3, 10, 85, 86

Index 187

Loss function, 32 0 − 1, 149 0 − *li*, 32, 65, 101, 144, 149, 166 asymmetric, 145, 167 linear, 63, 74 symmetric, 64, 149, 167

#### **M**

Markov chain Monte Carlo, 29 autocorrelation plot, 60 Gibbs sampling algorithm, 30, 120 Hamiltonian Monte Carlo, 31 Metropolis–Hastings algorithm, 30 Metropolis-Hastings algorithm two-block, 55 trace-plot, 60 Maximum likelihood, 85 Measurements continuous, 5 discrete, 5 Metropolis–Hastings algorithm, 30 Metropolis-Hastings algorithm two-block, 55 Model comparison, 7–13 feature-based, 14–17 parametric, 15 performance, 123, 173 score-based, 14, 17–22 non-anchored, 19, 21, 88 source-anchored, 19 trace-anchored, 19, 20 statistical, 6 three-level, 130–135 two-level, 109–130, 168–173 Monte Carlo error, 29, 54 estimate, 29, 47, 49 Hamiltonian Monte Carlo, 31 importance sampling, 29, 53 method, 28

#### **O**

Odds posterior, 9, 11, 90 prior, 8, 90

#### **P**

Parameter, 7 space, 7 continuous, 9 discrete, 9

Population relevant, 15 Prior assumptions, 10 Probability, 5 density, 8, 10 law of total, 5 marginal, 8 model, 7 posterior, 5 predictive, 8 prior, 5 Proposition, 3 common-source, 21, 106 hierarchy of, 13 multiple, 25, 103 specific-source, 21

#### **Q**

Questioned documents, 24, 26, 81, 84, 92, 93, 98, 100, 102, 103

#### **R**

R, 39 functions, 76, 78, 136, 139, 173, 176 Reporting evaluative, 4 Rice quality, 47, 57, 62, 143

#### **S**

Saliva, 107 Score, 18, 85 Scoring rule, 33 Sensitivity analysis, 38–39 loss function, 167 Markov chain Monte Carlo, 122 Monte Carlo, 51 prior distribution, 70 prior odds, 90 smoothing parameter, 160 Sex discrimination, 155, 163, 167 Signatures, 129 Significance testing, 8 Stan, 31

#### **T**

Threshold, 42, 63, 74 Toner on printed documents, 16, 80