# **SPRINGER BRIEFS IN PROBABILITY AND MATHEMATICAL STATISTICS**

Victor M. Panaretos Yoav Zemel

# An Invitation to Statistics in Wasserstein Space

# **SpringerBriefs in Probability and Mathematical Statistics**

#### **Editor-in-Chief**

Gesine Reinert, University of Oxford, Oxford, UK Mark Podolskij, University of Aarhus, Aarhus C, Denmark

#### **Series Editors**

Nina Gantert, Technische Universitat M¨unchen, M¨ ¨ unich, Nordrhein-Westfalen, Germany Tailen Hsing, University of Michigan, Ann Arbor, MI, USA Richard Nickl, University of Cambridge, Cambridge, UK Sandrine Pech ´ e, Universit ´ e Paris Diderot, Paris, France ´ Yosef Rinott, Hebrew University of Jerusalem, Jerusalem, Israel Almut E.D. Veraart, Imperial College London, London, UK Mathieu Rosenbaum, Universite Pierre et Marie Curie, Paris, France ´ Wei Biao Wu, University of Chicago, Chicago, IL, USA

SpringerBriefs present concise summaries of cutting-edge research and practical applications across a wide spectrum of fields. Featuring compact volumes of 50 to 125 pages, the series covers a range of content from professional to academic. Briefs are characterized by fast, global electronic dissemination, standard publishing contracts, standardized manuscript preparation and formatting guidelines, and expedited production schedules.

Typical topics might include:


Manuscripts presenting new results in a classical field, new field, or an emerging topic, or bridges between new results and already published works, are encouraged. This series is intended for mathematicians and other scientists with interest in probability and mathematical statistics. All volumes published in this series undergo a thorough refereeing process.

The SBPMS series is published under the auspices of the Bernoulli Society for Mathematical Statistics and Probability.

More information about this series at http://www.springer.com/series/14353

Victor M. Panaretos *•* Yoav Zemel

# An Invitation to Statistics in Wasserstein Space

Victor M. Panaretos Institute of Mathematics EPFL Lausanne, Switzerland

Yoav Zemel Statistical Laboratory University of Cambridge Cambridge, UK

ISSN 2365-4333 ISSN 2365-4341 (electronic) SpringerBriefs in Probability and Mathematical Statistics ISBN 978-3-030-38437-1 ISBN 978-3-030-38438-8 (eBook) https://doi.org/10.1007/978-3-030-38438-8

© The Editor(s) (if applicable) and The Author(s) 2020. This book is an open access publication. **Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG. The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

*To our families*

## **Preface**

A Wasserstein distance is a metric between probability distributions μ and ν on a ground space *X* , induced by the problem of optimal mass transportation or simply *optimal transport*. It reflects the minimal effort that is required in order to reconfigure the mass of μ to produce the mass distribution of ν. The 'effort' corresponds to the total work needed to achieve this reconfiguration, where work equals the amount of mass at the origin times the distance to the prescribed destination of this mass. The distance between origin and destination can be raised to some power other than 1 when defining the notion of work, giving rise to correspondingly different Wasserstein distances. When viewing the space of probability measures on *X* as a metric space endowed with a Wasserstein distance, we speak of a *Wassertein Space*.

Mass transportation and the associated Wasserstein metrics/spaces are ubiquitous in mathematics, with a long history that has seen them catalyse core developments in analysis, optimisation, and probability. Beyond their intrinsic mathematical richness, they possess attractive features that make them a versatile tool for the statistician. They frequently appear in the development of statistical theory and inferential methodology, sometimes as a technical tool in asymptotic theory, due to the useful topology they induce and their easy majorisation; and other times as a methodological tool, for example, in structural modelling and goodnessof-fit testing. A more recent trend in statistics is to consider Wasserstein spaces themselves as a sample and/or parameter space and treat inference problems in such spaces. It is this more recent trend that is the topic of this book and is coming to be known as 'statistics in Wasserstein spaces' or 'statistical optimal transport'.

From the theoretical point of view, statistics in Wasserstein spaces represents an emerging topic in mathematical statistics, situated at the interface between functional data analysis (where the data are functions, seen as random elements of an infinite-dimensional Hilbert space) and non-Euclidean statistics (where the data satisfy non-linear constraints, thus lying on non-Euclidean manifolds). Wasserstein spaces provide the natural mathematical formalism to describe data collections that are best modelled as random measures on R*<sup>d</sup>* (e.g. images and point processes). Such random measures carry the infinite-dimensional traits of functional data, but are intrinsically non-linear due to positivity and integrability restrictions. Indeed, contrarily to functional data, their dominating statistical variation arises through random (non-linear) deformations of an underlying template, rather than the additive (linear) perturbation of an underlying template. This shows optimal transport to be a canonical framework for dealing with problems involving the so-called *phase variation* (also known as registration, multi-reference alignment, or synchronisation problems). This connection is pursued in detail in this book and linked with the so-called problem of optimal multitransport (or optimal multicoupling).

In writing our monograph, we had two aims in mind:


The book focusses on the *theory* of statistics in Wasserstein spaces. It does not cover the associated computational/numerical aspects. This is partially due to space restrictions, but also due to the fact that a reference entirely dedicated to such issues can be found in the very recent monograph of Peyre and Cu- ´ turi [103]. Moreover, since this book is meant to be a rapid introduction for non-specialists, we have made no attempt to give a complete bibliography. We have added some bibliographic remarks at the end of each chapter, but these are in no way meant to be exhaustive. For those seeking reference works, Rachev [106] is an excellent overview of optimal transport up to 1985. Other recent reviews are Bogachev and Kolesnikov [26] and Panaretos and Zemel [101]. The latter review can be thought of as complementary to the present book and surveys some of the applications of optimal transport methods to statistics and probability theory.

<sup>1</sup> E.g. by Rachev and R¨uschendorf [107], Villani [124, 125], Ambrosio and Gigli [10], Ambrosio et al. [12], and more recently by Santambrogio [119].

#### **Structure of the Book**

The material is organised into five chapters.


Each chapter comes with some bibliographic notes at the end, giving some background and suggesting further reading. The first two chapters can be used independently as a crash course in optimal transport for statisticians at the MSc or PhD level depending on the audience's background. Proofs that were omitted from the main text due to space limitations have been organised into an online supplement and can be accessed from any online version chapter published on SpringerLink website: https://link.springer.com/book/10.1007/978-3-030-38438-8.

#### **Acknowledgements**

We wish to thank three anonymous reviewers for their thoughtful feedback. We are especially indebted to one of them, whose analytical insights were particularly useful. Any errors or omissions are, of course, our own responsibility. Victor M. Panaretos gratefully acknowledges support from a European Research Council Starting Grant. Yoav Zemel was supported by Swiss National Science Foundation Grant # 178220. Finally, we wish to thank Mark Podolskij and Donna Chernyk for their patience and encouragement.

Lausanne, Switzerland Victor M. Panaretos Cambridge, UK Yoav Zemel

# **Contents**





# **Chapter 1 Optimal Transport**

In this preliminary chapter, we introduce the problem of optimal transport, which is the main concept behind Wasserstein spaces. General references on this topic are the books by Rachev and R¨uschendorf [107], Villani [124, 125], Ambrosio et al. [12], Ambrosio and Gigli [10], and Santambrogio [119]. This chapter includes only few proofs, when they are simple, informative, or are not easily found in one of the cited references.

#### **1.1 The Monge and the Kantorovich Problems**

In 1781, Monge [95] asked the following question: given a pile of sand and a pit of equal volume, how can one optimally transport the sand into the pit? In modern mathematical terms, the problem can be formulated as follows. There is a sand space *<sup>X</sup>* , a pit space *<sup>Y</sup>* , and a cost function *<sup>c</sup>* : *<sup>X</sup>* <sup>×</sup> *<sup>Y</sup>* <sup>→</sup> <sup>R</sup> that encapsulates how costly it is to move a unit of sand at *x* ∈ *X* to a location *y* ∈ *Y* in the pit. The sand distribution is represented by a measure μ on *X* , and the shape of the pit is described by a measure ν on *Y* . Our decision where to put each unit of sand can be thought of as a function *T* : *X* → *Y* , and it incurs a total transport cost of

$$C(T) = \int\_{\mathcal{X}} c(\mathbf{x}, T(\mathbf{x})) \, \mathbf{d}\mu(\mathbf{x})\,.$$

Moreover, one cannot put all the sand at a single point *y* in the pit; it is not allowed to shrink or expand the sand. The map *T* must be mass-preserving: for any subset *B* ⊆

V. M. Panaretos, Y. Zemel, *An Invitation to Statistics in Wasserstein Space*, SpringerBriefs in Probability and Mathematical Statistics, https://doi.org/10.1007/978-3-030-38438-8 1

**Electronic Supplementary Material** The online version of this chapter (https://doi.org/10.1007/ 978-3-030-38438-8 1) contains supplementary material.

*Y* representing a region of the pit of volume ν(*B*), exactly that same volume of sand must go into *<sup>B</sup>*. The amount of sand allocated to *<sup>B</sup>* is {*<sup>x</sup>* <sup>∈</sup> *<sup>X</sup>* : *<sup>T</sup>*(*x*) <sup>∈</sup> *<sup>B</sup>*} <sup>=</sup> *<sup>T</sup>* <sup>−</sup>1(*B*), so the mass preservation requirement is that μ(*T* <sup>−</sup>1(*B*)) = ν(*B*) for all *B* ⊆ *Y* . This condition will be denoted by *T*#μ = ν and in words: ν is the push-forward of μ under *T*, or *T* pushes μ forward to ν. To make the discussion mathematically rigorous, we must assume that *c* and *T* are measurable maps, and that μ(*T* <sup>−</sup>1(*B*)) = ν(*B*) for all measurable subsets of *Y* . When the underlying measures are understood from the context, we call *T* a *transport map*. Specifying *B* = *Y* , we see that no such *T* can exist unless μ(*X* ) = ν(*Y* ); we shall assume that this quantity is finite, and by means of normalisation, that μ and ν are probability measures. In this setting, the Monge problem is to find the optimal transport map, that is, to solve

$$\inf\_{T:T\#\mu=\mathbf{v}} C(T).$$

We assume throughout this book that *X* and *Y* are complete and separable metric spaces,<sup>1</sup> endowed with their *Borel* σ*-algebra*, which, we recall, is defined as the smallest σ-algebra containing the open sets. Measures defined on the Borel σalgebra of *X* are called *Borel measures*. Thus, if μ is a Borel measure on *X* , then μ(*A*) is defined for any *A* that is open, or closed, or a countable union of closed sets, etc., and any continuous map on *X* is measurable. Similarly, we endow *Y* with its Borel σ-algebra. The product space *X* × *Y* is also complete and separable when endowed with its product topology; its Borel σ-algebra is generated by the product σ-algebra of those of *<sup>X</sup>* and *<sup>Y</sup>* ; thus, any continuous cost function *<sup>c</sup>* : *<sup>X</sup>* <sup>×</sup>*<sup>Y</sup>* <sup>→</sup> <sup>R</sup> is measurable. It will henceforth always be assumed, without explicit further notice, that μ and ν are Borel measures on *X* and *Y* , respectively, and that the cost function is continuous and nonnegative.

It is quite natural to assume that the cost is an increasing function of the distance between *x* and *y*, such as a power function. More precisely, that *Y* = *X* is a complete and separable metric space with metric *d*, and

$$c(\mathbf{x}, \mathbf{y}) = d^p(\mathbf{x}, \mathbf{y}), \qquad p \ge 0, \quad \mathbf{x}, \mathbf{y} \in \mathcal{X}^\*. \tag{1.1}$$

In particular, *c* is continuous, hence measurable, if *p* > 0. The limit case *p* = 0 yields the discontinuous function *c*(*x*, *y*) = **1**{*x* = *y*}, which nevertheless remains measurable because the diagonal {(*x*, *x*) : *x* ∈ *X* } is measurable in *X* × *X* . Particular focus will be put on the quadratic case *p* = 2 (Sect. 1.6) and the linear case *p* = 1 (Sect. 1.8.2).

The problem introduced by Monge [95] is very difficult, mainly because the set of transport maps {*T* : *T*#μ = ν} is intractable. And, it may very well be empty: this will be the case if μ is a Dirac measure at some *x*<sup>0</sup> ∈ *X* (meaning that μ(*A*) = 1 if *x*<sup>0</sup> ∈ *A* and 0 otherwise) but ν is not. Indeed, in that case the set *B* = {*T*(*x*0)} satisfies μ(*T* <sup>−</sup>1(*B*)) = 1 > ν(*B*), so no such *T* can exist. This also shows that the problem is asymmetric in μ and ν: in the Dirac example, there always exists a map *T* such that *T*#ν = μ—the constant map *T*(*x*) = *x*<sup>0</sup> for all *x* is the unique such map. A less

<sup>1</sup> But see the bibliographical notes for some literature on more general spaces.

extreme situation occurs in the case of absolutely continuous measures. If μ and ν have densities *f* and *g* on R*<sup>d</sup>* and *T* is continuously differentiable, then *T*#μ = ν if and only if for μ-almost all *x*

$$f(\mathbf{x}) = \mathbf{g}(T(\mathbf{x})) |\det \nabla T(\mathbf{x})|.$$

This is a highly non-linear equation in *T*, nowadays known as a particular case of a family of partial differential equations called *Monge–Ampere equations `* . More than two centuries after the work of Monge, Caffarelli [32] cleverly used the theory of Monge–Ampere equations to show smoothness of transport maps (see Sect. ` 1.6.4).

As mentioned above, if μ = δ{*x*0} is a Dirac measure and ν is not, then no transport maps from μ to ν can exist, because the mass at *x*<sup>0</sup> must be sent to a unique point *x*0. In 1942, Kantorovich [77] proposed a relaxation of Monge's problem in which mass can be split. In other words, for each point *x* ∈ *X* one constructs a probability measure μ*<sup>x</sup>* that describes how the mass at *x* is split among different destinations. If μ*<sup>x</sup>* is a Dirac measure at some *y*, then all the mass at *x* is sent to *y*. The formal mathematical object to represent this idea is a probability measure π on the product space *<sup>X</sup>* <sup>×</sup> *<sup>Y</sup>* (which is *<sup>X</sup>* <sup>2</sup> in our particular setting). Here π(*A* × *B*) is the amount of sand transported from the subset *A* ⊆ *X* into the part of the pit represented by *B* ⊆ *Y* . The total mass sent from *A* is π(*A*×*Y* ), and the total mass sent into *B* is π(*X* ×*B*). Thus, πis mass-preserving if and only if

$$\begin{aligned} \pi(A \times \mathcal{Y}) &= \mu(A), \qquad A \subseteq \mathcal{X} \quad \text{Borel};\\ \pi(\mathcal{X} \times \mathcal{B}) &= \mathbf{v}(\mathcal{B}), \qquad \mathcal{B} \subseteq \mathcal{Y} \quad \text{Borel}.\end{aligned} \tag{1.2}$$

Probability measures satisfying (1.2) will be called *transference plans*, and the set of those will be denoted by Π(μ,ν). We also say that π is a *coupling* of μ and ν, and that μ and ν are the first and second *marginal distributions*, or simply *marginals*, of π. The total cost associated with π ∈ Π(μ,ν) is

$$C(\pi) = \int\_{\mathcal{X}\times\mathcal{Y}} c(x,y) \, \mathrm{d}\pi(x,y) \, .$$

In our setting of a complete separable metric space *X* , one can represent π as a collection of probability measures {π*<sup>x</sup>*}*x*∈*<sup>X</sup>* on *Y* , in the sense that for all measurable nonnegative *g*

$$\int\_{\mathcal{X}\times\mathcal{Y}} \mathbf{g}(\mathbf{x},\mathbf{y}) \, \mathbf{d}\pi(\mathbf{x},\mathbf{y}) = \int\_{\mathcal{X}} \left[ \int\_{\mathcal{Y}} \mathbf{g}(\mathbf{x},\mathbf{y}) \, \mathbf{d}\pi\_{\mathbf{x}}(\mathbf{y}) \right] \, \mathbf{d}\mu(\mathbf{x}) .$$

The collection {π*<sup>x</sup>*} is that of the *conditional distributions*, and the iteration of integrals is called *disintegration*. For proofs of existence of conditional distributions, one can consult Dudley [47, Section 10.2] or Kallenberg [76, Chapter 5]. Conversely, the measure μ and the collection {π*<sup>x</sup>*} determine π uniquely by choosing *g* to be indicator functions. An interpretation of these notions in terms of random variables will be given in Sect. 1.2.

The Kantorovich problem is to find the best transference plan, that is, to solve

$$\inf\_{\pi \in \Pi(\mu, \nu)} C(\pi).$$

The Kantorovich problem is a relaxation of the Monge problem, because to each transport map *T* one can associate a transference plan π = π*<sup>T</sup>* of the same total cost. To see this, choose the conditional distribution π*<sup>x</sup>* to be a Dirac at *T*(*x*). Disintegration then yields

$$\mathbf{C}(\mathfrak{m}) = \int\_{\mathcal{X}' \times \mathcal{Y}'} c(\mathbf{x}, \mathbf{y}) \, \mathrm{d}\mathfrak{m}(\mathbf{x}, \mathbf{y}) = \int\_{\mathcal{X}'} \left[ \int\_{\mathcal{Y}} c(\mathbf{x}, \mathbf{y}) \, \mathrm{d}\mathfrak{m}(\mathbf{y}) \right] \, \mathrm{d}\mathfrak{m}(\mathbf{x}) = \int\_{\mathcal{X}} c(\mathbf{x}, T(\mathbf{x})) \, \mathrm{d}\mathfrak{m}(\mathbf{x}) = \mathbf{C}(T).$$

This choice of π satisfies (1.2) because π(*A* × *B*) = μ(*<sup>A</sup>* <sup>∩</sup> *<sup>T</sup>* <sup>−</sup>1(*B*)) and ν(*B*) = μ(*<sup>T</sup>* <sup>−</sup>1(*B*)) for all Borel *<sup>A</sup>* <sup>⊆</sup> *<sup>X</sup>* and *<sup>B</sup>* <sup>⊆</sup> *<sup>Y</sup>* .

Compared to the Monge problem, the relaxed problem has considerable advantages. Firstly, the set of transference plans is never empty: it always contains the product measure μ ⊗ν defined by [μ ⊗ν](*A*) = μ(*A*)ν(*B*). Secondly, both the objective function *C*(π) and the constraints (1.2) are linear in π, so the problem can be seen as infinite-dimensional linear programming. To be precise, we need to endow the space of measures with a linear structure, and this is done in the standard way: define the space *M*(*X* ) of all finite signed Borel measures on *X* . This is a vector space with (μ1+αμ<sup>2</sup>)(*A*) = μ1(*A*)+αμ<sup>2</sup>(*A*)for α <sup>∈</sup> <sup>R</sup>, μ1,μ<sup>2</sup> ∈ *M*(*X* ) and *A* ⊆ *X* Borel. The set of probability measures on *X* is denoted by *P*(*X* ), and is a convex subset of *M*(*X* ). The set Π(μ,ν) is then a convex subset of *P*(*X* ×*Y*), and as*C*(π) is linear in π, the set of minimisers is a convex subset of Π(μ,ν). Thirdly, there is a natural symmetry between Π(μ,ν) and Π(ν,μ). If π belongs to the former and we define π˜(*B*×*A*) = π(*A*×*B*), then π˜ ∈ Π(ν,μ). If we set ˜*c*(*y*, *x*) = *c*(*x*, *y*), then

$$\mathbf{C}(\mathfrak{m}) = \int\_{\mathcal{X}\times\mathcal{Y}} c(\mathbf{x}, \mathbf{y}) \, \mathbf{d}\mathfrak{m}(\mathbf{x}, \mathbf{y}) = \int\_{\mathcal{Y}\times\mathcal{X}} \tilde{c}(\mathbf{y}, \mathbf{x}) \, \mathbf{d}\mathfrak{m}(\mathbf{y}, \mathbf{x}) = \tilde{\mathbf{C}}(\mathfrak{m}).$$

In particular, when *X* = *Y* and *c* = *c*˜ is symmetric (as in (1.1)),

$$\inf\_{\pi \in \Pi(\mu, \mathbf{v})} C(\pi) = \inf\_{\pi \in \Pi(\mathbf{v}, \mu)} \tilde{C}(\tilde{\pi}),$$

and π ∈ Π(μ,ν) is optimal if and only if its natural counterpart π˜ is optimal in Π(ν,μ). This symmetry will be fundamental in the definition of the Wasserstein distances in Chap. 2.

Perhaps most importantly, a minimiser for the Kantorovich problem exists under weak conditions. In order to show this, we first recall some definitions. Let *Cb*(*X* ) be the space of real-valued, continuous bounded functions on *X* . A sequence of probability measures {μ*<sup>n</sup>*} ∈ *<sup>M</sup>*(*<sup>X</sup>* ) is said to converge *weakly*<sup>2</sup> to μ ∈ *M*(*X* ) if for all *f* ∈ *Cb*(*X* ), *f* dμ*<sup>n</sup>* → *f* dμ. To avoid confusion with other types of convergence, we will usually write μ*n* → μweakly; in the rare cases where a symbol

<sup>2</sup> Weak convergence is sometimes called narrow convergence, weak\* convergence, or convergence in distribution.

is needed we shall use the notation μ*<sup>n</sup> <sup>w</sup>* → μ. Of course, if μ*n* → μ weakly and μ*<sup>n</sup>* ∈ *P*(*X* ), then μ must be in *P*(*X* ) too (this is seen by taking *f* ≡ 1 and by observing that *f* dμ≥ 0 if *f* ≥ 0).

A collection of probability measures *K* is *tight* if for all ε > 0 there exists a compact set *K* such that infμ∈*K* μ(*K*) > 1−ε. If *K* is represented by a sequence {μ*n*}, then Prokhorov's theorem (Billingsley [24, Theorem 5.1]) states that a subsequence of {μ*<sup>n</sup>*} must converge weakly to some probability measure μ.

We are now ready to show that the Kantorovich problem admits a solution when *c* is continuous and nonnegative and *X* and *Y* are complete separable metric spaces. Let {π*<sup>n</sup>*} be a minimising sequence for *C*. Then, according to [24, Theorem 1.3], μ and ν must be tight. If *K*<sup>1</sup> and *K*<sup>2</sup> are compact with μ(*K*1),ν(*K*2) > 1 − ε, then *K*<sup>1</sup> ×*K*<sup>2</sup> is compact and for all π ∈ Π(μ,ν), π(*K*<sup>1</sup> ×*K*2) > 1−2ε. It follows that the entire collection Π(μ,ν) is tight, and by Prokhorov's theorem π*<sup>n</sup>* has a weak limit π after extraction of a subsequence. For any integer *K*, *cK*(*x*, *y*) = min(*c*(*x*, *y*),*K*) is a continuous bounded function, and

$$\mathbf{C}(\mathfrak{m}\_{\mathfrak{n}}) = \int \mathbf{c}(\mathbf{x}, \mathbf{y}) \, \mathrm{d}\mathfrak{m}\_{\mathfrak{n}}(\mathbf{x}, \mathbf{y}) \geq \int \mathbf{c}\_{K}(\mathbf{x}, \mathbf{y}) \, \mathrm{d}\mathfrak{m}\_{\mathfrak{n}}(\mathbf{x}, \mathbf{y}) \to \int \mathbf{c}\_{K}(\mathbf{x}, \mathbf{y}) \, \mathrm{d}\mathfrak{m}(\mathbf{x}, \mathbf{y}), \qquad n \to \infty.$$

By the monotone convergence theorem

$$\liminf\_{n \to \infty} \mathcal{C}(\pi\_n) \ge \lim\_{K \to \infty} \int c\_K(\mathbf{x}, \mathbf{y}) \, \mathrm{d}\pi(\mathbf{x}, \mathbf{y}) = \mathcal{C}(\pi) \qquad \text{if } \pi\_n \to \pi \text{ weakly.} \tag{1.3}$$

Since {π*<sup>n</sup>*} was chosen as a minimising sequence for *C*, π must be a minimiser, and existence is established.

As we have seen, the Kantorovich problem is a relaxation of the Monge problem, in the sense that

$$\inf\_{T:T\#\mu=\nu} C(T) = \inf\_{\pi\_T:T\#\mu=\nu} C(\pi) \ge \inf\_{\pi \in \Pi(\mu,\nu)} C(\pi) = C(\pi^\*),$$

for some optimal π∗. If π∗ = π*<sup>T</sup>* for some transport map *T*, then we say that the solution is induced from a transport map. This will happen in two different and important cases that are discussed in Sects. 1.3 and 1.6.1.

A remark about terminology is in order. Many authors talk about the *Monge– Kantorovich problem* or the *optimal transport(ation) problem*. More often than not, they refer to what we call here the Kantorovich problem. When one of the scenarios presented in Sects. 1.3 and 1.6.1 is considered, this does not result in ambiguity.

#### **1.2 Probabilistic Interpretation**

The preceding section was an analytic presentation of the Monge and the Kantorovich problems. It is illuminating, however, to also recast things in probabilistic terms, and this is the topic of this section.

A *random element* on a complete separable metric space (or any topological space) *X* is simply a measurable function *X* from some (generic) probability space (Ω,*F*,P) to *X* (with its Borel σ-algebra). The *probability law* (or *probability distribution*) is the probability measure μ*<sup>X</sup>* = *X*#P defined on the space *X* ; this is the Borel measure satisfying μ*<sup>X</sup>* (*A*) = <sup>P</sup>(*<sup>X</sup>* <sup>∈</sup> *<sup>A</sup>*) for all Borel sets *<sup>A</sup>*.

Suppose that one is given two random elements *X* and *Y* taking values in *X* and *<sup>Y</sup>* , respectively, and a cost function *<sup>c</sup>* : *<sup>X</sup>* <sup>×</sup>*<sup>Y</sup>* <sup>→</sup> <sup>R</sup>. The Monge problem is to find a measurable function *T* such that *T*(*X*) has the same distribution as *Y*, and such that the expectation

$$C(T) = \int\_{\mathcal{X}} c(\mathbf{x}, T(\mathbf{x})) \, \mathrm{d}\mu(\mathbf{x}) = \int\_{\mathcal{Q}} c[X(\mathbf{a}), T(X(\mathbf{a}))] \, \mathrm{d}\mathbb{P}(\mathbf{a}) = \mathbb{E}c(X, T(X))$$

is minimised.

The Kantorovich problem is to find a joint distribution for the pair (*X*,*Y*) whose marginals are the original distributions of *X* and *Y*, respectively, and such that the probability law π= (*X*,*Y*)#P minimises the expectation

$$\mathcal{C}(\mathfrak{x}) = \int\_{\mathcal{X} \times \mathcal{Y}} c(\mathbf{x}, \mathbf{y}) \, \mathrm{d}\mathfrak{x}(\mathbf{x}, \mathbf{y}) = \int\_{\Omega} c[X(\mathfrak{o}), Y(\mathfrak{o}))] \, \mathrm{d}\mathbb{P}(\mathfrak{o}) = \mathbb{E}\_{\mathfrak{X}} c(X, Y).$$

Any such joint distribution is called a coupling of *X* and *Y*. Of course, (*X*,*T*(*X*)) is a coupling when *T*(*X*) has the same distribution as *Y*. The measures π*<sup>x</sup>* in the previous section are then interpreted as the conditional distribution of *Y* given *X* = *x*.

Consider now the important case where *<sup>X</sup>* <sup>=</sup> *<sup>Y</sup>* <sup>=</sup> <sup>R</sup>*d*, *<sup>c</sup>*(*x*, *<sup>y</sup>*) = *x* − *y* 2, and *<sup>X</sup>* and *<sup>Y</sup>* are square integrable random vectors (<sup>E</sup> *X* <sup>2</sup> <sup>+</sup><sup>E</sup> *Y* <sup>2</sup> < ∞). Let *A* and *B* be the covariance matrices of *X* and *Y*, respectively, and notice that the covariance matrix of a coupling π must have the form *C* = *A V V<sup>t</sup> B* for a *d* ×*d* matrix *V*. The covariance matrix of the difference *X* −*Y* is

$$
\begin{pmatrix} I\_d \ -I\_d \end{pmatrix} \begin{pmatrix} A & V \\ V^t \ B \end{pmatrix} \begin{pmatrix} I\_d \\ -I\_d \end{pmatrix} = A + B - V^t - V^t
$$

so that

$$\mathbb{E}\_{\pi}c(X,Y) = \mathbb{E}\_{\pi}||X - Y||^2 = ||\mathbb{E}X - \mathbb{E}Y||^2 + \text{tr}\_{\pi}[A + B - V^t - V].$$

Since only *V* depends on the coupling π, the problem is equivalent to that of maximising the trace of *V*, the cross-covariance matrix between *X* and *Y*. This must be done subject to the constraint that a coupling π with covariance matrix *C* exists; in particular, *C* has to be positive semidefinite.

#### **1.3 The Discrete Uniform Case**

There is a special case in which the Monge–Kantorovich problem reduces to a finite combinatorial problem. Although it may seem at first hand as an oversimplification of the original problem, it is of importance in practice because arbitrary measures can be approximated by discrete measures by means of the strong law of large numbers. Moreover, the discrete case is important in theory as well, as a motivating example for the Kantorovich duality (Sect. 1.4) and the property of cyclical monotonicity (Sect. 1.7).

Suppose that μ and νare each uniform on *n* distinct points:

$$\mu = \frac{1}{n} \left( \delta \{ \mathbf{x}\_1 \} + \dots + \delta \{ \mathbf{x}\_n \} \right), \qquad \nu = \frac{1}{n} \left( \delta \{ \mathbf{y}\_1 \} + \dots + \delta \{ \mathbf{y}\_n \} \right).$$

The only relevant costs are *ci j* = *c*(*xi*, *yj*), the collection of which can be represented by an *n*×*n* matrix **C**. Transport maps *T* are associated with *permutations* in *Sn*, the set of all bijective functions from {1,...,*n*} to itself: given σ ∈ *Sn*, a transport map can be constructed by defining *T*(*xi*) = *y*σ(*i*). If σ is not a permutation, then *T* will not be a transport map from μ to ν. Transference plans π are equivalent to *n* × *n* matrices *M* with coordinates *Mi j* = π({(*xi*, *yj*)}) = *Mi j*; this is the amount of mass sent from *xi* to *yj*. In order for π to a be a transference plan, it must be that ∑*<sup>j</sup> Mi j* = 1/*n* for all *i* and ∑*<sup>i</sup> Mi j* = 1/*n* for all *j*, and in addition *M* must be nonnegative. In other words, the matrix *M* = *nM* belongs to *Bn*, the set of bistochastic matrices of order *n*, defined as *n*×*n* matrices *M* satisfying

$$\sum\_{j=1}^{n} M'\_{ij} = 1, \quad i = 1, \ldots, n; \qquad \sum\_{i=1}^{n} M'\_{ij} = 1, \quad j = 1, \ldots, n; \qquad M'\_{ij} \ge 0.$$

The Monge problem is the combinatorial optimisation problem over permutations

$$\inf\_{\sigma \in \mathcal{S}\_n} C(\sigma) = \frac{1}{n} \inf\_{\sigma \in \mathcal{S}\_n} \sum\_{i=1}^n c\_{i, \sigma(i)},$$

and the Kantorovich problem is the linear program

$$\inf\_{n:M\in B\_n} \sum\_{i,j=1}^n c\_{ij} M\_{ij} = \inf\_{M\in B\_n/n} \sum\_{i,j=1}^n c\_{ij} M\_{ij} = \inf\_{M\in B\_n/n} C(M).$$

If σ is a permutation, then one can define *M* = *M*(σ) by *Mi j* = 1/*n* if *j* = σ(*i*) and 0 otherwise. Then *M* ∈ *Bn*/*n* and *C*(*M*) = *C*(σ). Such *M* (or, more precisely, *nM*) is called a *permutation matrix*.

The Kantorovich problem is a linear program with *n*<sup>2</sup> variables and 2*n* constraints. It must have a solution because *Bn* (hence *Bn*/*n*) is a compact (nonempty) set in R*n*<sup>2</sup> and the objective function is linear in the matrix elements, hence continuous. (This property is independent of the possibly infinite-dimensional spaces *X*

and *Y* in which the points lie.) The Monge problem also admits a solution because *Sn* is a finite set. To see that the two problems are essentially the same, we need to introduce the following notion. If *B* is a convex set, then *x* ∈ *B* is an *extremal point* of *B* if it cannot be written as a convex combination *tz*+ (1−*t*)*y* for some distinct points *y*,*z* ∈ *B*. It is well known (Luenberger and Ye [89, Section 2.5]) that there exists an optimal solution that is extremal, so that it becomes relevant to identify the extremal points of *Bn*. It is fairly clear that each permutation matrix is extremal in *Bn*; the less obvious converse is known as Birkhoff's theorem, a proof of which can be found, for instance, at the end of the introduction in Villani [124] or (in a different terminology) in Luenberger and Ye [89, Section 6.5]. Thus, we have:

**Proposition 1.3.1 (Solution of Discrete Problem)** *There exists* σ ∈ *Sn such that M*(σ) *minimises C*(*M*) *over Bn*/*n. Furthermore, if* {σ1,...,σ*<sup>k</sup>*} *is the set of optimal permutations, then the set of optimal matrices is the convex hull of* {*M*(σ<sup>1</sup>),..., *M*(σ*<sup>k</sup>*)}*. In particular, if* σ *is the unique optimal permutation, then M*(σ) *is the unique optimal matrix.*

Thus, in the discrete case, the Monge and the Kantorovich problems coincide. One can of course use the simplex method [89, Chapter 3] to solve the linear program, but there are *n*! vertices, and there is in principle no guarantee that the simplex method solves the problem efficiently. However, the constraints matrix has a very specific form (it contains only zeroes and ones, and is totally unimodular), so specialised algorithms for this problem exist. One of them is the Hungarian algorithm of Kuhn [85] or its variant of Munkres [96] that has a worst-case computational complexity of at most *O*(*n*4). Another alternative is the class of net flow algorithms described in [89, Chapter 6]. In particular, the algorithm of Edmonds and Karp [50] has a complexity of at most *O*(*n*<sup>3</sup> log*n*). This monograph does not focus on computational aspects for optimal transport. This is a fascinating and very active area of contemporary research, and readers are directed to Peyre and Cuturi [ ´ 103].

**Remark 1.3.2** *The special case described here could have been more precisely called "the discrete uniform case on the same number of points", as "the discrete case" could refer to any two finitely supported measures* μ *and* ν*. In the Monge context, the setup discussed here is the most interesting case, see page 8 in the supplement for more details.*

#### **1.4 Kantorovich Duality**

The discrete case of Sect. 1.3 is an example of a linear program and thus enjoys a rich duality theory (Luenberger and Ye [89, Chapter 4]). The general Kantorovich problem is an infinite-dimensional linear program, and under mild assumptions admits similar duality.

#### *1.4.1 Duality in the Discrete Uniform Case*

We can represent any matrix *M* as a vector in R*n*<sup>2</sup> , say **M**, by enumeration of the elements row by row. If *nM* is bistochastic, i.e., *M* ∈ *Bn*/*n*, then the 2*n* constraints can be represented in a (2*n*)×*n*<sup>2</sup> matrix *<sup>A</sup>*. For instance, if *<sup>n</sup>* <sup>=</sup> 3, then

$$A = \begin{pmatrix} 1 & 1 & 1 & & & \\ & & 1 & 1 & 1 & \\ & & & & 1 & 1 & 1 \\ 1 & & 1 & & 1 & \\ & 1 & & 1 & & 1 \\ & & 1 & & 1 & & 1 \end{pmatrix} \in \mathbb{R}^{6 \times 9}.$$

For general *<sup>n</sup>*, the constraints read *<sup>A</sup>***<sup>M</sup>** <sup>=</sup> *<sup>n</sup>*−1(1,...,1) <sup>∈</sup> <sup>R</sup>2*<sup>n</sup>* and *<sup>A</sup>* takes the form

$$A = \begin{pmatrix} \mathbf{1}\_n \\ & \mathbf{1}\_n \\ & & \ddots \\ & & & \mathbf{1}\_n \\ I\_n & I\_n & \dots & I\_n \end{pmatrix} \in \mathbb{R}^{2n \times n^2}, \qquad \mathbf{1}\_n = (1, \dots, 1) \in \mathbb{R}^n,$$

with *In* the *n*×*n* identity matrix. Thus, the problem can be written

$$\min\_{\mathbf{M}} \mathbf{C}^{\mathsf{f}} \mathbf{M} \qquad \text{subject to} \qquad A\mathbf{M} = \frac{1}{n} (1, \dots, 1) \in \mathbb{R}^{2n}; \quad \mathbf{M} \ge 0.$$

The last constraint is to be interpreted coordinate-wise; all the elements of *M* must be nonnegative. The *dual problem* is constructed by introducing one variable for each row of *A*, transposing the constraint matrix and interchanging the roles of the objective vector **C** and the constraints vector *b* = *n*−1(1,...,1). Call the new variables *p*1,..., *pn* and *q*1,...,*qn*, and notice that each column of *A* corresponds to exactly one *pi* and one *qj*, and that the *n*<sup>2</sup> columns exhaust all possibilities. Hence, the dual problem is

$$\max\_{p,q \in \mathbb{R}^n} b^I \binom{p}{q} = \frac{1}{n} \sum\_{i=1}^n p\_i + \frac{1}{n} \sum\_{j=1}^n q\_j \qquad \text{subject to} \quad p\_i + q\_j \le c\_{ij}, \quad i, j = 1, \dots, n. \tag{1.4}$$

In the context of duality, one uses the terminology *primal problem* for the original optimisation problem. *Weak duality* states that if **M** and (*p*,*q*) satisfy the respective constraints, then

$$b^{l}\binom{p}{q} = \sum\_{i} p\_{i}\frac{1}{n} + \sum\_{j} q\_{j}\frac{1}{n} = \sum\_{i,j} (p\_{i} + q\_{j})M\_{ij} \le \sum\_{i,j} \mathbf{C}\_{ij}M\_{ij} = \mathbf{C}^{l}\mathbf{M}.$$

In particular, if equality holds, then **M** is primal optimal and (*p*,*q*) is dual optimal. *Strong duality* is the nontrivial assertion that there exist **M**∗ and (*p*∗,*q*∗) satisfying **C***t* **M**<sup>∗</sup> = *bt p*∗ *q*∗ .

#### *1.4.2 Duality in the General Case*

The vectors **C** and **M** were obtained from the cost function *c* and the transference plan π as*Ci j* = *c*(*xi*, *yj*) and *Mi j* = π({(*xi*, *yj*)}). Similarly, we can view the vectors *p* and *q* as restrictions of functions ϕ : *<sup>X</sup>* <sup>→</sup> <sup>R</sup> and ψ : *<sup>Y</sup>* <sup>→</sup> <sup>R</sup> of the form *pi* <sup>=</sup> ϕ(*xi*) and *qj* = ψ(*y <sup>j</sup>*). The constraint vector *b* = (**1***n*,**1***n*) can be written as *bi* = μ({*xi*}) and *bn*+*<sup>j</sup>* = ν({*yj*}). In this formulation, the constraint *pi* +*qj* ≤ *ci j* writes (ϕ,ψ) ∈ Φ*c* with

$$\Phi\_{\mathbf{c}} = \left\{ (\boldsymbol{\upvarphi}, \boldsymbol{\upvarphi}) \in L\_1(\boldsymbol{\upmu}) \times L\_1(\boldsymbol{\upnu}) : \boldsymbol{\upvarphi}(\mathbf{x}) + \boldsymbol{\upmu}(\mathbf{y}) \le c(\mathbf{x}, \mathbf{y}) \text{ for all } \mathbf{x}, \mathbf{y} \right\}, \boldsymbol{\upmu}$$

and the dual problem (1.4) becomes

$$\sup\_{\mathbf{L}(\boldsymbol{\Phi},\boldsymbol{\Psi}) \in L\_1(\boldsymbol{\mu}) \times L\_1(\boldsymbol{\nu})} \left[ \int\_{\mathcal{X}} \boldsymbol{\Phi}(\mathbf{x}) \, \mathbf{d}\boldsymbol{\mu}(\mathbf{x}) + \int\_{\mathcal{Y}} \boldsymbol{\Psi}(\mathbf{y}) \, \mathbf{d}\boldsymbol{\nu}(\mathbf{y}) \right] \qquad \text{subject to} \quad (\boldsymbol{\Phi}, \boldsymbol{\Psi}) \in \boldsymbol{\Phi}\_{\text{c}}.$$

Simple measure theory shows that the set constraints (1.2) defining the transference plans set Π(μ,ν) are equivalent to functional constraints. For future reference, we state this formally as:

**Lemma 1.4.1 (Functional Constraints)** *Let* μ *and* ν *be probability measures. Then* π ∈ Π(μ,ν) *if and only if for all integrable functions* ϕ ∈ *L*1(μ)*,* ψ ∈ *L*1(ν*),*

$$\int\_{\mathcal{X}\times\mathcal{Y}} [\boldsymbol{\upvarphi}(\boldsymbol{x}) + \boldsymbol{\upvarphi}(\boldsymbol{y})] \, \mathrm{d}\boldsymbol{\uppi}(\boldsymbol{x}, \boldsymbol{y}) = \int\_{\mathcal{X}} \boldsymbol{\upvarphi}(\boldsymbol{x}) \, \mathrm{d}\boldsymbol{\upmu}(\boldsymbol{x}) + \int\_{\mathcal{Y}} \boldsymbol{\upvarphi}(\boldsymbol{y}) \, \mathrm{d}\boldsymbol{\upnu}(\boldsymbol{y}).$$

The proof follows from the fact that (1.2) yields the above equality when ϕ and ψ are indicator functions. One then uses linearity and approximations to deduce the result.

Weak duality follows immediately from Lemma 1.4.1. For if π ∈ Π(μ,ν) and (ϕ,ψ) ∈ Φ*<sup>c</sup>*, then

$$\int\_{\mathcal{X}} \mathfrak{g}(\mathbf{x}) \, \mathrm{d}\mu(\mathbf{x}) + \int\_{\mathcal{Y}} \mathfrak{w}(\mathbf{y}) \, \mathrm{d}\nu(\mathbf{y}) = \int\_{\mathcal{X} \times \mathcal{Y}} [\mathfrak{g}(\mathbf{x}) + \mathfrak{w}(\mathbf{y})] \, \mathrm{d}\pi(\mathbf{x}, \mathbf{y}) \le C(\pi).$$

Strong duality can be stated in the following form:

**Theorem 1.4.2 (Kantorovich Duality)** *Let* μ *and* ν *be probability measures on complete separable metric spaces <sup>X</sup> and <sup>Y</sup> , respectively, and let c* : *<sup>X</sup>* <sup>×</sup>*<sup>Y</sup>* <sup>→</sup> <sup>R</sup><sup>+</sup> *be a measurable function. Then*

$$\inf\_{\pi \in \Pi(\mu, \nu)} \int\_{\mathcal{X} \times \mathcal{Y}} c \, \text{d}\pi = \sup\_{(\mathfrak{g}, \mathfrak{y}) \in \Phi\_c} \left| \int\_{\mathcal{X}} \mathfrak{g} \, \text{d}\mu + \int\_{\mathcal{Y}} \mathfrak{y} \, \text{d}\nu \right| \, .$$

See the Bibliographical Notes for other versions of the duality.

When the cost function is continuous, or more generally, a countable supremum of continuous functions, the infimum is attained (see (1.3)). The existence of maximisers (ϕ,ψ) is more delicate and requires a finiteness condition, as formulated in Proposition 1.8.1 below.

The next sections are dedicated to more concrete examples that will be used through the rest of the book.

#### **1.5 The One-Dimensional Case**

When *X* = *Y* = R, the Monge–Kantorovich problem has a particularly simple structure, because the class of "nice" transport maps contains at most a single element. Identify μ,ν <sup>∈</sup> *<sup>P</sup>*(R) with their cumulative distribution functions *<sup>F</sup>* and *<sup>G</sup>* defined by

$$F(t) = \mu((-\ast, t]), \qquad G(t) = \nu((-\ast, t]), \qquad t \in \mathbb{R}.$$

Let the cost function be (momentarily) quadratic: *c*(*x*, *y*) = |*x* − *y*| <sup>2</sup>/2. Since for *x*<sup>1</sup> ≤ *x*2, *y*<sup>1</sup> ≤ *y*<sup>2</sup>

$$c(\mathbf{y}\_2, \mathbf{x}\_1) + c(\mathbf{y}\_1, \mathbf{x}\_2) - c(\mathbf{y}\_1, \mathbf{x}\_1) - c(\mathbf{y}\_2, \mathbf{x}\_2) = (\mathbf{x}\_2 - \mathbf{x}\_1)(\mathbf{y}\_2 - \mathbf{y}\_1) \ge \mathbf{0},$$

it seems natural to expect the optimal transport map to be monotonically increasing. It turns out that, on the real line, there is at most one such transport map: if *T* is increasing and *T*#μ = ν, then for all *<sup>t</sup>* <sup>∈</sup> <sup>R</sup>

$$G(t) = \nu((-\ast, t]) = \mu((-\ast, T^{-1}(t)]) = F(T^{-1}(t)).$$

If *t* = *T*(*x*), then the above equation reduces to *T*(*x*) = *G*−1(*F*(*x*)). This formula determines *T* uniquely, and has an interesting probabilistic interpretation: it is wellknown that if *X* is a random variable with *continuous* distribution function *F*, then *F*(*X*) follows a uniform distribution on (0,1). Conversely, if *U* follows a uniform distribution, *G* is any distribution function, and

$$G^{-1}(\mu) = \inf G^{-1}([\mu, 1]) = \inf \{ x \in \mathbb{R} : G(x) \ge \mu \}, \qquad 0 < \mu < 1,$$

is the *quantile function* of *X*, then the random variable *G*−1(*U*) has distribution function *G*. We say that *G*−<sup>1</sup> is the *left-continuous inverse* of *G*. In terms of push-forward maps, we can write *F*#μ <sup>=</sup> Leb|[0,1] and *<sup>G</sup>*−1#Leb|[0,1] <sup>=</sup> ν, with Leb standing for Lebesgue measure, and it is restricted to the interval [0,1]. Consequently, if *F* is continuous and *G* is arbitrary, then *T*#μ = ν; we can view *T* as pushing μforward to ν in two steps: firstly, μ is pushed forward to Leb|[0,1] and secondly, Leb|[0,1] is pushed forward to ν.

Using the change of variables formula, we see that the total cost of *T* is

$$C(T) = \int\_{\mathbb{R}} |G^{-1}(F(\mathbf{x})) - \mathbf{x}|^2 \, \mathbf{d}\mu(\mathbf{x}) = \int\_0^1 |G^{-1}(u) - F^{-1}(u)|^2 \, \mathbf{d}u.$$

If *F* is discontinuous, then *F*#μ is not Lebesgue measure, and *T* is not necessarily defined. But there will exist an optimal transference plan π ∈ Π(μ,ν) that is monotone in the following sense: there exists a set Γ <sup>⊂</sup> <sup>R</sup><sup>2</sup> such that π(Γ ) = 1 and whenever (*xi*, *yi*) ∈ Γ,

$$|\mathbf{y}\_2 - \mathbf{x}\_1|^2 + |\mathbf{y}\_1 - \mathbf{x}\_2|^2 - |\mathbf{y}\_1 - \mathbf{x}\_1|^2 - |\mathbf{y}\_2 - \mathbf{x}\_2|^2 \ge \mathbf{0}.$$

Thus, mass at *x*<sup>1</sup> and *x*<sup>2</sup> can be split if need be, but in a monotone way. For example, if μ puts mass 1/2 at *x*<sup>1</sup> = −1 and at *x*<sup>2</sup> = 1 and ν is uniform on [−1,1]. Then the transference plan spreads the mass of *x*<sup>1</sup> uniformly on [−1,0], and the mass of *x*<sup>2</sup> uniformly on [0,1]. This is a particular case of the cyclical monotonicity that will be discussed in Sect. 1.7.

Elementary calculations show that the inequality

$$c(\mathbf{y}\_2, \mathbf{x}\_1) + c(\mathbf{y}\_1, \mathbf{x}\_2) - c(\mathbf{y}\_1, \mathbf{x}\_1) - c(\mathbf{y}\_2, \mathbf{x}\_2) \ge 0, \qquad \mathbf{x}\_1 \le \mathbf{x}\_2; \quad \mathbf{y}\_1 \le \mathbf{y}\_2$$

holds more generally than the quadratic cost *c*(*x*, *y*) = |*x*−*y*| 2. Specifically, it suffices that *<sup>c</sup>*(*x*, *<sup>y</sup>*) = *<sup>h</sup>*(|*x*−*y*|) with *<sup>h</sup>* convex on <sup>R</sup>+.

Since any distribution can be approximated by continuous distributions, in view of the above discussion, the following result from Villani [124, Theorem 2.18] should not be too surprising.

**Theorem 1.5.1 (Optimal Transport in** R**)** *Let* μ,ν <sup>∈</sup> *<sup>P</sup>*(R) *with distribution functions F and G, respectively, and let the cost function be of the form c*(*x*, *y*) = *h*(|*x*−*y*|) *with h convex and nonnegative. Then*

$$\inf\_{\pi \in \Pi(\mu, \nu)} C(\pi) = \int\_0^1 h(G^{-1}(u) - F^{-1}(u)) \, \mathrm{d}u.$$

*If the infimum is finite and h is strictly convex, then the optimal transference plan is unique. Furthermore, if F is continuous, then the infimum is attained by the transport map T* <sup>=</sup> *<sup>G</sup>*−<sup>1</sup> ◦*F.*

The prototypical choice for *h* is *h*(*z*) = |*z*| *<sup>p</sup>* with *p* > 1. This result allows in particular a direct evaluation of the Wasserstein distances for measures on the real line (see Chap. 2).

Note that no regularity is needed in order that the optimal transference plan be unique, unlike in higher dimensions (compare Theorem 1.8.2). The structure of solutions in the concave case (0 < *p* < 1) is more complicated, see McCann [94].

#### 1.6 Quadratic Cost 13

When *p* = 1, the cost function is convex but not strictly so, and solutions will not be unique. However, the total cost in Theorem 1.5.1 admits another representation that is often more convenient.

**Proposition 1.5.2 (Quantiles and Distribution Functions)** *If F and G are distribution functions, then*

$$\int\_0^1 |G^{-1}(u) - F^{-1}(u)| \, \mathrm{d}u = \int\_{\mathbb{R}} |G(\mathbf{x}) - F(\mathbf{x})| \, \mathrm{d}\mathbf{x}.$$

The proof is a simple application of Fubini's theorem; see page 13 in the supplement.

**Corollary 1.5.3** *If c*(*x*, *y*) = |*x*−*y*|*, then under the conditions of Theorem 1.5.1*

$$\inf\_{\pi \in \Pi(\mu, \nu)} C(\pi) = \int\_{\mathbb{R}} |G(\mathbf{x}) - F(\mathbf{x})| \, \mathrm{d}x.$$

#### **1.6 Quadratic Cost**

This section is devoted to the specific cost function

$$c(\mathbf{x}, \mathbf{y}) = \frac{\|\mathbf{x} - \mathbf{y}\|^2}{2}, \qquad \mathbf{x}, \mathbf{y} \in \mathcal{X}',$$

where *X* is a separable Hilbert space. This cost is popular in applications, and leads to a lucid and elegant theory. The factor of 1/2 does not affect the minimising coupling π and leads to cleaner expressions. (It does affect the optimal dual pair, but in an obvious way.)

#### *1.6.1 The Absolutely Continuous Case*

We begin with the Euclidean case, where *<sup>X</sup>* <sup>=</sup> *<sup>Y</sup>* = (R*d*, · ) is endowed with the Euclidean metric, and use the Kantorovich duality to obtain characterisations of optimal maps.

Since the dual objective function to be maximised

$$\int\_{\mathbb{R}^d} \boldsymbol{\varphi} \, \mathrm{d}\mu + \int\_{\mathbb{R}^d} \boldsymbol{\psi} \, \mathrm{d}\mathbf{v}$$

is increasing in ϕ and ψ, one should seek functions that take values as large as possible subject to the constraint ϕ(*x*)+ψ(*y*) ≤ *x*−*y* <sup>2</sup>/2. Suppose that an oracle tells us that some ϕ ∈ *L*1(μ) is a good candidate. Then the largest possible ψ satisfying (ϕ,ψ) ∈ Φ*<sup>c</sup>* is

$$\Psi(\mathbf{y}) = \inf\_{\mathbf{x} \in \mathbb{R}^d} \left[ \frac{\|\mathbf{x} - \mathbf{y}\|^2}{2} - \boldsymbol{\varrho}(\mathbf{x}) \right] = \frac{\|\mathbf{y}\|^2}{2} + \inf\_{\mathbf{x} \in \mathbb{R}^d} \left[ \frac{\|\mathbf{x}\|^2}{2} - \boldsymbol{\varrho}(\mathbf{x}) - \langle \mathbf{x}, \mathbf{y} \rangle \right].$$

In other words,

$$\tilde{\Psi}(\mathbf{y}) := \frac{||\mathbf{y}||^2}{2} - \Psi(\mathbf{y}) = \sup\_{\mathbf{x} \in \mathbb{R}^d} \left[ \langle \mathbf{x}, \mathbf{y} \rangle - \tilde{\Phi}(\mathbf{x}) \right], \qquad \Phi(\mathbf{x}) = \frac{||\mathbf{x}||^2}{2} - \Phi(\mathbf{x}).$$

As a supremum over affine functions (in *y*), ψ enjoys some useful properties. We remind the reader that a function *<sup>f</sup>* : *<sup>X</sup>* <sup>→</sup> <sup>R</sup> ∪ {∞} is *convex* if *<sup>f</sup>*(*tx*+ (1−*t*)*y*) <sup>≤</sup> *t f*(*x*)+(1−*t*)*f*(*y*) for all *x*, *y* ∈ *X* and *t* ∈ [0,1]. It is *lower semicontinuous* if for all *x* ∈ *X* , *f*(*x*) ≤ liminf*y*→*<sup>x</sup> f*(*y*). Affine functions are convex and lower semicontinuous, and it straightforward from the definitions that both convexity and lower semicontinuity are preserved under the supremum operation. Thus, the function ψ is convex and lower semicontinuous. In particular, it is Borel measurable due to the following characterisation: *f* is lower semicontinuous if and only if {*x* : *f*(*x*) ≤ α} is a closed set for all α<sup>∈</sup> <sup>R</sup>.

From the preceding subsection, we now know that optimal dual functions ϕ and ψ must take the form of the difference between · <sup>2</sup>/2 and a convex function. Given the vast wealth of knowledge on convex functions (Rockafellar [113]), it will be convenient to work with ϕ and ψ, and to assume that ψ = (ϕ)∗, where

$$f^\*(\mathbf{y}) = \sup\_{\mathbf{x} \in \mathbb{R}^d} [\langle \mathbf{x}, \mathbf{y} \rangle - f(\mathbf{x})], \qquad \mathbf{y} \in \mathbb{R}^d$$

is the *Legendre transform* of *f* ([113, Chapter 26]; [124, Chapter 2]), and is of fundamental importance in convex analysis. Now by symmetry, one can also replace ϕ by (ψ)<sup>∗</sup> = (ϕ)∗∗, so it is reasonable to expect that an optimal dual pair should take the form ( · <sup>2</sup>/2−ϕ, · <sup>2</sup>/2−(ϕ)∗), with ϕconvex and lower semicontinuous.

The alternative representation of the dual objective value as

$$\int\_{\mathbb{R}^d} \boldsymbol{\mathfrak{g}} \, \mathrm{d}\boldsymbol{\mu} + \int\_{\mathbb{R}^d} \boldsymbol{\mathfrak{y}} \, \mathrm{d}\boldsymbol{\nu} = \frac{1}{2} \int\_{\mathbb{R}^d} ||\boldsymbol{x}||^2 \, \mathrm{d}\boldsymbol{\mu}(\boldsymbol{x}) + \frac{1}{2} \int\_{\mathbb{R}^d} ||\boldsymbol{y}||^2 \, \mathrm{d}\boldsymbol{\nu}(\boldsymbol{y}) - \int\_{\mathbb{R}^d} \widetilde{\boldsymbol{\phi}} \, \mathrm{d}\boldsymbol{\mu} - \int\_{\mathbb{R}^d} \widetilde{\boldsymbol{\psi}} \, \mathrm{d}\boldsymbol{\nu}$$

is valid under the integrability condition

$$\int\_{\mathbb{R}^d} ||\mathbf{x}||^2 \, \mathbf{d}\mu(\mathbf{x}) + \int\_{\mathbb{R}^d} ||\mathbf{y}||^2 \, \mathbf{d}\nu(\mathbf{y}) < \infty$$

that μ and ν have finite second moments. This condition also guarantees that an optimal ϕ exists, as the conditions of Proposition 1.8.1 are satisfied. An alternative direct proof for the quadratic case can be found in Villani [124, Theorem 2.9].

Suppose that an optimal ϕ is found. What can we say about optimal transference plans π? According to the duality, a necessary and sufficient condition is that

$$\int\_{\mathbb{R}^d \times \mathbb{R}^d} \frac{||\mathbf{x} - \mathbf{y}||^2}{2} \, \mathrm{d}\pi(\mathbf{x}, \mathbf{y}) = \int\_{\mathbb{R}^d} \boldsymbol{\varphi} \, \mathrm{d}\mu + \int\_{\mathbb{R}^d} \boldsymbol{\varphi} \, \mathrm{d}\mathbf{v},$$

where ψ = · <sup>2</sup>/2−( · <sup>2</sup>/2−ϕ)∗. Equivalently (using Lemma 1.4.1),

$$\int\_{\mathbb{R}^d \times \mathbb{R}^d} [\widetilde{\boldsymbol{\Phi}}(\mathbf{x}) + (\widetilde{\boldsymbol{\Phi}})^\*(\mathbf{y}) - \langle \mathbf{x}, \mathbf{y} \rangle] \, \mathrm{d}\pi(\mathbf{x}, \mathbf{y}) = \mathbf{0}.\tag{1.5}$$

Since we have ϕ(*x*)+(ϕ)∗(*y*) <sup>≥</sup> *x*, *<sup>y</sup>* everywhere, the integrand is nonnegative. Hence, the integral vanishes if and only if π is concentrated on the set of (*x*, *y*) such that ϕ(*x*) +ϕ∗(*y*) = *x*, *<sup>y</sup>*. By definition of the Legendre transform as a supremum, this happens if and only if the supremum defining ϕ∗(*y*) is attained at *<sup>x</sup>*; equivalently

$$
\tilde{\boldsymbol{\phi}}(\boldsymbol{z}) - \tilde{\boldsymbol{\phi}}(\boldsymbol{x}) \ge \left< \boldsymbol{z} - \boldsymbol{x}, \boldsymbol{y} \right>, \qquad \boldsymbol{z} \in \mathcal{K} \dots
$$

This condition is precisely the definition of *y* being a *subgradient* of ϕ at *<sup>x</sup>* [113, Chapter 23]. When ϕ is differentiable at *<sup>x</sup>*, its unique subgradient is the gradient *y* = ∇ϕ(*x*) [113, Theorem 25.1]. If we are fortunate and ϕ is differentiable everywhere, or even μ-almost everywhere, then the optimal transference plan π is unique, and in fact induced from the transport map ∇ϕ. The problem, of course, is that ϕ may fail to be differentiable μ-almost surely. This is remedied by assuming some regularity on the source measure μ in order to make sure that *any* convex function be differentiable μ-almost surely, and is done via the following regularity result, which, roughly speaking, states that convex functions are differentiable almost surely. A stronger version is given in Rockafellar [113, Theorem 2.25], with an alternative proof in Alberti and Ambrosio [6, Chapter 2]. One could also combine the local Lipschitz property of convex functions [113, Chapter 10] with Rademacher's theorem (Villani [125, Theorem 10.8]).

**Theorem 1.6.1 (Differentiability of Convex Functions)** *Let f* : <sup>R</sup>*<sup>d</sup>* <sup>→</sup> <sup>R</sup>∪ {∞} *be a convex function with domain* dom*<sup>f</sup>* <sup>=</sup> {*<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* : *<sup>f</sup>*(*x*) <sup>&</sup>lt; <sup>∞</sup>} *and let <sup>N</sup> be the set of points at which f is not differentiable. Then N* ∩dom*f has Lebesgue measure 0.*

Theorem 1.6.1 is usually stated for the interior of dom*f* , denoted int(dom*f*), rather than the closure. But, since *A* = dom*f* is convex, its boundary has Lebesgue measure zero. To see this assume first that *A* is bounded. If int*A* is empty, then *A* lies in a lower dimensional subspace [113, Theorem 2.4]. Otherwise, without loss of generality 0 ∈ int*A*, and then by convexity of *A*, ∂*A* ⊆ (1 + ε)*A* for all ε > 0. When *A* is unbounded, write it as ∪*nA*∩[−*n*,*n*] *d*.

Another issue that might arise is that optimal ϕ's might not exist. This is easily dealt with using Proposition 1.8.1. If we assume that μ and ν have finite second moments:

$$\int\_{\mathbb{R}^d} ||\mathbf{x}||^2 \, \mathrm{d}\mu(\mathbf{x}) < \ast \qquad \text{and} \qquad \int\_{\mathbb{R}^d} ||\mathbf{y}||^2 \, \mathrm{d}\nu(\mathbf{y}) < \ast ,$$

then any transference plan π ∈ Π(μ,ν) has a finite cost, as is seen from integrating the elementary inequality *x*−*y* <sup>2</sup> <sup>≤</sup> <sup>2</sup> *x* <sup>2</sup> <sup>+</sup><sup>2</sup> *y* <sup>2</sup> and using Lemma 1.4.1:

$$C(\mathfrak{x}) \le \int\_{\mathbb{R}^d \times \mathbb{R}^d} [||\mathbf{x}||^2 + ||\mathbf{y}||^2] \, \mathrm{d}\mathfrak{x}(\mathbf{x}, \mathbf{y}) = \int\_{\mathbb{R}^d} ||\mathbf{x}||^2 \, \mathrm{d}\mu(\mathbf{x}) + \int\_{\mathbb{R}^d} ||\mathbf{y}||^2 \, \mathrm{d}\mathbf{v}(\mathbf{y}) < \infty.$$

With these tools, we can now prove a fundamental existence and uniqueness result for the Monge–Kantorovich problem. It has been proven independently by several authors, including Brenier [31], Cuesta-Albertos and Matran [ ´ 37], Knott and Smith [83], and Rachev and R¨uschendorf [117].

**Theorem 1.6.2 (Quadratic Cost in Euclidean Spaces)** *Let* μ *and* ν *be probability measures on* R*<sup>d</sup> with finite second moments, and suppose that* μ *is absolutely continuous with respect to Lebesgue measure. Then the solution to the Kantorovich problem is unique, and is induced from a transport map T that equals* μ*-almost surely the gradient of a convex function* φ*. Furthermore, the pair* ( *x* <sup>2</sup>/<sup>2</sup> <sup>−</sup> φ, *y* <sup>2</sup>/<sup>2</sup> <sup>−</sup> φ∗) *is optimal for the dual problem.*

*Proof.* To alleviate the notation we write φ instead of ϕ. By Proposition 1.8.1, there exists an optimal dual pair (ϕ,ψ) such that φ(*x*) = *x* <sup>2</sup>/<sup>2</sup> <sup>−</sup> ϕ(*x*) is convex and lower semicontinuous, and by the discussion in Sect. 1.1, there exists an optimal π. Since φ is μ-integrable, it must be finite almost everywhere, i.e., μ(domφ) = 1. By Theorem 1.6.1, if we define *N* as the set of nondifferentiability points of φ, then Leb(*N* ∩domφ) = 0; as μ is absolutely continuous, the same holds for μ. (Here Leb denotes Lebesgue measure.)

We conclude that μ(int(domφ) \*N* ) = 1. In other words, φ is differentiable μalmost everywhere, and so for μ-almost any *x*, there exists a unique *y* such that φ(*x*) + φ<sup>∗</sup>(*y*) = *x*, *y*, and *y* = ∇φ(*x*). This shows that π is unique and induced from the transport map ∇φ(*x*). The gradient ∇φ is Borel measurable, since each of its coordinates can be written as limsup*q*→0,*q*∈<sup>Q</sup> *<sup>q</sup>*−1(φ(*x*+*qv*)−φ(*x*)) for some vector *v* (the canonical basis of R*d*), which is Borel measurable because the limit superior is taken on countably many functions (and φ is measurable because it is lower semicontinuous).

#### *1.6.2 Separable Hilbert Spaces*

The finite-dimensionality of R*<sup>d</sup>* in the previous subsection was only used in order to apply Theorem 1.6.1, so one could hope to extend the results to infinite-dimensional separable Hilbert spaces.

Although there is no obvious parallel for Lebesgue measure (i.e., translation invariant) on infinite-dimensional Banach spaces, one can still define absolute continuity via Gaussian measures. Indeed, μ <sup>∈</sup> *<sup>P</sup>*(R*d*) is absolutely continuous with respect to Lebesgue measure if and only if the following holds: if *<sup>N</sup>* <sup>⊂</sup> <sup>R</sup>*<sup>d</sup>* is such that ν(*N* ) = 0 for any nondegenerate Gaussian measure ν, then μ(*N* ) = 0. This definition can be extended to any separable Banach space *X* via projections, as follows. Let *X* <sup>∗</sup> be the (topological) dual of *X* , consisting of all real-valued, continuous linear functionals on *X* .

**Definition 1.6.3 (Gaussian Measures)** *A probability measure* μ ∈ *P*(*X* ) *is a nondegenerate Gaussian measure if for any* - ∈ *X* <sup>∗</sup> \ {0}*,* -#μ <sup>∈</sup> *<sup>P</sup>*(R) *is a Gaussian measure with positive variance.*

**Definition 1.6.4 (Gaussian Null Sets and Absolutely Continuous Measures)** *A subset N* ⊂ *X is a Gaussian null set if whenever* ν *is a nondegenerate Gaussian measure,* ν(*N* ) = 0*. A probability measure* μ ∈ *P*(*X* ) *is absolutely continuous if* μ*vanishes on all Gaussian null sets.*

Clearly, if νis a nondegenerate Gaussian measure, then it is absolutely continuous.

As explained in Ambrosio et al. [12, Section 6.2], a version of Rademacher's theorem holds in separable Hilbert spaces: a locally Lipschitz function is Gateaux ˆ differentiable except on a Gaussian null set of *X* . Theorem 1.6.2 (and more generally, Theorem 1.8.2) extend to infinite dimensions; see [12, Theorem 6.2.10].

#### *1.6.3 The Gaussian Case*

Apart from the one-dimensional case of Sect. 1.5, there is another special case in which there is a unique *and* explicit solution to the Monge–Kantorovich problem.

Suppose that μ and ν are Gaussian measures on R*<sup>d</sup>* with zero means and nonsingular covariance matrices *A* and *B*. By Theorem 1.6.2, we know that there exists a unique optimal map *T* such that *T*#μ = ν. Since linear push-forwards of Gaussians are Gaussian, it seems natural to guess that *T* should be linear, and this is indeed the case.

Since *T* is a linear map that should be the gradient of a convex function φ, it must be that φ is quadratic, i.e., φ(*x*)−φ(0) = *x*,*Qx* for *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* and some matrix *<sup>Q</sup>*. The gradient of φ at *x* is (*Q*+*Qt* )*x* and the Hessian matrix is *Q*+*Qt* . Thus, *T* = *Q*+*Qt* and since φis convex, *T* must be positive semidefinite.

Viewing *T* as a matrix leads to the *Riccati equation TAT* = *B* (since *T* is symmetric). This is a quadratic equation in *T*, and so we wish to take square roots in a way that would isolate *T*. This is done by multiplying the equation from both sides with *A*1/2:

$$[\mathbf{A}^{1/2}\mathbf{T}\mathbf{A}^{1/2}][\mathbf{A}^{1/2}\mathbf{T}\mathbf{A}^{1/2}] = \mathbf{A}^{1/2}\mathbf{T}\mathbf{A}\mathbf{T}\mathbf{A}^{1/2} = \mathbf{A}^{1/2}\mathbf{B}\mathbf{A}^{1/2} = [\mathbf{A}^{1/2}\mathbf{B}^{1/2}][\mathbf{B}^{1/2}\mathbf{A}^{1/2}].$$

All matrices in brackets are positive semidefinite. By taking square roots and multiplying with *A*−1/2, we finally find

$$T = A^{-1/2} [A^{1/2} \mathbf{B} \mathbf{A}^{1/2}]^{1/2} A^{-1/2}.$$

A straightforward calculation shows that *TAT* = *B* indeed, and *T* is positive definite, hence optimal. To calculate the transport cost *C*(*T*), observe that (*T* − *I*)#μ is a centred Gaussian measure with covariance matrix

$$TAT - TA - AT + A = A + B - A^{1/2} [\mathbf{A}^{1/2} \mathbf{B} \mathbf{A}^{1/2}]^{1/2} \mathbf{A}^{-1/2} - A^{-1/2} [\mathbf{A}^{1/2} \mathbf{B} \mathbf{A}^{1/2}]^{1/2} \mathbf{A}^{1/2}.$$

If *<sup>Y</sup>* <sup>∼</sup> *<sup>N</sup>* (0,*C*), then <sup>E</sup> *Y* <sup>2</sup> equals the trace of*C*, denoted tr*C*. Hence, by properties of the trace,

18 1 Optimal Transport

$$\mathcal{C}(T) = \text{tr}\left[A + B - \mathcal{D}(\mathbf{A}^{1/2}\mathbf{B}\mathbf{A}^{1/2})^{1/2}\right].\tag{1.6}$$

By continuity arguments, (1.6) is the total transport cost between any two Gaussian distributions with zero means, even if *A* is singular.

If *AB* = *BA*, the above formulae simplify to

$$T = \mathbf{B}^{1/2} A^{-1/2}, \qquad \mathbf{C}(T) = \text{tr}\left[A + \mathbf{B} - 2\mathbf{A}^{1/2} \mathbf{B}^{1/2}\right] = ||\mathbf{A}^{1/2} - \mathbf{B}^{1/2}||\_F^2,$$

with *F* the Frobenius norm.

If the means of μ and ν are *m* and *n*, one simply needs to translate the measures. The optimal map and the total cost are then

$$T\mathbf{x} = n + A^{-1/2} [A^{1/2} \mathbf{B} \mathbf{A}^{1/2}]^{1/2} A^{-1/2} (\mathbf{x} - m);$$

$$C(T) = \left|| n - m ||^2 + \text{tr}[A + \mathbf{B} - 2(A^{1/2} \mathbf{B} \mathbf{A}^{1/2})^{1/2}].$$

From this, we can deduce a lower bound on the total cost between *any* two measures in R*<sup>d</sup>* in terms of their second order structure. This is worth mentioning, because such lower bounds are not very common (the Monge–Kantorovich problem is defined by an infimum, and thus typically easier to bound from above).

**Proposition 1.6.5 (Lower Bound for Quadratic Cost)** *Let* μ,ν <sup>∈</sup> *<sup>P</sup>*(R*d*) *have means m and n and covariance matrices A and B and let* π *be the optimal map. Then*

$$C(\pi) \ge ||n - m||^2 + \text{tr}[A + B - 2(A^{1/2} \text{BA}^{1/2})^{1/2}].$$

*Proof.* It will be convenient here to use the probabilistic terminology of Sect. 1.2. Let *X* and *Y* be random variables with distributions μ and ν. Any coupling of *X* and *Y* will have covariance matrix of the form *C* = *A V V<sup>t</sup> B* <sup>∈</sup> <sup>R</sup>2*d*×2*<sup>d</sup>* for some matrix *<sup>V</sup>* <sup>∈</sup> <sup>R</sup>*d*×*d*, constrained so that *<sup>C</sup>* is positive semidefinite. This gives the lower bound inf π∈Π(μ,ν) Eπ *X*−*Y* <sup>2</sup> <sup>=</sup> *m*−*n* <sup>2</sup>+ inf π∈Π(μ,ν) trπ[*A*+*B*−2*V*] ≥ *m*−*n* <sup>2</sup>+ inf *V*:*C*≥0 tr[*A*+*B*−2*V*].

As we know from the Gaussian case, the last infimum is given by (1.6).

#### *1.6.4 Regularity of the Transport Maps*

The optimal transport map *T* between Gaussian measures on R*<sup>d</sup>* is linear, so it is of course very smooth (analytic). The densities of Gaussian measures are analytic too, so that *T* inherits the regularity of μ and ν. Using the formula for *T*, one can show that a similar phenomenon takes place in the one-dimensional case. Though we do not have a formula for *T* at our disposal when μ and ν are general absolutely continuous measures on <sup>R</sup>*d*, *<sup>d</sup>* <sup>≥</sup> 2, it turns out that even in that case, *<sup>T</sup>* inherits the regularity of μ and νif some convexity conditions are satisfied.

#### 1.6 Quadratic Cost 19

To guess what kind of results can be hoped for, let us first examine the case *d* = 1. Let *F* and *G* denote the distribution functions of μ and ν, respectively. Suppose that *G* is continuously differentiable and that *G* > 0 on some open interval (finite or not) *I* such that ν(*I*) = 1. Then the inverse function theorem says that *G*−<sup>1</sup> is also continuously differentiable. Recall that the *support* of a (Borel) probability measure μ (denoted suppμ) is the smallest closed set *K* such that μ(*K*) = 1. A simple application of the chain rule (see page 19 in the supplement) gives:

**Theorem 1.6.6 (Regularity in** R**)** *Let* μ,ν <sup>∈</sup> *<sup>P</sup>*(R) *possess distribution functions F and G of class Ck, k* <sup>≥</sup> <sup>1</sup>*. Suppose further that* suppν *is an interval I (possibly unbounded) and that G* > 0 *on the interior of I. Then the optimal map is of class C<sup>k</sup> as well. If F*,*<sup>G</sup>* <sup>∈</sup> *<sup>C</sup>*<sup>0</sup> *are merely continuous, then so is the optimal map.*

The assumption on the support of ν is important: if μ is Lebesgue measure on [0,1] and the support of ν is disconnected, then *T* cannot even be continuous, no matter how smooth νis.

The argument above cannot be easily extended to measures on <sup>R</sup>*d*, *<sup>d</sup>* <sup>≥</sup> 2, because there is no explicit formula available for the optimal maps. As before, we cannot expect the optimal map to be continuous if the support of ν is disconnected. It turns out that the condition on the support of ν is not connectedness, but rather convexity. This was shown by Caffarelli, who was able to prove ([32] and the references within) that the optimal maps have the same smoothness as the measures. To state the result, we recall the following notation for an open Ω <sup>⊆</sup> <sup>R</sup>*d*, *<sup>k</sup>* <sup>≥</sup> 0 and α ∈ (0,1]. We say that *<sup>f</sup>* <sup>∈</sup> *<sup>C</sup>k*,α (Ω) if all the partial derivatives of order *k* of *f* are locally α-Holder ¨ on Ω. For example, if *k* = 1, this means that for any *x* ∈ Ω there exists a constant *L* and an open ball *B* containing *x* such that

$$||\nabla f(z) - \nabla f(\mathbf{y})|| \le L||\mathbf{y} - z||^\alpha, \qquad \mathbf{y}, z \in \mathbf{B}.$$

Note that *<sup>f</sup>* <sup>∈</sup> *<sup>C</sup>k*+<sup>1</sup> <sup>=</sup><sup>⇒</sup> *<sup>f</sup>* <sup>∈</sup> *<sup>C</sup>k*,β <sup>=</sup><sup>⇒</sup> *<sup>f</sup>* <sup>∈</sup> *<sup>C</sup>k*,α <sup>=</sup><sup>⇒</sup> *<sup>f</sup>* <sup>∈</sup> *<sup>C</sup>k*, for 0 <sup>≤</sup> α ≤ β ≤ 1 so α gives a "fractional" degree of smoothness for *f* . Moreover, *Ck*,<sup>0</sup> = *C<sup>k</sup>* and *Ck*,<sup>1</sup> is quite close to *Ck*+1, since Lipschitz functions are almost surely differentiable.

**Theorem 1.6.7 (Regularity of Transport Maps)** *Fix open sets* Ω1,Ω<sup>2</sup> <sup>⊆</sup> <sup>R</sup>*d, with* Ω<sup>2</sup> *convex, and absolutely continuous measures* μ,ν <sup>∈</sup> *<sup>P</sup>*(R*d*) *with finite second moments and bounded, strictly positive densities f*,*g, respectively, such that* μ(Ω<sup>1</sup>) = 1 = ν(Ω<sup>2</sup>)*. Let* φ *be such that* ∇φ#μ = ν*.*


*If in addition f*,*<sup>g</sup>* <sup>∈</sup> *<sup>C</sup>k*,α *, then* φ <sup>∈</sup> *<sup>C</sup>k*+2,α (Ω1)*.*

In other words, the optimal map *T* = ∇φ <sup>∈</sup> *<sup>C</sup>k*+1,α (Ω<sup>1</sup>) is one derivative smoother than the densities, so has the same smoothness as the measures μ,ν.

Theorem 1.6.7 will be used in two ways in this book. Firstly, it is used to derive criteria for a Karcher mean of a collection of measures to be the Frechet mean of that ´ collection (Theorem 3.1.15). Secondly, it allows one to obtain very smooth estimates for the transport maps. Indeed, any two measures μ and ν can be approximated by measures satisfying the second condition: one can approximate them by discrete measures using the law of large numbers and then employ a convolution with, e.g., a Gaussian measure (see, for instance, Theorem 2.2.7). It is not obvious that the transport maps between the approximations converge to the transport maps between the original measures, but we will see this to be true in the next section.

#### **1.7 Stability of Solutions Under Weak Convergence**

In this section, we discuss the behaviour of the solution to the Monge–Kantorovich problem when the measures μ and ν are replaced by approximations μ*<sup>n</sup>* and ν*n*. Since any measure can be approximated by discrete measures *or* by smooth measures, this allows us to benefit from both worlds. On the one hand, approximating μ and ν with discrete measures leads to the finite discrete problem of Sect. 1.3 that can be solved exactly. On the other hand, approximating μ and ν with Gaussian convolutions thereof leads to very smooth measures (at least on R*d*), and so the regularity results of the previous section imply that the respective optimal maps will also be smooth. Finally, in applications, one would almost always observe the measures of interest μ and ν with a certain amount of noise, and it is therefore of interest to control the error introduced by the noise. In image analysis, μ can represent an image that has undergone blurring, or some other perturbation (Amit et al. [13]). In other applications, the noise could be due to sampling variation, where instead of μ one observes a discrete measure μ*<sup>N</sup>* obtained from realisations *X*1,...,*XN* of random elements with distribution μ as μ*<sup>N</sup>* = *N*−<sup>1</sup> ∑*<sup>N</sup> i*=1 δ{*Xi*} (see Chap. 4).

In Sect. 1.7.1, we will see that the optimal transference plan π depends continuously on μ and ν. With this result under one's belt, one can then deduce an analogous property for the optimal map *T* from μ to ν given some regularity of μ, as will be seen in Sect. 1.7.2.

We shall assume throughout this section that μ*n* → μ and ν*n* → ν weakly, which, we recall, means that *<sup>X</sup> f* dμ*<sup>n</sup>* → *<sup>X</sup> f* dμ for all continuous bounded *<sup>f</sup>* : *<sup>X</sup>* <sup>→</sup> <sup>R</sup>. The following equivalent definitions for weak convergence will be used not only in this section, but elsewhere as well.

**Lemma 1.7.1 (Portmanteau)** *Let X be a complete separable metric space and let* μ,μ*<sup>n</sup>* ∈ *P*(*X* )*. Then the following are equivalent:*


For a proof, see, for instance, Billingsley [24, Theorem 2.1]. The equivalence with the last condition can be found in Pollard [104, Section III.2].

#### *1.7.1 Stability of Transference Plans and Cyclical Monotonicity*

In this subsection, we state and sketch the proof of the fact that if μ*n* → μ and ν*n* → ν weakly, then the optimal transference plans π*n* ∈ Π(μ*n*,ν*<sup>n</sup>*) converge to an optimal π ∈ Π(μ,ν). The result, as stated in Villani [125, Theorem 5.20], is valid on complete separate metric spaces with general cost functions, and reads as follows.

**Theorem 1.7.2 (Weak Convergence and Optimal Plans)** *Let* μ*<sup>n</sup> and* ν*<sup>n</sup> converge weakly to* μ *and* ν*, respectively, in P*(*<sup>X</sup>* ) *and let c* : *<sup>X</sup>* <sup>2</sup> <sup>→</sup> <sup>R</sup><sup>+</sup> *be continuous. If* π*n* ∈ Π(μ*n*,ν*<sup>n</sup>*) *are optimal transference plans and*

$$\limsup\_{n \to \infty} \int\_{\mathcal{X}^2} c(x, \mathbf{y}) \, \mathrm{d}\pi\_n(x, \mathbf{y}) < \infty.$$

*then* (π*<sup>n</sup>*) *is a tight sequence and each of its weak limits* π ∈ Π(μ,ν) *is optimal.*

One can even let *c* vary with *n* under some conditions.

Let *c*(*x*, *y*) = *x*−*y* <sup>2</sup>/2. We prefer to keep the notation *<sup>c</sup>*(·,·) in order to stress the generality of the arguments. A key idea in the proof is the replacement of optimality by another property called *cyclical monotonicity*, which behaves nicely with respect to weak convergence. To motivate this property, we recall the discrete case of Sect. 1.3 where μ = *N*−<sup>1</sup> ∑*<sup>N</sup> i*=1 δ{*xi*} and ν = *N*−<sup>1</sup> ∑*<sup>N</sup> i*=1 δ{*yi*}. There exists an optimal transference plan π induced from a permutation σ<sup>0</sup> ∈ *SN*. Since the ordering of {*xi*} and {*yi*} is irrelevant in the representations of μ and ν, we may assume without loss of generality that σ<sup>0</sup> is the identity permutation. Then, by definition of optimality,

$$\sum\_{i=1}^{N} c(\mathbf{x}\_i, \mathbf{y}\_i) \le \sum\_{i=1}^{N} c(\mathbf{x}\_i, \mathbf{y}\_{\sigma(i)}), \qquad \sigma \in \mathcal{S}\_N. \tag{1.7}$$

If σis the identity except for a subset *i*1,...,*in*, *n* ≤ *N*, then in particular

$$\sum\_{k=1}^{n} c(\mathbf{x}\_{i\_k}, \mathbf{y}\_{i\_k}) \le \sum\_{k=1}^{n} c(\mathbf{x}\_{i\_k}, \mathbf{y}\_{\sigma(i\_k)}), \qquad \sigma \in S\_n,$$

and if we choose σ(*ik*) = *ik*−<sup>1</sup> with *i*<sup>0</sup> = *in*, this writes

$$\sum\_{k=1}^{n} c(\mathbf{x}\_{\dot{l}\_k}, \mathbf{y}\_{\dot{l}\_k}) \le \sum\_{k=1}^{n} c(\mathbf{x}\_{\dot{l}\_k}, \mathbf{y}\_{\dot{l}\_{k-1}}).\tag{1.8}$$

By decomposing a permutation σ ∈ *SN* to disjoint cycles, one can verify that (1.8) implies (1.7). This will be useful since, as it turns out, a variant of (1.8) holds for arbitrary measures μ and νfor which there is no relevant finite *N* as in (1.7).

**Definition 1.7.3 (Cyclically Monotone Sets and Measures)** *A set* Γ <sup>⊆</sup> *<sup>X</sup>* <sup>2</sup> *is cyclically monotone if for any n and any* (*x*1, *y*1),...,(*xn*, *yn*) ∈ Γ*,*

$$\sum\_{i=1}^{n} c(\mathbf{x}\_{i}, \mathbf{y}\_{i}) \le \sum\_{i=1}^{n} c(\mathbf{x}\_{i}, \mathbf{y}\_{i-1}), \qquad (\mathbf{y}\_{0} = \mathbf{y}\_{n}).\tag{1.9}$$

*A probability measure* π *on X* <sup>2</sup> *is cyclically monotone if there exists a monotone Borel set* Γ *such that* π(Γ) = 1*.*

The relevance of cyclical monotonicity becomes clear from the following observation. If μ and ν are discrete uniform measures on *N* points and σ is an optimal permutation for the Monge–Kantorovich problem, then the coupling π = (1/*N*)∑*<sup>N</sup> i*=1 δ{(*xi*, *y*σ(*<sup>i</sup>*))} is cyclically monotone. In fact, even if the optimal permutation is not unique, the set

$$\Gamma = \{ (\mathbf{x}\_i, \mathbf{y}\_{\sigma(i)}) : i = 1, \dots, N, \sigma \in S\_N \text{ optimal} \}$$

is cyclically monotone. Furthermore, π ∈ Π(μ,ν) is optimal if and only if it is cyclically monotone, if and only if π(Γ ) = 1. It is heuristically easy to see that cyclical monotonicity is a necessary condition for optimality:

**Proposition 1.7.4 (Optimal Plans Are Cyclically Monotone)** *Let* μ,ν ∈ *P*(*X* ) *and suppose that the cost function c is nonnegative and continuous. Assume that the optimal* π ∈ Π(μ,ν) *has a finite total cost. Then* suppπ *is cyclically monotone. In particular,* π*is cyclically monotone.*

The idea of the proof is that if for some (*x*1, *y*1),...,(*xn*, *yn*) in the support of π,

$$\sum\_{i=1}^{n} c(\mathbf{x}\_{i}, \mathbf{y}\_{i}) > \sum\_{i=1}^{n} c(\mathbf{x}\_{i}, \mathbf{y}\_{i-1}),$$

then by continuity of *c*, the same inequality holds on some balls of positive measure. One can then replace π by a measure having (*xi*, *yi*−1) rather than (*xi*, *yi*) in its support, and this measure will incur a lower cost than π. A rigorous proof can be found in Gangbo and McCann [59, Theorem 2.3].

Thus, optimal transference plans π solve infinitely many discrete Monge– Kantorovich problems emanating from their support. More precisely, for any finite collection (*xi*, *yi*) ∈ suppπ, *i* = 1,...,*N* and any permutation σ ∈ *SN*, (1.7) is satisfied. Therefore, the identity permutation is optimal between the measures (1/*N*)∑δ{*xi*} and (1/*N*)∑δ{*yj*}.

In the same spirit as Γ defined above for the discrete case, one can strengthen Proposition 1.7.4 and prove existence of a cyclically monotone set Γ that includes the support of *any* optimal transference plan π: take Γ = ∪supp(π) for πoptimal.

The converse of Proposition 1.7.4 also holds.

**Proposition 1.7.5 (Cyclically Monotone Plans Are Optimal)** *Let* μ,ν ∈ *P*(*X* )*, <sup>c</sup>* : *<sup>X</sup>* <sup>2</sup> <sup>→</sup> <sup>R</sup><sup>+</sup> *continuous and* π ∈ Π(μ,ν) *a cyclically monotone measure with C*(π) *finite. Then* π *is optimal in* Π(μ,ν)*.*

Let us sketch the proof in the quadratic case *c*(*x*, *y*) = *x*−*y* <sup>2</sup>/2 and see how convexity comes into play. Straightforward algebra shows that (1.9) is equivalent, in the quadratic case, to

$$\sum\_{i=1}^{n} \left< \mathbf{y}\_{i}, \mathbf{x}\_{i+1} - \mathbf{x}\_{i} \right> \le \mathbf{0}, \qquad \left(\mathbf{x}\_{n+1} = \mathbf{x}\_{1}\right). \tag{1.10}$$

Fix (*x*0, *y*0) ∈ Γ = suppπ and define φ: *<sup>X</sup>* <sup>→</sup> <sup>R</sup>∪ {∞} by

$$\phi(\mathbf{x}) = \sup \left\{ \langle \mathbf{y}\_0, \mathbf{x}\_1 - \mathbf{x}\_0 \rangle + \dots + \langle \mathbf{y}\_{m-1}, \mathbf{x}\_m - \mathbf{x}\_{m-1} \rangle \right\},$$

$$+ \langle \mathbf{y}\_m, \mathbf{x} - \mathbf{x}\_m \rangle : m \in \mathbb{N}, \quad (\mathbf{x}\_i, \mathbf{y}\_i) \in \Gamma \right\}.$$

This function is defined as a supremum of affine functions, and is therefore convex and lower semicontinuous. Cyclical monotonicity of Γ implies that φ(*x*0) = 0, so φ is not identically infinite (it would have been so if Γ were not cyclically monotone). Straightforward computations show that Γ is included in the subdifferential of φ: *y* is a subgradient of φ at *x* when (*x*, *y*) ∈ Γ . Optimality of π then follows by weak duality, since π assigns full measure to the set of (*x*, *y*) such that φ(*x*) + φ∗(*y*) = *x*, *y*; see (1.5) and the discussion around it.

The argument for more general costs follows similar lines and is sketched at the end of this subsection.

Given these intermediary results, it is now instructive to prove Theorem 1.7.2.

*Proof (Proof of Theorem 1.7.2).* Since μ*n* → μ weakly, it is a tight sequence, and similarly for ν*<sup>n</sup>*. Consequently, the entire set of plans ∪*<sup>n</sup>*Π(μ*n*,ν*<sup>n</sup>*) is tight too (see the discussion before deriving (1.3)). Therefore, up to a subsequence, (π*<sup>n</sup>*) has a weak limit π. We need to show that π is cyclically monotone and that *C*(π) is finite. The latter is easy, since *cM*(*x*, *y*) = min(*M*, *c*(*x*, *y*)) is continuous and bounded:

$$C(\mathfrak{x}) = \lim\_{M \to \infty} \int\_{\mathcal{X}^{\mathfrak{Z}}} c\_M \, \mathrm{d}\mathfrak{x} = \lim\_{M \to \infty} \lim\_{n \to \infty} \int\_{\mathcal{X}^{\mathfrak{Z}}} c\_M \, \mathrm{d}\mathfrak{x}\_n \le \liminf\_{n \to \infty} \int\_{\mathcal{X}^{\mathfrak{Z}}} c \, \mathrm{d}\mathfrak{x}\_n < \infty.$$

To show that π is cyclically monotone, fix (*x*1, *y*1),...,(*xN*, *yN*) ∈ suppπ. We show that there exist (*x<sup>n</sup> <sup>k</sup>* , *yn <sup>k</sup>* ) ∈ suppπ*<sup>n</sup>* that converge to (*xk*, *yk*). Once this is established, we conclude from the cyclical monotonicity of suppπ*<sup>n</sup>* and the continuity of *c* that

$$\sum\_{k=1}^{N} c(\mathbf{x}\_k, \mathbf{y}\_k) = \lim\_{n \to \infty} \sum\_{k=1}^{N} c(\mathbf{x}\_k^n, \mathbf{y}\_k^n) \le \lim\_{n \to \infty} \sum\_{k=1}^{N} c(\mathbf{x}\_k^n, \mathbf{y}\_{k-1}^n) = \sum\_{k=1}^{N} c(\mathbf{x}\_k, \mathbf{y}\_{k-1}).$$

The existence proof for the sequence is standard. For ε > 0, let *B* = *B*ε (*xk*, *yk*) be an open ball around (*xk*, *yk*). Then π(*B*) > 0 and by the portmanteau Lemma 1.7.1, π*<sup>n</sup>*(*B*) > 0 for sufficiently large *n*. It follows that there exist (*x<sup>n</sup> <sup>k</sup>* , *yn <sup>k</sup>* ) ∈ *B*∩suppπ*n*. Let ε <sup>=</sup> <sup>1</sup>/*m*, say, then for all *<sup>n</sup>* <sup>≥</sup> *Nm* we can find (*x<sup>n</sup> <sup>k</sup>* , *yn <sup>k</sup>* ) ∈ suppμ*<sup>n</sup>* of distance 2/*m* from (*xk*, *yk*). We can choose *Nm*+<sup>1</sup> > *Nm* without loss of generality in order to complete the proof.

A few remarks are in order. Firstly, quadratic cyclically monotone sets (with respect to *x*−*y* <sup>2</sup>/2) are included in the subdifferential of convex functions. The converse is also true, as can be easily deduced from summing up the subgradient inequalities

$$
\phi(\mathbf{x}\_{i+1}) \ge \phi(\mathbf{x}\_i) + \left< \mathbf{y}\_i, \mathbf{x}\_{i+1} - \mathbf{x}\_i \right>, \qquad i = 1, \dots, N,
$$

where *yi* is a subgradient of φ at *xi*. For future reference, we state this characterisation as a theorem (which is valid in infinite dimensions too).

**Theorem 1.7.6 (Rockafellar [112])** *A nonempty* Γ <sup>⊆</sup> *<sup>X</sup>* <sup>2</sup> *is quadratic cyclically monotone if and only if it is included in the graph of the subdifferential of a lower semicontinuous convex function that is not identically infinite.*

Secondly, we have not used at all the Kantorovich duality, merely its weak form. The machinery of cyclical monotonicity can be used in order to prove the duality Theorem 1.4.2. This is indeed the strategy of Villani [125, Chapter 5], who explains its advantage with respect to Hahn–Banach-type duality proofs.

Lastly, the idea of the proof of Proposition 1.7.5 generalises to other costs in a natural way. Given a cyclically monotone (with respect to a cost function *c*) set Γ and a fixed pair (*x*0, *y*0) ∈ Γ, define (R¨uschendorf [116])

$$\mathfrak{g}(\mathbf{x}) = \inf \left\{ c(\mathbf{x}\_1, \mathbf{y}\_0) - c(\mathbf{x}\_0, \mathbf{y}\_0) + c(\mathbf{x}\_m, \mathbf{y}\_{m-1}) - c(\mathbf{x}\_{m-1}, \mathbf{y}\_{m-1}) + c(\mathbf{x}, \mathbf{y}\_m) - c(\mathbf{x}\_m, \mathbf{y}\_m) \right\}.$$

Then under some conditions, (ϕ,ψ) is dual optimal for some ψ. As explained in Sect. 1.8, ψ can be chosen to be essentially ϕ*<sup>c</sup>* (as defined in that section).

#### *1.7.2 Stability of Transport Maps*

We now extend the weak convergence of π*<sup>n</sup>* to π of the previous subsection to convergence of optimal maps. Because of the applications we have in mind, we shall work exclusively in the Euclidean space *X* = R*<sup>d</sup>* with the quadratic cost function; our results can most likely be extended to more general situations.

In this setting, we know that optimal plans are supported on graphs of subdifferentials of convex functions. Suppose that π*<sup>n</sup>* is induced by *Tn* and π is induced by *T*. Then in some sense, the weak convergence of π*<sup>n</sup>* to π yields convergence of the graphs of *Tn* to the graph of *T*. Our goal is to strengthen this to uniform convergence of *Tn* to *T*. Roughly speaking, we show the following: there exists a set *A* with μ(*A*) = 1 and such that *Tn* converge uniformly to *T* on every compact subset of *A*. For the reader's convenience, we give a user-friendly version here; a more general statement is given in Proposition 1.7.11 below.

**Theorem 1.7.7 (Uniform Convergence of Optimal Maps)** *Let* μ*n*,μ *be absolutely continuous measures with finite second moments on an open convex set U* <sup>⊆</sup> <sup>R</sup>*<sup>d</sup> such that* μ*n* → μ *weakly, and let* ν*n* → ν *weakly with* ν*n*,ν <sup>∈</sup> *<sup>P</sup>*(R*d*) *with finite second moments. If Tn and T are continuous on U and C*(*Tn*) *is bounded uniformly in n, then*

#### 1.7 Stability of Solutions Under Weak Convergence 25

$$\sup\_{\mathbf{x}\in\mathcal{Q}}||T\_n(\mathbf{x}) - T(\mathbf{x})|| \to 0, \qquad n \to \infty,$$

*for any compact* Ω⊆ *U.*

Since *Tn* and *T* are only defined up to Lebesgue null sets, it will be more convenient to work directly with the subgradients. That is, we view *Tn* and *T* as *set-valued* functions that to each *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* assign a (possibly empty) subset of <sup>R</sup>*d*. In other words, *Tn* and *T* take values in the power set of R*d*, denoted by 2R*<sup>d</sup>* .

Let φ : <sup>R</sup>*<sup>d</sup>* <sup>→</sup> <sup>R</sup>∪ {∞} be convex, *<sup>y</sup>*<sup>1</sup> <sup>∈</sup> ∂ φ(*x*1) and *y*<sup>2</sup> ∈ ∂ φ(*x*2). Putting *n* = 2 in the definition of cyclical monotonicity (1.10) gives

$$
\langle \mathbf{y}\_2 - \mathbf{y}\_1, \mathbf{x}\_2 - \mathbf{x}\_1 \rangle \ge 0.
$$

This property (which is weaker than cyclical monotonicity) is important enough to have its own name. Following the notation of Alberti and Ambrosio [6], we call a set-valued function (or multifunction) *<sup>u</sup>* : <sup>R</sup>*<sup>d</sup>* <sup>→</sup> <sup>2</sup>R*<sup>d</sup> monotone* if whenever *yi* ∈ *u*(*xi*), *i* = 1,2,

$$
\langle \mathbf{y}\_2 - \mathbf{y}\_1, \mathbf{x}\_2 - \mathbf{x}\_1 \rangle \ge 0.
$$

If *d* = 1, this simply means that *u* is a nondecreasing (set-valued) function. For example, one can define *u*(*x*) = {0} for *x* ∈ [0,1), *u*(1)=[0,1] and *u*(*x*) = 0 if / *x* ∈/ [0,1]. Next, *u* is said to be *maximally monotone* if no points can be added to its graph while preserving monotonicity:

$$\left\{ \left< \mathbf{y'} - \mathbf{y}, \mathbf{x'} - \mathbf{x} \right> \geq \mathbf{0} \quad \text{whenever } \mathbf{y} \in \mathfrak{u}(\mathbf{x}) \right\} \quad \implies \quad \mathbf{y'} \in \mathfrak{u}(\mathbf{x'}).$$

It will be convenient to identify *u* with its graph; we will often write (*x*, *y*) ∈ *u* to mean *y* ∈ *u*(*x*). Note that *u*(*x*) can be empty, even when *u* is maximally monotone. The previous example for *u* is not maximally monotone, but it will be if we modify *u*(0) to be (−∞,0] and *u*(1) to be [0,∞).

Of course, if φ : <sup>R</sup>*<sup>d</sup>* <sup>→</sup> <sup>R</sup> ∪ {∞} is convex, then *<sup>u</sup>* <sup>=</sup> ∂ φ is monotone. It follows from Theorem 1.7.6 that *u* is maximally cyclically monotone (no points can be added to its graph while preserving cyclical monotonicity). It can actually be shown that *u* is maximally monotone [6, Section 7]. In what follows, we will always work with subdifferentials of convex functions, so unless stated otherwise, *u* will always be assumed maximally monotone.

Maximally monotone functions enjoy the following very useful continuity property. It is proven in [6, Corollary 1.3] and will be used extensively below.

**Proposition 1.7.8 (Continuity at Singletons)** *Let x* <sup>∈</sup> <sup>R</sup>*<sup>d</sup> such that u*(*x*) = {*y*} *is a singleton. Then u is nonempty on some neighbourhood of x and it is continuous at x: if xn* → *x and yn* ∈ *u*(*xn*)*, then yn* → *y.*

Notice that this result implies that if a convex function φ is differentiable on some open set *<sup>E</sup>* <sup>⊆</sup> <sup>R</sup>*d*, then it is continuously differentiable there (Rockafellar [113, Corollary 25.5.1]).

If *<sup>f</sup>* : <sup>R</sup>*<sup>d</sup>* <sup>→</sup> <sup>R</sup>∪ {∞} is any function, one can define its subgradient at *<sup>x</sup>* locally as

$$\partial f(\mathbf{x}) = \{ \mathbf{y} : f(z) \ge f(\mathbf{x}) + \langle \mathbf{y}, z - \mathbf{x} \rangle + o(\|z - \mathbf{x}\|) \}$$

$$= \left\{ \mathbf{y} : \liminf\_{z \to \mathbf{x}} \frac{f(z) - f(\mathbf{x}) + \langle \mathbf{y}, z - \mathbf{x} \rangle}{\|z - \mathbf{x}\|} \ge 0 \right\}.$$

(See the discussion after Theorem 1.8.2.) When *f* is convex, one can remove the *o*( *z*−*x* ) term and the inequality holds for all *z*, i.e., globally and not locally. Since monotonicity is related to convexity, it should not be surprising that monotonicity is in some sense a local property. Suppose that *u*(*x*0) = {*y*0} is a singleton and that for some *<sup>y</sup>*<sup>∗</sup> <sup>∈</sup> <sup>R</sup>*d*,

$$
\langle \mathbf{y} - \mathbf{y}^\*, \mathbf{x} - \mathbf{x}\_0 \rangle \ge 0,
$$

for all *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* and *<sup>y</sup>* <sup>∈</sup> *<sup>u</sup>*(*x*). Then by maximality, *<sup>y</sup>*<sup>∗</sup> must equal *<sup>y</sup>*0. By "local property", we mean that the conclusion *y*<sup>∗</sup> = *y*<sup>0</sup> holds if the above inequality holds for *x* in a small neighbourhood of *x*<sup>0</sup> (an open set that includes *x*0). We will need a more general version of this result, replacing neighbourhoods by a weaker condition that can be related to Lebesgue points. The strengthening is somewhat technical; the reader can skip directly to Lemma 1.7.10 and assume that *G* is open without losing much intuition.

Let *Br*(*x*0) = {*x* : *x*−*x*0 <sup>&</sup>lt; *<sup>r</sup>*} for *<sup>r</sup>* <sup>≥</sup> 0 and *<sup>x</sup>*<sup>0</sup> <sup>∈</sup> <sup>R</sup>*d*. The interior of a set *<sup>G</sup>* <sup>⊆</sup> <sup>R</sup>*<sup>d</sup>* is denoted by int*G* and the closure by *G*. If *G* is measurable, then Leb*G* denotes the Lebesgue measure of *G*. Finally, conv*G* denotes the convex hull of *G*.

A point *x*<sup>0</sup> is a *Lebesgue point* (or of *Lebesgue density*) of a measurable set *<sup>G</sup>* <sup>⊆</sup> <sup>R</sup>*<sup>d</sup>* if for any ε > 0 there exists *t*ε> 0 such that

$$\frac{\text{Leb}(\mathcal{B}\_l(\mathbf{x}\_0) \cap G)}{\text{Leb}(\mathcal{B}\_l(\mathbf{x}\_0))} > 1 - \mathfrak{e}, \qquad 0 < t < t\_\mathcal{E}.$$

An illuminating example is the set {*<sup>y</sup>* <sup>≤</sup> |*x*|} in <sup>R</sup><sup>2</sup> (see Fig. 1.1). Since the "slope" of the square root is infinite, *x*<sup>0</sup> = (0,0) is a Lebesgue point, but the fraction above is strictly smaller than one, for all *t* > 0.

Fig. 1.1: The set *G* = {(*x*, *y*) : |*x*| ≤ 1, −0.2 ≤ *y* ≤ |*x*|}

We denote the set of points of Lebesgue density of *G* by *G*den. Here are some facts about *<sup>G</sup>*den: clearly, int*<sup>G</sup>* <sup>⊆</sup> *<sup>G</sup>*den <sup>⊆</sup> *<sup>G</sup>*. Stein and Shakarchi [121, Chapter 3, Corollary 1.5] show that Leb(*G*\*G*den) = 0 (and Leb(*G*den \*G*) = 0, so *<sup>G</sup>*den is very close to *<sup>G</sup>*). By the Hahn–Banach theorem, *<sup>G</sup>*den <sup>⊆</sup> int(conv(*G*)): indeed, if *<sup>x</sup>* is not in int(conv*G*), then there is a separating hyperplane between *x* and conv*G* ⊇ *G*, so the fraction above is at most 1/2 for all *t* > 0.

The "denseness" of Lebesgue points is materialised in the following result. It is given as exercise in [121] when *d* = 1, and the proof can be found on page 27 in the supplement.

**Lemma 1.7.9 (Density Points and Distance)** *Let x*<sup>0</sup> *be a point of Lebesgue density of a measurable set G* <sup>⊆</sup> <sup>R</sup>*d. Then*

$$\delta\mathfrak{d}(z) = \delta\_G(z) = \inf\_{\mathbf{x} \in G} ||z - \mathbf{x}|| = o(||z - \mathbf{x}\_0||), \qquad \text{as } z \to \mathbf{x}\_0.$$

Of course, this result holds for any *x*<sup>0</sup> ∈ *G* if the little *o* is replaced by big *O*, since δ is Lipschitz. When *x*<sup>0</sup> ∈ int*G*, this is trivial because δvanishes on int*G*.

The important part here is the following corollary: for almost all *x* ∈ *G*, δ(*z*) = *o*( *z* − *x* ) as *z* → *x*. This can be seen in other ways: since δ is Lipschitz, it is differentiable almost everywhere. If *x* ∈ *G* and δ is differentiable at *x*, then ∇δ(*x*) must be 0 (because δ is minimised there), and then δ(*z*) = *o*( *z* − *x* ). We just showed that δ is differentiable with vanishing derivative at all Lebesgue points of *x*. The converse is not true: *<sup>G</sup>* <sup>=</sup> {±1/*n*}<sup>∞</sup> *<sup>n</sup>*=<sup>1</sup> has no Lebesgue points, but δ(*y*) <sup>≤</sup> <sup>4</sup>*y*<sup>2</sup> as *y* → 0.

The locality of monotone functions can now be stated as follows. It is proven on page 27 of the supplement.

**Lemma 1.7.10 (Local Monotonicity)** *Let x*<sup>0</sup> <sup>∈</sup> <sup>R</sup>*<sup>d</sup> such that u*(*x*0) = {*y*0} *and x*<sup>0</sup> *is a Lebesgue point of a set G satisfying*

$$
\forall \mathbf{y} - \mathbf{y}^\*, \mathbf{x} - \mathbf{x}\_0 \boldsymbol{\upbeta} \ge \mathbf{0} \qquad \forall \mathbf{x} \in G \,\forall \mathbf{y} \in \boldsymbol{\upmu}(\mathbf{x}) \,\mathrm{d}.
$$

*Then y*<sup>∗</sup> = *y*0*. In particular, the result is true if the inequality holds on G* = *O* \*N with* /0 = *O open and N Lebesgue negligible.*

These continuity properties cannot be of much use unless *u*(*x*) is a singleton for reasonably many values of *x*. Fortunately, this is indeed the case: the set of points *x* such that *u*(*x*) contains more than one element has Lebesgue measure 0 (see Alberti and Ambrosio [6, Remark 2.3] for a stronger result). Another issue is that *u* may be empty, and convexity comes into play here again. Let dom*u* = {*x* : *u*(*x*) = /0}. Then there exists a convex closed set *K* such that

$$\mathsf{int}K \subseteq \mathsf{dom}u \subseteq K.$$

[6, Corollary 1.3(2)]. Although dom*u* itself may fail to be convex, it is almost convex in the above sense. By convexity, *K* \ int*K* has Lebesgue measure 0 (see the discussion after Theorem 1.6.1) and so the set of points in *K* where *u* is not a singleton,

{*x* ∈ *K* : *u*(*x*) = /0}∪{*x* ∈ *K* : *u*(*x*) contains more than one point},

has Lebesgue measure 0, and *u*(*x*) is empty for all *x* ∈/ *K*. (It is in fact not difficult to show that if *x* ∈ ∂*K*, then *u*(*x*) cannot be a singleton, by the Hahn–Banach theorem.)

With this background on monotone functions at our disposal, we are now ready to state the stability result for the optimal maps. We assume the following.

**Assumptions 1** *Let* μ*n*,μ,ν*n*,ν <sup>∈</sup> *<sup>P</sup>*(R*d*) *with optimal couplings (with respect to quadratic cost)* π*n* ∈ Π(μ*n*,ν*n*)*,* π ∈ Π(μ,ν) *and convex potentials* φ*<sup>n</sup> and* φ*, respectively, such that*


$$\limsup\_{n \to \infty} \int\_{\mathcal{X}^2} \frac{1}{2} ||x - y||^2 \, \mathrm{d}\pi\_n(x, y) < \infty;$$

• *(*unique limit*) the optimal* π ∈ Π(μ,ν) *is unique.*

*We further denote the subgradients* ∂ φ*<sup>n</sup> and* ∂ φ*by un and u, respectively.*

These assumptions imply that π has a finite total cost. This can be shown by the liminf argument in the proof of Theorem 1.7.2 but also from the uniqueness of π. As a corollary of the uniqueness of π, it follows that π*n* → π weakly; notice that this holds even if π*<sup>n</sup>* is not unique for any *n*. We will now translate this weak convergence to convergence of the maximal monotone maps *un* to *u*, in the following form.

**Proposition 1.7.11 (Uniform Convergence of Optimal Maps)** *Let Assumptions 1 hold true and denote E* = suppμ *and E*den *the set of its Lebesgue points. Let* Ω *be a compact subset of E*den *on which u is univalued (i.e., u*(*x*) *is a singleton for all x* ∈ Ω*). Then un converges to u uniformly on* Ω*: un*(*x*) *is nonempty for all x* ∈ Ω *and all n* > *N*Ω*, and*

$$\sup\_{\lambda \in \Omega} \sup\_{\mathbf{y} \in \mu\_n(\mathbf{x})} ||\mathbf{y} - \mu(\mathbf{x})|| \to \mathbf{0}, \qquad n \to \infty.$$

*In particular, if u is univalued throughout* int(*E*) *(so that* φ <sup>∈</sup>*C*<sup>1</sup> *there), then uniform convergence holds for any compact* Ω⊂ int(*E*)*.*

The proof of Proposition 1.7.11, given on page 28 of the supplement, follows two separate steps:


**Corollary 1.7.12 (Pointwise Convergence** μ**-Almost Surely)** *If in addition* μ *is absolutely continuous, then un*(*x*) → *u*(*x*) μ*-almost surely.*

*Proof.* We first claim that *E* ⊆ dom*u*. Indeed, for any *x* ∈ *E* and any ε > 0, the ball *B* = *B*ε (*x*) has positive measure. Consequently, *u* cannot be empty on the entire ball, because otherwise μ(*B*) = π(*B*×R*d*) would be 0. Since dom*<sup>u</sup>* is almost convex (see the discussion before Assumptions 1), this implies that actually int(conv*E*) ⊆ dom*u*.

The rest is now easy: the set of points *x* ∈ *E* for which Ω = {*x*} fails to satisfy the conditions of Proposition 1.7.11 is included in

$$(E \nmid E^{\text{den}}) \cup \{ \mathbf{x} \in \text{int}(\text{conv}(E)) : \boldsymbol{\mu}(\mathbf{x}) \text{ contains more than one point} \},$$

which is μ-negligible because μ is absolutely continuous and both sets have Lebesgue measure 0.

#### **1.8 Complementary Slackness and More General Cost Functions**

It is well-known (Luenberger and Ye [89, Section 4.4]) that the solutions to the primal and dual problems are related to each other via *complementary slackness*. In other words, solution of one problem provides a lot of information about the solution of the other problem. Here, we show that this idea remains true for the Kantorovich primal and dual problems, extending the discussion in Sect. 1.6.1 to more general cost functions.

Let *X* and *Y* be complete separable metric spaces, μ ∈ *P*(*X* ), ν ∈ *P*(*Y* ), and *<sup>c</sup>* : *<sup>X</sup>* <sup>×</sup>*<sup>Y</sup>* <sup>→</sup> <sup>R</sup><sup>+</sup> be a measurable cost function.

If one finds functions (ϕ,ψ) ∈ Φ*<sup>c</sup>* and a transference plan π ∈ Π(μ,ν) having the same objective values, then by weak duality (ϕ,ψ) is optimal in Φ*<sup>c</sup>* and π is optimal in Π(μ,ν). Having the same objective values is equivalent to

$$\int\_{\mathcal{X}\times\mathcal{Y}} [c(\mathbf{x}, \mathbf{y}) - \boldsymbol{\varphi}(\mathbf{x}) - \boldsymbol{\Psi}(\mathbf{y})] \, \mathrm{d}\boldsymbol{\pi}(\mathbf{x}, \mathbf{y}) = \mathbf{0}$$

which is in turn equivalent to

$$
\varphi(\mathbf{x}) + \Psi(\mathbf{y}) = c(\mathbf{x}, \mathbf{y}), \qquad \text{ $\pi$ -almost surely.}
$$

It has already been established that there exists an optimal transference plan π∗. Assuming that *C*(π∗) < ∞ (otherwise all transference plans are optimal), a pair (ϕ,ψ) ∈ Φ*<sup>c</sup>* is optimal if and only if

$$
\varphi(\mathbf{x}) + \Psi(\mathbf{y}) = c(\mathbf{x}, \mathbf{y}), \qquad \boldsymbol{\pi}^\*\text{-almost surely}.
$$

Conversely, if (ϕ0,ψ<sup>0</sup>) is an optimal pair, then π is optimal if and only if it is concentrated on the set

$$\{ (\mathfrak{x}, \mathfrak{y}) : \mathfrak{q}\_0(\mathfrak{x}) + \mathfrak{y}\_0(\mathfrak{y}) = c(\mathfrak{x}, \mathfrak{y}) \}.$$

In particular, if for a given *x* there exists a unique *y* such that ϕ0(*x*)+ψ<sup>0</sup>(*y*) = *c*(*x*, *y*), then the mass at *x* must be sent entirely to *y* and not be split; if this is the case for μalmost all *x*, then this relation defines *y* as a function of *x* and the resulting optimal π is in fact induced from a transport map. This idea provides a criterion for solvability of the Monge problem (Villani [125, Theorem 5.30]).

#### *1.8.1 Unconstrained Dual Kantorovich Problem*

It turns out that the dual Kantorovich problem can be recast as an unconstrained optimisation problem of only one function ϕ. The new formulation is not only conceptually simpler than the original one, but also sheds light on the properties of the optimal dual variables. Since the dual objective function to be maximised,

$$
\int\_{\mathcal{X}} \boldsymbol{\mathfrak{g}} \, \mathrm{d}\boldsymbol{\mu} + \int\_{\mathcal{Y}} \boldsymbol{\mathfrak{y}} \, \mathrm{d}\boldsymbol{\nu},
$$

is increasing in ϕ and ψ, one should seek functions that take values as large as possible subject to the constraint ϕ(*x*) +ψ(*y*) ≤ *c*(*x*, *y*). Suppose that an oracle tells us that some ϕ ∈ *L*1(μ) is a good candidate. Then the largest possible ψ satisfying (ϕ,ψ) ∈ Φ*<sup>c</sup>* is defined as

$$
\psi(\mathbf{y}) = \inf\_{\mathbf{x} \in \mathcal{X}} [c(\mathbf{x}, \mathbf{y}) - \mathfrak{g}(\mathbf{x})] := \mathfrak{g}^c(\mathbf{y}).
$$

A function taking this form is called *c-concave* [124, Chapter 2]; we say that ψ is the *c-transform* of ϕ. It is not necessarily true that ϕ*<sup>c</sup>* is integrable or even measurable, but if we neglect this difficulty, then it is obvious that

$$\sup\_{\boldsymbol{\Psi}\in L\_{1}(\nu):(\boldsymbol{\varrho},\boldsymbol{\Psi})\in\mathsf{o}\_{\mathbb{C}}} \left[ \int\_{\mathcal{X}} \boldsymbol{\varrho}\,\mathrm{d}\mu + \int\_{\mathcal{Y}} \boldsymbol{\Psi}\,\mathrm{d}\nu \right] = \int\_{\mathcal{X}} \boldsymbol{\varrho}\,\mathrm{d}\mu + \int\_{\mathcal{Y}} \boldsymbol{\varrho}^{c}\,\mathrm{d}\nu.$$

The dual problem can thus be formulated as the unconstrained problem

$$\sup\_{\mathfrak{p}\in L\_1(\mu)} \left[ \int\_{\mathcal{X}} \mathfrak{p} \, \mathrm{d}\mu + \int\_{\mathcal{Y}} \mathfrak{p}^c \, \mathrm{d}\nu \right].$$

One can apply this *c*-transform again and replace ϕby

$$\mathfrak{g}^{\circ \circ}(\mathfrak{x}) = (\mathfrak{g}^{\circ})^{\circ}(\mathfrak{x}) = \inf\_{\mathbf{y} \in \mathfrak{Y}} [c(\mathfrak{x}, \mathbf{y}) - \mathfrak{g}^{\circ}(\mathbf{y})] \ge \mathfrak{g}(\mathfrak{x}),$$

so that ϕ*cc* has a better objective value yet still (ϕ*cc*,ϕ*<sup>c</sup>*) <sup>∈</sup> Φ*<sup>c</sup>* (modulo measurability issues). An elementary calculation shows that in general ϕ*ccc* = ϕ*<sup>c</sup>*. Thus, for any function ϕ1, the pair of functions (ϕ,ψ)=(ϕ*cc* 1 ,ϕ*c* <sup>1</sup>) has a better objective value than (ϕ1,ψ<sup>1</sup>), and satisfies (ϕ,ψ) ∈ Φ*<sup>c</sup>*. Moreover, ϕ*<sup>c</sup>* = ψ and ψ*<sup>c</sup>* = ϕ; in words, ϕ and ψ are *c-conjugate*. An optimal dual pair (ϕ,ψ) can be expected to be *c*-conjugate; this is indeed true almost surely:

**Proposition 1.8.1 (Existence of an Optimal Pair)** *Let* μ *and* ν *be probability measures on X and Y such that the independent coupling with respect to the nonnegative and lower semicontinuous cost function is finite: <sup>X</sup>* <sup>×</sup>*<sup>Y</sup> <sup>c</sup>*(*x*, *<sup>y</sup>*)<sup>d</sup>μ(*x*)dν(*y*) < ∞*. Then there exists an optimal pair* (ϕ,ψ) *for the dual Kantorovich problem. Furthermore, the pair can be chosen in a way that* μ*-almost surely,* ϕ = ψ*<sup>c</sup> and* ν*-almost surely,* ψ = ϕ*c.*

Proposition 1.8.1 is established (under weaker conditions) by Ambrosio and Pratelli [11, Theorem 3.2]. It is clear from the discussion above that once existence of an optimal pair (ϕ1,ψ<sup>1</sup>) is established, the pair (ϕ,ψ)=(ϕ*cc* 1 ,ϕ*c* <sup>1</sup>) should be optimal. Combining Proposition 1.8.1 with the preceding subsection, we see that if ϕ is optimal (for the unconstrained dual problem), then any optimal transference plan π∗ must be concentrated on the set

$$\{ (\mathfrak{x}, \mathfrak{y}) : \mathfrak{q}(\mathfrak{x}) + \mathfrak{q}^c(\mathfrak{y}) = c(\mathfrak{x}, \mathfrak{y}) \}.$$

If for μ-almost every *x* this equation defines *y* uniquely as a (measurable) function of *x*, then π∗ is induced by a transport map. Indeed, we have seen how this is the case, in the quadratic case *c*(*x*, *y*) = *x* − *y* <sup>2</sup>/2, when μ is absolutely continuous. An extension to *p* > 1 (instead of *p* = 2) is sketched in Sect. 1.8.3.

We remark that at the level of generality of Proposition 1.8.1, the function ϕ*c* may fail to be Borel measurable; Ambrosio and Pratelli show that this pair can be modified up to null sets in order to be Borel measurable. If *c* is continuous, however, then ϕ*<sup>c</sup>* is an infimum of a collection of continuous functions (in *<sup>y</sup>*). Hence <sup>−</sup>ϕ*<sup>c</sup>* is lower semicontinuous, which yields that ϕ*<sup>c</sup>* is measurable. When *c* is *uniformly* continuous, measurability of ϕ*<sup>c</sup>* is established in a more lucid way, as exemplified in the next subsection.

#### *1.8.2 The Kantorovich–Rubinstein Theorem*

Whether ϕ*<sup>c</sup>*(*y*) is tractable to evaluate depends on the structure of *c*. We have seen an example where *c* was the quadratic Euclidean distance. Here, we shall consider another useful case, where *c* is a metric. Assume that *X* = *Y* , denote their metric by *d*, and let *c*(*x*, *y*) = *d*(*x*, *y*). If ϕ = ψ*<sup>c</sup>* is *c*-concave, then it is 1-Lipschitz. Indeed, by definition and the triangle inequality

$$\mathfrak{g}(\mathbf{z}) = \inf\_{\mathbf{y} \in \mathcal{W}} [d(\mathbf{z}, \mathbf{y}) - \mathfrak{w}(\mathbf{y})] \le \inf\_{\mathbf{y} \in \mathcal{W}} [d(\mathbf{x}, \mathbf{y}) + d(\mathbf{x}, \mathbf{z}) - \mathfrak{w}(\mathbf{y})] = \mathfrak{g}(\mathbf{x}) + d(\mathbf{x}, \mathbf{z}).$$

Interchanging *x* and *z* yields |ϕ(*x*)−ϕ(*z*)| ≤ *d*(*x*,*z*). 3

Next, we claim that if ϕ is Lipschitz, then ϕ*<sup>c</sup>*(*y*) = <sup>−</sup>ϕ(*y*). Indeed, choosing *x* = *y* in the infimum shows that ϕ*<sup>c</sup>*(*y*) <sup>≤</sup> *<sup>d</sup>*(*y*, *<sup>y</sup>*)<sup>−</sup>ϕ(*y*) = −ϕ(*y*). But the Lipschitz condition on ϕ implies that for all *x*, *d*(*x*, *y*) − ϕ(*x*) ≥ −ϕ(*y*). In view of that, we can take in the dual problem ϕ to be Lipschitz and ψ = −ϕ, and the duality formula (Theorem 1.4.2) takes the form

$$\inf\_{\pi \in \Pi(\mu, \mathbf{v})} \int\_{\mathcal{X}^{\mathcal{Z}}} d(\mathbf{x}, \mathbf{y}) \, \mathrm{d}\pi(\mathbf{x}, \mathbf{y}) = \sup\_{\|\boldsymbol{\varphi}\|\_{\mathrm{Lip}} \leq 1} \left| \int\_{\mathcal{X}^{\mathcal{Z}}} \boldsymbol{\mathfrak{g}} \, \mathrm{d}\mu - \int\_{\mathcal{X}} \boldsymbol{\mathfrak{g}} \, \mathrm{d}\mathbf{v} \right|,$$

$$\|\boldsymbol{\mathfrak{g}}\|\_{\mathrm{Lip}} = \sup\_{\mathbf{x} \neq \mathbf{y}} \frac{|\boldsymbol{\mathfrak{g}}(\mathbf{x}) - \boldsymbol{\mathfrak{g}}(\mathbf{y})|}{d(\mathbf{x}, \mathbf{y})}.\tag{1.11}$$

<sup>3</sup> In general, ψ*<sup>c</sup>* inherits the modulus of continuity of *c*, see Santambrogio [119, page 11].

This is known as the *Kantorovich–Rubinstein theorem* [124, Theorem 1.14]. (We have been a bit sloppy because ϕ may not be integrable. But if for some *x*<sup>0</sup> ∈ *X* , *x* → *d*(*x*, *x*0) is in *L*1(μ), then any Lipschitz function is μ-integrable. Otherwise one needs to restrict the supremum to, e.g., bounded Lipschitz ϕ.)

#### *1.8.3 Strictly Convex Cost Functions on Euclidean Spaces*

We now return to the Euclidean case *X* = *Y* = R*<sup>d</sup>* and explore the structure of *c*-transforms. When *c* is different than *x* − *y* <sup>2</sup>/2, we can no longer "open up the square" and relate the Monge–Kantorovich problem to convexity. However, we can still apply the idea that ϕ(*x*) +ϕ*<sup>c</sup>*(*y*) = *c*(*x*, *y*) if and only if the infimum is attained at *x*. Indeed, recall that

$$
\mathfrak{g}^c(\mathfrak{y}) = \inf\_{\mathfrak{x} \in \mathcal{X}} [c(\mathfrak{x}, \mathfrak{y}) - \mathfrak{q}(\mathfrak{x})],
$$

so that ϕ(*x*) +ϕ*<sup>c</sup>*(*y*) = *c*(*x*, *y*) if and only if

$$
\varphi(z) - \varphi(x) \le c(z, \mathbf{y}) - c(x, \mathbf{y}), \qquad z \in \mathcal{X}'.
$$

Notice the similarity to the subgradient inequality for convex functions, with the sign being reversed. In analogy, we call the collection of *y*'s satisfying the above in equality the *c-superdifferential* of ϕ at *x*, and we denote it by ∂*c*ϕ(*x*). Of course, if *c*(*x*, *y*) = *x*−*y* <sup>2</sup>/2, then *<sup>y</sup>* <sup>∈</sup> ∂*<sup>c</sup>*(*x*) if and only if *<sup>y</sup>* is a subgradient of ( · <sup>2</sup>/2−ϕ) at *x*.

The following result generalises Theorem 1.6.2 to other powers *p* > 1 of the Euclidean norm. These cost functions define the Wasserstein distances of the next chapter.

**Theorem 1.8.2 (Strictly Convex Costs on** <sup>R</sup>*d***)** *Let c*(*x*, *<sup>y</sup>*) = *<sup>h</sup>*(*<sup>x</sup>* <sup>−</sup> *<sup>y</sup>*) *with h*(*v*) = *v <sup>p</sup>*/*p for some p* > 1 *and let* μ *and* ν *be probability measures on* R*<sup>d</sup> with finite pth moments such that* μ *is absolutely continuous with respect to Lebesgue measure. Then the solution to the Kantorovich problem with cost function c is unique and induced from a transport map T . Furthermore, there exists an optimal pair* (ϕ,ϕ*c*) *of the dual problem, with* ϕ*c-concave. The solutions are related by*

$$T(\mathbf{x}) = \mathbf{x} - \nabla \boldsymbol{\varphi}(\mathbf{x}) ||\nabla \boldsymbol{\varphi}(\mathbf{x})||^{1/(p-1)-1} \qquad (\mu\text{-almost surely}).$$

*Proof (Assuming* ν *has Compact Support).* The existence of the optimal pair (ϕ,ϕ*c*) with the desired properties follows from Proposition 1.8.1 (they are Borel measurable because *c* is continuous). We shall now show that ϕ has a unique *c*supergradient μ-almost surely.

**Step 1:** ϕ **is** *c***-superdifferentiable.** Let π∗ be an optimal coupling. By duality arguments, π is concentrated on the set of (*x*, *y*) such that *y* ∈ ∂*c*ϕ(*x*). Consequently, for μ-almost any *x*, the *c*-superdifferential of ϕat *x* is nonempty.

**Step 2:** ϕ **is differentiable.** Here, we impose the additional condition that ν is compactly supported. Then ϕ can be taken as a *c*-transform on the compact support of ν. Since *h* is locally Lipschitz (it is*C*<sup>1</sup> because *p* > 1) this implies that ϕ is locally Lipschitz. Hence, it is differentiable Lebesgue almost surely, and consequently μalmost surely.

**Step 3: Conclusion.** For μ-almost every *x* there exists *y* ∈ ∂*c*ϕ(*x*) and a gradient *u* = ∇ϕ(*x*). In particular, *u* is a subgradient of ϕ:

$$
\phi(z) - \phi(x) \ge \langle \mu, z - x \rangle + o(||z - x||).
$$

Here and more generally, *o*( *z* − *x* ) denotes a function *r*(*z*) (defined in a neighbourhood of *x*) such that *r*(*z*)/ *z* − *x* → 0 as *z* → *x*. (If ϕ were convex, then we could take *r* ≡ 0, so the definition for convex functions is equivalent, and then the inequality holds globally and not only locally.) But *y* ∈ ∂*c*ϕ(*x*) means that as *z* → *x*,

$$h(z - \mathbf{y}) - h(\mathbf{x} - \mathbf{y}) = c(z, \mathbf{y}) - c(\mathbf{x}, \mathbf{y}) \ge \mathfrak{g}(z) - \mathfrak{g}(\mathbf{x}) \ge \langle u, z - \mathbf{x} \rangle + o(\|z - \mathbf{x}\|).$$

In other words, *u* is a subgradient of *h* at *x*−*y*. But *h* is differentiable with gradient ∇*h*(*v*) = *v v <sup>p</sup>*−<sup>2</sup> (zero if *v* = 0). We obtain ∇ϕ(*x*) = *u* = ∇*h*(*x* − *y*) and since the gradient of *h* is invertible, we conclude

$$\mathbf{y} = T(\mathbf{x}) := \mathbf{x} - (\nabla h)^{-1} [\nabla \boldsymbol{\varphi}(\mathbf{x})],$$

which defines *y* as a (measurable) function of *x*. <sup>4</sup> Hence, the optimal transference plan πis unique and induced from the transport map *T*.

The general result, without assuming compact support for ν, can be found in Gangbo and McCann [59]. It holds for a larger class of functions *h*, those that are strictly convex on R*<sup>d</sup>* (this yields that ∇*h* is invertible), have superlinear growth (*h*(*v*)/ *v* → ∞ as *v* → ∞) and satisfying a technical geometric condition (which *v <sup>p</sup>*/*<sup>p</sup>* does when *<sup>p</sup>* <sup>&</sup>gt; 1). Furthermore, if *<sup>h</sup>* is sufficiently smooth, namely *<sup>h</sup>* <sup>∈</sup>*C*1,<sup>1</sup> locally (it is if *p* ≥ 2, but not if *p* ∈ (1,2)), then μ does not need to be absolutely continuous; it suffices that it not give positive measure to any set of Hausdorff dimension smaller or equal than *d* −1. When *d* = 1 this means that Theorem 1.8.2 is still valid as long as μ has no atoms (μ({*x*}) = 0 for all *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>), which is a weaker condition than μbeing absolutely continuous.

It is also noteworthy that for strictly concave cost functions (e.g., *p* ∈ (0,1)), the situation is similar *when the supports of* μ *and* ν *are disjoint*. The reason is that *h* may fail to be differentiable at 0, but it only needs to be differentiated at *x*−*y* with *x* ∈ suppμ and *y* ∈ suppν. If the supports are not disjoint, then one needs to leave all common mass in place until the supports become disjoint (Villani [124, Chapter 2]) and then the result of [59] applies. As a simple example, let μ be uniform on [0,1] and ν be uniform on [0,2]. After leaving common mass in place, we are left with uniforms on [0,1] and [1,2] (with total mass 1/2) with essentially disjoint supports,

<sup>4</sup> Gradients of Borel functions are measurable, as the limit can be taken on a countable set. The inverse (∇*h*)−<sup>1</sup> equals the gradient of the Legendre transform *h*<sup>∗</sup> and is therefore Borel measurable.

for which the optimal transport map is the *decreasing* map *T*(*x*) = 2−*x*. Thus, the unique optimal π is not induced from a map, but rather from an equal weight mixture of *T* and the identity. Informally, each point *x* in the support of μ needs to be split; half stays and *x* and the other half transported to 2−*x*. The optimal coupling from ν to μ is unique and induced from the map *S*(*x*) = *x* if *x* ≤ 1 and 2−*x* if *x* ≥ 1, which is neither increasing nor decreasing.

#### **1.9 Bibliographical Notes**

Many authors, including Villani [124, Theorem 1.3]; [125, Theorem 5.10], give the duality Theorem 1.4.2 for lower semicontinuous cost functions. The version given here is a simplification of Beiglbock and Schachermayer [ ¨ 17, Theorem 1]. The duality holds for functions that take values in [−∞,∞] provided that they are finite on a sufficiently large subset of *X* × *Y* , but there are simple counterexamples if *c* is infinite too often [17, Example 4.1]. For results outside the Polish space setup, see Kellerer [80] and Rachev and R¨uschendorf [107, Chapter 4].

Theorem 1.5.1 for the one-dimensional case is taken from [124], where it is proven using the general duality theorem. For direct proofs and the history of this result, one may consult Rachev [106] or Rachev and R¨uschendorf [107, Section 3.1]. The concave case is carefully treated by McCann [94].

The results in the Gaussian case were obtained independently by Olkin and Pukelsheim [98] and Givens and Shortt [65]. The proof given here is from Bhatia [20, Exercise 1.2.13]. An extension to separable Hilbert spaces can be found in Gelbrich [62] or Cuesta-Albertos et al. [39].

The regularity theory of Sect. 1.6.4 is very delicate. Caffarelli [32] showed the first part of Theorem 1.6.7; the proof can also be found in Figalli's book [52, Theorem 4.23]. Villani [124, Theorem 4.14] states the result without proof and refers to Alesker et al. [7] for a sketch of the second part of Theorem 1.6.7. Other regularity results exist, Villani [125, Chapter 12]; Santambrogio [119, Section 1.7.6]; Figalli [52].

Cuesta-Albertos et al. [40, Theorem 3.2] prove Theorem 1.7.2 for the quadratic case; the form given here is from Schachermayer and Teichmann [120, Theorem 3].

The definition of cyclical monotonicity depends on the cost function. It is typically referred to as *c*-cyclical monotonicity, with "cyclical monotonicity" reserved to the special case of quadratic cost. Since we focus on the quadratic case and for readability, we slightly deviate from the standard jargon. That cyclical monotonicity implies optimality (Proposition 1.7.5) was shown independently by Pratelli [105] (finite lower semicontinuous cost) and Schachermayer and Teichmann [120] (possibly infinite continuous cost). A joint generalisation is given by Beiglbock et al. [ ¨ 18].

Section 1.7.2 is taken from Zemel and Panaretos [134, Section 7.5]; a slightly weaker version was shown independently by Chernozhukov et al. [35]. Heinich and Lootgieter [68] establish almost sure pointwise convergence. If μ*<sup>n</sup>* = μ, then the optimal maps converge in μ-measure [125, Corollary 5.23] in a very general setup, but there are simple examples where this fails if μ*<sup>n</sup>* = μ [125, Remark 5.25]. In the quadratic case, further stability results of a weaker flavour (focussing on the convex potential φ, rather than its derivative, which is the optimal map) can be found in del Barrio and Loubes [42].

The idea of using the *c*-transform (Sect. 1.8) is from R¨uschendorf [116].

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 2 The Wasserstein Space**

The Kantorovich problem described in the previous chapter gives rise to a metric structure, the *Wasserstein distance*, in the space of probability measures *P*(*X* ) on a space *X* . The resulting metric space, a subspace of *P*(*X* ), is commonly known as the *Wasserstein space W* (although, as Villani [125, pages 118–119] puts it, this terminology is "very questionable"; see also Bobkov and Ledoux [25, page 4]). In Chap. 4, we shall see that this metric is in a sense canonical when dealing with warpings, that is, deformations of the space *X* (for example, in Theorem 4.2.4). In this chapter, we give the fundamental properties of the Wasserstein space. After some basic definitions, we describe the topological properties of that space in Sect. 2.2. It is then explained in Sect. 2.3 how *W* can be endowed with a sort of infinite-dimensional Riemannian structure. Measurability issues are dealt with in the somewhat technical Sect. 2.4.

#### **2.1 Definition, Notation, and Basic Properties**

Let *X* be a separable Banach space. The *p-Wasserstein space* on *X* is defined as

$$\mathcal{W}\_p(\mathcal{X}^\circ) = \left\{ \mu \in P(\mathcal{X}^\circ) : \int\_{\mathcal{X}} ||x||^p \, \mathrm{d}\mu(x) < \infty \right\}, \qquad p \ge 1.$$

We will sometimes abbreviate and write simply *W<sup>p</sup>* instead of *Wp*(*X* ).

Recall that if μ,ν ∈ *P*(*X* ), then Π(μ,ν) is defined to be the set of measures π <sup>∈</sup> *<sup>P</sup>*(*<sup>X</sup>* <sup>2</sup>) having μ and ν as marginals in the sense of (1.2). The *p-Wasserstein distance* between μ and νis defined as the minimal total transportation cost between

**Electronic Supplementary Material** The online version of this chapter (https://doi.org/10.1007/ 978-3-030-38438-8 2) contains supplementary material.

<sup>©</sup> The Author(s) 2020

V. M. Panaretos, Y. Zemel, *An Invitation to Statistics in Wasserstein Space*, SpringerBriefs in Probability and Mathematical Statistics, https://doi.org/10.1007/978-3-030-38438-8 2

μ and ν in the Kantorovich problem with respect to the cost function *cp*(*x*, *y*) = *x*−*y p*:

$$W\_p(\mu, \mathbf{v}) = \left(\inf\_{\pi \in \Pi(\mu, \mathbf{v})} C\_p(\pi)\right)^{1/p} = \left(\inf\_{\pi \in \Pi(\mu, \mathbf{v})} \int\_{\mathcal{X} \times \mathcal{X}} ||\mathbf{x}\_1 - \mathbf{x}\_2||^p \, \mathrm{d}\pi(\mathbf{x}\_1, \mathbf{x}\_2)\right)^{1/p}.$$

The Wasserstein distance between μ and ν is finite when both measures are in *Wp*(*X* ), because

$$||\boldsymbol{x}\_1 - \boldsymbol{x}\_2||^p \le 2^p ||\boldsymbol{x}\_1||^p + 2^p ||\boldsymbol{x}\_2||^p.$$

Thus, *Wp* is finite on [*Wp*(*<sup>X</sup>* )]<sup>2</sup> <sup>=</sup> *<sup>W</sup>p*(*<sup>X</sup>* ) <sup>×</sup> *<sup>W</sup>p*(*<sup>X</sup>* ); it is nonnegative and symmetric and it is easy to see that *Wp*(μ,ν) = 0 if and only if μ = ν. A proof that *Wp* is a metric (satisfies the triangle inequality) on *W<sup>p</sup>* can be found in Villani [124, Chapter 7].

The aforementioned setting is by no means the most general one can consider. Firstly, one can define *Wp* and *W<sup>p</sup>* for 0 < *p* < 1 by removing the power 1/*p* from the infimum and the limit case *p* = 0 yields the total variation distance. Another limit case can be defined as *W*∞(μ,ν) = lim*p*→<sup>∞</sup>*Wp*(μ,ν). Moreover, *Wp* and *W<sup>p</sup>* can be defined whenever *X* is a complete and separable metric space (or even only separable; see Clement and Desch [ ´ 36]): one fixes some *x*<sup>0</sup> in *X* and replaces *x* by *d*(*x*, *x*0). Although the topological properties below still hold at that level of generality (except when *p* = 0 or *p* = ∞), for the sake of simplifying the notation we restrict the discussion to Banach spaces. It will always be assumed without explicit mention that 1 ≤ *p* < ∞.

The space *Wp*(*X* )is defined as the collection of measures μ such that *Wp*(μ,δ<sup>0</sup>) < ∞ with δ*<sup>x</sup>* being a Dirac measure at *x*. Of course, *Wp*(μ,ν) can be finite even if μ,ν ∈/ *Wp*(*X* ). But if μ ∈ *Wp*(*X* ) and ν ∈/ *Wp*(*X* ), then *Wp*(μ,ν) is always infinite. This can be seen from the triangle inequality

$$
\approx = W\_p(\mathbf{v}, \delta\_0) \le W\_p(\boldsymbol{\mu}, \delta\_0) + W\_p(\boldsymbol{\mu}, \mathbf{v}) \,.
$$

In the sequel, we shall almost exclusively deal with measures in *Wp*(*X* ).

The Wasserstein spaces are ordered in the sense that if *q* ≥ *p*, then *Wq*(*X* ) ⊆ *Wp*(*X* ). This property extends to the distances in the form:

$$q \ge p \ge 1 \quad \implies \quad W\_q(\mu, \nu) \ge W\_p(\mu, \nu). \tag{2.1}$$

To see this, let π ∈ Π(μ,ν) be optimal with respect to *q*. Jensen's inequality for the convex function *<sup>z</sup>* → *<sup>z</sup>q*/*<sup>p</sup>* gives

$$\mathcal{W}\_q^q(\mu, \mathbf{v}) = \int\_{\mathcal{X}^2} ||\mathbf{x} - \mathbf{y}||^q \, \mathrm{d}\pi(\mathbf{x}, \mathbf{y}) \ge \left( \int\_{\mathcal{X}^2} ||\mathbf{x} - \mathbf{y}||^p \, \mathrm{d}\pi(\mathbf{x}, \mathbf{y}) \right)^{q/p} \ge \mathcal{W}\_p^q(\mu, \mathbf{v}).$$

The converse of (2.1) fails to hold in general, since it is possible that *Wp* be finite while *Wq* is infinite. A converse can be established, however, if μ and νare bounded:

$$q \ge p \ge 1, \quad \mu(K) = \nu(K) = 1 \implies W\_q(\mu, \nu) \le W\_p^{p/q}(\mu, \nu) \left(\sup\_{\mathbf{x}, \mathbf{y} \in K} ||\mathbf{x} - \mathbf{y}||\right)^{1 - p/q}.\tag{2.2}$$

Indeed, if we denote the supremum by *dK* and let π be now optimal with respect to *p*, then π(*K* ×*K*) = 1 and

$$W\_q^q(\mu, \mathbf{v}) \le \int\_{K^2} \|\mathbf{x} - \mathbf{y}\|^q \operatorname{d}\mathfrak{π}(\mathbf{x}, \mathbf{y}) \le d\_K^{q-p} \int\_{K^2} \|\mathbf{x} - \mathbf{y}\|^p \operatorname{d}\mathfrak{π}(\mathbf{x}, \mathbf{y}) = d\_K^{q-p} W\_p^p(\mu, \mathbf{v}).$$

Another useful property of the Wasserstein distance is the upper bound

$$\mathcal{W}\_p(\mathbf{t}\#\mu, \mathbf{s}\#\mu) \le \left(\int\_{\mathcal{X}} \|\mathbf{t}(\mathbf{x}) - \mathbf{s}(\mathbf{x})\|^p \, \mathrm{d}\mu(\mathbf{x})\right)^{1/p} = \||\, \|\mathbf{t} - \mathbf{s}\|\!\rangle\_{\mathcal{X}^\*}\|\_{L\_p(\mu)}\tag{2.3}$$

for any pair of measurable functions **t**,**s** : *X* → *X* . Situations where this inequality holds as equality and **t** and **s** are optimal maps are related to *compatibility* of the measures μ, ν = **t**#μ and ρ = **s**#μ (see Sect. 2.3.2) and will be of conceptual importance in the context of Frechet means (see Sect. ´ 3.1).

We also recall the notation *BR*(*x*0) = {*x* : *x*−*x*0 < *R*} and *BR*(*x*0) = {*x* : *x*− *x*0 ≤ *R*} for open and closed balls in *X* .

#### **2.2 Topological Properties**

#### *2.2.1 Convergence, Compact Subsets*

The topology of a space is determined by the collection of its closed sets. Since *Wp*(*X* ) is a metric space, whether a set is closed or not depends on which sequences in *Wp*(*X* ) converge. The following characterisation from Villani [124, Theorem 7.12] will be very useful.

**Theorem 2.2.1 (Convergence in Wasserstein Space)** *Let* μ,μ*<sup>n</sup>* ∈ *Wp*(*X* )*. Then the following are equivalent:*


$$\sup\_{n} \int\_{\{\mathbf{x} : \|\mathbf{x}\| > R\}} ||\mathbf{x}||^{p} \, \mathrm{d}\mu\_{n}(\mathbf{x}) \to 0, \qquad R \to \underset{\mathbf{x} : \|\mathbf{x}\|}{\text{ess}},\tag{2.4}$$

*p* dμ(*x*)*;*

*4. for any C* <sup>&</sup>gt; <sup>0</sup> *and any continuous f* : *<sup>X</sup>* <sup>→</sup> <sup>R</sup> *such that* <sup>|</sup> *<sup>f</sup>*(*x*)| ≤*C*(1<sup>+</sup> *x <sup>p</sup>*) *for all x,*

$$\int\_{\mathcal{X}} f(\mathbf{x}) \, \mathbf{d}\mu\_n(\mathbf{x}) \to \int\_{\mathcal{X}} f(\mathbf{x}) \, \mathbf{d}\mu(\mathbf{x}) \,.$$

*5. (Le Gouic and Loubes [87, Lemma 14])* μ*n* → μ *weakly and there exists* ν ∈ *Wp*(*X* ) *such that Wp*(μ*n*,ν) → *Wp*(μ,ν)*.*

Consequently, the Wasserstein topology is finer than the weak topology induced on *Wp*(*X* ) from *P*(*X* ). Indeed, let *A* ⊆ *Wp*(*X* ) be weakly closed. If μ*<sup>n</sup>* ∈ *A* converge to μ in *Wp*(*X* ), then μ*n* → μ weakly, so μ ∈ *A* . In other words, the Wasserstein topology has more closed sets than the induced weak topology. Moreover, each *Wp*(*X* ) is a weakly closed subset of *P*(*X* ) by the same arguments that lead to (1.3). In view of Theorem 2.2.1, a common strategy to establish Wasserstein convergence is to first show tightness and obtain weak convergence, hence a candidate limit, and then show that the stronger Wasserstein convergence actually holds. In some situations, the last part is automatic:

**Corollary 2.2.2** *Let K* ⊂ *X be a bounded set and suppose that* μ*<sup>n</sup>*(*K*) = 1 *for all n* ≥ 1*. Then Wp*(μ*n*,μ) → 0 *if and only if* μ*n* → μ*weakly.*

*Proof.* This is immediate from (2.4).

The fact that convergence in *W<sup>p</sup>* is stronger than weak convergence is exemplified in the following result. If μ*n* → μ and ν*n* → ν in *Wp*(*X* ), then it is obvious that *Wp*(μ*n*,ν*<sup>n</sup>*) → *Wp*(μ,ν). But if the convergence is only weak, then the Wasserstein distance is still lower semicontinuous:

$$\liminf\_{n \to \infty} W\_p(\mu\_n, \nu\_n) \ge W\_p(\mu, \nu). \tag{2.5}$$

This follows from Theorem 1.7.2 and (1.3).

Before giving some examples, it will be convenient to formulate Theorem 2.2.1 in probabilistic terms. Let *X*,*Xn* be random elements on *X* with laws μ,μ*n* ∈ *Wp*(*X* ). Assume without loss of generality that *X*,*Xn* are defined on the same probability space (Ω,*F*,P) and write *Wp*(*Xn*,*X*)to denote *Wp*(μ*n*,μ). Then *Wp*(*Xn*,*X*) → 0 if and only if *Xn* <sup>→</sup> *<sup>X</sup>* weakly and <sup>E</sup> *Xn <sup>p</sup>* <sup>→</sup> <sup>E</sup> *X p*.

An early example of the use of Wasserstein metric in statistics is due to Bickel and Freedman [21]. Let *Xn* be independent and identically distributed random variables with mean zero and variance 1 and let *Z* be a standard normal random variable. Then *Zn* = ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *Xi*/ √*n* converge weakly to *Z* by the central limit theorem. But E*Z*<sup>2</sup> *<sup>n</sup>* <sup>=</sup> <sup>1</sup> <sup>=</sup> <sup>E</sup>*Z*2, so *<sup>W</sup>*2(*Zn*,*Z*) <sup>→</sup> 0. Let *<sup>Z</sup>*<sup>∗</sup> *<sup>n</sup>* be a bootstrapped version of *Zn* constructed by resampling the *Xn*'s. If *W*2(*Z*<sup>∗</sup> *<sup>n</sup>* ,*Zn*) → 0, then *W*2(*Z*<sup>∗</sup> *<sup>n</sup>* ,*Z*) → 0 and in particular *Z*∗ *<sup>n</sup>* has the same asymptotic distribution as *Zn*.

Another consequence of Theorem 2.2.1 is that (in the presence of weak convergence) convergence of moments automatically yields convergence of smaller moments (there are, however, more elementary ways to see this). In the previous example, for instance, one can also conclude that <sup>E</sup>|*Zn*<sup>|</sup> *<sup>p</sup>* <sup>→</sup> <sup>E</sup>|*Z*<sup>|</sup> *<sup>p</sup>* for any *<sup>p</sup>* <sup>≤</sup> 2 by the last condition of the theorem. If in addition E*X*<sup>4</sup> <sup>1</sup> < ∞, then

$$\mathbb{E}\mathbf{Z}\_n^4 = \mathfrak{Z} - \frac{\mathfrak{Z}}{n} + \frac{\mathbb{E}X\_1^4}{n} \to \mathfrak{Z} = \mathbb{E}\mathbf{Z}^4$$

(see Durrett [49, Theorem 2.3.5]) so *W*4(*Zn*,*Z*) → 0 and all moments up to order 4 converge.

Condition (2.4) is called *uniform integrability* of the function *x* → *x <sup>p</sup>* with respect to the collection (μ*<sup>n</sup>*). Of course, it holds for a single measure μ ∈ *Wp*(*X* ) by the dominated convergence theorem. This condition allows us to characterise compact sets in the Wasserstein space. One should beware that when *X* is infinitedimensional, (2.4) alone is not sufficient in order to conclude that μ*<sup>n</sup>* has a convergent subsequence: take μ*<sup>n</sup>* to be Dirac measures at *en* with (*en*) an orthonormal basis of a Hilbert space *X* (or any sequence with *en* = 1 that has no convergent subsequence, if *X* is a Banach space). The uniform integrability (2.4) must be accompanied with tightness, which is a consequence of (2.4) only when *X* = R*d*.

**Proposition 2.2.3 (Compact Sets in** *Wp***)** *A weakly tight set K* ⊆*W<sup>p</sup> is Wassersteintight (has a compact closure in Wp) if and only if*

$$\sup\_{\mu \in \mathcal{K}} \int\_{\{\mathbf{x} : \|\mathbf{x}\| > \mathcal{R}\}} \|\mathbf{x}\|^p \, \mathbf{d}\mu(\mathbf{x}) \to \mathbf{0}, \qquad \mathcal{R} \to \underset{\epsilon \gg \epsilon}{\text{ess.}}.\tag{2.6}$$

*Moreover,* (2.6) *is equivalent to the existence of a monotonically divergent function <sup>g</sup>* : <sup>R</sup><sup>+</sup> <sup>→</sup> <sup>R</sup><sup>+</sup> *such that*

$$\sup\_{\mu \in \mathcal{K}} \int\_{\mathcal{X}} ||x||^p g(||x||) \,\mathrm{d}\mu(x) < \infty.$$

The proof is on page 41 of the supplement.

**Remark 2.2.4** *For any sequence* (μ*<sup>n</sup>*) *in W<sup>p</sup> (tight or not) there exists a monotonically divergent g with X x pg*( *x* )dμ*<sup>n</sup>*(*x*) < ∞ *for all n.*

**Corollary 2.2.5 (Measures with Common Support)** *Let K* ⊆ *X be a compact set. Then*

$$\mathcal{X}' = \mathcal{W}\_p(K) = \{ \mu \in P(\mathcal{X}') : \mu(K) = 1 \} \subseteq \mathcal{W}\_p(\mathcal{X}')$$

*is compact.*

*Proof.* This is immediate, since *K* is weakly tight and the supremum in (2.6) vanishes when *<sup>R</sup>* is larger than the finite quantity sup*x*∈*<sup>K</sup> x* . Finally, *K* is closed, so *K* is weakly closed, hence Wasserstein closed, by the portmanteau Lemma 1.7.1.

For future reference, we give another consequence of uniform integrability, called *uniform absolute continuity*

$$\forall \mathfrak{e} \; \exists \delta \; \forall n \; \forall A \subseteq \mathcal{X} \; \text{Borel}: \qquad \mu\_n(A) \le \delta \quad \implies \quad \int\_A ||\mathbf{x}||^p \, \mathbf{d}\mu\_n(\mathbf{x}) < \mathfrak{e}. \tag{2.7}$$

To show that (2.4) implies (2.7), let ε > 0, choose *R* = *R*ε > 0 such that the supremum in (2.4) is smaller than ε/2, and set δ = ε/(2*Rp*). If μ*<sup>n</sup>*(*A*) ≤ δ, then

$$\int\_{A} ||\mathbf{x}||^{p} \, \mathrm{d}\mu\_{n}(\mathbf{x}) \le \int\_{A \cap \overline{\mathcal{B}}\_{\mathcal{R}}(0)} ||\mathbf{x}||^{p} \, \mathrm{d}\mu\_{n}(\mathbf{x}) + \int\_{A \backslash \overline{\mathcal{B}}\_{\mathcal{R}}(0)} ||\mathbf{x}||^{p} \, \mathrm{d}\mu\_{n}(\mathbf{x}) < \delta \mathsf{R}^{p} + \mathfrak{e}/2 \le \mathfrak{e}.$$

#### *2.2.2 Dense Subsets and Completeness*

If we identify a measure μ ∈ *Wp*(*X* ) with a random variable *X* (having distribution μ), then *X* has a finite *p*-th moment in the sense that the real-valued random variable *X* is in *Lp*. In view of that, it should not come as a surprise that *Wp*(*X* ) enjoys topological properties similar to *Lp* spaces. In this subsection, we give some examples of useful dense subsets of *Wp*(*X* ) and then "show" that like *X* itself, it is a complete separable metric space. In the next subsection, we describe some of the negative properties that *Wp*(*X* ) has, again in similarity with *Lp* spaces.

We first show that *Wp*(*X* ) is separable. The core idea of the proof is the feasibility of approximating any measure with discrete measures as follows.

Let μ be a probability measure on *X* , and let *X*1,*X*2,... be a sequence of independent random elements in *X* with probability distribution μ. Then the *empirical measure* μ*<sup>n</sup>* is defined as the random measure (1/*n*)∑*<sup>n</sup> i*=1 δ{*Xi*}. The law of large numbers shows that for any (measurable) bounded or nonnegative *<sup>f</sup>* : *<sup>X</sup>* <sup>→</sup> <sup>R</sup>, almost surely

$$\int\_{\mathcal{X}} f(\mathbf{x}) \, \mathbf{d}\mu\_n(\mathbf{x}) = \frac{1}{n} \sum\_{i=1}^n f(X\_i) \to \mathbb{E}f(X\_1) = \int\_{\mathcal{X}} f(\mathbf{x}) \, \mathbf{d}\mu(\mathbf{x}).$$

In particular when *f*(*x*) = *x <sup>p</sup>*, we obtain convergence of moments of order *p*. Hence by Theorem 2.2.1, if μ ∈ *Wp*(*X* ), then μ*n* → μ in *Wp*(*X* ) if and only if μ*n* → μ weakly. We know that integrals of bounded functions converge with probability one, but the null set where convergence fails may depend on the chosen function and there are uncountably many such functions. When *X* = R*d*, by the portmanteau Lemma 1.7.1 we can replace the collection *Cb*(*X* ) by indicator functions of rectangles of the form (−∞,*a*1]×···×(−∞,*ad*] for *<sup>a</sup>* = (*a*1,...,*ad*) <sup>∈</sup> <sup>R</sup>*d*. It turns out that the countable collection provided by rational vectors *a* suffices (see the proof of Theorem 4.4.1 where this is done in a more complicated setting). For more general spaces *X* , we need to find another countable collection { *fj*} such that convergence of the integrals of *fj* for all *j* suffices for weak convergence. Such a collection exists, by using bounded Lipschitz functions (Dudley [47, Theorem 11.4.1]); an alternative construction can be found in Ambrosio et al. [12, Section 5.1]. Thus:

**Proposition 2.2.6 (Empirical Measures in** *Wp***)** *For any* μ ∈ *P*(*X* ) *and the corresponding sequence of empirical measures* μ*n, Wp*(μ*n*,μ) → 0 *almost surely if and only if* μ∈ *Wp*(*X* )*.*

Indeed, if μ ∈/ *Wp*(*X* ), then *Wp*(μ*n*,μ) is infinite for all *n*, since μ*<sup>n</sup>* is compactly supported, hence in *Wp*(*X* ).

Proposition 2.2.6 is the basis for constructing dense subsets of the Wasserstein space.

**Theorem 2.2.7 (Dense Subsets of** *Wp***)** *The following collections of measures are dense in Wp*(*X* )*:*


*In particular, W<sup>p</sup> is separable (the third set is countable as X is separable).*

This is a simple consequence of Proposition 2.2.6 and approximations, and the details are given on page 43 in the supplement.

**Proposition 2.2.8 (Completeness)** *The Wasserstein space Wp*(*X* ) *is complete.*

One may find two different proofs in Villani [125, Theorem 6.18] and Ambrosio et al. [12, Proposition 7.1.5]. On page 43 of the supplement, we sketch an alternative argument based on completeness of the weak topology.

#### *2.2.3 Negative Topological Properties*

In the previous subsection, we have shown that *Wp*(*X* ) is separable and complete like *Lp* spaces. Just like them, however, the Wasserstein space is neither locally compact nor σ-compact. For this reason, existence proofs of Frechet means in ´ *Wp*(*X* ) require tools that are more specific to this space, and do not rely upon local compactness (see Sect. 3.1).

**Proposition 2.2.9 (***W<sup>p</sup>* **is Not Locally Compact)** *Let* μ ∈ *Wp*(*X* ) *and let* ε > 0*. Then the Wasserstein ball*

$$\overline{B}\_{\mathfrak{e}}(\mu) = \{ \mathbf{v} \in \mathcal{W}\_p(\mathcal{X}^\circ) : W\_p(\mu, \mathbf{v}) \le \mathfrak{e} \}$$

*is not compact.*

Ambrosio et al. [12, Remark 7.1.9] show this when μ is a Dirac measure, and we extend their argument on page 43 of the supplement.

From this, we deduce:

#### **Corollary 2.2.10** *The Wasserstein space Wp*(*X* ) *is not* σ*-compact.*

*Proof.* If *K* is a compact set in *Wp*(*X* ), then its interior is empty by Proposition 2.2.9. A countable union of compact sets has an empty interior (hence cannot equal the entire space *Wp*(*X* )) by the Baire property, which holds on the complete metric space *Wp*(*X* ) by the Baire category theorem (Dudley [47, Theorem 2.5.2]).

#### *2.2.4 Covering Numbers*

Let *<sup>K</sup>* <sup>⊂</sup> *<sup>W</sup>p*(*<sup>X</sup>* ) be compact and assume that *<sup>X</sup>* <sup>=</sup> <sup>R</sup>*d*. Then for any ε > 0 the covering number

$$N(\mathfrak{c}; \mathcal{X}') = \min \left\{ n : \exists \mu\_1, \dots, \mu\_n \in \mathcal{W}\_p(\mathcal{X}') \text{ such that } \mathcal{X}' \subseteq \bigcup\_{l=1}^n \{\mu : W\_p(\mu, \mu\_l) < \varepsilon\} \right\}$$

is finite. These numbers appear in statistics in various ways, particularly in empirical processes (see, for instance, Wainwright [126, Chapter 5]) and the goal of this subsection is to give an upper bound for *N*(ε;*K* ). Invoking Proposition 2.2.3, introduce a continuous monotone divergent *f* : [0,∞) → [0,∞] such that

$$\sup\_{\mu \in \mathcal{K}} \int\_{\mathbb{R}^d} ||\mathbf{x}||^p f(||\mathbf{x}||) \, \mathbf{d}\mu(\mathbf{x}) \le 1.$$

The function *f* provides a certain measure of how compact *K* is. If *K* = *Wp*(*K*) is the set of measures supported on a compact *<sup>K</sup>* <sup>⊆</sup> <sup>R</sup>*d*, then *<sup>f</sup>*(*L*) can be taken infinite for *L* large, and *L* can be treated as a constant in the theorem. Otherwise *L* increases as ε 0, at a speed that depends on *f* : the faster *f* diverges, the slower *L* grows with decreasing εand the better the bound becomes.

**Theorem 2.2.11** *Let* ε > 0 *and L* = *f* <sup>−</sup>1(1/ε*<sup>p</sup>*)*. If d*ε≤ *L, then*

$$
\log N(\varepsilon) \le C\_1(d) \left(\frac{L}{\varepsilon}\right)^d \left[ (p+d)\log\frac{L}{\varepsilon} + C\_2(d,p) \right],
$$

*with C*1(*d*) = 3*de*θ*d, C*2(*d*, *p*)=(*p*+*d*)log3+ (*p*+2)log2+logθ*<sup>d</sup> and* θ*<sup>d</sup>* = *d*[5+ log*d* +loglog*d*]*.*

Since ε > 0 is small and *L* is increasing in ε, the restriction that *d*ε ≤ *L* is typically not binding. We provide some examples before giving the proof.

Example 1: if all the measures are supported on the *d*-dimensional unit ball, then *L* can be taken equal to one, independently of ε. We obtain

$$\tilde{N}(\varepsilon) := \frac{\log N(\varepsilon)}{\log 1/\varepsilon} \le (d+p)C\_1(d)\varepsilon^{-d} + \text{smaller order terms.}$$

Example 2: if all the measures in *K* have uniform exponential moments, then *<sup>f</sup>*(*L*) = *eL* and *<sup>N</sup>*(ε) is a constant times ε−*d*[log1/ε] *<sup>d</sup>*. The exponent *p* appears only in the constant.

Example 3: suppose that *K* is a Wasserstein ball of order *p*+δ, that is, *f*(*L*) = *L*δ . Then *L* ∼ ε−*p*/δand

$$\tilde{N}(\varepsilon) \le C\_1(d)(p+d)(1+p/\delta)\varepsilon^{-d[1+p/\delta]}$$

up to smaller order terms. Here (when 0 < δ <sup>&</sup>lt; <sup>∞</sup>) the behaviour of *<sup>N</sup>*(ε) depends more strongly upon *p*: if *p* < *p*, then we can replace δ by δ = δ + *p* − *p* > δ, leading to a smaller magnitude of *<sup>N</sup>*(ε).

Example 4: if *<sup>f</sup>*(*L*) is only log*L*, then *<sup>N</sup>* behaves like ε<sup>−</sup>(*d*+*p*) exp(ε<sup>−</sup>*pd*), so *p* has a very dominant effect.

*Proof.* The proof is divided into four steps.

**Step 1: Compact support.** Let *PL* : <sup>R</sup>*<sup>d</sup>* <sup>→</sup> <sup>R</sup>*<sup>d</sup>* be the projection onto *BL*(0) = {*<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* : *x* ≤ *L*} and let μ∈ *K* . Then

$$\begin{aligned} W\_p^p(\mu, P\_L \# \mu) &\le \int\_{\mathbb{R}^d} ||x - P\_L(x)||^p \, \mathrm{d}\mu(x) = \int\_{||x|| > L} ||x - P\_L(x)||^p \, \mathrm{d}\mu(x) \\ &\le \int\_{||x|| > L} ||x||^p \, \mathrm{d}\mu(x) \le \frac{1}{f(L)} \int\_{||x|| > L} ||x||^p f(||x||) \, \mathrm{d}\mu(x) \le \frac{1}{f(L)}, \end{aligned}$$

and this vanishes as *L* → ∞.

**Step 2:** *n***-Point measures.** Let *n* = *N*(ε;*BL*(0)) be the covering number of the Euclidean ball in <sup>R</sup>*d*. There exists a set *<sup>x</sup>*1,...,*xn* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* such that *BL*(0) ⊆ ∪*<sup>B</sup>*ε (*xi*). If μ ∈ *Wp*(*BL*(0)), there exists a measure μ*<sup>n</sup>* supported on the *xi*'s and such that

$$W\_p(\mu, \mu\_n) \le \varepsilon.$$

Indeed let *C*<sup>1</sup> = *B*ε (*x*1), *Ci* = *B*ε (*xi*) \ ∪*j*<*iB*ε (*x <sup>j</sup>*) and define μ*<sup>n</sup>*({*xi*}) = μ(*Ci*). The transport map defined by **t**(*x*) = *xi* for *x* ∈ *Ci* pushes μ forward to μ*<sup>n</sup>* and

$$W\_p^p(\mu\_n, \mu) \le \sum\_{i=1}^n \int\_{C\_i} ||\mathbf{x} - \mathbf{x}\_i||^p \, \mathbf{d}\mu(\mathbf{x}) \le \sum\_{i=1}^n \varepsilon^p \mu(C\_i) = \varepsilon^p.$$

According to Rogers [114], we have the bound

*n* ≤ *e*θ*d*[*L*/ε] *d*, θ*<sup>d</sup>* = *d*[5+log*d* +loglog*d*],

whenever ε≤ *L*/*d*.

**Step 3: Common weights.** If μ = ∑*ak*δ*xk* and ν = ∑*bk*δ*xk* , then *<sup>W</sup> <sup>p</sup> p* (μ,ν) ≤ ∑*<sup>k</sup>* |*ak* −*bk*|sup*i*, *<sup>j</sup> xi* −*xj <sup>p</sup>*. Let

$$\mu\_{n,\varepsilon,\delta} = \left\{ \sum\_{k=1}^{n} a\_k \delta\_{k\_k} : a\_k \in \{0, \delta, 2\delta, \dots, \lceil 1/\delta \rceil \delta \}; \sum a\_k = 1 \right\}.$$

This set contains fewer than (2+1/δ)*n*−<sup>1</sup> elements, and any measure supported on {*x*1,..., *xn*} can be approximated by a measure in μ*n*,ε,δ with error 2*L*(*n*δ)1/*p*.

**Step 4: Conclusion.** Let *L* = *f* <sup>−</sup>1(1/ε*<sup>p</sup>*), *n* = *N*(ε;*BL*(0)) and δ = [ε/(2*L*)]*p*/*n*. Combining the previous three steps, we obtain in the case *L* ≥ ε*d* that

$$N(\mathfrak{B}\mathfrak{c}) \le (2 + 1/\delta)^{n-1} \le \left[2 + \left(\frac{L}{\mathfrak{c}}\right)^{p+d} 2^p e \Theta\_d\right]^{\epsilon \Theta\_d \left[L/\mathfrak{c}\right]^d} \le \left[\left(\frac{L}{\mathfrak{c}}\right)^{p+d} 2^{p+2} \Theta\_d\right]^{\epsilon \Theta\_d \left[L/\mathfrak{c}\right]^d},$$

.

because *L*/ε ≥ 1 and θ*<sup>d</sup>* ≥ 5. Conclude that

$$N(\mathfrak{e}) \le \left[ \mathfrak{Z}^{p+d} \left( \frac{L}{\mathfrak{e}} \right)^{p+d} \mathfrak{Z}^{p+2} \mathfrak{G}\_d \right]^{\mathfrak{z}^d e \theta\_d [L/\mathfrak{e}]^d}$$

#### **2.3 The Tangent Bundle**

Although the Wasserstein space *Wp*(*X* ) is non-linear in terms of measures, it *is* linear in terms of maps. Indeed, if μ ∈ *Wp*(*X* ) and *Ti* : *X* → *X* are such that *Ti* ∈ *Lp*(μ), then (α*T*<sup>1</sup> + β*T*2)#μ ∈ *Wp*(*X* ) for all α,β <sup>∈</sup> <sup>R</sup>. Later, in Sect. 2.4, we shall see that *Wp*(*X* ) is in fact homeomorphic to a subset of the space of such functions. The goal of this section is to exploit the linearity of the latter in order to define the tangent bundle of *Wp*. This in particular will be used for deriving differentiability properties of the Wasserstein distance in Sect. 3.1.6. However, the latter can be understood at a purely analytic level, and readers uncomfortable with differential geometry can access most of the rest of the monograph without reference to this section.

We assume here that *X* is a Hilbert space and that *p* = 2; the results extend to any *p* > 1. Absolutely continuous measures are assumed to be so with respect to Lebesgue measure if *X* = R*<sup>d</sup>* and otherwise refer to Definition 1.6.4.

#### *2.3.1 Geodesics, the Log Map and the Exponential Map in W*2(*X* )

Let γ ∈ *W*2(*X* ) be absolutely continuous and μ ∈ *W*2(*X* ) arbitrary. From Sect. 1.6.1, we know that there exists a unique solution to the Monge–Kantorovich problem, and that solution is given by a transport map that we denote by **t** μ γ . Recalling that **i** : *X* → *X* is the identity map, we can define a curve

$$\boldsymbol{\mathfrak{X}} = \left[\mathbf{i} + \mathbf{r}(\mathbf{t}\_{\mathcal{Y}}^{\mu} - \mathbf{i})\right] \boldsymbol{\mathfrak{Y}}, \qquad \boldsymbol{\mathfrak{r}} \in [0, 1].$$

This curve is known as McCann's [93] interpolant. As hinted in the introduction to this section, it is constructed via classical linear interpolation of the transport maps **t** μ γ and the identity. Clearly γ<sup>0</sup> = γ, γ<sup>1</sup> = μand from (2.3),

$$\begin{aligned} W\_2(\boldsymbol{\eta}, \boldsymbol{\eta}) &\leq \sqrt{\int\_{\mathcal{X}} \left[t(\mathbf{t}\_{\mathcal{Y}}^{\boldsymbol{\mu}} - \mathbf{i})\right]^2 \, \mathrm{d}\boldsymbol{\mathcal{Y}}} &= tW\_2(\boldsymbol{\gamma}, \boldsymbol{\mu});\\ W\_2(\boldsymbol{\eta}, \boldsymbol{\mu}) &\leq \sqrt{\int\_{\mathcal{X}} \left[(1-t)(\mathbf{t}\_{\mathcal{Y}}^{\boldsymbol{\mu}} - \mathbf{i})\right]^2 \, \mathrm{d}\boldsymbol{\mathcal{Y}}} = (1-t)W\_2(\boldsymbol{\gamma}, \boldsymbol{\mu}).\end{aligned}$$

It follows from the triangle inequality in *W*<sup>2</sup> that these inequalities must hold as equalities. Taking this one step further, we see that

$$W\_2(\chi, \chi\_s) = (t - s)W\_2(\chi, \mu), \qquad 0 \le s \le t \le 1.$$

In other words, McCann's interpolant is a *constant-speed geodesic* in *W*2(*X* ).

In view of this, it seems reasonable to define the *tangent space* of *W*2(*X* ) at μ as (Ambrosio et al. [12, Definition 8.5.1])

$$\text{Tan}\_{\mu} = \overline{\{t(\mathbf{t} - \mathbf{i}) : \mathbf{t} = \mathbf{t}\_{\mu}^{\mathsf{v}} \text{ for some } \mathbf{v} \in \mathcal{W}\_2(\mathcal{K}^{\circ}); t > 0\}}^{L\_2(\mu)}.$$

It follows from the definition that Tanμ ⊆ *L*2(μ). (Strictly speaking, Tanμ is a subset of the space of functions *f* : *X* → *X* such that *f* ∈ *L*2(μ) rather than *L*2(μ) itself, as in Definition 2.4.3, but we will write *L*<sup>2</sup> for simplicity.)

Although not obvious from the definition, this is a linear space. The reason is that, in R*d*, Lipschitz functions are dense in *L*2(μ), and for **t** Lipschitz the negative of a tangent element

$$-t(\mathbf{t} - \mathbf{i}) = s(\mathbf{s} - \mathbf{i}), \qquad s > t ||\mathbf{t}||\_{\text{Lip}}, \qquad \mathbf{s} = \mathbf{i} + \frac{t}{s}(\mathbf{i} - \mathbf{t})$$

lies in the tangent space, since **s** can be seen to belong to the subgradient of a convex function by definition of *s*. This also shows that Tanμ can be seen to be the *L*2(μ)-closure of all gradients of *C*<sup>∞</sup> *<sup>c</sup>* functions. We refer to [12, Definition 8.4.1 and Theorem 8.5.1] for the proof and extensions to other values of *p* > 1 and to infinite dimensions, using cylindrical functions that depend on finitely many coordinates [12, Definition 5.1.11]. The alternative definition highlights that it is essentially the inner product in Tanμ, but not the elements themselves, that depends on μ.

The tangent space definition is valid for arbitrary measures in *W*2(*X* ). The exponential map at γ ∈ *W*2(*X* ) is the restriction to Tanγ of the mapping that sends **r** ∈ *L*2(γ) to [**r**+**i**]#γ ∈ *W*2(*X* ). More explicitly, expγ : Tanγ→ *W*<sup>2</sup> takes the form

$$\exp\_{\mathcal{T}}(t(\mathbf{t}-\mathbf{i})) = \exp\_{\mathcal{T}}([\mathbf{t} + (1-t)\mathbf{i}] - \mathbf{i}) = [t\mathbf{t} + (1-t)\mathbf{i}]\#\mathcal{Y} \quad (t \in \mathbb{R}).$$

Thus, when γ is absolutely continuous, expγ is surjective, as can be seen from its right inverse, the log map

$$\log\_{\mathcal{Y}} \colon \mathcal{W}\_2 \to \text{Tan}\_{\mathcal{Y}} \qquad \log\_{\mathcal{Y}}(\mu) = \mathbf{t}\_{\mathcal{Y}}^{\mu} - \mathbf{i},$$

defined throughout *W*<sup>2</sup> (by virtue of Theorem 1.6.2). In symbols,

$$\exp\_{\mathcal{T}}(\log\_{\mathcal{T}}(\mu)) = \mu, \quad \mu \in \mathcal{W}\_2, \quad \text{and} \quad \log\_{\mathcal{T}}(\exp\_{\mathcal{T}}(t(\mathbf{t} - \mathbf{i}))) = t(\mathbf{t} - \mathbf{i}) \quad (t \in [0, 1]),$$

because convex combinations of optimal maps are optimal maps as well. In particular, McCann's interpolant " **i**+*t*(**t** μ γ −**i**) # #γ is mapped bijectively to the line segment *t*(**t** μ γ −**i**) ∈ Tanγthrough the log map.

It is also worth mentioning that McCann's interpolant can also be defined as

$$[tp\_2 + (1 - t)p\_1] \# \pi, \qquad p\_1(\mathbf{x}, \mathbf{y}) = \mathbf{x}, \quad p\_2(\mathbf{x}, \mathbf{y}) = \mathbf{y},$$

where *<sup>p</sup>*1, *<sup>p</sup>*<sup>2</sup> : *<sup>X</sup>* <sup>2</sup> <sup>→</sup> *<sup>X</sup>* are projections and π is any optimal transport plan between γ and μ. This is defined for arbitrary measures γ,μ ∈ *W*2, and reduces to the previous definition if γ is absolutely continuous. It is shown in Ambrosio et al. [12, Chapter 7] or Santambrogio [119, Proposition 5.32] that these are the only constant-speed geodesics in *W*2.

#### *2.3.2 Curvature and Compatibility of Measures*

Let γ,μ,ν∈ *W*2(*X* ) be absolutely continuous measures. Then by (2.3)

$$\|\mathbf{W}\_2^2(\mu, \nu) \le \int\_{\mathcal{X}} ||\mathbf{t}\_{\mathcal{Y}}^{\mu}(\mathbf{x}) - \mathbf{t}\_{\mathcal{Y}}^{\nu}(\mathbf{x})||^2 \, d\boldsymbol{\chi}(\mathbf{x}) = \||\log\_{\mathcal{Y}}(\mu) - \log\_{\mathcal{Y}}(\nu)||^2.$$

In other words, the distance between μ and ν is smaller in *W*2(*X* ) than the distance between the corresponding vectors logγ(μ) and logγ(ν) in the tangent space Tanγ . In the terminology of differential geometry, this means that the Wasserstein space has *nonnegative sectional curvature* at any absolutely continuous γ.

It is instructive to see when equality holds. As **t** γ ν = (**t** ν γ )−1, a change of variables gives

$$W\_2^2(\mu, \nu) \le \int\_{\mathcal{X}} \|\mathbf{t}\_{\mathcal{Y}}^{\mu}(\mathbf{t}\_{\mathcal{V}}^{\mathcal{Y}}(\mathbf{x})) - \mathbf{x}\|^2 \,\mathrm{d}\nu(\mathbf{x})\,.$$

Since the map **t** μ γ ◦**t** γ ν pushes forward ν to μ, equality holds if and only if **t** μ γ ◦**t** γ ν = **t** μ ν . This motivates the following definition.

**Definition 2.3.1 (Compatible Measures)** *A collection of absolutely continuous measures C* ⊆ *W*2(*X* ) *is* compatible *if for all* γ,μ,ν ∈ *C , we have* **t** μ γ ◦ **t** γ ν = **t** μ ν *(in L*2(ν*)).*

**Remark 2.3.2** *The absolute continuity is not necessary and was introduced for notational simplicity. A more general definition that applies to general measures is the following: every finite subcollection of C admits an optimal multicoupling whose relevant projections are simultaneously pairwise optimal; see the paragraph preceding Theorem 3.1.9.*

A collection of two (absolutely continuous) measures is always compatible. More interestingly, if *X* = R, then the entire collection of absolutely continuous (or even just continuous) measures is compatible. This is because of the simple geometry of convex functions in R: gradients of convex functions are nondecreasing, and this property is stable under composition. In a more probabilistic way of thinking, one can always push-forward μ to ν via the uniform distribution Leb|[0,1] (see Sect. 1.5). Letting *F*−<sup>1</sup> μ and *F*−<sup>1</sup> νdenote the quantile functions, we have seen that

$$W\_2(\mu, \nu) = ||F\_{\mu}^{-1} - F\_{\nu}^{-1}||\_{L\_2(0,1)}\cdot$$

(As a matter of fact, in this specific case, the equality holds for all *p* ≥ 1 and not only for *p* = 2.) In other words, μ → *<sup>F</sup>*−<sup>1</sup> μ is an *isometry* from *W*2(R) to the subset of *L*2(0,1) formed by (equivalence classes of) left-continuous nondecreasing functions on (0,1). Since this is a convex subset of a Hilbert space, this property provides a very simple way to evaluate Frechet means in ´ *W*2(R) (see Sect. 3.1). If γ = Leb|[0,1], then *F*−<sup>1</sup> μ = **t** μ γ for all μ, so we can write the above equality as

$$W\_2^2(\mu, \nu) = \||F\_{\mu}^{-1} - F\_{\nu}^{-1}\|\_{L\_2(0,1)} = \||\log\_{\mathcal{Y}}(\mu) - \log\_{\mathcal{Y}}(\nu)\||^2,$$

so that if *X* = R, the Wasserstein space is essentially *flat* (has zero sectional curvature).

The importance of compatibility can be seen as mimicking the simple onedimensional case in terms of a Hilbert space embedding. Let *C* ⊆ *W*2(*X* ) be compatible and fix γ ∈ *C* . Then for all μ,ν∈ *C*

$$W\_2^2(\mu, \nu) = \int\_{\mathcal{X}} \|\mathbf{t}\_{\mathcal{Y}}^{\mu}(\mathbf{x}) - \mathbf{t}\_{\mathcal{Y}}^{\nu}(\mathbf{x})\|^2 \, d\mathcal{Y}(\mathbf{x}) = \||\log\_{\mathcal{Y}}(\mu) - \log\_{\mathcal{Y}}(\nu)\|\_{L\_2(\mathcal{Y})}^2.$$

Consequently, once again, μ → **t** μ γ is an isometric embedding of *C* into *L*2(γ). Generalising the one-dimensional case, we shall see that this allows for easy calculations of Frechet means by means of averaging transport maps (Theorem ´ 3.1.9).

*Example: Gaussian compatible measures*. The Gaussian case presented in Sect. 1.6.3 is helpful in shedding light on the structure imposed by the compatibility condition. Let γ <sup>∈</sup> *<sup>W</sup>*2(R*d*) be a standard Gaussian distribution with identity covariance matrix. Let Σμ denote the covariance matrix of a measure μ <sup>∈</sup> *<sup>W</sup>*2(R*d*). When μ and νare centred nondegenerate Gaussian measures,

$$\mathbf{t}\_{\mathcal{Y}}^{\mu} = \boldsymbol{\Sigma}\_{\mu}^{1/2}; \qquad \mathbf{t}\_{\mathcal{Y}}^{\mathcal{V}} = \boldsymbol{\Sigma}\_{\mathbf{v}}^{1/2}; \qquad \mathbf{t}\_{\mu}^{\mathcal{V}} = \boldsymbol{\Sigma}\_{\mu}^{-1/2} [\boldsymbol{\Sigma}\_{\mu}^{1/2} \boldsymbol{\Sigma}\_{\mathbf{v}} \boldsymbol{\Sigma}\_{\mu}^{1/2}]^{1/2} \boldsymbol{\Sigma}\_{\mu}^{-1/2},$$

so that γ,μ, and νare compatible if and only if

$$\mathbf{t}\_{\mu}^{\nu} = \mathbf{t}\_{\mathcal{Y}}^{\nu} \circ \mathbf{t}\_{\mu}^{\mathcal{Y}} = \boldsymbol{\Sigma}\_{\mathbf{v}}^{1/2} \boldsymbol{\Sigma}\_{\mu}^{-1/2} \cdot$$

Since the matrix on the left-hand side must be symmetric, it must necessarily be that Σ1/2 ν and Σ−1/2 μ commute (if *A* and *B* are symmetric, then *AB* is symmetric if and only if *AB* = *BA*), or equivalently, if and only if Σν and Σμ commute. We see that a collection *C* of Gaussian measures on R*<sup>d</sup>* that includes the standard Gaussian distribution is compatible if and only if all the covariance matrices of the measures in *C* are *simultaneously diagonalisable*. In other words, there exists an orthogonal matrix *U* such that *D*μ = *U*Σμ*U<sup>t</sup>* is diagonal for all μ ∈ *C* . In that case, formula (1.6)

$$\mathcal{W}\_2^{\prime 2}(\mu, \nu) = \text{tr}[\Sigma\_{\mu} + \Sigma\_{\nu} - 2(\Sigma\_{\mu}^{1/2} \Sigma\_{\nu} \Sigma\_{\mu}^{1/2})^{1/2}] = \text{tr}[\Sigma\_{\mu} + \Sigma\_{\nu} - 2\Sigma\_{\mu}^{1/2} \Sigma\_{\nu}^{1/2}]$$

simplifies to

$$\mathcal{W}\_2^{\mu2}(\mu, \nu) = \text{tr}[D\_\mu + D\_\nu - 2D\_\mu^{1/2}D\_\nu^{1/2}] = \sum\_{l=1}^d (\sqrt{\alpha\_l} - \sqrt{\beta\_l})^2, \quad \alpha\_l = [D\_\mu]\_{il}; \ \beta\_l = [D\_\nu]\_{il}, \quad \beta\_l = \beta\_l + \beta\_l - \beta\_l$$

and identifying the (nonnegative) number *<sup>a</sup>* <sup>∈</sup> <sup>R</sup> with the map *<sup>x</sup>* → *ax* on <sup>R</sup>, the optimal maps take the "orthogonal separable" form

$$\mathbf{t}\_{\mu}^{\mathbf{v}} = \boldsymbol{\Sigma}\_{\mathbf{v}}^{1/2} \boldsymbol{\Sigma}\_{\mu}^{-1/2} = \boldsymbol{U} \boldsymbol{D}\_{\mathbf{v}}^{1/2} \boldsymbol{D}\_{\mu}^{-1/2} \boldsymbol{U}^{\mathrm{I}} = \boldsymbol{U} \circ \left(\sqrt{\boldsymbol{\beta}\_{1}/\alpha\_{1}}, \dots, \sqrt{\boldsymbol{\beta}\_{d}/\alpha\_{d}}\right) \circ \boldsymbol{U}^{\mathrm{I}} \dots$$

In other words, up to an orthogonal change of coordinates, the optimal maps take the form of *d* nondecreasing real-valued functions. This is yet another crystallisation of the one-dimensional-like structure of compatible measures.

With the intuition of the Gaussian case at our disposal, we can discuss a more general case. Suppose that the optimal maps are continuously differentiable. Then differentiating the equation **t** ν μ = **t** ν γ ◦ **t** γ μgives

$$
\nabla \mathbf{t}\_{\mu}^{\nu}(\mathbf{x}) = \nabla \mathbf{t}\_{\mathcal{Y}}^{\nu}(\mathbf{t}\_{\mu}^{\mathcal{I}}(\mathbf{x})) \nabla \mathbf{t}\_{\mu}^{\mathcal{I}}(\mathbf{x}) .
$$

Since optimal maps are gradients of convex functions, their derivatives must be symmetric and positive semidefinite matrices. A product of such matrices stays symmetric if and only if they commute, so in this differentiable setting, compatibility is equivalent to commutativity of the matrices ∇**t** ν γ (**t** γ μ(*x*)) and ∇**t** γ μ(*x*) for μ-almost all *x*. In the Gaussian case, the optimal maps are linear functions, so *x* does not appear in the matrices.

Here are some examples of compatible measures. It will be convenient to describe them using the optimal maps from a reference measure γ <sup>∈</sup> *<sup>W</sup>*2(R*d*). Define *C* = **t**#γ with **t** belonging to one of the following families. The first imposes the one-dimensional structure by varying only the behaviour of the norm of *x*, while the second allows for separation of variables that splits the *d*-dimensional problem into *d* one-dimensional ones.

*Radial transformations.* Consider the collection of functions **<sup>t</sup>** : <sup>R</sup>*<sup>d</sup>* <sup>→</sup> <sup>R</sup>*<sup>d</sup>* of the form **t**(*x*) = *xG*( *x* ) with *<sup>G</sup>* : <sup>R</sup><sup>+</sup> <sup>→</sup> <sup>R</sup> differentiable. Then a straightforward calculation shows that

$$\nabla \mathbf{t}(\mathbf{x}) = G(||\mathbf{x}||)I + [G'(||\mathbf{x}||)/||\mathbf{x}||] \ge \mathbf{x}^f \dots$$

Since both *I* and *xxt* are positive semidefinite, the above matrix is so if both *G* and *G* are nonnegative. If **s**(*x*) = *xH*( *x* ) is a function of the same form, then **s**(**t**(*x*)) = *xG*( *x* )*H*( *x G*( *x* )) which belongs to that family of functions (since *G* is nonnegative). Clearly

$$\nabla \mathbf{s}(\mathbf{t}(\mathbf{x})) = H\left[||\mathbf{x}||G(||\mathbf{x}||)\right]I + \left[G(||\mathbf{x}||)H'(||\mathbf{x}||G(||\mathbf{x}||))/||\mathbf{x}||\right] \ge \mathbf{x}^T$$

commutes with ∇**t**(*x*), since both matrices are of the form *aI* +*bxxt* with *a*,*b* scalars (that depend on *x*). In order to be able to change the base measure γ, we need to check that the inverses belong to the family. But if *y* = **t**(*x*), then *x* = *ay* for some scalar *a* that solves the equation

$$aG(a||y||) = 1..$$

Such *a* is guaranteed to be unique if *a* → *aG*(*a*) is strictly increasing and it will exist (for *y* in the range of **t**) if it is continuous. As a matter of fact, since the eigenvalues of ∇**t**(*x*) are *G*(*a*) and

$$G(a) + G'(a)a = (aG(a))', \qquad a = \|\mathbf{x}\|,$$

the condition that *a* → *aG*(*a*) is strictly increasing is sufficient (this is weaker than *G* itself increasing). Finally, differentiability of *G* is not required, so it is enough if *G* is continuous and *aG*(*a*) is strictly increasing.

*Separable variables.* Consider the collection of functions **<sup>t</sup>** : <sup>R</sup>*<sup>d</sup>* <sup>→</sup> <sup>R</sup>*<sup>d</sup>* of the form

$$\mathbf{t}(\mathbf{x}\_1, \dots, \mathbf{x}\_d) = (T\_1(\mathbf{x}\_1), \dots, T\_d(\mathbf{x}\_d)), \qquad T\_l: \mathbb{R} \to \mathbb{R}, \tag{2.8}$$

with *Ti* continuous and strictly increasing. This is a generalisation of the compatible Gaussian case discussed above in which all the *Ti*'s were linear. Here, it is obvious that elements in this family are optimal maps and that the family is closed under inverses and composition, so that compatibility follows immediately.

This family is characterised by measures having a *common dependence structure*. More precisely, we say that *C* : [0,1] *<sup>d</sup>* <sup>→</sup> [0,1] is a *copula* if *<sup>C</sup>* is (the restriction of) a distribution function of a random vector having uniform margins. In other words, if there is a random vector *<sup>V</sup>* = (*V*1,...,*Vd*) with <sup>P</sup>(*Vi* <sup>≤</sup> *<sup>a</sup>*) = *<sup>a</sup>* for all *<sup>a</sup>* <sup>∈</sup> [0,1] and all *j* = 1,...,*d*, and

$$\mathbb{P}(V\_1 \le \nu\_1, \dots, V\_d \le \nu\_d) = \mathbb{C}(\nu\_1, \dots, \nu\_d), \qquad \mu\_l \in [0, 1].$$

Nelsen [97] provides an overview on copulae. To any *d*-dimensional probability measure μ, one can assign a copula *C* = *C*μ in terms of the distribution function *G* of μand its marginals *Gj* as

$$G(a\_1, \ldots, a\_d) = \mu\left( (\! - \ast \ast, a\_1] \times \cdots \times (\! - \ast \ast, a\_d] \right) = C(G\_1(a\_1), \ldots, G\_d(a\_d)).$$

If each *Gj* is surjective on (0,1), which is equivalent to it being continuous, then this equation defines*C* uniquely on (0,1)*d*, and consequently on [0,1] *<sup>d</sup>*. If some marginal *Gj* is not continuous, then uniqueness is lost, but *C* still exists [97, Chapter 2]. The connection of copulae to compatibility becomes clear in the following lemma, proven on page 51 in the supplement.

**Lemma 2.3.3 (Compatibility and Copulae)** *The copulae associated with absolutely continuous measures* μ,ν <sup>∈</sup> *<sup>W</sup>*2(R*d*) *are equal if and only if* **<sup>t</sup>** ν μ *takes the separable form* (2.8)*.*

*Composition with linear functions.* If φ : <sup>R</sup>*<sup>d</sup>* <sup>→</sup> <sup>R</sup> is convex with gradient **<sup>t</sup>** and *<sup>A</sup>* is a *d*×*d* matrix, then the gradient of the convex function *x* → φ(*Ax*) at *x* is **t***<sup>A</sup>* = *At* **t**(*Ax*). Suppose ψ is another convex function with gradient **s** and that compatibility holds, i.e., ∇**s**(**t**(*x*)) commutes with ∇**t**(*x*) for all *x*. Then in order for

$$\nabla \mathbf{s}\_A(\mathbf{t}\_A(\mathbf{x})) = A^t \nabla \mathbf{s}(A A^t \mathbf{t}(A \mathbf{x})) A \qquad \text{and} \qquad \nabla \mathbf{t}\_A(\mathbf{x}) = A^t \nabla \mathbf{t}(A \mathbf{x}) A^t$$

to commute, it suffices that *AAt* = *I*, i.e., that *A* be orthogonal. Consequently, if {**t**#μ}**t**∈**<sup>T</sup>** are compatible, then so are {**t***<sup>U</sup>* #μ}**t**∈**<sup>T</sup>** for any orthogonal matrix *U*.

#### **2.4 Random Measures in Wasserstein Space**

Let μ be a fixed absolutely continuous probability measure in *W*2(*X* ). If Λ ∈ *W*2(*X* ) is another probability measure, then the transport map **t** Λ μ and the convex potential are functions of Λ. If Λ is now random, then we would like to be able to make probability statements about them. To this end, it needs to be shown that **t** Λ μ and the convex potential are *measurable* functions of Λ. The goal of this section is to develop a rigorous mathematical framework that justifies such probability statements. We show that all the relevant quantities are indeed measurable, and in particular establish a Fubini-type result in Proposition 2.4.9. This technical section may be skipped at first reading.

Here is an example of a measurability result (Villani [125, Corollary 5.22]). Recall that *P*(*X* ) is the space of Borel probability measures on *X* , endowed with the topology of weak convergence that makes it a metric space. Let *X* be a complete separable metric space and *<sup>c</sup>* : *<sup>X</sup>* <sup>2</sup> <sup>→</sup> <sup>R</sup><sup>+</sup> a continuous cost function. Let (Ω,*F*,P) be a probability space and Λ,κ : Ω → *P*(*X* ) be measurable maps. Then there exists a *measurable selection* of optimal transference plans. That is, a measurable π : Ω <sup>→</sup> *<sup>P</sup>*(*<sup>X</sup>* <sup>2</sup>) such that π(ω) ∈ Π(Λ(ω),κ(ω)) is optimal for all ω ∈ Ω.

Although this result is very general, it only provides information about π. If π is induced from a map *T*, it is not obvious how to construct *T* from π in a measurable way; we will therefore follow a different path. In order to (almost) have a self-contained exposition, we work in a somewhat simplified setting that nevertheless suffices for the sequel. At least in the Euclidean case *X* = R*d*, more general measurability results in the flavour of this section can be found in Fontbona et al. [53]. On the other hand, we will not need to appeal to abstract measurable selection theorems as in [53, 125].

#### *2.4.1 Measurability of Measures and of Optimal Maps*

Let *X* be a separable Banach space. (Most of the results below hold for any complete separable metric space but we will avoid this generality for brevity and simpler notation). The Wasserstein space *Wp*(*X* ) is a metric space for any *p* ≥ 1. We can thus define:

**Definition 2.4.1 (Random Measure)** *A random measure* Λ *is any measurable map from a probability space* (Ω,*F*,P) *to Wp*(*X* )*, endowed with its Borel* σ*-algebra.*

In what follows, whenever we call something random, we mean that it is measurable as a map from some generic unspecified probability space.

**Lemma 2.4.2** *A random measure* Λ *is measurable if and only if it is measurable with respect to the induced weak topology.*

Since both topologies are Polish, this follows from abstract measure-theoretic results (Fremlin [57, Paragraph 423F]). We give an elementary proof on page 53 of the supplement.

Optimal maps are functions from *X* to itself. In order to define random optimal maps, we need to define a topology and a σ-algebra on the space of such functions.

**Definition 2.4.3 (The Space** *Lp*(μ)**)** *Let X be a Banach space and* μ *a Borel measure on X . Then the space Lp*(μ) *is the space of measurable functions f* : *X* → *X such that*

$$\|f\|\_{\mathcal{X}\_p^\circ(\mu)} = \left(\int\_{\mathcal{X}} \|f(\mathfrak{x})\|\_{\mathcal{X}}^p \, \mathrm{d}\mu(\mathfrak{x})\right)^{1/p} < \infty.$$

When *X* is separable, *Lp*(μ) is an example of a *Bochner space*, though we will not use this terminology.

It follows from the definition that *f Lp*(μ) is the *Lp* norm of the map *x* → *f*(*x*) *<sup>X</sup>* from *X* to R:

$$\|f\|\_{\mathcal{E}\_\rho(\mu)} = \|\|f\|\|\_{\mathcal{X}^\*}\|\_{L\_\rho(\mu)}.$$

As usual we identify functions that coincide almost everywhere. Clearly, *Lp*(μ) is a normed vector space. It enjoys another property shared by *Lp* spaces—completeness:

**Theorem 2.4.4 (Riesz–Fischer)** *The space Lp*(μ) *is a Banach space.*

The proof, a simple variant of the classical one, is given on page 53 of the supplement.

Random maps lead naturally to random measures:

**Lemma 2.4.5 (Push-Forward with Random Maps)** *Let* μ ∈ *Wp*(*X* ) *and let* **t** *be a random map in Lp*(μ)*. Then* Λ = **t**#μ *is a continuous mapping from Lp*(μ) *to Wp*(*X* )*, hence a random measure.*

*Proof.* That Λtakes values in *W<sup>p</sup>* follows from a change of variables

$$\int\_{\mathcal{X}} ||\mathbf{x}||^p \, \mathbf{d} \mathbf{A}(\mathbf{x}) = \int\_{\mathcal{X}} ||\mathbf{t}(\mathbf{x})||^p \, \mathbf{d}\mu(\mathbf{x}) = ||\mathbf{t}||\_{\mathcal{X}\_p(\mu)} < \infty.$$

Since *Wp*(**t**#μ,**s**#μ) ≤ **t**−**s** *X Lp*(μ) = **t**−**s** *Lp*(μ) (see (2.3)), Λ is a continuous (in fact, 1-Lipschitz) function of **t**.

Conversely, **t** is a continuous function of Λ: **Lemma 2.4.6 (Measurability of Transport Maps)** *Let* Λ *be a random measure in Wp*(*X* ) *and let* μ ∈ *Wp*(*X* ) *such that* (**i**,**t** Λ μ )#μ *is the unique optimal coupling of* μ *and* Λ*. Then* Λ → **t** Λ μ *is a continuous mapping from Wp*(*X* ) *to Lp*(μ)*, so* **t** Λ μ *is a random element in Lp*(μ)*. In particular, the result holds if X is a separable Hilbert space, p* > 1*, and* μ*is absolutely continuous.*

*Proof.* This result is more subtle than Lemma 2.4.5, since Λ → **t** Λ μ is not necessarily Lipschitz. We give here a self-contained proof for the Euclidean case with quadratic cost and μ absolutely continuous. The general case builds on Villani [125, Corollary 5.23] and is given on page 54 of the supplement.

Suppose that Λ*n* → Λ in *W*2(R*d*) and fix ε<sup>&</sup>gt; 0. For any *<sup>S</sup>* <sup>⊆</sup> <sup>R</sup>*d*,

$$\|\mathbf{t}\_{\mu}^{\Lambda\_{n}} - \mathbf{t}\_{\mu}^{\Lambda}\|\_{\mathcal{L}\_{2}^{\mathbb{C}}(\mu)}^{2} = \int\_{\mathcal{S}} ||\mathbf{t}\_{\mu}^{\Lambda\_{n}} - \mathbf{t}\_{\mu}^{\Lambda}||^{2} \, \mathrm{d}\mu + \int\_{\mathbb{R}^{d}\backslash\mathcal{S}} ||\mathbf{t}\_{\mu}^{\Lambda\_{n}} - \mathbf{t}\_{\mu}^{\Lambda}||^{2} \, \mathrm{d}\mu \, \mathrm{d}\mu$$

Since *a*−*b <sup>p</sup>* <sup>≤</sup> <sup>2</sup>*<sup>p</sup> a <sup>p</sup>* <sup>+</sup>2*<sup>p</sup> b <sup>p</sup>*, the last integral is no larger than

$$4\int\_{\mathbb{R}^d \backslash \mathcal{S}} \|\mathbf{t}\_{\mu}^{\Lambda\_\mu}\|^2 \, \mathrm{d}\mu + 4\int\_{\mathbb{R}^d \backslash \mathcal{S}} \|\mathbf{t}\_{\mu}^{\Lambda}\|^2 \, \mathrm{d}\mu = 4\int\_{(\mathfrak{t}\_{\mu}^{\Lambda\_\mu})^{-1}(\mathbb{R}^d \backslash \mathcal{S})} \|\mathbf{x}\|^2 \, \mathrm{d}\Lambda\_\mathbb{R}(\mathbf{x}) + 4\int\_{(\mathfrak{t}\_{\mu}^{\Lambda})^{-1}(\mathbb{R}^d \backslash \mathcal{S})} \|\mathbf{x}\|^2 \, \mathrm{d}\Lambda(\mathbf{x}) .$$

Since (Λ*<sup>n</sup>*) and Λ are tight in the Wasserstein space, they must satisfy the absolute uniform continuity (2.7). Let δ = δε as in (2.7), and notice that by the measure preserving property of the optimal maps, the last two integrals are taken on sets of measures 1 − μ(*S*). Since μ is absolutely continuous, we can find a compact set *S* of μ-measure at least 1 − δ and on which Proposition 1.7.11 applies (see Corollary 1.7.12), yielding

$$\int\_{\mathcal{S}} ||\mathbf{t}^{\Lambda\_n}\_{\mu} - \mathbf{t}^{\Lambda}\_{\mu}||^2 \, \mathrm{d}\mu \le ||\mathbf{t}^{\Lambda\_n}\_{\mu} - \mathbf{t}^{\Lambda}\_{\mu}||^2\_{\circ} \to 0, \qquad n \to \infty, \,\,\,\,$$

so that

$$\limsup\_{n \to \infty} \left\| \mathbf{t}\_{\mu}^{\Lambda\_n} - \mathbf{t}\_{\mu}^{\Lambda} \right\|\_{\mathcal{E}\_2^{\mathcal{E}}(\mu)} \le 8\varepsilon,$$

and this completes the proof upon letting ε→ 0.

In Proposition 5.3.7, we show under some conditions that **t** Λ μ *L*2(μ) is a continuous function of μ.

#### *2.4.2 Random Optimal Maps and Fubini's Theorem*

From now on, we assume that *X* is a separable Hilbert space and that *p* = 2. The results can most likely be generalised to all *p* > 1 (see Ambrosio et al. [12, Section 10.2]), but we restrict to the quadratic case for simplicity.

Theorem 3.2.13 below requires the application of Fubini's theorem in the form

$$\mathbb{E} \int\_{\mathcal{X}} \left< \mathbf{t}\_{\Theta\_{0}}^{\Lambda} - \mathbf{i}, \mathbf{t}\_{\Theta\_{0}}^{\Theta} - \mathbf{i} \right> \, \mathrm{d}\Theta\_{0} = \int\_{\mathcal{X}} \mathbb{E} \left< \mathbf{t}\_{\Theta\_{0}}^{\Lambda} - \mathbf{i}, \mathbf{t}\_{\Theta\_{0}}^{\Theta} - \mathbf{i} \right> \, \mathrm{d}\Theta\_{0} = \int\_{\mathcal{X}} \left< \mathbb{E} \mathbf{t}\_{\Theta\_{0}}^{\Lambda} - \mathbf{i}, \mathbf{t}\_{\Theta\_{0}}^{\Theta} - \mathbf{i} \right> \, \mathrm{d}\Theta\_{0} \, \mathrm{d}\Theta\_{0}$$

In order for this to even make sense, we need to have a meaning for "expectation" in the spaces *L*2(θ<sup>0</sup>) and *L*2(θ<sup>0</sup>), both of which are Banach spaces. There are several (nonequivalent) definitions for integrals in such spaces (Hildebrant [69]); the one which will be the most convenient for our needs is the Bochner integral.

**Definition 2.4.7 (Bochner Integral)** *Let B be a Banach space and let f* : (Ω,*F*, <sup>P</sup>) <sup>→</sup> *B be a simple random element taking values in B:*

$$f(\boldsymbol{\alpha}) = \sum\_{j=1}^{n} f\_j \mathbf{1}\{\boldsymbol{\alpha} \in \mathfrak{Q}\_j\}, \qquad \boldsymbol{\Omega}\_j \in \mathcal{F}, \qquad f\_j \in \mathcal{B}.$$

*Then the Bochner integral (or expectation) of f is defined by*

$$\mathbb{E}f = \sum\_{j=1}^{n} \mathbb{P}(\mathfrak{Q}\_j) f\_j \in \mathcal{B}.$$

*If f is measurable and there exists a sequence fn of simple random elements such that fn* − *f* <sup>→</sup> <sup>0</sup> *almost surely and* <sup>E</sup> *fn* − *f* → 0*, then the Bochner integral of f is defined as the limit*

$$\mathbb{E}f = \lim\_{n \to \infty} \mathbb{E}f\_n.$$

The space of functions for which the Bochner integral is defined is the *Bochner space L*1(Ω;*B*), but we will use neither this terminology nor the notation. It is not difficult to see that Bochner integrals are well-defined: the expectations do not depend on the representation of the simple functions nor on the approximating sequence, and the limit exists in *B* (because it is complete). More on Bochner integrals can be found in Hsing and Eubank [71, Section 2.6] or Dunford et al. [48, Chapter III.6]. A major difference from the real case is that there is no clear notion of "infinity" here: the Bochner integral is always an element of *B*, whereas expectations of real-valued random variables can be defined in <sup>R</sup> ∪ {±∞}. It turns out that separability is quite important in this setting:

**Lemma 2.4.8 (Approximation of Separable Functions)** *Let f* : Ω → *B be measurable. Then there exists a sequence of simple functions fn such that fn*(ω) − *f*(ω) → 0 *for almost all* ω *if and only if f*(Ω \*N* ) *is separable for some N* ⊆ Ω *of probability zero. In that case, fn can be chosen so that fn*(ω) ≤ 2 *f*(ω) *for all* ω ∈ Ω*.*

A proof can be found in [48, Lemma III.6.9], or on page 55 of the supplement. Functions satisfying this approximation condition are sometimes called *strongly measurable* or *Bochner measurable*. In view of the lemma, we will call them *separately valued*, since this is the condition that will need to be checked in order to define their integrals.

Two remarks are in order. Firstly, if *B* itself is separable, then *f*(Ω) will obviously be separable. Secondly, the set *N* ⊂ Ω \*N* on which (*gnk* ) does not converge to *f* may fail to be measurable, but must have outer probability zero (it is included in a measurable set of measure zero) [48, Lemma III.6.9]. This can be remedied by assuming that the probability space (Ω,*F*,P) is complete. It will not, however, be necessary to do so, since this measurability issue will not alter the Bochner expectation of *f* .

**Proposition 2.4.9 (Fubini for Optimal Maps)** *Let* Λ *be a random measure in W*2(*X* ) *such that* E*W*2(δ0,Λ) < ∞ *and let* θ0,θ ∈ *W*2(*X* ) *such that* **t** Λ θ<sup>0</sup> *and* **t** θ θ0 *exist (and are unique) with probability one. (For example, if* θ<sup>0</sup> *is absolutely continuous.) Then*

$$\mathbb{E} \int\_{\mathcal{X}} \left< \mathbf{t}\_{\theta\_0}^{\Lambda} - \mathbf{i}, \mathbf{t}\_{\theta\_0}^{\theta} - \mathbf{i} \right> \, \mathrm{d}\theta\_0 = \int\_{\mathcal{X}} \mathbb{E} \left< \mathbf{t}\_{\theta\_0}^{\Lambda} - \mathbf{i}, \mathbf{t}\_{\theta\_0}^{\theta} - \mathbf{i} \right> \, \mathrm{d}\theta\_0 = \int\_{\mathcal{X}} \left< \mathbb{E} \mathbf{t}\_{\theta\_0}^{\Lambda} - \mathbf{i}, \mathbf{t}\_{\theta\_0}^{\theta} - \mathbf{i} \right> \, \mathrm{d}\theta\_0. \tag{2.9}$$

This holds by linearity when Λ is a simple random measure. The general case follows by approximation: the Wasserstein space is separable and so is the space of optimal maps, by Lemma 2.4.6, so we may apply Lemma 2.4.8 and approximate **t** Λ θ0 by simple maps for which the equality holds by linearity. On page 56 of the supplement, we show that these simple maps can be assumed optimal, and give the full details.

#### **2.5 Bibliographical Notes**

Our proof of Theorem 2.2.11 borrows heavily from Bolley et al. [29]. A similar result was obtained by Kloeckner [81], who also provides a lower bound of a similar order.

The origins of Sect. 2.3 can be traced back to the seminal work of Jordan et al. [74], who interpret the Fokker–Planck equation as a gradient flow (where functionals defined on *W*<sup>2</sup> can be differentiated) with respect to the 2-Wasserstein metric. The Riemannian interpretation was (formally) introduced by Otto [99], and rigorously established by Ambrosio et al. [12] and others; see Villani [125, Chapter 15] for further bibliography and more details.

Compatible measures (Definition 2.3.1) were implicitly introduced by Boissard et al. [28] in the context of *admissible optimal maps* where one defines families of gradients of convex functions (*Ti*) such that *T* <sup>−</sup><sup>1</sup> *<sup>j</sup>* ◦ *Ti* is a gradient of a convex function for any *i* and *j*. For (any) fixed measure γ ∈ *C* , compatibility of *C* is then equivalent to admissibility of the collection of maps {**t** μ γ }μ<sup>∈</sup>*<sup>C</sup>* . The examples we gave are also taken from [28].

Lemma 2.3.3 is from Cuesta-Albertos et al. [38, Theorem 2.9] (see also Zemel and Panaretos [135]).

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 3 Frechet Means in the Wasserstein ´ Space** *W*<sup>2</sup>

If *H* is a Hilbert space (or a closed convex subspace thereof) and *x*1,...,*xN* ∈ *H*, then the empirical mean *xN* = *N*−<sup>1</sup> ∑*xi* is the unique element of *H* that minimises the sum of squared distances from the *xi*'s.<sup>1</sup> That is, if we define

$$F(\theta) = \sum\_{i=1}^{N} \left\| \theta - \mathbf{x}\_i \right\|^2, \qquad \theta \in H,$$

then θ = *xN* is the unique minimiser of *F*. This is easily seen by "opening the squares" and writing

$$F(\theta) = F(\overline{\mathbf{x}}\_N) + N||\theta - \overline{\mathbf{x}}\_N||^2.$$

The concept of a Frechet mean (Fr ´ echet [ ´ 55]) generalises the notion of mean to a more general metric space by replacing the usual "sum of squares" with a "sum of squared distances", giving rise to the so-called *Frechet functional ´* . A closely related notion is that of a *Karcher mean* (Karcher [78]), a term that describes stationary points of the sum of squares functional, when the latter is differentiable (see Sect. 3.1.6). Population versions of Frechet means, assuming the space is endowed ´ with a probability law, can also be defined, replacing summation by expectation with respect to that law.

$$F(\mathbf{x}, \mathbf{y}) = (\mathbf{x} + \mathbf{y})^2 + (\mathbf{x} + \mathbf{1} - \mathbf{y})^2 + (1 - \mathbf{x} + \mathbf{y})^2 = 2 + \mathbf{x}^2 + \mathbf{y}^2 + (\mathbf{x} - \mathbf{y})^2$$

is minimised at (0,0).

**Electronic Supplementary Material** The online version of this chapter (https://doi.org/10.1007/ 978-3-030-38438-8 3) contains supplementary material.

<sup>1</sup> It should be remarked that this is a Hilbertian property (or at least a property linked to an inner product), not merely a linear property. In other words, it does not extend to Banach spaces. As an example, let *H* = R<sup>2</sup> with the *L*<sup>1</sup> norm and consider the vertices (0,0), (0,1), and (1,0) of the unit simplex. The mean of these is (1/3,1/3) but for (*x*, *y*) in the triangle,

Frechet means are perhaps the most basic object of statistical interest, and this ´ chapter studies such means when the underlying space is the Wasserstein space *W*2. In general, existence and uniqueness of a Frechet mean can be subtle, but we will ´ see that the nature of optimal transport allows for rather clean statements in the case of Wasserstein space.

#### **3.1 Empirical Frechet Means in ´** *W*<sup>2</sup>

#### *3.1.1 The Frechet Functional ´*

As foretold in the preceding paragraph, the definition of a Frechet mean requires the ´ definition of an appropriate sum-of-squares functional, the *Frechet functional ´* :

**Definition 3.1.1 (Empirical Frechet Functional and Mean) ´** *The Frechet functi- ´ onal associated with measures* μ<sup>1</sup>,...,μ*<sup>N</sup>* <sup>∈</sup> *<sup>W</sup>*2(*<sup>X</sup>* ) *is*

$$F: \mathcal{W}\_2(\mathcal{K}) \to \mathbb{R} \qquad F(\gamma) = \frac{1}{2N} \sum\_{i=1}^N W\_2^2(\gamma, \mu^i), \qquad \gamma \in \mathcal{W}\_2(\mathcal{K}).\tag{3.1}$$

*A Frechet mean of ´* (μ<sup>1</sup>,...,μ*<sup>N</sup>*) *is a minimiser of F in W*2(*X* ) *(if it exists).*

In analysis, a Frechet mean is often called a ´ *barycentre*. We shall use the terminology of "Frechet mean" that is arguably more popular in statistics. ´ <sup>2</sup>

The factor 1/(2*N*) is irrelevant for the definition of Frechet mean. It is introduced ´ in order to have simpler expressions for the derivatives (Theorems 3.1.14 and 3.2.13) and to be compatible with a population version E*W*<sup>2</sup> 2 (γ,Λ)/2 (see (3.3)).

The first reference that deals with empirical Frechet means in ´ *W*2(R*d*) is the seminal paper of Agueh and Carlier [2]. They treat the more general weighted Frechet ´ functional

$$F(\boldsymbol{\gamma}) = \frac{1}{2} \sum\_{i=1}^{N} w\_i W\_2^2(\boldsymbol{\gamma}, \boldsymbol{\mu}^i), \qquad 0 \le \boldsymbol{w}\_i, \quad \sum\_{i=1}^{N} w\_i = 1, \ldots$$

but, for simplicity, we shall focus on the case of equal weights. (If all the *wi*'s are rational, then the weighted functional can be encompassed in (3.1) by taking some of the μ*i* 's to be the same. The case of irrational *wi*'s is then treated with continuity arguments. Moreover, (3.3) encapsulates (3.1) as well as the weighted version when Λcan take finitely many values.)

<sup>2</sup> Interestingly, Frechet himself [ ´ 56] considered the Wasserstein metric between probability measures on R, and some refer to this as the *Fr´echet distance* (e.g., Dowson and Landau [44]), which is another reason to use this terminology.

#### *3.1.2 Multimarginal Formulation, Existence, and Continuity*

In [60], Gangbo and Swie¸ch consider the following ´ *multimarginal* Monge– Kantorovich problem. Let μ<sup>1</sup>,...,μ*<sup>N</sup>* be *N* measures in *W*2(*X* ) and let Π(μ<sup>1</sup>,..., μ*<sup>N</sup>*) be the set of probability measures in *<sup>X</sup> <sup>N</sup>* having {μ*i* }*N <sup>i</sup>*=<sup>1</sup> as marginals. The problem is to minimise

$$G(\boldsymbol{\pi}) = \frac{1}{2N^2} \int\_{\mathcal{X}^N} \sum\_{i$$

The factor 1/(2*N*2) is of course irrelevant for the minimisation and its purpose will be clarified shortly. If *N* = 2, we obtain the Kantorovich problem with quadratic cost. The probabilistic interpretation (as in Sect. 1.2) is that one is given random variables *X*1,...,*XN* with marginal probability laws μ<sup>1</sup>,...,μ*<sup>N</sup>* and seeks to construct a random vector *Y* = (*Y*1,...,*YN*) on *X <sup>N</sup>* such that *Xi d* = *Yi* and

$$\frac{1}{2N^2} \mathbb{E} \sum\_{i$$

for any other random vector *Z* = (*Z*1,...,*ZN*) such that *Xi d* = *Zi*. Intuitively, we seek a random vector with prescribed marginals but maximally correlated entries.

We refer to elements of Π(μ<sup>1</sup>,...,μ*<sup>N</sup>*) (equivalently, joint laws of *X*1,...,*XN*) as *multicouplings* (of μ<sup>1</sup>,...,μ*<sup>N</sup>*). Just like in the Kantorovich problem, there always exists an optimal multicoupling π.

Let us now show how the multimarginal problem is equivalent to the problem of finding the Frechet mean of ´ μ<sup>1</sup>,...,μ*<sup>N</sup>*. The first thing to observe is that the objective function can be written as

$$G(\boldsymbol{\pi}) = \int\_{\mathcal{X}^N} \frac{1}{2N} \sum\_{l=1}^N ||\mathbf{x}\_l - M(\mathbf{x})||^2 \, \mathrm{d}\boldsymbol{\pi}(\mathbf{x}), \qquad M(\mathbf{x}) = M(\mathbf{x}\_1, \dots, \mathbf{x}\_n) = \frac{1}{N} \sum\_{l=1}^N \mathbf{x}\_l.$$

The next result shows that the Frechet mean and the multicoupling problems are ´ essentially the same.

**Proposition 3.1.2 (Frechet Means and Multicouplings) ´** *Let* μ1 ,...,μ*<sup>N</sup>* <sup>∈</sup> *<sup>W</sup>* (*<sup>X</sup>* )*. Then* μ *is a Frechet mean of ´* (μ<sup>1</sup>,...,μ*<sup>N</sup>*) *if and only if there exists an optimal multicoupling* π <sup>∈</sup> *<sup>W</sup>* (*<sup>X</sup> <sup>N</sup>*) *of* (μ<sup>1</sup>,...,μ*<sup>N</sup>*) *such that* μ = *M*#π*, and furthermore F*(μ) = *G*(π)*.*

*Proof.* Let π be an arbitrary multicoupling of (μ<sup>1</sup>,...,μ*<sup>N</sup>*) and set μ = *M*#π. Then (*x* → *xi*,*M*)#π is a coupling of μ*<sup>i</sup>* and μ, and therefore

$$\int\_{\mathcal{X}^N} ||\mathbf{x}\_l - \mathbf{M}(\mathbf{x})||^2 \, \mathbf{d}\pi(\mathbf{x}) \ge W^2(\mu, \mu\_l).$$

Summation over *i* gives *F*(μ) ≤ *G*(π) and so inf*F* ≤ inf*G*.

For the other inequality, let μ ∈ *W* (*X* ) be arbitrary. For each *i*, let π*<sup>i</sup>* be an optimal coupling between μ and μ*i* . Invoking the gluing lemma (Ambrosio and Gigli [10, Lemma 2.1]), we may glue all π*i* 's using their common marginal μ. This procedure constructs a measure η on *X <sup>N</sup>*+<sup>1</sup> with marginals μ1,...,μ*N*,μ and its relevant projection π is then a multicoupling of μ1,...,μ*N*.

Since *X* is a Hilbert space, the minimiser of *y* → ∑ *xi* −*y* <sup>2</sup> is *y* = *M*(*x*). Thus

$$F(\boldsymbol{\mu}) = \frac{1}{2N} \int\_{\mathcal{X}^{N+1}} \sum\_{l=1}^{N} ||\mathbf{x}\_{l} - \mathbf{y}||^{2} \, \mathrm{d}\boldsymbol{\eta}(\mathbf{x}, \mathbf{y}) \geq \frac{1}{2N} \int\_{\mathcal{X}^{N+1}} \sum\_{l=1}^{N} ||\mathbf{x}\_{l} - \mathbf{M}(\mathbf{x})||^{2} \, \mathrm{d}\boldsymbol{\eta}(\mathbf{x}, \mathbf{y}) = G(\boldsymbol{\pi}).$$

In particular, inf*F* ≥ inf*G* and combining this with the established converse inequality we see that inf*F* = inf*G*. Observe also that the last displayed inequality holds as equality if and only if *y* = *M*(*x*) η-almost surely, in which case μ = *M*#π. Therefore, if μ does not equal *M*#π, then *F*(μ) > *G*(π) ≥ *F*(*M*#π), and μ cannot be optimal. Finally, if πis optimal, then

$$F(M\#\pi) \le G(\pi) = \inf G = \inf F$$

establishing optimality of μ = *M*#πand completing the proof.

Since optimal couplings exist, we deduce that so do Frechet means. ´

**Corollary 3.1.3 (Frechet Means and Moments) ´** *Any finite collection of measures* μ<sup>1</sup>,...,μ*<sup>N</sup>* <sup>∈</sup> *<sup>W</sup>*2(*<sup>X</sup>* ) *admits a Frechet mean ´* μ*, for all p* ≥ 1

$$\int\_{\mathcal{X}} ||\mathbf{x}||^p \, \mathbf{d}\mu(\mathbf{x}) \le \frac{1}{N} \sum\_{i=1}^N \int\_{\mathcal{X}} ||\mathbf{x}||^p \, \mathbf{d}\mu^i(\mathbf{x}),$$

*and when p* > 1 *equality holds if and only if* μ<sup>1</sup> <sup>=</sup> ··· <sup>=</sup> μ*N.*

*Proof.* Let π be a multicoupling of μ<sup>1</sup>,...,μ*<sup>N</sup>* such that μ = *MN*#π (Proposition 3.1.2). Then

$$\begin{split} \int\_{\mathcal{X}} ||\mathbf{x}||^{p} \, \mathrm{d}\mu(\mathbf{x}) &= \int\_{\mathcal{X}^{N}} \left| \frac{1}{N} \sum\_{i=1}^{N} \mathbf{x}\_{i} \right|^{p} \, \mathrm{d}\pi(\mathbf{x}) \leq \frac{1}{N} \sum\_{i=1}^{N} \int\_{\mathcal{X}^{N}} ||\mathbf{x}\_{i}||^{p} \, \mathrm{d}\pi(\mathbf{x}) \\ &= \frac{1}{N} \sum\_{i=1}^{N} \int\_{\mathcal{X}} ||\mathbf{x}||^{p} \, \mathrm{d}\mu^{i}(\mathbf{x}). \end{split}$$

The statement about equality follows from strict convexity of *x* → *x <sup>p</sup>* if *p* > 1.

A further corollary of Proposition 3.1.2 is a bound on the support:

**Corollary 3.1.4** *The support of any Frechet mean is included in the set ´*

$$\frac{\text{supp}\mu^1 + \dots + \text{supp}\mu^N}{N} = \left\{ \frac{\mathbf{x}\_1 + \dots + \mathbf{x}\_N}{N} : \mathbf{x}\_i \in \text{supp}\mu^i \right\} \subseteq \text{conv}\left(\bigcup\_{i=1}^N \text{supp}\mu^i\right).$$

*In particular, if all the* μ*i 's are supported on a common convex set K, then so is any of their Frechet means. ´*

The multimarginal formulation also yields a continuity property for the empirical Frechet mean. Conditions for uniqueness will be given in the next subsection. ´

**Theorem 3.1.5 (Continuity of Frechet Means) ´** *Suppose that W*2(μ*i k*,μ*i* ) → 0 *for i* = 1,...,*N and let* μ*<sup>k</sup> denote any Frechet mean of ´* (μ1 *<sup>k</sup>* ,...,μ*N <sup>k</sup>* )*. Then* (μ*<sup>k</sup>*) *stays in a compact set of W*2(*X* )*, and any limit point is a Frechet mean of ´* (μ<sup>1</sup>,...,μ*<sup>N</sup>*)*.*

In particular, if μ<sup>1</sup>,...,μ*<sup>N</sup>* have a *unique* Frechet mean ´ μ, then μ*k* → μin *W*2(*X* ).

*Proof.* We sketch the steps of the proof here, with the full details given on page 63 of the supplement.

**Step 1:** tightness of (μ*<sup>k</sup>*). This is true because the collection of multicouplings is tight, and the mean function *M* is continuous.

**Step 2:** weak limits are limits in *W*2(*X* ). This holds because the mean function has linear growth.

**Step 3:** the limit is a Frechet mean of ´ (μ<sup>1</sup>,...,μ*<sup>N</sup>*). From Corollary 3.1.3, it follows that μ*<sup>k</sup>* must be sought on some fixed bounded set in *W*2(*X* ). On such sets, the Frechet functionals are uniformly Lipschitz, so their minimisers converge ´ as well.

#### *3.1.3 Uniqueness and Regularity*

A general situation in which Frechet means are unique is when the Fr ´ echet func- ´ tional is strictly convex. In the Wasserstein space, this requires some regularity, but weak convexity holds in general. Absolutely continuous measures on infinitedimensional *X* are defined in Definition 1.6.4.

**Proposition 3.1.6 (Convexity of the Frechet Functional) ´** *Let* Λ, γ*<sup>i</sup>* ∈ *W*2(*X* ) *and t* ∈ [0,1]*. Then*

$$\mathcal{W}\_2^2(\mathfrak{r}\mathfrak{y} + (1-t)\mathfrak{y}\_2, \Lambda) \le t\mathcal{W}\_2^2(\mathfrak{y}\_1, \Lambda) + (1-t)\mathcal{W}\_2^2(\mathfrak{y}\_2, \Lambda). \tag{3.2}$$

*When* Λ *is absolutely continuous, the inequality is strict unless t* ∈ {0,1} *or* γ<sup>1</sup> = γ2*.*

**Remark 3.1.7** *The Wasserstein distance is not convex along geodesics. That is, if we replace the linear interpolant t*γ<sup>1</sup> + (1−*t*)γ<sup>2</sup> *by McCann's interpolant, then t* → *W*<sup>2</sup> 2 (γ*t*,Λ) *is not necessarily convex (Ambrosio et al. [12, Example 9.1.5]).*

*Proof.* Let π*i* ∈ Π(γ*i*,Λ) be optimal and notice that the linear interpolant *t*π<sup>1</sup> + (1− *t*)π2 ∈ Π(*t*γ<sup>1</sup> + (1−*t*)γ2,Λ), so that

$$W\_2^2(t\gamma\_1 + (1-t)\gamma\_2, \Lambda) \le \int\_{\mathcal{X}^2} \left| |\mathbf{x} - \mathbf{y}| \right|^2 \mathbf{d}[t\pi\_1 + (1-t)\pi\_2](\mathbf{x}, \mathbf{y}),$$

which is (3.2). When Λ is absolutely continuous and *t* ∈ (0,1), equality in (3.2) holds if and only if π*<sup>t</sup>* = *t*π<sup>1</sup> + (1−*t*)π<sup>2</sup> = (**t** *t*γ1+(1−*t*)γ2 Λ ×**i**)#Λ. But π*<sup>t</sup>* is supported on the graphs of two functions: **t** γ1 Λ and **t** γ2 Λ . Consequently, equality can hold only if these two maps equal Λ-almost surely, or, equivalently, if γ<sup>1</sup> = γ2.

As a corollary, we deduce that the Frechet mean is unique if one of the measures ´ μ*<sup>i</sup>* is absolutely continuous, and this extends to the population version (see Proposition 3.2.7).

We conclude this subsection by stating an important regularity property in the Euclidean case. See Agueh and Carlier [2, Proposition 5.1] for a proof.

**Proposition 3.1.8 (***L*∞**-Regularity of Frechet Means) ´** *Let* μ<sup>1</sup>,...,μ*<sup>N</sup>* <sup>∈</sup> *<sup>W</sup>*2(R*d*) *and suppose that* μ<sup>1</sup> *is absolutely continuous with density bounded by M. Then the Frechet mean of ´* {μ*i* } *is absolutely continuous with density bounded by NdM and is consequently a Karcher mean.*

In Theorem 5.5.2, we extend Proposition 3.1.8 to the population level.

#### *3.1.4 The One-Dimensional and the Compatible Case*

When *X* = R, there is a simple expression for the Frechet mean because ´ *W*2(R) can be imbedded in a Hilbert space. Indeed, recall that

$$W\_2(\mu, \nu) = ||F\_{\mu}^{-1} - F\_{\nu}^{-1}||\_{L\_2(0,1)}$$

(see Sect. 2.3.2 or 1.5). In view of that, *W*2(R) can be seen as the convex closed subset of *L*2(0,1) formed by equivalence classes of left-continuous nondecreasing functions on (0,1): any quantile function is left-continuous and nondecreasing, and any such function *G* can be seen to be the inverse function of the distribution function, the *right-continuous inverse* of *G*

$$F(\mathbf{x}) = \inf \{ t \in (0, 1) : G(t) > \mathbf{x} \} = \sup \{ t \in (0, 1) : G(t) \le \mathbf{x} \}.$$

(See, for example, Bobkov and Ledoux [25, Appendix A].) Therefore, the Frechet ´ mean of μ<sup>1</sup>,...,μ*<sup>N</sup>* <sup>∈</sup> *<sup>W</sup>*2(R) is the measure μhaving quantile function

$$F^{-1}\_{\mu} = \frac{1}{N} \sum\_{i=1}^{N} F\_{\mu^i}.$$

The Frechet mean is thus unique. This is no longer true in higher dimension, unless ´ some regularity is imposed on the measures (Proposition 3.2.7).

Boissard et al. [28] noticed that compatibility of μ<sup>1</sup>,...,μ*<sup>N</sup>* according to Definition 2.3.1 allows for a simple solution to the Frechet mean problem, as in the ´ one-dimensional case. Recall from Proposition 3.1.2 that this is equivalent to the multimarginal problem. Returning to the original form of *G*, we obtain an easy lower bound for any π ∈ Π(μ<sup>1</sup>,...,μ*<sup>N</sup>*):

3.1 Empirical Frechet Means in ´ *W*<sup>2</sup> 65

$$G(\boldsymbol{\pi}) = \frac{1}{2N^2} \int\_{\mathcal{X}^N} \sum\_{i$$

because the (*i*, *j*)-th marginal of π is a coupling of μ*<sup>i</sup>* and μ*j* . Thus, if equality above holds for π, then π is optimal and *M*#π is the Frechet mean by Proposition ´ 3.1.2. This is indeed the case for π = (**i**,**t** μ2 μ<sup>1</sup> ,...,**t** μ*N* μ<sup>1</sup> )#μ<sup>1</sup> because the compatibility gives:

$$\begin{split} \int\_{\mathcal{X}'^N} ||\mathbf{x}\_i - \mathbf{x}\_j||^2 \, \mathrm{d}\pi(\mathbf{x}\_1, \dots, \mathbf{x}\_N) &= \int\_{\mathcal{X}'} \left\| \mathbf{t}\_{\mu^1}^{\mu^i} - \mathbf{t}\_{\mu^1}^{\mu^j} \right\|^2 \, \mathrm{d}\mu^1 \\ &= \int\_{\mathcal{X}} \left\| \mathbf{t}\_{\mu^1}^{\mu^i} \circ \mathbf{t}\_{\mu^j}^{\mu^i} - \mathbf{i} \right\| \, \mathrm{d}\mu^j = \mathrm{W}\_2^2(\mu^i, \mu^j). \end{split}$$

We may thus conclude, in a slightly more general form (γ was μ<sup>1</sup> above):

**Theorem 3.1.9 (Frechet Mean of Compatible Measures) ´** *Suppose that* {γ,μ1, ...,μ*<sup>N</sup>*} *are compatible measures. Then*

$$\left[\frac{1}{N}\sum\_{i=1}^{N}\mathfrak{t}\_{\mathcal{V}}^{\mu^i}\right]\#\mathcal{V}$$

*is the Frechet mean of ´* (μ<sup>1</sup>,...,μ*<sup>N</sup>*)*.*

A population version is given in Theorem 5.5.3.

#### *3.1.5 The Agueh–Carlier Characterisation*

Agueh and Carlier [2] provide a useful sufficient condition for γ to be the Frechet ´ mean. When *X* = R*d*, this condition is also necessary [2, Proposition 3.8], hence characterising Frechet means in ´ R*d*. It will allow us to easily deduce some equivariance results for Frechet means with respect to independence (Lemma ´ 3.1.11) and rotations (3.1.12). More importantly, it provides a sufficient condition under which a local minimum of *F* is a global minimum (Theorem 3.1.15) and the same idea can be used to relate the population Frechet mean to the expected value of the opti- ´ mal maps (Theorem 4.2.4). Recall that φ∗ denotes the Legendre transform of φ, as defined on page 14.

**Proposition 3.1.10 (Frechet Means and Potentials) ´** *Let* μ<sup>1</sup>,...,μ*<sup>N</sup>* <sup>∈</sup> *<sup>W</sup>*2(*<sup>X</sup>* ) *be absolutely continuous, let* γ ∈ *W*2(*X* ) *and denote by* φ∗ *<sup>i</sup> the convex potentials of* **t** γ μ*i . If* φ*<sup>i</sup>* = φ∗∗ *<sup>i</sup> are such that*

$$\frac{1}{N} \sum\_{l=1}^{N} \phi\_l(\mathbf{x}) \le \frac{1}{2} \|\mathbf{x}\|^2,\qquad \forall \mathbf{x} \in \mathcal{X}^\*,\qquad \text{with equality } \mathfrak{Y}\text{-almost surely,}$$

*then* γ *is the unique Frechet mean of ´* μ<sup>1</sup>,...,μ*N.* *Proof.* Uniqueness follows from Proposition 3.2.7. If θ ∈ *W*2(*X* ) is any measure, then the Kantorovich duality yields

$$\begin{split} W\_2^2(\boldsymbol{\chi}, \boldsymbol{\mu}^i) &= \int\_{\mathcal{X}} \left( \frac{1}{2} \|\boldsymbol{\chi}\|^2 - \phi\_l(\boldsymbol{x}) \right) \, \mathrm{d}\boldsymbol{\eta}(\boldsymbol{x}) + \int\_{\mathcal{X}} \left( \frac{1}{2} \|\boldsymbol{y}\|^2 - \phi\_i^\*(\boldsymbol{y}) \right) \, \mathrm{d}\boldsymbol{\mu}^i(\boldsymbol{y});\\ W\_2^2(\boldsymbol{\theta}, \boldsymbol{\mu}^i) &\geq \int\_{\mathcal{X}} \left( \frac{1}{2} \|\boldsymbol{x}\|^2 - \phi\_l(\boldsymbol{x}) \right) \, \mathrm{d}\boldsymbol{\theta}(\boldsymbol{x}) + \int\_{\mathcal{X}} \left( \frac{1}{2} \|\boldsymbol{y}\|^2 - \phi\_l^\*(\boldsymbol{y}) \right) \, \mathrm{d}\boldsymbol{\mu}^i(\boldsymbol{y}). \end{split}$$

Summation over *i* gives the result.

A population version of this result, based on similar calculations, is given in Theorem 4.2.4.

The next two results are formulated in R*<sup>d</sup>* because then the converse of Proposition 3.1.10 is proven to be true. If one could extend [2, Proposition 3.8] to any separable Hilbert *X* , then the two lemmata below will hold with R*<sup>d</sup>* replaced by *X* . The simple proofs are given on page 66 of the supplement.

**Lemma 3.1.11 (Independent Frechet Means) ´** *Let* μ<sup>1</sup>,...,μ*<sup>N</sup> and* ν<sup>1</sup>,...,ν*<sup>N</sup> be absolutely continuous measures in W*2(R*d*<sup>1</sup> ) *and W*2(R*d*<sup>2</sup> ) *with Frechet means ´* μ *and* ν*, respectively. Then the independent coupling* μ ⊗ ν *is the Frechet mean of ´* μ1 ⊗ν<sup>1</sup>,...,μ*<sup>N</sup>* <sup>⊗</sup>ν*N.*

By induction (or a straightforward modification of the proof), one can show that the Frechet mean of ´ (μ*i* ⊗ν*i* ⊗ρ*i* ) is μ ⊗ν ⊗ρ, and so on.

**Lemma 3.1.12 (Rotated Frechet Means) ´** *If* μ *is the Frechet mean of the abso- ´ lutely continuous measures* μ<sup>1</sup>,...,μ*<sup>N</sup> and U is orthogonal, then U*#μ *is the Frechet ´ mean of U*#μ<sup>1</sup>,...,*U*#μ*N.*

#### *3.1.6 Differentiability of the Frechet Functional and Karcher ´ Means*

Since we seek to minimise the Frechet functional ´ *F*, it would be helpful if *F* were differentiable, because we could then find at least local minima by solving the equation *F* = 0. This observation of Karcher [78] leads to the notion of *Karcher mean*.

**Definition 3.1.13 (Karcher Mean)** *Let F be a Frechet functional associated with ´ some random measure* Λ *in W*2(*X* )*. Then* γ *is a Karcher mean for* Λ *if F is differentiable at* γ *and F* (γ) = 0*.*

Of course, if γ is a Frechet mean for the random measure ´ Λ and *F* is differentiable at γ, then *F* (γ) must vanish. In this subsection, we build upon the work of Ambrosio et al. [12] and determine the derivative of the Frechet functional. This will not only ´ allow for a simple characterisation of Karcher means in terms of the optimal maps **t** Λ γ (Proposition 3.2.14), but will also be the cornerstone of the construction of a steepest descent algorithm for empirical calculation of Frechet means. The differentiability ´ holds at the population level too (Theorem 3.2.13).

It turns out that the tangent bundle structure described in Sect. 2.3 gives rise to a differentiable structure in the Wasserstein space. Fix μ<sup>0</sup> <sup>∈</sup> *<sup>W</sup>*2(*<sup>X</sup>* ) and consider the function

$$F\_0: \mathcal{W}\_2(\mathcal{K}) \to \mathbb{R}, \qquad F\_0(\gamma) = \frac{1}{2} W\_2^2(\gamma, \mu^0).$$

Ambrosio et al. [12, Corollary 10.2.7] show that when γis absolutely continuous,

$$\lim\_{\mathbf{W}\_{2}(\mathbf{v},\boldsymbol{\eta})\to 0} \frac{F\_{0}(\mathbf{v}) - F\_{0}(\boldsymbol{\eta}) + \int\_{\mathcal{X}^{\*}} \left\langle \mathbf{t}\_{\mathcal{Y}}^{\boldsymbol{\mu}^{0}}(\mathbf{x}) - \mathbf{x}, \mathbf{t}\_{\mathcal{Y}}^{\boldsymbol{\nu}}(\mathbf{x}) - \mathbf{x} \right\rangle \,\mathrm{d}\boldsymbol{\eta}(\mathbf{x})}{W\_{2}(\mathbf{v},\boldsymbol{\eta})} = \mathbf{0}.$$

Parts of the proof of this result (the limit superior above is ≤ 0; the limit inferior is bounded below) are reproduced in Proposition 3.2.12. The integral above can be seen as the inner product

$$
\left\langle \mathbf{t}\_{\mathcal{Y}}^{\mu^0} - \mathbf{i}, \mathbf{t}\_{\mathcal{Y}}^{\nu} - \mathbf{i} \right\rangle,
$$

in the space *L*2(γ) that includes as a (closed) subspace the tangent space Tanγ . In terms of this inner product and the log map, we can write

$$F\_0(\nu) - F\_0(\gamma) = - \left\langle \log\_{\mathcal{Y}}(\mu^0), \log\_{\mathcal{Y}}(\nu) \right\rangle + o(W\_2(\nu, \gamma)), \qquad \nu \to \gamma \quad \text{in } \mathcal{W}\_2,$$

so that *F*<sup>0</sup> is Frechet-differentiable ´ <sup>3</sup> at γwith derivative

$$F\_0'(\boldsymbol{\gamma}) = -\log\_{\boldsymbol{\gamma}}(\mu^0) = -\left(\mathbf{t}\_{\mathcal{Y}}^{\mu^0} - \mathbf{i}\right) \in \mathbf{Tan}\_{\mathcal{Y}}.$$

By linearity, one immediately obtains:

**Theorem 3.1.14 (Gradient of the Frechet Functional) ´** *Fix a collection of measures* μ<sup>1</sup>,...,μ*<sup>N</sup>* <sup>∈</sup> *<sup>W</sup>*2(*<sup>X</sup>* )*. When* γ ∈ *W*2(*X* ) *is absolutely continuous, the Frechet ´ functional*

$$F(\boldsymbol{\gamma}) = \frac{1}{2N} \sum\_{i=1}^{N} W\_2^2(\boldsymbol{\gamma}, \boldsymbol{\mu}^i), \qquad \boldsymbol{\gamma} \in \mathcal{W}\_2(\mathcal{X}^\circ)$$

*is Frechet-differentiable and ´*

$$F'(\boldsymbol{\gamma}) = -\frac{1}{N} \sum\_{i=1}^{N} \log\_{\boldsymbol{\gamma}}(\mu^i) = -\frac{1}{N} \sum\_{i=1}^{N} \left(\mathbf{t}\_{\boldsymbol{\gamma}}^{\mu\_i} - \mathbf{i}\right) \dots$$

It follows from this that an absolutely continuous γ ∈ *W*2(*X* ) is a Karcher mean if and only if the average of the optimal maps is the identity. If in addition one μ*<sup>i</sup>* is absolutely continuous with bounded density, then the Frechet mean ´ μis absolutely

<sup>3</sup> The notion of Frechet derivative is also named after Maurice Fr ´ echet, but is not directly related ´ to Frechet means. ´

continuous by Proposition 3.1.8, so it is a Karcher mean. The result extends to the population version; see Proposition 3.2.14.

It may happen that a collection μ<sup>1</sup>,...,μ*<sup>N</sup>* of absolutely continuous measures have a Karcher mean that is not a Frechet mean; see ´ Alvarez-Esteban et al. [ ´ 9, Example 3.1] for an example in R2. But a Karcher mean γ is "almost" a Frechet mean ´ in the following sense. By Proposition 3.2.14, *N*−<sup>1</sup> ∑**t** μ*i* γ (*x*) = *x* for γ-almost all *x*. If, on the other hand, the equality holds *for all x* ∈ *X* , then γ is the Frechet mean by ´ taking integrals and applying Proposition 3.1.10. One can hope that under regularity conditions, the γ-almost sure equality can be upgraded to equality everywhere. Indeed, this is the case:

**Theorem 3.1.15 (Optimality Criterion for Karcher Means)** *Let U* <sup>⊆</sup> <sup>R</sup>*<sup>d</sup> be an open convex set and let* μ<sup>1</sup>,...,μ*<sup>N</sup>* <sup>∈</sup> *<sup>W</sup>*2(R*d*) *be probability measures on U with bounded strictly positive densities g*1,...,*gN. Suppose that an absolutely continuous Karcher mean* γ *is supported on U with bounded strictly positive density f there. Then* γ *is the Frechet mean of ´* μ<sup>1</sup>,...,μ*<sup>N</sup> if one of the following holds:*


*Proof.* The result exploits Caffarelli's regularity theory for Monge–Ampere equa- ` tions in the form of Theorem 1.6.7. In the first case, there exist *C*<sup>1</sup> (in fact, *C*2,α ) convex potentials ϕ*<sup>i</sup>* on R*<sup>d</sup>* with **t** μ*i* γ = ∇ϕ*<sup>i</sup>*, so that **t** μ*i* γ (*x*) is a singleton for all *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*d*. The set {*<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* : <sup>∑</sup>**<sup>t</sup>** μ*i* γ (*x*)/*N* = *x*} is γ-negligible (and hence Lebesgue negligible) and open by continuity. It is therefore empty, so *F* (γ) = 0 everywhere, and γ is the Frechet mean (see the discussion before the theorem). ´

In the second case, by the same argument we have ∑**t** μ*i* γ (*x*)/*N* = *x* for all *x* ∈ *U*. Since *U* is convex, there must exist a constant*C* such that ∑ϕ*i*(*x*) =*C*+*N x* <sup>2</sup>/2 for all *x* ∈ *U*, and we may assume without loss of generality that *C* = 0. If one repeats the proof of Proposition 3.1.10, then *F*(γ) ≤ *F*(θ) for all θ ∈ *P*(*U*). By continuity considerations, the inequality holds for all θ ∈ *P*(*U*) (Theorem 2.2.7) and since *U* is closed and convex, γis the Frechet mean by Corollary ´ 3.1.3.

#### **3.2 Population Frechet Means ´**

In this section, we extend the notion of empirical Frechet mean to the population ´ level, where Λ is a random element in *W*2(*X* ) (a measurable mapping from a probability space to *W*2(*X* )). This requires a different strategy, since it is not clear how to define the analogue of the multicouplings at that level of abstraction. However, it is important to point out that when there is more structure in Λ, multicouplings can be defined as laws of stochastic processes; see Pass [102] for a detailed account of the problem in this case.

In analogy with (3.1), we define:

**Definition 3.2.1 (Population Frechet Mean) ´** *Let* Λ *be a random measure in W*2(*X* )*. The Frechet mean of ´* Λ *is the minimiser (if it exists and is unique) of the Frechet functional ´*

$$F(\boldsymbol{\gamma}) = \frac{1}{2} \mathbb{E} W\_2^2(\boldsymbol{\gamma}, \boldsymbol{\Lambda}), \qquad \boldsymbol{\gamma} \in \mathcal{W}\_2(\mathcal{K}).\tag{3.3}$$

Since *W*<sup>2</sup> is continuous and nonnegative, the expectation is well-defined.

#### *3.2.1 Existence, Uniqueness, and Continuity*

Existence and uniqueness of Frechet means on a general metric space ´ *M* are rather delicate questions. Usually, existence proofs are easier: for example, since the Frechet functional ´ *F* is continuous on *M* (as we show below), one often invokes local compactness of *M* in order to establish existence of a minimiser. Unfortunately, a different strategy is needed when *M* = *W*2(*X* ), because the Wasserstein space is not locally compact (Proposition 2.2.9).

The first thing to notice is that *F* is indeed continuous (this is clear for the empirical version). This is a consequence of the triangle inequality and holds when *W*2(*X* ) is replaced by any metric space.

**Lemma 3.2.2 (Finiteness of the Frechet Functional) ´** *If F is not identically infinite, then it is finite and locally Lipschitz everywhere on W*2(*X* )*.*

*Proof.* Assume that *F* is finite at γ. If θis any other measure in *W*2(*X* ), write

$$2F(\gamma) - 2F(\theta) = \mathbb{E}[W\_2(\gamma, \Lambda) - W\_2(\theta, \Lambda)][W\_2(\gamma, \Lambda) + W\_2(\theta, \Lambda)].$$

Since *<sup>x</sup>* <sup>≤</sup> <sup>1</sup>+*x*<sup>2</sup> for all *<sup>x</sup>*, the triangle inequality in *<sup>W</sup>*2(*<sup>X</sup>* ) yields

$$\begin{aligned} 2|F(\boldsymbol{\gamma}) - F(\boldsymbol{\theta})| &\leq W\_2(\boldsymbol{\gamma}, \boldsymbol{\theta})[2\mathbb{E}W\_2(\boldsymbol{\gamma}, \boldsymbol{\Lambda}) + W\_2(\boldsymbol{\theta}, \boldsymbol{\gamma})] \\ &\leq W\_2(\boldsymbol{\gamma}, \boldsymbol{\theta})[2\mathbb{E}W\_2^2(\boldsymbol{\gamma}, \boldsymbol{\Lambda}) + 2 + W\_2(\boldsymbol{\theta}, \boldsymbol{\gamma})]. \end{aligned}$$

Since *F*(γ) < ∞, this shows that *F* is finite everywhere and the right-hand side vanishes as θ → γ in *W*2(*X* ). Now that we know that *F* is continuous, the same upper bound shows that it is in fact locally Lipschitz.

Example: let (*an*) be a sequence of positive numbers that sum up to one. Let *xn* = 1/*an* and suppose that Λ equals δ{*xn*} ∈ *<sup>W</sup>*2(R) with probability *an*. Then

$$\mathbb{E}W\_2^2(\Lambda, \delta\_0) = \sum\_{n=1}^{\infty} a\_n x\_n^2 = \sum\_{n=1}^{\infty} 1/a\_n = \infty,$$

and by Lemma 3.2.2 *F* is identically infinite. Henceforth, we say that *F* is finite when the condition in Lemma 3.2.2 holds.

Using the lower semicontinuity (2.5), one can prove existence on R*<sup>d</sup>* rather easily. (The empirical means exist even in infinite dimensions by Corollary 3.1.3.)

**Proposition 3.2.3 (Existence of Frechet Means) ´** *The Frechet functional associ- ´ ated with any random measure* Λ*in W*2(R*d*) *admits a minimiser.*

*Proof.* The assertion is clear if *F* is identically infinite. Otherwise, let (γ*<sup>n</sup>*) be a minimising sequence. We wish to show that the sequence is tight. Define *L* = sup*<sup>n</sup> F*(γ*<sup>n</sup>*) <sup>&</sup>lt; <sup>∞</sup> and observe that since *<sup>x</sup>* <sup>≤</sup> <sup>1</sup>+*x*<sup>2</sup> for all *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>,

$$\mathbb{E}W\_2(\boldsymbol{\gamma}\_n, \boldsymbol{\Lambda}) \le 1 + \mathbb{E}W\_2^2(\boldsymbol{\gamma}\_n, \boldsymbol{\Lambda}) \le 2L + 1, \qquad n = 1, 2, \dots$$

By the triangle inequality

$$L' = \mathbb{E}W\_2(\mathfrak{d}\_0, \Lambda) \le W\_2(\mathfrak{d}\_0, \mathfrak{\chi}\_1) + \mathbb{E}W\_2(\mathfrak{\chi}, \Lambda) \le W\_2(\mathfrak{d}\_0, \mathfrak{\chi}\_1) + 2L + 1$$

so that for all *n*

$$\left(\int\_{\mathbb{R}^d} ||\mathbf{x}||^2 \, \mathrm{d}\mathbf{y}\_n(\mathbf{x})\right)^{1/2} = W\_2(\boldsymbol{\gamma}\_n, \boldsymbol{\delta}\_0) \le \mathbb{E}W\_2(\boldsymbol{\gamma}\_n, \boldsymbol{\Lambda}) + \mathbb{E}W\_2(\boldsymbol{\Lambda}, \boldsymbol{\delta}\_0) \le 2L + 1 + L' < \infty.$$

Since closed and bounded sets in R*<sup>d</sup>* are compact, it follows that (γ*<sup>n</sup>*) is a tight sequence. We may assume that γ*n* → γ weakly, then use (2.5) and Fatou's lemma to obtain

$$\mathbb{E}\mathcal{F}(\boldsymbol{\gamma}) = \mathbb{E}W\_2^2(\boldsymbol{\gamma}, \Lambda) \le \mathbb{E}\liminf\_{n \to \infty} W\_2^2(\boldsymbol{\gamma}\_n, \Lambda) \le \liminf\_{n \to \infty} \mathbb{E}W\_2^2(\boldsymbol{\gamma}\_n, \Lambda) = \mathbb{Z}\inf F.$$

Thus, γis a minimiser of *F*, and existence is established.

When *X* is an infinite-dimensional Hilbert space, existence still holds under a compactness assumption. We first prove a result about the support of the Frechet mean. ´ At the empirical level, one can say more about the support (see Corollary 3.1.4).

**Proposition 3.2.4 (Support of Frechet Mean) ´** *Let* Λ *be a random measure in <sup>W</sup>*2(*<sup>X</sup>* ) *and let K* <sup>⊆</sup> *<sup>X</sup> be a convex closed set such that* <sup>P</sup>[Λ(*K*) = 1] = 1*. If* γ *minimises F, then* γ(*K*) = 1*.*

**Remark 3.2.5** *For any closed K* ⊆ *X and any* α ∈ [0,1]*, the set* {Λ ∈ *Wp*(*X* ) : Λ(*K*) ≥ α} *is closed in Wp*(*X* ) *because* {Λ ∈ *P*(*X* ):Λ(*K*) ≥ α} *is weakly closed by the portmanteau lemma (Lemma 1.7.1).*

The proof amounts to a simple projection argument; see page 70 in the supplement.

**Corollary 3.2.6** *If there exists a compact convex K satisfying the hypothesis of Proposition 3.2.4, then the Frechet functional admits a minimiser supported on K. ´*

*Proof.* Proposition 3.2.4 allows us to restrict the domain of *F* to *W*2(*K*), the collection of probability measures supported on *K*. Since this set is compact in *W*2(*X* ) (Corollary 2.2.5), the result follows from continuity of *F*.

From the convexity (3.2), one obtains a simple criterion for uniqueness. See Definition 1.6.4 for absolute continuity in infinite dimensions.

**Proposition 3.2.7 (Uniqueness of Frechet Means) ´** *Let* Λ *be a random measure in W*2(*X* ) *with finite Frechet functional. If ´* Λ *is absolutely continuous with positive (inner) probability, then the Frechet mean of ´* Λ*is unique (if it exists).*

**Remark 3.2.8** *It is not obvious that the set of absolutely continuous measures is measurable in W*2(*X* )*. We assume that there exists a Borel set A* ⊂ *W*2(*X* ) *such that* P(Λ∈ *A*) > 0 *and all measures in A are absolutely continuous.*

*Proof.* By taking expectations in (3.2), one sees that *F* is convex on *W*2(*X* ) with respect to linear interpolants. From Proposition 3.1.6, we conclude that

> Λ absolutely continuous =⇒ γ → 1 2 *W*<sup>2</sup> 2 (γ,Λ) strictly convex.

As *F* was already shown to be weakly convex in any case, it follows that

P(Λabsolutely continuous) > 0 =⇒ *F* strictly convex.

Since strictly convex functionals have at most one minimiser, this completes the proof.

We state without proof an important consistency result (Le Gouic and Loubes [87, Theorem 3]). Since *W*2(*X* ) is a complete and separable metric space, we can define the "second degree" Wasserstein space *W*2(*W*2(*X* )). The law of a random measure Λ is in *W*2(*W*2(*X* )) if and only if the corresponding Frechet functional is ´ finite.

**Theorem 3.2.9 (Consistency of Frechet Means) ´** *Let* Λ*n*,Λ *be random measures in <sup>W</sup>*2(R*d*) *with finite Frechet functionals and laws ´* <sup>P</sup>*n*,<sup>P</sup> <sup>∈</sup> *<sup>W</sup>*2(*W*2(R*d*))*. If* <sup>P</sup>*<sup>n</sup>* <sup>→</sup> <sup>P</sup> *in W*2(*W*2(R*d*))*, then any sequence* λ*<sup>n</sup> of Frechet means of ´* Λ*<sup>n</sup> has a W*2*-limit point* λ*, which is a Frechet mean of ´* Λ*.*

See the Bibliographical Notes for a more general formulation.

**Corollary 3.2.10 (Wasserstein Law of Large Numbers)** *Let* Λ *be a random measure in W*2(R*d*) *with finite Frechet functional and let ´* Λ<sup>1</sup>,... *be a sample from* Λ*. Assume* λ *is the unique Frechet mean of ´* Λ *(see Proposition 3.2.7). Then almost surely, the sequence of empirical Frechet means of ´* Λ1,...,Λ*<sup>n</sup> converges to* λ*.*

*Proof.* Let P be the law of Λ and let P*<sup>n</sup>* be its empirical counterpart (a random element in *W*2(*W*2(R*d*)). Like in the proof of Proposition 2.2.6 (with *X* replaced by the complete separable metric space *<sup>W</sup>*2(R*d*)), almost surely <sup>P</sup>*<sup>n</sup>* <sup>→</sup> <sup>P</sup> in *<sup>W</sup>*2(*W*2(R*d*)) and Theorem 3.2.9 applies.

Under a compactness assumption, one can give a direct proof for the law of large numbers as in Theorem 3.1.5. This is done on page 71 in the supplement.

#### *3.2.2 The One-Dimensional Case*

As a generalisation of the empirical version, we have:

**Theorem 3.2.11 (Frechet Means in ´** *W*2(R)**)** *Let* Λ *be a random measure in W*2(R) *with finite Frechet functional. Then the Fr ´ echet mean of ´* Λ *is the unique measure* λ *with quantile function F*−<sup>1</sup> λ (*t*) = E*F*−<sup>1</sup> Λ(*t*)*, t* ∈ (0,1)*.*

*Proof.* Since *L*2(0,1) is a Hilbert space, the random element *F*−<sup>1</sup> Λ ∈ *L*2(0,1) has a unique Frechet mean ´ *<sup>g</sup>* <sup>∈</sup> *<sup>L</sup>*2(0,1), defined by the relations *g*, *<sup>f</sup>* <sup>=</sup> <sup>E</sup> *F*−<sup>1</sup> Λ , *f* for all *f* ∈ *L*2(0,1). On page 72 of the supplement, we show that *g* can be identified with *F*−<sup>1</sup> λ.

Interestingly, no regularity is needed in order for the Frechet mean to be unique. ´ This is not the case for higher dimensions, see Proposition 3.2.7. If there is some regularity, then one can state Theorem 3.2.11 in terms of optimal maps, because *F*−<sup>1</sup> Λ is the optimal map from Leb|[0,1] to Λ. If γ <sup>∈</sup> *<sup>W</sup>*2(R) is any absolutely continuous (or even just continuous) measure, then Theorem 3.2.11 can be stated as follows: the Frechet mean of ´ Λ is the measure [E**t** Λ γ ]#γ. A generalisation of this result to compatible measures (Definition 2.3.1) can be carried out in the same way, since compatible measures are imbedded in a Hilbert space, using the Bochner integrals for the definition of the expected optimal maps (see Sect. 2.4).

#### *3.2.3 Differentiability of the Population Frechet Functional ´*

We now use the Fubini result (Proposition 2.4.9) in order to extend the differentiability of the Frechet functional to the population version. This will follow immediately ´ if we can interchange the expectation and the derivative in the form

$$F'(\boldsymbol{\gamma}) = \frac{1}{2} (\mathbb{E}W\_2^2)'(\boldsymbol{\gamma}, \boldsymbol{\Lambda}) = \mathbb{E}\left(\frac{1}{2}W\_2^2\right)'(\boldsymbol{\gamma}, \boldsymbol{\Lambda}) = -\mathbb{E}(\mathbf{t}\_{\boldsymbol{\gamma}}^{\boldsymbol{\Lambda}} - \mathbf{i}).$$

In order to do this, we will use dominated convergence in conjunction with uniform bounds on the slopes

$$\mu(\boldsymbol{\theta},\boldsymbol{\Lambda}) = \frac{0.5 W\_2^2(\boldsymbol{\theta},\boldsymbol{\Lambda}) - 0.5 W\_2^2(\boldsymbol{\theta}\_0,\boldsymbol{\Lambda}) + \int\_{\mathcal{X}} \langle \mathbf{t}\_{\boldsymbol{\theta}\_0}^{\boldsymbol{\Lambda}} - \boldsymbol{i}, \mathbf{t}\_{\boldsymbol{\theta}\_0}^{\boldsymbol{\theta}} - \boldsymbol{i} \rangle \,\mathrm{d}\boldsymbol{\theta}\_0}{W\_2(\boldsymbol{\theta},\boldsymbol{\theta}\_0)}, \qquad \mu(\boldsymbol{\theta}\_0,\boldsymbol{\Lambda}) = 0.1 \tag{3.4}$$

**Proposition 3.2.12 (Slope Bounds)** *Let* θ0*,* Λ*, and* θ *be probability measures with* θ<sup>0</sup> *absolutely continuous, and set* δ = *W*2(θ,θ<sup>0</sup>)*. Then*

$$\frac{1}{2}\delta - W\_2(\theta\_0, \Lambda) - \sqrt{2W\_2^2(\theta\_0, \delta\_0) + 2W\_2^2(\Lambda, \delta\_0)} \le \mu(\theta, \Lambda) \le \frac{1}{2}\delta,$$

*where u is defined by* (3.4)*. If the measures are compatible in the sense of Definition 2.3.1, then u*(θ,Λ) = δ/2*.*

The proof is a slight variation of Ambrosio et al. [12, Theorem 10.2.2 and Proposition 10.2.6], and the details are given on page 72 of the supplement.

**Theorem 3.2.13 (Population Frechet Gradient) ´** *Let* Λ *be a random measure with finite Frechet functional F. Then F is Fr ´ echet-differentiable at any absolutely con- ´ tinuous* θ<sup>0</sup> *in the Wasserstein space, and F* (θ<sup>0</sup>) = E**t** Λ θ<sup>0</sup> −**i**∈ *L*2(θ<sup>0</sup>)*. More precisely,*

$$\frac{F(\boldsymbol{\theta}) - F(\boldsymbol{\theta}\_{0}) + \int\_{\mathcal{X}} \langle \mathbb{E} \mathbf{t}\_{\boldsymbol{\theta}\_{0}}^{\Lambda} - \mathbf{i}, \mathbf{t}\_{\boldsymbol{\theta}\_{0}}^{\boldsymbol{\theta}} - \mathbf{i} \rangle \, \mathrm{d}\boldsymbol{\theta}\_{0}}{W\_{2}(\boldsymbol{\theta}, \boldsymbol{\theta}\_{0})} \to 0, \qquad \boldsymbol{\theta} \to \boldsymbol{\theta}\_{0} \quad \text{in } \mathcal{W}\_{2}.$$

*Thus, the Frechet derivative of F can be identified with the map ´* <sup>−</sup>(E**<sup>t</sup>** Λ θ<sup>0</sup> −**i**) *in the tangent space at* θ<sup>0</sup>*, a subspace of L*2(θ0)*.*

*Proof.* Introduce the slopes *u*(θ,Λ) defined by (3.4). Then for all Λ,*u*(θ,Λ) → 0 as *W*2(θ,θ<sup>0</sup>) → 0, by the differentiability properties established above. Let us show that E*u*(θ,Λ) → 0 as well. By Proposition 3.2.12, the expectation of *u* is bounded above by a constant that does not depend on Λ, and below by the negative of

$$\mathbb{E}W\_2(\boldsymbol{\Theta}\_0, \boldsymbol{\Lambda}) + \mathbb{E}\sqrt{2W\_2^2(\boldsymbol{\Theta}\_0, \boldsymbol{\delta}\_0) + 2W\_2^2(\boldsymbol{\Lambda}, \boldsymbol{\delta}\_0)}$$

$$\leq \sqrt{2}W\_2(\boldsymbol{\Theta}\_0, \boldsymbol{\delta}\_0) + \mathbb{E}W\_2(\boldsymbol{\Theta}\_0, \boldsymbol{\Lambda}) + \sqrt{2}\mathbb{E}W\_2(\boldsymbol{\Lambda}, \boldsymbol{\delta}\_0).$$

Both expectations are finite by the hypothesis on Λ because the Frechet functional ´ is finite. The dominated convergence theorem yields

$$\mathbb{E}u(\theta,\Lambda) = \frac{F(\theta) - F(\theta\_0) + \mathbb{E}\int\_{\mathcal{X}} \langle \mathbf{t}\_{\theta\_0}^{\Lambda} - i, \mathbf{t}\_{\theta\_0}^{\theta} - i \rangle \,\mathrm{d}\theta\_0}{W\_2(\theta, \theta\_0)} \to 0, \qquad W\_2(\theta\_0, \theta) \to 0.$$

The measurability of the integral and the result then follow from Fubini's theorem (see Proposition 2.4.9).

**Proposition 3.2.14** *Let* Λ *be a random measure in W*2(*X* ) *with finite Frechet func- ´ tional F, and let* γ *be absolutely continuous in W*2(*X* )*. Then* γ *is a Karcher mean of* Λ *if and only if* E**t** Λ γ −**i** = 0 *in L*2(γ)*. Furthermore, if* γ *is a Frechet mean of ´* Λ*, then it is also a Karcher mean.*

The characterisation of Karcher means follows immediately from Theorem 3.2.13. The other statement is that the derivative vanishes at the minimum, which is fairly obvious intuitively; see page 73 in the supplement.

#### **3.3 Bibliographical Notes**

Proposition 3.1.2 is essentially due to Agueh and Carlier [2, Proposition 4.2], who show it on R*<sup>d</sup>* (see also Zemel and Panaretos [134, Theorem 2]). An earlier result in a compact setting can be found in Carlier and Ekeland [33]. The formulation given here is from Masarotto et al. [91]. A more general version is provided by Le Gouic and Loubes [87, Theorem 8].

Lemmata 3.1.11 and 3.1.12 are from [135], but were known earlier (e.g., Bonneel et al. [30]).

Proposition 3.1.6 is a simplified version of Alvarez-Esteban et al. [ ´ 8, Theorem 2.8] (see [8, Corollary 2.9]).

Propositions 3.2.3 and 3.2.7 are from Bigot and Klein [22], who also show the law of large numbers (Corollary 3.2.10) and deal with the one-dimensional setup (Theorem 3.2.11) in a compact setting. Section 2.4 appears to be new, but see the discussion in its beginning for other measurability results.

Barycentres can be defined for any *p* ≥ 1 as the measures minimising μ → E*W <sup>p</sup> p* (Λ,μ). (Strictly speaking, these are not Frechet means unless ´ *p* = 2.) Le Gouic and Loubes [87] show Proposition 3.2.3 and Theorem 3.2.9 in this more general setup, where R*<sup>d</sup>* can be replaced by any separable locally compact geodesic space.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 4 Phase Variation and Frechet Means ´**

Why is it relevant to construct the Frechet mean of a collection of measures with ´ respect to the Wasserstein metric? A simple answer is that this kind of average will often express a more natural notion of "typical" realisation of a random probability distribution than an arithmetic average.<sup>1</sup> Much more can be said, however, in that the Wasserstein–Frechet mean and the closely related notion of an optimal mul- ´ ticoupling arise canonically as the appropriate framework for the formulation and solution to the problem of *separation of amplitude and phase variation of a point process*. It would almost seem that Wasserstein–Frechet means were "made" for ´ precisely this problem.

When analysing the (co)variation of a real-valued stochastic process {*Y*(*x*) : *x* ∈ *K*} over a convex compact domain *K*, it can be broadly said that one may distinguish two layers of variation:

• *Amplitude variation*. This is the "classical" variation that one would also encounter in multivariate analysis, and refers to the stochastic fluctuations around a mean level, usually encoded in the covariance kernel, at least up to second order.

In short, this is variation "in the *y*-axis" (ordinate).

**Electronic Supplementary Material** The online version of this chapter (https://doi.org/10.1007/ 978-3-030-38438-8 4) contains supplementary material.

<sup>1</sup> For instance, the arithmetic average of two scalar Gaussians *N*(μ<sup>1</sup>,1) and *N*(μ<sup>2</sup>,1) will be their mixture with equal weights, but their Frechet–Wasserstein average will be the Gaussian ´ *N*( <sup>1</sup> 2 μ<sup>1</sup> + 1 2 μ<sup>2</sup>,1) (see Lemma 4.2.1), which is arguably more representative from an intuitive point of view. In much the same way, the Frechet–Wasserstein average of probability measures representing some ´ type of object (e.g., normalised greyscale images of faces) will also be an object of the same type. This sort of phenomenon is well-known in manifold statistics, more generally, and is arguably one of the key motivations to account for the non-linear geometry of the sample space, rather than imbed it into a larger linear space and use the addition operation.

• *Phase variation*. This is a second layer of non-linear variation peculiar to continuous domain stochastic processes, and is rarely—if ever—encountered in multivariate analysis. It arises as the result of random changes (or deformations) in the time scale (or the spatial domain) of definition of the process. It can be conceptualised as a composition of the stochastic process with a random transformation (warp map) acting on its domain.

This is variation "in the *x*-axis" (abscissa).

The terminology on amplitude/phase variation is adapted from random trigonometric functions, which may vary in amplitude (oscillations in the range of the function) or phase (oscillations in the domain of the function). Failing to properly account for the superposition of these two forms of variation may entirely distort the findings of a statistical analysis of the random function (see Sect. 4.1.1). Consequently, it is an important problem to be able to separate the two, thus correctly accounting for the distinct contribution of each. The problem of separation is also known as that of *registration* (Ramsay and Li [108]), *synchronisation* (Wang and Gasser [129]), or *multireference alignment* (Bandeira et al. [16]), though in some cases these terms refer to a simpler problem where there is no amplitude variation at all.

Phase variation naturally arises in the study of random phenomena where there is no absolute notion of time or space, but every realisation of the phenomenon evolves according to a time scale that is intrinsic to the phenomenon itself, and (unfortunately) unobservable. Processes related to physiological measurements, such as *growth curves* and *neuronal signals*, are usual suspects. Growth curves can be modelled as *continuous random functions (functional data)*, whereas neuronal signals are better modelled as *discrete random measures (point processes)*. We first describe amplitude/phase variation in the former<sup>2</sup> case, as that is easier to appreciate, before moving on to the latter case, which is the main subject of this chapter.

#### **4.1 Amplitude and Phase Variation**

#### *4.1.1 The Functional Case*

Let *K* denote the unit cube [0,1] *<sup>d</sup>* <sup>⊂</sup> <sup>R</sup>*d*. A real random function *<sup>Y</sup>* = (*Y*(*x*) : *<sup>x</sup>* <sup>∈</sup> *<sup>K</sup>*) can, broadly speaking, have two types of variation. The first, *amplitude variation*, results from *Y*(*x*) being a random variable for every *x* and describes its fluctuations around the mean level *m*(*x*) = E*Y*(*x*), usually encoded by the variance var*Y*(*x*). For this reason, it can be referred to as "variation in the *y*-axis". More

<sup>2</sup> As the functional case will only serve as a motivation, our treatment of this case will mostly be heuristic and superficial. Rigorous proofs and more precise details can be found in the books by Ferraty and Vieu [51], Horvath and Kokoszka [ ´ 70], or Hsing and Eubank [71]. The notion of amplitude and phase variation is discussed in the books by Ramsay and Silverman [109, 110] that are of a more applied flavour. One can also consult the review by Wang et al. [127], where amplitude and phase variation are discussed in Sect. 5.2.

generally, for any finite set *x*1,..., *xn*, the *n* × *n* covariance matrix with entries κ(*xi*, *xj*) = cov[*Y*(*xi*),*Y*(*x <sup>j</sup>*)] encapsulates (up to second order) the stochastic deviations of the random vector (*Y*(*x*1),...,*Y*(*xn*)) from its mean, in analogy with the multivariate case. Heuristically, one then views amplitude variation as the collection κ(*x*, *y*) for *x*, *y* ∈ *K* in a sense we discuss next.

One typically views *Y* as a random element in the separable Hilbert space *L*2(*K*), assumed to have <sup>E</sup> *Y* <sup>2</sup> < ∞ and continuous sample paths, so that in particular *Y*(*x*) is a random variable for all *x* ∈ *K*. Then the *mean function*

$$m(\mathbf{x}) = \mathbb{E}Y(\mathbf{x}), \quad \mathbf{x} \in K$$

and the *covariance kernel*

$$\kappa(\mathbf{x}, \mathbf{y}) = \text{cov}[Y(\mathbf{x}), Y(\mathbf{y})], \quad \mathbf{x}, \mathbf{y} \in K$$

are well-defined and finite; we shall assume that they are continuous, which is equivalent to *Y* being *mean-square continuous*:

$$\left[\mathbb{E}[Y(\mathbf{y}) - Y(\mathbf{x})]\right]^2 \to \mathbf{0}, \qquad \mathbf{y} \to \mathbf{x}.$$

The covariance kernel κ gives rise to the *covariance operator R* : *L*2(*K*) → *L*2(*K*), defined by

$$(\mathcal{A}f)(\mathbf{y}) = \int\_K \mathbf{x}(\mathbf{x}, \mathbf{y}) f(\mathbf{x}) \, \mathrm{d}x,$$

a self-adjoint positive semidefinite Hilbert–Schmidt operator on *L*2(*K*). The justification to this terminology is the observation that when *m* = 0, for all bounded *f*,*g* ∈ *L*2(*K*),

$$\mathbb{E}\left\left = \mathbb{E}\left[\int\_{K^2} Y(\mathbf{x})f(\mathbf{x})Y(\mathbf{y})\mathbf{g}(\mathbf{y})\,\mathrm{d}(\mathbf{x},\mathbf{y})\right] = \int\_K \mathbf{g}(\mathbf{y})(\mathcal{A}\!\!\!f)(\mathbf{y})\,\mathrm{d}\mathbf{y},$$

and so, without the restriction to *m* = 0,

$$\operatorname{cov}\left[\left\langle Y, f \right\rangle, \left\langle Y, \mathbf{g} \right\rangle\right] = \int\_{K} \mathbf{g}\left(\mathbf{y}\right) (\mathcal{R}f)(\mathbf{y}) \,\mathrm{d}\mathbf{y} = \left\langle \mathbf{g}, \mathcal{R}f \right\rangle.$$

The covariance operator admits an eigendecomposition (*rk*,φ*<sup>k</sup>*)<sup>∞</sup> *<sup>k</sup>*=<sup>1</sup> such that *rk* 0, *R*φ*<sup>k</sup>* = *rk*φ*<sup>k</sup>* and (φ*<sup>k</sup>*) is an orthonormal basis of *L*2(*K*). One then has the celebrated *Karhunen–Loeve expansion `*

$$Y(\mathbf{x}) = m(\mathbf{x}) + \sum\_{k=1}^{\infty} \left\langle Y - m, \phi\_k \right\rangle \phi\_k(\mathbf{x}) = m(\mathbf{x}) + \sum\_{k=1}^{\infty} \xi\_k \phi\_k(\mathbf{x}).$$

A major feature in this expansion is the separation of the functional part from the stochastic part: the functions φ*<sup>k</sup>*(*x*) are deterministic; the random variables ξ*<sup>k</sup>* are scalars. This separation actually holds for any orthonormal basis; the role of choosing the eigenbasis of *R* is making ξ*<sup>k</sup> uncorrelated*:

$$\text{cov}(\xi\_k, \xi\_l) = \text{cov}\left[ \left< Y, \phi\_k \right>, \left< Y, \phi\_l \right> \right] = \left< \phi\_l, \mathcal{R}\phi\_k \right>.$$

vanishes when *k* = *l* and equals *rk* otherwise. For this reason, it is not surprising that using as φ*<sup>k</sup>* the eigenfunctions yields the optimal representation of *Y*. Here, optimality is with respect to truncations: for any other basis (ψ*<sup>k</sup>*) and any *M*,

$$\mathbb{E}\left\|\left|Y-m-\sum\_{k=1}^{M}\left\langle Y-m,\Psi\_{k}\right\rangle\Psi\_{k}\right\|^{2}\geq\mathbb{E}\left\|\left|Y-m-\sum\_{k=1}^{M}\left\langle Y-m,\phi\_{k}\right\rangle\phi\_{k}\right\|^{2}\right\|$$

so that (φ*<sup>k</sup>*) provides the best finite-dimensional approximation to *Y*. The approximation error on the right-hand side equals

$$\mathbb{E}\left\|\sum\_{k=M+1}^{\infty} \xi\_k \phi\_k \right\|^2 = \sum\_{k=M+1}^{\infty} r\_k$$

and depends on how quickly the eigenvalues of *R* decay.

One carries out inference for *m* and κon the basis of a sample *Y*1,...,*Yn* by

$$
\widehat{m}(\mathbf{x}) = \frac{1}{n} \sum\_{i=1}^{n} Y\_i(\mathbf{x}), \qquad \mathbf{x} \in K
$$

and

$$
\widehat{\mathbf{x}}(\mathbf{x}, \mathbf{y}) = \frac{1}{n} \sum\_{i=1}^{n} Y\_i(\mathbf{x}) Y\_i(\mathbf{y}) - \widehat{m}(\mathbf{x}) \widehat{m}(\mathbf{y}),
$$

from which one proceeds to estimate *R* and its eigendecomposition.

We have seen that amplitude variation in the sense described above is linear and dealt with using linear operations. There is another, qualitatively different type of variation, *phase variation*, that is non-linear and does not have an obvious finitedimensional analogue. It arises when in addition to the randomness in the values *Y*(*x*) itself, an extra layer of stochasticity is present in its domain of definition. In mathematical terms, there is a random invertible *warp function* (sometimes called *deformation* or *warping*) *T* : *K* → *K* and instead of *Y*(*x*), one observes realisations from

$$\tilde{Y}(\mathbf{x}) = Y(T^{-1}(\mathbf{x})), \qquad \mathbf{x} \in K.$$

For this reason, phase variation can be viewed as "variation in the *x*-axis". When *d* = 1, the set *K* is usually interpreted as a time interval, and then the model stipulates that each individual has its own time scale. Typically, the warp function is assumed to be a homeomorphism of *K* independent of *Y* and often some additional smoothness is imposed, say *<sup>T</sup>* <sup>∈</sup>*C*2. One of the classical examples is growth curves of children, of which a dataset from the Berkeley growth study (Jones and Bayley [73]) is shown in Fig. 4.1. The curves are the derivatives of the height of a sample of ten girls as a function of time, from birth until age 18. One clearly notices the presence of the two types of variation in the figure. The initial velocity for all children is the highest immediately or shortly after birth, and in most cases decreases sharply during the first 2 years. Then follows a period of acceleration for another year or so, and so on. Despite presenting qualitatively similar behaviour, the curves differ substantially not only in the magnitude of the peaks but also in their location. For instance, one red curve has a local minimum at the age of three, while a green one has a local maximum at almost that same time point. It is apparent that if one tries to estimate the mean function by averaging the curves at each time *x*, the shape of the resulting estimate would look very different from each of the curves. Thus, this pointwise averaging (known as the *cross-sectional mean*) fails to represent the typical behaviour. This phenomenon is seen more explicitly in the next example. The terminology of amplitude and phase comes from trigonometric functions, from

Fig. 4.1: Derivatives of growth curves of ten girls from the Berkeley dataset. The data and the code for the figure are from the R package fda (Ramsay et al. [111])

which we derive an artificial example that illustrates the difficulties of estimation in the presence of phase variation. Let *A* and *B* be symmetric random variables and consider the random function

$$\tilde{Y}(\mathbf{x}) = A \sin[8\pi(\mathbf{x} + \mathbf{B})].\tag{4.1}$$

(Strictly speaking, *x* → *x*+*B* is not from [0,1] to itself; for illustration purposes, we assume in this example that *K* = R.) The random variable *A* generates the amplitude variation, while *B* represents the phase variation. In Fig. 4.2, we plot four realisations and the resulting empirical means for the two extreme scenarios where *B* = 0 (no phase variation) or *A* = 1 (no amplitude variation). In the left panel of the figure, we see that the sample mean (in thick blue) lies between the observations and has a similar form, so can be viewed as the curve representing the typical realisation of the random curve. This is in contrast to the right panel, where the mean is qualitatively different from all curves in the sample: though periodicity is still present, the peaks and troughs have been flattened, and the sample mean is much more diffuse than any of the observations.

Fig. 4.2: Four realisations of (4.1) with means in thick blue. Left: amplitude variation (*B* = 0); right: phase variation (*A* = 1)

The phenomenon illustrated in Fig. 4.2 is hardly surprising, since as mentioned earlier amplitude variation is linear while phase variation is not, and taking sample means is a linear operation. Let us see in formulae how this phenomenon occurs. When *A* = 1 we have

$$\mathbb{E}\tilde{Y}(\mathbf{x}) = \sin(8\pi\mathbf{x})\mathbb{E}[\cos(8\pi B)] + \cos(8\pi\mathbf{x})\mathbb{E}[\sin(8\pi B)].$$

Since *B* is symmetric the second term vanishes, and unless *B* is trivial the expectation of the cosine is smaller than one in absolute value. Consequently, the expectation of *Y*˜(*x*) is the original function sin8π*x* multiplied by a constant of magnitude strictly less than one, resulting in peaks of smaller magnitude.

In the general case, where *Y*˜(*x*) = *Y*(*T* <sup>−</sup>1(*x*)) and *Y* and *T* are independent, we have

$$\mathbb{E}\tilde{Y}(\mathbf{x}) = \mathbb{E}[m(T^{-1}(\mathbf{x}))],$$

and

$$\text{cov}[\tilde{Y}(\mathbf{x}), \tilde{Y}(\mathbf{y})] = \mathbb{E}[\kappa(T^{-1}(\mathbf{x}), T^{-1}(\mathbf{y}))] + \text{cov}[m(T^{-1}(\mathbf{x}), m(T^{-1}(\mathbf{y})))].$$

From this, several conclusions can be drawn. Let μ˜ = μ(*T* <sup>−</sup>1(*x*)) be the conditional mean function given *T*. Then the value of the mean function itself, Eμ˜ , at *x*<sup>0</sup> is determined not by a single point, say *x*, but rather by all the values of *m* at the possible outcomes of *T* <sup>−</sup>1(*x*). In particular, if *x*<sup>0</sup> was a local maximum for *m*, then E[μ˜(*x*0)] will typically be strictly smaller than *m*(*x*0); the phase variation results in smearing *m*.

At this point an important remark should be made. Whether or not phase variation is problematic depends on the specific application. If one is interested indeed in the mean and covariance functions of *Y*˜, then the standard empirical estimators will be consistent, since *Y*˜ itself is a random function. But if it is rather *m*, the mean of *Y*, that is of interest, then the confounding of the amplitude and phase variation will lead to inconsistency. This can also be seen from the formula

$$\hat{Y}(\boldsymbol{\alpha}) = m(T^{-1}(\boldsymbol{\alpha})) + \sum\_{k=1}^{\infty} \xi\_k \phi\_k(T^{-1}(\boldsymbol{\alpha})) .$$

The above series is *not* the Karhunen–Loeve expansion of ` *Y*˜; the simplest way to notice this is the observation that φ*<sup>k</sup>*(*T* <sup>−</sup>1(*x*)) includes both the functional component φ*<sup>k</sup>* and the random component *T* <sup>−</sup>1(*x*). The true Karhunen–Loeve expansion ` of *Y*˜ will in general be qualitatively very different from that of *Y*, not only in terms of the mean function but also in terms of the covariance operator and, consequently, its eigenfunctions and eigenvalues. As illustrated in the trigonometric example, the typical situation is that the mean E*Y*˜ is more diffuse than *m*, and the decay of the eigenvalues ˜*rk* of the covariance operator is slower than that of *rk*; as a result, one needs to truncate the sum at high threshold in order to capture a substantial enough part of the variability. In the toy example (4.1), the Karhunen–Loeve expansion has ` a single term besides the mean if *B* = 0, while having two terms if *A* = 1.

When one is indeed interested in the mean *m* and the covariance κ, the random function *T* pertaining to the phase variation is a nuisance parameter. Given a sample *Y*˜ *<sup>i</sup>* <sup>=</sup> *Yi* ◦*<sup>T</sup>* <sup>−</sup><sup>1</sup> *<sup>i</sup>* , *i* = 1,...,*n*, there is no point in taking pointwise means of *Y*˜ *<sup>i</sup>*, because the curves are *misaligned*; *Y*˜ <sup>1</sup>(*x*) = *Y*1(*T* <sup>−</sup><sup>1</sup> <sup>1</sup> (*x*)) should not be compared with *<sup>Y</sup>*˜ <sup>2</sup>(*x*), but rather with *Y*2(*T* <sup>−</sup><sup>1</sup> <sup>1</sup> (*x*)) = *<sup>Y</sup>*˜ <sup>2</sup>(*T* <sup>−</sup><sup>1</sup> <sup>1</sup> (*T*2(*x*)). To overcome this difficulty, one seeks estimators *<sup>T</sup>*+*<sup>i</sup>* such that

$$\widehat{Y}\_{\ell}(\mathbf{x}) = \tilde{Y}\_{\ell}(\widehat{T}\_{\ell}(\mathbf{x})) = Y\_{\ell}(T\_{\ell}^{-1}(\widehat{T}\_{\ell}(\mathbf{x}))),$$

is approximately *Yi*(*x*). In other words, one tries to align the curves in the sample to have a common time scale. Such a procedure is called *curve registration*. Once registration has been carried out, one proceeds the analysis on *<sup>Y</sup>*+*i*(*x*) assuming only amplitude variation is now present: estimate the mean *m* by

$$
\widehat{m}(\mathbf{x}) = \frac{1}{n} \sum\_{i=1}^{n} \widehat{Y}\_i(\mathbf{x})
$$

and the covariance κ by its analogous counterpart. Put differently, registering the curves amounts to *separating the two types of variation*. This step is crucial regardless of whether the warp functions are considered as nuisance or an analysis of the warp functions is of interest in the particular application.

There is an obvious identifiability problem in the model *<sup>Y</sup>*˜ <sup>=</sup> *<sup>Y</sup>* ◦ *<sup>T</sup>* <sup>−</sup>1. If *<sup>S</sup>* is any (deterministic) invertible function, then the model with (*Y*,*T*) is statistically indistinguishable from the model with (*Y* ◦ *S*,*T* ◦ *S*). It is therefore often assumed that E*T* = **i** is the identity and in addition, in nearly all application, that *T* is monotonically increasing (if *d* = 1).

*Discretely observed data*. One cannot measure the height of person at every single instant of her life. In other words, it is rare in practice that one has access to the entire curve. A far more common situation is that one observes the curves *discretely*, i.e., at a finite number of points. The conceptually simplest setting is that one has access to a grid *x*1,...,*xJ* ∈ *K*, and the data come in the form

$$
\tilde{\chi}\_{ij} = \mathcal{P}\_i(\mathfrak{r}\_j),
$$

possibly with measurement error. The problem is to find, given ˜*yi j*, consistent estimators of *Ti* and of the original, aligned functions *Yi*.

In the bibliographic notes, we review some methods for carrying out this separation of amplitude and phase variation. It is fair to say that no single registration method arises as the canonical solution to the functional registration problem. Indeed, most need to make additional structural and/or smoothness assumptions on the warp maps, further to the basic identifiability conditions requiring that *T* be increasing and that E*T* equal the identity. We will eventually see that, in contrast, the case of point processes (viewed as discretely observed random measures) admits a canonical framework, without needing additional assumptions.

#### *4.1.2 The Point Process Case*

A point process is the mathematical object that represents the intuitive notion of a random collection of points in a space *X* . It is formally defined as a measurable map Π from a generic probability space into the space of (possibly infinite) Borel integer-valued measures of *X* in such a way that Π(*B*) is a measurable real-valued random variable for all Borel subsets *B* of *X* . The quantity Π(*B*) represents the random number of points observed in the set *B*. Among the plethora of books on point processes, let us mention Daley and Vere-Jones [41] and Karr [79]. Kallenberg [75] treats more general objects, *random measures*, of which point processes are a peculiar special case. We will assume for convenience that Π is a measure on a compact subset *<sup>K</sup>* <sup>⊂</sup> <sup>R</sup>*d*.

Amplitude variation of Π can be understood in analogy with the functional case. One defines the mean measure

$$\mathcal{A}(A) = \mathbb{E}[\Pi(A)], \qquad A \subset K \text{ Borel}$$

and, provided that E[Π(*K*)]<sup>2</sup> < ∞, the covariance measure

$$\kappa(A,B) = \text{cov}[\varPi(A), \varPi(B)] = \mathbb{E}[\varPi(A)\varPi(B)] - \lambda(A)\lambda(B),$$

the latter being a finite signed Borel measure on *K*. Just like in the functional case, these two objects encapsulate the (second-order) amplitude variation3 properties of the law of Π.

Given a sample Π1,...,Π*<sup>n</sup>* of independent point processes distributed as Π, the natural estimators

$$
\widehat{\lambda}(A) = \frac{1}{n} \sum\_{i=1}^{n} \Pi\_{l}(A); \qquad \widehat{\kappa}(A,B) = \frac{1}{n} \sum\_{i=1}^{n} \Pi\_{l}(A)\Pi\_{l}(B) - \widehat{\lambda}(A)\widehat{\lambda}(B),
$$

are consistent and the former asymptotically normal [79, Proposition 4.8].

Phase variation then pertains to a random warp function *T* : *K* → *K* (independent of Π) that deforms Π: if we denote the points of Π by *x*1,..., *xK* (with *K* random), then instead of (*xi*), one observes *T*(*x*1),...,*T*(*xK*). In symbols, this means that the data arise as Π <sup>=</sup> *<sup>T</sup>*#Π. We refer to Π as the *original point processes*, and Π as the *warped point processes*. An example of 30 warped and unwarped point processes is shown in Fig. 4.3. The point patterns in both panels present a qualitatively similar structure: there are two peaks of high concentration of points, while few points appear between these peaks. The difference between the two panels is in the position and concentration of those peaks. In the left panel, only amplitude variation is present, and the location/concentration of the peaks is the same across all observations. In contrast, phase variation results in shifting the peaks to different places for each of the observations, while also smearing or sharpening them. Clearly, estimation of the mean measure of a subset *A* by averaging the number of observed points in *A* would not be satisfactory as an estimator of λ when carried out with the warped data. As in the functional case, it will only be consistent for the measure ˜ λ defined by

$$
\tilde{\mathcal{A}}(A) = \mathbb{E}[\mathcal{A}(T^{-1}(A))], \qquad A \subseteq \mathcal{K}', 
$$

and ˜ λ = E[*T*#λ] misses most (or at least a significant part) of the bimodal structure of λand is far more diffuse.

<sup>3</sup> If the cumulative count process Γ (*t*) = Π[0,*t*) is mean-square continuous, then the use of the term "amplitude variation" can be seen to remain natural, as Γ (*t*) will admit a Karhunen–Loeve ` expansion, with all stochasticity being attributable to the random amplitudes in the expansion.

Since Π and *T* are independent, the conditional expectation of Πgiven *<sup>T</sup>* is

$$\mathbb{E}[\widetilde{\Pi}(A)|T] = \mathbb{E}[\Pi(T^{-1}(A))|T] = \mathbb{A}(T^{-1}(A)) = [T\#\lambda](A).$$

Consequently, we refer to Λ = *T*#λ as the *conditional mean measure*. The problem of separation of amplitude and phase variation can now be stated as follows. On the basis of a sample Π1,...,Π*n*, find estimators of (*Ti*) and (Π*<sup>i</sup>*). Registering the point processes amounts to constructing estimators, *registration maps T* ,−<sup>1</sup> *<sup>i</sup>* , such that the aligned points

$$\widehat{II}\_{\bar{l}} = \bar{T\_i}^{-\overline{1}} \# \widetilde{II}\_{\bar{l}} = [\bar{T\_i}^{-\overline{1}} \circ T\_{\bar{l}}] \# \Pi\_{\bar{l}}$$

are close to the original points Π*i*.

Fig. 4.3: Unwarped (left) and warped Poisson point processes

**Remark 4.1.1 (Poisson Processes)** *A special but important case is that of a Poisson process. Gaussian processes probably yield the most elegant and rich theory in functional data analysis, and so do Poisson processes when it comes to point processes. We say that* Π *is a* Poisson process *when the following two conditions hold. (1) For any disjoint collection* (*A*1,...,*An*) *of Borel sets, the random variables* Π(*A*1),...,Π(*An*) *are independent; and (2) for every Borel A* ⊆ *K,* Π(*A*) *follows a Poisson distribution with mean* λ(*A*)*:*

$$\mathbb{P}(\boldsymbol{\Pi}(\boldsymbol{A}) = k) = e^{-\lambda(\boldsymbol{A})} \frac{[\boldsymbol{\lambda}(\boldsymbol{A})]^k}{k!}.$$

*Conditional upon T , the random variables* Π(*Ak*) = Π(*T* <sup>−</sup>1(*Ak*))*, k* = 1,...,*n are independent as the sets* (*T* <sup>−</sup>1(*Ak*)) *are disjoint, and* Π(*A*) *follows a Poisson distribution with mean* λ(*T* <sup>−</sup>1(*A*)) = Λ(*A*)*. This is precisely the definition of a* Cox process*: conditional upon the* driving measure Λ*,* Π *is a Poisson process with mean measure* λ*. For this reason, it is also called a* doubly stochastic process*; in our context, the* *phase variation is associated with the stochasticity of* Λ *while the amplitude one is associated with the Poisson variation conditional upon* Λ*.*

As in the functional case there are problems with identifiability: the model (Π,*T*) cannot be distinguished from the model (*S*#Π,*<sup>T</sup>* ◦*S*−1) for any invertible *<sup>S</sup>* : *<sup>K</sup>* <sup>→</sup> *<sup>K</sup>*. It is thus natural to assume that E*T* is the identity map4 (otherwise set *S* = E*T*, i.e., replace Π by [E*T*]#Π and *<sup>T</sup>* by *<sup>T</sup>* ◦ [E*T*] <sup>−</sup>1).

Constraining *T* to have mean identity is nevertheless not sufficient for the model Π <sup>=</sup> *<sup>T</sup>*#Π to be identifiable. The reason is that given the two point sets Π and Π, there are many functions that push forward the latter to the former. This ambiguity can be dealt with by assuming some sort of *regularity* or *parsimony* for *T*. For example, when *K* = [*a*,*b*] is a subset of the real line, imposing *T* to be monotonically increasing guarantees its uniqueness. In multiple dimensions, there is no obvious analogue for increasing functions. One possible definition is the monotonicity described in Sect. 1.7.2:

$$\langle T(\mathbf{y}) - T(\mathbf{x}), \mathbf{y} - \mathbf{x} \rangle \ge 0, \qquad \mathbf{x}, \mathbf{y} \in K.$$

This property is rather weak in a sense we describe now. Let *<sup>K</sup>* <sup>⊆</sup> <sup>R</sup><sup>2</sup> and write *<sup>y</sup>* <sup>≥</sup> *<sup>x</sup>* if and only if *yi* ≥ *xi* for *i* = 1,2. It is natural to expect the deformations to maintain the *lexicographic order* in R2:

$$
\mathbf{y} \ge \mathbf{x} \quad \implies \quad T(\mathbf{y}) \ge T(\mathbf{x}).
$$

If we require in addition that the ordering must be preserved for all quadrants: for *z* = *T*(*x*) and *w* = *T*(*y*)

$$\{\mathbf{y}\_1 \ge \mathbf{x}\_1, \mathbf{y}\_2 \le \mathbf{x}\_2\} \qquad \implies \{\mathbf{w}\_1 \ge \mathbf{z}\_1, \mathbf{w}\_2 \le \mathbf{z}\_2\},$$

then monotonicity is automatically satisfied. In that sense, it is arguably not very restrictive.

Monotonicity is weaker than cyclical monotonicity (see (1.10) with *yi* = *T*(*xi*)), which is itself equivalent to the property of being the subgradient of a convex function. But if extra smoothness is present and *T* is a gradient of some function φ : *<sup>K</sup>* <sup>→</sup> <sup>R</sup>, then φ must be convex and *T* is then cyclically monotone. Consequently, we will make the following assumptions:


In the functional case, at least on the real line, these two conditions are imposed on the warp functions in virtually all applications, often accompanied with additional assumptions about smoothness of *T*, its structural properties, or its distance from the identity. In the next section, we show how these two conditions alone lead to the Wasserstein geometry and open the door to consistent, fully nonparametric separation of the amplitude and phase variation.

<sup>4</sup> This can be defined as Bochner integral in the space of measurable bounded *<sup>T</sup>* : *<sup>K</sup>* <sup>→</sup> *<sup>K</sup>*.

#### **4.2 Wasserstein Geometry and Phase Variation**

#### *4.2.1 Equivariance Properties of the Wasserstein Distance*

A first hint to the relevance of Wasserstein metrics in *Wp*(*X* ) for deformations of the space *X* is that for all *p* ≥ 1 and all *x*, *y* ∈ *X* ,

$$W\_p(\delta\_\mathbf{x}, \delta\_\mathbf{y}) = ||\mathbf{x} - \mathbf{y}||,$$

where δ*<sup>x</sup>* is as usual the Dirac measure at *x* ∈ *X* . This is in contrast to metrics such as the bounded Lipschitz distance (that metrises weak convergence) or the total variation distance on *P*(*X* ). Recall that these are defined by

$$\|\|\mu - \mathbf{v}\|\|\_{\mathrm{BL}} = \sup\_{\|\boldsymbol{\varphi}\|\|\mathbf{L}\leq 1} \left| \int\_{\mathcal{X}} \boldsymbol{\varphi} \, \mathrm{d}\mu - \int\_{\mathcal{X}} \boldsymbol{\varphi} \, \mathrm{d}\mathbf{v} \right|; \qquad \|\|\mu - \mathbf{v}\|\|\_{\mathrm{TV}} = \sup\_{A} |\mu(A) - \nu(A)|,$$

so that

$$||\mathfrak{d}\_{\mathbf{x}} - \mathfrak{d}\_{\mathbf{y}}||\_{\mathbf{BL}} = \min(1, ||\mathbf{x} - \mathbf{y}||); \qquad ||\mathfrak{d}\_{\mathbf{x}} - \mathfrak{d}\_{\mathbf{y}}||\_{\mathbf{TV}} = \begin{cases} 1 & \mathbf{x} \neq \mathbf{y} \\ 0 & \mathbf{x} = \mathbf{y}. \end{cases}$$

In words, the total variation metric "does not see the geometry" of the space *X* . This is less so for the bounded Lipschitz distance that does take small distances into account but not large ones.

Another property (shared by BL and TV) is equivariance with respect to translations. It is more convenient to state it using the probabilistic formalism of Sect. 1.2. Let *X* ∼ μ and *Y* ∼ ν be random elements in *X* , *a* be a fixed point in *X* , *X* = *X* +*a* and *Y* = *Y* +*a*. Joint couplings *Z* = (*X* ,*Y* ) are precisely those that take the form (*a*,*a*) +*Z* for a joint coupling *Z* = (*X*,*Y*). Thus

$$W\_p(\mu \* \delta\_a, \nu \* \delta\_a) = W\_p(X' + a, Y' + a) = W\_p(X, Y) = W\_p(\mu, \nu),$$

where δ*<sup>a</sup>* is a Dirac measure at *a* and ∗ denotes convolution.

This carries over to Frechet means in an obvious way. ´

**Lemma 4.2.1 (Frechet Means and Translations) ´** *Let* Λ *be a random measure in W*2(*X* ) *with finite Frechet functional and a ´* ∈ *X . Then* γ *is a Frechet mean of ´* Λ *if and only if* γ ∗ δ*<sup>a</sup> is a Frechet mean of ´* Λ ∗ δ*a.*

The result holds for other values of *p*, in the formulation sketched in the bibliographic notes of Chap. 2. In the quadratic case, one has a simple extension to the case where only one measure is translated. Denote the first moment (mean) of μ∈ *W*1(*X* ) by

$$m: \mathcal{W}\_1(\mathcal{X}') \to \mathcal{X}' \qquad m(\mu) = \int\_{\mathcal{X}} x \mathbf{d}\mu(x).$$

(When *X* is infinite-dimensional, this can be defined as the unique element *m* ∈ *X* satisfying

$$
\langle m, \mathbf{y} \rangle = \int\_{\mathcal{X}} \langle \mathbf{x}, \mathbf{y} \rangle \, \mathbf{d}\mu(\mathbf{x}), \qquad \mathbf{y} \in \mathcal{X}.
$$

By an equivalence of couplings similar to above, we obtain

$$W\_2^2(\mu \* \delta\_a, \nu) = W\_2^2(\mu, \nu) + (a - \left[m(\mu) - m(\nu)\right])^2 - \left[m(\mu) - m(\nu)\right]^2,$$

which is minimised at *a* = *m*(μ)−*m*(ν). This leads to the following conclusion:

**Proposition 4.2.2 (First Moment of Frechet Mean) ´** *Let* Λ *be a random measure in W*2(*X* ) *with finite Frechet functional and Fr ´ echet mean ´* γ*. Then*

$$\int\_{\mathcal{X}} \mathbf{x} \, \mathbf{d}\boldsymbol{\mathcal{Y}}(\boldsymbol{x}) = \mathbb{E}\left[\int\_{\mathcal{X}} \mathbf{x} \, \mathbf{d}\boldsymbol{\Lambda}(\boldsymbol{x})\right].$$

#### *4.2.2 Canonicity of Wasserstein Distance in Measuring Phase Variation*

The purpose of this subsection is to show that the standard functional data analysis assumptions on the warp function *T*, having mean identity and being increasing, are equivalent to purely geometric conditions on *T* and the conditional mean measure Λ = *T*#λ. Put differently, if one is willing to assume that E*T* = **i** and that *T* is increasing, then one is led *unequivocally* to the problem of estimation of Frechet ´ means in the Wasserstein space *<sup>W</sup>*2(*<sup>X</sup>* ). When *<sup>X</sup>* <sup>=</sup> <sup>R</sup>, "increasing" is interpreted as being the gradient of a convex function, as explained at the end of Sect. 4.1.2.

The total mass λ(*X* ) is invariant under the push-forward operation, and when it is finite, we may assume without loss of generality that it is equal to one, because all the relevant quantities scale with the total mass. Indeed, if λ = τμ with μ probability measure and τ > 0, then *T*#λ = τ ×*T*#μ, and the Wasserstein distance (defined as the infimum-over-coupling integrated cost) between τμ and τν is τ*Wp*(μ,ν) for μ,νprobabilities.

We begin with the one-dimensional case, where the explicit formulae allow for a more transparent argument, and for simplicity we will assume some regularity.

**Assumptions 2** *The domain K* <sup>⊂</sup> <sup>R</sup> *is a nonempty compact convex set (an interval), and the continuous and injective random map T* : *<sup>K</sup>* <sup>→</sup> <sup>R</sup> *(a random element in Cb*(*K*)*) satisfies the following two conditions:*

*(A1)* Unbiasedness: <sup>E</sup>[*T*(*x*)] = *x for all x* <sup>∈</sup> *K.*

*(A2)* Regularity: *T is monotone increasing.*

The relevance of the Wasserstein geometry to phase variation becomes clear in the following proposition that shows that Assumptions 2 are equivalent to geometric assumptions on the Wasserstein space *W*2(R).

**Proposition 4.2.3 (Mean Identity Warp Functions and Frechet Means in ´** *W*2(R)**)** *Let* φ <sup>⊂</sup> *<sup>K</sup>* <sup>⊂</sup> <sup>R</sup> *compact and convex and T* : *<sup>K</sup>* <sup>→</sup> <sup>R</sup> *continuous. Then Assumptions <sup>2</sup> hold if and only if, for any* λ <sup>∈</sup> *<sup>W</sup>*2(*K*)*supported on K such that* <sup>E</sup>[*W*<sup>2</sup> <sup>2</sup> (*T*#λ,λ)] < ∞*, the following two conditions are satisfied:*

*(B1)* Unbiasedness: *for any* θ<sup>∈</sup> *<sup>W</sup>*2(R)

$$\mathbb{E}[W\_2^2(T\#\lambda,\lambda)] \le \mathbb{E}[W\_2^2(T\#\lambda,\theta)].$$

*(B2)* Regularity: *if Q* : *<sup>K</sup>* <sup>→</sup> <sup>R</sup> *is such that T*#λ = *Q*#λ*, then with probability one*

$$\int\_{K} \left| T(\mathbf{x}) - \mathbf{x} \right|^{2} \mathbf{d} \lambda(\mathbf{x}) \le \int\_{K} \left| \mathcal{Q}(\mathbf{x}) - \mathbf{x} \right|^{2} \mathbf{d} \lambda(\mathbf{x}), \qquad \text{almost surely.}$$

These assumptions have a clear interpretation: (B1) stipulates that λ is a Frechet ´ mean of the random measure Λ = *T*#λ, while (B2) states that *T* must be the optimal map from λ to Λ, that is, *T* = **t** Λ λ.

*Proof.* If *T* satisfies (B2) then, as an optimal map, it must be nondecreasing λalmost surely. Since λ is arbitrary, *T* must be nondecreasing on the entire domain *K*. Conversely, if *T* is nondecreasing, then it is optimal for any λ. Hence (A2) and (B2) are equivalent.

Assuming (A2), we now show that (A1) and (B1) are equivalent. Condition (B1) is equivalent to the assertion that for all θ<sup>∈</sup> *<sup>W</sup>*2(R),

$$\mathbb{E}\|F\_{T\#\lambda}^{-1} - F\_{\lambda}^{-1}\|\_{L\_2(0,1)}^2 = \mathbb{E}[W\_2^2(T\#\lambda,\lambda)] \le \mathbb{E}[W\_2^2(T\#\lambda,\Theta)] = \mathbb{E}\|F\_{T\#\lambda}^{-1} - F\_{\theta}^{-1}\|\_{L\_2(0,1)}^2,$$

which is in turn equivalent to E[*FT*#λ ] <sup>−</sup>1] = E[*F*−<sup>1</sup> Λ ] = *F*−<sup>1</sup> λ (see Sect. 3.1.4). Condition (A2) and the assumptions on *T* imply that *F*Λ(*x*) = *F*λ (*T* <sup>−</sup>1(*x*)). Suppose that *F*λ is invertible (i.e., continuous and strictly increasing on *K*). Then *F*−<sup>1</sup> Λ (*u*) = *T*(*F*−<sup>1</sup> λ (*u*)). Thus (B1) is equivalent to E*T*(*x*) = *x* for all *x* in the range of *F*−<sup>1</sup> λ , which is *K*. The assertion that (A1) implies (B1), even if *F*λ is not invertible, is proven in the next theorem (Theorem 4.2.4) in a more general context.

The situation in more than one dimension is similar but the proof is less transparent. To avoid compactness assumptions, we introduce the following power growth condition (taken from Agueh and Carlier [2]) of continuous functions that grow like · *<sup>q</sup>* (*<sup>q</sup>* <sup>≥</sup> 0):

$$\mathcal{G}\_q(\mathcal{X}^\circ) = (1 + \|\cdot\|^q)\mathcal{C}\_b(\mathcal{X}^\circ) = \left\{ f : \mathcal{X}^\circ \to \mathbb{R} \text{ continuous} : \sup\_{\mathbf{x} \in \mathcal{X}} \frac{|f(\mathbf{x})|}{1 + \|\mathbf{x}\|^q} < \infty \right\}$$

with the norm *f Gq* = sup| *f*(*x*)|/(1 + *x <sup>q</sup>*) = *f* /(1 + · *q*) <sup>∞</sup>. The space *Gq*(*X* ,*X* ) is defined similarly, with *f* taking values in *X* instead of R, and the norm will be denoted in the same way. These are nonseparable Banach spaces.

**Theorem 4.2.4 (Mean Identity Warp Functions and Frechet Means) ´** *Fix* λ ∈ *P*(*X* ) *and let* **t** ∈ *G*1(*X* ,*X* ) *be a (Bochner measurable) random optimal map* *with (Bochner) mean identity and such that* <sup>E</sup> **t** *<sup>G</sup>*<sup>1</sup> < ∞*. Then* Λ = **t**#λ *has Frechet ´ mean* λ*:*

$$\mathbb{E}[\boldsymbol{W}\_2^2(\mathcal{A}, \boldsymbol{\Lambda})] \le \mathbb{E}[\boldsymbol{W}\_2^2(\boldsymbol{\theta}, \boldsymbol{\Lambda})] \qquad \forall \boldsymbol{\theta} \in \mathcal{W}\_2(\mathcal{X}).$$

The generalisation with respect to the one-dimensional result is threefold. Firstly, since our main interest is the implication (A1–A2) ⇒ (B1–B2), we need not assume *T* to be injective. Secondly, the support of λ is not required to be compact. Lastly, the result holds in arbitrary dimension, including infinite-dimensional separable Hilbert spaces *X* . In particular, if **t** is a linear map, then **t** *<sup>G</sup>*<sup>1</sup> coincides with the operator norm of **t**, so the assumption is that **t** be a bounded self-adjoint nonnegative operator with mean identity and finite expected operator norm.

*Proof.* Optimality of **t** ensures that it has a convex potential φ, and strong and weak duality give

$$\begin{split} \mathcal{W}\_2^2(\mathbb{A}, \Lambda) &= \int\_{\mathcal{X}} \left( \frac{1}{2} \|\mathbf{x}\|^2 - \boldsymbol{\phi}(\mathbf{x}) \right) \, \mathrm{d}\lambda(\mathbf{x}) + \int\_{\mathcal{X}} \left( \frac{1}{2} \|\mathbf{y}\|^2 - \boldsymbol{\phi}^\*(\mathbf{y}) \right) \, \mathrm{d}\Lambda(\mathbf{y});\\ \mathcal{W}\_2^2(\boldsymbol{\theta}, \Lambda) &\geq \int\_{\mathcal{X}} \left( \frac{1}{2} \|\mathbf{x}\|^2 - \boldsymbol{\phi}(\mathbf{x}) \right) \, \mathrm{d}\boldsymbol{\theta}(\mathbf{x}) + \int\_{\mathcal{X}} \left( \frac{1}{2} \|\mathbf{y}\|^2 - \boldsymbol{\phi}^\*(\mathbf{y}) \right) \, \mathrm{d}\Lambda(\mathbf{y}). \end{split}$$

Formally taking expectations, using Fubini's theorem and that Eφ = · <sup>2</sup>/2 (since E**t** is the identity) yields

$$\mathbb{E}[\mathbb{W}\_2^2(\theta,\Lambda)] \ge \int\_{\mathcal{X}} \left(\frac{1}{2} \|\mathbf{x}\|^2 - \mathbb{E}\phi(\mathbf{x})\right) \mathbf{d}\theta(\mathbf{x}) + \mathbb{E}\left[\int\_{\mathcal{X}} \left(\frac{1}{2} \|\mathbf{y}\|^2 - \phi^\*(\mathbf{y})\right) \mathbf{d}\Lambda(\mathbf{y})\right] = \mathbb{E}[\mathbb{W}\_2^2(\lambda,\Lambda)]$$

as required. The rigorous mathematical justification for this is given on page 88 in the supplement.

**Remark 4.2.5** *The "natural" space for* **t** *would be L*2(λ)*, but without the continuity assumption, the result may fail (Alvarez-Esteban et al. [ ´ 9, Example 3.1]). A simple argument shows that the growth condition imposed by the G*<sup>1</sup> *assumption is minimal; see page 89 in the supplement or Galasso et al. [58].*

**Remark 4.2.6** *The same statement holds if X is replaced by a (Borel) convex subset K thereof. The integrals will then be taken on K, showing that* λ *minimises the Frechet functional among measures supported on K, and, by continuity, on ´ K. By Proposition 3.2.4,* λ*is a Frechet mean. ´*

#### **4.3 Estimation of Frechet Means ´**

#### *4.3.1 Oracle Case*

In view of the canonicity of the Wasserstein geometry in Sect. 4.2.2, separation of amplitude and phase variation of the point processes Π*<sup>i</sup>* essentially requires computing Frechet means in the 2-Wasserstein space. It is both conceptually important and ´ technically convenient to introduce the case where an oracle reveals the conditional mean measures Λ = *T*#λ entirely. Thus, assuming that λ ∈ *W*2(*X* ) is the unique Frechet mean of a random measure ´ Λ, the goal is to estimate the structural mean λ on the basis of independent and identically distributed realisations Λ1,...,Λ*<sup>n</sup>* of λ.

Given that λis defined as the minimiser of the Frechet functional ´

$$F(\boldsymbol{\gamma}) = \frac{1}{2} \mathbb{E} W\_2^2(\boldsymbol{\Lambda}, \boldsymbol{\gamma}), \qquad \boldsymbol{\gamma} \in \mathcal{W}\_2(\mathcal{X}^\circ),$$

it is natural to estimate λ by a minimiser, say λ*<sup>n</sup>*, of the empirical Frechet functional ´

$$F\_n(\boldsymbol{\gamma}) = \frac{1}{2n} \sum\_{i=1}^n W\_2^2(\Lambda\_i, \boldsymbol{\gamma}), \qquad \boldsymbol{\gamma} \in \mathcal{W}\_2(\mathcal{K}).$$

A minimiser λ*<sup>n</sup>* exists by Corollary 3.1.3. When *X* = R, λ*<sup>n</sup>* can be seen to be an *unbiased* estimator of λin a generalised sense of Lehmann [88] (see Sect. 4.3.5).

The warp maps (and their inverses) can then be estimated as the optimal maps from λ*<sup>n</sup>* to each Λ*<sup>i</sup>* (see Sect. 4.3.4).

#### *4.3.2 Discretely Observed Measures*

In practice, one does not have the fortune of fully observing the inherently infinitedimensional objects Λ1,...,Λ*<sup>n</sup>*. A far more realistic scenario is that one only has access to a discrete version of Λ*<sup>i</sup>*, say Λ*i*. The simplest situation is when Λ*<sup>i</sup>* arises as an empirical measure of the form τ<sup>−</sup><sup>1</sup> ∑τ *i*=1 δ{*Yj*}, where *Yj* are independent with distribution Λ*<sup>i</sup>*. More generally, Λ*<sup>i</sup>* can be a normalised point process Π*<sup>i</sup>* with mean measure τΛ*<sup>i</sup>*, i.e.

$$
\widetilde{\Lambda}\_{l} = \frac{1}{\widetilde{\Pi}\_{l}(\mathcal{X})} \widetilde{\Pi}\_{l} \quad \text{with} \quad \mathbb{E}[\widetilde{\Pi}\_{l}(A)|\Lambda\_{l}] = \mathsf{T}\Lambda\_{l}(A), \qquad A \subseteq \mathcal{X} \text{ Borel}.
$$

This encapsulates the case of empirical measure when τ is an integer and Π*<sup>i</sup>* is a *binomial point process*. The parameter τ is the expected number of observed points over the entire space *X* ; the larger τ is, the more information Π*<sup>i</sup>* gives on Λ*i*.

Except if Λ*<sup>i</sup>* is an empirical measure, there is one difficulty in the above setting that needs to be addressed. Unless Π*<sup>i</sup>* is binomial, there is a positive probability that Π*i*(*<sup>X</sup>* ) = 0 and no points pertaining to Λ*<sup>i</sup>* are observed. In the asymptotic setup below, conditions will be imposed to ensure that this probability becomes negligible as *n* → ∞. For concreteness we defineΛ*<sup>i</sup>* <sup>=</sup> λ(0) for some fixed measure λ(0) that will be of minor importance. This can be a Dirac measure at 0, a certain fixed Gaussian measure, or (normalised) Lebesgue measure on some bounded set in case *X* = R*d*. We can now replace the estimator λ*<sup>n</sup>* by λ*n*, defined as any minimiser of

$$
\widetilde{F}\_n(\boldsymbol{\gamma}) = \frac{1}{2n} \sum\_{i=1}^n W\_2^2(\widetilde{\boldsymbol{A}}\_i, \boldsymbol{\gamma}), \qquad \boldsymbol{\gamma} \in \mathcal{W}\_2(\mathcal{H}),
$$

which exists by Corollary 3.1.3.

As a generalisation of the discrete case discussed in Sect. 1.3, the Frechet mean ´ of discrete measures can be computed exactly. Suppose that *Ni* = Π*i*(*<sup>X</sup>* ) is nonzero for all *i*. Then each Λ*<sup>i</sup>* is a discrete measure supported on *Ni* points. One can then recast the multimarginal formulation (see Sect. 3.1.2) as a finite linear program, solve it, and "average" the solution as in Proposition 3.1.2 in order to obtain λ*<sup>n</sup>* (an alternative linear programming formulation for finding a Frechet mean is given by ´ Anderes et al. [14]). Thus, λ*<sup>n</sup>* can be computed in finite time, even when *X* is infinite-dimensional.

Finally, a remark about measurability is in order. Point processes can be viewed as random elements in *M*+(*X* ) endowed with the *vague topology* induced from convergence of integrals of continuous functions with compact support. If μ*<sup>n</sup>* converge to μ vaguely, and *an* are numbers that converge to *a*, then *an*μ*<sup>n</sup>* → *a*μ vaguely. Thus, Λ˜*<sup>i</sup>* is a continuous function of the pair (Π*i*,Π*i*(*<sup>X</sup>* )) and can be viewed as a random measure with respect to the vague topology. The restriction of the vague topology to probability measures is equivalent to the weak topology,<sup>5</sup> and therefore vague, weak, and Wasserstein measurability are all equivalent.

#### *4.3.3 Smoothing*

Even when the computational complexity involved in calculating λ*<sup>n</sup>* is tractable, there is another reason not to use it as an estimator for λ. If one has a-priori knowledge that λ is smooth, it is often desirable to estimate it by a smooth measure. One way to achieve this would be to apply some smoothing technique to λ*<sup>n</sup>* using, e.g., kernel density estimation. However, unless the number of observed points from each measure is the same *<sup>N</sup>*<sup>1</sup> <sup>=</sup> ··· <sup>=</sup> *Nn* <sup>=</sup> *<sup>N</sup>*, λ*<sup>n</sup>* will usually be concentrated on many points, essentially *N*1+···+*Nn* of them. In other words, the Frechet mean is concen- ´ trated on many more points than each of the measures Λ*i*, thus potentially hindering its usefulness as a mean because it will not be a representative of the sample.

This is most easily seen when *X* = R, in which case each Λ*<sup>i</sup>* is a discrete uniform measure on points *x<sup>i</sup>* <sup>1</sup> < *x<sup>i</sup>* <sup>2</sup> <sup>&</sup>lt; ··· <sup>&</sup>lt; *<sup>x</sup><sup>i</sup> Ni* , where we assume for simplicity that the points are not repeated (this will happen with probability one if Λ*<sup>i</sup>* is diffuse). If we now set *Gi* to be the distribution function of Λ*i*, then the quantile function *<sup>G</sup>*−<sup>1</sup> *<sup>i</sup>* is piecewise constant on each interval (*k*, *k* +1]/*Ni* with jumps at

$$G\_i^{-1}(k/N\_i) = x\_k^\ell, \qquad k = 1, 2, \dots, N\_{\bar{t}}.$$

<sup>5</sup> In finite dimensional (or more generally, locally compact metric) spaces. If *X* is an infinitedimensional Hilbert space, the vague topology is trivial. This is stated and proved as Lemma 5 on page 27 in the supplement.

The Frechet mean has quantile function ´ *G*−1(*u*) = *n*−<sup>1</sup> ∑*G*−<sup>1</sup> *<sup>i</sup>* (*u*) and will have jumps at every point of the form *k*/*Ni* for *k* ≤ *Ni* and *i* = 1,...,*n*. In the worst-case scenario, when no pair from *Ni* has a common divisor, there will be

$$\left(\sum\_{i=1}^{n} N\_i - 1\right) + 1 = \left(\sum\_{i=1}^{n} N\_i\right) - n + 1$$

jumps for *G*−1, which is the number of points on which the Frechet mean will be ´ supported. (All the *G*−<sup>1</sup> *<sup>i</sup>* 's have a jump at one which thus needs to be counted once rather than *n* times.)

By counting the number of redundancies in the constraints matrix of the linear program, one can show that this is in general an upper bound on the number of support points of the Frechet mean. ´

An alternative approach is to first smooth each observation λ*<sup>n</sup>* and then calculate the Frechet mean. Since it is easy to bound the Wasserstein distances when deal- ´ ing with convolutions, we will employ kernel density estimation, although other smoothing approaches could be used as well.

To simplify the exposition, we provide the technical details only when *X* = R*d*, but a similar construction will work when the dimension of *X* is infinite. Let ψ : <sup>R</sup>*<sup>d</sup>* <sup>→</sup> (0,∞) be a continuous, bounded, strictly positive isotropic density function with unit variance: ψ(*x*) = ψ1( *x* ) with ψ<sup>1</sup> nonincreasing and

$$\int\_{\mathbb{R}^d} ||\mathfrak{x}||^2 \Psi(\mathfrak{x}) \, \mathrm{d}x = 1 = \int\_{\mathbb{R}^d} \Psi(\mathfrak{x}) \, \mathrm{d}x.$$

(Besides the boundedness all these properties can be relaxed, and if *X* = R even boundedness is not necessary.) A classical example for ψ is the standard Gaussian density in R*d*. Define the rescaled version ψσ(*x*) = σ−*d*ψ(*x*/σ) for all σ > 0. We can then replace Λ*<sup>i</sup>* by a smooth proxy Λ*<sup>i</sup>* ∗ψσ. If Λ*<sup>i</sup>* is a sum of Dirac masses at *x*1,...,*xNi* , then

$$
\tilde{\Lambda}\_l \ast \psi\_\sigma \quad \text{has density} \quad \mathbf{g}(\mathbf{x}) = \frac{1}{N\_l} \sum\_{j=1}^{N\_l} \psi\_\sigma(\mathbf{x} - \mathbf{x}\_l).
$$

If *Ni* = 0, one can either use λ(0) or λ(0) <sup>∗</sup>ψσ; this event will have negligible probability anyway.

For the purpose of approximating Λ*i*, this convolution is an acceptable estimator, because as was seen in the proof of Theorem 2.2.7,

$$W\_2^2(\tilde{\Lambda}\_\ell, \tilde{\Lambda}\_\ell \ast \Psi\_\sigma) \le \sigma^2.$$

But the measure Λ*<sup>i</sup>* has a strictly positive density throughout <sup>R</sup>*d*. If we know that Λ is supported on a convex compact *<sup>K</sup>* <sup>⊂</sup> <sup>R</sup>*d*, it is desirable to construct an estimator that has the same support *K*. The first idea that comes to mind is to project Λ*<sup>i</sup>* ∗ψσ to *K* (see Proposition 3.2.4), as this will further decrease the Wasserstein distance, but the resulting measure will then have positive mass on the boundary of *K*, and will not be absolutely continuous. We will therefore use a different strategy: eliminate all the mass outside *K* and redistribute it on *K*. The simplest way to do this is to restrict Λ*<sup>i</sup>* ∗ ψσ to *K* and renormalise the restriction to be a probability measure. For technical reasons, it will be more convenient to bound the Wasserstein distance when the restriction and renormalisation is done separately on each point of Λ*i*. This yields the measure

$$\widehat{A}\_{l} = \frac{1}{N\_{l}} \sum\_{j=1}^{N\_{l}} \frac{\delta\{\mathbf{x}\_{j}\} \* \boldsymbol{\Psi}\_{\sigma}}{[\delta\{\mathbf{x}\_{j}\} \* \boldsymbol{\Psi}\_{\sigma}](K)}\bigg|\_{K},\tag{4.2}$$

Lemma 4.4.2 below shows that *W*<sup>2</sup> 2 (Λ*i*,Λ<sup>+</sup>*i*) <sup>≤</sup> *<sup>C</sup>*σ<sup>2</sup> for some finite constant *C*. It is apparent that Λ<sup>+</sup>*<sup>i</sup>* is a continuous function of Λ*<sup>i</sup>* and σ, so Λ<sup>+</sup>*<sup>i</sup>* is measurable; in any case this is not particularly important because σ will vanish, so Λ<sup>+</sup>*<sup>i</sup>* <sup>=</sup> Λ*<sup>i</sup>* asymptotically and the latter is measurable.

Our final estimator <sup>+</sup>λ*<sup>n</sup>* for λis defined as the minimiser of

$$
\widehat{F}\_n(\boldsymbol{\gamma}) = \frac{1}{2n} \sum\_{i=1}^n W\_2^2(\widehat{\Lambda}\_i, \boldsymbol{\gamma}), \qquad \boldsymbol{\gamma} \in \mathcal{W}\_2(\mathcal{K}).
$$

Since the measures Λ<sup>+</sup>*<sup>i</sup>* are absolutely continuous, <sup>+</sup>λ*<sup>n</sup>* is unique. We refer to <sup>+</sup>λ*<sup>n</sup>* as the *regularised Frechet–Wasserstein estimator ´* , where the regularisation comes from the smoothing and the possible restriction to *K*.

In the case *<sup>X</sup>* <sup>=</sup> <sup>R</sup>, <sup>+</sup>λ*<sup>n</sup>* can be constructed via averaging of quantile functions. Let *<sup>G</sup>*+*<sup>i</sup>* be the distribution function of Λ<sup>+</sup>*i*. Then <sup>+</sup>λ*<sup>n</sup>* is the measure with quantile function

$$F\_{\widehat{\lambda}\_n}^{-1}(\mu) = \frac{1}{n} \sum\_{i=1}^n \widehat{G}\_i^{-1}(\mu), \qquad \mu \in (0, 1),$$

and distribution function

$$F\_{\widehat{\lambda}\_n}(\mathbf{x}) = [F\_{\widehat{\lambda}\_n}^{-1}]^{-1}(\mathbf{x}).$$

By construction, the *<sup>G</sup>*+*<sup>i</sup>* are continuous and strictly increasing, so the inverses are proper inverses and one does not to use the right-continuous inverse as in Sect. 3.1.4.

If *<sup>X</sup>* <sup>=</sup> <sup>R</sup>*<sup>d</sup>* and *<sup>d</sup>* <sup>≥</sup> 2, then there is no explicit expression for <sup>+</sup>λ*<sup>n</sup>*, although it exists and is unique. In the next chapter, we present a steepest descent algorithm that approximately constructs <sup>+</sup>λ*<sup>n</sup>* by taking advantage of the differentiability properties of the Frechet functional ´ *<sup>F</sup>*+*<sup>n</sup>* in Sect. 3.1.6.

#### *4.3.4 Estimation of Warpings and Registration Maps*

Once estimators Λ<sup>+</sup>*i*, *<sup>i</sup>* <sup>=</sup> <sup>1</sup>,...,*<sup>n</sup>* and <sup>+</sup>λ*<sup>n</sup>* are constructed, it is natural to estimate the map *Ti* = **t** Λ*i* λ and its inverse *T* <sup>−</sup><sup>1</sup> *<sup>i</sup>* = **t** λ Λ*<sup>i</sup>* (when Λ*<sup>i</sup>* are absolutely continuous; see the discussion after Assumptions 3 below) by the plug-in estimators

$$
\widehat{T\_l} = \mathbf{t}\_{\widehat{\lambda}\_n}^{\widehat{\Lambda}\_l}, \qquad \widehat{T\_l^{-1}} = (\widehat{T\_l})^{-1} = \mathbf{t}\_{\widehat{\lambda}\_l}^{\widehat{\lambda}\_n}.
$$

The latter, the registration maps, can then be used in order to register the points Π*i* via

$$\widehat{\Pi\_i^{(n)}} = \widehat{T\_i^{-1}} \# \widetilde{\Pi\_i}^{(n)} = \left[ \widehat{T\_i^{-1}} \circ T\_i \right] \# \Pi\_i^{(n)}.$$

It is thus reasonable to expect that if *T* ,−<sup>1</sup> *<sup>i</sup>* is a good estimator, then its composition with *Ti* should be close to the identity and Π -*<sup>i</sup>* should be close to Π*i*.

### *4.3.5 Unbiased Estimation When X* = R

In the same way, Frechet means extend the notion of mean to non-Hilbertian spaces, ´ they also extend the definition of unbiased estimators. Let *H* be a separable Hilbert space (or a convex subset thereof) and suppose that θ <sup>+</sup> is a random element in *<sup>H</sup>* whose distribution μθ depends on a parameter θ ∈ *H*. Then θ <sup>+</sup> is *unbiased* for θ if for all θ∈ *H*

$$\mathbb{E}\_{\theta} \widehat{\theta} = \int\_{H} x \mathrm{d}\mu\_{\theta}(x) = \theta.$$

(We use the standard notation Eθ *g*(θ <sup>+</sup>) = *g*(*x*)dμθ (*x*) in the sequel.) This is equivalent to

> Eθ θ −θ + <sup>2</sup> <sup>≤</sup> <sup>E</sup>θ γ −θ + <sup>2</sup>, <sup>∀</sup>θ, γ∈ *H*.

In view of that, one can define unbiased estimators of λ ∈ *W*<sup>2</sup> as measurable functions δ = δ(Λ1,...,Λ*<sup>n</sup>*) for which

$$\mathbb{E}\_{\lambda}W\_2^2(\lambda,\mathfrak{d}) \le \mathbb{E}\_{\lambda}W\_2^2(\gamma,\mathfrak{d}), \qquad \forall \gamma, \mathfrak{d} \in \mathcal{W}\_2.$$

This definition was introduced by Lehmann [88].

Unbiased estimators allow us to avoid the problem of over-registering (the socalled pinching effect; Kneip and Ramsay [82, Section 2.4]; Marron et al. [90, p. 476]). An extreme example of over-registration is if one "aligns" all the observed patterns into a single fixed point *x*0. The registration will then seem "successful" in the sense of having no residual phase variation, but the estimation is clearly biased because the points are not registered to the correct reference measure. Thus, requiring the estimator to be unbiased is an alternative to penalising the registration maps.

Due to the Hilbert space embedding of *W*2(R), it is possible to characterise unbiased estimators in terms of a simple condition on their quantile functions. As a corollary, λ*<sup>n</sup>*, the Frechet mean of ´ {Λ1,...,Λ*<sup>n</sup>*}, is unbiased. Our regularised Frechet– ´ Wasserstein estimator ˆ λ*<sup>n</sup>* can then be interpreted as *approximately unbiased*, since it approximates the unobservable λ*n*.

**Proposition 4.3.1 (Unbiased Estimators in** *W*2(R)**)** *Let* Λ *be a random measure in W*2(R) *with finite Frechet functional and let ´* λ *be the unique Frechet mean of ´* Λ *(Theorem 3.2.11). An estimator* δ *constructed as a function of a sample* (Λ1,...,Λ*n*) *is unbiased for* λ *if and only if the left-continuous representatives (in L*2(0,1)*) satisfy* E[*F*−<sup>1</sup> δ (*x*)] = *F*−<sup>1</sup> λ(*x*) *for all x* ∈ (0,1)*.*

*Proof.* The proof is straightforward from the definition: δ is unbiased if and only if for all λ and all γ,

$$\mathbb{E}\_{\lambda} \left\| F\_{\lambda}^{-1} - F\_{\delta}^{-1} \right\|\_{L\_2}^2 \le \mathbb{E}\_{\lambda} \left\| F\_{\mathcal{Y}}^{-1} - F\_{\delta}^{-1} \right\|\_{L\_2}^2,$$

which is equivalent to Eλ [*F*−<sup>1</sup> δ ] = *F*−<sup>1</sup> λ . In other words, these two functions must equal almost everywhere on (0,1), and their left-continuous representatives must equal everywhere (the fact that Eλ [*F*−<sup>1</sup> δ ] has such a representative was established in Sect. 3.1.4).

To show that δ = λ*<sup>n</sup>* is unbiased, we simply invoke Theorem 3.2.11 twice to see that

$$\mathbb{E}[F\_{\delta}^{-1}] = \mathbb{E}\left[\frac{1}{n}\sum\_{i=1}^{n}F\_{\Lambda\_{i}}^{-1}\right] = \mathbb{E}[F\_{\Lambda}^{-1}] = F\_{\lambda}^{-1},$$

which proves unbiasedness of δ.

#### **4.4 Consistency**

In functional data analysis, one often assumes that the number of curves *n* and the number of observed points per curve *m* both diverge to infinity. An analogous framework for point processes would similarly require the number of point processes *n* as well as the expected number of points τ per processes to diverge. A technical complication arises, however, because the mean measures do not suffice to characterise the distribution of the processes. Indeed, if one is given a point processes Π with mean measure λ (not necessarily a probability measure), and τ is an integer, there is no unique way to define a process Π(τ) with mean measure τλ. One can define Π(τ) = τΠ, so that every point in Π will be counted τ times. Such a construction, however, can never yield a consistent estimator of λ, even when τ→ ∞.

Another way to generate a point process with mean measure τλ is to take a superposition of τ independent copies of Π. In symbols, this means

$$H^{(\mathfrak{r})} = \Pi\_1 + \dots + \Pi\_{\mathfrak{r}},$$

with (Π*<sup>i</sup>*) independent, each having the same distribution as Π. This superposition scheme gives the possibility to use the law of large numbers. If τ is not an integer, then this construction is not well-defined but can be made so by assuming that the distribution of Π is *infinitely divisible*. The reader willing to assume that τis always an integer can safely skip to Sect. 4.4.1; all the main ideas are developed first for integer values of τand then extended to the general case.

A point process Π is infinitely divisible if for every integer *m* there exists a collection of *m* independent and identically distributed Π(1/*m*) *<sup>i</sup>* such that

$$
\Pi = \Pi\_1^{(1/m)} + \dots + \Pi\_m^{(1/m)} \qquad \text{in distribution.}
$$

If Π is infinitely divisible and τ = *k*/*m* is rational, then can define π(τ) using *km* independent copies of Π(1/*m*) :

$$\Pi^{(\tau)} = \sum\_{i=1}^{km} \Pi\_i^{(1/m)}.$$

One then deals with irrational τ via duality and continuity arguments, as follows. Define the *Laplace functional* of Πby

$$L\_{\Pi}(f) = \mathbb{E}\left[e^{-\Pi f}\right] = \mathbb{E}\left[\exp\left(-\int\_{\mathcal{X}} f \, \mathrm{d}\varPi\right)\right] \in [0, 1], \qquad f: \mathcal{X} \to \mathbb{R}\_{+} \quad \text{Borel.}$$

The Laplace functional characterises the distribution of the point process, generalising the notion of Laplace transform of a random variable or vector (Karr [79, Theorem 1.12]). By definition, it translates convolutions into products. When Π = Π(1) is infinitely divisible, the Laplace functional *L*<sup>1</sup> of Π takes the form (Kallenberg [75, Chapter 6]; Karr [79, Theorem 1.43])

$$L\_1(f) = \mathbb{E}\left[e^{-\Pi^{(1)}f}\right] = \exp\left[-\int\_{\mathcal{M}\_+(\mathcal{K})} (1 - e^{-\mu f}) \, \mathrm{d}\rho(\mu)\right] \quad \text{for some } \rho \in \mathcal{M}\_+(\mathcal{M}\_+(\mathcal{K})).$$

The Laplace functional of Π(τ) is *L*τ (*f*)=[*L*1(*f*)]τ for any rational τ, which simply amounts to multiplying the measure ρ by the scalar τ. One can then do the same for an irrational τ, and the resulting Laplace functional determines the distribution of Π(τ) for all τ> 0.

#### *4.4.1 Consistent Estimation of Frechet Means ´*

We are now ready to define our asymptotic setup. The following assumptions will be made. Notice that the Wasserstein geometry does not appear explicitly in these assumptions, but is rather *derived* from them in view of Theorem 4.2.4. The compactness requirement can be relaxed under further moment conditions on λ and the point process Π; we focus on the compact case for the simplicity and because in practice the point patterns will be observed on a bounded observation window.

**Assumptions 3** *Let K* <sup>⊂</sup> <sup>R</sup>*<sup>d</sup> be a compact convex nonempty set,* λ *an absolutely continuous probability measure on K, and* τ*<sup>n</sup> a sequence of positive numbers. Let* Π *be a point processes on K with mean measure* λ*. Finally, define U* = int*K.*


*The dependence of the estimators on n will sometimes be tacit. But* Λ*<sup>i</sup> does not depend on n.*

By virtue of Theorem 4.2.4, λ is a Frechet mean of the random measure ´ Λ = *T*#λ. Uniqueness of this Frechet mean will follow from Proposition ´ 3.2.7 if we show that Λ is absolutely continuous with positive probability. This is indeed the case, since *T* is injective and has a nonsingular Jacobian matrix; see Ambrosio et al. [12, Lemma 5.5.3]. The Jacobian assumption can be removed when *X* = R, because Frechet means are always unique by Theorem ´ 3.2.11.

Notice that there is no assumption about the dependence between rows. Assumptions 3 thus cover, in particular, two different scenarios:


Needless to say, Assumptions 3 also encompass binomial processes when τ*<sup>n</sup>* are integers, as well as Poisson processes or, more generally, Poisson cluster processes.

We now state and prove the consistency result for the estimators of the conditional mean measures Λ*<sup>i</sup>* and the structural mean measure λ.

**Theorem 4.4.1 (Consistency)** *If Assumptions 3 hold,* σ*<sup>n</sup>* = *n*−<sup>1</sup> ∑*<sup>n</sup> i*=1 σ(*n*) *<sup>i</sup>* → 0 *almost surely and* τ*<sup>n</sup>* → ∞ *as n* → ∞*, then:*

*1. The estimators* Λ<sup>+</sup>*<sup>i</sup> defined by* (4.2)*, constructed with bandwidth* σ = σ(*n*) *<sup>i</sup> , are Wasserstein-consistent for the conditional mean measures: for all i such that* σ(*n*) *i p* → 0

$$W\_2\left(\widehat{\Lambda}\_i, \Lambda\_i\right) \stackrel{p}{\longrightarrow} 0, \qquad a \text{ as } n \to \infty;$$

*2. The regularised Frechet–Wasserstein estimator of the structural mean measure ´ (as described in Sect. 4.3) is strongly Wasserstein-consistent,*

$$W\_2(\widehat{\lambda}\_n, \lambda) \stackrel{a.s.}{\longrightarrow} 0, \qquad \text{as } n \to \ast\ast\ast$$

*Convergence in 1. holds almost surely under the additional conditions that* ∑∞ *n*=1 τ−2 *n* < ∞ *and* E " Π(R*d*) #4 < ∞*. If* σ*<sup>n</sup>* → 0 *only in probability, then convergence in 2. still holds in probability.*

Theorem 4.4.1 still holds without smoothing (σ*<sup>n</sup>* <sup>=</sup> 0). In that case, <sup>+</sup>λ*<sup>n</sup>* = ˜ λ*<sup>n</sup>* is possibly not unique, and the theorem should be interpreted in a set-valued sense (as in Proposition 1.7.8): almost surely, *any* choice of minimisers ˜ λ*<sup>n</sup>* converges to λ as *n* → ∞.

The preceding paragraph notwithstanding, we will usually assume that some smoothing *is* present, in which case <sup>+</sup>λ*<sup>n</sup>* is unique and absolutely continuous by Proposition 3.1.8. The uniform Lipschitz bounds for the objective function show that if we restrict the relevant measures to be absolutely continuous, then <sup>+</sup>λ*<sup>n</sup>* is a continuous function of (Λ -1,...,Λ -*n*) and hence <sup>+</sup>λ*<sup>n</sup>* : (Ω,*F*,P) <sup>→</sup> *<sup>W</sup>*2(*K*) is measurable; this is again a minor issue because many arguments in the proof hold for each ω ∈ Ω separately. Thus, even if <sup>+</sup>λ*<sup>n</sup>* is not measurable, the proof shows that the convergence holds outer almost surely or in outer probability.

The first step in proving consistency is to show that the Wasserstein distance between the unsmoothed and the smoothed estimators of Λ*<sup>i</sup>* vanishes with the smoothing parameter. The exact rate of decay will be important to later establish the rate of convergence of <sup>+</sup>λ*<sup>n</sup>* to λ, and is determined next.

**Lemma 4.4.2 (Smoothing Error)** *There exists a finite constant C*ψ,*K, depending only on* ψ*and on K, such that*

$$W\_2^2\left(\widehat{\Lambda}\_i, \widetilde{\Lambda}\_i\right) \le C\_{\Psi, K} \sigma^2 \quad \text{if } \sigma \le 1. \tag{4.3}$$

Since the smoothing parameter will anyway vanish, this restriction to small values of σ is not binding. The constant *C*ψ,*<sup>K</sup>* is explicit. When *X* = R, a more refined construction allows to improve this constant in some situations, see Panaretos and Zemel [100, Lemma 1].

*Proof.* The idea is that (4.2) is a sum of measures with mass 1/*Ni* that can be all sent to the relevant point *xj*, and we refer to page 98 in the supplement for the precise details.

*Proof (Proof of Theorem 4.4.1).* The proof, detailed on page 97 of the supplement, follows the following steps: firstly, one shows the convergence in probability of Λ+*i* to Λ*<sup>i</sup>*. This is basically a corollary of Karr [79, Proposition 4.8] and the smoothing bound from Lemma 4.4.2.

To prove claim (2) one considers the functionals, defined on *W*2(*K*):

$$\begin{aligned} F(\boldsymbol{\gamma}) &= \frac{1}{2} \mathbb{E} W\_2^2(\boldsymbol{\Lambda}, \boldsymbol{\gamma}); \\ F\_n(\boldsymbol{\gamma}) &= \frac{1}{2n} \sum\_{i=1}^n W\_2^2(\boldsymbol{\Lambda}\_i, \boldsymbol{\gamma}); \\ \tilde{F}\_n(\boldsymbol{\gamma}) &= \frac{1}{2n} \sum\_{i=1}^n W\_2^2(\tilde{\boldsymbol{\Lambda}}\_i, \boldsymbol{\gamma}), \qquad \tilde{\boldsymbol{\Lambda}}\_i = \frac{\tilde{\boldsymbol{H}}\_i^{(n)}}{N\_i^{(n)}} \qquad \text{or } \boldsymbol{\lambda}^{(0)} \text{ if } N\_i^{(n)} = 0; \\ \widehat{F}\_n(\boldsymbol{\gamma}) &= \frac{1}{2n} \sum\_{i=1}^n W\_2^2(\hat{\boldsymbol{\Lambda}}\_i, \boldsymbol{\gamma}), \qquad \hat{\boldsymbol{\Lambda}}\_i = \boldsymbol{\lambda}^{(0)} \text{ if } N\_i^{(n)} = 0. \end{aligned}$$

Since *K* is compact, they are all locally Lipschitz, so their differences can be controlled by the distances between Λ*i*, Λ*i*, and Λ<sup>+</sup>*i*. The first distance vanishes since the intensity τ → ∞, and the second by the smoothing bound. Another compactness argument yields that *<sup>F</sup>*+*<sup>n</sup>* <sup>→</sup> *<sup>F</sup>* uniformly on *<sup>W</sup>*2(*K*), and so the minimisers converge.

The almost sure convergence in (1) is proven as follows. Under the stronger conditions at the end of the theorem's statement, for any fixed *<sup>a</sup>* = (*a*1,...,*ad*) <sup>∈</sup> <sup>R</sup>*d*,

$$\mathbb{P}\left(\frac{\widetilde{\Pi}\_i^{(n)}((-\-\infty,a])}{\tau\_n} - \Lambda\_i((-\-\infty,a]) \to 0\right) = 1$$

by the law of large numbers. This extends to all rational *a*'s, then to all *a* by approximation. The smoothing error is again controlled by Lemma 4.4.2.

#### *4.4.2 Consistency of Warp Functions and Inverses*

We next discuss the consistency of the warp and registration function estimators. These are key elements in order to align the observed point patterns Π*i*. Recall that we have consistent estimators Λ<sup>+</sup>*<sup>i</sup>* for Λ*<sup>i</sup>* and <sup>+</sup>λ*<sup>n</sup>* for λ. Then *Ti* = **t** Λ*i* λ is estimated by **t** Λ+*i* +λ*n* and *T* <sup>−</sup><sup>1</sup> *<sup>i</sup>* is estimated by **t** +λ*n* Λ+*i* . We will make the following extra assumptions that lead to more transparent statements (otherwise one needs to replace *K* with the set of Lebesgue points of the supports of λ and Λ*i*).

**Assumptions 4 (Strictly Positive Measures)** *In addition to Assumptions 3 suppose that:*


As a consequence suppΛ = supp(*T*#λ) = *T*(suppλ) = *K* almost surely. **Theorem 4.4.3 (Consistency of Optimal Maps)** *Let Assumptions 4 be satisfied in addition to the hypotheses of Theorem 4.4.1. Then for any i such that* σ(*n*) *i p* → 0 *and any compact set S* ⊆ int*K,*

$$\sup\_{\mathbf{x}\in\mathcal{S}} \|\widehat{T\_{l}^{-1}}(\mathbf{x}) - T\_{l}^{-1}(\mathbf{x})\| \stackrel{\text{p}}{\to} \mathbf{0}, \qquad \sup\_{\mathbf{x}\in\mathcal{S}} \|\widehat{T\_{l}}(\mathbf{x}) - T\_{l}(\mathbf{x})\| \stackrel{\text{p}}{\to} \mathbf{0}.$$

*Almost sure convergence can be obtained under the same provisions made at the end of the statement of Theorem 4.4.1.*

A few technical remarks are in order. First and foremost, it is not clear that the two suprema are measurable. Even though *Ti* and *T* <sup>−</sup><sup>1</sup> *<sup>i</sup>* are random elements in *Cb*(*U*,R*d*), their estimators are only defined in an *L*<sup>2</sup> sense. The proof of Theorem 4.4.3 is done ω-wise. That is, for any ω in the probability space such that Theorem 4.4.1 holds, the two suprema vanish as *n* → ∞. In other words, the convergence holds in outer probability or outer almost surely.

Secondly, assuming positive smoothing, the random measuresΛ<sup>+</sup>*<sup>i</sup>* are smooth with densities bounded below on *K*, so *T* ,−<sup>1</sup> *<sup>i</sup>* are defined on the whole of *U* (possibly as set-valued functions on a Λ*<sup>i</sup>*-null set). But the only known regularity result for <sup>+</sup>λ*<sup>n</sup>* is an upper bound on its density (Proposition 3.1.8), so it is unclear what is its support and consequently what is the domain of definition of *<sup>T</sup>*+*i*.

Lastly, when the smoothing parameter σ is zero, *<sup>T</sup>*+*<sup>i</sup>* and *<sup>T</sup>* ,−<sup>1</sup> *<sup>i</sup>* are not defined. Nevertheless, Theorem 4.4.3 still holds in the set-valued formulation of Proposition 1.7.11, of which it is a rather simple corollary:

*Proof (Proof of Theorem 4.4.3).* The proof amounts to setting the scene in order to apply Proposition 1.7.11 of stability of optimal maps. We define

$$\mu\_n = \widehat{\Lambda}\_l; \qquad \nu\_n = \widehat{\lambda}\_n; \qquad \mu = \Lambda\_l; \qquad \nu = \lambda; \qquad u\_n = \widehat{T}\_l^{-1}; \qquad u = T\_l^{-1}, \qquad \mu$$

and verify the conditions of the proposition. The weak convergence of μ*<sup>n</sup>* to μ and ν*<sup>n</sup>* to ν is the conclusion of Theorem 4.4.1; the finiteness is apparent because *K* is compact and the uniqueness follows from the assumed absolute continuity of Λ*i*. Since in addition *T* <sup>−</sup><sup>1</sup> *<sup>i</sup>* is uniquely defined on *U* = int*K* which is an open convex set, the restrictions on Ω in Proposition 1.7.11 are redundant. Uniform convergence of *<sup>T</sup>*+*<sup>i</sup>* to *Ti* is proven in the same way.

**Corollary 4.4.4 (Consistency of Point Pattern Registration)** *For any i such that* σ(*n*) *i p* → 0*,*

$$\mathbf{w}\_2 \left( \frac{\bar{\Pi\_i^{(n)}}}{N\_i^{(n)}}, \frac{\Pi\_i^{(n)}}{N\_i^{(n)}} \right) \stackrel{\mathbb{P}}{\rightarrow} \mathbf{0}.$$

The division by the number of observed points ensures that the resulting measures are probability measures; the relevant information is contained in the point patterns themselves, and is invariant under this normalisation.

*Proof.* The law of large numbers entails that *N*(*n*) *<sup>i</sup>* /τ*<sup>n</sup>* <sup>→</sup> 1, so in particular *<sup>N</sup>*(*n*) *<sup>i</sup>* is almost surely not zero when *n* is large. Since Π ,(*n*) *<sup>i</sup>* = (*T* ,−<sup>1</sup> *<sup>i</sup>* ◦ *Ti*)#Π(*n*) *<sup>i</sup>* , we have the upper bound

$$\mathrm{W}\_2^2\left(\frac{\Pi\_i^{(n)}}{N\_i^{(n)}},\frac{\Pi\_i^{(n)}}{N\_i^{(n)}}\right) \le \int\_K \|\widehat{T\_i^{-1}}(T\_i(\mathbf{x})) - \mathbf{x}\|^2 \,\mathrm{d}\frac{\Pi\_i^{(n)}}{N\_i^{(n)}}.$$

Fix a compact Ω ⊆ int*K* and split the integral to Ωand its complement. Then

$$\int\_{K\backslash\mathfrak{Q}} \left\lVert \widehat{T\_{l}^{-1}}(T\_{l}(\mathbf{x})) - \mathbf{x} \right\rVert^{2} \operatorname{d} \frac{\Pi\_{l}^{(n)}}{N\_{l}^{(n)}} \leq d\_{K}^{2} \frac{\Pi\_{l}^{(n)}(K \backslash\mathfrak{Q})}{\mathfrak{r}\_{n}} \frac{\mathfrak{r}\_{n}}{N\_{l}^{(n)}} \stackrel{\text{as}}{\to} d\_{K}^{2} \lambda(K \backslash\mathfrak{Q}),$$

by the law of large numbers, where *dK* is the diameter of *K*. By writing int*K* as a countable union of compact sets (and since λ is absolutely continuous), this can be made arbitrarily small by choice of Ω.

We can easily bound the integral on Ωby

$$\int\_{\mathfrak{Q}} \widehat{\|T\_{l}^{-1}(T\_{l}(\mathbf{x})) - \mathbf{x}\|}^{-1} \mathbf{d} \frac{\Pi\_{l}^{(n)}}{N\_{l}^{(n)}} \leq \sup\_{\mathbf{x} \in \mathfrak{Q}} \|\widehat{T\_{l}^{-1}}(T\_{l}(\mathbf{x})) - \mathbf{x}\|^{2} = \sup\_{\mathbf{y} \in T\_{l}(\mathbf{Q})} \|\widehat{T\_{l}^{-1}}(\mathbf{y}) - \mathbf{T}\_{l}^{-1}(\mathbf{y})\|^{2}.$$

But *Ti*(Ω) is a compact subset of *U* = int*K*, because *Ti* ∈ *Cb*(*U*,*U*). The right-hand side therefore vanishes as *n* → ∞ by Theorem 4.4.3, and this completes the proof.

Possible extensions pertaining to the boundary of *K* are discussed on page 33 of the supplement.

#### **4.5 Illustrative Examples**

In this section, we illustrate the estimation framework put forth in this chapter by considering an example of a structural mean λ with a bimodal density on the real line. The unwarped point patterns Π originate from Poisson processes with mean measure λ and, consequently, the warped points Π are Cox processes (see Sect. 4.1.2). Another scenario involving triangular densities can be found in Panaretos and Zemel [100].

#### *4.5.1 Explicit Classes of Warp Maps*

As a first step, we introduce a class of random warp maps satisfying Assumptions 2, that is, increasing maps that have as mean the identity function. The construction is a mixture version of similar maps considered by Wang and Gasser [128, 129].

For any integer *k* define ζ*<sup>k</sup>* : [0,1] → [0,1] by

$$
\mathcal{L}\_0(\mathbf{x}) = \mathbf{x}, \qquad \mathcal{L}\_k(\mathbf{x}) = \mathbf{x} - \frac{\sin(\pi k \mathbf{x})}{|k|\pi}, \quad k \in \mathbb{Z} \backslash \{0\}. \tag{4.4}
$$

Clearly ζ*<sup>k</sup>*(0) = 0, ζ*<sup>k</sup>*(1) = 1 and ζ*<sup>k</sup>* is smooth and strictly increasing for all *k*. Figure 4.4a plots ζ*<sup>k</sup>* for *k* = −3,...,3. To make ζ*<sup>k</sup>* a random function, we let *k* be an integer-valued random variable. If the latter is symmetric, then we have

$$\mathbb{E}\left[\zeta\_k(\mathbf{x})\right] = \mathbf{x}, \qquad \mathbf{x} \in [0, 1].$$

By means of mixtures, we replace this discrete family by a continuous one: let *J* > 1 be an integer and *V* = (*V*1,...,*VJ* ) be a random vector following the flat Dirichlet distribution (uniform on the set of nonnegative vectors with *v*<sup>1</sup> +···+*vJ* = 1). Take independently *kj* following the same distribution as *k* and define

$$T(\mathbf{x}) = \sum\_{j=1}^{J} V\_j \mathsf{f}\_{k\_j}(\mathbf{x}). \tag{4.5}$$

Since*Vj* is positive, *T* is increasing and as (*Vj*) sums up to unity *T* has mean identity. Realisations of these warp functions are given in Fig. 4.4b and c for *J* = 2 and *J* = 10, respectively. The parameters (*k <sup>j</sup>*) were chosen as symmetrised Poisson random variables: each *kj* has the law of *XY* with *X* Poisson with mean 3 and P(*Y* = 1) = <sup>P</sup>(*<sup>Y</sup>* <sup>=</sup> <sup>−</sup>1) = <sup>1</sup>/2 for *<sup>Y</sup>* and *<sup>X</sup>* independent. When *<sup>J</sup>* <sup>=</sup> 10 is large, the function *<sup>T</sup>* deviates only mildly from the identity, since a law of large numbers begins to take effect. In contrast, *J* = 2 yields functions that are quite different from the identity. Thus, it can be said that the parameter *J* controls the variance of the random warp function *T*.

Fig. 4.4: (**a**) The functions {ζ−3,...,ζ<sup>3</sup>}; (**b**) realisations of *T* defined by (4.5) with *J* = 2 and *k <sup>j</sup>* symmetrisations of Poisson random variables with mean 3; (**c**) realisations of *T* defined by (4.5) with *J* = 10 and *k <sup>j</sup>* as in (**b**)

#### *4.5.2 Bimodal Cox Processes*

Let the structural mean measure λ be a mixture of a bimodal Gaussian distribution (restricted to *K* = [−16,16]) and a beta background on the interval [−12,12], so that mass is added at the centre of *K* but not near the boundary. In symbols this is given as follows. Let ϕ be the standard Gaussian density and let βα,β denote the density of a the beta distribution with parameters α and β. Then λ is chosen as the measure with density

$$f(\mathbf{x}) = \frac{1-\varepsilon}{2} [\boldsymbol{\varrho}(\mathbf{x}-\mathbf{8}) + \boldsymbol{\varrho}(\mathbf{x}+\mathbf{8})] + \frac{\varepsilon}{24} \boldsymbol{\beta}\_{1,5,1.5} \left(\frac{\mathbf{x}+12}{24}\right), \qquad \mathbf{x} \in [-16, 16], \tag{4.6}$$

where ε ∈ [0,1] is the weight of the beta background. (We ignore the loss of a negligible amount of mass due to the restriction of the Gaussians to [−16,16].) Plots of the density and distribution functions are given in Fig. 4.5.

Fig. 4.5: Density and distribution functions corresponding to (4.6) with ε = 0 and ε= 0.15

The main criterion for the quality of our regularised Frechet–Wasserstein estima- ´ tor will be its success in discerning the two modes at ±8; these will be smeared by the phase variation arising from the warp functions.

We next simulated 30 independent Poisson processes with mean measure λ, ε = 0.1, and total intensity (expected number of points) τ = 93. In addition, we generated warp functions as in (4.5) but rescaled to [−16,16]; that is, having the same law as the functions

$$x \mapsto 32T\left(\frac{x+16}{32}\right) - 16$$

from *K* to *K*. These cause rather violent phase variation, as can be seen by the plots of the densities and distribution functions of the conditional measures Λ = *T*#λ presented in Fig. 4.6a and b; the warped points themselves are displayed in Fig. 4.6c.

Using these warped point patterns, we construct the regularised Frechet– ´ Wasserstein estimator employing the procedure described in Sect. 4.3. Each Π*<sup>i</sup>* was smoothed with a Gaussian kernel and bandwidth chosen by unbiased cross validation. We deviate slightly from the recipe presented in Sect. 4.3 by not restricting the resulting estimates to the interval [−16,16], but this has no essential effect on the finite sample performance. The regularised Frechet–Wasserstein estimator ´ <sup>+</sup>λ*<sup>n</sup>* serves as the estimator of the structural mean λ and is shown in Fig. 4.7a. It is contrasted with λ at the level of distribution functions, as well as with the empirical arithmetic mean; the latter, the *naive estimator*, is calculated by ignoring the warping and simply averaging linearly the (smoothed) empirical distribution functions across the observations. We notice that <sup>+</sup>λ*<sup>n</sup>* is rather successful at locating the two modes of λ, in contrast with the naive estimator that is more diffuse. In fact, its distribution function increases approximately linearly, suggesting a nearly constant density instead of the correct bimodal one.

Fig. 4.6: (**a**) 30 warped bimodal densities, with density of λ given by (4.6) in solid black; (**b**) their corresponding distribution functions, with that of λ in solid black; (**c**) 30 Cox processes, constructed as warped versions of Poisson processes with mean intensity 93 *f* using as warp functions the rescaling to [−16.16] of (4.5)

Fig. 4.7: (**a**) Comparison between the regularised Frechet–Wasserstein estimator, ´ the empirical arithmetic mean, and the true distribution function, including residual curves centred at *y* = 3/4; (**b**) The estimated warp functions; (**c**) Kernel estimates of the density function *f* of the structural mean, based on the warped and registered point patterns

Estimators of the warp maps *<sup>T</sup>*+*i*, depicted in Fig. 4.7b, and their inverses, are defined as the optimal maps between <sup>+</sup>λ*<sup>n</sup>* and the estimated conditional mean measures, as explained in Sect. 4.3.4. Then we register the point patterns by applying to them the inverse estimators *T* ,−<sup>1</sup> *<sup>i</sup>* (Fig. 4.8). Figure 4.7c gives two kernel estimators of the density of λ constructed from a superposition of all the warped points and all the registered ones. Notice that the estimator that uses the registered points is much more successful than the one using the warped ones in discerning the two density peaks. This is not surprising after a brief look at Fig. 4.8, where the unwarped, warped, and registered points are displayed. Indeed, there is very high concentration of registered points around the true location of the peaks, ±8. This is not the case for the warped points because of the phase variation that translates the centres of concentration for each individual observation. It is important to remark that the fluctuations in the density estimator in Fig. 4.7c are not related to the registration procedure, and could be reduced by a better choice of bandwidth. Indeed, our procedure does not attempt to estimate the density, but rather the distribution function.

Fig. 4.8: Bimodal Cox processes: (**a**) the observed warped point processes; (**b**) the unobserved original point processes; (**c**) the registered point processes

Figure 4.9 presents a superposition of the regularised Frechet–Wasserstein esti- ´ mators for 20 independent replications of the experiment, contrasted with a similar superposition for the naive estimator. The latter is clearly seen to be biased around the two peaks, while the regularised Frechet–Wasserstein seems approximately un- ´ biased, despite presenting fluctuations. It always captures the bimodal nature of the density, as is seen from the two clear elbows in each realisation.

To illustrate the consistency of the regularised Frechet–Wasserstein estimator ´ <sup>+</sup>λ*n* for λ as shown in Theorem 4.4.1, we let the number of processes *n* as well as the expected number of observed point per process τ to vary. Figures 4.10 and 4.11 show the sampling variation of <sup>+</sup>λ*<sup>n</sup>* for different values of *n* and τ. We observe that as either of these increases, the realisations <sup>+</sup>λ*<sup>n</sup>* indeed approach λ. The figures suggest that, in this scenario, the amplitude variation plays a stronger role than the phase variation, as the effect of τis more substantial.

Fig. 4.9: (**a**) Sampling variation of the regularised Frechet–Wasserstein mean ´ <sup>+</sup>λ*n* and the true mean measure λ for 20 independent replications of the experiment; (**b**) sampling variation of the arithmetic mean, and the true mean measure λ for the same 20 replications; (**c**) superposition of (**a**) and (**b**). For ease of comparison, all three panels include residual curves centred at *y* = 3/4

#### *4.5.3 Effect of the Smoothing Parameter*

In order to work with measures of strictly positive density, the observed point patterns have been smoothed using a kernel function. This necessarily incurs an additional bias that depends on the bandwidth σ*<sup>i</sup>*. The asymptotics (Theorem 4.4.1) guarantee the consistency of the estimators, in particular the regularised Frechet– ´ Wasserstein estimator <sup>+</sup>λ*<sup>n</sup>*, provided that max*<sup>n</sup> i*=1 σ*<sup>i</sup>* → 0. In our simulations, we choose σ*<sup>i</sup>* in a data-driven way by employing unbiased cross validation. To gauge for the effect of the smoothing, we carry out the same estimation procedure but with σ*<sup>i</sup>* multiplied by a parameter *<sup>s</sup>*. Figure 4.12 presents the distribution function of <sup>+</sup>λ*n* as a function of *s*. Interestingly, the curves are nearly identical as long as *s* ≤ 1, whereas when *s* > 1, the bias becomes more substantial.

These findings are reaffirmed in Fig. 4.13 that show the registered point processes again as a function of *s*. We see that only minor differences are present as *s* varies from 0.1 to 1, for example, in the grey (8), black (17), and green (19) processes. When *s* = 3, the distortion becomes quite more substantial. This phenomenon repeats itself across all combinations of *n*, τ, and *s* tested.

Fig. 4.10: Sampling variation of the regularised Frechet–Wasserstein mean ´ <sup>+</sup>λ*<sup>n</sup>* and the true mean measure λ for 20 independent replications of the experiment, with ε = 0 and *n* = 30. Left: τ = 43; middle: τ = 93; right: τ = 143. For ease of comparison, all three panels include residual curves centred at *y* = 3/4

Fig. 4.11: Sampling variation of the regularised Frechet–Wasserstein mean ´ <sup>+</sup>λ*<sup>n</sup>* and the true mean measure λ for 20 independent replications of the experiment, with ε = 0 and τ = 93. Left: *n* = 30; middle: *n* = 50; right: *n* = 70. For ease of comparison, all three panels include residual curves centred at *y* = 3/4.

#### **4.6 Convergence Rates and a Central Limit Theorem on the Real Line**

Since the conditional mean measures Λ*<sup>i</sup>* are discretely observed, the rate of convergence of our estimators will be affected by the rate at which the number of observed points per process *N*(*n*) *<sup>i</sup>* increases to infinity. The latter is controlled by the next lemma, which is valid for any complete separable metric space *X* .

**Lemma 4.6.1 (Number of Points Grows Linearly)** *Let N*(*n*) *<sup>i</sup>* = Π(*n*) *<sup>i</sup>* (*X* ) *denote the total number of observed points. If* τ*<sup>n</sup>*/log*n* → ∞*, then there exists a constant C*Π > 0*, depending only on the distribution of* Π*, such that almost surely*

$$\liminf\_{n \to \infty} \frac{\min\_{1 \le i \le n} N\_i^{(n)}}{\mathfrak{r}\_n} \ge C\_{II}.$$

Fig. 4.12: Regularised Frechet–Wasserstein mean as a function of the smoothing ´ parameter multiplier *s*, including residual curves. Here, *n* = 30 and τ= 143

Fig. 4.13: Registered point processes as a function of the smoothing parameter multiplier *s*. Left: *s* = 0.1; middle: *s* = 1; right: *s* = 3. Here, *n* = 30 and τ= 43

*In particular, there are no empty point processes, so the normalisation is welldefined. If* Π*is a Poisson process, then we have the more precise result*

$$\lim\_{n \to \infty} \frac{\min\_{1 \le i \le n} N\_i^{(n)}}{\mathfrak{r}\_n} = 1 \qquad \text{almost surely.}$$

**Remark 4.6.2** *One can also show that the limit superior of the same quantity is bounded by a constant C* Π*. If* τ*<sup>n</sup>*/log*n is bounded below, then the same result holds but with worse constants. If only* τ*<sup>n</sup>* → ∞*, then the result holds for each i separately but in probability.*

The proof is a simple application of Chernoff bounds; see page 108 in the supplement.

With Lemma 4.6.1 under our belt, we can replace terms of the order min*iN*(*n*) *<sup>i</sup>* by the more transparent order τ*<sup>n</sup>*. As in the consistency proof, the idea is to write

$$F - \hat{F}\_n = (F - F\_n) + (F\_n - F\_n) + (\mathcal{F}\_n - \hat{F}\_n)$$

and control each term separately. The first term corresponds to the phase variation, and comes from the approximation of the theoretical expectation *F* by a sample mean *Fn*. The second term is associated with the amplitude variation resulting from observing Λ*<sup>i</sup>* discretely. The third term can be viewed as the bias incurred by the smoothing procedure. Accordingly, the rate at which <sup>+</sup>λ*<sup>n</sup>* converges to λ is a sum of three separate terms. We recall the standard *O*<sup>P</sup> terminology: if *Xn* and *Yn* are random variables, then *Xn* = *O*P(*Yn*) means that the sequence (*Xn*/*Yn*) is *bounded in probability*, which by definition is the condition

$$\forall \varepsilon > 0 \,\,\exists M : \quad \sup\_{n} \mathbb{P}\left( \left| \frac{X\_n}{Y\_n} \right| > M \right) < \varepsilon.$$

Instead of *Xn* = *O*P(*Yn*), we will sometimes write *Yn* ≥ *O*P(*Xn*). The former notation emphasises the condition that *Xn* grows no faster than *Yn*, while the latter stresses that *Yn* grows at least as fast as *Xn* (which is of course the same assertion). Finally, *Xn* = *o*P(*Yn*) means that *Xn*/*Yn* → 0 in probability.

**Theorem 4.6.3 (Convergence Rates on** R**)** *Suppose in addition to Assumptions 3 that d* = 1*,* τ*<sup>n</sup>*/log*n* → ∞ *and that* Π *is either a Poisson process or a binomial process. Then*

$$W\_2(\widehat{\lambda}\_n, \lambda) \le O\_{\mathbb{P}}\left(\frac{1}{\sqrt{n}}\right) + O\_{\mathbb{P}}\left(\frac{1}{\sqrt[4]{\pi\_n}}\right) + O\_{\mathbb{P}}\left(\sigma\_n\right), \qquad \sigma\_n = \frac{1}{n} \sum\_{i=1}^n \sigma\_i^{(n)},$$

*where all the constants in the O*P *terms are explicit.*

**Remark 4.6.4** *Unlike classical density estimation, no assumptions on the rate of decay of* σ*<sup>n</sup> are required, because we only need to estimate the distribution function and not the derivative. If the smoothing parameter is chosen to be* σ(*n*) *<sup>i</sup>* = [*N*(*n*) *<sup>i</sup>* ] −α *for some* α > 0 *and* τ*<sup>n</sup>*/log*n* → ∞*, then by Lemma 4.6.1* σ*<sup>n</sup>* ≤ max1≤*i*≤*<sup>n</sup>* σ(*n*) *<sup>i</sup>* = *O*P(τ−α *<sup>n</sup>* )*. For example, if* Rosenblatt's rule α = 1/5 *is employed, then the O*P(σ*n*) *term can be replaced by O*P(1/ √5 τ*n*)*.*

One can think about the parameter τ as separating the *sparse* and *dense* regimes as in classical functional data analysis (see also Wu et al. [132]). If τis bounded, then the setting is *ultra sparse* and consistency cannot be achieved. A sparse regime can be defined as the case where τ*<sup>n</sup>* → ∞ but slower than log*n*. In that case, consistency is guaranteed, but some point patterns will be empty. The *dense* regime can be defined as τ*<sup>n</sup> <sup>n</sup>*2, in which case the amplitude variation is negligible asymptotically when compared with the phase variation.

The exponent −1/4 of τ*<sup>n</sup>* can be shown to be optimal without further assumptions, but it can be improved to <sup>−</sup>1/2 if <sup>P</sup>(*<sup>f</sup>*Λ ≥ ε on *K*) = 1 for some ε > 0, where *f*Λ is the density of Λ (see Sect. 4.7). In terms of *T*, the condition is that <sup>P</sup>(*<sup>T</sup>* <sup>≥</sup> ε) = 1 for some ε and λ has a density bounded below. When this is the case, τ*<sup>n</sup>* needs to compared with *n* rather than *n*<sup>2</sup> in the next paragraph and the next theorem.

Theorem 4.6.3 provides conditions for the optimal parametric rate √*n* to be achieved: this happens if we set σ*<sup>n</sup>* to be of the order *O*P(*n*−1/2) or less and if τ*<sup>n</sup>* is of the order *n*<sup>2</sup> or more. But if the last two terms in Theorem 4.6.3 are *negligible* with respect to *<sup>n</sup>*−1/2, then a sort of *central limit theorem* holds for <sup>+</sup>λ*n*:

**Theorem 4.6.5 (Asymptotic Normality)** *In addition to the conditions of Theorem 4.6.3, assume that* τ*<sup>n</sup>*/*n*<sup>2</sup> <sup>→</sup> <sup>∞</sup>*,* σ*<sup>n</sup>* = *o*P(*n*−1/2) *and* λ *possesses an invertible distribution function F*λ*on K. Then*

$$
\sqrt{n}\left(\hat{\mathbf{t}}\_{\lambda}^{\hat{\lambda}\_n} - \mathbf{i}\right) \longrightarrow \mathbf{Z} \quad \text{weakly in } L\_2(\lambda),
$$

*for a zero-mean Gaussian process Z with the same covariance operator of T (the latter viewed as a random element in L*2(λ)*), namely with covariance kernel*

$$\kappa(\mathbf{x}, \mathbf{y}) = \text{cov}\left\{ T(\mathbf{x}), T(\mathbf{y}) \right\}.$$

*If the density f*λ *exists and is (piecewise) continuous and bounded below on K, then the weak convergence also holds in L*2(*K*)*.*

In view of Sect. 2.3, Theorem 4.6.5 can be interpreted as asymptotic normality of +λ*<sup>n</sup>* in the *tangential* sense: <sup>√</sup>*n*logλ(+λ*<sup>n</sup>*) converges to a Gaussian random element in the tangent space Tanλ , which is a subset of *L*2(λ). The additional smoothness conditions allow to switch to the space *L*2(*K*), which is independent of the unknown template measure λ.

See pages 109 and 110 in the supplement for detailed proofs of these theorems. Below we sketch the main ideas only.

*Proof (Proof of Theorem 4.6.3).* The quantile formula *W*2(γ,θ) = *F*−<sup>1</sup> θ <sup>−</sup>*F*−<sup>1</sup> γ *L*2(0,1) from Sect. 1.5 and the average quantile formula for the Frechet mean (Sect. ´ 3.1.4) show that the oracle empirical mean *F*−<sup>1</sup> λ*<sup>n</sup>* follows a central limit theorem in *L*2(0,1). Since we work in the Hilbert space *L*2(0,1), Frechet means are simple averages, ´ so the errors in the Frechet mean have the same rate as the errors in the Fr ´ echet ´ functionals. The smoothing term is easily controlled by Lemma 4.4.2.

Controlling the amplitude term is more difficult. Bounds can be given using the machinery sketched in Sect. 4.7, but we give a more elementary proof by reducing to the 1-Wasserstein case (using (2.2)), which can be more easily handled in terms of distributions functions (Corollary 1.5.3).

*Proof (Proof of Theorem 4.6.5).* The hypotheses guarantee that the amplitude and smoothing errors are negligible and

$$
\sqrt{n}\left(F\_{\widehat{\lambda}\_n}^{-1} - F\_{\widehat{\lambda}}^{-1}\right) \to GP \quad \text{weakly in } L\_2(0,1),
$$

where *GP* is the Gaussian process defined in the proof of Theorem 4.6.3. One then employs a composition with *F*λ.

#### **4.7 Convergence of the Empirical Measure and Optimality**

One may find the term *O*P(1/ √4 τ*<sup>n</sup>*) in Theorem 4.6.3 to be somewhat surprising, and expect that it ought to be *O*P(1/ √τ*<sup>n</sup>*). The goal of this section is to show why the rate 1/ √4 τ*<sup>n</sup>* is optimal without further assumptions and discuss conditions under which it can be improved to the optimal rate 1/ √τ*<sup>n</sup>*. For simplicity, we concentrate on the case τ*<sup>n</sup>* = *n* and assume that the point process Π is binomial; the Poisson case being easily obtained from the simplified one (using Lemma 4.6.1). We are thus led to study rates of convergence of empirical measures in the Wasserstein space. That is to say, for a fixed exponent *p* ≥ 1 and a fixed measure μ ∈ *Wp*(*X* ), we consider independent random variables *X*1,... with law μ and the *empirical measure* μ*<sup>n</sup>* = *n*−<sup>1</sup> ∑*<sup>n</sup> i*=1 δ{*Xi*}. The first observation is that <sup>E</sup>*Wp*(μ,μ*<sup>n</sup>*) → 0:

**Lemma 4.7.1** *Let* μ∈ *P*(*X* ) *be any measure. Then*

$$\mathbb{E}W\_p(\mu,\mu\_n) \begin{cases} = \Leftrightarrow \quad \mu \notin \mathcal{W}\_p(\mathcal{X})\\ \to 0 \quad \mu \in \mathcal{W}\_p(\mathcal{X}). \end{cases}$$

*Proof.* This result has been established in an almost sure sense in Proposition 2.2.6. To extend to convergence in expectation observe that

$$\mathbb{W}\_p^p(\mu, \mu\_n) \le \int\_{\mathcal{X}^2} ||\mathbf{x} - \mathbf{y}||^p \, \mathbf{d}\mu \otimes \mu\_n(\mathbf{x}, \mathbf{y}) = \frac{1}{n} \sum\_{l=1}^n \int\_{\mathcal{X}} ||\mathbf{x} - X\_l||^p \, \mathbf{d}\mu(\mathbf{x}) .$$

Thus, the random variable 0 <sup>≤</sup> *Yn* <sup>=</sup> *<sup>W</sup> <sup>p</sup> p* (μ,μ*<sup>n</sup>*) is bounded by the sample average *Zn* of a random variable *V* = *X x*−*X*1 *p* dμ(*x*) that has a finite expectation. A version of the dominated converge theorem (given on page 111 in the supplement) implies that <sup>E</sup>*Yn* <sup>→</sup> 0. Now invoke Jensen's inequality.

**Remark 4.7.2** *The sequence* E*Wp*(μ,μ*<sup>n</sup>*) *is not monotone, as the simple example* μ = (δ<sup>0</sup> +δ<sup>1</sup>)/2 *shows (see page 111 in the supplement).*

The next question is how quickly E*Wp*(μ,μ*<sup>n</sup>*) vanishes when μ ∈ *Wp*(*X* ). We shall begin with two simple general lower bounds, then discuss upper bounds in the onedimensional case, put them in the context of Theorem 4.6.3, and finally briefly touch the *d*-dimensional case.

**Lemma 4.7.3 (**√*n* **Lower Bound)** *Let* μ ∈ *P*(*X* ) *be nondegenerate. Then there exists a constant c*(μ) > 0 *such that for all p* ≥ 1 *and all n*

$$\mathbb{E}W\_p(\mu\_n, \mu) \ge \frac{c(\mu)}{\sqrt{n}}.$$

*Proof.* Let *X* ∼ μ and let *a* = *b* be two points in the support μ. Consider *f*(*x*) = min(1, *x*−*a* ), a bounded 1-Lipschitz function such that *f*(*a*) = 0 < *f*(*b*). Then

$$\sqrt{n}\mathbb{E}W\_p(\mu\_n,\mu) \ge \sqrt{n}\mathbb{E}W\_1(\mu\_n,\mu) \ge \mathbb{E}\left|n^{-1/2}\sum\_{l=1}^n f(X\_l) - \mathbb{E}f(X)\right| \to \sqrt{\frac{2\text{var}f(X)}{\pi}} > 0$$

by the central limit theorem and the Kantorovich–Rubinstein theorem (1.11).

For discrete measures, the rates scale badly with *p*. More generally:

**Lemma 4.7.4 (Separated Support)** *Suppose that there exist Borel sets A*,*B* ⊂ *X such that* μ(*A*∪*B*) = 1*,*

$$\mu(A)\mu(B) > 0 \qquad \text{and} \qquad d\_{\text{min}} = \inf\_{\mathbf{x} \in A, \mathbf{y} \in B} ||\mathbf{x} - \mathbf{y}|| > 0.$$

*Then for any p* ≥ 1 *there exists cp*(μ) > 0 *such that* E*Wp*(μ*n*,μ) ≥ *cp*(μ)*n*−1/(2*p*) *.*

Any nondegenerate finitely discrete measure μ satisfies this condition, and so do "non-pathological" countably discrete ones. (An example of a "pathological" measure is one assigning positive mass to any rational number.)

*Proof.* Let *k* ∼ *B*(*n*,*q* = μ(*A*)) denote the number of points from the sample (*X*1,...,*Xn*) that fall in *A*. Then a mass of |*k*/*n* − *q*| must travel between *A* and *B*, a distance of at least *d*min. Thus, *W <sup>p</sup> p* (μ*n*,μ) <sup>≥</sup> *<sup>d</sup><sup>p</sup>* min|*k*/*n*−*q*|, and the result follows from the central limit theorem for *k*; see page 112 in the supplement for the full details.

These lower bounds are valid on any separable metric space. On the real line, it is easy to obtain a sufficient condition for the optimal rate *n*−1/<sup>2</sup> to be attained for *W*1: since *Fn*(*t*) ∼ *B*(*n*,*F*(*t*)) has variance *F*(*t*)(1 − *F*(*t*))/*n*, we have (by Fubini's theorem and Jensen's inequality)

$$\mathbb{E}W\_1(\mu\_n, \mu) = \int\_{\mathbb{R}} \mathbb{E}|F\_n(t) - F(t)| \, \mathrm{d}t \le n^{-1/2} \int\_{\mathbb{R}} \sqrt{F(t)(1 - F(t))} \, \mathrm{d}t,$$

so that *W*1(μ*n*,μ) is of the optimal order *n*−1/<sup>2</sup> if

$$J\_1(\mu) := \int\_{\mathbb{R}} \sqrt{F(t)(1 - F(t))} \,\mathrm{d}t < \infty.$$

Since the integrand is bounded by 1/2, this is certainly satisfied if μ is compactly supported. The *J*<sup>1</sup> condition is essentially a moment condition, since for any δ > 0, we have for *X* ∼ μ that <sup>E</sup>|*X*<sup>|</sup> 2+δ < ∞ =⇒ *J*1(μ) <sup>&</sup>lt; <sup>∞</sup> <sup>=</sup><sup>⇒</sup> <sup>E</sup>|*X*<sup>|</sup> <sup>2</sup> < ∞. It turns out that this condition is necessary, and has a more subtle counterpart for any *p* ≥ 1. Let *f* denote the density of the absolutely continuous part of μ (so *f* ≡ 0 if μis discrete).

**Theorem 4.7.5 (Rate of Convergence of Empirical Measures)** *Let p* ≥ 1 *and* μ ∈ *Wp*(R)*. The condition*

$$J\_p(\mu) = \int\_{\mathbb{R}} \frac{[F(t)(1 - F(t))]^{p/2}}{[f(t)]^{p-1}} \, \mathrm{d}t < \infty, \qquad (0^0 = 1).$$

*is necessary and sufficient for* E*Wp*(μ*n*,μ) = *O*(*n*−1/2)*.*

See Bobkov and Ledoux [25, Theorem 5.10] for a proof for the *Jp* condition, and Theorems 5.1 and 5.3 for the values of the constants and a stronger result.

When *p* > 1, for *Jp*(μ) to be finite, the support of μ must be connected; this is not needed when *p* = 1. Moreover, the *Jp* condition is satisfied when *f* is bounded below (in which case the support of μ must be compact). However, smoothness alone does not suffice, even for measures with positive density on a compact support. More precisely, we have:

**Proposition 4.7.6** *For any rate* ε*<sup>n</sup>* → 0 *there exists a measure* μ *on* [−1,1] *with positive C*<sup>∞</sup> *density there, and such that for all n*

$$\mathbb{E}W\_p(\mu\_n,\mu) \ge C(p,\mu)n^{-1/(2p)}\mathfrak{e}\_n.$$

The rate *n*−1/(2*p*) from Lemma 4.7.4 is the worst among compactly supported measures on R. Indeed, by Jensen's inequality and (2.2), for any μ∈ *P*([0,1]),

$$\mathbb{E}W\_p(\mu\_n, \mu) \le \left[\mathbb{E}W\_p^p(\mu\_n, \mu)\right]^{1/p} \le \left[\mathbb{E}W\_1(\mu\_n, \mu)\right]^{1/p} \le n^{-1/(2p)}.$$

The proof of Proposition 4.7.6 is done by "smoothing" the construction in Lemma 4.7.4, and is given on page 113 in the supplement.

Let us now put this in the context of Theorem 4.6.3. In the binomial case, since each Π(*n*) *<sup>i</sup>* and each Λ*<sup>i</sup>* are independent, we have

$$|\mathbb{E}W\_2(\Lambda\_i, \widetilde{\Lambda}\_i)|\Lambda\_i \le \sqrt{2J\_2(\Lambda\_i)}\frac{1}{\sqrt{\mathfrak{T}\_n}}.$$

(In the Poisson case, we need to condition on *N*(*n*) *<sup>i</sup>* and then estimate its inverse square root as is done in the proof of Theorem 4.6.3.) Therefore, a sufficient condition for the rate 1/ √τ*<sup>n</sup>* to hold is that E *J*2(Λ) < ∞ and a necessary condition is that P( *J*2(Λ) < ∞) = 1. These hold if there exists δ > 0 such that with probability one Λ has a density bounded below by δ. Since Λ = *T*#λ, this will happen provided that λ itself has a bounded below density and *T* has a bounded below derivative. Bigot et al. [23] show that the rate √τ*<sup>n</sup>* cannot be improved.

We conclude by proving a lower bound for absolutely continuous measures and stating, without proof, an upper bound.

**Proposition 4.7.7** *Let* μ <sup>∈</sup> *<sup>W</sup>*1(R*d*) *have an absolutely continuous part with respect to Lebesgue measure, and let* ν*<sup>n</sup> be any discrete measure supported on n points (or less). Then there exists a constant C*(μ) > 0 *such that*

$$W\_p(\mu, \nu\_n) \ge W\_1(\mu, \nu\_n) \ge C(\mu) n^{-1/d}.$$

*Proof.* Let *f* be the density of the absolutely continuous part μ*<sup>c</sup>*, and observe that for some finite number *M*,

$$2\delta = \mu\_c(\{x : f(x) \le M\}) > 0.$$

Let *x*1,...,*xn* be the support points of ν*<sup>n</sup>* and ε > 0. Let μ*<sup>c</sup>*,*<sup>M</sup>* be the restriction of μ*<sup>c</sup>* to the set where the density is smaller than *M*. The union of balls *B*ε (*xi*) has μ*<sup>c</sup>*,*M*-measure of at most

$$M\sum\_{i=1}^{n}\text{Leb}(B\_{\varepsilon}(\chi\_i)) = Mn\varepsilon^d \text{Leb}\_d(B\_1(0)) = Mn\varepsilon^d C\_d = \delta,$$

if ε*<sup>d</sup>* = δ(*nMCd*)−1. Thus, a mass 2δ −δ = δ must travel more than ε from ν*<sup>n</sup>* to μ in order to cover μ*<sup>c</sup>*,*M*. Hence

$$W\_1(\nu\_n, \mu) \ge \delta \varepsilon = \delta (\delta / \mathcal{M} \mathcal{C}\_d)^{1/d} \ n^{-1/d}.$$

The lower bound holds because we need ε<sup>−</sup>*<sup>d</sup>* balls of radius ε in order to cover a sufficiently large fraction of the mass of μ. The determining quantity for *upper* bounds on the empirical Wasserstein distance is the *covering numbers*

*N*(μ, ε, τ) = minimal number of balls whose union has μ mass ≥ 1−τ.

Since μ is tight, these are finite for all ε, τ > 0, and they increase as ε and τ approach zero. To put the following bound in context, notice that if μ is compactly supported on R*d*, then *N*(μ, ε,0) ≤ *K*ε<sup>−</sup>*d*.

**Theorem 4.7.8** *If for some d*>2*p, N*(μ, ε, ε*d p*/(*d*−2*p*) )≤*K*ε<sup>−</sup>*d, then* <sup>E</sup>*Wp*≤*Cpn*−1/*d.*

Comparing this with the lower bound in Lemma 4.7.4, we see that in the highdimensional regime *d* > 2*p*, absolutely continuous measures have a worse rate than discrete ones. In the low-dimensional regime *d* < 2*p*, the situation is opposite. We also obtain that for *d* > 2 and a compactly supported absolutely continuous μ ∈ *W*1(R*d*), E*W*1(μ*n*,μ) <sup>∼</sup> *<sup>n</sup>*−1/*d*.

#### **4.8 Bibliographical Notes**

Our exposition in this chapter closely follows the papers Panaretos and Zemel [100] and Zemel and Panaretos [134].

Books on functional data analysis include Ramsay and Silverman [109, 110], Ferraty and Vieu [51], Horvath and Kokoszka [ ´ 70], and Hsing and Eubank [71], and a recent review is also available (Wang et al. [127]). The specific topic of amplitude and phase variation is discussed in [110, Chapter 7] and [127, Section 5.2]. The next paragraph gives some selective references.

One of the first functional registration techniques employed dynamic programming (Wang and Gasser [128]) and dates back to Sakoe and Chiba [118]. Landmark registration consists of identifying salient features for each curve, called *landmarks*, and aligning them (Gasser and Kneip [61]; Gervini and Gasser [63]). In pairwise synchronisation (Tang and M¨uller [122]) one aligns each pair of curves and then derives an estimator of the warp functions by linear averaging of the pairwise registration maps. Another class of methods involves a template curve, to which each observation is registered, minimising a discrepancy criterion; the template is then iteratively updated (Wang and Gasser [129]; Ramsay and Li [108]). James [72] defines a "feature function" for each curve and uses the moments of the feature function to guarantee identifiability. Elastic registration employs the Fisher–Rao metric that is invariant to warpings and calculates averages in the resulting quotient space (Tucker et al. [123]). Other techniques include semiparametric modelling (Rønn [115]; Gervini and Gasser [64]) and principal components registration (Kneip and Ramsay [82]). More details can be found in the review article by Marron et al. [90]. Wrobel et al. [131] have recently developed a registration method for functional data with a discrete flavour. It is also noteworthy that a version of the Wasserstein metric can also be used in the functional case (Chakraborty and Panaretos [34]).

The literature on the point processes case is more scarce; see the review by Wu and Srivastava [133].

A parametric version of Theorem 4.2.4 was first established by Bigot and Klein [22, Theorem 5.1] in R*d*, extended to a compact nonparametric formulation in Zemel and Panaretos [134]. There is an infinite-dimensional linear version in Masarotto et al. [91]. The current level of generality appears to be new.

Theorem 4.4.1 is a stronger version of Panaretos and Zemel [100, Theorem 1] where it was assumed that τ*<sup>n</sup>* must diverge to infinity faster than log*n*. An analogous construction under the Bayesian paradigm can be found in Galasso et al. [58]. Optimality of the rates of convergence in Theorem 4.6.3 is discussed in detail by Bigot et al. [23], where finiteness of the functional *J*<sup>2</sup> (see Sect. 4.7) is assumed and consequently *O*P(τ <sup>−</sup>1/<sup>4</sup> *<sup>n</sup>* ) is improved to *O*P(τ <sup>−</sup>1/<sup>2</sup> *<sup>n</sup>* ).

As far as we know, Theorem 4.6.5 (taken from [100]) is the first central limit theorem for Frechet means in Wasserstein space. When the measures ´ Λ*<sup>i</sup>* are observed exactly (no amplitude variation: τ*<sup>n</sup>* = ∞ and σ = 0) Kroshnin et al. [84] have recently proven a central limit theorem for random Gaussian measures in arbitrary dimension, extending a previous result of Agueh and Carlier [3]. It seems likely that in a fully nonparametric setting, the rates of convergence (compare Theorem 4.6.3) might be slower than √*n*; see Ahidar-Coutrix et al. [4].

The magnitude of the amplitude variation in Theorem 4.6.3 pertains to the rates of convergence of E*Wp*(μ*n*,μ) to zero (Sect. 4.7). This is a topic of intense research, dating back to the seminal paper by Dudley [46], where a version of Theorem 4.7.8 with *p* = 1 is shown for the bounded Lipschitz metric. The lower bounds proven in this section were adapted from [46], Fournier and Guillin [54], and Weed and Bach [130].

The version of Theorem 4.7.8 given here can be found in [130] and extends Boissard and Le Gouic [27]. Both papers [27, 130] work in a general setting of complete separable metric spaces. An additional log*n* term appears in the limiting case *d* = 2*p*, as already noted (for *p* = 1) by [46], and the classical work of Ajtai et al. [5] for μ uniform on [0,1] 2. More general results are available in [54]. A longer (but far from being complete) bibliography is given in the recent review by Panaretos and Zemel [101, Subsection 3.3.1], including works by Barthe, Dobric, Talagrand, and ´ coauthors on almost sure results and deviation bounds for the empirical Wasserstein distance.

The *J*<sup>1</sup> condition is due to del Barrio et al. [43], who showed it to be necessary and sufficient for the empirical process <sup>√</sup>*n*(*Fn* <sup>−</sup> *<sup>F</sup>*) to converge in distribution to <sup>B</sup>◦*F*, with <sup>B</sup> Brownian bridge. The extension to 1 <sup>≤</sup> *<sup>p</sup>* <sup>≤</sup> <sup>∞</sup> (and a lot more) can be found in Bobkov and Ledoux [25], employing order statistics and beta distributions to reduce to the uniform case. Alternatively, one may consult Mason [92], who uses weighted approximations to Brownian bridges.

An important aspect that was not covered here is that of statistical inference of the Wasserstein distance on the basis of the empirical measure. This is a challenging question and results by del Barrio, Munk, and coauthors are available for onedimensional, elliptical, or discrete measures, as explained in [101, Section 3].

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 5 Construction of Frechet Means and ´ Multicouplings**

When given measures μ<sup>1</sup>,...,μ*<sup>N</sup>* are supported on the real line, computing their Frechet mean ´ μ¯ is straightforward (Sect. 3.1.4). This is in contrast to the multivariate case, where, apart from the important yet special case of compatible measures, closed-form formulae are not available. This chapter presents an iterative procedure that provably approximates at least a Karcher mean with mild restrictions on the measures μ<sup>1</sup>,...,μ*<sup>N</sup>*. The algorithm is based on the differentiability properties of the Frechet functional developed in Sect. ´ 3.1.6 and can be interpreted as classical steepest descent in the Wasserstein space *W*2(R*d*). It reduces the problem of finding the Frechet mean to a succession of pairwise transport problems, involving only the ´ Monge–Kantorovich problem between two measures. In the Gaussian case (or any location-scatter family), the latter can be done explicitly, rendering the algorithm particularly appealing (see Sect. 5.4.1).

This chapter can be seen as a complementary to Chap. 4. On the one hand, one can use the proposed algorithm to construct the regularised Frechet–Wasserstein ´ estimator <sup>+</sup>λ*<sup>n</sup>* that approximates a population version (see Sect. 4.3). On the other hand, it could be that the object of interest is the sample μ<sup>1</sup>,...,μ*<sup>N</sup>* itself, but that the latter is observed with some amount of noise. If one only has access to proxies μ -1,...,μ -*N*, then it is natural to use their Frechet mean ´ μ<sup>+</sup>¯ as an estimator of μ¯ . The proposed algorithm can then be used, in principle, in order to construct μ¯ , and the consistency framework of Sect. 4.4 then allows to conclude that if each + μ*<sup>i</sup>* is consistent, then so is μ+¯ .

After presenting the algorithm in Sect. 5.1, we make some connections to Procrustes analysis in Sect. 5.2. A convergence analysis of the algorithm is carried out in Sect. 5.3, after which examples are given in Sect. 5.4. An extension to infinitely many measures is sketched in Sect. 5.5.

V. M. Panaretos, Y. Zemel, *An Invitation to Statistics in Wasserstein Space*, SpringerBriefs in Probability and Mathematical Statistics, https://doi.org/10.1007/978-3-030-38438-8 5

**Electronic Supplementary Material** The online version of this chapter (https://doi.org/10.1007/ 978-3-030-38438-8 5) contains supplementary material.

(5.1)

#### **5.1 A Steepest Descent Algorithm for the Computation of Frechet Means ´**

Throughout this section, we assume that *N* is a fixed integer and consider a fixed collection

μ<sup>1</sup>,...,μ*<sup>N</sup>* <sup>∈</sup> *<sup>W</sup>*2(R*d*) with μ<sup>1</sup> absolutely continuous with bounded density,

whose unique (Proposition 3.1.8) Frechet mean ´ μ¯ is sought. It has been established that if γis absolutely continuous then the associated Frechet functional ´

$$F(\boldsymbol{\gamma}) = \frac{1}{2N} \sum\_{i=1}^{n} W\_2^2(\boldsymbol{\mu}^i, \boldsymbol{\gamma}), \qquad \boldsymbol{\gamma} \in \mathcal{W}\_2(\mathbb{R}^d),$$

has Frechet derivative (Theorem ´ 3.1.14)

$$F'(\gamma) = -\frac{1}{N} \sum\_{i=1}^{N} \log\_{\gamma}(\mu^i) = -\frac{1}{N} \sum\_{i=1}^{N} \left(\mathbf{t}\_{\mathcal{T}}^{\mu\_i} - \mathbf{i}\right) \in \mathbf{Tan}\_{\mathcal{T}}\tag{5.2}$$

at γ. Let γ *<sup>j</sup>* <sup>∈</sup> *<sup>W</sup>*2(R*d*) be an absolutely continuous measure, representing our current estimate of the Frechet mean at step ´ *j*. Then it makes sense to introduce a step size τ*<sup>j</sup>* > 0, and to follow the steepest descent of *F* given by the negative of the gradient:

$$\gamma\_{j+1} = \exp\_{\gamma\_j} \left( -\tau\_j F'(\gamma\_j) \right) = \left[ \mathbf{i} + \tau\_j \frac{1}{N} \sum\_{i=1}^N \log\_{\gamma}(\mu^i) \right] \# \gamma\_j = \left[ \mathbf{i} + \tau\_j \frac{1}{N} \sum\_{i=1}^N (\mathbf{t}\_{\gamma\_j}^{\mu^i} - \mathbf{i}) \right] \# \gamma\_j.$$

In order to employ further descent at γ *<sup>j</sup>*+1, it needs to be verified that *F* is differentiable at γ *<sup>j</sup>*+1, which amounts to showing that the latter stays absolutely continuous. This will happen for all but countably many values of the step size τ *<sup>j</sup>*, but necessarily if the latter is contained in [0,1]:

**Lemma 5.1.1 (Regularity of the Iterates)** *If* γ<sup>0</sup> *is absolutely continuous and* τ = τ<sup>0</sup> ∈ [0,1]*, then* γ<sup>1</sup> = expγ<sup>0</sup> (−τ<sup>0</sup>*F* (γ<sup>0</sup>)) *is also absolutely continuous.*

The idea is that push-forwards of γ<sup>0</sup> under monotone maps are absolutely continuous if and only if the monotonicity is strict, a property preserved by averaging. See page 118 in the supplement for the details.

Lemma 5.1.1 suggests that the step size should be restricted to [0,1]. The next result suggests that the objective function essentially tells us that the optimal step size, achieving the maximal reduction of the objective function (thus corresponding to an approximate line search), is exactly equal to 1. It does not rely on finite-dimensional arguments and holds when replacing R*<sup>d</sup>* by a separable Hilbert space.

**Lemma 5.1.2 (Optimal Stepsize)** *If* γ<sup>0</sup> <sup>∈</sup> *<sup>W</sup>*2(R*d*) *is absolutely continuous, then*

$$F(\gamma\_1) - F(\gamma\_0) \le -||F'(\gamma\_0)||^2 \left[\pi - \frac{\pi^2}{2}\right]^2$$

*and the bound on the right-hand side of the last display is minimised when* τ= 1*.*

*Proof.* Let *Si* = **t** μ*i* γ<sup>0</sup> be the optimal map from γ<sup>0</sup> to μ*i* , and set *Wi* = *Si* −**i**. Then

$$2NF(\mathfrak{y}\_0) = \sum\_{i=1}^{N} W\_2^2(\mathfrak{y}\_0, \boldsymbol{\mu}^i) = \sum\_{i=1}^{N} \int\_{\mathbb{R}^d} \|S\_i - \mathbf{i}\|^2 \, \mathrm{d}\mathfrak{y}\_0 = \sum\_{i=1}^{N} \|W\_i\|\_{\mathcal{L}^2(\mathfrak{y}\_0)}^2,\tag{5.3}$$

Both γ<sup>1</sup> and μ*<sup>i</sup>* can be written as push-forwards of γ<sup>0</sup> and (2.3) gives the bound

$$\mathbb{E}\left|\mathcal{W}\_2^2(\boldsymbol{\eta},\boldsymbol{\mu}^l)\right| \leq \int\_{\mathbb{R}^d} \left\| \left[ (1-\mathsf{r})\mathbf{i} + \frac{\mathsf{r}}{N}\sum\_{j=1}^N \mathcal{S}\_j \right] - \mathcal{S}\_l \right\|\_{\mathbb{R}^d}^2 \mathrm{d}\boldsymbol{\eta}\_0 = \left\| -\mathsf{W}\_l + \frac{\mathsf{r}}{N} \sum\_{j=1}^N \mathcal{W}\_j \right\|\_{\mathcal{X}^2(\{\mathsf{y}\_l\})}^2 \cdot \mathsf{r}$$

For brevity, we omit the subscript *L* <sup>2</sup>(γ<sup>0</sup>) from the norms and inner products. Developing the squares, summing over *i* = 1,...,*N* and using (5.3) gives

$$\begin{split} 2NF(\mathfrak{y}\_{i}) &\leq \sum\_{i=1}^{N} ||W\_{i}||^{2} - 2\frac{\mathfrak{r}}{N} \sum\_{i,j=1}^{N} \left< W\_{i}, W\_{j} \right> + N\mathfrak{r}^{2} \left\| \sum\_{j=1}^{N} \frac{1}{N} W\_{j} \right\|^{2} \\ &= 2NF(\mathfrak{y}\_{0}) - 2N\mathfrak{r} \left\| \sum\_{i=1}^{N} \frac{1}{N} W\_{i} \right\|^{2} + N\mathfrak{r}^{2} \left\| \sum\_{i=1}^{N} \frac{1}{N} W\_{i} \right\|^{2}, \end{split}$$

and recalling that *Wi* = *Si* −**i** yields

$$F(\gamma\_1) - F(\gamma\_0) \le \frac{\pi^2 - 2\pi}{2} \left\| \frac{1}{N} \sum\_{i=1}^N W\_i \right\|^2 = -\left\| F'(\gamma\_0) \right\|^2 \left[ \pi - \frac{\pi^2}{2} \right].$$

To conclude, observe that τ −τ<sup>2</sup>/2 is maximised at τ= 1.

In light of Lemmata 5.1.1 and 5.1.2, we will always take τ *<sup>j</sup>* = 1. The resulting iteration is summarised as Algorithm 1. A first step in the convergence analysis is that the sequence (*F*(γ*<sup>j</sup>*)) is nonincreasing and that for any integer *k*,

$$\frac{1}{2} \sum\_{j=0}^{k} ||F'(\gamma\_j)||^2 \le \sum\_{j=0}^{k} F(\gamma\_j) - F(\gamma\_{j+1}) = F(\gamma\_0) - F(\gamma\_{k+1}) \le F(\gamma\_0).$$

As *k* → ∞, the infinite sum on the left-hand side converges, so *F* (γ *<sup>j</sup>*) <sup>2</sup> must vanish as *j* → ∞.

**Remark 5.1.3** *The proof of Proposition 3.1.2 suggests a generalisation of Algorithm 1 to arbitrary measures in W*2(R*d*) *even if none are absolutely continuous. One can verify that Lemmata 5.1.2 and 5.3.5 (below) also hold in this setup, so it may be that convergence results also apply in this setup. The iteration no longer has the interpretation as steepest descent, however.*

#### **Algorithm 1** Steepest descent via Procrustes analysis


#### **5.2 Analogy with Procrustes Analysis**

Algorithm 1 is similar in spirit to another procedure, *generalised Procrustes analysis*, that is used in shape theory. Given a subset *<sup>B</sup>* <sup>⊆</sup> <sup>R</sup>*d*, most commonly a finite collection of labelled points called *landmarks*, an interesting question is how to mathematically define the *shape* of *B*. One way to reach such a definition is to disregard those properties of *B* that are deemed irrelevant for what one considers this shape should be; typically, these would include its location, its orientation, and/or its scale. Accordingly, the shape of *B* can be defined as the equivalence class consisting of all sets obtained as *gB*, where *g* belongs to a collection *G* of transformations of R*<sup>d</sup>* containing all combinations of rotations, translations, dilations, and/or reflections (Dryden and Mardia [45, Chapter 4]).

If *B*<sup>1</sup> and *B*<sup>2</sup> are two collections of *k* landmarks, one may define the distance between their shapes as the infimum of *B*<sup>1</sup> − *gB*<sup>2</sup> <sup>2</sup> over the group *G* . In other words, one seeks to *register B*<sup>2</sup> as close as possible to *B*<sup>1</sup> by using elements of the group *G* , with distance being measured as the sum of squared Euclidean distances between the transformed points of *B*<sup>2</sup> and those of *B*1. In a sense, one can think about the shape problem and the Monge problem as dual to each other. In the former, one is given constraints on how to optimally carry out the registration of the points with the cost being judged by how successful the registration procedure is. In the latter, one imposes that the registration be done *exactly*, and evaluates the cost by how much the space must be deformed in order to achieve this.

The optimal *g* and the resulting distance can be found in closed-form by means of *ordinary Procrustes analysis* [45, Section 5.2]. Suppose now that we are given *N* > 2 collections of points, *B*1,...,*BN*, with the goal of minimising the sum of squares *giBi* − *gjBj* <sup>2</sup> over *gi* <sup>∈</sup> *<sup>G</sup>* . <sup>1</sup> As in the case of Frechet means in ´ *W*2(R*d*) (Sect. 3.1.2), there is a formulation in terms of sum of squares from the average *<sup>N</sup>*−<sup>1</sup> <sup>∑</sup>*gjBj*. Unfortunately, there is no explicit solution for this problem when *<sup>d</sup>* <sup>≥</sup> 3. Like Algorithm 1, generalised Procrustes analysis (Gower [66]; Dryden and Mardia [45, p. 90]) tackles this "multimatching" setting by iteratively solving the pairwise problem as follows. Choose one of the configurations as an initial estimate/template, then register every other configuration to the template, employing ordinary Procrustes analysis. The new template is then given by the linear average of the registered configurations, and the process is iterated subsequently.

Paralleling this framework, Algorithm 1 iterates the two steps of registration and linear averaging given the current template γ*<sup>j</sup>*, but in a different manner:


Notice that in the Procrustes sense, the maps that register each μ*<sup>i</sup>* to the template γ *<sup>j</sup>* are **t** γ *j* μ*<sup>i</sup>* , the inverses of **t** μ*i* γ *<sup>j</sup>* . We will not use the term "registration maps" in the sequel, to avoid possible confusion.

#### **5.3 Convergence of Algorithm 1**

In order to tackle the issue of convergence, we will use an approach that is specific to the nature of optimal transport. This is because the Hessian-type arguments that are used to prove similar convergence results for steepest descent on Riemannian manifolds (Afsari et al. [1]) or Procrustes algorithms (Le [86]; Groisser [67]) do not apply here, since the Frechet functional may very well fail to be twice differentiable. ´

In fact, even in Euclidean spaces, convergence of steepest descent usually requires a Lipschitz bound on the derivative of *F* (Bertsekas [19, Subsection 1.2.2]). Unfortunately, *F* is not known to be differentiable at discrete measures, and these constitute a dense set in *W*2; consequently, this Lipschitz condition is very unlikely to hold. Still, this specific geometry of the Wasserstein space affords some advantages; for instance, we will place no restriction on the starting point for the iteration, except that it be absolutely continuous; and no assumption on how "spread out" the collection μ<sup>1</sup>,...,μ*<sup>N</sup>* is will be necessary as in, for example, [1, 67, 86].

<sup>1</sup> One needs to add an additional constraint to prevent registering all the collections to the origin.

**Theorem 5.3.1 (Limit Points are Karcher Means)** *Let* μ<sup>1</sup>,...,μ*<sup>N</sup>* <sup>∈</sup> *<sup>W</sup>*2(R*d*) *be probability measures and suppose that one of them is absolutely continuous with a bounded density. Then, the sequence generated by Algorithm 1 stays in a compact set of the Wasserstein space W*2(R*d*)*, and any limit point of the sequence is a Karcher mean of* (μ<sup>1</sup>,...,μ*<sup>N</sup>*)*.*

Since the Frechet mean ´ μ¯ is a Karcher mean (Proposition 3.1.8), we obtain immediately:

**Corollary 5.3.2 (Wasserstein Convergence of Steepest Descent)** *Under the conditions of Theorem 5.3.1, if F has a unique stationary point, then the sequence* {γ *j*} *generated by Algorithm 1 converges to the Frechet mean of ´* {μ<sup>1</sup>,...,μ*<sup>N</sup>*} *in the Wasserstein metric,*

$$W\_2(\gamma\_j, \mu) \longrightarrow 0, \qquad j \to \ast \ast \ast$$

Alternatively, combining Theorem 5.3.1 with the optimality criterion Theorem 3.1.15 shows that the algorithm converges to μ¯ when the appropriate assumptions on {μ*i* } and the Karcher mean μ = limγ *<sup>j</sup>* are satisfied. This allows to conclude that Algorithm 1 converges to the unique Frechet mean when ´ μ*<sup>i</sup>* are Gaussian measures (see Theorem 5.4.1).

The proof of Theorem 5.3.1 is rather elaborate, since we need to use specific methods that are tailored to the Wasserstein space. Before giving the proof, we state two important consequences. The first is the uniform convergence of the optimal maps **t** μ*i* γ *<sup>j</sup>* to **t** μ*i* μ¯ on compacta. This convergence does not immediately follow from the Wasserstein convergence of γ *<sup>j</sup>* to μ¯ , and is also established for the inverses. Both the formulation and the proof of this result are similar to those of Theorem 4.4.3.

**Theorem 5.3.3 (Uniform Convergence of Optimal Maps)** *Under the conditions of Corollary 5.3.2, there exist sets A*,*B*1,...,*B<sup>N</sup>* <sup>⊆</sup> <sup>R</sup>*<sup>d</sup> such that* μ¯(*A*) = 1 = μ<sup>1</sup>(*B*1) = ··· = μ*<sup>N</sup>*(*BN*) *and*

$$\sup\_{\Omega\_1} \left\| \mathbf{t}\_{\mathcal{Y}\_j}^{\mu^i} - \mathbf{t}\_{\mu}^{\mu^i} \right\| \xrightarrow{j \to \infty} \mathbf{0}, \qquad \sup\_{\Omega\_2^j} \left\| \mathbf{t}\_{\mu^i}^{\mathcal{Y}\_j} - \mathbf{t}\_{\mu^i}^{\bar{\mu}} \right\| \xrightarrow{j \to \infty} \mathbf{0}, \qquad i = 1, \dots, N,$$

*for any pair of compacta* Ω<sup>1</sup> ⊆ *A,* Ω*i* <sup>2</sup> <sup>⊆</sup> *Bi . If in addition all the measures* μ<sup>1</sup>,...,μ*N have the same support, then one can choose all the sets Bi to be the same.*

The other consequence is convergence of the optimal multicouplings.

**Corollary 5.3.4 (Convergence of Multicouplings)** *Under the conditions of Corollary 5.3.2, the sequence of multicouplings*

$$\left(\mathfrak{t}\_{\mathcal{Y}\_{\mathcal{I}}}^{\mu^1}, \dots \mathfrak{t}\_{\mathcal{Y}\_{\mathcal{I}}}^{\mu^n}\right) \# \gamma\_{\mathcal{I}}.$$

*of* {μ<sup>1</sup>,...,μ*<sup>N</sup>*} *converges (in Wasserstein distance on* (R*d*)*N) to the optimal multicoupling* (**t** μ1 μ ,...**t** μ*n* μ )#μ*.*

The proofs of Theorem 5.3.3 and Corollary 5.3.4 are given at the end of the present section.

The proof of Theorem 5.3.1 is achieved by establishing the following facts:


Since it has already been established that *F* (γ *<sup>j</sup>*) → 0, these three facts indeed suffice.

**Lemma 5.3.5** *The sequence generated by Algorithm 1 stays in a compact subset of the Wasserstein space W*2(R*d*)*.*

*Proof.* For all *j* ≥ 1, γ *<sup>j</sup>* takes the form *Mn*#π, where *MN*(*x*1,..., *xN*) = *x* and π is a multicoupling of μ<sup>1</sup>,...,μ*<sup>N</sup>*. The compactness of this set has been established in Step 2 of the proof of Theorem 3.1.5; see page 63 in the supplement, where this is done in a more complicated setup.

A closer look at the proof reveals that a more general result holds true. Let *A* denote the steepest descent iteration, that is, *A* (γ *<sup>j</sup>*) = γ *<sup>j</sup>*+1. Then the image of *A* , {*A* μ : μ <sup>∈</sup> *<sup>W</sup>*2(R*d*) absolutely continuous} has a compact closure in *<sup>W</sup>*2(R*d*). This is also true if R*<sup>d</sup>* is replaced by a separable Hilbert space.

In order to show that a weakly convergent sequence (γ *<sup>j</sup>*) of absolutely continuous measures has an absolutely continuous limit γ, it suffices to show that the densities of γ *<sup>j</sup>* are uniformly bounded. Indeed, if*<sup>C</sup>* is such a bound, then for any open *<sup>O</sup>* <sup>⊆</sup> <sup>R</sup>*d*, liminf γ*<sup>k</sup>*(*O*) ≤ *C*Leb(*O*), so γ(*O*) ≤ *C*Leb(*O*) by the portmanteau Lemma 1.7.1. It follows that γ is absolutely continuous with density bounded by *C*. We now show that such *C* can be found that applies to all measures in the image of *A* , hence to all sequences resulting from iterations of Algorithm 1.

**Proposition 5.3.6 (Uniform Density Bound)** *For each i* = 1,...,*N denote by gi the density of* μ*<sup>i</sup> (if it exists) and gi* <sup>∞</sup> *its supremum, taken to be infinite if gi does not exist (or if gi is unbounded). Let* γ<sup>0</sup> *be any absolutely continuous probability measure. Then the density of* γ<sup>1</sup> = *A* (γ<sup>0</sup>) *is bounded by the* 1/*d-th harmonic mean of gi* ∞*,*

$$C\_{\mu} = \left[ \frac{1}{N} \sum\_{i=1}^{N} \frac{1}{||\mathbf{g}^{i}||\_{\infty}^{1/d}} \right]^{-d}.$$

The constant *C*μ depends only on the measures (μ<sup>1</sup>,...,μ*<sup>N</sup>*), and is finite as long as one μ*<sup>i</sup>* has a bounded density, since *C*μ <sup>≤</sup> *<sup>N</sup><sup>d</sup> gi* <sup>∞</sup> for any *i*.

*Proof.* Let *hi* be the density of γ*<sup>i</sup>*. By the change of variables formula, for γ0-almost any *x*

$$h\_1(\mathbf{t}\_{\mathcal{N}}^{\mathcal{N}}(\mathbf{x})) = \frac{h\_0(\mathbf{x})}{\det \nabla \mathbf{t}\_{\mathcal{N}}^{\mathcal{N}}(\mathbf{x})}; \qquad \mathbf{g}^i(\mathbf{t}\_{\mathcal{N}}^{\mu^i}(\mathbf{x})) = \frac{h\_0(\mathbf{x})}{\det \nabla \mathbf{t}\_{\mathcal{N}}^{\mu^i}(\mathbf{x})}, \qquad \text{when } \mathbf{g}^i \text{ exists.} $$

(Convex functions are twice differentiable almost surely (Villani [125, Theorem 14.25]), hence these gradients are well-defined γ0-almost surely.) We seek a lower bound on the determinant of ∇**t** γ1 γ<sup>0</sup> (*x*), which by definition equals

$$N^{-d} \det \sum\_{i=1}^N \nabla \mathfrak{t}\_{\mathbb{X}}^{\mu^i}(\mathbf{x}) \,.$$

Such a bound is provided by the Brunn–Minkowski inequality (Stein and Shakarchi [121, Section 1.5]) for symmetric positive semidefinite matrices

$$\left[\det(A+B)\right]^{1/d} \ge \left[\det A\right]^{1/d} + \left[\det B\right]^{1/d},$$

which, applied inductively, yields

$$\left[\det \nabla \mathbf{t}\_{\mathcal{M}}^{\mathcal{Y}}(\boldsymbol{x})\right]^{1/d} \geq \frac{1}{N} \sum\_{i=1}^{N} \left[\det \nabla \mathbf{t}\_{\mathcal{M}}^{\mu^{i}}(\boldsymbol{x})\right]^{1/d}.$$

From this, we obtain an upper bound for *h*1:

$$\frac{1}{h\_1^{1/d}(\mathbf{t}\_{\mathcal{Y}0}^{\mathcal{Y}1}(\mathbf{x}))} = \frac{\det^{1/d}\sum\_{i=1}^N \nabla \mathbf{t}\_{\mathcal{Y}0}^{\mu^i}(\mathbf{x})}{Nh\_0^{1/d}(\mathbf{x})} \ge \frac{1}{N} \sum\_{i=1}^N \frac{1}{[\mathbf{g}^i(\mathbf{t}\_{\mathcal{Y}0}^{\mu^i}(\mathbf{x}))]^{1/d}} \ge \frac{1}{N} \sum\_{i=1}^N \frac{1}{||\mathbf{g}^i||\_\infty^{1/d}} = \mathbf{C}\_\mu^{-1/d}.$$

Let Σ be the set of points where this inequality holds, then γ0(Σ) = 1. Hence

$$\chi\left(\mathfrak{t}\_{\mathbb{M}}^{\mathcal{N}}(\boldsymbol{\Sigma})\right) = \chi\_{\mathbb{0}}[(\mathfrak{t}\_{\mathbb{M}}^{\mathcal{N}})^{-1}(\mathfrak{t}\_{\mathbb{M}}^{\mathcal{N}}(\boldsymbol{\Sigma}))] \geq \chi\_{\mathbb{0}}(\boldsymbol{\Sigma}) = 1.$$

Thus, γ1-almost surely and for all *i*,

$$h\_1(\mathbf{y}) \le \mathbf{C}\_{\mu} \dots$$

The third statement (continuity of *A* ) is much more subtle to establish, and its rather lengthy proof is given next. In view of Proposition 5.3.6, the uniform bound on the densities is not a hindrance for the proof of convergence of Algorithm 1.

**Proposition 5.3.7** *Let* (γ*<sup>n</sup>*) *be a sequence of absolutely continuous measures with uniformly bounded densities, suppose that W*2(γ*n*, γ) → 0*, and let*

$$\boldsymbol{\eta}\_{j} = \left(\mathbf{t}\_{\mathcal{Y}\_{j}}^{\mu^{1}}, \dots, \mathbf{t}\_{\mathcal{Y}\_{j}}^{\mu^{n}}, \mathbf{i}\right) \# \boldsymbol{\gamma}\_{j}, \qquad \boldsymbol{\eta} = \left(\mathbf{t}\_{\mathcal{Y}}^{\mu^{1}}, \dots, \mathbf{t}\_{\mathcal{Y}}^{\mu^{n}}, \mathbf{i}\right) \# \boldsymbol{\gamma}.$$

*Then* η*j* → η *in W*2([R*d*] *<sup>N</sup>*+1)*.*

*Proof.* As has been established in the discussion before Proposition 5.3.6, the limit γ must be absolutely continuous, so ηis well-defined.

In view of Theorem 2.2.1, it suffices to show that if *<sup>h</sup>* : (R*d*)*N*+<sup>1</sup> <sup>→</sup> <sup>R</sup> is any continuous nonnegative function such that

$$|h(t\_1, \ldots, t\_N, \mathbf{y})| \le \frac{2}{N} \sum\_{i=1}^N ||t\_i||^2 + 2||\mathbf{y}||^2,$$

then

$$\int\_{\mathbb{R}^d} \mathbf{g}\_n \, \mathrm{d}\boldsymbol{\eta}\_n = \int\_{(\mathbb{R}^d)^{N+1}} h \, \mathrm{d}\boldsymbol{\eta}\_n \to \int\_{(\mathbb{R}^d)^{N+1}} h \, \mathrm{d}\boldsymbol{\eta} = \int\_{\mathbb{R}^d} \mathbf{g} \, \mathrm{d}\boldsymbol{\eta}, \, \mathrm{g}\_n(\mathbf{x}) = h(\mathbf{t}\_{\mathcal{Y}\_l}^{\boldsymbol{\mu}^1}(\mathbf{x}), \dots, \mathbf{t}\_{\mathcal{Y}\_l}^{\boldsymbol{\mu}^n}(\mathbf{x}), \mathbf{x}),$$

and *g* defined analogously. The proof, given in full detail on page 124 of the supplement, is sketched here.

**Step 1: Truncation.** Since γ*<sup>n</sup>* converge in the Wasserstein space, they satisfy the uniform integrability (2.4) and absolute continuity (2.7) by Theorem 2.2.1. Consequently, *gn*,*<sup>R</sup>* = min(*gn*,4*R*) is uniformly close to *gn*:

$$\sup\_n \int [\mathbf{g}\_n(\mathbf{x}) - \mathbf{g}\_{n, \mathcal{R}}(\mathbf{x})] \, \mathbf{d}\, \mathbf{y}\_n(\mathbf{x}) \to \mathbf{0}, \qquad \mathcal{R} \to \approx.$$

We may thus replace *gn* by a bounded version *gn*,*R*.

**Step 2: Convergence of** *gn* **to** *g***.** By Proposition 1.7.11, the optimal maps **t** μ*i* γ*n* converge to **t** μ*i* γ and (since *h* is continuous), *gn* → *g* uniformly on "nice" sets Ω ⊆ *E* = suppγ. Write

$$
\int \mathbf{g}\_{n,R} \mathbf{d}\boldsymbol{\gamma}\_{n} - \int \mathbf{g}\_{R} \mathbf{d}\boldsymbol{\gamma} = \int \mathbf{g}\_{R} \mathbf{d} (\boldsymbol{\gamma}\_{n} - \boldsymbol{\gamma}) + \int\_{\boldsymbol{\Omega}} (\mathbf{g}\_{n,R} - \mathbf{g}\_{R}) \mathbf{d}\boldsymbol{\gamma}\_{n} + \int\_{\mathbb{R}^{d}\backslash\boldsymbol{\Omega}} (\mathbf{g}\_{n,R} - \mathbf{g}\_{R}) \mathbf{d}\boldsymbol{\gamma}\_{n}.
$$

**Step 3: Bounding the first two integrals.** The first integral vanishes as *n* → ∞, by the portmanteau Lemma 1.7.1, and the second by uniform convergence.

**Step 4: Bounding the third integral.** The integrand is bounded by 8*R*, so it suffices to bound the measures of <sup>R</sup>*<sup>d</sup>* \Ω. This is a bit technical, and uses the uniform density bound on (γ*<sup>n</sup>*) and the portmanteau lemma.

**Corollary 5.3.8 (Continuity of** *A* **)** *If W*2(γ*n*, γ) →0 *and* γ*<sup>n</sup> have uniformly bounded densities, then A* (γ*<sup>n</sup>*) → *A* (γ)*.*

*Proof.* Choose *h* in the proof of Proposition 5.3.7 to depend only on *y*.

*Proof (Proof of Corollary 5.3.4).* Choose *h* in the proof of Proposition 5.3.7 to depend only on *t*1,...,*tn*.

*Proof (Proof of Theorem 5.3.3).* Let *E* = suppμ¯ and set *Ai* <sup>=</sup> *<sup>E</sup>*den ∩ {*<sup>x</sup>* : **<sup>t</sup>** μ*i* μ¯ (*x*)is univalued}. As μ¯ is absolutely continuous, μ¯(*A<sup>i</sup>* ) = 1, and the same is true for *A* = ∩*N i*=1*Ai* . The first assertion then follows from Proposition 1.7.11.

The second statement is proven similarly. Let *E<sup>i</sup>* = suppμ*<sup>i</sup>* and notice that by absolute continuity the *Bi* = (*E<sup>i</sup>* )den ∩ {*<sup>x</sup>* : **<sup>t</sup>** μ¯ μ*<sup>i</sup>*(*x*) is univalued} has measure 1 with respect to μ*i* . Apply Proposition 1.7.11. If in addition *<sup>E</sup>*<sup>1</sup> <sup>=</sup> ··· <sup>=</sup> *<sup>E</sup>N*, then μ*i* (*B*) = 1 for *<sup>B</sup>* <sup>=</sup> <sup>∩</sup>*Bi* .

#### **5.4 Illustrative Examples**

As an illustration, we implement Algorithm 1 in several scenarios for which pairwise optimal maps can be calculated explicitly at every iteration, allowing for fast computation without error propagation. In each case, we give some theory first, describing how the optimal maps are calculated, and then implement Algorithm 1 on simulated examples.

#### *5.4.1 Gaussian Measures*

No example illustrates the use of Algorithm 1 better than the Gaussian case. This is so because optimal maps between centred nondegenerate Gaussian measures with covariances *A* and *B* have the explicit form (see Sect. 1.6.3)

$$\mathbf{t}\_A^B(\mathbf{x}) = A^{-1/2} [A^{1/2} \mathbf{B} A^{1/2}]^{1/2} A^{-1/2} \mathbf{x}, \qquad \mathbf{x} \in \mathbb{R}^d, \mathbf{y}$$

with the obvious slight abuse of notation. In contrast, the Frechet mean of a collec- ´ tion of Gaussian measures (one of which nonsingular) does not admit a closed-form formula and is only known to be a Gaussian measure whose covariance matrix Γ is the unique invertible root of the matrix equation

$$\Gamma = \frac{1}{N} \sum\_{l=1}^{N} \left[ \Gamma^{1/2} S\_l \Gamma^{1/2} \right]^{1/2},\tag{5.4}$$

where *Si* is the covariance matrix of μ*i* .

Given the formula for **t** *B <sup>A</sup>*, application of Algorithm 1 to Gaussian measures is straightforward. The next result shows that, in the Gaussian case, the iterates must converge to the unique Frechet mean, and that ( ´ 5.4) can be derived from the characterisation of Karcher means.

**Theorem 5.4.1 (Convergence in Gaussian Case)** *Let* μ<sup>1</sup>,...,μ*<sup>N</sup> be Gaussian measures with zero means and covariance matrices Si with S*<sup>1</sup> *nonsingular, and let the initial point* γ<sup>0</sup> *be N* (0,Γ<sup>0</sup>) *with* Γ<sup>0</sup> *nonsingular. Then the sequence of iterates generated by Algorithm 1 converges to the unique Frechet mean of ´* (μ<sup>1</sup>,...,μ*<sup>N</sup>*)*.*

*Proof.* Since the optimal maps are linear, so is their mean and therefore γ*<sup>k</sup>* is a Gaussian measure for all *k*, say *N* (0,Γ*<sup>k</sup>*) with Γ*<sup>k</sup>* nonsingular. Any limit point of γ is a Karcher mean by Theorem 5.3.1. If we knew that γ itself were Gaussian, then it actually must be the Frechet mean because ´ *N*−<sup>1</sup> ∑**t** μ*i* γ equals the identity everywhere on R*<sup>d</sup>* (see the discussion before Theorem 3.1.15).

#### 5.4 Illustrative Examples 127

Let us show that every limit point γ is indeed Gaussian. It suffices to prove that (Γ*<sup>k</sup>*) is a bounded sequence, because if Γ*k* → Γ , then *N* (0,Γ*<sup>k</sup>*) → *N* (0,Γ ) weakly, as can be seen from either Lehmann–Scheffe's theorem (the densities converge) or ´ Levy's continuity theorem (the characteristic functions converge). ´

To see that (Γ*<sup>k</sup>*) is bounded, observe first that for any centred (Gaussian or not) measure μwith covariance matrix *S*,

$$\mathcal{W}\_2^2(\mu, \delta\_0) = \text{tr}[\mathcal{S}],$$

where δ<sup>0</sup> is a Dirac mass at the origin. (This follows from the spectral decomposition of *S*.) Therefore

$$0 \le \text{tr}[\Gamma\_k] = W\_2^2(\mathfrak{N}, \delta\_0)$$

is bounded uniformly, because {γ*<sup>k</sup>*} stays in a Wasserstein-compact set by Lemma 5.3.5. If we define *C* = sup*<sup>k</sup>* tr[Γ*<sup>k</sup>*] < ∞, then all the diagonal elements of Γ*<sup>k</sup>* are bounded uniformly. When *A* is symmetric and positive semidefinite, 2|*Ai j*| ≤ *Aii* + *Ai j*. Consequently, all the entries of Γ*<sup>k</sup>* are bounded uniformly by *C*, which means that (Γ*<sup>k</sup>*) is a bounded sequence.

From the formula for the optimal maps, we see that if Γ is the covariance of the Frechet mean, then ´

$$I = \frac{1}{N} \sum\_{i=1}^{N} \Gamma^{-1/2} \left[ \Gamma^{1/2} S\_i \Gamma^{1/2} \right]^{1/2} \Gamma^{-1/2}$$

and we recover the fixed point equation (5.4).

If the means are nonzero, then the optimal maps are affine and the same result applies; the Frechet mean is still a Gaussian measure with covariance matrix ´ Γ and mean that equals the average of the means of μ*i* , *i* = 1,...,*N*.

Figure 5.1 shows density plots of *N* = 4 centred Gaussian measures on R<sup>2</sup> with covariances *Si* ∼ Wishart(*I*2,2), and Fig. 5.2 shows the density of the resulting Frechet mean. In this particular example, the algorithm needed 11 iterations ´ starting from the identity matrix. The corresponding optimal maps are displayed in Fig. 5.3. It is apparent from the figure that these maps are linear, and after a more careful reflection one can be convinced that their average is the identity. The four plots in the figure are remarkably different, in accordance with the measures themselves having widely varying condition numbers and orientations; μ<sup>3</sup> and more so μ<sup>4</sup> are very concentrated, so the optimal maps "sweep" the mass towards zero. In contrast, the optimal maps to μ<sup>1</sup> and μ<sup>2</sup> spread the mass out away from the origin.

Fig. 5.1: Density plot of four Gaussian measures in R2.

Fig. 5.2: Density plot of the Frechet mean of the measures in Fig. ´ 5.1

Fig. 5.3: Gaussian example: vector fields depicting the optimal maps *x* → **t** μ*i* μ¯ (*x*) from the Frechet mean ´ μ¯ of Fig. 5.2 to the four measures {μ*i* } of Fig. 5.1. The order corresponds to that of Fig. 5.1

#### *5.4.2 Compatible Measures*

We next discuss the behaviour of the algorithm when the measures are compatible. Recall that a collection *C* ⊆ *W*2(*X* ) is *compatible* if for all γ,ρ,μ ∈ *C* , **t** ν μ ◦**t** μ γ = **t** ν γ in *L*2(γ) (Definition 2.3.1). Boissard et al. [28] showed that when this condition holds, the Frechet mean of ´ (μ<sup>1</sup>,...,μ*<sup>N</sup>*) can be found by simple computations involving the *iterated barycentre*. We again denote by γ<sup>0</sup> the initial point of Algorithm 1, which can be any absolutely continuous measure.

**Lemma 5.4.2 (Compatibility and Convergence)** *If* γ0 ∪ {μ*i* } *is compatible, then Algorithm 1 converges to the Frechet mean of ´* (μ*i* ) *after a single step.*

*Proof.* By definition, the next iterate is

$$\mathcal{Y} = \left[\frac{1}{N} \sum\_{i=1}^{N} \mathbf{t}\_{\mathcal{N}}^{\mu^i} \right] \# \mathcal{Y}\_{\mathcal{N}},$$

which is the Frechet mean by Theorem ´ 3.1.9.

In this case, Algorithm 1 requires the calculation of *N* pairwise optimal maps, and this can be reduced to *N* −1 if the initial point is chosen to be μ1. This is the same computational complexity as the calculation of the iterated barycentre proposed in [28].

When the measures have a common copula, finding the optimal maps reduces to finding the optimal maps between the one-dimensional marginals (see Lemma 2.3.3) and this can be done using quantile functions as described in Sect. 1.5. The marginal Frechet means are then plugged into the common copula to yield the joint Fr ´ echet ´ mean. We next illustrate Algorithm 1 in three such scenarios.

#### **5.4.2.1 The One-Dimensional Case**

When the measures are supported on the real line, there is no need to use the algorithm since the Frechet mean admits a closed-form expression in terms of quan- ´ tile functions (see Sect. 3.1.4). We nevertheless discuss this case briefly because we build upon this construction in subsequent examples. Given that **t** ν μ = *F*−<sup>1</sup> ν ◦*F*μ, we may apply Algorithm 1 starting from one of these measures (or any other measure). Figure 5.4 plots *N* = 4 univariate densities and the Frechet mean yielded by the ´ algorithm in two different scenarios. At the left, the densities were generated as

$$f^i(\mathbf{x}) = \frac{1}{2}\phi\left(\frac{\mathbf{x} - m\_1^i}{\sigma\_1^i}\right) + \frac{1}{2}\phi\left(\frac{\mathbf{x} - m\_2^i}{\sigma\_2^i}\right),\tag{5.5}$$

with φthe standard normal density, and the parameters generated independently as

$$m\_1^i \sim U[-13, -3], \quad m\_2^i \sim U[3, 13], \quad \sigma\_1^i, \sigma\_2^i \sim Gamma(4, 4).$$

At the right of Fig. 5.4, we used a mixture of a shifted gamma and a Gaussian:

$$f^{i}(\mathbf{x}) = \frac{\mathfrak{Z}}{\mathfrak{G}} \frac{\mathfrak{B}\_{i}^{3}}{\Gamma(\mathfrak{Z})} (\mathbf{x} - m\_{3}^{i})^{2} e^{-\beta \mathfrak{k} \left(\mathbf{x} - m\_{3}^{i}\right)} + \frac{\mathfrak{Z}}{\mathfrak{G}} \phi(\mathbf{x} - m\_{4}^{i}),\tag{5.6}$$

with

$$
\beta^i \sim \text{Gamma}(4, 1), \quad m\_3^i \sim U[1, 4], \quad m\_4^i \sim U[-4, -1].
$$

The resulting Frechet mean density for both settings is shown in thick light blue, ´ and can be seen to capture the bimodal nature of the data. Even though the Frechet ´ mean of Gaussian mixtures is not a Gaussian mixture itself, it is approximately so, provided that the peaks are separated enough. Figure 5.5 shows the optimal maps pushing the Frechet mean ´ μ¯ to the measures μ<sup>1</sup>,...,μ*<sup>N</sup>* in each case. If one ignores the "middle part" of the *x* axis, the maps appear (approximately) affine for small values of *x* and for large values of *x*, indicating how the peaks are shifted. In the middle region, the maps need to "bridge the gap" between the different slopes and intercepts of these affine maps.

Fig. 5.4: Densities of a bimodal Gaussian mixture (left) and a mixture of a Gaussian with a gamma (right), with the Frechet mean density in light blue ´

Fig. 5.5: Optimal maps **t** μ*i* μ¯ from the Frechet mean ´ μ¯ to the four measures {μ*i* } in Fig. 5.4. The left plot corresponds to the bimodal Gaussian mixture, and the right plot to the Gaussian/gamma mixture

#### **5.4.2.2 Independence**

We next take measures μ*<sup>i</sup>* on R2, having independent marginal densities *f <sup>i</sup> <sup>X</sup>* as in (5.5), and *f <sup>i</sup> <sup>Y</sup>* as in (5.6). Figure 5.6 shows the density plot of *N* = 4 such measures, constructed as the product of the measures from Fig. 5.4. One can distinguish the independence by the "parallel" structure of the figures: for every pair (*y*1, *y*2), the ratio *g*(*x*, *y*1)/*g*(*x*, *y*2) does not depend on *x* (and vice versa, interchanging *x* and *y*). Figure 5.7 plots the density of the resulting Frechet mean. We observe that the ´ Frechet mean captures the four peaks and their location. Furthermore, the parallel ´ nature of the figure is preserved in the Frechet mean. Indeed, by Lemma ´ 3.1.11 the Frechet mean is a product measure. The optimal maps, in Fig. ´ 5.10, are the same as in the next example, and will be discussed there.

Fig. 5.6: Density plots of the four product measures of the measures in Fig. 5.4

Fig. 5.7: Density plot of the Frechet mean of the measures in Fig. ´ 5.6

#### **5.4.2.3 Common Copula**

Let μ*<sup>i</sup>* be a measure on R<sup>2</sup> with density

$$g^i(\mathbf{x}, \mathbf{y}) = c(F\_X^i(\mathbf{x}), F\_Y^i(\mathbf{y})) f\_X^i(\mathbf{x}) f\_Y^i(\mathbf{y}),$$

where *f <sup>i</sup> <sup>X</sup>* and *f <sup>i</sup> <sup>Y</sup>* are random densities on the real line with distribution functions *F<sup>i</sup> X* and *F<sup>i</sup> <sup>Y</sup>* , and *c* is a copula density. Figure 5.8 shows the density plot of *N* = 4 such measures, with *f <sup>i</sup> <sup>X</sup>* generated as in (5.5), *f <sup>i</sup> <sup>Y</sup>* as in (5.6), and *c* is the Frank(−8) copula density, while Fig. 5.9 plots the density of the Frechet mean obtained. (For ease of ´ comparison we use the same realisations of the densities that appear in Fig. 5.4.) The Frechet mean can be seen to preserve the shape of the density, having four clearly ´ distinguished peaks. Figure 5.10, depicting the resulting optimal maps, allows for a clearer interpretation: for instance, the leftmost plot (in black) shows more clearly that the map splits the mass around *x* = −2 to a much wider interval; and conversely a very large amount mass is sent to *x* ≈ 2. This rather extreme behaviour matches the peak of the density of μ<sup>1</sup> located at *x* = 2.

Fig. 5.8: Density plots of four measures in <sup>R</sup><sup>2</sup> with Frank copula of parameter <sup>−</sup><sup>8</sup>

Fig. 5.9: Density plot of the Frechet mean of the measures in Fig. ´ 5.8

Fig. 5.10: Frank copula example: vector fields of the optimal maps **t** μ*i* μ¯ from the Frechet mean ´ μ¯ of Fig. 5.9 to the four measures {μ*i* } of Fig. 5.8. The colours match those of Fig. 5.4

#### *5.4.3 Partially Gaussian Trivariate Measures*

We now apply Algorithm 1 in a situation that entangles two of the previous settings. Let *U* be a fixed 3 × 3 real orthogonal matrix with columns *U*1, *U*2, *U*<sup>3</sup> and let μ*i* have density

$$\mathbf{g}^{\ell}(\mathbf{y}\_1, \mathbf{y}\_2, \mathbf{y}\_3) = \mathbf{g}^{\ell}(\mathbf{y}) = f^{\ell}(U\_3^{\ell} \mathbf{y}) \frac{1}{2\pi \sqrt{\det \mathbf{S}^{\ell}}} \exp\left[ -\frac{(U\_1^{\ell} \mathbf{y}, U\_2^{\ell} \mathbf{y})(\mathbf{S}^{\ell})^{-1} \binom{U\_1^{\ell} \mathbf{y}}{U\_2^{\ell} \mathbf{y}}}{2} \right],$$

with *<sup>f</sup> <sup>i</sup>* bounded density on the real line and *<sup>S</sup><sup>i</sup>* <sup>∈</sup> <sup>R</sup>2×<sup>2</sup> positive definite. We simulated *<sup>N</sup>* <sup>=</sup> 4 such densities with *<sup>f</sup> <sup>i</sup>* as in (5.5) and *<sup>S</sup><sup>i</sup>* <sup>∼</sup> Wishart(*I*2,2). We apply Algorithm 1 to this collection of measures and find their Frechet mean (see the end ´ of this subsection for precise details on how the optimal maps were calculated). Figure 5.11 shows level set of the resulting densities for some specific values. The bimodal nature of *<sup>f</sup> <sup>i</sup>* implies that for most values of *<sup>a</sup>*, {*<sup>x</sup>* : *<sup>f</sup> <sup>i</sup>* (*x*) = *a*} has four elements. Hence, the level sets in the figures are unions of four separate parts, with each peak of *f <sup>i</sup>* contributing two parts that form together the boundary of an ellipsoid in R<sup>3</sup> (see Fig. 5.12). The principal axes of these ellipsoids and their position in R<sup>3</sup> differ between the measures, but the Frechet mean can be viewed as an average ´ of those in some sense.

In terms of orientation (principal axes) of the ellipsoids, the Frechet mean is most ´ similar to μ<sup>1</sup> and μ2, whose orientations are similar to one another.

Fig. 5.11: The set {*<sup>v</sup>* <sup>∈</sup> <sup>R</sup><sup>3</sup> : *gi* (*v*) = 0.0003} for *i* = 1 (black), the Frechet mean ´ (light blue), *i* = 2,3,4 in red, green, and dark blue, respectively

Let us now see how the optimal maps are calculated. If *Y* = (*y*1, *y*2, *y*3) ∼ μ*i* , then the random vector (*x*1, *x*2, *x*3) = *X* = *U*−1*Y* has joint density

$$f^{i}(\mathbf{x}\_{3})\exp\left[-\frac{(\mathbf{x}\_{1},\mathbf{x}\_{2})(\boldsymbol{\Sigma}^{i})^{-1}\binom{\boldsymbol{\chi}\_{1}}{\boldsymbol{\chi}\_{2}}}{2}\right]\frac{1}{2\pi\sqrt{\det\boldsymbol{\Sigma}^{i}}},$$

so the probability law of *X* is ρ*i* ⊗ ν*<sup>i</sup>* with ρ*<sup>i</sup>* centred Gaussian with covariance matrix Σ*<sup>i</sup>* and ν*<sup>i</sup>* having density *f <sup>i</sup>* on R. By Lemma 3.1.11, the Frechet mean of ´ (*U*−1#μ*i* ) is the product measure of that of (ρ*i* ) and that of (ν*i* ); by Lemma 3.1.12, the Frechet mean of ´ (μ*i* ) is therefore

$$U\#(\mathcal{J}'(0,\Sigma)\otimes f), \qquad f=F', \quad F^{-1}(q) = \frac{1}{N}\sum\_{l=1}^{N}F\_l^{-1}(q), \quad F\_l(\mathbf{x}) = \int\_{-\infty}^{\mathbf{x}} f^l(\mathbf{s})\,\mathrm{d}s,$$

where Σ is the Frechet–Wasserstein mean of ´ Σ1,...,Σ*N*.

Fig. 5.12: The set {*<sup>v</sup>* <sup>∈</sup> <sup>R</sup><sup>3</sup> : *gi* (*v*) = 0.0003} for *i* = 3 (left) and *i* = 4 (right), with each of the four different inverses of the bimodal density *f <sup>i</sup>* corresponding to a colour

Starting at an initial point γ<sup>0</sup> = *U*#(*N* (0,Σ<sup>0</sup>) ⊗ ν<sup>0</sup>), with ν<sup>0</sup> having continuous distribution *F*ν<sup>0</sup> , the optimal maps are *U* ◦ **t** *i* <sup>0</sup> ◦*U*−<sup>1</sup> <sup>=</sup> <sup>∇</sup>(ϕ*i* <sup>0</sup> ◦*U*−1) with

$$\mathbf{t}\_0^\ell(\mathbf{x}\_1, \mathbf{x}\_2, \mathbf{x}\_3) = \begin{pmatrix} \mathbf{t}\_{\Sigma\_0}^{\Sigma^\ell}(\mathbf{x}\_1, \mathbf{x}\_2) \\ F\_j^{-1} \circ F\_{\mathbf{v}\_0}(\mathbf{x}\_3) \end{pmatrix}.$$

the gradients of the convex function

$$
\mathfrak{q}\_0^i(\mathbf{x}\_1, \mathbf{x}\_2, \mathbf{x}\_3) = (\mathbf{x}\_1, \mathbf{x}\_2) \mathbf{t}\_{\mathfrak{N}}^{\Sigma^i} \begin{pmatrix} \mathbf{x}\_1 \\ \mathbf{x}\_2 \end{pmatrix} + \int\_0^{\mathbf{x}\_3} F\_j^{-1} (F\_{\mathbf{V}\_0}(\mathbf{s})) \, \mathbf{ds},
$$

where we identify **t** Σ*i* γ<sup>0</sup> with the positive definite matrix (Σ*i* )1/2[(Σ*i* )1/<sup>2</sup>Σ0(Σ*i* )1/2] −1/2 (Σ*i* )1/<sup>2</sup> that pushes forward *N* (0,Σ<sup>0</sup>) to *N* (0,Σ*i* ). Due to the one-dimensionality, the algorithm finds the third component of the rotated measures after one step, but the convergence of the Gaussian component requires further iterations.

#### **5.5 Population Version of Algorithm 1**

Let Λ <sup>∈</sup> *<sup>W</sup>*2(R*d*) be a random measure with finite Frechet functional. The population ´ version of (5.1) is

*q* = P(Λabsolutely continuous with density bounded by *M*) > 0 for some *M* < ∞,

which we assume henceforth. This condition is satisfied if and only if

P(Λabsolutely continuous with bounded density) > 0.

These probabilities are well-defined because the set

*<sup>W</sup>*2(R*d*;*M*) = {μ <sup>∈</sup> *<sup>W</sup>*2(R*d*) : μabsolutely continuous with density bounded by *M*}

is weakly closed (see the paragraph before Proposition 5.3.6), hence a Borel set of *W*2(R*d*).

In light of Theorem 3.2.13, we can define a population version of Algorithm 1 with the iteration function

> *A* (γ) = E**t** Λ γ , γ<sup>∈</sup> *<sup>W</sup>*2(R*d*) absolutely continuous.

The (Bochner) expectation is well-defined in *L*2(γ) because the random map **t** Λ γ is measurable (Lemma 2.4.6). Since *L*2(γ) is a Hilbert space, the law of large numbers applies there, and results for the empirical version carry over to the population version by means of approximations. In particular:

**Lemma 5.5.1** *Any descent iterate* γ *has density bounded by q*−*dM, where q and M are as in* (5.7)*.*

*Proof.* The result is true in the empirical case, by Proposition 5.3.6. The key point (observed by Pass [102, Subsection 3.3]) is that the number of measures does not appear in the bound *q*−*dM*.

Let Λ<sup>1</sup>,... be a sample from Λ and let *qn* be the proportion of measures in (Λ1,...,Λ*<sup>n</sup>*) that have density bounded by *M*. Then both *n*−<sup>1</sup> ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> **t** Λ*i* γ <sup>→</sup> <sup>E</sup>**<sup>t</sup>** Λ γ and *qn* → *q* almost surely by the law of large numbers. Pick any ω in the probability space for which this happens and notice that (invoking Lemma 2.4.5)

$$\mathbb{L}\mathcal{A}'(\boldsymbol{\gamma}) = \left[ \lim\_{n \to \infty} \frac{1}{n} \sum\_{i=1}^n \mathbf{t}\_{\mathcal{I}}^{\Lambda\_i} \right] \# \boldsymbol{\gamma} = \lim\_{n \to \infty} \left[ \frac{1}{n} \sum\_{i=1}^n \mathbf{t}\_{\mathcal{I}}^{\Lambda\_i} \right] \# \boldsymbol{\gamma}.$$

Let λ*<sup>n</sup>* denote the measure in the last limit. By Proposition 5.3.6, its density is bounded by *q*−*<sup>d</sup> <sup>n</sup> <sup>M</sup>* <sup>→</sup> *<sup>q</sup>*−*dM* almost surely, so for any *<sup>C</sup>* <sup>&</sup>gt; *<sup>q</sup>*−*dM* and *<sup>n</sup>* large, λ*<sup>n</sup>* has density bounded by *C*. By the portmanteau Lemma 1.7.1, so does limλ*<sup>n</sup>* = [E**t** Λ γ ]#γ. Now let *<sup>C</sup> <sup>q</sup>*−*dM*.

(5.7)

Though it follows that every Karcher mean of Λ has a bounded density, we cannot yet conclude that the same bound holds for the Frechet mean, because we need an a- ´ priori knowledge that the latter is absolutely continuous. This again can be achieved by approximations:

**Theorem 5.5.2 (Bounded Density for Population Frechet Mean) ´** *Let*Λ <sup>∈</sup> *<sup>W</sup>*2(R*d*) *be a random measure with finite Frechet functional. If ´* Λ *has a bounded density with positive probability, then the Frechet mean of ´* Λ *is absolutely continuous with a bounded density.*

*Proof.* Let *q* and *M* be as in (5.7), Λ<sup>1</sup>,... be a sample from Λ, and *qn* the proportion of (Λ*<sup>i</sup>*)*i*≤*<sup>n</sup>* with density bounded by *M*. The empirical Frechet mean ´ λ*<sup>n</sup>* of the sample (Λ1,...,Λ*<sup>n</sup>*) has a density bounded by *q*−*<sup>d</sup> <sup>n</sup> M*. The Frechet mean ´ λ of Λ is unique by Proposition 3.2.7, and consequently λ*n* → λ in *W*2(R*d*) by the law of large numbers (Corollary 3.2.10). For any *C* > limsup*q*−*<sup>d</sup> <sup>n</sup> M*, the density of λ is bounded by *C* by the portmanteau Lemma 1.7.1, and the limsup is *q*−*dM* almost surely. Thus, the density is bounded by *q*−*dM*.

In the same way, one shows the population version of Theorem 3.1.9:

**Theorem 5.5.3 (Frechet Mean of Compatible Measures) ´** *Let* Λ : (Ω,*F*,P) <sup>→</sup> *W*2(R*d*) *be a random measure with finite Frechet functional, and suppose that with ´ positive probability* Λ *is absolutely continuous and has bounded density. If the collection* {γ} ∪Λ(Ω) *is compatible and* γ *is absolutely continuous, then* [E**t** Λ γ ]#γ *is the Frechet mean of ´* Λ*.*

It is of course sufficient that {γ} ∪Λ(Ω \ *N* ) be compatible for some null set *N* ⊂ Ω.

#### **5.6 Bibliographical Notes**

The algorithm outlined in this chapter was suggested independently in this steepest descent form by Zemel and Panaretos [134] and in the form a fixed point equation iteration by Alvarez-Esteban et al. [ ´ 9]. These two papers provide different alternative proofs of Theorem 5.3.1. The exposition here is based on [134]. Although longer and more technical than the one in [9], the formalism in [134] is amenable to directly treating the optimal maps (Theorem 5.3.3) and the multicouplings (Corollary 5.3.4). On the flip side, it is noteworthy that the proof of the Gaussian case (Theorem 5.4.1) given in [9] is more explicit and quantitative; for instance, it shows the additional property that the traces of the matrix iterates are monotonically increasing.

Developing numerical schemes for computing Frechet means in ´ *W*2(R*d*) is a very active area of research, and readers are referred to the recent monograph of Peyre´ and Cuturi [103, Section 9.2] for a survey.

In recent work, Backhoff-Varaguas et al. [15] propose a stochastic steepest descent for finding Karcher means of a population Frechet functional associated with ´ a random measure Λ. At iterate *j*, one replaces γ*<sup>j</sup>* by

$$[\mathbf{t}\_{f}\mathbf{t}\_{\mathcal{Y}}^{\mu\_{f}} + (1 - t\_{f})\mathbf{i}]\#\mathcal{Y}\_{f}, \qquad \mu\_{f} \sim \Lambda.$$

The analogue of Theorem 5.3.1 holds under further conditions.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **References**

