Karlsruher Schriften zur Anthropomatik

Band 55

Band 55

R. Hug

Probabilistic Parametric Curves for Sequence Modeling

## Ronny Hug

**Probabilistic Parametric Curves for Sequence Modeling**

Ronny Hug

**Probabilistic Parametric Curves for Sequence Modeling**

Karlsruher Schriften zur Anthropomatik Band 55 Herausgeber: Prof. Dr.-Ing. habil. Jürgen Beyerer

Eine Übersicht aller bisher in dieser Schriftenreihe erschienenen Bände finden Sie am Ende des Buchs.

# **Probabilistic Parametric Curves for Sequence Modeling**

by Ronny Hug

Karlsruher Schriften zur Anthropomatik

Herausgeber: Prof. Dr.-Ing. habil. Jürgen Beyerer

Eine Übersicht aller bisher in dieser Schriftenreihe erschienenen Bände finden Sie am Ende des Buchs.

Band 55

Karlsruher Institut für Technologie Institut für Anthropomatik und Robotik

Probabilistic Parametric Curves for Sequence Modeling

Zur Erlangung des akademischen Grades eines Doktors der Ingenieurwissenschaften von der KIT-Fakultät für Informatik des Karlsruher Instituts für Technologie (KIT) genehmigte Dissertation

von Ronny Hug, M.Sc.

Tag der mündlichen Prüfung: 16. Dezember 2021 Erster Gutachter: Prof. Dr.-Ing. Jürgen Beyerer Zweiter Gutachter: Prof. Dr.-Ing. Marco Huber

#### **Impressum**

Karlsruher Institut für Technologie (KIT) KIT Scientific Publishing Straße am Forum 2 D-76131 Karlsruhe

KIT Scientific Publishing is a registered trademark of Karlsruhe Institute of Technology. Reprint using the book cover is not allowed.

www.ksp.kit.edu

*This document – excluding parts marked otherwise, the cover, pictures and graphs – is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0): https://creativecommons.org/licenses/by/4.0/deed.en*

*The cover page is licensed under a Creative Commons Attribution-No Derivatives 4.0 International License (CC BY-ND 4.0): https://creativecommons.org/licenses/by-nd/4.0/deed.en*

Print on Demand 2022 – Gedruckt auf FSC-zertifiziertem Papier

ISSN 1863-6489 ISBN 978-3-7315-1198-4 DOI 10.5445/KSP/1000146434

# **Abstract**

Representations of sequential data are commonly based on the assumption that observed sequences are realizations of an unknown underlying stochastic process. Usually, the determination of such a representation is construed as a learning problem and yields a sequence model. In this context, the model must be able to capture the multi-modal nature of the data, without blurring between single modes. For modeling the underlying stochastic process, commonly used neural network-based approaches either learn an implicit representation by using stochastic inputs or units, or learn to parameterize a probability distribution. As such, these models usually incorporate Monte Carlo or other approximation techniques in order to perform parameter estimation and probabilistic inference. This even holds true for regression-based approaches based on Mixture Density Networks, which still require Monte Carlo simulation for performing multi-modal inference. Thus, a research gap in fully regression-based approaches for parameter estimation and probabilistic inference emerges.

Towards this end, this thesis proposes a probabilistic extension to Bézier curves (-Curves), as a basis for effectively modeling continuous-time stochastic processes with a bounded index set. The proposed stochastic process model is denoted as the -Curve model and is based on Mixture Density Networks (MDN) and Bézier curves with Gaussian random variables as control points. Taking an MDN-based approach is in line with recent attempts to address the problem of quantifying uncertainty as a regression problem and yields a generic model, which is generally applicable as a basic model for probabilistic sequence modeling. Key advantages of the model include the ability of generating smooth multi-mode predictions in a single inference step, which avoids the need for Monte Carlo simulation. Further, being based on Bézier curves, the model can, in theory, be scaled up to high dimensional sequence data by embedding the control points in a high dimensional space. In order to approach theoretical limitations imposed by the restriction to a bounded index set, a conceptual extension to the -Curve model, capable of modeling infinite stochastic processes, is presented. Essential properties of the proposed approach and its extension are illustrated by several toy examples considering a sequence synthesis task.

With the original -Curve model being sufficient for most real-world applications, a thorough evaluation is conducted on different multi-step sequence prediction tasks for evaluating the capabilities of the model applied to realworld data. First, the model is evaluated against commonly used generic probabilistic sequence models on a human trajectory prediction task, proving the capabilities of the -Curve model, as the model outperforms other the models in this comparison. A qualitative evaluation investigates the behavior of the model in a prediction context. Further, difficulties in assessing the performance of probabilistic sequence models in a multi-modal setting are discussed. In addition, the model is applied to a human motion prediction task, assessing the claimed scalability of the model to higher-dimensional data. In this task, the model outperforms commonly used simple and neural networkbased baselines and performs on par with different state-of-the-art models on several occasions, proving its capabilities in this higher-dimensional example. Further, difficulties in covariance estimation and the smoothing property of the -Curve model are discussed.

# **Kurzfassung**

Repräsentationen sequenzieller Daten basieren in der Regel auf der Annahme, dass beobachtete Sequenzen Realisierungen eines unbekannten zugrundeliegenden stochastischen Prozesses sind. Die Bestimmung einer solchen Repräsentation wird üblicherweise als Lernproblem ausgelegt und ergibt ein Sequenzmodell. Das Modell muss in diesem Zusammenhang in der Lage sein, die multimodale Natur der Daten zu erfassen, ohne einzelne Modi zu vermischen. Zur Modellierung eines zugrundeliegenden stochastischen Prozesses lernen häufig verwendete, auf neuronalen Netzen basierende Ansätze entweder eine Wahrscheinlichkeitsverteilung zu parametrisieren oder eine implizite Repräsentation unter Verwendung stochastischer Eingaben oder Neuronen. Dabei integrieren diese Modelle in der Regel Monte Carlo Verfahren oder andere Näherungslösungen, um die Parameterschätzung und probabilistische Inferenz zu ermöglichen. Dies gilt sogar für regressionsbasierte Ansätze basierend auf Mixture Density Netzwerken, welche ebenso Monte Carlo Simulationen zur multi-modalen Inferenz benötigen. Daraus ergibt sich eine Forschungslücke für vollständig regressionsbasierte Ansätze zur Parameterschätzung und probabilistischen Inferenz.

Infolgedessen stellt die vorliegende Arbeit eine probabilistische Erweiterung für Bézierkurven (-Kurven) als Basis für die Modellierung zeitkontinuierlicher stochastischer Prozesse mit beschränkter Indexmenge vor. Das vorgestellte Modell, bezeichnet als -Kurven – Modell, basiert auf Mixture Density Netzwerken (MDN) und Bézierkurven, welche Kurvenkontrollpunkte als normalverteilt annehmen. Die Verwendung eines MDN-basierten Ansatzes steht im Einklang mit aktuellen Versuchen, Unsicherheitsschätzung als Regressionsproblem auszulegen, und ergibt ein generisches Modell, welches allgemein als Basismodell für die probabilistische Sequenzmodellierung einsetzbar ist. Ein wesentlicher Vorteil des Modells ist unter anderem die Möglichkeit glatte, multi-modale Vorhersagen in einem einzigen Inferenzschritt zu generieren, ohne dabei Monte Carlo Simulationen zu benötigen. Durch die Verwendung von Bézierkurven als Basis, kann das Modell außerdem theoretisch für beliebig hohe Datendimensionen verwendet werden, indem die Kontrollpunkte in einen hochdimensionalen Raum eingebettet werden. Um die durch den Fokus auf beschränkte Indexmengen existierenden theoretischen Einschränkungen aufzuheben, wird zusätzlich eine konzeptionelle Erweiterung für das -Kurven – Modell vorgestellt, mit der unendliche stochastische Prozesse modelliert werden können. Wesentliche Eigenschaften des vorgestellten Modells und dessen Erweiterung werden auf verschiedenen Beispielen zur Sequenzsynthese gezeigt.

Aufgrund der hinreichenden Anwendbarkeit des -Kurven – Modells auf die meisten Anwendungsfälle, wird dessen Tauglichkeit umfangreich auf verschiedenen Mehrschrittprädiktionsaufgaben unter Verwendung realer Daten evaluiert. Zunächst wird das Modell gegen häufig verwendete probabilistische Sequenzmodelle im Kontext der Vorhersage von Fußgängertrajektorien evaluiert, wobei es sämtliche Vergleichsmodelle übertrifft. In einer qualitativen Auswertung wird das Verhalten des Modells in einem Vorhersagekontext untersucht. Außerdem werden Schwierigkeiten bei der Bewertung probabilistischer Sequenzmodelle in einem multimodalen Setting diskutiert. Darüber hinaus wird das Modell im Kontext der Vorhersage menschlicher Bewegungen angewendet, um die angestrebte Skalierbarkeit des Modells auf höherdimensionale Daten zu bewerten. Bei dieser Aufgabe übertrifft das Modell allgemein verwendete einfache und auf neuronalen Netzen basierende Grundmodelle und ist in verschiedenen Situationen auf Augenhöhe mit verschiedenen Stateof-the-Art-Modellen, was die Einsetzbarkeit in diesem höherdimensionalen Beispiel zeigt. Des Weiteren werden Schwierigkeiten bei der Kovarianzschätzung und die Glättungseigenschaften des -Kurven – Modells diskutiert.

# **Acknowledgements**

Throughout the past few years I have received a great deal of support and assistance in my goal of pursuing my PhD.

First of all, I would like to thank my dissertation advisor Prof. Dr.-Ing. Jürgen Beyerer for his guidance and feedback, which were invaluable to complete this thesis. I am grateful to Prof. Dr.-Ing. Marco Huber, Prof. Dr. Wolfgang Karl, Prof. Dr.-Ing. Jörg Henkel, and Prof. Dr. Gregor Snelting for agreeing to be part of my examination committee. In particular, I would like to thank Prof. Dr.-Ing. Marco Huber for his interest in my work and for being in the committee as a second advisor.

I want to acknowledge my colleagues of the *Object Recognition Department* at the *Fraunhofer Institute of Optronics, System Technologies and Image Exploitation* for their constant assistance. Our conversations were essential for solving various challenges surrounding my thesis. In particular, I would like to thank Dr. Stefan Becker for his ongoing encouragement and our joint research providing a strong foundation for parts of this thesis. Special thanks go to Dr. Wolfgang Hübner and Dr. Michael Arens for providing excellent conditions, assistance and feedback to prepare this thesis.

Lastly, I want to thank my family and friends for their continuous support and encouragement throughout the years. In particular, I want to thank Melanie and Maximilian Hemgesberg for inspiring conversations about my work, and finally my wonderful wife Simone for her unconditional support and our little son Leon for being the little sunshine that he is.

# **Contents**




# **Notation**

This chapter introduces the notation and symbols which are used in this thesis.


# **General notation**

# **Probability Distributions**


## **Numbers, Indexing and Conventions**



## **Parametric Curves and Sequence Modeling**


xii

# **1 Introduction**

Sequential data, or rather timely ordered information, arises in the context of many different applications, like for example risk assessment in autonomous driving or in data-driven behavior analysis. In general, it is possible to reduce the majority of such use-cases to more abstract inference tasks, like sequence prediction. With real-world data being subject to noise and detection or annotation errors, the use of a probabilistic sequence model is favorable, as such models also take uncertainty in the data into account.

**(a)** Trajectory prediction **(b)** Out-of-distribution detection

**Figure 1.1:** Exemplary sequence modeling tasks on different levels of abstraction: 2D trajectory prediction in a constraint setting and out-of-distribution detection built upon the derived probabilistic sequence model when training the prediction model. Both tasks can contribute to a superordinate risk assessment application. The prediction task (1.1a) is concerned with future trajectory prediction (red, green and blue distributions) given an observed trajectory (solid cyan). In such a structured environment, a sequence model is learned, which is capable of capturing statistically relevant paths through the given scene. As the sequence model provides a model for the underlying data distribution, out-of-distribution detection can be performed given a trajectory (1.1b). In this example, moving on the pathway is valid under the model, but moving onto the grass is highly unlikely. The validity under the model is color-coded from red (not valid) to blue (valid). Figure 1.1b is taken from [Har17].

The determination of such a probabilistic sequence model is commonly layed out as a learning problem, where the model parameters are estimated from given data samples. This formulation as a learning problem goes along with the current dominance of deep learning approaches in a range of different fields related to sequential data. However, working with uncertainties and associated probability distributions, most current deep learning-based approaches for probabilistic sequence modeling rely on the calculation of intractable probability density functions. Because of that, variational or sample-based approximations are generally required during training and inference in such models. Although there exist regression-based approaches, which try to avoid the need for such expensive approximations during training, they still require Monte Carlo methods for inference.

Following this, a common ground for current sequence modeling approaches can be observed in their need for Monte Carlo methods during inference. Thus, a research gap in regression-based approaches for multi-modal probabilistic inference emerges.

Towards this end, the primary goal of this thesis revolves around the formulation of a fully regression-based probabilistic sequence model. In addition, common drawbacks of existing models should be avoided, i.e.


Following this, this thesis proposes a probabilistic extension for parametric curves for use in probabilistic sequence modeling and provides an implementation of the resulting model based on regression neural networks. The motivation for basing the approach on parametric curves is driven by the following expectations: First, modeling full curves enables *instant* multi-step inference without iteration and the need for Monte Carlo methods. Further, generated sequences are constrained by the underlying parametric curves. This, in turn, is expected to help stabilizing training. In addition, artifact generation during inference should be mitigated. Finally, modeling a stochastic process in terms of a probabilistic parametric curve yields a compact representation of said stochastic process.

## **1.1 Contributions**

In compliance with the aforementioned primary objectives, the main contributions provided in this thesis revolve around a novel probabilistic sequence model, built on a probabilistic extension to parametric curves. As such, the contributions can be ascribed to three categories: *theory*, *algorithms* and *evaluation*.

*Theory:* A probabilistic extension to Bézier curves and Bézier splines capable of modeling multi-modal stochastic processes is derived. In this extension, the Bézier curve's control points are assumed to be Gaussian, thus inheriting the stochasticity to the curve points by linear combination, resulting in a model for a continuous-time stochastic process. Discrete-time stochastic processes can be represented by discretizing such a probabilistic curve. Multi-modality is achieved, by combining multiple probabilistic curves into a mixture.

*Algorithms:* A learning- and regression-based approach for applying these probabilistic parametric curves in different sequence modeling tasks, specifically synthesis and prediction, is proposed. The approach is based on a Mixture Density Network, which outputs the parameters for (a mixture of) probabilistic parametric curves. This enables multi-step sequence generation without iteration or the need for Monte Carlo methods. Several toy examples assess different aspects and qualities of the approach.

*Evaluation:* An extensive evaluation of the proposed model is provided for the task of human trajectory prediction on real-world datasets. In addition, the common approach to evaluation in human trajectory prediction is examined with an attempt to provide insight into the suitability of the methodology for different task setups. Emphasis is put especially on commonly used performance measures. Finally, scalability of the approach is proven in a higherdimensional scenario given by human motion prediction.

Additional contributions to the field of human trajectory prediction exceeding the topical scope of this thesis are given by:


## **1.2 Outline**

The thesis is structured as follows: Chapter 2 provides a brief overview on the most common probabilistic sequence models most state-of-the-art deep learning models are built upon. This background chapter also serves the purpose of supporting the aforementioned claim for the revealed research gap. Chapter 3 provides the derivation of a probabilistic extension for Bézier curves and Bézier splines, including discussions on choices made for the approach and comparisons with related probabilistic sequence models. Closely connected to Chapter 3 is the proposed implementation of the probabilistic curve model given in Chapter 4. Besides implementation details, e.g. the structure of the model, several toy examples are provided, assessing different aspects and qualities of the model. Chapter 5 provides a real-world evaluation of the proposed model, using a low-dimensional and a higher-dimensional task, given by human trajectory prediction and human motion prediction. Finally, Chapters 6 and 7 conclude the thesis and give hints to potential future research directions.

# **2 Sequence Modeling**

In the context of machine learning, the task of*sequence modeling* is, in general, concerned with determining (stochastic) models able to represent, process and generate sequential data from a given data basis. When uncertainties about the data are taken into account, the sequence model aims to provide an either implicit or explicit representation of the underlying probability distribution.

To enable a more nuanced view on sequence modeling, this general task can be subdivided into three closely related sub-tasks, namely sequence *encoding*, *synthesis* and *prediction*. While sequence encoding is concerned with reducing a given sequence into a compact representation, e.g. a single vector, sequence synthesis and prediction aim at generating sequential data. Sequence synthesis, on the one hand, is concerned with the generation of sequences according to an underlying probability distribution, potentially conditioned on a specific input. On the other hand, sequence prediction combines both tasks by first requiring to encode a given input sequence (the *observation*) in order to generate a prediction for future data points of the observed sequence. As such, sequence prediction can be regarded as a variant of conditional sequence synthesis, where the synthesis model is conditioned on another sequence. Most applications in the context of sequence modeling can be ascribed to at least one of these three more general inference tasks. A schematic of each task is given in Figure 2.1.

**Figure 2.1:** Schematic of sequence modeling sub-tasks sequence encoding, synthesis and prediction. As an example, a sequence of 2D points is considered. In sequence encoding, the sequence model takes in a given sequence and encodes it into a specific representation, e.g. a vector **v**enc. A sequence synthesis model optionally takes a specific input, e.g. a vector **v**gen, and generates a sequence. Sequence prediction combines the two, as the sequence prediction model needs to encode a sequence it is given (green), in order to synthesize a continuation of that sequence (blue).

With the prevalence of noise and uncertainties in real-world data, statistical sequence models are employed for tackling either of the sequence modeling tasks. For determining a statistical sequence model, it is assumed that each sequence = {**x** }∈ in a specific dataset is a realization of an unknown stochastic process = { }∈ with index set and random variables following some probability distribution. Typically, either corresponds to ℕ0 , ℝ + <sup>0</sup> or some interval [, ], indicating a discrete-time, continuous-time or finite (continuous-time) stochastic process, respectively. Commonly, these statistical sequence models are either *probabilistic sequence models* or *stochastic process models*. While the latter are themselves variants of stochastic processes (e.g. *Gaussian processes* [Ras06]), probabilistic sequence models process and generate probability distributions, thus providing a model for the underlying stochastic process. Thereby, the probabilistic sequence model itself can be either probabilistic or even deterministic.

Following this, this thesis focuses on learning-based probabilistic sequence models for (conditional) sequence synthesis, including sequence prediction. A sequence model then generates a distribution over the sequence to be synthesized instead of a single (maximum likelihood) sample. The remainder of this section provides an overview of the most important probabilistic models in this context. Given the prevalence of deep learning-based models among current state-of-the-art approaches, the overview is limited to such models only. For an overview of machine learning models beyond deep learning, e.g. state space models, such as recursive bayesian estimators [Sär13] or autoregressive models, like the autoregressive moving-average model [Box15], the reader may be referred to comprehensive surveys on the topic, e.g. [Rud20b]. Although this survey focuses on a prediction task, most mentioned models are more universally applicable.

## **2.1 Neural Sequence Processing**

To preface the overview, it is important to mention that deep learning-based probabilistic sequence models are commonly built around an underlying neural sequence model, which is in charge of processing sequences at hand. While feed-forward networks (e.g. the Multilayer Perceptron [Mur91]) can be used in a setting of fixed length sequences or when applying a sliding window approach, dedicated sequence models are usually preferred. Common choices for the underlying sequence model are *Recurrent Neural Networks* (abbrev.: RNN, [Rum86]) and their variants, *Temporal Convolutional Networks* (abbrev.: TCN, [Bai18]) and *Transformer Networks* (abbrev.: TF, [Vas17]).

### **2.1.1 Recurrent Neural Networks**

Recurrent Neural Networks are feed-forward networks with additional recurrent connections along the time axis, enabling it to iteratively process sequences and carry information about past inputs. As such, RNNs, and especially its *Long Short-Term Memory* (abbrev.: LSTM, [Hoc97]) and *Gated Recurrent Unit* (abbrev.: GRU, [Cho14]) variants, are widely used. While vanilla RNNs are prone to gradient-related problems during training, especially vanishing gradients [Pas13], aforementioned variants incorporate gating mechanism to cope with such problems. From an operational point of view, RNNs are usually build as either *1-to-1* or *sequence-to-sequence* (abbrev.: seq2seq, sometimes also denoted as encoder-decoder, [Sut14]) RNNs. On the one hand, a 1-to-1 RNN processes a given sequence one element at a time and generates an output at each time step. This approach is generally applicable. Opposed to that, seq2seq RNNs are more tailored towards conditional sequence synthesis, where a given sequence is encoded first using an encoder RNN. The resulting encoding is then decoded by another RNN – the decoder – in order to generate an output sequence. Overall, both variants yield comparable performance considering a range of sequence modeling tasks, with the GRU performing slightly better in many cases [Chu14][Joz15]. However, when the network is built as a sequence-to-sequence model, the LSTM outperforms the GRU variant [Bri17].

As a final note, due to RNNs employing an autoregressive structure, i.e. using their own output at time as input at time + 1 during inference, techniques for managing the network input during training should be discussed. The most commonly used approach is given by the *teacher forcing* approach [Goo16]. Teacher forcing is a technique for training recurrent neural networks, that, at time uses the ground truth **x** as input, rather than the model's output from the previous time step **y**̂−1. As such, the actual network input signal is replaced with a *teacher* signal. This approach helps reaching convergence faster, at the cost of the network only eventually learning to cope with its own imperfect output. A way to tackle this problem, is to start the training process using teacher forcing and then slowly transitioning into an auto-conditioning scheme, where the actual network output is fed back in the subsequent time step [Ben15].

### **2.1.2 Temporal Convolutional Networks**

Temporal Convolutional Networks are a special variant of *Convolutional Neural Networks* (abbrev.: CNN, [LeC95]) for sequential data, popularized by the *WaveNet* model in the context of audio synthesis [Oor16]. The model consists of dilated causal convolutions. While *dilated convolutions* [Hol90] are incorporated in order to capture long range dependencies, *causal convolutions* [Oor16] ensure that the temporal order of a given sequence is taken into account. An advantage of the TCN over the RNN is its inherent parallelism on the one hand and a more stable training on the other hand. As the TCN processes multiple time steps at once instead of sequentially, convolutions can be done in parallel. A more stable training of the TCN can be attributed to more stable gradients. On the downside, the TCN is less flexible in processing sequences of variable length. Although it is possible to process variable length sequences by sliding the convolutional kernels, the *memory* of the model is limited by the filter kernel's width and the dilation rate, whereas the RNN may, in theory, establish dependencies up to the first sequence element. Looking at a range of different sequence modeling tasks, the TCN is able to outperform LSTM and GRU models [Bai18] or at least perform similar [Bec18].

### **2.1.3 Transformer Networks**

Transformer Networks originated from the field of natural language modeling as a replacement for the commonly used RNN-based sequence-to-sequence models. Since its emergence, Transformers also gained traction in other application domains, most notably speech processing, where Transformers consistently outperform RNN-based models [Kar19, Wan21]. Compared to RNNs, which process sequences recursively, the Transformer aims to get rid of recurrence and always considers the entire input sequence. As such, the most important concept Transformers are build around are *positional encoding* and *attention*. While the positional encoding enriches input sequence elements with information about their position within the given sequence, the attention mechanism is in charge of determining which parts of the input sequence are of importance for the calculation of each element in the target sequence. Further, using an attention mechanism, enriched information is available for sequence generation, when compared to RNN-based sequence-to-sequence models, where the sequence decoder is only provided with an encoded representation of the input sequence. Besides that, Transformers are in general more stable during training, but also seem to be more prone to overfitting, which indicates problems with generalization [Zey19]. Additionally, in its original formulation, the Transformer model is restricted to fixed-length sequences. This restriction, is tackled by the *Transformer-XL* extension [Dai19], which re-introduces a notion of recurrence and extends on the positional encoding concept.

## **2.2 Probabilistic Sequence Models**

This section provides an overview of deep learning-based probabilistic sequence models commonly used as a basis for task-specific model adaptations. These models can roughly be put into three categories: *Bayesian*, *regressionbased* and *transformative* approaches. For each category, the most relevant representatives are examined.

### **2.2.1 Bayesian Approaches**

In deep learning-based Bayesian approaches, a deterministic neural network is turned into a probabilistic model by treating all its parameters as random variables. A prominent example for this class of approaches is given by *Bayesian Neural Networks* (abbrev.: BNN, [Bis95]). In these models, inference and parameter estimation are built around Bayes' theorem. As such, the neural network outputs an arbitrary predictive probability distribution

$$p(\mathbf{y}|\mathbf{x}, \mathfrak{W}) = \int\_{\mathfrak{G}} p(\mathbf{y}|\mathbf{x}, \mathfrak{G}') p(\mathfrak{G}'|\mathfrak{W}) d\mathfrak{G}' \tag{2.1}$$

by propagating the units' output distributions through the network. For parameter estimation, the posterior distribution

$$p(\boldsymbol{\theta}|\boldsymbol{\mathfrak{B}}) = \frac{p(\mathfrak{B}\_{\boldsymbol{y}}|\mathfrak{B}\_{\boldsymbol{x}}, \boldsymbol{\theta})p(\boldsymbol{\mathfrak{B}})}{\int\_{\boldsymbol{\mathfrak{B}}} p(\mathfrak{B}\_{\boldsymbol{y}}|\mathfrak{B}\_{\boldsymbol{x}}, \boldsymbol{\mathfrak{B}}')p(\boldsymbol{\mathfrak{B}}')d\boldsymbol{\mathfrak{B}}'} \propto p(\mathfrak{B}\_{\boldsymbol{y}}|\mathfrak{B}\_{\boldsymbol{x}}, \boldsymbol{\mathfrak{B}})p(\boldsymbol{\mathfrak{B}})\tag{2.2}$$

of the network parameters, given a set of data samples, needs to be determined. Here, denotes the model parameters, the training dataset split into input data and target data , and **x** and **y** denote specific input and target vectors, respectively. Due to intractable probability distributions arising from non-linear transformations, usually either Monte Carlo methods [Nea92] or approximate inference is required for both training and inference. Common techniques used for approximate inference include variational inference (also known as *Bayesian Backpropagation*, [Blu15]), inference based on expectation propagation [Her15] and Monte Carlo Dropout [Gal16]. In order to extend BNNs for probabilistic sequence modeling, *Bayesian Recurrent Neural Networks* (abbrev.: BRNN, [For17]) were introduced. For the BRNN, the variational Bayesian Backpropagation scheme is adapted for Backpropagation Through Time [Wer90].

In summary, BNN-based approaches provide a fully probabilistic framework for sequence modeling. In addition to that, major advantages of such models are also given by their robustness to overfitting and the ability to provide information about model uncertainty. As a drawback, such models are difficult to train, due to the requirement of approximate inference making the training computationally more intensive and potentially less stable. Further, the need for approximate inference also yields a significant computational overhead when generating predictions. As a final note, considering the need for approximate inference, the *Bayesian Perceptron* [Hub20] is worth mentioning. The Bayesian Perceptron is a specific novel probabilistic formulation of the Perceptron [Ros58], which provides closed-form parameter propagation and estimation, thus eliminating the need for Monte Carlo Methods and approximate inference. However, a recurrent extension for sequence modeling building on this approach is not yet available.

Despite not being probabilistic sequence models according to the definition given earlier in this section, Gaussian process models are worth mentioning in the context of deep learning-based Bayesian approaches. This is due to their corresponding relationship, in that the function computed by a deep neural network is a function drawn from a Gaussian process [Lee18]. Conversely, a GP corresponds to a neural network with an infinite number of units in its hidden layer [Nea96, Wil97].

*Gaussian Processes* (GP) and Gaussian process regression [Ras06] provide a well-established model for probabilistic sequence modeling and especially prediction. Given a collection of sample points of a non-linear function (⋅) ∶ ℝ → ℝ, a mean function (⋅) and a covariance function (⋅,⋅) (*kernel*), the GP yields a multivariate Gaussian prior probability distribution over function space. The Gaussian distribution can be used to determine a conditional predictive distribution over the next element in a sequence given preceding observations. By embedding this 1-step prediction model into a sequential Monte Carlo simulation, multiple time steps can be predicted [Ell09]. *Deep Gaussian Processes* (abbrev.: DGP, [Dam13]) extend on the GP framework in order to constitute non-Gaussian, and therefore more complex, models. A DGP is a hierarchy of multiple GPs using non-linear mappings between each layer of the hierarchy. However, the resulting probability densities are intractable and thus require an approximate solution, which can be achieved e.g. by variational approximation [Cam15]. A special case of the DGP, which implements an autoregressive structure comparable to that of an RNN, is given by *Recurrent Gaussian Processes* (abbrev.: RGP, [Mat16]). Here, the priors of latent variables in each hidden layer follow an autoregressive structure. Following this, a recurrent variational approximation scheme, which uses a state space model-based approach, is introduced for inference. Besides having a computation intensive inference scheme, GPbased approaches grant good control over generated sequences, by explicitly modeling the kernel functions, thus controlling the prior over functions representable by the model through a regularization over the entire value range. This gives an advantage over most competing neural network-based approaches that generate sequences in a mostly unconstrained fashion. It should be noted, however, that GP-based approaches are rarely used in most application domains currently dominated by deep learning-based models.

### **2.2.2 Regression-based Approaches**

One of the main areas of application for neural networks is given by regression tasks, due to their ability to learn arbitrary mappings from a given domain into a targeted co-domain. As such, neural networks can be used for probabilistic modeling when treating the task of uncertainty estimation as a regression problem. The neural network is then in charge of learning a mapping from a given set of samples onto the parameters of a probability distribution estimating the generating distribution. Following this, the negative data log likelihood is optimized during training:

$$
\mathcal{L} = -\log p\_{\mathcal{G}}(\mathbf{x}).\tag{2.3}
$$

The most widely used regression-based neural network for probabilistic modeling is given by *Mixture Density Networks* (abbrev.: MDN, [Bis94, Bis06]), which map the output of their last layer onto the parameters of a mixture distribution. The prevalent choice for the mixture distribution is given by the Gaussian distribution, although Laplace distributions have also been used with Mixture Density Networks [Bra19]. For building a probabilistic sequence model using MDNs, a common choice is the *Recurrent Mixture Density Network* (abbrev.: R-MDN) model as proposed in [Gra13]. Here, an MDN is stacked on top of an LSTM network. The recurrent structure is then used for encoding the observed sequence as well as for generating predictions.

Compared to Bayesian approaches, using a deterministic model, such regression-based approaches are generally much simpler in terms of inference and computational cost, while still generating probabilistic output. On the downside, these approaches only give a point-estimate for the parameters of a preset target probability distribution. This limits the modeling capabilities of the approach and also does not allow to make assumptions about model uncertainty in a direct way. Drawbacks specific to R-MDNs are on the one hand given by the fact that generating multi-modal probabilistic predictions generally requires expensive Monte Carlo simulation [Hug18]. On the other hand, MDNs are prone to mode collapse [Mak19], where the model collapses into generating only slight variations of a single mode.

A more detailed introduction to MDNs is given in Section 4.1.

### **2.2.3 Transformative approaches**

Transformative approaches *transform* samples of a simple probability distribution into a sample-based representation of a more complex probability distribution. As such, transformative approaches combine deterministic neural networks with stochastic inputs in order to define a generative model. The most important models in this category are given by *Variational Autoencoders* and *Generative Adversarial Networks*.

*Variational Autoencoders* (abbrev.: VAE, [Kin14]) are a class of deep generative models with latent variables. In latent variable models, the unknown generating distribution (**x**) is modeled in terms of latent variables **z** ∼ (**z**) with prior distribution (**z**). According to Bayes' theorem, (**x**) and (**z**) are linked by the *mappings* (**x**|**z**) (*likelihood*) and (**z**|**x**) (*posterior*), which are computed explicitly. Following this, the generally intractable posterior needs to be approximated. For this approximation, the VAE follows a variational inference approach, approximating (**z**|**x**) with the variational posterior (**z**|**x**). Putting each part of the latent variable model together, in VAEs (**z**|**x**) and (**x**|**z**) are defined in terms of deterministic neural networks, denoted as the *recognition model* and the *generative model*. The networks are arranged similar to autoencoders [Hin94], with the latent space as the bottleneck. As a consequence, the generative process of the VAE works by transforming a set of latent variable samples **z** ∼ (**z**) drawn from the prior distribution (**z**) using the generative model (**x**|**z**). The prior distribution (**z**) is commonly defined as (**0**,**I**). Training the VAE is made possible by using the variational lower bound (also known as the evidence lower bound)

$$\log p(\mathbf{x}) \ge -\text{KL}(q\_{\phi}(\mathbf{z}|\mathbf{x})||p\_{\theta}(\mathbf{z})) + \mathbb{E}\_{q\_{\phi}(\mathbf{z}|\mathbf{x})} \left[ \log p\_{\theta}(\mathbf{x}|\mathbf{z}) \right] \tag{2.4}$$

in conjunction with the *reparameterization trick* [Kin14], which enables joint gradient-based training of the entire network. It should be noted, that *Normalizing Flows* [Rez15] can be used in a VAE in order to replace the learned approximate posterior (**z**|**x**). Normalizing Flows are a chain of *invertible* mappings, that can be used to transform samples of one probability distribution into another. In the context of VAEs, Normalizing Flows provide a framework for building a more flexible and complex variational approximation of the posterior (**z**|**x**) through an iterative procedure [Kin16, Hua18].

In order to extend on the concept of VAEs for sequence modeling, two approaches have emerged: the *Seq2seq Conditional VAE* and the *Variational Recurrent Neural Network*. *Seq2seq Conditional VAEs* [Bow16] build on the concept of conditional VAEs (abbrev.: CVAE, [Soh15]), which employ a conditional generating distribution (**x**|**v**) conditioned on some input **v**. This results in the conditional latent prior (**z**|**v**) and conditional mappings  (**x**|**z**,**v**) and (**z**|**x**,**v**). Following this, for sequence modeling, each sample **x** ∼ (**x**|**z**,**v**) resembles a full sequence. Further, in the case of sequence prediction, a given observed sequence needs to be encoded into **v**, in order to condition the CVAE's generative model on the given sequence. Hence, in seq2seq CVAEs, a CVAE is combined with a seq2seq RNN, where the RNN encoder is used to encode the observation into **v** and the RNN decoder resembles (**x**|**z**,**v**), generating the target sequence from **v** and **z**. Following this, the seq2seq CVAE generates a distribution over sequences of a specific length. Opposed to the composite approach of seq2seq CVAEs, the *Variational Recurrent Neural Network* (abbrev.: VRNN, [Chu15]) explicitly models the dependencies between latent variables of subsequent time steps. Following this, the VRNN embeds an RNN into a CVAE, which is at time conditioned on the RNN's previous hidden state **h**−1. For sequence synthesis, the VRNN then operates as a 1-to-1 model, generating a sequence of probability distributions rather than a probability distribution over sequences.

To summarize, VAE-based probabilistic sequence models provide comparable modeling capabilities to Bayesian approaches, while eliminating expensive approximations during inference, due the transformative approach. On the downside, because of imperfect reconstructions due to the injected noise when generating samples, training results can become less consistent.

*Generative Adversarial Networks* (abbrev.: GAN, [Goo14]) are another type of generative model, learning an *implicit*¹ model of the unknown generating distribution (**x**). While the generative model component in GANs is very similar to that of VAEs in that samples of a simple distribution are transformed into a sample-based representation of a more complex distribution using a deterministic neural network, the network structure and approach to estimating the parameters of the generative model is vastly different. In order to bypass the need to solve or approximate an intractable posterior distribution, GAN training is framed as a supervised learning problem, using a combination of two neural networks: the generative model itself (denoted as *generator*) and

¹ Implicit density models do not compute (**x**), but allow sampling from the underlying distribution using the model.

a *discriminator*. Both models are jointly trained playing a zero-sum game, where the generator tries to generate samples from the unknown data distribution (**x**), which the discriminator is incapable of classifying as real or fake. As such, the generator and the discriminator play the two-player minimax game

$$\mathcal{L} = \min\_{G} \max\_{D} \underbrace{\left(\mathbb{E}\_{\mathbf{x} \sim p(\mathbf{x})} \left[\log D(\mathbf{x})\right] + \mathbb{E}\_{\mathbf{z} \sim p(\mathbf{z})} \left[\log \left(1 - D(G(\mathbf{z}))\right)\right]\right)}\_{V(G, D)} \tag{2.5}$$

with value function (,). The generator's stochastic input distribution (**z**) is commonly defined as the multivariate standard Gaussian (**0**,**I**).

Similar to VAE-based approaches, the GAN can be extended for probabilistic sequence modeling by combining a conditional variant of the GAN [Mir14] with seq2seq RNNs [Yu17, Gup18]. Following this, the conditional generator (**z**,**v**) is combined with a seq2seq RNN and the conditional discriminator (**x**,**v**) is combined with an RNN sequence encoder. With the similarities to the VAE in the generative model, GANs provide similar benefits without the need for variational inference during training. Despite this, GANs tend to be hard to train because of vanishing gradient problems and the GANs proneness to mode collapse. While these problems are addressed by variations of the GAN building on the Wasserstein distance [Arj17, Gul17] or by incorporating multiple versions of the discriminator into the generator's loss function [Met17], balancing parameter updates between the generator and the discriminator still poses a challenging problem, as the discriminator converges faster than the generator on many occasions [Ham20].

## **2.3 Placement of this Thesis**

Looking at the overview of commonly used (probabilistic) sequence models for handling sequential data under uncertainty, a research gap in regressionbased approaches for multi-modal probabilistic inference is revealed. Following this, this thesis aims to provide a fully regression-based probabilistic sequence model with respect to model training and inference using the model. The targeted placement of this thesis among other sequence models is given in Table 2.1.


**Table 2.1:** Targeted placement of this thesis among other (probabilistic) sequence models.

*a* In a multi-modal setting

# **3 Concept**

Throughout this chapter, a probabilistic sequence model for representing stochastic processes is formulated, which aims to avoid the necessity of Monte Carlo approaches. To achieve this, first and foremost a sequence model for fixed-length sequences will be introduced in Section 3.1, covering the most application-relevant case. This covers continuous-time as well as discrete-time stochastic processes with bounded index set in an unimodal and a multi-modal setting. The model is then extended for the representation of infinite stochastic processes in Section 3.2 by lifting some conceptual limitations present in the former variant of the model.

The general idea behind the proposed probabilistic sequence model is to circumvent Monte Carlo sampling. Therefore, the model needs to represent full sequences instead of iteratively building them. Following this, a probabilistic extension to a certain type of parametric curves, Bézier curves in this case, is derived, granting a suitable representation of sequential data in arbitrary dimensions. The probabilistic sequence model is then built on these probabilistic Bézier curves.

## **3.1 The -Curve Model**

This section proposes a Bézier curve defined by stochastic control points capable of describing a continuous-time stochastic process = { }∈ on a closed range with Gaussian random variables ∼ ( , ) and index set = [0, 1]. This concept is further extended for modeling random variables following a Gaussian mixture distribution.

Starting with plain Bézier curves, a Bézier curve (e.g. [Pra02, Far02]) of degree deg

$$B\_{\mathcal{B}}(t) = \sum\_{l=0}^{N\_{\text{deg}}} b\_{l, N\_{\text{deg}}}(t) \mathbf{p}\_l \tag{3.1}$$

is a blended curve constructed as a linear combination of deg + 1 dimensional *control points* = {**p**<sup>0</sup> , **p**<sup>1</sup> , ..., **p**deg } using the Bernstein basis polynomials [Lor13]

$$b\_{l,N}(t) = \binom{N}{l} (1-t)^{N-l} t^l \tag{3.2}$$

as blending functions. The Bernstein basis polynomials are non-negative and satisfy ∑ ,() = 1. Each curve point **x** = () is determined by the curve's positional parameter ∈ [0, 1], where = 0 corresponds to **p**<sup>0</sup> and = 1 to **p**deg , respectively. The positional parameter can also be interpreted as a *time* parameter when looking at the curve points as a timely-ordered sequence of points. An example for a 2-dimensional Bézier curve of degree deg = 4 for ≤ 0.88 with corresponding Bernstein basis polynomials is depicted in Figures 3.1a and 3.1b, respectively.

**(a)** Exemplary 2-dimensional Bézier curve of degree deg = 4 for ≤ 0.88.

**(b)** Bernstein basis polynomials ,4() for 4'th degree blending.

**Figure 3.1:** Illustrating the connection between the Bernstein basis polynomials and Bézier curve construction. The Bernstein polynomial values control the weighting of control points for calculating curve points. The colors of the control points in figure (a) are associated with the weight curve of the same color in figure (b). Weights of control points for each curve point are dependent on the positional parameter . Figure (a) shows a curve constructed up to = 0.88, the remainder is indicated as a dashed line. Corresponding weights for = 0.88 are indicated by circular markers in figure (b).

Considering the objective of modeling a stochastic process, the curve points along this parametric curve have to be stochastic. A schematic of such a probabilistic Bézier curve is illustrated in Figure 3.2. Here, Figure 3.2a illustrates a discrete 2-dimensional Bézier curve as the starting point. Figure 3.2b indicates uncertainty associated with each curve point as a shaded region around the curve. It has to be noted, that this presentation of uncertainty is for illustration purposes only. Uncertainties of multiple time steps are overlayed while only considering uncertainty orthogonal to the actual curve. Thus, it does not reflect the real probability distribution when integrating over the positional parameter.

**(a)** Exemplary 2-dimensional Bézier curve of degree deg = 4.

**(b)** Schematic extension of a Bézier curve incorporating curve point uncertainty.

**Figure 3.2:** Illustration of the starting point ((a): discrete Bézier curve) and goal ((b): probabilistic Bézier curve) of this section for a Bézier curve of degree deg = 4. Uncertainty associated with each curve point is indicated by a shaded region around the curve, representing and 2 regions. Note: The presentation of uncertainty is for illustration purposes only and does not reflect a real probability distribution integrated over the curve's positional parameter.

In order to define a probabilistic extension for Bézier curves, such that generated curve points are stochastic and follow some probability distribution, it is necessary for the control points to be stochastic as well. This is due to every curve point being a linear combination of the curve's control points. Thus, an important question is given by the choice of a suitable probability distribution for the control points. A common choice is given by the Gaussian distribution, which is commonly used in machine learning and statistics due to its mathematical properties. On the one hand, the popularity of the Gaussian distribution can be explained through the *central limit theorem* [And10], which states that the sum of independent random variables converges towards a Gaussian distribution. Further, among all real-valued distributions with a given mean and variance, the Gaussian distribution is the *distribution of maximum entropy* [Con04]. On the other hand, the most notable mathematical property for defining a probabilistic Bézier curve is the fact that the linear combination of Gaussian random variables is again Gaussian.

Following this, for describing a stochastic process in terms of a parametric curve, each curve point should follow a Gaussian distribution. Thus, a *Gaussian Bézier curve* , denoted as -Curve, is proposed. The -Curve extends on Equation (3.1) and defines the control points = {<sup>0</sup> , <sup>1</sup> , ...deg } to follow a Gaussian distribution with ∼ ( , ) ∀ ∈ . The set of mean vectors is denoted as = {<sup>0</sup> , <sup>1</sup> , ..., deg } and the set of covariance matrices Σ = {<sup>0</sup> , <sup>1</sup> , ..., deg }, respectively. Thus, the -Curve is defined by a tuple = = (, Σ). As curve points are defined through a linear combination of the control points, the stochasticity is inherited from the control points to the curve points { }∈[0,1]. This is due to the fact, that for **A** + **B** with ∼ (, ) and ∼ (, ) follows¹

$$\mathbf{A}X + \mathbf{B}Y \sim \mathcal{N}(\mathbf{A}\mu\_{\mathbf{x}} + \mathbf{B}\mu\_{\mathbf{y}}, \mathbf{A}\Sigma\_{\mathbf{x}}\mathbf{A}^{T} + \mathbf{B}\Sigma\_{\mathbf{y}}\mathbf{B}^{T}).$$

Thus, the curve function

$$B\_{\mathcal{N}}(t,\psi) = (\mu^{\psi}(t), \Sigma^{\psi}(t))\tag{3.3}$$

with

$$\mu^{\psi}(t) = \sum\_{l=0}^{N\_{\text{deg}}} b\_{l, N\_{\text{deg}}}(t) \mu\_l \tag{3.4}$$

and

$$
\Delta^{\psi}(t) = \sum\_{l=0}^{N\_{\text{day}}} \left( b\_{l, N\_{\text{day}}}(t) \right)^2 \Sigma\_l,\tag{3.5}
$$

defines the parameters of a (multivariate) Gaussian probability distribution for each ∈ [0, 1]. Each -dimensional curve point then follows the respective Gaussian distribution ((), Σ()) at index . The Gaussian probability

¹ Following the definition as provided in *The Matrix Cookbook* [Pet08].

density at index is given by

$$\begin{split} p\_t^{\psi}(\mathbf{x}) &= p\left(\mathbf{x}|\mu^{\psi}(t), \Sigma^{\psi}(t)\right) \\ &= \mathcal{N}\left(\mathbf{x}|\mu^{\psi}(t), \Sigma^{\psi}(t)\right) \\ &= \frac{1}{\left|2\pi\Sigma^{\psi}(t)\right|^{\frac{1}{2}}} \exp\left\{-\frac{1}{2}\left(\mathbf{x} - \mu^{\psi}(t)\right)^{\top}\left(\Sigma^{\psi}(t)\right)^{-1}\left(\mathbf{x} - \mu^{\psi}(t)\right)\right\}. \end{split} \tag{3.6}$$

An example for a Gaussian curve point constructed from Gaussian control points for = 0.5 is depicted in Figure 3.3. The intermediate control point <sup>1</sup> influences the most, which leads to adopt a skewed covariance ellipse. Due to the covariance matrices being interpolated, the other control points, <sup>0</sup> and <sup>2</sup> , contribute to by making the covariance ellipse more spherical.

**Figure 3.3:** Example for a Gaussian curve point on an -Curve with 3 Gaussian control points for = 0.5. The covariance matrix of 's Gaussian distribution is a combination of the control point covariance matrices.

So far, a stochastic process model for a continuous index set = [0, 1] has been defined. In contrast to this, many real-world applications require discrete-time stochastic processes handling sequential data. For handling such use-cases, the -Curve model can be used to model Gaussian distributions at discrete points in time with (, ) using equally distributed values for , yielding a discrete subset

$$T\_N = \left\{\frac{\upsilon}{N-1} | \upsilon \in \{0, \ldots, N-1\} \right\} = \{t\_1, \ldots, t\_N\} \tag{3.7}$$

of the index set . Thus, each process index (curve parameter) ∈ corresponds to its respective sequence index at time ∈ {1, ..., }. It has to be noted, that using equidistant values for does not necessarily result in equidistant curve points. The distribution of the curve points along the curve depends on the positions of the control points. This is illustrated in Figure 3.4.

 0 1 2

**(a)** Example for equidistant curve points (Δ<sup>1</sup> = Δ<sup>2</sup> = Δ<sup>3</sup> = Δ<sup>4</sup> = Δ<sup>5</sup> ≈ 2.3).

**Figure 3.4:** Illustration of the impact of control point positioning on the distribution of curve points along the curve. Figures (a) and (b) show two exemplary Bézier curves with one shifted control point and identical shape. Note that the shape is not impacted by shifting 1, as it lies on the straight line between <sup>0</sup> and <sup>2</sup> . Curve points (black circular markers) are calculated using the same discrete index set =6 = {0, 0.2, 0.4, 0.6, 0.8, 1} of equally distributed values for .

Finally, the Gaussian random variable at time is given by

$$X\_{t\_l} \sim \mathcal{N}(B\_{\mathcal{N}}(t\_l, \psi)) = \mathcal{N}(\mu^{\Psi}(t\_l), \Sigma^{\Psi}(t\_l)) \tag{3.8}$$

with <sup>0</sup> = <sup>0</sup> and deg = as exact start and end conditions. Figure 3.5 depicts a 2-dimensional example for an -Curve with 5 control points. The mean curve and control points with respective covariance ellipses are shown in Figure 3.5a. Gaussian random variables along the -Curve given different values for are illustrated in Figure 3.5b. The influence of the most dominant control point for each curve point is clearly visible in the covariances, adapting towards respective control point covariances. Note that the parametric curve interpolates the mean vectors of all Gaussian distributions through time.

**(b)** Gaussian random variables along the -Curve for ∈ {0, 0.2, 0.4, 0.6, 0.8, 1}.

**Figure 3.5:** Example for modeling a finite discrete-time stochastic process using an -Curve. The stochastic process consists of random variables corresponding to a discrete subset of = [0,1].

As a final aspect to consider, the -Curve model can easily be extended for modeling multi-modal stochastic processes. While Gaussian probability distributions are a sufficient representation for unimodal sequence data, many real-world problems require a multi-modal representation. For this, a common approach is to use a Gaussian mixture probability distribution

$$\Xi\left(\{\pi\_k\}\_{k\in\{1,\dots,K\}}, \{\left(\mu\_k,\Sigma\_k\right)\}\_{k\in\{1,\dots,K\}}\right),\tag{3.9}$$

defined by weighted Gaussian components and probability density function

$$p(\mathbf{x}) = \sum\_{k=1}^{K} \pi\_k \mathcal{N}\left(\mathbf{x}|\boldsymbol{\mu}\_k, \boldsymbol{\Sigma}\_k\right), \text{with } \sum\_{k=1}^{K} \pi\_k = 1 \text{ and } \pi\_k \ge 0. \tag{3.10}$$

In the same way, the concept of -Curves can be extended to a mixture Ψ = (, {<sup>1</sup> , ..., }) of weighted -Curves with normalized weights = {<sup>1</sup> , ..., }. The stochastic curve points at index ∈ then follow a Gaussian mixture distribution

$$X\_t \sim \Xi\left(\pi, \{B\_{\mathcal{N}}(t, \psi\_k)\}\_{k \in \{1, \ldots, K\}}\right). \tag{3.11}$$

Accordingly, the probability density at ∈ induced by Ψ is given by

$$p\_l^{\Psi}(\mathbf{x}) = \sum\_{k=1}^{K} \pi\_k \mathcal{N}(\mathbf{x}|\mu^{\Psi\_k}(t), \Sigma^{\Psi\_k}(t)),\tag{3.12}$$

with () and Σ () given by the 'th -Curve, i.e. ( (), Σ ()) = (, ). Following this, each stochastic curve point can be multi-modal and each mode of the modeled stochastic process follows a separate -Curve. As such, the -Curve mixture provides the evolution of along multiple paths through time. An example for an -Curve mixture is depicted in Figure 3.6.

**Figure 3.6:** Example for a multi-modal stochastic curve point for = 0.5 given by an - Curve mixture consisting of 2 -Curves <sup>1</sup> and <sup>2</sup> . Both <sup>1</sup> and <sup>2</sup> are defined by 3 Gaussian control points. The curve point follows a 2-component Gaussian mixture distribution.

### **3.1.1 Rationale behind choosing Bézier curves**

Among related parametric curves with alternative formulations (e.g. Pythagorean-hodograph curves [Far08]) or basis polynomials (e.g. Lagrange bases [War79, Jef88] or the power basis [Sto89]), Bézier curves are the most widely used type of blended curves in various fields, especially in computer-aided design (e.g. [Fit14]), animation (e.g. [Haa18, Izd20]) and path planning (e.g. [Jol09, Tha19]). Besides their popularity, Bézier curves offer some valuable properties for the -Curve model. First and foremost, Bézier curves are numerically stable, as well as easy to calculate, control and manipulate. Every control point contributes to every curve point, which makes curve construction more intuitive and reasonable. An example for how the manipulation of a single control point impacts the entire curve is given in Figure 3.7.

**(a)** 2-dimensional Bézier curve of degree 4.

**(b)** Impact of relocating a single (intermediate) control point.

**Figure 3.7:** Illustration of global control, i.e. every control point affects every curve point, in Bézier curves. The initial curve is depicted in red and the modified curve in green.

In addition, Bézier curves provide a compact representation of the entire curve in terms of a set of control points. This, in turn, allows the description of a whole sequence of random variables, requiring only few stochastic control points. Further, Bézier curves can be scaled up to higher dimensions easily by increasing the dimensionality of the control points. Besides that, Bézier curves provide a commonly used building block for splines, which are segmented curves consisting of multiple parametric curves. The ability to combine Bézier curves into splines is relevant for a recurrent extension of the - Curve model as discussed in Section 3.2. Figure 3.8 provides basic examples for scalability and Bézier splines.

**(a)** Exemplary 3-dimensional Bézier curve of degree 4.

**(b)** Exemplary Bézier spline with 3 Bézier curve segments.

**Figure 3.8:** Basic examples for Bézier curve scalability (3.8a) and Bézier curves as a building block for Bézier splines consisting of Bézier curve segments (3.8b). Scalability is illustrated by increasing the control point dimension to 3, resulting in a 3-dimensional Bézier curve.

In the context of regression-based deep learning approaches to probabilistic sequence modeling, a model based on Bézier curves is expected to have a positive impact on the training and inference process. Due to the modeled mean sequence being constraint by an underlying parametric curve and the omission of an iterative generation approach, the generation of outliers within the sequence can be avoided. This, in turn, reduces the effect of error propagation present in iterative approaches under the presence of outliers.

### **3.1.2 A potential caveat: Non-linear Covariance Blending**

When combining control points into curve points, the control points are weighted using the Bernstein basis polynomials (see Equations (3.4) and (3.5)). While the control point mean vectors are linearly interpolated when calculating a curve point mean vector, non-linear weighting is introduced for the covariance matrices in Equation (3.5), due to the control points being Gaussian random variables. This, in turn, leads to an effect that prevents the -Curve model from maintaining a constant variance along the curve. Instead it is scaled down for curve points with 0 < < 1, resulting in a *squeezing effect*.

This effect is easiest to see, taking an -Curve with 2 control points, resembling a straight line, as an example. Setting the variance of both control points to 1, the variance of intermediate points is parabolic because of the non-linear covariance weighting. This is illustrated in Figure 3.9.

**Figure 3.9:** Illustration of non-linear variance interpolation using a simple 1-dimensional - Curve with 2 control points. Both control points <sup>0</sup> and <sup>1</sup> have a variance of 1. The shaded region around the curve depicts 1, 2 and 3 times the variance for each curve point. It can be seen, that the evolution of the variance along the curve is parabolic.

Recalling Equation (3.5), in Σ () = ∑ =0 (,()) 2 , the Bernstein coefficients need to be squared when blending the covariance matrices, due to

$$\begin{split} \text{cov}[aX] &= \mathbb{E}[(aX - \mathbb{E}[aX])(aX - \mathbb{E}[aX])^T] \\ &= a^2 \cdot \text{cov}[X]. \end{split} \tag{3.13}$$

Following this and ,() < 1 for 0 < < 1, the normalization property ∑ =0 (,()) 2 < 1 does not hold. Obvious attempts to mitigate this effect involve the addition of intermediate control points or an adjustment of intermediate control point variances. First, adding intermediate control points with constant variance does only amplify the squeezing effect, due to the increasing number of weights being involved, leading to ∑ =0 (,()) 2 < ∑ =0 (,()) 2 for > . Second, adjusting intermediate control point variances can only mitigate the squeezing effect for selected curve points, making it at least viable for discrete-time stochastic processes in theory. The effect of both approaches on the variance along the curve is depicted in Figure 3.10.

**(a)** Exemplary 1-dimensional -Curve using 3 intermediate control points with variance 1.

**(b)** Exemplary 1-dimensional -Curve using one intermediate control point with increased variance.

**Figure 3.10:** Illustration of different approaches trying to mitigate the squeezing effect in - Curves due to non-linear variance blending. In both subfigures, a simple 1 dimensional -Curve is depicted. Respective first and last control points have a variance of 1. The shaded region around the curve depicts 1, 2 and 3 times the variance for each curve point. It can be seen, while adding multiple intermediate control points with the same variance amplify the squeezing effect, increasing intermediate control point variances can help mitigate the effect.

Although, in theory, this effect seems like a major drawback of the -Curve model, especially in the continuous-time case, it is less relevant in practice, due to the prevalence of discrete sequence data. In addition, real-world data is commonly suspect to noise, which makes the constant variance case discussed in this subsection less likely to appear. In order to provide more insight into this effect and its impact in the context of sequence modeling, it is discussed further in the context of experiments conducted on real-world data in Section 5.1.5.4.

### **3.1.3 The -Curve Model as a Generative Model**

For modeling a stochastic process, the -Curve model provides a Gaussian probability distribution for each point in time. At the same time, it provides a probability distribution over parametric curves. Following this, the -Curve model can be used as a generative model to either generate samples at specific points in time or to generate (continuous) realizations of the stochastic process itself. The latter can be achieved by sampling a set of Bézier curve control points from an -Curve , or an -Curve mixture Ψ, respectively. In the case of -Curve mixtures, a specific -Curve to draw a sample from is randomly selected according to the weight distribution in a first step. A set of samples drawn from an -Curve and a mixture of -Curves is depicted in Figure 3.11.

**(a)** Samples for along an -Curve for ∈ {0, 0.2, 0.4, 0.6, 0.8, 1}.

**(c)** Bézier curves sampled from an -Curve.

**(b)** Samples for along an -Curve mixture for ∈ {0, 0.2, 0.4, 0.6, 0.8, 1}.

**(d)** Bézier curves sampled from an -Curve mixture.

**Figure 3.11:** Illustration of the -Curve model as a generative model for generating data for specific points in time along the curve ((a) and (b)) and for generating full Bézier curves according to the -Curve (mixture) control points ((c) and (d)).

### **3.1.4 Connection to Gaussian Processes**

Generally speaking, Gaussian processes are a form of stochastic processes, where the joint distribution of all stochastic variables { }∈ is a multivariate Gaussian distribution. The joint distribution is obtained using an explicit mean function and covariance function, commonly referred to as the kernel of the Gaussian process (see also Section 2.2.1). Due to the joint distribution being Gaussian, each individual stochastic variable, either obtained through marginalization or conditioning, is again Gaussian. Following this, a fundamental similarity between the -Curve model and Gaussian processes can be observed, in that the -Curve model provides a model for a stochastic process { }∈ comprised of Gaussian random variables ∼ (, ). Thus, the question arises, if the underlying -Curves are a special case of Gaussian processes using an implicit covariance function.

Following the definition of Gaussian processes [Mac03, Ras06], an -Curve would be classified as a Gaussian process, if for any finite subset {<sup>1</sup> , ..., } of , the joint probability density (<sup>1</sup> , ..., ) of corresponding random variables is Gaussian. This property is referred to as the *GP property* in the following and can be shown to hold true for -Curves, as these are, in fact, an alternative formulation for Gaussian processes with specific mean and covariance functions.

In order to prove the GP property holding true for -Curves, first recall, that an -Curve is defined in terms of a set of deg *independent* dimensional Gaussian control points ∼ ( , ), which are defined as column vectors, i.e.

$$P\_l = \begin{pmatrix} P\_1^l \\ \vdots \\ P\_d^l \end{pmatrix}. \tag{3.14}$$

Using these control points, a sequence of Gaussian probability distributions along the corresponding -Curve has been defined (see Equation (3.3)). As an alternative to this approach, the control points can also be stacked into the ((deg + 1) ⋅ × 1) control point random vector

$$\mathbf{P} = \begin{pmatrix} P\_0 \\ P\_1 \\ \vdots \\ P\_{N\_{\text{deg}}} \end{pmatrix},\tag{3.15}$$

consisting of independent Gaussian random variables, which is itself jointly Gaussian. Further, a ( ⋅ × (deg + 1) ⋅ ) transformation matrix

$$\mathbf{C} = \begin{pmatrix} \mathbf{B}\_{0, N\_{\text{deg}}}(t\_1) & \dots & \mathbf{B}\_{N\_{\text{deg}}, N\_{\text{deg}}}(t\_1) \\ \vdots & \ddots & \vdots \\ \mathbf{B}\_{0, N\_{\text{deg}}}(t\_N) & \dots & \mathbf{B}\_{N\_{\text{deg}}, N\_{\text{deg}}}(t\_N) \end{pmatrix},\tag{3.16}$$

with

$$\mathbf{B}\_{l,N\_{\rm deg}}(t) = b\_{l,N\_{\rm deg}}(t)\mathbf{I}\_d,\tag{3.17}$$

where **I** is the -dimensional identity matrix, can be derived using the Bernstein polynomials ,deg ( ) with ∈ (see Equation (3.7) for ), in order to map the control point random vector **P** onto a random vector consisting of -dimensional Gaussian curve points, i.e.

$$\mathbf{X} = \mathbf{C} \cdot \mathbf{P} = \begin{pmatrix} \mathbf{B}\_{0, N\_{\text{deg}}}(t\_1) \cdot P\_0 + \dots + \mathbf{B}\_{N\_{\text{deg}}, N\_{\text{deg}}}(t\_1) \cdot P\_{N\_{\text{deg}}} \\ \vdots \\ \mathbf{B}\_{0, N\_{\text{deg}}}(t\_N) \cdot P\_0 + \dots + \mathbf{B}\_{N\_{\text{deg}}, N\_{\text{deg}}}(t\_N) \cdot P\_{N\_{\text{deg}}} \end{pmatrix} = \begin{pmatrix} X\_1 \\ \vdots \\ X\_N \end{pmatrix} . \tag{3.18}$$

As **X** is obtained through a linear transformation of a Gaussian random vector, it is jointly Gaussian as well. As a consequence, the corresponding probability density function (**X**) = (<sup>1</sup> , ..., ) is a Gaussian probability density, thus the GP property holds.

Next, the mean function and Gaussian process kernel induced by a given -Curve will be derived. For simplicity, only the 1-dimensional case is regarded, which is also the common use case of Gaussian processes. Following this, the control points are defined by the mean value and variance 2 . While the mean function is equal to that of the -Curve itself (see Equation (3.4)), the kernel (,′ ) for two curve points = ∑ deg =0 ,deg () and = ∑ deg =0 ,deg (′ ) at indices and ′ with ,′ ∈ [0,1], and respective mean values = ∑ deg =0 ,deg () and = ∑ deg =0 ,deg (′ ) is defined as follows:

$$\begin{split} \mathcal{R}\_{\mathcal{R}\_{\mathcal{X}}}(t\_{1},t\_{j}) &= \mathbb{E}[(X-\mu\_{X})(Y-\mu\_{Y})] \\ &= \mathbb{E}\left[\left(\sum\_{k=0}^{n}b\_{k,n}(t\_{i})P\_{k}-\mu\_{X}\right)\left(\sum\_{k=0}^{n}b\_{k,n}(t\_{j})P\_{k}-\mu\_{Y}\right)\right] \\ &= \mu\_{X}\mu\_{Y}+\mathbb{E}\left[\sum\_{k=0}^{n}\left(\sum\_{k'=0}^{n}b\_{k,n}(t\_{i})b\_{k',n}(t\_{j})P\_{k}P\_{k'}\right)\right] \\ &- \mathbb{E}\left[\sum\_{k=0}^{n}b\_{k,n}(t\_{i})\mu\_{Y}P\_{k}\right]-\mathbb{E}\left[\sum\_{k=0}^{n}b\_{k,n}(t\_{j})\mu\_{X}P\_{k}\right] \\ &= \mu\_{X}\mu\_{Y}+\mathbb{E}\left[\sum\_{k=0}^{n}b\_{k,n}(t\_{i})b\_{k,n}(t\_{j})P\_{k}^{2}\right] \\ &+\mathbb{E}\left[\sum\_{k=0}^{n}\left(\sum\_{k'=0,k'\neq k}^{n}b\_{k,n}(t\_{i})b\_{k',n}(t\_{j})P\_{k}P\_{k'}\right)\right] \\ &- \mu\_{Y}\sum\_{k=0}^{n}b\_{k,n}(t\_{i})\mu\_{k}-\mu\_{X}\sum\_{k=0}^{n}b\_{k,n}(t\_{j})\mu\_{k}. \end{split}$$

By applying [ ⋅ ] = [ ] ⋅ [ ], which follows from the independence of the control points, and [<sup>2</sup> ] = Var[ ] + ([ ])2 , follows the closedform solution

$$\begin{split} k\_{\mathcal{H}\gamma}(t\_i, t\_j) &= \mu\_X \mu\_Y + \sum\_{k=0}^n b\_{k,n}(t\_i) b\_{k,n}(t\_j) (\sigma\_k^2 + \mu\_k^2) \\ &+ \sum\_{k=0}^n \left( \sum\_{k'=0, k'\neq k}^n b\_{k,n}(t\_i) b\_{k',n}(t\_j) \mu\_k \mu\_{k'} \right) \\ &- \mu\_Y \sum\_{k=0}^n b\_{k,n}(t\_i) \mu\_k - \mu\_X \sum\_{k=0}^n b\_{k,n}(t\_j) \mu\_k. \end{split} \tag{3.19}$$

It can be noted, that the diagonal elements of a covariance matrix obtained by (,′ ) correspond to the interpolated covariances of the given -Curve as defined in Equation (3.5).

Now, with the -Curve model covering a specific subset of Gaussian processes, the main commonalities and differences between both formulations will be discussed briefly. Although both, Gaussian processes and the -Curve model target a distribution over functions, or rather parametric curves in the case of -Curves, there is a key differences worth mentioning, in that both approaches provide a different perspective on the task of distribution modeling. While Gaussian processes pursue a bottom-up approach, especially in Gaussian process regression [Ras06], -Curves provide a top-down approach. As such, in Gaussian process regression, the relation between Gaussian "curve points" is modeled explicitly using the covariance function. Then treating these curve points as part of a partitioned joint distribution ensures the GP property. In the -Curve model, the distribution over functions is achieved by modeling the curve-defining control points stochastically, which dictate the relation between curve points implicitly. Thereby, the GP property follows from the correlation between curve points, which emerges from geometric constraints given by the underlying Bézier curve, i.e. the curve points being linear transformations of the same set of stochastic control points.

In order to conclude this short section on the connection between -Curves and Gaussian processes, a few illustrations are given, which compare commonly used Gaussian process kernels with different -Curve kernels. After that, a simple toy example, depicting the calculation of the posterior distribution given a few observation of a target function, is provided. For these examples, zero mean Gaussian processes are considered only.

Figure 3.12 illustrates a *radial basis function* (abbrev.: RBF, [Gör19]) kernel

$$k\_{\sigma,l}^{\text{rbf}}(t,t') = \sigma^2 \exp\left(-\frac{||t-t'||^2}{2l^2}\right),\tag{3.20}$$

with = 1 and = 0.25, a linear kernel [Gör19]

$$k^{\rm lin}\_{\sigma,\sigma\_b,c}(t,t') = \sigma\_b^2 + \sigma^2(t-c)(t'-c),\tag{3.21}$$

with = = = 0.5, and two -Curve kernels <sup>1</sup> (,′ ) and <sup>2</sup> (,′ ). <sup>1</sup> consists of two unit Gaussians, i.e. (0,1), and <sup>2</sup> consists of 9 zero mean Gaussian control points with standard deviations <sup>0</sup> = <sup>8</sup> = 1, <sup>1</sup> = <sup>7</sup> = 1.25, <sup>2</sup> = <sup>6</sup> = 1.5, <sup>3</sup> = <sup>5</sup> = 1.75 and <sup>4</sup> = 2. The standard deviation increases towards the center of the control point set, in order to cope with non-linear blending (see also Section 3.1.2). For each kernel, the covariance matrix has been calculated for 20 equally spaced values ranging from 0 to 1.

**(a)** RBF kernel rbf =1,=0.25(,′

). **(b)** Linear kernel lin =0.5,=0.5,=0.5(,′ ).

**(c)** -Curve kernel <sup>1</sup> (,′

). **(d)** -Curve kernel <sup>2</sup> (,′ ).

When comparing the covariance matrices in figures (b) and (c), it can be seen, that the results from the kernel based on a linear -Curve with unit Gaussian control points look similar to those based on the given linear kernel. In fact, the covariance matrix calculated with <sup>1</sup> is equale to the covariance matrix calculated with lin =0.5,=0.5,=0.5(,′ ) when normalizing its values to [0, 1]. On the other hand, the covariance matrix obtained with <sup>2</sup> (figure (d)), which is derived from a more complex -Curve, tends to be more comparable to the covariance matrix calculated with rbf =1,=0.25(,′ ) (figure (a)).

In combination with a mean vector, **0** in this case, each covariance matrix defines a prior distribution for a Gaussian process. Following this, Figure 3.13 depicts sample functions drawn from each prior distribution, again showing the parallels between the kernels.

**Figure 3.13:** Samples drawn from prior distributions using different Gaussian process kernels. The 2 region around the mean value is depicted as a red shaded area.

Finally, the Gaussian processes defined by the RBF kernel rbf =1,=0.25(,′ ) and the -Curve kernel <sup>2</sup> are used to approximate () = sin(8) on [0,1] using 4 observed data points. Using these data points, the posterior distribution of each Gaussian process can be calculated, which ideally tends to fit the targeted function with an increasing number of observed data points. The posterior distributions for both Gaussian processes are depicted in figure 3.14.

**Figure 3.14:** Posterior distributions of Gaussian processes obtained by using different kernels given 4 data points (circular markers) of a sine function (dashed line). The 2 region around the mean value is depicted as a red shaded area.

## **3.2 Modeling Infinite Stochastic Processes**

The -Curve model as presented in the previous Section 3.1 is viable for most real-world applications, which are generally concerned with sequences of fixed or at least bounded length. Thus, using Bézier curves as a basis, the representable curve complexity suffices the requirements of given sequential data. However, apart from the practicality of the -Curve model, there exists a conceptual limitation for modeling continuous-time stochastic processes { }∈. This limitation is given by the bounded index set = [0, 1], which is imposed by the use of a Bézier curve basis. Because of this, infinite continuous-time stochastic processes, i.e. with = ℝ<sup>+</sup> 0 , cannot be represented by the -Curve model. Further, infinite discrete-time stochastic processes { }∈ , which model open-ended sequences and are realized using a discrete subset of , are also affected by this limitation in a more subtle way. Although the number of control points of an -Curve is fixed, it is still possible to extract an infinite number of curve points with infinitesimal distance between subsequent curve points. However, as a sequence becomes longer, it generally also expands in space. Thus a potentially more complex underlying Bézier curve, i.e. a curve of higher degree, is required for achieving an accurate approximation. While there is no theoretical limit to the number of control points defining a Bézier curve, the approximation quality may suffer from an increasing number of control points in practice. This is due to the increased number of concurring control points, each contributing to every curve point (*global control*).

In the context of parametric curves, a common approach to tackle increasing curve complexity in terms of length and shape is the use of segmented curves. Here, simpler curves of fixed degree are stitched together in order to form a more complex curve, granting *local control* over curve segments. Thus, the number of segments can be increased as required, without affecting the entire curve. In the context of Bézier curves, such a curve is then called a *composite Bézier curve* or *Bézier spline* [Reb21]. Following this, a Bézier spline of degree deg is defined in terms of a sequence of Bézier curve segments = {<sup>1</sup> , ..., }, where each segment is defined by its own set of control points = { 0 , ..., deg }. Further, at least 0 continuity, i.e. deg = +1 0 , holds for subsequent Bézier curve segments.

If necessary, additional smoothness requirements, e.g. <sup>1</sup> or 2 continuity, can be added. Under 1 continuity, subsequent curve segments have identical tangents at the control point joining both segments. 2 continuous curves additionally have identical curvature at this point [Bar95]. Under 0 continuity, Bézier splines grant local control, i.e. the curve can be altered on a per segment basis without affecting other segments. This flexibility is restricted when enforcing <sup>1</sup> or 2 continuity, as neighboring control points of subsequent curve segments become dependent on one another. Geometrically, 1 continuity can be enforced by making the second last control point  deg−1 of a curve segment and the second control point +1 <sup>1</sup> of the subsequent curve segment collinear. For 2 continuity, these control points additionally have to have the same (euclidean) distance from the joining control point deg = +1 0 . Following this, local control can be granted to some degree by using more than 4 control points in a segment. Examples for 0 , 1 and 2 continuous segment intersections are given in Figure 3.15.

**Figure 3.15:** Examples for Bézier splines consisting of three segments with varying continuity constraints at segment intersections. Figure (a) depicts a <sup>0</sup> continuous Bézier spline and (b) depicts a Bézer spline meeting <sup>2</sup> continuity at the intersection of the first two segments and <sup>1</sup> continuity at the intersection of the second and third segment.

### **3.2.1 The Meta-time -Curve Model**

Given the aforementioned conceptual limitations, the goal is to extend the -Curve model for infinite stochastic processes and open-ended sequences, i.e. a stochastic process with index¹ ∈ ℝ<sup>+</sup> 0 . For this purpose, the concept of splines is incorporated into the model, thus combining -Curve segments into a more complex probabilistic spline. In order to model an indefinite number of curve segments, a notion of control point evolution is introduced, by defining the set of Gaussian control points as a function () of time . Due

¹ The index will again be interpreted as *time* for a more intuitive derivation of this extension.

to each function value then defining an entire curve segment covering multiple time steps, a (potentially) asynchronous timeline emerges. This timeline is denoted as *meta-time* with index ̃in the following. Subsequent values of ̃then yield subsequent -Curve segments on the probabilistic spline modeling the stochastic process. This probabilistic spline will be denoted as a *meta-time -Curve* in the following. With the control point function representing a sequence of Gaussian control point sets, the meta-time -Curve is defined as a sequence of connected -Curve segments = {{̃ 0 , ..., ̃ deg }}̃, indexed by meta-time ∈ ℕ ̃ 0 . For associating a point in time with the corresponding -Curve segment, the *meta-time mapping* ∶ → ̃onto the meta-timeline is introduced. In addition, with (and ) now exceeding the in- ̃ dex range of an -Curve, another mapping ∶ → onto the *curve-time* parameter ∈ [0, 1] is introduced in order to access the exact curve point on a curve segment. This mapping is denoted as the *curve-time mapping*. Note, that is technically defined as ∶ (,) → ̃ , but as ̃is derived from through , the additional parameter can be omitted. Finally, in the context of meta-time -Curves, 0 continuity is given by matching the mean vectors and covariance matrices at the intersection of subsequent -Curve segments. Aforementioned geometric restrictions for 1 and 2 continuity only apply to control point mean values. Figure 3.16 gives an illustration of the different timelines and the basic idea of the model extension.

**Figure 3.16:** Illustration of the proposed extension of the -Curve model, basing stochastic process modeling on probabilistic Bézier splines instead of single Bézier curves, thus allowing to model infinite stochastic processes. Interpreting the stochastic process index as time, multiple connected timelines emerge, namely meta-time ̃and curve-time . Given a specific point in time , the corresponding -Curve segment is determined by the mapping () and the specific point on the segment by (). An exemplary resulting probabilistic spline (,) is depicted at the bottom.

With the introduced timeline mappings, the original formulation of the - Curve model can be extended into the *meta-time -Curve model* as depicted in Table 3.1. Here, only the definition of the extended -Curve is provided, as the derivation of other formulas building on the curve definition, e.g. the curve point probability density function (Equation (3.6)), is not directly affected by these changes. Further, exemplary definitions are provided for both the meta-time and the curve-time mapping. Potential definitions for these mappings are discussed in Section 3.2.2.

**Table 3.1:** Overview of changes made to the -Curve model in order to derive the meta-time -Curve model extension. For completeness, examples for the meta-time and curvetime mappings are provided.


On a final note, the meta-time -Curve model can be used in the context of multi-modal stochastic processes by following the same approach as described for the original -Curve model using a mixture of meta-time -Curves. In this case, each mode of a stochastic process representation follows a separate meta-time -Curve. The mixture weight distribution is defined on a per segment basis, i.e. () = { ̃ ̃ 1 , ...̃ } is given at meta-time . Following this, a ̃ potential benefit of meta-time -Curves is given by the fact, that it is possible to alter the curve weights in each meta-time step. This allows for more control about the number of required mixture components on a per-segment basis.

### **3.2.2 Mapping Functions**

In the context of the meta-time -Curve model, two mappings have been defined, namely the meta-time mapping and the curve-time mapping, connecting the introduced timelines. As there possibly exists a wide range of options for defining either mapping, this section provides exemplary mapping definitions, which are expected to be relevant for an actual implementation of the model in the context of different sequence modeling tasks.

*Meta-time mapping:* Moving along the meta-timeline yields a sequence of - Curve segments along a probabilistic spline. In this context, the meta-time mapping maps a given point in time onto the meta-timeline, i.e. a natural number ∈ ℕ ̃ 0 , in order to determine the corresponding -Curve segment. Following this, the goal is to define a consistent meta-time mapping from onto the set of natural numbers.

The first possible definition of this mapping is given by a *fixed interval mapping*

$$m\_a(t) = \left\lfloor \frac{t}{a} \right\rfloor. \tag{3.22}$$

In this case, the meta-time ̃is advanced at a fixed rate as increases, thus traversing an infinite sequence of (distinct) -Curve segments. While this definition may result in premature segment changes, an interesting special case is given for = 1, where the resulting meta-time -Curve resembles a probabilistic spline with segments connected at their endpoints. Besides resulting in a well-defined segmented probabilistic curve, this further allows a straightforward definition of 1 and 2 continuous segment intersections via control point placement.

In cases, where periodic repetitions are expected in sequential data, another definition is given by a *periodic mapping*

$$m\_{a,b,p}(t) = \left\lfloor \frac{2 \cdot a}{\pi} \arcsin\left(\sin\left(\frac{2 \cdot \pi}{p} \cdot t\right)\right) + b \right\rfloor,\tag{3.23}$$

which can be based for example on a triangle wave. Here, is mapped onto the same sequence of -Curve segments periodically, thus repeating the same segment sequence over and over. The repetition frequency is controlled by . An alternative to the periodic mapping is given by a *modulo reset mapping*

$$m\_{a,k}(t) = \left\lfloor \frac{t}{a} \right\rfloor \mod k \tag{3.24}$$

with ≥ 1, which repeats the same segment sequence after every metatime steps. This mapping is basically built on a sawtooth wave. Similar to the fixed interval mapping, these two definitions can be parameterized to yield an endpoint-connected probabilistic spline.

*Curve-time mapping:* After determining the current -Curve segment at time , the current position within this segment needs to be determined. For this, a curve-time mapping needs to be defined, mapping onto curve-time ∈ [0,1]. As this mapping is highly dependent on the specific definition of the meta-time mapping , an exemplary curve-time mapping compliant with the given variants of , which result in an endpoint-connected spline, is provided. All these variants map a given time in a way, that a segment intersection occurs whenever is a multiple of 1. Following this, the difference between the value of for the first and last segment point is exactly 1 and intermediate values are in [0, 1]. As such, the curve-time mapping can be defined as

$$m\_{\mathbf{c}}(t) = t - m(t). \tag{3.25}$$

*Learned mapping:* On a final note, it is also possible to learn both mappings in a constraint optimization setting. A potential benefit of this can be given by a resulting efficient re-use of few base segments, especially when the mapping can be conditioned on additional domain-specific input. On the downside, depending on the constraints defined during optimization, spline properties regarding 1 and 2 continuity might be lost.

### **3.2.3 Modeling discrete-time stochastic processes**

As a last point, this section provides some insight into how the concept of meta-time -Curves can be applied for modeling discrete-time stochastic processes using different mapping variants. For this, only mapping variants resulting in an endpoint-connected probabilistic spline are considered, being expected to be the most relevant in an application context.

Recall in the -Curve model, a discrete index set ⊂ is extracted from the continuous index set = [0,1] using a predetermined sequence length (see also Section 3.1). Now, with = ℝ<sup>+</sup> 0 and sequences being unbounded in length, it is necessary to define the sequence length seg covered by each -Curve segment along a meta-time -Curve. Following this, the difference between subsequent stochastic process indices ∈ and +1 ∈ is dictated by seg through Δ = <sup>1</sup> seg−1 . Using this formulation, = seg is a necessary condition in the original -Curve model. Opposed to that, in the meta-time -Curve model it is possible to have > seg, which ultimately allows modeling open-ended sequences through generating a stream of - Curve segments. The discrete index set can now be re-formulated as

$$T\_N = \left\{ \sum\_{l=0}^{\upsilon} \Delta t | \upsilon \in \{0, \ldots, N-1\} \right\}.\tag{3.26}$$

The meta-time mapping ( ) is now required to fill the additional role of determining how many -Curve segments are required to model a sequence of length given seg. In the case of the fixed interval mapping

$$m\_{a=1}(\iota\_l) = \lfloor \iota\_l \rfloor \,,$$

the th element of a sequence lies on the =1()'th -Curve segment. Thus, the meta-time -Curve model needs to generate

$$m\_{a=1}(t\_N) = \frac{N}{N\_{\text{seg}}}$$

segments compliant with given continuity constrains. With → ∞, this results in an indefinite stream of distinct -Curve segments, connected at their endpoints. Looking at periodic mappings, e.g. the modulo reset mapping

$$m\_{a=1,k}(t\_l) = \lfloor t\_l \rfloor \mod k,$$

the meta-time -Curve model needs to generate a maximum of segments which are referenced in a loop via =1,. The maximum number of segments is required when > ( − 1) ⋅ seg. For → ∞, this results in an indefinite repetition of the same -Curve segments. Finally, in both cases, specific random variables along the meta-time -Curve are determined according to and the corresponding -Curve segment.

## **3.3 Summary**

The main contributions of this chapter are twofold. First, a probabilistic extension to Bézier curves (-Curves) was introduced, which models sequences of Gaussian probability distributions along a parametric curve. Thereby, an -Curve is defined in terms of stochastic control points. Further, it has been shown that -Curves are a special case of Gaussian processes. Second, a model building on mixtures of -Curves was presented, which enables the modeling of multi-modal stochastic processes. Using the -Curve model and its meta-time variant, finite and infinite, as well as discrete-time and continuous-time stochastic processes can be modeled.

# **4 Proposed Implementation**

This chapter provides an implementation for the -Curve model (see Section 3.1), based on Mixture Density Networks (MDN). Therefore, Section 4.1 first extends on the introduction of Mixture Density Networks as given in Section 2.2.2 and then proceeds to define -Curve Mixture Density Networks (abbrev.: -MDN). The definition of the -MDN is accompanied with several toy examples, exploring the capabilities of the -Curve model.

In addition to the -MDN, this chapter provides a proof of concept for the conceptual extension of the -Curve model, given by the meta-time -Curve model as described in Section 3.2. As such, a recurrent extension of the - MDN is introduced and briefly evaluated on multiple toy examples in Section 4.2.

## **4.1 -Curve Mixture Density Networks**

Defining a fully regression-based probabilistic sequence model is one of the main objectives pursued in this thesis. Following this, an MDN for learning the parameters of an -Curve mixture (see Section 3.1) from discrete sequence data is proposed. An MDN is a feed-forward neural network

$$\Phi(\mathbf{v}) = (\{\pi\_k, \mu\_k, \Sigma\_k\}\_{k \in \{1, \ldots, K\}} | \mathbf{v}), \tag{4.1}$$

that takes an input vector **v** and maps it onto the parameters of a dimensional, -component Gaussian mixture distribution. In order to ensure that the MDN generates a valid set of mixture parameters, the partitioned network output

$$
\mathfrak{H} = (\mathfrak{A}\_1, \dots, \mathfrak{A}\_K, \mathfrak{A}\_1, \dots, \mathfrak{A}\_K, \mathfrak{G}\_1, \dots, \mathfrak{G}\_K, \mathfrak{G}\_1, \dots, \mathfrak{G}\_K),
$$

with

$$\forall \mathfrak{a}\_k \in \mathbb{R}, \mathfrak{a}\_k \in \mathbb{R}^d, \mathfrak{b}\_k \in \mathbb{R}^d \text{ and } \mathfrak{b}\_k \in \mathbb{R}^{\frac{d^2 - d}{2}}$$

is further transformed to meet parameter value requirements, i.e.

$$
\pi\_k = \text{softmax}(\pi\_1, \dots, \pi\_K)\_k,
$$

such that ∑ = 1 and

$$\begin{aligned} \mu\_k &= \nexists \mu\_k \\ \sigma\_{k,l} &= f\_{\sigma}(\mathfrak{F}\_{k,l}) > 0 \,\,\forall i \in \{1, ..., d\} \\ \rho\_{k,j} &= f\_{\rho}(\mathfrak{F}\_{k,j}) \in [-1, 1] \,\,\forall j \in \left\{1, ..., \frac{d^2 - d}{2}\right\}. \end{aligned}$$

Note that the covariance matrices are calculated from the standard deviations and correlations in order to ensure positive definiteness. For the transformations and , there are several relevant options to consider. The original formulation [Bis94] employed

$$f\_{\sigma}(\mathbf{x}) = \exp(\mathbf{x}) \text{ and } f\_{\rho}(\mathbf{x}) = \tanh(\mathbf{x}) \tag{4.2}$$

to transform and into respective value ranges. Both of these functions, however, can lead the MDN into having numerical issues during training. In the case of , the exponential function yields instable optimization results for large input values due to its exponential growth. To cope with this, a shifted version of the *Exponential Linear Unit* (abbrev.: *ELU*, [Cle15, Gui17])

$$f\_{\sigma}(\mathbf{x}) = \text{ELU}(1, \mathbf{x}) + 1 \tag{4.3}$$

with

$$\text{ELU}(\mathfrak{a}, \mathfrak{x}) = \begin{cases} \mathfrak{a}(\mathfrak{e}^{\mathfrak{x}} - 1) & \text{for } \mathfrak{x} < 0 \\ \mathfrak{x} & \text{for } \mathfrak{x} \ge 0 \end{cases} \tag{4.4}$$

and the *softplus* function (also referred to as *SmoothReLU*, [Dug01, Glo11, Iso17])

$$f\_{\sigma}(\mathbf{x}) = \text{softplus}(\mathbf{x}) = \ln(1 + e^{\mathbf{x}}) \tag{4.5}$$

are commonly used in MDNs. These functions are similar to the exponential function for negative and small positive input values, but transition into a linear function for larger input values. From an optimization point of view, the softplus function may be preferred over the ELU, as the latter is noncontinuous in its derivatives [Sch20b]. Regarding the correlations , using the tanh function for can lead to vanishing gradients. Thus the *softsign* function [Glo10, Iso17]

$$\text{if } f\_{\mathcal{P}}(\mathbf{x}) = \text{softsign}(\mathbf{x}) = \frac{\mathbf{x}}{1 + |\mathbf{x}|} \tag{4.6}$$

may be used instead, despite having more complex derivatives [Sza21]. A schematic for an MDN generating a 2-dimensional Gaussian mixture distribution is depicted in Figure 4.1.

**Figure 4.1:** Schematic of a Mixture Density Network generating a 2-dimensional Gaussian mixture distribution. The outputs of a (feed-forward) neural network are linearly transformed and mapped onto respective parameters value ranges in order to determine the parameters, {, , }∈{1,...,}, of a Gaussian mixture distribution. The covariance matrices are given in terms of the standard deviations , and the correlations ,. For illustration purposes, the mean vector values ,1 and ,2, as well as the standard deviations ,1 and ,1 for each mixture component are not displayed separately.

Following this, Mixture Density Networks can be adapted easily to output the parameters of a -component -Curve mixture with -Curves of degree deg by generating the parameters {}∈{1,...,} = {( , Σ )}∈{1,...,} for all ⋅ (deg + 1) stochastic control points and the respective curve weights {}∈{1,...,}. Advantages of using an MDN for learning the -Curve mixture parameters, rather than other algorithms (e.g. *Expectation Maximization* [Dem77]), are twofold. First, MDNs allow its output distribution to be conditioned on arbitrary inputs. Thus, the MDN provides an easy approach to learn and process conditional -Curve mixtures, allowing the model to be used in a conditional inference framework. Second, the MDN can be incorporated easily into (almost) any neural network architecture without the need to control the gradient flow. Besides that, there are two notable drawbacks of MDNs to consider, namely mode collapse due to overfitting and instabilities during training [Mak19]. In the context of MDNs, mode collapse commonly refers to a problem, where the MDN puts all weight on a single low-variance component, regardless of the number of available components. While instabilities during training can be mitigated by choosing appropriate functions for and , mode collapse is expected to be reduced by setting MDNs in the context of -Curves, as modeling parametric curves instead of single points is expected to yield more distinct modes.

The most commonly used loss function for training an MDN is the *negative log-likelihood* [Bis94], which can also be adapted for training the -Curve Mixture Density Network from discrete sequence data. Let = {<sup>1</sup> , ..., } be a set of realizations of a stochastic process with = {**x** 1 , ..., **x** } where each **x** with ∈ {1, ..., } is a sample value for the respective random variable at time for ∈ (see Section 3.1). In order to simplify the training procedure, Gaussian random variables along the -Curve are treated as if they were independent. Following this, the joint probability of the samples **x** in a sequence along an -Curve factorizes into the unnormalized Gaussian density

$$p^{\psi}(\mathcal{S}\_{j}) = p^{\psi}(\mathbf{x}\_{1}^{j},...,\mathbf{x}\_{N}^{j}) = \prod\_{l=1}^{N} p\_{l\_{l}}^{\psi}(\mathbf{x}\_{l}^{j}).\tag{4.7}$$

This is exploited when defining the loss function. It should be noted, that this simplification can be justified, as the correlation between these Gaussian random variables is enforced implicitly by the underlying -Curve and thus by the stochastic control points, which are estimated during training.

For a single sequence and an -Curve = Φ(**v**), the loss function is then defined by the negative (unnormalized) log-likelihood

$$\begin{split} \mathcal{L} &= -\log \ p^{\psi}(\mathbf{x}\_{1}^{j}, \dots, \mathbf{x}\_{N}^{j}) \\ &= -\log \left( \prod\_{l=1}^{N} p\_{l\_{l}}^{\psi}(\mathbf{x}\_{l}^{j}) \right) \\ &= -\sum\_{l=1}^{N} \log \ p\_{l\_{l}}^{\psi}(\mathbf{x}\_{l}^{j}) \\ &= -\sum\_{l=1}^{N} \log \ p(\mathbf{x}\_{l}^{j}|\mu^{\psi}(t\_{l}), \Sigma^{\psi}(t\_{l})) \end{split} \tag{4.8}$$

of the sequence given an input vector **v**. Therefore, the loss for a set of sequences = {<sup>1</sup> , ..., } is defined as

$$\mathcal{L} = \sum\_{j=1}^{M} \left( -\sum\_{l=1}^{N} \log \ p(\mathbf{x}\_{l}^{j} | \boldsymbol{\mu}^{\Psi}(t\_{l}), \boldsymbol{\Sigma}^{\Psi}(t\_{l})) \right). \tag{4.9}$$

Equation (4.9) can easily be extended for -Curve mixtures. Given an - Curve mixture Ψ = Φ(**v**), the likelihood of a single training sequence is now calculated as the weighted linear combination of the likelihood of for each (see Equation (4.7)):

$$p^{\Psi}(\mathcal{S}\_{\mathbf{j}}) = \sum\_{k=1}^{K} \pi\_k p^{\Psi\_{\mathbf{k}}}(\mathcal{S}\_{\mathbf{j}}).\tag{4.10}$$

Thus, the loss for a set of sequences can be defined as

$$\begin{split} \mathcal{L} &= \frac{1}{M} \sum\_{j=1}^{M} -\log \sum\_{k=1}^{K} \pi\_{k} p^{\psi\_{k}}(\mathcal{S}\_{j}) \\ &= \frac{1}{M} \sum\_{j=1}^{M} -\log \sum\_{k=1}^{K} \pi\_{k} \prod\_{l=1}^{N} p\_{l\_{l}}^{\psi}(\mathbf{x}\_{l}^{j}) \\ &= \frac{1}{M} \sum\_{j=1}^{M} -\log \sum\_{k=1}^{K} \exp \left( \log \pi\_{k} + \sum\_{l=1}^{N} \log \left( p\_{l\_{l}}^{\psi}(\mathbf{x}\_{l}^{j}) \right) \right). \end{split} \tag{4.11}$$

Then, the -Curve Mixture Density Network (abbrev.: -MDN) can be trained using a standard gradient descent policy. Most commonly, momentum-based gradient descent optimizers are employed. Popular choices include *Adam* [Kin15] and *RMSprop* [Rud16]. It should be noted, that from an optimization point of view, it is preferred to use the mean of the likelihoods when long sequences or many samples should be processed, as the sum of negative log likelihoods may result in large loss values and thus a less stable optimization. Further, it is recommended to output the mean vectors relative to the last element of the input sequence instead of their absolute values. That way, the -MDN learns a residual mapping, which have proven to be easier to optimize and yield more accurate results [He16][Hug17]. Finally, the loss function ℒ given in Equation (4.11) is arranged in a way, such that the *log-sum-exp trick* [Pre07] can be applied. This trick prevents arithmetic underflow by offsetting the values in the exponent, according to

$$\log \left\{ \sum\_{l} \exp \left\{ \mathbf{z}\_{l} \right\} \right\} = \log \left\{ \sum\_{l} \exp \left\{ \mathbf{z}\_{l} - \mathbf{z}\_{\max} \right\} \right\} + \mathbf{z}\_{\max}.$$

In this implementation of the -Curve (mixture) model, the input vector **v** allows the model to be used in either a conditional or a non-conditional inference setting. Examples for both settings include sequence prediction (conditional) and the estimation of the data generating distribution given some dataset (non-conditional). Regarding the conditional case, the stochastic process, and thus the -Curve mixture, depends on some input sequence . This sequence may be encoded into the input vector **v** using some encoder Enc(). A common choice for sequence encoders are recurrent neural networks and variants, such as the Long Short-Term Memory (LSTM) and the Gated Recurrent Unit (GRU). In the non-conditional case, **v** can be set to some constant value. While, technically, this also gives a conditional -Curve mixture, there is no variation in **v**, resulting in a constant mixture.

Subsections 4.1.1 – 4.1.4 provide several toy examples using synthetically generated data to showcase different features and the functionality of the proposed implementation of the -Curve model, based on Mixture Density Networks. In order to remove as much complexity as possible, the input vector **v** is set to be constant, thus creating a non-conditional sequence synthesis setting. Following this, the -MDN learns to generate an -Curve mixture, which estimates the underlying stochastic process generating the provided data. For all of the toy examples, a *PyTorch* [Pas19] implementation of the -MDN is used. The model is trained using the Adam optimizer with the learning rate set to 0.01. All other parameters are left at PyTorch defaults. With low-dimensional sequence datasets being small in size, there is no need to perform batch optimization, as the entire dataset fits into memory. As such, the entire training dataset is processed with each iteration of training. During training, the model is assumed to have reached convergence, when the training loss stagnates for 10 iterations.

### **4.1.1 Estimating -Curve mixtures from noisy data**

For testing the -Curve Mixture Density Network's capability of learning the parameters of an -Curve mixture from noisy sequence data, a simple experiment is conducted. To enable a proper visualization of the results, this example uses 2-dimensional data. Following this, an experiment is set up as follows:

1 Define an arbitrary 2-component -Curve mixture Ψgt with 5 Gaussian control points per component by defining the weights , as well as the mean vectors and covariance matrices Σ for the control points of each -Curve . The weights are set to = {0.75, 0.25} resulting in a biased training dataset.


In this way, the stochastic control points defining the -Curve mixture have to be estimated indirectly through a set of sample sequences. The ground truth -Curve mixture Ψgt and a sample sequence is depicted in Figure 4.2.

**Figure 4.2:** Ground truth -Curves <sup>1</sup> and <sup>2</sup> (red and green) starting at (0, 0) alongside a sampled Bézier curve (blue). Both, (a) and (b), show the mean curve and the control points with covariance ellipses for both -Curves. In (a) the Bézier curve is illustrated as sampled from 1, while (b) shows the discretized version of the sample curve with Gaussian noise applied.

noise.

The training dataset and the evolution of the parameters defining the estimated -Curve mixture Ψpred over several training iterations are depicted in Figure 4.3. In order to make the quality of the estimation more comprehensible, Figures 4.3c – 4.3f show the deviation of the estimated from the actual parameters. The deviations are defined per control point as

$$\begin{aligned} \Delta \mu &= ||\mu\_{\text{pred},l}^k - \mu\_{\text{gt},l}^k||\_2 \\ \Delta \sigma &= \sigma\_{\text{pred},l}^k - \sigma\_{\text{gt},l}^k \\ \Delta \rho &= \rho\_{\text{pred},l}^k - \rho\_{\text{gt},l}^k. \end{aligned} \tag{4.12}$$

**Figure 4.3:** Training dataset (a) and estimated mixture parameters over the course of the training ((b) – (f)). (c): Euclidean distance between the estimated mean vectors and the ground truth. (d) – (f): Signed difference between estimated and ground truth standard deviations and correlations, revealing the occurrence of under- and over-estimation. In (c) – (f), values corresponding to the first and second mixture component are shown as a solid and dashed line, respectively. The depicted colored lines correspond to the control points {0,1,<sup>2</sup> ,3,4} of each -Curve in the mixture.

Looking at Figures 4.3b and 4.3c, it can be seen that the -MDN is well capable of estimating the actual component weights and control point mean vectors. On the other hand, the model seems to over-estimate the standard deviations slightly (Δ ≈ 0.1) and there appears to be a rather large discrepancy between the estimated and actual correlation values. A possible explanation for these discrepancies can be found when looking at the covariance matrix from a geometric point of view. In general, the covariance matrix not only controls the amount of dispersion in data drawn from a corresponding Gaussian distribution, but also the orientation of the principal axes of dispersion. As such, the covariance matrix can be interpreted as a linear transformation defined by a rotation matrix **R** and a diagonal scaling matrix **S**, such that = **RSSR**−1¹. In the 2-dimensional case, the data dispersion can be visualized by an (rotated) ellipse. The orientation of this ellipse is controlled by the covariance matrix off-diagonal element Σ0,1 = Σ1,0 = . Following this, when either is over-estimated, the error in ellipse orientation can be compensated by adapting the correlation . Besides that, there likely exist multiple similar solutions to the covariance matrix defined by different correlation values generating a similar set of samples. As such, the -MDN only finds a locally optimal solution.

Finally, Figure 4.4 depicts the Ψpred at different stages during training, illustrating the process of estimating Ψgt. It can be seen, that the position of the control points is estimated well. Further, the orientation preservation assumption can be confirmed looking at Figure 4.4d. The orientation of the estimated covariance ellipses is similar to the real ellipses, but the variances are slightly over-estimated.

¹ The transformation matrices can be obtained by an *Eigendecomposition*.

**Figure 4.4:** Ground truth -Curve mixture (red and green) and mixture estimated from noisy sequence data (blue and purple) after 0 (random initialization), 1000, 5000 and 10000 iterations of training.

### **4.1.2 Handling heteroscedastic data**

This example examines the capabilities of the -Curve model in handling heteroscedastic data, i.e. a stochastic processes with varying variance between time steps. Using heteroscedastic data especially allows the examination of the modeling accuracy when varying the number of -Curve control points. In contrast to the previous toy example, this and further examples are only concerned with curve points along the estimated -Curve (mixture), representing the actual stochastic process.

In order to keep the experiment as simple as possible, the dimensionality of the data is set to 1. Thus, a dataset consisting of 1000 sample sequences is generated using an unimodal, time-discrete stochastic process = { ∼ ( , )}∈=11 with mean values moving along a parabolic curve and corresponding standard deviations changing between consecutive process indices and +1. The stochastic process alongside sampled process realizations is given in Figure 4.5.

**Figure 4.5:** Illustration of the training samples drawn from a heteroscedastic stochastic process. (a): Ground truth discrete-time stochastic process = { }∈=11 . Standard deviations (, 2 and 3) along are illustrated as a shaded region around the mean curve. 3 for each is indicated by a horizontal dashed line. (b): Sample sequences drawn from .

Estimated -Curves with 5 and 15 control points generated by an -MDN are depicted in Figure 4.6. It can be seen, that the -MDN learns a smooth mean curve and compensates variation in noise using the variance of the control points. With an increasing number of control points, an increasing number of variations in input noise can be compensated. This, however, comes at the cost of a less accurate mean curve, due to the increasing degree of the underlying polynomial curve. Note that the -Curves still model a timecontinuous stochastic process, despite being given discrete data. Intermediate values are interpolated according to the Bernstein polynomials (see also Section 3.1).

**Figure 4.6:** Approximations of using -Curves with 5 (a) and 15 (b) control points as generated by an -MDN trained with noisy sequence data. Using more control points increases the accuracy in modeling the variance of the stochastic process at the cost of a less accurate mean curve. For reference, the training sequences (without connections between subsequent points) are illustrated using black cross markers.

### **4.1.3 Presence of superfluous mixture components**

When dealing with multi-modal sequence data, the actual number of modes is usually unknown. Following that, this toy example is concerned with the impact of superfluous -Curve mixture components on the -MDN training, as well as the resulting mixture model. For this experiment, a bimodal discrete-time stochastic process = { }∈=10 with constant variance is defined for generating a training dataset. Here, each random variable of the process follows a bimodal Gaussian mixture distribution and realizations of the process follow one of two paths with equal probability. The stochastic process alongside process realizations is depicted in Figure 4.7.

**Figure 4.7:** Illustration of the training samples drawn from a multi-modal stochastic process. (a): Ground truth discrete-time stochastic process = { }∈=10 . Standard deviations (, 2 and 3) along are illustrated as a shaded region around the mean curves. 3 for each is indicated by a horizontal dashed line. (b): Sample sequences for both modes drawn from .

A training dataset consisting of a set of 100 realizations of the aforementioned stochastic process is now used to estimate an -Curve mixture with = 6 components, i.e. 4 superfluous components, with 5 control points each. Preferably, in the resulting model, the weights of all 4 unnecessary components are driven towards 0 and the remaining -Curves model the two modes of the stochastic process. The resulting -Curve mixture components after training the -MDN are depicted in Figure 4.8. The components are ordered in descending order by their associated mixture weight .

**Figure 4.8:** Estimated -Curve mixture with = 6 components generated by an -MDN. Two components ((a) and (b)) represent with accurate weighting. The weights of superfluous components is driven towards 0 during training. The shape of these components is thus not further optimized at some point (see figures (c) – (f)).

Looking at the estimated -Curve mixture, the -MDN behaves as expected. During training it learns the weight distribution rather fast, leading to superfluous components being not further optimized in their shape. This can be seen in Figure 4.9a. The remaining non-zero components (4.8a and 4.8b) accurately model with minor over-estimation of the variances. Stripping away the superfluous components, the resulting -Curve mixture is depicted in Figure 4.9b.

**Figure 4.9:** Evolution of the weight distribution during training (a) and the resulting approximation of after removing low-weight components (b). The coloring in (a) matches the components depicted in the previous figure (4.8). For reference, the training sequences (without connections between subsequent points) are illustrated using black cross markers in figure (b).

As a final note, purposefully choosing > real and relying on the optimization driving superfluous components towards 0 might lead to several similar non-zero components when processing more complex datasets. In this case, there are several possibilities to cope with this when required. The first, and most straightforward approach, is implementing a post-processing step, which collapses similar components into one by accumulating their weights and averaging the curves. With respect to the training phase, proper regularization could be employed, trying to enforce sparsity. The most commonly used sparsity-inducing regularization is given by the <sup>1</sup> norm [Ng04] applied to the mixture weight distribution. Lastly, the determination of itself can be approached from a different perspective by trying to implement the idea of the Infinite Gaussian Mixture Model [Ras99]. The basic idea of the infinite GMM is applying a Bayesian modeling approach to model the mixture parameters. As such, the mixture weights are modeled using a Dirichlet prior distribution. While this approach removes the problem of choosing an appropriate value for , it also introduces more complexity into the -MDN and its training. Additionally, estimating the parameters of an infinite GMM usually involves Monte Carlo methods or variational inference, making such an extension of the -MDN counteract the intuition of designing a regressionbased probabilistic sequence model.

### **4.1.4 Comparison with SMC inference**

One design choice for the -Curve model is to move multi-step inference into the training phase, thus allowing for an instant prediction of several time steps. This is opposed to sequential monte carlo (SMC) approaches, which model the transition between subsequent time steps and perform iterative inference. In this experiment, the performance of the -MDN implementation of the -Curve model is compared to an exemplary SMC approach. For this comparison, an LSTM-MDN model embedded into a particle filtering cycle [Hug18], denoted as *ParticleLSTM* in the following, is used for generating an approximation for a discrete-time stochastic process. As the ParticleLSTM expects discrete inputs and outputs a Gaussian mixture probability distribution, a new set of samples, also called *particles*, needs to be drawn from the mixture distribution after each inference step. This is comparable to the resampling step of particle filters [Dou09] and serves two purposes. First and foremost, this approach keeps the number of particles constant, tackling exponential growth of particles when using a brute force approach. Second, it enables the propagation of a sample-based representation of a probability distribution through time using an LSTM-MDN.

For comparing the performance of the -MDN and the ParticleLSTM in approximating a time-discrete stochastic process from noisy sequence data, a training dataset is sampled from . In order to keep this experiment clear and easier to evaluate, = { }∈=5 is defined to be an unimodal stochastic process with finite index set =5 = {0, 0.25, 0.5, 0.75, 1}. The stochastic process and a training dataset sampled from are depicted in Figure 4.10.

**Figure 4.10:** Illustration of the training dataset for comparing the -MDN with an exemplary SMC approach. (a): Ground truth discrete-time stochastic process = { }∈=5 alongside sample sequences. (b): Training data sampled from . Sequential connections are left out in this illustration in order to provide a cleaner illustration.

Using this training dataset, both models are trained until convergence is reached. In this experiment, the -MDN generates a 1-component -Curve mixture with 3 control points and the ParticleLSTM uses particles for generating its prediction. An approximation of , given **v** = **0** as a constant input, is then generated as follows: In case of the -MDN, a single pass through the network yields the parameters of an -Curve. Accessing this curve at ∈ {0, 0.25, 0.5, 0.75, 1} =∶ =5 gives the stochastic variables { }∈=5 approximating . For the ParticleLSTM, passing copies of **v** through the network generates a Gaussian mixture distribution Ξ<sup>1</sup> approximating <sup>1</sup> . Next, samples are taken from Ξ<sup>1</sup> and fed into the network again in order to approximate <sup>2</sup> . This process is repeated for retrieving <sup>3</sup> , <sup>4</sup> and <sup>5</sup> . The resulting approximations of are depicted in Figure 4.11.

**(a)** -MDN using = 1 component with 3 control points. **(b)** ParticleLSTM using = 1000 particles.

**Figure 4.11:** Resulting approximations of the stochastic process . Both approaches yield similar results with slight differences in the variances, especially for <sup>3</sup> and 4.

It can be seen, that both approaches lead to similar results. The most notable differences can be observed looking at the variances, where the -MDN yields slightly less accurate results, which can be attributed to the use of a compact representation with only 3 control points. The difference in variances is most visible for <sup>3</sup> and <sup>4</sup> , where the non-linear weighting of control point covariances yield a minor under-estimation of the actual variance values (see also Section 3.1.2). On the other hand, the mean curve generated by the ParticleLSTM is less stable, which is most likely due to the uncontrolled and stochastic nature of the iterative approach.

While the comparison of the resulting approximations confirms the viability of the -Curve model as an alternative to SMC approaches, additional aspects should be considered. By design, the -Curve model moves the task of multi-step inference into the training phase, thus eliminating the need for Monte Carlo simulation during inference. Following this, Figure 4.12 depicts the differences in training time (4.12a), inference time (4.12b), memory usage (4.12c) and accuracy (4.12d) in order to reveal the impact of this design choice on these aspects. In order to provide a measure for the approximation accuracy, the error in terms of the euclidean distance of vectors combining mean and standard deviation values is averaged over all time steps, i.e.

$$E = \frac{1}{5} \sum\_{t=1}^{5} \left\| \begin{pmatrix} \hat{\mu}\_{t, \mathbf{x}} & \hat{\mu}\_{t, \mathbf{y}} & \hat{\sigma}\_{t, \mathbf{x}} & \hat{\sigma}\_{t, \mathbf{y}} \end{pmatrix}^{\top} - \begin{pmatrix} \mu\_{t, \mathbf{x}} & \mu\_{t, \mathbf{y}} & \sigma\_{t, \mathbf{x}} & \sigma\_{t, \mathbf{y}} \end{pmatrix}^{\top} \right\|\_{2}.\tag{4.13}$$

Here, ̂⋅ represents estimated values. In this formulation, the error function aggregates the square differences of each factor. Although this approach ignores the actual semantics of the mean and variance values, it should provide a viable estimate for the accuracy of all approximated factors, due to their common minimum error value and comparable squared value ranges. Note that for this comparison, the reference values ,, ,, , and , are calculated from the training dataset, as these are likely to differ slightly from the actual ground truth values due to the training data being sampled randomly. Further, for the ParticleLSTM, the statistics are provided for an increasing amount of particles used for inference. Training and inference of each configuration is performed 10 times, in order to generate more reliable results.

**(b)** Time required to infer an approximation of after training. Each line corresponds to one repetition of the inference procedure.

**Figure 4.12:** Comparison of the -MDN and the ParticleLSTM focusing on different aspects related to training and inference when increasing the number of particles used for inference. Inference statistics are provided for 10 repetitions in order to show the consistency of the SMC approach. In figures (b) – (d), the red line indicates the - MDN baselines, the ParticleLSTM is compared to. In figures (a) and (d), the green diamond markers indicate the mean values. The error in (d) is given by the average deviation of the estimated mean and standard deviation values from a reference provided by the training data.

Looking at Figure 4.12, the impact on the depicted factors is as expected. Due to moving the multi-step inference into the training phase, more iterations are required to reach convergence when compared to the ParticleLSTM, which only needs to learn the transition between subsequent time steps (see 4.12a). On the other hand, multi-step inference is achieved using a single pass through the -MDN, resulting in faster inference. The time required to approximate using the ParticleLSTM scales linearly with the number of particles (see 4.12b). At the same time, the memory usage grows with an increasing . Looking at Figure 4.12c memory usage even grows superlinear due to the heavy computations being implemented to run on the GPU, where vectorization is required. This, in turn, has higher memory demand, especially for parallelized particle re-sampling. As expected, Figure 4.12d shows that the accuracy of the approximation increases with higher , surpassing the accuracy of the -MDN approximation at some point. Ultimately, it depends on the specific use case whether faster but slightly less accurate or slower but more accurate inference is more important.

# **4.2 Proof of Concept: Recurrent -Curve Mixture Density Networks**

In order to provide a proof of concept for the conceptual extension of the - Curve model, the meta-time -Curve model (see Section 3.2), an approach which is capable of generating a steady stream of -Curve segments is required. Additionally, dependencies between subsequent segments along a generated probabilistic spline need to be taken into account, especially in 1 or 2 continuity cases. Following this, an autoregressive approach is wellsuited for this implementation. With combinations of recurrent neural networks and Mixture Density Networks being a state-of-the-art sequence model (see also Section 2.2.2), an LSTM network will be combined with the -MDN (see Section 4.1) for this proof of concept. The resulting model is denoted as *recurrent -MDN*. The recurrent -MDN operates on the meta-timeline and targets the generation of an endpoint-connected probabilistic spline. This restricts the timeline mappings presented in Section 3.2.2 to special cases without overlapping curve segments. A schematic of the model architecture is given in Figure 4.13.

**Figure 4.13:** Schematic of the recurrent -MDN illustrating the architecture with a loop (left side) and unrolled over steps of meta-time (right side). Outputs a -Curve segment at each meta-time step.

As illustrated, the same model input **v** is used in each meta-time step. The reason for this is given by the fact, that feeding generated -Curve segments back into the model would require a way of encoding an -Curve into **v**. Instead, by using constant input, the model relies on its recurrent connection for evolving its output over time. As a technical detail, the model is designed to generate a stream of residuals, i.e. Gaussian control point mean vectors are always given as offsets to preceding control points. This has multiple advantages. First, as mentioned in Section 4.1, using residuals instead of absolute values is more stable during training and inference, as the target domain is more restricted. Second, by defining segment control points in terms of previous control points, it is easier to take geometric restrictions into account for enforcing <sup>1</sup> or 2 continuity as required. Regarding the loss function employed during training of the recurrent -MDN, the negative log-likelihood as defined in Equation (4.11) in Section 4.1 can be directly translated to this extension. This is due to the extraction of sequences from a meta-time -Curve working basically in the same way as in -Curves.

In summary, using an autoregressive model provides a straightforward approach for implementing the meta-time -Curve model, with the capability of infinite sequence generation. On the downside it should be mentioned, that the calculation of the -Curve segment at meta-time ̃requires the calculation of all preceding segments. This is, however, also necessary independent of the specific approach when 1 continuity is required, as it introduces dependencies between subsequent segments. Although this approach re-introduces a notion of iterative generation, constrains imposed by the underlying parametric curve, which is a spline in this case, remain. Also, with every stochastic process mode now being modeled by a probabilistic spline, the need for Monte Carlo simulation is still avoided. Thus, the presented approach mostly complies with the objectives formulated in Chapter 1. The only exception is given by multi-step sequence generation beyond seg steps, which requires an iterative approach instead of being instantaneous.

### **4.2.1 Toy Examples**

This section provides a brief evaluation of the capabilities of the meta-time -Curve model through different toy examples. Similar to the toy examples given in Section 4.1, the examples in this section focus on non-conditional sequence synthesis. Thus **v** = **0** is set as the constant input of the recurrent - MDN for each meta-time step. This reduces the information processed by the recurrent -MDN to the information passed over time through the recurrent connection. Further, -Curve segments will be defined by 5 Gaussian control points. Due to working with discrete-time ground truth stochastic processes (see also Section 3.2.3), each segment is defined to cover sub-sequences of length seg = 20.

Three scenarios are considered for comparing the meta-time -Curve model to the original -Curve model. The targeted discrete-time stochastic processes for each scenario are depicted in Figure 4.14. For training, a set of = 200 realizations is sampled from each stochastic process. The curve-time mapping defined in Section 4.14 is used for all examples.

**Figure 4.14:** Ground truth discrete-time stochastic processes with ∈ {,,} providing different scenarios for exploring the capabilities of the meta-time -Curve model. The index sets*a* are defined as = =58 for scenarios (a) and (b), and = =96 for scenario (c), thus covering sequences consisting of 58 and 96 elements, respectively. While is subject to varying variance between time steps, and have constant variance. For all figures, Gaussian random variables along the stochastic process' mean curve are given with corresponding 2 covariance ellipses.

*a* The discrete index set notation follows the definition provided in Section 3.2.3.

The first scenario covers the case of a stochastic process with a complex mean curve, in terms of length and shape. Following this, a parametric curve requires an increased number of control points to approximate the sequence properly. For this example, a fixed interval mapping with = 1 is used and 1 continuity is enforced. With the training dataset consisting of sequences of length = 58 and each segment of the meta-time -Curve covering seg = 20 elements, the recurrent -MDN will generate 3 segments. Thus, the segmented curve is defined by a total of 13 control points due to subsequent segments having one control point in common. Following this, the -Curve model in comparison is defined with an equal number of 13 control points. The estimated original and meta-time -Curves generated by a respective -MDN and recurrent -MDN are depicted in Figure 4.15.

**Figure 4.15:** Approximations of using an -Curve (a) and a meta-time -Curve (b) as generated by an -MDN and a recurrent -MDN trained with noisy sequence data. For the meta-time -Curve, a fixed interval mapping is used and <sup>1</sup> continuity is enforced. In (b) -Curve segments are highlighted by using different colors. 2 covariance ellipses are provided for all Gaussian random variables along the mean curve.

While the meta-time -Curve approximates the mean curve perfectly, the estimated -Curve deviates slighty at around = 2 and = 4, averaging out a curved shape. A slight over-estimation of the variance at the beginning and end of the approximation can be observed for both models. Besides both models performing quite similar in their generated approximation, the -MDN took 8 times more iterations to reach convergence compared to recurrent -MDN. This observation can most likely be attributed to single - Curves of higher degree being harder to fit to given data due to the global control property.

The second scenario covers a stochastic process with its mean curve including sharp edges. In general, such curves cannot be represented by Bézier curves due to their smoothness property. Using a segmented curve, on the other hand, allows such edges by only targeting 0 continuity. Additionally, by using segments of lower degree, Gibbs phenomenon [Jer13] can be circumvented. Apart from the training dataset, the setup for this scenario is similar to the first scenario. The resulting approximations are depicted in Figure 4.16.

**Figure 4.16:** Approximations of using an -Curve (a) and a meta-time -Curve (b) as generated by an -MDN and a recurrent -MDN trained with noisy sequence data. For the meta-time -Curve, a fixed interval mapping is used. No smoothness constrains are applied. In (b) -Curve segments are highlighted by using different colors. 2 covariance ellipses are provided for all Gaussian random variables along the mean curve.

As expected, the estimated -Curve is unable to replicate the target mean curve, but still provides a close approximation. As the -Curve is of higher degree, Gibbs phenomenon is quite noticeable in this example. Besides minor fluctuations in the variances, the estimated meta-time -Curve is accurate with respect to its mean curve.

The third and final scenario regards a stochastic process whose mean curve follows a sine wave. Because of the periodicity, a modulo reset mapping with = 1 and = 2 will be used for the meta-time -Curve. Further, 1 continuity is enforced. Note that can only be assigned an appropriate value due to knowledge about the structure of the targeted mean curve. As such, can be approximated with a meta-time -Curve, which repeats the same two segments as many times as required. Following this, an -Curve with 9 control points will be estimated for comparison. The results are depicted in Figure 4.17.

**Figure 4.17:** Approximations of using an -Curve (a) and a meta-time -Curve (b) as generated by an -MDN and a recurrent -MDN trained with noisy sequence data. For the meta-time -Curve, a modulo reset mapping is used and <sup>1</sup> continuity is enforced. In (b) -Curve segments are highlighted by using different colors. 2 covariance ellipses are provided for all Gaussian random variables along the mean curve.

Looking at the estimated -Curve first, the approximation of is quiet accurate apart from the beginning and ending portions. On the other hand, the meta-time -Curve alternates between the two learned segments in order to achieve a close approximation of . Further, in such scenarios, the curve could be continued indefinitely, as indicated in Figure 4.17b (purple curve).

### **4.2.2 Handling underdetermined areas**

As the meta-time -Curve model is based on segmented curves being calculated iteratively using an autoregressive approach, the existence of underdetermined areas provides an aspect worth discussing. Such underdetermined areas are defined as segments within a meta-time -Curve, which are not well estimated during training. The main causes for this are given by either areas being sparely covered by the training dataset or insufficient model capacity. Besides the model output within these segments being less stable and reliable in an application context, it also affects subsequent segments due to error propagation in the autoregressive model structure. As a general approach for coping with underdetermined areas, a fallback mechanism can be integrated into the model. As such, the sequence model can rely on a basic, domain-specific sequence model, which covers regions of high model uncertainty. Besides the need of a potentially handmade fallback model, the model uncertainty needs to be measured using such an approach. Following this, a brief overview of techniques for measuring model uncertainty is given. An evaluation of the practicality of the presented techniques for the meta-time -Curve is thereby left out, being beyond the scope of this thesis

A prominent approach to measuring model uncertainty is given by transforming a network into a Bayesian neural network (see also Section 2.2.1), which is implemented using Monte Carlo Dropout [Gal16, Gal17, Ken17]. In these variants of Bayesian neural networks, dropout [Sri14] is applied in conjunction with multiple passes through the network in order to generate a distribution over the network's output. This distribution can then be used to measure the model's uncertainty by correlating it to the variance in the generated distribution. A downside of employing such an approach is that a given sequence model needs to be transformed into a Bayesian neural network, thereby also inheriting their potentially unwanted properties and problems. An alternative to Bayesian neural networks is given by *Prior Networks* [Mal18]. While Bayesian neural networks implicitly model distributional uncertainty, Prior networks provide an explicit model for model uncertainty. This is achieved by parameterizing a prior distribution over predictive distributions. Thus, the Prior network approach also requires changes to a given model. Opposed to that, an ensemble approach [Lak17, Hua17] can be pursued, avoiding the need to change the model at hand. Here, an ensemble of the same model is trained. As the training process itself is usually stochastic, the resulting ensemble consists of several models generating slightly different outputs for the same input. As such, using the entire ensemble, a distribution similar to that of a Monte Carlo Dropout Bayesian neural network can be generated.

## **4.3 Summary**

Overall, this chapter first provided a detailed introduction to Mixture Density Networks, focusing on their general structure and how their output is generated. Following this, -Curve Mixture Density Networks were defined as a regression-based implementation for the -Curve model, which enables multi-step inference only requiring a single forward pass through the network. Using synthetically generated data, several toy examples show the model's capability of learning stochastic control points from noisy sequence data and explore the model's behavior and capabilities under different circumstances. Finally, a comparison with an SMC-based approach was performed, depicting the advantages of -Curve Mixture Density Networks during inference in terms of memory usage and inference time.

Additionally, a proof of concept for an implementation of the meta-time - Curve model was presented. The presented model relies on an autoregressive structure in order to enable the representation of infinite stochastic processes. In comparison to -Curve Mixture Density Networks, toy examples on synthetically generated data indicate greater flexibility in terms of modeling capabilities at the cost of requiring a more complex neural network model, which is more expensive in terms of computation time during inference.

# **5 Evaluation**

This evaluation focuses on the -Curve model and the -MDN implementation as proposed in Sections 3.1 and 4.1. For this, the -Curve model is applied to two sequence prediction tasks. In both tasks, evaluated models need to represent a stochastic process describing obs + pred time steps, such that given obs observations of a process realization, the remaining pred steps can be inferred.

In the first part of the evaluation, the -Curve model is applied to the task of human trajectory prediction (Section 5.1). On the one hand, this task provides easy to interpret and visualize results. As such, it gives a good foundational evaluation of the general viability of the model. Further, although being simple in terms of data dimensionality, the task provides a lot of complexity with human trajectory prediction being a highly multi-modal problem. In this regard, human trajectory prediction provides an appropriate task for evaluating the capabilities of the model.

Following the evaluation of the viability and capabilities of the -Curve model, its claimed capability of being scalable to arbitrary dimensions is assessed. For this, the model is applied to the task of human motion prediction. This task provides a high-dimensional example, being concerned with modeling sequences of human poses (Section 5.2).

It is worth mentioning, that the meta-time -Curve model (Section 3.2), and thus the recurrent -MDN (Section 4.2), are excluded from this evaluation. There are two main reasons for this. First, the toy examples in Section 4.2 indicate that the meta-time -Curve model gains an edge over the original -Curve model for very long sequences only. However, common evaluations conducted on real-world data are usually restricted to rather short time horizons, i.e. short sequences. Second, the meta-time -Curve model is first and foremost a conceptual extension to the -Curve model, lifting some more domain-specific limitations.

For convenience, the following notation, extending on the notation used in previous chapters, is introduced for the scope of the evaluation. A dataset is denoted as = {<sup>1</sup> , ..., } and consists of sequences of fixed length = obs +pred with = {**x** 1 , ..., **x** }. Each dataset can further be divided into a training and test dataset, such that = train∪test, with train∩test = ∅. Finally, each sequence in a dataset is divided into an observed ,obs (obs time steps) and target (pred time steps) portion

$$\begin{split} \mathcal{X}\_{l} &= \mathcal{X}\_{l,\text{obs}} \cup \mathcal{Y}\_{l} \\ &= \{ \underbrace{\mathbf{x}\_{1}^{l}, \dots, \mathbf{x}\_{N\_{\text{obs}}}^{l}}\_{\text{observation}}, \underbrace{\mathbf{x}\_{N\_{\text{obs}}+1}^{l}, \dots \mathbf{x}\_{N\_{\text{obs}}+N\_{\text{pred}}}^{l}}\_{\text{target}} \}, \\ &= \{ \underbrace{\mathbf{x}\_{1}^{l}, \dots, \mathbf{x}\_{N\_{\text{obs}}}^{l}}\_{\text{observation}}, \underbrace{\mathbf{y}\_{1}^{l}, \dots, \mathbf{y}\_{N\_{\text{pred}}}^{l}}\_{\text{target}} \}, \end{split}$$

where the target portion is to be predicted.

## **5.1 Long-term Human Trajectory Prediction**

With the emergence of autonomous driving and advances in the field of automated video surveillance, the task of human trajectory prediction gained a significant amount of research interest in recent years. A trajectory is defined as a sequence of locations in a regarded scene, with some velocity profile attached to it. Predictions are then performed on sequences consisting of subsequent 2D image coordinates or 3D world coordinates, generated by e.g. a detection-tracking pipeline. Generally speaking, human trajectory prediction can be subdivided in a number of more specific tasks, depending on the time horizon for prediction, the point of view of recording and camera motion. Each of these aspects impacts observations, and such the respective prediction models, in a different way.

*Time Horizon:* In autonomous systems, the prediction task can be divided into short-term (0.5 up to 2 seconds) and long-term (up to 20 seconds) trajectory prediction [Rud20b]. While short-term predictions are mainly used for immediate decisions, such as collision avoidance, long-term prediction impacts the long-term behavior of an autonomous system, e.g. by influencing its path planning component. Considering short-term prediction, linear models combined with local collision avoidance approaches, e.g. the *social force model* [Hel95] or *Optimal Reciprocal Collision Avoidance* (abbrev.: ORCA, [Van11]), are generally well suited. In the context of human trajectories, ORCA yields more realistic motion patterns [Kot21]. In long-term trajectory prediction, the trajectory shape is greatly influenced by the surrounding static environment and interactions with other pedestrians. The extent of influence is highly dependent on the ground resolution and annotation rate of a given dataset [Hug21], as well as the pedestrian density.

*Point of view:* Most commonly, trajectory prediction datasets are recorded from a bird's eye view (*top view*), an elevated viewpoint with a tilted camera (*tilted view*) or from a camera positioned on the ground, e.g. mounted to a car (*frontal view*). While top view and surveillance datasets yield complete trajectories, occlusions occur frequently in frontal view datasets. As a consequence, prediction models need to be able to cope with missing inputs when working with frontal view datasets. In addition, constant velocity trajectories are distorted in frontal view datasets due to perspective distortion.

*Camera motion:* For top view and surveillance datasets, static cameras are a common choice. As such, recorded trajectory data complies with the static observed scene, potentially resulting in decision points, e.g. junctions, at specific locations. With frontal view datasets, identical trajectories change in shape with the ego-motion of the camera, when mounted to a car. In such cases, datasets are usually transformed into an ego-motion compensated reference frame (e.g. [Sch13]).

Following this, the following evaluation focuses on long-term trajectory prediction using bird's eye view data, giving a suitable task for learning-based sequence prediction models. Given an observed trajectory obs = {**x**<sup>1</sup> , ..., **x**obs} consisting of obs observed positions of a person, the subsequent pred future positions need to be predicted.

### **5.1.1 Dataset Overview**

With the rising interest in the topic of human trajectory prediction, a number of datasets has emerged. These datasets are most often created from annotated videos, recorded from a specific point of view. An overview of commonly used human trajectory datasets, categorized by the respective point of view, is given in Table 5.1.

**Table 5.1:** Non-exhaustive list of datasets appropriate for human trajectory prediction. Note that the list of top view datasets also includes datasets providing real-world positions instead of image coordinates. As data of persons moving on a flat plane is recorded, these datasets are similar to top view datasets, except trajectory points are given in a 3D reference frame with constant elevation.


In the context of long-term human trajectory prediction, top view and surveillance datasets are preferred due to the lack of occlusions and perspective distortions. In addition, data recorded from a static scene imposes a structure onto the dataset, which yields well-defined walking paths and decision points, exposing the multi-modal nature of human trajectory prediction.

Finally, the most commonly used datasets for long-term human trajectory prediction include the BIWI Walking Pedestrians (abbrev.: *biwi*), Crowds by Example (abbrev.: *crowds*) and the Stanford Drone (abbrev.: *sdd*) datasets. These datasets further consist of 2, 4 and 8 scenes, respectively. In the following, these scene datasets will be referred to as *dataset:scene*, e.g. *biwi:eth*. In the case of the Stanford Drone Dataset, multiple, partially overlapping¹, recordings of the same scene are provided. The specific recording will be indicated by a number added to the scene dataset abbreviation, e.g. *sdd:hyang00*.

## **5.1.2 State-of-the-art Human Trajectory Prediction Models**

Looking at state-of-the-art deep learning-based sequence prediction models for long-term human trajectory prediction, these models can be divided into aggregating and holistic models. Holistic models, on the one hand, model the entire observed scene including all pedestrians by using a spatio-temporal graph network, where each object in the scene is a unique node (e.g. [Moh20, Sal20]). Opposed to that, the more prevalent aggregating models have separate processing pipelines for each type of input, which are fused together at some point. For this class of models, a modular meta-architecture revolving around an underlying base sequence model can be defined, covering the main components of each model. Additional types of inputs, also referred to as *additional cues*, are discussed later in this section. A schematic of this meta-architecture is depicted in Figure 5.1.

¹ In the sense of the observed real-world scenery.

**Figure 5.1:** Schematic of a meta-architecture for aggregating trajectory prediction models. Such models contain at least some sequence model and optional building blocks for processing additional cues, such as social or environmental context.

Each aggregating prediction model at least consists of a base sequence model, which encodes input trajectories, the *observation*, and generates either single trajectories or probabilistic predictions. Taking a range of state-of-the-art deep learning-based prediction models into consideration, these models can be boiled down to few base sequence models. An overview of commonly used base sequence models is depicted in Table 5.2. Note that due to the existence of many similar models, only representative examples for each base sequence model are featured. For a comprehensive overview of existing human trajectory prediction approaches, the reader may be referred to recent surveys, e.g. [Rud20b]. Further, no distinction is made for variants of the same base model, as these most commonly only differ slightly. Finally, endpoint-conditioned prediction models (e.g. [Kit12, Man20]) are excluded from this overview, as the endpoint is assumed to be unknown in the context of this evaluation.

**Table 5.2:** Overview of commonly used base sequence models in human trajectory prediction alongside representative prediction models. The most frequently used sequence models are given by *Recurrent Mixture Density Networks* (abbrev.: R-MDN, [Gra13]), variants of *Generative Adversarial Networks* (abbrev.: GAN, [Goo14]) and *Variational Autoencoders* (abbrev.: VAE, [Kin14]) combined with a sequence to sequence model [Sut14], as well as *Transformers* [Vas17]. Transformers are only recently being studied in the context of human trajectory prediction. It could be noted, that *Temporal Convolutional Networks* (abbrev.: TCN, [Oor16, Bai18]) are excluded from this overview, as these are rarely used and yield similar performance to LSTM networks [Bec18].


With reference to the introductory section on sequence modeling (see Chapter 2), each of the base models listed in Table 5.2 provides certain benefits for the task of human trajectory prediction. R-MDN and Transformer models on the one hand are purely regression-based and thus easier to train. Additionally, these models can be used to output an explicit probability distribution over future trajectories by parameterizing a Gaussian mixture model. This, however, comes at the cost of a more difficult approach to generate multi-modal predictions. When parameterizing a Gaussian mixture model, the model can for example be embedded into a particle filter cycle [Hug18]. Another approach construes the trajectory prediction problem as a classification task, where possible future predictions are covered by different classes [Giu21]. Opposed to that, VAE and GAN are probabilistic models providing an implicit model of the data distribution. Both models employ a generator network processing a stochastic input in addition to the encoded observation. As a consequence, these models provide a straightforward approach to generating multi-modal predictions by sampling.

In recent years, an increasing number of approaches emphasize the use of additional cues. The most common additional cues are given by the social context, i.e. neighboring pedestrians, and the environmental context, i.e. static scene elements. Established approaches for incorporating social context are commonly based on either grid-based pooling (e.g. [Ala16, Gup18]), graph attention (e.g. [Vem18, Kos19, Hua19]) or graph convolution (e.g. [Moh20]). Environmental context, on the other hand, is usually given by an encoding of some reference image or video frame generated by a Convolutional Neural Network (CNN).

As the -Curve model introduced in this thesis provides an alternative model for the underlying base sequence model, such additional cues will not be considered in the following quantitative and qualitative evaluation. Consequently, the performance of the -Curve model is compared with the aforementioned base sequence models. It should be noted, that when taking away the additional cue components, most state-of-the-art models collapse onto their underlying base sequence models, thus justifying a comparison based on these base sequence models.

### **5.1.3 Evaluation Setup**

In order to provide a comprehensive evaluation of the -Curve model in the context of long-term human trajectory prediction and in comparison with commonly used sequence models, the tasks of unimodal and multi-modal trajectory prediction are considered. Therefore, the current standard approach to evaluation in the literature is extended by using additional datasets and performance measures, as it does not cover the task of multi-modal trajectory prediction. Further, the evaluation will be performed on each selected dataset in isolation, due to the removal of additional cues for the sequence models. Without such additional cues, long-term prediction requires well-defined decision points tied to static locations in the observed scene, in order to capture relevant walking paths. This is especially true for non-goal-driven prediction approaches, as regarded in this evaluation. As a consequence, pooling together unrelated datasets into a common reference frame cannot be justified and thus datasets are evaluated in isolation.

#### **5.1.3.1 Selected Datasets**

Under these conditions and in compliance with the current standard evaluation approach, the *biwi:eth*, *biwi:hotel*, *crowds:zara01* and *crowds:zara02* datasets are selected. The *crowds:students* dataset is left out, as it focuses heavily on human-human interaction and as such does not provide welldefined walking paths or decision points. As these datasets provide rather simple scene geometry, additional scenes are taken from the Stanford Drone Dataset. In order to keep the evaluation more concise, scene datasets with varying complexity [Ami20, Hug21] are considered. Thus, the *sdd:bookstore03* and *sdd:hyang00* datasets are included in the evaluation. For these datasets, only pedestrian trajectories are considered¹. Table 5.3 and Figures 5.2 and 5.3 depict samples from the datasets and relevant statistical details.

**Table 5.3:** Statistical details of human trajectory datasets selected for evaluation. It should be noted, that the number of trajectories can deviate from those given in the literature, as trajectories lying outside the image boundary after projection are dismissed. The trajectory length denotes the number of points defining a specific trajectory.


¹ The Stanford Drone dataset provides trajectory data for a multitude of different agent types, including for example pedestrians, bikers and skateboarders.

**(a)** biwi:eth

**(b)** biwi:hotel

**(c)** crowds:zara01

**Figure 5.2:** Overview of human trajectory datasets selected for evaluation. Sub-figures depict a reference image of the recorded scenery (left) and the overlayed dataset (right). Note: For illustration purposes, the image and data scale is aligned for all datasets, for the actual image resolutions see Table 5.3.

**(a)** crowds:zara02

**(b)** sdd:bookstore03

**(c)** sdd:hyang00

**Figure 5.3:** Overview of human trajectory datasets selected for evaluation. Sub-figures depict a reference image of the recorded scenery (left) and the overlayed dataset (right). Note: For illustration purposes, the image and data scale is aligned for all datasets, for the actual image resolutions see Table 5.3.

*Preprocessing:* At first, all datasets originally provided in world space coordinates are projected into image space using homographies. In the literature, the annotation frequency of the datasets is usually set to 2.5 annotations per second, which equals to the annotation rate of the BIWI Walking Pedestrians dataset. Thus, the annotation frequency of all datasets included in the evaluation is adjusted accordingly. Further, the evaluation is conducted on trajectories of a fixed length = obs+pred (see also Section 5.1.3.5). Following this, all (sub-)trajectories of a given length are extracted from each respective dataset in order to provide training and test datasets. Trajectories shorter than the given length are not considered for evaluation. As a final data preprocessing step, trajectories of non-moving or slow-moving persons are filtered out, as statistical models are worse in modeling trajectories of slow-moving persons, because their behavior becomes less predictable [Has19]. Thus, the datasetdependent required minimum speed¹ is calculated heuristically for a given dataset containing all possible (sub-)trajectories of length :

$$\begin{aligned} \mathbf{s}\_{\text{min}} &= \frac{\max\_{l} m\_{\text{speed}}(l) - \min\_{l} m\_{\text{speed}}(l)}{M} \\ &\text{with} \\ m\_{\text{speed}}(l) = \frac{1}{N - 1} \sum\_{t=2}^{N} ||\mathbf{x}\_{t}^{l} - \mathbf{x}\_{t-1}^{l}||\_{2}. \end{aligned} \tag{5.1}$$

Here, speed() denotes the average speed within the 'th trajectory ∈ .

¹ The average euclidean distance between subsequent trajectory points.

#### **5.1.3.2 Performance Measures**

In the standard evaluation approach, the designated performance measures are given by the *Average Displacement Error* (abbrev.: *ADE*) and the *Final Displacement Error* (abbrev.: *FDE*), defined as

$$\text{ADE} = \frac{1}{M \cdot N\_{\text{pred}}} \sum\_{l=1}^{M} \sum\_{t=1}^{N\_{\text{pred}}} ||\hat{\mathbf{y}}\_{t}^{l} - \mathbf{y}\_{t}^{l}||\_{2} \tag{5.2}$$

and

$$\text{FDE} = \frac{1}{M} \sum\_{l=1}^{M} ||\hat{\mathbf{y}}\_{N\_{\text{pred}}}^{l} - \mathbf{y}\_{N\_{\text{pred}}}^{l}||\_2 \tag{5.3}$$

for a given prediction horizon pred, a set = {<sup>1</sup> , ..., } of ground truth trajectories = {**y** 1 , ..., **y** pred } and corresponding predictions ̂ = { ̂ **y** 1 , ..., ̂ **y** pred } generated by a given prediction model. The ADE is then defined by the average L2 distance between the ground truth and a corresponding predicted trajectory, while the FDE is defined by the L2 distance between the final ground truth and predicted trajectory points after the prediction horizon. In the case of probabilistic sequence models, which generate a predictive distribution (**y**{1,...,pred} |**x**{1,...,obs} ), ̂ corresponds to a maximum likelihood prediction given the probabilistic output of the model.

As the ADE and FDE do not provide an adequate measure for assessing the quality of (multi-modal) probabilistic predictions, another performance measure is required for this case. Due to the actual ground truth probability distribution for each time step being unknown, a common choice is given by the *Negative (data) Log-Likelihood* (abbrev.: *NLL*, e.g. [Bha18, Iva19]), defined as

$$\text{NLL} = \frac{1}{M \cdot N\_{\text{pred}}} \sum\_{l=1}^{M} \sum\_{t=1}^{N\_{\text{pred}}} -\log p(\mathbf{y}\_t^l | \cdot). \tag{5.4}$$

Here, (**y** |⋅) denotes the predictive distribution for the 'th trajectory position as generated by the probabilistic sequence model. Note that the conditional part of this distribution is not given explicitly, as it varies between different models (see Section 5.1.3.5). It is worth mentioning, that sometimes an oracle measure (e.g. [Lee17]) is used as a sample-based substitute for the NLL. This measure does, however, introduce another hyperparameter, which is why the NLL is preferred in the context of this evaluation.

#### **5.1.3.3 Baselines**

In order to provide reference values for comparison, a simple baseline is given for each performance measure. In the case of the ADE and FDE, a simple prediction model is given by a linear extrapolation calculated from a respective observed trajectory. Here, the relative offset = **x** obs − **x** obs−1 of the two most recent observations is projected pred steps into the future, as these positions are assumed to have the most impact on the future trajectory [Sch20a]. In the case of the NLL measure, a sample-based prediction can be generated for each future position by using a *shotgun* approach [Paj18]. In this approach, multiple future trajectories are generated by randomly altering the direction and scale of the relative offset before projection. The altered offset for each future trajectory is then given by **R** ⋅ ⋅ with ∼ (0, ), ∼ (1, ) and the matrix **R** describing a rotation by degrees. This yields a unimodal probabilistic prediction with a fixed variance for each predicted time step. In the following, = 15° and = 0.1 are used. An exemplary prediction using both approaches is depicted in Figure 5.4.

**(a)** Linear extrapolation **(b)** Shotgun

**Figure 5.4:** Exemplary predictions generated by a linear extrapolation model and a shotgun model. Predicted samples generated by the shotgun model around the mean linear prediction (green) are depicted in blue.

In addition to these two baselines, a simple LSTM baseline is provided. This mainly has two reasons. On the one hand, the LSTM model is an integral component of multiple sequence models included in the evaluation. On the other hand, it is a widely used baseline next to the linear extrapolation approach.

#### **5.1.3.4 Implementation Details**

This section gives a brief overview on implementation details for the sequence models in comparison. The implementations are based on existing approaches, which provide a publicly available implementation. These implementations are adapted to use a common data pipeline. If necessary, components for processing additional cues, such as social context, are removed. The list of approaches the implementations are based on alongside adaptations made is given in Table 5.4.


**Table 5.4:** Sequence model implementations adapted for use in this evaluation.

*a* https://github.com/agrimgupta92/sgan

*b* https://github.com/apratimbhattacharyya18/CGM\_BestOfMany

*c* https://github.com/FGiuliari/Trajectory-Transformer

The remainder of this section provides some implementation details regarding the prediction models in comparison, including the -MDN. For each model, respective loss functions, training details and the type of output as generated by the model is depicted. Additionally, a simplified structure illustration is given for each model. These illustrations also serve the purpose of highlighting relevant hyperparameters of each model. The values chosen for each hyperparameter and relevant general information is given at the end of this section.

*-MDN:* For the task of human trajectory prediction, the -MDN is set into a conditional setting. Thus, the MDN's input vector **v** (see Section 4.1) needs to hold information about the observed trajectory, in order to condition the MDN's output upon the observation. In accordance with a wide range of human trajectory prediction models, an LSTM network is used for encoding the observed trajectory. The conditional -MDN is illustrated in Figure 5.5.

**Figure 5.5:** Simplified illustration of the conditional -MDN. Relevant hyperparameters depicted in blue are given by the hidden state dimension enc of the LSTM encoder, as well as the number of -Curves curves output by the MDN. Each generated - Curve is defined by cpts Gaussian control points.

As -Curves model entire trajectories, an -Curve can be used to either only model the future trajectory, or to model the observed trajectory together with the future trajectory. Both options will be considered in the evaluation.

*R-MDN:* The SMC-based R-MDN variant used in this evaluation belongs to the group of 1-to-1 sequence models (see Section 2.1.1), processing one trajectory point at a time. As such, the model takes a discrete trajectory point as input and outputs the parameters of a Gaussian mixture distribution modeling the next trajectory point. In order to enable the model to generate a multi-modal prediction, multiple points are sampled from the output distribution and fed back into the model. To prevent exponential growth of samples, subsequent output distributions are combined and re-sampled [Hug18]. A schematic of this model is given in Figure 5.6

**Figure 5.6:** Simplified illustration of the SMC-based R-MDN. Relevant hyperparameters depicted in blue are given by the hidden state dimension lstm of the LSTM network and the number of mixture components comps generated by the MDN.

During training, the commonly used teacher forcing approach (see Section 2.1.1) is used, as the model generates its prediction sequentially. With the model generating a sequence of conditional mixture distributions, the optimization is based on the negative log-likelihood loss

$$\mathcal{L} = \sum\_{t=2}^{N\_{\text{obs}} + N\_{\text{pred}}} -\log p(\mathbf{x}\_t | \mathbf{x}\_{t-1}, \dots, \mathbf{x}\_1), \tag{5.5}$$

for a given training sample trajectory = {**x**<sup>1</sup> , ..., **x**+pred }.

*GAN:* For applying a GAN in the context of human trajectory prediction, a sequence processing unit must be incorporated into the model. According to [Gup18], a sequence-to-sequence LSTM (see Sections 2.1.1 and 2.2.3) is built into the generator network and another LSTM encoder is built into the discriminator network. The GAN encodes the observed trajectory and then adds a random noise vector to the encoded representation in order to sequentially generate a prediction. By performing multiple passes through the decoder network using different noise vectors, a sample-based distribution of future trajectories is generated. A simplified illustration of the GAN is depicted in Figure 5.7.

**Figure 5.7:** Simplified illustration of the GAN. Relevant hyperparameters depicted in blue are given by the dimensionality of the noise vector noise, the hidden dimension of the generator's LSTM encoder enc and decoder dec, as well as the discriminator's LSTM encoder discr\_enc and feed forward network ff. The discriminator part (dashed boxes) are only used during training. The noise vector **z** is sampled from (**0**,**I**).

Opposed to the R-MDN, an auto-conditioning approach (see Section 2.1.1) is employed during training. For the loss calculation, samples { ̂ 1 , ..., ̂} with ̂ = { ̂ **y** 1 , ..., ̂ **y** pred } are generated. The loss function consists of a variety loss

$$\mathcal{L}\_{\text{variety}} = \min\_{l} \sum\_{t=1}^{N\_{\text{pred}}} \|\mathbf{\hat{y}}\_{t}^{l} - \mathbf{y}\_{t}\|\_{2} \tag{5.6}$$

combined with the GAN adversarial loss

$$\mathcal{L} = \mathbb{E}\_{\mathbf{y} \sim p\_{\text{data}}(\mathbf{y})} \left[ \log D(\mathbf{y}) \right] + \mathbb{E}\_{\mathbf{z} \sim p(\mathbf{z})} \left[ \log(1 - D(G(\mathbf{z}))) \right]. \tag{5.7}$$

 and denote the discriminator and generator networks, respectively. The variety loss is intended to encourage the GAN to generate diverse future trajectory predictions for the same observed trajectory.

*VAE:* Similar to the GAN extension, a sequence-to-sequence LSTM is built into the VAE in order to enable sequence processing. Further, prediction generation works similar to the GAN model by adding a random vector to an encoded representation of an observed trajectory in order to generate multiple future trajectories. A schematic of the VAE is depicted in Figure 5.8.

**Figure 5.8:** Simplified illustration of the VAE. Relevant hyperparameters depicted in blue are given by the dimensionality of the latent space latent and of the hidden state of the LSTM encoder enc and decoder dec. In addition, the hidden state dimension lstm of the auxiliary LSTM encoder only active during training gives another hyperparameter. The random vector **z** is sampled from (**0**,**I**) during inference, while the parameters of the Gaussian distribution are determined by the auxiliary LSTM encoder during training.

During training, the LSTM decoder takes the encoded observation with the added random vector as input for every prediction step. In this way, neither teacher forcing, nor auto-conditioning schemes are necessary. At the same time, the model entirely relies on the progressing internal LSTM state to generate an appropriate prediction. Similar to the GAN, samples are generated for the loss calculation. Here, the loss function consists of a variation of the VAE's standard ELBO (evidence lower bound) loss

$$\mathcal{L} = \max\_{l} \{ \log p(\mathcal{Y}|\mathbf{z}\_{l}, \mathcal{X}\_{\text{obs}}) \} - \log T - D\_{\text{KL}}(q(\mathbf{z}|\mathcal{X}) || p(\mathbf{z}|\mathcal{X}\_{\text{obs}}), \tag{5.8}$$

with **z** ∼ (**z**|). This *best of many samples* variant of the ELBO contains a variety loss component, comparable to that of the GAN implementation.

*Transformer:* Although the implementation chosen for this evaluation does not provide a probabilistic prediction model, it is considered in this comparison, as it provides a strong contender to the established LSTM networks built into many human trajectory prediction models. It is an attention-based sequence model, consisting of an encoder, which encodes the entire observed trajectory into a single vector, and a decoder, which sequentially generates one trajectory point at a time, given the encoding. A schematic of this model is depicted in Figure 5.9.

**Figure 5.9:** Simplified illustration of the Transformer. Relevant hyperparameters depicted in blue are given by the model dimension model, the number of attention heads heads and the dimension of the feed forward network ff. The encoder (top) and decoder (bottom) networks share the same hyperparameters. In the decoder network, ⟨0⟩ denotes a *start of sequence token* used as input for the initial prediction step.

Similar to the R-MDN, a teacher forcing approach is used during training. As the model generates a single future trajectory, the optimization can be based on the <sup>2</sup> loss function

$$\mathcal{L} = \sum\_{t=1}^{N\_{\text{pred}}} \|\mathbf{\hat{y}}\_t - \mathbf{y}\_t\|\_2. \tag{5.9}$$

*Hyperparameters:* The model's hyperparameters are determined by running a grid search around the parameters provided by the respective authors, using those being most consistent across all datasets. For hyperparameters yielding similar model results, the parameterization given by the respective authors is favored. Output-related parameters for the -MDN and R-MDN are defined separately in Section 5.1.4. A list of chosen hyperparameters for each model is given in Table 5.5.

**Table 5.5:** Overview of chosen hyperparameters for each model in comparison.


*General Information:* All models are trained using a stochastic gradient descent policy and the ADAM optimizer [Kin15], using either mini-batches or the entire training dataset at once (whichever worked best for the respective model). For prediction, = 300 samples are used for the sample-based predictors R-MDN, VAE and GAN.

#### **5.1.3.5 Evaluation Methodology**

For achieving a reliable evaluation, a -fold cross-validation is performed on each dataset, in order to cope with unfavorable random training and test splits. In the following, = 5 folds are performed, as it gives a good trade-off between error bias and variance [Has09]. In compliance with the goal of measuring the raw single dataset performance, all prediction models are re-trained for each fold. As is common practice, prediction models are tasked to predict pred = 12 steps (4.8 seconds) into the future, given an observation of obs = 8 steps (3.2 seconds).

For generating a maximum likelihood prediction, the output of the probabilistic prediction models in comparison need to be processed in different ways. For the R-MDN, instead of propagating a set of particles, the mean vector of the highest weighted mixture component is fed back into the model in each time step. As the GAN and VAE models generate a set of sample trajectories, the mean position for each time step is used. Finally, for the -MDN, the mean curve of the -Curve with the highest mixture weight is used.

Looking at the NLL measure, which requires a probability density function generated by each prediction model, sample-based output is processed by applying a kernel density estimation [Sco18] using a Gaussian kernel in order to obtain probability density functions for each time step.

### **5.1.4 Quantitative Results**

For the quantitative evaluation, multiple output-related configurations are considered for the R-MDN and -MDN models, controlling the number of mixture components and the -MDN model's output mode (see Section 5.1.3.4). The configurations are depicted in Table 5.6.


**Table 5.6:** Output-related configurations for the R-MDN and -MDN models in the evaluation.

Tables 5.7 – 5.12 summarize the results of the quantitative evaluation, using a per dataset 5-fold cross validation and the ADE, FDE and NLL performance measures. Accordingly, respective averaged performance values with corresponding standard deviation considering all 5 folds are depicted. It should be noted, that the performance values are not comparable across datasets, due to different image and ground resolutions. In order to make values comparable, datasets would need to be projected into 3-dimensional world space. Additionally, a re-sampling of trajectory points can be necessary in order to match motion profiles.


**Table 5.7:** Quantitative results of all approaches on the *biwi:eth* dataset for a prediction time horizon of pred = 12 time steps (4.8 seconds). ADE and FDE errors are reported in pixels. Lower is better for all performance measures.

**Table 5.8:** Quantitative results of all approaches on the *biwi:hotel* dataset for a prediction time horizon of pred = 12 time steps (4.8 seconds). ADE and FDE errors are reported in pixels. Lower is better for all performance measures.




**Table 5.10:** Quantitative results of all approaches on the *crowds:zara02* dataset for a prediction time horizon of pred = 12 time steps (4.8 seconds). ADE and FDE errors are reported in pixels. Lower is better for all performance measures.



Transformer 52.48 ± 14.88 97.00 ± 26.24 - R-MDN 50.02 ± 13.19 93.60 ± 23.37 14.28 ± 2.55 R-MDN 27.58 ± 4.64 50.11 ± 8.56 23.47 ± 23.88 VAE 30.75 ± 1.04 55.93 ± 1.83 8.88 ± 0.09 GAN **19**.**10** ± 0.58 35.45 ± 0.96 33.18 ± 4.71 -MDN 25.24 ± 1.24 48.28 ± 2.80 9.13 ± 0.05 -MDN 25.33 ± 2.34 48.69 ± 5.28 9.06 ± 0.09 -MDN 22.31 ± 1.15 41.60 ± 1.58 8.89 ± 0.07 -MDN 19.43 ± 0.55 **35**.**37** ± 1.10 **8**.**46** ± 0.04

**Table 5.11:** Quantitative results of all approaches on the *sdd:bookstore03* dataset for a prediction time horizon of pred = 12 time steps (4.8 seconds). ADE and FDE errors are

**Table 5.12:** Quantitative results of all approaches on the *sdd:hyang00* dataset for a prediction time horizon of pred = 12 time steps (4.8 seconds). ADE and FDE errors are reported in pixels. Lower is better for all performance measures.


*Model Comparison:* It can be seen, that in terms of the average and final displacement errors (ADE and FDE), the -MDN performs on par with the best performing model in comparison, i.e. the GAN model. At the same time, the -MDN outperforms every other model in terms of the NLL performance measure, with the VAE being the closest contender. It should be noted, that both, the GAN and the R-MDN model, have a tendency to perform worse in terms of NLL, which can be attributed to these model's weakness to mode collapse (see also Section 4.1). This is discussed in more detail in Section 5.1.5. Among the -MDN variants, the models only modeling the future trajectory seem to outperform those also modeling the observed trajectory. Further, using multiple components appears to be beneficial for more complex datasets. This can be expected, as these datasets contain multiple decision points, leading to multiple distinct possibilities for future trajectories. Lastly, the Transformer model performs notably worse than the LSTM baseline, which indicates that the model is not optimal for the specific task at hand in its original form and thus may require some adaptations.

*Baselines:* As expected, the linear prediction model performs quite well in terms of the average and final displacement error. This is due to a substantial amount of (sub-)trajectories in each dataset, commonly around 50 to 60 percent [Hug21], representing a constant linear motion. Similarly, the shotgun baseline is only outperformed by 2 out of 4 models, namely VAE and - MDN. This is due to the baseline's incapability of modeling multiple modes as required for more complex cases. Further, the variance of the prediction is not adapted to the actual location of the observation in the scene, resulting in under- and overestimation. More sophisticated prediction models not suffering from mode collapse are thus able to outperform this baseline.

*Summary:* In summary it can be said, that all probabilistic prediction models perform similar in terms of the presented performance measures, making the choice of model dependent on their respective properties. In this case, the -MDN may be favored over other models due to it being fully regressionbased and thus more stable during training and inference, while at the same time being less computationally heavy during inference.

### **5.1.5 Qualitative Evaluation**

This section provides some insight into the probabilistic models behavior and the evaluation methodology itself. Following this, the problem of mode collapse in R-MDN and GAN models is discussed at first. After that, a qualitative comparison of the probabilistic models considered in this evaluation is provided. Then, the quantitative evaluation of probabilistic prediction models is discussed in more detail, focusing on how to measure probabilistic prediction quality. Finally, different characteristics of the -Curve model and its -MDN implementation as discussed in Sections 3.1 and 4.1 are further investigated in the application context using real-world data.

#### **5.1.5.1 R-MDN and GAN: Mode Collapse**

The quantitative evaluation revealed that in some cases the R-MDN and GAN models yield large NLL values. While this can be the case because of the model generating bad predictions for certain inputs, this can oftentimes be attributed to both models being vulnerable to mode collapse (MDN: [Mak19], GAN: [Met17]), where the model outputs a narrow prediction due to only generating slight variations of the same sample. Figure 5.10 depicts a wellspread prediction, next to a bad prediction and a prediction indicating a case of mode collapse in order to give a visual example of the latter. In this illustration, exemplary predictions generated by a GAN trained on the biwi:eth dataset are shown.

**Figure 5.10:** Different predictions (blue) generated by a GAN trained on the biwi:eth dataset, yielding a well-spread (a) and a bad (b) prediction, as well as a prediction indicating a case of mode collapse (c). The observed trajectory is depicted in red. The negative log-likelihood (NLL) is provided for each predicted distribution given the ground truth trajectory depicted in green.

Figure 5.10a gives an example for a spread-out prediction with noticeable bias, which also covers the actual future trajectory. A common failure case is then given in Figure 5.10b, where the model generates a prediction with increasing uncertainty, but misses the actual future trajectory. Finally, 5.10c provides an example of a prediction, which indicates a case of mode collapse. In this example, all samples generated by the model are basically the same, with only minimal variation. While the failure case in Figure 5.10b yields a significantly increased negative log-likelihood, the low variance in the prediction depicted in 5.10c is increasing the error even further.

#### **5.1.5.2 Comparison of Probabilistic Predictions**

For a qualitative comparison of the probabilistic prediction models, three examples are taken from the sdd:hyang00 dataset, as it provides a wellstructured scenery, where pedestrians mainly stay on designated walking paths. These examples are depicted in Figure 5.11 and cover a range of situations with an increasing number of distinguishable possibilities for future trajectories.

**Figure 5.11:** Examples (observation depicted with markers) taken from the sdd:hyang00 dataset, providing examples for a straight prediction and multi-modal predictions at different decision points, i.e. junctions, on the pathway.

In the example depicted in Figure 5.11a, prediction of a straight motion is to be expected, as there are no decision points after the observed portion of the trajectory. Example 5.11b provides an observed trajectory, which ends prior to a decision point, where the observed person can either move straight or turn to the right. Although, looking at the data, turning to the right is statistically less likely, the observed trajectory shows a tendency of moving to the right, making both options possible. At last, the example given in Figure 5.11c grants the possibility of a potential trimodal prediction. In this case, however, the number of modes in the prediction is highly dependent on the local neighborhood of the observed trajectory considered during model training, as it influences the target distribution. This is discussed in more detail in Section 5.1.5.3.

The predictions by each model for each example are depicted in Figures 5.12, 5.13 and 5.14. For the R-MDN and -MDN models, the R-MDN and - MDN variants are used as representatives. Predictions are illustrated as a heatmap calculated from predicted samples of each time step ∈ {1,...,pred}.

**(c)** GAN **(d)** -MDN

For the first example, all models, with the exception of the VAE, generate a unimodal prediction going straight, as expected. While the R-MDN and the GAN generate comparable results, the -MDN generated a higher-variance prediction. The VAE on the other hand wrongly predicts another possibility of moving downwards in addition to the straight prediction. This may be caused by the close proximity of the absolute positions to the junction, where moving down is another option. This, in turn, indicates, that the model puts more weight on the observed positions in isolation, rather than to the context.

**(c)** GAN **(d)** -MDN

**Figure 5.13:** Predictions generated by the probabilistic prediction models in comparison for an example trajectory (red markers) taken from the sdd:hyang00 dataset. The actual future trajectory is indicated as a dashed red line.

The second example shows another form of mode collapse in the predictions of the R-MDN and the GAN, where the statistically less relevant mode is suppressed, thus the model's prediction collapses onto a single mode. Again, the -MDN and VAE models generate predictions with a higher variance, thereby also covering the possibility of turning to the right. Still, it is visible from the heatmap, that moving straight is the dominant option.

**(c)** GAN **(d)** -MDN

**Figure 5.14:** Predictions generated by the probabilistic prediction models in comparison for an example trajectory (red markers) taken from the sdd:hyang00 dataset. The actual future trajectory is indicated as a dashed red line.

Consistent with the two previous examples, the R-MDN and GAN generate a similar bimodal prediction for the final example, both ignoring the possibility of moving to the right. Ignoring this possibility could be attributed to the observed trajectory being close to the left side of the pathway, making it less likely moving to the right. Combined with the rather low-variance predictions of both models, hinting at a smaller surrounding area being considered for the conditional prediction, trajectories located closer to the right side of the pathway could not have had much influence on the model's output during training. Opposed to that, the VAE and -MDN models output a trimodal prediction, where the third mode is more defined in the -MDN's prediction. At the same time, the VAE seems to over-estimate the pedestrian's movement speed when going straight, while the -MDN rather under-estimates it slightly, when compared to the R-MDN and GAN predictions.

#### **5.1.5.3 Assessing the Quality of Probabilistic Predictions**

In the quantitative evaluation section, the negative log-likelihood (NLL) has been used as a measure of the quality of probabilistic predictions generated by the R-MDN, VAE, GAN and -MDN models. Although the NLL evaluates a predictive distribution generated for a given observation using only a single sample (the actual future trajectory), its application is justified under the assumption, that similar observations result in similar predictive distributions, thus evaluating the entire distribution. At the same time, wrong or superfluous modes in the predictive distribution are not penalized. This is one of the reasons for models, which generate distributions with higher variance, are often scored better. This is also the case for the oracle measure, as it ignores all predictions that are not close to the ground truth [Paj18]. These difficulties in assessing the quality of probabilistic predictions might be a reason for the standard evaluation approach for trajectory prediction models leaving out such a measure, even though most state-of-the-art models are capable of generating probabilistic predictions. Apart from these difficulties, the NLL provides a reliable measure for probabilistic predictions, as it does not require the actual ground truth distribution to be known.

Nonetheless, it would be interesting to compare the probabilistic models under a more sophisticated measure, using an estimation of the conditional ground truth distribution. Thus, this section aims to provide a toy example on a real-world dataset for evaluating the R-MDN, VAE, GAN and -MDN using the Wasserstein distance [Kol17]

$$W\_p(P,Q) = \left(\inf\_{\mathbf{y}\in\Gamma(P,Q)} \int ||\mathbf{x}-\mathbf{y}||^p d\mathbf{y}(\mathbf{x},\mathbf{y})\right)^{\frac{1}{p}},\tag{5.10}$$

where and are probability distributions and Γ(, ) is the set of all joint distributions (, ) whose marginals are and , respectively. The Wasserstein distance, originally formulated in the context of optimal transport [Kan39], is preferred over the KL-Divergence [Kul51] and metrics built upon it (e.g. the Jensen-Shannon distance [End03]), as it also takes the metric space into account. As such, it considers the work required to transport the probability mass from a given distribution to a target distribution. Because of this intuition, it is also known as the *Earth Mover's distance* in the 1 dimensional case. For dimensions > 1, there exists no closed form solution for the Wasserstein distance. In this case, a commonly used approximation is given by the sliced Wasserstein distance [Bon15, Kol19].

The following toy example focuses on the evaluation of the endpoint distribution (**y**pred |⋅), pred steps into the future, generated by each probabilistic prediction model, using the sliced Wasserstein distance. As a first step, the conditional ground truth distribution needs to be determined for each trajectory in the test dataset. This can be achieved by searching the training dataset for trajectories, which are similar to each test dataset trajectory in their respective observed portion. The conditional ground truth distribution can be estimated by applying a Gaussian kernel density estimation, using the endpoints of similar training dataset trajectories. The steps required to determine the conditional ground truth distribution for an exemplary test dataset trajectory <sup>∗</sup> ∈ test (Figure 5.15a) are depicted in Figure 5.15.

**(a)** Exemplary test trajectory **(b)** Determined search box

**(c)** Set of similar trajectories **(d)** Estimated endpoint distribution

**Figure 5.15:** Example for estimating a conditional ground truth probability distribution of a trajectory endpoint given a test trajectory's first 8 points as observation. The exemplary test trajectory in this figure starts at the black circular marker. The observed portion of the exemplary test trajectory ends prior to the junction, making a probabilistic prediction of its true endpoint potentially multi-modal.

For finding similar trajectories of <sup>∗</sup> = {**x** ∗ 1 , ..., **x** ∗ obs+pred }, an axis-aligned rectangular search region around the test trajectory's first point **x** ∗ 1 is determined. While the longitudinal expansion long is calculated to include the first 3 trajectory points, the lateral expansion lat considers the width of the walking path and is set by hand. This assumes, that there is no bias in the conditional ground truth distribution, if the observed trajectory is closer to either side of the walking path. The resulting search region is depicted in Figure 5.15b.

As a next step, all training dataset trajectories starting within this region are gathered. From this set of trajectories, only those complying with the general movement direction

$$\mathbf{m}^\* = \frac{1}{N\_{\rm obs}} \sum\_{t=2}^{N\_{\rm obs}} \mathbf{x}\_t^\* - \mathbf{x}\_{t-1}^\* \tag{5.11}$$

and speed

$$\mathbf{s}^\* = \frac{1}{N\_{\text{obs}}} \sum\_{t=2}^{N\_{\text{obs}}} ||\mathbf{x}\_t^\* - \mathbf{x}\_{t-1}^\*||\_2 \tag{5.12}$$

of the observed portion of the test trajectory are kept for the distribution estimation. Here, a movement direction deviation dir of 10<sup>∘</sup> and a speed deviation Δ of 25% is allowed. The resulting set of similar trajectories <sup>∗</sup> sim is depicted in Figure 5.15c. Finally, the conditional endpoint distribution estimated from the set of similar trajectories is illustrated in Figure 5.15d.

It has to be noted, that the resulting probability distribution is highly dependent on the considered local neighborhood defined by long and lat and the deviation parameters dir and Δ. At the same time, it is not quite clear how to choose these values properly. As such, the assumption made for this example might be inaccurate. Further, it is even harder to define these parameters in less structured datasets, making such an evaluation non-viable for large scale evaluations including several datasets. In addition to this aspect, another obstacle is the required amount of trajectories similar to an observed trajectory in question. This is touched upon in more detail later in this section.

Aiming at a comparison of endpoint probability distributions, Figure 5.16 depicts sample-based predictions for the endpoint as generated by the samplebased prediction models.

**Figure 5.16:** Sample-based endpoint predictions as generated by the R-MDN, VAE and GAN models. Regions of high sample-density can be interpreted as modes in the actual probability distribution, whereas the sample-spread indicates the variance.

As described before, a probability density function is estimated from these sample-based predictions by applying a Gaussian kernel density estimation. The resulting probability densities, including the one generated by the - MDN, are depicted in Figure 5.17. In addition to the probability densities, respective NLL scores given <sup>∗</sup> and Wasserstein distances given the estimated ground truth distribution (see Figure 5.15d) are provided.

**(a)** R-MDN (NLL: 124.65, Wasserstein: 116.66)

**(c)** GAN (NLL: 200.31, Wasserstein: 144.21)

**(b)** VAE (NLL: 16.61, Wasserstein: 127.24)

**(d)** -MDN (NLL: 11.46, Wasserstein: 91.10)

**Figure 5.17:** Predicted endpoint probability distributions as generated by the R-MDN, VAE, GAN and -MDN models. The NLL and Wasserstein distance values between each respective predicted distribution and an estimated ground truth distribution (see Figure 5.15d) are provided.

Following this example on how to calculate the Wasserstein distance for an exemplary trajectory, the same methodology is applied on the first fold test dataset of the sdd:hyang00 dataset. For the evaluation on the whole dataset, a few things have to be noted. First, in practice¹, the Wasserstein distance is calculated on a sample-based representation of the provided probability distributions. Thus, the actual sample-based representations are used for the R-MDN, VAE and GAN models, and an equal amount of samples is drawn from the distribution generated by the -MDN. Opposed to that, the estimated ground truth distribution is not re-sampled in order to obtain a larger number of samples, as to not distort the actual distribution. Further, test trajectories ∈ test are only considered, if there are at least 30 similar trajectories available, i.e. |sim| ≥ 30, according to the aforementioned method. According to [Sil86], at least 19 samples are required in order to calculate an accurate estimation of a bivariate Gaussian density using a kernel density estimation. Due to the ground truth distributions in this evaluation potentially being multi-modal, the number of samples should be increased. At the same time, increasing the number of required samples potentially reduces the number of available test trajectories, when there are not enough similar trajectories available. Using |sim| ≥ 30, the size of the test dataset is reduced by approximately 45%, thus providing a trade-off between obtaining an accurate ground truth distribution and a reasonable test set size.

Following this, Table 5.13 depicts the mean Wasserstein distance calculated using the sdd:hyang00's first fold test dataset. For comparison, the NLL, as calculated for the quantitative evaluation (see Section 5.1.4), is provided. For completeness, the Wasserstein distance is also provided for the shotgun baseline.

¹ In the context of this thesis, the implementation provided by the *Python Optimal Transport* library [Fla21] is used, which computes a Monte Carlo approximation of the 2-sliced Wasserstein distance.

**Table 5.13:** Negative Log-Likelihood and Wasserstein distance calculated on the sdd:hyang00's first fold test dataset for each probabilistic prediction model in comparison. In case of the Wasserstein distance, the predicted endpoint distribution is compared with an estimation of the true endpoint distribution for each test trajectory. Lower is better for both measures.


Looking at the results, the ranking of the probabilistic models is, in parts, consistent with the NLL-based ranking. The shotgun baseline still performed well in this toy example, which is probably due to the presence of many cases, where an unimodal prediction is sufficient. This also supports the shotgun approach' viability as a baseline for probabilistic trajectory prediction. Besides that, a major difference is the GAN performing notably better under the Wasserstein distance, which is likely to be attributed to the Wasserstein distance not penalizing lower variance predictions as is the case for the NLL. Still, the -MDN outperforms the other models in terms of the quality of the probabilistic prediction. The results further indicate being more stable under the use of the Wasserstein distance.

In summary, this toy example supports the viability of the proposed -Curve approach in the context of human trajectory prediction. Further, it is suggested, that the NLL can pose a viable performance measure for probabilistic prediction, but it needs to be accompanied with a qualitative evaluation in order to investigate on the reasonability of the predictions in terms of their variance. Finally, in cases, where the ground truth data distribution is available, e.g. when using synthetically generated datasets, the Wasserstein distance may be preferred over the NLL.

#### **5.1.5.4 -MDN: Additional Examples**

This section focuses on few real-world examples, addressing different characteristics of the -Curve model and its -MDN implementation, namely the suppression of superfluous mixture components (see Section 4.1.3), modeling different speeds using multiple mixture components and the squeezing effect (see Section 3.1.2). In addition, the most common failure case occurring when using the -MDN is presented.

Starting off with superfluous component suppression, the prediction of a 3 component -MDN for an exemplary trajectory taken from the biwi:eth is depicted in Figure 5.18. In this example, it can be seen, that in the model's output, two components have been suppressed, by assigning them a weight of ≈ 0, leaving a single component for the prediction. This complies with the desired behavior in this situation, as all persons moving towards the university come together at the entrance. Additionally, all persons in the dataset move with the same speed on average, making multi-modal prediction only necessary in situations with multiple distinct possible future trajectories.

**Figure 5.18:** Exemplary prediction of a 3-component -MDN, where 2 -Curves were suppressed in favor of a single -Curve responsible for the prediction.

On rare occasions, a low-weight, not well-optimized component appears in a generated prediction. This is the most common failure case when using the - MDN and is closely connected to the presence of superfluous components. An example for this taken from the crowds:zara02 dataset is given in Figure 5.19. In this example, the component depicted in green unexpectedly branches out and reduces in speed greatly. While both incidents are valid under the presence of other nearby pedestrians being part of a collision avoidance behavior, both actions combined are more likely to be an optimization artifact, where a mixture component receives no more support from training samples from some point onwards during training. Although the model is not exposed to specific multi-agent data, isolated trajectories still reflect this behavior.

**Figure 5.19:** Exemplary prediction of a 3-component -MDN revealing a common failure case of the prediction including a low-weighed not well-optimized mixture component (green).

Besides using multiple mixture components in a prediction for modeling distinct future trajectories, these components can also be used to model similar future trajectories, but at different speeds. Figure 5.20 gives an example taken from the crowds:zara02 dataset, where the -MDN uses all of its 3 mixture components for modeling different speed versions of the same future trajectory in terms of its movement direction and curvature. As mentioned before, having a higher person density than for example the biwi datasets, it may be more likely to deviate from the average movement speed, in order to prevent collisions with other pedestrians. This, in turn, likely causes the multi-modal prediction covering different speeds.

**Figure 5.20:** Exemplary prediction of a 3-component -MDN, where each -Curves in the mixture models another version of the same trajectory, using a different movement speed.

Finally, open questions concluding Section 3.1.2, include whether the squeezing effect is relevant in real-world situations and if the model is able to generate constant variance predictions, when learned from data. Overall, the squeezing effect can be rated as not relevant when using real-world data, which is generally subject to noise. In this case, predicting into the future, the variance usually increases with each time step. With respect to the constant variance case, Figure 5.21 provides an example taken from the sdd:hyang00 dataset, where the -MDN outputs an unimodal prediction, which maintains almost constant lateral variance. In this example, this is achieved by slowly morphing an almost circular covariance ellipse towards a covariance ellipse with increased longitudinal variance. Here, the increase of the longitudinal variance is a way of coping with uncertainties about the actual speed of the observed trajectory. Observing both, the effect of an increasing longitudinal variance over time while maintaining near constant lateral variance, show the low relevance of the squeezing effect for real-world applications.

**Figure 5.21:** Exemplary prediction of a -MDN maintaining a near constant lateral variance while increasing longitudinal variance at the same time.

#### **5.1.5.5 Implicit Input Attention**

With the prediction generated by an -MDN being based on just the obs observed positions, this section investigates the influence of each input on the generated prediction, with respect to their position within the observed sequence. Recall, that an -MDN prediction is given in terms of a curve weight distribution , as well as a set of control point mean vectors and covariance matrices , which yield the predicted mean curve and the region of uncertainty around it. Following this, it is especially interesting to see, if different parts of a given input sequence are considered for generating , and .

As there is no attention mechanism explicitly built into the -MDN architecture, the model's attention to different parts of a given sequence can be calculated using the gradient of each generated output with respect to the inputs. Using *PyTorch*, its *autograd* module can be used for this, which calculates the respective gradients by performing a backward pass through the network given the generated output. Figure 5.22 depicts the resulting gradient-based implicit attention maps for each dataset in the evaluation. For each dataset, gradient magnitudes are averaged for each output, i.e. , and . As in previous sections, the -MDN variant is considered.

Figure 5.22a, 5.22b and 5.22c depict the input attention on a per dataset basis for each of the model outputs separately. Figure 5.22a reveals that for generating the mean vectors , the most important input is given by the last observed position with an additional, but weaker, contribution by the second last element. This observation is in line with the findings given in [Sch20a]. Opposed to that, for determining the weights and covariance matrices, a mix of multiple observations spread across the entire observed sequence is considered. The choice of which observations to rely on varies between datasets. This is most likely due to random effects during training and the model being trained for each dataset individually. Especially in the case of the covariance matrix, it makes sense to incorporate multiple observations from a given sequence, as a noise estimation can be expected to be more accurate using more data samples. Figure 5.22c depicts the input attention for each model output averaged over all datasets and summarizes the aforementioned findings.

**(c)** Weighting

**Figure 5.22:** Heatmap visualization of the models attention to each of the obs = 8 observations when generating predictions in terms of curve weights , mean vectors and covariance matrices . Time steps are given relative to the last observation at = 0. Inputs with no influence on the output are depicted in white and inputs with the most influence (per row) are given in dark blue. In figures (a) – (c), the datasets biwi:eth, biwi:hotel, crowds:zara01, crowds:zara02, sdd:bookstore03 and sdd:hyang00 are depicted along the y-axis (a – f).

To accompany the heatmap visualizations, Figure 5.23 depicts the influence of each observation on respective mean vectors and covariance matrices for two exemplary input sequences. Both examples support the observation, that for generating the mean vectors, the most recent observations are the most important. Further, the covariance matrices are determined using several observations spread across the observed sequence.

**(b)**

**Figure 5.23:** Visualization of the influence of observations along a given input sequence for two exemplary sequences. More important observations for determining the mean vectors (left) and covariance matrices (right) are depicted with higher color intensity.

### **5.1.6 Summary**

In summary, this section gave a detailed overview of the human trajectory prediction task, commonly used datasets and state-of-the-art prediction models. The latter are most commonly variants of Recurrent Mixture Density Networks, Variational Autoencoders and Generative Adversarial Networks. This overview was followed by an extensive evaluation of the -Curve model in comparison to these commonly used models, using different performance measures and corresponding baselines. The performance measures include the average and final displacement error for measuring the performance of maximum likelihood predictions, as well as the negative log likelihood for assessing the probabilistic prediction performance. In this evaluation, the - Curve model shows competitive results, outperforming most other generic probabilistic sequence models in the comparison.

## **5.2 Human Motion Prediction**

The primary goal of this section is the evaluation of the scalability of the - Curve model to higher-dimensional data. For this, the task of human motion prediction is considered. Note that in the literature, human trajectory prediction (see the previous Section 5.1) is sometimes confused with human motion prediction. To clarify, human trajectory prediction is concerned with human movement along a trajectory through an observed scene based on observed 2- or 3-dimensional locations. Opposed to that, human motion prediction targets the motion of the human body when performing different actions and is based on sequences of human poses.

Thus, in human motion prediction, a prediction model is tasked to generate a sequence of human poses resembling some action performed by an observed subject. The generation is thereby conditioned on a given initial observation of the performed action. Each element in the sequences to process is given by a human pose. Such human poses are commonly represented as a set of 3D joint positions, which can be connected via a skeleton definition. The number of 3D joints describing a human pose varies between datasets. To give an example, in the Human3.6m dataset [Ion13], a human pose is described by 32 3D joints, yielding a 96-dimensional vector.

### **5.2.1 Datasets**

Looking at datasets which provide human pose sequences, the most commonly used ones include the *CMU mocap database*¹, the *Human3.6m dataset* (abbrev.: h3.6m, [Ion13]) and the *NTU RGB+D dataset* [Sha16]. Among these datasets, for the task of human motion prediction the h3.6m dataset is the most widely used. This is due to existence of a standard evaluation protocol, allowing to re-use previous results of different approaches tackling the prediction task. Some details on the h3.6m dataset are given in Table 5.14. Figure 5.24 depicts an example of a pose sequence for the *walking* action taken from the h3.6m dataset.


**Table 5.14:** Details of the Human3.6m dataset.

**Figure 5.24:** Exemplary human motion sequence taken from the Human3.6m dataset. The left arm and leg are depicted in blue. For illustration purposes, the sample rate is reduced to 12.5Hz.

¹ http://mocap.cs.cmu.edu/

### **5.2.2 Evaluation Protocol**

For enabling a repeatable and comparable evaluation, approaches presented in the literature commonly follow the standard evaluation protocol provided in [Fra15] and [Jai16]. According to this protocol, multiple data pre-processing steps are performed prior to training and evaluation. First, the pose representation as provided in the h3.6m dataset is converted into an exponential map representation of each joint using a specific pre-processing of global translation and rotation as specified in [Tay07]. Following the change in representation, the data is standardized by subtraction of the mean and division by the standard deviation along each dimension. Then, dimensions with constant values are dropped from the representation. The resulting pose representation then consists of 17 joints and a global translation component, yielding a 54-dimensional representation. Finally, the sequence sample rate is reduced to 25Hz.

Using the pre-processed data, training is performed on a subset of actions using subjects S1, S6, S7, S8, S9 and S11. The action subset includes *walking, eating, smoking, discussion, directions, greeting, phoning, posing, purchases, sitting, sittingdown, takingphoto, waiting, walkingdog, walkingtogether*. The test dataset then contains actions performed by subject S5, collecting 8 sub-sequences of specific actions using a fixed seed. The considered set of actions in the test dataset is restricted to the representative actions *walking*, *eating*, *smoking* and *discussion*. For prediction, a given model is tasked to predict up to pred = 10 time steps (400 milliseconds) into the future, given an observation of obs = 50 time steps (2 seconds) of a given action. The prediction performance is then measured in terms of the mean angle error¹

$$MAE = \frac{1}{M} \sum\_{l=1}^{M} ||\hat{\mathbf{y}}\_{l}^{l} - \mathbf{y}\_{t}^{l}||\_{2},\tag{5.13}$$

¹ Using an euler angle representation, which can be calculated from the exponential map representation

calculated after 80, 160, 320 and 400 milliseconds, using samples of the same action. With a sample rate of 25Hz, this corresponds to = 2, = 4, = 8 and = 10 time steps. This restriction to short-term prediction¹ is introduced due to the stochasticity of human motion preventing a quantitative evaluation of longer time horizons [Fra15].

### **5.2.3 Baselines and Comparison Models**

For comparison, there are several commonly used simple and neural networkbased baselines. Common simple baselines are given by the *Zero-velocity* model [Mar17], which constantly predicts the last observation, and a running average approach of the last observed poses. The running average approach will be abbreviated as *Run. avg. n*. Regarding neural network-based baselines, the most prevalent models include the *LSTM-3LR* [Fra15], the *ERD* [Fra15] and the *SRNN* [Jai16] models. While the LSTM-3LR is a three-layer LSTM network, the ERD and SRNN models are more tailored towards learning a meaningful representation of a given observation to base their prediction on. The *Encoder-Recurrent-Decoder* model (abbrev.: ERD) is a type of RNN that combines representation learning with learning temporal dynamics. To achieve this, the input to the RNN is encoded into a representation, where learning pose dynamics is easier. The *Structural RNN* (abbrev.: SRNN) on the other hand aims to incorporate semantic knowledge about the data structure into the model architecture. Following the fact, that a sequence of poses can be represented by a (manually designed) spatio-temporal graph, the SRNN provides an approach for transforming such a graph into a feedforward mixture of RNNs.

Beyond these common baselines, recent approaches to human motion prediction are commonly based on either Recurrent Neural Networks (e.g. [Gho17, Gop19]), (sequence-to-sequence) Generative Adversarial Networks (e.g. [Gui18, Kun19]) or Graph Neural Networks (abbrev.: GNN, e.g. [Mao19,

¹ Short-term prediction is defined as predicting less than 560ms into the future.

Li20]). The latter thereby consider the actual configuration of the joints according to the underlying skeleton. For the following quantitative evaluation, representatives for each base architecture are selected.

In the group of Recurrent Neural Networks, besides the baselines presented above, another interesting approach is given by the *QuaterNet* model [Pav18]. As opposed to the other approaches in this comparison, which regress joint rotations using the exponential map representation, the QuaterNet model uses a quaternion-based representation of joint rotations. This change in representation targets the issue of discontinuities, which can occur when using an exponential map representation. Further, joint position errors are considered in the training loss function, trying to incorporate the varying impact of joints on the pose.

Looking at the GAN-based approaches, the *Adversarial Geometry-aware Encoder-Decoder* (abbrev.: AGED, [Gui18]) and *Bidirectional 3D Human Motion Prediction GAN* (abbrev.: BiHMP-GAN [Kun19]) models are considered in the quantitative evaluation. Both of these models rely on a seq2seq RNN which is embedded in an adversarial training approach. The BiHMP-GAN model, on the one hand, incorporates a pose embedding, comparable to the ERD, and uses a bidirectional RNN architecture [Sch97] in its discriminator network. On the other hand, the AGED model exploits the intrinsic geometric structure of 3D rotations during training of the generator, by using a geodesic distance between joint rotations. This is opposed to the common approach of using an euclidean distance between predicted and ground truth joint angles.

Finally, among GNN-based approaches, the *Traj-GCN* [Mao19] and the *Adversarial GCN* (abbrev.: A-GCN, [Cui20]) models are included in the evaluation. Both models are based on *Graph Convolutional Networks* (abbrev.: GCN, [Kip17]) and thus encode spatial dependencies in human poses by treating a pose as a generic graph. The Traj-GCN model proposes to work in trajectory space instead of the traditionally used pose space, in order to encode temporal information. Further, graph connectivity is learned automatically during training. Similar to the second aspect, the A-GCN learns the connection strength between nodes in the graph. Following this, poses are represented as a dynamic graph, where natural connections between joint pairs are exploited explicitly. Beyond that, links between geometrically separated joints can be learned implicitly. Using an adversarial training approach, the A-GCN could also be put into the group of GAN-based approaches, thus blurring the line between the groups. An overview of the presented baseline and comparison models is depicted in Table 5.15.


**Table 5.15:** Overview of the baseline and comparison models considered in this evaluation.

### **5.2.4 -Curve Model Setup**

Following the common approach of processing pose sequences in an exponential map representation, training and prediction in the -MDN will be based on this representation. With a focus on scalability, the generic version of the model is used as in the human trajectory prediction evaluation (Section 5.1). Therefore, model extensions tailoring the model towards the task of human motion prediction are disregarded. Further, the use of a more domain-specific loss function, i.e. the geodesic loss function proposed in [Gui18], is also not considered. This is due to the fact, that it cannot be easily integrated into the log-likelihood loss function for learning the mean vectors and covariance matrices jointly.

For the evaluation, two variants of the -MDN generating unimodal predictions are employed. For the first variant, denoted as -MDN, the hyperparameters (see also Figure 5.5) are set as enc = 1024, curves = 1 and cpts = 4. A 4-layer LSTM is used as the sequence encoder. Further, the -MDN is parameterized to generate diagonal covariance matrices only. This is common practice due to covariance estimation becoming more difficult in higherdimensional data [Ha18, Raz20]. Mixture Density Networks are especially afflicted by this, where the estimation of higher-dimensional covariance matrices contributes to numerical instabilities [Rup17, Mak19]. Still, with the - MDN processing full pose representations, it can be expected that dependencies between dimensions are captured implicitly, regardless of the generated -Curve only providing diagonal covariance matrices. In order to provide more expressive covariance matrices, an additional variant of the -MDN is evaluated. This variant is denoted as -MDN and generates -Curves with sparse covariance matrices, which model inter-joint correlations and the correlations between the dimensions of the global translation. With each joint and the global translation being represented by a 3-dimensional (sub-)vector within the pose representation, resulting covariance matrices consist of 18 33 block matrices. To prevent numerical instabilities, -MDN is realized as an ensemble of -MDNs, where each network models the 3 dimensions of the global translation or a single joint, respectively. The outputs of each network in the ensemble are then combined into the targeted 54-dimensional -Curve. In this case, all joints are now modeled independently. Each - MDN in the ensemble is parameterized with enc = 128, curves = 1 and cpts = 4, using a 1-layer LSTM as sequence encoder.

### **5.2.5 Quantitative Results**

This section provides the quantitative results of the -MDN variants and the comparison models on the test dataset according to the standard protocol. The results for the simple and neural baselines are taken from [Mar17]. For the comparison models, the results are gathered from their respective papers. Thereby, only the overall best performing model variant, if there are any, is considered. The joint angle errors are reported in Tables 5.16 and 5.17. It should be noted, that the error standard deviation is commonly not reported in the literature, thus the standard deviation is left out for all models in comparison.

**Table 5.16:** Mean angle error (lower is better) for short-term human motion prediction on the Human3.6m dataset for the representative actions *walking* and *eating*. Commonly used simple and neural baselines are provided at the top and recent domain-specific models in the middle.


**Table 5.17:** Mean angle error (lower is better) for short-term human motion prediction on the Human3.6m dataset for the representative actions *smoking* and *discussion*. Commonly used simple and neural baselines are provided at the top and recent domainspecific models in the middle.


Looking at the results, the -MDN variants generally outperform the simple, yet strong, baselines in this task. The neural baselines, which are themselves more generic models, similar to the -MDN, are outperformed by a large margin. Expectedly, being a generic model, the -MDN falls a little bit behind in comparison with the domain-specific models.

Comparing both variants of the -MDN, the results are very similar. Variant performs slightly better on the *discussion* action, whereas variant performs slightly better on the other actions. Differences between the predictions generated by both variants are further detailed in the qualitative results Section 5.2.6.

In summary, the -Curve models performs quite well on the given task, despite being a more generic probabilistic sequence model. As such, the model is not specifically built to capture the underlying tree-like structure of the data, nor does it employ a specialized loss function. An additional culprit contributing to less accurate predictions may be given by the smoothing behavior of the model, which is examined in more detail in Section 5.2.6. Finally, the quantitative evaluation shows that the model scales well to modeling higherdimensional data.

### **5.2.6 Qualitative Results**

For the qualitative evaluation, exemplary predictions generated by the - MDN variants are examined. Thereby, differences between both variants and some insight into the behavior of the model is provided. Following this, Figures 5.25 – 5.28 depict exemplary predictions for all four actions in the test dataset for both model variants.

**Figure 5.25:** Qualitative comparison of predictions generated by both variants of the -MDN on the *discussion* action. -MDN generates diagonal covariance matrices. -MDN generates sparse covariance matrices, which model inter-joint correlations. For illustration purposes, the sample rate is reduced to 12.5Hz. For each prediction, the last 2 observed poses are depicted together with an prediction of 320 milliseconds (4 time steps) into the future. The full ground truth sequence of poses is depicted in the first row. The left arm and leg are depicted in blue (ground truth) or purple (prediction), respectively. Regions of interest are highlighted.

Looking at the *discussion* action depicted in Figure 5.25, a noticeable difference between both -MDN variants can be observed looking at the movement of the left arm. While -MDN predicts a downward movement, -MDN generates a more accurate prediction. Apart from that, both variants generate the same wrong movement for the right arm, indicating that the actual motion deviates from the average motion considering similar cases.

**Figure 5.26:** Qualitative comparison of predictions generated by both variants of the -MDN on the *eating* action. -MDN generates diagonal covariance matrices. -MDN generates sparse covariance matrices, which model inter-joint correlations. For illustration purposes, the sample rate is reduced to 12.5Hz. For each prediction, the last 2 observed poses are depicted together with an prediction of 320 milliseconds (4 time steps) into the future. The full ground truth sequence of poses is depicted in the first row. The left arm and leg are depicted in blue (ground truth) or purple (prediction), respectively. Regions of interest are highlighted.

**Figure 5.27:** Qualitative comparison of predictions generated by both variants of the -MDN on the *smoking* action. -MDN generates diagonal covariance matrices. -MDN generates sparse covariance matrices, which model inter-joint correlations. For illustration purposes, the sample rate is reduced to 12.5Hz. For each prediction, the last 2 observed poses are depicted together with an prediction of 320 milliseconds (4 time steps) into the future. The full ground truth sequence of poses is depicted in the first row. The left arm and leg are depicted in blue (ground truth) or purple (prediction), respectively. Regions of interest are highlighted.

The actions *eating* (Figure 5.26) and *smoking* (Figure 5.27) both show, apart from a few joints, a static pose throughout the sequence. As such, only subtle movements can be observed looking at the left arm. With respect to the predictions generated by both -MDN variants, these movements are seemingly averaged out in some way and thus not captured by the model. This smoothing effect is more visible when looking at single dimensions of the pose representations, as depicted in Figure 5.31 towards the end of this section.

**Figure 5.28:** Qualitative comparison of predictions generated by both variants of the -MDN on the *walking* action. -MDN generates diagonal covariance matrices. -MDN generates sparse covariance matrices, which model inter-joint correlations. For illustration purposes, the sample rate is reduced to 12.5Hz. For each prediction, the last 2 observed poses are depicted together with an prediction of 320 milliseconds (4 time steps) into the future. The full ground truth sequence of poses is depicted in the first row. The left arm and leg are depicted in blue (ground truth) or purple (prediction), respectively.

The action, which yields the most accurate prediction, is given by the *walking* action depicted in Figure 5.28. This is most likely due to this action consisting of more obvious motion of the entire body. Further, the *walking* action is more periodic than for example the *discussion* action. As such, it is more predictable and thus easier to model using a statistical model. In the given example, the observed subject slowly turns to the right. This is also correctly captured by both -MDN variants. Besides that, it can be seen that both variants capture the general trend in motion, but the predicted motion is not as nuanced and pronounced as the actual motion. This is, again, most likely due to the smoothing property of the model.

In order to gain more insight into the predictions generated by the -MDN variants, selected dimensions of the pose representation are depicted in the

representation of the left wrist.

following. In this case, the mean prediction with corresponding standard deviation is provided. The standard deviation can be obtained via marginalization from the covariance matrix at each predicted time step.

**Figure 5.29:** Visualization of selected pose representation dimensions for the purpose of illustrating differences and similarities between -MDN variants. The green curve depicts -MDN and the blue curve depicts -MDN. For both curves, the region around the curve is indicated by a shaded region. The ground truth is depicted in red. Time steps are depicted along the axis and the unit-less value of the selected dimension is given on the axis.

representation of the left wrist.

As mentioned before, there is a noticeable difference in the predicted motion of the left arm for the *discussion* action, when comparing both -MDN variants (see Figure 5.25). This can be seen looking at the third dimension of the representation of the left wrist (see Figure 5.29a). While the -MDN variant (blue) follows the ground truth, the -MDN variant (green) falsely predicts an almost constant value. With respect to the subtle arm movements in the *smoking* action (see Figure 5.27), it can be seen, that both model variants predict almost constant values for the left wrist, whereas the ground truth slightly deviates from the constant prediction (see Figure 5.29b).

**(a)** *discussion*: First dimension of the global translation.

**(b)** *walking*: First dimension of the representation of the right knee.

**Figure 5.30:** Visualization of selected pose representation dimensions for the purpose of illustrating the capability of the -MDN capturing general trends in human motion. The green curve depicts -MDN and the blue curve depicts -MDN. For both curves, the region around the curve is indicated by a shaded region. The ground truth is depicted in red. Time steps are depicted along the axis and the unit-less value of the selected dimension is given on the axis.

Although the -MDN is not quite well-suited for capturing subtle motions in a sequence of poses, it is well capable of capturing the general motion of an observed subject. This can be seen in Figure 5.30. Here, exemplary pose representation dimensions taken from the *discussion* and *walking* examples are illustrated. In both cases, the -MDN variants generate -Curves following the correct trend with respect to the ground truth.

Finally, the innate *smoothing* feature of the -Curve model is quite noticeable looking at the predictions generated by the -MDN variants. By generating a compact representation, the -Curve model generally averages out small variations in the data and thus primarily captures trends in the data. The model thereby copes with small variations by varying the variance of the control points accordingly. This smoothing effect is depicted in Figure 5.31.

**Figure 5.31:** Visualization of selected pose representation dimensions for the purpose of illustrating the smoothing property of the -Curve model. The green curve depicts -MDN and the blue curve depicts -MDN. For both curves, the region around the curve is indicated by a shaded region. The ground truth is depicted in red. Time steps are depicted along the axis and the unit-less value of the selected dimension is given on the axis.

On a final note, the -MDN variant generally generates higher variances than the -MDN variant. This may be due to -MDN having to cope with larger variations in the data, as it processes full 54-dimensional pose representations, whereas networks within the -MDN ensemble only need to deal with 3-dimensional data.

### **5.2.7 Summary**

In this section, the scalability of the -Curve model in terms of data dimensionality was evaluated. For this, the task of human motion prediction, where sequences of high-dimensional pose representations have to be modeled, was considered. The results show, that the -Curve model is well-capable of representing higher-dimensional data by increasing the dimensionality of the stochastic control points accordingly. While the -Curve model outperforms common baselines on the task, it falls a little bit behind in comparison with recent domain-specific models. However, this was expected, as the -Curve model is a generic model, while the domain-specific models incorporate additional information about the data, such as the arrangement of joints by using graph networks.

# **6 Summary**

Throughout this thesis, an approach for modeling stochastic processes with bounded index sets, the -Curve model, based on a probabilistic extension of Bézier curves (-Curves) has been presented. Thereby, a stochastic process is defined by Gaussian mixture distributions, which evolve along a mixture of -Curves. By basing the -Curve model on Bézier curves, a compact representation of a stochastic process can be achieved. Together with its proposed implementation based on Mixture Density Networks, the model provides a fully regression-based approach to probabilistic sequence modeling, which does not rely on Monte Carlo techniques during inference, thus reaching set goals. By using parametric curves and optimizing in function space rather than the -dimensional space of sequence values, the proposed model is able to generate smooth continuous predictions in a single inference step. Thereby, learning a probability distribution over parametric curves is in line with Gaussian processes, which the underlying -Curves provide a special case for. Different properties of the model were examined by conducting several toy examples on synthetically generated data.

The model has been evaluated extensively on the task of human trajectory prediction, targeting the overall performance of the model in an application context, which proved the viability and capabilities of the model. Looking at the evaluation results, the -Curve model outperforms other generic probabilistic sequence models on different error measures capturing unimodal and multi-modal prediction performance. These models are commonly used as a basis for more sophisticated, domain-specific models. Further, difficulties in measuring multi-modal prediction performance were discussed. In the scope of this discussion, a small experiment was conducted, in which the application of the Wasserstein metric as a performance measure was proposed. In addition to this broader evaluation, the model's scalability to higher-dimensional data has been shown by applying it to a human motion prediction task. While the -Curve model outperformed common simple and neural network-based baselines, being a more generic model, it generated slightly less accurate predictions in comparison to recent domain-specific models. Beyond the scalability assessment, difficulties in covariance estimation in higher dimensions and the smoothing property of the -Curve model were discussed.

Finally, extending on the concept of -Curves, a conceptual extension to the model, which is capable of modeling infinite stochastic processes, has been presented. For this extension, denoted as the meta-time -Curve model, a proof of concept on synthetically generated data has been provided, showing the overall viability of the approach in specific cases.

# **7 Thoughts on Future Research**

This final chapter provides an overview of possible directions for future research building on the findings of this thesis.

## **7.1 Tackle Practical Limitations**

First of all, revealed practical limitations of the -Curve model could be tackled. Thereby, the most relevant limitations can be given by covariance estimation in higher dimensions and the assumed stochastic independence of -Curve control points.

*Covariance Estimation:* As indicated in Section 5.2, estimating covariance matrices in high-dimensional data oftentimes leads to numerical instabilities during model training. This is mainly due to the increasing number of correlations that have to be estimated and the necessary condition of covariance matrices to be positive definite. As a result, oftentimes only diagonal covariance matrices are employed. A first step towards tackling this problem was taken by targeting sparse covariance matrices, in which only a few dimensions were correlated. Beyond that, it might be interesting to investigate more advanced approaches to covariance estimation (e.g. [Zho11, Che19]) and their applicability to training Mixture Density Networks.

*Stochastic Dependencies:* For deriving a closed-form loss function for the (recurrent) -MDN implementation, independence of -Curve control points was assumed (see Section 4.1). This independence can be sub-optimal when using the -Curve model as a generative model (see Section 3.1.3), as Bézier curves sampled from an -Curve not necessarily have a shape similar to the mean curve. This can be obstructive when -Curves, estimated from some dataset, should be used to enrich the dataset with more, synthetically generated, sequences similar to those present in the dataset. Following this, it could be interesting to examine, how stochastic dependencies between control points can be incorporated into the model.

## **7.2 Model Extensions**

Apart from these practical challenges, several model extensions could be approached, targeting different parts of the presented model.

*Interpolation of Arbitrary Distributions:* In the current formulation, the - Curve model interpolates Gaussian control points in order to obtain a sequence of Gaussian curve points. Following this, the question arises if it would be possible to interpolate control points following arbitrary probability distributions, in order to obtain a probabilistic curve with curve points then following a combined arbitrary probability distribution. To achieve this, the operation of combining multiple control points would need to be extended to a more abstract or general concept, which allows the transformation of given probability distributions. Moving towards this goal, a possible relevant approach might be given by Normalizing Flows (see Section 2.2.3), which can be used to transform simple probability distributions into more complex distribution by applying a chain of invertible mappings.

*The Meta-time -Curve model:* In the scope of this thesis, the meta-time - Curve model (see Section 3.2) was introduced as a conceptual extension to the -Curve model, lifting some less application-relevant limitations of the original model. However, the toy examples provided in Section 4.2 suggest the viability of the meta-time -Curve model, especially for modeling long sequences or specifically structured sequential data. Following this, it would be interesting to further explore the capabilities of this model. Looking at the timeline mapping functions introduced in the model definition, it could be especially interesting to examine the possibilities granted by employing learned mapping functions.

*Alternative Formulation for Handling Multi-Modality:* Currently, the - Curve model uses a mixture distribution approach for modeling multi-modal stochastic processes. A downside of such an approach is given by the potential blurring of modes when estimating the mixture parameters, as the loss is calculated in terms of a linear combination of all mixture components. Although mode collapse is mitigated by using Bézier curves as a basis, less well-defined modes can still be a result of using a mixture distribution approach. Thus, in order to achieve more clearly separated modes, more emphasize could be put on the component selection by introducing a notion of attention [Vas17, Dai19] into the model. In this case, an attention mechanism could be used to decide which of the available -Curves to select or combine for a given input.

*-Curve Gaussian Processes:* Finally, it could be interesting to elaborate more on the properties and potential advantages and disadvantages of the class of Gaussian process kernels induces by an -Curve in comparison with other kernels. Additionally, it could be examined if and to what extend the - Curve model and its implementation would benefit from the incorporation of concepts taken from Gaussian processes.

# **Bibliography**




[Bri17] BRITZ, Denny; GOLDIE, Anna; LUONG, Minh-Thang and LE, Quoc: "Massive exploration of neural machine translation architectures". In: *arXiv preprint arXiv:1703.03906* (2017) (cit. on p. 8).




ceedings. 2010, pp. 249–256 (cit. on p. 53).





on p. 81).


agents". In: *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*. 2017, pp. 336–345 (cit. on pp. 90, 97).


Generative Model for Raw Audio". In: *9th ISCA Speech Synthesis Workshop*. 2016, pp. 125–125 (cit. on pp. 8, 90).





*pattern recognition*. 2016, pp. 1010–1019 (cit. on p. 133).





# **Own publications**


# **Supervised student theses**


# **List of Figures**






# **List of Tables**



# **Acronyms**



## **Karlsruher Schriftenreihe zur Anthropomatik (ISSN 1863-6489)**




#### Philipp Woock **Umgebungskartenschätzung aus Sidescan-Sonardaten für ein autonomes Unterwasserfahrzeug.** ISBN 978-3-7315-0541-9 **Band 26**

Die Bände sind unter www.ksp.kit.edu als PDF frei verfügbar oder als Druckausgabe bestellbar.


Die Bände sind unter www.ksp.kit.edu als PDF frei verfügbar oder als Druckausgabe bestellbar.


Die Bände sind unter www.ksp.kit.edu als PDF frei verfügbar oder als Druckausgabe bestellbar.


Lehrstuhl für Interaktive Echtzeitsysteme Karlsruher Institut für Technologie

Fraunhofer-Institut für Optronik, Systemtechnik und Bildauswertung IOSB Karlsruhe

This work proposes a probabilistic extension to Bézier curves as a basis for effectively modeling stochastic processes with a bounded index set. The proposed stochastic process model is based on Mixture Density Networks and Bézier curves with Gaussian random variables as control points. One of the key advantages of this model is given by the ability to generate multi-mode predictions in a single inference step. This avoids the need for Monte Carlo simulation, which is a frequent requirement for performing parameter estimation and probabilistic inference in commonly used approaches. Essential properties of the proposed model are illustrated by several toy examples. Further, an evaluation is conducted on different multi-step sequence prediction tasks for evaluating the capabilities of the model applied to real-world data.

R. Hug

Probabilistic Parametric Curves for Sequence Modeling

Band 55

ISSN 1863-6489 ISBN 978-3-7315-1198-4