10

Juan Jairo Inga Charaja

Inverse Dynamic Game Methods for Identification of Cooperative System Behavior

KARLSRUHER BEITRÄGE ZUR REGELUNGS-UND STEUERUNGSTECHNIK

# Juan Jairo Inga Charaja Inverse Dynamic Game Methods for Identification of Cooperative System Behavior

Juan Jairo Inga Charaja

# **Inverse Dynamic Game Methods for Identification of Cooperative System Behavior**

Karlsruher Beiträge zur Regelungs- und Steuerungstechnik Karlsruher Institut für Technologie

Band 10

# **Inverse Dynamic Game Methods for Identification of Cooperative System Behavior**

by Juan Jairo Inga Charaja

Karlsruher Institut für Technologie Institut für Regelungs- und Steuerungssysteme

Inverse Dynamic Game Methods for Identification of Cooperative System Behavior

Zur Erlangung des akademischen Grades eines Doktor-Ingenieurs von der KIT-Fakultät für Elektrotechnik und Informationstechnik des Karlsruher Instituts für Technologie (KIT) genehmigte Dissertation

von Juan Jairo Inga Charaja, M.Sc.

Tag der mündlichen Prüfung: 16. Oktober 2020 Hauptreferent: Prof. Dr.-Ing. Sören Hohmann Korreferent: apl. Prof. Dr.-Ing. Daniel Görges

#### **Impressum**

Karlsruher Institut für Technologie (KIT) KIT Scientific Publishing Straße am Forum 2 D-76131 Karlsruhe

KIT Scientific Publishing is a registered trademark of Karlsruhe Institute of Technology. Reprint using the book cover is not allowed.

www.ksp.kit.edu

*This document – excluding the cover, pictures and graphs – is licensed under a Creative Commons Attribution-Share Alike 4.0 International License (CC BY-SA 4.0): https://creativecommons.org/licenses/by-sa/4.0/deed.en*

*The cover page is licensed under a Creative Commons Attribution-No Derivatives 4.0 International License (CC BY-ND 4.0): https://creativecommons.org/licenses/by-nd/4.0/deed.en*

Print on Demand 2021 – Gedruckt auf FSC-zertifiziertem Papier

ISSN 2511-6312 ISBN 978-3-7315-1080-2 DOI 10.5445/KSP/1000128612

# Preface

This thesis is the result of my work as a research assistant at the Institute of Control Systems (IRS) of the Karlsruhe Institute of Technology (KIT). This thesis would not exist if it were not for the support of many people. First and foremost, I would like to express my deepest gratitude towards Prof. Dr.-Ing. Sören Hohmann, who gave me the opportunity to work on this project under his supervision. I especially treasure your trust, support, as well as all inspiring discussions we had during these years. I would also like to kindly thank apl. Prof. Dr.-Ing. Daniel Görges for the assessment of this thesis and for his participation in the evaluation commitee of my thesis defense. I very much enjoyed all of our conversations and appreciate your genuine interest in my work.

Many thanks go to all of the IRS staff for the great working atmosphere. In particular, I would like to thank Martin, with whom I shared an office for many years and always with good vibes. I also acknowledge Tim Molloy with whom an outstanding research collaboration unfolded, the results of which are partially appreciable in this thesis. Furthermore, I appreciate the support of the members of my research group at IRS, especially Esther, Florian, Philipp and Simon, all of which proof-read various parts of this thesis, giving me valuable feedback. I am also grateful to all bachelor and master students I supervised and who supported my work with their own thesis projects.

Ohne die Unterstützung weiterer Menschen würde ich diese Zeilen nicht schreiben können. Ein besonderer Dank gilt daher meinem Lehrer Dr. Karl Moesgen, der mich in Lima inspiriert und gefördert hat, außerdem meinen Paten Anneliese und Toni sowie meiner ehemaligen Gastfamilie Seidel, die mir alle dabei geholfen haben, nach Deutschland zu kommen und mich hier gut einzuleben. Schließlich auch meinen ehemaligen Abschlussarbeitsbetreuern Gunter und Michael, die mir gezeigt haben, dass ich an der "Tür der Promotion" klopfen kann.

A todos mis amigos en Karlsruhe les agradezco por colaborar indirectamente a este proyecto por medio de todos los bailes, conversaciones, risas, fiestas, pichangas, chelas, etc. y por ser siempre una valiosa conexión hacia mis orígenes y cultura. Los más grandes agradecimientos van hacia mi familia, especialmente a mis papás Juan y Nadja así como a mi hermana Leda por su incondicional apoyo y sacrificio para que pueda seguir mi propio camino y por el amor que me han dado durante toda mi vida. Aber auch ein sehr besonderer Dank an dich, liebe Lena, für deinen Rückhalt und dafür, dass du in diesen Jahren stets für schöne Momente abseits von mathematischen Formeln gesorgt hast, die mir ein Gleichgewicht verschafft haben.

Heidelberg, October 2020

Flow in the living moment. — We are always in a process of becoming and nothing is fixed. Have no rigid system in you, and you'll be flexible to change with the ever changing. Open yourself and flow, my friend. Flow in the total openness of the living moment. If nothing within you stays rigid, outward things will disclose themselves.

Bruce Lee

# Contents





# List of Figures




# List of Tables



# Abbreviations and Symbols

# Abbreviations


# Symbols

### Latin Letters



#### Greek Letters



### Calligraphic and other symbols



#### Indices, exponents and operator names

# 1 Introduction

Automatic and intelligent machines have become ever-present in today's society. Previously developed for industrial environments to perform repetitive tasks on their own and out of human reach, the robots and automation systems of today interact closely with humans and several other robotic systems. Current trends of technological development entail an even closer interaction, for instance, at a haptic level. This means that machines physically interact with a cooperation partner, e.g. a human, in order to assist him in the more efficient and safe completion of various tasks. Such a close interaction is given in the fields of cooperative industrial robots, robot-assisted surgery and assistance systems for vehicle control and various other human-machine cooperation settings. Therefore, automated robotic systems increasingly need the ability to predict the behavior of the humans or previously unknown machines that may interact with them. This ability is a crucial part for the design of such cooperative systems and for the exploitation of the full potential of cooperation synergies. Hence, adequate modeling and identification methods are essential; such mathematical models and suitable identification approaches can lead to a better general understanding of interacting agents and also to the possibility of implementing model-based control algorithms in a technical device for an adequate behavior during interaction with e.g. a human partner.

The aforementioned situation demands a modeling framework which, on the one hand, serves as a mathematical approach for the design of the automatic controller, but on the other hand, allows the description of human behavior. Descriptive and biologically interpretable models for human behavior have been explored in the biologic and neuroscientific communities. In particular, motor control of humans has been conjectured to arise from minimum principles [NC61]. Several optimality principles have been proposed to explain the generation of a specific trajectory which serves as a command to lower-level biomechanical models (see [Eng01] for an extensive review). Given these optimality criteria, optimal control theory arises naturally as a model for movement planning and generation [Tod04] and has become a widely accepted approach in the neuroscience community. This led to further work which used this approach to model not only different kinds of human movement [MTL10, EHAAM16], but also the behavior of a human controlling a dynamic system [PCC<sup>+</sup> 15]. The theory of optimal control itself is one of the most applied concepts in automatic control with numerous applications in engineering. Using this concept, an automatic controller can be described by a particular cost function as this leads to a control law which determines its behavior.

In the general case with humans and machines interacting and cooperating with each other, either in terms of self-positioning (e.g. avoiding collision) or through the control of a dy-

Figure 1.1: Different scenarios of interaction between several agents

namic system (e.g. haptic shared control of a vehicle), as depicted in Figure 1.1, a possibly conflicting situation emerges. This is due to the fact that human and machine strive each for the optimization of their own individual criterion, thus potentially affecting each other negatively. Conflicts in dynamic situations, the latter of which arise in engineering problems, can be described by dynamic game theory, a framework which has been increasingly employed for applications in automatic control [Isa99, RBS16] as well as economics [Doc00] and biology [MGP<sup>+</sup> 18]. In other words, the mathematical framework of dynamic game theory not only includes modeling the behavior of each partner by means of a criterion to be optimized, but also allows for the analysis of the result of their interaction. This result is typically described by an equilibrium solution, the computation of which has been the object of considerable efforts (cf. [BO99]). In addition, first studies exist which demonstrate the potential of the so-called Nash equilibrium as a descriptive concept for biological systems, for instance, bird collision avoidance behavior [MGP<sup>+</sup> 18] as well as interacting humans in avoidance behavior [TW19] and in haptically coupled scenarios [BOW09, CS17, IFH19].

However, calculating equilibrium solutions in dynamic games demands the knowledge of the criteria each of the players optimize, which in real scenarios are typically unknown. Indeed, intelligent automated systems will usually have incomplete information about other players. Moreover, in human-machine interaction, the objective function of the human partner is usually unknown. For instance, in highly automated driving scenarios, an autonomous driving car would not have knowledge of the objectives of other non-autonomous (humancontrolled) vehicles. In these cases, if only measurement data is available, the objectives of the players have to be identified out of a given outcome of the interaction, i.e. players' actions and system states corresponding to a game-theoretic equilibrium. In order to permit a major breakthrough of the application of dynamic game theory, efficient data-based identification of the criteria each of the players optimized becomes essential. This identification problem is denoted as inverse dynamic game and its solution is the main research objective of this thesis.

# 1.1 Research Objective and Contributions

The main objective of this thesis is the development of methods for solving inverse dynamic game problems, allowing for an estimation of player objectives from observed interaction behavior. Contrary to the problem of determining equilibrium solutions from known objectives which has been extensively studied, the inverse problem has scarcely been considered in previous work. Most treatments consider special cases, propose computationally heavy methods and do not give further insight on the properties of the problem. Motivated by the aforementioned studies on human-human-interaction, the focus of this thesis are dynamic games where a Nash equilibrium arises and defines the observed behavior. In addition, the efficiency of the methods is endeavoured in view of their utilization in real applications.

In a broad sense, the following contributions are made and presented in this thesis:


# 1.2 Outline

The remainder of this thesis is structured as follows.

In Chapter 2, related work and existing literature on the estimation of player objectives in optimal control and dynamic games are reviewed. The research gap is formalized in terms of concrete research questions which shall be answered in this thesis. Chapter 3 introduces the reader to the necessary mathematical fundamentals of dynamic game theory. In particular, existing results on the determination of equilibrium solutions are reviewed which lay the foundation of the developed inverse dynamic methods of this thesis.

The main theoretical contributions are given in Chapters 4 to 6. Chapter 4 presents a formal definition of inverse dynamic game problems and presents a control-theoretical approach for open-loop inverse dynamic games. Furthermore, sufficient conditions for succesful identification of unique parameters will be presented. Inverse methods and analysis tools for the class of linear-quadratic (LQ) differential games are presented in Chapter 5; necessary and sufficient conditions for identification of unique parameters are also given. In Chapter 6, a method based on inverse reinforcement learning is presented and shown to be adequate for solving inverse dynamic games with both open-loop and feedback information structures. The chapter also presents unbiasedness results for the estimation of player objectives with this approach.

The next chapters involve the evaluation of the novel methods in simulations and a real application. First, Chapter 7 give simulation results to evaluate all presented methods. The properties of each class of method are highlighted and a systematic comparison with a stateof-the-art method is conducted where the quality of the identification, robustness to measurement noise and modeling errors as well as the computational complexity are evaluated. Chapter 8 shows an application of inverse dynamic games including the identification of human behavior in a haptic shared control task. Similar to Chapter 7, the experimental data is used to compare the methods with respect to the capability of describing observed human cooperative steering behavior.

Finally, Chapter 9 sums up all insights and results obtained in this thesis.

The structure of the thesis is summarized in Figure 1.2, where the main body is divided into two paths to stress the different principles which underlie the proposed inverse dynamic game methods.

Figure 1.2: Outline of the thesis

# 2 Related Work and Research Gap

In this chapter, related work concerning methods for the estimation of cost functions is reviewed and the concrete research gap is identified. The majority of related work is concerned with cost function identification in a single-player case, also known as inverse optimal control, both from a control-theoretic and from a computer science point of view. Therefore, this case and its origins are surveyed first to provide an adequate context before covering stateof-the-art methods in a game-theoretical setting. The chapter ends with a discussion on all explored literature, the statement of the research gap and corresponding research questions to be answered in this thesis.

# 2.1 The Inverse Problem of Optimal Control

The problem of characterizing and describing cost functions corresponding to known optimal solutions was first considered in an optimal control setting, a problem which is known as inverse optimal control. The study of inverse problems in optimal control started with Kalman's paper: "When is a linear control system optimal?". The paper introduced conditions for a given linear control law to be optimal with respect to a quadratic performance index in the case of a single-input linear system and also showed that the inverse problem is ill-posed [Kal64]. Further progress was made by [Tha67] and [MA73] which stated similar conditions for a control-affine system and more general performance indices. These conditions serve the characterization of control laws which are optimal, but are not computationally convenient in order to calculate a particular cost function. The computational aspect was addressed in [JK73], where formulas were given for calculating a particular set of cost function matrices based on the known system dynamics and control law. Generalized results were given by [Cas80], where the Hamilton-Jacobi-Bellman equation was proposed as a means to calculate all possible cost function parameters corresponding to a known control feedback law in a linear-quadratic optimal control problem. Similarly, [FN84] extended Kalman's results to the multivariable case and dropping the assumption of a stabilizing control law.

After these initial efforts, inverse optimal control as a means to determine cost functions receded into the background in favor of the development of control synthesis methods. The newly introduced objective of inverse optimal control consisted in the calculation of a control law which is optimal with respect to any cost function, a property which is desirable due to the resulting robust stability of the closed-loop system. An approach was developed in [Fuj87] for the linear-quadratic case. Later, [FK96, KT99] developed an approach for input-affine non-linear systems. Herefor, a link between optimal value functions and Control Lyapunov Functions was established using Sontag's control law [Son89].

Nori and Frezza were the first in the automatic-control community to state a problem which consisted of finding a cost function which explains measured trajectories [NF04], representing a contrast to the first theoretical work and the subsequent approaches focusing on control synthesis. Hence, the "inverse optimal control problem" underwent a shift towards a more application-oriented problem. Most following approaches which can be found under the name of "inverse optimal control" build upon this idea and define the problem as follows:

#### Definition 2.1 (Inverse Optimal Control Problem)

Let observed state trajectories x ∗ (t) of a known dynamic system and control trajectoriesu ∗ (t) of a controller be given. Determine the cost function J under which the observed trajectories are optimal.

Definition 2.1 assumes the optimality of the observed trajectories, thus intuitively representing the inverse problem to the classical optimal control problem<sup>1</sup> (illustrated in Figure 2.1). Nevertheless, this assumption is sometimes dropped (as e.g. in [NF04]) and therefore, the problem consists of estimating a cost function which best approximates a given set of trajectories.

Inverse Optimal Control

Figure 2.1: Graphical description of the inverse optimal control problem.

Inverse optimal control has been an object of research in the last decades, both from a theoretical and a practical point of view. The variety of methods for solving inverse optimal control problems can be classified into three main groups:

<sup>1</sup> In the course of this thesis, the latter problem shall be also referred to asforward problem to stress on the contrast to the introduced inverse problem.


It must be noted that the classification varies in literature. Indeed, a variety of articles use the term "inverse optimal control" as a term to denote the problem of estimating cost functions from measured data, similar to [NF04] and independently of the applied method. Nevertheless, in this thesis, this classification is proposed and shall be delineated in the following. Almost all articles found in literature present approaches which are based on the assumption of a particular structure of the cost function, e.g. a quadratic cost function. Therefore, the problem of identifying a cost function is reduced to determining parameters θ such that the observed state and control trajectories are optimal with respect to the resulting cost function J(θ).

The presented method classes are further described in the following.

#### 2.1.1 Direct Approaches

One of the most common ways to solve the inverse optimal control problem is a direct approach, where the cost function is determined iteratively. In each iteration, an optimal control problem is solved in order generate the trajectories which are optimal with respect to the current cost function candidate. These trajectories are then compared to the observed ones. Based on this comparison, which usually includes the calculation of an error measure between trajectories, the cost function can be updated such that the error is reduced. The overall aim of the method is to determine cost function parameters such that the error between both sets of trajectories is minimized. Due to the fact that the solution of the optimal control problem in each iteration can be represented as a "lower" level of the main optimization problem, these kinds of methods are also known as bilevel methods [MTL10]. Figure 2.2 shows a schematic diagram of both levels of the direct approach: the upper level, where the cost function of the current iteration κ is updated such that a performance measure, e.g. the error between trajectories is minimized, and the lower level, where an optimal control problem is solved to determine trajectories which are optimal with respect to the current cost function candidate.

The first algorithm of this kind was presented in [MTL10] and applied for human locomotion modeling. Further applications of this approach include driver steering behavior modeling [MFH17], reach-to-grasp human motion [EHAAM16] and human leg movements [BPC<sup>+</sup> 06]. The implementation of the methods usually differ in the techniques for solving the upper level problem. For example, in [EHAAM16], the upper level problem is solved by means of particle swarm optimization. In [BPC<sup>+</sup> 06], a static optimization version of the problem is posed and solved by nonlinear programming techniques. All methods require the repeated solution of optimal control (or static optimization) problems in the lower level and therefore potentially yield large computation times. Therefore, the importance of efficient numerical techniques for the solution of the problems in both levels is often stressed in literature (see e.g. [MTL10]). As a way of mitigating the computational effort, [ARARU<sup>+</sup> 11], [HSB12] and, very recently, [ZLH19] replace the lower level problem by its corresponding optimality conditions. As a consequence of the high computation times, the methods are mostly suitable for offline applications only.

Figure 2.2: Direct bilevel approach for inverse optimal control: The upper level updates the cost function candidate such that an error measure is minimized. The lower level solves an optimal control problem.

#### 2.1.2 Inverse Optimal Control

This class of methods exploits results from optimal control theory and do not rely on the repeated solution of an optimal control problem. The methods are based on the assumption that the observed trajectories are optimal with respect to an (unknown) cost function. With this assumption, optimality conditions are exploited in order to develop computational methods to find the parameters of the cost function which explains observed data. The optimal parameters are determined by minimizing an objective function (usually called residual function) which describes the extent to which optimality conditions are violated.

The variety of methods arises from the different kinds of optimality conditions which have been applied. In the continuous-time case, these include the minimum principle of Pontryagin<sup>2</sup> and the resulting Hamilton differential equations [JAB13], the Euler-Lagrange equations [AB14] and the Hamilton-Jacobi-Bellman equation [PHL14]. If time is discretized, then Karush-Kuhn-Tucker (KKT) conditions [KWB11, PJJB12, PR15, PR17] or the discrete-time minimum principle [MTFP16] can be applied.

<sup>2</sup> This principle was originally posed in 1955 as a maximum principle given the aim of maximizing an objective function (cf. [Gam99]).

Some work focused on the case where the dynamic system is linear and the cost function structure is quadratic, i.e. an inverse linear-quadratic optimal control problem. This formulation allows for exploiting the arising constant linear feedback matrix if the time horizon tends towards infinity. If this matrix is known, then the cost function parameters can be estimated by solving a linear matrix inequality [Boy94, Section 10.6] or by stating an alternative objective function to be minimized with the algebraic Riccati equations as constraints [PCC<sup>+</sup> 15, FMM<sup>+</sup> 18].

#### 2.1.3 Inverse Reinforcement Learning

Finally, related problems have been tackled in the field of computer science, for which so called inverse reinforcement learning (IRL) techniques have been developed. The IRL problem itself was first introduced by Russell and Ng [Rus98, NR00]. IRL mostly regards a discretetime Markov Decision Process (MDP), which implies a finite and discrete set of possible control<sup>3</sup> values and states and search for a reward function instead of a cost function.<sup>4</sup> An example scenario (depicted in Figure 2.3) which can be modeled with an MDP is a grid world.<sup>5</sup> The inverse problem consists in finding the cost function if the agent's trajectory from the initial state to the final state, or the optimal strategy, is known. Furthermore, in IRL problems, the strategies and the dynamics of the system are potentially stochastic.

Figure 2.3: Grid world scenario in reinforcement learning, where the aim is to find an optimal policy which leads to the desired final state (E).

There is a vast number of methods which tackle the IRL problem using different principles. Interestingly, many of the methods under the name of IRL which are available in literature

<sup>3</sup> In the IRL literature, the controls are known as actions. In this thesis, both names are used as synonyms.

<sup>4</sup> Minimization of a cost function corresponds to a maximization of the reward function. The maximization problem can be easily cast as a minimization problem by multiplying the reward function with −1. Therefore, in the following, the term "cost function" will be used without loss of generality.

<sup>5</sup> A grid world is the most common test scenario for (inverse) reinforcement learning methods. It describes an agent searching for an optimal strategy which allows him to reach the final state with least cost. In Figure 2.3, this implies avoiding the red blocks which denote a high cost.

are based on a repeated calculation of the control and state sequences based on the current reward function candidate, i.e. the solution of the forward problem. Therefore, the principle is very similar to the aforementioned bilevel method. The methods presented in [AN04, RBZ06, NS07] are exemplarily mentioned. The Bayesian IRL method of [RA07] uses maximum a-posteriori estimation of the cost function which depends on sampling methods and thus demands the repeated estimation of optimal controls. A widespread IRL approach was proposed by Ziebart et al. [ZMBD08]. The idea consists in applying the principle of maximum entropy introduced by Jaynes [Jay57] in order to find a least-biased probability function which explains the observed trajectories.

All of the aforementioned IRL methods consider an MDP as a basis and are therefore limited to discrete-valued and finite states and actions. For large (or even infinite) states and action spaces, these methods suffer from the curse of dimensionality and become highly complex and computationally heavy, especially if they are applied to approximate continuous-valued state and action spaces. Therefore, some effort has been made to develop IRL techniques for continuous-valued spaces, tackling in this way a very similar problem as the literature on cost function identification in a control-theoretical setting. It is conspicuous that these approaches show a strong similarity to the maximum entropy IRL method of [ZMBD08]. For example, [AB11] and [HFKB15] apply a maximum entropy distribution, yet solve the IRL problem using a bilevel structure. On the other hand, [KPRS13] and [LK12] propose a maximum entropy distribution which considers continuous-valued state and action spaces and does not rely on the repeated solution of optimal control problems.

# 2.2 Inverse Problems in Game Theory

After reviewing literature on cost function identification in a single-player case, this section investigates the extent to which similar problems have been tackled in a game-theoretical scenario, i.e. the identification of cost functions from the observed interaction between several players.

Inverse problems in game theory have received growing attention in the last years, especially for static games. The term inverse game theory was introduced in [SC12] to denote the estimation of the actions and cost functions of the adversary, i.e. the other players in the game, in order to obtain better results. Similar work is reviewed in the following.

### 2.2.1 Inverse Static Games

Even though the concept of inverse game theory initially consisted in estimating adversary cost functions from the point of view of a particular player, its meaning quickly became more general and hence, it gained a strong similarity to the previously introduced inverse optimal control problems. Kuleshov and Schrijvers [KS15] introduce their paper with the words: "given the observed behavior of players in a game, how can we infer the utilities<sup>6</sup> that led to this behavior?". They consider parametrizable Bayesian games where players have incomplete information of the opponent's cost function. These are estimated by using data of several realizations of static games. Similar conditions are needed in the approach of Konstantakopoulos et al. [KRJ<sup>+</sup> 18] which leverages necessary and sufficient conditions of each players' cost function to estimate their parameters. In [BGP15], a method based on the solution of variational inequalities is presented to identify cost functions. An application of this work for the optimization of transportation networks is presented in [ZPCP17].

#### 2.2.2 Inverse Dynamic Games

Transferring the problem of Definition 2.1 to a multiplayer (N-player) case leads to the concept of inverse dynamic games. A general inverse dynamic game may be defined as follows:

#### Definition 2.2 (General Inverse Dynamic Game)

Let state trajectories x ∗ (t) of a known dynamic system and control trajectories u ∗ i (t) of each player i, i ∈ {1, ..., N} which correspond to a solution of a dynamic game be given. Find the cost functions <sup>J</sup>i , for each player i, which generated the trajectories.

In Definition 2.2, the trajectories are generated by several players in a dynamic game acting based on individual cost functions. In addition, the problem is also ill-posed; an evident fact given the ill-posedness of the single-player case. The problem of Definition 2.2 is described as "general" in the sense that the solution type is still unspecified and, contrary to the singleplayer case, different solution concepts exist which generally lead to different trajectories. If the game is non-cooperative, the solution may be a Nash or a Stackelberg equilibrium depending on the order in which the players act. If the game is cooperative, then usually a Pareto efficient solution is assumed [ER11]. Literature on dynamic game theory is mostly focused in the concept of Nash equilibria which naturally arises when all players minimize their corresponding cost functions simultaneously. However, there exists a broad class of dynamic games for which the Stackelberg and the Nash solutions coincide.<sup>7</sup>

A literature search reveals that the problem of Definition 2.2 is greatly unexplored as mostly special cases can be found. In the automatic control community, an early work by Fujii and Khargonekar gives an approach to calculate solutions of an inverse linear-quadratic differential game [FK88] with a frequency-domain formulation. The results are similar to the oneplayer results developed by Kalman in [Kal64]. An inverse two-player zero-sum game has

<sup>6</sup> Utility is a term used especially in static game theory to denote a reward as in IRL methods.

<sup>7</sup> These concepts will be further explained later in Section 3.5.

been considered in [TMP16] where an approach which exploits necessary conditions for saddle point solutions was presented.<sup>8</sup> In [Wan07], necessary and sufficient conditions for identification in linear-quadratic dynamic games are given. However, these are restricted to the case of a second-order dynamic system and a two-player case. For N-player inverse dynamic game with open-loop strategies, recent results were presented in [MFP17a, MFP17b] where Pontryagin's minimum principle is leveraged. In [MFP17b], a bilevel method analogous to the ones described in Section 2.1.1 was formulated. This is portrayed in Figure 2.4: the upper level, where the <sup>N</sup> cost functions (denoted by <sup>J</sup>1:N ) are updated and the lower level, where a dynamic game is solved to determine trajectories corresponding to the N current cost function candidates.

Figure 2.4: Direct bilevel approach for inverse dynamic games: The upper level updates the cost function candidates such that an error measure is minimized. The lower level solves a dynamic game to determine Nash equilibrium trajectories.

Dynamic game theory has been of considerable interest in economics, leading to some proposed methods for the solution of the inverse problem in this field. For example, [BBL07] presented an approach which is based on the estimation of the value of the cost function by means of a Monte Carlo method. The work of Arcidiacono et al. [ABBE16] offers a more efficient method based on least-square estimation and likelihood functions. Both aforementioned methods have the main drawback that the game is limited to discrete-valued strategies and a finite number of possible states. A dynamic game with a linear-quadratic setting was considered in [CFG89], yet restricting the players' cost function matrices to only penalize their own controls and to only have diagonal entries.

As for IRL methods, some methods which aim at extending these techniques to the multiagent setting were proposed for cases in which all players behave cooperatively [HMRAD16, NKJ<sup>+</sup> 10, ŠKZK17]. On the other hand, IRL-based methods in a noncooperative setting have been proposed in [LBC18, RGZH12]. However, similar to single-agent IRL, all of these methods are based on an MDP and hence are limited to discrete-valued and finite control and state

<sup>8</sup> Zero-sum games represent the case where one player strives to minimize a cost function while the second player seeks to maximize the same cost function.

spaces. Literature shows few available work which considers continuous-valued action and state spaces. Two exceptions are [PSS<sup>+</sup> 16], where a cooperative scenario was considered, and [MHLK17], where each agent has an individual cost function, yet not explicitly relating their approach to game-theoretical concepts.

# 2.3 Discussion

As motivated in Chapter 1, the Nash equilibrium is a promising descriptive concept for the interaction between biological systems and hence potentially adequate for state-of-the-art applications in human-machine interaction. Therefore, this thesis focuses on the solution of inverse dynamic games where the trajectories correspond to a Nash equilibrium. In the following, the term inverse dynamic game will refer to this problem.

In order to solve inverse dynamic games, it may appear conceivable to apply a direct bilevel approach analogously to the single-player case (cf. Section 2.1.1). Nevertheless, the lowerlevel problem would consist in this case in determining the state trajectories and all players' control trajectories corresponding to the dynamic game of the current iteration. Consequently, the method implies the repeated solution of N coupled dynamic optimization problems for each set of cost function candidates. The first evaluation conducted in [MFP17b] presented a simple example where the inverse dynamic game involved the solution of 388 forward dynamic games. Especially for non-linear dynamic games, solving for Nash equilibria is in general computionally heavy and efficient numerical techniques are not available [HdlCIR19]<sup>9</sup> . Therefore, applying this approach yields a great risk of huge computation times.

This motivates the need for more efficient methods for inverse dynamic games which do not rely on the repeated solution of a dynamic game. A fast identification of player cost functions allows for an immediate adaptation of automatic controllers based on potential new information, e.g. if the cooperating human changes its behavior. Nevertheless, until now, little effort has been spent in the development of alternative methods for the efficient solution of general N-player inverse dynamic games. Methods which stem from IRL are restricted to discrete-valued and finite states and controls. In addition, IRL methods in a multiplayer setting which consider continuous-valued states and controls are also almost unexplored and their theoretical foundation has not been developed. The situation is similar in the field of automatic control, where only special cases have been treated. Apart from very early work of [CFG89] in an economics-specific scenario, successful attempts to solve general N-player inverse dynamic games have occured only recently ([MFP17a, MFP17b]). This work encourages further effort in exploring alternative techniques for inverse dynamic games which avoid a direct bilevel approach.

<sup>9</sup> A recent study in [HdlCIR19] showed that a nonscalar two-player dynamic game with non-quadratic cost functions can take from 479.11 to 12854 seconds to solve, depending on the applied method.

Finally, almost all of the mentioned approaches, especially in dynamic games, concentrate on delivering a method which is able to estimate a cost function, but do not give further insight on when an estimation is possible. This not less important aspect of the properties of inverse dynamic game problems is almost unaddressed; there is little work on inverse problems in optimal control and dynamic games following the ideas of Kalman and the first theoretical studies (cf. Section 2.1). In addition, the ill-posedness of inverse dynamic games demands further attention. To date, much uncertainty exists concerning the properties of inverse dynamic games as these are still considerably unexplored.

# 2.4 Conclusion and Research Questions

As discussed in the previous section, the inverse problem of optimal control, i.e. a singleplayer inverse dynamic game has been investigated from both a theoretical and a computational point of view. However, the problem of modeling and identifying the behavior of several players interacting with each other remains a greatly unexplored field, especially in the case of continuous-valued control and state spaces which is important for many applications. The application of a direct bilevel approach to this problem is inappropriate given the potential complexity of solving for Nash equilibrium trajectories repeatedly. Therefore, the following questions need to be answered:


For this purpose, necessary fundamentals concerning dynamic game theory and the forward problem of determining Nash equilibria are introduced in Chapter 3 as a basis for the subsequent result. Afterwards, the posed questions are addressed in Chapters 4 and 5, where methods based on IOC—according to the classification in Section 2.1—are developed, and in Chapter 6 which presents an IRL-based method is introduced as a means to solve inverse dynamic games.

Furthermore, two questions which naturally arise after the development of techniques for solving inverse dynamic games are:


Probably due to the fact that IOC and IRL methods have been studied by different research communities, until now, almost no systematic comparison has been conducted on the performance of these different concepts.<sup>10</sup> Therefore, in Chapter 7, all methods (IOC-based, IRL-based and bilevel methods) are compared to each other using two different major classes of inverse dynamic game problems, where robustness to measurement noise and cost function modeling errors are also examined. Lastly, a first application example is presented in Chapter 8 to evaluate the performance of all methods with real experimental data.

<sup>10</sup> Two notable exceptions are given by [TZ11] and [JAB13]. The first compred bilevel and IOC-similar methods in (single-player) inverse static optimization. The study demonstrated that the alternative method, which was based on optimality conditions, yielded comparable results to the bilevel method with considerably less computational effort. In [JAB13], a single-player inverse optimal control method based on Hamilton differential equations was compared in simulations with the bilevel method [MTL10] and the continuous-time counterparts of the methods presented in [AN04] and [RBZ06]. Their proposed method was shown to perform faster and with less trajectory and parameter error. Nevertheless, all simulated observed trajectories were noise-free.

# 3 Fundamentals of Dynamic Game Theory

This chapter gives an overview of fundamentals of dynamic game theory. After a short introduction to the general theory of games, non-cooperative dynamic and differential games are introduced. Furthermore, existing solution concepts for the forward problem are introduced and the available means for their calculation are shown. These principles provide a basis for the development of the inverse dynamic game methods proposed in subsequent chapters. The contents of this chapter are based on the books [BO99, Eng05, HKZ12, Tad13].

Game theory can be defined as the theory of mathematical models of decision making to describe situations with conflicts and cooperation between rational players. The conflicts arise from different interests or goals, leading to a strong dependency of each one's individual decisions. The theory emerged from the work of von Neumann [VNM47] and blossomed with the introduction of game equilibria by Nash [Nas51]. Since then, it has been extensively studied such that analytical tools are available for understanding phenomena arising from the interaction between decision makers.

# 3.1 Introduction to Games

One of the most frequent ways of defining a game is as a normal-form game, described in the following definition.

#### Definition 3.1 (Game in Normal Form)

A normal-form game is defined by


A game involves N decision makers called players which select particular actions from a possible strategy set. These are chosen such that a specific goal, represented by their individual cost function, is accomplished. Definition 3.1 is very general and allows numerous kinds of games which arise from different properties of the possible actions, strategy sets and cost functions of the players.

If the players act in a self-interested way, i.e. they strive for minimizing their own cost function, regardless of possible negative effects for other players, then the game is called noncooperative. If the players are able to generate binding agreements and act jointly in order to obtain a fair result, then the game is regarded as cooperative. If the choice of actions is deterministic, the strategies are called pure strategies. The converse is denoted as stochastic or mixed strategies. Moreover, games may be finite or infinite, depending on the strategy set <sup>U</sup>i of each player. If the set of possible strategies <sup>U</sup>i has a finite number of elements for all players, the game is said to be finite. Otherwise, if <sup>U</sup>i is infinite for at least one player, i.e. an infinite number of possible strategies is available for at least one player, the game is infinite.

An important classification of games is based on the number of times a player can choose an action. If the players act only once and independently of each other, the game is static. As soon as one player is allowed to act in several time stages based on new information resulting from other players' previous actions, then the game is dynamic. Therefore, in dynamic games, time plays an important role. The evolution of an infinite dynamic game is naturally described with a difference equation in a discrete-time formulation based on the stages or discrete time steps in which players take action. However, a continuous-time formulation is possible as well, which is also known in literature as a differential game.

The results of this thesis are based on non-cooperative infinite dynamic games in both discrete and continuous time. Since many results are analogous and comparable, the main aspects of infinite dynamic games will be shown and formalized in this chapter with a formulation in continuous time. Analogous definitions for the discrete-time case can be found in Appendix A.

# 3.2 Differential Games

The evolution of a differential game depends on the strategies of all players. It can be described by means of the time-dependent state trajectories of a dynamic system defined by differential equations.

#### Definition 3.2 (Dynamic System in State Space Representation)

A dynamic system is defined by ordinary differential equations and an initial condition given by

$$\dot{\mathbf{x}}(t) = f\left(\mathbf{x}(t), \boldsymbol{\mu}\_1(t), \dots, \boldsymbol{\mu}\_N(t), t\right) \tag{3.1a}$$

$$\mathbf{x}(0) = \mathbf{x}\_0,\tag{3.1b}$$

where x(t) ∈ <sup>R</sup> <sup>n</sup> and <sup>u</sup>i(t) ∈ <sup>R</sup> mi , i ∈ P, denote the system state vector and the control vector of player i at time step t, respectively. Furthermore, f : <sup>R</sup> <sup>n</sup> × R <sup>m</sup><sup>1</sup> <sup>×</sup> ... <sup>×</sup> <sup>R</sup> <sup>m</sup><sup>N</sup> × R + 0 7→ R n is a vector function which is continuous in t ∈ [0,T ] and globally Lipschitz in x, <sup>u</sup>1, ...,uN .

The evolution of the differential game is regarded for a time interval [0,<sup>T</sup> ] which represents the duration of the game. The vector <sup>x</sup><sup>0</sup> represents the initial state of the system. The final time T could be T → ∞ or a fixed value depending on the given problem. Lipschitz continuity of f is required to ensure that the initial value problem (3.1) admits a unique solution for every <sup>N</sup>-tuple (u1(t), ...,uN (t)) of continuous controls <sup>u</sup>i(t), <sup>i</sup> ∈ P. Each player <sup>i</sup> ∈ P acts upon the system in Definition 3.2 by applying a corresponding input or control trajectory <sup>u</sup>i(t), <sup>∀</sup><sup>t</sup> ∈ [0,<sup>T</sup> ] which belongs to an action space <sup>U</sup>i . Each player's control decision or strategy, denoted by <sup>γ</sup>i , is based on the state information available to them which is represented by a set-valued function <sup>η</sup>i(t). <sup>11</sup> The strategy is chosen from a set of available strategies <sup>Γ</sup>i and defines a particular control trajectory <sup>u</sup>i(t) <sup>12</sup>, i.e.

$$\mathfrak{u}\_i(t) = \mathfrak{y}\_i(\eta\_i(t), t), \quad \mathfrak{y}\_i \in \Gamma\_i. \tag{3.2}$$

The strategy and consequently, the control trajectories are determined according to an individual cost function

$$J\_i = h\_i\left(\mathbf{x}(T), T\right) + \int\_0^T g\_i\left(\mathbf{x}(t), \mathbf{u}\_1(t), \dots, \mathbf{u}\_N(t), t\right) \,\mathrm{d}t,\tag{3.3}$$

where <sup>h</sup>i denotes costs which arise from the final state or final time and <sup>д</sup>i represents running costs which arise for t ∈ [0,T ]. The aim of each playeri is to minimize the cost function (3.3) by applying appropriate controls <sup>u</sup>i(t). This objective is described by the dynamic optimization problem

<sup>11</sup> Different possibilities of player state information and corresponding strategies will be examined later in Sections 3.3 and 3.4.

<sup>12</sup> In the context of dynamic games, actions and strategies are different and have this relationship. On the contrary, in static games these are identical and the terms are therefore not distinguished.

$$\begin{aligned} \min\_{\mathbf{u}\_i(t)} J\_l \left( \mathbf{x}(t), \mathbf{u}\_i(t), \mathbf{u}\_{\neg l}(t), t \right) \\ \text{w.r.t.} \\ \dot{\mathbf{x}}(t) = f \left( \mathbf{x}(t), \mathbf{u}\_i(t), \mathbf{u}\_{\neg l}(t), t \right) \\ \mathbf{x}(0) = \mathbf{x}\_0 \end{aligned} \tag{3.4}$$

where <sup>¬</sup><sup>i</sup> is used as a shorthand notation for "all except <sup>i</sup>". Therefore, <sup>u</sup>¬i(t) denotes the input trajectories of all players except player i. <sup>13</sup> As a result, differential games can be described as N coupled dynamic optimization problems.

To summarize, a definition of differential games which will be used throughout this thesis is given.

#### Definition 3.3 (Differential Game)

A differential game is defined by


# 3.3 Information Structures

A relevant characteristic of a differential game is the available information for all players at each time step t. The information set is described by

$$\eta\_i(t) \in P\_{\neg 0} \left( \{ \mathbf{x}\_0, \mathbf{x}(s), \mathbf{x}(t) \} \right), \quad s \in [0, \chi\_{i,t}], \ \chi\_{i,t} \in [0, t], \tag{3.5}$$

<sup>13</sup> The importance of the uniqueness of the solution of (3.1) for every <sup>N</sup> -tuple (u1, ..., <sup>u</sup> <sup>N</sup> ) becomes clear at this point. Non-uniqueness is clearly not allowed in a differential game since it would potentially lead to nonuniqueness in the value of the cost functions for a single N -tuple of control trajectories.

where <sup>P</sup>¬∅(·) denotes a power set which excludes the empty set and <sup>χ</sup>i,t is non-decreasing in t. In a particular time t ∈ [0,T ], player i has knowledge of current or past values of the state x. By (3.5), it is possible to describe a variety of information structures which are very common in dynamic game theory. Sometimes partial state information is assumed instead of a complete state information as implied by (3.5) and as considered in this thesis. The next definition lists concrete information structures which shall be focused on in the following.

#### Definition 3.4 (Information Structure of the Players)

The information structure of player i is said to be


The open-loop information pattern describes the situation where all players decide at t <sup>=</sup> <sup>0</sup> the control trajectories <sup>u</sup>i(t) to be applied for <sup>t</sup> ∈ [0,<sup>T</sup> ] based solely on the initial system state value x0. The control decision remains unchanged for the whole duration of the game, regardless of any possible disturbance on the states. Figure 3.1 shows a graphical representation of a differential game with an open-loop information structure for each player.

Figure 3.1: Differential game with an open-loop information structure.

In case of a memoryless perfect state pattern, the players have information of the initial state <sup>x</sup><sup>0</sup> and the current state <sup>x</sup>(t). The inclusion of the initial state becomes necessary for solving differential games where some of the players have an OL information pattern and others have access to the states x(t). In this thesis, the converse case—equal information patterns for all players—is considered such that a feedback information pattern can be used equivalently.<sup>14</sup> These last two information structures imply "closing the loop" in a controltheoretical sense. The resulting multiplayer control loop for a feedback information structure is exemplarily depicted in Figure 3.2.

Figure 3.2: Differential game with a feedback information structure.

The different information patterns lead to various kinds of strategies selected by the players, each of which leads to a particular solution of the differential game, i.e. resulting state and control trajectories.

# 3.4 Strategies

As mentioned previously, the strategy defines the controls of the players based on the information available to them. Therefore, for each information structure defined above, we obtain a different class of strategy. The next definitions specify the corresponding strategy classes to the open-loop and the feedback information patterns.

<sup>14</sup> The later defined Nash equilibrium solution (cf. Section 3.5.1) is identical under both MPS and FB information patterns since the equilibrium dependence on <sup>x</sup><sup>0</sup> is given only for the initial time <sup>t</sup> <sup>=</sup> 0. Therefore, these information patterns can be considered as equivalent in this sense [BO99, p. 278]. For this reason, in the following only the OL and the FB information patterns shall be considered.

#### Definition 3.5 (Open-Loop Strategy)

An open-loop strategy <sup>γ</sup>i for player i ∈ P selects a control action according to

$$\mathfrak{u}\_{i}(t) = \mathfrak{y}\_{i}(\mathbf{x}\_{0}, t), \quad \forall \mathbf{x}\_{0} \in \mathbb{R}^{n}, \forall t \in [0, T], \tag{3.6}$$

where γ is a continuous function in t and defined for each possible initial state x0. The set of all such possible strategies is denoted by Γ OL i .

#### Definition 3.6 (Feedback Strategy)

i

A feedback strategy <sup>γ</sup>i for player i ∈ P selects a control action according to

$$\mu\_i(t) = \mathcal{Y}\_i(\mathbf{x}(t), t), \quad \forall t \in [0, T], \tag{3.7}$$

where γ is continuous in t and globally Lipschitz in x. The set of all such possible strategies is denoted by Γ FB .

An open-loop strategy describes the situation where all players decide at t <sup>=</sup> <sup>0</sup> the control trajectories <sup>u</sup>i(t) to be applied for <sup>t</sup> ∈ [0,<sup>T</sup> ] based solely on the initial state value <sup>x</sup><sup>0</sup> of the dynamic system. The control decision remains unchanged for the whole duration of the game, regardless of any possible disturbance on the states. The feedback strategy implies that the players define their actions based on the current state x(t). Therefore, each player commits to a particular reaction to the information concerning the state of the system.

These strategy types are the basis for the solution of differential games. In the following, different solution concepts are presented.

# 3.5 Solution Concepts in Differential Games

A differential game may have different outcomes depending on its properties. The main difference arises from the cooperative or non-cooperative nature of the interacting players. In a non-cooperative game, all players act strictly rationally in order to minimize their own cost function, regardless of the detriment this may cause to other players. In this kind of game, the most common solution concepts are described as game-theoretical equilibria. These are the so-called Nash equilibrium [Nas51] and the Stackelberg equilibrium [Sta52]. In turn, in cooperative differential games, players are able to cooperate and make agreements such that they can (potentially better) achieve their objectives. In this kind of games, Pareto efficient solutions [Par14] are mostly sought.

#### 3.5.1 Non-Cooperative Games

#### Nash Equilibrium

The Nash equilibrium is a solution concept in game theory which arises if (i) all players act simultaneously and optimally with respect to their own cost function and their beliefs of the other players' strategies and (ii) these beliefs are correct15. An alternative, equivalent definition is the following: For each player, there is no other feasible input strategy than the current, optimal one, that would minimize his own costs, taking into account all the other players with their optimal input strategy [Nas51]. In other words, it is not possible for all players to obtain a lower value of the cost function by solely altering their individual strategy. A formal definition is given in the following:

#### Definition 3.7 (Nash Equilibrium)

A Nash equilibrium is described by the N-tuple of strategies γ ∗ := γ ∗ 1 , ...,γ ∗ N , with γ ∗ i ∈ Γ , i ∈ P, ∈ {OL, FB}, which satisfies

$$J\_i\left(\mathfrak{y}\_i^\*,\mathfrak{y}\_{\neg i}^\*\right) \le J\_i\left(\mathfrak{y}\_i,\mathfrak{y}\_{\neg i}^\*\right), \quad \forall i \in \mathcal{P},$$

i

i.e. γ ∗ i <sup>=</sup> <sup>u</sup> ∗ i (t), t ∈ [0,T ] is the optimal input strategy for each playeri considering optimal input strategies of all other players γ ∗ ¬i . The resulting tuple of control trajectories u ∗ := u ∗ 1 (t), ...,u ∗ N (t) is called Nash equilibrium solution.

Definition 3.7 describes either an open-loop Nash equilibrium (OLNE) or a feedback Nash equilibrium (FNE), depending on the kind of strategy which is applied by each player, i.e. whether the strategy set <sup>Γ</sup>i is given by Γ OL i or Γ FB i , respectively. The corresponding state trajectories x ∗ (t) are determined by solving the initial value problem (3.1) using the control trajectories u ∗ 1 (t), ...,u ∗ N (t) . The OLNE has the property of being a weakly time consistent solution. This means that the players do not have any incentive of deviating from their strategy during the game, i.e. at any time step <sup>t</sup><sup>1</sup> ∈ [0,<sup>T</sup> ]. On the other hand, the FNE is strongly time consistent16, which means that their strategy γ ∗ i is still an equilibrium strategy if it was applied from any time <sup>t</sup><sup>1</sup> ∈ [0,<sup>T</sup> ] and starting from any arbitrarily chosen state <sup>x</sup>(t1) off the original equilibrium path (which is reachable from x(0)). This makes the feedback Nash equilibrium more robust towards any possible disturbances on the system state.

In a differential game, there may exist no Nash equilibria. Moreover, a single or multiple Nash equilibria may also exist. Furthermore, a Nash equilibrium cannot be uniquely associated to a set of cost functions J. This fact is of particular importance for the inverse differential game problem and will be discussed in Section 4.1 of the next chapter.

i

<sup>15</sup> An example of this is a situation where all cost functions are made public to all players [OR94, p. 14].

<sup>16</sup> Also called subgame perfect, see. e.g. [Eng05, Definition 8.2].

#### Stackelberg Solutions

Previously, it was assumed that the players select their strategies simultaneously. A scenario, where the players select their strategies one after the other can lead to a different outcome of the game. Such a setting was first introduced by von Stackelberg in the context of a duopoly output game [Sta52]. In a general N-player situation, one of the players is selected as a leader such that he announces his selected control strategy. Afterwards, the next player uses this information to make a decision on his own strategy such that his cost function is minimized. This process continues until player N chooses its strategy based on the announcements of the other N <sup>−</sup> <sup>1</sup> players' strategies. Stackelberg solutions are mostly considered in economic applications, e.g. market models, and are typically defined in a 2-player setting (cf. [CC72]).

#### Definition 3.8 (Stackelberg Strategy)

The strategy tuple γ <sup>s</sup> <sup>=</sup> (γ s 1 ,γ s 2 ) is called a Stackelberg strategy with player 1 as leader and player 2 as follower if for all <sup>γ</sup><sup>1</sup> ∈ Γ1

$$J\_1(\mathfrak{y}\_1^s, \mathfrak{y}\_2^s) \le J\_1(\mathfrak{y}\_1, \mathfrak{y}\_2^o(\mathfrak{y}\_1)) \tag{3.8}$$

where γ o 2 (γ1 ) ∈ <sup>Γ</sup><sup>2</sup> denotes the optimal response of player 2 to a fixed strategy of player 1, i.e.

$$J\_2(\mathbf{y}\_1, \mathbf{y}\_2^o(\mathbf{y}\_1)) = \min\_{\mathbf{y}\_2} J\_2(\mathbf{y}\_1, \mathbf{y}\_2),\tag{3.9}$$

and γ s 2 = γ o 2 (γ s 1 ).

The Stackelberg strategy is an attractive strategy when the information pattern is biased or asymmetric. This means that player 1 does not know the cost function of player 2, but player 2 has knowledge of both cost functions. This is the case in a market model where there is a dominant company. The leader has an advantage in terms of the possibility to obtain better results due to the fact that he is aware that the rest of the players will act optimally based on whatever strategy he may apply.

First derivations of Stackelberg solutions for dynamic games were given e.g. in [CC72, Med78]. For the (continuous-time) differential game case with N players, [Rub06, Proposition 2.3] states that the Stackelberg solution coincides with the feedback Nash equilibrium solution —provided it exists—if and only if (i) the running costs <sup>д</sup>i depend solely on the state <sup>x</sup> and each player's controls <sup>u</sup>i , i.e.

$$g\_i\left(\mathbf{x}(t), \boldsymbol{\mu}\_i(t), \boldsymbol{\mu}\_{\neg i}(t)\right) = g\_i(\mathbf{x}(t), \boldsymbol{\mu}\_i(t)),\tag{3.10}$$

and (ii) the dynamics of the state depend, at the most, linearly on each player's controls, i.e. the system dynamics have the control-affine form

$$\dot{\mathbf{x}}(t) = \boldsymbol{f}\_{\mathbf{x}}(\mathbf{x}(t), t) + \sum\_{i=1}^{N} \mathbf{G}\_{i}(\mathbf{x}, t)\boldsymbol{u}\_{i}(t). \tag{3.11}$$

#### 3.5.2 Cooperative Games

Contrary to the non-cooperative case, a cooperative game includes players which not only seek the optimization of their own objectives but also consider the objectives of the other players in the selection of the control actions. Hence, it is assumed that they cooperate in order to achieve their objectives.<sup>17</sup> However, no side-payments take place, which means that their cooperative behavior is not explicitely rewarded by introducing a cost-lowering term in the objective function. Consequently, depending on how the players decide to distribute their efforts, several possible minima exist for each particular player i ∈ P.

In the field of cooperative games, the concept of dominating strategies plays an important role. A strategy tuple <sup>γ</sup>(a) will dominate another strategy tuple <sup>γ</sup>(b) if the application of <sup>γ</sup>(a) leads to lower costs for all players compared to <sup>γ</sup>(b) . Therefore, dominating strategies lead to a better result for all players. This line of thought motivates considering only solutions that are such that they cannot be improved by all players simultaneously and leads to the concept of Pareto efficient solutions.

#### Pareto Efficient Solutions

A Pareto efficient solution is a combination of strategies such that it is not possible to obtain a better result in terms of the own cost function of each player without affecting the result of other players negatively. This means that, while it may be possible for individual players to improve their own result by changing their own action unilaterally, this would lead to a worse result for at least one of the other players. A Pareto efficient solution is defined as follows [Eng05, Definition 6.1]:

<sup>17</sup> Nevertheless, coalitional games, where several groups of players may build coalitions to act non-cooperatively with respect to other ones, are excluded in this thesis. See the definitions given in [ER11].

Definition 3.9 (Pareto Efficient Solution of a Differential Game) An N-tuple of strategies γ <sup>p</sup> = γ p 1 , . . . ,γ p N is a Pareto efficient solution (PES) of a differential game if no other feasible tuple γ <sup>=</sup> γ1 , . . . ,γ N exists for which

$$J\_{\boldsymbol{\beta}}(\mathbf{y}) < J\_{\boldsymbol{\beta}}(\mathbf{y}^{\boldsymbol{\rho}}) \tag{3.12}$$

for at least one j ∈ P and

$$J\_i(\mathfrak{y}) \le J\_i(\mathfrak{y}^\mathfrak{p}), \quad \forall \mathfrak{i} \in \mathcal{P}, \,\,\mathfrak{i} \ne j. \tag{3.13}$$

Definition 3.9 states that a PES is a combination of strategies such that it is not possible that any player obtains a lower value of his cost function by deviating from the strategy without affecting at least one other player negatively. Therefore, Pareto optima do not represent a stable solution of a non-cooperative game, since in such a game each player strives for minimization of their own cost function. A non-cooperative player will deviate from the Pareto strategy if this implies a lower value of his cost function, regardless of the resulting drawback for other players.

# 3.6 Calculation of Differential Game Solutions

This thesis focuses on the Nash equilibrium and on Pareto efficient solutions of differential games. Therefore, in the following, the relevant means for calculating these solutions are presented.

#### 3.6.1 Open-Loop Nash Equilibrium

The basis of the calculation of Nash equilibria is Definition 3.7. The inequality implies that the optimal strategy γ ∗ i ∈ Γ OL i leads to a control trajectory u ∗ i (t) which minimizes the cost function <sup>J</sup>(ui(t),<sup>u</sup> ∗ ¬i (t)) subject to the system dynamics

$$\dot{\mathbf{x}}(t) = f(\mathbf{x}(t), \boldsymbol{\upmu}\_{i}(t), \boldsymbol{\upmu}\_{\neg i}^{\*}(t), t), \tag{3.14}$$

i.e. the system dynamics with the optimal controls of the other players j ∈ P, j , i. Therefore, we obtain an optimal control problem for playeri since u ∗ ¬i (t) does not depend on <sup>u</sup>i(t). Hence, the tools of classical optimal control can be applied. In particular, Pontryagin's minimum principle (see e.g. [Nai03, Chapter 6]) can be used to determine a set of differential

(t), i ∈ P, satisfy the relations:

equations which represent necessary conditions for Nash equilibria. As in optimal control, the analysis of differential games is based on the Hamiltonian function

$$H\_i(\boldsymbol{\upPsi}\_i(t), \mathbf{x}(t), \mathbf{u}\_i(t), \mathbf{u}\_{\neg i}(t), t) = g\_i\left(\mathbf{x}(t), \mathbf{u}\_i(t), \mathbf{u}\_{\neg i}(t), t\right) + \boldsymbol{\upPsi}\_i^\top(t)\, f\left(\mathbf{x}(t), \mathbf{u}\_i(t), \mathbf{u}\_{\neg i}(t), t\right) \tag{3.15}$$

i

for all <sup>t</sup> ∈ [0,<sup>T</sup> ] and all players <sup>i</sup> ∈ P, whereψi : [0,T ] 7→ <sup>R</sup> n are so-called costate functions or Lagrangian multiplier functions. Given the case of an open-loop information structure and corresponding strategies as defined in Definition 3.5, the equilibrium is said to be an open-loop Nash equilibrium. The following theorem gives necessary conditions for such equilibria.

#### Theorem 3.1 (Necessary Conditions for Open-Loop Nash Equilibria)

For an <sup>N</sup>-player differential game of fixed duration [0,<sup>T</sup> ], let <sup>f</sup>(x,u1, ...,uN ,t), <sup>д</sup>i(x,u1, ...,uN ,t) and <sup>h</sup>i(x(<sup>T</sup> ),<sup>T</sup> ) be continuously differentiable with respect to <sup>x</sup> for all t ∈ [0,T ], i ∈ P. Then, if γ OL = γ ∗ 1 (x0,t), ...,γ ∗ N (x0,t) , where γ ∗ i ∈ Γ OL i and γ ∗ i (x0,t) <sup>=</sup> u ∗ i (t), i ∈ P, provides an open-loop Nash equilibrium (OLNE) solution with x ∗ (t) as the corresponding

$$\dot{\mathbf{x}}^{\*}(t) = f(\mathbf{x}^{\*}(t), \mathbf{u}\_{1}^{\*}(t), \dots, \mathbf{u}\_{N}^{\*}(t), t), \quad \mathbf{x}^{\*}(0) = \mathbf{x}\_{0} \tag{3.16a}$$

$$\boldsymbol{\mu}\_{i}^{\*}(t) = \mathop{\arg\min}\_{\boldsymbol{\mu}\_{l}(t)} \, H\_{l}\left(\boldsymbol{\upmu}\_{i}(t), \boldsymbol{\upmu}^{\*}(t), \boldsymbol{\upmu}\_{i}(t), \boldsymbol{\upmu}\_{\neg l}^{\*}(t), t\right) \tag{3.16b}$$

$$\Psi\_{i}(t) = -\nabla\_{\mathbf{x}} H\_{i} \left( \Psi\_{i}(t), \mathbf{x}^{\*}(t), \mathbf{u}\_{i}^{\*}(t), \mathbf{u}\_{\neg i}^{\*}(t), t \right) \tag{3.16c}$$

$$
\Psi\_i(T) = \nabla\_\mathbf{x} h\_i(\mathbf{x}^\*(T), t),
\tag{3.16d}
$$

where <sup>∇</sup>x denotes the partial derivative with respect to the state variable <sup>x</sup>.

state trajectory, the trajectories of the <sup>N</sup> costate functionsψi

#### Proof:

See the proof of Theorem 6.11 of [BO99].

The set of differential equations (3.16) have to be fulfilled for all open-loop Nash equilibria and is valid for the general case where <sup>u</sup>i is constrained. In case the optimal controls lie strictly inside the set defining the constraints or if we have unconstrained controls <sup>u</sup>i <sup>∈</sup> <sup>R</sup> mi as considered in Definition 3.2, the control equation (3.16b) leads to

$$\mathbf{0} = \nabla\_{\mathbf{u}\_i} H\_i(\boldsymbol{\Psi}\_i(t), \mathbf{x}^\*(t), \mathbf{u}\_i(t), \mathbf{u}\_{\neg i}^\*(t), t), \tag{3.17}$$

where <sup>∇</sup>u<sup>i</sup> denotes the partial derivative with respect toui . Therefore, with the application of Theorem 3.1 we obtain a set of coupled differential equations. Under some further assumptions including the cost functions being decoupled with respect to each player's controls,

i.e. (3.10) holds, and the system dynamics having the form (3.11), it is possible to formulate a two-point boundary value problem (TPBVP), generally consisting of (N <sup>+</sup> <sup>1</sup>)n ODEs and (N <sup>+</sup> <sup>1</sup>)n boundary conditions which can potentially be solved using numerical methods, e.g. shooting techniques [AMR95, Chapter 4]. Further details are given in Section B.3 of the Appendix. Note that the minimum principle of Pontryagin and therefore Theorem 3.1 represents only necessary conditions for Nash equilibria. It generates candidates for OLNE solutions but there is no guarantee that they are indeed a Nash equilibrium. However, under further assumptions, the minimum principle becomes a sufficient condition for optimality. Therefore, following [Doc00, Theorem 3.2], it can be stated that if <sup>H</sup>i ψi (t), <sup>x</sup>(t),ui(t),u¬i(t),<sup>t</sup> is convex in <sup>x</sup> and also continuously differentiable in <sup>x</sup>, and furthermore <sup>h</sup>i is convex, then the controls u ∗ i (t) are optimal with respect to each corresponding optimization problem and hence describe an OLNE.

In the following, an example is given to illustrate the procedure of calculating an OLNE by means of Theorem 3.1.

#### Example 3.1:

We consider a scenario consisting of two players controlling a system given by

$$
\dot{\mathbf{x}}(t) = -\mathbf{x}(t) + u\_1(t) + u\_2(t). \tag{3.18}
$$

Each player acts based on the cost function

$$J\_i = \int\_0^\infty \frac{1}{2} \mathbf{x}^2(t) + \frac{1}{2} u\_i^2(t) \,\mathrm{d}t, \quad i \in \{1, 2\}. \tag{3.19}$$

In the following, i and j are used to denote any player from the set <sup>P</sup> <sup>=</sup> {1, <sup>2</sup>} such that i , j. Furthermore, time dependencies are omitted for brevity.

To determine the OLNE, we first determine the Hamiltonian of each player:

$$H\_{\mathbf{i}} = \frac{1}{2}\mathbf{x}^2 + \frac{1}{2}u\_{\mathbf{i}}^2 + \psi\_{\mathbf{i}}\left(-\mathbf{x} + u\_{\mathbf{i}} + u\_{\mathbf{j}}\right), \quad \mathbf{i}, \mathbf{j} \in \{1, 2\}, \ \mathbf{i} \neq \mathbf{j}. \tag{3.20}$$

We now can utilize the necessary conditions for open-loop Nash equilibria given by Theorem 3.1. The control equation (3.16b) leads to

$$\frac{\partial H\_i}{\partial u\_i} = u\_i + \psi\_i = 0 \Leftrightarrow u\_i = -\psi\_i. \tag{3.21}$$

From (3.16c) we obtain the differential equation

$$
\dot{\psi}\_i = -\frac{\partial H\_i}{\partial \boldsymbol{\omega}} = -\boldsymbol{\omega} + \psi\_i. \tag{3.22}
$$

Furthermore, the system dynamics equation (3.16a) given by

$$
\dot{\mathbf{x}} = -\mathbf{x} + \mathbf{u}\_1 + \mathbf{u}\_2 \tag{3.23}
$$

must hold as well.

By combining (3.21), (3.22) and (3.23) we obtain the linear system of differential equations

$$
\begin{bmatrix} \dot{\mathbf{x}} \\ \dot{\psi}\_1 \\ \dot{\psi}\_2 \end{bmatrix} = \begin{bmatrix} -1 & -1 & -1 \\ -1 & 1 & 0 \\ -1 & 0 & 1 \end{bmatrix} \begin{bmatrix} \mathbf{x} \\ \dot{\psi}\_1 \\ \dot{\psi}\_2 \end{bmatrix}. \tag{3.24}
$$

Given that optimal control and differential game problems usually specifiy initial conditions for the state vector and terminal conditions for the costates <sup>ψ</sup>i , this system of differential equations represents a TPBVP. In this case, it can be solved both analytically and numerically. The general analytical solution can be determined e.g. by the eigenvalue and eigenvector method (see e.g. [HS14, Section 5.3]) and results in

$$\mathbf{x}^\*(t) = \mathbf{C}\_1(\sqrt{3} - 1)\exp\left(-\sqrt{3}t\right) + \mathbf{C}\_2(1 - \sqrt{3})\exp\left(\sqrt{3}t\right),\tag{3.25}$$

$$\mathbf{x}^\*(\dots) = \begin{bmatrix} \dots & \dots & \dots & \dots & \dots \end{bmatrix} \tag{3.25}$$

$$
\psi\_1^\*(t) = C\_1 \exp\left(-\sqrt{3}t\right) + C\_2 \exp\left(\sqrt{3}t\right) - C\_3 \exp\left(t\right),
\tag{3.26}
$$

$$\psi\_2^\*(t) = C\_1 \exp\left(-\sqrt{3}t\right) + C\_2 \exp\left(\sqrt{3}t\right) + C\_3 \exp\left(t\right),\tag{3.27}$$

where the constants <sup>C</sup>l , l ∈ {1, ..., <sup>3</sup>} are determined by using the aforementioned boundary conditions for states and costates. The OLNE solution results directly from the costate functions (3.26) and (3.27). Finally, we recognize that in this example the conditions of Theorem 3.1 are both necessary and sufficient.

#### 3.6.2 Feedback Nash Equilibrium

i

Consider a differential game where the players apply a feedback strategy as in Definition 3.6. By applying the minimum principle, similar equations to the ones of Theorem 3.1 result. Nevertheless, instead of (3.16c), the equation

$$\dot{\boldsymbol{\Psi}}\_{i}(t) = -\nabla\_{\mathbf{x}} H\_{i}(\boldsymbol{\Psi}\_{i}(t), \mathbf{x}^{\*}(t), \boldsymbol{\mu}\_{i}^{\*}(t), \boldsymbol{\mathcal{Y}}\_{\neg i}^{\*}(\mathbf{x}^{\*}, t), t) \tag{3.28}$$

i

holds. The time dependency of the state in the strategies γ ∗ ¬i is dropped here and in the following for brevity. In this new costate equation, the controls u ∗ ¬i (t) <sup>=</sup> γ ∗ ¬i (x ∗ ,t) have an influence on the partial derivative in (3.16c) since, contrary to the open-loop case, they now depend on the current value of x(t). Even though these new equations define a closed-loop no-memory Nash equilibrium, they are not computationally convenient [SH69a]. Furthermore, there is in general an uncountable number of solutions to the resulting differential equations, one of which is the open-loop solution determined in (3.16) [BO99, p. 277].

In order to eliminate this so-called "informational non-uniqueness", the concept of feedback Nash equilibria is introduced. This refinement states that if an N-tuple of strategies γ <sup>∗</sup> = γ ∗ 1 , ...,γ ∗ N constitutes a FNE solution of a differential game with duration [0,T ], then its restriction to the time interval [t,T ], for any t ∈ [0,T ], describes a FNE solution for the same differential game defined on this shorter time interval [t,T ]. A consequence of this requirement is the strong time consistency of FNE solutions (cf. Section 3.5.1). Furthermore, any FNE also fufills the equations of Theorem 3.1 with the costate equation (3.28).

The core of the results concerning feedback Nash equilibria is given by N coupled Hamilton-Jacobi-Bellman (HJB) equations for which the value function, known from optimal control, is extended to the N-player case.

#### Definition 3.10 (Value Function)

Consider a player i ∈ P. Let the optimal strategies of the other players γ ∗ ¬i associated to an <sup>N</sup>-player non-cooperative differential game be given. The value functionVi : R <sup>n</sup> × [0,T ] 7→ <sup>R</sup> of player i is defined by

$$V\_{l}(\mathbf{x},t) = \min\_{\{\mathbf{y}\_{l}(\mathbf{x},s), \, t \le s \le T\}} \int\_{t}^{T} g\_{l}\left(\bar{\mathbf{x}}\_{l}(s), \mathbf{y}\_{l}(\mathbf{x},s), \mathbf{y}\_{-l}^{\*}(\mathbf{x},s), s\right) \, \mathrm{d}s + h\_{l}(\mathbf{x}(T),T) \tag{3.29}$$

$$V\_{l}(\mathbf{x},t) = \int\_{t}^{t} g\_{l}(\mathbf{x}^\*(\mathbf{s}), \mathbf{y}\_{l}^\*(\mathbf{x}, \mathbf{s}), \mathbf{y}\_{-l}^\*(\mathbf{x}, \mathbf{s}), \mathbf{s}) \, \text{d}s \tag{3.30}$$

satisfying the boundary condition

$$V\_i(\mathbf{x}, T) = h\_i(\mathbf{x}, T), \tag{3.31}$$

and where

$$\dot{\bar{\mathbf{x}}}\_{i}(\mathbf{s}) = f(\bar{\mathbf{x}}\_{i}(\mathbf{s}), \mathbf{y}\_{i}(\mathbf{x}, \mathbf{s}), \mathbf{y}\_{\neg i}^{\*}(\mathbf{x}, \mathbf{s}));\\\bar{\mathbf{x}}\_{i}(t) = \mathbf{x}. \tag{3.32}$$

The value function <sup>V</sup>i , i ∈ P represents the minimum cost-to-go from any initial state x and any initial time t which is attainable by player i, where the optimal strategies of the other N <sup>−</sup> <sup>1</sup> players are fixed. With this definition, the following theorem can be stated.

#### Theorem 3.2 (Sufficient Conditions for Feedback Nash Equilibria)

For an N-player differential game of prescribed fixed duration [0,T ], an N-tuple of feedback strategies γ FB = γ ∗ 1 , ...,γ ∗ N where γ ∗ i ∈ Γ FB i and γ ∗ i (x,t) <sup>=</sup> u ∗ i (t), i ∈ P, provides a feedback Nash equilibrium (FNE) solution if there exist continuous differentiable value functions <sup>V</sup>i according to Defintion 3.10 which satisfy the partial differential equations

$$\begin{split}-\frac{\partial V\_{l}(\mathbf{x},t)}{\partial t} &= \min\_{\mathbf{u}\_{l}} \left[ \nabla\_{\mathbf{x}} V\_{l}(\mathbf{x},t) \tilde{f}\_{l}^{\*}(\mathbf{x}(t), \mathbf{u}\_{l}(t),t) + \tilde{g}\_{l}^{\*}(\mathbf{x}(t), \mathbf{u}\_{l}(t),t) \right] \\ &= \nabla\_{\mathbf{x}} V\_{l}(\mathbf{x},t) \tilde{f}\_{l}^{\*}(\mathbf{x}(t), \mathbf{y}\_{l}^{\*}(\mathbf{x},t),t) + \tilde{g}\_{l}^{\*}(\mathbf{x}(t), \mathbf{y}\_{l}^{\*}(\mathbf{x},t),t), \\ V\_{l}(\mathbf{x},T) &= h\_{l}(\mathbf{x},T), \quad i \in \mathcal{P}, \end{split} \tag{3.33}$$

where

$$\begin{aligned} \tilde{f}\_i^\*(\mathbf{x}(t), \boldsymbol{\mu}\_l(t), t) &= f(\mathbf{x}(t), \boldsymbol{\mathcal{Y}}\_{\neg l}^\*(\mathbf{x}, t), \boldsymbol{\mu}\_l(t), t), \\ \tilde{g}\_i^\*(\mathbf{x}(t), \boldsymbol{\mu}\_l(t), t) &= g\_i(\mathbf{x}(t), \boldsymbol{\mathcal{Y}}\_{\neg l}^\*(\mathbf{x}, t), \boldsymbol{\mu}\_l(t), t). \end{aligned} \tag{3.34}$$

The corresponding Nash equilibrium cost for player <sup>i</sup> is <sup>V</sup>i(x0, <sup>0</sup>).

#### Proof:

See the proof of Theorem 6.16 of [BO99].

i

The following example illustrates the use of Theorem 3.2 to determine a FNE solution of a differential game.

#### Example 3.2:

Consider the differential game with 2 players from Example 3.1, where they control a system with dynamics (3.18) and each of them chooses his actions such that his individual cost function (3.19) is minimized. However, contrary to last example, each of the players applies a feedback strategy according to Definition 3.6. Again, function dependencies are neglected for brevity, unless a variable dependence demands special attention.

Given time-independent functions <sup>д</sup>i(x,ui ,u¬i) and system dynamics as well as the infinite horizon (T → ∞), the value function also does not depend explicitely on time (cf. [HKZ12, Remark 7.5]), and therefore the HJB equation of each player results in

$$0 = \min\_{u\_i} \left( \frac{1}{2} \mathbf{x}^2 + \frac{1}{2} u\_i^2 + \frac{\partial V\_i}{\partial \mathbf{x}} \left[ -\mathbf{x} + u\_i + u\_j \right] \right), \quad i \in \{1, 2\}, \ i \neq j. \tag{3.35}$$

Minimizing the expression at the right hand side leads to

$$u\_i^\* + \frac{\partial V\_i}{\partial \mathbf{x}} = \mathbf{0} \Leftrightarrow u\_i^\* = -\frac{\partial V\_i}{\partial \mathbf{x}} \stackrel{\circ}{=} \chi\_i^\*(\mathbf{x}). \tag{3.36}$$

At this point, it is usually necessary to guess the structure of the value function <sup>V</sup>i . Given the linear system dynamics and the quadratic cost function, we hypothesize a quadratic value function. Moreover, given the symmetric structure18of the game, we are interested in symmetrical equilibrium actions u ∗ i <sup>=</sup> <sup>u</sup> ∗ leading to identical value functions.

For any player i ∈ {1, <sup>2</sup>}, we write the value function as

j

$$V\_i(\mathbf{x}) = \frac{A}{2}\mathbf{x}^2 + B\mathbf{x} + C \Leftrightarrow \frac{\partial V\_i}{\partial \mathbf{x}} = A\mathbf{x} + B \tag{3.37}$$

with A, B,C <sup>∈</sup> <sup>R</sup>. By using (3.36) and (3.37), the HJB equation (3.35) leads after some simplification to

$$0 = \left(-\frac{3}{2}A^2 - A + \frac{1}{2}\right)x^2 - \left(3AB + B\right)x - \frac{3}{2}B^2. \tag{3.38}$$

By comparing both equation sides we obtain <sup>B</sup> <sup>=</sup> <sup>0</sup> and two possible values <sup>A</sup><sup>1</sup> <sup>=</sup> <sup>−</sup>1, <sup>A</sup><sup>2</sup> <sup>=</sup> <sup>1</sup>/3. Given the positive integrand in (3.19), the value function must be positive and therefore, <sup>A</sup><sup>1</sup> is discarded. With (3.36) and (3.37) we obtain the optimal feedback strategy

$$
\chi\_l^\*(\mathbf{x}) = -\frac{1}{3}\mathbf{x}(t) \tag{3.39}
$$

and the corresponding state trajectory

$$\alpha^\*(t) = C \exp\left(-\frac{5}{3}t\right),\tag{3.40}$$

where <sup>C</sup> <sup>∈</sup> <sup>R</sup> is determined by using an initial state condition <sup>x</sup>(0) <sup>=</sup> <sup>x</sup><sup>0</sup> <sup>∈</sup> <sup>R</sup>.

#### 3.6.3 Pareto Efficient Solutions

In general, a dynamic game has various Pareto efficient solutions. The set of all of these solutions is called Pareto frontier. In the following, a theorem presenting necessary and sufficient conditions for Pareto efficient solutions is given.

<sup>18</sup> Here, the notion of symmetry of [Doc00, p. 106] is considered, meaning that all players (usually two) have the same cost function <sup>J</sup><sup>i</sup> and control space <sup>U</sup><sup>i</sup> . Furthermore, the system dynamics are symmetric with respect to the players in the sense that the equation is unaffected if e.g. <sup>u</sup><sup>1</sup> is interchanged with <sup>u</sup>2.

#### Theorem 3.3 (Necessary and Sufficient Conditions for Pareto Efficient Solutions)

Let <sup>τ</sup>i <sup>&</sup>gt; <sup>0</sup>, for all <sup>i</sup> ∈ P, satisfy

$$\sum\_{i=1}^{N} \tau\_i = 1.\tag{3.41}$$

Now consider an N-player differential game. If γ <sup>p</sup> = γ p 1 , ...,γ p N is such that

$$\begin{aligned} \mathbf{y}^p &= \underset{\mathbf{y}}{\text{arg min}} \sum\_{i=1}^N \tau\_i J\_i(\mathbf{y}) \\ \text{w.r.t} \\ \dot{\mathbf{x}} &= f(\mathbf{x}(t), \mathbf{u}\_1(t), \dots \mathbf{u}\_N(t), t) \\ \mathbf{x}(0) &= \mathbf{x}\_0 \end{aligned} \tag{3.42}$$

then γ p is a Pareto efficient solution (PES). Moreover, if the strategy spaces <sup>Γ</sup>i are convex and <sup>J</sup>i are convex for all <sup>i</sup> ∈ P, then for all Pareto-efficient solutions <sup>γ</sup> p there exist <sup>τ</sup>i such that γ P solves the optimization problem (3.42).

#### Proof:

The theorem can be found in [Eng05, Theorem 6.4]. The sufficiency result is proved in [Eng05, Lemma 6.1] while the necessary part is proved in [Eng05, Lemma 6.3].

The formulation of Theorem 3.3 as a dynamic optimization problem allows the use of the minimum principle to solve for the PES. The solution can sometimes be given with <sup>τ</sup>i as a degree of freedom. Weighting parameters which fulfill (3.41) can also be chosen to find a particular PES, e.g. with <sup>τ</sup>i <sup>=</sup> <sup>1</sup>/<sup>N</sup> .

In the following, an example is given to illustrate the calculation of a PES.

#### Example 3.3:

Consider the differential game with two players from Example 3.1. In this example, we assume the players are able to build cooperative strategies such that their overall performance is increased.

We choose <sup>τ</sup><sup>1</sup> <sup>=</sup> <sup>τ</sup> and <sup>τ</sup><sup>2</sup> <sup>=</sup> <sup>1</sup> <sup>−</sup> <sup>τ</sup> and state the cost function

$$J\_p = \tau J\_1 + (1 - \tau)J\_2 \tag{3.43}$$

$$\xi = \int\_0^1 \frac{1}{2} \mathbf{x}^2 + \frac{\tau}{2} (\boldsymbol{u}\_1^2 - \boldsymbol{u}\_2^2) + \frac{1}{2} \boldsymbol{u}\_2^2 \,\mathrm{d}t.\tag{3.44}$$

We now can utilize the minimum principle to determine the solution. The Hamiltonian which corresponds to <sup>J</sup>p is given by

$$H\_{\mathcal{P}} = \frac{1}{2}\mathbf{x}^2 + \frac{\tau}{2}(\boldsymbol{u}\_1^2 - \boldsymbol{u}\_2^2) + \frac{\boldsymbol{u}\_2^2}{2} + \psi\_{\mathcal{P}}\left(-\mathbf{x} + \boldsymbol{u}\_1 + \boldsymbol{u}\_2\right). \tag{3.45}$$

Since there is a coordination between both players, we consider the vector <sup>u</sup>p <sup>=</sup> - u<sup>1</sup> u<sup>2</sup> ⊤ as the overall control vector. The control equation

$$\frac{\partial H\_p}{\partial \mathbf{u}\_p} = \begin{bmatrix} \tau u\_1 + \psi\_p \\ u\_2(1 - \tau) + \psi\_p \end{bmatrix} = \mathbf{0} \tag{3.46}$$

of the minimum principle leads to

$$u\_1 = -\frac{1}{\tau} \psi\_p \quad \text{and} \quad u\_2 = -\frac{1}{1-\tau} \psi\_p. \tag{3.47}$$

Furthermore, the canonical differential equation of the costates

$$
\psi\_{\mathcal{P}} = -\frac{\partial H\_{\mathcal{P}}}{\partial \mathbf{x}} = -\mathbf{x} + \psi\_{\mathcal{P}} \tag{3.48}
$$

and the system dynamics equation

$$
\dot{\mathbf{x}} = -\mathbf{x} + \mathbf{u}\_1 + \mathbf{u}\_2 \tag{3.49}
$$

must hold for the optimal solution.

Similar to Example 3.1, by inserting (3.47) into (3.49) and using (3.48), we obtain a system of differential equations

$$
\begin{bmatrix}
\dot{\boldsymbol{\chi}} \\
\boldsymbol{\psi}\_{\mathcal{P}}
\end{bmatrix} = \begin{bmatrix}
\end{bmatrix} \begin{bmatrix}
\boldsymbol{\chi} \\
\boldsymbol{\psi}\_{\mathcal{P}}
\end{bmatrix} \tag{3.50}
$$

which can be solved analytically using the eigenvalue and eigenvector method. The general solution is

$$\alpha^\*(t) = C\_1(\lambda + 1) \exp\left(-\lambda t\right) - C\_2(\lambda - 1) \exp\left(\lambda t\right) \tag{3.51}$$

$$
\psi\_p^\*(t) = C\_1 \exp\left(-\lambda t\right) + C\_2 \exp\left(\lambda t\right),
\tag{3.52}
$$

with

$$
\lambda = \sqrt{1 + \frac{1}{\tau(1 - \tau)}} \tag{3.53}
$$

and where <sup>C</sup>l <sup>∈</sup> <sup>R</sup>, <sup>l</sup> ∈ {1, <sup>2</sup>} are determined using initial and terminal conditions.

#### 3.6.4 Comparison of Solution Concepts

In general, the OLNE and FNE are not equal since they are based on different assumptions concerning the available information to the players. Furthermore, while there are some cases where Nash equilibria and Pareto efficient solutions coincide, this is also generally not the case. In order to illustrate the difference between the solutions, the following example is presented.

#### Example 3.4:

Consider the same two-player differential game as in Examples 3.1, 3.2 and 3.3. In the three examples, the OLNE, FNE and the PES were calculated, respectively. In this example, the exact trajectories which follow from τ <sup>=</sup> <sup>0</sup>.<sup>5</sup> and the boundary conditions

$$\chi(0) = 2, \quad \psi\_1(T \to \infty) = 0, \quad \psi\_2(T \to \infty) = 0 \quad \text{and} \quad \psi\_p(T \to \infty) = 0 \tag{3.54}$$

were determined analytically using MATLAB's dsolve. Figure 3.3 shows state and control trajectories of the differential game defined by (3.18) and (3.19). Only one control trajectory is shown for each solution concept since the symmetry of the game leads to equal controls for both players. While the OLNE and FNE are similar to each other, the PES differs considerably more.

Figure 3.3: Open-loop Nash equilibrium, feedback Nash equilibrium and Pareto efficient solution of an example two-player differential game: a) State trajectories, b) Control trajectories.

Finally, in order to show that a cooperative differential game with a PES leads to a better outcome than a non-cooperative setting, we calculate the value of the objective function for each solution concept:

$$J\_{i, \text{OLNE}}^\* = 0.655 \tag{3.55}$$

$$J\_{\rm f,FBNE}^{\*} = 0.667 \tag{3.56}$$

$$J\_{\rm f,FBENE}^{\*} = 0.667 \tag{3.57}$$

$$J\_{i, \text{PES}} = 0.618, \quad i \in \{1, 2\}. \tag{3.57}$$

Hence, J ∗ i,FBNE <sup>≥</sup> <sup>J</sup> ∗ i,OLNE <sup>≥</sup> <sup>J</sup> ∗ i,PES holds. The lower costs of the PES demonstrates an advantage of acting cooperatively in this example.

# 3.7 Tractable Differential Games

The solution of the coupled differential equations which arise from the necessary and sufficient conditions for Nash equilibria is in general not a trivial task, especially concerning the partial differential equations (HJB equations) which are needed to find an FNE. Indeed, finding Nash equilibria for general differential games is nontrivial and an object of current research. To find an FNE in nonlinear differential games, approximative or iterative solutions of the HJB equations are sought and therefore, the use of reinforcement learning or adaptive dynamic programming techniques are obtaining increased interest [KKD14, ZZWZ16, KVML18].

There are particular kinds of differential games which are similar to the examples presented in the previous subsections in the sense that the calculation of Nash equilibria is considerably simplified. These are therefore called tractable differential games [HKZ12, Section 7.6] and include


These kinds of differential games are treated e.g. in [DFJ85] and [Doc00, Chapter 7].

One of the structures considered in this thesis are linear-quadratic differential games, as it is an important and widespread class of differential games which has been used in several applications of automatic control including driver assistance systems [FFH17], collision avoidance [MSA17], control of mobile robots [Gu08] and control of energy grids [ZMSFZ16]. Therefore, the following section presents the most important results which are known for this particular class of games.

# 3.8 Linear-Quadratic Differential Games

A linear-quadratic (LQ) differential game is a class of differential games where the system the players control simultaneously has linear dynamics, i.e. the evolution of the states is governed by a system of linear differential equations. Furthermore, the players act based upon an individual quadratic cost function. This kind of games can therefore be seen as an extension of linear-quadratic optimal control to the N-player case. LQ differential games are considered a class of differential games which can be solved with reasonable effort. Their particular structure allows the derivation of necessary and sufficient conditions for Nash equilibria which are computationally tractable.

#### Definition 3.11 (Linear-Quadratic Differential Game)

A linear-quadratic (LQ) differential game is defined by the same elements as Definition 3.3. The system dynamics are linear, i.e. are defined by

$$\dot{\mathbf{x}}(t) = \mathbf{A}\mathbf{x}(t) + \sum\_{i=1}^{N} \mathbf{B}\_{i} \boldsymbol{u}\_{i}(t),\tag{3.58}$$

where x(t) ∈ <sup>R</sup> n , u(t) ∈ <sup>R</sup> <sup>m</sup><sup>i</sup> and <sup>A</sup> and <sup>B</sup>i , i ∈ P, are the system and control matrices of appropriate dimensions, respectively, which form stabilizable matrix pairs (A, <sup>B</sup>i), <sup>i</sup> ∈ P. Furthermore, the cost functions are quadratic, i.e.

$$J\_i = \frac{1}{2} \mathbf{x}^\top(T) \mathbf{Q}\_{i,T} \mathbf{x}(T) + \frac{1}{2} \int\_0^T \mathbf{x}^\top(t) \mathbf{Q}\_i \mathbf{x}(t) + \sum\_{j=1}^N \mathbf{u}\_j^\top(t) \mathbf{R}\_{ij} \mathbf{u}\_j(t) \,\mathrm{d}t,\tag{3.59}$$

where <sup>Q</sup>i,T ,Qi , <sup>R</sup>ij are symmetric matrices for all <sup>i</sup>, <sup>j</sup> ∈ P and <sup>R</sup>i i <sup>≻</sup> <sup>0</sup>.

The constraint of positive definiteness <sup>R</sup>i i <sup>≻</sup> <sup>0</sup> is required in order to guarantee a meaningful minimization problem. Additional positive-semidefiniteness constraints are sometimes introduced, e.g. <sup>Q</sup>i,T , <sup>Q</sup>i <sup>≽</sup> <sup>0</sup>. These are often convenient to obtain Nash equilibrium solutions but are not always strictly necessary, as will be discussed in the next subsection.<sup>19</sup> Furthermore, the stabilizable pairs (A, <sup>B</sup>i), <sup>i</sup> ∈ P, imply that each player is able to stabilize the system on its own, a fact that is required for the following results on Nash equilibria in LQ differential games.

<sup>19</sup> A widespread case is given by a two-player differential game N <sup>=</sup> <sup>2</sup> where the players play in a stringent adversarial way. This is represented by cost function matrices <sup>Q</sup><sup>2</sup> <sup>=</sup> <sup>−</sup>Q<sup>1</sup> , <sup>Q</sup>2,<sup>T</sup> <sup>=</sup> <sup>−</sup>Q1,<sup>T</sup> , <sup>R</sup><sup>12</sup> <sup>=</sup> <sup>−</sup>R22, <sup>R</sup><sup>21</sup> <sup>=</sup> <sup>−</sup>R<sup>11</sup> and is known as zero-sum differential game [SH69b].

#### 3.8.1 Nash Equilibria in Open-Loop LQ Differential Games

#### Finite-Horizon

Consider a linear-quadratic differential game with finite horizon T . The calculation of openloop Nash equilibria is based on the solution of coupled matrix Riccati differential equations (RDEs), which can be derived from Pontryagin's minimum principle. Therefore, applying Theorem 3.1 to LQ differential games leads to the following result.

#### Theorem 3.4 (Sufficient Conditions for OLNE solutions in Finite-Horizon LQ Differential Games)

Consider an N-player LQ differential game as in Definition 3.11 with the additional constraints <sup>Q</sup>i , <sup>Q</sup>i,T <sup>≽</sup> <sup>0</sup>, <sup>i</sup> ∈ P. Let there exist a set of matrix-valued functions <sup>P</sup><sup>i</sup> , i ∈ P, which satisfy the Riccati differential equations (RDEs)

$$\dot{\boldsymbol{P}}\_{i}(t) = -\boldsymbol{P}\_{i}(t)\mathbf{A} - \mathbf{A}^{\top}\boldsymbol{P}\_{i}(t) + \sum\_{j=1}^{N} \boldsymbol{P}\_{i}(t)\mathbf{B}\_{j}\mathbf{R}\_{jj}^{-1}\mathbf{B}\_{j}^{\top}\boldsymbol{P}\_{j}(t) - \boldsymbol{\mathcal{Q}}\_{i}, \quad i \in \mathcal{P}, \tag{3.60}$$

with the transversality conditions

$$P\_i(T) = \mathbb{Q}\_{i,T}, \quad i \in \mathcal{P}. \tag{3.61}$$

Then, the LQ differential game has a unique OLNE for every initial state x0. Moreover, the resulting N-tuple of equilibrium controls u ∗ is defined by the controls

$$\mathfrak{u}\_{i}^{\*}(t) = \mathfrak{y}\_{i}^{\*}(\mathbf{x}\_{0}, t) = -\mathsf{R}\_{il}^{-1} \mathsf{B}\_{l}^{\top} \mathsf{P}\_{i}(t) \Phi(t, 0) \mathbf{x}\_{0}, \quad \mathbf{i} \in \mathcal{P}. \tag{3.62}$$

Here, <sup>Φ</sup>(t, <sup>0</sup>) satisfies the differential equation

$$\dot{\Phi}(t,0) = \left(\mathbf{A} - \sum\_{j=1}^{N} \mathbf{S}\_{f} \mathbf{P}\_{j}(t)\right) \Phi(t,0), \quad \Phi(t,t) = I,\tag{3.63}$$

where

$$\mathbf{S}\_{j} = \mathbf{B}\_{j}\mathbf{R}\_{jj}^{-1}\mathbf{B}\_{j}^{\top}, \quad j \in \mathcal{P}. \tag{3.64}$$

#### Proof:

See Section B.1 of the Appendix.

Theorem 3.4 gives an approach for calculating Nash equilibria by solving the RDEs (3.60) with the conditions (3.61). Nevertheless, cases exist where these do not have a solution, but the LQ differential game still has a solution [BO99, p. 314].

In case the system is not affected by any disturbance during the complete game duration, the controls can be formulated in the form of an optimal feedback law

$$\mathbf{y}\_{i}^{\*}(\mathbf{x},t) = -\mathbf{R}\_{li}^{-1}\mathbf{B}\_{l}^{\top}\mathbf{P}\_{i}(t)\mathbf{x}(t), \quad i \in \mathcal{P}.\tag{3.65}$$

#### Infinite-Horizon

In an infinite-horizon case, i.e. <sup>T</sup> → ∞, the matrices <sup>P</sup>i are constant (P<sup>Û</sup> i <sup>=</sup> <sup>0</sup>), resulting in coupled algebraic Riccati equations (ARE) and leading to the following result.

#### Theorem 3.5 (Sufficient Conditions for OLNE solutions in Infinite-Horizon LQ Differential Games)

Consider an N-player LQ differential game as in Definition 3.11 with T → ∞ and with the additional constraints <sup>Q</sup>i <sup>≻</sup> <sup>0</sup> and <sup>Q</sup>i,T <sup>=</sup> <sup>0</sup>, <sup>i</sup> ∈ P. Then, the LQ differential game has an OLNE for every initial state <sup>x</sup><sup>0</sup> if a set of matrices <sup>P</sup>i , i ∈ P, exists which satisfies the algebraic Riccati equations (AREs)

$$\mathbf{0} = -\mathbf{P}\_i \mathbf{A} - \mathbf{A}^\top \mathbf{P}\_i + \sum\_{j=1}^N \mathbf{P}\_i \mathbf{B}\_j \mathbf{R}\_{jj}^{-1} \mathbf{B}\_j^\top \mathbf{P}\_j - \mathbf{Q}\_i \quad , \ i \in \mathcal{P} \tag{3.66}$$

and additionally leads to a stable closed-loop system<sup>20</sup>

i

$$F \coloneqq \mathbf{A} - \sum\_{j=1}^{N} \mathbf{S}\_j \mathbf{P}\_j,\tag{3.67}$$

i.e. the eigenvalues of F have a negative real part. The resulting N-tuple of Nash equilibrium controls u ∗ is defined by (3.62), where <sup>P</sup>i(t) <sup>=</sup> <sup>P</sup>i , i ∈ P.

#### Proof:

See the proof of [BO99, Theorem 6.22].

According to [BO99, p. 336], the existence of OLNEs in an infinite-horizon LQ differential game does not imply the existence of an OLNE in the finite-horizon version of the game. Moreover, a unique solution of the RDEs in a finite-horizon differential game may converge

<sup>20</sup> Note that the stabilizability of (A, [B1, ..., <sup>B</sup><sup>N</sup> ]) is necessary, a property which follows from the stabilizable pairs (A, <sup>B</sup><sup>i</sup> ), <sup>i</sup> ∈ P, according to Definition 3.11.

for T → ∞ to a solution of the coupled AREs, but these are not necessarily stabilizing solutions and therefore would not constitute an OLNE of the infinite-horizon differential game.

#### 3.8.2 Nash Equilibrium in Feedback LQ Differential Games

#### Finite Horizon

Consider a LQ differential game with finite horizon T . Similar to the open-loop case, the calculation of feedback Nash equilibria is based on the solution of coupled RDEs, which can be derived from Theorem 3.2. We shall now restrict our attention to linear feedback strategies belonging to the set

$$\Gamma\_i^{\text{FB}} = \{ \mathbf{y}\_i \mid \mathbf{y}\_i(\mathbf{x}, t) = -\mathbf{K}\_i(t)\mathbf{x}(t) \}. \tag{3.68}$$

This allows the formulation of the following theorem.

#### Theorem 3.6 (Necessary and Sufficient Conditions for FNE solutions in Finite-Horizon LQ Differential Games)

Consider an N-player LQ differential game as in Definition 3.11. The LQ differential game has a linear FNE for every initial state <sup>x</sup><sup>0</sup> if and only if a set of symmetric matrix-valued functions <sup>P</sup>i , i ∈ P, exists which satisfy the Riccati differential equations (RDEs)

$$\begin{aligned} \dot{\mathbf{P}}\_i(t) &= -\mathbf{Q}\_i - \mathbf{P}\_i(t)\mathbf{A} - \mathbf{A}^\top \mathbf{P}\_i(t) + \sum\_{j=1}^N \mathbf{P}\_i(t)\mathbf{S}\_j \mathbf{P}\_j(t) + \dots \\ &\dots + \sum\_{\substack{j=1 \\ j \neq i}}^N \mathbf{P}\_j(t)\mathbf{S}\_j \mathbf{P}\_i(t) - \sum\_{\substack{j=1 \\ j \neq i}}^N \mathbf{P}\_j(t)\mathbf{S}\_{ij} \mathbf{P}\_j(t), \end{aligned} \tag{3.69}$$

where

$$\begin{aligned} \mathbf{S}\_{j} &= \mathbf{B}\_{j} \mathbf{R}\_{jj}^{-1} \mathbf{B}\_{j}^{\top}, & \quad j \in \mathcal{P}, \\ \mathbf{S}\_{ij} &= \mathbf{B}\_{j} \mathbf{R}\_{jj}^{-1} \mathbf{R}\_{ij} \mathbf{R}\_{jj}^{-1} \mathbf{B}\_{j}^{\top}, & \quad \text{i, } j \in \mathcal{P}, \; i \neq j, \end{aligned} \tag{3.70}$$

and the transversality conditions

$$P\_i(T) = \mathbf{Q}\_{i,T}, \quad i \in \mathcal{P}. \tag{3.71}$$

The resulting N-tuple of linear Nash equilibrium strategies γ ∗ is unique and defined by

$$\mathbf{y}\_i^\*(\mathbf{x}, t) = -\mathbf{R}\_{li}^{-1} \mathbf{B}\_i^\top \mathbf{P}\_i(t)\mathbf{x}(t) =: -\mathbf{K}\_i(t)\mathbf{x}(t), \quad i \in \mathcal{P}. \tag{3.72}$$

#### Proof:

See the proof of [Eng05, Theorem 8.3].

Generally speaking, the FNE arising from the solution of the coupled RDEs is not necessarily the only one. Basar reported in [Bas74] the existence of equilibrium strategies which are nonlinear functions of the state in discrete-time linear-quadratic dynamic games. Similarly, in [TM90] the authors present a specific LQ differential game example for which a nonlinear FNE exists. Therefore, Theorem 3.6 may not apply if the strategy space is enlarged as to include nonlinear strategies [Eng05, p. 365].

#### Infinite Horizon

As in the finite-horizon case, we restrict our attention to linear feedback strategies. Nevertheless, for infinite-horizon games, these are constant over time, i.e. they are defined by the set

$$\Pi\_i^{\rm FB} = \{ \mathcal{Y}\_i \mid \mathcal{Y}\_i(\mathbf{x}, t) = -K\_i \mathbf{x}(t) \}. \tag{3.73}$$

Furthermore, these strategies (or alternatively, control laws) <sup>K</sup> <sup>=</sup> (K1, ...,<sup>K</sup> N ) are assumed to belong to the set

$$\mathcal{F} = \left\{ (\mathbf{K}\_1, \dots, \mathbf{K}\_N) \mid F \text{ is stable} \right\}, \tag{3.74}$$

which can be interpreted as a strive of the players for jointly stabilizing the system.<sup>21</sup> A necessary and sufficient condition for the non-emptiness of F is the stabilizability of the matrix pair (A, - <sup>B</sup><sup>1</sup> · · · <sup>B</sup>N ) [EBS00]. With these conditions in mind, the following result is stated.

#### Theorem 3.7 (Necessary and Sufficient Conditions for FNE solutions in Infinite-Horizon LQ Differential Games)

Consider an N-player LQ differential game as in Definition 3.11 with T → ∞. Let the matrices <sup>P</sup>i , i ∈ P, be symmetric solutions to the ARE

$$\mathbf{0} = -\mathbf{Q}\_i - \mathbf{P}\_i \mathbf{A} - \mathbf{A}^\top \mathbf{P}\_i + \sum\_{j=1}^N \mathbf{P}\_i \mathbf{S}\_j \mathbf{P}\_j + \sum\_{\substack{j=1 \\ j \neq i}}^N \mathbf{P}\_j \mathbf{S}\_j \mathbf{P}\_i - \sum\_{\substack{j=1 \\ j \neq i}}^N \mathbf{P}\_j \mathbf{S}\_{ij} \mathbf{P}\_j,\tag{3.75}$$

<sup>21</sup> According to [Eng05, p. 372], this corresponds to the supposition that both players have a first priority in stabilizing the system. Furthermore, for most games the equilibria without this stabilization constraint coincide with the ones corresponding to a game for which this constraint is included. Therefore, the stabilization constraint will not be active in most cases.

and additionally lead to a stable closed-loop system

i

i

$$F = A - \sum\_{j=1}^{N} \mathcal{S}\_j P\_j,$$

where

$$\begin{aligned} \mathbf{S}\_{j} &= \mathbf{B}\_{j} \mathbf{R}\_{jj}^{-1} \mathbf{B}\_{j}^{\top}, & \quad & j \in \mathcal{P}, \\ \mathbf{S}\_{ij} &= \mathbf{B}\_{j} \mathbf{R}\_{jj}^{-1} \mathbf{R}\_{ij} \mathbf{R}\_{jj}^{-1} \mathbf{B}\_{j}^{\top}, & \quad \text{i, } j \in \mathcal{P}, \text{ i } \neq j. \end{aligned} \tag{3.76}$$

Then, there exists a linear FNE and the corresponding feedback strategies are defined by

$$\mathfrak{u}\_i^\*(t) = \mathfrak{y}\_i^\*(\mathbf{x}, t) = -\mathsf{R}\_{li}^{-1} \mathsf{B}\_l^\top P\_i \mathbf{x}(t) = -\mathsf{K}\_i \mathbf{x}(t). \tag{3.77}$$

Conversely, if a linear FNE exists and is defined by (3.77), then there exists a set of stabilizing matrices <sup>P</sup>i , i ∈ P, which solve the AREs (3.75).

#### Proof:

In light of (A, - <sup>B</sup><sup>1</sup> · · · <sup>B</sup>N ) being stabilizable from the fact that the single pairs (A, <sup>B</sup>i), i ∈ P, are stabilizable according to Definition 3.11, the rest of the proof is stated in [Eng05, Theorem 8.5].

Theorem 3.7 was formulated with some freedom, as the results of the infinite-horizon case are established with the definition of a feedback Nash equilibrium specific for infinite-horizon LQ games which are based on the constant linear feedback strategies (3.73). Further details are given in Chapter 5, where the AREs are exploited to develop a method for inverse LQ dynamic games. In addition, it is worth noting that the solutions of the AREs (3.75) and therefore the FNE are generally not unique [Eng05, p. 381].

# 3.9 Summary

This chapter presented fundamentals of dynamic game theory needed for the understanding of the inverse dynamic game methods introduced in this thesis. The following chapters are all based on games with the basic properties presented in Definition 3.3 and with mainly the Nash equilibrium as a solution concept—Nevertheless, a possible application to dynamic games with Pareto efficient solutions shall additionally be mentioned. Inverse dynamic game problems depend on further characteristics of the game, e.g. the information structure and strategy types as well as the assumed class of dynamic systems and cost function structure. The following three chapters introduce different kinds of inverse dynamic games and corresponding methods for their solution.

# 4 Inverse Non-Cooperative Differential Games

This chapter presents results on the solution of inverse differential games.<sup>22</sup> As described in Chapter 2, the aim of an inverse differential game is to calculate the cost functions players minimized which gave rise to observed state and control trajectories. In the following, this problem is first formulated formally. Afterwards, the main contributions presented in this chapter are the proposal of an efficient method for solving inverse open-loop differential games and the formulation of sufficient conditions for the uniqueness of the solution. Furthermore, the applicability of the method for inverse differential games with feedback strategies is demonstrated.<sup>23</sup>

# 4.1 Problem Formulation

The theoretical framework of non-cooperative differential games describes N agents treated as entities controlling the system based on the minimization of their individual cost functions, as introduced in Chapter 3. The non-cooperative nature of the game means that no contracts or agreements between players are in place while attempting to minimize their individual costs. Within the inverse problem of differential games, the result of the interaction between all players, i.e. the state and control trajectories, are assumed as given. A further important characteristic of the inverse differential game is that the interaction led to a Nash equilibrium. Some work exists which investigates conditions under which Nash equilibria exist (see e.g. the results in [Luk71, Var70] and the discussions and references given in [BO99, Eng05]). However, these conditions are not general and not simple to formulate in terms of the system dynamics or the cost functions. Addressing the existence of Nash equilibria in general dynamic games is beyond the scope of this thesis and therefore, the following assumption will be made.

<sup>22</sup> In the remainder of this thesis, the term inverse differential game describes an inverse dynamic game in continuous time (cf. last paragraph of Section 3.1).

<sup>23</sup> The results of this chapter are based on the conference paper [RIK+17] and the author's contribution to the journal paper [MIF+20].

#### Assumption 4.1 (Nash Character of the Observed Trajectories)

The observed state trajectories <sup>x</sup>˜(t) and control trajectories (u˜ <sup>1</sup>(t), ...,u˜ N (t)) of all players are Nash equilibrium trajectories x ∗ (t) andu ∗ i (t) generated by a non-cooperative differential game defined by a set of non-trivial cost functions <sup>J</sup><sup>∗</sup> <sup>=</sup> {J ∗ 1 , ..., J ∗ N } and a dynamic system according to Definition 3.2.

With this assumption, the inverse differential game problem is defined as follows.

#### Definition 4.1 (Inverse Differential Game Problem)

Let Assumption 4.1 hold such that state trajectories x ∗ (t) and control trajectories u ∗ i (t), <sup>∀</sup>i ∈ P, which correspond to a Nash equilibrium, are given. Find at least one set <sup>J</sup> such that <sup>J</sup>i , <sup>∀</sup>i ∈ P, fulfill

$$\begin{aligned} \mathbf{u}\_{l}^{\*}(t) &= \operatorname\*{arg\,min}\_{\mathbf{u}\_{l}(t)} J\_{l}\left(\mathbf{x}^{\*}(t), \mathbf{u}\_{l}(t), \mathbf{u}\_{-l}^{\*}(t)\right) \\ &\text{ w.r.t.} \\ \dot{\mathbf{x}}(t) &= f\left(\mathbf{x}(t), \mathbf{u}\_{1}(t), \dots, \mathbf{u}\_{N}(t), t\right) \\ \mathbf{x}(0) &= \mathbf{x}\_{0} \end{aligned} \tag{4.1}$$

The formulation of the inverse differential game problem implies determining the cost functions <sup>J</sup>i , i ∈ P, such that u ∗ i (t) solves the optimal control problems (4.1) which follow from Definition 3.7. Definition 4.1 allows for several types of Nash equilibria which arise depending on the information structure of the game and the resulting strategy types. In particular, in this thesis open-loop and feedback Nash equilibria are considered. In addition, Definition 4.1 establishes the search of "at least one set" of cost functions in consequence of the ill-posedness nature of inverse problems in optimal control and dynamic games. This means that several sets of cost functions exist which are equivalent in the sense that all of them are able to explain the same state and control trajectories. The concept of equivalence of cost functions is formalized in Section B.2 of the Appendix.

The inverse differential game of Definition 4.1 is very general and represents a considerably complex task since there is an infinite number of possible cost functions varying in structure and parametrization which may potentially solve the inverse differential game problem. This issue is not unique to inverse dynamic or differential games as it also arises in the inverse problem of optimal control (single-player case). Therefore, parameters need to be introduced first. Two lines of research have been developed to achieve this.

• Approximation of non-linear cost function structures by means of Gaussian processes [LPK11, LHF14] or alternatively, artificial neural networks [WOP16].

• Setting the cost function structure as a linear combination of basis functions [MTL10, PJJB12, JAB13, AB14, MTFP16, PRBF18, JKL<sup>+</sup> 19].

The first approach utilizes parameterized kernel functions which determine the structure of the Gaussian process. In this way, non-linear rewards can be learned by maximizing the likelihood function of the Gaussian process regression output and the kernel parameters under known observations of the state and control values. Nevertheless, finding these parameters is a computationally complex task which has only been solved succesfully in discretized state and control spaces (e.g. a grid world). On the other hand, the use of artificial neural networks usually demands large data sets and computation times.

Therefore, the second approach is followed and presented in the following subsection.

# 4.2 Basis Functions Approach

In this approach, the cost functions are given a structure specified with basis functions which are defined as follows.

#### Definition 4.2 (Basis Functions Vector)

The vector <sup>ϕ</sup>i <sup>∈</sup> <sup>R</sup> <sup>M</sup><sup>i</sup> contains the non-trivial functions <sup>ϕ</sup>i,(j) (x(t),u1(t), ...,uN (t),t), <sup>j</sup> <sup>∈</sup> {1, ..., <sup>M</sup><sup>i</sup> } which are called basis functions. Furthermore, the functions <sup>ϕ</sup>i,(j) : R <sup>n</sup> × R <sup>m</sup><sup>1</sup> × . . . <sup>×</sup> <sup>R</sup> <sup>m</sup><sup>N</sup> × [0,<sup>T</sup> ] 7→ <sup>R</sup> are continuously differentiable in <sup>x</sup> and <sup>u</sup>1, ...,uN for all <sup>j</sup> <sup>∈</sup> {1, ..., <sup>M</sup>i }.

The notation <sup>a</sup>i,(j) is used here and in the remainder of this thesis to represent the j-th entry of any vector a which corresponds to player i ∈ P.

Based on Definition 4.2, cost functions which consist of a linear combination of the basis functions are introduced, i.e.

$$J\_i(\boldsymbol{\phi}\_i, \boldsymbol{\theta}\_i) = \int\_0^T \boldsymbol{\theta}\_i^\top \boldsymbol{\phi}\_i(\mathbf{x}(t), \boldsymbol{\mu}\_1(t), \dots, \boldsymbol{\mu}\_N(t), t) \, \mathrm{d}t,\tag{4.2}$$

where <sup>θ</sup>i <sup>∈</sup> <sup>Θ</sup>i <sup>⊆</sup> <sup>R</sup> <sup>M</sup><sup>i</sup> are time-invariant parameters. The introduction of basis functions may appear stringent, yet it allows a wide variety of possible cost function structures.<sup>24</sup>

<sup>24</sup> Although the considered cost functions (4.2) have a so-called Lagrangian structure, i.e. cost functions with only integral costs, the methods and results of this chapter are also applicable to games with player cost functions with a Bolza structure, i.e. of the form (3.3). To do so, the terminal costs <sup>h</sup><sup>i</sup> (<sup>x</sup> (<sup>T</sup> ), <sup>T</sup> ) must be written as a linear combination of basis functions as well. Afterwards, the Bolza cost function can be transformed into a Lagrange cost function by means of the fundamental theorem of calculus (see e.g. [Nai03, Section 2.7.1]).

N

In order to define a well-posed inverse differential problem with the newly introduced basis functions, the dynamics <sup>f</sup> and basis functions <sup>ϕ</sup>i should be specified such that the observed states <sup>x</sup>˜(t) and controls (u˜ <sup>1</sup>(t), ...,u˜ N (t)) constitute a Nash equilibrium solution to the dynamic game for some (possibly non-unique) cost-functional parameters <sup>θ</sup>i <sup>∈</sup> <sup>Θ</sup>i . Addressing the selection of suitable dynamics and basis functions is beyond the scope of this thesis. Therefore, the following assumption is introduced:

#### Assumption 4.2 (Nash Character of the Trajectories w.r.t. a Differential Game with Basis Functions)

The observed states <sup>x</sup>˜(t) and controls (u˜ <sup>1</sup>(t), ...,u˜ N (t)) constitute a Nash equilibrium solution to the differential game with system dynamics according to Definition 3.2 which are additionally continuously differentiable in <sup>x</sup> and <sup>u</sup>1, ...,uN , and cost functions of the form (4.2) consisting of basis functions <sup>ϕ</sup>i according to Definition 4.2 and the unknown cost function parameters <sup>θ</sup>i <sup>=</sup> <sup>θ</sup> ∗ i <sup>∈</sup> <sup>Θ</sup><sup>i</sup> for <sup>i</sup> ∈ P.

Assumption 4.2 specifies Assumption 4.1 for the introduced cost function structure established with the basis functions of Definition 4.2. The assumption of continuous differentiability of the system dynamics f is standard and permits the consideration of Theorem 3.1 which shall be leveraged in the course of this chapter. With this introduced assumption, the inverse differential game problem regarded in this chapter is defined as follows.

#### Definition 4.3 (Inverse Differential Game with Basis Functions)

Let Assumption 4.2 be fulfilled such that state trajectories x ∗ (t) and control trajectories u ∗ i (t), i ∈ P, which correspond to a Nash equilibrium, are given. Determine at least one tuple of parameters <sup>θ</sup> :<sup>=</sup> (θ1, ..., <sup>θ</sup> N ), with <sup>θ</sup>i <sup>∈</sup> <sup>Θ</sup>i , i ∈ P, such that

$$\begin{aligned} \mathbf{u}\_i^\*(t) &= \operatorname\*{arg\,min}\_{\mathbf{u}\_i(t)} \int J\_i \left( \boldsymbol{\phi}\_i(\mathbf{x}^\*(t), \mathbf{u}\_i(t), \mathbf{u}\_{-i}^\*(t), t), \boldsymbol{\theta}\_i \right) \\ &\le \text{r.r.t.} \\ \dot{\mathbf{x}}(t) &= \boldsymbol{f} \left( \mathbf{x}(t), \mathbf{u}\_1(t), \dots, \mathbf{u}\_N(t), t \right) \\ \mathbf{x}(0) &= \mathbf{x}\_0 \end{aligned} \tag{4.3}$$
  $\text{for all players } i \in \mathcal{P}.$ 

A consequence of the introduction of basis functions is the reduction of the general inverse differential game problem to a parameter identification problem. Despite this simplification, under Assumption 4.2, the inverse differential game problem will still have multiple solutions in general. One of the reasons is the following: if the trajectories x ∗ (t) and u ∗ 1 (t), . . . ,u ∗ (t) 

solve the dynamic optimization problems of Definition 4.3 with <sup>θ</sup>i <sup>=</sup> <sup>θ</sup> ∗ i <sup>∈</sup> <sup>Θ</sup><sup>i</sup> , then the trajectories will also solve the dynamic optimization problems with <sup>θ</sup>i <sup>=</sup> <sup>c</sup>i<sup>θ</sup> ∗ i for all scaling factors <sup>c</sup>i <sup>&</sup>gt; 0. Furthermore, the zero vectors <sup>θ</sup>i <sup>=</sup> <sup>0</sup> are trivial solutions to the inverse differential game problem. Therefore, without loss of generality, trivial solutions and ambiguous scaling shall be excluded by considering parameter sets of the form <sup>Θ</sup>i <sup>=</sup> {θi <sup>∈</sup> <sup>R</sup> Mi <sup>|</sup> <sup>θ</sup>i,(1) <sup>=</sup> <sup>1</sup>} where <sup>θ</sup>i,(1) denotes the first element of <sup>θ</sup><sup>i</sup> . The choice of the fixed-element constraint <sup>θ</sup>i,(1) <sup>=</sup> 1 is arbitrary and results analogous to those of this chapter will also hold with normalization constraints such as <sup>∥</sup>θi <sup>∥</sup> <sup>=</sup> 1. 25

# 4.3 Inverse Open-Loop Differential Games

The inverse differential games of Definitions 4.1 and 4.3 imply finding cost functions such the solution of the N optimal control problems correspond to the given controls(u ∗ 1 (t), ...,u ∗ N (t)). Since for a particular optimal control problem of player i, the other players' controls u ∗ ¬i (t) are available, we can proceed to analyze these individual optimal control problems. For the forward problem of finding open-loop Nash equilibrium trajectories, the tools of optimal control theory, in particular the minimum principle of Pontryagin, are leveraged to obtain necessary conditions for open-loop Nash equilibria (cf. Section 3.6). Similarly, in this section, these conditions shall be exploited to find parameters <sup>θ</sup>i which solve the inverse differential game problem of Definition 4.3 in case of open-loop strategies.

#### 4.3.1 Residual-Based Approach

The main idea consists of exploiting the fact that the observed trajectories correspond by assumption to a Nash equilibrium, i.e. x˜(t) <sup>=</sup> x ∗ (t) and <sup>u</sup>˜ i(t) <sup>=</sup> <sup>u</sup> ∗ i (t). These must fulfill the equations of Theorem 3.1 as these represent necessary conditions for Nash equilibria. Consider any player i ∈ P. Besides the system dynamics equation, the costate equation

$$\begin{aligned} \dot{\Psi}\_i(t) &= -\nabla\_{\mathbf{x}} H\_i \left( \Psi\_i(t), \mathbf{x}^\*(t), \mathbf{u}\_i(t), \mathbf{u}\_{\neg i}^\*(t), t \right) \\ &\dots \end{aligned} \tag{4.4a}$$

$$\begin{aligned} \Psi\_i(T) &= 0, \\ \Psi\_i(T) &= 0, \end{aligned} \tag{4.4b}$$

where (4.4b) follows fromhi(x(<sup>T</sup> ),<sup>T</sup> ) <sup>=</sup> <sup>0</sup> due to the Lagrangian structure of the cost function (4.2), and the control equation

$$\boldsymbol{\mu}\_{i}^{\*}(t) = \mathop{\arg\min}\_{\boldsymbol{\mu}\_{i}(t)} \boldsymbol{H}\_{i}\left(\boldsymbol{\Psi}\_{i}(t), \boldsymbol{x}^{\*}(t), \boldsymbol{\mu}\_{i}(t), \boldsymbol{\mu}\_{-i}^{\*}(t), t\right) \tag{4.5}$$

<sup>25</sup> Both fixed-element (e.g. [MTFP16]) and normalization-constraint parameter sets (e.g. [ARARU+11]) are popular in the related literature of inverse optimal control.

must be fulfilled. Since we consider no constraints on the control variables <sup>u</sup>i(t), the control equation (4.5) results in the Hamiltonian gradient condition

$$\mathbf{0} = \nabla\_{\mathbf{u}\_i} H\_i \left( \boldsymbol{\psi}\_i(t), \mathbf{x}^\*(t), \mathbf{u}\_i(t), \mathbf{u}\_{\neg i}^\*(t), t \right). \tag{4.6}$$

With the Hamiltonian function of player i being given by

$$H\_i = \boldsymbol{\theta}\_i^\top \boldsymbol{\phi}\_i \left( \mathbf{x}(t), \mathbf{u}\_i(t), \mathbf{u}\_{\neg i}(t), t \right) + \boldsymbol{\psi}\_i^\top(t) \boldsymbol{f} \left( \mathbf{x}(t), \mathbf{u}\_i(t), \mathbf{u}\_{\neg i}(t), t \right) \tag{4.7}$$

as a result of the cost function structure (4.2), the following definition is introduced.

Definition 4.4 (Residuals)

The functions

$$r\_{\mathbb{C}}(\theta\_i, \boldsymbol{\psi}\_i, t) = \left\| \nabla\_{\boldsymbol{\mathfrak{u}}\_i} H\_i \left( \boldsymbol{\upvarphi}\_i, \theta\_i, t \right) \right\|^2 \Big|\_{\substack{\boldsymbol{\mathfrak{u}}\_i(t) = \boldsymbol{\upvarphi}\_i^\*(t) \\ \boldsymbol{\mathfrak{x}}(t) = \boldsymbol{\upvarphi}^\*(t)}} \tag{4.8}$$

and

i

$$r\_{\mathcal{L}}(\boldsymbol{\theta}\_{i}, \boldsymbol{\Psi}\_{i}, t) = \left\| \left\| \dot{\boldsymbol{\Psi}}\_{i}(t) + \nabla\_{\mathbf{x}} H\_{i} \left( \boldsymbol{\Psi}\_{i}, \boldsymbol{\theta}\_{i}, t \right) \right\|^{2} \right\|\_{\begin{subarray}{l} \mathbf{u}\_{i}(t) = \mathbf{u}\_{i}^{\*}(t) \end{subarray}} , \tag{4.9}$$

where || · || denotes the Euclidean norm, are called residuals of the control equation and the costate equation, respectively.

The residuals of Definition 4.4 result from the insertion of the Hamiltonian (4.7) in (4.4a) and (4.6) and the subsequent insertion of the known optimal trajectories x ∗ (t) and u ∗ i (t), which result in a dependence on the costate functions <sup>ψ</sup>i and the parameters <sup>θ</sup>i only. Note that <sup>r</sup>C(θi ,ψi ) and <sup>r</sup>L(θi ,ψi ) are both equal to zero for <sup>θ</sup>i <sup>=</sup> <sup>θ</sup> ∗ i and <sup>ψ</sup>i (t) <sup>=</sup> ψ ∗ i (t). Therefore, in light of this formulation, the idea of the proposed residual-based method consists of the computation of <sup>ˆ</sup>θi <sup>∈</sup> <sup>Θ</sup>i and costate functions <sup>ψ</sup><sup>ˆ</sup> i : [0,T ] 7→ <sup>R</sup> n for each player i ∈ P which solve the optimization problem

$$\begin{aligned} \min\_{\boldsymbol{\Psi}\_{i}, \boldsymbol{\theta}\_{i}} & \quad \int\_{0}^{T} r\_{\mathbb{C}}(\boldsymbol{\theta}\_{i}, \boldsymbol{\psi}\_{i}) + \rho \ \boldsymbol{r}\_{\mathbb{L}}(\boldsymbol{\theta}\_{i}, \boldsymbol{\psi}\_{i}) \, \mathrm{d}t \\ \text{s.t.} & \quad \boldsymbol{\theta}\_{i} \in \Theta\_{i}, \end{aligned} \tag{4.10}$$

where ρ > <sup>0</sup> is a specifiable weighting factor. The intuition behind (4.10) is the following: ψˆ (t) and parameters <sup>ˆ</sup>θi are sought such that the costate condition (4.4a) and Hamiltonian gradient condition (4.6) hold for all t ∈ [0,T ]. <sup>26</sup> Under Assumption 4.2, <sup>ˆ</sup>θi <sup>=</sup> <sup>θ</sup> ∗ i will be a (possibly non-unique) solution to (4.10).

The solution of (4.10) is based on its reformulation as a quadratic program. For that purpose, it shall first rewritten as a LQ dynamic optimization problem. Let us define the matrix

$$L := \begin{bmatrix} I\_{M\_l} & \mathbf{0}\_{M\_l \times n} \end{bmatrix} \in \mathbb{R}^{M\_l \times (M\_l + n)} \tag{4.11}$$

where <sup>I</sup> <sup>M</sup><sup>i</sup> denotes a square identity matrix with dimensions <sup>M</sup>i×Mi . Similarly, <sup>0</sup>Mi×n denotes a zero matrix with dimensions <sup>M</sup>i <sup>×</sup> <sup>n</sup>. Furthermore, we define the matrices <sup>R</sup> <sup>=</sup> <sup>I</sup>n, <sup>B</sup> :<sup>=</sup> - <sup>0</sup>n×M<sup>i</sup> In ⊤ and the time-variant matrices

$$\mathbf{N}\_{i}(t) \coloneqq \begin{bmatrix} \rho \nabla\_{\mathbf{x}} \phi\_{i}(t) & \rho \nabla\_{\mathbf{x}} f(t) \end{bmatrix}^{\top} \tag{4.12}$$
 
$$\begin{bmatrix} \nabla\_{\mathbf{x}} \nabla\_{\mathbf{x}} f(t) & \nabla\_{\mathbf{x}} \nabla\_{\mathbf{x}} f(t) \end{bmatrix}^{\top} \tag{4.12}$$

$$\mathcal{Q}\_i(t) \coloneqq \begin{bmatrix} \sqrt{\rho} \nabla\_{\mathbf{x}} \phi\_i(t) & \sqrt{\rho} \nabla\_{\mathbf{x}} f(t) \\ \nabla\_{\mathbf{u}\_i} \phi\_i(t) & \nabla\_{\mathbf{u}\_i} f(t) \end{bmatrix}^\top \begin{bmatrix} \sqrt{\rho} \nabla\_{\mathbf{x}} \phi\_i(t) & \sqrt{\rho} \nabla\_{\mathbf{x}} f(t) \\ \nabla\_{\mathbf{u}\_i} \phi\_i(t) & \nabla\_{\mathbf{u}\_i} f(t) \end{bmatrix} \tag{4.13}$$

where we use the shorthand <sup>∇</sup>x <sup>f</sup>(t) ∈ <sup>R</sup> n×n and <sup>∇</sup>u<sup>i</sup> f(t) ∈ <sup>R</sup> mi×n to denote the matrices of partial derivatives of <sup>f</sup> with respect to <sup>x</sup>(t) and <sup>u</sup>i(t), respectively27, and evaluated with θi , <sup>x</sup>(t), and <sup>u</sup>i(t) for <sup>i</sup> ∈ P. Similarly, we use <sup>∇</sup>xϕi (t) ∈ <sup>R</sup> <sup>n</sup>×M<sup>i</sup> , and <sup>∇</sup>u<sup>i</sup>ϕi (t) ∈ <sup>R</sup> <sup>m</sup>i×M<sup>i</sup> to denote the matrices of partial derivatives of <sup>ϕ</sup>i evaluated with <sup>θ</sup>i , <sup>x</sup>(t), and <sup>u</sup>i(t) for <sup>i</sup> ∈ P. The following lemma rewrites the problem (4.10) as a LQ dynamic optimization problem.

#### Lemma 4.1

Consider any player <sup>i</sup> ∈ P. The optimization problem (4.10) over the costates <sup>ψ</sup>i and parameters <sup>θ</sup>i is equivalent to the LQ dynamic optimization problem

$$\min\_{\mathbf{z}\_i, \mathbf{w}\_i} \int\_0^T \mathbf{z}\_i^\top(t) \mathbf{Q}\_i(t) \mathbf{z}\_i(t) + \mathbf{w}\_i^\top(t) \rho \mathbf{R} \mathbf{w}\_i(t) + 2 \mathbf{z}\_i^\top(t) \mathbf{N}\_i(t) \mathbf{w}\_i(t) \,\mathrm{d}t$$
 
$$\text{s.t.}$$
 
$$\dot{\mathbf{z}}\_i(t) = \mathbf{B} \mathbf{w}\_i(t), \; t \in [0, T]$$
 
$$\mathcal{L} \mathbf{z}\_i(t) \in \Theta\_i, \; t \in [0, T]$$

over the functions <sup>z</sup>i : [0,T ] 7→ <sup>R</sup> <sup>M</sup>i+<sup>n</sup> andvi : [0,T ] 7→ <sup>R</sup> <sup>n</sup> with the variable substitutions

$$\mathbf{z}\_{i}(t) = \begin{bmatrix} \theta\_{i} \\ \psi\_{i}(t) \end{bmatrix} \quad \text{and} \quad \mathbf{w}\_{i}(t) = \dot{\Psi}\_{i}(t). \tag{4.15}$$

<sup>26</sup> If N <sup>=</sup> <sup>1</sup> and ρ <sup>=</sup> 1, this method recudes to the single-player approach presented in [JAB13].

<sup>27</sup> The partial derivatives <sup>∇</sup><sup>x</sup> <sup>f</sup> (t) are defined here as the transposed Jacobian of <sup>f</sup> , i.e. <sup>∇</sup><sup>x</sup> <sup>f</sup> (t) <sup>=</sup> h ∂<sup>f</sup> (<sup>t</sup> ) ∂x1 · · · ∂<sup>f</sup> (<sup>t</sup> ) <sup>∂</sup>x<sup>n</sup> i⊤ .

#### Proof:

We note that the integrand of the objective functional of (4.10) may be rewritten as

 <sup>∇</sup>u<sup>i</sup>Hi t,ψi (t), <sup>θ</sup>i 2 + ρ ψÛ i (t) <sup>+</sup> <sup>∇</sup>xHi t,ψi (t), <sup>θ</sup>i 2 = √ ρψÛ i (t) <sup>+</sup> √ <sup>ρ</sup>∇xHi t,ψi (t), <sup>θ</sup>i <sup>∇</sup>u<sup>i</sup>Hi t,ψi (t), <sup>θ</sup>i 2 = √ ρψÛ i (t) <sup>+</sup> √ <sup>ρ</sup>∇xϕi (t)θi <sup>+</sup> √ <sup>ρ</sup>∇<sup>x</sup> <sup>f</sup>(t)ψi (t) <sup>∇</sup>u<sup>i</sup>ϕi (t)θ<sup>i</sup> <sup>+</sup> <sup>∇</sup>u<sup>i</sup> <sup>f</sup>(t)ψi (t) 2 = √ <sup>ρ</sup>∇xϕi (t) √ <sup>ρ</sup>∇x <sup>f</sup>(t) <sup>∇</sup>u<sup>i</sup>ϕi (t) ∇u<sup>i</sup> f(t) <sup>θ</sup>i ψi (t) + √ ρIn <sup>0</sup>mi×n ψÛ i (t) 2 = z ⊤ i (t)Qi (t)zi(t) <sup>+</sup><sup>v</sup> ⊤ i (t)ρRvi(t) <sup>+</sup> <sup>2</sup><sup>z</sup> ⊤ i (t)Ni(t)vi(t)

where the second equality holds by recalling the definition of the player Hamiltonian (4.7), and the third and fourth equalities are obtained via matrix algebra by recalling the definitions of <sup>Q</sup>i (t), <sup>R</sup>, and <sup>N</sup>i(t) together with the variable substitutions (4.15). We also note that the constraint <sup>θ</sup>i <sup>∈</sup> <sup>Θ</sup>i may be equivalently written as

$$L\mathbf{z}\_i(t) = \boldsymbol{\theta}\_i \in \boldsymbol{\Theta}\_i$$

and the (implicit) constraint in (4.10) that <sup>θ</sup>i is time-invariant is equivalent to the constraint

$$\dot{\boldsymbol{\sigma}}\_{i}(t) = \begin{bmatrix} \dot{\boldsymbol{\theta}}\_{i} \\ \dot{\boldsymbol{\varphi}}\_{i}(t) \end{bmatrix} = \begin{bmatrix} \mathbf{0}\_{M\_{i}\times n} \\ \dot{\boldsymbol{\varphi}}\_{i}(t) \end{bmatrix} = \mathbf{B}\boldsymbol{\sigma}\_{i}(t).$$

Minimization of the functional

$$\int\_{0}^{T} \mathbf{z}\_{i}^{\top}(t) \mathbf{Q}\_{i}(t) \mathbf{z}\_{i}(t) + \boldsymbol{\sigma}\_{i}^{\top}(t) \boldsymbol{\rho} \mathbf{R} \boldsymbol{\sigma}\_{i}(t) + 2 \boldsymbol{\varpi}\_{i}^{\top}(t) \mathbf{N}\_{i}(t) \boldsymbol{\sigma}\_{i}(t) \text{ d}t$$

over <sup>z</sup>i : [0,T ] 7→ <sup>R</sup> Mi+n and <sup>v</sup>i : [0,T ] 7→ <sup>R</sup> n subject to the constraints <sup>z</sup>Ûi(t) <sup>=</sup> <sup>B</sup>vi(t) and <sup>L</sup>zi(t) ∈ <sup>Θ</sup>i for all <sup>t</sup> ∈ [0,<sup>T</sup> ] is therefore equivalent to the minimization of the objective functional of (4.10) overψi (t) and <sup>θ</sup>i subject to <sup>θ</sup>i <sup>∈</sup> <sup>Θ</sup>i with the substitutions

$$\mathbf{z}\_{i}(t) = \begin{bmatrix} \theta\_{i} \\ \psi\_{i}(t) \end{bmatrix} \text{ and } \mathfrak{w}\_{i}(t) = \dot{\psi}\_{i}(t).$$

The lemma result follows and the proof is complete.

Lemma 4.1 establishes that (4.10) at the core of the proposed method can be rewritten as (4.14) with linear dynamic constraints, quadratic objective functional, and (partial) constraints on

the function Lzi(t). The following lemma shows that (4.14) can be solved as a LQ optimal control problem with an unknown initial state <sup>z</sup>i(0) resulting in a quadratic program.

#### Lemma 4.2 (Quadratic Program Formulation)

Consider any player i ∈ P and suppose that ρ > <sup>0</sup> is selected such that the matrix Qi (t) − <sup>N</sup>i(t)<sup>ρ</sup> −1R −1N ⊤ i (t) is positive semidefinite for all t ∈ [0,T ]. A pair of functions zˆi : [0,T ] 7→ <sup>R</sup> <sup>M</sup>i+<sup>n</sup> and <sup>v</sup>ˆi : [0,T ] 7→ <sup>R</sup> n solves the dynamic optimization problem (4.14) if and only if the initial value of <sup>z</sup>ˆi(0) <sup>=</sup> <sup>α</sup><sup>ˆ</sup> i <sup>∈</sup> <sup>R</sup> Mi+n solves the quadratic program

$$\begin{aligned} \min\_{\boldsymbol{\alpha}\_i} & & \boldsymbol{\alpha}\_i^\top P\_i(0)\boldsymbol{\alpha}\_i \\ \text{s.t.} & & \boldsymbol{L}\boldsymbol{\alpha}\_i \in \Theta\_i \end{aligned} \tag{4.16}$$

and the pair of functions satisfy the differential equation

$$
\dot{\hat{z}}\_i(t) = \mathbf{B}\hat{\boldsymbol{\sigma}}\_i(t) = \mathbf{B}\mathbf{K}\_i(t)\hat{\boldsymbol{z}}\_i(t) \tag{4.17}
$$

i

for all <sup>t</sup> ∈ [0,<sup>T</sup> ] where <sup>K</sup>i(t) :<sup>=</sup> <sup>−</sup><sup>ρ</sup> −1 - B ⊤ <sup>P</sup>i(t) <sup>+</sup> <sup>N</sup> ⊤ i (t) and <sup>P</sup>i : [0,T ] 7→ <sup>R</sup> (Mi+n)×(Mi+n) is the unique symmetric positive semidefinite solution to the Riccati differential equation

$$\mathbf{0} = \dot{\mathbf{P}}\_i(t) - \boldsymbol{\rho}^{-1}(\mathbf{P}\_i(t)\mathbf{B} + \mathbf{N}\_i(t))(\mathbf{B}^\top \mathbf{P}\_i^\top(t) + \mathbf{N}\_i^\top(t)) + \mathbf{Q}\_i(t) \tag{4.18}$$

for <sup>t</sup> ∈ [0,<sup>T</sup> ] with terminal boundary condition <sup>P</sup>i(<sup>T</sup> ) <sup>=</sup> <sup>0</sup>.

#### Proof:

Consider any player<sup>i</sup> ∈ P. We first note that given a functionvi : [0,T ] 7→ <sup>R</sup> n together with an initial value <sup>z</sup>i(0) <sup>=</sup> <sup>α</sup>i <sup>∈</sup> <sup>R</sup> <sup>M</sup>i+<sup>n</sup> with Lαi <sup>∈</sup> <sup>Θ</sup>i , we may solve the differential equation <sup>z</sup>Ûi(t) <sup>=</sup> <sup>B</sup>vi(t) for the unique function <sup>z</sup>i : [0,T ] 7→ <sup>R</sup> Mi+n . The constraints in the dynamic optimization problem (4.14) from Lemma 4.1 therefore imply that the optimization in (4.14) may be rewritten as only over <sup>z</sup>i(0) and <sup>v</sup>i . Namely, (4.14) is equivalent to the unknown initial state optimal control problem

$$\min\_{\mathbf{o}\_i} \min\_{\mathbf{w}\_i} \int\_0^T \mathbf{z}\_i^\top(t) \mathbf{Q}\_i(t) \mathbf{z}\_i(t) + \mathbf{o}\_i^\top(t) \rho \mathbf{R} \mathbf{w}\_i(t) + 2 \mathbf{z}\_i^\top(t) \mathbf{N}\_i(t) \mathbf{w}\_i(t) \,\mathrm{d}t$$
 
$$\text{s.t.}$$
 
$$\dot{\mathbf{z}}\_i(t) = \mathbf{B} \mathbf{w}\_i(t), \ t \in [0, T] \tag{4.19}$$
 
$$\mathbf{z}\_i(0) = \mathbf{a}\_i$$
 
$$\mathbf{L} \mathbf{a}\_i \in \Theta\_i.$$

For any <sup>α</sup>i <sup>∈</sup> <sup>R</sup> Mi+n , the inner optimization problem over the function <sup>v</sup>i in (4.19) is a standard LQ optimal control problem with cross-product terms.

Under the positive definiteness of <sup>R</sup> <sup>=</sup> <sup>I</sup>n as well as <sup>ρ</sup> <sup>&</sup>gt; <sup>0</sup> and the positive semidefiniteness of the expression <sup>Q</sup>i (t) − <sup>N</sup>i(t)<sup>ρ</sup> −1R −1N ⊤ i (t), Section 3.4 of [AM89] gives that for any <sup>z</sup>i(0) <sup>=</sup> <sup>α</sup>i <sup>∈</sup> <sup>R</sup> Mi+n , the unique function solving the inner optimization problem over <sup>v</sup>i in (4.19) is

$$
\hat{\boldsymbol{\sigma}}\_{i}(t) = \mathbf{K}\_{i}(t)\hat{\boldsymbol{z}}\_{i}(t) \tag{4.20}
$$

for all <sup>t</sup> ∈ [0,<sup>T</sup> ] where <sup>K</sup>i(t) <sup>=</sup> <sup>−</sup><sup>ρ</sup> −1 - B ⊤ <sup>P</sup>i(t) <sup>+</sup> <sup>N</sup> ⊤ i (t) and <sup>P</sup>i : [0,T ] 7→ <sup>R</sup> (Mi+n)×(Mi+n) is the unique symmetric positive semidefinite solution to the Riccati differential equation (4.18) with <sup>P</sup>i(<sup>T</sup> ) <sup>=</sup> <sup>0</sup> (see also [Kuč73, Kal64]). Section 3.4 of [AM89] also gives that the minimum value of the inner optimization problem over <sup>v</sup>i in (4.19) is

$$\boldsymbol{\alpha}\_{i}^{\top}\boldsymbol{P}\_{i}(0)\boldsymbol{\alpha}\_{i}\tag{4.21}$$

for any initial state <sup>z</sup>i(0) <sup>=</sup> <sup>α</sup>i . The function <sup>z</sup>ˆi solving the inner optimization of (4.19) satisfies <sup>Û</sup>zˆi(t) <sup>=</sup> BKi(t)zˆi(t) for any initial state <sup>α</sup>i . Consequently, the unknown initial state optimal control problem (4.19) simplifies to the quadratic program (4.16). It follows that the pair of functions (zˆi ,vˆi) solves (4.14) if the functions satisfy the differential equation (4.17) and <sup>z</sup>ˆi(0) <sup>=</sup> <sup>α</sup><sup>ˆ</sup> i solves (4.16).

In the following, the "only if" part of the lemma assertion is proved — i.e., that if the pair of functions(zˆi ,vˆi)solves (4.14), then they satisfy the differential equation (4.17) and <sup>z</sup>ˆi(0) <sup>=</sup> <sup>α</sup><sup>ˆ</sup> i solves the quadratic program (4.16). We first note that the function <sup>v</sup>ˆi solving the inner optimization problem over <sup>v</sup>i in (4.19) is unique and given by (4.20) for any given <sup>α</sup>i <sup>∈</sup> <sup>R</sup> Mi . Thus, if the pair of functions (zˆi ,vˆi) solves (4.14), then it must satisfy the differential equation (4.17). Since the unique form of <sup>v</sup>ˆi implies that (4.19) reduces to the quadratic program (4.16), then <sup>z</sup>ˆi(0) <sup>=</sup> <sup>α</sup><sup>ˆ</sup> i if the pair of functions (zˆi ,vˆi) solves (4.14). The lemma result follows and the proof is complete.

Lemma 4.2 allows us to solve the quadratic program (4.16) for the initial values <sup>z</sup>ˆi(0) <sup>=</sup> <sup>α</sup><sup>ˆ</sup> i instead of solving (4.14) for the functions <sup>z</sup>ˆi over the entire interval <sup>t</sup> ∈ [0,<sup>T</sup> ]. Recalling Lemma 4.1 and the definition of the vectors <sup>z</sup>i(0), we note that the initial values <sup>z</sup>ˆi(0) <sup>=</sup> <sup>α</sup><sup>ˆ</sup> i correspond to the vector

$$
\hat{\boldsymbol{\alpha}}\_i = \begin{bmatrix} \hat{\boldsymbol{\theta}}\_i^\top & \hat{\boldsymbol{\psi}}\_i^\top(0) \end{bmatrix}^\top \tag{4.22}
$$

where <sup>ˆ</sup>θi andψ<sup>ˆ</sup> i are solutions to the residual-based method (4.10). Together, Lemmas 4.1 and 4.2 therefore allow us to sidestep the difficult problem of directly solving and analyzing the original optimization problem (4.10) and instead solve the quadratic program (4.16) for the parameters <sup>ˆ</sup>θi <sup>=</sup> Lα<sup>ˆ</sup> i .

#### Remark 4.1:

The choice of <sup>ρ</sup> <sup>=</sup> <sup>1</sup> is always sufficient to ensure that the expression <sup>Q</sup>i (t) − <sup>N</sup>i(t)<sup>ρ</sup> −1R −1N ⊤ i (t) is positive semidefinite for all t ∈ [0,T ] since

$$\begin{aligned} \mathbf{Q}\_i(t) &- \mathbf{N}\_i(t) \mathbf{R}^{-1} \mathbf{N}\_i^\top(t) \\ &= \mathbf{Q}\_i(t) - \mathbf{N}\_i(t) \mathbf{N}\_i^\top(t) \\ &= \begin{bmatrix} \nabla\_{\boldsymbol{u}\_i} \boldsymbol{\Phi}\_i^\top(t) \nabla\_{\boldsymbol{u}\_i} \boldsymbol{\Phi}\_i(t) & \nabla\_{\boldsymbol{u}\_i} \boldsymbol{\Phi}\_i^\top(t) \nabla\_{\boldsymbol{u}\_i} f(t) \\ \nabla\_{\boldsymbol{u}\_i} \boldsymbol{f}^\top(t) \nabla\_{\boldsymbol{u}\_i} \boldsymbol{\Phi}\_i(t) & \nabla\_{\boldsymbol{u}\_i} \boldsymbol{f}^\top(t) \nabla\_{\boldsymbol{u}\_i} f(t) \end{bmatrix} \\ &= \begin{bmatrix} \nabla\_{\boldsymbol{u}\_i} \boldsymbol{\Phi}\_i(t) & \nabla\_{\boldsymbol{u}\_i} f(t) \end{bmatrix}^\top \begin{bmatrix} \nabla\_{\boldsymbol{u}\_i} \boldsymbol{\Phi}\_i(t) & \nabla\_{\boldsymbol{u}\_i} f(t) \end{bmatrix}. \end{aligned}$$

with the first equality holding due to the definition of R, and the second and third equalities following by substituting the definitions <sup>Q</sup>i (t) and <sup>N</sup>i(t). Other values of <sup>ρ</sup> may result in Qi (t) − <sup>N</sup>i(t)<sup>ρ</sup> −1R −1N ⊤ i (t) not being positive semidefinite and thus leading to multiple solutions of (4.18).

In the following, the results of Lemmas 4.1 and 4.2 shall be used to establish novel explicit expressions for the parameters <sup>ˆ</sup>θi that solve the inverse differential game problem. Furthermore, sufficient conditions shall be presented under which the parameters <sup>ˆ</sup>θi are guaranteed to be unique and identical to the original parameters θ ∗ i up to a multiplying positive factor.

#### 4.3.2 Sufficient Conditions for the Uniqueness of the Solution

To present the main result on the solution of the residual-based method (4.10), consider the matrix <sup>P</sup>i(0) of the optimization problem (4.16) and define

$$
\bar{P}\_i := \begin{bmatrix} P\_{i,(2,2)}(0) & \dots & P\_{i,(2,M\_i+n)}(0) \\ P\_{i,(3,2)}(0) & \dots & P\_{i,(3,M\_i+n)}(0) \\ \vdots & \ddots & \vdots \\ P\_{i,(M\_i+n,2)}(0) & \dots & P\_{i,(M\_i+n,M\_i+n)}(0) \end{bmatrix} \tag{4.23}$$

as the principal submatrix of <sup>P</sup>i(0) formed by deleting the first row and column of <sup>P</sup>i(0), and

$$\bar{\mathbf{p}}\_{i} := \begin{bmatrix} P\_{i,(2,1)}(\mathbf{0}) & P\_{i,(3,1)}(\mathbf{0}) & \dots & P\_{i,(M\_l+n,1)}(\mathbf{0}) \end{bmatrix}^{\top} \tag{4.24}$$

i

which denotes the first column of <sup>P</sup>i(0) with deleted first element. Furthermore, let

$$
\bar{\boldsymbol{P}}\_i = \boldsymbol{U}\_i \boldsymbol{\Sigma}\_i^{\bar{\boldsymbol{P}}} \boldsymbol{U}\_i^{\bar{\boldsymbol{\Gamma}}}
$$

be the singular value decomposition (SVD) of P¯ i where <sup>Σ</sup> P i ∈ R (Mi+n−1)×(Mi+n−1) is a diagonal matrix, and

$$\mathbf{U}\_{i} = \begin{bmatrix} \mathbf{U}\_{i}^{11} & \mathbf{U}\_{i}^{12} \\ \mathbf{U}\_{i}^{21} & \mathbf{U}\_{i}^{22} \end{bmatrix} \in \mathbb{R}^{(M\_{i} + n - 1) \times (M\_{i} + n - 1)} \tag{4.25}$$

is a block matrix with submatricesU 11 i ∈ R (Mi−1)×r P¯ i ,U 12 i ∈ R (Mi−1)×(Mi+n−1−r P¯ i ) ,U 21 i ∈ R n×r P¯ i and U 22 i ∈ R n×(Mi+n−1−r P¯ i ) . Finally, P¯ + i and r P¯ i represent the pseudoinverse and rank of the submatrix P¯ i , respectively. To present the main result, we recall the introduced parameter set

$$\Theta\_{\bar{i}} = \{ \theta\_{\bar{i}} \in \mathbb{R}^{\mathcal{M}\_{\bar{i}}} \mid \theta\_{\bar{i},(1)} = 1 \}\tag{4.26}$$

so as to exclude the trivial solution <sup>ˆ</sup>θi <sup>=</sup> <sup>0</sup> and to exclude non-uniqueness due to scaling. As discussed in Section 4.2, there is no loss of generality with this parameter set since the ordering and scaling of the basis functions and cost function parameters is arbitrary.

#### Theorem 4.1 (General Solution of the Residual-Based Method)

Consider any player <sup>i</sup> ∈ P, and let <sup>Θ</sup>i <sup>=</sup> {θi <sup>∈</sup> <sup>R</sup> Mi <sup>|</sup> <sup>θ</sup>i,(1) <sup>=</sup> <sup>1</sup>}. All of the parameter vectors <sup>ˆ</sup>θi <sup>∈</sup> <sup>Θ</sup>i corresponding to all solutions (ψ<sup>ˆ</sup> i , ˆ<sup>θ</sup>i) of the proposed method (4.10) are of the form

$$
\hat{\boldsymbol{\theta}}\_i = \mathbf{L}\hat{\boldsymbol{\alpha}}\_i \tag{4.27}
$$

where <sup>α</sup><sup>ˆ</sup> i <sup>=</sup> - <sup>1</sup> α¯ˆ <sup>⊤</sup> i ⊤ ∈ R <sup>M</sup>i+<sup>n</sup> are (potentially non-unique) solutions to the quadratic program (4.16) with α¯ˆ i <sup>∈</sup> <sup>R</sup> Mi+n−<sup>1</sup> given by

$$
\hat{\bar{\alpha}}\_i = -\bar{P}\_i^+ \bar{p}\_i + U\_i \begin{bmatrix} \mathbf{0}\_r \\ \mathbf{b} \end{bmatrix} \tag{4.28}
$$

where <sup>0</sup>r <sup>∈</sup> <sup>R</sup> r P¯ <sup>i</sup> and for any b <sup>∈</sup> <sup>R</sup> Mi+n−1−r P¯ <sup>i</sup> . Furthermore, if either U 12 i <sup>=</sup> <sup>0</sup> or <sup>P</sup>¯ i has full rank, i.e. r P¯ i <sup>=</sup> <sup>M</sup><sup>i</sup> <sup>+</sup> <sup>n</sup> <sup>−</sup> <sup>1</sup>, then all solutions (ψ<sup>ˆ</sup> i , ˆ<sup>θ</sup>i) to the proposed method (4.10) correspond to the single unique parameter vector <sup>ˆ</sup>θi <sup>∈</sup> <sup>Θ</sup>i given by

$$
\hat{\boldsymbol{\theta}}\_{i} = \boldsymbol{L} \begin{bmatrix} 1 \\ -\bar{\boldsymbol{P}}\_{i}^{+} \bar{\boldsymbol{p}}\_{i} \end{bmatrix}. \tag{4.29}
$$

#### Proof:

Lemmas 4.1 and 4.2 together imply that all solutions to the original optimization problem (4.10) of the proposed residual-based method have parameter vectors given by <sup>ˆ</sup>θi <sup>=</sup> Lα<sup>ˆ</sup> i whereα<sup>ˆ</sup> i is a solution to the quadratic program (4.16). We thus proceed by analyzing (4.16).

For any <sup>α</sup>i <sup>∈</sup> <sup>R</sup> <sup>M</sup>i+<sup>n</sup> with Lαi <sup>∈</sup> <sup>Θ</sup>i where <sup>Θ</sup>i <sup>=</sup> {θi <sup>∈</sup> <sup>R</sup> Mi <sup>|</sup> <sup>θ</sup>i,(1) <sup>=</sup> <sup>1</sup>}, we have that <sup>α</sup>i <sup>=</sup> - <sup>1</sup> α¯ ⊤ i ⊤ where <sup>α</sup>¯i <sup>∈</sup> <sup>R</sup> Mi+n−<sup>1</sup> and so

$$\begin{aligned} \boldsymbol{\alpha}\_{\boldsymbol{i}}^{\top} \boldsymbol{P}\_{\boldsymbol{i}}(0) \boldsymbol{\alpha}\_{\boldsymbol{i}} &= \begin{bmatrix} 1 & \bar{\boldsymbol{\alpha}}\_{\boldsymbol{i}}^{\top} \end{bmatrix} \boldsymbol{P}\_{\boldsymbol{i}}(0) \begin{bmatrix} 1 \\ \bar{\boldsymbol{\alpha}}\_{\boldsymbol{i}} \end{bmatrix} \\ &= \boldsymbol{P}\_{\boldsymbol{i},(1,1)}(0) + \bar{\boldsymbol{\alpha}}\_{\boldsymbol{i}}^{\top} \bar{\boldsymbol{P}}\_{\boldsymbol{i}} \bar{\boldsymbol{\alpha}}\_{\boldsymbol{i}} + 2 \bar{\boldsymbol{\alpha}}\_{\boldsymbol{i}}^{\top} \bar{\boldsymbol{p}}\_{\boldsymbol{i}} \end{aligned}$$

i

where <sup>P</sup>i,(1,1) (0) is the first element of <sup>P</sup>i(0). All solutions <sup>α</sup>ˆi of the constrained quadratic program (4.16) with <sup>Θ</sup>i <sup>=</sup> {θi <sup>∈</sup> <sup>R</sup> Mi <sup>|</sup> <sup>θ</sup>i,(1) <sup>=</sup> <sup>1</sup>} are therefore of the form <sup>α</sup><sup>ˆ</sup> <sup>i</sup> <sup>=</sup> - <sup>1</sup> α¯ˆ ′ i ⊤ where α¯ˆ i <sup>∈</sup> <sup>R</sup> Mi+n−<sup>1</sup> are solutions to the unconstrained quadratic program

$$\min\_{\bar{\boldsymbol{\alpha}}\_{\bar{\boldsymbol{\alpha}}}} \left\{ \frac{1}{2} \bar{\boldsymbol{\alpha}}\_{\bar{\boldsymbol{\alpha}}}^{\top} \bar{\boldsymbol{P}}\_{\bar{\boldsymbol{\alpha}}} \bar{\boldsymbol{\alpha}}\_{\bar{\boldsymbol{\alpha}}} + \bar{\boldsymbol{\alpha}}\_{\bar{\boldsymbol{\alpha}}}^{\top} \bar{\boldsymbol{p}}\_{\bar{\boldsymbol{\alpha}}} \right\} \dots$$

We note that <sup>P</sup>i(0) is symmetric positive semidefinite which guarantees the existence of a solution of (4.16). Furthermore, this leads to P¯ i also being symmetric positive semidefinite. With these conditions fulfilled, [Gal11, Proposition 15.2] gives that the equivalent unconstrained quadratic program is solved by any α¯ˆ i satisfying

$$
\hat{\bar{\boldsymbol{\alpha}}}\_i = -\bar{\boldsymbol{P}}\_i^+ \bar{\boldsymbol{p}}\_i + \boldsymbol{U}\_i \begin{bmatrix} \mathbf{0}\_r \\ \mathbf{b} \end{bmatrix},
$$

for any b <sup>∈</sup> <sup>R</sup> Mi+n−1−r P¯ <sup>i</sup> . The first theorem assertion (4.27) follows.

Now, to prove the second theorem assertion we note that if U 12 i <sup>=</sup> <sup>0</sup>, then

$$\begin{aligned} \hat{\boldsymbol{\sigma}}\_{i} &= -\bar{\boldsymbol{P}}\_{i}^{+} \bar{\boldsymbol{p}}\_{i} + \begin{bmatrix} \boldsymbol{U}\_{i}^{11} & \mathbf{0} \\ \boldsymbol{U}\_{i}^{21} & \boldsymbol{U}\_{i}^{22} \end{bmatrix} \begin{bmatrix} \mathbf{0}\_{r} \\ \boldsymbol{b} \end{bmatrix} \\ &= -\bar{\boldsymbol{P}}\_{i}^{+} \bar{\boldsymbol{p}}\_{i} + \begin{bmatrix} \mathbf{0}\_{M\_{i}-1} \\ \boldsymbol{U}\_{i}^{22} \mathbf{b} \end{bmatrix} \end{aligned}$$

for any b <sup>∈</sup> <sup>R</sup> Mi+n−1−r P¯ <sup>i</sup> where U 22 i b <sup>∈</sup> <sup>R</sup> n . Clearly, if r P¯ i <sup>=</sup> <sup>M</sup><sup>i</sup> <sup>+</sup> <sup>n</sup> <sup>−</sup> 1, then we also have that

$$
\hat{\bar{\mathbf{a}}}\_i = -\bar{P}\_i^+ \bar{p}\_i
$$

i .

Thus, if eitherU 12 i <sup>=</sup> <sup>0</sup> or<sup>r</sup> P¯ i <sup>=</sup> <sup>M</sup><sup>i</sup> <sup>+</sup>n−1, then the first <sup>M</sup><sup>i</sup> <sup>−</sup><sup>1</sup> components of <sup>α</sup>¯ˆ i are invariant with respect to the free vector b <sup>∈</sup> <sup>R</sup> Mi+n−1−r P¯ <sup>i</sup> , and so all solutions <sup>α</sup><sup>ˆ</sup> i <sup>=</sup> - <sup>1</sup> α¯ˆ <sup>⊤</sup> i ⊤ of the constrained quadratic program (4.16) satisfy

$$L\hat{\boldsymbol{\alpha}}\_{i} = L\begin{bmatrix} 1 \\ -\bar{\boldsymbol{P}}\_{i}^{+} \bar{\boldsymbol{p}}\_{i} \end{bmatrix}$$

due to the definition of <sup>L</sup> (cf. (4.11)). The second theorem assertion follows since <sup>ˆ</sup>θi <sup>=</sup> Lα<sup>ˆ</sup> i which completes the proof.

Theorem 4.1 establishes that the conditions U 12 i <sup>=</sup> <sup>0</sup> and <sup>r</sup> P¯ i <sup>=</sup> <sup>M</sup><sup>i</sup> <sup>+</sup> <sup>n</sup> <sup>−</sup> <sup>1</sup> are both sufficient for ensuring the uniqueness of the player cost-functional parameters <sup>ˆ</sup>θi computed with the proposed method (4.10). These conditions will not hold when the inverse differential game problem is ill-posed — for example, on short time-horizons T , due to degenerate system dynamics, or when the trajectories are uninformative (e.g. when the trajectories x ∗ (t) and (u ∗ 1 (t), . . . ,u ∗ N (t)) correspond to a dynamic equilibrium of the dynamics in the sense that xÛ(t) <sup>=</sup> <sup>0</sup> for all t ∈ [0,T ]). The conditions U 12 i <sup>=</sup> <sup>0</sup> and <sup>r</sup> P¯ i <sup>=</sup> <sup>M</sup><sup>i</sup> <sup>+</sup> <sup>n</sup> <sup>−</sup> <sup>1</sup> may be interpreted as analogous conditions to the persistence of excitation conditions known from parameter estimation and adaptive control.

The following corollary establishes that, under the assumption that the ill-posedness of the inverse differential game problem is only due to an unknown scaling factor, then U 12 i <sup>=</sup> <sup>0</sup> and r P¯ i <sup>=</sup> <sup>M</sup><sup>i</sup> <sup>+</sup>n−<sup>1</sup> become sufficient conditions for ensuring that the residual-based method (4.10) yields unique player cost-functional parameters that only differ from the true player cost-functional parameters θ ∗ i by an unknown scaling factor <sup>c</sup>i <sup>&</sup>gt; <sup>0</sup> when Assumption 4.2 holds.

#### Corollary 4.1 (Uniqueness up to a Scaling Factor)

Suppose that Assumption 4.2 holds. Consider any player <sup>i</sup> ∈ P, and let <sup>Θ</sup>i <sup>=</sup> {θi <sup>∈</sup> <sup>R</sup> Mi | <sup>θ</sup>i,(1) <sup>=</sup> <sup>1</sup>}. If either <sup>U</sup> 12 i <sup>=</sup> <sup>0</sup> or <sup>r</sup> P¯ i <sup>=</sup> <sup>M</sup><sup>i</sup> <sup>+</sup> <sup>n</sup> <sup>−</sup> <sup>1</sup>, and if there exists a <sup>c</sup><sup>i</sup> <sup>&</sup>gt; <sup>0</sup> such that ciθ ∗ i <sup>∈</sup> <sup>Θ</sup><sup>i</sup> , then

$$
\hat{\boldsymbol{\theta}}\_{i} = \boldsymbol{L} \begin{bmatrix} 1 \\ -\bar{\boldsymbol{P}}\_{i}^{+} \bar{\boldsymbol{p}}\_{i} \end{bmatrix} = \boldsymbol{c}\_{i} \boldsymbol{\theta}\_{i}^{\*} \tag{4.30}
$$

is the unique parameter vector corresponding to all optimal solutions (ψ<sup>ˆ</sup> i , ˆ<sup>θ</sup>i) of the residualbased method (4.10).

#### Proof:

The necessary conditions for open-loop Nash equilibria of Theorem 3.1, i.e. (4.4) and (4.6) imply that (ψi ,ciθ ∗ i ) (with <sup>ψ</sup>i solving (4.4) under <sup>ψ</sup>i (<sup>T</sup> ) <sup>=</sup> <sup>0</sup> and <sup>θ</sup>i <sup>=</sup> <sup>c</sup>i<sup>θ</sup> ∗ i ) is always a solution to the proposed method (4.10) under Assumption 4.2. Since the conditions of the corollary give that <sup>c</sup>i<sup>θ</sup> ∗ i is in <sup>Θ</sup>i , and since the second assertion of Theorem 4.1 implies the uniqueness of the parameter vector <sup>ˆ</sup>θi <sup>∈</sup> <sup>Θ</sup>i corresponding to all optimal solutions of the residual-based method (4.10) if either U 12 i <sup>=</sup> <sup>0</sup> or <sup>r</sup> P¯ i <sup>=</sup> <sup>M</sup><sup>i</sup> <sup>+</sup> <sup>n</sup> <sup>−</sup> 1, we must have that <sup>ˆ</sup>θ<sup>i</sup> <sup>=</sup> <sup>c</sup>i<sup>θ</sup> ∗ i when either U 12 i <sup>=</sup> <sup>0</sup> or <sup>r</sup> P¯ i <sup>=</sup> <sup>M</sup><sup>i</sup> <sup>+</sup> <sup>n</sup> <sup>−</sup> <sup>1</sup> holds. The corollary assertion follows. In the following, the implications of each condition of Theorem 4.1 to the originally posed residual-based method (4.10) is analyzed.

#### Full-Rank Condition

In Corollary 4.1 and Theorem 4.1, if r P¯ i <sup>=</sup> <sup>M</sup><sup>i</sup> <sup>+</sup> <sup>n</sup> <sup>−</sup> <sup>1</sup> holds then both the player costfunctional parameters <sup>ˆ</sup>θi and costate functions <sup>ψ</sup><sup>ˆ</sup> i solving (4.10) are unique. To see that a unique pair (ψ<sup>ˆ</sup> i , ˆ<sup>θ</sup>i) solves (4.10) when <sup>r</sup> P¯ i <sup>=</sup> <sup>M</sup><sup>i</sup> <sup>+</sup> <sup>n</sup> <sup>−</sup> 1, we note that the first assertion of Theorem 4.1, specifically (4.28), implies that the vectors <sup>α</sup><sup>ˆ</sup> i <sup>=</sup> - <sup>1</sup> α¯ˆ <sup>⊤</sup> i ⊤ are unique solutions to the quadratic program (4.16) if r P¯ i <sup>=</sup> <sup>M</sup><sup>i</sup> <sup>+</sup> <sup>n</sup> <sup>−</sup> <sup>1</sup> because the free vector <sup>b</sup> will be zero-dimensional. Now, since Lemmas 4.1 and 4.2 imply that the vectors <sup>α</sup><sup>ˆ</sup> i <sup>=</sup> <sup>z</sup>ˆi(0) correspond to <sup>h</sup> ˆθ ⊤ i <sup>ψ</sup><sup>ˆ</sup> ⊤ i (0) i⊤ , and since Lemma 4.2 implies a unique function ψ<sup>ˆ</sup> i for each initial condition ψ<sup>ˆ</sup> i (0), we have that the pair (ψ<sup>ˆ</sup> i , ˆ<sup>θ</sup>i) is indeed the unique solution to (4.10) when r P¯ i <sup>=</sup> <sup>M</sup><sup>i</sup> <sup>+</sup> <sup>n</sup> <sup>−</sup> 1.

#### SVD Matrix Condition

The condition U 12 i <sup>=</sup> <sup>0</sup> can hold when <sup>r</sup> P¯ i <sup>&</sup>lt; <sup>M</sup><sup>i</sup> <sup>+</sup> <sup>n</sup> <sup>−</sup> 1. If <sup>U</sup> 12 i <sup>=</sup> <sup>0</sup> but <sup>r</sup> P¯ i <sup>&</sup>lt; <sup>M</sup><sup>i</sup> <sup>+</sup> <sup>n</sup> <sup>−</sup> 1, then the second assertion of Theorem 4.1 implies that all pairs (ψ<sup>ˆ</sup> i , ˆ<sup>θ</sup>i) solving (4.10) will share the unique parameter vector <sup>ˆ</sup>θi given by (4.29) but may not share a common costate function ψˆ i (t). The conditionU 12 i <sup>=</sup> <sup>0</sup> prohibits the elements of <sup>α</sup><sup>ˆ</sup> <sup>i</sup> corresponding to <sup>ˆ</sup>θ<sup>i</sup> (but notψ<sup>ˆ</sup> i (0)) from depending on the free vector b in (4.28).

#### 4.3.3 Algorithm and Example

In light of Theorem 4.1 and the role of the conditions U 12 i <sup>=</sup> <sup>0</sup> and <sup>r</sup> P¯ i <sup>=</sup> <sup>M</sup><sup>i</sup> <sup>+</sup> <sup>n</sup> <sup>−</sup> 1, the residual-based method (4.10) can be implemented for each player i ∈ P with the following algorithm:

Algorithm 1 Residual-based method for player i in an inverse OL differential game.

Input: State and control trajectories x ∗ (t) and (u ∗ 1 (t), . . . ,u ∗ N (t)), dynamics f, basis functions , and parameter constraint set <sup>Θ</sup>i <sup>=</sup> {θi <sup>∈</sup> <sup>R</sup> Mi <sup>|</sup> <sup>θ</sup>i,(1) <sup>=</sup> <sup>1</sup>}.

i

ϕi Output: Computed Player <sup>i</sup> cost-functional parameters <sup>θ</sup>i .


6: if r P¯ i <sup>=</sup> <sup>M</sup><sup>i</sup> <sup>+</sup> <sup>n</sup> <sup>−</sup> <sup>1</sup> then 7: return Unique <sup>θ</sup>i <sup>=</sup> <sup>ˆ</sup>θi given by (4.29). 8: else 9: Compute <sup>U</sup>i and <sup>U</sup> 12 i in (4.25) through SVD of P¯ i . 10: if U 12 i <sup>=</sup> <sup>0</sup> then 11: return Unique <sup>θ</sup>i <sup>=</sup> <sup>ˆ</sup>θi given by (4.29). 12: else 13: return Any <sup>θ</sup>i <sup>=</sup> <sup>ˆ</sup>θi from (4.27) with any <sup>b</sup> <sup>∈</sup> <sup>R</sup> Mi+n−1−r P¯ i . 14: end if 15: end if

Hence, the core of the proposed residual-based method with Algorithm 1 is the solution of a RDE and thus we avoid the need to solve nested differential game or optimal control problems. Furthermore, we are also free to compute the cost function parameters of each player separately (rather than as part of the same optimization). Finally, the presented method gives conditions under which the computed parameters are unique in the parameter set <sup>Θ</sup>i . These conditions hold for N-player inverse differential games and therefore valid for the special case of (single-player) inverse optimal control as well.

To conclude, an example illustrating the results of this section is presented.

#### Example 4.1:

Consider an optimal control problem, i.e. a one-player differential game, with system dynamics

$$
\dot{\mathbf{x}}(t) = u\_1(t) \tag{4.31}
$$

where <sup>u</sup>1(t) ∈ <sup>R</sup> and with an initial state value <sup>x</sup><sup>0</sup> <sup>=</sup> <sup>1</sup>. Let the cost function be of the form (4.2) with T <sup>=</sup> <sup>3</sup> and the basis functions

$$\phi\_1\left(\mathbf{x}(t), u\_1(t), t\right) = \begin{bmatrix} u\_1^2(t) & \mathbf{x}^2(t) & \mathbf{x}(t)u\_1(t) \end{bmatrix}^\top \tag{4.32}$$

and cost function parameters

$$
\boldsymbol{\theta}\_1 = \boldsymbol{\theta}\_1^\* = \begin{bmatrix} 1 & 5 & 2 \end{bmatrix}^\top \,. \tag{4.33}
$$

The optimal control problem is solved for the optimal state and control trajectories in Figure 4.1 by applying the minimum principle and solving the coupled differential equations numerically. These trajectories are unique solutions to the problem since θ ∗ 1 satisfies the positive definite and positive semidefinite conditions of [AM89, Section 3.4]. To solve the inverse optimal control problem, Algorithm 1 is applied. The Riccati equation leads to the submatrix

$$
\overline{P}\_1 = \begin{bmatrix}
0.4614 & -0.6126 & -0.6126 \\
\end{bmatrix} \tag{4.34}
$$

which is rank deficient. Computing the SVD of <sup>P</sup><sup>1</sup> yields

$$\mathbf{U}\_{1} = \begin{bmatrix} -0.4113 & -0.9115 & 0.0000 \\ 0.6445 & -0.2909 & -0.7071 \\ 0.6445 & -0.2909 & 0.7071 \end{bmatrix} \tag{4.35}$$

and thereforeU 12 <sup>1</sup> = - <sup>0</sup> <sup>−</sup>0.7071<sup>⊤</sup> , 0 which implies that there are not unique parameters <sup>θ</sup><sup>1</sup> <sup>∈</sup> <sup>Θ</sup><sup>1</sup> solving the inverse optimal control problem. Thus, the general solution is given by (4.28). By inspecting this solution, we observe that the first parameter of <sup>α</sup>i(0) which corresponds to <sup>θ</sup>1,(2) can uniquely be recovered (cf. the first entry of <sup>U</sup> 12 1 ). Nevertheless, the free parameter <sup>b</sup> <sup>∈</sup> <sup>R</sup> affects the parameter <sup>θ</sup>1,(3) , leading to the non-uniqueness. Using (4.28), the general solution of <sup>θ</sup><sup>1</sup> can be formulated as

$$\boldsymbol{\theta}\_{1} = \begin{bmatrix} 1 \\ 5 \\ 4.467 \end{bmatrix} + \begin{bmatrix} 0 \\ 0 \\ -0.7071 \end{bmatrix} b, \quad b \in \mathbb{R}. \tag{4.36}$$

Indeed, by solving the optimal control problem again with (4.36) and any b <sup>∈</sup> <sup>R</sup>, it is confirmed that the optimal trajectories x ∗ (t) and u ∗ 1 (t) are unaffected by the choice of b.

Figure 4.1: State and control trajectories solving the optimal control problem of Example 4.1

# 4.4 Inverse Feedback Differential Games

The inverse differential game problem assuming a feedback information structure consists in finding the cost function parameters of all players such that the observed trajectories correspond to a feedback Nash equilibrium.

As already noted in Section 3.6.2, the Nash solution of one player depends on the Nash controls of all other players. More specifically, the differential equation of the costate variables corresponding to player i depends on the other controls since, due to the closed-loop information structure, these depend on the state variables. In other words, the control <sup>u</sup>i(t) is determined by a feedback strategy in the form of a control law <sup>u</sup>i(t) <sup>=</sup> <sup>γ</sup>i (x,t). As discussed in Section 3.6.2, the conditions presented in Theorem 3.1 now include the new costate equation

$$\dot{\boldsymbol{\Psi}}\_{i}(t) = -\nabla\_{\mathbf{x}} H\_{i}(\boldsymbol{\Psi}\_{i}(t), \mathbf{x}^{\*}(t), \boldsymbol{\mu}\_{i}^{\*}(t), \boldsymbol{\mathcal{Y}}\_{\neg i}^{\*}(\mathbf{x}^{\*}, t), t)$$

i

in order to account for the other players' strategies dependency on the state variable.

#### 4.4.1 Residual-Based Approach

i

In order to apply the residual-based method, the following assumption is introduced.

#### Assumption 4.3 (Control Laws)

The Nash equilibrium control laws u ∗ (t) <sup>=</sup> γ ∗ (x,t) are known for all players i ∈ P.

i

i

Under Assumption 4.3, instead of (3.1), we have

$$\dot{\mathbf{x}}(t) = f\_i\left(\mathbf{x}(t), \mathbf{y}\_1^\*(\mathbf{x}, t), \dots, \mathbf{u}\_l(t), \dots, \mathbf{y}\_N^\*(\mathbf{x}, t), t\right), \quad \mathbf{x}(0) = \mathbf{x}\_0 \tag{4.37}$$

which represent system dynamics from the point of view of each player i ∈ P. Furthermore, Assumption 4.3 leads to the basis functions

$$\left(\boldsymbol{\phi}\_{i}\left(\mathbf{x}(t),\boldsymbol{\mu}\_{i}(t),\boldsymbol{\mathcal{Y}}\_{\neg i}^{\*}(\mathbf{x},t),t\right),\quad i\in\mathcal{P}.\tag{4.38}$$

Under Assumption 4.3 and the consequently introduced system dynamics (4.37) and basis functions (4.38), we obtain the Hamiltonian function of player i

$$H\_i = \boldsymbol{\theta}\_i^\top \boldsymbol{\phi}\_i \left(\mathbf{x}, \boldsymbol{u}\_i, \boldsymbol{\mathcal{y}}\_{\neg i}, \mathbf{t}\right) + \boldsymbol{\upvarphi}\_i^\top \boldsymbol{f}\_i \left(\mathbf{x}, \boldsymbol{u}\_i, \boldsymbol{\mathcal{y}}\_{\neg i}^\*, \mathbf{t}\right), \tag{4.39}$$

where the implicit dependencies were omitted for brevity.

Residuals can be introduced analogously to Section 4.3 according to Definition 4.4. Thus, the inverse differential game with feedback strategies can be solved by applying the residualbased method (4.10). Using (4.37) and (4.38), we redefine the matrices

$$\mathbf{N}\_i(t) \coloneqq \begin{bmatrix} \rho \nabla\_\mathbf{x} \phi\_i(t) & \rho \nabla\_\mathbf{x} f\_i(t) \end{bmatrix}^\top\_\mathbf{x} \tag{4.40}$$

$$\mathcal{Q}\_{i}(t) \coloneqq \begin{bmatrix} \sqrt{\rho} \nabla\_{\mathbf{x}} \phi\_{i}(t) & \sqrt{\rho} \nabla\_{\mathbf{x}} f\_{i}(t) \\ \nabla\_{\mathbf{u}\_{i}} \phi\_{i}(t) & \nabla\_{\mathbf{u}\_{i}} f\_{i}(t) \end{bmatrix}^{\top} \begin{bmatrix} \sqrt{\rho} \nabla\_{\mathbf{x}} \phi\_{i}(t) & \sqrt{\rho} \nabla\_{\mathbf{x}} f\_{i}(t) \\ \nabla\_{\mathbf{u}\_{i}} \phi\_{i}(t) & \nabla\_{\mathbf{u}\_{i}} f\_{i}(t) \end{bmatrix} \tag{4.41}$$

and note that the differences with respect to the open-loop case arise from the influence of the new system dynamics <sup>f</sup>i and basis functions on the partial derivatives. With these definitions, we can proceed analogously to the open-loop case, ultimately yielding Lemmas 4.1 and 4.2. Consequently, analogous results to Theorem 4.1 and Corollary 4.1 can be formulated. The formal theorem statements and proofs are omitted here.

#### 4.4.2 Example

The following example illustrates the application of the residual-based method for inverse feedback differential games.

#### Example 4.2:

Consider a two-player differential game with system dynamics

$$
\dot{\mathbf{x}}(t) = -\mathbf{x}(t) + \boldsymbol{u}\_1(t) + \boldsymbol{u}\_2(t) \tag{4.42}
$$

where <sup>u</sup>i(t) ∈ <sup>R</sup>, <sup>i</sup> ∈ P, and with an initial state value <sup>x</sup><sup>0</sup> <sup>=</sup> <sup>5</sup>. Let the cost function be of the form (4.2) with T <sup>=</sup> <sup>6</sup> and the basis functions

$$\boldsymbol{\phi}\_{i}\left(\mathbf{x}(t),\boldsymbol{u}\_{i}(t),t\right) = \begin{bmatrix} \boldsymbol{u}\_{i}^{2}(t) & \boldsymbol{x}\_{1}^{2}(t) & \boldsymbol{u}\_{j}^{2}(t) \end{bmatrix}^{\top}, \quad i,j \in \{1,2\}, \; i \neq j \tag{4.43}$$

and cost function parameters

$$
\boldsymbol{\theta}\_1 = \boldsymbol{\theta}\_1^\* = \begin{bmatrix} 1 & 1 & 10 \\ & & \end{bmatrix}^T \tag{4.44}
$$

$$
\boldsymbol{\theta}\_2 = \boldsymbol{\theta}\_2^\* = \begin{bmatrix} 1 & 2 & 1 \end{bmatrix}^\top. \tag{4.45}
$$

These parameters are used to solve for the Nash equilibrium state and control trajectories depicted in Figure 4.2. Since a linear-quadratic differential game lies at hand, this was done by solving the coupled Riccati equations (3.69), which also confirms the Nash character of the trajectories according to Theorem 3.6. This inverse differential problem is illustrated by recovering the cost function parameters of player 1. The feedback strategies of each player have the form u ∗ (t) <sup>=</sup> γ ∗ (x,t) <sup>=</sup> <sup>−</sup>k ∗ (t)x(t), leading to the system dynamics

$$
\dot{\mathbf{x}}(t) = -\mathbf{x}(t) + u\_1(t) - k\_2^\*(t)\mathbf{x}(t) \tag{4.46}
$$

and the basis functions

i

i

i

$$\Phi\_1\left(\mathbf{x}(t), u\_i(t), t\right) = \begin{bmatrix} u\_1^2(t) & \mathbf{x}^2(t) & (-k\_2^\*(t)\mathbf{x}(t))^2 \end{bmatrix}^\top \tag{4.47}$$

according to (4.37) and (4.38). The Riccati equation of the residual-based method leads to the submatrix

$$
\overline{P}\_1 = \begin{bmatrix} 16.045 & 7.499 & -7.470 \\ 7.499 & 3.505 & -3.491 \\ -7.470 & -3.492 & 3.642 \end{bmatrix} \tag{4.48}
$$

 which has full rank equal to <sup>M</sup>i <sup>+</sup> <sup>n</sup> <sup>−</sup> <sup>1</sup> <sup>=</sup> <sup>3</sup>. Therefore, with the results of Theorem 4.1 and Corollary 4.1, we obtain the unique solution

$$
\hat{\boldsymbol{\theta}}\_{1} = \begin{bmatrix} 1.000 & 1.000 & 10.000 \end{bmatrix}^{\top} = \boldsymbol{\theta}\_{1}^{\*}. \tag{4.49}
$$

Figure 4.2: State and control trajectories solving the differential game in Example 4.2

The presented example illustrates Theorem 4.1 for inverse feedback differential games, allowing the identification of cost function parameters if the control laws of all players are known, according to Assumption 4.3. Interestingly, in Example 4.2, the cost function parameters of player 1 could be exactly recovered, even though the basis functions were partially redundant due to the fact that the control of player 2 depends on the state variable. However, since k2(t) was exactly known for all t ∈ [0,T ], the proportion of its influence on the state variable could be distinguished by the method.

# 4.5 Method Limitations

Before concluding this chapter, possible limitations of the presented methods shall be discussed. A first issue could emerge if only truncated trajectories are available, i.e. we only have access to the trajectories (and control laws, in the feedback case) for t ∈ [0,T¯] with T¯ < T . The method can still be applied, but the quality of identification depends on the extent up to which the available truncated trajectories represent the complete optimal trajectories.

A further issue arises if Assumption 4.2 does not hold. This assumption may be violated e.g. due to misspecified dynamics or basis functions, or imperfect trajectories.<sup>28</sup> Additionally, the violation might be even more severe if the trajectories do not even represent a Nash equilibrium, regardless of the basis functions or the system dynamics. In either case, by solving (4.10), parameters <sup>ˆ</sup>θi and functionsψ<sup>ˆ</sup> i (t) result such that (4.4a) and (4.6) hold approximately with their priority assigned via choice of ρ. Due to the fact that the approach is based on conditions for Nash equilibria which are generally only necessary, it cannot be always guaranteed that the resulting parameters can be used for determining Nash equilibrium trajectories.

Lastly, the exact knowledge of the feedback strategies as implied by Assumption 4.3 is a rather strict assumption. Nevertheless, given that the state x ∗ (t) and control trajectoriesu ∗ i (t), i ∈ P, are available, it is possible to at least determine an approximation using parameter estimation techniques. This will be examined in the next chapter in the context of inverse linear-quadratic differential games.

# 4.6 Conclusion

In this chapter, an inverse differential game method based on necessary conditions for Nash equilibria was presented. The main idea consisted in the formulation of residuals which represent the violation of the open-loop Nash equilibrium conditions if the parameters (and costate functions) do not correspond to a Nash equilibrium under the observations of the state and control trajectories. The minimization of the residuals lead to a dynamic optimization problem for each player i, the minimizers of which are given by the sought cost function parameters of that specific player. The method is substantially based on the solution of a Riccati differential equation and a static quadratic program, thus avoiding the expensive computation of Nash equilibrium trajectories in each iteration and allowing for the statement of sufficient conditions for the unique solution of the cost function parameters in an inverse open-loop differential game.

Moreover, an approach to solve inverse differential games with feedback strategies was presented. It was shown that it is possible to formulate a residual-based method for the feedback case by assuming the knowledge of the control laws. In this way, the sufficient conditions for the solution uniqueness are still valid. Nevertheless, in general, the control's dependence on the states may lead to redundant basis functions which potentially make the exact estimation of the cost function parameters difficult due to the ambiguity of the solution of the residual-based method.

This chapter presented results for finite-horizon inverse differential games. The following chapter deals with inverse problems for the class of infinite-horizon linear-quadratic differ-

<sup>28</sup> The latter two cases shall be examined in Chapter 7.

ential games and aims at gaining additional insight by exploiting the particular system and cost function structure.

# 5 Inverse Non-Cooperative Linear-Quadratic Differential Games

This chapter is devoted to the solution of inverse problems in non-cooperative linear-quadratic differential games. This particular class of inverse differential games arises if the dynamic system all players are controlling is linear and a quadratic structure of the player cost functions is given. Furthermore, the considered planning horizon is infinite, leading to constant linear feedback strategies of the players. Linear system dynamics and quadratic cost functions are ubiquitous in control theory and therefore, the properties of this kind of inverse differential games are thoroughly investigated. The techniques employed in this chapter are similar to the ones applied in Chapter 4 in the sense that control-theoretical conditions for Nash equilibria are leveraged, i.e. an inverse optimal control approach is applied. The main contribution presented in this chapter consists of the formulation of explicit solution sets describing all possible solutions of an inverse LQ differential game with an infinite horizon. The dimensions of this solution set depend on the characteristics of the differential game, e.g. number of states, controls and players. Furthermore, necessary and sufficient conditions are given for the uniqueness (up to a positive factor) of the inverse differential game solutions. Finally, on a more practical side, a quadratic program is formulated which allows the efficient computation of one solution (belonging to the whole solution set) and the corresponding algorithm for implementation is presented. The chapter ends with an illustrative example of the method and a conclusion.<sup>29</sup>

# 5.1 Problem Definition

Consider a continuous-time N-player noncooperative differential game of linear-quadratic type according to Definition 3.11. Therefore, the continuous-time state process of the game is described by the initial value problem

$$\dot{\mathbf{x}}(t) = \mathbf{A}\mathbf{x}(t) + \sum\_{i=1}^{N} \mathbf{B}\_{i}\mathbf{u}\_{i}(t) \tag{5.1a}$$

$$\mathbf{x}(\mathbf{0}) = \mathbf{x}\_0 \tag{5.1b}$$

<sup>29</sup> The results of this chapter were partially previously published in the journal paper [IBM+19].

where it is further assumed that (A, - <sup>B</sup><sup>1</sup> · · · <sup>B</sup>N ) is stabilizable. Following the explanations in Section 3.8.2, the results of this chapter shall be restricted to the consideration of constant linear feedback strategies, i.e. strategies <sup>γ</sup>i belonging to the set (3.73). Therefore, the control trajectories are given by

$$\mathbf{u}\_{i}(t) = -\mathbf{K}\_{i}\mathbf{x}(t), \quad \forall \text{ i} \in \mathcal{P}, \tag{5.2}$$

with the control laws <sup>K</sup> <sup>=</sup> (K1, ...,<sup>K</sup> N ) (cf. (3.77)). In particular, these lead to a stable closedloop system (cf. (3.67))

$$F = A - \sum\_{j=1}^{N} \mathbf{B}\_{j} \mathbf{K}\_{j},\tag{5.3}$$

i.e. they belong to the set of stabilizing control law tuples defined in (3.74).

In this chapter, a Lagrangian quadratic cost function

$$J\_i(\mathbf{x}\_0, \mathbf{K}, \mathbf{Q}\_i, \mathbf{R}\_{ij}) = \frac{1}{2} \int\_0^{\infty} \mathbf{x}^\top \mathbf{Q}\_i \mathbf{x} + \sum\_{j=1}^N \mathbf{u}\_j^\top \mathbf{R}\_{ij} \mathbf{u}\_j \, \mathrm{d}t,\tag{5.4}$$

is considered for each player i ∈ P, where the same matrix assumptions as in Definition 3.11 are made, i.e. <sup>Q</sup>i , <sup>R</sup>ij are symmetric for all <sup>i</sup>, <sup>j</sup> ∈ P and <sup>R</sup>i i <sup>≻</sup> <sup>0</sup> for all <sup>i</sup> ∈ P. <sup>30</sup> By posing (5.4), a particular structure of the cost functions of all players is defined, similar to the basis function approach considered in Section 4.2. Indeed, a cost function of the form (5.4) can be equivalently represented as a cost function with basis functions as introduced in (4.2).<sup>31</sup> The cost function <sup>J</sup>i in (5.4) is written as a function of the N-tuple of feedback laws <sup>K</sup> <sup>=</sup> (K1, ...,<sup>K</sup> N ) and the initial state <sup>x</sup><sup>0</sup> since together these generate the state and control trajectories <sup>x</sup>(t) and <sup>u</sup>i(t) via (5.1) and (5.2). Finite cost function values are guaranteed by the restriction to strategies or feedback laws belonging to F as defined in (3.74).

In this chapter, feedback Nash equilibria are considered which are defined in the context of infinite-horizon LQ differential games as follows (cf. Definition 3.7).

<sup>30</sup> Note that no definiteness assumptions on <sup>Q</sup><sup>i</sup> , i ∈ P, are made since the control laws are restricted to the stabilizing set F (cf. [EBS00]).

<sup>31</sup> This follows directly from e.g. <sup>1</sup> 2 x ⊤Qix <sup>=</sup> θ ⊤ <sup>i</sup> <sup>ϕ</sup><sup>i</sup> with <sup>θ</sup> <sup>i</sup> <sup>=</sup> vec(Q<sup>i</sup> ) and where <sup>ϕ</sup><sup>i</sup> has the elements <sup>ϕ</sup>i,(j) <sup>=</sup> 1 2 <sup>x</sup><sup>l</sup> <sup>x</sup><sup>p</sup> , <sup>∀</sup>l, <sup>p</sup> ∈ {1, ..., <sup>n</sup>}.

Definition 5.1 (Feedback Nash Equilibrium ([EBS00])) An N-tuple K <sup>∗</sup> <sup>=</sup> (K ∗ 1 , ...,K ∗ N ) ∈ F is called a stationary linear feedback Nash equilibrium if

<sup>J</sup>i(x0,<sup>K</sup> ∗ ,Qi , <sup>R</sup>ij) ≤ <sup>J</sup>i(x0,<sup>K</sup> ∗ ¬i (β),Qi , <sup>R</sup>ij), (5.5)

holds for all <sup>i</sup> ∈ P, all <sup>x</sup><sup>0</sup> <sup>∈</sup> <sup>R</sup> n , and all β such that K ∗ ¬i (β) ∈ F, where K ∗ ¬i (β) <sup>=</sup> (K ∗ 1 , ...,K ∗ i−1 , β,K ∗ i+1 , ...,K ∗ ) .

The FNE is generally not unique (cf. Section 3.8.2), i.e. various tuples K ∗ corresponding to a particular infinite-horizon LQ differential game may exist. However, in the following, one specific FNE denoted by K ∗ shall be considered.

The following definition is introduced before formalizing the inverse LQ differential game problem.

#### Definition 5.2 (Canonical Parameter Set)

N

The canonical parameter set of the LQ differential game is the set Θ which contains all possible cost function parameters of (5.4), i.e. all possible matrices <sup>Q</sup>i and <sup>R</sup>ij , <sup>∀</sup>i, <sup>j</sup> ∈ P, which lead to the Nash equilibrium given by K ∗ , i.e.

$$\Theta = \{\theta\_i \mid \mathbf{i} \in \mathcal{P}, \ K^\* = \mathbf{K}(\theta\_1, ..., \theta\_N) \text{ fudfills (5.5)}\},\tag{5.6}$$

where <sup>θ</sup><sup>i</sup> contains the elements of the matrices <sup>Q</sup>i and <sup>R</sup>ij , <sup>i</sup>, <sup>j</sup> ∈ P.

This definition follows directly from the ill-posedness characteristic of inverse differential games. It allows for describing a general set of solutions of inverse differential game which do not necessarily differ in a constant parameter solely. Furthermore, the following assumption is introduced.

#### Assumption 5.1

The Nash equilibrium feedback matrices K <sup>∗</sup> ∈ F are known.

With this assumption, which is similar to Assumption 4.1 made in the last chapter, the inverse infinite-horizon LQ differential game problem considered in this chapter is defined as follows.<sup>32</sup>

<sup>32</sup> In the remainder of this chapter, the considered inverse problem shall be referred to as inverse linear-quadratic differential game problem. The infinite-horizon property shall be omitted for the sake of brevity.

#### Definition 5.3 (Inverse Linear-Quadratic Differential Game Problem)

Consider an infinite-horizon LQ differential game consisting of system dynamics (5.1), where <sup>A</sup>, <sup>B</sup>i , <sup>∀</sup>i ∈ P are given, and unknown cost functions (5.4). Furthermore, let Assumption 5.1 hold such that Nash equilibrium feedback matrices K ∗ are available. Determine the canonical parameter set Θ described in Definition 5.2

While this problem definition is related to the problem in Definition 4.3, it is different in the sense that not only one single tuple of parameter vectors <sup>θ</sup> <sup>=</sup> (θ1, ..., <sup>θ</sup> N ) is sought, but the complete set of (equivalent) possible tuples of parameter vectors which lead to a given Nash equilibrium. Furthermore, instead of a Nash equilibrium described by the trajectories x ∗ (t) and u ∗ i (t), i ∈ P, the availability of a Nash equilibrium described by a tuple of control laws K ∗ is assumed.

#### Remark 5.1:

By solving the problem of Definition 5.3 we can also solve the related problem of finding Θ, if instead of K ∗ , trajectories x ∗ (t) and u ∗ i (t), i ∈ P, are given. This follows from the fact that K ∗ i can be estimated via (5.2). Indeed, such an estimation is commonly performed in single-player inverse LQ optimal control, e.g. in [PCC<sup>+</sup> 15] and [FMM<sup>+</sup> 18], where the proposed methods also rely on the availability of a control law. Further details on the estimation of K ∗ are given in Section 5.4.2.

# 5.2 Solution Sets for Inverse Linear-Quadratic Differential Games

This section presents general solution sets for inverse LQ differential games such that the problem of Definition 5.3 is solved. Similar to Chapter 4, available results on the conditions for feedback Nash equilibria shall be exploited. In the case of an infinite-horizon LQ differential game, the conditions are available in the form of coupled algebraic Riccati equations (ARE).

#### 5.2.1 Coupled Algebraic Riccati Equations

The following theorem is introduced as a basis for the development of the results of this chapter.

#### Theorem 5.1 (Necessary and Sufficient Conditions for Feedback Nash Equilibria)

Let there exist an <sup>N</sup>-tuple of symmetric matrices <sup>P</sup>i , i ∈ P satisfying the N matrix algebraic Riccati equations (ARE)

$$\mathbf{P}\_i \mathbf{F} + \mathbf{F}^\top \mathbf{P}\_i + \sum\_{j \in \mathcal{P}} \mathbf{P}\_j \mathbf{B}\_j \mathbf{R}\_{jj}^{-1} \mathbf{R}\_{ij} \mathbf{R}\_{jj}^{-1} \mathbf{B}\_j^\top \mathbf{P}\_j + \mathbf{Q}\_i = \mathbf{0} \tag{5.7}$$

such that F is stabilized. Furthermore, let K ∗ be defined as

$$\mathbf{K}\_{i}^{\*} = \mathbf{R}\_{i\bar{i}}^{-1} \mathbf{B}\_{i}^{\top} \mathbf{P}\_{i}. \tag{5.8}$$

Then, K <sup>∗</sup> <sup>=</sup> (K ∗ 1 , ...,K ∗ N ) is a FNE as in Definition 5.1 and <sup>J</sup>i(x0,<sup>K</sup> ∗ ,Qi , <sup>R</sup>ij) <sup>=</sup> <sup>x</sup> ⊤ 0 <sup>P</sup>ix0. Conversely, if K ∗ is a FNE then the set of ARE (5.7) has a stabilizing solution.

i

#### Proof:

See the proof of [EBS00, Theorem 4].

#### Remark 5.2:

The ARE given in (5.7) are an alternative and equivalent formulation of the ARE given in (3.75). Both expressions are common in differential game theory.

Theorem 5.1 represents a necessary and sufficient condition for feedback Nash equilibria. Hence, if the feedback matrices K ∗ are given, the cost function parameters in the matrices <sup>R</sup>ij and <sup>Q</sup>i , i, j ∈ P, must fulfill (5.7). This fact shall be leveraged in order to develop a method to solve the inverse LQ differential game. Inspired by [JAK89] and [AKFIJ12], where numerical techniques for continuous-time Riccati equations and results on the properties of Sylvester and Lyapunov type algebraic equations were introduced, respectively, Kronecker products shall be applied to derive a reformulation of (5.7) which serves as a basis for the subsequent results.

#### Reformulation of the Algebraic Riccati Equations

Before presenting the reformulation, let us define a Kronecker sum [Bre78] as

$$X \oplus Y = \left(X \otimes I\_q\right) + \left(I\_r \otimes Y\right), \tag{5.9}$$

for squared matrices X <sup>∈</sup> <sup>R</sup> r×r and Y <sup>∈</sup> <sup>R</sup> q×q , where <sup>I</sup>q denotes a <sup>q</sup>-dimensional identity matrix and ⊗ is the Kronecker product. In order to develop a reformulation of (5.7), we require the following result.

#### Lemma 5.1 (Inverse Existence)

Define <sup>F</sup> <sup>⊕</sup> :<sup>=</sup> <sup>F</sup> <sup>⊤</sup> <sup>⊕</sup> F <sup>⊤</sup> where F is calculated by means of (5.3) with any tuple of feedback matrices K <sup>∗</sup> ∈ F (cf. (3.74)). The inverse F −1 ⊕ exists.

#### Proof:

F −1 ⊕ exists if all eigenvalues <sup>λ</sup>l <sup>∈</sup> <sup>σ</sup>(<sup>F</sup> <sup>⊕</sup>), <sup>l</sup> ∈ {1, ...,<sup>n</sup> 2 } are different from zero. By using [Zha11, Theorem 4.8], we discern that <sup>λ</sup>l <sup>=</sup> <sup>µ</sup><sup>j</sup> <sup>+</sup> <sup>µ</sup>k , where <sup>µ</sup><sup>j</sup> , <sup>µ</sup>k <sup>∈</sup> <sup>σ</sup>(F), for <sup>j</sup>, <sup>k</sup> ∈ {1, ...,n} such that l is associated to a particular combination of j and k, i.e. j <sup>=</sup> ⌈ l n ⌉ and k <sup>=</sup> l <sup>−</sup>n(j <sup>−</sup>1). Since only stabilizing feedback matrices belonging to the set <sup>F</sup> in (3.74) are considered, F is a stable matrix and thus <sup>λ</sup>l <sup>&</sup>lt; <sup>0</sup>, <sup>∀</sup><sup>l</sup> ∈ {1, ...,<sup>n</sup> 2 }. The lemma assertion follows.

Unless otherwise stated, the following calculations are with respect to a particular player i ∈ P. With the results of Lemma 5.1, the matrices

$$\mathbf{Z}\_{i} := (\mathbf{I}\_{n} \otimes \mathbf{B}\_{i}^{\top}) \mathbf{F}\_{\oplus}^{-1} \in \mathbb{R}^{nm \times n^{2}} \tag{5.10}$$

and

$$\mathbf{K}\_{l}^{\otimes} := \mathbf{K}\_{l}^{\top} \otimes \mathbf{K}\_{l}^{\top} \in \mathbb{R}^{n^{2} \times m\_{l}^{2}} \tag{5.11}$$

are defined. Furthermore, K ∗ i is written as <sup>K</sup>i in (5.11) and in the following lemma for brevity.

#### Lemma 5.2 (Equivalent Formulation of the ARE)

Let the parameter ¯θi <sup>∈</sup> <sup>R</sup> <sup>L</sup> denote the vectorized matrices of the cost function (5.4), i.e.

$$\bar{\boldsymbol{\theta}}\_{i} = \begin{bmatrix} \text{vec}(\mathbf{Q}\_{i})^{\top} & \text{vec}(\mathbf{R}\_{i1})^{\top} & \cdots & \text{vec}(\mathbf{R}\_{li})^{\top} & \cdots & \text{vec}(\mathbf{R}\_{iN})^{\top} \end{bmatrix}^{\top},\tag{5.12}$$

where vec(X) represents a column vectorization of a matrix X, leading to L <sup>=</sup> n <sup>2</sup> + Í N i=1 m2 i . Then, the matrices <sup>Q</sup>i , <sup>R</sup>ij , <sup>i</sup>, <sup>j</sup> ∈ P, corresponding to ¯θi satisfy (5.7) if (and only if) ¯θi fulfills

$$
\bar{\mathbf{M}}\_i \bar{\theta}\_i = \mathbf{0} \tag{5.13}
$$

where M¯ i <sup>∈</sup> <sup>R</sup> n mi×L is given by

$$\bar{\mathbf{M}}\_{i} := \begin{bmatrix} \mathbf{Z}\_{i} & \mathbf{Z}\_{i}\mathbf{K}\_{1}^{\otimes} & \cdots & \mathbf{Z}\_{i}\mathbf{K}\_{i-1}^{\otimes} & (\mathbf{Z}\_{i}\mathbf{K}\_{i}^{\otimes} + \mathbf{K}\_{i}\otimes\mathbf{I}\_{p}) & \mathbf{Z}\_{i}\mathbf{K}\_{i+1}^{\otimes} & \cdots & \mathbf{Z}\_{i}\mathbf{K}\_{N}^{\otimes} \end{bmatrix}. \tag{5.14}$$

#### Proof:

We rewrite (5.7) as

$$\mathbf{0} = \text{vec}(\mathbf{P}\_i \mathbf{F}) + \text{vec}(\mathbf{F}^\top \mathbf{P}\_i) + \sum\_{j \in \mathcal{P}} \text{vec}(\mathbf{P}\_j \mathbf{B}\_j \mathbf{R}\_{jj}^{-1} \mathbf{R}\_{ij} \mathbf{R}\_{jj}^{-1} \mathbf{B}\_j^\top \mathbf{P}\_j) + \text{vec}(\mathbf{Q}\_i)$$

$$\mathbf{0} = \left[ \left( \mathbf{F}^\top \otimes \mathbf{I}\_n \right) + \left( \mathbf{I}\_n \otimes \mathbf{F}^\top \right) \right] \text{vec}(\mathbf{P}\_l) + \sum\_{j \in \mathcal{P}} \left( \mathbf{K}\_j^\top \otimes \mathbf{K}\_j^\top \right) \text{vec}(\mathbf{R}\_{lj}) + \text{vec}(\mathbf{Q}\_l)$$

and thus

$$\text{vec}(P\_i) = -F\_{\oplus}^{-1}\text{vec}(\mathbf{Q}\_i) - \sum\_{j \in \mathcal{P}} F\_{\oplus}^{-1} \mathbf{K}\_j^{\otimes} \text{vec}(\mathbf{R}\_{ij}).\tag{5.15}$$

The first equality follows from vectorizing (5.7), while for the second equality (5.8) was used and the following equivalence was applied:

$$\text{vec}(XYZ) = \left(Z^{\top} \otimes X\right)\text{vec}(Y). \tag{5.16}$$

This equivalence holds for any matrices X, Y and Z with suitable dimensions [Bre78]. The third equality (5.15) follows with the results of Lemma 5.1 and the definitions given in (5.11) and (5.9). Now we rewrite (5.8) as

$$\left(\left(I\_n \otimes \mathbf{B}\_i^\top\right)^{-1}\left(\mathbf{K}\_i^\top \otimes I\_\rho\right)\operatorname{vec}(\mathbf{R}\_{li}) = \operatorname{vec}(P\_{i\cdot})\tag{5.17}$$

using (5.16). Inserting (5.17) in (5.15) results in

$$\mathcal{Z}\_i \text{vec}(\mathbf{Q}\_i) + \left(\mathbf{K}\_i^\top \otimes \mathbf{I}\_p\right) \text{vec}(\mathbf{R}\_{li}) + \sum\_{j \in \mathcal{P}} \mathcal{Z}\_i \mathbf{K}\_j^\otimes \text{vec}(\mathbf{R}\_{lj}) = \mathbf{0} \tag{5.18}$$

and thus (5.13) follows immediately with (5.14) and (5.12).

The parameters ¯θi for which (5.13) holds are valid solutions of (5.7) for a given <sup>K</sup> ∗ i . Note that the feedback matrices K <sup>∗</sup> <sup>=</sup> (K ∗ 1 , . . . ,K ∗ N ) completely characterize the Nash equilibrium trajectories x ∗ (t) and u ∗ i (t), i ∈ P. This follows from (5.1) fulfilling all conditions for admitting a unique solution for any N-tuple of continuous controls (5.2) [BO99]. Thus, the parameters ¯<sup>θ</sup>i are associated to a Nash equilibrium represented by either the feedback matrices or the state and control trajectories.

#### 5.2.2 Canonical Parameter Set

The matrix Riccati equations (5.7) have multiple solutions which potentially represent different Nash equilibria [Wee01]. However, it is worth emphasizing that we are only interested

$$\mathbb{T}$$

in all parameter tuples ¯θ which represent a specific Nash equilibrium. Bearing this in mind, the following theorem gives the main result.

#### Theorem 5.2 (Canonical Parameter Set of Inverse LQ Differential Games)

Let a LQ differential game be given by (5.1) and (5.4). Furthermore, let Assumption 5.1 hold such that Nash equilibrium control laws K ∗ are given. Then, the canonical parameter set of the corresponding inverse LQ differential game is given by

$$\Theta = \bigcup\_{i \in \mathcal{P}} \ker(\bar{M}\_i),\tag{5.19}$$

with convex boundaries such that <sup>R</sup>i i <sup>&</sup>gt; <sup>0</sup>, <sup>∀</sup><sup>i</sup> ∈ P.

#### Proof:

By inspecting (5.13) from Lemma 5.2 we can recognize that all parameters which satisfy the ARE lie within the kernel of M¯ i , which depends on K ∗ . Therefore, all possible cost function parameters of player i which lead to the known Nash equilibrium are given by span(v (1) i , ...,<sup>v</sup> (di ) i ), where <sup>d</sup>i represents the dimension of the kernel of <sup>M</sup>¯ i with basis vectors <sup>v</sup>i . The set including the cost function parameters of all players corresponding to the Nash equilibrium represented by K ∗ is thus given by (5.19).

Note that the results of Lemma 5.2 together with Theorem 5.2 allow for a simple proof of the well-known invariance of the Nash equilibrium in case any cost function parameter ¯θi is multiplied by a positive constant.

#### Corollary 5.1

The trajectories constituting a Nash equilibrium under <sup>N</sup> cost functions <sup>J</sup>i( ¯θ ∗ i ), i ∈ P, of an infinite-horizon LQ differential game will constitute the same Nash equilibrium for <sup>J</sup>i( ¯θi) with ¯θi <sup>=</sup> <sup>c</sup>i ¯θ ∗ i , <sup>∀</sup>ci <sup>&</sup>gt; <sup>0</sup>.

#### Proof:

This can be easily be seen from M¯ ici ¯θ ∗ i <sup>=</sup> <sup>c</sup>iM¯ i ¯θ ∗ i <sup>=</sup> <sup>0</sup> which does not affect ker(M¯ i) nor Θ.

The results of Lemma 5.2 as well as Theorem 5.2 are derived with respect to the parameter definition in (5.12) which considers the most general case where no assumptions on the structure of the cost function matrices were made, e.g. symmetry. The characteristics of the differential game and in particular, the properties of the cost function matrices affect the dimensions of ker(M¯ i) and consequently of the canonical parameter set <sup>Θ</sup>. Therefore, in the next section, some properties of inverse LQ differential games based on the possible structures of the cost function matrices are discussed.

# 5.3 Properties of Inverse Linear-Quadratic Differential Game Solution Sets

Cost function matrices in a quadratic cost function are largely assumed to be at least symmetric. Furthermore, in many applications, these are assumed to be diagonal. Since these matrix properties reduce the number of unknown parameters, inverse LQ differential games and their solution sets shall be analyzed considering all possible cases for the cost function matrices.

#### 5.3.1 Preliminaries

Let us define the variable <sup>M</sup>i <sup>∈</sup> <sup>R</sup> + to denote the number of (non-redundant) parameters of a player's cost function. The specific value of <sup>M</sup>i depends on whether the cost function matrices are symmetric or diagonal. We have

$$M\_i = \begin{cases} \frac{n^2 + n}{2} + \sum\_{i \in \mathcal{P}} \frac{m\_i^2 + m\_i}{2}, & \text{symmetric matrices} \\ n + \sum\_{i \in \mathcal{P}} m\_i, & \text{diagonal matrices} \\ L, & \text{else.} \end{cases} \tag{5.20}$$

Since <sup>M</sup>i <sup>≤</sup> <sup>L</sup> holds, the analysis of inverse LQ differential games is based on the vectors <sup>θ</sup>i <sup>∈</sup> <sup>R</sup> <sup>M</sup><sup>i</sup> which have a potentially reduced dimension compared to the parameter vector of Lemma 5.2. The matrix <sup>M</sup>i <sup>∈</sup> <sup>R</sup> nmi×M<sup>i</sup> is introduced accordingly as a possible modification of the matrix M¯ .

#### Remark 5.3:

i

The vector <sup>θ</sup>i <sup>∈</sup> <sup>R</sup> <sup>M</sup><sup>i</sup> and the modified matrix <sup>M</sup>i <sup>∈</sup> <sup>R</sup> nmi×M<sup>i</sup> comply with Lemma 5.2 in the sense that

$$\mathcal{M}\_i \theta\_i = \mathbf{0} \tag{5.21}$$

holds. Consequently, the results of Theorem 5.2 and, obviously, Corollary 5.1 hold for these introduced variables as well.

In the following, an example illustrating the introduced modifications is presented.

#### Example 5.1:

Consider a 2-player LQ differential game with <sup>n</sup> <sup>=</sup> <sup>2</sup>, <sup>m</sup><sup>1</sup> <sup>=</sup> <sup>m</sup><sup>2</sup> <sup>=</sup> <sup>1</sup>, where the cost functions are given by (5.4). By Lemma 5.2, we obtain <sup>M</sup>i <sup>=</sup> <sup>L</sup> <sup>=</sup> <sup>6</sup>, leading to the vector

$$
\bar{\boldsymbol{\Theta}}\_{i} = \begin{bmatrix} \boldsymbol{Q}\_{i,(1,1)} & \boldsymbol{Q}\_{i,(2,1)} & \boldsymbol{Q}\_{i,(1,2)} & \boldsymbol{Q}\_{i,(2,2)} & \boldsymbol{R}\_{i1} & \boldsymbol{R}\_{i2} \end{bmatrix}^{\mathsf{T}},\tag{5.22}
$$

where <sup>Q</sup>i,(r,c) with <sup>r</sup>,<sup>c</sup> ∈ {1, <sup>2</sup>} denotes the element of <sup>Q</sup> in the <sup>r</sup>-th row and <sup>c</sup>-th column. Furthermore, we have the matrix

$$\bar{\boldsymbol{M}}\_{i} = \begin{bmatrix} (\bar{\boldsymbol{m}}\_{i})\_{1} & (\bar{\boldsymbol{m}}\_{i})\_{2} & \cdots & (\bar{\boldsymbol{m}}\_{i})\_{6} \end{bmatrix}, \quad i \in \{1, 2\}, \tag{5.23}$$

i

where (m¯ i)j , j ∈ {1, <sup>2</sup>, .., L} denotes the j-th column of M¯ .

#### Diagonal Matrices

In case of diagonal matrices, <sup>Q</sup>i,(2,1) <sup>=</sup> <sup>Q</sup>i,(1,2) <sup>=</sup> <sup>0</sup>, <sup>i</sup> ∈ {1, <sup>2</sup>}. Therefore, the reduced nonredundant parameter vector has the dimension <sup>M</sup>i <sup>=</sup> <sup>4</sup>, <sup>i</sup> ∈ {1, <sup>2</sup>}, and is given by

$$\boldsymbol{\Theta}\_{i} = \begin{bmatrix} \boldsymbol{Q}\_{i,(1,1)} & \boldsymbol{Q}\_{i,(2,2)} & \boldsymbol{R}\_{i1} & \boldsymbol{R}\_{i2} \end{bmatrix}^{\top}. \tag{5.24}$$

Thus, we set

$$\mathcal{M}\_i = \begin{bmatrix} (\bar{m}\_i)\_1 & (\bar{m}\_i)\_4 & (\bar{m}\_i)\_5 & (\bar{m}\_i)\_6 \end{bmatrix}, \quad i \in \{1, 2\} \tag{5.25}$$

such that (5.21) is fulfilled.

#### Symmetric Matrices

In case of symmetric matrices, <sup>Q</sup>i,(2,1) <sup>=</sup> <sup>Q</sup>i,(1,2) , i ∈ {1, <sup>2</sup>}. This leads to a reduced nonredundant parameter vector with the dimension <sup>M</sup>i <sup>=</sup> <sup>5</sup>, <sup>i</sup> ∈ {1, <sup>2</sup>}, and given by

$$
\boldsymbol{\Theta}\_{\boldsymbol{i}} = \begin{bmatrix} \boldsymbol{Q}\_{\boldsymbol{i},(1,1)} & \boldsymbol{Q}\_{\boldsymbol{i},(1,2)} & \boldsymbol{Q}\_{\boldsymbol{i},(2,2)} & \boldsymbol{R}\_{\boldsymbol{i}1} & \boldsymbol{R}\_{\boldsymbol{i}2} \end{bmatrix}^{\top}. \tag{5.26}
$$

Hence, we set

$$\mathcal{M}\_{i} = \begin{bmatrix} (\bar{m}\_{i})\_{1} & (\bar{m}\_{i})\_{2} + (\bar{m}\_{i})\_{3} & (\bar{m}\_{i})\_{4} & (\bar{m}\_{i})\_{5} & (\bar{m}\_{i})\_{6} \end{bmatrix}, \quad i \in \{1, 2\}, \tag{5.27}$$

such that (5.21) is fulfilled.

These modifications allow for the analysis of inverse LQ differential games and their solution sets in the case of symmetric or diagonal cost function matrices by means of the kernel of Mi .

#### 5.3.2 Sufficient Condition for Solution Sets

In the following, all possible parameters <sup>θ</sup>i which lead to the same Nash equilibrium, provided all other parameters <sup>θ</sup>¬i are fixed, is denoted as the solution set of player <sup>i</sup> ∈ P. This solution set is defined by the non-trivial solutions of (5.21). Therefore, one way to characterize these solutions is using the kernel of <sup>M</sup>i . Its dimension will depend on the number of linearly independent equations generated by the n mi rows of <sup>M</sup>i compared to the number of unknown parameters <sup>M</sup>i . Since rank(Mi) ≤ min(Mi ,n mi), the number of players, states and controls of each player as well as the assumed properties of the cost function matrices are important for evaluating the existence of inverse differential game solutions.

#### Proposition 5.1:

The solution set of player<sup>i</sup> is at least one-dimensional if the number of rows of <sup>M</sup>i is strictly less than the number of parameters in <sup>θ</sup>i , i.e. ker(Mi) , <sup>∅</sup> if n mi <sup>&</sup>lt; <sup>M</sup>i .

#### Proof:

The condition n mi <sup>&</sup>lt; <sup>M</sup>i implies rank(Mi) <sup>&</sup>lt; <sup>M</sup>i , leading to dim(ker(Mi)) <sup>&</sup>gt; 0.

Proposition 5.1 gives a sufficient condition for the existence of vectors spanning the kernel of <sup>M</sup>i . The exact dimension of the kernel is defined by rank(Mi). The following example illustrates the results of Theorem 5.2 and the solution set concept.

#### Example 5.2:

Consider an infinite-horizon LQ differential game where two players control a doubleintegrator system given by

$$
\dot{\mathbf{x}}(t) = \begin{bmatrix} 0 & 1 \\ 0 & 0 \end{bmatrix} \mathbf{x}(t) + \begin{bmatrix} 0 \\ 1 \end{bmatrix} u\_1(t) + \begin{bmatrix} 0 \\ 1 \end{bmatrix} u\_2(t). \tag{5.28}
$$

The cost functions of the two players are given by (5.4) with <sup>Q</sup><sup>1</sup> <sup>=</sup> diag(1, <sup>2</sup>) and <sup>Q</sup><sup>2</sup> <sup>=</sup> diag(1, <sup>0</sup>.7) as well as <sup>R</sup><sup>11</sup> <sup>=</sup> <sup>1</sup>, <sup>R</sup><sup>12</sup> <sup>=</sup> <sup>R</sup><sup>21</sup> <sup>=</sup> <sup>0</sup> and <sup>R</sup><sup>22</sup> <sup>=</sup> <sup>1</sup>. The parameter vector of player <sup>i</sup> is given by

$$
\theta\_i = \begin{bmatrix} Q\_{i,(1,1)} & Q\_{i,(2,2)} & R\_{i\bar{i}} \end{bmatrix}, \quad i \in \{1, 2\}.
$$

The game is solved by calculating the solution of the finite-horizon version of the game, i.e. solvin the corresponding RDEs (3.69), and extracting the converged value of <sup>P</sup>i afterwards. The resulting K ∗ represents a Nash equilibrium since the calculated <sup>P</sup>i satisfies (5.7) for all players and the closed loop stability of the system dynamics was confirmed (cf. Theorem 5.1). The calculated Nash equilibrium is (K ∗ 1 ,K ∗ 2 ) = ( - <sup>0</sup>.5773 1.2827 , - <sup>0</sup>.5774 0.5882 ).

The kernels of the matrices <sup>M</sup>i <sup>∈</sup> <sup>R</sup> <sup>2</sup>×<sup>3</sup> are defined by the span of the vectors

$$\begin{aligned} \boldsymbol{\sigma}\_{1}^{(1)} &= [\boldsymbol{\sigma}\_{1,(j)}^{(1)}]\_{f=1,2,3} = \begin{bmatrix} 0.4083 & 0.8165 & 0.4083 \end{bmatrix}^{\top} \\ &\dots \end{aligned} \tag{5.29}$$

$$\boldsymbol{\sigma}\_{2}^{(1)} = [\boldsymbol{\sigma}\_{2,(j)}^{(1)}]\_{j=1,2,3} = \begin{bmatrix} 0.6337 & 0.4437 & 0.6337 \end{bmatrix}^{\top} \tag{5.30}$$

which result in the canonical parameter set

$$\Theta = \{\mu\_i \hat{\mathbf{Q}}\_i, \mu\_i \hat{\mathbf{R}}\_{ii}\}\_{i=1,2}, \quad \mu\_i \in \mathbb{R}^+,\tag{5.31}$$

which consists of the solution sets of player <sup>1</sup> and <sup>2</sup> and where Q<sup>ˆ</sup> i <sup>=</sup> diag(<sup>v</sup> (1) i,1 ,v (1) i,2 ) and R<sup>ˆ</sup> i i = v (1) i,3 . This means that the cost function parameters are unique up to a constant parameter. In particular, <sup>µ</sup><sup>1</sup> <sup>=</sup> <sup>2</sup>.<sup>4494</sup> and <sup>µ</sup><sup>2</sup> <sup>=</sup> <sup>1</sup>.<sup>5779</sup> lead to the defined ground truth parameters.

As mentioned in the introduction of this section, the number of unknown parameters depend on the properties of the matrices, which in turn have an influence on the possible dimensions of each player's solution set for the inverse LQ differential game. This aspect is further examined in the following.

#### General Cost Function Matrices

In the case of arbitrary cost function matrices <sup>M</sup>i <sup>=</sup> <sup>L</sup> <sup>=</sup> <sup>n</sup> <sup>2</sup> + Í j ∈P <sup>m</sup><sup>2</sup> j holds. Since nmi <sup>≤</sup> <sup>0</sup>.5(n <sup>2</sup> <sup>+</sup> m<sup>2</sup> i ) < n <sup>2</sup> + Í j ∈P <sup>m</sup><sup>2</sup> j for any choice of <sup>n</sup>, <sup>m</sup>j , <sup>∀</sup>j ∈ P and N <sup>∈</sup> <sup>N</sup> + , dim(ker(Mi)) <sup>&</sup>gt; <sup>0</sup> follows. The sufficient condition of Proposition 5.1 is fulfilled.

#### Symmetric Cost Function Matrices

If we assume symmetry of all cost function matrices, then <sup>M</sup>i <sup>=</sup> <sup>0</sup>.5(<sup>n</sup> <sup>2</sup> <sup>+</sup>n <sup>+</sup> Í j ∈P(m<sup>2</sup> j <sup>+</sup>mj)). Since

$$mm\_l \le 0.5(n^2 + m\_l^2) < 0.5\left(n(n+1) + \sum\_{j \in \mathcal{P}} m\_j(m\_j + 1)\right) = M\_l n$$

« for any choice of <sup>n</sup>, <sup>m</sup>j , <sup>∀</sup>j ∈ P, and N <sup>∈</sup> <sup>N</sup> + , dim(ker(Mi)) <sup>&</sup>gt; <sup>0</sup> holds. The sufficient condition of Proposition 5.1 is fulfilled and the solution set of player i can be given in terms of the vectors <sup>v</sup>i which span the kernel of <sup>M</sup>i .

#### Diagonal Cost Function Matrices

Only in the case of diagonal matrices, where <sup>M</sup>i <sup>=</sup> <sup>n</sup> <sup>+</sup> Í j ∈P m<sup>j</sup> , combinations of <sup>n</sup>, <sup>m</sup>i , <sup>N</sup> exist such that n mi <sup>≥</sup> <sup>M</sup>i , thus potentially leading to an empty solution set. Here we note that if rank(Mi) <sup>=</sup> <sup>M</sup>i <sup>−</sup> 1, then the solution set of player <sup>i</sup> is one-dimensional and a

#### (a) Symmetrical cost function matrices

Figure 5.1: Number of parameters and equations in the ILQDG problem depending on the number of states and controls in a one-player LQ differential game. The red thick line/dot denotes the cases where n m<sup>i</sup> <sup>=</sup> <sup>M</sup><sup>i</sup> <sup>−</sup> 1.

unique algebraic solution for player <sup>i</sup>'s parameters may be found by setting <sup>θ</sup>i,(j) <sup>=</sup> <sup>1</sup> for one particular <sup>j</sup> ∈ {1, ..., <sup>M</sup>i } and proceeding analogously to [MZ18, Proposition 1], where the special case <sup>N</sup> <sup>=</sup> <sup>1</sup> is considered. This is possible e.g. if <sup>n</sup> <sup>=</sup> <sup>1</sup> and <sup>m</sup><sup>1</sup> <sup>=</sup> <sup>1</sup> (besides <sup>N</sup> <sup>=</sup> 1).

The analysis of the sufficient condition for symmetric and diagonal cost function matrices is illustrated in Figure 5.1 for the case <sup>N</sup> <sup>=</sup> 1. The number of equations (rows of <sup>M</sup>i ) and the number of parameters <sup>M</sup>i are shown as a function of the number of states <sup>n</sup> and the number of controls <sup>m</sup>i . In Figure 5.1(a), which depicts the case of symmetrical cost function matrices, the number of parameters <sup>M</sup>i is always greater than the number of equations n mi such that the solution set of player 1 is at least one-dimensional. In Figure 5.1(b), which depicts the case of diagonal cost function matrices, we observe that there are combinations of <sup>n</sup> andmi which lead to n mi <sup>≥</sup> <sup>M</sup>i , thus not yet allowing for any conclusion concerning the solution set. In turn, the situations where the kernel of <sup>M</sup>i is guaranteed to not be empty in this scenario are represented by the red thick line. It denotes the cases where n mi <sup>=</sup> <sup>M</sup>i <sup>−</sup><sup>1</sup> <sup>&</sup>lt; <sup>M</sup>i which fulfill the sufficient condition of Proposition 5.1.

These 3D maps are altered if N > <sup>1</sup> and in the general case where each player penalizes the other players' controls, i.e. <sup>R</sup>ij , <sup>0</sup> for <sup>i</sup> , <sup>j</sup>, <sup>i</sup> ∈ P. The cases <sup>N</sup> <sup>=</sup> <sup>2</sup> and <sup>N</sup> <sup>=</sup> <sup>3</sup> are shown in the Appendix C for further illustration of how the properties of <sup>M</sup>i are affected by the number of players, states and controls.

#### Remark 5.4:

The previous analysis shows the implications of n mi <sup>&</sup>lt; <sup>M</sup>i as a sufficient condition for the existence of a solution set for player <sup>i</sup> which is at least one-dimensional. The case n mi <sup>≥</sup> <sup>M</sup>i

demands further attention, given that it potentially leads to an empty kernel of <sup>M</sup>i— this occurs if rank(Mi) ≥ <sup>M</sup>i . Nevertheless, this does not imply that a solution of the inverse differential game problem for player i does not exist. Indeed, the existence of a Nash equilibrium described by K ∗ implies the existence of at least one N-tuple θ <sup>=</sup> θ <sup>∗</sup> which generated the equilibrium.

In light of Remark 5.4, the next section presents a formulation of inverse LQ differential games which allows to find a solution of the inverse differential game problem regardless of the presented properties. In addition, it facilitates the derivation of further general results concerning the solution sets of each player.

# 5.4 Quadratic Programming Formulation for Inverse Linear-Quadratic Differential Games

The approach is based on the formulation of a residual function, analogously to Definition 4.4, which denotes the extent to which the necessary and sufficient conditions for Nash equilibria are violated. Since the conditions are represented by the coupled ARE (5.7) and its reformulation (5.21), where the matrix <sup>M</sup>i depends on the given matrices <sup>A</sup>, <sup>B</sup>i , i ∈ P and K <sup>∗</sup> <sup>=</sup> (K ∗ 1 , ...,K ∗ ), the following residual is introduced.

Definition 5.4 (Residual)

N

Let a function <sup>r</sup>i : R Mi 7→ R n m<sup>i</sup> , i ∈ P, be defined as

$$r\_i(\theta\_i) = \mathcal{M}\_i \theta\_i. \tag{5.32}$$

The function <sup>r</sup>i is called residual of the coupled ARE (5.7).

The violation of the coupled ARE in terms of the residual function occurs if the parameters <sup>θ</sup>i do not represent a Nash equilibrium for given feedback control laws K ∗ i and system dynamics matrices <sup>A</sup> and <sup>B</sup>i , i ∈ P. While it would be possible to pose an optimization problem such that ||ri || is minimized, it is computationally more convenient to consider a quadratic residual function. The following lemma relates the quadratic residual function to the AREs.

#### Lemma 5.3

Let a LQ differential game be given by (5.1) and (5.4). Furthermore, let Assumption 5.1 hold. The ARE (5.7) is fulfilled if and only if ||Miθi ||<sup>2</sup> = 0.

#### Proof:

The proof is trivial given that the norm of a vector is zero if and only if the vector itself is a zero-vector.

In light of Lemma 5.3, the optimization problem

$$\begin{aligned} \min\_{\theta\_i} \quad &||r\_i(\theta\_i)||\_2^2 = \min\_{\theta\_i} \quad \frac{1}{2} \boldsymbol{\theta}\_i^\top \boldsymbol{H}\_i \boldsymbol{\theta}\_i, \\ \text{s.t.} \\ &\quad \boldsymbol{\theta}\_{i,(j)} > 0, \quad \forall j \in \{1, \ldots, M\_i\}, \\ &\quad \mathbf{R}\_{ii} > \mathbf{0} \end{aligned} \tag{5.33}$$

is posed, where <sup>H</sup>i <sup>=</sup> <sup>2</sup>(M<sup>⊤</sup> i <sup>M</sup>i) ∈ <sup>R</sup> <sup>M</sup>i×M<sup>i</sup> . Analogously to the residual-based approach in Section 4.3.1, the aim of the optimization problem (5.33) is to minimize the quadratic residual to obtain parameters <sup>θ</sup>i which fulfill the ARE.

#### Remark 5.5:

The constraints <sup>θ</sup>i,(j) <sup>&</sup>gt; <sup>0</sup>, <sup>∀</sup><sup>j</sup> ∈ {1, ..., <sup>M</sup><sup>i</sup> }, in (5.33) are introduced in order to avoid trivial solutions. Literature in inverse optimal control and inverse games often introduce the constraint <sup>θ</sup>i,(j) <sup>=</sup> <sup>1</sup> for any <sup>j</sup> ∈ {1, ..., <sup>M</sup><sup>i</sup> } (see note 25 in page 51). Analogous results concerning the properties of (5.33) can easily be proved with this (additional) constraint. Also note that, in case of diagonal cost function matrices, <sup>θ</sup>i,(j) <sup>&</sup>gt; <sup>0</sup>, <sup>∀</sup><sup>j</sup> ∈ {1, ..., <sup>M</sup><sup>i</sup> }, ensures <sup>R</sup>i i <sup>≻</sup> <sup>0</sup>.

### 5.4.1 Necessary and Sufficient Conditions for One-Dimensional Solution Sets

In the following, the quadratic program (5.33) is leveraged to obtain insights on inverse LQ differential games. The properties of the quadratic program (5.33) differ considerably depending on whether rank(Mi) is less, equal or greater than the number of parameters <sup>M</sup>i . By considering the case n mi <sup>&</sup>lt; <sup>M</sup>i , which leads to rank(Mi) <sup>&</sup>lt; <sup>M</sup>i , the following proposition can be stated.

#### Proposition 5.2:

Let a LQ differential game be given by (5.1) and (5.4) such that n mi <sup>&</sup>lt; <sup>M</sup>i . Then, the quadratic program (5.33) is convex and a solution is guaranteed to exist.

#### Proof:

It is clear that both the constraint set defined by <sup>θ</sup>i,(j) <sup>&</sup>gt; <sup>0</sup>, <sup>∀</sup><sup>j</sup> ∈ {1, ..., <sup>M</sup><sup>i</sup> }, and <sup>R</sup>i i <sup>≻</sup> <sup>0</sup> are convex and therefore, their intersection is also convex. Under the conditions n mi <sup>&</sup>lt; <sup>M</sup>i we obtain rank(Hi) <sup>=</sup> rank(M<sup>⊤</sup> i <sup>M</sup>i) ≤ min(n m<sup>i</sup> , <sup>M</sup>i) <sup>=</sup> n mi <sup>&</sup>lt; <sup>M</sup>i , leading to a convex—since M⊤ i <sup>M</sup><sup>i</sup> <sup>≽</sup> <sup>0</sup>—but not strictly convex objective function. Hence, the quadratic program is convex and therefore always has a solution.

The results of Proposition 5.2 are not surprising for the case where Assumption 5.1 holds, since this guarantees that at least one solution for the parameters <sup>θ</sup>i of a particular player i ∈ P (and the ones generated by a multiplying positive constant) must exist. Note that solving the optimization problem (5.33) leads to one of the solutions belonging to ker(Mi) (cf. Proposition 5.1 and Theorem 5.2), but it does not give any information on the dimensions of each player's solution set.

The following theorem is stated as the main result regarding the canonical parameter set of inverse LQ differential games.

#### Theorem 5.3 (Necessary and Sufficient Conditions for Uniqueness up to a Positive Factor)

Let a LQ differential game be given by (5.1) and (5.4). Furthermore, let Assumption 5.1 hold. The inverse LQ differential game has a canonical parameter set of the form

$$\Theta = \{c\_i \theta\_i; \ c\_i > 0, \ i \in \mathcal{P}\},\tag{5.34}$$

if and only if n mi <sup>≥</sup> <sup>M</sup>i <sup>−</sup> <sup>1</sup> and additionally rank(Mi) <sup>=</sup> <sup>M</sup>i <sup>−</sup> <sup>1</sup>.

#### Proof:

We first state that n mi <sup>≥</sup> <sup>M</sup>i <sup>−</sup> <sup>1</sup> is a necessary condition for unique solutions since n mi <sup>&</sup>lt; <sup>M</sup>i <sup>−</sup> <sup>1</sup> leads to a solution set of a dimension greater than 1 (cf. Proposition 5.1). By the results of Lemma 5.3, (5.7) is fulfilled if and only if ||Miθi ||<sup>2</sup> = 0. We therefore proceed to analyze the quadratic program (5.33). Under the theorem condition rank(Mi) <sup>=</sup> <sup>M</sup>i <sup>−</sup> <sup>1</sup> we have dim(ker(Mi)) <sup>=</sup> <sup>1</sup> which implies a one-dimensional solution set for each player <sup>i</sup> ∈ P of the form (5.34).

The case rank(Mi) <sup>&</sup>lt; <sup>M</sup>i <sup>−</sup> <sup>1</sup> leads to solution sets with a dimension greater that 1 and is therefore excluded. Therefore, only the case rank(Mi) <sup>=</sup> <sup>M</sup>i remains which we analyze using (5.33). If rank(Mi) <sup>=</sup> <sup>M</sup>i , which is only possible if n m <sup>≥</sup> <sup>q</sup>, then we obtain <sup>H</sup>i <sup>≻</sup> <sup>0</sup> and thus (5.33) is strictly convex. Strict convexity leads to a unique solution of (5.33) and therefore to a unique solution of the ARE (5.7). But the latter contradicts Corollary 5.1, from where we conclude that rank(Mi) <sup>=</sup> <sup>M</sup>i <sup>−</sup> <sup>1</sup> is also necessary. Theorem 5.3 gives necessary and sufficient conditions for the solution set of each player i to be one-dimensional, i.e. each player's parameters <sup>θ</sup>i are unique up to a real positive factor ci .

Summarizing the results of this subsection, if the canonical parameter set has the form (5.34), then a particular <sup>θ</sup>i belonging to the corresponding solution set each player <sup>i</sup> can be computed by means of the quadratic program (5.33). If the conditions of Theorem 5.3 are not fulfilled, then with the results of Proposition 5.2, (5.33) yields any solution from the canonical parameter set with non-unique parameters for each player i ∈ P.

#### 5.4.2 Identification of Feedback Matrices

The optimization problem (5.33) always yields a solution which is associated with a given Nash equilibrium represented by K ∗ . If only observed Nash equilibrium control and state trajectories are available, then it becomes necessary to estimate the control laws K ∗ . For the N-player inverse differential game at hand, a least-squares identification based on (5.2) is proposed. For this purpose, let us introduce a finite sequence of sampling times

$$\mathcal{T}\_{l} \coloneqq \{ t\_{k} \in [0, T] : 1 \le k \le K\_{l} \land 0 \le t\_{1} < \dots < t\_{K\_{l}} \le T \} \tag{5.35}$$

for each playeri ∈ P, where [0,T ] is the time interval for which x ∗ (t) and u ∗ i (t) are available. Let the value of the state and control trajectories attk be denoted by <sup>x</sup> [k] andu [k] i , respectively. Then, the feedback matrix can be estimated by means of

$$\hat{\mathbf{K}}\_i = \underset{\mathbf{K}\_i}{\text{arg min}} \sum\_{k=1}^{K\_i} ||\mathbf{K}\_i \mathbf{x}^{[k]} + \mathbf{u}\_i^{[k]}||^2,\tag{5.36}$$

where || · || denotes the Euclidean norm. Least-square estimation theory states that the parameters (in this case the entries of <sup>K</sup>i ) can be recovered if persistence of excitation (PE) conditions are fulfilled [ÅW95, Section 2.4]. These conditions demand that the trajectories of <sup>x</sup> and <sup>u</sup>i are "informative" enough and are e.g. not identical to zero. Furthermore, if the least-square estimation is considered from a stochastic point of view, i.e.

$$
\mu\_i^{[k]} = -\mathbf{K}\_i \mathbf{x}^{[k]} + \boldsymbol{\varepsilon}\_i,\tag{5.37}
$$

where <sup>ϵ</sup>i <sup>∈</sup> <sup>R</sup> <sup>m</sup><sup>i</sup> denotes a vector of zero-mean Gaussian white noise, then the estimation is biasfree if <sup>ϵ</sup>i(t) is independent of the state <sup>x</sup>(t) [ÅW95, P. 47]. The conditions for a bias-free estimation are usually not given. For example, the state <sup>x</sup>(t) depends on the controls <sup>u</sup>i(t) due to the system dynamics and is therefore not independent of the additive gaussian noise. Nevertheless, the LS estimation works well in practice, as shown later in Chapter 7.

#### 5.4.3 Algorithm and Example

The inverse LQ differential game method for determining a particular solution parameter vector <sup>θ</sup>i of player <sup>i</sup> based on (5.33) can be implemented with the following algorithm.

Algorithm 2 IOC based method for player i in an inverse feedback LQ differential game.

Input: State and control trajectories <sup>x</sup>(t) and (u1(t), . . . ,uN (t)), system matrix <sup>A</sup>, input matrices <sup>B</sup>i , <sup>∀</sup>i ∈ P.

Output: Computed player <sup>i</sup> cost function parameters <sup>θ</sup>i .


Note that, similar to the methods presented in Chapter 4, Algorithm 2 may be used for cost function parameter identification of any player i ∈ P in an N-player infinite-horizon LQ differential game. Furthermore, the method may also be applied for the special case of a single player, i.e. an inverse LQ optimal control problem. The core of the presented approach is the quadratic program which can be solved very efficiently with state-of-the-art methods, e.g. active-set and interior point methods [NW06, Chapter 16].

This section ends by the presentation of an example to illustrate Theorem 5.3 and the use of Algorithm 2 to identify cost function parameters in an inverse LQ differential game.

#### Example 5.3:

Consider an infinite-horizon LQ differential game where 2 players control a stabilizable linear system defined by the differential equation

$$\dot{\mathbf{x}}(t) = \begin{bmatrix} 1 & -1 \\ 1 & 0 \end{bmatrix} \mathbf{x}(t) + \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \boldsymbol{u}\_1(t) + \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \boldsymbol{u}\_2(t) \tag{5.38}$$

and select their feedback strategies according to a cost function of the form (5.4) with cost function matrices

$$\begin{aligned} \mathbf{Q}\_{1} &= \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}, & \mathbf{Q}\_{2} &= \begin{bmatrix} 1 & 0 \\ 0 & 10 \end{bmatrix}, \\\ \mathbf{R}\_{11} &= \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}, & \mathbf{R}\_{22} &= \begin{bmatrix} 2 & 0 \\ 0 & 1 \end{bmatrix}, \\\ \mathbf{R}\_{12} &= \mathbf{0}, & \mathbf{R}\_{21} &= \mathbf{0}. \end{aligned} \tag{5.39}$$

The vectorization of the cost function matrices according to (5.12) leads to a parameter vector of dimension <sup>M</sup>i <sup>=</sup> <sup>4</sup> given by

$$
\Theta\_{\bar{i}} = \begin{vmatrix} Q\_{\bar{i},(1,1)} & Q\_{\bar{i},(2,2)} & R\_{\bar{i}\bar{i},(1,1)} & R\_{\bar{i}\bar{i},(2,2)} \end{vmatrix}, \quad i \in \{1,2\},
$$

where <sup>Q</sup>i,(j,j) and <sup>R</sup>i i,(j,j) denote the <sup>j</sup>-th diagonal entry of <sup>Q</sup>i and <sup>R</sup>i i , respectively. Analogously to the last example, the infinite-horizon LQ differential game was solved by calculating the solution of the corresponding RDEs (3.69) and extracting the converged value of <sup>P</sup>i . The resulting state and control trajectories x ∗ (t) andu ∗ i (t) were confirmed to correspond to a stable system and hence, to a Nash equilibrium.

In this example, the inverse method is given the resulting state and control trajectories x ∗ (t) and u ∗ i (t) instead of the Nash equilibrium feedback matrices K ∗ . Following Algorithm 2, these trajectories were used to estimate the feedback matrices with the LS approach given in (5.36), where a set <sup>T</sup>i with <sup>T</sup> <sup>=</sup> <sup>10</sup> and <sup>K</sup>i <sup>=</sup> <sup>501</sup> was selected according to (5.35). The Nash equilibrium can be exactly estimated with deviations ||K<sup>ˆ</sup> i <sup>−</sup> <sup>K</sup> ∗ i || < <sup>10</sup>−<sup>14</sup> for all i <sup>=</sup> {1, <sup>2</sup>}. With K <sup>∗</sup> <sup>=</sup> (K ∗ 1 ,K ∗ 2 ), we obtain the matrices

$$\mathbf{M}\_1 = \begin{bmatrix} -0.436 & -0.026 & 0.466 & -0.004 \\ 0.100 & -0.027 & 0.053 & -0.126 \\ 0.100 & -0.027 & -0.078 & 0.006 \\ -0.032 & -0.153 & -0.020 & 0.204 \end{bmatrix} \tag{5.40}$$

and

$$\mathbf{M}\_2 = \begin{bmatrix} -0.436 & -0.026 & 0.530 & -0.365 \\ 0.100 & -0.027 & 0.144 & -0.114 \\ 0.100 & -0.027 & 0.264 & -0.353 \\ -0.032 & -0.153 & -0.048 & 1.655 \end{bmatrix} \text{.} \tag{5.41}$$

We find that rank(Mi) <sup>=</sup> <sup>M</sup>i holds for <sup>i</sup> <sup>=</sup> {1, <sup>2</sup>}, which indicates a one-dimensional solution set for each player i according to Theorem 5.3. By solving the quadratic program (5.33) we obtain the parameters

$$\begin{aligned} \hat{\boldsymbol{\theta}}\_1 &= \begin{bmatrix} 1.000 & 1.000 & 1.000 & 1.000 \end{bmatrix} \\ \hat{\boldsymbol{\theta}}\_2 &= \begin{bmatrix} 0.602 & 6.024 & 1.204 & 0.602 \end{bmatrix} . \end{aligned} \tag{5.42}$$

The parameters θ ∗ <sup>1</sup> were exactly identified, while for the second player, the parameters are equal up to a multiplying constant. In particular, we have <sup>ˆ</sup>θ<sup>2</sup> <sup>=</sup> <sup>0</sup>.<sup>6024</sup> <sup>θ</sup> ∗ 2 .

# 5.5 Method Limitations

Prior to this chapter's conclusion, possible limitations of the method are discussed. The first issue is given, similar to last chapter, if e.g. only noise-corrupted measurements of the state and control trajectories are available. Nevertheless, since the method relies on the feedback control laws and these are estimated by the LS method, it can be conjectured that the method has a considerable robustness to noise in the trajectories. This case shall be further examined in Section 7.5. In addition, truncated trajectories do not represent a problem as long as these fulfill the PE condition mentioned in Section 5.4.2. Informative trajectories can potentially fulfill this condition even with a small number of values.

A further issue arises if an <sup>i</sup> ∈ P exists such that <sup>K</sup>i does not constitute a Nash equilibrium feedback law with respect to any set of cost function matrices <sup>Q</sup>i , <sup>R</sup>i i , and <sup>R</sup>ij of the assumed structure, e.g. symmetric. More generally, <sup>K</sup>i might not be a Nash equilibrium for any set of cost function matrices, regardless of their structure. This can occur e.g. if <sup>K</sup>i is identified from trajectories <sup>x</sup>(t) and <sup>u</sup>i(t) which do not represent a Nash equilibrium. However, by the results of Proposition 5.2, the existence of a solution to the quadratic program (5.33) is guaranteed, independently of the Nash character of the control laws. Since the presented quadratic programming approach is based on the coupled ARE which are necessary and sufficient conditions for feedback Nash equilibria, the identification results yield parameters which lead to the Nash equilibrium feedback law which is the closest to the original observed feedback law. The distance is measured in terms of the violation of the coupled AREs (cf. the discussion of the experimental results in Section 8.8). However, this distance may not be proportional to or correlate with the error between observed and identified trajectories.

# 5.6 Conclusion

In this chapter, the inverse problem of infinite-horizon LQ differential games was considered, where a feedback Nash equilibrium is given and cost function parameters are sought which explain this resulting equilibrium. The parameters correspond to the elements of the matrices of the quadratic cost functions of the players and the Nash equilibrium is assumed to be given in the form of an N-tuple of player feedback matrices. The solution of the inverse LQ differential game was given in the form of an explicit set—the canonical parameter set—which describes all possible cost function parameter vectors or matrices which lead to the same Nash equilibrium, and was achieved by a reformulation of the necessary and sufficient conditions for Nash equilibria. Importantly, sufficient conditions for the possibility of stating such explicit sets were given. In addition, these results were applied to formulate a quadratic program which allows an efficient computation of the cost function parameters. Moreover, the analysis of the resulting quadratic program allowed for the statement of necessary and sufficient conditions for the uniqueness of the solution set of a particular player up to a multiplying positive constant. Finally, it was demonstrated that the feedback matrices of all players can be estimated out of Nash equilibrium state and control trajectories by using a least-squares approach. Consequently, all of the results developed in this chapter can be applied if, instead of the player feedback matrices, observations of Nash equilibrium state and control trajectories are available.

The results of this chapter represent solutions related to one of the questions Kalman stated: "What optimization problems lead to a constant, linear control law?" (Problem A in [Kal64]). This problem was recently considered in [MZ18] for single-player infinite-horizon problems; these results have been generalized for N-player differential games in this chapter.

# 6 Inverse Dynamic Games Based on Inverse Reinforcement Learning

This chapter presents inverse dynamic game solutions such that cost functions which explain observed behavior of several players can be found. The methods presented in this chapter are based on inverse reinforcement learning techniques and on a discrete-time formulation of the infinite dynamic game. Therefore, the methods in this chapter represent an alternative approach to the IOC-based methods of the previous two chapters. Nevertheless, there is a similarity to the results of these aforementioned chapters, namely the development of an inverse dynamic game method which does not rely on a repeated solution of the forward problem, i.e. the repeated computation of Nash equilibrium state and control trajectories. After a short introduction to the principle of maximum entropy, which represents the basis of the methods, the main contribution of this chapter is shown, namely the derivation of a probabilistic method for inverse dynamic games based on Maximum Entropy Inverse Reinforcement Learning (MaxEnt IRL). The cases where the players' behavior corresponds to an open-loop and a feedback Nash equilibrium are considered. In addition, results on the unbiasedness of the estimation of cost function parameters are presented. After providing further details which are important for the practical implementation of these methods, examples for the solution of inverse linear-quadratic dynamic games are given. The chapter ends with conclusions on all presented results.<sup>33</sup>

# 6.1 Introduction to the Probabilistic Approach and Maximum Entropy

In this thesis, the aim is the development of IRL methods for inverse dynamic games which allow for continuous-valued control and state spaces, such that comparable methods to the ones presented in Chapters 4 and 5 based on inverse optimal control can be obtained. The inverse dynamic game problem is regarded in this chapter from a probabilistic perspective which is introduced in the following.

<sup>33</sup> Preliminary versions of the results of this chapter have been published in the conference paper [KIR+17]. The chapter's contents are based on the article [IBKH20].

Figure 6.1: Example of a probability function for trajectories

For a simplified presentation, consider the results of a dynamic game as a single trajectory ˜ξ(t) which is assumed to stem from a probability function <sup>P</sup>(ξ) defined over a finite and discrete set of (in this case five) possible trajectories ξ(t). This scenario is illustrated in Figure 6.1, where the observed trajectory ˜ξ(t) <sup>=</sup> ξ (3) is colored green. In this example, a probability value is assigned to each of the five possible trajectories. Transferring this line of thought to an inverse problem in dynamic games leads to the fact that one or several trajectories ξ are observed, but their probabilities are unknown. The choice of a probability function which explains these observed trajectory is not unique, even if some constraints are introduced. This problem becomes even more complex if the trajectories originate from a probability density function <sup>p</sup>(ξ) instead of the previously presented probability mass function <sup>P</sup>(ξ) since this implies a potentially infinite number of possible trajectories. In order to resolve the ambiguity in this kind of problem, the principle of maximum entropy can be applied. This was introduced by Jaynes in [Jay57] as a means to infer probability distributions which are consistent with experimental data.<sup>34</sup> According to Jaynes, this method leads to the "least biased estimate possible on the given information". This is illustrated e.g. by the fact that the distribution which maximizes the entropy with the constraints of fixed and known expectation and variance is the Gaussian distribution. Similarly, the maximum entropy distribution where no constraints are introduced is the uniform distribution [CT06, Section 12.2].

<sup>34</sup> Jaynes' objective was to present a potential application of information theory results—obtained by Shannon ([Sha48])— to the field of statistical mechanics. The interested reader is also referred to [PGLD13] for a historical review.

This introduced probabilistic perspective of dynamic games constitutes the basis of the definition of the problem. Likewise, the principle of maximum entropy shall be leveraged for the development of inverse dynamic game solutions presented in the next sections.

# 6.2 Problem Definition

Consider an infinite dynamic game in discrete time35, where N players simultaneously control a system with (potentially time-variant) dynamics of the form (see also Definition A.1)

$$\mathbf{x}^{(k+1)} = f\_D^{(k)}\left(\mathbf{x}^{(k)}, \mathbf{u}\_1^{(k)}, \dots, \mathbf{u}\_N^{(k)}\right) \tag{6.1a}$$

$$\mathbf{x}^{(1)} = \mathbf{x}\_1.\tag{6.1b}$$

The goal of each playeri ∈ P is to minimize its individual cost function by applying a control strategy. The cost functions' structure is assumed to be defined by a linear combination of <sup>M</sup>i <sup>∈</sup> <sup>N</sup> known features36, i.e.

$$J\_i = -\sum\_{k=1}^{k\_E} \boldsymbol{\theta}\_i^\top \boldsymbol{\phi}\_i \left(\mathbf{x}^{(k)}, \mathbf{u}\_1^{(k)}, \dots, \mathbf{u}\_N^{(k)}\right),\tag{6.2}$$

where <sup>k</sup><sup>E</sup> <sup>∈</sup> <sup>N</sup>>0, <sup>ϕ</sup>i <sup>∈</sup> <sup>R</sup> <sup>M</sup><sup>i</sup> contains all features of player i defined analogously to Definition 4.2 and <sup>θ</sup>i <sup>∈</sup> <sup>R</sup> <sup>M</sup><sup>i</sup> represents the vector of player i's individual feature weights, i.e. the cost function parameters.

A main element of inverse problems in optimal control and dynamic games are the observed state and control trajectories. Generally speaking, a trajectory consists of a sequence of values according to the discrete-time formulation of the game. Therefore, the following definition is introduced.

<sup>35</sup> The discrete-time formulation is chosen following the line of a vast number of previous studies on single-player IRL (cf. Section 2.1.3). The results of this chapter are based on definitions analogous to the ones in Chapter 3. These discrete-time dynamic game definitions are given in Appendix A.

<sup>36</sup> In this chapter, the term features is used instead of basis functions in order to be consistent to IRL literature. Furthermore, in the following it is assumed that the feature functions in <sup>ϕ</sup><sup>i</sup> are independent of k. Their corresponding values are still stage-dependent through the values of the states and the controls. In addition, note that the cost function has been multiplied with a factor of −1. This is done in order to be congruent with IRL literature which assumes a reward function to be maximized instead of a cost function to be minimized.

### Definition 6.1 (Stacked State and Control Values) Let

$$\underline{\mathbf{x}} = \left[ \left( \mathbf{x}^{(1)} \right)^{\mathsf{T}} \quad \dots \quad \left( \mathbf{x}^{(k\_E)} \right)^{\mathsf{T}} \right]\_{\cdot}^{\mathsf{T}} \in \mathbb{R}^{nk\_E},\tag{6.3}$$

$$\underline{\mathbf{u}}\_{i} = \left[ \left( \mathbf{u}\_{i}^{(1)} \right)^{\mathsf{T}} \quad \dots \quad \left( \mathbf{u}\_{i}^{(k\_{E})} \right)^{\mathsf{T}} \right]^{\mathsf{T}} \in \mathbb{R}^{m\_{i}k\_{E}},\tag{6.4}$$

<sup>∀</sup>i ∈ P, be vectors containing all values of the system state x (k) and the control values u (k) i of player i ∈ P for all time steps k ∈ K, respectively.

Furthermore, the following notation is introduced for a set of trajectories in accordance with the system dynamics (6.1) which will facilitate a more compact representation of the results of this chapter.

Definition 6.2 (Trajectory Set)

A trajectory ζ :<sup>=</sup> x,u<sup>1</sup> , . . . ,uN is defined as a set containing the stacked values of the system state <sup>x</sup> and the stacked control values <sup>u</sup>i of all players i ∈ P, which is feasible with respect to the system dynamics given by (6.1).

The estimation of the cost function parameters <sup>θ</sup>i is based on an observed set of trajectories denoted by ˜ζ :<sup>=</sup> x˜,u˜ 1 , . . . ,u˜ N which, following the probabilistic approach presented in Section 6.1, is assumed to be sampled by a probability density function p ζ | θ ∗ 1 , ..., θ ∗ N with unknown parameters θ ∗ 1 , ..., θ ∗ .

A further key value in IRL problems is the feature count (cf. [AN04, RBZ06, ZMBD08] in the single-player case) which is introduced in the following.

#### Definition 6.3 (Feature Count)

N

The feature count <sup>µ</sup>i (<sup>ζ</sup> ) <sup>∈</sup> <sup>R</sup> <sup>M</sup><sup>i</sup> of a player i ∈ P along a trajectory ζ is defined as a vector containing the accumulated values of the features along that trajectory, i.e.

$$\mu\_i\left(\boldsymbol{\zeta}\right) = \sum\_{k=1}^{k\_E} \phi\_i\left(\mathbf{x}^{(k)}, \mathbf{u}\_1^{(k)}, \dots, \mathbf{u}\_N^{(k)}\right),\tag{6.5}$$

with x (k) ,u (k) i <sup>∈</sup> ζ , <sup>∀</sup>i ∈ P, k ∈ K.

Using the feature counts <sup>µ</sup>i (<sup>ζ</sup> ) and (6.2), the costs along a trajectory <sup>ζ</sup> for any player <sup>i</sup> ∈ P can be rewritten as

$$J\_i\left(\check{\zeta}, \boldsymbol{\theta}\_i\right) = -\boldsymbol{\theta}\_i^\top \boldsymbol{\mu}\_i\left(\check{\zeta}\right). \tag{6.6}$$

In the following and with some abuse of notation in favor of better readability, <sup>p</sup> (<sup>ζ</sup> <sup>|</sup> <sup>θ</sup>1:N ) represents the probability density of a trajectory <sup>ζ</sup> as a function of parameters <sup>θ</sup>1, . . . , <sup>θ</sup> N corresponding to the cost functions <sup>J</sup>i , <sup>∀</sup>i ∈ P.

Having introduced these basic definitions, the inverse dynamic game problem considered in this chapter is defined as follows.

#### Definition 6.4 (Inverse Dynamic Game Based on IRL)

Find parameters <sup>ˆ</sup>θi , <sup>∀</sup>i ∈ P, such that the expected costs of a trajectory sampled from the probability density p ζ | ˆθ1:N resulting from the identified parameters corresponds for each player i ∈ P to the expected costs of the observed trajectory sampled from the probability density p ζ | θ ∗ 1:N , i.e.

$$\mathbb{E}\_{\mathbb{P}}\big(\_{\zeta}{\zeta}|\hat{\theta}\_{\mathtt{i}:\mathcal{N}}\big)\left\{I\_{i}\left(\zeta,\theta\_{i}^{\*}\right)\right\}\overset{!}{=}\mathbb{E}\_{\mathbb{P}\big(\zeta\big|\theta\_{\mathtt{i}:\mathcal{N}}^{\*}\big)}\left\{I\_{i}\left(\zeta,\theta\_{i}^{\*}\right)\right\},\forall i\in\mathcal{P}.\tag{6.7}$$

The requirement (6.7) arises from the demand of obtaining for each player a cost function that results in an individual performance as good as the observed one, where the performance is measured with respect to each player's unknown true cost function <sup>J</sup>i ζ , θ ∗ i . <sup>37</sup> Similar to the inverse differential game problem of Definition 4.3, Definition 6.4 implies that we are interested in finding one parameter vector <sup>θ</sup>i for each player<sup>i</sup> ∈ P such that (6.7) holds, i.e. the dynamic game with identified cost function parameters is able to explain the observed trajectories. This differs to the problem investigated in Section 5.2 where the complete solution set for each player i ∈ P is sought, since inverse problems in optimal control and dynamic games are naturally ill-posed.

# 6.3 Maximum Entropy Distribution of Trajectories in Dynamic Games

The principle of maximum entropy provides a means to resolve the ill-posedness issue such that parameters can be found which solve the problem given in Definition 6.4. In this section, we transfer the maximum entropy approach to inverse dynamic games with N players. The aim is to find a probability density function <sup>p</sup> (<sup>ζ</sup> <sup>|</sup> <sup>θ</sup>1:N ) which represents the probability of trajectories <sup>ζ</sup> as a function of the parameters <sup>θ</sup>1, . . . , <sup>θ</sup> N , yet considering (6.7) as only constraint or a-priori knowledge. Finding an expression for <sup>p</sup> (<sup>ζ</sup> <sup>|</sup> <sup>θ</sup>1:N ) shall provide a useful result on our way towards the solution of inverse dynamic games with IRL.

<sup>37</sup> Similar objectives have been frequently defined in single-player IRL methods, see e.g. the seminal papers [NR00] and [AN04].

In order to state a relationship between observed trajectories ˜ζ and the probability distribution p ζ | θ ∗ 1:N which generated them, the following assumption is made:

#### Assumption 6.1

The feature count of player i along the trajectory ˜ζ (denoted as µ˜ i for all players i ∈ P) represents the expectation of the feature count <sup>E</sup>p( <sup>ζ</sup> <sup>|</sup><sup>θ</sup> ∗ 1:<sup>N</sup> ) µi (ζ ) based on the probability density function p ζ | θ ∗ 1:N which results from the parameters θ ∗ 1 ,. . . ,θ ∗ , i.e.

$$\mathbb{E}\_{\mathbb{P}}(\boldsymbol{\zeta}|\boldsymbol{\theta}^{\*}\_{\mathrm{i}:N})\left\{\boldsymbol{\mu}\_{\mathrm{i}}\left(\boldsymbol{\zeta}\right)\right\}=\boldsymbol{\tilde{\mu}}\_{\mathrm{i}},\quad\forall\boldsymbol{i}\in\mathcal{P}.\tag{6.8}$$

N

Assumption 6.1 means that each observation ˜ζl is representative38. As no further information is available, the sample mean is used as an estimate for the expectation of the feature count. Furthermore, note that Assumption 6.1 implies that if <sup>n</sup>t <sup>∈</sup> <sup>N</sup>><sup>0</sup> observed trajectories are given, i.e. a set of trajectories D = { ˜ζ1, . . . , ˜ζnt }, the expectation of the feature count of player i is given by

$$\mathbb{E}\_{\mathbb{P}\big(\boldsymbol{\zeta}\big|\boldsymbol{\theta}^{\*}\_{1:N}\big)}\left\{\boldsymbol{\mu}\_{\boldsymbol{i}}\big(\boldsymbol{\zeta}\big)\right\}=\frac{1}{n\_{\boldsymbol{t}}}\sum\_{l=1}^{n\_{\boldsymbol{t}}}\boldsymbol{\mu}\_{\boldsymbol{i}}\left(\tilde{\boldsymbol{\zeta}}\_{l}\right),\tag{6.9}$$

where <sup>µ</sup>i ˜ζl denotes the feature count of the observed trajectory ˜ζl with <sup>l</sup> ∈ {1, ...,n<sup>t</sup> }.

#### Lemma 6.1 (Path Feature Count Equivalence to Costs)

Let the expectation of the feature count be equal for both the probability density p ζ | ˆθ1:N resulting from the identified parameters and the probability function p ζ | θ ∗ 1:N with original parameters θ ∗ 1 , ..., θ ∗ N , i.e.

$$\mathbb{E}\_{\mathbf{p}\Big(\boldsymbol{\zeta}\mid\boldsymbol{\theta}\_{\mathtt{:}:N}\Big)}\left\{\boldsymbol{\mu}\_{\boldsymbol{i}}\left(\boldsymbol{\zeta}\right)\right\} = \mathbb{E}\_{\mathbf{p}\Big(\boldsymbol{\zeta}\mid\boldsymbol{\theta}\_{\mathtt{:}:N}^{\*}\Big)}\left\{\boldsymbol{\mu}\_{\boldsymbol{i}}\left(\boldsymbol{\zeta}\right)\right\}\tag{6.10}$$

for each player i ∈ P. Then, for any parameters where θ ∗ i 2 < <sup>∞</sup>, (6.7) is fulfilled.

<sup>38</sup> A representative sample is a typical sample of a population [Mar91]. The latter means in this context all possible trajectories which can be generated from the assumed probability density function p ζ | θ ∗ 1:N .

#### Proof:

By rewriting (6.7), we can state the following relations:

$$0 \le \left| \mathbb{E}\_{\mathbb{P}\Big(\zeta \mid \hat{\theta}\_{\text{::N}}\Big)} \left\{ J\_{l}^{\*} \left( \zeta, \theta\_{l}^{\*} \right) \right\} - \mathbb{E}\_{\mathbb{P}\big(\zeta \mid \theta\_{\text{::N}}^{\*} \Big)} \left\{ J\_{l}^{\*} \left( \zeta, \theta\_{l}^{\*} \right) \right\} \right| \tag{6.11}$$

$$= \left| \mathbb{E}\_{\mathbf{p}\left(\boldsymbol{\zeta} \mid \hat{\boldsymbol{\theta}}\_{\text{i:N}}\right)} \left\{ \boldsymbol{\theta}\_{\text{i}}^{\*\top} \boldsymbol{\mu}\_{\text{i}}(\boldsymbol{\zeta}) \right\} - \mathbb{E}\_{\mathbf{p}\left(\boldsymbol{\zeta} \mid \boldsymbol{\theta}\_{\text{i:N}}^{\*}\right)} \left\{ \boldsymbol{\theta}\_{\text{i}}^{\*\top} \boldsymbol{\mu}\_{\text{i}}(\boldsymbol{\zeta}) \right\} \right| \tag{6.12}$$

$$\leq \left\lVert \left\lVert \boldsymbol{\theta}\_{i}^{\*} \right\rVert\_{2} \right\rVert\_{\mathbf{P}} \left\lVert \operatorname\*{\mathbb{E}}\_{\mathbf{p}\left(\boldsymbol{\zeta}^{\prime} \mid \hat{\boldsymbol{\theta}}\_{i:N}\right)} \left\{ \boldsymbol{\mu}\_{i} \left(\boldsymbol{\zeta}\right) \right\} - \operatorname\*{\mathbb{E}}\_{\mathbf{p}\left(\boldsymbol{\zeta} \mid \boldsymbol{\theta}\_{i:N}^{\*}\right)} \left\{ \boldsymbol{\mu}\_{i} \left(\boldsymbol{\zeta}\right) \right\} \right\rVert\_{2} \tag{6.13}$$

Therefore, if (6.10) holds, then the right side of (6.13) is equal to zero and hence, together with the inequality in (6.11), this implies that (6.7) holds as well.

Lemma 6.1 represents the principle of matching feature expectations for all players. This principle was introduced in [AN04] for N <sup>=</sup> <sup>1</sup> and used as a basis for numerous single-player IRL methods.

Since the inverse dynamic game problem defined in Definition 6.4 demands the fulfillment of (6.7), by the results of Lemma 6.1 and using Assumption 6.1 we require

$$\mathbb{E}\_{\mathbf{p}(\boldsymbol{\zeta}|\boldsymbol{\theta}\_{1:N})}\left\{\boldsymbol{\mu}\_{\boldsymbol{i}}\left(\boldsymbol{\zeta}\right)\right\} = \tilde{\boldsymbol{\mu}}\_{\boldsymbol{i}},\tag{6.14}$$

for each player i ∈ P. Moreover, for a density function,

$$\int\_{\mathbb{V}\zeta} \mathbf{p}\left(\zeta \middle| \,\theta\_{1:N}\right) \,\mathrm{d}\zeta = 1 \tag{6.15}$$

must apply. Since the conditions (6.14) and (6.15) do not lead to a unique solution for the probability density function, the principle of maximum entropy is applied. For a continuous density function the entropy corresponding to a probability density function is given by [CT06, Section 8.1]

$$h\left(\mathbf{p}\left(\boldsymbol{\zeta}\mid\boldsymbol{\theta}\_{1:N}\right)\right) = -\int\_{\mathbb{V}\boldsymbol{\zeta}} \mathbf{p}\left(\boldsymbol{\zeta}\mid\boldsymbol{\theta}\_{1:N}\right) \ln\left(\mathbf{p}\left(\boldsymbol{\zeta}\mid\boldsymbol{\theta}\_{1:N}\right)\right) \mathrm{d}\boldsymbol{\zeta}.\tag{6.16}$$

In order to determine a probability density function <sup>p</sup> (<sup>ζ</sup> <sup>|</sup> <sup>θ</sup>1:N ) which only takes the information of (6.14) and (6.15) into consideration, the differential entropy (6.16) is maximized with the requirements (6.14) and (6.15) as optimization constraints. The density function which leads to maximum entropy in dynamic games is presented in the following lemma.

#### Lemma 6.2 (Maximum Entropy Probability Distribution in Inverse Dynamic Games)

The maximum entropy distribution under the constraints defined by (6.14) and (6.15) is given by

$$\begin{split} \operatorname{p} \left( \boldsymbol{\zeta} \mid \boldsymbol{\theta}\_{1:N} \right) &= \frac{\exp \left( \sum\_{i=1}^{N} \boldsymbol{\theta}\_{i}^{\top} \boldsymbol{\mu}\_{i} \left( \boldsymbol{\zeta} \right) \right)}{\int\_{\forall \boldsymbol{\zeta}} \exp \left( \sum\_{i=1}^{N} \boldsymbol{\theta}\_{i}^{\top} \boldsymbol{\mu}\_{i} \left( \boldsymbol{\zeta} \right) \right) d\boldsymbol{\zeta}} \\ &= \frac{\exp \left( \sum\_{i=1}^{N} -J\_{i} \left( \boldsymbol{\zeta}, \boldsymbol{\theta}\_{i} \right) \right)}{\int\_{\forall \boldsymbol{\zeta}} \exp \left( \sum\_{i=1}^{N} -J\_{i} \left( \boldsymbol{\zeta}, \boldsymbol{\theta}\_{i} \right) \right) d\boldsymbol{\zeta}}, \end{split} \tag{6.17}$$

where the alternative representation given in the last equation follows from (6.6).

#### Proof:

A calculus-based approach is followed as suggested in [CT06, Section 12.1]. To maximize the differential entropy (6.16) under the constraints given by (6.14) and (6.15), we introduce Lagrange multipliers <sup>ψ</sup> <sup>∈</sup> <sup>R</sup> and <sup>θ</sup>i <sup>∈</sup> <sup>R</sup> Mi×<sup>1</sup> , <sup>∀</sup>i ∈ P, and set up the objective function

$$\begin{split} \Lambda\left(\mathbf{p}\left(\boldsymbol{\zeta}\right|\boldsymbol{\theta}\_{1:N}\right), \boldsymbol{\psi}, \boldsymbol{\theta}\_{1:N}\right) &= \\ &- \int\_{\mathbb{V}\_{\boldsymbol{\zeta}}^{\boldsymbol{\zeta}}} \mathbf{p}\left(\boldsymbol{\zeta}\mid\boldsymbol{\theta}\_{1:N}\right) \ln\left(\mathbf{p}\left(\boldsymbol{\zeta}\mid\boldsymbol{\theta}\_{1:N}\right)\right) \mathrm{d}\boldsymbol{\zeta} + \boldsymbol{\psi}\left(\int\_{\mathbb{V}\_{\boldsymbol{\zeta}}^{\boldsymbol{\zeta}}} \mathbf{p}\left(\boldsymbol{\zeta}\mid\boldsymbol{\theta}\_{1:N}\right) \mathrm{d}\boldsymbol{\zeta} - 1\right) \\ &+ \boldsymbol{\theta}\_{1}^{\top} \left(\int\_{\mathbb{V}\_{\boldsymbol{\zeta}}^{\boldsymbol{\zeta}}} \mathbf{p}\left(\boldsymbol{\zeta}\mid\boldsymbol{\theta}\_{1:N}\right) \boldsymbol{\mu}\_{1}\left(\boldsymbol{\zeta}\right) \mathrm{d}\boldsymbol{\zeta} - \bar{\boldsymbol{\mu}}\_{1}\right) + \dots \\ &\vdots \\ &+ \boldsymbol{\theta}\_{N}^{\top} \left(\int\_{\mathbb{V}\_{\boldsymbol{\zeta}}^{\boldsymbol{\zeta}}} \mathbf{p}\left(\boldsymbol{\zeta}\mid\boldsymbol{\theta}\_{1:N}\right) \boldsymbol{\mu}\_{N}\left(\boldsymbol{\zeta}\right) \mathrm{d}\boldsymbol{\zeta} - \bar{\boldsymbol{\mu}}\_{N}\right). \end{split} \tag{6.18}$$

In this way, the expression

$$\begin{split} \frac{\partial \Lambda}{\partial \mathbf{p} \left( \boldsymbol{\zeta} \mid \boldsymbol{\theta}\_{1:N} \right)} &= - \int\_{\mathbb{V}\_{\boldsymbol{\zeta}}^{\boldsymbol{\zeta}}} \ln \left( \mathbf{p} \left( \boldsymbol{\zeta} \mid \boldsymbol{\theta}\_{1:N} \right) \right) \mathrm{d}\boldsymbol{\zeta} - \int\_{\mathbb{V}\_{\boldsymbol{\zeta}}^{\boldsymbol{\zeta}}} \frac{\mathbf{p} \left( \boldsymbol{\zeta} \mid \boldsymbol{\theta}\_{1:N} \right)}{\mathbf{p} \left( \boldsymbol{\zeta} \mid \boldsymbol{\theta}\_{1:N} \right)} \mathrm{d}\boldsymbol{\zeta} \\ &+ \boldsymbol{\psi} \int\_{\mathbb{V}\_{\boldsymbol{\zeta}}^{\boldsymbol{\zeta}}} \mathbf{1} \, \mathrm{d}\boldsymbol{\zeta} + \boldsymbol{\theta}\_{1}^{\top} \int\_{\mathbb{V}\_{\boldsymbol{\zeta}}^{\boldsymbol{\zeta}}} \boldsymbol{\mu}\_{1} \left( \boldsymbol{\zeta} \right) \mathrm{d}\boldsymbol{\zeta} + \cdots + \boldsymbol{\theta}\_{N}^{\top} \int\_{\mathbb{V}\_{\boldsymbol{\zeta}}^{\boldsymbol{\zeta}}} \boldsymbol{\mu}\_{N} \left( \boldsymbol{\zeta} \right) \mathrm{d}\boldsymbol{\zeta} \\ &= \int\_{\mathbb{V}\_{\boldsymbol{\zeta}}^{\boldsymbol{\zeta}}} \left( - \ln \left( \mathbf{p} \left( \boldsymbol{\zeta} \mid \boldsymbol{\theta}\_{1:N} \right) \right) - 1 + \boldsymbol{\psi} + \sum\_{i=1}^{N} \boldsymbol{\theta}\_{i}^{\top} \boldsymbol{\mu}\_{i} \left( \boldsymbol{\zeta} \right) \right) \mathrm{d}\boldsymbol{\zeta} \\ &\stackrel{\text{i}}{=} \mathbf{0} \end{split} \tag{6.19}$$

gives a necessary condition for the sought probability density function. By inspecting (6.19) we see that this condition is fulfilled if

$$-\ln\left(\mathbf{p}\left(\boldsymbol{\zeta}\,\middle|\,\theta\_{1:N}\right)\right) - 1 + \psi + \sum\_{i=1}^{N} \theta\_i^{\top} \boldsymbol{\mu}\_i\left(\boldsymbol{\zeta}\right) = 0. \tag{6.20}$$

By reformulating (6.20), we obtain the probability density function of a trajectory ζ , i.e.

$$\exp\left(\zeta \mid \theta\_{1:N}\right) = \exp\left(-1+\psi\right)\exp\left(\sum\_{i=1}^{N} \theta\_i^{\top} \mu\_i\left(\zeta\right)\right). \tag{6.21}$$

Using (6.21), (6.15) is rewritten as

$$\begin{split} 1 &= \int\_{\forall \boldsymbol{\zeta}} \mathbf{p} \left( \boldsymbol{\zeta} | \, \boldsymbol{\theta}\_{1:N} \right) \mathrm{d}\boldsymbol{\zeta} \\ &= \exp \left( -1 + \boldsymbol{\psi} \right) \int\_{\forall \boldsymbol{\zeta}} \exp \left( \sum\_{i=1}^{N} \boldsymbol{\theta}\_{i}^{\top} \boldsymbol{\mu}\_{i} \left( \boldsymbol{\zeta} \right) \right) \mathrm{d}\boldsymbol{\zeta} \\ &\Leftrightarrow \; \exp \left( -1 + \boldsymbol{\psi} \right) = \frac{1}{\int\_{\boldsymbol{\zeta}} \exp \left( \sum\_{i=1}^{N} \boldsymbol{\theta}\_{i}^{\top} \boldsymbol{\mu}\_{i} \left( \boldsymbol{\zeta} \right) \right) \mathrm{d}\boldsymbol{\zeta}} . \end{split} \tag{6.22}$$

Inserting (6.22) in (6.21) leads to the probability density function (6.17). The entropy is maximized since

$$\frac{\partial^2 \Lambda}{\partial \mathbf{p} \left(\boldsymbol{\zeta} \,|\, \boldsymbol{\theta}\_{1:N}\right)^2} = -\int\_{\mathbb{V}\boldsymbol{\zeta}} \frac{1}{\mathbf{p} \left(\boldsymbol{\zeta} \,|\, \boldsymbol{\theta}\_{1:N}\right)} \mathrm{d}\boldsymbol{\zeta} < 0 \tag{6.23}$$

for all <sup>p</sup> (<sup>ζ</sup> <sup>|</sup> <sup>θ</sup>1:N ) , 0.

In order to obtain an estimate of the cost function parameters <sup>ˆ</sup>θi , i ∈ P, it may appear suitable to maximize the probability density function (6.17), analogously to similar 1-player IRL methods [ZMBD08, LK12]. However, given the dependence of (6.17) on the cost function parameters of all players, it is not possible to solve for a particular <sup>θ</sup>i . Nevertheless, if ζ corresponds to a Pareto efficient solution according to Definition 3.9, then (6.17) can be used to identify corresponding parameters <sup>ˆ</sup>θi which explain the observations. This approach is presented in Appendix D.

The following sections present approaches to identify cost function parameters which explain observed Nash equilibrium trajectories.

# 6.4 Open-Loop Case

In this section, we shall consider inverse dynamic games where each player applies an openloop strategy (cf. Definition A.4) and an open-loop Nash equilibrium (OLNE) arises from their interaction.

A suitable probability density function <sup>p</sup>(ζ ) is sought which allows for the estimation of cost function parameters.

#### 6.4.1 Probability Density Function

The non-cooperative character of the dynamic game implies that each player only considers his own cost function and strives for its minimization by means of the selected open-loop strategy. From Theorem A.1 we see that the open-loop Nash equilibrium involves the solution of a set of differential equations which includes derivatives of the system dynamics and the features (which constitute the Hamiltonian) with respect to the system state x (k) and player i's controls u (k) i . The other players' controls do not depend on either of these, and therefore, they do not have any influence on player i's actions.

Consequently, the following probability function

$$\begin{split} \operatorname{p} \left( \zeta \mid \boldsymbol{\theta}\_{i} \right) &= \frac{\exp \left( -J\_{i} \left( \zeta \right) \right)}{\int\_{\tilde{\zeta}} \exp \left( -J\_{i} \left( \tilde{\zeta} \right) \right) \operatorname{d} \tilde{\zeta}} \\ &= \frac{\exp \left( \boldsymbol{\theta}\_{i}^{\top} \boldsymbol{\mu}\_{i} (\zeta) \right)}{\int\_{\tilde{\zeta}} \exp \left( \boldsymbol{\theta}\_{i}^{\top} \boldsymbol{\mu}\_{i} (\tilde{\zeta}) \right) \operatorname{d} \tilde{\zeta}} \end{split} \tag{6.24}$$

is defined, which represents the probability (density) of a particular trajectory from the point of view of player i. This density implies that the probability of a particular trajectory is inversely proportional to the costs generated by player <sup>i</sup>'s own individual cost function <sup>J</sup>i defined by player <sup>i</sup>'s cost function parameter set <sup>θ</sup>i . This simplifies the probability density function <sup>p</sup> (<sup>ζ</sup> <sup>|</sup> <sup>θ</sup>1:N ) in such a way that <sup>N</sup> probability density functions <sup>p</sup> (<sup>ζ</sup> <sup>|</sup> <sup>θ</sup>i) which depend on each player's cost function parameters <sup>θ</sup>i are considered instead of one single probability density function which depends on all parameters.

Considering a possible total number of <sup>n</sup>t demonstrations, the following likelihood function is defined based on the introduced probability density function.

#### Definition 6.5 (Likelihood Function)

Let a set of <sup>n</sup>t trajectories denoted by <sup>D</sup> <sup>=</sup> { ˜ζ1, ..., ˜ζnt } be given. Then the likelihood of the data given a parameter vector <sup>θ</sup>i is defined as

$$\mathcal{L}(\boldsymbol{\theta}\_{l} \mid \mathcal{D}) = \prod\_{l=1}^{n\_{l}} \mathbf{p}\left(\tilde{\zeta}\_{l} \mid \boldsymbol{\theta}\_{l}\right),\tag{6.25}$$

where p ˜ζl | θi is obtained by evaluating (6.24) at ˜ζl , <sup>l</sup> ∈ {1, ...,nt }.

The likelihood describes the probability density of the trajectories when the parameters are set. Moreover, it is a function of <sup>θ</sup>i . With this function, the foundation for a maximum likelihood estimation (MLE) of the cost function parameters is given. In order to show that maximizing the likelihood leads to an unbiased estimation of the cost function parameters, the following assumption adapts Assumption 6.1 (and (6.9)) to probability density functions depending only on the parameters <sup>θ</sup>i of one player <sup>i</sup> ∈ P as defined in (6.24).

#### Assumption 6.2 (Expectation and Mean Equivalence)

i

The mean of the feature count of the <sup>n</sup>t observed trajectories is equal to the expectation of the feature count of the trajectories resulting from the probability density function with original parameters θ ∗ , i.e.

$$\mathbb{E}\_{\mathbf{p}\big(\boldsymbol{\zeta}\big|\boldsymbol{\theta}\_{i}^{\*}\big)}\left\{\boldsymbol{\mu}\_{j}(\boldsymbol{\zeta})\right\}=\frac{1}{n\_{t}}\sum\_{l=1}^{n\_{t}}\boldsymbol{\mu}\_{j}\left(\tilde{\boldsymbol{\zeta}}\_{l}\right),\quad\forall\ i,j\in\mathcal{P}.\tag{6.26}$$

#### 6.4.2 Cost Function Estimation and Unbiasedness Results

Before presenting the unbiasedness of the MLE as the main result for inverse non-cooperative dynamic games of this chapter, an alternative definition of the cost functions which will be convenient for the proof of the main theorem.

#### Definition 6.6 (Extended Features, Feature Count and Parameter Vector)

Let <sup>ϕ</sup>¯ denote an extended feature vector which includes all features <sup>ϕ</sup>i,(q) , i ∈ P, q <sup>∈</sup> {1, . . . , <sup>M</sup>i } of all <sup>N</sup> players such that <sup>ϕ</sup>¯ (r) , <sup>ϕ</sup>¯ (s) for all r,s ∈ {1, . . . , dim(ϕ¯)} and r , s. In other words, the extended feature vector <sup>ϕ</sup>¯ consists of the feature vectors <sup>ϕ</sup>i of all players such that no feature is included more than once and all features are linearly independent of each other. The extended feature count µ¯(ζ ) is defined analogously according to Definition 6.3. Furthermore, let the extended parameter vector ¯θi be defined such that

$$J\_i(\zeta) = \boldsymbol{\theta}\_i^\top \boldsymbol{\mu}\_i(\zeta) = \bar{\boldsymbol{\theta}}\_i^\top \bar{\boldsymbol{\mu}}(\zeta), \quad i \in \mathcal{P}. \tag{6.27}$$

#### Remark 6.1:

For (6.27) to hold, ¯θi has to include zeros in the positions corresponding to the elements of <sup>ϕ</sup>¯ representing features which were not in <sup>ϕ</sup>i previously.

#### Remark 6.2:

Assumption 6.2 leads to

$$\mathbb{E}\_{\mathbb{P}}(\zeta | \theta\_l^\*) \left\{ \bar{\mu}(\zeta) \right\} = \frac{1}{n\_l} \sum\_{l=1}^{n\_l} \bar{\mu} \left( \tilde{\zeta}\_l \right), \quad \forall \ i \in \mathcal{P}, \tag{6.28}$$

for the extended feature count µ¯(ζ ).

The following theorem presents the method for estimating cost function parameters from open-loop Nash equilibrium trajectories and states the unbiasedness of the estimation.

#### Theorem 6.1 (Unbiasedness of the Estimation)

Let a set of trajectories D = { ˜ζ1, . . . , ˜ζnt } for which Assumption 6.2 is fulfilled be given. Then, the MLE with respect to the observed trajectories, i.e.

$$\hat{\boldsymbol{\theta}}\_{i} = \underset{\boldsymbol{\theta}\_{i}}{\text{arg}\,\text{max}}\,\ln\,\mathcal{L}\left\{\boldsymbol{\theta}\_{i}|\,\mathcal{D}\right\},\tag{6.29}$$

where <sup>L</sup> {θi <sup>|</sup> D} is obtained by evaluating the likelihood function of Definition 6.5 at ˜ζl , <sup>l</sup> ∈ {1, ...,nt }, leads to parameters <sup>ˆ</sup>θi such that <sup>p</sup> ζ | ˆθi results in an expectation of the cost function values <sup>J</sup>j ζ , θ ∗ j , <sup>∀</sup>j ∈ P which is equal to the one corresponding to <sup>p</sup> ζ | θ ∗ i , i.e. n n o

$$\mathbb{E}\_{\mathbf{p}\left(\boldsymbol{\zeta}\mid\hat{\boldsymbol{\theta}}\_{i}\right)}\left|J\_{\boldsymbol{\zeta}}\left(\boldsymbol{\zeta},\boldsymbol{\theta}\_{j}^{\*}\right)\right| = \mathbb{E}\_{\mathbf{p}\left(\boldsymbol{\zeta}\mid\boldsymbol{\theta}\_{i}^{\*}\right)}\left\langle J\_{\boldsymbol{\zeta}}\left(\boldsymbol{\zeta},\boldsymbol{\theta}\_{j}^{\*}\right)\right\rangle,\tag{6.30}$$

holds for all i, j ∈ P.

#### Proof:

Using the extended parameter vector ¯θi (cf. Definition 6.6), (6.30) can be rewritten as

$$\mathbb{E}\_{\mathbb{P}\left(\boldsymbol{\zeta}\mid\hat{\boldsymbol{\theta}}\_{i}\right)}\left\{J\_{j}\left(\boldsymbol{\zeta},\bar{\boldsymbol{\theta}}\_{j}^{\*}\right)\right\}=\mathbb{E}\_{\mathbb{P}\left(\boldsymbol{\zeta}\mid\bar{\boldsymbol{\theta}}\_{i}^{\*}\right)}\left\{J\_{j}\left(\boldsymbol{\zeta},\bar{\boldsymbol{\theta}}\_{j}^{\*}\right)\right\}\tag{6.31}$$

for all i, j ∈ P. Therefore, (6.31) shall be proved in the following.

The maximization of the log-likelihood function (6.29) implies

$$\mathbf{o} \stackrel{\circ}{=} \frac{\partial}{\partial \bar{\theta}\_i} \sum\_{l=1}^{n\_t} \ln \left( \frac{\exp \left( \bar{\theta}\_i^{\top} \bar{\mu}(\check{\zeta}\_l) \right)}{\int\_{\zeta} \exp \left( \bar{\theta}\_i^{\top} \bar{\mu}(\zeta) \right) d\zeta} \right)\_{\bar{\theta}\_i = \hat{\theta}\_i} \tag{6.32}$$

= Õnt l=1 ∂ ∂ ¯θi <sup>−</sup>ln ∫ ζ exp ¯θ ⊤ i <sup>µ</sup>¯(<sup>ζ</sup> ) dζ + ¯θ ⊤ i µ¯( ˜ζl) ¯θ <sup>i</sup>=¯ˆ θ i (6.33)

$$=\sum\_{l=1}^{n\_{\ell}} \left( \frac{\int\_{\check{\zeta}} - \exp\left(\bar{\theta}\_{l}^{\top}\bar{\mu}(\check{\zeta})\right) \bar{\mu}(\check{\zeta}) \, \mathrm{d}\check{\zeta}}{\int\_{\check{\zeta}} \exp\left(\bar{\theta}\_{l}^{\top}\bar{\mu}(\check{\zeta})\right) \, \mathrm{d}\check{\zeta}} + \bar{\mu}(\check{\zeta}\_{l}) \right) \bigg|\_{\bar{\theta}\_{l} = \dot{\theta}\_{l}}.\tag{6.34}$$

Since the integrals in the numerator and the denominator in (6.34) are independent of each other, (6.34) can be rewritten as

$$\mathbf{o} \stackrel{!}{=} \sum\_{l=1}^{n\_{\ell}} \left( \int\_{\boldsymbol{\zeta}} \frac{-\exp\left(\bar{\boldsymbol{\theta}}\_{l}^{\top} \bar{\boldsymbol{\mu}}(\boldsymbol{\zeta})\right) \bar{\boldsymbol{\mu}}(\boldsymbol{\zeta})}{\int\_{\boldsymbol{\zeta}} \exp\left(\bar{\boldsymbol{\theta}}\_{l}^{\top} \bar{\boldsymbol{\mu}}(\boldsymbol{\zeta})\right) \mathrm{d}\boldsymbol{\zeta}} \mathrm{d}\boldsymbol{\zeta} + \bar{\boldsymbol{\mu}}(\boldsymbol{\tilde{\zeta}}\_{l}) \right) \bigg|\_{\bar{\boldsymbol{\theta}}\_{i} = \hat{\boldsymbol{\theta}}\_{i}}.\tag{6.35}$$

Using the defined probability density function (6.24), we obtain

$$\mathbf{o} \stackrel{!}{=} \sum\_{l=1}^{n\_{\ell}} \left( - \int\_{\zeta} \mathbf{p} \left( \zeta \mid \bar{\boldsymbol{\theta}}\_{l} \right) \bar{\boldsymbol{\mu}}(\zeta) \, \mathrm{d}\zeta + \bar{\boldsymbol{\mu}}(\tilde{\zeta}\_{l}) \right) \bigg|\_{\boldsymbol{\theta}\_{l} = \hat{\boldsymbol{\theta}}\_{l}} $$
 
$$= \sum\_{l=1}^{n\_{\ell}} \left( - \mathbb{E}\_{\mathbf{p} \left( \boldsymbol{\zeta} \mid \hat{\boldsymbol{\theta}}\_{l} \right)} \left\{ \bar{\boldsymbol{\mu}}(\zeta) \right\} + \bar{\boldsymbol{\mu}}(\tilde{\zeta}\_{l}) \right) . \tag{6.36}$$

By rewriting (6.36) and considering Assumption 6.2 and Remark 6.2,

$$\mathbb{E}\_{\mathbf{p}\Big(\zeta \mid \hat{\boldsymbol{\theta}}\_{l}\big)}\{\bar{\boldsymbol{\mu}}(\zeta)\} = \frac{1}{n\_{l}}\sum\_{l=1}^{n\_{l}}\bar{\boldsymbol{\mu}}(\tilde{\zeta}\_{l}) = \mathbb{E}\_{\mathbf{p}\big(\zeta \mid \bar{\boldsymbol{\theta}}\_{l}^{\*}\big)}\{\bar{\boldsymbol{\mu}}(\zeta)\}\tag{6.37}$$

results. Therefore, the expectations of the feature count µ¯ are equal for both probability density functions. By applying the results of Lemma 6.1 (which also hold for a probability density function <sup>p</sup>(<sup>ζ</sup> <sup>|</sup> <sup>θ</sup>i)) we conclude that (6.37) leads to (6.31) which is equivalent to (6.30).

The results of Theorem 6.1 guarantee (6.30), which at first glance differs from the requirement (6.7) posed in the inverse dynamic game problem in Definition 6.4. However, for inverse openloop dynamic games, it was proposed to consider N probability density functions <sup>p</sup> ζ | θ ∗ i instead of a single one given by p ζ | θ ∗ 1:N . Therefore, instead of the equivalence of expected costs with respect to this initially assumed probability density function p ζ | θ ∗ 1:N , we obtain the equivalence of expected costs for all players j ∈ P with respect to each of the N probability density functions p ζ | θ ∗ i as stated in (6.30). Consequently, the estimated parameters ˆ<sup>θ</sup>i solve the inverse dynamic game problem for an open-loop information structure.

#### Remark 6.3:

Solving the optimization problem (6.29) demands the possibility of evaluating the likelihood function <sup>L</sup> {θi <sup>|</sup> D} and therefore the probability density function (6.24) at the trajectories <sup>ζ</sup> ∗ l . The denominator in (6.24) includes an integral over all trajectories ζ which are feasible with respect to the system dynamics and an initial state. Calculating this integral is intractable given the continuous-valued control and action spaces. Therefore, approximations are usually applied. This will be tackled in Section 6.6.

# 6.5 Feedback Case

In this section, solutions for inverse dynamic games with the feedback Nash equilibrium (FNE) as a solution concept are presented. Therefore, the MPS and feedback information structures according to Definition A.3 are considered. The resulting strategies are given by<sup>39</sup>

i

$$
\mu\_i^{(k)} = \mathcal{Y}\_i^{(k)}(\mathbf{x}^{(k)}).\tag{6.38}
$$

The following assumption is needed for the results of this section.

i

i

#### Assumption 6.3 (Control Laws)

The Nash equilibrium control laws γ (k)∗ (x (k) ), k ∈ K are known for all players i ∈ P.

<sup>39</sup> According to [BO99, p. 278], the feedback Nash equilibrium solution under the MPS information pattern solely depends on x (k) at the time step k. The dependency on x (1) is given only for k <sup>=</sup> 1. Therefore, we have feedback strategies as in Definition A.5 for both MPS and FB information structures.

For the case of a finite-horizon dynamic game, i.e. <sup>k</sup>E <sup>∈</sup> <sup>N</sup>, Assumption 6.3 demands the knowledge of the exact (time-dependent) function γ (k)∗ i (x (k) ). This case is analogous to Assumption 4.3 for inverse feedback differential games. In case of an infinite-horizon (kE → ∞) dynamic game, Assumption 6.3 implies that the time-independent functional relationship of γ (k) to x (k) is known.

#### Remark 6.4:

i

Assumption 6.3 is rather restrictive for general nonlinear feedback Nash equilibria. However, not only the estimation of the control law is non-trivial, but also the calculation of the equilibria themselves which implies the solution of coupled partial differential equations (see Theorem 3.2) or coupled Bellman equations (see Theorem A.2). On the other hand, Assumption 6.3 is not restrictive for infinite-horizon linear-quadratic dynamic games, since the Nash equilibrium controls are given by

$$\mathbf{y}\_i^{(k)\*} (\mathbf{x}^{(k)}) = \mathbf{K}\_i^\* \mathbf{x}^{(k)},\tag{6.39}$$

with K ∗ i ∈ R mi×n [Eng05, Section 8.3]. As mentioned in Section 5.4.2, the estimation of K ∗ i can easily be performed by means of a least-squares approach.

If Assumption 6.3 holds, the control laws of the players j ∈ P, j , i can replace u (k)∗ j in (6.1), leading to system dynamics from player i's perspective defined as

$$\mathbf{x}^{(k+1)} = f^{(k)}\left(\mathbf{x}^{(k)}, \mathbf{u}\_i^{(k)}, \mathbf{y}\_{\neg i}^{(k)\*}\left(\mathbf{x}^{(k)}\right)\right)$$

$$= f\_i^{(k)}\left(\mathbf{x}^{(k)}, \mathbf{u}\_i^{(k)}\right). \tag{6.40}$$

In this way, it is possible for player i to represent the system dynamics as a function of the system state <sup>x</sup> and his own control variable <sup>u</sup>i . The effect of the other players' controls are considered due to the implied knowledge of the control laws and the system state in every time step. Analogously, the features <sup>ϕ</sup>i of player i's cost function can be rewritten as a function of the state <sup>x</sup> and the control variables <sup>u</sup>i , i.e.

$$\begin{split} \boldsymbol{\phi}\_{i} &= \boldsymbol{\phi}\_{i}(\mathbf{x}^{(k)}, \boldsymbol{u}\_{1}^{(k)}, \dots, \boldsymbol{u}\_{N}^{(k)}) \\ &= \boldsymbol{\phi}\_{i}\left(\mathbf{x}^{(k)}, \boldsymbol{u}\_{i}^{(k)}, \boldsymbol{\mathcal{Y}}\_{\neg i}^{(k)\*}\left(\mathbf{x}^{(k)}\right)\right) \\ &= \boldsymbol{\phi}\_{i}\left(\mathbf{x}^{(k)}, \boldsymbol{u}\_{i}^{(k)}\right), \end{split} \tag{6.41}$$

where the same vector <sup>ϕ</sup>i is used with some mathematical freedom in favor of a simplified presentation. Based on the system dynamics from playeri's perspective (6.40) and the rewritten features (6.41), the following theorem is presented which describes the method for an unbiased maximum likelihood estimation of cost function parameters in an inverse feedback Nash dynamic game.

#### Theorem 6.2 (Unbiasedness of the Estimation)

Let a set of trajectories D = { ˜ζ1, . . . , ˜ζnt } be given such that Assumption 6.2 is fulfilled. Furthermore, let Assumption 6.3 hold such that the feedback Nash control laws γ (k)∗ i are known for all i ∈ P. Then, the MLE with respect to the observed trajectories, i.e.

$$\hat{\boldsymbol{\theta}}\_{i} = \underset{\boldsymbol{\theta}\_{i}}{\text{arg}\,\text{max}}\,\ln \mathcal{L}\left\{\left|\boldsymbol{\theta}\_{i}\right|\mathcal{D}\right\}\tag{6.42}$$

where <sup>L</sup> {θi | D<sup>∗</sup> } is obtained by evaluating the likelihood function of Definition 6.5 at <sup>ζ</sup> ∗ l , <sup>l</sup> ∈ {1, ...,nt } and with respect to the system dynamics (6.40), leads to parameters <sup>ˆ</sup>θi such that

$$\mathbb{E}\_{\mathbf{p}\Big(\boldsymbol{\zeta}\mid\hat{\boldsymbol{\theta}}\_{i}\Big)}\left\{J\_{\boldsymbol{f}}\left(\boldsymbol{\zeta}\mid\boldsymbol{\theta}\_{\boldsymbol{f}}^{\*}\right)\right\}=\mathbb{E}\_{\mathbf{p}\big(\boldsymbol{\zeta}\mid\boldsymbol{\theta}\_{i}^{\*}\big)}\left\{J\_{\boldsymbol{f}}\left(\boldsymbol{\zeta}\mid\boldsymbol{\theta}\_{\boldsymbol{f}}^{\*}\right)\right\}\tag{6.43}$$

holds for all i, j ∈ P (cf. Theorem 6.1).

#### Proof:

The cost functions <sup>J</sup>i , i ∈ P can be rewritten using the modified features (6.41). Afterwards, the theorem can be proved analogously to Theorem 6.1.

# 6.6 Practical Aspects

The results of the previous sections provide the theoretical foundation for the application of MaxEnt IRL for the solution of inverse dynamic game problems. The core of the method is the MLE based on the probability density functions p ζ | θ ∗ i . The focus of this section is laid on the computation of the MLEs which yield cost function parameters <sup>ˆ</sup>θi explaining observed results of a dynamic game. This poses the practical challenge of evaluating the probability density function (6.24) and with that result, the likelihood function (6.25). This is the main objective approached in this section.

#### 6.6.1 Approximation of the Probability Density Function

The integral in the denominator of (6.24) is computationally intractable and therefore, an approximation is necessary. This may be achieved by replacing the integral with a sum over several trajectory samples which have to be generated from a previously defined probability distribution [KPRS13, MHB16] or determined in each iteration from a forward optimal control or dynamic game solution with current cost function parameter candidates [AB11]. Which sampled trajectories are chosen has a great impact on the estimation of cost function parameters (cf. [AB11]). In order to avoid the problem of choosing adequate samples, in this i,l

i,l

thesis the integral and therewith, the probability density functions are approximated locally. The following procedure is inspired by the approach proposed in [LK12] for a single-player case. Nonetheless, some modifications are introduced and will be explained when suitable.

Consider any player <sup>i</sup> ∈ P. Given an observed trajectory ˜ζl , <sup>l</sup> ∈ {1, <sup>2</sup>, ...,nt }, and consequently, the control trajectories u˜ ¬i,l of all other players, we can formulate the costs <sup>J</sup>i( ˜ζl , θi) of player <sup>i</sup> generated by ˜ζl such that only variations of his own control trajectory <sup>u</sup>˜ i,l are taken into account, i.e. the costs are formulated as <sup>J</sup>i ui ,u˜ ¬i,l , θi . Local approximations of the observed trajectory ˜ζl are considered which arise from the aforementioned variations of <sup>u</sup>i,l while the other players' controls <sup>u</sup>˜ ¬i,l remain unchanged. Hence, we approximate the cost function <sup>J</sup>i ui ,u˜ ¬i,l , θi by means of a second-order Taylor series expansion around the observed controls u˜ corresponding to the trajectory ˜ζl . This results in

$$\begin{split} J\_{l}\left(\underline{\mathbf{u}}\_{l},\underline{\tilde{\mathbf{u}}}\_{\neg l,l},\boldsymbol{\theta}\_{l}\right) \approx & J\_{l}\left(\underline{\tilde{\mathbf{u}}}\_{l,l},\underline{\tilde{\mathbf{u}}}\_{\neg l,l},\boldsymbol{\theta}\_{l}\right) + \begin{pmatrix} \underline{\mathbf{u}}\_{l} - \boldsymbol{\tilde{\mathbf{u}}}\_{l,l} \end{pmatrix}^{\top} \boldsymbol{\tilde{g}}\_{l,l}(\boldsymbol{\theta}\_{l}) \\ & \quad + \frac{1}{2} \left(\underline{\boldsymbol{u}}\_{l} - \boldsymbol{\tilde{\underline{\mathbf{u}}}}\_{l,l}\right)^{\top} \boldsymbol{\tilde{G}}\_{l,l}(\boldsymbol{\theta}\_{l}) \left(\underline{\boldsymbol{u}}\_{l} - \boldsymbol{\tilde{\underline{\mathbf{u}}}}\_{l,l}\right), \end{split} \tag{6.44}$$

where д˜ i,l (θi) ∈ <sup>R</sup> <sup>m</sup><sup>i</sup> <sup>k</sup><sup>E</sup> and G˜ i,l(θi) ∈ <sup>R</sup> <sup>m</sup><sup>i</sup> <sup>k</sup><sup>E</sup> <sup>×</sup>m<sup>i</sup> <sup>k</sup><sup>E</sup> denote the first and second derivative of <sup>J</sup><sup>i</sup> with respect to <sup>u</sup>i , respectively, i.e.

$$\tilde{\mathbf{g}}\_{i,l}(\theta\_i) := \left. \frac{\mathbf{d}J\_i}{\mathbf{d}\underline{\underline{\mathbf{u}}}\_{l}} \right|\_{\underline{\underline{\mathbf{u}}}\_{l} = \tilde{\underline{\mathbf{u}}}\_{l,l}} \tag{6.45}$$

$$\tilde{\mathbf{G}}\_{i,l}(\theta\_i) := \left. \frac{\mathbf{d}^2 J\_l}{\mathbf{d} \underline{\mathbf{u}}\_l^2} \right|\_{\underline{\mathbf{u}}\_l = \tilde{\underline{\mathbf{u}}}\_{i,l}}. \tag{6.46}$$

In the following, д˜ (θi) and <sup>G</sup>˜ i,l(θi) are written as <sup>д</sup>˜ and G˜ i,l , respectively, for brevity.

By reformulating (6.24) using the Taylor series based approximation (6.44) of the cost function and considering that the observed trajectory ˜ζl is (with fixed <sup>θ</sup>i ) uniquely defined by the controls u˜ i,l with given <sup>u</sup>˜ ¬i,l and the initial state x (1) , the probability density function can be evaluated at ˜ζl using the relation

$$\begin{split} \mathbf{p}\left(\tilde{\underline{\boldsymbol{u}}}\_{i,l} \Big| \boldsymbol{\tilde{\underline{\boldsymbol{u}}}}\_{-i,l}, \mathbf{x}^{(1)}, \boldsymbol{\theta}\_{i} \right) &= \frac{\mathbf{e}^{-J\_{i}\left(\underline{\hat{\underline{\boldsymbol{u}}}}\_{i,l} \left| \underline{\boldsymbol{\underline{\boldsymbol{u}}}}\_{-i,l}, \mathbf{x}^{(1)}, \boldsymbol{\theta}\_{i} \right)}}{\int\_{-\infty}^{\infty} \mathbf{e}^{-J\_{i}\left(\underline{\boldsymbol{u}}\_{i} \left| \underline{\boldsymbol{u}}\_{-i,l}, \mathbf{x}^{(1)}, \boldsymbol{\theta}\_{i} \right)} \, d\underline{\boldsymbol{u}}\_{i} \right. \\ &\approx \mathbf{e}^{\left(-\frac{1}{2} \dot{\boldsymbol{\underline{\boldsymbol{\underline{\boldsymbol{u}}}}\_{i,l}} \dot{\boldsymbol{\underline{\boldsymbol{G}}}}\_{-l,l}^{-1} \dot{\boldsymbol{\underline{\boldsymbol{\boldsymbol{\underline{\boldsymbol{\boldsymbol{u}}}}}\_{i,l}}\_{l}} \right. \tag{6.27} \left(\ddot{\boldsymbol{G}}\_{i,l}\right)^{\frac{1}{2}} \, (2\pi)^{-\frac{\dim(\dot{\underline{\boldsymbol{u}}}\_{i,l})}{2}}. \end{split} \tag{6.28}$$

i,l

This leads to the log-likelihood function

$$\ln \left( \mathcal{L} \{ \boldsymbol{\theta}\_{l} \mid \mathcal{D} \} \right) \approx \sum\_{l=1}^{n\_{l}} \left( -\frac{1}{2} \tilde{\boldsymbol{g}}\_{l,l}^{\top} \tilde{\boldsymbol{G}}\_{l,l}^{-1} \tilde{\boldsymbol{g}}\_{l,l} + \frac{1}{2} \ln \left( \det(\boldsymbol{\tilde{G}}\_{l,l}) \right) - \frac{1}{2} \dim \left( \underline{\boldsymbol{u}}\_{l,l} \right) \ln(2\pi) \right) \tag{6.48}$$

which can be used for the MLEs stated in Theorems 6.1 and 6.2. The detailed calculation steps are provided in Section B.5 of the Appendix.<sup>40</sup>

Therefore, in order to evaluate (6.24), the first derivative д˜ i,l and the second derivative G˜ i,l are needed. Their calculation is explained in the following.

#### 6.6.2 Evaluation of the Log-Likelihood Function

By applying the chain rule, the first and second derivatives of the cost function are given by

$$
\tilde{\mathbf{g}}\_{i,l} = \nabla\_{\underline{\mathbf{u}}\_{i}} J\_{l} + \left(\nabla\_{\underline{\mathbf{u}}\_{i}} \underline{\mathbf{x}}\right)^{\top} \nabla\_{\underline{\mathbf{x}}} J\_{l} \bigg|\_{\substack{\underline{\mathbf{u}}\_{i} = \tilde{\underline{\mathbf{u}}}\_{i,l} \\ \underline{\mathbf{x}} = \ddot{\underline{\mathbf{x}}}\_{l}}} \tag{6.49}
$$

$$\tilde{\mathbf{G}}\_{i,l} = \nabla\_{\underline{\underline{u}}, \underline{\underline{u}}, i} J\_i + \left(\nabla\_{\underline{\underline{u}}, \underline{\underline{x}}} \underline{\underline{x}}\right)^\top \nabla\_{\underline{\underline{x}}, \underline{\underline{x}}} J\_i \nabla\_{\underline{\underline{u}}, \underline{\underline{x}}} \underline{\underline{x}} + \nabla\_{\underline{\underline{u}}, \underline{\underline{u}}, i} \underline{\underline{x}} \times\_1 \nabla\_{\underline{\underline{x}}} J\_i + 2 \nabla\_{\underline{\underline{u}}, \underline{\underline{x}}} J\_i \nabla\_{\underline{\underline{u}}} \underline{\underline{x}} \Big|\_{\underline{\underline{u}} = \frac{\tilde{\underline{u}}}{\underline{\underline{x}}}, i} \tag{6.50}$$

where <sup>∇</sup>u<sup>i</sup> <sup>J</sup><sup>i</sup> and <sup>∇</sup><sup>x</sup> <sup>J</sup><sup>i</sup> denote the partial derivatives of <sup>J</sup><sup>i</sup> with respect to <sup>u</sup>i and x, respectively.<sup>41</sup> Likewise, <sup>∇</sup>u<sup>i</sup> ui Ji , <sup>∇</sup><sup>x</sup> <sup>x</sup> <sup>J</sup><sup>i</sup> and <sup>∇</sup>u<sup>i</sup> x <sup>J</sup><sup>i</sup> represent second-order partial derivatives of <sup>J</sup><sup>i</sup> with respect to <sup>u</sup>i and <sup>x</sup>. The partial derivative <sup>∇</sup>u<sup>i</sup> <sup>x</sup> is defined analogously. The term <sup>∇</sup>u<sup>i</sup> ui x is used with some abuse of notation to represent a third-order tensor such that ×<sup>1</sup> represents a 1-mode tensor multiplication [KB09, Section 2.5].<sup>42</sup>

In the following, we elaborate on the structure of the partial derivatives which form д˜ i,l and G˜ i,l as given in (6.49) and (6.50), with the partial derivatives with respect to <sup>x</sup> as an example. With the assumed structure of the cost function (6.2), we obtain the first-order partial derivative

$$\nabla\_{\underline{\mathbf{x}}} J\_{\boldsymbol{i}} = -\left[ \left( \nabla\_{\mathbf{x}} \boldsymbol{\Phi}\_{\boldsymbol{i}} \right) \boldsymbol{\Theta}\_{\boldsymbol{i}} \Big|\_{\mathbf{x}^{(k)} = \mathbf{x}^{(1)}} \quad \dots \quad \left( \nabla\_{\mathbf{x}} \boldsymbol{\Phi}\_{\boldsymbol{i}} \right) \boldsymbol{\Theta}\_{\boldsymbol{i}} \Big|\_{\mathbf{x}^{(k)} = \mathbf{x}^{(k\_E)}} \right]^{\mathsf{T}} \in \mathbb{R}^{nk\_E},\tag{6.51}$$

where, unless otherwise specified, <sup>∇</sup>xϕi denotes the partial derivative of <sup>ϕ</sup>i with respect to x (k) . The second-order partial derivatives of the cost function <sup>∇</sup><sup>x</sup> x <sup>J</sup>i , <sup>∇</sup>u<sup>i</sup> ui <sup>J</sup><sup>i</sup> and <sup>∇</sup>u<sup>i</sup> x <sup>J</sup>i are

<sup>40</sup> Note that (6.44) and (6.48) yield equalities in the case of quadratic cost functions.

<sup>41</sup> The last term in (6.50) was neglected in [LK12]. Nevertheless, it can only be neglected if there are no features which depend on both <sup>x</sup> and <sup>u</sup><sup>i</sup> , i.e. <sup>ϕ</sup>i,(j) (x, <sup>u</sup><sup>i</sup> ) is equal to either <sup>ϕ</sup>i,(j) (<sup>x</sup> ) or <sup>ϕ</sup>i,(j) (u<sup>i</sup> ) for all <sup>i</sup> ∈ P and all <sup>j</sup> ∈ {1, ..., <sup>M</sup><sup>i</sup> }.

<sup>42</sup> For the 1-mode tensor multiplication we obtain <sup>∇</sup>u<sup>i</sup> ui <sup>x</sup> <sup>×</sup><sup>1</sup> <sup>∇</sup><sup>x</sup> <sup>J</sup><sup>i</sup> <sup>=</sup> <sup>∇</sup><sup>x</sup> <sup>J</sup><sup>i</sup> ⊤ ∇u<sup>i</sup> ui x <sup>∈</sup> <sup>R</sup> <sup>1</sup>×m<sup>i</sup> <sup>k</sup><sup>E</sup> <sup>×</sup>m<sup>i</sup> <sup>k</sup><sup>E</sup> , which can be represented as a matrix of dimensions <sup>m</sup>ik<sup>E</sup> <sup>×</sup> <sup>m</sup>ik<sup>E</sup> .

i

block diagonal matrices since the costs at time step k only depend on the states x (k) and controls u (k) at time step k. Therefore, we obtain

$$\nabla\_{\underline{\mathbf{x}}\mathbf{x}} J\_{l} = \text{blkdiag}\left(-\sum\_{l=1}^{M\_{l}} \nabla\_{\mathbf{x}\mathbf{x}} \phi\_{i,(l)} \theta\_{i,(l)}\bigg|\_{\mathbf{x}^{(k)}=\mathbf{x}^{(1)}}, \quad \dots \quad , -\sum\_{l=1}^{M\_{l}} \nabla\_{\mathbf{x}\mathbf{x}} \phi\_{i,(l)} \theta\_{i,(l)}\bigg|\_{\mathbf{x}^{(k)}=\mathbf{x}^{(k\_{\mathrm{E}})}}\right), \tag{6.52}$$

where blkdiag(·) denotes a block diagonal matrix. In this case, there are <sup>k</sup>E blocks of dimension n <sup>×</sup> n. The other partial derivatives can be computed analogously to (6.51) and (6.52).

The partial derivative <sup>∇</sup>u<sup>i</sup> <sup>x</sup> describes the sensitivity of <sup>x</sup> with respect to <sup>u</sup>i for all time steps as a consequence of the system dynamics. Since present actions are not influenced by future actions, the matrix

$$\begin{array}{rcl} \mathbf{D}\_{l} :=& \nabla\_{\underline{\underline{\mathbf{u}}}\_{l}} \underline{\underline{\mathbf{x}}} \Big|\_{\underline{\underline{\mathbf{u}}}\_{l} = \underline{\underline{\mathbf{u}}}\_{l,I}} \\ & & \underline{\underline{\mathbf{x}}} = \underline{\underline{\mathbf{x}}}\_{l} \end{array} \tag{6.53}$$

is defined, where <sup>D</sup>i is a block upper triangular matrix. The blocks within <sup>D</sup>i are given by

$$D\_{i}^{(k\_{2},k\_{1})} = \begin{cases} \nabla\_{\mathbf{u}\_{i}^{(k)}} \mathbf{x}^{(k+1)} \Big|\_{k=k\_{1}}, & \text{for } k\_{2} = k\_{1} + 1 \\\\ \left(\nabla\_{\mathbf{x}^{(k)}} \mathbf{x}^{(k+1)}\right) D\_{i}^{(k\_{2}-1,k\_{1})} \Big|\_{k=k\_{2}-1}, & \text{for } k\_{2} > k\_{1} + 1 \\ \mathbf{0}, & \text{else}. \end{cases} \tag{6.54}$$

The blocks D (k2,k1) i , <sup>k</sup>1, <sup>k</sup><sup>2</sup> ∈ K have the dimension <sup>n</sup> <sup>×</sup> <sup>m</sup>i and represent the influence of the player <sup>i</sup>'s control at time step <sup>k</sup><sup>2</sup> on the states at time step <sup>k</sup>1. These partial derivatives can be interpreted as part of the numerical solution of the initial value problem which approximates the next state. The matrix <sup>D</sup>i employs the partial derivatives with respect to <sup>u</sup>i in each time step for the whole corresponding time interval between two time steps. Contrary to this approach, a modification of the matrix <sup>D</sup>i is proposed here in order to improve the approximation. Inspired by the trapezoid method for solving initial value problems [Epp13, Section 6.5], the effect of u (k2) at <sup>k</sup><sup>2</sup> on <sup>x</sup> (k1) is approximated by means of

$$\begin{split} \tilde{\mathbf{D}}\_{i}^{(k\_{2},k\_{1})} &:= \frac{1}{2} \left( \nabla\_{\mathbf{u}^{(k\_{1})}} \mathbf{x}^{(k\_{2})} + \nabla\_{\mathbf{u}^{(k\_{1})}} \mathbf{x}^{(k\_{2}+1)} \right) \\ &= \frac{1}{2} \left( \mathbf{D}\_{i}^{(k\_{2},k\_{1})} + \mathbf{D}\_{i}^{(k\_{2}+1,k\_{1})} \right) . \end{split} \tag{6.55}$$

The modified matrix D˜ i , which is built with the blocks D˜ (k2,k1) i analogously to D with (6.54), takes into account the effect of the control value u (k1) i on the interval of x (k2) until x (k2+1) and yields a better approximation of the system dynamics.<sup>43</sup>

<sup>43</sup> This modification was applied in experimental work presented in [IEFH18].

Contrary to the aforementioned partial derivatives, the term <sup>∇</sup>u<sup>i</sup> ui x is a third-order tensor and does not exhibit a convenient structure for its computation. Therefore, following the recommendations in [LK12], this term is neglected in favor of more efficient calculations.<sup>44</sup>

### 6.6.3 Algorithms

The results presented in the previous sections are condensed in two algorithms for the solution of inverse dynamic games by means of MaxEnt IRL. The algorithms summarize the procedure for cost function identification in a dynamic game when an open-loop information structure or a feedback information structure (and corresponding Nash equilibria) lie at hand. The following Algorithm 3 corresponds to the open-loop case.


Input: Observed trajectory set <sup>D</sup>, dynamics <sup>f</sup>, basis functions <sup>ϕ</sup>i .

Output: Computed player <sup>i</sup> cost function parameters <sup>θ</sup>i .


The next Algorithm 4 gives the necessary steps for solving inverse dynamic games with a feedback information structure based on MaxEnt IRL.

Algorithm 4 IRL Method in Feedback Dynamic Games for Player i.

Input: Observed trajectory set <sup>D</sup>, dynamics <sup>f</sup>, basis functions <sup>ϕ</sup>i .

Output: Computed player <sup>i</sup> cost function parameters <sup>θ</sup>i .


<sup>44</sup> Neglecting this term does not have any effect for most problems. For example, this term is always zero for the broad class of nonlinear control-affine systems (3.11).

#### Remark 6.5:

Step 1 and Step 2 of Algorithms 3 and 4, respectively, can also be calculated prior to the identification procedure since they are independent of the observed data.

#### Remark 6.6:

The methods shown in this chapter are formulated for a finite-horizon problem, i.e. <sup>k</sup>E <sup>∈</sup> <sup>N</sup>><sup>0</sup> in (6.2). However, all results can still be applied if the assumed underlying LQ dynamic game has an infinite horizon <sup>k</sup>E → ∞. The presented method solely requires the availability of observed state trajectories x <sup>∈</sup> <sup>R</sup> nK<sup>i</sup> and <sup>u</sup>i ∈ R <sup>m</sup>iK<sup>i</sup> where <sup>K</sup>i ≪ ∞ (cf. Definition 6.1). For adequate results, [0; <sup>K</sup>i] should be a sufficiently representative interval of the complete time span [0, ∞).

# 6.7 Application to Inverse LQ Dynamic Games

This section presents an exemplary application of IRL for solving inverse LQ dynamic games in order to illustrate the procedures presented in Algorithms 3 and 4. In the following, both inverse open-loop dynamic games and inverse feedback dynamic games are examined.

#### 6.7.1 Open-Loop

Consider N-player LQ dynamic games according to Definition A.7. Therefore, each player applies his controls to a system described by the difference equation

$$\mathbf{x}^{(k+1)} = \mathbf{A}\_D^{(k)} \mathbf{x}^{(k)} + \sum\_{j=1}^{N} \mathbf{B}\_{D,j}^{(k)} \mathbf{u}\_j^{(k)}.\tag{6.56}$$

Furthermore, each player i ∈ P selects an open-loop strategy γ (k) i = u (k) i (cf. Definition A.4) based on a quadratic cost function of the form

$$J\_i = -\frac{1}{2} \sum\_{k=1}^{k\_E} \left( \left( \mathbf{x}^{(k)} \right)^\top \mathbf{Q}\_i \mathbf{x}^{(k)} + \left( \mathbf{u}\_i^{(k)} \right)^\top \mathbf{R}\_{il} \mathbf{u}\_l^{(k)} \right), \tag{6.57}$$

where <sup>Q</sup>i and <sup>R</sup>i i <sup>≺</sup> <sup>0</sup> are symmetric matrices.<sup>45</sup> The cost function (6.57) does not include the terms which penalize the controls u (k) j , j , i of all other players (cf. (A.16)). This is due to the fact that these controls do not have any influence on the solution of open-loop dynamic games and therefore can be neglected. This follows e.g. from the necessary conditions for Nash equilibria given in Theorem A.1.

<sup>45</sup> The negative sign is considered in this chapter according to (6.2) and thus <sup>R</sup>i i is negative definite instead of positive definite to ensure a meaningful problem.

In order to apply the results of the previous sections to linear-quadratic dynamic games, it is necessary to reformulate quadratic objective functions such that the structure in (6.2) is obtained. Furthermore, the partial derivatives of the states with respect to the controls have a particular structure in the case of linear system dynamics. These aspects will be examined and presented in the following.

#### Features in LQ Open-Loop Dynamic Games

The features in the vector <sup>ϕ</sup>i which correspond to the <sup>1</sup> 2 (n <sup>2</sup> <sup>+</sup> n) non-redundant elements of the matrix <sup>Q</sup>i are given by

$$\boldsymbol{\phi}\_{i,rc}^{\mathcal{Q}\_i} = -\frac{1}{2} \mathbf{x}\_r^{(k)} \mathbf{x}\_c^{(k)}, \quad \mathbf{c} = 1, \dots, n, \; r = 1, \dots, c. \tag{6.58}$$

Similarly, for the <sup>1</sup> 2 (m2 i <sup>+</sup>mi) parameters of the symmetric matrix <sup>R</sup>i i , we obtain the features

$$\phi\_{i,rc}^{\mathcal{R}\_{il}} = -\frac{1}{2} u\_{i,r}^{(k)} u\_{i,c}^{(k)}, \quad c = 1, \dots, m\_l, \; r = 1, \dots, c. \tag{6.59}$$

For r <sup>=</sup> c, the parameters which are multiplied with ϕ Qi i,r c and <sup>ϕ</sup> Ri i i,r c correspond to the <sup>r</sup>th diagonal entry of the matrix <sup>Q</sup>i and <sup>R</sup>i i , respectively. For the case where <sup>c</sup> , <sup>r</sup>, these parameters correspond to two times the off-diagonal (symmetric) entries <sup>Q</sup>i,r c <sup>=</sup> <sup>Q</sup>i,cr and <sup>R</sup>i i,r c <sup>=</sup> <sup>R</sup>i i,cr , respectively.

#### System Dynamics

The linear system dynamics lead to the relations

$$
\nabla\_{\mathbf{x}^{(k)}} \mathbf{x}^{(k+1)} = \mathbf{A}\_{D}^{(k)} \quad \text{and} \quad \nabla\_{\mathbf{u}\_{i}^{(k)}} \mathbf{x}^{(k+1)} = \mathbf{B}\_{D,i}^{(k)}.\tag{6.60}
$$

Then, D˜ i can be determined with (6.54) and (6.55).

The following example illustrates the solution of an inverse dynamic game with MaxEnt IRL to identify cost function parameters:

#### Example 6.1:

Consider a two-player discrete-time dynamic game with system dynamics (6.56) defined by the matrices

$$\mathbf{A}\_{D}^{(k)} = \begin{bmatrix} 1 & 0.02 \\ 0 & 1 \end{bmatrix}, \quad \mathbf{B}\_{D,i}^{(k)} = \begin{bmatrix} 0.0002 \\ 0.02 \end{bmatrix}, \quad i \in \{1, 2\}, \; k \in \mathcal{K} \tag{6.61}$$

and the initial value x (1) = - 1 −1 ⊤ . These matrices correspond to a continuous-time double-integrator system (cf. Example 5.2) sampled with ∆T <sup>=</sup> <sup>0</sup>.02 s. In addition, let the quadratic cost function of the players be given by (6.57), where

$$\mathbf{Q}\_1 = -\begin{bmatrix} 4 & 1\\ 1 & 3 \end{bmatrix}, \quad \mathbf{R}\_{11} = -1, \quad \mathbf{Q}\_2 = -\begin{bmatrix} 10 & 1\\ 1 & 2 \end{bmatrix}, \quad \mathbf{R}\_{22} = -1. \tag{6.62}$$

Then, the features corresponding to the cost function of player i are given by

$$\begin{aligned} \phi\_{i,11}^{\mathcal{Q}\_i} &= -\frac{1}{2} \left( \mathbf{x}\_1^{(k)} \right)^2, \quad \phi\_{i,12}^{\mathcal{Q}\_i} = -\frac{1}{2} \mathbf{x}\_1^{(k)} \mathbf{x}\_2^{(k)},\\ \phi\_{i,22}^{\mathcal{Q}\_i} &= -\frac{1}{2} \left( \mathbf{x}\_2^{(k)} \right)^2, \quad \phi\_{i,11}^{\mathcal{R}\_{i1}} = -\frac{1}{2} \left( u\_i^{(k)} \right)^2. \end{aligned} \tag{6.63}$$

The cost functions <sup>J</sup>i of player <sup>i</sup> can be rewritten as

$$J\_i = -\sum\_{k=1}^{k\_E} \left[ \theta\_{i,(1)} \phi\_{i,11}^{\mathcal{Q}\_i} + \theta\_{i,(2)} \phi\_{i,12}^{\mathcal{Q}\_i} + \theta\_{i,(3)} \phi\_{i,22}^{\mathcal{Q}\_i} + \theta\_{i,(4)} \phi\_{i,11}^{\mathcal{R}\_{i1}} \right], \quad i \in \{1, 2\}. \tag{6.64}$$

with the cost function parameters

$$
\begin{aligned}
\boldsymbol{\theta}\_1 &= \boldsymbol{\theta}\_1^\* = \begin{bmatrix} \mathbf{4} & \mathbf{2} & \mathbf{3} & \mathbf{1} \end{bmatrix}^\top, \\
\boldsymbol{\theta}\_2 &= \boldsymbol{\theta}\_2^\* = \begin{bmatrix} \mathbf{10} & \mathbf{2} & \mathbf{2} & \mathbf{1} \end{bmatrix}^\top.
\end{aligned}
$$

Now we assume <sup>k</sup>E <sup>=</sup> <sup>250</sup> and use the coupled Riccati equations (3.60) to calculate the OLNE46and obtain the trajectory set ζ ∗ . The state and control trajectories belonging to this set are corrupted by Gaussian white noise such that the resulting trajectories have a signal-tonoise ratio (SNR) of 30 dB. A total number of 30 realizations are generated, leading to <sup>n</sup>t <sup>=</sup> <sup>30</sup> trajectories ˜ζl , <sup>l</sup> ∈ {1, ...,nt }. These are used to evaluate the log-likelihood function (6.48), for which we compute the necessary partial derivatives. The partial derivative <sup>∇</sup>x <sup>J</sup><sup>i</sup> is given by (6.51), where

$$\left(\nabla\_{\mathbf{x}}\boldsymbol{\Phi}\_{i}\right)\boldsymbol{\theta}\_{i} = \begin{bmatrix} \boldsymbol{\theta}\_{i,(1)}\boldsymbol{\mathsf{x}}\_{1}^{(k)} + \frac{1}{2}\boldsymbol{\theta}\_{i,(2)}\boldsymbol{\mathsf{x}}\_{2}^{(k)}\\ \frac{1}{2}\boldsymbol{\theta}\_{i,(2)}\boldsymbol{\mathsf{x}}\_{1}^{(k)} + \boldsymbol{\theta}\_{i,(3)}\boldsymbol{\mathsf{x}}\_{2}^{(k)} \end{bmatrix}.\tag{6.65}$$

Similarly, <sup>∇</sup>u<sup>i</sup> <sup>J</sup>i <sup>∈</sup> <sup>R</sup> <sup>k</sup><sup>E</sup> is determined by using the partial derivative

$$\left(\nabla\_{\mathfrak{u}\_{\bar{i}}}\phi\_{\bar{i}}\right)\theta\_{\bar{i}} = \theta\_{\bar{i},(4)}\mu\_{\bar{i}}^{(k)}.\tag{6.66}$$

For the second partial derivates we obtain

$$\nabla\_{\underline{\mathbf{x}},\underline{\mathbf{x}}} J\_{\boldsymbol{i}} = \text{blkdiag}\left(-\begin{bmatrix} \theta\_{\boldsymbol{i},(1)} & \frac{1}{2}\theta\_{\boldsymbol{i},(2)}\\ \frac{1}{2}\theta\_{\boldsymbol{i},(2)} & \theta\_{\boldsymbol{i},(3)} \end{bmatrix}, \dots, -\begin{bmatrix} \theta\_{\boldsymbol{i},(1)} & \frac{1}{2}\theta\_{\boldsymbol{i},(2)}\\ \frac{1}{2}\theta\_{\boldsymbol{i},(2)} & \theta\_{\boldsymbol{i},(3)} \end{bmatrix}\right) \tag{6.67}$$

$$\nabla\_{\underline{\mathbf{H}},\underline{\mathbf{H}},i} J\_i = \text{blkdiag}\left(-\theta\_{i,(4)}, \dots, -\theta\_{i,(4)}\right). \tag{6.68}$$

The MLE (6.29) is performed using a numerical optimization method, namely the Broyden-Fletcher-Goldfarb-Shannon (BFGS) method. We obtain, after normalizing with respect to θi,(4) for a better comparability, the estimated parameters

$$\begin{aligned} \hat{\boldsymbol{\theta}}\_{1} &= \begin{bmatrix} 3.88 & -2.22 & 2.98 & 1.00 \end{bmatrix}^{\top} \\ \hat{\boldsymbol{\theta}}\_{2} &= \begin{bmatrix} 10.19 & -1.69 & 2.12 & 1.00 \end{bmatrix}^{\top} . \end{aligned} \tag{6.69}$$

Consider now the feature count

$$\hat{\boldsymbol{\mu}} = \frac{1}{2} \sum\_{k=1}^{k\_E} \left[ (\mathbf{x}\_1^{(k)})^2 \quad \mathbf{x}\_1^{(k)} \mathbf{x}\_2^{(k)} \quad (\mathbf{x}\_2^{(k)})^2 \quad (\boldsymbol{u}\_1^{(k)})^2 \quad (\boldsymbol{u}\_2^{(k)})^2 \right]^\top. \tag{6.70}$$

The feature count of the trajectory <sup>ˆ</sup>ζ generated by solving an LQ dynamic game with the estimated parameters (6.69) is given by

$$
\hat{\bar{\mu}} = \begin{bmatrix} 9.88 & -12.75 & 17.04 & 32.46 & 12.66 \end{bmatrix} . \tag{6.71}
$$

The mean feature count of observed trajectories is

$$
\tilde{\boldsymbol{\mu}} = \begin{bmatrix} 9.88 & -12.76 & 17.05 & 32.90 & 12.87 \end{bmatrix}, \tag{6.72}
$$

suggesting, in consideration of (6.10), that the estimated parameters <sup>ˆ</sup>θi are different to the original parameters θ ∗ i , but lead to very similar costs. The original trajectory ζ <sup>∗</sup> and the estimated <sup>ˆ</sup>ζ are depicted in Figure 6.2, showing that the identified parameters are able to explain the observed behavior.

Figure 6.2: Observed trajectories and trajectories following from the estimated parameters of the LQ dynamic game in Example 6.1

#### 6.7.2 Feedback Case

Consider now a LQ dynamic game where players choose their feedback strategies (cf. Definition A.5) based on a quadratic cost function. Since we consider a feedback (or MPS) information pattern, the general quadratic cost functions are given by

$$J\_i = -\frac{1}{2} \sum\_{k=1}^{k\_E} \left( \mathbf{x}^{(k) \top} \mathbf{Q}\_i \mathbf{x}^{(k)} + \sum\_{j=1}^{N} \mathbf{u}\_j^{(k) \top} \mathbf{R}\_{ij} \mathbf{u}\_j^{(k)} \right), \quad \mathbf{i} \in \mathcal{P}, \tag{6.73}$$

and the resulting feedback strategies are given by (6.39). This relation can be used to obtain system dynamics from the point of view of player i given by

$$\begin{split} \mathbf{x}^{(k+1)} &= \mathbf{A}\_{D}^{(k)} \mathbf{x}^{(k)} + \mathbf{B}\_{D,i}^{(k)} \mathbf{u}\_{i}^{(k)} - \sum\_{\begin{subarray}{c} j=1\\ j \neq i \end{subarray}}^{N} \mathbf{B}\_{D,j}^{(k)} \mathbf{K}\_{j}^{(k)} \mathbf{x}^{(k)} \\ &= \left( \mathbf{A}\_{D}^{(k)} - \sum\_{\begin{subarray}{c} j=1\\ j \neq i \end{subarray}}^{N} \mathbf{B}\_{D,j} \mathbf{K}\_{j}^{(k)} \right) \mathbf{x}^{(k)} + \mathbf{B}\_{D,i}^{(k)} \mathbf{u}\_{i}^{(k)} \\ &=: \bar{\mathbf{A}}\_{D,i}^{(k)} \mathbf{x}^{(k)} + \mathbf{B}\_{D,i}^{(k)} \mathbf{u}\_{i}^{(k)} .\end{split} \tag{6.74}$$

As described in Section 6.5, inverse feedback dynamic games can be solved by exploiting the knowledge of the strategies γ (k) i . For the case of LQ dynamic games this means that the feedback matrices K (k)∗ , i ∈ P, k ∈ K are given.

#### Remark 6.7:

i

In the typical case that K (k)∗ i , i ∈ P, k ∈ K are not known, it is possible to assume an infinite horizon, i.e. <sup>k</sup>E → ∞ and estimate a constant feedback law which approximates the relationship between the controls and the states (cf. Section 5.4.2).<sup>47</sup> In the case of an infinite-horizon inverse LQ dynamic game, then the estimation can be effectively done by means of (5.36).

<sup>46</sup> The continuous-time equations were used as the considered time step ∆T <sup>=</sup> <sup>0</sup>.02 s allows a quasi-continuous analysis instead of the use of discrete-time equations for determining Nash equilibria. The interested reader is referred to Section A.5 of the Appendix where references on discrete-time Riccati equations are given.

<sup>47</sup> If the limit of the Riccati matrix P (k) i for (<sup>k</sup> <sup>=</sup> <sup>k</sup><sup>E</sup> → ∞) exists, then it corresponds to a FNE for the infinitehorizon dynamic game. In general, other FNE solutions may also exist which are not necessarily related to the aforementioned solution [BO99, P. 290].

#### Features in LQ Feedback Dynamic Games

By using the known feedback control matrices K (k)∗ i , the quadratic cost function (6.73) of player i can be rewritten as

$$J\_i = \frac{1}{2} \sum\_{k=1}^{k\_E} \left( \mathbf{x}^{(k) \top} \mathbf{Q}\_i \mathbf{x}^{(k)} + \sum\_{\substack{j=1 \\ j \neq i}}^N \mathbf{x}^{(k) \top} \mathbf{K}\_j^{(k) \top} \mathbf{R}\_{ij} \mathbf{K}\_j^{(k)} \mathbf{x}^{(k)} + \mathbf{u}\_i^{(k) \top} \mathbf{R}\_{il} \mathbf{u}\_i^{(k)} \right). \tag{6.75}$$

The features corresponding to the entries of <sup>Q</sup>i und <sup>R</sup>i i are identical to the open-loop case (cf. (6.58) und (6.59)). In the feedback case, we further have the features corresponding to the entries of <sup>R</sup>ij which are given by

$$\phi\_{i,rc}^{\mathbf{R}\_{ij}} = -\frac{1}{2} (\mathbf{K}\_j^{(k)\*} \mathbf{x}^{(k)})\_r (\mathbf{K}\_j^{(k)\*} \mathbf{x}^{(k)})\_c, \quad c = 1, \dots, m\_j, \; r = 1, \dots, c,\tag{6.76}$$

where (K (k)∗ j x (k) )r denotes the <sup>r</sup>-th entry of the vector <sup>K</sup> (k)∗ j x (k) . Similar to the matrices <sup>Q</sup>i i and <sup>R</sup>i i , the main diagonal elements of <sup>R</sup>ij correspond to parameters which weight the features ϕ Ri j i,r r , <sup>r</sup> <sup>=</sup> <sup>1</sup>, . . . ,m<sup>i</sup> . For the case where c , r, these parameters correspond to two times the off-diagonal (symmetric) entries <sup>R</sup>ij,r c <sup>=</sup> <sup>R</sup>ij,cr , respectively.

#### System Dynamics

The linear system dynamics lead to the relations

$$
\nabla\_{\mathbf{x}^{(k)}} \mathbf{x}^{(k+1)} = \bar{\mathbf{A}}\_{D,i}^{(k)} \quad \text{and} \quad \nabla\_{\mathbf{u}\_i^{(k)}} \mathbf{x}^{(k+1)} = \mathbf{B}\_{D,i}^{(k)}.\tag{6.77}
$$

Then, D˜ i can be computed with (6.54) and (6.55).

#### Example 6.2:

Consider a two-player discrete-time dynamic game with the system dynamics (6.61), the initial value x (1) = - 1 −1 ⊤ , and cost functions of the form (6.73) with the cost function matrices

$$\begin{aligned} \mathbf{Q}\_1 &= \begin{bmatrix} 8 & 0 \\ 0 & 2 \end{bmatrix}, \quad \mathbf{R}\_{11} = 1, \quad \mathbf{R}\_{12} = 1, \\\ \mathbf{Q}\_2 &= \begin{bmatrix} 1 & 0 \\ 0 & 4 \end{bmatrix}, \quad \mathbf{R}\_{22} = 1, \quad \mathbf{R}\_{21} = 0.3. \end{aligned} \tag{6.78}$$

The LQ dynamic game leads to feedback strategies

$$\boldsymbol{\mathfrak{u}}\_{i}^{(k)\*} = \mathbf{y}\_{i}^{(k)\*} (\mathbf{x}) = \begin{bmatrix} \boldsymbol{k}\_{2,(1)}^{(k)\*} & \boldsymbol{k}\_{2,(2)}^{(k)\*} \end{bmatrix} \begin{bmatrix} \boldsymbol{x}\_{1}^{(k)} \\ \boldsymbol{x}\_{2}^{(k)} \end{bmatrix}. \tag{6.79}$$

We assume that K ∗ i = h k (k)∗ i,(1) k (k)∗ i,(2) i is not known and is approximated by a constant feedback law K˜ i to be identified, as mentioned in Remark 6.7. We obtain ||K˜ i<sup>x</sup> <sup>−</sup><sup>u</sup> ∗ i || < <sup>0</sup>.<sup>02</sup> for all <sup>i</sup> <sup>=</sup> {1, <sup>2</sup>}. The approximation of the time-variant control matrices <sup>K</sup>i by means of the constant matrices K<sup>ˆ</sup> i is shown in Figure 6.3.

Figure 6.3: Nash equilibrium feedback matrices K (k)∗ i and their approximation by means of constant feedback matrices K<sup>ˆ</sup> i in Example 6.2

The features corresponding to the cost function of player i are given by:

$$\begin{aligned} \phi\_{i,11}^{\mathcal{Q}\_i} &= -\frac{1}{2} \left( \mathbf{x}\_1^{(k)} \right)^2, \quad \phi\_{i,22}^{\mathcal{Q}\_i} = -\frac{1}{2} \left( \mathbf{x}\_2^{(k)} \right)^2 \\\ \phi\_{i,11}^{\mathcal{R}\_{i1}} &= -\frac{1}{2} \left( u\_i^{(k)} \right)^2, \quad \phi\_{i,11}^{\mathcal{R}\_{ij}} = -\frac{1}{2} \left( k\_{j,(1)}^\* \mathbf{x}\_1^{(k)} + k\_{j,(2)}^\* \mathbf{x}\_2^{(k)} \right)^2 \end{aligned} \tag{6.80}$$

The cost functions <sup>J</sup>i , i ∈ {1, <sup>2</sup>} , can be rewritten as

$$J\_i = -\sum\_{k=1}^{k\_E} \left[ \theta\_{i,(1)} \phi\_{i,11}^{\mathcal{Q}\_i} + \theta\_{i,(2)} \phi\_{i,22}^{\mathcal{Q}\_i} + \theta\_{i,(3)} \phi\_{i,11}^{\mathcal{R}\_{ii}} + \theta\_{i,(4)} \phi\_{i,11}^{\mathcal{R}\_{ij}} \right], \ i, j \in \{1, 2\}, \ i \neq j,\tag{6.81}$$

where the cost function parameters are given by

$$\begin{aligned} \boldsymbol{\theta}\_1 &= \begin{bmatrix} \mathbf{8} & \mathbf{2} & \mathbf{1} & \mathbf{1} \end{bmatrix}^\top, \\ \boldsymbol{\theta}\_2 &= \begin{bmatrix} \mathbf{1} & \mathbf{4} & \mathbf{1} & \mathbf{0}.\mathbf{3} \end{bmatrix}^\top. \end{aligned}$$

The calculated FNE trajectory ζ ∗ is used to identify cost function parameters which explain it. However, this time the exact FNE trajectory ζ <sup>∗</sup> and one single demonstration, i.e. <sup>n</sup>t <sup>=</sup> <sup>1</sup>, are used. Using the MLE (6.42) which is determined again with the BFGS method, we obtain the cost function parameters

$$\begin{aligned} \hat{\boldsymbol{\theta}}\_{1} &= \begin{bmatrix} 7.67 & 0.148 & 1.00 & 2.26 \end{bmatrix}^{\top}, \\ \hat{\boldsymbol{\theta}}\_{2} &= \begin{bmatrix} -1.44 & 2.47 & 1.00 & 0.72 \end{bmatrix}^{\top}. \end{aligned} \tag{6.82}$$

Similar to last example, we consider the extended feature count

$$
\hat{\bar{\mu}} = \frac{1}{2} \sum\_{k=1}^{k\_E} \begin{bmatrix} \phi\_{1,11}^{\mathcal{Q}\_1} & \phi\_{1,22}^{\mathcal{Q}\_1} & \phi\_{1,11}^{\mathcal{R}\_{11}} & \phi\_{1,11}^{\mathcal{R}\_{12}} & \phi\_{1,11}^{\mathcal{R}\_{22}} & \phi\_{1,11}^{\mathcal{R}\_{21}} \end{bmatrix}^\top. \tag{6.83}
$$

for both the observed trajectory ζ <sup>∗</sup> and the trajectory <sup>ˆ</sup>ζ corresponding to the parameters (6.82), obtaining

$$
\hat{\bar{\mu}} = \begin{bmatrix} 10.44 & 15.92 & 1.34 & 10.37 & 10.41 & 1.34 \end{bmatrix}^{\top} \tag{6.84}
$$

and

$$
\tilde{\boldsymbol{\mu}} = \begin{bmatrix} 10.44 & 15.93 & 1.34 & 10.37 & 10.37 & 1.34 \end{bmatrix}^{\mathsf{T}}, \tag{6.85}
$$

and indicating that the identified parameters indeed approximate the observed trajectory adequately (cf. Example 6.1).

# 6.8 Method Limitations

Some potential limitations of the presented mehods shall be discussed before concluding this chapter. The introduced IRL-based inverse dynamic game methods can cope with truncated trajectories in [0,Ki] with <sup>K</sup>i <sup>&</sup>lt; <sup>K</sup>E as long as these represent the complete trajectories adequately (cf. Remark 6.6). Small values of <sup>K</sup>i compared to <sup>K</sup>E may deteriorate the results, i.e. the results improve the closer <sup>K</sup>i is to <sup>K</sup>E .

Noise-corrupted trajectories can also represent an issue since the approach indirectly attempts to equalize the feature count values of observed trajectories with the ones which would arise from the probability density function with identified parameters. On the other hand, equalizing feature count values may lead to a greater robustness in case the features, i.e. the basis functions, are not specified correctly. The effects of these issues on the identification results will be examined in Chapter 7.

Finally, a further possible detriment can arise if the available trajectories do not constitute a Nash equilibrium. The method is based on the probability density function (5.1) which includes the implicit assumption that each player's decision was not directly affected by the choice of the other players' controls, a sufficient condition of which is given by the availability of trajectories representing a Nash equilibrium. In addition, the method for feedback information structures leverages the availability of feedback control laws. If the control laws describe the functional relationship between states and controls, then the modified system dynamics still reflect the actions of the other players. Therefore, the IRL methods have the potential of being robust to at least mild deviations from the Nash equilibrium. Indeed, the basis of the presented results is Assumption 6.2, which does not demand that the observed trajectories are exactly equal to a deterministic result of the dynamic game with cost function parameters θ ∗ i . This allows for the estimation of cost function parameters <sup>ˆ</sup>θi from trajectories which represent and resemble Nash equilibrium trajectories, but may deviate from this optimality.

# 6.9 Conclusion

In this chapter, IRL was considered as a means to solve inverse problems in dynamic games. The principle of maximum entropy was applied to the dynamic game scenario and the obtained results were used to derive probability density functions to model the origin of observed dynamic game trajectories. Based on these, a maximum-likelihood estimation of the cost function parameters was proposed for the case when players apply Nash equilibrium strategies. Both open-loop and feedback strategies were regarded. In addition, the unbiasedness of this maximum-likelihood estimation was proved under typical IRL assumptions. The results of this chapter lay the theoretical foundation for the application of MaxEnt IRL for identifying cost function parameters of players in a dynamic game. Finally, solutions of inverse linear-quadratic dynamic games were shown to illustrate the presented methods and their applicability.

After this last chapter presenting theoretical results on inverse dynamic games and their solution, the following chapters present a comparison between different method classes in both simulations and a real application.

# 7 Simulations

In the previous chapters, inverse problems in dynamic game theory were introduced and two main classes of methods were proposed for their solution, namely the residual-based IOC method and an IRL-based approach. These classes of methods are different from a theoretical and conceptual point of view given their contrasting origins in automatic control and computer science. This chapter aims at presenting the capabilities of both classes of methods and comparing them by using different test scenarios in simulations. In this way, their strengths and weaknesses shall be examined. Moreover, the IOC and IRL methods are systematically compared with a Direct Bilevel (DB) approach which is based on the solution of a forward dynamic game in each iteration (see Section 2.1.1).

This chapter starts with a mathematical description of the DB approach used for comparison to the new inverse dynamic game methods. Afterwards, the considered scenarios are introduced before explaining the general evaluation procedure applied in this chapter, as well as the metrics used for comparison. Then, the simulation results are presented and discussed. These results include an evaluation of the methods' robustness to measurement noise and errors in the basis function vectors. After shortly analyzing the computation times of the methods, the chapter ends with conclusions based on the obtained insights.

# 7.1 Direct Bilevel Approach

The Direct Bilevel (DB) approach considered in this chapter is a direct extension of the method introduced in [MTL10] (see also Section 2.1.1), which was recently formulated in [MFP17a]. It aims to determine cost function parameters <sup>θ</sup> <sup>=</sup> (θ1, ..., <sup>θ</sup> N ) such that the corresponding Nash equilibrium trajectories approximate the observed state and control trajectories. For this objective, the following optimization problem can be formulated:

$$\min\_{\theta} J\_{\text{DB}} = \int\_{0}^{T} ||\mathbf{x}\_{\theta}(t) - \tilde{\mathbf{x}}(t)||^{2} + \sum\_{j=1}^{N} ||\mathbf{u}\_{\theta,j}(t) - \tilde{\mathbf{u}}\_{j}(t)||^{2} \,\mathrm{d}t,\tag{7.1}$$

where <sup>x</sup>θ (t) and <sup>u</sup>θ,i (t) denote Nash equilibrium trajectories resulting from cost functions with parameters <sup>θ</sup>. The objective functional <sup>J</sup>DB provides a natural squared-error metric between candidate state and control trajectories and the observed Nash equilibrium state <sup>x</sup>˜(t) and control trajectories <sup>u</sup>˜ i(t). Note that if the observed trajectories correspond to a Nash equilibrium with cost function parameters θ ∗ <sup>∈</sup> Θ, then the optimization problem is solved for any θ which also belongs to the solution set Θ according to the equivalence of cost functions (cf. Section B.2 in the Appendix) which imply identical Nash equilibrium trajectories. Some details need to be considered for practical implementation of this approach. These are given in Section B.6 in the Appendix.

# 7.2 Simulation Scenarios

In this chapter, two main simulation scenarios are considered:


In the first scenario, the ball-on-beam is chosen as a dynamic system. It is a well-known benchmark system in control engineering since it poses a challenging stabilization problem which is representative of the difficulties generated by growing nonlinearities [HSK92, BSLK97]. This scenario shall serve to show the solution of inverse dynamic games with openloop strategies.

The second scenario consists of a LQ dynamic game with feedback strategies. Considering the class of LQ dynamic games allows for an analysis with the tools developed in Chapter 5. Furthermore, in order to increase the complexity of the LQ dynamic game, a generic dynamic game is considered where three players influence a system by means of two control variables each. This scenario is used for the examination of inverse feedback dynamic games.

For each scenario, one IOC-based method, one IRL method and a DB approach shall be compared. The performance comparison is first done with assumed perfect observations of the Nash trajectories. Nevertheless, an evaluation of the robustness of all methods to noise in the observations is also presented.

# 7.3 Evaluation Method

In the following, the evaluation method is presented. After describing the general steps constituting the whole evaluation process, the metrics used for the comparison are introduced.

#### 7.3.1 General Steps

The evaluation procedure used in this chapter is summarized in Figure 7.1 and shall be explained in the following. For the simulation environment, a cost function structure defined by a linear combination of basis functions according to (4.2) is assumed. Therefore, it is first necessary to define a basis function vector <sup>ϕ</sup>i and a parameter vector θ ∗ i for each playeri ∈ P. These cost functions are used to calculate the Nash equilibrium trajectories of the states x ∗ (t) and the controls u ∗ i (t). <sup>48</sup> For the case where perfect observations are assumed, the observations <sup>x</sup>˜(t) and <sup>u</sup>˜ i(t) correspond to the calculated Nash equilibrium trajectories <sup>x</sup>(t) and <sup>u</sup>i(t). Otherwise, Gaussian white noise <sup>ϵ</sup> x and ϵ ui is added to the Nash equilibrium state trajectories and control trajectories to form the observations, respectively. The generated observations x˜(t) and u˜(t) simulate dynamic game data which is measured and results from the interaction between the players. Based on these observations, in the inverse dynamic game step, one of the inverse dynamic game methods is applied to obtain estimations of the cost function parameters <sup>ˆ</sup>θi for all players <sup>i</sup> ∈ P. At this point, the analysis of the identification results may be conducted based on the parameter deviation, i.e. the comparison of the estimated cost function parameters with the ground truth. Nevertheless, particularly for the robustness evaluation, it will be examined whether potentially inexact identification of the cost function parameters has a considerable impact on the capability to approximate the observations. For these cases, identified trajectories <sup>x</sup>ˆ(t) and <sup>u</sup>ˆi(t) are determined. This is done by calculating the Nash equilibrium again, yet this time based on the estimated parameters <sup>ˆ</sup>θi of all players. By comparing the identified trajectories with the ground truth trajectories, it is possible to evaluate if the estimated parameters can describe the observed outcome of the dynamic game despite a potential deviation from the real parameters. We determine the trajectory deviation by calculating the metrics δ x , δ u , and ∆ <sup>θ</sup> which are presented in the next section.

#### 7.3.2 Evaluation Metrics

As previously mentioned, the results of the inverse dynamic game methods are evaluated with respect to the quality of the cost function parameter identification. Furthermore, the approximation of the observed trajectories by means of the trajectories of the identified model are also assessed. For these two objectives, two different metrics are used which are introduced in the following.

<sup>48</sup> All simulated Nash equilibrium trajectories x ∗ (t) and u ∗ i (t) are calculated using a continuous-time formulation of the dynamic game using the different theorems from Section 3.6, depending on the information structure and strategy types. The IRL-based methods, which were developed considering a discrete-time formulation, shall be given equivalent system dynamics corresponding to the selected time step ∆T as shown in Examples 6.1 and 6.2. Furthermore, one single trajectory set will be used, i.e. <sup>n</sup><sup>t</sup> <sup>=</sup> <sup>1</sup> for the IRL methods.

Figure 7.1: Evaluation procedure for simulation results

#### Cost Function Parameters

Since identification of the cost function parameters is only possible up to a scaling constant, the comparison is done after a normalization process. The ground truth parameters <sup>θ</sup>i,GT and the identified parameters <sup>ˆ</sup>θi are normalized with respect to an arbitrary parameter. In this case and without loss of generality, the last entry of the vector <sup>θ</sup>i is chosen. This is done for all players i ∈ P. Therefore, for the ground truth normalized parameter vectors θ ∗ i,(norm) and the normalized estimated parameter vectors <sup>ˆ</sup>θi,(norm) of player <sup>i</sup>, we have

$$\{\boldsymbol{\Theta}^{\*}\_{i,\text{(norm)}}\}\_{\mathcal{P}} = \frac{\{\boldsymbol{\Theta}^{\*}\_{i}\}\_{\mathcal{P}}}{\{\boldsymbol{\Theta}^{\*}\_{i}\}\_{M\_{i}}} \quad \text{and} \quad \{\hat{\boldsymbol{\theta}}\_{i,\text{(norm)}}\}\_{\mathcal{P}} = \frac{\{\hat{\boldsymbol{\theta}}\_{i}\}\_{\mathcal{P}}}{\{\hat{\boldsymbol{\theta}}\_{i}\}\_{M\_{i}}},\tag{7.2}$$
 
$$\forall \boldsymbol{\mathcal{p}} \in \{1, \ldots, M\_{l}\},$$

where {θi }p denotes the <sup>p</sup>-th entry of the parameter vector <sup>θ</sup>i . <sup>49</sup> The parameter {θ<sup>i</sup> }M<sup>i</sup> is therefore the last entry of the vector <sup>θ</sup>i . By using the normalized parameters, the relative parameter error is defined as

$$\mathcal{S}\_{\boldsymbol{\rho}}^{\boldsymbol{\theta}} = \frac{\{\hat{\boldsymbol{\theta}}\_{\boldsymbol{i},\mathrm{(norm)}}\}\_{\boldsymbol{\rho}}}{\{\boldsymbol{\theta}^{\*}\_{\boldsymbol{i},\mathrm{(norm)}}\}\_{\boldsymbol{\rho}}}, \quad \forall \boldsymbol{\rho} \in \{1, \ldots, M\_{\boldsymbol{l}}\}.\tag{7.3}$$

The comparison of the parameters is done by means of the absolute value of the relative error of the parameters

$$
\Delta\_{\rho}^{\theta} = \left| 1 - \delta\_{\rho}^{\theta} \right|, \quad \Delta\_{\rho}^{\theta} \in \{0, \infty\} \,. \tag{7.4}
$$

Therefore, the closer the absolute values of the relative error ∆ θ p are to zero, the stronger the similarity is between identified and ground truth parameters. The mean and maximum value of ∆ θ p will be considered. These are denoted with <sup>∆</sup> θ p,mean and <sup>∆</sup> θ p,max, respectively.

#### Comparison of Trajectories

Before introducing the considered metrics for comparing trajectories, it is important to note that in the simulations, trajectories are available in the form of a series of <sup>K</sup>i data points described by the set

$$\mathcal{T}\_{\mathbf{i}} = \{ t\_k \in [0, T] \mid 1 \le k \le K\_{\mathbf{i}} \land 0 \le t\_k \le T \}. \tag{7.5}$$

In the following, <sup>K</sup>i <sup>=</sup> <sup>K</sup> is set for all <sup>i</sup> ∈ P to ease the comparison between ground truth and estimated trajectories. The estimated trajectories <sup>x</sup>ˆ(t) and <sup>u</sup>ˆi(t), <sup>i</sup> ∈ P, are the ones which arise from the solution of the dynamic game with the estimated cost function parameters <sup>ˆ</sup>θi . The different state and control trajectories may differ in maximal amplitude, which hinders a direct comparison between them. In order to be able to compare the error measures of all trajectories, it is reasonable to normalize each of them with respect to their respective maximum value. Therefore, we consider the normalized sum of absolute trajectory errors (NSAE), which in case of the state error, is defined as

$$\delta^{\mathbf{x}\_{j}^{\mathbf{x}\_{j}}} = \frac{1}{\max\_{k} \left| \mathbf{x}\_{j}^{\*(k)} \right|} \sum\_{k=1}^{K} \left| \hat{\mathbf{x}}\_{j}^{(k)} - \hat{\mathbf{x}}\_{j}^{(k)} \right| \quad , \ j \in \{1, \ldots, n\}, \tag{7.6}$$

<sup>49</sup> The notation {<sup>θ</sup> <sup>i</sup> }<sup>p</sup> is equivalent to the previously introduced <sup>θ</sup>i,(p) . These are used interchangeably in favor of better readability.

where x (k) j = x tk j denotes the <sup>k</sup>-th data point of the state <sup>x</sup>j . For systems with more than one state, the sum of NSAEs of the state trajectories

$$\delta^{\mathbf{x}} = \sum\_{j=1}^{n} \delta^{\mathbf{x}\prime} \tag{7.7}$$

is considered. Similarly, the NSAE of the controls of player i is defined as

$$\boldsymbol{\delta}^{\mathbf{u}\_{i}} = \sum\_{j=1}^{m\_{i}} \frac{1}{\max\_{k} \left| \boldsymbol{u}\_{i,(j)}^{\*(k)} \right|} \sum\_{k=1}^{K} \left| \tilde{\boldsymbol{u}}\_{i,(j)}^{(k)} - \hat{\boldsymbol{u}}\_{1,(j)}^{(k)} \right| \quad , j \in \{1, \ldots, m\_{i}\}. \tag{7.8}$$

The overall sum of NSAEs of the control trajectories is given by

$$
\delta^{\mathfrak{u}} = \sum\_{i=1}^{N} \delta^{\mathfrak{u}\_i}.\tag{7.9}
$$

In the following, the error measures (7.7) and (7.9) will be used for trajectory comparison.

# 7.4 Inverse Open-Loop Dynamic Games

In this section, different classes of inverse dynamic game methods for identifying cost function parameters corresponding to an open-loop Nash equilibrium are evaluated and compared. The methods are


These are abbreviated and referred to as IOC, IRL and DB methods, respectively.

#### 7.4.1 Preliminaries

The considered system is a ball-on-beam system which was extended such the system is controlled by two players simultaneously instead of one. The task is to balance a ball in the middle of the beam.

Figure 7.2: Ball-on-beam system

#### System Dynamics

The ball-on-beam system is shown schematically in Fig. 7.2. Here, <sup>α</sup>x denotes the angle of the beam towards the horizontal. In addition, (sX ,sY ) and (sx ,sy ) denote the positions of the ball in the earth-fixed and beam-fixed coordinate systems, respectively, both centered at the beam's center of rotation. Both players are allowed to interact with the system by applying a torque <sup>u</sup>i(t) <sup>=</sup> <sup>M</sup>i(t), <sup>i</sup> ∈ {1, <sup>2</sup>}, with respect to the beam's rotational axis. Let the system state be defined as

$$\mathbf{x}(t) = \begin{bmatrix} \mathbf{s}\_x(t) & \dot{\mathbf{s}}\_x(t) & a\_x(t) & \dot{a}\_x(t) \end{bmatrix}^\top. \tag{7.10}$$

Then, the system dynamics are described by the nonlinear differential equation (cf. [BVBB14])

$$\dot{\mathbf{x}} = \begin{bmatrix} \mathbf{x}\_2 \\\\ \frac{m\_b r\_b^2 (\mathbf{x}\_1 \mathbf{x}\_4^2 - g\_e \sin(\mathbf{x}\_3))}{\Theta\_b + m\_b r\_b^2} \\\\ \mathbf{x}\_4 \\\\ \frac{-2m\_b \mathbf{x}\_1 \mathbf{x}\_2 \mathbf{x}\_4 - m\_b g\_e \mathbf{x}\_1 \cos(\mathbf{x}\_3) + \mathbf{u}\_1 + \mathbf{u}\_2}{m\_b \mathbf{x}\_1^2 + \Theta\_w} \end{bmatrix} \tag{7.11}$$

where the time dependence of the states and controls was dropped for a better readability. The variable <sup>д</sup><sup>e</sup> is the gravitational constant, <sup>Θ</sup><sup>w</sup> is the inertia of the beam and <sup>r</sup>b , <sup>m</sup>b and <sup>Θ</sup>b are the radius, mass and inertia of the ball, respectively. The parameter values are given in Table 7.1.

Table 7.1: Parameters of the ball-on-beam system used for simulation


The inertia of the beam was calculated assuming an equally distributed mass <sup>m</sup>w <sup>=</sup> <sup>1</sup>.3 kg, a width <sup>d</sup>w <sup>=</sup> <sup>0</sup>.01 m and a length <sup>l</sup>w <sup>=</sup> 2 m.

#### Cost Functions and Data Generation

Each player acts based on an individual cost function of the form (4.2), where the basis function vector is given by

$$\boldsymbol{\Phi}\_{i} = \begin{bmatrix} \mathbf{x}\_{1}^{2} & \mathbf{x}\_{2}^{2} & \mathbf{x}\_{3}^{2} & \mathbf{x}\_{4}^{2} & \boldsymbol{u}\_{i}^{2} \end{bmatrix}^{\top}, \quad \forall i \in \{1, 2\}. \tag{7.12}$$

This feature vector describes both players' individual preferences to zero the ball's displacement from the center of the beam, its velocity, the beam's angle and angular velocity, respectively. Furthermore, it represents the desire to keep their individual torques small. In the following, units are neglected as all quantities are given in SI units. To model the players' behavior by means of cost functions, let the ground truth parameters be given by

> θ ∗ <sup>1</sup> = - 20 1 1 1 2 and θ ∗ <sup>2</sup> = - 1 1 10 1 1 . (7.13)

In this way, the first player focuses on bringing the ball to the center of the beam whereas the second player mainly focuses on bringing the beam to a horizontal position (see state definition in (7.10)).

For the calculate equilibrium step, the system dynamics and cost functions with ground truth parameters are used to solve for open-loop Nash equilibrium trajectories by applying Pontryagin's minimum principle and then solving the resulting two-point boundary value problem, where the initial state

$$\mathbf{x}(\mathbf{0}) = \begin{bmatrix} \mathbf{0}.5 & \mathbf{0} & \mathbf{0} & \mathbf{0} \end{bmatrix}^{\mathrm{T}},\tag{7.14}$$

was used. The solution leads to trajectories x ∗ (t) and u ∗ i (t) corresponding to the open-loop Nash equilibrium (OLNE). Further details on the calculation are given in Section B.4 of the Appendix. The equilibrium state is illustrated in Figure 7.3, where the trajectories of the ball position and beam angle, i.e. of the states x1(t) and x3(t) are depicted. The applied torques of each player, i.e. the controls u1(t) and u2(t) are also shown. The different preferences of the players modeled by the cost function parameters in (7.13) can be recognized. Player 1 applies a positive torque such that the ball is moved towards the zero position, whereas player 2 counteracts this action since his focus is to regulate the beam angle towards zero.

#### 7.4.2 Noisefree Case

The inverse methods are first tested under the assumption that the observed trajectories correspond exactly to the OLNE trajectories generated by the ground truth cost function param-

Figure 7.3: Open-loop Nash equilibrium trajectories of the ball-on-beam system

eters θ ∗ i . This represents an ideal condition to analyze the extent up to which the real parameters <sup>θ</sup>i can be obtained. The cost function parameter values are given with a precision of 2 decimal values. More precision is not needed since, as it will be shown later, differences of less order of magnitude barely have an effect on the corresponding trajectories. Nevertheless, the parameter errors ∆ θ p are calculated with the highest possible precision.

#### Inverse Optimal Control Based Method

The trajectories of the open-loop Nash equilibrium are used to determine the parameters <sup>θ</sup>i of each player by means of Algorithm 1. The solution of the RDE appearing in the method was calculated by means of a numerical solver of MATLAB (ode45).

The estimated parameters are<sup>50</sup>

$$\begin{aligned} \hat{\theta}\_1 &= \begin{bmatrix} 19.99 & 1.00 & 1.00 & 1.00 & 2.00 \end{bmatrix} \\ \hat{\theta}\_2 &= \begin{bmatrix} 1.01 & 1.00 & 10.00 & 1.00 & 1.00 \end{bmatrix} . \end{aligned} \tag{7.15}$$

which lead to a mean parameter error ∆ θ p,mean <sup>=</sup> <sup>0</sup>.16% and a maximum parameter error ∆ θ p,mean <sup>=</sup> <sup>0</sup>.76%. The NSAE of the states is <sup>δ</sup> <sup>x</sup> <sup>=</sup> <sup>0</sup>.0271. The NSAE of the controls is δ <sup>u</sup> <sup>=</sup> <sup>0</sup>.025.

<sup>50</sup> For the presented inverse open-loop dynamic game results, the parameter vectors <sup>θ</sup> <sup>i</sup> , <sup>∀</sup>i ∈ P were multiplied with a constant factor c <sup>∈</sup> <sup>R</sup> <sup>+</sup> such that the last entry corresponds to the ground truth, i.e. <sup>ˆ</sup>θi,(5) <sup>=</sup> <sup>θ</sup> ∗ <sup>i</sup>,(5) , <sup>∀</sup>i ∈ P. This was done in favor of higher clearness in the comparison.

#### Inverse Reinforcement Learning Based Method

In order to solve the inverse dynamic game problem, Algorithm 3 was applied. The optimization problem corresponding to the MLE (6.29) was solved with the MATLAB solver fminunc, using a BFGS Quasi-Newton method. The estimated parameters are

$$\begin{aligned} \hat{\boldsymbol{\theta}}\_{1} &= \begin{bmatrix} 19.51 & 0.95 & 0.73 & 0.77 & 2.00 \end{bmatrix} \\ \hat{\boldsymbol{\theta}}\_{2} &= \begin{bmatrix} 1.04 & 1.01 & 9.99 & 1.02 & 1.00 \end{bmatrix} . \end{aligned} \tag{7.16}$$

We obtain a mean parameter error ∆ θ p,mean <sup>=</sup> <sup>8</sup>.1% and a maximum parameter error <sup>∆</sup> θ p,max <sup>=</sup> <sup>27</sup>.0%. The NSAE of the states is δ <sup>x</sup> <sup>=</sup> <sup>0</sup>.664. The NSAE of the controls is δ <sup>u</sup> <sup>=</sup> <sup>0</sup>.554. The parameter error is bigger than the one generated by the IOC approach. The NSAE values are also higher than the ones corresponding to the IOC based identification.

#### Direct Bilevel Approach

For this method, the optimization problem (7.1) was solved using the procedure in Section B.6 with an interior-point method of MATLAB's fmincon solver.

The estimated parameters are

$$\begin{aligned} \hat{\boldsymbol{\theta}}\_{1} &= \begin{bmatrix} 20.11 & 0.89 & 3.91 & 0.85 & 2.00 \end{bmatrix} \\ \hat{\boldsymbol{\theta}}\_{2} &= \begin{bmatrix} 1.14 & 1.01 & 10.13 & 1.09 & 1.00 \end{bmatrix} . \end{aligned} \tag{7.17}$$

The mean parameter error ∆ θ p,mean <sup>=</sup> <sup>42</sup>.9% and a maximum parameter error <sup>∆</sup> θ p,max <sup>=</sup> <sup>290</sup>.9%. The NSAE of the states is δ <sup>x</sup> <sup>=</sup> <sup>1</sup>.4322. The NSAE of the controls is δ <sup>u</sup> <sup>=</sup> <sup>0</sup>.122. The parameter error is bigger than the one generated by both the IOC and IRL approaches.

#### Comparison

The following Table 7.2 summarizes the results of the parameter identification with all methods. In addition, the identified parameters <sup>ˆ</sup>θi of all methods are used to generate OLNE trajectories <sup>x</sup>ˆ(t) and <sup>u</sup>ˆi(t). Both the original and identified trajectories of the controls as well as the ball position and beam angle (states <sup>x</sup><sup>1</sup> and <sup>x</sup>3, respectively) are depicted in Figure 7.4. While the parameter errors of the identification with IRL and the DB approach are higher than the ones corresponding to the IOC method, they do not have a big impact on the trajectory approximation in this setup. The OLNE of all identified cost functions is practically identical to the original OLNE trajectories. The differences are imperceptible even though there is a slight difference in the estimation accuracy by all methods. This also confirms that the presented parameter precision of two decimal values is sufficient for an adequate comparison.


Table 7.2: Ground truth and cost function parameters of the nonlinear OL differential game identified with all methods using noiseless trajectories

Figure 7.4: Trajectories resulting from the nonlinear inverse dynamic game solutions with IOC, IRL and DB methods

#### 7.4.3 Robustness to Measurement Noise

In practice, measurements of the states and controls corresponding to a dynamic game may not be ideal. For example, the measurements may be affected by noise, which can be detrimental for the identification of cost function parameters. Therefore, the results of the inverse dynamic game methods should ideally be robust to measurement noise. In order to evaluate this property for the considered open-loop methods, Gaussian white noise is artificially added to the state and control trajectories. Hence, the new measurements which are used for identification of cost function parameters are given by

$$\tilde{\mathbf{x}}\_{z}(t) = \mathbf{x}\_{z}^{\*}(t) + \boldsymbol{\varepsilon}\_{z}^{\mathbf{x}}, \qquad \qquad \qquad \forall \mathbf{z} \in \{1, \ldots, n\}, \tag{7.18}$$

$$
\tilde{u}\_{i,z}(t) = u^\*\_{i,z}(t) + \epsilon^u\_{i,z}, \qquad \qquad \forall z \in \{1, \ldots, m\_i\}, \ \forall i \in \mathcal{P}. \tag{7.19}
$$

The noise ϵ x z and ϵ u i,z was chosen in such a way that all signals have a particular signal-tonoise ratio (SNR). Different SNR levels from 20 dB to 40 dB were considered for trajectory generation. In order to examine the consistency of the results, 100 samples of Gaussian white noise are generated for each of the considered SNR levels such that we obtain the trajectories ˜ζs , s ∈ {1, ..., <sup>100</sup>} (cf. Definition 6.2). Figure 7.5 shows examples of noise-corrupted Nash equilibrium trajectories with different SNR values. The generated noisy trajectories are used to identify cost function parameters with all methods. Therefore, for each of the methods, we obtain 100 sets of identified parameters <sup>ˆ</sup>θs , s ∈ {1, ..., <sup>100</sup>}. In turn, each of these is used to compute corresponding OLNE trajectories denoted by <sup>ˆ</sup>ζs , s ∈ {1, ..., <sup>100</sup>}. The mean over all 100 values of the identified parameters of each player, denoted by <sup>ˆ</sup>θi,mean is computed for the following analysis. Moreover, the comparison of the estimated parameters and trajectories with the original ones is assessed with the mean of the NSAE (defined in (7.7), (7.8) and (7.9)) over all 100 trajectories. These are denoted by δ x mean, δ ui mean and <sup>δ</sup> u mean, respectively. Similarly, the maximum and mean parameter errors over all 100 results, denoted by ∆ θ max and <sup>∆</sup> θ mean, are considered (cf. (7.4)).

#### Inverse Optimal Control

The mean values of the identified cost function parameters are given in Table 7.3, where the noisefree case is listed for comparison and is denoted by an infinite SNR. The parameter error increases considerably with the presence of noise. Even with a SNR value of 30 dB which implies a rather low magnitude of the noise, the parameters deviate significantly from the ground truth. In particular, from this SNR value on, the parameter <sup>ˆ</sup>θi,(3) becomes negative which implies a reward of the deviations from zero, instead of a penalty as originally stated. This trend is confirmed by the mean values of the parameter and trajectory errors which are summarized in Table 7.4. The table shows very high errors for an SNR value equal to 30 dB or below.


Table 7.3: Mean values of the cost function parameters of the inverse nonlinear OL dynamic game which were identified with the IOC method

Figure 7.5: Noise-corrupted open-loop Nash equilibrium trajectories of the two-player dynamic game with the nonlinear ball-on-beam system


Table 7.4: Mean parameter errors and NSAE of trajectories obtained with the IOC method

#### Inverse Reinforcement Learning

The mean values of the identified cost function parameters are given in Table 7.5. The order of magnitude of the parameters is similar for all SNR values, but the results are also negatively affected by lower SNR values. For an SNR value of 20 dB, the parameter <sup>ˆ</sup>θ1,(3) of player 1 becomes slightly negative, leading to a reward of the deviation of <sup>x</sup><sup>3</sup> from zero. The mean values of the errors listed in Table 7.6 are moderately low compared to the IOC results, especially the mean parameter error and the mean NSAE of the states.

SNR in dB <sup>ˆ</sup>θ1,mean ˆθ2,mean 20 20.62 1.19 -2.58 1.79 2.00 1.53 1.16 7.03 1.58 1.00 25 19.85 0.97 0.79 0.99 2.00 1.23 1.06 9.13 1.20 1.00 30 19.60 0.94 1.16 0.81 2.00 1.09 1.02 9.63 1.08 1.00 35 19.53 0.93 1.17 0.76 2.00 1.05 1.01 9.88 1.04 1.00 40 19.51 0.92 1.41 1.72 2.00 1.05 1.01 9.98 1.03 1.00 ∞ 19.51 0.95 0.73 0.77 2.00 1.04 1.01 9.99 1.02 1.00

Table 7.5: Mean values of the cost function parameters identified with the IRL method

Table 7.6: Parameter errors and NSAE obtained with the IRL method


#### Direct Bilevel Approach

The mean values of the identified cost function parameters are given in Table 7.7. The identified parameters are very similar for all SNR values and no clear SNR-dependent trend can be recognized. Almost all parameters are very similar to the ground truth. Only the parameter ˆ<sup>θ</sup>1,(3) of the first player could not be recovered exactly. The mean values of the errors listed in Table 7.8 show that the parameter and trajectory error overall do increase with smaller SNR values. However, even for the lowest SNR value of 20 dB, the errors, especially the NSAE of the controls, are considerably low.

Table 7.7: Mean values of the identified cost function parameters obtained from noisy trajectories using the DB method SNR in dB <sup>ˆ</sup>θ1,mean ˆθ2,mean 20 20.12 0.86 3.16 0.91 2.00 1.06 1.00 9.97 1.04 1.00


Table 7.8: Mean parameter errors and NSAE obtained from noisy trajectories using the DB method


#### Comparison

The results of cost function identification with noisy measurements are now compared. The mean values of the parameter error corresponding to the SNR values of 20 dB to 40 dB are illustrated in Figure 7.6. In a similar way, Figure 7.7 contrasts the mean values of the NSAE of the states and controls.

Figure 7.6 shows that the IOC approach outperforms the IRL and DB methods in the case of perfect observations of the Nash equilibrium trajectories, but its parameter estimation becomes notoriously worse as the SNR values become smaller. In contrast, both the IRL and DB method yield similar results across all SNR values and demonstrate being less affected

Figure 7.6: Comparison of parameter errors of identification for all SNR values and all methods

(a) Sum of normalized absolute state errors

Figure 7.7: Comparison of trajectory errors of identification for all SNR values and all methods

by measurement noise. The DB method is slightly better than the IRL approach only for the 20 dB case. A similar trend is observed in Figure 7.7. Nevertheless, it is noticeable that the differences in the parameter estimation can lead to big dissimilarities in the mean NSAE. The superiority of IRL and the DB method in terms of robustness to measurement noise is confirmed. However, it can be observed that the DB approach yields the lowest NSAE of the controls.

In order to obtain a better insight into the quality of the trajectory approximation, the mean values of the identified parameters with each method, i.e. the parameters in Tables 7.3, 7.5 and 7.7, are used to generate corresponding model state and control trajectories. Figure 7.8 shows an example for an SNR value of 30 dB. The IRL and DB methods yield very similar results. The IOC approach is able to explain the state trajectories adequately, but fails to reproduce the course of the control trajectories. For SNR values lower than 30 dB, the control trajectory approximation by the IRL method starts to deteriorate while the DB approach maintains its robustness. Plots of this comparison for all SNR values can be found in Section E.1.1 of the Appendix.

Figure 7.8: Observed trajectories and estimations based on mean identification results of all methods, SNR = 30 dB

#### 7.4.4 Robustness to a Basis Function Mismatch

Especially in practical applications, it cannot be assured that the observed trajectories constitute Nash equilibrium trajectories generated by the considered basis functions <sup>д</sup>i . In order to give a first evaluation of the limits of the presented methods, a mismatch of the original ground truth (GT) basis functions and the ones used in the inverse dynamic game methods is regarded in this section. The following analysis utilizes the noisefree trajectories generated by the parameters θ ∗ 1 and θ ∗ 2 , as given in Section 7.4.1, for identification. However, for both the inverse dynamic game step and the subsequent forward solution to obtain estimated trajectories xˆ(t), u<sup>ˆ</sup> <sup>1</sup>(t), and u<sup>ˆ</sup> <sup>2</sup>(t) (cf. Figure 7.1), four different basis function vectors shall be considered which differ from the original ones. These are given in Table 7.9.

The choice is motivated by the control task and the ground truth parametrization (cf. (7.13)). The basis functions x 2 2 and x 2 4 corresponding to the ball velocity and the beam angle velocity are both weakly weighted by <sup>θ</sup>i,(2) and <sup>θ</sup>i,(4) , respectively. Therefore, case I neglects these


Table 7.9: Considered cases in the basis function mismatch analysis of inverse open-loop dynamic games

basis functions to evaluate their significance for identification. Cases II and III disregard one additional basis function, either x 2 1 or x 2 3 , corresponding to the ball position and beam angle, respectively. Finally, case IV represents a situation where one of the basis functions is incorrectly specified.

The basis functions are assumed as different from the ground truth and hence, the parameters are not comparable. Therefore, only the NSAE of the trajectories shall be considered for the evaluation. The NSAE arising from identification with each method is given in Table 7.10 for each case. For case I we observe a low NSAE of the states and a higher NSAE of the controls. Cases II and III lead to worse results in terms of the state trajectory approximation. Lastly, for case IV only the IRL method yields low NSAE values for the states, whereas the DB method can only approximate the control trajectories adequately. The observed trajectories and the estimated trajectories are exemplarily shown for cases I and IV in Figures 7.9 and 7.10. Additional plots describing the results of the other cases can be found in Section E.1.2 of the Appendix.


Table 7.10: NSAE in case of basis function mismatch

#### 7.4.5 Discussion of Inverse Open-Loop Dynamic Game Results

By comparing the results of both methods based on noisefree trajectories, it is recognizable that the method based on IOC offers the best results in terms of parameter accuracy. This also leads to a better performance considering the approximation of the ground truth trajectories. Nevertheless, even though the IRL method and the DB approach exhibit a lower parameter approximation accuracy, both are still able to explain the Nash equilibrium trajectories. While there is computationally a minor difference between their trajectory approximation errors, it is so low that it is imperceptible, as shown by Figure 7.4.

Figure 7.9: Inverse open-loop dynamic game results for all methods, basis function mismatch case I.

The differences in the parameter identification results can be explained by the different characteristics of each of the methods. All methods are based on the solution of an optimization problem. In the case of the IOC approach, the parameters which exactly fulfill the conditions for Nash equilibria are sought. Since the observations are perfect, i.e. they correspond to an exact Nash equilibrium, the corresponding cost function parameters can be found with great precision. The IRL approach is based on the maximization of a likelihood function which indirectly considers the requirement of matching the cost function values of the Nash equilibrium trajectories. The slight deviation to the true parameters arise given the fact that a sufficient match of trajectories, which correlates to a peak in the likelihood function, may not require a precise estimation of parameters. Finally, the DB approach similarly searches for parameters such that the deviation between the costs of observed and estimated trajectories is minimal. This also potentially does not require an exact estimation of parameters.

Figure 7.10: Inverse open-loop dynamic game results for all methods, basis function mismatch case IV.

Having discussed these differences in the noisefree case, it is possible to find similar explanations for the results of identification in the presence of measurement noise in the observed Nash equilibrium trajectories. In this case, we observe that the IRL method and the DB approach are more robust towards measurement noise. Even up to SNR values of 20 dB and 25 dB, cost function parameters can be found which explain the observed trajectories. This can also be explained by the different principles each method is based on. The probabilistic formulation of the inverse dynamic game problem in the IRL-based method with the indirect requirement of matching trajectory costs leads to a higher robustness to noise. On the contrary, the IOC approach is strongly affected by measurement noise. The parameter deviations of the IOC approach especially lead to a poor approximation of the control trajectories. The approximation of the state trajectories is not strongly affected by the parameters deviations due to higher trajectory noise.<sup>51</sup>

Finally, the analysis of basis function mismatch indicates that all methods are mildly robust towards a small mismatch of the basis function vectors, especially regarding the state trajectory approximation. All methods yield greater errors if an originally relevant basis function (e.g. x 2 1 and x 2 3 in the example) is neglected. The results suggest that the task can, to some extend, still be described by the other basis functions with a corresponding adequate parameterization which compensates the missing basis functions. However, this possibility may

<sup>51</sup> Similar results were reported in [MTFP16], where a one-player inverse optimal control problem was similarly solved by leveraging the minimum principle and where only the state were corrupted with noise in the evaluation.

depend on the real parametrization of the basis functions. This means that a missing basis function which was weighted by a high value of the corresponding parameter cannot be compensated with other basis functions. In addition, a misspecified basis function as in case IV can affect the results of all methods considerably, especially for the IOC method. This is due to the fact that the basis function <sup>x</sup><sup>2</sup> is not appropriate for the task at hand which consists of regulating all states to zero. The other methods, IRL and DB, are less affected since they, either directly or indirectly, take the deviation between trajectories into consideration. This is further illustrated by Table 7.11 where the parameters identified by each method in case IV are listed. The table indicates that the IRL and DB methods correctly estimate the parameter <sup>θ</sup>i,(2)—the one which corresponds to <sup>x</sup>2—as a value which has to be at least close to zero such that trajectories similar to the observed ones can be obtained.


# 7.5 Inverse Feedback Dynamic Games

After comparing inverse dynamic game methods for identification in open-loop dynamic games, this section is devoted to an evaluation of inverse feedback dynamic games in a Nash equilibrium, i.e. the players applied linear feedback strategies which led to a FNE. Analogously to last section, one method of each class is analyzed and compared in the following. In particular,


Again, these are be abbreviated and referred to as IOC, IRL and DB methods, respectively.

#### 7.5.1 Preliminaries

The following analysis is conducted by means of an infinite-horizon linear-quadratic dynamic game with the following system dynamics and cost functions.

#### System Dynamics

The system is described by the differential equation

$$\dot{\mathbf{x}}(t) = \mathbf{A}\mathbf{x}(t) + \sum\_{i=1}^{3} \mathbf{B}\_{i} \boldsymbol{\mu}\_{i}(t) \tag{7.20}$$

with

$$\mathbf{A} = \begin{bmatrix} -8 & -6 & 1 & 0 \\ 1 & 0 & 2 & 1 \\ 0 & -2 & 0 & 1 \\ 0 & 1 & 0 & -1 \end{bmatrix}, \qquad \mathbf{B}\_1 = \begin{bmatrix} 0 & 1 \\ 0 & 0 \\ 0 & 0 \\ 1 & 0 \end{bmatrix}, \qquad \mathbf{B}\_2 = \begin{bmatrix} 0 & 0 \\ 0 & 0 \\ 1 & 0 \\ 0 & 1 \end{bmatrix}, \qquad \mathbf{B}\_3 = \begin{bmatrix} 0 & 0 \\ 1 & 0 \\ 0 & 1 \\ 0 & 1 \end{bmatrix}.$$

Therefore, in this case, each player <sup>i</sup> has a control vector <sup>u</sup>i <sup>∈</sup> <sup>R</sup> <sup>m</sup><sup>i</sup> with <sup>m</sup>i <sup>=</sup> <sup>2</sup> to apply at each time t. The system (A, - <sup>B</sup><sup>1</sup> · · · <sup>B</sup>N ) is stabilizable and therefore, the existence of stabilizing linear feedback strategies of the form

$$\mathbf{u}\_{i}(t) = -\mathbf{K}\_{i}\mathbf{x}(t), \quad \forall \; i \in \mathcal{P}$$

is guaranteed [EBS00].

#### Cost Functions

Each player i ∈ P aims to minimize an individual quadratic performance index

$$J\_i = \frac{1}{2} \int\_0^\infty \mathbf{x}^\top(t) \mathbf{Q}\_i \mathbf{x}(t) + \mathbf{u}\_i^\top(t) \mathbf{R}\_{il} \mathbf{u}\_l(t) \,\mathrm{d}t. \tag{7.21}$$

The ground truth parameters of the cost functions were set to

$$\begin{aligned} \mathbf{Q}\_1^\* &= \text{diag}(1, 0.4, 2, 1), & \mathbf{R}\_{11}^\* &= \text{diag}(1, 1), \\ \mathbf{Q}\_2^\* &= \text{diag}(1, 0.6, 1, 2), & \mathbf{R}\_{22}^\* &= \text{diag}(1, 1), \\ \mathbf{Q}\_3^\* &= \text{diag}(1, 1, 0.5, 1), & \mathbf{R}\_{33}^\* &= \text{diag}(1, 2). \end{aligned} \tag{7.22}$$

Using the ground truth cost function parameters, the feedback Nash equilibrium trajectories x ∗ (t) and u ∗ (t) were calculated by means of the coupled matrix Riccati equations [Eng05, Theorem 8.5]. The theorem allows to confirm the Nash character of the trajectories given the stability of the controlled system. The resulting Nash equilibruim feedback matrices are given by

$$\begin{aligned} \mathbf{K}\_1^\* &= \begin{bmatrix} 0.012 & 0.123 & 0.114 & 0.318\\ 0.066 & 0.028 & -0.006 & 0.012 \end{bmatrix},\\ \mathbf{K}\_2^\* &= \begin{bmatrix} 0.004 & -0.041 & 0.541 & 0.130\\ 0.018 & 0.197 & 0.130 & 0.644 \end{bmatrix},\\ \mathbf{K}\_3^\* &= \begin{bmatrix} 0.025 & 0.650 & 0.115 & 0.149\\ 0.020 & 0.132 & 0.384 & 0.301 \end{bmatrix}.\end{aligned} \tag{7.23}$$

#### Properties of the Inverse LQ Dynamic Game

Before solving the inverse LQ dynamic game, the LQ character of the problem allows for its analysis by means of the results of Chapter 5. We first use the results of Lemma 5.2 to determine the matrices <sup>M</sup>i <sup>∈</sup> <sup>R</sup> <sup>8</sup>×<sup>6</sup> with (5.14) using the control matrices K ∗ i . Now consider the rank of <sup>M</sup>i and obtain rank(Mi) <sup>=</sup> <sup>6</sup> for all <sup>i</sup> ∈ P. By the results of Theorem 5.3, the necessary and sufficient conditions for a unique solution of the inverse LQ dynamic game up to a multiplying constant parameter are fulfilled.

#### 7.5.2 Noisefree Case

The inverse dynamic game methods are first tested under ideal conditions, i.e. the observed trajectories are free of measurement noise and therefore correspond exactly to the FNE which arise out of the dynamic game consisting of the system dynamics (7.20) and cost functions (7.21) with ground truth parameters (7.22). Since both the IOC and IRL methods rely on the estimation of the Nash equilibrium feedback matrices, this is carried out for both players using a least-squares approach presented in Section 5.4.2 and the given trajectories x ∗ (t), u ∗ i (t). The estimation yields very good results for K<sup>ˆ</sup> i as we obtain deviations where ||K<sup>ˆ</sup> i − K ∗ || < <sup>10</sup>−<sup>4</sup> , i <sup>=</sup> {1, <sup>2</sup>, <sup>3</sup>}, from the original Nash feedback matrices.

#### Inverse Optimal Control

i

The inverse dynamic game is solved by determining the solution of the quadratic static optimization problem (5.33) using the estimated feedback matrices K<sup>ˆ</sup> i . The parameters in (7.22) are exactly identified exactly up to two decimal values and are therefore not explicitely given. The mean parameter error ∆ θ p,mean is 0.05% and the maximum parameter error <sup>∆</sup> θ p,max is 0.26%. The NSAE of the states is δ <sup>x</sup> <sup>=</sup> <sup>0</sup>.<sup>002</sup> while the NSAE of the controls is δ <sup>u</sup> <sup>=</sup> <sup>0</sup>.010.

#### Inverse Reinforcement Learning

The IRL approach leads to identified cost function parameters which approximate the original ground truth paramters up to two decimal values. The mean parameter error is ∆ θ p,mean <sup>=</sup> <sup>0</sup>.1% and the maximum parameter error is ∆ θ p,max <sup>=</sup> <sup>0</sup>.6%. The NSAE of the states is <sup>δ</sup> <sup>x</sup> = <sup>0</sup>.019. The NSAE of the controls is δ <sup>u</sup> <sup>=</sup> <sup>0</sup>.073. All errors are slightly bigger than the errors obtained with the IOC method.

#### Direct Bilevel Approach

The DB approach leads to a mean parameter error of ∆ θ p,mean <sup>=</sup> <sup>0</sup>.43% and a maximum parameter error of ∆ θ p,max <sup>=</sup> <sup>3</sup>.85%. The NSAE of the states is <sup>δ</sup> <sup>x</sup> <sup>=</sup> <sup>0</sup>.<sup>028</sup> and the NSAE of the controls is δ <sup>u</sup> <sup>=</sup> <sup>0</sup>.151. The DB approach yields greater errors than both the IOC and IRL methods.

#### Comparison

The following Tables 7.12 and 7.13 summarize the results of the parameter identification with all methods.<sup>52</sup>


Table 7.12: Ground truth and cost function matrices <sup>Q</sup><sup>i</sup> identified from noiseless trajectories with all methods

Table 7.13: Ground truth and cost function matrices <sup>R</sup>i i identified from noiseless trajectories with all methods


Even though the metrics show that the DB leads to the highest mean and maximum parameter errors as well as the highest NSAE, thus suggesting a superiority of IOC and IRL in the quality

<sup>52</sup> All results were normalized with respect to the parameter <sup>R</sup>i,(11) for a better comparison. Therefore, this parameter is not explicitely given in Table 7.13.

of the estimation, all errors are relatively small. The values in Tables 7.12 and 7.13 confirm that all methods lead to an excellent estimation of the cost function parameters. For the sake of completeness and in order to see potential differences in the approximation of the observed trajectories, we solve the LQ differential game with the estimated parameters and determine the corresponding FNE trajectories for all methods. The ground truth and model state trajectories are depicted in Figure 7.11. Likewise, the control trajectories are shown in Figure 7.12. All methods are able to perfectly approximate the observed trajectories.

Figure 7.11: Ground truth and estimated state trajectories of the inverse LQ feedback dynamic game with each method

#### 7.5.3 Robustness to Measurement Noise

This section presents simulation results on the influence of the presence of noise in the observed trajectories on the results of the inverse dynamic game methods. Similar to the evaluation in Section 7.4.3 for the open-loop case, Gaussian white noise is added to the state trajectories and the control trajectories according to (7.18) and (7.19), respectively. Once more, the added noise is generated such that the corrupted trajectories have a particular SNR value. The considered SNR values range from 20 dB to 40 dB. 100 samples of Gaussian white noise were generated and therefore, the noisy trajectories ˜ζs , s ∈ {1, ..., <sup>100</sup>} (cf. Definition 6.2), are obtained for each of the different SNR values. Each one of these trajectories was used

Figure 7.12: Ground truth and estimated control trajectories of the inverse LQ feedback dynamic game with each method

to identify cost function parameters. Therefore, we obtain for each method the parameter sets <sup>ˆ</sup>θs , s ∈ {1, ..., <sup>100</sup>}. Each of the parameter sets can be used to determine corresponding FNE trajectories which are denoted by <sup>ˆ</sup>ζs , s ∈ {1, ..., <sup>100</sup>}. Analogously to Section 7.4.3, the metrics δ x mean, δ ui mean and <sup>δ</sup> u mean for the trajectory comparison as well as <sup>∆</sup> θ max and <sup>∆</sup> θ mean for parameter comparison are considered.

#### Inverse Optimal Control

The parameter and trajectory errors are given in Table 7.14. The errors increase moderately with lower values of the SNR. The worst case mean parameter error is <sup>18</sup>.2%. It is noticeable that the NSAE of the control u1(t) is always bigger than the NSAE of the other players' controls.

Table 7.14: Parameter errors and NSAE between ground truth trajectories and trajectories obtained with IOC from noisy trajectories


#### Inverse Reinforcement Learning

The error measures for each SNR value are given in Table 7.15. In this case, it can be observed again that the NSAE of the control u1(t) is always bigger than the NSAE of the other players' controls. The worst case mean parameter error is <sup>11</sup>.4%.

Table 7.15: Parameter errors and NSAE between ground truth trajectories and trajectories obtained with IRL from noisy trajectories


#### Direct Bilevel Approach

Table 7.16 gives the resulting NSAE and the parameter errors. The trend of less accurate estimations of the control u1(t) is visible in this case as well. The worst case mean parameter error is <sup>19</sup>.4%.

Table 7.16: Parameter errors and NSAE between ground truth trajectories and trajectories obtained with the DB method from noisy trajectories


#### Comparison

Figure 7.13 shows a comparison of the mean NSAE obtained with each method and for all SNR values. It is noticeable that the IRL approach leads to the least NSAE of the states for the SNR values 30 dB to 40 dB. For an SNR of 25 dB, the IRL method and the DB approach obtain almost the same results. Finally, for highly corrupted trajectories with an SNR of 20 dB, the DB approach offers the best results, closely followed by the IRL method. The IOC method leads for all SNR values to a higher state error than the other approaches. Similar results can be observed in the mean NSAE control errors δ ui mean, i ∈ {1, <sup>2</sup>, <sup>3</sup>}. In this case, the IRL approach offers better results consistently across all SNR values. For little noise, i.e. for SNR values of 30 dB to 40 dB, the IOC method leads to better results than the DB approach.

Figure 7.13: Mean NSAE obtained with each method for all trajectory SNR values

Regarding the parameter errors, Figure 7.14 shows that the least mean parameter error is obtained by the IRL method for all SNR values. However, the maximum parameter error does not show a clear trend, but suggests that the IOC method yields more consistent results, as the other methods have greater maximum parameter errors. The IOC method and the DB approach have similar mean parameter errors. However, by inspecting the maximum parameter error, it can be discerned that the IOC approach does not lead to great differences as the SNR value changes. On the contrary, the maximum parameter error of the IRL is always higher and varies considerably more with the exception of the case of an SNR value of 40 dB. The DB method results do not allow a particular interpretation as no clear trend can observed, except for the bigger error with less SNR which is common for all methods. Nevertheless, an outlier value can be observed for an SNR of 35 dB caused by an anomalously poor identification result.

Once more, for a better understanding of these results, the mean values of the identified parameters were used to determine mean estimated Nash equilibrium trajectories. These parameters are listed in the Appendix: Tables E.1 and E.2 correspond to the IOC method, Tables E.3 and E.4 to the IRL method and Tables E.5, E.6 show the results of the DB approach. The resulting estimated FNE trajectories are compared with the original noiseless trajectories in ζ ∗ . Figures 7.15 and 7.16 show this comparison for the FNE state and control trajectories, respectively, which were estimated from noisy observations with 20 dB. It is noticeable that, despite the low SNR, all methods lead to good approximations of the states and control tra-

(a) Maximum parameter error

(b) Mean parameter error

Figure 7.14: Parameter error of identification for all SNR values and all methods

Figure 7.15: Ground truth and estimated state trajectories of the inverse LQ feedback dynamic game with each method. The identification was conducted using noise-corrupted trajectories with SNR = 20 dB.

jectories. In a detailed view of the results, there is a better agreement between the original trajectories and the estimated ones in the case of the state variables. Furthermore, we can observe that the DB method performs slightly better than the IRL and IOC methods. While this minor difference are visible in this case, these are even tinier for greater SNR values. The corresponding figures are given in Section E.2.1 of the Appendix.

Figure 7.16: Ground truth and estimated state trajectories of the inverse LQ feedback dynamic game with each method. The identification was conducted using noise-corrupted trajectories with SNR = 20 dB.

#### 7.5.4 Robustness to a Basis Function Mismatch

This section presents an evaluation of the robustness of the inverse LQ dynamic game methods to a mismatch in the basis functions, similar to the analysis conducted in Section 7.4.4 for the open-loop case. The noisefree trajectories generated by the cost function matrices Q ∗ i and R ∗ ij , <sup>i</sup>, <sup>j</sup> ∈ P, as given in Section 7.5.1 are used for identification. For both the inverse dynamic game step and the subsequent forward solution to obtain estimated trajectories xˆ(t) and <sup>u</sup>ˆi(t), <sup>i</sup> ∈ P, it shall be assumed that certain elements of the matrix <sup>Q</sup>i are neglected and therefore not identified. The considered cases are described in Table 7.17. These describe an increasing number of parameters of the diagonal matrix <sup>Q</sup>i which are neglected. Analogously to the open-loop case, only the NSAE of the trajectories shall be considered for the evaluation. The NSAE arising from identification with each method are given in Table 7.18. Similar error values can be observed for the cases I to III for all methods, with the DB method presenting slightly lower values. In turn, case IV shows a very high error for all methods. The observed trajectories and the estimated trajectories are exemplarily shown for case I in Figures 7.17 and 7.18. Additional plots describing the results of the other cases can be found in Section E.2.2 of the Appendix.


Table 7.17: Considered cases in the basis function mismatch analysis of inverse LQ feedback dynamic games

Table 7.18: NSAE in case of basis function mismatch in inverse LQ dynamic games


#### 7.5.5 Discussion of Inverse LQ Dynamic Game Results

The inverse LQ differential game was solved by means of an IOC based method, an IRL based method and the DB approach. All methods were shown to lead to good identification results both in terms of trajectory approximation and parameter estimation. The IOC method presented the highest parameter estimation precision in the case of noiseless trajectories.

The analysis with noise-corrupted trajectories demonstrated that the IRL based method offers the best results across all SNR values. Only for the mean NSAE of the states, the DB method is slightly better than IRL. The results indicate that the DB and IRL methods are more robust towards measurement noise than IOC. As for the parameter error, we observe that the mean parameter error reflects the fact that the IRL method performed the best with all SNR values. The higher robustness of the DB approach in low SNR regions compared to IOC can also be noticed. However, an interesting result of IOC is the lower variability in the maximum

Figure 7.17: Ground truth and estimated state trajectories of the inverse LQ feedback dynamic game with each method. The identification was conducted with the wrong assumption that <sup>Q</sup>i,(4,4) <sup>=</sup> <sup>0</sup> for <sup>i</sup> ∈ {1, <sup>2</sup>} (case I).

parameter error. This suggests that even though the DB approach and IRL performed better in the mean, they are not guaranteed to always lead to better results.

Regarding the robustness to a basis function mismatch, the resuls of Table 7.18 show that the methods are fairly robust to a mismatch caused by the neglection of features. However, not including any basis function which penalizes the states (as in case IV) leads to major deviations of both states and controls with respect to the original trajectories. The original parameters describe a behavior which aims at regulating all states to zero and has to be considered in the choice of the basis functions. Similarly to the analysis of the effects of measurement noise on the results, it can be discerned that the IRL and DB method are slightly more robust than the IOC method in case of a basis function mismatch. Finally, it can also be noted that the control trajectory approximation is corrupted more than the state approximation, especially for the IOC and IRL methods. In general, the approximations of the controls are affected more, independent of whether the perturbation lies in the basis functions or the trajectories.

Figure 7.18: Ground truth and estimated state trajectories of the inverse LQ feedback dynamic game with each method. The identification was conducted with the wrong assumption that <sup>Q</sup>i,(4,4) <sup>=</sup> <sup>0</sup> for <sup>i</sup> ∈ {1, <sup>2</sup>} (case I).

Analogously to the open-loop case, the results of this section can be explained by the different concepts behind each of the methods. IOC depends on the fact that the trajectories correspond to a feedback Nash equilbrium. The IRL method is based on the maximization of a likelihood function which indirectly includes the requirement of matching costs of the observed trajectories and therefore is more robust towards mild violations of the Nash equilibrium assumption generated by the measurement noise or by basis function errors. Finally, the objective function of the DB approach which explicitely considers the deviation between trajectories is responsible for its good results.

# 7.6 Computation Time

Before concluding on the observed results, the computation time of all approaches is briefly examined, as computational efficiency is an important issue towards the application of these methods for an online estimation of cost function parameters. The computational effort is exemplarily shown for the case with noisy trajectories with SNR = 25 dB to give an impression of the computational demands of each of the methods. Table 7.19 presents the computation times of the different methods in the case of an identification in an open-loop and feedback scenario.<sup>53</sup> The DB method yields the highest computation time, followed by the IRL and IOC methods. The DB method's computation time in the open-loop nonlinear case is approximately 26% higher than the one corresponding to the linear-quadratic feedback dynamic game. This can be explained by the fact that the first demands the repeated solution of a nonlinear dynamic game which is generally harder to solve than a linear-quadratic dynamic game. The IOC method is the fastest since it relies on the solution of a conventional RDE or a quadratic program, which can usually be efficiently solved with numerical techniques. Finally, the IRL method stands inbetween. The conceptually abstract likelihood function and its convergence properties are hard to analyze. However, the fact that it consists of one single static optimization problem yields a great chance of being faster than the DB method.


Table 7.19: Computation times for inverse dynamic games

# 7.7 Conclusion

In this chapter, a systematic comparison between IOC, IRL and DB methods for solving inverse dynamic games was conducted. Both open-loop and feedback structures were considered. Moreover, the robustness of the approaches with respect to the presence of noise in the observed trajectories was examined. In addition to the quality of cost function parameter identification, the capability of the identified cost functions to describe observed data was also assessed.

In the open-loop case, the IOC method was shown to lead to the most accurate results in the parameter estimation if the observed trajectories correspond to a Nash equilibrium. Nevertheless, if the observations are noise-corrupted, the IOC method's results deteriorate. The state trajectory approximation is still adequate, but the control trajectories deviate considerably from the ground truth. The IRL and DB methods showed a higher robustness to measurement noise and yield to similar results. Only in the lowest considered SNR value case, the DB method led to slightly better approximations. In addition, all methods show a slight robustness to missing relevant basis functions as long as the other ones are meaningful and related to the control task at hand. In case a non-adequate basis function is provided, only the IRL and DB methods are able to neglect it by setting its corresponding parameter to a value near zero.

<sup>53</sup> The used CPU was an Intel Xeon E5-2630 at 2.6 GHz with 32 GB of RAM.

As for the feedback case, a similar trend as in the open-loop case could be observed. Nevertheless, it can be stated that the magnitude of both parameter and NSAE for IOC and IRL methods is smaller than in the open-loop case. One possible reason is that the linear system dynamics allow for better identification, especially in the case of the IRL which relies on a dynamics linearization (which is nevertheless time-variant, i.e. it is computed in every time step). However, this may be best explained by the LS estimation of the feedback matrices <sup>K</sup>i which is done by means of the control and state trajectories. This estimation is, theoretically speaking, not bias-free. The noise has zero mean, but is applied to both the control and the state values. In spite of this fact, the estimation works well in practice such that a relatively accurate functional relationship between the states and the control is provided to the IOC and IRL methods. This is also reflected by the good results obtained by all methods in the analysis of basis function mismatch.

To finish this chapter, the main findings are summarized as follows:


After this analysis of inverse dynamic game methods in a simulation environment, the following chapter presents a first application of inverse dynamic game methods with real experimental data.

# 8 Application to Shared Control Systems

This chapter presents an application example for inverse dynamic games. The aim of this chapter is to provide a first evaluation of the applicability of inverse dynamic games to identify cost functions in a real scenario. In the following, a shared control scenario between two humans is considered. Shared control stems from the field of human-machine cooperation. It usually describes a situation where humans and machines simultaneously control a dynamic system.<sup>54</sup> Therefore, it has led to a rising number of applications including robotassisted rehabilitation in medicine as well as all kinds of technical assistance systems for vehicle control or for various types of technical devices including construction machines, wheelchairs, etc. For the evaluations in this chapter, an experiment in which several pairs of subjects simultaneously control a steering system is employed. This scenario is modeled by means of a differential game such that cost functions describing the interaction of human pairs can be identified from measured data. The two method classes for inverse differential games presented in this thesis, IOC and IRL, shall be evaluated by means of this experiment. Furthermore, similar to Chapter 7, the results shall be compared to the results of applying the DB approach for identification.

# 8.1 Experimental Setup

The experimental setup which was used can be seen as a simplified scenario of the lateral control of a vehicle. This section presents all details concerning the hardware setup and the implementation of the haptic feedback. In the following, this setup will be referred to as the cooperative steering system. 55

The cooperative steering system consists of four main components: two active steering wheels, two monitors with visualization windows and a real-time processing unit of dSPACE. The steering wheels are equipped with an incremental encoder of 40000 increments per full rotation for measuring the steering angles with a sampling frequency of <sup>f</sup>s <sup>=</sup> 100 Hz. Furthermore, they are active due to integrated motors which can apply a torque on each of them. The

<sup>54</sup> The reader is referred to [ACM+18] for a formal definition of Shared Control and its multiple applications.

<sup>55</sup> The experiment described in this chapter has been also presented in the conference paper [IFH19], where the differential game model was shown to better explain cooperative steering behavior than an alternative state-ofthe-art model (presented in [IEFH18]).

maximum torque of the motors is <sup>15</sup>.6 Nm. One of the components of the motor torque is calculated such that the steering wheel has the dynamics of a spring-damper system. Therefore, the dynamics of the steering wheel j ∈ {1, <sup>2</sup>} are described by means of the equation

$$
\Theta\_{\rm SW,f} \, \vec{\phi}\_f(t) = M\_f(t) - d\_f \dot{\phi}\_f(t) - c\_f \varphi\_f(t), \tag{8.1}
$$

with the spring constantcj , damping constant <sup>d</sup>j and the moment of inertia <sup>Θ</sup>SW,j and where <sup>φ</sup>j(t) and <sup>M</sup>j denote the steering wheel angle and the human input torque, respectively. The parameters of the steering wheels are given in Section F.1.1 of the Appendix.

In the experiment, the two steering wheels are haptically coupled. This virtual coupling is implemented in a real-time environment with the dSPACE processor unit. This unit is also used to establish the communication between all components. The haptic coupling is effectuated by calculating the required torque MC(t)such that the angular difference between the two steering wheels is reduced to zero. This is achieved by emulating a virtual springdamper element between both steering wheels with an automatic controller. Therefore, with the haptic coupling, a further torque exists which influences the dynamics of each steering wheel, leading to the dynamics equation

$$
\Theta\_{\rm SW,j} \,\,\phi\_j(t) = M\_j(t) - d\_j \phi\_j(t) - c\_j \phi\_j(t) + M\_\mathcal{C}(t). \tag{8.2}
$$

The implementation of the controller was done in MATLAB/Simulink 2010b. Further details on this controller can be found in Section F.1.2 of the Appendix.

A computer interacts with the real-time system and generates two separate visualization windows on two monitors in order to give visual feedback of the current steering wheel position to each participant. This visualization was implemented by means of OpenGL and includes a marker (green square) which moves horizontally in the window according to the value of the steering angle. The steering wheel value range which is mapped onto the screen is [−180°; 180°], where a positive angle corresponds to a counterclockwise rotation. A further element in the visualization window is the reference trajectory. The points which constitute the trajectory pass downwards through the window at a constant speed. A single point crosses the entire visualization window in 2 seconds. The vertical position of the marker is fixed at 75% of the window height. Figure 8.1 depicts all components of the experimental setup as well as an example of the visualization window and the black curtain (thick black line) which served to separate each subject's area.

# 8.2 Modeling

The experiment consists of a shared control task, in which pairs of participants control the cooperative steering system simultaneously. The aim of the subjects is to follow the reference trajectory shown on the monitor by means of their corresponding steering wheel. This

Figure 8.1: Hardware setup for the experiment

scenario is modeled by means of a differential game such that the observed data can be used to identify cost functions of each subject which explain their cooperative behavior. In the following, the differential game is formalized mathematically. Afterwards, the system dynamic equations and cost function structure are stated more precisely for the scenario at hand.

#### 8.2.1 Shared Control Modeling via Differential Games

Consider two human players controlling a dynamic system

$$
\dot{\mathbf{x}}(t) = \mathbf{A}\mathbf{x}(t) + \mathbf{B}\_1\mathbf{u}\_1(t) + \mathbf{B}\_2\mathbf{u}\_2(t) \tag{8.3}
$$

with x(0) <sup>=</sup> x0, where x(t) ∈ <sup>R</sup> n represents the system states and <sup>u</sup>i(t) ∈ <sup>R</sup> <sup>m</sup><sup>i</sup> denotes the control trajectories of player i. In addition, suppose a reference signal is given, which is the output of the known linear reference model

$$
\dot{\mathbf{z}}(t) = H\mathbf{z}(t). \tag{8.4}
$$

Given that the framework of feedback control is the most suitable for modeling human motor control [TJ02, Tod04], it is assumed that the human players select a feedback strategyγi ∈ Γ FB i according to Definition 3.6. Furthermore, the cost function structure

$$J\_i = \int\_0^\infty \mathbf{e}(t)^\top \mathbf{Q}\_i \mathbf{e}(t) + \mathbf{u}\_i(t)^\top \mathbf{R}\_{li} \mathbf{u}\_i(t) \text{ dt}, \quad i \in \{1, 2\} \tag{8.5}$$

is assumed for each player, where e(t) <sup>=</sup> x(t) − z(t). In this way, the cost function models the objective of both humans to track a given reference, i.e. minimize the error between the state and reference trajectories.

While the cost function (8.5) is quadratic, it is not a standard quadratic cost function since the cost function matrix <sup>Q</sup>i is not penalizing the state variable x(t), but the state-reference deviation e(t). Therefore, the methods for inverse linear-quadratic dynamic games cannot be applied directly. Nevertheless, it is possible to introduce a new system state including both the states and the reference variables such that (8.5) is transformed into a standard quadratic cost function. This leads to extended system dynamics where the linearity property is maintained. In this way, we obtain a linear-quadratic differential game according to Definition 3.11. The details on these reformulations are presented in Section B.7 of the Appendix.

#### 8.2.2 Cooperative Steering System Dynamics

To simplify the model of the cooperative steering system, an ideal coupling of the two steering wheels is assumed. This means that both steering wheels have the same angle φ and angular velocity φÛ. With this assumption, the dynamics of the system of coupled steering wheels are given by

$$\dot{\mathbf{x}}(t) = \begin{bmatrix} -\frac{d\_c}{\Theta\_{\text{sum}}} & -\frac{c\_c}{\Theta\_{\text{sum}}} \\ 1 & 0 \end{bmatrix} \mathbf{x}(t) + \begin{bmatrix} \frac{1}{\Theta\_{\text{sum}}} \\ 0 \end{bmatrix} u\_1(t) + \begin{bmatrix} \frac{1}{\Theta\_{\text{sum}}} \\ 0 \end{bmatrix} u\_2(t) \tag{8.6}$$

where x(t) <sup>=</sup> - φÛ(t) φ(t) ⊤ and <sup>u</sup>i(t) <sup>=</sup> <sup>M</sup>i(t) is the steering torque of human <sup>i</sup>. The variable <sup>Θ</sup>sum denotes the sum of the moments of inertia of both steering wheels. All system parameters are given in Table 8.1.


Table 8.1: Cooperative steering system model parameters

#### 8.2.3 Cost Functions

The cost function structure is given by (8.5). Furthermore, diagonal matricesQi <sup>=</sup> diag(<sup>q</sup> (1) i ,q (2) i ) are assumed such that off-diagonal parameters are neglected. This is a common procedure in optimal control theory since off-diagonal matrix elements represent mixed terms in the cost function which are usually not interpretable [BH75]. The state reference is given by z(t) <sup>=</sup> - φÛref(t) φref(t) ⊤ , representing the reference values for the steering angle velocity

and the steering angle, which is visible on the monitor. It is assumed that the participants do not aim to follow a particular reference trajectory of the steering velocity since none was specified, neither visually nor verbally. Conversely, the reference trajectory of the steering angle φref(t) corresponds to the one visible on the monitor and is equal for both participants.

# 8.3 Data Acquisition and Preparation

In order to apply inverse dynamic game methods, a set of state and control trajectories is needed. As mentioned previously in Section 8.1, a sensor for measuring the angle <sup>φ</sup>j(t) of each steering wheel is available. The steering angle velocity <sup>φ</sup>Ûj(t) and the acceleration <sup>φ</sup>Üj(t) are determined offline by a numerical differentiation and a subsequent smoothing process via a cubic spline interpolation (MATLAB function csaps with parameter p <sup>=</sup> <sup>0</sup>.99995). The steering torque of each human <sup>u</sup>i(t) <sup>=</sup> <sup>M</sup>i(t) is then calculated by means of (8.2), i.e. the system dynamics equation of each steering wheel. Due to the ideal coupling of the steering wheels, the steering wheel angle φ(t) and angular velocity φÛ(t) of the cooperative steering system are set equal to the mean value of both steering wheel angles and velocities, respectively.

# 8.4 Experimental Protocol

Fifty-two subjects (age 25 ± 2.27) participated in the experiment in pairs. They did not have the possibility to make any eye-contact and were told to refrain from speaking during the experiment. They were aware that they were completing the task with a partner. Each subject pair was told to track the reference trajectory as well as they could.

Each pair of subjects completed an approximately two minutes long run which consisted of


The first part P1 included splines and step functions as visible reference trajectories for the steering angle. On the other hand, P2 consisted of only step functions. Step functions were used for evaluation since these represent goal-oriented or point-to-point movements, also known as reaching movements. This kind of movements are often considered in studies concerning human motor behavior both from a neuroscience and biology perspective [FH91, Kal09, KM11] as well as from a control theoretical perspective [ARARU<sup>+</sup> 11, CS17]. The reference trajectory of P2 describes 4 point-to-point movements defined by the fixed positions (120°, 0°, -120°, 0°, 120°). Finally, P3 included similarly to P1 step functions as well as splines. The subjects were unaware of this scenario subdivision and all related details.

# 8.5 Evaluation Procedure

As described in Section 8.2.1, the shared control scenario is modeled as a linear-quadratic differential game with feedback strategies. Therefore, the methods for inverse feedback dynamic games (the same as in Section 7.5) are applied for cost function identification. In the following, they are also referred to as the IOC, IRL and DB methods. All methods were given the same system dynamics and cost function structure. The data obtained from the middle part of the test run (P2) was used for estimating the cost function parameters of both participants with each of the aforementioned methods.

Contrary to the simulations presented in Chapter 7, no ground truth cost function parameters θ <sup>∗</sup> <sup>=</sup> (θ ∗ 1 , θ ∗ 2 ) are available in a real application. Therefore, the only way to evaluate the identification results is by using the estimated cost functions to generate estimated trajectories xˆ(t), u<sup>ˆ</sup> <sup>1</sup>(t), and u<sup>ˆ</sup> <sup>2</sup>(t) and compare them with the measured trajectories x˜(t), u˜ <sup>1</sup>(t) and u˜ <sup>2</sup>(t). This comparison is done by means of the NSAE for states and controls introduced in Section 7.3.2. The 52 participants formed 26 pairs of subjects and therefore, 26 data sets were available for analysis. These 26 sets of trajectories lead each to an estimation of the cost function parameters. Therefore, we obtain the parameters <sup>ˆ</sup>θ (s) , s ∈ {1, ..., <sup>26</sup>} for each of the methods IOC, IRL and DB. Afterwards, each set of identified parameter vectors consisting of ˆθ (s) IOC, ˆθ (s) IRL and <sup>ˆ</sup><sup>θ</sup> (s) DB is used to solve for the Nash equilibrium trajectories <sup>x</sup><sup>ˆ</sup> (s) (t), u<sup>ˆ</sup> (s) 1 (t) and uˆ (s) 2 (t), s ∈ {1, ..., <sup>26</sup>}. This is done by applying the reformulations of Section B.7 to obtain a standard LQ differential game and using Theorem 3.7 afterwards. The Nash trajectories are compared to the observed trajectories x˜(t), u˜ <sup>1</sup>(t) and u˜ <sup>2</sup>(t) by computing the corresponding NSAE as described in Section 7.3.2. Figure 8.2 summarizes the evaluation procedure applied in this chapter.

# 8.6 Results

The NSAE of states and controls was calculated for all data sets and all corresponding identification results. All values are given in the Section F.2 of the Appendix. Due to the small data set, the median values δ x median of the errors are considered instead of the mean values. The median values and the standard deviations δ x SD of the errors for all used inverse dynamic methods are given in Table 8.2. The statistical results are summarized and depicted in Figure 8.3.

Figure 8.2: Evaluation procedure for the identification in a real shared control scenario

Table 8.2: Mean value and standard deviation of NSAE obtained from identification with IOC, IRL and DB methods


The first noticeable characteristic of the results is the considerably higher magnitude of the error compared to the magnitudes seen in Chapter 7. In general, it can be discerned that the DB approach led to smaller mean values and variances of errors than the IRL and IOC based approaches. The IRL method performed better than the IOC method in terms of the state trajectory approximation. Nevertheless, the mean values of the NSAE of the controls are very similar. The range and standard deviation of the errors shown in Figure 8.3 are smaller for the DB method compared to IOC and IRL based approaches. In order to test the statistical significance of these errors, a Wilcoxon signed rank test<sup>56</sup> was conducted on the data sets of δ x , δ u . The test results confirmed that all differences are statistically significant with a significance level of α <sup>=</sup> <sup>0</sup>.01. Nevertheless, the control errors of the IOC and IRL methods are an exception. The signed rank test confirmed that their difference is not statistically significant. Detailed results with p-values are provided in Section F.2.1 of the Appendix.

<sup>56</sup> A Wilcoxon signed rank test (see e.g. [SC88]) is a statistical test where, contrary to more widespread statistical test methods as e.g. student's t-test, it is not assumed that the data follows a normal distribution. This assumption was avoided here due to the relatively small data population.

Figure 8.3: Statistical results of the cost function identification in the experiment

In order to further illustrate the identification results, the measured data and the estimated trajectories x (s) (t) and u (s) i (t) for some representative subject pairs s ∈ {1, ..., <sup>26</sup>} are shown in the following. Figure 8.4 shows the data and identification results of subject pair 1. This data set yielded the smallest error for all methods. It can be recognized that the states are approximated the best by the DB approach, followed by the IRL method. The control trajectories cannot be exactly described by the dynamic game with the estimated parameters <sup>ˆ</sup>θ any method. Only the qualitative course can be described and several changes in the torque cannot be accounted for.

The following identification result in Figure 8.5 corresponds to subject pair 2. The DB and IRL method yield the best results regarding state trajectory approximation. Nevertheless, the error is higher than in the results shown in Figure 8.4. In the case of the control trajectories, it is noticeable that the IRL approach fails to identify the control actions of the first subject, but estimates the control of the second subject as higher. This leads to the same state trajectories as the DB approach. The estimation of a control trajectory as (nearly) a constant is an effect which was observed for some data sets, not only for the IRL method, but also for the IOC and DB method. This effect can be seen e.g. in the results of subject pair 22 depicted in Figure 8.6. The DB approach is able to describe the control trajectories better, but on the other hand, the IOC and IRL methods are able to approximate the state trajectories slightly better than the DB method for this data set.

# 8.7 Computation Time

Analogously to Chapter 7, the computation time required for the solution of inverse dynamic games is analyzed.<sup>57</sup> The mean of the computation times was calculated for each of the

<sup>57</sup> The used CPU was an Intel Core i7-6600U at 2.6 GHz with 12 GB of RAM.

Figure 8.4: Identification results of subject pair 1

method classes considered. The values are listed in Table 8.3. It can be observed that the results of Section 7.7 are replicated. The DB approach needs the most computation time, followed by the IRL and IOC method. The IOC and IRL approaches need <sup>0</sup>.01 % and <sup>1</sup>.57 % of the DB method's required computation time, respectively.



Figure 8.5: Identification results of subject pair 2

# 8.8 Discussion

This section is devoted to a discussion of the results of the previous sections. The results are analyzed and the limitations of the methods and the experiment are reviewed.

Overall, it can be stated that the inverse feedback dynamic game method based on the DB approach performs better than its IRL and IOC based counterparts in terms of trajectory approximation. This is shown by the mean values of the errors δ x DB,mean < δ x IRL,mean < δ x IOC,mean of both states and controls in Table 8.2. Furthermore, the standard deviations δ x SD and <sup>δ</sup> u SD are the smallest for the DB approach, indicating that this method led to more consistent results.

The better results of the DB approach can be similarly explained as in the simulation results of Chapter 7. The underlying optimization problem in the DB method directly minimizes the error between observed and estimated trajectories. In turn, the IRL method does this indirectly by means of an implicit requirement included in the likelihood function. In a very different approach, the IOC method aims to minimize the violation of Nash equilibrium conditions and does not consider the error between trajectories in the process.

In general terms, the methods appear to be able to describe the state trajectories better than the control trajectories. However, there were several data sets for which the state trajectories could not be explained adequately by the cost functions with identified parameters, regardless of the selected inverse dynamic game method. The question arises as to which reasons this effect might have.

One potential source of error is an inexact modeling of the cooperative steering system. In particular, the assumption of an ideal coupling of the steering wheels may have been too strong for the used system, such that the description by means of (8.6) is not accurate enough. It is conceivable that this inaccuracy is higher the more dynamic the interaction is, i.e. when the partners act very differently and change the direction of the torque very often. Besides this fact, the subject pairs were observed to have partially disobeyed the instructions of the experiment. For example, in Figure 8.6, the time span between 1 s and 2 s shows that player 1 applied a torque contrary to the one which is needed to bring the steering angle towards the reference value. This behavior had to be compensated for by player 2. Such behavior contradicts the rationality implied by a model based on differential games and thus cannot be accounted for.

Overall, the results suggest that the players may not act exactly optimally and thus the interaction may sometimes not be exactly represented by a Nash equilibrium. If the trajectories do not represent a Nash equilibrium, then worse results of the IOC and IRL methods are potentially obtained, given the fact that they rely on the estimation of a Nash equilibrium control law from these trajectories. For example, the IOC method first calculates an estimation K<sup>ˆ</sup> i of the linear control law which best describes the relation between measured controls and states; afterwards, cost function parameters are determined which correspond to the identified control matrix. However, these control matrices K<sup>ˆ</sup> i which are optimal in a least-squares sense (cf. Section 5.4.2) do not necessarily correspond to a Nash equilibrium. Consequently, the cost functions with parameters <sup>ˆ</sup>θi describe a Nash equilibrium which is the "closest" to <sup>K</sup><sup>ˆ</sup> i in the sense that the violation of the Riccati equations is minimal. To illustrate this, consider the value of the residual ||M<sup>ˆ</sup> i ˆθi ||, where M<sup>ˆ</sup> i is calculated by means of the K<sup>ˆ</sup> <sup>=</sup> (K<sup>ˆ</sup> <sup>1</sup>, ...,K<sup>ˆ</sup> N ) identified via the LS method (see (5.36)). This describes the extent up to which identified parameters <sup>ˆ</sup>θi , together with K<sup>ˆ</sup> i , violate the necessary and sufficient conditions for Nash equilibria. Therefore, it can be seen as a measure of the "non-Nash" character of the estimated K<sup>ˆ</sup> <sup>58</sup>. Figure 8.7 shows that some of the identifiedK<sup>ˆ</sup> are approximately a Nash equilibrium, but some others present less Nash character. In particular, the good results of Figure 8.4 can be associated to a low value of the residual. Nevertheless, it could be observed that the residual value does not allow forseeing the quality of the trajectory approximation results.

<sup>58</sup> Note that | |M<sup>ˆ</sup> i ˆθ i | | , <sup>0</sup> is possible while | |M<sup>ˆ</sup> ′ i ˆθ i | | ≈ 0. M<sup>ˆ</sup> ′ i is calculated with K<sup>ˆ</sup> ′ which arise from the solution of the differential game corresponding to the identified parameters <sup>ˆ</sup>θ . The latter lead to a Nash equilibrium according to the necessary and sufficient conditions used for determining the trajectories <sup>u</sup><sup>ˆ</sup> <sup>i</sup> (t) and <sup>x</sup><sup>ˆ</sup> (t).

Figure 8.7: Residual values of the identified control law and parameters for all subject pairs. Here, the outlier | |Mˆ 2 ˆ<sup>θ</sup> <sup>2</sup> | | <sup>=</sup> <sup>44</sup>.<sup>84</sup> for subject pair 22 is not depicted in favor of better visibility of the other values.

Another problem arises if the estimated K<sup>ˆ</sup> i yields higher values of the objective function of the least-square estimation functional ||ui <sup>+</sup>Ki<sup>x</sup> || (cf. (5.36)), i.e. the linear feedback is unable to reproduce the relationship between <sup>u</sup>i(t) and <sup>x</sup>(t). A consequence would be a detriment of the approximation capabilities of the inverse dynamic game methods based on IOC and IRL since they rely on this feedback law estimation to include the influence of the other player's controls on the system dynamics.

Finally, the mean computation times presented in Table 8.3 show that the IOC method would be the most appropriate method in terms of a potential online application such that cost function parameters are constantly updated as new data points are available. The IRL approach may also serve for such a purpose with more efficient coding. On the other hand, the computation time of the DB approach confirm that it is not suitable for an online application. Cost function parameters may change over time due to different effects, e.g. fatigue or even sudden events. These alterations cannot be quickly detected by the DB method, but rather by the alternative methods developed in this thesis.

# 8.9 Concluding Remarks

In this chapter, an application example for inverse dynamic games was presented. A cooperative steering experiment was conducted where pairs of subject interact haptically to cooperatively complete a control task. The results indicate that it is possible to describe cooperative system behavior by means of dynamic games, and that inverse dynamic game methods can be used to identify cost functions which explain the observed behavior.

The results showed the following insights:


The system used for the experiment and its dynamic model resulted to be too inaccurate to make reliable conclusions concerning cooperative behavior of human in haptic interaction. The results of this experiment suggest that the assumption of a Nash equilibrium in haptic interaction may be reasonable in certain situations. In order to give answers to these questions, which are also interesting for other scientific communities, more studies and experiments have to be conducted. Nevertheless, the methods presented in this thesis showed the potential of application to these purposes.

# 9 Conclusion

As technical systems become more intelligent, they are also required to be able to interact with other technical systems and humans. The theory of dynamic games provides a useful mathematical framework for describing the interaction between several players with possibly conflicting interests. A large body of work exists concerning the calculation of the outcome of the dynamic game from known objectives of all players. On the contrary, the inverse problem of dynamic games, which consists in finding the cost functions each player minimized which led to the observed behavior, has received limited attention. This thesis contributes to this line of research by developing methods for the solution of N-player inverse dynamic games with both open-loop and feedback structures and with two different classes of methods, assuming that the interaction between players led to an open-loop or a feedback Nash equilibrium. Following the line of a large number of studies in the identification of cost function in a singleplayer case, the structure of the cost functions is fixed by assuming a linear combination of basis functions such that the problem is reduced to finding cost function parameters for each player. In addition, the results give a substantial insight on the properties of inverse optimal control and inverse dynamic game problems.

The first method class proposed in this thesis is given by a residual-based IOC method and exploits necessary and sufficient conditions for Nash equilibria which are based on controltheoretical techniques. In the open-loop case, the reformulations of these conditions allow to pose the problem of identifying cost function parameters as an unconstrained quadratic program. Furthermore, sufficient conditions are given to test for the uniqueness of the cost function parameters up to a multiplying constant. For a feedback structure, the use of the same techniques is possible. Nevertheless, the knowledge of the feedback law becomes necessary. Identifying the feedback law is feasible for the main class of dynamic games given by infinite-horizon linear-quadratic dynamic games with an infinite horizon. Therefore, the inverse problem of dynamic games was thoroughly analyzed for this particular class of games. By exploiting the necessary and sufficient conditions for Nash equilibria given by algebraic Riccati equations, explicit solution sets describing all possible cost function parameters which correspond to the same Nash equilibrium were established. Furthermore, a quadratic program was formulated to efficiently find a solution of the inverse dynamic game. An analysis of the properties of this quadratic program yields necessary and sufficient conditions for the uniqueness of the inverse LQ dynamic game solutions.

The second method class which was proposed is an IRL approach, where a probability density function is stated as a likelihood function which depends on the cost function parameters of each player. The likelihood function, found by means of the principle of Maximum Entropy, implicitely includes the requirement that the expected costs of the trajectories sampled from a density function with the estimated parameters correspond to the costs of the observed trajectories. The cost function parameters are determined via a Maximum-Likelihood estimation. For this approach, it was proved that by maximizing the likelihood function we obtain equal expected costs of trajectories generated by the probability density function with ground truth parameters and the one with the estimated parameters.

Having proposed two major classes of inverse dynamic game methods for each of the two information structures considered, i.e. open-loop and feedback, a systematic evaluation was conducted where each method was tested using Nash equilibrium trajectories of a test system. Until now, such a study was missing in literature, even for the single-player case. For inverse dynamic games with open-loop strategies, a two-player game with a nonlinear ballon-beam dynamic system was considered. The evaluation in the case of a feedback Nash equilibrium was done using a three-player linear-quadratic dynamic game. Both cases included a comparison of the performance of IOC and IRL based methods as well as a direct bilevel (DB) approach analogous to the widespread state-of-the-art single-player inverse dynamic game method of Mombaur et al. [MTL10]. The main findings confirm previous evidence that bilevel methods generally need a high computational effort, since they demand the solution of several dynamic games, i.e. determining Nash equilibria from current candidate cost function parameters. The IOC method outperformed IRL and the DB method in the case of perfect measurements. However, it was shown that the DB and IRL methods are similar to each other and more robust towards measurement noise than IOC methods, since the results of the latter deteriorate with higher measurement noise. Nevertheless, if the measurement noise is low, IOC methods can yield even better results than the DB approach, as it could be observed that the IOC method needs between <sup>0</sup>.005% and <sup>0</sup>.01% of the DB method's computational time. In addition, the inverse dynamic game methods which exploit the estimation of the feedback Nash equilibrium control laws were shown to be more robust towards measurement noise. As for potential errors in the basis functions, the IRL method offers the ability of detecting irrelevant basis functions with less computational effort than the DB method. The IOC methods show a higher dependency on meaningful basis functions.

Finally, an application example of cooperative system identification was presented, where the aim was to identify cost functions which explain cooperative behavior of humans while completing together a control task and interacting haptically in the process. The results confirmed the trends observed in the simulations, showing that the DB method is the most robust method, followed by IRL and IOC methods. Nevertheless, some data sets could not be described properly by any of the methods. The results indicate that an accurate dynamic system model is of utmost importance for the use of these methods. With a model which better describes the dynamic system both humans interact through, it is conceivable that the developed methods based on IRL and IOC yield a good performance with a reasonable required computational time (of seconds or even milliseconds), thus allowing for their use in real applications where an online estimation of cost function parameters is of interest.

To summarize, this thesis makes a contribution to the theory of inverse problems in optimal control and dynamic game theory. The results not only provide new methods for solving this class of problems, but also shed new light onto their properties. In particular, the novel necessary and sufficient conditions for unique solutions of inverse dynamic games, as well of the unbiasedness of the estimation in an IRL setting, are also valid for the single-player case. The methods open new possibilites for applications regarding the description of multiagent or cooperative system behavior, e.g. for the identification of human behavior during the interaction with a machine or of biological systems in general, leading to the possibility of employing a learning-by-demonstration approach in a multi-agent setting.

# A Infinite Dynamic Games in Discrete Time

This section gives an overview of the relevant definitions and theorems for discrete-time dynamic games which are considered in Chapter 6 of this thesis. The definitions and theorems are analogous to the ones in continuous time. Therefore, each of them has a corresponding counterpart which can be found in Chapter 3. The following selection is based on the books [BO99, HKZ12].

# A.1 Basic Definitions

A discrete-time dynamic game involves N players taking actions in several discrete time steps. Since their possible actions are infinite, typical description forms as payoff matrices or game trees are not possible (see e.g [BO99, Chapter 3]). Instead, the evolution of their decision process is described by means of a dynamic system in discrete-time which is defined as follows.

Definition A.1 (Dynamic System in Discrete-Time State Space Representation) A dynamic system is defined by a difference equation and an initial condition given by

$$\mathbf{x}^{(k+1)} = f\_D^{(k)}\left(\mathbf{x}^{(k)}, \mathbf{u}\_1^{(k)}, \dots, \mathbf{u}\_N^{(k)}\right) \tag{A.1a}$$

$$\mathbf{x}^{(1)} = \mathbf{x}\_1 \tag{A.1b}$$

where x (k) ∈ R <sup>n</sup> and u (k) i ∈ R <sup>m</sup><sup>i</sup> denote the system state vector and the control vector of player <sup>i</sup> at time step <sup>k</sup> ∈ {1, <sup>2</sup>, ..., <sup>k</sup>E } <sup>=</sup>: <sup>K</sup>, respectively.

Each player i ∈ P acts upon the system in Definition A.1 by applying a sequence of inputs or controls u (k) i , <sup>∀</sup><sup>k</sup> ∈ K which belongs to an (here infinite) action space <sup>U</sup>i . Analogously to the continuous-time case, each player decides on a particular strategy γ (k) i from the space <sup>Γ</sup>i . The control decision is based on the information available to them which is represented by a set-valued function η (k) i . This function is generally defined for each player i ∈ P and all time steps k ∈ K as a subset of

$$\mathcal{I}\_i = \left\{ \{ \mathbf{y}\_i^{(j)} \}, \{ \mathbf{u}\_i^{(j)} \} \right\}\_{\substack{i \in \mathcal{P} \\ j=1,...,k}}, \tag{A.2}$$

where y (k) i = h (k) i (x (k) ) denotes the observed values of the state x (k) according to a function h (k) i . Consequently, the control value at step <sup>k</sup> results from <sup>γ</sup>i (η (k) i ) <sup>=</sup> u (k) i , γi ∈ Γi .

Each player selects its strategy according to an individual stage-additive cost function of the form

$$J\_i = \sum\_{k=1}^{\kappa\_E} g\_{D,i} \left( \mathbf{x}^{(k)}, \mathbf{u}\_1^{(k)}, \dots, \mathbf{u}\_N^{(k)} \right). \tag{A.3}$$

To summarize, a definition of the discrete-time infinite dynamic game is given.

#### Definition A.2 (Non-Cooperative Discrete-Time Dynamic Game)

A non-cooperative discrete-time dynamic game is defined by


The elements and the definition strongly resemble those introduced in Chapter 3. In fact, in system-theoretical terms, if a time difference between each level of play (e.g. k and k <sup>+</sup> 1) in a discrete-time dynamic game can be stated and this difference tends towards zero, the game may be considered an approximation of a corresponding continuous-time differential game (quasi-continuous analysis). Indeed, this fact was exploited in order to apply the IRL-based inverse dynamic game methods of Chapter 6 to continuous-time models, e.g. the physically interpretable model of the ball-on-beam system. Furthermore, this allows the comparison of the methods presented in this thesis.

# A.2 Information Structures

In the following, a definition of the information structures analogous to the ones in Definition 3.4 is given.

#### Definition A.3 (Information Structure of the Players in Discrete-Time Dynamic Games)

i

The information structure of player i is said to be


i

i

(iii) feedback (FB) pattern if η (k) <sup>=</sup> {x (k) }, k ∈ K.

# A.3 Strategies

Similar to Section 3.4, the following definitions describe open-loop and feedback strategies in discrete-time dynamic games.

Definition A.4 (Open-Loop Strategy in Discrete-Time Dynamic Games) An open-loop strategy γ (k) for player i ∈ P selects a control action according to

i

$$\mathfrak{u}\_{i}^{(k)} = \mathfrak{p}\_{i}^{(k)}(\mathbf{x}\_{1}), \quad \forall \mathbf{x}\_{1} \in \mathbb{R}^{n}, \, k \in \mathcal{K}. \tag{A.4}$$

i

i

i The set of all such possible strategies is denoted by Γ OL .

i

Definition A.5 (Feedback Strategy in Discrete-Time Dynamic Games) An feedback strategy γ (k) for player i ∈ P selects a control action according to

$$\mathfrak{u}\_i^{(k)} = \mathfrak{y}\_i^{(k)}(\mathfrak{x}^{(k)}), \quad k \in \mathcal{K}. \tag{A.5}$$

The set of all such possible strategies is denoted by Γ FB .

i

# A.4 Conditions for Nash Equilibria and Pareto Efficient Solutions in Discrete-Time Dynamic Games

The definition of the solution concepts, i.e. Nash equilibrium, Stackelberg and Pareto efficient solutions, are identical to the ones given in Section 3.5. The only difference is the definition of the strategies <sup>γ</sup>i which are defined for discrete-time dynamic games by Definitions A.4 and A.5. Therefore, the definitions are not rewritten here. Nevertheless, in the following, analogous results to Theorems 3.1 – 3.3 are given. These serve as a basis for the calculation of solutions of discrete-time dynamic games.

#### Nash Equilibrium

The following theorems are based on the discrete-time Hamiltonian function

$$\begin{split} H\_{l}^{(k)}(\boldsymbol{\Psi}\_{l}^{(k+1)},\boldsymbol{\mathfrak{u}}\_{l}^{(k)},\boldsymbol{\mathfrak{u}}\_{\cdot \cdot l}^{(k)}) &:= g\_{\boldsymbol{D},l}^{(k)}(\mathbf{x}^{(k)},\boldsymbol{\mathfrak{u}}\_{l}^{(k)},\boldsymbol{\mathfrak{u}}\_{\cdot \cdot l}^{(k)}) + \\ &\quad \quad \quad \quad \quad \quad \quad \quad \quad \Psi\_{l}^{(k+1)\top} \boldsymbol{f}\_{\cdot D}^{(k)}(\mathbf{x}^{(k)},\boldsymbol{\mathfrak{u}}\_{l}^{(k)},\boldsymbol{\mathfrak{u}}\_{\cdot \cdot l}^{(k)}), \quad k \in \mathcal{K}, \; i \in \mathcal{P}. \end{split} \tag{A.6}$$

Furthermore, the shorthand notations

D,i , ...,ψ

D,i

i

$$f\_D^{(k)\*} = f\_D^{(k)}(\mathbf{x}^{(k)\*}, \boldsymbol{\mu}\_1^{(k)\*}, \dots, \boldsymbol{\mu}\_N^{(k)\*}) \tag{A.7}$$

$$\begin{cases} \mathbf{y}\_{D,i} = \mathbf{y}\_{D,i} + \mathbf{u}\_{D,i} + \mathbf{u}\_{D,i} \\ g\_{D,i}^{(k)\*} = g\_{D,i}^{(k)}(\mathbf{x}^{(k)\*}, \mathbf{u}\_{1}^{(k)\*}, \dots, \mathbf{u}\_{N}^{(k)\*}) \end{cases} \tag{A.8}$$

are introduced.

The following theorem is the discrete-time counterpart of Theorem 3.1.

#### Theorem A.1 (Necessary Conditions for Open-Loop Nash Equilibria in Discrete-Time Dynamic Games)

For an N-player discrete-time infinite dynamic game, let f (k) D (x (k) ,u (k) i ,u (k) ¬i ) be convex and дD,i x (k) ,u (k) i ,u (k) ¬i be continuously differentiable on R n for all k ∈ K, i ∈ P. Then, if (γ ∗ 1 (x1), ...,γ ∗ N (x1)) with γ ∗ i (xi) <sup>=</sup> <sup>u</sup> ∗ i provides an open-loop Nash equilibrium solution with x <sup>∗</sup> as the corresponding state trajectory, there exists a finite sequence of costate functions (ψ (1) (k<sup>E</sup> ) ), i ∈ P such that the following relations are satisfied:

$$\mathbf{x}^{(k+1)} = \mathbf{f}\_D^{(k)\*}, \quad \mathbf{x}^{(1)\*} = \mathbf{x}\_1 \\ \tag{A.9a}$$

$$\boldsymbol{\mathfrak{u}}\_{i}^{(k)\*} = \underset{\boldsymbol{\mathfrak{u}}\_{i}^{(k)}}{\arg\min} \ \boldsymbol{H}\_{i}^{(k)} \left( \boldsymbol{\mathfrak{y}}\_{D,i}^{(k+1)}, \boldsymbol{\mathfrak{x}}^{(k)\*}, \boldsymbol{\mathfrak{u}}\_{i}^{(k)}, \boldsymbol{\mathfrak{u}}\_{\neg i}^{(k)\*} \right) \tag{A.9b}$$

$$\boldsymbol{\Psi}\_{D,i}^{(k)} = \nabla\_{\mathbf{x}^{(k)}} \boldsymbol{H}\_i^{(k)} \left( \boldsymbol{\Psi}\_{D,i}^{(k+1)}, \mathbf{x}^{(k)\*}, \boldsymbol{u}\_i^{(k)\*}, \boldsymbol{u}\_{\neg i}^{(k)\*} \right) \tag{A.9c}$$

$$
\boldsymbol{\Psi}\_i^{(K)} = \mathbf{0},\tag{A.9d}
$$

where <sup>∇</sup>x (k) denotes a partial derivative with respect to the states <sup>x</sup> (k) .

#### Proof:

See e.g. the proof of Theorem 6.1 of [BO99].

Before presenting the theorem which represents necessary and sufficient conditions for feedback Nash equilibria, the discrete-time value function is defined.

#### Definition A.6 (Value Function)

Consider a player i ∈ P. Let the optimal strategies of the other players γ ∗ ¬i associated to an N-player non-cooperative discrete-time infinite dynamic game be given. The value function Vi : R <sup>n</sup> × K 7→ <sup>R</sup> of player i is defined by

$$V\_l(\mathbf{x},k) = \min\_{\mathbf{y}\_i^{(k)}, \dots, \mathbf{y}\_i^{(k\_E)}} \sum\_{j=k}^{k\_E} g\_{D,i} \left( \mathbf{x}^{(j)}, \mathbf{y}\_i^{(j)}, \mathbf{y}\_{\neg i}^{(j)\*} \right), \tag{A.10}$$

where x (k) <sup>=</sup> x.

The following theorem is the discrete-time counterpart of Theorem 3.2.

#### Theorem A.2 (Necessary and Sufficient Conditions for Feedback Nash Equilibria in Discrete-Time Dynamic Games)

For an N-player discrete-time dynamic game, an N-tuple of feedback strategies (γ (k)∗ 1 , ...,γ (k)∗ N ) provides a feedback Nash equilibrium (FNE) solution if, and only if, there exist value functions <sup>V</sup>i according to Definition A.6 such that the following recursive relations are satisfied for all players i ∈ P:

D,i

$$\begin{split} V\_{l}(\mathbf{x},k) &= \min\_{\mathbf{u}\_{l}^{(k)}} \left[ \hat{g}\_{D,l}^{(k)\*} (\mathbf{x}, \mathbf{u}\_{l}^{(k)}) + V\_{l} \left( \hat{f}\_{D,l}^{(k)\*} (\mathbf{x}, \mathbf{u}\_{l}^{(k)}), k+1 \right) \right] \\ &= \hat{g}\_{D,l}^{(k)\*} (\mathbf{x}, \mathbf{u}\_{l}^{(k)\*}) + V\_{l} \left( \hat{f}\_{D,l}^{(k)\*} (\mathbf{x}, \mathbf{u}\_{l}^{(k)}), k+1 \right); \quad V\_{l} (\mathbf{x}, k\_{E}) = \mathbf{0}, \end{split} \tag{A.11}$$

where

$$\begin{split} \tilde{f}\_{D,i}^{(k)\*} (\mathbf{x}, \boldsymbol{\mu}\_i^{(k)}) &= f\_D (\mathbf{x}, \boldsymbol{\mathcal{y}}\_{\rightharpoonup i}^{(k)\*} (\mathbf{x}), \boldsymbol{\mu}\_i^{(k)}), \\ \tilde{g}\_{D,i}^{(k)\*} (\mathbf{x}, \boldsymbol{\mu}\_i^{(k)}) &= g\_{D,i}^{(k)} (\mathbf{x}, \boldsymbol{\mathcal{y}}\_{\rightharpoonup i}^{(k)\*} (\mathbf{x}), \boldsymbol{\mu}\_i^{(k)}). \end{split} \tag{A.12}$$

The corresponding Nash equilibrium cost for player <sup>i</sup> is <sup>V</sup>i(x1, <sup>1</sup>).

D,i

#### Proof:

See the proof of Theorem 6.6 of [BO99].

Theorem A.2 gives not only sufficient conditions for FNE (cf. Theorem 3.2), but also necessary conditions. Its core consists of the N Bellman equations (A.11) which, analogous to the singleplayer case, follow from the principle of optimality stated by Bellman [Bel66].<sup>59</sup> For dynamic games, the Bellman equations imply that the N inequalities corresponding to the definition of the Nash equilibrium must hold true for all possible local games (with γ (k) i ∈ Γ FB i ) defined at each possible initial point x (k) , k ∈ K, thus leading to the strong time consistency property of the FNE.

#### Pareto Efficient Solutions

The following theorem presents necessary and sufficient conditions for Pareto efficient solutions in discrete-time dynamic games. It constitutes the counterpart of Theorem 3.3.

#### Theorem A.3 (Necessary and Sufficient Conditions for Pareto Efficient Solutions in Discrete-Time Dynamic Games)

Let <sup>τ</sup>i <sup>&</sup>gt; <sup>0</sup>, for all <sup>i</sup> ∈ P, satisfy

$$\sum\_{i=1}^{N} \tau\_i = 1.\tag{A.13}$$

N

Now consider an N-player differential game. If γ <sup>P</sup> <sup>=</sup> {γ P 1 , ...,γ P } is such that

$$\mathbf{y}^P = \underset{\mathbf{y}}{\text{arg min}} \sum\_{i=1}^N \tau\_i J\_i(\mathbf{y}) \tag{A.14a}$$

w.r.t

$$\mathbf{x}^{(k+1)} = f\_D(\mathbf{x}^{(k)}, \boldsymbol{\mu}\_1^{(k)}, \dots \boldsymbol{\mu}\_N^{(k)}) \tag{A.14b}$$

$$\mathbf{x}(1) = \mathbf{x}\_1 \tag{A.14c}$$

then γ P is a Pareto efficient solution (PES). Moreover, if the strategy spaces <sup>Γ</sup>i are convex and <sup>J</sup>i are convex in <sup>u</sup> (k) i for all i ∈ P, k ∈ K, then for all Pareto-efficient γ P there exist τ such that γ P solves the optimization problem in (A.14).

#### Proof:

The theorem is stated analogously to Theorem 3.3. According to [LZ18], both the sufficiency (first theorem assertion) and the necessary part which are taken from the continuous-time result are valid for the discrete-time case.

<sup>59</sup> The principle of optimality as stated in [Bel66] reads: "An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision". This result was used to derive the Bellman equation in single-player optimal control (see e.g. [Kir04, Chapter 3]).

Similar to Theorem 3.3, the optimization problem (A.14) allows the use of the discrete-time minimum principle to solve for the PES. Further results concerning the necessary and sufficient conditions, in terms of the minimum principle corresponding to the problem defined by (A.14), are presented in [LZ18].

# A.5 Discrete-Time Linear-Quadratic Dynamic Games

Analogously to LQ differential games, discrete-time LQ dynamic games are defined as follows.

#### Definition A.7 (Linear-Quadratic Dynamic Game)

A linear-quadratic dynamic game is defined by the same elements as Definition A.2. The system dynamics are linear, i.e. are defined by

$$\mathbf{x}^{(k+1)} = \mathbf{A}\_D \mathbf{x}^{(k)} + \sum\_{j=1}^{N} \mathbf{B}\_{D,j} \mathbf{u}\_j^{(k)} \tag{A.15}$$

where x <sup>∈</sup> <sup>R</sup> n , u <sup>∈</sup> <sup>R</sup> mi . The cost functions are quadratic, i.e.

$$J\_i = \frac{1}{2} \sum\_{k=1}^{k\_E} \left( \mathbf{x}^{(k) \top} \mathbf{Q}\_i \mathbf{x}^{(k)} + \sum\_{j=1}^{N} \mathbf{u}\_j^{(k) \top} \mathbf{R}\_{ij} \mathbf{u}\_j^{(k)} \right). \tag{A.16}$$

where <sup>Q</sup>i , <sup>R</sup>ij are symmetric for all <sup>i</sup>, <sup>j</sup> ∈ P and <sup>R</sup>i i <sup>≻</sup> <sup>0</sup>.

The positive semidefiniteness of <sup>Q</sup>i and <sup>R</sup>ij , <sup>i</sup>, <sup>j</sup> ∈ P, <sup>i</sup> , <sup>j</sup> can be sometimes required in order to state necessary and sufficient conditions for Nash equilibria in open-loop and feedback information structures by means of discrete-time coupled Riccati equations. These equations are also derived from the discrete-time minimum principle, i.e. Theorem A.1, and the coupled HJB equations, i.e. Theorem A.2, respectively. In this thesis, a quasi-continuous analysis was considered such that the trajectories of states and controls in LQ dynamic games were generated by the continuous-time RDEs. Therefore, the discrete-time Riccati equations are not explicitely given here. The reader is referred to


# B Mathematical Supplements

In this section, further mathematical details are given which complement various sections of this thesis.

# B.1 Proof of Theorem 3.4

To the best of the author's knowledge, the precise formulation of Theorem 3.4 is not available in literature. Similar results can be found in [BO99, Theorem 6.12]. However, a formulation similar to the results in [Eng05] was chosen in this thesis in favor of simplicity.

Proof:

[Eng05, Theorem 7.2] states that an OLNE exists if the coupled RDEs (3.60) with conditions (3.61) have a solution <sup>P</sup>i , i ∈ P and additionally, a symmetric solution P¯ i(t) to the noncoupled RDE

$$
\dot{\bar{P}}\_i(t) = -\mathbf{A}^T \bar{P}\_i(t) - \bar{P}\_i(t)\mathbf{A} + \bar{P}\_i(t)\mathbf{S}\_i \bar{P}\_i(t) - \mathbf{Q}\_i(T) \tag{\text{B.1}}
$$

exists on [0,<sup>T</sup> ] for all players <sup>i</sup> ∈ P. Under the theorem conditions <sup>Q</sup>i <sup>≽</sup> <sup>0</sup> and <sup>Q</sup>i,T <sup>≽</sup> <sup>0</sup>, i ∈ P, results of the theory of differential equations can be leveraged to state that the solutions P¯ i(t) of (B.1) are guaranteed to exist (cf. proof of [BO99, Proposition 5.3]. The theorem assertion follows.

# B.2 Equivalence of Cost Functions

Inverse optimal control and inverse dynamic game problems have an inherent ill-posedness property. We give in this section definitions of the equivalence of cost functions in an optimal control and dynamic game scenario.

#### B.2.1 Optimal Control

In an optimal control problem, where optimal control trajectories u ∗ (t) which minimize a cost function J are sought, more than one cost function exists which would lead to the same optimal control u ∗ (t). Consequently, if the system dynamics are unchanged, they lead to the same state trajectories x ∗ (t). Mathematically, this means that even if

$$J^{(1)}(\mathfrak{u}(t)) \neq J^{(2)}(\mathfrak{u}(t)),\tag{B.2}$$

it is still possible to obtain

$$\underset{\mathfrak{u}(t)}{\text{arg min }} J^{(1)}(\mathfrak{u}(t)) = \underset{\mathfrak{u}(t)}{\text{arg min }} J^{(2)}(\mathfrak{u}(t)). \tag{B.3}$$

For example, it is a well-known fact that (B.3) holds for J (2) (u(t)) <sup>=</sup> cJ (1) (u(t)),c <sup>∈</sup> <sup>R</sup> + . Nevertheless, according to [NF04], the illposedness of a general inverse LQ optimal control problem may transcend the ill-posedness due to a positive real constant. Therefore, it is conceivable that this property is still present in a general inverse (non-LQ) optimal control problem. To define when two cost functions are equivalent, we introduce the following definition.

Definition B.1 (Equivalence of Cost Functions in an Optimal Control Problem) Two cost functions J (1) and J (2) are equivalent if and only if

$$\mathbf{S}^{(1)} = \mathbf{S}^{(2)} \tag{\text{B.4}}$$

where S (j) , j ∈ {1, <sup>2</sup>}, denotes the set of solutions for cost function J (j) , i.e.

$$\mathcal{S}^{(j)} = \left\{ \mathfrak{u}(t) \mid \mathfrak{u}(t) = \operatorname\*{arg\,min}\_{\mathfrak{u}(t)} J^{(j)}(\mathfrak{u}(t)) \right\}. \tag{B.5}$$

#### B.2.2 Differential Game

An N-player differential game can be considered a generalization of an optimal control problem. Consequently, the ill-posedness issues discussed in the last section are valid in this more general case as well. Analogously to Definition B.1, it is possible to define two equivalent cost functions of a specific player i in a differential game with N players.

Definition B.2 (Equivalence of Cost Functions in a Differential Game) Two cost functions J (1) i and J (2) i are equivalent if and only if

i

i

$$\mathcal{S}\_i^{(1)} = \mathcal{S}\_i^{(2)} \tag{B.6}$$

i

where S (j) , j ∈ {1, <sup>2</sup>} denotes the set of solutions of cost function J (j) , i.e.

$$\mathcal{S}\_{i}^{(j)} = \left\{ \boldsymbol{\mathfrak{u}}\_{i}(t) \mid \boldsymbol{\mathfrak{u}}\_{i}(t) = \mathop{\arg\min}\_{\boldsymbol{\mathfrak{u}}\_{i}(t)} \boldsymbol{J}\_{i}^{(j)}(\boldsymbol{\mathfrak{u}}\_{i}(t), \boldsymbol{\mathfrak{u}}\_{-i}^{\*}(t)) \right\}. \tag{B.7}$$

This definition can be interpreted as follows. Let <sup>J</sup>¬i represent <sup>N</sup> <sup>−</sup> <sup>1</sup> cost functions except the cost function of player i. If these cost functions are fixed, then according to Definition B.2, two cost functions for player <sup>i</sup> are equivalent if and only if, together with <sup>J</sup>¬i , they lead to the same Nash equilibrium.

# B.3 Calculation of Open-Loop Nash Equilibria With the Minimum Principle

Section 3.6.1 presented Theorem 3.1 as necessary conditions for OLNE which consist of several coupled differential equations. Under certain restrictions, these can be used to state a two-point boundary value problem (TPBVP) to solve for Nash equilibrium state trajectories x ∗ (t). The following lemma represents a useful result for this purpose.

#### Lemma B.1

i

Consider an N-player differential game where the system dynamics are affine in the controls, i.e.

$$\dot{\mathbf{x}}(t) = f(\mathbf{x}(t), \mathbf{u}\_1(t), \dots, \mathbf{u}\_N(t), t) = f\_{\mathbf{x}}(\mathbf{x}(t), t) + \sum\_{i=1}^{N} G\_i(\mathbf{x}, t)\mathbf{u}\_i(t) \tag{B.8}$$

and the running costs <sup>д</sup>i of the cost function <sup>J</sup>i in (3.3) are given by

$$g\_i(\mathbf{x}(t), \mathbf{u}\_1(t), \dots, \mathbf{u}\_N(t)) = g\_{i,1}(\mathbf{x}(t), \mathbf{u}\_1(t)) + \dots + g\_{i,N}(\mathbf{x}, \mathbf{u}\_N(t)), \quad \forall i \in \mathcal{P}.\tag{\mathbb{B}.9}$$

Furthermore, assume that the functions <sup>u</sup>j 7→ <sup>д</sup>i,j(x(t),uj(t)) are strictly convex for all <sup>i</sup>, <sup>j</sup> ∈ P and that <sup>д</sup>i,i has superlinear growth, i.e.

$$\lim\_{||\mathbf{u}\_i||\to\infty} \frac{g\_{i,l}(\mathbf{x}, \mathbf{u}\_i)}{\mathbf{u}\_i} \to \infty \tag{\text{B.10}}$$

Then, for every (x,t) ∈ <sup>R</sup> <sup>n</sup> × [0,<sup>T</sup> ] and every tuple (ψ<sup>1</sup> , ...,ψN ) ∈ <sup>R</sup> <sup>n</sup> <sup>×</sup> ... <sup>×</sup> <sup>R</sup> n , the minimization problem

$$\boldsymbol{u}\_{i}^{\*}(t) = \mathop{\arg\min}\_{\boldsymbol{u}\_{i}} \{ \boldsymbol{\Psi}\_{i}^{\top} \boldsymbol{G}\_{i}(\mathbf{x}, t) \boldsymbol{u}\_{i}(t) + g\_{i, i}(\mathbf{x}(t), \boldsymbol{u}\_{i}(t)) \} \tag{B.11}$$

has a unique solution.

#### Proof:

The proof is analogous to the proof of Lemma 4.1 in [Bre11] in a two-player case..

The implications of Lemma B.1 are explained in the following. By using the n algebraic equations defined by (3.17) and the results of Lemma B.1, u ∗ i (t) can be written as the unique map

$$
\mu\_i^\*(t) = \eta\_i^\*(\mathbf{x}(t), \boldsymbol{\Psi}\_i(t), t). \tag{\text{B.12}}
$$

By inserting (B.12) in (3.16a) and (3.16c), we obtain a system of coupled non-linear differential equations consisting of

i

i

i

$$\begin{array}{c} \dot{\mathbf{x}}(t) = f\left(\mathbf{x}(t), \boldsymbol{\eta}\_{i}^{\*}(t), \boldsymbol{\eta}\_{\neg i}^{\*}(t), t\right) \\ \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \end{array} \tag{\text{B.13}}$$

$$\dot{\Psi}\_i(t) = -\nabla\_\mathbf{x} H\_i(\Psi\_i(t), \mathbf{x}(t), \boldsymbol{\eta}\_i^\*(t), \boldsymbol{\eta}\_{\neg i}^\*(t), t), \tag{B.14}$$

i

where η ∗ i (t) and η ∗ ¬i (t) is used as a short notation for = η ∗ i (x(t),ψi (t)) and η ∗ ¬i (x(t),ψ¬i (t)), respectively, and the boundary conditions

$$\mathbf{x}^\*(0) = \mathbf{x}\_0 \tag{\text{B.15a}}$$

$$
\Psi\_i(T) = \nabla\_\mathbf{x} h\_i(\mathbf{x}(T)). \tag{\text{B.15b}}
$$

The TPBVP arising from (B.13), the differential equations (B.14) for each i ∈ P and boundary conditions (B.15) can be solved using numerical methods, e.g. shooting methods or collocation methods [AMR95, Chapter 4]. The solution of this TPBVP describes an OLNE.

# B.4 Open-Loop Nash Equilibrium of the Ball-on-Beam System

In this section, details on the computation of the OLNE for the differential game with the ballon-beam system considered in Section 7.4 are provided. In the following, time dependencies shall be omitted for brevity. Furthermore, all equations with the index i refer to player i <sup>∈</sup> {1, <sup>2</sup>}. The ball-on-beam system dynamics are given by

$$\dot{\mathbf{x}} = \begin{bmatrix} \mathbf{x}\_2 \\\\ \frac{m\_b r\_b^2 (\mathbf{x}\_1 \mathbf{x}\_4^2 - g\_e \sin(\mathbf{x}\_3))}{\Theta\_b + m\_b r\_b^2} \\\\ \mathbf{X}\_4 \\\\ \frac{-2m\_b \mathbf{x}\_1 \mathbf{x}\_2 \mathbf{x}\_4 - m\_b g\_e \mathbf{x}\_1 \cos(\mathbf{x}\_3) + \mathbf{u}\_1 + \mathbf{u}\_2}{m\_b \mathbf{x}\_1^2 + \Theta\_w} \end{bmatrix} \tag{B.16}$$

and the cost functions are defined as

$$J\_i = \int\_0^l \theta\_i^\top \phi\_i \,\,\mathrm{d}t$$

with the parameter vector <sup>θ</sup>i <sup>∈</sup> <sup>R</sup> 5×1 and the basis function vector

$$
\boldsymbol{\Phi}\_{i} = \begin{bmatrix} \mathbf{x}\_{1}^{2} & \mathbf{x}\_{2}^{2} & \mathbf{x}\_{3}^{2} & \mathbf{x}\_{4}^{2} & u\_{i}^{2} \end{bmatrix}^{\top}. \tag{\text{B.17}}$$

The corresponding Hamiltonian is

$$H\_i = \boldsymbol{\theta}\_i^\top \boldsymbol{\phi}\_i + \boldsymbol{\Psi}\_i^\top f(\mathbf{x}, \boldsymbol{u}\_i, \boldsymbol{u}\_{\neg i}).\tag{\text{B.18}}$$

Using (3.17), we obtain for each players' controls

$$u\_i^\* = \eta\_i(\mathbf{x}, \boldsymbol{\Psi}\_i) = -\frac{\psi\_{i,4}}{2\theta\_{i,5}(m\_b\boldsymbol{x}\_1^2 + \Theta\_{\mathbf{w}})}.\tag{\text{B.19}}$$

Next, we apply (3.16c) to obtain

$$\dot{\boldsymbol{\Psi}}\_{i} = - \begin{bmatrix} 2\theta\_{i,1}\mathbf{x}\_{1} + \dot{\psi}\_{i,2}(\nabla\_{\mathbf{x}}\mathbf{f})\_{(2,1)} + \dot{\psi}\_{i,4}(\nabla\_{\mathbf{x}}\mathbf{f})\_{(4,1)}\\ 2\theta\_{i,2}\mathbf{x}\_{2} + \dot{\psi}\_{i,1}(\nabla\_{\mathbf{x}}\mathbf{f})\_{(1,2)} + \dot{\psi}\_{i,4}(\nabla\_{\mathbf{x}}\mathbf{f})\_{(4,2)}\\ 2\theta\_{i,3}\mathbf{x}\_{3} + \dot{\psi}\_{i,2}(\nabla\_{\mathbf{x}}\mathbf{f})\_{(2,3)} + \dot{\psi}\_{i,4}(\nabla\_{\mathbf{x}}\mathbf{f})\_{(4,3)}\\ 2\theta\_{i,4}\mathbf{x}\_{4} + \dot{\psi}\_{i,2}(\nabla\_{\mathbf{x}}\mathbf{f})\_{(2,4)} + \dot{\psi}\_{i,3}(\nabla\_{\mathbf{x}}\mathbf{f})\_{(3,4)} + \dot{\psi}\_{i,4}(\nabla\_{\mathbf{x}}\mathbf{f})\_{(4,4)} \end{bmatrix},\tag{B.20}$$

where (∇<sup>x</sup> <sup>f</sup>)(r,c) , r, c ∈ {1, ..., <sup>4</sup>} denote the elements of the matrix of partial derivatives

$$\nabla\_{\mathbf{x}}f = \begin{bmatrix} 0 & 1 & 0 & 0\\ \frac{m\_b r\_b^2 \mathbf{x}\_4^2}{m\_b r\_b^2 + \Theta\_b} & 0 & \frac{-g\_e m\_b r\_b^2 \cos(\mathbf{x}\_3)}{m\_b r\_b^2 + \Theta\_b} & \frac{2m\_b r\_b^2 \mathbf{x}\_1 \mathbf{x}\_4}{m\_b r\_b^2 + \Theta\_b} \\ 0 & 0 & 0 & 1\\ D & -\frac{2m\_b \mathbf{x}\_1 \mathbf{x}\_4}{Z} & \frac{g\_e m\_b \mathbf{x}\_1 \sin(\mathbf{x}\_3)}{Z} & -\frac{2m\_b \mathbf{x}\_1 \mathbf{x}\_2}{Z} \end{bmatrix} \tag{\text{B.21}}$$

with

$$\begin{aligned} D &= \frac{-2m\_b \mathbf{x}\_2 \mathbf{x}\_4 - g\_e m\_b \cos(\mathbf{x}\_3)}{Z} - \frac{2m\_b \mathbf{x}\_1 (\boldsymbol{u}\_1^\* + \boldsymbol{u}\_2^\* - 2m\_b \mathbf{x}\_1 \mathbf{x}\_2 \mathbf{x}\_4 - g\_e m\_b \mathbf{x}\_1 \cos(\mathbf{x}\_3))}{Z^2} \\ Z &= m\_b \mathbf{x}\_1^2 + \Theta\_\mathbf{w}. \end{aligned}$$

Following the procedure described in Section B.3, we insert (B.19) in (B.16) and obtain the system dynamics

$$\dot{\mathbf{x}} = \begin{bmatrix} \mathcal{X}\_2 \\\\ \frac{m\_b r\_b^2 (x\_1 x\_4^2 - g\_e \sin(x\_3))}{\Theta\_b + m\_b r\_b^2} \\\\ \mathcal{X}\_4 \\\\ f\_4^\eta \end{bmatrix},\tag{B.22}$$

where

$$f\_4^\eta = \frac{\left(-4\theta\_{1,(5)}\theta\_{2,(5)}\mathbf{x}\_2\mathbf{x}\_4 - 2\theta\_{1,(5)}\theta\_{2,(5)}g\_\epsilon\cos(\mathbf{x}\_3)\right)m\_b\mathbf{x}\_1Z - \psi\_{1,(4)}\theta\_{2,(5)} - \psi\_{2,(4)}\theta\_{1,(5)}}{2\theta\_{1,(5)}\theta\_{2,(5)}Z^2}.\tag{B.23}$$

Furthermore, we insert (B.19) in (B.20) and obtain the same costate differential equation, yet with

$$(\nabla\_{\mathbf{x}}f)\_{(4,1)} = D^{\eta} = \frac{-2m\_b \mathbf{x}\_2 \mathbf{x}\_4 - g\_e m\_b \cos(\mathbf{x}\_3)}{Z} - \frac{2m\_b \mathbf{x}\_1 f\_4^{\eta}}{Z}. \tag{\text{B.24}}$$

Z Z The system dynamics (B.22) and the differential equations of <sup>ψ</sup><sup>1</sup> and <sup>ψ</sup><sup>2</sup> defined by (B.20) and (B.24) constitute a TPBVP which can be solved numerically. In this thesis, the MATLAB function bvp4c is used which applies a collocation method (see [SKR00]).

# B.5 Approximations for the Maximum Entropy Probability Density Function

This section presents the steps needed for the approximation result of the probability density function given in (6.47). For brevity, the subscript i is omitted from all variables related to playeri in the following. Likewise, for the following derivations are based on the assumption that one single demonstration (nt <sup>=</sup> 1) lies at hand such that the subscript <sup>l</sup> can also be neglected.

Inserting (6.44) in (6.24) results in

p ˜ζ θ = p u˜ u˜ ¬i , x (1) , θ = e <sup>−</sup>J(u˜) ∫ <sup>∞</sup> −∞ e <sup>−</sup>J(u)du −1 ≈ e <sup>−</sup>J(u˜) ∫ <sup>∞</sup> −∞ e n <sup>−</sup>J(u˜ )−(u−u˜) ⊤ д− 1 2 (u−u˜) ⊤ <sup>G</sup>(u−u˜) o du −1 = ∫ <sup>∞</sup> −∞ e n 1 2 д ⊤G −1д− 1 2 (u−u˜) ⊤ G(u−u˜)+д ⊤ (u−u˜)+(u−u˜) ⊤ д+д ⊤G −1д o du −1 = ∫ <sup>∞</sup> −∞ e n 1 2 д ⊤G −1д− 1 2 (G(u−u˜)+д) ⊤ G −1 (G(u−u˜)+д) o du −1 . (B.25)

We note that the relation

$$\begin{split} & \left( G\left( \underline{u} - \underline{\tilde{u}} \right) + g \right)^{\top} G^{-1} \left( G\left( \underline{u} - \underline{\tilde{u}} \right) + g \right) \\ & \qquad = \left( G\underline{u} - G\underline{\tilde{u}} + g \right)^{\top} G^{-1} \left( G\underline{u} - G\underline{\tilde{u}} + g \right) \\ & \qquad = \left( \underline{\underline{u}}^{\top} G^{\top} G^{-1} - \underline{\underline{\underline{u}}}^{\top} G^{\top} G^{-1} + g^{\top} G^{-1} \right) G \left( \underline{u} - \underline{\tilde{u}} + G^{-1} g \right) \\ & \qquad = \left( \underline{\underline{u}}^{\top} + \left( g^{\top} G^{-1} - \underline{\underline{u}}^{\top} \right) \right) G \left( \underline{u} + G^{-1} g - \underline{\underline{u}} \right) \\ & \qquad = \left( \underline{u} + \left( G^{-1} g - \underline{\underline{u}} \right) \right)^{\top} G \left( \underline{u} + \left( G^{-1} g - \underline{\underline{u}} \right) \right), \end{split} \tag{8.26}$$

holds due to the symmetry of the second derivative G of the cost function. By appling (B.26) in (B.25), the right hand side results in

$$\left[\mathbf{e}^{\left\{-\frac{1}{2}g^{\top}G^{-1}g\right\}}\right] \left[\int\_{-\infty}^{\infty} \mathbf{e}^{\left\{-\frac{1}{2}\left(\underline{u}+\left(G^{-1}g-\underline{\underline{u}}\right)\right)^{\top}G\left(\underline{u}+\left(G^{-1}g-\underline{\underline{u}}\right)\right)\right\}} \underline{\mathbf{d}}\underline{u}\right]^{-1} . \tag{\text{B.27}}$$

Finally, since

$$\int\_{-\infty}^{\infty} \frac{1}{\sqrt{(2\pi)^{\dim(\mathfrak{y})} |\Sigma\_{\mathfrak{Y}}|}} \mathrm{e}^{\left| -\frac{1}{2} (\mathfrak{y} - \mu\_{\mathcal{Y}})^{\top} \Sigma\_{\mathcal{Y}}^{-1} (\mathfrak{y} - \mu\_{\mathcal{Y}}) \right|} \mathrm{d}\mathfrak{y} \stackrel{!}{=} 1\tag{\text{B.28}}$$

holds for a multidimensional Gaussian distribution with the mean <sup>µ</sup><sup>Y</sup> and the covariance matrix ΣY, we may rewrite (B.27) and obtain the approximated probability density function

$$\mathrm{p}\left(\zeta\_{i}\middle|\boldsymbol{\theta}\right) \approx \mathrm{e}^{-\left\{\frac{1}{2}\boldsymbol{g}^{\top}\boldsymbol{G}^{-1}\boldsymbol{g}\right\}} \det(\mathsf{G})^{\frac{1}{2}} (2\pi)^{-\frac{1}{2}\dim(\mathsf{g})},\tag{\text{B.29}}$$

where dim u <sup>=</sup> mkE denotes the dimension of <sup>u</sup>. From (B.29), the approximated loglikelihood function results in

$$\begin{split} \ln \mathcal{L} \left( \tilde{\boldsymbol{\zeta}} \, \middle| \, \boldsymbol{\theta} \right) &= \ln \left( \mathbf{p} \left( \underline{\boldsymbol{u}} \, \middle| \, \tilde{\underline{\boldsymbol{u}}}\_{\sim l^{\prime}}, \mathbf{x}^{(1)}, \boldsymbol{\theta} \right) \right) \\ &\approx -\frac{1}{2} \boldsymbol{g}^{\top} \boldsymbol{G}^{-1} \boldsymbol{g} + \frac{1}{2} \ln \left( \det(\mathbf{G}) \right) - \frac{1}{2} \dim \left( \underline{\boldsymbol{u}} \right) \ln \left( 2\pi \right) \, . \end{split} \tag{B.30}$$

# B.6 Implementation of the Direct Bilevel Approach

The DB approach used for comparison in this thesis is based on the minimization of the cost functional (7.1) which depends on the current candidate trajectories <sup>u</sup>θ,j (t) and <sup>x</sup>θ (t). These trajectories must be Nash equilibrium trajectories under an arbitrary parametrization of the cost functions θ. The solution of a forward dynamic game with the parameters θ is therefore nested inside the objective function in (7.1). Consequently, each of the objective function evaluations will include the solution of a forward dynamic game to determine an OLNE or a FNE, depending on the considered case. We note that the search for θ might lead to cost function parameter candidates for which a Nash equilibrium does not necessarily exist. Proving the existence of Nash equilibria is in general not trivial. For example, in the case of linear-quadratic differential games, the existence of Nash equilibria depends on the existence of the solution to the coupled Riccati differential equations, yet its existence has only been proved under strong assumptions. Furthermore, the proofs are not very useful for practical implementation. Therefore, existence of Nash equilibria cannot be ensured by introducing optimization constraints. Nevertheless, probably inspired by the optimal control case (cf. assumptions in the results summarized in [Kuč73]), literature on (linear-quadratic) dynamic games usually introduce constraints of the kind

$$\mathcal{C} = \left\{ \theta\_i \mid \theta\_{i,(j)} \ge 0, \,\forall i \in \mathcal{P}, j \in \{1, ..., M\_l\} \right\}. \tag{B.31}$$

This constraint set was implemented in the minimization of the objective function for the DB approach. The occurence of succesful calculations of Nash equilibria was indeed increased with this set. Nevertheless, it was not enough to completely avoid failure. Therefore, the objective function was augmented by a resetting procedure of the candidate trajectories (potentially leading to greater costs) which became active if the forward problem, i.e. the numerical solution of the corresponding RDEs or the TPBVPs did not converge.

The algorithm describing the cost functional to be evaluated in each iteration of the optimization problem is listed below.

Algorithm 5 Cost Functional for the Direct Bilevel Approach in Inverse Differential Games.

Input: Parameter candidates <sup>θ</sup>, observed trajectory set <sup>D</sup>, dynamics <sup>f</sup>, basis functions <sup>ϕ</sup>i . Output: Sum of squared errors <sup>J</sup>DB


Therefore, the DB method used for the simulation results of Chapter 7 consists of the minimization of the cost functional described by Algorithm 5 subject to the constraints (B.31).

# B.7 Solutions of the LQ Tracking Problem in the Cooperative Steering Model

This section presents reformulations of the LQ tracking problem arising in Section 8.2.1 to a standard LQ problem which allows an easier solution of the differential game. First, the general approach is presented. It is based on the reformulation proposed for the single-player case in [ML14]. Afterwards, the reformulations specific to the problem of Section 8.2.1 are shown. In the remainder of this section, time dependencies of all variables will be omitted for better readability.

#### B.7.1 General Reformulation to a Standard LQ Problem

To begin the reformulation, the state variable X <sup>=</sup> - x z<sup>⊤</sup> is introduced which combines the system states and the corresponding reference trajectories. With this new state, we define an extended system including the original system dynamics as well as the reference model dynamics:

$$
\dot{X} \quad = \ddot{A}X + \ddot{B}\_1 \mu\_1 + \ddot{B}\_2 \mu\_2
$$

$$
\text{with} \quad \ddot{A} \quad = \begin{bmatrix} A & \mathbf{0} \\ \mathbf{0} & H \end{bmatrix}, \quad \ddot{B}\_i = \begin{bmatrix} B\_i \\ \mathbf{0} \end{bmatrix}, \quad i \in \{1, 2\}. \tag{B.32}
$$

Due to the infinite horizon, the cost function (8.5) can only be applied if H is Hurwitz. This is a considerable restriction, since application-relevant reference signals, e.g. sinusoidal and step functions, will not lead to a Hurwitz reference system matrix. In order to circumvent this problem, we introduce a discount factor β such that <sup>0</sup> < β < <sup>1</sup> in the cost function, thus avoiding infinite costs.

We note that the tracking error e can be written as e <sup>=</sup> TX, where T <sup>=</sup> - <sup>I</sup>n <sup>−</sup>In and <sup>I</sup>n is an n-dimensional identity matrix. With this transformation matrix and with the discount factor β, we rewrite (8.5) as

$$J\_i = \int\_0^\infty \exp(-\beta t) X^\top T^\top \mathcal{Q}\_i T X + \mathbf{u}\_i^\top \mathcal{R}\_{il} \mathbf{u}\_i \text{ dt}$$

$$= \int\_0^\infty \exp(-\beta t) X^\top \tilde{\mathcal{Q}}\_i X + \mathbf{u}\_i^\top \mathcal{R}\_{il} \mathbf{u}\_i \text{ dt} \tag{\text{B.33}}$$

where

$$
\tilde{\mathbf{Q}}\_i = \mathbf{T}^\top \mathbf{Q}\_i \mathbf{T} = \begin{bmatrix} \mathbf{Q}\_i & -\mathbf{Q}\_i \\ -\mathbf{Q}\_i & \mathbf{Q}\_i \end{bmatrix} \tag{\text{B.34}}
$$

According to Modares and Lewis [ML14], the optimal control problem consisting of the system dynamics (B.32) and the cost function (B.33) for any i ∈ {1, <sup>2</sup>} to be minimized can be reformulated as an optimal control problem with a cost function without any discounting factor <sup>β</sup>, but with the new system matrix <sup>A</sup>˜ <sup>−</sup> <sup>0</sup>.5βIn instead of (B.32). This is necessary in order to ease the calculation of the solution and to prove its existence. In their paper [ML14], Modares and Lewis state that the solution exists if the matrix <sup>A</sup>˜ <sup>−</sup> <sup>0</sup>.5βIn is Hurwitz.

### B.7.2 Transformed System Dynamics and Cost Functions of the Cooperative Steering System

Given that we apply constant reference values, H <sup>=</sup> <sup>0</sup> holds for the reference system matrix in (8.4). Moreover, the velocity reference signal is zero. Therefore, we neglect this term before applying the aforementioned transformation. In this way, we obtain system dynamics of the form (B.32) with the extended state X <sup>=</sup> - φ φ φ <sup>Û</sup> ref<sup>⊤</sup> . This leads to a transformed system (B.32) with

$$
\tilde{\mathbf{A}}^{\prime} = \begin{bmatrix} -\frac{d\_c}{\Theta\_{\text{sum}}} & -\frac{c\_c}{\Theta\_{\text{sum}}} & 0 \\\\ 1 & 0 & 0 \\\\ 0 & 0 & 0 \end{bmatrix}, \quad \tilde{\mathbf{B}}\_1 = \tilde{\mathbf{B}}\_2 = \begin{bmatrix} \frac{1}{\Theta\_{\text{sum}}} \\\ 0 \\\\ 0 \end{bmatrix}. \tag{B.35}
$$

Furthermore, the transformation matrix is given by

$$T = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & -1 \end{bmatrix}. \tag{B.36}$$

Since our steering wheel system is stabilizable and the reference system with H <sup>=</sup> <sup>0</sup> is marginally stable, any β > <sup>0</sup> suffices to make the extended system stabilizable and, consequently, to make the transformation applicable. We choose a small value of β <sup>=</sup> <sup>0</sup>.01, leading to a modified cost function

$$J\_{l} = \int\_{0}^{\infty} \exp(-\beta t) \mathbf{X}^{\top} \tilde{\mathbf{Q}}\_{l} \mathbf{X} + R\_{li} u\_{l}^{2} \text{ dt},\tag{\text{B.37}}$$

where

$$
\tilde{\mathbf{Q}}\_i = \mathbf{T}^\top \mathbf{Q}\_i \mathbf{T} = \begin{bmatrix} q\_1 & 0 & 0 \\ 0 & q\_2 & -q\_2 \\ 0 & -q\_2 & q\_2 \end{bmatrix} . \tag{\text{B.38}}
$$

Finally, we obtain a standard LQ differential game consisting of the system dynamics matrices (A˜ <sup>−</sup> <sup>0</sup>.5βIn, <sup>B</sup>˜ 1, B˜ <sup>2</sup>) and the cost functions

$$J\_i = \int\_0^\infty \mathbf{X}^\top \tilde{\mathbf{Q}}\_i \mathbf{X} + R\_{il} u\_i^2 \text{ dt.} \tag{\text{B.39}}$$

For the solution of the inverse LQ dynamic game, parameter constraints are introduced in the corresponding optimization problems (constituting the IOC, IRL and DB approaches) such that the structure of the cost function matrix in (B.38) is ensured.

# C Supplementary Results on the Solution Sets for Inverse Linear-Quadratic Differential Games

The following results complement the results of Section 5.3 to illustrate how the properties of an inverse LQ differential game are altered depending on the number of states, controls and players.

All results are based on the general structure of a quadratic cost function given by

$$J\_i(\mathbf{x}\_0, \mathbf{K}, \mathbf{Q}\_i, \mathbf{R}\_{ij}) = \frac{1}{2} \int\_0^\infty \mathbf{x}^\top \mathbf{Q}\_i \mathbf{x} + \sum\_{j=1}^N \mathbf{u}\_j^\top \mathbf{R}\_{ij} \mathbf{u}\_j \text{ dt} \dots$$

A two-player and a three-player inverse LQ differential game are considered exemplarily.

Figures C.1 and C.2 shows a 3D map for analyzing the dimensions of the matrix <sup>M</sup>i for inverse LQ differential games with N <sup>=</sup> <sup>2</sup> and N <sup>=</sup> 3, respectively, with symmetric and diagonal cost function matrices and different numbers of states <sup>n</sup> and controls <sup>m</sup>i . These are analogous to Figure 5.1 which showed the case <sup>N</sup> <sup>=</sup> 1. The number of equations (rows of <sup>M</sup>i ) and the number of parameters <sup>M</sup>i (columns of <sup>M</sup>i ) are shown as a function of the number of states n and the number of controls <sup>m</sup>i .

In Figure C.1a and C.2a, the number of parameters <sup>M</sup>i is always greater than the number of equations n mi such that the solution set of player <sup>i</sup> is at least one-dimensional. In Figures C.1b and C.2b, we observe that there are combinations of <sup>n</sup> and <sup>m</sup>i which lead to n mi <sup>≥</sup> <sup>M</sup>i . The black line denotes the cases where n mi <sup>=</sup> <sup>M</sup>i <sup>−</sup> <sup>1</sup> <sup>&</sup>lt; <sup>M</sup>i which shows that the kernel is guaranteed to exist and is one-dimensional. Therefore, from this line to the left, the solution set of player<sup>i</sup> can be expressed by ker(Mi), while the area which is on the right side of the line represents the cases where solutions may be found by applying the results of Theorem 5.3.

(b) Diagonal cost function matrices

Figure C.1: Number of parameters and equations depending on the number of states and controls in a two-player inverse LQ differential game. The red thick line denotes the case where n m<sup>i</sup> <sup>=</sup> <sup>M</sup><sup>i</sup> <sup>−</sup> 1.

(a) Symmetrical cost function matrices

(b) Diagonal cost function matrices

Figure C.2: Number of parameters and equations depending on the number of states and controls in a 3-player inverse LQ differential game. The red thick line denotes the case where n m<sup>i</sup> <sup>=</sup> <sup>M</sup><sup>i</sup> <sup>−</sup> 1.

# D Inverse Cooperative Dynamic Games Based on Maximum Entropy Inverse Reinforcement Learning

In this chapter, the probability density function given by (6.17) is leveraged such that a method to identify cost function parameters out of a solution of the dynamic game in the sense of Pareto is developed. Similar to the results of Chapter 6 the unbiasedness of the estimation is proved.

# D.1 Preliminaries

In this appendix, Pareto efficient solutions are considered which can be described by a global cost function given by the sum of weighted player cost functions. Several global cost functions are possible depending on the selected weighting parameters to build the sum (cf. Section 3.6.3). One particular global cost function is given by the sum of uniformly weighted player cost functions defined as follows.

#### Definition D.1 (Global Cost Function as Uniformly Weighted Sum)

The uniformly weighted sum of all player cost functions is given by

$$J\_{\Sigma} = \sum\_{i=1}^{N} J\_i = \sum\_{i=1}^{N} -\theta\_i^{\top} \mu\_i =: -\theta\_{\Sigma}^{\top} \mu\_{\Sigma} \tag{D.1}$$

with

$$\boldsymbol{\theta}\_{\Sigma} = \begin{bmatrix} \boldsymbol{\theta}\_{1}^{\mathsf{T}} & \dots & \boldsymbol{\theta}\_{N}^{\mathsf{T}} \end{bmatrix}^{\mathsf{T}} \tag{D.2a}$$

and

$$
\boldsymbol{\mu}\_{\Sigma} = \begin{bmatrix} \boldsymbol{\mu}\_{1}^{\mathsf{T}} & \dots & \boldsymbol{\mu}\_{N}^{\mathsf{T}} \end{bmatrix}^{\mathsf{T}}.\tag{\mathsf{D}.2b}
$$

The following assumption is introduced in order to be able to obtain Pareto efficient solutions.

#### Assumption D.1 (Convexity of the Global Cost Function)

The cost functions <sup>J</sup>i are convex for all <sup>i</sup> ∈ P.

#### Remark D.1:

It can be noted that

$$\underset{\mathbf{y}}{\text{arg min }} J\_{\Sigma}(\mathbf{y}) = \underset{\mathbf{y}}{\text{arg min }} \sum\_{i=1}^{N} \frac{1}{N} J\_{i}(\mathbf{y}), \tag{D.3}$$

where <sup>γ</sup> :<sup>=</sup> {γ<sup>1</sup> , . . . ,γ N }, holds since multiplying any cost function <sup>J</sup>Σ with a constant factor c <sup>∈</sup> <sup>R</sup> + (here <sup>1</sup>/N ) does not alter the solution of the optimization problem. Therefore, under Assumption D.1 and with the results of Theorem 3.3, the minimizer of <sup>J</sup>Σ describes a Pareto efficient solution of a cooperative game.

# D.2 Identification Method and Unbiasedness of the Estimation

Sections 6.4 and 6.5 presented how to find cost function parameters which explain observed trajectories which arise from a noncooperative game with OL and FB Nash equilibrium strategies. This was done by means of a MLE based on a probability density function. This section presents a similar procedure such that parameters can be found which explain trajectories corresponding to a cooperative game with equally weighted cost functions as in Definition D.1.

The inverse dynamic game method is based on the density (6.17) with naturally arises with the maximum entropy principle as described in Section 6.3. The first step consists in rewriting (6.17) using (D.1) and (D.2), leading to

$$\begin{split} \operatorname{p} \left( \left. \zeta \right| \theta\_{\Sigma} \right) &= \frac{\exp \left( \theta\_{\Sigma}^{\top} \mu\_{\Sigma} \left( \zeta \right) \right)}{\int\_{\zeta} \exp \left( \theta\_{\Sigma}^{\top} \mu\_{\Sigma} \left( \zeta \right) \right) \, \mathrm{d}\zeta} \\ &= \frac{\exp \left( -J\_{\Sigma} \left( \zeta \right) \right)}{\int\_{\zeta} \exp \left( -J\_{\Sigma} \left( \zeta \right) \right) \, \mathrm{d}\zeta} . \end{split} \tag{D.4}$$

This allows the definition of a likelihood function analogous to the one introduced in Definition 6.5. In this case, we denote the likelihood as

$$\mathcal{L}(\boldsymbol{\theta}\_{\Sigma} \mid \mathcal{D}) = \prod\_{l=1}^{n\_{\ell}} \mathbf{p}\left(\tilde{\zeta}\_{l} \mid \boldsymbol{\theta}\_{\Sigma}\right), \tag{D.5}$$

where p ˜ζl | θ Σ is obtained by evaluating (D.4) at ˜ζl , <sup>l</sup> ∈ {1, ...,nt }.

The following theorem represents the main result concerning the identification of cost functions in an inverse cooperative dynamic game with Pareto efficient solutions.

Theorem D.1 (Unbiasedness of the Identification of Pareto Efficient Solutions) Let <sup>n</sup>t trajectories <sup>D</sup> <sup>=</sup> { ˜ζ1, . . . , ˜ζnt } fulfilling Assumption 6.1 be available. Then, the MLE with respect to the observed trajectories, i.e.

$$\hat{\boldsymbol{\theta}}\_{\Sigma} = \underset{\boldsymbol{\theta}\_{\Sigma}}{\text{arg}\,\text{max}} \,\ln \,\mathcal{L}\left\{\boldsymbol{\theta}\_{\Sigma} \,|\,\mathcal{D}\right\} \tag{D.6}$$

where <sup>L</sup> {<sup>θ</sup> <sup>Σ</sup><sup>|</sup> D} is obtained by evaluating the likelihood function (D.5) at ˜ζl , <sup>l</sup> ∈ {1, ...,nt }, leads to parameters <sup>ˆ</sup><sup>θ</sup> Σ such that the resulting probability density function p ζ | θ ∗ Σ leads to an expectation of the cost function values <sup>J</sup>Σ ζ , θ ∗ Σ which is equal to the one corresponding to the probability density function with original parameters p ζ | θ ∗ Σ , i.e.

$$\mathbb{E}\_{\mathbf{p}\Big(\boldsymbol{\zeta}\,|\theta\_{\Sigma}^\*\big)}\left\{J\_{\Sigma}\left(\boldsymbol{\zeta},\theta\_{\Sigma}^\*\right)\right\} = \mathbb{E}\_{\mathbf{p}\Big(\boldsymbol{\zeta}\,|\hat{\theta}\_{\Sigma}\big)}\left\{J\_{\Sigma}\left(\boldsymbol{\zeta},\theta\_{\Sigma}^\*\right)\right\}.\tag{D.7}$$

#### Proof:

The proof is analogous to the proof of Theorem 6.1.

The results of Theorem D.1 imply that the expectation of the global costs (under the original parameters) produced by trajectories generated by the probability density functions with original and estimated parameters are equal. Note that this result is weaker than the one required in (6.7) as it considers only the overall costs. Nevertheless, for a cooperative game, it is enough to describe observed trajectories completely.

#### Remark D.2:

Similar to the results of Chapter 6, solving the optimization problem (D.6) demands the possibility of evaluating the likelihood function <sup>L</sup> {<sup>θ</sup> Σ<sup>|</sup> D} and therefore the probability density function (D.4) at the trajectories ˜ζl , <sup>l</sup> ∈ {1, ...,nt }. The denominator in (D.4) includes an integral over all trajectories ˜ζ which are feasible with respect to the system dynamics and an initial state. An approach analogous to the one presented in Section 6.6 can be applied in this case.

#### Remark D.3:

The result <sup>ˆ</sup><sup>θ</sup> Σ of (D.6) contains the cost function parameters of all players in one single vector according to (D.2). Assuming that the number of features <sup>M</sup>i is known for every player i ∈ P, an individual parameter set <sup>ˆ</sup>θi can be determined by means of (D.2a) out of <sup>ˆ</sup><sup>θ</sup> Σ. This is done by using the relation

$$
\hat{\boldsymbol{\theta}}\_{l} = \hat{\boldsymbol{\theta}}\_{\Sigma}(l\_{l}^{s} : l\_{l}^{e}). \tag{D.8}
$$

with

$$l\_i^s = 1 + \sum\_{\alpha=1}^{l} M\_{\alpha-1} \quad \text{and} \quad l\_i^e = \sum\_{\alpha=1}^{l} M\_{\alpha}, \tag{D.9}$$

with <sup>M</sup><sup>0</sup> <sup>=</sup> <sup>0</sup> and where <sup>ˆ</sup><sup>θ</sup> Σ(<sup>l</sup> s i : l e i ) ∈ R l e i −l s i <sup>+</sup><sup>1</sup> denotes a vector that contains the entries l s i to l e i of the vector <sup>ˆ</sup><sup>θ</sup> Σ.

The presented method is capable of identifying cost function parameters which explain trajectories corresponding to an optimal solution based on uniformly weighted player cost functions, which is one of the Pareto efficient solutions belonging to the Pareto frontier. Pareto efficient solutions can be obtained by minimizing the sum of cost functions of all players which are nevertheless not necessarily equally weighted (see Definition 3.9). The presented method would not be able to estimate the original parameters θ ∗ i , but would be able to determine parameters <sup>ˆ</sup>θi which are also capable of describing the trajectories in this scenario. A simulation example where the effectiveness of the presented inverse dynamic game method is demonstrated can be found in [IBKH20].

# E Supplementary Simulation Results

This chapter gives supplementary results of the simulative evaluation of the inverse dynamic game methods performed in Chapter 7.

# E.1 Inverse Nonlinear Open-Loop Dynamic Game

#### E.1.1 Robustness to Measurement Noise

Figures E.1 – E.4 show the trajectory estimation results for different SNR values of the observed trajectory used for the inverse dynamic game methods. The estimated trajectories are determined by solving the dynamic game with the parameters <sup>ˆ</sup>θi , i ∈ P, identified by each of the considered methods, i.e. the parameters from Tables 7.3, 7.5 and 7.7 are used. The noisefree case is presented in Figure 7.4 and the 30 dB results are shown in Figure 7.8.

Figure E.1: Observed trajectories and estimations based on mean identification results of all methods, SNR = 20 dB

Figure E.2: Observed trajectories and estimations based on mean identification results of all methods, SNR = 25 dB

Figure E.3: Observed trajectories and estimations based on mean identification results of all methods, SNR = 35 dB

Figure E.4: Observed trajectories and estimations based on mean identification results of all methods, SNR = 40 dB

#### E.1.2 Robustness to Basis-Function Mismatch

Figures E.5 and E.6 show the comparison of the trajectories which result from the dynamic games solved with the parameters <sup>ˆ</sup>θi , i ∈ P identified by each of the considered methods. The identification is based on observed trajectories generated in Section 7.4.1 and different basis functions (cases II and III) as given in Table 7.9.

# E.2 Inverse LQ Feedback Differential Game

#### E.2.1 Robustness to Measurement Noise

The following Tables E.1 to E.6 list the mean values of the identified parameters corresponding to the matrices Q<sup>ˆ</sup> i and R<sup>ˆ</sup> i i , <sup>i</sup> ∈ P, over all 100 identification procedures conducted in Section 7.5.3, where the robustness of the inverse dynamic game methods to measurement noise is evaluated.

The following Figures E.7 – E.10 show the comparison of the trajectories which result from the dynamic games solved with the mean of the parameters <sup>ˆ</sup>θi , i ∈ P, identified by each of the considered methods and based on the observed trajectories with different SNR values. The noisefree case is presented in Figures 7.11 and 7.12 and the 20 dB results are shown in Figure

Figure E.5: Inverse open-loop dynamic game results for all methods in the basis function mismatch case II.

Figure E.6: Observed trajectories and estimations based on identification results of all methods in the basis function mismatch case III.


Table E.1: Mean values of the cost function matrices <sup>Q</sup><sup>i</sup> identified with IOC

Table E.2: Mean values of the cost function matrices <sup>R</sup>i i identified with IOC


Table E.3: Mean values of the cost function matrices <sup>Q</sup><sup>i</sup> identified with IRL


Table E.4: Mean values of the cost function matrices <sup>R</sup>i i identified with IRL


7.15 and 7.16. As it can be inferred from Figures E.9 and E.10, the trajectory comparison for the cases 35 dB and 40 dB yields no visually recognizable improvement. These are not


Table E.5: Mean values of the cost function matrices <sup>Q</sup><sup>i</sup> identified with the DB method

Table E.6: Mean values of the cost function matrices <sup>R</sup>i i identified with the DB method


explicitely shown here as the result are practically identical to the noisefree case from Figures 7.11 and 7.12.

#### E.2.2 Robustness to Basis Function Mismatch

Figures E.11 – E.16 show the comparison of the observed trajectories with the ones which result from the dynamic games solved with the parameters <sup>ˆ</sup>θi , i ∈ P identified by each of the considered methods. The identification is based on observed trajectories generated in Section 7.4.1 and incomplete basis functions (cases II to IV) as given in Table 7.17.

Figure E.7: Ground truth and estimated state trajectories of the inverse LQ feedback dynamic game with each method. The identification was conducted using noise-corrupted trajectories with SNR = 25 dB.

Figure E.8: Ground truth and estimated control trajectories of the inverse LQ feedback dynamic game with each method. The identification was conducted using noise-corrupted trajectories with SNR = 25 dB.

Figure E.9: Ground truth and estimated state trajectories of the inverse LQ feedback dynamic game with each method. The identification was conducted using noise-corrupted trajectories with SNR = 30 dB.

Figure E.10: Ground truth and estimated control trajectories of the inverse LQ feedback dynamic game with each method. The identification was conducted using noise-corrupted trajectories with SNR = 30 dB.

Figure E.11: Ground truth and estimated state trajectories of the inverse LQ feedback dynamic game with each method. The identification was conducted with the wrong assumption that <sup>Q</sup>i,(j,j) <sup>=</sup> <sup>0</sup> for <sup>i</sup> ∈ {1, <sup>2</sup>} and j ∈ {3, <sup>4</sup>} (case II).

Figure E.12: Ground truth and estimated state trajectories of the inverse LQ feedback dynamic game with each method. The identification was conducted with the wrong assumption that <sup>Q</sup>i,(j,j) <sup>=</sup> <sup>0</sup> for <sup>i</sup> ∈ {1, <sup>2</sup>} and j ∈ {3, <sup>4</sup>} (case II).

Figure E.13: Ground truth and estimated state trajectories of the inverse LQ feedback dynamic game with each method. The identification was conducted with the wrong assumption that <sup>Q</sup>i,(j,j) <sup>=</sup> <sup>0</sup> for <sup>i</sup> ∈ {1, <sup>2</sup>} and j ∈ {2, <sup>3</sup>, <sup>4</sup>} (case III).

Figure E.14: Ground truth and estimated state trajectories of the inverse LQ feedback dynamic game with each method. The identification was conducted with the wrong assumption that <sup>Q</sup>i,(j,j) <sup>=</sup> <sup>0</sup> for <sup>i</sup> ∈ {1, <sup>2</sup>} and j ∈ {2, <sup>3</sup>, <sup>4</sup>} (case III).

Figure E.15: Ground truth and estimated state trajectories of the inverse LQ feedback dynamic game with each method. The identification was conducted with the wrong assumption that <sup>Q</sup><sup>i</sup> <sup>=</sup> <sup>0</sup> for <sup>i</sup> ∈ {1, <sup>2</sup>} (case IV).

Figure E.16: Ground truth and estimated state trajectories of the inverse LQ feedback dynamic game with each method. The identification was conducted with the wrong assumption that <sup>Q</sup><sup>i</sup> <sup>=</sup> <sup>0</sup> for <sup>i</sup> ∈ {1, <sup>2</sup>} (case IV).

# F Supplementary Results of the Application in Shared Control

This section gives further information on the results of Chapter 8.

# F.1 Further Details on the Experimental Setup

This section provides details on the parameters of the two steering wheels and on the developed control structure which realizes their virtual coupling are presented.

#### F.1.1 Steering Wheel Parameters

The following Table F.1 lists the parameters of the two steering wheels belonging to the cooperative steering system.


#### F.1.2 Steering Wheel Coupling Control

The steering wheels were coupled using a control algorithm which emulates a spring-damper element between them. This kind of coupling was first presented in [LDFH14], where it was also used in a study for analyzing haptic interaction between humans.

Figure F.1 shows the control loop of the cooperative steering system. The controller calculates a torque M(t) which is equally distributed on each steering wheel. The aim is to regulate the difference <sup>e</sup><sup>e</sup> (t) <sup>=</sup> <sup>e</sup>des <sup>−</sup> <sup>e</sup>meas(t) towards zero, where <sup>e</sup>meas(t) <sup>=</sup> <sup>φ</sup>1(t) − <sup>φ</sup>2(t) is the difference of measured steering angles and <sup>e</sup>des <sup>=</sup> <sup>0</sup> is the desired angle difference.

Figure F.1: Control structure of the cooperative steering system

The controller R(s) is designed as a proportional-derivative (PD) controller with a low-pass filter positioned before the derivative part in order to supress measurement noise. The structure of the PD controller is illustrated in Figure F.2. The controller behavior is defined by the transfer function

$$R(\mathbf{s}) = \frac{M(\mathbf{s})}{E(\mathbf{s})} = \frac{\mathbf{s}K\_D}{1 + T\_P \mathbf{s}} + K\_P,\tag{\text{F.1}}$$

where M(s) and E(s) denote the Laplace transform of the torque M(t) and the control error <sup>e</sup>e (t), respectively. The variables <sup>K</sup>P and <sup>K</sup>D denote the coefficients of the proportional and the derivative terms, respectively. The variableTp denotes the time constant of the first-order lag filter. The values of these parameters are given in Table F.2.

Figure F.2: PD controller used for the coupling of the steering wheels


Table F.2: PD controller parameters

# F.2 Supplementary Tables of the Shared Control Identification Results

The following Tables F.3 to F.5 give all trajectory estimation errors for all subject pairs which were obtained using the IOC, IRL and DB methods, respectively.


Table F.3: Cooperative steering experiment: Error between measured trajectories and trajectories obtained with the IOC method

Table F.4: Cooperative steering experiment: Error between measured trajectories and trajectories obtained with the IRL method


Subject pair δ x δ <sup>u</sup><sup>1</sup> δ u2 1 44.224 57.090 55.853 2 101.985 61.651 77.223 3 73.849 67.249 53.884 4 96.832 102.442 69.903 5 77.725 66.479 67.224 6 85.592 60.670 73.956 7 116.169 105.058 78.649 8 92.483 70.193 70.751 9 91.977 62.021 78.398 10 68.201 76.293 64.553 11 89.672 80.611 120.724 12 94.118 81.256 87.326 13 58.457 76.280 38.762 14 85.817 93.916 83.197 15 83.445 85.230 59.801 16 107.501 58.896 86.753 17 99.877 70.717 70.081 18 79.049 75.170 82.617 19 73.561 54.234 89.718 20 87.378 59.884 85.506 21 74.819 53.756 83.710 22 143.719 76.626 57.984 23 95.018 107.129 73.908 24 100.738 86.762 63.648 25 95.582 114.574 71.243 mean 88.711 76.168 73.815 median 89.672 75.170 38.762 SD 19.372 17.459 15.723

Table F.5: Cooperative steering experiment: Error between experimentally measured trajectories and trajectories obtained with the DB approach

#### F.2.1 Statistical Test Results

The following Tables F.6 and F.7 give the p-values corresponding to the right-tailed Wilcoxon signed-rank test conducted to the data sets of NSAE of states and controls, respectively. In Table F.6, the hypothesis is always rejected with a significance level of α <sup>=</sup> <sup>0</sup>.01. The righttailed property leads to the validity of the alternative hypothesis which states that δ x median,row − δ x median,column <sup>&</sup>gt; 0. The same holds for Table F.7 with the exception of the NSAE of the controls obtained by the IOC and IRL methods. The hypothesis <sup>H</sup><sup>0</sup> cannot be rejected and thus their difference is not statistically significant.

Table F.6: p-values of the Wilcoxon signed-rank test with <sup>H</sup><sup>0</sup> : "<sup>δ</sup> x median,row −δ x median,column comes from a distribution with zero median".


Table F.7: p-values of the Wilcoxon signed-rank test with <sup>H</sup><sup>0</sup> : "<sup>δ</sup> u median,row −δ u median,column comes from a distribution with zero median" (\*\* denotes the failure of hypothesis rejection).


# References

# Public References






IEEE, 2011.







# Own Publications and Conference Contributions


# Workshop Contributions


# Supervised Theses


#### **Karlsruher Beiträge zur Regelungs- und Steuerungstechnik (ISSN 2511-6312) Institut für Regelungs- und Steuerungssysteme**


Band 09 Ludwig, Julian Automatisierte kooperative Transition einer Regelungsaufgabe zwischen Mensch und Maschine am Beispiel des hochautomatisierten Fahrens. ISBN 978-3-7315-1069-7

Band 10 Inga Charaja, Juan Jairo Inverse Dynamic Game Methods for Identification of Cooperative System Behavior. ISBN 978-3-7315-1080-2

10

Inverse Dynamic Game Methods for Identification of Cooperative System Behavior

The theory of dynamic games is an effective approach for modeling and analyzing interactions between decision makers or players in dynamic processes. However, in order to use this theory in real applications, the possibility of quick identification of player objectives based on previously observed actions and interaction results is crucial. This identification problem, which generalizes the inverse problem of optimal control to a multi-agent case, is called inverse dynamic game.

This work introduces a formalization of inverse open-loop and feedback dynamic games, including a specialization as a parameter identification problem and corresponding solution methods. The first method class exploits automatic control techniques. Given the inherent ill-posedness of the problem, sufficient conditions for unique solutions are developed. For the widespread class of linear-quadratic dynamic games, explicit solution sets characterizing all possible inverse solutions are stated. A second method class based on the use of inverse reinforcement learning techniques from computer science is also proposed, allowing to solve the problem with a probabilistic approach which leads to a maximum-likelihood estimation. The proposed methods are verified and compared to a state-of-the-art approach using both simulated and experimental data, confirming their high computational efficiency.

The results of this work provide a theoretical contribution to inverse optimal control and inverse dynamic games as well as practical methods for an efficient player objective estimation, being applicable for e.g. human behavior identification in human-machine cooperation as well as multi-agent learning by demonstration.

Juan Jairo Inga Charaja

ISSN 2511-6312 ISBN 978-3-7315-1080-2