22

22

Self-Learning Longitudinal Control for On-Road Vehicles

Luca Puccetti

KARLSRUHER BEITRÄGE ZUR REGELUNGS-UND STEUERUNGSTECHNIK

# Self-Learning Longitudinal Control for On-Road Vehicles

Luca Puccetti

**Self-Learning Longitudinal Control for On-Road Vehicles**

Karlsruher Beiträge zur Regelungs- und Steuerungstechnik Karlsruher Institut für Technologie

Band 22

# **Self-Learning Longitudinal Control for On-Road Vehicles**

by Luca Puccetti

Karlsruher Institut für Technologie Institut für Regelungs- und Steuerungssysteme

Self-Learning Longitudinal Control for On-Road Vehicles

Zur Erlangung des akademischen Grades eines Doktors der Ingenieurwissenschaften von der KIT-Fakultät für Elektrotechnik und Informationstechnik des Karlsruher Instituts für Technologie (KIT) genehmigte Dissertation

von Luca Puccetti, M.Sc.

Tag der mündlichen Prüfung: 17. Februar 2023 Erster Gutachter: Prof. Dr.-Ing. Sören Hohmann Zweiter Gutachter: Prof. Dr.-Ing. Jürgen Adamy

**Impressum**

Karlsruher Institut für Technologie (KIT) KIT Scientific Publishing Straße am Forum 2 D-76131 Karlsruhe

KIT Scientific Publishing is a registered trademark of Karlsruhe Institute of Technology. Reprint using the book cover is not allowed.

www.ksp.kit.edu

*This document – excluding parts marked otherwise, the cover, pictures and graphs – is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0): https://creativecommons.org/licenses/by/4.0/deed.en*

*The cover page is licensed under a Creative Commons Attribution-No Derivatives 4.0 International License (CC BY-ND 4.0): https://creativecommons.org/licenses/by-nd/4.0/deed.en*

Print on Demand 2023 Gedruckt auf 100 % Recyclingpapier mit dem Gütesiegel "Der Blaue Engel"

ISSN 2511-6312 ISBN 978-3-7315-1290-5 DOI 10.5445/KSP/1000156966

# **Abstract**

Advanced driver assistance systems are an important selling point for vehicles, but they require high development effort. For longitudinal control, a common foundation for driver assistance systems, tuning requires time and effort to balance accuracy and passenger comfort. Reinforcement learning is a promising approach for automating this, but until now has mostly been applied to simulated examples that provided ideal conditions and nearly infinite training time.

Among the major challenges for applying reinforcement learning to longitudinal control in a real vehicle, there are partially observed dynamics and tracking control. In order to be applicable in real-world applications, the learning process for the optimal controller needs to converge within minutes. On top of that, the target speed can change arbitrarily in this use case, which is challenging to solve with reinforcement learning.

This work proposes two computationally lightweight reinforcement learning algorithms that address these issues. First, a model-free algorithm is introduced. It is based on the actor-critic architecture that employs a special structure in the stateaction value function approximator to handle the partially observed system. In addition, it is proposed to learn a feedforward speed tracking controller that uses a projection and training data manipulation.

A second proposal within this work is a model-based algorithm that is based on policy search. It is accompanied with an automated inversion-based feedforward controller design method.

The proposed algorithms are compared across a series of scenarios in a real vehicle and learning on-line, i.e. while driving closed-loop. While the algorithms react slightly different to choices of exploration noise, both learn robustly and quickly, are able to adapt to different points of operation, e.g. speeds and gears even when faced with disturbances during training. To the author's knowledge, this is the first application of reinforcement learning that successfully learns on-line in a real vehicle.

# **Zusammenfassung**

Fahrerassistenzsysteme (Advanced Driver Assistance Systems) sind ein wichtiges Verkaufsargument für PKWs, fordern jedoch hohe Entwicklungskosten. Insbesondere die Parametrierung für Längsregelung, die einen wichtigen Baustein für Fahrerassistenzsysteme darstellt, benötigt viel Zeit und Geld, um die richtige Balance zwischen Insassenkomfort und Regelgüte zu treffen. Reinforcement Learning scheint ein vielversprechender Ansatz zu sein, um dies zu automatisieren. Diese Klasse von Algorithmen wurde bislang allerdings vorwiegend auf simulierte Aufgaben angewendet, die unter idealen Bedingungen stattfinden und nahezu unbegrenzte Trainingszeit ermöglichen.

Unter den größten Herausforderungen für die Anwendung von Reinforcement Learning in einem realen Fahrzeug sind Trajektorienfolgeregelung und unvollständige Zustandsinformationen aufgrund von nur teilweise beobachteter Dynamik. Darüber hinaus muss ein Algorithmus, der in realen Systemen angewandt wird, innerhalb von Minuten zu einem Ergebnis kommen. Außerdem kann das Regelziel sich während der Laufzeit beliebig ändern, was eine zusätzliche Schwierigkeit für Reinforcement Learning Methoden darstellt.

Diese Arbeit stellt zwei Algorithmen vor, die wenig Rechenleistung benötigen und diese Hürden überwinden. Einerseits wird ein modellfreier Reinforcement Learning Ansatz vorgeschlagen, der auf der Actor-Critic-Architektur basiert und eine spezielle Struktur in der Zustandsaktionswertfunktion verwendet, um mit teilweise beobachteten Systemen eingesetzt werden zu können. Um eine Vorsteuerung zu lernen, wird ein Regler vorgeschlagen, der sich auf eine Projektion und Trainingsdatenmanipulation stützt.

Andererseits wird ein modellbasierter Algorithmus vorgeschlagen, der auf Policy Search basiert. Diesem wird eine automatisierte Entwurfsmethode für eine inversionsbasierte Vorsteuerung zur Seite gestellt.

Die vorgeschlagenen Algorithmen werden in einer Reihe von Szenarien verglichen, in denen sie online, d.h. während der Fahrt und bei geschlossenem Regelkreis, in einem realen Fahrzeug lernen. Obwohl die Algorithmen etwas unterschiedlich auf verschiedene Randbedingungen reagieren, lernen beide robust und zügig und sind in der Lage, sich an verschiedene Betriebspunkte, wie zum Beispiel Geschwindigkeiten und Gänge, anzupassen, auch wenn Störungen während des Trainings einwirken. Nach bestem Wissen des Autors ist dies die erste erfolgreiche Anwendung eines Reinforcement Learning Algorithmus, der online in einem realen Fahrzeug lernt.

# **Contents**



\*

# **List of Figures**



# **List of Tables**


# **1 Introduction**

### **1.1 Motivation**

Do you still remember your first driving lesson? Back then, even basic tasks like keeping a constant speed seemed complicated. Around 2016, experts predicted that adolescents would soon be spared even from these challenges. They expected driver's licences to become obsolete within a few years since autonomous vehicles on public roads would become mainstream. While this excitement has cooled off considerably [135] and fully autonomous driving has entered the "trough of disillusionment" [45], advanced driver assistance systems are steadily entering mass markets. On the one hand this is due to regulatory demands [163], on the other hand many new car buyers choose these systems as an option, especially in rapidly growing markets such as China [116]. Forecasts assume that 85% of cars built in 2025 will be equipped with advanced driver assistance systems [134]. This trend aims not only to increase driving comfort, but also to enhance safety [37].

In the long run, these developments aim at gradually making human supervision superfluous [3], even though this goal is still more than a decade away according to experts [135].

Advanced driver assistance systems – be it a simple one like advanced cruise control or more advanced ones – generally consist of a chain of effects that covers [160, Fig. 7.1]:


These components become increasingly complex the more advanced a driver assistance system becomes. While early assistance systems limited themselves to speed control in free-flowing traffic, expanding the state of the art today requires operation in dense, multi-modal traffic and possibly less structured environments [163, p. 1576] without relying on a driver for supervision [44]. This requires highly accurate sensing, perception, planning and control, which stifled the optimism soon after car makers embarked on development efforts toward fully autonomous vehicles. In fact, experts assume that 45% of automotive software development cost will be spent on efforts towards this goal by 2030 [3]. However, recent studies concluded that customers are not willing to pay high premiums for assistance systems [116, 117], an effect that has been aggravated by the COVID-19 pandemic suffocating the world economy from 2020 on [116].

For cost-effective software development, modularity and code re-use are key. While this works for some of the more abstract components such as planning, these methods are not applicable to control. Control design entails devising a controller structure, which can generally be transfered between systems or vehicle models, and also tuning the controller. The tuned parameters are usually specific to a system, and must therefore be adapted if a new vehicle model is designed, a vehicle configuration added and sometimes even during development if low-level controllers change their dynamics or vehicle parameters change.

The parameters are usually optimized manually to maximize a (subjective) rating scale [64, 65, 33, 75] combining a perception of accuracy, comfort and safety. Optimizing parameters in test vehicles manually is a time-consuming (and boring) process for engineers, which, if avoided, could help reduce development cost for modern driver assistance systems.

Controllers that tune their parameters according to the environment are known as adaptive controllers, see [88] for an overview. In this context, we are looking for adaptive control that optimizes a performance criterion, a task that has been researched in the name of reinforcement learning (RL). RL is a technique that learns optimal controllers through trial and error. Despite impressive displays of its potential in simulated tasks, it has not yet been widely applied to real systems, since it is known to not be sample efficient, struggles with partially observed noisy systems and is typically computationally heavy [79, 32].

This work adresses a part of this problem by enabling RL for speed tracking control. For this, it extends the capabilities of RL to enable online learning in a real vehicle, i.e. the gain is adjusted while the controller is following a varying speed using RL. Algorithms following two different paradigms in RL are presented: model based (MB) and model free (MF), that contrary to previous algorithms run online on limited hardware, are robust towards disturbances and delays in the real vehicle and are able to track arbitrarily time-varying setpoints. In addition to the feedback controller, two methods to incorporate a feedforward controller<sup>1</sup> are presented. In experiments on a real vehicle these RL algorithms are the first to ever successfully, automatically and repeatably tune controllers on a real vehicle within a few minutes, proving the fitness of the proposed solutions for day-to-day engineering practice. The results

<sup>1</sup> In combination, feedback and feedforward controller are known as 2-degree-of-freedom controller, as feedforwards dynamics can be designed independently of feedback dynamics, i.e. responses to setpoint changes and disturbances can be influenced separately.

show that learning is robust and simultaneously adapts to changing conditions such as different speeds or disturbance levels.

The rest of this chapter introduces the problem of speed tracking control in more detail, placing it in the context of other works and stating our contribution, see Fig. 1.1 for an overview. Chapter 2 introduces groundwork and notation for parameter optimization and RL for continuous control, upon which we build our proposed algorithms in chapter 3. Experimental validation in a range of experiments in a real car is given in chapter 4 and discussed in Chapter 5.

### **1.2 Problem Description**

In this section we introduce the technical environment of our controller, define the learning objective, the desired controller structure and introduce the constraints of our problem.

#### **1.2.1 Plant**

Traditionally, control for road vehicles is divided into longitudinal and lateral control [160, p. 44]. In this work, we focus on the former. This section provides the context to the controller to be designed by giving an overview of the plant structure and controller interface, see Fig. 1.2 for a schematic overview. Additionally, a simplified model of the plant dynamics is derived that is later used to give an intuition of the effect of hyperparameters, i.e. parameters that influence the learning behavior and result, and illustrate some of the algorithm design choices.

This work designs a speed tracking controller that aims to follow a target speed *y*ˆ based on measurements of the current speed *y* by demanding an acceleration *u* from a low-level controller. Trajectory planning is outside the scope of this work, see e.g. [123, 53] for state of the art approaches<sup>2</sup> . We assume a trajectory that can be previewed over a limited horizon but is otherwise arbitrary. The low-level controller tries to achieve steady-state accuracy of acceleration and thus compensate for steadystate resistances like road grade or aerodynamic drag using an integrator. It allocates controls to the actuator subsystems in longitudinal dynamics, i.e. the braking system and powertrain and makes use of drag torque of the engine. Commands to the powertrain are filtered depending on vehicle speed to prevent excessive wear on the variable valve timing assembly while allowing precise parking maneuvers. The two actuator systems differ in their lag time and dynamic response: the brake system reacts more promptly to torque demands than the powertrain. Both are affected

<sup>2</sup> We limit ourselves to speed tracking control to limit the complexity of the learning problem, albeit traditional control approaches often also include position, acceleration, and sometimes even additional derivatives of acceleration in their planning and control.

**Figure 1.1:** Overview of the remaining introduction chapter: first, Section 1.2 provides a more detailed description of the problem. It begins with an introduction of the plant, i.e. longitudinal vehicle dynamics with its respective interface. The controller structure closing the loop with the plant is presented next. It includes the parameters that we set out to automatically learn in this work. Here, learning entails optimizing the controller parameter set, therefore this work covers the definition of optimality in the following part. The problem definition section is completed by constraints on the learning and controlling task this work strives to solve. With the task defined potential approaches are surveyed in Section 1.3. The section starts with classic (nonlearning) controller design methods. Next, it glances over model-free RL methods, which will be discussed in more depth later, and gives an intuition on the relation to model-based RL. Finally, the introductory chapter concludes with a summary and sketches the research gap this work aims to close. At the end of Chapter 2 the description of the research gap is revisited once necessary concepts and notation are introduced.

**Figure 1.2:** Overview of the plant structure. The goal of this work is to design the learning controller (marked in green) that is fed a measurement of the speed *y* and a target speed *y*ˆ, possibly with a preview. It outputs a commanded acceleration *u* that serves as an input to the low-level controller. The low-level controller tries to provide steady-state accuracy on acceleration level and allocates controls to the actuators. Longitudinal dynamics consists of resistances, inertia, hydraulic braking system, a powertrain containing a combustion engine, a hydraulic torque converter with lock-up clutch and an automatic transmission. Resistances include internal resistances, rolling resistance, aerodynamic drag and road grade. The sum of these elements forms the acceleration that is then integrated to vehicle speed. Speed is measured by wheel speed sensors and an evaluation logic.

by communication delays and are controlled in a purely feedforward fashion. Disturbances like road grade and driving resistances, consisting of aerodynamic drag, rolling resistance and internal resistances as well as inertia affect vehicle acceleration additionally. Vehicle acceleration is integrated to vehicle speed, that is measured using wheel speed sensors and an evaluation logic: Wheel speed sensors are sensors that measure rotation increments of a wheel and are inherently discretized (see e.g. [163, p. 288]). An evaluation logic detects slipping wheels and chooses the most appropriate evaluation method for the operation conditions and returns a quantized, slightly delayed speed signal.

While this work does not assume a valid model of the plant exists, most building blocks would require a complex model with unobserved internal states and possibly unknown parameters:


The problem at hand therefore features a combination of unknown states and parameters for a complex dynamic system, which prevents the use of model-based observers. Learning controllers as investigated in this work could circumvent this issue.

Since the real plant poses challenges beyond a simulated environment, we will not limit ourselves to learning in simulation. Nontheless, we provide a simplified linear model in the box below, which we will use to illustrate some characteristics of the proposed algorithms.

#### **Linear Model for Plant Dynamics**

The following presents a simplified model that disregards nonlinear effects, e.g. due to torque magnification at low speeds or asymmetry between braking system and powertrain. It serves as an example for development of the proposed approaches.

For this, the low-level controller is assumed to be ideal and able to compensate resistances. The remaining longitudinal dynamics consisting of actuator dynamics and inertia are modeled using an ideally dampened system of second order with time constant *T*acc = 0.1. A similar structure is chosen in [2, 123]. The acceleration is integrated to vehicle speed, leading to a (continuous-time) third order system

$$
\dot{\mathfrak{X}} = \begin{bmatrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ 0 & -\frac{1}{T\_{\rm acc}^2} & -\frac{2}{T\_{\rm acc}} \end{bmatrix} \mathfrak{X} + \begin{bmatrix} 0 \\ 0 \\ \frac{1}{T\_{\rm acc}^2} \end{bmatrix} u \tag{1.1}
$$

with delay

$$\mathfrak{x}(T) = \mathfrak{x}(T - 80\,\mathrm{ms}).\tag{1.2}$$

*T* is the continuous time coordinate. The system is discretized*<sup>a</sup>* to a step size ∆*t* = 20 ms. The measurement of speed is corrupted by noise *ξ* from a normal distribution with standard deviation*<sup>b</sup> ν*<sup>y</sup> = 0.01. This yields a discrete-time system of order 7. With index *t* as the discrete time step we write

$$\begin{aligned} \mathbf{x}\_{t+1} &= A \, \mathbf{x}\_t + B \, \mathbf{u}\_t \\ y\_t &= \mathbb{C} \, \mathbf{x}\_t + \tilde{\mathbf{e}}\_\mathbf{y} \text{ with} \\ A &= \begin{bmatrix} 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0.02 & 0.0002 \\ 0 & 0 & 0 & 0 & 0 & 0.9825 & 0.0164 \\ 0 & 0 & 0 & 0 & 0 & -1.6375 & 0.6550 \end{bmatrix}, \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 0.0001 \\ 1.6375 \end{bmatrix}, \\ \mathbf{C} &= \begin{bmatrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 \end{bmatrix}. \end{aligned} \tag{1.3}$$

In the examples given in the course of this thesis the matrices *A*, *B* and *C* are considered unknown in size and value. This model is used for experimental pre-studies in Chapter 3 and in Appendix A.3.

*<sup>b</sup>* The influence of the noise amplitude is investigated in Appendix A.3.5.

*<sup>a</sup>* The discrete time state *x* has a different order than its continuous time counterpart for ease of notation.

The next section closes the loop between commanded acceleration and measured speed by introducing the controller.

#### **1.2.2 Controller Structure**

This work aims to automatically adapt the parameters of an existing output feedback controller<sup>3</sup> , that is introduced in this section. Its foundations were laid by Adiprasito [2, Sec. 4.3] and significantly reworked by Rathgeber [123, Sec. 5.3.2]<sup>4</sup> . The controller concept comprises elaborate feedforward and measures to counter stationary disturbances as well as position control in both longitudinal and lateral direction. In this work, we make use of the existing low-level controller to counter stationary acceleration disturbances, but otherwise limit our scope to speed control.

The controller to be learned is divided in feedforward and feedback<sup>5</sup> :

$$
\mu\_t = \mu\_{\rm ff,t} + \mu\_{\rm fb,t}.\tag{1.4}
$$

To account for interpretability, both are chosen to be structures with tunable parameters that are well known in classic control theory. This work presents two options for the structure of the learning feedforward:


The feedback controller is designed as an output feedback controller as in [2]:

$$
\mu\_{\rm fb,t} = \theta\_{\rm c} (\mathcal{y}\_{\rm t} - y\_{\rm t})\_{\prime} \tag{1.5}
$$

with (to-be-learned) controller gain *θ*c. We assume that plant is stabilizable (see e.g. [85]) using output control. The optimal output feedback controller depends on the initial state (distribution) of the plant [90]. This is shown using the example system (1.3) in the appendix A.5.

Next, we turn our attention to the goal we want to achieve by tuning the parameter *θ*c.

<sup>3</sup> This choice enables us to compare the learned result with an expert's tuning in Chapter 4. Future work may entail searching for more advanced learning structures that e.g. integrate more of the low-level controllers.

<sup>4</sup> In production vehicles, the controller schedules different gains according to vehicle speed.

<sup>5</sup> During learning, we add exploration noise as a third component.

#### **1.2.3 Definition of Optimality**

In this section the optimization criterion for the learning controller is introduced. After a brief overview of the key optimization aspects, the criterion is presented in an RL compatible form alongside with configuration possibilities.

The presented criterion applies specifically to the feedback part of the controller. The feedforward part of the controller does not have to be optimal, since optimization has already been accounted for in the planning of the trajectory. The sole criterion for the feedforward component is therefore accuracy. The following applies to the feedback component.

Most important to tracking a planned trajectory is accuracy, but this needs to be balanced with comfort in most scenarios<sup>6</sup> . Traditionally, this parameter tuning goal is encoded in a subjective rating scale. It aims to capture an experienced passenger's perception of comfort and safety [64, 33, 65, 75]. These two aspects are linked according to [38]. Controllers for driver assistance systems are therefore tuned to be frugal in their outputs, since brisk actions are not only perceived as uncomfortable but may seem erratic and suggest lacking safety of the system. Commonly, the magnitude of acceleration is deemed to have predominant influence on ride comfort in the literature, i.e. to maximize comfort, acceleration should stay close to 0 [14, p. 87]. The controller therefore needs to balance accuracy with acceleration magnitude.

For the feedback controller, the problem can be described along the lines of classic (discrete-time) optimal control. The goal is to maximize a discounted infinite sum of rewards

$$\mathbb{E}\_{\mathcal{Y},\mathcal{Y}}\left(\mathcal{V}^{\rho}(\mathcal{Y}\_{\prime}\hat{\mathcal{Y}})\right) = \mathbb{E}\_{\mathcal{Y},\mathcal{Y}}\left(\mathbb{E}\left(\sum\_{i=t}^{\infty}\gamma^{i-t}R(\mathcal{Y}\_{i\prime}\hat{\mathcal{Y}}\_{i\prime}\rho(\hat{\mathcal{Y}}\_{i\prime}\boldsymbol{y}\_{i})) \Big| \mathcal{y}\_{0} = \mathcal{y}\_{\prime}\hat{\mathcal{y}}\_{0} = \hat{\mathcal{Y}}\right)\right) \tag{1.6}$$

with discount factor 0 ≤ *γ* ≤ 1 and *ρ* denoting the control law to be learned. The operator **E**(·) is the expectation operator, the vertical bar imposes conditions on the expression the expectation is drawn from. *V* is the state value, which is formed by the sum of rewards incurred by following the policy/control law *ρ* from state *y* and target *y*ˆ onward, see Section 2.2.1 for a more detailed introduction. The reward function is

$$R(y\_t, \hat{y}\_t, \mu\_t) = -\mathbb{C}\_{\mathbf{y}}(\hat{y}\_t - y\_t)^2 - \mathbb{C}\_{\mathbf{u}}\mu\_t^2. \tag{1.7}$$

*C*<sup>y</sup> and *C*<sup>u</sup> are positive coefficients, the first weighs deviation from the control target, i.e. it accounts for tracking accuracy. The second punishes the magnitude of commanded acceleration. In the following, an aggressive controller is intended as a controller that counters control errors fiercely through its output and thus accounting less for any penalty on its output.

The optimization target can therefore be configured in the following way:

<sup>6</sup> In most cases, it is the goal to keep passengers content, but for safety critical cases this principle must be abandoned, e.g. avoiding a suddenly appearing obstacle.


#### **Alternative Reward Function Including Jerk**

Commonly, the magnitude of acceleration is seen as the predominant contributor to the passenger's perceived comfort. In some works, however, jerk, i.e. the derivative of the acceleration, is suggested as an additional influence [14, 38]. While this work mostly relies on the reward function (1.7), an alternative reward function including a measure of jerk is

$$R\_{\mathcal{Y}}(y\_t, \mathcal{Y}\_t, u\_t, u\_{t-1}) = -\mathcal{C}\_{\mathcal{Y}}(\mathcal{y}\_t - y\_t)^2 - \mathcal{C}\_{\mathbf{u}}u\_t^2 - \mathcal{C}\_{\Delta \mathbf{u}}(u\_t - u\_{t-1})^2. \tag{1.8}$$

It includes an additional term accounting for the commanded jerk weighted by positive coefficient *C*∆u. Chapter 4 shows that the proposed approaches are capable of learning with this reward formulation.

The state value (1.6) shall be maximized over the states that are encountered while running the policy *ρ* with parameter *θ*c:

$$\theta\_{\mathbf{c}} = \operatorname\*{argmax}\_{\theta\_{\mathbf{c}}} \mathbb{E}\_{\hat{\theta}, \mathcal{Y}} \left( V^{\pi} (\hat{y}, y) \right). \tag{1.9}$$

However, the to-be-designed learning process must work with several limitations, that are introduced in the next section.

#### **1.2.4 Constraints**

The approach to be designed needs to comply with several constraints to be applicable in engineering practice:

1. **Learning Time:** The learning process must complete within minutes, otherwise it does not bring any benefit to manual tuning and incurs high cost for vehicle operation and supervision.

<sup>7</sup> Extreme choices, e.g. *C*<sup>y</sup> >> *C*<sup>u</sup> can harm learning performance, see Appendix A.3.1.

<sup>8</sup> A discounted reward may not always define a sensible design objective, but has beneficial influence on value estimation and learning, see appendix A.3.1.


With our task sketched out and the given constraints in mind, we take a look at what the state of the art has to offer in this context.

# **1.3 Learning Controllers for Longitudinal Control in the Literature**

Before describing the research gap in the next section this work is put in the context of related literature. We begin with classic design methods for longitudinal control that generally rely on a known model. Then we turn our attention to RL, a method for learning optimal controllers from experience. First, MF approaches are surveyed, then methods that learn a model and use it to update the controller.

### **1.3.1 Classic Designs for Longitudinal Control**

Controllers that rely on a known model are the state of the art in production ready vehicles. A variety of competing approaches exist:

• **PID Controllers** [10] are popular throughout many applications in control theory because of their intuitive, yet powerful design. The output consists of a sum of three terms: One is proportional (P) to the current control deviation, one integrates (I) past control deviations, and another one tries to predict future control deviations through the use of derivatives (D) of the control deviation. By tuning the gain for each of these, their influence can be adjusted up to a point of deactivating parts of it, creating e.g. a PD controller by deactivating the integrating compontent. This design has been adopted by [2, 23, 58, 122, 100, 132, 156]. A fractional order controller is obtained if not only the control deviation itself is integrated but also non-integer powers of it. This can help overcome shortcomings like overshoot or resonance with traditional PI(D) control and is therefore adopted by [66, 149].


margins of the plant and picks the controller that optimizes a given performance metric. H-∞ control is an example of such a metric that is used in [34]. If insecurities are large, the controller returned may be overly cautious and therefore exhibit low performance.

Given that traditional vehicle control is a mature field of research, this list likely does not do justice to this field of research. For this problem definition, these approaches are not applicable, since no accurate model of the plant is available.

Adaptive controllers [86, 88] provide an alternative to approach uncertainties in system models<sup>9</sup> . In contrast to robust control the control law adapts to changing system characteristics at runtime, e.g. to account for parameter variation. This may help achieve superior performance. An important distinction within adaptive control is the design (or performance) goal: some approaches define a target behavior, e.g. using a reference model [96], while other approaches optimize a cost/reward functional. The latter class is commonly referred to as RL10. For the application targeted by this work an optimal behavior reference cannot be given a priori, since the idea is to maximize passenger comfort and controller precision. The goal is therefore formulated as a to-be-optimized functional, making RL methods the algorithm group of choice.

We now turn our attention to MF RL control design methods.

### **1.3.2 Model-Free RL**

Among adaptive controllers, the discipline of RL covers approaches that strive to learn optimal controllers by trial and error.

As the name suggests, MF RL methods estimate a measure of their performance directly from returns calculated from experience gained by interacting with the plant. Based on this estimate, they improve their control policy. We give a brief overview of applications that made this class of methods popular, then we turn our attention to methods in control theory and applications to real-world systems. Finally, a few examples of applications in the automotive context will be given.

MF RL was recently used for impressive displays of potential. Among the most influential was AlphaZero [139], who clearly beat the earlier, but world-famous<sup>11</sup> RL algorithm AlphaGo [138] in the game of Go. A later iteration [137] was extended to support other games like chess. In [20] an artificial player managed to consistently

<sup>9</sup> Note that the 'adaptive' in adaptive control has a more general meaning than in adaptive cruise control: while the control law is adapted in the former, the latter switches the control target depending on the traffic situation.

<sup>10</sup> A popular index term in control literature is approximate dynamic programming, which adresses a subgroup of RL approaches.

<sup>11</sup> AlphaGo was the first algorithm to clearly defeat a human professional player.

win against a round of professional Texas Hold'em Poker players. In the wake of this success RL was applied to more complex video games like Quake III [71], Dota 2 [108] and StarCraft II [155], often at levels that compete with or exceed top-level human players. Beyond games, MF RL has been applied to recommendation systems at Facebook [46] and the Chinese bing News platform [169].

Due to its relation to control theory, applications as controllers have been proposed early on: In [19] the authors apply MF RL to a linear quadratic regulator problem in 1993, later on [6] compared RL to a PI controller for a simulated heating coil. Several benchmark problems have been proposed since [59].

Most of the applications for RL are in simulated or virtual domains. Applications in real-world tasks are considered hard for RL [79, 32] due to factors like sample efficiency, training time, delays, partially observed states, noise and others. Nonetheless, there are a few notable examples of successful RL applications with physical systems. Robotic grasping has been addressed in [9, 74, 51]. Four-legged robots have learned to walk using MF RL in [54, 56]. To the author's knowledge, no approach that learns to control longitudinal dynamics of a real vehicle online has been presented. Longitudinal dynamics differs from hand-like manipulators and walking robots by its comparably slow and asymmetric behavior and its integrator properties along with very limited measurements and computational resources.

Applications in the vehicle domain are based on simulation or on recorded data, but occurred early on: In [42] the authors propose an algorithm to adapt parameters for active roll stabilization and [67] introduces a method for tuning a PID controller for idling the engine using RL. A method for learning steering control from prerecorded data has been proposed in [124]. After that, several applications of RL to vehicle control have been presented, see [87] for a survey. All of the presented methods for longitudinal dynamics control learn in simulation or on recorded data. An impressive example is given by [40], where a policy combining planning and control is used for navigating a parking lot after being trained in simulation over several hours.

#### **1.3.3 Model-Based RL**

MB RL at least partly replaces data from the plant of interest with a (learned) model thereof. Potential benefits over MF RL can be sample efficiency or added safety. The risk is that the learned model does not capture the system dynamics well and the learned control policy may fail to achieve the goal.

MB RL has been pioneered by the Dyna architecture [144] and has since risen to milestone examples like Google's datacenter cooling application [89]. Very generic algorithms like Dreamer [57] have proven that MB algorithms can bring benefits even in partially observed, high-dimensional and very diverse environments.

By learning a model and estimating its accuracy, ideas from robust control can be incorporated to guarantee safety, which is especially beneficial when working with real systems, e.g. as demonstrated using a quadrotor [15] or an inverted pendulum [16]. In [27] the MB approach was used for fast learning of a control law for a unicycle and cart-pole.

The survey [115] names several applications of MB RL to unmanned aerial, underwater and ground vehicles, mostly focused on navigation. Applications of MB RL to problems in vehicle control are rare. The authors of [165] propose a method to tune a PID speed controller using evolutionary parameter search for the controller in a simulated task. In [24], a policy search algorithm aided by simulated models of different complexity learns to drift a radio-controlled model car.

### **1.4 Summary and Research Gap**

This section briefly summarizes the problem statement and condensates it in the form of three research questions that point at the contribution of this work.

Automated tuning of vehicle speed tracking controllers could save automotive control engineers time and effort. This work bases the controller on a given design to ensure interpretability of the gains and aim to tune it in accordance with an optimal control performance criterion balancing ride comfort and tracking accuracy over a long horizon. The task is challenging due to multiple aspects: the plant is nonlinear and its state only partially observed. The control target is known only over a limited horizon and can develop arbitrarily. Beyond that, practical application brings several limitations with it, such as limited computation power, time constraints, noise and stochasticity.

While learning algorithms have given impressive displays of their potential, speed tracking control for a real vehicle has not yet been solved, likely due to the asymmetric and comparably slow, partially observed dynamics as well as arbitrarily evolving control target. On the other hand, traditional algorithms require a model for their design, which requires additional effort to validate and will have limited accuracy. This work therefore sets out to make use of the potential of RL algorithms for speed tracking control in real road vehicles.

To solve the challenge, the capabilities of RL need to be extended. This task is divided in three main questions:


3. How can a feedforward be learned in an RL framework?

First and foremost this work strives to successfully enable RL for tracking control and online learning in a real vehicle, answering question 1. For this, it needs to enable RL for tracking control and online learning in a real vehicle. Two RL algorithms, one MF and one MB are proposed to address this question.

The MF algorithm is based on the actor-critic architecture with two modifications: first, this work presents a reconstructing structure for the state-action value function along with the required state augmentation pattern (see Section 3.1.1). Then a method to modify experienced state transitions for training is presented that restores the Markov property for training with arbitrarily varying control targets (see Section 3.1.2).

The proposed MB algorithm (see Section 3.2) is based on model learning and policy search. A decaying maximum norm is employed to dampen variance during training.

Experiments show that both algorithms are capable of learning on constrained hardware under different conditions in Chapter 4.

Both MF and MB methods contain sources for potential bias and variance that can yield suboptimal results or amount to failure of the learning task in the worst case. The question of sample efficiency is important when operating a real-world system where experiment duration is an important factor. This work therefore compares the two approaches to answer question 2. The experiments show that both learn fairly quickly and succeed in most use cases, while the MB algorithm exhibits slightly higher robustness towards the choice of exploration signal. This qualifies especially the MB learning method for day-to-day engineering practice.

In order to achieve state of the art controller performance at following a trajectory, feedforward control is necessary. Two different ways are proposed to respond to question 3: one approach extends the proposed training data manipulation to a framework that enables following arbitrary trajectories with a learned MF algorithm in Section 3.1.2. As a comparison method a non-optimal inversion-based design for the MB approach is proposed in Section 3.2.2. An experimental comparison shows that the optimal design yields a slightly softer controller (see Chapter 4).

The next chapter introduces important concepts in the state of the art that lay the groundwork for our proposed MB and MF algorithms. After introducing important notation and concepts, it specifically describes how the challenges at hand affect RL algorithms. At the end of the chapter these challenges are therefore revisited to accurately describe the research gap in a technical way, which is the starting point for the subsequent Chapter 3, in which the proposed enhancements on MB and MB RL algorithms are laid out. These algorithms are then put to test in Chapter 4 in a series of experiments in real vehicles. Chapter 5 rounds this work off with a conclusion.

# **2 Prerequisites**

This chapter covers the essential concepts and notation upon which this work builds its contribution in the next chapter. It begins with parameter fitting and optimization. These are used for adapting function approximators in RL, which is introduced afterwards: First, function approximation is applied to learn the state-action value function using temporal difference learning. Then policy search methods are explained and how they can make use of state-action value functions using the (deterministic) policy gradient. It is shown how learning a model can be incorporated in RL, e.g. to enhance the data efficiency. Since RL relies on trial and error, the concept of exploration is introduced. This chapter ends with a summary of the challenges that are still open beyond the state of the art for the envisioned task in speed tracking control.

### **2.1 Parameter Estimation**

RL relies on fitting parameters of an approximator for the control policy, value functions or models. An important part is therefore parameter optimization. In this section we briefly introduce a general optimization problem, a few hints for popular optimizers<sup>12</sup> are given in the appendix. The section ends with a generalization of a function approximator with parameters, which is a form of an artificial neural network. The following section is based on [48, Sections 4.3 and 8.5] except where stated otherwise.

### **2.1.1 Parameter Optimization Problem**

Here we have a brief look at an optimization problem with the respective parameter vectors and the loss function. We point out some aspects that are special to the problems encountered in this work such as stochasticity.

Optimization strives to find the value of a parameter that drives an expression to an extreme value. Within this work we consider only a subset of optimization problems that are constituted by the expression *L* called loss function or target function and a

<sup>12</sup> In this work we use the term *optimizer* as a synonym to *optimization algorithm*.

continuous parameter vector *θ* with no additional constraints. The goal is to find the parameter vector *θ* ∗ that minimizes the value of *L*:

$$\theta^\* = \operatorname\*{arg\,min}\_{\theta} L(\theta, \{X\}). \tag{2.1}$$

This task is considered a supervised learning problem in artificial intelligence. The operator {·} marks a set of elements of the same kind, e.g. state vectors. The loss function *L* may have other inputs {*X*} in addition to the parameter vector *θ* and is typically intended to approximate a (large or infinite) sum over a per-element loss function *l*. An example problem could be the estimation of model parameters *θ* for a dynamic model with continuous state vector included in *X* such that it reproduces the behavior of a real system. The loss function could in this case be intended as the squared approximation error *l* of the model output to the system output averaged over the entire state space. Since computing such a value exactly is not feasible, either because it is mathematically not tractable, empirically infeasible or prohibitive performance-wise, this performance metric is approximated using a randomly sampled finite set of datapoints {*X*}*<sup>b</sup>* . This finite set of samples is often referred to as minibatch and is sampled out of an available batch of data, that is generally finite as well. The number of samples per minibatch is called batchsize *b*. In the context of this work, loss functions have the form

$$L(\theta, \{X\}\_b) = \frac{1}{b} \sum\_{i=1}^{b} l(\theta, \{X\}\_{(i)}),\tag{2.2}$$

where {*X*}(*i*) denotes the *i*-th element of the batch {*X*}. This approximation comes at the cost of stochasticity, i.e. each call to the loss function yields a slightly different result due to randomized sampling of the minibatch. Solution algorithms for stochastic optimization therefore need to be tolerant towards this additional challenge. Some optimization problems even involve a non-stationary loss function. In this case the optimization goal changes over time.

Optimization problems are generally solved by an iterative optimization algorithm, e.g. by taking steps in the opposite direction of the gradient ∇*L* = <sup>d</sup>*L*/d*θ*, possibly making use of the jacobian *J<sup>L</sup>* = <sup>1</sup>/*b*[ d*l*(*θ*,{*X*} (1) )/d*θ* ··· d*l*(*θ*,{*X*} (*b*) )/d*<sup>θ</sup>* ] ⊤ . If these are not available, in some cases they can be numerically approximated by finite differences. Fig. 2.1 is a schematic for an iterative solution of a supervised learning problem consisting of a loss function and an optimizer.

In the following, we will omit the data {*X*} when referring to the loss function in most occasions and only explicitly include it if necessary. Instead, we will use the notation ·|, e.g. *L*|*θ<sup>t</sup>* to express that the loss function *L* has been evaluated using parameters *θt* .

Algorithms designed to solve these optimization problems are briefly presented in Appendix A.1.

**Figure 2.1:** Schematic of a supervised learning problem. At each time step *t* a loss function computes value *Lt* , gradient ∇*L<sup>t</sup>* and jacobian *JL*,*<sup>t</sup>* from a set of parameters *θ<sup>t</sup>* and a minibatch of additional inputs {*X*}*<sup>t</sup>* . The optimizer iteratively returns new estimates of the parameter vector *θt*+<sup>1</sup> from the loss function outputs and the former parameter vector *θ<sup>t</sup>* .

#### **2.1.2 Neural Network Concept**

Artificial neural networks are a powerful yet flexible concept that is employed for function approximation (regression) in the context of this work [128]. We introduce the general concept, the backpropagation mechanism for calculating derivatives and a few common types of layers, i.e. building blocks of a neural network.

Despite more complex network patterns exist in the literature, we limit ourselves to the following structure: An artificial neural network *N* : *X<sup>N</sup>* 7→ *Y<sup>N</sup>* consists of one or more functions *f* : *X<sup>f</sup>* 7→ *Y<sup>f</sup>* , each with a set of parameters *w* which is often called weights in the literature. *m* layers are daisy-chained to one another to form the network *N*:

$$N\left(X\_N\right) = f\_m\left(f\_{m-1}\left(\dots\left(f\_1\left(X\_N\right)\right)\right)\right).\tag{2.3}$$

Each layer may consist of subfunctions that divide the inputs to the layer among each other, allowing arbitrary serial<sup>13</sup> and parallel combinations. The advantage of building complex functions as a chain of subfunctions is that they lend themselves to tasks that require abstraction while their gradient can be computed by iteratively applying the chain rule:

$$\frac{\mathrm{d}N}{\mathrm{d}X\_{N}} = \frac{\mathrm{d}f\_{m}}{\mathrm{d}X\_{m}}\Big|\_{X\_{m}=f\_{m-1}(X\_{m-1})} \frac{\mathrm{d}f\_{m-1}}{\mathrm{d}X\_{m-1}}\Big|\_{X\_{m-1}=f\_{m-2}(X\_{m-2})} \dots \frac{\mathrm{d}f\_{1}}{\mathrm{d}X\_{1}}\Big|\_{X\_{1}=X\_{N}}.\tag{2.4}$$

Similarly, as any parameter in the function can be treated as an input, gradients with respect to parameters <sup>d</sup>*<sup>N</sup>* d*θ<sup>i</sup>* with *i* = 1, . . . , *m* can be formulated. The backpropagation algorithm provides a computationally efficient way for calculating this gradient. The workflow for training a neural network *N*, i.e. optimizing its parameters *θ* = - *w*<sup>1</sup> . . . *wm*−<sup>1</sup> *w<sup>m</sup>* to make it approximate a desired mapping *<sup>N</sup>*<sup>ˆ</sup> : *<sup>X</sup>N*<sup>ˆ</sup> 7→ *YN*<sup>ˆ</sup> requires a loss function, e.g. *L*({*X*}*<sup>b</sup>* ) = <sup>1</sup>/*<sup>b</sup>* ∑ *b i*=1 *l*({*X*}(*i*) ) with

<sup>13</sup> In addition to feedforward networks in the literature there are networks that base their output on their previous outputs, known as recurrent layers, e.g. [35], which are useful to describe processes over time. This, however, comes at the cost of a more complicated training procedure [61]. A neural network containing one or more recurrent layers is called recurrent neural network (RNN).

*l*(*XL*) = <sup>1</sup>/<sup>2</sup> *N*ˆ (*XL*) − *N*(*XL*) 2 . Then, training consists of repeatedly sampling a batch {*X*}*<sup>b</sup>* , computing the value and gradient (and possibly the jacobian) of the loss function and performing an optimizer step until the loss value falls below a chosen threshold, see Fig. 2.1 for a schematic.

The data used for training should cover the area that the function approximator should be trained for and for stochastic optimization to work best, the samples used for training are assumed to be representative of the underlying distribution. Therefore, it is often assumed that the data fed to the training process is indepentently and identically distributed [128].

Different kinds of layers exist, e.g. convolutional or normalizing layers. The most popular kinds consists of a nonlinear activation function *a* : *Z<sup>a</sup>* 7→ *Y<sup>a</sup>* that is applied to a biased and weighted sum of its inputs:

$$\mathbf{Y}\_f = a \left( w \begin{bmatrix} \mathbf{X}\_f \\ \mathbf{1} \end{bmatrix} \right). \tag{2.5}$$

Popular activation functions *a* include arc tangent, rectified linear units and even a purely linear, i.e. "pass-through" variant in some cases. Depending on the number of parameters and the computational complexity of the layer, training can become resource intensive for large networks.

Neural networks are being used as a framework for estimation of models, value functions and control policies, all of which can be parts of RL algorithms presented in the following.

### **2.2 RL for Continuous Control**

This section aims to explain the actor-critic architecture, which is a popular setup for RL algorithms in continuous domains. Actor-critic combines an element estimating *how good* it is to choose a specific action in a specific state (a state-action value function) with an element that chooses said action depending on the state (a policy).

The following section is structured as follows:


In the next section we round off with a summary of the deterministic actor-critic algorithm for continuous control that serves as a basis for this work and highlight the challenges we face when applying the algorithm to a real car. These challenges are adressed with the proposed algorithms in Chapter 3.

### **2.2.1 The RL Problem and State(-Action)-Values**

Here we define the RL problem, consisting of the plant, a reward function and the RL agent, i.e. the learning controller. Special attention is paid to the aspects that set the problem at hand apart from the classic RL problem. The goal is to maximize the long term reward, which is defined through the accumulation of rewards to a state value.

For this work, we assume a discrete equidistant time scale with step index *t*. According to [145, Section 3.1] the RL problem is divided in two parts (see Fig. 2.2):


The environment is modeled as a Markov decision process that performs two mappings: it maps the state vector *s<sup>t</sup>* ∈ **R***<sup>n</sup>* and action<sup>16</sup> *u<sup>t</sup>* ∈ **R** to the following state *st*+<sup>1</sup> according to an unknown dynamics *G* and to a reward *rt*+<sup>1</sup> ∈ **R** according to a reward function *R* [145, Section 3.1]. The environment therefore comprises the plant, accounting for the dynamics and a reward function *R* used to compute the reward signal *r* from *s* and *u*. The reward function encodes the task goal, e.g. balancing deviation from a trajectory with control effort. The Markov decision process

<sup>14</sup> Since this section aims to introduce RL in general, we skip a detailed definition of the state vector *s* for now. The definition used for this work is given in equation (3.2) and expanded to include a target preview in (3.7).

<sup>15</sup> The tuples to learn from may be observed at a time step *k* ̸= *t*, see Section 2.2.4.

<sup>16</sup> The action space can have multiple dimensions, but we limit ourselves to scalar action spaces in this work.

**Figure 2.2:** Overview of the RL problem defintion. The plant (i.e. a transfer function *G*) is acted upon by the action *u<sup>t</sup>* from the RL algorithm and returns a state vector *s<sup>t</sup>* . The reward function *R* computes the reward signal *r<sup>t</sup>* from action *u<sup>t</sup>* and state *s<sup>t</sup>* . Together, plant and reward function are considered a Markov decision problem with unknown dynamics, which is often referred to as environment in the literature [145, Section 3.1]. In the literature, both components are assumed to be unknown. This work considers the reward function to be known and will therefore include it in the RL algorithm later. The RL algorithm that solves the RL problem accepts a reward signal *r<sup>t</sup>* and the state vector *s<sup>t</sup>* , from which it learns to optimally pick an action *u<sup>t</sup>* .

is memory-less, i.e. the state vector *s* must be fully observed, but the process may be stochastic [145, Section 3.1].

Within this work, this definition is altered slightly (also see Section 1.2 and Fig. 2.3):


The goal for this work is to learn an optimal deterministic mapping *ρθ*<sup>c</sup> : *s<sup>t</sup>* 7→ *u<sup>t</sup>* which is called policy or control law such that it maximizes rewards "in the long run" [145, Section 3.3]. The policy *ρθ*<sup>c</sup> is parameterized by a vector *θ*c. In order to be able to enhance the policy from experience, RL algorithms often resort to a random component (exploration noise) added to their policy to explore a larger and more diverse portion of the state-action space [145, Section 13.1]. The behavior policy *π* used during training therefore may be stochastic despite the learned policy *ρθ*<sup>c</sup> being

**Figure 2.3:** Venn diagram for altered state definition: The plant state *x*, plant output *y* and the state vector *s* presented to the RL algorithm have separate symbols within this work. In vast parts of the RL literature, problems with the Markov property are treated that allow to treat these quantities as equals: *x* = *y* = *s*. Within this section we introduce the basic algorithm with this assumption in mind, but since we intend to apply the algorithm to a partially observed plant (*x* and *y* can have different dimensions) later, we augment the plant output with additional data (*y* and *s* can have different dimensions). These vectors will therefore be different.

deterministic.<sup>17</sup> Throughout this work the behavior policy *πθ*<sup>c</sup> is understood as the result of adding a random component with expectation zero commonly referred to as exploration noise to the learned/learning policy *ρθ*<sup>c</sup> . See Section 2.2.5 for the kinds of exploration noise used in this work.

The goal of learning an optimal policy can be formalized using the intermediate notions trajectory *τ*, return *U* and state value *V*.


<sup>17</sup> Learning is possible without a random component if the plant is sufficiently stochastic [136]. In that case the behavior policy *π* can be equal to the learned policy *ρθ*<sup>c</sup> .

policy *πθ*<sup>c</sup> . The Markov assumption ensures that a mapping from the state vector *s* to the expectation over the returns exists, i.e. the most accurate estimate of the expected reward from following the policy can be formed using only the state vector [145, Section 17.3].

The discount factor *γ* ∈ [0, 1] is part of the definition of the optimization goal, reducing the impact of rewards in the distant future and resulting in a more impatient behavior if chosen close to 0 or a far-sighted behavior if chosen close to 1. In our application rewards are valued independently of when they are incurred, but it can have beneficial impacts on learning if chosen to be lower than 1 (see Section 4.1).

The learning goal can be formulated as maximization of expected state value using the policy parameter vector *θ*c [145, eq. (3.13)]:

$$H = \underset{s \sim f}{\mathbb{E}}\_{f} \left( \hat{\mathcal{V}}^{\pi}(s) \right) \tag{2.6}$$

$$\theta\_{\text{c}}^{\*} = \underset{\theta\_{\text{c}}}{\text{arg}\,\text{max}}\, H \tag{2.7}$$

The distribution we take the expectation over arises by applying the control law *πθ*<sup>c</sup> to the plant *G*, i.e. the expectation is taken over the distribution of experienced states ∫ from applying the control law *πθ*<sup>c</sup> that may differ<sup>18</sup> to some extent from the learned policy *ρθ*<sup>c</sup> . The assumption that this distribution is stationary and exists is equivalent to the assumption that the combination of MDP and controller is ergodic or that the controller is stable within the experienced part of the state space<sup>19</sup> [145, Section 10.3].

The state-action value *Q*ˆ *<sup>π</sup>*(*s*, *u*) is a reformulation of the state value *V*ˆ *<sup>π</sup>* that is helpful for deriving the controller optimization algorithm later. It is defined as the expectation of the accumulated discounted reward gained from taking (arbitrary) controller action *u<sup>t</sup>* in state *s<sup>t</sup>* and following policy *πθ*<sup>c</sup> from the following state *st*+<sup>1</sup> on [145, Section 3.5]:

$$\bigotimes^{\pi}(s\_{l\prime}u\_{l}) = \mathbb{E}\left(\mathcal{R}(s\_{l\prime}u\_{l}) + \gamma \mathcal{U}^{\pi}(\tau\_{l+1})\right). \tag{2.8}$$

It can be shown that

$$
\hat{\mathcal{V}}^{\rho}(\mathbf{s}) = \hat{\mathcal{Q}}^{\rho}\_{\theta\_{\mathbf{c}}}(\mathbf{s}\_{\prime}\rho(\mathbf{s})) \tag{2.9}
$$

<sup>18</sup> Since learning hinges on the presence of exploration noise, it has to be included in the optimization goal. Usually, the actual goal is to optimize a deterministic policy *ρθ*<sup>c</sup> . It is therefore necessary to find a balance between proximity to *ρθ*<sup>c</sup> and exploration in the behavior policy *πθ*<sup>c</sup> .

<sup>19</sup> In [145, Section 10.3], the authors admit only combinations of policies and plants that result in stationary distributions of states that are independent of the initial state. This cannot bijectively be mapped to a stability definition in control theory, but concepts from control theory can be found that fit this description. For example, neutrally stable [10, Section 5.3], time-invariant combinations of plant and controller result in a temporally infinite, but repetitive trajectory in the state space. This guarantees the state to be within a finite portion of the state space (independently of the starting point, as long as it is within the set), thus assigning a finite and time-invariant probability density function to the respective subspace. Conversely, an unstable combination of plant and controller does not allow to confine the state to a finite area for an infinite amount of time, and thus prohibits the formulation of a time-invariant probability density function.

for deterministic policy *ρθ*<sup>c</sup> or

$$\hat{\mathcal{V}}^{\pi}(s) = \underset{\pi\_{\theta\_{\mathbb{C}}}(s)}{\mathbb{E}} \left( \hat{\mathbb{Q}}^{\pi}(s, \pi\_{\theta\_{\mathbb{C}}}(s)) \right) \tag{2.10}$$

for stochastic policy *π*. The difference of state-action value function *Q*ˆ *<sup>π</sup>* and state value function *V*ˆ *<sup>π</sup>* is known as advantage function, as it encodes the difference in value of actions compared to the policy's choice depending on the state. In the next subsection we introduce a popular method to learn the state-action value function from experienced state transitions.

#### **2.2.2 Temporal-Difference-Learning in Continuous State and Action Spaces**

This section presents a loss function for learning state values, which are used to update the control law in the next section. We rely on [145] throughout this section.

In this section we try to train a function approximator to be an estimation of the stateaction value function, which will later help maximize the state value. As we introduced in Section 2.1.2 training a function approximator *Q<sup>ρ</sup>* (*st* , *ut*) requires target values, i.e. outputs *Q*ˆ *<sup>ρ</sup>* (*st* , *ut*) to given input tuples ⟨*s<sup>t</sup>* , *ut*⟩. While these can be obtained through straightforward Monte Carlo methods [145, Chapter 5], i.e. estimating state values by averaging utility from experienced rewards, methods based on bootstrapping [145, Chapter 6] have become more popular due to their superior performance. Following the bootstrapping principle means to estimate the utility, state value or state-action value using a combination of experienced rewards and the to-be-trained function approximator, thus allowing learning from incomplete trajectories down to even single state transitions. Bootstrapping is possible thanks to the recursive nature of the utility (e.g. *U<sup>ρ</sup>* (*τt*) = ∑ *m <sup>i</sup>*=<sup>0</sup> *γ <sup>i</sup>R*(s(*τ<sup>t</sup>* , *i*), u(*τ<sup>t</sup>* , *i*)) + *γ <sup>m</sup>*+1*U<sup>ρ</sup>* (*τt*+*m*+1)) and the linearity of the expectation. State(-action) values therefore fulfill the (one-step20) Bellman equation [145, Section 3.5]

$$
\hat{Q}^{\rho}(s\_t, \mu\_t) = \mathbb{E}\_{s\_{t+1}, \mu\_{t+1} \ \rho(s\_t)} \left( r\_{t+1} + \gamma \hat{Q}^{\rho}(s\_{t+1}, \mu\_{t+1}) \right). \tag{2.11}
$$

The expectation on the right hand side provides a convenient way to calculate a target value for a function approximator *Q ρ θ*sav (parameterized using a vector *θ*sav) eval-

<sup>20</sup> Multi-step formulations are possible by using more rewards from the experienced trajectory [145, Section 7.1], i.e. *rt*+1,*rt*+2, . . . ,*rt*+*k*−<sup>1</sup> instead of only *rt*+1. The value *Q*ˆ *<sup>ρ</sup>* (*st*+*<sup>k</sup>* , *ut*+*<sup>k</sup>* ) from the function approximator is therefore used at a more distant time step (*t* + *k* instead of *t* + 1), which has less influence if a discount factor *γ* < 1 is used. A possibly harmful bias of the function approximator (e.g. due to insufficient approximation power or simply because of unlucky initialization values) on the training target is therefore minimized. However, the experienced trajectory may be noisy and therefore not ideally reflect the behavior intended by the policy to be evaluated. Using more steps in this bootstrapping equation may therefore increase variance.

uated at ⟨*s<sup>t</sup>* , *ut*⟩ by recurring at its value<sup>21</sup> resulting from the successor state-action combination ⟨*st*+1, *ut*+1⟩. Depending on whether *ut*+<sup>1</sup> and subsequent actions stem from experience, i.e. the behavior policy *π*, or the learning policy *ρ*, the state-action value function of the behavior policy *π* or the learning policy *ρ* is approximated (disregarding differences in the distribution of experienced state transitions). The former is considered on-policy learning and the latter off-policy [145, Section 5.4]. Off-policy learning is known to diverge if the learned policy significantly diverges from the behavior policy used for gathering experience [145, Section 11.3]. This work uses off-policy learning and tries to avoid this source of divergence by choosing *π* as a stochastic variant of *ρ*.

To make this use of relationship (2.11) for a real implementation, the expectation is estimated from single samples (i.e. the expectation is dropped). The approximation error *δ*TD resulting from an imperfect function approximator<sup>22</sup> *Q ρ θ*sav in this equation is known as temporal difference (TD) error [145, Section 6.1]:

$$\delta\_{\rm TD} = r\_{t+1} - Q\_{\theta\_{\rm SAT}}^{\rho} \left( s\_{t\prime} u\_t \right) + \gamma Q\_{\theta\_{\rm SAT}}^{\rho} \left( s\_{t+1\prime} \rho(s\_{t+1}) \right) \tag{2.12}$$

Training a function approximator to reduce a norm of the TD error using samples distributed throughout the area of interest in the state-action space can serve as a basis for taking a local step towards an improved policy [145, Chapters 9-11]. Algorithms proposed in this work minimize the squared per-element loss *l*TD(⟨*s<sup>t</sup>* , *u<sup>t</sup>* ,*rt*+1,*st*+1⟩) = 1/2*δ* 2 TD through loss

$$L\_{\rm TD}(\{\langle s\_{l\prime}u\_{l\prime}r\_{l+1\prime}s\_{l+1}\rangle\}\_{b\_{\rm crit}}) = \frac{1}{b\_{\rm crit}}\sum\_{i=1}^{b\_{\rm crit}}l\_{\rm TD}\left(\{\langle s\_{l\prime}u\_{l\prime}r\_{l+1\prime}s\_{l+1}\rangle\}\_{\langle i\rangle}\right). \tag{2.13}$$

*b*crit is the batch size. This training resembles a supervised learning problem, except that the training target may move between optimization steps, see Fig. 2.4.

The conceptually simplest way of updating the policy once a state-action value function is available is to make the policy choose the action that maximizes the *Q<sup>ρ</sup>* value<sup>24</sup> in each state *s* [145, Chapter 8]. For environments with discrete action (and state) spaces that may even work with a table for value function approximation, e.g. by a (hopefully simple) search, but since optimizing a (possibly large) neural network can be too computationally expensive, this approach known as Q-Learning is limited to the so-called tabular case. While efforts have been taken to allow the transfer of

<sup>21</sup> Despite the dependence of *Q*ˆ *<sup>ρ</sup>* (*st*+1, *ut*+1) on the parameters of the function approximator, the value is typically treated as a constant when it is included in a loss function, i.e. ignored when computing the gradient [91]. In this work, we adhere to this practice with a few exceptions (mostly published in [119]), which we mark as 'double-sided gradient' as opposed to 'single-sided gradient' in appendix A.4.

<sup>22</sup> Approximation errors may be due to suboptimal parameters or structure.

<sup>23</sup> The subsequent action *u<sup>t</sup>* is calculated using the policy *ρ*, hence the schematic in Fig. 2.4 is off-policy.

<sup>24</sup> For this maximization to yield the optimal action, the state-action value function approximator *Q<sup>ρ</sup>* has to have a maximum in the location of the maximum of the actual state-action value function *Q*ˆ *<sup>ρ</sup>* , but may be otherwise imperfect.

**Figure 2.4:** Schematic of the off-policy temporal difference learning algorithm. TD learning can be implemented as a conventional optimization problem consisting of optimizer, memory and a loss function. However, it differs from the supervised learning problem in Fig. 2.1 in a way that the approximation typically changes over time (e.g. because the policy *ρ* evolves.). The inputs to the off-policy TD learning process are23the previous state *st*−1, the previous action *ut*−<sup>1</sup> and the current state *s<sup>t</sup>* . With the help of the reward function *R* and the policy *ρ* the TD loss value, gradient and jacobian can be obtained. For brevity, we use the notation *L*TD|*<sup>t</sup>* , ∇*L*TD|*<sup>t</sup>* and *JL*TD |*<sup>t</sup>* to note that the expressions *L*TD, ∇*L*TD and *JL*TD were evaluated using the inputs of time step *t*, i.e. *st*−1, *ut*−1, *r<sup>t</sup>* , *s<sup>t</sup>* and *u<sup>t</sup>* along with parameter vector *θ*sav,*<sup>t</sup>* and the policy *ρ* available at that time step. Along with the former iteration of the parameter vector *θ*sav,*<sup>t</sup>* , these are the input to the optimizer, which computes the next parameter vector iteration *θ*sav,*t*+1. In Section 2.2.3 we add target networks, an extension to TD learning that relies on separate copies of the parameters for policy and state(-action) value function for bootstrapping.

this idea to arbitrary neural networks by defining a final layer whose optimum can be trivially found [12], policy search and policy gradient methods have become far more popular. These methods are presented in the next section.

#### **2.2.3 Policy Search and Policy Gradient Methods**

The ultimate goal to RL is making optimal decisions. The mapping from an environment state to an (optimal) decision is the (optimal) policy. It can be derived from the (optimal) state-action value function as we hinted at the end of the last section, but in continuous dimensions it is often more convenient to directly learn that mapping. Algorithms following this paradigm constitute the class of policy search algorithms [145, Chapter 13].

In the following we briefly outline a few ways to forego learning a state(-action) value function by sampling the returns from experience or using a learned model. Then, we turn our attention to policy gradients that improve the policy based on information from an estimated value function. This yields the basic actor-critic algorithm.

If the objective (2.7) can be sampled, e.g. by averaging over returns computed by applying a policy *ρ* with parameters *θ*c to an available or learned model, then naive approaches like genetic algorithms/hill climbing can be successful [145, Section

**Figure 2.5:** Schematic of the deterministic policy gradient algorithm assuming a learned state-action value function *Q* (Q-Function). The optimizer is provided with the state-action value *Q*(*s<sup>t</sup>* , *ut*) and the deterministic policy gradient (2.14), consisting of a product of the gradient of the stateaction value function with respect to the action and the parameter gradient of the policy. These quantities are computed based on the current action *u<sup>t</sup>* and the current state *s<sup>t</sup>* . The optimizer returns an updated estimate of the policy parameters *θ*c,*t*+1.

1.5],[165]. For a more goal-oriented optimization, a gradient from finite differences of sampled rewards could be used. However, the sampled learning objective is usually subject to high variance, which may result in slow learning. We follow this approach in our MB RL algorithm (see Section 3.2).

Policy gradient methods instead do not require a model of the plant. They strive to provide an accurate estimate of the gradient of the learning objective in (2.7).

This work relÃes on the deterministic policy gradient [136], which can be interpreted as replacing *V*ˆ *<sup>ρ</sup>* with *Q*ˆ *<sup>ρ</sup>* according to (2.9) in the policy learning objective (2.7) and then applying the chain rule to compute a derivative

$$\frac{\mathrm{d}H}{\mathrm{d}\theta\_{\mathrm{c}}} = \mathbb{E}\left(\frac{\mathrm{d}\bigotimes\_{\mathrm{d}\mu}^{\rho}}{\mathrm{d}\mu}\Big|\_{s,\mu=\rho(s)} \frac{\mathrm{d}\rho\_{\theta\_{\mathrm{c}}}}{\mathrm{d}\theta\_{\mathrm{c}}}\Big|\_{s}\right) \tag{2.14}$$

This gradient can be sampled from experienced states *s<sup>k</sup>* with *k* denoting time steps during the agent's learning phase. The combination of learning a state-action value function with a (deterministic) policy gradient for policy updates yields an actorcritic algorithm<sup>25</sup> depicted in Fig. 2.5.

Provided the value function is approximated with high fidelity, actor-critic can learn quickly due to its low-variance gradient estimates. Depending on the quality of the function approximator, bias may be introduced [136, 145]. The next chapter

<sup>25</sup> Methods that explicitly learn a policy are considered actor-only methods that e.g. directly optimize the policy on the average reward. Methods that only implicitly learn a policy like Q-Learning are referred to as critic-only. Actor-critic combines both elements.

**Figure 2.6:** Schematic of an RL algorithm using an experience storage. For better overview the algorithm is structured in three sections: evaluation contains the policy with exploration noise; the data preparation layer accepts the current state and action, and feeds the experience tuple ⟨*st*−1, *ut*−1,*st*⟩ to the experience storage (reward omitted to avoid cluttering). From the experience storage the algorithm randomly picks experience tuples for the training part. Commonly, a minibatch {⟨*sk*−<sup>1</sup> , *uk*−<sup>1</sup> ,*s<sup>k</sup>* ⟩} of more than one tuple is sampled at random from the experience storage, containing more diverse data compared to using only the latest experience tuple. The training step combines TD learning (see Section 2.2.2 and Fig. 2.4) for a state-action value function, which in turn is the basis for the deterministic policy gradient (see Section 2.2.3 and Fig. 2.5). This work expands this architecture with elements to make it applicable to the partially observed dynamics and enable learning for tracking control.

provides two common extensions that help with accurate state-action value function learning.

#### **2.2.4 Common Extensions: Experience Replay and Target Networks**

Common extensions to the basic actor-critic algorithm are experience replay and target networks. These were invented to make learning more robust and minimize variance and bias in the learning process and were presented in [91].

Experience replay aims to provide the learning process of both state-action value function and policy with meaningful minibatches of experience.

In physical systems, states that occur within rapid succession to one another are often close value-wise. Using single samples sequentially to update the respective function approximators would therefore violate the assumption of independent samples in the training batch (see Section 2.1.2). Experience replay helps to break the temporal correlation by keeping a circular buffer of experienced state transitions to sample batches from [92]. By making the buffer large, the individual samples in the training batch are potentially temporally distant and therefore less similar. They may, however, have been collected using a policy that is outdated by the time they are included in the training batch. It has therefore been suggested to use an off-policy algorithm with experience replay. Another potential benefit of experience replay is that experience can be included in multiple training steps, thus increasing data efficiency. A schematic overview is given by Fig. 2.6. Appendix A.3 gives an intuition of the effect of batch size and experience storage size in a simulated example.

Target networks aim to eliminate another source of divergent learning: since the state-action value function approximator is used to calculate the target value, the target may vary considerably between two updates in the same state due to evolving weights. Since this variance can cause the learning process to diverge, a target network is used instead of the original state-action value function approximator. The target network weights ˜*θ*sav,*<sup>t</sup>* are updated using an exponential average of the parameters *θ*sav,*t* in the state-action value function approximator:

$$
\vec{\theta}\_{\text{sav},t} = (1 - \eta\_{\text{crit}}) \vec{\theta}\_{\text{sav},t-1} + \eta\_{\text{crit}} \theta\_{\text{sav},t}.\tag{2.15}
$$

The parameter *η*crit ∈ (0, 1] can be chosen to favor smooth but slow progression of the target network weights ˜*θ*sav,*<sup>t</sup>* if chosen to be small or, if chosen close to 1, to allow for close tracking of the parameters *θ*sav,*t* in the actual state action value function. While this extension may slow learning, it may increase stability as the authors of [91] claim26. In Appendix A.3 we provide an intuition on the effect of target networks on the learning process.

#### **Outlook: Beyond Deterministic Actor-Critic**

This work focuses on deterministic actor-critic algorithms, but literature offers several popular alternatives. We only include them here for reference, since most of these improvements were added to enhance learning from visual representations using large networks.

In TD3 [43] the authors try to extend the deterministic policy gradient approach beyond DDPG [91] by decreasing the policy update frequency, applying the ideas from [153] in the form of clipped double Q-learning and other measures.

In recent years, stochastic policies have gained more attention. One of the central developments adding stability followed the idea of natural gradients [73], a method to enhance learning speed, in the form of trust-region pol-

<sup>26</sup> A target network has also been used to avoid overestimation of state-action values in Q-learning variants [152]. Target networks for the policy are common, too [91].

icy optimization (TRPO) [130], proximal policy optimization (PPO) [131] and actor-critic using kronecker-factored trust region (ACKTR) [167]. Additionally, ideas presented for deterministic policy gradients were incorporated, e.g. experience replay [159] and soft actor-critic [55, 56].

### **2.2.5 Exploration**

Even with off-policy TD-learning, the actor-critic algorithm relies on an exploration policy *π* that to some extent resembles the current policy *ρ*, yet deviates from it to estimate the advantage function27. Additionally, it needs to excite the system in a way that creates a rich distribution of states to learn from, which can be challenging when learning stable controllers. We introduce several options to add exploration noise and give a few pointers to other options for exploration, e.g. safe exploration.

Depending on the perspective, exploration noise helps the learning process in different ways:


RL was first introduced for discrete (action-)spaces, and exploration therefore typically amounted to nothing more than picking a random action every once in a while (*ϵ*-greedy) [145], possibly with schemes that reduce the random component with learning progress (e.g. in Boltzmann Exploration [158] or the more optimal, yet more complex Bayesian Exploration [143]), or assume very high returns for unvisited states (optimistic) [145]. For systems with a continuous action space, exploration noise is either inherent to a stochastic policy, defined as a distribution the action is sampled from [145] or as a random number added to the output of a deterministic policy [136]. Other choices include count-based exploration, i.e. preferring rarely visited state-action combinations [148], or curiosity-driven, i.e. preferring actions where a

<sup>27</sup> Similarity between the current policy *ρ* and the exploration policy *π* helps learning in configurations where the agent is not able to extrapolate experience between states. This may be due to the system behaving nonlinearly, but also due to the function approximators used in the agent. Yet, a slight deviation helps the agent understand if abandoning the currently learned policy can be beneficial, i.e. estimating the advantage.

simultaneously trained model returns erroneous predictions [112] which can be seen as concepts within intrinsic motivation, i.e. optimizing an additional, known reward function that rewards exploration [13].

For real-world systems, purely random signals can be an inapt choice: on the one hand they can cause high wear on actuators, on the other hand they may fail to excite the system due to their emphasis on high frequencies. It has therefore become common to incorporate some temporal correlation between samples of the exploration signal, e.g. sampling from an Ornstein-Uhlenbeck process [91]. This work follows an idea from [161] and samples the random component for exploration at a lower frequency than the evaluation frequency of the controller.

Within this work we consider three options of adding noise to the deterministic policy:


While the three proposed approaches to exploration are applicable to the vehicle, it remains open if they succeed at providing sufficient exploration and excitation as well as making the agent experience a meaningful distribution of states. In Section 4.4 we therefore run experiments with each variant.

### **2.2.6 Connection to Model-Based RL**

MB RL is often defined as substituting (at least a part of [72]) the experience from the environment with data generated by a (learned) model, but otherwise using the principles introduced above, see Fig. 2.7 for an overview on the integration of a model within this work. Some approaches deviate from that, e.g. by making use of specific model estimation methods to incorporate a notion of uncertainty in the learning process or by using more traditional control design methods. In this section

**Figure 2.7:** Schematic of an RL algorithm using an experience storage and a learning model. A model is trained using experienced state transitions sampled from the experience storage. The trained model can either be used to generate additional experience tuples for learning a state(-action) value function or, as in this case, be directly used in the policy learning step, e.g. for sampling returns using simulations.

we give a brief overview of model estimation methods and architectures used for MB RL.

Model estimation is a developed field of research (see e.g. [93]). In the context of RL, many kinds of model have been proposed. Popular ones include linear models [26], gaussian processes (e.g. PILCO [27]), nonlinear feedforward (e.g. MLAC [50]) or recurrent networks (e.g. Dreamer [57]); see [115] for a more extensive overview. Generally, it is assumed that the full state vector *x* of the plant is observed (i.e. *s* = *x*), which is often not the case with real systems [32]. Partially observed systems require simultaneous estimation of state vector and model parameters. For some applications, ignoring unobserved dynamics, i.e. learning a model using only the observed portion of the state vector may yield a sufficiently accurate model [140]. The authors of [104] group approaches for reconstructing a complete state from partial state observations in windowing (concatenating past observations and actions in the state vector), belief states (maximize likelihood of observation), recurrency (implicitly form a state in a trained recurrent neural network) and external memory (memory cells that can explicitly accessed by a neural network).

MB algorithms are often claimed to be more sample-efficient, since a model can be learned from limited experience data and then used to generate a large amount of hypothetical experience [144] to learn a policy from at comparably low cost [115]. A policy may be learned by (MB) policy search, value function methods or actor-critic methods as introduced above. Some approaches make additional use of some characteristics of the estimator or the estimation process to guide the learning process, e.g. to account for robustness using the approximation error [15, 16, 25].

While MB algorithms have the potential to learn from less data, their success depends on the accuracy of the learned model, which can be a challenge for partially observed environments.

This work therefore follows two parallel approaches, one MB and one MF, which allows to compare them in the envisioned application.

# **2.3 Summary, Technical Description of the Research Gap**

This chapter introduced a deterministic actor-critic algorithm as groundwork for this thesis. It is briefly summarize here and provide a synopsis of the challenges arising from the application to speed tracking control in a real vehicle is given.

From the point of view of the RL algorithm, the input is a state and the output is a commanded action, both are vectors of real continuous-value, discrete-time signals. The reward function is treated as part of the algorithm and defines the learning goal. The expectation of the discounted reward sum, i.e. the state value, is to be maximized in the long run over the distribution of visited states. Everything outside the algorithm is considered the environment and supposed to adhere to fully observed dynamics.

In practice, policy search methods and actor-critic designs are the most popular choices for continuous control tasks. Policy search aims to directly optimize control parameters to enhance the controller's performance, e.g. by sampling returns (Monte Carlo methods) and applying hill-climbing methods. The actor-critic architecture adds a value function estimator to guide policy improvement with less variance.

The deterministic framework this work harnesses the deterministic policy gradient to estimate the policy gradient from a Q-function learned by TD learning. Both deterministic policy and state-action value function are learned by applying LM optimization to function approximators trained on batches of experience that are stored in a ringbuffer. MB RL algorithms strive to increase sample efficiency by learning a model of the environment first and then employ it to generate experience to learn from.

When it comes to real applications, however, some of the assumptions that RL algorithms rely on, do not hold. In the case of speed tracking control, the challenges fall into three groups:

	- a) The plant dynamics violate the Markov assumption: it exhibits delays and only provides a speed measurement. This makes learning a model or a value function harder.
	- b) The control target is arbitrary: there is no global description for its dynamics. State values can therefore only be predicted with large variance.
	- a) All occurring optimization problems are noisy, e.g. due to quantization or measurement noise.
	- b) The vehicle itself exhibits some stochasticity, behaves nonlinearly at low speeds and its reaction to control inputs is not symmetric at high frequencies due to the brake system reacting more promptly. Low-level algorithms filter some of the inputs to limit wear on the powertrain. Since many of the displays of performance have been achieved in simulated environments that behave ideally, challenges from real-world systems have been neglected so far.

The next chapter proposes MB and a MF candidate designs to overcome the first two challenges. After that, we demonstrate that the proposed algorithms are capable of handling real-world conditions in Chapter 4 and thus the third challenge.

<sup>28</sup> Since giving an exact time frame for manual tuning is not possible, this work aims to solve the task within minutes, which would guarantee a time benefit over manual tuning.

# **3 Proposed Approaches**

Based on the basic concepts of RL introduced in Chapter 2, this chapter proposes two algorithms for vehicle speed tracking control. The work in this chapter can be understood as enabling MB and MF RL for the vehicle application, such that they can be used in experiments in Chapter 4.

The first algorithm is based on the MF deterministic actor-critic architecture and employs a reconstructing state-action value function network to overcome challenges arising from partially observed states. This work proposes to locally approximate the control target to remove variance due to changes in the control target and thus enable tracking control.

The second algorithm learns a model from windowed past states and actions that is then used to sample returns for policy search. The estimated model additionally serves as a basis for an automated inversion-based feedforward design.

Both algorithms create little computational burden, e.g. by relying on few parameters for approximation.

### **3.1 MF RL Algorithm**

Here the actor-critic architecture introduced in Section 2.2 is expanded in two ways: Section 3.1.1 proposes a combination of augmenting the state vector and a special architecture for the function approximator in order to cope with the kind of partially observed systems arising from communication delays and slow actuators. Additionally, Section 3.1.2 proposes a local approximation of the control target along with a manipulation of the training data to enable learning with little variance while following arbitrary trajectories. Finally, Section 3.1.3 provides an overview of the complete algorithm.

### **3.1.1 Reconstructing State-Action Value Function Approximator for Partially Observed Plants**

This section proposes an architecture for the state-action value function approximator to cope with partially observed dynamics (see challenge 1a) while keeping a small computational footprint (see challenge 2). The idea is to reconstruct part of the unobserved state from past actions since it is mainly a comparably slow actuator whose internal states are unobserved. This work has been published in [119] and [118].

The section begins by surveying approaches to partially observed systems in RL, then introduces three variants of the proposed combination of reconstructing layer and state augmentation. A few insights from a simulated example are given to underline the effectiveness of the proposed structure. As a preliminary to the extensive vehicle experiments in Chapter 4, we evaluate the reconstruction variants in the car.

#### **Approaches for RL in Partially Observed Plants**

There are two ways to apply RL to partially observed environments: either the measurement is treated as if it was the full state, i.e. the unobserved part of the state vector is disregarded, or past observations and actions are used to reconstruct the missing information.

**Ignore Unobserved Dynamics** With only partial information on the system state available, predictions for the coming states are more uncertain. This uncertainty in predicted states carries over to rewards and thus estimated returns. With the optimization goal being subject to variance, the optimization process is often slower and yields less precise results. For RL, this means possibly slow learning speed and sub-par performance in the converged state. Yet, ignoring the hidden portion of the state may work [140], especially RL algorithms with large nonlinear function approximators seem to fare well in these scenarios [133]. These are not feasible due to performance restrictions in our application (see challenge 2).

If the policy relies exclusively on observed features (i.e. entries in the state vector), a value function approximator can be constructed that guarantees bias-free policy gradients according to compatible function approximation theory [146, 136]. However, it has been shown that these simplest forms of compatible function approximation yield the same level of variance as an algorithm that learns without a value function [147].

While disregarding unobserved dynamics may seem tempting from a theoretical point of view, the toll on learning speed and accuracy can be high, as an example shows in Section 3.1.1. To avoid poor learning performance, missing state information can be reconstructed from past states and actions.

**Reconstruct Missing Information from Input and State History** Apart from estimators based on a known or learned model, reconstruction can be done using either a finite window of past observations and actions (also known as finite history, finite memory or windowing) as an input to a feedforward function approximator or using a memory in the function approximator [113, 1]. In environments with comparably fast unobserved dynamics, e.g. blinking in Atari games, using several past measurements has been proven to be effective [102, 79]. Augmenting the input space usually comes at the price of a larger function approximation network, i.e. higher computational complexity. Additionally, it raises the question of how to form the augmented state vector, i.e. how many past state measurements and/or actions to include. The authors of [61] point out that the algorithm's performance can suffer if important events fall beyond the limited horizon provided in the state vector and suggest to use recurrent neural network (RNN) (see Section 2.1.2). These come at the cost of a more difficult training procedure that is inherently in contrast to experience replay, since RNNs require the inputs to be presented in chronological order (see also [60]). Despite these challenges, impressive results have been reached using recurrent structures, e.g. in [63, 162].

The proposed approach resembles the windowing method, yet the function approximator is kept lean by tayloring it to a class of plants. Several methods are investigated in the following, one uses the windowed input in an RNN-like way, the other two are feedforward networks. By performing a preliminary experiment the most appropriate reconstruction method is selected, a linear filter-like layer with a constraint on the weights.

#### **Proposed State-Action Value Function Approximator Structure**

This work proposes an architecture for the state-value function approximator for a special class of partially observed systems. Two intuitions motivate its design: Actuator dynamics can be reconstructed from input history and the optimal value function structure for the fully observed case. The design therefore combines a reconstructing filter whose output is fed to a subsequent function approximator. Since both filter and subsequent function layer are part of the state-action value function approximator, both learn using standard TD learning.

This section begins with the optimal state-action value function for the fully observed case, then introduces three candidates for the reconstructing filter and explains how to implement them.

For the linear-quadratic regulator (LQR) setting, i.e. a linear controller *ρθ*<sup>c</sup> for a linear, fully observed, plant with a quadratic reward function (cf. (1.7)), the true state-action value function is known to be quadratic [19]:

$$
\hat{Q}\_{W}^{\rho}(\mathbf{x},\boldsymbol{\mu}) = \begin{bmatrix} \mathbf{x}^{\top} & \boldsymbol{\mu}^{\top} \end{bmatrix} \boldsymbol{W} \begin{bmatrix} \boldsymbol{\mu} \\ \boldsymbol{\mu} \end{bmatrix} \tag{3.1}
$$

with *W* a quadratic matrix of appropriate size.

For the longitudinal speed control problem (see Fig. 1.2) the full state is not available. The dynamics are therefore split in two groups: an unobserved system containing

**Figure 3.1:** Assumed structure of the learning problem. The plant consists of two blocks *G*unobserved and *G*observed in a series connection. The state vector *ζ* of the first block is not observed; the state vector *y* of the second block is the plant output.

actuators and resistances accounting for the acceleration and integrator returning the speed, i.e. an observed system29, see Fig. 3.1.

The next step towards the state-action value function structure is to build a learning observer for the unobserved portion of the state to be used in combination with the quadratic structure. The estimation can occur from two perspectives: from a causal perspective the unobserved dynamics is before the observed part and could therefore be inferred from (past) actions, we call this view the integrator perspective. Another option would be to reconstruct state vector entries from the output history, which will be referred to as differentiator perspective. Since the plant dynamics is slow compared to the sampling time, successive actions from an integrator point of view are unlikely to have vastly different impact, i.e. the influence of past actions on the current state estimate to is expected to develop continuously over time. In the differentiator perspective, approximate derivatives, i.e. finite differences of the measured outputs, are weighted over time. Here the impact of derivatives is assumed to evolve in a continuous way over time. In practice, these perspectives are difficult to separate since the actions are computed based on output measurements<sup>30</sup> .

In the remainder of this section three reconstruction concepts following the integrator perspective<sup>31</sup> are proposed:


In the following, concept 1 is referred to as 'normalized' layer, concept 2 as 'gaussian' layer and to concept 3 as 'rnn' layer. All presented approaches assume the state

<sup>29</sup> Note that this division is strict if considered in continous time, but not in discrete time: in continuous time the integrator is solely fed acceleration, but the discrete-time form uses small components from other unobserved states due to matrix exponential/numeric integration. The error introduced is small if the time step size ∆*t* is small.

<sup>30</sup> Exploration noise plays a crucial role in breaking the correlation between measurements and inputs, see Section 2.2.5 and Section A.3.4 in the Appendix.

<sup>31</sup> The idea behind following the integrator perspective is that differentiation can potentially amplify noise.

vector to contain the current speed measurement *y<sup>t</sup>* and a window of *h*<sup>u</sup> past actions - *ut*−*h<sup>u</sup>* ... *ut*−<sup>1</sup> ⊤ :

$$\mathbf{s}\_{t} = \begin{bmatrix} y\_{t} & u\_{t-h\_{\mathbf{u}}} & \dots & u\_{t-1} \end{bmatrix}^{\top}. \tag{3.2}$$

Despite the plant having likely more than one unobserved state dimension, the proposed structure worked best by learning to reconstruct only a single (i.e. onedimensional) state. This may be due to issues with learning reconstruction using a large number of parameters or difficulties at discerning between different hidden states<sup>32</sup> .

Approach 1 can be implemented as a fully connected feedforward layer (see Section 2.1.2) with *h*<sup>u</sup> inputs and 1 output. It adds only *h*<sup>u</sup> parameters to the network. FIR filter (see e.g. [95] for an introduction) usually become more accurate with increasing length. For the system at hand this gives a good rule of thumb for tuning the filter length: it should be chosen according to the impulse response of the system. Increasing the filter beyond a certain value yields little accuracy gains and makes learning more difficult due to the increasing dimension of the parameter space. An example is given in appendix A.3. In order to remove unnecessary degrees of freedom33, weight normalization [127] is used to fix the norm of the filter weight vector to 1. This does not affect the approximation power of the approximator as following layers can compensate for scaling errors.

The proposed FIR filter is designed with the integrator perspective in mind and is less apt to work with the differentiator perspective, since it cannot simultaneously compute all possible finite differences of its inputs and weight them. However, the proposed structure could approximate this by learning gains of alternating signs with magnitude according to the influence of the respective finite difference. This would be equivalent to skipping every second finite difference. In our experience the learned set of FIR parameters is usually a mixture of the (smoother) impulse response (see Fig. 3.5 in the next section for a simulated example) and an alternating series when applied to the real vehicle. Concept 2 therefore is designed to enforce smooth FIR coefficients by reducing the number of learnable coefficients from *h*<sup>u</sup> to *n*weights and interpolating between them using gaussians34:

$$w\_{i, \text{smoothed}} = \sum\_{j=1}^{n\_{\text{weights}}} w\_{j, \text{train}} \exp\left(\frac{(j-i)^2}{\sigma}\right) \tag{3.3}$$

<sup>32</sup> Note that it is difficult to find a ground truth for the reconstructing layer if more than one system state is unobserved. Reconstructing only one replacement signal for unobserved states requires learning a blended signal. The ratio, however, depends on the distribution of the training data and aims to compensate for missing monomial multiplication in the subsequent layer.

<sup>33</sup> Another method for limiting unusable degrees of freedom in an approximator that uses an FIR filter for reconstruction is to fix a weight in the linear layer after it.

<sup>34</sup> This is motivated by how radial basis function are used to approximate smooth functions with a limited set of parameters.

**Figure 3.2:** Schematic of the proposed network architecture from [119]. The first layer passes the measurement *y<sup>t</sup>* (here a version with more than one measured quantity is shown), the reconstruction output<sup>32</sup> ˜*ζ* and the current action *u<sup>t</sup>* to a second layer without learnable parameters that computes all monomials up to degree 2 of its inputs (qf). The third layer outputs a weighted sum of its inputs (fully connected, fc) that is learned to approximate the state-action value. While the reconstructing layer can be any of the presented ways, in this schematic reconstruction according to concept 1 in the form of a fully connected layer (fc) is used.

with *σ* being a hyperparameter we choose to be 1, and *i* = 1, ..., *h*u. Each training step only updates the meta parameters *w*train, but then the parameters *w*smoothed are calculated and used for evaluation. This is done by including the derivative of (3.3) in the backpropagation algorithm. Despite this solution, the meta-parameters *w*train may still be learned as a series with alternating signs, effectively only forming finite differences over a longer time interval.

Concept 3 also reduces the number of parameters by feeding the appended actions to a linear recurrent elman network structure [35] in chronological order<sup>35</sup> .

The proposed estimators are combined with a quadratic function approximator. Fig. 3.2 gives a schematic overview for the implementation of the FIR filter approach 1 with the quadratic function approximator. The reconstructing layer can also be combined with other structures after it, e.g. fully connected networks.

<sup>35</sup> This interpretation of parts of the state vector as trajectories avoids the issues of RNNs with experience replay, but potential issues for training RNNs known as exploding/vanishing gradient persist (see e.g. [111, 48]).

#### **Simulation Results**

This section aims to prove the intuition that the proposed reconstruction method is both effective at reducing the variance and can be modularly used in different approximator architectures. The proposed approximator structure configuration 1 is applied to the example system (1.3) to show that it is able to reduce the variance in the learning process. This section shows that other approximator structures are viable by using a fully connected network instead of the quadratic approximator at the cost of slightly slower learning and reduced accuracy.

To highlight the effectiveness of the proposed approach this section compares two configurations of the deterministic actor-critic architecture. One uses a quadratic function approximator for the state-action value function, treating the measurement as full state, i.e. ignoring the unobserved dynamics. We mark this configuration as 'ignore'. The other configuration uses the proposed FIR layer for reconstruction; this configuration is marked as 'reconstruct'. The training occurs in episodes of 200 time steps each, with the plant inititalized in random states<sup>36</sup> and the control target is 0. Training is paused if either the buffer of past actions or the experience storage are not filled to capacity. Every 10 episodes a validation run is performed, i.e. training is paused, no exploration noise applied while the controller tries to drive the system from a fixed initial point *x*<sup>0</sup> = 1 0 0<sup>⊤</sup> to zero in order to monitor the learning progress by the achieved state value *V*˜(*x*0) of this point37. This experiment was presented in [119] and used a fixed parameter in the final layer of the function approximator instead of weight normalization to eliminate the unused degree of freedom. The hyperparameters can be found in Table A.1, set A in the appendix A.4.

A comparison of TD errors during training is given in Fig. 3.3. It shows that the reconstructing state-action value function helps reducing the TD-error in the steady state.

The reduction in variance is visible from Fig. 3.4 which shows the approximated state value computed every 10th run. Not only is learning more precise, i.e. the performance exhibits less variance throughout the learning process, but also faster. Note that an output controller is not always optimal throughout the entire state space. It may therefore be misleading to judge performance from a single state value.

In order to prove the flexibility of the proposed approach, the proposed combination of FIR-like reconstructing layer and quadratic function approximator is compared with a combination of FIR-like normalized reconstructing layer and a shallow fully

<sup>36</sup> Episodic training is common in RL, since regular resets to random states help compensate for poor performance of the agent and help with exploring the state space [145, Section 5.2].

<sup>37</sup> A truncated version of the state value is used for computational feasibilty. We truncate the infinite sum once the norm of the state falls below 10−<sup>4</sup> .

**Figure 3.3:** Mean (*µ*) and standard deviation interval (*µ* ± *σ*) of mean squared TD error *L*TD while learning a state-action value function for a fixed policy plotted over training episodes from [119]. The experiment was performed 50 times. The proposed reconstruction method lowers the steadystate TD error. The experiment was conducted in an episodic setting, which causes the state augmentation buffer to be purged in reqular intervals, halting learning. Since the buffer for state augmentation takes longer to fill than the buffer for the TD error, the interruptions of the learning are only visible for the reconstructing value function.

**Figure 3.4:** Mean *µ* and standard deviation interval *µ* ± *σ* of approximated undiscounted and truncated state value *V*˜(*x*0) of a fixed initial state *x*<sup>0</sup> over training episodes from [119] averaged over 50 runs. The sum was truncated once the norm of the state vector falls below 10−<sup>4</sup> . The reduced variance when reconstructing unobserved states translates into more accurately learned control parameters and thus more steady performance, especially when converged. Better performance in a single state does not necessarily translate to better overall performance in optimal output control.

**Figure 3.5:** Mean (*µ*) and standard deviation interval (*µ* ± *σ*) of learned weights in the normalized FIR filter when paired with a shallow fully connected layer ('fc') compared to a combination with a quadratic function approximator ('qf') averaged over 50 runs from [119]. Both roughly retrace the (inverted) impulse response of the unobserved dynamics apart from scaling effects32, but the coefficients learned in combination with the quadratic function approximator exhibit less variance. The pattern of the FIR coefficients suggests that a longer FIR filter would not yield benefits when learning with this environment as the coefficients approach zero for actions close to the end fo the window of past actions.

connected network. Fig. 3.5 is a plot of the FIR coefficients at the end of the training. Both approaches learn a similar distribution, with the quadratic form exhibiting less variance.

#### **Comparison of Recurrent Layer, Interpolated Weights and Fixed-Norm FIR in the Vehicle**

When applied to a real vehicle, the three proposed reconstruction methods exhibit different behavior. This comparison was published in [118]; see Chapter 4 for more details on the experiment setup and Table A.1, set C, for hyperparameters. Fig. 3.6 shows the progression of the controller parameter over training time. Both gaussian weights layer and normalized FIR yield fast convergence and stay in a small region when converged. The RNN-based variant exhibits higher variance and takes longer to converge. For simplicity, this work therefore focuses on the normalized FIR reconstruction method.

For the MF approach, this rounds off the proposal for solving the challenge of a partially observed environment. Next, we turn our attention to tracking a timevarying target.

**Figure 3.6:** Actor weights for different reconstruction methods learned in a real car from [118]. The controller parameter is updated only every 2 s resulting in the step-line-like graph and a maxnorm constraint in the optimizer causes the linear slope on the way to the steady state. For both the gaussian weights (marked as 'gaussian') and the normalized FIR layer ('normalized') the controller gain stabilizes around 0.5. The variant using an RNN cell for the reconstruction converges later and exhibits higher variance.

### **3.1.2 Tracking Control and Preview Compression**

It is difficult to learn a tracking controller (challenge 1b) using RL: This is because arbitrarily changing target values make it impossible to accurately estimate the expected return if the target is not known on an infinite horizon. In engineering applications, this information is usually not available.

For the application at hand, the challenge can be decomposed in two sub-problems:


Including a preview helps to predict the state-action value, but bloats the input space of the function approximator38. This may enable learning with a moving target, but makes training costly and difficult. Since the target changes only slowly, the entries in the prediction horizon exhibit only minor differences. This makes it difficult to differentiate between the influences of individual entries on the overall value, which further slows the learning progress, i.e. requires small learning steps and large batches.

<sup>38</sup> Even for short preview lengths the number of points is large due to the short sampling interval. The naive approach to use a more coarse grid can be seen as a special case of the presented approach.

The proposed solution answers these issues: It projects the trajectory preview on a low-dimensional polynomial. This simultaneously reduces the state dimension and allows extrapolation to an infinite horizon39. The proposed projection is combined with a set of methods for training data manipulation providing for compliance with the Markov assumption, without limiting the proposed approach to trajectories generated by a time-invariant external system. Using off-policy learning, one variant of the proposed manipulation methods allows to replace the noise in the action space with noise on the target, simultaneously adding flexibility and a method for guiding the agent during learning. This subsection is mostly based on [81], where the proposed method is presented in a general way and applied using linear polynomials, i.e. an acceleration feedforward. Some parts stem from [119], which uses a constant polynomial, i.e. no feedforward controller, but highlights some of the aspects on exploration.

The following section is structured as follows: First, this work is put in the context of literature on tracking using RL. Then, the proposed projection and training data manipulation method are presented. Using simulation, it is shown that the target can account for exploration instead of directly adding noise to the action. Finally, the simulated example (1.3) is used to prove that the proposed training data manipulation allows to significantly reduce variance during learning.

#### **Literature on Tracking using RL**

The following briefly reviews literature on RL for tracking control. For a more indepth discussion see [118, 83, 82]. Tracking control signifies following a varying control target, yet a feedback-controller reacting solely to the deviation of the system state from the desired state (see e.g. [107, 157, 119]) often lags behind the desired trajectory.

If the trajectory is known at training time and does not vary in the use case, a learning algorithm can be trained to follow a specific trajectory [166], but has to be re-trained each time the trajectory changes.

A slight generalization is to assume the trajectory stems from an exogenous system with constant dynamics (see e.g. [78, 103, 80, 94, 77]). Since these dynamics are perceived as part of the environment by the learning algorithm, standard RL methods can be applied. However, re-training is necessary as soon as the trajectory, i.e. the related exogenous system changes. The proposed algorithm can be seen as an extension to this approach, allowing for arbitrary trajectories thanks to the training data manipulation.

More general methods follow the idea of universal value function approximators [129], which aim to return a state(-action) value conditioned on a target, allowing to

<sup>39</sup> The infinite horizon is helpful for low-variance estimation, yet discount limits the effective horizon.

learn tasks with different objectives. In their work, however, the target is assumed to be a time-invariant goal state. For time-varying targets, i.e. trajectories, a major challenge is their representation, especially since they are usually only partially known.

Incorporating preview can be done by explicitly including a finite window of length *h*<sup>y</sup> of reference values *y*ˆ*<sup>t</sup>* , . . . , *y*ˆ*t*+*h*<sup>y</sup> in the state vector. Since the Q-function for TDlearning is defined on an infinite horizon, an assumption has to be made on the continuation of the profile, e.g. assuming constant values beyond the provided horizon [83]. It may however be difficult to make a good choice in *h*y, since longer previews reduce bias but bloat the input space, i.e. increase the number of critic weights. This is prevented by using a low-dimensional polynomial projection as proposed in [82] in this work.

Our proposed algorithm therefore combines feedforward, arbitrary trajectories and limited preview with low variance during learning.

#### **Proposed Method for Integrating Feedforward in the RL Algorithm**

The proposed approach consists of two parts:


The theoretical and more general foundation for this method has been published in [82].

These components are reflected in the layout of this section: It begins by encoding a set of preview values for the target as polynomial coefficients, then substitutes the performance measure (1.7) with an approximate target that assumes the task is to follow the approximate polynomial forever40. This substitution is done through a manipulation of the training data. This section is rounded off with an overview of the complete algorithm.

Polynomials are a common method to compress or encode information41, often used in the form of orthogonal polynomials (or "Polynomial Chaos", see e.g. [164] for an introduction), as they allow to flexibly and accurately reflect continuous functions

<sup>40</sup> This assumption allows to formulate the value function as an infinite sum, which can be bootstrapped in TD learning.

<sup>41</sup> Encoding using a combination of basis functions weighted by parameters is popular throughout many domains, e.g. cosine projection in the popular JPEG image compression format (see e.g. [68]). Other approximation methods that allow for propagation similar to (3.8) are possible in our algorithm, too [82].

with a small number of parameters. In the following polynomials are used to reflect a preview from a finite horizon using polynomial coefficients. The preview *y*ˆ *t*+˜*t<sup>t</sup>* is assumed to be given at constant offsets relative to the current time. While these preview points do not have to be equidistant, for simplicity here it is assumed that the preview is sampled at the system sample rate. The offset<sup>42</sup> from the current time step *t* can therefore be denoted using a discrete index ˜*t<sup>t</sup>* .

We employ a base set of polynomials *ϕ*, that allows to recover approximate target values ˜*y*ˆ (˜*tt*) *t* valid for time step *t* + ˜*t<sup>t</sup>* from a set of coefficients *p<sup>t</sup>* = - *p*0,*<sup>t</sup> p*1,*<sup>t</sup> p*2,*<sup>t</sup>* . . . *pn*p,*<sup>t</sup>* obtained at time step *t*:

$$
\tilde{y}\_t^{(\tilde{t}\_t)} = p\_t \phi(\tilde{t}\_t) \tag{3.4}
$$

with e.g. *ϕ*(˜*tt*) = - 1 ˜*tt*∆*t* (˜*tt*∆*t*) 2 (˜*tt*∆*t*) 2 . . . (˜*tt*∆*t*) *n*p ⊤ and *p*0,*<sup>t</sup>* denoting the constant term in the approximating polynomial resulting from projecting the preview available at time step *t*, *p*1,*<sup>t</sup>* the coefficient for the linear term etc. up to polynomial order *n*p.

The assumption of constant samples allows for efficient encoding using a timeinvariant projection matrix<sup>43</sup>

$$P\_{\text{project}} = \left( \begin{bmatrix} \phi(0) & \phi(1) & \dots & \phi(n\_{\text{P}}) \end{bmatrix}^{\top} \right)^{+} \tag{3.5}$$

with (·) <sup>+</sup> being the operator for a pseudo inverse, e.g. by singular value decomposition. Then, the projected polynomial parameters can be retrieved using the vectormatrix product

$$p\_t = \begin{bmatrix} \mathfrak{H}\_t & \mathfrak{H}\_{t+1} & \mathfrak{H}\_{t+2} & \dots & \mathfrak{H}\_{t+h\_\mathbf{y}} \end{bmatrix} P\_{\text{project}} \tag{3.6}$$

In the following the polynomial that encodes the reference trajectory is called reference polynomial.

In contrast to the limited horizon preview, the reference polynomial is defined on an infinite horizon<sup>44</sup> in ˜*t<sup>t</sup>* . Its coefficients are appended to the state vector

$$s\_t = \begin{bmatrix} y\_t & u\_{t-h\_{\rm th}} & \dots & u\_{t-1} & p\_{0,t} & \dots & p\_{n\_{\rm P},t} \end{bmatrix}^\top . \tag{3.7}$$

See Section 3.1.1 for the augmentation with previous actions.

Now an optimization goal is defined on a surrogate state value that is based on the assumption that the reference polynomial is followed infinitely.

<sup>42</sup> Note that the local "time" coordinate frames ˜*t<sup>i</sup>* may be arbitrarily rescaled according to numerical needs.

<sup>43</sup> Other methods for obtaining polynomial coefficients, e.g. projection on orthogonal polynomials, are viable, too.

<sup>44</sup> It is assumed that either the extrapolated values resemble the not yet known trajectory, reference values in the far future do not affect the currently optimal choices or that the problem is discounted enough to render extrapolation errors insignificant.

**Figure 3.7:** Qualitative example for a polynomial approximation of the target trajectory. The original control target is given by a finite set of preview points starting in the current time step *t*. By projecting on a polynomial, a slight approximation error may be introduced and the domain is extended to infinity. The reference polynomial can be expressed in different time coordinate frames, e.g. ˜*tt*−1, ˜*t<sup>t</sup>* or ˜*tt*+1, to ensure coherent state vector definition for each time step.

Since the reference polynomial is defined relative to the current time step *t*, a transformation *T*shift to subsequent time steps is necessary, e.g. *t* + 1 in order to use the same reference polynomial in future state vectors: a shifted reference polynomial allows to define the state vector *st*+<sup>1</sup> according to (3.7), i.e. with an index ˜*tt*+<sup>1</sup> starting at *t* + 1. This transformation ensures coherence of the state definition by transforming the reference polynomial to coordinate frames of interest, e.g. ˜*tt*+1. An exemplary trajectory is shown in Fig. 3.7.

For equidistant steps, polynomial parameters can be transformed using a constant linear operation

$$p\_t^{(1)} = p\_t T\_{\text{shift}} \tag{3.8}$$

with *p* (1) *t* indicating that the polynomial coefficients *p<sup>t</sup>* have been shifted from a time coordinate frame ˜*t<sup>t</sup>* starting at *t* to ˜*tt*+<sup>1</sup> starting at *t* + 1. This linear transformation is equivalent to applying a time-invariant dynamics of an exo-system to the respective part of the state vector and depends on the choice of base polynomials. By equating the coefficients of ∑ *n*p *i*=0 *p* (0) *i*,*t* ((˜*t* + 1)∆*t*) *<sup>i</sup>* = ∑ *n*p *i*=0 *p* (1) *i*,*t* (˜*t*∆*t*) *i* the transforma-

**Figure 3.8:** Example for manipulation of the target according to methods *M<sup>t</sup>* and *Mt*+<sup>1</sup> for a zero-order reference polynomial adapted from [118]. The rationale behind both manipulations is to hide the target change *y*ˆ*t*+<sup>1</sup> − *y*ˆ*<sup>t</sup>* from the learning algorithm. For learning, in the tuple covering time steps *t* and *t* + 1 either the target at time step *t* can be changed to match the target at time step *t* + 1 (manipulation *Mt*) or the target at time step *t* + 1 can be changed to match the target at time step *t* (manipulation *Mt*+1). For this example with a constant approximation of the target, both result in equal target values for time steps *t* and *t* + 1. Note that this manipulation is done solely to the tuple ⟨*s<sup>t</sup>* , *u<sup>t</sup>* ,*st*+1⟩ used for training the state-action value function. Controller evaluation is unaffected and data for other time steps is manipulated independently, such that the manipulated target and the target the policy is evaluated with do not drift apart.

tion matrix<sup>45</sup>

$$T\_{\rm shift} = \begin{bmatrix} \binom{n\_{\rm P}}{0} \Delta t^0 & \binom{n\_{\rm P}}{1} \Delta t^1 & \binom{n\_{\rm P}}{2} \Delta t^2 & \cdots & \binom{n\_{\rm P}}{n\_{\rm P}} \Delta t^{n\_{\rm P}} \\ 0 & \binom{n\_{\rm P}-1}{0} \Delta t^0 & \binom{n\_{\rm P}-1}{1} \Delta t^1 & \cdots & \binom{n\_{\rm P}-1}{n\_{\rm P}-1} \Delta t^{n\_{\rm P}-1} \\ \vdots & & \cdots & \vdots \\ 0 & \cdots & \cdots & 0 & \binom{0}{0} \Delta t^0 \end{bmatrix} \tag{3.9}$$

can be derived.

In the real use case, however, the target will change arbitrarily and the reference polynomials obtained from approximating the preview points at time steps *t* and *t* + 1 may not fulfill the relation (3.8), i.e. *p* (1) *t* ̸= *pt*+1. We suggest to hide these incoherences from the learning process by manipulating the data used for training

<sup>45</sup> Despite not being part of the publication, this form was developed jointly with Florian Köpf for [81].

the state-action value function46, i.e. learning the surrogate value function

$$\mathcal{V}^{\pi}(y\_t, p\_t) = \sum\_{i=t}^{\infty} \gamma^{i-t} \mathbb{R}\left(y\_{t\prime} \mathfrak{J}\_t^{(i-t)}, \pi\left(y\_{i\prime} p\_t^{(i-t)}\right)\right) \tag{3.10}$$

or a correspondingly defined state-action value function. When evaluating the TD error, the algorithm uses the same reference polynomial for augmenting the state vector *s* for both time steps *t* and *t* + 1, but shifts it to match the difference in time according to (3.8). Two approaches are possible here:


See Fig. 3.8 for an example. We mark state vectors that contain a manipulated target as *s*¯. Since the reward depends on the control target, these manipulations entail recalculating the reward. The manipulations are solely and indepentently done in the tuples used for training the state-action value function, e.g. before storing them in the experience storage. In our experience the actual and the manipulated target are often only slightly different, enough to move exploration to the target in some cases (see Section 3.1.2), but still close enough to learn a meaningful state-action value function.

The important goal of preemptively acting on target changes, i.e. before they affect the control error, can be achieved by learning a policy that uses the coefficients of the reference polynomial as its features. Note that for the policy, no manipulation of the target is applied, such that it accepts truly arbitrary targets as its input. In our case, we use a linear policy (1.5) to benefit from the linear-quadratic architecture in our state-action value function.

Next, a short example from [118] is presented to show the variance reduction capabilities of the proposed manipulation. For this, the system<sup>47</sup> (1.3) and a random target<sup>48</sup> is used. The target is given with one time step preview, i.e. just the target for the next time step is provided49. Two training runs are performed to compare

<sup>46</sup> Value functions with encoded preview are still quadratic if the environment is fully observed and the controller is linear [82], therefore the intuition behind the value function approximator introduced in Section 3.1.1 still holds.

<sup>47</sup> These experiments were performed without adding the delay included in system (1.3), i.e. with a reduced system order.

<sup>48</sup> The target is chosen randomly over time, but in a coherent way, i.e. we stick to what we gave as a preview but choose randomly how to extend it in subsequent time steps.

<sup>49</sup> The projection on a constant polynomial is trivial in this case.

**Figure 3.9:** Mean squared TD loss over training steps for configurations using manipulation *Mt*+<sup>1</sup> (marked as "on") vs. no manipulation (marked as "off") from [118] for multiple training runs. The agent training on the manipulated data exhibits TD errors several magnitudes lower than the agent training on the non-manipulated data.

the algorithm's behavior with training data manipulation *Mt*+<sup>1</sup> and without any training data manipulation. The hyperparameters are given in Table A.1, set B in the appendix. Fig. 3.9 shows the progression of the TD error during training. The proposed manipulation effectively reduces the TD error, suggesting that variance is lowered significantly during training. Lower variance helps faster and more precise controller learning, as Fig. 3.10 shows.

#### **Shifting Exploration between Target and Action**

Manipulating the target can be advantageous for RL, as we show in the following.

Using manipulation *M<sup>t</sup>* on the tuple ⟨*s<sup>t</sup>* , *u<sup>t</sup>* ,*st*+1⟩ changes *s<sup>t</sup>* such that the action *u<sup>t</sup>* (calculated from *s<sup>t</sup>* prior to manipulation) would become off-policy even if no exploration noise had been applied. Theoretically, this may allow the agent to learn without additional exploration noise. In practice, however, it may be necessary to choose a target profile that forces the controller to excite the system, and to ensure that the controller is aggressive enough to translate the excitation in the target to the plant (see Section 2.2.5 for purposes of exploration noise).

Experiments with several configurations of exploration noise are performed using the system<sup>47</sup> (1.3):

**Figure 3.10:** Mean *µ* and standard deviation interval *µ* ± *σ* of controller gains *θ*<sup>c</sup> over training steps for a configuration using manipulation *Mt*+<sup>1</sup> (marked as "on") vs. a configuration training on nonmanipulated data (marked as "off") from [118] for multiple training runs. The agent training on manipulated data exhibits lower variance and converges faster. One or both solutions are biased.


This experiment uses the hyperparameters in Table A.1, set B, in the appendix, but varied the exploration as described. The evolution of the controller gain during training is diagrammed in Fig. 3.11. Applying exploration noise to both target and action ("ex. c,t.") yields least variance during training, applying none to both can make the learning process fail, i.e. the controller gain diverge. Both variants that have noise either on controller output or target learn, but have higher variance in comparison and therefore learn slightly slower. It is therefore advisable to be careful when foregoing the use of exploration noise.

<sup>50</sup> We use a random number sampled every 100 time steps from a uniform distribution between −1 and 1 as exploration noise in this example.

<sup>51</sup> The target profile consists of an accumulation of a uniform random number sampled between −0.1 and 0.1 at every time step.

**Figure 3.11:** Mean *µ* and standard deviation interval *µ* ± *σ* of controller gains during training with the example system (1.3) with different exploration noise configurations from [118]. When applying no exploration noise to either target or action (marked as "-"), the learning process may fail. With exploration noise on either target (marked as "ex. t") or controller (marked as "ex. c"), learning succeeds. Best results are obtained with noise on both (marked as "ex. c,t"), where learning occurs quickly and with minimum variance.

The proposed modifications could be even extended beyond what has been presented here: arbitrary many modified variants of a state transition tuple could be used in the training data set as long as the targets within the tuple are made to comply with each other. This is possible as the proposed algorithm learns off-policy and explicitly includes a notion of the control target in the state-action value function (cf. Universal Value Function [129]). By assuming different targets, the agent can learn different state-action values from a single state transition tuple if the reward function is available, therefore efficiently learning the assumed target dynamics (3.8). This idea has become popular from the concept of hindsight experience replay [8]<sup>52</sup> .

This is not only useful for generalizing between targets, but can also be used to virtually explore target dynamics, i.e. learn quickly without actually adding noise to the target. This is possible by adding noise to the target in the training data only. This virtual target noise can be applied before applying either of the manipulations *M<sup>t</sup>* or *Mt*+<sup>1</sup> directly on the reference polynomial coefficients and helps explore the exogeneous system dynamics (3.8). We mark state vectors with added noise to the reference polynomial coefficients as *s*˘. Virtual target noise can lead to faster learning, but may skew the distribution of observed targets from the intended use case.

<sup>52</sup> In [8] the authors give an example about missing a shot in a Hockey game: instead of simply discarding the actions taken prior to the miss as worthless, an agent could learn that if the goal had been in a different position, there would have been a reward.

### **3.1.3 Summary**

This section presented an algorithm based on the (deep) deterministic actor-critic architecture. The main contributions are a state-action value function that allows learning with the partially observed dynamics of the powertrain and a manipulation of the training data that allows learning a feedforward with an arbitrarily changing control target.

For this, the state vector is extended: it now consists of coefficients of a reference polynomial, measurements of the plant output and past actions. The state-action value function approximator uses a learned FIR-filter-like layer to reconstruct missing state information and combines polynomial coefficients, reconstructed state information, output measurement and the action in a quadratic function that yields a low-variance estimate of the state-action value. The proposed method employs a projection on polynomial parameters, optional virtual target noise and a training data manipulation to learn tracking control with little variance.

The proposed algorithm is partitioned in three parts (see Fig. 3.12 for a schematic):


The proposed measures allow for learning robustly with little parameters. Chapter 4 shows that they fulfill the requirements to be applied to a real vehicle.

The next section presents another candidate algorithm that uses a learned model to search a policy with to be compared with in Chapter 4.

**Figure 3.12:** Schematic of the proposed MF RL architecture. In addition to the MF RL architecture from Fig. 2.6 we added three blocks: a state augmentation block augments the measured input *y<sup>t</sup>* with past actions and coefficients of a reference polynomial from *y*ˆ*<sup>t</sup>* . We omitted additional target preview points to avoid cluttering the overview. The target noise block adds a random component on the reference polynomial coefficients to accelerate learning of the target dynamics. The match target block ensures coherence between the targets in the state vectors *s*˘*<sup>t</sup>* and *s*˘*t*−<sup>1</sup> by applying manipulation method *M<sup>t</sup>* or *Mt*+1. Note that the manipulation of the target only affects data entered to the experience storage for training. The state vector that is fed to the policy is not manipulated, therefore allowing to follow arbitrary targets *y*ˆ. The overall algorithm can be split into (policy) evaluation, data preparation and training, which can run at independent sample times.

**Figure 3.13:** Schematic of a model learning problem, which is a variant of the parameter estimation problem (cf. Fig. 2.1). Based on a batch of tuples {⟨*sk*−<sup>1</sup> , *uk*−<sup>1</sup> ,*s<sup>k</sup>* ⟩} and the latest estimate of model parameters *θ*m,*t*−<sup>1</sup> the value *L*|*<sup>t</sup>* , gradient ∇*L*|*<sup>t</sup>* and the jacobian *JL*|*t*are calculated. These are used in an optimizer to get an updated estimate of the model parameter vector *θ*m. The process is repeated with newly sampled batches until convergence.

### **3.2 Model-Based RL Algorithm**

This section proposes a computationally efficient MB policy search algorithm.

MB algorithms have the potential for high data efficiency, as they use the data from the real plant to learn a model that in turn can be used to generalize the experience to not yet seen portions of the state space at very low cost (see Section 2.2.6).

Partially observed systems pose the added challenge of simultaneously estimating state and dynamics, which the proposed algorithm avoids by windowing and estimating an autoregressive model with external inputs (ARX). Based on the estimated model, an optimal output feedback controller is searched and an inversion-based feedforward controller is designed automatically. With the exception of the feedforward design, the proposed algorithm has been published in [120].

This section starts by introducing our approach to model identification, then briefly touches upon our feedback control optimization using policy search and finally proposes an algorithm for automatically designing a feedforward controller.

### **3.2.1 ARX Model Estimation for Partially Observed Plant**

Since the to-be-learned model serves as a basis for policy search, any error in the approximated dynamics may lead to a bias in the derived policy. It is therefore important to ensure approximation power to be able to reflect the true system behavior – and of course the algorithm must be able to harness these degrees of freedom. On the other hand, a too high number of learnable parameters may not only cause the learning process to overfit, e.g. to noise, but also reduce learning speed. This work tries to find an optimal balance by optimizing a criterion that considers both accuracy and complexity.

The following briefly explains ARX models and the chosen estimation method.

The following model estimation method is taken from [93, p. 176]. It is based on an ARX model that provides a prediction *y*˚*t*+<sup>1</sup> of the next output *yt*+<sup>1</sup> from *n* current and past actions and states53:

$$\vec{y}\_{t+1} = \sum\_{i=0}^{n-1} \left( \theta\_{\mathbf{y},i} \, \vec{y}\_{t-i} + \theta\_{\mathbf{u},i} \, u\_{t-i} \right) \,. \tag{3.11}$$

For ease of notation, we define the model parameter vector *θ*<sup>m</sup> = [ *<sup>θ</sup>*u,0 ... *<sup>θ</sup>*u,*n*−<sup>1</sup> *<sup>θ</sup>*y,0 ... *<sup>θ</sup>*y,*n*−<sup>1</sup> ]. To minimize the one-step prediction error, the per-element loss function

$$I(y\_{t-n+1}, \dots, y\_{t+1}, u\_{t-n+1}, \dots, u\_{t+1}, \theta\_{\mathbf{m}}) = \left(y\_{t+1} - \left(\sum\_{i=0}^{n-1} \theta\_{\mathbf{y},i} y\_{t-i} + \theta\_{\mathbf{u},i} u\_{t-i}\right)\right)^2 \tag{3.12}$$

is defined and optimized over batches of *b*<sup>m</sup> = 200 samples drawn from the experience storage of size 5000. LM optimization is employed and a decreasing-over-time maxnorm constraint is enforced using norm clipping (see Section A.1). This aims at keeping variance of model parameters to a minimum that would otherwise drive up the variance of the learned controller parameters. A scheme of the optimization process is shown in Fig. 3.13.

Model accuracy and model complexity are balanced by choosing the model order *n* using the minimum description length (MDL) criterion [142, 106], yielded a consistent optimal model order close to *n* = 20 [120] for different dataset lengths as opposed to the equally popular Akaike Information Criterion (AIC) [4, 106].

The resulting model is able to precisely predict the plant output even over extended periods as the simulation result in Fig. 3.14 suggests.

#### **3.2.2 Sampling-Based Controller Design and Inversion-Based Feedforward Design**

This section briefly states the proposed approach on sampling-based policy search. Then it describes the automated design process for the inversion-based feedforward.

The proposed algorithm updates the feedback controller through policy search using numeric gradients of values estimated using the model (3.11) introduced above. The state value is approximated by a truncated sum of rewards calculated from a simulation of model (3.11) with controller (1.5). The infinite sum (1.6) is cut off<sup>54</sup>

<sup>53</sup> We differentiate between the model output *y*˚ and plant output *y* for clarity. Later on, we will use plant output values *y* in the place of model outputs *y*˚ to estimate model parameters.

<sup>54</sup> This truncation criterion is intended to balance loss of accuracy against computational burden: with a stable controller and a constant target the expression abates over time while the computational effort does not. The contribution of the to-be-truncated portion of the sum tends to decrease with the amount of summands taken into account and eventually falls below numerical accuracy.

**Figure 3.14:** Output of a learned model of order 20 and measured speed from [120]. At time 0 s the model was initialized with measured outputs and then simulated for 100 s using the inputs recorded in the car. The recording is of a controller trying to follow a target speed varied every 1.25 s by 1 km h−<sup>1</sup> between 19 km h−<sup>1</sup> and 21 km h−<sup>1</sup> . The model output ('Simulated') is close to the measured speed ('Measured').

once *γ <sup>i</sup>*−*tR*(*y<sup>i</sup>* , *y*ˆ*<sup>i</sup>* , *π*(*y*ˆ*<sup>i</sup>* , *yi*)) < 10−<sup>20</sup> or the amount of summands exceeds 500. The performance gradient for the policy is computed by finite differences and is fed to an LM optimizer instance with a maxnorm constraint to update the controller weight. Gradients and state values are averaged over several initial states sampled from a ringbuffer of the last 5000 augmented state vectors gathered from the plant.

To enable high precision trajectory tracking, this work proposes to pair the learning feedback controller with an automatically designed feedforward controller with an accompanying reference filter according to the two degrees of freedom controller scheme [84, 11]. The popular flatness-based (see e.g. [52]) feedforward design requires planning in the coordinate system of the system-dependent flat output (which is unknown prior to model estimation), while stable inversion [170] constrains trajectory planning. The proposed feedforward controller and filter are automatically designed by combining a predefined Butterworth low-pass filter with an approximate inverse of the learned model [21], allowing for flexible trajectory planning in the output space at the cost of possibly suboptimal tracking performance. See Fig. 3.15 for a schematic overview of the automated design process.

The following applies the z-transform to the ARX model, fixes artifacts of estimation by correcting poles and filtering and applies approximate inversion.

The first step is to transform the estimated model (3.11) to the z-domain to yield the transfer function *G*˜(*z*) and to divide the numerator in polynomials *N*mp and *N*nmp

**Figure 3.15:** Schematic of autoamted MB feedforward (FF) design. The design procedure takes an estimated ARX model of the plant as its input. It splits the numerator polynomial in stable and unstable zeros and ensures precise matching of the integrator pole. Then, zero phase error tracking control (ZPETC) is applied and a step of delay is added before inverting the modified model. A Butterworth filter completes the design.

with stable and unstable zeros<sup>55</sup> respectively, i.e. zeros within/on and outside of the unit circle.

$$\mathbf{G}(z) = \frac{\sum\_{i=0}^{n} \theta\_{\mathbf{u},i} z^{i}}{1 - \sum\_{i=0}^{n} \theta\_{\mathbf{y},i} z^{i}} = \frac{\mathrm{N}\_{\mathrm{mp}}(z) \mathrm{N}\_{\mathrm{nmp}}(z)}{D(z)}. \tag{3.13}$$

Since misestimation of the integrator can yield a feedforward with nonzero steady state gain56, the pole (i.e. zero of *D*(*z*)) closest to 1 is moved to exactly one, yielding *D*˜ (*z*). To obtain a stable inverse, *N*nmp is replaced with *N*˜ nmp. It is constructed according to the zero phase error tracking control (ZPETC) design (see e.g. [21]) using the zeros *a<sup>j</sup>* with *j* = 1 . . . deg(*N*nmp) of *N*nmp:

$$\tilde{N}\_{\text{nmp}} = \prod\_{j=1}^{\deg(N\_{\text{nmp}})} (-az + 1) \,\text{}\,\tag{3.14}$$

effectively projecting unstable zeros onto stable ones while preserving the stationary gain. *N*˜ nmp is delayed by one time step to ensure causality. This gives

$$\mathcal{G}(z) = \frac{z \mathcal{N}\_{\rm mp}(z) \tilde{\mathcal{N}}\_{\rm mnp}(z)}{\tilde{\mathcal{D}}(z)}. \tag{3.15}$$

The feedforward design is completed by inverting *G*ˆ(*z*) and adding a butterworth filter *B*(*z*) of order 3 (see [110]) with cut-off frequency<sup>57</sup> 1 Hz:

$$F\_{\rm ffw}(z) = B(z)\hat{G}^{-1}(z). \tag{3.16}$$

<sup>55</sup> Since zeros in the numerator turn into poles when inverting the transfer function, zeros outside the unit circle are considered unstable in this context.

<sup>56</sup> This means a feedforward controller would still command an acceleration even if the target speed is steady. Without a stabilizing feedback controller, this would cause the vehicle to continously accelerate or decelerate.

<sup>57</sup> Damping high frequencies in the feedforward design was introduced to limit wear on the variable valve timing assembly. Experts from BMW recommended to avoid frequencies above 1 Hz in the input to the drivetrain. In addition, damping high frequencies in the feedforward can help mitigate artifacts resulting from erroneous model estimation.

The accompanying reference filter results from cancelling out terms in

$$F\_{\rm ref}(z) = F\_{\rm ffw}(z)\bar{G}(z). \tag{3.17}$$

During learning, the parameters of the feedforward *F*ffw and the reference filter *F*ref are adapted while they are running. With every update in parameters, their internal state vectors become meaningless, and are therefore reinitialized as if the current target *y<sup>t</sup>* was their steady state to avoid the settling phase during driving. Appendix A.2 aims to give more detailed insight on this matter.

#### **3.2.3 Summary**

The proposed algorithm uses an estimated ARX model as a basis for designing a feedback controller using policy search and a feedforward controller using approximate inversion. Learning from batches along with a (dynamic) maxnorm constraint on the optimizer facilitate quick and noise-free learning purely from plant inputs and outputs, allowing to learn from a partially observed plant. The automated feedforward design is implemented in a way that circumvents pitfalls from the model estimation.

The proposed MB algorithm fits the scheme presented in Fig. 2.7: controller evaluation and data preparation are performed within the sample time ∆*t* while the training containing model estimation and control design can be prolonged to balance performance limits against learning speed. This allows to run the algorithm online on constrained hardware.

The next chapter compares the presented MB and MF algorithms in a real vehicle across several experiment setups.

# **4 Validation in the Car**

In this chapter, the presented MB and MF algorithms are applied to a real vehicle to demonstrate that the algorithms are capable to learn under real-world conditions and constrained resources. It shows that they deliver consistent results robustly under varying circumstances and points out limitations. The results reveal that the learning process is fast enough for application in day-to-day engineering as convergence is reached within minutes.

As a preparation step for the vehicle experiments, a simulation study presents systematic perturbations of important hyperparameters to visualize their influence and gives an intuition on why the chosen hyperparameters are deemed sensible. After that, a basic experiment setup for learning speed control in a real road vehicle is introduced. This basic experiment is conducted multiple times to showcase robustness and repeatability of learning process and result. Then factors of this experiment are varied to assess the influence of external factors on the learning results and the influence of exploration noise on the learning process and the results. See Fig. 4.1 for an overview. Throughout the presented experiments the results of MB and MF algorithms are close, suggesting the learned results are bias-free or are similarly biased. A part of the results have been published in [120].

### **4.1 Hyperparameter Choice**

This section gives an intuition on the effect of hyperparameters on the behavior of the MF RL algorithm presented in Section 3.1. It must not be misunderstood as a strict proof for the optimality of the parameters chosen in our vehicle experiments. This is mostly due to three reasons:


**Figure 4.1:** Overview of experiments. All experiments are variations of the baseline experiment in Section 4.2. Variations of the conditions outside the algorithm are marked in shades of blue: Subsequently, the disturbance level is varied by performing an experiment on a track consisting of ramps and tight turns. Experiments at different speeds (and gears) are described in Section 4.3 (dark blue). Experiments that vary components of the algorithm are highlighted in shades of red: Exploration noise configuration is changed in Section 4.4 (dark red). The reward function is modified in Section 4.5.1 (red) and a learning feedforward controller is added in Section 4.5.2 (light red).

works under these circumstances, but can not guarantee good performance in the real-world experiment.

• Hyperparameters may compensate each other's effect to some extent. This allows to accomodate for side effects in some choices, but often makes it difficult to choose one set over another.

The fact that hyperparameters are problem dependent and have a high impact on the algorithm's performance is known in the literature: RL algorithms may "require a high dregree of human expertise" for tuning hyperparameters that fit new applications [168]. Ambiguos performance resulting from hyperparameter tuning has sparked a debate on the reliability of performance baselines for algorithm comparison [70]. Frameworks for tuning hyperparameters in RL have started to emerge in recent years, e.g. [22, 70]. These frameworks aim to find a hyperparameter set that optimizes performance in simulation in a black-box fashion. The following is therefore intended to make the choices in this work plausible and serve as a guide to interested readers with their hyperparameter tuning tasks.

Before presenting the results of our simulation study, this section introduces an understanding of bias, variance, speed and stability, which make up the (informal) tuning goal. Then, the simulation experiment setup is presented and an overview of the hyperparameters to be varied is given.

In contrast to many publications that limit themselves to pointing out the biasvariance trade-off, this section tries to provide two additional facets to the discussion: stability and speed.

In this context, **variance** is considered a measure of how much a learned parameter fluctuates, especially (but not only) once learning has converged. This fluctuation can be understood as how much a parameter differs across different training runs, but also as oscillation or noise around a trajectory that leads the parameter towards its convergence value over training time. Generally, the goal is low variance when tuning hyperparameters. Variance can typically be traced back to the respective loss function (including the data it is evaluated on) and its gradients as well as its interactions with the function approximator. The optimizer (see Section A.1) can mitigate or aggravate this effect depending on its type and configuration. In the experiments multiple training runs are performed with each hyperparameter configuration. The output controller has only one parameter in these experiments, allowing to meaningfully visualize mean and standard deviation over training duration, which gives an intuition of the variance in policy learning. For the critic, this approach would not yield informative diagrams58, which is why this work resorts to plotting the TD Loss (see Section 2.2.2) over training time. The TD error (2.12) is not a direct measure

<sup>58</sup> Due to the high amount of parameters in a very similar range, plots of the weights over training time result in overlapping graphs that are hard to read.

of variance for the critic59, but its value tries to encode how unexpected the training data is or how good the state transitions in the training data batch match the critic's value estimation60. For training the actor, the Deterministic Policy Gradient is used, that, if inaccurate, can cause variance in the policy parameters. The gradient's accuracy, in turn, relies on the quality of the state-action value function approximation. The TD error can therefore be seen as an indicator for the precision of the gradient used for training the policy. Albeit having a large error margin, the TD error can therefore be be taken as a hint of how much the Deterministic Policy Gradient contributes to the variance in the actor parameters.

The following understands **bias** as the deviation of the learned parameter from the optimal/desired<sup>61</sup> parameter. Especially the latter definition does not allow to measure bias, since the optimal/desired value may not be available as it depends on the state distribution (see Section A.5). However, it allows to consider the parameters defining the optimization goal (e.g. in the reward function) as hyperparameters affecting algorithm performance. For example, it might be a valid choice to deviate from the actual learning objective to enhance learning behavior. Judgement of bias from the presented experiments is therefore limited: hyperparameters may well affect the optimal gain (e.g. by changing the observed state distribution), but also skew the learned policy directly. To some extent, this can be seen by comparing the performance of different hyperparameter sets, but it is difficult to factor in the effect of the state distribution. The effect of the state distribution is therefore included in this work's definition of bias, allowing to infer a parameter's influence on bias by the difference in learned controller gains between tested configurations. The goal is to limit the amount of bias through hyperparameter choice where possible.

This work's understanding of **stability** is how prone the algorithm is to (catastrophic) failure during learning, i.e. divergence of the learned controller parameters. While in most cases stability is adversely affected by variance, this section points out that some hyperparameters pose exceptions to this, thus confirming the need for this additional perspective. Stability is measured by the relative frequency among experiment rollouts with state vector norms that exceed a large threshold62. Stability is of utmost importance for real-world applications. If a hyperparameter threatens to introduce instability it is tuned in a conservative fashion.

In the following, **learning speed** is the inverse time necessary for the learning process

<sup>59</sup> Changes of the TD error between learning steps can be due to noise in the parameters, too. However, there may also be other reasons, such as sampling of different batches.

<sup>60</sup> A vanishing TD error would only be achieved if the state-action value function had been learned to perfection and the environment is noise-free.

<sup>61</sup> This work treats some of the hyperparameters that define the optimum as tunable. Modifying them therefore leads to a deviation from the desired optimum, but may facilitate learning. The then-optimal learned parameter set may thus differ from the desired optimum which is considered bias according to the understanding in this work.

<sup>62</sup> Once a simulation crosses the threshold it is terminated immediately. In the plots given in this section we only include the parameters learned in this run only up to the point the threshold was crossed.

to converge. This can be looked at from multiple angles: in real-world applications computation time for a learning step plays an important role, as does the waiting time for filling the experience storage or optimizer performance. Another aspect influencing this notion of speed is the definition of convergence, loosely understood as steady state or plateau<sup>63</sup> in the policy parameter. Additionally, the amount of necessary training steps or computation time may depend on how far the starting point is from convergence, i.e. how far the parameter set the algorithm was initialized with and the parameter set at convergence are apart. Unfortunately, these concepts are not unambiguously and accurately measureable, which is why any verdict is based on experience and the comparison of example training runs. The comparison of learning speed is therefore limited to cases where the differences in how fast a plateau is reached are visible in the plot of the controller gain. For real-world applications speed is very important, since exploration causes wear on the plant and induces experiment cost.

The following simulation study aims to showcase the influence of important hyperparameters on the four goals variance, bias, learning speed and stability. For this, it systematically varies the hyperparameters used in the vehicle experiments (with exception of the learning interval, see Table A.1), set E, and applies the resulting algorithm to the example system (1.3). For each configuration, 10 training runs are performed. The training process terminates after a simulated duration 200 s or if the norm of simulated state vector exceeds a threshold of 10<sup>4</sup> to catch severely unstable (i.e. rapidly diverging over a sustained period) controllers. The majority of the plots in the following are averaged over 10 training runs and we mark the interval of 1 standard deviation around the average in each. Where averaged results are presented, training runs that are prematurely terminated are subsequently excluded from the averages.

See Fig. 4.2 for an overview of the parameters varied in the simulation study. Albeit the example system 1.3 mimics the real vehicle's behavior reasonably well throughout most cases, a few exceptions were observed. We therefore comment the results with our experience from the trial runs in the real car in mind.

The complete results of our simulation study can be found in Appendix A.3. Here, we limit ourselves to giving the example of the discount factor and provide a tabular summary.

**Discount Factor** *γ* While the present use case values present and future rewards equally (see Section 2.2.2), in RL it can be beneficial to set the discount to a value lower than 1. Discount close to 1 decreases stability64, increases variance (see TD

<sup>63</sup> While detecting a plateau is feasible with a criterion, this criterion would need to be adapted due to the different levels of variance across experiments and over training time. An exact comparison is therefore not possible.

<sup>64</sup> Stability was not affected in the simulated experiments, but occasionally diverging controller gains have been observed with a discount factor of *γ* = 0.95 and higher.

**Figure 4.2:** Overview of experiments in simulation study. The complete results can be found in appendix A.3.

**Figure 4.3:** TD error and actor gain for multiple discount factors *γ* averaged over 10 training runs and their respective standard deviation intervals. While a discount factor *γ* = 1 is considered ideal in this work, this would induce infeasible variance levels in the policy gradient, as can be seen by the TD error increasing with the discount factor. However, lower discount factors yield lower actor gains which translate to controllers that are less aggressive. This can be seen as bias with respect to the intended controller. The discount factor *γ* therefore epitomizes the bias-variance trade-off in RL. Training only begins after the first 100 s since state transitions from this period are used to fill the experience storage.


**Table 4.1:** Overview of hyperparameter influence observed on simulated environment. Each parameter was varied in at least three steps. An asterisk (\*) marks that a fourth value has been tested. The arrows mark how an increase over the given range of the respective hyperparameter would affect stability, variance, bias and speed. Desired effects are marked in green, undesired ones in red. If no clear effect is visible, the respective cell is left grey without an arrow. For ambivalent effects, e.g. an increase at first and a decrease later, two arrows are drawn, and for weak effects the arrow is put in brackets.

error<sup>65</sup> and actor gain in Fig. 4.3) but can be seen as decreasing bias (see actor gain in Fig. 4.3). This is in line with other research that considers discount as a form of a regularizer [5] and suggests to progressively increase it during learning [41]. In the vehicle experiments the discount was chosen below 1 to accomodate for other sources of variance, e.g. from the environment. From the graph, little difference in learning speed can be seen: the actor gains stabilize between 110 s and 130 s. The graph suggests a minor tendency of slowing learning by increasing the discount factor, but the increase in variance towards higher gains prevents such a conclusion. The chosen discount value of 0.9 leaves a margin to select other hyperparameters affecting variance aggressive enough to allow for fast learning.

Table 4.1 summarizes the intuitions on hyperparameter and noise influence on the learning process. Each row describes how an increase of the respective parameter affects stability, variance, bias and speed. The desired effects are coded in green, and undesired effects are marked in red.

<sup>65</sup> Discount has impact on TD error since it dampens approximation error for future rewards in the Bellman equation.

**Figure 4.4:** Test vehicle based on a *BMW 740Li* [17, 18] used for online training. It is equipped with a *dSpace Autobox* that runs our proposed algorithm online, closed-loop and in real time. It accesses controllers for chassis/brake and powertrain via in-vehicle bus systems.

With the choice of hyperparameters in place66, we embark on our vehicle tests in the next section.

### **4.2 Baseline Experiment**

The vehicle tests begin with a standard experiment that shows the fitness of the presented approaches to this application and serves as a baseline to benchmark the effect of variations later. This section describes the vehicle setup, track and standard experiment before giving an overview of the results.

### **4.2.1 Setup**

The experiments are carried out using a test vehicle based on a *BMW 740Li* [141] (see Fig. 4.4).

The drivetrain consists of a 3.0 L inline six cylinder petrol engine with maximum power output of 240 kW and 450 N m of torque. It is equipped with a *TwinScroll* turbocharger, *Valvetronic* and *VANOS* for variable valve timing and lift. The vehicle is driven through an 8 speed automatic transmission and rear wheel drive. In the presented experiments the vehicle is operated at the upper end of the permissible weight of 2445 kg with co-driver, driver and an electronics rack installed in the luggage compartment (see Fig. 4.4).

<sup>66</sup> In Appendix A.4 several sets of hyperparameters are given. Despite their similarity, we use several sets in the following vehicle experiments since the experiments were carried out over several measurement campaigns. It is our goal to keep the experiments comparable and refer to the corresponding hyperparameter set for each experiment. Since this work aims at comparing two different algorithm architectures, these are tuned to work with comparable hyperparameter sets, but may only reach their full potential if this constraint is lifted.

It is equipped with a *dSpace Autobox* [29] containing a *dSpace DS1007 PPC Processor Board*<sup>67</sup> [30] that is connected to in-vehicle bus systems. Through these it has access to controllers for the drivetrain which are used to command torque requests from brakes and powertrain. These commands are executed using feedforward controllers. The vehicle speed is estimated based on wheel speed sensors and available through an in-vehicle bus signal in a quantized form with little noise and delay. Delays affecting data transferred over the in-vehicle bus systems are negligible. Programming, monitoring and evaluation is done via a second computer accessing the *dSpace Autobox* via Ethernet.

The *dSpace Autobox* replaces a factory-installed vehicle controller in the vehicle network and provides the interface to the proposed algorithms (see Fig. 1.2):


Both algorithms are split in two steps that are run in separate tasks on the *dSpace Autobox* (see Figs. 3.12 and 2.7):


Except where stated otherwise all our learning experiments are performed on a 1.4 km long, mostly level and straight road. The car is manually accelerated to be close to a target speed of 20 km h−<sup>1</sup> in second gear, then the controller and learning process are activated simultaneously68. Near the end of the road the controller is deactivated, the car is turned and again accelerated to a speed close to the target speed

<sup>67</sup> At the time of experiment preparation, the hardware used was among the most powerful real-time capable boards rated for in-vehicle application. Future work may harness the performance available through more recent hardware.

<sup>68</sup> Since both presented algorithms are off-policy and the idea of virtual targets has already been introduced, it seems a natural extension left as future work to gather data from manual operation of the vehicle and tune the controller in the background. This offline training setting poses the risk of driving the system in areas of the state-space that have not been observed before, but this topic has been adressed in the literature recently, see e.g. [26].

**Figure 4.5:** Progression of the output feedback gain during four training runs in the default experiment setup: two with MB and two with MF learning algorithm. The initial slope of the learning curves is limited by the maxnorm constraint. Both configurations converge to similar results after a few minutes while keeping a certain amount of variance after convergence.

manually, then the controller and learning process are activated again. The learning process idles until the experience storage is filled to capacity. The experiments run until a total of 1500 s of combined time for filling the experience storage and learning have elapsed.

The baseline uses exploration noise added to the controller output that is sampled from a uniform distribution between *ν*<sup>u</sup> = 0.5 m s−<sup>2</sup> and −*ν*<sup>u</sup> every *t*noise = 1 s. While the controller is active, the control target varies in 1 km h−<sup>1</sup> steps around a base speed<sup>69</sup> of ¯*y*ˆ = 20 km h−<sup>1</sup> every 1.5 s in a regular pattern for additional excitation (see Fig. A.19). For a complete overview of hyperparameters in the baseline experiment see Table A.1, set E, for the MF algorithm and Table A.2 for the MB algorithm.

#### **4.2.2 Result**

The results of two rollouts with MB and MF algorithms each are shown in Fig. 4.5. All four runs exhibit an almost linear slope at the beginning of the experiment, where the optimizer is limited by the respective maxnorm constraint. The MB algorithm has a stricter constraint, which slows the MB algorithm down more than the MF

<sup>69</sup> The speed was chosen to reflect average driving speeds in urban areas [7]. It has an additional benefit for safety and practicability. At low speeds a trained driver can swiftly return to a safe state even in case of extreme unexpected controller outputs. High speeds have the additional disadvantage of resulting in longer distance driven, requiring either a more extensive test track (with potentially more variance in the environment) or more frequent turns.

algorithm. After around 120 s the MF algorithm reaches its steady state with its controller gain oscillating around 0.5. The MB algorithm takes around 300 s to reach a steady state, again around a controller gain of 0.5. In their steady state, both MF and MB algorithms exhibit some variance with gains<sup>70</sup> oscillating between 0.4 and 0.55.

These results show that both algorithms can repeatably learn under real-world conditions. With convergence reached below 2 min and 5 min respectively, both MF and MB algorithm are relatively fast71. The learned gains are in the same area, suggesting that either both have the same or no bias.

These results suggest that tuning a speed tracking controller using both proposed RL approaches is feasible, since both converge to similar values. With hyperparameters chosen to be comparable, this is in line with our expectations. The differences in the trajectory of the controller parameter at the beginning of the experiment result from a stricter maxnorm constraint on the MB algorithm, which is necessary to ensure stability of the learning process. Still, a significant amount of variance remains even once learning reaches a steady state for both algorithms.

Both convergence speed and variance in steady state can be influenced by optimizer tuning, but are conflicting goals. A decaying maxnorm constraint can help alleviate this. The minimum maxnorm was intentionally kept at nonzero levels even after convergence to avoid premature stopping and prove that the algorithm converges on its own<sup>72</sup> .

The experiment was performed in a controlled environment, yet some factors still add variance: minimal differences in road grade, slight steering corrections and different initial conditions after activation of the controller contribute to variance. However, these influences are expected be minor since they should be averaged out by sampling learning data from a comparably large experience storage.

To gain an intuition on strategies to optimize rewards, experiments using different gains without exploration noise were performed. While these considerations are not sufficient to scientifically prove the RL algorithm's choice, an interested reader can find them in appendix A.6.

With the feasability shown and the baseline established, the next experiments vary a few important factors in the experiment setup to examine the robustness of the proposed algorithms to different scenarios.

<sup>70</sup> The learning result matches the gain *θ*<sup>c</sup> = 0.5 tuned by an expert engineer according to the subjective rating scale referred to in section 1.2.3.

<sup>71</sup> Convergence speed is influenced by convergence speed of the algorithm and computation time. The former depends on algorithm design (see chapters 2, 3 and appendix A.1). The latter is a mixture of algorithm design decisions like neural network architecture (see sections 2.1.2 and 3.1.1) and batch size (see appendix A.3.2).

<sup>72</sup> See Appendix A.3 to get an intuition on how the maxnorm affects learning in a simulated example.

**Figure 4.6:** Training runs at target speeds of ¯*y*ˆ = 10 km h−<sup>1</sup> with a variation of ±1 km h−<sup>1</sup> in first gear, ¯*y*ˆ = 20 km h−<sup>1</sup> with a variation of ±1 km h−<sup>1</sup> in second gear and ¯*y*ˆ = 30 km h−<sup>1</sup> with a variation of ±1 km h−<sup>1</sup> in third gear for MF and MB algorithm. At the lowest speed the learned gain is slightly lower at around 0.4 compared to the gain learned at ¯*y*ˆ = 20 km h−<sup>1</sup> and ¯*y*ˆ = 30 km h−<sup>1</sup> which is 0.5. The gains for MB and MF algorithm converge to similar gains at each speed, respectively. The evolution of the gains is qualitatively similar, which suggests that the algorithm is capable of learning throughout the tested speed range.

### **4.3 Effect of Conditions on Learning and Result**

Vehicle dynamics is a composite of complex powertrain dynamics and resistances. This section shows that the controller is able to adapt to the different dynamics throughout the operational range of the vehicle by testing it at different speeds and in a scenario with a high disturbance level.

The set of experiments varying speeds relies on mostly the same setup as the baseline experiment: test vehicle, track and algorithm are identical. Instead of using target speeds from the interval of 20 km h−<sup>1</sup> ± 1 km h−<sup>1</sup> , the base speed was changed to 10 km h−<sup>1</sup> and 30 km h−<sup>1</sup> respectively. This reflects common speeds in urban areas [7] and puts the learning algorithm to the challenge of near-standstill vehicle dynamics on the one hand and tests its applicability to intermediate speeds on the other.

Fig. 4.6 shows the progression of the learned gain during training runs at different target speeds:

• At speeds around ¯*y*ˆ = 30 km h−<sup>1</sup> with a variation of ±1 km h−<sup>1</sup> in third gear the learning results are similar to the results for ¯*y*ˆ = 20 km h−<sup>1</sup> with a variation of ±1 km h−<sup>1</sup> in second gear for both MB and MF algorithms<sup>73</sup> .

<sup>73</sup> This seems plausible since the powertrain is likely powerful enough to provide a similar acceleration

	- ¯*y*ˆ = 20 km h−<sup>1</sup> with a variation of ±1 km h−<sup>1</sup> in second gear is from the baseline experiment presented above.
	- Training at ¯*y*ˆ = 10 km h−<sup>1</sup> with a variation of ±1 km h−<sup>1</sup> in first gear results in a slightly lower gain around 0.4 for both MB and MF algorithm.

The evolution of the gains is similar to the learning process in our default configuration, suggesting that the algorithms are both capable of learning at different speeds<sup>74</sup> . The lower gains for the low speed experiment may be due to nonlinear adaptation of the low-level controller for parking applications or temporary disengagement of the torque converter<sup>75</sup> between engine and automatic gearbox.

The next experiment aims to prove the algorithms' robustness towards external disturbances. It uses the same vehicle setup as the baseline experiment, but is performed on a track with slopes up to 16% and tight turns without deactivating the controller. An aerial of the track with the path driven during the experiment is given in Fig. 4.7. Due to the frequent changes in disturbance the steady-state compensation of disturbances in the low-level controller can only partly counteract them, requiring the to-be-learned controller to compensate.

Fig. 4.8 shows that both MB and MF algorithm deviate from their behavior the baseline experiment: The MF algorithm initially has a similar initial slope, but gradually

behavior at both 20 km h−<sup>1</sup> and 30 km h−<sup>1</sup> . This may change at very high speeds, which were not tested in this work for safety reasons.

<sup>74</sup> The learned controllers could conveniently be combined in a gain-scheduling fashion to provide a single controller with a wide range of operation. This is left for future work.

<sup>75</sup> A torque converter is a hydraulic component used between a combustion engine and an automatic transmission. It allows the engine to idle while the car is stopped in gear, boosts engine torque at very low speeds and behaves similar to a (dampened) solid connection at higher speeds. In modern powertrains the torque converter is either replaced by automated clutches or equipped with a clutch that can bridge the torque converter to avoid losses at higher speeds [101, Section 6.3.4, pp. 111].

**Figure 4.8:** Comparison of evolution of output gains during training under different disturbance conditions. The baseline experiment, performed on a straight, even road is compared to learning while repeatedly driving over a steep hill. In the latter case neither learning nor controller were deactivated during turning, resulting a scenario with heavy disturbance. Both MF and MB controller learn a higher gain<sup>76</sup> .

climbs towards higher gains and is close to 0.7 at the end of the experiment76. The MB algorithm shows some variance from the start. The plot of the gain shows more pronounced interruptions of the downward trend. After around 500 s the gain is around 0.7 and stays there until the experiment is terminated.

Since the optimal output controller gain depends on the scenario, this behavior is in line with our expectations.

This shows that the algorithms are capable of learning even in scenarios with frequent and strong disturbances, i.e. in less controlled environments.

### **4.4 Effect of Exploration Noise on Learning and Result**

The choice of exploration noise not only affects the learning result by choosing which part of the system dynamics are exposed as a consequence of excitation but may also influence the distribution of experienced states (see Section 2.2.5). Two variations of the baseline experiment are therefore conducted, in which amplitude and distribution of exploration noise are varied, respectively. These experiments show that MB and MF algorithms react differently to variations of exploration noise. While both

<sup>76</sup> The experiment had to be terminated earlier due to limited availability of the proving grounds.

**Figure 4.9:** Training runs with different amplitudes of exploration noise. The training runs with exploration noise amplitude 0.25 m s−<sup>2</sup> tend to slightly higher gains on the MF algorithm than the training runs with amplitudes 0.5 m s−<sup>2</sup> and 0.75 m s−<sup>2</sup> . In these cases MB and MF converge to similar values around *θ*<sup>c</sup> = 0.5.

algorithms are susceptible to amplitude variations, especially the MF algorithm is sensitive to the distribution.

The first variation of the baseline experiment changes the amplitude *ν*u of the exploration signal added to the controller output. The baseline experiment uses *ν*<sup>u</sup> = 0.5 m s−<sup>2</sup> , which is lowered to *ν*<sup>u</sup> = 0.25 m s−<sup>2</sup> and increased to *ν*<sup>u</sup> = 0.75 m s−<sup>2</sup> respectively. Values beyond this range were not deemed feasible77: Large amplitudes can result in commands that cause the vehicle to stop or strongly increase the engine speed if the controller allows too high of a deviation from the target speed. Additionally, exploration noise amplitude controls how much sudden changes in acceleration are magnified, which result in wear on the engine. All other aspects of the experiment setup remain unchanged compared to the baseline experiment.

The results are shown in Fig. 4.9. In the presented scenario the MF algorithm learns gains around 0.7 when run with a maximum amplitude of *ν*<sup>u</sup> = 0.25 m s−<sup>2</sup> . The MB algorithm returns a learned gain around 0.5 in this case. Increasing the exploration noise amplitude to *ν*<sup>u</sup> = 0.75 m s−<sup>2</sup> yields gains around 0.5 for both algorithms.

In comparison to the learned gain in the baseline experiment, learning with the lowered exploration noise amplitude yields a higher gain, while the gain learned using an increased exploration noise amplitude is similar to the one learned in the baseline experiment.

<sup>77</sup> More extreme choices are presented in a hyperparameter study based on simulation in appendix A.3.4.

**Figure 4.10:** Training runs with different kinds of exploration noise. While the MB algorithm converges around *θ*<sup>c</sup> = 0.5 with all tested variants of exploration noise, the MF algorithm is more sensitive to the choice of exploration noise type. The training run with rectangular noise reaches a lower convergence value around *θ*<sup>c</sup> = 0.3 while the run using parameter noise does not converge within the experiment duration.

The experiment shows that the presented MF algorithm is susceptible to changes in exploration noise. This may be due to the differences in policy search: the MF algorithm optimizes its controller in noise-free simulated rollouts of its learned model. Changes in observed state distribution due to exploration are therefore not reflected in this process.

In a second variation of the baseline experiment the exploration noise distribution is changed. Again, all other factors are kept the way they are in the baseline experiment. In addition to the uniform distribution used in the baseline experiment a rectangle signal and parameter noise were used as exploration signal. The sampling interval of the exploration noise was not varied to avoid damage to the variable valve timing assembly due to high frequency. Appendix A.3.4 shows the effects of this variation in simulated experiments. See Section 2.2.5 for more information on exploration noise.

The presented MB algorithm seems to be more robust against different choices of exploration noise, with all configurations converging to matching gains as shown in Fig. 4.10. The MF algorithm exhibits large variance when paired with parameter noise, with its parameters oscillating over a large range of values (not depicted) or converging to slightly different values when combined with rectangular noise.

Exploration noise seems to be a sensitive influence on the MF algorithm, less so on the MB algorithm. We expected the optimal gain to be influenced by the exploration noise, since the optimal output control gain depends on the initial state for the optimization. The optimization of the MB controller gain results from simulated rollouts starting from observed initial states. In contrast, the MF algorithm solely relies on observed data. The distributions of simulated and experienced states may differ, with the simulated data not corrupted by exploration noise. This may explain the difference in sensitivity towards exploration noise.

### **4.5 Extensions**

With a modification of the proposed algorithms it is possible to include the commanded jerk as an additional measure for comfort in the reward function. Learning then results in a slightly lower controller gain for the MB algorithm, but fails to converge under the MF algorithm. The MF algorithm is extended to learn an optimal feedforward using preview (introduced in Section 3.1.2) and the MB algorithm is enabled to learn an inversion-based feedforward (introduced in Section 3.2.2).

#### **4.5.1 Including Commanded Jerk in Reward**

As proposed in Section 1.2.3 a variant of the proposed algorithms includes commanded jerk in the reward function to better reflect common measures for passenger comfort for longitudinal dynamics. With the reward function changed accordingly, both algorithms undergo an experiment with all other factors equal to the baseline experiment.

Fig. 4.11 shows that both algorithms learn a slightly lower gain compared to the baseline experiment. The MF training run yields a gain around 0.4 which is close to the result of the baseline experiment.

The difference in behavior may again be due to the different distributions the controllers are optimized with: the MB algorithm relies on simulated rollouts starting from observed initial states while the MF algorithm purely works on observed state vectors. Since exploration noise increases jerk in the recorded data, but is not present in simulated rollouts, the resulting controller gains may differ.

Generally, it seems plausible that the added penalty for commanded jerk pushes the balance between controller output and tolerated control error towards higher potential control errors which can be achieved by lower controller gains compared to our baseline experiment.

**Figure 4.11:** Training including commanded jerk. Both algorithms converge to a lower gain than when trained without jerk with the MF algorithm exhibiting little difference to the baseline experiment.

### **4.5.2 Learning Feedforward**

The experiment presented in this section varies the target speed over a range of speeds. The feedforward extensions to MB and MF algorithms are applied, respectively, and it shows that while they still converge they enhance the ability of the learned controller to follow the target profile.

As stated in Section 1.2.2, this work considers feedforward control as an additional component of the controller that enables better tracking of a changing reference value. See Sections 3.1.2 and 3.2.2 for the implementation of MF and MB feedforward algorithms, respectively.

In the experiments presented in this section the control target changes according to a repeating profile consisting of a sinusoid with its value held constant at its extremes78, see Fig. 4.14. This maneuver<sup>79</sup> covers target speeds ranging between 10 km h−<sup>1</sup> and 25 km h−<sup>1</sup> and is therefore performed with automated gear shifts between first and second gear. The gear changes occur around 15 km h−<sup>1</sup> and since they are neither directly influenced nor observed by the algorithm, the algorithms perceives them as change in the plant dynamics.

<sup>78</sup> Note that the factory controller requires differentiable trajectories and includes a position control loop and therefore cannot be compared to the learned result.

<sup>79</sup> The influence of the target trajectory on the presented algorithms is an unexplored avenue for research open for future work. Driving cycles like the WLTP [151] provide scenarios that are more prone to what a consumer vehicle experiences, and the learned controllers would likely be optimal in these driving use cases.

**Figure 4.12:** Controller gains during training MF with (marked as "ff") and without feedforward (marked as "fb"). In the experiments the MF algorithm uses a projection on a reference polynomial of order 1 (marked as "PO1") for learning feedforward. In this case, since two coefficients (constant an linear) of the reference polynomial serve as inputs to the policy, two gains are learned, where the on weighting the linear term accounts for the feedforward component. The feedback-only case uses a projection on a constant polynomial (equivalent to no projection at all) and is therefore marked as "PO0". The convergence values for the feedback gains are similar in both cases, albeit higher than in the default scenario with rectangular target variation.

The hyperparameters for the MF algorithm for this experiment are given in Table A.1, set D, in the appendix80; the MB algorithm is used in its default configuration but with the added inversion-based feedforward. See Table A.2 for hyperparameters. The vehicle setup and track are identical to the baseline experiment.

Fig. 4.12 shows that the MF algorithm learns higher feedback gains than with the default target speed variation (see Section 4.2), yet converges with similar variance independently of a simultaneously learned feedforward. The MB algorithm behaves similarly as in the default case: the initial slope for the controller gain is equally limited by the maxnorm constraint. In the steady state it exhibits a similar amount of variance as in the feedback-only baseline experiment. The learned feedback gain matches the feedback gain of the MF algorithm as Fig. 4.13 shows<sup>81</sup> .

The learned controllers are tested to track the sinoid maneuver without exploration noise; track and vehicle are identical to the baseline experiment.

<sup>80</sup> This experiment is part of an earlier campaing published in [81], relying on a different set of hyperparameters.

<sup>81</sup> The feedforward gains are not depicted because of the large number of coefficients and different value range.

**Figure 4.13:** Feedback gains during training of MB algorithm with and without feedforward. The learning progresses in a similar way as in the default scenario, yet the learned gains converge at higher values around 0.7 for both cases.

**Figure 4.14:** Trained controllers with and without feedforward following the reference maneuver without exploration noise. Both approaches to feedforward learning outperform the feedback-only variants as they significantly reduce lag. The MF feedforward variant is more aggressive than the MB feedforward.

Both approaches yield an improvement in tracking performance as Fig. 4.14 shows<sup>82</sup> . The manuever is performed with less lag by both learned feedforward controllers. The MF controller allows the vehicle speed to lag behind the target more than the MB controller does. However, the latter controller makes the vehicle speed overshoot at the end of the acceleration phase. As the MF feedforward is designed to optimize the reward, it trades accuracy for reduced controller outputs. This is contrast to the MB controller, which is designed to track the target as closely as possible.

Both algorithms provide viable options for learning feedforward control and increase tracking performance in our experiment. This work is among the first ones exploring tracking control using RL in a real-world application. It could be an interesting field for future research to examine the influence of the trajectory during training on the learned feedforward controller.

<sup>82</sup> When training artificial neural networks it is generally avoided to measure the performance of the trained network on the data it was trained on since this would encourage overfitting to the training data, i.e. hamper generalization ability [48, Section 5.2]. This work adheres to this principle, too: While the planned trajectory was the same during learning and evaluation, the trajectory the algorithm learned from significantly differed between training and evaluation due to the addition of virtual trajectory noise (see Section 3.1.2) and exploration noise during training.

# **5 Discussion and Conclusion**

This chapter glances over the proposed algorithms from Chapter 3 and the experiment results from Chapter 4. We interpret them and use them to answer the research questions from Chapter 1, point out limitations and finish with a few suggestions for future work.

This work set out to answer three main research questions in Section 1.4, which we re-state here for convenience:


To adress these, this work proposes two approaches to RL for speed tracking control in a real vehicle. These have to overcome specific challenges that result from the intended application (see Section 2.3): tracking control using RL, a partially observed plant, constrained computational resources, limited learning time and fitness for realworld application.

The proposed MB algorithm is based on a combination of a learning ARX model and policy search. An additional automated feedforward design is based on approximate inversion. The ARX model is estimated from vectors of current and past plant inputs and outputs, which works on the partially observed plant. With a limited number of parameters and a linear model structure the computational effort for parameter estimation is limited. The computational load for the simulation-based controller design can be matched with the available resources depending on batch size and update frequency.

Additionally, this work proposes a MF algorithm based on the actor-critic architecture employing a critic network architecture taylored to a class of partially observed plants that are common in control applications. The MF algorithm is extended with a target compression method that allows to learn an optimal feedforward controller using common actor-critic architectures.

In contrast to prior work that was mostly based on simulation, this work successfully applies both algorithms to a real vehicle for online, closed-loop learning. In the presented experiments both algorithms have proven to perform under various conditions. These include variations of speed and gear, strong disturbances, different kinds of exploration noise and the reward function. Both algorithms consistently converge in an array of experiments, learn a feedforward controller and yield similar results except for the experiments varying exploration and reward signal.

These extensive real-world tests prove that both approaches are valid solutions for learning optimal speed tracking controllers. The results suggest that the proposed MB algorithm tends to be slightly more robust, e.g. to different kinds of exploration noise or discount levels while the MF algorithm converged more quickly. This may, however, be due to our choice of hyperparameters.

When it comes to feedforward control, again, both approaches are viable. However, from an engineering perspective an ideal feedforward, i.e. ideally inverting the plant, may be preferred over an optimal feedforward, i.e. balancing cost of deviation from target against control effort, since the target trajectory is likely already optimized during planning.

By enabling online learning in a real vehicle, this work is a step forward for RL research and control theory. The proposed algorithms have proven to work well within the constraints of computational resources, time and real-world experiment conditions. With these additions to the state of the art, it overcomes several limitations that prevented RL from being applied to tasks outside simulated examples. Algorithms of this capability can take the tedious controller tuning work off an engineer's shoulders.

# **A Appendix**

### **A.1 Optimizer**

Several algorithms have been presented to solve optimization problems. This section quickly goes over the most popular gradient-based algorithms and introduces the Levenberg-Marquardt algorithm that is used throughout this work. At the end of this section we present step size limitation, an extension that can be applied to most optimizers.

**Gradient Descent** Gradient descent is a basic iterative optimization algorithm that consists of taking steps ∆*θ<sup>t</sup>* in the opposite direction of the gradient d*L*/d*θ*, often with a step size proportional to the gradient norm:

$$
\Delta\theta\_l = -a\nabla L \big|\_{\theta\_l}.\tag{A.1}
$$

Here, *t* is a discrete index denoting the optimization step. The scalar *α* is commonly referred to as learning rate. The subsequent parameter estimate *θt*+<sup>1</sup> consists of the old estimate *θ<sup>t</sup>* and the gradient step ∆*θ<sup>t</sup>* :

$$
\theta\_{t+1} = \theta\_t + \Delta\theta\_t. \tag{A.2}
$$

This algorithm depends on the hyperparameter *α*. A too small choice may require many optimization steps and therefore slow convergence, while a too high value can make the optimization process divergent, i.e. diverge from the optimum. The effect of the parameter depends on the optimization problem. Stochastic optimization may make the algorithm behave erratically if the gradient is affected by noise. In ideal conditions83, gradient descent can solve problems that are linear-in-parameters in a single step and is therefore considered a first-order optimization technique.

**Gradient Descent Based Algorithms** The downsides of gradient descent were adressed by two extensions to the algorithm:

• AdaGrad [31],[48, Section 8.5.1] and RMSProp [48, Section 8.5.2] try to choose an individual **learning rate** for each entry in the gradient vector, therefore alleviating slow learning while avoiding divergence as far as possible.

<sup>83</sup> Ideal conditions include optimal choice in step size, noise-free and non-stochastic optimization.

• **Momentum** [121], [48, Section 8.3.2] is a popular technique to enhance robustness to noise in the gradient and local optima using a running average over past parameter steps.

Both techniques are combined in the popular adaptive moment estimation (ADAM) [76], [48, Section 8.5.3] optimizer. It counters the shortcomings of gradient descent optimization to some extent at low computational cost and is therefore considered state of the art in machine learning applications.

**Second Order Optimization** While ADAM may exhibit performance beyond regular first-order optimization, in some cases second-order optimization may still outperform it [48, Section 4.3.1]. Second-order optimization is based on Newton's method which relies on a second-order Taylor series expansion of the loss function to compute the parameter update ∆*θ<sup>t</sup>* .

The Gauß-Newton algorithm makes use of the fact that for least-squares loss functions the required Hessian for the approximation can be constructed using the Jacobian [39].

The Levenberg-Marquardt (LM) algorithm [105], [48, Section 8.6.1] adds regularization to the (Gauß-) Newton step, effectively scaling between (Gauß-) Newton step and (small) gradient descent steps:

$$
\Delta\theta\_t = -\left[f\_L^\top f\_L + \lambda I\right]^{-1} \nabla L\Big|\_{\theta\_t}.\tag{A.3}
$$

The regularization factor *λ* is adapted in accordance with the success of the last step in the optimization: if the loss increases, the step is discarded, *λ* is increased (decreasing the step size and making the step more prone to a small gradient descent step). If the loss decreases, the step is accepted and *λ* is decreased, making the next step closer to a Gauß-Newton step.

The matrix inversion is expensive for large matrices, but may be feasible for applications with a comparably low number of variables to optimize.

An example for the behavior of exemplary implementations of the presented optimizers is given in Fig. A.1, where they are applied to the difficult to optimize Rosenbrock function. The figure shows that second order optimizers converge to a proximity of the optimum in significantly less steps.

**Step Size Limitation** Erroneous gradients may lead to misguided optimizer steps in stochastic optimization84, introducing variance in the optimization process. In the worst case a single (possibly large) erroneous step may cause the entire optimization

<sup>84</sup> This may be an effect of unfortunate sampling of minibatches from noisy data.

**Figure A.1:** Example for trajectories of optimization algorithms finding the minimum of the Rosenbrock function [125]. While gradient descent with a learning rate of 0.0005 is not divergent in this example, it takes 8225 steps to get within a radius of 0.05 around the optimum at 1 1 . Newton's method and Levenberg-Marquardt both converge in less than 10 iterations.

process to diverge. A possible countermeasure could be to limit the maximum step size an optimizer is allowed to take. While other methods exist, in this work we rely on norm clipping [111], i.e. we rescale the step ∆*θ<sup>t</sup>* if its norm |∆*θ<sup>t</sup>* | exceeds a threshold *g*max to *<sup>g</sup>*max/|∆*θ<sup>t</sup>* <sup>|</sup>∆*θ<sup>t</sup>* .

# **A.2 Motivation for Initializing the MB RL Feedforward Controller in its Steady State**

Section 3.2.2 states that the MB RL Feedforward Controller was initialized in its steady state. This section provides background to this decision by laying out the available choices and their impact on the system during learning.

During learning, the feedforward controller has to be reset after every update to its parameters, because its internal state may not be meaningful with the feedforward dynamics after an update. With updates occuring every 600 ms, the reaction of the feedforward after initialization frequently affects the vehicle. The choice of initialization method is therefore an important one.

For initializing any simulated dynamic system the state can be chosen arbitrarily or filled with zeros - in both cases and indepentently of the input fed to the system its output generally cannot be predicted. In the context of learning longitudinal control this may amount to extreme controller commands that may trigger safety thresholds and deactivate the controller or cause drastic braking or acceleration. This behavior would lead to learning an ill-fitting model which in turn results in a poor controller.

This work therefore chooses to initialize the feedforward controller in its steady state. At this state, the output of the system remains constant - with the learned model forced to have a pole at 1, the steady state gain of the inverted system is 0, and thus the output in steady state is also 0. When following a constant target during learning, the feedforward is not contributing to the controller output. Periodic resets to its steady state following parameter updates therefore have no impact on its (non-existent) output. If the trajectory to follow varies over time, the feedforward contribution is non-zero. Periodically resetting its state, and thus its output to zero therefore can cause oscillating behavior during learning. Since resets do not occur once learning is deactivated, these oscillations were tolerated during learning, where they aided exploration.

### **A.3 A Simulation Study on Hyperparameter Influence**

Here, the complete results from the simulation study introduced in Section 4.1 are presented.

This part begins by varying the hyperparameters defining the optimality criterion for the controller, then it shows the effect of parameters for the optimization in actor and critic before presenting a study on the effects of noise both for exploration and in the environment. A graphical overview of the presented experiments can be found in Fig. 4.2. A tabular overview of the influences of the individual parameters is given in Table 4.1.

### **A.3.1 Hyperparameters in the Optimality Criterion**

For discount factor *γ*, see Section 4.1 and Fig. 4.3.

While low values for parameter *C*<sup>u</sup> encourage the agent to learn more aggressive controllers, it slightly increases variance (see actor gain in Fig. A.2) in the learning process and therefore makes it less stable. At first glance it seems counterintuitive that low values for *C*<sup>u</sup> simultaneously increase variance in the actor gain and decrease TD error levels. However, lower TD errors are a direct consequence of lowering the impact of the action on state values, thus reducing the error potential from misadaptation of the critic. The actor gains exhibit higher variance regardless of that, since it is learned from the policy gradient which is derived from aforementioned lowered impact. Since the advantage component in the critic is reduced while the overall noise level remains equal, the resulting gradient is less accurate. The outcome can be

**Figure A.2:** Averages and standard deviation intervals for 10 training runs with different values for reward parameter *C*u. In classic control parameter *C*<sup>u</sup> is used to tune the aggressivity of the controller. While lower values for *C*<sup>u</sup> tend to lower the TD error, this has and adverse effect on policy learning: it increases variance and slows learning. Within this learning setup, the parameter *C*<sup>u</sup> can therefore not be varied freely and the engineer once more has to trade bias for variance, if an aggressive controller is desired.

seen in the actor gain progression in Fig. A.2: at lower *C*<sup>u</sup> values learning is slower and affected by higher variance.

The parameter *C*<sup>u</sup> is considered a tuning factor for how aggressive the controller shall be in classic optimal control. In vehicle experiments we found values around 0.025 to be the lower feasible limit. Values below this required excessive choices in other hyperparameters to achieve stable learning in the real vehicle.

#### **A.3.2 Hyperparameters for the Optimization Steps**

While LM provides quick learning, this comes at the cost of a somewhat noisy behavior. It tends to take big steps even in the proximity of the optimum in the presence of noise. This introduces variance in the controller parameter estimation. If a restrictive maxnorm is applied in the optimizer for the actor, this effect can be dampened at the cost of learning speed (see actor gain in Fig. A.3). On the contrary, if the limit does not constrain the actor optimizer, this may allow the actor to move faster than the critic can track it or cross the stability boundary, leading to failure of the learning process. This can be seen from our simulated experiment in Fig. A.3: with a very open maxnorm limit for the actor, not only TD errors become large, but it leads to several failed training runs. In the vehicle application, a restrictive maxnorm in the actor is therefore mandatory to ensure stability. In order to benefit from quick learning, a slightly higher limit is used when training begins and decayed during the learning process until it reaches a lower boundary.

**Figure A.3:** Average and standard deviation interval over 10 training runs for multiple values for actor maxnorm constraint. 6 of the training runs with maxnorm 10 failed and were subsequently excluded from the plot. High actor maxnorm limits can cause instability due to the critic not being able to track the value changes due to policy evolution. This can be seen from very high TD errors and failed test runs with actor maxnorm 10, and high variance in the actor gain *θ*c. Lower values may slow learning. The actor maxnorm therefore trades off speed against variance and stability.

**Figure A.4:** Critic maxnorm variation with 10 experiments per maxnorm value. The plot shows average and standard deviation interval for each maxnorm limit. While using maxnorm 1, every experiment run was stable, 1 run was unstable with maxnorm 0.1 and all 10 runs were broken off with maxnorm 0.01. This simulated experiment shows that high maxnorm limits aid stability in learning.

**Figure A.5:** Average and standard deviation interval for critic target update rate variation. Very low values for target update rate make it harder for the critic to track the state-action value with high fidelity, therefore resulting in higher variance in both actor gain and TD error, thus slowing learning at very low target update rates. A target update rate of 1 results in fastest convergence, but may introduce bias due to overestimation of state-action values (see subsection 2.2.4, not visible from graph due to variance and slow learning).

In the context of critic optimization, the maxnorm constraint reverses its effect on learning stability: a liberal limit on the maxnorm generally increases stability, since it allows to quickly learn value changes resulting from a shift in the controller. This is visible from an opposite perspective in the TD error in Fig. A.4: the lowest critic maxnorm first causes the TD error to explode, then the runs are terminated due to instability. In the vehicle experiments it was observed that learning without any limit would become unstable again in some cases (not seen in the simulated experiments). However, regarding noise its effect is fully parallel to the actor: lower maxnorm constraints reduce variance, which generally enhances both learning speed and precision in the actor. In the experiments the critic optimizer was loosely constrained, yet not entirely left without a maxnorm limit to avoid (rare) cases of instability due to large erroneous parameter changes in the critic. Over the course of the experiment, the limit was slowly restrained to additionally reduce variance.

While target networks were originally introduced to aid stability in the learning process (see subsection 2.2.4), they can have a similar effect as slowing the critic optimizer: at very low target update rates the critic may fail to track the actor's learning progress. While [152] claim that target networks may help reduce bias in certain cases, no clear conclusion can be drawn from this experiment (see actor gain in A.5), mainly because learning was slowed due to high variance and therefore terminated before convergence (e.g. for *η*crit = 0.005) or because of too high variance in the steady state (*η*crit = 0.025). In the vehicle experiments an intermediate value of *η*crit = 0.1 was chosen to not slow learning down too much while retaining some of the stabilizing effect.

**Figure A.6:** Average and standard deviation for training runs with different batch sizes. With the smallest batch size *b* = 15 variance was increased both in TD error and actor gain. A batch size of *b* = 150 yielded lower TD errors that *b* = 300, but similar results on the actor gain. Albeit the difference in TD error may be explained with overfitting to a small batch size, it is considered a negligible risk.

Big batches help avoiding local minima and yield better gradient estimates which results in less variance in the learned policy, which can be seen best in the TD error plot in Fig. A.6: low batch sizes<sup>85</sup> yield low TD errors in some cases, but are stricken with high variance, suggesting that the critic approximator overfits to the small provided batches, but fails to match state transitions in other batches. Higher batch sizes reduce this effect, keeping a higher average TD error but with lower variance. Very small batch sizes may even compromise stability, while very large batch sizes bring little benefit, which is visible in the actor gains from Fig A.6. Batch size can therefore be seen as a way to balance between low variance and fast computation, i.e. quick learning (within the hardware limits).

However, higher batch sizes come at the cost of higher computation time<sup>86</sup> as Fig. A.7 shows. A batch size of *b* = 150 was chosen to balance precision in learning with low computational load.

#### **A.3.3 Variation of Algorithm Architecture Elements**

The number of steps for the TD backup (see Section 2.2.2) is often referred to as an example for a bias-variance trade-off (see e.g. [60]), i.e. more steps for backup should reduce bias in the learned policy at the cost of higher variance. The simulation experiments show that a higher number of steps for backup does significantly increase

<sup>85</sup> We used batches of identical size for training actor and critic.

<sup>86</sup> Note that the hardware used in the vehicle differs from the one used in this comparison. Additionally, while this comparison relies on interpreter-based *Matlab* code while compiled generated code was employed in the vehicle experiment. The comparison can therefore only indicate a trend.

**Figure A.7:** Average computation time for different batch sizes for 10 simulated training runs each. Logging and environment dynamics are included in the computation time measurement but can be considered as negligible. The analysis was performed on a Laptop containing an *Intel Core i5-8350* paired with 16 GB RAM using *Matlab R2018b* and *Windows 10* Build 17134. The setup differs both in hardware and in software from vehicle experiments and therefore reacts differently to the batch size, but the general trend towards higher computation times is valid in both cases. With faster computation times, the learning interval for the vehicle tests can be shortened, allowing for faster convergence and therefore shorter experiment duration.

**Figure A.8:** Average and standard deviation for training runs with different amounts of steps for backup. More steps for backup significantly increase variance in the TD error (plot for 1 step barely visible at the lower end of the plot), slightly change them average learned controller gain but add variance there, too.

**Figure A.9:** Experience storage size variation. The configuration with the smallest buffer size crossed the instability threshold once, but tends in that direction towards the end of the training run. Filling a large experience storage may take longer, but yields repeatably stable learning with little variance afterwards.

the TD error and slightly affects the learned gains (see A.8). We choose not to use more than one step for backup in the Bellman equation since the slight change in learned gain is not worth increasing the variance by orders of magnitude.

The authors of [91] recommend using a large experience storage. The next experiment shows why: If the buffer is limited to the batch size, the agent may start to learn early on, but fail soon after. With a large experience storage, learning is repeatably stable. In the vehicle experiments, stable learning was achieved with a low amount of variance using the largest setting (5000 time steps). This helps increase robustness against short disturbances like bumps, ramps or lane changes. Experience storage size therefore can be used to increase stability and decrease variance at the expense of some waiting time before the learning process starts, i.e. effective learning speed.

In [119] we suggested to choose the length *h*<sup>u</sup> of the FIR filter according to the impulse response of the system of interest. Fig. A.10 supports this: while all configurations have similar TD errors, shorter filter length settings yield higher gains. Filters with length equal to or greater 35 yield similar gains. When looking at the unit impulse response in Fig. A.11 it can be seen that the output is constant at around 35 time steps. Using no filter at all can increase variance up to the point of instability [119], but too short filters can yield biased results.

#### **A.3.4 Exploration Noise**

Fig. A.12 shows the influence of exploration noise amplitude. While higher amplitudes increase the TD error, they benefit policy learning by making it quicker and

**Figure A.10:** Mean and standard deviation interval for different FIR filter length settings. Short filters tend towards higher gains, but this effect vanishes towards higher filter lengths. There is no effect on the TD error.

**Figure A.11:** Response to a unit impulse for example system (1.3) with and without noise. Note that the time scale is in discrete time steps, not seconds. The initial response is flat due to delay. After roughly 35 time steps the output is roughly constant again.

**Figure A.12:** Mean and standard deviation of TD error and actor gain for training runs with different exploration noise amplitudes. Intermediate values for exploration noise perform best for TD-error values, TD error variance and actor gain variance. Very low values fail to excite the system, making it harder to learn without overfitting. The intermediate noise amplitude provides the agent with rich data to learn from, and further increasing the noise amplitude brings little benefit for learning: it increases the TD error, but only slightly increases learning performance in the actor gain. Very high amplitudes may not be feasible in real-world systems.

more precise. If the amplitude is too low, the exploration noise may fail to excite the system, thus yielding data that is not rich enough to meaningfully fit the state-action value function to. This causes high variance in the learning process, increasing the risk of unstable learning. Very high amplitudes on the other hand enhance the learning process only slightly and may not be applicable to real-world systems without risking to damage the plant.

Despite the objective of exciting the system with exploration noise, learning should generally occur around the envisioned operation area of the controller to be learned. Since exploration noise is only added during learning, it may skew the resulting policy and should therefore be used carefully.

Exploration noise amplitude needs to balance stability and variance against bias while staying within the limits of the system.

Holding a value for the exploration noise for more than one time step can be used to shift the frequency spectrum towards lower frequencies. Depending on the system, this may not only prevent it from being harmed, but also enhance excitation. Holding for no more than one time step in this experiment results in higher variance than holding it for 50 time steps, i.e. 1 s, as can be seen from the actor gain evolution in Fig A.13. Longer hold phases deteriorate the learning performance again. The real-world vehicle has low-pass characteristics and may suffer from excessive wear if too high frequency signals are used as an input. We chose an intermediate value of 50 time steps for our experiments.

**Figure A.13:** Exploration noise sampling time variation plotted with averages and standard deviation intervals. The sampling time allows to scale the signal between white noise and a step-like signal by sampling the random component used for exploration at a lower frequency. A hold factor of 1 is equivalent to sampling the exploration noise at each controller time step of 20 ms, a hold factor of 200 keeps the random part constant for 200 time steps (equivalent to 4 s) before resampling it. High hold factor values therefore emphasize low frequencies. The optimal hold factor depends on the system characteristics. In this example a hold factor of 1 or 200 results in higher variance in the actor gain compared to a hold factor of 50 (equivalent to 1 s).

**Figure A.14:** Average and standard deviation for TD error and actor gain in virtual trajectory noise amplitude size variation. Adding a random component was proposed to enhance learning speed for feedforward components, but in this case only increases variance and bias.

**Figure A.15:** Effect of environment noise levels on TD error and actor gain. The higher the noise level in the environment, the higher the TD error becomes and the higher the variance in actor gains becomes.

Adding additional noise on the trajectory was proposed by [81] in order to accelerate learning of feedforward components in the controller, but only adds bias for this simple policy as Fig. A.14 shows. This is because it shifts the target in the training data to be more diverse for off-policy learning. This, however makes the distribution of training data differ from the actual use case, which may make the controller be suboptimal for the envisioned scenario. When learning a policy without a feedforward component virtual trajectory noise was therefore abandoned.

#### **A.3.5 Environment Dynamics and Noise**

The proposed algorithm is affected by environment noise as Fig. A.15 shows<sup>87</sup> . The noise level is reflected in the variance of the learned gain and the TD error level. This experiment only shows that it is desirable to have low levels of noise in the environment. In real-world systems the amount of noise generally cannot be influenced.

The next simulated example presents how different dynamics affect the learning process in Fig. A.16. For the algorithm it seems to be easier to learn from comparably fast dynamics: with smaller time constants *T*acc, the learning process converges earlier. This suggests that slow systems are harder to learn from for an RL agent. An integrator would pose the limiting case here: it can be seen as having an infinite time constant, and would thus be hardest to learn<sup>88</sup> .

<sup>87</sup> The noise level chosen in for the linear system 1.3 was chosen to have a comparable amplitude as the measurement noise on the measurements in the vehicle setup, but this was not based on a thorough analysis, e.g. by Fourier transformation.

<sup>88</sup> An RL agent generally cannot learn a policy if the policy it is currently training on is unstable in combination with the system since the value function has infinite values, i.e. the critic would diverge.

**Figure A.16:** Effect on TD error and actor gain of environment time constant. The slower the system dynamics are, i.e. the harder it is to excite the system, the slower the learning process and the higher the variance.

### **A.4 Hyperparameters Used in Experiments**

In Table A.1 and its continuations we a provide tabular hyperparameter overview for each experiment.

For brief instants an unstable policy can be tolerated if the current policy is unstable, since the critic only slowly tracks the policy. For systems that cannot be stabilized with the policy structure, the learning process must diverge.

<sup>89</sup> The derivative of the value used for bootstrapping was included. See footnote 21 on page 26 for more details.


#### **Table A.1:** Hyperparameters MF


#### **Table A.1:** Hyperparameters MF (ctd.)



**Table A.1:** Hyperparameters MF (ctd.)


**Table A.2:** Hyperparameters MB Algorithm from [120].

**Figure A.17:** Histogram of optimal output gain for 2000 randomly chosen initial states within the unit cube from truncated policy search. The distribution sports two discrete maxima and spreads over a wide range of values. The optimal gain strongly depends on the initial state.

# **A.5 A Simulation Study on Optimal Output Control Gains**

While for the fully observed linear case the optimal controller is valid throughout the entire state-space, this is not the case for optimal output control. This section aims to illustrate this by computing the optimal output feedback gain for a noisefree variant of the system (1.3) using (truncated) policy search (see subsections 2.2.3 and 3.2.2).

First the optimal output feedback gain is computed for 2000 randomly distributed initial states within the unit cube. Fig. A.17 shows that the optimal gain is multimodal and varies over a wide range depending on the initial state.

Next, the initial state is sampled over grids in the coordinate planes. Fig. A.18 shows strong dependence of the optimal output feedback gain over the initial state position in the state space.

This implies that even in the best case a learned gain can only be considered optimal in a close proximity of the conditions it was learned in. For the controller to be learned within this work it can therefore be assumed that the optimal controller depends on the distribution of states seen during training, which is affected by external factors from the experiment, e.g. road slope, or from the target trajectory and exploration noise.

### **A.6 Validation of Example Gains in the Real Car**

This section aims to understand if the controller chosen by the RL algorithm is plausible. For this, the learning goal of the controller is approximated using data of test runs using controllers with a series of different gains. The results suggest that the

**Figure A.18:** Optimal output gain over initial state in the coordinate planes; delay initialized with zero. The optimal gain in the *x*(2)-*x*(3)-plane is zero throughout, but assumes different values along trajectories in the *x*(2)-*x*(4)-plane and *x*(3)-*x*(4)-plane.

controller may be close to optimal. However, this estimation relies on several approximations that are necessary for feasibility but limit precision. Therefore these estimations cannot be considered as proof for optimality, but can instead provide some insights from recorded trajectories that make the algorithm's decision plausible.

First, the approximation of the controller's learning goal (2.7) is introduced. To this end, (2.7) is expanded to

$$H = \int \hat{V}^{\pi}(s)f(s)ds. \tag{A.4}$$

However, it is difficult to simultaneously obtain *V*ˆ *<sup>π</sup>*(*s*) and ∫ (*s*) from experimental data:


An intermediate experiment is therefore conducted: The output controller is used without noise to follow the step-like trajectory90, balancing excitation from setpoint changes with settling periods in which the target remains constant. At least 10 setpoint change cycles are recorded for an array of control gains. The target and the resulting vehicle trajectories are given in Fig. A.19.

From this data, the controller values (A.4) are approximated for an array of controller gains. The continuous integral in (A.4) is replaced with a finite sum over weighted state values. For this, multiple approximations are taken:

1. Instead of considering the entire multi-dimensional state space, only the control error is considered, which is divided into *n* = 26 bins. Due to the limited data from the experiments, considering the entire space *S* would not yield a meaningful distribution.

<sup>90</sup> The same type of trajectory was used during our training experiments.

**Figure A.19:** Average *µ* and standard deviation interval *µ* ± *σ* of speed measurements for different controller gains following the trajectory used for training. Lower gains *θ*<sup>c</sup> tend to make the vehicle stay close to the average speed ¯*y*ˆ, higher gains make the vehicle deviate from it to follow the variation ∆*y*.

**Figure A.20:** Estimated truncated controller values from trajectories averaged over experienced state distribution. According to this estimation the optimal controller is close to 0.5, suggesting that the learned controllers are optimal.


According to this estimation (see Fig. A.20), the optimal controller is close to 0.5, suggesting that the gain learned by both algorithms in the baseline experiment is close to optimal. However, we cannot take this result as proof due to the multitude of sources for inaccuracy.

Instead a few strategies for optimizing the learning objective can be pointed out from the obtained trajectories that can help to understand the algorithm's choice.

The control error distribution for the validation experiment gives insight to the different strategies to balance when following this varying target using output control. Since the target changes very frequently, trying to follow it closely, i.e. using a high controller gain, may not be beneficial for the control error distribution (see Fig. A.21a). While high gains yield a distribution containing both high and low control errors, keeping the speed constant and using almost no control effort, i.e. a low controller gain, yields three distinct maxima in the control error distribution: since

<sup>91</sup> With a discount factor of *γ* = 0.9, the accumulated discount is small: *γ* <sup>51</sup> < 0.005.

<sup>92</sup> A trajectory to compute a valid state value from must not contain a setpoint change. Due to the repeating target trajectory, the portion of states a state value can be computed for is limited.

the target assumes three different values, but the speed stays mostly constant, the control error is either close to 0 or equal to the variation of the target speed. Note that this strategy works best if disturbances are low (cf. section 4.3). Controllers in the mid range trade these effects off, seldom reaching a control error close to 0, but also avoiding very high control errors. This effect is also visible in the average state measurements in Fig. A.19. Fig. A.21 shows that exploration noise spreads the experienced control errors over a wider band while changing the control target less frequently has the opposite effect.

It is therefore not possible to prove the optimality of the learned gain from this experiment, but it appears plausible that trading off control effort against control error favors an intermediate gain in line with the algorithm's choice.

**(a)** Control error distribution without exploration noise on even road with target change every 1.5 s.

**(b)** Control error distribution without exploration noise on even road with target change every 10 s.

**(c)** Control error distribution with exploration noise on even road with target change every 1.5 s.

**Figure A.21:** Control error distributions over gain for example use cases. The experienced control error is strongly influenced by the choice of target trajectory, the presence of disturbances and exploration noise.

**(d)** Control error distribution without exploration noise and high disturbances but constant target.

**Figure A.21:** (ctd.) Control error distributions over gain for example use cases. The experienced control error is strongly influenced by the choice of target trajectory, the presence of disturbances and exploration noise.

# **Own Publications**


# **Bibliography**


#### **Karlsruher Beiträge zur Regelungs- und Steuerungstechnik (ISSN 2511-6312) Institut für Regelungs- und Steuerungssysteme**



### INSTITUT FÜR REGELUNGS- UND STEUERUNGSSYSTEME

22

Self-Learning Longitudinal Control for On-Road Vehicles

Automating controller tuning tasks is an enticing prospect for carmakers offering advanced driver assistance systems. While Reinforcement Learning is a promising approach in simulation, it needs significant extension to work in challenging real-world scenarios.

This book not only presents algorithmic extensions to both model-free and model-based Reinforcement Learning, but also compares their learning behavior for longitudinal control across an array of real-world test cases.

Despite noise and partially observed dynamics, the proposed algorithms converge within minutes, provide feedforward control to track arbitrary trajectories and are computationally lightweight even during training. As an often-overlooked aspect, exploration noise is investigated as an important influence on learning performance and result.

The proposed additions to the state of the art enable Reinforcement Learning for engineering practice, relieving engineers of tedious manual tuning tasks.

Luca Puccetti

ISSN 2511-6312 ISBN 978-3-7315-1290-5