**Communications and Control Engineering**

Gianluigi Pillonetto · Tianshi Chen · Alessandro Chiuso · Giuseppe De Nicolao · Lennart Ljung

# Regularized System Identification

Learning Dynamic Models from Data

# **Communications and Control Engineering**

#### **Series Editors**

Alberto Isidori, Roma, Italy Jan H. van Schuppen, Amsterdam, The Netherlands Eduardo D. Sontag, Boston, USA Miroslav Krstic, La Jolla, USA

**Communications and Control Engineering** is a high-level academic monograph series publishing research in control and systems theory, control engineering and communications. It has worldwide distribution to engineers, researchers, educators (several of the titles in this series find use as advanced textbooks although that is not their primary purpose), and libraries.

The series reflects the major technological and mathematical advances that have a great impact in the fields of communication and control. The range of areas to which control and systems theory is applied is broadening rapidly with particular growth being noticeable in the fields of finance and biologically inspired control. Books in this series generally pull together many related research threads in more mature areas of the subject than the highly specialised volumes of *Lecture Notes in Control and Information Sciences*. This series's mathematical and control-theoretic emphasis is complemented by *Advances in Industrial Control* which provides a much more applied, engineering-oriented outlook.

Indexed by SCOPUS and Engineering Index.

**Publishing Ethics:** Researchers should conduct their research from research proposal to publication in line with best practices and codes of conduct of relevant professional bodies and/or national and international regulatory bodies. For more details on individual ethics matters please see:

https://www.springer.com/gp/authors-editors/journal-author/journal-authorhelpdesk/publishing-ethics/14214

More information about this series at https://link.springer.com/bookseries/61

Gianluigi Pillonetto · Tianshi Chen · Alessandro Chiuso · Giuseppe De Nicolao · Lennart Ljung

# Regularized System Identification

Learning Dynamic Models from Data

Gianluigi Pillonetto Department of Information Engineering University of Padova Padova, Italy

Alessandro Chiuso Department of Information Engineering University of Padova Padova, Italy

Lennart Ljung Department of Electrical Engineering Linköping University Linköping, Sweden

Tianshi Chen School of Data Science The Chinese University of Hong Kong Shenzhen, China

Giuseppe De Nicolao Electrical, Computer and Biomedical Engineering University of Pavia Pavia, Italy

ISSN 0178-5354 ISSN 2197-7119 (electronic) Communications and Control Engineering ISBN 978-3-030-95859-6 ISBN 978-3-030-95860-2 (eBook) https://doi.org/10.1007/978-3-030-95860-2

MATLAB is a registered trademark of The MathWorks, Inc. See https://www.mathworks.com/trademarks for a list of additional trademarks

Mathematics Subject Classification: 93B30

© The Editor(s) (if applicable) and The Author(s) 2022

**Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

*To my grandmother Rina*

*— Gianluigi Pillonetto*

*To my family, my parents Naimei and Wei, my wife Yu, and my little girl Yuening*

*—Tianshi Chen*

*To my wife Mascia, my daughter Angela and my mentor Giorgio*

*—Alessandro Chiuso*

*To my wife Elena and my sons Pietro and Laura*

*—Giuseppe De Nicolao*

*To my wife Ann-Kristin and my sons Johan and Arvid*

*—Lennart Ljung*

# **Preface**

*System identification* is concerned with estimating models of dynamical systems based on observed input and output signals. The term was coined in 1953 by Lotfi Zadeh, but various approaches had of course been suggested before that. One can distinguish two major routes in the development of system identification: (1) A *statistical route* relying on parameter estimation techniques such as Maximum Likelihood and (2) a *realization route*, based on techniques to realize (linear) dynamical systems from input/output descriptions, such as impulse responses. The literature on this in the past 70 years is extensive and impressive.

Mathematically, system identification is an *inverse problem* and may suffer from numerical instability. The Russian researcher Tikhonov suggested in the 1940s a general way to curb the number of solutions for inverse problems which he called *regularization*. A simple regularization method applied to linear regression became known as *ridge regression*. *Regularized system identification* was for a long time used as a term for ridge regression.

Around 2000 other ideas were put forward for achieving regularization. They had links to general function estimation with mathematical foundations in Reproducing Kernel Hilbert Spaces (RKHS) and kernel techniques. This resulted in intense research and extensive publications in the past 25 years. Regularized system identification has become also known as the *kernel approach* to identification.

It is the purpose of this book to give a comprehensive overview of this development. A flow diagram of the book's chapters is given in Fig. 1. It starts with the core of the regularization idea: To accept some bias in the estimates to achieve a smaller variance error and a better overall Mean Square Error (MSE) of the model. This is illustrated with the *Stein effect* discussed in Chap. 1.

Traditional System identification (the statistical route) is surveyed in Chap. 2. An archetypical model structure is the *linear regression* and Chap. 3 explains how regularization is handled in such models, while the Bayesian interpretation of this is given in Chap. 4. The linear regression perspective is lifted to general linear models of dynamical systems in Chap. 5.

**Fig. 1** *Chapter Dependencies* The first two chapters are introductory. They review the bias-variance trade-off, discussing the James–Stein estimator, and the classical approach to linear system identification. Regularized kernel-based approaches to linear system identification in finite-dimensional spaces are developed in Chaps. 3–5. The reader can directly skip to Chap. 9where such techniques are illustrated via numerical experiments and real-world cases. A different flow to reach the final chapter moves along Chaps. 6–8 where regularization in reproducing kernel Hilbert spaces is described. These parts of the book address estimation of infinite-dimensional (discrete- or continuous-time) linear models and nonlinear system identification

With this, the basic techniques of practical regularization for linear models have been outlined and the readers may continue directly to Chap. 9 for numerical experiments and practical applications.

Chapters 6 and 7 lift the mathematical foundation of regularization with a treatment of how the techniques fit into the framework of RKHS, while Chap. 8 deals with applications to nonlinear models.

Sections marked with the symbol contain quite technical material which can be skipped without interrupting the reading. Proofs of some of the theorems contained in the book are gathered in the Appendix present at the end of each chapter.

Padova, Italy Shenzhen, China Padova, Italy Pavia, Italy Linköping, Sweden July 2021

Gianluigi Pillonetto Tianshi Chen Alessandro Chiuso Giuseppe De Nicolao Lennart Ljung

# **Acknowledgements**

Many researchers have worked with authors of this book, as is clear from the list of references. Their support and ideas have been instrumental for our results and the contents of this book. We thank them all. We also acknowledge the financial support given to us by the Thousand Youth Talents Plan of China, the Natural Science Foundation of China (NSFC) under contract No. 61773329, the Shenzhen Science and Technology Innovation Council under contract No. Ji-20170189 and the Chinese University of Hong Kong, Shenzhen, under contract No. PF. 01.000249 and No. 2014.0003.23, the Department of Information Engineering, University of Padova (Italy), University of Pavia (Italy), University of Linköping (Sweden), the long time support by the Swedish research council (VR) and an advanced grant from the European Research Council (ERC).

# **Contents**



Contents xiii







# **Abbreviations and Notation**

## **Notation**






## **Abbreviations**



# **Chapter 1 Bias**

**Abstract** Adopting a quadratic loss, the performance of an estimator can be measured in terms of its mean squared error which decomposes into a variance and a bias component. This introductory chapter contains two linear regression examples which describe the importance of designing estimators able to well balance these two components. The first example will deal with estimation of the means of independent Gaussians. We will review the classical least squares approach which, at first sight, could appear the most appropriate solution to the problem. Remarkably, we will instead see that this unbiased approach can be dominated by a particular biased estimator, the so-called James–Stein estimator. Within this book, this represents the first example of regularized least squares, an estimator which will play a key role in subsequent chapters. The second example will deal with a classical system identification problem: impulse response estimation. A simple numerical experiment will show how the variance of least squares can be too large, hence leading to unacceptable system reconstructions. The use of an approach, known as ridge regression, will give first simple intuitions on the usefulness of regularization in the system identification scenario.

## **1.1 The Stein Effect**

Consider the following "basic" statistical problem. Starting from the realizations of *N* independent Gaussian random variables *yi* ∼ *N* (θ*i*, σ<sup>2</sup>), our aim is to reconstruct the means θ*<sup>i</sup>* , contained in the vector θ seen as a deterministic but unknown parameter vector.<sup>1</sup> The estimation performance will be measured in terms of mean squared error (MSE). In particular, let *E* and · denote expectation and Euclidean norm, respectively. Then, given an estimator θˆ of an *N*-dimensional vector θ with *i*th

<sup>1</sup> In future chapters, θ<sup>0</sup> will be used to denote the true value of the deterministic vector that has generated the data, distinguishing it from the vector which parametrizes the model. In this introductory chapter, θ is instead used in both the cases to maintain the notation as simple as possible.

<sup>©</sup> The Author(s) 2022

G. Pillonetto et al., *Regularized System Identification*, Communications and Control Engineering, https://doi.org/10.1007/978-3-030-95860-2\_1

component θ*<sup>i</sup>* , one has

$$\begin{split} MSE\_{\hat{\theta}} &= \mathcal{E} \| \hat{\theta} - \theta \|^{2} \\ &= \underbrace{\sum\_{i=1}^{N} \mathcal{E} (\hat{\theta}\_{i} - \mathcal{E} \hat{\theta}\_{i})^{2}}\_{\textit{Variance}} + \underbrace{\sum\_{i=1}^{N} (\theta\_{i} - \mathcal{E} \hat{\theta}\_{i})^{2}}\_{\textit{Bias}^{2}}, \end{split} \tag{1.1}$$

where in the last passage we have decomposed the error into two components. The first one is the *variance* of the estimator while the difference between the mean and the true parameter values measures the *bias*. If the mean coincides with θ, the estimator is said to be *unbiased*. The total error thus has two contributions: the variance and the (squared) bias.

Note that the mean estimation problem introduced above is a simple instance of linear Gaussian regression. In fact, letting *IN* be the *N* × *N* identity matrix, the measurements model is

$$Y = \theta + E, \quad E \sim \mathcal{N}(0, \sigma^2 I\_N), \tag{1.2}$$

where *Y* is the *N*-dimensional (column) vector with *i*th component *yi* . The most popular strategy to recover θ from data is least squares which also corresponds to maximum likelihood in this Gaussian scenario. The solution minimizes

$$\|Y - \theta\|^2$$

and is then given by

$$
\hat{\theta}^{LS} = Y.
$$

Apparently, the obtained estimator is the most reasonable one. A first intuitive argument supporting it is the fact that the random variables {*yj*}*<sup>j</sup>*=*<sup>i</sup>* seem unable to carry any information on θ*<sup>i</sup>* , since all the noises *ei* are independent. Hence, the natural estimate of θ*<sup>i</sup>* appears indeed its noisy observation *yi* . This estimator is also unbiased: for any θ we have

$$\mathcal{S}^{\mathbb{C}}\left(\widehat{\theta}^{LS}\right) = \mathcal{S}^{\mathbb{C}}(Y) = \theta.$$

Hence, from (1.1) we see that the MSE coincides with its variance, which is constant over θ and given by

$$MSE\_{LS} = \mathcal{E} \left\| \hat{\theta}^{LS} - \theta \right\|^2 = N\sigma^2.$$

According to Markov's theorem θˆ*L S* is also efficient. This means that its variance is equal to the Cramér–Rao limit: no unbiased estimate can be better than the least squares estimate, e.g., see [9, 17].

## *1.1.1 The James–Stein Estimator*

By introducing some bias in the inference process, it is easy to obtain estimators which dominate strictly least squares (in the MSE sense) over certain parameter regions. The most trivial example is the constant estimator θˆ = *a*. Its variance is null, so that its MSE reduces to the bias component θ − *a*2. Hence, even if the behaviour of θˆ is unacceptable in most of the parameter space, this estimator outperforms least squares in the region

$$\{\theta \text{ s.t. } \|\theta - a\|^2 < N\sigma^2\}.$$

Note a feature common to least squares and the constant estimator. Both of them do not attempt to trade bias and variance, they just set to zero one of the two MSE components in (1.1). An alternative route is the design of estimators which try to balance bias and variance. Rather surprisingly, we will now see that this strategy can dominate θˆ*L S* over the entire parameter space.

The first criticisms about least squares were introduced by Stein in the '50s [23] and can be so summarized. A good mean estimator θˆ should also lead to a good estimate of the Euclidean norm of θ. Thus, one should have

$$\|\hat{\theta}\| \approx \|\theta\|.$$

But, if we consider the "natural" estimator θˆ*L S* = *Y* , in view of the independence of the errors *ei* , one obtains

$$\mathcal{E}^{\mathbb{P}} \| Y \|^{2} = N \sigma^{2} + \| \theta \|^{2}.$$

This shows that the least squares estimator tends to overestimate θ. It thus seems desirable to correct θˆ*L S* by shrinking the estimate towards the origin, e.g., adopting estimators of the form θˆ*L S*(1 − *r*), where *r* is a positive scalar. The most famous example is the James–Stein estimator [15] where *r* is determined from data as follows:

$$r = \frac{(N-2)\sigma^2}{\|Y\|^2},$$

hence leading to

$$
\hat{\theta}^{JS} = Y - \frac{(N-2)\sigma^2}{\|Y\|^2}Y.
$$

Note that, even if all the components of *Y* are mutually independent, θˆ *J S* exploits all of them to estimate each θ*<sup>i</sup>* . The surprising outcome is that θˆ *J S* outperforms θˆ*L S* over all the parameter space, as illustrated in the next theorem.

**Theorem 1.1** (James–Stein's MSE, based on [15]) *Consider N Gaussian and independent random variables yi* ∼ *N* (θ*i*, σ<sup>2</sup>)*. Let also* θˆ *J S denote the James–Stein estimator of the means, i.e.,*

**Fig. 1.1** Estimation of the mean <sup>θ</sup> <sup>∈</sup> <sup>R</sup><sup>10</sup> of a Gaussian with covariance equal to the identity matrix. The plot displays the mean squared error of least squares (*MSEL S* ) and of the James–Stein estimator (*MSEJ S* ), including its bias-variance decomposition, as a function of θ<sup>1</sup> with θ = [θ<sup>1</sup> 0 ... 0]

$$
\hat{\theta}^{JS} = Y - \frac{(N-2)\sigma^2}{\|Y\|^2} Y.
$$

*Then, if N* ≥ 3*, the MSE of* θˆ *J S satisfies*

$$MSE\_{JS} < N\sigma^2 \quad \forall \theta.$$

We say that an estimator dominates another estimator if for all the θ its MSE is not larger and for some θ it is smaller. In statistics an estimator is then said to be *admissible* if no other estimator exists that dominates it in terms of MSE. The above theorem then shows that the least squares estimator of the mean of a multivariate Gaussian is not admissible if the dimension exceeds two. The reason is that, even when the Gaussians are independent, the global MSE can be reduced uniformly by adding some bias to the estimate. This is also graphically illustrated in Fig. 1.1 where *MSEJ S*, along with its decomposition, is plotted as a function of the component θ<sup>1</sup> of the ten-dimensional vector θ = [θ<sup>1</sup> 0 ... 0] (noise variance is equal to one). One can see that *MSEJ S* < *MSEL S* since the bias introduced by θˆ *J S* is compensated by a greater reduction in the variance of the estimate. Note however that James–Stein improves the overall MSE and not the individual errors affecting the θ*<sup>i</sup>* . This aspect can be important in certain applications where it is not desirable to trade a higher individual MSE for a smaller overall MSE.

It is easy to check that the James–Stein estimator admits the following interesting reformulation:

$$\begin{split} \hat{\theta}^{JS} &= \arg\min\_{\theta} \|Y - \theta\|^2 + \mathcal{Y} \cdot \|\theta\|^2 \\ &= Y \frac{1}{1 + \mathcal{Y}}, \end{split} \tag{1.3}$$

where the positive scalar γ is determined from data as follows:

$$\gamma = \frac{(N-2)\sigma^2}{\|Y\|^2 - (N-2)\sigma^2}.\tag{1.4}$$

Equation (1.3) thus reveals that θˆ *J S* is a particular version of*regularized least squares*, an estimator which will play a central role in this book. In particular, the objective in (1.3) contains two contrasting terms. The first one, *Y* − θ2, is a quadratic loss which measures the adherence to experimental data. The second one, θ2, is a regularizer which shrinks the estimate towards the origin by penalizing the energy of the solution. The role of the regularization parameter γ is then to balance these two components via a simple scalar adjustment. Equation (1.4) shows that James–Stein's strategy is to set its value to the inverse of an estimate of the signal-to-noise ratio.

#### *1.1.2 Extensions of the James–Stein Estimator -*

We have seen that the James–Stein estimator corrects each component of θˆ*L S* shifting it towards the origin. This implies that the MSE improvement will be better when the components of θ are close to zero. Actually, there is nothing special in the origin. If the true <sup>θ</sup> is expected to be close to *<sup>a</sup>* <sup>∈</sup> <sup>R</sup>*<sup>N</sup>* , one can modify the original <sup>θ</sup><sup>ˆ</sup> *J S* as follows:

$$
\hat{\theta}^{JS} = Y - \frac{(N-2)\sigma^2}{\|Y-a\|^2} \left(Y - a\right).
$$

The result is an estimator which still dominates least squares, with the origin's role now played by *a*. The estimator thus concentrates the MSE improvement around *a*.

Now, let us consider a non-orthonormal scenario where Gaussian linear regression now amounts to estimating θ from the *N* measurements

$$
\gamma\_i = d\_i \theta\_i + e\_i \quad e\_i \sim \mathcal{A}'(0, 1),
$$

with all the noises *ei* mutually independent. The least squares (maximum likelihood) estimator is now

$$
\hat{\theta}\_i^{LS} = \frac{y\_i}{d\_i}, \quad i = 1, \dots, N,
$$

and its MSE is the sum of the variances of θˆ*L S <sup>i</sup>* , i.e.,

$$MSE\_{LS} = \sum\_{i=1}^{N} \frac{1}{d\_i^2} \cdot 1$$

Note that the MSE can be large when just one of the *di* is small. In this case, the problem is said to be *ill-conditioned*: even a moderate measurement error can lead to a large reconstruction error.

Also in this non-orthonormal scenario, it is possible to design estimators whose MSE is uniformly smaller than *MSEL S*. The number of possible choices is huge, depending on which region of the parameter space one wants to concentrate the improvement. There is however an important limitation shared by all of Stein-type estimators: in general they are not much effective against ill-conditioning. This is illustrated in the following example. It illustrates an estimator whose negative features are well representative of some drawbacks of Stein's estimation in non-orthogonal settings.

**Example 1.2** (*A generalization of James–Stein*) Consider the estimator θˆ whose *i*th component is given by

$$\hat{\theta}\_{i} = \left[1 - \frac{N-2}{S} d\_{i}^{2}\right] \frac{\mathbf{y}\_{i}}{d\_{i}}, \quad i = 1, \ldots, N,\tag{1.5}$$

where

$$S = \sum\_{i=1}^{N} d\_i^2 \chi\_i^2.$$

It is now shown that θˆ is a generalization of James–Stein able to outperform least squares over the entire parameter space. In fact, defining

$$h\_i(Y) = -d\_i^2 \frac{N-2}{S} \mathbf{y}\_i,$$

after simple computations we obtain

$$\begin{split} MSE\_{\hat{\boldsymbol{\theta}}} &= \sum\_{i=1}^{N} \frac{1}{d\_{i}^{2}} + \boldsymbol{\mathcal{C}} \left[ 2 \sum\_{i=1}^{N} \frac{(\mathbf{y}\_{i} - d\_{i} \boldsymbol{\theta}\_{i}) h\_{i}(\boldsymbol{Y})}{d\_{i}^{2}} + \sum\_{i=1}^{N} \frac{h\_{i}^{2}(\boldsymbol{Y})}{d\_{i}^{2}} \right] \\ &= \sum\_{i=1}^{N} \frac{1}{d\_{i}^{2}} + \boldsymbol{\mathcal{C}} \left[ 2 \sum\_{i=1}^{N} \frac{1}{d\_{i}^{2}} \frac{\partial h\_{i}(\boldsymbol{Y})}{\partial \mathbf{y}\_{i}} + \sum\_{i=1}^{N} \frac{h\_{i}^{2}(\boldsymbol{Y})}{d\_{i}^{2}} \right], \end{split}$$

where the last equality comes from Lemma 1.1 reported in Sect. 1.4. Since

$$\frac{\partial h\_i(Y)}{\partial \mathbf{y}\_i} = -d\_i^2 \frac{N-2}{S} + d\_i^4 \frac{N-2}{S^2} \mathbf{2y}\_i^2,$$

one has

$$\begin{aligned} &\mathcal{E}\left[2\sum\_{i=1}^{N}\frac{1}{d\_i^2}\frac{\partial h\_i(Y)}{\partial \mathbf{y}\_i} + \sum\_{i=1}^{N}\frac{h\_i^2(Y)}{d\_i^2}\right] \\ &=\mathcal{E}\left[-\frac{2(N-2)N}{S} + \frac{2(N-2)}{S^2}\sum\_{i=1}^{N}2d\_i^2\mathbf{y}\_i^2 + \frac{(N-2)^2}{S^2}\sum\_{i=1}^{N}d\_i^2\mathbf{y}\_i^2\right] \\ &=-\mathcal{E}\frac{(N-2)^2}{S} < 0 \end{aligned}$$

which implies

$$MSE\_{\hat{\theta}} < MSE\_{\hat{\theta}^{\mathcal{LS}}} \quad \forall \theta \dots$$

However, assume that the problem is ill-conditioned. Then, if one *di* is small and the values of *di* are quite spread, we could well have *d*<sup>2</sup> *<sup>i</sup>* /*S* ≈ 0. Hence, (1.5) essentially reduces to

$$
\hat{\theta}\_i = \left[1 - \frac{N-2}{S} d\_i^2 \right] \frac{y\_i}{d\_i} \approx \frac{y\_i}{d\_i},
$$

which is the least squares estimate of θ*<sup>i</sup>* . This means that the signal components mostly influenced by the noise, i.e., associated with small *di* , are not regularized. Thus, in presence of ill-conditioning, θˆ will likely return an estimate affected by large errors. -

## **1.2 Ridge Regression**

Consider now one of the fundamental problems in system identification. The task is to estimate the impulse response *g*<sup>0</sup> of a discrete-time, linear and causal dynamic system, starting from noisy output data. The measurements model is

$$\mathbf{y}(t) = \sum\_{k=1}^{\infty} \mathbf{g}\_k^0 \boldsymbol{\mu}(t - k) + \boldsymbol{e}(t), \ t = 1, \ldots, N,\tag{1.6}$$

where *t* denotes time, the sampling interval is one time unit for simplicity, the *g*<sup>0</sup> *k* indicate the impulse response coefficients, *u*(*t*) is the known system input while *e*(*t*) is the noise.

To determine the impulse response from input–output measurements, one of the main questions is how to parametrize the unknown *g*0. The classical approach, which will be also reviewed in the next chapter, introduces a collection of impulse response models *g*(θ ), each parametrized by a different vector θ. In particular, here we will adopt an FIR model of order *m*, i.e., *gk* (θ ) = θ*<sup>k</sup>* for *k* = 1,..., *m* and zero elsewhere. This permits to reformulate (1.6) as a linear regression: we stack all the elements *y*(*t*) and *e*(*t*) to form the vectors *Y* and *E* and obtain the model

$$Y = \Phi \theta + E$$

with the regression matrix <sup>Φ</sup> <sup>∈</sup> <sup>R</sup>*<sup>N</sup>*×*<sup>m</sup>* given by

$$
\Phi = \begin{pmatrix}
\begin{matrix}
\boldsymbol{u}(0) & \boldsymbol{u}(-1) & \boldsymbol{u}(-2) & \dots \boldsymbol{u}(-m+1) \\
\boldsymbol{u}(1) & \boldsymbol{u}(0) & \boldsymbol{u}(-1) & \dots & \boldsymbol{u}(-m) \\
\end{matrix} \\
\dots \\
\boldsymbol{u}(N-1) \; \boldsymbol{u}(N-2) \; \boldsymbol{u}(N-3) \; \dots \; \boldsymbol{u}(N-m)
\end{pmatrix}.
$$

We can now use least squares to estimate θ. Assuming Φ*<sup>T</sup>* Φ of full rank, we obtain

$$\hat{\theta}^{LS} = \arg\min\_{\theta} \left\| Y - \Phi \theta \right\|^2 \tag{1.7a}$$

$$\mathbf{h} = (\boldsymbol{\Phi}^{\mathsf{T}} \boldsymbol{\Phi})^{-1} \boldsymbol{\Phi}^{\mathsf{T}} \boldsymbol{Y}. \tag{1.7b}$$

Note that the impulse response estimate is function of the FIR order which corresponds to the dimension *m* of θ. The choice of *m* is a trade-off between bias (a large *m* is needed to describe slowly decaying impulse responses without too much error) and variance (large *m* requires estimation of many parameters leading to large variance). This can be illustrated with a numerical experiment. The unknown impulse response *g*<sup>0</sup> is defined by the following rational transfer function:

$$\frac{(z+1)^2}{z(z-0.8)(z-0.7)},$$

which, in practice, is equal to zero after less than 50 samples (*g*<sup>0</sup> is the red line in Fig. 1.3). We estimate the system from 1000 outputs corrupted by white and Gaussian noises *e*(*t*) of variance equal to the variance of the noiseless output divided by 50, see Fig. 1.2 (bottom panel). Data come from the system initially at rest and then fed at *t* = 0 with white noise low-pass filtered by *z*/(*z* − 0.99), see Fig. 1.2 (top panel). The reconstruction error is very large if we try to estimate *g*<sup>0</sup> with *m* = 50: linear models are easy to estimate but the drawback is that high-order FIR may suffer from high variance. Hence, it is important to select a model order which well balances bias and variance. To do that one needs to try different values of *m* then using some validation procedures to determine the "optimal" one. In this case, since the true *g*<sup>0</sup> is known, we can obtain the best value by selecting that *m* ∈ [1,..., 50] which minimizes the MSE. This is an example of *oracle-based* procedure not implementable in practice: the optimal order is selected exploiting the knowledge of the true system. We obtain *m* = 18 which corresponds to *MSEL S* = 70.7 and leads to the impulse response estimate displayed in Fig. 1.3. Even if the data set size is large and the signal-to-noise ratio is good, the estimate is far from satisfactory. The reason is that the low-pass input has poor excitation and leads to an ill-conditioned problem. This means that the condition number of the regression matrix Φ is large so that also a small output error can produce a large reconstruction error.

**Fig. 1.2** Input–output data

An alternative to the classical paradigm, where different model structures are introduced, is the following straightforward generalization of (1.3), known as *ridge regression* [13, 14]:

$$\hat{\theta}^{\mathcal{R}} = \arg\min\_{\theta} \|Y - \Phi\theta\|^2 + \mathcal{Y}\|\theta\|^2\tag{1.8a}$$

$$\mathbf{H} = (\Phi^T \Phi + \mathcal{Y}I\_m)^{-1} \Phi^T Y,\tag{1.8b}$$

where we set *m* = 50 to solve our problem. Letting *A* = (Φ*<sup>T</sup>* Φ + γ *Im*)−<sup>1</sup>Φ*<sup>T</sup>* , it is easy to derive the MSE decomposition associated with θˆ*<sup>R</sup>*:

$$MSE\_R = \underbrace{\sigma^2 \text{Trace}(AA^T)}\_{Variance} + \underbrace{\left\|\theta - A\Phi\theta\right\|^2}\_{Bias^2}.\tag{1.9}$$

Figure 1.4 displays *MSER* for the particular system identification problem at hand as a function of the regularization parameter. Note that γ plays the role of the model order in the classical scenario but can be tuned in a continuous manner to reach a good bias-variance trade-off. It is also interesting to see its influence on the variance and bias components. The variance is a decreasing function of the regularization parameter. Hence, its maximum is reached for γ = 0 where θˆ*<sup>R</sup>* reduces to the least squares estimator θˆ*L S* given by (1.7) with *m* = 50. Instead, the bias increases with γ . At the limit, for γ → ∞, the penalty θ<sup>2</sup> is so overweighted that θˆ*<sup>R</sup>* becomes the constant estimator centred on the origin (it returns all null impulse response coefficients).

In Fig. 1.5, we finally display the ridge regularized estimate with γ set to the value minimizing the error and leading to *MSER* = 16.8. It is evident that ridge regression provides a much better bias-variance trade-off than selecting the FIR order.

## **1.3 Further Topics and Advanced Reading**

Stein's intuition on the development of an estimator able to dominate least squares in terms of global MSE can be found in [23], while the specific shape of θˆ *J S* has been obtained in [15]. From then, a large variety of different estimators outperforming least squares, also under different losses, have been designed. It has been proved that there exists estimators which dominate James–Stein, even if the MSE improvement is not large, as described in [12, 16, 25]. Extensions and applications can be found in [5, 6, 11, 22, 24, 26]. A James–Stein version of the Kalman filter is derived in [18]. For interesting discussions on the limitations of Stein-type estimators in facing ill-conditioning see [8] but also [19] for new outcomes with better numerical stability properties. Other developments are reported in [7] where generalizations of Stein's lemma are also described.

The paper [10] describes connections between James–Stein estimation and the socalled empirical Bayes approaches which will be treated later on in this book. The interplay between Stein-type estimators and the Bayes approach is also discussed in [2]. Here, one can also find an estimator which dominates least squares concentrating the MSE improvement in an ellipsoid that can be chosen by the user in the parameter space. This approach is deeply connected with robust Bayesian estimation concepts, e.g., see [1, 3].

The term ridge regression has been popularized by the works [13, 14]. This approach, introduced to guard against ill-conditioning and numerical instability, is an example of Tikhonov regularization for ill-posed problems. Among the first classical works on regularization and inverse problems, it is worth already citing [4, 20, 27–29]. A recent survey on the use of regularization for system identification can be instead found in [21]. The literature on this topic is huge and other relevant works will be cited in the next chapters.

## **1.4 Appendix: Proof of Theorem 1.1**

To discuss the properties of the James–Stein estimator, first it is useful to introduce a result which is a simplified version of Lemma 3.2 reported in Chap. 3, known as Stein's lemma.

**Lemma 1.1** (Stein's lemma, based on [24]) *Consider N Gaussian and independent random variables yi* <sup>∼</sup> *<sup>N</sup>* (θ*i*, σ2)*. For i* <sup>=</sup> <sup>1</sup>,..., *n, let also h* : <sup>R</sup>*<sup>N</sup>* <sup>→</sup> <sup>R</sup> *be a differentiable function such that E* ∂*h*(*Y* ) ∂*yi* < ∞*. Then, it holds that*

$$
\circledast(\mathbf{y}\_i - \theta\_i)h(Y) = \sigma^2 \circledast \frac{\partial h(Y)}{\partial \mathbf{y}\_i}.
$$

*Proof* During the proof, we use *E<sup>j</sup>*=*<sup>i</sup>* to denote the expectation conditional on {*yj*}*j*=*<sup>i</sup>* . Also, abusing notation, *<sup>h</sup>*(*x*) with *<sup>x</sup>* <sup>∈</sup> <sup>R</sup> indicates the function *<sup>h</sup>* with *<sup>i</sup>*th argument set to *x* while the other arguments are set to *yj j* = *i*.

Note that, in view of the independence assumptions, each *yi* conditional on {*yj*}*j*=*<sup>i</sup>* is still Gaussian with mean θ*<sup>i</sup>* and variance σ2. Then, using integration by parts, one has

$$\begin{split} \mathcal{E}\_{j\neq i} \left( \frac{\partial h(Y)}{\partial \mathbf{y}\_{i}} \right) &= \int\_{-\infty}^{+\infty} \frac{\partial h(\mathbf{x})}{\partial \mathbf{x}} \frac{\exp(-(\mathbf{x} - \theta\_{i})^{2}/(2\sigma^{2}))}{\sqrt{2\pi}\sigma} d\mathbf{x} \\ &= \left[ h(\mathbf{x}) \frac{\exp(-(\mathbf{x} - \theta\_{i})^{2}/(2\sigma^{2}))}{\sqrt{2\pi}\sigma} \right]\_{-\infty}^{+\infty} \\ &+ \int\_{-\infty}^{+\infty} \frac{(\mathbf{x} - \theta\_{i})}{\sigma^{2}} h(\mathbf{x}) \frac{\exp(-(\mathbf{x} - \theta\_{i})^{2}/(2\sigma^{2}))}{\sqrt{2\pi}\sigma} d\mathbf{x} \\ &= \int\_{-\infty}^{+\infty} \frac{(\mathbf{x} - \theta\_{i})}{\sigma^{2}} h(\mathbf{x}) \frac{\exp(-(\mathbf{x} - \theta\_{i})^{2}/(2\sigma^{2}))}{\sqrt{2\pi}\sigma} d\mathbf{x} \\ &= \frac{\mathcal{E}\_{j\neq i}\left( (y\_{i} - \theta\_{i}) h(Y) \right)}{\sigma^{2}}. \end{split}$$

Note that the penultimate equality exploits the fact that *h*(*x*) exp(−(*x* − θ*i*)<sup>2</sup>/(2σ<sup>2</sup>)) must be infinitesimal as *x* → ∞, otherwise the assumption *E* ∂*h*(*Y* ) ∂*yi* < ∞ would not hold. Using the above result, we obtain

$$\begin{aligned} \mathcal{E}\left( (\mathbf{y}\_i - \theta\_i) h(Y) \right) &= \mathcal{E}\left[ \mathcal{E}\_{j \neq i} \left( (\mathbf{y}\_i - \theta\_i) h(Y) \right) \right] \\ &= \sigma^2 \mathcal{E} \left[ \mathcal{E}\_{j \neq i} \left( \frac{\partial h(Y)}{\partial \mathbf{y}\_i} \right) \right] \\ &= \sigma^2 \mathcal{E} \frac{\partial h(Y)}{\partial \mathbf{y}\_i} \end{aligned}$$

and this completes the proof. -

#### 1.4 Appendix: Proof of Theorem 1.1 13

We now show that the MSE of the James–Stein estimator is uniformly smaller than the MSE of least squares. One has

$$\begin{split} MSE\_{JS} &= \mathcal{E} \left( \|\theta - \hat{\theta}^{JS}(Y)\|^2 \right) \\ &= \mathcal{E} \left( \|\theta - Y + \frac{(N-2)\sigma^2}{\|Y\|^2} Y\|^2 \right) \\ &= \mathcal{E} \left( \|\theta - Y\|^2 + \frac{(N-2)^2 \sigma^4}{\|Y\|^4} \|Y\|^2 + 2(\theta - Y)^T Y \frac{(N-2)\sigma^2}{\|Y\|^2} \right) \\ &= N\sigma^2 + \mathcal{E} \left( \frac{(N-2)^2 \sigma^4}{\|Y\|^2} + 2(\theta - Y)^T Y \frac{(N-2)\sigma^2}{\|Y\|^2} \right) . \end{split}$$

As for the last term inside the expectation, exploiting Stein's lemma with

$$h\_i(Y) = \frac{\mathbf{y}\_i}{\|Y\|^2}, \quad \frac{\partial h\_i(Y)}{\partial \mathbf{y}\_i} = \frac{1}{\|Y\|^2} - 2\frac{\mathbf{y}\_i^2}{\|Y\|^4},$$

one has

$$\begin{aligned} \mathcal{E}^{\mathbb{P}}\left(\frac{(\theta - Y)^{T}Y}{\|Y\|^{2}}\right) &= \mathcal{E}^{\mathbb{P}}\left(\sum\_{i=1}^{N}(\theta\_{i} - \mathbf{y}\_{i})h\_{i}(Y)\right) \\ &= -\sigma^{2}\mathcal{E}^{\mathbb{P}}\left(\sum\_{i=1}^{N}\left(\frac{1}{\|Y\|^{2}} - 2\frac{\mathbf{y}\_{i}^{2}}{\|Y\|^{4}}\right)\right) \\ &= -\sigma^{2}\mathcal{E}^{\mathbb{P}}\left(\frac{N-2}{\|Y\|^{2}}\right). \end{aligned}$$

Using this equality in the MSE expression, we finally obtain

$$\begin{split} MSE\_{JS} &= N\sigma^2 + \mathcal{E}\left(\frac{(N-2)^2\sigma^4}{\|Y\|^2} - 2\frac{(N-2)^2\sigma^4}{\|Y\|^2}\right) \\ &= N\sigma^2 - (N-2)^2\sigma^4 \mathcal{E}\left(\frac{1}{\|Y\|^2}\right) < N\sigma^2. \end{split}$$

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 2 Classical System Identification**

**Abstract** System identification as a field has been around since the 1950s with roots from statistical theory. A substantial body of concepts, theory, algorithms and experience has been developed since then. Indeed, there is a very extensive literature on the subject, with many text books, like [5, 8, 12]. Some main points of this "classical" field are summarized in this chapter, just pointing to the basic structure of the problem area. The problem centres around four main pillars: (1) the observed data from the system, (2) a parametrized set of candidate models, "the Model structure", (3) an estimation method that fits the model parameters to the observed data and (4) a validation process that helps taking decisions about the choice of model structure. The crucial choice is that of the model structure. The archetypical choice for linear models is the ARX model, a linear difference equation between the system's input and output signals. This is a universal approximator for linear systems—for sufficiently high orders of the equations, arbitrarily good descriptions of the system are obtained. For a "good" model, proper choices of structural parameters, like the equation orders, are required. An essential part of the classical theory deals with asymptotic quality measures, bias and variance, that aim at giving the best mean square error between the model and the true system. Some of this theory is reviewed in this chapter for estimation methods of the maximum likelihood character.

## **2.1 The State-of-the-Art Identification Setup**

System Identification is characterized by five basic concepts:


© The Author(s) 2022

G. Pillonetto et al., *Regularized System Identification*, Communications and Control Engineering, https://doi.org/10.1007/978-3-030-95860-2\_2

See Fig. 2.1. It is typically an iterative process to navigate to a model that passes through the validation test ("is not falsified"), involving revisions of the necessary choices. For several of the steps in this loop helpful support tools have been developed. It is however not quite possible or desirable to fully automate the choices, since subjective perspectives related to the intended use of the model are very important.

## **2.2** *M***: Model Structures**

A model structure *M* is set of a parametrized models that describe the relations between the inputs *u* and outputs *y* of the system. The parameters are denoted by θ so a particular model will be denoted by *M*(θ ). The set of models then is

$$\mathcal{A} \mathcal{A} = \{ \mathcal{A}(\theta) | \theta \in D\_{\mathcal{A}} \}. \tag{2.1}$$

The models may be expressed and formalized in many different ways. The most common model is linear and time-invariant linear (LTI), but possible models include both nonlinear and time-varying cases, so a list of actually used concrete model will be both very long and diverse.

It is useful to take the general view that a model gives a rule to predict (onestep-ahead) the output at time *t*, i.e., *y*(*t*) (a *p*-dimensional column vector), based on observations of previous input–output data up to time *t* − 1 (denoted by *Z<sup>t</sup>*−<sup>1</sup> = {*y*(*t* − 1), *u*(*t* − 1), *y*(*t* − 2), *u*(*t* − 2), . . .}). Here *u*(*t*) is the input at time *t* and we assume here that the data are collected in discrete time and denote for simplicity the samples as enumerated by *t*.

The predicted output will then be

$$
\hat{\mathbf{y}}(t|\theta) = \mathbf{g}(t, \theta, Z^{t-1}) \tag{2.2}
$$

for a certain function *g* of past data. This covers a very wide variety of model descriptions, sometimes in a somewhat abstract way. The descriptions become much more explicit when we specialize to linear models.

**A note on "inputs"** All measurable disturbances that affect *y* should be included among the inputs *u* to the system, even if they cannot be manipulated as control inputs. In some cases, the system may entirely lack measurable inputs, so the model (2.2) then just describes how future outputs can be predicted from past ones. Such models are called *time series*, and correspond to systems that are driven by unobservable disturbances. Most of the techniques described in this book apply also to such models.

**A note on disturbances** A complete model involves both a description of the input– output relations and a description of how various disturbance or noise sources affect the measurements. The noise description is essential both to understand the quality of the model predictions and the model uncertainty. Proper control design also requires a picture of the disturbances in the system.

## *2.2.1 Linear Time-Invariant Models*

For linear time-invariant (LTI) systems, a general model structure is given by the transfer function *G* from input *u* to output *y* and with an additive disturbance—or noise—*v*(*t*):

$$\mathbf{y}(t) = G(q, \theta)\boldsymbol{\mu}(t) + \boldsymbol{\nu}(t). \tag{2.3a}$$

This model is in discrete time and *q* denotes the shift operator *qy*(*t*) = *y*(*t* + 1). The sampling interval is set to one time unit. The expansion of *G*(*q*,θ) in the inverse (backwards) shift operator gives the *impulse response* of the system:

$$G(q,\theta)u(t) = \sum\_{k=1}^{\infty} \text{g}\_k(\theta) q^{-k} u(t) = \sum\_{k=1}^{\infty} \text{g}\_k(\theta) u(t-k). \tag{2.3b}$$

The discrete-time Fourier transform (or the *z*-transform of the impulse response, evaluated in *z* = *e<sup>i</sup>*<sup>ω</sup>) gives the *frequency response* of the system:

$$G(e^{i\omega}, \theta) = \sum\_{k=1}^{\infty} g\_k(\theta) e^{-ik\omega}. \tag{2.3c}$$

The function *G* describes how an input sinusoid shifts phase and amplitude when it passes through the system.

The additive noise term *v* can be described as white noise *e*(*t*), filtered through another transfer function *H*:

$$\nu(t) = H(q, \theta)e(t) \tag{2.3d}$$

$$
\circledast \mathbf{e}^2(t) = \sigma^2 \tag{2.3e}
$$

$$\prescript{}{}{e}(t)e^{\prescript{T}{}{}}(k) = 0 \text{ if } k \neq t \tag{2.3f}$$

(*E* denotes mathematical expectation).

This noise characterization is quite versatile and with a suitable choice of *H* it can describe a disturbance with a quite arbitrary spectrum. It is useful to normalize (2.3d) by making *H monic*:

$$H(q, \theta) = 1 + h\_1(\theta) q^{-1} + \dotsb \ . \tag{2.3g}$$

To think in terms of the general model description (2.2) with the predictor as a unifying model concept, assuming *H* to be inversely stable [5, Sect. 3.2] it is useful to rewrite (2.3) as

$$\begin{aligned} H^{-1}(q,\theta)\mathbf{y}(t) &= H^{-1}(q,\theta)G(q,\theta)u(t) + e(t) \\ \mathbf{y}(t) &= [1 - H^{-1}(q,\theta)]\mathbf{y}(t) + H^{-1}(q,\theta)Gu(t) + e(t) = \\ \mathbf{y}(t) &= G(q,\theta)u(t) + [1 - H^{-1}(q,\theta)][\mathbf{y}(t) - G(q,\theta)u(t)] + e(t). \end{aligned}$$

Note that the expansion of *H*−<sup>1</sup> starts with "1", so the first term starts with *h*˜1*q*−<sup>1</sup> so there is a delay in *y*. That means that the right-hand side is known at time *t* − 1 except for the term *e*(*t*), which is unpredictable at time *t* − 1 and must be estimated with its mean 0. All this means that the predictor for (2.3) (the conditional mean of *y*(*t*) given past data) is

$$\hat{\mathbf{y}}(t|\theta) = G(q,\theta)u(t) + [1 - H^{-1}(q,\theta)][\mathbf{y}(t) - G(q,\theta)u(t)].\tag{2.4}$$

It is easy to interpret the first term as a simulation using the input *u*, adjusted with a prediction of the additive disturbance *v*(*t*) at time *t*, based on past values of *v*. The predictor is thus an easy reformulation of the basic transfer functions *G* and *H*. The question now is how to parametrize these.

#### **2.2.1.1 The McMillan Degree**

Given just the sequence of impulse responses *gk* , with *k* = 1, 2,..., one may consider different ways of representing the system in a more compact form, like rational transfer functions or state-space models, to be considered below. A quite useful concept is then *the McMillan degree*:

#### 2.2 *M*: Model Structures 21

From a given impulse response sequence, *gk* (that could be *p* × *m* matrices that describe a system with *m* inputs and *p* outputs) form the *Hankel matrix*

$$H\_k = \begin{bmatrix} \mathfrak{g}\_1 & \mathfrak{g}\_2 & \mathfrak{g}\_3 & \cdots & \mathfrak{g}\_k \\ \mathfrak{g}\_2 & \mathfrak{g}\_3 & \mathfrak{g}\_4 & \cdots & \mathfrak{g}\_{k+1} \\ \dots & \dots & \dots & \dots & \dots & \dots \\ \mathfrak{g}\_k & \mathfrak{g}\_{k+1} & \mathfrak{g}\_{k+2} & \cdots & \mathfrak{g}\_{2k-1} \end{bmatrix} \tag{2.5}$$

Then as *k* increases, the McMillan degree *n* of the impulse response is the maximal rank of *Hk* :

$$m = \max\_{k} \text{rank } H\_k. \tag{2.6}$$

This means that the impulse response can be generated from an *n*th-order state-space model, but not from any lower-order model.

#### **2.2.1.2 Black-Box Models**

A *black-box* model uses no physical insight or interpretation, but is just a general and flexible parameterization. It is natural to let *G* and *H* be rational in the shift operator:

$$G(q, \theta) = \frac{B(q)}{F(q)}; \quad H(q, \theta) = \frac{C(q)}{D(q)}\tag{2.7a}$$

$$B(q) = b\_1 q^{-1} + b\_2 q^{-2} + \dots \, b\_{nb} q^{-nb} \tag{2.7b}$$

$$F(q) = 1 + f\_1 q^{-1} + \dots + f\_{nf} q^{-nf},\tag{2.7c}$$

with then *C* and *D monic* like *F*, i.e., start with a "1", and the vector collecting all the coefficients

$$\theta = [b\_1, b\_2, \dots, f\_{nf}].\tag{2.7d}$$

Common black-box structures of this kind are FIR (finite impulse response model, *F* = *C* = *D* = 1), ARMAX (autoregressive moving average with exogenous input, *F* = *D*), and BJ (Box–Jenkins, all four polynomials different).

#### **A Very Common Case: The ARX Model**

A very common case is that *F* = *D* = *A* and *C* = 1 which gives the *ARX model* (autoregressive with exogenous input):

$$\mathbf{y}(t) = A^{-1}(q)B(q)\boldsymbol{\mu}(t) + A^{-1}(q)e(t) \text{ or } \tag{2.8a}$$

$$A(q)\mathbf{y}(t) = B(q)\boldsymbol{\mu}(t) + e(t)\,\text{or}\tag{2.8b}$$

$$\mathbf{y}(t) + a\_1 \mathbf{y}(t-1) + \dots + a\_{n\_a} \mathbf{y}(t - n\_a) \tag{2.8c}$$

$$b = b\_1 \mu(t - 1) + \dots + b\_{n\_b} \mu(t - n\_b). \tag{2.8d}$$

This means that the expression for the predictor (2.4) becomes very simple:

$$
\hat{\mathbf{y}}(t|\theta) = \boldsymbol{\varphi}^T(t)\boldsymbol{\theta} \tag{2.9}
$$

$$\boldsymbol{\varphi}^T(t) = \begin{bmatrix} -\mathbf{y}(t-1) \ -\mathbf{y}(t-2) \ \dots \ -\mathbf{y}(t-n\_a) \ \boldsymbol{u}(t-1) \ \dots \ \boldsymbol{u}(t-n\_b) \end{bmatrix} \tag{2.10}$$

$$\boldsymbol{\theta}^{T} = \begin{bmatrix} a\_1 \ a\_2 \ \dots \ a\_{n\_a} b\_1 \ b\_2 \ \dots \ b\_{n\_b} \end{bmatrix}. \tag{2.11}$$

In statistics, such a model is known as a *linear regression*.

We note that as *na* and *nb* increase to infinity the predictor (2.9) may approximate any linear model predictor (2.4). This points to a *very important general approximation property of ARX models:*

**Theorem 2.1** (based on [6]) *Suppose a true linear system is given by*

$$\mathbf{y}(t) = G\_0(q)\boldsymbol{u}(t) + H\_0(q)\boldsymbol{e}(t),\tag{2.12}$$

*where G*0(*q*) *and H*−<sup>1</sup> <sup>0</sup> (*q*) *are stable filters,*

$$\begin{aligned} G\_0(q) &= \sum\_{k=1}^{\infty} g\_k(q^{-k}) \\ H\_0^{-1}(q) &= \sum\_{k=1}^{\infty} \tilde{h}\_k(q^{-k}) \\ d(n) &= \sum\_{k=n}^{\infty} |g\_k| + |\tilde{h}\_K| \end{aligned}$$

*and e is a sequence of independent zero-mean random variables with bounded fourthorder moments.*

*Consider an ARX model (2.8) with orders na*, *nb* = *n, estimated from N observations. Assume that the order n depends on the number of data as n*(*N*)*, and tends to infinity such that n*(*N*)<sup>5</sup>/*N* → 0*. Assume also that the system is such that <sup>d</sup>*(*n*(*N*))√*<sup>N</sup>* <sup>→</sup> <sup>0</sup> *as N* → ∞*. Then the ARX model estimates <sup>A</sup>*ˆ*<sup>n</sup>*(*<sup>N</sup>*)(*q*) *and <sup>B</sup>*ˆ*<sup>n</sup>*(*<sup>N</sup>*)(*q*) *of order n*(*N*) *obey*

$$\frac{B\_{n(N)}(q)}{\hat{A}\_{n(N)}(q)} \to G\_0(q), \quad \frac{1}{\hat{A}\_{n(N)}(q)} \to H\_0(q) \text{ as } N \to \infty. \tag{2.13}$$

Intuitively, the above result follows from the fact that the true predictor for the system

$$\hat{\mathbf{y}}(t|\theta) = (1 - H\_0^{-1})\mathbf{y}(t) + H\_0^{-1}G\_0\boldsymbol{\mu}(t) = \sum\_{k=1}^{\infty} \tilde{h}\_k \mathbf{y}(t - k) + \tilde{g}\_k \boldsymbol{\mu}(t - k)$$

is stable. Hence, it can be truncated at any *n* with arbitrary accuracy, and the truncated sum is the predictor of an *n*th-order ARX model.

This is quite a useful result saying that ARX models can approximate any linear system, if the orders are sufficiently large. ARX models are easy to estimate. The estimates are calculated by linear least squares (LS) techniques, which are convex and numerically robust. Estimating a high-order ARX model, possibly followed by some model order reduction, could thus be an alternative to the numerically more demanding general PEM criterion minimization (2.22) introduced later on. This has been extensively used, e.g., by [14, 15]. The only drawback with high-order ARX models is that they may suffer from high variance.

#### **2.2.1.3 Grey-Box Models**

If some physical facts are known about the system, these could be incorporated in the model structure. Such a model that is based on physical insights and has a built-in behaviour that mimics known physics is known as a *Grey-Box Model*. For example, it could that for an airplane whose motion equations are known from Newton's laws, but certain parameters are unknown, like the aerodynamical derivatives. Then it is natural to build a continuous-time state-space model from known physical equations:

$$\begin{aligned} \dot{\mathbf{x}}(t) &= A(\theta)\mathbf{x}(t) + B(\theta)\boldsymbol{\mu}(t) \\ \mathbf{y}(t) &= C(\theta)\mathbf{x}(t) + D(\theta)\boldsymbol{\mu}(t) + \boldsymbol{\nu}(t). \end{aligned} \tag{2.14}$$

Here θ are simply some entries of the matrices *A*, *B*,*C*, *D*, corresponding to unknown physical parameters, while the other matrix entries signify known physical behaviour. This model can be sampled with well-known sampling formulas (obeying the input inter-sample properties, zero-order hold or first-order hold) to give

$$\begin{aligned} \mathbf{x}(t+1) &= \mathcal{F}(\theta)\mathbf{x}(t) + \mathcal{Y}(\theta)\boldsymbol{u}(t) \\ \mathbf{y}(t) &= \mathcal{C}(\theta)\mathbf{x}(t) + D(\theta)\boldsymbol{u}(t) + \mathbf{w}(t) \end{aligned} \tag{2.15}$$

The model (2.15) has the transfer function from *u* to *y*

$$G(q, \theta) = C(\theta) [qI - \mathcal{J}(\theta)]^{-1} \mathcal{J}(\theta) + D(\theta) \tag{2.16}$$

so we have achieved a particular parameterization of the general linear model (2.3).

#### **2.2.1.4 Continuous-Time Models**

The general model description (2.2) describes how the predictions evolve in discrete time. But in many cases we are interested in continuous-time (CT) models, like models for physical interpretation and simulation. But CT model estimation is contained in the described framework, as the linear state-space model (2.14) illustrates.

## *2.2.2 Nonlinear Models*

A nonlinear model is a relation (2.2), where the function *g* is nonlinear in the input– output data *Z*. There is a rich variation in how to specify the function *g* more explicitly. A quite general way is the nonlinear state-space equation, which is a counterpart to (2.15):

$$\begin{aligned} \mathbf{x}(t+1) &= f(\mathbf{x}(t), \boldsymbol{\mu}(t), \boldsymbol{\nu}(t), \boldsymbol{\theta}) \\ \mathbf{y}(t) &= h(\mathbf{x}(t), \boldsymbol{e}(t), \boldsymbol{\theta}), \end{aligned} \tag{2.17}$$

where *v* and *e* are white noises.

## **2.3** *I* **: Identification Methods—Criteria**

The goal of identification is to match the model to the data. Here the basic techniques for such matching will be discussed. Suppose we have collected a data record in the time domain

$$\mathcal{O}\_T = \{ \mu(1), \mathbf{y}(1), \dots, \mu(N), \mathbf{y}(N) \}\tag{2.18}$$

which will be called in this book *identification set* or *training set*, with *N* being its size. A natural way to evaluate a model is to see how well it is able to predict the measured output since the model is in essence a predictor. It is thus quite natural to form the prediction errors for (2.2):

$$
\varepsilon(t,\theta) = \mathbf{y}(t) - \hat{\mathbf{y}}(t|\theta). \tag{2.19}
$$

The "size" of this error can be measured by some scalar norm:

$$\ell(\varepsilon(t,\theta))\tag{2.20}$$

and the performance of the predictor over the whole data record *D<sup>T</sup>* is given by

$$V\_N(\theta) = \sum\_{t=1}^N \ell(\varepsilon(t, \theta)). \tag{2.21}$$

A natural parameter estimate is the value that minimizes this prediction fit:

$$\dot{\theta}\_N = \arg\min\_{\theta \in D\_{\mathcal{A}}} V\_N(\theta). \tag{2.22}$$

This is the *Prediction Error Method (PEM)* and it is applicable to general model structures. See, e.g., [5] or [7] for more details.

The PEM approach can be embedded in a statistical setting. The ML methodology below offers a systematic framework to do so.

## *2.3.1 A Maximum Likelihood (ML) View*

If the system innovations *e* have a probability density function (pdf) *f* (*x*), then the criterion function (2.21) with (*x*) = − log *f* (*x*) will be the logarithm of the *Likelihood function*. See Lemma 5.1 in [5]. More specifically, let the system have *p* outputs, and let the innovations be Gaussian with zero mean and covariance matrix Λ, so that

$$\mathbf{y}(t) = \hat{\mathbf{y}}(t|\theta\_0) + \boldsymbol{e}(t), \quad \mathbf{e}(t) \in N(0, \Lambda) \tag{2.23}$$

for the θ<sup>0</sup> that generated the data. Then it follows that the negative logarithm of the likelihood function for estimating θ from *y* is

$$L\_N(\theta) = \frac{1}{2} [V\_N(\theta) + N \log \det A + Np \log 2\pi],\tag{2.24}$$

where *VN* (θ ) is defined by (2.21), with

$$\ell(\varepsilon(t,\theta)) = \varepsilon^{\top}(t,\theta)\Lambda^{-1}\varepsilon(t,\theta). \tag{2.25}$$

That means that the maximum likelihood model estimate (MLE) for known Λ is obtained by minimizing *VN* (θ ). If Λ is not known, it can be included among the parameters and estimated, ([5], p. 218), which results in a criterion

$$D\_N(\theta) = \det \sum\_{t=1}^N \varepsilon(t, \theta) \varepsilon^T(t, \theta) \tag{2.26}$$

to be minimized.

A Bayesian interpretation of (2.22) as well as a regularized version will be given in Chap. 4.

## **2.4 Asymptotic Properties of the Estimated Models**

As we have seen in the first chapter, bias and variance play important roles in estimation problems. We will here give a short account of how these concepts are treated in classical system identification.

## *2.4.1 Bias and Variance*

The observations, certainly of the output from the system, are affected by noise and disturbances. That means that the estimated model parameters (2.22) also will be affected by disturbances. These disturbances are typically described as stochastic processes, which makes the estimate θˆ *<sup>N</sup>* a *random variable*. This has a certain probability density function, which could be complicated to compute. Often the analysis is restricted to its mean and variance only. The difference between the mean and a true description of the system measures the *bias* of the model. If the mean coincides with the true system, the estimate is said to be *unbiased*. As already pointed out in (1.1), the total error in a model thus has two contributions: the bias and the variance.

## *2.4.2 Properties of the PEM Estimate as N* **→ ∞**

Except in simple special cases it is quite difficult to compute the pdf of the estimate θˆ *<sup>N</sup>* . However, its *asymptotic properties* as *N* → ∞ are easier to establish. The basic results can be summarized as follows (see [5, Chaps. 8 and 9] for a more complete treatment):

• **Limit model:**

$$\hat{\theta}\_N \to \theta^\* = \arg\min \left[ \lim\_{N \to \infty} \frac{1}{N} V\_N(\theta) \approx \delta^\circ \ell(\varepsilon(t, \theta)) \right]. \tag{2.27}$$

Here *E* denotes mathematical expectation. So the estimate will converge to the best possible model, in the sense that it gives the smallest average prediction error. • **Asymptotic covariance matrix for scalar output models:**

In case the prediction errors *e*(*t*) = ε(*t*, θ <sup>∗</sup>) for the limit model are approximately white, the covariance matrix of the parameters is asymptotically given by

$$\mathrm{Cov}\hat{\theta}\_N \sim \frac{\kappa(\ell)}{N} \left[ \mathrm{Cov} \frac{d}{d\theta} \hat{\mathbf{y}}(t|\theta) \right]^{-1} \,. \tag{2.28}$$

That means that the covariance matrix of the parameter estimate is given by the inverse covariance matrix of the gradient of the predictor w.r.t. the parameters.

#### 2.4 Asymptotic Properties of the Estimated Models 27

Here (prime denoting derivatives)

$$\kappa(\ell) = \frac{\ell^{\mathbb{P}}[\ell'(e(t))]^2}{\ell^{\mathbb{P}}\ell'(e(t))^2}. \tag{2.29}$$

Note that

$$
\kappa(\ell) = \sigma^2 = \pounds e^2(t) \quad \text{if} \quad \ell(e) = e^2/2.
$$

If the model structure contains the true system, it can be shown that this covariance matrix is the smallest that can be achieved by any unbiased estimate, in case the norm is chosen as the logarithm of the pdf of *e*. That is, it fulfils the *the Cramér– Rao inequality*, [2]. These results are valid for quite general model structures.

## • **Results for LTI models:**

Now, specialize to linear models (2.3) and assume that the true system is described by

$$\mathbf{y}(t) = G\_0(q)\boldsymbol{u}(t) + H\_0(q)\boldsymbol{e}(t),\tag{2.30}$$

which could be general transfer functions, possibly much more complicated than the model. Then

–

$$\theta^\* = \arg\min\_{\theta} \int\_{-\pi}^{\pi} |G(e^{i\omega}, \theta) - G\_0(e^{i\omega})|^2 \frac{\Phi\_\mathbf{u}(\omega)}{|H(e^{i\omega}, \theta)|^2} d\omega. \tag{2.31}$$

That is, the frequency function of the limiting model will approximate the true frequency function as well as possible in a frequency norm given by the input spectrum Φ*<sup>u</sup>* and the noise model.

– For a linear black-box model, the covariance of the estimated frequency function is

$$\text{Cov}G(e^{i\omega}, \hat{\theta}\_N) \sim \frac{n}{N} \frac{\Phi\_\upsilon(\omega)}{\Phi\_\upsilon(\omega)} \text{ as } n, N \to \infty,\tag{2.32}$$

where *n* is the model order and Φ*<sup>v</sup>* is the noise spectrum σ<sup>2</sup>|*H*0(*e<sup>i</sup>*ω)| 2. The variance of the estimated frequency function at a given frequency is thus, for a high-order model, proportional to the noise-to-signal ratio at that frequency. That is a natural and intuitive result.

## *2.4.3 Trade-Off Between Bias and Variance*

The quality of the model depends on the quality of the measured data and the flexibility of the chosen model structure (2.1). A more flexible model structure typically has smaller bias, since it is easier to come closer to the true system. At the same time, it will have a higher variance: with higher flexibility it is easier to be fooled by disturbances and this may lead to data overfitting. So the trade-off between bias and variance to reach a small total error is a choice of balanced flexibility of the model structure.

As the model gets more flexible, the fit to the estimation data in (2.22), given by *VN* (θˆ *<sup>N</sup>* ), will always improve. To account for the variance contribution, it is thus necessary to modify this fit to assess the total quality of the model. A much used technique for this is Akaike's criterion (AIC), [1],

$$\hat{\theta}\_N = \underset{\mathcal{A}', \theta \in D\_{\mathcal{A}}}{\text{arg min }} 2L\_N(\theta) + 2\text{dim}\theta,\tag{2.33}$$

where *L <sup>N</sup>* is the negative log likelihood function. The minimization also takes place over a family of model structures with different number of parameters (dim θ).

For Gaussian innovations *e* with unknown and estimated variance, the criterion AIC takes the form

$$\hat{\theta}\_N = \underset{\mathcal{A}', \theta \in D, \mathcal{A}}{\text{arg min}} \left[ \log \det \left[ \frac{1}{N} \sum\_{t=1}^N \varepsilon(t, \theta) \varepsilon^T(t, \theta) \right] + 2 \frac{m}{N} \right] \quad \text{AIC} \tag{2.34}$$

with *m* = dimθ and after normalization and omission of model-independent quantities.

There is also a small-sample version, described in [4] and known in the literature as corrected Akaike's criterion (AICc), defined by

$$\hat{\theta}\_N = \arg\min\_{\theta} \left[ \log \det \left[ \frac{1}{N} \sum\_{t=1}^N \varepsilon(t,\theta) \varepsilon^T(t,\theta) \right] + 2 \frac{m}{(N-m-1)} \right], \quad \text{AICc.} \tag{2.35}$$

Another variant places a larger penalty on the model flexibility:

$$\hat{\theta}\_N = \arg\min\_{\theta} \left[ \log \det \left[ \frac{1}{N} \sum\_{t=1}^N \varepsilon(t,\theta) \boldsymbol{\varepsilon}^T(t,\theta) \right] + \log(N) \frac{m}{N} \right], \text{ BIC, MDL.} \tag{2.36}$$

This is known as Bayesian information criterion (BIC) or Rissanen's Minimum Description Length (MDL) criterion, see, e.g., [10, 11] and [5, pp. 505–507].

Section 2.6 contains further aspects on the choice of model structure.

#### **2.5 X: Experiment Design**

Experiment design involves all questions that concern the collection of estimation data, such as selecting which signals to measure, which sampling rate to use, and also the design of the input including possible feedback configurations.

The theory of experiment design primarily relies upon analysis of how the asymptotic parameter covariance matrix (2.28) depends on the design variables: so the essence of experiment design can be symbolized as

$$\min\_{\boldsymbol{X}} \text{trace}\{\boldsymbol{C}[\boldsymbol{E}\boldsymbol{\psi}(t)\boldsymbol{\psi}^T(t)]^{-1}\},$$

where ψ is the gradient of the prediction w.r.t. the parameters and the matrix *C* is used to weight variables reflecting the intended use of the model.

For linear systems, the input design is often expressed as selecting the spectrum (frequency contents) of *u*.

*This leads to the following recipe: let the input's power be concentrated to frequency regions where a good model fit is essential, and where disturbances are dominating.*

The measurement setup, like if band limited inputs are used to estimate continuoustime models and how the experiment equipment is instrumented with band-pass filters, e.g., see [8, Sects. 13.2–3], also belongs to the important experiment design questions.

## **2.6** *V* **: Model Validation**

Model validation is about obtaining a model that, at least for the time being, can be accepted. It amounts to examining and scrutinizing the model to check if it can be used for its purpose. These methods are of course problem dependent and contain several subjective elements, Therefore, no conclusive procedure for validation can be given. A few useful techniques will be listed here. Basically it is a matter of trying to falsify a model under the conditions it will be used for and also to gain confidence in its ability to reproduce new data from the system.

## *2.6.1 Falsifying Models: Residual Analysis*

An estimated model is never a correct description of a true system. In that sense, a model cannot be "validated", i.e., proved to be correct. Instead it is instructive to try and *falsify* it, i.e., confront it with facts that may contradict its correctness. A good principle is to look for the *simplest unfalsified model*, see, e.g., [9].

*Residual analysis* is the leading technique for falsifying models: the residuals or one-step-ahead prediction errors ε(ˆ *t*) = ε(*t*, θˆ *<sup>N</sup>* ) = *y*(*t*) − ˆ*y*(*t*|θˆ *<sup>N</sup>* ) should ideally not contain any traces of past inputs or past residuals. If they did, it means that the predictions are not ideal. So, it is natural to test the correlation functions

$$
\hat{r}\_{\hat{\varepsilon},u}(k) = \frac{1}{N} \sum\_{t=1}^{N} \hat{\varepsilon}(t+k)u(t) \tag{2.37}
$$

$$
\hat{r}\_{\hat{\varepsilon}}(k) = \frac{1}{N} \sum\_{t=1}^{N} \hat{\varepsilon}(t+k)\hat{\varepsilon}(t) \tag{2.38}
$$

and check that they are not larger than certain thresholds. Here *N* is the length of the data record and *k* typically ranges over a fraction of the interval [−*N*, *N*]. See, e.g., [5, Sect. 16.6] for more details.

## *2.6.2 Comparing Different Models*

When several models have been estimated it is a question to choose the "best one". Then, models that employ more parameters naturally show a better fit to the data, and it is necessary to outweigh that. The model selection criteria AIC (2.34) and BIC (2.36) are examples of how such decisions can be taken. They can be extended to regular hypothesis tests where more complex models are accepted or rejected at various test levels, see, e.g., [5, Sect. 16.4].

Making comparisons in the frequency domain is a very useful complement for domain experts used to think in terms of natural frequencies, natural damping, etc.

## *2.6.3 Cross-Validation*

Cross-validation (CV) is an important statistical concept that loosely means that the model performance is tested on a data set (*validation data*) other than the estimation data. There is an extensive literature on cross-validation, e.g., [13] and many ways to split up available data into estimation and validation parts have been suggested. The goal is to obtain an estimate of the prediction capability of future data of the model in correspondence with different choices of θ. Parameter selection is thus performed by optimizing the estimated prediction score. *Hold out validation* is the simplest form of CV: the available data are split in two parts, where one of them *(estimation set*) is used to estimate the model, and the other one (*validation set*) is used to assess the prediction capability. By ensuring independence of the model fit from the validation data, the estimate of the prediction performance is approximately unbiased. For models that do not require estimation of initial states, like FIR and ARX models, CV can be applied efficiently in more sophisticated ways by splitting the data into more portions, as described in [3].

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 3 Regularization of Linear Regression Models**

**Abstract** Linear regression models are widely used in statistics, machine learning and system identification. They allow to face many important problems, are easy to fit and enjoy simple analytical properties. The simplest method to fit linear regression models is least squares whose systematic treatment is available in many textbooks, e.g., [35, Chap. 4], [12]. Linear regression models can be fitted also in different way and a class of methods that we will consider in this chapter is the so-called regularized least squares. It is an extension of least squares which minimizes the sum of the square loss function and a regularization term. This latter can take various forms, leading to several variants which have been applied extensively in theory as well as in practical applications. In this chapter, we will focus on these methods and introduce their fundamentals. In the first part of the appendix to this chapter, we also report some basic results of linear algebra useful for the reading.

## **3.1 Linear Regression**

Regression theory is concerned with modelling relationships among variables. It is used for predicting one dependent variable based on the information provided by one or more independent variables. In linear regression, the relationship among variables is given by linear functions. To illustrate this, we start from the function estimation problem because it is intuitive and easy to understand.

The aim of function estimation is to reconstruct a function *<sup>g</sup>* : <sup>R</sup>*<sup>n</sup>* <sup>→</sup> <sup>R</sup> with *<sup>n</sup>* <sup>∈</sup> <sup>N</sup> from a collection of *N* measured values of *g*(*x*) and *x* which we denote, respectively, by *yi* and *xi* for *i* = 1,..., *N*. For generic values of *x*, the estimate *g*ˆ should give a good prediction *g*ˆ(*x*) of *g*(*x*). The variables *x* and *g*(*x*) are often called the input and the output variable or simply the input and the output, respectively. The collection of measured values of *x* and *g*(*x*), given by the couples {*xi*, *yi*}, is called the data set or also the training set. In practical applications, the measurement *yi* is often not precise and subject to some disturbance, i.e., for a given input *xi* there is often discrepancy between *g*(*xi*) and its measured value *yi* . To describe this phenomenon, it is natural to introduce a disturbance variable *<sup>e</sup>* <sup>∈</sup> <sup>R</sup> and assume that, for any given *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>*, the measured value of *g*(*x*) is

$$\mathbf{y} = \mathbf{g}(\mathbf{x}) + \mathbf{e}.\tag{3.1}$$

Hence, *y* is the measured output and *g*(*x*)is the noise-free or true output. Accordingly, the training data {*xi*, *yi*}*<sup>N</sup> <sup>i</sup>*=1 are collected as follows:

$$\mathbf{y}\_{i} = \mathbf{g}(\mathbf{x}\_{i}) + e\_{i}, \quad i = 1, \ldots, N. \tag{3.2}$$

We are interested in linear regression models for estimation of *g*. For illustration, an example is now introduced.

**Example 3.1** (*Polynomial regression*) We consider *<sup>g</sup>* : [0, <sup>1</sup>] → <sup>R</sup> and assume that such function is smooth. Then, *g* can be well approximated by polynomials with a certain order. In this case, a linear regression model for the function estimation problem takes the following form:

$$\mathbf{y}\_{i} = \theta\_{1} + \sum\_{k=2}^{n} \theta\_{k} \boldsymbol{x}\_{i}^{k-1} + \boldsymbol{e}\_{i}, \quad i = 1, \ldots, N,\tag{3.3}$$

where <sup>θ</sup>*<sup>k</sup>* <sup>∈</sup> <sup>R</sup> for *<sup>k</sup>* = 1,..., *<sup>n</sup>*. Defining

$$\boldsymbol{\phi}(\mathbf{x}\_{i}) = \begin{bmatrix} 1 \ \mathbf{x}\_{i} \ \dots \ \mathbf{x}\_{i}^{n-1} \end{bmatrix}^{T}, \quad \boldsymbol{\theta} = \begin{bmatrix} \boldsymbol{\theta}\_{1} \ \boldsymbol{\theta}\_{2} \ \dots \ \boldsymbol{\theta}\_{n} \end{bmatrix}^{T}, \tag{3.4}$$

where, for a real-valued matrix *A*, the notation *A<sup>T</sup>* denotes its matrix transpose, we rewrite (3.3) as

$$\mathbf{y}\_{i} = \boldsymbol{\phi}(\mathbf{x}\_{i})^{T}\boldsymbol{\theta} + \boldsymbol{e}\_{i}, \quad i = 1, \ldots, N \tag{3.5}$$

obtaining a more compact expression. -

Although (3.5) is derived from Example 3.1, it is the general linear regression model studied in the theory of regression. For convenience, we remove the dependence of φ(*xi*) on *xi* and simply write φ(*xi*) as φ*<sup>i</sup>* , when the context is clear. In addition, all the vectors are column vectors. Then, model (3.5) becomes

$$\mathbf{y}\_{i} = \boldsymbol{\phi}\_{i}^{T}\boldsymbol{\theta} + \boldsymbol{e}\_{i}, \quad i = 1, \ldots, N, \quad \mathbf{y}\_{i} \in \mathbb{R}, \ \boldsymbol{\phi}\_{i} \in \mathbb{R}^{n}, \ \boldsymbol{\theta} \in \mathbb{R}^{n}, \ \boldsymbol{e}\_{i} \in \mathbb{R}. \tag{3.6}$$

In what follows, we will focus on (3.6) and introduce the linear regression problem, the methods of least squares and regularized least squares. We will call *yi* <sup>∈</sup> <sup>R</sup> the measured output, <sup>φ</sup>*<sup>i</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* the regressor, <sup>θ</sup> <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* the model parameter, *<sup>n</sup>* the model order, and *ei* the measurement noise.

Before proceeding, it should be noted that the choice of the model order *n* is a critical problem in practical applications. The rule of thumb is to set *n* to a large

$$\square$$

enough value such that *g* can be represented by the proposed model structure. In system identification, this corresponds to introducing a model structure flexible enough to contain the true system. Consider, e.g., Example 3.1 again and assume that the function *g* is actually a polynomial of order 5. Clearly, if the dimension of θ does not satisfy *n* ≥ 6, then *x*<sup>5</sup> cannot be represented and some model bias will affect the estimation process. However, the order *n* should not be chosen larger than necessary, because this can increase the variance of the model estimate. This problem is actually the same as model selection complexity in the classical system identification and is connected with the bias-variance trade-off illustrated in the first two chapters and also discussed in more detail shortly.

Also in light of the above discussion, we often assume that the model order *n* is either large enough for *g* to be adequately represented by the proposed model or even that a true model parameter that has generated the data exists, denoted by <sup>θ</sup><sup>0</sup> <sup>∈</sup> <sup>R</sup>*<sup>n</sup>*. Hence, we can formulate linear regression as the problem of obtaining an estimate <sup>θ</sup><sup>ˆ</sup> such that, given a new regressor <sup>φ</sup> <sup>∈</sup> <sup>R</sup>*<sup>n</sup>*, the prediction <sup>φ</sup>*<sup>T</sup>* <sup>θ</sup><sup>ˆ</sup> is close to <sup>φ</sup>*<sup>T</sup>* <sup>θ</sup>0.

## **3.2 The Least Squares Method**

There are many methods to estimate θ in the linear regression model (3.6). In this section, we consider the least squares (LS) method.

## *3.2.1 Fundamentals of the Least Squares Method*

Given the data *yi*, φ*<sup>i</sup>* for *i* = 1,..., *N*, one way to estimate θ is to minimize the least squares (LS) criterion:

$$\hat{\boldsymbol{\theta}}^{\text{LS}} = \underset{\boldsymbol{\theta}}{\text{arg min }} \, l(\boldsymbol{\theta}), \qquad l(\boldsymbol{\theta}) = \sum\_{i=1}^{N} (\mathbf{y}\_i - \boldsymbol{\phi}\_i^T \boldsymbol{\theta})^2,\tag{3.7}$$

where *l*(θ ) is the LS criterion and θˆLS is the LS estimate of θ. Then, the predicted output *<sup>y</sup>*<sup>ˆ</sup> for the value of <sup>φ</sup>*<sup>T</sup>* <sup>θ</sup><sup>0</sup> with <sup>φ</sup> <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* is obtained as

$$
\hat{\mathbf{y}} = \boldsymbol{\phi}^T \hat{\boldsymbol{\theta}}^{\text{LS}}.\tag{3.8}
$$

#### **3.2.1.1 Normal Equations and LS Estimate**

The LS estimate θˆLS given by (3.7) has a closed-form expression. To see this, note that the first- and second-order derivatives of *l*(θ ) with respect to θ are

36 3 Regularization of Linear Regression Models

$$\frac{\partial l(\theta)}{\partial \theta} = 2 \sum\_{i=1}^{N} \phi\_i \phi\_i^T \theta - 2 \sum\_{i=1}^{N} \phi\_i \mathbf{y}\_i, \quad \frac{\partial^2 l(\theta)}{\partial \theta \partial \theta} = 2 \sum\_{i=1}^{N} \phi\_i \phi\_i^T \succeq 0,\tag{3.9}$$

where *A* 0 means that *A* is a positive semidefinite matrix. Then all θˆLS that satisfy

$$\left[\sum\_{i=1}^{N} \phi\_i \phi\_i^T\right] \hat{\theta}^{\text{LS}} = \sum\_{i=1}^{N} \phi\_i \mathbf{y}\_i \tag{3.10}$$

are global minima of *l*(θ ). The set of Eqs. (3.10) is known as the normal equations. For the time being, we assume that *<sup>N</sup> <sup>i</sup>*=1 φ*i*φ*<sup>T</sup> <sup>i</sup>* is full rank.<sup>1</sup> Then

$$\delta^{\rm LS} = \left[\sum\_{i=1}^{N} \phi\_i \phi\_i^T\right]^{-1} \sum\_{i=1}^{N} \phi\_i \mathbf{y}\_i. \tag{3.11}$$

#### **3.2.1.2 Matrix Formulation**

It is often convenient to rewrite the LS method in matrix form. To this goal, let

$$Y = \begin{bmatrix} y\_1 \\ y\_2 \\ \vdots \\ \vdots \\ y\_N \end{bmatrix}, \ \Phi = \begin{bmatrix} \phi\_1^T \\ \phi\_2^T \\ \vdots \\ \phi\_N^T \end{bmatrix}, \ E = \begin{bmatrix} e\_1 \\ e\_2 \\ \vdots \\ e\_N \end{bmatrix}. \tag{3.12}$$

We can then rewrite (3.6) with the θ<sup>0</sup> that generated the data, the LS criterion (3.7), the normal Eqs. (3.10) and the LS estimate (3.11) in matrix form, respectively:

$$Y = \Phi \theta\_0 + E \tag{3.13}$$

$$\hat{\theta}^{\text{LS}} = \underset{\theta}{\text{arg min }} l(\theta), \qquad l(\theta) = \|Y - \Phi \theta\|\_2^2 \tag{3.14}$$

$$
\boldsymbol{\Phi}^{\mathsf{T}} \boldsymbol{\Phi} \boldsymbol{\hat{\theta}}^{\mathsf{LS}} = \boldsymbol{\Phi}^{\mathsf{T}} \boldsymbol{Y} \tag{3.15}
$$

$$
\hat{\theta}^{\text{LS}} = (\Phi^T \Phi)^{-1} \Phi^T Y,\tag{3.16}
$$

where ·<sup>2</sup> is the Euclidean norm, i.e., the 2-norm, and Φ is called the regression matrix.

<sup>1</sup> Recall that the column rank (resp., the row rank) of a matrix is the dimension of the space spanned by the columns (resp., the rows) of the matrix. It is a fundamental result in linear algebra that the column rank and the row rank of a matrix are always equal and this number is called the rank of the matrix. A matrix is said to be full rank if its rank is equal to the lesser of the number of rows and columns and a matrix is said to be rank deficient otherwise.

## *3.2.2 Mean Squared Error and Model Order Selection*

#### **3.2.2.1 Bias, Variance, and Mean Squared Error of the LS Estimate**

We study the linear regression problem in a probabilistic framework, assuming that data are generated according to (3.13) and that

$$\text{In the measurement noise } e\_l \text{, for } i = 1, \dots, N \text{, are i.i.d. with mean 0 and variance } \sigma^2. \tag{3.17}$$

Due to this assumption, the LS estimator θˆLS, as well as any estimator of θ dependent on the data, becomes random variables. Then, it is interesting to study the statistical properties of θˆLS, such as the bias, variance and mean squared error (MSE).

All the expectations reported below are computed with respect to the noises *ei* with the regressors φ*<sup>i</sup>* assumed to be deterministic. Simple calculations lead to

$$\mathcal{d}^{\diamond}(\hat{\theta}^{\text{LS}}) = \theta\_0 \tag{3.18a}$$

$$
\hat{\theta}\_{\text{bias}}^{\text{LS}} = \mathcal{\ell}^{\diamond}(\hat{\theta}^{\text{LS}}) - \theta\_0 = 0 \tag{3.18b}
$$

$$\text{Cov}(\hat{\theta}^{\text{LS}}, \hat{\theta}^{\text{LS}}) = \delta^{\mathbb{I}}[(\hat{\theta}^{\text{LS}} - \mathcal{E}(\hat{\theta}^{\text{LS}}))(\hat{\theta}^{\text{LS}} - \mathcal{E}(\hat{\theta}^{\text{LS}}))^{T}] = \sigma^{2}(\Phi^{T}\Phi)^{-1} \quad (\text{5.18c})$$

$$\begin{split} \text{MSE}(\hat{\theta}^{\text{LS}}, \theta\_0) &= \mathcal{E}[(\hat{\theta}^{\text{LS}} - \theta\_0)(\hat{\theta}^{\text{LS}} - \theta\_0)^T] \\ &= \text{Cov}(\hat{\theta}^{\text{LS}}, \hat{\theta}^{\text{LS}}) + \hat{\theta}^{\text{LS}}\_{\text{bias}} (\hat{\theta}^{\text{LS}}\_{\text{bias}})^T \\ &= \sigma^2 (\Phi^T \Phi)^{-1}, \tag{3.18d} \end{split} \tag{3.18d}$$

where Cov(θˆLS, θˆLS) is the covariance matrix of θˆLS and MSE(θˆLS, θ0) is the MSE matrix of θˆLS function of the true model parameter θ0.

#### **3.2.2.2 Model Order Selection**

The issue of model order selection is essentially the same as that of model complexity selection in the classical system identification scenario. Therefore, the techniques introduced in Sect. 2.4.3 can be used to choose the model order *n*, e.g., Akaike's information criterion (AIC) [1], the Bayesian Information criterion (BIC) or Minimum Description Length (MDL) approach [25, 39].

The quality of the LS estimate θˆLS depends on the adopted model order *n*. In practical applications, model complexity is in general unknown and needs to be determined from data. As the model order *n* gets larger, the fit to the data *Y* − ΦθˆLS<sup>2</sup> <sup>2</sup> in (3.14) will become smaller, but the variances along the diagonal of the MSE matrix (3.18d) of θˆLS will become larger at the same time. When assessing the quality of θˆLS, one way to account for the increasing variance is to introduce criteria that suitably modify the plain data fit. AIC and BIC are techniques following this idea and can be used for model order selection. More specifically, besides (3.17), further assuming that the errors are independent and Gaussian, i.e.,

$$e\_i \sim \mathcal{N}(0, \sigma^2), \quad i = 1, \ldots, N \tag{3.19}$$

with known noise variance σ <sup>2</sup>, we obtain

$$\text{AIC:} \qquad \hat{\theta}^{\text{LS}} = \operatorname\*{arg\,min}\_{\theta \in \mathbb{R}^n} \frac{1}{N} \|Y - \Phi \theta\|\_2^2 + 2\sigma^2 \frac{n}{N}, \tag{3.20}$$

$$\text{BIC or MDL:} \qquad \hat{\theta}^{\text{LS}} = \underset{\theta \in \mathbb{R}^n}{\text{arg min }} \frac{1}{N} \|Y - \Phi \theta\|^2\_2 + \log(N)\sigma^2 \frac{n}{N},\tag{3.21}$$

where the minimization also takes place over a family of model structures with different dimension *n* of θ.

Another way is to estimate the prediction capability of the model on some unseen data which are not used for model estimation. As briefly seen in Sect. 2.6.3, crossvalidation (CV) exploits this idea and is among the most widely used techniques for model selection. Recall that hold out CV is the simplest form of CV with data divided into two parts. One part is used to estimate the model with different model orders and the other part is used to assess the prediction capability of each model through the prediction score *Y*<sup>v</sup> − ΦvθˆLS<sup>2</sup> <sup>2</sup>. Here, *Y*v, Φ<sup>v</sup> are the validation data which are different from those used to derive θˆLS. The model order giving the best prediction score will be chosen.

The noise variance σ <sup>2</sup> of the measurement noises *ei* plays an important role in statistical modelling, e.g., in the assessment of the variance of θˆLS and in the model order selection using, e.g., AIC (3.20) or BIC (3.21). In practical applications, the noise variance σ <sup>2</sup> is in general unknown and needs to be estimated from the data *Y* and Φ. It can be estimated in different ways based on the maximum likelihood estimation (MLE) method or the statistical property of θˆLS.

Under (3.17) and the Gaussian assumption (3.19), the ML estimate of σ <sup>2</sup>, as given in [25, p. 506], is

$$
\hat{\sigma}^{2, \text{ML}} = \frac{1}{N} \| Y - \Phi \hat{\theta}^{\text{LS}} \|\_{2}^{2}. \tag{3.22}
$$

Using only assumption (3.17), an unbiased estimator of σ <sup>2</sup>, as given in [25, p. 554], turns out

$$
\hat{\sigma}^2 = \frac{1}{N - n} \|Y - \Phi \hat{\theta}^{\text{LS}}\|\_2^2. \tag{3.23}
$$

AIC and BIC were reported, respectively, in (3.20) and (3.21) assuming known noise variance. When σ <sup>2</sup> is unknown, the use of the ML estimate (3.22) leads to the widely used AIC and BIC for Gaussian innovations, e.g., [25, pp. 506–507]:

**Fig. 3.1** Polynomial regression: the function *g*(*x*)(blue curve) and the data {*xi*, *yi*}<sup>40</sup> *<sup>i</sup>*=1 (red circles)

$$\text{AIC:} \quad \hat{\theta}^{IS} = \underset{\theta \in \mathbb{R}^s}{\text{arg}\min} \log\left(\frac{1}{N} \|Y - \Phi \theta\|\_2^2\right) + 2\frac{n}{N}, \tag{3.24}$$

$$\text{BIC or MDL:} \qquad \hat{\theta}^{LS} = \underset{\theta \in \mathbb{R}^n}{\text{arg min }} \log\left(\frac{1}{N} \|Y - \Phi \theta\|\_2^2\right) + \log(N)\frac{n}{N}. \tag{3.25}$$

**Example 3.2** (*Polynomial regression using LS and discrete model order selection*) We apply the LS method and the model order selection techniques to polynomial regression as sketched in Example 3.1. Let the function *g* be

$$g(\mathbf{x}) = \sin^2(\mathbf{x})(1 - \mathbf{x}^2), \quad \mathbf{x} \in [0, 1]. \tag{3.26}$$

Then, we generate the data as follows:

$$y\_i = \sin^2(\mathbf{x}\_i)(1 - \mathbf{x}\_i^2) + e\_i, \quad i = 1, \ldots, 40,\tag{3.27}$$

where *x*<sup>1</sup> = 0, *x*<sup>40</sup> = 1, the *x*2,..., *x*<sup>39</sup> are evenly spaced points between *x*<sup>1</sup> and *x*40, and the noises *ei* are i.i.d. Gaussian distributed with zero mean and standard deviation 0.034. The function *g* and the generated data are shown in Fig. 3.1.

The function *g* is smooth and can be well approximated by polynomials. However, it is unclear which order should be chosen. Hence, we test the values *n* = 1,..., 15 and, for each order *n*, we form the regressor (3.4), the linears regression model (3.13) and derive the LS estimate θˆLS. As shown in Fig. 3.2, as the order *n* increases the data fit *Y* − ΦθˆLS<sup>2</sup> <sup>2</sup> keeps decreasing.

**Fig. 3.2** Polynomial regression: profile of the LS data fit as a function of the discrete model order *n*

For model order selection, we use AIC (3.24), BIC (3.25) and hold out CV with *xi*, *yi* , *i* = 1, 3,..., 39 for estimation and *xi*, *yi* , *i* = 2, 4,..., 40 for validation. Figure 3.3 plots the values of AIC (3.24), BIC (3.25) and the prediction score of hold out CV. The order *n* selected by AIC and BIC are the same and equal to 3 while that selected by hold out CV is 7.

To evaluate the performance of models of different complexity, we compute the fit measure

$$\mathcal{J} = 100 \left( 1 - \left[ \frac{\sum\_{k=1}^{40} |\mathbf{g}(\mathbf{x}\_k) - \hat{\mathbf{g}}(\mathbf{x}\_k)|^2}{\sum\_{k=1}^{40} |\mathbf{g}(\mathbf{x}\_k) - \bar{\mathbf{g}}^0|^2} \right]^{1/2} \right), \quad \bar{\mathbf{g}}^0 = \frac{1}{40} \sum\_{k=1}^{40} \mathbf{g}(\mathbf{x}\_k). \tag{3.28}$$

Note that *F* = 100 means a perfect agreement between *g*(*x*) and the corresponding estimate. The model fits for *n* = 1,..., 15 are shown in Fig. 3.4: the order *n* = 3 gives the best prediction. Figure 3.5 plots the estimates of *g*(*x*) for *n* = 3, 7, 15 over the *xi* , *i* = 1,..., 40. Overfitting occurs when *n* = 15, indicating that the corresponding model is too flexible and fooled by the noise. -

**Fig. 3.3** Polynomial regression: model order selection with *<sup>n</sup>* = 1,..., <sup>15</sup> using LS. The blue curve, the red curve and the yellow curve show the values of AIC (3.24), BIC (3.25) and the prediction score of hold out CV, respectively

**Fig. 3.4** Polynomial regression: profile of the model fit (3.28) as a function of the order *n* using LS. The most accurate estimate is obtained with model order equal to 3 which corresponds to a second-order polynomial

**Fig. 3.5** Polynomial regression: true function (blue line) and LS estimates obtained using three different model orders given by *<sup>n</sup>* = 3, 7 and 15

## **3.3 Ill-Conditioning**

## *3.3.1 Ill-Conditioned Least Squares Problems*

When <sup>Φ</sup> <sup>∈</sup> <sup>R</sup>*N*×*<sup>n</sup>* with *<sup>N</sup>* <sup>≥</sup> *<sup>n</sup>* is rank deficient, i.e., rank(Φ) < *<sup>n</sup>*, or "close" to rank deficient, the corresponding LS problem is said to be ill-conditioned. Examples were already encountered in Sect. 1.1.2 to discuss some limitations of the James– Stein estimators and in Sect. 1.2 in the context of FIR models. There are different ways to handle ill-conditioned LS problems. Below, we show how to calculate θˆLS more accurately by using the singular value decomposition (SVD).

#### **3.3.1.1 Singular Value Decomposition**

SVD is a fundamental matrix decomposition. Any matrix <sup>Φ</sup> <sup>∈</sup> <sup>R</sup>*<sup>N</sup>*×*<sup>n</sup>*, with *<sup>N</sup>* <sup>≥</sup> *<sup>n</sup>* to simplify the exposition, can be decomposed as follows:

$$
\Phi = U\Lambda V^T,\tag{3.29}
$$

where Λ is a rectangular diagonal matrix with nonnegative diagonal entries σ*<sup>i</sup>* , *<sup>i</sup>* = 1,..., *<sup>n</sup>* and *<sup>U</sup>* <sup>∈</sup> <sup>R</sup>*<sup>N</sup>*×*<sup>N</sup>* and *<sup>V</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>*×*<sup>n</sup>* are orthogonal matrices, i.e., such that *UTU* = *UU<sup>T</sup>* = *IN* and *V <sup>T</sup> V* = *V V <sup>T</sup>* = *In*. The factorization (3.29) is called the singular value decomposition of Φ and the σ*<sup>i</sup>* are called the singular values of Φ. Without loss of generality, they can be assumed to be ordered according to their magnitude:

$$
\sigma\_1 \ge \sigma\_2 \ge \dots \ge \sigma\_n \ge 0.
$$

Since Φ*<sup>T</sup>* Φ = *V*Λ*<sup>T</sup>*Λ*V <sup>T</sup>* = *V D*2*V <sup>T</sup>* , where *D* is a square diagonal matrix whose diagonal entries are the σ*<sup>i</sup>* , it follows that

$$
\sigma\_i = \sqrt{\lambda\_i(\Phi^T \Phi)}, \quad i = 1, \ldots, n,\tag{3.30}
$$

where λ*i*(*A*) denotes the *i*th eigenvalue of the matrix *A*.

#### **3.3.1.2 Condition Number**

The condition number of a matrix is a measure of how "close" is the matrix to rank deficient. When Φ is an invertible square matrix, it is denoted by cond(Φ) below and defined as

$$\text{cond}(\Phi) = \|\Phi^{-1}\| \|\Phi\|,\tag{3.31}$$

where · is a matrix norm, with the convention that cond(Φ) = ∞ for singular Φ. For a generic <sup>Φ</sup> <sup>∈</sup> <sup>R</sup>*N*×*<sup>n</sup>*, with SVD in the form (3.29), its condition number with respect to the 2-norm ·<sup>2</sup> is defined as

$$\text{cond}(\Phi) = \frac{\sigma\_{\text{max}}}{\sigma\_{\text{min}}},\tag{3.32}$$

where σmax = σ<sup>1</sup> and σmin = σ*<sup>n</sup>* are the largest and smallest singular values of Φ, respectively. If we use the 2-norm ·<sup>2</sup> in (3.31), then (3.31) coincides with (3.32). Hereafter, the condition number of a matrix will be defined by (3.32).

#### **3.3.1.3 Ill-Conditioned Matrix and LS Problem**

The condition number of a matrix is important since it can be used to measure the sensitivity of the LS estimate to perturbations in the data. To be specific, let<sup>Φ</sup> <sup>∈</sup> <sup>R</sup>*<sup>N</sup>*×*<sup>n</sup>* be full rank and let δ*Y* denote a small componentwise perturbation in *Y* . The solution of the perturbed LS criterion becomes

$$\tilde{\theta}\_2^{\text{LS}} = \underset{\theta}{\text{arg min}} \|(Y + \delta Y) - \Phi \theta\|\_2^2. \tag{3.33}$$

Then, it can be shown, e.g., [17, Chap. 5], [10, Chap. 3], that

44 3 Regularization of Linear Regression Models

$$\frac{\|\tilde{\theta}\_2^{\rm LS} - \tilde{\theta}\_2^{\rm LS}\|\_2}{\|\tilde{\theta}\_2^{\rm LS}\|\_2} \le \text{cond}(\Phi)\varepsilon + O\left(\varepsilon^2\right), \quad \varepsilon = \frac{\|\delta Y\|\_2}{\|Y\|\_2}. \tag{3.34}$$

So, the relative error bound depends on cond(Φ): the larger cond(Φ), the larger the relative error. One can thus say that the matrix Φ (and the LS problem) with a small condition number is well conditioned, while the matrix Φ (and the LS problem) with a large condition number is ill-conditioned. The condition number enters also more complex bounds on the relative error due to perturbations on the matrix Φ [10, 17].

**Example 3.3** (*Effect of ill-conditioning on LS*) Consider the linear regression model (3.13). Let

$$\Phi = \frac{1}{2} \begin{bmatrix} 1 & 1 \\ 1 + 10^{-8} \ 1 - 10^{-8} \end{bmatrix}, \quad Y = \begin{bmatrix} 1 \\ 1 \end{bmatrix}. \tag{3.35}$$

The two singular values of Φ are σmax = 1 and σmin = 5 × 10−<sup>9</sup>, implying that cond(Φ) = 2 × 10<sup>8</sup>. Thus, Φ and the LS problem (3.14) are ill-conditioned.

Using the normal Eq. (3.15), we obtain the LS estimate θˆ*LS* <sup>1</sup> in closed form:

$$\hat{\theta}\_1^{LS} = (\Phi^T \Phi)^{-1} \Phi^T Y = \Phi^{-1} Y = \begin{bmatrix} 1 \\ 1 \end{bmatrix}. \tag{3.36}$$

Now, suppose that there is a small perturbation δ*Y* in *Y*

$$
\delta Y = \begin{bmatrix} 0.01\\ 0 \end{bmatrix}.\tag{3.37}
$$

Solving the normal Eq. (3.15) with *Y* replaced by *Y* + δ*Y* now gives

$$
\hat{\theta}\_2^{IS} = \begin{bmatrix} 1.01 - 10^6 \\ 1.01 + 10^6 \end{bmatrix}. \tag{3.38}
$$

So, when the LS problem (3.14) is ill-conditioned, a small perturbation in *Y* could cause a significant change in the LS estimate derived by solving the normal Eq. (3.15) directly. -

**Example 3.4** (*Polynomial regression: ill-conditioned LS Problem*) We revisit the polynomial regression Examples (3.26) and (3.27) stressing the dependence of the condition number on the polynomial complexity. In particular, Fig. 3.6 shows that the ill-conditioning of the regression matrix Φ constructed according to (3.4) and (3.12) augments as the dimension *n* increases. This further points out the importance of a careful selection of the discrete model order to control the estimator's variance when using LS. -

**Fig. 3.6** Polynomial regression: profile of the base 10 logarithm of the condition number of Φ as a function of the order *n*

#### **3.3.1.4 LS Estimate Exploiting the SVD of** *Φ*

In order to obtain more accurate LS estimates for ill-conditioned problems, one can use the SVD of <sup>Φ</sup>. Given <sup>Φ</sup> <sup>∈</sup> <sup>R</sup>*N*×*<sup>n</sup>* with *<sup>N</sup>* <sup>≥</sup> *<sup>n</sup>*, we consider two cases:


For the rank-deficient case, we assume without loss of generality that rank(Φ) = *m* < *n*. In this case, the LS problem does not have a unique solution. To get a special solution, we have to impose extra conditions on the solutions of the LS problem.

Let the singular value decomposition of Φ be

$$\Phi = U\Lambda V^T = \begin{bmatrix} U\_1 \ U\_2 \end{bmatrix} \begin{bmatrix} A\_1 \ 0 \\ 0 \ 0 \end{bmatrix} \begin{bmatrix} V\_1 \ V\_2 \end{bmatrix}^T,\tag{3.39}$$

where <sup>Λ</sup><sup>1</sup> <sup>∈</sup> <sup>R</sup>*<sup>m</sup>*×*<sup>m</sup>* is diagonal and positive definite while *<sup>U</sup>*<sup>1</sup> <sup>∈</sup> <sup>R</sup>*<sup>N</sup>*×*<sup>m</sup>* and *<sup>V</sup>*<sup>1</sup> <sup>∈</sup> R*<sup>n</sup>*×*<sup>m</sup>*.

We now perform a change of coordinates in both the output and parameter space

$$\tilde{Y} = U^T Y = \begin{bmatrix} U\_1^T Y \\ U\_2^T Y \end{bmatrix} = \begin{bmatrix} \tilde{Y}\_1 \\ \tilde{Y}\_1 \end{bmatrix}, \qquad \tilde{\theta} = V^T \theta = \begin{bmatrix} V\_1^T \theta \\ V\_2^T \theta \end{bmatrix} = \begin{bmatrix} \tilde{\theta}\_1 \\ \tilde{\theta}\_1 \end{bmatrix}.$$

Note that both *Y*˜ <sup>1</sup> and θ˜ <sup>1</sup> are *m*-dimensional vectors. In the new coordinates, the residual vector is

$$U^T \left( Y - \Phi \theta \right) = \tilde{Y} - A \tilde{\theta} = \begin{bmatrix} \tilde{Y}\_1 - A\_1 \tilde{\theta}\_1 \\ \tilde{Y}\_2 \end{bmatrix}.$$

The LS criterion can be rewritten as

$$\|\|Y - \Phi\theta\|\|^2 = (Y - \Phi\theta)^T U U^T (Y - \Phi\theta) = \|\tilde{Y} - A\tilde{\theta}\|^2 = \|\tilde{Y}\_1 - A\_1\tilde{\theta}\_1\|^2 + \|\tilde{Y}\_2\|^2$$

and is minimized by

$$
\tilde{\theta}^{\text{LS}} = \begin{bmatrix} \tilde{\theta}\_1^{\text{LS}} \\ \tilde{\theta}\_2^{\text{LS}} \end{bmatrix} = \begin{bmatrix} A^{-1} \tilde{Y}\_1 \\ \tilde{\theta}\_2 \end{bmatrix}, \tag{3.40}
$$

where θ˜ <sup>2</sup> <sup>∈</sup> <sup>R</sup>*n*−*<sup>m</sup>* is an arbitrary vector. To get the minimum norm solution, one can set θ˜ <sup>2</sup> = 0 that, turning back to the original coordinates, yields

$$
\hat{\theta}^{\text{LS}} = V \tilde{\theta}^{\text{LS}} = V\_1 \Lambda\_1^{-1} U\_1^T Y. \tag{3.41}
$$

Interestingly, for the rank-deficient case, the special solution (3.41) relates to the Moore–Penrose pseudoinverse of Φ, defined as

$$\boldsymbol{\Phi}^+ = \boldsymbol{V}\boldsymbol{\Sigma}^+\boldsymbol{U}^T = \begin{bmatrix} \boldsymbol{V}\_1 \ \boldsymbol{V}\_2 \end{bmatrix} \begin{bmatrix} \boldsymbol{A}\_1^{-1} \ \boldsymbol{0} \\ \boldsymbol{0} \ \boldsymbol{0} \end{bmatrix} \begin{bmatrix} \boldsymbol{U}\_1 \ \boldsymbol{U}\_2 \end{bmatrix}^T = \boldsymbol{V}\_1\boldsymbol{\Sigma}\_1^{-1}\boldsymbol{U}\_1^T.$$

So, given a matrix Σ, its pseudoinverse Σ<sup>+</sup> is obtained by replacing all the nonzero diagonal entries by their reciprocal and transposing the resulting matrix. When rank(Φ) = *n*, the pseudoinverse returns the usual (unique) LS solution

$$
\phi^+ = \left(\phi^T \phi\right)^{-1} \phi^T.
$$

It follows that the minimum norm solution among the general solutions of the LS problem (3.14) can be always written as

$$
\hat{\theta}^{\text{LS}} = \Phi^+ Y\_\cdot
$$

For the rank-deficient case, due to roundoff errors, Φ may have some very small computed singular values other than the *m* singular values contained in Λ<sup>1</sup> in (3.39). The situation is similar to the case where Φ is full rank but with a very large condition number. Note also that the rank of Φ needs to be known beforehand to compute the SVD of Φ. However, numerical determination of the rank of a matrix is nontrivial (and out of scope of this book). Here, we just mention a simple way to deal with these issues by using the so-called truncated SVD.

Consider the SVD (3.39) and, without loss of generality, assume

$$A = \text{diag}(\sigma\_1, \sigma\_2, \dots, \sigma\_n) \text{ with } \sigma\_1 \ge \sigma\_2 \ge \dots \ge \sigma\_n \ge 0.$$

Now set σˆ*<sup>i</sup>* = σ*<sup>i</sup>* if σ*<sup>i</sup>* > *tol* and σˆ*<sup>i</sup>* = 0 otherwise. Then

$$
\hat{\Phi} = U\hat{A}V^T,\tag{3.42}
$$

where <sup>Λ</sup><sup>ˆ</sup> <sup>∈</sup> <sup>R</sup>*N*×*<sup>n</sup>* is diagonal with entries <sup>σ</sup>ˆ1, <sup>σ</sup>ˆ2,..., <sup>σ</sup>ˆ*n*, is called the truncated SVD of Φ. So, the truncated SVD (3.42) can be used to handle the case where Φ has full rank but large condition number: for a given *tol*, it suffices to replace Φ with Φˆ and then to compute the LS estimate of θ by means of Φˆ <sup>+</sup>*Y* .

**Example 3.5** (*Truncated SVD*) We revisit Example 3.3 by making use of the truncated SVD of Φ. We take the user-supplied measure of uncertainty *tol* to be 1e-7. Then the LS estimate θˆ*LS* <sup>3</sup> computed by (3.41) with *Y* replaced by *Y* + δ*Y* becomes

$$
\hat{\theta}\_3^{LS} = \hat{\Phi}^+(Y + \delta Y) = \begin{bmatrix} 1.0050 \\ 1.0049 \end{bmatrix} . \tag{3.43}
$$

One can thus see that the estimate is now very close to [1 1] *<sup>T</sup>* which was the one obtained in absence of the perturbation δ*Y* . -

## *3.3.2 Ill-Conditioning in System Identification*

In Sect. 1.2 we have illustrated an ill-conditioned system identification problem. Below, we will see that the difficulty was due to the fact that low-pass filtered inputs may induce regression matrices with large cond(Φ).

Consider the FIR model of order *n*:

$$\mathbf{y}(t) = \sum\_{k=1}^{n} \mathbf{g}\_k \boldsymbol{\mu}(t - k) + \boldsymbol{e}(t), \quad t = 1, \ldots, N,\tag{3.44}$$

which can be written in the form (3.13) as follows:

$$\begin{aligned} Y &= \Phi \theta\_0 + E\\ Y &= \begin{bmatrix} \mathbf{y}(1) \\ \mathbf{y}(2) \\ \vdots \\ \mathbf{y}(N) \end{bmatrix}, \quad \Phi = \begin{bmatrix} u(0) & u(-1) & \cdots & u(1-n) \\ u(1) & u(2) & \cdots & u(2-n) \\ \vdots & \vdots & \cdots & \vdots \\ u(N-1) & u(N-2) & \cdots & u(N-n) \end{bmatrix}, \\ \theta\_0 &= \begin{bmatrix} g\_1 \\ g\_2 \\ \vdots \\ g\_n \end{bmatrix}, \quad E = \begin{bmatrix} e\_1 \\ e\_2 \\ \vdots \\ e\_N \end{bmatrix}. \end{aligned} \tag{3.45}$$

Then we have

$$\boldsymbol{\Phi}^{T}\boldsymbol{\Phi} = \begin{bmatrix} \sum\_{t=0}^{N-1} u(t)^{2} & \sum\_{t=0}^{N-1} u(t)u(t-1) & \dots & \sum\_{t=0}^{N-1} u(t)u(t-n+1) \\ \sum\_{t=0}^{N-1} u(t)u(t-1) & \sum\_{t=-1}^{N-2} u(t)^{2} & \dots & \sum\_{t=-1}^{N-2} u(t)u(t-n+2) \\ \vdots & \vdots & \dots & \vdots \\ \sum\_{t=0}^{N-1} u(t)u(t-n+1) & \sum\_{t=-n+1}^{N-n} u(t)u(t+n-2) & \dots & \sum\_{t=-n+1}^{N-n} u(t)^{2} \end{bmatrix} . \tag{3.46}$$

Since cond(Φ*<sup>T</sup>* Φ) = (cond(Φ))<sup>2</sup>, we study cond(Φ*<sup>T</sup>* Φ)in what follows. In addition, while so far we have assumed deterministic regressors, now we work in a more structured probabilistic framework where the system input is a stochastic process. This implies that Φ is a random matrix. In particular, *u*(*t*) is filtered white noise, with the filter assumed to be stable and given by

$$H(q) = \sum\_{k=0}^{\infty} h(k)q^{-k}.\tag{3.47a}$$

Hence,

$$u(t) = \sum\_{k=0}^{\infty} h(k)\nu(t-k) = H(q)\nu(t),\tag{3.47b}$$

where *v*(*t*) is zero-mean white noise of variance σ <sup>2</sup> with bounded fourth moments. It comes that *u*(*t*)is a zero-mean stationary stochastic process with covariance function *ku*(*t*,*s*) = *E* [*u*(*t*)*u*(*s*)] = *Ru*(*t* − *s*) with *Ru*(τ ) defined as follows:

$$\begin{aligned} \mathcal{E}[\boldsymbol{\mu}(t)\boldsymbol{\mu}(t-\tau)] &= \sum\_{k=0}^{\infty} \sum\_{l=0}^{\infty} h(k)h(l)\mathcal{E}[\boldsymbol{\nu}(t-k)\boldsymbol{\nu}(t-\tau-l)] \\ &= \sum\_{k=0}^{\infty} h(k)h(k-\tau) \stackrel{\triangle}{=} \mathcal{R}\_{\boldsymbol{\mu}}(\tau). \end{aligned}$$

From the ergodic theory, e.g., [25, Theorem 3.4], it also follows that

$$\frac{1}{N} \sum\_{t=1}^{N} \mu(t)u(t-\tau) \to R\_u(\tau), \quad N \to \infty, \text{ a.s.} \tag{3.48}$$

From (3.46) and (3.48), one obtains the following almost sure convergence:

$$\frac{1}{N}\boldsymbol{\Phi}^{T}\boldsymbol{\Phi}\rightarrow\begin{bmatrix}R\_{\boldsymbol{u}}(0) & R\_{\boldsymbol{u}}(1) & \cdots & R\_{\boldsymbol{u}}(n-1) \\ R\_{\boldsymbol{u}}(1) & R\_{\boldsymbol{u}}(0) & \cdots & R\_{\boldsymbol{u}}(n-2) \\ \vdots & \vdots & \cdots & \vdots \\ R\_{\boldsymbol{u}}(n-1) & R\_{\boldsymbol{u}}(n-2) & \cdots & R\_{\boldsymbol{u}}(0) \end{bmatrix}, \quad N \to \infty, \text{ a.s.} \quad (3.49)$$

So, lim*<sup>N</sup>*→∞ <sup>1</sup> *<sup>N</sup>* Φ*<sup>T</sup>* Φ is the covariance matrix of *u*(1)... *u*(*n*) *<sup>T</sup>* whose condition number thus provides insights on the ill-conditioning affecting the system identification problem.

Since the covariance matrix is real and symmetric, its condition number is the ratio between the largest and the smallest of its eigenvalues. An important result of O. Toeplitz, e.g., [44], [20, Chap. 5], says that *as n* → ∞, *the eigenvalues of the covariance matrix of the infinite-dimensional vector u*(1) *u*(2)...*<sup>T</sup> coincide with the set of values assumed by the power spectrum of u*(*t*), which is given by

$$\Psi\_{\mathfrak{u}}(\omega) = \sum\_{\mathfrak{r} = -\infty}^{+\infty} R\_{\mathfrak{u}}(\mathfrak{r}) e^{-i\omega \mathfrak{r}}.\tag{3.50}$$

Hence, considering also that Ψ*<sup>u</sup>* (−ω) = Ψ*u*(ω), one has

$$\text{cond}\left(\lim\_{n\to\infty}\lim\_{N\to\infty}\frac{1}{N}\varPhi^T\varPhi\right) = \frac{\max\_{\boldsymbol{\alpha}\in\{0,\pi\}}\varPsi\_{\boldsymbol{\mu}}(\boldsymbol{\omega})}{\min\_{\boldsymbol{\alpha}\in\{0,\pi\}}\varPsi\_{\boldsymbol{\mu}}(\boldsymbol{\omega})}.\tag{3.51}$$

In addition, since *u*(*t*) is a filtered white noise (3.47) and *H*(*q*) is stable, one also has [see, e.g., [25, p. 37] for details]:

$$\Psi\_{\mu}(\omega) = \sigma^2 |H(e^{i\alpha})|^2,\tag{3.52}$$

where *H*(*e<sup>i</sup>*ω) is the frequency function of the filter *H*(*q*), i.e.,

$$H(e^{i\alpha}) = \sum\_{k=0}^{\infty} h(k)e^{-iak}.\tag{3.53}$$

Finally, combining the results (3.49)–(3.53) yields

50 3 Regularization of Linear Regression Models

$$\text{cond}\left(\lim\_{n\to\infty}\lim\_{N\to\infty}\frac{1}{N}\Phi^T\Phi\right) = \frac{\max\_{w\in\{0,\pi\}}|H(e^{i\nu})|^2}{\min\_{w\in\{0,\pi\}}|H(e^{i\nu})|^2}.\tag{3.54}$$

When the maximum of |*H*(*e<sup>i</sup>*ω)| is significantly larger than the minimum of |*H*(*e<sup>i</sup>*ω)|, the matrix lim*<sup>n</sup>*→∞ lim*<sup>N</sup>*→∞ <sup>1</sup> *<sup>N</sup>* Φ*<sup>T</sup>* Φ could be very ill-conditioned. For instance, if we consider the stable filter

$$H(q) = \frac{1}{(1 - aq^{-1})^2}, \quad 0 \le a < 1,\tag{3.55}$$

then one has

$$\frac{\max\_{w \in \{0, \pi\}} |H(e^{iw})|^2}{\min\_{w \in \{0, \pi\}} |H(e^{iw})|^2} = \frac{(1+a)^4}{(1-a)^4}.\tag{3.56}$$

As *a* varies from 0.01 to 0.99, input power is more concentrated at low frequencies and the ill-conditioning affecting the system identification problem augments. In fact, the above quantity increases from about 1 to 1.6 × 10<sup>9</sup>.

## **3.4 Regularized Least Squares with Quadratic Penalties**

One way to handle ill-conditioning is to use regularized least squares (ReLS). Such method will play a special role in this book to control overfitting by encoding prior knowledge. First insights on these aspects are provided below.

ReLS adds a regularization term *J* (θ ) into the LS criterion (3.14), yielding the following problem:

$$\hat{\theta}^{\mathbb{R}} = \underset{\theta}{\text{arg min}} \left\| Y - \Phi \theta \right\|\_{2}^{2} + \gamma J(\theta), \tag{3.57}$$

where γ ≥ 0 is often called the regularization parameter. It has to balance the adherence to the data *Y* − Φθ<sup>2</sup> <sup>2</sup> and the penalty *J* (θ ). There are many choices for the regularization term which can be connected with the prior knowledge on the true model parameter θ<sup>0</sup> that needs to be estimated.

In this section, we consider regularization terms *J* (θ ) which are quadratic functions of θ. The resulting estimator will be denoted by ReLS-Q in this chapter. In particular, we let *J* (θ ) = θ *<sup>T</sup> P*−<sup>1</sup>θ so that the ReLS criterion (3.57) becomes

$$\hat{\theta}^{\mathbb{R}} = \underset{\theta}{\text{arg min}} \|Y - \Phi\theta\|\_2^2 + \gamma\theta^T P^{-1}\theta \tag{3.58a}$$

$$= (\Phi^T \Phi + \mathcal{Y}P^{-1})^{-1} \Phi^T Y \tag{3.58b}$$

$$=P\Phi^T(\Phi P\Phi^T + \gamma I\_N)^{-1}Y\tag{3.58c}$$

$$= (P\Phi^T\Phi + \gamma I\_n)^{-1} P\Phi^T Y,\tag{3.58d}$$

where *<sup>P</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>*×*<sup>n</sup>* is a positive semidefinite matrix, here assumed invertible, often called the regularization matrix, and *In* is the *n*-dimensional identity matrix.<sup>2</sup>

**Remark 3.1** The regularization matrix *P* could be singular. In this case, (3.58a) is not well defined but, with a suitable arrangement, we can use the Moore–Penrose pseudoinverse *P*<sup>+</sup> instead of *P*−1. In particular, let the SVD of *P* be

$$P = \begin{bmatrix} U\_1 \ U\_2 \end{bmatrix} \begin{bmatrix} A\_P \ 0 \\ 0 \ 0 \end{bmatrix} \begin{bmatrix} U\_1 \ U\_2 \end{bmatrix}^T,$$

where Λ*<sup>P</sup>* is a diagonal matrix with the positive singular values of *P* as diagonal elements and *U* = *U*<sup>1</sup> *U*<sup>2</sup> is an orthogonal matrix with *U*<sup>1</sup> having the same number of columns as that of Λ*<sup>P</sup>* . Recall also that *P*<sup>+</sup> = *U*1Λ<sup>−</sup><sup>1</sup> *<sup>P</sup> U*1. In order to find how (3.58a) should be modified for singular *P*, let us consider

$$P\_{\varepsilon} = \begin{bmatrix} U\_1 \ U\_2 \end{bmatrix} \begin{bmatrix} A\_P & 0\\ 0 & \varepsilon I \end{bmatrix} \begin{bmatrix} U\_1 \ U\_2 \end{bmatrix}^T, \quad \varepsilon > 0.1$$

By replacing *P* with *P*<sup>ε</sup> in (3.58a), we obtain

$$\hat{\theta}^{\mathbb{R}} = \underset{\theta}{\text{arg min}} \|Y - \Phi\theta\|\_2^2 + \gamma\theta^T U\_1 \Lambda\_P^{-1} U\_1^T \theta + \frac{\mathcal{V}}{\varepsilon} \theta^T U\_2 U\_2^T \theta. \tag{3.59}$$

If we let ε → 0, it follows that the parameter vector must satisfy*U<sup>T</sup>* <sup>2</sup> θ = 0. Therefore, we may conveniently associate to a singular *P* the modified regularization problem

$$\hat{\theta}^{\mathbb{R}} = \underset{\theta}{\text{arg min}} \|Y - \Phi\theta\|\_2^2 + \gamma\theta^T P^+ \theta \tag{3.60a}$$

$$\text{subj.to} \quad U\_2^T \theta = 0. \tag{3.60b}$$

If *P*−<sup>1</sup> is replaced by *P*+, it is easy to verify that (3.58c) or (3.58d) is still the optimal solution of (3.60). Instead, this does not hold for (3.58b). For convenience, we will use (3.58a) in the sequel and refer to (3.60) for its rigorous meaning.

## *3.4.1 Making an Ill-Conditioned LS Problem Well Conditioned*

The ReLS-Q can make the ill-conditioned LS problem well conditioned. Consider ridge regression which, as discussed in Sect. 1.2, corresponds to setting *P* = *In*, hence obtaining

<sup>2</sup> The step from (3.58c) to (3.58d) follows from the matrix equality *<sup>A</sup>*(*Ij* <sup>+</sup> *B A*)−<sup>1</sup> <sup>=</sup> (*Ik* <sup>+</sup> *AB*)−<sup>1</sup> *<sup>A</sup>* which holds for every *<sup>A</sup>* <sup>∈</sup> <sup>R</sup>*k*<sup>×</sup> *<sup>j</sup>* and *<sup>B</sup>* <sup>∈</sup> <sup>R</sup>*j*×*<sup>k</sup>* .

$$\hat{\theta}^{\mathbb{R}} = \underset{\theta}{\text{arg min}} \|Y - \Phi\theta\|\_2^2 + \chi\|\theta\|\_2^2 \tag{3.61a}$$

$$= (\Phi^T \Phi + \chi I\_n)^{-1} \Phi^T Y. \tag{3.61b}$$

The parameter γ directly affects the condition number of (Φ*<sup>T</sup>* Φ + γ *In*) whose inverse defines the regularized estimate. In fact, the positive definite square matrix (Φ*<sup>T</sup>* Φ + γ *In*) has eigenvalues (coincident with its singular values) equal to σ <sup>2</sup> *<sup>i</sup>* + γ . Therefore,

$$\text{cond}(\Phi^T \Phi + \mathcal{Y}I\_n) = \frac{\sigma\_1^2 + \mathcal{Y}}{\sigma\_n^2 + \mathcal{Y}}$$

which can be adjusted by tuning the regularization parameter γ . This means that regularization can make the LS problem well conditioned even when Φ is rank deficient: if the smallest singular value is null one has

$$\text{cond}(\boldsymbol{\Phi}^T \boldsymbol{\Phi} + \boldsymbol{\chi} \boldsymbol{I}\_n) = \frac{\sigma\_1^2 + \boldsymbol{\chi}}{\boldsymbol{\chi}}.$$

#### **3.4.1.1 Mean Squared Error**

Simple calculations of expectations with respect to the errors *ei* , with the regressors φ*<sup>i</sup>* assumed to be deterministic, lead to

$$\boldsymbol{\mathcal{C}}(\hat{\boldsymbol{\theta}}^{\mathrm{R}}) = (\boldsymbol{\Phi}^{T}\boldsymbol{\Phi} + \boldsymbol{\chi}\boldsymbol{P}^{-1})^{-1}\boldsymbol{\Phi}^{T}\boldsymbol{\Phi}\boldsymbol{\theta}\_{0} \tag{3.62a}$$

$$\hat{\theta}\_{\text{bias}}^{\mathbb{R}} = \mathcal{E}(\hat{\theta}^{\mathbb{R}}) - \theta\_0 = - (\boldsymbol{\Phi}^T \boldsymbol{\Phi} + \boldsymbol{\chi} \, P^{-1})^{-1} \boldsymbol{\chi} \, P^{-1} \theta\_0 \tag{3.62b}$$
 
$$\boldsymbol{\varepsilon}\_{\text{s}} \approx \boldsymbol{\mathfrak{R}} \dots \boldsymbol{\varepsilon} \, \mathbf{R} \qquad \boldsymbol{\varepsilon}\_{\text{s}} \approx \boldsymbol{\mathfrak{R}} \dots \boldsymbol{\mathfrak{R}} \qquad \boldsymbol{\varepsilon}\_{\text{s}} \approx \boldsymbol{\mathfrak{R}} \dots \boldsymbol{\mathfrak{T}} \, \mathbf{T}$$

$$\begin{split} \text{Cov}(\hat{\theta}^{\text{R}}, \hat{\theta}^{\text{R}}) &= \mathcal{E}[(\hat{\theta}^{\text{R}} - \mathcal{E}(\hat{\theta}^{\text{R}}))(\hat{\theta}^{\text{R}} - \mathcal{E}(\hat{\theta}^{\text{R}}))^{T}] \\ &= (\Phi^{T}\Phi + \gamma P^{-1})^{-1}\sigma^{2}\Phi^{T}\Phi(\Phi^{T}\Phi + \gamma P^{-1})^{-1} \end{split} \tag{3.62c}$$

$$\begin{split} \text{MSE}(\hat{\theta}^{\mathbf{R}}, \theta\_{0}) &= \mathcal{E}(\hat{\theta}^{\mathbf{R}} - \theta\_{0})(\hat{\theta}^{\mathbf{R}} - \theta\_{0})^{T} \\ &= \text{Cov}(\hat{\theta}^{\mathbf{R}}, \hat{\theta}^{\mathbf{R}}) + \hat{\theta}^{\mathbf{R}}\_{\text{bias}}(\hat{\theta}^{\mathbf{R}}\_{\text{bias}})^{T} \\ &= (\boldsymbol{\Phi}^{T}\boldsymbol{\Phi} + \boldsymbol{\chi}\boldsymbol{P}^{-1})^{-1}(\boldsymbol{\sigma}^{2}\boldsymbol{\Phi}^{T}\boldsymbol{\Phi} + \boldsymbol{\chi}^{2}\boldsymbol{P}^{-1}\theta\_{0}\theta\_{0}^{T}\boldsymbol{P}^{-1})(\boldsymbol{\Phi}^{T}\boldsymbol{\Phi} + \boldsymbol{\chi}\boldsymbol{P}^{-1})^{-1}, \end{split} \tag{3.62d}$$

where Cov(θˆ<sup>R</sup>, θˆ<sup>R</sup>) is the covariance matrix of θˆ<sup>R</sup> and MSE(θˆ<sup>R</sup>, θ0) is the MSE matrix of θˆ<sup>R</sup> function of the true model parameter θ0. Expression (3.62) shows clearly regularization's influence on the statistical properties of θˆR:


increase in the bias is moderate, an MSE matrix "smaller" than that associated to LS can be obtained.

## *3.4.2 Equivalent Degrees of Freedom*

For a given regularization matrix *P*, we have seen (also deriving the structure of the MSE) that the regularization parameter γ controls the influence of the regularization: as γ varies from0 to∞, the influence of the regularization θ *<sup>T</sup> P*−<sup>1</sup>θ becomes stronger. In particular, when γ = 0 there is no regularization and θˆ<sup>R</sup> reduces to θˆLS. When γ = ∞ the regularization term γ θ *<sup>T</sup> P*−<sup>1</sup>θ overwhelms the data fit *Y* − Φθ<sup>2</sup> <sup>2</sup> and one has θˆ<sup>R</sup> = 0.

Often, it is more convenient to exploit a normalized measure of the influence of the regularization instead of considering directly the value of γ . For this goal, we introduce the so-called influence or hat matrix:

$$H = \Phi P \Phi^T \left(\Phi P \Phi^T + \mathcal{Y} I\_N\right)^{-1}. \tag{3.63}$$

Such matrix is important since it connects the measured output *Y* with the predicted output *Y*ˆ = ΦθˆR, i.e., one has

$$
\hat{Y} = \Phi \hat{\theta}^{\mathbb{R}} = HY.\tag{3.64}
$$

It is also important since its trace is indeed a normalized measure of the influence of the regularization. To see this, let *A* = Φ *P*Φ*<sup>T</sup>* and consider its SVD

$$A = UDU^T,$$

where *UU<sup>T</sup>* = *I* and *D* is a diagonal matrix with nonnegative entries *d*<sup>2</sup> *<sup>i</sup>* . Then,

$$H = UDU^T (UDU^T + \gamma I\_N UU^T)^{-1} = UD(D + \gamma I\_N)^{-1} U^T.$$

Since *U* is orthogonal, one has trace(*UMU<sup>T</sup>* ) = trace(*M*), so that

$$\text{trace}(H) = \text{trace}(D(D + \mathcal{Y}I\_N)^{-1}) = \sum\_{i=1}^n \frac{d\_i^2}{d\_i^2 + \mathcal{Y}}.$$

The above equation implies that trace(*H*) is a monotonically decreasing function of γ . It attains its maximum at γ = 0 and infimum as γ → ∞. In particular, for γ = 0 one has θˆ<sup>R</sup> = θˆLS and the hat matrix *H* becomes *H* = Φ(Φ*<sup>T</sup>* Φ)−<sup>1</sup>Φ*<sup>T</sup>* , implying that trace(*H*) = *n* if Φ is full rank. For γ → ∞ one instead has trace(*H*) → 0. Therefore, it holds that 0 < trace(*H*) ≤ *n*. Hence, since *n* is the dimension of θ, i.e., the number of parameters in the linear regression model,trace(*H*) can be seen as the

**Fig. 3.7** Polynomial regression: true function *g*(*x*) (blue line) and ridge regression estimates obtained with 16 different values of the regularization parameter

counterpart of the number of parameters to be estimated in the LS context. In other words, in the regularized framework trace(*H*) plays the role of the model order. It thus becomes natural to call it the equivalent degrees of freedom for the ReLS-Q estimate θˆR, e.g., [21, Sect. 7.6], [4, p. 559]:

$$\text{dof}(\hat{\theta}^{\mathbb{R}}) = \text{trace}(H). \tag{3.65}$$

The notation dof(γ ) will be also used in the book in place of dof(θˆ<sup>R</sup>) to stress the dependence of the equivalent degrees of freedom on the regularization parameter.

**Example 3.6** (*Polynomial regression: ridge regression*) As shown in Fig. 3.6, the regression matrix Φ built in the polynomial regression Example (3.26) and (3.27) is ill-conditioned for large *n*. Here, we consider the case *n* = 16 (corresponding to a polynomial order 15) which leads to cond(Φ) = 1.49 × 10<sup>11</sup>. To illustrate how ridge regression (3.61) can face the ill-conditioning, let γ = γ*<sup>i</sup>* , *i* = 1,..., 16, with γ<sup>1</sup> = 0.01 and γ<sup>16</sup> = 0.31 and γ2,...,γ<sup>15</sup> evenly spaced between γ<sup>1</sup> and γ16. For each γ*<sup>i</sup>* , we then compute the corresponding ridge regression estimate (3.61) and plot the 16 estimates *g*ˆ(*x*) = φ(*x*)*<sup>T</sup>* θˆ<sup>R</sup> in Fig. 3.7. The fits (3.28) are shown in Fig. 3.8 as a function of γ . One can see that γ = 0.11 gives the best performance obtaining a fit around 89%. Interestingly, such fit is larger than the best result obtained by LS through optimal tuning of the discrete model order, see Fig. 3.4. The base 10 logarithm of the condition number of Φ*<sup>T</sup>* Φ + γ *In*, as a function of γ , is displayed in Fig. 3.9. One can see that the matrix is much better conditioned now. Figure 3.10 plots the equivalent degrees of freedom of θˆR. Even if *n* = 16, the actual model complexity in terms of equivalent degrees of freedom is much smaller, around 4 for

**Fig. 3.8** Polynomial regression: profile of the ridge regression fit (3.28) as a function of γ . Large fit values are associated to estimates close to the true function

**Fig. 3.9** Polynomial regression: profile of the base 10 logarithm of the condition number of <sup>Φ</sup>*<sup>T</sup>* <sup>Φ</sup> <sup>+</sup> <sup>γ</sup> *In* as a function of <sup>γ</sup>


**Fig. 3.10** Polynomial regression: profile of the equivalent degrees of freedom (3.65) as a function of γ using ridge regression

the tested values of γ . Finally, the estimates of any component of θ obtained using the different values of γ are shown in Fig. 3.11.

#### **3.4.2.1 Regularization Design: The Optimal Regularizer**

A natural question is how to design a regularization matrix *P* and select γ to obtain a "good" model estimate. From a "classic" or "frequentist" point of view, rational choices are those that make the MSE matrix (3.62d) small in some sense, as discussed below. For our purposes, it is useful to rewrite the MSE matrix (3.62d) as follows:

$$\text{MSE}(\hat{\theta}^{\text{R}}, \theta\_0) = \sigma^2 \left(\frac{P\Phi^T\Phi}{\mathcal{Y}} + I\_n\right)^{-1} \left(\frac{P\Phi^T\Phi P}{\mathcal{Y}^2} + \frac{\theta\_0 \theta\_0^T}{\sigma^2}\right) \left(\frac{\Phi^T\Phi P}{\mathcal{Y}} + I\_n\right)^{-1} \,. \tag{3.66}$$

Then, it is useful to first introduce the following lemma.

**Lemma 3.1** (based on [9]) *Consider the matrix*

$$M(\mathcal{Q}) = \left(\mathcal{Q}R + I\right)^{-1} (\mathcal{Q}R\mathcal{Q} + Z)(R\mathcal{Q} + I)^{-1},$$

*where Q*, *R and Z are positive semidefinite matrices. Then for all Q*

**Fig. 3.11** Polynomial regression: profile of the estimates of each component forming the ridge regression estimate (3.61). For each value *<sup>k</sup>* = 0,..., 15 on the *<sup>x</sup>*-axis the plot reports the estimates of the coefficient of the monomial *x<sup>k</sup>* obtained by using different values of the regularization parameter γ

$$M(Z) \le M(\mathcal{Q}),\tag{3.67}$$

*which means that M*(*Q*) − *M*(*Z*) *is positive semidefinite.*

The proof consists of straightforward calculations and can be found in Sect. 3.8.2.

Using (3.66) and Lemma 3.1, the question which *P* and γ give the best MSE of θˆ<sup>R</sup> has a clear answer: the equation σ <sup>2</sup>*P* = γ θ0θ *<sup>T</sup>* <sup>0</sup> needs to be satisfied. Thus, the following result holds.

**Proposition 3.1** (Optimal regularization for a given θ0*,* based on [9]) *Letting* γ = σ <sup>2</sup>*, the regularization matrix*

$$P = \theta\_0 \theta\_0^T \tag{3.68}$$

*minimizes the MSE matrix (3.66) in the sense of (3.67).*

Note that the MSE matrix (3.66) is linear in θ0θ *<sup>T</sup>* <sup>0</sup> . This means that if we compute θˆ<sup>R</sup> with the same *P* for a collection of true systems θ0, the average MSE over that collection will be given by (3.66) with θ0θ *<sup>T</sup>* <sup>0</sup> replaced by its average over the collection. In particular, if θ<sup>0</sup> is a random vector with *E* (θ0θ *<sup>T</sup>* <sup>0</sup> ) = Π, we obtain the following result.

**Proposition 3.2** (Optimal regularization for a random system θ0*,*based on [9]) *Consider (3.62d) with* γ = σ <sup>2</sup>*. Then, the best average (expected) MSE for a random true system* θ<sup>0</sup> *with E* (θ0θ *<sup>T</sup>* <sup>0</sup> ) = Π *is obtained by the regularization matrix P* = Π*.*

Propositions 3.1 and 3.2 thus give a somewhat preliminary answer to our design problem. Since the best regularization matrix *P* = θ0θ *<sup>T</sup>* <sup>0</sup> depends on the true system θ0, such formula cannot be used in practice. Nevertheless, it suggests to choose a regularization matrix which mimics the behaviour of θ0θ *<sup>T</sup>* <sup>0</sup> . Using prior knowledge on the true system θ0, this can be done by postulating a parametrized family of matrices *<sup>P</sup>*(η) with <sup>η</sup> <sup>∈</sup> <sup>Γ</sup> <sup>⊂</sup> <sup>R</sup>*<sup>m</sup>*, where <sup>η</sup> is the so-called *hyperparameter* vector, <sup>Γ</sup> is the set where η can vary and *m* is the dimension of η. Thus, the choice of a parametrized regularization matrix is similar to model structure selection in system identification. The nature of the optimal regularizer suggests also to set

$$
\chi = \sigma^2.\tag{3.69}
$$

However, the noise variance σ <sup>2</sup> is in general unknown and needs to be estimated from the data. One can adopt equations (3.22) or (3.23). Another option is to include σ <sup>2</sup> in η and then estimate it together with the other hyperparameters.

## **3.5 Regularization Tuning for Quadratic Penalties**

## *3.5.1 Mean Squared Error and Expected Validation Error*

Now, assume that a parametrized family of regularization matrices *P*(η) has been defined. The vector η is in general unknown and has to be tuned by using the available measurements. The ReLS-Q estimate θˆ<sup>R</sup>(η)in (3.58) depends on η and the estimation strategy depends on the measure used to quantify its quality. We will consider the following two criteria:


#### **3.5.1.1 Minimizing the MSE**

Still adopting a "classic" or "frequentist" point of view, a rational choice of η is one that makes the MSE matrix (3.62d) small in some sense. For ease of estimation, a scalar measure is often exploited. In [25, Chap. 12], it is suggested to use a weighting matrix *Q* and trace(MSE(θˆ<sup>R</sup>(η), θ0)*Q*) as a quality measure of θˆ<sup>R</sup>(η), where *Q* reflects the intended use of the model θˆ<sup>R</sup>(η). Then an estimate of η, say ηˆ, is obtained as follows:

#### 3.5 Regularization Tuning for Quadratic Penalties 59

$$\hat{\eta} = \underset{\eta \in \Gamma}{\text{arg min }} \text{trace}(\text{MSE}(\tilde{\theta}^{\mathbb{R}}(\eta), \theta\_0) | \mathcal{Q}). \tag{3.70}$$

Note that (3.70) depends on the true system θ<sup>0</sup> that is unknown and thus cannot be used. In practice, we need to first find a "good" estimate, say θˆ, of the true system θ<sup>0</sup> and then to replace θ<sup>0</sup> in (3.70) with θˆ. Then, hopefully, a "good" estimate is given by

$$\hat{\eta} = \underset{\eta \in \Gamma}{\text{arg min }} \text{trace}(\text{MSE}(\hat{\theta}^{\text{R}}(\eta), \hat{\theta}) \, \hat{Q}). \tag{3.71}$$

Different choices of θˆ and *Q* lead to different estimators (3.71). Examples are obtained setting θˆ to the LS estimate or to the ridge regression estimate of θ0, while the choice *Q* = *In* is often used. In any case, the major difficulty underlying the idea of "minimizing the MSE" for hyperparameters tuning lies in whether or not θˆ is a "good" estimate of θ0, which is actually our fundamental problem.

#### **3.5.1.2 Minimizing the EVE**

An alternative quality measure of θˆ<sup>R</sup>(η) is related to model prediction capability on independent validation data and is characterized by the expected validation error (EVE).

To define it, we need to introduce the training/estimation data and the validation data. The training data is used for estimating the model and is contained in the set *D*T. The validation data are used to assess model prediction capability and are in the set *D*V.

Now, let θˆ<sup>R</sup>(η) denote a general ReLS-Q estimate parametrized by the vector η and obtained using only the training data *<sup>D</sup>*T. Let *<sup>y</sup>*<sup>v</sup> <sup>∈</sup> <sup>R</sup>, <sup>φ</sup><sup>v</sup> <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* be a validation sample pair. These objects could both be random, e.g., *y*<sup>v</sup> can be affected by noise and the regressor could be defined by a stochastic system input. The validation error EVE*<sup>D</sup>*<sup>T</sup> (η) is then given by

$$\text{EVE}\_{\mathcal{\beta}\_{\text{T}}}(\eta) = \mathcal{\beta}[(\text{y}\_{\text{v}} - \text{\phi}\_{\text{v}}^{T}\hat{\theta}^{\text{R}}(\eta))^{2}|\mathcal{\beta}\_{\text{T}}| \,. \tag{3.72}$$

In the above equation, the expectation *E* is computed w.r.t. the joint distribution of *<sup>y</sup>*<sup>v</sup> and <sup>φ</sup><sup>v</sup> conditioned on the training data *<sup>D</sup>*T. If <sup>φ</sup><sup>v</sup> <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* is deterministic and, as usual, *y*<sup>v</sup> is affected by a noise independent by those entering the training set, the mean is taken just w.r.t. such noise, with *D*<sup>T</sup> which influences only θˆR. In any case, the result is a function of the training set. Now, we can see *D*<sup>T</sup> as random and then the EVE is

$$\text{EVE}(\eta) \triangleq \mathcal{E}[\text{EVE}\_{\mathbb{S}\_{\Gamma}}(\eta)],\tag{3.73}$$

where the expectation *E* is over the training set. Note that the final result is function of the true θ<sup>0</sup> which determines the probability distributions of the training and validation data.

The EVE(η) measures the prediction capability of the model θˆ<sup>R</sup>(η) before seeing any training or validation data: the smaller the EVE(η), the better the expected model prediction capability. Therefore, it is natural to estimate η as follows:

$$
\hat{\eta} = \underset{\eta \in \Gamma}{\text{arg min }} \text{EVE}(\eta). \tag{3.74}
$$

However, as said, the above objective depends on the unknown vector θ<sup>0</sup> so that estimation of θ is not possible in practice. The problem is analogous to that encountered when trying to tune η by minimizing the MSE

**Remark 3.2** Interestingly, the idea of "minimizing the MSE" and the idea of "minimizing the EVE" are connected. To see this, we assume for simplicity that the regressors φ*<sup>i</sup>* , *i* = 1,..., *N* in the training data and φ<sup>v</sup> in the validation data are deterministic. Then it can be shown that

$$EVE(\eta) = \mathcal{\delta}[(\mathbf{y}\_{\text{v}} - \boldsymbol{\phi}\_{\text{v}}^{T}\hat{\boldsymbol{\theta}}^{R}(\eta))^{2}] = \sigma^{2} + \boldsymbol{\phi}\_{\text{v}}^{T}MSE(\hat{\boldsymbol{\theta}}^{R}(\eta), \boldsymbol{\theta}\_{0})\boldsymbol{\phi}\_{\text{v}},\tag{3.75}$$

where the expectation *E* is over everything that is random, and *MSE*(θˆ*<sup>R</sup>*(η), θ0) is the MSE matrix of θˆ*<sup>R</sup>*(η) defined in (3.62d). Clearly, (3.75) shows that minimizing *EVE*(η) with respect to η is equivalent to minimizing trace(*MSE*(θˆ*<sup>R</sup>*(η), θ0)*Q*) with respect to η when *Q* = φ*v*φ*<sup>T</sup> v* .

To overcome the fact that the EVE depends on the unknown θ0, we could first find a "good" estimate of EVE(η) using the available data and then determine the hyperparameter vector by minimizing it. There are two ways to achieve this goal: by efficient sample reuse of the data and by considering the in-sample EVE instead. More details will be provided in the next two subsections.

## *3.5.2 Efficient Sample Reuse*

One way to estimate EVE(η) by exploiting efficient sample reuse includes crossvalidation (CV) [41] and its variants already mentioned in Sects. 2.6.3 and 3.2.2 when discussing model order selection.

#### **3.5.2.1 Hold Out Cross-Validation**

The simplest CV is the so-called hold out CV (HOCV), which is widely used to select the model order for the classical PEM/ML. The HOCV can also be used to estimate the hyperparameter η ∈ Γ for the ReLS-Q method.

The idea of hold out CV is to first split the given data into two parts: the training data *D*<sup>T</sup> and the validation data *D*V. The prediction capability is measured in terms of the validation error. The model that gives the smallest validation error will be selected. More specifically, the HOCV takes the following three steps:


$$\text{CV}(\eta) = \sum\_{(\mathbf{y}, \boldsymbol{\phi}\_{\mathbf{v}}) \in \mathcal{\mathcal{S}}\_{\mathbf{v}}} (\mathbf{y}\_{\mathbf{v}} - \boldsymbol{\phi}\_{\mathbf{v}}^T \hat{\boldsymbol{\theta}}^{\mathbf{R}}(\eta))^2,$$

where the summation is over all pairs of (*y*v, φv) in the validation data *D*v. Then, select the value of η that minimizes CV(η):

$$
\hat{\eta} = \operatorname\*{arg\,min}\_{\eta \in \Gamma} \text{CV}(\eta). \tag{3.76}
$$

It is also possible to change the role of the training and validation sets in order to perform a second validation step: the model is estimated on the previous validation set and the validation error is computed on the previous training set. Finally, the final validation error is obtained by averaging the two validation errors.

#### **3.5.2.2** *k***-Fold Cross-Validation**

The HOCV with swapped sets is a special case of the more general *k*-fold CV with *k* = 2, e.g., [24]. If the data set size is small, the HOCV may perform poorly. In fact, the training data may not be sufficiently rich to build good models and a validation set of small size may give a too uncertain validation error. In this case, the *k*-fold CV with *k* > 2 could be used.

The idea of *k*-fold CV is to first split the data into *k* parts of equal size. For every η ∈ Γ , the following procedure is repeated *k* times. At the *i*th run with *i* = 1, 2,..., *k*:


62 3 Regularization of Linear Regression Models

$$\text{CV}\_{-i}(\eta) = \sum\_{(\text{y}\_{\text{v}}, \phi\_{\text{v}}) \in \mathcal{\mathcal{\mathcal{G}}}\_{\text{J}}} (\text{y}\_{\text{v}} - \phi\_{\text{v}}^T \hat{\theta}^{\text{R}}(\eta))^2,$$

where the summation is over all pairs of (*y*v, φv) in the validation data *D*V,*<sup>i</sup>* .

Finally, the *k* validation errors CV−*<sup>i</sup>*(η) so obtained are summed to obtain the following total validation error for η:

$$\text{CV}(\eta) = \sum\_{i=1}^{k} \text{CV}\_{-i}(\eta),$$

and the estimate of η is finally given by

$$
\hat{\eta} = \operatorname\*{arg\,min}\_{\eta \in \Gamma} \mathbf{CV}(\eta). \tag{3.77}
$$

#### **3.5.2.3 Predicted Residual Error Sum of Squares and Variants**

The computation of the *k*-fold CV is often expensive and an exception is the leaveone-out CV (LOOCV) where the validation set includes only one validation pair. When the square loss function is used, the total validation error admits a closed-form expression and the LOOCV is also known as the predicted residual error sum of squares (PRESS), e.g., [2].

First, recall the linear regression model (3.13) and the corresponding data *yi* <sup>∈</sup> <sup>R</sup> and <sup>φ</sup>*<sup>i</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* for *<sup>i</sup>* = 1,..., *<sup>N</sup>*. Then the ReLS-Q estimate is

$$\begin{split} \hat{\theta}^{\text{R}} &= \operatorname\*{arg\,min}\_{\theta} ||Y - \Phi \theta\||^{2} + \sigma^{2} \theta^{T} P^{-1}(\eta) \theta \\ &= \left(\Phi^{T} \Phi + \sigma^{2} P^{-1}(\eta)\right)^{-1} \Phi^{T} Y \\ &= \left(\sum\_{i=1}^{N} \phi\_{i} \phi\_{i}^{T} + \sigma^{2} P^{-1}(\eta)\right)^{-1} \sum\_{i=1}^{N} \phi\_{i} \mathbf{y}\_{i}, \end{split} \tag{3.78}$$

where we have set γ = σ <sup>2</sup> following (3.69). For the *k*th measured output *yk* , the corresponding predicted output *y*ˆ*<sup>k</sup>* and residual *rk* are, respectively,

$$\hat{\mathbf{y}}\_{k} = \boldsymbol{\phi}\_{k}^{T} \left( \sum\_{i=1}^{N} \boldsymbol{\phi}\_{i} \boldsymbol{\phi}\_{i}^{T} + \sigma^{2} \boldsymbol{P}^{-1}(\boldsymbol{\eta}) \right)^{-1} \sum\_{i=1}^{N} \boldsymbol{\phi}\_{i} \mathbf{y}\_{i},\tag{3.79a}$$

$$
\mathbf{y}\_k = \mathbf{y}\_k - \hat{\mathbf{y}}\_k.\tag{3.79b}
$$

Then, PRESS selects the value of η ∈ Γ that minimizes the sum of squares of the validation errors. One can prove that this corresponds to the following problem:

#### 3.5 Regularization Tuning for Quadratic Penalties 63

$$\text{PRESSI: } \hat{\eta} = \underset{\eta \in \Gamma}{\text{arg min }} \sum\_{k=1}^{N} \frac{r\_k^2}{(1 - \phi\_k^T M^{-1} \phi\_k)^2},\tag{3.80}$$

where *rk* are defined by (3.79) while

$$\mathcal{M} = \sum\_{i=1}^{N} \phi\_i \phi\_i^T + \sigma^2 P^{-1}(\eta). \tag{3.81}$$

The derivation of (3.80) can be found in Sect. 3.8.3. It is worth noting that the denominator in (3.80) is strictly related to the diagonal entries of the hat matrix *H* defined in (3.63). In fact,

$$
\phi\_k^T M^{-1} \phi\_k = H\_{kk}
$$

so that

$$\text{PRESS}: \quad \hat{\eta} = \underset{\eta \in \Gamma}{\text{arg min}} \sum\_{k=1}^{N} \frac{r\_k^2}{(1 - H\_{kk})^2}.$$

Hence, interestingly, one can conclude that PRESS evaluation requires to compute just the ReLS-Q estimate exploiting the full data set (instead of solving *N* problems, one for each missing measurement in the training set).

One method that is closely related with PRESS is the so-called generalized crossvalidation (GCV), e.g., [18]. GCV is obtained by replacing in (3.80) the factors *Hkk* by their average, i.e., trace(*H*)/*N*:

$$\text{GCV:} \quad \hat{\eta} = \underset{\eta \in \Gamma}{\text{arg}\, \text{min}} \, \frac{1}{(1 - \text{trace}(H)/N)^2} \sum\_{k=1}^{N} r\_k^2. \tag{3.82}$$

Recalling (3.65), the term trace(*H*) defines the degrees of freedom of θˆR. Hence, the GCV criterion can be rewritten as follows:

$$\text{GCV:} \quad \hat{\eta} = \underset{\eta \in \Gamma}{\text{arg min}} \frac{1}{(1 - \text{dof}(\theta^{\mathbb{R}})/N)^2} \sum\_{k=1}^N r\_k^2.$$

## *3.5.3 Expected In-Sample Validation Error*

In the definition of the validation error EVE*<sup>D</sup>*<sup>T</sup> (3.72), reported for convenience also below

$$\text{EVE}\_{\mathcal{\partial}\mathbf{\bar{r}}}(\eta) = \mathcal{\partial}[(\mathbf{y}\_{\mathbf{v}} - \boldsymbol{\phi}\_{\mathbf{v}}^{T}\hat{\boldsymbol{\theta}}^{\mathbf{R}}(\eta))^{2}|\mathcal{Q}\_{\mathbf{T}}],$$

we assumed that the conditional expectation *E* is over the independent validation sample pair *<sup>y</sup>*<sup>v</sup> <sup>∈</sup> <sup>R</sup>, <sup>φ</sup><sup>v</sup> <sup>∈</sup> <sup>R</sup>*<sup>n</sup>*, which are drawn randomly from their joint distribution. The computation of the validation error (3.72) could become easier if independent validation sample pairs *<sup>y</sup>*<sup>v</sup> <sup>∈</sup> <sup>R</sup>, <sup>φ</sup><sup>v</sup> <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* are generated in a particular way.

For linear regression problems, it is convenient to assume that the same *deterministic* regressors φ*<sup>i</sup>* , *i* = 1, 2,..., *N*, are used for generating both the training data and the validation data. To be specific, still using θ<sup>0</sup> to denote the true parameter vector, we recall from (3.6), that the training output samples are

$$y\_i = \phi\_i^T \theta\_0 + e\_i, \quad i = 1, \ldots, N. \tag{3.83}$$

In this case, the training set is

$$\mathcal{O}\_{\Gamma} = \{ (\mathbf{y}\_i, \phi\_i) \mid \mathbf{y}\_i \in \mathbb{R}, \phi\_i \in \mathbb{R}'' \text{ satisfying (3.83), } i = 1, \dots, N \}. \tag{3.84}$$

Using the same regressors φ*<sup>i</sup>* , consider a set of validation output samples *y*v,*<sup>i</sup>* as follows:

$$\mathbf{y}\_{\mathbf{v},i} = \phi\_i^T \theta\_0 + e\_{\mathbf{v},i}, \quad i = 1, \ldots, N,\tag{3.85}$$

where θ<sup>0</sup> is the true parameter vector, with the noises *ei* and *e*v,*<sup>i</sup>* assumed identically and independently distributed. The validation error is now denoted by EVEin*<sup>D</sup>*<sup>T</sup> (η), computed as follows:

$$\text{EVE}\_{\text{in}\,\mathcal{S}\_{\text{T}}}(\eta) = \frac{1}{N} \sum\_{i=1}^{N} \mathcal{E}[(\mathbf{y}\_{\text{v},i} - \boldsymbol{\phi}\_{i}^{T}\hat{\boldsymbol{\theta}}^{\mathbf{R}}(\eta))^{2} | \mathcal{\boldsymbol{\mathcal{O}}}],\tag{3.86}$$

and called in-sample validation error [21, p. 228]. Note that, similarly to what discussed after (3.72), the expectation *E* in (3.86) is computed w.r.t. the joint distribution of the couples *y*<sup>v</sup>,*<sup>i</sup>*, φ*<sup>i</sup>* conditioned on the training data *D*T. Thus, the result is function of the training set. As done in (3.73), we can remove such dependence by computing the expected in-sample validation error as

$$\text{EVE}\_{\text{in}}(\eta) = \mathcal{E}[\text{EVE}\_{\text{in}\,\otimes\_{\Gamma}}(\eta)],\tag{3.87}$$

with expectation taken over the joint distribution of the training data. In what follows, we will see how to build an unbiased estimator of EVEin(η) using the training data (3.84), and how to exploit it for hyperparameters tuning.

#### **3.5.3.1 Expectation of the Sum of Squared Residuals, Optimism and Degrees of Freedom**

To estimate EVEin(η), consider the sum of squared residuals

#### 3.5 Regularization Tuning for Quadratic Penalties 65

$$\overline{\text{err}}(\eta)\_{\hat{\mathcal{R}}\_{\Gamma}} = \frac{1}{N} \sum\_{i=1}^{N} (\mathbf{y}\_i - \boldsymbol{\phi}\_i^T \hat{\theta}^{\mathbb{R}}(\eta))^2,\tag{3.88}$$

which is function only of the training set. Its expectation w.r.t. the training data (3.84) is

$$\overline{\text{err}}(\eta) = \mathcal{E}\left(\frac{1}{N} \sum\_{i=1}^{N} (\mathbf{y}\_i - \boldsymbol{\phi}\_i^T \hat{\boldsymbol{\theta}}^{\mathbf{R}}(\eta))^2\right). \tag{3.89}$$

One expects EVEin(η) to be not smaller than err(η) because this latter quantity exploits the same data to fit the model and to assess the error. This intuition is indeed true as shown in the following theorem whose proof is in Sect. 3.8.4.

**Theorem 3.7** *Consider the linear regression model (3.13) with the training data (3.84), the validation data (3.85) and the ReLS-Q estimate (3.58). Then it holds that*

$$
\overline{err}(\eta) \le \text{EVE}\_{\text{in}}(\eta). \tag{3.90}
$$

Theorem 3.7 shows that the expectation of the sum of squares of the residuals is an overly optimistic estimator of the expected in-sample validation error EVEin(η). The difference between EVEin(η) and err(η) is called the optimism in statistics. In particular, one has, see, e.g., [21, p. 229]:

$$\text{EVE}\_{\text{in}}(\eta) = \overline{\text{err}}(\eta) + \text{optimism}(\eta), \tag{3.91}$$

where rewriting (3.83) as

$$Y = \Phi \theta\_0 + E,\tag{3.92}$$

and defining the output prediction as

$$
\hat{Y}(\eta) = \Phi \hat{\theta}^{\mathbb{R}}(\eta),
$$

it holds that

$$\text{optimism}(\eta) = 2\frac{1}{N}\text{trace}(\text{Cov}(Y, \hat{Y}(\eta))) \ge 0. \tag{3.93}$$

Combining arguments contained in the proof of Theorem 3.7 reported in the appendix to this chapter, see, in particular, (3.164), with the definition of equivalent degrees of freedom in (3.65), one obtains that

$$\text{trace}(\text{Cov}(Y, \hat{Y}(\eta))) = \sigma^2 \text{dof}(\hat{\theta}^{\mathbb{R}}(\eta)). \tag{3.94}$$

This thus reveals the deep connection between the optimism and the equivalent degrees of freedom.

#### **3.5.3.2 An Unbiased Estimator of the Expected In-Sample Validation Error**

Exploiting (3.94), we can now rewrite (3.91) as

$$\text{EVE}\_{\text{in}}(\eta) = \overline{\text{err}}(\eta) + 2\sigma^2 \frac{\text{dof}(\bar{\theta}^{\text{R}}(\eta))}{N}. \tag{3.95}$$

Interestingly, on the left-hand side of (3.95), EVEin(η), by definition (3.87), is the mean of a random variable which depends on both the training data (3.84) and the validation data (3.85). Instead, on the right-hand side of (3.95), err(η) is the expectation of a random variable which depends only on the training data. Hence, an unbiased estimator EVE in(η) of EVEin(η) is obtained just replacing err(η) with err(η)*<sup>D</sup>*<sup>T</sup> reported in (3.88). One thus obtains

$$\begin{split} \widehat{\mathrm{EVE}\_{\mathrm{in}}}(\eta) &= \overline{\mathrm{err}}(\eta)\_{\beta\_{\mathrm{T}}} + 2\sigma^{2} \frac{\mathrm{dof}(\hat{\theta}^{\mathrm{R}}(\eta))}{N} \\ &= \frac{1}{N} \| Y - \Phi \hat{\theta}^{\mathrm{R}}(\eta) \|\_{2}^{2} + 2\sigma^{2} \frac{\mathrm{dof}(\hat{\theta}^{\mathrm{R}}(\eta))}{N} . \end{split} \tag{3.96}$$

So, after observing the training data (3.84), the hyperparameter η can be estimated as follows:

$$\hat{\eta} = \underset{\eta \in \Gamma}{\text{arg min}} \frac{1}{N} \|Y - \Phi \hat{\theta}^{\text{R}}(\eta)\|\_{2}^{2} + 2\sigma^{2} \frac{\text{dof}(\hat{\theta}^{\text{R}}(\eta))}{N}. \tag{3.97}$$

The hyperparameter estimation criterion (3.97) has different names in statistics: it is known as the CP statistics, e.g., [27] and Stein's unbiased risk estimator (SURE), e.g., [40].

Interestingly, as it will be clear from the proof of Theorem 3.7, the above formula (3.97) still provides an unbiased prediction risk estimator also if we replace Φθ<sup>0</sup> in (3.92) with a generic vector μ s.t. *Y* = μ + *E*. Hence, one does not need to assume the existence of the true θ<sup>0</sup> and of a regression matrix which describes the linear input–output relation. A variant of the expected in-sample validation error is also discussed in Sect. 3.8.5.

#### **3.5.3.3 Excess Degrees of Freedom\***

In the previous subsection, we have discussed how to construct an unbiased estimator of the expected in-sample validation error, see (3.96), and how to use it for hyperparameters tuning, see (3.97). Irrespective of the particular method adopted for hyperparameter estimation, the estimate ηˆ of η depends on the data *Y* , with the regression matrix Φ here assumed deterministic and known. We stress this by writing

$$
\hat{\eta} = \hat{\eta}(Y).
$$

Accordingly, the ReLS-Q estimate (3.58) with η replaced by η(ˆ *Y* ) becomes

$$
\hat{\theta}^{\mathbb{R}}(\hat{\eta}(Y)) = (\Phi^T \Phi + \sigma^2 P^{-1}(\hat{\eta}(Y)))^{-1} \Phi^T Y. \tag{3.98}
$$

Since ηˆ is a random vector, to design a true unbiased estimator of the expected insample validation error of θˆ<sup>R</sup>(η(ˆ *Y* )) one should not use (3.96) since it assumes the hyperparameter η constant.

In what follows, we will derive an unbiased estimator of the expected in-sample validation error of θˆ<sup>R</sup>(η(ˆ *Y* )). Such an estimator will thus be able to account also for the price of estimating model complexity (the degrees of freedom) from data. To this goal, we need the following version of Stein's Lemma [40], a simplified version of which was already introduced in Chap. 1.

**Lemma 3.2** (Stein's Lemma, adapted from [40]) *Consider the following additive measurement model:*

$$
\infty = \mu + \varepsilon, \qquad \varkappa, \mu, \varepsilon \in \mathbb{R}^p,
$$

*where* μ *is an unknown constant vector and* ε ∼ *N*(0,Σ)*. Let* μ(ˆ *x*) *be an estimator of* <sup>μ</sup> *based on the data x such that Cov*(μ(<sup>ˆ</sup> *<sup>x</sup>*), *<sup>x</sup>*) *and <sup>E</sup>* ( ∂μ(<sup>ˆ</sup> *<sup>x</sup>*) <sup>∂</sup>*<sup>x</sup>* ) *exist. Then*

$$Cov(\hat{\mu}(\mathbf{x}), \mathbf{x}) = \delta^{\mathbb{P}} \left( \frac{\partial \hat{\mu}(\mathbf{x})}{\partial \mathbf{x}} \right) \Sigma.$$

Let

$$\mathbf{Y}\_{\mathbf{V}} = \begin{bmatrix} \mathbf{y}\_{\mathbf{v},1} \\ \mathbf{y}\_{\mathbf{v},2} \\ \vdots \\ \mathbf{y}\_{\mathbf{v},N} \end{bmatrix}, \ E\_{\mathbf{V}} = \begin{bmatrix} e\_{\mathbf{v},1} \\ e\_{\mathbf{v},2} \\ \vdots \\ e\_{\mathbf{v},N} \end{bmatrix}, \tag{3.99}$$

so that (3.85) can be rewritten as

$$Y\_{\mathbf{v}} = \Phi \theta\_0 + E\_{\mathbf{v}}.\tag{3.100}$$

Now, let us consider the measurements model (3.92) and the validation data (3.100), assuming also that

$$E \sim N(0, \sigma^2 I\_N), \ E\_\mathbf{v} \sim N(0, \sigma^2 I\_N). \tag{3.101}$$

Then, using the correspondences

$$\begin{aligned} \label{eq:EMT} \boldsymbol{\upmu} = \boldsymbol{Y}, \boldsymbol{\upmu} = \boldsymbol{\upPhi}\boldsymbol{\theta}\_{0}, \hat{\boldsymbol{\upmu}}(\boldsymbol{x}) = \boldsymbol{\upPhi}\hat{\boldsymbol{\uptheta}}^{\mathbb{R}}(\hat{\boldsymbol{\upeta}}(\boldsymbol{Y})), \tilde{\boldsymbol{\upmu}} = \boldsymbol{Y}\_{\text{v}}, \boldsymbol{\upvarepsilon} = \boldsymbol{V}, \tilde{\boldsymbol{\upvarepsilon}} = \boldsymbol{E}\_{\text{v}}, \boldsymbol{\upSigma} = \boldsymbol{\sigma}^{2}\boldsymbol{I}\_{N}, \\ \boldsymbol{f}(\boldsymbol{Y}, \boldsymbol{\uphat{\upmu}}) = \boldsymbol{\upPhi}\hat{\boldsymbol{\uptheta}}^{\mathbb{R}}(\hat{\boldsymbol{\upmu}}(\boldsymbol{Y})) = \boldsymbol{\upPhi}(\boldsymbol{\upPhi}^{\mathbb{T}}\boldsymbol{\upPhi} + \boldsymbol{\sigma}^{2}\boldsymbol{P}^{-1}(\hat{\boldsymbol{\upmu}}(\boldsymbol{Y})))^{-1}\boldsymbol{\upPhi}^{\mathbb{T}}\boldsymbol{Y}, \end{aligned}$$

together with (3.161) in the appendix to this chapter, one can prove that

$$\underbrace{\mathcal{C}\left[\frac{1}{N}\mathcal{C}\|\|Y\_{\text{V}}-\boldsymbol{\Phi}\boldsymbol{\hat{\theta}}^{\text{R}}(\boldsymbol{\hat{\eta}}(\boldsymbol{Y}))\|\_{2}^{2}|\mathcal{O}\_{\text{T}}\right]}\_{=2\frac{1}{N}\text{trace}(\text{Cov}(\boldsymbol{Y},\boldsymbol{\Phi}\boldsymbol{\hat{\theta}}^{\text{R}}(\boldsymbol{\hat{\eta}}(\boldsymbol{Y}))))} - \underbrace{\mathcal{C}\left[\frac{1}{N}\|\boldsymbol{Y}-\boldsymbol{\Phi}\boldsymbol{\hat{\theta}}^{\text{R}}(\boldsymbol{\hat{\eta}}(\boldsymbol{Y}))\|\_{2}^{2}\right]}\_{\text{err}(\boldsymbol{\eta})}$$

Using Stein's Lemma, one has

$$\begin{split} \text{Cov}(Y, \Phi \hat{\theta}^{\text{R}}(\hat{\eta}(Y)) &= \sigma^{2} \mathcal{E}[\frac{df(Y, \hat{\eta})}{dY}] \\ &= \sigma^{2} \mathcal{E}[\frac{\partial f(Y, \hat{\eta})}{\partial Y}] + \sigma^{2} \mathcal{E}[\frac{\partial f(Y, \hat{\eta})}{\partial \hat{\eta}} \frac{\partial \hat{\eta}}{\partial Y}] \\ &= \sigma^{2} \mathcal{E}[\Phi(\Phi^{T} \Phi + \sigma^{2} P^{-1}(\hat{\eta}(Y)))^{-1} \Phi^{T}] + \sigma^{2} \mathcal{E}[\frac{\partial f(Y, \hat{\eta})}{\partial \hat{\eta}} \frac{\partial \hat{\eta}}{\partial Y}]. \end{split}$$

Therefore, it holds that

$$\begin{split} \text{EVE}\_{\text{in}} &= \overline{\text{err}}(\eta) + 2\sigma^{2} \frac{1}{N} \mathcal{\mathcal{C}}[\text{trace}(\Phi \, P(\hat{\eta}(Y))) \Phi^{T}(\Phi \, P(\hat{\eta}(Y))) \Phi^{T} + \sigma^{2} I\_{N})^{-1}] \\ &\quad + 2\sigma^{2} \frac{1}{N} \text{trace}(\mathcal{\mathcal{C}}[\frac{\partial f(Y,\hat{\eta})}{\partial \hat{\eta}} \frac{\partial \hat{\eta}}{\partial Y}]) \\ &= \overline{\text{err}}(\eta) + 2\sigma^{2} \frac{\text{dof}(\hat{\theta}^{R}(\hat{\eta}(Y))}{N} + 2\sigma^{2} \frac{1}{N} \text{trace}(\mathcal{\mathcal{C}}[\frac{\partial f(Y,\hat{\eta})}{\partial \hat{\eta}} \frac{\partial \hat{\eta}}{\partial Y}]). \end{split} (3.102)$$

If ηˆ = η(ˆ *Y* ) were independent of *Y* , the above objective would coincide with the SURE score reported in (3.97). The difference is instead the presence of the term 2σ <sup>2</sup> <sup>1</sup> *<sup>N</sup>* trace(*E* [ ∂ *f* (*Y*,η)ˆ ∂ηˆ ∂ηˆ <sup>∂</sup>*<sup>Y</sup>* ]). It represents the extra optimism induced by the estimation of η and is due to the randomness of the data *Y* entering the hyperparameter estimator. The term trace(*E* [ ∂ *f* (*Y*,η)ˆ ∂ηˆ ∂ηˆ <sup>∂</sup>*<sup>Y</sup>* ])is called the excess degrees of freedom [33] and denoted by

$$\text{exdof}(\hat{\theta}^{\mathbb{R}}(\hat{\eta}(Y)) = \text{trace}(\mathcal{\ell}[\frac{\partial f(Y,\hat{\eta})}{\partial \hat{\eta}}\frac{\partial \hat{\eta}}{\partial Y}]).\tag{3.103}$$

#### 3.5 Regularization Tuning for Quadratic Penalties 69

From (3.102), we readily obtain an unbiased estimator of EVEin as follows:

$$\begin{split} \widehat{\mathrm{EVE}\_{\mathrm{in}}} &= \overline{\mathrm{err}}(\eta)\_{\mathcal{\beta}\_{\mathrm{T}}} + 2\sigma^{2} \frac{\mathrm{dof}(\hat{\theta}^{\mathrm{R}}(\hat{\eta}(Y))}{N} + 2\sigma^{2} \frac{\widehat{\mathrm{exdof}(\hat{Y}(\hat{\eta}))}}{N} \\ &= \frac{1}{N} \| Y - \Phi \hat{\theta}^{\mathrm{R}}(\hat{\eta}(Y)) \|\_{2}^{2} + 2\sigma^{2} \frac{\mathrm{dof}(\hat{\theta}^{\mathrm{R}}(\hat{\eta}(Y))}{N} \\ &+ 2\sigma^{2} \frac{1}{N} \mathrm{trace}(\frac{\partial f(Y,\hat{\eta})}{\partial \hat{\eta}} \frac{\partial \hat{\eta}}{\partial Y}), \end{split} \tag{3.104}$$

where exdof-(*Y*ˆ(η)) ˆ is an unbiased estimator of exdof(*Y*ˆ(η)) ˆ . As discussed in [33], (3.104) can be used to compare different regularized estimators also in terms of the different complexity of the hyperparameters tuning strategies that they adopt.

## **3.6 Regularized Least Squares with Other Types of Regularizers**

The general ReLS criterion assumes the following form

$$\hat{\theta}^{\mathbb{R}} = \underset{\theta}{\text{arg min}} \; \|Y - \Phi \theta\|\_2^2 + \gamma J(\theta).$$

The different choices of the regularization term *J* (θ ) depend on the prior knowledge regarding θ0. Having discussed the quadratic penalty, we will now consider two other important choices for *J* (θ ) given by the 1- or nuclear norm.

## *3.6.1* 1*-Norm Regularization*

ReLS with 1-norm regularization leads to

$$\hat{\theta}^{\mathbb{R}} = \underset{\theta}{\text{arg min}} \|Y - \Phi\theta\|\_2^2 + \chi\|\theta\|\_1,\tag{3.105}$$

where θ<sup>1</sup> represents the 1-norm of <sup>θ</sup>, i.e., θ<sup>1</sup> <sup>=</sup> *<sup>n</sup> <sup>i</sup>*=1 |θ*i*| with θ*<sup>i</sup>* being the *i*th element of θ. The problem (3.105) is also known as the least absolute shrinkage and selection operator (LASSO) [42] and is equivalently defined as follows:

$$\underset{\theta}{\text{arg min}} \left\| Y - \Phi \theta \right\|\_{2}^{2}, \text{ subject to } \left\| \theta \right\|\_{1} \leq \beta,\tag{3.106}$$

where β ≥ 0 is a tuning parameter connected with γ that controls the sparsity of θ.

#### **3.6.1.1 Computation of Sparse Solutions**

LASSO (3.105) has been widely used for finding sparse solutions. In signal processing, such problem has wide applications in compressive sensing for finding sparse signal representations from redundant dictionaries. In machine learning and statistics, the problem has also been applied extensively for variable selection where the aim is to select a subset of relevant variables to use in model construction.

Recall that a vector <sup>θ</sup> <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* is said to be sparse if θ<sup>0</sup> *<sup>n</sup>*, where θ<sup>0</sup> is the <sup>0</sup> norm of θ which counts the number of nonzero elements of θ. For linear regression models, sparse estimation requires to find a sparse θ able to well fit the data, i.e., such that *Y* − Φθ<sup>2</sup> <sup>2</sup> is small. More formally, the problem is defined as follows:

$$\min\_{\theta} \|\theta\|\_{0}, \text{ subject.} \text{ to } \|Y - \Phi\theta\|\_{2}^{2} \le \varepsilon,\tag{3.107}$$

where *<sup>Y</sup>* <sup>∈</sup> <sup>R</sup>*<sup>N</sup>* , θ <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* with *<sup>n</sup>* <sup>&</sup>gt; *<sup>N</sup>*,<sup>Φ</sup> <sup>∈</sup> <sup>R</sup>*N*×*<sup>n</sup>* assumed of full rank, i.e.,rank(Φ) <sup>=</sup> *N*, and ε ≥ 0 is a tuning parameter that controls the data fit.

The problem (3.107) is known to be NP-hard, e.g., [31]. It is combinatorial and finding its solution requires an exhaustive search. Hence, one needs approximated methods. The most popular technique relies on a convex relaxation of (3.107) obtained by replacing the 0-norm with the 1-norm:

$$\min\_{\theta} \|\theta\|\_1, \text{ subject.}\\\text{to } \|Y - \Phi\theta\|\_2^2 \le \varepsilon. \tag{3.108}$$

By using the method of Language multipliers, it can be shown that the convex relation (3.108) is equivalent to LASSO (3.105).

A natural question is whether or not the solution of LASSO (3.105) can be sparse. The answer is affirmative. For illustration, we first show this feature when the regression matrix Φ is orthogonal and assuming *N* = *n*.

#### **3.6.1.2 LASSO Using an Orthogonal Regression Matrix**

Let us consider (3.105) with orthogonal regression matrix Φ, i.e., Φ*<sup>T</sup>* Φ = ΦΦ*<sup>T</sup>* = *In*. Then (3.105) is rearranged as follows:

$$\begin{split} \hat{\theta}^{\text{R}} &= \underset{\theta}{\text{arg min}} \, \| (\boldsymbol{\varPhi}^{T} \, \Phi)^{-1} \boldsymbol{\varPhi}^{T} (\boldsymbol{Y} - \boldsymbol{\varPhi} \theta) \|\_{2}^{2} + \boldsymbol{\upnu} \| \theta \|\_{1} \\ &= \underset{\theta}{\text{arg min}} \, \| \hat{\theta}^{\text{LS}} - \theta \|\_{2}^{2} + \boldsymbol{\upnu} \| \theta \|\_{1} \\ &= \underset{\theta}{\text{arg min}} \sum\_{i=1}^{n} (\hat{\theta}\_{i}^{\text{LS}} - \theta\_{i})^{2} + \boldsymbol{\upnu} | \theta\_{i}|, \end{split} \tag{3.109}$$

where θˆLS *<sup>i</sup>* is the *i*th element of θˆLS.

To derive the optimal solution θˆR, we first recall the definition of subderivative and subdifferential of a convex function *<sup>f</sup>* : *<sup>X</sup>* <sup>→</sup> <sup>R</sup> with *<sup>X</sup>* being an open interval. The subderivative of a convex function *<sup>f</sup>* : *<sup>X</sup>* <sup>→</sup> <sup>R</sup> at a point *<sup>x</sup>*<sup>0</sup> in the open interval *X* is a real number *a* such that

$$f(\mathbf{x}) - f(\mathbf{x}\_0) \ge a(\mathbf{x} - \mathbf{x}\_0)$$

for all *x* in *X*. It can be shown that there exist *b* and *c* with *b* ≤ *c* such that the set of subderivatives at *x*<sup>0</sup> for a convex function is a nonempty closed interval [*b*, *c*], where *b* and *c* are the one-sided limits defined as follows:

$$b = \lim\_{\mathbf{x} \to \mathbf{x}\_0^-} \frac{f(\mathbf{x}) - f(\mathbf{x}\_0)}{\mathbf{x} - \mathbf{x}\_0}, \quad c = \lim\_{\mathbf{x} \to \mathbf{x}\_0^+} \frac{f(\mathbf{x}) - f(\mathbf{x}\_0)}{\mathbf{x} - \mathbf{x}\_0}.$$

The closed interval [*b*, *c*] is called the subdifferential of *f* (*x*) at the point *x*0.

Then, considering (3.109), θˆ<sup>R</sup> is an optimal solution if

$$-2(\hat{\theta}\_i^{\text{LS}} - \hat{\theta}\_i^{\text{R}}) + \chi \partial |\hat{\theta}\_i^{\text{R}}| = 0, \ i = 1, 2, \dots, n,\tag{3.110}$$

where θˆ<sup>R</sup> *<sup>i</sup>* is the *i*th element of θˆ<sup>R</sup> and ∂|θˆ<sup>R</sup> *<sup>i</sup>* | represents the subdifferential of |θˆ<sup>R</sup> *i* | which is equal to

$$|\partial|\hat{\theta}\_i^{\mathbb{R}}| = \begin{cases} \{\text{sign}(\hat{\theta}\_i^{\mathbb{R}})\} \begin{array}{c} \hat{\theta}\_i^{\mathbb{R}} \neq 0 \\ \{-1, 1\} \end{array} , i = 1, 2, \dots, n. \end{cases} \tag{3.111}$$

Using (3.110) and (3.111), we obtain the following explicit solution of LASSO for orthogonal Φ:

$$\hat{\theta}\_{i}^{\mathbb{R}} = \text{sign}(\hat{\theta}\_{i}^{\text{LS}}) \min \left\{ 0, |\hat{\theta}\_{i}^{\text{LS}}| - \frac{\mathcal{Y}}{2} \right\}, \ i = 1, 2, \ldots, n. \tag{3.112}$$

From (3.112) one can see that the solution of LASSO will be sparse if many absolute values of the elements of θˆLS are smaller than γ /2. So, γ can be used to tune the sparsity of θ. It can also be seen that the nonzero elements of the solution of LASSO are biased and that, compared with the LS solution, they are shrunk towards zero (translated towards zero by a constant factor γ /2).

#### **3.6.1.3 LASSO Using a Generic Regression Matrix: Geometric Interpretation**

For a generic non-orthogonal Φ, LASSO in general has no explicit solutions. To understand why it can still induce sparse solutions, we can use the geometric interpretation of LASSO in the form of (3.106) with <sup>θ</sup> <sup>∈</sup> <sup>R</sup><sup>2</sup>. In Fig. 3.12, one can see that for the first case coloured in blue (resp., the third case coloured in brown), if

**Fig. 3.12** Geometric interpretation of the solution of LASSO in the form (3.106) with nonorthogonal <sup>Φ</sup> and <sup>θ</sup> <sup>=</sup> [θ<sup>1</sup> <sup>θ</sup>2] *<sup>T</sup>* <sup>∈</sup> <sup>R</sup>2. First, the large grey square represents the constraint θ<sup>1</sup> ≤ β. Then, three cases are considered here and coloured in blue, red and brown, respectively. For each case, the tiny square represents the least squares estimate θˆLS, the elliptical contours represent the level curves of *Y* − Φθ<sup>2</sup> <sup>2</sup> centred at θˆLS and the cross represents the solution of LASSO (3.106). For the first case coloured in blue, the cross happens at the top corner of the large grey square and implies that the θ1-element of the solution of LASSO (3.106) is zero. For the second case coloured in red, the cross and the tiny square coincide and imply that the least square estimate θˆLS is also the solution of LASSO (3.106) whose two components are both nonzero. For the third case coloured in brown, the cross happens at the right corner of the large grey square and implies that the θ2-element of the solution of LASSO (3.106) is zero

the elliptical contour is rotated slightly about the axis perpendicular to the paper and through the blue (resp., brown) cross, the optimal solution of (3.106) will still have a zero θ1-element (resp., θ2-element). This explains why LASSO can often induce sparse solutions with a suitable choice of the regularization parameter.

Finally, since the cost function of LASSO (3.105) is a convex function of θ, many standard convex optimization software packages are available to obtain numerical solutions of LASSO very efficiently, such as YALMIP [26], CVX [19], CVXOPT [3], CVXPY [11].

**Example 3.7** (*Polynomial regression-LASSO*) We revisit the polynomial regression Examples (3.26) and (3.27) with LASSO (3.105). In particular, we set the model order to *n* = 16, with the regression matrix Φ built according to (3.4) and (3.12). Moreover, we let γ = γ*<sup>i</sup>* , *i* = 1,..., 16 with γ<sup>1</sup> = 0.01, γ<sup>16</sup> = 0.31 and γ2,...,γ<sup>15</sup> evenly spaced between γ<sup>1</sup> and γ16. For each γ = γ*<sup>i</sup>* , we compute the corresponding solution of the LASSO (3.105). In particular, the estimates *g*ˆ(*x*) = φ(*x*)*<sup>T</sup>* θˆ<sup>R</sup> for *x* = *xi* , with *i* = 1,..., 40, are plotted in Fig. 3.13.

**Fig. 3.13** Polynomial regression: true function *g*(*x*)(blue) and LASSO estimates (thin) for different values of the regularization parameter γ

The model fits (3.28) obtained for different γ are shown in Fig. 3.14. One can see that γ = 0.15 gives the best result.

Finally, the LASSO estimates of the components of θ obtained using the different values of γ are shown in Fig. 3.15. It is evident that the LASSO estimate (3.105) is sparse. Comparing it with the ridge regression estimates reported in Fig. 3.11, one can conclude that LASSO may give a simpler model, i.e., depending only on a limited number of components of θ. -

#### **3.6.1.4 Sparsity Inducing Regularizers Beyond the** 1**-Norm**

We have seen that the 1-norm plays a key role for sparse estimation. However, as shown in [34], there are many other sparsity inducing regularizers. Let *l* be any concave and nondecreasing function on [0, ∞), three examples being reported in the top panel of Fig. 3.16. Then, other penalties which promote sparsity assume the form *J* (θ ) = *<sup>n</sup> <sup>i</sup>*=1 *l*(θ <sup>2</sup> *<sup>i</sup>* ) and are given by

$$\begin{aligned} I(\eta) &= \eta^{\frac{p}{2}}, \ p \in (0, 2) & \implies & J(\theta) = \sum\_{i=1}^{n} |\theta\_i|^p, \ p \in (0, 2), \\\ I(\eta) &= \log(|\eta|^{\frac{1}{2}} + \varepsilon), \ \varepsilon > 0 & \implies & \end{aligned} \tag{3.113}$$

$$I(\eta) = \log(|\eta|^{\frac{1}{2}} + \varepsilon), \ \varepsilon > 0 \quad \stackrel{\eta = \theta\_i^2}{\implies} \qquad J(\theta) = \sum\_{i=1}^{n} \log(|\theta\_i| + \varepsilon).$$

**Fig. 3.14** Polynomial regression: profile of the model fit (3.28) obtained by LASSO as a function of the regularization parameter γ

**Fig. 3.15** Polynomial regression: profile of the estimates of each component forming the LASSO estimate (3.105). For each value *<sup>k</sup>* = 0,..., 15 on the *<sup>x</sup>*-axis the plot reports the estimates of the coefficient of the monomial *x<sup>k</sup>* obtained by using different values of the regularization parameter γ

**Fig. 3.16** The top panel shows profiles of *l*(θ*i*) given by <sup>θ</sup>0.<sup>05</sup> *<sup>i</sup>* , log(θ0.<sup>5</sup> *<sup>i</sup>* <sup>+</sup> <sup>1</sup>) and <sup>θ</sup>0.<sup>5</sup> *<sup>i</sup>* with <sup>θ</sup>*<sup>i</sup>* ranging over [0, 1]. The bottom panel displays profiles of sparsity inducing penalties *l*(θ<sup>2</sup> *i* ) given by |θ*i*| <sup>0</sup>.1, log(|θ*i*| + <sup>1</sup>) and <sup>|</sup>θ*i*<sup>|</sup> with <sup>θ</sup>*<sup>i</sup>* ranging over [−1, 1]

Some of them are displayed in the bottom panel of Fig. 3.16. The use of nonconvex penalties may increase the sparsity in the solution but the drawback is that optimization problems possibly exposed to local minima must be handled.

#### **3.6.1.5 Presence of Outliers and Robust Regression**

In practical applications, it may happen that the measurement outputs *yi* so far described by the model

$$\mathbf{y}\_{i} = \boldsymbol{\phi}\_{i}^{T}\boldsymbol{\theta}\_{0} + \boldsymbol{e}\_{i}, \quad i = 1, \ldots, N$$

may be contaminated by outliers which represent unexpected noise model deviations. They can be due to the failure of some sensors or to mistakes in the setting of the experiment. In this case, data can actually be generated by the following system:

$$y\_i = \phi\_i^T \theta\_0 + e\_i + \nu\_{0,i}, \quad i = 1, \ldots, N,\tag{3.114}$$

where the *ei* form a white noise with mean zero and variance σ <sup>2</sup> while the *v*0,*<sup>i</sup>* represents the outliers which are assumed to be zero most of time. Hence, the vector

$$V\_0 = \begin{bmatrix} \nu\_{0,1} \ \nu\_{0,2} \ \dots \ \nu\_{0,N} \end{bmatrix}^T$$

is assumed to be sparse.

When data come from (3.114), straightforward application of the LS method may lead to a poor estimate θˆLS of θ0. For illustration, let us consider an extreme case by assuming *v*0,*<sup>i</sup>* = 0 for *i* = 1, 2,..., *N* − 1 while the |φ*<sup>T</sup> <sup>i</sup>* θ<sup>0</sup> + *ei*| for *i* = 1,..., *N* are all negligible compared to |*v*0,*<sup>N</sup>* |. LS leads to

$$\begin{split} \hat{\theta}^{\text{LS}} &= \operatorname\*{arg\,min}\_{\boldsymbol{\theta}} \sum\_{i=1}^{N} (\mathbf{y}\_{i} - \boldsymbol{\phi}\_{i}^{T}\boldsymbol{\theta})^{2} \\ &= \operatorname\*{arg\,min}\_{\boldsymbol{\theta}} \sum\_{i=1}^{N-1} (\boldsymbol{\phi}\_{i}^{T}\boldsymbol{\theta}\_{0} + \boldsymbol{e}\_{i} - \boldsymbol{\phi}\_{i}^{T}\boldsymbol{\theta})^{2} \\ &\quad + (\boldsymbol{\phi}\_{N}^{T}\boldsymbol{\theta}\_{0} + \boldsymbol{e}\_{N} + \boldsymbol{\nu}\_{0,N} - \boldsymbol{\phi}\_{N}^{T}\boldsymbol{\theta})^{2}. \end{split}$$

The first *N* − 1 terms in the above cost function are the same encountered in absence of outliers while the last term is different due to *v*0,*<sup>N</sup>* . The |φ*<sup>T</sup> <sup>i</sup>* θ<sup>0</sup> + *ei*|, *i* = 1,..., *N* are negligible compared to |*v*0,*<sup>N</sup>* |, a phenomenon then further amplified by the quadratic criterion here adopted. To make the last term as small as possible, θˆLS will mainly tend to fit only *v*0,*<sup>N</sup>* . Hence, the terms φ*<sup>T</sup> <sup>i</sup>* θ<sup>0</sup> + *ei* which carry information on the true system will be little regarded. This will lead to a poor estimate of θ0.

Many robust regression methods are available hinging on loss functions less sensitive to outliers than the square loss. An example is Huber estimation

$$\hat{\theta}^{\text{Huber}} = \underset{\theta}{\text{arg min}} \sum\_{i=1}^{N} I^{\text{Huber}}(\mathbf{y}\_i - \phi\_i^T \theta) \tag{3.115}$$

where the Huber loss function *l* Huber is defined as follows:

$$d^{\text{Huber}}(\mathbf{x}) = \begin{cases} \begin{array}{c} \mathbf{x}^2 \\ \mathcal{Y}|\mathbf{x}| - \frac{1}{4} \mathcal{Y}^2 \, |\mathbf{x}| \ge \frac{\mathcal{Y}}{2} \end{array} . \end{cases} . \tag{3.116}$$

In (3.116), the parameter γ > 0 is a tuning parameter whose role will become clear shortly. The Huber loss function (3.116) is less sensitive to outliers because it grows linearly for|*x*| ≥ γ /2. Note that a limit case of the Huber loss is the 1-norm obtained with γ which tends to zero.

#### **3.6.1.6 An Equivalence Between** 1**-Norm Regularization and Huber Estimation**

Let

$$\begin{aligned} \tilde{\mathbf{y}}\_i &= \mathbf{y}\_i - \boldsymbol{\phi}\_i^T \boldsymbol{\theta}, \quad i = 1, 2, \dots, N, \\ \tilde{\mathbf{Y}} &= \begin{bmatrix} \tilde{\mathbf{y}}\_1 \ \tilde{\mathbf{y}}\_2 \ \dots \ \tilde{\mathbf{y}}\_N \end{bmatrix}^T. \end{aligned}$$

Consider the 1-norm regularization given by

$$\mathop{\rm arg\,min}\_{\boldsymbol{\theta},\boldsymbol{\mathcal{V}}\_{0}} \sum\_{i=1}^{N} (\mathbf{y}\_{i} - \boldsymbol{\phi}\_{i}^{T}\boldsymbol{\theta} - \boldsymbol{\nu}\_{0,i})^{2} + \boldsymbol{\mathcal{y}} |\boldsymbol{\nu}\_{0,i}| \tag{3.117}$$

whose peculiarity is to require joint optimization w.r.t. the parameter vector θ and the outliers *v*0,*<sup>i</sup>* contained in *V*0. Interestingly, (3.117) is actually equivalent to Huber estimation (3.115), i.e., they have the same optimal solution. To show this, one needs just to prove that

$$\sum\_{i=1}^{N} l^{\text{Huber}}(\mathbf{y}\_i - \boldsymbol{\phi}\_i^T \boldsymbol{\theta}) = \min\_{V} \|\tilde{Y} - V\_0\|\_2^2 + \mathcal{Y} \|V\_0\|\_1. \tag{3.118}$$

The right-hand side of (3.118) corresponds to LASSO (3.105) with an orthogonal regression matrix given by the identity. It thus follows from (3.112) that the components of the optimal solution *V*ˆ <sup>R</sup> <sup>0</sup> admit the following closed-form expression:

$$\hat{\nu}\_{0,i}^{\mathbb{R}} = \text{sign}(\tilde{\mathbf{y}}\_i) \min \left\{ 0, |\tilde{\mathbf{y}}\_i| - \frac{\mathcal{Y}}{2} \right\}, \ i = 1, 2, \dots, N. \tag{3.119}$$

Now we replace *V*<sup>0</sup> in the cost function of the right-hand side of (3.118) with *V*ˆ <sup>R</sup> 0 and it is straightforward to check that the following identify holds:

$$\sum\_{i=1}^{N} l^{\text{Huber}}(\mathbf{y}\_i - \phi\_i^T \theta) = \|\tilde{Y} - \hat{V}\_0^{\text{R}}\|\_2^2 + \mathcal{y} \|\hat{V}\_0^{\text{R}}\|\_1. \tag{3.120}$$

Therefore, (3.117) is indeed equivalent to the Huber estimation (3.115).

## *3.6.2 Nuclear Norm Regularization*

So far the output *Y* , the parameter θ and the noise *E* in (3.13) have been assumed to be vectors. In what follows, we allow them to be matrices and consider the following linear regression model:

$$Y = \Phi\theta\_0 + E, \quad Y \in \mathbb{R}^{N \times m}, \quad \Phi \in \mathbb{R}^{N \times n}, \ \theta\_0 \in \mathbb{R}^{n \times m}, \ E \in \mathbb{R}^{N \times m}. \tag{3.121}$$

The ReLS with nuclear norm regularization takes the following form:

$$\hat{\theta}^{\mathbb{R}} = \underset{\theta}{\text{arg min}} \left\| Y - \Phi \theta \right\|\_{F}^{2} + \mathcal{Y} \left\| h(\theta) \right\|\_{\*},\tag{3.122}$$

where ·*<sup>F</sup>* is the Frobenius norm of a matrix, *h*(θ ) is a matrix that is affine in θ and *h*(θ )∗ is the nuclear norm of the matrix *h*(θ ), see also Sect. 3.8.1, the appendix to this chapter, for a brief review of matrix and vector norms.

#### **3.6.2.1 Nuclear Norm Regularization for Matrix Rank Minimization**

Matrix rank minimization problems (RMP) are a class of optimization problems that involve minimizing the rank of a matrix subject to convex constraints. They are often encountered in signal processing, image processing and statistics. For example, a typical statistical problem is to obtain a low-rank covariance matrix able to describe some available data and/or consistent with some prior assumptions. Formally, the RMP is defined as follows:

$$\begin{array}{c} \min\limits\_{X} \text{rank}(X) \\ \text{RMP:} \begin{array}{c} \text{rank}(X) \\ \text{subj. to } X \in \mathfrak{C} \subset \mathbb{R}^{n \times m}, \end{array} \end{array} \tag{3.123}$$

with *X* belonging to a convex set C while rank(*X*) describes the order (complexity) of the underlying model.

In general, the RMP (3.123) is NP-hard and thus there is need for approximated methods. Several heuristic methods have been proposed, such as the nuclear norm heuristic [14] and the log-det heuristic [15]. In particular, for a convex set C the convex envelope of a function *<sup>f</sup>* : <sup>C</sup> <sup>→</sup> <sup>R</sup> is defined as the largest convex function *g* such that *g*(*x*) ≤ *f* (*x*) for every *x* ∈ C, e.g., [22]. For a nonconvex *f* , solving

$$\min\_{\mathbf{x}\in\mathfrak{C}} f(\mathbf{x})\tag{3.124}$$

may be difficult. In this case, if it is possible to derive the convex envelope *g* of *f* , then

$$\min\_{\mathbf{x}\in\mathfrak{C}}\mathbf{g}(\mathbf{x})\tag{3.125}$$

turns out a convex approximation of (3.124) and, in particular, the minimum of (3.125) can represent a lower bound of that of (3.124). Moreover, if necessary, the minimizing argument of (3.125) can be chosen as the initial point for a more complicated nonconvex local search aiming to solve (3.124).

As shown in Theorem 1 of [13, Chap. 5], the convex envelope of the rank function rank(*X*) with *<sup>X</sup>* <sup>∈</sup> <sup>C</sup> <sup>=</sup> {*X*|*X*<sup>2</sup> <sup>≤</sup> <sup>1</sup>, *<sup>X</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>*×*<sup>m</sup>*} is the nuclear norm of *<sup>X</sup>*, i.e., *X*∗. As a result, the nuclear norm heuristic to solve the RMP (3.123) is obtained by replacing the rank of *X* with the nuclear norm of *X*, i.e.,

$$\text{Nuclear norm heuristic: } \begin{array}{c} \min\_{X} \|X\|\_{\*} \\ \text{subj. to } X \in \mathfrak{C} \subset \mathbb{R}^{n \times m} . \end{array} \tag{3.126}$$

Without loss of generality, we assume that *<sup>X</sup>* <sup>∈</sup> <sup>C</sup> <sup>=</sup> {*<sup>X</sup>* | *X*<sup>2</sup> <sup>≤</sup> *<sup>M</sup>*, *<sup>X</sup>* <sup>∈</sup> <sup>R</sup>*n*×*m*} for some *M* > 0. Then, from the definition of the convex envelope, for *X* ∈ C we have

$$\left\| \left\| \frac{X}{M} \right\|\_{\*} \leq \text{rank}\left(\frac{X}{M}\right) \implies \frac{1}{M} \|X\|\_{\*} \leq \text{rank}(X).$$

In addition

$$\frac{1}{M} \| X^{\text{copt}} \|\_{\*} \le \text{rank}(X^{\text{opt}}) \le \text{rank}(X^{\text{copt}}),\tag{3.127}$$

where *X*opt and *X*copt denote the optimal solution of the RMP (3.123) and that of the nuclear norm heuristic (3.126), respectively. The inequalities in (3.127) thus provide an upper and lower bound for the optimal solution of the RMP (3.123).

As shown in [13, Chap. 5], the nuclear norm heuristic (3.126) can be equivalently formulated as a semidefinite program (SDP):

$$\begin{aligned} \min\_{X,Y,Z} & \text{trace}\, Y + \text{trace}\, Z\\ \text{subj.to} & \begin{bmatrix} Y & X\\ X^T & Z \end{bmatrix} \succeq 0, \ X \in \mathfrak{C}, \end{aligned} \tag{3.128}$$

where *<sup>Y</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>*×*<sup>n</sup>*, *<sup>Z</sup>* <sup>∈</sup> <sup>R</sup>*<sup>m</sup>*×*<sup>m</sup>* and both *<sup>Y</sup>* and *<sup>Z</sup>* are symmetric. The SDP problem (3.128) can be solved by interior point methods. For this purpose, some convex optimization software packages which can be used include YALMIP [26], CVX [19], CVXOPT [3] and CVXPY [11].

#### **3.6.2.2 Application in Covariance Matrix Estimation with Low-Rank Structure**

Now we go back to the linear regression model (3.121) and the ReLS with nuclear norm regularization (3.122). Consider the problem of covariance matrix estimation with low-rank structure, e.g., [38]. In particular, in (3.121), we take *N* = *m* = *n*, let *Y* be a sample covariance matrix, Φ = *In*, and θ<sup>0</sup> be a positive semidefinite matrix which has low-rank structure. Moreover, in (3.122), we take *h*(θ ) = θ. We can then obtain a matrix estimate θˆ<sup>R</sup> with low-rank structure using ReLS with nuclear norm regularization as follows:

$$
\hat{\theta}^{\mathbb{R}} = \underset{\theta}{\text{arg min}} \left\| Y - \theta \right\|\_{F}^{2} + \mathcal{Y} \left\| \theta \right\|\_{\*}, \tag{3.129}
$$

for a suitable choice of γ > 0. An example is reported below.

**Example 3.8** (*Covariance matrix estimation problem*) First, we construct a blockdiagonal rank-deficient covariance matrix <sup>θ</sup><sup>0</sup> that has 4 blocks denoted by *Ai* <sup>∈</sup> <sup>R</sup>*ni*×*ni* with *n*<sup>1</sup> = 20, *n*<sup>2</sup> = 10, *n*<sup>3</sup> = 5 and *n*<sup>4</sup> = 15. Using *blkdiag* to represent a blockdiagonal matrix, one thus has θ<sup>0</sup> = *blkdiag*(*A*1, *A*2, *A*3, *A*4). Each *Ai* is generated by summing up *vi*,*<sup>j</sup> v<sup>T</sup> <sup>i</sup>*,*<sup>j</sup>* , *j* = 1,..., *ni* − 2, where the *vi*,*<sup>j</sup>* are *ni*-dimensional vectors with components independent and uniformly distributed on [−1, 1]. It comes that *rank*(θ0) = 42 since the rank of each *i*th block is *ni* − 2. Then we draw 20000 samples *xi* from the Gaussian distribution *N* (0, θ0). The available measurements are *zi* = *xi* + *ei* where the *ei* are independent and distributed as *N* (0, 0.6). Using the *zi* we calculate the sample covariance *Y* as follows:

$$Y = \frac{1}{20000} \sum\_{i=1}^{20000} (z\_i - \overline{z})(z\_i - \overline{z})^T, \quad \overline{z} = \frac{1}{20000} \sum\_{i=1}^{20000} z\_i. \tag{3.130}$$

We solve the ReLS problem (3.129) with the data *Y* defined above and γ in the set {0.1411, 0.1414, 0.1419, 0.1423, 0.1427}, obtaining different estimates θˆ<sup>R</sup> of the covariance matrix.

The top panel of Fig. 3.17 shows the base 10 logarithm of the 50 estimated singular values. Each profile is obtained with a different regularization parameter. Such results show that, seeing the tiny singular values as null, a suitable value of the regularization parameter, like γ = 0.1427, leads to *rank*(θˆ<sup>R</sup>) = 42. Note in fact that the green curve, which is associated to such γ , has a jump towards zero when passing from 42 to 43 on the *x*-axis. The influence of the nuclear norm regularization is also visible in the bottom panel which shows the profile of the relative error of θˆ<sup>R</sup> as a function of γ . When γ is small, e.g., γ = 0.1411, the influence is invisible, θˆ<sup>R</sup> is almost the same as the sample covariance *Y* and *rank*(θˆ<sup>R</sup>) = 50. When γ becomes larger, the regularization influence becomes more visible, making θˆ<sup>R</sup> closer to the true covariance θ0. -

#### **3.6.2.3 Vector Case:** 1**-Norm Regularization**

The nuclear norm heuristic and inequalities (3.127) also justify the use of the 1-norm regularization (3.108) for the problem of finding sparse solutions (3.107).

For the vector case, i.e., <sup>θ</sup> <sup>∈</sup> <sup>R</sup>*<sup>n</sup>*×*<sup>m</sup>* with *<sup>m</sup>* = 1, we can take *<sup>X</sup>* and <sup>C</sup> in the previous section to be *<sup>X</sup>* <sup>=</sup> <sup>θ</sup> and <sup>C</sup> <sup>=</sup> {<sup>θ</sup> <sup>∈</sup> <sup>R</sup>*<sup>n</sup>*|*<sup>Y</sup>* <sup>−</sup> Φθ<sup>2</sup> <sup>2</sup> ≤ ε}. Then it is easy to see that the 1-norm is the convex envelope of the 0-norm for θ∞ ≤ 1, i.e.,

$$\|\theta\|\_1 \le \|\theta\|\_0, \text{ for } \|\theta\|\_\infty \le 1.$$

Then, the RMP (3.123) and the nuclear norm heuristic (3.126) become the problem of finding sparse solutions (3.107) and the 1-norm regularization (3.108), respectively. Similar to what is done to obtain (3.127), we assume that θ∞ ≤ *M* for some *M* > 0. If θ∞ ≤ *M*, one has

$$\frac{1}{M} \| \boldsymbol{\theta}^{\text{copt}} \|\_{1} \le \| \boldsymbol{\theta}^{\text{opt}} \|\_{0} \le \| \boldsymbol{\theta}^{\text{copt}} \|\_{0},\tag{3.131}$$

where θ opt and θ copt denote the optimal solution of the problem of finding sparse solution (3.107) and that of the 1-norm regularization (3.108), respectively. Similar to the matrix case, (3.131) provides an upper and lower bound for the optimal solution of the sparse estimation problem (3.107).

## **3.7 Further Topics and Advanced Reading**

The systematic treatment of the regression theory is available in many textbooks, e.g., [12, 35]. The noise variance estimation is a critical issue in practical applications and has been discussed in details in [48]. When the regression matrix is ill-conditioned, it is important to make sure that the least squares estimate is calculated in an accurate and efficient way, e.g., [10, 17]. Moreover, for the regularized least squares in quadratic form, the regularization matrix could also be ill-conditioned. In this case, extra care is required in the calculation of both the regularized least squares estimate and the hyperparameter estimates, e.g., [8]. For given data, the quality of a model depends on the control of its complexity, which can be described by different measures in different contexts, e.g., the model order and the equivalent degrees of freedom. A good exposition of model complexity and its selection can be found in [21]. It is worth to mention that the degrees of freedom for LASSO have also been defined and discussed in [43, 51]. In practical applications, there are two key issues for the regularized least squares with quadratic regularization: the design of the regularization matrix and the estimation of the hyperparameter. While the latter issue has been discussed extensively in the literature, e.g., [21, 36, 46, 47], there are much fewer results on the former issue in the context of system identification, as discussed in [7]. The asymptotic properties of some widely used hyperparameter estimators, such as the maximum marginal likelihood estimator, Stein's unbiased risk estimator, generalized cross-validation, etc., have been reported in [29, 30]. LASSO and its variants have been extremely popular in practical applications, as described in [16, 28, 32, 50]. The nuclear norm heuristic to solve matrix rank minimization problems has wide applications in practical applications, see, e.g., [5, 6, 14, 15, 37]. Beyond the Huber loss function [23], the square loss function can be replaced also by other convex functions like the Vapnik loss function [45] as discussed later on in Chap. 6.

## **3.8 Appendix**

## *3.8.1 Fundamentals of Linear Algebra*

In this section, we review some fundamentals of linear algebra used in this chapter.

#### **3.8.1.1 QR Factorization and Singular Value Decomposition**

We begin with giving the definitions of QR factorization and SVD, which are very important decompositions used for many purposes other than solving LS problems.

For any <sup>Φ</sup> <sup>∈</sup> <sup>R</sup>*<sup>N</sup>*×*<sup>n</sup>* with *<sup>N</sup>* <sup>≥</sup> *<sup>n</sup>*, <sup>Φ</sup> can be decomposed as follows:

$$
\Phi = \mathcal{Q}R,\tag{3.132}
$$

where *<sup>Q</sup>* <sup>∈</sup> <sup>R</sup>*<sup>N</sup>*×*<sup>N</sup>* is orthogonal, i.e., *<sup>Q</sup><sup>T</sup> <sup>Q</sup>* <sup>=</sup> *Q Q<sup>T</sup>* <sup>=</sup> *IN* , and *<sup>R</sup>* <sup>∈</sup> <sup>R</sup>*<sup>N</sup>*×*<sup>n</sup>* is upper triangular. Further assume that Φ has full rank. Then Φ can be decomposed as follows:

$$
\Phi = \mathcal{Q}\_1 \mathcal{R}\_1 \tag{3.133}
$$

where *Q*<sup>1</sup> = *Q*(:, 1 : *n*) and *R*<sup>1</sup> = *R*(1 : *n*, 1 : *n*) with *Q*(:, 1 : *n*) being the matrix consisting of the first *n* columns of *Q* and *R*(1 : *n*, 1 : *n*) being the matrix consisting of the first *n* rows and *n* columns of *R*. The factorizations (3.132) and (3.133) are called the full and thin QR factorization, respectively. In particular, when *R*<sup>1</sup> has positive diagonal entries, the thin QR factorization (3.133) is unique.

We start providing the "economy size" definition of the SVD. For any <sup>Φ</sup> <sup>∈</sup> <sup>R</sup>*N*×*<sup>n</sup>* with *N* ≥ *n*, Φ can be decomposed as follows:

$$
\Phi = U\Lambda V^T,\tag{3.134}
$$

where *<sup>U</sup>* <sup>∈</sup> <sup>R</sup>*N*×*<sup>n</sup>* satisfies *<sup>U</sup>TU* <sup>=</sup> *IN* , <sup>Λ</sup> = diag(σ1, σ2,...,σ*n*) with <sup>σ</sup><sup>1</sup> <sup>≥</sup> <sup>σ</sup><sup>2</sup> <sup>≥</sup> ···≥ <sup>σ</sup>*<sup>n</sup>* <sup>≥</sup> <sup>0</sup>, and *<sup>V</sup>* <sup>∈</sup> <sup>R</sup>*n*×*<sup>n</sup>* is orthogonal. The factorization (3.134) is called the singular value decomposition (SVD) of Φ and the σ*<sup>i</sup>* , *i* = 1,..., *n* are called the singular values of Φ.

The SVD admits also the "full size" formulation, as given in (3.29). One has that (3.134) still holds but *U* is an orthogonal *N* × *N* matrix and Λ is a rectangular *N* × *n* diagonal matrix, while *V* is still an orthogonal *n* × *n* matrix. In this second formulation, *V* and *U* can be associated to orthonormal change of coordinates in the domain and codomain of Φ such that, in the new coordinates, the linear operator is diagonal.

#### **3.8.1.2 Vector and Matrix Norms**

Important vector norms are the 1, <sup>2</sup> and <sup>∞</sup> norms. For a given vector <sup>θ</sup> <sup>∈</sup> <sup>R</sup>*<sup>n</sup>*, they are denoted by θ1, θ<sup>2</sup> and θ∞, respectively, and are defined as follows:

$$\|\theta\|\_1 = \sum\_{i=1}^{n} |\theta\_i|,\tag{3.135}$$

84 3 Regularization of Linear Regression Models

$$\|\theta\|\_2 = \sqrt{\sum\_{i=1}^n \theta\_i^2},\tag{3.136}$$

$$\|\theta\|\_{\infty} = \max\{ |\theta\_1|, |\theta\_2|, \dots, |\theta\_n| \},\tag{3.137}$$

where the <sup>2</sup> norm is also known as the Euclidean norm.

Important matrix norms are the nuclear norm, the Frobenius norm and the spectral norm. For a given matrix <sup>Φ</sup> <sup>∈</sup> <sup>R</sup>*N*×*<sup>n</sup>* with *<sup>N</sup>* <sup>≥</sup> *<sup>n</sup>*, these three matrix norms are denoted by Φ∗, Φ<sup>F</sup> and Φ2, respectively, and are defined as follows:

$$\|\|\Phi\|\|\_{\*} = \sum\_{i=1}^{n} \sigma\_i(\Phi),\tag{3.138}$$

$$\|\boldsymbol{\Phi}\|\_{\mathcal{F}} = \sqrt{\sum\_{i=1}^{N} \sum\_{j=1}^{n} \boldsymbol{\Phi}\_{i,j}^{2}} = \sqrt{\sum\_{i=1}^{n} \sigma\_{i}^{2}(\boldsymbol{\Phi})},\tag{3.139}$$

$$\|\Phi\|\_{2} = \sigma\_{\text{max}}(\Phi),\tag{3.140}$$

where σ*i*(Φ) represents the *i*th largest singular value of Φ, σmax(Φ) = σ1(Φ) and Φ*<sup>i</sup>*,*<sup>j</sup>* is the (*i*, *j*)th element of Φ.

Now, we report some properties of the vector and matrix norms. The *i*th largest singular value of Φ is equal to the square root of the *i*th largest eigenvalue of Φ*<sup>T</sup>* Φ, or equivalently ΦΦ*<sup>T</sup>* . If Φ is square and positive semidefinite, then the nuclear norm of Φ is equal to the trace of Φ, i.e., Φ∗ = trace(Φ). For matrices *A*, *B* ∈ <sup>R</sup>*N*×*<sup>n</sup>*, we can define the inner product on <sup>R</sup>*N*×*<sup>n</sup>* <sup>×</sup> <sup>R</sup>*N*×*<sup>n</sup>* as*A*, *<sup>B</sup>* = trace(*A<sup>T</sup> <sup>B</sup>*) <sup>=</sup> *<sup>N</sup> i*=1 *<sup>n</sup> <sup>j</sup>*=1 *Ai*,*<sup>j</sup> Bi*,*<sup>j</sup>* . So the Frobenius norm is the norm associated with this inner product. The spectral norm is defined as the induced 2-norm, i.e., for <sup>Φ</sup> <sup>∈</sup> <sup>R</sup>*N*×*<sup>n</sup>*,

$$\|\boldsymbol{\Phi}\|\_{2} = \underset{\boldsymbol{\theta}\neq\boldsymbol{0}}{\text{maximize }} \frac{\|\boldsymbol{\Phi}\boldsymbol{\theta}\|\_{2}}{\|\boldsymbol{\theta}\|\_{2}} = \underset{\|\boldsymbol{\theta}\|\_{2} = 1}{\text{maximize }} \|\boldsymbol{\Phi}\boldsymbol{\theta}\|\_{2}.\tag{3.141}$$

To show that (3.141) is equal to (3.140), note that maxθ2=1 Φθ<sup>2</sup> is equivalent to maxθ<sup>2</sup> <sup>2</sup>=1 Φθ<sup>2</sup> <sup>2</sup>, which is further equivalent to

$$\max\_{\theta} \|\Phi\theta\|\_{2}^{2} + \lambda(1 - \|\theta\|\_{2}^{2}) = \max\_{\theta} \theta^{T}\Phi^{T}\Phi\theta + \lambda(1 - \theta^{T}\theta),\tag{3.142}$$

where λ is the Lagrange multiplier. Checking the optimality condition of (3.142) yields that the optimal solution will satisfy

$$
\Phi^T \Phi \theta - \lambda \theta = 0, \; \theta^T \theta = 1.
$$

The above equation implies that λ is an eigenvalue of Φ*<sup>T</sup>* Φ, and moreover,

3.8 Appendix 85

$$
\theta^T \Phi^T \Phi \theta = \lambda \theta^T \theta = \lambda. \tag{3.143}
$$

As a result, we have

$$\max\_{\|\theta\|\_2=1} \|\Phi\theta\|\_2 = (\max\_{\|\theta\|\_2^2=1} \theta^T \Phi^T \Phi \theta)^{\frac{1}{2}} = (\max\_{\|\theta\|\_2^2=1} \lambda)^{\frac{1}{2}} = (\lambda\_{\max})^{\frac{1}{2}},$$

where λmax is the largest eigenvalue of Φ*<sup>T</sup>* Φ that is equal to σ <sup>2</sup> max(Φ). Thus (3.141) is indeed equal to (3.140).

The aforementioned three matrix norms, the nuclear norm, the Frobenius norm and the spectral norm, can be seen as natural extensions of the three vector norms: the 1, <sup>2</sup> and <sup>∞</sup> norms,, respectively. In particular, if we construct an *n*-dimensional vector with the *n* singular values of Φ as its elements, then the three matrix norms Φ∗, Φ<sup>F</sup> and Φ<sup>2</sup> correspond to the 1, <sup>2</sup> and <sup>∞</sup> norms of the constructed vector, respectively. Moreover, for any given norm · on <sup>R</sup>*N*×*<sup>n</sup>*, there exists a dual norm ·<sup>d</sup> of · defined as

$$\|A\|\_{\mathbf{d}} = \sup \{ \text{trace}(\mathbf{A}^T \mathbf{B}) | \mathbf{B} \in \mathbb{R}^{N \times n}, \|\mathbf{B}\| \le 1 \}. \tag{3.144}$$

For the vector norms, the dual norm of the <sup>1</sup> norm is the <sup>∞</sup> norm and the dual norm of the <sup>2</sup> norm is the <sup>2</sup> norm. The properties for the vector norms extend to the matrix norms we have defined: the dual norm of the nuclear norm is the spectral norm, see, e.g., [37], and the dual norm of the Frobenius norm is itself.

#### **3.8.1.3 Matrix Inversion Lemma, Based on [49]**

The matrix inversion lemma is also known as Sherman–Morrison–Woodbury formula and refers to the following identity:

$$(A+UCV)^{-1} = A^{-1} - A^{-1}U(C^{-1} + VA^{-1}U)^{-1}VA^{-1},\tag{3.145}$$

where *A* and *C* are square *n* × *n* and *m* × *m* matrices.

## *3.8.2 Proof of Lemma 3.1*

Define *W* = −(*Q R* + *In*)−<sup>1</sup> and *W*<sup>0</sup> = −(*Z R* + *In*)−<sup>1</sup>. Then (3.67) can be rewritten as

$$W(QR\,\mathcal{Q} + Z)W^T \ge W\_0(ZR\,\mathcal{Z} + Z)W\_0^T. \tag{3.146}$$

Note that

86 3 Regularization of Linear Regression Models

$$I\_n + W = -W\mathcal{Q}R, \qquad I\_n + W\_0 = -W\_0 Z R \tag{3.147}$$

thus (3.67) can be further rewritten as

$$\begin{aligned} &(I\_n + W)\boldsymbol{R}^{-1}(I\_n + W)^T + WZW^T \\ &\geq (I\_n + W\_0)\boldsymbol{R}^{-1}(I\_n + W\_0)^T + W\_0ZW\_0^T. \end{aligned} \tag{3.148}$$

In the following, we show that

$$(I\_n + W)\mathcal{R}^{-1}(I\_n + W)^T + WZW^T$$

$$-(I\_n + W\_0)\mathcal{R}^{-1}(I\_n + W\_0)^T - W\_0ZW\_0^T$$

$$= (W - W\_0)(\mathcal{R}^{-1} + Z)(W - W\_0)^T. \tag{3.149}$$

Simple calculation shows that (3.149) is equivalent to

$$\begin{aligned} \left( (I\_n + W\_0) R^{-1} W^T + W R^{-1} (I\_n + W\_0^T) \right) \\ - \left( I + W\_0 \right) R^{-1} W\_0^T - W\_0 R^{-1} (I\_n + W\_0^T) \\ = 2 W\_0 Z W\_0^T - W\_0 Z W^T - W Z W\_0^T. \end{aligned} \tag{3.150}$$

It follows from the second equation of (3.147) that

$$(I\_n + W\_0) \mathcal{R}^{-1} = -W\_0 Z. \tag{3.151}$$

Now inserting (3.151) into the left-hand side of (3.150) shows that (3.150) and thus (3.149) holds. Moreover, since (*W* − *W*0)(*R*−<sup>1</sup> + *Z*)(*W* − *W*0)*<sup>T</sup>* in (3.149) is positive semidefinite, Eq. (3.148) holds as well, which in turn implies (3.67) holds. This completes the proof.

## *3.8.3 Derivation of Predicted Residual Error Sum of Squares (PRESS)*

For the case when the *k*th measured output *yk* , *k* = 1,..., *N*, is not used, the corresponding ReLS-Q estimate becomes

$$\hat{\theta}\_{-k}^{\mathbb{R}} = \left(\sum\_{i=1, i \neq k}^{N} \phi\_i \phi\_i^T + \sigma^2 P^{-1}(\eta)\right)^{-1} \sum\_{i=1, i \neq k}^{N} \phi\_i \mathbf{y}\_i. \tag{3.152}$$

For the *k*th measured output *yk* , *k* = 1,..., *N*, the corresponding predicted output *y*ˆ−*<sup>k</sup>* and validation error *r*−*<sup>k</sup>* are

$$\hat{\mathbf{y}}\_{-k} = \phi\_k^T \left( \sum\_{i=1, i \neq k}^N \phi\_i \phi\_i^T + \sigma^2 P^{-1}(\eta) \right)^{-1} \sum\_{i=1, i \neq k}^N \phi\_i \mathbf{y}\_i,\tag{3.153a}$$

$$
\mathbf{r}\_{-k} = \mathbf{y}\_k - \hat{\mathbf{y}}\_{-k}.\tag{3.153b}
$$

With *M* defined in (3.81) and by Woodbury matrix identity, e.g., [10, 17], we have

$$\begin{aligned} \left(\sum\_{i=1, i \neq k}^{N} \phi\_i \phi\_i^T + \sigma^2 P^{-1}(\eta)\right)^{-1} &= (M - \phi\_k \phi\_k^T)^{-1} \\ &= M^{-1} - \frac{M^{-1} \phi\_k \phi\_k^T M^{-1}}{-1 + \phi\_k^T T M^{-1} \phi\_k}. \end{aligned} \tag{3.154}$$

Then we have

$$\begin{split} r\_{-k} &= y\_k - \phi\_k^T M^{-1} \sum\_{i=1, i \neq k}^N \phi\_i y\_i + \phi\_k^T \frac{M^{-1} \phi\_k \phi\_k^T M^{-1}}{-1 + \phi\_k^T M^{-1} \phi\_k} \sum\_{i=1, i \neq k}^N \phi\_i y\_i \\ &= r\_k + \phi\_k^T M^{-1} \phi\_k y\_k + \phi\_k^T \frac{M^{-1} \phi\_k \phi\_k^T M^{-1}}{-1 + \phi\_k^T M^{-1} \phi\_k} \sum\_{i=1, i \neq k}^N \phi\_i y\_i \\ &= r\_k + \phi\_k^T M^{-1} \phi\_k \left( y\_k + \frac{\phi\_k^T M^{-1}}{-1 + \phi\_k^T M^{-1} \phi\_k} \sum\_{i=1, i \neq k}^N \phi\_i y\_i \right) \\ &= r\_k + \frac{\phi\_k^T M^{-1} \phi\_k}{-1 + \phi\_k^T M^{-1} \phi\_k} \\ &\times \left( \frac{-y\_k + \phi\_k^T M^{-1} \phi\_k y\_k + \phi\_k^T M^{-1} \sum\_{i=1, i \neq k}^N \phi\_i y\_i}{-1 + \phi\_k^T M^{-1} \phi\_k} \right) \\ &= r\_k - \frac{\phi\_k^T M^{-1} \phi\_k}{-1 + \phi\_k^T M^{-1} \phi\_k} r\_k \\ &= r\_k \frac{1}{1 - \phi\_k^T M^{-1} \phi\_k}, \end{split} (3.155)$$

which shows that *r*−*<sup>k</sup>* is actually obtained by scaling *rk* with a factor 1/(1 − φ*T <sup>k</sup> M*−<sup>1</sup>φ*<sup>k</sup>* ). Accordingly, we have the sum of squares of the validation errors

$$\sum\_{k=1}^{N} r\_{-k}^{2} = \sum\_{k=1}^{N} \frac{r\_{k}^{2}}{(1 - \phi\_{k}^{T} M^{-1} \phi\_{k})^{2}}.\tag{3.156}$$

Then the PRESS (3.80) is obtained by minimizing (3.156) with respect to η ∈ Γ .

## *3.8.4 Proof of Theorem 3.7*

Using (3.92) and (3.100), it is easy to see that proving (3.90) is equivalent to show that

$$\underbrace{\mathcal{E}\left[\frac{1}{N}\|Y-\Phi\hat{\theta}^{\mathbb{R}}(\eta)\|\_{2}^{2}\right]}\_{\overrightarrow{\text{err}}(\eta)} \leq \underbrace{\mathcal{E}\left[\frac{1}{N}\mathcal{E}[\|Y\_{\text{v}}-\Phi\hat{\theta}^{\mathbb{R}}(\eta)\|\_{2}^{2}|\mathcal{Q}\_{\text{T}}|\right]}\_{\text{EVE}\_{\text{a}}(\eta)}\tag{3.157}$$

and to prove the above inequality we need the following lemma.

**Lemma 3.3** *Consider the following additive measurement model:*

$$\boldsymbol{\omega} = \boldsymbol{\mu} + \boldsymbol{\varepsilon}, \ \boldsymbol{\varepsilon}, \boldsymbol{\mu}, \boldsymbol{\varepsilon} \in \mathbb{R}^{p}, \tag{3.158}$$

*where* μ *is an unknown constant vector and* ε *is a random variable with zero-mean and covariance matrix E* (εε*<sup>T</sup>* ) = Σ*. Let* μ(ˆ *x*) *be an estimator of* μ *based on the data x and let x be new data generated from* ˜

$$
\tilde{\mathfrak{x}} = \mu + \tilde{\mathfrak{s}}, \ \tilde{\mathfrak{x}} \in \mathbb{R}^p,\tag{3.159}
$$

*where* ε˜ *is a random variable uncorrelated with* ε *and has zero-mean and covariance matrix E* (ε˜ε˜*<sup>T</sup>* ) = Σ*. Then it holds that*

$$\mathcal{E}(\|\tilde{\mathbf{x}} - \hat{\mu}(\mathbf{x})\|\_2^2) = \mathcal{E}(\|\mu - \hat{\mu}(\mathbf{x})\|\_2^2) + \text{trace}(\Sigma) \tag{3.160}$$

$$\mathbf{x} = \boldsymbol{\mathcal{C}}(\left\|\mathbf{x} - \hat{\boldsymbol{\mu}}(\mathbf{x})\right\|\_{2}^{2}) + 2\,\mathrm{trace}(\mathrm{Cov}(\hat{\boldsymbol{\mu}}(\mathbf{x}), \mathbf{x})),\qquad(3.161)$$

*where the expectation is over both* ε *and* ε˜*.*

*Proof* Firstly, we consider (3.160). We have

$$\begin{split} &\mathcal{E}(\|\tilde{\mathbf{x}}-\hat{\mu}(\mathbf{x})\|\_{2}^{2}) = \mathcal{E}(\|\tilde{\mathbf{x}}-\mu+\mu-\hat{\mu}(\mathbf{x})\|\_{2}^{2}) \\ &= \mathcal{E}(\|\mu-\hat{\mu}(\mathbf{x})\|\_{2}^{2}) + \mathcal{E}(\|\tilde{\mathbf{x}}-\mu\|\_{2}^{2}) + 2\mathcal{E}[(\tilde{\mathbf{x}}-\mu)^{T}(\mu-\hat{\mu}(\mathbf{x}))] \\ &= \mathcal{E}(\|\mu-\hat{\mu}(\mathbf{x})\|\_{2}^{2}) + \mathcal{E}(\|\tilde{\mathbf{x}}\|\_{2}^{2}), \end{split}$$

which shows that (3.160) is true.

Secondly, we consider (3.161). Similarly, we have

$$\begin{split} \mathcal{E}(\|\tilde{\mathbf{x}} - \hat{\mu}(\mathbf{x})\|\_{2}^{2}) &= \mathcal{E}(\|\tilde{\mathbf{x}} - \mathbf{x} + \mathbf{x} - \hat{\mu}(\mathbf{x})\|\_{2}^{2}) \\ &= \mathcal{E}(\|\mathbf{x} - \hat{\mu}(\mathbf{x})\|\_{2}^{2}) + \mathcal{E}(\|\tilde{\mathbf{x}} - \mathbf{x}\|\_{2}^{2}) + 2\mathcal{E}[(\tilde{\mathbf{x}} - \mathbf{x})^{T}(\mathbf{x} - \hat{\mu}(\mathbf{x}))] \\ &= \mathcal{E}(\|\mathbf{x} - \hat{\mu}(\mathbf{x})\|\_{2}^{2}) + \mathcal{E}(\|\tilde{\mathbf{x}} - \mathbf{x}\|\_{2}^{2}) + 2\mathcal{E}[(\tilde{\mathbf{x}} - \mathbf{x})^{T}(\mathbf{x} + \mu - \hat{\mu}(\mathbf{x}))] \\ &= \mathcal{E}(\|\mathbf{x} - \hat{\mu}(\mathbf{x})\|\_{2}^{2}) + 2\operatorname{trace}(\Sigma) - 2\mathcal{E}[\varepsilon^{T}(\mathbf{x} + \mu - \hat{\mu}(\mathbf{x}))] \\ &= \mathcal{E}(\|\mathbf{x} - \hat{\mu}(\mathbf{x})\|\_{2}^{2}) + 2\mathcal{E}[\varepsilon^{T}\hat{\mu}(\mathbf{x})], \end{split}$$

which implies that (3.161) is true. -

Now we prove (3.157) by applying Lemma 3.3. Let

$$\mathbf{x} = Y, \boldsymbol{\mu} = \Phi \theta\_0, \hat{\boldsymbol{\mu}}(\mathbf{x}) = \Phi \hat{\theta}^{\mathbb{R}}, \tilde{\mathbf{x}} = Y\_{\text{v}}, \boldsymbol{\varepsilon} = E, \tilde{\boldsymbol{\varepsilon}} = E\_{\text{test}}, \boldsymbol{\Sigma} = \sigma^2 I\_N,\tag{3.162}$$

and then it follows from (3.161) that

$$\underbrace{\mathcal{E}\left[\frac{1}{N}\mathcal{E}\|\|Y\_{\text{V}}-\boldsymbol{\Phi}\boldsymbol{\hat{\theta}}^{\text{R}}(\boldsymbol{\eta})\|\_{2}^{2}|\mathcal{Q}\_{\text{T}}\right]}\_{\text{EVE}\_{\text{in}}(\boldsymbol{\eta})}-\underbrace{\mathcal{E}\left[\frac{1}{N}\|Y-\boldsymbol{\Phi}\boldsymbol{\hat{\theta}}^{\text{R}}(\boldsymbol{\eta})\|\_{2}^{2}\right]}\_{\text{EV}(\boldsymbol{\eta})}$$

$$=2\frac{1}{N}\text{trace}(\text{Cov}(Y,\boldsymbol{\Phi}\boldsymbol{\hat{\theta}}^{\text{R}}(\boldsymbol{\eta}))).\tag{3.163}$$

Next we show that the right-hand side of (3.163) is nonnegative. For the ReLS-Q problem (3.58a) with the ReLS-Q estimate (3.58b), the predicted output *Y*ˆ(η) of *Y* is

$$\hat{Y}(\eta) = \Phi \hat{\theta}^{\mathbb{R}}(\eta) = \Phi P \Phi^T (\Phi P \Phi^T + \sigma^2 I\_N)^{-1} Y.$$

Then we have

$$\begin{split} \text{Cov}(Y, \Phi \hat{\theta}^R(\eta)) &= \text{Cov}(Y, \hat{Y}(\eta)) \\ &= \mathcal{E}(Y - \mathcal{E}(Y))(\hat{Y} - \mathcal{E}(\hat{Y}(\eta)))^T \\ &= \mathcal{E}(Y - \mathcal{E}(Y))(Y - \mathcal{E}(Y))^T \Phi P \Phi^T (\Phi P \Phi^T + \sigma^2 I\_N)^{-1} \\ &= \sigma^2 \Phi P \Phi^T (\Phi P \Phi^T + \sigma^2 I\_N)^{-1} = \sigma^2 H, \end{split} \tag{3.164}$$

where *H* is the hat matrix defined in (3.63). One has

$$\text{trace}(\text{Cov}(Y, \Phi \hat{\theta}^{\mathbb{R}}(\eta))) = \sigma^2 \,\text{trace}(H) \ge 0.$$

Therefore, the right-hand side of (3.163) is nonnegative and thus (3.90) holds true completing the proof of Theorem 3.7.

## *3.8.5 A Variant of the Expected In-Sample Validation Error and Its Unbiased Estimator*

It is possible to derive variants of the expected in-sample validation error and its unbiased estimator by modifying (3.92) and (3.100).

Assume that Φ is full rank, i.e., rank(Φ) = *n*. Then, multiplying both sides of (3.92) and (3.100) with (Φ*<sup>T</sup>* Φ)−<sup>1</sup>Φ*<sup>T</sup>* yields

$$(\Phi^T \Phi)^{-1} \Phi^T Y = \theta\_0 + (\Phi^T \Phi)^{-1} \Phi^T E,\tag{3.165}$$

$$(\boldsymbol{\Phi}^{\boldsymbol{T}}\boldsymbol{\Phi})^{-1}\boldsymbol{\Phi}^{\boldsymbol{T}}\boldsymbol{Y}\_{\boldsymbol{\V}}=\boldsymbol{\theta}\_{0}+(\boldsymbol{\Phi}^{\boldsymbol{T}}\boldsymbol{\Phi})^{-1}\boldsymbol{\Phi}^{\boldsymbol{T}}\boldsymbol{E}\_{\boldsymbol{\V}},\tag{3.166}$$

which will be our new "true system" and new "validation data", respectively.

Different from (3.162), we now take

$$\begin{aligned} \boldsymbol{\chi} &= (\boldsymbol{\Phi}^{\boldsymbol{T}} \boldsymbol{\Phi})^{-1} \boldsymbol{\Phi}^{\boldsymbol{T}} \boldsymbol{Y}, \boldsymbol{\mu} = \boldsymbol{\theta}\_{0}, \boldsymbol{\hat{\mu}}(\boldsymbol{\chi}) = \boldsymbol{\hat{\theta}}^{\mathbb{R}}(\boldsymbol{\eta}), \boldsymbol{\tilde{\chi}} = (\boldsymbol{\Phi}^{\boldsymbol{T}} \boldsymbol{\Phi})^{-1} \boldsymbol{\Phi}^{\boldsymbol{T}} \boldsymbol{Y}\_{\boldsymbol{\eta}}, \\ \boldsymbol{\varepsilon} &= (\boldsymbol{\Phi}^{\boldsymbol{T}} \boldsymbol{\Phi})^{-1} \boldsymbol{\Phi}^{\boldsymbol{T}} \boldsymbol{E}, \boldsymbol{\tilde{\varepsilon}} = (\boldsymbol{\Phi}^{\boldsymbol{T}} \boldsymbol{\Phi})^{-1} \boldsymbol{\Phi}^{\boldsymbol{T}} \boldsymbol{E}\_{\boldsymbol{\eta}}, \boldsymbol{\Sigma} = \boldsymbol{\sigma}^{2} (\boldsymbol{\Phi}^{\boldsymbol{T}} \boldsymbol{\Phi})^{-1} . \end{aligned} \tag{3.167}$$

Note that θˆLS = (Φ*<sup>T</sup>* Φ)−1Φ*<sup>T</sup> Y* and then it follows from (3.160) and (3.161) that

$$\begin{split} \mathcal{E}(\|(\boldsymbol{\Phi}^{T}\boldsymbol{\Phi})^{-1}\boldsymbol{\Phi}^{T}\boldsymbol{Y}\_{\boldsymbol{\eta}}-\widehat{\boldsymbol{\theta}}^{\mathcal{R}}(\boldsymbol{\eta})\|\_{2}^{2}) &= \mathcal{E}(\|\widehat{\boldsymbol{\theta}}^{\mathcal{R}}(\boldsymbol{\eta})-\boldsymbol{\theta}\_{0}\|\_{2}^{2}) + \sigma^{2}\operatorname{trace}((\boldsymbol{\Phi}^{T}\boldsymbol{\Phi})^{-1}) \\ &= \mathcal{E}(\|\widehat{\boldsymbol{\theta}}^{\mathrm{LS}}-\widehat{\boldsymbol{\theta}}^{\mathcal{R}}(\boldsymbol{\eta})\|\_{2}^{2}) + 2\operatorname{trace}(\operatorname{Cov}(\widehat{\boldsymbol{\theta}}^{\mathrm{R}}(\boldsymbol{\eta}),\widehat{\boldsymbol{\theta}}^{\mathrm{LS}})) .\end{split}$$

From the above two equations, we have

$$\begin{split} \mathcal{E}(\|\hat{\theta}^{\mathbb{R}}(\eta) - \theta\_0\|\_2^2) &= \mathcal{E}(\|\hat{\theta}^{\text{LS}} - \hat{\theta}^{\mathbb{R}}(\eta)\|\_2^2) \\ &+ 2\operatorname{trace}(\text{Cov}(\hat{\theta}^{\mathbb{R}}(\eta), \hat{\theta}^{\text{LS}})) - \sigma^2 \operatorname{trace}((\boldsymbol{\Phi}^T \boldsymbol{\Phi})^{-1}) .\end{split}$$

Further note that

$$\begin{aligned} \hat{\theta}^{\mathbb{R}}(\eta) &= (\Phi^T \Phi + \sigma^2 P^{-1}(\eta))^{-1} \Phi^T Y = (\Phi^T \Phi + \sigma^2 P^{-1}(\eta))^{-1} \Phi^T \Phi \hat{\theta}^{\mathbb{L}S}, \\ \text{Cov}(\hat{\theta}^{\mathbb{L}S}, \hat{\theta}^{\mathbb{L}S}) &= \sigma^2 (\Phi^T \Phi)^{-1}, \end{aligned}$$

then we have

$$\begin{split} \mathcal{E}(\|\hat{\theta}^{\text{R}}(\eta) - \theta\_{0}\|\_{2}^{2}) &= \mathcal{E}(\|\hat{\theta}^{\text{LS}} - \hat{\theta}^{\text{R}}(\eta)\|\_{2}^{2}) \\ &+ 2\sigma^{2}\operatorname{trace}((\Phi^{T}\Phi + \sigma^{2}P^{-1}(\eta))^{-1} - 0.5(\Phi^{T}\Phi)^{-1}). \end{split} \tag{3.168}$$

Note that *E* (θˆ<sup>R</sup>(η) − θ0<sup>2</sup> <sup>2</sup>) is equal to trace(MSE(θˆ<sup>R</sup>(η), θ0)), then we denote it by mse<sup>η</sup> and we readily obtain an unbiased estimator of mse<sup>η</sup> as follows:

$$\widehat{\mathrm{mes}}\_{\eta} = \|\hat{\theta}^{\mathrm{LS}} - \hat{\theta}^{\mathrm{R}}(\eta)\|\_{2}^{2} + 2\sigma^{2}\mathrm{trace}((\Phi^{T}\Phi + \sigma^{2}P^{-1}(\eta))^{-1} - 0.5(\Phi^{T}\Phi)^{-1}).\tag{3.169}$$

Now given the training data (3.84), the corresponding estimate mse <sup>η</sup> of mse<sup>η</sup> can be used to estimate the hyperparameter η: we should take the value of η ∈ Γ that minimizes (3.169), i.e.,

$$\hat{\eta} = \underset{\eta \in \Gamma}{\text{arg min}} \left\| \hat{\theta}^{\text{LS}} - \hat{\theta}^{\text{R}}(\eta) \right\|\_{2}^{2} + 2\sigma^{2} \text{trace} ( (\boldsymbol{\Phi}^{T}\boldsymbol{\Phi} + \sigma^{2}\boldsymbol{P}^{-1}(\eta))^{-1} - 0.5(\boldsymbol{\Phi}^{T}\boldsymbol{\Phi})^{-1}). \tag{3.170}$$

#### 3.8 Appendix 91

The criterion (3.170) is known as the SURE of the expected in-sample validation error for the true system (3.165) and the validation data (3.166), e.g., [33, 40].

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 4 Bayesian Interpretation of Regularization**

**Abstract** In the previous chapter, it has been shown that the regularization approach is particularly useful when information contained in the data is not sufficient to obtain a precise estimate of the unknown parameter vector and standard methods, such as least squares, yield poor solutions. The fact itself that an estimate is regarded as poor suggests the existence of some form of prior knowledge on the degree of acceptability of candidate solutions. It is this knowledge that guides the choice of the regularization penalty that is added as a corrective term to the usual sum of squared residuals. In the previous chapters, this design process has been described in a deterministic setting where only the measurement noises are random. In this chapter, we will see that an alternative formalization of prior information is obtained if a subjective/Bayesian estimation paradigm is adopted. The major difference is that the parameters, rather than being regarded as deterministic, are now treated as a random vector. This stochastic setting permits the definition of new powerful tools for both priors selection, e.g., through the maximum entropy principle, and for regularization parameters tuning, e.g., through the empirical Bayes approach and its connection with the concept of equivalent degrees of freedom.

## **4.1 Preliminaries**

We have seen that the regularization approach can be used to effectively solve estimation problems that are otherwise ill-conditioned. In particular, a penalty is added as a corrective term to the usual sum of squared residuals. In this way, between two candidate solutions achieving the same squared loss, the regularizer is chosen such as to penalize candidate solutions that depart from our prior knowledge on some features of the unknown parameter vector.

It is worth noting that the regularization approach lies within a frequentist paradigm in which the observed data, affected by noise, are random variables, but the unknown parameter vector is deterministic in nature. For linear-in-parameter

© The Author(s) 2022

G. Pillonetto et al., *Regularized System Identification*, Communications and Control Engineering, https://doi.org/10.1007/978-3-030-95860-2\_4

models, regularization yields an estimate that, though biased, may be preferable to the unbiased least squares estimate in view of the smaller variance. In particular, the tuning of the regularization parameter aims at an advantageous solution of the bias-variance dilemma. By trading an excessive variance for some bias, a smaller mean squared error may be achieved, as exemplified by the James–Stein estimator. An alternative formalization of prior information is obtained if a subjective/Bayesian estimation paradigm is adopted. The major difference is that the parameters, rather than being regarded as deterministic, are now treated as a random vector.

In order to introduce the Bayesian paradigm, it can be useful to start with a simple example in which the parameters do depend on the result of a random experiment. Consider a metabolism model for which the parameter vector θ can take only two possible values, θ*<sup>h</sup>* and θ*<sup>d</sup>* , associated with healthy and diabetic patients, respectively. The model specifies p(*Y* |θ ), where *Y* are observations collected from a randomly chosen patient with 90% probability of being healthy and 10% probability of being diabetic. In this simple case, model identification amounts to deciding between θ*<sup>h</sup>* and θ*<sup>d</sup>* . It is also clear that θ is a discrete random variable with p(θ = θ*h*) = 0.9 and p(θ = θ*<sup>d</sup>* ) = 0.1. These probabilities summarize the prior information about the unknown parameter, before any observation is collected. Once the data *Y* become available, the Bayes formula can be used to compute the posterior probability

$$\mathbf{p}(\theta\_h|Y) = \frac{\mathbf{p}(Y|\theta\_h)\mathbf{p}(\theta\_h)}{\mathbf{p}(Y)} = \frac{\mathbf{p}(Y|\theta\_h)\mathbf{p}(\theta\_h)}{\mathbf{p}(Y|\theta\_h)\mathbf{p}(\theta\_h) + \mathbf{p}(Y|\theta\_d)\mathbf{p}(\theta\_d)}.\tag{4.1}$$

Of course, p(θ*<sup>d</sup>* |*Y* ) = 1 − p(θ*h*|*Y* ). In particular, if the data *Y* are consistent with diabetes symptoms, it may well happen that p(θ*<sup>d</sup>* |*Y* ) > 0.5, in which case θ = θ*<sup>d</sup>* would be taken as the final estimate.

In the previous example, the prior probability distribution assigned to θ reflects a real experiment that is the random choice of a patient from a population where 90% of subjects are healthy, which implies a prejudice in favour of θ = θ*h*. In other words, the prior distribution ranks the candidate parameters according to the available a priori knowledge. If we look at the numerator of (4.1), we see that it combines a priori information with the data through the product of the *prior probability* p(θ*h*) and the *likelihood* p(*Y* |θ*h*). In the example, the population was a binary one (either healthy or diabetic), but we can imagine more complex populations allowing for several countable or even uncountable possible values of θ.

In the actual Bayesian paradigm a further step is made: the parameters θ are assigned a prior probability p(θ*h*), even if there does not exist an underlying experiment that draws the model from a population of possible models. According to the subjective definition of probability, p(θ = θ )¯ represents the (subjective) degree of belief that θ is going to take the value θ¯. In particular, in analogy with the regularization penalty, it is possible to rank the possible values of θ, assigning a low probability to values whose occurrence is deemed unlikely. In our context, the intrinsically subjective nature of the prior probability, a controversial issue in the confrontation between the frequentist and Bayesian paradigms, is specular to the subjective choice of the regularization penalty: rather than expressing the preference for some solutions through the choice of a proper penalty, the preference is formulated by means a prior distribution.

As shown in the following, many formulas and results can be indifferently derived adopting either the regularization or the Bayesian paradigm. However, the Bayesian approach has its pros. In particular, the tuning of the regularization parameter, rather than being addressed on an ad hoc basis, can be formulated as a statistical estimation problem. Moreover, the Bayesian paradigm offers a very natural way to asses uncertainty intervals, whereas the regularization paradigm has a harder time assessing the amount of bias in the estimate. Among the cons, one may mention the need for a deeper probabilistic background in order to gain a full comprehension of all aspects.

Throughout the chapter we will mainly focus on the linear Gaussian case, but the approach is more general and some hints at generalizations will be provided. In addition, we will use θ to denote the stochastic vector that has generated the data, in contrast with the deterministic θ<sup>0</sup> used in the classical setting discussed in the previous chapter.

## **4.2 Incorporating Prior Knowledge via Bayesian Estimation**

We consider the problem of estimating a parameter vector <sup>θ</sup> <sup>∈</sup> <sup>R</sup>*<sup>n</sup>*, based on the observation vector *<sup>Y</sup>* <sup>∈</sup> <sup>R</sup>*<sup>N</sup>* . The two ingredients of Bayesian estimation are the prior distribution of θ, also known by short as *prior*, and the conditional distribution of *Y* given θ. As already observed, the basic assumption is that the parameter vector θ is not completely unknown, but rather some prior knowledge is available that is formulated in terms of *subjective probability*, specified as a probability density function:

$$\mathbf{p}(\theta) : \mathbb{R}^n \mapsto \mathbb{R} \dots$$

The density function p(θ ) is chosen by the user so as to assign a low probability to values whose occurrence is deemed unlikely. For instance, if θ is a scalar parameter whose value is believed to lie more or less around 30, hardly smaller than 20 and hardly larger than 40, this prior knowledge can be embedded in a Gaussian density with *E* θ = μθ = 30 and standard deviation σθ = 5:

$$
\theta \sim \mathcal{A}'(30, 25).
$$

In fact, under this distribution, p (|θ − μθ | > 2σθ ) = p (|θ − 30| > 10) < 0.05. Although not impossible, it is considered unlikely that values of θ too distant from 30 are going to occur. A natural question is how and when our prior knowledge is sufficient to specify a distribution. This crucial issue calls for the notion and role of *hyperparameters*, see Sect. 4.2.4, and for the possible use of the *maximum entropy* *principle* as a way to obtain an entire probability distribution from partial knowledge relative to its moments, see Sect. 4.6.

The second ingredient is the conditional distribution of *Y* given θ that, when considered as a function of θ, is also known as *likelihood*:

$$L(\theta|Y) = \mathbf{p}(Y|\theta) = \frac{\mathbf{p}(Y,\theta)}{\mathbf{p}(\theta)},$$

where p(*Y*,θ) is the joint probability distribution of the random vectors *Y* and θ. The likelihood is usually obtained from some mathematical model of the data. Consider, for instance, the simple model

$$Y\_i = \theta \sqrt{i} + e\_i, \qquad i = 1, \ldots, N,$$

where *ei* ∼ *N* (0, σ<sup>2</sup>) are independent and identically distributed measurement errors, with known variance σ2. Conditional on θ, i.e., assuming that θ is known, *Yi* is Gaussian with

$$\mathcal{E}\left[Y\_{i}|\theta\right] = \theta\sqrt{i}, \qquad \text{Var}\left(Y\_{i}|\theta\right) = \sigma^{2}$$

so that, in view of independence, the likelihood is

$$L(\theta|Y) = \mathbf{p}(Y|\theta) = \prod\_{i=1}^{N} \mathbf{p}(Y\_i|\theta), \qquad \mathbf{p}(Y\_i|\theta) = \mathcal{A}'(\theta\sqrt{i}, \sigma^2).$$

When both the prior distribution p(θ ) and the likelihood p(*Y* |θ ) have been specified, the Bayes formula yields the *posterior distribution*

$$\mathbf{p}(\theta|Y) = \frac{\mathbf{p}(Y|\theta)\mathbf{p}(\theta)}{\mathbf{p}(Y)}.$$

We have seen that all our prior knowledge was embedded in the prior. In a similar way, all the knowledge obtained by the combination of prior information with the new information brought by the observations is now embedded in the posterior distribution p(θ|*Y* ), denoted by short as *posterior*.

Although all the relevant information is encapsulated within the posterior, a *point estimate* is often required for practical or communication purposes. The *Maximum A Posteriori (MAP) estimate* is the value that maximizes the posterior:

$$\theta^{\mathsf{MAP}} = \arg\max\_{\theta} \mathbf{p}(Y|\theta). \tag{4.2}$$

Its interpretation is simple, as it represents the most likely value, once the prior knowledge has been updated taking into account the observations. Alternatively, the mean squared error

$$\text{MSE}(\hat{\theta}) = \mathcal{\delta} \left[ \left( \hat{\theta} - \theta \right)^{2} | Y \right]$$

can be used as a criterion to select the point estimate θˆ. Above, *E* (·|*Y* ) denotes the expected value taken with respect to the posterior distribution p(θ|*Y* ). The following classical result from estimation theory (whose proof is in Sect. 4.13.1) then holds.

**Theorem 4.1** *The minimizer of the MSE*

$$\boldsymbol{\theta}^{\mathbf{B}} = \arg\min\_{\hat{\boldsymbol{\theta}}} \text{MSE}(\hat{\boldsymbol{\theta}})$$

*is known as* Bayes estimate *and can be shown to be equal to the conditional mean:*

$$
\theta^{\mathbb{B}} = \circledast[\theta|Y]\,.
$$

A third point estimate is the conditional median used especially in view of its statistical robustness when the posterior is obtained numerically via stochastic simulation algorithms, see Sect. 4.10.

When, in addition to a point estimate, an assessment of the uncertainty is needed, it can be derived from the posterior through the computation of a properly defined *credible region C*<sup>γ</sup> <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* such that

$$\Pr(\theta \in C\_{\mathcal{Y}} | Y) = \mathcal{Y}. \tag{4.3}$$

For example, *C*<sup>γ</sup> could be taken as the smallest region such that (4.3) holds, a choice that goes under the name of highest posterior density region.

## *4.2.1 Multivariate Gaussian Variables*

In this subsection, some basic properties and definitions of multivariate Gaussian variables are recalled. This review is instrumental to the derivation of the Bayesian estimator when observations and parameters are jointly Gaussian, see Sect. 4.2.2. In turn, this will pave the way to the analysis of the linear model under additive Gaussian measurement errors, see Sect. 4.2.3.

A random vector *Z* = [*Z*<sup>1</sup> ... *Zm*] *<sup>T</sup>* is said to be distributed according to a nondegenerate *m*-variate Gaussian distribution if its joint probability density function is of the type

$$\mathbf{p}(z\_1, \dots, z\_m) = \frac{1}{\sqrt{(2\pi)^m \det V}} \exp^{-\frac{1}{2}(z-\mu)^T V^{-1}(z-\mu)},\tag{4.4}$$

where *V* is a symmetric positive definite matrix and μ is some vector in R*<sup>m</sup>*.

It can be shown that

$$\circledast(Z) = \mu,\quad \text{Var}\,(Z) = V.$$

Then, the notation

$$Z \sim \mathcal{J}(\mu, V)$$

(already used before in the scalar case) indicates that *Z* is a multivariate Gaussian (Normal) random vector with mean μ and variance matrix *V*.

**Property 4.1** *If Z* <sup>∼</sup> *<sup>N</sup>* (μ, *<sup>V</sup>* ) *and Y* <sup>=</sup> *AZ, where A* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>*×*<sup>m</sup>*, *<sup>n</sup>* <sup>≤</sup> *m, is a fullrank deterministic matrix, then*

$$Y \sim \mathcal{J}'(A\mu, AVA^T).$$

In particular, it follows that the marginal distributions of the entries of *Z* are Gaussian:

$$Z\_i \sim \mathcal{J}'(\mu\_i, V\_{ii}).$$

**Property 4.2** *Assuming Z* ∼ *N* (μ, *V* )*, let X* = [*Z*<sup>1</sup> ... *Zn*] *<sup>T</sup> , Y* = [*Zn*+<sup>1</sup> ... *Zm*] *T , where* 1 ≤ *n* < *m, and partition* μ *and V accordingly:*

$$
\mu = \begin{bmatrix} \mu\_X \\ \mu\_Y \end{bmatrix}, \quad \begin{bmatrix} V\_{XX} \ V\_{XY} \\ V\_{YX} \ V\_{YY} \end{bmatrix}.
$$

*Then,* p(*X*|*Y* = *y*) *is a multivariate Gaussian density function with*

$$\begin{aligned} \mathcal{E}(X|Y=\mathbf{y}) &= \mu\_X + V\_{XY} V\_{YY}^{-1} (\mathbf{y} - \mu\_Y) \\ \text{Var}(X|Y=\mathbf{y}) &= V\_{XX} - V\_{XY} V\_{YY}^{-1} V\_{YX} \end{aligned}$$

*and we can write*

$$(X|Y=\mathbf{y}) \sim \mathcal{N} \left( \mu\_X + V\_{XY} V\_{YY}^{-1} (\mathbf{y} - \mu\_Y), V\_{XX} - V\_{XY} V\_{YY}^{-1} V\_{YX} \right),$$

*where X*|*Y* = *y stands for the random vector X conditional on Y* = *y.*

## *4.2.2 The Gaussian Case*

Let us consider the case in which the observation vector *<sup>Y</sup>* <sup>∈</sup> <sup>R</sup>*<sup>N</sup>* and the unknown vector <sup>θ</sup> <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* are jointly Gaussian:

$$
\begin{bmatrix} \theta \\ Y \end{bmatrix} \sim \mathcal{N} \left( \begin{bmatrix} \mu\_{\theta} \\ \mu\_{Y} \end{bmatrix}, \begin{bmatrix} \Sigma\_{\theta} & \Sigma\_{\theta Y} \\ \Sigma\_{Y\theta} & \Sigma\_{Y} \end{bmatrix} \right), \quad \Sigma\_{Y} > 0. \tag{4.5}
$$

The key idea behind Bayesian estimation is referring to the posterior distribution of θ given *Y* as representative of the state of knowledge about the unknown vector. It follows from Property 4.2 that such posterior is Gaussian as well:

$$
\theta \rfloor Y \sim \mathcal{N} \left( \mu\_{\theta} + \Sigma\_{\theta Y} \Sigma\_{Y}^{-1} (Y - \mu\_{Y}), \Sigma\_{\theta} - \Sigma\_{\theta Y} \Sigma\_{Y}^{-1} \Sigma\_{Y\theta} \right). \tag{4.6}
$$

In view of Gaussianity, θMAP coincides with the conditional expectation *E* (θ|*Y* ):

$$
\boldsymbol{\theta}^{\rm B} = \boldsymbol{\theta}^{\rm MAP} = \boldsymbol{\delta}^{\circ}(\boldsymbol{\theta}|\boldsymbol{Y}) = \mu\_{\boldsymbol{\theta}} + \boldsymbol{\Sigma}\_{\boldsymbol{\theta}\boldsymbol{Y}} \boldsymbol{\Sigma}\_{\boldsymbol{Y}}^{-1} (\boldsymbol{Y} - \mu\_{\boldsymbol{Y}}).\tag{4.7}
$$

The reliability of the estimate can be assessed by the posterior variance

$$\Delta\_{\theta|Y} = \text{Var}(\theta|Y) = \Sigma\_{\theta} - \Sigma\_{\theta Y} \Sigma\_{Y}^{-1} \Sigma\_{Y\theta}$$

based on which the so-called credible intervals can be derived as explained below.

The posterior variance of θ*<sup>i</sup>* is the *i*-th diagonal entry of the posterior covariance matrix:

$$
\sigma\_{\theta\_{\parallel}|Y}^2 = \left[ \Sigma\_{\theta|Y} \right]\_{i\bar{i}} \dots
$$

Observing that θ*i*|*Y* ∼ *N* (θ<sup>B</sup> *<sup>i</sup>* , σ<sup>2</sup> <sup>θ</sup>*i*|*<sup>Y</sup>* ), it follows that

$$\Pr\left(\theta\_i^{\text{B}} - 1.96\sigma\_{\theta|Y} \le \theta\_i \le \theta\_i^{\text{B}} - 1.96\sigma\_{\theta|Y}|Y\right) = 0.95\tag{4.8}$$

so that[θ<sup>B</sup> *<sup>i</sup>* − 1.96σθ*<sup>i</sup>* <sup>|</sup>*<sup>Y</sup>* , θ<sup>B</sup> *<sup>i</sup>* + 1.96σθ*<sup>i</sup>* <sup>|</sup>*<sup>Y</sup>* ]is the 95%-credible interval for the parameter θ*<sup>i</sup>* , given the observation vector *Y* . If two or more parameters are jointly considered, the notion of credible region can be obtained in a similar way. In the Gaussian case, such regions are suitable (hyper)-ellipsoids centred in θB.

## *4.2.3 The Linear Gaussian Model*

The Bayesian approach can be applied to the estimation of the standard linear model in matrix form

$$Y = \Phi\theta + E, \quad E \sim \mathcal{N}(0, \Sigma\_E), \quad \Sigma\_E > 0 \tag{4.9}$$

in which *<sup>Y</sup>* <sup>∈</sup> <sup>R</sup>*<sup>N</sup>* and the parameter vector <sup>θ</sup> is no more regarded as a deterministic quantity, but as a random vector independent of *E*. In particular, we assume that some prior information is available which is embedded in a Gaussian prior distribution

$$
\theta \sim \mathcal{J}'(\mu\_{\theta}, \Sigma\_{\theta}), \quad \Sigma\_{\theta} > 0.
$$

Since *Y* is the linear combination of the jointly Gaussian vectors θ and *E*, the vectors *Y* and θ are jointly Gaussian as well. Hereafter, positive definiteness of Σθ is assumed if not stated otherwise. The singular case, see Remark 4.1, amounts to assuming perfect knowledge of some linear combination of the unknown parameters or, equivalently, to constrain the estimated vector θ to belong to a prescribed subspace. The ability to incorporate this type of constraint is not unique to the Bayesian approach. In the context of the deterministic regularization, an example is given by the optimal regularization matrix *P* = θ0θ *<sup>T</sup>* <sup>0</sup> , derived in Sect. 3.4.2.1.

In order to obtain the Bayes estimate according to (4.7), we need to compute μ*<sup>Y</sup>* = *E* (*Y* ), Σθ*<sup>Y</sup>* = Cov(θ , *Y* ), and Σ*<sup>Y</sup>* = Var(*Y* ):

$$\begin{aligned} \mu\_Y &= \boldsymbol{\delta}^\circ(Y) = \boldsymbol{\Phi}\mu\_\theta \\ \text{Var}(Y) &= \text{Var}(\boldsymbol{\Phi}\boldsymbol{\theta}) + \text{Var}(E) = \boldsymbol{\Phi}\boldsymbol{\Sigma}\_\boldsymbol{\theta}\boldsymbol{\Phi}^T + \boldsymbol{\Sigma}\_E \\ \text{Cov}(\boldsymbol{\theta}, Y) &= \text{Cov}(\boldsymbol{\theta}, \boldsymbol{\Phi}\boldsymbol{\theta}) + \text{Cov}(\boldsymbol{\theta}, E) = \boldsymbol{\Sigma}\_\boldsymbol{\theta}\boldsymbol{\Phi}^T. \end{aligned}$$

Then, we can apply (4.7) to obtain

$$\boldsymbol{\theta}^{\rm B} = \mu\_{\boldsymbol{\theta}} + \Sigma\_{\boldsymbol{\theta}} \boldsymbol{\Phi}^{\rm T} (\boldsymbol{\Phi} \,\Sigma\_{\boldsymbol{\theta}} \boldsymbol{\Phi}^{\rm T} + \Sigma\_{\boldsymbol{E}})^{-1} (\boldsymbol{Y} - \boldsymbol{\Phi} \,\mu\_{\boldsymbol{\theta}}) \tag{4.10}$$

$$\text{Var}(\boldsymbol{\theta}|\boldsymbol{Y}) = \boldsymbol{\Sigma}\_{\boldsymbol{\theta}} - \boldsymbol{\Sigma}\_{\boldsymbol{\theta}} \boldsymbol{\Phi}^{\boldsymbol{T}} (\boldsymbol{\Phi}\boldsymbol{\Sigma}\_{\boldsymbol{\theta}} \boldsymbol{\Phi}^{\boldsymbol{T}} + \boldsymbol{\Sigma}\_{\boldsymbol{E}})^{-1} \boldsymbol{\Phi}\boldsymbol{\Sigma}\_{\boldsymbol{\theta}}.\tag{4.11}$$

The proofs of the following two classical results are reported in Sects. 4.13.2 and 4.13.3.

**Theorem 4.2** (Orthogonality property)

$$\mathcal{E}\left[ (\theta^{\mathcal{B}} - \theta)Y^{T} \right] = 0. \tag{4.12}$$

The following lemma, whose proof is in Sect. 4.13.3, is useful in order to obtain an alternative expression that proves more convenient, especially when *n N*.

**Lemma 4.1** *It holds that*

$$
\Sigma\_{\boldsymbol{\theta}} \boldsymbol{\Phi}^{\boldsymbol{T}} (\boldsymbol{\Phi} \boldsymbol{\Sigma}\_{\boldsymbol{\theta}} \boldsymbol{\Phi}^{\boldsymbol{T}} + \boldsymbol{\Sigma}\_{\boldsymbol{E}})^{-1} = (\boldsymbol{\Phi}^{\boldsymbol{T}} \boldsymbol{\Sigma}\_{\boldsymbol{E}}^{-1} \boldsymbol{\Phi} + \boldsymbol{\Sigma}\_{\boldsymbol{\theta}}^{-1})^{-1} \boldsymbol{\Phi}^{\boldsymbol{T}} \boldsymbol{\Sigma}\_{\boldsymbol{E}}^{-1} .
$$

By applying the previous lemma, the alternative expression of the Bayes estimate is obtained

$$\boldsymbol{\theta}^{\rm B} = (\boldsymbol{\Phi}^{\rm T} \boldsymbol{\Sigma}\_{E}^{-1} \boldsymbol{\Phi} + \boldsymbol{\Sigma}\_{\boldsymbol{\theta}}^{-1})^{-1} (\boldsymbol{\Phi}^{\rm T} \boldsymbol{\Sigma}\_{E}^{-1} \boldsymbol{Y} + \boldsymbol{\Sigma}\_{\boldsymbol{\theta}}^{-1} \boldsymbol{\mu}\_{\boldsymbol{\theta}}) \tag{4.13}$$

$$\text{Var}(\theta|Y) = (\Phi^T \Sigma\_E^{-1} \Phi + \Sigma\_\theta^{-1})^{-1}. \tag{4.14}$$

As already noted, the Bayes estimate coincides with θMAP, the maximum of the posterior density:

$$\mathbf{p}(\theta|Y) \propto \mathbf{p}(Y|\theta)\mathbf{p}(\theta).$$

Recall that, in view of the assumed linear model (4.9),

$$Y|\theta \sim \mathcal{J}'(\Phi\theta, \Sigma\_E)$$

and note that

#### 4.2 Incorporating Prior Knowledge via Bayesian Estimation 103

$$\log \mathbf{p}(\theta) = c\_1 - \frac{1}{2} (\theta - \mu\_\theta)^T \Sigma\_\theta^{-1} (\theta - \mu\_\theta) \tag{4.15}$$

$$\log \mathbf{p}(Y|\theta) = c\_2 - \frac{1}{2} (Y - \Phi\theta)^T \Sigma\_E^{-1} (Y - \Phi\theta),\tag{4.16}$$

where *c*<sup>1</sup> and *c*<sup>2</sup> are constants we are not concerned with. Therefore, the maximization of the posterior density can be written as

$$\begin{split} \theta^{\text{MAP}} &= \underset{\theta}{\text{arg}\,\text{max}} \,\log \mathbf{p}(Y|\theta) + \log \mathbf{p}(\theta) \\ &= \underset{\theta}{\text{arg}\,\text{max}} (Y - \Phi \theta)^{T} \Sigma\_{E}^{-1} (Y - \Phi \theta) + (\theta - \mu\_{\theta})^{T} \Sigma\_{\theta}^{-1} (\theta - \mu\_{\theta}) \end{split}$$

whose solution is easily shown to be given by (4.13). This shows that, under Gaussianity assumptions, the Bayes estimate of the linear model can be seen as a regularized least squares estimator with quadratic regularization term (ReLS-Q), see Sect. 3.4. In particular, if

$$
\Sigma\_E = \sigma^2 I\_N, \quad \mu\_\theta = 0,\tag{4.17}
$$

the Bayes and MAP estimators,

$$\theta^{\rm B} = \theta^{\rm MAP} = \arg\min\_{\theta} \|Y - \Phi\theta\|^2 + \theta^T P^{-1} \theta,\tag{4.18}$$

coincide with the ReLS estimator with regularization matrix *P* = Σθ /σ2. Under the further assumption Σθ = λ*In*, the MAP estimator coincides with a ridge regression estimator with γ = σ<sup>2</sup>/λ.

**Remark 4.1** When Σθ = *P*, where *P* = *P<sup>T</sup>* ≥ 0 is singular, one can still use (4.10) to obtain the Bayes estimate, while (4.13) and the quadratic problem (4.18) are no more valid due to the nonexistence ofΣ<sup>−</sup><sup>1</sup> <sup>θ</sup> . Nevertheless, by replicating the derivation in Remark 3.1, it is still possible to interpret the Bayes estimate as the solution of a constrained quadratic problem. In particular, under (4.17), we have that

$$\boldsymbol{\theta}^{\mathbf{B}} = \arg\min\_{\theta} \left\lVert Y - \Phi \boldsymbol{\theta} \right\rVert\_2^2 + \boldsymbol{\theta}^T \boldsymbol{P}^+ \boldsymbol{\theta} \tag{4.19}$$

$$\text{subj.to}\quad U\_2^T \theta = 0,\tag{4.20}$$

where *U*<sup>2</sup> was defined in Remark 3.1, as part of the singular value decomposition of *P*. The result can be interpreted as follows. A singular variance matrix means that we have perfect knowledge on some linear combination of the parameter vector. In particular,

$$\begin{aligned} \text{Var}\left[U\_2^T \theta\right] &= U\_2^T \text{Var}\left(\theta\right) U\_2 \\ &= U\_2^T \begin{bmatrix} U\_1 \ U\_2 \end{bmatrix} \begin{bmatrix} A\_P & 0 \\ 0 & 0 \end{bmatrix} \begin{bmatrix} U\_1 \ U\_2 \end{bmatrix}^T U\_2 = 0, \end{aligned}$$

where, with reference to the SVD of *P*, we have exploited the fact that *U<sup>T</sup>* <sup>2</sup> *U*<sup>1</sup> = 0. As a consequence,

$$\Pr(U\_2^T \theta = U\_2 \mu\_\theta) = 1,$$

thus justifying the presence of the equality constraints in the quadratic problem (4.19)–(4.20), where μθ = 0 is assumed. Recalling the orthogonality of *U*<sup>1</sup> and *U*2, we have that *U<sup>T</sup>* <sup>2</sup> θ = 0 implies that θ ∈ Range(*U*1) = Range(*P*). Therefore, the constrained quadratic problem (4.19)–(4.20) can also be equivalently reformulated as

$$\boldsymbol{\theta}^{\mathcal{B}} = \underset{\boldsymbol{\theta} \in \text{Range}(\boldsymbol{P})}{\text{arg}\min} \quad \left\|\boldsymbol{Y} - \boldsymbol{\Phi}\boldsymbol{\theta}\right\|^2 + \boldsymbol{\theta}^T \boldsymbol{P}^+ \boldsymbol{\theta}. \tag{4.21}$$

One can also assess that the solution of this problem can be written as

$$\boldsymbol{\theta}^{\mathcal{B}} = \boldsymbol{P}\boldsymbol{\Phi}^{\mathcal{T}}(\boldsymbol{\Phi}\boldsymbol{P}\boldsymbol{\Phi}^{\mathcal{T}} + \boldsymbol{\Sigma}\_E)^{+}\boldsymbol{Y},$$

an expression which does not require invertibility of any matrix.

In conclusion, the Bayes estimate always exists and is unique. In any case, it can be written as (4.7) with Σ<sup>−</sup><sup>1</sup> *<sup>Y</sup>* replaced by its pseudoinverse.

The Bayesian interpretation of deterministic regularization can be exploited to obtain a guideline for the selection of the regularization matrix. The simplest case is when some statistics, e.g., based on samples coming from past problems, is available for the parameter vector θ. Then, the Bayesian interpretation suggests to select the covariance matrix of θ, divided by the error variance σ2, as regularization matrix. If examples from the past are not available, one may rely on prior knowledge, telling that some entries of θ have smaller variance than others or that some correlation exists between the entries.

## *4.2.4 Hierarchical Bayes: Hyperparameters*

In the cases in which prior information on the parameters is not sufficient to specify a prior, it is common to resort to hierarchical Bayesian models. Instead of fixing the prior, a family of priors is considered, parametrized by one or more *hyperparameters*. As an example, consider the case in which prior knowledge could be formalized in terms of zero-mean independent and equally distributed parameters whose absolute value is not too large. In absence of more precise information on their size, we could adopt the following prior:

$$
\theta \sim \mathcal{J}'(0, \lambda I\_N),
$$

where the scalar λ, called hyperparameter, enters the game as a further unknown quantity. More in general, the prior distribution p(θ|α) may depend on a hyperparameter vector α. One may also want to consider a hyperparameter vector β entering the definition of the likelihood p(*Y* |θ , β). The most common example is when the measurement variance σ<sup>2</sup> is not known and is therefore treated as a hyperparameter. In the following, the vector of all hyperparameters will be denoted by

$$\boldsymbol{\eta} = \left[\boldsymbol{\alpha}^{\boldsymbol{T}} \boldsymbol{\beta}^{\boldsymbol{T}}\right]^{\boldsymbol{T}} \boldsymbol{\cdot}$$

For a given η, we will denote by θMAP(η) and θB(η) the corresponding MAP and Bayes estimates:

$$\theta^{\text{MAP}}(\eta) = \arg\max\_{\theta} \mathbf{p}(\theta | Y, \eta) \tag{4.22}$$

$$\boldsymbol{\theta}^{\rm B}(\boldsymbol{\eta}) = \boldsymbol{\mathcal{E}}(\boldsymbol{\theta}|\boldsymbol{Y}, \boldsymbol{\eta}) = \int \boldsymbol{\theta} \mathbf{p}(\boldsymbol{\theta}|\boldsymbol{Y}, \boldsymbol{\eta}) d\boldsymbol{\theta}, \tag{4.23}$$

where

$$\mathbf{p}(\theta|Y,\eta) = \frac{\mathbf{p}(Y|\theta,\beta)\mathbf{p}(\theta|\alpha)}{\int \mathbf{p}(Y|\theta,\beta)\mathbf{p}(\theta|\alpha)d\theta}.\tag{4.24}$$

## **4.3 Bayesian Interpretation of the James–Stein Estimator**

In this section, we show that the James–Stein estimator can be seen as a particular Bayesian estimator. As seen, in Eq. (1.2), the measurements model is

$$Y = \theta + E, \quad E \sim \mathcal{N}(0, \sigma^2 I\_N). \tag{4.25}$$

In a Bayesian setting, the parameter vector is regarded as a random vector, whose distribution reflects our state of knowledge. In particular, we assume

$$
\theta \sim \mathcal{A}'(0, \lambda I\_N), \tag{4.26}
$$

where λ plays the role of hyperparameter. It follows that θ and *Y* are zero-mean jointly Gaussian variables with

$$
\Sigma\_{\theta Y} = \mathcal{\ell}(\theta Y^T) = \mathcal{\ell}(\theta \theta^T) = \lambda I\_N, \quad \Sigma\_Y = \mathcal{\ell}(Y Y^T) = (\lambda + \sigma^2) I\_N. \tag{4.27}
$$

According to (4.7), the Bayes estimate is given by the conditional expectation

$$\mathcal{A}^{\mathcal{C}}(\theta|Y) = \Sigma\_{\theta Y} \Sigma\_Y^{-1} Y = \frac{\lambda}{\lambda + \sigma^2} Y = \left(1 - r\_{\text{Bayes}}\right) Y,\tag{4.28}$$

where

$$r\_{\text{Bayes}} = \frac{\sigma^2}{\lambda + \sigma^2}.\tag{4.29}$$

It is apparent that the estimator (4.28) has the same structure as James–Stein's one, with *r* replaced by *r*Bayes.

Since *Y* and θ are jointly Gaussian, *E* (θ|*Y* ) = θMAP, where

$$\theta^{\text{MAP}} = \text{arg}\min\_{\theta} \frac{\|Y - \theta\|^2}{\sigma^2} + \frac{\|\theta\|^2}{\lambda} = \text{arg}\min\_{\theta} \|Y - \theta\|^2 + \frac{\sigma^2}{\lambda} \|\theta\|^2$$

which highlights the fact that *E* (θ|*Y* ) is the solution of a regularized least squares problem, controlled by the regularization parameter σ<sup>2</sup>/λ.

If the variances λ and σ<sup>2</sup> could be assigned on the basis of prior knowledge, the similarity would be only formal. Let us make a step forward, considering the case in which the variance σ<sup>2</sup> is given, while λ is estimated from the data. The basic idea is that the hyperparameter λ could be tuned based on the observed vector *Y* and plugged into (4.29) to obtain an estimate of *r*Bayes. Alternatively, one may focus directly on finding a sensible estimate of*r*Bayes. In this respect, we are going to show that Stein's *r* is an unbiased estimate of *rBayes* under the Gaussian model (4.25) and (4.26) [6]. For this purpose, we will exploit a property of the inverse chi-square variable.

**Definition 4.1** *(chi-square random variable)* The sum of the squares of *n* standard Gaussian independent random variables is a nonnegative valued random variable known as *chi-square variable* with *n* degrees of freedom:

$$
\chi\_n^2 = \sum\_{i=1}^n X\_i^2, \quad X\_i \sim \mathcal{J}'(0, 1).
$$

Its mean and expectation are

$$\mathcal{E}\left(\chi\_n^2\right) = n, \quad \text{Var}\left(\chi\_n^2\right) = 2n.$$

The inverse of a chi-square variable is called *inverse chi-square*. For *n* > 2, its mean is

$$\mathcal{A}\left[\frac{1}{\chi\_n^2}\right] = \frac{1}{n-2}.\tag{4.30}$$

Now, assume *N* > 2 and observe that

$$\frac{\|Y\|^2}{\lambda + \sigma^2} = \frac{\sum\_{i}^{n} Y\_i^2}{\lambda + \sigma^2} \sim \chi\_N^2.$$

Recalling that the expectation of the inverse chi-square is equal to 1/(*N* − 2), we have that

$$\mathcal{A}^{\mathbb{C}}\left[\frac{\lambda+\sigma^2}{\|Y\|^2}\right] = \mathcal{C}\left[\frac{1}{\chi^2\_N}\right] = \frac{1}{(N-2)}.$$

Therefore,

4.3 Bayesian Interpretation of the James–Stein Estimator 107

$$\mathcal{E}(r) = \mathcal{E}\left[\frac{(N-2)\sigma^2}{\|Y\|^2}\right] = \frac{\sigma^2}{\lambda + \sigma^2} = r\_{\text{Bayes}}.$$

This means that James–Stein's shrinking coefficient *r* can be seen as an unbiased estimator of the shrinking coefficient*rBayes* appearing in the formula of the posterior expectation.

The example is instructive under several respects. First, it shows that, under suitable probabilistic assumptions, the typical structure of regularized estimators can be justified through Bayesian arguments. The second point has to do with the tuning of the regularization parameters. In the empirical Bayes approach, see Sect. 4.4, there is a preliminary step in which a point estimate of hyperparameters is obtained by standard estimation methods. Then, this point estimate is plugged into the expression of the Bayesian estimator. Although a full Bayesian approach would call for the joint estimation of parameters and hyperparameters, the two-step empirical Bayes approach not only conjugates simplicity and effectiveness but provides a probabilistic underpinning to regularized identification methods.

## **4.4 Full and Empirical Bayes Approaches**

When the prior, and possibly the likelihood, include hyperparameters, Bayesian estimation becomes more complex and gives rise to alternative approaches. In principle, we want to obtain the posterior distribution

$$\mathbf{p}(\theta|Y) = \frac{\mathbf{p}(Y|\theta)\mathbf{p}(\theta)}{\mathbf{p}(Y)}.$$

However, if a hierarchical Bayesian model is adopted, we do not know p(θ ), but only p(θ|η). At the cost of assigning a prior p(η) also to the hyperparameters, the prior p(θ ) can be obtained by marginalization of the joint probability density:

$$\mathbf{p}(\theta) = \int \mathbf{p}(\theta, \eta) d\eta = \int \mathbf{p}(\theta|\eta)\mathbf{p}(\eta) d\eta.$$

In general, this integral has to be computed numerically, e.g., by Monte Carlo methods. This leads to *full Bayesian* methods that compute the desired p(θ|*Y* ) regarding both parameters and hyperparameters as random variables. Some remarks on these methods will be given in Sect. 4.10.

The justification for a simpler computational scheme stems from the following reformulation of the posterior:

$$\mathbf{p}(\theta|Y) = \int \mathbf{p}(\theta, \eta|Y) d\eta = \int \mathbf{p}(\theta|\eta, Y)\mathbf{p}(\eta|Y) d\eta. \tag{4.31}$$

Observe that

$$\mathbf{p}(\eta|Y) \propto \mathbf{p}(Y|\eta)\mathbf{p}(\eta),\tag{4.32}$$

where *L*(η|*Y* ) = p(*Y* |η) is the likelihood of the hyperparameter vector η. It is also called *marginal likelihood* because it is obtained from the marginalization with respect to θ of the joint density p(*Y*, θ|η):

$$L(\eta|Y) = \int \mathbf{p}(Y,\theta|\eta)d\theta = \int \mathbf{p}(Y|\theta,\eta)\mathbf{p}(\theta|\eta)d\theta. \tag{4.33}$$

If data are sufficiently informative, the marginal likelihood has good chances to be unimodal and sharply peaked in a neighbourhood of the maximum likelihood estimate

$$
\eta^{\mathsf{ML}} = \arg\max\_{\eta} \mathbf{p}(Y|\eta).
$$

When this happens and p(η) is rather uninformative (as it should be), from (4.32) it follows that p(η|*Y* ) is peaked as well. Then, as long as the properties of p(θ|η, *Y* ) do not change rapidly with η near ηML, the integral (4.31) can be approximated as

$$\mathbf{p}(\theta|Y) \simeq \mathbf{p}(\theta|\boldsymbol{\eta}^{\text{ML}}, Y) = \frac{\mathbf{p}(Y|\theta, \boldsymbol{\eta}^{\text{ML}})\mathbf{p}(\theta|\boldsymbol{\eta}^{\text{ML}})}{\mathbf{p}(Y|\boldsymbol{\eta}^{\text{ML}})}.$$

In practice, this suggests to compute the posterior using the prior p∗(θ ) = p(θ|ηML) associated with the maximum likelihood estimate of hyperparameters. More in general, Empirical Bayes (EB) methods adopt a two-stage scheme. In the first step, a point estimate η<sup>∗</sup> is computed which is then kept fixed in the second step, when the posterior of the parameters is obtained, based on the prior p∗(θ ) = p(θ|η∗).

Among the advantages of the approach one may mention its simplicity, especially when there are few hyperparameters and the posterior p(θ|*Y*, ηML) is easily obtained as in the jointly Gaussian case. Moreover, the tuning of η admits an intuitive interpretation as the counterpart of model order selection in classic parametric estimation methods. The main drawback is that the EB method fails to propagate the uncertainty of the point estimate η∗.

Under the linear Gaussian model (4.9), the integral (4.33) admits a closed-form solution. In fact, since

$$Y \sim \mathcal{N}(\Phi \mu\_{\theta}(\eta), \Sigma(\eta)), \quad \Sigma(\eta) = \Phi \Sigma\_{\theta}(\eta) \Phi^{T} + \Sigma\_{E}(\eta),$$

we have

$$\log L(\eta|Y) = -\frac{1}{2}\log(2\pi \det(\Sigma)) - \frac{1}{2}(Y - \Phi \mu\_{\theta})^T \Sigma^{-1} (Y - \Phi \mu\_{\theta}), \quad (4.34)$$

where in the right-hand side dependence on η has been omitted for simplicity.

Therefore, application of Empirical Bayes estimation to the linear model (4.9) would consist of the following two steps: **Step 1:**

# <sup>η</sup><sup>∗</sup> <sup>=</sup> <sup>η</sup>ML <sup>=</sup> arg max<sup>η</sup> *<sup>L</sup>*(η|*<sup>Y</sup>* ).

**Step 2:** Let μθ = μθ (η∗), Σ*<sup>E</sup>* = Σ(η∗), Σθ = Σθ (η∗) and compute the posterior expectation according to Sect. 4.2.3.

When the likelihood and the prior are such that integral (4.33) cannot be computed explicitly, an approximation is needed. In particular, one can resort to the Laplace approximation, which is based on a second-order Taylor expansion of log p(*Y*, θ|η) around θMAP(η) defined in (4.22), from which an integrable approximation of p(*Y*, θ|η) appearing in (4.33) is obtained. Note, however, that the Laplace approximation has to be recalculated for each evaluation of *L*(η|*Y* ) occurring during the iterative computation of ηML.

## **4.5 Improper Priors and the Bias Space**

The use of priors is most useful whenever the data alone are not sufficient to provide reliable parameter estimates but there exists some a priori knowledge that can be exploited. It may happen that for some parameters the introduction of a prior is not possible or not desirable, because their estimation can be satisfactorily performed anyway, given the information in the data. This can be accounted for by assuming that such parameters have *improper priors*.

In order to deal with the case where *<sup>p</sup>* parameters <sup>θ</sup> *<sup>P</sup>* <sup>∈</sup> <sup>R</sup>*<sup>p</sup>* have a proper prior and the remaining *<sup>n</sup>* <sup>−</sup> *<sup>p</sup>* parameters <sup>θ</sup> *<sup>I</sup>* <sup>∈</sup> <sup>R</sup>*n*−*<sup>p</sup>* have an improper prior, consider the following model:

$$Y = \Phi \theta + E, \quad \Phi = \begin{bmatrix} \mathcal{Q} \ \Psi \end{bmatrix}, \quad \theta = \begin{bmatrix} \theta^P \\ \theta^I \end{bmatrix} \tag{4.35}$$

$$
\theta \sim \mathcal{N}(0, \Sigma\_{\theta}), \quad E \sim \mathcal{N}(0, \sigma^2 I\_N) \tag{4.36}
$$

$$
\Sigma\_{\theta} = \begin{bmatrix} \Sigma & 0 \\ 0 & aI\_{n-p} \end{bmatrix}, \quad \Sigma > 0. \tag{4.37}
$$

The (asymptotically) improper prior for θ *<sup>I</sup>* is obtained by letting *a* → ∞ so that θ *<sup>I</sup>* has infinite variance, i.e., its density is flat. This amounts to complete lack of prior knowledge for the last *n* − *p* entries of the parameter vector θ that, for simplicity, is assumed to be zero mean. The use of improper priors in a Bayesian setting has the same effect as the introduction of a *bias space* in a deterministic regularization setting. Within such a subspace, parameters are immune from regularization, a feature that could be useful to apply regularization only where needed without causing undesired distortions. The following theorem, whose proof is in Sect. 4.13.4, is analogous to a result obtained in [22] to obtain a Bayesian interpretation of smoothing splines. It illustrates the asymptotic behaviour of posterior means and variances as *a* goes to infinity.

**Theorem 4.3** (adapted from [22]) *If* rank(Φ) = *n and* rank(Ω) = *n* − *p, then*

$$\begin{aligned} \lim\_{a \to \infty} \ell^{\ell}(\boldsymbol{\theta}^{\boldsymbol{I}}|\boldsymbol{Y}) &= (\boldsymbol{\Psi}^{\boldsymbol{T}}\boldsymbol{M}^{-1}\boldsymbol{\Psi})^{-1}\boldsymbol{\Psi}^{\boldsymbol{T}}\boldsymbol{M}^{-1}\boldsymbol{Y} \\ \lim\_{a \to \infty} \ell(\boldsymbol{\theta}^{\boldsymbol{P}}|\boldsymbol{Y}) &= \boldsymbol{\Sigma}\boldsymbol{\Omega}^{\boldsymbol{T}}\boldsymbol{M}^{-1}(\boldsymbol{I}\_{n} - \boldsymbol{\Psi}(\boldsymbol{\Psi}^{\boldsymbol{T}}\boldsymbol{M}^{-1}\boldsymbol{\Psi})^{-1}\boldsymbol{\Psi}^{\boldsymbol{T}}\boldsymbol{M}^{-1})\boldsymbol{Y} \\ \boldsymbol{M} &= \boldsymbol{\Omega}\boldsymbol{\Sigma}\boldsymbol{\Omega}^{\boldsymbol{T}} + \sigma^{2}\boldsymbol{I}\_{N} \\ \lim\_{a \to \infty} \text{Var}\left(\boldsymbol{\theta}|\boldsymbol{Y}\right) &= \sigma^{2}\left(\boldsymbol{\Phi}^{\boldsymbol{T}}\boldsymbol{\Phi} + \sigma^{2}\begin{bmatrix} \boldsymbol{0} & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{\Sigma}^{-1} \end{bmatrix}\right)^{-1}. \end{aligned}$$

An interesting benefit of improper priors is the possibility of reducing the number of hyperparameters by treating some of them as unknowns whose prior is improper. Letting the symbol **1***n*×<sup>1</sup> denotes a column vector of ones, assume, for example, that θ ∼ *N* (μ**1***n*×<sup>1</sup>, Σθ ), i.e., all the scalar entries of θ share the same prior mean μ. In most cases, very little is known about μ that could be therefore regarded as a hyperparameter to be tuned by marginal likelihood maximization. It can be then treated as a deterministically known variable, according to the Empirical Bayes approach, see Sect. 4.4. By this choice, however, the hyperparameter is fixed to its point estimate and its uncertainty is not propagated, implying that the uncertainty of θ<sup>B</sup> will be underestimated if assessed by (4.14).

Alternatively, μ can be treated as a further random parameter. For this purpose, define θ˜ = θ − μ and consider the model

$$\begin{aligned} \bar{\theta} &= \begin{bmatrix} \theta \\ \mu \end{bmatrix}, & \Sigma\_{\bar{\theta}} &= \begin{bmatrix} \Sigma\_{\theta} & 0 \\ 0 & a \end{bmatrix} \\ Y &= \bar{\Phi}\bar{\theta} + E, & \bar{\Phi} &= \begin{bmatrix} \Phi \ \Phi \mathbf{1}\_{n \times 1} \end{bmatrix} \\ \bar{\theta} &\sim \mathcal{N}(0, \Sigma\_{\bar{\theta}}), & E &\sim \mathcal{N}(0, \sigma^2 I\_N). \end{aligned}$$

This formulation decreases the number of hyperparameters, without introducing prejudices (provided we let *a* → ∞). More importantly, it is now possible to assess the joint uncertainty of the estimates of μ and θ˜ through the posterior variance Var(θ¯|*Y* ).

## **4.6 Maximum Entropy Priors**

A major appeal of the Bayesian paradigm lies in its ability to provide a rational foundation to regularization: one starts from prior knowledge and then proceeds with its formalization in terms of a probabilistic prior, from which the regularization penalty is finally derived. However, there is a stumbling block in the way, because the available prior knowledge is often too vague to avoid arbitrariness in the choice of the prior distribution. As a matter of fact, the derivation of systematic approaches for the selection of prior distributions is a classic topic of Bayesian estimation theory. In this section, the approach based on entropy maximization is briefly reviewed.

The starting point is the observation that, even when prior information is absent or very limited, there are candidate distributions that are obviously preferable, due to symmetry arguments. Assume, for instance, that candidate values for a scalar parameter θ are known to belong to a finite set {θ*i*,*i* = 1,... *m*} and no further information is available. Then, the only reasonable prior distribution will be p(θ = θ*i*) = 1/*m*. In fact, assigning unequal probabilities would create an unjustified asymmetry, given that our prior information does not make any distinction between the *m* possible values of the parameter.

The case of a continuous-valued parameter θ taking values in a finite interval [*a*, *b*] can be addressed in a similar way. In this case, a reasonable prior distribution is the uniform one:

$$\mathbf{p}(\theta) = \begin{cases} \frac{1}{b-a}, & a \le \theta \le b \\\ 0, & \text{elsewhere} \end{cases}.$$

In both examples, we might say the chosen distributions are those that reflect the maximum ignorance about the unknown parameter.

The next step is to formalize this notion of maximum ignorance in contexts where some partial information about θ is available. This can be done by means of the notion of entropy of a probability distribution. For a discrete distribution p(·) taking values p(θ*i*) on a numerable set {θ*i*}, the entropy *H* is defined as

$$H(\mathbf{p}) = -\sum\_{i} \mathbf{p}(\theta\_i) \log \mathbf{p}(\theta\_i).$$

Note that the minimum possible entropy *H*(p) = 0 occurs when the probability is concentrated at a unique value θ¯. This is the case of a maximally informative distribution such that p(θ = θ )¯ = 1. Conversely, if the set {θ*i*} has cardinality *m*, the maximum value *H*(p) = log(*m*) is achieved in correspondence of the uniform distribution p(θ = θ*i*) = 1/*m*, ∀*i*. In other words, the larger the entropy, the less information is conveyed by the distribution.

For continuous-valued random variables, the notion of *differential entropy h*(p) is introduced:

$$h(\mathbf{p}) = -\int\_{D\_{\theta}} \mathbf{p}(\theta) \log \mathbf{p}(\theta) d\theta,$$

where *D*<sup>θ</sup> denotes the support of the distribution. Note that, among distributions with finite support, the maximum possible (differential) entropy is achieved by the uniform distribution.

The principle of *Maximum Entropy* (MaxEnt) states that the admissible distribution with largest entropy is the one that best represents the current state of knowledge. The admissible distributions are those that satisfy a set of constraints, chosen so as to incorporate all the available prior knowledge. For instance, if the prior knowledge amounts to knowing that θ ∈ [*a*, *b*], the prior suggested by the MaxEnt principle is the uniform distribution. Other types of constraints are typically expressed as expectations of functions of the parameters θ. In particular, consider a random variable θ, subject to known values η*<sup>i</sup>* of *m* expectations

$$\mathcal{A}[\mathbf{g}\_i(\theta)] = \int \mathbf{g}\_i(\theta)\mathbf{p}(\theta)d\theta = \eta\_i, \quad i = \dots, m. \tag{4.38}$$

Then, we have the following useful result.

**Theorem 4.4** (General form of maximum entropy distributions, based on [12]) *Among all the distributions satisfying (4.38), the maximum entropy one is of exponential type*

$$\mathbf{p}(\theta) = A \exp(-\lambda\_1 \mathbf{g}\_1(\theta) - \dots - \lambda\_m \mathbf{g}\_m(\theta)),\tag{4.39}$$

*where* λ*<sup>i</sup> are m constants determined from (4.38) and A is such that*

$$A \int\_{-\infty}^{+\infty} \exp(-\lambda\_1 g\_1(\theta) - \dots - \lambda\_m g\_m(\theta)) d\theta = 1. \tag{4.40}$$

**Example 4.5** *(MaxEnt prior from information on expected absolute value)* Assume that prior knowledge is summarized by the expectation *E* |θ| = η. Then, the MaxEnt prior is the solution of the constrained optimization problem

$$\max\_{\mathbf{p}} h(\mathbf{p}) \quad \text{s.t.} \quad \mathcal{E}|\theta| = \eta.$$

Obviously, *m* = 1 and *g*1(θ ) = |θ|. In view of (4.39) and (4.40), p(θ ) is a Laplace distribution:

$$\mathbf{p}(\theta) = 0.5\lambda e^{-\lambda|\theta|}.$$

The value of λ is found by imposing the constraint on the expectation:

$$\int\_{-\infty}^{+\infty} 0.5|\theta|\lambda e^{-\lambda|\theta|}d\theta = \eta.$$

Since the constraint on the expectation is satisfied for λ = 1/η, the following Laplace distribution is eventually obtained:

$$\mathbf{p}(\theta) = \frac{e^{-\frac{|\overline{v}|}{\pi}}}{2\eta}.$$

Therefore, starting from a very partial information, such as a guess on the expected absolute value of the parameter, it is possible to completely specify a prior distribution that: (i) is coherent with the prior knowledge and (ii) does not introduces undue assumptions because it is the least informative one, so far as entropy is taken as a measure of informativeness. One could object that it is scarcely realistic to assume prior knowledge of the expected absolute value of θ. However, if we adopt the empirical Bayes framework, the objection is circumvented by the possibility of treating η as a hyperparameter that will be estimated from data.

Therefore, prior knowledge may just tell that the expectation of |θ| is finite, without specifying a value for this expectation. The MaxEnt principle then suggests the functional form of the prior that incorporates a hyperparameter η, whose tuning, e.g., by marginal likelihood maximization, see Sect. 4.4, will be the first step of the actual estimation algorithm. As it will be seen in the following, this particular prior is associated with the Bayesian interpretation of the regularization penalty employed by the so-called Lasso estimator that has been already introduced in a deterministic regularization setting in Sect. 3.6.1.1. -

For our purposes, of particular interest are MaxEnt priors satisfying constraints on the second-order moments. In the scalar case, we have the following classical result, e.g., see [19].

**Proposition 4.1** (based on [12]) *Let* θ *be a zero-mean random variable with known variance E* θ <sup>2</sup> = λ*. Then, the MaxEnt distribution is normal:*

$$
\theta \sim \mathcal{J}(0, \lambda).
$$

Also in this case, the necessity of specifying λ is not an issue, because the unknown variance can be regarded as a hyperparameter and tuned by marginal likelihood maximization. In other words, if the only prior knowledge is that θ has a finite, yet unknown, variance, the MaxEnt principle suggests the use of a normal prior parametrized by its variance.

When θ is a vector, a multivariate prior might be derived according to the following proposition.

**Proposition 4.2** (based on [12]) *Let* θ *be a zero-mean n-dimensional random vector whose entries have known variances E* θ <sup>2</sup> *<sup>i</sup>* = λ*i*,*i* = 1,..., *n. Then, the MaxEnt distribution is a multivariate normal with diagonal covariance matrix:*

$$
\theta \sim \mathcal{A}'(0, \Sigma\_{\theta}), \quad \Sigma\_{\theta} = \text{diag}\{\lambda\_i\}.
$$

The importance of this result is twofold. First, also in the multivariate case, the least informative distribution under second moment constraints is of normal type. Moreover, if the covariances are unknown, it is seen that the MaxEnt principle yields independent distributions.

A shortcoming of the maximum entropy approach is that the resulting distributions are not invariant with respect to reparametrizations of the unknown vector. To make an example, we have already seen that the maximum entropy distribution of θ in a finite interval [1, 2] is uniform. On the other hand, if the reparametrization ψ = 1/θ

is adopted and the MaxEnt approach is applied to ψ, the resulting prior will be a uniform distribution for ψ in [0.5,1], which corresponds to

$$\mathbf{p}(\theta) = \begin{cases} \frac{2}{\theta^2}, & 1 \le \theta \le 2 \\ 0, & \text{elsewhere,} \end{cases}$$

which is obviously different from a uniform distribution. A possible way to limit arbitrariness is to specify that, before applying the MaxEnt principle, one should first identify the "object of interest". Indeed, choosing either θ or 1/θ as object of interest is going to yield different MaxEnt priors.

#### **4.7 Model Approximation via Optimal Projection** *-*

Approximate low-order models are commonly used even when there is awareness that the real data are generated by a more complex model. Motivations may range from their use for control design purposes to better interpretability of the phenomena under investigation. Unfortunately, under model misspecification, several nice properties enjoyed by standard estimators are no more valid. In particular, a naive application of the least squares may provide far less than satisfactory results. In this section, it is shown that, within the Bayesian framework, the search for an optimal approximate model can be given a rigorous formulation that admits a projection-based solution.

We assume that the data *Y* are distributed according to (4.9), which summarizes our state of knowledge. However, rather than resorting to Bayesian estimation of the vector θ, an approximate model, typically of low order, is searched for. For instance, if θ*<sup>i</sup>* were the samples of an impulse response, one might be interested in approximating them by a parametric model:

$$\theta \cong \mathbf{g}(\boldsymbol{\zeta}), \quad \mathbf{g}(\boldsymbol{\zeta}) = \begin{bmatrix} \mathbf{g}\_1(\boldsymbol{\zeta}) \ \cdots \ \mathbf{g}\_n(\boldsymbol{\zeta}) \end{bmatrix}^T,$$

where ζ = ζ<sup>1</sup> ··· ζ*<sup>q</sup> <sup>T</sup>* is the unknown parameter vector. For example, in order to approximate the sequence θ*<sup>i</sup>* by means of a single exponential function, it suffices to let *q* = 2 and

$$
\mathfrak{g}\_i(\zeta) = \zeta\_1 e^{\zeta 2^i},
$$

where ζ<sup>1</sup> is the amplitude and ζ<sup>2</sup> is the rate coefficient of the exponential.

A very natural estimator is the least squares one:

$$\zeta^{LS} = \arg\min\_{\zeta} \|Y - \Phi \mathbf{g}(\zeta)\|^2.$$

Note that ζ *L S* coincides with the maximum likelihood estimate if the following model is assumed:

$$Y = \Phi g(\zeta) + E, \quad E \sim \mathcal{A}'(0, \sigma^2 I\_N).$$

In the present context, however, no claim is made that reality conforms to our approximate model. It may well be that the true θ, being more complex than its parsimonious parametric model *g*(ζ ), is better represented by the model (4.9). Nevertheless, we are interested in finding the best approximation of <sup>θ</sup> within a set *<sup>P</sup>* = {*g*(ζ )|<sup>ζ</sup> <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* ,} of parametric approximations.

Under model (4.9), the optimal approximate model *g*<sup>∗</sup> can be defined as the one that minimizes the mean squared error *E* θ − *g* 2. For a generic model *g* = *g*(ζ ), parametrized by the vector <sup>ζ</sup> <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* , *<sup>q</sup>* <sup>≤</sup> *<sup>n</sup>*, we have that

$$\log^\* = \operatorname{g}(\xi^\*), \quad \xi^\* := \arg\min\_{\xi} \boldsymbol{\delta}^\circ \left[ \left\| \boldsymbol{\theta} - \operatorname{g}(\xi) \right\|^2 \left| Y \right\|, \tag{4.41}$$

where the conditional expectation is taken with reference to the probability measure specified by (4.9). The following theorem, whose proof is in Sect. 4.13.5, was first derived in the context of linear system identification [20]. It shows that the optimal approximation is the projection of the Bayes estimate θ<sup>B</sup> onto the set *P*.

**Theorem 4.6** (Optimal approximation, based on [20]) *Assume that (4.9) holds. Then,*

$$\zeta^\* = \arg\min\_{\zeta} \|\theta^\mathcal{B} - \mathcal{g}(\zeta)\|^2. \tag{4.42}$$

In view of the last theorem, the best approximation *g*(ζ ) ∈ *P* can be computed by a two-step procedure. First, the Bayes estimate θ<sup>B</sup> is obtained and in the second step the optimal *g*(ζ <sup>∗</sup>) is calculated as the solution of the least squares problem (4.42).

An interesting question is whether the obtained approximation is still optimal if the goal is minimizing the error, not with respect to θ, but with respect to the noiseless output Φθ. In other words, the goal is finding *g<sup>o</sup>* that minimizes Φθ − Φ*g<sup>o</sup>*) 2. This can be done by introducing a weighted norm in the cost function:

$$\log^o = \operatorname{g}(\boldsymbol{\zeta}^o), \quad \boldsymbol{\zeta}^o \coloneqq \arg\min\_{\boldsymbol{\zeta}} \boldsymbol{\delta}^o \left[ \|\boldsymbol{\theta} - \operatorname{g}(\boldsymbol{\zeta})\|\_{W}^2 \middle| \boldsymbol{Y} \right], \tag{4.43}$$

where *x* 2 *<sup>W</sup>* stands for *x <sup>T</sup> W x*. In particular, if *W* = Φ*<sup>T</sup>* Φ, then

$$\|\theta - \operatorname{g}(\zeta)\|\_{W}^{2} = \|\Phi\theta - \Phi\operatorname{g}(\zeta)\|^{2}.$$

By extending the proof of Theorem 4.6 to the case of a weighted norm, the following projection result is obtained.

**Theorem 4.7** (Optimal weighted approximation, based on [20]) *Assume that (4.9) holds. Then,*

$$\|\xi^o = \arg\min\_{\xi} \|\theta^\mathbf{B} - \mathbf{g}(\xi)\|\_{W}^2. \tag{4.44}$$

The consequence is that different approximations *g<sup>o</sup>* are obtained depending on their prospective use. If the scope is just approximating θ, then *W* = *In*, but, if the scope is predicting the outputs, then *W* = Φ*<sup>T</sup>* Φ and a different result is obtained.

## **4.8 Equivalent Degrees of Freedom**

In this section, the Bayesian estimation problem for the linear model is analysed by means of a diagonalization approach. The purpose is twofold: (i) the equivalent degrees of freedom of the Bayesian estimator are introduced together with their relationship with suitable weighted squared sums of residuals and squared sums of estimated parameters; (ii) it is shown that ηML, the ML estimate of the hyperparameter vector, satisfies meaningful conditions involving the degrees of freedom. Finally, the obtained results are applied to the tuning of the regularization parameter, defined as the ratio between scaling factors for the noise variance Σ*<sup>E</sup>* and the parameter variance Σθ . For the sake of simplicity, in this section, we assume μθ = 0.

Let us consider the case when the hyperparameters are just two scaling factors for the covariance matrices Σ*<sup>E</sup>* and Σθ , that is,

$$
\Sigma\_{\theta} = \lambda K, \quad \lambda > 0 \tag{4.45}
$$

$$
\Sigma\_E = \sigma^2 \Psi, \quad \sigma^2 > 0 \tag{4.46}
$$

$$\eta = \begin{bmatrix} \lambda \ \sigma^2 \end{bmatrix}^T,\tag{4.47}$$

where *K* and Ψ are known definite positive matrices. In such a case, it is immediate to see that the Bayes estimate

$$\boldsymbol{\theta}^{\rm B} = \left(\boldsymbol{\Phi}^{T}\boldsymbol{\Psi}^{-1}\boldsymbol{\Phi} + \frac{\sigma^{2}}{\lambda}\boldsymbol{K}^{-1}\right)^{-1}\boldsymbol{\Phi}^{T}\boldsymbol{\Psi}^{-1}\boldsymbol{Y}$$

depends only on the ratio γ = σ<sup>2</sup>/λ, which behaves as a deterministic regularization parameter. This means that only the ratio between the scaling factors is relevant to the computation of a point estimate, although both of them are needed to compute the posterior variance (4.14). When Ψ = *IN* and *K* = *In*, the above estimator provides a Bayesian interpretation to the classical ridge regression estimator. In particular, γ can be interpreted as a noise-to-signal ratio and its tuning reformulated as a statistical estimation problem.

Given a positive definite symmetric matrix *S*, let *S*<sup>1</sup>/<sup>2</sup> = *S*<sup>1</sup>/<sup>2</sup> *<sup>T</sup>* be its symmetric square root, i.e., *S*<sup>1</sup>/<sup>2</sup> *S*<sup>1</sup>/<sup>2</sup> = *S*. Now, consider the singular value decomposition

$$
\Psi^{-1/2} \Phi K^{1/2} = UDV^T,
$$

where *U* and *V* are square matrices such that *UTU* = *IN* and *V <sup>T</sup> V* = *In* and *<sup>D</sup>* <sup>∈</sup> <sup>R</sup>*<sup>N</sup>*×*<sup>n</sup>* is a diagonal matrix with diagonal entries {*di*},*<sup>i</sup>* <sup>=</sup> <sup>1</sup>,..., *<sup>n</sup>*, see (3.134). Moreover, define

$$\begin{aligned} \bar{Y} &= U^T \Psi^{-1/2} Y \\ \bar{E} &= U^T \Psi^{-1/2} E \\ \bar{\theta} &= V^T K^{-1/2} \theta . \end{aligned}$$

Observe that

$$\mathcal{E}\left(\bar{E}\,\bar{E}^T\right) = U^T\Psi^{-1/2}\,\delta^\circ E \, E^T\Psi^{-1/2}U = \sigma^2 U^T U = \sigma^2 I\_N.$$

Analogously, *E* θ¯θ¯*<sup>T</sup>* = λ*In*. Moreover,

$$\begin{aligned} \bar{Y} &= U^T \Psi^{-1/2} (\Phi \theta + E) = U^T \Psi^{-1/2} \Phi K^{1/2} V V^T K^{-1/2} \theta + \bar{E}, \\ &= U^T U D V^T V \bar{\theta} + \bar{E} = D \bar{\theta} + \bar{E}. \end{aligned}$$

In view of these properties, it follows that the original Bayesian estimation problem admits the following diagonal reformulation:

$$
\bar{Y} = D\bar{\theta} + \bar{E}, \quad \bar{E} \sim \mathcal{N}(0, \sigma^2 I\_N), \quad \bar{\theta} \sim \mathcal{N}(0, \lambda I\_n), \tag{4.48}
$$

where *E*¯ and θ¯ are independent of each other.

In view of statistical independence, we have *N* independent scalar models:

$$\begin{aligned} \bar{\mathbf{y}}\_i &= d\_i \theta\_i + \bar{v}\_i, \quad i = 1, \dots, n \\ \bar{\mathbf{y}}\_i &= \bar{v}\_i, \quad i = n+1, \dots, N, \end{aligned}$$

where v¯*<sup>i</sup>* ∼ *N* (0, σ<sup>2</sup>), *i* = 1,..., *N*, and θ¯ *<sup>i</sup>* ∼ *N* (0, λ), *i* = 1,..., *n*.

By (4.11), it is straightforward to see that the Bayes estimates are

$$\bar{\theta}\_i^{\mathbb{B}} = \frac{\lambda d\_i \bar{\mathbf{y}}\_i}{\sigma^2 + \lambda d\_i^2} = \frac{d\_i \bar{\mathbf{y}}\_i}{\gamma + d\_i^2}, \quad i = 1, \dots, m$$

or, in matrix form,

$$\bar{\theta}^{\mathsf{B}} = \left(D^{\mathsf{T}}D + \mathcal{Y}I\_{\mathsf{n}}\right)^{-1}D^{\mathsf{T}}\bar{Y}.$$

Let the residuals be defined as ε¯*<sup>i</sup>* = ¯*yi* − *d*¯ *<sup>i</sup>* θ¯<sup>B</sup> *<sup>i</sup>* , *i* = 1,..., *N*, where

$$\bar{d}\_{i} = \begin{cases} d\_{i}, & 1 \le i \le n \\ 0, & n+1 \le i \le N \end{cases}.\tag{4.49}$$

Then, we have

$$\bar{\varepsilon}\_{i} = y\_{i} - \frac{\bar{d}\_{i}^{2} \bar{\mathbf{y}}\_{i}}{\chi + \bar{d}\_{i}^{2}} = \frac{\chi \bar{\mathbf{y}}\_{i}}{\chi + \bar{d}\_{i}^{2}} \tag{4.50}$$

$$\mathcal{A}\overline{\varepsilon}\_{i}^{2} = \frac{\mathcal{\chi}^{2}\mathcal{\mathcal{E}}\overline{\mathbf{y}}\_{i}^{2}}{(\mathcal{\chi}+\overline{d}\_{i}^{2})^{2}} = \frac{\mathcal{\chi}^{2}(\overline{d}\_{i}^{2}\lambda+\sigma^{2})}{(\mathcal{\chi}+\overline{d}\_{i}^{2})^{2}} = \frac{\sigma^{2}\mathcal{\chi}}{\mathcal{\chi}+\overline{d}\_{i}^{2}} = \sigma^{2}\left(1-\frac{\overline{d}\_{i}^{2}}{\mathcal{\chi}+\overline{d}\_{i}^{2}}\right) (4.51)$$

or, in matrix form,

$$\bar{\varepsilon} = \gamma (D^T D + \gamma I\_N)^{-1} \bar{Y}, \quad \mathcal{E} \|\bar{\varepsilon}\|^2 = \sigma^2 \left( N - \text{trace} (D (D^T D + \gamma I\_N)^{-1} D^T) \right). \tag{4.52}$$

It is worth noting that the above relationships do not hold for a generic regularization parameter γ , but only when γ = σ2/λ. In the remaining part, we present some results that were first derived in the context of Bayesian deconvolution in [5]. The proof of the following proposition is in Sect. 4.13.6.

**Proposition 4.3** (based on [5]) *For a given hyperparameter vector* η*, let* WRSS *denote the following weighted squared sum of residuals:*

$$\text{WRSS} = (Y - \Phi\theta^{\text{B}})^T \Psi^{-1} (Y - \Phi\theta^{\text{B}}),$$

*where* θ<sup>B</sup> = *E* [θ|*Y*, η]*. Then,*

$$\circledast(\text{WRSS}) = \sigma^2(N - \text{trace}(H(\gamma))),$$

*where*

$$H(\boldsymbol{\gamma}) = \Phi(\boldsymbol{\Phi}^T \boldsymbol{\Psi}^{-1} \boldsymbol{\Phi} + \boldsymbol{\chi} \boldsymbol{K}^{-1})^{-1} \boldsymbol{\Phi}^T \boldsymbol{\Psi}^{-1}$$

*is the so-called* hat matrix*.*

As already noted, see (3.64), when Σ*<sup>E</sup>* = σ<sup>2</sup> *IN* , the predicted output *Y*ˆ = Φθ<sup>B</sup> and the measured output *Y* are related through the hat matrix:

$$
\hat{Y} = H(\mathcal{Y})Y.
$$

In order to better understand the link between the hat matrix and the degrees of freedom, just consider the standard linear model *<sup>Y</sup>* <sup>=</sup> Φθ <sup>+</sup> *<sup>E</sup>*, θ <sup>∈</sup> <sup>R</sup>*<sup>n</sup>*, and the corresponding LS estimate θLS = (Φ*<sup>T</sup>* Φ)−<sup>1</sup>Φ*<sup>T</sup> Y* . The predicted output is *Y*ˆ = *H*LS*Y* , where *H*LS = Φ(Φ*<sup>T</sup>* Φ)−<sup>1</sup>Φ*<sup>T</sup>* enjoys the property trace(*H*LS) = *n*.

It is this analogy that justifies the introduction of equivalent degrees of freedom which we already encountered in (3.65) as a function of the regularized estimate θ<sup>R</sup> described in the deterministic context. Its definition, here derived starting from the stochastic context, is reported below stressing its dependence on the regularization parameter γ .

**Definition 4.2** *(equivalent degrees of freedom)* The quantity

$$\text{dof}(\wp) = \text{trace}(H(\wp)), \quad 0 \le \text{dof}(\wp) \le n \tag{4.53}$$

is called *equivalent degrees of freedom*.

In view of (4.52),

$$\text{dof}(\wp) = \sum\_{i=1}^{n} \frac{d\_i^2}{d\_i^2 + \wp}$$

so that dof(γ ) is a monotonically decreasing function of γ with 0 ≤ dof(γ ) ≤ *n*. The equivalent degrees of freedom provide an easily understandable measure of the flexibility of estimator: for instance, if they are approximately equal to three, the Bayesian estimator has a flexibility comparable to a model with three parameters. For linear-in-parameter models estimated by ordinary or weighted least squares, the degrees of freedom coincide with the rank of the regressor matrix and, therefore, they can take only integer values. The equivalent degrees of freedom of the Bayesian estimator, conversely, are a nonnegative real number controlled by γ .

The next theorem establishes a connection between the degrees of freedom and the ML estimate

$$\eta^{\mathsf{ML}} = \left[\lambda^{\mathsf{ML}} \left(\sigma^2\right)^{\mathsf{ML}}\right]^{\mathsf{T}}$$

of the hyperparameter vector. Accordingly, we define

$$\nu^{\mathsf{ML}} = \frac{\left(\sigma^2\right)^{\mathsf{ML}}}{\lambda^{\mathsf{ML}}}.$$

Moreover, we introduce the following weighted squared sum of estimated parameters:

$$\text{WPSS} = (\theta^{\text{B}})^T K^{-1} \theta^{\text{B}} = \left\| \bar{\theta}^{\text{B}} \right\|^2 = \sum\_{i=1}^{n} \frac{d\_i^2 \bar{\mathbf{y}}\_i^2}{(\nu + d\_i^2)^2}. \tag{4.54}$$

The proof of the following result is in Sect. 4.13.7.

**Theorem 4.8** (based on [5]) *Assume that model (4.9) holds where* Σθ *and* Σ*<sup>E</sup> are as in (4.46)–(4.47). Then, the* ML *estimates of the hyperparameters satisfy the following necessary conditions:*

$$\text{WRSS} = \left(\sigma^2\right)^{\text{ML}} \left(N - \text{dof}\left(\boldsymbol{\chi}^{\text{ML}}\right)\right) \tag{4.55}$$

$$\text{WPSS} = \lambda^{\text{ML}} \text{dof}(\boldsymbol{\chi}^{\text{ML}}).\tag{4.56}$$

By taking the ratio between (4.55) and (4.56), the following proposition is obtained.

**Proposition 4.4** (based on [5]) *If* λML *and* σ2 ML *are nonnull and finite, then*

$$\gamma^{\rm ML} = \frac{\text{dof}\left(\gamma^{\rm ML}\right)}{N - \text{dof}\left(\gamma^{\rm ML}\right)} \frac{\text{WRSS}}{\text{WPSS}}.\tag{4.57}$$

This last corollary can be used as a simple and practical tuning procedure as it requires just a line search on the scalar γ . Of course, (4.57) relies on the necessary conditions of Theorem 4.8, so that one has to check if the solution corresponds to a maximum of the likelihood function.

## **4.9 Bayesian Function Reconstruction**

In this section, the Bayesian estimation approach is illustrated through its application to the reconstruction of an unknown function from noisy samples. The observations will be generated by adding pseudorandom noise to a known function *g*(*x*), so that the performances of alternative estimators can be directly assessed by comparison with the ground truth. The selected *g*(*x*) is the same function (3.26) used in the previous chapter in order to illustrate polynomial regression:

$$g(x) = (\sin(x))^2(1 - x^2), \quad x \in [0, \ 1]. \tag{4.58}$$

Also the noise model is the same:

$$\mathbf{y}\_{i} = \mathbf{g}(\mathbf{x}\_{i}) + e\_{i}, \quad i = 1, \ldots, N. \tag{4.59}$$

We let *N* = 40, *x*<sup>1</sup> = 0, *x*<sup>40</sup> = 1, and *x*2,..., *x*<sup>39</sup> are evenly spaced points between *x*<sup>1</sup> and *x*40. Finally, *ei* , *i* = 1,..., 40, are i.i.d. Gaussian distributed with mean zero and standard deviation 0.034.

The problem of estimating θ*<sup>i</sup>* = *g*(*ti*), i.e., the samples of the unknown function, is a particular case of the linear Gaussian model (4.9) with Φ = *IN* , that is,

$$Y = \theta + E, \quad E \sim \mathcal{N}(0, \sigma^2 I\_N). \tag{4.60}$$

Since Φ is square, in this case, the number *n* of unknowns coincides with the number *N* of observations.

The noisy data and the true function are displayed in the top left panel of Fig. 4.1. It is assumed that the available prior knowledge regards the "regularity" of *g*(·) and the knowledge that *g*(0) = 0. A possible probabilistic translation of this qualitative knowledge is assuming that θ*<sup>i</sup>* is a so-called random walk:

$$
\theta\_i = \theta\_{i-1} + w\_i, \quad i = 1, \dots, N, \quad \theta\_0 = 0,
$$

where w*<sup>i</sup>* ∼ *N* (0, λ) are independent random variables. In fact, under the random walk model, the first difference

$$
\theta\_i - \theta\_{i-1} = w\_i
$$

has a finite variance, equal to λ. Hence, if we approximate the derivative of *g*(·) by the first difference <sup>θ</sup>*<sup>i</sup>* <sup>−</sup> <sup>θ</sup>*<sup>i</sup>*−1, this approximation is less than 1.96√<sup>λ</sup> with probability 0.95, which guarantees that the profile of the function cannot vary too quickly. Note that, due to the qualitative nature of the prior knowledge, the precise value of λ is unknown, so that it has to be treated as a hyperparameter. Conversely, it is assumed that the true value of σ<sup>2</sup> is known. Summarizing, we have

**Fig. 4.1** Function reconstruction example. Top left: noisy data and true function. Top right, bottom left and bottom right: Residual sum of squares, i.e., the sum of the squared differences between the function values and their estimates, degrees of freedom and marginal loglikelihood against the hyperparameter λ. The oracle denotes the value that minimizes RSS while ML indicates the maximizer of the marginal likelihood

$$\theta\_i = \sum\_{j=1}^i w\_j, \quad i = 1, \dots, N$$

or, in matrix form,

$$\theta = Fw, \quad F = \begin{bmatrix} 1 \ 0 \ 0 \dots \ 0 \\ 1 \ 1 \ 0 \dots \ 0 \\ 1 \ 1 \ 1 \dots \ 0 \\ \vdots \ \vdots \ \vdots \ \ddots \ \vdots \\ 1 \ 1 \ 1 \dots \ 1 \end{bmatrix}, \quad w = \begin{bmatrix} w\_1 \\ w\_2 \\ w\_3 \\ \vdots \\ w\_N \end{bmatrix}$$

.

Observing that Var(w) = λ*IN* , the prior variance of the parameter vector is

$$
\Sigma\_{\boldsymbol{\theta}} = \lambda F \boldsymbol{F}^{\boldsymbol{T}} = \lambda \begin{bmatrix} 1 \ 1 \ \dots \ 1 \\ 1 \ 2 \ \dots \ 2 \\ \vdots \ \vdots \ \vdots \ \vdots \\ 1 \ 2 \ \dots \ \boldsymbol{N} \end{bmatrix}.
$$

For a given λ, the Bayes estimate θ<sup>B</sup> is obtained according to (4.10) and can be written as

$$
\theta^{\mathsf{B}} = \Sigma\_{\theta} \left( \Sigma\_{\theta} + \sigma^2 I\_N \right)^{-1} Y.
$$

The corresponding equivalent degrees of freedom, obtained by (4.53), are now thought as a (monotonically nondecreasing) function of λ, i.e.,

$$\text{dof}(\lambda) = \text{trace } H(\lambda), \quad H(\lambda) = \Sigma\_{\theta} \left(\Sigma\_{\theta} + \sigma^2 I\_N\right)^{-1}, \quad \Sigma\_{\theta} = \lambda F F^T.$$

In the bottom left panel of Fig. 4.1, the degrees of freedom are plotted against λ. For small values of λ they are close to zero and get closer to *N* = 40 as λ goes to infinity. It is a rather general feature that the function dof(λ) is better visualized on a semilog scale. In order to tune the regularization parameter λ, one can resort to the maximization of the marginal loglikelihood:

$$\begin{split} \lambda^{\text{ML}} &= \arg\max\_{\lambda} \left\{ -\frac{1}{2} \log(2\pi \det(\Sigma)) - \frac{1}{2} Y^T \Sigma^{-1} Y \right\}, \\ \Sigma &= \Sigma\_{\theta} + \sigma^2 I\_N = \lambda F F^T + \sigma^2 I\_N. \end{split}$$

It turns out that λML = 4.92*e* − 4, the corresponding degrees of freedom being 12.17. For the sake of comparison, λ = 6.61*e* − 4 is the best possible value, i.e., the one provided by an oracle that exploits the knowledge of the true function in order to minimize the sum of the squared reconstruction errors. This latter quantity is function of λ and here denoted by RSS(λ). As seen in the top right panel of Fig. 4.1, marginal likelihood maximization achieves RSS = 9.80*e* − 2, not much worse than RSS = 9.71*e* − 2 achieved by the oracle, whose associated degrees of freedom are 13.88. Therefore, in this specific case, the marginal likelihood criterion somehow underestimates the complexity of the model.

In Fig. 4.2, the estimates obtained in correspondence of six different values of λ are displayed. It is apparent that for λ = 1*e* − 6 and λ = 1*e* − 5 the estimated function is overregularized, while overfitting occurs for λ = 1*e* − 1 and λ = 1*e* − 2. The two bottom panels display the oracle and ML estimates, the former exhibiting a slightly more regular profile.

Finally, observing that in our case Σθ*<sup>Y</sup>* = Σθ , we have

$$\Sigma\_{\theta|Y} = \text{Var}(\theta|Y) = \Sigma\_{\theta} - \Sigma\_{\theta}\Sigma\_{Y}^{-1}\Sigma\_{\theta}$$

**Fig. 4.2** Function reconstruction example. The panels display the Bayes estimates *g*ˆ(*x*) corresponding to six different values of the hyperparameter λ, including the one provided by the oracle and the maximum likelihood one

**Fig. 4.3** Function reconstruction example. True function and Empirical Bayes estimate *g*ˆ(*x*) based on λ*M L* together with its 95% Bayesian credible intervals

and we can compute the 95% Bayesian credible intervals, according to (4.8). As it can be seen from Fig. 4.3, the credible limits successfully capture the uncertainty, as demonstrated by the fact that the true function lies within the limits.

This simple example has shown that Bayesian estimation can be effectively employed in order to reconstruct an unknown function without need of assuming a specific parametric structure, e.g., polynomial or other. The key idea is the use of a smoothness prior, expressed through the assumed prior distribution of the first differences of the function. The associated variance λ is treated as a hyperparameter that can be tuned via marginal likelihood maximization. Altogether, this is a flexible Empirical Bayes scheme that can be employed as a general-purpose black-box estimator.

Of interest is also the fact that the considered function could have been the impulse response of a dynamical system. In this respect, the example highlights also the limits of the approach. A first issue, easily fixable, has to do with the insufficient smoothness of the estimate. As seen in Fig. 4.3, the true function is significantly smoother than its estimate. As a matter of fact, it is not difficult to increase the regularity of the Bayes estimate: for instance, it suffices to assume that the samples θ*<sup>i</sup>* = *g*(*xi*) are an integrated random walk:

$$\begin{aligned} \theta\_i &= \theta\_{i-1} + \xi\_i \\ \xi\_i &= \xi\_{i-1} + w\_i, \end{aligned}$$

where w*<sup>i</sup>* ∼ *N* (0, λ) are again independent and identically distributed. This prior distribution is going to yield smoother profiles. Rather interestingly, the obtained estimate can be seen as the discrete-time counterpart of cubic smoothing splines, a method widely used for the nonparametric reconstruction of unknown functions.

A more serious issue regards extrapolation properties of the estimate that are in turn connected with the type of asymptotic decay shown by stable impulse responses. As it can be seen from Fig. 4.3, oscillations and credible intervals do not tend to dampen as *x* increases. While it would be easy to compute the Bayes estimate also for values far beyond the observation window, the result would be disappointing. Indeed, coherently with the diffusive nature of random walks, the width of the credible band would diverge, which is unnecessarily conservative when a stable impulse response is reconstructed. It appears that the task of identifying impulse responses calls for prior distributions that are specifically suited to the their features, especially the asymptotic ones. The development of these prior distributions, or equivalently the design of suitable regularization penalties, will be a central topic of the subsequent chapters.

## **4.10 Markov Chain Monte Carlo Estimation**

As already mentioned in Sect. 4.4, in the full Bayesian approach the estimate

$$\mathbf{p}(\theta|Y) = \int \mathbf{p}(\theta, \eta|Y) d\eta = \int \mathbf{p}(\theta|\eta, Y)\mathbf{p}(\eta|Y) d\eta$$

requires a marginalization with respect to the hyperparameter vector η. In general, this integral cannot be computed analytically. Nevertheless it can be computed numerically by means of Markov Chain Monte Carlo (MCMC) methods that generate pseudorandom samples drawn from the joint posterior density p(θ , η|*Y* ). The Gibbs sampling (GS) algorithm is the most straightforward and popular MCMC method. Its goal is to simulate a realization of a Markov chain, whose samples, though not independent of each other, form an ergodic process whose stationary distribution coincides with the desired posterior. Hence, provided that the burn-in phase is discarded, the posterior distribution is approximated by the histogram of the samples. In order to generate the samples, at each step a random extraction is made from a proposal distribution. In the Gibbs sampler, the proposal distribution is the so-called full conditional, that is, the probability of a given element of the parameter vector given the data and the current values of all other elements.

For the linear Gaussian model (4.9), a Gibbs sampler may be implemented as follows:


This stochastic simulation algorithm generates a Markov chain whose stationary distribution coincides with p(θ , η|*Y* ). Therefore, though correlated, the generated samples {θ(*k*) , η(*k*) } can be used to estimate the (joint and marginal) posterior distributions and also the posterior expectations via the proper sample averages. For example,

$$\frac{1}{N} \sum\_{k=1}^{N} \theta^{(k)} \simeq \delta^{\circ}(\theta | Y).$$

The choice of the prior distributions p(θ|η) and p(η|*Y* ) has a critical influence on the efficiency of the scheme. The priors are called *conjugate*, when for each parameter the prior and the full conditional belong to the same distribution family. This implies that the same random variable generators can be used throughout the simulation.

Consider model (4.9), whereΣ*<sup>E</sup>* is known andΣθ = λ*K*, with λ unknown. Below, we describe a Gibbs sampling scheme for obtaining the posterior distributions of θ and η = λ. For θ, the prior is θ|λ ∼ *N* (0, λ*K*). A conjugate prior for λ is the inverse Gamma distribution:

$$\frac{1}{\lambda} \sim \Gamma(\mathbf{g}\_1, \mathbf{g}\_2), \quad \mathbf{g}\_1, \mathbf{g}\_2 > 0.$$

In other words, it is assumed that 1/λ is distributed as a Gamma random variable, so that

$$\mathbf{p}\left(\frac{1}{\lambda}\right) \propto \left(\frac{1}{\lambda}\right)^{\mathbf{g}\_{\parallel}-1} e^{-\left(\frac{\mathbf{p}\_{\parallel}}{\lambda}\right)\_{\perp}}$$

With this choice of the prior, the full conditional of 1/λ will be distributed as a suitable Gamma variable, ∀*k*. More precisely, it can be shown that, if

$$\mathbf{p}\left(\bar{\theta}|\lambda\right) \sim \mathcal{N}\left(\mathbf{0}, \lambda I\_N\right), \quad \mathbf{p}\left(\frac{1}{\lambda}\right) \sim \Gamma(\mathbf{g}\_1, \mathbf{g}\_2)$$

then

$$\mathbf{p}\left(\frac{1}{\lambda}\middle|\bar{\theta}\right) \sim \Gamma\left(\mathbf{g}\_1 + \frac{N}{2}, \mathbf{g}\_2 + \frac{\left\|\bar{\theta}\right\|^2}{2}\right). \tag{4.61}$$

Recall that the mean and variance of the Gamma random variable are *g*1/*g*<sup>2</sup> and *g*1/*g*<sup>2</sup> <sup>2</sup> , respectively. For the prior to be as uninformative as possible, we let *g*<sup>1</sup> and *g*<sup>2</sup> decrease to zero. Under these assumptions, the Gibbs sampler unfolds as follows:

1. Initialize λ and θ, e.g., using the empirical Bayes estimates

$$
\lambda^{(0)} = \lambda^{ML}, \quad \theta^0 = \theta^\mathbf{B} = \mathcal{E}(\theta|\lambda^{ML}, Y).
$$

and let *k* = 0.

2. Draw a sample 1/λ(*k*+1) from the full conditional distribution

$$\mathbf{p}\left(\frac{1}{\lambda}\middle|\theta^{(k)},Y\right) = \mathbf{p}\left(\frac{1}{\lambda}\middle|\theta^{(k)}\right) = \Gamma\left(\frac{N}{2}, \frac{\theta^{(k)T}K^{-1}\theta^{(k)}}{2}\right). \tag{4.62}$$

3. Draw a sample θ(*k*+1) from the full conditional distribution

$$\operatorname{p}\left(\theta|\lambda^{(k+1)},Y\right) = \mathcal{N}\left(\mathcal{C}(\theta|\lambda^{(k+1)},Y), \operatorname{Var}(\theta|\lambda^{(k+1)},Y)\right)$$

whose mean and variance are obtained according to (4.10) or (4.13). 4. If *k* = *kmax* , end, else *k* = *k* + 1 and go to step 2.

Above, the expression of the full conditional (4.62) is a direct consequence of the conjugacy property (4.61), as it can be seen by letting θ¯ = *K* <sup>−</sup>1/<sup>2</sup>θ(*k*) , where *K* <sup>−</sup>1/<sup>2</sup> is a symmetric matrix such that *K* <sup>−</sup>1/2*K* <sup>−</sup>1/<sup>2</sup> = *K* <sup>−</sup>1.

When there are other hyperparameters to tune, e.g., the noise variance σ2, the MCMC scheme can be properly extended. Provided that they exist, conjugate priors ensure an efficient sampling from the proposal distributions that generate the random samples, although a variety of MCMC schemes are available that deal with nonconjugate priors at the cost of an increased computational effort.

The main advantage of MCMC methods is that they implement the full Bayesian framework that is only approximated by the empirical Bayes scheme. In particular, MCMC methods do not neglect the hyperparameter uncertainty which is correctly propagated to the parameter estimate. However, as already discussed in Sect. 4.4, if data are informative enough to ensure a precise estimate of the hyperparameters, the difference between MCMC and empirical Bayes estimates (and associated credible regions) may be of minor importance.

## **4.11 Model Selection Using Bayes Factors**

As discussed in Sect. 2.6.2, one fundamental issue is the selection of the "best" model inside a class of postulated structures. In the classical setting, this can be performed using criteria like AIC (2.34) and BIC (2.36) or adopting a cross-validation strategy. We will now see that the Bayesian approach provides a powerful alternative based on the concept of posterior model probability.

Let *M<sup>i</sup>* be a model structure parametrized by the vector *x<sup>i</sup>* . In the system identification scenario discussed in Chap. 2, the structures could be ARMAX models of different complexity. Hence, each *x<sup>i</sup>* would correspond to the θ*<sup>i</sup>* parametrizing (2.1) and containing the coefficients of rational transfer functions of different orders. If little knowledge on them were available, poorly informative prior densities could be assigned. Another example concerns the function estimation problem illustrated in Sect. 4.9. Here, *x<sup>i</sup>* could contain the samples θ*<sup>i</sup>* of the unknown function *g* modelled as a stochastic process. Then, the different structures could represent different covariances of *g* defined by a random walk or an integrated random walk. Each covariance would then depend on an unknown hyperparameter vector η*<sup>i</sup>* containing the variance of the random walk increments and possibly also of the measurement noise. So, in this case, one would have *x<sup>i</sup>* = [θ*<sup>i</sup>* η*<sup>i</sup>* ]. Here, η*<sup>i</sup>* is a random vector with flat priors typically assigned to the variances to include just nonnegativity information.

Now, suppose that we are given *m* competitive structures *M<sup>i</sup>* . An important conceptual step is to interpret even them as (discrete) random variables, each having probability Pr(*M<sup>i</sup>* ) before seeing the data *Y* . The selection of the best model has then a natural answer: one should select the structure having the largest posterior probability Pr(*M<sup>i</sup>* |*Y* ). Using Bayes rule, one has

$$\Pr(\mathcal{A}^i|Y) = \frac{\int \mathbf{p}(Y|\mathcal{A}^i, \mathbf{x}^i) d\mathbf{x}^i \Pr(\mathcal{A}^i)}{\mathbf{p}(Y)}.$$

A typical choice is to think of the structures as equiprobable, so that Pr(*M<sup>i</sup>* ) = 1/*m* for any *i*. Then, one can select the *M<sup>i</sup>* maximizing the so-called Bayesian evidence given by

$$\mathbf{p}(Y|\mathcal{M}^i) = \int \mathbf{p}(Y|\mathcal{M}^i, \mathbf{x}^i) d\mathbf{x}^i.$$

Note that this corresponds to the marginal likelihood where all the parameter uncertainty connected with the *i*-th structure has been integrated out. Given two structures *M*<sup>1</sup> and *M*2, the Bayes factor is also defined as follows:

$$B\_{12} = \frac{\mathbf{p}(Y|\mathcal{A}\ell^1)}{\mathbf{p}(Y|\mathcal{A}\ell^2)}.$$

Hence, large values of *B*<sup>12</sup> indicate that data strongly support *M*<sup>1</sup> as opposed to *M*2.

For the computation of the Bayesian evidence, the same numerical considerations reported at the end of Sect. 4.4 then hold. In particular, when the evidence cannot be computed explicitly, approximations are needed given by the Laplace approximation. Also the BIC criterion is often adopted. In particular, in the function estimation problem one can integrate out θ. Then, one can evaluate the complexity of the model using the marginal likelihood optimized w.r.t. the hyperparameters η*<sup>i</sup>* , then adding a term which penalizes the dimension of the hyperparameter vector. This will be also discussed later on in Sect. 7.2.1.1.

MCMC can be also used to compute the evidence by simulating from posterior distributions and using the harmonic mean of the likelihood values, see Sect. 4.3 in [14]. A more powerful and complex approach employs MCMC techniques able to jump between models of different dimensions, an approach known in the literature as reversible jump Markov chain Monte Carlo computation [10].

## **4.12 Further Topics and Advanced Reading**

There is an extensive literature debating on the interpretation of probability as a quantification of personal belief and it would be impossible to give a satisfactory account of all the contributions. The reader interested in studying motivations and foundations of*subjective probability* may refer to [4, 16]. One of the merits of Bayesian probability is its efficacy in addressing ill-posed and ill-conditioned problems, including also a wide class of statistical learning problems. The connection between deterministic regularization and Bayesian estimation has been pointed out by several authors in different contexts. Two examples related to spline approximation and neural networks are given by [8, 15].

The choice and tuning of the priors is undoubtedly the crux of any Bayesian approach. It is not a surprise that the tuning of hyperparameters via the *Empirical Bayes* approach emerged early as a practical and effective way to deploy Bayesian methods in real-world contexts, see [6] for its use in the study of the James–Stein estimator. Since the 1980s, thanks to the advent of Markov chain Monte Carlo methods, *full Bayesian* approaches have become a viable alternative, motivating reflections on the pros and cons of the two approaches, see, for instance, [17]. In particular, the connection between Stein's Unbiased Risk Estimator (SURE), equivalent degrees of freedom and the robustness of marginal likelihood hyperparameter tuning has been investigated by [1, 21]. The choice of the prior distributions is somehow more controversial. In the present chapter, we exposed the principles of the maximum entropy approach, mainly following [12], but other approaches have been advocated for finding non-informative priors. A requirement could be invariance with respect to change of coordinates, enjoyed, for instance, by Jeffreys' prior [13].

It not unusual to have parameters that should be left immune from regularization. In the Bayesian approach, this corresponds to the absence of prior information, usually expressed through an infinite variance prior. Although the case could be treated by assigning large variances to some parameters, it is numerically more robust useful to use the exact formulas. Their derivation by a limit argument followed [22].

The idea of deriving approximated parametric models by a suitable projection of the Bayes estimate conforms to Hjalmarsson's advice "always first model as well as possible" [11]. The projection result has been derived in [23] for Gaussian processes and subsequently extended to general distributions in [20].

The equivalent degrees of freedom of a regularized estimator have been studied in the context of smoothing by additive [2] and spline models [3, 9], while a discussion specialized to the case of Bayesian estimation can be found in [5, 17].

Starting by the seminal paper [7], the use of stochastic simulation for computing posterior distributions according to a full Bayesian paradigm has gained a wider and wider adoption, especially when there exist conjugate priors that allow efficient sampling schemes. In particular, this is possible for the linear model discussed in this chapter, whose MCMC estimation is discussed in [18].

## **4.13 Appendix**

## *4.13.1 Proof of Theorem 4.1*

For simplicity, the proof is given in the scalar parameter case. We have that

$$\begin{split} \frac{d\mathrm{MSE}(\hat{\theta})}{d\hat{\theta}} &= \frac{d}{d\hat{\theta}} \int\_{-\infty}^{+\infty} (\hat{\theta} - \theta)^2 \mathbf{p}(\theta | Y) d\theta \\ &= 2\hat{\theta} \int\_{-\infty}^{+\infty} \mathbf{p}(\theta | Y) d\theta - 2 \int\_{-\infty}^{+\infty} \theta \mathbf{p}(\theta | Y) d\theta \\ &= 2 \left( \hat{\theta} - \mathcal{E} \, [\theta | Y] \right). \end{split}$$

Moreover,

$$\frac{d^2 \text{MSE}(\hat{\theta})}{d\hat{\theta}^2} = 2 \int\_{-\infty}^{+\infty} \mathbf{p}(\theta | Y) d\theta = 2 > 0.$$

Therefore, θ<sup>B</sup> = *E* [θ|*Y* ] minimizes MSE(θ )ˆ .

## *4.13.2 Proof of Theorem 4.2*

Let *X* = θ<sup>B</sup> − θ denote the estimation error. Recalling that *E* [*Y* − Φμθ ] = *E* [*E*] = 0, from (4.10) it follows that *E X* = 0. Note also that *X* and *Y* are jointly Gaussian and

$$\text{Cov}(X, Y) = \mathcal{\delta}[X(Y - \mathcal{\delta}Y)^T] = \mathcal{\delta}[XY^T] - \mathcal{\delta}X\mathcal{\delta}Y^T = \mathcal{\delta}[XY^T].$$

Now, using (4.7), we have

$$\begin{array}{l} \mathcal{E}[XY^{T}] = \mathcal{E}^{\mathbb{P}}\Big[ (\theta^{\mathbb{B}} - \theta)Y^{T} \Big] \\ = \Sigma\_{\theta Y} \Sigma\_{Y}^{-1} \mathcal{E}^{\mathbb{P}} \Big[ (Y - \mu\_{Y})Y^{T} \Big] - \mathcal{E}^{\mathbb{P}} \Big[ (\theta - \mu\_{\theta})Y^{T} \Big] \\ = \Sigma\_{\theta Y} \Sigma\_{Y}^{-1} (\mathcal{E}^{\mathbb{P}}[YY^{T}] - \mu\_{Y}\mu\_{Y}^{T}) - \mathcal{E}^{\mathbb{P}}[\theta Y^{T}] - \mu\_{\theta}\mu\_{Y}^{T} \\ = \Sigma\_{\theta Y} \Sigma\_{Y}^{-1} \Sigma\_{Y} - \Sigma\_{\theta Y} = 0. \end{array}$$

## *4.13.3 Proof of Lemma 4.1*

By applying the matrix inversion lemma (3.145) and proceeding with simple matrix manipulations,

ΣθΦ*<sup>T</sup>* (Σ*<sup>E</sup>* <sup>+</sup> ΦΣθΦ*<sup>T</sup>* ) <sup>−</sup><sup>1</sup> <sup>=</sup> ΣθΦ*<sup>T</sup>* (Σ−<sup>1</sup> *<sup>E</sup>* <sup>−</sup> <sup>Σ</sup>−<sup>1</sup> *<sup>E</sup>* Φ(Φ*<sup>T</sup>* <sup>Σ</sup>−<sup>1</sup> *<sup>E</sup>* <sup>Φ</sup> <sup>+</sup> <sup>Σ</sup>−<sup>1</sup> <sup>θ</sup> ) <sup>−</sup>1Φ*<sup>T</sup>* Σ−<sup>1</sup> *<sup>E</sup>* ) <sup>=</sup> ΣθΦ*<sup>T</sup>* <sup>Σ</sup>−<sup>1</sup> *<sup>E</sup>* <sup>−</sup> ΣθΦ*<sup>T</sup>* <sup>Σ</sup>−<sup>1</sup> *<sup>E</sup>* Φ(Φ*<sup>T</sup>* <sup>Σ</sup>−<sup>1</sup> *<sup>E</sup>* <sup>Φ</sup> <sup>+</sup> <sup>Σ</sup>−<sup>1</sup> <sup>θ</sup> ) <sup>−</sup>1Φ*<sup>T</sup>* Σ−<sup>1</sup> *E* <sup>=</sup> Σθ (*<sup>I</sup>* <sup>−</sup> <sup>Φ</sup>*<sup>T</sup>* <sup>Σ</sup>−<sup>1</sup> *<sup>E</sup>* Φ(Φ*<sup>T</sup>* <sup>Σ</sup>−<sup>1</sup> *<sup>E</sup>* <sup>Φ</sup> <sup>+</sup> <sup>Σ</sup>−<sup>1</sup> <sup>θ</sup> ) <sup>−</sup>1)Φ*<sup>T</sup>* Σ−<sup>1</sup> *E* <sup>=</sup> Σθ (Φ*<sup>T</sup>* <sup>Σ</sup>−<sup>1</sup> *<sup>E</sup>* <sup>Φ</sup> <sup>+</sup> <sup>Σ</sup>−<sup>1</sup> <sup>θ</sup> <sup>−</sup> <sup>Φ</sup>*<sup>T</sup>* <sup>Σ</sup>−<sup>1</sup> *<sup>E</sup>* Φ)(Φ*<sup>T</sup>* <sup>Σ</sup>−<sup>1</sup> *<sup>E</sup>* <sup>Φ</sup> <sup>+</sup> <sup>Σ</sup>−<sup>1</sup> <sup>θ</sup> ) <sup>−</sup>1Φ*<sup>T</sup>* Σ−<sup>1</sup> *E* <sup>=</sup> (Φ*<sup>T</sup>* <sup>Σ</sup>−<sup>1</sup> *<sup>E</sup>* <sup>Φ</sup> <sup>+</sup> <sup>Σ</sup>−<sup>1</sup> <sup>θ</sup> ) <sup>−</sup>1Φ*<sup>T</sup>* Σ−<sup>1</sup> *<sup>E</sup>* .

## *4.13.4 Proof of Theorem 4.3*

In view of (4.13), the conditional variance is

$$\text{Var}(\theta|Y) = \left(\frac{\Phi^T \Phi}{\sigma^2} + \Sigma\_{\theta}^{-1}\right)^{-1} = \sigma^2 \left(\Phi^T \Phi + \sigma^2 \begin{bmatrix} a^{-1} I\_{n-p} & 0\\ 0 & \Sigma^{-1} .\end{bmatrix}\right)^{-1}$$

In view of (4.7)

$$\mathcal{A}^{\mathbb{C}}(\theta|Y) = \Sigma\_{\theta} \Phi^{T} (\Phi \Sigma\_{\theta} \Phi^{T} + \sigma^{2} I\_{n})^{-1} Y = \begin{bmatrix} \Sigma \mathcal{Q}^{T} \\ a \Psi^{T} \end{bmatrix} (a \Psi \Psi^{T} + M)^{-1} Y.$$

By replicating the passages of Lemma 4.1

$$a\Psi(a\Psi\Psi^T + M)^{-1} = \left(\Psi^T M^{-1} \Psi + \frac{I\_{n-p}}{a}\right)^{-1} \Psi^T M^{-1}.$$

Moreover, by applying the matrix inversion lemma, see (3.145),

$$\begin{aligned} (a\Psi\Psi^T + M)^{-1} &= M^{-1} - M^{-1}\Psi \left(\Psi^T M^{-1} \Psi + \frac{I\_{n-p}}{a}\right)^{-1} \Psi^T M^{-1} \\ &= M^{-1} - M^{-1} \Psi (\Psi^T M^{-1} \Psi)^{-1} \left(I\_{n-p} + \frac{1}{a} (\Psi^T M^{-1} \Psi)^{-1}\right)^{-1} \Psi^T M^{-1} .\end{aligned}$$

Then, letting *a* → ∞ complete the proof. Observe that all the inverse matrices appearing in the proof exist due to the full-rank assumptions made on Φ and Ψ.

## *4.13.5 Proof of Theorem 4.6*

The expectation in (4.41) can be rewritten as

.

$$\begin{split} &\mathcal{\mathscr{E}}\left[\left\|\theta-\theta^{\mathsf{B}}+\theta^{\mathsf{B}}-\mathsf{g}(\xi)\right\|^{2}\Big|Y\right] \\ &=\mathcal{\mathscr{E}}\left[\left\|\theta-\theta^{\mathsf{B}}\right\|^{2}+2\left(\theta-\theta^{\mathsf{B}}\right)^{T}\left(\theta^{\mathsf{B}}-\mathsf{g}(\xi)\right)+\left\|\theta^{\mathsf{B}}-\mathsf{g}(\xi)\right\|^{2}\Big|Y\right] \\ &=\mathcal{\mathscr{E}}\left[\left\|\theta-\theta^{\mathsf{B}}\right\|^{2}\Big|Y\right]+\mathcal{\mathscr{E}}\left\|\theta^{\mathsf{B}}-\mathsf{g}(\xi)\right\|^{2}. \end{split}$$

The proof follows by observing that in the last equation the first term does not depend on ζ . In the last passage, we have exploited the fact that θ<sup>B</sup>|*Y* is deterministic and equal to *E* (θ|*Y* ).

## *4.13.6 Proof of Proposition 4.3*

First observe that

$$\text{WRSS} = \left\| \bar{\varepsilon} \right\|^2 = \sum\_{i=1}^{N} \frac{\nu^2 \bar{\mathbf{y}}\_i^2}{(\nu + \bar{d}\_i^2)^2}. \tag{4.63}$$

Hence, in view of (4.52)

$$\mathcal{A}^{\otimes}\text{WRSS} = \sigma^2 \left( N - \text{trace} (D(D^T D + \gamma I\_N)^{-1} D^T) \right).$$

On the other hand, by simple matrix manipulations, it turns out that

$$U^T \Psi^{-1/2} H \Psi^{1/2} U = D(D^T D + \chi I\_N)^{-1} D^T.$$

Finally, recalling that trace(*AB*) = trace(*B A*),

$$\text{trace}(U^T \Psi^{-1/2} H \Psi^{1/2} U) = \text{trace}(\Psi^{1/2} U U^T \Psi^{-1/2} H) = \text{trace}(H)$$

thus proving the thesis.

## *4.13.7 Proof of Theorem 4.8*

Without loss of generality, the proof refers to the diagonalized Bayesian estimation problem (4.48). The marginal loglikelihood function is

$$\sum\_{i=1}^{N} \log(\bar{d}\_i^2 \lambda + \sigma^2) + \sum\_{i=1}^{N} \frac{\bar{y}\_i^2}{\bar{d}\_i^2 \lambda + \sigma^2} + \kappa,$$

where κ denotes a constant we are not concerned with. By equating to zero the partial derivatives with respect to σ<sup>2</sup> and λ we obtain

$$\sum\_{i=1}^{N} \frac{1}{\vec{d}\_i^2 \lambda + \sigma^2} - \sum\_{i=1}^{N} \frac{\vec{y}\_i^2}{(\vec{d}\_i^2 \lambda + \sigma^2)^2} = 0$$

$$\sum\_{i=1}^{n} \frac{d\_i^2}{d\_i^2 \lambda + \sigma^2} - \sum\_{i=1}^{n} \frac{d\_i^2 \vec{y}\_i^2}{(d\_i^2 \lambda + \sigma^2)^2} = 0.$$

In view of (4.54) and (4.63),

$$\begin{aligned} \sigma^2 \left( N - \text{dof}(\wp) \right) - \text{WRSS} &= 0, \\ \lambda \text{dof}(\wp) - \text{WPSS} &= 0, \end{aligned}$$

which concludes the proof.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 5 Regularization for Linear System Identification**

**Abstract** Regularization has been intensively used in statistics and numerical analysis to stabilize the solution of ill-posed inverse problems. Its use in System Identification, instead, has been less systematic until very recently. This chapter provides an overview of the main motivations for using regularization in system identification from a "classical" (Mean Square Error) statistical perspective, also discussing how structural properties of dynamical models such as stability can be controlled via regularization. A Bayesian perspective is also provided, and the language of maximum entropy priors is exploited to connect different form of regularization with time-domain and frequency-domain properties of dynamical systems. Some numerical examples illustrate the role of hyper parameters in controlling model complexity, for instance, quantified by the notion of Degrees of Freedom. A brief outlook on more advanced topics such as the connection with (orthogonal) basis expansion, McMillan degree, Hankel norms is also provided. The chapter is concluded with an historical overview on the early developments of the use of regularization in System Identification.

## **5.1 Preliminaries**

As we have discussed in the preceding chapters, system identification can be framed as an inverse problem which aims at finding a dynamical model *M* from a set of measured input output "training" data *D<sup>T</sup>* := {*u*(*t*), *y*(*t*)}*<sup>t</sup>*=1,...,*<sup>N</sup>* . The field of inverse problems [5] has motivated the development of, and is pervaded by, regularization techniques; as such it is evident that regularization could and should play a major role also in the system identification arena.

On the contrary, we believe it is fair to say that regularization has not had a pervasive impact in system identification until very recently. To introduce its use in this field, we will refer to linear models *M* = {*M*(θ )|θ ∈ *D<sup>M</sup>* } introduced in Chap. 2, Eq. (2.1). Note that this notation not only includes classical parametric structures,

<sup>©</sup> The Author(s) 2022

G. Pillonetto et al., *Regularized System Identification*, Communications and Control Engineering, https://doi.org/10.1007/978-3-030-95860-2\_5

such as ARX, ARMAX, Box–Jenkins models but also so-called nonparametric ones where the "parameter" θ may be infinite dimensional, e.g., containing all the impulse response coefficients of the filters *Wy* (*q*) and *Wu*(*q*) which characterize the predictor

$$
\hat{\mathbf{y}}(t|\theta) = W\_{\mathbf{y}}(q)\mathbf{y}(t) + W\_{\mathbf{u}}(q)\boldsymbol{u}(t). \tag{5.1}
$$

The transfer functions *Wy* (*q*) and *Wu*(*q*) are related to the input–output model

$$\mathbf{y}(t) = G(q, \theta)\boldsymbol{\mu}(t) + H(q, \theta)\boldsymbol{e}(t)$$

by the relation

$$W\_\chi(q) := [1 - H^{-1}(q, \theta)] \quad W\_u(q) := H^{-1}(q, \theta)G(q, \theta), \tag{5.2}$$

see also (2.4).

For simplicity here, we consider the single-output case *<sup>y</sup>*(*t*) <sup>∈</sup> <sup>R</sup>. In the prediction error framework described in Chap. 2, the model fit is typically measured by the negative log likelihood

$$V\_N(\theta) = -2\log \mathbf{p}(\mathcal{Q}\_T|\theta) = -2\sum\_{t=1}^N \log(\mathbf{p}(\mathbf{y}(t) - \hat{\mathbf{y}}(t|\theta))),$$

which in the Gaussian case is, up to constants, proportional to the sum of squared prediction errors

$$V\_N(\theta) \propto \sum\_{t=1}^N (\mathbf{y}(t) - \hat{\mathbf{y}}(t|\theta))^2.$$

As discussed in Chap. 3, regularization can be added to make the inverse problem of estimating the model *M*(θ ) from data well-posed, and therefore regularized estimators θˆ *<sup>R</sup>* of the form

$$\hat{\theta}\_R := \underset{\theta}{\text{arg min }} W\_N(\theta) = \underset{\theta}{\text{arg min }} V\_N(\theta) + J\_\gamma(\theta) \tag{5.3}$$

are considered. This framework has been extensively discussed in the previous chapter in the context of linear regression under the squared loss *VN* (θ ) = *Y* − Φθ<sup>2</sup> <sup>2</sup>, see e.g., Eq. (3.57).

The function *J*<sup>γ</sup> (θ ) is usually referred to as the *penalty function*, and possibly depends on some (hyper-)parameter γ . In the simplest case *J*<sup>γ</sup> (θ ) takes the multiplicative form

$$J\_\gamma(\theta) := \gamma J(\theta).$$

and γ acts a scaling factor which controls the "amount" of regularization. The most famous example is the so-called ridge regression problem, in which a quadratic loss *VN* (θ ) is used and *J* (θ ) := θ<sup>2</sup> so that (see also (3.61a)):

#### 5.1 Preliminaries 137

$$\hat{\theta}^{\mathcal{R}} := \underset{\theta}{\text{arg min }} \|Y - \Phi \theta\|^2\_2 + \gamma \|\theta\|^2 = \left(\Phi^T \Phi + \gamma I\right)^{-1} \Phi^T Y.$$

However, ridge regression has not had a significant impact in the context of System Identification, i.e., when the vector θ contains the impulse response coefficients of a (linear) dynamical system. To understand why, it is important to discuss the choice of *J*<sup>γ</sup> (θ ). We will see that it plays a fundamental role and strongly influences the properties of the estimator θˆ *<sup>R</sup>*. In particular, we will see how *J*<sup>γ</sup> (θ )should be designed to encode properties of dynamical systems such as BIBO stability, smoothness in time domain and frequency domain, oscillatory behaviour and so on; this is a form of "inductive bias" well known and studied in the machine learning community, see e.g., [61].

As argued in Chap. 4, regularization can be given a Bayesian interpretation. In fact, introducing a probabilistic prior on model parameters θ of the form

$$\mathbf{p}\_{\mathcal{V}}(\theta) \propto e^{-\frac{J\_{\mathcal{V}}(\theta)}{2}} \tag{5.4}$$

and the Likelihood function:

$$\mathbf{p}(\mathcal{Q}\_T|\theta) \propto e^{-\frac{V\_N(\theta)}{2}} \tag{5.5}$$

the maximum a posteriori (MAP) estimator of θ (see (4.2)), becomes

$$\hat{\theta}^{\text{MAP}} \coloneqq \arg \max\_{\theta} \mathbf{p}(\theta | \mathcal{Q}\_T) \tag{5.6}$$

$$\mathbf{p} = \arg\max\_{\theta} \mathbf{p}(\mathcal{Q}\_T|\theta)\mathbf{p}\_\mathcal{\mathcal{Y}}(\theta) \tag{5.7}$$

$$=\arg\max\_{\theta} \log \left[ \mathbf{p}(\mathcal{Q}\_T|\theta)\mathbf{p}\_Y(\theta) \right] \tag{5.8}$$

$$=\arg\min\_{\theta} \ -\log\left[\mathbf{p}(\mathcal{Q}\_T|\theta)\right] - \log\left[\mathbf{p}\_Y(\theta)\right] \tag{5.9}$$

$$\mathbf{k} = \arg\min\_{\theta} \ V\_N(\theta) + J\_\mathcal{V}(\theta) \tag{5.10}$$

$$
\hat{\theta} = \hat{\theta}\_R.\tag{5.11}
$$

In what follows, we will therefore use interchangeably the "regularization" framework, and thus think of *J*<sup>γ</sup> (θ ) as a penalty function, or the "Bayesian" framework, and thus think of *p*<sup>γ</sup> (θ ) as a prior (with some caution in the infinite-dimensional case).

## **5.2 MSE and Regularization**

The final goal of modelling is to perform some task, e.g., prediction or control, on future unseen data. As such the estimated model quality should be measured having the objective in mind. For simplicity, we will consider a prediction task, referring the reader to the literature discussed in Sect. 5.9 for extensions. To this purpose, in addition to the training data *D<sup>T</sup>* , let us introduce testing data:

$$\mathcal{Y}\_{test} := \{ \mu\_{test}(t), \,\mathcal{y}\_{test}(t) \}\_{t=1,\dots,N\_{test}}.$$

A model *M*ˆ := *M*(θ )ˆ estimated using the training data *D<sup>T</sup>* should then predict well testing data *Dtest* . In particular, let *y*ˆ(*t*|θ )ˆ be the output prediction at instant *t* constructed using the estimated model. Then, we can measure the performance of *M*ˆ using the Mean Squared Error (MSE) on output (*Y* ) prediction and assuming that data are generated by some "true", yet unknown parameter vector θ0. This is defined as

$$MSE\_{Y}(\mathcal{A}\hat{\boldsymbol{\ell}},\boldsymbol{\theta}^{0}) = \mathcal{E}\left(\frac{1}{N\_{IesII}}\sum\_{t=1}^{N\_{IesII}} \left(\boldsymbol{\chi}\_{IesI}(t) - \hat{\boldsymbol{\chi}}\_{IesI}(t|\boldsymbol{\hat{\ell}})\right)^{2}\right) = \mathcal{E}\left(\boldsymbol{\chi}\_{IesI}(t) - \hat{\boldsymbol{\chi}}\_{IesI}(t|\boldsymbol{\hat{\ell}})\right)^{2},\tag{5.12}$$

where, for simplicity, we have assumed stationary statistics for the couples *utest*(*t*), *ytest*(*t*) in the last passage. In this section, we will argue that using regularization in estimating θˆ can indeed help in obtaining a small *MSEY* (*M*ˆ, θ0). Let us first assume that data are generated by an unknown "true" linear time-invariant (LTI) causal model:

$$\mathbf{y}(t) = \sum\_{k=1}^{\infty} \mathbf{g}\_k \boldsymbol{\mu}(t - k) + \boldsymbol{e}(t),\tag{5.13}$$

where the "true" "parameter" θ<sup>0</sup> = [*g*1, *g*2, *g*3,..., *gn*,...] is an infinite sequence in 1, i.e.,

$$\sum\_{k=1}^{\infty} |\mathbf{g}\_k| < \infty.$$

We now consider the model class *M*(θ ) of Finite Impulse Response (FIR) Output Error (OE) models

$$\mathbf{y}(t) = \sum\_{k=1}^{n} \theta\_k \boldsymbol{u}(t-k) + \boldsymbol{e}(t),\tag{5.14}$$

where the parameter vector <sup>θ</sup> <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* contains the coefficients of an *<sup>n</sup>*th-order finite impulse response model. Under the assumption that the input process is unit variance white noise, independent of the measurement noise, and defining

$$\hat{\mathbf{g}}\_k := \begin{cases} \hat{\theta}\_k \ k = 1, \dots, m \\ 0 \text{ otherwise} \end{cases}$$

the MSE (5.12) has the expression

#### 5.2 MSE and Regularization 139

$$\begin{split} MSE\_{Y}(\mathcal{M},\boldsymbol{\theta}\_{0}) &= \boldsymbol{\mathcal{E}}(\operatorname{Y}\_{test}(t) - \hat{\operatorname{Y}}\_{test}(t|\boldsymbol{\hat{\theta}}))^{2} \\ &= \boldsymbol{\mathcal{E}}\left(\sum\_{k=1}^{\infty} (g\_{k} - \hat{g}\_{k})u\_{test}(t-k) + \boldsymbol{e}(t)\right)^{2} \\ &= \underbrace{\sum\_{k=1}^{\infty} \boldsymbol{\mathcal{E}}(g\_{k} - \hat{g}\_{k})^{2}}\_{\boldsymbol{\mathcal{E}} \parallel \mathbf{g} - \hat{g} \parallel \boldsymbol{2}} + \sigma^{2} \\ &= \underbrace{\sum\_{k=1}^{\infty} \boldsymbol{\mathcal{E}}(\hat{g}\_{k} - \boldsymbol{\mathcal{E}} \mathbb{I}[\hat{g}\_{k}])^{2}}\_{\textrm{Variance}} + \underbrace{\sum\_{k=1}^{\infty} (g\_{k} - \boldsymbol{\mathcal{E}}[\hat{g}\_{k}])^{2}}\_{\textrm{Biaz}^{2}} + \sigma^{2} \\ &= \underbrace{\sum\_{k=1}^{n} \boldsymbol{\mathcal{E}}(\hat{\boldsymbol{\theta}}\_{k} - \boldsymbol{\mathcal{E}}[\hat{\boldsymbol{\theta}}\_{k}])^{2}}\_{\textrm{Variance}} + \underbrace{\sum\_{k=1}^{n} (g\_{k} - \boldsymbol{\mathcal{E}}[\hat{\boldsymbol{\theta}}\_{k}])^{2}}\_{\textrm{Biaz}^{2}} + \sum\_{k=n+1}^{\infty} \boldsymbol{g}\_{k}^{2} + \sigma^{2}. \end{split} \tag{5.15}$$

This is nothing but the usual *bias-variance trade-off* discussed in Chap. 1: the model (θ in this case) has to be rich enough (i.e., *n* large) to capture the "true" data generating mechanism (low bias) but also simple enough (i.e., *n* small) to be estimated using the available data with low variability (low variance). The squared loss

$$\|\mathcal{S}\|\mathbf{g} - \hat{\mathbf{g}}\|^2 = \sum\_{k=1}^{\infty} \mathcal{S}^\circ (\mathbf{g}\_k - \hat{\mathbf{g}}\_k)^2$$

present on the right-hand side of (5.15), after the third equality, is called a *compound* loss on the (possibly infinite) vector θ [60, 63] and defines the MSE.

Considering compound losses of this type allows us to connect with the discussion made in Chap. 1 on Stein's effect. To simplify exposition, let us assume that the identification input is a discrete impulse *u*(*t*) = δ(*t*) so that we can think of *y*(*t*) as direct noisy measurements of all the (nonzero) impulse response coefficients

$$\mathbf{y}(t) = \mathbf{g}\_t + e(t) \quad \text{ } t = 1, \dots, n. \tag{5.16}$$

Defining *Y* := [*y*(1), . . . , *y*(*n*)] *<sup>T</sup>* and *E* := [*e*(1), . . . , *e*(*n*)] *<sup>T</sup>* the measurement model (5.16) can be written in vector form

$$Y = \theta + E, \qquad E \sim \mathcal{N}(0, \sigma^2 I\_n). \tag{5.17}$$

As we have seen in Chap. 1, the least squares estimator θˆ *L S* for model (5.17) is dominated (for *n* > 2) by the James–Stein estimator discussed in Sect. 1.1.1. As argued in Chap. 1, the James–Stein estimator (1.3) is a special case of a regularized estimator (5.3) where *J*<sup>γ</sup> (θ ) = γ θ<sup>2</sup> and γ takes the data-dependent form (1.4)

$$\mathcal{Y} = \frac{(n-2)\sigma^2}{\|\mathbf{y}\|^2 - (n-2)\sigma^2}.$$

Following this route, the James–Stein estimator favours "small" parameters values (the regularization term *J*<sup>γ</sup> (θ ) = γ θ<sup>2</sup> penalises large θ) and therefore it is to be expected that the gap w.r.t. the least square estimator is larger under these circumstances; this has been illustrated in Fig. 1.1.

As pointed out in Sect. 1.1.2, there is actually nothing special in having chosen the origin as a reference. In fact, the penalty term can be replaced with *J*<sup>γ</sup> (θ ) = <sup>γ</sup> <sup>θ</sup> <sup>−</sup> *<sup>a</sup>*<sup>2</sup> for any *<sup>a</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* yielding to estimators which always dominate least squares provided γ is chosen as

$$\frac{(n-2)\sigma^2}{\|\mathbf{y} - a\|^2 - (n-2)\sigma^2}.$$

This teaches us that under certain circumstances it is possible to steer the estimators, using a suitable penalty functional, towards certain regions of the parameter space (or more generally model space); most importantly, this can be done without any loss (actually with a gain) for any possible occurrence of the "true" yet unknown system. However the reader should remind that this only holds for the compound loss (5.15) and should not be seen as *panacea*. For instance, James–Stein estimators may provide only marginal improvements over Least Squares in situations where the signal-tonoise ratio is highly non-uniform over parameter space, a situation often encountered in system identification when input signals are not white and poor excitation may be present, e.g., in certain frequency bands. This has been illustrated in Example 1.2.

Therefore, as a take home message from Chap. 1 and the discussion above, we should remind that regularization has potential to offer, yet its use in system identification is not straightforward. The main reasons are as follows:


The latter is one of the main goals of this book, i.e., to provide the reader with a thorough understanding of the role of regularization in estimating dynamical systems so as to optimally design regularization methods depending on the intended use of the model. In the remaining part of the chapter, we will first introduce the concept of "optimal" prior and derive its expression. We will then connect the structure of the optimal prior to the notion of BIBO stability for linear dynamical systems and also its link with smoothness in time and frequency domains. Connection with the Bayesian setting will also be provided. The chapter will be concluded with an historical overview of how the use of regularization in the context of estimation of dynamical systems has evolved, illustrating also the role played by time- and frequency-domain smoothness.

## **5.3 Optimal Regularization for FIR Models**

Let us consider the problem of estimating the impulse response {θ*<sup>k</sup>* }*k*=1,...,*<sup>n</sup>* of the FIR model (5.14) using data {*y*(*t*)}*t*=1,...,*<sup>N</sup>* . The FIR model can be compactly written as

$$Y = \Phi \theta + E,\tag{5.18}$$

where *Y* := [*y*(1), . . . , *y*(*N*)] *<sup>T</sup>* , *E* := [*e*(1), . . . , *e*(*N*)] *<sup>T</sup>* and Φ contains the input samples, which are assumed to be available for all times needed to avoid issues related to the initial condition. Then, we will still use θ<sup>0</sup> to denote the "true" value that has generated the data.

We now consider the class of regularized estimators

$$\hat{\theta}^{\mathcal{R}} := \underset{\theta \in \mathbb{R}^n}{\text{arg min}} \; \|Y - \Phi \theta\|^2 + \sigma^2 \theta^T P^{-1} \theta^T$$

parametrized by the regularization matrix *P* = *P<sup>T</sup>* > 0. As shown in Chap. 3, see Eq. (3.60), the generalized ridge regression estimator θˆ*<sup>R</sup>* can be extended also to the case *P* is singular so that we can assume *P* = *P<sup>T</sup>* 0. As a matter of fact, in the Bayesian framework introduced in Chap. 4, θ *<sup>R</sup>* can be also interpreted as the MAP estimator

$$
\hat{\theta}^{\text{MAP}} := \arg\max \quad p(\theta|Y).
$$

obtained under the assumption that the noise *E* is Gaussian, zero mean and variance σ<sup>2</sup> *I* and that θ is independent of *E*, zero-mean Gaussian with (possibly singular) variance *P* = *P<sup>T</sup>* 0 (the singular case was described in (4.19)).

In this section, to emphasize the dependence of the estimator θˆ*<sup>R</sup>* on *P* = *P<sup>T</sup>* 0, we will use the notation

$$
\hat{\theta}^P := \hat{\theta}^\mathbb{R} = \hat{\theta}^{\mathbb{M}AP}.
$$

Our objective now is to study the performance of the estimator θˆ*<sup>P</sup>* , in terms of MSE, as a function of *P* = *P<sup>T</sup>* 0, under the assumption that *Y* has been generated by a "true model" of the form (5.18) with a deterministic and unknown parameter θ0. Thus, the only source of "randomness" is the noise vector *E* and the system input which is seen as a stochastic process (independent of *E*) in this section.

We consider a test experiment with a new input *utest*(*t*), independent of the input *u*(*t*) used for identification; for convenience of notation, we define the lagged test input vector

$$\phi\_{test}(t) := \left[ \mu\_{test}(t), \dots, \mu\_{test}(t-n+1) \right]^T$$

so that under (5.14) the test output is given by

$$y\_{test}(t) = \phi\_{test}^{\mathcal{T}}(t)\theta\_0 + e\_{test}(t).$$

Let us also define the covariance matrix

$$W\_u = \operatorname{Var} \left\{ \phi\_{test}(t) \right\} = \boldsymbol{\phi}^{\diamond} \phi\_{test}(t) \boldsymbol{\phi}\_{test}^{\top}(t)$$

(note that stationary assumptions are present here, in fact *Wu* does not depend on time *t*) and the MSE matrix

$$M\_{\boldsymbol{\theta}\_{0}}(P) := \boldsymbol{\delta}^{\boldsymbol{\epsilon}}(\boldsymbol{\theta}\_{0} - \boldsymbol{\hat{\theta}}^{P})(\boldsymbol{\theta}\_{0} - \boldsymbol{\hat{\theta}}^{P})^{T}.$$

If we now consider the output mean squared error *MSEY* (*M*ˆ, θ0)in (5.12) computed for the model *M*ˆ, we obtain

$$\begin{split} MSE\_{Y}(\boldsymbol{\mathcal{M}},\boldsymbol{\hat{\theta}}^{o}) &= \boldsymbol{\mathcal{E}} \left( \mathbf{y}\_{test}(t) - \boldsymbol{\hat{y}}\_{test}(t|\boldsymbol{\hat{\theta}}^{P}) \right)^{2} \\ &= \boldsymbol{\mathcal{E}} \left[ \boldsymbol{\phi}\_{test}^{T}(t)\boldsymbol{\theta}\_{0} + \boldsymbol{e}\_{test} - \boldsymbol{\phi}\_{test}^{T}(t)\boldsymbol{\hat{\theta}}^{P} \right]^{2} \\ &= \boldsymbol{\mathcal{E}} \left[ (\boldsymbol{\theta}\_{0} - \boldsymbol{\hat{\theta}}^{P})^{T}\boldsymbol{\phi}\_{test}(t)\boldsymbol{\phi}\_{test}^{T}(t)(\boldsymbol{\theta}\_{0} - \boldsymbol{\hat{\theta}}^{P})^{T} \right] + \boldsymbol{\sigma}^{2} \\ &= \boldsymbol{Tr} \{ \boldsymbol{\mathcal{E}}(\boldsymbol{\theta}\_{0} - \boldsymbol{\hat{\theta}}^{P})(\boldsymbol{\theta}\_{0} - \boldsymbol{\hat{\theta}}^{P})^{T}\boldsymbol{\mathcal{E}}\boldsymbol{\phi}\_{test}(t)\boldsymbol{\phi}\_{test}^{T}(t) \} + \boldsymbol{\sigma}^{2} \\ &= \boldsymbol{Tr} \{ \boldsymbol{M}\_{\boldsymbol{\theta}\_{0}}(P)\boldsymbol{W}\_{u} \} + \boldsymbol{\sigma}^{2}, \end{split} \tag{5.19}$$

where in the second to last equation, we have used that the test inputs and noises are assumed to be independent of the training inputs and noise in the identification data used for estimating θˆ*<sup>P</sup>* .

A direct consequence of this fact is that, given two prior covariance matrices *P* and *P*∗, if *M*<sup>θ</sup><sup>0</sup> (*P*) *M*<sup>θ</sup><sup>0</sup> (*P*∗), then

$$MSE\_Y(\hat{\theta}^P, \theta\_0) \ge MSE\_Y(\hat{\theta}^{P^\*}, \theta\_0) \qquad \forall W\_{\mu},$$

i.e., estimator θ *<sup>P</sup>*<sup>∗</sup> outperforms θ *<sup>P</sup>* in terms of output prediction for any possible choice of the test input covariance *Wu*. Thus, if the modelling purpose is output prediction, it is of interest to minimize, w.r.t. all possible *P* = *P<sup>T</sup>* 0, the matrix *M*<sup>θ</sup><sup>0</sup> (*P*), i.e., to find

$$P^\* := \underset{P = P^T \succeq 0}{\text{arg min}} \ M\_{\theta\_0}(P), \tag{5.20}$$

so that θˆ*<sup>P</sup>*<sup>∗</sup> outperforms any other θˆ *<sup>P</sup>* in terms of output error (5.15) for any choice of the (test) input covariance *Wu*. Under the assumption that the true model generating the data is an FIR model of length *n* with impulse response

$$\mathbf{g}\_k = \begin{cases} \theta\_{0,k} & k \le n \\ 0 & k > n, \end{cases}$$

the solution *P*<sup>∗</sup> of the minimization problem in (5.20) has been derived in Proposition 3.1, and takes the form

$$P^\* = \theta\_0 \theta\_0^T,\tag{5.21}$$

where θ<sup>0</sup> is the "true" impulse response of the data-generating mechanism (5.14). An alternative proof of the optimal solution (5.21) to problem (5.20) can be found in Sect. 5.10.1. Since *P*<sup>∗</sup> depends on the unknown true system, this result is not of practical interest; however, if we think of the FIR model (5.14) as the approximation of a BIBO stable infinite impulse response model

$$\mathbf{y}(t) = \sum\_{k=1}^{\infty} \theta\_{0,k} \boldsymbol{\mu}(t - k) + e(t),\tag{5.22}$$

the impulse response θ<sup>0</sup> should have finite <sup>1</sup> norm θ01, i.e.,

$$\|\theta\_0\|\_1 := \sum\_{k=1}^{\infty} |\theta\_{0,k}| < \infty,\tag{5.23}$$

and therefore θ<sup>0</sup>,*<sup>k</sup>* should decay as a function of the index *k*. As a result, the entries [*P*∗]*i j* = θ<sup>0</sup>,*<sup>i</sup>* θ<sup>0</sup>,*<sup>j</sup>* of optimal kernel decay as functions of the row and column indexes *i* and *j*. In Bayesian terms, it is thus expected that also the elements [*P*]*i j* of any "good" candidate prior variance should do the same. As we will see later in this chapter, recent forms of regularization for system identification include a decay rate condition on the elements [*P*]*i j* , so as to guarantee that the estimated system is BIBO stable. Therefore, we will often refer to conditions on the decay rate of *P* as "stability conditions". While condition (5.23) is obviously satisfied when θ is a finite dimensional vector, this loose connection between decay rate of the kernel and stability needs to be tightened. We will see in the next section that this can be properly formulated in a Bayesian framework.

## **5.4 Bayesian Formulation and BIBO Stability**

In the previous section, we have considered only FIR models which are reasonable approximations of any BIBO LTI system in most practical scenarios. However, it is of interest to formulate the estimation of LTI BIBO stable systems in full generality, without assuming the impulse response to be of finite support. This entails working with infinite dimensional impulse responses {θ*<sup>k</sup>* }*<sup>k</sup>*∈N. In this chapter, we first consider the Bayesian framework, while regularization in infinite-dimensional Hilbert spaces will be addressed in Chap. 6. To start with, we model the unknown impulse response {θ*<sup>k</sup>* }*<sup>k</sup>*∈<sup>N</sup> as a stochastic process indexed over time *k*; this is the straightforward extension to the infinite-dimensional case of (5.18) where θ was a finite-dimensional random vector. In this context, it is of interest to introduce the concept of "stable" priors:

**Definition 5.1** (*Stable priors*) A prior on {θ*<sup>k</sup>* }*k*∈<sup>N</sup> is said to be stable if realizations are sequences almost surely in 1, i.e.,

$$\sum\_{k=1}^{\infty} |\theta\_k| < \infty \qquad a.s.$$

In most of this book, mostly for computational reasons, we will also assume that {θ*<sup>k</sup>* }*k*∈<sup>N</sup> be Gaussian (i.e., that any finite collection of random variables {θ*<sup>k</sup>* }*k*∈*<sup>I</sup>* , *<sup>I</sup>* = {*i*1,...,*i*}, *ik* <sup>∈</sup> <sup>N</sup>, <sup>∈</sup> <sup>N</sup> are jointly Gaussian). This is formalized in the following assumption.

**Assumption 5.1** Under the Bayesian framework, we assume {θ*<sup>k</sup>* }*k*∈<sup>N</sup> to be a Gaussian stochastic process with mean {*mk* }*k*∈<sup>N</sup> and covariance function *<sup>K</sup>*(*t*,*s*), *<sup>t</sup>*,*<sup>s</sup>* <sup>∈</sup> <sup>N</sup>. -

It is an interesting fact that, under additional assumptions on the mean and covariance functions, the prior is stable according to Definition 5.1, as formalized in the following lemma whose proof is in Sect. 5.10.2.

**Lemma 5.1** *Under Assumption 5.1 and if the following additional conditions hold*

$$\sum\_{k=1}^{\infty} |m\_k| = M\_{\ell\_1} < \infty \qquad \sum\_{k=1}^{\infty} K(k,k)^{1/2} = K\_{\ell\_1} < \infty,\tag{5.24}$$

*then the prior is stable as per Definition 5.1, i.e.,*

$$\sum\_{k=1}^{\infty} |\theta\_k| < \infty \qquad a.s.$$

In most of this book, we will also make the assumption that the a priori mean *mt* is identically zero, and thus only the condition on the covariance *K*(*t*,*s*) should be checked to ensure stability. We will now discuss different form of prior covariances *K* encountered in the literature.

## **5.5 Smoothness and Contractivity: Time- and Frequency-Domain Interpretations**

As seen in Sect. 5.3, the optimal regularizer should mimic the "true" impulse response, which is clearly unfeasible since the impulse response is unknown. However, as already discussed in Sect. 5.4, we can use the prior to encode qualitative behaviour of impulse responses of BIBO stable linear systems. In particular we have seen in Lemma 5.1 that a certain decay condition on the prior mean and covariance guarantees the description of only (almost surely) BIBO stable linear systems. The simplest example of such a prior model is the following.

**Example 5.2** (*Diagonal (DI) prior*) Assume the prior mean to be zero *mt* = 0, <sup>∀</sup>*<sup>t</sup>* <sup>∈</sup> <sup>N</sup> and the covariance function to be diagonal with exponentially decaying entries

$$K(t,s) = \lambda \alpha' \delta(t-s) \quad \text{ } t, s \in \mathbb{N} \quad \lambda > 0, \quad 0 \le \alpha < 1.$$

The parameters λ (scale factor) and α (decay rate) are treated as hyperparameters to be estimated from data, using e.g., marginal likelihood maximization, as described in Sect. 4.4. It is worth observing that the assumptions of Lemma 5.1 are satisfied, indeed

$$\sum\_{t \in \mathbb{N}} |m\_t| = 0 \qquad \sum\_{t \in \mathbb{N}} K(t, t)^{1/2} = \sum\_{t \in \mathbb{N}} \sqrt{\lambda} \alpha^{t/2} = \sqrt{\lambda} \frac{\sqrt{\alpha}}{1 - \sqrt{\alpha}} < \infty$$

and hence this is a *stable prior*.

It is interesting to observe that a decay rate condition on the impulse response coefficients is equivalent to assuming a smoothness condition in the frequency domain. To see this, let us introduce the frequency response function

$$G(e^{j\alpha}) := \sum\_{k=1}^{\infty} \theta\_k e^{-j\alpha k}.$$

The *L*2-norm of the first derivative *dG*(*<sup>e</sup> <sup>j</sup>*ω) *<sup>d</sup>*<sup>ω</sup> can be considered

$$\left\| \frac{dG(e^{j\omega})}{d\omega} \right\|^2 := \frac{1}{2\pi} \int\_0^{2\pi} \left| \frac{dG(e^{j\omega})}{d\omega} \right|^2 d\omega$$

which using Parseval's theorem can be expressed in time domain

$$\left\|\frac{dG(e^{j\omega})}{dw}\right\|^2 = \sum\_{k=1}^{\infty} k^2 |\theta\_k|^2. \tag{5.25}$$


**Fig. 5.1** Sample realizations from the diagonal kernel prior for α = 0.4 (top) and α = 0.8 (bottom). Impulse response is on the left, frequency response (magnitude only) on the right

Computing higher-order derivatives, and using again Parseval's theorem, the *L*2 norm of the *m*th-order derivative is given by

$$\left\| \frac{d^{(m)}G(e^{j\omega})}{d\omega^{(m)}} \right\|^2 = \sum\_{k=1}^{\infty} k^{2m} |\theta\_k|^2. \tag{5.26}$$

Hence, the condition that the {θ*<sup>k</sup>* } decay rapidly (and possibly exponentially as postulated by the Diagonal kernel) with *k*, implies a bound on the *L*<sup>2</sup> norm of the *m*th-order derivatives, i.e., smoothness in the frequency domain of the model.

As illustrated in Fig. 5.1, smoothness in the frequency domain decreases when α increases. However, under this prior, the impulse response coefficients are modelled as independent (yet not identically distributed) random variables. Thus no smoothness in the time domain is included, as for instance, is typically performed with priors based on random walk, which are the discrete-time counterpart of spline models as discussed in Sect. 4.9. A prior model that, in addition to stability, also includes a smoothness condition in the time domain, is the so-called TC-kernel:

**Example 5.3** (*Tuned-Correlated (TC) prior*) Assume the prior mean is zero *mt* = 0, <sup>∀</sup>*<sup>t</sup>* <sup>∈</sup> <sup>N</sup> and the covariance function takes the form

$$K(t,s) = \lambda \alpha^{\max(t,s)} \quad \text{ } t, s \in \mathbb{N} \quad \lambda > 0, \quad 0 \le \alpha < 1.$$

As in the previous example, the parameters λ (scale factor) and α (decay rate) are treated as hyperparameters to be estimated from data, using e.g., marginal likelihood maximization. It is worth observing that also in this case the assumptions of Lemma 5.1 are satisfied, indeed

$$\sum\_{t \in \mathbb{N}} |m\_t| = 0 < \infty \qquad \sum\_{t \in \mathbb{N}} K(t, t)^{1/2} = \sum\_{t \in \mathbb{N}} \sqrt{\lambda} \alpha^{t/2} = \sqrt{\lambda} \frac{\sqrt{\alpha}}{1 - \sqrt{\alpha}} < \infty$$

**Fig. 5.2** Sample realizations from the Tuned-Correlated (TC) prior for α = 0.4 (top) and α = 0.8 (bottom). Impulse response is on the left, frequency response (magnitude only) on the right

$$
\mathcal{A}\theta\_t\theta\_s = \lambda\alpha^t \quad \forall t \ge s.
$$

So, the correlation is different from zero and exponentially decays to zero. -

Figure 5.2 shows two typical realizations from the TC prior, both in time domain and frequency domain, for α = 0.4 (top) and α = 0.8 (bottom), while Fig. 5.3 shows 30 sample realizations from the DI (top) and TC (bottom) priors, respectively.

**Example 5.4** (*Importance of stable priors*) In order to illustrate the advantage of using stable priors, we now consider a simple example of identification of an output error model. In particular, we consider a system of the form

$$\mathbf{y}(t) = \sum\_{k=1}^{\infty} \mathbf{g}\_k \boldsymbol{\mu}(t - k) + e(t),$$

where the measured input *u*(*t*) and the noise *e*(*t*) are realizations from white Gaussian noise with zero mean and unit variance. The impulse response is

$$\mathbf{g}\_k = \begin{cases} \left(\frac{k}{2}\right)^2 e^{-\frac{k}{4}} & k \ge 1\\ 0 & k < 1 \end{cases}.$$

For the purpose of identification, we assume the input is available at all time instances needed. For illustration purposes, the impulse response has been truncated at *k* = 50, since it is in practice zero for *k* > 50. We also assume that output measurements *y*(*t*) are available for *t* = 1,..., 35. The hyperparameters are all estimated using marginal likelihood maximization, see Sect. 4.4. The results are shown in Fig. 5.4. The reconstruction error is measured using the percentage root mean square (RMS) error:

$$\sqrt{\frac{\sum\_{k=1}^{\infty} (\mathbf{g}\_k - \hat{\mathbf{g}}\_k)^2}{\sum\_{k=1}^{\infty} \mathbf{g}\_k^2}} \times 100\%. \tag{5.27}$$

As illustrated in Fig. 5.4, it is apparent that the results obtained by using the stable priors, see panels (b) and (c), outperform those returned by the spline (random walk) prior, see panel a, that does not include the stability constraint. The best relative error is obtained by the TC priors ( 10%) and goes up to as much as 33% for the spline priors. It can also be observed that while for stable priors (b) and (c) confidence intervals shrink as time index *k* grows, the same does not hold for the spline prior. The same behaviour had been observed in Sect. 4.9, see Fig. 4.1. -

In the next section, a class of stable priors, which includes TC as a special case, will be derived following a first-principle maximum entropy framework.

## *5.5.1 Maximum Entropy Priors for Smoothness and Stability: From Splines to Dynamical Systems*

The class of Stable Spline priors introduced in the paper [49] extends smoothness priors ideas used in splines models introduced in Sect. 4.9, embedding exponential decay conditions on the impulse response prior. They ultimately lead to estimated models which are BIBO stable with probability 1.

In this section, we will introduce a simple construction of these stable spline priors in discrete time. In particular, we will exploit a very natural axiomatic derivation in the maximum entropy framework introduced in Chap. 4. For the sake of illustration,

**Fig. 5.4** Panels **a**–**c**: impulse response reconstruction (blue) and true (red) with 95% Bayesian confidence intervals (dashed). Panel **d** is the relative RMS error (5.27) on impulse response reconstruction as a function of the scale factor λ. For DI and TC priors for each scale factor, the optimal decay rate α is estimated using marginal likelihood. The star denotes the performance obtained using the scale factor selected using marginal likelihood optimization. It is remarkable that the relative error achieved by maximizing the marginal likelihood is close to the minimum achievable by an oracle who would have access to the true impulse response and thus could minimize the relative RMS error

we will only consider the so-called stable spline prior of order one (also known as the TC prior, see Example 5.3) and its extension known as DC prior. Possible extensions will be discussed, but not developed in full detail.

The most natural construction, inspired by smoothing spline ideas, is based on the following two observations:

1. Stability: the variance of θ*<sup>k</sup>* should decay "sufficiently fast" (see Lemma 5.1), possibly exponentially, with the lag *k*. Assuming a zero-mean process, this can be expressed using a condition on second-order moments of the form:

$$\mathcal{S}\left[\theta\_k^2\right] = \lambda\_S \alpha^k \quad k = 1, \ldots, n \quad 0 < \alpha < 1. \tag{5.28}$$

For reasons that will become clear later on, imposing equality (as done above) rather than inequality constraints is convenient.

2. Smoothness: the difference between adjacent coefficients should be constrained, e.g., as measured by the relative variance,

$$\frac{\mathcal{E}\left[\left(\theta\_{k-1}-\theta\_{k}\right)^{2}\right]}{\mathcal{E}\left[\theta\_{k-1}^{2}\right]} = \lambda\_{R} \quad k = 2, \ldots, n. \tag{5.29}$$

Using the stability constraint and redefining the constant λ*R*, condition (5.29) can be rewritten as

$$\mathcal{A}\left[\left(\theta\_{k-1}-\theta\_k\right)^2\right] = \lambda\_R \alpha^{k-1} \quad k=2,\ldots,n. \tag{5.30}$$

The following theorem (whose proof is reported in Sect. 5.10.3) derives the class of maximum entropy priors under the constraints (5.28) and (5.29). Next, in Corollary 5.1 (whose proof is in Sect. 5.10.4), we will see that for special choices of λ*<sup>S</sup>* and λ*<sup>R</sup>* the well-known TC and DC priors [10, 52] are obtained.

**Theorem 5.5** *Let* {θ*<sup>k</sup>* }*k*=1,...,*<sup>n</sup> be a zero mean, absolutely continuous random vector with density p*<sup>θ</sup> (θ )*, that satisfies the following constraints (with* 0 <α< 1*):*

$$\begin{aligned} \, \prescript{\mathcal{E}}{\mathcal{E}} \left[ \theta\_k^2 \right] &= \lambda\_S \alpha^k \begin{array}{l} k = 1, \ldots, n \\ \lambda\_R \alpha^{k-1} \end{array} = \lambda\_R \alpha^{k-1} \begin{array}{l} k = 1, \ldots, n \\ k = 2, \ldots, n \end{array} \tag{5.31}$$

*with* <sup>λ</sup>*<sup>S</sup>* <sup>∈</sup> <sup>R</sup> *and* <sup>λ</sup>*<sup>R</sup>* <sup>∈</sup> <sup>R</sup> *such that*

$$
\lambda\_S (1 - \sqrt{\alpha})^2 < \lambda\_R < \lambda\_S (1 + \sqrt{\alpha})^2. \tag{5.32}
$$

*Then, the solution p*θ ,*M E* (θ ) *of the maximum entropy problem*

$$p\_{\theta, ME} \coloneqq \underset{\mathfrak{p}(\cdot) \text{ s.t. } (\mathfrak{f}.\mathfrak{J}1)}{\text{arg}\, \max} \, -\mathcal{E} \log(p\_{\theta}(\theta)) \tag{5.33}$$

*has the following form:*

$$p\_{\boldsymbol{\theta},ME}(\boldsymbol{\theta}) = C e^{-\frac{1}{2}\boldsymbol{\theta}^{\boldsymbol{\mathcal{T}}}\boldsymbol{\Sigma}^{-1}\boldsymbol{\theta}},\tag{5.34}$$

*where the matrix* Σ−<sup>1</sup> *has the band structure:*

$$
\Sigma^{-1} = \begin{bmatrix}
\ast \ast & 0 & \dots & \dots & 0 \\
\ast \ast & \ast & 0 & \dots & 0 \\
0 & \ast & \ast & \ast & 0 & \dots \\
\vdots & \dots & \ddots & \ddots & \ddots & \dots \\
0 & \dots & 0 & \ast & \ast & \ast \\
0 & \dots & \dots & 0 & \ast & \ast \\
\end{bmatrix}.
$$

*The maximum entropy process admits the backward representation*

$$
\theta\_{k-1} = a\_B \theta\_k + w\_k \quad w\_k \sim \mathcal{A}'(0, \sigma\_k^2) \quad k \in \{1, \dots, n\}
$$

*with*

$$a\_B = \frac{\lambda\_S (1 + \alpha) - \lambda\_R}{2\lambda\_S \alpha},\tag{5.35}$$

$$
\sigma\_k^2 = \lambda\_S \alpha^{k-1} (1 - a\_B^2 \alpha),
\tag{5.36}
$$

*and terminal condition*

$$
\mathcal{A}\,\theta\_n^2 = \lambda\_S \alpha^n. \tag{5.37}
$$

*Last, the autocovariance of* θ*<sup>k</sup> satisfies the relation:*

$$
\mathcal{A}^{\mathcal{C}} \theta\_k \theta\_h = \lambda\_S a\_B^{|k-h|} \alpha^{\max\{k, h\}}.\tag{5.38}
$$

**Corollary 5.1** *Under the conditions of Theorem 5.5 and defining*

$$\rho := a\_B \sqrt{\alpha} = \frac{\lambda\_S (1 + \alpha) - \lambda\_R}{2 \lambda\_S \sqrt{\alpha}},\tag{5.39}$$

*the maximum entropy model in Theorem 5.5 corresponds to the so-called DC-kernel [10], i.e.,*

$$
\mathcal{A}^{\diamond} \theta\_k \theta\_h = \lambda\_S \rho^{|k-h|} \alpha^{\frac{k+h}{2}}.\tag{5.40}
$$

*In particular, for* λ*<sup>R</sup>* = λ*S*(1 − α)*, this reduces to the so-called TC kernel [10] with*

$$
\mathcal{A}\,\partial\_k\theta\_h = \lambda\_S \alpha^{\max\{k,h\}},\tag{5.41}
$$

*while for* λ*<sup>R</sup>* = λ*S*(1 + α), *we obtain the covariance of the "diagonal" kernel*

$$\mathcal{A}^{\mathcal{C}} \theta\_k \theta\_h = \begin{cases} \lambda\_S \alpha^P \ k = h \\ 0 & k \neq h \end{cases} . \tag{5.42}$$

**Remark 5.1** In the maximum entropy kernel derived in Theorem5.5, which includes DC, TC and DI as special cases as stressed in Corollary 5.1, the constant λ*<sup>S</sup>* plays only the role of a scale factor while α is a "decay rate". Therefore, by fixing λ*<sup>S</sup>* = 1 and α = 0.8 we can study the behaviour as the "regularity" constant λ*<sup>R</sup>* varies in the interval <sup>λ</sup>*S*(<sup>1</sup> <sup>−</sup> <sup>√</sup>α)<sup>2</sup> <sup>=</sup> <sup>λ</sup>*<sup>R</sup>*,*min* <sup>≤</sup> <sup>λ</sup>*<sup>R</sup>* <sup>≤</sup> <sup>λ</sup>*<sup>R</sup>*,*max* <sup>=</sup> <sup>λ</sup>*S*(<sup>1</sup> <sup>+</sup> <sup>√</sup>α)2. This is entirely equivalent to studying the behaviour of the kernel as a function of the ratio λ*R*/λ*S*. We thus consider a grid of 9 possible values λ*<sup>R</sup>*,*min* = λ*<sup>R</sup>*,<sup>1</sup> < λ*<sup>R</sup>*,<sup>2</sup> < ··· < λ*<sup>R</sup>*,<sup>9</sup> = λ*<sup>R</sup>*,*max* . Then, Fig. 5.5 plots 5 sample realizations for each of these values with panel (i) corresponding to the value λ*<sup>R</sup>*,*<sup>i</sup>* . In particular, λ*<sup>R</sup>*,<sup>4</sup> = λ*S*(1 − α) corresponds to the TC kernel and λ*<sup>R</sup>*,<sup>6</sup> = λ*S*(1 + α) induces the DI kernel. For each realization

**Fig. 5.5** Sample realizations (solid) and best (least squares) exponential fit as a function of the kernel parameters. In all figures α = 0.8 and λ*<sup>S</sup>* = 1. The regularity parameter λ*<sup>R</sup>* varies, from its minimum value <sup>λ</sup>*R*,*min* <sup>=</sup> <sup>λ</sup>*S*(<sup>1</sup> <sup>−</sup> <sup>√</sup>α)<sup>2</sup> <sup>0</sup>.011 in panel (1) to the maximum value <sup>λ</sup>*R*,*max* <sup>=</sup> <sup>λ</sup>*S*(<sup>1</sup> <sup>+</sup> <sup>√</sup>α)<sup>2</sup> <sup>3</sup>.589 in panel (9). Panel (4), with <sup>λ</sup>*<sup>R</sup>* <sup>=</sup> <sup>0</sup>.2, corresponds to the TC kernel; panel (6) with λ*<sup>R</sup>* = 2.6 to the DI kernel

from the prior (solid line) also its best single-exponential fit is shown in order to highlight the "overall" decay rate which can be thought of as an envelope of the curves. In panel (1), with λ*<sup>R</sup>* taking the smallest possible value, hence imposing the "maximum" amount of regularity, all realizations are pure exponentials. In panel (9), with λ*<sup>R</sup>* taking its maximum value, all realizations are pure damped oscillations. In fact, in both cases, it can be checked that the corresponding kernel is singular.

#### **Degrees of Freedom of the DC Kernels**

Theorem 5.5 provides a class of kernels *K*<sup>η</sup> parametrized by the hyperparameter vector η := [λ*S*, λ*R*, α]. In Fig. 5.5, we have illustrated how realizations from the prior change as a function of the regularity parameter λ*<sup>R</sup>* having fixed λ*<sup>S</sup>* = 1 (or equivalently as a function of the ratio λ*R*/λ*S*). As discussed in Chap. 4, choosing the prior is equivalent to describing the model class. In the linear system identification context, this then defines a penalty function on impulse responses. A way to measure the "size" of the model class is to use the concept of equivalent degrees of freedom, introduced in the Bayesian context in Sect. 4.8. Unfortunately, the degrees of freedom are defined in terms of the output predictor sensitivity and they thus require to specify not only the model class but also the experimental conditions under which the model is estimated. Only in limiting cases (such as improper prior on finitely and linearly parametrized model classes), degrees of freedom become independent of the experiment and coincide with the number of parameters. In this section, we thus consider the prototypical setup in Eq. (5.18):

$$Y = \Phi\theta\_0 + E \quad \quad Y \in \mathbb{R}^N, \quad N = 1000, \quad \theta\_0 \in \mathbb{R}^n. \tag{5.43}$$

We recall that the matrix Φ is an Hankel matrix built with the input samples {*u*(*t*)}so that Φθ<sup>0</sup> implements the convolution of *u* with θ0. The input {*u*(*t*)} is now assumed to be a zero-mean unit variance white noise. We also assume the noise {*e*(*t*)} is zeromean unit variance white noise. We consider two scenarios in which the order of the system (length *n* of θ0) is assumed to be either *n* = 30 or *n* = 100. Exploiting the derivation in Chap. 4 (see Definition 4.2 and Proposition 4.3), the degrees of freedom dof(η), as a function of the hyperparameter vector η, are given by

$$\text{dof}(\eta) = \text{trace}\left(\Phi(\Phi^T \Phi + K\_{\eta}^{-1})^{-1} \Phi^T\right).$$

Assuming also here that λ*<sup>S</sup>* = 1, we study how dof(η) varies as a function of λ*<sup>R</sup>* for three different values of α (0.6, 0.8, and 0.95). The behaviour is illustrated in Fig. 5.6 where it is apparent that the maximum is achieved for the DI kernel, and the minimum (a bit smaller than 1) is attained at the extremum points, where the kernel has rank exactly equal to 1. It is interesting to observe the intertwining between the value of α (that controls the decay rate) and the length of the FIR model *n*. As the coefficient vector θ<sup>0</sup> changes from length *n* = 30 (left) to *n* = 100 (right) the effective "size" of the model doesn't change much for α = 0.6 and α = 0.8, while it does increase when α = 0.95. This confirms the fact that the kernel, for α fixed, effectively controls the model complexity so that the estimator becomes insensitive to the chosen length, provided *n* is "big enough" w.r.t. α. In particular *n* = 15 would be sufficient for α = 0.6, *n* = 30 for α = 0.8 while for α = 0.95 the effective size is about *n* = 100.

#### **Extension to Smoothness Conditions on Filtered Versions**

So far, we limited our attention to so-called "first-order" stable splines, which are derived imposing conditions on "first-order" differences, leading to first-order, i.e., AR(1), realizations. Of course these constructions can be generalized by replacing (5.31) with a higher-order constraint of the form

$$\begin{array}{c} \mathcal{E} \| \theta\_k \| ^2 \\ \mathcal{E} \| \theta\_k - \sum\_{i=1}^p a\_i \theta\_{k+i} \| ^2 \leq \lambda\_R \alpha^k. \end{array} \tag{5.44}$$

While the first constraint is a "standard" stability condition, the second constraint can be interpreted as a filtered frequency domain smoothness condition. In fact, defining the filter *<sup>F</sup>*(*q*) := <sup>1</sup> <sup>−</sup> *<sup>p</sup> <sup>i</sup>*=<sup>1</sup> *aiq<sup>i</sup>* , let us denote with θ *<sup>F</sup> <sup>k</sup>* the sequence obtained filtering θ*<sup>k</sup>* with *F*(*q*). The condition

**Fig. 5.6** Effective degrees of freedom of the DC kernel as a function of λ*<sup>R</sup>* (λ*<sup>S</sup>* = 1) for model (5.43); *n* = 30 (left), *n* = 100 (right). From top to bottom: α = 0.6, α = 0.8 and α = 0.95

$$\mathcal{A}^{\mathbb{C}} \| \boldsymbol{\theta}\_k^F \|^2 = \mathcal{E} \| \boldsymbol{\theta}\_k - \sum\_{i=1}^p a\_i \boldsymbol{\theta}\_{k+i} \|^2 \le \lambda\_R \boldsymbol{\alpha}^k$$

implies that θ *<sup>F</sup> <sup>k</sup>* should decay "fast" enough (in mean square) and thus

$$\mathcal{S} \sum\_{k=0}^{\infty} k^{2m} \|\theta\_k^F\|^2$$

should be small for any integer *m*. As a consequence, if

$$G(e^{j\omega}) := \sum\_{k=1}^{\infty} \theta\_k e^{-jak},$$

using Parseval's theorem,

$$\mathcal{E} \int\_0^{2\pi} \left\| F(e^{jw}) \frac{d^{(m)}G(e^{jw})}{d\alpha^{(m)}} \right\|^2 d\alpha^2$$

should be small as well, implying that θ*<sup>k</sup>* should concentrate most of his energy (variance) in frequency bands where the (absolute value of the) filter *F*(*e <sup>j</sup>*ω) is small.

We regard developments of this type, in principle, as a straightforward extension of the basic ideas discussed in this chapter to obtain DC kernels. In particular, the choice of the coefficients *a* in (5.44) is a design issue, which can be guided by prior knowledge on the candidate models, and its underlying principles and ideas are the same as those illustrated above. There are however additional complications due to the richer structure of the constraints, which might entail non-trivial issues to derive an analytic expression of the kernel.

#### **5.6 Regularization and Basis Expansion** *-*

The <sup>2</sup> (ridge regression) regularized estimators that have been discussed in this chapter can also be framed in the context of basis expansion using the so-called Karhunen–Loève decomposition of the random process θ. For the sake of exposition, we will now consider the finite-dimensional case, i.e., we will study FIR models of length *n* of the form (5.14). Extension to the infinite-dimensional case will be discussed in the framework of Reproducing Kernel Hilbert Spaces illustrated in Chap. 6. Under this finite-dimensional assumption, we consider the covariance matrix **<sup>K</sup>** <sup>∈</sup> <sup>R</sup>*<sup>n</sup>*×*<sup>n</sup>* whose entries satisfy [**K**](*t*,*s*) := *<sup>K</sup>*(*t*,*s*) <sup>=</sup> cov(θ*t*, θ*s*). The matrix **<sup>K</sup>** can be written in terms of its spectral decomposition (Singular Value Decomposition) in the form:

156 5 Regularization for Linear System Identification

$$\mathbf{K} = U S U^T = \sum\_{i=1}^{n} \xi\_i u\_i u\_i^T \quad u\_i \in \mathbb{R}^n \quad \|u\_i\| = 1 \quad u\_i \perp u\_j \quad \forall i \neq j,\qquad (5.45)$$

where

$$U := [\mu\_1, \dots, \mu\_n] \quad \mathbb{S} := \text{diag}\{\xi\_1, \dots, \xi\_n\}.$$

The set of vectors *ui* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* provides an orthonormal basis of <sup>R</sup>*<sup>n</sup>* so that any impulse response <sup>θ</sup> <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* can be written using the orthonormal basis expansion

$$
\theta = \sum\_{i=1}^{n} u\_i \beta\_i \qquad \beta\_i := <\theta, u\_i >, \tag{5.46}
$$

where the coefficients β*<sup>i</sup>* =< θ, *ui* >= *u<sup>T</sup> <sup>i</sup>* θ are therefore zero-mean random vectors with covariances

$$\boldsymbol{\delta}^{\boldsymbol{\beta}} \boldsymbol{\beta}\_{i} \boldsymbol{\beta}\_{j} = \boldsymbol{\delta}^{\boldsymbol{u}}\_{i}^{T} \boldsymbol{\theta} \boldsymbol{\theta}^{T} \boldsymbol{u}\_{j} = \boldsymbol{u}\_{i}^{T} \mathbf{K} \boldsymbol{u}\_{j} = \boldsymbol{\xi}\_{i} \boldsymbol{\delta}\_{ij}.$$

Clearly, the argument above can be reversed. Namely, starting from (a possibly orthonormal) basis *ui* , *i* = 1,..., *n* the random basis expansion

$$\theta = \sum\_{i=1}^{n} \mu\_i \beta\_i, \qquad \beta\_i \sim \mathcal{N}(0, \xi\_i) \quad (\beta\_1, \dots, \beta\_n) \quad \text{independent} \tag{5.47}$$

induces a probability description of the candidate θ's which turns out to be zero mean and with covariance matrix as in (5.45). This interpretation provides a clear link between "standard" models described in terms of basis expansions, regularization and the Bayesian view.

**Remark 5.2** (*Low-Rank Kernel Approximation*) The spectral decomposition of the kernel (5.45) suggests also that, when some singular values ξ*<sup>i</sup>* are "very small", it can be easily approximated by a low-rank matrix

$$\mathbf{K} = \sum\_{i=1}^{n} \xi\_i \mu\_i \boldsymbol{u}\_i^T \simeq \sum\_{i=1}^{\hat{n}} \xi\_i \boldsymbol{u}\_i \boldsymbol{u}\_i^T \qquad \hat{n} \le n.$$

This is equivalent to approximating the ξ*<sup>i</sup>* below a certain threshold with zero singular values. This threshold can be chosen by a standard SVD-truncation criterion, e.g., neglecting singular values below a certain fraction of the largest singular value ξ1, i.e., that satisfy

$$
\xi\_i \, \, \, \, \, \, \frac{\xi\_1}{R} \, \, \, \, \, \,
$$

In Fig. 5.7, the value *R* = 20 has been chosen to plot the most relevant eigenfunctions. Low-rank kernel approximation can also be exploited to reduce the computational burden in computing the solutions.

**Fig. 5.7** First *n*ˆ eigenfunctions of the DC kernel. To enhance clarity, *n* is chosen for each combination of the parameters so that *<sup>n</sup>*<sup>ˆ</sup> <sup>=</sup> arg max*<sup>i</sup> i s*.*t*. σ<sup>2</sup> *<sup>i</sup>* > σ<sup>2</sup> <sup>1</sup> /20 (see Remark 5.2). In all figures, α = 0.8 and λ*<sup>S</sup>* = 1. The regularity parameter λ*<sup>R</sup>* varies, from its minimum value λ*R*,*min* = <sup>λ</sup>*S*(<sup>1</sup> <sup>−</sup> <sup>√</sup>α)<sup>2</sup> <sup>0</sup>.011 in panel (1) to the maximum value <sup>λ</sup>*R*,*max* <sup>=</sup> <sup>λ</sup>*S*(<sup>1</sup> <sup>+</sup> <sup>√</sup>α)<sup>2</sup> <sup>3</sup>.589 in panel (9). Panel (4), with λ*<sup>R</sup>* = 0.2, corresponds to the TC kernel; panel (6) with λ*<sup>R</sup>* = 2.6 to the DI kernel

Figure 5.7 shows the eigenfunctions of the DC kernel for different choices of the hyperparameters. As already studied in the previous section, the "complexity" of the kernel, measured e.g., by the degrees of freedom as illustrated in Fig. 5.6, varies as the hyperparameters change. In the context of basis expansions, this is clear from Fig. 5.8 where the singular values of the kernel, i.e., the variances of the basis expansion coefficients β*<sup>i</sup>* , introduced in (5.47), vary as the hyperparameters change. For instance when λ*<sup>R</sup>* = λ*<sup>R</sup>*,*min*, see panel (1), and λ*<sup>R</sup>* = λ*<sup>R</sup>*,*max* , see panel (9), the kernel has rank 1. Instead the singular values decay slower for the DI kernel, see panel (5), that also has the largest number of degrees of freedom, see Fig. (5.6).

Even if this section is devoted to finite impulse response models (i.e., *n* finite, and therefore BIBO stable systems), it still makes sense to discuss what happens to the coefficients θ*<sup>n</sup>* when *n* becomes "large" and its relation with BIBO stability. In Lemma 5.1, we have seen that a sufficient conditions for a.s. BIBO stability of realizations from the Gaussian prior, is that the diagonal elements of *K* satisfy the summability condition

**Fig. 5.8** First 10 singular values of the DC kernel. In all figures, α = 0.8 and λ*<sup>S</sup>* = 1. The regularity parameter <sup>λ</sup>*<sup>R</sup>* varies, from its minimum value <sup>λ</sup>*R*,*min* <sup>=</sup> <sup>λ</sup>*S*(<sup>1</sup> <sup>−</sup> <sup>√</sup>α)<sup>2</sup> <sup>0</sup>.011 in panel (1) to the maximum value <sup>λ</sup>*R*,*max* <sup>=</sup> <sup>λ</sup>*S*(<sup>1</sup> <sup>+</sup> <sup>√</sup>α)<sup>2</sup> <sup>3</sup>.589 in panel (9). Panel (4), with <sup>λ</sup>*<sup>R</sup>* <sup>=</sup> <sup>0</sup>.2, corresponds to the TC kernel; panel (6) with λ*<sup>R</sup>* = 2.6 to the DI kernel

$$\sum\_{t=1}^{\infty} K(t, t)^{1/2} < \infty$$

which requires a "sufficiently fast" decay rate of the diagonal *K*(*t*, *t*). A quite natural question concerns how the behaviour of *K*(*t*, *t*) reflects on the basis vectors *ui* . The following lemma, whose proof is in Sect. 5.10.5, gives the answer.

**Lemma 5.2** *The basis vectors ui introduced in* (5.45)*, whose tth elements are denoted by uit , satisfy the inequality*

$$|u\_{it}| \le \frac{1}{\xi\_i} C [\mathbf{K}]\_{t,t}^{1/2}, \quad C := \sum\_{t=1}^n [\mathbf{K}]\_{t,t}. \tag{5.48}$$

*Condition* (5.48) *holds also in the infinite dimensional case, i.e., as n* → ∞*, provided K*(*t*,*s*) *admits the spectral decomposition*

$$K(t,s) = \sum\_{i=1}^{\infty} \xi\_i u\_{it} u\_{is},$$

*where the ui are orthonormal sequences in* <sup>2</sup> *and the condition* <sup>∞</sup> *<sup>t</sup>*=<sup>1</sup> *K*(*t*, *t*) = *C* < ∞ *is satisfied.*

While this result is essentially trivial for *n* finite, it becomes important when *n* → ∞, since it provides a condition on the tail behaviour of the eigenvectors (eigenfunctions). For instance, if the diagonal entries (variances) of the kernel *K*(*t*, *t*) decay exponentially fast as a function of*t*, also the *uit* do so. The decay of the eigenfunctions can be visually inspected in Fig. 5.7.

## **5.7 Hankel Nuclear Norm Regularization**

As discussed above, regularization can be used to enforce smoothness and stability of impulse responses. Yet this is just one way, and possibly not the most common in the field of dynamical systems, to control the "complexity" of model classes.

For instance, in the parametric approach to system identification, the complexity can be measured by the dimension of a minimal state-space realization of the unknown system. For ease of exposition, let us now only consider the single-input single-output output error case (i.e., *H*(*z*) = 1). In this case, the number of free parameters is 2*n* + 1 where *n* is the degree of the denominator of the transfer function *G*<sup>θ</sup> (*z*), that also equals the dimension *n* of a minimal state-space realization of *G*<sup>θ</sup> (*z*) which is called the McMillan degree of *G*(*z*,θ), as seen in Sect. 2.2.1.1. To fix notation, let us introduce a minimal state-space realization of *G*(*z*,θ)

$$\begin{aligned} \mathbf{x}\_{t+1} &= A\mathbf{x}\_t + Bu\_t \qquad \mathbf{x}\_t \in \mathbb{R}^n, \\ \mathbf{y}\_t &= \mathbf{C}\mathbf{x}\_t \end{aligned} \tag{5.49}$$

which is such that *G*(*z*,θ) = *C*(*z I* − *A*)−<sup>1</sup>*B*. If {*g*(*k*,θ)}*<sup>k</sup>*∈<sup>N</sup> is the impulse response sequence, parametrized by θ, then one has *g*(*k*,θ) = *C A<sup>k</sup>*−<sup>1</sup>*B* ∀*k* > 0.

It is well known from realization theory that the McMillan degree has a close connection with the so-called *Hankel* matrix formed with the impulse response coefficients, i.e.,

$$\mathcal{H}\_{r,c}^{\theta}(\theta) := \begin{bmatrix} \operatorname{g}(1,\theta) & \operatorname{g}(2,\theta) & \operatorname{g}(3,\theta) & \dots & \operatorname{g}(c,\theta) \\ \operatorname{g}(2,\theta) & \operatorname{g}(3,\theta) & \operatorname{g}(4,\theta) & \dots & \operatorname{g}(c+1,\theta) \\ \vdots & \ddots & \ddots & \ddots & \vdots \\ \operatorname{g}(r,\theta) & \operatorname{g}(r+1,\theta) & \operatorname{g}(r+2,\theta) & \dots & \operatorname{g}(r+c-1,\theta) \end{bmatrix} \tag{5.50}$$

with *r* block rows and *c* block columns. The following lemma holds.

**Lemma 5.3** (based on [65]) *The linear time-invariant system with impulse response* {*g*(*k*,θ)}*<sup>k</sup>*∈<sup>N</sup> *admits a minimal state-space realization of order n (i.e., has McMillan degree equal to n) if and only if, for some choice of r*, *c the following holds:*

$$n = \text{rank}\{\mathcal{H}\_{r,c}^{\varrho}(\theta)\} = \text{rank}\{\mathcal{H}\_{r+j,c+i}^{\varrho}(\theta)\} \quad \forall \ i, j \in \mathbb{N}.\tag{5.51}$$

In practice, only a finite number of impulse response (Markov) parameters *g*(*k*,θ), *k* = 1,..., *p* is available and the problem of finding a state-space model of the form (5.49) such that *g*(*k*,θ) = *C A<sup>k</sup>*−1*B* ∀ *k* = 1,..., *p*, is known as *partial realization problem*.

This shows that, indeed, a notion of "complexity" can be attached to the dimension *n* of a minimal state-space realization (5.49); therefore the rank of the Hankel matrix *Hc*,*<sup>r</sup>*(θ ) can be considered as a candidate for performing regularization. This leads to the choice of a penalty given by

$$J\_{\mathcal{A}^\ell, \mathcal{Y}}(\theta) := \mathcal{y} \text{ rank} \{ \mathcal{X}\_{r, c}^\ell(\theta) \} \tag{5.52}$$

for suitable values of the integers *c*,*r*. Unfortunately, similarly to what happens for the 0 quasi-norm *x*<sup>0</sup> (defined as the number of non-zero entries in the vector *x*) discussed in Sect. 3.6.2.1, the rank functional is not convex; as a result solving optimization problems involving penalties of the form (5.52) is problematic. The very same issue arise in a variety of rank-constrained optimization problems.

As seen in Chap. 3, to overcome this limitations, inspired by work on <sup>1</sup> regularization, researchers have suggested to use the *nuclear norm A*∗ of a matrix *<sup>A</sup>* <sup>∈</sup> <sup>R</sup>*m*×*<sup>n</sup>* defined as

$$\|A\|\_{\*}:=\text{trace}\left(\sqrt{A^T A}\right) = \sum\_{i} \sigma\_i(A),\tag{5.53}$$

where σ*i*(*A*) denotes the *i*th singular value of the matrix *A*, as a surrogate for the rank of the matrix *A*. The nuclear norm is also known as Ky–Fan *n*-norm or trace norm. This choice is motivated by the following lemma.

**Lemma 5.4** (based on [20]) *Given a matrix A* <sup>∈</sup> <sup>R</sup>*<sup>m</sup>*×*<sup>n</sup> the nuclear norm of A is the convex envelope of the rank function on the set <sup>A</sup>* := {*<sup>A</sup>* <sup>∈</sup> <sup>R</sup>*<sup>m</sup>*×*<sup>n</sup>*, *A* ≤ <sup>1</sup>}*.*

These considerations have led to a whole class of regularization methods which build upon the nuclear norm of the Hankel matrix

$$J\_{\mathcal{H}^\circ,\mathcal{Y}}(\theta) := \mathcal{Y} \| \mathcal{H}\_{r,c}(\theta) \|\_{\*}$$

as a possible regularizer. Also several extensions have been considered, including weighted versions of the form

$$J\_{\mathcal{H}^\circ, \mathcal{Y}}(\theta) := \mathcal{Y} \parallel W\_r \mathcal{H}\_{r,c}^\circ(\theta) W\_c \|\_\*,$$

where *Wc* and *Wr* are, respectively, "column" and "row" weightings. These latter can be possibly adapted iteratively, in the framework of iteratively reweighted methods such as those commonly used in conjunction with <sup>1</sup> and/or <sup>2</sup> reweighted schemes, see e.g., [72].

The Hankel norm regularizer can also be studied from a Bayesian perspective, considering the prior

$$\operatorname{p}\_{\mathcal{H}',\mathcal{Y}}(\theta) \propto \exp\left(-\boldsymbol{\gamma} \, \|\mathcal{H}'\_{r,c}(\theta)\|\_{\*}\right) \propto \exp\left(-\boldsymbol{\gamma} \sum\_{i} \sigma\_{i}(\mathcal{H}'\_{r,c}(\theta))\right). \tag{5.54}$$

To gain some intuition on the structure of this prior, let *g*(*k*,θ) = θ*<sup>k</sup>* and consider the following modified prior which penalizes the nuclear norm of the squared Hankel matrix, i.e.,

$$\tilde{\mathbf{p}}\_{\mathcal{H}',\mathcal{V}}(\boldsymbol{\theta}) \propto \exp\left(-\boldsymbol{\gamma} \|\boldsymbol{\beta}\boldsymbol{\theta}\_{r,c}^{\boldsymbol{\theta}}(\boldsymbol{\theta}) \boldsymbol{\beta}\boldsymbol{\theta}\_{r,c}^{\boldsymbol{\theta}}(\boldsymbol{\theta})^{T}\|\_{\*}\right) \propto \exp\left(-\boldsymbol{\gamma} \sum\_{i} \sigma\_{i}(\boldsymbol{\mathcal{H}}\_{r,c}^{\boldsymbol{\theta}}(\boldsymbol{\theta}) \boldsymbol{\mathcal{H}}\_{r,c}^{\boldsymbol{\theta}}(\boldsymbol{\theta})^{T})\right). \tag{5.55}$$

The reason for introducing p is twofold. The first is related to the fact that the ˜ prior (5.55) is equivalent to assuming that the entries θ*<sup>k</sup>* of the impulse response are independent zero mean Gaussians, as formalized in the following proposition.

**Proposition 5.1** (based on [53]) *Let* <sup>p</sup>˜*<sup>H</sup>* ,γ (θ ) *be as in* (5.55) *and let* <sup>θ</sup> <sup>∈</sup> <sup>R</sup>*<sup>m</sup>* <sup>∼</sup> p˜*<sup>H</sup>* ,γ (θ ), *where Hp*,*<sup>p</sup>*(θ ) *is its p* × *p Hankel matrix (with m* = 2*p* − 1*). Then the* θ*<sup>k</sup> 's are zero mean, independent and Gaussian. In particular:*

$$\theta\_k \sim \begin{cases} \mathcal{N}\left(0, \frac{1}{2\gamma k}\right) & \text{if } 1 \leqslant k \leqslant \frac{m+1}{2} \\ \mathcal{N}\left(0, \frac{1}{2\gamma(m-k+1)}\right) & \text{if } \frac{m+1}{2} < k \leqslant m \end{cases} \tag{5.56}$$

As illustrated in Fig. 5.9, from (5.56) one sees that the variance of θ*<sup>k</sup> is not* decaying with the lag *k*, and hence the prior p˜*<sup>H</sup>* ,γ (θ ) does not induce a BIBO stable hypothesis space.

Second, the prior p˜*<sup>H</sup>* ,γ (θ ) can be used as a proposal distribution for an MCMC scheme, as introduced in Sect. 4.10, to sample from the Hankel prior p*<sup>H</sup>* ,γ (θ ) in (5.54) with *g*(*k*,θ) = θ*<sup>k</sup>* . Samples from p*<sup>H</sup>* ,γ (θ ) can then be used to approximate the variances Var{θ*<sup>k</sup>* } and the correlations Corr{θ*<sup>P</sup>* , θ*h*}. These are shown in Fig. 5.9. In particular, the solid line in the left panel shows Var{θ*<sup>k</sup>* } as a function of *k*, while the right panel Corr{θ*<sup>P</sup>* , θ*<sup>k</sup>*+*<sup>h</sup>*} as a function of *h* for *k* fixed to 50. It is clear that, even though under p*<sup>H</sup>* ,γ (θ ) the θ*<sup>k</sup>* 's are not Gaussian, the variances resemble those of p˜*<sup>H</sup>* ,γ (θ ) (left panel, dashed line) as and also their correlations resemble those of independent variables. For the sake of comparison, the left panel plots also the profiles of the impulse response coefficients' variances using the TC prior for two different decay rates (dashdot lines).

**Fig. 5.9 Prior induced by the Hankel Nuclear Norm**: the impulse response coefficients are contained in the vector <sup>θ</sup> <sup>∈</sup> <sup>R</sup>79, modelled as a random vector with probability density function p*<sup>H</sup>* ,γ (θ ) ∝ exp(−*H*40,40(θ )∗). *Left*: variances of the impulse response coefficients θ*<sup>k</sup>* reconstructed by MCMC (solid line) and approximated using the prior (5.56) (dashed line). The figure also displays the variances of θ*k* when θ is a Gaussian random vector with stable spline (TC) covariance (5.41) for two different values of α (dashdot lines). All the profiles are rescaled so that they share the same initial value. *Right*: 40th row of the matrix containing the correlation coefficients returned by the MATLAB command corrcoef(M) where each column of the 79×10<sup>6</sup> matrix <sup>M</sup> contains one MCMC realization of θ under the Hankel prior p*<sup>H</sup>* ,γ (θ ). The adopted MCMC scheme was a random walk Metropolis with increments proportional to the variances (5.56) divided by a factor equal to 4

These observations suggest that, while the nuclear norm regularization (prior) accounts for system-theoretic notions of model complexity as defined by the McMillan degree, it fails to include decay rate and smoothness constraints. One would expect, therefore, that Hankel regularization alone may not give satisfactory results as it is not able to properly bound the candidate set of models. It turns out that the maximum entropy framework discussed in Sect. 5.5.1 can be used to build prior distribution which account for stability, smoothness as well as "complexity". The following theorem (whose proof is given in Sect. 5.10.6) gives the structure of the MaxEnt prior under a simple "TC"-like condition on the stability-smoothness constraint.

**Theorem 5.6** *Let* {θ*<sup>P</sup>* }*<sup>k</sup>*=1,...,*<sup>m</sup> be a zero mean, absolutely continuous random vector with density p*<sup>θ</sup> (θ )*, which satisfies the following constraints:*

$$\begin{array}{c} \delta \left[ \theta\_m^2 \right] \\ \delta \left[ (\theta\_{k-1} - \theta\_k)^2 \right] \le \sigma^2 a^{k-2} (1 - \alpha) \quad k = 2, \dots, m \\ \delta \| \mathcal{M}\_{r, \epsilon}^{\rho} (\theta) \|\_{\*} \le h. \end{array} \tag{5.57}$$

*Then, the solution* pθ ,*MEH* (θ ) *of the maximum entropy problem*

$$\mathbf{p}\_{\boldsymbol{\theta}, MEH} := \underset{\mathbf{p}(\cdot)}{\text{arg}\, \max} \quad -\boldsymbol{\delta}^{\boldsymbol{\theta}} \log(\mathbf{p}\_{\boldsymbol{\theta}}(\boldsymbol{\theta})) \tag{5.58}$$

*has the following form:*

$$\mathbf{p}\_{\boldsymbol{\theta},MEH}(\boldsymbol{\theta}) \propto e^{-\mu\_H \| \boldsymbol{\upbeta}\_{r,\boldsymbol{\upepsilon}}^{\boldsymbol{\upepsilon}}(\boldsymbol{\theta}) \|\_{\*}} \left[ \prod\_{k=2}^{m} e^{-\frac{1}{2} \mu\_{k-1} (\theta\_{k-1} - \theta\_{\boldsymbol{\upbeta}})^2} \right] e^{-\frac{1}{2} \mu\_m \theta\_m^2},\tag{5.59}$$

*where the Lagrange multipliers* μ*<sup>H</sup>* , μ1,...,μ*<sup>m</sup> are determined so that the constraints* (5.57) *are satisfied.*<sup>1</sup>

Hankel nuclear norm discussed in this chapter is only one possible way to favour "simple" (in the sense of having small McMillan degree) models. Indeed, it is by no means trivial to use priors of the form (5.59), that involve nuclear norm terms, in conjunction with marginal likelihood optimization to estimate hyperparameters. Several variations are possible and, indeed, matricial reweighting schemes such as those used in [55] can be used in a Bayesian context, leading to iteratively reweighted schemes that remind of 1/<sup>2</sup> reweighting [72].

## **5.8 Historical Overview**

The framework discussed in this chapter has indeed a long history that can be traced back, by and large outside the control community, until the early '70s of the last century. In this section, we will review these developments and point to similarities and differences with the theory developed in this chapter.

## *5.8.1 The Distributed Lag Estimator: Prior Means and Smoothing*

To the best of our knowledge Bayesian methods for estimating dynamical systems have first been advocated in the early '70s in the econometrics literature for FIR models of the form (5.14), which were referred to as *distributed lag* models. The length *n* of the FIR model was actually left unspecified, and possibly let going to infinity.

In particular, [40, 62] were the first to talk about (and apply) Bayesian methods for system identification, arguing that "rigid parametric" structures may be inadequate,

<sup>1</sup> Using the complementary slackness conditions it follows that a multiplier may be nonzero only if the corresponding inequality in (5.57) holds with an equality sign.

extending arguments which can be found in [66] for "static" linear regression models to the "dynamical" systems scenario. In the paper [40], having in mind that modes of linear time-invariant systems have an exponentially decaying behaviour of the type α*<sup>t</sup>* , it was suggested to describe the unknown impulse response θ with a process having an exponentially decaying prior mean

$$\{m\_t\}\_{t\in\mathbb{N}}\qquad m\_t := \lambda \alpha^t \quad |\alpha| < 1. \tag{5.60}$$

Other possible response patterns had been considered, such as the hump, composed of the response build-up, the maximum and its decay, see [40] for details and alternative patterns. The covariance function *K*(*t*,*s*) in [40] was taken so that the ratio

$$\frac{std(\theta\_t)}{m\_t}$$

remains constant over time *t*. This was called the "proportionality principle". and can be achieved with the choice

$$K(t,s) = \text{cov}(\theta\_t, \theta\_s) := \nu w\_{ts} \alpha^{t+s-2} \quad |w\_{ts}| \le 1 \tag{5.61}$$

so that the normalized standard deviation

$$\frac{\text{std}(\theta\_t)}{m\_t} = \frac{\sqrt{K(t,t)}}{m\_t} = \frac{\sqrt{\nu w\_{ts} \alpha^{2t-2}}}{\lambda \alpha^t} = \frac{\sqrt{\nu w\_{ts} \alpha^{-2}}}{c}$$

is indeed constant if *wts* is so. This would imply that prior credible intervals have constant relative size w.r.t. their means, see p. 1065 of [40].

The choice (5.61) left the coefficients *wts* unspecified and, indeed in [40], it was emphasized that *"the selection of the values of the set of wi j still remains a relatively difficult task"*; one suggestion, inspired by work on smoothing [34], has been to take

$$
\omega\_{ij} = \mathbf{w}^{|i-j|} \quad 0 < \mathbf{w} < 1 \tag{5.62}
$$

leading to

$$K\_{ij} = \nu \alpha^{i+j-2} w^{|i-j|},\tag{5.63}$$

which is exactly the DC kernel introduced in Corollary 5.1. It is also interesting to observe that [40] already suggested the use of marginal likelihood to choose the most suitable prior distribution in the class.

Of course, postulating a prior mean *m* introduces in the estimation procedures a remarkable prejudice and requires quite accurate knowledge on the expected θ. The paper [62], inspired by "smoothing priors" arguments, suggested instead that the prior mean should be zero, and only smoothness conditions on the lags should be enforced; this leads to a zero mean prior, i.e., *c* = 0 in (5.60), with a *d*th degree smoothing covariance. For instance, for *d* = 2, the prior model can be expressed in

terms of the second-order differences:

$$
\mathcal{J} := \underbrace{\begin{bmatrix} 1 \ -2 \ 1 \ \vdots \ 0 \ \ldots \ 0 \\ 0 \ 1 \ -2 \ \vdots \ \vdots \ \vdots \ 0 \\ \vdots \ \vdots \ \ddots \ \vdots \ \ddots \ \vdots \\ 0 \ \ldots \ \ldots \ 1 \ -2 \ 1 \end{bmatrix}}\_{:= S} \theta = S\theta,
$$

postulating *E* ββ*<sup>T</sup>* = *SE* θθ *<sup>T</sup> S<sup>T</sup>* = *I*.

It is clear from Fig. 5.10 that this prior guarantees smoothness in time domain (and therefore low-pass behaviour in frequency domain) but no guarantee on stability.

## *5.8.2 Frequency-Domain Smoothing and Stability*

The "time-domain" smoothing discussed in the previous section has been criticized by Akaike [1] who posed the question whether time-domain smoothness conditions would "be the most natural ones". Akaike suggested that instead smoothness should be enforced in the frequency domain, i.e., considering the frequency response

$$G(e^{j\alpha}) := \sum\_{k=1}^{n} \theta\_k e^{-j\alpha k}.$$

To this purpose, the *L*2-norm of the first derivative *dG*(*<sup>e</sup> <sup>j</sup>*ω) *<sup>d</sup>*<sup>ω</sup> can be considered and we have already seen in (5.25) that one obtains

166 5 Regularization for Linear System Identification

$$\left\|\frac{dG(e^{j\omega})}{d\omega}\right\|^2 = \sum\_{k=1}^n k^2 |\theta\_k|^2. \tag{5.64}$$

Discouraging large *dG*(*e <sup>j</sup>*ω) *d*ω 2 can thus be obtained using the right-hand side of (5.64) as a penalty, which can be written in the form:

$$\mathbf{p}(\boldsymbol{\gamma}, \boldsymbol{\theta}) := \boldsymbol{\theta}^T \boldsymbol{K}\_{\boldsymbol{\gamma}}^{-1} \boldsymbol{\theta},$$

where

$$K\_Y := \frac{1}{\nu} \text{diag}\left\{ 1, \,\,\frac{1}{4}, \,\,\frac{1}{9}, \dots, \,\,\frac{1}{n^2} \right\}. \tag{5.65}$$

This is of course equivalent to assuming that the impulse response vector θ has a zero-mean normal prior with covariance *K*<sup>γ</sup> .

Unfortunately, in the limit *n* → ∞, the covariance function (5.65) does not meet the (more stringent) sufficient conditions of Lemma 5.1; of course rather straightforward extensions include setting penalties on higher-order derivatives, which would result in a faster decay rate of the diagonal elements of (5.65). This is a manifestation of the well-known link between regularity in the frequency domain and decay rate of the impulse response already discussed in Sect. 5.5.

## *5.8.3 Exponential Stability and Stochastic Embedding*

More recently, Gaussian priors for dynamical systems have been considered in the control literature; in particular, a zero-mean Gaussian prior with diagonal and exponentially decaying covariance

$$\mathcal{E}\theta\theta^T = K\_{\rho,\alpha} := \alpha \operatorname{diag}\left\{1, \,\rho, \,\rho^2, \dots, \,\rho^{n-1}\right\} \tag{5.66}$$

has been proposed in the so-called "stochastic embedding" framework [25, 26]. Let us now briefly introduce the problem: consider an Output Error model of the form

$$\mathbf{y}(t) = \sum\_{k=1}^{\infty} \mathbf{g}\_k(\theta)\boldsymbol{\mu}(t-k) + e(t),$$

where *gk* (θ ), <sup>θ</sup> <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* is a parametric description of the unknown impulse response {*gk* }*<sup>k</sup>*=1,...,<sup>∞</sup> in the model class *Mn*(θ ). Let θˆ be some parametric estimator of θ, e.g., the PEM estimator

$$\hat{\theta} = \underset{\theta}{\text{arg min }} \sum\_{k=1}^{N} \left\| \mathbf{y}(t) - G(\mathbf{z}, \theta)\boldsymbol{\mu}(t) \right\|^2. \tag{5.67}$$

#### 5.8 Historical Overview 167

Let now

$$\hat{G}(z) := G(z, \hat{\theta}) = \sum\_{k=1}^{\infty} \mathbf{g}\_k(\hat{\theta}) z^{-k}$$

be the corresponding estimator of the transfer function *G*(*z*,θ) = <sup>∞</sup> *<sup>k</sup>*=<sup>1</sup> *gk* (θ )*z*−*<sup>k</sup>* .

In the Model Error Modelling framework, it is assumed that the "true" transfer function *G*(*z*) is only partially captured by the chosen model class *Mn*(θ ) so that

$$G(z) = G(z, \theta\_0) + \tilde{G}(z) \quad G(z, \theta\_0) \in \mathcal{A}\_n(\theta) \tag{5.68}$$

and *G*˜(*z*) represents a model error. The purpose of Model Error Modelling is to obtain a statistical description of the model error, say

$$
\tilde{G}(z) := G(z) - \hat{G}(z)
$$

which may be used, for instance, to estimate the model order, e.g., the dimension *n* of the parameter vector θ. This can be achieved by minimizing an estimate of the MSE

$$\mathcal{E} \| G(z) - G(z, \hat{\eta}) \|^{2}$$

while accounting for the model error model *G*˜(*z*), see e.g., Eqs. (89)–(92) in [26].

The model error *G*˜(*z*) is estimated in [26] starting from the least squares residuals *v*<sup>θ</sup>ˆ(*t*) := *y*(*t*) − *G*(*z*, θ )ˆ *u*(*t*) which, under assumption (5.68), is expected to be described by the model

$$\nu(t) = G(z)\mu(t) + e(t).$$

It is remarkable that [26] propose to estimate the parameters α and ρ that characterize the covariance (5.66) resorting to marginal likelihood maximization

$$\mathbf{p}(\hat{\alpha}, \hat{\rho}) := \underset{\alpha, \rho}{\text{arg}\, \max} \int \mathbf{p}(V\_{\hat{\eta}} | \tilde{\mathbf{g}}) \mathbf{p}(\tilde{\mathbf{g}} | \alpha, \rho) \, d\tilde{\mathbf{g}},\tag{5.69}$$

where *V*<sup>η</sup><sup>ˆ</sup> := [*v*<sup>η</sup>ˆ(1), . . . , *v*<sup>η</sup>ˆ(*N*)]. It is also interesting to observe that the exponential decay of the covariance sequence (5.66) implies a smoothness condition in the frequency response function similar in spirit to that advocated in [1]. This is formalized in the following result whose proof is in Sect. 5.10.7.

**Lemma 5.5** *Let* {*gk*,α}*<sup>k</sup>*=0,...,<sup>∞</sup> *be a zero-mean Gaussian process with covariance* (5.66) *and let*

$$G\_{\alpha}(e^{j\omega}) := \sum\_{k=0}^{\infty} g\_{k,\alpha} e^{-jk\omega} \quad \omega \in [0, 2\pi).$$

*be its Fourier transform. Then the Lipschitz-like condition*

$$\delta \left[ \| G\_a(e^{j\alpha \underline{\gamma}}) - G\_a(e^{j\alpha \underline{\gamma}}) \| ^2 \right] \le \frac{c}{1 - a} (\omega\_1 - \omega\_2)^2 \qquad |\alpha| < 1 \tag{5.70}$$

*holds.*

## **5.9 Further Topics and Advanced Reading**

Section 1.3 already reported a list of topics and readings on inverse problems, Stein estimators and their link with the Empirical Bayes framework.

The use of regularization and Bayesian priors can be probably dated back to the paper [71] were smoothing ideas have been advocated for a denoising problem in the field of Actuarial Science. See also the much later reference [34]. The later developments are essentially impossible to survey in this short section and we refer the reader to [66] for an early overview on the use of Bayes priors in the context of linear regression; the interested reader may also consult [22, 31, 32, 42, 59] where generalized ridge regression has been proposed to stabilize ill-conditioned inverse problems.

To the best of our knowledge, [40, 62] have been the first to use these ideas in the context of dynamical systems, named "distributed-lag" models in these early references. This work has been subsequently taken up by Akaike [1] and later on by Kitagawa and Gersh in a series of papers, see e.g., [35, 36], which culminated in the well-known book [37]. The seminal papers by Leamer and Shiller have also been continued by the econometrics community, starting with the work by Doan, Litterman and Sims, see e.g., [18] for an overview and further references. This has lead to the so-called "Minnesota prior", which has been discussed quite extensively in the econometrics literature; several variations and extensions are found, see for instance [23, 41].

The econometrics literature has since then studied Bayesian procedures for system identification rather intensively, mostly under the acronym *Bayesian VARs*; the main driving motivation was that of handling high-dimensional time series (i.e., *p* large, called *cross-sectional dimension* in the econometrics literature) with possibly many explicative variables (*m* large), see for instance [2, 17, 23, 38].

The problem of tuning the regularization parameters (or equivalently the hyperparameters describing the prior in a Bayesian setting) has received relatively little attention in the econometrics literature: [40] already suggested the use of Empirical Bayes procedure, while [2, 18] propose tuning the hyperparameters using out-ofsample and in-sample errors, respectively. The paper [38] and the most recent work [23] adopt again an Empirical Bayes approach using the marginal likelihood; [23] claims the superiority of this approach w.r.t. previous "ad-hoc" techniques [2, 18].

Despite this long history, the use of Bayesian priors for system identification has only gained popularity in relatively recent times, e.g., see the survey [52]. We believe it is fair to say that reason for this is to be attributed to the fact that much more efforts have been recently devoted to developing prior models tailored to estimating dynamical system. In the remaining part of the book, these issues will be dealt with in some details. The reader is referred to [10, 11, 49, 50, 55] for various classes of prior models and to [6, 7, 12, 55] for more details on Maximum Entropy derivations. Extensions include prior models to estimate sparse models for high-dimensional time series [14, 74] as well as classes of priors for nonlinear dynamical models [51], that will be thoroughly discussed in Chap. 8. In particular, the techniques described in this chapter can be also used to identify the so-called *dynamic networks* that consist of a large set of interconnected dynamic systems. Modelling such complex physical systems is important in several fields of science and engineering, including also biomedicine and neuroscience [27, 30, 46, 56]. Estimation is difficult since they are often large scale and the network topology is typically unknown [14, 44, 67]. One typically postulates the existence of many connections and then has to understand from data which are really active. Since in real physical systems often only a small fraction of links is really working, the estimation process needs to exploit sparsity regularizers as those introduced in Chap. 3 and their stochastic interpretation like the Bayesian Lasso [47]. In the context of *linear dynamic networks*, where modules are defined by impulse responses, many approaches have been recently designed e.g., relying on local multi-input single-output (MISO) models [16, 19, 45]. Contributions based on variational Bayesian inference and/or nonparametric regularization, deeply connected with the techniques discussed in this book, are in [14, 33, 58, 73]. Methods to infer the full network dynamics using (structured) multiple-input multiple-output (MIMO) models can be found instead in [21, 69], with estimates consistency analyzed in [57]. A contribution based on the combination of the stable spline kernel and the so-called horseshoe sparsity prior [8, 54, 68] has been developed in [48]. See also [3, 24, 29, 70] for insights on identifiability issues and [28] where compressed sensing is exploited.

## **5.10 Appendix**

## *5.10.1 Optimal Kernel*

**Theorem 5.7** *The solution P*<sup>∗</sup> *of problem* (5.20) *is given by*

$$P^\* = \theta\_0 \theta\_0^T,\tag{5.71}$$

*where* θ *is the "true" impulse response of the data-generating mechanism* (5.14)*.*

*Proof* The proof will proceed as follows: let us denote with θˆ*<sup>P</sup>*<sup>∗</sup> the estimator obtained with *P* as in (5.71). Consider the error

$$
\tilde{\theta}^P := \hat{\theta}^P - \theta\_0
$$

which can be written as

$$\begin{aligned} \tilde{\theta}^{\mathcal{P}} &= \hat{\theta}^{\mathcal{P}} - \theta\_0 \\ &= \hat{\theta}^{\mathcal{P}^\*} \theta\_0 + \hat{\theta}^{\mathcal{P}} - \hat{\theta}^{\mathcal{P}^\*} \\ &= \tilde{\theta}^{\mathcal{P}^\*} + \left( \hat{\theta}\_{\mathcal{P}} - \hat{\theta}^{\mathcal{P}^\*} \right). \end{aligned}$$

We shall show that the following orthogonality property holds:

$$\mathcal{E}\tilde{\theta}^{P^\*}\left(\hat{\theta}\_P-\hat{\theta}^{P^\*}\right)^T=0\tag{5.72}$$

so that

$$\mathcal{E}\tilde{\theta}^{P}\tilde{(\theta}^{P})^{T} = \mathcal{E}\tilde{\theta}^{P\*} (\tilde{\theta}^{P\*})^{T} + \mathcal{E}\left(\hat{\theta}\_{P} - \hat{\theta}^{P\*}\right)\left(\hat{\theta}\_{P} - \hat{\theta}^{P\*}\right)^{T} \tag{5.73}$$

and therefore:

$$M\_{\boldsymbol{\theta}}(P) - M\_{\boldsymbol{\theta}}(P^\*) = \boldsymbol{\mathcal{E}} \tilde{\boldsymbol{\theta}}^P (\tilde{\boldsymbol{\theta}}^P)^T - \boldsymbol{\mathcal{E}} \tilde{\boldsymbol{\theta}}^{P^\*} ((\tilde{\boldsymbol{\theta}}^{P^\*})^T = \boldsymbol{\mathcal{E}} \left(\hat{\theta}\_P - \hat{\boldsymbol{\theta}}^{P^\*}\right) \left(\hat{\theta}\_P - \hat{\boldsymbol{\theta}}^{P^\*}\right)^T \succeq 0$$

which will prove the claim that *P*<sup>∗</sup> = θ0θ *<sup>T</sup>* <sup>0</sup> is the optimal solution to (5.20).

It now just remains to show that (5.72) holds. To do so, let us rewrite (4.7) assuming null μθ and using the matrix inversion lemma as (3.145):

$$\begin{array}{l} \hat{\theta}^{\rho} = \left(\sigma^{2}I + P\Phi^{T}\Phi\right)^{-1}P\Phi^{T}Y\\ = \left(\sigma^{2}I + P\Phi^{T}\Phi\right)^{-1}P\Phi^{T}\left(\Phi\theta\_{0} + E\right) \\ = \left(\sigma^{2}I + P\Phi^{T}\Phi\right)^{-1}\left[\left(P\Phi^{T}\Phi + \sigma^{2}I - \sigma^{2}I\right)\theta\_{0} + P\Phi^{T}E\right] \\ = \theta\_{0} - \left(\sigma^{2}I + P\Phi^{T}\Phi\right)^{-1}\left[\sigma^{2}\theta\_{0} - P\Phi^{T}E\right]. \end{array}$$

Therefore, the error θ˜*<sup>P</sup>* := θ<sup>0</sup> − θˆ*<sup>P</sup>* can be written in the form:

$$\tilde{\boldsymbol{\theta}}^{P} = \underbrace{\left(\boldsymbol{\sigma}^{2}\boldsymbol{I} + P\boldsymbol{\Phi}^{T}\boldsymbol{\Phi}\right)^{-1}\left[\boldsymbol{\sigma}^{2}\boldsymbol{\theta}\_{0} - P\boldsymbol{\Phi}^{T}\boldsymbol{E}\right]}\_{:=\boldsymbol{W}\_{P}}\boldsymbol{I} = \boldsymbol{W}\_{P}\left[\boldsymbol{\sigma}^{2}\boldsymbol{\theta}\_{0} - P\boldsymbol{\Phi}^{T}\boldsymbol{E}\right].\tag{5.74}$$

Now, using (5.74), we have:

$$
\delta^{P^\*} - \hat{\theta}\_P = \tilde{\theta}\_P - \tilde{\theta}^{P^\*} = \sigma^2 \left( W\_P - W\_{P^\*} \right) \theta\_0 + \left( W\_{P^\*} P - W\_P P \right) \Phi^T E.
$$

Now, let us compute

$$\begin{split} \mathcal{A}\left(\hat{\theta}^{P^\*} - \hat{\theta}\_P\right) (\tilde{\theta}^{P^\*})^T &= \sigma^4 \left( W\_P - W\_{P^\*}\right) \theta \theta^T W\_{P^\*}^T - \sigma^2 \left( W\_{P^\*} P - W\_P K \right) \Phi^T \Phi^\* W\_{P^\*}^T \, \mathrm{d} \\ &= \sigma^2 \left[ \sigma^2 \left( W\_P - W\_{P^\*} \right) - \left( W\_{P^\*} P^\* - W\_P P \right) \Phi^T \Phi \right] P^\* W\_{P^\*}^T \end{split} \tag{5.75}$$

If we now use the identity

$$W\_P \left(\sigma^2 I + P \Phi^T \Phi\right) = I \qquad \Rightarrow \qquad \sigma^2 W\_P = I - W\_P P \Phi^T \Phi$$

we obtain

$$
\sigma^2 \left( W\_P - W\_{P^\*} \right) = \left( W\_{P^\*} P^\* - W\_P P \right) \Phi^T \Phi
$$

so that, using (5.75),

$$\mathcal{E}\left(\hat{\theta}^{P^\*} - \hat{\theta}\_P\right)(\tilde{\theta}^{P^\*})^T = 0$$

which proves (5.72) and thus the theorem. -

## *5.10.2 Proof of Lemma 5.1*

Consider the following upper bound on the probability that the <sup>1</sup> norm of θ be larger than a given threshold *T*<sup>1</sup> :

$$\mathbb{P}\left[\sum\_{l=1}^{\infty} |\theta\_l| \ge T\_{\ell\_1}\right] \le \frac{1}{T\_{\ell\_1}} \delta^{\varepsilon} \sum\_{l=1}^{\infty} |\theta\_l| = \frac{1}{T\_{\ell\_1}} \sum\_{l=1}^{\infty} \delta^{\varepsilon} |\theta\_l| \le \frac{1}{T\_{\ell\_1}} \sum\_{l=1}^{\infty} \left(m\_l + \sqrt{2/\pi} K(\iota, t)^{1/2}\right)^{\varepsilon}$$

where we have used the equality *E* |*X*| = σ <sup>√</sup>2/π for *<sup>X</sup>* <sup>∼</sup> *<sup>N</sup>* (0, σ<sup>2</sup>). Using the hypothesis (5.24) we have that

$$\mathbb{P}\left[\sum\_{t=1}^{\infty} |\theta\_t| \ge T\_{\ell\_1}\right] \le \frac{M\_{\ell\_1} + K\_{\ell\_1}\sqrt{2/\pi}}{T\_{\ell\_1}}$$

and therefore

$$\mathbb{P}\left[\sum\_{t=1}^{\infty} |\theta\_t| < T\_{\ell\_1}\right] \ge 1 - \frac{M\_{\ell\_1} + K\_{\ell\_1}\sqrt{2/\pi}}{T\_{\ell\_1}}.$$

Taking the limit as *T*<sup>1</sup> → +∞ we have

$$\mathbb{P}\left[\sum\_{t=1}^{\infty} |\theta\_t| < +\infty\right] = 1$$

which concludes the proof.

## *5.10.3 Proof of Theorem 5.5*

The proof is based on the fact that the Maximum Entropy distribution p(θ ) under constrains *E fk* (θ ) = *Fk* and *E gk* (θ ) = *Gk* has the "Gibbs" structure, i.e., it is the exponential of a weighted sum of the constraint functionals (see e.g., [15]):

$$\mathbf{p}(\theta) \propto e^{-\sum\_{i} \mu\_{i} f\_{i}(\theta) + \eta\_{i} g\_{i}(\theta)}.$$

In our case, we have *fk* (θ ) = θ <sup>2</sup> *<sup>k</sup>* and *gk* (θ ) = (θ*<sup>k</sup>*−<sup>1</sup> − θ*<sup>k</sup>* )2, and therefore the max-ent solution has the form

$$p\_{\boldsymbol{\theta},ME}(\boldsymbol{\theta}) = \boldsymbol{C}e^{-\frac{1}{2}\left(\mu\_1\boldsymbol{\theta}\_1^2 + \sum\_{k=2}^n \mu\_k \boldsymbol{\theta}\_k^2 + \gamma\_k \left(\boldsymbol{\theta}\_{k-1} - \boldsymbol{\theta}\_k\right)^2\right)}.\tag{5.76}$$

Using a well-known result in graphical models (see e.g., Lauritzen [39]), the variables θ*<sup>k</sup>* and {θ*<sup>k</sup>*+<sup>2</sup>,...,θ*n*} are conditionally independent given θ*<sup>k</sup>*+<sup>1</sup> (because θ*<sup>k</sup>*+<sup>1</sup> is the only neighbour of θ*<sup>P</sup>* in the graph representing p(θ*<sup>k</sup>* , θ*<sup>k</sup>*+<sup>1</sup>,...,θ*n*) (or equivalently θ*<sup>k</sup>*+<sup>1</sup> separates θ*<sup>k</sup>* from θ*<sup>k</sup>*+<sup>2</sup>, θ*<sup>k</sup>*+<sup>3</sup>,...,θ*n*).

In our case, this conditional independence implies that the best linear estimator θˆ *<sup>k</sup>*−<sup>1</sup> of θ*<sup>k</sup>*−<sup>1</sup> given θ*<sup>k</sup>* , θ*<sup>k</sup>*+<sup>1</sup>,...,θ*<sup>n</sup>* depends only θ*<sup>P</sup>* (i.e., θˆ *<sup>k</sup>*−<sup>1</sup> = *aB*,*<sup>k</sup>* θ*<sup>k</sup>* ) so that the vector θ admits the f<sup>2</sup> representation:

$$
\theta\_{k-1} = a\_{B,k} \theta\_k + w\_k \tag{5.77}
$$

with *wk* := θ*<sup>k</sup>*−<sup>1</sup> − θˆ *<sup>k</sup>*−<sup>1</sup> = θ*<sup>k</sup>*−<sup>1</sup> − *aB*,*<sup>k</sup>* θ*<sup>k</sup>* zero mean and uncorrelated of θ*<sup>k</sup>* , θ*<sup>k</sup>*+<sup>1</sup>,...,θ*n*. Let us define σ<sup>2</sup> *<sup>k</sup>* := *E w*<sup>2</sup> *<sup>k</sup>* . In order to express *aB*,*<sup>k</sup>* and σ<sup>2</sup> *<sup>k</sup>* as a function of λ*R*, λ*S*, α, we exploit the constraints (5.31) and the dynamical model (5.77). In particular we have

$$\begin{array}{l} \lambda\_S \alpha^{k-1} = \mathcal{E} \theta\_{k-1}^2 \\ \quad = a\_{B,k}^2 \mathcal{E} \theta\_P^2 + \sigma\_k^2 \\ \quad = a\_{B,k}^2 \lambda\_S \alpha^P + \sigma\_k^2 \end{array} \tag{5.78}$$

$$\begin{array}{l} \lambda\_R \alpha^{k-1} = \mathcal{C} (\theta\_{k-1} - \theta\_k)^2 \\ = \mathcal{C} ((a\_{B,k} - 1)\theta\_k^2 + w\_k)^2 \\ = \mathcal{C} (a\_{B,k} - 1)^2 \theta\_k^2 + \mathcal{C} w\_k^2 \\ = (a\_{B,k} - 1)^2 \lambda\_S \alpha^k + \sigma\_k^2 \end{array} \tag{5.79}$$

Substracting (5.78) from (5.79) we obtain

$$(\lambda\_S - \lambda\_R)\alpha^{k-1} = a\_{B,k}^2 \lambda\_S \alpha^P - (a\_{B,k} - 1)^2 \lambda\_S \alpha^k = (2a\_{B,k} - 1)\lambda\_S \alpha^k$$

which implies that

$$a\_{B,k} = \frac{\lambda\_S (1+\alpha) - \lambda\_R}{2\lambda\_S \alpha} =: a\_B.$$

that is independent of *k*, thus denoted with *aB* as in (5.35). From (5.79) we also have that

<sup>2</sup> We prefer here to work with backward representations since, as we will see, with this choice we will have *aB*,*<sup>k</sup>* = *aB*, independent of *k*. Forward representations are discussed in Sect. 5.10.8.

$$
\sigma\_k^2 = \lambda\_R a^{k-1} - (a\_B - 1)^2 \lambda\_S a^k = (\lambda\_R - (a\_B - 1)^2 \lambda\_S a) a^{k-1} = (1 - a\_B^2 a) \lambda\_S a^{k-1}
$$

where the last equality follows after a few manipulations and proves (5.36). Replacing

$$a\_B - 1 = \frac{\lambda\_S (1 - \alpha) - \lambda\_R}{2 \lambda\_S \alpha}$$

in the previous equation we have:

$$
\sigma\_k^2 = \left[\lambda\_R - \left(\frac{\lambda\_S(1-\alpha)-\lambda\_R}{2\lambda\_S\alpha}\right)^2 \lambda\_S\alpha\right] \alpha^{k-1}.
$$

Of course σ<sup>2</sup> *<sup>k</sup>* , and thus the right hand side, should be positive (for simplicity we exclude the singular case σ<sup>2</sup> *<sup>k</sup>* = 0):

$$
\lambda\_R - \left(\frac{\lambda\_S(1-\alpha)-\lambda\_R}{2\lambda\_S\alpha}\right)^2 \lambda\_S\alpha = \frac{4\lambda\_R\lambda\_S\alpha - (\lambda\_S(1-\alpha)-\lambda\_R)^2}{4\lambda\_S\alpha} > 0
$$

which in turn is equivalent to

$$4\lambda\_R \lambda\_S \alpha - (\lambda\_S(1-\alpha) - \lambda\_R)^2 > 0.$$

This happens if and only if

$$
\lambda\_R^2 - 2\lambda\_R \lambda\_S (1+\alpha) + (1-\alpha)^2 \lambda\_S^2 < 0.
$$

This is a degree two polynomial in λ*<sup>R</sup>* with two positive roots

$$
\lambda\_{R,i} = \lambda\_S(1+a) \pm \sqrt{\lambda\_S^2(1+a)^2 + \lambda\_S^2(1-a)^2} = \lambda\_S(1+a \pm 2\sqrt{a}) \quad i = 1,2,3
$$

and therefore our problem is feasible if and only if

$$
\lambda\_{R,min} = \lambda\_{R,1} = \lambda\_S(1 + \alpha - 2\sqrt{\alpha}) < \lambda\_R < \lambda\_S(1 + \alpha + 2\sqrt{\alpha}) = \lambda\_{R,2} = \lambda\_{R,max}
$$

thus proving (5.32). Now it remains to prove that (5.76) takes the form (5.34). First let us observe that the exponent of (5.76) is a quadratic form in θ, and therefore (5.76) can be written in the form

$$p\_{\theta,ME}(\theta) = Ce^{-\frac{1}{2}\theta^T \Phi \theta}.$$

Last, since in (5.76) only products of the form θ*<sup>k</sup>* θ*<sup>h</sup>* for *h* ∈ [*k* − 1, *k*, *k* + 1] appear, the matrix Φ = Φ*<sup>T</sup>* has the following band structure:

$$
\Phi = \begin{bmatrix}
\* & \* & 0 & \dots & \dots & 0 \\
\* & \* & \* & 0 & \dots & 0 \\
0 & \* & \* & \* & 0 & \dots \\
\vdots & \dots & \ddots & \ddots & \ddots & \dots \\
0 & \dots & 0 & \* & \* & \* \\
0 & \dots & \dots & 0 & \* & \* \\
\end{bmatrix}.
$$

In addition, for *p*θ ,*M E* (θ ) to be a density, Φ needs to be positive semidefinite (otherwise there would be directions in which the density grows indefinitely). Since θ admits the backward AR representation (5.77) with *E w*<sup>2</sup> *<sup>k</sup>* = σ<sup>2</sup> *<sup>k</sup>* > 0, the covariance matrix Σ = *E* θθ *<sup>T</sup>* is positive definite and thus Φ = Σ−1. To compute the autocovariance function *E* θ*h*θ*<sup>k</sup>* we consider the following cases: if *k* = *h* we have

$$
\mathcal{E}\theta\_h\theta\_h = \lambda\_S\mathfrak{a}^h.
$$

If *k* > *h* we have

$$\circledast \theta\_h \theta\_k = a\_B \circledast \theta\_{h+1} \theta\_k$$

and iterating the relation we find

$$
\circledast \theta\_h \theta\_k = a\_B^{k-h} \circledast \theta\_k \theta\_k = \lambda\_S a\_B^{k-h} \alpha^k.
$$

Analogously, if *h* > *k* we have

$$
\mathcal{A}^{\diamond} \theta\_h \theta\_k = a\_B^{h-k} \mathcal{A}^{\diamond} \theta\_h \theta\_h = \lambda\_S a\_B^{h-k} \alpha^h.
$$

Combining the three cases we obtain

$$
\mathcal{S}^{\diamond} \theta\_k \theta\_h = \lambda\_S a\_B^{|k-h|} \alpha^{\max\{k,h\} }
$$

proving (5.38).

## *5.10.4 Proof of Corollary 5.1*

Using the definition (5.39) in Eq. (5.38) we obtain:

$$\mathcal{A}\theta\_k \theta\_h = \lambda\_S a\_B^{|k-h|} \alpha^{\max\{k,h\}} = \lambda\_S \frac{\rho^{|k-h|}}{\alpha^{\frac{|k-h|}{2}}} \alpha^{\max\{k,h\}} = \lambda\_S \rho^{|k-h|} \alpha^{\frac{k+h}{2}}.$$

In addition, if the matching condition λ*<sup>R</sup>* = λ*S*(1 − α) is satisfied, then from (5.35) *aB* <sup>=</sup> 1 and from (5.39) <sup>ρ</sup> <sup>=</sup> <sup>√</sup>α; substituting in (5.40) we obtain

$$\mathcal{S}\theta\_k\theta\_h = \lambda\_S\rho^{|k-h|}\alpha^{\frac{k+h}{2}} = \lambda\_S\alpha^{\max\{k,h\}}$$

i.e., the covariance sequence of the well known TC kernel.

## *5.10.5 Proof of Lemma 5.2*

The proof of this lemma is a simple application of Schwartz inequality. In particular we have:

$$\begin{split} |\boldsymbol{\mu}\_{\boldsymbol{t}\boldsymbol{t}}| &= \frac{1}{\xi} |\sum\_{\boldsymbol{s}=1}^{n} [\mathbf{K}]\_{\boldsymbol{t},\boldsymbol{s}} \boldsymbol{u}\_{\boldsymbol{is}}| \leq \sum\_{\boldsymbol{s}=1}^{n} \sqrt{[\mathbf{K}]\_{\boldsymbol{t},\boldsymbol{t}}} \sqrt{[\mathbf{K}]\_{\boldsymbol{t},\boldsymbol{s}}} |\boldsymbol{u}\_{\boldsymbol{is}}| \\ &\leq \frac{1}{\xi} \sqrt{[\mathbf{K}]\_{\boldsymbol{t},\boldsymbol{t}}} \sum\_{\boldsymbol{s}=1}^{n} \sqrt{[\mathbf{K}]\_{\boldsymbol{s},\boldsymbol{s}}} |\boldsymbol{u}\_{\boldsymbol{is}}| \leq \underbrace{\frac{1}{\xi} K(\boldsymbol{t},\boldsymbol{t})^{1/2}}\_{=\mathcal{C} < \infty} \underbrace{\sum\_{s=1}^{n} |\boldsymbol{u}\_{\boldsymbol{is}}|^{2}}\_{=\mathcal{C} < \infty} \underbrace{\sum\_{s=1}^{n} |\boldsymbol{u}\_{\boldsymbol{is}}|^{2}}\_{=1}. \end{split}$$

where the last inequality follows from the fact that *uit* has 2-norm equal to 1 for all *i*. The same condition clearly holds also in the infinite dimensional case, i.e., as *n* → ∞ if *K*(*t*,*s*) admits the spectral decomposition

$$K(t,s) = \sum\_{i=1}^{\infty} \xi\_i u\_{it} u\_{is}$$

and the condition *<sup>t</sup> K*(*t*, *t*) = *C* < ∞holds. In particular this latter condition holds true if the more stringent condition *<sup>t</sup> K*1/<sup>2</sup>(*t*, *t*) < ∞ in Lemma 5.1 is satisfied.

## *5.10.6 Proof of Theorem 5.6*

The proof follows from fact that the Maximum Entropy distribution p(*x*) under constrains *E fi*(*x*) ≤ γ*<sup>i</sup>* has the "Gibbs" structure, i.e., it is the exponential of a weighted sum of the constraint functionals (see e.g., [15]):

$$\mathbf{p}(\mathbf{x}) \propto e^{-\sum\_{i} \mu\_{i} f\_{i}(\mathbf{x})}.$$

## *5.10.7 Proof of Lemma 5.5*

Since {*gk*,α}*<sup>k</sup>*=0,...,<sup>∞</sup> is zero mean, then clearly also *G*α(*e <sup>j</sup>*ω)is so, i.e., *E G*α(*e <sup>j</sup>*ω) = 0. If we now consider the difference

176 5 Regularization for Linear System Identification

$$G\_a(e^{j\omega\_1}) - G\_a(e^{j\omega\_2}) = \sum\_{k=0}^{\infty} g\_{k,a} \left[ e^{-jka\_1} - e^{-jka\_2} \right],$$

taking the expected value of the squared norm, and using the fact the *E gk*,α*gk*,α = *c*α*<sup>k</sup>* δ*<sup>k</sup>*−*<sup>h</sup>*, we have

$$\|\mathcal{E}\|\|G\_{\alpha}(e^{j\alpha\mathfrak{l}}) - G\_{\alpha}(e^{j\alpha\mathfrak{l}})\|^2 = \sum\_{k=0}^{\infty} c\alpha^k \|e^{-jk\alpha\mathfrak{l}} - e^{-jk\alpha\mathfrak{l}}\|^2.$$

Now, using

$$\|e^{-j k a\_1} - e^{-j k a\_2}\|^2 = 2\left(1 - \cos(\omega\_1 - \omega\_2)\right) \le \left(\omega\_1 - \omega\_2\right)^2$$

and the expression for the sum of the geometric series α*<sup>k</sup>* the thesis follows.

#### *5.10.8 Forward Representations of Stable-Splines Kernels -*

A major drawback of the backward construction is that it is not straightforward to extend it to an infinite interval, i.e., to let *n* → ∞ in order to consider infinitely long impulse response models {θ*<sup>k</sup>* }*k*∈<sup>N</sup>. However this difficulty can be circumvented exploiting the "forward" representation of (5.77), which turns out to be again a time varying AR(1) model.<sup>3</sup> Theorem 5.8 derives the forward AR(1) representation of the maximum entropy process found in Theorem 5.5.

**Theorem 5.8** *The maximum entropy solution to* (5.33) *found in Theorem 5.5 admits the forward AR(1) representation*

$$
\theta\_{k+1} = a\_F \theta\_k + w\_k \quad \quad k \ge 0 \tag{5.80}
$$

*with zero-mean initial condition such that E* θ <sup>2</sup> <sup>0</sup> = λ*S, and where*

$$a\_F = \rho \alpha^{1/2} = a\_B \alpha \tag{5.81}$$

*and wk is a sequence of zero mean variables, uncorrelated with the initial condition* θ<sup>0</sup> *and such that*

$$\mathcal{E}\,^{\mathcal{C}}\!w\_{k}w\_{h} = \begin{cases} \sigma\_{F,k}^{2} \; k = h\\ 0 \; \; k \neq h \end{cases} \tag{5.82}$$

*with* σ<sup>2</sup> *<sup>F</sup>*,*<sup>k</sup>* = λ*S*α*<sup>k</sup>*+<sup>1</sup>(1 − ρ<sup>2</sup>)*.*

<sup>3</sup> There are several ways to see this: perhaps the simplest is to recall that the inverse covariance matrix of an AR(1) process has a band (tridiagonal) structure, which implies that forward and backward models share the same conditional dependence structure.

*Proof* First of all let us observe that, if θ*<sup>k</sup>* admits an AR(1) forward representation of the form (5.80) (with *wk* that satisfies (5.82)), *aF* should satisfy the relation

$$a\_F = \circledast \theta\_{k+1} \theta\_k \left(\circledast \theta\_k^2\right)^{-1}.$$

Using the expression (5.38), we obtain:

$$\left(\mathcal{A}\theta\_{k+1}\theta\_k \left(\mathcal{C}\theta\_k^2\right)^{-1} = \lambda\_S a\_B \alpha^{k+1} \left(\lambda\_S \alpha^k\right)^{-1} = a\_B \alpha^k$$

and recalling that ρ = *aB*α<sup>1</sup>/<sup>2</sup> we also obtain

$$a\_F = a\_B \alpha = \rho \alpha^{1/2}.$$

In addition, denoting σ<sup>2</sup> *<sup>F</sup>*,*<sup>k</sup>* := *E w*<sup>2</sup> *k* ,

$$\mathcal{A}^{\diamond} \theta\_{k+1}^{2} = a\_F^2 \mathcal{A}^{\diamond} \theta\_k^2 + \sigma\_{F,k}^2$$

must hold. Therefore,

$$
\sigma\_{F,k}^2 = \delta \theta\_{k+1}^2 - a\_F^2 \delta \theta\_k^2 = \lambda\_S \alpha^{k+1} - \rho^2 \alpha \alpha^k = \lambda\_S \alpha^{k+1} (1 - \rho^2).
$$

It also straightforward to verify that, if θ*<sup>k</sup>* is generated by (5.80), then

$$\mathcal{A}\theta\_{k+\mathsf{r}}\theta\_k = a\_F^{\mathsf{r}}\mathcal{A}^{\diamond}\theta\_k^2 = a\_F^{\mathsf{r}}\lambda\_S\alpha^k = \lambda\_S a\_B^{\mathsf{r}}\alpha^{k+\mathsf{r}} \quad \mathsf{r} > 0$$

which is exactly of the form

$$\mathcal{A}^{\circledast} \theta\_h \theta\_k = \lambda\_S a\_B^{|h-k|} \alpha^{\max(k,h)}$$

provided *h* = *k* + τ , τ > 0. This concludes the proof. -

**References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 6 Regularization in Reproducing Kernel Hilbert Spaces**

**Abstract** Methods for obtaining a function *g* in a relationship *y* = *g*(*x*) from observed samples of *y* and *x* are the building blocks for black-box estimation. The classical parametric approach discussed in the previous chapters uses a function model that depends on a finite-dimensional vector, like, e.g., a polynomial model. We have seen that an important issue is the model order choice. This chapter describes some regularization approaches which permit to reconcile flexibility of the model class with well-posedness of the solution exploiting an alternative paradigm to traditional parametric estimation. Instead of constraining the unknown function to a specific parametric structure, the function will be searched over a possibly infinitedimensional functional space. Overfitting and ill-posedness are circumvented by using reproducing kernel Hilbert spaces as hypothesis spaces and related norms as regularizers. Such kernel-based approaches thus permit to cast all the regularized estimators based on quadratic penalties encountered in the previous chapters as special cases of a more general theory.

## **6.1 Preliminaries**

Techniques for reconstructing a function *g* in a functional relationship *y* = *g*(*x*) from observed samples of *y* and *x* are the fundamental building blocks for black-box estimation. As already seen in Chap. 3 when treating linear regression, given a finite set of pairs (*xi*, *yi*) the aim is to determine a function *g* having a good prediction capability, i.e., for a new pair (*x*, *y*) we would like the prediction *g*(*x*) close to *y* (e.g., in the MSE sense).

The classical parametric approach discussed in Chap. 3 uses a model *g*<sup>θ</sup> that depends on a finite-dimensional vector θ. A very simple example is a polynomial model, treated in Example 3.1, given, e.g., by *g*θ(*x*) = θ<sup>1</sup> + θ2*x* + θ3*x* <sup>2</sup> whose coefficients θ*<sup>i</sup>* can be estimated by fitting the data via least squares. In this parametric scenario, we have seen that an important issue is the model order choice. In fact, the least squares objective improves as the dimension of θ increases, eventually leading

G. Pillonetto et al., *Regularized System Identification*, Communications and Control Engineering, https://doi.org/10.1007/978-3-030-95860-2\_6

to data interpolation. But overparametrized models, as a rule, perform poorly when used to predict future output data, even if benign overfitting may sometimes happen, as e.g., described in the context of deep networks [17, 55, 75]. Another drawback related to overparameterization is that the problem may become ill-posed in the sense of Hadamard, i.e., the solution may be non-unique, or ill-conditioned. This means that the estimate may be highly sensitive even to small perturbations of the outputs *yi* as, e.g., illustrated in Fig. 1.3 of Sect. 1.2.

This chapter describes some regularization approaches which permit to reconcile flexibility of the model class with well-posedness of the solution exploiting an alternative paradigm to traditional parametric estimation. Instead of constraining the unknown function to a specific parametric structure, *g* will be searched over a possibly infinite-dimensional functional space. Overfitting and ill-posedness is circumvented by using reproducing kernel Hilbert spaces (RKHSs) as hypothesis spaces and related norms as regularizers. Such norms generalize the quadratic penalties seen in Chap. 3. In this scenario, the estimator is completely defined by a positive definite kernel which has to encode the expected function properties, e.g., the smoothness level. Furthermore we will see that, even when the model class is infinite dimensional, the function estimate turns out a finite linear combination of basis functions computable from the kernel. The estimator also enjoys strong asymptotic properties, permitting (under reasonable assumptions on data generation) to achieve the optimal predictor as the data set size grows to infinity.

The kernel-based approaches described in the following sections thus permit to cast all the regularized estimators based on quadratic penalties encountered in the previous chapters as special cases of a more general theory. In addition, RKHS theory paves the way to the development of other powerful techniques, e.g., for estimation of an infinite number of impulse response coefficients (IIR models estimation), for continuous-time linear system identification and also for nonlinear system identification.

The reader not familiar with functional analysis finds in the first part of the appendix of this chapter a brief overview on the basic results used in the next sections, like, e.g., the concept of linear and bounded functional which is key to define a RKHS.

## **6.2 Reproducing Kernel Hilbert Spaces**

In what follows, we use *X* to indicate domains of functions. In machine learning, this set is often referred to as the *input space* with its generic element *x* ∈ *X* called *input location*. Sometimes, *X* is assumed to be a compact metric space, e.g., one can think of *X* as a closed and bounded set in the familiar space R*<sup>m</sup>* equipped with the Euclidean norm. In what follows, all the functions are real valued, so that *<sup>f</sup>* : *<sup>X</sup>* <sup>→</sup> <sup>R</sup>.

**Reproducing kernel Hilbert spaces** We now introduce a class of Hilbert spaces *H* which play a fundamental role as hypothesis spaces for function estimation problems. Our goal is to estimate maps which permit to make predictions over the whole *X* . Thus, a basic requirement is to search for the predictor in a space containing functions which are well-defined pointwise for any *x* ∈ *X* . In particular, we assume that all the pointwise evaluators *g* → *g*(*x*) are linear and bounded over *H* . This means that ∀*x* ∈ *X* there exists *Cx* < ∞ such that

$$|\lg(\mathbf{x})| \le C\_{\mathbf{x}} \|\lg\|\_{\mathcal{A}^{\varrho}}, \quad \forall \mathbf{g} \in \mathcal{A}^{\varrho}.\tag{6.1}$$

The above condition is stronger than requiring *g*(*x*) < ∞ ∀*x* since *Cx* can depend on *x* but not on *g*. This property already leads to the function spaces of interest. The following definitions are taken from [13].

**Definition 6.1** *(RKHS, based on* [13]*)* A reproducing kernel Hilbert space (RKHS) over a non-empty set *<sup>X</sup>* is a Hilbert space of functions *<sup>g</sup>* : *<sup>X</sup>* <sup>→</sup> <sup>R</sup> such that (6.1) holds.

As suggested by the name itself, RKHSs are related to the concept of positive definite kernel [13, 20], a particular function defined over *X* × *X* . In the literature it is also called positive semidefinite kernel, hence in what follows positive definite kernel and positive semidefinite kernel will define the same mathematical object. This is also specified in the next definition.

**Definition 6.2** *(Positive definite kernel, Mercer kernel and kernel section, based on* [13]*)* Let *<sup>X</sup>* denote a non-empty set. A symmetric function *<sup>K</sup>* : *<sup>X</sup>* <sup>×</sup> *<sup>X</sup>* <sup>→</sup> <sup>R</sup> is called *positive definite kernel* or *positive semidefinite kernel* if, for any finite natural number *p*, it holds

$$\sum\_{i=1}^{p} \sum\_{j=1}^{p} a\_i a\_j K(\mathbf{x}\_i, \mathbf{x}\_j) \ge 0, \quad \forall (\mathbf{x}\_k, a\_k) \in (\mathcal{X}, \mathbb{R}), \quad k = 1, \dots, p.$$

If strict inequality holds for any set of *p* distinct input locations *xk* , i.e.,

$$\sum\_{i=1}^{p} \sum\_{j=1}^{p} a\_i a\_j K(\mathbf{x}\_i, \mathbf{x}\_j) > \mathbf{0},$$

then the kernel is *strictly positive definite*.

If *X* is a metric space and the positive definite kernel is also continuous, then *K* is said to be a *Mercer kernel*.

Finally, given a kernel *K*, the *kernel section Kx* centred at *x* is the function *<sup>X</sup>* <sup>→</sup> <sup>R</sup> defined by

$$K\_x(\mathbf{y}) = K(\mathbf{x}, \mathbf{y}) \quad \forall \mathbf{y} \in \mathcal{X}^\circ.$$

Hence, in the sense given above, a positive definite kernel "contains" matrices which are all at least positive semidefinite.

We are now in a position to state a fundamental theorem from [13] here specialized to Mercer kernels which lead to RKHSs containing continuous functions (the proof is reported in Sect. 6.9.2).

**Theorem 6.1** (RKHSs induced by Mercer kernels, based on [13]) *Let X be a compact metric space and let K* : *<sup>X</sup>* <sup>×</sup> *<sup>X</sup>* <sup>→</sup> <sup>R</sup> *be a Mercer kernel. Then, there exists a unique (up to isometries) Hilbert space H of functions, called RKHS associated to K , such that*

*1. all the kernel sections belong to H , i.e.,*

$$K\_x \in \mathcal{H}^\ell \quad \forall x \in \mathcal{X}^\*;\tag{6.2}$$

*2. the so-called* reproducing property *holds, i.e.,*

$$
\langle K\_x, \mathbf{g} \rangle\_{\mathcal{H}^\mathbb{C}} = \mathbf{g}(\mathbf{x}) \,\,\,\forall (\mathbf{x}, \mathbf{g}) \in (\mathcal{X}^\circ, \mathcal{H}^\mathbb{C})\,.\tag{6.3}
$$

*In addition, H is contained in the space C of continuous functions.*

**Remark 6.1** Note that the space *H* characterized in Theorem 6.1 is indeed a RKHS according to Definition 6.1. In fact, for any input location *x* the kernel section *Kx* belongs to the space and, according to the reproducing property, represents the evaluation functional at *x*. Then, Theorem 6.27 (Riesz representation theorem), reported in the appendix to this chapter, permits the conclusion that all the pointwise evaluators over *H* are linear and bounded.

While Theorem 6.1 establishes a link between Mercer kernels (which enjoy continuity properties) and RKHSs, it is possible also to state a one-to-one correspondence with the entire class of positive definite kernels (not necessarily continuous). In particular, the following result holds.

**Theorem 6.2** (Moore–Aronszajn, based on [13]) *LetX be any non-empty set. Then, to every RKHS H there corresponds a unique positive definite kernel K such that the reproducing property (6.3) holds. Conversely, given a positive definite kernel K , there exists a unique RKHS of real-valued functions defined over X where* (6.2) *and* (6.3) *hold.*

The proof can be quite easily obtained using Theorem 6.27 (Riesz representation theorem) and arguments similar to those contained in the proof of Theorem 6.1.

**Further notes and RKHSs examples** Thus, a RKHS *H* can be defined just by specifying a kernel *K*, also called the *reproducing kernel* of *H* . In particular, any RKHS is generated by the kernel sections. More specifically, let *S* = span({*Kx* }*<sup>x</sup>*∈*<sup>X</sup>* ) and define the following norm in *S*

#### 6.2 Reproducing Kernel Hilbert Spaces 185

$$\left\|f\right\|\_{\mathcal{H}^{\ell}}^{2} = \sum\_{i=1}^{p} \sum\_{j=1}^{p} c\_{i}c\_{j}K(\mathbf{x}\_{i},\mathbf{x}\_{j}),\tag{6.4}$$

where

$$f(\cdot) = \sum\_{i=1}^{p} c\_i K\_{x\_i}(\cdot).$$

Then, one has

*H* = *S* ∪ {all the limits w.r.t. ·*<sup>H</sup>* of Cauchy sequences contained in *S*}.

Summarizing, one has


Assume for instance *K*(*x*1, *x*2) = exp −*x*<sup>1</sup> − *x*2<sup>2</sup> , which is the so-called Gaussian kernel. Then, all the functions in the corresponding RKHS are sums, or limits of sums, of functions proportional to Gaussians. As further elucidated later on, this means that every function of *H* inherits properties such as smoothness and integrability of the kernel, e.g., we have seen in Theorem 6.1 that kernel continuity implies *H* ⊂ *C* . This fact has an important consequence on modelling: instead of specifying a whole set of basis functions, it suffices to choose a single positive definite kernel that encodes the desired properties of the function to be synthesized.

**Example 6.3** *(Norm in a two-dimensional RKHS)* We introduce a very simple RKHS to illustrate how the kernel *K* can be seen as a similarity function that establishes the norm (complexity) of a function by comparing function values at different input locations.

When *X* has finite cardinality *m*, the functions are evaluated just on a finite number of input locations. Hence, each function *f* is in one-to-one correspondence with the *m*-dimensional vector

$$\mathbf{f} = \begin{pmatrix} f(1) \\ f(2) \\ \vdots \\ f(m) \end{pmatrix}.$$

In addition, any kernel is in one-to-one correspondence with one symmetric positive semidefinite matrix **<sup>K</sup>** <sup>∈</sup> <sup>R</sup>*<sup>m</sup>*×*<sup>m</sup>* with (*i*, *<sup>j</sup>*)-entry **<sup>K</sup>***i j* <sup>=</sup> *<sup>K</sup>*(*i*, *<sup>j</sup>*). Finally, the kernel sections can be seen as the columns of **K**.

Assume, e.g., *m* = 2 with *X* = {1, 2}. Then, the functions can be seen as twodimensional vectors and any kernel *K* is in one-to-one correspondence with one symmetric positive semidefinite matrix **<sup>K</sup>** <sup>∈</sup> <sup>R</sup>2×2. The RKHS *<sup>H</sup>* associated to *<sup>K</sup>* is finite-dimensional being spanned just by the two kernel sections *K*1(·) and *K*2(·) which can be seen as the two columns of **K**. Hence, the functions *f* in *H* are in one-to-one correspondence with the vectors

$$\mathbf{f} = \begin{pmatrix} f(1) \\ f(2) \end{pmatrix} = \mathbf{K}c, \quad c \in \mathbb{R}^2.$$

If **K** is full rank, *H* covers the whole R<sup>2</sup> and from (6.4) we have

$$\|f\|\_{\mathcal{H}}^2 = c^T \mathbf{K}c = \mathbf{f}^T \mathbf{K}^{-1} \mathbf{f}.$$

For the sake of simplicity, assume also that **K**<sup>11</sup> = **K**<sup>22</sup> = 1 so that it must hold −1 < **K**<sup>12</sup> < 1. Then, considering, e.g., the function *f* (*i*) = *i*, one has

$$\begin{aligned} \|f\|\_{\mathcal{H}^{\ell}}^2 &= \begin{bmatrix} 1 & 2 \end{bmatrix} \mathbf{K}^{-1} \begin{bmatrix} 1 & 2 \end{bmatrix}^T\\ &= \frac{\mathbf{S} - 4\mathbf{K}\_{12}}{1 - \mathbf{K}\_{12}^2}, \quad -1 < \mathbf{K}\_{12} < 1. \end{aligned}$$

Figure 6.1 displays *f* <sup>2</sup> *<sup>H</sup>* as a function of **K**12. One can see that the norm diverges as |**K**12| approaches 1.

If, e.g., **K**<sup>12</sup> = 1 the kernel function becomes constant over *X* × *X* . Hence, the two kernel sections *K*1(·) and *K*2(·) coincide, being constant with *K*1(*i*) = *K*2(*i*) = 1 for *i* = 1, 2. This means that **K**<sup>12</sup> = 1 induces a space *H* containing only constant functions.<sup>1</sup> This explains why the norm (complexity) of *f* becomes large if **K**<sup>12</sup> is close to 1: the space becomes less and less "tolerant" of functions with *f* (1) = *f* (2).

Letting now *f* (1) = 1 and *f* (2) = *a*, the joint effect of **K**<sup>12</sup> and *a* is explained by the formula

$$\begin{aligned} \|f\|\_{\mathcal{H}^\ell}^2 &= \begin{bmatrix} 1 \ a \end{bmatrix} \mathbf{K}^{-1} \begin{bmatrix} 1 \ a \end{bmatrix}^T\\ &= \frac{(a - \mathbf{K}\_{12})^2}{1 - \mathbf{K}\_{12}^2} + 1, \quad -1 < \mathbf{K}\_{12} < 1. \end{aligned}$$

Note that, thinking now of **K**<sup>12</sup> as fixed, the function with minimal RKHS norm (complexity) is obtained with *a* = **K**<sup>12</sup> and has a norm equal to one. -

**Example 6.4** *(L* <sup>μ</sup> <sup>2</sup> *and* -<sup>2</sup>*)* Let *<sup>X</sup>* <sup>=</sup> <sup>R</sup> and consider the classical Lebesgue space of square summable functions with μ equal to the Lebesgue measure. Recall that this is a Hilbert space whose elements are equivalence classes of functions measurable

<sup>1</sup> One can then also easily check that the case **<sup>K</sup>**<sup>12</sup> = −1 instead induces a RKHS containing only functions satisfying *f* (1) = − *f* (2).

w.r.t. Lebesgue: any group of functions which differ only on a set of null measure (e.g., containing only a countable number of input locations) identifies the same vector. Hence, *L* <sup>μ</sup> <sup>2</sup> cannot be a RKHS since pointwise evaluation is not even well defined.

Let instead *<sup>X</sup>* <sup>=</sup> <sup>N</sup> (the set of natural numbers) and define the identity kernel

$$K(i,j) = \delta\_{ij}, \ (i,j) \in \mathbb{N} \times \mathbb{N},\tag{6.5}$$

where δ*i j* is the Kronecker delta. Clearly, *K* is symmetric and positive definite according to Definition 6.2 (it can be associated with an identity matrix of infinite size). Hence, it induces unique RKHS *H* that contains all the finite combinations of the kernel sections. In particular, any finite sum can be written as *<sup>f</sup>* (·) <sup>=</sup> *<sup>m</sup> <sup>i</sup>*=<sup>1</sup> *fi Ki*(·), where some of the *fi* may be null, and corresponds to a sequence with a finite number of non null components. To obtain the entire *H* , we need also to add all the Cauchy sequences limits w.r.t. the norm (6.4) given by

$$\begin{aligned} \|f\|\_{\mathcal{H}^\ell}^2 &= \left\| \sum\_{i=1}^m f\_i K\_i(\cdot) \right\|\_{\mathcal{H}^\ell}^2 \\ &= \sum\_{i=1}^m \sum\_{j=1}^m f\_i f\_j K(i,j) = \sum\_{i=1}^m f\_i^2, \end{aligned}$$

which coincides with the classical Euclidean norm of [ *f*<sup>1</sup> ... *fm*]. This allows us to conclude that the associated RKHS is the classical space -<sup>2</sup> of square summable sequences.

As a finale note, Definition 6.1 easily confirms that -<sup>2</sup> is a RKHS. In fact, for every *f* = [ *f*<sup>1</sup> *f*<sup>2</sup> ...] ∈ -<sup>2</sup> one has

$$|f\_i| \le \sqrt{\sum\_i f\_i^2} = \|f\|\_2 \quad \forall i, j$$

and, recalling (6.1), this shows that all the evaluation functionals *<sup>f</sup>* <sup>→</sup> *fi* with *<sup>i</sup>* <sup>∈</sup> <sup>N</sup> are bounded. -

**Example 6.5** *(Sobolev space and the first-order spline kernel)* While in the previous example we have seen that *L* <sup>μ</sup> <sup>2</sup> is not a RKHS, consider now the space obtained by integrating the functions in this space. In particular, let *X* = [0, 1], set μ to the Lebesgue measure and consider

$$\mathcal{A}^{\theta} = \left\{ f \mid f(\mathbf{x}) = \int\_0^{\mathbf{x}} h(\mathbf{y}) d\mathbf{y} \text{ with } h \in \mathcal{L}\_2^{\theta^\mu} \right\}.$$

One thus has that any *f* in *H* satisfies *f* (0) = 0 and is absolutely continuous: its derivative *h* = ˙*f* is defined almost everywhere and is Lebesgue integrable.

With the inner product given by

$$
\langle f, \mathfrak{g} \rangle\_{\mathcal{A}^\circ} = \langle \bar{f}, \dot{\mathfrak{g}} \rangle\_{\mathcal{A}^\circ\_{\vec{1}}},
$$

it is easy to see that *H* is a Hilbert space. In fact, *L* <sup>μ</sup> <sup>2</sup> is Hilbert and we have established a one-to-one correspondence between functions in *H* and *L* <sup>μ</sup> <sup>2</sup> which preserves inner product. Such *H* is an example of Sobolev space [2] since the complexity of a function is measured by the energy of its derivative:

$$\|f\|\_{\mathcal{H}^\circ}^2 = \int\_0^1 \dot{f}^2(x)dx.$$

Now, given *x* ∈ [0, 1], let χ*<sup>x</sup>* (·) be the indicator function of the set [0, *x*]. Then, one has

$$\begin{aligned} |f(\mathbf{x})| &= \left| \int\_0^x \dot{f}(a) da \right| = |\langle \chi\_x, \dot{f} \rangle\_{\mathcal{A}\_2^{\mu}}|, \\ &\le \|\dot{f}\|\_{\mathcal{A}\_2^{\mu}} = \|f\|\_{\mathcal{A}^{\mu}}, \end{aligned}$$

where we have used the Cauchy–Schwarz inequality. Hence, *H* is also a RKHS since all the evaluations functionals are bounded. We now prove that its reproducing kernel is the so-called first-order (linear) *spline kernel* given by

$$K(\mathbf{x}, \mathbf{y}) = \min(\mathbf{x}, \mathbf{y}).\tag{6.6}$$

In fact, every kernel section belongs to *H* , being piecewise linear with *K*˙ *<sup>x</sup>* = χ*<sup>x</sup>* . Furthermore, (6.6) satisfies the reproducing property since

**Fig. 6.2** Linear and cubic spline kernel with kernel sections *Kxi*(*x*) for *xi* = 0.1, 0.2,..., 1 (bottom)

$$\begin{aligned} \langle f, K\_x \rangle\_{\mathcal{H}^\ell} &= \langle f, \chi\_x \rangle\_{\mathcal{H}^\mu\_2} \\ &= \int\_0^x \dot{f}(\mathbf{y}) d\mathbf{y} = f(\mathbf{x}). \end{aligned}$$

The linear spline kernel and some of its sections are displayed in the top panels of Fig. 6.2. -

#### *6.2.1 Reproducing Kernel Hilbert Spaces Induced by Operations on Kernels -*

We report some classical results about RKHSs induced by operations on kernels which can be derived from [13]. The first theorem characterizes the RKHS induced by the sum or product of two kernels.

**Theorem 6.6** (RKHS induced by sum or product of two kernels, based on [13]) *Let K and G be two positive definite kernels over the same domain X* × *X , associated to the RKHSs H and G , respectively.*

*The sum K* + *G, where*

$$[K+G](\mathbf{x}, \mathbf{y}) = K(\mathbf{x}, \mathbf{y}) + G(\mathbf{x}, \mathbf{y}),$$

*is the reproducing kernel of the RKHS R containing functions*

$$f = h + \mathbf{g}, \quad (h, \mathbf{g}) \in \mathcal{H}^\rho \times \mathcal{G}$$

*with*

$$\|f\|\_{\mathcal{A}^\flat}^2 = \min\_{h \in \mathcal{A}^\flat, \mathbf{g} \in \mathcal{G}} \|h\|\_{\mathcal{A}^\flat}^2 + \|\mathbf{g}\|\_{\mathcal{G}}^2 \text{ s.t. } f = h + \mathbf{g}.$$

*The product K G, where*

$$[KG](\mathbf{x}, \mathbf{y}) = K(\mathbf{x}, \mathbf{y})G(\mathbf{x}, \mathbf{y})$$

*is instead the reproducing kernel of the RKHS R containing functions*

$$f = h\mathbf{g}, \quad (h, \mathbf{g}) \in \mathcal{H}^\ell \times \mathcal{G}^\ell$$

*with*

$$\|f\|\_{\mathcal{A}^\ell}^2 = \min\_{h \in \mathcal{A}^\ell, \mathfrak{g} \in \mathcal{\mathcal{G}}} \|h\|\_{\mathcal{A}^\ell}^2 \|\mathfrak{g}\|\_{\mathcal{G}}^2 \text{ s.t. } f = h\mathfrak{g}.$$

The second theorem instead provides the connection between two RKHSs, with the second one obtained from the first one by sampling its kernel.

**Theorem 6.7** (RKHS induced by kernel sampling, based on [13]) *Let H be the RKHS induced by the kernel K* : *<sup>X</sup>* <sup>×</sup> *<sup>X</sup>* <sup>→</sup> <sup>R</sup>*. Let <sup>Y</sup>* <sup>⊂</sup> *<sup>X</sup> and denote with <sup>R</sup> the RKHS of functions over Y induced by the restriction of the kernel K on Y* × *Y . Then, the functions in R correspond to the functions in H sampled on Y . One also has*

> *<sup>f</sup>* <sup>2</sup> *<sup>R</sup>* = min *<sup>g</sup>*∈*<sup>H</sup> g*<sup>2</sup> *<sup>H</sup> s.t. g<sup>Y</sup>* = *f*, (6.7)

*where g<sup>Y</sup> is g sampled on Y .*

The following theorem lists some operations which permit to build kernels (and hence RKHSs) from simple building blocks.

**Theorem 6.8** (Building kernels from kernels, based on [13]) *Let K*<sup>1</sup> *and K*<sup>2</sup> *two positive definite kernels over <sup>X</sup>* <sup>×</sup> *<sup>X</sup> and K*<sup>3</sup> *a positive definite kernel over* <sup>R</sup>*<sup>m</sup>* <sup>×</sup> <sup>R</sup>*m. Let also P an m* × *m symmetric positive semidefinite matrix and P*(*x*) *a polynomial with positive coefficients. Then, the following functions are positive definite kernels over X* × *X :*


## **6.3 Spectral Representations of Reproducing Kernel Hilbert Spaces**

In the previous section we have seen that any RKHS is generated by its kernel sections. We now discuss another representation obtainable when the kernel can be diagonalized as follows

$$K(\mathbf{x}, \mathbf{y}) = \sum\_{i \in \mathcal{J}} \zeta\_i \rho\_i(\mathbf{x}) \rho\_i(\mathbf{y}), \quad \zeta\_i > 0 \,\,\forall i,\tag{6.8}$$

where the set *I* is countable. This will lead to new insights on the nature of the RKHSs, generalizing to the infinite-dimensional case the connection between regularization and basis expansion reported in Sect. 5.6.

A simple situation holds when the input space has finite cardinality, e.g., *X* = {*x*<sup>1</sup> ... *xm*}. Under this assumption, any positive definite kernel is in one-to-one correspondence with the *m* × *m* matrix **K** whose (*i*, *j*)-entry is *K*(*xi*, *x <sup>j</sup>*). The representation (6.8) then follows from the spectral theorem applied to **K**. In fact, if ζ*<sup>i</sup>* and *vi* are, respectively, the eigenvalues and the orthonormal (column) eigenvectors of **K**, (6.8) can be written as

$$\mathbf{K} = \sum\_{i=1}^{m} \zeta\_i \nu\_i \nu\_i^T,$$

where the functions ρ*i*(·) have become the vectors *vi* . One generalization of this result is described below.

Let *L <sup>K</sup>* be the linear operator defined by the positive definite kernel *K* as follows:

$$L\_K[f](\cdot) = \int\_X K(\cdot, \mathbf{x}) f(\mathbf{x}) d\mu(\mathbf{x}).\tag{6.9}$$

We also assume that μ is a σ-finite and nondegenerate Borel measure on *X* . Essentially this means that *X* is the countable union of measurable sets with finite measure and that <sup>μ</sup> "covers" entirely *<sup>X</sup>* . The reader can, e.g., consider *<sup>X</sup>* <sup>⊂</sup> <sup>R</sup>*<sup>m</sup>* and think of μ as the Lebesque measure or any probability measure with μ(*A*) > 0 for any nonempty open set *A* ⊂ *X* . The next classical result goes under the name of Mercer theorem whose formulations trace back to [60].

**Theorem 6.9** (Mercer theorem, based on [60]) *Let X be a compact metric space equipped with a nondegenerate and* σ*-finite Borel measure* μ *and let K be a Mercer kernel on <sup>X</sup>* <sup>×</sup> *<sup>X</sup> . Then, there exists a complete orthonormal basis of <sup>L</sup>* <sup>μ</sup> <sup>2</sup> *given by a countable number of continuous functions* {ρ*i*}*<sup>i</sup>*∈*<sup>I</sup> satisfying*

$$L\_K[\rho\_i] = \zeta\_i \rho\_i, \quad i \in \mathcal{J}', \quad \zeta\_1 \ge \zeta\_2 \ge \cdots \ge 0,\tag{6.10}$$

*with* ζ*<sup>i</sup>* > 0 ∀*i if K is strictly positive and* lim*<sup>i</sup>*→∞ ζ*<sup>i</sup>* = 0 *if the number of eigenvalues is infinite.*

*One also has*

$$K(\mathbf{x}, \mathbf{y}) = \sum\_{i \in \mathcal{J}} \zeta\_i \rho\_i(\mathbf{x}) \rho\_i(\mathbf{y}),\tag{6.11}$$

*where the convergence of the series is absolute and uniform on X* × *X .*

The following result characterizes a RKHS through the eigenfunctions of *L <sup>K</sup>* . The proof is reported in Sect. 6.9.3.

**Theorem 6.10** (RKHS defined by an orthonormal basis of *L* <sup>μ</sup> <sup>2</sup> ) *Under the same assumption of Theorem 6.9, if the* ρ*<sup>i</sup> and* ζ*<sup>i</sup> satisfy (6.10), with also* ζ*<sup>i</sup>* > 0 ∀*i, one has*

$$\mathcal{X}^{\rho} = \left\{ f \; \middle| \; f(\mathbf{x}) = \sum\_{i \in \mathcal{J}} c\_i \rho\_i(\mathbf{x}) \text{ s.t. } \sum\_{i \in \mathcal{J}} \frac{c\_i^2}{\zeta\_i} < \infty, \right\}. \tag{6.12}$$

*In addition, if*

$$f = \sum\_{i \in \mathcal{J}} a\_i \rho\_i, \quad \mathbf{g} = \sum\_{i \in \mathcal{J}} b\_i \rho\_i,$$

*one has*

$$\langle f, \mathbf{g} \rangle\_{\mathcal{A}^{\mathcal{C}}} = \sum\_{i \in \mathcal{J}} \frac{a\_i b\_i}{\zeta\_i},\tag{6.13}$$

*so that*

$$\|f\|\_{\mathcal{H}'}^2 = \sum\_{i \in \mathcal{J}'} \frac{a\_i^2}{\zeta\_i}. \tag{6.14}$$

*Hence, it also comes that* { <sup>√</sup>ζ*i*ρ*i*}*<sup>i</sup>*∈*<sup>I</sup> is an orthonormal basis of <sup>H</sup> .*

The representation (6.12) is not unique since the spectral maps, i.e., the functions that associate a kernel with a decomposition of the type (6.8), are not unique. They depend on the chosen measure μ even if they lead to the same RKHS.

Theorem 6.10 thus shows that any kernel admitting an expansion (6.11) coming from the Mercer theorem induces a separable RKHS, i.e., having a countable basis given by the ρ*<sup>i</sup>* . Later on, Theorem 6.13 will show that such result holds under much milder assumptions. In fact, the representation (6.12) can be obtained starting from any diagonalized kernel (6.8) involving generic functions ρ*<sup>i</sup>* , e.g., not necessarily independent of each other. One can also remove the compactness hypothesis on the input space, e.g., letting *X* be the entire R*<sup>m</sup>*.

**Remark 6.2** *(Relationship between H and L* <sup>μ</sup> <sup>2</sup> *)* Theorem 6.10 points out an interesting connection between *H* and *L* <sup>μ</sup> <sup>2</sup> . Since the functions ρ*<sup>i</sup>* form an orthonormal basis in *L* <sup>μ</sup> <sup>2</sup> , one has

$$f \in \mathcal{L}\_2^{\rho^{\mu}} \iff f = \sum\_{i \in \mathcal{J}} c\_i \rho\_i \text{ with } \sum\_{i \in \mathcal{J}} c\_i^2 < \infty \tag{6.15}$$

while (6.12) shows that

$$f \in \mathcal{H}^{\ell} \iff f = \sum\_{i \in \mathcal{J}} c\_i \rho\_i \text{ with } \sum\_{i \in \mathcal{J}} \frac{c\_i^2}{\zeta\_i} < \infty. \tag{6.16}$$

If <sup>ζ</sup>*<sup>i</sup>* <sup>&</sup>gt; <sup>0</sup> <sup>∀</sup>*i*, one has the set inclusion *<sup>H</sup>* <sup>⊂</sup> *<sup>L</sup>* <sup>μ</sup> <sup>2</sup> since the functions in the RKHS, must satisfy a more stringent condition on the expansion coefficients decay (the ζ*<sup>i</sup>* decay to zero).

In addition, let *L*1/<sup>2</sup> *<sup>K</sup>* denote the operator defined as the square root of *L <sup>K</sup>* , i.e., for any *<sup>f</sup>* <sup>∈</sup> *<sup>L</sup>* <sup>μ</sup> <sup>2</sup> with *f* = *<sup>i</sup>*∈*<sup>I</sup> ci*ρ*<sup>i</sup>* , one has

$$L\_K^{1/2}[f] = \sum\_{i \in \mathcal{J}} \sqrt{\zeta\_i} c\_i \rho\_i. \tag{6.17}$$

This is a smoothing operator: the function *L*1/<sup>2</sup> *<sup>K</sup>* [ *f* ] is more regular than *f* since the expansion coefficients <sup>√</sup>ζ*<sup>i</sup> ci* decrease to zero faster than the *ci* . In view of (6.15) and (6.16), we obtain

$$\mathcal{A}^{\ell} = \left\{ L\_K^{1/2}[f] \; \mid \; f \in \mathcal{A}\_2^{\mu} \right\},\tag{6.18}$$

which shows that the RKHS can be thought of as the output of the linear system *L*<sup>1</sup>/<sup>2</sup> *K* fed with the space *L* <sup>μ</sup> <sup>2</sup> , i.e., *<sup>H</sup>* <sup>=</sup> *<sup>L</sup>*<sup>1</sup>/<sup>2</sup> *<sup>K</sup> <sup>L</sup>* <sup>μ</sup> 2 .

**Example 6.11** *(Spline kernel expansion)* In Example 6.5, we have seen that the space of functions on the unit interval satisfying *<sup>f</sup>* (0) <sup>=</sup> 0 and <sup>1</sup> <sup>0</sup> ˙*f* <sup>2</sup>(*x*)*dx* < ∞ is the RKHS associated to the first-order spline kernel min(*x*, *y*). We now derive a representation of the type (6.12) for this space setting μ to the Lebesgue measure. For this purpose, consider the system

$$\int\_0^1 \min(x, y)\rho(y)dy = \zeta \rho(x).$$

The above equation is equivalent to

$$\int\_0^\chi \mathbf{y}\rho(\mathbf{y})d\mathbf{y} + \mathbf{x} \int\_\chi^1 \rho(\mathbf{y})d\mathbf{y} = \zeta \rho(\mathbf{x}),$$

which implies ρ(0) = 0. Taking the derivative w.r.t. *x* we also obtain

$$\int\_{x}^{1} \rho(\mathbf{y})d\mathbf{y} = \zeta \dot{\rho}(\mathbf{x})$$

that implies ρ˙(1) = 0. Differentiating again w.r.t. *x* gives

$$-\rho(\mathbf{x}) = \zeta \tilde{\rho}(\mathbf{x}),$$

whose general solution is

$$\rho(\mathbf{x}) = a\sin(\mathbf{x}/\sqrt{\zeta}) + b\cos(\mathbf{x}/\sqrt{\zeta}), \quad a, b \in \mathbb{R}.$$

The boundary conditions ρ(0) = ˙ρ(1) = 0 imply *b* = 0 and lead to the following possible eigenvalues:

$$\zeta\_i = \frac{1}{(i\pi - \pi/2)^2}, \quad i = 1, 2, \dots$$

The orthonormality condition also implies *<sup>a</sup>* <sup>=</sup> <sup>√</sup>2 so that we obtain

$$\rho\_i(\mathbf{x}) = \sqrt{2}\sin\left(i\pi\mathbf{x} - \frac{\pi\mathbf{x}}{2}\right), \quad i = 1, 2, \dots, 4$$

This provides the formulation (6.12) of the Sobolev space *H* . Figure 6.3 plots three eigenfunctions (left panel) and the first 100 eigenvalues ζ*<sup>i</sup>* (right panel). It is evident that the larger *i* the larger is the high-frequency content of ρ*<sup>i</sup>* and the RKHS norm of such basis function. In fact, a large value of *i* corresponds to a small eigenvalue ζ*<sup>i</sup>* and one has ρ*i*<sup>2</sup> *<sup>H</sup>* = 1/ζ*<sup>i</sup>* . -

**Example 6.12** *(Translation invariant kernels and Fourier expansion)* A translation invariant kernel depends only on the difference of its two arguments. Hence, there exists *<sup>h</sup>* : *<sup>X</sup>* <sup>→</sup> <sup>R</sup> such that *<sup>K</sup>*(*x*, *<sup>y</sup>*) <sup>=</sup> *<sup>h</sup>*(*<sup>x</sup>* <sup>−</sup> *<sup>y</sup>*). Assume that *<sup>X</sup>* = [0, <sup>2</sup>π] and that *h* can be extended to a continuous, symmetric and periodic function over R. Then, it can be expanded in terms of the following uniformly convergent Fourier series

$$h(\mathbf{x}) = \sum\_{i=0}^{\infty} \zeta\_i \cos(i\mathbf{x}),$$

**Fig. 6.3** Expansion of the first-order spline kernel min(*x*, *y*): eigenfunctions ρ*<sup>i</sup>* for *i* = 1, 2, 8 (left panel) and eigenvalues ζ*i* (right)

where ζ<sup>0</sup> accounts for the constant component and we assume ζ*<sup>i</sup>* > 0 ∀*i*. We thus obtain the kernel expansion

$$K(x, \mathbf{y}) = \zeta\_0 + \sum\_{i=1}^{\infty} \zeta\_i \cos(i\mathbf{x})\cos(i\mathbf{y}) + \sum\_{i=1}^{\infty} \zeta\_i \sin(i\mathbf{x})\sin(i\mathbf{y}),$$

in terms of functions which are all orthogonal in *L* <sup>μ</sup> <sup>2</sup> . Hence, these kernels induce RKHSs generated by the Fourier basis, with different inner products determined by ζ*<sup>i</sup>* . -

#### *6.3.1 More General Spectral Representation -*

Now, assume that the kernel *K* is available in the form *K*(*x*, *y*) = *<sup>i</sup>*∈*<sup>I</sup>* ζ*i*ρ*i*(*x*)ρ*i*(*y*) with ζ*<sup>i</sup>* > 0 ∀*i*, but with functions ρ*<sup>i</sup>* not necessarily orthonormal. More generally, we do not even require that they are independent, e.g., ρ<sup>1</sup> could be a linear combination of ρ<sup>2</sup> and ρ3. The following result shows that the RKHS associated to *K* is still generated by the ρ*<sup>i</sup>* , but the relationship of the expansion coefficients with ·*<sup>H</sup>* is more involved than in the previous case.

**Theorem 6.13** (RKHS induced by a diagonalized kernel) *Let H be the RKHS induced by K*(*x*, *y*) = *<sup>i</sup>*∈*<sup>I</sup>* ζ*i*ρ*i*(*x*)ρ*i*(*y*) *with* ζ*<sup>i</sup>* > 0 ∀*i and the set I countable. Then, H is separable and admits the representation*

$$\mathcal{X}^{\theta} = \left\{ f \; \middle| \; f(\mathbf{x}) = \sum\_{i \in \mathcal{J}} c\_i \rho\_i(\mathbf{x}) \text{ s.t. } \sum\_{i \in \mathcal{J}} \frac{c\_i^2}{\zeta\_i} < \infty \right\} \tag{6.19}$$

*and one has*

$$\left\|\|f\|\right\|\_{\mathcal{A}^{\ell}}^{2} = \min\_{\{c\_{i}\}} \sum\_{i \in \mathcal{J}} \frac{c\_{i}^{2}}{\zeta\_{i}} \text{ s.t. } f = \sum\_{i \in \mathcal{J}} c\_{i} \rho\_{i}. \tag{6.20}$$

The proof is reported in Sect. 6.9.4 while an application example is given below.

#### **Example 6.14** Let

$$K(\mathbf{x}, \mathbf{y}) = 2\sin^2(\mathbf{x})\sin^2(\mathbf{y}) + 2\cos^2(\mathbf{x})\cos^2(\mathbf{y}) + 1.$$

Using Theorem 6.13, we obtain that the RKHS *H* associated to *K* is spanned by sin<sup>2</sup>(*x*), cos<sup>2</sup>(*x*) and the constant function. Now, let *<sup>f</sup>* (*x*) <sup>=</sup> 1 and consider the problem of computing *f* <sup>2</sup> *<sup>H</sup>* . To have a correspondence with (6.8) we can, e.g., fix the notation

$$\rho\_1(\mathbf{x}) = \sin^2(\mathbf{x}), \quad \rho\_2(\mathbf{x}) = \cos^2(\mathbf{x}), \quad \rho\_3(\mathbf{x}) = 1$$

and

ζ<sup>1</sup> = 2, ζ<sup>2</sup> = 2, ζ<sup>3</sup> = 1.

Since the functions ρ*<sup>i</sup>* are not independent, many different representation for *f* (*x*) = 1 can be found. In particular, one has

$$1 = c\rho\_1(\mathbf{x}) + c\rho\_2(\mathbf{x}) + (1 - c)\rho\_3(\mathbf{x}) \quad \forall c \in \mathbb{R},$$

so that

$$\left\|\|f\|\right\|\_{\mathcal{H}^{\ell}}^2 = \min\_{c} \frac{c^2}{2} + \frac{c^2}{2} + (1 - c)^2 = \min\_{c} \ 2c^2 - 2c + 1 = \frac{1}{2}$$

with the minimum 1/2 obtained at *c* = 1/2. Hence, according to the norm of *H* , the "minimum energy" representation of *f* (*x*) = 1 is 1/2(ρ1(*x*) + ρ2(*x*) + ρ3(*x*)). -

## **6.4 Kernel-Based Regularized Estimation**

## *6.4.1 Regularization in Reproducing Kernel Hilbert Spaces and the Representer Theorem*

A powerful approach to reconstruct a function *<sup>g</sup>* : *<sup>X</sup>* <sup>→</sup> <sup>R</sup> from sparse data {*xi*, *yi*}*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> consists of minimizing a suitable functional over a RKHS. An important generalization of the estimators based on quadratic penalties, denoted by ReLS-Q in Chap. 3, is defined by

$$\hat{\mathbf{g}} = \underset{f \in \mathcal{H}^{\mathbb{P}}}{\text{arg min }} \sum\_{i=1}^{N} \mathcal{V}\_{i}(\mathbf{y}\_{i}, f(\mathbf{x}\_{i})) + \gamma \|f\|\_{\mathcal{H}^{\mathbb{P}}}^{2}. \tag{6.21}$$

In (6.21), *V<sup>i</sup>* are loss functions measuring the distance between *yi* and *f* (*xi*). They can take only positive values and are assumed convex w.r.t. their second argument *f* (*xi*). As an example, when the quadratic loss is adopted for any *i*, one obtains

$$\mathcal{V}\_i(\mathbf{y}\_i, f(\mathbf{x}\_i)) = (\mathbf{y}\_i - f(\mathbf{x}\_i))^2.$$

Then, the norm ·*<sup>H</sup>* defines the regularizer, e.g., given by the energy of the firstorder derivative

$$\|f\|\_{\mathcal{H}^\ell}^2 = \int\_0^1 \dot{f}^2(x)dx,$$

which corresponds to the spline norm introduced in Example 6.5. Finally, the positive scalar γ is the regularization parameter (already encountered in the previous chapters) which has to balance adherence to experimental data and function regularity. Indeed, the idea underlying (6.21) is that the predictor *g*ˆ should be able to describe the data without being too complex according to the RKHS norm. In particular, the scope of the regularizer is to restore the well-posedness of the problem, making the solution depend continuously on the data. It should also include our available information on the unknown function, e.g., the expected smoothness level.

The importance of the RKHSs in the context of regularization methods stems from the following central result, whose first formulation can be found in [52]. It shows that the solutions of the class of variational problems (6.21) admit a finitedimensional representation, independently of the dimension of *H* . The proof of an extended version of this result can be found in Sect. 6.9.5.

**Theorem 6.15** (Representer theorem, adapted from [104]) *Let H be a RKHS. Then, all the solutions of (6.21) admit the following expression*

$$
\hat{\mathfrak{g}} = \sum\_{i=1}^{N} c\_i K\_{x\_i},
\tag{6.22}
$$

*where the ci are suitable scalar expansion coefficients.*

Thus, as in the traditional linear parametric approach, the optimal function is a linear combination of basis functions. However, a fundamental difference is that their number is now equal to the number of data pairs, and is thus not fixed a priori. In fact, the functions appearing in the expression of the minimizer *g*ˆ are just the kernel sections *Kxi* centred on the input data. The representer theorem also conveys the message that, using estimators of the form (6.21), it is not possible to recover arbitrarily complex functions from a finite amount of data. The solution is always confined to a subspace with dimension equal to the data set size.

Now, let **<sup>K</sup>** <sup>∈</sup> <sup>R</sup>*<sup>N</sup>*×*<sup>N</sup>* be the positive semidefinite matrix (called *kernel matrix*, or Gram matrix) such that **K***i j* = *K*(*xi*, *x <sup>j</sup>*). The *i*th row of **K** is denoted by **k***<sup>i</sup>* . Using this notation, if *<sup>g</sup>* <sup>=</sup> *<sup>N</sup> <sup>i</sup>*=<sup>1</sup> *ci Kxi* then

$$\|\mathbf{g}(\mathbf{x}\_i) = \mathbf{k}\_i c \quad \text{and} \quad \|\mathbf{g}\|\_{\mathcal{H}^\circ}^2 = c^T \mathbf{K} c,\tag{6.23}$$

where *c* = [*c*1,..., *cN* ] *<sup>T</sup>* and the second equality derives from the reproducing property or, equivalently, from (6.4).

Using the representer theorem, we can plug the expression (6.22) of the optimal *g*ˆ into the objective (6.21). Then, exploiting (6.23), the variational problem (6.21) boils down to

$$\min\_{c \in \mathbb{R}^N} \sum\_{i=1}^N \mathcal{V}\_i(\mathbf{y}\_i, \mathbf{k}\_i c) + \gamma c^T \mathbf{K} c. \tag{6.24}$$

The regularization problem (6.21) has been thus reduced to a finite-dimensional optimization problem whose order *N* does not depend on the dimension of the original space *H* . In addition, since each loss function *V<sup>i</sup>* has been assumed convex, the objective (6.24) is convex overall. How to compute the expansion coefficients now depends on the specific choice of the *V<sup>i</sup>* , as discussed in the next section.

**Remark 6.3** *(Kernel trick and implicit basis functions encoding)* Assume that the kernel admits the expansion *K*(*x*, *y*) = <sup>∞</sup> *<sup>i</sup>*=<sup>1</sup> ζ*i*ρ*i*(*x*)ρ*i*(*y*), ζ*<sup>i</sup>* > 0. Then, as discussed in Sect. 6.3, any function in *H* has the representation

$$f = \sum\_{i=1}^{\infty} a\_i \rho\_i \text{ with } \|f\|\_{\mathcal{A}^\circ}^2 = \sum\_{j=1}^{\infty} \frac{a\_j^2}{\zeta\_j}.$$

Problem (6.21) can then be rewritten using the infinite-dimensional vector *a* = [*a*<sup>1</sup> *a*<sup>2</sup> ...] as unknown:

$$
\hat{a} = \arg\min\_{a} \sum\_{i=1}^{N} \mathbb{V}\_{i} \left( \mathbf{y}\_{i}, \sum\_{j=1}^{\infty} a\_{j} \rho\_{j}(\mathbf{x}\_{i}) \right) + \gamma \sum\_{j=1}^{\infty} \frac{a\_{j}^{2}}{\zeta\_{j}},
$$

and an equivalent representation of (6.22) becomes *g*ˆ = <sup>∞</sup> *<sup>i</sup>*=<sup>1</sup> *a*ˆ*i*ρ*<sup>i</sup>* . In comparison to this reformulation the use of the kernel and of the representer theorem subsumes modelling and computational advantages. In fact, through *K* one needs neither to choose the number of basis functions to be used (the kernel can already include in an implicit way an infinite number of basis functions) nor to store any basis function in memory (the representer theorem reduces inference to solving a finite-dimensional optimization problem based on the kernel matrix **K**). These features are related to what is called the *kernel trick* in the machine learning literature.

## *6.4.2 Representer Theorem Using Linear and Bounded Functionals*

A more general version of the representer theorem obtained in [52] can be obtained by replacing *f* (*xi*) with *Li*[ *f* ], where *Li* is linear and bounded. In the first part of the following result *H* is just required to be Hilbert. In Sect. 6.9.5 we will see how Theorem 6.16 can be further generalized.

**Theorem 6.16** (Representer theorem with functionals *Li* , adapted from [104]) *Let H be a Hilbert space and consider the optimization problem*

$$\hat{\mathbf{g}} = \operatorname\*{arg\,min}\_{f \in \mathcal{H}^{\mathbb{P}}} \sum\_{i=1}^{N} \mathcal{V}\_{i}(\mathbf{y}\_{i}, L\_{i}[f]) + \gamma \|f\|\_{\mathcal{H}^{\mathbb{P}}}^{2},\tag{6.25}$$

*where each Li* : *<sup>H</sup>* <sup>→</sup> <sup>R</sup> *is linear and bounded. Then, all the solutions of (6.25) admit the following expression*

$$\hat{\mathfrak{g}} = \sum\_{i=1}^{N} c\_i \eta\_i,\tag{6.26}$$

*where the ci are suitable scalar expansion coefficients and each* η*<sup>i</sup>* ∈ *H is the representer of Li , i.e., for any i and f* ∈ *H :*

$$L\_i[f] = \langle f, \eta\_i \rangle\_{\mathcal{H}}.\tag{6.27}$$

*In particular, if H is a RKHS with kernel K , each basis function is given by*

$$\eta\_i(\mathbf{x}) = L\_i[K(\cdot, \mathbf{x})].\tag{6.28}$$

The existence of η*<sup>i</sup>* satisfying (6.27) is ensured by the Riesz representation theorem (Theorem 6.27). One can also prove that in a RKHS a linear functional *L* is linear and bounded if and only if the function *f* obtained by applying *L* to the kernel, i.e., *f* (*x*) = *L*[*K*(*x*, ·)] ∀*x*, belongs to the RKHS.

Note also that Theorem 6.15 is indeed a special case of the last result. In fact, let *H* be a RKHS and *Li*[ *f* ] = *f* (*xi*) ∀*i*. Then, each *Li* is linear and bounded and each η*<sup>i</sup>* becomes the kernel section *Kxi* according to the reproducing property.

**Example 6.17** *(Solution using the quadratic loss)* Let us adopt a quadratic loss in (6.25), i.e., *Vi*(*yi*, *Li*[ *f* ]) = (*yi* − *Li*[ *f* ])2. This makes the objective strictly convex so that a unique solution exists. To find it, plugging (6.26) in (6.25) and using also (6.28), the following quadratic problem is obtained

$$\left\|\boldsymbol{Y} - \boldsymbol{O}\boldsymbol{c}\right\|^2 + \gamma \boldsymbol{c}^T \boldsymbol{O} \boldsymbol{c} \tag{6.29}$$

where *Y* = [*y*1,..., *yN* ] *<sup>T</sup>* , · is the Euclidean norm, while the *N* × *N* matrix *O* has *i*, *j* entry given by

$$O\_{ij} = \langle \eta\_i, \eta\_j \rangle\_{\mathcal{A}^\circ} = L\_i \lbrack L\_j \lbrack K \rbrack \rbrack. \tag{6.30}$$

The minimizer *c*ˆ of (6.29) is unique if *O* is full rank. Otherwise, all the solutions lead to the same function estimate in view of the (already mentioned) strict convexity of (6.25). In particular, one can always use as optimal expansion coefficients the components of the vector

$$
\hat{c} = \left(O + \gamma I\_N\right)^{-1} Y. \tag{6.31}
$$

In Sect. 6.5.1 this result will be further discussed in the context of the so-called regularization networks, where one comes back to assume *Li*[ *f* ] = *f* (*xi*). -

## **6.5 Regularization Networks and Support Vector Machines**

The choice of the loss *V<sup>i</sup>* in (6.21) yields regularization algorithms with different properties. We will illustrate four different cases below.

## *6.5.1 Regularization Networks*

Let us consider the quadratic loss function *Vi*(*yi*, *f* (*xi*)) = *r* <sup>2</sup> *<sup>i</sup>* , with the residual *ri* defined by *ri* = *yi* − *f* (*xi*). Such a loss, also depicted in Fig. 6.4 (top left panel), leads to the problem

$$\hat{\mathbf{g}} = \operatorname\*{arg\,min}\_{f \in \mathcal{H}^{\mathbb{P}}} \sum\_{i=1}^{N} \left( \mathbf{y}\_{i} - f(\mathbf{x}\_{i}) \right)^{2} + \gamma \left\| f \right\|\_{\mathcal{H}^{\mathbb{P}}}^{2},\tag{6.32}$$

which is a generalization of the regularized least squares problem encountered in the previous chapters. In particular, it extends the estimator (3.58a) based on quadratic penalty called ReLS-Q in Chap. 3. The estimator (6.32) is known in the literature as *regularization network* [71] or also *kernel ridge regression*. The strict convexity of the objective (6.32) ensures that the minimizer *g*ˆ not only exists but is also unique (this issue is further discussed in the remark at the end of this subsection).

To find the solution, we can follow the same arguments developed in Example 6.17, just specializing the result to the case *Li*[ *f* ] = *f* (*xi*). We will see that the matrix *O* has just to be replaced by the kernel matrix **K**.

As previously done, let *Y* = [*y*1,..., *yN* ] *<sup>T</sup>* and use · to indicate the Euclidean norm. Then, the corresponding regularization problem (6.24) becomes

**Fig. 6.4** Loss functions examples: quadratic (top left), Huber with δ = 1 (top right), Vapnik with ε = 0.5 (bottom left) and Hinge (bottom right). The first three losses are all functions of the residual *r* = *y* − *f* (*x*) while the hinge loss depends on the margin *m* = *y f* (*x*)

$$\min\_{c \in \mathbb{R}^N} \left\| Y - \mathbf{K}c \right\|^2 + \gamma c^T \mathbf{K}c,\tag{6.33}$$

which is a finite-dimensional ReLS-Q. After simple calculations, one of the optimal solutions<sup>2</sup> is found to be

$$
\hat{c} = \left(\mathbf{K} + \gamma I\_N\right)^{-1} Y,\tag{6.34}
$$

where *IN* is the *N* × *N* identity matrix. The estimate from the regularization network is thus available in closed form, given by *<sup>g</sup>*<sup>ˆ</sup> <sup>=</sup> *<sup>N</sup> <sup>i</sup>*=<sup>1</sup> *c*ˆ*<sup>i</sup> Kxi* with the optimal coefficient vector *c*ˆ solving a linear system of equations.

**Remark 6.4** *(Regularization network as projection)* An interpretation of the regularization network can be also given in terms of a projection. In particular, let *R*

<sup>2</sup> Similarly to what discussed in Example 6.17, if **K** is not full rank, the solution of (6.33) is not unique. In fact, the minimizers are the sum of (6.34) and the null space of the kernel matrix. However, all of them lead to the same function estimate *g*ˆ.

be the Hilbert space <sup>R</sup>*<sup>N</sup>* <sup>×</sup> *<sup>H</sup>* (any element is a couple containing a vector *<sup>v</sup>* and a function *<sup>f</sup>* ) with norm defined, for any *<sup>v</sup>* <sup>∈</sup> <sup>R</sup>*<sup>N</sup>* and *<sup>f</sup>* <sup>∈</sup> *<sup>H</sup>* , by

$$\left\| \left( \mathbf{v}, f \right) \right\|\_{\mathcal{A}}^2 = \left\| \mathbf{v} \right\|^2 + \gamma \left\| f \right\|\_{\mathcal{A}^\rho}^2, \ \gamma > 0, \ \left\| \cdot \right\| = \text{Euclidean norm.} $$

Let also *S* be the (closed) subspace given by all the couples (*v*, *f* ) satisfying the constraint *v* = [ *f* (*x*1)... *f* (*xN* )]. Then, if *g* = (*Y*, 0) where 0 here denotes the null function in *H* , the projection of *g* onto *S* is

$$\begin{aligned} \|g\_{\mathcal{S}} = \operatorname\*{arg\,min}\_{h \in \mathcal{S}} \|g - h\|\_{\mathcal{A}^\ell}^2 \\ &= \operatorname\*{arg\,min}\_{(\{f(\mathbf{x}\_i)\}\_{i=1}^N, f), \ f \in \mathcal{A}^\ell} \sum\_{i=1}^N (\mathbf{y}\_i - f(\mathbf{x}\_i))^2 + \gamma \|f\|\_{\mathcal{A}^\ell}^2. \end{aligned}$$

It is now immediate to conclude that *gS* corresponds to ([ ˆ*g*(*x*1)... *g*ˆ(*xn*)], *g*ˆ) where *g*ˆ is indeed the minimizer (6.32), which must thus be unique in view of Theorem 6.25 (Projection theorem). Note that this interpretation can be extended to all the variational problems (6.21) containing losses defined by a norm induced by an inner product in R*<sup>N</sup>* .

#### *6.5.2 Robust Regression via Huber Loss -*

As described in Sect. 3.6.1, a shortcoming of the quadratic loss is its sensitivity to outliers because the influence of large residuals *ri* grows quadratically. In presence of outliers, one would better use a loss function that grows linearly. These issues have been widely studied in the field of robust statistics [51], where loss functions such as the Huber's have been introduced. Recalling (3.115), one has

$$\mathcal{V}\_i'(\mathbf{y}\_i, f(\mathbf{x}\_i)) = \begin{cases} \frac{r\_i^2}{2}, & |r\_i| \le \delta \\ \delta \left( |r\_i| - \frac{\delta}{2} \right), & |r\_i| > \delta \end{cases},$$

where we still have *ri* = *yi* − *f* (*xi*). The Huber loss function with δ = 1 is shown in Fig. 6.4 (top right panel). Notice that it grows linearly and is thus robust to outliers. When δ → +∞, one recovers the quadratic loss. On the other hand, we also have lim<sup>δ</sup>→0<sup>+</sup> *Vi*(*r*)/δ = |*ri*| that is the absolute value loss.

#### *6.5.3 Support Vector Regression -*

Sometimes, it is desirable to neglect prediction errors, as long as they are below a certain threshold. This can be achieved, e.g., using the Vapnik's ε-insensitive loss given, for *ri* = *yi* − *f* (*xi*), by

$$\mathcal{V}\_i(\mathbf{y}\_i, f(\mathbf{x}\_i)) = |r\_i|\_\varepsilon = \begin{cases} 0, & |r\_i| \le \varepsilon \\ |r\_i| - \varepsilon, & |r\_i| > \varepsilon \end{cases}.$$

The Vapnik loss with ε = 0.5 is shown in Fig. 6.4 (bottom left panel). Notice that it has a null plateau in the interval [−ε, ε]so that any predictor closer than ε to *yi* is seen as a perfect interpolant. The loss then grows linearly, thus ensuring robustness. The regularization problem (6.21) associated with the ε-insensitive loss function turns out

$$\hat{\mathbf{g}} = \operatorname\*{arg\,min}\_{f \in \mathcal{H}^{\boldsymbol{\ell}^{\boldsymbol{\ell}}}} \sum\_{i=1}^{N} |\mathbf{y}\_{i} - f(\mathbf{x}\_{i})|\_{\boldsymbol{\ell}} + \gamma \|f\|\_{\mathcal{A}^{\boldsymbol{\ell}^{\boldsymbol{\ell}}}}^{2},\tag{6.35}$$

and is called *Support Vector Regression* (SVR), see, e.g., [37]. The SVR solution, given by *<sup>g</sup>*<sup>ˆ</sup> <sup>=</sup> *<sup>N</sup> <sup>i</sup>*=<sup>1</sup> *c*ˆ*<sup>i</sup> Kxi* according to the representer theorem, is characterized by sparsity in *c*ˆ, i.e., some components *c*ˆ*<sup>i</sup>* are set to zero. This feature is briefly discussed below.

In the SVR case, obtaining the optimal coefficient vector *c*ˆ by (6.24) is not trivial since the loss |·|<sup>ε</sup> is not differentiable everywhere. This difficulty can be circumvented by replacing (6.24) with the following equivalent problem obtained considering two additional *N*-dimensional parameter vectors ξ and ξ∗:

$$\min\_{c, \xi, \xi^\*} \sum\_{i=1}^N (\xi\_i + \xi\_i^\*) + \gamma c^T \mathbf{K} c,\tag{6.36}$$

subject to the constraints

$$\begin{aligned} &y\_i - \mathbf{k}\_i c \le \varepsilon + \xi\_i, \quad i = 1, \dots, N, \\ &\mathbf{k}\_i c - y\_i \le \varepsilon + \xi\_i^\*, \quad i = 1, \dots, N, \\ &\xi\_i, \xi\_i^\* \ge 0, \qquad i = 1, \dots, N. \end{aligned}$$

To see that its minimizer contains the optimal solution *c*ˆ of (6.24), it suffices noticing that (6.36) assigns a linear penalty only when |*yi* − **k***<sup>i</sup> c*| > ε.

Problem (6.36) is quadratic subject to linear inequality constraints, hence it is solvable by standard optimization approaches like interior point methods [64, 108]. Calculating the Karush–Kuhn–Tucker conditions, it is possible to show that the condition |*yi* − **k***<sup>i</sup> c*ˆ| < ε implies *c*ˆ*<sup>i</sup>* = 0. Indexes *i* for which *c*ˆ*<sup>i</sup>* = 0 instead identify the set of input locations *xi* called *support vectors*.

#### *6.5.4 Support Vector Classification -*

The three losses illustrated above were originally proposed for regression problems, with the output *y* real valued. When the outputs can assume only two values, e.g., 1 and −1, a classification problem arises. Here, the scope of the predictor is just to separate two classes. This problem can be seen as a special case of regression. In particular, even if the output space is binary, consider prediction functions *f* : *X* → <sup>R</sup> and assume that the input *xi* is associated to the class 1 if *<sup>f</sup>* (*xi*) <sup>≥</sup> 0 and to the class −1 if *f* (*xi*) < 0. Let the margin on an example (*xi*, *yi*) be *mi* = *yi f* (*xi*). Then, we will see that the value of *mi* is a measure of how well we are classifying the available data. One can thus try to maximize the margin but still searching for a function not too complex according to the RKHS norm. In particular, we can exploit (6.21) with a loss that depends on the margin as described below.

The most natural classification loss is the 0 − 1 loss defined for any *i* by

$$\mathcal{V}\_i(\mathbf{y}\_i, f(\mathbf{x}\_i)) = \begin{cases} 0, \ m\_i > 0 \\ 1, \ m\_i \le 0 \end{cases}, \quad m\_i = \mathbf{y}\_i f(\mathbf{x}\_i),$$

and depicted in Fig. 6.4 (bottom right panel, dashed line). Adopting it, the first component of the objective in (6.21) returns the number of misclassifications. However, the 0 − 1 loss is not convex and leads to an optimization problem of combinatorial nature.

An alternative is the so-called hinge loss [98] defined by

$$\mathcal{V}\_i(\mathbf{y}\_i, f(\mathbf{x}\_i)) = |1 - \mathbf{y}\_i f(\mathbf{x}\_i)|\_+ = \begin{cases} 0, & m > 1 \\ 1 - m, & m \le 1 \end{cases}, \quad m = \mathbf{y}\_i f(\mathbf{x}\_i),$$

which thus provides a linear penalty when *m* < 1. Figure 6.4 (bottom right panel, solid line) illustrates that it is a convex upper bound on the 0 − 1 loss. The problem associated with the hinge loss turns out

$$\hat{\mathbf{g}} = \operatorname\*{arg\,min}\_{f \in \mathcal{H}^{\mathbb{P}}} \sum\_{i=1}^{N} |1 - y\_i f(\mathbf{x}\_i)|\_+ + \gamma \left\| f \right\|\_{\mathcal{H}^{\mathbb{P}}}^2,\tag{6.37}$$

and is called *support vector classification* (SVC).

Like in the SVR case, obtaining the optimal coefficient vector by (6.37) is not trivial since the hinge loss is not differentiable. But one can still resort to an equivalent problem, now obtained considering just an additional parameter vector ξ:

$$\min\_{c, \xi} \sum\_{i=1}^{N} \xi\_i + \gamma c^T \mathbf{K} c,\tag{6.38}$$

subject to the constraints

$$\begin{aligned} y\_i(\mathbf{k}\_i c) &\geq 1 - \xi\_i, & i = 1, \dots, N, \\ \xi\_i &\geq \mathbf{0}, & i = 1, \dots, N. \end{aligned}$$

As in the SVR case, the optimal solution *c*ˆ is sparse and indexes *i* for which *c*ˆ*<sup>i</sup>* = 0 define the *support vectors xi* .

## **6.6 Kernels Examples**

The reproducing kernel characterizes the hypothesis space *H* . Together with the loss function, it also completely defines the key estimator (6.21) which exploits the RKHS norm as regularizer. The choice of *K* has thus a crucial impact on the ability of predicting future output data. Some important RKHSs are discussed below.

## *6.6.1 Linear Kernels, Regularized Linear Regression and System Identification*

We now show that the regularization network (6.32) generalizes the ReLS-Q problem introduced in Chap. 3 which adopts quadratic penalties. The link is provided by the concept of *linear kernel*.

We start assuming that the input space is *<sup>X</sup>* <sup>=</sup> <sup>R</sup>*<sup>m</sup>*. Hence, any input location *<sup>x</sup>* corresponds to an *<sup>m</sup>*-dimensional (column) vector. If *<sup>P</sup>* <sup>∈</sup> <sup>R</sup>*m*×*<sup>m</sup>* denotes a symmetric and positive semidefinite matrix, a linear kernel is defined as follows

$$K(\mathbf{y}, \mathbf{x}) = \mathbf{y}^T P \mathbf{x}, \quad (\mathbf{x}, \mathbf{y}) \in \mathbb{R}^m \times \mathbb{R}^m.$$

All the kernel sections are linear functions. Hence, their span defines a finitedimensional (closed) subspace of linear functions that, in view of Theorem 6.1 (and subsequent discussion) coincides with the whole *H* . Hence, the RKHS induced by the linear kernel is simply a space of linear functions and, for any *g* ∈ *H* , there exists *<sup>a</sup>* <sup>∈</sup> <sup>R</sup>*<sup>m</sup>* such that

$$\mathbf{g}(\mathbf{x}) = a^T P \mathbf{x} = K\_a(\mathbf{x}).$$

If *P* is full rank, letting θ := *Pa*, we also have

$$\begin{aligned} \left||\boldsymbol{g}\right||\_{\mathcal{A}^{\boldsymbol{\ell}^{\boldsymbol{\prime}}}}^{2} &= \left||\boldsymbol{K}\_{a}\right||\_{\mathcal{A}^{\boldsymbol{\ell}^{\boldsymbol{\prime}}}}^{2} = \langle K\_{a}, K\_{a}\rangle\_{\mathcal{A}^{\boldsymbol{\ell}^{\boldsymbol{\prime}}}} \\ &= K(a, a) = a^{\boldsymbol{T}} P a \\ &= \boldsymbol{\theta}^{\boldsymbol{T}} P^{-1} \boldsymbol{\theta}. \end{aligned}$$

Now, let us use such *H* in the regularization network (6.32). Without using the representer theorem, we can plug the representation *g*(*x*) = θ*<sup>T</sup> x* in the regularization problem to obtain *g*ˆ(*x*) = θˆ*<sup>T</sup> x* where

$$\hat{\theta} = \operatorname\*{arg\,min}\_{\theta \in \mathbb{R}^n} \|Y - \Phi \theta\|^2 + \gamma \theta^T P^{-1} \theta,\tag{6.39}$$

with the *i*th row of the regression matrix Φ equal to *x <sup>T</sup> <sup>i</sup>* . One can see that (6.39) coincides with ReLS-Q, with the regularization matrix *P* which defines the linear kernel *K* and, in turn, the penalty term θ*<sup>T</sup> P*−1θ.

We now derive the connection with linear system identification in discrete time. The data set consists of the output measurements {*yi*}*<sup>N</sup> <sup>i</sup>*=<sup>1</sup>, collected on the time instants {*ti*}*<sup>N</sup> <sup>i</sup>*=<sup>1</sup>, and of the system input *u*. We can form each input location using past input values as follows

$$\mathbf{x}\_{i} = \begin{bmatrix} \mu\_{t\_{l}-1} \ u\_{t\_{l}-2} \ \dots \ u\_{t\_{l}-m} \end{bmatrix}^{T},\tag{6.40}$$

where *m* is the FIR order and an input delay of one unit has been assumed. Then, if *Y* collects the noisy outputs, ˆ θ becomes the impulse response estimate. This establishes a correspondence between regularized FIR estimation and regularization in RKHS induced by linear kernels.

#### **6.6.1.1 Infinite-Dimensional Extensions** *-*

In place of *<sup>X</sup>* <sup>=</sup> <sup>R</sup>*<sup>m</sup>*, let now *<sup>X</sup>* <sup>⊂</sup> <sup>R</sup>∞, i.e., the input space contains sequences. We can interpret any input location as an infinite-dimensional column vector and use ordinary notation of algebra to handle infinite-dimensional objects. For instance, if *x*, *y* ∈ *X* then *x <sup>T</sup> y* = *x*, *y* <sup>2</sup> where ·, · <sup>2</sup> is the inner product in -2. Assume we are given a symmetric and infinite-dimensional matrix *P* such that the linear kernel

$$K(\mathbf{y}, \mathbf{x}) = \mathbf{y}^T P \mathbf{x}$$

is well defined over a subset of <sup>R</sup><sup>∞</sup> <sup>×</sup> <sup>R</sup>∞. For example, if *<sup>P</sup>* is absolutely summable, i.e., *i j* |*Pi j*| < ∞, the kernel is well defined for any input location *x* ∈ *X* with *X* = -<sup>∞</sup>. The kernel section centred on *x* is the infinite-dimensional column vector *Px*. Following arguments similar to those seen in the finite-dimensional case, one can conclude that the RKHS associated to such *K* contains linear functions of the form *g*(*x*) = *a<sup>T</sup> Px* with *a* ∈ *X* . Roughly speaking, the regularization network (6.32) relying on such hypothesis space is the limit of Problem (6.39) for *m* → ∞. To compute the solution, in this case it is necessary to resort to the representer theorem (6.22). One obtains

$$\hat{\mathfrak{g}}(\mathbf{x}) = \sum\_{i=1}^{N} \hat{c}\_i K\_{x\_i}(\mathbf{x}) = \hat{\theta}^T \mathbf{x}$$

where *c*ˆ is defined by (6.34) and

$$\hat{\theta} := \sum\_{i=1}^{N} \hat{c}\_i P\_{X\_i}.$$

The link with linear system identification follows the same reasoning previously developed but *xi* now contains an infinite number of past input values, i.e.,

$$\boldsymbol{\mu}\_{i} = \begin{bmatrix} \boldsymbol{\mu}\_{t\_{l}-1} \ \boldsymbol{\mu}\_{t\_{l}-2} \ \boldsymbol{\mu}\_{t\_{l}-3} \dots \end{bmatrix}^{T} \boldsymbol{\dots}$$

With this correspondence, the regularization network now implements regularized IIR estimation and ˆ θ contains the impulse response coefficients estimates. In fact, note that the nature of *xi* makes the value *g*ˆ(*xi*) the convolution between the system input *u* and ˆ θ evaluated at *ti* (with one unit input delay).

In a more sophisticated scenario, in place of sequences, the input space *X* could contain functions. For instance, *X* ⊂ *P<sup>c</sup>* where *P<sup>c</sup>* is the space of piecewise continuous functions on R+. Thus, each input location corresponds to a continuous function *<sup>x</sup>* : <sup>R</sup><sup>+</sup> <sup>→</sup> <sup>R</sup>. Given a suitable symmetric function *<sup>P</sup>* : <sup>R</sup><sup>+</sup> <sup>×</sup> <sup>R</sup><sup>+</sup> <sup>→</sup> <sup>R</sup>, a linear kernel is now defined by

$$K(\mathbf{y}, \mathbf{x}) = \int\_{\mathbb{R}^+ \times \mathbb{R}^+} \mathbf{y}(t) P(t, \tau) \mathbf{x}(\tau) dt d\tau.$$

The corresponding RKHS thus contains linear functionals: any *f* ∈ *H* maps *x* (which is a function) into R. The solution of the regularization network (6.32) equipped with such hypothesis space is

$$\hat{\mathfrak{g}}(\mathfrak{x}) = \sum\_{i=1}^{N} \hat{c}\_i K\_{x\_i}(\mathfrak{x}) = \int\_{\mathbb{R}^+} \hat{\theta}(\tau) \mathfrak{x}(\tau) d\tau,$$

where *c*ˆ is still defined by (6.34) and

$$\hat{\theta}(\tau) := \sum\_{i=1}^{N} \hat{c}\_i \int\_{\mathbb{R}^+} P(\tau, t) \mathbf{x}\_i(t) dt.$$

The connection with linear system identification is obtained by defining

$$\alpha\_i(t) = \mu(t\_i - t), \quad t \ge 0$$

(if the input *u*(*t*) is continuous for *t* ≥ 0 and causal, the functions *xi*(*t*) is piecewise continuous, making necessary the assumption *X* ⊂ *P<sup>c</sup>*). In this way, each *g* ∈ *H* represents a different linear system. Furthermore, the regularization network (6.32) implements regularized system identification in continuous time and θˆ is the continuous-time impulse response estimate. The class of kernels which include the BIBO stability constraint will be discussed in the next chapter.

## *6.6.2 Kernels Given by a Finite Number of Basis Functions*

Assume we are given an input space *<sup>X</sup>* and *<sup>m</sup>* independent functions <sup>ρ</sup>*<sup>i</sup>* : *<sup>X</sup>* <sup>→</sup> <sup>R</sup>. Then, we define

$$K(\mathbf{x}, \mathbf{y}) = \sum\_{i=1}^{m} \rho\_i(\mathbf{x}) \rho\_i(\mathbf{y}).$$

It is easy to verify that *K* is a positive definite kernel. Recalling Theorem 6.13, the associated RKHS coincides with the *m*-dimensional space spanned by the basis functions <sup>ρ</sup>*<sup>i</sup>* . Each function in *<sup>H</sup>* has the representation *<sup>g</sup>*(*x*) <sup>=</sup> *<sup>m</sup> <sup>i</sup>*=<sup>1</sup> θ*i*ρ*i*(*x*) and, in view of (6.20) and the independence of the basis functions, one has

$$\|\mathbf{g}\|\_{\mathcal{H}^\mathbb{C}}^2 = \sum\_{i=1}^m \theta\_i^2.$$

Consider now the regularization network (6.32) equipped with such hypothesis space. The solution can be computed without using the representer theorem by plugging in (6.32) the expression of *<sup>g</sup>* as a function of <sup>θ</sup>. Letting <sup>Φ</sup> <sup>∈</sup> <sup>R</sup>*N*×*<sup>m</sup>* with <sup>Φ</sup>*i j* <sup>=</sup> <sup>ρ</sup>*j*(*xi*), we obtain *<sup>g</sup>*<sup>ˆ</sup> <sup>=</sup> *<sup>m</sup> <sup>i</sup>*=<sup>1</sup> θˆ *<sup>i</sup>*ρ*<sup>i</sup>* with

$$\hat{\theta} = \arg\min\_{\theta \in \mathbb{R}^n} \|Y - \Phi \theta\|^2 + \gamma \|\theta\|^2. \tag{6.41}$$

The solution (6.41) coincides with the ridge regression estimate introduced in Sect. 1.2.

#### *6.6.3 Feature Map and Feature Space -*

Let *F* be a space endowed with an inner product, and assume that a representation of the form

$$K(\mathbf{x}, \mathbf{y}) = \langle \phi(\mathbf{x}), \phi(\mathbf{y}) \rangle\_{\mathcal{F}}, \qquad \phi: \mathcal{X} \to \mathcal{F},\tag{6.42}$$

is available. Then, it follows immediately that *K* is a positive definite kernel. In this context, φ is called a *feature map*, and *F* the *feature space*. For instance, to have the connection with the kernel discussed in the previous subsection, we can think of φ as a vector containing *m* functions. It is defined for any *x* by

$$\phi(\mathbf{x}) = \begin{pmatrix} \rho\_1(\mathbf{x})\\ \rho\_2(\mathbf{x})\\ \vdots\\ \rho\_m(\mathbf{x}) \end{pmatrix}$$

so that *<sup>F</sup>* <sup>=</sup> <sup>R</sup>*<sup>m</sup>* with the Euclidean inner product. Then, we obtain

$$K(\mathbf{x}, \mathbf{y}) = \langle \phi(\mathbf{x}), \phi(\mathbf{y}) \rangle\_2 = \phi^T(\mathbf{x})\phi(\mathbf{y}) = \sum\_{i=1}^m \rho\_i(\mathbf{x})\rho\_i(\mathbf{y})\dots$$

Now, given any positive definite kernel *K*, Theorem 6.2 (Moore–Aronszajn theorem) implies the existence of at least one feature map, namely, the RKHS map φ*<sup>H</sup>* : *X* → *H* such that

$$
\phi\_{\mathcal{A}^\circ}(\mathfrak{x}) = K\_x,
$$

where the representation (6.42) follows immediately from the reproducing property. These arguments show that *K* is a positive definite kernel iff there exists at least one Hilbert space *F* and a map φ : *X* → *F* such that *K*(*x*, *y*) = φ(*x*), φ(*y*) *F* .

Feature maps and feature spaces are not unique since, by introducing any linear isometry *I* : *H* → *F*, one can obtain a representation in a different space:

$$K(\mathbf{x}, \mathbf{y}) = \langle \phi\_{\mathcal{A}^{\mathbb{R}}}(\mathbf{x}), \phi\_{\mathcal{A}^{\mathbb{R}}}(\mathbf{y}) \rangle\_{\mathcal{A}^{\mathbb{R}}} = \langle I \diamond \phi\_{\mathcal{A}^{\mathbb{R}}}(\mathbf{x}), I \diamond \phi\_{\mathcal{A}^{\mathbb{R}}}(\mathbf{y}) \rangle\_{\mathcal{F}}.$$

Now, assume that the kernel admits the decomposition (6.8), i.e.,

$$K(\mathbf{x}, \mathbf{y}) = \sum\_{i=1}^{\infty} \zeta\_i \rho\_i(\mathbf{x}) \rho\_i(\mathbf{y})$$

with ζ*<sup>i</sup>* > 0 ∀*i*. Then, a *spectral feature map* of *K* is

$$
\phi\_{\mu} : \mathcal{X} \to \ell\_2
$$

with

$$\phi\_{\mu}(\mathbf{x}) = \{\sqrt{\zeta\_i}\rho\_i(\mathbf{x})\}\_{i=1}^{\infty}, \quad \mathbf{x} \in \mathcal{X}\,.$$

In fact, we have

$$\langle \phi\_{\mu}(\mathbf{x}), \phi\_{\mu}(\mathbf{y}) \rangle\_{2} = \sum\_{i=1}^{\infty} \zeta\_{i} \rho\_{i}(\mathbf{x}) \rho\_{i}(\mathbf{y}) = K(\mathbf{x}, \mathbf{y}).$$

It is worth also pointing out the role of the feature map within the estimation scenario. In many applications, linear functions are not models powerful enough. Kernels define more expressive spaces by (implicitly) mapping the data into a highdimensional feature space where linear machines can be applied. Then, the use of the estimator (6.21) does not require to know any feature map associated to *K*: the representer theorem shows that the only information needed to compute the estimate is the kernel matrix, as also discussed in Remark 6.3.

## *6.6.4 Polynomial Kernels*

Another example of kernel is the (inhomogeneous) polynomial kernel [70]. For *<sup>x</sup>*, *<sup>y</sup>* <sup>∈</sup> <sup>R</sup>*<sup>m</sup>*, it is defined by

$$K(\mathbf{x}, \mathbf{y}) = (\langle \mathbf{x}, \mathbf{y} \rangle\_2 + c)^p, \quad p \in \mathbb{N}, \quad c \ge 0, 1$$

with ·, · <sup>2</sup> to denote the classical Euclidean inner product. As an example, assume *c* = 1 and *m* = *p* = 2 with *x* = [*xa xb*] and *y* = [*ya yb*]. Then, one obtains the kernel expansion

$$K(\mathbf{x}, \mathbf{y}) = 1 + \mathbf{x}\_a^2 \mathbf{y}\_a^2 + \mathbf{x}\_b^2 \mathbf{y}\_b^2 + 2\mathbf{x}\_a \mathbf{x}\_b \mathbf{y}\_a \mathbf{y}\_b + 2\mathbf{x}\_a \mathbf{y}\_a + 2\mathbf{x}\_b \mathbf{y}\_b,$$

of the type (6.8) with the ρ*i*(*xa*, *xb*) given by all the monomials of degree up to 2, i.e., the 6 functions

$$1, \ \propto\_a^2, \ \propto\_b^2, \ \propto\_a \propto\_b, \ \propto\_a, \ \propto\_b.$$

More in general, if *c* > 0, the polynomial kernel induces a *m*+*<sup>p</sup> p* -dimensional RKHS spanned by all possible monomials up to the *p*th degree. The number of basis function is thus finite but exponential in *p*. This simple example is in some sense opposite to that described in Sect. 6.6.2. It shows how a kernel can be used to define implicitly a rich class of basis functions.

## *6.6.5 Translation Invariant and Radial Basis Kernels*

A kernel is said translation invariant if there exists *<sup>h</sup>* : *<sup>X</sup>* <sup>→</sup> <sup>R</sup> such that *<sup>K</sup>*(*x*, *<sup>y</sup>*) <sup>=</sup> *h*(*x* − *y*). This class has been already encountered in Example 6.12 where its relationship with the Fourier basis (in the case of one-dimensional input space) is illustrated. A general characterization is given below, see also [80].

**Theorem 6.18** (Bochner, based on [23]) *A positive definite kernel K over X* = <sup>R</sup>*<sup>d</sup> is continuous and of the form K*(*x*, *<sup>y</sup>*) <sup>=</sup> *<sup>h</sup>*(*<sup>x</sup>* <sup>−</sup> *<sup>y</sup>*) *if and only if there exists a probability measure* μ *and a positive scalar* η *such that:*

$$K(\mathbf{x}, \mathbf{y}) = \eta \int\_{\mathcal{X}} \cos \left( \langle \mathbf{z}, \mathbf{x} - \mathbf{y} \rangle\_2 \right) d\mu(\mathbf{z}).$$

Translation invariant kernels include also the class of radial basis kernels (RBF) of the form *K*(*x*, *y*) = *h*(*x* − *y*) where · is the Euclidean norm [85]. A notable example is the so-called *Gaussian kernel*:

$$K(\mathbf{x}, \mathbf{y}) = \exp\left(-\frac{\|\mathbf{x} - \mathbf{y}\|^2}{\rho}\right), \quad \rho > 0,\tag{6.43}$$

where ρ denotes the kernel width. This kernel is often used to model functions expected to be somewhat regular. Note however that ρ has an important role in tuning the smoothness level. A low value makes the kernel close to diagonal so that a low norm can be assigned also to rapidly changing functions. On the other hand, as ρ approaches zero, only functions close to be constant are given a low penalty. This is the same phenomenon illustrated in Fig. 6.1.

Another widely adopted kernel, which induces spaces of functions less regular than the Gaussian one, is the *Laplacian kernel* which uses the Euclidean norm in place of the squared Euclidean norm:

$$K(\mathbf{x}, \mathbf{y}) = \exp\left(-\frac{\|\mathbf{x} - \mathbf{y}\|}{\rho}\right), \quad \rho > 0. \tag{6.44}$$

Differently from the kernels described in the first part of Sect. 6.6.1, as well as in Sects. 6.6.2 and 6.6.4, the RKHS associated with any non-constant RBF kernel is infinite dimensional (it cannot be spanned by a finite number of basis functions). The associated RKHS can be shown to be dense in the space of all continuous functions defined on a compact subset *<sup>X</sup>* <sup>⊂</sup> <sup>R</sup>*<sup>m</sup>*. This means that every continuous function can be represented in this space with the desired accuracy as measured by the supnorm sup*<sup>x</sup>*∈*<sup>X</sup>* | *f* (*x*)|. This property is called *universality*. This does not imply that the RKHS induced by a universal kernel includes any continuous function. For instance, the Gaussian kernel is universal but it has been proved that it does not contain any polynomial, including the constant function [69].

## *6.6.6 Spline Kernels*

To simplify the exposition, let *X* = [0, 1] and let also *g*(*j*) denote the *j*th derivative of *g*, with *g*(0) := *g*. Intuitively, in many circumstances an effective regularizer is obtained by penalizing the energy of the *p*th derivative of *g*, i.e., employing

$$\int\_0^1 \left(\mathfrak{g}^{(p)}(x)\right)^2 dx.$$

An interesting question is whether this penalty term can be cast in the RKHS theory. For *p* = 1, a positive answer has been given by Example 6.5. Actually, the answer is positive for any integer *p*. In fact, consider the Sobolev space of functions *g* whose first *p* − 1 derivatives are absolutely continuous and satisfy *g*(*j*) (0) = 0 for *j* = 0,..., *p* − 1. The same arguments developed in Example 6.5 when *p* = 1 can be easily generalized to prove that this is a RKHS *H* with norm

$$\left\|\mathbf{g}\right\|\_{\mathcal{H}^{\mathbb{C}}}^2 = \int\_0^1 \left(\mathbf{g}^{(p)}(\mathbf{x})\right)^2 d\mathbf{x}.$$

The corresponding kernel is the *p*th-order spline kernel

$$K(\mathbf{x}, \mathbf{y}) = \int\_0^1 G\_p(\mathbf{x}, u) G\_p(\mathbf{y}, u) du,\tag{6.45}$$

where *G <sup>p</sup>* is the so-called Green's function given by

$$G\_p(\mathbf{x}, u) = \frac{(\mathbf{x} - u)\_+^{p-1}}{(p-1)!}, \qquad (u)\_+ = \begin{cases} u \text{ if } u \ge 0 \\ 0 \text{ otherwise} \end{cases}.\tag{6.46}$$

Note that the Laplace transform of *G <sup>p</sup>*(·, 0) is 1/*s <sup>p</sup>*. Hence, the Green's function is connected with the impulse response of a *p*-fold integrator. When *p* = 1, we recover the linear spline kernel of Example 6.5:

$$K(\mathbf{x}, \mathbf{y}) = \min\{\mathbf{x}, \mathbf{y}\} \tag{6.47}$$

whereas *p* = 2 leads to the popular cubic spline kernel [104]:

$$K(\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x}\mathbf{y}\min\{\mathbf{x}, \mathbf{y}\}}{2} - \frac{(\min\{\mathbf{x}, \mathbf{y}\})^3}{6}.\tag{6.48}$$

The linear and the cubic spline kernel are displayed in Fig. 6.2.

We can use the spline hypothesis space in the regularization problem (6.21). Then, from the representer theorem one obtains that the estimate *g*ˆ is a *p*th-order smoothing spline with derivatives continuous exactly up to order 2*p* − 2 (the order's choice is thus related to the expected function smoothness). This can be seen also from the kernels sections plotted in Fig. 6.2 for *p* equal to 1 and 2. For *p* = 2 the (finite) sum of kernel sections provides the well-known cubic smoothing splines, i.e., piecewise third-order polynomials.

Spline functions enjoy many numerical properties originally studied in the interpolation scenario. In particular, piecewise polynomials circumvent Runge's phenomenon (large oscillations affecting the reconstructed function) which, e.g., arises when high-order polynomials are employed [81]. Fit convergence rates are discussed, e.g., in [3, 14].

## *6.6.7 The Bias Space and the Spline Estimator*

**Bias space** As discussed in Sect. 4.5, in a Bayesian setting, in some cases it can be useful to enrich *H* with a low-dimensional parametric part, known in the literature as *bias space*. The bias space typically consists of linear combinations of functions {φ*<sup>k</sup>* } *m <sup>k</sup>*=<sup>1</sup>. For instance, if the unknown function exhibits a linear trend, one may let *m* = 2 and φ1(*x*) = 1, φ2(*x*) = *x*. Then, one can assume that *g* is sum of two functions, one in *H* and the other one in the bias space. In this way, the function space becomes *H* + span{φ1,..., φ*m*}. Using a quadratic loss, the regularization problem is given by

$$\hat{\theta}\_{i}(\hat{f},\hat{\theta}) = \underset{\substack{f \in \mathcal{H}',\\ \theta \in \mathbb{R}^n}}{\text{arg min}} \sum\_{i=1}^N \left( \mathbf{y}\_i - f(\mathbf{x}\_i) - \sum\_{k=1}^m \theta\_k \phi\_k(\mathbf{x}\_i) \right)^2 + \gamma \|f\|\_{\mathcal{H}'}^2,\tag{6.49}$$

and the overall function estimate turns out *<sup>g</sup>*<sup>ˆ</sup> <sup>=</sup> <sup>ˆ</sup>*<sup>f</sup>* <sup>+</sup> *<sup>m</sup> <sup>k</sup>*=<sup>1</sup> ˆ θ*k*φ*<sup>k</sup>* . Note that the expansion coefficients in θ are not subject to any penalty term but a low value for *m* avoids overfitting. The solution can be computed exploiting an extended version of the representer theorem. In particular, it holds that

$$\hat{\mathbf{g}} = \sum\_{i=1}^{N} \hat{c}\_{i} K\_{x\_{i}} + \sum\_{k=1}^{m} \hat{\theta}\_{k} \phi\_{k},\tag{6.50}$$

where, assuming that <sup>Φ</sup> <sup>∈</sup> <sup>R</sup>*N*×*<sup>m</sup>* is full column rank and <sup>Φ</sup>*i j* <sup>=</sup> <sup>φ</sup>*j*(*xi*),

$$\hat{\theta} = \left(\Phi^T A^{-1} \Phi\right)^{-1} \Phi^T A^{-1} Y \tag{6.51a}$$

$$
\hat{c} = A^{-1} \left( Y - \Phi \hat{\theta} \right) \tag{6.51b}
$$

$$A = \mathbf{K} + \gamma I\_N.\tag{6.51c}$$

**Remark 6.5** *(Extended version of the representer theorem)* The correctness of formulas (6.51a–6.51c) can be easily verified as follows. Fix θ to the optimizer ˆ θ in the objective present in the rhs of (6.49). Then, we can use the representer theorem with *Y* replaced by *Y* − Φ ˆ <sup>θ</sup> to obtain <sup>ˆ</sup>*<sup>f</sup>* <sup>=</sup> *<sup>N</sup> <sup>i</sup>*=<sup>1</sup> *c*ˆ*<sup>i</sup> Kxi* with

$$
\hat{c} = A^{-1} \left( Y - \Phi \hat{\theta} \right),
$$

with *A* indeed given by (6.51c). This proves (6.51b). Using the definition of *A* this also implies

$$Y - \mathbf{K}\hat{c} = \Phi\hat{\theta} + \gamma\hat{c}.$$

Now, if we fix *f* to ˆ*f* , the optimizer ˆ θ is just the least squares estimate of θ with *Y* replaced by *Y* − **K***c*ˆ. Hence, we obtain

$$
\hat{\theta} = \left(\boldsymbol{\Phi}^T \boldsymbol{\Phi}\right)^{-1} \boldsymbol{\Phi}^T (\boldsymbol{Y} - \mathbf{K} \hat{\boldsymbol{\alpha}}) .
$$

Using *Y* − **K***c*ˆ = Φθˆ + γ*c*ˆ in the expression for θˆ, we obtain Φ*<sup>T</sup>* Φ<sup>−</sup><sup>1</sup> Φ*<sup>T</sup> c*ˆ = 0. Multiplying the lhs and rhs of (6.51b) by Φ*<sup>T</sup>* Φ<sup>−</sup><sup>1</sup> Φ*<sup>T</sup>* and using this last equality, (6.51a) is finally obtained.

**The spline estimator** The bias space is useful, e.g., when spline kernels are adopted. In fact, the spline space of order *p* contains functions all satisfying the constraints *g*(*j*) (0) = 0 for *j* = 0,..., *p* − 1. Then, to cope with nonzero initial conditions, one can enrich such RKHS with polynomials up to order *p* − 1. The enriched space is *H* ⊕ span{1, *x*,..., *x <sup>p</sup>*−<sup>1</sup>}, with ⊕ denoting a direct sum, and enjoys the universality property mentioned at the end of Sect. 6.6.5. The resulting spline estimator becomes a notable example of (6.49): it solves

$$\min\_{\substack{f\in\mathcal{M}^{\ell},\\ \theta\in\mathbb{R}^{p}}} \sum\_{i=1}^{N} \left(\mathbf{y}\_{i} - f(\mathbf{x}\_{i}) - \sum\_{k=1}^{p} \theta\_{k} \mathbf{x}\_{i}^{k-1}\right)^{2} + \gamma \int\_{0}^{1} \left(f^{(p)}(\mathbf{x})\right)^{2} d\mathbf{x},\tag{6.52}$$

whose explicit solution is given by (6.50) setting <sup>φ</sup>*<sup>k</sup>* (*x*) <sup>=</sup> *<sup>x</sup>k*−<sup>1</sup> and <sup>Φ</sup>*i j* <sup>=</sup> *<sup>x</sup> <sup>j</sup>*−<sup>1</sup> *<sup>i</sup>* .

We consider a simple numerical example to illustrate the estimator (6.52) and the impact of different choices of γ on its performance. The task is the reconstruction of the function *g*(*x*) = *e*sin(10*x*) , with *x* ∈ [0, 1], from 100 direct samples corrupted by

**Fig. 6.5** Cubic spline estimator (6.52) with three different values of the regularization parameter: truth (red thick line), noisy data (◦) and estimate (black solid line)

white and Gaussian noise with standard deviation 0.3. The estimates coming from (6.52) with *p* = 2 and three different values of γ are displayed in the three panels of Fig. 6.5. The cubic spline estimate plotted in the top left panel is affected by oversmoothing: the too large value of γ overweights the norm of *f* in the objective (6.52), introducing a large bias. Hence, the model is too rigid, unable to describe the data. The top right panel displays the opposite situation obtained adopting a too low value for γ which overweights the loss function in (6.52). This leads to a high variance estimator: the model is overly flexible and overfits the measurements. Finally, the estimate in the bottom panel of Fig. 6.5 is obtained using the regularization parameter optimal in the MSE sense. The good trade-off between bias and variance leads to an estimate close to truth. As already pointed out in the previous chapters, the choice of γ can thus be interpreted as the counterpart of model order selection in the classical parametric paradigm.

#### **6.7 Asymptotic Properties** *-*

## *6.7.1 The Regression Function/Optimal Predictor*

In what follows, we use μ to indicate a probability measure on the input space *X* . For simplicity, we assume that it admits a probability density function (pdf) denoted by p*<sup>x</sup>* . The input locations *xi* are now seen as random quantities and p*<sup>x</sup>* models the stochastic mechanism through which they are drawn from *X* . For instance, in the system identification scenario treated in Sect. 6.6.1, each input location contains system input values, e.g., see (6.40). If we assume that the input is a stationary stochastic process, all the *xi* indeed follow the same pdf p*<sup>x</sup>* .

Let also *Y* indicate the output space. Then, p*yx* denotes the joint pdf on *X* × *Y* which factorizes into p*<sup>y</sup>*|*<sup>x</sup>* (*y*|*x*)p*<sup>x</sup>* (*x*). Here, p*<sup>y</sup>*|*<sup>x</sup>* is the pdf of the output *y* conditional on a particular realization *x*.

Let us now introduce some important quantities function of *X* , *Y* and *pyx* . Given a function *f* , the least squares error associated to *f* is defined by

$$\operatorname{Err}(f) = \mathcal{E}(\mathbf{y} - f(\mathbf{x}))^2 = \int\_{\mathcal{X} \times \mathcal{Y}} \left(\mathbf{y} - f(\mathbf{x})\right)^2 \mathbf{p}\_{\mathbf{y}\mathbf{x}}(\mathbf{y}, \mathbf{x}) d\mathbf{x}d\mathbf{y}.\tag{6.53}$$

The following result, also discussed in [33], characterizes the minimizer of Err( *f* ) and has connections with Theorem 4.1.

**Theorem 6.19** (The regression function, based on [33]) *We have*

$$f\_{\rho} = \operatorname\*{arg\,min}\_{f} \operatorname{Err}(f),$$

*where f*<sup>ρ</sup> *is the so-called* regression function *defined by*

216 6 Regularization in Reproducing Kernel Hilbert Spaces

$$f\_{\rho}(\mathbf{x}) = \int\_{\mathcal{Y}} \mathbf{y} \mathbf{p}\_{\mathbf{y}|\mathbf{x}}(\mathbf{y}|\mathbf{x}) d\mathbf{y}, \quad \mathbf{x} \in \mathcal{X}. \tag{6.54}$$

One can see that the regression function does not depend on the marginal density p*<sup>x</sup>* but only on the conditional p*<sup>y</sup>*|*<sup>x</sup>* . For any given *x*, it corresponds to the posterior mean (Bayes estimate) of the output *y* conditional on *x*. The proof of this fact is easily obtained by first using the following decomposition

$$\begin{split} \operatorname{Err}(f) &= \int\_{\mathcal{X}\times\mathcal{Y}} \left( \mathbf{y} - f\_{\rho}(\mathbf{x}) + f\_{\rho}(\mathbf{x}) - f(\mathbf{x}) \right)^{2} \mathbf{p}\_{\mathbf{y}\mathbf{x}}(\mathbf{y}, \mathbf{x}) d\mathbf{x} d\mathbf{y} \\ &= \delta^{\mathbb{P}}(f\_{\rho}(\mathbf{x}) - f(\mathbf{x}))^{2} + \delta^{\mathbb{P}}(\mathbf{y} - f\_{\rho}(\mathbf{x}))^{2} \\ &+ 2 \int\_{\mathcal{X}} \left( f\_{\rho}(\mathbf{x}) - f(\mathbf{x}) \right) \underbrace{\left( \int\_{\mathcal{Y}} \left( \mathbf{y} - f\_{\rho}(\mathbf{x}) \right) \mathbf{p}\_{\mathbf{y}|\mathbf{x}}(\mathbf{y}|\mathbf{x}) d\mathbf{y} \right)}\_{\mathbf{0}} \mathbf{p}\_{\mathbf{x}}(\mathbf{x}) d\mathbf{x} \\ &= \delta^{\mathbb{P}}(f\_{\rho}(\mathbf{x}) - f(\mathbf{x}))^{2} + \delta^{\mathbb{P}}(\mathbf{y} - f\_{\rho}(\mathbf{x}))^{2}, \end{split}$$

and then noticing that the very last term is independent of *f* .

Theorem 6.19 shows that *f*<sup>ρ</sup> is the best output predictor in the sense that it minimizes the expected quadratic loss (MSE) on a new output drawn from p*yx* . Now, we will consider a scenario where p*<sup>y</sup>*|*<sup>x</sup>* (and possibly also p*<sup>x</sup>* ) is unknown and only *<sup>N</sup>* samples {*xi*, *yi*}*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> from p*yx* are available. We will study the asymptotic properties (*N* growing to infinity) of the regularized approaches previously described. The regularization network case is treated in the following subsection.

## *6.7.2 Regularization Networks: Statistical Consistency*

Consider the following regularization network

$$\hat{\mathbf{g}}\_N = \operatorname\*{arg\,min}\_{f \in \mathcal{H}'} \frac{\sum\_{i=1}^N (\mathbf{y}\_i - f(\mathbf{x}\_i))^2}{N} + \gamma \|f\|\_{\mathcal{H}'}^2,\tag{6.55}$$

which coincides with (6.32) except for the introduction of the scale factor 1/*N* in the quadratic loss. We have also stressed the dependence of the estimate on the data set size *N*. Our goal is to assess whether *g*ˆ*<sup>N</sup>* converges to *f*<sup>ρ</sup> as *N* → ∞ using the norm ·*<sup>L</sup>* <sup>μ</sup> <sup>2</sup> defined by the pdf p*<sup>x</sup>* as follows

$$\|f\|\_{\mathcal{L}\_2^{\rho^\mu}}^2 = \int\_{\mathcal{X}} f^2(\mathbf{x}) \mathbf{p}\_x(\mathbf{x}) d\mathbf{x} \dots$$

First, details on the data generation process are provided.

**Data generation assumptions** The probability measure μ on *X* is assumed to be Borel non degenerate. As already recalled, this means that realizations from p*<sup>x</sup>* can cover entirely *X* , without holes. This happens, e.g., when p*<sup>x</sup>* (*x*) > 0 ∀*x* ∈ *X* . The stochastic processes *xi* and *yi* are jointly stationary, with joint pdf p*yx* .

The study is not limited to the i.i.d. case. This is important, e.g., in system identification where, as visible in (6.40), input locations contain past input values shifted in time, hence introducing correlation among the *xi* . Let *a*, *b* indicate two integers with *a* ≤ *b*. Then, *M<sup>b</sup> <sup>a</sup>* denotes the σ-algebra generated by (*xa*, *ya*), . . . , (*xb*, *yb*). The process (*x*, *y*) is said to satisfy a strong mixing condition if there exists a sequence of real numbers ψ*<sup>m</sup>* such that, ∀*k*, *m* ≥ 1, one has

$$|P(A \cap B) - P(A)P(B)| \le \psi\_i \quad \forall A \in \mathcal{A}\_1^k, B \in \mathcal{A}\_{k+i}^\infty$$

with

$$\lim\_{i \to \infty} \psi\_i = 0.$$

Intuitively, if *a*, *b* represent different time instants, this means that the random variables tend to become independent as their temporal distance increases.

**Assumption 6.20** *(Data generation and strong mixing condition)* The probability measure μ on the input space (having pdf p*<sup>x</sup>* ) is nondegenerate. In addition, the random variables *xi* and *yi* form two jointly stationary stochastic processes, with finite moments up to the third order and satisfy a strong mixing condition. Finally, denoting with ψ*<sup>i</sup>* the mixing coefficients, one has

$$\sum\_{i=1}^{\infty} \left( |\psi\_i|^{1/3} \right) < \infty.$$

#### **Consistency Result**

The following theorem, whose proof is in Sect. 6.9.6, illustrates the convergence in probability of (6.55) to the best output predictor.

**Theorem 6.21** (Statistical consistency of the regularization networks) *Let H be a RKHS of functions f* : *<sup>X</sup>* <sup>→</sup> <sup>R</sup> *induced by the Mercer kernel K , with <sup>X</sup> a compact metric space. Assume that f*<sup>ρ</sup> ∈ *H and that Assumption 6.20 holds. In addition, let*

$$
\gamma \propto \frac{1}{N^{\alpha}},\tag{6.56}
$$

*where* α *is any scalar in* (0, <sup>1</sup> <sup>2</sup> )*. Then, as N goes to infinity, one has*

$$\|\hat{\lg}\_N - f\_\rho\|\_{\mathcal{L}\_2^{\mu}} \longrightarrow\_p 0,\tag{6.57}$$

*where* −→*<sup>p</sup> denotes convergence in probability.*

The meaning of (6.56) is the following one. The regularizer ·<sup>2</sup> *<sup>H</sup>* in (6.55) restores the well-posedness of the problem by introducing some bias in the estimation process. Intuitively, to have consistency, the amount of regularization should decay to zero as *N* goes to ∞, but not too rapidly in order to keep the variance term under control. This can be obtained making the regularization parameter γ go to zero with the rate suggested by (6.56).

## *6.7.3 Connection with Statistical Learning Theory*

We now discuss the class of estimators (6.21) within the framework of statistical learning theory.

**Learning problem** Let us consider the problem of *learning from examples* as defined in statistical learning. The starting point is that described in Sect. 6.7.1. There is an unknown probabilistic relationship between the variables *x* and *y* described by the joint pdf p*yx* on *<sup>X</sup>* <sup>×</sup> *<sup>Y</sup>* . We are given examples {*xi*, *yi*}*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> of this relationship, called *training data*, which are independently drawn from p*yx* . The aim of the learning process is to obtain an estimator *g*ˆ*<sup>N</sup>* (a map from the training set to a space of functions) able to predict the output *y* given any *x* ∈ *X* .

**Generalization and consistency** In the statistical learning scenario, the two fundamental properties of an estimator are *generalization* and *consistency*. To introduce them, first we introduce a loss function *V* (*y*, *f* (*x*)), called *risk functional*. Then, the mean error associated to a function *f* is the *expected risk* given by

$$I(f) = \int\_{\mathcal{X}\times\mathcal{Y}} \mathcal{V}(\mathbf{y}, f(\mathbf{x})) \mathbf{p}\_{\text{yx}}(\mathbf{y}, \mathbf{x}) d\mathbf{x} d\mathbf{y}.\tag{6.58}$$

Note that, in the quadratic loss case, the expected risk coincides with the error already introduced in (6.53). Given a function *f* , the *empirical risk* is instead defined by

$$I\_N(f) = \frac{1}{N} \sum\_{i=1}^N \mathcal{V}(\mathbf{y}\_i, f(\mathbf{x}\_i)). \tag{6.59}$$

Then, we introduce a class of functions forming the hypothesis space *F* where the predictor is searched for. The ideal predictor, also called the *target function*, is given by3

$$f\_0 = \operatorname\*{arg\,min}\_{f \in \mathcal{F}} I(f). \tag{6.60}$$

<sup>3</sup> Here, and also when introducing empirical risk minimization (ERM), we assume that all the introduced minimizers exist. If this does not hold true, all the concepts remain valid by resorting to the concept of almost minimizers and almost ERM, with *I*( *f*0) := inf *<sup>f</sup>* <sup>∈</sup>*<sup>F</sup> I*( *f* ).

In general, even when a quadratic loss is chosen, *f*<sup>0</sup> does not coincide with the regression function *f*<sup>ρ</sup> introduced in (6.54) since *F* could not contain *f*ρ.

The concepts of generalization and consistency trace back to [97, 99–101]. Below, recall that *g*ˆ*<sup>N</sup>* is stochastic since it is function of the training set which contains the random variables {*xi*, *yi*}*<sup>N</sup> <sup>i</sup>*=1.

**Definition 6.3** *(Generalization and consistency, based on* [102]*)* The estimator *g*ˆ*<sup>N</sup>* (uniformly) generalizes if ∀ε > 0:

$$\lim\_{N \to \infty} \sup\_{\mathbb{P}\_{\mathcal{I}^x}} \mathbb{P} \left\{ |I\_N(\hat{\mathbf{g}}\_N) - I(\hat{\mathbf{g}}\_N)| > \varepsilon \right\} = 0. \tag{6.61}$$

The estimator is instead (universally) consistent if ∀ε > 0:

$$\lim\_{N \to \infty} \sup\_{\mathfrak{p}\_N} \mathbb{P}\left\{ I(\hat{\mathfrak{g}}\_N) > I(f\_0) + \varepsilon \right\} = 0. \tag{6.62}$$

From (6.61), one can see that generalization implies that the performance on the training set (the empirical error) must converge to the "true" performance on future outputs (the expected error). The presence of the supp*yx* is then to indicate that this property must hold uniformly w.r.t. all the possible stochastic mechanisms which generate the data. Consistency, as defined in (6.62), instead requires the expected error of *g*ˆ*<sup>N</sup>* to converge to the expected error achieved by the best predictor in *F*. Note that the reconstruction of *f*<sup>0</sup> is not required. The goal is that *g*ˆ*<sup>N</sup>* be able to mimic the prediction performance of *f*<sup>0</sup> asymptotically. Key issues in statistical learning theory are the understanding of the conditions on *g*ˆ*<sup>N</sup>* , the function class *F* and the loss *V* which ensure such properties.

#### **Empirical Risk Minimization**

The most natural technique to determine *f*<sup>0</sup> from data is the *empirical risk minimization* (ERM) approach where the empirical risk is optimized:

$$\hat{\mathbf{g}}\_N = \underset{f \in \mathcal{F}}{\text{arg min }} I\_N(f) = \underset{f \in \mathcal{F}}{\text{arg min }} \frac{1}{N} \sum\_{i=1}^N \mathcal{V}(\mathbf{y}\_i, f(\mathbf{x}\_i)). \tag{6.63}$$

The study of ERM has provided a full characterization of the necessary and sufficient conditions for its generalization and consistency. To introduce them, we first need to provide further details on the data generation assumptions.

#### **Assumption 6.22** *(Data generation assumptions)* It holds that


Note that, if the first four points hold true, in practice any loss function of interest, such as quadratic, Huber or Vapnik, satisfies the last requirement.

We now introduce the concept of *V*γ-dimension [5]. It is a complexity measure which extends the concept of Vapnik–Chervonenkis (VC) dimension originally introduced for the indicator functions.

**Definition 6.4** *(V*γ*-dimension, based on* [5]*)* Let Assumption 6.22 hold. The *V*γdimension of *V* in *F*, i.e., of the set *V* (*y*, *f* (*x*)), *f* ∈ *F*, is defined as the maximum number *h* of vectors (*x*1, *y*1), . . . , (*xh*, *yh* ) that can be separated in all 2*<sup>h</sup>* possible way using rules

> Class 1: if *V* (*yi*, *f* (*xi*)) ≥ *s* + γ, Class 0: if *V* (*yi*, *f* (*xi*)) ≤ *s* − γ

for *f* ∈ *F* and some *s* ≥ 0. If, for any *h*, it is possible to find *h* pairs (*x*1, *y*1), . . . , (*xh*, *yh*) that can be separated in all the 2*<sup>h</sup>* possible ways, the *V*γ-dimension of *V* in *F* is infinite.

So, the *V*γ-dimension is infinite if, for any data set size, one can always find a function *f* and a set of points which can be separated by *f* in any possible way. Note that the required margin to distinguish the classes increases as γ augments. This means that the *V*γ-dimension is a monotonically decreasing function of γ.

The following definition deals with the uniform, distribution-free convergence of empirical means to expectations for classes of real-valued functions. It is related to the so-called *uniform laws of large numbers*.

**Definition 6.5** *(Uniform Glivenko Cantelli class, based on* [5]*)* Let *G* denote a space of functions *Z* → *R*, where *R* is a bounded real set, and let p*<sup>z</sup>* denote a generic pdf on *Z* . Then, *G* is said to be a Uniform Glivenko Cantelli (uGC) class4 if

$$\forall \varepsilon > 0 \quad \lim\_{N \to \infty} \sup\_{\mathbb{P}\_t} \mathbb{P} \left\{ \sup\_{\mathcal{g} \in \mathcal{\mathcal{G}}} \left| \frac{1}{N} \sum\_{i=1}^N \mathcal{g}(z\_i) - \int\_{\mathcal{X}} \mathcal{g}(z) p\_\varepsilon(z) dz \right| > \varepsilon \right\} = 0.$$

It turns out that, under the ERM framework, generalization and consistency are equivalent concepts. Moreover, the finiteness of the *V*γ-dimension coincides with the concept of uGC class relative to the adopted losses and turns out the necessary and sufficient condition for generalization and consistency [5]. This is formalized below.

**Theorem 6.23** (ERM and *V*γ-dimension, based on [5]) *Let Assumption 6.22 hold. The following facts are then equivalent:*

• *ERM (uniformly) generalizes.*

<sup>4</sup> Sometimes, the class defined by (6.5) in terms of convergence in probability is called weak uGC while almost sure convergence leads to a strong uGC. However, it can be proved that, if Assumption 6.22 holds true and the function class is the composition of the losses with *F*, the two concepts become equivalent.

#### 6.7 Asymptotic Properties 221


In the last point regarding the uGC class, one can follow Definition 6.5 using the correspondences *Z* = *X* × *Y* , *z* = (*x*, *y*), p*<sup>z</sup>* = p*yx* and *R* = [*A*, *B*].

#### **Connection with Regularization in RKHS**

The connection between statistical learning theory and the class of kernel-based estimators (6.21) is obtained using as function space *F* a ball *B<sup>r</sup>* in a RKHS *H* , i.e.,

$$\mathcal{J} \cdot \mathcal{J} = \mathcal{A}\_r := \left\{ f \in \mathcal{J}^\ell \mid \| f \|\_{\mathcal{A}^r} \le r \right\}. \tag{6.64}$$

The ERM method (6.63) becomes

$$\hat{\mathbf{g}}\_N = \operatorname\*{arg\,min}\_f \frac{1}{N} \sum\_{i=1}^N \mathcal{V}(\mathbf{y}\_i, f(\mathbf{x}\_i)) \quad \text{s.t.} \ \|f\|\_{\mathcal{H}^\diamond} \le r,\tag{6.65}$$

which is an inequality constrained optimization problem. Exploiting the Lagrangian theory, we can find a positive scalar γ, function of*r* and of the data set size *N*, which makes (6.65) equivalent to

$$\hat{\mathbf{g}}\_N = \underset{f \in \mathcal{H}^\circ}{\text{arg min}} \; \frac{1}{N} \sum\_{i=1}^N \mathcal{V}(\mathbf{y}\_i, f(\mathbf{x}\_i)) + \gamma \left( \|f\|\_{\mathcal{H}^\circ}^2 - r^2 \right),$$

which, apart from constants, coincides with (6.21). The question now is whether (6.65) is consistent in the sense of the statistical learning theory. The answer is positive. In fact, under Assumption 6.22, it can be proved that the class of functions *V* in *F* is uGC if *F* is uGC. In addition, one sufficient (but not necessary) condition for *F* to be uGC is that *F* be a compact set in the space of continuous functions. The following important result then holds.

**Theorem 6.24** (Generalization and consistency of the kernel-based approaches, based on [33, 65]) *Let H be any RKHS induced by a Mercer kernel containing functions f* : *<sup>X</sup>* <sup>→</sup> <sup>R</sup>*, with <sup>X</sup> a compact metric space. Then, for any r, the ball B<sup>r</sup> is compact in the space of continuous functions equipped with the sup-norm. It then comes that B<sup>r</sup> is uGC and, if Assumption 6.22 holds, the regularized estimator (6.65) generalizes and is consistent.*

Theorem 6.24 thus shows that kernel-based approaches permit to exploit flexible infinite-dimensional models with the guarantee that the best prediction performance (achievable inside the chosen class) will be asymptotically reached.

## **6.8 Further Topics and Advanced Reading**

Basic functional analysis principles can be found, e.g., in [59, 79, 112]. The concept of RKHS was developed in 1950 in the seminal works [13, 20]. Classical books on the subject are [6, 82, 84]. RKHSs have been introduced within the machine learning community in [46, 47] leading, in conjunction with Tikhonov regularization theory [21, 96], to the development of many powerful kernel-based algorithms [42, 86].

Extensions of the theory to vector-valued RKHSs is described in [62]. This is connected to the so-called multi-task learning problem [18, 29] which deals with the simultaneous reconstruction of several functions. Here, the key point is that measurements taken on a function (task) may be informative w.r.t. the other ones, see [16, 40, 68, 95] for illustrations of the advantages of this approach. Multi-task learning will be illustrated in Chap. 9 using also a numerical example based on real pharmacokinetics data.

Mercer theorem dates back to [60] which discusses also the connection with integral equations, see also the book [50]. Extensions of the theorem to non compact domains are discussed in [94]. The first version of the representer theorem appears in [52]. It has been then subject of many generalizations which can be found in [11, 36, 83, 103, 110]. Recent works have also extended the classical formulation to the context of vector-valued functions (multi-task learning and collaborative filtering), matrix regularization problems (with penalty given by spectral functions of matrices), matricizations of tensors, see, e.g., [1, 7, 12, 54, 87]. These different types of representer theorems are cast in a general framework in [10].

The term regularization network traces back to [71] where it is illustrated that a particular regularized scheme is equal to a radial basis function network. Support vector regression and classification were introduced in [24, 31, 37, 98], see also the classical book [102]. Robust statistics are described in [51].

The term "kernel trick" was used in [83] while interpretation of kernels as inner products in a feature space was first described in [4]. Sobolev spaces are illustrated, e.g., in [2] while classical works on smoothing splines are [32, 104]. The important spline interpolation properties are described in [3, 14, 22].

Polynomial kernels were used for the first time in [70] while an application to Wiener system identification can be found in [44], as also discussed later on in Chap. 8 devoted to nonlinear system identification. An explicit (spectral) characterization of the RKHS induced by the Gaussian kernel can be found in [91, 92], while the more general case of radial basis kernels is treated in [85]. The concept of universal kernel is discussed, e.g., in [61, 90].

The strong mixing condition is discussed, e.g., in [107] and [34].

The convergence proof for the regularization network relies upon the integral operator approach described in [88] in an i.i.d. setting and its extension to the dependent case developed in [66] in the Wiener system identification context. For other works on statistical consistency and learning rates of regularized least squares in RKHS see, e.g., [48, 93, 105, 109, 111].

Statistical learning theory and the concepts of generalization and consistency, in connection with the uniform law of large numbers, date back to the works of Vapnik and Chervonenkis [97, 99–101]. Other related works on convergence of empirical processes are [38, 39, 73]. The concept of *V*<sup>γ</sup> dimension and its equivalence with the Glivenko–Cantelli class is proved in [5], see also [41] for links with RKHS. Relationships between the concept of stability of estimates (continuous dependence on the data) and generalization/consistency can be found in [63, 72], see also [26] for previous work on this subject. Numerical computation of the regularized estimate (6.21) is discussed in the literature studying the relationship between machine learning and convex optimization [19, 25, 77]. In the regularization network case (quadratic loss), if the data set size *N* is large, plain application of a solver with computational cost *O*(*N*<sup>3</sup>) can be highly inefficient. Then, one can use approximate representations of the kernel function [15, 53], based, e.g., on the Nyström method or greedy strategies [89, 106, 113]. One can also exploit the Mercer theorem by just using an *m*th-order approximation of *K* given by *<sup>m</sup> <sup>i</sup>*=<sup>1</sup> ζ*i*ρ*i*(*x*)ρ*i*(*y*). The solution obtained with this kernel may provide accurate approximations also when *m N*, see [28, 43, 67, 114, 115]. Training of kernel machines can be also accelerated by using randomized low-dimensional feature spaces [74], see also [78] for insights on learning rates.

In the case of generic convex loss (different from the quadratic), one problem is that the objective is not differentiable everywhere. In this circumstance, the powerful interior point (IP) methods [64, 108] can be employed which applies damped Newton iterations to a relaxed version of the Karush–Kuhn–Tucker (KKT) equations for the objective [27]. A statistical and computational framework that allows their broad application to the problem (6.21) for a wide class of piecewise linear quadratic losses can be found in [8, 9]. In practice, IP methods exhibit a relatively fast convergence behaviour. However, as in the quadratic case, a difficulty can arise if *N* is very large, i.e., it may not be possible to store the entire kernel matrix in memory and this fact can hinder the application of second-order optimization techniques such as the (damped) Newton method. A way to circumvent this problem is given by the socalled decomposition methods where a subset of the coefficients *ci* , called working set, is selected, and the associated low-dimensional sub-problem is solved. In this way, only the corresponding entries of the output kernel matrix need to be loaded into the memory, e.g., see [30, 56–58]. An extreme case of decomposition method is coordinate descent, where the working set contains only one coefficient [35, 45, 49].

## **6.9 Appendix**

## *6.9.1 Fundamentals of Functional Analysis*

We gather some basic functional analysis definitions and results.

#### **Vector Spaces**

We will assume that the reader is familiar with the concept of real vector space *V* (the field is given by the real numbers). Here, we just recall that this is a set whose elements are called vectors. The space is closed w.r.t. two operations, called addition and scalar multiplication, which satisfy the usual algebraic properties. This means that any linear and finite combination of vectors still falls in *V*. When the vector space contains functions *<sup>g</sup>* : *<sup>X</sup>* <sup>→</sup> <sup>R</sup>, for any *<sup>f</sup>*, *<sup>g</sup>* <sup>∈</sup> *<sup>V</sup>* and <sup>α</sup> <sup>∈</sup> <sup>R</sup> the two operations are defined as follows:

$$f + \mathbf{g} = h \text{ where } h(\mathbf{x}) = f(\mathbf{x}) + \mathbf{g}(\mathbf{x}) \text{ \textquotedbl{}x \in \mathcal{X}^\*$$

and

$$\alpha f = h \text{ where } h(\mathfrak{x}) = \alpha f(\mathfrak{x}) \text{ } \forall \mathfrak{x} \in \mathcal{X} \text{ }.$$

#### **Inner Products and Norms**

An inner product on *V* is the function

$$
\langle \cdot, \cdot \rangle : V \times V \to \mathbb{R}
$$

which is

1. linear in the first argument

$$
\langle \alpha \mathbf{v} + \beta \mathbf{y}, z \rangle = \alpha \langle \mathbf{v}, z \rangle + \beta \langle \mathbf{y}, z \rangle, \quad \mathbf{v}, \mathbf{y}, z \in V \quad \alpha, \beta \in \mathbb{R};
$$

2. symmetric (and so also linear in the second argument)

$$
\langle \nu, \mathbf{y} \rangle = \langle \mathbf{y}, \nu \rangle;
$$

3. positive, in the sense that

$$
\langle \boldsymbol{\nu}, \boldsymbol{\nu} \rangle \geq \mathbf{0} \quad \forall \boldsymbol{\nu}
$$

with

$$
\langle \nu, \nu \rangle = 0 \iff \nu = 0,
$$

where in the r.h.s. 0 denotes the null vector.

Recall also that a norm on *V* is the nonnegative function

$$\parallel \cdot \parallel : V \to \mathbb{R}^+$$

which satisfies

1. absolute homogeneity

$$\|\alpha v\| = |\alpha| \|v\|, \quad v \in V \quad \alpha \in \mathbb{R};$$

6.9 Appendix 225

2. the triangle inequality

*v* + *y*≤*v*+*y*;

3. null vector condition

*v* = 0 ⇐⇒ *v* = 0.

The norm induced by the inner product ·, · is given by

$$\left\|\nu\right\|^2 = \langle \nu, \nu \rangle,$$

and it is easy to check that this function indeed satisfies all the three norm axioms listed above. One also has the Cauchy–Schwarz inequality

$$\left\| \langle \nu, \chi \rangle \right\| \leq \left\| \nu \right\| \left\| \left\| \chi \right\| \right\|.$$

Finally, recall that both ·, *x* with *x* ∈ *V* and · are examples of continuous functionals *<sup>V</sup>* <sup>→</sup> <sup>R</sup>, i.e., if lim*<sup>j</sup>*→∞ *<sup>v</sup>* <sup>−</sup> *<sup>v</sup> <sup>j</sup>* = 0, then

$$\lim\_{j \to \infty} \|\nu\_j\| = \|\nu\|, \quad \lim\_{j \to \infty} \langle \nu\_j, x \rangle = \langle \nu, x \rangle \,\forall x \in V.$$

#### **Hilbert and Banach Spaces**

A Hilbert space *H* is a vector space equipped with an inner product ·, · which is complete w.r.t. to the norm · induced by such inner product. This means that, given any Cauchy sequence, i.e., a sequence of vectors {*gj*}<sup>∞</sup> *<sup>j</sup>*=<sup>1</sup> such that

$$\lim\_{m,n \to \infty} \|\mathbf{g}\_m - \mathbf{g}\_n\| = 0,$$

there exists *g* ∈ *H* such that

$$\lim\_{j \to \infty} \|\mathbf{g} - \mathbf{g}\_j\| = 0.$$

In other words, every Cauchy sequence is convergent. Examples of Hilbert spaces used in this book are

• the classical Euclidean space <sup>R</sup>*<sup>m</sup>* of vectors *<sup>a</sup>* = [*a*<sup>1</sup> ... *am*] equipped with the classical Euclidean inner product

$$\langle a, b \rangle\_2 = \sum\_{i=1}^{m} a\_i b\_i$$

sometimes denoted just by ·, · in the book;

• the space -<sup>2</sup> of squared summable real sequences *a* = [*a*<sup>1</sup> *a*<sup>2</sup> ...], i.e., such that 226 6 Regularization in Reproducing Kernel Hilbert Spaces

$$\sum\_{i=1}^{\infty} a\_i^2 < \infty,$$

equipped with the inner product

$$\langle a, b \rangle\_2 = \sum\_{i=1}^{\infty} a\_i b\_i;$$

• the classical Lebesgue space *L*<sup>2</sup> of functions (where the measure μ is here omitted to simplify notation) *<sup>g</sup>* : *<sup>X</sup>* <sup>→</sup> <sup>R</sup> which are squared summable w.r.t. the measure μ, i.e., such that

$$\int\_{\mathcal{X}} \mathrm{g}^2(\mathbf{x}) d\mu(\mathbf{x}) < \infty,$$

equipped with the inner product still denoted by ·, · <sup>2</sup> but now given by

$$
\langle \mathbf{g}, f \rangle\_2 = \int\_{\mathcal{X}^\*} \mathbf{g}(\mathbf{x}) f(\mathbf{x}) d\mu(\mathbf{x}).
$$

The spaces reported above are also instances of metric spaces where, for every couple of vectors *f*, *g*, there is a notion of distance defined by *f* − *g*. Other metric spaces are the Banach spaces. They are normed vector spaces complete w.r.t. the metric induced by their norm. Hence, every Hilbert space is a Banach space but the converse is not true: this happens when · does not derive from an inner product. Examples of Banach spaces (whose norm does not derive from an inner product) are

• the space -<sup>1</sup> of absolutely summable real sequences *a* = [*a*<sup>1</sup> *a*<sup>2</sup> ...], i.e., such that

$$\sum\_{i=1}^{\infty} |a\_i| < \infty,$$

equipped with the norm

$$\|a\|\_1 = \sum\_{i=1}^{\infty} |a\_i|;$$

• the Lebesgue space *<sup>L</sup>*<sup>1</sup> of functions *<sup>g</sup>* : *<sup>X</sup>* <sup>→</sup> <sup>R</sup> absolutely integrable w.r.t. the measure μ, i.e., such that

$$\int\_{\mathcal{X}} |g(x)|d\mu(x) < \infty,$$

equipped with the norm

$$\|\mathfrak{g}\|\_1 = \int\_{\mathcal{X}} |\mathfrak{g}(\mathfrak{x})| d\mu(\mathfrak{x});$$

#### 6.9 Appendix 227

• the space -<sup>∞</sup> of bounded real sequences *a* = [*a*<sup>1</sup> *a*<sup>2</sup> ...], i.e., such that

$$\sup\_i |a\_i| < \infty,$$

equipped with the norm

$$\|a\|\_{\infty} = \sup\_{i} |a\_i|;$$

• the space *<sup>C</sup>* of continuous functions *<sup>g</sup>* : *<sup>X</sup>* <sup>→</sup> <sup>R</sup>. where *<sup>X</sup>* is a compact set typically in R*<sup>m</sup>*, equipped with the sup-norm (also called uniform norm)

$$\|\mathbf{g}\|\_{\infty} = \max\_{\boldsymbol{x} \in \mathcal{X}} |\mathbf{g}(\boldsymbol{x})|;$$

• the Lebesgue space *<sup>L</sup>*<sup>∞</sup> of functions *<sup>g</sup>* : *<sup>X</sup>* <sup>→</sup> <sup>R</sup> which are essentially bounded w.r.t. the measure μ, i.e., for any *g* there exists *M* such that

$$|\lg(\chi)| \le M \text{ almost everywhere in } \mathcal{K}^\circ \text{ w.r.t the measure } \mu, \mu$$

equipped with the norm

*g*∞ = inf {*M* | |*g*(*x*)| ≤ *M* almost everywhere in *X* w.r.t. the measure μ}.

An infinite-dimensional Hilbert (or Banach) space is said to be separable if it admits a countable basis {ρ*j*}<sup>∞</sup> *<sup>j</sup>*=<sup>1</sup>, i.e., for any *g* in the space we can find scalars *c <sup>j</sup>* such that

$$\lim\_{j \to \infty} \left\| \mathbf{g} - \sum\_{j=1}^{\infty} c\_j \rho\_j \right\| = 0.$$

When such vectors {ρ*j*} satisfy also the conditions

$$\|\rho\_j\| = 1 \,\,\forall j,\,\,\,\langle \rho\_j,\rho\_i \rangle = 0 \,\,\, j \neq i,$$

then the basis is said to be orthonormal.

#### **Subspaces, Projections and Compact Sets**

A subset *S* of the vector space *V* is said to be a subspace if *S* is itself a vector space with the same addition and multiplication operations defined in *V*. The symbol

$$\text{span}(\{\rho\_j\}\_{j \in A})$$

denotes the subspace containing all the finite linear combinations of vectors taken from the (possibly uncountable) family {ρ*j*}*<sup>j</sup>*∈*<sup>A</sup>*.

Given a subspace (or simply a set) *S* contained in a Hilbert (or Banach) space, we define

*S*¯ = *S* ∪ {all the limits of Cauchy sequences built using vectors in S}.

Then, *S* is said to be closed if

$$s = s\_{-}$$

The orthogonal to a subspace *S* of a Hilbert space is denoted by *S*<sup>⊥</sup> and defined by

$$\mathcal{S}^\perp = \{ \mathbf{g} \mid \langle \mathbf{g}, f \rangle = 0 \,\,\forall f \in \mathcal{S} \}\,\,.$$

It is easy to prove that *S*<sup>⊥</sup> is always a closed subspace.

The following fundamental theorem holds.

**Theorem 6.25** (Projection theorem) *Let S be a closed subspace of a Hilbert space with norm* ·*<sup>H</sup> . Then, one has*

• *any g* ∈ *H has a unique decomposition*

$$\mathbf{g} = \mathbf{g}\_{\mathcal{S}} + \mathbf{g}\_{\mathcal{S}^\perp}, \quad \mathbf{g}\_{\mathcal{S}} \in \mathcal{S}, \ \mathbf{g}\_{\mathcal{S}^\perp} \in \mathcal{S}^\perp;$$

• *gS is the projection of g onto S, i.e.,*

$$\mathbf{g}\_S = \operatorname\*{arg\,min}\_{f \in \mathcal{S}} \|\mathbf{g} - f\|\_{\mathcal{A}^\diamond};$$

• *it holds that*

$$\left\|\operatorname{g}\right\|\_{\mathcal{H}^{\operatorname{\diamond}}}^2 = \left\|\operatorname{g}\_S\right\|\_{\mathcal{H}^{\operatorname{\diamond}}}^2 + \left\|\operatorname{g}\_{S^\perp}\right\|\_{\mathcal{H}^{\operatorname{\diamond}}}^2.$$

A set *A* contained in a Hilbert (or Banach) space with norm · is said to be *compact* if, given any sequence {*gj*} of vectors all contained in *A*, it is possible to extract a subsequence {*gk <sup>j</sup>*} convergent in *A*, i.e., there exists *g* ∈ *A* such that

$$\lim\_{j \to \infty} \|\mathbf{g} - \mathbf{g}\_{k\_j}\| = 0.$$

When the space is finite-dimensional, a set is compact iff it is closed and bounded.

#### **Linear and Bounded Functionals**

Given a Hilbert space *<sup>H</sup>* with norm ·*<sup>H</sup>* , a functional *<sup>L</sup>* : *<sup>H</sup>* <sup>→</sup> <sup>R</sup> is said to be bounded (or, equivalently, continuous) if there exists a positive scalar *C* such that

$$\left| \left| L \{ \mathbf{g} \} \right| \leq C \left\| \mathbf{g} \right\| \left| \mathcal{A}^{\mathbb{J}} \right. \right. \left. \left. \left. \left. \mathbf{\tilde{g}} \right| \in \mathcal{A}^{\mathbb{J}} \right. \right. \tag{6.66}$$

The following classical theorem holds.

**Theorem 6.26** (Closed graph theorem) *Let H be a Hilbert (or Banach) space. Then <sup>L</sup>* : *<sup>H</sup>* <sup>→</sup> <sup>R</sup> *is linear and bounded if and only if the graph of L, i.e.,*

$$Gr(L) = \{ (f, L[f]) \text{ with } f \in \mathcal{H} \},$$

*is a closed set in the product space <sup>H</sup>* <sup>×</sup> <sup>R</sup>*. This means that if* { *fi*} +∞ *<sup>i</sup>*=<sup>1</sup> *is a sequence converging to f* ∈ *H and* {*L*[ *fi*]}+∞ *<sup>i</sup>*=<sup>1</sup> *converges to y* <sup>∈</sup> <sup>R</sup>*, then L*[ *<sup>f</sup>* ] = *y.*

This other fundamental theorem asserts that every linear and bounded functional over *H* is in one-to-one correspondence with a vector in *H* .

**Theorem 6.27** (Riesz representation theorem, based on [76]) *Let H be a Hilbert space and let L* : *<sup>H</sup>* <sup>→</sup> <sup>R</sup>*. Then L is linear and bounded if and only there is a unique f* ∈ *H such that*

$$L[\mathbf{g}] = \langle \mathbf{g}, f \rangle\_{\mathcal{H}}, \quad \forall \mathbf{g} \in \mathcal{H}. \tag{6.67}$$

## *6.9.2 Proof of Theorem 6.1*

First, we derive two lemmas which are instrumental to the main proof.

#### **Lemma 6.1** *Let*

$$\mathcal{S} = \operatorname{span}(\{K\_x\}\_{x \in \mathcal{X}}).$$

*If there exists a Hilbert space H satisfying conditions (6.2) and (6.3), then H is the closure of S, i.e., H* = *S.*¯

*Proof* It comes from condition (6.2) that *S*¯ is a closed subspace which must belong to *H* . Theorem 6.25 (Projection theorem) then ensures that any function *f* ∈ *H* can be written as

$$f = f\_{\vec{\mathcal{S}}} + f\_{\vec{\mathcal{S}}^{\perp}}, \quad f\_{\vec{\mathcal{S}}} \in \mathcal{S}, \ f\_{\vec{\mathcal{S}}^{\perp}} \in \mathcal{S}^{\perp}.$$

As for the component *fS*¯⊥ , using condition (6.3) (reproducing property) we obtain

$$f\_{\vec{\mathcal{S}}^\perp}(\mathbf{x}) = \langle f\_{\vec{\mathcal{S}}^\perp}, K\_{\mathbf{x}} \rangle\_{\mathcal{A}^\mathbb{C}} = 0, \,\,\forall \mathbf{x}.$$

In fact, every kernel section belongs to *S* and is thus orthogonal to every function in *S*¯⊥. Hence, *fS*¯⊥ is the null vector and this concludes the proof. -

**Lemma 6.2** *Let S* = *span*({*Kx* }*<sup>x</sup>*∈*<sup>X</sup>* ) *and define*

$$\left\|f\right\|\_{\mathcal{H}^\diamond}^2 = \sum\_{i=1}^m \sum\_{j=1}^m c\_i c\_j K(\mathbf{x}\_i, \mathbf{x}\_j),\tag{6.68}$$

*where f is a generic element in S, hence admitting representation*

230 6 Regularization in Reproducing Kernel Hilbert Spaces

$$f(\cdot) = \sum\_{i=1}^{m} c\_i K\_{x\_i}(\cdot).$$

*Then,* ·*<sup>H</sup> is a well-defined norm in S.*

*Proof* The reader can easily check that absolute homogeneity and the triangle inequality are satisfied by ·*<sup>H</sup>* . We only need to prove the null vector condition, i.e., that for every *f* ∈ *S* one has

$$\|f\|\_{\mathcal{A}^\circ} = 0 \iff f = 0.$$

Now, assume that *<sup>f</sup> <sup>H</sup>* <sup>=</sup> 0 where *<sup>f</sup>* (·) <sup>=</sup> *<sup>m</sup> <sup>i</sup>*=<sup>1</sup> *ci Kxi*(·). While the coefficients {*ci*} *m <sup>i</sup>*=<sup>1</sup> and the input locations {*xi*} *m <sup>i</sup>*=<sup>1</sup> are fixed and define *f* , let also *cm*+<sup>1</sup> and *xm*+<sup>1</sup> be an arbitrary scalar and input location, respectively. Define **<sup>K</sup>** <sup>∈</sup> <sup>R</sup>*m*×*<sup>m</sup>* and **<sup>K</sup>**<sup>+</sup> <sup>∈</sup> <sup>R</sup>*m*+1×*m*+<sup>1</sup> two matrices with (*i*, *<sup>j</sup>*)-entry given by *<sup>K</sup>*(*xi*, *<sup>x</sup> <sup>j</sup>*). Let also *<sup>c</sup>* <sup>=</sup> [*c*<sup>1</sup> ... *cm*] *<sup>T</sup>* and *c*<sup>+</sup> = [*c*<sup>1</sup> ... *cm cm*+1] *<sup>T</sup>* . Note that **K***c* is the vector which contains the values of *f* on the input locations {*xi*} *m <sup>i</sup>*=<sup>1</sup>.

Since *K* is positive definite, it holds that

$$c\_+^T \mathbf{K}\_+ c\_+ \ge 0 \quad \forall \ (c\_{m+1}, x\_{m+1}) \in (\mathbb{R} \times \mathcal{K} \, ).$$

In addition, since by assumption

$$\left\|f\right\|\_{\mathcal{H}}^2 = c^T \mathbf{K}c = 0,$$

it comes that the components of the vector **K***c*, which are the values of *f* on {*xi*} *m <sup>i</sup>*=<sup>1</sup>, are all null. Now, we show that *f* (*x*) = 0 holds everywhere, also on the generic input location *xm*+<sup>1</sup> ∈ *X* . In fact, after simple calculations, one obtains

$$\begin{split} c\_+^T \mathbf{K}\_+ c\_+ &= c^T \mathbf{K} c + 2 \left[ \sum\_{i=1}^m c\_i \, K(\mathbf{x}\_i, \mathbf{x}\_{m+1}) \right] c\_{m+1} + K(\mathbf{x}\_{m+1}, \mathbf{x}\_{m+1}) c\_{m+1}^2 \\ &= 2 \left[ \sum\_{i=1}^m c\_i \, K(\mathbf{x}\_i, \mathbf{x}\_{m+1}) \right] c\_{m+1} + K(\mathbf{x}\_{m+1}, \mathbf{x}\_{m+1}) c\_{m+1}^2 \\ &= 2f(\mathbf{x}\_{m+1}) c\_{m+1} + K(\mathbf{x}\_{m+1}, \mathbf{x}\_{m+1}) c\_{m+1}^2. \end{split}$$

Now, assume that *f* (*xm*+<sup>1</sup>) > 0. Then, since the last term on the r.h.s. is infinitesimal of order two w.r.t. *cm*+<sup>1</sup> we can find a negative value for *cm*+<sup>1</sup> sufficiently close to zero such that *c<sup>T</sup>* <sup>+</sup>**K**+*c*<sup>+</sup> < 0 which contradicts the fact that *K* is positive definite. If *f* (*xm*+<sup>1</sup>) < 0 we can instead find a positive value for *cm*+<sup>1</sup> sufficiently close to zero such that *c<sup>T</sup>* <sup>+</sup>**K**+*c*<sup>+</sup> < 0, which is still a contradiction. Hence, we must have *f* (*xm*+<sup>1</sup>) = 0. Since *xm*+<sup>1</sup> was arbitrary, we conclude that *f* must be the null function.


#### 6.9 Appendix 231

We now prove Theorem 6.1. Let *S* = span({*Kx* }*<sup>x</sup>*∈*<sup>X</sup>* ) and, for any *f*, *g* ∈ *S* having representations

$$f(\cdot) = \sum\_{i=1}^{m} c\_i K\_{x\_i}(\cdot), \quad g(\cdot) = \sum\_{i=1}^{p} d\_i K\_{y\_i}(\cdot)$$

define

$$\langle f, \mathbf{g} \rangle\_{\mathcal{A}^p} = \sum\_{i=1}^m \sum\_{j=1}^p c\_i d\_j K(\mathbf{x}\_i, \mathbf{y}\_j).$$

By Lemma 6.2, it is immediate to check that ·, · *<sup>H</sup>* is a well-defined inner product on *S*. Then, we now show that the desired Hilbert space is *H* = *S*¯, where *S*¯ is the completion of *S* w.r.t. the norm induced by ·, · *<sup>H</sup>* .

Condition (6.2) is trivially satisfied since, by construction, all the kernel sections belong to *H* .

As for the condition (6.3), we start checking that it holds over *S*. Introducing the couple of functions in *S* given by

$$f(\cdot) = \sum\_{i=1}^{m} c\_i K\_{x\_i}(\cdot), \quad \mathbf{g}(\cdot) = K\_x(\cdot),$$

we have

$$\langle f, K\_{\boldsymbol{\chi}} \rangle\_{\mathcal{H}^{\boldsymbol{\ell}}} = \langle f, \mathbf{g} \rangle\_{\mathcal{H}^{\boldsymbol{\ell}}} = \sum\_{i=1}^{m} c\_{i} K(\boldsymbol{\chi}\_{i}, \boldsymbol{\chi}) = f(\boldsymbol{\chi}),$$

showing that the reproducing property holds in *S*. Let us now consider the completion of *S*. To this aim, let { *f <sup>j</sup>*} be a Cauchy sequence with *f <sup>j</sup>* ∈ *S* ∀ *j*. We have

$$\begin{aligned} |f\_i(\mathbf{x}) - f\_j(\mathbf{x})| &= |\langle f\_i - f\_j, K\_{\mathbf{x}} \rangle\_{\mathcal{H}^\diamond}| \\ &\le \|f\_i - f\_j\|\_{\mathcal{H}^\diamond} \|K\_{\mathbf{x}}\|\_{\mathcal{H}^\diamond}, \end{aligned}$$

where we have used first the reproducing property (since it holds in *S*) and then the Cauchy–Schwarz inequality. We have

$$\|K\_x\|\_{\mathcal{A}^\diamond} = |\sqrt{\langle K\_x, K\_x \rangle\_{\mathcal{A}^\diamond}}| = \sqrt{K(x, x)} \le q < +\infty,$$

where the scalar *q* independent of *x* exists because the kernel *K* is continuous over the compact *X* × *X* . Combining the last two inequalities leads to

$$|f\_i(\mathbf{x}) - f\_j(\mathbf{x})| \le \sup\_{\mathbf{x} \in \mathcal{X}} |f\_i(\mathbf{x}) - f\_j(\mathbf{x})| \le q \|f\_i - f\_j\|\_{\mathcal{A}^\theta},\tag{6.69}$$

which shows that the convergence in *H* implies also uniform convergence. In other words, if *f <sup>j</sup>* → *f* in *H* w.r.t. ·*<sup>H</sup>* , then *f <sup>j</sup>* → *f* also in the space *C* of continuous functions w.r.t. the sup-norm ·∞. Since *S* ⊂ *C* and *C* is Banach, all the functions in the completion of *S* are continuous, i.e., *H* ⊂ *C* . Furthermore, if *f <sup>j</sup>* → *f* in *H* , one has that for any *x* ∈ *X*

$$\lim\_{j \to \infty} \langle f\_j, K\_x \rangle\_{\mathcal{A}^\circ} = \langle f, K\_x \rangle\_{\mathcal{A}^\circ},$$

by the continuity of the inner product. But we also have

$$\lim\_{j \to \infty} \langle f\_j, K\_x \rangle\_{\mathcal{H}^\circ} = \lim\_{j \to \infty} f\_j(x) = f(x),$$

since *f <sup>j</sup>* ∈ *S* ∀ *j*, the reproducing property holds in *S* and convergence in *H* implies uniform (and, hence, pointwise) convergence. This shows that *f*, *Kx <sup>H</sup>* = *f* (*x*) ∀ *f* ∈ *H* , i.e., the reproducing property holds over all the space *H* .

The last point is the unicity of *H* . For the sake of contradiction, assume that there exists another Hilbert space *G* which satisfies conditions (6.2) and (6.3). By Lemma 6.1, we must have *G* = *S*¯ where the completion of *S* is w.r.t. the norm ·*<sup>G</sup>* deriving from the inner product ·, · *<sup>G</sup>* . Condition (6.3) holds both in *H* and in *G* , so that we have

$$
\langle K\_x, K\_s \rangle\_{\mathcal{H}'} = K(\mathbf{x}, \mathbf{s}) = \langle K\_x, K\_s \rangle\_{\mathcal{H}}, \ \forall (\mathbf{x}, \mathbf{s}) \in \mathcal{X}' \times \mathcal{X}'.
$$

Since the functions in *S* are finite linear combinations of kernel sections, by the linearity of the inner product, the above equality allows to conclude that

$$
\langle f, \mathbf{g} \rangle\_{\mathcal{A}^{\mathbb{C}}} = \langle f, \mathbf{g} \rangle\_{\mathcal{Y}}, \ \forall (f, \mathbf{g}) \in \mathcal{S} \times \mathcal{S}.
$$

Such an equality, together with the uniqueness of limits, implies that the completion of *S* w.r.t. ·*<sup>H</sup>* coincides with the completion w.r.t. ·*<sup>G</sup>* . Hence, *H* and *G* are the same Hilbert space and this completes the proof.

## *6.9.3 Proof of Theorem 6.10*

It is not difficult to see that (6.12) with the inner product (6.13) is a Hilbert space. In addition, using the Mercer theorem, in particular the expansion (6.11), from (6.13) one has

$$\begin{aligned} \|K\_{\boldsymbol{x}}\|\_{\mathcal{H}^{\boldsymbol{\ell}}}^2 &= \|\sum\_{i \in \mathcal{J}} \zeta\_i \rho\_i(\boldsymbol{x}) \rho\_i(\cdot)\|\_{\mathcal{H}^{\boldsymbol{\ell}}}^2 \\ &= \sum\_{i \in \mathcal{J}} \frac{\zeta\_i^2 \rho\_i^2(\boldsymbol{x})}{\zeta\_i} = K(\boldsymbol{x}, \boldsymbol{x}) < \infty, \end{aligned}$$

and, for any *f* = *<sup>i</sup>*∈*<sup>I</sup> ai*ρ*<sup>i</sup>* , it also holds that

$$\begin{aligned} \langle K\_x, f \rangle\_{\mathcal{A}^\ell} &= \langle \sum\_{i \in \mathcal{J}} \zeta\_i \rho\_i(\mathbf{x}) \rho\_i(\cdot), \sum\_{i \in \mathcal{J}} a\_i \rho\_i(\cdot) \rangle\_{\mathcal{A}^\ell} \\ &= \sum\_{i \in \mathcal{J}} \frac{\zeta\_i \rho\_i(\mathbf{x}) a\_i}{\zeta\_i} = f(\mathbf{x}). \end{aligned}$$

This shows that every kernel section belongs to *H* and the reproducing property holds. Theorem 6.1 then ensures that *H* is indeed the RKHS associated to *K*.

## *6.9.4 Proof of Theorem 6.13*

First, let *H* be the RKHS induced by *K*(*x*, *y*) = ζρ(*x*)ρ(*y*). Any RKHS is spanned by its kernel sections, hence in this case *H* is the one-dimensional subspace generated by ρ. By the reproducing property it holds that

$$\|K\_{\mathfrak{x}}\|\_{\mathcal{X}^{\mathbb{C}}}^2 = K(\mathfrak{x}, \mathfrak{x}) = \zeta \rho^2(\mathfrak{x})\,.$$

In addition, one has

$$\left\|K\_{\boldsymbol{x}}\right\|\_{\mathcal{H}^{\boldsymbol{\ell}}}^2 = \left\|\zeta \rho(\boldsymbol{x})\rho\right\|\_{\mathcal{H}^{\boldsymbol{\ell}}}^2 = \zeta^2 \rho^2(\boldsymbol{x}) \left\|\rho\right\|\_{\mathcal{H}^{\boldsymbol{\ell}}}^2,$$

so that

$$\|\rho\|^2\_{\mathcal{A}^\wp} = \frac{1}{\zeta}.$$

Now, consider the kernel of interest *K*(*x*, *y*) = <sup>∞</sup> *<sup>i</sup>*=<sup>1</sup> ζ*i*ρ*i*(*x*)ρ*i*(*y*) associated with *H* . Define *K <sup>j</sup>*(*x*, *y*) = ζ *<sup>j</sup>*ρ*j*(*x*)ρ*j*(*y*). with ·*<sup>H</sup> <sup>j</sup>* to denote the norm induced by *K <sup>j</sup>* . From the discussion above it holds that

$$\left\|\rho\_{j}\right\|\_{\mathcal{H}^{\mathbb{P}}\_{j}}^{2} = \frac{1}{\zeta\_{j}}.\tag{6.70}$$

Think of *K*(*x*, *y*) = <sup>∞</sup> *i*=1 <sup>ζ</sup>*i*ρ*i*(*x*)ρ*i*(*y*) as the sum of *<sup>K</sup> <sup>j</sup>*(*x*, *<sup>y</sup>*) and *<sup>K</sup>*<sup>−</sup> *<sup>j</sup>*(*x*, *<sup>y</sup>*) <sup>=</sup> <sup>∞</sup> *<sup>k</sup>*<sup>=</sup> *<sup>j</sup>* ζ*k*ρ*<sup>k</sup>* (*x*)ρ*<sup>k</sup>* (*y*). Then, using Theorem 6.6 and (6.70), one has

$$\left\|\|\rho\_j\|\right\|\_{\mathcal{H}^\ell}^2 = \min\_{c\_j, h} \frac{c\_j^2}{\zeta\_j} + \left\|h\right\|\_{\mathcal{H}^\ell\_{-j}}^2 \text{ s.t. } \rho\_j = c\_j \rho\_j + h, \ c\_j \in \mathbb{R}, \ h \in \mathcal{H}\_{-j}$$

where*H*<sup>−</sup> *<sup>j</sup>* is the RKHS induced by *K*<sup>−</sup> *<sup>j</sup>* . Evaluating the objective at(*c <sup>j</sup>* = 1, *h* = 0), one obtains

234 6 Regularization in Reproducing Kernel Hilbert Spaces

$$\left\|\rho\_j\right\|\_{\mathcal{H}^\diamond}^2 \le \frac{1}{\zeta\_j},$$

and this shows that ρ*<sup>j</sup>* ∈ *H* ∀ *j*.

Now we prove that the functions ρ*<sup>j</sup>* generate all the RKHS *H* induced by *K*. Using Theorem 6.25 (Projection theorem), it comes that for any *f* ∈ *H* we have

$$f = \mathfrak{g} + h \text{ with } \mathfrak{g} \in G, \ h \in G^{\perp}$$

where *G* indicates the closure in *H* of the subspace generated by all the ρ*<sup>k</sup>* . Using the reproducing property, one obtains

$$\begin{aligned} h(\mathbf{x}) &= \_{\mathcal{A}^\ell} \\ &= \_{\mathcal{A}^\ell} \\ &= \sum\_{k=1}^\infty \zeta\_k \rho\_k(\mathbf{x}) < h(\cdot), \rho\_k(\cdot) >\_{\mathcal{A}^\ell} = 0 \quad \forall \mathbf{x}, \mathbf{y} \end{aligned}$$

where the last equality exploits the relation *h* ⊥ ρ*<sup>k</sup>* ∀*k*. This completes the first part of the proof.

As for the RKHS norm characterization, first let *H* <sup>∞</sup> *<sup>j</sup>* be the RKHS induced by the kernel <sup>∞</sup> *<sup>k</sup>*<sup>=</sup> *<sup>j</sup> Kk* with *h <sup>j</sup>* to denote a generic element of *H* <sup>∞</sup> *<sup>j</sup>* . Then, given *f* ∈ *H* , using Theorem 6.6 in an iterative fashion, we obtain

$$\begin{split} \|f\|\_{\mathcal{H}^{\ell}}^{2} &= \min\_{c\_{1},h\_{2}} \frac{c\_{1}^{2}}{\zeta\_{1}} + \|h\_{2}\|\_{\mathcal{H}\_{\mathcal{H}\_{x}^{\infty}}^{2}}^{2} \text{ s.t. } f = c\_{1}\rho\_{1} + h\_{2} \\ &= \min\_{c\_{1},c\_{2},h\_{3}} \frac{c\_{1}^{2}}{\zeta\_{1}} + \frac{c\_{2}^{2}}{\zeta\_{2}} + \|h\_{3}\|\_{\mathcal{H}\_{\mathcal{H}\_{x}^{\infty}}^{2}}^{2} \text{ s.t. } f = c\_{1}\rho\_{1} + c\_{2}\rho\_{2} + h\_{3} \\ &\vdots \\ &= \min\_{c\_{1},\ldots,c\_{n-1},h\_{n}} \sum\_{k=1}^{n-1} \frac{c\_{k}^{2}}{\zeta\_{k}} + \|h\_{n}\|\_{\mathcal{H}\_{x}^{\ell^{\infty}}}^{2} \text{ s.t. } f = \sum\_{i=1}^{n-1} c\_{i}\rho\_{i} + h\_{n} . \end{split}$$

In particular, every equality above is obtained thinking of the kernel <sup>∞</sup> *<sup>k</sup>*<sup>=</sup> *<sup>j</sup> Kk* as the sum of *K <sup>j</sup>* and <sup>∞</sup> *<sup>k</sup>*<sup>=</sup> *<sup>j</sup>*+<sup>1</sup> *Kk* . Then, *h <sup>j</sup>* can be decomposed into two parts, i.e., *h <sup>j</sup>* = *c <sup>j</sup>*ρ*<sup>j</sup>* + *h <sup>j</sup>*+1, with ρ*j*<sup>2</sup> *<sup>H</sup> <sup>j</sup>* = 1/ζ *<sup>j</sup>* where, as before, ·*<sup>H</sup> <sup>j</sup>* denotes the norm in the one-dimensional RKHS induced by *K <sup>j</sup>* . Now, let *c*ˆ1,..., *c*ˆ*<sup>n</sup>*−<sup>1</sup>, *h*ˆ*<sup>n</sup>* be the minimizers of the last objective (the minimizer can be assumed unique without loss of generality, just to simplify the exposition) and note that *h*ˆ*n<sup>H</sup>* <sup>∞</sup> *<sup>n</sup>* must go to zero as *n* → ∞. Then, it comes that the sequence *c*ˆ1, *c*ˆ2,... characterizing the norm *f* <sup>2</sup> *<sup>H</sup>* is indeed min{*ck* } <sup>∞</sup> *k*=1 *c*2 *k* <sup>ζ</sup>*<sup>k</sup>* with the {*ck* } subject to the constraints lim*<sup>n</sup>*→∞ *f* − *<sup>n</sup> <sup>k</sup>*=<sup>1</sup> *ck*ρ*k<sup>H</sup>* = 0.

## *6.9.5 Proofs of Theorems 6.15 and 6.16*

We prove the following more general result that embraces as special cases Theorems 6.15 and 6.16.

**Theorem 6.28** *Let H be a Hilbert space. Consider the optimization problem*

$$\min\_{f \in \mathcal{H}^{\mathbb{P}}} \Phi(L\_1[f], \dots, L\_N[f], \|f\|\_{\mathcal{H}^{\mathbb{P}}}) \tag{6.71}$$

*and assume that*


*Then, all the solutions of (6.71) admit the following expression*

$$
\hat{\mathbf{g}} = \sum\_{i=1}^{N} c\_i \eta\_i,\tag{6.72}
$$

*where the ci are suitable scalar expansion coefficients and each* η*<sup>i</sup>* ∈ *H is the representer of Li , i.e.,*

$$L\_i[f] = \langle f, \eta\_i \rangle\_{\mathcal{H}}, \quad \forall f \in \mathcal{H}, \ i = 1, \ldots, N.$$

*In particular, if H is a RKHS with kernel K , each basis function is given by*

$$
\eta\_i(\mathbf{x}) = L\_i[K(\cdot, \mathbf{x})].
$$

To prove the above result, let *g*ˆ be a solution of (6.71) and denote with *S* the (closed) subspace spanned by the *N* representers η*<sup>i</sup>* of the functionals *Li* , i.e.,

$$S = \text{span}\{\eta\_1, \dots, \eta\_N\}.$$

Exploiting Theorem 6.25 (Projection theorem), we can write

$$
\hat{\mathbf{g}} = \hat{\mathbf{g}}\_{\mathcal{S}} + \hat{\mathbf{g}}\_{\mathcal{S}^\perp}, \quad \hat{\mathbf{g}}\_{\mathcal{S}} \in \mathcal{S}, \ \hat{\mathbf{g}}\_{\mathcal{S}^\perp} \in \mathcal{S}^\perp.
$$

For the sake of contradiction, assume that *g*ˆ*<sup>S</sup>*<sup>⊥</sup> is different from the null function. Then, we have

Φ(*L*1[ ˆ*g*],..., *Ln* [ ˆ*g*], ˆ*g<sup>H</sup>* ) = Φ( η1, *g*ˆ *<sup>H</sup>* ,..., η*<sup>N</sup>* , *g*ˆ *<sup>H</sup>* , ˆ*g<sup>H</sup>* ) = Φ( η1, *g*ˆ*<sup>S</sup>* + ˆ*gS*<sup>⊥</sup> *<sup>H</sup>* ,..., η*<sup>N</sup>* , *g*ˆ*<sup>S</sup>* + ˆ*gS*<sup>⊥</sup> *<sup>H</sup>* , & ˆ*gS* <sup>2</sup> *<sup>H</sup>* +ˆ*gS*<sup>⊥</sup> <sup>2</sup> *<sup>H</sup>* ) = Φ( η1, *g*ˆ*<sup>S</sup> <sup>H</sup>* ,..., η*<sup>N</sup>* , *g*ˆ*<sup>S</sup> <sup>H</sup>* , & ˆ*gS* <sup>2</sup> *<sup>H</sup>* +ˆ*gS*<sup>⊥</sup> <sup>2</sup> *<sup>H</sup>* ) < Φ( η1, *g*ˆ*<sup>S</sup> <sup>H</sup>* ,..., η*<sup>N</sup>* , *g*ˆ*<sup>S</sup> <sup>H</sup>* , ˆ*gS <sup>H</sup>* ),

where the last equality exploits the fact that each η*<sup>i</sup>* is orthogonal to all the functions in *S*<sup>⊥</sup> while the inequality exploits the assumption that Φ is strictly increasing w.r.t. its last argument. This contradicts the optimality of *g*ˆ and implies that *g*ˆ*S*<sup>⊥</sup> must be the null function, hence concluding the first part of the proof.

Finally, to prove (6.28) note that, if *H* is a RKHS, one has

$$
\eta\_i(\mathfrak{x}) = \langle \eta\_i, K\_x \rangle\_{\mathcal{H}^\circ} = L\_i[K(\cdot, \mathfrak{x})],
$$

where the first equality comes from the reproducing property, while the second one derives from the fact that η*<sup>i</sup>* is the representer of *Li* .

## *6.9.6 Proof of Theorem 6.21*

#### **Preliminary Lemmas**

The first lemma, whose proof can be found in [34], states a bound on the correlation between two random variables assuming values in a Hilbert space.

**Lemma 6.3** (based on [34]) *Let a and b be zero-mean random variables measurable with respect to the* σ*-algebras M*<sup>1</sup> *and M*<sup>2</sup> *and with values in the Hilbert space H having inner product* ·, · *<sup>H</sup> . Then, it holds that*

$$|\boldsymbol{\beta}^{\boldsymbol{\varepsilon}}[\langle \boldsymbol{a}, \boldsymbol{b} \rangle\_{\mathcal{H}^{\boldsymbol{\varepsilon}}}]| \leq 15 \sqrt[3]{\psi(\mathcal{A}\ell\_1, \mathcal{A}\ell\_2) \mathcal{E} \|\boldsymbol{a}\|\_{\mathcal{H}^{\boldsymbol{\varepsilon}}}^3 \mathcal{E} \|\boldsymbol{b}\|\_{\mathcal{H}^{\boldsymbol{\varepsilon}}}^3},\tag{6.73}$$

*where all the expectations above are assumed to exist and*

$$\psi(\mathcal{A}\ell\_1, \mathcal{A}\ell\_2) = \sup\_{A \in \mathcal{A}\ell\_1, B \in \mathcal{A}\ell\_2} |P(A \cap B) - P(A)P(B)|.$$

As for the second lemma, first it is useful to introduce the following integral operator:

$$L\_K[f](\cdot) = \int\_X K(\cdot, x) f(x) p\_x(\mathbf{x}) d\mathbf{x}.$$

Since the assumptions underlying Theorem (6.9) (Mercer Theorem) hold true, there exists a complete orthonormal basis of *L* <sup>μ</sup> <sup>2</sup> , denoted by {ρ*i*}*<sup>i</sup>*∈*<sup>I</sup>* , which satisfies

$$L\_K[\rho\_i] = \zeta\_i \rho\_i, \quad i \in \mathcal{J}', \quad \zeta\_1 \ge \zeta\_2 \ge \dotsb$$

To simplify exposition, hereby we assume ζ*<sup>i</sup>* > 0 ∀*i*. Then, for *r* > 0, we define the operators *L*−*<sup>r</sup> <sup>K</sup>* and *L<sup>r</sup> <sup>K</sup>* as follows

$$L\_K^r[f] = \sum\_{i \in \mathcal{J}} \zeta\_i^r c\_i \rho\_i \tag{6.74}$$

$$L\_K^{-r}[f] = \sum\_{i \in \mathcal{J}} \frac{c\_i}{\zeta\_i^r} \rho\_i \,. \tag{6.75}$$

The function *L*−*<sup>r</sup> <sup>K</sup>* [ *f* ] is less regular than *f* since its expansion coefficients go to zero more slowly. Instead, *L<sup>r</sup> <sup>K</sup>* is a smoothing operator since ζ*<sup>r</sup> <sup>i</sup> ci* goes to zero faster than *ci* as *<sup>i</sup>* goes to infinity. When *<sup>r</sup>* <sup>=</sup> <sup>1</sup>/2 we recover the operator *<sup>L</sup>*1/<sup>2</sup> *<sup>K</sup>* already defined in (6.17) which satisfies *<sup>H</sup>* <sup>=</sup> *<sup>L</sup>*1/<sup>2</sup> *<sup>K</sup> <sup>L</sup>* <sup>μ</sup> <sup>2</sup> . The following lemma holds.

**Lemma 6.4** *If L*−*<sup>r</sup> <sup>K</sup> <sup>f</sup>*<sup>ρ</sup> <sup>∈</sup> *<sup>L</sup>* <sup>μ</sup> <sup>2</sup> *for some* 0 < *r* ≤ 1*, letting*

$$\hat{f} = \arg\min\_{f \in \mathcal{H}^{\circ}} \left\Vert f - f\_{\rho} \right\Vert\_{\mathcal{L}\_{\sharp}^{\rho}}^{2} + \gamma \left\Vert f \right\Vert\_{\mathcal{H}^{\circ}}^{2},\tag{6.76}$$

*one has*

$$\|\|\hat{f} - f\_{\rho}\|\_{\mathcal{L}\_2^{\rho\_n}} \le \gamma^r \|L\_K^{-r} f\_{\rho}\|\_{\mathcal{L}\_2^{\rho\_n}}.\tag{6.77}$$

*Proof* By assumption, there exists *<sup>g</sup>* <sup>∈</sup> *<sup>L</sup>* <sup>μ</sup> <sup>2</sup> , say *g* = *<sup>i</sup>*∈*<sup>I</sup> di*ρ*<sup>i</sup>* , such that *<sup>f</sup>*<sup>ρ</sup> <sup>=</sup> *<sup>L</sup><sup>r</sup> K g* so that *f*<sup>ρ</sup> = *<sup>i</sup>*∈*<sup>I</sup>* <sup>ζ</sup>*<sup>r</sup> <sup>i</sup> di*ρ*<sup>i</sup>* . Now, we characterize the solution ˆ*f* of (6.76) using *f* = *<sup>i</sup>*∈*<sup>I</sup> ci*ρ*<sup>i</sup>* and optimizing w.r.t. the *ci* . The objective becomes

$$\sum\_{i \in \mathcal{J}} (c\_i - \zeta\_i^r d\_i)^2 + \gamma \sum\_{i \in \mathcal{J}} \frac{c\_i^2}{\zeta\_i},$$

and setting the partial derivatives w.r.t. each *ci* to zero, we obtain

$$
\hat{f} = \sum\_{i \in \mathcal{J}} \hat{c}\_i \rho\_i, \quad \hat{c}\_i = \frac{\zeta\_i^{r+1} d\_i}{\zeta\_i + \gamma}. \tag{6.78}
$$

This implies

$$\hat{f} - f\_{\rho} = -\sum\_{i \in \mathcal{J}} \frac{\gamma}{\zeta\_i + \gamma} \zeta\_i^r d\_i \rho\_i.$$

If 0 < *r* ≤ 1, it follows that

$$\begin{split} \|\hat{f} - f\_{\rho}\|\_{\mathcal{L}\_{1}^{\kappa}} &= \left\{ \sum\_{i \in \mathcal{J}} \left( \frac{\gamma}{\zeta\_{i} + \gamma} \zeta\_{i}^{r} d\_{i} \right)^{2} \right\}^{1/2} \\ &= \gamma^{r} \left\{ \sum\_{i \in \mathcal{J}} \left( \frac{\gamma}{\zeta\_{i} + \gamma} \right)^{2(1-r)} \left( \frac{\zeta\_{i}}{\zeta\_{i} + \gamma} \right)^{2r} d\_{i}^{2} \right\}^{1/2} \\ &\leq \gamma^{r} \sum\_{i \in \mathcal{J}} \left( \frac{\gamma}{\zeta\_{i} + \gamma} \right)^{(1-r)} \left( \frac{\zeta\_{i}}{\zeta\_{i} + \gamma} \right)^{r} |d\_{i}| \\ &\leq \gamma^{r} \left\{ \sum\_{i \in \mathcal{J}} d\_{i}^{2} \right\}^{1/2} = \gamma^{r} \|\|g\|\|\_{\mathcal{L}\_{2}^{\kappa}} = \gamma^{r} \|\|L\_{K}^{-r} f\_{\rho}\|\|\_{\mathcal{L}\_{2}^{\kappa}}. \end{split}$$

and this proves (6.77). -

In the proof of the third lemma reported below, the notation *Sx* : *<sup>H</sup>* <sup>→</sup> <sup>R</sup>*<sup>N</sup>* indicates the sampling operator defined by *Sx f* = [ *f* (*x*1)... *f* (*xN* )]. In addition, *S<sup>T</sup> x* denotes its adjoint, i.e., for any *<sup>c</sup>* <sup>∈</sup> <sup>R</sup>*<sup>N</sup>* , it satisfies

$$\langle f, S\_x^T c \rangle\_{\mathcal{H}^\circ} = \langle S\_x f, c \rangle = \sum\_{i=1}^N c\_i f(\mathbf{x}\_i) = \langle f, \sum\_{i=1}^N c\_i K\_{\mathbf{x}\_i} \rangle\_{\mathcal{H}^\circ},$$

where ·, · is the Euclidean inner product. Hence, one has

$$S\_x^T c = \sum\_{i=1}^N c\_i K\_{x\_i} \quad \forall \ c \in \mathbb{R}^N.$$

**Lemma 6.5** *Define*

$$\eta\_i(\cdot) = \left[\mathbf{y}\_i - \hat{f}(\mathbf{x}\_i)\right] K(\mathbf{x}\_i, \cdot) \tag{6.79}$$

*with* ˆ*f defined by (6.76). Then, if g*ˆ*<sup>N</sup> is given by* (6.55)*, one has*

$$\|\hat{g}\_N - \hat{f}\|\_{\mathcal{A}^\ell} \le \frac{1}{\gamma} \left\| \frac{1}{N} \sum\_{i=1}^N \left( \eta\_i - \mathcal{E}[\eta\_i] \right) \right\|\_{\mathcal{A}^\ell}.$$

*Proof* First, it is useful to derive two useful equalities involving ˆ*f* and *g*ˆ*<sup>N</sup>* . The first one is

$$
\gamma \hat{f} = L\_K(\hat{f}\_\rho - \hat{f}) = \mathcal{\mathbb{G}} \eta\_i. \tag{6.80}
$$

$$\square$$

The last equality in (6.80) follows from the definition of *Lk* and η*<sup>i</sup>* . The first equality can be obtained using the representation *f*<sup>ρ</sup> = *<sup>i</sup>*∈*<sup>I</sup> di*ρ*<sup>i</sup>* , then following the same passages contained in the first part of the previous lemma's proof to obtain

$$\hat{f} = \sum\_{i \in \mathcal{J}} \frac{\zeta\_i}{\zeta\_i + \gamma} d\_i \rho\_i, \quad f\_\rho - \hat{f} = \sum\_{i \in \mathcal{J}} \frac{\gamma}{\zeta\_i + \gamma} d\_i \rho\_i.$$

The second result consists of the following alternative expression for *g*ˆ*<sup>N</sup>* :

$$\hat{\mathbf{g}}\_N = \left(\frac{S\_x^T S\_x}{N} + \gamma I\right)^{-1} \frac{S\_x^T}{N} Y,\tag{6.81}$$

where *I* denotes the identity operator. To prove it, we will use the equality *g*ˆ*<sup>N</sup>* = *ST <sup>x</sup>* (**K** + *N*γ*IN* ) <sup>−</sup><sup>1</sup> *Y* which derives from the representer theorem and also the fact that, for any vector *<sup>c</sup>* <sup>∈</sup> <sup>R</sup>*<sup>N</sup>* , it holds that *Sx <sup>S</sup><sup>T</sup> <sup>x</sup> c* = **K***c* with **K** the kernel matrix built using [*x*<sup>1</sup> ... *xN* ]. Then, we have

$$\begin{aligned} \left(\frac{S\_x^T S\_x}{N} + \gamma I\right) \hat{\mathbf{g}}\_N &= \frac{S\_x^T}{N} \left(\mathbf{K} \left(\mathbf{K} + N\gamma I\_N\right)^{-1} + N\gamma \left(\mathbf{K} + N\gamma I\_N\right)^{-1}\right) Y\\ &= \frac{S\_x^T}{N} Y. \end{aligned}$$

Now, it is also useful to obtain a bound on the inverse of the operator *<sup>S</sup><sup>T</sup> <sup>x</sup> Sx <sup>N</sup>* + γ*I*. Assume that *v* ∈ *H* and let *u* satisfy

$$\left(\frac{\mathbf{S}\_x^T \mathbf{S}\_x}{N} + \gamma I\right)u = \nu.$$

We take inner products on both sides with *u* and use the equality *Sx S<sup>T</sup> <sup>x</sup> u*, *u <sup>H</sup>* = *Sxu*, *Sxu* to obtain

$$\frac{1}{N}\langle S\_x\mu, S\_x\mu\rangle + \gamma\|\mu\|\_{\mathcal{A}^{\ell^\*}}^2 = \langle \nu, \mu\rangle\_{\mathcal{A}^{\ell^\*}} \le \|\nu\|\_{\mathcal{A}^{\ell^\*}}\|\mu\|\_{\mathcal{A}^{\ell^\*}}.$$

One has

$$\lambda\_{\mathbf{x}} := \inf\_{f \in \mathcal{H}^{\boldsymbol{\varrho}}} \frac{\|S\_{\mathbf{x}}f\|}{\sqrt{N} \|f\|\_{\mathcal{H}^{\boldsymbol{\varrho}}}} \implies \left(\lambda\_{\mathbf{x}}^2 + \gamma\right) \|\boldsymbol{\mu}\|\_{\mathcal{H}^{\boldsymbol{\varrho}}}^2 \le \|\boldsymbol{\nu}\|\_{\mathcal{H}^{\boldsymbol{\varrho}}} \|\boldsymbol{\mu}\|\_{\mathcal{H}^{\boldsymbol{\varrho}}}.$$

Thus, we have shown that

$$\left(\frac{S\_x^T S\_x}{N} + \gamma I\right)\mu = \nu \implies \|\mu\|\_{\mathcal{A}^\rho} \le \frac{\|\nu\|\_{\mathcal{A}^\rho}}{\lambda\_x^2 + \gamma} \le \frac{1}{\gamma} \|\nu\|\_{\mathcal{A}^\rho}, \quad \forall \nu \in \mathcal{A}^\rho. \tag{6.82}$$

Now, it comes from (6.81) that

$$
\hat{\mathbf{g}}\_N - \hat{f} = \left(\frac{S\_x^T S\_x}{N} + \gamma I\right)^{-1} \left(\frac{S\_x^T Y}{N} - \frac{S\_x^T S\_x \hat{f}}{N} - \gamma \hat{f}\right) \dots
$$

Exploiting the equalities

$$\frac{S\_x^T Y}{N} - \frac{S\_x^T S\_x \hat{f}}{N} = \frac{1}{N} \sum\_{i=1}^N \eta\_i, \quad \gamma \hat{f} = \mathcal{E} \eta\_i,$$

which derive from (6.79) and (6.80), respectively, we obtain

$$
\hat{\varrho}\_N - \hat{f} = \left(\frac{S\_x^T S\_x}{N} + \gamma I\right)^{-1} \frac{1}{N} \sum\_{i=1}^N \left(\eta\_i - \mathcal{E}[\eta\_i]\right).
$$

The use of (6.82) then completes the proof. -

#### **Proof of Statistical Consistency**

Let ˆ*f* be defined by (6.76), i.e.,

$$\hat{f} = \arg\min\_{f \in \mathcal{H}^{\mathbb{P}}} \left\| f - f\_{\rho} \right\|\_{\mathcal{E}\_2^{\rho\_{\mathbb{P}}}}^2 + \gamma \left\| f \right\|\_{\mathcal{H}^{\mathbb{P}}}^2.$$

Then, consider the following error decomposition

$$\|\|\hat{g}\_N - f\_\rho\|\|\_{\mathcal{L}\_2^\mu} \le \|\hat{f} - f\_\rho\|\|\_{\mathcal{L}\_2^\mu} + \|\hat{g}\_N - \hat{f}\|\|\_{\mathcal{L}\_2^\mu}.\tag{6.83}$$

The first term ˆ*f* − *f*ρ*<sup>L</sup>* <sup>μ</sup> <sup>2</sup> on the r.h.s. is not stochastic. The assumption *f*<sup>ρ</sup> ∈ *H* ensures that *L*−*<sup>r</sup> <sup>K</sup> f*ρ*<sup>L</sup>* <sup>μ</sup> <sup>2</sup> < ∞ for 0 ≤ *r* ≤ 1/2. It thus comes from Lemma 6.4 that, at least for 0 < *r* ≤ 1/2, it holds that

$$\|\hat{f} - f\_{\rho}\|\_{\mathcal{A}\_2^{\mu}} \le \gamma^r \|L\_K^{-r} f\_{\rho}\|\_{\mathcal{A}\_2^{\mu}} < \infty. \tag{6.84}$$

Now, consider the second term ˆ*gN* − ˆ*f <sup>L</sup>* <sup>μ</sup> <sup>2</sup> . Since the input space (the function domain) is compact, and recalling also (6.69), there exists a constant *A* such that

$$\|\hat{\mathbf{g}}\_N - \hat{f}\|\_{\mathcal{X}\_2^{\rho}} \le A \|\hat{\mathbf{g}}\_N - \hat{f}\|\_{\mathcal{X}^{\rho}}.\tag{6.85}$$

To obtain a bound for the r.h.s. involving the RKHS norm, consider the stochastic function

$$
\eta\_i(\cdot) = \left[ \mathbf{y}\_i - \hat{f}(\mathbf{x}\_i) \right] K(\mathbf{x}\_i, \cdot),
$$

already introduced in (6.79). Using the reproducing property, one has

$$\left\|\eta\_{i}\right\|\_{\mathcal{H}^{\ell}}^{2} = \left[y\_{i} - \hat{f}(\mathbf{x}\_{i})\right]^{2} K(\mathbf{x}\_{i}, \mathbf{x}\_{i}).\tag{6.86}$$

The function ˆ*f* belongs to *H* and is thus continuous on the compact *X* . In addition, the kernel *K* is continuous on the compact *X* × *X* and the process{*xi*, *yi*} has finite moments up to the third order by assumption. Hence, there exists a constant *B* independent of *i* such that

$$\mathcal{d}^{\boldsymbol{\ell}}\left[\left\|\boldsymbol{\eta}\_{i}\right\|\_{\mathcal{H}^{\boldsymbol{\ell}}}^{k}\right] \leq B, \quad k = 1, 2, 3. \tag{6.87}$$

We can now come back to ˆ*gN* − ˆ*f <sup>H</sup>* . From Lemma 6.5, ∀γ > 0 it holds that

$$\|\hat{\lg}\_N - \hat{f}\|\_{\mathcal{H}^\rho} \le \frac{1}{\gamma} \left\| \frac{1}{N} \sum\_{i=1}^N (\eta\_i - \mathcal{E}[\eta\_i]) \right\|\_{\mathcal{H}^\rho}.\tag{6.88}$$

Now, using first Jensen's inequality and then (6.87), (6.88), Assumption 6.20 and (6.73) in Lemma 6.3 (with *a* and *b* replaced by η*<sup>i</sup>* − *E* [η*i*] and η*<sup>j</sup>* − *E* [η*j*]) one obtains constants *C* and *D* such that

$$\begin{split} & \left( \left\| \left\| \left\| \frac{1}{N} \sum\_{i=1}^{N} (\eta\_{i} - \boldsymbol{\delta}^{\varepsilon}[\eta\_{i}]) \right\|\_{\mathcal{A}^{\mathbb{P}}} \right\| \right) \right\|^{2} \leq \left\| \mathbb{E} \left[ \left\| \frac{1}{N} \sum\_{i=1}^{N} (\eta\_{i} - \boldsymbol{\delta}^{\varepsilon}[\eta\_{i}]) \right\|\_{\mathcal{A}^{\mathbb{P}}}^{2} \right] \right\|^{2} \\ & \leq \frac{15}{N^{2}} \sum\_{i=1}^{N} \sum\_{j=1}^{N} \sqrt[3]{\|\psi\_{[i-j]}\|} \left( \boldsymbol{\delta}^{\varepsilon} \|\boldsymbol{\eta} - \boldsymbol{\delta}^{\varepsilon}[\eta]\|\_{\mathcal{A}^{\mathbb{P}}}^{3} \|\right)^{\frac{2}{3}} \\ & \leq \frac{C}{N} \left( \boldsymbol{\delta}^{\varepsilon} \| (\|\boldsymbol{\eta}\|\_{\mathcal{A}^{\mathbb{P}}} + \|\boldsymbol{\delta}^{\varepsilon}[\eta]\|\_{\mathcal{A}^{\mathbb{P}}})^{3} \right)^{\frac{2}{3}} \leq \frac{D}{N}, \end{split}$$

where η replaces η*<sup>i</sup>* or η*<sup>j</sup>* when the expectation is independent of *i* and *j*. This latter result, combined with (6.85) and (6.88), leads to the existence of a constant *E* such that

$$\mathcal{A}^{\mathbb{P}} \| \hat{\mathcal{g}}\_N - \hat{f} \|\_{\mathcal{A}\_2^{\mu}} \le A \mathcal{\delta} \| \hat{\mathcal{g}}\_N - \hat{f} \|\_{\mathcal{A}^{\mathbb{P}}} \le \frac{E}{\gamma \sqrt{N}} \tag{6.89}$$

that, combined with (6.83) and (6.84), implies that for any 0 < *r* ≤ 1/2

$$\|\mathcal{J}\|\hat{g}\_N - f\_\rho\|\_{\mathcal{A}\_2^\mu} \le \gamma^r \|L\_K^{-r} f\_\rho\|\_{\mathcal{A}\_2^\mu} + \frac{E}{\gamma\sqrt{N}}.\tag{6.90}$$

Hence, when γ is chosen according to (6.56), *E* ˆ*gN* − *f*ρ*<sup>L</sup>* <sup>μ</sup> <sup>2</sup> converges to zero as *N* grows to ∞. Using the Markov inequality, (6.57) is finally obtained.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 7 Regularization in Reproducing Kernel Hilbert Spaces for Linear System Identification**

**Abstract** In the previous parts of the book, we have studied how to handle linear system identification by using regularized least squares (ReLS) with finite-dimensional structures given, e.g., by finite impulse response (FIR) models. In this chapter, we cast this approach in the RKHS framework developed in the previous chapter. We show that ReLS with quadratic penalties can be reformulated as a function estimation problem in the finite-dimensional RKHS induced by the regularization matrix. This leads to a new paradigm for linear system identification that provides also new insights and regularization tools to handle infinite-dimensional problems, involving, e.g., IIR and continuous-time models. For all this class of problems, we will see that the representer theorem ensures that the regularized impulse response is a linear and finite combination of basis functions given by the convolution between the system input and the kernel sections. We then consider the issue of kernel estimation and introduce several tuning methods that have close connections with those related to the regularization matrix discussed in Chap. 3. Finally, we introduce the notion of stable kernels, that induce RKHSs containing only absolutely summable impulse responses and study minimax properties of regularized impulse response estimation.

## **7.1 Regularized Linear System Identification in Reproducing Kernel Hilbert Spaces**

## *7.1.1 Discrete-Time Case*

We will consider linear discrete-time systems in the form of the so-called output error (OE) models. Data are generated according to the relationship

$$\mathbf{y}(t) = G^0(q)\boldsymbol{u}(t) + \boldsymbol{e}(t), \quad t = 1, \ldots, N,\tag{7.1}$$

where *<sup>y</sup>*(*t*), *<sup>u</sup>*(*t*) and *<sup>e</sup>*(*t*) <sup>∈</sup> <sup>R</sup> are the system output, the known system input and the noise at time instant *<sup>t</sup>* <sup>∈</sup> <sup>N</sup>, respectively. In addition, *<sup>G</sup>*0(*q*) is the "true" system that has to be identified from the input–output samples with *q* being the time shift operator, i.e., *qu*(*t*) = *u*(*t* + 1). Here, and also in all the remaining parts of the chapter, we assume that *e* is white noise (all its components are mutually uncorrelated).

In Chap. 2, we have seen that there exist different ways to parametrize *G*0(*q*). In what follows, we will start our discussions exploiting the simplest impulse response descriptions given by FIR models and then we will consider more general infinitedimensional models also in continuous time. We will see that there is a common way to estimate them through regularization in the RKHS framework and the representer theorem.

#### **7.1.1.1 FIR Case**

The FIR case corresponds to

$$\begin{split} \mathbf{y}(t) &= G(q, \theta)\boldsymbol{\mu}(t) + e(t) \\ &= \sum\_{k=1}^{m} \mathbf{g}\_{k} \boldsymbol{\mu}(t - k) + e(t), \quad \boldsymbol{\theta} = \left[ \mathbf{g}\_{1}, \dots, \mathbf{g}\_{m} \right]^{T}, \end{split} \tag{7.2}$$

where *m* is the FIR order, *g*1, ..., *gm* are the FIR coefficients and θ is the unknown vector that collects them. Model (7.2) can be rewritten in vector form as follows:

$$Y = \Phi \theta + E,\tag{7.3}$$

where

$$Y = \begin{bmatrix} \mathbf{y}(1) \ \dots \ \mathbf{y}(N) \end{bmatrix}^T, \quad E = \begin{bmatrix} e(1) \ \dots \ e(N) \end{bmatrix}^T$$

and

$$\Phi = [\varphi(1) \; \; \; \; \; \; \; \; \; \varphi(N)]^T$$

with

$$\boldsymbol{\varphi}^{\boldsymbol{T}}(t) = [\boldsymbol{\mu}(t-1) \; \; \; \; \; \boldsymbol{\mu}(t-m)].$$

Instead of describing FIR model estimation directly in the regularized RKHS framework, let us first recall the ReLS method with quadratic penalty term introduced in Chap. 3. It gives the estimate of θ by solving the following problem:

$$\hat{\theta} = \arg\min \sum\_{t=1}^{N} (\mathbf{y}(t) - \sum\_{k=1}^{m} \mathbf{g}\_k \boldsymbol{\mu}(t-k))^2 + \boldsymbol{\mathcal{y}} \boldsymbol{\theta}^T \boldsymbol{P}^{-1} \boldsymbol{\theta} \tag{7.4a}$$

$$\mathbf{I} = \operatorname\*{arg\,min}\_{\boldsymbol{\theta}} \left\| \boldsymbol{Y} - \boldsymbol{\Phi}\boldsymbol{\theta} \right\|^2 + \boldsymbol{\chi}\boldsymbol{\theta}^T \boldsymbol{P}^{-1} \boldsymbol{\theta} \tag{7.4b}$$

$$= (\boldsymbol{\Phi}^T \boldsymbol{\Phi} + \boldsymbol{\chi} \boldsymbol{P}^{-1})^{-1} \boldsymbol{\Phi}^T \boldsymbol{Y} \tag{7.4c}$$

$$=P\Phi^T(\Phi P\Phi^T+\mathcal{Y}I\_N)^{-1}Y,\tag{7.4d}$$

where the regularization matrix *<sup>P</sup>* <sup>∈</sup> <sup>R</sup>*<sup>m</sup>*×*<sup>m</sup>* is positive semidefinite, assumed invertible for simplicity. The regularization parameter γ is a positive scalar that, as already seen, has to balance adherence to experimental data and strength of regularization.

Now we show that (7.4) can be reformulated as a function estimation problem with regularization in the RKHS framework. For this aim, we will see that the key is to use the *m* × *m* matrix *P* to define the kernel over the domain {1, 2,..., *m*} × {1, <sup>2</sup>,..., *<sup>m</sup>*}. This in turn will define a RKHS of functions *<sup>g</sup>* : {1, <sup>2</sup>,..., *<sup>m</sup>*} → <sup>R</sup>. Such functions are connected with the components *gi* of the *m*-dimensional vector θ by the relation *g*(*i*) = *gi* . So, the functional view is obtained replacing the vector θ with the function that maps *i* into the *i*th component of θ.

Let us define a positive semidefinite kernel *<sup>K</sup>* : *<sup>X</sup>* <sup>×</sup> *<sup>X</sup>* <sup>→</sup> <sup>R</sup> as follows:

$$K(i,j) = P\_{ij}, \quad i, j \in \mathcal{X}^{\circ} = \{1, 2, \ldots, m\}, \tag{7.5}$$

where *Pi j* is the (*i*, *j*)th entry of the regularization matrix *P*. It is obvious that *K* is positive semidefinite because *P* is positive semidefinite. Its kernel sections will be denoted by *Ki* with *i* = 1,..., *m* and are the columns of *P* seen as functions mapping *X* into R.

Now, using the Moore–Aronszajn Theorem, illustrated in Theorem6.2, the kernel *K* reported in (7.5) defines a unique RKHS *H* such that *Ki*, *g<sup>H</sup>* = *g*(*i*), ∀(*i*, *g*) ∈ (*X* , *H* ). This is the function space where we will search for the estimate of the FIR coefficients. According to the discussion following Theorem6.2, since there are just *m* kernel sections *Ki* associated to the *m* columns of *P*, for any impulse response candidate *g* ∈ *H* , there exist *m* scalars *aj* such that

$$\log(i) = \sum\_{j=1}^{m} a\_j K(i, j) = P(i, :)a\tag{7.6}$$

where *P*(*i*, :) is the *i*th row of *P*. Since *g*(*i*) is the *i*th component of θ, one has

$$
\theta = Pa.
$$

By the reproducing property, we also have

250 7 Regularization in Reproducing Kernel Hilbert Spaces …

$$\begin{aligned} \|g\|\_{\mathcal{H}^\ell}^2 &= \langle \sum\_{j=1}^m a\_j K\_j, \sum\_{l=1}^m a\_l K\_l \rangle\_{\mathcal{H}^\ell} = \sum\_{j=1}^m \sum\_{l=1}^m a\_j a\_l K\,(j,l) \rangle \\ &= \sum\_{j=1}^m \sum\_{l=1}^m a\_j a\_l P\_{jl} = a^T P a \end{aligned}$$

and this implies

$$\left\|\mathbf{g}\right\|\_{\mathcal{A}^\circ}^2 = \boldsymbol{\theta}^T \boldsymbol{P}^{-1} \boldsymbol{\theta}.$$

As a result, the ReLS method (7.4) can be reformulated as follows:

$$\hat{\mathbf{g}} = \operatorname\*{arg\,min}\_{\mathbf{g} \in \mathcal{M}^{\ell}} \sum\_{t=1}^{N} \left( \mathbf{y}(t) - \sum\_{k=1}^{m} \mathbf{g}(k)\boldsymbol{\mu}(t-k) \right)^{2} + \boldsymbol{\chi} \left\| \boldsymbol{g} \right\|\_{\mathcal{H}^{\ell}}^{2} \tag{7.7}$$

which is a regularized function estimation problem in the RKHS *H* .

In view of the equivalence between (7.4) and (7.7), the FIR function estimate *g*ˆ has the closed-form expression given by (7.4d). The correspondence is established by *g*ˆ(*i*) = θˆ *<sup>i</sup>* . We will show later that such closed-form expression can be derived/interpreted by exploiting the representer theorem.

**Remark 7.1** Besides (7.7), there is also an alternative way to reformulate the ReLS method (7.4) as a function estimation problem with regularization in the RKHS framework. This has been sketched in the discussions on linear kernels in Sect. 6.6.1. The difference lies in the choice of the function to be estimated and the choice of the corresponding kernel. In particular, in this chapter, we have obtained (7.7) choosing the function and the corresponding kernel to be the FIR *g* and (7.5), respectively. In contrast, in Sect. 6.6.1, the RKHS is defined by the kernel

$$K(\mathbf{x}, \mathbf{y}) = \mathbf{x}^T P \mathbf{y}, \quad \mathbf{x}, \mathbf{y} \in \mathcal{X}^\circ = \mathbb{R}^m \tag{7.8}$$

and contains the linear functions *x <sup>T</sup>* θ, where the input locations *x* incapsulate *m* past input values. So, using (7.8), the corresponding RKHS does not contain impulse responses but functions that represent directly linear systems mapping regressors (built with input values) into outputs.

#### **7.1.1.2 IIR Case**

The infinite impulse response (IIR) case corresponds to

$$\mathbf{y}(t) = G(q, \theta)u(t) + e(t) = \sum\_{k=1}^{\infty} \mathbf{g}\_k u(t - k) + e(t), \quad t = 1, \dots, N \tag{7.9}$$

where θ = [*g*1,..., *g*∞] *<sup>T</sup>* . So, model order *m* is set to ∞ and we have to handle infinite-dimensional objects. To face the intrinsic ill-posedness of the estimation problem, one could think to introduce an infinite-dimensional regularization matrix *P*. But the penalty θ *<sup>T</sup> P*−1θ, adopted in (7.4) for the FIR case, would turn out to be undefined. So, the RKHS setting is needed to define regularized IIR estimates. The first step is to choose a positive semidefinite kernel *<sup>K</sup>* : <sup>N</sup> <sup>×</sup> <sup>N</sup> <sup>→</sup> <sup>R</sup>. Then, let *<sup>H</sup>* be the RKHS associated with *K* and *g* ∈ *H* be the IIR function with *g*(*k*) = *gk* for *<sup>k</sup>* <sup>∈</sup> <sup>N</sup>. Finally, the estimate is given by

$$\hat{\mathbf{g}} = \operatorname\*{arg\,min}\_{\mathbf{g} \in \mathcal{M}^{\mathbb{P}}} \sum\_{t=1}^{N} (\mathbf{y}(t) - \sum\_{k=1}^{\infty} \mathbf{g}(k)\boldsymbol{\mu}(t-k))^2 + \boldsymbol{\chi} \|\boldsymbol{\varrho}\|\_{\mathcal{M}^{\mathbb{P}}}^2. \tag{7.10}$$

One may wonder whether it is possible to obtain a closed-form expression of the IIR estimate *g*ˆ as in the FIR case. The answer is positive and given by the following representer theorem. It derives from Theorem6.16 reported in the previous chapter applied to the case of quadratic loss functions, as discussed in Example 6.17, that allows to recover the expansion coefficients of the estimate just solving a linear system of equations, see (6.29) and (6.31). Before stating in a formal way the result, it is useful to point out the following two facts:


**Theorem 7.1** (Representer theorem for discrete-time linear system identification, based on [73, 90])*. Consider the function estimation problem* (7.10)*. Assume that <sup>H</sup> is the RKHS induced by a positive semidefinite kernel K* : <sup>N</sup> <sup>×</sup> <sup>N</sup> <sup>→</sup> <sup>R</sup> *and that, for t* = 1,..., *N, the functions* η*<sup>t</sup> defined by*

$$\eta\_t(i) = \sum\_{k=1}^{\infty} K(i,k)u(t-k), \quad i \in \mathbb{N} \tag{7.11}$$

*are all well defined in H . Then, the solution of* (7.10) *is*

$$\hat{\mathbf{g}}(i) = \sum\_{t=1}^{N} \hat{c}\_{t} \eta\_{t}(i), \quad i \in \mathbb{N}, \tag{7.12}$$

*where c*ˆ*<sup>t</sup> is the tth entry of the vector*

252 7 Regularization in Reproducing Kernel Hilbert Spaces …

$$
\hat{c} = (O + \chi I\_N)^{-1} Y \tag{7.13}
$$

*with Y* = [*y*(1), . . . *y*(*N*)] *<sup>T</sup> and with the* (*t*,*s*)*th entry of O given by*

$$O\_{ts} = \sum\_{i=1}^{\infty} \sum\_{k=1}^{\infty} K(i,k)u(t-k)u(s-i), \quad t,s = 1,\ldots,N. \tag{7.14}$$

Theorem7.1 discloses an important feature of regularized impulse response estimation in RKHS. The function estimate *g*ˆ has a finite dimensional representation that does not depend on the dimension of the RKHS *H* induced by the kernel but only on the data set size *N*.

**Example 7.2** (*Stable spline kernel for IIR estimation*) To estimate high-order FIR models, in the previous chapters, we have introduced some regularization matrices related to the DC, TC and stable spline kernels, see (5.40) and (5.41). Consider now the TC kernel, also called first-order stable spline, with support extended to <sup>N</sup> <sup>×</sup> <sup>N</sup>, i.e.,

$$K(i,j) = \alpha^{\max(i,j)}, \quad 0 < \alpha < 1, \quad (i,j) \in \mathbb{N}.\tag{7.15}$$

This kernel induces a RKHS that contains IIR models and can be conveniently adopted in the estimator (7.10). An interesting question is to derive the structure of the induced regularizer *g*<sup>2</sup> *<sup>H</sup>* . One could connect *K* with the matrix *P* entering (7.4a) but its inverse is undefined since now *P* is infinite dimensional. To derive the stable spline norm, it is instead necessary to resort to functional analysis arguments. In particular, in Sect. 7.7.1, it is proved that

$$\|\|\mathbf{g}\|\|\_{\mathcal{H}^\ell}^2 = \sum\_{t=1}^\infty \frac{(\mathbf{g}\_{t+1} - \mathbf{g}\_t)^2}{(1 - \alpha)\alpha^t},\tag{7.16}$$

an expression that well reveals how the kernel (7.15) includes information on smooth exponential decay. When used in (7.10), the resulting IIR estimate balances the data fit (sum of squared residuals) and the energy of the impulse response increments weighted by coefficients that increase exponentially with time *t* and thus enforce stability.

Let us now consider a simple application of the representer theorem. Assume that the system input is a causal step of unit amplitude, i.e., *u*(*t*) = 1 for *t* ≥ 0 and *u*(*t*) = 0 otherwise. The functions (7.11) are given by

$$\eta\_t(i) = \sum\_{k=1}^{\infty} K(i,k)u(t-k), \quad i \in \mathbb{N}.$$

For instance, the first three basis functions are

$$\begin{aligned} \eta\_1(i) &= \sum\_{k=1}^{\infty} K(i,k)\mu(1-k) = \alpha^{\max(i,1)}\\ \eta\_2(i) &= \sum\_{k=1 \atop \infty}^{\infty} K(i,k)\mu(2-k) = \alpha^{\max(i,1)} + \alpha^{\max(i,2)}\\ \eta\_3(i) &= \sum\_{k=1}^{\infty} K(i,k)\mu(3-k) = \alpha^{\max(i,1)} + \alpha^{\max(i,2)} + \alpha^{\max(i,3)} \end{aligned}$$

and, in general, one has

$$\eta\_t(i) = \sum\_{k=1}^{t} \alpha^{\max(i,k)}.$$

Hence, any η*<sup>t</sup>* is a well-defined function in the RKHS induced by *K*, being the sum of the first *t* kernel sections. Then, according to Theorem7.1, we conclude that the IIR estimate returned by (7.10) is spanned by the functions {η*t*}*<sup>N</sup> <sup>t</sup>*=1 with coefficients then computable from (7.13). -

Although Theorem7.1 is stated for the IIR case (7.10), the same result also holds for the FIR case (7.7). The only difference is that the series in (7.11) and (7.14) have to be replaced by finite sums up to the FIR order *m*. Then, interestingly, one can interpret the regularized FIR estimate (7.4d) in a different way exploiting the representer theorem perspective. In particular, one finds *O* = Φ *P*Φ*<sup>T</sup>* while the basis functions {η*t*}*<sup>N</sup> <sup>t</sup>*=1 are in one-to-one correspondence with the *N* columns of *P*Φ*<sup>T</sup>* , each of dimension *m*.

## *7.1.2 Continuous-Time Case*

Now, we consider linear continuous-time systems still focusing on the output error (OE) model structure. The system outputs are collected over *N* time instants *ti* . Hence, the measurements model is

$$\mathbf{y}(t\_i) = \int\_0^\infty \mathbf{g}^0(\tau)\boldsymbol{\mu}(t\_i - \tau)d\tau + \boldsymbol{e}(t\_i), \quad i = 1, \ldots, N,\tag{7.17}$$

where *y*(*t*), *u*(*t*) and *e*(*t*) are the system output, the known input and the noise at time instant *<sup>t</sup>* <sup>∈</sup> <sup>R</sup>+, respectively, while *<sup>g</sup>*<sup>0</sup>(*t*), *<sup>t</sup>* <sup>∈</sup> <sup>R</sup><sup>+</sup> is the "true" system impulse response.

Similarly to what done in the previous section, we will study how to determine from a finite set of input–output data a regularized estimate of the impulse response *g*<sup>0</sup> in the RKHS framework. The first step is to choose a positive semidefinite kernel *K* : <sup>R</sup><sup>+</sup> <sup>×</sup> <sup>R</sup><sup>+</sup> <sup>→</sup> <sup>R</sup>. It induces the RKHS *<sup>H</sup>* containing the impulse response candidates *g* ∈ *H* . Then, the linear model can be estimated by solving the following function estimation problem:

254 7 Regularization in Reproducing Kernel Hilbert Spaces …

$$\hat{\mathbf{g}} = \operatorname\*{arg\,min}\_{\mathbf{g} \in \mathcal{H}^{\mathbb{P}}} \sum\_{i=1}^{N} \left( \mathbf{y}(t\_{i}) - \int\_{0}^{\infty} \mathbf{g}\left(\tau\right) \boldsymbol{u}(t\_{i} - \tau) d\tau \right)^{2} + \boldsymbol{\chi} \left\| \mathbf{g} \right\|\_{\mathcal{H}^{\mathbb{P}}}^{2}.\tag{7.18}$$

The closed-form expression of the impulse response estimate *g*ˆ is given by the following representer theorem that again derives from Theorem6.16 and the same discussion reported before Theorem7.1. Note just that now any functional *Li* entering Theorem6.16 is applied to continuous-time impulse responses *g* in the RKHS *H* . Hence, it represents the continuous-time convolution with the input, i.e., *Li* maps *g* ∈ *H* into the system output evaluated at the time instant *ti* .

**Theorem 7.3** (Representer theorem for continuous-time linear system identification, based on [73, 90]) *Consider the function estimation problem* (7.18)*. Assume that <sup>H</sup> is the RKHS induced by a positive semidefinite kernel K* : <sup>R</sup><sup>+</sup> <sup>×</sup> <sup>R</sup><sup>+</sup> <sup>→</sup> <sup>R</sup> *and that, for i* = 1,..., *N, the functions* η*<sup>i</sup> defined by*

$$\eta\_i(\mathbf{s}) = \int\_0^\infty K(\mathbf{s}, \tau) u(t\_i - \tau) d\tau, \quad \mathbf{s} \in \mathbb{R}^+ \tag{7.19}$$

*are all well defined in H . Then, the solution of* (7.18) *is*

$$\hat{\mathbf{g}}(\mathbf{s}) = \sum\_{i=1}^{N} \hat{c}\_{i} \eta\_{i}(\mathbf{s}), \quad \mathbf{s} \in \mathbb{R}^{+} \tag{7.20}$$

*where c*ˆ*<sup>i</sup> is the ith entry of the vector*

$$
\hat{c} = (O + \chi I\_N)^{-1} Y \tag{7.21}
$$

*with Y* = [*y*(*t*1), . . . *y*(*tN* )] *<sup>T</sup> and the* (*i*, *j*)*th entry of O given by*

$$O\_{ij} = \int\_0^\infty \int\_0^\infty K(\tau, s) u(t\_i - s) u(t\_j - \tau) ds d\tau, \quad i, j = 1, \ldots, N. \tag{7.22}$$

**Example 7.4** (*Stable spline kernel for continuous-time system identification*) In Example 6.5, we introduced the first-order spline kernel min(*x*, *y*) on [0, 1]×[0, 1]. It describes a RKHS of continuous functions *f* on the unit interval that satisfy *f* (0) = 0 whose squared norm is the energy of the first-order derivative, i.e.,

$$\int\_0^1 \left(\dot{f}(\alpha)\right)^2 d\alpha. \tag{7.23}$$

To describe stable impulse responses *g*, we instead need a kernel defined over the positive real axis <sup>R</sup><sup>+</sup> that induces the constraint *<sup>g</sup>*(+∞) = 0. A simple way to obtain this is to exploit the composition of the spline kernel with an *exponential* change of coordinates mapping <sup>R</sup><sup>+</sup> into [0, <sup>1</sup>]. The resulting kernel is called (continuous-time)

**Fig. 7.1** First-order (top left) and second-order (bottom left) stable spline kernel with some kernel sections (right panels) obtained with β = 0.5 and centred on 0, 0.5, 1,..., 10 (bottom)

first-order stable spline kernel. It is given by

$$K(\mathbf{s}, t) = \min(e^{-\beta \mathbf{s}}, e^{-\beta t}) = e^{-\beta \max(\mathbf{s}, t)}, \quad \mathbf{s}, t \in \mathbb{R}^+, \tag{7.24}$$

where β > 0 regulates the change of coordinates and, hence, the impulse responses decay rate. So, β can be seen as a kernel parameter related to the dominant pole of the system.

It is interesting to note the similarity between the kernel (7.15) and the first-order stable spline kernel (7.24). By letting α = exp(−β), the sampled version of the firstorder stable spline kernel (7.24) corresponds exactly to the TC kernel (7.15). Top panel of Fig. 7.1 plots (7.24) and also some kernel sections: they are all continuous and exponentially decaying to zero. Such kernel inherits also the universality property of the splines. In fact, its kernel sections can approximate any continuous impulse response on all the compact subsets of R+.

The relationship with splines permits also to easily achieve one spectral decomposition of (7.24). In particular, in Example 6.11, we obtained the following expansion of the spline kernel:

256 7 Regularization in Reproducing Kernel Hilbert Spaces …

$$\min(\mathbf{x}, \mathbf{y}) = \sum\_{i=1}^{+\infty} \xi\_i \rho\_i(\mathbf{x}) \rho\_i(\mathbf{y})$$

with

$$
\rho\_i(\mathbf{x}) = \sqrt{2}\sin\left(i\pi\mathbf{x} - \frac{\pi\mathbf{x}}{2}\right), \quad \xi\_i = \frac{1}{(i\pi - \pi/2)^2},
$$

where all the ρ*<sup>i</sup>* are mutually orthogonal on [0, 1] w.r.t. the Lebesque measure. In view of the simple connection between spline and stable spline kernels given by exponential time transformations, one easily obtains that the first-order stable spline kernel can be diagonalized as follows:

$$e^{-\beta \max(s, t)} = \sum\_{i=1}^{\infty} \xi\_i \phi\_i(s) \phi\_i(t) \tag{7.25}$$

with

$$\phi\_i(t) = \rho\_i(e^{-\beta t}), \quad \xi\_i = \frac{1}{(i\pi - \pi/2)^2},\tag{7.26}$$

where the φ*<sup>i</sup>* are now orthogonal on [0, +∞) w.r.t. the measure μ of density β*e*−β*<sup>t</sup>* . In Fig. 6.3, we reported the eigenfunctions ρ*<sup>i</sup>* with *i* = 1, 2, 8 and the eigenvalues ζ*<sup>i</sup>* for the first-order spline kernel (6.47). For comparison, we now show in Fig. 7.2 the corresponding eigenfunctions φ*<sup>i</sup>* of the first-order stable spline kernel (7.24) with β = 1 and also the ζ*<sup>i</sup>* . While the eigenvalues are the same, differently from the ρ*<sup>i</sup>* the eigenfunctions φ*<sup>i</sup>* now decay exponentially to zero.

Having obtained one spectral decomposition of (7.24), we can now exploit Theorem6.10 to obtain the following representation of the RKHS induced by the first-order stable spline kernel:

**Fig. 7.2** Expansion of the continuous-time first-order stable spline kernel *e*−<sup>β</sup> max(*x*,*y*) with β = 1: eigenfunctions ρ*i*(*x*) for *i* = 1, 2, 8 (left panel) and eigenvalues ζ*<sup>i</sup>* (right)

7.1 Regularized Linear System Identification in Reproducing Kernel Hilbert Spaces 257

$$\mathcal{H}^{\ell} = \left\{ \mathbf{g} \mid \mathbf{g}(t) = \sum\_{i=1}^{\infty} c\_i \phi\_i(t), \ t \ge 0, \ \sum\_{i=1}^{\infty} \frac{c\_i^2}{\xi\_i} < \infty \right\},\tag{7.27}$$

and the squared norm of *g* turns out to be

$$\left\|\|g\|\right\|\_{\mathcal{H}^\ell}^2 = \sum\_{i=1}^\infty \frac{c\_i^2}{\xi\_i}.\tag{7.28}$$

Now we will exploit the above results to obtain a more useful expression for *g*<sup>2</sup> *<sup>H</sup>* . The deep connection between spline and stable spline kernel implies that these two spaces are isometrically isomorphic, i.e., there is an one-to-one correspondence that preserves inner products. In fact, we can associate to any stable spline function *g*(*t*)in *H* the spline function *f* (*t*) in the space induced by (6.47) such that *g*(*t*) = *f* (*e*−β*<sup>t</sup>* ). So, *g*(*t*) = ∞ *<sup>i</sup>*=1 *ci*φ*i*(*t*) implies *f* (*t*) = ∞ *<sup>i</sup>*=1 *ci*ρ*i*(*t*) and the two functions have indeed the same norm ∞ *i*=1 *c*2 *i* ζ*i* . Now, using (7.23) and (7.28), we obtain

$$\left\|\mathbf{g}\right\|\_{\mathcal{H}}^2 = \int\_0^1 \left(\dot{f}(t)\right)^2 dt = \int\_0^{+\infty} \left(\dot{\mathbf{g}}(t)\right)^2 \frac{e^{\beta t}}{\beta} dt. \tag{7.29}$$

This expression gives insights into the nature of the stable spline space. Compared to the classical Sobolev space induced by the first-order spline kernel, the norm penalizes the energy of the first-order derivative of *g* with a weight proportional to *e*<sup>β</sup>*<sup>t</sup>* . Such norm thus enforces all the function in *H* to be continuous impulse responses decaying to zero at least exponentially. Note also that (7.29) really seems the continuous-time counterpart of the norm (7.16) associated to the discrete-time stable spline kernel.

Let us see now how to generalize the kernel (7.24). In Sect. 6.6.6 of the previous chapter, we have introduced the general class of spline kernels. Here, we started our discussion using the first-order (linear) spline kernel min(*x*, *y*) but we have seen that higher-order models can be useful to reconstruct smoother functions, an important example being the second-order (cubic) spline kernel (6.48). Applying exponential time transformations to the splines, the class of the so-called *stable spline kernels* is obtained. For instance, from (6.48), one obtains the second-order stable spline kernel

$$\frac{e^{-\beta(s+t+\max(s,t))}}{2} - \frac{e^{-3\beta\max(s,t)}}{6}.\tag{7.30}$$

The bottom panels of Fig. 7.1 plots (7.30) and also some kernel sections: they exponentially decay to zero and are more regular than those associated to (7.24). -

#### *7.1.3 More General Use of the Representer Theorem for Linear System Identification -*

Theorems 7.1 and 7.3 are special cases of the more general representer theorem involving function estimation from sparse and noisy data. It was reported as Theorem6.16 in the previous chapter. Let us briefly recall it. Its starting point was the optimization problem

$$\hat{\mathbf{g}} = \operatorname\*{arg\,min}\_{\mathbf{g} \in \mathcal{H}^{\mathbb{P}}} \sum\_{i=1}^{N} \mathcal{V}\_{i}(\mathbf{y}\_{i}, L\_{i}[\mathbf{g}]) + \boldsymbol{\mathcal{Y}} \|\mathbf{g}\|\_{\mathcal{H}^{\mathbb{P}}}^{2},\tag{7.31}$$

where *V<sup>i</sup>* is a loss function, e.g., the quadratic loss adopted in this chapter, and each functional *Li* : *<sup>H</sup>* <sup>→</sup> <sup>R</sup> is linear and bounded. Then, all the solutions of (7.31) are given by

$$\hat{\mathbf{g}} = \sum\_{i=1}^{N} c\_i \eta\_i,\tag{7.32}$$

where each η*<sup>i</sup>* ∈ *H* is the representer of *Li* given by

$$
\eta\_i(t) = L\_i[K(\cdot, t)].\tag{7.33}
$$

How to compute the expansion coefficients *ci* will then depend on the nature of the *V<sup>i</sup>* , as described in Sect. 6.5.

The estimator (7.31) can be exploited for linear system identification thinking of *g* as an impulse response, using e.g., a stable spline kernel to define *H* . The linear functional *Li* is then defined by a convolution and returns the system noiseless outputs at instant *ti* . In particular, in discrete-time one has

$$L\_i[\mathbf{g}] = \sum\_{k=1}^{\infty} \mathbf{g}(k)\boldsymbol{\mu}(\mathbf{t}\_i - k), \quad \mathbf{t}\_i = 1, \dots, N \tag{7.34}$$

while in continuous time, it holds that

$$L\_i[\mathbf{g}] = \int\_0^\infty \mathbf{g}(\tau)\boldsymbol{\mu}(t\_i - \tau)d\tau. \tag{7.35}$$

When quadratic losses are used, (7.31) becomes the regularization network described in Sect. 6.5.1 whose expansions coefficients are available in closed form. One has *c*ˆ = (*O* + γ *IN* )−<sup>1</sup>*Y* with the (*t*,*s*)-entry of the matrix *O* given by *Ots* = *Ls*[*Lt*[*K*]], as given by (7.14) in discrete time and by (7.22) in continuous time. The use of losses *V<sup>i</sup>* different from quadratic then opens the way also to the definition of many new algorithms for impulse response estimation. For example, the use of the Vapnik's -insensitive loss described in Sect. 6.5.3 leads to support vector regression

for linear system identification. Beyond promoting sparsity in the coefficients *ci* , it also makes the estimator robust against outliers since penalties on large residuals grows linearly. Outliers can be tackled also by adopting the <sup>1</sup> or Huber loss, see Sect. 6.5.2. A general system identification framework that includes all the convex piecewise linear quadratic losses and penalties is, e.g., described in [2].

Interestingly, the estimator (7.31) can be conveniently adopted for linear system identification also giving *g* a different meaning from an impulse response. For instance, in system identification there are important IIR models that use Laguerre functions see e.g., [91, 92] whose *z*-transform is

$$\frac{\sqrt{1-\alpha^2}}{z-\alpha} \left(\frac{1-\alpha z}{z-\alpha}\right)^{j-1}, \quad j = 1, 2, \dots, 4$$

They form an orthonormal basis in <sup>2</sup> and some of them are displayed in Fig. 7.3.

Another option is given by the Kautz basis functions that allow also to include information on the presence of system resonances [46]. Using φ*<sup>i</sup>* to denote such basis functions, the impulse response model can be written as

$$f(t) = \sum\_{i=1}^{\infty} g\_i \phi\_i(t).$$

A problem is how to determine the coefficients *gi* from data. Classical approaches use truncated expansions *f* = *d <sup>i</sup>*=1 *gi*φ*<sup>i</sup>* , with model order *d* estimated using, e.g., Akaike's criterion, as discussed in Sect. 2.4.3, and then determine the *gi* by least squares. An interesting alternative is to let *d* = +∞ and to think that the *gi* define the function *g* such that *g*(*i*) = *gi* . One can then estimate the coefficients through (7.31) adopting a kernel, like TC and stable spline, that includes information on the expansion coefficients' decay to zero. Working in discrete time, the functionals *Li* entering (7.31) are in this case defined by

$$L\_i[\mathbf{g}] = \sum\_{j=1}^{\infty} \mathbf{g}\_j \sum\_{k=1}^{\infty} \phi\_j(k)\mu(t\_i - k),$$

while in continuous time, one has

$$L\_i[\mathbf{g}] = \sum\_{j=1}^{\infty} \mathbf{g}\_j \int\_0^{\infty} \phi\_j(\tau) \mu(t\_i - \tau) d\tau.$$

## *7.1.4 Connection with Bayesian Estimation of Gaussian Processes*

Similarly to what discussed in the finite-dimensional setting in Sect. 4.9, also the more general regularization in RKHS can be given a probabilistic interpretation in terms of Bayesian estimation. In this paradigm, the different loss functions correspond to alternative statistical models for the observation noise, while the kernel represents the covariance of the unknown random signal, assumed independent of the noise. In particular, when the loss is quadratic, all the involved distributions are Gaussian.

We now discuss the connection under the linear system identification perspective where the "true" impulse response *g*<sup>0</sup> is seen as the random signal to estimate. Consider the measurements model

$$\mathbf{y}(t\_i) = L\_i \mathbf{[g}^0] + e(t\_i), \quad i = 1, \ldots, N,\tag{7.36}$$

where *Li* is a linear functional of the true impulse response *g*<sup>0</sup> defined by convolution with the system input evaluated at *ti* . One has

$$L\_i[\mathbf{g}^0] = \sum\_{k=1}^{\infty} \mathbf{g}^0(k)\boldsymbol{\mu}(t\_i - k)$$

in discrete time and

$$L\_i[\mathbf{g}^0] = \int\_0^\infty \mathbf{g}^0(\tau)\boldsymbol{\mu}(t\_i - \tau)d\tau$$

in continuous time. So, the impulse response estimators discussed in this chapter can be compactly written as

$$\hat{\mathbf{g}} = \operatorname\*{arg\,min}\_{\mathbf{g} \in \mathcal{H}^{\mathbb{P}}} \sum\_{i=1}^{N} (\mathbf{y}(t\_i) - L\_i[\mathbf{g}])^2 + \boldsymbol{\chi} \left\| \mathbf{g} \right\|\_{\mathcal{H}^{\mathbb{P}}}^2,\tag{7.37}$$

where the RKHS *<sup>H</sup>* contains functions *<sup>g</sup>* : *<sup>X</sup>* <sup>→</sup> <sup>R</sup> with *<sup>X</sup>* <sup>=</sup> <sup>N</sup> in discrete time and *X* = R<sup>+</sup> in continuous time.

The following result (whose simple proof is in Sect. 7.7.2) shows that, under Gaussian assumptions on the impulse response and the noise, (7.37) provides the minimum variance estimate of *g*<sup>0</sup> given the measurements *Y* = [*y*(*t*1), . . . , *y*(*tN* )] *T* .

**Proposition 7.1** *Let the following assumptions hold:*

• *the impulse response g*<sup>0</sup> *is a zero-mean Gaussian process on X . Its covariance function is defined by*

$$
\boldsymbol{\mathcal{S}}^{\boldsymbol{0}}(\mathbf{g}^{\boldsymbol{0}}(t)\mathbf{g}^{\boldsymbol{0}}(\mathbf{s})) = \lambda \boldsymbol{K}(t,\mathbf{s}),
$$

*where* λ *is a positive scalar and K is a kernel;*

• *the e*(*t*) *are mutually independent zero-mean Gaussian random variables with variance* σ <sup>2</sup>*. Moreover, they are independent of g*<sup>0</sup>*.*

*Let H be the RKHS induced by K , set* γ = σ <sup>2</sup>/λ *and define*

$$\hat{\mathfrak{g}} = \arg\min\_{\mathfrak{g}\in\mathcal{A}^{\ell}} \left( \sum\_{i=1}^{N} (\mathfrak{y}(t\_i) - L\_i[\mathfrak{g}])^2 + \mathcal{y} \left\| \mathfrak{g} \right\|\_{\mathcal{H}^{\ell}}^2 \right).$$

*Then, g is the minimum variance estimator of g* ˆ <sup>0</sup> *given Y , i.e.,*

$$\mathcal{A}^{\mathbb{C}}[\mathbf{g}^{0}(t)|Y] = \widehat{\mathbf{g}}(t) \quad \forall t \in \mathcal{X}^{\circ}.$$

**Remark 7.2** The connection between regularization in RKHS and estimation of Gaussian processes was first pointed out in [51] in the context of spline regression, using quadratic losses, see also [41, 83, 90]. The connection also holds for a wide class of losses *V<sup>i</sup>* also different from quadratic. For instance, in this statistical framework, using the absolute value loss corresponds to Laplacian noise assumptions. The statistical interpretation of an -insensitive loss in terms of Gaussians with mean and variance given by suitable random variables can be found in [79], see also [40, 67]. For all this kind of noise models, and many others, it can be shown that the RKHS estimate *g*ˆ includes all the possible finite-dimensional maximum a posteriori estimates of *g*<sup>0</sup>, see [3] for details.

**Remark 7.3** The relation between RKHSs and Gaussian stochastic processes, or more general Gaussian random fields, is stated by Proposition 7.1 in terms of minimum variance estimators. In particular, since the representer theorem ensures that such estimator is sum of a finite number of basis functions belonging to *H* , it turns out that *g*ˆ belongs to the RKHS induced by the covariance of *g*<sup>0</sup> with probability one. Now, one may also wonder what happens a priori, before seeing the data. In other words, the question is whether realizations of a zero-mean Gaussian process of covariance *K* fall in the RKHS induced by *K*. If the kernel *K* is associated with an infinite-dimensional *H* , the answer is negative with probability one, as graphically

**Fig. 7.4** The largest space contains all the realizations of a zero-mean Gaussian process of covariance *K*. The smallest space is the RKHS *H* induced by *K*, assumed here infinite dimensional. The probability that realizations of *f* fall in the RKHS is zero. Instead, when the assumptions underlying the representer theorem hold, the realizations of the minimum variance estimator *E* [ *f* |*Y* ] are contained in *H* with probability one

illustrated in Fig. 7.4. While deep discussions can be found in [9, 34, 59, 68], here we give just a hint on this fact. Assume that the kernel admits the decomposition

$$K(s,t) = \sum\_{i=1}^{M} \zeta\_i \phi\_i(s)\phi\_i(t)$$

inducing an *M*-dimensional RKHS *H* . Let the deterministic functions φ*<sup>i</sup>* be independent. Then, we know from Theorem6.13 that, if *f* (*t*) = *M <sup>i</sup>*=1 *ai*φ*i*(*t*), then

$$\left\|f\right\|\_{\mathcal{X}^\mathbb{C}}^2 = \sum\_{i=1}^M \frac{a\_i^2}{\zeta\_i}.$$

Now, think of *K* as a covariance and let *ai* be zero-mean Gaussian and independent random variables of variance ζ*<sup>i</sup>* , i.e.,

$$a\_i \sim \mathcal{J}(0, \zeta\_i).$$

Then, the so-called Karhunen–Loève expansion of the Gaussian random field *f* ∼ *N* (0, *K*), also discussed in Sect. 5.6 to connect regularization and basis expansion in finite dimension, is given by

$$f(t) = \sum\_{i=1}^{M} a\_i \phi\_i(t)$$

with *M* possibly infinite and convergence in quadratic mean. The RKHS norm of *f* is now a random variable and, since the *ai* are mutually independent with *E a*<sup>2</sup> *<sup>i</sup>* = ζ*<sup>i</sup>* , one has

$$\left\|\mathcal{J}\right\|f\right\|\_{\mathcal{H}^\diamond}^2 = \mathcal{E} \sum\_{i=1}^M \frac{a\_i^2}{\zeta\_i} = \sum\_{i=1}^M \frac{\mathcal{E}a\_i^2}{\zeta\_i} = M.$$

So, if the RKHS is infinite dimensional, one has *M* = ∞ and the expected (squared) RKHS norm of the process *f* diverges to infinity.

## *7.1.5 A Numerical Example*

Our goal now is to illustrate the influence of the choice of the kernel on the quality of the impulse response estimate using also the Bayesian interpretation of regularization. The example is a simple linear discrete time system in the form of (7.1). Using the *z*-transform, its transfer function is

$$\mathbf{y}(t) = \frac{1}{z(z - 0.85)}\boldsymbol{\mu}(t) + \boldsymbol{e}(t), \quad t = 1, \ldots, 20. \tag{7.38}$$

The system's impulse response is reported in Fig. 7.5. The disturbances *e*(*t*) are independent and Gaussian random variables with mean zero and variance 0.05<sup>2</sup>. For ease of visualization, we let the input *u*(*t*) be an impulsive signal, i.e., *u*(0) = 1 and *u*(*t*) = 0 elsewhere. Thus, the impulse response have to be estimated from 20 direct and noisy impulse response measurements.

We consider a Monte Carlo simulation of 200 runs. At any run, the outputs are obtained by generating mutually independent measurement noises. One data set is shown in Fig. 7.5. For each of the 200 data sets, we use the regularized IIR estimator (7.10). For what regards *<sup>K</sup>* : <sup>N</sup> <sup>×</sup> <sup>N</sup> <sup>→</sup> <sup>R</sup>,, we will compare the performance of three kernels: the Gaussian (6.43), the cubic spline (6.48) and the stable spline (7.15) defined, respectively, by

$$\exp\left(-\frac{(i-j)^2}{\rho}\right), \quad \frac{ij\min\{i,j\}}{2} - \frac{(\min\{i,j\})^3}{6}, \quad \alpha^{\max(i,j)}.$$

Recall that the Gaussian and the cubic spline kernel are the most used in machine learning to include information on smoothness. The cubic spline estimator could be also complemented with a bias space given, e.g., by a linear function, as described in Sect. 6.6.7. However, one would obtain results very similar to those described in what follows.

To adopt the estimator (7.10), we need to find a suitable value for the regularization parameter γ and also for the unknown kernel parameters, i.e., the kernel width ρ in the Gaussian kernel and the stability parameter α for stable spline. As already done, e.g., in Sect. 1.2 for ridge regression, an oracle-based procedure is adopted to optimally balance bias and variance. The unknown parameters are achieved by maximizing the measure of fit defined as follows:

**Fig. 7.5** The true impulse response (thick line) and one out of the 200 data sets (◦)

$$100\left(1-\left[\frac{\sum\_{k=1}^{50}|\mathbf{g}\_k^0-\hat{\mathbf{g}}(k)|^2}{\sum\_{k=1}^{50}|\mathbf{g}\_k^0-\bar{\mathbf{g}}^0|^2}\right]^{\frac{1}{2}}\right),\ \bar{\mathbf{g}}^0=\frac{1}{50}\sum\_{k=1}^{50}\mathbf{g}\_k^0,\tag{7.39}$$

where computation is restricted only to the first 50 samples where, in practice, the impulse response is different from zero. This tuning procedure is ideal since it exploits the true function *g*<sup>0</sup>. It is useful here since it excludes the uncertainty brought by the kernel tuning procedure and will fully reveal the influence of the kernel choice on the quality of the impulse response estimate.

The impulse response estimates obtained by the cubic spline, the Gaussian and the stable spline kernel are reported in Fig. 7.6. When the cubic spline kernel (6.48) is chosen, the impulse response estimates diverge as time goes. This result can be also given a Bayesian interpretation where (6.48) becomes the covariance of the stochastic process *g*<sup>0</sup>. Specifically, the cubic spline kernel models the impulse response as double integration of white noise. So, impulse responses coefficients are correlated but the prior variance increases in time. For stable systems, variability is instead expected to decay to zero as*t* progresses. When the Gaussian kernel (6.43) is chosen, quality of the impulse response estimates much improves, but many of them exhibit oscillations and the variance of the impulse response estimator is still large. Bayesian arguments here show that the Gaussian kernel models *g*<sup>0</sup> as a stationary stochastic process. Smoothness information is encoded but not the fact that that one expects the prior variance to decay to zero. Finally, the impulse response estimates returned by the stable spline kernel (7.15) are all very close to the truth. These outcomes are similar to those described, e.g., in Example 5.4 in Sect. 5.5. In particular, even if this example is rather simple, it shows clearly that a straightforward application of standard kernels from machine learning and smoothing splines literature may give unsatisfactory results. Inclusion of dynamic systems features in the regularizer,

t

like smooth exponential decay, greatly enhances the quality of the impulse response estimates.

## **7.2 Kernel Tuning**

As we have seen in the previous parts of the book, the kernels depend on some unknown parameters, the so-called hyperparameters. They can, e.g., include scale factors, the kernel width of the Gaussian kernel or the impulse response's decay rate in the TC and stable spline kernels. In real-world applications, the oracle-based procedure used in the previous section cannot be used. The kernels need instead to be tuned from data. Such procedure is referred to as hyperparameter estimation and is the counterpart of model order selection in the classical paradigm of system identification. It determines model complexity within the new paradigm where system identification is seen as regularized function estimation in RKHSs. This calibration step will thus have a major impact on model's performance, e.g., in terms of predictive capability on new data. Due to the connection with the ReLS methods in quadratic form, the tuning methods introduced in Chaps. 3 and 4 can be easily applied also in the RKHS framework. In particular, let *K*(η) denote a kernel, where η is the hyperparameter vector belonging to the set Γ . Such vector could also include other parameters not present in the kernel, e.g., the noise variance σ <sup>2</sup>. Some calibration methods to estimate η from data are then reported below.

## *7.2.1 Marginal Likelihood Maximization*

The first approach we describe is marginal likelihood maximization (MLM), also called the empirical Bayes method in Sect. 4.4. MLM relies on the Bayesian interpretation of function estimation in RKHS discussed in Sect. 7.1.4. Under the same assumptions stated in Proposition 7.1, η can be estimated by maximum likelihood

$$
\hat{\eta} = \arg\max\_{\eta \in \Gamma} \mathbf{p}(Y|\eta),
\tag{7.40}
$$

with p(*Y* |η) obtained by integrating out *g*<sup>0</sup> from the joint density p(*Y* |*g*<sup>0</sup>)p(*g*<sup>0</sup>|η), i.e.,

$$\mathbf{p}(Y|\eta) = \int \mathbf{p}(Y|\mathbf{g}^0)\mathbf{p}(\mathbf{g}^0|\eta)d\mathbf{g}^0. \tag{7.41}$$

The probability density p(*Y* |η) is the marginal likelihood and, hence, (7.40) is called the MLM method.

#### 7.2 Kernel Tuning 267

Computation of (7.41) is especially simple in our case since our measurements model is linear and Gaussian. In fact, in the Bayesian interpretation of regularized linear system identification in RKHS, the impulse response *g*<sup>0</sup> is a zero-mean Gaussian process with covariance λ*K*, where λ is a positive scale factor. The impulse response is also assumed independent of the noises *e*(*t*) which are white and Gaussian of variance σ <sup>2</sup>. Recall also the definition of the matrix *O*, now possibly function of η, reported in (7.14) for the discrete-time case, i.e., when *X* = N, and in (7.22) for the continuous-time case, i.e., when *X* = R+. The matrix λ*O*(η) plays an important role in the MLM method since it corresponds to the covariance matrix of the noise-free output vector [*L*1[*g*<sup>0</sup>], ..., *L <sup>N</sup>* [*g*<sup>0</sup>]]*<sup>T</sup>* and is thus often called the output kernel matrix. Then, as also discussed in Sect. 7.7.2, it comes that the vector *Y* turns out to be Gaussian with zero mean, i.e.,

$$Y \sim \mathcal{A}'(0, Z(\eta)),$$

where the covariance matrix *Z*(η) is given by

$$Z(\eta) = \lambda \, O + \sigma^2 I\_N$$

with *IN* the *N* × *N* identity matrix. Here, the vector η could, e.g., contain both λ and σ <sup>2</sup>. One then obtains that the empirical Bayes estimate of η in (7.40) becomes

$$\hat{\eta} = \underset{\eta \in \Gamma}{\text{arg min }} Y^T Z(\eta)^{-1} Y + \log \det(Z(\eta)), \tag{7.42}$$

where the objective is proportional to the minus log of the marginal likelihood.

As discussed in Chap. 4, the MLM method includes the Occam's razor principle, i.e., unnecessarily complex models are automatically penalized, see e.g., [83]. In particular, the Occam's factor arises thanks to the marginalization and it manifests itself in the term log det(*Z*(η)) in (7.42). A simple example can be obtained thinking of the behaviour of the objective for different values of the kernel scale factor λ. When λ increases, the model becomes more complex since, under a stochastic viewpoint, the prior variance of the impulse response *g*<sup>0</sup> increases. In fact, the term *Y <sup>T</sup> Z*(η)−<sup>1</sup>*Y* , related to the data fit, decreases since the inverse of *Z*(η) tends to the null matrix (the model has infinite variance and can describe any kind of data). But the Occam's factor increases since det(*Z*(η)) grows to infinity. In this way, ηˆ will balance data fit and model complexity.

#### **7.2.1.1 Numerical Example**

To illustrate the effectiveness of MLM, we revisit the example reported in Sect. 1.2. The problem is to reconstruct the impulse response reported in Fig. 7.7 (red line) from the 1000 input–output data displayed in Fig. 1.2. System input is low pass and this makes estimation hard due to ill-conditioning.

#### 7.2 Kernel Tuning 269

We will adopt three kernels. Using δ to denote the Kronecker delta, the value *K*(*i*, *j*) is defined, respectively, by

$$
\delta\_{ij}, \quad \alpha^{\max(i,j)}, \quad \frac{\alpha^{i+j+\max(i,j)}}{2} - \frac{\alpha^{3\max(i,j)}}{6}.
$$

The first choice corresponds to ridge regression with the regularizer given by the sum of squared impulse response coefficients. The other two are the first- and second-order stable spline kernel reported in (7.15) and in (7.30), respectively. More specifically, the last kernel corresponds to the discrete-time version of (7.30) with α = *e*−<sup>β</sup> .

In Fig. 1.5, we reported the ridge regularized estimate with γ chosen by an oracle to maximize the fit. To ease comparison with other approaches, such a figure is also reproduced in the top panel of Fig. 7.7. The reconstruction is not satisfactory since the regularizer does not include information on smoothness and decay. In fact, the Bayesian interpretation reveals that ridge regression describes the impulse response as realization of white noise, a poor model for stable dynamic systems. This also explains the presence of oscillations in the reconstructed profile.

The middle and bottom panel report the estimates obtained by the stable spline kernels with the noise variance and the hyperparameters γ,α tuned from data through MLM. Even if no oracle is used, the quality of the impulse response reconstruction greatly increases. This is also confirmed by a Monte Carlo study where 200 data sets are obtained using the same kind of input but generating new independent noise realizations. MATLAB boxplots of the 200 fits for all the three estimators are in Fig. 7.8. Here, the median is given by the central mark while the box edges are the 25th and 75th percentiles. Then, the whiskers extend to the most extreme fits not seen as outliers. Finally, the outliers are plotted individually. Average fits are 73.7% for ridge, 83.9% for first-order and 90.2% for second-order stable spline.

In this example, one can see that it is preferable to use the second-order stable spline kernel. This is easily explained by the fact that the true impulse response is quite regular so that increasing our expected smoothness improves the performance.

Interestingly, the selection between different kernels, like first- and second-order stable spline, can be also automatically performed by MLM, so addressing the problem of model comparison described in Sect. 2.6.2. In fact, let *s* denote an additional hyperparameter that may assume only value 0 or 1. Then, we can consider the combined kernel

$$s\alpha^{\max(i,j)} + (1-s)\left(\frac{\alpha^{i+j+\max(i,j)}}{2} - \frac{\alpha^{3\max(i,j)}}{6}\right)$$

and optimize the hyperparameters *s*, α and γ by MLM. Clearly, the role of *s* is to select one of the two kernels, e.g., if the estimate *s*ˆ is 0, then the impulse response estimate will be given by a second-order stable spline. Applying this procedure to our problem, one finds that the second-order stable spline kernel is selected 177 times out of the 200 Monte Carlo runs. Obtained fits are shown in Fig. 7.9, their mean is 88.8%.

**Remark 7.4** Kernel choice via MLM has also connections with selection through the concept of Bayesian model probability discussed in Sect. 4.11, see also [50]. In fact, assume we are given different competitive kernels (covariances) *K<sup>i</sup>* and, for a while, assume also that all the hyperparameter vectors η*<sup>i</sup>* are known. We can then interpret each kernel as a different model.We can also assign a priori probabilities that data have been generated by the *i*th covariance *K<sup>i</sup>* , hence thinking of any model as a random variable itself. If all the kernels are given the same probability, the marginal likelihood computed using *K<sup>i</sup>* becomes proportional to the posterior probability of the *i*th model. This permits to exploit the marginal likelihood to select the "best" kernel-based estimate among those generated by the *K<sup>i</sup>* . When hyperparameters are unknown, the marginal likelihoods can be evaluated with each η*<sup>i</sup>* set to its estimate ηˆ*i* . In this case, care is needed since maximized likelihoods define model posterior probabilities that do not account for hyperparameters uncertainty. For example, if the dimensions of η*<sup>i</sup>* change with *i*, the risk is to select a kernel that have many parameters and overfits. This problem can be mitigated, e.g., by adopting the criteria described in Sect. 2.4.3, e.g., using BIC, we compute

$$
\hat{h} = \arg\min\_i \ -2\log \mathbf{p}(Y|\hat{\eta}^i) + (\dim \eta^i)\log N,
$$

where *N* is the number of available output measurements and dim η*<sup>i</sup>* is the number of hyperparameters contained in the *i*th model. Note that, when using stable spline kernels as in the above example, the BIC penalty is irrelevant since the firstand the second-order stable spline estimator contain the same number of unknown hyperparameters.

## *7.2.2 Stein's Unbiased Risk Estimator*

The second method is the Stein's unbiased risk estimator (SURE) method introduced in Sect. 3.5.3.2. The idea of SURE is to minimize an unbiased estimator of the risk, which is the expected in-sample validation error of the model estimate. In what follows, *g*<sup>0</sup> is no more stochastic as in the previous subsection but corresponds to a deterministic impulse response. Identification data are given by

$$\mathbf{y}(t\_i) = L\_i[\mathbf{g}^0] + e(t\_i), \quad i = 1, \ldots, N,$$

where the *e*(*ti*) are independent, with zero mean and known variance σ <sup>2</sup>, and each *Li* is the linear functional defined by convolutions with the system input evaluated at *ti* . One thus has *Li*[*g*<sup>0</sup>] = ∞ *<sup>k</sup>*=1 *g*<sup>0</sup>(*k*)*u*(*ti* − *k*) in discrete time, where the *ti* assume integer values, and *Li*[*g*<sup>0</sup>] = <sup>∞</sup> <sup>0</sup> *g*<sup>0</sup>(τ )*u*(*ti* − τ )*d*τ in continuous time. The *N* independent validation output samples *yv*(*ti*) are then defined by using the same input that generates the identification data but an independent copy of the noises, i.e.,

$$\mathbf{y}\_{\nu}(t\_{i}) = L\_{i} \mathbf{[g}^{0}] + e\_{\nu}(t\_{i}), \quad i = 1, \ldots, N. \tag{7.43}$$

So, all the 2*N* random variables *ev*(*ti*) and *e*(*ti*) are mutually independent, with zero mean and noise variance σ <sup>2</sup>. Consider the impulse response estimator

$$\hat{\mathbf{g}} = \arg\min\_{\mathbf{g}\in\mathcal{H}} \left( \sum\_{i=1}^{N} (\mathbf{y}(t\_i) - L\_i[\mathbf{g}])^2 + \boldsymbol{\chi} \|\mathbf{g}\|\_{\mathcal{H}}^2 \right)$$

as a function of the hyperparameter vector η. The predictions of the *yv*(*ti*) are then given by *Li*[ ˆ*g*] and also depend on η. The expected in-sample validation error of the model estimate *g*ˆ is then given by the mean prediction error

$$\text{EVE}\_{\text{in}}(\eta) = \frac{1}{N} \sum\_{i=1}^{N} \delta^{\mathbb{C}}(\mathbf{y}\_{\text{V}}(t\_{i}) - L\_{i}[\hat{\mathbf{g}}])^{2},\tag{7.44}$$

where the expectation *E* is over the random noises *ev*(*ti*) and *e*(*ti*). Note that the result not only depends on η but also on the unknown (deterministic) impulse response *g*0. So, we cannot compute the prediction error. However, it is possible to derive an unbiased estimate of it. To obtain this, let *Y*ˆ(η) be the (column) vector with components *Li*[ ˆ*g*]. The output kernel matrix *O*(η), already introduced to describe marginal likelihood maximization, then gives the connection between the vector *Y* containing the measured outputs *y*(*ti*) and the predictions. In fact, using the representer theorem to obtain *g*ˆ, and hence the *Li*[ ˆ*g*], one obtains

$$
\hat{Y}(\eta) = O(\eta)(O(\eta) + \mathcal{Y}I\_N)^{-1}Y. \tag{7.45}
$$

Following the same line of discussion developed in Sect. 3.5.3.2 to obtain (3.96), we can derive the following unbiased estimator of (7.44):

$$\widehat{\rm EVE}\_{\rm in}(\eta) = \frac{1}{N} \|Y - \hat{Y}(\eta)\|^2 + 2\sigma^2 \frac{\text{dof}(\eta)}{N},\tag{7.46}$$

where dof(η) are the degrees of the freedom of *Y*ˆ(η) given by

$$\text{dof}(\eta) = \text{trace}(O(\eta)(O(\eta) + \boldsymbol{\chi} \, I\_N)^{-1})\tag{7.47}$$

that vary from *N* to 0 as γ increases from 0 to ∞.

Note that (7.46) is function only of the *N* output measurements *y*(*ti*). Thus, we can then estimate the hyperparameter <sup>η</sup> by minimizing the unbiased estimator EVE in(η) of EVEin(η) to achieve

$$\hat{\eta} = \operatorname\*{arg\,min}\_{\eta \in \Gamma} \frac{1}{N} \|Y - \hat{Y}(\eta)\|^2 + 2\sigma^2 \frac{\text{dof}(\eta)}{N}. \tag{7.48}$$

The above formula has the same form of the AIC criterion (2.33) computed assuming Gaussian noise of known variance σ <sup>2</sup> except that the dimension *m* of the model parameter θ is now replaced by the degrees of freedom dof(η).

## *7.2.3 Generalized Cross-Validation*

The third approach is the generalized cross-validation (GCV) method. As discussed in Sects. 2.6.3 and 3.5.2.3, cross-validation (CV) is a classical way to estimate the expected validation error by efficient reuse of the data and GCV is closely related with the *N*-fold CV with quadratic losses. To describe it in the RKHS framework, let *g*ˆ*<sup>k</sup>* be the solution of the following function estimation problem:

#### 7.2 Kernel Tuning 273

$$\hat{\mathbf{g}}^{k} = \operatorname\*{arg\,min}\_{\mathbf{g} \in \mathcal{H}^{\ell}} \sum\_{i=1, i \neq k}^{N} (\mathbf{y}(t\_i) - L\_i \mathbf{[g]})^2 + \boldsymbol{\mathcal{y}} \left\| \mathbf{g} \right\|\_{\mathcal{H}^{\ell}}^2. \tag{7.49}$$

So, *g*ˆ*<sup>k</sup>* is the function estimate when the *k*th datum *y*(*tk* )is left out. As also described, e.g., in [90, Chap. 4], the following relation between the prediction error of *g*ˆ and the prediction error of *g*ˆ*<sup>k</sup>* holds:

$$\mathbb{E}\left[\mathbf{y}(t\_k) - L\_k\|\hat{\mathbf{g}}^k\right] = \frac{\mathbf{y}(t\_k) - L\_k\|\hat{\mathbf{g}}\|}{1 - H\_{kk}(\eta)},\tag{7.50}$$

where *Hkk* (η) is the (*k*, *k*)th element of the influence matrix

$$H(\eta) = O(\eta)(O(\eta) + \mathcal{Y}I\_N)^{-1}.$$

Therefore, the validation error of the *N*-fold CV with quadratic loss function is

$$\sum\_{k=1}^{N} \left( \mathbf{y}(t\_k) - L\_k \mathbf{I} \hat{\mathbf{g}}^k \right)^2 = \sum\_{k=1}^{N} \left( \frac{\mathbf{y}(t\_k) - L\_k \mathbf{I} \hat{\mathbf{g}}}{1 - H\_{kk}(\eta)} \right)^2. \tag{7.51}$$

Minimizing the above equation as a criterion to estimate the hyperparameter η leads to the predicted residual sums of squares (PRESS) method

$$\hat{\eta} = \underset{\eta \in \Gamma}{\text{arg min}} \sum\_{k=1}^{N} \left( \frac{\mathbf{y}(t\_k) - L\_k[\hat{\mathbf{g}}]}{1 - H\_{kk}(\eta)} \right)^2. \tag{7.52}$$

The above criterion coincides with that derived in (3.80) working in the finitedimensional setting.

GCV is a variant of (7.52) obtained by replacing each *Hkk* (η), *k* = 1,..., *N*, in (7.52) with their average. One obtains

$$\hat{\eta} = \underset{\eta \in \Gamma}{\arg\min} \sum\_{k=1}^{N} \left( \frac{\mathbf{y}(t\_k) - L\_k[\hat{\mathbf{g}}]}{1 - \text{trace}(H(\eta))/N} \right)^2. \tag{7.53}$$

In view of (7.45), one has

$$
\hat{Y}(\eta) = H(\eta)Y.
$$

and, from (7.47) one can see thattrace(*H*(η)) corresponds to the degrees of freedom dof(η), i.e.,

$$\text{trace}(H(\eta)) = \text{dof}(\eta).$$

So, the GCV (7.53) can be rewritten as follows:

274 7 Regularization in Reproducing Kernel Hilbert Spaces …

$$\hat{\eta} = \operatorname\*{arg\,min}\_{\eta \in \Gamma} \frac{\|Y - \bar{Y}(\eta)\|^2}{(1 - \operatorname{dof}(\eta)/N)^2}. \tag{7.54}$$

This corresponds to the criterion (3.82) obtained in the finite-dimensional setting. Differently from SURE, a practical advantage of PRESS and GCV is that they do not require knowledge (or preliminary estimation) of the noise variance σ <sup>2</sup>.

## **7.3 Theory of Stable Reproducing Kernel Hilbert Spaces**

In the numerical experiments reported in this chapter, we have seen that regularized IIR models based, e.g., on TC and stable splines provide much better estimates of stable linear dynamic systems than other popular machine learning choices like the Gaussian kernel. The reading key was the inclusion in the identification process of information on the decay rate of the impulse response. This motivates the study of the class of the so-called stable kernels that enforces the stability constraint on the induced RKHS.

## *7.3.1 Kernel Stability: Necessary and Sufficient Conditions*

The necessary and sufficient condition for a linear system to be bounded-input– bounded-output (BIBO) stable is that its impulse response *g* ∈ <sup>1</sup> for the discretetime case and *g* ∈ *L*<sup>1</sup> for the continuous-time case. Here, <sup>1</sup> is the space of absolutely summable sequences, while *L*<sup>1</sup> contains the absolutely summable functions on R<sup>+</sup> (equipped with the classical Lebesque measure), i.e.,

$$\sum\_{k=1}^{\infty} |\mathbf{g}\_k| < \infty \,\,\forall \mathbf{g} \in \ell\_1 \quad \text{and} \quad \int\_0^{\infty} |\mathbf{g}(\mathbf{x})| d\mathbf{x} < \infty \,\,\,\forall \mathbf{g} \in \mathcal{L}^1. \tag{7.55}$$

Therefore, for regularized identification of stable systems the impulse response should be searched within a RKHS that is a subspace of <sup>1</sup> in discrete time and a subspace of *L*<sup>1</sup> in continuous time. This naturally leads to the following definition of stable kernels.

**Definition 7.1** (*Stable kernel, based on* [32, 73]) Let *<sup>K</sup>* : *<sup>X</sup>* <sup>×</sup> *<sup>X</sup>* <sup>→</sup> <sup>R</sup> be a positive semidefinite kernel and *<sup>H</sup>* : *<sup>X</sup>* <sup>→</sup> <sup>R</sup> be the RKHS induced by *<sup>K</sup>*. Then, *<sup>K</sup>* is said to be stable if


If a kernel *K* is not stable, it is also said to be unstable. Accordingly, the RKHS *H* is said to be stable or unstable if *K* is stable or unstable.

Assigned a kernel, the question is now how to assess its stability. For this purpose, a direct use of the above definition is often challenging since it can be difficult to understand which functions belong to the associated RKHS. Stability conditions directly on *K* would be instead desirable. One first observation is that, since *H* contains all kernel sections according to Theorem6.2, all of them must be stable. In discrete time, this means *K*(*i*, ·) ∈ <sup>1</sup> for all *i*. However, this condition is necessary but not sufficient for stability, a fact which is not so surprising since we have seen in Sect. 6.2 that *H* contains also all the Cauchy limits of linear combinations of kernel sections. For instance, in Example 6.4, we have seen that the identity kernel *K*(*i*, *j*) = <sup>δ</sup>*i j* , connected with ridge regression but here defined over all <sup>N</sup> <sup>×</sup> <sup>N</sup>, induces 2. Such space is not contained in 1. So, the identity kernel is not stable even if each kernel section is stable since it contains only one non-null element.

The following fundamental result can be found in a more general form in [16] and gives the desired charactherization of kernel stability. Maybe not surprisingly, we will see that the key test spaces are ∞, that contains bounded sequences in discrete time, and *L*∞, that contains essentially bounded functions in continuous time. The proof is reported in Sect. 7.7.3.

**Theorem 7.5** (Necessary and sufficient condition for kernel stability, based on [16, 32, 73]) *Let K* : *<sup>X</sup>* <sup>×</sup> *<sup>X</sup>* <sup>→</sup> <sup>R</sup> *be a positive semidefinite kernel with <sup>X</sup>* <sup>=</sup> <sup>N</sup> *or X* = R+*. Then,*

• *one has*

$$\mathcal{H}^{\rho} \subset \ell\_1 \iff \sum\_{s=1}^{\infty} \left| \sum\_{t=1}^{\infty} K(s, t) l\_t \right| < \infty, \; \forall \; l \in \ell\_\infty \tag{7.56}$$

*for the discrete-time case where X* = N*;*

• *one has*

$$\mathcal{X}\mathcal{X}^{\rho} \subset \mathcal{X}\_1 \iff \int\_0^\infty \left| \int\_0^\infty K(s,t)l(t)dt \right| ds < \infty,\ \forall \ l \in \mathcal{X}\_\infty \tag{7.57}$$

*for the continuous-time case where X* = R+*.*

Figure 7.10 illustrates the meaning of Theorem7.5 by resorting to a simple system theory argument. In particular, a kernel can be seen as an acausal linear time-varying system. In discrete time it induces the following input–output relationship

$$y\_i = \sum\_{j=1}^{\infty} K\_i(j) u\_j, \quad i = 1, 2, \dots, \tag{7.58}$$

where *Ki*(*j*) = *K*(*i*, *j*), while *ui* and *yi* denote the system input and output at instant *i*. Then, the RKHS induced by *K* is stable iff system (7.58) maps every bounded input {*ui*}<sup>∞</sup> *<sup>i</sup>*=1 into a summable output {*yi*}<sup>∞</sup> *<sup>i</sup>*=1. Abusing notation, we can also see *K* as an infinite-dimensional matrix with *i*, *j*-entry given by *Ki*(*j*) with *u* and *y* infinite-dimensional column vectors. Then, using ordinary algebra notation to

**Fig. 7.10** System theoretic interpretation of RKHS stability. The kernel *K* is associated to an acausal linear system. In discrete time, the input–output relationship is given by *yi* = ∞ *<sup>j</sup>*=1 *Ki*(*j*)*u <sup>j</sup>* . Then, *K* is stable iff every bounded input *u* is mapped into a summable output *y*

handle these objects, the input–output relationship becomes *y* = *K u* and the stability condition is

$$
\mathcal{H}^\ell \subseteq \ell\_1 \iff Ku \in \ell\_1 \,\,\forall u \in \ell\_\infty.
$$

In Theorem7.5, it is immediate to see that including the constraint −1 ≤ *lt* ≤ 1 ∀*t* on the test functions does not have any influence on the stability test. With this constraint, one has

$$\left|\sum\_{t=1}^{\infty} K(\mathbf{s}, t) l\_t\right| \le \sum\_{t=1}^{\infty} |K(\mathbf{s}, t)| \quad \text{and} \quad \left|\int\_0^{\infty} K(\mathbf{s}, t) l(t) dt\right| \le \int\_0^{\infty} |K(\mathbf{s}, t)| dt.$$

The following result is then an immediate corollary of Theorem7.5 obtained exploiting the above inequalities. It states that absolute summability is a sufficient condition for a kernel to be stable.

**Corollary 7.1** (based on [16, 32, 73]) *Let K* : *<sup>X</sup>* <sup>×</sup> *<sup>X</sup>* <sup>→</sup> <sup>R</sup> *be a positive semidefinite kernel with X* = N *or X* = R+*. Then,*

• *one has*

$$\mathcal{H}^{\ell} \subset \ell\_1 \Longleftrightarrow \sum\_{s=1}^{\infty} \sum\_{t=1}^{\infty} |K(s, t)| < \infty \tag{7.59}$$

*for the discrete-time case where X* = N*;*

• *one has*

$$\mathcal{H}^{\theta} \subset \mathcal{E}\_1 \Longleftarrow \int\_0^\infty \int\_0^\infty |K(s, t)| dt ds < \infty \tag{7.60}$$

*for the continuous-time case where X* = R+*.*

Finally, consider the class of nonnegative-valued kernels *K* <sup>+</sup> , i.e., satisfying *K*(*s*, *t*) ≥ 0 ∀*s*, *t*. If a kernel is stable, using as test function *l*(*t*) = 1 ∀*t*, one must have

$$\left| \sum\_{t=1}^{\infty} K^+(s, t) l\_t \right| = \sum\_{t=1}^{\infty} K^+(s, t) < \infty$$

in discrete time, and

$$\left| \int\_0^\infty K^+(s, t) l(t) dt \right| = \int\_0^\infty K^+(s, t) dt < \infty$$

in continuous time. So, for nonnegative-valued kernels, stability implies (absolute) summability of the kernel. But, since we have seen in Corollary 7.1 that absolute summability implies stability, the following result holds.

**Corollary 7.2** (based on [16, 32, 73]) *Let K <sup>+</sup>* : *<sup>X</sup>* <sup>×</sup> *<sup>X</sup>* <sup>→</sup> <sup>R</sup> *be a positive semidefinite and nonnegative-valued kernel with X* = N *or X* = R+*. Then,*

• *one has*

$$\mathcal{H}^{\ell} \subset \ell\_1 \iff \sum\_{s=1}^{\infty} \sum\_{t=1}^{\infty} K^{+}(s, t) < \infty \tag{7.61}$$

*for the discrete-time case where X* = N*;*

• *one has*

$$\mathcal{A}\mathcal{A}^{\varrho} \subset \mathcal{A}\_1 \iff \int\_0^\infty \int\_0^\infty K^+(s, t) dt ds < \infty \tag{7.62}$$

*for the continuous-time case where X* = R+*.*

As an example, we can now show that the Gaussian kernel (6.43) defined e.g., over <sup>R</sup><sup>+</sup> <sup>×</sup> <sup>R</sup><sup>+</sup> is not stable. In fact, it is nonnegative valued and one has

$$\int\_0^\infty \int\_0^\infty \exp\left(-(s-t)^2/\rho\right) ds dt = \infty \,\,\forall \rho \dots$$

The same holds for the spline kernels (6.45) extended to <sup>R</sup><sup>+</sup> <sup>×</sup> <sup>R</sup><sup>+</sup> and also for translation invariant kernels introduced in Example 6.12, as e.g., proved in [32] using the Schoenberg representation theorem. Hence, all of these models are not suited for stable impulse response estimation.

**Remark 7.5** Any unstable kernel can be made stable simply by truncation. More specifically, let *<sup>K</sup>* : *<sup>X</sup>* <sup>×</sup> *<sup>X</sup>* <sup>→</sup> <sup>R</sup> be an unstable kernel with *<sup>X</sup>* <sup>=</sup> <sup>N</sup> or *<sup>X</sup>* <sup>=</sup> <sup>R</sup>+. Then by setting *K*(*s*, *t*) = 0 for *s*, *t* > *T* for any given *T* ∈ *X* , a stable kernel is obtained. Care should be however taken when a FIR model is obtained through this operation. In fact, consider e.g., the use of cubic spline or Gaussian kernel in the estimation problem depicted in Fig. 7.6 setting *T* equal to 20 or 50. Also after truncation, such models would not give good performance: the undue oscillations affecting the estimates in the top and middle panel of Fig. 7.6 would still be present. The reason is that these two kernels do not encode the information that the variability of the impulse response decreases as time progresses, as also already discussed using the Bayesian interpretation of regularization.

#### *7.3.2 Inclusions of Reproducing Kernel Hilbert Spaces in More General Lebesque Spaces -*

We now discuss the conditions for a RKHS to be contained in the spaces *L* <sup>μ</sup> *<sup>p</sup>* equipped with a generic measure μ. The following analysis will then include both the space *L*<sup>1</sup> (considered before with the Lebesque measure) and <sup>1</sup> as special cases obtained with *p* = 1. First, we need the following definition.

**Definition 7.2** (*based on* [16]) Let <sup>1</sup> <sup>≤</sup> *<sup>p</sup>* ≤ ∞ and *<sup>q</sup>* <sup>=</sup> *<sup>p</sup> <sup>p</sup>*−<sup>1</sup> with the convention *<sup>p</sup> <sup>p</sup>*−<sup>1</sup> <sup>=</sup> <sup>∞</sup> if *<sup>p</sup>* = 1 and *<sup>p</sup> <sup>p</sup>*−<sup>1</sup> = 1 if *<sup>p</sup>* <sup>=</sup> <sup>∞</sup>. Moreover, let *<sup>K</sup>* : *<sup>X</sup>* <sup>×</sup> *<sup>X</sup>* <sup>→</sup> <sup>R</sup> be a positive semidefinite kernel. Then, the kernel *K* is said to be *q*-bounded if


The following theorem then gives the necessary and sufficient condition for the *q*-boundedness of a kernel and is a special case of Proposition 4.2 in [16].

**Theorem 7.6** (based on [16]) *Let K* : *<sup>X</sup>* <sup>×</sup> *<sup>X</sup>* <sup>→</sup> <sup>R</sup> *be a positive semidefinite kernel with H the induced RKHS. Then, H is a subspace of L* <sup>μ</sup> *<sup>p</sup> if and only if K is q-bounded, i.e.,*

$$\mathcal{H}^{\ell} \subset \mathcal{E}\_p^{\mu} \iff K \text{ is } q\text{-bounded}.$$

Theorem7.6 permits thus to see if a RKHS is contained in *L* <sup>μ</sup> *<sup>p</sup>* by checking the properties of the kernel. Interestingly, setting *p* = 1, that implies *q* = ∞, and μ e.g., to the Lebesque measure one can see that the concept of stable and ∞-bounded kernel are equivalent. Theorem7.5 is then a special case of Theorem7.6.

#### **7.4 Further Insights into Stable Reproducing Kernel Hilbert Spaces** *-*

In this section, we provide some additional insights into the structure of the stable kernels and associated RKHSs. The analysis is focused on the discrete-time case where the kernel *K* can be seen as an infinite-dimensional matrix with the (*i*, *j*) entries denoted by *Ki j* . Thus, the function domain is the set of natural numbers N and the RKHS contains discrete-time impulse responses of causal systems.

As discussed after (7.58) to comment Fig. 7.10, the kernel *K* can be also associated with an acausal linear time-varying system, often called kernel operator in the literature. It maps the infinite-dimensional input (sequence) *u* into the infinitedimensional output *K u* whose *i*th component is ∞ *<sup>j</sup>*=1 *Ki ju <sup>j</sup>* . Two important kernel operators will be considered. The first one maps <sup>∞</sup> into <sup>1</sup> and is key for kernel stability as pointed out in Theorem7.5. The second one maps <sup>2</sup> into <sup>2</sup> itself and will be important to discuss spectral decompositions of stable kernels.

## *7.4.1 Inclusions Between Notable Kernel Classes*

To state some relationships between stable kernels and other fundamental classes, we start introducing some sets of RKHSs. Define


$$\sum\_{ij} |K\_{ij}| < +\infty;$$

• the set *S f t* of RKHSs induced by finite-trace kernels, i.e., satisfying

$$\sum\_{i} K\_{ii} < +\infty;$$

• the set *S*<sup>2</sup> associated to squared summable kernels, i.e., satisfying

$$\sum\_{ij} \, K\_{ij}^2 < +\infty.$$

One has then the following result from [8] (see Sect. 7.7.4 for some details on its proof).

**Theorem 7.7** (based on [8]) *It holds that*

$$
\mathcal{J}\_1 \subset \mathcal{J}\_s \subset \mathcal{J}\_{ft} \subset \mathcal{J}\_2. \tag{7.63}
$$

Figure 7.11 gives a graphical description of Theorem7.7 in terms of inclusions of kernels classes. Its meaning is further discussed below.

In Corollary 7.1, we have seen that absolute summability is a sufficient condition for kernel stability. The result *S*<sup>1</sup> ⊂ *S<sup>s</sup>* shows also that such inclusion is strict. Hence, one cannot conclude that a kernel is unstable from the sole failure of absolute summability.

The fact that *S<sup>s</sup>* ⊂ *S f t* means that the set of finite-trace kernels contains the stable class. This inclusion is strict, hence the trace analysis can be used only to show that a given RKHS is not contained in 1. There are however interesting consequences of this fact. Consider all the RKHSs induced by translation invariant kernels

$$
\mathcal{K}\_{ij} = h(i - j),
$$

where *h* satisfies the positive semidefinite constraints. The trace of these kernels is *<sup>i</sup> Kii* = *<sup>i</sup> h*(0) and it always diverges unless *h* is the null function. So, all the translation invariant kernels are unstable (as already mentioned after Corollary 7.2). Other instability results become also immediately available. For instance, all the kernels with diagonal elements satisfying *Kii* ∝ *i*−<sup>δ</sup> are unstable if δ ≤ 1.

**Fig. 7.11** Inclusion properties of some important kernel classes

Finally, the strict inclusion *S f t* ⊂ *S*<sup>2</sup> shows that the finite-trace test is more powerful than a check of kernel squared summability.

## *7.4.2 Spectral Decomposition of Stable Kernels*

As discussed in Sect. 6.6.3 and in Remark 6.3, kernels can define spaces rich of functions by (implicitly) mapping the space of the regressors into high-dimensional feature spaces where linear estimators can be used. This allows to reduce nonlinear algorithms even without knowing explicitly the feature map, i.e., without the exact knowledge of which functions are encoded in the kernel. In particular, in Sect. 6.3, we have seen that if the kernel admits the spectral representation

$$K(\mathbf{x}, \mathbf{y}) = \sum\_{i=1}^{\infty} \xi\_i \rho\_i(\mathbf{x}) \rho\_i(\mathbf{y}),\tag{7.64}$$

then the ρ*i*(*x*) are the basis functions that span the RKHS induced by *K*. For instance, the basis functions ρ1(*x*) = 1, ρ2(*x*) = *x*, ρ3(*x*) = *x*<sup>2</sup>,... describe polynomial models which are, e.g., included up to a certain degree in the polynomial kernel discussed in Sect. 6.6.4. Now, we will see that stable kernels always admit an expansion of the type (7.64) with the ρ*<sup>i</sup>* forming a basis of 2. The number of ζ*<sup>i</sup>* different from zero then corresponds to the dimension of the induced RKHS.

Formally, it is now necessary to consider the operator induced by a stable kernel *K* as a map from <sup>2</sup> into <sup>2</sup> itself. Again, it is useful to see *K* as an infinite-dimensional matrix so that we can think of *K v* as the result of the kernel operator applied to *v* ∈ 2. An operator is said to be compact if it maps any bounded sequence {*vi*} into a sequence {*K vi*} from which a convergent subsequence can be extracted [85, 95]. From Theorem7.7, we know that any stable kernel *K* is finite trace and, hence, squared summable. This fact ensures the compactness of the kernel operator, as discussed in [8] and stated below.

#### **Theorem 7.8** (based on [8]) *Any operator induced by a stable kernel is self-adjoint, positive semidefinite and compact as a map from* <sup>2</sup> *into* <sup>2</sup> *itself.*

This result allows us to exploit the spectral theorem [35] to obtain an expansion of *K*. Now, recall that spectral decompositions were discussed in Sect. 6.3 where the Mercer's theorem was also reported.Mercer's theorem derivations exploit the spectral theorem and, as, e.g., in Theorem6.9, they typically assume that the kernel domain is compact, see also [86] for discussions and extensions. Indeed, first formulations consider continuous kernels on compact domains (proving also uniform convergence of the expansion). However, the spectral theorem does not require the domain to be compact and, when applied to discrete-time kernels on<sup>N</sup> <sup>×</sup> <sup>N</sup>, it guarantees pointwise convergence. It thus becomes the natural generalization of the decomposition of a symmetric matrix in terms of eigenvalues and eigenvectors, initially discussed in the finite-dimensional setting in Sect. 5.6 to link regularization and basis expansion. This is summarized in the following proposition that holds in virtue of Theorem7.8.

**Proposition 7.2** (Representation of stable kernels, based on [8]) *Assume that the kernel K is stable. Then, there always exists an orthonormal basis of* <sup>2</sup> *composed by eigenvectors* {ρ*i*} *of K with corresponding eigenvalues* {ζ*i*}*, i.e.,*

$$K\rho\_i = \zeta\_i \rho\_i, \quad i = 1, 2, \dots$$

*In addition, the kernel admits the following expansion:*

$$K\_{\rm xy} = \sum\_{i=1}^{+\infty} \xi\_i \rho\_i(\mathbf{x}) \rho\_i(\mathbf{y}),\tag{7.65}$$

*with x*, *<sup>y</sup>* <sup>∈</sup> <sup>N</sup>*.*

While in the next subsection, we will use the above theorem to discuss the representation of stable RKHSs, some numerical considerations regarding (7.65) are now in order. Under an algorithmic viewpoint, many efficient machine learning procedures use truncated Mercer expansions to approximate the kernel, see [42, 52, 75, 93, 96] for discussions on their optimality in a stochastic framework. Applications for system identification can be found in [15] where it is shown that a relatively small number of eigenfunctions (w.r.t. the data set size) can well approximate impulse responses regularized estimates. These works trace back to the so-called Nyström method where an integral equation is replaced by finite-dimensional approximations [5, 6]. However, obtaining the Mercer expansion (7.65) in closed form is often hard. Fortunately, the <sup>2</sup> basis and related eigenvalues of a stable RKHS can be numerically recovered (with arbitrary precision w.r.t. the <sup>2</sup> norm) through a sequence of SVDs applied to truncated kernels [8]. Formally, let *K*(*d*) denote the *d* × *d* positive

**Fig. 7.12** Expansion of the first-order discrete-time stable spline kernel *Kx y* = αmax(*x*,*y*) with α = 0.99: eigenfunctions ρ*i*(*x*) orthogonal in <sup>2</sup> for *i* = 1, 2, 8 (left panel, samples are linearly interpolated) and eigenvalues ζ*i* (right)

semidefinite matrix obtained by retaining only the first *d* rows and columns of *K*. Let also ρ(*d*) *<sup>i</sup>* and <sup>ζ</sup> (*d*) *<sup>i</sup>* be, respectively, the eigenvectors of *K*(*d*) , seen as elements of <sup>2</sup> with a tail of zeros, and the eigenvalues returned by the SVD of *K*(*d*) . Assume, for simplicity, single multiplicity of each ζ*<sup>i</sup>* . Then, for any *i*, as *d* grows to ∞ one has

$$\begin{aligned} \mathfrak{t}\_i^{(d)} &\rightarrow \mathfrak{t}\_i \\ \|\rho\_i^{(d)} - \rho\_i\|\_2 &\rightarrow 0, \end{aligned} \tag{7.66a}$$

$$\text{where } \| \cdot \|\_{2} \text{ is the } \ell\_2 \text{ norm.}$$

In Fig. 7.12, we show some eigenvectors (left panel) and the first 100 eigenvalues (right) of the stable spline kernel *Kx y* = αmax (*x*,*y*) with α = 0.99. Results are obtained applying SVDs to truncated kernels of different sizes and monitoring convergence of eigenvectors and eigenvalues. The final outcome was obtained with *d* = 2000.

## *7.4.3 Mercer Representations of Stable Reproducing Kernel Hilbert Spaces and of Regularized Estimators*

Now we exploit the representations of the RKHSs induced by a diagonalized kernel as discussed in Theorems 6.10 and 6.13 (where compactness of the input space is not even required). In view of Proposition 7.2, assuming for simplicity all the ζ*<sup>i</sup>* different from zero, one obtains that the RKHS associated to a stable *K* always admits the representation

$$\mathcal{A}^{\theta} = \left\{ \mathbf{g} = \sum\_{i=1}^{\infty} a\_i \rho\_i \text{ s.t. } \sum\_{i=1}^{\infty} \frac{a\_i^2}{\zeta\_i} < +\infty \right\},\tag{7.67}$$

where the ρ*<sup>i</sup>* are the eigenvectors of *K* forming an orthonormal basis of 2. <sup>1</sup> If *g* = ∞ *<sup>i</sup>*=1 *ai*ρ*<sup>i</sup>* , one also has

$$\left\|\mathbf{g}\right\|\_{\mathcal{H}}^2 = \sum\_{i=1}^{\infty} \frac{a\_i^2}{\xi\_i}.\tag{7.68}$$

The fact that any stable RKHS is generated by an <sup>2</sup> basis gives also a clear connection with the important impulse response estimators which adopt orthonormal functions, e.g., the Laguerre functions illustrated in Fig. 7.3 [46, 91, 92]. A classical approach used in the literature is to introduce the model *g* = *<sup>i</sup> ai*ρ*<sup>i</sup>* and then to use linear least squares to determine the expansion coefficients *ai* . In particular, let *Lt*[*g*] be the system output, i.e., the convolution between the known input and *g* evaluated at the time instant *t*. Then, the impulse response estimate is

$$\hat{\mathbf{g}} = \sum\_{i=1}^{d} \hat{a}\_i \rho\_i \tag{7.69a}$$

$$\{\hat{a}\_{i}\}\_{i=1}^{d} = \operatorname\*{arg\,min}\_{\{a\_{i}\}\_{i=1}^{d}} \sum\_{t=1}^{N} \left( \mathbf{y}(t) - L\_{t} \left[ \sum\_{i=1}^{d} a\_{i} \rho\_{i} \right] \right)^{2},\tag{7.69b}$$

where *d* determines model complexity and is typically selected using AIC or crossvalidation (CV) as discussed in Chap. 2.

In view of (7.67) and (7.68), the regularized estimator (7.10), equipped with a stable RKHS, is equivalent to

$$\hat{f} = \sum\_{i=1}^{\infty} \hat{a}\_i \rho\_i \tag{7.70a}$$

$$\{\hat{a}\_{i}\}\_{i=1}^{\infty} = \operatorname\*{arg\,min}\_{\{a\_{i}\}\_{i=1}^{\infty}} \sum\_{t=1}^{N} \left( \mathbf{y}(t) - L\_{t} \left[ \sum\_{i=1}^{\infty} a\_{i} \rho\_{i} \right] \right)^{2} + \gamma \sum\_{i=1}^{\infty} \frac{a\_{i}^{2}}{\xi\_{i}}.\tag{7.70b}$$

<sup>1</sup> In (7.67), we have assumed that all the kernel eigenvalues are strictly positive so that *H* is infinite dimensional. If some ζ*i* is null, *H* is spanned only by the eigenvectors associated to those non-null. If only a finite number of ζ*<sup>i</sup>* is different from zero, *K* is finite rank and *H* is finite dimensional. A notable case is that of the RKHSs induced by truncated kernels, i.e., such that there exists *d* such that *Kii* = 0 ∀*i* > *d*. As we have seen, this kind of kernels induce finite-dimensional RKHSs containing FIR systems of order *d*.

This result is connected with the kernel trick discussed in Remark 6.3 and shows that regularized least squares in a stable (infinite-dimensional) RKHS always model impulse responses using an <sup>2</sup> orthonormal basis, as in the classical works on linear system identification. But the key difference between (7.69) and (7.70) is that complexity is no more controlled by the model order because *d* is set to ∞. Complexity instead depends on the regularization parameter γ (and possibly also on other kernel parameters) that balances the data fit and the penalty term. This latter induces stability by using the kernel eigenvalues ζ*<sup>i</sup>* to constrain the decay rate to zero of the expansion coefficients.

## *7.4.4 Necessary and Sufficient Stability Condition Using Kernel Eigenvectors and Eigenvalues*

We have seen that a fruitful way to design a regularized estimator for linear system identification is to introduce a kernel by specifying its entries *Ki j* . This modelling technique translates our expected features of an impulse response into kernel properties, e.g., smooth exponential decay as described by stable spline, TC and DC kernels. This route exploits the kernel trick, i.e., the basis functions implicit encoding. In some circumstances, it could be useful to build a kernel starting from the design of eigenfunctions ρ*<sup>i</sup>* and eigenvalues ζ*<sup>i</sup>* . A notable example is given by the (already cited) Laguerre or Kautz functions that belong to the more general class of Takenaka–Malmquist orthogonal basis functions [46]. They can be useful to describe oscillatory behavior or presence of fast/slow poles.

Since any stable kernel can be associated with an <sup>2</sup> basis, the following fundamental problem then arises. Given an orthonormal basis{ρ*i*} of 2, for example, of the Takenaka–Malmquist type, which are the conditions on the eigenvalues ζ*<sup>i</sup>* ensuring stability of *Kx y* = +∞ *<sup>i</sup>*=1 ζ*i*ρ*i*(*x*)ρ*i*(*y*)? The answer is in the following result derived from [8] that reports the necessary and sufficient condition (the proof is given in Sect. 7.7.5).

**Theorem 7.9** (RKHS stability using Mercer expansions, based on [8]) *Let H be the RKHS induced by K with*

$$K\_{xy} = \sum\_{i=1}^{+\infty} \xi\_i \rho\_i(\mathbf{x}) \rho\_i(\mathbf{y}),$$

*where the* {ρ*i*} *form an orthonormal basis of* 2*. Let also*

$$\partial \ell\_{\infty} = \left\{ \mu \in \ell\_{\infty} \, : \, |\mu(i)| = 1, \,\,\forall i \ge 1 \right\}.$$

*Then, one has*

$$\mathcal{H}^{\rho} \subset \ell\_1 \iff \sup\_{u \in \mathcal{W}\_\infty} \sum\_i \xi\_i \langle \rho\_i, u \rangle\_2^2 < +\infty,\tag{7.71}$$

*where* ·, ·<sup>2</sup> *is the inner product in* 2*.*

Thus, clearly, there is no stability if one function ρ*<sup>i</sup>* associated to ζ*<sup>i</sup>* > 0 doesn't belong to 1. In fact, one can choose *u* containing the signs of the components of ρ*<sup>i</sup>* and this leads to ρ*i*, *u*<sup>2</sup> = +∞. Nothing is instead required for the eigenvectors associated to ζ*<sup>i</sup>* = 0. Theorem7.9 permits also to derive the following sufficient stability condition.

**Corollary 7.3** (based on [8]) *Let H be the RKHS induced by the kernel Kx y* = +∞ *<sup>i</sup>*=1 ζ*i*ρ*i*(*x*)ρ*i*(*y*) *with* {ρ*i*} *an orthonormal basis of* 2*. Then, it holds that*

$$\mathcal{H}^{\rho} \subset \ell\_1 \Longleftarrow \sum\_i \xi\_i \|\rho\_i\|\_1^2 < +\infty. \tag{7.72}$$

*Furthermore, such condition also implies kernel absolute summability and, hence, it is not necessary for RKHS stability.*

It is easy to exploit the stability condition (7.72) to design models of stable impulse responses starting from an <sup>2</sup> basis. Let us reconsider, e.g., Laguerre or Kautz basis functions {ρ*i*} to build the impulse response model

$$\mathbf{g} = \sum\_{i=1}^{\infty} a\_i \rho\_i.$$

To exploit (7.70), one has to define stability constraints on the expansion coefficients *ai* . This corresponds to define ζ*<sup>i</sup>* in such a way that the regularizer

$$\sum\_{i=1}^{\infty} \frac{a\_i^2}{\zeta\_i}$$

enforces absolute summability of *g*. Laguerre and Kautz models belong to the Takenaka–Malmquist class of functions ρ*<sup>i</sup>* that all satisfy

$$\|\rho\_i\|\_1 \le Mi,$$

with *M* a constant independent of *i* [46]. Then, Corollary 7.3 ensures that the choice

$$
\zeta\_i \propto i^{-\nu}, \quad \nu > 2,
$$

includes the stability contraint for the entire Takenaka–Malmquist class.

Let us now consider the class of orthonormal basis functions ρ*<sup>i</sup>* all contained in a ball of 1. Then, the necessary and sufficient stability condition assumes a form especially simple as the following result shows.

**Corollary 7.4** (based on [8]) *Let H be the RKHS induced by the kernel Kx y* = +∞ *<sup>i</sup>*=1 ζ*i*ρ*i*(*x*)ρ*i*(*y*) *with* {ρ*i*} *an orthonormal basis of* <sup>2</sup> *and* ρ*i*<sup>1</sup> ≤ *M* < +∞ *if*

**Fig. 7.13** Inclusion properties of some important kernel classes in terms of Mercer expansions. This representation is the dual of that reported in Fig. 7.11 and defines kernel sets through properties of the kernel eigenvectors ρ*<sup>i</sup>* , forming an orthonormal basis in 2, and of the corresponding kernel eigenvalues ζ*i* . The condition *<sup>i</sup>* ζ*<sup>i</sup>* ρ*<sup>i</sup>* <sup>2</sup> <sup>1</sup> < ∞is the most restrictive since it implies kernel absolute summability. The necessary and sufficient condition for stability is sup*u*∈*<sup>U</sup>* <sup>∞</sup> *<sup>i</sup>* ζ*i*ρ*i*, *u* 2 <sup>2</sup> < ∞. Finally, *<sup>i</sup>* ζ*<sup>i</sup>* < ∞ and *<sup>i</sup>* ζ <sup>2</sup> *<sup>i</sup>* < ∞ are exactly the conditions for a kernel to be finite trace and squared summable, respectively

ζ*<sup>i</sup>* > 0*, with M not dependent on i. Then, one has*

$$
\mathcal{H}^{\ell} \subset \ell\_1 \iff \sum\_i \xi\_i < +\infty. \tag{7.73}
$$

Finally, Fig. 7.13 illustrates graphically all the stability results here obtained starting from Mercer expansions.

#### **7.5 Minimax Properties of the Stable Spline Estimator** *-*

In this section, we will derive non-asymptotic upper bounds on the MSE of the regularized IIR estimator (7.10) valid for all the exponentially stable discrete-time systems whose poles belong to the complex circle of radius ρ. Obtained bounds can be evaluated before any data is observed. This kind of results give insight into the so-called sample complexity, i.e., the number of measurements needed to achieve a certain accuracy on impulse response reconstruction. This is an attractive feature even if, since the bounds need to hold for all the models falling in a particular class, often they are quite loose for the particular dynamic system at hand. However, they have a considerable theoretical value since permit also to assess the quality of (7.10) through nonparametric minimax concepts. Such setting considers the worst-case inside an infinite-dimensional class and has been widely studied in nonparametric regression and density estimation [88]. In particular, obtained bounds will lead to conditions which ensure the optimality in order, i.e., the best convergence rate of (7.10) in the minimax sense. We will derive them by considering system inputs given by white noises and using the TC/stable spline kernel (7.15) as regularizer. The important dependence between the convergence rate of (7.10) to the true impulse response, the stability kernel parameter α and the stability radius ρ will be elucidated.

## *7.5.1 Data Generator and Minimax Optimality*

As in the previous part of the chapter, we use *g*<sup>0</sup> to denote the impulse response of a discrete-time linear system. The measurements are generated as follows:

$$\mathbf{y}(t) = \sum\_{k=1}^{\infty} \mathbf{g}^0(k)\boldsymbol{u}\_{t-k} + \boldsymbol{e}(t),\tag{7.74}$$

where *g*<sup>0</sup>(*k*) are the impulse response coefficients. We will always assume *g*<sup>0</sup> as a deterministic and exponentially stable impulse response, while the input *u* and the noise *e* are stochastic as specified below.

**Assumption 7.10** The impulse response *g*<sup>0</sup> belongs to the following set:

$$\mathcal{AP}(\varrho, L) = \left\{ \mathbf{g} : |\mathbf{g}(k)| \le L\varrho^k \right\}, \quad 0 \le \rho < 1. \tag{7.75}$$

The system input and the noise are discrete-time stochastic processes. One has that {*u*(*t*)}*t*∈<sup>Z</sup> are independent and identically distributed (i.i.d.) zero-mean random variables with

$$\mathcal{E}\left[\mu(t)^2\right] = \sigma\_u^2, \quad |\mu(t)| \le C\_u < \infty. \tag{7.76}$$

Finally, {*e*(*t*)}*<sup>t</sup>*∈<sup>Z</sup> are independent random variables, independent of {*u*(*t*)}*<sup>t</sup>*∈<sup>Z</sup>, with

$$\mathcal{E}[e(t)] = 0, \quad \mathcal{E}[e(t)^2] \le \sigma^2. \tag{7.77}$$

The available measurements are

$$\mathcal{O}\_T = \{\mu(1), \dots, \mu(N), \mathbf{y}(1), \dots, \mathbf{y}(N)\},\tag{7.78}$$

where *N* is the data set size.

The quality of an impulse response estimator *g*ˆ function of *D<sup>T</sup>* will be measured by computing the estimation error *E g*<sup>0</sup> − ˆ*g*2, where ·<sup>2</sup> is the norm in the space <sup>2</sup> of squared summable sequences. Note that the expectation is taken w.r.t. the randomness of the system input and the measurement noise. The worst-case error over the family *S* of exponentially stable systems reported in (7.75) will be also considered. In particular, the uniform 2-risk of *g*ˆ is

288 7 Regularization in Reproducing Kernel Hilbert Spaces …

$$\sup\_{\mathbf{g}\in\mathcal{F}}\mathcal{S}\|\mathbf{g}-\hat{\mathbf{g}}\|\_{2}.$$

An estimator *g*<sup>∗</sup> is then said to be *minimax* if the following equality holds for any data set size *N*:

$$\sup\_{\mathbf{g}\in\mathcal{Y}'}\mathcal{S}^{\parallel}\|\mathbf{g}-\mathbf{g}^\*\|\_{2} = \inf\_{\hat{\mathbf{g}}}\sup\_{\mathbf{g}\in\mathcal{Y}'}\mathcal{S}^{\parallel}\|\mathbf{g}-\hat{\mathbf{g}}\|\_{2},$$

meaning that *g*<sup>∗</sup> minimizes the error w.r.t. the worst-case scenario. Building such kind of estimator is in general really difficult. For this reason, it is often convenient to consider just the asymptotic behaviour introducing the concept of optimality in order. Specifically, an estimator *g*¯ is *optimal in order* if

$$\sup\_{\mathfrak{g}\in\mathcal{G}'} \mathcal{\delta}^{\mathbb{P}} \|\mathfrak{g} - \overline{\mathfrak{g}}\|\_{2} \leq \mathcal{C}\_{N} \sup\_{\mathfrak{g}\in\mathcal{G}'} \mathcal{\delta}^{\mathbb{P}} \|\mathfrak{g} - \mathfrak{g}^\*\|\_{2}$$

with *CN* is function of the data set size and satisfies sup*<sup>N</sup> CN* < ∞ and *g*<sup>∗</sup> is minimax. In our linear system identification setting, optimality in order thus ensures that, as *N* grows to infinity, the convergence rate of *g*¯ to the true impulse response *g*<sup>0</sup> cannot be improved by any other system identification procedure in the minimax sense.

## *7.5.2 Stable Spline Estimator*

As anticipated, our study is focused on the following regularized estimator:

$$\hat{\mathbf{g}} = \operatorname\*{arg\,min}\_{\mathbf{g} \in \mathcal{M}^{\ell}} \sum\_{t=1}^{N} \left( \mathbf{y}(t) - \sum\_{k=1}^{\infty} \mathbf{g}(k)\boldsymbol{u}(t-k) \right)^{2} + \boldsymbol{\chi} \left\| \mathbf{g} \right\|\_{\mathcal{H}^{\ell}}^{2},\tag{7.79}$$

equipped with the stable spline kernel

$$K(i,j) = \alpha^{\max(i,j)}, \quad 0 < \alpha < 1, \quad (i,j) \in \mathbb{N}.\tag{7.80}$$

For future developments, it is important to control complexity of (7.79) not only by using the hyperparameters γ and α but also through the dimension *d* of the following subspace:

$$\mathcal{H}\_d^\ell = \{ \mathbf{g} \in \mathcal{H}^\ell \text{ s.t. } \mathbf{g}(d+1) = \mathbf{g}(d+2) = \dots = 0 \}$$

over which optimization of the objective in (7.79) is performed. In particular, we will consider the estimator

$$\hat{\mathbf{g}}^d = \arg\min\_{\mathbf{g}\in\mathcal{M}\_d^\ell} \sum\_{t=1}^N \left( \mathbf{y}(t) - \sum\_{k=1}^d \mathbf{g}(k)\boldsymbol{u}(t-k) \right)^2 + \boldsymbol{\gamma} \left\| \mathbf{g} \right\|\_{\mathcal{M}^\ell}^2,\tag{7.81}$$

and will study how *N* and the choice of γ , α, *d* influence the estimation error and, hence, the convergence rate. This will lead to complexity control rules that are a hybrid of those seen in the classical and in the regularized framework. To obtain this, first, we rewrite (7.81) in terms of regularized FIR estimation by exploiting the structure of the stable spline norm (7.16) which shows that

$$\log \in \mathcal{H}\_d^\ell \Longrightarrow \|\lg\|\_{\mathcal{H}^\ell}^2 = \left(\sum\_{t=1}^{d-1} \frac{(\mathbf{g}(t+1) - \mathbf{g}(t))^2}{(1-\alpha)\alpha^t}\right) + \frac{\mathbf{g}^2(d)}{(1-\alpha)\alpha^d}.\tag{7.82}$$

Let us define the matrix

$$R = \frac{1}{a - a^2} \begin{bmatrix} 1 & -1 & 0 & 0 & \cdots & 0 \\ -1 & 1 + \frac{1}{a} & -\frac{1}{a} & 0 & \cdots & 0 \\ 0 & -\frac{1}{a} & \frac{1}{a} + \frac{1}{a^2} & -\frac{1}{a^2} & \cdots & 0 \\ & 0 & 0 & \ddots & \ddots & \ddots & \vdots \\ 0 & 0 & \cdots & \cdots & -\frac{1}{a^{d-2}} & \frac{1}{a^{d-2}} + \frac{1}{a^{d-1}} \end{bmatrix} \tag{7.83}$$

and the regressors

$$\varphi\_d(t) = \begin{pmatrix} u(t-1) \\ \vdots \\ u(t-d) \end{pmatrix}. \tag{7.84}$$

Now, one can easily see that the first *d* components of *g*ˆ*<sup>d</sup>* in (7.81) are contained in the vector

$$\arg\min\_{\boldsymbol{\theta}} \sum\_{t=1}^{N} \left( \mathbf{y}(t) - \boldsymbol{\varphi}\_d(t)^T \boldsymbol{\theta} \right)^2 + \boldsymbol{\mathcal{y}} \boldsymbol{\theta}^T \boldsymbol{R} \boldsymbol{\theta}. \tag{7.85}$$

Hence, we obtain

$$\hat{\mathbf{g}}^d = (\hat{\mathbf{g}}(1), \dots, \hat{\mathbf{g}}(d), 0, 0, \dots) \tag{7.86}$$

where

$$
\begin{pmatrix} \hat{\mathbf{g}}(1) \\ \vdots \\ \hat{\mathbf{g}}(d) \end{pmatrix} = \left( \frac{1}{N} \sum\_{t=1}^{N} \varphi\_d(t) \varphi\_d^T(t) + \frac{\mathcal{V}}{N} \mathcal{R} \right)^{-1} \frac{1}{N} \sum\_{t=1}^{N} \varphi\_d(t) \mathbf{y}(t). \tag{7.87}
$$

In real applications, one cannot measure the inputs at all the time instants and our data set *D<sup>T</sup>* in (7.78) could contain only the inputs *u*(1), . . . , *u*(*N*). So, differently from what postulated in the above equations, in practice the regressors are never perfectly known. One solution is just to replace with zeros the unknown input values {*u*(*t*)}*<sup>t</sup>*<<sup>1</sup> entering (7.84). Also under this model misspecification, all the results introduced in the next sections still hold.

## *7.5.3 Bounds on the Estimation Error and Minimax Properties*

The following theorem will report non asymptotic bounds that illustrate the dependence of *E g*<sup>0</sup> − ˆ*g<sup>d</sup>* <sup>2</sup> on the following three key variables:


In addition, it gives conditions on α which ensure optimality in order if some conditions on the stability radius ρ entering (7.75) and on the FIR order *d* (function of the data set size *N*) are fullfilled. Below, the notation *O*(1) indicates an absolute constant, independent of *<sup>N</sup>*. Furthermore, given *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>, we use *x* to indicate the largest integer not larger than *x*. The following result then holds.

**Theorem 7.11** (based on [74]) *Let the FIR order d be defined by the following function of the data set size N :*

$$d^\* = \left\lfloor \frac{\ln(N(1-\alpha)\sigma\_u^2) - \ln(8\nu)}{\ln(1/\alpha)} \right\rfloor,\tag{7.88}$$

*with N large enough to guarantee d*<sup>∗</sup> ≥ 1*. Then, under Assumption7.10, the estimator (7.81) satisfies*

$$\begin{aligned} &\mathcal{E}^{\mathbb{P}}\|g-\hat{\mathbf{g}}^{d^\*}\|\_{2} \\ &\leq O(1)\left[\frac{L\rho^{d^\*+1}}{(1-\rho)}\left(\sqrt{\frac{d^\*}{N}}+1\right)+\frac{\sigma}{\sigma\_u}\sqrt{\frac{d^\*}{N}}+\frac{4L\gamma}{1-\alpha}\frac{h\_{d^\*}}{N}\right], \end{aligned} \tag{7.89}$$

*where*

$$h\_{d^\*} = \begin{cases} \frac{\sqrt{d^\*}}{\rho} & \text{if } \alpha = \rho \\\ \frac{\rho}{\sqrt{\alpha^2 - \rho^2}} & \text{if } \alpha > \rho \\\ \frac{\rho}{\sqrt{\rho^2 - \alpha^2}} \left(\frac{\rho}{\alpha}\right)^{d^\*} & \text{if } \alpha < \rho \end{cases} . \tag{7.90}$$

*Furthermore, if the measurement noise is Gaussian and* <sup>√</sup><sup>α</sup> <sup>≥</sup> <sup>ρ</sup>*, the stable spline estimator (7.81) is optimal in order.*

To illustrate the meaning of Theorem7.11, first is useful to recall a result obtained in [43] that relies on the Fano's inequality. It shows that, if a dynamic system is fed with white input and the measurement noise is Gaussian, the expected <sup>2</sup> error of any impulse response estimator cannot decay to zero faster than &ln *<sup>N</sup> <sup>N</sup>* in a minimax sense.

**Theorem 7.12** (based on [43]) *Let Assumption7.10 hold and assume also that the measurement noise is Gaussian. Then, if g is any impulse response estimator built* ˆ *with D<sup>T</sup> , for N sufficiently large one has*

$$\sup\_{\mathbf{g}\in\mathcal{Y}(\varrho,L)} \mathcal{E} \|\hat{\mathbf{g}} - \mathbf{g}\|\_{2} \geq \mathcal{O}(1) \sqrt{\frac{\ln N}{N}}.\tag{7.91}$$

To illustrate the convergence rate of the stable spline estimator, first note that the FIR dimension *d*<sup>∗</sup> in (7.88) scales logarithmically with *N*. Apart from irrelevant constants, one in fact has

$$d^\* \sim \frac{\ln(N)}{\ln(1/a)}.\tag{7.92}$$

We now consider the three terms on the r.h.s. of (7.89) with *d* = *d*∗. Since

$$
\sqrt{\frac{d^\*}{N}} \sim \sqrt{\frac{\ln N}{N}} \quad \text{and} \quad \rho^{d^\*} \sim N^{-\frac{\ln \rho}{\ln \rho}},\tag{7.93}
$$

the first two terms decay to zero at least as &ln *<sup>N</sup> <sup>N</sup>* . Regarding the third one, one has

$$\frac{h\_{d^\*}}{N} \sim \begin{cases} \frac{\sqrt{\ln N}}{N} & \text{if } \alpha = \rho \\\frac{1}{N} & \text{if } \alpha > \rho \\\ N^{-\frac{\ln \rho}{\ln \alpha}} & \text{if } \alpha < \rho \end{cases} \tag{7.94}$$

and this shows that the optimal convergence rate is obtained if α ≥ ρ but the case α<ρ can be critical. In particular, combining (7.89) with (7.93) and (7.94), the following considerations arise:


$$
\frac{\ln \rho}{\ln \alpha} \ge 0.5 \implies \sqrt{\alpha} \ge \rho.
$$

This indeed corresponds to what was stated in the final part of Theorem7.11: under Gaussian noise the stable spline estimator is optimal in order if <sup>√</sup><sup>α</sup> is an upper bound on the stability radius ρ.

Relationships (7.93) and (7.94) clarify also what happens when the kernel includes a too fast exponential decay rate, i.e., when <sup>√</sup>α<ρ. In this case, the error goes to

**Fig. 7.14** Convergence rate ln ρ/ ln <sup>α</sup> of the stable spline estimator as a function of <sup>√</sup><sup>α</sup> for <sup>√</sup>α<ρ with <sup>ρ</sup> in the set {0.7, <sup>0</sup>.8, <sup>0</sup>.9, <sup>0</sup>.95, <sup>0</sup>.99}. When <sup>√</sup>α<ρ the estimation error converges to zero as *<sup>N</sup>*<sup>−</sup> ln <sup>ρ</sup> ln <sup>α</sup> . Instead, if <sup>√</sup><sup>α</sup> <sup>≥</sup> <sup>ρ</sup> the error decays as & ln *<sup>N</sup> <sup>N</sup>* , making the stable spline estimator optimal in order when the measurement noise is Gaussian

zero as *N* <sup>−</sup> ln <sup>ρ</sup> ln <sup>α</sup> , getting worse as <sup>√</sup><sup>α</sup> drifts apart <sup>ρ</sup>. Such phenomenon has a simple explanation. A too small α enforces the impulse response estimate to decay to zero also when the true impulse response coefficients are significantly different from zero. This corresponds to a strong bias: a wrong amount of regularization is introduced in the estimation process, hence compromising the convergence rate. This is also graphically illustrated in Fig. 7.14 that plots the convergence rate ln ρ/ ln α as a function of <sup>√</sup><sup>α</sup> for five different values of <sup>ρ</sup>.

The analysis thus shows how α plays a fundamental role in controlling impulse response complexity and, hence, in establishing the properties of the regularized estimator. This is not surprising also in view of the deep connection between the decay rate and the degrees of freedom of the model. This was illustrated in Fig. 5.6 of Sect. 5.5.1 using the class of DC kernels which includes TC as special case.

## **7.6 Further Topics and Advanced Reading**

The idea to handle linear system identification with regularization methods in the RKHS framework first appears in [72]. As already mentioned, the representer theorems introduced in this chapter are special cases of that involving linear and bounded functionals reported in the previous chapter, see Theorem6.16. More general versions of representer theorems with, e.g., more general loss functions and/or regularization terms can be found in, e.g., [33]. Similarly to the spline smoothing problem studied in Sect. 6.6.7, it could be useful to enrich the regularized impulse response estimators here described with a parametric component. Of course, the corresponding regularized estimator will still have a closed-form finite-dimensional representation that depends on both the number of data *N* and the number of enriched parametric components, e.g., see [72, 90].

The stable spline kernel [72] and the diagonal correlated kernel [19] are the first two kernels introduced in the linear system identification literature. The stability of a kernel (or equivalently the stability of a RKHS) first appeared in [32, 73]. The stability of a kernel is equivalent to the ∞-boundedness of the kernel, which is a special case of the more general *q*-boundedness with 1 < *q* ≤ ∞ in [16]. The proof in [16] for the sufficiency and necessity of the *q*-boundedness of a kernel is quite involved and abstract. Theorem7.5 is also discussed in [24], see also [76] where the stability analysis exploits the output kernel. The optimal kernel that minimizes the mean squared error was studied in [19, 73]. As already discussed, unfortunately, the optimal kernel cannot be applied in practice because it depends on the true impulse response to be estimated. Nevertheless, it offers a guideline to design kernels for linear system identification and more general function estimation problems. Motivated by these findings, many stable kernels have been introduced over the years, e.g., [17, 21, 77, 80, 97]. In particular, [17] proposed linear multiple kernels to handle systems with complicated dynamics, e.g., with distinct time constants and distinct resonant frequencies, and [77] further extended this idea and proposed "integral" versions of the stable spline kernels. To design kernels to embed more general prior knowledge, e.g., the overdamped/underdamped dynamics, common structure, etc., it is natural to divide the prior knowledge into different types and then develop systematic ways to design kernels accordingly, see [21, 80, 97]. In particular, the approaches proposed in [21] are based on machine learning and a system theory perspectives, those in [80] rely on the maximum entropy principle, and the method proposed in [97] uses harmonic analysis.

Along with the kernel design, many efforts have also been spent on "kernel analysis". In particular, many kernels can be given maximum entropy interpretations including the stable spline kernel, the diagonal correlated kernel and the more general simulation-induced kernel [14, 21, 23]. This can help to understand the prior knowledge embedded in the model. Many kernels have the Markov property e.g., [83]. Examples are the diagonal correlated kernel and some carefully designed simulation induced kernels [21]. Exploring this property could help to design efficient implementation. As we have seen, the spectral analysis of kernels is often not available in closed form, even is it can be numerically recovered, but exceptions include the stable spline and the diagonal correlated kernel [20, 22, 72].

The hyperparameter tuning problem has been studied for a long time in the context of function estimation problem from noisy observations, e.g., [83, 90]. The marginal likelihood maximization method depends on the connection with the Bayesian estimation of Gaussian processes, which was first studied in [51] in spline regression, see also [41, 83, 90]. More discussions on its relation to Bayesian evidence and Occam's razor principle can be found in e.g., [27, 60]. Stein's unbiased risk estimation method is also known as the *Cp* statistics [61]. The generalized cross-validation method is first proposed in [28] and found to be rotation invariant in [44]. The problem can also be tackled using full Bayes approaches relying on stochastic simulation techniques, e.g., Markov chain Monte Carlo [1, 39].

In the context of linear system identification, some theoretical results on the hyperparameter estimation problem have been derived. In particular, it was shown in [4] that the marginal likelihood maximization method is consistent for diagonal kernels in terms of the mean square error and asymptotically minimizes a weighted mean square error for nondiagonal kernels. In [78], the robustness of the marginal likelihood maximization is analysed with the help of the excess degrees of freedom. It is further shown in [63, 64, 66] that Stein's unbiased risk estimation as well as many cross-validation methods are asymptotically optimal in the sense of mean square error. In [4, 17, 94], the optimal hyperparameter of the marginal likelihood maximization is shown to be sparse. By exploring such property it is possible to handle various structure detection problems in system identification like sparse dynamic network identification [17, 26]. Full Bayes approaches can be found, e.g., in [69].

As also recalled in the previous chapter, straightforward implementation of the regularization method in RKHS framework has computational complexity *O*(*N*<sup>3</sup>) and thus is prohibitive to apply when *N* is large. Many efficient approximation methods have been proposed in machine learning, e.g., [53, 81, 82]. In the context of linear system identification, there is another practical issue that must be noted in the implementation: the ill-conditioning possibly arising from the use of stable kernels, which is unavoidable due to the nature of stability. Hence, extra care has to be taken when developing efficient implementations. Some approximation methods have been proposed to reduce the computational complexity and avoid numerical computation. The first one is to truncate the IIR at a suitable finite-order *n*. Then, computational complexity becomes *O*(*n*<sup>3</sup>) and one can also use the approach proposed in [18] relying on some fundamental algebraic techniques and reliable matrix factorizations. The other one is to truncate the infinite expansion of a kernel at a finite-order *l*. Then, computational complexity becomes *O*(*l* <sup>3</sup>), see [15]. See also [36] for efficient kernel-based regularization implementation using Alternating Direction Method of Multipliers (ADMM). Another practical issue is the difficulty caused by local minima. For kernels with few number of hyperparameters, e.g., the stable spline kernel and the diagonal correlated kernel, this difficulty can be well faced using different starting points or also some grid methods. For systems with complicated dynamics, it is suggested to apply linear multiple kernels [17] since the corresponding marginal likelihood maximization is a difference of convex programming problem and a stationary point can be found efficiently using sequential convex optimization technique, e.g., [48, 87].

We only considered single-input single-output linear systems in the chapter with white measurement noise. For multiple-input single-output linear systems, it is natural to use multi-input impulse response models and then assume that the overall system has a block diagonal kernel [73]. The regularization method can also be extended to handle linear systems with colored noise, e.g., ARMAX models. One can exploit the fact that such systems can be approximated arbitrarily well by finiteorder ARX models [57]. The problem thus becomes a special case of multiple-output single-input systems where the regressors contain also past outputs [71]. This will be also illustrated in Chap. 9.

In practice, the data could be contaminated by outliers due to a failure in the measurement or transmission equipment, e.g., [56, Chap. 15]. In the presence of outliers, it is suggested to use heavy-tailed distributions instead of the commonly used Gaussian distribution for the noise in robust statistics, e.g., [49]. For regularization methods in the RKHS framework, the key difficulty is that the hyperparameter estimation criteria and the regularized estimate may not have closed-form expressions. Several methods have been proposed to overcome this difficulty. In particular, an expectation maximization (EM) method was proposed in [10] and further improved in [55] exploiting a variational expectation method.

Input design is an important issue for classical system identification and many results have been obtained, e.g., [38, 45, 47, 56]. For regularized system identification in RKHS, some results have been reported recently. The first result was given in [37] where the mutual information between the output and the impulse response was chosen as the input design criterion. Unfortunately, obtaining the optimal input involves the solution of a nonconvex optimization problem. Differently from [37], [65] adopts scalar measures of the Bayesian mean square error as input design criterion, proposing a two-step procedure to find the global optimal input through convex optimization.

For what concerns the building of uncertainty regions around the dynamic system estimates, approaches are available which return bounds that, beyond being nonasymptotic, are also exact, i.e., with the desired inclusion probability. This requires some assumptions on data generation, like the introduction of prior distributions on the impulse response. An important example, already widely discussed in this book, is the use of a Bayesian framework that interprets regularization as Gaussian regression [83]. The posterior density becomes available in closed form and Bayes intervals can be easily obtained. Another approach to compute bounds for linear regression is the sign-perturbed sums (SPS) technique [30]. Following a randomization principle, it builds guaranteed uncertainty regions for deterministic parametric models in a quasidistribution free setup [11, 12]. Recently, there have been notable extensions to the class of models that SPS can handle. The first line of thought still sees the unknown parameters as deterministic but introduces regularization, see [29, 70, 89] and also [31] which is a first attempt to move beyond the strictly parametric nature of SPS. A second line of thought allows for the exploitation of some form of prior knowledge at a more fundamental probabilistic level [13, 70].

Finally, the interested readers are referred to the survey [73] for more references, see also [25, 58].

## **7.7 Appendix**

## *7.7.1 Derivation of the First-Order Stable Spline Norm*

We will exploit a representation of the RKHS induced by the first-order discretetime stable spline kernel given by a linear transformation of the space <sup>2</sup> containing the squared summable sequences. This has some connections with the relationship between squared summable function spaces and RKHS discussed in Remark 6.2, even if no spectral decomposition of the kernel will be needed below.

Let *H* be the RKHS induced by the stable spline kernel (7.15) with elements denoted by *g* = {*gt*} +∞ *<sup>t</sup>*=1. We will see that any *g* ∈ *H* can be written as

$$\mathbf{g}\_{l} = \sum\_{j=1}^{\infty} \psi\_{lj} \mathbf{w}\_{j}, \qquad \mathbf{w} \in \ell\_{2}, \tag{7.95}$$

where the scalars {ψ*t j*} define the linear operator mapping <sup>2</sup> into *H* . By adopting notation of ordinary algebra to handle infinite-dimensional objects, one can see *g* as an infinite-dimensional column vector. In addition, (7.95) can be rewritten as *g* = Ψ*w*, where Ψ is an infinite-dimensional matrix with (*t*, *j*)-entry given by ψ*t j* . We will now obtain the expression of Ψ. Let

$$A = \text{diag}\{\lambda\_1, \lambda\_2, \lambda\_3, \dots\}, \qquad \lambda\_t = \alpha^t - \alpha^{t+1}$$

$$\mathcal{A}\ell = [\nu^1 \ \nu^2 \ \nu^3 \ \cdots], \qquad \nu^t = \sum\_{j=1}^t e^j$$

where *e <sup>j</sup>* is the infinite-dimensional column vector with all null elements except its *j*th entry which is equal to one. Let also

$$
\Psi = \mathcal{A} \ell \Lambda^{1/2}.\tag{7.96}
$$

The inverse Ψ <sup>−</sup><sup>1</sup> of Ψ acts as follows: given a sequence *g*, it maps *g* into

$$\Psi^{-1}g = \begin{bmatrix} \frac{1}{\sqrt{a-a^2}}(g\_1 - g\_2) \\ \frac{1}{\sqrt{a^2 - a^3}}(g\_2 - g\_3) \\ \vdots \end{bmatrix} . \tag{7.97}$$

Then, given Ψ in (7.96), we will show that the space

$$\mathcal{H}^{\ell} = \left\{ \Psi^{\ell} w \, \middle| \, w \in \ell\_2 \right\},\tag{7.98}$$

with inner product given by

$$
\langle f, \mathbf{g} \rangle\_{\mathcal{H}} = \langle \Psi^{-1} f, \Psi^{-1} \mathbf{g} \rangle\_2,\tag{7.99}
$$

is the RKHS induced by the stable spline kernel. First, it is easy to see that the null space of Ψ contains only the null vector. Then, since <sup>2</sup> is Hilbert, one obtains that *H* is a Hilbert space. We can now exploit Theorem6.2, i.e., the Moore–Aronszajn theorem, to prove that it is also the desired RKHS. To obtain this, the two conditions described below have to be checked.

The first condition says that any kernel section must belong to the space *H* in (7.98). Thanks to the algebraic view, we can see the stable spline kernel *K* as an infinite-dimensional matrix. Hence, the kernel sections are the infinite-dimensional columns of *K* and, in particular, we use *Kt* to indicate the *t*th column. Now, one has to assess that *Kt*<sup>2</sup> *<sup>H</sup>* < ∞ ∀*t*. Note that

$$\Psi^{-1}K\_{l} = \begin{bmatrix} 0\\ \vdots\\ \vdots\\ 0\\ \sqrt{\alpha^{t} - \alpha^{t+1}}\\ \sqrt{\alpha^{t+1} - \alpha^{t+2}}\\ \vdots \end{bmatrix} \leftarrow t\text{th row.}\tag{7.100}$$

Then, we have

$$\begin{aligned} \langle K\_t, K\_t \rangle\_{\mathcal{A}^\ell} &= \langle \Psi^{-1} K\_t, \Psi^{-1} K\_t \rangle\_2 \\ &= \sum\_{j=t}^{+\infty} (\alpha^j - \alpha^{j+1}) = \alpha^t < \infty, \end{aligned}$$

and the first condition is so satisfied.

The second condition is the reproducing property, i.e., one has to assess that

$$\langle K\_l, \mathbf{g} \rangle\_{\mathcal{H}^\mathbb{C}} = \mathbf{g}\_l \quad \forall \mathbf{g} \in \mathcal{H}^\mathbb{C}, \ \forall t.$$

This holds true since

$$\begin{aligned} \langle K\_t, \mathbf{g} \rangle\_{\mathcal{H}^\vee} &= \langle \Psi^{-1} K\_t, \Psi^{-1} \mathbf{g} \rangle\_2 \\ &\sum\_{j=t}^{+\infty} (\mathbf{g}\_j - \mathbf{g}\_{j+1}) = \mathbf{g}\_t, \end{aligned}$$

showing that the second condition is also satisfied.

Using (7.99), one has

298 7 Regularization in Reproducing Kernel Hilbert Spaces …

$$\|\|\mathbf{g}\|\|\_{\mathcal{H}'}^2 = \langle \Psi^{-1}\mathbf{g}, \Psi^{-1}\mathbf{g} \rangle\_2 = \sum\_{t=1}^{\infty} \frac{(\mathbf{g}\_{t+1} - \mathbf{g}\_t)^2}{(1 - \alpha)\alpha^t}$$

and this confirms the norm's structure reported in (7.16).

## *7.7.2 Proof of Proposition 7.1*

We will exploit the results on estimation of Gaussian vectors reported in Sect. 4.2.2.

Let Cov[*u*, *v*] denote the covariance matrix of two random vectors *u* and *v*, i.e.,

$$\text{Cov}[\boldsymbol{\mu}, \boldsymbol{\nu}] := \boldsymbol{\mathcal{E}}[(\boldsymbol{\mu} - \boldsymbol{\mathcal{E}}[\boldsymbol{\mu}])(\boldsymbol{\nu} - \boldsymbol{\mathcal{E}}[\boldsymbol{\nu}])^T].$$

First, we consider the distribution of *Y* . Note that *Li*[*g*<sup>0</sup>] is a linear functional of the stochastic process *g*<sup>0</sup>. Hence, since linear transformations of normal processes preserve Gaussianity, the noise-free output [*L*1[*g*<sup>0</sup>],..., *L <sup>N</sup>* [*g*<sup>0</sup>]]*<sup>T</sup>* is a multivariate zero-mean Gaussian random vector. Furthermore, since

$$\text{Cov}(L\_i[\mathbf{g}^0], L\_j[\mathbf{g}^0]) = \lambda L\_i[L\_j[K]],$$

the covariance matrix of [*L*1[*g*<sup>0</sup>],..., *L <sup>N</sup>* [*g*<sup>0</sup>]]*<sup>T</sup>* , apart from the scale factor λ, is indeed defined by the output kernel matrix *O* reported in (7.14) for the discretetime case, i.e., when *X* = N, and in (7.22) for the continuous-time case, i.e., when *X* = R+. Now, recall that the *e*(*t*), where *t* = 1,..., *N*, are assumed to be mutually independently Gaussian distributed with mean zero and variance σ <sup>2</sup>. Moreover, they are also assumed independent of *g*<sup>0</sup>. One then obtains that *g*<sup>0</sup> and *Y* are jointly Gaussians, with the mean and covariance matrix of *Y* given by

$$\mathcal{A}^{\diamond}(Y) = 0, \quad \text{Cov}(Y, Y) = \lambda O + \sigma^2 I\_N.$$

For what regards the covariance matrix of *g*<sup>0</sup> and *Y* , the independence assumptions imply that

$$\text{Cov}(\mathcal{g}^0(\mathbf{x}), Y) = \lambda [L\_1[K\_x], \dots, L\_N[K\_x]].$$

Then, using also the correspondence γ = σ <sup>2</sup>/λ, we have

$$\begin{aligned} \mathcal{E}\left[g^0(\mathbf{x})|Y\right] &= \lambda [L\_1[K\_x] \dots L\_N[K\_x]] \left(\lambda O + \sigma^2 I\_N\right)^{-1} Y \\ &= \left[L\_1[K\_x] \dots L\_N[K\_x]\right] \left(O + \chi I\_N\right)^{-1} Y \\ &= \sum\_{t=1}^N \hat{c}\_t L\_t[K\_x] \end{aligned}$$

where *c*ˆ*<sup>t</sup>* is the *t*th entry of vector *c*ˆ defined in (7.13) for the continuous-time case or in (7.21) for the discrete-time case. This completes the proof.

## *7.7.3 Proof of Theorem 7.5*

We only consider the proof for the discrete-time case (7.56). The continuous-time case (7.57) can be proved in a similar way. To prove (7.56), we first need a lemma.

**Lemma 7.1** *Consider the linear operator L <sup>K</sup> defined by*

$$L\_K[l](\cdot) = \sum\_{t=1}^{\infty} K(\cdot, t) l\_t,\tag{7.101}$$

*where K* : <sup>N</sup> <sup>×</sup> <sup>N</sup> <sup>→</sup> <sup>R</sup> *is a positive semidefinite kernel. Assume that L <sup>K</sup> satisfies the following property: for any l* ∈ ∞, *one has L <sup>K</sup>* [*l*] ∈ 1*. Then, Lk is a continuous (bounded) linear operator, i.e., there exists a scalar b* > 0*, independent of l, such that*

$$\|L\_K[l]\|\_1 \le b \|l\|\_\infty, \ \forall l \in \ell\_\infty. \tag{7.102}$$

*Proof* First, we show that for any *<sup>s</sup>* <sup>∈</sup> <sup>N</sup>, the kernel section *Ks*(·) belongs to 1. To show this, for any *<sup>s</sup>* <sup>∈</sup> <sup>N</sup>, we can define a sequence *<sup>l</sup>* <sup>∈</sup> <sup>∞</sup> in the following way:

$$l\_t = \begin{cases} 1 & \text{if } K(\mathbf{s}, t) \ge 0 \\ -1 & \text{otherwise.} \end{cases}$$

Then plugging this *l* into (7.101) yields *L <sup>K</sup>* [*l*] = ∞ *<sup>t</sup>*=1 |*K*(*s*, *t*)|. Since *L <sup>K</sup>* [*l*] ∈ <sup>1</sup> for every *l* ∈ ∞, then we obtain

$$\sum\_{t=1}^{\infty} |K(s, t)| < \infty, \quad \forall s \in \mathbb{N}. \tag{7.103}$$

Now, for any *l*, *a* ∈ ∞, it holds that

$$|L\_K[l](\mathbf{s}) - L\_K[a](\mathbf{s})| = \left| \sum\_{t=1}^{\infty} K(\mathbf{s}, t)(l\_t - a\_t) \right| \le \|l - a\|\_{\infty} \sum\_{t=1}^{\infty} |K(\mathbf{s}, t)|,\tag{7.104}$$

where both *l* − *a*∞ and ∞ *<sup>t</sup>*=1 <sup>|</sup>*K*(*s*, *<sup>t</sup>*)<sup>|</sup> are finite for any *<sup>s</sup>* <sup>∈</sup> <sup>N</sup> since *<sup>l</sup>*, *<sup>a</sup>* <sup>∈</sup> <sup>∞</sup> and in view of (7.103). Following (7.104), the remaining proof is a simple application of the closed graph theorem, see Theorem6.26. In fact, let *l* → *a* in <sup>∞</sup> and *L <sup>K</sup>* [*l*] → *g*

in 1. Then (7.104) shows that *<sup>L</sup> <sup>K</sup>* [*l*](*s*) <sup>→</sup> *<sup>L</sup> <sup>K</sup>* [*a*](*s*) for every *<sup>s</sup>* <sup>∈</sup> <sup>N</sup>, implying that *gs* <sup>=</sup> *<sup>L</sup> <sup>K</sup>* [*a*](*s*) for every *<sup>s</sup>* <sup>∈</sup> <sup>N</sup>. As a result, the graph (*l*, *<sup>L</sup> <sup>K</sup>* [*l*]) is closed and thus *L <sup>K</sup>* is continuous (bounded) by the closed graph theorem. -

Now let us consider (7.56) in Theorem7.5. We first prove the sufficient part, i.e.,

$$\sum\_{s=1}^{\infty} \left| \sum\_{t=1}^{\infty} K(s, t) l\_t \right| < \infty, \quad \forall l \in \ell\_{\infty} \implies \mathcal{H} \subset \ell\_1.$$

We start by introducing some definitions. For any *f* ∈ *H* , we let *l* ∈ <sup>∞</sup> be a sequence defined by the signs of *f* , i.e.,

$$l\_t = \begin{cases} 1 & \text{if } f\_t \ge 0 \\ -1 & \text{otherwise} \end{cases}$$

and let also *l <sup>n</sup>* be a sequence defined by

$$l\_t^n = \begin{cases} l\_t & \text{for } t = 1, \dots, n \\ 0 & \text{otherwise.} \end{cases}$$

Then we have

$$\sum\_{t=1}^{n} |f\_t| = \sum\_{t=1}^{\infty} f\_t l\_t^n = \sum\_{t=1}^{\infty} \langle f(\cdot), l\_t^n K\_t(\cdot) \rangle\_{\mathcal{H}^\circ},$$

where the last identity is due to the reproducing property of *K*. Moreover, by the Cauchy–Schwarz inequality, we have

$$\sum\_{t=1}^{n} |f\_t| \le \|f\|\_{\mathcal{H}^\ell} \left\| \sum\_{t=1}^{\infty} l\_t^n K\_t(\cdot) \right\|\_{\mathcal{H}^\ell} \,. \tag{7.105}$$

Now we show that ( ( ∞ *<sup>t</sup>*=1 *l n <sup>t</sup> Kt*(·) ( ( *<sup>H</sup>* is finite. First, we note that

$$\begin{aligned} \left\| \sum\_{t=1}^{\infty} l\_t^n K\_t(\cdot) \right\|\_{\mathcal{H}^\ell}^2 &= \langle \sum\_{s=1}^{\infty} l\_s^n K\_s(\cdot), \sum\_{t=1}^{\infty} l\_t^n K\_t(\cdot) \rangle\_{\mathcal{H}^\ell} \\ &= \sum\_{s=1}^{\infty} \left( \sum\_{t=1}^{\infty} l\_t^n K(s,t) \right) l\_s^n \\ &\leq \sum\_{s=1}^{\infty} \left| \sum\_{t=1}^{\infty} l\_t^n K(s,t) \right| \| l^n \|\_{\infty}, \end{aligned}$$

and then from the linear operator *L <sup>K</sup>* defined in (7.101) and its boundedness property (7.102) proved in Lemma 7.1, we obtain

#### 7.7 Appendix 301

$$\left\| \sum\_{t=1}^{\infty} l\_t^n K\_t(\cdot) \right\|\_{\mathcal{H}^\circ}^2 \le \| L\_K[l^n] \|\_1 \| l^n \|\_{\infty} \le b \| l^n \|\_{\infty}^2 = b,$$

where we have used the fact that *l <sup>n</sup>* ∞ = 1 for any *<sup>n</sup>* <sup>∈</sup> <sup>N</sup>. Noting the above equation and (7.105) yields

$$\sum\_{t=1}^{n} |f\_t| \le \|f\|\_{\mathcal{A}^\varepsilon} \sqrt{b}, \quad \forall n \in \mathbb{N}.$$

Since *f* ∈ *H* and thus *f <sup>H</sup>* is finite, *n <sup>t</sup>*=1 <sup>|</sup> *ft*<sup>|</sup> is bounded above for any *<sup>n</sup>* <sup>∈</sup> <sup>N</sup>. Further note that the partial sum *n <sup>t</sup>*=1 | *ft*| is an increasing sequence and bounded above, therefore by monotone convergence theorem, the limit of *n <sup>t</sup>*=1 | *ft*|, i.e., lim*<sup>n</sup>*→∞ *n <sup>t</sup>*=1 | *ft*| exists, and is denoted by ∞ *<sup>t</sup>*=1 | *ft*|, which shows that *f* ∈ 1. Since *f* was chosen arbitrarily, this implies *H* ⊂ <sup>1</sup> and thus completes the proof for the sufficient part.

Now, we prove the necessary part, i.e.,

$$\mathcal{H}^{\ell} \subset \ell\_1 \implies \sum\_{s=1}^{\infty} \left| \sum\_{t=1}^{\infty} l\_t K(s, t) \right| < \infty \,\,\,\forall l \in \ell\_{\infty}.$$

Again, we start by introducing some definitions. For any *f* ∈ *H* and *l* ∈ ∞, we define a new sequence *l f* by letting [*l f* ]*<sup>t</sup>* <sup>=</sup> *lt ft* , <sup>∀</sup>*<sup>t</sup>* <sup>∈</sup> <sup>N</sup>, where [*l f* ]*<sup>t</sup>* is the *<sup>t</sup>*th entry in the sequence *l f* . Then we have *l f* ∈ 1, because *l* ∈ <sup>∞</sup> and *f* ∈ <sup>1</sup> due to *H* ⊂ 1. Moreover, we define *g<sup>n</sup>*(·) = *n <sup>t</sup>*=1 *lt Kt*(·) with *<sup>n</sup>* <sup>∈</sup> <sup>N</sup>. Now we show that the sequence of functions *<sup>g</sup><sup>n</sup>*(·) with *<sup>n</sup>* <sup>∈</sup> <sup>N</sup> is a *weak Cauchy sequence* in *<sup>H</sup>* . To show this, we take without loss of generality *<sup>m</sup>* <sup>≤</sup> *<sup>n</sup>* and *<sup>m</sup>* <sup>∈</sup> <sup>N</sup>, and then we have

$$\left(\mathbf{g}^{\boldsymbol{n}}(\cdot) - \mathbf{g}^{\boldsymbol{m}}(\cdot)\right) = \sum\_{\iota=m+1}^{n} l\_{\iota} K\_{\iota}(\cdot). \tag{7.106}$$

Moreover, we have

$$\langle \mathcal{g}^{\mathfrak{n}}(\cdot) - \mathcal{g}^{\mathfrak{m}}(\cdot), f(\cdot) \rangle\_{\mathcal{H}'} = \langle \sum\_{t=m+1}^{n} l\_t K\_t(\cdot), f(\cdot) \rangle\_{\mathcal{H}'} = \sum\_{t=m+1}^{n} l\_t f\_t, \,\forall f \in \mathcal{H}'.$$

Since *l f* ∈ 1, i.e., ∞ *<sup>t</sup>*=1 |*lt ft*| < ∞, the Cauchy criterion ensures that

$$\lim\_{l\_m, n \to \infty} \sum\_{t=m+1}^n |l\_t f\_t| = 0,\tag{7.107}$$

which implies

302 7 Regularization in Reproducing Kernel Hilbert Spaces …

$$\lim\_{m,n \to \infty} \sum\_{t=m+1}^{n} l\_t f\_t = 0.$$

Noting the above equation and (7.106) yields that the sequence of functions *g<sup>n</sup>* (·) <sup>=</sup> *<sup>n</sup> <sup>t</sup>*=1 *lt Kt*(·) with *<sup>n</sup>* <sup>∈</sup> <sup>N</sup> is a weak Cauchy sequence. Recall that every Hilbert space, beyond being complete, is also *weakly sequentially complete*, which is because every Hilbert space is reflexive, see Definition 2.5.23 along with Corollaries 2.8.10 and 2.8.11 in [62]. Hence, the sequence of functions *g<sup>n</sup>*(·) = *n <sup>t</sup>*=1 *lt Kt*(·) with *<sup>n</sup>* <sup>∈</sup> <sup>N</sup> is also a *weakly convergent sequence*, i.e., there exists an *h* ∈ *H* such that

$$\lim\_{n \to \infty} \langle \mathbf{g}^n(\cdot), f(\cdot) \rangle\_{\mathcal{H}^\circ} = \langle h(\cdot), f(\cdot) \rangle\_{\mathcal{H}^\circ}, \,\forall f \in \mathcal{H}^\circ.$$

Now, we take *f* (·) = *Ks*(·) in the above equation. Using the reproducing property of *K*, the left-hand side becomes

$$\lim\_{n \to \infty} \langle \mathbf{g}^n(\cdot), K\_s(\cdot) \rangle\_{\mathcal{H}^\circ} = \sum\_{t=1}^\infty l\_t K(\mathbf{s}, t),$$

while the right-hand side becomes

$$\langle h(\cdot), K\_s(\cdot) \rangle\_{\mathcal{H}^\circ} = h(s).$$

This implies that

$$\sum\_{t=1}^{\infty} l\_t K(\mathbf{s}, t) = h(\mathbf{s}) \quad \forall \mathbf{s} \in \mathbb{N}.$$

Finally, note that *h* ∈ *H* ⊂ 1, therefore

$$\sum\_{s=1}^{\infty} \left| \sum\_{t=1}^{\infty} l\_t K(s, t) \right| < \infty, \ \forall l \in \ell\_{\infty}, \ \mathbf{x}$$

which completes also the necessary part and, hence, concludes the proof.

## *7.7.4 Proof of Theorem 7.7*

First, it is useful to set up some notation. Let *r* be an integer or *r* = ∞. Then, we define the set *U<sup>r</sup>* as follows:

$$\mathcal{W}\_r := \{ \mathbf{x} \in \mathbb{R}^r : \mathbf{x}(i) = \pm 1, \forall \, i = 1, \ldots, r \}. \tag{7.108}$$

Let *p* be another integer associated with the odd number *m* = 2*p* + 1 and with *n* = 2*<sup>m</sup>*. We also use *xi* ∈ *U<sup>m</sup>* , with *i* = 1, 2,..., *n*, to indicate distinct vectors containing exactly *m* elements ±1 (their ordering is irrelevant). Then, for any *n* = 23, 25, 27,... , the *n* × *m* matrix *V*(*n*) is given by

$$V^{(n)} = \begin{bmatrix} \mathbf{x}\_1 \ \mathbf{x}\_2 \ \dots \ \mathbf{x}\_n \end{bmatrix}^T \tag{7.109}$$

and its rows contain all the possible permutations of ±1. We now discuss the inclusions stated in the theorem.

The inclusion *S*<sup>1</sup> ⊆ *S<sup>s</sup>* derives from Corollary 7.1 where we have seen that absolute summability is a sufficient condition for kernel stability. The proof of the strict inclusion *S*<sup>1</sup> ⊂ *S<sup>s</sup>* is not trivial and is reported in [7] where one can find a particular kernel, function of the matrices *V*(*n*) in (7.109), that is stable but non-absolutely summable.

For what concerns the inclusion *S<sup>s</sup>* ⊂ *S f t* , let *Mm* denote a positive semidefinite matrix of size *<sup>m</sup>* <sup>×</sup> *<sup>m</sup>*. Consider also the linear operator *Mm* : <sup>R</sup>*<sup>m</sup>* <sup>→</sup> <sup>R</sup>*<sup>m</sup>* with domain and co-domain equipped, respectively, with the <sup>∞</sup> and the <sup>1</sup> norms. Its operator norm is then given by

$$\|M\_m\|\_{\infty,1} := \max\_{\|\mu\|\_{\infty}=1} \|M\_m\mu\|\_1 = \max\_{\mathbf{x}\in\mathbb{X}\_m} \|M\_m\mathbf{x}\|\_1,\tag{7.110}$$

where the last equality follows from the so-called Bauer's maximum principle for convex functions. First, we prove that

$$\text{trace}(M\_m) \le \|M\_m\|\_{\infty, 1} \le n \text{ trace}(M\_m). \tag{7.111}$$

For this aim, since *V*(*n*)*<sup>T</sup>* contains all the vectors in *U<sup>m</sup>* as columns, the problem is equal to evaluating

$$M\_m V^{(n)T}$$

and to find the column with maximum <sup>1</sup> norm. The <sup>1</sup> norm of each column can be obtained as the scalar product of the column with a suitable *x* ∈ *U<sup>m</sup>* containing the signs of the column entries. Hence, the *n*<sup>2</sup> entries of

$$V^{(n)}M\_m V^{(n)T}$$

surely contain these *n* <sup>1</sup> norms. Furthermore, the maximum <sup>1</sup> norm which needs to be found is the maximum of all these *n*<sup>2</sup> entries since *x <sup>T</sup>* <sup>1</sup> *c* ≤ *x <sup>T</sup>* <sup>2</sup> *c*, ∀*x*<sup>1</sup> ∈ *U<sup>m</sup>* if *x*<sup>2</sup> = sign(*c*), where the function sign returns, for each entry of *c*, value 1 if such entry is larger than zero and -1 otherwise. Also, since *V* (*n*)*Mm V*(*n*)*<sup>T</sup>* is positive semidefinite, the maximum is found along its diagonal, i.e.,

$$\|M\_m\|\_{\infty,1} = \max\_{i=1,\ldots,n} \|V^{(n)}M\_m V^{(n)T}\|\_{ii}.$$

We now note that the trace of *V*(*n*)*Mm V*(*n*)*<sup>T</sup>* satisfies

$$\text{trace}[V^{(n)}M\_mV^{(n)T}] \ge \|M\_m\|\_{\infty,1} \ge \frac{1}{n} \text{trace}[V^{(n)}M\_mV^{(n)T}].$$

Finally,

$$\begin{aligned} \text{trace}[V^{(n)}M\_mV^{(n)T}] &= \text{trace}[M\_mV^{(n)T}V^{(n)}]\\ &= \text{trace}[M\_m(nI\_m)] = n\,\,\text{trace}[M\_m] \end{aligned} $$

and this proves (7.111).

Now, think of *Mk* as the *k* × *k* submatrix of the stable kernel represented by the infinite-dimensional matrix *K*. We also use *L <sup>K</sup>* to denote the associated kernel operator mapping <sup>∞</sup> into 1. So, it holds that

$$\|M\_k\|\_{\infty,1} \le \|L\_K\|\_{\infty,1} < +\infty,\ \forall k = 1,2,\dots,n$$

where *L <sup>K</sup>* ∞,<sup>1</sup> indicates the operator norm of *L <sup>K</sup>* , i.e.,

$$\|\|L\_K\|\|\_{\infty,1} = \max\_{x \in \mathcal{U}\_\infty} \|Kx\|\_1.$$

Using (7.111), we obtain

$$\text{trace}[\mathcal{M}\_k] \le \|L\_K\|\_{\infty, 1}, \forall k = 1, 2, \dots$$

and, since trace[*Mk* ] is a monotone non-decreasing sequence upper-bounded by *L <sup>K</sup>* ∞,<sup>1</sup> < +∞, one also has

$$\sum\_{i} |K\_{ii}| \le \|L\_K\|\_{\infty, 1} < +\infty.$$

This shows that the trace of any stable kernel is finite. Such inclusion is strict as the following example shows. Let the vector *v* s.t. *v* ∈ <sup>2</sup> and *v* ∈/ 1. Consider the kernel

$$\boldsymbol{\mathcal{K}} = \boldsymbol{\mathcal{W}}^{\mathcal{T}}.$$

One has trace(*K*) = *v*<sup>2</sup> <sup>2</sup> < +∞. If *w* = *sign*(*v*) ∈ <sup>∞</sup> one has *K w* = *vv*<sup>1</sup> and this implies *K w*<sup>1</sup> = ∞. So, the kernel *K* has finite trace but is unstable.

The inclusion *S f t* ⊂ *S*<sup>2</sup> relies on the important relation between nuclear and Hilbert–Schmidt (HS) operators, e.g., see [35, 54, 84]. In particular, let *K* be a kernel, seen as an infinite-dimensional matrix, and let *L <sup>K</sup>* be the induced kernel operator as a map from <sup>2</sup> into <sup>2</sup> itself. Given any orthonormal basis {*vi*} in 2, the nuclear norm of *L <sup>K</sup>* is

#### 7.7 Appendix 305

$$\sum\_{i=1}^{\infty} \langle \mathbf{v}\_i, K \mathbf{v}\_i \rangle\_2,\tag{7.112}$$

and is independent of the chosen basis. Then, *L <sup>K</sup>* is said to be nuclear if (7.112) is finite. Its (squared) Hilbert–Schmidt (HS) norm is instead

$$\sum\_{i=1}^{\infty} \|K\nu\_i\|\_2^2 \tag{7.113}$$

and is also independent of the chosen basis. Then, *L <sup>K</sup>* is said to be HS if (7.113) is finite. It is also known that any nuclear operator is HS and can be written as the composition of two HS operators.

For our purposes, we now exploit the fact that any finite-trace kernel induces a nuclear operator, as shown in [8]. So, one also has that (7.113) is finite and, choosing as {*vi*} the canonical basis {*ei*} of <sup>2</sup> , one obtains

$$\sum\_{i=1}^{\infty} \|Ke\_i\|\_2^2 = \sum\_{ij} K\_{ij}^2 < \infty. \tag{7.114}$$

Such inclusion is also strict as illustrated via the example

$$K = \text{diag}\{1, 1/2, 1/3, \dots, 1/k, \dots\}.$$

Finally, *S*<sup>2</sup> is contained in the set of all the positive semidefinite infinite matrices. Furthermore, the inclusion is strict: this can be seen just considering the example *K* = *vv<sup>T</sup>* , where *v* is the infinite-dimensional column vector with all components equal to 1.

## *7.7.5 Proof of Theorem 7.9*

The notation *L <sup>K</sup>* is still used to denote the operator induced by the kernel *K* and mapping <sup>∞</sup> into 1. Its operator norm is *L <sup>K</sup>* ∞,<sup>1</sup> while (ζ*i*, ρ*i*) are its eigenvalues and eigenvectors orthogonal in 2. From Theorem7.5 and Lemma 7.1, one has

$$
\mathcal{H}^{\ell} \subset \ell\_1 \iff \|L\_K\|\_{\infty, 1} < +\infty. \tag{7.115}
$$

Since the function

$$f(\mathfrak{u}) := \|\mathfrak{y}\|\_1 = \sum\_{i} |\mathfrak{y}(i)| = \sum\_{i} \left| \sum\_{h} K\_{ih} \mathfrak{u}(h) \right| $$

is convex, the Bauer's maximum principle ensures that

306 7 Regularization in Reproducing Kernel Hilbert Spaces …

$$\|L\_K\|\_{\infty,1} = \sup\_{\boldsymbol{\mu} \in \mathcal{Y}\_{\infty}} f(\boldsymbol{\mu}) = \sup\_{\boldsymbol{\mu} \in \mathcal{Y}\_{\infty}} \sum\_{i} \left| \sum\_{h} K\_{ih} \boldsymbol{\mu}(h) \right|,\tag{7.116}$$

where

$$\partial \ell\_{\infty} = \left\{ u \in \ell\_{\infty} \, : \, |u(i)| = 1, \,\,\forall i \ge 1 \right\}.$$

Using notation of ordinary algebra to deal with infinite-dimensional matrices, we can write *K* = *U DU<sup>T</sup>* , where *D* is diagonal and contains the eigenvalues ζ*<sup>i</sup>* of *K* while the columns of *U* contain the corresponding eigenvectors ρ*<sup>i</sup>* . One has

$$\mathbf{y} = U\mathbf{x}, \ \mathbf{x} = DU^T \boldsymbol{w}$$

and, hence,

$$\begin{aligned} \chi &= \left[ \xi\_1 < \rho\_1, \mu >\_2 \: \xi\_2 < \rho\_2, \mu >\_2 \ldots \right]^T \\ \chi &= \xi\_1 < \rho\_1, \mu >\_2 \rho\_1 + \xi\_2 < \rho\_2, \mu >\_2 \rho\_2 + \ldots \end{aligned}$$

Letting *s*(*u*) = sign(*y*), we obtain

$$h(\mu) := \|\mathbf{y}\|\_1 = \sum\_{h} \xi\_h < \rho\_h, \mu >\_2 < \rho\_h, \mathbf{s}(\mu) >\_2 \dots$$

Using (7.116), also noticing that *f* (*u*) = *h*(*u*), this implies

$$\begin{aligned} \|L\_K\|\_{\infty,1} &= \sup\_{\mu \in \mathcal{Y}\_{\infty}} \sum\_h |\xi\_h \prec \rho\_h, \mu >\_2 < \rho\_h, \operatorname{s}(\mu) >\_2 \operatorname{d}(\mu) \\ &= \sup\_{\mu \in \mathcal{Y}\_{\infty}} h(\mu). \end{aligned}$$

Now, define

$$\log(u) := \Sigma\_h \zeta\_h \langle \rho\_h, u \rangle\_2^2, \qquad A := \sup\_{\mathfrak{u} \in \mathcal{\mathcal{Y}}\_\infty} \sum\_h \zeta\_h \langle \rho\_h, u \rangle\_2^2 = \sup\_{\mathfrak{u} \in \mathcal{\mathcal{Y}}\_\infty} \text{g}(\mathfrak{u}).$$

Exploiting the definition of *s*(*u*), one has

$$h(\mathfrak{u}) \ge \mathfrak{g}(\mathfrak{u}) \implies \|L\_K\|\_{\infty, 1} \ge A.$$

On the other hand,

#### 7.7 Appendix 307

$$\begin{aligned} h(u) &= \sum\_{h} \xi\_{h} \langle \rho\_{h}, u \rangle\_{2} \langle \rho\_{h}, s(u) \rangle\_{2} \\ &= \sum\_{h} \left( \sqrt{\xi\_{h}} \langle \rho\_{h}, u \rangle\_{2} \right) \left( \sqrt{\xi\_{h}} \langle \rho\_{h}, s(u) \rangle\_{2} \right) \\ &\leq \sqrt{\sum\_{h} \xi\_{h} \langle \rho\_{h}, u \rangle\_{2}^{2}} \sqrt{\sum\_{h} \xi\_{h} \langle \rho\_{h}, s(u) \rangle\_{2}^{2}} \\ &\leq \sqrt{g(u)} \sqrt{g(s(u))} \end{aligned}$$

that implies

$$\|L\_K\|\_{\infty,1} \le A.$$

So, one has

$$\|L\_K\|\_{\infty,1} = \sup\_{\mu \in \mathcal{U}\_\infty} \sum\_h \xi\_h \langle \rho\_h, \mu \rangle\_{2^h}^2$$

and this concludes the proof in view of (7.115).

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 8 Regularization for Nonlinear System Identification**

**Abstract** In this chapter we review some basic ideas for nonlinear system identification. This is a complex area with a vast and rich literature. One reason for the richness is that very many parameterizations of the unknown system have been suggested, each with various proposed estimation methods. We will first describe with some details nonparametric techniques based on Reproducing Kernel Hilbert Space theory and Gaussian regression. The focus will be on the use of regularized least squares, first equipped with the Gaussian or polynomial kernel. Then, we will describe a new kernel able to account for some features of nonlinear dynamic systems, including fading memory concepts. Regularized Volterra models will be also discussed. We will then provide a brief overview on neural and deep networks, hybrid systems identification, block-oriented models like Wiener and Hammerstein, parametric and nonparametric variable selection methods.

## **8.1 Nonlinear System Identification**

In Sect. 2.2, Eq. (2.2), a model of a dynamical system was defined as a predictor function *g* that maps past input–output data

$$Z^{t-1} = \{ \mathbf{y}(t-1), \boldsymbol{\mu}(t-1), \mathbf{y}(t-2), \boldsymbol{\mu}(t-2), \dots \}$$

to the next output

$$
\hat{\mathbf{y}}(t|\theta) = \mathbf{g}(t, \theta, Z^{t-1}),
\tag{8.1}
$$

where θ is a parameter vector that indexes the model. The predictor could possibly also be a nonparametric map belonging to some function class. If *g* is a nonlinear function of *Z<sup>t</sup>*−<sup>1</sup> the model is nonlinear and the task to infer it from all the available measurements contained in the training set *D<sup>T</sup>* is the task of *Nonlinear System Identification*. This is a complex area with a vast and rich literature. One reason for

<sup>©</sup> The Author(s) 2022

G. Pillonetto et al., *Regularized System Identification*, Communications and Control Engineering, https://doi.org/10.1007/978-3-030-95860-2\_8

the richness is that very many parameterizations of *g* have been suggested, each with various proposed estimation methods, e.g., see the survey [36]. The different parameterizations allow various degrees of prior knowledge about the system to be accounted for, which gives *grey box models* with different shades of grey: see the section *The Palette of Nonlinear Models* in [36].

A typical element of nonlinear models is that somewhere in the structure there can be a *static nonlinearity* present, ζ (*t*) = *h*(η(*t*)). Dealing with static nonlinearities is therefore an essential feature in nonlinear identification. See the sidebar "Static Nonlinearities" in [36], and Sect. 8.5.2 for some brief remarks.

If no prior physical knowledge is available, we have a *black-box model*. Then we need to employ parameterizations for *g* that are very flexible and can describe any reasonable function with arbitrary accuracy. A typical choice for this are *neural networks* or *deep nets*. See Sect. 8.5.1 for some comments. Alternatively one can define *g* non-parametrically as belonging to a certain (possibly infinite dimensional) function class. This leads to *kernel methods*, like *regularization networks*, and *Gaussian Process inference*, treated in the next section.

Both in the case of grey and black-box models, nonlinear identification is characterized by considerable structural uncertainty. This leads typically to parametric models with many parameters and regularization will be a natural and useful tool to handle that. This chapter will discuss typical use of regularization for various tasks in nonlinear system identification.

## **8.2 Kernel-Based Nonlinear System Identification**

Consider the measurements model

$$\mathbf{y}(t\_i) = f^0(\mathbf{x}\_i) + e(t\_i), \quad i = 1, \ldots, N,\tag{8.2}$$

where *y*(*ti*) is the system output at instant *ti* , corrupted by the noises *e*(*ti*), and *f* <sup>0</sup> is the unknown function to reconstruct. The link with nonlinear system identification is obtained by assuming that the *xi* contains past input and/or output values, i.e.,

*xi* = [*uti*−<sup>1</sup> *uti*−<sup>2</sup> ... *uti*−*mu yti*−<sup>1</sup> *yti*−<sup>2</sup> ... *yti*−*my* ]. (8.3)

In this way, the function *f* <sup>0</sup> represents a dynamic system. For the sake of simplicity, let *m* = *mu* = *my* , where *m* will be called the system memory in what follows. Then, if *m* < ∞ a nonlinear ARX (NARX) model is obtained. A nonlinear FIR (NFIR) is instead obtained when *xi* contains only past inputs, i.e.,

$$\mathbf{x}\_{i} = [\boldsymbol{\mu}\_{t\_{i}-1} \; \boldsymbol{\mu}\_{t\_{i}-2} \; \dots \; \boldsymbol{\mu}\_{t\_{i}-m}].\tag{8.4}$$

Now, with these correspondences, we can assume that our nonlinear predictor belongs to a function class *H* given by a RKHS. Then, given the *N* couples {*xi*, *y*(*ti*)}, the regularization network

#### 8.2 Kernel-Based Nonlinear System Identification 315

$$\hat{f} = \underset{f \in \mathcal{H}^{\mathbb{P}}}{\text{arg min}} \sum\_{i=1}^{N} (\mathbf{y}(t\_i) - f(\mathbf{x}\_i))^2 + \boldsymbol{\mathcal{y}} \left\| \boldsymbol{f} \right\|\_{\mathcal{H}^{\mathbb{P}}}^2 \tag{8.5}$$

implements regularized NARX, with *<sup>f</sup>* : <sup>R</sup>2*<sup>m</sup>* <sup>→</sup> <sup>R</sup>, or NFIR, with *<sup>f</sup>* : <sup>R</sup>*<sup>m</sup>* <sup>→</sup> <sup>R</sup>.

To obtain the estimate ˆ*f* we can now exploit Theorem6.15, i.e., the representer theorem. Since we focus on quadratic loss functions, the results in Sect. 6.5.1 ensure that our system estimate ˆ*f* not only exists and is unique but is also available in closed form. In particular, let *Y* = [*y*(*t*1), . . . , *y*(*tN* )] *<sup>T</sup>* and **<sup>K</sup>** <sup>∈</sup> <sup>R</sup>*<sup>N</sup>*×*<sup>N</sup>* be the kernel matrix such that **K***i j* = *K*(*xi*, *x <sup>j</sup>*). The nonlinear system estimate is then sum of the *N* kernel sections centred on the *xi* , i.e.,

$$
\hat{f} = \sum\_{i=1}^{N} \hat{c}\_i K\_{x\_i} \tag{8.6}
$$

with coefficients *c*ˆ*<sup>i</sup>* contained in the vector

$$
\hat{c} = (\mathbf{K} + \boldsymbol{\chi}\boldsymbol{I}\_N)^{-1}\boldsymbol{Y},
\tag{8.7}
$$

with *IN* the *N* × *N* identity matrix.

For future developments, in the remaining part of this section it is useful to cast the connection between regularization in RKHS and Bayesian estimation in this nonlinear setting. Some strategies for hyperparameters tuning will be also recalled.

## *8.2.1 Connection with Bayesian Estimation of Gaussian Random Fields*

First, we recall an important result obtained in the linear setting in Sect. 7.1.4. The starting point was the measurements model

$$\mathbf{y}(t\_i) = L\_i[\mathbf{g}^0] + e(t\_i), \quad i = 1, \ldots, N,$$

with *g*<sup>0</sup> denoting the system impulse response and *Li*[*g*<sup>0</sup>] representing the convolution between *g*<sup>0</sup> and the input, evaluated at *ti* . Proposition 7.1 said that, if *H* is the RKHS induced by a kernel *K*, then

$$\hat{\mathbf{g}} = \underset{\mathbf{g} \in \mathcal{H}^{\mathbb{C}}}{\text{arg}\min} \sum\_{i=1}^{N} \left( \mathbf{y}(t\_i) - L\_i[\mathbf{g}] \right)^2 + \boldsymbol{\chi} \left\| \mathbf{g} \right\|\_{\mathcal{H}^{\mathbb{C}}}^2$$

is the minimum variance impulse response estimator when the noise *e* is white and Gaussian while *g*<sup>0</sup> is a zero-mean Gaussian process (independent of *e*) of covariance proportional to *K*, i.e.,

$$
\mathbb{G}^{\mathbb{C}}(\mathbf{g}^{0}(t)\mathbf{g}^{0}(\mathbf{s})) \propto K(t,\mathbf{s}).
$$

So, the choice of *K* ensures that the probability is concentrated on our expected impulse responses. For instance, in previous chapters we have seen that the TC/stable spline class describes time-courses that are smooth and exponential decaying with a level established by some hyperparameters. A very simple approach to understand the prior ideas introduced in the model is to simulate some curves that will thus represent some of our candidate impulse responses. As an example, some realizations from the discrete-time TC kernel (7.15), given by *K*(*i*, *j*) = αmax(*i*,*j*) with α = 0.9, are reported in the left panels of Fig. 8.1.

Consider the nonlinear scenario with measurements model given by (8.2) and input locations containing past inputs and outputs. The fundamental difference w.r.t. the linear setting is that the unknown function *f* <sup>0</sup> now represents directly the nonlinear input–output relationship. The connection with Bayesian estimation is obtained thinking of *f* <sup>0</sup> as a nonlinear stochastic surface, in particular a zero-mean Gaussian random field. This is a generalization of a stochastic process over general domains: one has that, for any set of input locations {*x*<sup>∗</sup> *i* } *p <sup>i</sup>*=<sup>1</sup>, the vector [ *<sup>f</sup>* <sup>0</sup>(*x*<sup>∗</sup> <sup>1</sup> ) ... *f* <sup>0</sup>(*x*<sup>∗</sup> *p*)] is jointly Gaussian. In particular, the covariance of such vector is assumed to be proportional to the kernel matrix **K** whose (*i*, *j*)-entry is **K***i j* = *K*(*x*<sup>∗</sup> *<sup>i</sup>* , *x*<sup>∗</sup> *j*). This corresponds to saying that *f* <sup>0</sup> is a zero-mean Gaussian random field with covariance λ*K*, with λ a positive scalar, independent of the white Gaussian noises *e*(*ti*) of variance σ2. Then,

$$\hat{f} = \underset{f \in \mathcal{H}^{\mathbb{C}}}{\text{arg min}} \sum\_{i=1}^{N} (\mathbf{y}(t\_i) - f(\mathbf{x}\_i))^2 + \boldsymbol{\nu} \left\| f \right\|\_{\mathcal{H}^{\mathbb{C}}}^2, \quad \boldsymbol{\nu} = \frac{\sigma^2}{\lambda}$$

turns out to be the minimum variance estimator of the nonlinear system *f* 0. In this stochastic scenario, our model assumptions can be better understood by simulating some nonlinear surfaces from the prior. They will represent some of our candidate nonlinear systems. As an example, some realizations from the Gaussian kernel (6.43), given by *K*(*x*, *a*) = exp(−*x* − *a*<sup>2</sup>/ρ) with ρ = 1000, are reported in the right panels of Fig. 8.1. It is apparent that such covariance includes just information on the smoothness on the input–output map, i.e., the fact that similar inputs should produce similar system outputs.

## *8.2.2 Kernel Tuning*

As already discussed, e.g., in Sect. 3.5, even when the structure of a kernel is assigned, the estimator (8.5) typically contains unknown parameters that have to be determined from data. For example, if the Gaussian kernel exp(−*x* − *a*<sup>2</sup>/ρ) is adopted the unknown hyperparameter vector η will contain the regularization parameter γ , the kernel width ρ and possibly also the system memory *m*. We now briefly discuss esti-

**Fig. 8.1** *Left panels* Realizations of a stochastic zero-mean Gaussian process modelling discretetime impulse response candidates. They are drawn by using the TC kernel 0.9max(*i*,*j*) as covariance. *Right panels* Realizations of a zero-mean Gaussian surface (random field) representing nonlinear systems candidates, in particular NFIR models with memory *m* = 2 in (8.4). They are drawn by using the Gaussian kernel exp(−*<sup>x</sup>* <sup>−</sup> *<sup>y</sup>*2/1000) as covariance

mation of η just pointing out some natural connections with the techniques illustrated in Sect. 7.2 in the linear scenario.

An important observation is that, when a quadratic loss is adopted, even in the nonlinear setting the estimator (8.5) leads to predictors linear in the data *Y* . In addition, since we assume data generated according to (8.2), direct noisy measurements of *f* are available. Hence, the output kernel matrix *O* used in Sect. 7.2 just reduces to the kernel matrix **K** computed over the *xi* where data are collected. In fact, from (8.6) and (8.7) one can see that the predictions *y*ˆ*<sup>i</sup>* , i.e., the estimates of the *f* <sup>0</sup>(*xi*), are the components of **K***c*ˆ. So, they are collected in the vector

$$
\hat{Y}(\eta) = \mathbf{K}(\eta)(\mathbf{K}(\eta) + \boldsymbol{\mathcal{Y}}\boldsymbol{I}\_N)^{-1}\boldsymbol{Y}.\tag{8.8}
$$

Now, consider techniques like SURE and GCV that see *f* <sup>0</sup> as a deterministic function so that the randomness in *Y* derives only from the output noise. Exploiting the same line of discussion reported in Sects. 3.5.2 and 3.5.3 (see also Sect. 7.2), from (8.8) we see that the influence matrix is given by **K**(η)(**K**(η) + γ *IN* )−1. Hence, the degrees of freedom are

$$\text{dof}(\eta) = \text{trace}(\mathbf{K}(\eta)(\mathbf{K}(\eta) + \boldsymbol{\mathcal{Y}}\boldsymbol{I}\_N)^{-1}).\tag{8.9}$$

Then, the SURE estimate of η is obtained by minimizing the following unbiased estimator of the prediction risk

$$\hat{\eta} = \underset{\eta \in \Gamma}{\text{arg min }} \frac{1}{N} \|Y - \hat{Y}(\eta)\|^2 + 2\sigma^2 \frac{\text{dof}(\eta)}{N} \tag{8.10}$$

while the GCV estimate is

$$\hat{\eta} = \operatorname\*{arg\,min}\_{\eta \in \Gamma} \frac{\|Y - \hat{Y}(\eta)\|^2}{(1 - \operatorname{dof}(\eta)/N)^2},\tag{8.11}$$

where we have used Γ to denote the optimization domain.

If we instead consider the Bayesian framework discussed in the previous subsection, we see *f* <sup>0</sup> as a zero-mean Gaussian random field of covariance λ*K*, with λ a positive scale factor, independent of the white Gaussian noise of variance σ2. Since *y*(*ti*) = *f* <sup>0</sup>(*xi*) + *e*(*ti*), following the same reasonings developed in the finitedimensional context in Sect. 4.4, one obtains that the vector *Y* is zero-mean Gaussian, i.e.,

$$Y \sim \mathcal{J}'(0, Z(\eta))$$

with covariance matrix

$$Z(\eta) = \lambda \mathbf{K}(\eta) + \sigma^2 I\_N.$$

Above, the vector η could, e.g., contain λ, σ<sup>2</sup>, *m* and also other parameters entering *K*. Then, we easily obtain that its marginal likelihood estimate is

$$\hat{\eta} = \arg\min\_{\eta \in \Gamma} \ Y^T Z(\eta)^{-1} Y + \log \det(Z(\eta)). \tag{8.12}$$

## **8.3 Kernels for Nonlinear System Identification**

In the previous section we have cast the kernel-based estimator (8.5) in the framework of nonlinear system identification. We have also provided its Bayesian interpretation and recalled how to estimate the hyperparameter vector η when the parametric form of *K* is assigned. But the crucial question is now the regularization design. This is a fundamental issue, initially discussed in Sect. 3.4.2, which in this setting consists of choosing a kernel structure suited to model nonlinear dynamic systems. Two interesting options come from machine learning literature. The first one is the (already mentioned) Gaussian kernel

$$K(\alpha, a) = e^{\frac{-\|x - a\|^2}{\rho}}$$

that can describe input–output relationships just known to be smooth. We have also seen in Sect. 6.6.5 that this model is infinite dimensional, i.e., its induced RKHS cannot be spanned by a finite number of basis functions. It is also universal, being dense in the space of all continuous functions defined on any compact subset of the regressors' domain. These appear attractive features when little information on system dynamics are available.

A second alternative is the polynomial kernel

$$K(\mathbf{x}, a) = (\langle \mathbf{x}, a \rangle\_2 + 1)^p, \quad p \in \mathbb{N}, \tag{8.13}$$

where ·, · <sup>2</sup> is the classical Euclidean inner product. In the NFIR case, where the input locations *xi* <sup>∈</sup> <sup>R</sup>*<sup>m</sup>* as given in (8.4), such kernel has a fundamental connection with the Volterra representations of nonlinear systems, see, e.g., [35]. In fact, we know from Sect. 6.6.4 that the induced RKHS is not universal but has dimension *m*+*<sup>p</sup> p* and contains all possible monomials up to the *p*th degree. Hence, the polynomial kernel implicitly encodes truncated discrete Volterra series of the desired order. It avoids curse of dimensionality since the possibly large number coefficients have not to be computed explicitly thanks to monomials' encoding. In fact, from (8.7) one can see that estimation complexity, even if cubic in the number *N* of output data turns out to be linear in the system memory *m* and independent of the degree *p* of nonlinearity.

## *8.3.1 A Numerical Example*

We will consider a numerical example where the Gaussian and the polynomial kernel are used to estimate a nonlinear dynamic system from input–output data.

Consider the NFIR

$$f^{0}(\mathbf{x}\_{t}) = \left(\sum\_{i=1}^{80} g\_{i}^{0} u\_{t-i}\right) - u\_{t-2}u\_{t-3} - 0.25u\_{t-4}^{2} + 0.25u\_{t-1}u\_{t-2} + \\\\ \left. + 0.75u\_{t-3}^{3} + 0.5\left(u\_{t-1}^{2} + u\_{t-1}u\_{t-3} + u\_{t-2}u\_{t-4}\right) \right| \tag{8.14}$$

with nonlinearities taken from [40] while the coefficients *g*<sup>0</sup> *<sup>i</sup>* are reported in Fig. 8.2. The inputs are independent Gaussian random variables of variance 4. The measurements model is that reported in (8.2) with the noise *e* white and Gaussian of variance 4 and independent of *u*. Such system is strongly nonlinear: the contribution of the linear part (defined by the *g*<sup>0</sup> *<sup>i</sup>* ) to the output variance is around 12% of the overall variance.

We generate 2000 input–output couples and display them in Fig. 8.3. The first 1000 input–output couples {*uk* , *yk* }<sup>1000</sup> *<sup>k</sup>*=<sup>1</sup> are the identification data while the other 1000 {*uk* , *yk* }<sup>2000</sup> *<sup>k</sup>*=<sup>1001</sup> are the test set. They are used to assess the performance of an estimator in terms of the prediction fit

$$100\left(1-\left[\frac{\sum\_{k=1001}^{2000}|\mathbf{y}\_k-\hat{\mathbf{y}}\_k|^2}{\sum\_{k=1001}^{2000}|\mathbf{y}\_k-\bar{\mathbf{y}}|^2}\right]^{\frac{1}{2}}\right),\quad\bar{\mathbf{y}}=\frac{1}{1000}\sum\_{k=1001}^{2000}y\_k,\tag{8.15}$$

where the *y*ˆ*<sup>k</sup>* are the predictions returned by a certain estimator by assuming null initial conditions, i.e., computed by using only {*uk* }<sup>2000</sup> *<sup>k</sup>*=<sup>1001</sup> and setting to zero the inputs falling outside the test set.

First, consider the estimator (8.5) equipped with either the Gaussian or the polynomial kernel with input locations

$$\mathbf{x}\_{i} = [\boldsymbol{\mu}\_{t\_{i}-1} \; \boldsymbol{\mu}\_{t\_{i}-2} \; \dots \; \boldsymbol{\mu}\_{t\_{i}-m}],$$

**Fig. 8.3** Input and output data generated by the nonlinear system (8.14). The first 1000 couples (black line) are used as identification data while the other 1000 (red) are the test set used to assess the prediction performance of a model

where the system memory *m* is seen as an hyperparameter to be estimated from data. Specifically, when using the Gaussian kernel

$$K(x,a) = e^{\frac{-\|x-a\|^2}{\rho}}$$

the estimator depends on the unknown hyperparameter vector

$$
\eta = [m \,\,\,\nu \,\,\rho],
$$

where *m* is the system memory, γ is the regularization parameter and ρ is the kernel width. Instead, when using the polynomial kernel

$$K(\boldsymbol{\chi}, a) = (\langle \boldsymbol{\chi}, a \rangle\_2 + 1)^p, \quad p \in \mathbb{N},$$

we have

$$
\eta = [m \quad \varchi \quad p],
$$

where, in place of ρ, the third unknown hyperparameter is the polynomial order *p*. In both the cases, we estimate η by using an oracle. In particular, assigned a certain η, the estimator (8.5) determines ˆ*f* by using only the identification data but the oracle has access to the test set to select that hyperparameter vector that maximizes **Fig. 8.4** Test set data (red line), extracted from the last 1000 outputs visible in the right panel of Fig. 8.3, and predictions returned by (8.5) equipped with the Gaussian kernel (top panel, black) and the polynomial kernel (bottom panel, black). The estimators use the first 1000 input–output couples in Fig. 8.3 as training data, with hyperparameter vector η tuned by an oracle that maximizes the test set fit

the prediction fit (8.15). Note that calibration is quite computational expensive. In fact, one has to introduce a grid to account for the discrete nature of the system memory *m*. The polynomial kernel requires also the introduction of another grid for the polynomial order *p*.

Figure 8.4 reports some test set data (red line) extracted from the last 1000 outputs displayed in the right panel of Fig. 8.3. When adopting the Gaussian kernel, the oracle chooses *m* = 4. When using the polynomial kernel it selects *m* = 6 and sets the polynomial order to *p* = 3. The top panel of Fig. 8.4 shows the predictions returned by the oracle-based Gaussian kernel (black line). The prediction fit is not so large, equal to 69.6%. The bottom panel instead plots results from the oracle-based polynomial kernel (black line). The prediction capability increases to 73.5% but does not appear so satisfactory. Figure 8.5 also reports the MATLAB boxplots of 100 prediction fits

returned by the two kernel-based estimators after a Monte Carlo study. At any of the 100 runs new realizations of inputs and noises define a new identification and test set. One can see that, on average, the polynomial kernel performs a bit better than the Gaussian kernel, but its mean prediction fit is around 72%.

## *8.3.2 Limitations of the Gaussian and Polynomial Kernel*

From (8.14) one can see that the NFIR order is *m* = 80 while the oracle sets *m* = 4 and *m* = 6 when using, respectively, the Gaussian and the polynomial kernel. This introduces a bias in the estimation process that is clearly visible in the predictions reported in Fig. 8.4. Let us try to understand the reasons of this phenomenon.

**Polynomial kernel** First, consider the polynomial kernel. The oracle chooses the correct polynomial order *p* = 3 to account for the highest-order term 0.75*u*<sup>3</sup> *<sup>t</sup>*−<sup>3</sup> present in the system. Such choice however already defines a complex model since it includes all the monomials up to order 3. In particular, with *m* = 6 and *p* = 3 the number of adopted basis functions is

$$
\binom{m+p}{p} = \binom{6+3}{3} = 84,
$$

that is quite large considering that 1000 outputs are available. If *m* is increased to 7, one would implicitly use

$$
\binom{m+p}{p} = \binom{7+3}{3} = 120
$$

basis functions. In general, values of *m* larger than 6 are not acceptable for the oracle: even a careful tuning of the regularization parameter γ does not permit to have a good control on the estimator's variance. This is illustrated in Fig. 8.6 that displays the best prediction test set fit that can be obtained by the oracle as a function of the system memory *m*. The maximum is indeed obtained with *m* = 6. Instead, the **Fig. 8.6** Predictions fits by the oracle-based estimator equipped with the polynomial kernel as a function of system memory *m*. The optimal model dimension is achieved for *m* = 6

value *m* = 80 leads to a very small fit, around 25%, because this introduces an overly complex model with

$$
\binom{m+p}{p} = \binom{80+3}{3} = 918811
$$

monomials.

Another reason that does not allow the polynomial kernel to well control model variance is the way it regularizes (implicitly) the monomial coefficients. We describe this point through a simple example. A quadratic polynomial kernel is considered but similar considerations would still hold by introducing larger degrees. Let *p* = 2, *x* = [*ut*−<sup>1</sup> ... *ut*−*m*] and *a* = [*u*<sup>τ</sup>−<sup>1</sup> ... *u*<sup>τ</sup>−*m*]. Exploiting the multinomial theorem one obtains

$$\begin{aligned} K(x,a) &= (\sum\_{i=1}^m u\_{t-i}u\_{t-i} + 1)^2 \\ &= \sum\_{i=1}^m u\_{t-i}^2 u\_{\tau-i}^2 + 2\sum\_{i=2}^m \sum\_{j=1}^{i-1} (u\_{t-i}u\_{t-j})(u\_{\tau-i}u\_{\tau-j}) + 2\sum\_{i=1}^m u\_{t-i}u\_{\tau-i} + 1. \end{aligned}$$

This defines the following diagonalized version of the quadratic polynomial kernel

$$K(\boldsymbol{x}, a) = \sum\_{i} \xi\_i \rho\_i(\boldsymbol{x}) \rho\_i(a),$$

where the ρ*i*(*x*) are all the monomials up to degree 3 contained in the following vector

**Fig. 8.7** Some realizations from a zero-mean stochastic process with covariance given by a Gaussian kernel (left panel) and by a Gaussian plus linear kernel (right)

$$\begin{cases} \boldsymbol{u}\_{t-m}^2, \dots, \boldsymbol{u}\_{t-1}^2, \boldsymbol{u}\_{t-m}\boldsymbol{u}\_{t-m+1}, \dots, \boldsymbol{u}\_{t-m}\boldsymbol{u}\_{t-1}, \boldsymbol{u}\_{t-m+1}\boldsymbol{u}\_{t-m+2}, \\\\ \dots, \boldsymbol{u}\_{t-m+1}\boldsymbol{u}\_{t-1}, \dots, \boldsymbol{u}\_{t-2}\boldsymbol{u}\_{t-1}, \boldsymbol{u}\_{t-m}, \dots, \boldsymbol{u}\_{t-1}, \boldsymbol{u}\_{t-m}, \dots, \boldsymbol{u}\_{t-1}, 1 \end{cases},$$

with the corresponding ζ*<sup>i</sup>* given by

$$\left\{1, \ldots, 1, 2, \ldots, 2, 2, \ldots, 2, \ldots, 2, 2, \ldots, 2, 2, \ldots, 2, 1\right\}.$$

According to the RKHS theory described in Sect. 6.3, for any *f* in the RKHS *H* induced by such kernel one has

$$f(\boldsymbol{\alpha}) = \sum\_{i} c\_{i} \rho\_{i}(\boldsymbol{\alpha}), \quad \left\| f \right\|\_{\mathcal{H}^{\boldsymbol{\mathbb{e}}}}^{2} = \sum\_{i} \frac{c\_{i}^{2}}{\xi\_{i}},$$

where all the eigenvalues ζ*<sup>i</sup>* assume value 1 or 2 (most of them are equal to 2). Hence, one can see that the regularizer *f* <sup>2</sup> *<sup>H</sup>* does not incorporate any fading memory concept typical of dynamic systems. In fact, the two coefficients of the monomials {*u*2 *<sup>t</sup>*−*<sup>m</sup>*, *<sup>u</sup>*<sup>2</sup> *<sup>t</sup>*−<sup>1</sup>} or those of the couple {*ut*−*mut*−*m*+<sup>1</sup>, *ut*−<sup>2</sup>*ut*−<sup>1</sup>} are assigned the same penalty. But, similarly to the linear case, one should instead expect that inputs *ut*−*<sup>i</sup>* have less influence on *yt* as the positive lag *i* increases.

**Gaussian kernel** As in the case of the polynomial model, one of the limitations of the Gaussian kernel *K*(*x*, *a*) = exp(−*x* − *a*<sup>2</sup>/ρ) in modelling nonlinear systems is that it does not include any fading memory concept. Hence, the inputs {*ut*−<sup>1</sup>, *ut*−<sup>2</sup>,..., *ut*−*<sup>m</sup>*} included in the input location are expected to have the same influence on *yt* . This can be appreciated also through the Bayesian interpretation of regularization, e.g., by inspecting the system realizations generated by the Gaussian kernel reported in the right panels of Fig. 8.1.

**Fig. 8.8** True function (red line), noisy data and regularized estimate returned by (8.5) by using a Gaussian kernel *<sup>K</sup>*(*u*, *<sup>a</sup>*) <sup>=</sup> exp(−(*<sup>u</sup>* <sup>−</sup> *<sup>a</sup>*)2/500) (left panel, black) and a Gaussian plus linear kernel *<sup>K</sup>*(*u*, *<sup>a</sup>*) <sup>=</sup> exp(−(*<sup>u</sup>* <sup>−</sup> *<sup>a</sup>*)2/500) <sup>+</sup> <sup>10</sup>*ua* (right, black). The regularization parameter <sup>γ</sup> is estimated from data via marginal likelihood optimization (8.12)

Still adopting a stochastic viewpoint, another drawback is that the covariance exp(−*x* − *a*<sup>2</sup>/ρ) describes stationary processes and this implies that the variance of *f* <sup>0</sup>(*x*) does not depend on the input location. This is now illustrated in the one-dimensional case where *<sup>x</sup>* <sup>∈</sup> <sup>R</sup> and the kernel models a static nonlinear system *f* <sup>0</sup>(*u*), i.e., the (noiseless) output *y* depends only on a single input value *u*. The left panel of Fig. 8.7 plots some realizations from exp(−(*u* − *a*)<sup>2</sup>/500). They can be poor nonlinear system candidates since a nonlinear system, like that reported in (8.14), often contains also a linear component. For this reason it can be useful to enrich the model with a linear kernel. Its effect can be appreciated by looking at the realizations plotted in the right panel of Fig. 8.7 that are now drawn by using exp(−(*u* − *a*)<sup>2</sup>/500) + *ua*/400 as covariance.

The fact that the predictive capability of a nonlinear model can much improve by adding a linear component can be understood also considering Theorem6.15 (representer theorem). Using only a Gaussian kernel, the estimate ˆ*f* of the nonlinear system returned by (8.5) is the sum of *N* Gaussian functions centred on the *xi* . Hence, in the regions where no data are available, the function ˆ*f* just decays to zero and this can lead to poor predictions when, e.g., a linear component is present in the system. This phenomenon is illustrated in the left panel of Fig. 8.8. In this case, the prediction performance can be greatly enhanced by adding a linear kernel, whose results are visible in the right panel of the same figure.

## *8.3.3 Nonlinear Stable Spline Kernel*

We will build a kernel *K* for nonlinear system identification, namely the nonlinear stable spline kernel, by exploiting what has been learnt from the previous example. To simplify exposition, we consider the NFIR case but all the ideas here developed can be immediately extended to NARX models, as discussed at the end of this section.

First, it is useful to define *K* as the sum of a linear and a nonlinear kernel, i.e.,

$$\mathcal{X}'(\mathbf{x}\_i, \mathbf{x}\_j) = \lambda\_L \mathbf{x}\_i^T P \mathbf{x}\_j + \lambda\_{NL} K(\mathbf{x}\_i, \mathbf{x}\_j), \tag{8.16}$$

where the input locations are here seen as column vectors, i.e.,

$$\boldsymbol{x}\_{i} = \begin{bmatrix} \boldsymbol{\mu}\_{t\_{i}-1} \ \boldsymbol{\mu}\_{t\_{i}-2} \ \dots \ \boldsymbol{\mu}\_{t\_{i}-m} \end{bmatrix}^{T},$$

*<sup>P</sup>* <sup>∈</sup> <sup>R</sup>*m*×*<sup>m</sup>* is a symmetric positive semidefinite matrix that models the impulse response of the system's linear part while *K* describes the nonlinear dynamics. Note that the two-scale factors λ*<sup>L</sup>* and λ*N L* are unknown hyperparameters that balance the contributions of the linear and nonlinear part to the output.

For what concerns *P*, such matrix can be defined by resorting to the class of stable kernels developed in the previous chapters. In particular, using the TC/stable spline kernel, the (*a*, *b*)-entry of *P* is

$$P\_{ab} = a\_L^{\max(a,b)}, \quad 0 \le a\_L < 1, \quad a = 1, \dots, m, \ b = 1, \dots, m,\tag{8.17}$$

where α*<sup>L</sup>* determines the decay rate of the impulse response governing the linear dynamics.

For what concerns *K*, we will define it by modifying the classical Gaussian kernel in order to include fading memory concepts. Following the same ideas underlying the TC kernel, we include the information that *ut*−*<sup>i</sup>* is expected to have less influence on *yt* as *i* increases by defining

$$K(\mathbf{x}\_i, \mathbf{x}\_j) = \exp\left(-\sum\_{k=1}^m \alpha\_{NL}^{k-1} \frac{(u\_{t\_i-k} - u\_{t\_j-k})^2}{\rho}\right), \quad 0 < \alpha\_{NL} \le 1. \tag{8.18}$$

The additional hyperparameter α*N L* gives the information that past inputs' influence decays exponentially to zero. To understand how this kernel models the nonlinear surface, and how different values of α*N L* can describe different system features, we can use the Bayesian interpretation of regularization. In particular, consider an example with *m* = 2, so that the components of *xi* are *uti*−<sup>1</sup> and *uti*−2, and let the system *f* <sup>0</sup> be a zero-mean Gaussian random field with covariance given by (8.18) with ρ = 1000. If α*N L* = 1 we recover the Gaussian kernel. Hence, before seeing any data, *uti*−<sup>1</sup> and *uti*−<sup>2</sup> are expected to have the same influence on the system output. This can be appreciated by drawing some realizations from such random field, e.g., see the top panel of Fig. 8.9 (or the right panels of Fig. 8.1).

With α*N L* very close to zero, the output depends mainly on *uti*−1, i.e.,

$$K(\mathbf{x}\_i, \mathbf{x}\_j) \approx \exp\left(-\frac{(\mathbf{u}\_{t\_i-1} - \mathbf{u}\_{t\_j-1})^2}{\rho}\right).$$

This can be appreciated by looking at the realization in the middle panel of Fig. 8.9 obtained with α*N L* = 0.001. One can see that, for fixed *uti*−1, changes in *uti*−<sup>2</sup> do not produce appreciable variations in the function value. If the value of α*N L* is now increased, the input value *uti*−<sup>2</sup> starts playing a role. This is visible in the bottom panel where the realization is now generated by using α*N L* = 0.1.

The nonlinear stable spline kernel enjoys also an advantage related to computational issues. Using classical machine learning kernels, like Gaussian or polynomial, the choice of the dimension *m* of the input space is a delicate issue. It requires discrete tuning, as encountered in classical linear system identification to estimate, e.g., FIR or ARX order, and this can be computationally expensive. In the case of the polynomial kernel, another discrete parameter is the polynomial order *p* that requires an additional grid. By introducing stability/fading memory hyperparameters, one can instead set *m* to a large value increasing the flexibility of the estimator. Then, estimation of α*<sup>L</sup>* and α*N L* from data permits to control the "effective" dimension of the regressor space in a continuous manner. In light of the continuous nature of the optimization domain, one needs to solve only one optimization problem, involving, e.g., SURE (8.10), GCV (8.11) or Empirical Bayes (8.12).

Finally, as already mentioned, the extension to NARX models is very simple. Let *xi* = [*a<sup>T</sup> <sup>i</sup> b<sup>T</sup> i* ] *<sup>T</sup>* with

$$a\_i = \begin{bmatrix} \mu\_{t\_i - 1} \ u\_{t\_i - 2} \ \dots \ u\_{t\_i - m} \end{bmatrix}^T, \quad b\_i = \begin{bmatrix} \mathbf{y}\_{t\_i - 1} \ \mathbf{y}\_{t\_i - 2} \ \dots \ \mathbf{y}\_{t\_i - m} \end{bmatrix}^T.$$

Then, the kernel (8.16) can be modified as follows

$$\mathcal{X}'(\mathbf{x}\_i, \mathbf{x}\_j) = \lambda\_a a\_i^T P\_a a\_j + \lambda\_b b\_i^T P\_b b\_j + \lambda\_c K\_c(a\_i, a\_j) K\_d(b\_i, b\_j) \tag{8.19}$$

with the matrices *Pa* and *Pb* defined by the TC kernel (8.17), with possibly different decay rates α*<sup>L</sup>* , and the nonlinear kernels *Kc* and *Kd* defined by (8.18), with possibly different decay rates α*N L* . A possible variation is

$$\mathcal{X}^{\ell}(\mathbf{x}\_{i},\mathbf{x}\_{j}) = \lambda\_{a}a\_{i}^{T}P\_{a}a\_{j} + \lambda\_{b}b\_{i}^{T}P\_{b}b\_{j} + \lambda\_{c}K\_{c}(a\_{i},a\_{j}) + \lambda\_{d}K\_{d}(b\_{i},b\_{j}), \quad (8.20)$$

where the nonlinear dynamics are no more product, as in (8.19), but instead sum of nonlinear functions which depend on either past inputs or past outputs. In fact, recall from Theorem6.6 that sums and products of kernels induce well-defined RKHSs containing, respectively, sums and products of functions belonging to the spaces associated to the single kernels.

**Fig. 8.9** Realizations from a zero-mean Gaussian random field having covariance exp <sup>−</sup> <sup>1</sup> 1000 (*uti*−<sup>1</sup> <sup>−</sup> *utj*−1)<sup>2</sup> <sup>+</sup> <sup>α</sup>*N L* (*uti*−<sup>2</sup> <sup>−</sup> *utj*−2)2 for three different values of α*N L*

Time

## *8.3.4 Numerical Example Revisited: Use of the Nonlinear Stable Spline Kernel*

Let us now reconsider the numerical example where the nonlinear system (8.14) is used to generate the identification and test data reported in Fig. 8.3. Now, we use the estimator (8.5) equipped with the nonlinear stable spline kernel (8.16). System memory is set to *m* = 100. Hence, we let α*<sup>L</sup>* and α*N L* determine from data which past inputs mostly influence the output due to the linear and nonlinear system part, respectively. In particular, the hyperparameter vector η = [λ*<sup>L</sup>* λ*N L* α*<sup>L</sup>* α*N L* ρ] is estimated via marginal likelihood maximization using the 1000 input–output training data.

Figure 8.10 shows the same test set data (red line) reported in Fig. 8.4 and extracted from the last 1000 outputs visible in the right panel of Fig. 8.3. The predictions (black line) returned by the nonlinear stable spline kernel are now very close to truth. The prediction fit is around 90%. Comparing these results with those in Fig. 8.4, one can see that the prediction performance is much better than that of the Gaussian and polynomial kernel. Recall also that these two estimators tune complexity by using an oracle that is not implementable in practice. Figure 8.11 also plots the MATLAB boxplots of 100 prediction fits returned after a Monte Carlo study of 100 runs by these two oracle-based estimators, already present in Fig. 8.5, and by nonlinear stable spline. One can see that the use of a regularizer that accounts for dynamic systems features largely improves the prediction fits.

## **8.4 Explicit Regularization of Volterra Models**

In what follows, we use C(*k*, *m*) to indicate the number of ways one can form the nonnegative integer *k* as the sum of *m* nonnegative integers. This is the same problem as distributing *k* objects to *m* groups (some groups may get zero objects). By combinatorial theory we have

$$\mathbf{C}(k,m) = \binom{k+m-1}{m-1}.\tag{8.21}$$

We adopt the model description (8.2) and seek a simple representation for the model *f* (*x*). For simple notation, assume that *f* is scalar valued with past inputs only, i.e., (*my* = 0, *mu* = *m*) with input location *x* given by (8.4). A straightforward idea is to mimic polynomial Taylor expansion

$$f(\mathbf{x}) = \sum\_{k=1}^{p} g\_k \mathbf{x}^k. \tag{8.22}$$

This innocent-looking function expansion is in fact a bit more complex than it looks. The *k*th power of the *m*-row vector *x* is to be interpreted as C-dimensional column vector with each element being a monomial of the *m*-components *x*(*i*) of *x* with sum of exponents being *k*:

$$\alpha\_r^{(k)} = \mathbf{x}(1)^{\beta(k,1)} \mathbf{x}(2)^{\beta(k,2)} \cdots \mathbf{x}(m)^{\beta(k,m)} \tag{8.23a}$$

$$\beta(k, p) \text{ non negative, such that } \sum\_{\ell=1}^{m} \beta(k, \ell) = k \tag{8.23b}$$

$$r = 1, 2, \ldots, \mathbb{C}(k, m). \tag{8.23c}$$

In (8.22) *gk* is to be interpreted as a row vector with C(*k*, *m*) elements

$$\mathbf{g}\_k = [\mathbf{g}\_k^{(1)}, \dots, \mathbf{g}\_k^{\mathbf{C}(k,m)}].\tag{8.23d}$$

The response *f* (*x*) is thus made of *d*(*p*, *m*) = *p <sup>k</sup>*=<sup>1</sup> C(*k*, *m*) contributions ("impulse responses") from each of the nonlinear combinations of past inputs

$$\boldsymbol{\alpha}\_{r}^{(k)} = \boldsymbol{\mu}\_{t-1}^{\beta(k,1)} \boldsymbol{\mu}\_{t-2}^{\beta(k,2)} \cdots \boldsymbol{\mu}\_{t-m}^{\beta(k,m)} \tag{8.23e}$$

$$r = 1, \ldots, \mathbb{C}(k, m), \quad k = 1, \ldots, p. \tag{8.23f}$$

This expansion of the model (8.22) is the *Volterra Model* discussed, e.g., by [7, 35]. It has *d*(*p*, *m*) parameters. The reader may recognize this as an explicit treatment of the polynomial kernel (8.13) which does not exploit any basis functions implicit encoding and, hence, does not exploit the kernel trick described in Remark 6.3. This has also some connections with the explicit regularization approaches for linear system identification discussed in Sect. 7.4.4 using, e.g., Laguerre functions.

So, this model has memory length *m* and polynomial order *p*. As *p* → ∞ it follows that *f* (*x*) in (8.22), with possibly the addition of a constant function, can approximate any ("reasonable") function arbitrarily well. This universal approximation property is of course very valuable for black box models and created considerable interest in Volterra models. However, it is easy to see that the number *d*(*p*, *m*) of parameters *gk* increases very rapidly with *m* and *p* and that high-order polynomials in the observed signals may create numerically ill-conditioned calculations. Hence, Volterra models have not been used so much in practical identification problems, unless for small values of *m* and *p*.

A remedy for the large number of parameters and ill-conditioned numerics is clearly to use regularization. In [4] it is discussed how to regularize the Volterra model to make it a practical tool. In short, the idea is the following, illustrated for a small example with *p* = 2.

We write the model also adding a scalar *g*<sup>0</sup> which accounts for a constant component in the output so that one has

$$\mathbf{y}(t) = \mathbf{g}\_0 + \mathbf{g}\_1^T \boldsymbol{\varphi}(t) + \boldsymbol{\varphi}^T(t) G\_2 \boldsymbol{\varphi}(t) \tag{8.24a}$$

$$\boldsymbol{\varphi}^{T}(t) = [\boldsymbol{\mu}(t\_1), \boldsymbol{\mu}(t\_2) \dots \boldsymbol{\mu}(t\_m)] \tag{8.24b}$$

$$\mathbf{g}\_1 = \theta\_1 \quad m-\text{dimensional column vector}\tag{8.24c}$$

$$G\_2 \,\, m \times m \,\,\text{symmetric matrix},\tag{8.24d}$$

where the matrix *G*<sup>2</sup> is formed from *g*(1) <sup>2</sup> *<sup>g</sup>*(2) <sup>2</sup> *<sup>g</sup>*(3) <sup>2</sup> in the expansion (8.22)–(8.23e).

The regularized estimation can now be formed as the criterion

$$\hat{\boldsymbol{\theta}}^{\mathbb{R}} = \underset{\boldsymbol{\theta}}{\text{arg min}} \; \|\boldsymbol{Y} - \boldsymbol{\Phi}\_N^T \boldsymbol{\theta}\|^2 + \boldsymbol{\theta}^T \boldsymbol{D} \boldsymbol{\theta} \tag{8.25}$$

with

$$\boldsymbol{\theta} = [\mathbf{g}\_0, \boldsymbol{\theta}\_1^T, \boldsymbol{\theta}\_2^T]^T \tag{8.26}$$

and θ<sup>2</sup> is an *m*(*m* + 1)/2 dimensional column vector made up from *G*2, and *Y* is the vector of observed outputs *y*(*t*) with *t* = 1,..., *N*. The regression vector Φ*<sup>N</sup>* if formed from the components of ϕ(*t*) in the obvious way. It is natural to decompose the regularization matrix accordingly:

$$D = \begin{bmatrix} d\_0 & 0 & 0 \\ 0 & D\_1 & 0 \\ 0 & 0 & D\_2 \end{bmatrix} \tag{8.27}$$

and treat the regularization of the constant term, (*d*0), the linear term (*D*1) and the quadratic term (*D*2) in (8.24a) separately. As discussed in Chap. 5, a natural choice of regularization matrices is to let them reflect prior information about the corresponding parameters. That means that *d*<sup>0</sup> can be taken as any suitable scalar. The θ<sup>1</sup> vector for the first-order term describes a regular linear impulse response, and the prior for that one can be taken as, e.g., the DC kernel reported in (5.40), i.e.,

$$P\_1(i,j) = c \cdot e^{-\alpha|i-j|} e^{-\beta\frac{(i+j)}{2}}.\tag{8.28}$$

For the second-order model θ<sup>2</sup> it is natural to treat the second-order nonlinear term in the Volterra expansion as a two-dimensional surface, described by two time-indices τ<sup>1</sup> and τ<sup>2</sup> so that the parameter at τ1, τ<sup>2</sup> is the contribution to the Volterra sum from *u*(*t* − τ1) · *u*(*t* − τ2). This is illustrated in Fig. 8.12. The prior value of this contribution can be formed as the product of two kernels built up from responses in a coordinate system *U* , *V* after an orthonormal coordinate transformation, corresponding to a rotation

**Fig. 8.12** Regularization surface for the second-order term in a Regularized Volterra expansion

of 45◦ of the original τ1, τ2-plane:

$$P\_2(i,j) = c\_2 P\_{\mathcal{V}}(i,j) P\_{\mathcal{V}}(i,j) \tag{8.29}$$

$$P\_{\mathcal{V}}(i,j) = e^{-a\_{\mathcal{V}}\left\Vert |\mathcal{V}\_i| - |\mathcal{V}\_j|} e^{-\beta\_{\mathcal{V}}\frac{\left\Vert |\mathcal{V}\_i| + |\mathcal{V}\_j|}{2} \right\Vert}\tag{8.30}$$

$$P\_{\mathbb{X}}(i,j) = e^{-\alpha\_{\mathbb{X}}\left|\left|\left|\mathbb{X}\_{i}\right|-\left|\mathbb{X}\_{j}\right|\right|\right|}e^{-\beta\_{\mathbb{X}}\left|\frac{\left|\left|\mathbb{X}\_{i}\right|+\left|\mathbb{X}\_{j}\right|\right|}{2}\right|},\tag{8.31}$$

where *U<sup>i</sup>* and *V<sup>i</sup>* refer to the coordinates in the new system. The corresponding prior distribution is depicted in Fig. 8.12. As desired, it is smooth and decays to zero in all directions. The coordinate change is useful to make the surface smooth over critical border lines.

This regularization was deployed in [36], section "Example 5(a) Black-Box Volterra Model of the Brain". Quite useful results were obtained with a regularized model with 594 parameters, thanks to the regularization. An extension for the regularized Volterra models, based on similar idea, is treated in [41], which also provides an EM algorithm to estimate the hyperparameters in the regularization matrices. Another development where the ideas developed in [4] are coupled with kernels implicit encoding can be found in [8].

## **8.5 Other Examples of Regularization in Nonlinear System Identification**

## *8.5.1 Neural Networks and Deep Learning Models*

There are many other universal approximators fon nonlinear systems *f* (*x*) than those based on kernels or on the explicit Volterra model (8.22). The most common ones are various neural network models (NNMs), see, e.g., [12, 23]. They use simple nonlinearities connected in more or less complex networks. The parameters are weights in the connections as well as characterizations of the nonlinearities. Like Volterra models they are capable of approximating any reasonable system arbitrarily well given sufficiently many parameters. This means that the NNM typically has many parameters. In simple application there could be hundreds of parameters but some applications, especially in the so-called deep model applications, could have tens of thousands of parameters [18], see also [9, 11, 13, 43] for deep NARX and state-space models. Even if benign overfitting has been sometimes observed also for overparametrized models [3, 19, 30], in general regularization is a very important tool also for estimating such model. Hence, many tricks are typically included in the estimation/minimization schemes.

 2, <sup>1</sup> **penalties** They include the traditional weighted <sup>2</sup> and <sup>1</sup> norm penalties that we discuss in this book, see, e.g., Sect. 3.6. For example, all estimation algorithms in the *System Identification Toolbox*, [22] are equipped with optional weighted 2 regularization—also when NNM are estimated.

**Early termination** It is common to monitor not only the fit to estimation data in the minimization process, but also how well the current model fits a *validation data set*. Then the minimization is terminated when the fit to validation data no longer improves, even when the estimation criterion value keeps improving. This *early termination* technique is in fact equivalent to traditional regularization, as shown in [38].

**Dropout or Dilution** A special technique common in (deep) learning with NNM is to curb the flexibility of the model by ignoring (dropping) randomly chosen nodes in the network. This is of course a way to control that the model does not become prone to overfitting and provides regularization of the estimation just as the other methods in this book, but by a quite different technique. See, e.g., [17, 28] for more details.

## *8.5.2 Static Nonlinearities and Gaussian Process (GP)*

A basic problem in nonlinear system identification is to handle estimation of a static nonlinear function *h*(η) from known observations

$$\{\zeta(t), \eta(t), t = 1, \dots, N\}, \qquad \zeta(t) = h(\eta(t)) + noise.$$

A general way to do this is to apply Gaussian Process (GP) estimation, [29], see also Sects. 4.9 and 8.2.1. Then *h*(η) is seen as a Gaussian stochastic process with a prior mean (often zero) and a certain prior covariance function *K*(η1, η2). The arguments can range both over a discrete and continuous domain. After a number of observations *z* = {ζ (*t*), η(*t*), *t* = 1,..., *N*}, the posterior distribution of the process *h <sup>p</sup>*(η|*z*), can be determined for any η. This is, in short, how the function *h* can be estimated. As seen in Sect. 8.2.1, it corresponds to a kernel method with the kernel determined by the prior covariance function *K*(η1, η2).

## *8.5.3 Block-Oriented Models*

A very common family of nonlinear dynamic models is obtained by networks of linear dynamic models *G*(*q*) and nonlinear static functions *h*(*x*), see Fig. 8.13. The simplest and most common ones are the *Hammerstein Model y*(*t*) = *G*(*h*(*u*(*t*)) which is obtained by passing the input through a static nonlinearity before it enters the linear system. *Wiener model z* = *G*(*u*), *y*(*t*) = *g*(*z*(*t*)), where the output of a linear system is subsequently passing through the nonlinearity. The important contribution [5] has shown that any nonlinear system with fading memory can be approximated by a Wiener model. See also, e.g., [37] for a survey and [42] for a general approach

to Hammerstein–Wiener identification allowing coloured noise sources both before and after the nonlinearities (which may be non-invertible).

Traditionally, the nonlinearities have been parametrized, e.g., as piecewise constant or piecewise linear, as polynomials or as neural nets. Recently it has been more common to work with nonparametric nonlinearities which are typically modelled by the GP approach, and whole estimation is then treated in a Bayesian setting. For example, in [21] the linear part of a Wiener model is parametrized by state-space matrices *A*, *B* in an observer canonical form with suitable priors and the output nonlinearity *h*(*z*) is a Gaussian Process with a prior mean = *z* ("linear output") and large and "smooth" prior covariance. To obtain the posterior densities, a particle Gibbs sampler (PMCMC, Particle Markov Chain Monte Carlo) is employed.

In [32] the same approach is used to model the output nonlinearity, but the linear part is written as an impulse response, with a prior of the same type as discussed in Sect. 5.5.1. The whole problem can then be written as

$$\mathbf{y} = \boldsymbol{\varphi}(\boldsymbol{\Phi}\mathbf{g}),\tag{8.32}$$

where *y* is the observed output, ϕ is the output static nonlinearity, *g* is the impulse response of the linear system and Φ is the Toeplitz matrix formed from the input. The problem is then to determine the posterior densities p(ϕ|*y*) and p(*g*|*y*) by Bayesian calculations. In [31] a similar technique is used for estimating Hammerstein models.

## *8.5.4 Hybrid Models*

A common class of nonlinear models are *Hybrid models* [15, 39]. They change their properties depending on some *regime variable p*(*t*) (which may be formed from the inputs and outputs themselves) [16]. Think of a collection of linear models that describe the system behaviour in different parts of the operating space and automatically shift as the operating point changes. To build a hybrid model involves two steps: (1) find the collection of relevant different models and (2) determine the areas where each model is operative. This is considered as quite a difficult problem, and approaches from different areas in control theory have been tested. Here we will comment upon a few ideas that relate to regularized identification.

A basic problem is to decide when a change in system behaviour occurs. This relates to *change detection* and *signal segmentation*. A regularization based method to segment ARX models was suggested in [25]. The standard way to estimate ARX models can be described as in Chap. 2:

$$\min\_{\theta} \sum\_{t=1}^{N} \|\mathbf{y}(t) - \boldsymbol{\varphi}^T(t)\theta\|^2. \tag{8.33}$$

This gives the average linear model behaviour over the time record *t* ∈ [1 ..., *N*]. To follow momentary changes over time, we could estimate *N* models by

$$\min\_{\boldsymbol{\theta}(t), t=1,\ldots,N} \sum\_{t=1}^{N} \left\| \mathbf{y}(t) - \boldsymbol{\varphi}^T(t)\boldsymbol{\theta}(t) \right\|^2. \tag{8.34}$$

This would give a perfect fit with a pretty useless collection of models. To tell that we need to be more selective when accepting new models, we can add a <sup>1</sup> regularization term, discussed in Sect. 3.6, obtaining:

$$\min\_{\boldsymbol{\theta}(t), t=1,\ldots,N} \sum\_{t=1}^{N} \left\| \mathbf{y}(t) - \boldsymbol{\varphi}^T(t)\boldsymbol{\theta}(t) \right\|^2 + \boldsymbol{\chi} \sum\_{t=2}^{N} \left\| \boldsymbol{\theta}(t) - \boldsymbol{\theta}(t-1) \right\|\_1. \tag{8.35}$$

One could also use the norms in *<sup>p</sup>* with *p* > 1 as regularizers but it is crucial that the penalty is a sum of norms and not a sum of squared norms. Then, adopting a suitable value for the regularization parameter γ , the penalty favours the terms in the second sum to be exactly zero and not just small. This will force the number of different models from (8.35) to be small and thus just flag when essential changes have taken place.

This idea is taken further in [24] to build hybrid models of PWA (piecewise affine) character. The starting point is again (8.34), but now the number of models is reduced by looking at all the raw models:

$$\min\_{\boldsymbol{\theta}(t), t=1,\ldots,N} \sum\_{t=1}^{N} \left\| \mathbf{y}(t) - \boldsymbol{\varphi}^T(t)\boldsymbol{\theta}(t) \right\|^2 + \gamma \sum\_{t=1}^{N} \sum\_{s=1}^{N} K(p(t)\boldsymbol{p}(\mathbf{s})) \left\| \boldsymbol{\theta}(t) - \boldsymbol{\theta}(\mathbf{s}) \right\|. \tag{8.36}$$

Here *K*(*p*1, *p*2) is a weighting based on the respective regime variables *p*. This gives a number of, say, *d* submodels, and they can then be associated with values of the regime variable by a classification step.

These ideas of segmentation, building a collection of *d* submodels and associating them with particular values of time are taken to a further degree of sophistication in [27]. The idea there is to build *hybrid stable spline* (HSS) algorithm, based on a joint use of the TC (stable spline) kernel, see Sect. 5.5.1, for a family of ARX models like (8.34). The classification of the models is built into the algorithm, by letting the classification parameters be part of the hyperparameters. An MCMC scheme is employed to handle the nonconvex and combinatorial difficulties of the maximum likelihood criterion.

## *8.5.5 Sparsity and Variable Selection*

In all estimation problems it is essential to find the regressors *xk* (*t*), where *k* = 1,..., *d*, which are best suited for predicting the goal variable *y*(*t*). The variables *xk* can be formed from the observations from the system in many different ways. It is generally desired to find a small collection of regressors, and statistics offers many tools for this: hypothesis analysis, projection pursuit [14], manifold learning/dimensionality reduction [10, 26, 34], ANOVA, see, e.g., [20] for applications to nonlinear system identification.

The problem of variable (regressor) selection can be formulated as follows. Given a model with *n* candidate regressors *x*˜*<sup>k</sup>* (*t*)

$$\mathbf{y}(t) = f(\tilde{\mathbf{x}}\_1(t), \dots, \tilde{\mathbf{x}}\_n(t)) + e(t) \tag{8.37}$$

find a subselection or combination of regressors *x*1(*t*), . . . , *xd* (*t*) that gives the best model of the system. Note that the NARX model (8.3) is a special case of (8.37) with *xk* (*t*) = [*y*(*t* − *k*), *u*(*t* − *k*)]. In principle one could try out different subsets of regressors and see how good models (in cross validation) are produced. That would in most cases mean overwhelmingly many tests.

Instead the 1-norm regularization discussed in Sect. 3.6.1, leading to LASSO in (3.105), is a very powerful tool for variable selection and sparsity. In what follows each *x*˜*i*(*t*) is scalar and is the *i*th component of the *n*-dimensional vector *x*(*t*). Then, for a linearly parametrized model

$$\mathbf{y}(t) = \beta\_1 \tilde{\mathbf{x}}\_1(t) + \dots + \beta\_n \tilde{\mathbf{x}}\_n(t) + e(t), \tag{8.38}$$

where the best regressors are found by the criterion

$$\min\_{\mathbf{B}} \sum\_{t=1}^{N} \|\mathbf{y}(t) - \Phi(t)\mathbf{B}\|^2 + \mathcal{Y} \|\mathbf{B}\|\_1 \tag{8.39}$$

$$\mathbf{B} = \begin{bmatrix} \beta\_1, \beta\_2, \dots, \beta\_n \end{bmatrix}^T \tag{8.40}$$

$$\Phi(t) = [\tilde{\mathbf{x}}\_1(t), \tilde{\mathbf{x}}\_2(t), \dots, \tilde{\mathbf{x}}\_n(t)].\tag{8.41}$$

This idea to use 1-norm regularization was extended to the general model (8.37) in [2]. It is based on the idea to estimate the partial derivatives <sup>β</sup>*<sup>k</sup>* <sup>=</sup> <sup>∂</sup> *<sup>f</sup>* <sup>∂</sup>*x*˜*<sup>k</sup>* in (8.37) analogously to (8.39). In particular, the Taylor expansion of *f* (*x*(*t*)) around *x* <sup>0</sup> is

$$f(\mathbf{x}(t)) = f(\mathbf{x}^0) + (\mathbf{x}(t) - \mathbf{x}^0)^T \frac{\partial f}{\partial \tilde{\mathbf{x}}} + \mathcal{O}(\|\mathbf{x}(t) - \mathbf{x}^0\|^2). \tag{8.42}$$

The partial derivative is evaluated at *x* <sup>0</sup> and is a column vector of dimension *n* with row *k* given by the derivative w.r.t. *x*˜*<sup>k</sup>* . As anticipated, denote this by β*<sup>k</sup>* . These parameters can be estimated by least squares with

$$\min\_{\mathbf{a}, \mathbf{B}} \sum\_{t=1}^{N} \|\mathbf{y}(t) - \mathbf{a} - (\mathbf{x}(t) - \mathbf{x}^{0})^{T} \mathbf{B}\|^{2} \cdot K(\mathbf{x}(t) - \mathbf{x}^{0}) + \boldsymbol{\mathcal{y}} \|\mathbf{B}\|\_{1},\tag{8.43}$$

where α corresponds to *f* (*x* <sup>0</sup>), B is the vector of partial derivatives β*<sup>k</sup>* and *K* is a kernel that focuses the sum to points *x*(*t*) in the vicinity of *x* 0. The ! norm regularization term is added just as in (8.39) to promote zero estimates of the gradients. This will focus on selecting regressors *x*˜*<sup>k</sup>* that are important for the model.

With the so-called iterative reweighting, [6], the regularization term can be refined to

$$\gamma \sum\_{k=1}^{n} w\_k |\beta\_k|,\tag{8.44}$$

where *wk* = 1/|βˆ *<sup>k</sup>* | are based on the estimates from (8.43). This refinement is suggested to be included in the algorithm of [2].

Note that this test depends on the chosen point *x* 0. It will be a big task to investigate "many" such points. In [1] it is instead suggested to estimate the expected values *E xi* ∂ *f* <sup>∂</sup>*x*˜*<sup>i</sup>* and *<sup>E</sup>* <sup>∂</sup> *<sup>f</sup>* ∂*x*˜*i* . This is done using the pdfs for *<sup>x</sup>*˜*<sup>k</sup>* given by p*i*(*u*) and *<sup>d</sup>*p*i*(*u*) *dxi* which can be estimated by simple density estimation (involving only a scalar random variable).

A comprehensive study of sparsity and regularization is made in [33]. It works with a more complex model definition, allowing *<sup>f</sup>* : <sup>R</sup>*<sup>n</sup>* <sup>→</sup> <sup>R</sup> to be defined over several Hilbert spaces. The bottom line is still based on 1-norm regularization of partial derivatives and the final learning algorithm is given by minimization of a functional

$$\frac{1}{N} \sum\_{t=1}^{N} (\mathbf{y}\_t - f(\mathbf{x}(t)))^2 + \nu \left( 2 \sum\_{i=1}^{n} \left\| \frac{\partial f}{\partial \tilde{\mathbf{x}}\_i} \right\|\_N + \nu \left\| f \right\|\_{\mathcal{H}'}^2 \right). \tag{8.45}$$

Here, *H* can be a RKHS, the penalty on each partial derivative is given by

$$\left\| \frac{\partial f}{\partial \tilde{\boldsymbol{x}}\_i} \right\|\_N = \sqrt{\frac{1}{N} \sum\_{t=1}^N \left( \frac{\partial f(\boldsymbol{x}(t))}{\partial \tilde{\boldsymbol{x}}\_i} \right)},$$

γ is the regularization parameter and ν is a small positive number to ensure stability and strongly convex regularizer.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 9 Numerical Experiments and Real World Cases**

**Abstract** This chapter collects some numerical experiments to test the performance of kernel-based approaches for discrete-time linear system identification. Using Monte Carlo simulations, we will compare the performance of kernel-based methods with the classical PEM approaches described in Chap. 2. Simulated and real data are included, concerning a robotic arm, a hairdryer and a problem of temperature prediction. We conclude the chapter by introducing the so-called multi-task learning where several functions (tasks) are simultaneously estimated. This problem is significant if the tasks are related to each other so that measurements taken on a function are informative with respect to the other ones. A problem involving real pharmacokinetics data, related to the so-called population approaches, is then illustrated. Results will be often illustrated by using MATLAB boxplots. As already mentioned in Sect. 7.2, when commenting Fig. 7.8, the median is given by the central mark while the box edges are the 25th and 75th percentiles. The whiskers extend to the most extreme fits not seen as outliers. Then, the outliers are plotted individually.

## **9.1 Identification of Discrete-Time Output Error Models**

In this section, we will consider two numerical experiments with data generated according to the discrete-time output error (OE) model

$$\mathbf{y}(t) = G\_0(q)\boldsymbol{\mu}(t) + e(t),$$

where *G*<sup>0</sup> is a rational transfer function while *e* is white Gaussian noise independent of the known input *u*. Using simulated data, we will compare the performance of the classical PEM approach, as described in Chap. 2, with some of the regularized techniques illustrated in this book. In particular, we will adopt regularized high-order FIR, with impulse response coefficients contained in the *m*-dimensional (column) vector θ and the output data in the (column) vector *Y* = [*y*(1)... *y*(*N*)] *<sup>T</sup>* . So, letting the regression matrix <sup>Φ</sup> <sup>∈</sup> <sup>R</sup>*<sup>N</sup>*×*<sup>m</sup>* be

<sup>©</sup> The Author(s) 2022

G. Pillonetto et al., *Regularized System Identification*, Communications and Control Engineering, https://doi.org/10.1007/978-3-030-95860-2\_9

344 9 Numerical Experiments and Real World Cases

$$
\Phi = \begin{pmatrix}
\begin{matrix}
\mu(0) & \mu(-1) & \mu(-2) & \dots \ u(-m+1) \\
\mu(1) & \mu(0) & \mu(-1) & \dots & \mu(-m) \\
\dots & & & \\
\mu(N-1) \ \mu(N-2) \ \mu(N-3) & \dots \ \mu(N-m)
\end{matrix}
\end{pmatrix},
$$

our estimator is

$$\hat{\theta} = \underset{\theta}{\text{arg min}} \|Y - \Phi \theta\|^2 + \chi \theta^T P^{-1} \theta \tag{9.1a}$$

$$\mathbf{H} = P\Phi^T(\Phi P\Phi^T + \gamma I\_N)^{-1}Y; \text{or} \tag{9.1b}$$

$$\mathbf{J} = \left( P\boldsymbol{\Phi}^T \boldsymbol{\Phi} + \boldsymbol{\chi}\boldsymbol{I}\_m \right)^{-1} P \boldsymbol{\Phi}^T \boldsymbol{Y}. \tag{9.1c}$$

We have already seen in (5.40), (5.41) and (7.30), using MaxEnt arguments and spline theory, that choices for the regularization matrix *P* can be the first- or secondorder stable spline kernel, denoted respectively by TC and SS, respectively, or the DC kernel. They are recalled below specifying also the hyperparameter vector η:

$$\begin{aligned} \text{TC} \quad & P\_{kj}(\eta) = \lambda \alpha^{\text{max}(k,j)}; \\ & \lambda \ge 0, \ 0 \le \alpha < 1, \ \eta = [\lambda, \alpha], \end{aligned} \tag{9.2}$$

$$\text{SS}\quad P\_{kj}(\eta) = \lambda \left(\frac{\alpha^{k+j+\max(k,j)}}{2} - \frac{\alpha^{3\max(k,j)}}{6}\right)$$

$$\lambda \ge 0, \ 0 \le \alpha < 1, \ \eta = [\lambda, \alpha], \tag{9.3}$$

$$\begin{aligned} \text{DC} \quad P\_{kj}(\eta) &= \lambda a^{(k+j)/2} \rho^{|j-k|}; \\ \lambda &\ge 0, \ 0 \le \alpha < 1, \, |\rho| \le 1, \ \eta = [\lambda, \alpha, \rho]. \end{aligned} \tag{9.4}$$

## *9.1.1 Monte Carlo Studies with a Fixed Output Error Model*

In this example the true impulse response is fixed to that reported in Fig. 8.2, obtained by random generation of a rational transfer function of order 10. It has to be estimated from 500 input–output couples (collected with system initially at rest). The input is white noise filtered by the rational transfer function 1/(*z* − *p*) where *p* will vary over the unit interval during the experiment. Note that *p* establishes the difficulty of our system identification problem. Values close to zero make the input similar to white noise and the output data informative over a wide range of frequencies. Instead, values of *p* close to 1 increase the low-pass nature of the input and, hence, the ill-conditioning. The measurement noise is white and Gaussian with variance equal to that of the noiseless output divided by 50. Two estimators will be adopted:

#### 9.1 Identification of Discrete-Time Output Error Models 345

• *Oe+Or*. Classical PEM approach (2.22) equipped with an oracle. In particular, our candidate models are rational transfer functions where the order of the two polynomials is equal and can vary between 1 and 30. For any model order, estimation is performed through nonlinear least squares by solving (2.22) with in (2.21) set to the quadratic function. The method is implemented in oe.m of the MATLAB System Identification Toolbox. Then, the oracle chooses the estimate which maximizes the fit

$$100\left(1-\left[\frac{\sum\_{k=1}^{100}|\mathbf{g}\_k^0-\hat{\mathbf{g}}(k)|^2}{\sum\_{k=1}^{100}|\mathbf{g}\_k^0-\bar{\mathbf{g}}^0|^2}\right]^{\frac{1}{2}}\right),\ \bar{\mathbf{g}}^0=\frac{1}{100}\sum\_{k=1}^{100}\mathbf{g}\_k^0,\tag{9.5}$$

where *g*<sup>0</sup> *<sup>k</sup>* are the true impulse response coefficients while *g*ˆ(*k*) denote their estimates. The estimator is given the information that system initial conditions are null.

• *TC+ML*. This is the regularized estimator (9.1), equipped with the kernel *TC*. The number of estimated impulse response coefficients is *m* = 100 and the regression matrix is built with *u*(*t*) = 0 if *t* < 0. At every run, the noise variance is estimated by fitting via least squares a low-bias model for the impulse response. Then, the two kernel hyperparameters are obtained via marginal likelihood optimization, see (7.42). The method is implemented in impulseest.m of the MATLAB System Identification Toolbox.

We consider 4 Monte Carlo studies of 300 runs defined by different values of *p* in the set {0, 0.9, 0.95, 0.99}. As already mentioned, *p* = 0 corresponds to white noise input while *p* = 0.99 leads to a highly ill-conditioned problem (output data provide little information at high frequencies). Figure 9.1 reports the boxplots of the 1000 fits returned by *Oe+Or* and *TC+ML* for the four different values of *p*. Even if PEM exploits an oracle to tune complexity, the performance is (slightly) better than*TC+ML* only when the input is white noise, see also Table 9.1. When *p* increases, the illconditioning affecting the problem increases and *TC+ML* outperforms *Oe+Or* even if no oracle is used for hyperparameters tuning. This also points out the effectiveness of marginal likelihood optimization in controlling complexity.

This case study shows that continuous tuning of hyperparameters may be a more versatile and powerful approach than classical estimation of discrete model orders. A problem related to PEM here could be also the presence of local minima of the objective. This is much less critical when adopting kernel-based regularization. In fact, *TC+ML* regulates complexity through only two hyperparameters while *Oe+Or* has to optimize many more parameters (function of the postulated model order).

**Fig. 9.1** *Experiment with a fixed OE-model* Boxplots of 300 impulse response fits returned by PEM with an oracle to tune discrete model order and by TC with continuous hyperparameters estimated by marginal likelihood optimization. Results are function of the level of ill-conditioning affecting the problem which increases with *p* (the input is white Gaussian noise for *p* = 0 while the other values define low-pass inputs)

**Table 9.1** *Experiment with a fixed OE-model* Average fit, as a function of *p*, after 300 Monte Carlo runs. The value *p* = 0 corresponds to white noise input and the level of ill-conditioning then increases as *p* increases


## *9.1.2 Monte Carlo Studies with Different Output Error Models*

Now we consider two Monte Carlo studies of 1000 runs regarding identification of several discrete-time output error models. The outputs are still given by

$$\mathbf{y}(t) = G\_0(q)\boldsymbol{\mu}(t) + e(t)$$

with *e* white Gaussian noise independent of *u*, but the rational transfer function *G*<sup>0</sup> changes at any run. In fact, a 30th-order single-input single-output continuous-time system is first randomly generated by the MATLAB command rss.m. It is then sampled at 3 times of its bandwidth and used if its poles fall within the circle of the complex plane with centre at the origin and radius 0.99.

With the system at rest, 1000 input–output pairs are generated as follows. At any run, the system input is unit variance white Gaussian noise filtered by a second-order rational transfer function generated by the same procedure adopted to obtain *G*0. The outputs are corrupted by an additive white Gaussian noise with a SNR (the ratio between the variance of noiseless output and noise) randomly chosen in [1, 20] at any run. In the first experiment, the data set

$$\mathcal{O}\_T = \{ \mu(1), \mathfrak{y}(1), \dots, \mathfrak{u}(N), \mathfrak{y}(N) \}$$

contains the first 200 input–output couples, i.e., *N* = 200, while in the second experiment all the 1000 couples are used, i.e., *N* = 1000.

Starting from null initial conditions, at any run we also generate two different kinds of test sets

$$\mathcal{O}\_{test} = \{ \mu^{new}(1), \mathbf{y}^{new}(1), \dots, \mu^{new}(M), \mathbf{y}^{new}(M) \}, \quad M = 1000.$$

The first test set is especially challenging since noiseless outputs are generated by using unit variance white Gaussian noise as input. In the second test set the input has instead the same statistics of that entering the identification data, hence making easier its prediction.

The performance of a model characterized by θˆ, and returning *y*ˆ*ne*w(*t*|θ )ˆ as output prediction at instant *t*, is

$$\mathcal{R}(\hat{\theta}) = 100 \left( 1 - \left\langle \frac{\sum\_{t=1}^{M} \left( \mathbf{y}^{new}(t) - \hat{\mathbf{y}}^{new}(t|\hat{\theta}) \right)^{2}}{\sum\_{t=1}^{M} \left( \mathbf{y}^{new}(t) - \bar{\mathbf{y}}^{new} \right)^{2}} \right\rangle, \quad M = 1000, \qquad (9.6)$$

where *y*¯*ne*<sup>w</sup> is the average output in *Dtest* and *y*ˆ*ne*w(*t*|θ )ˆ are computed assuming zero initial conditions (otherwise high-order models could have the advantage to calibrate the initial conditions to fit *Dtest*). The prediction fit (9.6) can be obtained by the MATLAB command predict(model,data,k,'ini','z') where model and data denote structures containing the estimated model and the test set *Dtest* , respectively.

In what follows, we will use also estimators equipped with an oracle which evaluates the fit (9.6) for the test set of interest. Different rational models with orders between 1 and 30 are tried and the oracle selects the orders that give the best fit. We are now in a position to introduce the following 6 estimators:


#### **9.1.2.1 Results**

The MATLAB boxplots in Fig. 9.2 contain the 1000 fit measures returned by the estimators during the first experiment with *N* = 200 (left panels) and the second experiment with *N* = 1000 (right panels). Table 9.2 reports the average fit values.

In the top panels of Fig. 9.2 one can see the fits of the first test set. Recall that *Oe+Or1* has access to such data to optimize the prediction capability. Interestingly, despite this advantage, the performance of all the three regularized approaches is close that of the oracle while that *Oe+CV* is not so satisfactory. This is also visible in the first two rows of Table 9.2.

The bottom panels of Fig. 9.2 show results relative to the second test set which is used by *Oe+Or2* to maximize the prediction fit. Since training and test data are

**Table 9.2** *Identification of discrete-time OE-models* Average fit, after 1000 Monte Carlo runs, as a function of the test set type and the identification data set size (*N* = 200 or *N* = 1000). Results in the first and last column come from oracle-based estimators which cannot be implemented in practice


more similar, the prediction capability of *Oe+CV* improves significantly but the regularized estimators still outperform the classical approach, see also the last two rows of Table 9.2.

**Fig. 9.2** *Identification of discrete-time OE-models Top* Boxplot of the 1000 prediction fits on future outputs with test input given by white noise. The size of the identification data set is 200 (top left) or 1000 (top right). *Bottom* Differently from the results in the top panel, input statistics in the estimation and test data set are the same. The first and last boxplot contained in the four panels contain results from the estimators *Oe+Or1* and *Oe+Or2* which cannot be implemented in practice

**Fig. 9.3** *Robot arm* A portion of the input–output data for the robot arm: the input is the driving couple (bottom) and the output is the tip of the robot arm (top)

## *9.1.3 Real Data: A Robot Arm*

Consider now the vibrating flexible robot arm described in [27], where two feedforward controller design methods were compared on trajectory tracking problems. The input of the robot arm is the driving couple and the output is the acceleration at the tip of the robot arm. The input–output data contain 40960 data points. They are collected at a sampling frequency of 500 Hz for 10 periods with each period containing 4096 data points. A portion of the data is shown in Fig. 9.3. The identification problem of the robot arm was studied in [23, Sect. 11.4.4] with frequency domain methods.

We will build models by both the classical prediction error method and the kernel method with the DC kernel. Since the true system is unknown, to compare the performance of different impulse response estimates we divide the data into two parts: the training and the test set, given by the first 6000 input–output couples and the reaming ones, respectively. Then, we measure how well the models, built with the estimation data, predict the test outputs.

For the prediction error method, we estimate *n*th-order state-space models without disturbance model and with zero initial conditions for *n* = 1,..., 36. This method is available in MATLAB's System Identification Toolbox [13] as the command pem(data,n,'dist','no','ini','z'). The prediction fits computed using (9.6) are shown as a function of *n* in Fig. 9.4, respectively. An oracle that has access to the test set would select the order *n* = 18, hence obtaining a prediction fit equal to 79.75%. For the kernel method with the DC ker-

**Fig. 9.4** *Robot arm* The solid red line is the fit for the prediction error method for different model order *n* = 1,..., 36. The dash-dot blue line is the prediction fit on the test set for the regularized method with the DC kernel

nel, we estimate a FIR model of high-order 3000 with hyperparameters tuned by optimizing the marginal likelihood. When forming the regression matrix, the unknown input data are set to zero. The prediction fit (9.6) is 83.07% and is shown as a horizontal solid line in Fig. 9.4. The kernel method with the DC kernel is available in MATLAB's System Identification Toolbox [14] as the command impulseest(data,3000,0,opt) where, in the option opt, we set opt.RegulKernel='dc'; opt.Advanced.AROrder=0.

The Bode magnitude plot of the models estimated by PEM and the DC kernel is shown in Fig. 9.5. The empirical frequency function estimate obtained using the command etfe in MATLAB's System Identification Toolbox [14] is also displayed.

The measured output and the predicted output over a portion of the test set are shown in Fig. 9.6. If one has concern that a FIR model of order 3000 is quite large, then one could reduce such high-order model by projecting it to a low-order statespace model. Exploiting model order reduction techniques, the fit of a state-space model of order *n* = 25 is 79.8%, still better than the best state-space description that can be obtained by PEM.

**Fig. 9.5** *Robot arm* Bode magnitude plot of the estimated models obtained by the empirical frequency function estimate ETFE (black), the regularized method with DC kernel and hyperparameters estimate by marginal likelihood optimization (blue), and the prediction error method with order *n* = 18 (red)

**Fig. 9.6** *Robot arm* A portion of the test set (grey) and the predictions returned by the regularized method with DC kernel (blue) and the prediction error method (red)

## *9.1.4 Real Data: A Hairdryer*

The second application is a real laboratory device, whose function is similar to that of a hairdryer: the air is fanned through a tube and then heated at a mesh of resistor wires, as described in [13, Sect. 17.3]. The input to the hairdryer is the voltage over the mesh of resistor wires while the output is the air temperature measured by a thermocouple. The input–output data contain 1000 data points collected at a sampling frequency of 12.5 Hz for 80 s. A portion of the data is shown in the top panel of Fig. 9.7. Since the input–output values move around 5 and 4.9, respectively, we detrend the measurements in such a way that they move around 0. The estimation and test set data are then given by the first and the last 500 input–output couples, respectively.

As in the case of the robot arm, we build models by the classical prediction error method with an oracle, which maximizes the prediction fit, and the regularized approach with the DC kernel, with hyperparameters tuned by marginal likelihood optimization. For the prediction error method, we estimate *n*th-order state-space models without disturbance model for *n* = 1,..., 36 and with zero initial conditions. The fits, as a function of *n*, are shown in Fig. 9.8. The best result is obtained for order *n* = 5 and turns out 88.38%. For the kernel method with the DC kernel, we estimate a FIR model with order 70. When forming the regression matrix, we set the unknown input data to zero. The prediction fit (9.6) is somewhat close to that achieved by PEM+Oracle being equal to 88.15%. It is shown as a dash-dot blue line in Fig. 9.8. The test set and the predicted outputs returned by the two methods are shown in Fig. 9.9. One can see that the regularized approach has a prediction capability very close to that of PEM+Oracle.

## **9.2 Identification of ARMAX Models**

In this section we consider the identification of linear systems

$$\mathbf{y}(t) = \left\{ \sum\_{i=1}^{p} G\_{0i}(q)\boldsymbol{\mu}\_{i}(t) \right\} + H\_{0}(q)e(t). \tag{9.7}$$

Differently from the previous cases, beyond the presence of multiple observable inputs *ui* , also the noise model is unknown. In fact, the *e*(*t*) are white Gaussian noise of unit variance filtered by a system *H*0(*q*) that has to be estimated from data.

First, it is useful to cast the identification of the general model (9.7) in a regularized context. Without loss of generality, to simplify the exposition, let *p* = 1 with the single observable input denoted by *u*. Exploiting (2.4), given the general linear model (9.7), we can write any predictor as two infinite impulse responses from *y* and *u*, respectively. When using ARX models, we have seen in (2.8) that such infinite responses specialize to finite responses. One has

**Fig. 9.7** *Hairdryer* A portion of the input–output data for the hairdryer. The input is the voltage over the mesh of resistor wires (bottom panel) and the output is the air temperature measured by a thermocouple (top panel)

**Fig. 9.8** *Hairdryer* The solid line is the fit for the prediction error method for different model order *n* = 1,..., 36. The dash-dot line is the fit for the ReLS method with the DC kernel

**Fig. 9.9** *Hairdryer* The measured output and the predicted output over the test data: the measured output (grey), the ReLS method with DC kernel (blue) and the prediction error method (red)

$$\begin{split} \mathbf{y}(t) &= -a\_1 \mathbf{y}(t-1) - \dots - a\_{n\_a} \mathbf{y}(t - n\_a) + b\_1 \boldsymbol{\mu}(t - 1) + \dots \\ &+ b\_{n\_b} \boldsymbol{\mu}(t - n\_b) + \boldsymbol{e}(t) = \boldsymbol{\varphi}\_\mathbf{y}^T(t) \boldsymbol{\theta}\_a + \boldsymbol{\varphi}\_\mathbf{u}^T(t) \boldsymbol{\theta}\_b + \boldsymbol{e}(t), \end{split} \tag{9.8}$$

where θ*<sup>a</sup>* = *a*<sup>1</sup> ... *ana T* , θ*<sup>b</sup>* = *b*<sup>1</sup> ... *bnb <sup>T</sup>* and <sup>ϕ</sup>*<sup>y</sup>* (*t*), <sup>ϕ</sup>*u*(*t*) are made up from *<sup>y</sup>* and *u* in an obvious way. Thus, the ARX model is a linear regression model, to which the same ideas of regularization can be applied. This point is important since we have seen in Theorem 2.1 that ARX-expressions become arbitrarily good approximators for general linear systems as the orders *na*, *nb* tend to infinity. However, as discussed in Chap. 2, high-order ARX can suffer from large variance. A solution is to set *na* = *nb* = *n* to a large value and then introduce regularization matrices for the two impulse responses from *y* and from *u*. The *P*-matrix in (9.1) can be partitioned along with θ*a*, θ*b*:

$$P(\eta\_1, \eta\_2) = \begin{bmatrix} P^a(\eta\_1) & 0 \\ 0 & P^b(\eta\_2) \end{bmatrix} \tag{9.9}$$

with *P<sup>a</sup>*(η1), *P<sup>b</sup>*(η2) defined, e.g., by any of (9.2)–(9.4). Letting θ = [θ *<sup>T</sup> <sup>a</sup>* θ *<sup>T</sup> b* ] *<sup>T</sup>* and building the regression matrix using [ϕ*<sup>T</sup> <sup>y</sup>* (*t*) ϕ*<sup>T</sup> <sup>u</sup>* (*t*)] as rows, the estimator (9.1) now becomes a regularized high-order ARX. The MATLAB code for estimating this model using, e.g., the DC kernel would be

ao=arxRegulOptions('RegularizationKernel','DC'), [Lambda,R] = arxRegul(data,na,nb,nk,ao),

aropt= arxOptions; aropt.Regularization.Lambda = Lambda, aropt.Regularization.R = R, m = arx(data,na,nb,nk,aropt).

We can also easily extend this construction to multiple inputs. Given any generic *p*, one needs to estimate *p* + 1 impulse responses with the matrix (9.9) now containing *p* + 1 blocks. If there are multiple outputs, one approach is to consider each output channel as a separate linear regression as in (9.8). The difference is that now also the other outputs need to be appended as done with the inputs.

## *9.2.1 Monte Carlo Experiment*

One challenging Monte Carlo study of 1000 runs is now considered. Data are generated at any run by an ARMAX model of order 30 having *p* observable inputs, i.e.,

$$\mathbf{y}(t) = \left\{ \sum\_{i=1}^{p} \frac{B\_i(q)}{A(q)} u\_i(t) \right\} + \frac{C(q)}{A(q)} e(t),$$

with *p* drawn from a random variable uniformly distributed on {2, 3, 4, 5}. Note that the system contains *p* + 1 rational transfer functions. They depend on the polynomials *A*, *Bi* and *C* which are randomly generated at any run by the MATLAB function drmodel.m. Such function is first called to obtain the common denominator *A* and the first numerator *B*1. The other *p* calls are used to obtain the numerators of the remaining rational transfer functions. The system so generated is accepted if the modulus of its poles is not larger than 0.95. In addition, letting *Gi*(*q*) <sup>=</sup> *Bi*(*q*) *<sup>A</sup>*(*q*) and *<sup>H</sup>*(*q*) <sup>=</sup> *<sup>C</sup>*(*q*) *<sup>A</sup>*(*q*) the signal to noise ratio has to satisfy

$$1 \le \frac{\sum\_{i=1}^p \|G\_i\|\_2^2}{\|H\|\_2^2} \le 20$$

where *Gi*2, *H*<sup>2</sup> are the <sup>2</sup> norms of the system impulse responses.

After a transient to mitigate the effect of initial conditions, at any run 300 input– output couples are collected to form the identification data set *D<sup>T</sup>* and other 1000 to define the test set *Dtest* . In any case, the input is white Gaussian noise of unit variance.

Differently from the output error models, in the ARMAX case the performance measure adopted to compare different estimated models depends on the prediction horizon *<sup>k</sup>*. More specifically, let *<sup>y</sup>*ˆ*ne*<sup>w</sup> *<sup>k</sup>* (*t*|θ )<sup>ˆ</sup> be the *<sup>k</sup>*-step-ahead predictor associated with an estimated model characterized by θˆ. For any *t*, such function predicts *k*-stepahead the test output *yne*w(*t*) by using the values of the test input *une*<sup>w</sup> up to time *t* − 1 and of the test output *yne*<sup>w</sup> up to *t* − *k*. The prediction difficulty in general increases as *k* gets larger. The special case *k* = 1 corresponds to the one-step-ahead predictor given by (2.4), while see, e.g., [13, Sect. 3.2] for the expressions of the generic *k*-step-ahead impulse responses.

As done in (9.6) we use *y*¯*ne*<sup>w</sup> denote the mean of the outputs in *Dtest* , but now the prediction fit depends on *k*, being given by

$$\mathcal{R}\_k(\hat{\theta}) = 100 \left( 1 - \left\langle \frac{\sum\_{t=1}^{M} \left( \mathbf{y}^{new}(t) - \hat{\mathbf{y}}\_k^{new}(t|\hat{\theta}) \right)^2}{\sum\_{t=1}^{M} \left( \mathbf{y}^{new}(t) - \overline{\mathbf{y}}^{new} \right)^2} \right\rangle, \quad M = 1000. \quad (9.10)$$

In this case, we say that an estimator is equipped with an oracle if it can use the test set to maximize <sup>20</sup> *<sup>k</sup>*=<sup>1</sup> *F<sup>k</sup>* by tuning the complexity of the model estimated using the identification data. The following estimators are then introduced:


All the system inputs delay are assumed known and their values are provided to all the estimators described above.

The average of the fits *F<sup>k</sup>* given by (9.10), function of the prediction horizon *k*, is reported in Fig. 9.10. Since PEM equipped with Akaike-like criteria return very small average fits, results achieved by this kind of procedures are not displayed. The MATLAB boxplots of the 1000 values of *F*<sup>1</sup> and *F*<sup>20</sup> returned by all the estimators are visible in Fig. 9.11. The average fit of *SS+ML* is quite close to that of *PEM+Oracle* which is in turn outperformed by *TC+ML* and *DC+ML*. This is remarkable also considering that such kernel-based approaches can be used in real applications while *PEM+Oracle* relies on an ideal tuning which exploits the test set. Results returned by PEM equipped with CV are instead unsatisfactory.

The results outline the importance of regularization, especially in experiments with relatively small data sets. In this case, only 300 input–output measurements are available with quite complex systems of order 30. The classical PEM approach equipped with any model order-selection rule cannot predict the test set better than the oracle. However, this latter can tune complexity by exploring only a finite set of given models. Kernel-based approaches can instead balance bias and variance by continuous tuning of regularization parameters. In this way, better performing trade-offs may be reached.

## *9.2.2 Real Data: Temperature Prediction*

Now we consider thermodynamic modelling of buildings using some real data taken from [22]. Eight sensors are placed in two rooms of a small two-floor residential building of about 80 m<sup>2</sup> and 200 m3. They are located only on one floor (approximately 40 m2). More specifically, temperatures are collected through a wireless sensor network made of 8 *Tmote-Sky* nodes produced by Moteiv Inc. The building was inhabited during the measurement period consisting of 8 days and samples were taken every 5 min. A thermostat controlled the heating system with the reference temperature manually set every day depending upon occupancy and other needs. This makes available a total of 8 temperature profiles displayed in Fig. 9.12. One can see the high level of collinearity of the signals. This makes the problem ill-conditioned, complicating the identification process.

We just consider multiple-input single-output (MISO) models. The temperature from the first node is seen as the output (*yi*) and the other 7 temperatures as inputs

**Fig. 9.11** *Identification of ARMAX models* Boxplots of the 1000 values of *F*<sup>1</sup> (top panel) and *F*<sup>20</sup> (bottom). Recall that *PEM+Oracle* uses additional information, having access to the test set to perform model order selection

(*u j <sup>i</sup>* , *j* = 1,..., 7). Data are divided into 2 parts: those collected at time instants 1,..., 1200 form the identification set while those at instants 1201,..., 2500 are used for test purposes. With 5 min sampling times, 1200 instants almost correspond to 100 h, a rather small time interval. Hence, we assume a "stationary" environment and normalize the data so as to have zero mean and unit variance before performing identification. Quality of the *k*-step-ahead prediction on test data is measured by (9.10).

Identification has been performed using ARMAX models with an oracle which has access to the test set. This estimator, called PEM+Or, maximizes<sup>48</sup> *<sup>k</sup>*=<sup>1</sup> *F<sup>k</sup>* which accounts for the prediction capability up to 4 h ahead. The other estimator is regularized ARX equipped with the TC kernel with a different scale factor λ assigned to each unknown one-step-ahead predictor impulse response and a common decay rate α. The length of each impulse response is set to 50 and the hyperparameters are estimated via marginal likelihood maximization using only the identification data. This estimator is denoted by TC+ML. Results are reported in Fig. 9.13 (top panel): the performance of PEM+Or and TC+ML is quite similar. Sample trajectories of one-hour-ahead test data prediction returned by TC+ML are also reported in Fig. 9.13 (bottom panel).

#### **9.3 Multi-task Learning and Population Approaches** *-*

In the previous chapters we have studied the problem of reconstructing a real-valued function from discrete and noisy samples. An extension is the so-called multi-task learning problem in which several functions (tasks) are simultaneously estimated. This problem is significant if the tasks are related to each other so that measurements

taken on a function are informative with respect to the other ones. An example is given by a network of linear systems whose impulse responses share some common features. Here, a relevant problem is the study of anomaly detection in homogenous populations of dynamic systems [5, 6, 10]. Normally, all of them are supposed to have the same (possibly unknown) nominal dynamics. However, there can be a subset of systems that have anomalies (deviations from the mean) and the goal is to detect them from the data collected in the population. Important applications of multitask learning arise also in biomedicine when multiple experiments are performed in subjects from a population [9]. Similar patterns are observed in individual responses so that measurements collected in a subject can help reconstructing also the responses of other individuals. In pharmacokinetics (PK) and pharmacodynamics (PD) the joint analysis of several individual curves is often exploited and called population analysis [24]. One class of adopted models is parametric, e.g., compartmental ones [7]. The problem can be solved using, e.g., the NONMEM software, which traces back to the seventies [3, 25], or more sophisticated approaches like Bayesian MCMC algorithms [15, 28].More recently, machine learning/nonparametric approaches have been proposed for the population analysis of PK/PD data [16, 19, 20].

In the machine learning literature, the term multi-task learning was originally introduced in [4]. The performance improvement achievable by using a multi-task approach instead of a single-task one which learns the functions separately has been then pointed out in [1, 26], see also [2] for a Bayesian treatment. Next, in [8] it has been proposed a regularized kernel method hinging upon on the theory of vectorvalued Reproducing kernel Hilbert spaces [18]. Developments and applications of multi-task learning can then be found, e.g., in [11, 12, 17, 21, 29, 30].

## *9.3.1 Kernel-Based Multi-task Learning*

We will now see that multi-task learning can be cast within the RKHS setting developed in the previous chapters by defining a particular kernel. Just to simplify exposition, let us assume that there is a common input space *X* for all the tasks and consider a set of *<sup>k</sup>* functions **<sup>f</sup>***<sup>i</sup>* : *<sup>X</sup>* → <sup>R</sup>. Assume also that the following *ni* input–output data are available for each task *i*

$$(\mathbf{x}\_{1i}, \mathbf{y}\_{1i}), (\mathbf{x}\_{2i}, \mathbf{y}\_{2i}), \dots, (\mathbf{x}\_{n\_i i}, \mathbf{y}\_{n\_i i}).\tag{9.11}$$

Our goal is to jointly estimate all the unknown functions **f***<sup>i</sup>* starting from these examples. For this aim, first a kernel can be introduced to include our knowledge on the single functions (like smoothness) and also on their relationships. This can be done by defining an enlarged input space

$$\mathcal{X} = X \times \{1, 2, \ldots, k\}.$$

Hence, a generic element of *X* is the couple (*x*,*i*) where *x* ∈ *X* while *i* ∈ {1,..., *k*}. The index *i* thus specifies that the input location belongs to the part of the function domain connected with the *i*th function. The information regarding all the tasks can now be specified by the kernel *<sup>K</sup>* : *<sup>X</sup>* <sup>×</sup> *<sup>X</sup>* <sup>→</sup> <sup>R</sup> which induces a RKHS of functions **<sup>f</sup>** : *<sup>X</sup>* <sup>→</sup> <sup>R</sup>. In fact, we are just exploiting RKHS theory on function domains that include both continuous and discrete components. Note that, in practice, any function **f** embeds *k* functions **f***<sup>i</sup>* .

Regularization in RKHS then allows us to reconstruct the tasks from the data (9.11) by computing

$$\hat{\mathbf{f}} = \underset{\mathbf{f} \in \mathcal{H}^{\ell}}{\text{arg min}} \sum\_{i=1}^{k} \sum\_{l=1}^{n\_i} \mathcal{V}\_{li}(\mathbf{y}\_{li}, \mathbf{f}\_i(\mathbf{x})) + \boldsymbol{\chi} \left\| \mathbf{f} \right\|\_{\mathcal{H}^{\ell}}^2. \tag{9.12}$$

Under general conditions on the losses*Vli* , we can then apply the representer theorem, i.e., Theorem 6.15, to obtain the following expression for the minimizer:

$$\hat{\mathbf{f}}\_{j}(\mathbf{x}) = \sum\_{i=1}^{k} \sum\_{l=1}^{n\_{l}} c\_{li} K\left( (\mathbf{x}, j), (\mathbf{x}\_{li}, i) \right) \quad \mathbf{x} \in X, j = 1, \ldots, k \tag{9.13}$$

where {*cli*} are suitable scalars. Adopting quadratic losses which include weights {σ2 *li*}, i.e.,

$$\mathcal{V}\_{li}(a,b) = \frac{(a-b)^2}{\sigma\_{li}^2}.$$

for any *<sup>a</sup>*, *<sup>b</sup>* <sup>∈</sup> <sup>R</sup>, a regularization network is obtained and the expansion coefficients {*cli*} solve the following linear system of equations

$$\sum\_{i=1}^{k} \sum\_{l=1}^{n\_l} \left[ K((\mathbf{x}\_{li}, i), (\mathbf{x}\_{jq}, q)) + \gamma \sigma\_{jq}^2 \delta\_{lj} \delta\_{iq} \right] c\_{li} = \mathbf{y}\_{jq}, \tag{9.14}$$

where *q* = 1,..., *k*, *j* = 1,..., *nq* and δ*i j* is the Kronecker delta.

**Connection with Bayesian estimation** Exploiting the same arguments developed in Sect. 8.2.1, the following relationship between (9.13), (9.14) and Bayesian estimation of Gaussian random fields is obtained. Let the measurements model be

$$\mathbf{y}\_{ji} = \mathbf{f}\_i(\mathbf{x}\_{ji}) + e\_{ji} \tag{9.15}$$

where {*e ji*} are independent Gaussian noises of variances {σ<sup>2</sup> *ji*}. Define

$$\mathbf{y}\_i = \begin{bmatrix} \mathbf{y}\_{1i} \dots \mathbf{y}\_{n\_i i} \end{bmatrix}^T, \qquad \mathbf{y}^k = \begin{bmatrix} \mathbf{y}\_1^T \dots \mathbf{y}\_k^T \mathbf{l}^T \dots \mathbf{l} \end{bmatrix}^T$$

Assume also that {**f***i*} are zero-mean Gaussian random fields, independent of the noises, with covariances

$$\text{Cov}\left(\mathbf{f}\_i(\mathbf{x}), \mathbf{f}\_q(\mathbf{s})\right) = K\left((\mathbf{x}, i), (\mathbf{s}, q)\right) \qquad \mathbf{x}, \mathbf{s} \in X, \mathbf{s}$$

where *i* = 1,..., *k* and *q* = 1,..., *k*. Then, one obtains that for *j* = 1,..., *k*, the minimum variance estimate of **f***<sup>j</sup>* conditional on *y<sup>k</sup>* is defined by (9.13), (9.14) by setting γ = 1. Furthermore, the posterior variance of **f***j*(*x*) is

$$\operatorname{Var}\left[\mathbf{f}\_{j}(\mathbf{x})|\mathbf{y}^{k}\right] = \operatorname{Var}\left[\mathbf{f}\_{j}(\mathbf{x})\right] - \operatorname{Cov}\left(\mathbf{f}\_{j}(\mathbf{x}), \mathbf{y}^{k}\right) \left(Var\left[\mathbf{y}^{k}\right]\right)^{-1} \operatorname{Cov}\left(\mathbf{f}\_{j}(\mathbf{x}), \mathbf{y}^{k}\right)^{T} . \tag{9.16}$$

In the above formula, in view of the independence assumptions, one has

$$\text{Var}\begin{bmatrix}\mathbf{y}^{k}\end{bmatrix} = \begin{pmatrix}V\_{11} & V\_{12} & \dots & V\_{1k} \\ V\_{21} & \dots & \dots & V\_{2k} \\ \vdots & \dots & \dots & \dots \\ V\_{k1} & \dots & \dots & V\_{kk}\end{pmatrix} + \begin{pmatrix} \Sigma\_{1} & 0 & \dots & 0 \\ 0 & \Sigma\_{2} & \dots & 0 \\ 0 & 0 & \dots & 0 \\ 0 & 0 & 0 & \Sigma\_{k} \end{pmatrix}.$$

where each block *Viq* belongs to R*ni*×*nq* and its (*l*, *j*)-entry is given by

$$V\_{iq}(l,j) = K((x\_{li}, i), (x\_{jq}, q)),$$

while Σ*<sup>i</sup>* = diag{σ<sup>2</sup> 1*i*,...,σ<sup>2</sup> *ni <sup>i</sup>*}. In addition

$$\begin{split} \text{Cov}\left(\mathbf{f}\_{j}(\mathbf{x}), \mathbf{y}^{k}\right) &= \text{Cov}\left(\mathbf{f}\_{j}(\mathbf{x}), [\mathbf{f}\_{1}(\mathbf{x}\_{11}) \dots \mathbf{f}\_{1}(\mathbf{x}\_{n1}) \dots \mathbf{f}\_{k}(\mathbf{x}\_{1k}) \dots \mathbf{f}\_{k}(\mathbf{x}\_{nk})] \right) \\ &= [K((\mathbf{x}, j), (\mathbf{x}\_{11}, \mathbf{l})) \dots K((\mathbf{x}, j), (\mathbf{x}\_{n1}, \mathbf{l})) \\ &\qquad \dots K((\mathbf{x}, j), (\mathbf{x}\_{1k}, k)) \dots K((\mathbf{x}, j), (\mathbf{x}\_{nk}, k))]. \end{split}$$

**Example of multi-task kernel: average plus shift** A simple yet useful class of multi-task kernels is obtained by defining *K* as follows:

$$K((\mathbf{x}\_1, p), (\mathbf{x}\_2, q)) = \overline{\hat{\lambda}\!K}(\mathbf{x}\_1, \mathbf{x}\_2) + \delta\_{pq} \tilde{\hat{\lambda}} \tilde{\hat{K}}\_p(\mathbf{x}\_1, \mathbf{x}\_2) \tag{9.17}$$

where λ <sup>2</sup> andλ<sup>2</sup> are two-scale factors that typically need to be estimated from data. Such kernel describes each function as the sum of an average function **f**, hereafter named *average task*, and an *individual shift* **f***j*(*x*) specific for each task. Indeed, if λ = 0 all the functions would be learnt independently of each other. Instead, when <sup>λ</sup> <sup>=</sup> 0 all the tasks are actually the same. The Bayesian interpretation of multi-task learning discussed above facilitates also the understanding of this model. In fact, once the kernel is seen as a covariance, it is easy to see that, for any *i* and *x* ∈ *X*, each task decomposes into

$$\mathbf{f}\_i(\boldsymbol{\chi}) = \mathbf{f}(\boldsymbol{\chi}) + \mathbf{f}\_i(\boldsymbol{\chi})$$

where ¯**f** and {˜**f***i*} are zero-mean independent Gaussian random fields.

## *9.3.2 Numerical Example: Real Pharmacokinetic Data*

Multi-task learning is now illustrated by considering a data set connected with xenobiotics administration in 27 human subjects [20]. Such administration can be seen as the input to a continuous-time linear dynamic system whose (measurable) output is the drug profile in plasma. In any subject, 8 measurements were collected at 0.5, 1, 1.5, 2, 4, 8, 12, 24 h after a bolus, an input which can be seen as a Dirac delta. Hence, one has to deal with a particular continuous-time system identification problem where noisy and direct samples of the impulse response are available.

**Fig. 9.14** *Multi-task learning* Xenobiotics concentration data after a bolus in 27 human subjects: average curve (thick) and individual curves

In this experiment, noises are known to be Gaussian and heteroscedastic, i.e., their variances are not constant being given by σ<sup>2</sup> *i j* = (0.1*yi j*)2. The 27 experimental concentration profiles are displayed in Fig. 9.14, together with the average profile. In light of the number of subjects, such average curve is a reasonable estimate of the average task **f**.

The whole data set consists of 216 pairs (*xi j*, *yi j*), for *i* = 1,..., 8 and *j* = 1,..., 27, and is split in an identification (training) and a test set. For what regards training, a sparse sampling schedule is considered: only 3 measurements per subject are randomly chosen within the 8 available data. We will adopt the multi-task estimator (9.12) to reconstruct all the continuous-time profiles. In view of the Gaussian and heteroscedastic nature of the noise, the losses are defined by

$$\mathcal{V}\_{li}(a,b) = \frac{(a-b)^2}{\sigma\_{ij}^2}.$$

For what regards the function model, since humans are expected to give similar responses to the drug, quite close to an average function, the kernel (9.17) is adopted. In addition, it is known that in these experiments there is a greater variability for small values of *t*, followed by an asymptotic decay to zero. This motivates the use of a stable kernel to model both the average and the shifts. A model suggested in [20] is a cubic spline kernel under the time-transformation

**Fig. 9.15** *Multi-task learning* Single task (left) and multi-task (right) estimates of some curves (thick line) with 95% confidence intervals (dashed lines) using only three data (circles) for each of the 27 subjects. The other five "unobserved" data (asterisks) are also plotted. Dotted line indicates the estimates obtained by using the full sampling grid

9.3 Multi-task Learning and Population Approaches 367

$$h(t) = \frac{3}{t+3}$$

which defines (9.17) through the correspondences

$$
\overline{K}(t,\tau) = \widetilde{K}\_p(t,\tau) = \frac{h(t)h(\tau)\min\{h(t),h(\tau)\}}{2} - \frac{(\min\{h(t),h(\tau)\})^3}{6}.
$$

One can check that this model induces a stable RKHS by using Corollary 7.2. In fact, the kernels are nonnegative-valued and the integral of a generic kernel section is

$$\int\_0^{+\infty} \left( \frac{h(t)h(\tau)\min\{h(t), h(\tau)\}}{2} - \frac{\left(\min\{h(t), h(\tau)\}\right)^3}{6} \right) d\tau$$

$$= \frac{1}{2(t+3)^3} \left( (27t+81)\log(\frac{t+3}{3}) + 13.5t + 67.5 \right)$$

and this result clearly implies

$$\int\_0^{+\infty} \int\_0^{+\infty} \left( \frac{h(t)h(\tau)\min\{h(t),h(\tau)\}}{2} - \frac{\left(\min\{h(t),h(\tau)\}\right)^3}{6} \right) d\tau dt < \infty.$$

The initial plasma concentration is known to be zero. Hence, a zero variance virtual measurement in *t* = 0 was added for all tasks. The hyperparameters λ <sup>2</sup> andλ<sup>2</sup> were then estimated via marginal likelihood maximization by exploiting the Bayesian interpretation of multi-task learning discussed above.

**Fig. 9.16** *Multi-task learning* Boxplots of the prediction errors (RMSE) obtained by the single-task approach and by the multi-task approach

The left and right panels of Fig. 9.15 report results obtained by the single- and the multi-task approach, respectively, in 5 subjects. One can see the data and the estimated curves with their 95% confidence intervals obtained using the posterior variance (9.16). Each panel shows also the estimates obtained by employing the full sampling grid. It is apparent that the multi-task estimates are closer to these reference profiles. A good predictive capability with respect to the other five "unobserved" data is also visible. To better quantify this aspect, let *I <sup>f</sup>* and *I<sup>r</sup> <sup>j</sup>* denote the full and reduced sampling grid in the *<sup>j</sup>*th subject. Let also *Ij* <sup>=</sup> *<sup>I</sup> <sup>f</sup>* -*Ir <sup>j</sup>* , whose cardinality is 5. Then, for each subject, we also define the prediction error as

$$RMSE\_j^{MT} = \sqrt{\frac{\sum\_{i \in I\_j} (\mathbf{y}\_{ij} - \hat{\mathbf{f}}\_j(\mathbf{x}\_{ij}))^2}{\mathfrak{S}}}$$

with the single-task *RMSEST <sup>j</sup>* defined in a similar way. Figure 9.16 then reports the boxplots with the 27 *RMSE* returned by the single- and multi-task estimates. The improvement on the prediction performance due to the kernel-based population approach is evident.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Index**

#### **A**

Absolute homogeneity, 224 Absolute Shrinkage And Selection Operator (LASSO), 69 Actuarial science, 168 Admissible estimator, 4, 140 Akaike's information theoretic criterion (AIC), 28, 38 Approximation property of ARX models, 22 ARMAX model, 21, 353 ARX model, 21, 353 Asymptotic covariance matrix of parameter estimates, 26

#### **B**

Basis expansion, 155 Bauer's maximum principle, 303, 305 Bayes estimate, 102 Bayes estimate (conditional mean), 99 Bayes estimate with improper prior, 110 Bayes factor, 128 Bayes formula, 96 Bayesian confidence intervals, 124 Bayesian function reconstruction, 120 Bayesian information criterion (BIC), 28, 38 Bayesian interpretation of multi-task learning, 363 Bayesian interpretation of regularization, 97, 267 Bayesian interpretation of ridge regression, 269 Bayesian interpretation of the James–Stein estimator, 105 Bayesian model probability, 270

Bayesian optimal model approximation/reduction, 115 Bayesian VARs, 168 Bias, 26, 37, 137 Bias and variance, 26 Bias and variance decomposition, 2 Bias space, 109, 212 Bias-variance trade-off, 8, 28, 139 BIBO stability, 274 BIC, 28, 38, 270 Black-box model, 21, 314 Block-oriented models, 335 Bochner theorem, 210 Box–Jenkins (BJ) model, 21 Boxplot, 269

#### **C**

Cauchy sequence, 185, 225 Cauchy–Schwarz inequality, 225 Change detection, 337 Chi-square random variable, 106 Choice of the loss, 200 Classification problem, 204 Closed-form expression of IIR estimate, 251 Closed graph theorem, 228 Closed set, 228 Closed subspace, 228 Column and row rank, 36 Compactness of the operator induced by a stable kernel, 281 Compact set, 228 Competitive kernels, 270 Complete orthonormal basis, 192 Complete space, 225

© The Editor(s) (if applicable) and The Author(s) 2022 G. Pillonetto et al., *Regularized System Identification*, Communications and Control Engineering, https://doi.org/10.1007/978-3-030-95860-2

Compound loss, 139 Compressive sensing, 70 Conditional mean, 99 Conditional median, 99 Condition number, 8, 43 Confidence ellipsoid, 101 Conjugate priors, 126 Consistency, 219 Continuous functional, 225 Continuous-time model, 24 Convergence rate of the stable spline estimator, 291 Convex envelope, 78 Convex envelope of the rank function, 79 Convex relaxation, 70 Corrected Akaike's criterion (AICc), 28 Covariance matrix estimation with low-rank structure, 80 Covariance of estimated frequency function, 27 Cramér–Rao inequality, 27 Credible region, 99 Cross-Validation (CV), 30, 38 Cubic spline estimator, 263 Cubic spline kernel, 212, 263 Curse of dimensionality, 319 CVX software, 79

#### **D**

Damped Newton iterations, 223 DC kernel, 151, 252, 344 Deep networks, 314, 334 Degrees of freedom, 272, 318 Diagonalized Bayesian estimation problem, 117 Diagonalized kernel, 191 Differential entropy, 111 Dimensionality reduction, 338 Discrete-time Fourier transform, 19 Distributed lag estimator, 163 Disturbances and noise sources, 19 Dropout or dilution, 335

#### **E**

Early termination, 335 Empirical Bayes approach, 107, 108, 124, 168, 266 Empirical risk, 218 Empirical Risk Minimization (ERM), 219 Entropy maximization, 111 Equivalent degrees of freedom, 54, 118, 152 Equivalent degrees of freedom and maximum likelihood estimate, 119 Equivalent degrees of freedom for the DC kernel, 153 Ergodic process, 125 Estimation data, 59 Estimation set, 30, 59 Euclidean inner product, 225 Euclidean norm, 36 Excess degrees of freedom, 68 Expected in-sample validation error, 64 Expected Validation Error (EVE), 58, 59 Experimental conditions, 17 Experiment design, 29

#### **F**

Fading memory concepts, 325 Fano's inequality, 290 Feature map, 280 Feature map and feature space, 208 Filtered frequency domain smoothness condition, 153 Finite Impulse Response (FIR) model, 7, 21, 248 Finite-trace kernels, 279 First-order spline kernel, 188, 212 First-order Stable Spline, 153 Frequency response, 19, 49 Frequentist paradigm, 95 Frobenius norm, 84 Full Bayesian approach, 107, 125 Full conditional distribution, 127 Full rank, 45 Full rank matrix, 36 Function estimation, 33

#### **G**

Gaussian kernel, 185, 210, 263, 319 Gaussian process realizations and RKHS, 261 Gaussian random field, 316, 363 Gaussian regression, 260, 314, 335 Generalization, 219 Generalization and consistency, 218 Generalization and consistency of kernelbased approaches, 221 Generalized Cross Validation (GCV), 63, 272, 273, 318 Generalized ridge regression, 168 Gibbs sampler, 125 Gibbs sampling, 125 Gibbs structure, 175

Index 373

Green's function, 212 Grey box models, 23, 314

#### **H**

Hairdryer experiment, 353 Hammerstein model, 335 Hammerstein–Wiener identification, 336 Hankel matrix, 153, 159 Hankel nuclear norm, 160 Hankel prior, 161 Hat matrix, 53, 118, 273 Heteroscedastic noise, 365 Hilbert and Banach spaces, 225 Hilbert–Schmidt (HS) norm, 305 Hilbert–Schmidt (HS) operator, 304 Hinge loss, 204 Hold out cross-validation, 30, 40, 61 Huber estimation, 76 Huber loss, 76, 202 Hybrid models, 336 Hybrid stable spline, 338 Hyperparameter, 58, 104, 127, 136, 145, 266 Hyperparameter tuning, 58, 345 Hypothesis test, 30

#### **I**

Identification data, 17, 24, 34, 64, 135, 320, 347–349, 356 Identification method, 17 Identification set, 17, 24, 34, 64, 135, 320, 347–349, 356 Identity kernel, 187 Ill-conditioned matrix, 44, 47 Ill-conditioning, 8, 42 Ill-conditioning and system input, 344 Ill-conditioning in system identification, 47 Ill-posedness, 182 Improper priors, 109 Impulse response, 7 Inclusions between notable kernel classes, 279 Infinite Impulse Response (IIR) model, 247, 250 Influence matrix, 53, 273 Inner product, 224 Inner products and norms, 224 Innovations, 25 Input design, 29, 295 Input location, 182 Input space, 182 Instability of spline kernel, 277 Instability of the Gaussian kernel, 277

Instability of translation invariant kernels, 277, 279 Integral equation, 222 Integral operator approach, 222 Integrated random walk, 125 Interior point methods, 79, 223 Inverse chi-square random variable, 106 Inverse Gamma random variable, 126 Inverse problems, 135 Iteratively reweighted methods, 161 Iterative reweighting, 339

#### **J**

James–Stein estimator, 3, 139 James–Stein's MSE, 3 Jointly Gaussian vectors, 100

#### **K**

Karhunen-Loève decomposition, 155 Karush–Kuhn–Tucker (KKT) equations, 223 Kernel as similarity function, 185 Kernel-based multi-task learning, 362 Kernel-based population approach, 368 Kernel defined by a finite number of basis functions, 208 Kernel eigenvalues and eigenvectors, 191 Kernel hyperparameters tuning, 266 Kernel implicit encoding, 185, 198, 209 Kernel matrix, 198, 318 Kernel operator *LK* , 191, 278 Kernel ridge regression, 200 Kernel section, 183 Kernel selection using marginal likelihood, 270 Kernels for NARX models, 328 Kernels for nonlinear system identification, 319 Kernels generated by Fourier basis, 195 Kernel stability: necessary and sufficient condition, 275 Kernel stability: sufficient condition, 276 Kernel stability: system theoretic interpretation, 276 Kernel trick, 198, 209, 284 Kernel tuning, 266, 316 Kernel width, 263 K-fold cross-validation, 61 Ky-Fan norm, 160

## **L**

L1 norm, 69, 85, 339 L2 norm, 85 L<sup>∞</sup> norm, 85 L1 regularization, 69 Lagrange multiplier, 70 Lagrangian theory, 70, 221 Laguerre and Kautz basis functions, 259, 285 Laplace approximation, 109 LASSO, 69 LASSO for non-orthogonal regression, 71 LASSO for orthogonal regression, 70 Learning from examples, 218 Least squares, 2, 8, 33, 35 Leave-one-out cross-validation, 62 Likelihood function, 98 Linear and bounded functionals, 183, 228 Linear Gaussian model, 97, 101 Linear kernel, 205 Linear plus nonlinear kernel, 327 Linear regression, 22, 33 Linear spline kernel, 212 Linear state-space model, 24 Linear time-invariant (LTI) system, 18 Loss function and noise model, 260 Loss functions, 197 Low rank kernel approximation, 156

#### **M**

Manifold learning, 338 Margin, 204 Marginal likelihood, 108, 266, 345 Marginal likelihood estimate, 124, 145, 266, 318, 345 Markov chain, 125 Markov Chain Monte Carlo (MCMC), 125 Markov parameters, 160 Markov's theorem, 2 Matrix inversion lemma, 85 Matrix norm, 83 Maximum A Posteriori (MAP) estimate, 98 Maximum entropy principle, 98, 111 Maximum entropy priors, 111, 148 Maximum entropy stable priors, 150 Maximum Likelihood (ML), 25, 38 Maximum Likelihood Estimate (MLE), 25, 38 McMillan degree, 20 MDL, 28, 38 Mean Squared Error (MSE), 1, 37, 139 Mercer kernel, 183 Mercer representation of a stable RKHS, 283 Mercer theorem, 192 Metric space, 226 Minimal state space realization, 159 Minimax estimators, 286, 288 Minimum Description Length (MDL) criterion, 28, 38 Minimum norm solution, 46 Minnesota prior, 168 Model and predictor, 18, 313 Model approximation/reduction, 114 Model bias, 35 Model error modeling, 167 Model order, 10, 34 Model order selection, 28, 37 Model posterior probability, 270 Model prediction capability, 60 Model quality, 28 Model structure, 17, 18 Model validation, 29 Monomials' encoding, 319 Monte Carlo study with different ARMAX models, 356 Monte Carlo study with different OEmodels, 347 Moore–Aronszajn theorem, 184, 209 Moore–Penrose pseudoinverse, 46, 51 MSE decomposition, 26, 52 MSE matrix, 37, 52 Multi-task kernel: average plus shift, 364 Multi-task learning, 222, 362 Multivariate Gaussian, 99

#### **N**

NARX, 314 Necessary and sufficient conditions for generalization and consistency, 219 Networks of linear dynamic models, 335 Neural networks, 314, 334 NFIR, 314 Noise spectrum, 27 Nondegenerate Borel measure, 191 Nonlinear model, 24 Nonlinear random surface, 316 Nonlinear stable spline kernel, 326 Nonlinear state-space model, 24 Nonlinear system Identification, 313 Norm, 224 Normal equations, 36 Norm induced by inner product, 225 Nuclear norm, 84, 304 Nuclear norm heuristic, 79 Nuclear norm regularization, 78

Index 375

Nuclear operator, 304 Null vector condition, 225 Numerical expansion of a stable kernel, 282 Numerical expansion of the first-order stable spline kernel, 281 Nystrom methöd, 281

#### **O**

Occam's factor, 267 Occam's razor principle, 267 One-step-ahead predictor, 20, 313 One-step-ahead predictor for ARX models, 22 Optimal regularization matrix, 57, 102 Optimality in order, 288 Optimism and equivalent degrees of freedom, 65 Oracle and test set, 321, 322, 348, 358 Oracle-based estimation procedure, 8, 263 Oracle-based procedure, 345 Orthogonality, 228 Orthogonality property of Bayes estimate, 102 Orthonormal basis expansion, 156 Orthonormal basis in -2, 284 Outliers, 76 Output Error (OE) model, 148, 166, 247 Output Error (OE) model in continuoustime, 253 Output kernel matrix, 267, 272, 318 Overfitting, 28, 40, 50, 182

#### **P**

Parametrized regularization matrix, 58 Parseval's theorem, 145, 155 Partial realization problem, 160 Particle Gibbs samples, 336 Particle Markov chain Monte Carlo, 336 PEM asymptotic properties, 26 Penalty function, 136 Piecewise affine models, 337 Pointwise evaluator, 183 Polynomial kernel, 210, 319 Polynomial regression, 34, 39, 120 Population approaches, 362 Positive definite kernel, 183 Posterior distribution, 98 Posterior variance, 101 Power spectrum, 49 Predicted Residual Error Sum of Squares (PRESS), 62, 273 Prediction error, 24

Prediction Error Method (PEM), 23, 25 Prediction fit, 25, 357 Prediction risk, 318 PRESS derivation, 86 Prior distribution, 96, 97, 137 Projection pursuit, 338 Projection theorem, 228 Proportionality principle, 164 Pseudoinverse, 46

#### **Q**

Q-boundedness of a kernel, 278 QR factorization, 83 Quadratic loss, 200 Quadratic polynomial kernel, 324

#### **R**

Radial basis kernels, 210 Random walk, 120 Rank-deficient matrix, 45 Real pharmacokinetic data example, 364 Regression function/optimal predictor, 215 Regression matrix, 36 Regularization design, 56 Regularization in quadratic form, 50, 51 Regularization in RKHS, 196 Regularization in RKHS and Bayesian estimation, 260 Regularization matrix, 51 Regularization network, 200 Regularization network as projection, 201 Regularization network for linear system identification, 258 Regularization network representation using stable kernels, 283 Regularization network using Laguerre or Kautz basis functions, 259 Regularization parameter, 5, 50, 96, 106, 116, 197, 214, 263 Regularization parameter and condition number, 52 Regularization term, 50 Regularized ARX, 355 Regularized estimate in RKHS as Bayes estimate, 261, 315 Regularized FIR, 248, 289 Regularized IIR, 251, 263 Regularized impulse response estimation in RKHS, 252 Regularized Least Squares (ReLS), 5, 33, 50, 106 Regularized NARX, 315

Regularized NFIR, 315 Regularized Volterra models, 331 Regularizer, 5 Relationship between *<sup>H</sup>* and *<sup>L</sup>*<sup>μ</sup> <sup>2</sup> , 193 Representation of stable kernels, 281 Representer theorem, 197 Representer theorem for continuous-time linear system identification, 254 Representer theorem for discrete-time linear system identification, 251 Representer theorem for nonlinear system identification, 315 Representer theorem with bias space, 213 Representer theorem with linear and bounded functionals, 199 Reproducing kernel, 184 Reproducing Kernel Hilberts Space (RKHS), 183 Reproducing property, 184 Residual analysis, 29 Ridge regression, 10, 51, 136, 208, 269 Ridge regression as Bayesian estimation, 116 Riesz representation theorem, 184, 199, 229 Risk functional, 218 RKHS induced by a diagonalized kernel, 195 RKHS induced by kernel operations, 190 RKHS induced by kernel sampling, 190 RKHS induced by kernel sums or products, 190 RKHS induced by Mercer kernel, 184 RKHS induced by the Stable Spline kernel, 257 RKHS map, 209 RKHSs inclusions in general Lebesque spaces, 278 RKHS stability using Mercer expansions: necessary and sufficient conditions, 284 Robot arm experiment, 350 Robust regression, 76, 202 Robust statistics, 202

#### **S**

Sample complexity, 286 Sample complexity and minimax properties of the stable spline estimator, 290 Second-order spline kernel, 212 Semidefinite programming, 79 Separable space, 227 Sherman–Morrison–Woodbury formula, 85 Shift operator, 19, 248

Signal segmentation, 337 Simplest unfalsified model, 29 Singular Value Decomposition (SVD), 42 Singular values, 43 Smoothness in the frequency domain, 146, 165 Sobolev norm, 188, 254 Sobolev space, 188, 194, 222 Space -<sup>1</sup> of absolutely summable sequences, 226 Space -<sup>2</sup> of squared summable real sequences, 225 Space -<sup>∞</sup> of bounded sequences, 227 Space *C* of continuous functions, 184, 227 Space *L*<sup>1</sup> of absolutely integrable functions, 226 Space *L*<sup>2</sup> of square summable functions, 186, 226 Space *L*<sup>∞</sup> of essentially bounded functions, 227 Sparse estimation, 70, 205 Sparsity and variable selection, 338 Sparsity inducing regularization, 73 Spectral decomposition, 155 Spectral decomposition of stable kernels, 281 Spectral decomposition of Stable Spline kernel, 255 Spectral feature map, 209 Spectral map, 192 Spectral norm, 84 Spectral representation of RKHS, 191 Spectral theorem, 281 Spline and Stable Spline kernel, 257 Spline estimator, 214 Spline kernel expansion, 193 Spline kernels, 211 Spline norm, 197 Square summable kernel, 279 Stable diagonal kernel, 145 Stable kernels, 247, 367 Stable prior, 145 Stable RKHS, 274, 279 Stable Spline (SS) kernel, 148, 252, 255, 263, 269 Stable Spline kernel of higher order, 257 Stable Spline norm, 252, 257, 296 Static nonlinearity, 314 Stationary distribution, 125 Statistical consistency of regularization networks, 217 Statistical learning theory, 218 Stein's effect, 3, 139

#### Index 377

Stein's estimation in non-orthogonal setting, 6 Stein's lemma, 12, 67 Stein's Unbiased Risk Estimator (SURE), 66, 271, 318 Stochastic embedding, 166 Strictly positive definite kernel, 183 Strong mixing condition, 217 Subderivative, 71 Subdifferential, 71 Subjective/Bayesian estimation paradigm, 95, 96 Subjective probability, 97 Subselection of regressors, 338 Subspace, 227 Summable kernel, 279 Sup-norm, 227 Support vector classification, 204 Support vector regression, 202 Support vector regression for linear system identification, 259

#### **T**

Takenaka-Malmquist orthogonal basis functions, 284 Taylor expansion and Volterra models, 331 TC, SS and DC kernel, 344 Temperature prediction experiment, 358 Test set, 138, 320, 330, 347, 350, 353, 356, 365 Tikhonov regularization, 11, 222 Time series, 19 Trace norm, 160 Training data, 17, 24, 33, 34, 64, 135, 218, 320, 347, 349, 356 Training set, 17, 24, 33, 34, 59, 64, 135, 218, 320, 347, 349, 356 Transfer function, 19, 136 Translation invariant kernel, 194, 210 Triangle inequality, 225 Truncated Mercer expansions, 281

Truncated SVD, 46 Tuned Correlated (TC) kernel, 146, 252

## **U**

Unbiased estimate, 26 Unbiased estimation of EVE, 66, 272 Unbiased estimator, 2 Uniform Glivenko Cantelli class, 220 Uniform norm, 227 Universality, 211 Universal kernel, 211, 222

#### **V**

V<sup>γ</sup> -dimension, 220 Validation data, 38 Validation process, 17 Validation set, 30, 59 Vapnik–Chervonenkis (VC) dimension, 220 Vapnik's ε-insensitive loss, 203 Variable selection, 70 Variance, 37 Vector norm, 83 Vector space, 224 Vector-valued RKHSs, 222 Volterra model, 319 Volterra series, 319

#### **W**

Well-conditioned matrix, 44 Wiener model, 335

#### **Y**

YALMIP software, 79

#### **Z**

0-1 loss, 160, 204 Z-transform, 19