**Intelligent Control and Learning Systems 3**

Jing Wang Jinglin Zhou Xiaolu Chen

# Data-Driven Fault Detection and Reasoning for Industrial Monitoring

# **Intelligent Control and Learning Systems**

Volume 3

#### **Series Editor**

Dong Shen , School of Mathematics, Renmin University of China, Beijing, Beijing, China

The Springer book series Intelligent Control and Learning Systems addresses the emerging advances in intelligent control and learning systems from both mathematical theory and engineering application perspectives. It is a series of monographs and contributed volumes focusing on the in-depth exploration of learning theory in control such as iterative learning, machine learning, deep learning, and others sharing the learning concept, and their corresponding intelligent system frameworks in engineering applications. This series is featured by the comprehensive understanding and practical application of learning mechanisms. This book series involves applications in industrial engineering, control engineering, and material engineering, etc.

The Intelligent Control and Learning System book series promotes the exchange of emerging theory and technology of intelligent control and learning systems between academia and industry. It aims to provide a timely reflection of the advances in intelligent control and learning systems. This book series is distinguished by the combination of the system theory and emerging topics such as machine learning, artificial intelligence, and big data. As a collection, this book series provides valuable resources to a wide audience in academia, the engineering research community, industry and anyone else looking to expand their knowledge in intelligent control and learning systems.

More information about this series at https://link.springer.com/bookseries/16445

# Data-Driven Fault Detection and Reasoning for Industrial Monitoring

Jing Wang Department of Automation, School of Electronical and Control Engineering North China University of Technology Beijing, China

Xiaolu Chen College of Engineering Peking University Beijing, China

Jinglin Zhou College of Information Science and Technology Beijing University of Chemical Technology Beijing, China

ISSN 2662-5458 ISSN 2662-5466 (electronic) Intelligent Control and Learning Systems ISBN 978-981-16-8043-4 ISBN 978-981-16-8044-1 (eBook) https://doi.org/10.1007/978-981-16-8044-1

© The Editor(s) (if applicable) and The Author(s) 2022. This book is an open access publication.

**Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

# **Preface**

After decades of development, automation and intelligence increase significantly in the process industry, and key technologies continue to make breakthroughs. In the era of "New Industrial Revolution", it is of great significance to use modern information technology to promote intelligent manufacturing with the goal of safety, efficiency, and green. Obviously, safety has always been the lifeline of intelligent and optimized manufacturing in process industries.

With the increasing requirements for production safety and quality improvement, process monitoring and fault diagnosis have gained great attention in academic research and even in industrial applications. The widespread use of sensor networks and distributed control systems have facilitated access to a wealth of process data. How to effectively use the data generated during the production process and the process mechanism knowledge for process monitoring and fault diagnosis is a topic worth exploring for the large and complex process industrial systems. Fruitful academic results have been produced recently and widely used in the actual production process.

The authors of this book have devoted themselves to the theoretical and applied research work on data-driven industrial process monitoring and fault diagnosis methods for many years. They are deeply concerned with the flourishing development of data-driven fault diagnosis techniques. This book focuses on both multivariate statistical process monitoring (MSPM) and Bayesian inference diagnosis. It introduces the basic multivariate statistical modeling methods, as well as the authors' latest achievements around the practical industrial needs, including multi-transition process monitoring, fault classification and identification, quality-related fault detection, and fault root tracing.

The main contributions given in this book are as follows:

(1) Soft-transition-based high-precision monitoring for multi-stage batch processes: Most batch processes obviously have several operation stages with different process characteristics. In addition, their data present obvious threedimensional features with strong nonlinearity and time variability. So it is difficult to apply multivariate statistical methods directly to the monitoring of batch processes. This book proposes a soft-transition-based fault detection method. First, a two-step stage division method based on Support Vector Data Description (SVDD) is given, then a dynamic soft-transition model of transition stages is constructed; finally, the monitoring in the original measurement space is given for each stage. Compared with the traditional method, the advantages of the proposed method are reflected in the following techniques: improvement of soft-transition process design, statistic decomposition, and fusion indicator monitoring. It just greatly increase the accuracy of batch process fault detection.

(2) Fault classification and identification for batch process with variable production cycle: Batch processes inevitably are subject to the changes in initial conditions and the external environment, which can cause changes in production cycles. However, current monitoring methods for batch processes generally require the equal production cycle and a complete production trajectory. Therefore, variable cycle and unknown values estimation in complete trajectory become the bottleneck for improving the diagnostic performance. This book gives a fault diagnosis method for batch processes based on kernel Fisher envelope analysis. It builds envelope surface models for normal conditions and all known fault condition, respectively. Then online fault diagnosis strategy is proposed based on these surface models. Further, the fusion of kernel Fisher envelope analysis and PCA is proposed for fault diagnosis of batch process. It effectively solves the fault classification and identification of unequal-length batch production process.

(3) Quality-related fault detection with fusion of global and local features: The key of manufacture is to guarantee the final product quality, yet it is difficulty or extreme cost to acquire quality information in real time. Therefore, it is great practical to monitor the process variables that have an impact on the final quality output in roder to further enable quality-related fault detection and diagnosis. This book proposes an idea of quality-related projection with the fusion of global and local features to obtain the correlation between quality variables and process variables. It is well known that the partial least squares projection algorithm looks for global structural change information based on the process covariance maximization direction. The local preservation projection, or manifold learning approach can exactly maintain the local neighborhood structure and achieve nonlinear mapping by using linear approximation. The proposed fusion approach constructs potential geometric structures containing both global and local information, extracts meaningful low-dimensional structural information to represent the relationship between high-dimensional process variables and quality data. Thus, it effectively achieves the detection of quality-related faults for strongly nonlinear and strongly dynamic processes.

(4) Bayesian fault diagnosis and root tracing combined with process mechanism: Due to the complex interrelationships among system components, same fault source may have different manifestation signs in the different process variables. The traditional contribution graph in multivariate statistical monitoring is inefficiency in fault root tracing. This book proposes an uncertainty knowledge expression inference model, named probabilistic causal graph model, based on probability theory and graph theory. It intuitively and accurately reveals the qualitative and quantitative relationships between process variables. Then a framework for fault diagnosis and root tracking based on the proposed model is given. Different modeling and inference techniques are given for the discrete and continuous system, respectively. So, the inference can perform real-time dynamic analysis of discrete alarm states or continuous process variables. The forward inference predicts the univariate and multivariate alarms or fault events, while the reverse implements the accurate fault root tracing and localization.

The book consists of 14 chapters divided into four parts:

Part I, Chaps. 1–4, is devoted to mathematical background. Chapter 1 gives the basic knowledge about process monitoring measure, common detection indicator, and its control limit. Chapters 2–3 focus on the basic multivariate statistical methods, including principal element analysis (PCA), partial least squares (PLS), canonical correlation analysis (CCA), canonical variable analysis (CVA), and Fisher discriminant analysis (FDA). To help readers learn the above theoretical methods, Chap. 4 gives a detailed introduction to the Tennessee Eastman (TE) continuous chemical simulation platform and the penicillin semi-batch reaction simulation platform. Readers can collect appropriate process data and conduct corresponding simulation experiments on these simulation platform.

Part II, Chaps. 5–8, are organized around the main contributions 1 and 2 of this book. Various improved fault detection and identification methods are given for batch process. Chapters 5–6 are given for contribution 1 aiming at the high-precision process monitoring of with many stages process, based on Support Vector Data Description (SVDD) soft-transition process, and fusion index design based on statistics decomposition. Chapters 7–8 are given for contribution 2 aiming at the fault identification for complex batch process with unequal cycle, based on the kernel Fisher envelope surface analysis and local linear embedded Fisher discriminant analysis, respectively.

Part III, Chaps. 9–12, are organized around the main contribution 3 of this book. To improve the statistical model between process variables and quality variables with nonlinear correlation, two different strategies are considered. First, under the idea of global and local feature fusion, the manifold structure are considered to extract the nonlinear correlations between them effectively. A unified framework of spatial optimization projection is constructed based on the effective fusion of two types of performance indices, global covariance maximization and local adjacency structure minimization. A variety of different performance combinations are given in Chaps. 9–11: QGLPLS, LPPLS and LLEPLS, respectively. Another strategy is to consider the nonlinearity as uncertainty, then robust L1-PLS is proposed in Chap. 12. It enhances the robustness of PLS method based on the latent structure regression with L1. The effectiveness and applicability of the above combination methods are discussed.

Part IV, Chaps. 13–14, are organized around the main contribution 4 of the book. The known industrial process flow structure is integrated with the industrial data analytic, and the qualitative causal relationships among process variables are established by multivariate causal analysis methods. The quantitative causal dependencies among process variables are characterized by conditional probability density estimation under this network structure. So, Bayesian causal probability graph model of complex systems is realized for process variable failure prediction and reverse tracing. The specific implementation of the Bayesian inference, respectively, in discrete alarm variable analysis and continuous process variable analysis are given in this book.

Fault detection and diagnosis (FDD) is one of the core topics in modern complex industrial processes. It attracts the attention of scientists and engineers from various fields such as control, mechanics, mathematics, engineering, and automation. This book gives an in-depth study of various data-driven analysis methods and their applications in process monitoring, especially for data modeling, fault detection, fault classification, fault identificatoin, and fault reasoning. Oriented toward the industrial big data analytic and industrial artificial intelligence, this book integrates multivariate statistical analysis, Bayesian inference, machine learning, and other intelligent analysis methods. This book attempts to establish a basic framework of complex industrial process monitoring suitable for various types of industrial data processing, and gives a variety of fault detection and diagnosis theories, methods, algorithms, and various applications. It provides data-driven fault diagnosis techniques of interest to advanced undergraduate and graduate students, researchers in the direction of automation and industrial safety. It also provides various applications of engineering modeling, data analysis, and processing methods for related practitioners and engineers.

Beijing, China August 2021

Jing Wang jwang@ncut.edu.cn

Jinglin Zhou jinglinzhou@mail.buct.edu.cn

> Xiaolu Chen chenxiaolu@pku.edu.cn

**Acknowledgements** The authors thank National Natural Science Foundation of China (Grant No. 61573050, 61973023, 62073023, and 61473025) for funding the research over the past 8 years. The authors also thank Li Liu, Huatong Wei, Bin Zhong, Ruixuan Wang, and Shunli Zhang, the graduate students from the College of Information Science and Technology, Beijing University of Chemical Technology, for the hard work in system design and programming.

# **Contents**





#### Contents xiii


# **Abbreviations**




# **Chapter 1 Background**

# **1.1 Introduction**

Fault detection and diagnosis (FDD) technology is a scientific field emerged in the middle of the twentieth century with the rapid development of science and data technology. It manifests itself as the accurate sensing of abnormalities in the manufacturing process, or the health monitoring of equipment, sites, or machinery in a specific operating site. FDD includes abnormality monitoring, abnormal cause identification, and root cause location. Through qualitative and quantitative analysis of field process and historical data, operators and managers can detect alarms that affect product quality or cause major industrial accidents. It is help for cutting off failure paths and repairing abnormalities in a timely manner.

# *1.1.1 Process Monitoring Method*

In general, FDD technique is divided into several parts: fault detection, fault isolation, fault identification, and fault diagnosis (Hwang et al. 2010; Zhou and Hu 2009). Fault detection is determining of the appearance of fault. Once a fault (or error) has been successfully detected, damage assessment needs to be performed, i.e., fault isolation (Yang et al. 2006). Fault isolation lies in determining the type, location, magnitude, and time of the fault (i.e., the observed out-of-threshold variables). It should be noted that fault isolation is not to isolation of specific components of a system with the purpose of stopping errors from propagating. In a sense, fault identification may have been a better choice. It also has the ability to determine its timely change. Isolation and identification are commonly used in the FDD process without strict distinction. Fault diagnosis determines the cause of the observed out-of-threshold variables in this book, so it is called as fault root tracing. During the process of fault tracing, efforts are made to locate the source of the fault and find the root cause.

<sup>©</sup> The Author(s) 2022

J. Wang et al., *Data-Driven Fault Detection and Reasoning for Industrial Monitoring*, Intelligent Control and Learning Systems 3, https://doi.org/10.1007/978-981-16-8044-1\_1

**Fig. 1.1** Classification of fault diagnosis methods

FDD involves control theory, probability statistics, signal processing, machine learning, and many other research areas. Many effective methods have been developed, and they are usually classified into three categories, knowledge-based, analytical, and data-driven (Chiang et al. 2001). Figure 1.1 shows the classification of fault diagnosis methods.

#### **(1) Analytical Method**

The analytical model of the engineering system is obtained based on the mathematical and physical mechanism. Analytical model-based method represents to monitor the process real time according to the mathematical models often constructed from first principles and physical characteristics. Most analytical measures contain state estimation (Wang et al. 2020), parameter estimation (Yu 1997), parity space (Ding 2013), and analytical redundancy (Suzuki et al. 1999). The analytical method appears to be relatively simple and usually is applied to systems with a relatively small number of inputs, outputs, and states. It is impractical for modern complex system since it is not easy to establish an accurate mathematical model due to its complex characteristics such as nonlinearity, strong coupling, uncertainty, and ultra-high-dimensional input and output.

#### **(2) Knowledge-Based Method**

Knowledge-based fault diagnosis does not require an accurate mathematical model. Its basic idea is to use expert knowledge or qualitative relationship to develop the fault detection rules. The common approaches mainly include fault tree diagnosis (Hang et al. 2006), expert system diagnosis (Gath and Kulkarn 2014), directed graphs, fuzzy logic (Miranda and Felipe 2015), etc. The application of knowledge-based models strongly relies on the complete process empirical knowledge. Once the information of the diagnosed object is known from expert experience and historical data, a variety of rules for appropriate reasoning is constructed. However, the accumulation of process experience and knowledge are time-consuming and even difficult. Therefore, this method is not universal and can only be applied to engineering systems which people are familiar with.

#### **(3) Data-Driven Method**

Data-driven method is based on the rise of modern information technology. In fact, it involves a variety of disciplines and techniques, including statistics, mathematical analysis, and signal processing. Generally speaking, the industrial data in the field are collected and stored by intelligent sensors. Data analysis can mine the hidden information contained in the data, establish the data model between input and output, help the operator to monitor the system status in real time, and achieve the purpose of fault diagnosis. Data-driven fault diagnosis methods are be divided into three categories: signal processing-based, statistical analysis-based, and artificial intelligence-based (Zhou et al. 2011; Bersimis et al. 2007). The commonality of these methods is that high-dimensional variables are projected into the low-dimensional space with extracting the key features of the system. Data-driven method does not require an accurate model, so is more universal.

Both analytical techniques and data-driven methods have their own merits, but also have certain limitations. Therefore, the fusion-driven approach combining mechanistic knowledge and data could compensate the shortcomings of a single technique. This book explores the fault detection, fault isolation/identification, and fault root tracing problems mainly based on the multivariate statistical analysis as a mathematical foundation.

## *1.1.2 Statistical Process Monitoring*

Fault detection and diagnosis based on multivariate statistical analysis has developed rapidly and a large number of results have emerged recently. This class of method, based on the historical data, uses multivariate projection to decompose the sample space into a low-dimensional principal element subspace and a residual subspace. Then the corresponding statistics are constructs to monitor the observation variables. Thus, this method also is called latent variable projection method.

#### **(1) Fault Detection**

The common multivariate statistical fault detection methods include principal component analysis (PCA), partial least squares (PLS), canonical correlation analysis (CCA), canonical variables analysis (CVA), and their extensions. Among them, PCA and PLS, as the most basic techniques, are usually used for monitoring processes with Gaussian distributions. These methods usually use Hotelling's T<sup>2</sup> and Squared Prediction Error (SPE) statistics to detect variation of process information.

It is worth noting that these techniques extract the process features by maximizing the variance or covariance of process variables. They only utilize the information of first-order statistics (mathematical expectation) and second-order statistics (variance and covariance) while ignoring the higher order statistics (higher order moments and higher order cumulants). Actually, there are few processes in practice that are subject to the Gaussian distribution. The traditional PCA and PLS are unable to extract effective features from non-Gaussian processes due to omitting the higher order statistics. It reduces the monitoring efficiency.

Numerous practical production conditions, such as strong nonlinearity, strong dynamics, and non-Gaussian distribution, make it difficult to directly apply the basic multivariate monitoring methods. To solve these practical problems, various extended multivariate statistical monitoring methods have flourished. For example, to deal with the process dynamics, dynamic PCA and dynamic PLS methods have been developed, which take into account the autocorrelation and cross-correlation among variables (Li and Gang 2006). To deal with the non-Gaussian distribution, independent component analysis (ICA) methods have also been developed (Yoo et al. 2004). To deal with the process nonlinearity, some extended kernel methods such as kernel PCA (KPCA), kernel PLS (KPLS), and kernel ICA (KICA) have emerged (Cheng et al. 2011; Zhang and Chi 2011; Zhang 2009).

#### **(2) Fault Isolation or Identification**

A common approach for separating faults is the contribution plot. It is an unsupervised approach that uses only the process data to find fault variables and does not require other prior knowledge. Successful separation based on the contribution plot includes the following properties: (1) each variable has the same mean value of contribution under the normal operation and (2) the faulty variables have very large contribution values under the fault conditions, compared with other normal variables. Alcala and Qin summarized the commonly contribution plot techniques, such as complete decomposition contributions (CDC), partial decomposition contributions (PDC), and reconstruction-based contributions (RBC) (Alcala and Qin 2009, 2011).

However, contribution plot usually suffers from the smearing effect, a situation in which non-faulty variables show larger contribution values, while the contribution values of the fault variables are smaller. Westerhuis et al. pointed out that one variable may affect other variables during the execution of PCA, thus creating a smearing effect (Westerhuis et al. 2000). Kerkhof et al. analyzed the smearing effect in three types of contribution indices, CDC, PDC, and RBC, respectively (Kerkhof et al. 2013). It was pointed that smearing effect is caused by the compression and expansion operations of variables from the perspective of mathematical decomposition. So it cannot be avoided during the transformation of data from measurement space to latent variable space. In order to eliminate the smearing effect, several new contribution indices are given based on dynamically calculating average value of the current and previous residuals (Wang et al. 2017).

If the historical data collected have been previously categorized into separate classes where each class pertains to a particular fault, fault isolation or identification can be transformed into pattern classification problem. The statistical methods, such as Fisher's discriminant analysis (FDA) (Chiang et al. 2000), have also been successfully applied in industrial practice to solve this problem. It assigns the data into two or more classes via three steps: feature extraction, discriminant analysis, and maximum selection. If the historical data have not been previously categorized, unsupervised cluster analysis may classify data into separate classes accordingly (Jain et al. 2000), such as the K-Means algorithm. More recently, neural network and machine learning techniques imported from statistical analysis theory have been receiving increasing attention, such as support vector data description (SVDD) covered in this book.

#### **(3) Fault Diagnosis or Root Tracing**

Fault root tracing based on Bayesian network (BN) is a typical diagnostic method that combines the mechanism knowledge and process data. BN, also known as probabilistic network or causal network, is a typical probabilistic graphical model. Since the end of last century, it has gradually become a research hotspot due to its superior theoretical properties in describing and reasoning about uncertain knowledge. BN was first proposed by Pearlj, a professor at the University of California, in 1988, to solve the problem of uncertain information in artificial intelligence. BN represents the relationships between the causal variable is the form of directed acyclic graphs. In the fault diagnosis process of an industrial system, the observed variable is used as node containing all the information about the equipment, control quantities, and faults in the system. The causal connection between variables is quantitatively described as a directed edge with the conditional probability distribution function (Cai et al. 2017). Fault diagnosis procedure with BNs consists of BN structure modeling, BN parameter modeling, BN forward inference, and BN inverse tracing.

In addition to the probabilistic graphical model such as BN, the development of other causal graphical model has developed vigorously. These progresses aim at determining the causal relationship among the operating units of the system based on hypothesis testing (Zhang and Hyvärinen 2008; Shimizu et al. 2006). The generative model (linear or nonlinear) is built to explain the data generation process, i.e., causality. Then the direction of causality is tested under some certain assumptions. The most typical one is the linear non-Gaussian acyclic model (LiNGAM) and its improved version (Shimizu et al. 2006, 2011). It has the advantage of determining the causal structure of variables without pre-specifying their causal order. All these results are serving as a driving force for the development of probabilistic graphical model and playing a more important role in the field of fault diagnosis.

## **1.2 Fault Detection Index**

The effectiveness of data-driven measures often depends on the characterization of process data changes. Generally, there are two types of changes in process data: common and special. Common changes are entirely caused by random noise, while specials refer to all data changes that are not caused by common causes, such as impulse disturbances. Common process control strategies may be able to remove most of the data changes with special reasons, but these strategies cannot remove the common cause changes inherent in the process data. As process data changes are inevitable, statistical theory plays an important role in most process monitoring programs.

By defining faults as abnormal process conditions, it is easy to know that the application of statistical theory in the monitoring process actually relies on a reasonable assumption: unless the system fails, the data change characteristics are almost unchanged. This means that the characteristics of data fluctuations, such as mean and variance, are repeatable for the same operating conditions, although the actual value of the data may not be very predictable. The repeatability of statistical attributes allows automatic determination of thresholds for certain measures, effectively defining out-of-control conditions. This is an important step to automate the process monitoring program. Statistical process monitoring (SPM) relies on the use of normal process data to build process model. Here, we discuss the main points of SPM, i.e., fault detection index.

In multivariate process monitoring, the variability in the residual subspace (RS) is represented typically by squared sum of the residual, namely the Q statistic or the squared prediction error (SPE). The variability in the principle component subspace (PCS) is represented typically by Hotelling's T<sup>2</sup> statistic. Owing to the complementary nature of the two indices, combined indices are also proposed for fault detection and diagnosis. Another statistic that measures the variability in the RS is Hawkins' statistic (Hawkins 1974). The global Mahalanobis distance can also be used as a combined measure of variability in the PCS and RS. Individual tests of PCs can also be conducted (Hawkins 1974), but they are often not preferred in practice, since one has to monitor many statistics. In this section, we summarize several fault detection indices and provide a unified representation.

# *1.2.1* **T<sup>2</sup>** *Statistic*

Consider the sampled data with *m* observation variables *x* = [*x*1, *x*2,..., *xm*] and *n* observations for each variable. The data are stacked into a matrix *<sup>X</sup>* <sup>∈</sup> *<sup>R</sup><sup>n</sup>*×*<sup>m</sup>*, given by

$$X = \begin{bmatrix} \begin{array}{cccc} \boldsymbol{\chi}\_{11} & \boldsymbol{\chi}\_{12} & \cdots & \boldsymbol{\chi}\_{1m} \\ \boldsymbol{\chi}\_{21} & \boldsymbol{\chi}\_{22} & \cdots & \boldsymbol{\chi}\_{2m} \\ \vdots & \vdots & \cdots & \vdots \\ \boldsymbol{\chi}\_{n1} & \boldsymbol{\chi}\_{n2} & \cdots & \boldsymbol{\chi}\_{nm} \end{array} \end{bmatrix}, \tag{1.1}$$

firstly, the matrix *X* is scaled to zero mean, and the sample covariance matrix is equal to

$$\mathbf{S} = \frac{1}{n-1} \mathbf{X}^{\mathsf{T}} \mathbf{X}.\tag{1.2}$$

An eigenvalue decomposition of the matrix *S*,

#### 1.2 Fault Detection Index 7

$$S = \bar{P}\bar{A}\bar{P}^{\top} = [P\ \tilde{P}]\operatorname{diag}\{\mathbf{A}, \tilde{A}\}\left[P\ \tilde{P}\right]^{\top}.\tag{1.3}$$

The correlation structure of the covariance matrix *S* is revealed, where *P* is orthogonal. (*P P*<sup>T</sup> <sup>=</sup> *<sup>I</sup>*, in which, *<sup>I</sup>* is the identity matrix) (Qin 2003) and

$$\begin{aligned} A &= \frac{1}{n-1} T^{\mathrm{T}} T = \operatorname{diag} \{ \lambda\_1, \lambda\_2, \dots, \lambda\_k \} \\ \tilde{A} &= \frac{1}{n-1} \tilde{T}^{\mathrm{T}} \tilde{T} = \operatorname{diag} \{ \lambda\_k + 1, \lambda\_k + 2, \dots, \lambda\_m \} \\ \lambda\_1 &\ge \lambda\_2 \ge \dots \ge \lambda\_m, \quad \sum\_{i=1}^k \lambda\_i > \sum\_{j=k+1}^m \lambda\_j \\ \lambda\_i &= \frac{1}{N-1} t\_i^{\mathrm{T}} t\_i \approx \mathrm{var}(t\_i) \end{aligned}$$

when *n* is very large. The score vector *t<sup>i</sup>* is the *i*-th column of *T*¯ = [*T, T***˜**]. The PCS is *Sp* = span{*P*} and the RS is *Sr* = span{*P*˜ }. Therefore, the matrix *X* is decomposed into a score matrix *T*¯ and a loading matrix *P*¯ = [*P, P***˜**], that is

$$X = \bar{T}\bar{P}^T = \hat{X} + \tilde{X} = TP^\top + \tilde{T}\tilde{P}^\top = XPP^\top + X\left(I - PP^\top\right),\qquad(1.4)$$

The sample vector *x* can be projected on the PCS and RS, respectively:

$$
\mathbf{x} = \hat{\mathbf{x}} + \tilde{\mathbf{x}} \tag{1.5}
$$

$$
\hat{\mathbf{x}} = \mathbf{P} \mathbf{P}^{\mathrm{T}} \mathbf{x} \tag{1.6}
$$

$$
\tilde{\boldsymbol{x}} = \tilde{\boldsymbol{P}} \tilde{\boldsymbol{P}}^{\mathsf{T}} \boldsymbol{x} = \left(\boldsymbol{I} - \boldsymbol{P}\boldsymbol{P}^{\mathsf{T}}\right) \boldsymbol{x}.\tag{1.7}
$$

Assuming *S* is invertible and with the definition

$$\mathbf{z} = \mathbf{A}^{-\frac{1}{2}} \mathbf{P}^{\mathbf{T}} \mathbf{x}.\tag{1.8}$$

The Hotelling's T<sup>2</sup> statistic is given by Chiang et al. (2001)

$$\mathbf{T}^2 = \mathbf{z}^\mathsf{T}\mathbf{z} = \mathbf{x}^\mathsf{T}\mathbf{P}\mathbf{A}^{-1}\mathbf{P}^\mathsf{T}\mathbf{x}.\tag{1.9}$$

The observation vector *x* is projected into a set of uncorrelated variables *y* by *<sup>y</sup>* <sup>=</sup> *<sup>P</sup>*<sup>T</sup>*x*. The rotation matrix *<sup>P</sup>* directly from the covariance matrix of *<sup>x</sup>* guarantees that *y* is correspond to *x*. *Λ* scales the elements of *y* to produce a set of variables with unit variance corresponding to the elements of *z*. The conversion of the covariance matrix is demonstrated graphically in Fig. 1.2 for a two-dimensional observation space (*m* = 2) (Chiang et al. 2001).

The T<sup>2</sup> statistic is a scaled squared 2-norm of an observation vector *x* from its mean. An appropriate scalar threshold is used to monitor the variability of the data in

**Fig. 1.2** A graphical illustration of the covariance conversion for the T<sup>2</sup> statistic

the entire *m*-dimensional observation space. It is determined based on an appropriate probability distribution with given significance level α. In general, it is assumed that


Then the T<sup>2</sup> statistic follows a χ<sup>2</sup> distribution with *m* degrees of freedom (Chiang et al. 2001),

$$\mathbf{T}\_{\alpha}^{2} = \chi\_{\alpha}^{2}(m). \tag{1.10}$$

The set T<sup>2</sup> ≤ T<sup>2</sup> <sup>α</sup> is an elliptical confidence region in the observation space, as illustrated in Fig. 1.3 for two process variables. This threshold (1.10) is applied to monitor the unusual changes. An observation vector projected within the confidence region indicates process data are in-control status, whereas outside projection indicates that a fault has occurred (Chiang et al. 2001).

When the actual covariance matrix for the normal status is not known but instead estimated from the sample covariance matrix (1.2), the threshold for fault detection

#### 1.2 Fault Detection Index 9

is given by

$$\mathcal{T}\_a^2 = \frac{m(n-1)(n+1)}{n(n-m)} F\_a(m, n-m),\tag{1.11}$$

where *F*α(*m*, *n* − *m*) is the upper 100α% critical point of the F-distribution with *m* and *n* − *m* degrees of freedom (Chiang et al. 2001). For the same significance level α, the upper in-control limit in (1.11) is larger (more conservative) than that in (1.10). The two limits approach each other when the amount of observation increases (*n* → ∞) (Tracy et al. 1992).

## *1.2.2 Squared Prediction Error*

The SPE index measures the projection of the sample vector on the residual subspace:

$$\text{SPE} := \|\tilde{\mathbf{x}}\|^2 = \|(I - \mathbf{P}\mathbf{P}^\mathrm{T})\mathbf{x}\|^2. \tag{1.12}$$

The process is considered as normal if

$$\text{SPE} \le \delta\_a^2,\tag{1.13}$$

where δ<sup>2</sup> <sup>α</sup> denotes the upper control limit of SPE with a significant level of α. Jackson and Mudholkar gave an expression for δ<sup>2</sup> <sup>α</sup> (Jackson and Mudholkar 1979)

$$\delta\_a^2 = \theta\_1 \left( \frac{z\_a \sqrt{2\theta\_2 h\_0^2}}{\theta\_1} + 1 + \frac{\theta\_2 h\_0 (h\_0 - 1)}{\theta\_1^2} \right)^{1/h\_0},\tag{1.14}$$

where

$$\theta\_i = \sum\_{j=k+1}^{m} \lambda\_j^i, \qquad i = 1, 2, 3,\tag{1.15}$$

$$h\_0 = 1 - \frac{2\theta\_1 \theta\_3}{3\theta\_2^2},\tag{1.16}$$

where *k* is the number of retained principal components and *z*<sup>α</sup> is the normal deviation corresponding to the upper percentile of 1 − α. Note that the above result is obtained under the following conditions.

• The sample vector *x* follows a multivariate normal distribution.


When a fault occurs, the fault sample vector *x* consists of the normal part superimposed on the faulty part. The fault causes the SPE to be larger than the threshold δ2 <sup>α</sup>, which results in the fault being detected.

Nomikos and MacCregor (1995) used the results in Box (1954) to derive an alternative upper control limit for SPE.

$$
\delta\_\alpha^2 = \mathbf{g} \chi\_{h;\alpha}^2 \tag{1.17}
$$

where

$$\mathbf{g} = \theta\_2/\theta\_1, \qquad h = \theta\_1^2/\theta\_2. \tag{1.18}$$

The relationship between SPE threshold (1.14) and (1.17) is as follows: Nomikos and MacCregor (1995)

$$
\delta\_a^2 \cong gh \left( 1 - \frac{2}{9h} + z\_a \sqrt{\frac{2}{9h}} \right)^3
$$

## *1.2.3 Mahalanobis Distance*

Define the following Mahalanobis distance which forms the global Hotelling's T<sup>2</sup> test:

$$\mathbf{D} = \mathbf{X}^{\mathrm{T}} \mathbf{S}^{-1} \mathbf{X} \sim \frac{m(n^2 - 1)}{n(n - m)} F\_{m, n - m},\tag{1.19}$$

where *S* is the sample covariance of *X*. When *S* is singular with rank(*S*) = *r* < *m*, Mardia discusses the use of the pseudo-inverse of *S*, which in turn yields the Mahalanobis distance of the reduced-rank covariance matrix (Brereton 2015):

$$\mathbf{D}\_{\mathbf{t}} = \mathbf{X}^{\mathrm{T}} \mathbf{S}^{+} \mathbf{X} \sim \frac{r(n^{2} - 1)}{n(n - r)} F\_{r, n - r} \tag{1.20}$$

where *S*<sup>+</sup> is the Moore-Penrose pseudo-inverse. It is straightforward to show that the global Mahalanobis distance is the sum of T<sup>2</sup> in PCS and T<sup>2</sup> *<sup>H</sup>* <sup>=</sup> *<sup>x</sup>*<sup>T</sup> *<sup>P</sup>*˜ *<sup>Λ</sup>*˜ <sup>−</sup><sup>1</sup> *P*˜ T *x* (Hawkins' statistic Hawkins 1974) in RS:

$$\mathbf{D} = \mathbf{T}^2 + \mathbf{T}\_H^2. \tag{1.21}$$

#### 1.2 Fault Detection Index 11

When the number of observations *n* is quite large, the global Mahalanobis distance approximately obeys the χ<sup>2</sup> distribution with *m* degrees of freedom:

$$\mathbf{D} \sim \chi\_m^2. \tag{1.22}$$

Similarly, the reduced-rank Mahalanobis distance follows:

$$\mathbf{D}\_{\mathbf{r}} \sim \chi\_{\mathbf{r}}^2. \tag{1.23}$$

Therefore, faults can be detected using the correspondingly defined control limits for D and D*r*.

## *1.2.4 Combined Indices*

In practice, better monitoring performance can be achieved in some cases by using a combined index instead of two indices to monitor the process. Yue and Qin proposed a combined index for fault detection that combines SPE and T<sup>2</sup> as follows: Yue and Qin (2001):

$$\varphi = \frac{\text{SPE}(X)}{\delta\_a^2} + \frac{\text{T}^2(X)}{\chi\_{l;a}^2} = X^\text{T} \Phi X,\tag{1.24}$$

where

$$\boldsymbol{\Phi} = \frac{\boldsymbol{P}\boldsymbol{A}^{-1}\boldsymbol{P}^{\mathrm{T}}}{\boldsymbol{\chi}\_{l,a}^{2}} + \frac{\boldsymbol{I} - \boldsymbol{P}\boldsymbol{P}^{\mathrm{T}}}{\delta\_{a}^{2}} = \frac{\boldsymbol{P}\boldsymbol{A}^{-1}\boldsymbol{P}^{\mathrm{T}}}{\boldsymbol{\chi}\_{l,a}^{2}} + \frac{\tilde{\boldsymbol{P}}\tilde{\boldsymbol{P}}^{\mathrm{T}}}{\delta\_{a}^{2}}.\tag{1.25}$$

Notice that *Φ* is symmetric and positive definite. To use this index for fault detection, the upper control limit of ϕ is derived from the results of Box (1954), which provides an approximate distribution with the same first two moments as the exact distribution. Using the approximate distribution given in Box (1954), the statistical data ϕ is approximated as follows:

$$
\varphi = \mathbf{X}^{\mathrm{T}} \boldsymbol{\Phi} \mathbf{X} \sim \operatorname{g} \chi\_h^2,\tag{1.26}
$$

where the coefficient

$$\mathbf{g} = \frac{\operatorname{tr}(\mathbf{S}\Phi)^2}{\operatorname{tr}(\mathbf{S}\Phi)}\tag{1.27}$$

and the degree of freedom for χ<sup>2</sup> *<sup>h</sup>* distribution is 12 1 Background

$$h = \frac{[tr(\mathbf{S}\Phi)]^2}{tr(\mathbf{S}\Phi)^2},\tag{1.28}$$

in which,

$$\operatorname{tr}(\mathbf{S}\Phi) = \frac{l}{\chi^2\_{l;\alpha}} + \frac{\sum\_{i=l+1}^m \lambda\_i}{\delta\_\alpha^2} \tag{1.29}$$

$$\operatorname{tr}(\mathbf{S}\Phi)^2 = \frac{l}{\chi^4\_{l;\alpha}} + \frac{\sum\_{i=l+1}^m \lambda\_i^2}{\delta^4\_{\alpha}} \tag{1.30}$$

After computing *g* and *h*, for a given significance level α, a control upper limit for ϕ can be obtained. A fault is detected by ϕ if

$$
\varphi \succ \lg \chi^2\_{h;a},
\tag{1.31}
$$

It is worth noting that Raich and Cinar suggest another combined statistic (Raich and Cinar 1996),

$$\varphi = c \frac{\text{SPE}(X)}{\delta\_a^2} + (1 - c) \frac{\text{T}^2(X)}{\chi\_{l;\alpha}^2},\tag{1.32}$$

where *c* ∈ (0, 1) is a constant. They further give a rule that the statistic less than 1 is considered normal. However, this may lead to wrong results because even if the above statistic is less than 1, it is possible that SPE(*X*)>δ<sup>2</sup> <sup>α</sup> or T<sup>2</sup>(*X*)>χ<sup>2</sup> *<sup>l</sup>*;<sup>α</sup> (Qin 2003).

## *1.2.5 Control Limits in Non-Gaussian Distribution*

Nonlinear characteristics are the hotspot of current process monitoring research. Many nonlinear methods such as kernel principal component, neural network, and manifold learning are widely used in the component extraction of process monitoring. The principal component extracted by such methods may be independent of the Gaussian distribution. Thus, the control limits of the T<sup>2</sup> and Q statistical series are calculated by the probability density function, which can be estimated by the nonparametric kernel density estimation (KDE) method. The KDE applies to the T<sup>2</sup> and Q statistics because they are univariate although the processes represented by these statistics are multivariate. Therefore, the control limits for the monitoring statistics (T<sup>2</sup> and SPE) are calculated from their respective PDF estimates, given by

$$\begin{cases} \int\_{-\infty}^{\text{Th}\_{\mathbb{T}^2,\mu}} g(\mathbf{T}^2) d\mathbf{T}^2 = \alpha\\ \int\_{-\infty}^{\text{Th}\_{\text{SPE},\mu}} g(\text{SPE}) d\text{SPE} = \alpha, \end{cases} \tag{1.33}$$

where

$$\lg(z) = \frac{1}{lh} \sum\_{j=1}^{l} \mathbb{K} \left( \frac{z - z\_j}{h} \right)^j$$

K denotes a kernel function and *h* denotes the bandwidth or smoothing parameter. Finally, the fault detection logic for the PCS and RS is as follows:

$$\begin{aligned} \text{T}^2 &> \text{Th}\_{\text{T}^2, \alpha} \text{ or } \text{T}\_{\text{SPE}} > \text{Th}\_{\text{SPE}, \alpha}, \qquad \text{Faults} \\ \text{T}^2 &\le \text{Th}\_{\text{T}^2, \alpha} \text{ and } \text{T}\_{\text{SPE}} \le \text{Th}\_{\text{SPE}, \alpha}, \quad \text{Fault-free}. \end{aligned} \tag{1.34}$$

## **References**


Ding SX (2013) Model-based fault diagnosis techniques. Springer, London


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 2 Multivariate Statistics in Single Observation Space**

The observation data collected from continuous industrial processes usually have two main categories: process data and quality data, and the corresponding industrial data analysis is mainly for the two types of data based on the multivariate statistical techniques. Process data are collected by distributed control system (DCS) in real time with frequent sampling (its basic sampling period usually is 1s). For example, there are five typical variables in the process industries: temperature, pressure, flow rate, liquid level, and composition. Among them, temperature, pressure, flow rate, and liquid level are process variables. However, it is difficult to acquire the real-time quality measurement in general due to the limitation of quality sensors. Usually, the quality data are obtained by taking samples for laboratory test and their sampling frequency is much lower than that of process data. For example, product composition, viscosity, molecular weight distribution, and other quality-related parameters need to be obtained through various analytical instruments in the laboratory, such as composition analyzers, gel permeation chromatography (GPC), or mass spectrometry.

Process data and quality data belong to two different observation spaces, so the corresponding statistical analysis methods are correspondingly divided into two categories: single observation space and multiple observation spaces. This book introduces the basic multivariate statistical techniques from this perspective of observation space. This chapter focuses on the analysis methods in single observation space, including PCA and FDA methods. The core of these methods lies in the spatial projection oriented to different needs, such as sample dispersion or multi-class sample separation. This projection could extract the necessary and effective features while achieving the dimensional reduction. The next chapter focuses on the multivariate statistical analysis methods between two-observation space, specifically including PLS, CCA, and CVA. These methods aim at maximizing the correlation of variables in different observation spaces, and achieve the feature extraction and dimensional reduction.

© The Author(s) 2022

J. Wang et al., *Data-Driven Fault Detection and Reasoning for Industrial Monitoring*, Intelligent Control and Learning Systems 3, https://doi.org/10.1007/978-981-16-8044-1\_2

## **2.1 Principal Component Analysis**

As the modern industrial production system is becoming larger and more complex, the stored historical data not only has high dimensionality but also has strong coupling and correlation between the process variables. This also makes it impractical to monitor so many process variables at the same time. Therefore, we need to find a reasonable method to minimize the loss of information contained in the original variables while reducing the dimension of monitoring variables. If a small number of independent variables can be used to accurately reflect the operating status of the system, the operators can monitor these few variables to achieve the purpose of controlling the entire production process.

Principal component analysis (PCA) is one of the most widely used multivariate statistical algorithm (Pan et al. 2008). It is mainly used to monitor the process data with high dimensionality and strong linear correlation. It decomposes highdimensional process variables into a few independent principal components and then establishing a model. The extracted features constitute the projection principal component subspace (PCS) of the PCA algorithm and this space contains most of the changes in the system. The remaining features constitute the residual subspace, which mainly contains the noise and interference during the monitoring process and a small amount of system change information (Wiesel and Hero 2009). Due to the integration of variables, PCA algorithm can be able to overcome the overlapping information caused by multiple correlations, and achieve dimensional reduction of high-dimensional data, simultaneously. It also highlights the main features and removes the noise and some unimportant features in the PCS.

## *2.1.1 Mathematical Principle of PCA*

Suppose data matrix *<sup>X</sup>* <sup>∈</sup> *<sup>R</sup><sup>n</sup>*×*<sup>m</sup>*, where *<sup>m</sup>* is the number of variables and *<sup>n</sup>* is the number of observations for each variable. Matrix *X* can be decomposed into the sum of outer products of *k* vectors (Wang et al. 2016; Gao 2013):

$$X = t\_1 \mathbf{p}\_1^\mathrm{T} + t\_2 \mathbf{p}\_2^\mathrm{T} + \dots + t\_k \mathbf{p}\_k^\mathrm{T},\tag{2.1}$$

where *t<sup>i</sup>* is score vector, also called the principal component of the matrix *X*, and *p<sup>i</sup>* is the feature vector corresponding to the principal component, also called load vector. Then (2.1) can also be written in the form of matrix:

$$X = \boldsymbol{T}\boldsymbol{P}^{\mathsf{T}}.\tag{2.2}$$

Among them, *T* = [*t*1, *t*2,..., *t <sup>k</sup>* ] is called the score matrix and *P* = [ *p*1, *p*2,..., *p<sup>k</sup>* ] is called the load matrix. The score vectors are orthogonal to each other,

$$t\_i^\mathrm{T} t\_j = 0, i \neq j. \tag{2.3}$$

The following relationships exist between load vectors:

$$\begin{cases} \mathbf{p}\_i^\mathsf{T} \mathbf{p}\_j = 0, i \neq j \\ \mathbf{p}\_i^\mathsf{T} \mathbf{p}\_j = 1, i = j \end{cases} \tag{2.4}$$

It is shown that the load vectors are also orthogonal to each other and the length of each load vector is 1.

Multiplying the left and right sides of (2.2) by load vector *p<sup>i</sup>* and combining with (2.4), we can get

$$t\_i = X \, p\_i. \tag{2.5}$$

Equation (2.5) shows that each score vector *t<sup>i</sup>* is the projection of the original data *X* in the direction of the load vector *p<sup>i</sup>* corresponding to *t<sup>i</sup>* . The length of the score vector *t<sup>i</sup>* reflects the coverage degree of the original data *X* in the direction of *p<sup>i</sup>* . The longer the length of *t<sup>i</sup>* , the greater the coverage degree or range of change of the data matrix *X* in the direction of *p<sup>i</sup>* (Han 2012). The score vector *t<sup>i</sup>* is arranged as follows :

$$\left\|\left\|t\_1\right\|\right\| > \left\|\left\|t\_2\right\|\right\| > \left\|\left\|t\_3\right\|\right\| > \dots > \left\|\left\|t\_k\right\|\right\|.\tag{2.6}$$

The load vector *p*<sup>1</sup> represents the direction in which the data matrix *X* changes most, and load vector *p*<sup>2</sup> is orthogonal to *p*<sup>1</sup> and represents the second largest direction of the data matrix *X* changes. Similarly, the load vector *p<sup>k</sup>* represents the direction in which *X* changes least. When most of the variance is contained in the first *r* load vectors and the variance contained in the latter *m* − *r* load vectors is almost zero which could be omitted. Then the data matrix *X* is decomposed into the following forms:

$$\begin{split} \hat{X} &= t\_1 \boldsymbol{p}\_1^\top + t\_2 \boldsymbol{p}\_2^\top + \cdots + t\_r \boldsymbol{p}\_r^\top + E \\ &= \hat{X} + E = \boldsymbol{T} \boldsymbol{P}^\top + E, \end{split} \tag{2.7}$$

where *X*ˆ is principle component matrix and *E* is the residual matrix whose main information is caused by measurement noise. PCA divides the original data space into principal component subspace (PCS) and residual subspace (RS). These two subspaces are orthogonal and complementary to each other. The principal component subspace mainly reflects the changes caused by normal data, while the residual subspace mainly reflects the changes caused by noise and interference.

PCA is to calculate the optimal loading vectors *p* by solving the optimization problem:

$$J = \max\_{\mathbf{p} \neq \mathbf{0}} \frac{\mathbf{p}^{\mathrm{T}} X^{\mathrm{T}} X \mathbf{p}}{\mathbf{p}^{\mathrm{T}} \mathbf{p}}. \tag{2.8}$$

The number*r* of principal components is generally obtained by cumulative percent variance (CPV). Use eigenvalue decomposition or singular value decomposition of the covariance matrix of *X* and obtain all the eigenvalues λ*<sup>i</sup>* . CPV is defined as follows:

$$\text{CPV} = \frac{\sum\_{i=1}^{r} \lambda\_i}{\sum\_{i=1}^{n} \lambda\_i}. \tag{2.9}$$

Generally, when the CPV value is greater than or equal to 85%, the corresponding number *r* is obtained.

## *2.1.2 PCA Component Extraction Algorithm*

There are two algorithms to implement PCA component extraction. Algorithm 1 is based on the singular value decomposition (SVD) of the covariance matrix and Algorithm 2 obtains each principal component based on Nonlinear Iterative Partial Least Squares algorithm (NIPALS), developed by H. Wold at first for PCA and later for PLS (Wold 1992). It gives more numerically accurate results compared with the SVD of the covariance matrix, but is slower to calculate.

The PCA dimensional reduction is illustrated by simple two-dimensional random data. Figure 2.1 shows the original random data sample in two-dimensional space. Figure 2.2 is a visualization with principal axis and confidence ellipse of the original data. The green ray gives the direction with the largest variance of the original data and the black ray shows the direction of second largest variance.

PCA projects the original data *X* from the two-dimensional space into onedimensional subspace along the direction of maximum variance direction. The dimensional reduction is shown in Fig. 2.3.

#### **Algorithm 1** SVD-based component extraction algorithm

#### **Input:**

Data matrix *X*.

#### **Output:**

*r* principal components.

[S1] Normalize the original data set *X* = *x*T(1), *x*T(2), . . . , *x*T(*n*) <sup>T</sup> <sup>∈</sup> *<sup>R</sup>n*×*m*, in which *<sup>x</sup>* <sup>=</sup> [*x*1, *<sup>x</sup>*2,..., *xm*] <sup>∈</sup> *<sup>R</sup>*1×*m*, with zero mean one variance.

[S2] Calculate the covariance matrix *S* of the Normalized data matrix *X*:

$$S = \frac{1}{n-1} X X^{\mathrm{T}}.\tag{2.10}$$

[S3] Find the eigenvalues and eigenvectors of the covariance matrix *S* using eigenvalue decomposition:

$$\begin{aligned} \vert \lambda\_i I - \mathcal{S} \vert &= 0 \\ \vert (\lambda\_i I - \mathcal{S}) \vert \mathbf{p}\_i &= 0. \end{aligned} \tag{2.11}$$

[S4] Sort the eigenvalues from large to small and determine the first *r* eigenvalues based on the CPV index. Construct the corresponding eigenvector matrix *P* = [ *p*1, *p*2,..., *pr*] according to the eigenvectors *D* = (λ1,...,λ*r*).

[S5] Calculate the score matrix *T* based on the following relationship:

$$X = TP.\tag{2.12}$$

[S6] The normalized data matrix *X* is decomposed as follows:

$$X = \hat{X} + E = TP^T + \hat{X}.\tag{2.13}$$

where *X*ˆ is the principal component part of the data and *X*˜ is the residual part. **return** *r* principal components

**Fig. 2.2** Visualization of the change principal axis and confidence ellipse of the original data

#### **Algorithm 2** NIPALS-based component extraction algorithm

#### **Input:**

Data matrix *X*.

#### **Output:**

*r* principal components.

[S1] Normalize the original data *X*.

[S2] Set *i* = 1 and choose a column *x <sup>j</sup>* from *X* and mark it as *t*1,*<sup>i</sup>* , that is, *t*1,*<sup>i</sup>* = *x <sup>j</sup>* .

[S3] Calculate the load vector *p*<sup>1</sup>

$$p\_1 = \frac{X^\mathrm{T} t\_{1,i}}{t\_{1,i}^\mathrm{T} t\_{1,i}}.\tag{2.14}$$

[S4] Normalize *p*1,

$$\boldsymbol{\mathfrak{p}\_{\parallel}^{\rm T}} = \frac{\boldsymbol{\mathfrak{p}\_{\parallel}^{\rm T}}}{\|\boldsymbol{\mathfrak{p}\_{\parallel}}\|}. \tag{2.15}$$

[S5] Calculate the score vector *t*1,*i*+1,

$$t\_{1,i+1} = \frac{X \, p\_1}{p\_1^T \, p\_1}.\tag{2.16}$$

[S6] Compare the *t*1,*<sup>i</sup>* and *t*1,*i*+1. If *t*1,*i*+<sup>1</sup> <sup>−</sup> *<sup>t</sup>*1,*<sup>i</sup>* < ε, and go to S7, where ε > 0 is a very small positive constant. If *t*1,*i*+<sup>1</sup> <sup>−</sup> *<sup>t</sup>*1,*<sup>i</sup>* <sup>≥</sup> <sup>ε</sup>, set *<sup>i</sup>* <sup>=</sup> *<sup>i</sup>* <sup>+</sup> 1 and go back to S3.

[S7] Calculate the residual *<sup>E</sup>*<sup>1</sup> <sup>=</sup> *<sup>X</sup>* <sup>−</sup> *<sup>t</sup>*<sup>1</sup> *<sup>p</sup>*<sup>T</sup> <sup>1</sup> , replace *X* with *E*<sup>1</sup> and return to S2 to calculate the next principal component *t*<sup>2</sup> until the CPV value meets the requirements.

[S8] *r* principal components are obtained, namely:

$$X = t\_1 \mathbf{p}\_1^T + t\_2 \mathbf{p}\_2^T + \dots + t\_r \mathbf{p}\_r^T + \tilde{X} = TP^T + \tilde{X},\tag{2.17}$$

**return** *r* principal components

## *2.1.3 PCA Base Fault Detection*

PCA can be applied to solve all kinds of data analysis problems, such as exploration and visualization of high-dimensional data sets, data compression, data preprocessing, dimensional reduction, removing data redundancy, and denoising. When it is applied to the field of FDD and the detection process is divided into offline modeling and online monitoring.


$$\begin{aligned} \dot{\mathbf{x}} &= \hat{\mathbf{x}} + \tilde{\mathbf{x}} \\ \hat{\mathbf{x}} &= \mathbf{P} \mathbf{P}^{\mathsf{T}} \mathbf{x} \\ \tilde{\mathbf{x}} &= (I - \mathbf{P} \mathbf{P}^{\mathsf{T}}) \mathbf{x}, \end{aligned} \tag{2.18}$$

where *x*ˆ is the projection of the sample *x* in PCS and *x*˜ is the projection of the sample in RS. Calculate the statistics, SPE (1.12) on RS and T<sup>2</sup> (1.9) on PCS of new sample *x*, respectively. Compare the statistics of new sample with the control limits obtained from the training data. If the statistics of the new sample exceeds the control limit, it means that a fault has occurred, otherwise the system is in the normal operation.

*x*ˆ and *x*˜ are not only orthogonal (*x*ˆ T *x*˜ = 0) but also still statistically independent (E *x*ˆ T *x*˜ = 0). So, there are natural advantages to apply PCA algorithm to process monitoring. The flowchart of PCA based fault detection is shown in Fig. 2.4. In general, the fault detection process based on multivariate statistical analysis is similar as that of PCA, only the statistical model and statistics index are different.

## **2.2 Fisher Discriminant Analysis**

Industrial processes are heavily instrumented and large amounts of data are collected online and stored in computer database. A lot of data are usually collected during outof-control operations. When the data collected during an out-of-control operation has been previously diagnosed, the data can be classified into separate categories, where each category is related to a specific fault. When the data has not been diagnosed before, cluster analysis can help diagnose the operation of collecting data, and the data can be divided into a new category accordingly. If hyperplanes can separate the data

**Fig. 2.4** PCA-based fault detection

in the class, as shown in Fig. 2.5, these separation planes can define the boundaries of each fault area. Once a fault is detected using the online data observation, the fault can be diagnosed by determining the fault area where the observation is located. Assuming that the detected fault is represented in the database, the fault can be correctly diagnosed in this way.

## *2.2.1 Principle of FDA*

Fisher discriminant analysis (FDA), a dimensionality reduction technique that has been extensively studied in the pattern classification domain, takes into account the information between the classes. For fault diagnosis, data collected from the plant during in the specific fault operation are categorized into classes, where each class contains data representing a particular fault. FDA is a classical linear dimensionality reduction technique that is optimal in maximizing the separation between these classes. The main idea of FDA is to project data from a high-dimensional space into a lower dimensional space, and to simultaneously ensure that the projection maximizes the scatter between classes while minimizing the scatter within each class. It means that the high-dimensional data of the same class is projected to the low-dimensional space and clustered together, but the different classes are far apart.

Given training data for all classes *X* ∈ *R<sup>n</sup>*×*<sup>m</sup>*, where *n* and *m* are the number of observations and measurement variables, respectively. In order to understand FDA,

**Fig. 2.5** Two-dimensional comparison of FDA and PCA

it is first necessary to define various matrices, including the total scatter matrix, intraclass (within-class) scatter matrix, and inter-class (between-class) scatter matrix. The total scatter matrix is

$$\mathbf{S}\_{l} = \sum\_{i=1}^{n} \left( \mathbf{x}(i) - \bar{\mathbf{x}} \right) \left( \mathbf{x}(i) - \bar{\mathbf{x}} \right)^{\mathsf{T}},\tag{2.19}$$

where *x*(*i*) represents the vector of measurement variables for the *i*-th observation and *x*¯ is the total mean vector.

$$\bar{\mathbf{x}} = \frac{1}{n} \sum\_{i=1}^{n} \mathbf{x}(i). \tag{2.20}$$

The within-scatter matrix for class *j* is

$$\mathcal{S}\_{j} = \sum\_{\mathbf{x}(i) \in \mathcal{X}\_{j}} \left( \mathbf{x}(i) - \bar{\mathbf{x}}\_{j} \right) \left( \mathbf{x}(i) - \bar{\mathbf{x}}\_{j} \right)^{\mathrm{T}},\tag{2.10}$$

where *X<sup>j</sup>* is the set of vectors *x*(*i*) which belong to the class *j* and *x*¯ *<sup>j</sup>* is the mean vector for class *j*:

$$\bar{\mathbf{x}}\_{j} = \frac{1}{n\_{j}} \sum\_{\mathbf{x}(i) \in \mathcal{X}\_{j}} \mathbf{x}(i),\tag{2.22}$$

where *n <sup>j</sup>* is the number of observations in the *j*-th class. The **intra-class scatter matrix** is

$$\mathbf{S}\_w = \sum\_{j=1}^p \mathbf{S}\_j,\tag{2.23}$$

where *p* is the number of classes. The **inter-class scatter matrix** is

$$\mathbf{S}\_{b} = \sum\_{j=1}^{p} n\_{j} \left(\bar{\mathbf{x}}\_{j} - \bar{\mathbf{x}}\right) \left(\bar{\mathbf{x}}\_{j} - \bar{\mathbf{x}}\right)^{\mathrm{T}}.\tag{2.24}$$

It is obvious that the following relationship always holds:

$$\mathcal{S}\_t = \mathcal{S}\_b + \mathcal{S}\_w. \tag{2.25}$$

The maximum inter-class scatter means that the sample centers of different classes are as far apart as possible after projection (max *v*<sup>T</sup> *Sbv*). The minimum intra-class scatter is equivalent to making the sample points of the same class after projection to be clustered together as much as possible (min *<sup>v</sup>*<sup>T</sup> *<sup>S</sup>*w*v*, <sup>|</sup>*S*w<sup>|</sup> = 0), where *<sup>v</sup>* <sup>∈</sup> *<sup>R</sup><sup>m</sup>*.

The optimal FDA project *w* is obtained by

$$J = \max\_{\mathbf{w} \neq 0} \frac{\mathbf{w}^{\mathsf{T}} \mathbf{S}\_b \mathbf{w}}{\mathbf{w}^{\mathsf{T}} \mathbf{S}\_w \mathbf{w}}. \tag{2.26}$$

Both the numerator and denominator have project vector *w*. Considering that *w* and α*w*, α = 0 have the same effect, Let *w*<sup>T</sup> *S*w*w* = 1, then the optimal objective (2.26) becomes

$$\begin{aligned} J &= \max\_{\mathbf{w}} \boldsymbol{w}^{\mathsf{T}} \boldsymbol{S}\_{b} \boldsymbol{w} \\ \text{s.t.} \quad & \boldsymbol{w}^{\mathsf{T}} \boldsymbol{S}\_{w} \boldsymbol{w} = 1. \end{aligned} \tag{2.27}$$

Firstly, let's consider the optimization of first FDA vector *w*1. Solving (2.27) by Lagrange multiplier method.

$$L(\boldsymbol{w}\_1, \boldsymbol{\lambda}\_1) = \boldsymbol{w}\_1^\mathrm{T} \mathbf{S}\_b \boldsymbol{w}\_1 - \boldsymbol{\lambda}\_1 (\boldsymbol{w}\_1^\mathrm{T} \mathbf{S}\_w \boldsymbol{w}\_1 - 1)$$

Find the partial derivative of *L* with respect to *w*1.

$$\frac{\partial L}{\partial \boldsymbol{w}\_1} = 2\boldsymbol{S}\_b \boldsymbol{w}\_1 - 2\lambda\_1 \boldsymbol{S}\_w \boldsymbol{w}\_1$$

The first FDA vector is equal to the eigenvectors *w*<sup>1</sup> of the generalized eigenvalue problem.

$$\mathbf{S}\_b \mathbf{w}\_1 = \lambda\_1 \mathbf{S}\_w \mathbf{w}\_1 \to \mathbf{S}\_w^{-1} \mathbf{S}\_b \mathbf{w}\_1 = \lambda\_1 \mathbf{w}\_1. \tag{2.28}$$

The first FDA vector boils down to finding the eigenvector *w*<sup>1</sup> corresponding to the largest eigenvalue of the matrix *S*−<sup>1</sup> <sup>w</sup> *Sb*.

The second FDA vector is captured such that the inter-class scatter is maximized, while the intra-class scatter is minimized on all axes perpendicular to the first FDA vector and the same is true for the remaining FDA vectors. The *k*th FDA vectors is obtained by

$$\mathbf{S}\_{w}^{-1}\mathbf{S}\_{b}\boldsymbol{w}\_{k}=\lambda\_{k}\boldsymbol{w}\_{k},$$

where λ<sup>1</sup> ≥ λ<sup>2</sup> ≥···≥ λ*<sup>p</sup>*−<sup>1</sup> and λ*<sup>k</sup>* indicate the degree of overall separability among the classes by projecting the data onto *w<sup>k</sup>* .

When *S*<sup>w</sup> is invertible, the FDA vector can be computed from the generalized eigenvalue problem. This is almost always true as long as the number of observations *n* is significantly larger than the number of measurements *m* (the case in practice). If the *S*<sup>w</sup> matrix is not invertible, you can use PCA to project data into *m*<sup>1</sup> dimensions before executing FDA, in which *m*<sup>1</sup> is the number of non-zero eigenvalues of the covariance matrix *S<sup>t</sup>* .

The first FDA vector is the eigenvector associated with the largest eigenvalue, the second FDA vector is the eigenvector associated with the second largest eigenvalue, and so on. The large eigenvalue λ*<sup>k</sup>* shows that when the data in classes are projected onto the associated eigenvector *w<sup>k</sup>* , there is a large overall separation of class means relative to the variance of the class, and thus, a large degree of separation among classes along the direction of *w<sup>k</sup>* . Since the rank of *S<sup>b</sup>* is less than *p* and at most *p* − 1 eigenvalues are not equal to zero. The FDA provides a useful ordering of eigenvectors only in these directions.

When FDA is used as a pattern classification, the dimensionality reduction technique is implemented for all classes of data at the same time. Denote *W<sup>a</sup>* = [*w***1**, *<sup>w</sup>***2**,..., *<sup>w</sup>a*] <sup>∈</sup> *<sup>R</sup><sup>m</sup>*×*<sup>a</sup>*. The discriminant function can be deduced as

$$\begin{split} g\_{j}(\mathbf{x}) &= -\frac{1}{2} \left( \mathbf{x} - \bar{\mathbf{x}}\_{j} \right)^{\mathrm{T}} \mathbf{W}\_{a} \left( \frac{1}{n\_{j} - 1} \mathbf{W}\_{a}^{\mathrm{T}} \mathbf{S}\_{j} \, \mathbf{W}\_{a} \right)^{-1} \mathbf{W}\_{a}^{\mathrm{T}} \left( \mathbf{x} - \bar{\mathbf{x}}\_{j} \right) + \ln \left( p\_{i} \right) \\ &- \frac{1}{2} \ln \left[ \det \left( \frac{1}{n\_{j} - 1} \mathbf{W}\_{a}^{\mathrm{T}} \mathbf{S}\_{j} \, \mathbf{W}\_{a} \right) \right]. \end{split} \tag{2.29}$$

FDA can also be used to detect faults by defining an additional class of data on top of the fault class, i.e., data collected under normal operating conditions. The reliability of fault detection using (2.29) depends on the similarity between the data from normal operating conditions and the fault class data in the training set. Fault detection using FDA will yield small miss rates for known fault classes when a transformation *W* exists such that data from normal operating conditions can be reasonably separated from other fault classes.

## *2.2.2 Comparison of FDA and PCA*

As two classical techniques for dimensionality reduction of a single data set, PCA and FDA exhibit similar properties in many aspects. The optimization problems of PCA and FDA, respectively, formulated mathematically in (2.8) and (2.26), can also be captured as

$$J\_{\rm PCA} = \max\_{\mathbf{w} \neq \mathbf{0}} \frac{\mathbf{w}^{\rm T} S\_{\rm I} \mathbf{w}}{\mathbf{w}^{\rm T} \mathbf{w}} \tag{2.30}$$

$$J\_{\rm FDA} = \max\_{\mathbf{w} \neq \mathbf{0}} \frac{\mathbf{w}^{\rm T} \mathbf{S}\_{\rm I} \mathbf{w}}{\mathbf{w}^{\rm T} \mathbf{S}\_{\rm w} \mathbf{w}} \tag{2.31}$$

In the special case, *S*<sup>w</sup> = *a I*, *a* = 0, their vector optimization results are identical. This would occur if the data for each class could be described by a uniformly distributed ball (i.e., without a dominant direction), even if these balls had different sizes. The difference between these two techniques only occurs when the data used to describe either class appears elongated. These elongated shapes occur on highly correlated data sets, for example, the data collected in industrial processes. Thus, when FDA and PCA are applied to process data in the same way, the FDA vectors and the PCA loading vectors are significantly different. The different objectives of (2.30) and (2.31) show that the FDA has superior performance than PCA at distinguishing among fault classes.

Figure 2.5 illustrates a difference between PCA and FDA. The first FDA vector and the PCA loading vector are almost perpendicular. PCA is to map the entire data set to the coordinate axis that is most convenient to represent the data. The mapping does not use any classification information inside the data. Therefore, although the entire data set is more convenient to represent after PCA (reducing the dimensionality and minimizing the loss of information), it may become more difficult to classify. It is found that the projections of red and blue are overlapped in the PCA direction, but separated in the FDA direction. The two sets of data become easier to distinguish (it can be distinguished in low dimensions, reducing large amount of calculations) by FDA mapping.

To illustrate more clearly the difference between PCA and FDA, the following numerical example of binary classification is given.

$$\begin{aligned} \mathbf{x}\_1 &= [\dots + 0.05\mu(0, 1); \mathbf{3.2} + 0.9\mu(0, 1)] \in \mathbb{R}^{2 \times 100} \\ \mathbf{x}\_2 &= [\dots + 0.05\mu(0, 1); \mathbf{3.2} + 0.9\mu(0, 1)] \in \mathbb{R}^{2 \times 100}, \\ X &= [\mathbf{x}\_1, \mathbf{x}\_2] \in \mathbb{R}^{2 \times 200}, \end{aligned}$$

where*µ*(0, 1) ∈ *R*<sup>1</sup>×<sup>100</sup> is a uniformly distributed random vector on [0, 1]. *X* is a twomode data and its projection of FDA and PCA is shown in Fig. 2.6. The distribution of the data in the classes is somewhat elongated. The linear transformation of the data on the first FDA vector separates the two types of data better than the linear transformation of the data on the first PCA loading vector.

**Fig. 2.6** Two-dimensional data projection comparison of FDA and PCA

Both PCA and FDA can be used to classify the original data after dimensionality reduction. PCA is an unsupervised method, i.e. it has no classification labels. After dimensionality reduction, unsupervised algorithms such as K-Means or selforganizing mapping networks are needed for classification. The FDA is a supervised method. It first reduces the dimensionality of the training data and then finds a linear discriminant function. The similarities and differences between FDA and PCA can be summarized as follows.

	- (1) Both functions are used to reduce dimensionality;
	- (2) Both assume Gaussian distribution.
	- (1) FDA is a supervised dimensionality reduction method, while PCA is unsupervised;
	- (2) FDA dimensionality reduction can be reduced to the number of categories *k* − 1 at most, PCA does not have this restriction;
	- (3) FDA is more dependent on the mean. If the sample information is more dependent on variance, the effect will not be as good as PCA;
	- (4) FDA may overfit the data.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 3 Multivariate Statistics Between Two-Observation Spaces**

As mentioned in the previous chapter, industrial data are usually divided into two categories, process data and quality data, belonging to different measurement spaces. The vast majority of smart manufacturing problems, such as soft measurement, control, monitoring, optimization, etc., inevitably require modeling the data relationships between the two kinds of measurement variables. This chapter's subject is to discover the correlation between the sets in different observation spaces.

The multivariate statistical analysis relying on correlation among variables generally include canonical correlation analysis (CCA) and partial least squares regression (PLS). They all perform linear dimensionality reduction with the goal of maximizing the correlation between variables in two measurement spaces. The difference are that CCA maximize **correlation**, while PLS maximize **covariance**.

# **3.1 Canonical Correlation Analysis**

Canonical correlation analysis (CCA) was first proposed by Hotelling in 1936 (Hotelling 1936). It is a multivariate statistical analysis method that uses the correlation between two composite variables to reflect the overall correlation between two sets of variables. The CCA algorithm is widely used in the analysis of data correlation and it is also the basis of partial least squares. In addition, it is also used in feature fusion, data dimensionality reduction, and fault detection (Yang et al. 2015; Zhang and Dou 2015; Zhang et al. 2020; Hou 2013; Chen et al. 2016a, b).

## *3.1.1 Mathematical Principle of CCA*

Assuming that there are *l* dependent variables *y* = (*y*1, *y*2,..., *yl*) <sup>T</sup> and *m* independent variables *x* = (*x*1, *x*2,..., *xm*) T. In order to capture the correlation between the dependent variables and the independent variables, *n* sample points are observed, which constitutes two data sets

$$\begin{aligned} X &= \left[\mathbf{x}(1), \mathbf{x}(2), \dots, \mathbf{x}(n)\right]^\mathrm{T} \in \boldsymbol{R}^{n \times m} \\\\ \mathbf{Y} &= \left[\mathbf{y}(1), \mathbf{y}(2), \dots, \mathbf{y}(n)\right]^\mathrm{T} \in \boldsymbol{R}^{n \times l} \end{aligned}$$

CCA draws on the idea of component extraction to find a canonical component *u*, which is a linear combination of variables *xi* ; and a canonical component v, which is a linear combination of *yi* . In the process of extraction, the correlation between *u* and v is required to be maximized. The correlation degree between *u* and v can roughly reflect the correlation between *X* and *Y*.

Without loss of generality, assuming that the original variables are all standardized, i.e., each column of the data set *X* and *Y* has mean 0 and variance 1, the covariance matrix of cov(*X*, *Y*) is equal to its correlation coefficient matrix, in which,

$$\text{cov}(X, Y) = \frac{1}{n} \begin{bmatrix} X^\mathrm{T} X & X^\mathrm{T} Y \\ Y^\mathrm{T} X & Y^\mathrm{T} Y \end{bmatrix} = \begin{bmatrix} \mathbf{E}\_{xx} & \mathbf{E}\_{xy} \\ \mathbf{E}\_{xy}^\mathrm{T} & \mathbf{E}\_{yy} \end{bmatrix}$$

PCA is analyzed for *Σx x* or *Σyy* , while CCA is analyzed for *Σx y*

Now the problem is how to find the direction vectors *α* and *β*, and then use them to construct the canonical components:

$$\begin{aligned} u &= \alpha\_1 \mathbf{x}\_1 + \alpha\_2 \mathbf{x}\_2 + \dots + \alpha\_m \mathbf{x}\_m \\ v &= \beta\_1 \mathbf{y}\_1 + \beta\_2 \mathbf{y}\_2 + \dots + \beta\_l \mathbf{y}\_l, \end{aligned} \tag{3.1}$$

where *α* = [α1, α2,..., α*m*] <sup>T</sup> ∈ *R<sup>m</sup>*×1, *β* = [β1, β2,..., β*l*] <sup>T</sup> ∈ *R<sup>l</sup>*×1, such that the correlation between *u* and v is maximized. Obviously, the sample means of *u* and v are zero, and their sample variances are as follows:

$$\begin{aligned} \text{var}(u) &= \alpha^\top \Sigma\_{xx} \alpha\\ \text{var}(v) &= \beta^\top \Sigma\_{\text{yy}} \beta \end{aligned}$$

The covariance of *u* and v is

$$\text{cov}(u, v) = \alpha^{\text{T}} \boldsymbol{\Sigma}\_{xy} \boldsymbol{\beta}.$$

One way to maximize the correlation of *u* and v is to make the corresponding correlation coefficient maximum, i.e.,

$$\max \rho(\mu, v) = \frac{\text{cov}(\mu v)}{\sqrt{\text{var}(\mu)\text{var}(v)}}.\tag{3.2}$$

In CCA, the following optimization objective is used:

$$\begin{aligned} J\_{\text{CCA}} &= \max  = \alpha^{\text{T}} \boldsymbol{\Sigma}\_{\text{xy}} \beta \\ \text{s.t.} &\, \alpha^{\text{T}} \boldsymbol{\Sigma}\_{\text{xx}} \alpha = 1; \, \beta^{\text{T}} \boldsymbol{\Sigma}\_{\text{yy}} \beta = 1. \end{aligned} \tag{3.3}$$

This optimization objective can be summarized as follows: to seek a unit vector *α* on the subspace of *X* and a unit vector *β* on the subspace of *Y* such that the correlation between *u* and v is maximized. Geometrically, ρ(*u*, v) is again equal to the cosine of the angle between *u* and v. Thus, (3.3) is again equivalent to making the angle ω between *u* and v take the minimum value.

It can be seen from (3.3) that the goal of the CCA algorithm is finally transformed into a convex optimization process. The maximum value of this optimization goal is the correlation coefficient of *X* and *Y*, and the corresponding *α* and *β* are projection vectors, or linear coefficients. After the first pair of canonical correlation variables are obtained, the second to *k*th pair of canonical correlation variables that are not correlated with each other can be similarly calculated.

The following Fig. 3.1 shows the basic principle diagram of the CCA algorithm:

At present, there are two main methods which include eigenvalue decomposition and singular value decomposition for optimizing the above objective function to obtain *α* and *β*.

**Fig. 3.1** Basic principle diagram of the CCA algorithm

## *3.1.2 Eigenvalue Decomposition of CCA Algorithm*

Using the Lagrangian function, the objective function of (3.3) is transformed as follows:

$$\max J\_{\rm{CCA}}(\alpha, \beta) = \alpha^{\rm{T}} \boldsymbol{\Sigma}\_{\rm{xy}} \beta - \frac{\lambda\_{\rm{I}}}{2} (\alpha^{\rm{T}} \boldsymbol{\Sigma}\_{\rm{xx}} \alpha - \mathrm{l}) - \frac{\lambda\_{\rm{2}}}{2} (\beta^{\rm{T}} \boldsymbol{\Sigma}\_{\rm{yy}} \beta - \mathrm{l}).\tag{3.4}$$

Set <sup>∂</sup> *<sup>J</sup>* <sup>∂</sup>*<sup>α</sup>* <sup>=</sup> 0 and <sup>∂</sup> *<sup>J</sup>* <sup>∂</sup>*<sup>β</sup>* = 0, then

$$\begin{aligned} \boldsymbol{\Sigma}\_{\rm xy}\boldsymbol{\beta} - \lambda\_1 \boldsymbol{\Sigma}\_{\rm xx} \boldsymbol{\alpha} &= 0 \\ \boldsymbol{\Sigma}\_{\rm xy}^{\rm T} \boldsymbol{\alpha} - \lambda\_2 \boldsymbol{\Sigma}\_{\rm yy} \boldsymbol{\beta} &= 0. \end{aligned} \tag{3.5}$$

Let <sup>λ</sup> <sup>=</sup> <sup>λ</sup><sup>1</sup> <sup>=</sup> <sup>λ</sup><sup>2</sup> <sup>=</sup> *<sup>α</sup>*<sup>T</sup>*Σx yβ*, and multiply (3.5) to the left by *<sup>Σ</sup>*−<sup>1</sup> *x x* and *Σ*−<sup>1</sup> *yy* , respectively, and get:

$$\begin{aligned} \boldsymbol{\Sigma}\_{\rm xx}^{-1} \boldsymbol{\Sigma}\_{\rm xy} \boldsymbol{\beta} &= \lambda \boldsymbol{\alpha} \\ \boldsymbol{\Sigma}\_{\rm yy}^{-1} \boldsymbol{\Sigma}\_{\rm yx} \boldsymbol{\alpha} &= \lambda \boldsymbol{\beta}. \end{aligned} \tag{3.6}$$

Substituting the second formula in (3.6) into the first formula, we can get

$$
\boldsymbol{\Sigma}\_{\rm xx}^{-1} \boldsymbol{\Sigma}\_{\rm xy} \boldsymbol{\Sigma}\_{\rm yy}^{-1} \boldsymbol{\Sigma}\_{\rm yx} \boldsymbol{\alpha} = \lambda^2 \boldsymbol{\alpha} \tag{3.7}
$$

From (3.7), we can get the largest eigenvalue λ and the corresponding maximum eigenvector *α* only by eigenvalue decomposition of the matrix *Σ*−<sup>1</sup> *x x Σx yΣ*−<sup>1</sup> *yy Σyx* . In the similar way, the vector *β* can be obtained. At this time, the projection vectors *α* and *β* of a set of canonical correlation variables can be obtained.

## *3.1.3 SVD Solution of CCA Algorithm*

Let *<sup>α</sup>* <sup>=</sup> *<sup>Σ</sup>*−1/<sup>2</sup> *x x <sup>a</sup>*, *<sup>β</sup>* <sup>=</sup> *<sup>Σ</sup>*−1/<sup>2</sup> *yy b*, and then we can get

$$\begin{cases} \boldsymbol{\alpha}^{\mathrm{T}} \boldsymbol{\Sigma}\_{xx} \boldsymbol{\alpha} = 1 \to \boldsymbol{\mathsf{a}}^{\mathrm{T}} \boldsymbol{\Sigma}\_{xx}^{-1/2} \boldsymbol{\Sigma}\_{xx} \boldsymbol{\Sigma}\_{xx}^{-1/2} \boldsymbol{\mathsf{a}} = 1 \to \boldsymbol{\mathsf{a}}^{\mathrm{T}} \boldsymbol{\mathsf{a}} = 1\\ \boldsymbol{\beta}^{\mathrm{T}} \boldsymbol{\Sigma}\_{\mathrm{yy}} \boldsymbol{\beta} = 1 \to \boldsymbol{\mathsf{b}}^{\mathrm{T}} \boldsymbol{\Sigma}\_{\mathrm{yy}}^{-1/2} \boldsymbol{\Sigma}\_{\mathrm{yy}} \boldsymbol{\Sigma}\_{\mathrm{yy}}^{-1/2} \boldsymbol{\mathsf{b}} = 1 \to \boldsymbol{\mathsf{b}}^{\mathrm{T}} \boldsymbol{\mathsf{b}} = 1\\ \boldsymbol{\alpha}^{\mathrm{T}} \boldsymbol{\Sigma}\_{\mathrm{xy}} \boldsymbol{\beta} = \boldsymbol{\mathsf{a}}^{\mathrm{T}} \boldsymbol{\Sigma}\_{xx}^{-1/2} \boldsymbol{\Sigma}\_{\mathrm{xy}} \boldsymbol{\Sigma}\_{\mathrm{yy}}^{-1/2} \boldsymbol{\mathsf{b}}. \end{cases} (3.8)$$

In other words, the objective function of (3.3) can be transformed into as follows:

$$\begin{aligned} J\_{\text{CCA}}(\boldsymbol{a}, \boldsymbol{b}) &= \arg\max\_{\boldsymbol{a}, \boldsymbol{b}} \boldsymbol{a}^{\mathrm{T}} \boldsymbol{\Sigma}\_{\text{xx}}^{-1/2} \boldsymbol{\Sigma}\_{\text{xy}} \boldsymbol{\Sigma}\_{\text{yy}}^{-1/2} \boldsymbol{b} \\ \text{s.t. } \boldsymbol{a}^{\mathrm{T}} \boldsymbol{a} &= \boldsymbol{b}^{\mathrm{T}} \boldsymbol{b} = 1. \end{aligned} \tag{3.9}$$

A singular value decomposition for matrix *M* yields

$$\mathbf{M} = \boldsymbol{\Sigma}\_{\text{xx}}^{-1/2} \mathbf{E}\_{\text{xy}} \boldsymbol{\Sigma}\_{\text{yy}}^{-1/2} = \boldsymbol{\Gamma} \boldsymbol{\Sigma} \boldsymbol{\Psi}^{\text{T}}, \boldsymbol{\Sigma} = \begin{bmatrix} \boldsymbol{\Lambda}\_{\text{x}} \ \mathbf{0} \\ \mathbf{0} \ \mathbf{0} \end{bmatrix} \tag{3.10}$$

where κ is the number of principal elements or non-zero singular values, and κ ≤ *min*(*l*, *m*), *Λ*<sup>κ</sup> = diag (λ1,..., λκ), λ<sup>1</sup> ≥···≥ λκ 0.

Since all columns of *Γ* and *Ψ* are standard orthogonal basis, *a*<sup>T</sup>*Γ* and *Ψ* <sup>T</sup>*b* are vectors with only one scalar value of 1, and the remaining scalar value of 0. So, we can get

$$\mathbf{a}^{\mathrm{T}}\boldsymbol{\Sigma}\_{xx}^{-1/2}\boldsymbol{\Sigma}\_{xy}\boldsymbol{\Sigma}\_{yy}^{-1/2}\boldsymbol{b} = \mathbf{a}^{\mathrm{T}}\boldsymbol{\Gamma}\boldsymbol{\Sigma}\boldsymbol{\Psi}^{\mathrm{T}}\boldsymbol{b} = \sigma\_{ab}.\tag{3.11}$$

From (3.11), it can be seen that *a*<sup>T</sup>*Σ*−1/<sup>2</sup> *x x Σx yΣ*−1/<sup>2</sup> *yy b* maximizes actually the left and right singular vectors corresponding to the maximum singular values of *M*. Thus, using the corresponding left and right singular vectors *Γ* and *Ψ* , we can obtain the projection vectors *α* and *β* for a set of canonical correlation variables, namely,

$$\begin{aligned} \alpha &= \boldsymbol{\Sigma}\_{xx}^{-1/2} \boldsymbol{a} \\ \beta &= \boldsymbol{\Sigma}\_{yy}^{-1/2} \boldsymbol{b}. \end{aligned} \tag{3.12}$$

## *3.1.4 CCA-Based Fault Detection*

When there is a clear input-output relationship between the two types of data measurable online, CCA can be used to design an effective fault detection system. The CCA-based fault detection method can be considered as an alternative to PCA-based fault detection method, and an extension of PLS-based fault detection method (Chen et al. 2016a).

Let

$$\begin{aligned} J\_s &= \Sigma\_{xx}^{-1/2} \Gamma(;, 1:\kappa) \\ L\_s &= \Sigma\_{\text{yy}}^{-1/2} \Psi(;, 1:\kappa) \\ J\_{\text{res}} &= \Sigma\_{xx}^{-1/2} \Gamma(;, \kappa + 1:l) \\ L\_{\text{res}} &= \Sigma\_{\text{yy}}^{-1/2} \Psi(;, \kappa + 1:m) .\end{aligned}$$

According to CCA method, *J*<sup>T</sup> *<sup>s</sup> x* and *L*<sup>T</sup> *<sup>s</sup> y* are closely related. However, in actual systems, measurement variables are inevitably affected by noise, and the correlation between *J*<sup>T</sup> *<sup>s</sup> x* and *L*<sup>T</sup> *<sup>s</sup> y* can be expressed as

$$\mathbf{L}\_s^\mathrm{T} \mathbf{y}(k) = \mathbf{A}\_\kappa^\mathrm{T} \mathbf{J}\_s^\mathrm{T} \mathbf{x}(k) + \upsilon\_s(k),\tag{3.13}$$

where v*<sup>s</sup>* is the noise term and weakly related to *J*<sup>T</sup> *<sup>s</sup> x*. Based on this, the residual vector is

$$\mathbf{r}\_1(k) = \mathbf{L}\_s^\mathrm{T} \mathbf{y}(k) - \mathbf{A}\_\kappa^\mathrm{T} \mathbf{J}\_s^\mathrm{T} \mathbf{x}(k). \tag{3.14}$$

Assume that the input and output data obey the Gaussian distribution. It is known that linear transformation does not change the distribution of random variables, so the residual signal *r*<sup>1</sup> also obeys the Gaussian distribution and its covariance matrix is

$$\boldsymbol{\Sigma}\_{r1} = \frac{1}{N-1} \left( \boldsymbol{L}\_s^\mathrm{T} \boldsymbol{Y} - \boldsymbol{A}\_\kappa^\mathrm{T} \boldsymbol{J}\_s^\mathrm{T} \boldsymbol{X} \right) \left( \boldsymbol{L}\_s^\mathrm{T} \boldsymbol{Y} - \boldsymbol{A}\_\kappa^\mathrm{T} \boldsymbol{J}\_s^\mathrm{T} \boldsymbol{U} \right)^\mathrm{T} = \frac{\boldsymbol{I}\_\kappa - \boldsymbol{A}\_\kappa^2}{N-1}. \tag{3.15}$$

Similarly, another residual vector can be obtained

$$\mathbf{r}\_2(k) = \mathbf{J}\_s^\mathrm{T} \mathbf{x}(k) - \mathbf{A}\_\kappa \mathbf{L}\_s^\mathrm{T} \mathbf{y}(k). \tag{3.16}$$

Its covariance matrix is

$$\boldsymbol{\Sigma}\_{r2} = \frac{1}{N-1} \left( \boldsymbol{J}\_s^\mathrm{T} \boldsymbol{U} - \boldsymbol{\Lambda}\_\kappa \boldsymbol{L}\_s^\mathrm{T} \boldsymbol{Y} \right) \left( \boldsymbol{J}\_s^\mathrm{T} \boldsymbol{U} - \boldsymbol{\Lambda}\_\kappa \boldsymbol{L}\_s^\mathrm{T} \boldsymbol{Y} \right)^\mathrm{T} = \frac{\boldsymbol{I}\_\kappa - \boldsymbol{\Lambda}\_\kappa^2}{N-1}. \tag{3.17}$$

It can be seen from formula (3.15)–(3.16) that the covariance of residual *r*<sup>1</sup> and *r*<sup>2</sup> are the same. For fault detection, the following two statistics can be constructed:

$$\mathbf{T}\_1^2(k) = (N - 1)\mathbf{r}\_1^\mathrm{T}(k) \left(\mathbf{I}\_\kappa - \mathbf{A}\_\kappa^2\right)^{-1} \mathbf{r}\_1(k) \tag{3.18}$$

$$\mathbf{T}\_2^2(k) = (N - 1)\mathbf{r}\_2^\mathrm{T}(k) \left(\mathbf{I}\_\kappa - \mathbf{A}\_\kappa^2\right)^{-1} \mathbf{r}\_2(k). \tag{3.19}$$

## **3.2 Partial Least Squares**

Multiple linear regression analysis is relatively common and the least square method is generally used to estimate the regression coefficient in this type of regression method. But the least square technique often fails when there is multiple correlation between the independent variables or the number of samples is less than the number of variables. So the partial least square technique is developed to resolve this problem. S. Wold and C. Albano et al. proposed the partial least squares method for the first time and applied it to the field of chemistry (Wold et al. 1989). It aims at the regression modeling between two sets of multi-variables with high correlation and integrates the basic functions of multiple linear regression analysis, principal component analysis, and canonical correlation analysis. PLS is also called the secondgeneration regression analysis method due to its simplification model in the data structure and correlation (Hair et al. 2016). It has developed rapidly and widely used in various fields recent years (Okwuashi et al. 2020; Ramin et al. 2018).

# *3.2.1 Fundamental of PLS*

Suppose there are *l* dependent variables(*y*1, *y*2,..., *yl*) and *m* independent variables (*x*1, *x*2,..., *xm*). In order to study the statistical relationship between the dependent variable and the independent variable, *n* sample points are observed, which constitutes a data set *X* = [*x*1, *x*2,..., *xm*] ∈ *Rn*×*<sup>m</sup>* , *Y* = [*y*1, *y*2,..., *yl*] ∈ *Rn*×*<sup>l</sup>* of the independent variables and the dependent variables.

To address the problems encountered in least squares multiple regression between *X* and *Y*, the concept of component extraction is introduced in PLS regression analysis. Recall that principal component analysis, for a single data matrix *X*, finds the composite variable that best summarizes the information in the original data. The principal component *T* in *X* is extracted with the maximum variance information of the original data:

$$\max \text{var } (T),\tag{3.20}$$

PLS extracts component vectors *t<sup>i</sup>* and *u<sup>i</sup>* from *X* and *Y*, which means *t<sup>i</sup>* is a linear combination of (*x*1, *x*2,..., *xm*), and *u<sup>i</sup>* is a linear combination of (*y*1, *y*2,..., *yl*). During the extracting of components, in order to meet the needs of regression analysis, the following two requirements should be satisfied:


The two requirements indicate that *t<sup>i</sup>* and *u<sup>i</sup>* should represent the data set *X* and *Y* as well as possible and the component *t<sup>i</sup>* of the independent variable has the best ability to explain the component *u<sup>i</sup>* of the dependent variable.

## *3.2.2 PLS Algorithm*

The most popular algorithm used in PLS to compute the vectors in the calibration step is known as nonlinear iterative partial least squares (NIPALS). First, normalize the data to achieve the purpose of facilitating calculations. Normalize *X* to get matrix *E*<sup>0</sup> and normalize *Y* to get matrix *F*0:

$$\mathbf{E}\_0 = \begin{bmatrix} x\_{11} \ \cdots \ \cdot \ x\_{1m} \\ \vdots \ \vdots \ \vdots \\ x\_{n1} \ \cdots \ \cdot \ x\_{mn} \end{bmatrix}, \quad \mathbf{F}\_0 = \begin{bmatrix} \mathbf{y}\_{11} \ \cdots \ \mathbf{y}\_{1l} \\ \vdots \ \vdots \ \vdots \\ \mathbf{y}\_{n1} \ \cdots \ \cdot \ \mathbf{x}\_{nl} \ \mathbf{y} \end{bmatrix} \tag{3.21}$$

In the first step, set *t*<sup>1</sup> (*t*<sup>1</sup> = *E*0*w*1) to be the first component of *E*0, and *w*<sup>1</sup> is the first direction vector of *E*0, which is a unit vector, *w*1 = 1. Similarly, set *u*<sup>1</sup> (*u*<sup>1</sup> = *F*<sup>0</sup> *c*1) to be the first component of *F*0, and *E*<sup>0</sup> is the first direction vector of *F*0, which is a unit vector too, *c*1 = 1.

According to the principle of principal component analysis, *t*<sup>1</sup> and *u*<sup>1</sup> should meet the following conditions in order to be able to represent the data variation information in *X* and *Y* well:

$$\begin{aligned} \max \text{var } (t\_1) \\ \max \text{var } (\mathfrak{u}\_1) \end{aligned} \tag{3.22}$$

On the other hand, *t*<sup>1</sup> is further required to have the best explanatory ability for *u*<sup>1</sup> due to the needs of regression modeling. According to the thinking of canonical correlation analysis, the correlation between *t*<sup>1</sup> and *u*<sup>1</sup> should reach the maximum value:

$$\max r\left(t\_1, \mu\_1\right). \tag{3.23}$$

The covariance of *t*<sup>1</sup> and *u*<sup>1</sup> is usually used to describe the correlation in partial least squares regression:

$$\max \text{Cov}\left(t\_1, \mu\_1\right) = \sqrt{\text{var}\left(t\_1\right) \text{var}\left(\mu\_1\right)} r\left(t\_1, \mu\_1\right) \tag{3.24}$$

Converting to the normal mathematical expression, *t*<sup>1</sup> and *u*<sup>1</sup> is solved by the following optimization problem:

$$\begin{aligned} \max\_{\boldsymbol{w}\_{1}, \boldsymbol{c}\_{1}} & \langle E\_{0} \boldsymbol{w}\_{1}, F\_{0} \boldsymbol{c}\_{1} \rangle \\ \text{s.t.} & \begin{cases} \boldsymbol{w}\_{1}^{\mathrm{T}} \boldsymbol{w}\_{1} = 1 \\ \boldsymbol{c}\_{1}^{\mathrm{T}} \boldsymbol{c}\_{1} = 1. \end{cases} \end{aligned} \tag{3.25}$$

Therefore, it needs to calculate the maximum value of *w*<sup>T</sup> <sup>1</sup> *E*<sup>T</sup> <sup>0</sup> *F*<sup>0</sup> *c*<sup>1</sup> under the constraints of *w*1<sup>2</sup> <sup>=</sup> 1 and *c*1<sup>2</sup> <sup>=</sup> 1.

In this case, the Lagrangian function is

$$\mathbf{s} = \mathbf{w}\_1^\mathrm{T} \mathbf{E}^\mathrm{T} \mathbf{r}\_0 \mathbf{c}\_1 - \lambda\_1 \left(\mathbf{w}\_1^\mathrm{T} \mathbf{w}\_1 - 1\right) - \lambda\_2 \left(\mathbf{c}\_1^\mathrm{T} \mathbf{c}\_1 - 1\right). \tag{3.26}$$

Calculate the partial derivatives of *s* with respect to *w*1, *c*1, λ1, and λ2, and let them be zero

$$\frac{\partial s}{\partial \mathbf{w}\_1} = E\_0^\mathrm{T} F\_0 \mathbf{c}\_1 - 2\lambda\_1 \mathbf{w}\_1 = 0,\tag{3.27}$$

#### 3.2 Partial Least Squares 39

$$\frac{\partial \mathbf{s}}{\partial \mathbf{c}\_{1}} = E\_{0}^{\mathrm{T}} F\_{0} \mathbf{w}\_{1} - 2\lambda\_{2} \mathbf{c}\_{1} = \mathbf{0},\tag{3.28}$$

$$\frac{\partial \mathbf{s}}{\partial \lambda\_1} = -\left(\mathbf{w}\_1^\mathrm{T} \mathbf{w}\_1 - 1\right) = 0,\tag{3.29}$$

$$\frac{\partial \mathbf{s}}{\partial \lambda\_2} = -\left(\mathbf{c}\_1^\mathsf{T} \mathbf{c}\_1 - 1\right) = 0.\tag{3.30}$$

It can be derived from the above formulas that

$$2\lambda\_1 = 2\lambda\_2 = \mathfrak{w}\_1^\mathsf{T} E\_0^\mathsf{T} F\_0 \mathfrak{c}\_1 = \langle E\_0 \mathfrak{w}\_1, F\_0 \mathfrak{c}\_1 \rangle \tag{3.31}$$

Let θ<sup>1</sup> = 2λ<sup>1</sup> = 2λ<sup>2</sup> = *w*<sup>T</sup> <sup>1</sup> *E*<sup>T</sup> <sup>0</sup> *F*<sup>0</sup> *c*1, so θ<sup>1</sup> is the value of the objective function of the optimization problem (3.25). Then (3.27) and (3.28) are rewritten as

$$E\_0^\top F\_0 \mathbf{c}\_1 = \theta\_1 \mathbf{w}\_1,\tag{3.32}$$

$$F\_0^\mathrm{T} E\_0 \mathfrak{w}\_1 = \theta\_1 \mathfrak{c}\_1. \tag{3.33}$$

Substitute (3.33) into (3.32),

$$E\_0^\mathrm{T} F\_0 F\_0^\mathrm{T} E\_0 \mathfrak{w}\_1 = \theta\_1^2 \mathfrak{w}\_1. \tag{3.34}$$

Substitute (3.32) into (3.33) simultaneously,

$$\boldsymbol{F}\_0^\mathrm{T} \boldsymbol{E}\_0 \boldsymbol{E}\_0^\mathrm{T} \boldsymbol{F}\_0 \mathbf{c}\_1 = \theta\_1^2 \mathbf{c}\_1. \tag{3.35}$$

Equation (3.34) shows that *w*<sup>1</sup> is the eigenvector of matrix *E*<sup>T</sup> <sup>0</sup> *F*0*F*<sup>T</sup> <sup>0</sup> *E*<sup>0</sup> with the corresponding eigenvalue θ<sup>2</sup> 1. Here, θ<sup>1</sup> is the objective function. If we want to get its maximum value, *w*<sup>1</sup> should be the unit eigenvector of the maximum eigenvalue of matrix *E*<sup>T</sup> <sup>0</sup> *F*0*F*<sup>T</sup> <sup>0</sup> *E*0. Similarly, *c*<sup>1</sup> should be the unit eigenvector of the largest eigenvalue of the matrix *F*<sup>T</sup> <sup>0</sup> *E*0*E*<sup>T</sup> <sup>0</sup> *F*0.

Then the first components *t*<sup>1</sup> and *u*<sup>1</sup> are calculated from the direction vectors *w*<sup>1</sup> and *c*1:

$$\begin{aligned} t\_1 &= E\_0 \boldsymbol{w}\_1 \\ \boldsymbol{w}\_1 &= F\_0 \boldsymbol{c}\_1. \end{aligned} \tag{3.36}$$

The regression equations of *E*<sup>0</sup> and *F*<sup>0</sup> is found by *t*<sup>1</sup> and *u*1:

$$\begin{aligned} E\_0 &= t\_1 \boldsymbol{p}\_1^\mathrm{T} + E\_1\\ F\_0 &= \boldsymbol{\mu}\_1 \boldsymbol{q}\_1^\mathrm{T} + F\_1^\*\\ F\_0 &= t\_1 \boldsymbol{r}\_1^\mathrm{T} + F\_1. \end{aligned} \tag{3.37}$$

The regression coefficient vectors in (3.37) are

$$\begin{aligned} p\_1 &= \frac{E\_0^\mathrm{T} t\_1}{\|t\_1\|^2} \\ q\_1 &= \frac{F\_0^\mathrm{T} u\_1}{\|u\_1\|^2} \\ r\_1 &= \frac{F\_0^\mathrm{T} t\_1}{\|t\_1\|^2} .\end{aligned} \tag{3.38}$$

*E*1, *F*<sup>∗</sup> <sup>1</sup> and *F*<sup>1</sup> are the residual matrices of the three regression equations.

Second step is to replace *E*<sup>0</sup> and *F*<sup>0</sup> with residual matrices *E*<sup>1</sup> and *F*1, respectively. Then find the second pair of direction vectors *w*2, *c*2, and the second pair of components *t*<sup>2</sup> and *u*2:

$$\begin{aligned} t\_2 &= E\_1 \mathbf{w}\_2 \\ \mathbf{w}\_2 &= F\_1 \mathbf{c}\_2 \\ \theta\_2 &= \mathbf{w}\_2^T E\_1^T F\_1 \mathbf{c}\_2. \end{aligned} \tag{3.39}$$

Similarly, *w*<sup>2</sup> is the unit eigenvector corresponding to the largest eigenvalue of matrix *E*<sup>T</sup> <sup>1</sup> *F*1*F*<sup>T</sup> <sup>1</sup> *E*1, and *c*<sup>2</sup> is the unit eigenvector of the largest eigenvalue of matrix *F*T <sup>1</sup> *E*1*E*<sup>T</sup> <sup>1</sup> *F*1. Calculate the regression coefficient

$$\begin{aligned} p\_2 &= \frac{\mathbf{E}\_1^\mathrm{T} t\_2}{\|\mathbf{t}\_2\|^2} \\ r\_2 &= \frac{\mathbf{F}\_1^\mathrm{T} t\_2}{\|\mathbf{t}\_2\|^2} . \end{aligned} \tag{3.40}$$

The regression equation is updated:

$$\begin{aligned} E\_1 &= t\_2 p\_2^\mathrm{T} + E\_2\\ F\_1 &= t\_2 r\_2^\mathrm{T} + F\_2. \end{aligned} \tag{3.41}$$

Repeat the calculation according to the above steps. If the rank of *X* is *R*, the regression equation can be obtained:

$$\begin{aligned} E\_0 &= t\_1 \boldsymbol{p}\_1^\mathrm{T} + \cdots + t\_R \boldsymbol{p}\_R^\mathrm{T} \\ F\_0 &= t\_1 \boldsymbol{r}\_1^\mathrm{T} + \cdots + t\_R \boldsymbol{r}\_R^\mathrm{T} + F\_R. \end{aligned} \tag{3.42}$$

If the number of feature vectors used in the PLS modeling is large enough, the residuals could be zero. In general, it only needs to select *a*(*a R*) components among them to form a regression model with better prediction. The number of principal components required for modeling is determined by cross-validation discussed in Sect. 3.2.3. Once the appropriate component number is determined, the external relationship of the input variable matrix *X* as

$$X = \boldsymbol{T}\boldsymbol{P}^{\mathrm{T}} + \bar{\boldsymbol{X}} = \sum\_{h=1}^{a} \boldsymbol{t}\_{h}\boldsymbol{p}\_{h}^{\mathrm{T}} + \bar{\boldsymbol{X}}.\tag{3.43}$$

The external relationship of the output variable matrix *Y* can be written as

$$\mathbf{Y} = \mathbf{U}\,\mathbf{Q}^{\mathrm{T}} + \bar{\mathbf{Y}} = \sum\_{h=1}^{a} \boldsymbol{\mu}\_{h}\mathbf{q}\_{h}^{\mathrm{T}} + \bar{\mathbf{Y}}.\tag{3.44}$$

The internal relationship is expressed as

$$
\hat{\mathbf{u}}\_h = \mathbf{b}\_h \mathbf{t}\_h, \quad \mathbf{b}\_h = \mathbf{t}\_h^\mathsf{T} \mathbf{u}\_h / \mathbf{t}\_h^\mathsf{T} \mathbf{t}\_h. \tag{3.45}
$$

# *3.2.3 Cross-Validation Test*

In many cases, the PLS equation does not require the selection of all principal components for regression modeling, but rather, as in principal component analysis, the first *d*(*d* ≤ *l*) principal components can be selected in a truncated manner, and a better predictive model can be obtained using only these *d* principal components. In fact, if the subsequent principal components no longer provide more meaningful information to explain the dependent variable, using too many principal components will only undermine the understanding of the statistical trend and lead to wrong prediction conclusions. The number of principal components required for modeling can be determined by cross-validation.

Cross-validation is used to prevent over-fitting caused by complex model. Sometimes referred to as the circular estimation, it is a statistically useful method for cutting data sample into smaller subset. This is done by first doing the analysis on a subset, while the other subset is used for subsequent confirmation and validation of this analysis. The subset used for analysis is called the training set. The other subset is called validation set and generally separated from the testing set. Two cross-validation methods often used in practice are *K*-fold cross-validation (K-CV) and leave-one-out cross-validation (LOO-CV).

K-CV divides the *n* original data into *K* groups (generally evenly divided), makes each subset of data into a validation set once separately. The rest of the *K* − 1 subsets are considered as the training set, so K-CV will result in *K* models. In general, *K* is selected between 5 and 10. LOO-CV is essentially N-CV. The process of determining the number of principal components will be described in detail using LOO-CV as an example.

All *n* samples are divided into two parts: the first part is the set of all samples excluding a certain sample *i* (containing a total of *n* − 1 samples) and a regression equation is fitted with this data set using *d* principal components; The second part is to substitute the *i*th sample that was just excluded into the fitted regression equation to obtain the predicted value *y*ˆ(*i*)*<sup>j</sup>*(*d*), *j* = 1, 2,...,*l* of *yj* . Repeating the above test for each *i* = 1, 2,..., *n* , the sum of squared prediction errors for *yj* can be defined as PRESS*j*(*d*).

$$\text{PRESS}\_{j}(d) = \sum\_{i=1}^{n} \left( \mathbf{y}\_{ij} - \hat{\mathbf{y}}\_{(i)j}(d) \right)^{2}, j = 1, 2, \dots, l. \tag{3.46}$$

The sum of squared prediction errors of *Y* = (*y*1,..., *yl*) <sup>T</sup> can be obtained as

$$\text{PRESS}(d) = \sum\_{j=1}^{l} \text{PRESS}\_j(d). \tag{3.47}$$

Obviously, if the robustness of the regression equation is not good, the error is large and thus it is very sensitive to change in the samples, and the effect of this perturbation error will increase the PRESS(*d*) value.

On the other hand, use all sample points to fit a regression equation containing *d* components. In this case, the fitted value of the *i*th sample point is *y*ˆ*i j*(*d*). The fitted error sum of squares for *yj* is defined as SS*j*(*d*) value

$$\text{SS}\_j(d) = \sum\_{i=1}^n \left( \mathbf{y}\_{ij} - \hat{\mathbf{y}}\_{ij}(d) \right)^2. \tag{3.48}$$

The sum of squared errors of *Y* is

$$\text{SS}(d) = \sum\_{i=1}^{l} SS\_j(d) \tag{3.49}$$

Generally, PRESS(*d*)is greater than *SS*(*d*) because PRESS(*d*) contains an unknown perturbation error and the fitting error decreases with the increase of components, i.e., *SS*(*d*) is less than SS(*d* − 1). Next, compare SS(*d* − 1) and PRESS(*d*). SS(*d* − 1) is the fitting error of the regression equation that is fitted with all samples with *d* components; PRESS(*d*) contains the perturbation error of the samples but with one more component. If the *d* component regression equation with perturbation error can be somewhat smaller than the fitting error of the *d* − 1 component regression equation, it is considered that adding one component *t<sup>d</sup>* will result in a significant improvement in prediction accuracy. Therefore, it is always expected that the ratio of PRESS(*d*) SS(*d*−1) is as small as possible. The general setting

#### 3.2 Partial Least Squares 43

$$\frac{\text{PRESS}(d)}{\text{SS}(d-1)} \le (1 - 0.05)^2 = 0.95^2. \tag{3.50}$$

IF PRESS(*d*) ≤ 0.952SS(*d* − 1), the addition of the component is considered beneficial. And conversely, if PRESS(*d*) > 0.952SS(*d* − 1), the new addition of components is considered to have no significant improvement in reducing the prediction error of the regression equation.

In practice, the following cross-validation index is used. For each dependent variable *yj* , define

$$\mathbf{Q}\_{dj}^2 = 1 - \frac{\text{PRESS}\_j(d)}{\text{SS}\_j(d-1)}.\tag{3.51}$$

For the full dependent variable *Y*, the cross-validation index of component *t<sup>d</sup>* is defined as

$$\mathbf{Q}\_d^2 = 1 - \frac{\text{PRESS}(d)}{\text{SS}(d-1)}.\tag{3.52}$$

The marginal contribution of component *t<sup>d</sup>* to the predictive accuracy of the regression model has the following two scales (cross-validation index).


## **References**


Hotelling H (1936) Relations between two sets of variates. Biom Trust 28(3/4):321–377


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 4 Simulation Platform for Fault Diagnosis**

The previous chapters have described the mathematical principles and algorithms of multivariate statistical methods, as well as the monitoring processes when used for fault diagnosis. In order to validate the effectiveness of data-driven multivariate statistical analysis methods in the field of fault diagnosis, it is necessary to conduct the corresponding fault monitoring experiments. Therefore this chapter introduces two kinds of simulation platform, Tennessee Eastman (TE) process simulation system and fed-batch Penicillin Fermentation Process simulation system. They are widely used as test platforms for the process monitoring, fault classification, and identification of industrial process. The related experiments based on PCA, CCA, PLS, and FDA are completed on the TE simulation platforms.

## **4.1 Tennessee Eastman Process**

The original TE industrial process control problem was developed by Downs and Vogel in 1993. It is used for the open and challenging control-related topics including multi-variable controller design, optimization, adaptive and predictive control, nonlinear control, estimation and identification, process monitoring and diagnosis, and education. TE process model is established according to the actual chemical process. It has been widely used as a benchmark for control and monitoring research process. Figure 4.1 shows the flow diagram of TE process with five major units: reactor, condenser, compressor, vaporliquid separator, and stripper. Four kinds of gaseous material *A, C, D,* and *E* are input for reaction. In addition, a small amount of inert gas B is contained besides the above feeds. The final products are three liquid including *G, H*, and *F*, where *F* is the by-product.

https://doi.org/10.1007/978-981-16-8044-1\_4

J. Wang et al., *Data-Driven Fault Detection and Reasoning for Industrial Monitoring*, Intelligent Control and Learning Systems 3,

**Fig. 4.1** Tennessee Eastman process


Briefly, TE process consists of two data modules: XMV module containing 12 manipulated variables (XMV(1)-XMV(12):*x*<sup>23</sup> − *x*34) and XMEAS module consisting of 22 process measured variables (XMEAS(1)-XMEAS(22):*x*<sup>1</sup> − *x*22) and 19 component measured variables (XMEAS(23)-XMEAS(41):*x*<sup>35</sup> − *x*53), as listed in Tables 4.1 and 4.2.

In this book, the code provided is available on the website *online* at http://depts. washington.edu/control/LARRY/TE/download.html. Also, the code and data sets can be downloaded. The Simulink simulator allows an easy setting and generation of the operation modes, measurement noises, sampling time, and magnitudes of the faults. It is thus very helpful for the data-driven process monitoring study. 21 artificially disturbances (considered as faulty operations for fault diagnosis problem) in the TE process are shown in Table 4.3. In general, the entire TE data consists of training set and testing set, and each set includes 22 kinds of data under different simulation operations. Each kind of data has sampled measurements on 53 observed variables.

In the data set given in the web link above, d00.dat to d21.dat are training sets, and d00\_te.dat to d21\_te.dat are testing sets. d00.dat and d00\_te.dat are samples under

#### 4.1 Tennessee Eastman Process 47


**Table 4.1** Monitoring variables in the TE process(*x*<sup>1</sup> − *x*34)

the normal operation conditions. The training samples of d00.dat are sampled under 25h running simulation. The total number of observations is 500. The d00\_te.dat test samples are obtained under 48 h running simulation, and the total number of observation data is 960. d01.dat–d21.dat (for training) and d01\_te.dat–d21\_te.dat (for testing) are sampled with different faults, in which the numerical label of the data set are corresponding to the fault type.

All the testing data set are obtained under 48 h running simulation with the faults introduced at 8 h. A total of 960 observations are collected, in which the first 160 observations are in the normal operation. It is worth to point out that the data sets


**Table 4.2** Monitoring variables in the TE process(*x*<sup>35</sup> − *x*53)

**Table 4.3** Disturbances for the TE process


once generated by Leoand et al. (2001) is widely accepted for process monitoring and fault diagnosis research. The data sets are smoothed, filtered, and normalized. The monitored variables are variables *x*<sup>1</sup> − *x*53.

## **4.2 Fed-Batch Penicillin Fermentation Process**

Fed-batch fermentation processes are widely used in the pharmaceutical industry. The yield maximization is usually considered as the main goal in the batch fermentation processes. The different characteristics of batch operation from the continuous operation include strong nonlinearity, non-stationary conditions, batch-to-batch variability, and strong time-varying conditions. These features result that the yield is difficult to predict. Therefore, the fault detection, classification, and identification of batch/fed-batch processes shows more difficulties compared with the continuous TE process.

The model of fed-batch penicillin fermentation process is described by Birol et al. (2002)

$$\begin{aligned} X &= f(X, S, C\_L, H, T) \\ S &= f(X, S, C\_L, H, T) \\ C\_L &= f(X, S, C\_L, H, T) \\ P &= f(X, S, C L, H, T, P) \\ CO\_2 &= f(X, H, T) \\ H &= f(X, H, T), \end{aligned}$$

where *X, S, CL , P, C O*2*, H* and *T* are biomass concentration, substrate concentration, dissolved oxygen concentration, penicillin concentration, carbon dioxide concentration, hydrogen ion concentration for pH -*H*+, and temperature, respectively. The corresponding detailed mathematical model is given in Birol et al. (2002).

The research group with the Illinois Institute of Technology has developed a dynamic simulation of penicillin production based on an unstructured model, Pen-Sim V2.0. This model has been used as a benchmark for statistical process monitoring studies of batch/fed-batch reaction process. The flow chart of the fermentation process is depicted in Fig. 4.2. The fermentation unit consists of a fermentation reactor and a coil-based heat exchange unit. The pH and temperature are automatically controlled by two PID controllers by adjusting the flow rates of acid/base and cold/hot water. The glucose substrate is fed continuously into the fermentation reactor in open-loop operation in the fed-batch operation mode.

Fourteen variables are considered in PenSim V2.0 model, shown in Table 4.4: 5 input variables (1–4, 14) and 9 process variables (5–13). Since variables 11–13 are not measured online in industry, only 11 variables are monitored here.

**Fig. 4.2** Flow chart of the penicillin fermentation process


**Table 4.4** Variables in penicillin fermentation process

## **4.3 Fault Detection Based on PCA, CCA, and PLS**

This section tests the effectiveness of various multivariate statistical methods for the TE process. Faults in the standard TE data set are introduced at the 160 sampling. For comparison purposes, the normal operation data d00\_te is chosen as to train the statistical model and faulty operation data d01\_te-d21\_te is used to test model and detect fault. In the experiments for the PCA and PLS methods, the process variable matrix *X* consists of process variables (XMEAS (1–22)) and manipulated variables (XMV (1–11)). XMEAS (35) is used as the quality variable matrix *Y* for PLS. In the CCA experiment, the process variables (XMEAS (1–22)) are used as one data set, and the manipulated variables (XMV (1–11)) as another data set.

The fault detection rate (FDR) and false alarm rate (FAR) are defined as follows:

$$\begin{aligned} \text{FDR} &= \frac{\text{No.of samples} (J > J\_{th} | f \neq 0)}{\text{total samples} (f \neq 0)} \times 100\\ \text{FAR} &= \frac{\text{No.of samples} (J > J\_{th} | f = 0)}{\text{total samples} (f = 0)} \times 100. \end{aligned} \tag{4.1}$$

Experiment and model parameters are determined as follows. The principal components of PCA are determined by the cumulative contribution of 90%. The number of principal components of PLS is selected as 6. T<sup>2</sup> and Q statistics are used to monitor process faults. It should be noted that in the monitoring of CCA, (3.18) and (3.19) are used as monitoring indices and the corresponding monitoring results are slightly different. For 21 fault types, the FDR for PCA, CCA, and PLS based on the control limit with 99% confidence level are shown in Table 4.5. It can be seen that the multivariate statistical methods listed in this section (including PCA, CCA, and PLS) can accurately detect the significant process faults.

Figures 4.3, 4.4, and 4.5 show the different monitoring results base on PCA, CCA, and PLS model for typical faults IDV(1), IDV(16), and IDV(20), respectively. Here, the black line is the statistic calculated from the real-time data and the red line is the normal statistic threshold from the offline model calculation.

It is easy to find that CCA has better detection for certain fault types from Table 4.5, such as faults IDV(10), IDV(16), IDV(19), and IDV(20). The monitoring results for faults IDV(16) and IDV(20) are shown in Figs. 4.4 and 4.5. Why does CCA show better detection capabilities than the other two methods in certain faults? Let's check the setting of process variable *X* for three methods. In contrast to PCA and PLS, CCA splits its X-space directly into two parts and extracts the latent variables by examining the correlation between these two parts, i.e., the latent variables extracted by CCA can better characterise the changes in the process.

## **4.4 Fault Classification Based on FDA**

To further test the effectiveness of fault classification, samples from the 161th to the 700th of the 21 fault data sets and the normal data sets are used for training FDA model. The corresponding data from the 701th to the 960th samples are used to test FDA model and its classification ability. FDA in Sect. 2.2 is a classical method to validate the classification effect and identify the fault types. The following distance metric index is introduced to further quantify the difference between different faults:


**Table 4.5** FDRs of PCA, CCA and PLS

$$\mathbf{D}\_2 = \left\| \begin{array}{c} \text{FDA}\_i \ -\text{FDA}\_j \end{array} \right\|,$$

where FDA*<sup>i</sup>* denotes the FDA feature vector of the *i*th fault.

The simulation results are shown in Fig. 4.6. The 22 kinds of data (including the normal operation and 21 faulty operation) can be roughly divided into two major categories: the first category is the faults that are significantly different from other faults, which contains faults IDV(2) (line with ♦), IDV(6) (line with ∗), and IDV(18) (line with ◦); the other category is the set of faults whose characteristics are relatively close to each other.

The faults IDV(1), IDV(2), IDV(6), and IDV(20) are further analyzed. The FDA results for fault classification are shown in Fig. 4.7. The D2 indices for these faults vary considerably, as the classification results clearly illustrated. Conversely, certain faults have very small differences in D2 indices. For example, faults IDV(4), IDV(11), and IDV(14) have the similar FDA D2 indices, shown in Fig. 4.8. These faults are difficult to classify accurately based on FDA model, as shown in Fig. 4.9.

**Fig. 4.6** D2 index for different faults

**Fig. 4.7** FDA identification result for the fault 1, 2, 6, and 20

**Fig. 4.8** D2 indices for fault 4, 11, and 14

**Fig. 4.9** FDA identification result for the fault 4, 11, and 14

# **4.5 Conclusions**

Two kinds of simulation platforms are introduced for verifying the statistical monitoring methods and several experiments based on the traditional methods, PCA, PLS, CCA, and FDA, are finished. These basic experiments illustrate the characteristics of several methods and their fault detection effects. Actually, there are lots of improved methods to overcome the shortcomings and deficiencies of the original multivariate statistical analysis methods. Each method has its own conditions and scope of application. No one method completely outperforms the others in terms of performance. Furthermore, data-based fault detection methods need to be combined with the actual monitoring objects, and existing methods need to be improved according to its knowledge and characteristics. So this book focus on the fault detection (discrimination) strategies for batch processes and strong nonlinear systems.

## **References**

Birol G, Undey C, Cinar A (2002) A modular simulation package for fed-batch fermentation: penicillin production. Comput Chem Eng, pp 1553–1565

Leoand M, Russell E, Braatz R (2001) Tennessee eastman process. Springer, London

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 5 Soft-Transition Sub-PCA Monitoring of Batch Processes**

Batch or semi-batch processes have been utilized to produce high-value-added products in the biological, food, semi-conductor industries. Batch process, such as fermentation, polymerization, and pharmacy, is highly sensitive to the abnormal changes in operating condition. Monitoring of such processes is extremely important in order to get higher productivity. However, it is more difficult to develop an exact monitoring model of batch processes than that of continuous processes, due to the common natures of batch process: non-steady, time-varying, finite duration, and nonlinear behaviors. The lack of exact monitoring model in most batch processes leads that an operator cannot identify the faults when they occurred. Therefore, effective techniques for monitoring batch process exactly are necessary in order to remind the operator to take some corrective actions before the situation becomes more dangerous.

Generally, many batch processes are carried out in a sequence of steps, which are called multi-stage or multi-phase batch processes. Different phases have different inherent natures, so it is desirable to develop stage-based models that each model represents a specific stage and focuses on a local behavior of the batch process. This chapter focuses on the monitoring method based on multi-phase models. An improved online sub-PCA method for multi-phase batch process is proposed. A two-step stage dividing algorithm based on support vector data description (SVDD) technique is given to divide the multi-phase batch process into several operation stages reflecting their inherent process correlation nature. Mechanism knowledge is considered firstly by introducing the sampling time into the loading matrices of PCA model, which can avoid segmentation mistake caused by the fault data. Then SVDD method is used to strictly refine the initial division and obtain the soft-transition substage between the stable and transition periods. The idea of soft-transition is helpful for further improving the division accuracy. Then a representative model is built for each sub-stage, and an online fault monitoring algorithm is given based on the division techniques above. This method can detect fault earlier and avoid false alarm

<sup>©</sup> The Author(s) 2022

J. Wang et al., *Data-Driven Fault Detection and Reasoning for Industrial Monitoring*, Intelligent Control and Learning Systems 3, https://doi.org/10.1007/978-981-16-8044-1\_5

because of more precise stage division, comparing with the conventional sub-PCA method.

## **5.1 What Is Phase-Based Sub-PCA**

The general monitoring for batch process is phase/stage-based sub-PCA method, which divides the process into several phases (Yao and Gao 2009). The phase-based sub-PCA consists of three steps: **data matrix unfloding**, **phase division**, and **sub-PCA modeling**. Now the details of them are introduced.

#### 1. **Data Matrix Unfolding**

Different from the continuous process, the historical data of batch process are composed of a three-dimensional array *X*(*I* × *J* × *K*), where *I* is the number of batches, *J* is the number of variables, and *K* is the number of sampling times. The original data *X* should be conveniently rearranged into two-dimensional matrices prior to developing statistical models. Two traditional methods are widely applied: the batch-wise unfolding and the variable-wise unfolding, with the most used method is batch-wise unfolding. The three-dimensional matrix *X* should be cut into *K* time-slice matrix after the batch-wise unfolding is completed.

The three-dimensional process data *X*(*I* × *J* × *K*) is batch-wise unfolded into two-dimensional forms *X<sup>k</sup>* (*I* × *J* ), (*k* = 1, 2,..., *K*). Then a time-slice matrix is placed beneath one another, but not beside as shown in Fig. 5.1 (Westerhuis et al. 1999; Wold et al. 1998). Sometimes batches have different lengths, i.e. the sampling number *K* are different. The process data need to be aligned before unfolding. There are many data alignment methods raised by former researchers, such as directly filling zeros to missing sampling time (Arteaga and Ferrer 2002), dynamic time warping (Kassida et al. 1998). These unfolding approaches do not require any estimation of unknown future data for online monitoring.

#### 2. **Phase Division**

The traditional multivariate statistical analysis methods are valid in the continuous process, since all variables are supposed to stay around certain stable state and the correlation between these variables remains relatively stable. Non-steadystate operating conditions, such as time-varying and multi-phase behavior, are the typical characteristics in a batch process. The process correlation structure might change due to process dynamics and time-varying factors. The statistical model may be ill-suited if it takes the entire batch data as a single object, and the process correlation among different stages are not captured effectively. So multiphase statistic analysis aims at employing the separate model for the forthcoming period, instead of using a single model for the entire process. The phase division plays a key role in batch process monitoring.

Many literature divided the process into multi-phase based on mechanism knowledge. For example, the division is based on different processing units or distinguishable operational phases within each unit (Dong and McAvoy 1996; Reinikainen and Hoskuldsson 2007). It is suggested that process data can be naturally divided into groups prior to modelling and analysis. This stage division directly reflects the operational state of the process. However, the known prior knowledge usually are not sufficient to divide processes into phases reasonably.

Besides, Muthuswamy and Srinivasan identified several division points according to the process variable features described in the form of multivariate rules (Muthuswamy and Srinivasan 2003). Undey and Cinar used an indicator variable that contained significant landmarks to detect the completion of each phase (Undey and Cinar 2002). Doan and Srinivasan divided the phases based on the singular points in some known key variables (Doan and Srinivasan 2008). Kosanovich, Dahl, and Piovoso pointed out that the changes in the process variance information explained by principal components could indicate the division points between the process stages (Kosanovich and Dahl 1996). There are many results in this area but not give a clear strategy to distinct the steady phase and transition phase (Camacho and Pico 2006; Camacho et al. 2008; Yao and Gao 2009).

#### 3. **Sub-PCA Modeling**

The statistical models are constructed for all the phases after the phase division and are not limited to PCA methods. Here, sub-PCA is representatively one of these sub-statistical monitoring methods. The final sub-PCA model of each phase is calculated by taking the average of the time-slice PCA models in the corresponding phase. The number of principal components of each phase are determined based on the relative cumulative variance.

The *T* 2, *SPE* statistics and their corresponding control limits are calculated according to the sub-PCA model. Check the Euclidean distance of the new data from the center of each stage of clustering and determine at which stage the new data is located. Then, the corresponding sub-PCA model is used to monitor the new data. Fault warning is pointed according to the control limits of T<sup>2</sup> or SPE.

## **5.2 SVDD-Based Soft-Transition Sub-PCA**

Industrial batch process operates in a variety of status, including grade changes, startup, shutdown, and maintenance operations. Transitional region between neighboring stages is very common in multistage process, which shows the gradual changeover from one operation pattern to another. Usually the transitional phases first show basic characteristic that are more similar to the previous stable phase and then more similar to the next stable phase at the end of the transition. The different transition phases undergo different trajectories from one stable mode to another, with change in characteristics that are more pronounced in sampling time and more complex than those within a phase. Therefore, valid process monitoring during transitions is very important. Up to now, few investigations about transition modeling and monitoring have been reported (Zhao et al. 2007). Here, a new transition identification and monitoring method base on the SVDD division method is proposed.

# *5.2.1 Rough Stage-Division Based on Extended Loading Matrix*

The original three-dimensional array *X*(*I* × *J* × *K*) is first batch-wise unfolded into two-dimensional form *X<sup>k</sup>* . By subtracting the grand mean of each variable over all time and all batches, unfolding matrix *X<sup>k</sup>* is centered and scaled.

$$X\_k = \frac{[X\_k - \text{mean}\,(X\_k)]}{\sigma\,(X\_k)},\tag{5.1}$$

where mean (*X<sup>k</sup>* ) and σ(*X<sup>k</sup>* ) represent the mean value and the standard variance of matrix *X<sup>k</sup>* , respectively. The main nonlinear and dynamic components of every variable are still left in the scaled matrix.

Suppose the unfolding matrix at each time-slice is *X<sup>k</sup>* . Project it into the principle component subspace by loading matrix *P<sup>k</sup>* to obtain the scores matrix *T<sup>k</sup>* :

$$X\_k = T\_k \mathbf{P}\_k^T + E\_k,\tag{5.2}$$

where *E<sup>k</sup>* is the residual. The first few components in PCA which represent major variation of original data set *X<sup>k</sup>* are chosen. The original data set *X<sup>k</sup>* is divided into the score matrix *<sup>X</sup>***<sup>ˆ</sup>** *<sup>k</sup>* <sup>=</sup> *<sup>T</sup><sup>k</sup> <sup>P</sup>*<sup>T</sup> *<sup>k</sup>* and the residual matrix *E<sup>k</sup>* . Here, *X***ˆ** *<sup>k</sup>* is PCA model prediction. Some useful techniques, such as the cross-validation, have been used to determine the most appropriate retained numbers of principal components. Then the loading matrix *P<sup>k</sup>* and singular value matrix *S<sup>k</sup>* of each time-slice matrix *X<sup>k</sup>* can be obtained.

As the loading matrix *P<sup>k</sup>* reflects the correlations of process variables, it usually is used to identify the process stage. Sometimes disturbances brought by measurement noise or other reasons will lead wrong division, because the loading matrix just obtained from process data is hard to distinguish between wrong data and transition phase data. Generally, different phases in the batch process could be firstly distinguished according to the mechanism knowledge.

The sampling time is added to the loading matrix on order to divide the process exactly. The sampling time is a continuously increasing data set, so it must also be centered and scaled before added to the loading matrix. Generally, the sampling time is centered and scaled not along the batch dimension like process data *X*, but along the time dimension in one batch. Then the scaling time *t<sup>k</sup>* is changed into a vector *t<sup>k</sup>* by multiplying unit column vector. So the new time-slice matrix is written as *P***ˆ** *<sup>k</sup>* = [*P<sup>k</sup>* , *t<sup>k</sup>* ], in which *t<sup>k</sup>* is a 1 × *J* column vector with repeated value of current sampling time. The sampling time will not change too much with the ongoing of batch process, but have an obvious effect on the phase separation. Define the Euclidean distance of extended loading matrix *P***ˆ** *<sup>k</sup>* as

$$\begin{aligned} \left\| \hat{\boldsymbol{P}}\_{i} - \hat{\boldsymbol{P}}\_{j} \right\|^{2} &= \left[ \mathbf{P}\_{i} - \mathbf{P}\_{j}, \ t\_{i} - t\_{j} \right] \left[ \mathbf{P}\_{i} - \mathbf{P}\_{j}, \ t\_{i} - t\_{j} \right]^{\mathrm{T}} \\ &= \left\| \mathbf{P}\_{i} - \mathbf{P}\_{j} \right\|^{2} + \left\| t\_{i} - t\_{j} \right\|^{2} . \end{aligned} \tag{5.3}$$

Then the batch process can be divided into *S*<sup>1</sup> stages using *K*-means clustering method to cluster the extended loading matrices *P***ˆ** *<sup>k</sup>* .

Clearly, the Euclidean distance of the extended loading matrix *P***ˆ***<sup>i</sup>* includes both data differences and sampling time differences. The data at different stages differ significantly in sampling time. Therefore, when noise interference makes the data at different stages present the same or similar characteristics, the large differences in sampling times will keep the final Euclidean distance at a large value. This is because the erroneous division data is very different in sampling time from the data from the other stages, while the data from the transition stage has very little variation in sampling time. We can easily distinguish erroneous divisions in the transition phase from those caused by noise.

## *5.2.2 Detailed Stage-Division Based on SVDD*

The extended time-slice loading matrices *P***ˆ** *<sup>k</sup>* represent the local covariance information and underlying process behavior as mentioned before, so they are used in determining the operation stages by proper analyzing and clustering procedures. The process is divided into different stages and each separated process stage contains a series of successive samples. Moreover, the transition stage is unsuitable to be forcibly incorporated into one steady stage because of its variation complexity of process characteristics. The transiting alteration of process characteristics imposes disadvantageous effects on the accuracy of stage-based sub-PCA monitoring models. Furthermore, it deteriorates fault detecting performance if just a steady transition sub-PCA model is employed to monitor the transition stage. Consequently, a new method based on SVDD is proposed to separate the transition regions after the rough stage-division which is determined by the *K*-means clustering.

SVDD is a relatively new data description method, which is originally proposed by Tax and Duin for the one-class classification problem (Tax and Duin 1999, 2004). SVDD has been employed for damage detection, image classification, one-class pattern recognition, etc. Recently, it has also been applied in the monitoring of continuous processes. However, SVDD has not been used for batch process phase separating and recognition up to now.

The loading matrix of each stage is used to train the SVDD model of transition process. SVDD model first maps the data from original space to feature space by a nonlinear transformation function, which is called as kernel function. Then a hypersphere with minimum volume can be found in the feature space. To construct such a minimum volume hypersphere, the following optimization problem is obtained:

$$\begin{aligned} \min & \varepsilon \left( \mathcal{R}, A, \xi \right) = \mathcal{R}^2 + \mathcal{C} \sum\_{i} \xi\_i \\ \text{s.t. } & \left\| \hat{P}\_i - A \right\|^2 \le \mathcal{R}^2 + \xi\_i, \xi\_i \ge 0, \forall \ i, \end{aligned} \tag{5.4}$$

where *R* and *A* are the radius and center of hypersphere, respectively, *C* gives the trade-off between the volume of the hypersphere and the number of error divides. ξ*<sup>i</sup>* is a slack variable which allows a probability that some of the training samples can be wrongly classified. Dual form of the optimization problem (5.4) can be rewritten as

$$\begin{aligned} \min & \sum\_{i} \alpha\_{i} K\left(\hat{\mathbf{P}}\_{i}, \hat{\mathbf{P}}\_{i}\right) - \sum\_{i,j} \alpha\_{i}, \alpha\_{j} K\left(\hat{\mathbf{P}}\_{i}, \hat{\mathbf{P}}\_{j}\right) \\ \text{s.t. } & 0 \le \alpha\_{i} \le C\_{i}, \end{aligned} \tag{5.5}$$

where *K P***ˆ***i*, *P***ˆ** *<sup>j</sup>* is the kernel function, and α*<sup>i</sup>* is the Lagrange multiplier. Here, Gaussian kernel function is selected as kernel function. General quadratic programming method is used to solve the optimization question (5.5). The hypersphere radius *R* can be calculated according to the optimal solution α*<sup>i</sup>* :

$$\mathcal{R}^2 = 1 - 2\sum\_{i=1}^n \alpha\_i K\left(\hat{\mathbf{P}}\_i, \hat{\mathbf{P}}\_i\right) + \sum\_{i=1, j=1}^n \alpha\_i, \alpha\_j K\left(\hat{\mathbf{P}}\_i, \hat{\mathbf{P}}\_j\right) \tag{5.6}$$

Here, the loading matrices *P***ˆ** *<sup>k</sup>* are corresponding to nonzero parameter α*<sup>k</sup>* . It means that they have effect on the SVDD model. Then the transition phase can be distinguished from the steady phase by inputting all the time-slice matrices *P***ˆ** *<sup>k</sup>* into SVDD model. When a new data *P***ˆ** *ne*<sup>w</sup> is available, the hyperspace distance from the new data to the hypersphere center should be calculated firstly

$$D^2 = \left\|\hat{\boldsymbol{P}}\_{new} - \boldsymbol{\alpha}\right\|^2 = 1 - 2\sum\_{i=1}^n \alpha\_i K\left(\hat{\mathbf{P}}\_{new}, \hat{\mathbf{P}}\_i\right) + \sum\_{i=1, j=1}^n \alpha\_i, \alpha\_j K\left(\hat{\mathbf{P}}\_i, \hat{\mathbf{P}}\_j\right). \tag{5.7}$$

If the hyperspace distance is less than the hypersphere radius, i.e., *D*<sup>2</sup> ≤ *R*2, the process data *P***ˆ** *ne*<sup>w</sup> belongs to steady stages; else (that is *D*<sup>2</sup> > *R*2), the data will be assigned to transition stages. The whole batch is divided into *S*<sup>2</sup> stages at the detailed division, which includes *S*<sup>1</sup> steady stages and *S*<sup>2</sup> → *S*<sup>1</sup> transition stages.

The mean loading matrix *P***¯***<sup>s</sup>* can be adopted to get sub-PCA model of *s*th stage because the time-slice loading matrices in one stage are similar. *P***¯***<sup>s</sup>* is the mean matrix of the loading matrices *P<sup>k</sup>* in *s*th stage. The principal components number *as* can be obtained by calculating the relative cumulative variance of each principal component until it reaches 85%. Then the mean loading matrix is modified according to the obtained principal components. The sub-PCA model can be described as

$$\begin{cases} \boldsymbol{T}\_k = \boldsymbol{X}\_k \bar{\boldsymbol{P}}\_s \\ \bar{\boldsymbol{X}}\_k = \boldsymbol{T}\_k \bar{\boldsymbol{P}}\_s^\mathrm{T} \\ \bar{\boldsymbol{E}}\_k = \boldsymbol{X}\_k - \bar{\boldsymbol{X}}\_k. \end{cases} \tag{5.8}$$

The T<sup>2</sup> and SPE statistic control limits are calculated:

$$\begin{aligned} \mathbf{T}^2\_{\alpha, s, i} &\sim \frac{a\_{s, i}(I - 1)}{(I - a\_{s, i})} F\_{a\_{t, i}, I - a\_{t, i}, \alpha} \\ \text{SPE}\_{k, \alpha} &= \mathbf{g}\_k \boldsymbol{\Xi}^2\_{h\_k, \alpha}, \ \mathbf{g}\_k = \frac{\upsilon\_k}{2m\_k}, \ h\_k = \frac{2m\_k^2}{\upsilon\_k}, \end{aligned} \tag{5.9}$$

where *mk* and v*<sup>k</sup>* are the mean and variance of all batches data at time *k*, respectively, *as*,*<sup>i</sup>* is the number of retained principal components in batch *i*(*i* = 1, 2,..., *I*), and stage *s*. *I* is the number of batches, α is the significant level.

## *5.2.3 PCA Modeling for Transition Stage*

Now a soft-transition multi-phase PCA modeling method based on SVDD is presented according to the mentioned above. It uses the SVDD hypersphere radius to determine the range of transition region between two different stages. Meanwhile, it introduces a concept of membership grades to evaluate quantitatively the similarity between current sampling time data and transition (or steady) stage models. The sub-PCA models for steady phases and transition phases are established respectively which greatly improve the accuracy of models. Moreover, they reflect the characteristic changing during the different neighboring stages. Time-varying monitoring models in transition regions are established relying on the concept of membership grades, which are the weighted sum of nearby steady phase and transition phase submodels. Membership grade values are used to describe the partition problem with ambiguous boundary, which can objectively reflect the process correlations changing

from one stage to another. Here, the hyperspace distance *Dk*,*<sup>s</sup>* is defined from the sampling data at time *k* to the center of the *s*th SVDD sub-model. It is used as dissimilarity index to evaluate quantitatively the changing trend of process characteristics. Correlation coefficients λ*<sup>l</sup>*,*<sup>k</sup>* are given as the weight of soft-transition sub-model, which are defined, respectively, as

$$\begin{cases} \lambda\_{s-1,k} = \frac{D\_{k,s} + D\_{k,s+1}}{2\left(D\_{k,s-1} + D\_{k,s} + D\_{k,s+1}\right)}\\ \quad \lambda\_{s,k} = \frac{D\_{k,s-1} + D\_{k,s+1}}{2\left(D\_{k,s-1} + D\_{k,s} + D\_{k,s+1}\right)}\\ \quad \lambda\_{s+1,k} = \frac{D\_{k,s-1} + D\_{k,s}}{2\left(D\_{k,s-1} + D\_{k,s} + D\_{k,s+1}\right)}, \end{cases} \tag{5.10}$$

where *l* = *s* − 1, *s*, and *s* + 1 is the stage number, which represent the last steady stage, current transition stage, and next steady stage, respectively. The correlation coefficient is inverse proportional to hyperspace distance. The greater the distance, the smaller the effect of the hyperspatial distance. The monitoring model for the transition phase of each time interval can be obtained from the weighted sum of the sub-PCA models, i.e.,

$$\mathcal{P}'\_k = \sum\_{l=s-1}^{s+1} \lambda\_{l,k} \tilde{\mathcal{P}}\_l. \tag{5.11}$$

The soft-transition PCA model in (5.11) properly reflects the time-varying transiting development. The score matrix *T <sup>k</sup>* and the covariance matrix *S <sup>k</sup>* can be obtained at each time instance. The SPE statistic control limit is still calculated by (5.9). Different batches have some differences in transition stages. The average T<sup>2</sup> limits for all batches are used to monitor the process in order to improve the robustness of the proposed method. The T<sup>2</sup> statistical control limits can be calculated from historical batch data and correlation coefficients.

$$\mathbf{T}\_{\alpha}^{2'} = \sum\_{l=s-1}^{s+1} \sum\_{i=1}^{I} \lambda\_{l,i,k} \frac{\mathbf{T}\_{\alpha\_{x,l}}^2}{I},\tag{5.12}$$

where *i* (*i* = 1, 2,..., *I*)is the batch number, T<sup>2</sup> <sup>α</sup>*s*,*<sup>i</sup>* is the sub-stage T<sup>2</sup> statistic control limit of each batch which is calculated by (5.9) for sub-stage *s*.

Now the soft-transition model of each time interval in transition stages is obtained. The batch process can be monitored efficiently by combining with the steady stage model given in Sect. 5.2.2.

## *5.2.4 Monitoring Procedure of Soft-Transition Sub-PCA*

The whole batch process has been divided into several steady stages and transition stage after the two steps stage-dividing, shown in Sects. 5.2.1 and 5.2.2. The new softtransition sub-PCA method is applied to get detailed sub-model shown Sect. 5.2.3. The details of modeling steps are given as follows:


The whole flowchart of improved sub-PCA modeling based on SVDD softtransition is shown in Fig. 5.2. The modeling process is offline, which is depending on the historical data of *I* batches.

The following steps should be adopted during online process monitoring.


$$\begin{aligned} \mathbf{t}\_{new} &= \mathbf{x}\_{new} \mathbf{P}'\_{new} \\ \mathbf{e}\_{new} &= \mathbf{x}\_{new} - \vec{\mathbf{x}}\_{new} = \mathbf{x}\_{new} \left( I - \mathbf{P}'\_{new} \mathbf{P}'^{\mathrm{T}}\_{new} \right) \end{aligned} \tag{5.13}$$

Or if it belongs to a steady one, the mean loading matrix *P***¯***<sup>s</sup>* would be used to calculate the score vector *tne*<sup>w</sup> and error vector *ene*w,

**Fig. 5.2** Illustration of soft-transition sub-PCA modeling

$$\begin{split} t\_{new} &= \mathbf{x}\_{new} \bar{\mathbf{P}}\_s \\ \mathbf{e}\_{new} &= \mathbf{x}\_{new} - \mathbf{\tilde{x}}\_{new} = \mathbf{x}\_{new} \left( I - \mathbf{\tilde{P}}\_s \bar{\mathbf{P}}\_s^\mathrm{T} \right) . \end{split} \tag{5.14}$$

(4) Calculate the SPE and T<sup>2</sup> statistics of current data as follows:

$$\begin{aligned} \mathbf{T}\_{new}^2 &= \mathbf{t}\_{new} \mathbf{\tilde{S}}\_s \mathbf{t}\_{new}^\mathrm{T} \\ \mathbf{S} \mathbf{P} \mathbf{E}\_{new} &= \mathbf{e}\_{new} \mathbf{e}\_{new}^\mathrm{T} .\end{aligned} \tag{5.15}$$

(5) Judge whether the SPE and T<sup>2</sup> statistics of current data exceed the control limits. If one of them exceeds the control limit, alarm abnormal; if none of them does, the current data is normal.

# **5.3 Case Study**

## *5.3.1 Stage Identification and Modeling*

The Fed-Batch Penicillin Fermentation Process is used as a simulation case in this section. A detailed description of the Fed-Batch Penicillin Fermentation Process is available in Chap. 4. A reference data set of 10 batches is simulated under nominal conditions with small perturbations. The completion time is 400 h. All variables are sampled every 1 h so that one batch will offer 400 sampling data.

The rough division result based on K-mean method is shown in Fig. 5.3. Originally, the batch process is classified into 3 steady stage, i.e. *S*<sup>1</sup> = 3. Then SVDD classifier with Gaussian kernel function is used here for detailed division. The hypersphere radius of original 3 stages is calculated, and the distances from each sampling data to the hypersphere center are shown in Fig. 5.4.

As can be seen from the Fig. 5.4, the sampling data between two stages, such as the data during the time interval 28–42 and 109–200, are obviously out of the hypersphere. That means the data at this two time regions have significant difference from that of other steady stage. Therefore, these two stages are considered as transition stage. The process was further divided into 5 stages according to the detailed SVDD division, shown in Fig. 5.5

**Fig. 5.3** Rough division result based on K-mean clustering

**Fig. 5.4** SVDD stage classification result

**Fig. 5.5** Detailed process division result based on SVDD

It is obviously that the stages during the time interval 1–27, 43–109 and 202–400 are steady stages. The hyperspace distance of stage 28–42, 109–200 exceeded the radius of hypersphere obviously, so the two stages are separated as transition stage. Then the new SVDD classifier model is rebuilt. The whole batch process data set is divided into five stages using the phase identification method proposed in this chapter, that is *S*<sup>2</sup> = 5.

## *5.3.2 Monitoring of Normal Batch*

Monitoring results of the improved sub-PCA methods for the normal batch are presented in Fig. 5.6. The blue line is the statistic corresponding to online data and the red line is control limit with 99% confidence, which is calculated based on the normal historical data. It can be seen that as a result of great change of hyperspace distance at about 30 h in Fig. 5.4, the T<sup>2</sup> control limit drops sharply. The T<sup>2</sup> statistic of this batch still stays below the confidence limits. Both of the monitoring systems (T<sup>2</sup> and SPE) do not yield any false alarms. It means that this batch behaves normally during the running.

## *5.3.3 Monitoring of Fault Batch*

Monitoring results of the proposed method are compared with that of traditional sub-PCA method in order to illustrate the effectiveness. Here two kinds of faults are used to test the monitoring system. Fault 1 is the agitator power variable with a decreasing 10% step at the time interval 20–100. They are shown in Figs. 5.7 and 5.8 that SPE statistic increases sharply beyond the control limit in both methods, while T<sup>2</sup> statistic which in fact reflects the changing of sub-PCA model did not beyond the control limit in traditional sub-PCA method. That means the proposed soft-transition method made a more exact model than traditional sub-PCA method.

**Fig. 5.6** Monitoring plots for a normal batch

**Fig. 5.7** The proposed soft-transition monitoring for fault 1

**Fig. 5.8** The traditional Sub-PCA monitoring for fault 1

The differences between these two methods can be seen directly at the projection map, i.e. Figs. 5.9 and 5.10. The blue dot is the projection of data in the time interval 50–100 to the first two principal components space, and the red line is control limit. Figure 5.10 shows that none of the data out of control limit using the traditional sub-PCA method. The reason is that the traditional sub-PCA does not divide transition stage. The proposed soft-transition sub-PCA can effectively diagnose the abnormal or fault data, shown in Fig. 5.9.

Fault 2 is a ramp decreasing with 0.1 slopes which is added to the substrate feed rate at the time interval 20–100. Online monitoring result of the traditional sub-PCA and proposed method are shown in Figs. 5.11 and Fig. 5.12. It can be seen that this fault is detected by both two methods. The SPE statistic of the proposed method is out of the limit about at 50 h and the T<sup>2</sup> values alarms at 45 h. Then both of them increase slightly and continuously until the end of fault.

**Fig. 5.11** Proposed Soft-transition monitoring results for fault 2

**Fig. 5.12** The traditional Sub-PCA monitoring for fault 2

It is clearly shown in Fig. 5.12 that the SPE statistic of traditional sub-PCA did not alarm until about 75 h, which lags far behind that of the proposed method. Meanwhile, the T<sup>2</sup> statistic has a fault alarm at the beginning of the process. It is a false alarm caused by the changing of process initial state. In comparison, the proposed method has fewer false alarms, and the fault alarm time of the proposed method is obviously ahead of the traditional sub-PCA.


**Table 5.1** Monitoring results of FA for other faults

The monitoring results for other 12 different faults are presented in Table 5.1. The fault variable No. (1, 2, 3) represents the aeration rate, agitator power and substrate feed rate, respectively, as shown in Chap. 4. Here FA is the number of false alarm during the operation life.

It can be seen that the false alarms of the conventional sub-PCA method is obviously higher than that of the proposed method. In comparisons, the proposed method shows good robustness. The false alarms here are caused by the little change of the process initial state. The initial states are usually different in real situation, which will lead to the changes in monitoring model. Many false alarms are caused by these little changes. The conventional sub-PCA method shows poor monitor performance in some transition stage and even can't detect these faults because of the inaccurate stage division.

## **5.4 Conclusions**

In a multi-stage batch process, the correlation between process variables changes as the stages are shifted. It makes MPCA and traditional sub-PCA methods inadequate for process monitoring and fault diagnosis. This chapter proposes a new phase identification method to explicitly identify stable and transitional phases. Each phase usually has its own dynamic characteristics and deserves to be treated separately. In particular, the transition phase between two stable phases has its own dynamic transition characteristics and it is difficult to identify.

Two techniques are adopted in this chapter to overcome the above problems. Firstly, inaccurate phase delineation caused by fault data is avoided in the rough division by introducing sampling times in the loading matrix. Then, based on the distance of the process data to the center of the SVDD hypersphere, transition phases can be identified from nearby stable phases. Separate sub-PCA models are given for these stable and transitional phases. In particular, the soft transition sub-PCA model is a weighted sum of the previous stable stage, the current transition stage and the next stable stage. It can reflect the dynamic characteristic changes of the transition phase.

Finally, the proposed method is applied to the penicillin fermentation process. The simulation results show the effectiveness of the proposed method. Furthermore, the method can be applied to the problem of monitoring any batch or semi-batch process for which detailed process information is not available. It is helpful when identifying the dynamic transitions of unknown batch or semi-batch processes.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 6 Statistics Decomposition and Monitoring in Original Variable Space**

The traditional process monitoring method first projects the measured process data into the principle component subspace (PCS) and the residual subspace (RS), then calculates T<sup>2</sup> and SPE statistics to detect the abnormality. However, the abnormality by these two statistics are detected from the principle components of the process. Principle components actually have no specific physical meaning, and do not contribute directly to identify the fault variable and its root cause. Researchers have proposed many methods to identify the fault variable accurately based on the projection space. The most popular is contribution plot which measures the contribution of each process variable to the principal element (Wang et al. 2017; Luo et al. 2017; Liu and Chen 2014). Moreover, in order to determine the control limits of the two statistics, their probability distributions should be estimated or assumed as specific one. The fault identification by statistics is not intuitive enough to directly reflect the role and trend of each variable when the process changes.

In this chapter, direct monitoring in the original measurement space is investigated, in which the two statistics are decomposed as a unique sum of the variable contributions of the original process variables, respectively. The monitoring of the original process variables is direct and explicit in the physical meaning, but it is relatively complicated and time consuming due to the need to monitor each variable in both SPE and T<sup>2</sup> statistics. To address this issue, a new combined index is proposed and interpreted in geometric space, which is different from other combined indices (Qin 2003; Alcala and Qin 2010). The proposed combined index is an intrinsic method. Compared with the traditional latent space methods, the combined index-based monitoring does not require the prior distribution assumption to calculate the control limits. Thus, the monitor complexity is reduced greatly.

## **6.1 Two Statistics Decomposition**

According to the traditional PCA method, the process variables *x* could be divided into two parts: principal component *x*ˆ and the residual *e*:

$$\mathbf{x} = t\mathbf{P}^{\mathrm{T}} + \mathbf{e} = \hat{\mathbf{x}} + \mathbf{e},\tag{6.1}$$

where *P* is the matrix associated with the loading vectors that define the latent variable space, *t* is the score matrix that contains the coordinates of *x* in that space, and *e* is the matrix of residuals. T<sup>2</sup> and SPE statistics are used to measure the distance from the new data to the model data. Generally, T<sup>2</sup> and SPE statistics should be analyzed simultaneously so that the cumulative effects of all variables can be utilized. However, most of the literatures have only considered the decomposition of T2. Therefore, this chapter considered the SPE statistical decomposition to obtain the original process variables monitored in T<sup>2</sup> and in the SPE statistical space.

# *6.1.1* **T<sup>2</sup>** *Statistic Decomposition*

The statistic can be reformulated as follows:

$$\mathbf{T}^2 := \mathbf{D} = t\mathbf{A}^{-1}\mathbf{t}^\mathrm{T} = \mathbf{x}^\mathrm{P}\mathbf{A}^{-1}\mathbf{P}^\mathrm{T}\mathbf{x}^\mathrm{T} = \mathbf{x}^\mathrm{A}\mathbf{x}^\mathrm{T} = \sum\_{i=1}^J \sum\_{j=1}^J a\_{i,j}\mathbf{x}\_i\mathbf{x}\_j \ge 0,\qquad(6.2)$$

where *<sup>A</sup>* <sup>=</sup> *<sup>P</sup>Λ*−<sup>1</sup> *<sup>P</sup>*<sup>T</sup> <sup>≥</sup> 0, *<sup>Λ</sup>*−<sup>1</sup> is the inverse of the covariance matrix estimated from a reference population, and *ai*,*<sup>j</sup>* is the element of matrix *A*.

One of the T<sup>2</sup> statistic decompositions (Birol et al. 2002) is given as follows:

$$\begin{split} \mathbf{D} &= \sum\_{k=1}^{J} \frac{a\_{k,k}}{2} \left[ \left( \mathbf{x}\_{k} - \mathbf{x}\_{k}^{\*} \right)^{2} + \left( \mathbf{x}\_{k}^{2} - \mathbf{x}\_{k}^{\*2} \right) \right] \\ &= \sum\_{k=1}^{J} a\_{k,k} \left[ \left( \mathbf{x}\_{k}^{2} - \mathbf{x}\_{k}^{\*} \mathbf{x}\_{k} \right) \right] \\ &= \sum\_{k=1}^{J} c\_{k}^{D} . \end{split} \tag{6.3}$$

The *x*<sup>∗</sup> *<sup>k</sup>* is given as follows:

$$
\alpha\_k^\* = -\frac{\sum\_{j=1 \atop j \neq k}^N a\_{k,j} x\_j}{a\_{k,k}},
$$

where the *c<sup>D</sup> <sup>k</sup>* is the decomposed T<sup>2</sup> statistic of each variable *xk* . Next, the T<sup>2</sup> statistic of each variable *xk* can be calculated as follows:

$$c\_k^D = a\_{k,k} \left[ \left( \mathbf{x}\_k^2 - \mathbf{x}\_k^\* \mathbf{x}\_k \right) \right]. \tag{6.4}$$

The detailed T<sup>2</sup> statistic decomposition process is not shown in here, details can be found in Alvarez et al. (2007, 2010).

## *6.1.2 SPE Statistic Decomposition*

The SPE statistic, which reflects the change of the random quantity in the residual subspace, also has a quadratic form:

$$\begin{split} \text{SPE} &:= \mathbf{Q} = \mathbf{e}\mathbf{e}^{\mathrm{T}} = \mathbf{x} \left( I - \mathbf{P}\mathbf{P}^{\mathrm{T}} \right) \left( I - \mathbf{P}\mathbf{P}^{\mathrm{T}} \right)^{\mathrm{T}} \mathbf{x}^{\mathrm{T}} \\ &= \mathbf{x}\mathbf{B}\mathbf{x}^{\mathrm{T}} = \sum\_{i=1}^{J} \sum\_{j=1}^{J} b\_{i,j} \mathbf{x}\_{i} \mathbf{x}\_{j}, \end{split} \tag{6.5}$$

where *B* = *<sup>I</sup>* <sup>−</sup> *P P*T *<sup>I</sup>* <sup>−</sup> *P P*T<sup>T</sup> , *bi*,*<sup>j</sup>* is the element of matrix *B*, and *bi*,*<sup>j</sup>* = *bj*,*<sup>i</sup>* . Similar to the decomposition of T<sup>2</sup> statistic, SPE statistics can also be decomposed into a series of new statistic of each variable.

Firstly, the SPE statistic Q can be reformulated in terms of a single variable *xk* :

$$\mathbf{Q} = \mathbf{Q}\_k = b\_{k,k}\mathbf{x}\_k^2 + \left(2\sum\_{j=1, j\neq k}^J b\_{k,j}\mathbf{x}\_j\right)\mathbf{x}\_k + \sum\_{i=1, i\neq k}^J \sum\_{j=1, j\neq k}^J b\_{i,j}\mathbf{x}\_i\mathbf{x}\_j. \tag{6.6}$$

The minimum value of Q*<sup>k</sup>* can be calculated as

$$\frac{\partial \mathbf{Q}\_k}{\partial \mathbf{x}\_k} = 2b\_{k,k}\mathbf{x}\_k^\* + 2\sum\_{j=1, j\neq k}^J b\_{k,j}\mathbf{x}\_j = 0 \Rightarrow \mathbf{x}\_k^\* = -\sum\_{j=1, j\neq k}^J b\_{k,j}\mathbf{x}\_j/b\_{k,k}\tag{6.7}$$

$$\mathbf{Q}\_k^{\min} = -b\_{k,k}\mathbf{x}\_k^{\*2} + \sum\_{\substack{i=1, i \neq k}}^J \sum\_{j=1, j \neq k}^J b\_{i,j}\mathbf{x}\_i\mathbf{x}\_j. \tag{6.8}$$

The difference between the SPE statistic of *xk* and Qmin *<sup>k</sup>* is

$$\mathbf{Q} - \mathbf{Q}\_k^{\min} = b\_{k,k} \left(\mathbf{x}\_k - \mathbf{x}\_k^\*\right)^2. \tag{6.9}$$

The sum of the Qmin *<sup>k</sup>* for *k* = 1, 2,..., *J* is

$$\begin{split} \sum\_{k=1}^{J} \mathbf{Q}\_{k}^{\text{min}} &= \sum\_{k=1}^{J} \left( -b\_{k,k} \mathbf{x}\_{k}^{\ast 2} + \sum\_{i=1, i \neq k}^{J} \sum\_{j=1, j \neq k}^{J} b\_{i,j} \mathbf{x}\_{i} \mathbf{x}\_{j} \right) \\ &= (J-2)\,\mathbf{Q} + \sum\_{k=1}^{J} b\_{k,k} \left( \mathbf{x}\_{k}^{2} - \mathbf{x}\_{k}^{\ast 2} \right). \end{split} \tag{6.10}$$

The SPE statistic obtained from (6.10) can be evaluated as the sum of the contributions of each variable *xk* :

$$\begin{split} \mathbf{Q} &= \sum\_{k=1}^{J} \frac{b\_{k,k}}{2} \left[ \left( \mathbf{x}\_{k} - \mathbf{x}\_{k}^{\*} \right)^{2} + \left( \mathbf{x}\_{k}^{2} - \mathbf{x}\_{k}^{\*2} \right) \right] \\ &= \sum\_{k=1}^{J} b\_{k,k} \left[ \left( \mathbf{x}\_{k}^{2} - \mathbf{x}\_{k}^{\*} \mathbf{x}\_{k} \right) \right] \\ &= \sum\_{k=1}^{J} q\_{k}^{\mathrm{SPE}}. \end{split} \tag{6.11}$$

The original process variables of the SPE statistic are used to monitor the system status:

$$q\_k^{\rm SPE} = b\_{k,k} \left[ \left( \mathbf{x}\_k^2 - \mathbf{x}\_k^\* \mathbf{x}\_k \right) \right]. \tag{6.12}$$

So the novel SPE statistic can be evaluated as a unique sum of the contributions of each variable *q*SPE *<sup>k</sup>* (*k* = 1, 2,..., *J* ), which is used for original process variable monitoring.

## *6.1.3 Fault Diagnosis in Original Variable Space*

Similar to other PCS monitoring strategies, the proposed original variable monitoring technique consists of two stages that are executed offline and online. Firstly, the control limits of the two statistics (T<sup>2</sup> and SPE) for each time interval are determined by reference population of normal batches in the offline stage. Next, two statistics are calculated at each sampling during the online stage. If one of statistics exceeds the established control limit, then a faulty mode is declared.

The historical data of the batch process are composed of a three-dimensional array *X*(*I* × *J* × *K*), where *I*, *J* , and *K* are the number of batches, process variables, and sampling times, respectively. The three-dimensional process data must be unfolded into two-dimensional forms *X<sup>k</sup>* (*I* × *J* ), *k* = 1, 2,..., *K* before performing the PCA operation. The unfolding matrix *X<sup>k</sup>* is normalized to zero mean and unit variance in each variable. The main nonlinear and dynamic components of the variable are still left in the scaled data matrix *X<sup>k</sup>* .

The normalized data matrix *X<sup>k</sup>* is projected into principal component subspace by loading matrix *P<sup>k</sup>* to obtain the scores matrix *T<sup>k</sup>* :

$$X\_k = \mathcal{T}\_k \mathcal{P}\_k^T + \mathcal{E}\_k,$$

where *E<sup>k</sup>* is the residual matrix. The two statistics associated with the *i*th batch for the *j*th variable in *k*th time interval are defined as *c<sup>D</sup> <sup>i</sup>*,*j*,*<sup>k</sup>* and *q*SPE *<sup>i</sup>*,*j*,*<sup>k</sup>* .

The control limit of a continuous process can be determined by using the kernel density estimation (KDE) method. Another method has been used for calculating the control limit for batch process, which is determined by the mean and variance of each statistic (Yoo et al. 2004; Alvarez et al. 2007). The mean and variance of *c<sup>D</sup> i*,*j*,*k* are calculated as follows:

$$\begin{aligned} \bar{c}\_{j,k}^D &= \sum\_{i=1}^I c\_{i,j,k}^D / I \\ \text{var}\left(c\_{j,k}^D\right) &= \sum\_{i=1}^I \left(c\_{i,j,k}^D - \bar{c}\_{j,k}^D\right)^2 / (I-1). \end{aligned} \tag{6.13}$$

The control limit of statistic *c<sup>D</sup> <sup>i</sup>*,*j*,*<sup>k</sup>* is estimated as

$$c\_{j,k}^{\text{limit}} = \bar{c}\_{j,k}^{D} + \lambda\_1 \text{(var } \left( c\_{j,k}^{D} \right) \text{)}^{1/2},\tag{6.14}$$

where λ<sup>1</sup> is a predefined parameter. Similarly, the control limit of statistic is

$$q\_{j,k}^{\text{limit}} = \bar{q}\_{j,k}^{\text{SPE}} + \lambda\_2 \left(\text{var}\left(q\_{j,k}^{\text{SPE}}\right)\right)^{\frac{1}{2}},\tag{6.15}$$

where λ<sup>2</sup> is a predefined parameter,

$$\begin{aligned} \bar{q}\_{j,k}^{\text{SPE}} &= \sum\_{i=1}^{I} q\_{i,j,k}^{\text{SPE}} / I \\ \text{var}(q\_{j,k}^{\text{SPE}}) &= \sum\_{i=1}^{I} \left( q\_{i,j,k}^{\text{SPE}} - \bar{q}\_{j,k}^{\text{SPE}} \right)^2 / (I - 1) . \end{aligned} \tag{6.16}$$

As above, the control limit calculation is very simple. Although the calculation increases, the extra calculations can be performed offline, there is no restriction during the online monitoring stage. The proposed monitoring technique corresponding to the offline and online stages is summarized as follows:

#### **A. Offline Stage**

1. Obtain the normal process data of *I* batches *X*, unfold them into two-dimensional time-slice matrix *X<sup>k</sup>* , and then normalize the data.


#### **B. Online Stage**


## **6.2 Combined Index-Based Fault Diagnosis**

The monitoring method in the original process variables can avoid some of the disadvantages of traditional statistic approach in the latent variable space, such as indirectly monitoring (Yoo et al. 2004). However, the original variable monitoring method is relatively complicated due to the monitoring of each variable in both SPE and T<sup>2</sup> statistics. It means that each variable should be monitored twice, which increases the calculation. Thus, a new combined index, composed of the SPE and T<sup>2</sup> statistics, is proposed to decrease monitoring complexity.

## *6.2.1 Combined Index Design*

In this section, we use symbol *X*(*I* × *J* ) to substitute the unfolding process data matrix *X<sup>k</sup>* (*I* × *J* ) for general analysis. Similarly, *P<sup>k</sup>* , *T<sup>k</sup>* , *E<sup>k</sup>* are substituted by *P, T, E*. The process data *X* could be decomposed into PCS and RS when performing PCA:

$$X = T\mathbf{P}^{\mathsf{T}} + E = \hat{X} + E,\tag{6.17}$$

where *X*ˆ is the PCS and *E* is the RS. If the principal number is *m*, then a PCS with *m*-dimension and a RS with (*J* − *m*)-dimension can be obtained. When new data *x* are measured, they are projected into the principal subspace:

$$t = \mathbf{x}P.\tag{6.18}$$

**Fig. 6.1** Graphical representation of T<sup>2</sup> and SPE statistics

The principal component (PC) score vector *t*(1 × *m*) is the projection of new data *x* in the PCS. Subsequently, the PC score vector is projected back into the original process variables to estimate the process data *<sup>x</sup>*<sup>ˆ</sup> <sup>=</sup> *t P*T. The residual vector *<sup>e</sup>* is defined as

$$e = \mathbf{x} - \hat{\mathbf{x}} = \mathbf{x} \left( I - P\mathbf{P}^{\mathsf{T}} \right). \tag{6.19}$$

Residual vector *e* reflects the difference between new data *x* and modeling data *X* in the RS. A graphical interpretation of T<sup>2</sup> and SPE statistics is shown in Fig. 6.1.

To describe the statistics clearly in the geometry, the principal component subspace is taken as a hyperplane. The SPE statistic checks the model validity by measuring the distance between the data in the original process variables and its projection onto the model plain. Generally, the T<sup>2</sup> statistic is described by the Mahalanobis distance of the project point *t* to the projection center of normal process data, which aims to check if the new observation is projected into the limits of normal operation. The residual space is perpendicular to the principal hyperplane. The SPE statistic shows the distance from the new data *x* to the principal component hyperplane.

A new distance index ϕ from the new data to the principal component projection center of the modeling data is given in the following. It can be used for monitoring instead of the SPE and T<sup>2</sup> indicators. Consider the singular value decomposition (SVD) of the covariance matrix *<sup>R</sup><sup>x</sup>* <sup>=</sup> <sup>E</sup> *X*<sup>T</sup>*X* for given normal data *X*,

$$\mathbf{R}\_x = U\boldsymbol{A}\boldsymbol{U}^\mathrm{T},$$

where*Λ* = *diag*{λ1, λ2,..., λ*m*, **0***<sup>J</sup>*−*<sup>m</sup>*}is the eigenvalue of *R<sup>x</sup>* . The original loading matrix *<sup>U</sup> <sup>J</sup>*×*<sup>J</sup>* is a unitary matrix and *UU*<sup>T</sup> <sup>=</sup> *<sup>I</sup>*. Each column of the unitary matrix is a set of standard orthogonal basis in its span space. The basis vectors of principal component space and residual space divided from matrix *U* are orthogonal to each other. Furthermore,

$$U = [P, \ \ P\_e],\tag{6.20}$$

where *P* ∈ *R<sup>J</sup>*×*<sup>m</sup>* is the loading matrix. *P<sup>e</sup>* ∈ *R<sup>J</sup>*×(*J*−*m*) can be treated as the loading matrix of residual space. Thus, *P* and *P<sup>e</sup>* are presented by *U* as follows:

$$P = UF\_1, \ P\_e = UF\_2,\tag{6.21}$$

where

$$\boldsymbol{F}\_{1} = \begin{bmatrix} \boldsymbol{I}\_{m} \\ \mathbf{0}\_{J-m} \end{bmatrix}\_{J \times m}, \ \boldsymbol{F}\_{2} = \begin{bmatrix} \mathbf{0}\_{m} \\ \boldsymbol{I}\_{J-m} \end{bmatrix}\_{J \times m}, \tag{6.22}$$

where *I<sup>m</sup>* and *I <sup>J</sup>*−*<sup>m</sup>* are the *m* and *J* − *m* dimension unit matrices, respectively, and **0***<sup>m</sup>* and **0***<sup>J</sup>*−*<sup>m</sup>* are the *m* and *J* − *m* dimension zero matrices, respectively. Furthermore, the SPE and T<sup>2</sup> statistics are denoted by *U*:

$$\begin{split} e &= \mathbf{x} \left( I - PP^{\mathrm{T}} \right) = \mathbf{x} \left( UU^{\mathrm{T}} - UF\_{1}F\_{1}^{\mathrm{T}}U^{\mathrm{T}} \right) \\ &= \mathbf{x} \left( UU^{\mathrm{T}} - UE\_{1}U^{\mathrm{T}} \right) = \mathbf{x}U \left( I - E\_{1} \right)U^{\mathrm{T}} = \mathbf{x}UE\_{2}U^{\mathrm{T}}, \end{split} \tag{6.23}$$

where

$$E\_1 = \begin{bmatrix} I\_m & \mathbf{0}\_{m,J-m} \\ \mathbf{0}\_{J-m,m} & \mathbf{0}\_{J-m} \end{bmatrix}, \ E\_2 = \begin{bmatrix} \mathbf{0}\_m & \mathbf{0}\_{m,J-m} \\ \mathbf{0}\_{J-m,m} & \mathbf{I}\_{J-m} \end{bmatrix}. \tag{6.24}$$

Define *y* = *xU*, then

$$\begin{split} \text{SPE} &:= \mathbf{Q} = \mathbf{e} \mathbf{e}^{\mathrm{T}} = \mathbf{x} \, U \boldsymbol{E}\_{2} \mathbf{U}^{\mathrm{T}} \mathbf{U} \boldsymbol{E}\_{2} \mathbf{U}^{\mathrm{T}} \mathbf{x}^{\mathrm{T}} \\ &= \mathbf{x} \, U \boldsymbol{E}\_{2} \mathbf{U}^{\mathrm{T}} \mathbf{x}^{\mathrm{T}} = \mathbf{y} \, \mathbf{E}\_{2} \mathbf{y}^{\mathrm{T}} = \sum\_{i=m+1}^{J} \mathbf{y}\_{i}^{2} . \end{split} \tag{6.25}$$

Similarly, we can describe the T<sup>2</sup> statistic as follows:

$$\begin{split} \mathbf{T}^{2} &:= \mathbf{D} = t\,\boldsymbol{\Lambda}\_{m}^{-1}\mathbf{t}^{\mathrm{T}} = \mathbf{x}\,\mathbf{P}\,\mathbf{A}\_{m}^{-1}\,\mathbf{P}^{\mathrm{T}}\mathbf{x}^{\mathrm{T}} \\ &= \mathbf{x}\,\boldsymbol{U}\,\boldsymbol{F}\_{1}\boldsymbol{A}\_{m}^{-1}\,\mathbf{F}\_{1}^{\mathrm{T}}\mathbf{U}^{\mathrm{T}}\mathbf{x}^{\mathrm{T}} = \mathbf{x}\,\boldsymbol{U}\,\boldsymbol{A}^{-1}\boldsymbol{U}^{\mathrm{T}}\mathbf{x}^{\mathrm{T}} \\ &= \mathbf{y}\,\boldsymbol{A}^{-1}\mathbf{y}^{\mathrm{T}} = \sum\_{i=1}^{m} \mathbf{y}\_{i}^{2}\boldsymbol{\sigma}\_{i}^{2}, \end{split} \tag{6.26}$$

where

$$\mathbf{A}\_{m}^{-1} = \operatorname{diag} \{ \sigma\_1^2, \sigma\_2^2, \dots, \sigma\_m^2 \}, \ \mathbf{A}^{-1} = [\mathbf{A}\_{m}^{-1}, \mathbf{0}\_{(J-m)\times(J-m)}] \}.$$

The new combined index could be obtained directly by composing the two statistics as

$$\varphi = D + \mathbf{Q} = \sum\_{i=1}^{m} \mathbf{y}\_i^2 \sigma\_i^2 + \sum\_{i=m+1}^{J} \mathbf{y}\_i^2. \tag{6.27}$$

It is proved via mathematical illustration that the two decomposed statistics can be geometrically added together directly. This result demonstrates that T<sup>2</sup> and SPE statistic can be combined primarily and that is an intrinsic property. Thus, the combined index is a more general and geometric representation compared with the other combined index. The monitoring strategy with the novel index is introduced in the next subsection.

## *6.2.2 Control Limit of Combined Index*

In Sect. 6.1, the T<sup>2</sup> and SPE statistics are decomposed into two new statistics for each variable. To reduce the calculation of process monitoring, the two new statistics are combined into a new statistic ϕ to monitor the process.

$$
\varphi\_{i,j,k} = c\_{i,j,k}^{\text{D}} + q\_{i,j,k}^{\text{SPE}},\tag{6.28}
$$

where ϕ*<sup>i</sup>*,*j*,*<sup>k</sup>* is the combined statistic at sampling time *k* for the *j*th variable. The method mentioned in Sect. 6.1.3 can be used to calculate the control limit of the new statistic,

$$
\varphi\_{j,k}^{\text{limit}} = \bar{\varphi}\_{j,k} + \kappa \left( \text{var} \left( \varphi\_{j,k} \right) \right)^{1/2}, \tag{6.29}
$$

where κ is a predefined parameter, and

$$\begin{aligned} \bar{\varphi}\_{j,k} &= \sum\_{i=1}^{I} \varphi\_{i,j,k}/I\\ \text{var}\left(\varphi\_{j,k}\right) &= \sum\_{i=1}^{I} \left(\varphi\_{i,j,k} - \bar{\varphi}\_{j,k}\right)^2/(I-1). \end{aligned} \tag{6.30}$$

The online process monitoring can be performed according to comparing the new statistic and its control limit. There are several points to highlight for readers when the proposed control limit is used. Firstly, the mean and variance may be inaccurate for a small number of samples. As a result, a sufficient number of training samples should be collected during the offline stage. Secondly, the predefined parameter is important and it is designed by the engineers according to the actual process conditions. The tuning method regarding κ is similar to the Shewhart control chart. Equation (6.29) illustrates that the effect of variance depends on the predefined parameter κ and the fluctuation of control limits also relies on it on each sample. For example, the control limit is smooth when κ is selected to be a smaller value, and the control limit fluctuates when κ is selected to be a larger value.

If the combined statistic of the new sample has a significant difference from those of the reference data set, then a fault is detected. As a result, a fault isolation procedure is set up to find the fault roots. This fault response process is one of advantages in original process variable monitoring as each variable has a unique formulation and physical meaning. The proposed monitoring steps are similar as that in Sect. 6.1.2.

## **6.3 Case Study**

A fed-batch penicillin fermentation process is considered in case study, and its detailed mathematical model is given in Birol et al. (2002). A detailed description of the fed-batch penicillin fermentation process is available in Chap. 4.

## *6.3.1 Variable Monitoring via Two Statistics Decomposition*

Firstly, the original process variable monitoring algorithm mentioned in Sect. 6.1.2 is tested. The monitoring results of all variables would be interminable and tedious, so only several typical variables are shown here for demonstration or comparison. The monitoring result of variable 1 in a test normal batch is shown in Fig. 6.2. None of the two statistics (*c<sup>D</sup>* <sup>1</sup>,*<sup>k</sup>* and *q*SPE <sup>1</sup>,*<sup>k</sup>* ) exceeds its control limit, and the statistics (*c<sup>D</sup> j*,*k* and *q*SPE *<sup>j</sup>*,*<sup>k</sup>* , *j* = 2,..., 11 ) of all the other variables do not exceed the control limits as well. The monitoring results of other variables are similar to that of variable 1, so we omitted them due to the restriction of the book length. These results show that proposed algorithm do not have a false alarm when it is used to monitor the normal batch.

Next, the fault batch data are used to test the proposed monitoring algorithm of the original process variables, and two types of faults are chosen here.

**Fault 1:** step type, e.g., a 20% step decrease is added in variable 3 at 200–250 h.

**Fig. 6.2** Original variables monitoring for normal batch (variable 1)

**Fig. 6.3** Monitoring result for Fault 1 (variable 1)

The monitoring results are shown as follows. Figure 6.3 shows the monitoring result of variable 1 for fault 1, the statistics changes obviously during the fault occurrence. However, the statistics do not exceed the control limit, i.e., the process status exhibits changes, but variable 1 is not the fault source. The monitoring results of variables 2, 4, 8, 9, and 11 are almost the same as the result of variable 1, and these results are not presented here.

The monitoring results of variable 3 and variable 5 are shown in Figs. 6.4 and 6.5, respectively. Both of the variable statistics exceed the control limit at the sampling time 200 h. Regarding the other variables of 6, 7, and 10, the statistics of these variables also exceed the control limit, and the simulation results of these variable are nearly the same as that of variable 5 (the results are not presented here).

The question is: which variable is the fault source, variable 3, 5, or others? From the amplitude of Figs. 6.4 and 6.5, it is easy to see that the two statistics for variable 3 exceed the control limits to a much greater extent than those for variable 5 and other variables. In particular, the Q statistic of variable 3 is 40 times greater than its control limit. From this perspective, variable 3 can be concluded to be the fault source, as it makes contribution to the statistics obviously. Note that there is no smearing effect in the proposed method. The smearing effect means that non-faulty variables exhibit larger contribution values, while the contribution of faulty variables is smaller. Because the statistics are decomposed into a unique sum of the variable contributions, each monitoring figure is plotted against the decomposed variable statistics. Furthermore, the proposed method may identify several faulty variables if they have larger contributions at close magnitudes.

**Fig. 6.4** Monitoring result for Fault 1 (variable 3)

**Fig. 6.5** Monitoring result for Fault 1 (variable 5)

**Fig. 6.6** Relative contribution rate of *Rc* for Fault 1

To confirm the monitoring conclusion, the relative statistical contribution rate of the *j*th variable at time *k* is defined as

$$\begin{aligned} R\_c^{j,k} &= c\_{j,k}^{\mathcal{D}} / \sum\_{j=1}^{J} c\_{j,k}^{\mathcal{D}}\\ R\_q^{j,k} &= q\_{j,k}^{\mathrm{SPE}} / \sum\_{j=1}^{J} q\_{j,k}^{\mathrm{SPE}}. \end{aligned}$$

The relative statistic contribution rates of 11 variables are shown in Figs. 6.6 and 6.7. It is clear that variable 3 is the source of Fault 1. It is found that variables 9, 10, and11 still have the higher contribution when the fault is eliminated because the fault in variable 3 causes the change of the other process variables. The effects on whole process still continue, even if the fault is eliminated, and the fault variable evolves from the original variable 3 to other process variables.

**Fault 2:** ramp type, i.e., fault involving a ramp increasing with a slope of 0.3 in variable 3 at 20–80 h.

The two monitor statistics of variable 3 are shown in Figs. 6.8 and 6.9. It can be seen that both of the two statistics exceed the control limits at approximately 50 h. The alarming time lags relative to the fault occurrence time (approximately 20 h) are found because this fault variable changes gradually. When the fault is eliminated after 80 h, the relationship among the variables changes back to normal. The T<sup>2</sup> statistic obviously declines under the control limit, while the SPE statistic still exceeds the control limit because the error caused by Fault 2 still exists.

**Fig. 6.7** Relative contribution rate of *Rq* for Fault 1

**Fig. 6.8** Fault 2 monitoring by *c* statistic (variable 3)

**Fig. 6.9** Fault 2 monitoring by *q* statistic (variable 3)

## *6.3.2 Combined Index-Based Monitoring*

The same test data in Sect. 6.3.1 are used to test monitoring effectiveness of the new combined index. Considering a normal batch, the monitoring result of ϕ statistic is shown in Fig. 6.10. Variable 1 is still monitored in this section, as was the case in Sect. 6.3.1 for comparison. It is shown that the new index ϕ of variable 1 is far below its control limit, as is the case for the new index values of the other variables. This method shows some good performances, and the number of false alarms is zero in normal batch monitoring. The new index is more stable than the two statistics, and it is easy to observe for operators.

**Fault 1:** step type, e.g., a 20% step decrease is added in variable 3 at 200–250 h.

The new statistic ϕ of variable 1 does not exceed the control limit in Fig. 6.11, although it changes from 200 h to 250 h during the fault. The values of new statistic ϕ of variables 2, 4, 8, 9, and 11 also do not exceed the control limit. The corresponding monitoring statistics are omitted here. Thus, these variables have no direct relationship with the fault variable, i.e., they are not the fault source.

Furthermore, the monitoring results of variables 3 and 5 are shown in Figs. 6.12 and 6.13, respectively. The value statistics of variables 3 and 5 exceed their control limits obviously, as well as those of variables 6, 7, and 10. As discussed in Sect. 6.3.1, one can see that the statistic ϕ of variable 3 changes to a greater extent than other variables, so variable 3 is the potential fault source. This result shows that the proposed approach is an efficient technique for fault detection.

**Fig. 6.10** Original variables monitoring based on combined index for normal batch (variable 1)

**Fig. 6.11** Fault 1 monitoring based on combined index (variable 1)

**Fig. 6.12** Fault 1 monitoring based on combined index (variable 3)

**Fig. 6.13** Fault 1 monitoring based on combined index (variable 5)

The relative contribution of the new statistic is used to confirm the fault source, which is defined as

$$\mathcal{R}\_{\varphi}^{k} = \varphi\_{j,k} / \sum\_{j=1}^{J} \varphi\_{j,k}.$$

The relative contribution of variable 3 is nearly 100%, as shown in Fig. 6.14. So variable 3 is confirmed as the fault source. It is found that variables 9, 10, and 11 still have a higher contribution when the fault is eliminated because the fault in variable 3 causes the change of the other process variables and the effect on whole process still continues, even if the fault is eliminated.

Note that the relative contribution plot (RCP) is an auxiliary tool to locate the fault roots. It is only used for comparison with the proposed monitoring method to confirm diagnostic conclusions. Furthermore, the RCP is completely different from the traditional contribution diagram in this work. The RCP in this work is calculated using the original process variables, i.e., there is no smearing effect of the RCP. The contribution of each variable is independent of the other variables. Therefore, the proposed method is a novel and helpful approach in terms of original process variable monitoring. Furthermore, the color map of the fault contribution is intuitive. As a result, the map will promote the operator's initiative to find the fault source, and engineers can find some useful information to avoid more serious accidents.

**Fault 2:** ramp type, i.e., fault involving a ramp increasing with a slope of 0.3 in variable 3 at 20–80 h.

**Fig. 6.14** Relative contribution rate of ϕ statistic for Fault 1

**Fig. 6.15** Fault 2 monitoring of variable 3 by ϕ statistic

The monitoring result of variable 3 is shown in Fig. 6.15. It can be seen that the new statistic ϕ exceeds the control limit at approximately 50 h, and then it falls below the control limit after 80 h. The result shows that the combined index can detect different faults.

## *6.3.3 Comparative Analysis*

The monitoring performances of different methods are compared. Several performance indices are given to evaluate the monitoring efficiency. False alarm (FA) is the number of false alarms during the operation life. Time detected (TD) is the time that the statistic exceeds the control limit under the fault operation, which can represent the sensitivity.

The monitoring results of the proposed method are compared with that of the traditional sub-PCA method (Lu et al. 2004) in latent space and the soft-transition sub-PCA (Wang et al. 2013) to illustrate the effectiveness. The FA and TD results for other 12 faults are presented in Tables 6.1 and 6.2, respectively. Fault variable numbers (1, 2, and 3) represent the aeration rate, agitator power, and substrate feed rate, as shown in Chap. 4. The fault type and occurring time for the variables are given in Table 6.1, and those input conditions are as same as those in Sects. 6.3.1 and 6.3.2.


**Table 6.1** Monitoring results of FA for other faults

It can be seen from Table 6.1 that there are multiple false alarms applying the traditional sub-PCA method to detect faults, while the original process variable monitoring method shows less false alarms based on the combined index ϕ in this chapter. Among the three indices of the original spatial monitoring, the *c* and *q* statistics may have a large number of false alarms for different reasons, but the new combined index ϕ is more accurate because it can balance the two indices.

Table 6.2 indicates that the original process variable monitoring has accurate and timely detection results comparing with the other two detection methods. The detection delay is more than 10h for Fault 4, 7, 8 and 11 in the traditional sub-PCA and the soft-transition sub-PCA. Such a delay is inconceivable in a complex industrial process. While the difference between the detected time and the real fault time for the proposed approach is less than 10 h, except for fault 4. This result is helpful and meaningful in practice. As a result, the proposed approach could provide more suitable process information to operators. Thus, the proposed monitoring method based on a combined index shows advantages of rapid detection and fewer false alarms compared with the traditional or soft-transition sub-PCA approaches, whose monitoring operation is in the latent space but not the original measurement space.


**Table 6.2** Comparing the time of fault detected

## **6.4 Conclusions**

A new multivariate statistical method for the monitoring and diagnosis of batch processes, which operates on the original process variables, was presented in this chapter. The proposed monitoring method is based on the decomposition of the T<sup>2</sup> and SPE statistics as a unique sum of each variable contribution. However, problems may arise if the number of variables is large when the original process variables technique is applied. To reduce the workload of the monitoring calculation, a new combined index was proposed. A mathematical illustration was given to prove that the two decomposed statistics can be added together directly. Compared to the traditional PCA method in latent space, the proposed method is sufficiently direct, and only one statistical index is utilized, thereby decreases the calculation burden.

The new original variable space monitoring method can detect a fault with a clear result based on each variable. The fault source can be determined directly from the statistical index rather than using the traditional contribution plot. Furthermore, the control limit of the new combined statistics is very simple, and it does not need to assume that it follows some probability distribution. The simulation results show that the new combined statistics can detect the fault efficiently. As the new statistic index is the combination of two decomposed statistics, it can avoid many problems introduced by the use of a single statistic, such as false alarms or missing alarms.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 7 Kernel Fisher Envelope Surface for Pattern Recognition**

It is found that the batch process is more difficultly monitored compared with the continuous process, due to its complex features, such as nonlinearity, non-stable operation, unequal production cycles, and most variables only measured at the end of batch. Traditional methods for batch process, such as multiway FDA (Chen 2004) and multi-model FDA (He et al. 2005), cannot solve these issues well. They require complete batch data only available at the end of a batch. Therefore, the complete batch trajectory must be estimated real time, or alternatively only the measured values at the current moment are used for online diagnosis. Moreover, the above approaches do not consider the problem of inconsistent production cycles.

To address these issues, this chapter presents the modeling of kernel Fisher envelope surface (KFES) and applies it to the fault identification of batch process. This method builds separate envelope models for the normal and faulty data based on the eigenvalues projected to the two discriminant vectors of kernel FDA. The highlights of the proposed method include the kernel project aiming at the nonlinearity, data batch-wise unfolding, envelope modeling aiming at unequal cycles, and new detection indicator easily for online implementation.

# **7.1 Process Monitoring Based on Kernel Fisher Envelope Analysis**

## *7.1.1 Kernel Fisher Envelope Surface*

Consider the batch-wise data matrix with *I* batches, i.e.,

$$X(k) = \begin{bmatrix} X^1(k), X^2(k), \dots, X^I(k) \end{bmatrix}^\text{T},$$

```
© The Author(s) 2022
```
101

J. Wang et al., *Data-Driven Fault Detection and Reasoning for Industrial Monitoring*, Intelligent Control and Learning Systems 3, https://doi.org/10.1007/978-981-16-8044-1\_7

where *<sup>X</sup><sup>i</sup>* consists of *ni*(*<sup>i</sup>* <sup>=</sup> <sup>1</sup>,..., *<sup>I</sup>*) row vectors and each row vector is a sample vector *X<sup>i</sup> <sup>j</sup>*(*k*), *j* = 1,..., *ni* acquired at time *k* and batch *i*. Each batch has the same sampling period but different operation cycles, i.e., batch *i* has *ni* (*i* = 1, 2,..., *I*) sampling point. Suppose *K* is the largest sampling moment among all batches, i.e., *K* = max [*n*1, *n*2,..., *nI* ].

Let Φ(*x*) be a nonlinear mapping rule that maps the sample data from the original space *X* into the high-dimensional space *F*. Suppose that each batch is treated as a class, then the whole data set can be categorized as *I* classes. The optimal discriminant vector *w* is obtained using the exponential criterion function in the feature space *F*. Since computing Φ(*x*) is not always feasible, a kernel function can be introduced,

$$K(\mathbf{x}\_i, \mathbf{x}\_j) = \prec \Phi(\mathbf{x}\_i), \Phi(\mathbf{x}\_j) \succ = \Phi(\mathbf{x}\_i)^\mathsf{T} \Phi(\mathbf{x}\_j). \tag{7.1}$$

This kernel function is introduced to allow the dot product in *F* without directly computing Φ. According to the principle of reproducing kernel, any solution *w* ∈ *F* of discriminant vector must lie in the span of all training samples of *w*:

$$\mathfrak{w} = \sum\_{(i=1)}^{n} \alpha\_i \Phi(\mathfrak{x}\_i) = \Phi \alpha,\tag{7.2}$$

where *xm*, *m* = 1,..., *n*, *n* = *n*<sup>1</sup> + *n*<sup>2</sup> +···+ *nI* is the row vector of *X*. Φ(*x*) = [Φ(*x*1), . . . , Φ(*xn*)]; *α* = (α1, α2,... α*n*)T. The eigenvalues *Ti j* are obtained by projecting the sampled values Φ(*x<sup>i</sup> <sup>j</sup>*) in space onto *w*.

$$\begin{split} T\_{ij} &= \boldsymbol{\mathfrak{w}}^{\mathrm{T}} \boldsymbol{\Phi}(\mathbf{x}^{i}\_{j}) = \boldsymbol{\alpha}^{\mathrm{T}} \boldsymbol{\Phi}^{\mathrm{T}} \boldsymbol{\Phi}(\mathbf{x}^{i}\_{j}) \\ &= \boldsymbol{\alpha}^{\mathrm{T}} [\boldsymbol{\Phi}(\mathbf{x}\_{1})^{\mathrm{T}} \boldsymbol{\Phi}(\mathbf{x}^{i}\_{j}), \dots, \boldsymbol{\Phi}(\mathbf{x}\_{i})^{\mathrm{T}} \boldsymbol{\Phi}(\mathbf{x}^{i}\_{j})] \\ &= \boldsymbol{\alpha}^{\mathrm{T}} \boldsymbol{\xi}^{i}\_{j}. \end{split} \tag{7.3}$$

The kernel sample vector *ξ<sup>i</sup> <sup>j</sup>* is defined as follows:

$$\boldsymbol{\xi}\_{j}^{i} = \left[ K(\mathbf{x}\_{1}, \mathbf{x}\_{j}^{i}), K(\mathbf{x}\_{2}, \mathbf{x}\_{j}^{i}), \dots, K(\mathbf{x}\_{n}, \mathbf{x}\_{j}^{i}) \right]^{\mathrm{T}}.\tag{7.4}$$

Consider the projection of within-class mean vector *m*<sup>Φ</sup> *<sup>i</sup>* ,*i* = 1,..., *I* , the kernel within-class mean vector *μ<sup>i</sup>* is obtained as

$$\mu\_i = \left[ \frac{1}{n\_i} \sum\_{j=1}^{n\_i} K(\mathbf{x}\_1, \mathbf{x}\_j^i), \dots, \frac{1}{n\_i} \sum\_{j=1}^{n\_i} K(\mathbf{x}\_n, \mathbf{x}\_j^i) \right]^T. \tag{7.5}$$

Then the kernel between-class scatter matrix *K<sup>b</sup>* is

#### 7.1 Process Monitoring Based on Kernel Fisher Envelope Analysis 103

$$\mathbf{K}\_b = \sum\_{i=1}^{l} \frac{n\_i}{n} (\mu\_i - \mu\_0)(\mu\_i - \mu\_0)^\mathrm{T}. \tag{7.6}$$

Similarly, consider the projection of overall mean vector *m*<sup>Φ</sup> <sup>0</sup> to the discriminant vector *w*, the kernel overall mean vector *μ*<sup>0</sup> and between-class scatter matrix *K* <sup>w</sup> can be calculated as

$$\mu\_0 = \left[ \frac{1}{n} \sum\_{j=1}^n K(\mathbf{x}\_1, \mathbf{x}\_j), \dots, \frac{1}{n} \sum\_{j=1}^n K(\mathbf{x}\_n, \mathbf{x}\_j) \right]^T \tag{7.7}$$

$$\mathbf{K}\_w = \frac{1}{n} \sum\_{i=1}^{I} \sum\_{j=1}^{n\_i} (\xi\_j^i - \mu\_i)(\xi\_j^i - \mu\_i)^\mathrm{T}. \tag{7.8}$$

The discriminant function with the objective of maximizing between class and minimizing within class is equivalent to

$$\begin{split} \max J(\boldsymbol{\alpha}) &= \frac{\operatorname{tr}(\boldsymbol{\alpha}^{\mathrm{T}} \mathbf{K}\_{b} \boldsymbol{\alpha})}{\operatorname{tr}(\boldsymbol{\alpha}^{\mathrm{T}} \mathbf{K}\_{w} \boldsymbol{\alpha})} \\ &= \frac{\operatorname{tr}(\boldsymbol{\alpha}^{\mathrm{T}} (\boldsymbol{V}\_{b} \mathbf{A}\_{b} \boldsymbol{V}\_{b}^{\mathrm{T}}) \boldsymbol{\alpha}))}{\operatorname{tr}(\boldsymbol{\alpha}^{\mathrm{T}} (\boldsymbol{V}\_{w} \mathbf{A}\_{w} \boldsymbol{V}\_{w}^{\mathrm{T}}) \boldsymbol{\alpha})}, \end{split} \tag{7.9}$$

where *<sup>K</sup><sup>b</sup>* <sup>=</sup> *<sup>V</sup>bΛbV*<sup>T</sup> *<sup>b</sup>* and *<sup>K</sup>* <sup>w</sup> <sup>=</sup> *<sup>V</sup>* <sup>w</sup>*Λ*<sup>w</sup> *<sup>V</sup>*<sup>T</sup> <sup>w</sup> are eigenvalue decompositions of between-class and within-class scatter matrices, respectively. To construct the envelope surface model, it is usually assumed that two discriminant vectors are obtained, namely, the optimal discriminant vector and the suboptimal discriminant vector. The kernel sampling vector for sampling point *k* of batch *i* is *ξ<sup>i</sup> <sup>k</sup>*, which is projected onto the two discriminant vectors to obtain the eigenvalues *T* <sup>1</sup> *ik* and *T* <sup>2</sup> *ik* .

The eigenvalue vectors of all batch at time *k* in the first two projection direction are - *T* 1 <sup>1</sup>*<sup>k</sup>* , *T* <sup>1</sup> <sup>2</sup>*<sup>k</sup>* ,..., *T* <sup>1</sup> *I k* and - *T* 2 <sup>1</sup>*<sup>k</sup>* , *T* <sup>2</sup> <sup>2</sup>*<sup>k</sup>* ,..., *T* <sup>2</sup> *I k* . Their means of the two eigenvalue vectors are *mean*1(*k*) and *mean*2(*k*), respectively. Define that

$$\begin{aligned} \max\_{1}(k) &= \max\left[|T\_{1k}^{1} - mean\_{1}(k)|, \ \cdots \ , \ |T\_{1k}^{1} - mean\_{1}(k)|\right] \\ \max\_{2}(k) &= \max\left[|T\_{1k}^{2} - mean\_{2}(k)|, \ \cdots \ , \ |T\_{1k}^{2} - mean\_{2}(k)|\right], \end{aligned} \tag{7.10}$$

where max(*k*) is the larger between max1(*k*) and max2(*k*), for all *k* = 1, 2,..., *K*). Then the envelope surface in high-dimensional space is

$$(\mathbf{x}\_k - mean\_1(k))^2 + (\mathbf{y}\_k - mean\_2(k))^2 = \max \left( k \right)^2 (k = 1, 2, \dots, K), \quad (7.11)$$

where (*xk* , *yk* ) is a projection of original data in the feature space, i.e., *xk* is the eigenvalue in the optimal discriminant direction and *yk* is the eigenvalue in the suboptimal discriminant direction. Equation (7.11) gives the envelope surface with the maximum variation which allows the eigenvalues at different sampling times for this kind of data.

#### **Unequal Cycle Discussion**

Suppose the production period of each batch is different, i.e., *ni* is varying with the batch *i*. The envelope surface model is similar as described above, but the difference lies in the composition of the eigenvalue vector. As a simple example, it is known that there are *I* batches of data in a training data set, and the sampling moment *k* for each batch varies from 1 to *K*, *K* is the largest sampling moment of all batches. Suppose only batch *i* does not reach the maximum sampling moment *K*, *k* = 1,..., *ni*, *ni* ≤ *K*. The corresponding eigenvalue vectors are - *T* 1 <sup>1</sup>*<sup>k</sup>* , *T* <sup>1</sup> <sup>2</sup>*<sup>k</sup>* ,..., *T* <sup>1</sup> *I k* and - *T* 2 <sup>1</sup>*<sup>k</sup>* , *T* <sup>2</sup> <sup>2</sup>*<sup>k</sup>* ,..., *T* <sup>2</sup> *I k* if *k* = 1,..., *ni* . When the time increases *k* = *ni* + 1,..., *K*, the eigenvalue vectors are *T* 1 <sup>1</sup>*<sup>k</sup>* , *T* <sup>1</sup> <sup>2</sup>*<sup>k</sup>* , ··· *T* <sup>1</sup> (*i*−1)*<sup>k</sup>* , *<sup>T</sup>* <sup>1</sup> (*i*+1)*<sup>k</sup>* ,..., *<sup>T</sup>* <sup>1</sup> *I k* and *T* 2 <sup>1</sup>*<sup>k</sup>* , *T* <sup>2</sup> <sup>2</sup>*<sup>k</sup>* , ··· *T* <sup>2</sup> (*i*−1)*<sup>k</sup>* , *<sup>T</sup>* <sup>2</sup> (*i*+1)*<sup>k</sup>* ,..., *<sup>T</sup>* <sup>2</sup> *I k* . Obviously, the parameters in envelope surface model (7.11), max(*k*), max1(*k*), and max2(*k*) are time varying with *k*.

## *7.1.2 Detection Indicator*

Define the detection indicators as follows:

$$\begin{aligned} P\_1(k) &= \frac{|T\_k^1 - mean\_1(k)|}{\max(k)}\\ P\_2(k) &= \frac{|T\_k^2 - mean\_2(k)|}{\max(k)}\\ T(k) &= (T\_k^1)^2 + (T\_k^2)^2, \end{aligned} \tag{7.12}$$

where *T* <sup>1</sup> *<sup>k</sup>* and *T* <sup>2</sup> *<sup>k</sup>* are the eigenvalues obtained by mapping the real-time sampling vector *xk* onto the discriminant vector in the higher dimensional space. When the trajectory of eigenvalues at that moment is contained within the envelope surface, there must be *P*1(*k*) < 1 and *P*2(*k*) < 1 holds. If the difference between the new batch of data and the training data for this type of envelope surface model is large, the Gaussian kernel function used in the kernel Fisher criterion is almost zero, such that *T* <sup>1</sup> *<sup>k</sup>* =0, *T* <sup>2</sup> *<sup>k</sup>* =0, i.e., *T* (*k*) = 0. Thus, for a given measured data, using the above indicators, a judgement can be made. When *P*1(*k*) < 1, *P*2(*k*) < 1, and *T* (*k*) = 0 does not occur, the data sampled at that moment belong to this mode type. When *T* (*k*) = 0 occurs consistently, it indicates that the sampled data does not belong to this mode type.

It is assumed that it has been determined from the normal operating envelope surface model that the batch of data is faulty at some point. Fault identification is carried out using fault envelope surface models. Consider one of the fault envelope surface models, if *P*1(*k*) < 1, *P*2(*k*) < 1, and no *T* (*k*) = 0, then the batch fault is in current fault type. If *T* (*k*) = 0 appears consistently in each envelope model, then the fault that exists may be a new one. When that fault occurs multiple times, the pattern type needs to be updated and an additional envelope model need to be constructed for new fault.

The fault identification using the proposed kernel Fisher envelope surface analysis (KFES) is given as follows. Its fault monitoring flowchart is shown in Fig. 7.1.

#### **Fault Monitoring Algorithm Based on KFES**

**Step 1**: Collect the historical data with *S* fault categories. Construct *S* envelope surface models for each category based on the description in Sect. 7.1.1:

$$(\mathbf{x}\_k - mean\_1^S(k))^2 + (\mathbf{y}\_k - mean\_2^S(k))^2 = \max^S(k)^2, (k = 1, 2, \dots, K). \quad (7.13)$$

Then store all the model parameters *mean<sup>S</sup>* (*k*), *mean<sup>S</sup>* (*k*), and max*<sup>S</sup>*(*k*), (*k* = , 2,..., *K*). Thus, the envelope model library *En*v − *model*(*S*, *k*) is constructed.

**Step 2**: Sample the real-time data *x<sup>k</sup>* . After normalization, the kernel sampling vector *ξ<sup>k</sup>* is obtained.

**Step 3**: Under the known *S* fault envelope surface model at time *k*, project the kernel sampling vector *ξ<sup>k</sup>* of *x<sup>k</sup>* in the direction of the discriminant vectors. Calculate the corresponding project eigenvalues *T* <sup>1</sup> *<sup>k</sup>* , *T* <sup>2</sup> *<sup>k</sup>* and detection indicators. If *P<sup>S</sup>* <sup>1</sup> (*k*) < 1, *P<sup>S</sup>* <sup>2</sup> (*k*) < 1, and *T <sup>S</sup>*(*k*) = 0, then the fault belongs to category *S*.

**Step 4**: If detection indicators in Step 3 are not satisfied for all known fault type, it is possible that a new fault has occurred. When that unknown fault lasts for a

period of time, the model library needs to be updated. The envelope surface for this new fault is modeled according to the accumulated new batch data as Step 1, and augmented into the model library.

# *7.1.3 KFES-PCA-Based Synthetic Diagnosis in Batch Process*

The basic idea of synthetic diagnosis integrates the advantage of KFES and PCA. It builds a multiway PCA model for normal operating in the historical database and calculates the monitoring statistics T<sup>2</sup> and SPE of PCA model and their control limits. The multiway PCA is used for fault detection. For the fault data in the historical database, the KFES is modeled for known fault categories. The KFES analysis is used for fault identification. The modeling and online monitoring process of synthetic diagnosis is shown in Fig. 7.2.

The normal operating data and *S* classes fault data were obtained from the historical data set. Firstly, the normal operating condition data *X*(*I* × *J* × *K*) is expanded into two-dimensional matrix *X*(*I* × *J K*) in the time direction. After normalization, the data is unfolding again as *Y*(*I K* × *J* ) in the batch direction. Perform multiway PCA on the matrix to obtain score matrix *T* (*I K* × *R*) and load matrix *P*(*J* × *R*), where *R* is the number of principal components. Then calculate the control limits of the statistics T<sup>2</sup> and SPE.

**Fig. 7.2** Process monitoring flowchart based on KFES-PCA

Instead of using contribution maps, kernel Fisher envelope surface analysis is used for fault diagnosis. Assume that there are *S* classes in the fault data set. The envelope surface model is first constructed for each fault type. When the new data *xne*w,*<sup>k</sup>* is obtained, it should be judged whether the current operation is normal by PCA model. If the T<sup>2</sup> and SPE exceed the control limits, and the fault is detected. Then we can identify the type of fault by KFES model library. If the eigenvalues do not satisfy the indicators in all the known fault models, this fault seems to be new. As long as enough data to KFES modeling are collected, update the new fault model in the model library.

#### **Process Monitoring Algorithm Based on KFES-PCA A. Offline Modeling**

**Step 1**: Develop an improved multiway PCA model for normal operating conditions data, calculate the statistics T<sup>2</sup> and SPE, and determine the corresponding control T<sup>2</sup> lim and SPElim based on the score matrix *T*(*K I* × *R*) and load matrix *P*(*J* × *R*) obtained from the normal model.

**Step 2**: Apply KFES analysis to the fault data and construct a fault envelope for each type of fault separately. Find the optimal discriminant weight matrix *Wα*, the mean *mean*1(*k*), *mean*2(*k*), and maximum max (*k*) of the eigenvalue vectors.

**Step 3**: Store T<sup>2</sup> lim and SPElim, the discriminant weight matrix *W<sup>α</sup>* for each fault type, the mean *mean*1(*k*), *mean*2(*k*), and the maximum max (*k*) of the eigenvalues.

#### **B. Online Monitoring**

**Step 1**: Normalize the new batch of data *xnew,<sup>k</sup>* (*J* × 1) at the *k*th sampling moment.

**Step 2**: Calculate the value of statistics T<sup>2</sup> and SPE and determine if they are over the limit, if not, back to the first step. Otherwise proceed to the next step.

**Step 3**: The known fault envelope surface model is used for fault identification at that moment. *xne*w,*<sup>k</sup>* (*J* × 1) is the sampling data obtained at the first *k* sampling moment, normalized and projected onto the discriminant weight matrix *W<sup>α</sup>* of the kernel Fisher envelope model to obtain the eigenvalues *T* <sup>1</sup> *<sup>k</sup>* and *T* <sup>2</sup> *<sup>k</sup>* . The eigenvalues are substituted into the index, *P*1(*k*) < 1, *P*2(*k*) < 1, and no *T* (*k*) = 0, and the fault is in this fault type.

**Step 4**: If a fault has been detected based on step 2, but it does not belong to any known fault type obtained from step 3, this indicates that a new fault may have occurred. When that unknown fault has occurred several times, the mode type needs to be updated and the envelope surface model for that fault needs to be augmented with the accumulated batches of new faults in an offline situation.

## **7.2 Simulation Experiment Based on KFES-PCA**

The fed-batch penicillin fermentation simulation platform is used to verify the effectiveness of the KFES-PCA method for fault diagnosis here. Eleven variables affecting the fermentation reaction were selected for modeling, and these variables were air flow, stirring power, substrate flow acceleration rate, temperature, etc. Three simulation failure types were selected as shown in Table 7.1. The total data sets (including 50 batches) were generated from the Pensim 2.0 simulation platform with 1 h sampling interval, consisting of 20 batches of normal operation, 10 batches of bottom flow acceleration rate drop failure, 10 batches of agitation power drop failure, and 10 batches of air flow drop failure. The normal operation data are obtained at different product cycles, one batch with 95 h, two batches with 96 h, two batches with 97 h, three batches with 98 h, five batches with 99 h, and seven batches with 100 h. Similarly, change the reaction duration of each batch, and change the time and amplitude of the failure occurrence. The failure batch data are collected.

Figure 7.3a–d gives the envelope surface of the kernel Fisher discriminant envelope model under the normal operation and three known fault operations offline trained, respectively. Here the *x*-axis and *y*-axis represent the direction of the optimal and suboptimal discriminant vector, and the z-axis represents time.

The traditional monitoring methods, such as MPCA and MFDA, require the modeling batches to be of equal length. However, the duration of the different batches tends to change in practice. Therefore, the data of different batches must be preprocessed with equal length when using these methods. The proposed KFES-PCA method unfolds the data in the batch direction during the preprocessing, which can simply cope with the unequal batches of data and therefore easily performed in practice.

The following experiments are designed to perform the online detection with the known fault and new unknown fault data, respectively. The two batches of test data are not included in the training data in order to obtain a valid validation. In addition, a comparative validation using the conventional contribution map method and the improved MFDA method is also carried out (Jiang et al. 2003).


**Table 7.1** Types of faults in penicillin fermentation processes

**Fig. 7.3** Envelope surface for normal and three fault operations

## *7.2.1 Diagnostic Effect on Existing Fault Types*

#### **Experiment 1: Step Drop Fault at Stirring Power**

A fault batch data is regenerated for testing with the stirring power drop fault. The fault occurs at 50 h with a step disturbance of −12% in magnitude until the process ends. The sampled data is first monitored based on T<sup>2</sup> and SPE statistics, as shown in Fig. 7.4. It can be seen that T<sup>2</sup> and SPE continues to exceed the limit from 50 h to process end. A failure can be detected when it occurs at 50 h. Table 7.2 records the indicators when it is diagnosed using the envelope surface model of fault 2. It shows that there are *P*1(*k*) < 1, *P*2(*k*) < 1, and no *T* (*k*) = 0 with time through from 50 h to 100 h. So it is concluded that this fault of testing batch belongs to fault 2. Figure 7.5 shows the diagnosis results based on each envelope surface model. It can also be seen that the fault matches with the second type of fault, a mixing power drop fault.

The contribution plot is used to analyze the testing data at 50 h, as shown in Fig. 7.6. It is found that the second variable contributes significantly to both the statistics T<sup>2</sup> and SPE. This also diagnoses that the fault belongs to fault 2. Therefore, the envelope surface model is equally successful in diagnosing the fault type when compared with the contribution plot method.

**Fig. 7.4** Monitoring statistics of KFES-PCA method: experiment 1


**Table 7.2** The indicators detected in fault 2 envelope surface: experiment 1

The comparison experiment is finished based on the improved MFDA method, as shown in Fig. 7.7. The horizontal coordinate is time. The vertical coordinate is fault type, where 0 represents the normal operation, and 1, 2, 3, and 4 correspond to fault 1, fault 2, fault 3, and unknown fault, respectively. It can be seen that the improved MFDA has a relatively high rate of misdiagnosis and its diagnosis result is not ideal.

#### **Experiment 2: Step Drop Fault at Air Flow**

The testing fault is air flow drop failure and testing data is regenerated with the failure which occurred in 58 h, and its amplitude is −10% step disturbance until the process ends. The monitoring statistics T<sup>2</sup> and SPE are given in Fig. 7.8. The T<sup>2</sup> and SPE continue to exceed the control limits from 58 h to the end, so a fault is detected at 58 h in real time.

Figure 7.9 is the monitoring result using the proposed envelope surface model. Table 7.3 records the indicators when using the envelope surface model of fault 3. It can be seen that there are *P*1(*k*) < 1, *P*2(*k*) < 1, and no *T* (*k*) = 0 between 58 h and 100 h, so it is judged that the fault which occurred in this testing batch belongs to fault 3. Figure 7.9 shows all the diagnosis results with different envelope surface models. It can also be seen that this fault matches with the model of fault 3, i.e., the air flow drop fault.

(a) fault 1 envelope surface

**Fig. 7.5** Fault diagnosis based on envelope surfaces: experiment 1

**Fig. 7.6** Contribution plot to statistics T<sup>2</sup> and SPE at 50 h

The contribution plot of the sampling data at 58 h is shown in Fig. 7.10, where variables 1, 4, 6, and 8 contribute more to the statistic T2. The variable 3 contributed more to the statistic SPE. The diagnosis result is not significant. Therefore, the envelope surface method can successfully diagnose faults that are not diagnosed by the contribution plot.

**Fig. 7.7** Fault diagnosis based on improved MFDA: experiment 1

**Fig. 7.8** Monitoring statistics of KFES-PCA method: experiment 2

The comparison results of the improved MFDA method are given in Fig. 7.11. It shows a relatively higher rate of misdiagnosis and its diagnosis result is not very satisfactory, compared with the proposed KFES-PCA.

**Fig. 7.9** Fault diagnosis based on envelope surfaces: experiment 2


**Table 7.3** The indicators detected in fault 3 envelope surface: experiment 2

**Fig. 7.10** Contribution plot to statistics T<sup>2</sup> and SPE at 58 h: experiment 2

**Fig. 7.11** Fault diagnosis based on improved MFDA: experiment 2

## *7.2.2 Diagnostic Effect on Unknown Fault Types*

#### **Experiment 3: Slope Drop Fault at Air Flow Rate**

Here a new fault is used to test the diagnosis ability of the proposed KFES-PCA method. The slope faults different from the known three fault types are considered. The test fault is a ramp fault in which the air flow rate drops by −15% at 50 h. Firstly, the T<sup>2</sup> and SPE statistics are used to detect this new fault. Figure 7.12 shows that the T<sup>2</sup> and SPE statistics both detect this fault in time at 50 h.

**Fig. 7.12** Monitoring statistics of KFES-PCA: experiment 3

**Table 7.4** The indicator detected in fault 3 envelope surface: experiment 3

(c) fault 3 envelope surface

**Fig. 7.13** Fault diagnosis based on different envelope surfaces: experiment 3

The known envelope surface models are used to diagnose this fault. Table 7.4 records that all the indicators are zero when the envelope surface model of fault 3 is used for diagnosis. It means that no fault 3 has occurred. The same indicator results are obtained from the envelope surface models of other known faults. Figure 7.13 gives the diagnosis result under the different envelope surface models. So this fault does not belong to the known fault category and is diagnosed as a new fault. Therefore, the proposed method realizes the real-time diagnosis for unknown faults.

The diagnosis result of improved MFDA method is given in Fig. 7.14. It can be seen that the improved MFDA does not make a timely and correct diagnosis when the fault occurs. It gives a wrong diagnosis result, fault type 3. The correct result is

**Fig. 7.14** Fault diagnosis based on improved MFDA: experiment 3

reported until 63 h. This fault is diagnosed as a new fault, and there is a 13 h delay. Therefore, the improved MFDA method failed to identify new faults.

## **7.3 Conclusions**

This chapter describes a monitoring method based on KFES-PCA for batch processes. The production cycles of batch processes are often unequal, and monitoring methods for batch processes generally require batch data with consistent production cycles. Although data preprocessing can result in equal cycles, these methods can result in the loss of important information about faults. In addition, many existing monitoring methods often require a complete production trajectory for online monitoring, and filling or estimating unknown values inevitably leads to a decrease in diagnostic performance. To address the above two problems, the modeling process of the KFES method is described in detail and an online monitoring flowchart is presented. Furthermore, a batch fault diagnosis method integrating the KFES and the improved PCA method is proposed. The method is applied to a penicillin fermentation simulation platform and compared with the traditional contribution map method and the improved MFDA method. The results show that the proposed method has better monitoring performance, and it can diagnose faults early and effectively and has the ability to identify unknown faults.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 8 Fault Identification Based on Local Feature Correlation**

Industrial data variables show obvious high dimension and strong nonlinear correlation. Traditional multivariate statistical monitoring methods, such as PCA, PLS, CCA, and FDA, are only suitable for solving the high-dimensional data processing with linear correlation. The kernel mapping method is the most common technique to deal with the nonlinearity, which projects the original data in the low-dimensional space to the high-dimensional space through appropriate kernel functions so as to achieve the goal of linear separability in the new space. However, the space projection from the low dimension to the high dimension is contradictory to the actual requirement of dimensionality reduction of the data. So kernel-based method inevitably increases the complexity of data processing. For this reason, we have proposed another kind of nonlinear processing approach based on the manifold learning, a class of unsupervised model that seeks to describe data sets as low-dimensional manifold embedded in high-dimensional spaces. It characterizes the original data as a low-dimensional manifold to achieve the goal of nonlinear correlation processing. This strategy is consistent with the goal of dimensionality reduction. Furthermore, manifold learning fits the nonlinear correlation by means of piecewise linearization in an intuitive sense. It has significantly less complexity compared to the kernel mapping method.

This chapter carries out the pattern classification techniques for multivariate variables with strong nonlinear correlation and applies them to the fault identification of batch process. Two kinds of pattern classification methods are given in this chapter: (1) kernel exponential discriminant analysis (KEDA): this method addresses the nonlinear correlation properties among multi-variables at two levels, kernel mapping and exponential discrimination, respectively. It can significantly improve the classification accuracy compared with the traditional FDA method. (2) The fusion method is based on manifold learning and discriminant analysis: two different fusion strategies, local linear exponential discriminant analysis (LLEDA) and neighborhoodpreserving embedding discriminant analysis (NPEDA), are given, respectively. Here

<sup>©</sup> The Author(s) 2022

J. Wang et al., *Data-Driven Fault Detection and Reasoning for Industrial Monitoring*, Intelligent Control and Learning Systems 3, https://doi.org/10.1007/978-981-16-8044-1\_8

locally linear embedding (LLE) is a popular algorithm of manifold learning. They both combine the advantage of global discriminant analysis with the local structure preserving. LLEDA is a parallel strategy to find a trade-off projection vector between the local geometric structure preserving and the global data classification. NPEDA is a cascaded strategy whose dimensionality reduction process is implemented in two serial steps. The two methods emphasize the intrinsic structure of the data while utilizing the global discriminant information, so they have better classification than the traditional EDA method. Finally, a kind of hybrid fault diagnosis scheme is given for the complex industrial process, which consists of PCA-based fault detection, hierarchical clustering-based pre-diagnosis, and LLEDA-based final identification.

# **8.1 Fault Identification Based on Kernel Discriminant Exponent Analysis**

## *8.1.1 Methodology of KEDA*

The kernel exponent discriminant analysis (KEDA) is also a discriminative classification method, which aims to find a series of discriminant vectors that can transform the data into the kernel space and achieve the greatest separation between different types of data in the projection direction.

Consider the batch process data set with *I* batches, i.e.,

$$X(k) = \left[X^1(k), X^2(k), \dots, X^I(k)\right]^\top,$$

where *<sup>X</sup><sup>i</sup>* consists of *ni*,*<sup>i</sup>* <sup>=</sup> <sup>1</sup>,..., *<sup>I</sup>* row vectors, and each row vector is a sample vector *X<sup>i</sup> <sup>j</sup>*(*k*), *j* = 1,..., *ni* acquired at time *k* and batch *i*. According to the analysis from equations (7.1)–(7.9) in Sect. 7.1.1, the optimization function of kernel Fisher discrimination analysis (KFDA) is given as follows,

$$\begin{split} \max J(\boldsymbol{\alpha}) &= \frac{\operatorname{tr}(\boldsymbol{\alpha}^{\mathrm{T}} \mathbf{K}\_{b} \boldsymbol{\alpha})}{\operatorname{tr}(\boldsymbol{\alpha}^{\mathrm{T}} \mathbf{K}\_{w} \boldsymbol{\alpha})} \\ &= \frac{\operatorname{tr}(\boldsymbol{\alpha}^{\mathrm{T}} (\boldsymbol{V}\_{b} \mathbf{A}\_{b} \boldsymbol{V}\_{b}^{\mathrm{T}}) \boldsymbol{\alpha}))}{\operatorname{tr}(\boldsymbol{\alpha}^{\mathrm{T}} (\boldsymbol{V}\_{w} \mathbf{A}\_{w} \boldsymbol{V}\_{w}^{\mathrm{T}}) \boldsymbol{\alpha})}, \end{split} \tag{8.1}$$

where *<sup>K</sup><sup>b</sup>* <sup>=</sup> *<sup>V</sup>bΛbV*<sup>T</sup> *<sup>b</sup>* and *<sup>K</sup>* <sup>w</sup> <sup>=</sup> *<sup>V</sup>* <sup>w</sup>*Λ*<sup>w</sup> *<sup>V</sup>*<sup>T</sup> <sup>w</sup> are eigenvalue decompositions of between-class and within-class scatter matrices, respectively. *Λ<sup>b</sup>* = *diag*(λ*<sup>b</sup>*1, λ*<sup>b</sup>*2,..., λ*bn*), and *Λ*<sup>w</sup> = *diag*(λw<sup>1</sup>, λw<sup>2</sup>,..., λw*<sup>n</sup>*) are the eigenvalues, *V<sup>b</sup>* = (v*<sup>b</sup>*1, v*<sup>b</sup>*2,...,v*bn*), and *V* <sup>w</sup> = (vw1, vw2,...,vw*n*) are the corresponding eigenvectors. The basic objective is to maximize the between-class distance and minimize the with-class distance simultaneously during the projection.

In order to improve the discrimination accuracy further, the discriminant function (8.1) is exponentiated:

$$\begin{split} \max J(\boldsymbol{\alpha}) &= \frac{tr(\boldsymbol{\alpha}^{\mathrm{T}} (\boldsymbol{V}\_{b} \exp(\boldsymbol{A}\_{b}) \boldsymbol{V}\_{b}^{\mathrm{T}}) \boldsymbol{\alpha})}{tr(\boldsymbol{\alpha}^{\mathrm{T}} (\boldsymbol{V}\_{w} \exp(\boldsymbol{A}\_{w}) \boldsymbol{V}\_{w}^{\mathrm{T}}) \boldsymbol{\alpha})} \\ &= \frac{tr(\boldsymbol{\alpha}^{\mathrm{T}} \exp(\boldsymbol{K}\_{b}) \boldsymbol{\alpha})}{tr(\boldsymbol{\alpha}^{\mathrm{T}} \exp(\boldsymbol{K}\_{w}) \boldsymbol{\alpha})} . \end{split} \tag{8.2}$$

The optimization problem (8.2) is transferred to the following generalized eigenvalue problem:

$$\begin{aligned} \exp(\mathbf{K}\_b)\alpha &= \mathbf{A}\exp(\mathbf{K}\_w)\alpha \\ \alpha &\qquad \exp(\mathbf{K}\_w)^{-1}\exp(\mathbf{K}\_b)\alpha = \mathbf{A}\alpha, \end{aligned} \tag{8.3}$$

where *Λ* is the eigenvalue and *α* is the corresponding eigenvector. The discriminant vectors are calculated from (8.3). Usually, the first two vectors, optimal, and suboptimal ones are selected for dimensionality reduction.

The within-class and between-class scatter matrices are exponentiated in KEDA. Consider the general property of exponential function, *e<sup>x</sup>* > *x* for any *x* > 0, so the scatter matrix of KEDA is greater than KFDA. It means KEDA has better discriminatory capability than KFDA. Moreover, if the amount of sample data is less than the number of variables, the rank of within-class scatter matrix is less than the dimension of variables. Now the within-class scatter matrix is singular, and its inversion does not exist. But both the within-class and between-class scatter matrices are exponentiated in KEDA. The exponentiated matrices must be full rank, so the singular problem caused by small samples is solved. Thus from this view, the KEDA method not only solves the small sample problem, but also efficiently classifies the sample data into different categories, which helps to improve the classification accuracy.

Let's consider the nonlinear mapping *Φ*(*x<sup>i</sup> <sup>k</sup>* ) of original sample *x<sup>i</sup> <sup>k</sup>* and project it to the optimal and suboptimal discriminant directions, respectively. Then the eigenvalues *Ti*(*k*) = [*T* <sup>1</sup> *ik* , *T* <sup>2</sup> *ik* ] <sup>T</sup> and *T* <sup>2</sup> *ik* are obtained, which represent the projection values in the optimal and suboptimal discriminant directions. Usually, the data in the same class shows the similar project eigenvalues in the direction of selected discrimination vectors. If the test data matches with the known fault class, it has maximum projection eigenvalue under this model, obviously nonzero. If the test data does not match with this class, the eigenvalue is small even close to zero. It is unrealistic to judge the data type simply based on the magnitude of eigenvalues. So difference degree *D* between two projection values *Ti*(*k*) and *T <sup>j</sup>*(*k*) is defined as follows:

$$D\_{i,j}(k) = 1 - \frac{(\boldsymbol{T}\_i(k))^\mathrm{T} \boldsymbol{T}\_j(k)}{\|\boldsymbol{T}\_i(k)\|\_2 \left\|\boldsymbol{T}\_j(k)\right\|\_2}. \tag{8.4}$$

The smaller the difference *D*, the higher the model matched.

The KEDA-based fault classification and identification process for batch process is given as follows:

**Step 1**: Data preprocess. The three-dimensional data set *X*(*L* × *J* × *K*) is batchwise unfolded into two-dimensional data *X*(*L K* × *J* ), normalized along the time in the batch cycle and variable-wise re-arranged.

**Step 2**: Kernel projection. The original data *X* is mapping to a high-dimensional feature space via a nonlinear kernel function, and the kernel sampling data *ξ<sup>i</sup> <sup>j</sup>* = [*K*(*x*1, *<sup>x</sup><sup>i</sup> <sup>j</sup>*), *K*(*x*2, *x<sup>i</sup> <sup>j</sup>*), . . . , *K*(*xn*, *x<sup>i</sup> j*)] <sup>T</sup> are obtained.

**Step 3**: KEDA modeling. The optimal kernel discriminant vectors are solved from the discriminant function equation (8.3). Project the sample data *ξ<sup>i</sup> <sup>j</sup>* to the selected kernel discriminant vectors and calculate the corresponding eigenvalues *Ti*(*k*).

**Step 4**: Test calculation. The test sample *x <sup>j</sup>*,*ne*w(*k*)is collected and the corresponding eigenvalues *Ti*,*ne*w(*k*) according to the known *S* classes model are calculated, respectively.

**Step 5**: Fault identification. The class of test data can be determined by calculating the difference degree between test sample and trained data (8.4).

## *8.1.2 Simulation Experiment*

The proposed KEDA was used for fault identification in the penicillin fermentation process mentioned in Sect. 4.2. Here nine process variables were considered for monitoring and three faults are shown in Table 8.1. The data were generated by the penicillin simulator when the amplitude and time of fault are changed. A total of 40 batches were selected as the training data set: 10 batches for normal and known 3 faults. The KEDA method with Gaussian kernel function was used to find the optimal discriminant vectors for each type of model, and four different models were obtained.

**Experiment 1: Data classification** Figures 8.1, 8.2, 8.3, and 8.4 show the classification comparison of KFDA and KEDA for penicillin data: normal data and three types of fault data. When the test data are different from the known four types, the projections are also separated from each other. But the KFDA shows weaker classification performance: some faults are closer together and the boundaries are not easily distinguishable, such as fault 3 data (red -) and test fault data (black -) in Figs. 8.1 and 8.3. However, the KEDA works better for classifying these data, and


**Table 8.1** Description of the fault type of penicillin process

**Fig. 8.1** Two-dimensional classification visualization: KFDA method

the red and black parts are classified clearly in Figs. 8.2 and 8.4. These plots show that the between-class and within-class distances have increased for different types of data in KEDA, but the between-class distance has increased by a larger magnitude than the within-class distance. So the different types of data can be better separated.

**Experiment 2: Fault-type identification** Let's consider the testing data set, which also consists of the four types of data and an unknown fault data. Table 8.2 gives the eigenvalues of the four testing data calculated based on the KEDA model of fault 2. The eigenvalues are obtained by projecting the testing data to the selected optimal discriminant directions. If there is a large difference between the testing data and the training data, then the value of *u* − *v* -<sup>2</sup> is large and the exponentiated Gaussian kernel function, *K*(*u*, *v*) = exp(− *u* − *v* -<sup>2</sup>/(2σ)<sup>2</sup>), is almost close to zero. However, sometimes the fault occurrence eigenvalues are not close to zero, as shown in Table 8.2. At this case, the eigenvalues of the test data need to be analyzed further.

It is impossible to show the values at any sampling instance, so we further analyze the statistical characterizes of eigenvalues projected to the optimal discrimination direction of known model. If the eigenvalue of testing data follows a normal distribution in a model, the testing data belongs to this kind of model. Conversely, if the eigenvalue does not follow a normal distribution, it means that the testing data does not match with this model. Figures 8.5, 8.6, and 8.7 give the statistical analysis of the testing data (normal, faults 1 and 3) in the known fault 3 model. The eigenvalue of fault 3 follows a normal distribution in the fault 3 model, while the normal data or fault 1 data do not follow a normal distribution.

**Fig. 8.2** Two-dimensional classification visualization: KEDA method

**Fig. 8.3** Three-dimensional classification visualization: KFDA method

**Fig. 8.4** Three-dimensional classification visualization: KEDA method


**Table 8.2** The eigenvalues of test data in fault 2 model

Moreover, the difference degree between test data and known model is used to determine the type of fault. The results are shown in Table 8.3. Since some of the test data have zero eigenvalues in the known model, and the denominators in the definition (8.4) are zero, the different degree cannot be calculated and expressed as "–". The difference degree is small if the test data belongs to the known type model, and large if the test data does not belong to the model. It is found that the test data has the smallest different degree in the matching model.

**Fig. 8.5** The eigenvalues of test normal data in fault 3 model

**Fig. 8.6** The eigenvalues of test fault 1 data in fault 3 model

**Table 8.3** The difference degree of test data in different models


## **8.2 Fault Identification Based on LLE and EDA**

The new dimensionality reduction approach based on the combination of EDA and LLE is proposed with two different combination performances, Local Linear Exponential Discriminant Analysis (LLEDA) and Neighborhood-Preserving Embedding Discriminant Analysis (NPEDA). This fusion idea combines the global discriminant analysis with local structure preservation during the dimensionality reduction process. LLEDA and NPEDA are solved by different optimization objectives, respectively, and the corresponding maximum values are derived to reduce the computational complexity. They both exhibit the good local preservation and global discrimination capabilities. The nonlinear analytics is transformed into an equivalent neighborhood holding problem based on the idea of piecewise linearization.

The main difference between the two methods is that LLEDA is a parallel strategy whereas NPEDA is a cascading strategy. LLEDA focuses on the global supervised discrimination balanced with local nonlinear dimensionality reduction. It finds a balanced projection vector between the local geometry and the data classification and results in an optimal subspace projection of the samples. When faults are difficult to distinguish, LLEDA method can improve the identification rate by adjusting the trade-off parameter between the global index and the local index. NPEDA is a cascading strategy where the dimensionality reduction process is implemented in two successive steps: the first aims at maintaining the local geometric relationships and reconstructing each sample point using a linear weighted combination of nearest neighbors, the second at performing discriminant analysis on the reconstructed sample.

# *8.2.1 Local Linear Exponential Discriminant Analysis*

The basic idea of LLEDA is to project the samples into the optimal discriminant space while maintaining the local geometric structure of the original data. The schematic diagram is shown in Fig. 8.8. LLEDA combines the advantages of LLE and EDA, which extracts the global classification information while compressing the dimensionality of the feature space without destroying local relationships. It finds a balance between global supervised discrimination and local preservation of nonlinearity through an adjusted trade-off parameter.

Consider the original data being mapped into a hidden space *F* via function *A*. An explicit linear mapping from *<sup>X</sup>* to *<sup>Y</sup>*, *<sup>Y</sup>* <sup>=</sup> *<sup>A</sup>*T*<sup>X</sup>* is constructed to circumvent the out-of-sample problem. The original LLE problem is written as follows:

$$\begin{split} \min \varepsilon(\mathbf{Y}) &= \sum\_{j=1}^{n} \left| \mathbf{y}\_{j} - \sum\_{r=1}^{k} W\_{jr} \mathbf{y}\_{jr} \right|^{2} = \| \mathbf{Y}(I - \mathbf{W}) \|^{2} \\ &= tr(\mathbf{Y}(I - \mathbf{W})(I - \mathbf{W})^{\mathrm{T}} \mathbf{Y}^{\mathrm{T}}) \\ &= tr(A^{\mathrm{T}} X \mathbf{M} \mathbf{X}^{\mathrm{T}} A). \end{split} \tag{8.5}$$

The LLEDA problem is proposed with the following objective function:

$$\max J(A) = \frac{\operatorname{tr}\left(\mathbf{A}^{\mathsf{T}} \exp(\mathbf{S}\_{b})\mathbf{A}\right)}{\operatorname{tr}\left(\mathbf{A}^{\mathsf{T}} \exp(\mathbf{S}\_{w})\mathbf{A}\right)} - \mu \cdot \operatorname{tr}\left(\mathbf{A}^{\mathsf{T}} \mathbf{X} \mathbf{M} \mathbf{X}^{\mathsf{T}} \mathbf{A},\right) \tag{8.6}$$

where μ is a trade-off parameter that balances the intrinsic geometry and global discriminant information. In general, (8.6) is equivalently transformed into an optimization problem with constraint,

$$\begin{aligned} \max J(A) &= tr\left(A^{\mathsf{T}} \exp(\mathbf{S}\_b) A\right) - \mu \cdot tr(A^{\mathsf{T}} X M X^{\mathsf{T}} A) \\ \text{s.t.} \quad A^{\mathsf{T}} \exp(\mathbf{S}\_w) A &= I, \end{aligned} \tag{8.7}$$

where *A* = [*a*1, *a*2,..., *an*]. (8.7) is solved by introducing the Lagrangian multiplier:

$$L\_1(a\_i) = \boldsymbol{a}\_i^\mathsf{T} \left( \exp(\mathbf{S}\_b) - \mu \mathbf{X} \mathbf{M} \mathbf{X}^\mathsf{T} \right) \boldsymbol{a}\_i + \theta (1 - \boldsymbol{a}\_i^\mathsf{T} \exp(\mathbf{S}\_w) \boldsymbol{a}\_i), \qquad (8.8)$$

**Fig. 8.8** The schematic diagram of LLEDA

where θ is Lagrangian multiplier. According to the zero gradient in *L*1(*ai*) with respect to *a<sup>i</sup>* , we have

$$\begin{aligned} (\exp(\mathbf{S}\_b) - \mu \mathbf{X} \mathbf{M} \mathbf{X}^\top) \mathbf{a}\_i &= \theta \exp(\mathbf{S}\_w) \mathbf{a}\_i \\\\ (\exp(\mathbf{S}\_w)^{-1} (\exp(\mathbf{S}\_b) - \mu \mathbf{X} \mathbf{M} \mathbf{X}^\top) \mathbf{a}\_i &= \theta \mathbf{a}\_i, \end{aligned} \tag{8.9}$$

where θ is treated as a generalization eigenvalue. The discriminant matrix *A* is made up of the corresponding eigenvectors of the first *d* largest eigenvalues in (8.9).

# *8.2.2 Neighborhood-Preserving Embedding Discriminant Analysis*

NPEDA is also to find a series of discriminative vectors and map the samples into a new space. The sample point is represented linearly by their neighbors to maintain the local geometry as much as possible during the projection process. The schematic diagram is shown in Fig. 8.9. NPEDA is a cascade strategy in which the dimensionality reduction process is divided into two successive steps, the first aiming at maintaining local geometric relationships and the second aiming at a discriminant analysis in which each sample point is reconstructed by a linearly weighted combination of its neighbors.

Rewrite the between-class scatter matrix *S<sup>b</sup>* and the within-class scatter matrix *<sup>S</sup>*<sup>w</sup> under the explicit linear mapping *<sup>Y</sup>* <sup>=</sup> *<sup>A</sup>*T*X*:

**Fig. 8.9** The schematic diagram of NPEDA

$$\begin{split} \mathcal{S}\_{b} &= \sum\_{i=1}^{c} n\_{i} (\bar{\mathbf{y}}^{i} - \bar{\mathbf{y}})^{2} = \sum\_{i=1}^{c} n\_{i} \left( A^{\top} \bar{\mathbf{x}}^{i} - A^{\top} \bar{\mathbf{x}} \right)^{2} \\ &= A^{\top} \left( \sum\_{i=1}^{c} n\_{i} (\bar{\mathbf{x}}^{i} - \bar{\mathbf{x}}) (\bar{\mathbf{x}}^{i} - \bar{\mathbf{x}})^{\top} \right) A \\ &= A^{\top} \left( \sum\_{i=1}^{c} \frac{1}{n\_{i}} (\mathbf{x}^{i}\_{1} + \dots + \mathbf{x}^{i}\_{n}) (\mathbf{x}^{i}\_{1} + \dots + \mathbf{x}^{i}\_{n})^{\top} - 2n \bar{\mathbf{x}} \bar{\mathbf{x}}^{\top} + n \bar{\mathbf{x}} \bar{\mathbf{x}}^{\top} \right) A \\ &= A^{\top} \left( \sum\_{i=1}^{c} \sum\_{j,k=1}^{n\_{i}} \frac{1}{n\_{i}} \mathbf{x}^{j}\_{j} \mathbf{x}^{k}\_{k} - n\_{i} \bar{\mathbf{x}} \bar{\mathbf{x}}^{\top} \right) A \\ &= A^{\top} \left( X B \mathbf{X}^{\top} - n \bar{\mathbf{x}} \bar{\mathbf{x}}^{\top} \right) A \\ &= A^{\top} X \left( B - \frac{1}{n} e \mathbf{e}^{\top} \right) X^{\top} A, \end{split} \tag{8.10}$$

where *<sup>x</sup>*¯*<sup>i</sup>* <sup>=</sup> <sup>1</sup> *ni ni <sup>j</sup>*=<sup>1</sup> *<sup>x</sup><sup>i</sup> <sup>j</sup>* ,¯*y* = *<sup>c</sup> <sup>i</sup>*=<sup>1</sup> *ni* ¯*y<sup>i</sup> <sup>c</sup> <sup>i</sup>*=<sup>1</sup> *ni* , *<sup>x</sup>*¯ <sup>=</sup> *<sup>c</sup> <sup>i</sup>*=<sup>1</sup> *ni <sup>x</sup>*¯*<sup>i</sup> <sup>c</sup> <sup>i</sup>*=<sup>1</sup> *ni* <sup>=</sup> <sup>1</sup> *n <sup>c</sup> <sup>i</sup>*=<sup>1</sup> *ni <sup>x</sup>*¯*<sup>i</sup>* ; *e* = [1, 1,..., 1] <sup>T</sup> with dimension *n*, and

$$B\_{ij} = \begin{cases} \frac{1}{n\_k} & \mathbf{x}\_i \text{ and } \mathbf{x}\_j \in k\text{-th class.}\\ 0 & \text{otherwise.} \end{cases}$$

$$\begin{split} \mathbf{S}\_{w} &= \sum\_{i=1}^{c} \sum\_{j=1}^{n\_{i}} (\mathbf{y}\_{j}^{i} - \bar{\mathbf{y}}^{i})^{2} = \sum\_{i=1}^{c} \sum\_{j=1}^{n\_{i}} \left( A^{\mathsf{T}} \mathbf{x}\_{j}^{i} - A^{\mathsf{T}} \bar{\mathbf{x}}^{i} \right)^{2} \\ &= A^{\mathsf{T}} \left( \sum\_{i=1}^{c} \left( \sum\_{j=1}^{n\_{i}} (\mathbf{x}\_{j}^{i} - \bar{\mathbf{x}}^{i})(\mathbf{x}\_{j}^{i} - \bar{\mathbf{x}}^{i})^{\mathsf{T}} \right) \right) A \\ &= A^{\mathsf{T}} \left( \sum\_{i=1}^{c} \left( \sum\_{j=1}^{n\_{i}} \mathbf{x}\_{j}^{i} \mathbf{x}\_{j}^{i\mathsf{T}} - n\_{i} \bar{\mathbf{x}}^{i} \bar{\mathbf{x}}^{i\mathsf{T}} \right) \right) A \\ &= A^{\mathsf{T}} \left( \sum\_{i=1}^{c} \left( X\_{i} \mathbf{x}\_{i}^{\mathsf{T}} - \frac{1}{n\_{i}} X\_{i} (\mathbf{e}\_{i} \mathbf{e}\_{i}^{\mathsf{T}}) \mathbf{X}\_{i}^{\mathsf{T}} \right) \right) A \\ &= A^{\mathsf{T}} \sum\_{i=1}^{c} (X\_{i} L\_{i} \mathbf{x}\_{i}^{\mathsf{T}}) A, \end{split} \tag{8.11}$$

where *<sup>L</sup><sup>i</sup>* <sup>=</sup> *<sup>I</sup>* <sup>−</sup> <sup>1</sup> *ni e<sup>i</sup> e*<sup>T</sup> *<sup>i</sup>* , *I* is unit matrix, and *e<sup>i</sup>* = [1, 1,..., 1] <sup>T</sup> with dimension *ni* . The discriminant vectors *A*<sup>∗</sup> are solved by the following optimization problem:

$$A^\* = \arg\max \frac{\left| A^\mathrm{T} X (\mathcal{B} - \frac{1}{n} \mathbf{e} \mathbf{e}^\mathrm{T}) X^\mathrm{T} A \right|}{\left| A^\mathrm{T} \sum\_{i=1}^c (X\_i L\_i X\_i^\mathrm{T}) A \right|}. \tag{8.12}$$

Considering that the original data is reconstructed by its neighbors less than ε:

$$\sum\_{j=1}^{n} \parallel \mathbf{x}\_j - \sum\_{r=1}^{k} \mathbf{W}\_{jr} \mathbf{x}\_{jr} \parallel^2 < \varepsilon,$$

where ε is a small positive number. *W* is reconstruction mapping matrix such that *k <sup>r</sup>*=<sup>1</sup> *Wir* = 1. Then

$$\left\|\mathbf{x}\_i - \sum\_{r=1}^k \mathbf{W}\_{ir}\mathbf{x}\_{ir}\right\|^2 = \left\|\sum\_{r=1}^k (\mathbf{W}\_{ir}\mathbf{x}\_I - \mathbf{W}\_{ir}\mathbf{x}\_{ir})\right\|^2 = \left\|\mathcal{Q}\_i\mathbf{W}\_i\right\|^2,$$

where *Q<sup>i</sup>* = [*x<sup>i</sup>* − *xi*1, *x<sup>i</sup>* − *xi*2,..., *x<sup>i</sup>* − *xir*].

Matrix *W* can be solved by Lagrange multiplier.

$$\begin{aligned} L\_2 &= \frac{1}{2} \left\| \mathcal{Q}\_i \mathbf{W}\_i \right\|^2 - \lambda\_i \left[ \sum\_{r=1}^k \mathbf{W}\_{ir} - 1 \right] \\ \frac{\partial L\_2}{\partial \mathbf{W}\_i} &= \mathcal{Q}\_i^\mathrm{T} \mathcal{Q}\_i \mathbf{W}\_i - \lambda\_i E = \mathcal{C}\_i \mathbf{W}\_i - \lambda\_i E = 0, \end{aligned}$$

where *<sup>W</sup><sup>i</sup>* <sup>=</sup> <sup>λ</sup>*iC*−<sup>1</sup> *<sup>i</sup> <sup>E</sup>*, *<sup>C</sup><sup>i</sup>* <sup>=</sup> *<sup>Q</sup>*<sup>T</sup> *<sup>i</sup> Qi*, *E* = [1, 1,..., 1] <sup>T</sup> with dimension *k*. Considering

$$\sum\_{r=1}^{k} \mathbf{W}\_{ir} = \mathbf{E}^{\mathrm{T}} \mathbf{W}\_i = 1 \Longrightarrow \mathbf{E}^{\mathrm{T}} \lambda\_i \mathbf{C}\_i^{-1} \mathbf{E} = 1 \Longrightarrow \lambda\_i = (\mathbf{E}^{\mathrm{T}} \mathbf{C}\_i^{-1} \mathbf{E})^{-1},$$

we have

$$\boldsymbol{W}\_{i} = \lambda\_{i}\boldsymbol{\mathcal{C}}\_{i}^{-1}\boldsymbol{E} = \frac{\boldsymbol{\mathcal{C}}\_{i}^{-1}\boldsymbol{E}}{\boldsymbol{E}^{\top}\boldsymbol{\mathcal{C}}\_{i}^{-1}\boldsymbol{E}}.$$

The sample point is reconstructed by the optimal weights *W*, i.e., *x <sup>j</sup>* = *<sup>k</sup> <sup>r</sup>*=<sup>1</sup> *W jr x jr*. It is linearly represented by its neighbors by maintaining the local geometry in the dimensionality reduction process. Substitute it into (8.12) and NPEDA optimization is revised as follows:

$$\begin{split} A^\* &= \arg\max\_{A} \frac{\left| A^\top \exp\left( (\sum\_{r=1}^k W\_{ir} \mathbf{x}\_{ir}) (\mathbf{B} - \frac{1}{n} \mathbf{e} \mathbf{e}^\top) (\sum\_{r=1}^k W\_{ir} \mathbf{x}\_{ir})^\top \right) A \right|}{\left| A^\top \exp\left( \sum\_{i=1}^c (\sum\_{r=1}^k W\_{jr} \mathbf{X}\_{jr}^i) \mathbf{L}\_i (\sum\_{r=1}^k W\_{jr} \mathbf{X}\_{jr}^i)^\top \right) A \right|} \\ &= \arg\max\_A \frac{\left| A^\top \exp(\mathbf{S}\_{nb}) A \right|}{\left| A^\top \exp(\mathbf{S}\_{nw}) A \right|}. \end{split} \tag{8.13}$$

Equation (8.13) is equivalently to solve the maximum eigenvalue of the generalized eigenvalue decomposition problem:

$$\begin{aligned} \exp(\mathbf{S}\_{nb})A &= \sigma \exp(\mathbf{S}\_{nw})A\\ \text{or} \\ \exp(\mathbf{S}\_{nw})^{-1} \exp(\mathbf{S}\_{nb})A &= \sigma A, \end{aligned} \tag{8.14}$$

where σ is the generalized eigenvalue and the linear transformation matrix *A* of NPEDA is the eigenvector corresponding to the first *d* largest eigenvalues of (exp(*Sn*w))−<sup>1</sup> exp(*Snb*).

# *8.2.3 Fault Identification Based on LLEDA and NPEDA*

In this section, the LLEDA and NPEDA methods are implemented for fault identification with monitoring flowchart, as shown in Fig. 8.10. The fault recognition rate (FCR) is introduced to test the identification effectiveness. FCR of fault model *i* is defined as the percentage of test data identified in this corresponding model out of the total number of samples tested:

$$\text{FCR}(i) = \frac{n\_{i,identify}}{n\_{all}} \times 100\%,\tag{8.15}$$

where *ni*,*identi f y* denotes the sample size identified as fault *i* and *nall* denotes the sample size of all samples of fault *i*. The identification process is given as follows,


$$\begin{split} g(\mathbf{x}) &= -\frac{1}{2} (\mathbf{x} - \bar{\mathbf{x}}^i)^\mathrm{T} A \left( \frac{1}{n\_i - 1} A^\mathrm{T} \exp(S\_w^i) A \right)^{-1} A^\mathrm{T} (\mathbf{x} - \bar{\mathbf{x}}^i) \\ &+ \ln(c) - \frac{1}{2} \ln \left[ \det \left( \frac{1}{n\_i - 1} A^\mathrm{T} \exp(S\_w^i) A \right) \right]. \end{split} \tag{8.16}$$

If the value of the discriminant function exceeds the normal limitation, a fault occurs.

5. The fault type of online data can be determined when its posterior probability value is maximum. The posterior probability of data *x* in fault *c<sup>i</sup>* class is calculated as

**Fig. 8.10** Flowchart of fault identification with LLEDA and NPEDA methods

$$P(\mathbf{x} \in \mathbf{c}\_i | \mathbf{x}) = \frac{P(\mathbf{x} | \mathbf{x} \in \mathbf{c}\_i) P(\mathbf{x} \in \mathbf{c}\_i)}{\sum\_{i=1}^c P(\mathbf{x} | \mathbf{x} \in \mathbf{c}\_i) P(\mathbf{x} \in \mathbf{c}\_i)},\tag{8.17}$$

where *P*(*x* ∈ *ci*) is the prior probability and *P*(*x*|*x* ∈ *ci*) is the conditional probability density function of the sample *x*:

$$P(\mathbf{x}|\mathbf{x}\in\mathbf{c}\_{i}) = \frac{\exp[-\frac{1}{2}(\mathbf{x}-\bar{\mathbf{x}}^{i})^{\mathrm{T}}\boldsymbol{A}\boldsymbol{P}\_{q}\boldsymbol{A}^{\mathrm{T}}(\mathbf{x}-\bar{\mathbf{x}}^{i})]}{(2\pi)^{\frac{\mathrm{H}}{2}}[\frac{1}{n\_{i}-1}\boldsymbol{A}^{\mathrm{T}}(\sum\_{\mathbf{x}\in\mathbf{c}\_{i}}(\mathbf{x}-\bar{\mathbf{x}}^{i})(\mathbf{x}-\bar{\mathbf{x}}^{i})^{\mathrm{T}})\boldsymbol{A}]^{\frac{1}{2}}},\qquad(8.18)$$

where *<sup>P</sup><sup>q</sup>* = [ <sup>1</sup> *ni*−<sup>1</sup> *<sup>A</sup>*<sup>T</sup>( *x*∈*c<sup>i</sup>* (*<sup>x</sup>* <sup>−</sup> *<sup>x</sup>*¯*<sup>i</sup>* )(*<sup>x</sup>* <sup>−</sup> *<sup>x</sup>*¯*<sup>i</sup>* )<sup>T</sup>)*A*] <sup>−</sup><sup>1</sup>.

## *8.2.4 Simulation Experiment*

Multi-classification methods, FDA, EDA, LLE+FDA, LLEDA, and NPEDA, were carried to evaluate the classification performance in TE simulation platform. TE operation lasted for 48h, with faults occurring in the 8thh and sampled every 3 min. 400 training data were selected for building the classification model and 400 testing data for evaluating the performance of the model. Three different types of faults were considered: faults 2, 8, and 13. Fault 2 refers to a step change in the B component feed with the *A*-*C* feed ratio remaining constant. Fault 8 refers to a random change in the A, B and C feed component variables. Fault 13 refers to a slow drift in the reaction dynamics. Here faults 8 and 13 are difficult identified due to its random variation and slow drift. The training and testing data for the three types of faults were projected onto the first and second eigenvectors, respectively, by different methods and the classification results are shown in Fig. 8.11.

Table 8.4 shows the identification rate for faults 2, 8, and 13 under different classification methods. Here the number of discrimination directions, i.e., reduction order, is considered from 1 to 10. It is shown that the identification rates are improved with increasing the number of discrimination vectors. The recognition rate for fault 2 is high, almost close to 100%. The recognition rate for faults 8 and 13 gradually increases as the number of discrimination vectors increases. NPEDA and LLEDA show higher recognition rates on faults 2, 8, and 13, compared with other methods, such as FDA and LLE+EDA.

Figure 8.12 shows the posterior probability values for the different test data under the LLEDA and NPEDA methods. The larger posteriori probability values mean the higher possibility of the test data belong to this category. Furthermore, the diagnostic results are related to the classification capability. If the classification performance is good, higher identification rate is achieved.

## **8.3 Cluster-LLEDA-Based Hybrid Fault Monitoring**

## *8.3.1 Hybrid Monitoring Strategy*

Generally, the data collected from an actual industrial process are unlabeled and initially undiagnosed. It is worth noting that the LLEDA method performs well in fault identification, but it is a supervised algorithm that requires the known classification of the historical data set. To overcome this problem, the supervised LLEDA method is extended into an unsupervised learning method by introducing the cluster analysis method. The cluster method can obtain the fault data category information which is input to LLEDA modeling module as a prior. To make better use of the proposed cluster-LLEDA classification method, a hybrid fault monitoring strategy is given, as shown in Fig. 8.13.

Figure 8.13 indicates that the hybrid fault monitoring strategy is mainly divided into three parts, **historical data analysis**, **fault model library establishment**, and **online detection and fault identification**. First, the historical data of industrial processes is roughly detected by PCA to label the fault data. Then hierarchical clustering technique is used to classify the process data detected as fault into different types. The model library is established for all fault types by LLEDA, which further extracts the fault features and obtain fine identification. Finally, the online detection and fault identification are realized.

The procedure of historical data analysis part is summarized as follows:

**Fig. 8.11** Projection of different fault data on the first two feature vectors


**Table 8.4** Comparison of identification rate for faults 2, 8, and 13


**Fig. 8.12** Diagnosis results of faults 2, 8, and 13 by LLEDA and NPEDA methods

**Fig. 8.13** Hybrid fault detection and diagnosis information process

The procedure of fault model library establishment is summarized as follows:


The procedure of online detection and fault identification is summarized as follows:


**Clustering Analysis** The hierarchical clustering algorithm is more widely used and has the advantages of simple calculation, fast and easy to obtain similar results, without knowing the number of clusters in advance (Saxena et al. 2017). The clustering starts with *n* samples each as a class, specifies the distance between samples and the clustering between classes. Then the two closest classes are merged into a new class, and the distance between the new class and the other classes are calculated. Repeat the merging process between the two closest classes, and the number of classes is reduced by one after each merging. The merging will stop until all samples are merged into one class or a certain condition is met.

The class is denoted by *G* in the cluster analysis. Suppose class *G* has *m* samples denoted by the column vector *xi*(*i* = 1, 2,..., *m*), *di j* is the distance between *x<sup>i</sup>* and *x <sup>j</sup>* , and *DK L* is the distance between two different categories *GK* and *GL* . The squared distance *DK L* between *GK* and *GL* is defined as follows:

$$D\_{KL}^2 = \frac{1}{n\_K n\_L} \Sigma\_{\mathbf{x}\_i \in G\_K, \mathbf{x}\_j \in G\_L} d\_{ij}^2. \tag{8.19}$$

The recursive formula for between-class squared clustering is

$$D\_{ML}^2 = \frac{n\_K}{n\_M} D\_{KJ}^2 + \frac{n\_K}{n\_M} D\_{LJ}^2. \tag{8.20}$$

The inconsistency coefficient *Y* is used to determine the final number of clusters *c*. Here *Y* is a matrix of (*n* − 1) × 4, where the first column is the mean of all link lengths (i.e., merging class distances) involved, the second column is the standard deviation of all the related link lengths, the third column is the number of related links, and the fourth column is the inconsistency coefficient.

For the links obtained by the *k*th merging class, the inconsistency coefficient is calculated as follows:

$$Y(k,4) = \frac{(Z(k,3) - Y(k,1))}{(Y(k,2))},\tag{8.21}$$

where the input *Z*(*n*−1)×<sup>3</sup> is a matrix of systematic clustering trees. Under the condition that guarantees the number of classes as small as possible, the change of the inconsistency coefficient determines the final value of classes number.

## *8.3.2 Simulation Study*

The experiment uses the Tennessee Eastman (TE) process to evaluate the effectiveness of the proposed hybrid method.

**Experiment 1: Failure Initial Screening and Classification** The TE data set was first detected by the PCA method, and the fault detection results are shown in Fig. 8.14, the final T<sup>2</sup> and SPE statistics obtained were 0.4951 and 0.6882, respectively. The specific detection is shown in Table 8.5. The results show that the recognition rate of faults 1, 2, 6, 7, 8, 12, 13, 14, 17, and 18 is high, and the recognition rate of other faults is low. This indicates that the significant faults can be detected, while the potential faults cannot be detected.

Therefore, PCA-based fault detection methods can only coarsely split the data set and detect significant faults. Potential faults can be identified with a high fault identification rate only in the case of known fault categories. In the coarse separation stage of historical data, the fault data can be identified not only by PCA method, but also by improved PCA or other fault detection methods to further improve the identification rate.

After the historical data analysis, the fault data set is collected and clustered into different fault classes by using the hierarchical clustering method. According to the inconsistency coefficient, the final number of fault classes is 10. As the fault type is in a large number, it is difficult to display the classified fault data together in a tree diagram. As example, we select the faults 1, 2, and 6 to demonstrate the clustering effect of the hierarchical cluster analysis algorithm. Fault 1 is a step change in the A/C feed ratio with component B remaining unchanged, while fault 2 is a step change

**Fig. 8.14** Fault detection based on PCA


**Table 8.5** Fault recognition rate based on PCA

**Fig. 8.15** Hierarchical cluster analysis

in component B with the A/C ratio remaining unchanged. Fault 6 is a step change in the feed loss of A. The hierarchical clustering tree diagram is given Fig. 8.15. The final number of categories is three according to the inconsistency coefficient, which is consistent with the actual classification.

Now the fault data have been divided into 10 classes by hierarchical cluster analysis. Obviously, the dimension is high and its visualization effect is poor. In order to improve the visualization effect and reflect the change trend and the interrelationship between each variable at the same time, the parallel coordinate visualization method

**Fig. 8.16** Parallel coordinate visualization of fault data

**Fig. 8.16** (continued)

is selected. It is a visualization technique that allows the high-dimensional variables to be represented by a series of axes parallel to each other. The value of the variables is corresponding to the positions on the axes.

The visualization results for each type of fault data are shown in Fig. 8.16. The blue dash in each subplot indicates the normal data and the other color dashes indicate different fault data. Since each variable in the TE data has a corresponding actual physical meaning, the type of fault can be judged by comparing the other color dashes with the blue dash in each variable. These faults can be labeled for establishing the fault model library.

**Experiment 2: LLEDA-based Fault Identification** The fault identification method used here is LLEDA, which increases the distance between different classes and improves the classification ability even if fault samples are small. Here faults 4, 8, and 13 are selected as example to show the identification results. Fault 4 is a minor fault, which is manifested in the step change of the inlet temperature of the reactor cooling water, but the other 50 variables are still in a stable state, and the change is less than 2% compared with the normal data. Fault 13 refers to the slow drift of reactor kinetic constants when the fault occurs, which will cause a violent reaction of each variable, and the final product G is always in a fluctuating state. Fault 8 refers

**Fig. 8.17** Projection of different fault data on feature vectors

to the change of random variables of A, B, and C feed ingredients when the fault occurs.

**Fig. 8.18** Diagnosis results of fault 4, 8, and 13 by LLEDA methods

To better observe the classification in spatial structure, the training data and testing data of the three faults are projected onto the first three feature vectors by different methods. The classification results are shown in Fig. 8.17.

Figure 8.18 shows the posterior probability values of different test data by LLEDA method under different models. The posterior probability values are larger when the samples belong to category *i*. The colored bars indicate the diagnostic result, i.e., probability values, in which color bar from bottom to top is corresponding to the probability values 0–1 (white indicates that the probability of identification is 0 and red indicates that the probability value of identification is 1.) In this way, the fault identification results are visualized. The diagnosis result is related to the classification ability. The better classification performance leads to a higher fault recognition rate. Here fault 13 is in poor classification owing to the small number of feature vectors. The recognition rate of faults can be improved by increasing the number of feature vectors.

## **8.4 Conclusion**

This chapter presents three discriminant analysis methods, KEDA, LLEDA and NPEDA, that can handle nonlinearities and avoid small sample data problems. Normal and faulty data models are developed, and these models are used to check whether abnormal behavior occurs, and variance-based performance metrics are used to identify the type of data tested. Especially, two new supervised dimensionality reduction methods, LLEDA and NPEDA, are proposed which combines the advantages of local linear embedding and exponential discriminant analysis methods, taking into account both global and local information. The nonlinear data is piecewise linearized by maintaining the internal structure during the extraction of the eigenvalues. They overcome the singularity problem of within-class scatter matrices, and therefore show good performance for the small sample problem.

Furthermore, the hybrid process monitoring and fault identification algorithm is proposed in this chapter, which effectively combines the PCA initial detection, the classification of hierarchical clustering, and the discriminative analysis of LLEDA. This hybrid method ensures the monitoring and diagnosis is performed directly on the collected data without a priori knowledge.

## **Reference**

Saxena A, Prasad M, Gupta A, Bharill N, Patel O, Tiwari A, Er M, Ding W, Lin C (2017) A review of clustering techniques and developments. Neurocomputing 267:664–681

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 9 Global Plus Local Projection to Latent Structures**

147

Owing to the raised demands on process operation and product quality, the modern industrial process becomes more complicated when accompanied by the large number of process and quality variables produced. Therefore, quality-related fault detection and diagnosis are extremely necessary for complex industrial processes. Data-driven statistical process monitoring plays an important role in this topic for digging out the useful information from these highly correlated process and quality variables, because the quality variables are measured at a much lower frequency and usually have a significant time delay (Ding 2014; Aumi et al. 2013; Peng et al. 2015; Zhang et al. 2016; Yin et al. 2014). Monitoring the process variables related to the quality variables is significant for finding potential harm that may lead to system shutdown with possible enormous economic loss.

PLS is a typical multivariate statistical analysis technique in two coordinate space, which is well suitable for the quality-related fault detection and process monitoring. However, actual industrial data are often with the features of strong nonlinear dynamic and coupled, etc. PLS method only considers the static linear mapping between multiple sources of data, so it is difficult to achieve accurate detection results by directly applying PLS. It becomes an important direction how to introduce the local structure-preserving capability to the global structure projection of PLS, in order to extract the complex features of industrial data. This idea of global structure and local structure fusion can usually be implemented by two strategies, plus and embedding. This chapter focuses on the idea of plus, global, and local partial least squares (GLPLS) which is introduced first. Global plus local projection to latent structure (GPLPLS) method is further proposed, and three different performance functions are given from the projection requirements of input measurement space and output measurement space, separately or simultaneously. The next two chapters focus on the idea of embedding, two different embedding methods, locality-preserving partial least squares (LPPLS) and local linear embedded projec-

J. Wang et al., *Data-Driven Fault Detection and Reasoning for Industrial Monitoring*, Intelligent Control and Learning Systems 3, https://doi.org/10.1007/978-981-16-8044-1\_9

tion of latent structure (LLEPLS), are proposed, which use LPP and LLE as local structure-preserving technique, respectively.

# **9.1 Fusion Motivation of Global Structure and Local Structure**

Currently, partial least squares (PLS), which is one of those data-driven methods (Severson et al. 2016; Ge et al. 2012; Li et al. 2010; Zhao 2014; Zhang and Qin 2008), is widely used because of its advantages in extracting the latent variables by establishing the relationship between input and output space for quality-relevant process monitoring (Qin 2010). It maintains the maximum correlation between quality and process variables and has better quality-related fault detection capability. However, the nature of PLS is a linear projection, which is not applicable for nonlinear systems. It uses only global structural information with information such as mean and variance and performs poorly in systems with strong local nonlinear characteristics.

Nonlinear PLS methods can be divided into two categories: external nonlinear PLS models and internal nonlinear PLS models, as shown in Fig. 9.1.

External nonlinear PLS models are used as a class of nonlinear PLS models that introduce nonlinear transformations in the input and/or output variables. An example is kernel partial least squares (KPLS) (Rosipal and Trejo 2001; Godoy et al. 2014; Rosipal and Trejo 2001), which is used to describe the nonlinear relationship between the independent variables and for extending the linear relationship between the inputs and outputs. KPLS effectively solves the nonlinear problem between the principal components for input space and output space, but the selection of kernel function is more difficult in practical applications. Similarly, the kernel concurrent canonical correlation analysis (KCCCA) algorithm is proposed for quality-relevant nonlinear process monitoring that considers the nonlinearity in the quality variables (Zhu et al. 2017). Kernel-based methods map the original data into a (possibly

**Fig. 9.1** Outer and inner model presentation for linear PLS decomposition

high-dimensional) Hilbert space (eigenspace), but the projection in the eigenspace is complex, the direction and length of the projection cannot be determined, and the choice of kernel function is not straightforward.

Inner nonlinear PLS model is where the internal linear model between latent variables is replaced by a nonlinear model, but its external model remains unchanged, such as quadratic partial least squares (QPLS) (Wold et al. 1989), spline function PLS (SPLS) (Wold 1992), and neural network PLS (NNPLS) (Qin and McAvoy 1992, 1996) approaches. Recursive nonlinear PLS (RNPLS) models are built by extending the input and output matrices on top of PLS (Li et al. 2005); nonlinear PLS (NPLSSLT) based on the slice transformation (SLT) can be used for nonlinear correction, where SLT-based segmented linear mapping functions are used to construct nonlinear relationships between input and output score vectors (Shan et al. 2015); and nonlinear iterative partial least square algorithm (NIPALS) is improved by assuming that the score vector is a linear projection of the original variables in the internal nonlinear PLS, at the cost of increased computational complexity and optimization complexity.

PLS methods have nonlinearities in both the outer model and the inner model. An example is the orthogonal nonlinear PLS method (O-NLPLS) which considers orthogonal correlated nonlinearities between the input and output variables (Doymaz et al. 2003). This method retains the orthogonality properties of the PCA method due to the fact that it is based on a neural network architecture. Similarly, RBF network is used to identify the nonlinearity of the input variables and to establish the nonlinear relationship between the input and output variables (Zhao et al. 2006; Shimizu et al. 2006).

The different linear PLS representations are mathematically equivalent. However, using different nonlinear PLS methods results in different performance and characteristics. Existing nonlinear PLS methods have some shortcomings, such as the problem of choosing kernel functions or latent structures for unknown nonlinear systems; the problem of increasing computational complexity when using neural networks for nonlinear mapping; and the lack of a superior PLS decomposition algorithm. Therefore, how to simplify the nonlinear PLS modeling problem is an urgent need to be solved.

Considering that PLS and its extended algorithms only focus attention on the global structural information and cannot extract the local adjacent structural information of the data well, they are not suitable for the extraction of nonlinear features. Therefore, the local linearization method for dealing with nonlinear problems is taken into account. In recent years, locality-preserving projections (LPP) (He and Niyogi 2003; He et al. 2005), which belong to the manifold learning method have been proposed to solve the local adjacent structural feature problem and effectively make up for this deficiency. In addition, there are many other manifold learning methods, such as isometric feature mapping (Tenenbaum et al. 2000), local linear embedding (LLE) (Roweis and Saul 2000), Laplace feature map (Belkin and Niyogi 2003), etc.

Manifold learning methods preserve the local features by projecting the global structure to an approximate linear space, and by constructing a neighborhood graph to explore the inherent geometric features and manifold structure from the sample data sets. But these methods cannot consider the overall structure and lack a detailed analysis and explanation of the correlation between process and quality variables. Therefore, combining the global projection methods, such as PLS, and the manifold learning method, such as LPP and LLE, has become a new topic of concern for a growing number of engineers.

Regarding the combination of global and local information, Zhong et al. proposed a quality-related global and local partial least squares (GLPLS) model (Zhong et al. 2016). The GLPLS method integrates the advantages of the LPP and PLS methods, and extracts meaningful low-dimensional representations from the high-dimensional process and quality data. The principal components in GLPLS preserve the local structural information in their respective data sheets as much as possible. However, the correlation between the process and quality variables is not enhanced, and the constraints of LPP are removed in the optimization objective function. Therefore, the monitoring results are seriously affected.

After further analysis of the geometric characteristics of LPP and PLS, a new integration method called the locality-preserving partial least squares (LPPLS) model that was proposed by Wang et al. pays more attention to the locality-preserving characteristics (Wang et al. 2017). LPPLS can exploit the underlying geometrical structure, which contains the local characteristics, in input and output space. Although the maximization of correlation degree between the process and quality variables was considered, the global characteristics were converted into a combination of multiple local linearized characteristics and were not expressed directly. In many processes, the linear relationship may be the most important, and the best way is to describe it directly rather than through a combination of multiple local linearized characteristics.

## **9.2 Mathematical Description of Dimensionality Reduction**

## *9.2.1 PLS Optimization Objective*

PLS algorithm is used to model the relationship between the normalized data sets *X* = [*x*(1), *x*(2), . . . , *x*(*n*)] ∈ *R<sup>n</sup>*×*<sup>m</sup>* (*x* = [*x*1, *x*2,..., *xm*] <sup>T</sup>) and *Y* = [ *y*(1), *y*(2), . . . , *y*(*n*)] <sup>T</sup> ∈ *R<sup>n</sup>*×*<sup>l</sup>* ( *y* = [*y*1, *y*2,..., *yl*]). *X* is the process variable and *Y* is the quality variable. *m* and *l* are the dimensionality of the input and output spaces, and *n* is the number of samples. *X* and *Y* are decomposed as follows:

$$X = TP^\top + \bar{X} \tag{9.1}$$

$$Y = U\,\mathcal{Q}^{\mathrm{T}} + \bar{Y},\tag{9.2}$$

where *T* = [*t*1, *t*2,..., *t<sup>d</sup>* ] ∈ *R<sup>n</sup>*×*<sup>d</sup>* , and *U* = [*u*1, *u*2,..., *u<sup>d</sup>* ] ∈ *R<sup>n</sup>*×*<sup>d</sup>* are the score matrices of *X* and *Y*, respectively. *P* = [ *p*1, *p*2,..., *p<sup>d</sup>* ] ∈ *R<sup>m</sup>*×*<sup>d</sup>* and *Q* = [*q*1, *q*2,..., *q<sup>d</sup>* ] ∈ *R<sup>l</sup>*×*<sup>d</sup>* are the load matrices of *X* and *Y*. *X*¯ ∈ *R<sup>n</sup>*×*<sup>m</sup>* and *Y*¯ ∈ *R<sup>n</sup>*×*<sup>l</sup>* are the residual matrices of *X* and *Y*. *d* is the number of latent variables. The weight vectors *w* and *c* are derived by the NIPALS algorithm such that the covariance of score vectors *t* and *u* is maximized.

$$\begin{split} \max Cov(t, \mathfrak{u}) &= \sqrt{Var(t)Var(\mathfrak{u})}r(t, \mathfrak{u}) \\ &= \sqrt{Var(X\mathfrak{w})Var(Y\mathfrak{c})}r(X\mathfrak{w}, Y\mathfrak{c}). \end{split} \tag{9.3}$$

Equation (9.3) is actually equivalent to solving the following optimization problem:

$$\begin{aligned} \max\_{\boldsymbol{\Psi}, \boldsymbol{c}} &< X \,\boldsymbol{w}, \; \mathbf{Y} \mathbf{c} >\\ \text{s.t. } &\|\boldsymbol{w}\| = 1, \; \|\boldsymbol{c}\| = 1 \end{aligned} \tag{9.4}$$

or

$$\begin{aligned} J\_{\text{PLS}} &= \max \,\boldsymbol{\mathfrak{w}}^{\mathsf{T}} \boldsymbol{X}^{\mathsf{T}} \boldsymbol{Y} \boldsymbol{c} \\ \text{s.t.} & \,\|\boldsymbol{w}\| = 1, \,\|\boldsymbol{c}\| = 1. \end{aligned} \tag{9.5}$$

## *9.2.2 LPP and PCA Optimization Objectives*

LPP aims to project points in space *X* into low-dimensional space *Φ* = *φ*<sup>T</sup>(1), *φ*<sup>T</sup>(2), . . . , *φ*<sup>T</sup>(*n*) T ∈ *Rn*×*<sup>d</sup>* (*d* < *m*, *φ* = [φ1,..., φ*<sup>d</sup>* ]) via the projection matrix *W* = [*w*1,..., *w<sup>d</sup>* ] ∈ *Rm*×*<sup>d</sup>* , that is,

$$\phi(i) = \mathbf{x}(i)\,\mathbf{W}, (i = 1, 2, \dots, n). \tag{9.6}$$

The optimal mapping of the input space can be obtained by solving the following minimization problem:

$$\begin{split} J\_{\text{LPP}}(\boldsymbol{\mathfrak{w}}) &= \min \frac{1}{2} \sum\_{i,j=1}^{n} ||\phi\_i - \phi\_j||^2 s\_{xij} \\ &= \min \left( \boldsymbol{\mathfrak{w}}^T \boldsymbol{X}^T \boldsymbol{D}\_x \boldsymbol{X} \boldsymbol{\mathfrak{w}} - \boldsymbol{\mathfrak{w}}^T \boldsymbol{X}^T \boldsymbol{S}\_x \boldsymbol{X} \boldsymbol{\mathfrak{w}} \right) \\ \text{s.t. } &\boldsymbol{\mathfrak{w}}^T \boldsymbol{X}^T \boldsymbol{D}\_x \boldsymbol{X} \boldsymbol{\mathfrak{w}} = 1, \end{split} \tag{9.7}$$

where *S<sup>x</sup>* = [*sxij*] ∈ *R<sup>n</sup>*×*<sup>n</sup>* is the neighboring relationship matrix between *xi* and *x <sup>j</sup>* . *D<sup>x</sup>* = [*dxij*] is a diagonal matrix, *dxii* = *j sxij* , and

$$s\_{xij} = \begin{cases} e^{-\frac{\|\mathbf{x}(i) - \mathbf{x}(j)\|^2}{2\delta\_x^2}}, & \mathbf{x}(i) \text{ and } \mathbf{x}(j) \in \text{``neighbors''}\\ 0, & \text{otherwise} \end{cases} \tag{9.8}$$

δ*<sup>x</sup>* is the neighbors parameter. Compute the "neighbors" of *x*(*i*) and *x*(*j*) by K-nearest neighbors method.

The LPP problem (9.7) in space *X* is updated as follows:

$$\begin{aligned} J\_{\text{LPP}}(\boldsymbol{\omega}) &= \max \; \boldsymbol{\omega}^{\text{T}} \boldsymbol{X}^{\text{T}} \boldsymbol{S}\_{\text{x}} \boldsymbol{X} \boldsymbol{\omega} \\ &\text{s.t.} \; \boldsymbol{\omega}^{\text{T}} \boldsymbol{X}^{\text{T}} \boldsymbol{D}\_{\text{x}} \boldsymbol{X} \boldsymbol{\omega} = 1. \end{aligned} \tag{9.9}$$

The local structure information of *X* is contained in the matrices *X*<sup>T</sup> *S<sup>x</sup> X* and *X*<sup>T</sup> *D<sup>x</sup> X*. The magnitude of the diagonal element values indicates the magnitude of the role of the corresponding variables in preserving the local structure. The nondiagonal elements correspond to the correlation between the observed variables. Similarly, the optimization problem for PCA can be expressed as follows:

$$\begin{aligned} J\_{\text{PCA}}(\boldsymbol{\omega}) &= \max \, \boldsymbol{\omega}^{\top} \boldsymbol{X}^{\top} \boldsymbol{X} \boldsymbol{w} \\ \text{s.t.} \quad \boldsymbol{\omega}^{\top} \boldsymbol{w} &= 1. \end{aligned} \tag{9.10}$$

Based on the similarity of the optimization goals of LPP and PCA, combined with the component extraction idea of PCA included in PLS, we naturally consider fusing the LPP features into PLS to weaken the limitation of PLS, lack of local feature extraction capabilities. The simplest feature fusion method is to re-synthesize the two optimization goals, such as the GLPLS (Zhong et al. 2016), into a new optimization goal through some trade-off parameters.

## **9.3 Introduction to the GLPLS**

GLPLS method is given in this chapter to obtain the relationship between the quality and measurement variables while maintaining the local characteristics as much as possible. The main idea is to integrate the LPP method to preserve the local structural characteristics and the PLS method to perform the relevant quality statistical analysis. As a result, GLPLS method is able not only to identify the latent characteristics direction for both the measurement and the quality data space but also to preserve (to the greatest extent possible) the local structural characteristics in the two hidden subspaces.

Consider both the manifold structure for process variables *X* and the product output variables *Y* by introducing parameters λ<sup>1</sup> and λ<sup>2</sup> to control the trade-off between the extraction of the global and local features. Therefore, the objective of GLPLS-based method is defined as

$$\begin{aligned} J\_{\text{GLPLS}}(\mathbf{w}, \mathbf{c}) &= \arg\max \{ \mathbf{w}^{\text{T}} \mathbf{X}^{\text{T}} \mathbf{Y} \mathbf{c} + \lambda\_{\text{l}} \mathbf{w}^{\text{T}} \boldsymbol{\theta}\_{\text{x}} \mathbf{w} + \lambda\_{2} \mathbf{c}^{\text{T}} \boldsymbol{\theta}\_{\text{y}} \mathbf{c} \} \\ \text{s.t. } &\mathbf{w}^{\text{T}} \mathbf{w} = 1, \mathbf{c}^{\text{T}} \mathbf{c} = 1, \end{aligned} \tag{9.11}$$

where *<sup>θ</sup><sup>x</sup>* <sup>=</sup> *<sup>X</sup>*<sup>T</sup> *<sup>S</sup><sup>x</sup> <sup>X</sup>* and *<sup>θ</sup><sup>y</sup>* <sup>=</sup> *<sup>Y</sup>*<sup>T</sup> *<sup>S</sup>y<sup>Y</sup>* represent the local structure information of process variables and quality variables, respectively. *S<sup>x</sup>* , *S<sup>y</sup>* , *D*1, and *D*<sup>2</sup> are the local feature parameter of the LPP algorithm. Parameters λ<sup>1</sup> and λ<sup>2</sup> are used to control the weight coefficients between global and local features.

It can be found from (9.11) that the objective function of GLPLS contains the objective function of the PLS algorithm *w*T*X*T*Y c* and a part of the optimization problem of LPP algorithm *w*T*X*<sup>T</sup> *S<sup>x</sup> Xw* and *c*T*Y*<sup>T</sup> *SyY c*.

The optimization function (9.11) seems to be a good combination of the PLS algorithm global characteristics and the LPP algorithm local persistence characteristics. Is that really the case? Let us analyze the solution of the optimization problem first. To solve the optimization objective function (9.11), the following Lagrange function is introduced:

$$\begin{split} \psi(\mathbf{w}, \mathbf{c}) &= \mathbf{w}^{\mathsf{T}} \mathbf{X}^{\mathsf{T}} \mathbf{Y} \mathbf{c} + \lambda\_1 \mathbf{w}^{\mathsf{T}} \boldsymbol{\theta}\_x \mathbf{w} + \lambda\_2 \mathbf{c}^{\mathsf{T}} \boldsymbol{\theta}\_y \mathbf{c} \\ &\quad - \eta\_1 (\mathbf{w}^{\mathsf{T}} \mathbf{w} - 1) - \eta\_2 (\mathbf{c}^{\mathsf{T}} \mathbf{c} - 1) . \end{split} \tag{9.12}$$

Then, according to the conditions for extremum, (9.11) is resolved as follows (Zhong et al. 2016):

$$J\_{\rm GLPLS}(\mathfrak{w}, \mathfrak{c}) = \eta\_{\mathfrak{l}} + \eta\_{\mathfrak{z}}.\tag{9.13}$$

Let λ<sup>1</sup> = η1, λ<sup>2</sup> = η2, *w* is best projection vector, which is the corresponding eigenvector of the largest eigenvalue (*I* − *θ<sup>x</sup>* ) <sup>−</sup><sup>1</sup> *X*T*Y I* − *θ<sup>y</sup>* −<sup>1</sup> *Y*T*X*, *c* is best projection vector, which is the corresponding eigenvector of the largest eigenvalue (*<sup>I</sup>* <sup>−</sup> *<sup>θ</sup><sup>y</sup>* )−1*Y*T*X*(*<sup>I</sup>* <sup>−</sup> *<sup>θ</sup><sup>x</sup>* )−1*X*T*Y*, that is,

$$\begin{aligned} \left(\boldsymbol{I} - \boldsymbol{\theta}\_{\boldsymbol{x}}\right)^{-1} \boldsymbol{X}^{\mathrm{T}} \boldsymbol{Y} (\boldsymbol{I} - \boldsymbol{\theta}\_{\boldsymbol{\gamma}})^{-1} \boldsymbol{Y}^{\mathrm{T}} \boldsymbol{X} \boldsymbol{w} &= 4 \boldsymbol{\eta}\_{1} \boldsymbol{\eta}\_{2} \boldsymbol{w} \\ \left(\boldsymbol{I} - \boldsymbol{\theta}\_{\boldsymbol{\gamma}}\right)^{-1} \boldsymbol{Y}^{\mathrm{T}} \boldsymbol{X} (\boldsymbol{I} - \boldsymbol{\theta}\_{\boldsymbol{x}})^{-1} \boldsymbol{X}^{\mathrm{T}} \boldsymbol{Y} \boldsymbol{c} &= 4 \boldsymbol{\eta}\_{1} \boldsymbol{\eta}\_{2} \boldsymbol{c}. \end{aligned} \tag{9.14}$$

Equation (9.13) shows that the optimal solution of GLPLS is η<sup>1</sup> + η2, but in the actual calculation process (9.14), the optimal solution obtained by GLPLS algorithm is η1η2. Obviously, in most cases, the conditions for maximizing η<sup>1</sup> + η<sup>2</sup> and η1η<sup>2</sup> are different.

In order to explain the reason for this result, we once again return to the GLPLS optimization objective (9.11). Equation (9.11) is a global (PLS) and local (LPP) feature combination optimization problem. It is undeniable that this combination is reasonable to a certain extent. However, the latent variables of PLS are chosen to manifest their variation as much as possible, and the correlation between latent variables is as strong as possible. But the LPP method only needs to keep the local structure information as much as possible when constructing its latent variables. In other words, although the local features of the process variables (*x*(θ*<sup>x</sup>* <sup>=</sup> *<sup>X</sup>*<sup>T</sup> *<sup>S</sup><sup>x</sup> <sup>X</sup>*)) and the quality variables ( *<sup>y</sup>*(θ*<sup>y</sup>* <sup>=</sup> *<sup>Y</sup>*<sup>T</sup> *SyY* )) are enhanced, the correlation between the local features is not enhanced. Therefore, this direct combination of global and local features may lead to erroneous results.

In the GLPLS method, the LPP is used to maintain local structural features. Locally linear embedding (LLE) is also a commonly used manifold learning algorithm. Like the LPP algorithm, the LLE algorithm also converts a global nonlinear problem into a combination of multiple local linear problems by maintaining local structural information, but the LLE algorithm has fewer adjustable parameters than the LPP algorithm. Therefore, the LLE algorithm is another good solution to the problem of a strongly local nonlinear process system. The LLE algorithm has been briefly introduced in Chap. 11, and its optimization objective function is transformed into a general maximization form. Therefore, in the next section, we combine the PLS method and the LLE/LPP method in a new way, trying to maintain the global and local structural information of the process variables and quality variables at the same time, and enhance the correlation between them.

# **9.4 Basic Principles of GPLPLS**

## *9.4.1 The GPLPLS Model*

According to the Taylor series expansion, a nonlinear function can be written as follows:

$$F(Z) = A(Z - Z\_0) + g(Z - Z\_0),\tag{9.15}$$

where *A*(*Z* − *Z*0) and *g*(*Z* − *Z*0) represent the linear part and the nonlinear part, respectively. In many real systems, especially near the balance point (*Z*0), the linear part is primary and the nonlinear part is secondary. The PLS method is difficult to model nonlinear systems well. Because the PLS method uses the linear dimensionality reduction method PCA to obtain the principal components, which only establishes the relationship between the linear part of the input variable space (*X*) and the output variable space (*Y*). In order to obtain a better model with local nonlinear features, the KPLS model (Rosipal and Trejo 2001) maps the original data to a high-dimensional feature space, while the LPPLS model (Wang et al. 2017) transforms nonlinear features into a combination of multiple local linearized features. Both of these methods can solve some nonlinear problems. However, the feature space of the KPLS model is not easy to determine, and the main linear part of the LPPLS model is more suitable to be directly described by global structural features.

In fact, the PLS optimization (9.5) includes two goals for the selected latent variable: one is that the latent variable contains variance varying as much as possible and the other is that the correlation between the latent variables of the input space and the output space is as strong as possible. Although the GLPLS model combines global and local feature information, the combination of the two is not coordinated. How does one combine the two features to maintain the same objective? According to the expression of a nonlinear function (9.15), the input and output spaces can both be divided into two parts: the linear and nonlinear parts. By introducing local structure information, the nonlinear part can be transformed into a combination of multiple local linear problems.

Inspired by the role of the PCA model (*w*<sup>T</sup>*X*<sup>T</sup>*Xw*) in the PLS model (*w*<sup>T</sup>*X*<sup>T</sup>*Y c*) and the limitation of the GLPLS algorithm, this section proposes a novel dimensionality reduction method. It combines global (PCA) and local (LLE/LPP) features to extract latent variables of nonlinear systems. Therefore, the input space *X* or the output space *Y* is mapped to the new feature space *X<sup>F</sup>* and *Y <sup>F</sup>* , respectively. The new feature space contains a global linear subspace and multiple local linear subspaces. Use the new feature space *X<sup>F</sup>* and *Y <sup>F</sup>* to replace the original space *X* and *Y*, respectively. Consequently, a new objective function of the global plus local projection to latent structure (GPLPLS) method is shown in the following new optimization objective

$$\begin{aligned} J\_{\text{GPLPLS}}(\boldsymbol{\mathfrak{w}}, \boldsymbol{\mathfrak{c}}) &= \text{arg}\max \{ \boldsymbol{\mathfrak{w}}^{\mathsf{T}} \boldsymbol{X}\_{F}^{\mathsf{T}} \boldsymbol{Y}\_{F} \boldsymbol{\mathfrak{c}} \} \\ \text{s.t. } &\boldsymbol{\mathfrak{w}}^{\mathsf{T}} \boldsymbol{\mathfrak{w}} = 1, \boldsymbol{\mathfrak{c}}^{\mathsf{T}} \boldsymbol{\mathfrak{c}} = 1, \end{aligned} \tag{9.16}$$

where *X<sup>F</sup>* and *Y <sup>F</sup>* satisfy *X<sup>F</sup>* = *X* + λ*xθ* 1 2 *<sup>x</sup>* and *Y <sup>F</sup>* = *Y* + λ*yθ* 1 2 *y* .

It is found that the new feature spaces *X<sup>F</sup>* and *Y <sup>F</sup>* are both divided into linear part (*X*, *Y*) and nonlinear part (λ*xθ* 1 2 *<sup>x</sup>* , λ*yθ* 1 2 *<sup>y</sup>* ), similar as (9.15). Figure 9.2 shows the principle of the GPLPLS method. Here *Xglobal* and *Yglobal* are the corresponding linear part in the input space and the output space, respectively. They will be projected to the dimensionality reduction space by the traditional global projection method, PLS. *Xlocal* and *Ylocal* are the corresponding nonlinear parts, which will be dimensionality reduction projected by the local-preserving projection method (LPP).

**Fig. 9.2** The schematic diagram of the GPLPLS method

The core of extracting the principle components is PCA. So the linear model of *X* and *Y* is established by (9.16). It actually contains two relations: one relationship is that the input and output spaces are divided into "score" and "load" (external relationship), and the other relationship is the relationship between the latent variables of the input space and output space (internal relationship). These two relationships can also be seen from the schematic diagram (Fig. 9.2) of the GPLPLS model. Obviously, we can keep only the internal model, or the external model, or retain the local structure information of the internal model and the external model at the same time. Therefore, by setting four different values of λ*<sup>x</sup>* and λ*<sup>y</sup>* , four different optimization objective functions can be set as follows:


## *9.4.2 Relationship Between GPLPLS Models*

The optimization objective function of the GPLPLS method is given by (9.16). There are three GPLPLS models according to different values of λ*<sup>x</sup>* and λ*<sup>y</sup>* . What is the relationship between the three GPLPLS models? What is the difference between their modeling? These issues will be discussed in this section.

Suppose the original relationship is *Y* = *f* (*X*). Local linear embedding or localpreserving projection can be regarded as the equilibrium point of system linearization. From this perspective, the models with different combinations of λ*<sup>x</sup>* and λ*<sup>y</sup>* are as follows:

(1) PLS model: *Y*ˆ = *A*0*X*.

(2) GPLPLS*<sup>x</sup>* model: *Y*ˆ = *A*1[*X*, *xzi*].


Here *xzi*(*i* = 1, 2,..., *kx* ) and *ylj* = *f*(*xlj*)(*j* = 1, 2,..., *ky* ) are the local feature points of the input space and output space, respectively. *A*0, *A*1, *A*2, and *A*<sup>3</sup> are the model coefficient matrices. Obviously, PLS uses a simple linear approximation of the original system. This approximation effect is generally not good for a nonlinear relatively strong system. The GPLPLS uses the method of spatial local decomposition and approximates the original system with the sum of multiple simple linear models. GPLPLS*<sup>x</sup>* or GPLPLS*<sup>y</sup>* is a special case of GPLPLS*<sup>x</sup>*+*<sup>y</sup>* . It seems that these three combinations have embraced all the possible GPLPLS models. Let us go back to the GPLPLS*<sup>x</sup>*+*<sup>y</sup>* model's optimization function again.

$$\begin{split} J\_{\text{QPL-LS}\_{x+\gamma}}(\boldsymbol{\mathfrak{w}},\boldsymbol{c}) &= \arg\max\_{\boldsymbol{\mathfrak{w}},\boldsymbol{c}} \{ \boldsymbol{\mathfrak{w}}^{\text{T}} \left( \boldsymbol{X} + \boldsymbol{\lambda}\_{\boldsymbol{x}} \boldsymbol{\theta}\_{\boldsymbol{x}}^{\frac{1}{2}} \right)^{\mathrm{T}} \left( \boldsymbol{Y} + \boldsymbol{\lambda}\_{\boldsymbol{y}} \boldsymbol{\theta}\_{\boldsymbol{y}}^{\frac{1}{2}} \right) \boldsymbol{c} \} \\ &= \arg\max\_{\boldsymbol{\mathfrak{w}},\boldsymbol{c}} \left\{ \boldsymbol{\mathfrak{w}}^{\text{T}} \boldsymbol{X}^{\text{T}} \boldsymbol{Y} \boldsymbol{c} + \boldsymbol{\lambda}\_{\boldsymbol{x}} \boldsymbol{\mathfrak{w}}^{\text{T}} \boldsymbol{\theta}\_{\boldsymbol{x}}^{\frac{1}{2} \mathrm{T}} \boldsymbol{Y} \boldsymbol{c} \\ &\quad + \boldsymbol{\lambda}\_{\boldsymbol{y}} \boldsymbol{\mathfrak{w}}^{\text{T}} \boldsymbol{X}^{\text{T}} \boldsymbol{\theta}\_{\boldsymbol{y}}^{\frac{1}{2}} \boldsymbol{c} + \boldsymbol{\lambda}\_{\boldsymbol{x}} \boldsymbol{\lambda}\_{\boldsymbol{y}} \boldsymbol{\mathfrak{w}}^{\text{T}} \boldsymbol{\theta}\_{\boldsymbol{x}}^{\frac{1}{2}} \boldsymbol{\theta}\_{\boldsymbol{y}}^{\frac{1}{2}} \boldsymbol{c} \right\} \\ &\text{s.t.} \quad \boldsymbol{\mathfrak{w}}^{\text{T}} \boldsymbol{w} = 1, \boldsymbol{c}^{\text{T}} \boldsymbol{c} = 1. \end{split} (9.17)$$

Obviously, (9.17) contains two coupled components (*θ* 1 2 T *<sup>x</sup> Y* and *X*<sup>T</sup>*θ* 1 2 *<sup>y</sup>* ), which represent the correlation between the linear primary part and the nonlinear part. In some cases, these coupled components may have a negative impact on modeling. On the other hand, in addition to the external relationship between the input and output space which can be extended to a combination of linear and nonlinear, the internal relationship between the input and output space (the final model) can also be described as a combination of linear and nonlinear. Therefore, it is natural that we can model the linear and nonlinear parts without considering the coupling component between the two parts. Correspondingly, there is no need to consider the coupling component between the linear and nonlinear parts in the optimization function of the model. Therefore, the optimization objective of the following GPLPLS*x y* model can be obtained:

$$\begin{split} J\_{\text{GPLLS}\_{\text{xy}}}(\boldsymbol{\mathfrak{w}}, \boldsymbol{\mathfrak{c}}) &= \text{arg}\max \{ \boldsymbol{\mathfrak{w}}^{\text{T}} \boldsymbol{\mathbf{X}}^{\text{T}} \boldsymbol{\mathbf{Y}} \boldsymbol{\mathfrak{c}} + \lambda\_{\text{xy}} \boldsymbol{\mathfrak{w}}^{\text{T}} \boldsymbol{\theta}\_{\text{x}}^{\text{T}} \boldsymbol{\theta}\_{\text{y}}^{\frac{1}{2}} \boldsymbol{\mathfrak{c}} \} \\ &\text{s.t.} \quad \boldsymbol{\mathfrak{w}}^{\text{T}} \boldsymbol{\mathfrak{w}} = 1, \boldsymbol{\mathfrak{c}}^{\text{T}} \boldsymbol{\mathfrak{c}} = 1. \end{split} \tag{9.18}$$

Among them, λ*x y* parameters control the trade-off between global and local features.

## *9.4.3 Principal Components of the GPLPLS Model*

In this section, we will introduce how to obtain the principal components of the GPLPLS model. In order to facilitate the comparison with the traditional linear PLS model, denoted by *E*0*<sup>F</sup>* = *X<sup>F</sup>* and *F*0*<sup>F</sup>* = *Y <sup>F</sup>* . The optimization objective functions of four GPLPLS models are included in the following optimization objectives:

$$\begin{aligned} J\_{\text{GPLLS}}(\boldsymbol{\mathfrak{w}}, \boldsymbol{\mathfrak{c}}) &= \text{arg}\max \{ \boldsymbol{\mathfrak{w}}^{\text{T}} \boldsymbol{X}\_{F}^{\text{T}} \boldsymbol{Y}\_{F} \boldsymbol{\mathfrak{c}} + \lambda\_{\text{xy}} \boldsymbol{\mathfrak{w}}^{\text{T}} \boldsymbol{\theta}\_{\text{x}}^{\text{T}} \boldsymbol{\theta}\_{\text{y}}^{\text{T}} \boldsymbol{\mathfrak{c}} \} \\ &\text{s.t.} \quad \boldsymbol{\mathfrak{w}}^{\text{T}} \boldsymbol{\mathfrak{w}} = 1, \boldsymbol{\mathfrak{c}}^{\text{T}} \boldsymbol{\mathfrak{c}} = 1, \end{aligned} \tag{9.19}$$

where at least one of [λ*<sup>x</sup>* , λ*<sup>y</sup>* ] and λ*x y* is nonzero. The steps of obtaining latent variables of the GPLPLS model (9.19) are as follows.

First, the Lagrangian multiplier factor is introduced to transform the objective function (9.19) into the following unconstrained form:

$$\begin{split} \Psi(\boldsymbol{\mathfrak{w}}\_{1}, \mathbf{c}\_{1}) &= \mathbf{w}\_{1}^{\mathrm{T}} \boldsymbol{E}\_{0F}^{\mathrm{T}} \boldsymbol{F}\_{0F} \mathbf{c}\_{1} + \lambda\_{\mathrm{xy}} \boldsymbol{\mathfrak{w}}\_{1}^{\mathrm{T}} \boldsymbol{\theta}\_{\boldsymbol{x}}^{\mathrm{T}} \boldsymbol{\theta}\_{\boldsymbol{y}}^{\mathrm{T}} \mathbf{c}\_{1} \\ & \quad - \lambda\_{1} (\boldsymbol{\mathfrak{w}}\_{1}^{\mathrm{T}} \boldsymbol{\mathfrak{w}}\_{1} - 1) - \lambda\_{2} (\boldsymbol{\mathfrak{c}}\_{1}^{\mathrm{T}} \boldsymbol{\mathfrak{c}}\_{1} - 1) . \end{split} \tag{9.20}$$

Let (∂Ψ )/(∂*w*1) = 0 and (∂Ψ )/(∂*c*1) = 0, we can find the optimal solution of *w*<sup>1</sup> and *c*1. Then the objective function (9.19) is transformed as

$$\left[E\_{0F}^{\rm T}F\_{0F} + \lambda\_{\rm xy}\boldsymbol{\theta}\_{\rm x}^{\rm \frac{1}{2}T}\boldsymbol{\theta}\_{\rm y}^{\frac{1}{2}}\right]^{\rm T} \left[E\_{0F}^{\rm T}F\_{0F} + \lambda\_{\rm xy}\boldsymbol{\theta}\_{\rm x}^{\rm \frac{1}{2}T}\boldsymbol{\theta}\_{\rm y}^{\frac{1}{2}}\right] \boldsymbol{w}\_{1} = \boldsymbol{\theta}^{2}\boldsymbol{w}\_{1}\tag{9.21}$$

$$\left[\boldsymbol{F}\_{0F}^{\mathrm{T}}\boldsymbol{E}\_{0F} + \lambda\_{\mathrm{xy}}\boldsymbol{\theta}\_{\mathrm{y}}^{\frac{1}{2}\mathrm{T}}\boldsymbol{\theta}\_{\mathrm{x}}^{\frac{1}{2}}\right]^{\mathrm{T}} \left[\boldsymbol{F}\_{0F}^{\mathrm{T}}\boldsymbol{E}\_{0F} + \lambda\_{\mathrm{xy}}\boldsymbol{\theta}\_{\mathrm{y}}^{\frac{1}{2}\mathrm{T}}\boldsymbol{\theta}\_{\mathrm{x}}^{\frac{1}{2}}\right] \mathbf{c}\_{1} = \boldsymbol{\theta}^{2}\boldsymbol{c}\_{1},\tag{9.22}$$

where *<sup>θ</sup>* <sup>=</sup> *<sup>w</sup>*<sup>T</sup>*X*<sup>T</sup> *<sup>F</sup> Y <sup>F</sup> c* + λ*x yw*<sup>T</sup>*θ* 1 2 T *<sup>x</sup> θ* 1 2 *<sup>y</sup> c*. The target vectors *w*<sup>1</sup> and *c*<sup>1</sup> are calculated from (9.21) and (9.22). After obtaining the target vector (that is, the direction vector of the latent variables), the latent variables *t*<sup>1</sup> and *u*1, the load vectors *p*<sup>1</sup> and *q*1, and the residual matrices *E*<sup>1</sup> and *F*<sup>1</sup> can be calculated as follows:

$$\mathbf{t}\_{1} = \mathbf{E}\_{0F}\mathbf{w}\_{1},\tag{9.23}$$

$$\mathbf{p}\_1 = \frac{E\_{0F}^\mathrm{T} t\_1}{\|\mathbf{t}\_1\|^2}, \qquad \qquad \qquad \mathbf{q}\_1 = \frac{F\_{0F}^\mathrm{T} t\_1}{\|\mathbf{t}\_1\|^2} \tag{9.24}$$

$$E\_{1F} = E\_{0F} - t\_1 \mathbf{p}\_1^\mathrm{T},\qquad\qquad\qquad F\_{1F} = F\_{0F} - t\_1 \mathbf{q}\_1^\mathrm{T}.\tag{9.25}$$

Similar to the PLS method, the other latent variables of the GPLPLS model can be obtained by continuing to decompose the residual matrices *Ei L* and *Fi L* (*i* = 1, 2,..., *d* − 1). Usually, the first *d* latent variables are used to produce a better predictive regression model and *d* can be determined by the cross-validation test (Zhou et al. 2010).

The above is the establishment of the GPLPLS model and its principal component extraction process. Now let's compare GPLPLS model with the GLPLS model.

First of all, GPLPLS likes the GLPLS method at the main idea, i.e., to combine local and global structural features (covariance). Obviously, the GPLPLS method integrates global and local structural features better than the GLPLS method. Different from the GLPLS method, the GPLPLS method not only maintains the local structural features, but also extracts the relevant information in the input space and output space as much as possible. Therefore, the GPLPLS method can extract the largest global correlation as much as possible, while extracting the local structural correlation between process and quality variables.

Compared with the LPPLS method (Chap. 10) and LLEPLS method (Chap. 11), all the characteristics of the LPPLS method are described by local features. This indiscriminate description has advantages in strongly nonlinear systems, but it may not necessarily have advantages in linearly dominant but locally nonlinear systems. The GPLPLS method proposed in this chapter is a process aimed at linear advantages, but it still maintains some nonlinear relationships. It integrates global features (covariance) and nonlinear correlation (multivariance) as much as possible.

## **9.5 GPLPLS-Based Quality Monitoring**

## *9.5.1 Process and Quality Monitoring Based on GPLPLS*

The GPLPLS-based monitoring method is very similar to the PLS method. The common monitoring indicators of PLS are T<sup>2</sup> and SPE. In Chap. 11, it has been explained in detail that SPE statistics is not suitable for monitoring residual space of PLS. Therefore, in this chapter, the process monitoring based on the GPLPLS method uses statistics to monitor the principal component subspace and the remaining subspace. The monitoring process is also divided into two parts: offline training and online monitoring. The detailed process is as follows.

The input space *X* and the output space *Y* of the GPLPLS model are mapped to a low-dimensional space defined by a small number of latent variables [*t*1,..., *t<sup>d</sup>* ]. The decomposition of *E*0*<sup>F</sup>* and *F*0*<sup>F</sup>* is as follows:

$$\begin{aligned} E\_{0F} &= \sum\_{i=1}^{d} t\_i \mathbf{p}\_i^\mathrm{T} + \overline{E}\_{0L} = \mathbf{T} \mathbf{P}^\mathrm{T} + \overline{E}\_{0F} \\ F\_{0F} &= \sum\_{i=1}^{d} t\_i \mathbf{q}\_i^\mathrm{T} + \overline{F}\_{0L} = \mathbf{T} \mathbf{Q}^\mathrm{T} + \overline{F}\_{0F}, \end{aligned} \tag{9.26}$$

where *T* = [*t*1, *t*2,..., *t<sup>d</sup>* ] is the score matrix. *P* = [ *p*1,..., *p<sup>d</sup>* ] and *Q* = [*q*1,..., *q<sup>d</sup>* ] are the load matrices of the process variable *E*0*<sup>F</sup>* and the quality variable *F*0*<sup>F</sup>* , respectively. Use *E*0*<sup>F</sup>* instead of *t<sup>i</sup>* :

$$T = E\_{0F} \mathcal{R} = \left(I + \lambda\_x \mathcal{S}\_x^{\frac{1}{2}}\right) E\_0 \mathcal{R},\tag{9.27}$$

where *R* = [*r*1,..., *r<sup>d</sup>* ] is the decomposition matrix, and

$$r\_i = \prod\_{j=1}^{i-1} \left( I\_n - w\_j p\_j^\mathrm{T} \right) w\_i \dots$$

It is noted that *E*0*<sup>F</sup>* contains the results of locality-preserving learning. Operations (9.26) and (9.27) are executable during the model training. But the data is sampled real time during the process of online monitoring. The individual real-time data cannot be constructed for the transformational matrix *S<sup>x</sup>* or *S<sup>y</sup>* for the locality learning. Considering the practical application of (9.26) and (9.27), they should be transformed as the decomposition of normalized matrices *E*<sup>0</sup> and *F*0,

$$E\_0 = T\_0 \mathbf{P}^\mathrm{T} + \bar{E}\_0 \tag{9.28}$$

$$F\_0 = T\_0 \bar{\mathbf{Q}}^T + \bar{F}\_0 = E\_0 R \,\bar{\mathbf{Q}}^T + \overline{F}\_0,\tag{9.29}$$

where *<sup>T</sup>*<sup>0</sup> <sup>=</sup> *<sup>E</sup>*0*R*, *<sup>E</sup>*¯ <sup>0</sup> <sup>=</sup> *<sup>E</sup>*<sup>0</sup> <sup>−</sup> *<sup>T</sup>*<sup>0</sup> *<sup>P</sup>*<sup>T</sup>, and *<sup>Q</sup>*¯ <sup>=</sup> *<sup>T</sup>* <sup>+</sup> <sup>0</sup> *F*0.

During the online monitoring for new samples *x* and *y* (standardized data), an oblique projection is introduced in the input space *x*:

$$\mathbf{x} = \hat{\mathbf{x}} + \mathbf{x}\_e \tag{9.30}$$

$$
\hat{\mathbf{x}} = \mathbf{R} \mathbf{P}^{\mathsf{T}} \mathbf{x} \tag{9.31}
$$

$$\mathbf{x}\_e = (I - \mathbf{R}\mathbf{P}^\mathrm{T})\mathbf{x}.\tag{9.32}$$

The statistics T<sup>2</sup> pc and T<sup>2</sup> <sup>e</sup> of the principal component space and the remaining subspace are calculated as follows:

$$\mathbf{t} = \mathbf{R}^{\mathrm{T}} \mathbf{x} \tag{9.33}$$

$$\mathbf{T}\_{pc}^{2} := \mathbf{t}^{\mathrm{T}} \mathbf{A}^{-1} \mathbf{t} = \mathbf{t}^{\mathrm{T}} \left\{ \frac{1}{n-1} \mathbf{T}\_{0}^{\mathrm{T}} \mathbf{T}\_{0} \right\}^{-1} \mathbf{t} \tag{9.34}$$

$$\mathbf{T}\_e^2 := \mathbf{x}\_e^\mathrm{T} \mathbf{A}\_e^{-1} \mathbf{x}\_e = \mathbf{x}\_e^\mathrm{T} \left\{ \frac{1}{n-1} \mathbf{x}\_e^\mathrm{T} \mathbf{x}\_e \right\}^{-1} \mathbf{x}\_e,\tag{9.35}$$

where *Λ* and *Λ<sup>e</sup>* are covariance matrices. T<sup>2</sup> *pc* and T<sup>2</sup> <sup>e</sup> are statistics with the threshold Th*pc*,<sup>α</sup> and The,, respectively. Considering the statistics T<sup>2</sup> pc and T<sup>2</sup> <sup>e</sup> are not obtained through normalized data *E*0, and the output variables may not obey the Gaussian distribution. Therefore, the corresponding thresholds cannot be calculated from Fdistribution. So their probability density functions should be estimated first by nonparametric kernel density estimation (KDE) (Lee et al. 2010).

The fault diagnosis logic based on the GPLPLS model is as follows:

T2 *pc* > Th*pc*,<sup>α</sup> Quality-relevant faults T2 *pc* > Th*pc*,<sup>α</sup> or T<sup>2</sup> *<sup>e</sup>* > Th*<sup>e</sup>*,<sup>α</sup> Process-relevant faults T2 *pc* <sup>≤</sup> Th*pc*,<sup>α</sup> and T<sup>2</sup> *<sup>e</sup>* ≤ Th*<sup>e</sup>*,<sup>α</sup> Fault free (9.36)

The process monitoring of GPLPLS algorithm with multiple input and multiple output data is as follows:

(1) Standardize the original data *X* and *Y*. Calculate *T*0, *Q*¯ , and *R* based on GPLPLS algorithm (9.28) and (9.29). Determine the number of principal components *d* by cross-validation.


## *9.5.2 Posterior Monitoring and Evaluation*

Many quality-related process monitoring methods have been verified on the wellknown TE process simulation platform. The goal of most methods is to make the quality-related alarm rate as high as possible, but the reasonability of monitoring result seems to receive little attention. Therefore, similar to the performance evaluation index of the control loop, we introduce a posterior monitoring assessment (PMA) index to evaluate the reasonability of quality-related alarm rate. PMA is defined as follows:

$$\text{PMA} = \frac{\mathbb{E}\left(\mathbf{y}\_N^2\right)}{\mathbb{E}\left(\mathbf{y}\_F^2\right)},\tag{9.37}$$

where <sup>E</sup>(·) is the mathematical expectation, *<sup>y</sup><sup>N</sup>* and *<sup>y</sup><sup>F</sup>* are the output data of the training data set and the output data of the fault data set, respectively. It is noted that they are both normalized by the mean and standard deviation of *y<sup>N</sup>* . PMA → 1 indicates that the quality of the fault data is close to normal operation; PMA > 1 indicates the data quality is better than the normal. Moreover, PMA far from 1 means that the quality is very different from the normal, and the corresponding qualityrelated index T<sup>2</sup> (PLS method) or T<sup>2</sup> *pc* (GPLPLS method) should be higher, and the others should be lower.

However, the widespread controllers reduce the impact of certain failures, especially small fault. So a single PMA indicator cannot truly reflect the dynamic changes, two PMA indicators are adopted to describe dynamic and steady-state effects, respectively,

$$\text{PMA}\_{1} = \min \left\{ \frac{\mathbb{E}\left(\mathbf{Y}\_{N}^{2}(k\_{0}:k\_{1},i)\right)}{\mathbb{E}\left(\mathbf{Y}\_{F}^{2}(k\_{0}:k\_{1},i)\right)} \right\}, \quad i = 1,2,\ldots,l \tag{9.38}$$

$$\text{PMA}\_2 = \min \left\{ \frac{\mathbb{E}\left(\mathbf{Y}\_N^2(k\_2:n,i)\right)}{\mathbb{E}\left(\mathbf{Y}\_F^2(k\_2:n,i)\right)} \right\}, \quad i = 1,2,\ldots,l,\tag{9.39}$$

where *k* = 0, 1, 2 is constant. It is noted that the worst strategy is selected in order to ensure the rationality of the evaluation. Moreover, the two PMA indicators are only used to test whether the previous fault detection results are reasonable. Their evaluations are objective but not indicate whether the fault is quality related, compared with the detection based on GPLPLS model. The quality testing is necessary for further diagnosis.

## **9.6 TE Process Simulation Analysis**

Process monitoring and fault diagnosis based on the GPLPLS model are tested on the TE simulation platform. The monitoring performance of several models, such as PLS, a concurrent projection to the latent structures (CPLS) (Qin 2012), and GPLPLS, are compared. The input and output spaces are projected and decomposed into five subspaces in CPLS: input principle subspace, input residual subspace, output principle subspace, output residual subspace, and joint input-output subspace. Just focusing on the quality-related faults, the principle and residual subspaces of input are replaced by the input remaining subspace *xe* in CPLS model, and the corresponding monitoring statistics are replaced by T<sup>2</sup> *<sup>e</sup>* . The output principle and residual subspaces in the CPLS model are not considered in order to highlight process-based quality monitoring. Two different data sets are used from (Zhang et al. 2017) and (Wang et al. 2017).

## *9.6.1 Model and Discussion*

The input matrix is composed of process variables [XMEAS(1:22)] and manipulated variables [XMV(1:11), except XMV(5) and XMV(9)]. The output matrix is composed of mass variable [XMEAS (35), XMEAS (36)]. The training data is normal data IDV(0) and the test data is 21 fault data IDV(1-21). The threshold is calculated based on the confidence level 99.75% (see equation (1.10) for detail).

The simulation parameters of the GPLPLS model, especially the GPLPLS*x y* model) are *kx* = 22, *ky* = 23, λ*<sup>x</sup>* = λ*<sup>y</sup>* = 0, λ*x y* = 1, *k*<sup>0</sup> = 161. Note that the local nonlinear structure features are extracted by the LLE method. Number of principal components of PLS, CPLS, and GPLPLS models are 6, 6, and 2, respectively, determined by the cross-validation method. *k*<sup>1</sup> = *n* = 960, *k*<sup>2</sup> = 701. The detection results including FDR, FAR, and indicator PMA are listed in Table 9.1.

With these two PMA indices in Table 9.1, 21 faults are divided into two types: quality-independent faults (PMA1 > 0.9 or PMA1 + PMA2 > 1.5 ) including IDV(3,4,9,11,14,15,19) and quality-related faults. Furthermore, the quality-related faults are further subdivided into four types:

**Type 1**: fault has a slight impact on quality, [IDV(10,16,17, and 20)], 0.5 < PMA*<sup>i</sup>* < 0.8 *i* = 1, 2.

**Type 2**: fault is quality recoverable, [IDV(1,5, and 7)], PMA1 < 0.35 and PMA2 > 0.65.

**Type 3**: fault has a serious impact on quality, [IDV(2, 6, 8, 12, 13, and 18)], PMA*<sup>i</sup>* < 0.1 *i* = 1, 2.

**Type 4**: fault causes the output variables to drift slowly, [IDV(21)].


**Table 9.1** FDRs of PLS, CPLS, GPLPLS*x y* , and PMA

This classification is not only a preliminary result depending on the choice of parameters *k*0, *k*1, and *k*2, but it also has a reference value. All methods show the consistent results for the serious quality-related faults, which are not discussed in the next fault detection analysis.

## *9.6.2 Fault Diagnosis Analysis*

Form the above results, it is found that for some faults, their detection results are not consistent with different methods, including quality-recoverable faults, slight quality-related faults, and quality-independent faults. The detailed analysis for the three situations is given below. For all the monitoring graphs, the horizontal axis represents the sample, the vertical axis represents the statistics (the picture above

**Fig. 9.3** Output prediction for IDV(1), IDV(5), and IDV(7) using the GPLPLS*x y* method

represents T<sup>2</sup> *pc*, the picture below represents T<sup>2</sup> *<sup>e</sup>* ), and the red dotted line is the threshold with confidence level 99.75%. The blue line is the actual monitoring value. For all prediction graphs, the horizontal axis represents the sample, the vertical axis represents the output value, the blue dashed line is actual value, and the green line is for the prediction.

(1) Quality-recoverable fault

Quality-recoverable faults include IDV(1), IDV(5), and IDV(7). They are all stepchange faults, but the feedback or cascade controller can reduce their effect on quality during the actual process. Therefore, the quality variables in the faults IDV(1), IDV(5), and IDV(7) should return to normal. The output prediction is shown in Fig. 9.3. As an example, the corresponding fault monitoring results for IDV(7) are shown in Fig. 9.4 which correspond to the PLS, CPLS, and GPLPLS*x y* models, respectively. Here the statistics T<sup>2</sup> *pc* and T<sup>2</sup> *<sup>e</sup>* detected the input space for processrelated faults. For the GPLPLS*x y* model, the value of the T<sup>2</sup> *pc* statistic returns to the normal value, while the T<sup>2</sup> *<sup>e</sup>* statistic still maintains a high value. This means that these faults are quality-recoverable faults. PLS and CPLS reported that these faults are quality-related faults but give many false alarms, especially for IDV(7). The statistical value of T<sup>2</sup> *pc* is also very close to the threshold, but still exceeds the threshold. They still indicated the fault alarm even when the operation have returned to normal under the controller. They fail to grasp the essence of the fault detection problem

**Fig. 9.4** PLS, CPLS, and GPLPLS*x y* monitoring results for IDV(7)

**Fig. 9.5** Output prediction for IDV(4), IDV(11), and IDV(14) using the GPLPLS*x y* method

with recoverable quality. In this case, the GPLPLS*x y* method can accurately reflect the process and quality changes.

(2) Quality-independent fault

Quality-independent faults include IDV(4), IDV(11), and IDV(14), but they are related to process. All these faults are related to the reactor cooling water, and these interferences hardly affect the quality of output products. The corresponding output quality prediction of GPLPLS*x y* methods is shown in Fig. 9.5. The monitoring results for IDV(14) by PLS, CPLS, and GPLPLS*x y* methods are shown in Fig. 9.6. In the GPLPLS*x y* model, T<sup>2</sup> *pc* are almost under the threshold, which indicates that these faults are not related to quality. But for PLS and CPLS models, these faults are detected both in T<sup>2</sup> *pc* and *T* <sup>2</sup> *<sup>e</sup>* . In other words, PLS or CPLS model shows that these interferences are related to quality. Compared with PLS, CPLS method can filter out fault alarm to a certain extent in T<sup>2</sup> *pc*, but still has higher alarm than GPLPLS*x y* . For quality-independent fault, PLS and CPLS have a high detection rate, but fails to indicate the quality-independent faults.

(3) Slight quality-related faults

Faults, such as IDV(10), IDV(16), IDV(17), and IDV(20), have a slight impact on quality. Few people study this type of failure. Their quality-related alarm rates are similar to quality-recoverable faults. Although they are quality related, they have little impact on quality. Their *T* <sup>2</sup> *pc* value of related monitoring statistics is relatively

**Fig. 9.6** PLS, CPLS, and GPLPLS*x y* monitoring results for IDV(14)

**Fig. 9.7** Output predicted values for IDV(16), IDV(17), and IDV(20) using the GPLPLS*x y* method

**Fig. 9.8** PLS, CPLS, and GPLPLS*x y* monitoring results for IDV(20)

small. To some extent, these faults can also be regarded as failures that have nothing to do with quality. Many methods, such as the PLS method, fail to detect them accurately. The output prediction values of GPLPLS*x y* models are shown in Fig. 9.7. The monitoring results of the three models for fault IDV(20) are shown in Fig. 9.8. It can be seen that the monitoring results of the GPLPLS*x y* model are the most accurate, and the PLS and CPLS models give false alarm results. In the GPLPLS*x y* model, process changes better match quality changes.

From the three situations analyzed above, it can be seen that the GPLPLS method can filter harmful alarm situations. It can be used for minor quality-related failures, quality-unrelated failures, and quality-recoverable failures. There are two possible reasons for the good fault diagnosis performance of the GPLPLS method: first, the principal component of the GPLPLS method is based on the global features of nonlinear local structural features, and the method enhances its nonlinear mapping ability. Secondly, the GPLPLS method uses a non-Gaussian threshold, which makes it possible to process the signal that does not necessarily satisfy the Gaussian assumption.

## *9.6.3 Comparison of Different GPLPLS Models*

For the same data set above, the FDRs of the other three GPLPLS*<sup>x</sup>* , GPLPLS*<sup>y</sup>* , and GPLPLS*<sup>x</sup>*+*<sup>y</sup>* models (local nonlinear structural features are all extracted by the LLE method) are shown in Table 9.2, where *K* = [*kx* , *ky* ]. It can be seen from the table that the results of these methods are very good, and consistent conclusions are drawn. Especially the FDR of GPLPLS*<sup>x</sup>*+*<sup>y</sup>* model and the GPLPLS*x y* model are very similar.

In order to discuss these models more clearly, fault IDV (7) is selected for further analysis. It can be seen from Table 9.2 that the monitoring results of IDV(7) by the GPLPLS*<sup>y</sup>* model are obviously inconsistent with other methods. T<sup>2</sup> statistics give a higher alarm (79.25%). According to the previous analysis, this alarm is an annoying false alarm. The other three models have relatively low alarm rates for fault IDV (7), near 26%, which means that the monitoring effect is very good. The possible reason for false alarm is that the GPLPLS*<sup>y</sup>* model only enhances the local nonlinear structure characteristics in the output space. It is linear to the input space and the output space


**Table 9.2** FDRs of GPLPLS methods with LLE local feature

is nonlinear. Process monitoring results may be better. However, the input space of the TE simulation process may also have strong nonlinearity, which leads to the poor monitoring results of GPLPLS*<sup>y</sup>* model, and the other three models show higher consistency with this type of fault.

The above results of the GPLPLS models are obtained by combining with the LLE method to retain local nonlinear structural features. Below, the monitoring results of the GPLPLS model combined with another local retention algorithm LPP method are given, as shown in Table 9.3, where Σ = [σ*<sup>x</sup>* , σ*<sup>y</sup>* ]. It can be seen that Table 9.3 gives consistent conclusions, so the analysis will not be performed here.

Many methods have the similar fusion idea of global projection and local preserving, such as GLPLS, LPPLS, and others. These methods all need to adjust parameters, and different parameters have different results. In order to be as consistent as possible with the existing results of other methods, we chose the same data set in Wang et al. (2017) for the following tests.


**Table 9.3** FDRs of GPLPLS methods with LPP local feature

In the following comparison experiment, input variable matrix *X* is composed of process variables [XMEAS (1 : 22)] and 11 manipulated variables [XMEAS (23 : 33)] except XMV (12). The quality variable matrix *Y* includes XMEAS (35) and XMEAS (38). The model parameters based on the combination of manifold learning algorithm and PLS are set as follows:

(1) The GLPLS model: δ*<sup>x</sup>* = 0.1, δ*<sup>y</sup>* = 0.8, *kx* = 12, *ky* = 12.

(2) The LPPLS model: δ*<sup>x</sup>* = 1.5, δ*<sup>y</sup>* = 0.8, *kx* = 20, *ky* = 15.

(3) The GPLPLS model: *kx* = 11, *ky* = 16 (mainly refers to the GPLPLS*x y* model).

Table 9.4 lists the FDR values of different quality-related monitoring methods, corresponding to PLS, CPLS, GLPLS, and GPLPLS models, and the corresponding detection threshold is calculated with confidence level of 99.75%. The last two columns are FDRs calculated based on the PMA value of this data set.

It can be seen from Tables 9.1 and 9.4 that although the data sets are different, the results of PMA are similar. Therefore, the quality-related monitoring results should be similar, and it is obvious that the GPLPLS model gives consistent conclusions. The higher FDR of other models than GPLPLS is due to not good to distinguish


**Table 9.4** FDRs comparison for different quality-related methods

whether these faults are quality related. Although GLPLS has similar fusion idea of global feature and local structure, its weak monitoring performance is caused by the inappropriate parameters and model construction. Because it is difficult to select suitable parameters, the parameter determination method is still an open issue.

In summary, GPLPLS model shows good monitoring performance. It is suitable for the combination of global structure and local structure features, so the output prediction results and fault monitoring results of the model are better than other models.

# **9.7 Conclusions**

This chapter proposes a new statistical monitoring model based on the global plus local projection to latent structure (GPLPLS) model. This model not only maintains the global and local structural characteristics of the data, but also pays more attention to the correlation between the extracted principal components. First, the GLPLS method is introduced, and it is pointed out that the model construction of this method is unreasonable, and then the GPLPLS method is proposed to maintain the global and local features with a new structure. Then a monitoring model based on the GPLPLS method is established, and the monitoring performance of the proposed method is verified on the TE process simulation platform. The results show that compared with PLS, CPLS, and GLPLS, GPLPLS method has better process monitoring performance for quality-related fault.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 10 Locality-Preserving Partial Least Squares Regression**

173

This chapter proposes another nonlinear PLS method, named as locality-preserving partial least squares (LPPLS), which embeds the nonlinear degenerative and structurepreserving properties of LPP into the PLS model. The core of LPPLS is to replace the role of PCA in PLS with LPP. When extracting the principal components of *t<sup>i</sup>* and *u<sup>i</sup>* , two conditions must satisfy: (1) *t<sup>i</sup>* and *u<sup>i</sup>* retain the most information about the local nonlinear structure of their respective data sets. (2) The correlation between *t<sup>i</sup>* and *u<sup>i</sup>* is the largest. Finally, a quality-related monitoring strategy is established based on LPPLS.

First, the geometric interpretation of PCA in PLS and LPP is introduced. LPPLS model and LPPLS-based quality-related process monitoring method are proposed. Here three different types of LPPLS models are also given in the same framework, facing three nonlinear cases: nonlinearly correlated in the input space *X* or the output space *Y*, as well as between them. A typical algorithm for extracting principal components is derived. Then, the feasibility and effectiveness of LPPLS method is verified by artificial 3-D data and Tennessee Eastman Process simulations.

## **10.1 The Relationship Among PCA, PLS, and LPP**

For the normalized data sets of process variables *X* = - *x*<sup>T</sup>(1), *x*<sup>T</sup>(2), . . . , *x*<sup>T</sup>(*n*) T ∈ *R<sup>n</sup>*×*<sup>m</sup>* (*x* ∈ *R*<sup>1</sup>×*<sup>m</sup>*) and quality variable *Y* = - *y*<sup>T</sup>(1), *y*<sup>T</sup>(2), . . . , *y*<sup>T</sup>(*n*) T ∈ *R<sup>n</sup>*×*<sup>l</sup>* ( *y* ∈ *R*<sup>1</sup>×*<sup>l</sup>* ), where *m* and *l* are the dimension of the process and quality variables spaces, and *n* is the number of samples, the principal component extraction of PCA, LPP, and PLS is actually equivalent to the following constrained optimization problem.

$$J\_{\rm PCA}(\boldsymbol{\omega}) = \max\_{\boldsymbol{\pi}} \left. \boldsymbol{\omega}^{\rm T} \boldsymbol{X}^{\rm T} \boldsymbol{X} \boldsymbol{\omega} \right| \tag{10.1}$$

$$\begin{aligned} \text{s.t.} \boldsymbol{\mathfrak{w}}^{\mathsf{T}} \boldsymbol{\mathfrak{w}} &= 1\\ J\_{\text{LPP}}(\boldsymbol{\mathfrak{w}}) &= \max\_{\boldsymbol{x}} \; \boldsymbol{\mathfrak{w}}^{\mathsf{T}} \boldsymbol{X}^{\mathsf{T}} \boldsymbol{S}\_{\boldsymbol{x}} \boldsymbol{X} \boldsymbol{w} \end{aligned} \tag{10.2}$$

$$\begin{aligned} \text{s.t. } \boldsymbol{w}^{\top} \boldsymbol{X}^{\top} \boldsymbol{D}\_{\times} \boldsymbol{X} \boldsymbol{w} &= 1\\ J\_{\rm PLS}(\boldsymbol{w}, \boldsymbol{c}) &= \max \boldsymbol{w}^{\top} \boldsymbol{X}^{\top} \boldsymbol{Y} \boldsymbol{c} \\ \text{s.t.} \boldsymbol{w}^{\top} \boldsymbol{w} &= 1, \boldsymbol{c}^{\top} \boldsymbol{c} = 1 \end{aligned} \tag{10.3}$$

The meaning of related variables such as *w*, *c* has been given in Chap. 9. Also, in Chap. 9, to weaken the limitations of PLS's lack of local feature extraction capabilities, the input space *X* and the output space *Y*, are mapped into a new feature space *X<sup>F</sup>* and *Y <sup>F</sup>* that includes a global linear subspace and a plurality of local linear subspaces. Consequently, the following new optimization objective function of the global plus local projection to latent structures (GPLPLS) method is immediately obtained using the feature space *XF* or *YF* to replace the original space *X* or *Y*,

$$\begin{aligned} J\_{\text{QPLPLS}}(\boldsymbol{\omega}, \boldsymbol{c}) &= \text{arg}\max \{ \boldsymbol{\omega}^{\text{T}} \boldsymbol{X}\_{F}^{\text{T}} \boldsymbol{Y}\_{F} \boldsymbol{c} \} \\ &\text{s.t.} \quad \boldsymbol{w}^{\text{T}} \boldsymbol{w} = 1, \boldsymbol{c}^{\text{T}} \boldsymbol{c} = 1, \end{aligned} \tag{10.4}$$

where *X<sup>F</sup>* = *X* + λ*xθ* 1 2 *<sup>x</sup>* , *Y <sup>F</sup>* = *Y* + λ*yθ* 1 2 *y* .

Although adding local features to the global features makes the GPLPLS model show excellent performance in fault detection, the GPLPLS model does not fully implement local feature extraction or its local features are only extracted approximately. The main reason is that the constraint condition of the GPLPLS model is still the constraint condition of PCA or PLS. Of course, this combination way generally cannot guarantee the constraints of PCA and LPP at the same time.

Only the nonlinear part of the function is described by the local features, and the linear part is still characterized by the traditional covariance matrix in Chap. 9. In fact, the characteristics of the linear part can also be described by local characteristics. In this way, we can regard the linear part and the nonlinear part as a whole, thereby avoiding unnecessary parameter trade-offs. In the following context, we attempt to analyze the differences and similarities between PCA and LPP.

The local characteristics of *X* of LPP are contained in the matrices *X*<sup>T</sup> *S<sup>x</sup> X* and *X*<sup>T</sup> *D<sup>x</sup> X*. To study the similarity of LPP and PCA, the matrix *S<sup>x</sup>* and *D<sup>x</sup>* are decomposed into *S* 1 2 T *<sup>x</sup> S* 1 2 *<sup>x</sup>* and *D* 1 2 T *<sup>x</sup> D* 1 2 *<sup>x</sup>* , respectively. Then LPP criteria (10.2) is further transformed as

$$\begin{split} J\_{\text{LPP}}(\boldsymbol{\mathfrak{w}}) &= \max \; \boldsymbol{\mathfrak{w}}^{\text{T}} \boldsymbol{X}\_{M}^{\text{T}} \boldsymbol{X}\_{M} \boldsymbol{\mathfrak{w}} \\ &\text{s.t.} \boldsymbol{\mathfrak{w}}^{\text{T}} \boldsymbol{M}\_{\boldsymbol{x}}^{\text{T}} \boldsymbol{M}\_{\boldsymbol{x}} \boldsymbol{\mathfrak{w}} = 1, \end{split} \tag{10.5}$$

where *M<sup>x</sup>* = *D* 1 2 *<sup>x</sup> X*, *X <sup>M</sup>* = *S* 1 2 *<sup>x</sup> X*.

Comparing (10.5) and (10.1), it can be found that the structure in the mathematical description of the optimization problem of LPP and PCA is similar. "PCA selects a subspace consisting of the eigenvectors corresponding to the largest eigenvalues of the global covariance matrix, while LPP selects a subspace consisting of the eigenvectors corresponding to the smallest eigenvalues of the local covariance matrix (He et al. 2005)". Therefore, LPP can replace PCA in the PLS decomposition process, thus achieving the preservation of strong local nonlinearity.

PCA is used to extract a set of components that transforms the original data *X* to a set of t-scores *T* in the PLS criteria (10.3) of forming latent variables. PCA and PLS only extract global linear features and therefore do not reflect the local information of the sample and its nonlinear features. Actually PCA is not the only method of extracting principle components. LPP, converting the global nonlinearity into a combination of multiple local linearities, also can be used for extracting principle components. Therefore, LPP is suitable for systems with strong local nonlinear features.

## **10.2 LPPLS Models and LPPLS-Based Fault Detection**

## *10.2.1 The LPPLS Models*

Based on (10.3), the two criteria for selecting latent vectors *u<sup>i</sup>* and *t<sup>i</sup>* for PLS are as follows:


The optimization objective for extracting the first component pairs (*t*1, *u*1) is

$$\begin{aligned} J\_{\text{PLS}}(\boldsymbol{\omega}\_{1}, \mathbf{c}\_{1}) &= \max \, \boldsymbol{\omega}\_{1}^{\mathsf{T}} \boldsymbol{X}^{\mathsf{T}} \boldsymbol{Y} \mathbf{c}\_{1} \\ \text{s.t.} \quad \boldsymbol{\omega}\_{1}^{\mathsf{T}} \boldsymbol{\omega} &= 1, \, \mathbf{c}\_{1}^{\mathsf{T}} \mathbf{c}\_{1} = 1. \end{aligned} \tag{10.6}$$

The optimization objective (10.6) is used for fast extraction of principal components in PLS. Define *E*<sup>0</sup> = *X*, *F*<sup>0</sup> = *Y*, then the latent variables *t*<sup>1</sup> and *c*<sup>1</sup> are calculated by *t*<sup>1</sup> = *E*0*w*<sup>1</sup> and *u*<sup>1</sup> = *F*<sup>0</sup> *c*1, where *c*<sup>1</sup> and *w*<sup>1</sup> are the eigenvectors corresponding to the maximum eigenvalues of the following matrices.

$$E\_0^\mathrm{T} F\_0 F\_0^\mathrm{T} E\_0 \mathfrak{w}\_1 = \theta\_1^2 \mathfrak{w}\_1 \tag{10.7}$$

$$\boldsymbol{F}\_0^\mathrm{T} \boldsymbol{E}\_0 \boldsymbol{E}\_0^\mathrm{T} \boldsymbol{F}\_0 \mathbf{c}\_1 = \theta\_1^2 \mathbf{c}\_1. \tag{10.8}$$

Considering the similarity between LPP and PCA discussed in the previous section, LPP is used to extract the principle components (10.3) in PLS decomposition instead of PCA, i.e., the LPPLS model. Three LPPLS models (types I, II, and III) are developed to address the different nonlinear relationships.

The type I LPPLS model is given to deal with this case where the input space *X* has a nonlinear relationship and the correlation between the input *X* and the output *Y* is linear. The principal components of the input space *X* of the type I LPPLS are extracted by LPP and the principal components of the output space *Y* are extracted by PCA. The optimization objectives are as follows:

$$\begin{aligned} J\_{\text{LPPL},\text{S}\_l}(\boldsymbol{\mathfrak{w}}, \boldsymbol{\mathfrak{c}}) &= \max \boldsymbol{\mathfrak{w}}^\top \boldsymbol{X}\_M^\top \boldsymbol{Y} \boldsymbol{\mathfrak{c}} \\ \text{s.t. } &\boldsymbol{\mathfrak{c}}^\top \boldsymbol{\mathfrak{c}} = 1, \boldsymbol{\mathfrak{w}}^\top \boldsymbol{M}\_x^\top \boldsymbol{M}\_x \boldsymbol{\mathfrak{w}} = 1. \end{aligned} \tag{10.9}$$

The type II LPPLS model is given to deal with the nonlinearly correlation between the input space *X* and output space *Y*, but linearly correlation in the input space *X*. The principal components in input space *X* are extracted by PCA and the principal components of the output space *Y* are extracted by LPP. The optimization function is

$$\begin{aligned} J\_{\text{LPL},\text{S}\_{\text{ll}}}(\boldsymbol{\mathfrak{w}},\boldsymbol{\mathfrak{c}}) &= \max \, \boldsymbol{\mathfrak{w}}^{\text{T}} \boldsymbol{X}^{\text{T}} \boldsymbol{Y}\_{\text{M}} \mathbf{c} \\ \text{s.t.} \quad \boldsymbol{\mathfrak{w}}^{\text{T}} \boldsymbol{\mathfrak{w}} &= 1, \, \mathbf{c}^{\text{T}} \boldsymbol{M}\_{\text{y}}^{\text{T}} \boldsymbol{M}\_{\text{y}} \mathbf{c} = 1 \end{aligned} \tag{10.10}$$

in which

$$\begin{aligned} \mathcal{Y}\_M &= \mathcal{S}\_\circ^{\frac{1}{2}} Y, \mathcal{S}\_\circ = \mathcal{S}\_\circ^{\frac{1}{2}} \mathcal{S}\_\circ^{\frac{1}{2}} \\ \mathcal{M}\_\circ &= \mathcal{D}\_\circ^{\frac{1}{2}} Y, \mathcal{D}\_\circ = \mathcal{D}\_\circ^{\frac{1}{2}} \mathcal{D}\_\circ^{\frac{1}{2}} \end{aligned}$$

where *S<sup>y</sup>* and *D<sup>y</sup>* are similar as the *S<sup>x</sup>* and *D<sup>x</sup>* and it has a different neighbors parameter δ*<sup>y</sup>* in (9.8).

The type III LPPLS model is given for the nonlinear correlation between the input space *X* and the output space *Y* as well as among the input spaces *X*. In this case, the principal components of the input space *X* and output space *Y* are both extracted by the LPP. Its corresponding optimization objective function is

$$\begin{aligned} J\_{\text{LPLS}\_{\text{III}}}(\boldsymbol{\mathfrak{w}}, \boldsymbol{\mathfrak{c}}) &= \max \boldsymbol{\mathfrak{w}}^{\text{T}} \boldsymbol{X}\_{M}^{\text{T}} \boldsymbol{Y}\_{M} \mathbf{c} \\ \text{s.t.} \ \boldsymbol{\mathfrak{w}}^{\text{T}} \boldsymbol{M}\_{\text{x}}^{\text{T}} \boldsymbol{M}\_{\text{x}} \boldsymbol{\mathfrak{w}} &= 1, \boldsymbol{\mathfrak{c}}^{\text{T}} \boldsymbol{M}\_{\text{y}}^{\text{T}} \boldsymbol{M}\_{\text{y}} \mathbf{c} = 1. \end{aligned} \tag{10.11}$$

The criteria for the selection of latent vectors *u<sup>i</sup>* and *t<sup>i</sup>* for type III LPPLS are as follows:

(1) The nonlinear variation on the latent vector is manifested as much as possible;

(2) The correlation between latent vectors is as strong as possible.

**Discussion** one of the aims of is to choose factors *u<sup>i</sup>* and *t<sup>i</sup>* that better represent the nonlinear variation of the factor changes. GLPLS's optimization objective is given in (10.12) (Zhong et al. 2016).

$$\begin{aligned} J\_{\text{GLPLS}}(\boldsymbol{\mathfrak{w}}, \boldsymbol{\mathfrak{c}}) &= \max \left\{ \boldsymbol{\mathfrak{w}}^{\text{T}} \boldsymbol{X}^{\text{T}} \boldsymbol{Y} \boldsymbol{\mathfrak{c}} + \beta\_{1} \boldsymbol{\mathfrak{w}}^{\text{T}} \boldsymbol{X}\_{M}^{\text{T}} \boldsymbol{X}\_{M} \boldsymbol{\mathfrak{w}} + \beta\_{2} \boldsymbol{\mathfrak{c}}^{\text{T}} \boldsymbol{Y}\_{M}^{\text{T}} \boldsymbol{Y}\_{M} \boldsymbol{\mathfrak{c}} \right\} \\ &\text{ s.t. } \ \boldsymbol{\mathfrak{w}}^{\text{T}} \boldsymbol{\mathfrak{w}} = 1, \ \boldsymbol{\mathfrak{c}}^{\text{T}} \boldsymbol{\mathfrak{c}} = 1, \end{aligned} \tag{10.12}$$

where the parameters β<sup>1</sup> and β<sup>2</sup> are the trade-off between global and local feature extraction. Here the embedding properties and data screening of LPP are removed because the constraints *<sup>w</sup>*T*X*<sup>T</sup> *<sup>D</sup><sup>x</sup> <sup>X</sup><sup>w</sup>* <sup>=</sup> 1 and *<sup>c</sup>*T*Y*<sup>T</sup> *<sup>D</sup>yY c* <sup>=</sup> 1 of LPP are removed in (10.12). GLPLS model is a fusion of the PLS model with the partial LPP model. "The best vectors *w* and *c* from (10.12) ensure maximum correlation (PLS) and relative or local optimal data filtering and embedding capabilities for *X* and *Y* (Zhong et al. 2016)". On the other hand, *w*T*X*<sup>T</sup> *S<sup>x</sup> Xw* and *c*T*Y*<sup>T</sup> *SyY c* are only used to introduce the local features in the input and output space, but not the correlation features between them. However, the LPP model is fully embedded in the LPPLS model. It is embedded in the outer layer, inner layer or both of the PLS model, i.e., three types of LPPLS models. At the same time, the correlation information in the input and output spaces is retained.

Type III LPPLS is used as an example to show the extracting of principal components. Supposed the first component pairs is (*t*1, *u*1). Define *E*0*<sup>L</sup>* = *X <sup>M</sup>* and *F*0*<sup>L</sup>* = *Y <sup>M</sup>* in order to facilitate comparison with the traditional linear PLS.

First, the optimization (10.11) for the first component pair (*t*1, *u*1) is converted into an unconstrained problem by the Lagrangian multiplier,

$$\Psi(\mathbf{w}\_{1}, \mathbf{c}\_{1}) = \mathbf{w}\_{1}^{\mathrm{T}} E\_{0L}^{\mathrm{T}} F\_{0L} \mathbf{c}\_{1} - \lambda\_{1} (\mathbf{w}\_{1}^{\mathrm{T}} M\_{\mathrm{x}}^{\mathrm{T}} M\_{\mathrm{x}} \mathbf{w}\_{1} - 1) - \lambda\_{2} (\mathbf{c}\_{1}{}^{\mathrm{T}} N\_{\mathrm{y}}^{\mathrm{T}} N\_{\mathrm{y}} \mathbf{c}\_{1} - 1). \tag{10.13}$$

Let <sup>∂</sup><sup>Ψ</sup> <sup>∂</sup>*w*<sup>1</sup> <sup>=</sup> 0 and <sup>∂</sup><sup>Ψ</sup> <sup>∂</sup>*c*<sup>1</sup> = 0, then the optimal pair of *w*<sup>1</sup> and *c*<sup>1</sup> is obtained

$$E\_{0L}^{\rm T} F\_{0L} \mathbf{c}\_1 = 2\lambda\_1 \mathbf{M}\_x^{\rm T} \mathbf{M}\_x \mathbf{w}\_1 \tag{10.14}$$

$$F\_{0L}^{\mathsf{T}} E\_{0L} \mathfrak{w}\_1 = 2\lambda\_2 \mathbf{N}\_{\mathsf{y}}^{\mathsf{T}} \mathbf{N}\_{\mathsf{y}} \mathfrak{c}\_1. \tag{10.15}$$

Equations (10.14) and (10.15) are respectively multiplied by *w*<sup>T</sup> <sup>1</sup> and *c*<sup>T</sup> <sup>1</sup> on the left, then,

$$\boldsymbol{\theta}\_{1} \coloneqq 2\lambda\_{1} = 2\lambda\_{2} = \boldsymbol{\mathfrak{w}}\_{1}^{\mathrm{T}} \boldsymbol{E}\_{0L}^{\mathrm{T}} \boldsymbol{F}\_{0L} \mathbf{c}\_{1} = \boldsymbol{\mathfrak{c}}\_{1}^{\mathrm{T}} \boldsymbol{F}\_{0L}^{\mathrm{T}} \boldsymbol{E}\_{0L} \boldsymbol{w}\_{1}. \tag{10.16}$$

Comparing (10.11) and (10.16), it is found that θ<sup>1</sup> is the objective function value. Substitute (10.16) into (10.14) and (10.15), and the relationship between *w*<sup>1</sup> and *c*<sup>1</sup> is obtained,

$$\boldsymbol{\sigma}\_{1} = \frac{1}{\theta\_{1}} (\boldsymbol{\mathcal{M}}\_{x}^{\mathrm{T}} \boldsymbol{\mathcal{M}}\_{x})^{-1} \boldsymbol{E}\_{0L}^{\mathrm{T}} \boldsymbol{F}\_{0L} \mathbf{c}\_{1} \tag{10.17}$$

$$\mathbf{c}\_{1} = \frac{1}{\theta\_{1}} (\mathbf{N}\_{\text{y}}^{\text{T}} \mathbf{N}\_{\text{y}})^{-1} \mathbf{F}\_{0L}^{\text{T}} \mathbf{E}\_{0L} \mathbf{w}\_{1}. \tag{10.18}$$

Substitute (10.18) into (10.14) and substitute (10.17) into (10.15), the following equations about the first vector pair are obtained,

$$(\mathbf{M}\_x^\mathrm{T} \mathbf{M}\_x)^{-1} \mathbf{E}\_{0L}^\mathrm{T} \mathbf{F}\_{0L} (\mathbf{N}\_y^\mathrm{T} \mathbf{N}\_y)^{-1} \mathbf{F}\_{0L}^\mathrm{T} \mathbf{E}\_{0L} \mathbf{w}\_1 = \theta\_1^2 \mathbf{w}\_1 \tag{10.19}$$

$$(\boldsymbol{N\_{y}^{\mathrm{T}}N\_{y}})^{-1}\boldsymbol{F\_{0L}^{\mathrm{T}}}\boldsymbol{E\_{0L}}(\boldsymbol{M\_{x}^{\mathrm{T}}M\_{x}})^{-1}\boldsymbol{E\_{0L}^{\mathrm{T}}}\boldsymbol{F\_{0L}}\mathbf{c\_{l}}=\boldsymbol{\theta\_{1}^{2}c\_{l}}.\tag{10.20}$$

The optimal weight vectors *w*<sup>1</sup> and *c*<sup>1</sup> is obtained by the maximum eigenvalue of (10.19) and (10.20). Now the potential variables *u*<sup>1</sup> and *t*<sup>1</sup> are calculated as follows:

$$t\_1 = E\_{0L} \boldsymbol{w}\_1, \ \boldsymbol{\mu}\_1 = F\_{0L} \boldsymbol{c}\_1.$$

Calculation of the load vector:

$$p\_1 = \frac{E\_{0L}^{\mathrm{T}} t\_1}{\|t\_1\|^2}, \ \bar{q}\_1 = \frac{F\_{0L}^{\mathrm{T}} t\_1}{\|t\_1\|^2}.$$

Residual matrixes *E*1*<sup>L</sup>* and *F*1*<sup>L</sup>* are

$$E\_{1L} = E\_{0L} - t\_1 \mathbf{p}\_1^\mathrm{T},\ \ F\_{1L} = F\_{0L} - \boldsymbol{\mu}\_1 \bar{\boldsymbol{\mathfrak{q}}}\_1^\mathrm{T}.$$

The first optimal weight vector *w*<sup>1</sup> of PLS (10.7) is the eigenvectors of matrix *E*T <sup>0</sup> *F*0*F*<sup>T</sup> <sup>0</sup> *E*0, while in LPPLS (10.19), it is corresponding to the eigenvectors of matrix *M*<sup>T</sup> *<sup>x</sup> M<sup>x</sup>* −<sup>1</sup> *E*<sup>T</sup> <sup>0</sup>*<sup>L</sup> F*0*<sup>L</sup> N*<sup>T</sup> *<sup>y</sup> N<sup>y</sup>* −<sup>1</sup> *F*<sup>T</sup> <sup>0</sup>*<sup>L</sup> E*0*<sup>L</sup>* . The optimization problem with maximum eigenvalue in (10.19)are very similar to the traditional linear PLS. Therefore, the traditional NIPALS technique is convenient to extract the remaining principle components.

The other latent variables are calculated based on the residual matrices *Ei L* and *Fi L* ,*i* = 1, 2,..., *d* − 1.

$$t\_{i+1} = E\_{iL} \mathfrak{w}\_{i+1}, \ \mathfrak{u}\_{i+1} = F\_{iL} \mathfrak{c}\_{i+1},$$

where *w<sup>i</sup>*+<sup>1</sup> is the eigenvector corresponding to the maximum eigenvalue θ<sup>2</sup> *<sup>i</sup>*+<sup>1</sup> of matrix (*M*<sup>T</sup> *<sup>x</sup> M<sup>x</sup>* )−<sup>1</sup>*E*<sup>T</sup> *i L Fi L* (*N*<sup>T</sup> *<sup>y</sup> N<sup>y</sup>* )−<sup>1</sup>*F*<sup>T</sup> *i L Ei L* .

Similarly, *c<sup>i</sup>*+<sup>1</sup> is the eigenvector corresponding to the maximum eigenvalue of(*N*<sup>T</sup> *<sup>y</sup> N<sup>y</sup>* )−<sup>1</sup>*F*<sup>T</sup> *i L Ei L* (*M*<sup>T</sup> *<sup>x</sup> M<sup>x</sup>* )−<sup>1</sup>*E*<sup>T</sup> *i L Fi L* . Then,

$$p\_{i+1} = \frac{E\_{iL}^{\mathrm{T}} t\_{i+1}}{\|t\_{i+1}\|^2}, \ \bar{q}\_{i+1} = \frac{F\_{iL}^{\mathrm{T}} t\_{i+1}}{\|t\_{i+1}\|^2}.$$

Finally, *d* latent variables of LPPLS are determined using the cross-validation method.

## *10.2.2 LPPLS for Process and Quality Monitoring*

*X* and *Y* is projected to a low-dimensional space by latent variables (*t*1,..., *t<sup>d</sup>* ). The neighboring mapping of original data *E*0*<sup>L</sup>* and *F*0*<sup>L</sup>* is decomposed as follows:

$$\begin{aligned} E\_{0L} &= \sum\_{i=1}^{d} t\_i \mathbf{p}\_i^\mathrm{T} + E = \mathbf{T} \mathbf{P}^\mathrm{T} + \bar{E} \\\ F\_{0L} &= \sum\_{i=1}^{d} t\_i \mathbf{q}\_i^\mathrm{T} + F = \mathbf{T} \bar{\mathbf{Q}}^\mathrm{T} + \bar{F}, \end{aligned} \tag{10.21}$$

where *T* = [*t*1, *t*2,..., *t<sup>d</sup>* ] are the latent score vectors. *P* = [ *p*1,..., *p<sup>d</sup>* ] and *Q*¯ = [*q*¯ <sup>1</sup>,..., *q*¯ *<sup>d</sup>* ] are load matrices for *E*0*<sup>L</sup>* and *F*0*<sup>L</sup>* , respectively. *T* is represented by the neighboring mapping data *E*0*<sup>L</sup>* ,

$$T = E\_{0L} \mathbf{R} = \mathbf{S}\_x^{\dagger} E\_0 \mathbf{R},\tag{10.22}$$

where *R* = [*r*1,...,*rd* ] ,

$$r\_i = \prod\_{j=1}^{i-1} \left( I\_n - w\_j p\_j^\top \right) w\_i$$

Similarly as GPLPLS method, (10.21) and (10.22) are difficult to apply in practice since the locality transformation matrix *S* cannot be obtained during the online measurements. So they are changed to the direct decomposition of *E*<sup>0</sup> and *F*0,

$$E\_0 = S\_x^{-\frac{1}{2}} (TP^\mathrm{T} + \bar{E}) = T\_0 P^\mathrm{T} + E^\prime \tag{10.23}$$

$$F\_0 = \mathbf{S}\_{\mathbf{y}}^{-\frac{1}{2}} (\mathbf{S}\_{\mathbf{x}}^{\frac{1}{2}} T\_0 \bar{\mathbf{Q}}^T + \bar{F}),\tag{10.24}$$

where *T*<sup>0</sup> = *E*0*R*, *E* = *S* − 1 2 *<sup>x</sup> E*¯ .

Process and quality monitoring for new scaled and mean-centered data samples *x* and *y* is performed by the oblique projection of the input data *x*.

$$\begin{aligned} \mathbf{x} &= \hat{\mathbf{x}} + \mathbf{x}\_{\varepsilon} \\ \hat{\mathbf{x}} &= \mathbf{R} \mathbf{P}^{\mathrm{T}} \mathbf{x} \\ \mathbf{x}\_{\varepsilon} &= \left( I - \mathbf{P} \mathbf{R}^{\mathrm{T}} \right) \mathbf{x} . \end{aligned} \tag{10.25}$$

The residual space still contains much variation information (Qin and Zheng 2012), but it is not the main focus of LPPLS. To facilitate the comparison with traditional monitoring methods, this chapter will directly adopt traditional fault monitoring indices without any modification. The *T* <sup>2</sup> and *Q* statistics are defined,

$$\begin{aligned} t &= \mathbf{R}^{\mathrm{T}} \mathbf{x} \\ \mathbf{T}^2 &= \mathbf{t}^{\mathrm{T}} \mathbf{A}^{-1} t = \mathbf{t}^{\mathrm{T}} \left( \frac{1}{n-1} \mathbf{T}\_0^{\mathrm{T}} \mathbf{T}\_0 \right)^{-1} t \\ \mathbf{Q} &= \| \mathbf{x}\_{\epsilon} \|^2 = \mathbf{x}^{\mathrm{T}} (\mathbf{I} - \mathbf{P} \mathbf{R}^{\mathrm{T}}) \mathbf{x}, \end{aligned} \tag{10.26}$$

where *Λ* is the sample covariance matrix. The matrix *X*˜ or *E*0*<sup>L</sup>* of type III LPPLS is not a scaled and mean-centered one. Moreover in nonlinear systems, the output variables may not obey the Gaussian distribution even if the input variables obey it. So the control limits of the statistics of T<sup>2</sup> and Q are not computed according to the *F* and χ<sup>2</sup> distributions. It should be calculated based on their probability density functions obtained by non-parametric kernel density estimation method (Lee et al. 2004).

**Remark 10.1** The LPPLS decomposition (10.23) is similar to linear PLS, but its residual space *E* is related to the locally preserved projection matrix *S* 1 2 *<sup>x</sup>* . It is difficult to obtain the locally retained projection matrix *S* 1 2 *<sup>x</sup>* for new data during online fault detection. But its covariance matrix Λ of the samples and the statistics of T<sup>2</sup> and Q (10.26) are not directly related to the locally retained projection matrix *S* 1 2 *<sup>x</sup>* which is a useful feature for online monitoring

Although matrix *S<sup>L</sup>* := *S* − 1 2 *<sup>y</sup> S* 1 2 *<sup>x</sup>* ∈ *Rn*×*<sup>n</sup>* is constant, the regression equation (10.24) cannot be used for output projections. As mentioned above, the first reason is that the locally preserved projection matrices *S* 1 2 *<sup>x</sup>* and *S* 1 2 *<sup>y</sup>* for the new data are difficult to obtain. Another is that direct application of least squares solution *S<sup>R</sup>* = *E*<sup>+</sup> <sup>0</sup> *S<sup>L</sup> E*<sup>0</sup> may lead to poor prediction performance. The prediction performance directly determines whether a model needs to be updated in practice. The regression equation can be constructed based on *F*<sup>0</sup> and *T*<sup>0</sup> based on (10.23),

$$F\_0 = T\_0 \mathbf{Q}^\mathrm{T} + \tilde{F}.\tag{10.27}$$

**Remark 10.2** In the special case of *S<sup>L</sup>* = *I*, (10.24) and (10.27) are equal. In most cases, the regression coefficients ( *Q*¯ and *Q*) are significantly different. But considering both *Q*¯ and *Q* are least squares solutions for any type of regression equation, so the regression errors *F*¯ and *F*˜ are equivalent in theory. Therefore, the latter regression equation (10.27) can be used to predicts the corresponding output of the new input data.

**Fig. 10.1** Projection results of PLS, and LPPLS models for S-curve data set with *Y* = 2*x*<sup>1</sup> − *x*3. Type I LPPLS model is used

1

**Fig. 10.2** Projection results of PLS, and LPPLS models for Swiss roll data set with *Y* = *x*1*x*3. Type III LPPLS model is used

## *10.2.3 Locality-Preserving Capacity Analysis*

Here two three-dimensional artificial data sets are used to explain the localitypreserving capacity of LPPLS, S-curves and Swiss roll. They are common to validate the performance of manifold learning algorithm.

$$\begin{aligned} X\_1 &= \left[ \mathbf{x}\_1; \mathbf{x}\_2; \mathbf{x}\_3 \right] \\ &= \left[ \cos(\alpha), -\cos(\alpha) \right]; \mathbf{S}v\_1; \left[ \sin(\alpha), 2 - \sin(\alpha) \right] \\\ X\_2 &= \left[ \mathbf{x}\_1; \mathbf{x}\_2; \mathbf{x}\_3 \right] \\ &= \left[ t \cos(t); 2v\_3; t \sin(t) \right], \end{aligned}$$

where α = (1.5v<sup>2</sup> − 1)/π, *t* = 3π/2(1 + 2v4). v1, v2, v<sup>3</sup> and v<sup>4</sup> are uniformly distributed on (0, 1). Two kinds of output function is defined as *y* = 2*x*<sup>1</sup> − *x*<sup>3</sup> (linear) and *y* = *x*1*x*<sup>3</sup> (nonlinear).

1000 sample points are randomly generated in the 3-D space [*x*1, *x*2, *x*3], and the dimensionality reduction process for PLS and LPPLS model is performed. The projection results of the two models in two dimensions are shown in Figs. 10.1 and 10.2, respectively.

The projection results show that PLS does not preserve the local structural information for the S-curves and Swiss roll. In other words, the data is not correctly classified by color. However, LPPLS preserves the local structural features and has

1

good classification results. LPPLS model improves the local preserving capability of PLS model; moreover, LPPLS can better discriminate the boundary features. Thus, LPPLS method can be used to detect faults related to output variables in systems with strong nonlinearity.

## **10.3 Case Study**

Validation of the proposed LPPLS-based fault detection method is performed on the Tennessee Eastman Process simulation platform (Lyman and Georgakis 1995). TEP is described in detail in the article found in (Lee et al. 2006). The related data sets are downloaded from "http://web.mit.edu/braatzgroup/links.html". PCA (Dunia and Qin 1998; Good et al. 2010) and other global-local preserving projections methods (Luo 2014; Bao et al. 2016; Luo et al. 2016) did not merge any information in the output space, so only the LPPLS method and two quality-related monitoring methods (PLS method and GLPLS method) are compared.

## *10.3.1 PLS, GLPLS and LPPLS Models*

The input variable matrix *X* = [*x*1, *x*<sup>2</sup> ··· , *x*33] <sup>T</sup> consists of 22 process variables (XMEAS(1:22):=*x*<sup>1</sup> : *x*22) and 11 manipulated variables (*x*<sup>23</sup> : *x*33) except XMV(12). The quality variable matrix *Y* = [*y*1; *y*2] is composed of the components *G* of stream 9 and the components *E* of stream 11, i.e., XMEAS (35) (*y*1) and (38) (*y*2). The training set is the normal data IDV(0) containing 960 samples. The test set is the fault data IDV(1:21). Each fault data have 960 samples (the first 160 samples are normal and the last 800 samples are faulty). The model parameters are δ*<sup>x</sup>* = 1.5, δ*<sup>y</sup>* = 0.8, *Kx* = 20 and *Ky* = 15, where *Kx* and *Ky* are the adjacent parameters in the input space and output space, respectively. Regression coefficients obtained by PLS, GLPLS, and LPPLS models are shown in Table 10.1. The relative errors of training are shown in Fig. 10.3. Here the relative error is calculated as error = (*yi* − *yi*,*tr*)/*yi*,*i* = 1, 2 and *yi*,*tr* is the corresponding output of the training model.

The training error in Fig. 10.3 shows that the training results of the PLS, GLPLS, and LPPLS models satisfy the modeling requirements. The output prediction experiments of these models are finished under all the fault conditions (i.e., the test data set), and similar prediction abilities are obtained for most cases. Give fault IDV(21) as an example, the output prediction of three models are shown in Fig. 10.4. *y*<sup>1</sup> and *y*<sup>2</sup> are at the top and bottom of these figures, respectively. Fault IDV(21) is caused by a slow drift in the output variables to drift slowly (Lee et al. 2006), but the prediction performances of three methods still are good even in this fault case. So the generalization capability of three models is verified.


**Table 10.1** Regression coefficients of PLS, GLPLS, and LPPLS models

**Fig. 10.3** Relative errors of PLS, GLPLS, and LPPLS models

**Fig. 10.4** Prediction results for IDV(21) of PLS, GLPLS, and LPPLS models

## *10.3.2 Quality Monitoring Analysis*

The T<sup>2</sup> statistic represents the mapping between process variables and quality variables for PLS and its related methods. The alarm in T<sup>2</sup> statistic indicates a qualityrelated fault. In contrast, the Q statistic represents only the residuals in the input space, therefore, its alarm indicates that the fault is not quality related. Table 10.2 gives the monitoring FDR whose control limits are calculated with confidence level 99.75%, respectively.

The product quality consists of component G (XMEAS(35)) and component E (XMEAS(38)). Faults IDV(3,4,9,11,14,15,19) have almost no effect on product quality, but the remaining faults cause significant changes in the quality variables. The FDR results of the LPPLS method match the above actual TPE case, which detects quality-related faults with much higher accuracy than the PLS and GLPLS models (e.g., IDV(5) and IDV(12) in Table 10.2). In this section, the performance for fault detection is further examined based on three fault scenarios, including disturbance of reactor cooling water, disturbance of condenser cooling water, and a constant position of the steam 4 valves.

#### **Experiment 1: Disturbance in Reactor Cooling Water (Quality-Independent Fault)**

The faults related to the reactor cooling water are IDV (4), IDV (11), and IDV (14). As mentioned above, they have little effect on the product quality but are process related. The results of monitoring the variation of the reactor cooling water are shown in Fig. 10.5. Here IDV (14) is given for example in order to compare with other quality-related methods, such as GPLPLS given in Chap. 9.


**Table 10.2** FDR of PLS, GLPLS, and LPPLS models

**Fig. 10.5** PLS, GLPLS, and LPPLS monitoring for IDV(14)

The faults related to the reactor cooling water will cause the variation of reactor temperature, but the reactor temperature is controlled by a cascade controller. So any disturbances, including step fault IDV(4), random fault IDV(11), and valve sticking disturbances IDV(14), do not affect the product quality. Table 10.2 shows the fault detection rates for the PLS, GLPLS, and LPPLS methods. The Q statistics of all three methods detect these process-related faults in the input space with higher FDR. The FDR values for LPPLS for the T<sup>2</sup> statistic are much smaller than other methods, which indicates that these faults are quality-independent. Fault IDV(14) is a special case. When the traditional analysis methods, such as filtering or PLS, are applied to this fault, most information about the fault feature are lost. This leads to this fault is difficult to detect in the input space, thus preventing it from detecting the fault in the input space. Now Let's check the detection result for fault IDV(14). FDR in the T<sup>2</sup> statistic for PLS and GLPLS model are 33.5% and 96.88%, far higher than LPPLS. It means that PLS and GLPLS distinguish fault IDV(14) as quality related. The FDR of LPPLS in T<sup>2</sup> statistic is 2.5%, near to that of GPLPLS (Tables 9.2 and 9.3). So LPPLS can effectively filter the quality-irrelevant faults, similar as GPLPLS method.

#### **Experiment 2: Disturbance in Condenser Cooling Water (Quality-Related Fault)**

These faults include the quality-related faults IDV (5) and IDV (12). The fault IDV (5) is caused by a step change in the cooling water flow rate of the condenser. Since the series controller compensates for this step change, the separator temperature returns to setpoint. The PLS and GLPLS have similar predicted results, returning to the setpoint 10 h after the fault. But LLPLS-based monitoring provides a persistent alarm in statistic (T2) (Fig. 10.6). "The persistence of the fault detection statistic is demonstrated by the fact that it continues to alert the operator to process anomalies even though all process variables appear to have returned to their normal values, especially important in quality-related process fault detection (Lee et al. 2006)". In fact, the disturbance in condenser cooling water, such as its flow rate, always affects the output quality. It should be pointed that the cooling water flow rate of the condenser plays an important role both in the output quality and the safety of the chemical plant. This fault cannot be eliminated by the series controller and should be alarming. Although the controller can compensate the variations caused by this fault, the process-related monitoring in Q statistic, (Fig. 10.6), provides a consistent alarm. Experimental results show that the PLS and GLPLS models do not actually capture the source of the fault, while LPPLS does.

#### **Experiment 3: Constant Position in Valve of Steam 4**

Fault IDV (21) due to the slow output drift has been little studied. The sensitivity of fault detection is related to the magnitude of the mass drift. Therefore, fast detection of fault IDV(21) is beneficial for quality control. The process monitoring results are shown in Fig. 10.7. For GLPLS, LPPLS, and PLS, this fault is fully detected as quality-related after about 650, 720, and 780 samples, respectively. LPPLS and GLPLS detect the fault IDV(21) faster than PLS method.

The following conclusions are drawn from the above experiments.

**Fig. 10.6** PLS, GLPLS, and LPPLS monitoring for IDV(5)

**Fig. 10.7** PLS, GLPLS, and LPPLS monitoring for IDV(21)


## **10.4 Conclusions**

In this chapter, the LPPLS statistical model is proposed and the LPPLS-based qualityrelated fault detection and prediction is given. LPPLS not only retains the local information of the original data, but also maintains the correlation between *X* and *Y* to the maximum extent, thus achieving accurate prediction of quality variables. The LPPLS encapsulates the excellent detection performance for locally nonlinear systems, due to the local feature extraction ability controlled by two parameters, δ*<sup>x</sup>* and δ*<sup>y</sup>* . Experiment results on the artificial three-dimensional data sets, S-curve and Swiss roll, show that LPPLS maintains local structural features well. The experiment results on TEP simulator show that LPPLS extracts the local nonlinear features more effectively and has better fault detection performance than PLS and GLPLS models.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 11 Locally Linear Embedding Orthogonal Projection to Latent Structure**

Quality variables are measured much less frequently and usually with a significant time delay by comparison with the measurement of process variables. Monitoring process variables and their associated quality variables is essential undertaking as it can lead to potential hazards that may cause system shutdowns and thus possibly huge economic losses. Maximum correlation was extracted between quality variables and process variables by partial least squares analysis (PLS) (Kruger et al. 2001; Song et al. 2004; Li et al. 2010; Hu et al. 2013; Zhang et al. 2015). In order to deal with the nonlinear correlation of industrial data, this chapter proposes another two nonlinear PLS methods, named as Local Linear Embedded Projection of Latent Structure (LLEPLS). LLEPLS is an oblique projection on the input data space. By further decomposing the LLEPLS model, Local Linear Embedded Orthogonal Projection of Latent Structure (LLEOPLS) is proposed which the orthogonal projection on the input space is obtained. LLEPLS or LLEOPLS also extracts the maximum relevant information and preserves the local nonlinear structure between input and output simultaneously.

LLEPLS or LLEOPLS project the input and output space into three subspaces from the view of statistical analysis: (1) joint input-output subspace, aiming at finding the nonlinear relationship between the input and output. It also can be used for quality prediction. (2) output-residual subspace, aiming at monitoring the quality-related fault which cannot be predicted from the process data. (3) orthogonal input-residual subspace, aiming at identifying whether the predictable fault is quality related. The corresponding monitoring strategies are established based on the LLEPLS and LLEOPLS model, respectively.

## **11.1 Comparison of GPLPLS, LPPLS, and LLEPLS**

PLS has a better performance compared to PCA in quality-relevant faults. As shown in Fig. 11.1, the output space (*Y*) and input space (*X*) are decomposed for the PLS model. Here the external relationship is the "foundation" and the internal relationship is the "result". For nonlinear PLS, the desired "results" cannot be obtained by internal structure adjustment (Zhang and Qin 2008), if the external relationships are linear. Therefore, it is possible to build better internal relationships by starting with the analysis of external relationships. The nonlinear function usually is approximated by a series of locally weighted linear model. For example, (Wang et al. 2014; Yin et al. 2016, 2017) use the locally weighted projection regression (LWPR) or few univariate regressions to learn the nonlinearity of external relationships. This PLS regression can be considered as multi-KPLS regression with Gaussian kernel to some extent.

The location-preserving partial least squares (LPPLS) model (given in Chap. 10) is another external nonlinear PLS model and its structure is relatively simple compared to the KPLS model (Wang et al. 2017). However, the LPPLS model has at least two limitations. The first one is that the local geometric structure (uniform weights) cannot be preserved better, or the σ parameter (Gaussian weights) (Kokiopoulou and Saad 2007) is difficult to be selected properly. The second is an oblique decomposition of the measurement process variables. The LPPLS model extracts the principal components and retains local structure by locality-preserving projection (LPP). LLE, another nonlinear dimensionality reduction technique, transforms the global nonlinear problem into a combination of several local linear problems by introducing local geometric information. Compared with LLE method, the local preserving strategy of LPP is more complex, and its parameters (Gaussian weights)are more and not easy to tuned.

The global plus local projection to latent structure (GPLPLS) (given in Chap. 9) integrates the advantages of PLS and LLE methods. The distinctive feature of the GPLPLS model is that the local nonlinear features are enhanced by LLE in the PLS decomposition (Zhou et al. 2018). GPLPLS uses the strategy of plus but not embedding, in which the new feature space is divided into linear part (global projection) and nonlinear part (local preserving). It confirms that the LLE plus PLS algorithm is able to perform the decomposition of the input and output space, and effectively preserve the local geometric structure. However, this combination needs further research, such as how to combine more effectively, how to make the orthogonal decomposition be completed, and also how to quantitatively evaluate the monitoring effect.

Based on the above analysis, Local Linear Embedded Projection of Latent Structure (LLEPLS) is proposed. It extracts the maximum correlation information between input and output, at the same time reveals and preserves the intrinsic nonlinear structure of the original data. The principal components of the input space (or measured variables space) extracted by LLEPLS still contain the variations orthogonal to *Y*. These variations are output irrelevant and do not contribute to the output prediction. Moreover, LLEPLS is an oblique projection on the input space. Orthogonalization is an alternative solution for these issues. Then the local linear embedded orthogonal projection to latent structure (LLEOPLS) model is proposed in order to explain further the LLEPLS prediction model and detect quality-related faults. LLEOPLS eliminates the *T* <sup>2</sup> statistic including variations orthogonal to the output. LLEOPLS differs significantly from other existing nonlinear PLS models in orthogonal projections with local geometric structure preservation and less easily fixed parameters.

## **11.2 A Brief Review of the LLE Method**

Given the normalized data set *X* = - *x*<sup>T</sup>(1), *x*<sup>T</sup>(2), . . . , *x*<sup>T</sup>(*n*) T ∈ *R<sup>n</sup>*×*<sup>m</sup>*, (*x* = [*x*1, *x*2,..., *xm*] ∈ *R*<sup>1</sup>×*<sup>m</sup>*) of the model, where *n* is the sampling time and *m* is the number of input variables. LLE algorithm introduces the local structural information and transforms the global nonlinear problem into a combination of multiple local linear problems. It is outstanding at the locally nonlinear processes.

The size of neighborhood *kx* is crucial for the local geometric structure. According to the distance measures such as Euclidean distance, the K nearest neighbors (KNN) of the sample can be selected (Kouropteva et al. 2002),

$$k\_{\mathbf{x},opt} = \arg\min\_{k\_{\mathbf{z}}} (1 - \rho\_{\mathbf{D}\_{\mathbf{z}}\mathbf{D}\_{\phi\_{\mathbf{z}}}}^2),\tag{11.1}$$

where *D<sup>x</sup>* and *D*<sup>φ</sup>*<sup>x</sup>* denotes the distance matrices (between point pairs) in *X* and *<sup>x</sup>* (*<sup>x</sup>* given in (11.4)), and ρ denotes the standard linear correlation coefficient between *D<sup>x</sup>* and *D*<sup>φ</sup>*<sup>x</sup>* .

192 11 Locally Linear Embedding Orthogonal Projection to Latent Structure

Next, the *kx* nearest neighbors of the sample *x*(*i*) can be obtained. Then *x*(*i*) can be linearly expressed based on its the *kx* nearest neighbors *x*(*j*) by the following optimization object,

$$\begin{aligned} J(A\_x) &= \min \sum\_{i=1}^n \left\| \mathbf{x}(i) - \sum\_{j=1}^{k\_x} a\_{ij,x} \mathbf{x}(j) \right\|^2 \\ &\text{s.t. } \sum\_{j=1}^{k\_x} a\_{ij,x} = 1, \end{aligned} \tag{11.2}$$

where [*ai j*,*<sup>x</sup>* ] := *A<sup>x</sup>* ∈ *Rn*×*kx* , (*i* = 1, 2,..., *n*, *j* = 1, 2,..., *kx* ) denotes the weight coefficients. Usually, points belonging to the space *X* are projected onto a new low-dimensional reduced space *<sup>x</sup>* = - *φ*T *<sup>x</sup>* (1), *φ*<sup>T</sup> *<sup>x</sup>* (2), . . . , *φ*<sup>T</sup> *<sup>x</sup>* (*n*) T ∈ *Rn*×*<sup>d</sup>* , (*d* < *m*, *φ<sup>x</sup>* ∈ *R*1×*<sup>d</sup>* ) determined by the following optimization:

$$\begin{aligned} J\_{\text{LLE}}(W) &= \min \sum\_{i=1}^{n} \left\| \phi\_x(i) - \sum\_{j=1}^{k\_x} a\_{ij,x} \phi\_x(j) \right\|^2 \\ \text{s.t.} \quad & \Phi\_x^\mathrm{T} \Phi\_x = I. \end{aligned} \tag{11.3}$$

In order to further analysis, a linear mapping matrix *W* = [*w*1,..., *w<sup>d</sup>* ] ∈ *Rm*×*<sup>d</sup>* is introduced with the guarantee of local embedding,

$$\phi\_x(i) = \mathbf{x}(i)\,\mathbf{W}, \ \ (i = 1, 2, \ldots, n). \tag{11.4}$$

where *wj*, *j* = 1,..., *d* denotes the projection vector. Then the optimization (11.3) is rewritten as

$$\begin{aligned} J\_{\text{LLE}}\left(W\right) &= \min \text{tr}\left(\mathbf{W}^{\text{T}}\mathbf{X}^{\text{T}}\mathbf{M}\_{\text{x}}^{\text{T}}\mathbf{M}\_{\text{x}}\mathbf{X}W\right) \\ &\text{s.t.} \quad \mathbf{W}^{\text{T}}\mathbf{X}^{\text{T}}\mathbf{X}W = I, \end{aligned} \tag{11.5}$$

where *M<sup>x</sup>* = (*I* − *A<sup>x</sup>* ) ∈ *R<sup>n</sup>*×*<sup>n</sup>*. SVD operation is performed on *M<sup>x</sup>* in order to simplify the dimensionality reduction problem,

$$\boldsymbol{M}\_{\boldsymbol{x}} = \begin{bmatrix} \boldsymbol{U}\_{\boldsymbol{x}} \ \boldsymbol{\bar{U}}\_{\boldsymbol{x}} \end{bmatrix}^{\mathrm{T}} \begin{bmatrix} \mathbf{S}\_{\boldsymbol{x}} \ \mathbf{0} \\ \mathbf{0} & \mathbf{0} \end{bmatrix} \begin{bmatrix} \boldsymbol{V}\_{\boldsymbol{x}} \\ \boldsymbol{\bar{V}}\_{\boldsymbol{x}} \end{bmatrix}$$

Then, the minimum value problem (11.5) is changed as follows:

$$\begin{aligned} J\_{\text{LLE}}(W) &= \max \text{tr} \left( W^{\text{T}} X\_M^{\text{T}} X\_M W \right) \\ \text{s.t.} \quad & W^{\text{T}} X^{\text{T}} X W = I, \end{aligned} \tag{11.6}$$

where *<sup>X</sup> <sup>M</sup>* := *S*−<sup>1</sup> *<sup>x</sup>* **0 0 0** *V<sup>x</sup> V*¯ *<sup>x</sup> X* = *SVx X*. Generally, LLE should chose the reduced dimension *d* in (11.3) in advance, but PCA can determine the corresponding dimension based on the specific criteria such as the cumulative contribution. The optimization problem (11.6) is further rewritten,

$$\begin{aligned} J\_{\text{LLE}}\left(\boldsymbol{w}\right) &= \max \boldsymbol{w}^{\text{T}} \boldsymbol{X}\_{M}^{\text{T}} \boldsymbol{X}\_{M} \boldsymbol{w} \\ \text{s.t.} \quad & \boldsymbol{w}^{\text{T}} \boldsymbol{X}^{\text{T}} \boldsymbol{X} \boldsymbol{w} = 1, \end{aligned} \tag{11.7}$$

where *w* ∈ *Rm*×1. The criteria of determining the number of principal components in PCA can be directly applied to LLE. Based on the SVD algorithm, the matrix *X <sup>M</sup>* is decomposed into a "load matrix" *P<sup>d</sup>* = [ *p*1, *p*2,..., *p<sup>d</sup>* ] and a "score matrix" *T<sup>d</sup>* = [*t*1, *t*2,..., *t<sup>d</sup>* ]

$$X\_M^\mathrm{T} X\_M = [P\_{d0} \ \mathbf{P}\_{r0}] \begin{bmatrix} \mathbf{A}\_d \\ & \mathbf{A}\_r \end{bmatrix} \begin{bmatrix} \mathbf{P}\_{d0} \\ \mathbf{P}\_{r0} \end{bmatrix}$$

and defined *P<sup>d</sup>* = *Pd*0/*X Pd*0, *P<sup>r</sup>* = *Pr*0/*X Pr*0, and

$$\begin{split} \boldsymbol{X}\_{M} &= \boldsymbol{T}\_{d}\boldsymbol{\mathsf{P}}\_{d}^{\mathrm{T}} + \boldsymbol{T}\_{r}\boldsymbol{\mathsf{P}}\_{r}^{\mathrm{T}} \\ &= \boldsymbol{\mathsf{P}}\_{d}\boldsymbol{\mathsf{P}}\_{d}^{\mathrm{T}}\boldsymbol{X}\_{M} + \left(\boldsymbol{I} - \boldsymbol{\mathsf{P}}\_{d}\boldsymbol{\mathsf{P}}\_{d}^{\mathrm{T}}\right)\boldsymbol{X}\_{M}, \end{split} \tag{11.8}$$

where *T<sup>d</sup>* = *X <sup>M</sup> P<sup>d</sup>* , *T<sup>r</sup>* = *X <sup>M</sup> Pr*.

It is observed from (11.7) and (11.8) that the projection direction of LLE can be obtained by maximizing the variance. Thus, the LLE constructs a new PLS regression with the local geometric structure-preserving ability according to the component extraction criteria.

Variance (factor variation) is used to extract the latent variables in PLS algorithm. It transforms the original data *X* and *Y* into a set of t-scores *T* and u-scores *U*. The latent factors *T* and *U* are chosen by maximizing the factor variation. It aims at using fewer dimensions but retaining more features of the original data. PLS is a linear dimensionality reduction technique, but does not explore the intrinsic structure of original data. It is not conducive to data classification, but may make data mixed together. The phenomena that may occur with PLS are given in Fig. 11.2, similar as the PCA. Figure 11.2a shows a two-mode data space *X* and Fig. 11.2b give its first principal component *t*<sup>1</sup> in PCA. The contribution of the first principal component of *t*<sup>1</sup> is 99%. As shown in Fig. 11.2b, the blue *o* and black ∗ points in the one−dimensional coordinate system are mixed together. The second principal component is discarded due to its small contribution although it maintains the local geometric structure.

# **11.3 LLEPLS Models and LLEPLS-Based Fault Detection**

## *11.3.1 LLEPLS Models*

In order to extract the first component pair (*t*1, *u*1), the traditional PLS optimization is expressed as

$$\begin{aligned} J\_{\text{PLS}}\left(\boldsymbol{\omega}\_{\text{1}}, \mathbf{c}\_{\text{1}}\right) &= \max \, \boldsymbol{\omega}\_{\text{1}}^{\text{T}} \mathbf{X}^{\text{T}} \mathbf{Y} \mathbf{c}\_{\text{1}} \\ \text{s.t.} \quad \boldsymbol{\omega}\_{\text{1}}^{\text{T}} \boldsymbol{\omega}\_{\text{1}} &= 1, \, \mathbf{c}\_{\text{1}}^{\text{T}} \mathbf{c}\_{\text{1}} = 1. \end{aligned} \tag{11.9}$$

Define *E*<sup>0</sup> = *X* and *F*<sup>0</sup> = *Y*. The PLS latent variables *t*<sup>1</sup> and *c*<sup>1</sup> of are obtained from *t*<sup>1</sup> = *E*0*w*<sup>1</sup> and *u*<sup>1</sup> = *F*<sup>0</sup> *c*1. Here *c*<sup>1</sup> and *w*<sup>1</sup> are the eigenvectors corresponding to the maximum eigenvalues of matrices,

$$E\_0^\mathrm{T} F\_0 F\_0^\mathrm{T} E\_0 \mathfrak{w}\_1 = \theta\_1^2 \mathfrak{w}\_1 \tag{11.10}$$

$$F\_0^\mathrm{T} E\_0 E\_0^\mathrm{T} F\_0 \mathbf{c}\_1 = \theta\_1^2 \mathbf{c}\_1. \tag{11.11}$$

Locality linearly embedded partial least squares (LLEPLS) is proposed to optimize the function as follows:

$$\begin{aligned} J\_{\text{LLEPLS}} \ (\boldsymbol{\omega}\_{\text{l}}, \mathbf{c}\_{\text{l}}) &= \max \boldsymbol{\omega}\_{\text{l}}^{\text{T}} \boldsymbol{X}\_{M}^{\text{T}} \boldsymbol{Y}\_{M} \mathbf{c}\_{\text{l}} \\ \text{s.t.} \quad \boldsymbol{\omega}\_{\text{l}}^{\text{T}} \boldsymbol{X}^{\text{T}} \boldsymbol{X} \boldsymbol{w}\_{\text{l}} &= 1, \boldsymbol{c}\_{\text{l}}^{\text{T}} \boldsymbol{Y}^{\text{T}} \mathbf{y} \boldsymbol{c}\_{\text{l}} = 1 \end{aligned} \tag{11.12}$$

in which,

$$\begin{aligned} Y\_M &= \begin{bmatrix} \mathbf{S}\_{\mathbf{y}}^{-1} \mathbf{0} \\ \mathbf{0} & \mathbf{0} \end{bmatrix} \begin{bmatrix} \mathbf{V}\_{\mathbf{y}} \\ \dot{\mathbf{V}}\_{\mathbf{y}} \end{bmatrix} \mathbf{Y} = \mathbf{S}\_{V\_{\mathbf{y}}} \mathbf{Y} \\\ M\_{\mathbf{y}} &= I - A\_{\mathbf{y}} = \begin{bmatrix} \mathbf{U}\_{\mathbf{y}} \ \bar{\mathbf{U}}\_{\mathbf{y}} \end{bmatrix}^{T} \begin{bmatrix} \mathbf{S}\_{\mathbf{y}} \ \mathbf{0} \\ \mathbf{0} & \mathbf{0} \end{bmatrix} \begin{bmatrix} \mathbf{V}\_{\mathbf{y}} \\ \bar{\mathbf{V}}\_{\mathbf{y}} \end{bmatrix} \end{aligned}$$

where *A<sup>y</sup>* is accompanied by its neighbors with different parameters *ky* , similar as *A<sup>x</sup>* . *S<sup>y</sup>* , *V <sup>y</sup>* and *U<sup>y</sup>* are also similar to *S<sup>x</sup>* , *V<sup>x</sup>* and *U<sup>x</sup>* .

The criteria of LLEPLS component decomposition and latent factors extraction are given as follows:


Then, the latent variable calculation process of LLEPLS model is given as follows. Denote *E*0*<sup>L</sup>* = *X <sup>M</sup>* and *F*0*<sup>L</sup>* = *Y <sup>M</sup>* , similar as the traditional linear PLS. The constrained optimization problem (11.12) is transformed by introducing a Lagrange multiplier vector,

$$\begin{split} \Psi \left( \begin{aligned} \boldsymbol{\Psi} \left( \boldsymbol{w}\_{1}, \boldsymbol{c}\_{1} \right) &= \boldsymbol{\mathfrak{w}}\_{1}^{\top} \boldsymbol{E}\_{0L}^{\top} \boldsymbol{F}\_{0L} \boldsymbol{c}\_{1} - \boldsymbol{\lambda}\_{1} \left( \boldsymbol{\mathfrak{w}}\_{1}^{\top} \boldsymbol{X}^{\top} \boldsymbol{X} \boldsymbol{w}\_{1} - 1 \right) \\ &- \boldsymbol{\lambda}\_{2} \left( \boldsymbol{c}\_{1}^{\top} \boldsymbol{Y}^{\top} \boldsymbol{Y} \boldsymbol{c}\_{1} - 1 \right) . \end{aligned} \tag{11.13}$$

The optimal *w*<sup>1</sup> and *c*<sup>1</sup> is solved by <sup>∂</sup><sup>Ψ</sup> <sup>∂</sup>*w*<sup>1</sup> <sup>=</sup> 0 and <sup>∂</sup><sup>Ψ</sup> <sup>∂</sup>*c*<sup>1</sup> = 0. Next, the optimization problem (11.13) is solved by the maximum eigenvalue problem,

$$\left(\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\mathbf{E}\_{0L}^{\mathrm{T}}\mathbf{F}\_{0L}\left(\mathbf{Y}^{\mathrm{T}}\mathbf{Y}\right)^{-1}\mathbf{F}\_{0L}^{\mathrm{T}}\mathbf{E}\_{0L}\mathbf{w}\_{1} = \theta\_{1}^{2}\mathbf{w}\_{1}\tag{11.14}$$

$$\left(\left(\mathbf{Y}^{\mathrm{T}}\mathbf{Y}\right)^{-1}\mathbf{F}\_{0\,L}^{\mathrm{T}}\mathbf{E}\_{0\,L}\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\mathbf{E}\_{0\,L}^{\mathrm{T}}\mathbf{F}\_{0\,L}\mathbf{c}\_{1} = \theta\_{1}^{2}\mathbf{c}\_{1}.\tag{11.15}$$

The first optimal weight vector *w*<sup>1</sup> in the conventional linear PLS (11.10) is corresponding to the matrix *E*<sup>T</sup> <sup>0</sup> *F*0*F*<sup>T</sup> <sup>0</sup> *E*0. For the LLEPLS (11.14), the optimal *w*<sup>1</sup> is

**Fig. 11.3** Outer- and inner-model presentation for LLEPLS decomposition

derived from the corresponding matrix *X*T*X* −<sup>1</sup> *E*<sup>T</sup> <sup>0</sup> *<sup>L</sup> F*<sup>0</sup> *<sup>L</sup> Y*T*Y* −<sup>1</sup> *F*<sup>T</sup> <sup>0</sup> *<sup>L</sup> E*<sup>0</sup> *<sup>L</sup>* . These matrices are particularly similar. The extraction and modeling of the residual components can be done by traditional PLS methods.

It is worth pointing out that the columns of the input space *X* and/or the output space *Y* may not be full rank. The inverse of *X*T*X* and/or *Y*T*Y* does not exist. Similar as the *Sx* in (11.6), the corresponding matrix inverse can be obtained for *X* and/or *Y*. It does not affect the following analysis, so both cases will be treated indiscriminately in the rest of this chapter.

The first *d* components are obtained to predict the regression model, where *d* is determined by cross-validation tests. Similar to the outer- and inner-model presentation for PLS decomposition, the corresponding LLEPLS decomposition is shown in Fig. 11.3. It is found that that the new feature space *X<sup>F</sup>* and *Y <sup>F</sup>* are both constructed by the nonlinear part, i.e., the local structure information. Compared with the decomposition of GPLPLS shown in Fig. 9.2, the global linear part is eliminated.

## *11.3.2 LLEPLS for Process and Quality Monitoring*

The linear localization embedding in the low-dimensional space of *X* and *Y* is formed by few latent variables (*t*1,..., *t<sup>d</sup>* ) in the LLEPLS model. The neighborhood mappings of *E*0*<sup>L</sup>* and *F*0*<sup>L</sup>* are decomposed as follows:

$$\begin{aligned} E\_{0L} &= \sum\_{i=1}^{d} t\_i \mathbf{p}\_i^{\mathrm{T}} + \bar{E}\_{0L} = \mathbf{T} \mathbf{P}^{\mathrm{T}} + \bar{E}\_{0L} \\ F\_{0L} &= \sum\_{i=1}^{d} t\_i \mathbf{q}\_i^{\mathrm{T}} + \bar{F}\_{0L} = \mathbf{T} \mathbf{Q}^{\mathrm{T}} + \bar{F}\_{0L}, \end{aligned} \tag{11.16}$$

where *T* = [*t*1, *t*2,..., *t<sup>d</sup>* ] denotes the score vectors, *P* = - *p*1,..., *p<sup>d</sup>* and *<sup>Q</sup>* <sup>=</sup> - *q*1,..., *q<sup>d</sup>* denote the loading matrices of *E*0*<sup>L</sup>* and *F*0*<sup>L</sup>* , respectively. Score *T* is represented in terms of the neighboring mapping data *E*0*<sup>L</sup>* ,

$$T = E\_{0L} \mathbf{R} = S\_{V\_x} E\_0 \mathbf{R},\tag{11.17}$$

where *R* = [*r*1,..., *r<sup>d</sup>* ], and

$$r\_i = \prod\_{j=1}^{i-1} (I\_n - \mathbf{w}\_j \mathbf{p}\_j^\mathrm{T}) \mathbf{w}\_i \dots$$

Equations (11.16) and (11.17) are difficult to directly apply in practice due to the calculation of locality-preserving matrix *S*, so the decomposition for the scaled and mean-centered *E*<sup>0</sup> and *F*<sup>0</sup> are given,

$$E\_0 = T\_0 \mathbf{P}^\mathrm{T} + \bar{E}\_0 \tag{11.18}$$

$$\begin{split} F\_0 &= T\_0 \bar{\boldsymbol{\varrho}}^{\mathrm{T}} + \bar{\boldsymbol{F}}\_0 \\ &= \boldsymbol{E}\_0 \boldsymbol{R} \bar{\boldsymbol{\varrho}}^{\mathrm{T}} + \bar{\boldsymbol{F}}\_0, \end{split} \tag{11.19}$$

where *T*<sup>0</sup> = *E*0*R*, *Q*¯ = *T* <sup>+</sup> <sup>0</sup> *F*0.

Now let's consider the monitoring of new samples *x* and subsequently on *y*. First the samples are scaled and mean-centered, an oblique projection is derived on the input data space *x*.

$$\begin{aligned} \mathbf{x} &= \hat{\mathbf{x}} + \mathbf{x}\_{\varepsilon} \\ \hat{\mathbf{x}} &= \mathbf{P} \mathbf{R}^{\mathsf{T}} \mathbf{x} \\ \mathbf{x}\_{\varepsilon} &= \left( I - \mathbf{P} \mathbf{R}^{\mathsf{T}} \right) \mathbf{x} . \end{aligned} \tag{11.20}$$

The statistics *T* <sup>2</sup> and *Q* are calculated as follows:

$$\begin{aligned} t &= \mathbf{R}^{\mathrm{T}} \mathbf{x} \\ \mathbf{T}^{2} &= \mathbf{t}^{\mathrm{T}} \mathbf{A}^{-1} t = \mathbf{t}^{\mathrm{T}} \left( \frac{1}{n-1} \mathbf{T}\_{0}^{\mathrm{T}} \mathbf{T}\_{0} \right)^{-1} t \\ \mathbf{Q} &= \|\mathbf{x}\_{\varepsilon}\|^{2} = \mathbf{x}^{\mathrm{T}} \left( \mathbf{I} - \mathbf{P} \mathbf{R}^{\mathrm{T}} \right) \mathbf{x}, \end{aligned} \tag{11.21}$$

where *Λ* is the sample covariance matrix.

The space of measured variables, i.e., input space, is divided into two subspaces: score subspace and residual subspace. LLEPLS detects the quality-related faults by the T<sup>2</sup> statistic in the score subspace and detects the quality-irrelevant faults by Qstatistics in the residual subspace. The PLS scores which constitute the T<sup>2</sup> statistic still includes the variation orthogonal to *Y*. Therefore, LLEPLS still has deficiencies in the quality-related fault detection.

# **11.4 LLEOPLS Models and LLEOPLS-Based Fault Detection**

As demonstrated in (Li et al. 2010), (Ding et al. 2013), the standard PLS performs a diagonal decomposition of the measured process variables. The LLEPLS model (11.16) also is a oblique decomposition operation (11.20) on the measured process variables, which is similar to the standard PLS model. Thus, the major part of the measured process variables may include variations orthogonal to the output variables. In other words, the principle component still include the output irrelevant variation, and the residual part may include a large of output-related variation. In addition, the number of principal components is often dependent on the operator's decision and is likely to cause the problems of component redundancy. In order to solve these problem, it is necessary to further decompose the LLEPLS model in equation (11.18) and get an orthogonal decomposition for the measured process variables. In this model, the regression coefficient *<sup>R</sup> <sup>Q</sup>*¯ <sup>T</sup> in equation (11.19) are used to describe the relationship between *<sup>E</sup>*<sup>0</sup> and *<sup>F</sup>*0. Performing the SVD operation on *<sup>R</sup> <sup>Q</sup>*¯ <sup>T</sup> to obtain orthogonal decomposition,

$$\boldsymbol{\mathcal{R}} \boldsymbol{\bar{Q}}^{\mathrm{T}} = \boldsymbol{U}\_{pc} \mathbf{S}\_{pc} \mathbf{V}\_{pc}^{\mathrm{T}},\tag{11.22}$$

where *Spc* contains all non-zero singular values in descending order. *V pc* and *U pc* are the corresponding right and left singular vectors. Then,

$$\begin{split} \bar{F}\_0 &= E\_0 \boldsymbol{U}\_{pc} \boldsymbol{S}\_{pc} \boldsymbol{V}\_{pc}^\mathrm{T} + \bar{F}\_0 \\ &= \boldsymbol{T}\_{pc} \boldsymbol{Q}\_{pc}^\mathrm{T} + \bar{F}\_0, \end{split} \tag{11.23}$$

where *T pc* = *E*0*U pc*, *Qpc* = *V pcSpc*. The output-residual subspace *F*¯ <sup>0</sup> indicates an unpredictable output but may include some variation.

Furthermore, *E*<sup>0</sup> decomposes into two orthogonal subspaces by *T pc*.

$$\begin{split} E\_0 &= \hat{E}\_0 + X\_\varepsilon \\ &= T\_{pc} \boldsymbol{U}\_{pc}^\mathrm{T} + E\_0 \left( \boldsymbol{I} - \boldsymbol{U}\_{pc} \boldsymbol{U}\_{pc}^\mathrm{T} \right), \end{split} \tag{11.24}$$

where *<sup>E</sup>*<sup>ˆ</sup> <sup>0</sup> := *<sup>T</sup> pcU*<sup>T</sup> *pc* and *X<sup>e</sup>* = *E*<sup>0</sup> *<sup>I</sup>* <sup>−</sup> *<sup>U</sup> pcU*<sup>T</sup> *pc* . *X<sup>e</sup>* denotes the orthogonal input-residual subspace. The new data samples *x* and subsequently *y* are orthogonal projected on the input data space *x* for process and quality monitoring,

$$\begin{aligned} \mathbf{x} &= \hat{\mathbf{x}} + \mathbf{x}\_{\epsilon} \\ \hat{\mathbf{x}} &= \boldsymbol{U}\_{pc} \boldsymbol{U}\_{pc}^{\mathrm{T}} \mathbf{x} \\ \mathbf{x}\_{\epsilon} &= \left( \boldsymbol{I} - \boldsymbol{U}\_{pc} \boldsymbol{U}\_{pc}^{\mathrm{T}} \right) \mathbf{x} \\ \boldsymbol{t}\_{pc} &= \boldsymbol{U}\_{pc} \mathbf{x} \\ \mathbf{y}\_{\epsilon} &= \mathbf{y} - \boldsymbol{\mathcal{Q}}\_{pc} \boldsymbol{t}\_{pc} \end{aligned} \tag{11.25}$$

The LLEOPLS model is given in (11.23) and (11.24) with many parameters to be determined in prior. The selection of the optimal parameters has been described for LLE (Kouropteva et al. 2002). The optimal parameters [*kx* , *ky* ] of LLEOPLS model is determined by simultaneously considering the characteristics of the LLE itself and the relationship between the input and output spaces. The following optimization is given for determining the parameters [*kx* , *ky* ]:

$$\begin{split} \left[k\_{\rm x},k\_{\rm y}\right]\_{\rm opt} &= \arg\min\_{k\_{\rm x},k\_{\rm y}} \left(1-\rho\_{\rm D\_{x}\mathcal{D}\_{\phi\_{\rm x}}}^{2}+1-\rho\_{\rm D\_{y}\mathcal{D}\_{\phi\_{\rm y}}}^{2}\right.\\ &\left.+1-\rho\_{\rm D\_{\dot{\gamma}}\mathcal{D}\_{\boldsymbol{\gamma}}}^{2}\right|\_{\rm train}+1-\rho\_{\rm D\_{\dot{\gamma}}\mathcal{D}\_{\boldsymbol{\gamma}}}^{2}\Big|\_{\rm pre} \right), \end{split} \tag{11.26}$$

where ˆ*y* = *Qpc t pc*. ·|*train* and ·|*pre* are the training data set and the testing data sets, respectively. The first two terms in (11.26), 1 − ρ<sup>2</sup> *<sup>D</sup><sup>x</sup> <sup>D</sup>*φ*<sup>x</sup>* and 1 <sup>−</sup> <sup>ρ</sup><sup>2</sup> *D<sup>y</sup> D*φ*<sup>y</sup>* , aim at evaluating the geometric similarity between the embedding space and the highdimensional space. The last two terms, 1 − ρ<sup>2</sup> *<sup>D</sup>y*<sup>ˆ</sup> *<sup>D</sup><sup>y</sup>* and 1 <sup>−</sup> <sup>ρ</sup><sup>2</sup> *Dy*<sup>ˆ</sup> *D<sup>y</sup>* , indicate the effect of the model which indirectly reflects the role of the first two terms. Cross-validation is used to ensure the training results of the model. The last term is the most important part in (11.26),

$$\left[k\_{\rm x}, k\_{\rm y}\right]\_{\rm opt} = \arg\min\_{k\_{\rm x}k\_{\rm y}} \left(1 - \left.\rho\_{\rm B\_{j}, \rm B\_{\rm y}}^{2}\right|\_{\rm pre}\right). \tag{11.27}$$

A generalized LLEOPLS model with the optimal parameters *kx* and *ky* can be used to monitor the operation of the system. The T<sup>2</sup> statistics can monitor the outputrelated score (*T pc*), output-residual part and input-residual part,

$$\mathbf{T}\_{pc}^{2} = \mathbf{t}\_{pc}^{\mathrm{T}} A\_{pc}^{-1} \mathbf{t}\_{pc} = \mathbf{t}\_{pc}^{\mathrm{T}} \left\{ \frac{1}{n-1} \mathbf{T}\_{pc}^{\mathrm{T}} \mathbf{T}\_{pc} \right\}^{-1} \mathbf{t}\_{pc}$$

$$\mathbf{T}\_{\varepsilon}^{2} = \mathbf{x}\_{\varepsilon}^{\mathrm{T}} A\_{x,\varepsilon}^{-1} \mathbf{x}\_{\varepsilon} = \mathbf{x}\_{\varepsilon}^{\mathrm{T}} \left\{ \frac{1}{n-1} \mathbf{X}\_{\varepsilon}^{\mathrm{T}} \mathbf{X}\_{\varepsilon} \right\}^{-1} \mathbf{x}\_{\varepsilon} \tag{11.28}$$

$$\mathbf{T}\_{\mathrm{y},\varepsilon}^{2} = \mathbf{y}\_{\varepsilon}^{\mathrm{T}} A\_{\mathrm{y},\varepsilon}^{-1} \mathbf{y}\_{\varepsilon} = \mathbf{y}\_{\varepsilon}^{\mathrm{T}} \left\{ \frac{1}{n-1} \mathbf{Y}\_{\varepsilon}^{\mathrm{T}} \mathbf{Y}\_{\varepsilon} \right\}^{-1} \mathbf{y}\_{\varepsilon},$$

where *Λpc*, *Λ<sup>x</sup>*,*<sup>e</sup>* and *Λ<sup>y</sup>*,*<sup>e</sup>* denotes the sample covariance matrices. *Y<sup>e</sup>* := *F*¯ <sup>0</sup> = *<sup>F</sup>*<sup>0</sup> <sup>−</sup> *<sup>T</sup> pc <sup>Q</sup>*<sup>T</sup> *pc*.

The *T pc* of the LLEOPLS method is not obtained from a scaled and mean-centered matrix *E*0*<sup>L</sup>* . The control limits of the T<sup>2</sup> statistical series usually are calculated based on the probability density function estimated by the non-parametric KDE method. The T<sup>2</sup> *pc* and T<sup>2</sup> *<sup>e</sup>* statistics both are univariate although the processes represented by these statistics are multivariate. Then the control limits for the monitoring statistics (T<sup>2</sup> *pc*, T<sup>2</sup> *<sup>e</sup>* and T<sup>2</sup> *<sup>y</sup>*,*<sup>e</sup>*) are obtained from the corresponding PDF estimation,

$$\int\_{-\infty}^{\text{Th}\_{\text{yr},a}} g(\mathbf{T}\_{pc}^2) d\mathbf{T}\_{pc}^2 = \alpha$$

$$\int\_{-\infty}^{\text{Th}\_{\text{yr},a}} g(\mathbf{T}\_{\epsilon}^2) d\mathbf{T}\_{\epsilon}^2 = \alpha$$

$$\int\_{-\infty}^{\text{Th}\_{\text{yr},a}} g(\mathbf{T}\_{\text{yr},\epsilon}^2) d\mathbf{T}\_{\text{yr},\epsilon}^2 = \alpha,$$

where

$$\mathbf{g}(z) = \frac{1}{lh} \sum\_{j=1}^{l} \mathcal{K}\left(\frac{z - z\_j}{h}\right).$$

where *K*(·) and *h* are kernel function and its bandwidth or smoothing parameter, respectively.

Finally, the fault detection logic for the output-residue subspace is given,

$$\begin{aligned} \mathbf{T}\_{\mathbf{y},\epsilon}^{2} &> \mathbf{Th}\_{\mathbf{y}\_{\epsilon},\alpha} & \quad \text{Unpredictedable output faults} \\ \mathbf{T}\_{\mathbf{y},\epsilon}^{2} &\leq \mathbf{Th}\_{\mathbf{y},\alpha} & \quad \text{Fault-free in unpreichlet output.} \end{aligned} \tag{11.29}$$

T2 <sup>y</sup>,<sup>e</sup> includes the output information, so it is suitable for monitoring the outputresidual subspace. But this posteriori quality monitoring is not the focus. Instead, process-based quality monitoring is of greater interest. Fault detection logic for the input space is (Zhou et al. 2018):

$$\begin{aligned} \mathbf{T}\_{pc}^2 &> \mathbf{Th}\_{pc,\alpha} & \text{Quality-relevant faults} \\ \mathbf{T}\_{pc}^2 &> \mathbf{Th}\_{pc,\alpha} \text{ or } \mathbf{T}\_e^2 > \mathbf{Th}\_{\mathbf{x}\_e,\alpha} & \text{Process-relevant faults} \\ \mathbf{T}\_{pc}^2 &\le \mathbf{Th}\_{pc,\alpha} & \text{F}\_e^2 \le \mathbf{Th}\_{\mathbf{x}\_e,\alpha} & \text{Fault-free} \end{aligned} \tag{11.30}$$

The monitoring process of LLEOPLS algorithm for the complex industrial system is given as follows:


## **11.5 Case Study**

The fault detection strategy based on the proposed LLEPLS and LLEOPLS model is performed on the Tennessee Eastman Process (TEP) simulation platform (Lyman and Georgakis 1995). To better demonstrate the effectiveness and rationality of the proposed monitoring strategy, the PLS monitoring strategy and the concurrent projection to latent structure (CPLS) model (Qin and Zheng 2012) are compared. With the CPLS algorithm, the input and output spaces are projected into five subspaces: the input-principle subspace, the input-residual subspace, the output-principle subspace, the output-residual subspace, and the joint input-output subspace. When only the monitoring capability of quality-related faults is considered, the input-residual subspace replaces the input-residual and -principle subspace in the CPLS model. The T<sup>2</sup> *<sup>e</sup>* replaces the corresponding monitoring strategy. In order to emphasize the process-based quality monitoring, the output-residual subspace in LLEOPLS model will not be considered. Similarly, the output-principle and -residual subspaces in CPLS model are not considered.

## *11.5.1 Models and Discussion*

All process measurement variables (XMEAS (1:22)) and manipulation variables (XMV (1:11)) form the input variables matrix *X*. The quality variable matrix *Y* consists of XMEAS (35) and (38). The training data set is normal data IDV(0) and the texting data consists of the 21 fault data IDV(1-21). The optimal parameters of LLE-PLS and LLEOPLS are *kx* = 24 and *ky* = 20. The number of principal components of the PLS, CPLS, LLEPLS, and LLEOPLS models are 6, 6, 5, and 5, respectively.

From the analysis of previous Chaps. 9 and 10, it is known that faults IDV(3,4), IDV(9,11), IDV(14,15), and IDV(19) had almost no effect on product quality but other faults produced significant variations in quality variables when select component G (XMEAS(35)) and component E (XMEAS(38)) as product quality variables. The FDR and FAR of PLS, LLEPLS, CPLS, and LLEOPLS at the control limit


**Table 11.1** FDR of PLS, LLEPLS, CPLS, and LLEOPLS

with confidence level 99.75% are shown in Tables 11.1 and 11.2, respectively. Based on the two tables, the monitoring results for LLEOPLS are a little different from the other monitoring results which are almost the same as FAR, such as IDV(14) and IDV(17). They are considered as quality-related faults in the method of PLS. However, LLEOPLS method indicates that they are quality-irrelevant faults.

Which monitoring results are more credible? The following is given to assess whether the final result of the fault detection is reasonable by quantifying the posterior quality alarm rate (PQAR).

$$\text{PQAR} = \frac{\text{No. of samples } (\{ | (Y\_F) | \} > 3 \mid f \neq 0)}{\text{total samples } (f \neq 0)} \times 100,\tag{11.31}$$

where *Y <sup>F</sup>* are the scaled and mean-centered data, which is the output data of the fault cases. The PQAR is also given in Table 11.1. The 21 faults are divided into two categories by PQAR. Type I is quality-independent (PQAR*<sup>i</sup>* < 6,*i* = 1, 2,..., 21), including IDV(3,4,9,11,14,15,17,19,20). Type II is quality-relevant faults, and further


**Table 11.2** FAR of PLS, LLEPLS, CPLS, and LLEOPLS

classified into three categories: IDV(16) has a slight effect on quality; IDV(1, 2, 5, 6, 7, 8, 10, 12, 13, 18) has a serious effect on quality; and IDV(21) causes a slow drift of the output variable. Apparently, the LLEOPLS method achieves a consistent conclusion (T<sup>2</sup> *pc*). That is, the LLEOPLS model can eliminate the quality-independent interference alarms better. However, there are still some differences in alarm rates between PQAR and T<sup>2</sup> *pc*, such as IDV (5), IDV (7), and IDV (20). What causes this difference? Next, the differences between the LLEOPLS method and the other methods are further analyzed based on the PQAR and *T* <sup>2</sup> *pc* alarm rates.

# *11.5.2 Fault Detection Analysis*

The differences in fault detection results are discussed for the PLS (CPLS) model and the LLEPLS (LLEOPLS) model, respectively. Several cases exist for output variables or process variables with no faults or minor faults (IDV(3,9,15)). Both approaches

**Fig. 11.4** PLS, LLEPLS, CPLS, and LLEOPLS monitoring result for IDV(1) and the output predicted values

provide consistent conclusions. For other faults, there are some differences in their diagnostic results. For two failure cases, including quality-recoverable failures and quality-irrelevant failures, the analysis is as follows. Subplots (a-d) of Figs. 11.4, 11.5 and 11.6 are monitoring result based on the statistics T<sup>2</sup> *pc* and T<sup>2</sup> *<sup>e</sup>* , respectively. The blue line shows the monitored value and the red dashed line shows the control limit of 99.75%. In the corresponding subplots (e) and (f) give the output prediction, where the blue dashed line is the measurement value and the green line is predicted value.

#### **Experiment 1: Quality-Recoverable Faults**

Consider the fault IDV(1), IDV(5), IDV(7). All these fault conditions are step faults, but the in-process feedback controller or cascade controller can compensate the changes in the output variables; therefore, the product quality variables under the fault condition IDV (1), IDV (5), and IDV (7) tend to return to normal. The monitoring results of IDV (1) are shown in Fig. 11.4 by the PLS, LLEPLS, CPLS, LLEOPLS methods.

It is easy to find that the T<sup>2</sup> *<sup>e</sup>* statistics in CPLS and LLEOPLS method can detect the process-related faults. The T<sup>2</sup> *pc* statistic of the LLEOPLS model returns back to the control limit which indicates that those faults are quality recoverable. Existing work in the literatures reports the high detection rates of these faults. For example, PLS, CPLS, and LLEPLS methods give many false alarms based on T<sup>2</sup> for IDV(1). In this case, the LLEOPLS method can accurately reflect the changes in both process variables and quality variables.

**Fig. 11.5** PLS and LLEPLS monitoring result for IDV(17) and the output predicted values

**Fig. 11.6** PLS and LLEPLS monitoring result for IDV(20) and the output predicted values

For IDV(1), a huge difference between FDR(T2) and PQAR can be observed. On the one hand, FDR(T<sup>2</sup> or T<sup>2</sup> *pc*) is based on the principal components of the process variables (without time delay), while PQDR is obtained based on the actual output values (with time delay). They are not equivalent. Moreover, considering that the data used for modeling are under normal operating, but not under fault conditions. The nonlinearity feature may not be fully excited (i.e., these nonlinearities appear to be linear in the normal and steady operation). When fault occurs, nonlinearity is fully excited and may lead to false alarms and missed alarms due to the inability of the original model to predict the output. In fact, T<sup>2</sup> is considered to monitor the quality-related fault, which implies the assumption that the output of the system can still be well predicted by the model in case of a failure. Although the variation of the predicted value of the PLS model (XMEAS(38)) follows the variation of the actual output value, the predicted value is too large which results in a much larger FDR (T<sup>2</sup> in the PLS, CPLS, and LLEPLS models) than the PQAR. Nevertheless, the monitoring results of CPLS and LLEPLS are closer to reality by the orthogonalization strategy and the local linear embedding strategy.

#### **Experiment 2: Quality-Irrelevant Faults**

Fault IDV(4,11,14,17,19,20) are quality-irrelevant, in which IDV(4), IDV(11), IDV(14), and IDV(17) are considered as quality-independent but process related. The monitoring results and output predictions for IDV(17) are shown in Fig. 11.5. As shown in Fig. 11.5e, f, the PLS model cannot predict the output values well while the LLEPLS model can predict the output values very accurately. So many false alarms generated by T<sup>2</sup> of the PLS method. There are two possible reasons: PLS model does not map the nonlinear functions well, and its principal components contain the variations orthogonal to the output variables. Although CPLS improves the orthogonal part of PLS, its nonlinearity extracting ability is still poor. In contrast, the LLEPLS model captures the nonlinear structure well and filters out these false alarms by LLE.

IDV(20) is another touchstone for fault detection. The monitoring results and their output predictions are shown in Fig. 11.6. The detection of all methods is not good based on PQAR, but LLEOPLS method is the best. It is found from the predicted results that LLEPLS model can predict the output variation well. With the removal of the orthogonal component, there remains a question why T<sup>2</sup> *pc* still fails to yield consistent results. One of the underlying reasons is that the nonlinear dynamics excited by IDV(20) cannot be well described by the parameters [*kx* , *ky* ]=[24, 20], which in turn leads to a wrong classification. Another reason could be the different control limits between PQDR and T<sup>2</sup> *pc*. The statistical results of PQDR are obtained by assuming that the output variables obey a Gaussian distribution, and subsequently, their control limits are determined by a threefold standard deviation criterion. However, the 99.75% control limit of T<sup>2</sup> *pc* was obtained by non-parametric estimation. This differs from the results of the Gaussian assumption. The control limit of T<sup>2</sup> *pc* with confidence level 99.75% for the non-parametric KDE is 9.9583, but under the Gaussian assumption is 12.0708). In fact, the monitoring results of T<sup>2</sup> *pc* of LLEOPLS

**Fig. 11.7** PQAR and the corresponding LLEOPLS monitoring results

show that most of the alarms are transient alarms and few are continuous, where the transient alarms may be caused by noise.

#### **Experiment 3: Other Quality-Related Faults**

For other quality-related faults, the FDA results are essentially the same for these methods given in Table 11.1. However, the FDA results are significantly different for IDV(2), IDV(8), IDV(21), etc. The superiority of the proposed method is further verified by comparing the PQAR of IDV(2), IDV(8), IDV(21). The monitoring results are shown in Fig. 11.7. Although fault IDV(2) and IDV(8) are quality-related, the quality certainly meets the production requirements even in these fault condition. So the quality-related alarm is not higher. The monitoring results of the proposed LLEOPLS method are consistent with PQAR.

## **11.6 Conclusions**

Nonlinear regression modeling and analysis is a particularly tricky task. LLEPLS model transforms the nonlinear regression problem into a combination of multiple local linear regression problems using the local linear embedding feature. It not only allows the local properties of the original data to be preserved, but also allows the correlation between the input space and the output space to be maximized, further accurately predicting the quality variables. While the T<sup>2</sup> *pc* statistic of LLEPLS model contains the orthogonal variation of the output. In order to eliminate it, the input space of LLEPLS is further orthogonally decomposed, and the corresponding statistical criteria are established, i.e., LLEOPLS is obtained. The characteristics of the LLEOPLS model with nonlinear mapping and orthogonal decomposition are further clarified by comparing with the PLS, CPLS, and LLEPLS models in TEP benchmark simulation. Simulation results show that the LLEOPLS model is more effective for nonlinear systems and yields better (more consistent) fault detection performance, compared with the PLS, CPLS, and LLEPLS models. Although LLEOPLS has good quality-related monitoring performance for nonlinear processes, it has some limitations, such as that the low-dimensional manifold in which the sampled data are located is linear and that the noise subjects to Gaussian distribution. These are the directions of our further research.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 12 New Robust Projection to Latent Structure**

In many actual nonlinear systems, especially near the equilibrium point, linearity is the primary feature and nonlinearity is the secondary feature. For the system that deviates from the equilibrium point, the secondary nonlinearity or local structure feature can also be regarded as the small uncertainty part, just as the nonlinearity can be used to represent the uncertainty of a system (Wang et al. 2019). So this chapter also focuses on how to deal with the nonlinearity in PLS series method, but starts from an different view, i.e., robust PLS. Here the system nonlinearity is considered as uncertainty and a new robust L1-PLS is proposed.

The traditional PLS and its nonlinear improvement methods are usually to maximize the covariance between the input and output data, i.e., the square of L2 norm. L2 norm has the feature of clear physical meaning and convenient calculation, and its solution are unique unbiased and dense. While it is powerless for systems with rich local features such as nonlinear systems or uncertain systems. The proposed robust L1-PLS aims at the robustness of the feature extraction and the regression coefficients. This method maintains the signal relative size during the feature extraction. Moreover, it guarantees the features are robust to outliers in the global statistical view and sensitive to the local structure information.

# **12.1 Motivation of Robust L1-PLS**

Many robust PLS methods have been developed to increase the robustness of traditional PLS method recently. Branden (2004) and Hubert (2008) replaced the empirical variance-covariance matrix in PLS by a robust covariance estimator, and used the minimum covariance determinant (MCD) estimator and the reweighted MCD estimator (RMCD) for low-dimensional data sets. Turkmen (2013) proposed the influence function analysis for the robust PLS estimator. Currently, the existing robust PLS methods use robust covariance estimation techniques with the identification of multivariate outliers to maintain robustness (Fortuna et al. 2007; Filzmoser 2016). These

<sup>©</sup> The Author(s) 2022

J. Wang et al., *Data-Driven Fault Detection and Reasoning for Industrial Monitoring*, Intelligent Control and Learning Systems 3, https://doi.org/10.1007/978-981-16-8044-1\_12

methods actually perform with a potential assumption that the signal is subject to Gaussian distribution, which is not satisfied for many industrial processes. Usually the industrial data are full of lots of outliers and follow either heavy-tailed distribution (Doman'ski 2019) or multipeak distribution (Wang 2000). In other words, the statistical properties of this kind of data cannot be described by the robust covariance matrix estimation. Furthermore, outliers may contain very important information, so the outliers cannot be simply deleted or replaced (Liu et al. 2018). The data also have some nondominant local structure features besides the outliers. Robust covariance estimation methods also do not handle the small uncertainty correctly.

Recently, a robust PCA (RPCA) (Kwak 2008) and a robust sparse PCA (RSPCA) (Meng et al. 2012) were proposed, which the two methods maximized the L1 norm rather than the square of L2 norm of the input data. Experiments showed that they are efficient and robust for the data with inherent uncertainty and outliers. However, the two improved RPCA methods do not obtain any useful information from the output quality variables, so it is difficult to directly apply them to quality-relevant process monitoring and fault diagnosis (Zhou et al. 2018). The monitoring system will automatically alarm if a fault is detected whether it affects the product quality or not. Many alarms do not make sense for the final production quality.

It is known that the least absolute deviation (LAD) regression is often better than the least squares (LS) regression for non-Gaussian signals, especially those with a heavy-tailed distribution. While LAD regression is immune to outliers. Moreover, the solution of LAD regression is not unique, and it is necessary to introduce the optimal technique to obtain an optimal solution. So the LAD regression of highdimensional system is a time-consuming task. To improve the efficiency of the LAD algorithm, the idea of partial least squares (PLS) regression is used to extend the conventional LAD regression to partial LAD regression. The PLS-based monitoring method decomposes the process space through the correlation between the quality and the process variables, which can reflect the quality-relevant product changes in the process variables (Wang et al. 2017; Zhou et al. 2018).

In order to enhance the robustness of the PLS method in a new way, this chapter proposes a novel dual robustness projection to latent structure regression method based on the L1 norm, L1-PLS. The optimization objective during the principle components extraction in the PLS method is a square of L2 norm, i.e., the least squares regression problem. L1-PLS use the L1 norm maximization to replace the square of the L2 norm maximization in the traditional PLS methods. The L1 norm penalty terms are added to the direction vectors in the latent structure construction. Moreover, the partial LAD regression is used to obtain the regression coefficients. Therefore, the L1-PLS regression method achieves dual robust capabilities including robust principle components and regression coefficients. On the other hand, the L1 norm optimization target also has the certain capability of local structural feature retention, compared with the L2 norm optimization goal.

L1-PLS is distinguished from other existing robust PLS methods in several respects:


# **12.2 Introduction to RSPCA Method**

Consider the input data *X*= [*x*(1), ··· *x*(*n*)] ∈ *Rm*×*<sup>n</sup>*, where *x* = [*xi*, ··· *xm*]; *m* and *n* are the dimensionality of the input data and the size of the input matrix. The traditional PCA method aims to find the *d*(*d* < *m*) dimensional linear subspace with the largest input data variance. The objective function is as follows:

$$\mathbf{W}^\* = \arg\max \left\| \mathbf{W}^\mathrm{T} \mathbf{X} \right\|\_2^2, \text{ s.t.}\\\mathbf{W}^\mathrm{T} \mathbf{W} = I\_d,\tag{12.1}$$

where *W* = *w*<sup>T</sup> <sup>1</sup> ,..., *w*<sup>T</sup> *d* T ∈ *Rm*×*<sup>d</sup>* is weight matrix. .<sup>2</sup> represents the *L*<sup>2</sup> norm of a matrix or vector.

However, the principal components based on the PCA are usually a linear combination of the original variables usually with the non-zero weights. The non-zero weight results in that many irrelevant variables are included in the final model and cause unnecessary interference. Therefore, the spare PCA (SPCA) method was proposed to achieve the sparse expression of the principal components as much as possible (Liu 2014). Its objective function is

$$\mathbf{W}^\* = \arg\max \left\| \mathbf{W}^\mathrm{T} \mathbf{X} \right\|\_2^2, \text{ s.t. } \mathbf{W}^\mathrm{T} \mathbf{W} = I\_d, \ \|\mathbf{W}\|\_1 < \mathrm{s}, \tag{12.2}$$

where .<sup>1</sup> is the L1 norm of a matrix or vector. It is introduced as constraint or penalty term to enhance the sparsity of the principal components.*s* is the number of non-zero weights. The L1 norm penalty term (*W*<sup>1</sup> < *s*) realizes the sparse expression of the direction vector.

Figure 12.1 shows the amplifying effect curve of L1 norm and L2 norm on noise. The blue dotted line is the square of the L2 norm (for one-dimensional data, it is equivalent to the L2 norm), and the red line is the L1 norm. Obviously, the L2 norm has an inhibitory effect on the data in |*x*| ≤ 1 and has an enlarged effect on the data in |*x*| > 1. The L1 norm maintains the relative size of the original data and has a

relatively small expansion effect on all data. In order to further improve the robustness of SPCA, the RSPCA method is proposed to reduce the sensitivity of the principal components to outliers. The L2 norm in the objective function is substituted by L1 norm (Zou et al. 2006). The optimization function of RSPCA is given as follows:

$$\|w\ast = \arg\max \|\mathbf{X}^{\mathsf{T}}\mathbf{w}\|\_{1}, \text{ s.t. } \mathbf{w}^{\mathsf{T}}\mathbf{w} = 1, \|\mathbf{w}\|\_{1} < \mathsf{s}. \tag{12.3}$$

Here the optimization problem is a form of L1 norm maximization with an L1 norm penalty term simultaneously. In order to obtain the principal components of the RSPCA method, the optimal direction vector w∗ is calculated by Algorithm 3.

The convergence of Algorithm 3 and the rationality of the obtained sparse direction vectors have been theoretically verified (Zou et al. 2006). However, Algorithm 3 indicates that the sparseness of the data needs to be given in prior during the calculation of the sparse direction vector. Generally speaking, the sparsity of input data is unknown and it contains uncertainty. More importantly, the RSPCA method cannot be directly applied to quality-related process monitoring. Therefore, this chapter introduces the L1 norm into the PLS method.

# **12.3 Basic Principle of L1-PLS**

The double robust projection to latent structure (L1-PLS) method is given based on the L1 norm, aiming at improving the robustness of the traditional PLS method. The PLS method extracts principal components from the input space and output space, and the principal components should satisfy the following conditions: carry the maximum variation information (representation) of their respective variable spaces as much as

#### **Algorithm 3** RSPCA algorithm for one sparse PC

#### **Input:**

Data matrix *X*, sparsity *s*.

#### **Output:**

The *s* sparse PC *w*∗.

1: Initialization *<sup>w</sup>*(0) <sup>∈</sup> *<sup>R</sup>*1×*m*, set *<sup>w</sup>*(0) <sup>=</sup> *<sup>w</sup>*(0) *w*(0)<sup>2</sup> , and *<sup>k</sup>* <sup>=</sup> 0.

$$\begin{aligned} \text{2. Let } \boldsymbol{\mathfrak{v}} = (v\_1, \dots, v\_m)^{\mathsf{T}} = \sum\_{i=1}^n p\_i(t) X\_i \text{, where } p\_i(k) = \begin{cases} 1, & \mathbf{w}^{\mathsf{T}}(k) X\_i \ge 0 \\ -1, & \mathbf{w}^{\mathsf{T}}(k) X\_i < 0 \end{cases} \text{ and } X\_i \text{ is the } k \text{th column of the matrix } \mathbf{Y}. \\ \text{3. Let } \boldsymbol{\mathfrak{v}} \text{ be the } (\boldsymbol{\mathfrak{v}} + \boldsymbol{1}) \text{ largest almost in } |\mathbf{u}| \end{aligned}$$

*i*th column of the matrix *X*. Let γ be the (*s* + 1) largest element in |*v*| . 3: Let *<sup>β</sup>* <sup>=</sup> (β1,..., <sup>β</sup>*m*)T, where <sup>β</sup>*<sup>i</sup>* <sup>=</sup> sgn(v*i*)(|v*i*<sup>|</sup> <sup>−</sup> <sup>γ</sup>)+,*<sup>i</sup>* <sup>=</sup> <sup>1</sup>,..., *<sup>m</sup>*, and (*z*)<sup>+</sup> <sup>=</sup> ⎧ 1; *z* > 0

$$\begin{cases} z, & x > 0 \\ 0, & x \le 0 \end{cases}, \text{sgn}(z) = \begin{cases} \cdots, & z \ll \cdots \\ 0; & z = 0. \text{ Make } \mathfrak{w}(k+1) = \frac{\beta}{\|\beta\|\_2}, \text{ and } k = k+1. \end{cases}$$


7: **return** *w*∗;

possible, and the degree of correlation between different variable spaces is as large as possible (correlation). Take the extraction of the first principal component as an example. The PLS method is expressed as follows:

$$\begin{array}{l}E\_0^\mathrm{T}F\_0F\_0^\mathrm{T}E\_0\mathfrak{w}\_1 = \theta^2\mathfrak{w}\_1\\F\_0^\mathrm{T}E\_0E\_0^\mathrm{T}F\_0\mathfrak{c}\_1 = \theta^2\mathfrak{c}\_1,\end{array} \tag{12.4}$$

where *w*<sup>1</sup> and *c*<sup>1</sup> are the direction vector of the principle components *t*<sup>1</sup> and *u*1. The optimization problem (12.4) is transformed into finding the unit direction vectors *w*<sup>1</sup> and *c*<sup>1</sup> corresponding to the maximum eigenvalue θ<sup>2</sup> of matrices *E*<sup>T</sup> <sup>0</sup> *F*0*F*<sup>T</sup> <sup>0</sup> *E*<sup>0</sup> and *F*<sup>T</sup> <sup>0</sup> *E*0*E*<sup>T</sup> <sup>0</sup> *F*0, respectively. It can be seen that the solution of (12.4) satisfies the requirements about the representation and correlation in PLS method.

Then, multiply both sides of the equation (12.4) by *w*<sup>T</sup> <sup>1</sup> and *c*<sup>T</sup> <sup>1</sup> , respectively, and obtain

$$\begin{aligned} \mathbf{w}\_1^T \mathbf{E}\_0^\mathsf{T} \mathbf{F}\_0 \mathbf{F}\_0^\mathsf{T} \mathbf{E}\_0 \mathbf{w}\_1 &= \theta^2, \quad \text{s.t.} \mathbf{w}\_1^\mathsf{T} \mathbf{w}\_1 = 1\\ \mathbf{c}\_1^\mathsf{T} \mathbf{F}\_0^\mathsf{T} \mathbf{E}\_0 \mathbf{E}\_0^\mathsf{T} \mathbf{F}\_0 \mathbf{c}\_1 &= \theta^2, \quad \text{s.t.} \mathbf{c}\_1^\mathsf{T} \mathbf{c}\_1 = 1. \end{aligned} \tag{12.5}$$

To simplify further, we can get

$$\begin{aligned} \boldsymbol{\mathfrak{w}}\_{1}^{\*} &= \arg\max \left\lVert \boldsymbol{\mathfrak{w}}\_{1}^{\mathrm{T}} \boldsymbol{E}\_{0}^{\mathrm{T}} \boldsymbol{F}\_{0} \right\rVert\_{2}^{2}, \quad \text{s.t.} \quad \boldsymbol{\mathfrak{w}}\_{1}^{\mathrm{T}} \boldsymbol{\mathfrak{w}}\_{1} = 1\\ \boldsymbol{\mathfrak{c}}\_{1}^{\*} &= \arg\max \left\lVert \boldsymbol{\mathfrak{c}}\_{1}^{\mathrm{T}} \boldsymbol{F}\_{0}^{\mathrm{T}} \boldsymbol{E}\_{0} \right\rVert\_{2}^{2}, \quad \text{s.t.} \quad \boldsymbol{\mathfrak{c}}\_{1}^{\mathrm{T}} \boldsymbol{\mathfrak{c}}\_{1} = 1. \end{aligned} \tag{12.6}$$

The optimal problem of the traditional PLS (12.4) is expressed as L2 norm optimization in (12.6). *w*<sup>∗</sup> <sup>1</sup> and *c*<sup>∗</sup> <sup>1</sup> are the optimal direction vectors.

It is known that the noise is flowed into the regression model through the direction vector ( *w*<sup>1</sup> and *c*<sup>1</sup> ) in most cases, which affects the estimation of the regression parameters in the PLS method. Similar as the idea of equation (12.3), we replace the maximization of the *L*<sup>2</sup> norm in the objective function (12.6) with the maximization of L1 norm. Moreover, the L1 norm penalty term is added to the direction vector. Therefore, the objective function of the L1-PLS method based on the L1 norm is given as follows:

$$\begin{aligned} \boldsymbol{\mathfrak{w}}\_{1}^{\*} &= \arg\max \left\lVert \boldsymbol{\mathfrak{w}}\_{1}^{\mathrm{T}} \boldsymbol{E}\_{0}^{\mathrm{T}} \boldsymbol{F}\_{0} \right\rVert\_{1}, \quad \text{s.t.} \quad \boldsymbol{\mathfrak{w}}\_{1}^{\mathrm{T}} \boldsymbol{\mathfrak{w}}\_{1} = 1, \quad \lVert \boldsymbol{\mathfrak{w}}\_{1} \rVert\_{1} < s\_{1} \\ \boldsymbol{\mathfrak{c}}\_{1}^{\*} &= \arg\max \left\lVert \boldsymbol{\mathfrak{c}}\_{1}^{\mathrm{T}} \boldsymbol{F}\_{0}^{\mathrm{T}} \boldsymbol{E}\_{0} \right\rVert\_{1}, \quad \text{s.t.} \quad \boldsymbol{\mathfrak{c}}\_{1}^{\mathrm{T}} \boldsymbol{\mathfrak{c}}\_{1} = 1, \quad \lVert \boldsymbol{\mathfrak{c}}\_{1} \rVert\_{1} < s\_{2}, \end{aligned} \tag{12.7}$$

where *s*<sup>1</sup> and *s*<sup>2</sup> are the sparsity of input spatial data and output spatial data, respectively.

According to the above analysis, although the direction vectors (*w*<sup>1</sup> and *c*1) in (12.4) contains the correlation between the input data *E*<sup>0</sup> and the output data *F*0, fortunately, they can be solved separately in (12.7). Therefore, Algorithm 3 also is suitable for the solution of (12.7) by replacing the corresponding input data matrix *X* with *E*<sup>T</sup> <sup>0</sup> *F*<sup>0</sup> and *F*<sup>T</sup> <sup>0</sup> *E*0, respectively. It is noted that the solution of *w*<sup>1</sup> and *c*<sup>1</sup> are independent but not jointed by Algorithm 3.

Once the optimal direction vectors *w*<sup>1</sup> and *c*<sup>1</sup> are obtained, the score vectors in the latent space, i.e., the first principle component pair, *t*<sup>1</sup> and *u*<sup>1</sup> can be calculated

$$t\_1 = E\_0 \mathbf{w}\_1,\\
\mathbf{u}\_1 = F\_0 \mathbf{c}\_1. \tag{12.8}$$

Next, the regression coefficients (loading vectors) of *F*<sup>0</sup> and *E*<sup>0</sup> to *t*<sup>1</sup> will be established. In the traditional PLS model, the regression coefficients *p*<sup>1</sup> and *q*<sup>1</sup> are estimated by least squares, namely,

$$\begin{array}{lcl}\mathbf{p}\_{1} = \mathbf{E}\_{0}^{\mathrm{T}} \mathbf{t}\_{1} / \|\mathbf{t}\_{1}\|^{2} \\ \mathbf{q}\_{1} = \mathbf{F}\_{0}^{\mathrm{T}} \mathbf{t}\_{1} / \|\mathbf{t}\_{1}\|^{2} .\end{array} \tag{12.9}$$

Similarly, least squares estimation is also susceptible to outliers, and the least absolute deviation (LAD) method is introduced to deal with this problem. Therefore, in order to further improve the robustness, LAD regression is used to solve the regression coefficients in the L1-PLS algorithm, namely,

$$\begin{array}{l} \mathbf{p}\_{\downarrow}^{\*} = \arg\min \left\| \boldsymbol{E}\_{0} - \boldsymbol{t}\_{1} \boldsymbol{p}\_{1}^{\mathrm{T}} \right\|\_{\mathrm{I}}\\ \mathbf{q}\_{\downarrow}^{\*} = \arg\min \left\| \boldsymbol{F}\_{0} - \boldsymbol{t}\_{1} \boldsymbol{q}\_{1}^{\mathrm{T}} \right\|\_{\mathrm{I}},\end{array} \tag{12.10}$$

where *p*<sup>∗</sup> <sup>1</sup> and *q*<sup>∗</sup> <sup>1</sup> are the optimal loading vectors of (12.10).

Obviously, the essence of (12.10) is also the form of L1 norm. When there are few outliers, it is not necessary to use the norm to solve the regression coefficient. Due to the direction vector has been solved by maximizing the L1 norm, the influence of the outlier has been reduced, and as can be seen from Fig. 12.1. When the outlier is small, the L2 norm and the L1 norm have the same effect.

Calculate the residual matrix *E*<sup>1</sup> and *F*1:

$$E\_1 = E\_0 - t\_1 \mathbf{p}\_1^\mathrm{T}, \; F\_1 = F\_0 - t\_1 \mathbf{q}\_1^\mathrm{T} \tag{12.11}$$

Similar as the extraction of the first principal components pair, the other principal components are calculated iteratively by decomposing the residuals *E<sup>i</sup>* and *F<sup>i</sup>* (*i* = 1,..., *d* − 1). The extraction of principal components is stopped until the model determined by the extracted principal components satisfies the desired requirements.

The dual robustness of the L1-PLS algorithm is reflected in the following two aspects:


# **12.4 L1-PLS-Based Process Monitoring**

It is found that only the calculation process of the direction vector *w*<sup>1</sup> and *c*<sup>1</sup> (12.7) or the regression coefficient *p*<sup>1</sup> and *q*<sup>1</sup> (12.10) is improved in the L1-PLS method, and other steps are not affected. Therefore, the monitoring process based on the L1- PLS method is the same as the PLS method. In the process monitoring based on the L1-PLS method, the T<sup>2</sup> and T<sup>2</sup> *<sup>e</sup>* statistics are still used to monitor the principal component subspace and the remaining subspace. Then, the L1-PLS-based monitoring is described in detail in Algorithm 4 (offline process training) and Algorithm 5 (online process monitoring). The corresponding flowchart is shown in Fig. 12.2.

In Algorithms 4 and 5, *Λ* and *Λ<sup>e</sup>* represent the sample covariance matrix. The non-parametric kernel density estimation (KDE) method (1.33) is used to estimate the corresponding control limits of T<sup>2</sup> and T<sup>2</sup> *e* .

There is still a key problem in the implementation of Algorithm 4: the sparsity degree *s*<sup>1</sup> and *s*<sup>2</sup> need to be given in prior. There are two common strategies to determine *s*<sup>1</sup> and *s*2. (1) The first one is the variable importance in prediction (VIP) method (Farrés et al. 2015). It judges whether the variable is an irrelevant variable based on the VIP score of the *j*th predicted value of the response variable. Usually, the "greater than " criterion is used as the selection criterion. More precisely, the threshold

#### **Algorithm 4** L1-PLS method for Offline process training

#### **Input:**

Normal data sets *<sup>X</sup>* = [*x*1,..., *xm*] ∈ *<sup>R</sup>n*×*m*, *<sup>Y</sup>* = [*y*1,..., *yl*] ∈ *<sup>R</sup>n*×*<sup>l</sup>* , sparsity *s*<sup>1</sup> and *s*2. **Output:**

	- to get the direction vectors *w<sup>i</sup>* and *c<sup>i</sup>* , respectively.
	- (2.2) Calculate the score vectors: *t<sup>i</sup>* = *Ei*−1*wi*, *u<sup>i</sup>* = *Fi*−<sup>1</sup> *c<sup>i</sup>* .
	- (2.3) Calculate the load vectors:

$$\begin{array}{llll} \mathbf{p}\_{1} = E\_{0}^{\mathrm{T}} t\_{1} / \|\boldsymbol{t}\_{1}\|^{2} & \mathbf{p}\_{1}^{\*} = \mathrm{arg}\,\min \left\| \boldsymbol{E}\_{0} - \boldsymbol{t}\_{1} \right\| \boldsymbol{t}\_{1} \\ \mathbf{q}\_{1} = F\_{0}^{\mathrm{T}} t\_{1} / \|\boldsymbol{t}\_{1}\|^{2} & \mathbf{q}\_{1}^{\*} = \mathrm{arg}\,\min \left\| \boldsymbol{F}\_{0} - \boldsymbol{t}\_{1} \boldsymbol{q}\_{1}^{\mathrm{T}} \right\| \boldsymbol{t}\_{1} \\ \dots & \dots & \dots & \dots \end{array}$$

(2.4) Calculate the Residual matrix: *<sup>E</sup><sup>i</sup>* <sup>=</sup> *<sup>E</sup>i*−<sup>1</sup> <sup>−</sup> *<sup>t</sup><sup>i</sup> <sup>p</sup>*<sup>T</sup> *<sup>i</sup>* , *<sup>F</sup><sup>i</sup>* <sup>=</sup> *<sup>F</sup>i*−<sup>1</sup> <sup>−</sup> *<sup>u</sup><sup>i</sup> <sup>q</sup>*<sup>T</sup> *i* . (3) Describe *t<sup>i</sup>* with the original matrix *E*0: *T* = *E*0*R*,

$$\mathcal{R} = [r\_1, \dots, r\_d] \text{, in which } \boldsymbol{r}\_i = \prod\_{j=1}^{i-1} (I\_n - \boldsymbol{w}\_j \boldsymbol{p}\_j^\top) \boldsymbol{w}\_i.$$

$$\hat{\boldsymbol{E}} = \boldsymbol{T} \boldsymbol{P}^\top = E\_0 \boldsymbol{R} \boldsymbol{P}^\top$$

$$\bar{\boldsymbol{E}} = E\_0 - \hat{\boldsymbol{E}} = E\_0 (I\_n - \boldsymbol{R} \boldsymbol{P}^\top)$$

(4) For a normalized data sample *x*, calculate its estimate, residual and the corresponding PC value.

$$\begin{aligned} \hat{\boldsymbol{x}} &= \boldsymbol{R} \boldsymbol{P}^{\mathsf{T}} \boldsymbol{x} \\ \boldsymbol{t} &= \boldsymbol{R} \boldsymbol{x} \\ \boldsymbol{e} &= \boldsymbol{x} - \hat{\boldsymbol{x}} = \left( \boldsymbol{I} - \boldsymbol{R} \boldsymbol{P}^{\mathsf{T}} \right) \boldsymbol{x} \end{aligned}$$

(5) Calculate the statistics T<sup>2</sup> and T<sup>2</sup> *e* :

$$\mathbf{T}^2 := t\mathbf{A}^{-1}\mathbf{t}^\mathrm{T} = t(\frac{1}{n-1}\mathbf{T}^\mathrm{T}\mathbf{T})^{-1}\mathbf{t}^\mathrm{T}$$

$$\mathbf{T}\_e^\mathrm{T} := \mathbf{e}\mathbf{A}\_e^{-1}\mathbf{e}^\mathrm{T} = \mathbf{e}(\frac{1}{n-1}\bar{\mathbf{E}}^\mathrm{T}\bar{\mathbf{E}})^{-1}\mathbf{e}^\mathrm{T}$$

**return** T<sup>2</sup> lim and T<sup>2</sup> *<sup>e</sup>*,lim;

should be adjusted based on the distribution of the overall data in different situations. (2) The second strategy is the selectivity ratio method (Branden and Hubert 2004). The variable selection ratio is calculated according to the ratio of the interpretation of the *X* variable on the *Y* target projection component to the residual variance. Then F test is performed to define the boundary between important variables and irrelevant variables. Since the VIP method is simple and easy to implement, the VIP method is selected to determine the sparsity *s*<sup>1</sup> and *s*<sup>2</sup> here.

#### **Algorithm 5** L1-PLS method for Online process monitoring

#### **Input:**

New normalized data sets *xne*<sup>w</sup> and *yne*w.

#### **Output:**

Online process monitoring results.

(1) Calculate the new score vectors: *tne*<sup>w</sup> = *xne*<sup>w</sup> *R*.

(2) Calculate the new prediction matrix and new residual:

$$
\tilde{\mathbf{x}}\_{new} = t\_{new} \boldsymbol{\mathcal{P}}^{\mathrm{T}} = \mathbf{x}\_{new} \boldsymbol{\mathcal{R}} \boldsymbol{\mathcal{P}}^{\mathrm{T}}
$$

$$
\mathbf{z}\_{new} = \mathbf{x}\_{new} - \tilde{\mathbf{x}}\_{new} = \mathbf{x}\_{new} (I\_n - \boldsymbol{\mathcal{R}} \boldsymbol{\mathcal{P}}^{\mathrm{T}}).
$$

(3) Calculate the new statistics *T* <sup>2</sup> *ne*<sup>w</sup> and *T* <sup>2</sup> *<sup>e</sup>*,*ne*w:

$$T\_{new}^2 = t\_{new} A^{-1} t\_{new}^\mathrm{T} = t\_{new} \left\{ \frac{1}{n-1} \boldsymbol{\mathcal{T}}^\mathrm{T} \boldsymbol{\mathcal{T}} \right\}^{-1} \boldsymbol{t}\_{new}^\mathrm{T}$$
 
$$\mathbf{T}\_{\mathrm{e},new}^2 := \mathbf{e}\_{new} \boldsymbol{A}\_{\boldsymbol{\varepsilon}}^{-1} \boldsymbol{\mathsf{e}}\_{new}^\mathrm{T} = \mathbf{e}\_{new} \left\{ \frac{1}{n-1} \bar{\boldsymbol{E}}^\mathrm{T} \bar{\boldsymbol{E}} \right\}^{-1} \boldsymbol{\mathsf{e}}\_{new}^\mathrm{T}$$

(4) Compare *T* <sup>2</sup> *ne*<sup>w</sup> and *T* <sup>2</sup> *<sup>e</sup>*,*ne*<sup>w</sup> with the corresponding control limits T<sup>2</sup> lim and T2 *<sup>e</sup>*,lim.

**return** Online process monitoring results.

It is worth noting that the role of sparsity is to achieve variable selection. If the established system model contains many irrelevant variables, giving the sparsity is helpful to limit the number of irrelevant variables, so as to realize L1-sparse-PLS. However, if the sparsity of the input data is uncertain, the sparsity degree *s*<sup>1</sup> and *s*<sup>2</sup> can be set equal to the variable number in the input and output space, respectively, to eliminate the uncertainty caused by the sparsification. In this view, the proposed L1-PLS method is uniformly called as L1-(S)PLS method based on the different sparsity.

## **12.5 TE Simulation Analysis**

In this simulation, the input variable *X* is composed of 31 variables [XMEAS(1:22)] and [XMV(1:11) (except XMV(5) and XMV(9))]. The output variable *Y* consists of the quality components *G* (XMEAS(35)) and *H* (XMEAS(36)). Two simulation examples are used to verify the effectiveness of the L1-PLS method for fault detection.

**Fig. 12.2** The Flow chart of Algorithms 4 and 5

## *12.5.1 Robustness of Principal Components*

The robustness of the L1-PLS method is mainly implemented on the direction vectors, which directly reflects the robustness of the PCs. The variation of the PC structure caused by outliers therefore is the focus of robustness analysis. Here results of PLS and RPLS methods are given for comparison. The input and output data (*X* ∈ *R*<sup>960</sup>×<sup>31</sup>, *Y* =∈ *R*<sup>960</sup>×2) are sampled from the TE process under the normal operation for training data. In order to test further the proposed L1-PLS, the outliers are added in the input space in the following form:

$$X(k) = X^\*(k) + \Xi\_j(k),\tag{12.12}$$

where *X*∗(*k*) is the *k*th normal sample (*k* = 1, 2,..., 960) *Ξ <sup>j</sup>* is the *j*-th randomly generated outlier that obey Gaussian distribution *Ξ <sup>j</sup>* ∼ *N*(0, 2000). For ease of verification, three kinds of repeatable outliers that are generated using a specific random seed are added to the training set,

**Fig. 12.3** The relative change rates of *t*<sup>1</sup> using PLS and L1-PLS

$$\begin{aligned} \Xi\_1(12) &= [-71.294, 4.929, 35.199, -0.100]^\mathrm{T} & \text{for } x\_{14:17} \\ \Xi\_2(140) &= [4.164, -16.912, -66.307]^\mathrm{T} & \text{for } x\_{29:31} \\ \Xi\_3(200) &= [-1.960, 42.969, 77.737, -19.239, -72.776, 7.439]^\mathrm{T} & \text{for } x\_{1:6} \end{aligned}$$

Outlier Ξ1(12) means that only the 14, 15, 16, and 17th variables at the 12th sample time *X*(12) are abnormal, and the other variables at other sample times are still normal. The other two outliers have similar meanings.

The sparsity *s*<sup>1</sup> and *s*<sup>2</sup> in the L1-PLS method are set to 31 and 2. The sparsity is equal to the variable number of input and output space, respective. In other words, the L1- PLS method can reflect the changes in all variables. The components numbers *d* are determined using cross-validation. They are 6, 6, and 2 for PLS, RPLS, and L1-PLS methods, respectively. The principle components are *ti* <sup>=</sup> *<sup>n</sup> <sup>j</sup>*=<sup>1</sup> *wi j x <sup>j</sup>*,*i* = 1,..., *d*, in which *wi j* is the *j*th element of *ri* . The coefficients *wi j* are used to reflect whether the outliers affect the principle components. The relative rates of change (RRC) indices are defined as follows:

$$\begin{aligned} RRC\_{1,i} &= \max\{ |\mathbf{w}\_{ij,normal} - \mathbf{w}\_{ij,outliers}| \} \\ RRC\_{2,i} &= ||\mathbf{w}\_{i,normal} - \mathbf{w}\_{i,outliers}||\_1, \end{aligned} \tag{12.13}$$

where *w<sup>i</sup>*,*normal* = [w*i j*]*normal* and *w<sup>i</sup>*,*outlier s* = [w*i j*]*outlier s* are the normalized coefficient vectors with normal samples and adding outliers samples for the *ith* PC, respectively.

*RRC*<sup>1</sup> represents the maximum absolute deviation of the two coefficient sets, which indicates the worst changes of the normalized *wi j* . *RRC*<sup>2</sup> represents the sum of the absolute deviations of the two coefficient sets, which indicates the overall change of the normalized *wi j* .

The normalized coefficient *wi j* values of the first two PCs (*t*<sup>1</sup> and *t*2) of the PLS, RPLS and L1-PLS methods are shown in Figs. 12.3 and 12.4. The corresponding indices *RRCi*,*i* = 1, 2 are given in Table 12.1 (a smaller value is better).

**Fig. 12.4** The relative change rates of *t*<sup>2</sup> using PLS and L1-PLS

**Table 12.1** *RRCi* of *t*<sup>1</sup> and *t*<sup>2</sup> of the PLS, L1-PLS and L1-SPLS methods


It can be seen from Figs. 12.3–12.4 and Table 12.1 that no matter which method is used, the outliers will always affect the structure of the PCs to some extent. In general, the outliers have a large adverse effect on the PCs extraction of the PLS method, and thus results in the largest change in its PC structures. With the robust covariance estimation method, the outliers have little effect on the PCs extraction of the RPLS method. L1-PLS method only uses the L1 norm to be insensitive to outliers, without any outliers processing. Outliers that cause changes in the structure of its two PCs are nearly identical and within an acceptable range, whether in the *RRC*<sup>1</sup> or *RRC*2. The samples considered to be outliers may be a true reflection of the system state when the data set follows a heavy-tailed distribution (Doman'ski 2019). It is more important to retain all the samples to extract the PCs, although the outliers have a certain influence on the direction vectors.

By further analyzing the structure of *t*<sup>1</sup> and *t*2, it can be easily found that the extracted PCs by those methods are quite different. In order to better explain the structural differences of*t*<sup>1</sup> and *t*<sup>2</sup> in different methods, IDV(14) is taken as an example for in-depth analysis. The typical process variable monitoring results of IDV(14) are given in Fig. 12.5, in which, *x*9, *x*<sup>21</sup> and *x*<sup>30</sup> have similar monitoring results. Among the *t*<sup>1</sup> and *t*2, the sum of the absolute weights for *x*9, *x*<sup>21</sup> and *x*<sup>30</sup> of the PLS method (0.062) is more than twice that of the L1-PLS method (0.025).

These weight differences do not significantly affect the output prediction and the monitoring performance in the normal operation. But these differences are amplified in the fault modes. For example, consider the monitoring under the fault modes IDV(14) and IDV(17). The role of *x*<sup>21</sup> and *x*<sup>30</sup> (especially *x*30) in the PLS method is

**Fig. 12.5** Typical process variable monitoring results of IDV(14)

exaggerated, leading to incorrect predictions and quality-relevant monitoring results (see Figs. 12.6 and 12.7). Correspondingly, the L1 norm can better maintain the relative size of those variables, therefore, the role of *x*<sup>21</sup> and *x*<sup>30</sup> in the extracted PCs is not exaggerated. In other words, the extracted PCs by the L1 norm better capture the relationship between the input space and output space.

## *12.5.2 Robustness of Prediction and Monitoring Performance*

The robustness of the principal components of the L1-PLS method is discussed in the previous section. But the number of principal components of the three methods is different, which only reflects one aspect of the robustness. Now, the robustness of prediction performance and monitoring is analyzed further, especially the prediction performance directly reflects the quality of the model. There are 21 types of faults in the TE process. The fault IDV(21) is a fault that the output drifts slowly, caused by the constant change of the steam valve position. So it does not reflect the robustness of the model. Therefore, the first 20 faults are analyzed in this simulation experiment. In this simulation, the sparsity in the L1-SPLS model is determined by the VIP method: input space *s*<sup>1</sup> = 14, output space *s*<sup>2</sup> = 2.

#### **Experiment 1: Prediction Performance Analysis**

In this experiment, the L1-PLS model shows good output prediction results for the 20 fault data sets. L1-PLS(outliers) and PLS(outliers) mean that the two models are trained by the normal operation data with adding outliers, described in previous Sect. 12.5.1. In order to illustrate the above conclusions more clearly, four faults IDV(7), IDV(14), IDV(17), and IDV(18) are selected to compare the prediction performance of the PLS model and the L1-PLS model. The output prediction results are good for all fault modes, but the four faults come from four different fault types, and the results of the L1-PLS model and the PLS model are quite different. Figures 12.6 and 12.7 give the output prediction results of the fault IDV(7), IDV(14), IDV(17),

**Fig. 12.6** Output predicted values for IDV(7), IDV(14), IDV(17), and IDV(18) using PLS(outliers)

and IDV(18). The horizontal axis represents data samples, and the vertical axis represents output values. The blue dashed line is the actual output value, and the green is the predicted output value.

In these prediction and monitoring diagrams, the first 160 samples are normal data, and the last 800 samples are data under different fault modes. The output prediction of fault IDV(7) shows a consistent conclusion under the step-change fault. The feedback controller or cascade controller reduces the impact of faults and abnormal values on product quality. For the other three types of fault IDV(14), IDV(17), and IDV(18), there are some differences in their output prediction results. When the system is under the normal operation, the PLS and L1-PLS models have the same good prediction results. However, after adding outliers, the PLS method cannot accurately predict the output (Fig. 12.6), while the L1-PLS method still quickly detects the output changes and makes correct predictions (Fig. 12.7). In particular, for faults IDV(17) and IDV(18), the PLS method gives a serious wrong predictions. Experiments show that the prediction performance of the L1-PLS method is better than PLS. Even if the data is contaminated by outliers, L1-PLS can still predict the output accurately. In other words, the L1-PLS model has stronger robust prediction performance.

**Fig. 12.7** Output predicted values for IDV(7), IDV(14), IDV(17), and IDV(18) using L1- PLS(outliers)

#### **Experiment 2: Monitoring Performance Analysis**

The robustness of monitoring performance is mainly verified by the accuracy of fault detection. The detection indices are FDR and FAR (4.1), the control limit is calculated with the confidence level 99.75% for both PLS and L1-PLS methods. The FAR results of the two models are basically same, this indicates that the proposed L1-PLS method does not increase the risk of false alarms, so it is not analyzed in this section. Table 12.2 lists the FDR results of the first 20 faults without adding outliers, corresponding to the models PLS, L1-PLS and L1-SPLS respectively. Table 12.3 shows the FDR results of 20 faults after adding outliers, corresponding to the models PLS (outliers), L1-PLS (outliers), and L1-SPLS (outliers).

For serious quality-related faults IDV(2), IDV(6), IDV(8), IDV(12), IDV(13), and IDV(18), the six models give consistent results. Therefore, these faults are not analyzed in this chapter. For other types of faults, their results are very different, including the quality-irrelevant faults, the quality-recoverable faults, and slight quality-related faults. The detailed analysis of the three situations is given below. In the monitoring figures of this section, the blue line represents the value of the statistic, where the


**Table 12.2** FDRs of PLS, L1-PLS, and L1-SPLS

upper curve is T2, and the lower is T2 *<sup>e</sup>* . The system alarms if the blue line exceeds the red control limit.

#### **Case 1: Quality Irrelevant Fault**

It can be found from Table 12.2 that very low alarm values are given for faults IDV(3), IDV(9), IDV(15), and IDV(19). However, the alarm values of the L1-PLS and L1- SPLS models are lower, which indicates that fewer false alarms will occur during the monitoring. It can also be seen from the corresponding Figs. 12.8, 12.9, 12.10, 12.11, and 12.12, the alarm points of the latter two models are much less. For faults IDV(4), IDV(11), and IDV(14), they are all related to the reactor cooling water and hardly affect the quality of output products. The PLS model gives a higher alarm value, which may lead to serious false alarms, while the L1-PLS model effectively avoids these alarms and reduces the number of false alarms. In addition, the L1-PLS model eliminates most of the false alarms in the monitoring Figs. 12.8, 12.9, 12.10, and the L1-SPLS model almost eliminates all false alarms.

When adding outliers, the PLS model provides the same wrong results for qualityirrelevant faults. The specific FDR values are shown in Table 12.3. However, the


**Table 12.3** FDRs of PLS(outliers), L1-PLS(outliers), and L1-SPLS(outliers)

**Fig. 12.8** PLS, L1-PLS and L1-SPLS monitoring results for IDV(4)

**Fig. 12.9** PLS, L1-PLS and L1-SPLS monitoring results for IDV(11)

**Fig. 12.10** PLS, L1-PLS and L1-SPLS monitoring results for IDV(14)

**Fig. 12.11** PLS, L1-PLS and L1-SPLS monitoring results for IDV(15)

**Fig. 12.12** PLS, L1-PLS and L1-SPLS monitoring results for IDV(19)

monitoring effect of the L1-PLS model is still very good, for fault IDV(9), IDV(14), and IDV(19). The detection rate has been reduced to 0, which means that false alarms are completely eliminated in these cases. Therefore, the L1-(S)PLS model will not interfere with the fault monitoring results after adding outliers. It should be noted that the monitoring performance of the L1-PLS model after adding outliers (Table 12.3) is better than the normal conditions (Table 12.2). The possible reason is outliers, and the total noise in the input data becomes larger. The L1-PLS method can filter out noise more effectively during the modeling. Therefore, the established model is more accurate and the monitoring performance is improved.

#### **Case 2: Quality-Recoverable Fault**

Faults IDV(1), IDV(5), and IDV(7) are quality-recoverable faults. The prediction value should tend to return to normal, but the statistic should be kept at a higher value. Figure 12.13 shows the monitoring results of the three models on the fault IDV(1). It can be seen that both the L1-PLS and L1-SPLS model methods give the

**Fig. 12.13** PLS, L1-PLS and L1-SPLS monitoring results for IDV(1)

**Fig. 12.14** PLS, L1-PLS and L1-SPLS monitoring results for IDV(5)

**Fig. 12.15** PLS, L1-PLS and L1-SPLS monitoring results for IDV(7)

correct alarm results. In the PLS model, the value of the statistic exceeds the control limit, so a false alarm is generated in the process monitoring. For the fault IDV(5), it is also a process-related fault. It can be seen from Tables 12.2 and 12.3 that the fault detection rates of the L1-PLS and L1-SPLS models are lower than the PLS model, which means that the monitoring results are more accurate. Figures 12.14 and 12.16, respectively, show the monitoring diagrams of the three models for the fault IDV(5) in the normal case (without adding outliers) and with adding outliers. For fault IDV(7), the corresponding monitoring results are shown in Fig. 12.15. The PLS model gives completely wrong result, while the results of the other two models are more accurate.

The detection result for fault IDV(1) obtained by the L1-PLS (outliers) model seems to be better than the L1-PLS model, and the monitoring results are more reasonable. In addition, for the fault IDV (5), although the monitoring results of the L1-PLS and L1-SPLS (outliers) models may not be ideal, as shown in Fig. 12.14. The T<sup>2</sup> *<sup>e</sup>* statistics of the L1-PLS and L1-SPLS models can detect the input space

**Fig. 12.16** PLS(outliers), L1-PLS(outliers) and L1-SPLS(outliers) monitoring results for IDV(5)

**Fig. 12.17** Typical process variable monitoring results of IDV(5)

process-related faults. But the PLS (outliers), L1-PLS (outliers) and L1-SPLS (outliers) models gave wrong results (Fig. 12.16).

There are two possible reasons for this phenomenon. Firstly, the outliers were added directly without being regulated by the dynamic system, so its influence on the extraction of the principal components cannot be determined directly. Secondly, the typical process dynamics corresponding to fault IDV(5) is shown in Fig. 12.17. Only the variable 31 is a step change in all the monitored variables, and the rest gradually returns to the normal under the action of controller. In terms of the composition of the principal components, the contribution of variable 31 to the principal components is small. Therefore, its role is more in the residual space in the normal case (without adding outliers). After the outlier is added, its contribution to the principal component increases, which means its role in the residual space is weakened. It in turn causes the monitoring indicators in the residual space to return back to normal. On the other hand, the percentage of its contribution to the principal component is still small, so the monitoring indicators on the principal metric space also do not significantly reflect its characteristics.

#### **Case 2: Slight Quality Related Fault**

Fault IDV (16) and IDV (17) have a slight impact on quality, which means that they have almost no impact on output quality. Figure 12.18 shows the monitoring results of the three models after adding outliers. The fault monitoring results of PLS (outliers)

**Fig. 12.18** PLS(outliers), L1-PLS(outliers), and L1-SPLS(outliers) monitoring results for IDV(17)

model is very bad, there have been many false positives. The L1-PLS (outliers) model and L1-SPLS (outliers) model effectively reduce these false alarms. It can also be seen from the corresponding FDR that the monitoring results of the L1-PLS (outliers) model and the L1-SPLS (outliers) model are more reasonable.

It can be seen from the above comparison results that even if outliers are added to the input data, the monitoring results of the L1-(S)PLS model have also been greatly improved. In other words, the L1-(S)PLS model improves the robustness performance and fault detection performance.

## **12.6 Conclusions**

This chapter proposes a quality-related statistical monitoring method of double robust projection to latent structure (L1-PLS), which enhances the robustness of the PLS algorithm from two aspects. On the one hand, the L1-PLS method replaces the *L*<sup>2</sup> norm in the objective function with the L1 norm, and adds the L1 norm penalty term to the direction vector; On the other hand, the regression coefficient of the L1-PLS algorithm can also be obtained by the L1 norm. Therefore, the L1-PLS algorithm has double robustness. Then a monitoring model based on the L1-PLS method is established, the robust performance and monitoring performance are verified on the TE process simulation platform. The results show that the L1-PLS method has better robustness and better performance in process monitoring and fault diagnosis.

## **References**


Farrés M, Platikanov S, Tsakovski S, Tauler R (2015) Comparison of the variable importance in projection (VIP) and of the selectivity ratio (SR) methods for variable selection and interpretation. J Chemom 29(10):528–536


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 13 Bayesian Causal Network for Discrete Variables**

Ensuring the safety of industrial systems requires not only detecting the faults, but also locating them so that they can be eliminated. The previous chapters have discussed the fault detection and identification methods. Fault traceability is also an important issue in industrial system. This chapter and Chap. 14 aim at the fault inference and root tracking based on the probabilistic graphical model. This model explores the internal linkages of system variables quantitatively and qualitatively, so it avoids the bottleneck of multivariate statistical model without clear mechanism. The exacted features or principle components of multivariate statistical model are linear or nonlinear combinations of system variables and have not any physical meaning. So the multivariate statistical model is good at fault detection and identification, but not at fault root tracking.

Bayesian network (BN) can estimate and predict the potentially harmful factors of the general system, but its structure learning has some deficiencies when it is applied to the complex system, such as complex training mechanism and variable causalities. In order to simplify the network structure, lots of assumptions should be presupposed and it inevitably causes the loss of generality. Usually, a generative model (linear or nonlinear) is built to explain the data generating process, i.e., the causalities. A variety of causal discovery methods have been proposed recently to find the causalities (Hyvärinen et al. 2010; Hong et al. 2017). The most classical method is the linear non-gaussian acyclic model (LiNGAM) (Shimizu et al. 2010), in which the full structure of BN is identifiable without pre-specifying a causal order of the variables. The improved LiNGAM method is proposed to estimate the causal order of variables without any prior structure knowledge and provide better statistical performance (Shimizu et al. 2011). The nonlinear causality of a pair of variables is discovered in Johnson and Bhattacharyya (2015), where the proposed method shows a limitation when dealing with the multivariate variables.

The above approaches exploit the complexity of the marginal and conditional probability distributions in one way or the other. Despite the large number of methods for bivariate causal discovery have been proposed over the last few years, their practical performance has not been studied systematically. These methods have yet

J. Wang et al., *Data-Driven Fault Detection and Reasoning for Industrial Monitoring*, Intelligent Control and Learning Systems 3, https://doi.org/10.1007/978-981-16-8044-1\_13

to be applied to the actual industrial systems which usually do not meet the linear and bivariate assumptions. To address the above issues, this chapter proposes a more generalized multivariate post-nonlinear acyclic causal model for the complex industrial process. The proposed multivariate post-nonlinear acyclic causal model, named as Bayesian Causal Network (BCN), can easily find the multi-variables causality. It shows more compact structure and consistency with mechanism, compared with the traditional BN structure. In addition, it avoids the complex learning mechanism of traditional BN, so is easier to implement without compromising accuracy.

# **13.1 Construction of Bayesian Causal Network**

It is known that there are many ways to describe the system characteristic according to the observational data and expert knowledge, such as graph model (Hipel et al. 2011), neural network model (Li et al. 2016), fuzzy model (Jiang et al. 2015). The graph model is composed of points and lines to describe the system structure and the causal relationships among variables. It provides an effective method for studying various systems, especially the complex systems. Bayesian network, a typical graph model, is the main method to deal with the knowledge representations and uncertainties based on the probability theory. It builds the causality and probability within the process components and the system variables from the prior knowledge and process data. BN consists of the structure learning and the parameter learning, in which the structure learning aims at determining the causalities within system variables and the parameter learning aims at revealing the quantitative relationship of these causalities. Bayesian network has been applied to fault diagnosis, financial analysis, automatic target recognition, military, and many other areas (Zhu et al. 2017).

## *13.1.1 Description of Bayesian Network*

Bayesian network, also known as Belief Network or directed acyclic graphical model, is a probabilistic graphical model. It first proposed by Judea Pearl in 1985 (Pearl 1986). It is an uncertainty processing model that simulates the causal relationship in the human reasoning process, and its network topology is a directed acyclic graph (DAG). The nodes in the directed acyclic graph represent the random variables, including the observable variables, hidden variables, unknown parameters, etc. Variables or propositions that are believed to have a causal relationship (or non-conditional independence) are connected by arrows (in other words, the arrow connecting two nodes represents whether the two random variables have a causal relationship or are not conditionally independent). If two nodes are connected by a single arrow, it means that one of the nodes is "cause" and the other is "effect", a conditional probability value is used to describe the causality degree quantitatively.

For example, assume that node *A* directly affects node *B*, then *A* → *B*. The arrow from *A* to *B* is used to establish a directed arc *(A, B)* from node *A* to node *B*, and the weight (its connection strength) is determined by the conditional probability *P(B*|*A)*. In short, a BN is formed by drawing the random variables in a directed graph according to whether they are conditionally independent. It usually uses circle to represent the random variables (nodes) and arrow to represent the conditional dependencies. Figure 13.1 gives a simple Bayesian network (Ishak et al. 2011).

## *13.1.2 Establishing Multivariate Causal Structure*

Model-based causal discovery assumes a generative model to explain the data generating process. When the existing knowledge about the data model is unavailable, the assumed model should be sufficiently general so that it can be adapted to approximate the real data generation process. Furthermore, the model should be identified such that it could distinguish the causes from the effects. A nonlinear and multivariable system always possesses the following three characteristics (Chen et al. 2018):


To discover the causality of multivariable in complex industrial systems, a more generalized multivariate post-nonlinear acyclic causal model with inner additive noise is proposed. The model is in the form of graph theory and Bayesian network structure. Assume that there is a DAG to represent the relationship among multiple observed variables. Mathematically, the generating process of *X<sup>i</sup>* is

$$X\_i = f\_{i,2}(f\_{i,1}(PA\_i) + \mathfrak{e}\_i),\tag{13.1}$$

where the observed variables *X<sup>i</sup>* ,*i* = {1*,* 2*,..., n*} are arranged in a causal order, such that no later variable causes any earlier variable. *P A<sup>i</sup>* is the direct cause of

*X<sup>i</sup>* . *fi,*<sup>1</sup> denotes the nonlinear effect of this cause, and *fi,*<sup>2</sup> denotes the invertible post-nonlinear distortion in variable *X<sup>i</sup>* . *e<sup>i</sup>* is the independent disturbance which is a continuous-valued random variable with non-gaussian distributions of non-zero variances. Model (13.1) satisfies the aforementioned three characteristics: function *fi,*<sup>1</sup> accounts for the nonlinear effect of the causes *P A<sup>i</sup>* ; *e<sup>i</sup>* is the noise effect during the transmission from *P A<sup>i</sup>* to *X<sup>i</sup>* ; invertible function *fi,*<sup>2</sup> reflects the nonlinear distortion caused by the sensor or measurement.

Randomly select a pair of variables *X<sup>i</sup>* and *X <sup>j</sup>* , *i*, *j* = {1*,* 2*,..., n*}. Assume that the pair *(Xi, X <sup>j</sup>)* has the causal relation *X<sup>i</sup>* → *X <sup>j</sup>* . It's data generating process can be described in a generated model,

$$X\_j = f\_{j,2}(f\_{j,1}(X\_i) + \mathfrak{e}\_j),\tag{13.2}$$

where *e <sup>j</sup>* is independent from *X<sup>i</sup>* . Define *s<sup>i</sup> f <sup>j</sup>,*<sup>1</sup>*(Xi)*, *s<sup>j</sup> e <sup>j</sup>* , and *s<sup>i</sup>* is independent from *s<sup>j</sup>* .

Rewrite the generating process *X<sup>i</sup>* → *X <sup>j</sup>* as follows:

$$\begin{aligned} X\_i &= f\_{j,1}^{-1}(\mathbf{s}\_i), \\ X\_j &= f\_{j,2}(\mathbf{s}\_i + \mathbf{s}\_j). \end{aligned} \tag{13.3}$$

*X<sup>i</sup>* and *X <sup>j</sup>* in (13.3) are post-nonlinear (PNL) mixtures of independent sources *s<sup>i</sup>* and *s<sup>j</sup>* . The PNL mixing model can be seen as a special case of the general nonlinear independent component analysis (ICA) model. Here we use nonlinear ICA method to solve this problem (13.3).

Generally there are two possibility to describe the causal relation between any two random variables *X<sup>i</sup>* and *X <sup>j</sup>* , (*X<sup>i</sup>* → *X <sup>j</sup>* or *X <sup>j</sup>* → *Xi*). We should identify the correct relation by judging which one satisfies the assumed model (13.2). If the causal relation is *X<sup>i</sup>* → *X <sup>j</sup>* (i.e., *X<sup>i</sup>* and *X <sup>j</sup>* satisfy the model (13.2)), we can invert the data generating process (13.2) to recover the disturbance *e <sup>j</sup>* , which is expected to be independent from *X<sup>i</sup>* . Two steps are used to examine the possible causal relationships between variables.

In the first step, recover the disturbance *e <sup>j</sup>* corresponding to the assumed causal relation *X<sup>i</sup>* → *X <sup>j</sup>* based on the constrained nonlinear ICA. If this causal relation holds, there exist nonlinear functions *f* <sup>−</sup><sup>1</sup> *<sup>j</sup>,*<sup>2</sup> and *f <sup>j</sup>,*<sup>1</sup> such that

$$\mathbf{e}\_{j} = f\_{j,2}^{-1}(X\_{j}) - f\_{j,1}(X\_{i}),\tag{13.4}$$

where *e <sup>j</sup>* is independent from *X<sup>i</sup>* . Thus perform nonlinear ICA using the structure in Fig. 13.2 and the outputs of system are

$$\begin{aligned} Y\_i &= X\_i, \\ Y\_j &= \mathbf{e}\_j = \mathbf{g}\_j(X\_j) - \mathbf{g}\_i(X\_i)). \end{aligned} \tag{13.5}$$

**Fig. 13.2** Constrained nonlinear ICA system used to verify if the causal relation *X<sup>i</sup>* → *X <sup>j</sup>* holds

The nonlinearities *gi* and *gj* is modeled by Multi-layer perceptrons (MLP's), and the parameters in *gi* and *gj* are learned by making *Y<sup>i</sup>* and *Y <sup>j</sup>* as independent as possible, i.e., minimizing the mutual information between *Y<sup>i</sup>* and *Y <sup>j</sup>* ,

$$I(Y\_i, Y\_j) = H(Y\_i) + H(Y\_j) - H(Y),\tag{13.6}$$

where *H(Y)* is the joint entropy of *Y* = *(Yi, Y <sup>j</sup>)<sup>T</sup>* ,

$$\begin{split}H(\boldsymbol{Y}) &= -\mathbb{E}\left[\log p\_{\boldsymbol{Y}}(\boldsymbol{Y})\right] \\ &= -\mathbb{E}\left[\log p\_{\boldsymbol{Y}}(\boldsymbol{X}) - \log|\boldsymbol{J}|\right] \\ &= H(\boldsymbol{X}) + \mathbb{E}\left[\log|\boldsymbol{J}|\right].\end{split} \tag{13.7}$$

The joint density of *Y* = *(Yi, Y <sup>j</sup>)*<sup>T</sup> is *pY (Y)* = *pX (X)/*| *J*|. *J* is the Jacobian matrix of the transformation from *(Xi, X <sup>j</sup>)* to *(Yi, Y <sup>j</sup>)*, i.e.,

$$\begin{aligned} \mathbf{J} &= \frac{\partial(\mathbf{Y}\_i, \mathbf{Y}\_j)}{\partial(X\_i, X\_j)}, \\ |\mathbf{J}| &= \begin{vmatrix} 1 & 0 \\ \mathbf{g}\_i' \ \mathbf{g}\_j' \end{vmatrix} = \begin{vmatrix} \mathbf{g}\_j' \end{vmatrix}. \end{aligned} \tag{13.8}$$

Substitute (13.7) and (13.8) into (13.6), we have

$$\begin{split} I(Y\_i, Y\_j) &= H(Y\_i) + H(Y\_j) - \mathbb{E}[\log|\mathcal{J}|] - H(\mathcal{X}) \\ &= -\mathbb{E}\left[\log p\_{Y\_i}(Y\_i)\right] - \mathbb{E}\left[\log p\_{Y\_j}(Y\_j)\right] - \mathbb{E}\left[\log|\mathcal{g}\_j'|\right] - H(\mathcal{X}), \end{split} \tag{13.10}$$

where *H(X)* does not depend on the parameters in *gi* and *gj* and can be considered as constant. The minimization problem (13.10) is solved by gradient-descent methods, and the details of the optimization are skipped.

In the second step, verify if the estimated disturbance *Y <sup>j</sup>* is independent from the assume cause *Y<sup>i</sup>* based on the statistical independence test. The kernel-based

statistical test is adopted with the significance level = 0*.*01 (Giga 2014). Denote the test statistic as *testi*→*<sup>j</sup>* . If *testi*→*<sup>j</sup> > testj*→*<sup>i</sup>* , it indicates that *Y<sup>i</sup>* and *Y <sup>j</sup>* are not independent, that is *X<sup>i</sup>* → *X <sup>j</sup>* does not hold. Repeat the above procedure with *X<sup>i</sup>* and *X <sup>j</sup>* exchanged to verify if *X <sup>j</sup>* → *X<sup>i</sup>* holds. If *testi*→*<sup>j</sup> < testj*→*<sup>i</sup>* , it concludes that *X<sup>i</sup>* causes *X <sup>j</sup>* . *gi* and *gj* provide an estimate of *f <sup>j</sup>,*<sup>1</sup> and *f* <sup>−</sup><sup>1</sup> *<sup>j</sup>,*<sup>2</sup> , respectively.

For a complex system, there are *n* process variables. Following a test sequence, *X*<sup>1</sup> → *X*2, *X*<sup>1</sup> → *X*3*,..., X<sup>n</sup>*−<sup>1</sup> → *Xn*, the *N* group statistics should be tested,

$$N = n + (n - 1) + (n - 2) + \dots + 1 = \frac{n(n - 1)}{2}.\tag{13.11}$$

The total computation is in direct proportion to 2 × *N*. As the number of variables increases, the amount of computation will increase as well. The measured statistics in the positive order (or in the reverse order) are stored as

$$\begin{aligned} A &= [test\_{X\_1 \to X\_2}, test\_{X\_1 \to X\_3}, \dots, test\_{X\_{n-1} \to X\_n}], \\ B &= [test\_{X\_2 \to X\_1}, test\_{X\_3 \to X\_1}, \dots, test\_{X\_n \to X\_{n-1}}]. \end{aligned} \tag{13.12}$$

Comparing the corresponding elements of the vectors *A* and *B*, the causal direction of this pair of variables is determined according to the smaller statistic. Once the causality of all variables is found based on the above cyclic search, integrate them into a DAG.

## *13.1.3 Network Parameter Learning*

The multivariate causality model gives a framework similar to the Bayesian network to find the internal structure of the complex systems. Its graphical structure expresses the causal interactions and direct/indirect relations as probabilistic networks. Its parameter represents the intensity of the complex inter-relationships among the cause-effect variables.

Consider a finite set *U* = {*X*1*,..., Xn*} of discrete random variables where each variable *X<sup>i</sup>* may take on several discrete status from a finite set. A Bayesian network is an annotated directed acyclic graph that encodes a joint probability distribution over a set of random variables *U*. Formally, the Bayesian network for *U* is constructed as a pair *B* =*< G, Θ >*. *G* is a directed acyclic graph whose vertices is correspond to the random variables *X*1*,..., Xn*. *Θ* is the parameters set that quantifies the network with <sup>θ</sup>*ijk* <sup>=</sup> *<sup>p</sup>(x<sup>k</sup> <sup>i</sup> )*|*pa <sup>j</sup> <sup>i</sup>* and *<sup>k</sup>* <sup>θ</sup>*ijk* <sup>=</sup> 1, where *<sup>x</sup><sup>k</sup> <sup>i</sup>* is the discrete status of *X<sup>i</sup>* and *pa <sup>j</sup> <sup>i</sup>* is one of components in the complete parent set *P A<sup>i</sup>* of *X<sup>i</sup>* in *G*. Every variable *X<sup>i</sup>* is conditionally independent of its non-descendants given its parents (Markov condition). The joint probability distribution over set *U* is

13.1 Construction of Bayesian Causal Network 239

$$P\_B(X\_1, \ldots, X\_n) = \prod\_{i=1}^n P\_B(X\_i | PA\_i) = \prod\_{i=1}^n \theta\_{X\_i | \prod P A\_i} \tag{13.13}$$

The parameters of the causality Bayesian network are mainly learned from the statistics analysis of sample data. The maximum likelihood estimation method (MLE) is one of the most classical and effective algorithms in parameter learning.

Give a data set *D* = {*D*1*,..., DN* } of all Bayesian network nodes. The goal of parameter learning is to find the most probable values for *Θ*. These values best explain the data set *D*, which can be quantified by the log likelihood function *logp(D*|θ*)*, denoted *L <sup>D</sup>(*θ*)*. Assume that all samples are drawn independently from the underlying distribution. According to the conditional independence assumptions, we have

$$L\_D(\theta) = \log \prod\_{i=1}^n \prod\_{j=1}^{q\_i} \prod\_{k=1}^{r\_i} \theta\_{ijk}^{n\_{ijk}},\tag{13.14}$$

where *qi* is the number of combinations of the parent nodes *pa <sup>j</sup> <sup>i</sup>* , *ri* is the number of the node *Xi* status. *nijk* indicates how many elements of *D* contain both *x<sup>k</sup> <sup>i</sup>* and *pa <sup>j</sup> <sup>i</sup>* . If the data set *D* is complete, MLE method can be described as a constrained optimization problem,

$$\begin{aligned} \max L\_D(\theta), \\ \text{s.t.} & g\_{ij}(\theta) = \sum\_{k=1}^{r\_l} \theta\_{ijk} - 1 = 0, \forall i = 1, \dots, n, \forall j = 1, \dots, q\_i. \end{aligned} \tag{13.15}$$

Its global optimum solution is

$$
\theta\_{ijk} = \frac{n\_{ijk}}{n\_{ij}},
\tag{13.16}
$$

where *ni j* = *<sup>k</sup>*=1*,...,ri nijk* .

## **13.2 BCN-Based Fault Detection and Inference**

The complete monitoring model is established via combining the multivariate causal structure and the Bayesian parameters learning. The qualitative and quantitative relationships among the process variables are revealed to the greatest extent. Then this model is forward used to accurately predict the operation status and detect faults of the critical process variables (i.e., forward inference). Similarly, it also can be inversely used to find the source of the faults (i.e., backward inference). The overall block diagram of the proposed method is shown in Fig. 13.3.

Causality network prediction or inference is to calculate the probability of the hypothesis variables at certain status according to the network topology and con-

ditional probability distribution of the evidence variable. An inference or query *P( Q* = *q*|*E* = *e*0*)* is to calculate the posterior probability of a query variable *Q* being at its specific value *q* in the condition of given evidence *e*<sup>0</sup> for node *E*.

There are many existing network inference algorithms, such as variable elimination algorithm and junction tree algorithm (JT). These algorithms utilize the hypothesis variables and specific independence relations induced by the evidence in BN to simplify the updating task. JT implements the inference procedure in four steps (Borsotto et al. 2006),


The inference starts from a root clique. The core step of message propagation consists of a message collection phase and a distribution phase. The cliques of the junction tree are connected by separators such that the so-called junction tree property holds. When a message is passed from one clique *X* to another clique *Y*, it is mediated by the separate set *S* between the two cliques. Every conditional probability distribution of the original BN is associated with a clique such that the domain of the distribution is a subset of the clique domain (we use the notation *dom(φ)* to refer to the domain of a potential *φ*). The set of distributions *φ<sup>X</sup>* associated with a clique *X* are in standard junction tree architectures combined to form the initial clique *X*.

$$\phi\_X = \prod\_{\phi \epsilon \phi\_X} \phi. \tag{13.17}$$

For a clique, a potential or a message is a mapping from the value assignments of the nodes to the set [0*,* 1*.*0]. A message pass from *X* to *Y* occurs with two procedures: projection and absorption based on the Hugin architecture (architecture is proposed

by Jensen et al. 1990). The projection procedure saves the current potential and assigns a new one to *S*:

$$
\phi\_S^{old} \leftarrow \phi\_S \text{, and } \phi\_S \leftarrow \sum\_{X \mid S} \phi\_X \tag{13.18}
$$

The absorption procedure assigns a new potential to *Y* using both the old and the new tables of *S*,

$$
\phi\_Y \leftarrow \phi\_Y \frac{\phi\_S}{\phi\_S^{old}},
\tag{13.19}
$$

where *φ<sup>S</sup>* is the current separator potential, *φold <sup>S</sup>* is the old separator potential, *φ<sup>X</sup>* is the clique potential for *X*, *φ<sup>Y</sup>* is the clique potential for *Y*.

The query answering step has two procedures. First, the marginalization procedure calculates the joint probability of *Q* and *E* = *e*<sup>0</sup> : *P( Q, E* = *e*0*)* = *<sup>X</sup>*{ *<sup>Q</sup>*} *φ<sup>X</sup>* . Second, the normalization procedure calculates the inference result,

$$P(\mathcal{Q} = q | E = e\_0) = \frac{P(\mathcal{Q} = q, E = e\_0)}{\sum\_{\mathcal{Q}} P(\mathcal{Q}, E = e\_0)}.\tag{13.20}$$

The fault of operational variables is an intervention that has various effects on the production process. The main task in fault detection is to predict the system output and detect whether a fault occurs. The object of causal inference is to find the real root cause under the faulty intervention.

## **13.3 Case Study**

In order to evaluate the performance of the proposed method, the experiment results are reported from three aspects: the causal direction identification of multi-variables, network parameter learning, and probability inference.

## *13.3.1 Public Data Sets Experiment*

Four published data sets proposed by Mooij and Janzing (Leoand et al. 2001) are used to test the effectiveness of the nonlinear multivariate causal model. The causeeffect pairs are available at http://webdav.tuebingen.mpg.de/cause-effect/, which is considered as the benchmark for testing causal detection algorithms. The four data sets are (1) the ground altitude and temperature sampled at 349 stations, US; (2) census income data set which contains weighted census data extracted from the 1994 and 1995 current population surveys conducted by the U.S. Census Bureau.

**Fig. 13.4** Scatter plots of four data sets, **a**–**d** corresponding to data sets (1)–(4), respectively


**Table 13.1** Independence test statistics under different assumption of causal directions

**Table 13.2** Causal results of the public data sets


The variables include age and wage per hour; (3) the attribute information (age and heart rate) from Cardiac Arrhythmia database; (4) the population with sustainable access to improved drinking water sources (%) total, and the infant mortality rate (per 1000 live births) both sexes, 2006. Each data set consists of two random variables which their cause-effect relationship is known. The four data sets have different attributes, which is sufficient to show the general and comprehensive nature of the data.

Figure 13.4 gives the scatter plots of the selected data sets (1)–(4). Table 13.1 shows the statistics of independence test on *x* and *y* for data sets (1)–(4) under different assumption of causal directions. The statistics are calculated separately based on these different assumptions. Comparing the test statistics under two different assumption in Table 13.1, the causal direction of each set all are determined as *x* → *y*. Table 13.2 summarizes the causal results obtained by the multivariate causality model. It is found that the test results are consistent to the real causal relationship. We can conclude that the proposed method can correctly identify the causal direction regardless the diversity of data.

# *13.3.2 TE Process Experiment*

In order to illustrate the applicability of the proposed method in the actual complex industrial process, the network topology of TE process is established and used to predict the alarm variables. TE platform simulates an actual chemical process, a detailed description of the TE process is given in Chap. 4.

#### **Experiment 1: Build Causal Structure**

In this experiment, eight important process variables are selected to calculate their causality in order to facilitate the result visualization. From the mechanism analysis of TE process, it is known that when the reactor feed *X*<sup>2</sup> increases, the material is first entered into the reactor, so the reactor level *X*<sup>4</sup> must increase. So the reactor feed *X*<sup>2</sup> directly affects the reactor level *X*4. The temperature of cooling water *X*<sup>8</sup> and the reactor feed *X*<sup>2</sup> are the main factors of affecting the reactor temperature *X*5. The reactor pressure *X*<sup>3</sup> changes synchronized with the reactor temperature *X*<sup>5</sup> according to the general physical principle. In addition, once the chemical reaction in the reactor is more intense, the compressor module power *X*<sup>7</sup> will be synchronized to strengthen due to the sequential loop. At the same time, the reactor pressure *X*<sup>3</sup> also has an obvious influence on the recovered flow *X*<sup>1</sup> and the material level *X*<sup>6</sup> in the separator. Now the initial structure of the causality network is built based on the mechanism analysis (including the expert prior knowledge and the intuitive correlation analysis of process variable), named as Bnet0 shown in Fig. 13.5.

The pre-defined fault is random variations in A, B, C compositions in stream 4. The corresponding data of eight variables are collected from the simulation platform. The reaction length is 700 h to ensure that the data is sufficient to reflect the system process. 500 sampling data are obtained after the equal time decimating. The causal direction of the paired variables is shown in Table 13.3. Three different causality


**Table 13.3** Causal direction of TE variables

**Fig. 13.6** The network compare: **a** Bnet1, **b** Bnet2, **c** Bnet3

models are compared, including (1) Bnet1, the proposed multivariate post-nonlinear acyclic causal model, shown in Fig. 13.6a; (2) Bnet2, an alternative network obtained from the traditional BN structure learning method-K2 algorithm which needs to set the node order, shown in Fig. 13.6b; (3) Bnet3, the network structure learned with the expectation maximization (EM) algorithm, shown in Fig. 13.6c.

Comparing the process analysis structure Bnet0 and Bnet1 determined by the proposed Bayesian Causal Network, it is found that Bnet1 is exactly consistent to Bnet0. The structure determined using the proposed method exactly matches the mechanism and expert knowledge, which indicates that the causal structure is credible and accurate. However, Bnet2 and Bnet3 learned from the traditional BN methods are not consistent with the mechanism. They show a big gap from the actual physical relationship. It demonstrates that the general BN learning method fails when it is applied to the complex nonlinear systems, while the proposed multivariate causality model proves its superiority.


**Table 13.4** Threshold setting for alarm status in different variables

**Experiment 2: Parameter Learning** Once the TE network structure is determined, the alarm prediction model can be obtained by parameter learning of this causality structure network. In general, the process alarm event can be divided into five-alarm levels, namely, high-high alarm (HH), high alarm(H), normal(N), low alarm(L) and low-low alarm(LL), corresponding to the number 1,2,3,4,5. The first step is to discretize the continuous variables into five-alarm levels by setting different thresholds, shown in Table 13.4.

Here the MLE algorithm is adopted to learn the network parameters and get a complete probability table. Suppose that the initial probability of the alarm level in the normal condition is theoretically divided equally. Then the conditional probability values for all variables are calculated based on the BN parameter learning. Considering two root nodes *X*<sup>2</sup> and *X*8, their corresponding probabilities for five status are 0.0843, 0.2211, 0.4704, 0.2026 and 0.0217, respectively. The probability of other descendant variables as shown in Fig. 13.7. Hot plot is used to show the probability since the precise value has nothing meaning for the alarm prediction and inference. The color represents the probability range between 0 and 1.

It should be concerned with the probability value of close to 1. These are the key points in determining the inference results. When the probability is less than 0.5, the result situation will not likely appear in the actual inference. Figure 13.7a shows the probability of *X*<sup>5</sup> under the combined action of *X*<sup>2</sup> and *X*8. The abscissa is the status condition of *X*<sup>8</sup> and *X*2, and the ordinate is the probability value for five-alarm status of *X*<sup>5</sup> displayed in corresponding color. *P(X*<sup>5</sup> = 1|*X*<sup>8</sup> = 1*,* 2 and *X*<sup>2</sup> = 1*)* ≈ 1 in the lower left corner of Fig. 13.7a. It means that *X*<sup>5</sup> occurs the low-low alarm with the probability close to 1 when *X*<sup>2</sup> and *X*<sup>8</sup> are in the low-low alarm status. *P(X*<sup>5</sup> = 5|*X*<sup>8</sup> = 4*,* 5 and *X*<sup>2</sup> = 5*)* ≈ 1 in the upper right corner of Fig. 13.7a. It means that *X*<sup>5</sup> occurs the high-high alarm with the probability close to 1 when *X*<sup>2</sup> and *X*<sup>8</sup> are in the high-high alarm status. These inference results are consistent with the actual mechanism.

Figure 13.7b–e reflects the probability relationship between bivariate variables. Figure 13.7b shows the probability of *X*<sup>4</sup> under the action of *X*3. *P(X*<sup>4</sup> = 5|*X*<sup>3</sup> =

**Fig. 13.7** Conditional probability of the descendant variables: **a** *P(X*5|*X*8*, X*2*)*, **b** *P(X*4|*X*3*)*, **c** *P(X*3|*X*5*)*, **d** *P(X*7|*X*5*)*, **e** *P(X*1|*X*3*)*, **f** *P(X*6|*X*3*)*


**Table 13.5** Alarm level prediction of compress work *X***<sup>7</sup>**

5*)* ≈ 1 in the upper right corner. It means that the probability of *X*<sup>4</sup> occurs the high-high alarm with the probability close to 1 when *X*<sup>3</sup> in the high-high alarm status. However, *P(X*<sup>4</sup> = 1|*X*<sup>3</sup> = 5*)* = 0 in the lower right corner. It means that *X*<sup>4</sup> occurs the high-high alarm with the probability close to 0 when *X*<sup>3</sup> in the low-low alarm statue. *P(X*<sup>4</sup> = 1 and *X*<sup>4</sup> = 2|*X*<sup>3</sup> = 2*)* ≈ 0*.*5 in the green area. It means the probability of *X*<sup>4</sup> occurs the low alarm or low-low alarm almost same when *X*<sup>3</sup> in the low alarm status. Similarly, the inference results obtained from Fig. 13.7c–e are consistent with the mechanism.

**Experiment 2: Alarm Prediction** Alarm prediction is a top-down inference according to the evidences inference conclusion. The probabilistic analysis calculates the likelihood of each status for the result variable may occur. The discrete status corresponding to the maximum probability is the alarm prediction result.

Using the established multivariate causality network model, compress work *X*<sup>7</sup> is predicted when its parent variables *X*2, *X*<sup>8</sup> and *X*<sup>5</sup> are known. The prediction results for model Bnet1 are shown in Table 13.5, where *X*ˆ **<sup>7</sup>** is the prediction value of *X***7**.

The total prediction accuracy for the 20 simulation experiments is 75%. When the maximum probability of the predicted value is greater than 0.5, the prediction result is confident. Furthermore, the predictions with a high probability is consistent with the true status. When the maximum probability of the predicted value is less than 0.5, the prediction result is not believable and accurate. The mis-predictions confuse between the adjacent status, such as the normal status 2 and Low alarm 3 (or high alarm 2). The simulation results show that the multivariate causality network can find the intrinsic relationships among various process variables, and give precise fault or alarm prediction.

## **13.4 Conclusions**

This chapter proposes a multivariate causality model to analyze the causal direction of multivariable and final determine the network topology. The proposed method can describe the system structure more accurate than the traditional BN structure learning method especially when the industrial process is high complex. Combined with the network parameters learning and evidence inference technique, an accurate monitoring and alarm prediction can be performed. The validity of the proposed method is verified via the public data set and TE process. An compact network structure and confident alarm prediction are obtained for the TE process based on the causal analysis and probability inference. Both the methodology and the simulation results show that the proposed multivariate causality model has great value for the process industry modeling and monitoring.

There are some issues worth further discussion. The computing efficiency of the proposed multivariate post-nonlinear acyclic causal modeling method should be considered when solving the large-scale causal analysis problems in the real world. Developing the efficient algorithm to find the causal relationship of multiple variables based on the general functional causal models is still an important topic. To improve the computational efficiency, a feasible solution is to limit the complexity of the causal structure, such as decreasing the number of direct causes of each variable. Moreover, a smart optimization procedure instead of the exhaustive search should be considered further.

## **References**


Giga M (2014) Statistical tests, test of independence. Nihon Ika Daigaku Igakkai Zasshi 10(2):115– 119

Hipel KW, Kilgour DM, Fang L (2011) The graph model for conflict resolution. Wiley encyclopedia of operations research and management science


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 14 Probabilistic Graphical Model for Continuous Variables**

Most of the sampled data in complex industrial processes are sequential in time. Therefore, the traditional BN learning mechanisms have limitations on the value of probability and cannot be applied to the time series. The model established in Chap. 13 is a graphical model similar to a Bayesian network, but its parameter learning method can only handle the discrete variables. This chapter aims at the probabilistic graphical model directly for the continuous process variables, which avoids the assumption of discrete or Gaussian distributions.

This chapter expands the previous work in Chap. 13 from the random discrete variables to the random continuous variables. In addition to enhancing the effect of causal structure and parameter learning on the continuous variables, kernel density estimation is used to construct the node association strength of the causal graph network in the form of probability density. The conditional probability density is obtained from the mathematical operation between the low-dimensional probability density and the high-dimensional joint probability density. This non-parametric estimation method directly estimates the probability density of continuous variables and avoids the limitations of traditional Gaussian assumptions. Moreover, this chapter strictly derives the evaluation indicators for the KDE estimation quality. The proposed causal learning mechanism does not have any restrictions, such as linear, nonlinear, or distribution functions. It establishes an accurate causal probability graphical model to detect faults and find the root cause of the fault.

# **14.1 Construction of Probabilistic Graphical Model**

# *14.1.1 Multivariate Casual Structure Learning*

The first step of building a graphical model is to construct a causal topological relationship. The causal hypothesis model is a post-nonlinear model. It can determine the

J. Wang et al., *Data-Driven Fault Detection and Reasoning for Industrial Monitoring*, Intelligent Control and Learning Systems 3, https://doi.org/10.1007/978-981-16-8044-1\_14

causal relationship between multiple variables through hypothesis testing. Detailed information can be found in Chap. 13 (Chen et al. 2018).

Consider a model which represents the causal relationship between variables. Here a generative model is used to explain the data generation process. When the existing mechanism of the data model cannot be determined, the hypothetical model should be sufficiently versatile so that it can be adapted to approximate the actual data generation process. In addition, the model should be identified so that cause and effect can be distinguished.

In order to discover the causality of multiple variables in a complex system, a more generalized multivariable nonlinear acyclic causal model with internal additive noise is given same as Chap. 13. The model adopts the form of graph theory and Bayesian network structure. Assume that a directed acyclic graph (DAG) represents the relationship between multiple observed variables. Select a pair of variables *Xi* and *X <sup>j</sup>* ,*i, j* = {1*,* 2*,..., n*} from the system, respectively. If *Xi* is *X <sup>j</sup>*'s parent node and its data generating process is described in a post-nonlinear(PNL) mixing model. The generation process of *Xi* is *X <sup>j</sup>* = *f <sup>j</sup>,*<sup>2</sup> - *f <sup>j</sup>,*<sup>1</sup> *(Xi)* + *e <sup>j</sup>* , where *fi,*<sup>1</sup> denotes the nonlinear effect of the causes, and *fi,*<sup>2</sup> denotes the invertible post-nonlinear distortion in variable *Xi* . *e <sup>j</sup>* is the independent disturbance. Here it is applicable to a combination of hypothesis testing and nonlinear independent component analysis (ICA) to solve this problem (Shimizu et al. 2011). To describe in simplified language, it can be divided into two steps:


For any pair of variables in the system, two causal assumptions can be made. The causality is assumed positive and negative, and the direction of the causality is determined by comparing the statistical information obtained by calculation. After *n (n* − 1*)* hypotheses and tests, the causality of all system variables is determined finally. Therefore, this multivariate nonlinear acyclic causal modeling method will not have the limitation of Bayesian network structure learning. It can effectively establish the causal structure of the process.

## *14.1.2 Probability Density Estimation*

Section 14.1.1 completed the construction of the causal structure of the model. The complete graph model also should include the quantitative relationships between nodes which is described as probabilistic connection of nodes here. The probability density of the node variable is determined by the non-parametric probability density estimation method. Because the child node is affected by its parent node, the probabilistic connection relationship manifests itself in the conditional probability density. Kernel Density Estimation (KDE) is a prominent method to estimate the non-parametric probability density. The explicit form of the density function is the main advantage of KDE method (Chen et al. 2018).

Let *X*1*, X*2*, X*3*,..., Xn* be a set of samples of the random variable *X*. Its density function *f (x), x* ∈ *R*, *X* is unknown. The distribution density function *f (x)* can be derived from its corresponding cumulative distribution function *F(x)*,

$$f(\mathbf{x}) = \frac{dF(\mathbf{x})}{d\mathbf{x}} \approx \frac{F(\mathbf{x} + h) - F(\mathbf{x} - h)}{2h},\tag{14.1}$$

where *<sup>h</sup> <sup>&</sup>gt;* 0 is the window width. The empirical distribution function *Fn(x)* <sup>=</sup> <sup>1</sup> *n <sup>i</sup> I(Xi* ≤ *x)* is used to estimate *F(x)*. Substitute it into (14.1),

$$\begin{split} \hat{f}(\mathbf{x}) &= \frac{dF(\mathbf{x})}{dx} \approx \frac{F(\mathbf{x} + h) - F(\mathbf{x} - h)}{2h} \\ &= \frac{1}{2nh} \sum\_{i} I(\mathbf{x} - h < X\_i \le \mathbf{x} + h) \\ &= \frac{1}{nh} \sum\_{i} K\_0 \left( \frac{X\_i - \mathbf{x}}{h} \right) . \end{split} \tag{14.2}$$

(14.2) gives the KDE for *f (x)* with a window width *h* and a kernel function *<sup>K</sup>*<sup>0</sup> <sup>=</sup> <sup>1</sup> <sup>2</sup> *I(*|*u*| ≤ 1*)*.

The more general kernel density estimate is

$$\hat{f}(\mathbf{x}) = \frac{1}{nh} \sum\_{i}^{n} K\left(\frac{X\_i - \chi}{h}\right),\tag{14.3}$$

where ˆ*f (x)* gives the estimate of the probability density function. *n*, *h*, *K* are the number of samples, window width and kernel function.

Conditional probability density calculation requires additional mathematical operations. Similarly, consider two random sample sets *X*1*, X*2*, X*3*,..., Xn* and *Y*1*, Y*2*, Y*3*,..., Yn*, where *X* is cause variable and *Y* is effect variable. The joint probability density of *x* and *y* is defined as

$$\hat{f}(\mathbf{x}, \mathbf{y}) = \frac{1}{n} \sum\_{i=1}^{n} \frac{1}{h\_1 h\_2} K\left(\frac{\mathbf{x} - X\_i}{h\_1}, \frac{\mathbf{y} - Y\_i}{h\_2}\right),\tag{14.4}$$

where *h*<sup>1</sup> and *h*<sup>2</sup> are the window width corresponding to the cause variable *x* and the effect variable *y*, respectively.

According to the definition of conditional probability, the conditional density *f (y*|*x)* is obtained as follows:


**Table 14.1** Common kernel functions

$$f(\mathbf{y}|\mathbf{x}) = \frac{f(\mathbf{x}, \mathbf{y})}{f(\mathbf{x})}.\tag{14.5}$$

The kernel function here affects the precision of kernel density estimation. How to select an appropriate kernel function is an important issue. Usually, the following properties should be considered: symmetry, non-negative, and normality (Zeng et al. 2017). The mathematical description of common kernel functions is given in Table 14.1(Jiang and Nicholas 2014).

It can be seen from the KDE expression that the kernel function *K*, sample size *n* and its window width *h* are the main contributing factors of *f (x)*. Once the number of samples *n* is fixed, *K* and *h* directly affect the accuracy of the system model parameters. Furthermore, the effectiveness of fault detection and root cause diagnosis will fluctuate directly. Therefore, in order to estimate the probability density more accurately and improve the estimation quality of KDE, a KDE evaluation criterion is given in the next section. There are already data showing that the choice of kernel function has a negligible effect on the result of kernel density estimation (Silverman 1998), so the optimization of *K* is not considered here.

## *14.1.3 Evaluation Index of Estimation Quality*

According to the definition of kernel density, consider the following two cases: (1) the value of the window width *<sup>h</sup>* is very large. The average compression transformation *<sup>x</sup>*−*Xi <sup>h</sup>* can remove the local details of the probability density function, which results in the smoothness of probability density estimation curve. A relatively low resolution is shown at this case, and the estimation deviation is enlarged; (2) the value of the window width is very small. On the contrary, the influence of the randomness of probability density will increase, and the important characteristics of density will be masked. It causes the larger fluctuation of density estimation and the stability is easy to be deteriorated. The estimation variance is too large at this case (Jiang and Nicholas 2014).

The requirements about the accurate estimation include much closer to the true values and remaining stable for different observations. These two attributes are described by the estimated deviation and variance which are given as

$$\begin{aligned} \text{Bias}\{\hat{f}(\mathbf{x})\} &= \mathbb{E}[\hat{f}(\mathbf{x})] - f(\mathbf{x})\\ \text{Var}\{\hat{f}(\mathbf{x})\} &= \mathbb{E}[\hat{f}(\mathbf{x})]^2 - [\mathbb{E}\hat{f}(\mathbf{x})]^2. \end{aligned} \tag{14.6}$$

The probability density function of the child nodes in the causal model is affected by the parent nodes. Its probability density usually is multidimensional. Consider a two-dimensional kernel density function *f (x, y)* as an example. Its deviation and variance are

$$\begin{aligned} \text{Bias}\{\hat{f}(\mathbf{x}, \mathbf{y})\} &= \mathbb{E}\left[\hat{f}(\mathbf{x}, \mathbf{y})\right] - f(\mathbf{x}, \mathbf{y}) \\ \text{Var}\{\hat{f}(\mathbf{x}, \mathbf{y})\} &= \mathbb{E}\left[\hat{f}(\mathbf{x}, \mathbf{y})\right]^2 - \left[\mathbb{E}\hat{f}(\mathbf{x}, \mathbf{y})\right]^2. \end{aligned} \tag{14.7}$$

Here the mean square integral error (MISE) is introduced as the evaluation index of KDE. The MISE index has an unique advantage to evaluate the difference between the estimated function and the true function. At the same time, it also guarantees the fitness and smoothness of kernel estimation.

One-dimensional MISE is defined as

$$\text{MISE}[\hat{f}(\mathbf{x})] = \mathbb{E} \int \left[ \hat{f}(\mathbf{x}) - f(\mathbf{x}) \right]^2 d\mathbf{x}.\tag{14.8}$$

Two-dimensional MISE is defined as

$$\text{MISE}[\hat{f}(\mathbf{x}, \mathbf{y})] = \mathbb{E} \left[ \int \left[ \hat{f}(\mathbf{x}, \mathbf{y}) - f(\mathbf{x}, \mathbf{y}) \right]^2 d\mathbf{x} d\mathbf{y}.\right] \tag{14.9}$$

The above MISE indices are simplified as, and the details can be found from the supporting information in Chen et al. (2018),

$$\begin{split} \text{MISE}[\hat{f}(\mathbf{x})] &= \int \text{Var}(\hat{f}(\mathbf{x})) + \int \text{Bias}^2(\hat{f}(\mathbf{x}))d\mathbf{x} \\ &= \frac{1}{nh} \int K^2(t)dt + \frac{1}{4}h^4 \left[ \int t^2 K(t)dt \right]^2 \int [f''(\mathbf{x})]^2 d\mathbf{x} \\ \text{MISE}[\hat{f}(\mathbf{x}, \mathbf{y})] &= \frac{1}{nh\_1h\_2} \int K^2(t)dt + \frac{1}{4}h\_1^4h\_2^4 \\ &\quad \times \left[ \int t^2 K^2(t)dt \right]^2 \iint (\nabla f(\mathbf{x}, \mathbf{y}))^2 d\mathbf{x}d\mathbf{y}. \end{split} \tag{14.11}$$

It is found from (14.10) and (14.11) that the values of the functions *K*<sup>2</sup> *(t) dt* and *t* <sup>2</sup>*K (t) dt* are related to the kernel function *K*. They are not difficult to calculate if the mathematical expression of kernel function is substituted into the above equations. Generally speaking, window width *h* has a greater impact on MISE value, so optimizing *h* is critical. Here (14.10) and (14.11) are also used as optimization objectives to find the best window width *h*.

For one-dimensional probability density, let *d* MISE ˆ*f (x) /dh* <sup>=</sup> 0. Then

$$h\_{opt} = \sqrt[8]{\frac{\int K^2(t)dt}{n\int t^2 K(t)dt \, ^\circ \int f''(x)^2 dx}}. \tag{14.12}$$

For two−dimensional probability density, let

$$\begin{split} \frac{\partial \text{MISE}[\hat{f}(\mathbf{x}, \mathbf{y})]}{\partial h\_1} &= h\_1^3 h\_2^4 \left( \int t^2 K(t) dt \right)^2 \iint (\nabla f(\mathbf{x}, \mathbf{y}))^2 d\mathbf{x} d\mathbf{y} \\ &\quad - \frac{1}{n h\_1^2 h\_2} \int K^2(t) dt \\ &= 0, \\ \frac{\partial \text{MISE}[\hat{f}(\mathbf{x}, \mathbf{y})]}{\partial h\_2} &= h\_2^3 h\_1^4 (\int t^2 K(t) dt)^2 \iint (\nabla f(\mathbf{x}, \mathbf{y}))^2 d\mathbf{x} d\mathbf{y} \\ &\quad - \frac{1}{n h\_2^2 h\_1} \int K^2(t) dt \\ &= 0. \end{split} \tag{14.13}$$

Then

$$\begin{split} h\_1^{opt} &= \sqrt[8]{\frac{\int K^2(t)dt}{nh\_2^5 (\int t^2 K(t)dt)^2 \iint (\nabla f(\mathbf{x}, \mathbf{y}))^2 d\mathbf{x} d\mathbf{y}}} \\ h\_2^{opt} &= \sqrt[8]{\frac{\int K^2(t)dt}{nh\_1^5 (\int t^2 K(t)dt)^2 \iint (\nabla f(\mathbf{x}, \mathbf{y}))^2 d\mathbf{x} d\mathbf{y}}}. \end{split} \tag{14.14}$$

If the kernel function is predetermined, *K*2*(t)dt ( <sup>t</sup>* <sup>2</sup>*K(t)dt)*<sup>2</sup> = *C(k)* is a constant. Usually the true probability density functions *f (x)* and *f (x, y)* are unknown. The estimated probability density function (14.3) and (14.4) are substituted into (14.12) and (14.14), respectively. Then the optimal parameter *h* for one-dimensional estimation or *h*<sup>1</sup> and *h*<sup>2</sup> for two-dimensional estimation are obtained.

## **14.2 Dynamic Threshold for the Fault Detection**

Generally speaking, the process variables show obvious difference in their measurements in the normal operation and faulty operation. Then the measurement difference must be reflected in the probability density distribution. System failure detection is to find their differences based on the appropriate thresholds. Here, it is not feasible to use the confidence interval of the normal state to directly distinguish the fault. The actual process data are usually accompanied by a lot of noise, the distribution is not ideal even in the normal operation. Therefore, its confidence cannot be completely described as a constant horizontal line. The constant confidence line is further difficult to distinguish the normal operation and the fault operation. Therefore, the idea of dynamic threshold is introduced. Fused Lasso (FL) method is common to denoise in the field of signal processing. Here it is used to design the dynamic confidence limits. It can provide the required reasonable range for each node based on the normal data.

The Fused Lasso Signal Approximator (FLSA) aims at eliminating noise and smoothing data (Bensi et al. 2013). The real-valued observations *y* = *βx* is obtained by finding the sequence *β*1*,..., β<sup>N</sup>* that minimizes the criterion,

$$J\_{FL} = \frac{1}{2} \sum\_{k=1}^{N} (\mathbf{y}\_k - \beta\_k \mathbf{x}\_k)^2 + \lambda\_1 \sum\_{k=1}^{N} |\beta\_k| + \lambda\_2 \sum\_{k=2}^{N} |\beta\_k - \beta\_{k-1}|,\tag{14.15}$$

where λ<sup>1</sup> and λ<sup>2</sup> are tuning parameters, *x*1*,..., x<sup>N</sup>* is the feature variables. The objective of *JF L* consists of three parts: <sup>1</sup> 2 *<sup>N</sup> k*=1 - *y<sup>k</sup>* − *β<sup>k</sup> x<sup>k</sup>* <sup>2</sup> is the traditional index of the least squares algorithm. It strives for the regression accuracy of the model for all the existed measurements[*xk , yk* ]. The last two parts λ<sup>1</sup> *<sup>N</sup> <sup>k</sup>*=<sup>1</sup> |*β<sup>k</sup>* | + λ<sup>2</sup> *<sup>N</sup> <sup>k</sup>*=<sup>2</sup> |*β<sup>k</sup>* − *β<sup>k</sup>*−1| encourages the sparsity of regression coefficients and their differences. The parameters λ<sup>1</sup> and λ<sup>2</sup> are adjusted to trade-off the regression accuracy and denoising power. (14.15) is totally a denoising problem if λ<sup>1</sup> = 0 .

Here the hidden Markov model (HMM) and the maximum likelihood estimation method are used for optimization calculation. The HMM posits an emission probability *Pr* - *y<sup>k</sup>* |*β<sup>k</sup>* that is a standard normal distribution, and a transition probability *Pr* - *β<sup>k</sup>*+1|*β<sup>k</sup>* that is double exponential with parameter λ<sup>2</sup> (where *Pr* denotes probability).

The Viterbi algorithm is a typical dynamic programming algorithm for this HMM problem, which the detailed description be found in (Rabiner et al. 1989). The objective function (14.15) is rewritten as maximization in a more general form,

$$J\_{FL} = \sum\_{k=1}^{N} e\_k(\beta\_k) - \lambda\_2 \sum\_{k=2}^{N} d(\beta\_k, \beta\_{k-1}),\tag{14.16}$$

where *ek (b)* <sup>=</sup> *<sup>R</sup> <sup>i</sup>*=<sup>1</sup> *yikvi(b)*.

Denote the variable sequences *(x*1*, x*2*,..., x<sup>k</sup> )* as the shorthand *x*<sup>1</sup>:*<sup>k</sup>* . Rewrite the criterion (14.16) as follows:

$$\begin{split} J\_{FL} &= \max\_{\beta\_{1N}} \left[ \sum\_{k=1}^{N} e\_k(\beta\_k) - \lambda\_2 \sum\_{k=2}^{N} d(\beta\_k, \beta\_{k-1}) \right] \\ &= \max\_{\beta\_N} [e\_N(\beta\_N)] + \max\_{\beta\_{1(N-1)}} \left[ \sum\_{k=1}^{N-1} e\_k(\beta\_k) - \lambda\_2 \sum\_{k=2}^{N} d(\beta\_k, \beta\_{k-1}) \right] \end{split} \tag{14.17}$$

and

$$\begin{split} f\_{N}(\boldsymbol{\beta}\_{N}) &:= \max\_{\boldsymbol{\beta}\_{1:(N-1)}} \left[ \sum\_{k=1}^{N-1} e\_{k}(\boldsymbol{\beta}\_{k}) - \lambda\_{2} \sum\_{k=2}^{N} d(\boldsymbol{\beta}\_{k}, \boldsymbol{\beta}\_{k-1}) \right] \\ &= \max\_{\boldsymbol{\beta}\_{N-1}} [e\_{N-1}(\boldsymbol{\beta}\_{N-1}) + \lambda\_{2} d(\boldsymbol{\beta}\_{N}, \boldsymbol{\beta}\_{N-1})] \\ &\quad + \max\_{\boldsymbol{\beta}\_{1:(N-2)}} \left[ \sum\_{k=1}^{N-2} e\_{k}(\boldsymbol{\beta}\_{k}) - \lambda\_{2} \sum\_{k=2}^{N-1} d(\boldsymbol{\beta}\_{k}, \boldsymbol{\beta}\_{k-1}) \right]. \end{split} \tag{14.18}$$

The definitions of functions *fN*−<sup>1</sup>*(β<sup>N</sup>*−<sup>1</sup>*), fN*−<sup>2</sup>*(β<sup>N</sup>*−<sup>2</sup>*), . . . , f*2*(β*2*)* are similar to *fN (β<sup>N</sup> )*. The maximization problem is solved further iteratively. It is summarized by introducing the intermediate functions with *k* ranging from 2 to *N*,

$$\begin{aligned} \delta\_1(\boldsymbol{b}) &:= e\_1(\boldsymbol{b}) \\ \psi\_k(\boldsymbol{b}) &:= \arg\max\_{\widetilde{\boldsymbol{b}}} [\delta\_{k-1}(\widetilde{\boldsymbol{b}}) - \lambda\_2 |\boldsymbol{b} - \widetilde{\boldsymbol{b}}|] \\ f\_k(\boldsymbol{b}) &:= \delta\_{k-1}(\psi\_k(\boldsymbol{b})) - \lambda\_2 |\boldsymbol{b} - \psi\_k(\boldsymbol{b})| \\ \delta\_k &:= e\_k(\boldsymbol{b}) + f\_k(\boldsymbol{b}). \end{aligned} \tag{14.19}$$

The functions ψ*<sup>k</sup> (*·*)* take part in the backward pass of the algorithm. This backward pass computes *β*ˆ <sup>1</sup>*,...,β*ˆ *<sup>N</sup>* through a recursion identical to that of the Viterbi algorithm for HMMs:

$$\begin{aligned} \beta\_N &= \arg\max\_b \{ \delta\_N(\mathbf{b}) \} \\ \hat{\beta}\_k &= \psi\_{k+1}(\hat{\beta}\_{k+1}) \quad for \ k = N-1, N-2, \dots, 1. \end{aligned} \tag{14.20}$$

So far, the above FL theory is implemented to obtain the dynamic threshold of the data model. During the process of fault detection, the KDE estimated probability values are the input variable of the FLSA algorithm for smoothing. The influence of data noise on the estimated probability density function is eliminated and a credible threshold is found to distinguish the normal operation and the faulty operation.

## **14.3 Forward Fault Diagnosis and Reverse Reasoning**

Detailed theoretical supports have been supplemented enough in last section, including the construction of probability graph models, the selection of probability density estimation evaluation indicators and parameter optimization, and the setting of dynamic thresholds for fault detection. The established model structure is determined by the causal direction between operating units, which represents the qualitative relationship between nodes. The non-parametric KED estimation is used to obtain the parameters of the graph model, i.e., the causal probability relationship. Probability can quantitatively describe the dependence between process variables. The evalu-

**Fig. 14.1** The overall framework

ation index of the probability relationship estimation is derived and calculated to ensure the accuracy of the graphical model.

Now this section combines and implements the above theoretical methods into a certain fault detection and diagnosis framework, which can be used to diagnose abnormal events in the system and locate the root cause of the fault. The overall framework of the proposed method is represented in Fig. 14.1.

The main steps for fault detection and root tracing are summarized based on the detail flow chart in Fig. 14.2,


**Fig. 14.2** Flowchart for detecting and tracing faults

## **14.4 Case Study: Application to TEP**

The proposed methods are verified on Tennessee Eastman (TE) process simulator. The TE process contains a total of 52 process variables and measurement variables. Eight variables in the reactor module are selected to test the causal structure, same as Chap. 13. The physical meanings of these variables are listed in Table 14.2. According to the causal analysis method, it is not difficult to obtain the causal relationship between eight variables (the detail analysis also can be found in Chap. 13). The corresponding topology is shown in Fig. 14.3.


**Table 14.2** Process manipulated variables

List all the probability density function and conditional probability density of nodes in the causal graph. In total, *f (x*2*)*, *f (x*8*)*, *f (x*4|*x*2*)*, *f (x*5|*x*8*)*, *f (x*7|*x*5*)*, *f (x*3|*x*5*)*, *f (x*1|*x*3*)*, *f (x*6|*x*3*)* need to be estimated. Here the root nodes *x*<sup>2</sup> and *x*<sup>8</sup> have one-dimensional probability density function. Optimize the window width *h* to obtain an accurate probability estimate.

The training data set contains 960 samples in the normal operation. These data are used to obtain the KDE of the model. Combine the causal structure constructed in the previous step to get a complete graphical model. Fault IDV(4) is a minor fault which is used as a test sample to verify the effectiveness and sensitiveness of the proposed method to minor faults. The fault IDV(4), a step change of the reactor cooling water inlet temperature, is introduced in the middle of the reaction. Then 960 samples are obtained as the testing data set, in which the first 480 samples are normal and the following 480 are faulty data.

In order to be able to trace the root cause of the fault, the child nodes must be selected here to test the fault. Randomly select one of the child nodes *x*<sup>7</sup> of the graphical model as the experimental object. According to the causal structure, it is easy to see that *x*<sup>7</sup> is directly related to *x*5. Here *x*<sup>5</sup> is the parent node of *x*7, so first calculate the conditional probability density *f (x*7|*x*5*)*. Figure 14.4 gives the graphical representation of the probability relationship between these two variables. Figure 14.4a depicts the probability density of normal data and fault data as a function of sampling time. Based on the fusion lasso method, the obtained KDE estimation is used as a rough signal for denoising and restoration. The crossed line in Fig. 14.4b represents the KDE recovered after denoising, which is set as the dynamical threshold. It can be clearly seen that after about 480 samples, the conditional probability of *x*<sup>7</sup> has exceeded the normal limit. Based on the FLSA method, the obtained KDE estimation is used as a rough signal for denoising and restoration.

Fault tracing refers to finding the root cause of failure in *x*7. The existing graph model can clearly show the causal relationship between nodes, so the propagation path of the fault can be easily analyzed. Carry out the reverse reasoning based on the established causal structure parameter model. Start from the failure variable and

(b) Conditional probability of *x*<sup>7</sup> under *x*<sup>5</sup> corresponds to the sampling time

**Fig. 14.4** Conditional probability of *x*<sup>7</sup> under *x*<sup>5</sup>

calculate the probability density function of its parent node in turn. The probability density curves obtained under normal and fault conditions are compared to determine whether the variables on each path are faulty. Continue this step until finding the root cause of the failure. In order to conversely infer the roots of fault *x*7, it is necessary to calculate *f (x*5|*x*8*)*, *f (x*5|*x*2*)*, *f (x*2*)*, *f (x*8*)* separately. Simulation results are shown in Fig. 14.5.

From the detection result graph, the true propagation path of the fault can be analyzed. The test shows that the root of the fault is *x*8. Corresponding to the physical meaning of the variable, the root cause is the temperature of the cooling water, and fault IDV(4) is a step change in the temperature of the cooling water. The result is consistent with the actual process.

# **14.5 Conclusions**

This chapter proposes a probability graph model directly for the continuous process variables aiming at the fault detection and root tracing. The model structure is determined by the causal relationship, and the probability relationship in the model is determined by the KDE method. For the child nodes in the causal structure, i.e., variables affected by other nodes, the conditional probability density functions are calculated based on the multidimensional joint probability density and the lowdimensional probability density. It reflects the strength relationship of the causal connection between the variables. An MISE index is rigorously derived to evaluate the estimation accuracy of KDE and optimize the KDE parameters. A dynamic threshold is constructed based on the FLSA algorithm to check the change of probability density, further to detect the fault. The experiment results in the TE process show that the proposed method not only accurately detects the occurrence of the failure, but also succeeds in finding its root cause.

# **References**


Silverman BW (1998) Density estimation for statistics and data analysis. Routledge, Boca Raton

Zeng J, Luo S, Cai J, Kruge U (2017) Nonparametric density estimation of hierarchical probabilistic graph models for assumption-free monitoring. Ind Eng Chem Res 56(5):1278–1287

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.