**Studies in Classification, Data Analysis, and Knowledge Organization**

Paula Brito · José G. Dias · Berthold Lausen · Angela Montanari · Rebecca Nugent Editors

# Classification and Data Science in the Digital Age

# Studies in Classification, Data Analysis, and Knowledge Organization

# Managing Editors Wolfgang Gaul, Karlsruhe, Germany Maurizio Vichi, Rome, Italy Claus Weihs, Dortmund, Germany

#### Editorial Board

Daniel Baier, Bayreuth, Germany Frank Critchley, Milton Keynes, UK Reinhold Decker, Bielefeld, Germany Edwin Diday , Paris, France Michael Greenacre, Barcelona, Spain Carlo Natale Lauro, Naples, Italy Jacqueline Meulman, Leiden, The Netherlands Paola Monari, Bologna, Italy Shizuhiko Nishisato, Toronto, Canada Noboru Ohsumi, Tokyo, Japan Otto Opitz, Augsburg, Germany Gunter Ritter, Passau, Germany Martin Schader, Mannheim, Germany †

Studies in Classification, Data Analysis, and Knowledge Organization is a book series which offers constant and up-to-date information on the most recent developments and methods in the fields of statistical data analysis, exploratory statistics, classification and clustering, handling of information and ordering of knowledge. It covers a broad scope of theoretical, methodological as well as application-oriented articles, surveys and discussions from an international authorship and includes fields like computational statistics, pattern recognition, biological taxonomy, DNA and genome analysis, marketing, finance and other areas in economics, databases and the internet. A major purpose is to show the intimate interplay between various, seemingly unrelated domains and to foster the cooperation between mathematicians, statisticians, computer scientists and practitioners by offering well-based and innovative solutions to urgent problems of practice.

Paula Brito • José G. Dias • Berthold Lausen • Angela Montanari • Rebecca Nugent Editors

# Classification and Data Science in the Digital Age

Editors Paula Brito Faculty of Economics University of Porto Porto, Portugal

INESC TEC, Centre for Artificial Intelligence and Decision Support (LIAAD) Porto, Portugal

Berthold Lausen Department of Mathematical Sciences University of Essex Colchester, UK

Rebecca Nugent Department of Statistics & Data Science Carnegie Mellon University Pittsburgh, PA, USA

José G. Dias Business Research Unit University Institute of Lisbon Lisbon, Portugal

Angela Montanari Department of Statistical Sciences "Paolo Fortunati" University of Bologna Bologna, Italy

ISSN 1431-8814 ISSN 2198-3321 (electronic) Studies in Classification, Data Analysis, and Knowledge Organization ISBN 978-3-031-09033-2 ISBN 978-3-031-09034-9 (eBook) https://doi.org/10.1007/978-3-031-09034-9

Mathematics Subject Classification: 62H30, 62H25, 62R07, 68T09, 62H86, 68T10, 94A16, 68T30

© The Editor(s) (if applicable) and The Author(s) 2023. This book is an open access publication. Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if

changes were made. The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

# **Preface**

"Classification and Data Science in the Digital Age", the 17th Conference of the International Federation of Classification Societies (IFCS), is held in Porto, Portugal, from July 19th to July 23rd 2022, locally organised by the Faculty of Economics of the University of Porto and the Portuguese Association for Classification and Data Analysis, CLAD.

The International Federation of Classification Societies (IFCS), founded in 1985, is an international scientific organization with non-profit and non-political motives. Its purpose is to promote mutual communication, co-operation and interchange of views among all those interested in scientific principles, numerical methods, theory and practice of data science, data analysis, and classification in a broad sense and in as wide a range of applications as possible; to serve as an agency for the dissemination of scientific information related to these areas of interest; to prepare international conferences; to publish a newsletter and other publications. The scientific activities of the Federation are intended for all people interested in theory of classification and data analysis, and related methods and applications. IFCS 2022 – originally scheduled for August 2021, and postponed due to the Covid-19 pandemic – will be its 17th edition; previous editions were held in Thessaloniki (2019), Tokyo (2017) and Bologna (2015).

Keynote lectures are addressed by Genevera Allen (Rice University, USA), Charles Bouveyron (Université Côte d'Azur, Nice, France), Dianne Cook (Monash University, Melbourne, Australia), and João Gama (Faculty of Economics, University of Porto & LIAAD INESC TEC, Portugal). The conference program includes two tutorials: "Analysis of Data Streams" by João Gama (Faculty of Economics, University of Porto & LIAAD INESC TEC, Portugal) and "Categorical Data Analysis of Visualization" by Rosaria Lombardo (Università degli Studi della Campania Luigi Vanvitelli, Italy) and Eric Beh (University of Newcastle, Australia). IFCS 2022 has highlighted topics, which lead to Semi-Plenary Invited Sessions. The Conference program also includes Thematic Tracks on specific areas, as well as free contributed sessions in different topics (both oral communications and posters).

The Conference Scientific Program Committee is co-chaired by Paula Brito, José G. Dias, Berthold Lausen, and Angela Montanari, and includes representatives of the IFCS member societies: Adalbert Wilhelm – GfKl, Ahmed Moussa – MCS, Arthur White – IPRCS, Brian Franczak – CS, Eva Boj del Val – SEIO, Fionn Murtagh – BCS, Francesco Mola – CLADAG, Hyunjoong Kim – KCS, Javier Trejos Zelaya – SoCCCAD, Koji Kurihara – JCS, Krzysztof Jajuga – SKAD, Mark de Rooij – VOC, Mohamed Nadif – SFC, Niel le Roux – MDAG, Simona Korenjak Černe – SSS, Theodore Chadjipadelis – GSDA, who were responsible for the Conference Scientific Program, and whom the organisers wish to thank for their precious cooperation. Special thanks are also due to the chairs of the Thematic Tracks, for their invaluable collaboration.

The papers included in this volume present new developments in relevant topics of Data Science and Classification, constituting a valuable collection of methodological and applied papers that represent the current research in highly developing areas. Combining new methodological advances with a wide variety of real applications, this volume is certainly of great value for Data Science researchers and practitioners alike.

First of all, the organisers of the Conference and the editors would like to thank all authors, for their cooperation and commitment. We are specially grateful to all colleagues who served as reviewers, and whose work was decisive to the scientific quality of these proceedings. We also thank all those who have contributed to the design and production of this Book of Proceedings at Springer, in particular Veronika Rosteck, for her help concerning all aspects of publication.

The organisers would like to express their gratitude to the Portuguese Association for Classification and Data Analysis, CLAD, as well as to the Faculty of Economics of the University of Porto (FEP–UP), who enthusiastically supported the Conference from the very start, and contributed to its success. We cordially thank all members of the Local Organising Committee – Adelaide Figueiredo, Carlos Ferreira, Carlos Marcelo, Conceição Rocha, Fernanda Figueiredo, Fernanda Sousa, Jorge Pereira, M. Eduarda Silva, Paulo Teles, Pedro Campos, Pedro Duarte Silva, and Sónia Dias – and all people at FEP–UP who worked actively for the conference organisation, and whose work is much appreciated. We are very grateful to all our sponsors, for their generous support. Finally, we thank all authors and participants, who made the conference possible.

Porto, *Paula Brito* July 2022 *José G. Dias Berthold Lausen Angela Montanari Rebecca Nugent*

# **Acknowledgements**

The Editors are extremely grateful to the reviewers, whose work was determinant for the scientific quality of these proceedings. They were, in alphabetical order:

Adalbert Wilhelm Agustín Mayo-Iscar Alípio Jorge André C. P. L. F. de Carvalho Ann Maharaj Anuška Ferligoj Arthur White Berthold Lausen Brian Franczak Carlos Soares Christian Hennig Conceição Amado Eva Boj del Val Francesco Mola Francisco de Carvalho Geoff McLAchlan Gilbert Saporta Glòria Mateu-Figueras Hans Kestler Hélder Oliveira Hyunjoong Kim Jaime Cardoso Javier Trejos Jean Diatta José A. Lozano José A. Vilar José Matos

Koji Kurihara Krzysztof Jajuga Laura Palagi Laura Sangalli Lazhar Labiod Luis Angel García-Escudero Luis Teixeira M. Rosário Oliveira Margarida G. M. S. Cardoso Mark de Rooij Michelangelo Ceci Mohamed Nadif Niel Le Roux Paolo Mignone Patrice Bertrand Pedro Campos Pedro Duarte Silva Pedro Ribeiro Peter Filzmoser Rosanna Verde Rosaria Lombardo Salvatore Ingrassia Satish Singh Simona Korenjak-Černe Theodore Chadjipadelis Veronica Piccialli Vladimir Batagelj

# **Partners & Sponsors**

We are extremely grateful to the following institutions whose support contributes to the success of IFCS 2022:

# **Sponsors**

Banco de Portugal

Berd

Comissão de Viticultura da Região dos Vinhos Verdes

Indie Campers

INESC/TEC

Luso-American Development Foundation

PSE

Sociedade Portuguesa de Estatística

Instituto Nacional de Estatística/Statistics Portugal

Unilabs

Universidade do Porto

# **Partners**

Associação Portuguesa para a Investigação Operacional Associação Portuguesa de Reconhecimento de Padrões Associação de Turismo do Porto e Norte Centro Internacional de Matemática Faculdade de Engenharia da Universidade do Porto International Association of Statistical Computing International Association of Statistical Education Sociedade Portuguesa de Matemática Springer

# **Organisation**

CLAD - Associação Portuguesa de Classificação e Análise de Dados Faculdade de Economia da Universidade do Porto

# **Contents**



#### Contents xiii




# **A Topological Clustering of Individuals**

Rafik Abdesselam

**Abstract** The clustering of objects-individuals is one of the most widely used approaches to exploring multidimensional data. The two common unsupervised clustering strategies are Hierarchical Ascending Clustering (HAC) and k-means partitioning used to identify groups of similar objects in a dataset to divide it into homogeneous groups. The proposed Topological Clustering of Individuals, or TCI, studies a homogeneous set of individual rows of a data table, based on the notion of neighborhood graphs; the columns-variables are more-or-less correlated or linked according to whether the variable is of a quantitative or qualitative type. It enables topological analysis of the clustering of individual variables which can be quantitative, qualitative or a mixture of the two. It first analyzes the correlations or associations observed between the variables in a topological context of principal component analysis (PCA) or multiple correspondence analysis (MCA), depending on the type of variable, then classifies individuals into homogeneous group, relative to the structure of the variables considered. The proposed TCI method is presented and illustrated here using a real dataset with quantitative variables, but it can also be applied with qualitative or mixed variables.

**Keywords:** hierarchical clustering, proximity measure, neighborhood graph, adjacency matrix, multivariate data analysis

# **1 Introduction**

The objective of this article is to propose a topological method of data analysis in the context of clustering. The proposed approach, Topological Clustering of Individuals

Rafik Abdesselam ()

© The Author(s) 2023 1

University of Lyon, Lyon 2, ERIC - COACTIS Laboratories Department of Economics and Management, 69365 Lyon, France, e-mail: rafik.abdesselam@univ-lyon2.fr

P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_1

(TCI) is different from those that already exist and with which it is compared. There are approaches specifically devoted to the clustering of individuals, for example, the Cluster procedure implemented in SAS software, but as far as we know, none of these approaches has been proposed in a topological context.

Proximity measures play an important role in many areas of data analysis [16, 5, 9]. The results of any operation involving structuring, clustering or classifying objects are strongly dependent on the proximity measure chosen.

This study proposes a method for the topological clustering of individuals whatever type of variable is being considered: quantitative, qualitative or a mixture of both. The eventual associations or correlations between the variables partly depends on the database being used and the results can change according to the selected proximity measure. A proximity measure is a function which measures the similarity or dissimilarity between two objects or variables within a set.

Several topological data analysis studies have been proposed both in the context of factorial analyses (discriminant analysis [4], simple and multiple correspondence analyses [3], principal component analysis [2]) and in the context of clustering of variables [1], clustering of individuals [10] and this proposed TCI approach.

This paper is organized as follows. In Section 2, we briefly recall the basic notion of neighborhood graphs, we define and show how to construct an adjacency matrix associated with a proximity measure within the framework of the analysis of the correlation structure of a set of quantitative variables, and we present the principles of TCI according to continuous data. This is illustrated in Section 3 using an example based on real data. The TCI results are compared with those of the wellknown classical clustering of individuals. Finally, Section 4 presents the concluding remarks on this work.

# **2 Topological Context**

Topological data analysis is an approach based on the concept of the neighborhood graph. The basic idea is actually quite simple: for a given proximity measure for continuous or binary data and for a chosen topological structure, we can match a topological graph induced on the set of objects.

In the case of continuous data, we consider 𝐸 = {𝑥 1 , · · · , 𝑥 <sup>𝑗</sup> , · · · , 𝑥 <sup>𝑝</sup> }, a set of 𝑝 quantitative variables. We can see in [1] cases of qualitative or even mixed variables.

We can, by means of a proximity measure 𝑢, define a neighborhood relationship, 𝑉𝑢, to be a binary relationship based on 𝐸 × 𝐸. There are many possibilities for building this neighborhood binary relationship.

Thus, for a given proximity measure *u*, we can build a neighborhood graph on 𝐸, where the vertices are the variables and the edges are defined by a property of the neighborhood relationship.

Many definitions are possible to build this binary neighborhood relationship. One can choose the Minimal Spanning Tree (MST) [7], the Gabriel Graph (GG) [11] or, as is the case here, the Relative Neighborhood Graph (RNG) [14].

For any given proximity measure 𝑢, we can construct the associated adjacency binary symmetric matrix 𝑉<sup>𝑢</sup> of order 𝑝, where, all pairs of neighboring variables in 𝐸 satisfy the following RNG property:

**Fig. 1** Data - RNG structure - Euclidean distance - Associated adjacency matrix.

Figure 1 shows a simple illustrative example in R <sup>2</sup> of a set of quantitative variables that verify the structure of the RNG graph with Euclidean distance as proximity measure: 𝑢(𝑥 𝑘 , 𝑥<sup>𝑙</sup> ) = qÍ<sup>2</sup> 𝑗=1 (𝑥 𝑘 𝑗 − 𝑥 𝑙 𝑗 ) 2 .

This generates a topological structure based on the objects in 𝐸 which are completely described by the adjacency binary matrix 𝑉𝑢.

#### **2.1 Reference Adjacency Matrices**

Three topological factorial approaches are described in [1] according to the type of variables considered: quantitative, qualitative or a mixture of both. We consider here the case of a set of quantitative variables.

We assume that we have at our disposal a set 𝐸 = {𝑥 𝑗 ; 𝑗 = 1, · · · , 𝑝} of 𝑝 quantitative variables and 𝑛 individuals-objects. The objective here is to analyze in a topological way, the structure of the correlations of the variables considered [2], from which the clustering of individuals will then be established.

We construct the reference adjacency matrix named 𝑉𝑢★ from the correlation matrix. Expressions of suitable adjacency reference matrices for cases involving qualitative variables or mixed variables are given in [1].

To examine the correlation structure between the variables, we look at the significance of their linear correlation. The reference adjacency matrix 𝑉𝑢★ associated with reference measure 𝑢★, can be written using the Student's t-test of the linear correlation coefficient 𝜌 of Bravais-Pearson:

**Definition 1** For quantitative variables, 𝑉𝑢★ is defined as:

$$V\_{\mathsf{u}\_{\mathsf{A}}}(\mathsf{x}^{k},\mathsf{x}^{l}) = \begin{cases} 1 \text{ if } \begin{array}{c} p \text{-value } = P \end{array} \mid \begin{array}{c} T\_{\mathsf{u}-\mathsf{2}} \mid > \mathsf{t} \text{-value} \end{array} \mid \leq \alpha \; ; \; \forall k, l = 1, \; p \nmid \; \mathsf{t} \text{-value} \end{cases}$$

where the 𝑝-value is the significance test of the linear correlation coefficient for the two-sided test of the null and alternative hypotheses, 𝐻<sup>0</sup> : 𝜌(𝑥 𝑘 , 𝑥<sup>𝑙</sup> ) = 0 vs. 𝐻<sup>1</sup> : 𝜌(𝑥 𝑘 , 𝑥<sup>𝑙</sup> ) ≠ 0.

Let 𝑇𝑛−<sup>2</sup> be a t-distributed random variable of Student with 𝜈 = 𝑛 − 2 degrees of freedom. In this case, the null hypothesis is rejected if the 𝑝-value is less than or equal to a chosen 𝛼 significance level, for example, 𝛼 = 5%. Using a linear correlation test, if the 𝑝-value is very small, it means that there is a very low likelihood that the null hypothesis is correct, and consequently we can reject it.

#### **2.2 Topological Analysis - Selective Review**

Whatever the type of variable set being considered, the built reference adjacency matrix 𝑉𝑢★ is associated with an unknown reference proximity measure 𝑢★.

The robustness depends on the 𝛼 error risk chosen for the null hypothesis: no linear correlation in the case of quantitative variables, or positive deviation from independence in the case of qualitative variables, can be studied by setting a minimum threshold in order to analyze the sensitivity of the results. Certainly the numerical results will change, but probably not their interpretation.

We assume that we have at our disposal {𝑥 𝑘 ; 𝑘 = 1, .., 𝑝} a set of 𝑝 homogeneous quantitative variables measured on 𝑛 individuals. We will use the following notations:






We first analyze, in a topological way, the correlation structure of the variables using a Topological PCA, which consists of carrying out the standardized PCA [6, 8] triplet ( 𝑋 , 𝑀 <sup>b</sup> <sup>𝑝</sup> , 𝐷<sup>𝑛</sup> ) of the projected data matrix <sup>𝑋</sup><sup>b</sup> <sup>=</sup> 𝑋𝑉𝑢★ and, for comparison, the duality diagram of the Classical standardized PCA triplet ( 𝑋 , 𝑀<sup>𝑝</sup> , 𝐷<sup>𝑛</sup> ) of the initial data matrix 𝑋. We then proceed with a clustering of individuals based on the significant principal components of the previous topological PCA.

**Definition 2** TCI consist of performing a HAC, based on the Ward criterion1 [15], on the significant factors of the standardized PCA of the triplet (𝑋, 𝑀 <sup>b</sup> <sup>𝑝</sup>, 𝐷𝑛).

<sup>1</sup> Aggregation based on the criterion of the loss of minimal inertia.

# **3 Illustrative Example**

The data used [13] to illustrate the TCI approach describe the renewable electricity (RE) of the 13 French regions in 2017, described by 7 quantitative variables relating to RE. The growth of renewable energy in France is significant. Some French regions have expertise in this area; however, the regions' profiles appear to differ.

The objective is to specify regional disparities in terms of RE by applying topological clustering to the French regions in order to identify which were the country's greenest regions in 2017. Statistics relating to the variables are displayed in Table 1.


**Table 1** Summary statistics of renewable energy variables.

**Table 2** Correlation matrix (𝑝-value) - Reference adjacency matrix 𝑉𝑢★ .


The adjacency matrix 𝑉𝑢★ , associated with the proximity measure 𝑢★, adapted to the data considered, is built from the correlations matrix Table 2 according to Definition 1. Note that in this case, which uses quantitative variables, it is considered that two positively correlated variables are related and that two negatively correlated variables are related but remote. We will therefore take into account any sign of correlation between variables in the adjacency matrix.

We first carry out a Topological PCA to identify the correlation structure of the variables. A HAC, according to Ward's criterion, is then applied to the significant principal components of the PCA of the projected data. We then compare the results of a topological and a classical PCA.

Figure 2 presents, for comparison on the first factorial plane, the correlations between principal components-factors and the original variables.

We can see that these correlations are slightly different, as are the percentages of the inertias explained on the first principal planes of Topological and Classic PCA.

**Fig. 2** Topological & Classical PCA of RE of the French regions.

The two first factors of the Topological PCA explain 57.89% and 26.11%, respectively, accounting for 83.99% of the total variation in the data set; however, the two first factors of the Classical PCA add up to 75.20%. Thus, the first two factors provide an adequate synthesis of the data, that is, of RE in the French regions. We restrict the comparison to the first significant factorial axes.

For comparison, Figure 3 shows dendrograms of the Topological and Classical clustering of the French regions according to their RE. Note that the partitions chosen in 5 clusters are appreciably different, as much by composition as by characterization. The percentage variance produced by the TCI approach, 𝑅 <sup>2</sup> = 86.42%, is higher than that of the classic approach, 𝑅 <sup>2</sup> = 84.15%, indicating that the clusters produced via the TCI approach are more homogeneous than those generated by the Classical one.

Based on the TCI analysis, the Corse region alone constitutes the fourth cluster, and the Nouvelle-Acquitaine region is found in the second cluster with the Grand-Est, Occitanie and Provence-Alpes-Côte-d'Azur (PACA) regions; however, in the Classical clustering, these two regions - Corse and Nouvelle-Aquitaine - together constitute the third cluster.

Figure 4 summarizes the significant profiles (+) and anti-profiles (-) of the two typologies; with a risk of error less than or equal to 5%, they are quite different.

The first cluster produced via the TCI approach, consisting of a single region, Auvergne-Rhônes-Alpes (AURA), is characterized by high share of hydroelectricity, a high level of coverage of regional consumption, and high RE production and consumption. The second cluster - which groups together the four regions of Grand-Est, Occitanie, Provence-Alpes-Côte-d'Azur (PACA) and Nouvelle-Aquitaine - is considered a homogeneous cluster, which means that none of the seven RE characteristics differ significantly from the average of these characteristics across all regions. This cluster can therefore be considered to reflect the typical picture of RE in France.

**Fig. 3** Topological and Classical dendrograms of the French regions.

**Fig. 4** Typologies - Characterization of TCI & Classical clusters

Cluster 3, which consists of six regions, is characterized by a high degree of wind energy, a low degree of hydroelectricity, low coverage of regional consumption, and low production and consumption of RE compared to the national average. Cluster 4, represented by the Corse region, is characterized by a high share of solar energy and low production and consumption of RE. The last class, represented by the Ilede-France region, is characterized by a high share of biomass energy. Regarding the other types of RE, their share is close to the national average.

# **4 Conclusion**

This paper proposes a new topological approach to the clustering of individuals which can enrich classical data analysis methods within the framework of the clustering of objects. The results of the topological clustering approach, based on the notion of a neighborhood graph, are as good - or even better, according to the R-squared results - than the existing classical method. The TCI approach is be easily programmable from the PCA and HAC procedures of SAS, SPAD or R software. Future work will involve extending this topological approach to other methods of data analysis, in particular in the context of evolutionary data analysis.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Model Based Clustering of Functional Data with Mild Outliers**

Cristina Anton and Iain Smith

**Abstract** We propose a procedure, called CFunHDDC, for clustering functional data with mild outliers which combines two existing clustering methods: the functional high dimensional data clustering (FunHDDC) [1] and the contaminated normal mixture (CNmixt) [3] method for multivariate data. We adapt the FunHDDC approach to data with mild outliers by considering a mixture of multivariate contaminated normal distributions. To fit the functional data in group-specific functional subspaces we extend the parsimonious models considered in FunHDDC, and we estimate the model parameters using an expectation-conditional maximization algorithm (ECM). The performance of the proposed method is illustrated for simulated and real-world functional data, and CFunHDDC outperforms FunHDDC when applied to functional data with outliers.

**Keywords:** functional data, model-based clustering, contaminated normal distribution, EM algorithm

# **1 Introduction**

Recently, model-based clustering for functional data has received a lot of attention. Real data are often contaminated by outliers that affect the estimations of the model parameters. Here we propose a method for clustering functional data with mild outliers. Mild outliers are usually sampled from a population different from the

© The Author(s) 2023 11 P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_2

Cristina Anton ()

MacEwan University, 10700 – 104 Avenue Edmonton, AB, T5J 4S2, Canada, e-mail: popescuc@macewan.ca

Iain Smith MacEwan University, 10700 – 104 Avenue Edmonton, AB, T5J 4S2, Canada, e-mail: smithi23@mymacewan.ca

assumed model, so we need to choose a model flexible enough to accommodate them.

Functional data live in an infinite dimensional space and model-based methods for clustering are not directly available because the notion of probability density function generally does not exist for such data. A first approach is to use a twostep method and first do a discretization or a decomposition of the functional data in a basis of functions (such as Fourier series, B-splines, etc.), and then directly apply multivariate clustering methods to the discretization or the basis coefficients. A second approach, which allows the interaction between the discretization and the clustering steps, is based on a probabilistic model for the basis coefficients [1, 2].

We follow the second approach, and we propose a method, called CFunHDDC, which extends the functional high dimensional data clustering (FunHDDC) [1] to clustering functional data with mild outliers. There are several methods to detect outliers of functional data and a robust clustering methodology based on trimming is presented in [4]. Our approach does not involve trimming the outliers and it is inspired by the method CNmixt [3] for clustering multivariate data with mild outliers. We propose a model for the basis coefficients based on a mixture of contaminated multivariate normal distributions. A multivariate contaminated normal distribution is a two-component normal mixture in which the bad observations (outliers) are represented by a component with a small prior probability and an inflated covariance matrix.

In the next section we present the model and its parsimonious variants. Parameter estimation is included in Section 3. In Section 4 we present applications to simulated and real-world data. The last section includes the conclusions.

# **2 The Model**

We suppose that we observe 𝑛 curves {𝑥1, . . . , 𝑥𝑛} and we want to cluster them in 𝐾 homogeneous groups. For each curve 𝑥<sup>𝑖</sup> we have access to a finite set of values 𝑥𝑖 𝑗 = 𝑥𝑖(𝑡𝑖 𝑗), where 0 ≤ 𝑡𝑖<sup>1</sup> < 𝑡𝑖<sup>2</sup> < . . . < 𝑡𝑖𝑚<sup>𝑖</sup> ≤ 𝑇. We assume that the observed curves are independent realizations of a 𝐿 <sup>2</sup>− continuous stochastic process 𝑋 = {𝑋(𝑡)}<sup>𝑡</sup> ∈[0,𝑇 ] for which the sample paths are in 𝐿 2 [0, 𝑇]. To reconstruct the functional form of the data we assume that the curves belong to a finite dimensional space spanned by a basis of functions {𝜉1, . . . , 𝜉𝑝}, so we have the expansion for each curve

$$\chi\_i(t) = \sum\_{j=1}^p \gamma\_{ij} \xi\_j(t).$$

Here we assume that the dimension 𝑝 is fixed and known. We consider a model based on a mixture of multivariate contaminated normal distributions for the coefficients vectors {𝛾1, . . . , 𝛾𝑛} ⊂ R 𝑝 , 𝛾<sup>𝑖</sup> = (𝛾𝑖1, . . . , 𝛾𝑖 𝑝) <sup>&</sup>gt; ∈ R 𝑝 , 𝑖 = 1, . . . , 𝑛.

We suppose that there exists two unobserved random variables 𝑍 = (𝑍1, . . . , 𝑍<sup>𝐾</sup> ), Υ = (Υ1, . . . , Υ<sup>𝐾</sup> ) ∈ {0, 1} <sup>𝐾</sup> where 𝑍 indicates the cluster membership and Υ whether an observation is good or bad (outlier). 𝑍<sup>𝑘</sup> = 1 if 𝑋 ∈ 𝑘th cluster and 𝑍<sup>𝑘</sup> = 0 otherwise, and Υ<sup>𝑘</sup> = 1 if 𝑋 ∈ 𝑘th cluster and it is a good observation, and Υ<sup>𝑘</sup> = 0 otherwise. For clustering we need to predict the value 𝑧<sup>𝑖</sup> = (𝑧𝑖1, . . . , 𝑧𝑖𝐾 ) of 𝑍, and to determine the bad observations we need to predict the value 𝜈<sup>𝑖</sup> = (𝜈𝑖1, . . . , 𝜈𝑖𝐾 ) of Υ for each observed curve 𝑥<sup>𝑖</sup> , 𝑖 = 1, . . . , 𝑛.

We consider a set of 𝑛<sup>𝑘</sup> observed curves of the 𝑘th cluster with the coefficients {𝛾1, . . . , 𝛾𝑛<sup>𝑘</sup> } ⊂ R 𝑝 . We assume that {𝛾1, . . . , 𝛾𝑛<sup>𝑘</sup> } are independent realizations of a random vector Γ ∈ R 𝑝 , and that the stochastic process associated with the 𝑘th cluster can be described in a lower dimensional subspace E 𝑘 [0, 𝑇] ⊂ 𝐿 2 [0, 𝑇] with dimension 𝑑<sup>𝑘</sup> ≤ 𝑝 and spanned by the first 𝑑<sup>𝑘</sup> elements of a group specific basis of functions {𝜙𝑘 𝑗 } <sup>𝑗</sup>=1,...,𝑑<sup>𝑘</sup> that can be obtained from {𝜉 <sup>𝑗</sup> } <sup>𝑗</sup>=1,..., 𝑝 by a linear transformation 𝑝

$$\phi\_{k\bar{J}} = \sum\_{l=1}^{P} q\_{k,jl} \xi\_l,$$

with an 𝑝 × 𝑝 orthogonal matrix 𝑄<sup>𝑘</sup> = (𝑞𝑘, 𝑗𝑙). In [1] for FunHDDC the assumption is that the distribution of Γ for the 𝑘th cluster is Γ ∼ 𝑁(𝜇<sup>𝑘</sup> , Σ<sup>𝑘</sup> ), Σ<sup>𝑘</sup> = 𝑄𝑘Δ𝑘𝑄 > 𝑘 , where

$$
\Delta\_k = \begin{pmatrix} a\_{k1} & & \mathbf{0} \\ & \ddots & & \mathbf{0} \\ \hline 0 & & a\_{kd\_k} & & \\ \hline & & \mathbf{0} & & \mathbf{0} \\ & & & \mathbf{0} \\ & & & 0 & b\_k \end{pmatrix} p
$$

« ¬ with 𝑎𝑘𝑖 > 𝑏<sup>𝑘</sup> , 𝑖 = 1, . . . , 𝑑<sup>𝑘</sup> . We can say that the variance of the actual data in the 𝑘th cluster is modeled by 𝑎𝑘1, . . . , 𝑎𝑘𝑑<sup>𝑘</sup> and the parameter 𝑏<sup>𝑘</sup> models the variance of the noise [1].

We follow the approach in [3] and we assume that Γ for the 𝑘th cluster has the multivariate contaminated normal distribution with density

$$f(\gamma\_i; \theta\_k) = \alpha\_k \phi(\gamma\_i; \mu\_k, \Sigma\_k) + (1 - \alpha\_k) \phi(\gamma\_i; \mu\_k, \eta\_k \Sigma\_k), \tag{1}$$

where 𝛼<sup>𝑘</sup> ∈ (0.5, 1), 𝜂<sup>𝑘</sup> > 1, 𝜃<sup>𝑘</sup> = {𝛼<sup>𝑘</sup> , 𝜇<sup>𝑘</sup> , Σ<sup>𝑘</sup> , 𝜂<sup>𝑘</sup> }, and 𝜙(𝛾<sup>𝑖</sup> ; 𝜇<sup>𝑘</sup> , Σ<sup>𝑘</sup> ) is the density for the 𝑝−variate normal distribution 𝑁(𝜇<sup>𝑘</sup> , Σ<sup>𝑘</sup> ):

$$\phi(\gamma\_i; \mu\_k, \Sigma\_k) = (2\pi)^{-p/2} |\Sigma\_k|^{-1/2} \exp\left(-\frac{1}{2} (\gamma\_i - \mu\_k)^\top \Sigma\_k^{-1} (\gamma\_i - \mu\_k)\right) \tag{2}$$

Here 𝛼<sup>𝑘</sup> defines the proportion of uncontaminated data in the 𝑘the cluster and 𝜂<sup>𝑘</sup> represents the degree of contamination. We can see 𝜂<sup>𝑘</sup> as an inflation parameter that measures the increase in variability due to the bad observations.

Each curve 𝑥<sup>𝑖</sup> has a basis expansion with coefficient 𝛾<sup>𝑖</sup> such that 𝛾<sup>𝑖</sup> is a random vector whose distributions is a mixture of contaminated Gaussians with density

$$p(\boldsymbol{\gamma};\boldsymbol{\theta}) = \sum\_{k=1}^{K} \pi\_k f(\boldsymbol{\gamma};\theta\_k) \tag{3}$$

where 𝜋<sup>𝑘</sup> = 𝑃(𝑍<sup>𝑘</sup> = 1) is the prior probability of the 𝑘th the cluster and 𝜃 = Ð𝑘 𝑘=1 (𝜃<sup>𝑘</sup> ∪{𝜋<sup>𝑘</sup> }) is the set formed by all the parameters. We refer to this model as FCLM[𝑎𝑘 𝑗, 𝑏<sup>𝑘</sup> , 𝑄<sup>𝑘</sup> , 𝑑<sup>𝑘</sup> ] (functional contaminated latent mixture). As in [1] we consider the parsimonious sub-models: FCLM[𝑎𝑘 𝑗, 𝑏, 𝑄<sup>𝑘</sup> , 𝑑<sup>𝑘</sup> ], FCLM[𝑎<sup>𝑘</sup> , 𝑏<sup>𝑘</sup> , 𝑄<sup>𝑘</sup> , 𝑑<sup>𝑘</sup> ], FCLM[𝑎, 𝑏<sup>𝑘</sup> , 𝑄<sup>𝑘</sup> , 𝑑<sup>𝑘</sup> ], FCLM[𝑎<sup>𝑘</sup> , 𝑏, 𝑄<sup>𝑘</sup> , 𝑑<sup>𝑘</sup> ], FCLM[𝑎, 𝑏, 𝑄<sup>𝑘</sup> , 𝑑<sup>𝑘</sup> ].

# **3 Model Inference**

To fit the models we use the ECM algorithm [3], which is a variant of the EM algorithm. In the ECM algorithm we replace the M-step in the EM algorithm by two simpler CM-steps given by the partition of the set with the parameters 𝜃 = {Ψ1, Ψ2}, where Ψ<sup>1</sup> = {𝜋<sup>𝑘</sup> , 𝛼<sup>𝑘</sup> , 𝜇<sup>𝑘</sup> , 𝑎𝑘 𝑗, 𝑏<sup>𝑘</sup> , 𝑞𝑘 𝑗, 𝑘 = 1, . . . , 𝐾, 𝑗 = 1, . . . , 𝑑<sup>𝑘</sup> }, Ψ<sup>2</sup> = {𝜂<sup>𝑘</sup> , 𝑘 = 1, . . . , 𝐾}, and 𝑞𝑘 𝑗 is the 𝑗th column of 𝑄<sup>𝑘</sup> .

We have two sources of missing data: the clusters' labels and the type of observation (good or bad). Thus the complete data are given by 𝑆 = {𝛾<sup>𝑖</sup> , 𝑧𝑖 , 𝜈<sup>𝑖</sup> }𝑖=1,...,𝑛, and the complete-data likelihood is

$$L\_c\left(\theta; S\right) = \prod\_{i=1}^{N} \prod\_{k=1}^{K} \left\{ \pi\_k \left[ \alpha\_k \phi(\gamma\_i; \mu\_k, \Sigma\_k) \right]^{\nu\_{ik}} \left[ (1 - \alpha\_k) \phi(\gamma\_i; \mu\_k, \eta\_k \Sigma\_k) \right]^{1 - \nu\_{ik}} \right\}^{\nu\_{ik}}$$

We denote the complete-data log-likelihood by 𝑙<sup>𝑐</sup> (𝜃; 𝑆) = log(𝐿<sup>𝑐</sup> (𝜃; 𝑆)).

Next we present the ECM algorithm for the model FCLM[𝑎𝑘 𝑗, 𝑏<sup>𝑘</sup> , 𝑄<sup>𝑘</sup> , 𝑑<sup>𝑘</sup> ]. At the 𝑞 iteration of the ECM algorithm in the E-step we calculate 𝐸[𝑙<sup>𝑐</sup> (𝜃 (𝑞−1) ; 𝑆)|𝛾1, . . . , 𝛾𝑛, 𝜃 (𝑞−1) ], given the current values of the parameters 𝜃 (𝑞−1) . This reduces to the calculation of 𝑧 (𝑞) 𝑖𝑘 := 𝐸[𝑍𝑖𝑘 |𝛾<sup>𝑖</sup> , 𝜃 (𝑞−1) ], 𝜈 (𝑞) 𝑖𝑘 := 𝐸[Υ𝑖𝑘 |𝛾<sup>𝑖</sup> , 𝑧𝑖 , 𝜃 (𝑞−1) ].

In the first CM step in the 𝑞 iteration of the ECM algorithm we calculate Ψ (𝑞) 1 as the value of Ψ<sup>1</sup> that maximize 𝑙 (𝑞−1) <sup>𝑐</sup> with Ψ<sup>2</sup> fixed at Ψ (𝑞−1) 2 . We obtain

$$\pi\_k^{(q)} = \frac{\sum\_{l=1}^n z\_{lk}^{(q)}}{n}, \quad \alpha\_k^{(q)} = \frac{\sum\_{l=1}^n z\_{lk}^{(q)} \nu\_{lk}^{(q)}}{\sum\_{l=1}^n z\_{lk}^{(q)}}, \quad \mu\_k^{(q)} = \frac{\sum\_{l=1}^n z\_{lk}^{(q)} \left(\nu\_{lk}^{(q)} + \frac{1 - \nu\_{lk}^{(q)}}{\eta\_k^{(q-1)}}\right) \chi\_l}{\sum\_{l=1}^n z\_{lk}^{(q)} \left(\nu\_{lk}^{(q)} + \frac{1 - \nu\_{lk}^{(q)}}{\eta\_k^{(q-1)}}\right)} \tag{4}$$

$$\Sigma\_k^{(q)} = \frac{1}{\sum\_{l=1}^n \mathbb{Z}\_{lk}^{(q)}} \sum\_{l=1}^n z\_{lk}^{(q)} \left( \nu\_{lk}^{(q)} + \frac{1 - \nu\_{lk}^{(q)}}{\eta\_k^{(q-1)}} \right) (\gamma\_l - \mu\_k^{(q)}) (\gamma\_l - \mu\_k^{(q)})^\top \tag{5}$$

We introduce a value 𝛼 ∗ and we constrain 𝛼<sup>𝑘</sup> ∈ (𝛼 ∗ , 1). If the estimation 𝛼 (𝑞) 𝑘 in (4) is less than 𝛼 ∗ , we use the *optimize*() function in the *stats* package in R to do a numerical search for 𝛼 (𝑞) 𝑘 .

As in [1] we get the updated values 𝑎 (𝑞) 𝑘 𝑗 , 𝑏(𝑞) 𝑘 , 𝑞 (𝑞) 𝑘 𝑗 , 𝑘 = 1, . . . , 𝐾, 𝑗 = 1, . . . , 𝑑<sup>𝑘</sup> from the sample covariance matrix Σ (𝑞) 𝑘 of cluster 𝑘, using also the matrix of inner products between the basis functions 𝑊 = (𝑤𝑗𝑙)1<sup>≤</sup> 𝑗,𝑙≤𝑝, where 𝑤𝑗𝑙 = ∫ 𝑇 0 𝜉 <sup>𝑗</sup>(𝑡)𝜉𝑙(𝑡)𝑑𝑡.

In the second CM step in the 𝑞 iteration of the ECM algorithm we calculate 𝜂 (𝑞) 𝑘 as the value that maximize 𝑙 (𝑞−1) <sup>𝑐</sup> with Ψ<sup>1</sup> fixed at Ψ (𝑞) 1 .

At the end of the ECM algorithm, we do a two-step classification to provide the expected clustering. If 𝑞 <sup>𝑓</sup> is the last iteration of the algorithm before convergence, an observation 𝛾<sup>𝑖</sup> ∈ R 𝑝 is assigned to the cluster 𝑘<sup>0</sup> ∈ {1, . . . , 𝐾} with the largest 𝑧 (𝑞<sup>𝑓</sup> ) 𝑖𝑘 . Next, an observation 𝛾<sup>𝑖</sup> that was assigned to the cluster 𝑘<sup>0</sup> is considered good if 𝜈 (𝑞<sup>𝑓</sup> ) 𝑖𝑘<sup>0</sup> > 0.5, and it is considered bad otherwise. After the classification step we can eliminate the bad observations and run FunHDDC to re-cluster the remaining observations.

The class specific dimension 𝑑<sup>𝑘</sup> is selected through the scree-test of Cattell by comparison of the difference between eigenvalues with a given threshold [1]. The number of clusters 𝐾 as well as the parsimonious model are selected using the BIC criterion.

# **4 Applications**

**Fig. 1** Smooth data simulated without oultiers (a), according to scenario A (b), scenarion B (c), and scenario C (d), coloured by group for one simulation.

We simulate 1000 curves based on the model FCLM[𝑎<sup>𝑘</sup> , 𝑏<sup>𝑘</sup> , 𝑄<sup>𝑘</sup> , 𝑑<sup>𝑘</sup> ]. The number of clusters is fixed to 𝐾 = 3 and the mixing proportions are equal 𝜋<sup>1</sup> = 𝜋<sup>2</sup> = 𝜋<sup>3</sup> = 1/3. We consider the following values of the parameters

Group 1: 𝑑 = 5, 𝑎 = 150, 𝑏 = 5, 𝜇 = (1, 0, 50, 100, 0, . . . , 0) Group 2: 𝑑 = 20, 𝑎 = 15, 𝑏 = 8, 𝜇 = (0, 0, 80, 0, 40, 2, 0, . . . , 0)

$$\text{Group 3:} \, d = 10, \, a = 30, \, b = 10, \, \mu = (0, \dots, 0, 20, 0, 80, 0, 0, 100),$$

where 𝑑 is the intrinsic dimension of the subgroups, 𝜇 is the mean vector of size 70, 𝑎 is the value of the 𝑑-first diagonal elements of Δ, and 𝑏 the value of the 70− 𝑑- last ones. Curves are smoothed using 35 Fourier basis functions. We repeat the simulation 100 times. A sample of theses data is plotted in Figure 1 a. We consider the following contamination schemes where the scores are simulated from contaminated normal distributions with the previous parameters and

A: 𝛼<sup>𝑖</sup> = 0.9, 𝑖 = 1, . . . , 3, and 𝜂<sup>1</sup> = 7, 𝜂<sup>2</sup> = 10, 𝜂<sup>3</sup> = 17. B: 𝛼<sup>𝑖</sup> = 0.9, 𝑖 = 1, . . . , 3, and 𝜂<sup>1</sup> = 5, 𝜂<sup>2</sup> = 50, 𝜂<sup>3</sup> = 15. C: 𝛼<sup>𝑖</sup> = 0.9, 𝑖 = 1, . . . , 3, and 𝜂<sup>1</sup> = 100, 𝜂<sup>2</sup> = 70, 𝜂<sup>3</sup> = 170.

Samples for data generated according to scenarios A, B, C are plotted in Figure 1 b, c, d, respectively. We notice that there is more overlapping between the 3 groups when we increase the values of 𝜂.

**Table 1** Mean (and standard deviation) of ARI for BIC best model on 100 simulations. Bold values indicates the highest value for each method.


The quality of the estimated partitions obtained using FunHDDC and CFunHDDC is evaluated using the Adjusted Rand Index (ARI) [3], and the results are included in Table 1. For FunHDDC we use the library *funHDDC* in R. We run both algorithms for 𝐾 = 3 with all 6 sub-models and the best solution in terms of the highest BIC value for all those submodels is returned. The initialization is done with the 𝑘-means


**Table 2** Correct classification rates for each method.

strategy with 50 repetitions, and the maximum number of iterations is 200 for the stopping criterion. We use 𝜖 ∈ {0.05, 0.1, 0.2} in the Cattell test.

We notice that CFunHDDC outperforms FunHDDC, and it gives excellent results even in Scenario C. For CFunHDDC the best results are obtained for 𝜖 = 0.2 in the Catell test, and the values of the ARI are close to 1.

Next, we consider the NOx data available in the *fda.usc* library in R and representing daily curves of Nitrogen Oxides (NOx) emissions in the neighborhood of the industrial area of Poblenou, Barcelona (Spain). The measurements of NOx (in 𝜇g/m<sup>3</sup> ) were taken hourly resulting in 76 curves for "working days" and 39 curves for "non-working days" (see Figure 2 a). Since NOx is a contaminant agent, the detection of outlying emission is useful for environmental protection. This data set has been used for testing methods for the detection of outliers and to illustrate robust clustering based on trimming for functional data [4].

We apply CFunHDDC, FunHDDC, and CNmixt to the NOx data. Curves are smoothed using a basis of 8 Fourier functions, and we run the algorithms for 𝐾 = 2 clusters. For CFunHDDC, FunHDDC we use 𝜖 ∈ {0.001, 0.05, 0.1, 0.2} in the Cattell test and the rest of the settings are the same as in the simulation study. We run CNmixt for all 14 models from the *ContaminatedMixt* R library, based on the coefficients in the Fourier basis, with 1000 iterations for the stopping criteria, and initialization done with the 𝑘-means method. The correct classification rates (CCR) are reported in Table 2.

The CCR for CFunHDDC are slightly better than the ones for FunHDDC and CNmixt, and are comparable with the ones reported in Table 1 in [4] for Funclust,

**Fig. 2** a.Daily NOx curves for 115 days; b. c. Clustering obtained with CFunHDDC, 𝜖 = 0, 05,𝛼 <sup>∗</sup> = 0.85; Non-working days (blue), working days (red), outliers (green).

RFC, and TrimK. In Figure 2 b, c we present the clusters and the detected outliers for 𝜖 = 0.05 and 𝛼 <sup>∗</sup> = 0.85. The curves that are detected as outliers (green lines) exhibit different patterns from the rest of the curves.

One of the advantages of extending the FunHDDC to CFunHDDC is the outlier detection. For 𝛼 <sup>∗</sup> = 0.85 and 𝜖 = 0.05, CFunHDDC detects 16 outliers, which are the same with the outliers mentioned in [4]. For the data without outliers, CFunHDDC becomes equivalent to FunHDDC, and for the trimmed data the CCR increases to 0.79.

# **5 Conclusion**

We propose a new method, CFunHDDC, that extends the FunHDDC functional clustering method to data with mild outliers. Unlike other robust functional clustering algorithms, CFunHDDC does not involve trimming the data. CFunHDDC is based on a model formed by a mixture of contaminated multivariate normal distributions, which makes parameter estimation more difficult than for FunHDDC, so we use an ECM instead of an EM algorithm. The clustering and outlier detection performance of CFunHDDC is tested for simulated data and the NOx data and it always outperforms FunHDDC. Moreover, CFunHDDC has a comparable performance with robust functional clustering methods based on trimming, such as RFC and TrimK, and it has similar or better performance when compared to a two-step method based on CNmixt. Although there are several model-based methods for multivariate data with outliers that can be used to construct two-step methods for functional data, as observed in [1], these two-step methods always suffers from the difficulty to choose the best discretization. CFunHDDC can be extended to multivariate functional data, and recently, independently of our work, a similar approach was followed in [5], but without considering the parsimonious models and the value 𝛼 ∗ .

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **A Trivariate Geometric Classification of Decision Boundaries for Mixtures of Regressions**

Filippo Antonazzo and Salvatore Ingrassia

**Abstract** Mixtures of regressions play a prominent role in regression analysis when it is known the population of interest is divided into homogeneous and disjoint groups. This typically consists in partitioning the observational space into several regions through particular hypersurfaces called decision boundaries. A geometrical analysis of these surfaces allows to highlight properties of the used classifier. In particular, a geometrical classification of decision boundaries for the three most used mixtures of regressions (with fixed covariates, with concomitant variables and random covariates) was provided in case of one and two covariates, under Gaussian assumptions and in presence of only one real response variable. This work aims to extend these results to a more complex setting where three independent variables are considered.

**Keywords:** mixtures of regressions, decision boundaries, hyperquadrics, modelbased clustering

# **1 Introduction**

Linear regression is commonly employed to model the relationship between a 𝑑dimensional real vector of covariates **X** and a real response variable𝑌. It is well suited if we can assume that regression coefficients are fixed over all possible realizations (**x**, 𝑦) ∈ R <sup>𝑑</sup>+<sup>1</sup> of the couple (**X**, 𝑌). This assumption falls if it is a-priori known that realizations come from a population Ω which can be partitioned into 𝐺 disjoint

Salvatore Ingrassia

Filippo Antonazzo ()

Inria, Université de Lille, CNRS, Laboratoire de mathématiques Painlevé 59650 Villeneuve d'Ascq, France, e-mail: filippo.antonazzo@inria.fr

Dipartimento di Economia e Impresa, Università di Catania, Corso Italia 55, 95129 Catania, Italy, e-mail: salvatore.ingrassia@unict.it

P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_3

homogeneous groups Ω𝑔, 𝑔 = 1, . . . , 𝐺. In this case, a mixture of linear regressions (or clusterwise regression) is a more appropriate statistical tool. According to their degree of flexibility and generality, we can distinguish three types of mixtures of regressions: mixtures of regressions with fixed covariates (MRFC) [3]; mixtures of regressions with concomitant variables (MRCV) [6] and mixtures of regressions with random covariates (MRRC), also referred to in literature as cluster-weighted models [3, 4].

Mixtures of regressions can also be employed from a classification point of view to identify the group membership of each observation. In this case, the generated classifier divides the real space into 𝐺 regions through particular R 𝑑+1 surfaces called decision boundaries. In [5], the decision boundaries generated by each type of mixture are analyzed from a geometrical point of view, especially in those cases where 𝑑 = 1, 2 and 𝐺 = 2. The aim of the present work is to extend the results presented in the aforementioned paper to a higher dimensional case where 𝑑 = 3, giving more insight into the properties of these classifiers. The rest of the paper is organized as follows. In Section 2 we summarize the main ideas about mixtures of regressions. In Section 3 decision boundaries will be defined, finally proposing a geometrical classification in Section 4 when 𝑑 = 3 and 𝐺 = 2. In Section 5, we will conclude investigating with practical example the shape of three-dimensional decision boundaries in presence of variables following heavy-tailed 𝑡-distributions.

# **2 Mixtures of Regressions**

Below we briefly define three types of mixtures of regressions, ordered according to their generality and flexibility, given by an increasing number of parameters.

**MRFC.** Mixtures of regressions with fixed covariates have the following density:

$$p(\mathbf{y}|\mathbf{x};\boldsymbol{\psi}) = \sum\_{\mathbf{g}=\mathbf{l}}^{G} \pi\_{\mathbf{g}} f(\mathbf{y}|\mathbf{x};\boldsymbol{\theta}\_{\mathbf{g}}).\tag{1}$$

The density 𝑓 (𝑦|**x**; 𝜃𝑔) is indexed by a parameter vector 𝜃<sup>𝑔</sup> belonging to an Euclidean parametric space Θ𝑔. Moreover, every 𝜋<sup>𝑔</sup> is positive and Í<sup>𝐺</sup> 𝑔=1 𝜋<sup>𝑔</sup> = 1. The vector 𝜓 = (𝜋1, . . . , 𝜋𝐺, 𝜃1, . . . , 𝜃𝐺) denotes the set of all the parameters of the model.

**MRCV.** The density of a mixture of regressions with concomitant variables is:

$$p(\mathbf{y}|\mathbf{x};\boldsymbol{\psi}) = \sum\_{\mathbf{g}=1}^{G} f(\mathbf{y}|\mathbf{x};\theta\_{\mathbf{g}}) p(\Omega\_{\mathbf{g}}|\mathbf{x};\boldsymbol{\alpha}),\tag{2}$$

where the vector 𝜓 = (𝜃1, . . . , 𝜃𝐺, 𝛼) contains all parameters indexing the model. More specifically, 𝑝(Ω<sup>𝑔</sup> |**x**; 𝛼) is a function depending on **x** according to a vector of real parameters 𝛼. Typically, the probability 𝑝(Ω<sup>𝑔</sup> |**x**; 𝛼) is a multinomial logistic density with 𝛼 = (𝛼 𝑡 1 , . . . , 𝛼<sup>𝑡</sup> 𝐺 ) 𝑡 and 𝛼<sup>𝑔</sup> = (𝛼𝑔0, 𝛼<sup>𝑡</sup> 𝑔1 ) <sup>𝑡</sup> ∈ R 𝑑+1 , i.e.:

$$p(\Omega\_{\mathbf{g}}|\mathbf{x};\alpha) = \frac{\exp(\alpha\_{\mathbf{g}0} + \alpha\_{\mathbf{g}1}^{\mathbf{t}}\mathbf{x})}{\sum\_{\mathbf{g}=1}^{G} \exp(\alpha\_{\mathbf{g}0} + \alpha\_{\mathbf{g}1}^{\mathbf{t}}\mathbf{x})}.$$

Due to identifiability reasons, it is necessary to add the constraint 𝛼<sup>1</sup> = **0**, see [2].

**MRRC.** Mixtures of regressions with random covariates propose the following decomposition for the conjoint density 𝑝(**x**, 𝑦; 𝜓):

$$p(\mathbf{x}, \mathbf{y}; \boldsymbol{\mu}) = \sum\_{\mathbf{g}=1}^{G} f(\mathbf{y}|\mathbf{x}, \theta\_{\mathbf{g}}) p(\mathbf{x}; \boldsymbol{\xi}\_{\mathcal{G}}) \pi\_{\mathcal{G}},\tag{3}$$

where 𝜋<sup>𝑔</sup> > 0 and Í<sup>𝐺</sup> 𝑔=1 𝜋<sup>𝑔</sup> = 1. Furthermore, the model is totally parametrized by the vector 𝜓 = (𝜋1, . . . , 𝜋𝐺, 𝜃1, . . . , 𝜃𝐺, 𝜉1, . . . , 𝜉𝐺), where each 𝜃<sup>𝑔</sup> indexes the conditional density 𝑓 (𝑦|**x**, 𝜃𝑔), while each 𝜉<sup>𝑔</sup> refers to the density of **X** in the group Ω𝑔, denoted with 𝑝(**x**; 𝜉𝑔).

In particular, under Gaussian assumptions it results 𝑌 |**x**, Ω<sup>𝑔</sup> ∼ 𝑁(𝛽𝑔<sup>0</sup> + 𝛽 𝑡 𝑔1 **x**, 𝜎<sup>2</sup> 𝑔 ), where each 𝛽<sup>𝑔</sup> = (𝛽𝑔0, 𝛽𝑔1) is a vector of real parameters. Only for MRRC model, we will further assume **X**|𝛀<sup>𝑔</sup> ∼ 𝑁(𝜇𝑔, Σ𝑔) for all 𝑔 = 1, . . . , 𝐺, where 𝜇<sup>𝑔</sup> denotes the mean of the Gaussian distribution, while Σ<sup>𝑔</sup> is its covariance matrix. Denoting with 𝜙(·) the Gaussian density function, equations (1)-(3) can be, respectively, rewritten as

$$p(\mathbf{y}|\mathbf{x};\boldsymbol{\psi}) = \sum\_{\mathbf{g}=1}^{G} \boldsymbol{\phi}(\mathbf{y}; \boldsymbol{\beta}\_{\mathbf{g}0} + \boldsymbol{\beta}\_{\mathbf{g}1}^{\prime} \mathbf{x}; \sigma\_{\mathbf{g}}^{2}) \pi\_{\mathbf{g}},\tag{4}$$

$$p(\mathbf{y}|\mathbf{x};\boldsymbol{\psi}) = \sum\_{\mathbf{g}=1}^{G} \phi(\mathbf{y}; \boldsymbol{\beta}\_{\mathbf{g}0} + \boldsymbol{\beta}\_{\mathbf{g}1}^{\prime} \mathbf{x}; \sigma\_{\mathbf{g}}^{2}) p(\boldsymbol{\Omega}\_{\mathbf{g}}|\mathbf{x}; \boldsymbol{\alpha}),\tag{5}$$

$$p(\mathbf{x}, \mathbf{y}; \boldsymbol{\psi}) = \sum\_{\mathbf{g}=1}^{G} \phi(\mathbf{y}, \boldsymbol{\beta}\_{\mathcal{S}^{\mathcal{O}}} + \boldsymbol{\beta}\_{\mathcal{g}^{\mathcal{I}}}^{\prime} \mathbf{x}; \sigma\_{\mathcal{g}}^{\mathcal{D}}) \phi(\mathbf{x}; \boldsymbol{\mu}\_{\mathcal{S}}, \boldsymbol{\Sigma}\_{\mathcal{S}}) \pi\_{\mathcal{S}}.\tag{6}$$

Maximum likelihood estimate for 𝜓 are usually obtained with the Expectation-Maximization (EM) algorithm. Then, the final estimate is used to build classifiers which group observations into 𝐺 disjoint classes.

# **3 Decision Boundaries: Generality**

There are different ways to build classifiers. One of the best known is the method of discriminant functions. The aim of this procedure is to define 𝐺 functions 𝐷<sup>𝑔</sup> (**x**, 𝑦; 𝜓) and a decision rule to divide the real space R 𝑑+1 into 𝐺 decision regions, named R1, . . . , R𝐺. The decision regions have a one-to-one relationship with the subgroups Ω𝑔, i.e., if an observation (**x**, 𝑦) ∈ R 𝑑+1 is assigned to R𝑔, it is classified as part of Ω𝑔. Among all possible decision rules, the most used one consists in assigning (**x**, 𝑦) to R<sup>𝑔</sup> if:

$$D\_{\mathcal{g}}(\mathbf{x}, \mathbf{y}; \boldsymbol{\psi}) > D\_{\mathcal{f}}(\mathbf{x}, \mathbf{y}; \boldsymbol{\psi}) \quad \forall \boldsymbol{j} \neq \mathbf{g}. \tag{7}$$

Then, decision boundaries are defined as the surfaces in R 𝑑+1 separating the decision regions R𝑔, where observations cannot be uniquely classified. Formally, each decision boundary is a hypersurface represented by the mathematical equation 𝐷 <sup>𝑗</sup>(**x**, 𝑦; 𝜓) − 𝐷<sup>𝑘</sup> (**x**, 𝑦; 𝜓) = 0, 𝑗 ≠ 𝑘.

Different choices for discriminant functions are possible: under Gaussian assumptions it is convenient to define 𝐷<sup>𝑔</sup> (·) as the logarithm of the 𝑔-th component mixture density, as it conveys useful computational simplification [5]. So, we can define, for all the three models, these discriminant functions:

$$MRFC:\ \ D\_{\mathcal{g}}(\mathbf{x},\mathbf{y};\boldsymbol{\psi}) = \ln[\phi(\mathbf{y};\boldsymbol{\beta}\_{\mathcal{g}0} + \boldsymbol{\beta}\_{\mathcal{g}1}^{\prime}\mathbf{x}, \boldsymbol{\sigma}\_{\mathcal{g}}^{2})\pi\_{\mathcal{g}}] \tag{8}$$

$$MRCV: \quad D\_{\rm g}(\mathbf{x}, \mathbf{y}; \boldsymbol{\psi}) = \ln[\phi(\mathbf{y}; \boldsymbol{\beta}\_{\rm g0} + \boldsymbol{\beta}\_{\rm g1}^{\boldsymbol{\prime}} \mathbf{x}, \boldsymbol{\sigma}\_{\rm g}^{2}) \exp(\boldsymbol{\alpha}\_{\rm g0} + \boldsymbol{\alpha}\_{\rm g1}^{\boldsymbol{\prime}} \mathbf{x})] \tag{9}$$

$$MRRC:\ D\_{\mathcal{g}}(\mathbf{x},\mathbf{y};\boldsymbol{\psi}) = \ln[\phi(\mathbf{y};\boldsymbol{\beta}\_{\mathcal{g}0} + \boldsymbol{\beta}\_{\mathcal{g}1}^{\prime}\mathbf{x}, \boldsymbol{\sigma}\_{\mathcal{g}}^{2})\phi(\mathbf{x};\boldsymbol{\mu}\_{\mathcal{g}}, \boldsymbol{\Sigma}\_{\mathcal{g}})\boldsymbol{\pi}\_{\mathcal{g}}] \tag{10}$$

#### **3.1 The Case with** 𝑮 = **2**

In the case of interest where 𝐺 = 2, there is a single decision boundary defined by the equation 𝐷(**x**, 𝑦; 𝜓) = 𝐷<sup>2</sup> (**x**, 𝑦; 𝜓) − 𝐷<sup>1</sup> (**x**, 𝑦; 𝜓) = 0. Thus, the assignment rule for every point (**x**, 𝑦) ∈ R 𝑑+1 is based on the sign of 𝐷(**x**, 𝑦; 𝜓). It assigns (**x**, 𝑦) to Ω<sup>2</sup> if 𝐷(**x**, 𝑦; 𝜓) > 0; to Ω1, otherwise.

In [5] the geometrical properties of the hypersurfaces, defined by the equation 𝐷(**x**, 𝑦; 𝜓) = 0, have been investigated up to dimension 𝑑 = 2, providing the following propositions for quadrics.

**Proposition 1 (MRFC quadrics)** *The decision boundary between* Ω1 *and* Ω2 *is always a degenarate quadric.*

**Proposition 2 (MRCV quadrics)** *If* 𝛼 𝑡 (𝛽<sup>21</sup> − 𝛽11) ≠ 0*, then the decision boundary between* Ω1 *and* Ω2 *is a paraboloid; otherwise it is a degenarate quadric.*

**Proposition 3 (MRRC quadrics)** *Under convenient conditions, the decision boundary between* Ω1 *and* Ω2 *can be a degenerate quadric but it can be also assume any of the general quadric forms.*

These results show that models with more flexibility, i.e. with more parameters, can generate more varieties of decision boundaries. In the following section, we will extend these statements to dimension 𝑑 = 3.

# **4 Geometrical Classification of Decision Boundaries with** 𝑮 = **2 and** 𝑫 = **3**

In this section we extend previous results for mixtures of regression in presence of two classes and 𝑑 = 3, where decision boundaries reveal to be hyperquadrics in R 4 . Mathematical proofs of results for MRFC and MRCV models are based on an algebraic analysis of the matrices representing these hyperquadrics.

**MRFC.** Mixtures of regressions with fixed covariates are characterized by a low degree of flexibility. Indeed, all decision boundaries are degenerate hyperquadrics as the following result shows.

**Proposition 4 (MRFC hyperquadrics)** *The decision boundary between* Ω1 *and* Ω2 *is a degenerate hyperquadric of rank at most equal to* 3*. The rank is less than 3 if* 𝛽<sup>11</sup> = 𝛽<sup>21</sup> *or* 1−𝜋<sup>1</sup> 𝜋1 = 𝜎2 𝜎1 *.*

**MRCV.** A MRCV allows more degrees of freedom than a MRFC. A consequence is that the obtained decision boundaries are higher rank hyperquadrics, as the following result states.

**Proposition 5 (MRCV hyperquadrics)** *The decision boundary between* Ω1 *and* Ω2 *is a degenerate hyperquadric with rank at most equal to 4. In particular, rank is equal to 4 if* 𝛼 𝑡 (𝛽<sup>21</sup> − 𝛽11) ≠ 0*. In addition, if* 𝛼 𝑡 (𝛽<sup>21</sup> − 𝛽11) = 0 *and* 𝜎 2 1 = 𝜎 2 2 ; *the matrix has rank at most equal to 2, therefore the hyperquadric is reducible.*

**MRRC.** Proposition 3 shows MRRC exhibit a high number of possible types of conics and quadrics [5]. This fact is confirmed in dimension 𝑑 = 3, even if a strong theoretical result is difficult to obtain with simple algebra due to the mathematical complexity of the MRRC hyperquadric matrix. Indeed, it is possible to show such flexibility by building several practical examples (not displayed here), where hyperquadrics of various shapes arise.

Analyzing the provided results, we can note that they perfectly match the hierarchy established in dimension 𝑑 = 2. Indeed, a MRFC can generate only degenerate hyperquadrics of rank 3; the surfaces generated by a MRCV, which has more parameters, are still degenerate, but with a higher rank (equal to 4) depending on the same mathematical condition of Proposition 2; finally a MRRC, the most flexible model in terms of number of parameters, can give rise to various hyperquadrics, as in 𝑑 = 2.

# **5 Beyond Gaussian Assumptions:** 𝒕**-distribution in** 𝒅 = **2**

In [5], Gaussian assumptions were crossed by illustrating the case of a simple linear regression (𝐺 = 2 and 𝑑 = 1) where more general 𝑡-distributions were required for robustness reasons. It is shown that the generated decision boundaries are more flexible than their Gaussian counterparts, as they can assume more various shapes, although these surfaces can be calculated only numerically. In this section, we continue the exploration of the 𝑡-distribution case adding one more variable, thus 𝑑 = 2. Under these more general assumptions, discriminant functions (8) − (10) become:

$$MRFC\text{-}t:\ \ D\_{\text{g}}(\mathbf{x},\mathbf{y};\boldsymbol{\psi}) = \ln[q(\mathbf{y};\boldsymbol{\beta}\_{\text{g}0} + \boldsymbol{\beta}\_{\text{g}1}^{\ell}\mathbf{x}, \boldsymbol{\sigma}\_{\text{g}}^{2}, \boldsymbol{\eta}\_{\text{g}})\boldsymbol{\pi}\_{\text{g}}],\tag{11}$$

$$MRCV\text{-}t:\quad D\_{\text{g}}(\mathbf{x},\mathbf{y};\boldsymbol{\psi}) = \ln[q(\mathbf{y};\beta\_{\text{g}0} + \beta\_{\text{g1}}^{t}\mathbf{x}, \sigma\_{\text{g}}^{2}, \eta\_{\text{g}})\exp(\alpha\_{\text{g0}} + \alpha\_{\text{g1}}^{t}\mathbf{x})],\quad(12)$$

$$MRRC\text{-}t:\quad D\_{\text{g}}(\mathbf{x},\mathbf{y};\boldsymbol{\psi}) = \ln[q(\mathbf{y};\boldsymbol{\beta}\_{\text{g}0} + \boldsymbol{\beta}\_{\text{g}1}^{t}\mathbf{x},\boldsymbol{\sigma}\_{\text{g}}^{2},\boldsymbol{\eta}\_{\text{g}})q(\mathbf{x};\boldsymbol{\mu}\_{\text{g}},\boldsymbol{\Sigma}\_{\text{g}},\boldsymbol{\nu}\_{\text{g}})\boldsymbol{\pi}\_{\text{g}}],\quad(13)$$

where 𝑞(𝑦; 𝛽𝑔0+𝛽 𝑡 𝑔1 **x**, 𝜎<sup>2</sup> 𝑔 , 𝜂𝑔) denotes a generalized 𝑡-distribution density, with noncentrality parameter equal to 𝛽𝑔<sup>0</sup> + 𝛽 𝑡 𝑔1 **x**, scaling parameter equal to 𝜎 2 𝑔 and degrees of freedom given by 𝜂𝑔. Similarly, 𝑞(**x**; 𝜇𝑔, 𝚺𝑔, 𝜈𝑔) is a multivariate generalized 𝑡-distribution density, where 𝜇<sup>𝑔</sup> is the non-centrality parameter, 𝚺<sup>𝑔</sup> denotes the scaling and 𝜈<sup>𝑔</sup> represents the degrees of freedom. Figure 1-2 display the decision boundaries for the three considered models whose parameters are presented in Table 1: they clearly show the gain in flexibility given by the more general distributional assumptions. Moreover, 𝑡-boundaries with 𝜂<sup>1</sup> = 𝜂<sup>2</sup> = 10 (Figure 2; red curves) seem to be closer to Gaussian ones (blue curves) than those with 𝜂<sup>1</sup> = 𝜂<sup>2</sup> = 3 (Figure 1; orange curves): this is coherent with standard probabilistic theory.


**Table 1** Parameters used in Figure 1-2. MRRC: covariance matrices 𝚺**<sup>1</sup>** and 𝚺**<sup>2</sup>** are equal to the identity matrix **I**2.

# **6 Conclusions**

This work has provided a trivariate multivariate geometrical classification for the decision boundaries generated by mixtures of regressions in presence of two classes. Under Gaussian assumptions, our results confirmed the same hierarchy that was shown in 𝑑 = 2, as MRRC turns out to exhibits a huge variety of decision boundaries, while other models generate only degenerate surfaces. This is coherent with its high degree of flexibility given by its very general parametrization. The provided results

**Fig. 1** Decision boundaries under assumptions of Gaussian (in blue) and 𝑡-distributed variables with 𝜂<sup>1</sup> = 𝜂<sup>2</sup> = 3 (in orange) for the three considered mixtures of regressions.

**Fig. 2** Decision boundaries under assumptions of Gaussian (in blue) and 𝑡-distributed variables with 𝜂<sup>1</sup> = 𝜂<sup>2</sup> = 10 (in red) for the three considered mixtures of regressions.

could help to select the right model depending on the shape of data. For example, if in a descriptive analysis data turn out to be approximately separated by a simple degenerate hyperquadric, it will be better to estimate a MRFC or a MRCV instead of a complex MRRC. On the contrary, if the separation surface seems to be nondegenerate, then it will be preferable to fit a general MRRC. Moreover, this work also showed that the degree of flexibility (thus, the variety of possible decision boundaries) can be enhanced by go further Gaussianity, assuming, for example, 𝑡-distributed variables. This encourage additional extensions where more general distributions can be included, allowing a better comprehension of mixtures and possible applications to generalized linear models where categorical variables are considered.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Generalized Spatio-temporal Regression with PDE Penalization**

Eleonora Arnone, Elia Cunial, and Laura M. Sangalli

**Abstract** We develop a novel generalised linear model for the analysis of data distributed over space and time. The model involves a nonparametric term 𝑓 , a smooth function over space and time. The estimation is carried out by the minimization of an appropriate penalized negative log-likelihood functional, with a roughness penalty on 𝑓 that involves space and time differential operators, in a separable fashion, or an evolution partial differential equation. The model can include covariate information in a semi-parametric setting. The functional is discretized by means of finite elements in space, and B-splines or finite differences in time. Thanks to the use of finite elements, the proposed method is able to efficiently model data sampled over irregularly shaped spatial domains, with complicated boundaries. To illustrate the proposed model we present an application to study the criminality in the city of Portland, from 2015 to 2020.

**Keywords:** functional data analysis, spatial data analysis, semiparametric regression with roughness penalty

Eleonora Arnone ()

Elia Cunial

Laura M. Sangalli

© The Author(s) 2023 29 P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_4

Dipartimento di Scienze Statistiche, Università di Padova, Via Cesare Battisti, 241, 35121 Padova, Italy, e-mail: eleonora.arnone@unipd.it

Dipartimento di Matematica, Politecnico di Milano, Piazza Leonardo da Vinci 32, 20133 Milano, Italy, e-mail: elia.cunial@mail.polimi.it

Dipartimento di Matematica, Politecnico di Milano, Piazza Leonardo da Vinci 32, 20133 Milano, Italy, e-mail: laura.sangalli@polimi.it

# **1 Introduction**

In this work we develop a novel generalised linear model for the analysis of data distributed over space and time. Let 𝑌 be a real-valued variable of interest, and **W** a vector of 𝑞 covariates, observed in 𝑛 spatio-temporal locations {**p**<sup>𝑖</sup> , 𝑡<sup>𝑖</sup> }𝑖=1,...,𝑛 ∈ Ω×𝑇, where Ω ⊂ R 2 is a bounded spatial domain, and 𝑇 ⊂ R a temporal interval. We assume that the expected value of 𝑌, conditional on the covariates and the location of observation, can be modeled as:

$$g(\mathbb{E}[Y|W,\mathbf{p},t]) = \mathbf{W}^\top \boldsymbol{\mathcal{B}} + f(\mathbf{p},t)$$

where 𝑔 is a known monotone link function, chosen on the basis of the stochastic nature of𝑌, 𝜷 ∈ R 𝑞 is an unknown vector of regression coefficients, and 𝑓 : Ω×𝑇 → R is an unknown deterministic function, which captures the spatio-temporal variation of the phenomenon under study. Starting from the values {𝑦<sup>𝑖</sup> , **w**<sup>𝑖</sup> }𝑖=1,...,𝑛, of the observed response variable and covariates, we estimates 𝜷 and 𝑓 in a semiparametric fashion. In particular, following the approach in [9], that consider a similar problem for data scattered over space only, we minimize the functional

$$\ell\left(\{\mathbf{y}\_i, \mathbf{w}\_i, \mathbf{p}\_i, t\_i\}\_{i=1,\dots,n}; \mathcal{B}, f\right) + \mathcal{P}(f)$$

where ℓ is the appropriate negative log-likelihood, and P( 𝑓 ) is a penalty that enforces 𝑓 to be a regular function.

Similarly to the regression methods in [1, 2, 3, 4, 5, 7, 8], the roughness penalty on 𝑓 , P( 𝑓 ), involves some partial differential operators. In particular, our aim is to extend the Spatial-Temporal regression with partial differential equations regularization (ST-PDE), developed in [2, 3, 4], to generalized linear model settings, further broadening the class of regression models with PDE regularization reviewed in [6]. Hence, likewise ST-PDE, the proposed generalized linear model has a roughness penalty that involves a second order linear differential operator 𝐿 applied to 𝑓 . Specifically, as in [4], we may consider the penalty

$$\mathcal{P}(f) = \lambda\_T \int\_{\Omega} \int\_0^T \left(\frac{\partial^2 f}{\partial t^2}\right)^2 + \lambda\_S \int\_{\Omega} \int\_0^T \left(Lf\right)^2,$$

where the first term accounts for the regularity of the function in time, while the second accounts for the regularity of the function in space; the importance of each term is controlled by two smoothing parameters 𝜆<sup>𝑇</sup> and 𝜆𝑆. Alternatively, as in [2], we may consider a single penalty which accounts for the spatial and temporal regularity:

$$\mathcal{P}(f) = \lambda \int\_{\Omega} \int\_0^T \left( \frac{\partial f}{\partial t} + Lf - u \right)^2 \,.$$

Differently from the models in [2, 3, 4], the estimation functional to be minimized is not quadratic. This poses increased difficulties from the computational point of view. The minimization is performed via a functional version of the penalized iterative reweighted least square algorithm.

The estimation problem is appropriately discretized. In particular, in time, the discretization involves either cubic B-splines, for the two-penalties case, or finite differences, when the single penalty is employed. The discretization in space is performed via finite elements, on a triangulation of the spatial domain of interest. This enables to appropriately considered spatial domains with complicated boundaries, such as the one considered in the following section, concerning the study of criminality data over the city of Portland.

# **2 Application to Criminality Data**

This section describes the Portland criminality data, that will be used to illustrate the proposed methodology. We will present a Poisson model to count the crimes in the city, and study their evolution from April 2015 to November 2020. In addition, we shall consider as a covariate the population of the city neighborhoods. The crime data are publicly available on the website of the Police Bureau of the city1.

The crimes counts are aggregated by trimesters and at a neighborhoods level. Figure 1 shows the city neighborhoods, each neighborhood colored according to its total population. The bottom part of the same figure shows the temporal evolution of the crimes in each neighborhood. Each curve corresponds to a neighborhood and is colored according to the neighborhood population. In both panels, the three neighborhoods with the highest number of crimes are indicated by numbers 1, 2 and 3. The figure highlights the presence of some correlation between neighborhood population and the number of crimes. However, criminality is not fully explained by population. For instance, neighborhoods 1 and 3 present an high number of crimes with a moderate population. This raises the interest towards a semiparametric generalized linear model, as the one introduced in Section 1, with a nonparametric term accounting for the spatio-temporal variability in the phenomenon, that cannot be explained by population or other census quantities. Figure 2 shows the same data for four different trimesters on the Portland map. As already pointed out, the three area with the highest number of crimes are in the city center, and in the Hazelwood neighborhood, in the east part of the city.

From Figures 1 and 2 we can see that the shape of the domain is complicated; the city is indeed crossed by a river, with few bridges connecting the two parts, most of them placed downtown. Therefore, neighborhoods at opposite side of the river and far from the center, where most bridges are located, are close in euclidean distance, but far apart in reality. This particular morphology influences the phenomenon under study, for example, in the north of the city, the east side of the river is characterized by an higher number of crimes with respect to the west part. Due to this characteristics of the data and the domain, is is of crucial importance to take into account the shape

<sup>1</sup> Police Bureau crime data: https://www.portlandoregon.gov/police/71978

**Fig. 1** Top: the city of Portland divided into neighborhoods, each neighborhood colored according to the total population. Bottom: the total crimes over time for each neighborhood; each curve corresponds to a neighborhood and is colored according to the neighborhood's population. The three neighborhoods with the highest number of crimes are indicated by numbers 1, 2 and 3.

**Fig. 2** Total crime counts per neighborhood per trimester; green indicates lower number of crimes, red indicates a higher number of crimes.

of the domain during the estimation process. For this reason, estimation based on classical semiparametric models, such as those based on thin-plate splines, would give poor results, while the proposed method is particularly well suited, being able to complying the nontrivial form of the domain.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **A New Regression Model for the Analysis of Microbiome Data**

Roberto Ascari and Sonia Migliorati

**Abstract** Human microbiome data are becoming extremely common in biomedical research due to the relevant connections with different types of diseases. A widespread discrete distribution to analyze this kind of data is the Dirichletmultinomial. Despite its popularity, this distribution often fails in modeling microbiome data due to the strict parameterization imposed on its covariance matrix. The aim of this work is to propose a new distribution for analyzing microbiome data and to define a regression model based on it. The new distribution can be expressed as a structured finite mixture model with Dirichlet-multinomial components. We illustrate how this mixture structure can improve a microbiome data analysis to cluster patients into "enterotypes", which are a classification based on the bacteriological composition of gut microbiota. The comparison between the two models is performed through an application to a real gut microbiome dataset.

**Keywords:** count data, Bayesian inference, mixture model, multivariate regression

# **1 Introduction**

The human microbiome is defined as the set of genes associated with the microbiota, i.e. the microbial community living in the human body, including bacteria, viruses and some unicellular eukaryotes [1, 8]. The mutualistic relationship between microbiota and human beings is often beneficial, though it can sometimes

Sonia Migliorati

© The Author(s) 2023 35

Roberto Ascari ()

Department of Economics, Management and Statistics (DEMS), University of Milano-Bicocca, Milan, Italy, e-mail: roberto.ascari@unimib.it

Department of Economics, Management and Statistics (DEMS), University of Milano-Bicocca, Milan, Italy, e-mail: sonia.migliorati@unimib.it

P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_5

become detrimental for several health outcomes. For example, changes in the gut microbiome composition can be associated with diabetes, cardiovascular disesase, obesity, autoimmune disease, anxiety and many other factors impacting on human health [1, 5, 12, 14]. Moreover, the development of next-generation sequencing technologies allows nowadays to survey the microbiome composition using direct DNA sequencing of either marker genes or the whole metagenomics, without the need for isolation and culturing. These are the two main reasons for the recent explosion of research on microbiome, and highlight the importance of understanding the association between microbiome composition and biological and environmental covariates.

A widespread distribution for handling microbiome data is the Dirichletmultinomial (DM) (e.g., see [4, 16]), a generalization of the multinomial distribution obtained by assuming that, instead of being fixed, the underlying taxa proportions come from a Dirichlet distribution. This allows to model overdispersed data counts, that is data showing a variance much larger than that predicted by the multinomial model. Despite its popularity, the DM distribution is often inadequate to model real microbiome datasets due to the strict covariance structure imposed by its parameterization, which hinders the description of co-occurrence and co-exclusion relationships between microbial taxa.

The aim of this work is to propose a new distribution that generalizes the DM, namely the flexible Dirichlet-multinomial (FDM), and a regression model based on it. The new model provides a better fit to real microbiome data, still preserving a clear interpretation of its parameters. Moreover, being a finite mixture with DM components, it enables to account for the data latent group structure, and thus to identify clusters sharing similar biota compositions.

# **2 Statistical Models for Microbiome Data**

In this section, we define a new distribution for multivariate counts and a regression model based on it, that allows to link microbiome abundances with covariates. Note that, once the DNA sequence reads have been aligned to the reference microbial genomes, the abundances of microbial taxa can be quantified. Thus, microbiome data represent the count composition of 𝐷 bacterial taxa in a specific biological sample, and a microbiome dataset is a sequence of 𝐷-dimensional vectors **Y**1, **Y**2, . . . , **Y**<sup>𝑁</sup> , where 𝑌𝑖𝑟 counts the number of occurrences of taxon 𝑟 in the 𝑖-th sample (𝑖 = 1, . . . , 𝑁 and 𝑟 = 1, . . . , 𝐷). Since the 𝑖-th sample contains a number 𝑛<sup>𝑖</sup> of bacteria, microbiome observations are subject to a fixed-sum constraint, that is Í<sup>𝐷</sup> 𝑟=1 𝑌𝑖𝑟 = 𝑛<sup>𝑖</sup> .

#### **2.1 Count Distributions**

Following a compound approach, we assume that **Y**|𝚷 = 𝝅 ∼ Multinomial(𝑛, 𝝅), and we consider suitable distributions for the vector of probabilities 𝚷 ∈ S𝐷. The set S <sup>𝐷</sup> = {𝝅 = (𝜋1, . . . , 𝜋𝐷) | : 𝜋<sup>𝑟</sup> > 0, Í<sup>𝐷</sup> 𝑟=1 𝜋<sup>𝑟</sup> = 1} is the 𝐷-part simplex and it is the proper support of continuous compositional vectors. A distribution for **Y** is obtained by marginalizing the joint distribution of (**Y**, 𝚷) | . A common choice for this distribution is the mean-precision parameterized Dirichlet, whose probability density function (p.d.f.) is

$$f\_{\text{Dir}}(\pi;\mu,\alpha^{+}) = \frac{\Gamma(\alpha^{+})}{\prod\_{r=1}^{D}\Gamma(\alpha^{+}\mu\_{r})} \prod\_{r=1}^{D} \pi\_{r}^{(\alpha^{+}\mu\_{r})-1},$$

where 𝝁 = E[𝚷] ∈ S𝐷, and 𝛼 + > 0 is a precision parameter. Compounding the multinomial distribution with the Dirichlet one leads to the DM distribution, widely used in microbiome data analysis, whose probability mass function (p.m.f.) is

$$f\_{\rm DM}(\mathbf{y};n,\mu,\alpha^{\dagger}) = \frac{n!\Gamma(\alpha^{\dagger})}{\Gamma(\alpha^{\dagger}+n)} \prod\_{r=1}^{D} \frac{\Gamma(\alpha^{\dagger}\mu\_{r}+\mathbf{y}\_{r})}{(\mathbf{y}\_{r}!)\Gamma(\alpha^{\dagger}\mu\_{r})}.$$

The mean vector of a DM distribution is E[**Y**] = 𝑛𝝁, so that the parameter 𝝁 = E[**Y**]/𝑛 can be thought of as a scaled mean vector. Moreover, its covariance matrix is

$$\mathbb{V}\left[\mathbf{Y}\right] = n\mathbf{M} \left[1 + \frac{n-1}{\alpha^{+} + 1}\right],\tag{1}$$

where **M** = (Diag(𝝁) − 𝝁𝝁<sup>|</sup> ). Equation (1) highlights how the additional parameter 𝛼 + allows to increase flexibility in the variability structure with respect to the standard multinomial distribution.

We propose to take advantage of an alternative sound distribution defined on S 𝐷, namely the flexible Dirichlet (FD) [7, 9]. The latter is a structured finite mixture with Dirichlet components, entailing some constraints among the components' parameters to ensure model identifiability. Thanks to its mixture structure, the p.d.f. of a FDdistributed random vector can be expressed as

$$f\_{\rm FD}(\pi; \mu, \alpha^{+}, w, \mathbf{p}) = \sum\_{j=1}^{D} p\_{j} f\_{\rm Dir} \left(\pi; \lambda\_{j}, \frac{\alpha^{+}}{1 - w}\right),\tag{2}$$

where

$$
\lambda\_j = \mu - w\mathbf{p} + w\mathbf{e}\_j \tag{3}
$$

is the mean vector of the 𝑗-th component, 𝝁 = E [𝚷] ∈ S𝐷, 𝛼 <sup>+</sup> > 0, **p** ∈ S𝐷, 0 < 𝑤 < min n 1, min<sup>𝑟</sup> ∈{1,...,𝐷} n 𝜇𝑟 𝑝𝑟 oo, and **e**<sup>𝑗</sup> is a vector with all elements equal to zero except for the 𝑗-th which is equal to one.

Equation (2) points that the Dirichlet components have different mean vectors and a common precision parameter, the latter being determined by 𝛼 + and 𝑤. In particular, inspecting Equation (3), it is easy to observe that any two vectors 𝝀<sup>𝑟</sup> and 𝝀ℎ, 𝑟 ≠ ℎ, coincide in all the elements except for the 𝑟-th and the ℎ-th.

If 𝚷 is supposed to be FD distributed, a new discrete distribution for count vectors can be defined (we shall call flexible Dirichlet-multinomial (FDM)). The p.m.f. of the FDM can be expressed as

$$\begin{split} f\_{\text{FDM}}(\mathbf{y};n,\mu,\alpha^{+},\mathbf{p},w) &= \sum\_{j=1}^{D} p\_{j} f\_{\text{DM}} \left( \mathbf{y};n,\lambda\_{j},\frac{\alpha^{+}}{1-w} \right) \\ &= \sum\_{j=1}^{D} p\_{j} \frac{n! \Gamma(\frac{\alpha^{+}}{1-w})}{\Gamma(\frac{\alpha^{+}}{1-w}+n)} \prod\_{r=1}^{D} \frac{\Gamma(\frac{\alpha^{+}}{1-w}\lambda\_{jr}+\mathbf{y}\_{r})}{(\mathbf{y}\_{r}!) \Gamma(\frac{\alpha^{+}}{1-w}\lambda\_{jr})}, \end{split} \tag{4}$$

where 𝝀<sup>𝑗</sup> is defined in Equation (3). Interestingly, it is possible to recognize the flexible beta-binomial (FBB) [3] distribution as a special case of the FDM. The FBB is a generalization of the binomial distribution successful in dealing with overdispersion. Moreover, note that when **p** = 𝝁 and 𝑤 = 1/(𝛼 <sup>+</sup> + 1) the DM distribution is recovered.

Equation (4) shows that the FDM is a finite mixture with DM components displaying a common precision parameter and different scaled mean vectors 𝝀<sup>𝑗</sup> , 𝑗 = 1, . . . , 𝐷. The overall mean vector and the covariance matrix of the FDM can be expressed as

$$\begin{aligned} \mathbb{E}\left[\mathbf{Y}\right] &= n\mu, \\ \mathbb{V}\left[\mathbf{Y}\right] &= n\mathbf{M}\left[1 + \frac{n-1}{\phi+1}\right] + n\frac{(n-1)\phi w^2}{\phi+1}\mathbf{P}, \\ &\qquad \qquad \qquad \qquad (5) \end{aligned}$$

where **M** = (Diag(𝝁) − 𝝁𝝁<sup>|</sup> ), **P** = (Diag(**p**) − **pp**<sup>|</sup> ), and 𝜙 = 𝛼 + /(1 − 𝑤) is the common precision parameter of the DM components. A comparison between Equations (5) and (1) points out that the covariance matrix of the FDM distribution is a very easily interpretable extension of the DM's covariance matrix. Indeed, it is composed of two terms, the first one coinciding with the DM's covariance matrix, whereas the second one depends on the mixture structure of the FDM model. In particular, the FDM covariance matrix has 𝐷 additional parameters with respect to the DM, namely 𝐷 − 1 distinct elements in the vector of mixing weights **p**, and the parameter 𝑤 which controls the distance among the components' barycenters [7]. This is the key element explaining the better ability of the FDM in modeling a wide range of scenarios.

#### **2.2 Regression Models**

With the aim of performing a regression analysis, let **Y** = (**Y**1, . . . , **Y**<sup>𝑁</sup> ) <sup>|</sup> be a set of independent multivariate responses collected on a sample of 𝑁 subjects/units. For the 𝑖-th subject, **Y**<sup>𝑖</sup> counts the number of times that each of 𝐷 possible taxa occurred among 𝑛<sup>𝑖</sup> trials, and **x**<sup>𝑖</sup> is a (𝐾 + 1)-dimensional vector of covariates.

A parameterization of the FDM useful in a regression perspective is the one based on 𝝁, **p**, 𝛼 + , and 𝑤˜, where

$$
\tilde{w} = \frac{w}{\min\left\{1, \min\_{r} \left\{\frac{\mu\_r}{Pr}\right\}\right\}} \in (0, 1). \tag{6}
$$

We can define the FDM regression (FDMReg) and the DM regression (DMReg) models assuming that **Y**<sup>𝑖</sup> follows an FDM(𝑛<sup>𝑖</sup> , 𝝁<sup>𝑖</sup> , 𝛼+ , **p**, 𝑤˜) or a DM(𝑛<sup>𝑖</sup> , 𝝁<sup>𝑖</sup> , 𝛼+ ) distribution, respectively. Even if the FDM and DM distributions do not belong to the dispersion-exponential family, we can follow a GLM-type approach, [6] by linking the parameter 𝝁<sup>𝑖</sup> to the linear predictor through a proper link function such as the multinomial logit link function, that is

$$\log(\mu\_{ir}) = \log\left(\frac{\mu\_{ir}}{\mu\_{iD}}\right) = \mathbf{x}\_i^\mathsf{T} \mathcal{B}\_r, \qquad r = 1, \ldots, D - 1,\tag{7}$$

where 𝜷<sup>𝑟</sup> = (𝛽𝑟0, 𝛽𝑟1, . . . , 𝛽𝑟 𝐾 ) | is a vector of regression coefficients for the 𝑟-th element of 𝝁<sup>𝑖</sup> . Note that the last category has been conventionally chosen as baseline category, thus 𝜷<sup>𝐷</sup> = **0**.

The parameterization of the FDMReg based on 𝝁, **p**, 𝛼 + , and 𝑤˜ defines a variation independent parameter space, meaning that no constraints exist among parameters. In a Bayesian framework, this allows to assume prior independence, and, consequently, we can specify a prior distribution for each parameter separately. In order to induce minimum impact on the posterior distribution, we select weakly-informative priors: (i) 𝜷<sup>𝑟</sup> ∼ 𝑁𝐾+<sup>1</sup> (**0**, Σ), where **0** is the (𝐾 + 1)-vector with zero elements, and Σ is a diagonal matrix with 'large' variance values, (ii) 𝛼 <sup>+</sup> ∼ 𝐺𝑎𝑚𝑚𝑎(𝑔1, 𝑔2) for small values of 𝑔<sup>1</sup> and 𝑔2, (iii) 𝑤˜ ∼ 𝑈𝑛𝑖 𝑓 (0, 1), and (iv) a uniform prior on the simplex for **p**.

Inferential issues are dealt with by a Bayesian approach through a Hamiltonian Monte Carlo (HMC) algorithm [10], which is a popular generalization of the Metropolis-Hastings algorithm. The Stan modeling language [13] allows implementing an HMC method to obtain a simulated sample from the posterior distribution.

To compare the fit of the models we use the Watanabe-Akaike information criterion (WAIC) [15, 17], a fully Bayesian criterion that balances between goodness-offit and complexity of a model: lower values of WAIC indicate a better fit.

# **3 A Gut Microbiome Application**

In this section, we fit the DM and the FDM regression models to a microbiome dataset analyzed by Xia et al. [19] and previously proposed by Wu et al. [18]. They collected gut microbiome data on 98 healthy volunteers. In particular, the counts of three bacteria genera were recorded, namely Bacteroides, Prevotella, and Ruminococcus. Arumugam et al. [2] used these three bacteria to define three groups they called enterotypes. These enterotypes provide information about the human's body ability to produce vitamins.

Wu et al. analyzed the same dataset conducting a cluster analysis via the 'partitioning around medoids' (PAM) approach. They detected only two of the three enterotypes defined in the work by Arumugam et al. Moreover, these two clusters are characterized by different frequencies: 86 out of the 98 samples were allocated to the first enterotype, whereas only 12 samples were clustered into enterotype 2. This is due to the small number of subjects with a high abundance of Prevotella (i.e., only 36 samples showed a Prevotella count greater than 0).

Besides the bacterial data, we consider also 𝐾 = 9 covariates, representing information on micro-nutrients in the habitual long-term diet collected using a food frequency questionnaire. These 9 additional variables have been selected by Xia et al. using a 𝑙<sup>1</sup> penalized regression approach.

Table 1 shows the posterior mean and 95% credible set (CS) of each parameter involved in the DMReg and the FDMReg models. Though the significant covariates are the same across the models, the FDMReg shows a lower WAIC, thus being the best model in terms of fit. This is due to the additional set of parameters involved in the mixture structure that help in providing information on this dataset.

The mixture structure of the FDMReg model can be exploited to cluster observations into groups through a model-based approach. More specifically, each observation can be allocated to the mixture component that most likely generated it. Indeed, note that the mixing weights estimates (0.637, 0.357 and 0.006, from Table 1) confirm the presence of two out of the three enterotypes defined by Arumugam et al. [2]. To further illustrate the benefits of the FDReg model in a microbiome data analysis, we compare the clustering profile obtained by the FDMReg model and the one obtained with the PAM approach used by Wu et al. In particular, Table 2 summarizes this comparison in a confusion matrix. Despite the clustering generated by the FDMReg being based on some distributional assumptions (i.e., the response is FDM distributed), it highly agrees with the one obtained by the PAM algorithm for 84% of the observations. This percentage is obtained using the covariates selected by Xia et al. in a logistic normal multinomial regression model context. Clearly, the results could be improved by developing an ad hoc variable selection procedure for the FDMReg model. The main advantage to considering the FDMReg (that is a model-based clustering approach) is that, besides the clustering of the data points, it provides also some information on the detected clusters (e.g., their size and a measure of their distance) and the relationship between the response and the set of covariates. This additional information may increase the insight we can gain from


**Table 1** Posterior mean and 95% CS for the parameters of the DMReg and FDMReg models. Regression coefficients in bold are related to 95% CS's not containing the zero value.

data. Further improvements could be obtained considering an even more flexible distribution for 𝚷, that is the extended flexible Dirichlet [11].

**Table 2** Confusion matrix for clustering based on the FDMReg model compared to the PAM algorithm.

$$
\begin{array}{c|ccc}
 & \text{FDMReg} & \\
 \hline
 1 & 2 \\
 \hline
 2 & 16 \\
 \hline
 2 & 0 & 12
\end{array}
$$

# **References**

1. Amato, K.: An introduction to microbiome analysis for human biology applications. Am. J. Hum. Biol. **29** (2017)


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Stability of Mixed-type Cluster Partitions for Determination of the Number of Clusters**

Rabea Aschenbruck, Gero Szepannek, and Adalbert F. X. Wilhelm

**Abstract** For partitioning clustering methods, the number of clusters has to be determined in advance. One approach to deal with this issue are stability indices. In this paper several stability-based validation methods are investigated with regard to the 𝑘*-prototypes* algorithm for mixed-type data. The stability-based approaches are compared to common validation indices in a comprehensive simulation study in order to analyze preferability as a function of the underlying data generating process.

**Keywords:** cluster stability, cluster validation, mixed-type data

# **1 Introduction**

In cluster analysis practice, it is common to work with mixed-type data (i.e. numerical and categorical variables), while in theoretical development the research is traditionally often restricted to numerical data. A comprehensive overview on cluster analysis based on mixed-type data is given in [1]. To cluster these mixed-type data, a popular approach is the 𝑘*-prototypes* algorithm, an extension of the popular 𝑘*-means* algorithm, as proposed in [2] and implemented in [3].

As for all partitioning clustering methods, the number of clusters has to be specified in advance. In the past, several validation methods have been identified for the

Rabea Aschenbruck ()

Gero Szepannek

Adalbert F.X. Wilhelm Jacobs University Bremen, Campus Ring 1, 28759 Bremen, Germany, e-mail: A.Wilhelm@jacobs-university.de

© The Author(s) 2023 43

Stralsund University of Applied Sciences, Zur Schwedenschanze 15, 18435 Stralsund, Germany, e-mail: rabea.aschenbruck@hochschule-stralsund.de

Stralsund University of Applied Sciences, Zur Schwedenschanze 15, 18435 Stralsund, Germany, e-mail: gero.szepannek@hochschule-stralsund.de

P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_6

𝑘*-prototypes* algorithm to enable the rating of clusters and to determine the index optimal number of clusters. A brief overview is given in Section 2, followed by an examination of the investigated stability indices to improve clustering mixed-type data1. In Section 3, a simulation study has been conducted in order to compare the performance of stability indices as well as a new proposed adjustment, and additionally to rate the performance with respect to internal validation indices. Finally, a summary, which does not state a superiority of the stability-based approaches over internal validation indices in general, and an outlook are given in Section 4.

# **2 Stability of Cluster Partitions**

The assessment of cluster quality can be used for the comparison of clusters resulting from different methods or from the same method but with different input parameters, e.g., with a different number of clusters. Especially the latter has already been an important issue in partitioning clustering many decades ago [5]. Since then, some work has been done on this subject. Hennig [6] points out, that nowadays some literature uses the term *cluster validation* exclusively for methods that decide about the optimal number of clusters, in the following named *internal validation*. An overview of internal validation indices is given, e.g., in [7] or [8]. In [9], a set of internal cluster validation indices for mixed-type data to determine the number of clusters for the 𝑘*-prototypes* algorithm was derived and analyzed. In the following, stability indices are presented, before they are compared to each other and additionally to internal validation indices in Section 3. Since cluster stability is a model agnostic method, the indices are applicable to any clustering algorithm and not limited to numerical data [10].

A partition 𝑆 splits data 𝑌 = {𝑦1, . . . , 𝑦𝑛} into 𝐾 groups 𝑆1, . . . , 𝑆<sup>𝐾</sup> ⊆ 𝑌. The focus of this paper is on the evaluation and rating of cluster partitions with socalled stability indices. To calculate these, as discussed by Dolnicar and Leisch [11] or mentioned by Fang and Wang [12], 𝑏 ∈ {1, . . . , 𝐵} bootstrap samples 𝑌 𝑏 (with replacement, see e.g. [13]) from the original data set 𝑌 are drawn. For every bootstrap sample 𝑌 𝑏 , a cluster partition 𝑆 <sup>𝑏</sup> = {𝑆 𝑏 1 , . . . , 𝑆<sup>𝑏</sup> 𝐿𝑏 } is determined. For the validation of the different results of these bootstrap samples, the set of points from the original data set that are also part of the 𝑏-th bootstrap sample 𝑋 <sup>𝑏</sup> = 𝑌 ∩ 𝑌 𝑏 is used, where 𝑛<sup>𝑏</sup> is the size of 𝑋 𝑏 . Furthermore 𝐶 <sup>𝑏</sup> = {𝑆<sup>𝑘</sup> ∩ 𝑋 𝑏 |𝑘 = 1, . . . , 𝐾} and 𝐷 <sup>𝑏</sup> = {𝑆 𝑏 𝑙 ∩ 𝑋 𝑏 |𝑙 = 1, . . . , 𝐿𝑏}, with 𝐵 ★ 𝐶 being the number of bootstrap samples for which 𝐶 <sup>𝑏</sup> ≠ ∅, and 𝑛𝑆<sup>𝑘</sup> , 𝑛𝐶<sup>𝑏</sup> 𝑘 , 𝑛<sup>𝑆</sup> 𝑏 𝑙 , and 𝑛𝐷<sup>𝑏</sup> 𝑙 with 𝑘 ∈ {1, . . . , 𝐾}, 𝑙 ∈ {1, . . . , 𝐿𝑏} are the numbers of objects in cluster group 𝑆<sup>𝑘</sup> , 𝐶 𝑏 𝑘 , 𝑆 𝑏 𝑙 and 𝐷 𝑏 𝑙 , respectively.

In 2002, Ben-Hur et al. [14] presented stability-based methods, which can be used to define the optimal number of clusters. In their work, the basis for the calculation of the stability indices is a binary matrix 𝑃 𝐶<sup>𝑏</sup> , which represents the cluster partition 𝐶 𝑏 in the following way

<sup>1</sup> The mentioned and analyzed stability indices will extend the R package clustMixType [4].

Stability for Determination of the Number of Clusters

$$P\_{ij}^{C^b} = \begin{cases} 1, \text{ if objects } \mathbf{x}\_i^b, \mathbf{x}\_j^b \in X^b \text{ are in the same cluster and } i \neq j, \\ 0, \text{ otherwise.} \end{cases} \tag{1}$$

With 𝑃 𝐷<sup>𝑏</sup> defined analogously, the dot product of the two cluster partitions 𝐶 𝑏 and 𝐷 𝑏 is defined as 𝐷(𝑃 𝐶<sup>𝑏</sup> , 𝑃𝐷<sup>𝑏</sup> ) = Í 𝑖, 𝑗 𝑃 𝐶<sup>𝑏</sup> 𝑖 𝑗 𝑃 𝐷<sup>𝑏</sup> 𝑖 𝑗 . This leads to a Jaccard coefficient based index of two cluster partitions 𝐶 𝑏 and 𝐷 𝑏

$$\text{Stab}\_1(P^{C^b}, P^{D^b}) = \frac{D(P^{C^b}, P^{D^b})}{D(P^{C^b}, P^{C^b}) + D(P^{D^b}, P^{D^b}) - D(P^{C^b}, P^{D^b})} . \tag{2}$$

Hennig proposed a so-called local stability measure for every cluster group in a cluster partition based on the Jaccard coefficient as well [15]. To obtain one stability value 𝑆𝑡𝑎𝑏J;cw for the whole partition, the weighted mean of the cluster-wise values with respect to the size of the cluster groups is determined. Another stability-based index presented by Ben-Hur et al., based on the simple matching coefficient, is called Rand index [16] and defined as

$$Stab\_{\mathbb{R}}(P^{C^b}, P^{D^b}) = 1 - \frac{1}{n^2} \| P^{C^b} - P^{D^b} \|^2. \tag{3}$$

Additionally, they present the stability index based on a similarity measure, which was originally mentioned by Fowlkes and Mallows [17],

$$Stab\_{\rm FM}(P^{C^b}, P^{D^b}) = \frac{D(P^{C^b}, P^{D^b})}{\sqrt{D(P^{C^b}, P^{C^b})D(P^{D^b}, P^{D^b})}}.\tag{4}$$

For determination of the number of clusters, Ben-Hur et al. proposed the analysis of the distribution of index values calculated between pairs of clustered sub-samples, where high pairwise similarities indicate a stable partition. The authors' suggested aim is examining the transition from a stable to an unstable clustering state. In the simulation study, this qualitative criterion was numerically approximated by the differences in the areas under these curves. Furthermore, von Luxburg [18] published an approach to obtain the cluster partition stability based on the minimal matching distance, where the minimum is taken over all permutations of the 𝐾 labels of clusters. Straightforward, the distances are summarized by their mean to obtain 𝐼𝑛𝑠𝑡𝑎𝑏L(𝑃 𝐶<sup>𝑏</sup> , 𝑃𝐷<sup>𝑏</sup> ) respectively 𝑆𝑡𝑎𝑏L(𝑃 𝐶<sup>𝑏</sup> , 𝑃𝐷<sup>𝑏</sup> ) = 1 − 𝐼𝑛𝑠𝑡𝑎𝑏L(𝑃 𝐶<sup>𝑏</sup> , 𝑃𝐷<sup>𝑏</sup> ).

# **3 Simulation Study**

In order to compare the stability indices of the cluster partition and afterwards with respect to the internal validation indices, a simulation study was conducted. In the following, the setup and execution of this simulation study starting with the data generation is briefly presented, and subsequently the results are evaluated.

#### **3.1 Data Generation and Execution of Simulation Study**

The simulation study is based on artificial data, which are generated for different scenarios. In Table 1, the features that define the data scenarios and their corresponding parameter values are listed. Since a full factorial design is used, there are 120 different data settings in the conducted simulation study.2 The selection of the considered features follow the characteristics of the simulation study in [19] and were extended with respect to the ratio of the variable types as in [20].


**Table 1** Features and the associated feature specifications used to generate the data scenarios.

The clusters of the 200 observations are defined by the the feature settings. Each variable can either be *active* or *inactive*. For the numerical variables, *active* means drawing values from the normal distribution 𝑋<sup>1</sup> ∼ N(𝜇1, 1), with random 𝜇<sup>1</sup> ∈ {0, ..., 20}, and *inactive* means drawing from 𝑋<sup>0</sup> ∼ N(𝜇0, 1) with 𝜇<sup>0</sup> = 2· 𝑞1<sup>−</sup> 𝑣 2 −𝜇1, where 𝑞<sup>𝛼</sup> is the 𝛼-quantile of N(𝜇1, 1) and 𝑣 ∈ {0.05, 0.1}. This results in an overlap of 𝑣 for the two normal distributions. To achieve an overlap of 𝑣 = 0, the inactive variable is drawn from N(𝜇<sup>1</sup> − 10, 1). Furthermore, each factor variable has two levels, 𝑙<sup>0</sup> and 𝑙1. The probability for drawing 𝑙<sup>0</sup> for an active variable is 𝑣 and (1 − 𝑣) for level 𝑙1. For an inactive variable, the probability for 𝑙<sup>0</sup> is (1 − 𝑣) and 𝑣 for 𝑙1.

Below, the code structure of the simulation study is presented. For each of the 120 data scenarios, a repetition of 𝑁 = 10 runs was performed. This should mitigate the influence of the random initialization of the 𝑘*-prototypes* algorithm. For the range of two up to nine cluster groups, the stability indices are determined based on bootstrap samples as suggested in [21]. In order to rank the performance of the stability-based indices, the internal validation indices were also determined on the same data.

#### **Pseudo-Code Simulation Study**

```
for(every data situation){
  for(i in 1:N){ # 10 iterations to mitigate/soften random influences
    data <- create.data(data situation)
    for(q in 2:9){
      output <- kproto(data, k = q, nstarts = 20)
      # stability-based indices determined with the usage of 100 bootstrap samples
      stab_val_method <- stab_kproto(output, B = 100, method)
      int_val_method <- validation_kproto(output, method) # internal validation
    }
    # determine optimal cluster size for every method
    cs_method <- max/min(int_val_method or stab_val_method)
  }
}
```
<sup>2</sup> There is no data scenario with two variables and eight cluster groups. Additionally, if there are two variables, obviously only the 0.5 ratio between factor and numerical variables is possible.

**Fig. 1** The evaluations of the four stability-based cluster indices are presented. There are ten repetitions of rating the data situation for 𝑘 clusters in the range of two to nine and the indexoptimal number of clusters is highlighted. The parameters of the underlying data structure are nV = 8, fac\_prop = 0.5, overlap = 0.1 and symm = FALSE. The number of clusters nC in the data structure varies row-wise.

#### **3.2 Analysis of the Results**

Figure 1 shows exemplary results of the simulation study for three different data scenarios over the 10 repetitions. Each row of the figure shows a different data scenario and each column shows one of the four stability-based indices. The first row is related to a data scenario with two clusters (marked by a vertical green line). Each plot shows the examined number of clusters and the determined index value for the 10 repetitions. The maximum index value for each repetition is highlighted with a larger dot and marks the index-optimal number of clusters of this repetition. It can be seen that all of the four different indices detected the two clusters in the underlying data structure. Rows two and three show the evaluations of data with cluster partitions of four and eight clusters, respectively. It can be seen that the generated number of clusters is not always rated as index optimal (for example, with four clusters, two or three clusters were often also evaluated as optimal). Since the results shown here are representative for all scenarios, the four cluster indices and their interpretation were examined in more detail.

In the left part of Figure 2, different transformations of the index values are presented. Besides the standard index values (green line), the numerical approximation of the approach of Ben-Hur et al. mentioned above is also shown (red line). For the Jaccard-based evaluation, the proposed cluster-wise stability determination by Hennig is presented in orange. Additionally, we propose an adjustment of the index values (hereinafter referred to as *new adjust*), similar to [22], to take into account not only the magnitude of the index but also the local slope: The index value scaled with the geometric mean of the changes to the neighbor values is presented in dark green.

**Fig. 2** *Left:* Example of the variations of the index values at an iteration of the data scenario with the parameters nC = 4, nV = 8, fac\_prop = 0.5, overlap = 0.1 and symm = FALSE. *Right:* Proportion of correct determinations, partitioned according to the different number of clusters in the underlying data structure.

Again, for each variation of the indices, the index optimal value is highlighted. The numerically determined index values according to the approach of Ben-Hur et al. gain no benefit, thus it can be concluded that the quantification is not appropriate for the purpose and that further research is required. The cluster-wise stability determination of the Jaccard index also does not seem to improve the determination of the number of clusters to a large extent. Obviously, the local slope in the example in Figure 2 is strengthened for four evaluated cluster groups by the new adjustment that leads to a determination of four cluster groups (which is the generated number of clusters). Since only one iteration of one data scenario is shown on the left, the sum of correct determined number of clusters with respect to the generated number of clusters is shown on the right hand side of Figure 2. These sums for two, four and eight clusters in the underlying data structure point out the improvement of the proposed adjustment of the index values. Especially for more than two clusters, the rate of correctly determined numbers of clusters can be increased.

Finally, the internal validation indices were comparatively examined. For analyzing the outcome of the simulation study, the determined index optimal numbers of clusters are shown in Table 2. While the comparison for two clusters in the underlying data shows a slight advantage for the stability-based indices, especially for eight clusters the preference is in favor of the internal validation indices. To gain a better understanding of the mean success rate of determining the correct number of clusters for each data scenario, Figure 3 further shows the results of a linear regression on the various data parameters. It can be seen that in most cases there is not too much difference between the considered methods. The stability-based indices do a better job of determining the number of clusters for data with equally large cluster groups. Obviously, a larger number of variables causes a better determination of the number of clusters. The largest variation in the influence on the proportion of correct determination can be seen for the parameter *number of clusters*. The more cluster groups are available in the underlying data structure, the worse the determination becomes (especially for the stability-based indices and the indices Ptbiserial and Tau).

**Table 2** Determined number of clusters for all data scenarios with nC ∈ {2, 4, 8}, summarized by the stability-based as well as internal validation indices and the evaluated number of clusters.

**Fig. 3** Linear regression coefficients for the parameters of the five data set features, where coefficients whose confidence intervals contain 0 are displayed in transparent.

# **4 Conclusion**

The aim of this study was to investigate the determination of the optimal number of clusters based on stability indices. Several variations of analysis methods of stabilitybased index values were presented and comparatively analyzed in a simulation study. The proposed adjustment of the index values with respect not only to their magnitude but also to the local slope was able to improve the standard stability indices, especially for a smaller number of clusters. The simulation study did not show any general superiority of stability-based approaches over internal validation indices.

In the future, the various methods of analyzing the stability-based index values should be examined in more detail, e.g., taking into account the Adjusted Rand Index. For this purpose, further research may address the characteristics of the evaluated curves more precisely, or further extend the approach of Ben-Hur et al. as a quantitative determination method, which has not been done yet.

# **References**

	- https://CRAN.R-project.org/package=clusterMixType

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **A Review on Official Survey Item Classification for Mixed-Mode Effects Adjustment**

Afshin Ashofteh and Pedro Campos

**Abstract** The COVID-19 pandemic has had a direct impact on the development, production, and dissemination of official statistics. This situation led National Statistics Institutes (NSIs) to make methodological and practical choices for survey collection without the need for the direct contact of interviewing staff (i.e. remote survey data collection). Mixing telephone interviews (CATI) and computer-assisted web interviewing (CAWI) with direct contact of interviewing constitute a new way for data collection at the time COVID-19 crisis. This paper presents a literature review to summarize the role of statistical classification and design weights to control coverage errors and non-response bias in mixed-mode questionnaire design. We identified 289 research articles with a computerized search over two databases, Scopus and Web of Science. It was found that, although employing mixed-mode surveys could be considered as a substitution of traditional face-to-face interviews (CAPI), proper statistical classification of survey items and responders is important to control the nonresponse rates and coverage error risk.

**Keywords:** mixed-mode official surveys, item classification, weighting methods, clustering, measurement error

Pedro Campos

Afshin Ashofteh ()

Statistics Portugal (Instituto Nacional de Estatística, Departamento de Metodologia e Sistemas de Infomação) and NOVA Information Management School (NOVA IMS) and MagIC, Universidade Nova de Lisboa, Lisboa, Portugal, e-mail: afshin.ashofteh@ine.pt

Statistics Portugal (Instituto Nacional de Estatística, Departamento de Metodologia e Sistemas de Infomação) and Faculty of Economics, Universidade do Porto, and LIAAD INESC TEC, Portugal, e-mail: pedro.campos@ine.pt

# **1 Introduction**

This paper provides a summary of a systematic literature review of the role of classification variables and weighting methods of mixed-mode surveys in minimizing the measurement error, coverage error, and nonresponse bias.

Before the COVID-19 pandemic, the statistical adjustment of mode-specific measurement effects was studied by many scholars. However, after the pandemic, survey methodologists made a strong effort to meet the challenges of new restrictions for collecting data with proper quality [1]. Data collection with mixing different modes by considering their contribution to the overall published statistics was considered as a solution by NSIs. The methodologists have been trying to use technology, data science, and mixed-device surveys to decrease the expected coverage error and nonresponse bias with new target populations at the time of pandemic rather than the traditional interviewer-assisted and paper survey modes [2]. This coverage error is caused by the changes of the target population from the general population to the general population accessible with technological devices. Te Braak et. al. [3] highlighted how the representativeness of self-administered online surveys is expected to be impacted by decreased response rates. Their research demonstrates that a huge group of respondents dropout selectively and that this selectivity varies depending on the dropout moment and demographic categorical information.

According to the studies in Statistics Portugal, using classification methods by categorical variables and applying the repeated weighting techniques seem to be fruitful to estimate and adjust for mode and device effects. Fortunately, many authors discussed the use of weights in statistical analysis [4]. It is important to improve inference in cases where mixed-mode effects are combined with measurement errors caused by primary data collection on categorical variables and socio-demographic information. On one side that the categorical variables are collected with the help of responders (primary data), the survey mode has a strong impact on answering behaviors and answering conditions. Respondents might evaluate some of the new categorical variables as sensitive information or privacy intrusive. They may not be willing to share these personal data by telephone or technological devices, which are necessary for statistical classification. Additionally, for NSIs, also the new data collection channels are costly and redesign of the survey estimation methodology is time consuming. On the other side, the categorical variables should be available in sampling frames (secondary data) and the coverage error is the main concern. For instance, in CATI surveys of Statistics Portugal after COVID-19, the population was considered as belonging to the following categories: (i) households with a listed landline telephone, (ii) households that do not have a telephone but use only a mobile telephone, and (iii) households that do not have a telephone at all (or whose number is unknown). We could expect these households with very different socioeconomic characteristics, and new methods of classification or clustering as helpful methods for measurement error adjustment at the time of the pandemic. However, if they are different in the important categorical variables of our survey, then a weighting solution could amplify a part of the sample, which does not represent the population. As a result, statistical classification would be another source of bias instead of solving the problem. Therefore, we could expect two approaches. First, we could ignore classification, simply because we consider the groups are homogeneous and the weighting could be recommended to adjust for COVID-19 pandemic situation and non-observation errors. Second, the groups or responders are different and we need categorical variables. In this case, the non-observation errors of CATI and CAWI could not be covered by changing only the weights and we have to recommend CAPI to collect categorical information and apply both clustering and weighting together to have a reasonable coverage by mixed modes.

This study undergoes a systematic literature review on this topic guided by the following question. What is the best methodology or modified estimation strategy to mitigate the mode-effects problems based on design weighting and classification? To answer this question, we performed a systematic review analysis limited to the following databases: Web of Science, Scopus, and working papers from NSIs. We only considered papers written in English. This article is organized as follows: Section 2 presents the methodology of research that maps keyword identification search, databases, and bibliometric analysis. In Section 3, we present the results, identifying the PRISMA flow diagram, characteristics of the articles, author coauthorship analysis, as well as the Keywords occurrence over the years. In Section 4, we discuss the content analysis. Section 5 is about the main conclusions and finally, in Section 6, the main research gaps and future works are outlined.

# **2 Methods**

To accomplish the research, the preferred reporting items for systematic reviews and meta-analysis methodology were adopted. The algorithm of the paper selection from databases (Scopus and WOS) was based on screening started by search keywords ((mixed-mode\* OR "Mode effect\*") AND (weighting OR weight\* OR classification) AND ("Measurement error\*" OR "Non-response bias" OR "Data quality" OR "response rate\*" ) AND ( capi OR "Computer Assisted Personal Interview\*" OR cawi OR "Assisted Web Interview\*" OR cati OR "Computer Assisted Telephone Interview\*" OR "web survey\*" OR "mail survey\*" OR "telephone survey\*" )) and then the result was filtered by "official statistics". The results of the two databases were merged, and then duplication was removed. For bibliometric analysis, the Mendeley open-source tool was used to extract metadata and eliminate duplicates. For network analysis, the VOSviewer open-source tool has been applied to visualize the extracted information from the data set and obtain the quantitative and qualitative outcomes. After assessing the eligibility, books and review papers were omitted from results and relevant articles picked up from databases. The final dataset was selected according to the visual abstract in Figure 2, which shows detailed information about this systematic literature review.

**Fig. 1** Literature review flow diagram. (Source: Author's preparation).

**Fig. 2** Density visualization analysis of the 22 leader authors who have at least 3 papers.

# **3 Results**

The 28 leader authors who had at least 4 papers are presented in Figure 2. Author occurrence analysis was performed by applying the VOSviewer research tool for network analysis. The top three leader authors were Mick P. Couper with 14 articles, Barry Schouten with 14 articles, and Roger Tourangeau with 11 articles. With the help of VOSviewer, keywords' analysis was accomplished. We analyzed the cooccurrence of author keywords with the full counting method. In the first step, we select one for the minimum occurrence of a keyword and the result was 711 keywords. We could see the application of keywords over years (Figure 3). Some of the keywords were not exactly the same, but their use and meaning were the same.

**Fig. 3** Application of keywords over years.

We decided to match similar words to make the output clearer. Choosing the full counting method resulted in a total of 592 authors meeting the threshold.

# **4 Content Analysis**

The studies emphasize the dramatic change in mixed-mode strategies in the last decades based on design-based and model-assisted survey sampling, time series methods, small area estimation [6], and high expectation to undergo further changes especially after the magnificent experience of NSIs, trying new modes after COVID-19 pandemic [7].

The problem is about mixed-mode effects and calibration, and briefly, we could follow several approaches such as design weighting to find sampling weights, nonresponse weighting adjustment, and calibration. The design weight of a unit may be interpreted as the number of units from population represented by a specific sample unit. Most surveys, if not all, suffer from nonresponse in item or unit. Auxiliary information could be used to improve the quality of design-weighted estimates. An auxiliary variable must have at least two characteristics to be considered in calibration: (i) It must be available for all sample units; and (ii) Its population total must be known.

The categorical variables from the demographic information of nonrespondents such as education level, age, income, location, language, and marital status could help the survey methodologists to categorize the target population and recognize the best sequence of the modes [8]. Van Berkel et al. [9] considered nine strata in their classification tree by using age, ethnicity, urbanization, and income as explanatory variables. Re-interview design and inverse regression estimator (IREG) are among the best approaches to improve measurement bias by using related auxiliary information [10].

The focus of this approach is on the weights of estimators rather than the bias from the measurements. For an estimator, we could consider 𝑦𝑖,𝑚 the measurement obtained from unit 𝑖 through mode 𝑚. The 𝑦𝑖,𝑚 consists of 𝑢<sup>𝑖</sup> as the observed value for respondent 𝑖, an additive mode-dependent measurement bias 𝑏𝑚, and a modedependent measurement variance 𝜀𝑖,𝑚 with an expected value equal to zero. Equation (1) shows the measurement error model.

$$
\Delta y\_{i,m} = u\_i + b\_m + \varepsilon\_{i,m} \tag{1}
$$

If we consider two different modes 𝑚 and 𝑚¤ , then the differential measurement error between these two modes is given by

$$(y\_{i,m} - y\_{i,\dot{m}} = (b\_m - b\_{\dot{m}}) + (\varepsilon\_{i,m} - \varepsilon\_{i,\dot{m}}) \tag{2}$$

The expected value of (𝑏<sup>𝑚</sup> − 𝑏𝑚¤ ) is the differential measurement bias. If we consider 𝑡ˆ<sup>𝑦</sup> as an estimation of the total of variable 𝑦 according to its observations in different modes 𝑦𝑖,𝑚, then

$$\hat{\mathbf{r}}\_{\mathbf{y}} = \sum\_{i=1}^{n} \omega\_{i} \mathbf{y}\_{i,m} \tag{3}$$

where 𝜔<sup>𝑖</sup> is a survey weight assigned to unit 𝑖 with 𝑛 the number of respondents. From a combination of equations (2) and (3), and taking the expectation over the measurement error model (1), we would have

$$E\left(\hat{t}\_{\mathbf{y}}\right) = E\left(\sum\_{i=1}^{n} \omega\_{i}\mathbf{y}\_{i,m}\right) = \sum\_{i=1}^{n} \omega\_{i}u\_{i,m} + \sum\_{i=1}^{n} b\_{m}\omega\_{i}\partial\_{i,m} + \sum\_{i=1}^{n} \omega\_{i}\partial\_{i,m}E\left(\varepsilon\_{i,m}\right) \tag{4}$$

with 𝜕𝑖,𝑚 = 1 if unit 𝑖 responded through mode 𝑚, and zero otherwise. Since 𝐸 𝜀𝑖,𝑚 = 0

$$E\left(\hat{\mathbf{r}}\_{\mathbf{y}}\right) = E\left(\sum\_{i=1}^{n} \omega\_{i}\mathbf{y}\_{i,m}\right) = \sum\_{i=1}^{n} \omega\_{i}u\_{i,m} + \sum\_{i=1}^{n} \omega\_{i}\partial\_{i,m}b\_{m} \tag{5}$$

stating that the expected total of the survey estimate for 𝑌 consists of the estimated true total of 𝑈, plus true total of 𝑏<sup>𝑚</sup> from data collected through mode 𝑚. Since 𝑏<sup>𝑚</sup> is an unobserved mode-dependent measurement bias, Í<sup>𝑛</sup> <sup>𝑖</sup>=<sup>1</sup> <sup>𝜔</sup>𝑖𝜕𝑖,𝑚𝑏<sup>𝑚</sup> in equation (5) indicates the existence of an unknown mode-dependent bias for estimation of 𝑡𝑦. According to Equation (5), there is an unknown measurement bias in sequential mixed-mode designs that might be adjusted by different estimators. Data obtained via a re-interview design or a sub-set of respondents to the first stage of a sequential mixed-mode survey provides necessary auxiliary information to adjust measurement bias in sequential mixed-mode surveys. Klausch et al [10] propose six different estimators and show that an inverse version of regression estimator (IREG) performs well under all considered scenarios. The idea of IREG is to use re-interview data to estimate the inverse slope of ordinary or generalized least squares linear regression of benchmark measurements 𝑦 <sup>𝑚</sup><sup>𝑏</sup> on 𝑦 <sup>𝑚</sup><sup>𝑗</sup> as follows [11]

$$\mathbf{y}\_{i}^{m\_{j}} = \beta\_{0} + \beta\_{1}\mathbf{y}\_{i}^{m\_{b}} \tag{6}$$

and estimate the measurement of target variable by applying the inverse of 𝛽ˆ <sup>1</sup> in the following estimator, so-called inverse regression estimator

$$\hat{\mathbf{y}}\_{r\_{\text{min}}}^{\prime reg} = \frac{1}{\left(\hat{N}\_{m\_1} + \hat{N}\_{m\_2}\right)} \left(\sum\_{\ell=1}^{m\_{m\_b}} d\_l \mathbf{y}\_{\ell}^{m\_b} + \sum\_{\ell=1}^{m\_f} d\_l \left(\hat{\mathbf{y}}\_{r e}^{m\_b} - \frac{1}{\hat{\beta}\_1} \left(\hat{\mathbf{y}}\_{r e}^{m\_f} - \mathbf{y}\_{\ell}^{m\_f}\right)\right)\right) b, \ j = 1, 2; b \neq j \quad (7)$$

where 𝑦ˆ 𝑚<sup>𝑗</sup> 𝑟 𝑒 and 𝑦ˆ 𝑚<sup>𝑏</sup> 𝑟 𝑒 are the respondents means of focal and benchmark mode outcome in the re-interview and 𝑑<sup>𝑖</sup> denotes the design weight of the sample design. For a detailed presentation and discussion of the methods see Chapter 8.5 in [12]. However, for longitudinal studies with different modes at different time points, the effect of time on the respondents would make it difficult to estimate the pure mixedmode effect especially for volatile classification variables such as the address for immigrants. The solution could be conducting the survey on parallel or separate samples to evaluate the time effect and mode effect separately.

In practice, Statistics Portugal has been using the available information of a sampling frame as a part of FNA (the dwellings national register database) at the time of COVID-19. The situation was considered as telephone numbers are linked to a sample drawn from a population register in FNA for the samples for CATI rotation-scheme surveys such as Labor Force Survey. In 2020, the Labour Force Survey (LFS) in Portugal as a mandatory survey for the member states within the EU was adjusted for undercover of the percentage of households with a listed landline telephone. As a result, the comparison of these surveys after and before COVID-19 shows the usefulness of the discussed methodologies. In 2021, the successful CAWI mode census by Statistics Portugal shows respondents tend to favor the web-based questionnaire to avoid the risk of COVID-19 infection with a face-to-face interview. It shows the potential change in the mode tendency by responders.

# **5 Conclusions**

COVID-19 crisis led to new solutions on item classification for mixed-mode effects adjustment, such as applying mode calibration to population subgroups by categorical variables such as gender, regions, age groups, etc. Studies offer sequential mixed-mode design started with CAWI as the cheapest mode supported by an initial postal mail or telephone contact and possible cash incentive. With a lag, follow up the non-respondents with giving them a choice between CAPI and CATI according to their specific classification group and demographic information, such as education level, age, income, location, language, and marital status. It is fruitful to reduce the cost and increase the accuracy simultaneously.

This study showed that sample frames might need updates for necessary categorical information, which are based on choices made several years ago. Additionally, more research studies seem necessary for ethics concerns, privacy regulations, and standards for using categorical variables and classification information in social mixed-mode surveys and official statistics.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Clustering and Blockmodeling Temporal Networks – Two Indirect Approaches**

Vladimir Batagelj

**Abstract** Two approaches to clustering and blockmodeling of temporal networks are presented: the first is based on an adaptation of the clustering of symbolic data described by modal values and the second is based on clustering with relational constraints. Different options for describing a temporal block model are discussed.

**Keywords:** social networks, network analysis, blockmodeling, symbolic data analysis, clustering with relational constraints

# **1 Temporal Networks**

Temporal networks described by *temporal quantities* (TQs) were introduced in the paper [2]. We get a *temporal network* N<sup>T</sup> = (V, L, T, P,W) by attaching the *time* T to an ordinary network, where V is the set of nodes, L is the set of links, P is the set of node properties, W is the set of link weights, and T = [𝑇𝑚𝑖𝑛, 𝑇𝑚𝑎𝑥) is a linearly ordered set of time points 𝑡 ∈ T which are usually integers or reals.

In a temporal network nodes/links activity/presence, nodes properties, and links weights can change through time. These changes are described with TQs. A TQ is described by a sequence 𝑎 = [(𝑠<sup>𝑟</sup> , 𝑓<sup>𝑟</sup> , 𝑣<sup>𝑟</sup> ) : 𝑟 = 1, 2, . . . , 𝑘] where [𝑠<sup>𝑟</sup> , 𝑓<sup>𝑟</sup> ) determines a time interval and 𝑣<sup>𝑟</sup> is the value of the TQ 𝑎 on this interval. The set 𝑇<sup>𝑎</sup> = Ð 𝑟 [𝑠<sup>𝑟</sup> , 𝑓<sup>𝑟</sup> ) is called the *activity set* of 𝑎. For 𝑡 ∉ 𝑇<sup>𝑎</sup> its value is *undefined*, 𝑎(𝑡) = .

Assuming that for every 𝑥 ∈ R ∪ { } : 𝑥 + = + 𝑥 = 𝑥 and 𝑥 · = · 𝑥 = we can extend the addition and multiplication to TQs

Vladimir Batagelj ()

IMFM, Jadranska 19, 1000 Ljubljana, Slovenia & IAM UP, Muzejski trg 2, 6000 Koper, Slovenia & HSE, 11 Pokrovsky Bulvar, 101000 Moscow, Russian Federation, e-mail: vladimir.batagelj@fmf.uni-lj.si

<sup>©</sup> The Author(s) 2023 63

P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_8

$$\begin{array}{cccc}(a+b)(t) = a(t) + b(t) & \text{and} & T\_{a+b} = T\_a \cup T\_b\\(a \cdot b)(t) = a(t) \cdot b(t) & \text{and} & T\_{a \cdot b} = T\_a \cap T\_b\end{array}$$

Let 𝑇<sup>𝑉</sup> (𝑣) ⊆ T, 𝑇<sup>𝑉</sup> ∈ P, be the activity set for a node 𝑣 ∈ V and 𝑇<sup>𝐿</sup> (ℓ) ⊆ T, 𝑇<sup>𝐿</sup> ∈ W, the activity set for a link ℓ ∈ L. The following *consistency condition* must be fulfilled for activity sets: If a link ℓ(𝑢, 𝑣) is active at the time point 𝑡 then its end-nodes 𝑢 and 𝑣 should be active at the time point 𝑡 : 𝑇<sup>𝐿</sup> (ℓ(𝑢, 𝑣)) ⊆ 𝑇<sup>𝑉</sup> (𝑢)∩𝑇<sup>𝑉</sup> (𝑣).

In the following we will need

1. *Total*: total(𝑎) = Í 𝑖 ( 𝑓<sup>𝑖</sup> − 𝑠𝑖) · 𝑣<sup>𝑖</sup> 2. *Average*: average(𝑎) = total(𝑎) |𝑇<sup>𝑎</sup> | where |𝑇<sup>𝑎</sup> | = Í 𝑖 ( 𝑓<sup>𝑖</sup> − 𝑠𝑖) 3. *Maximum*: max(𝑎) = max<sup>𝑖</sup> 𝑣<sup>𝑖</sup>

To support the computations with TQs we developed in Python the libraries TQ and Nets, see https://github.com/bavla/TQ .

# **2 Traditional (Generalized) Blockmodeling Scheme**

A *blockmodel* (BM) [11] consists of structures obtained by identifying all units from the same cluster of the clustering / *partition* **C** = {𝐶<sup>𝑖</sup> }, 𝜋(𝑣) = 𝑖 ⇔ 𝑣 ∈ 𝐶<sup>𝑖</sup> . Each pair of clusters (𝐶<sup>𝑖</sup> , 𝐶𝑗) determines a block consisting of links linking 𝐶<sup>𝑖</sup> to 𝐶<sup>𝑗</sup> . For an exact definition of a blockmodel we have to be precise also about which blocks produce an arc in the *reduced graph* on classes and which do not, what is the *weight* of this arc, and in the case of generalized BM, of what *type*. The reduced graph can be represented by relational matrix, called also *image matrix*.

**Fig. 1** Blockmodel.

To develop a BM method we specify a criterion function 𝑃(𝜇) measuring the "error" of the BM 𝜇. We can introduce additional knowledge by constraining the partitions to a set Φ of feasible partitions. We are searching for a partition 𝜋 <sup>∗</sup> ∈ Φ such that the corresponding BM 𝜇 <sup>∗</sup> minimizes the criterion function 𝑃(𝜇).

# **3 BM of Temporal Networks**

For an early attempt of temporal network BM see [2, 5]. To the traditional BM scheme we add the time dimension. We assume that the network is described using temporal quantities [2] for nodes/links activity/presence, and some nodes properties and links weights. Then also the BM partition 𝜋 is described for each node 𝑣 with a

temporal quantity 𝜋(𝑣, 𝑡): 𝜋(𝑣, 𝑡) = 𝑖 means that in time 𝑡 node 𝑣 belongs to cluster 𝑖. The structure and activity of clusters 𝐶𝑖(𝑡) = {𝑣 : 𝜋(𝑣, 𝑡) = 𝑖} can change through time, but they preserve their identity.

For the BM 𝜇 the clusters are mapped into BM nodes 𝜇 : 𝐶<sup>𝑖</sup> → [𝑖]. To determine the BM we still have to specify how the links from 𝐶<sup>𝑖</sup> to 𝐶<sup>𝑗</sup> are represented in the BM – in general, for the model arc ([𝑖], [𝑗]), we have to specify two TQs: its *weight* 𝑎𝑖 𝑗 and, in the case of generalized BM, its *type* 𝜏𝑖 𝑗. The weight can be an object of a different type than the weights of the block links in the original temporal network.

We assume that in a temporal network N = (V, L, T, P,W) the links weight is described by a TQ 𝑤 ∈ W. In the following years we intend to develop BM methods case by case.

	- a. indirect approach based on clustering of TQs: 𝑝(𝑣) = Í <sup>𝑢</sup>∈<sup>𝑁</sup> (𝑣) 𝑤(𝑣, 𝑢), hierarchical clustering and leaders;
	- b. indirect approach by conversion to the *clustering with relation constraint* (CRC);
	- c. direct approach by (local) optimization of the criterion function 𝑃 over Φ

In this paper, we present approaches for cases 1.a and 1.b.

In the literature there exist other approaches to BM of temporal networks. A recent overview is available in the book [12].

#### **3.1 Adapted Symbolic Clustering Methods**

In [8] we adapted traditional leaders [13, 10] and agglomerative hierarchical [14, 1] clustering methods for clustering of modal-valued symbolic data. They can be almost directly applied for clustering units described by variables that have for their values temporal quantities.

For a unit 𝑋<sup>𝑖</sup> , each variable 𝑉<sup>𝑗</sup> is described with a size ℎ𝑖 𝑗 and a temporal quantity **x**𝑖 𝑗, 𝑋𝑖 𝑗 = (ℎ𝑖 𝑗, **x**𝑖 𝑗). In our algorithms we use *normalized* values of temporal variables 𝑉 <sup>0</sup> = (ℎ, **p**) where

$$\mathbf{p} = [(s\_r, f\_r, p\_r) : r = 1, 2, \dots, k] \qquad \text{and} \qquad p\_r = \frac{v\_r}{h}$$

In the case, when ℎ = total(**x**), the normalized TQ **p** is essentially a probability distribution.

Both methods create cluster representatives that are represented in the same way.

#### **3.2 Clustering of Temporal Network and CRC**

To use the CRC in the construction of a nodes partition we have to define a dissimilarity measure 𝑑(𝑢, 𝑣) (or a similarity 𝑠(𝑢, 𝑣)) between nodes. An obvious solution is 𝑠(𝑢, 𝑣) = 𝑓 (𝑤(𝑢, 𝑣)), for example


We can transform a similarity 𝑠(𝑢, 𝑣) into a dissimilarity by 𝑑(𝑢, 𝑣) = 1 𝑠 (𝑢,𝑣) or 𝑑(𝑢, 𝑣) = 𝑆 − 𝑠(𝑢, 𝑣) where 𝑆 > max𝑢,𝑣 𝑠(𝑢, 𝑣). In this way, we transformed the temporal network partitioning problem into a clustering with relational constraints problem [6, 360–369]. It can be efficiently solved also for large sparse networks.

#### **3.3 Block Model**

Having the partition 𝜋, to produce a BM we have to specify the values on its links. There are different options for model links weights 𝑎(([𝑖], [𝑗])).


# **4 Example: September 11th Reuters Terror News**

The *Reuters Terror News* network was obtained from the CRA (Centering Resonance Analysis) networks produced by Steve Corman and Kevin Dooley at Arizona State University. The network is based on all the stories released during 66 consecutive days by the news agency Reuters concerning the September 11 attack on the U.S., beginning at 9:00 AM EST 9/11/01.

The nodes, 𝑛 = 13332, of this network are important words (terms). For a given day, there is an edge between two words iff they appear in the same utterance (for details see the paper [9]). The network has 𝑚 = 243447 edges. The weight of an edge is its daily frequency. There are no loops in the network. The network Terror News is undirected – so will be also its BM.

The Reuters Terror News network was used as a case network for the Viszards visualization session on the Sunbelt XXII International Sunbelt Social Network Conference, New Orleans, USA, 13-17. February 2002. It is available at http: //vlado.fmf.uni-lj.si/pub/networks/data/CRA/terror.htm .

We transformed the Pajek version of the network into NetsJSON format used in libraries TQ and Nets. For a temporal description of each node/word for clustering we took its activity (sum of all TQs on edges adjacent to a given node 𝑣)

$$\text{act}(v) = \sum\_{\mu \in N(v)} w(v:\mu).$$

Our leaders' and hierarchical clustering methods are compatible – they are based on the same clustering error criterion function. Usually, the leaders' method is used to reduce a large clustering problem to up to some hundred units. With hierarchical clustering of the leaders of the obtained clusters, we afterward determine the "right" number of clusters and their representatives.

**Fig. 2** Hierarchical clustering of 100 leaders in Terror News.

To cluster all 13332 words (nodes) in Terror News we used the adapted leaders' method searching for 100 clusters. We continued with the hierarchical clustering of the obtained 100 leaders. The result is presented in the dendrogram in Figure 2.

**Fig. 3** Word clouds for clusters 𝐶58 and 𝐶81.

To get an insight into the content of a selected cluster we draw the corresponding word cloud based on the cluster's leader. In Figure 3 the word clouds for clusters 𝐶58 and 𝐶81 (|𝐶58| = 1396, |𝐶81| = 2226 ) are presented.

We can also compare the activities of pairs of clusters by considering the overlap of p-components (probability distributions) of their leaders. In Figure 4, we compare cluster 𝐶58 with cluster 𝐶81, and cluster 𝐿96 with cluster 𝐶66. In the right diagram some values are outside the display area: 𝐿96[15] = 0.3524, 𝐶66[4] = 0.1961, 𝐶66[5] = 0.2917.

**Fig. 4** Comparing activities of clusters (blue – first cluster, red – second cluster, violet – overlap).

We decided to consider in the BM the clustering of Terror News into 5 clusters **C** = {𝐶94, 𝐶88, 𝐶95, 𝐿43, 𝐿74}. The split of cluster C95 gives clusters of sizes 325 and 629 (for sizes, see the right side of Figure 5). Both clusters C94 and C88 have a chaining pattern at their top levels.

Because of large differences in the cluster sizes, it is difficult to interpret the total intensities image matrix. An overall insight into the BM structure we get from the geometric average intensities image matrix (right side) and the corresponding BM network (cut level 0.3), left side of Figure 5.


**Fig. 5** Block model and image matrix.

A more detailed BM is presented by the activities (𝑝-components) image matrix in Figure 6.

**Fig. 6** BM represented as 𝑝-components of temporal activities of links between pairs of clusters.

A more compact representation of a temporal BM is a heatmap display of this matrix in Figure 7. Because of some relatively very large values, it turns out that the display of the matrix with logarithmic values provides much more information.

**Terror news / log2**

**Fig. 7** BM heatmap with log2 values.

To the Terror News network, we applied also the clustering with relational constraints approach. Because of the limited space available for each paper, we can not present it here. A description of the analysis with the corresponding code is available at https://github.com/bavla/TQ/wiki/BMRC .

# **5 Conclusions**

The presented research is a work in progress. It only deals with the two simplest cases of temporal blockmodeling. We provided some answers to the problem of normalization of model weights TQs when comparing them and some ways to present/display the temporal BMs.

We used different tools (R, Python, and Pajek) to obtain the results. We intend to provide the software support in a single tool – probably in Julia. We also intend to create a collection of interesting and well-documented temporal networks for testing and demonstrating the developed software.

**Acknowledgements** The paper contains an elaborated version of ideas presented in my talks at the XXXX Sunbelt Social Networks Conference (on Zoom), July 13-17, 2020 and at the EUSN 2021 – 5th European Conference on Social Networks, Naples (on Zoom), September 6-10, 2021.

This work is supported in part by the Slovenian Research Agency (research program P1-0294 and research projects J1-9187, J1-2481, and J5-2557), and prepared within the framework of the HSE University Basic Research Program.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Latent Block Regression Model**

Rafika Boutalbi, Lazhar Labiod, and Mohamed Nadif

**Abstract** When dealing with high dimensional sparse data, such as in recommender systems, co-clustering turns out to be more beneficial than one-sided clustering, even if one is interested in clustering along one dimension only. Thereby, co-clusterwise is a natural extension of clusterwise. Unfortunately, all of the existing approaches do not consider covariates on both dimensions of a data matrix. In this paper, we propose a *Latent Block Regression Model* (LBRM) overcoming this limit. For inference, we propose an algorithm performing simultaneously co-clustering and regression where a linear regression model characterizes each block. Placing the estimate of the model parameters under the maximum likelihood approach, we derive a Variational Expectation-Maximization (VEM) algorithm for estimating the model's parameters. The finality of the proposed VEM-LBRM is illustrated through simulated datasets.

**Keywords:** co-clustering, clusterwise, tensor, data mining

# **1 Introduction**

The *cluster-wise* linear regression algorithm CLR (or Latent Regression Model) is a finite mixture of regressions and one of the most commonly used methods for simultaneous learning and clustering [14, 5]. It aims to find clusters of entities to minimize the overall sum of squared errors from regressions performed over these clusters. Specifically, **X** = [𝑥𝑖 𝑗] ∈ R 𝑛×𝑣 is the covariate matrix and **Y** ∈ R 𝑛×1 the response vector. The *cluster-wise* method aims to find 𝑔 clusters 𝐶1, . . . , 𝐶<sup>𝑔</sup> and regression coefficients 𝜷 (𝑘) ∈ R <sup>𝑑</sup>×<sup>1</sup> by minimizing the following objective function Í𝑔 𝑘=1 Í 𝑖∈𝐶<sup>𝑘</sup> (𝑦<sup>𝑖</sup> − Í𝑣 𝑗=1 𝛽 (𝑘) 𝑗 𝑥𝑖 𝑗 + 𝑏<sup>𝑘</sup> ) <sup>2</sup> where:


Lazhar Labiod · Mohamed Nadif

Centre Borelli UMR 9010, Université Paris Cité, France, e-mail: lazhar.labiod@u-paris.fr;mohamed.nadif@u-paris.fr

Rafika Boutalbi ()

Institute for Parallel and Distributed Systems, Analytic Computing, University of Stuttgart, Germany, e-mail: rafika.boutalbi@ipvs.uni-stuttgart.de

P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_9

Various adjustments have been made to this model to improve its performance in terms of clustering and prediction. In our contribution, we propose to embed the co-clustering in the model.

Co-clustering is a simultaneous clustering of both dimensions of a data matrix that has proven to be more beneficial than traditional one-sided clustering, especially when dealing with sparse data. When dealing with high dimensional data sparse or not, co-clustering turns out to be more valuable than one-sided clustering [1, 13], even if one is interested in clustering along one dimension only. In [4] the authors proposed the SCOAL approach (Simultaneous Co-clustering and Learning model), leading to co-clustering and prediction for binary data; they generalized the model to continuous data. However, this model does not take into account the sparsity of data in the sense that it does not lead to homogeneous blocks. The obtained results in terms of *Mean Square Error* (MSE) are good, but in terms of co-clustering (homogeneity of co-clusters), no analysis has been presented. This model is also related to the soft PDLF (Predictive Discrete Latent Factor) model [2], where the value of response 𝑦𝑖 𝑗's in each co-cluster is modeled as a sum 𝛽 𝑇 𝑥𝑖 𝑗 + 𝛿𝑘ℓ where 𝛽 is a global regression model. In contrast, 𝛿𝑘ℓ is a co-cluster specific offset. More recently, in [17] the authors proposed an algorithm taking into account only row covariates information to realize co-clustering and regression simultaneously. To this end, the authors are based on the latent block models [8]. In our contribution, we propose to rely also on this model but considering row and column covariates.

The proposed Latent Block Regression Model (LBRM) is an extension of finite mixtures of regression models where the co-clustering is embedded. It allows us to deal with co-clustering and regression simultaneously while taking into account covariates. To estimate the parameters we rely on a *Variational* Expectation-Maximization algorithm [7] referred to as VEM-LBRM.

# **2 From Clusterwise Regression to Co-clusterwise Regression**

#### **2.1 Latent Block Model (LBM)**

Given an 𝑛 × 𝑑 data matrix **X** = (𝑥𝑖 𝑗, 𝑖 ∈ 𝐼 = {1, . . . , 𝑛}; 𝑗 ∈ 𝐽 = {1, . . . , 𝑑}). It is assumed that there exists a partition on 𝐼 and a partition on 𝐽. A partition of 𝐼 ×𝐽 into 𝑔 × 𝑚 blocks will be represented by a pair of partitions (**z**, **w**). The 𝑘-th row cluster corresponds to the set of rows 𝑖 such that 𝑧𝑖𝑘 = 1 and 𝑧𝑖𝑘<sup>0</sup> = 0 ∀𝑘 <sup>0</sup> ≠ 𝑘. Thereby, the partition represented by **z** can be also represented by a matrix of elements in {0, 1} 𝑔 satisfying <sup>Í</sup><sup>𝑔</sup> 𝑘=1 𝑧𝑖𝑘 = 1. Similarly, the ℓ-th column cluster corresponds to the set of columns 𝑗 and the partition **w** can be represented by a matrix of elements in {0, 1} 𝑚 satisfying Í<sup>𝑚</sup> <sup>ℓ</sup>=<sup>1</sup> <sup>𝑤</sup>𝑗ℓ <sup>=</sup> <sup>1</sup>.

Considering the Latent Block Model (LBM) [6], it is assumed that each element 𝑥𝑖 𝑗 of the 𝑘ℓth block is generated according to a parameterized probability density function (pdf) 𝑓 (𝑥𝑖 𝑗; 𝛼𝑘ℓ ). Furthermore, in the LBM the univariate random variables 𝑥𝑖 𝑗 are assumed to be conditionally independent given (**z**, **w**). Thereby, the conditional pdf of **X** can be expressed as 𝑃(𝑧𝑖𝑘 = 1, 𝑤𝑗ℓ = 1|**X**) = 𝑃(𝑧𝑖𝑘 = 1|**X**)𝑃(𝑤𝑗ℓ = 1|**X**). From this hypothesis, we then consider the latent block model where the two sets 𝐼 and 𝐽 are considered as random samples and the row, and column labels become latent variables. Therefore, the parameter of the latent block model is 𝚯 = (𝝅, 𝝆, 𝜶), with 𝝅 = (𝜋1, . . . , 𝜋𝑔) and 𝝆 = (𝜌1, . . . , 𝜌𝑚) where (𝜋<sup>𝑘</sup> = 𝑃(𝑧𝑖𝑘 = 1), 𝑘 = 1, . . . , 𝑔), (𝜌<sup>ℓ</sup> = 𝑃(𝑤𝑗ℓ = 1), ℓ = 1, . . . , 𝑚) are the mixing proportions and 𝜶 = (𝛼𝑘ℓ ; 𝑘 = 1, . . . 𝑔, ℓ = 1, . . . , 𝑚) where 𝛼𝑘ℓ is the parameter of the distribution of block 𝑘ℓ. Considering that the complete data are the vector (**X**, **z**, **w**), i.e, we assume that the latent variable **z** and **w** are known, the resulting complete data log-likelihood of the latent block model 𝐿<sup>𝐶</sup> (**X**, **z**, **w**, 𝚯) = log 𝑓 (**X**, **z**, **w**; 𝚯) can be written as follows

$$\sum\_{k=1}^{g} z\_k \log \pi\_k + \sum\_{\ell=1}^{m} w\_\ell \log \rho\_\ell + \sum\_{i=1}^{n} \sum\_{j=1}^{d} \sum\_{k=1}^{g} \sum\_{\ell=1}^{m} z\_{ik} w\_{j\ell} \log \Phi\_{k\ell}(\mathbf{x}\_{ij}; \mathbf{a}\_{k\ell}) \dots$$

where the 𝜋<sup>𝑘</sup> 's and 𝜌<sup>ℓ</sup> 's denote the proportions of row and columns clusters respectively; see for instance [8]. Note that the complete-data log-likelihood breaks into three terms: the first one depends on proportions of row clusters, the second on proportions of column clusters and the third on the pdf of each block or co-cluster. The objective is then to maximize the function 𝐿<sup>𝐶</sup> (**z**, **w**, 𝚯).

#### **2.2 Latent Block Regression Model (LBRM)**

For co-clustering of continuous data, the Gaussian latent block model can be used. For instance, note that it is easy to show that the minimization of the well-known criterion of ||**X** − **z**𝝁**w** 𝑇 ||<sup>2</sup> = Í𝑔 𝑘=1 Í<sup>𝑚</sup> ℓ=1 Í 𝑖|𝑧𝑖𝑘=1 Í 𝑗 |𝑤𝑗ℓ=1 (𝑥𝑖 𝑗 − 𝜇𝑘ℓ ) <sup>2</sup> where **z** ∈ {0, 1} 𝑛×𝑔 , **w** ∈ {0, 1} <sup>𝑑</sup>×<sup>𝑚</sup> and 𝝁 ∈ R <sup>𝑔</sup>×<sup>𝑚</sup> is associated to Latent block Gaussian model whith 𝜶𝑘ℓ = (𝜇𝑘ℓ , 𝜎<sup>2</sup> 𝑘ℓ ), the proportions of row clusters and column clusters are equal and in addition the variances of blocks are identical [9]. Note that 1) the characteristic of the latent block model is that the rows and the columns are treated symmetrically 2) the estimation of the parameters requires a variational approximation [7, 17]. In the sequel, we see how can we integrate a regression model. Hereafter, we propose a novel Latent Block Regression model for co-clustering and learning simultaneously. The model considers the response matrix **Y** = [𝑦𝑖 𝑗] ∈ R 𝑛×𝑑 and the covariate tensor X = [1, **x**𝑖 𝑗] ∈ R <sup>𝑛</sup>×𝑑×<sup>𝑣</sup> where 𝑛 is the number of rows, 𝑑 the number of columns, and 𝑣 the number of covariates. Figure 1 presents data structure for the proposed model LBRM.

In the following we propose the integration of mixture of regression [5] per block in the Latent Block model (LBM) considering the distribution Φ(𝑦𝑖 𝑗 |**x**𝑖 𝑗; 𝜆𝑘ℓ ).We assume in the following the normality of Φ,

$$\Phi(\mathbf{y}\_{if}|\mathbf{x}\_{if};\lambda\_{k\ell}) = p(\mathbf{y}\_{i,j}|\mathbf{x}\_{if},\mathcal{B}\_{k\ell},\sigma\_{k\ell}) = (2\pi\sigma\_{k\ell}^2)^{-0.5} \exp\left\{-\frac{1}{2\sigma\_{k\ell}^2} (\mathbf{y}\_{if} - \mathcal{B}\_{k\ell}^\top \mathbf{x}\_{ij})^2\right\}$$

**Fig. 1** Data representation for proposed model.

With the LBRM model, the parameter 𝛀 is composed of row and column proportions 𝝅, 𝝆 respectively, 𝜷 = {𝜷11, . . . , 𝛽𝑔𝑚} with 𝜷 > 𝑘ℓ = (𝛽 0 𝑘ℓ , 𝛽<sup>1</sup> 𝑘ℓ , . . . , 𝛽<sup>𝑣</sup> 𝑘ℓ ) where 𝛽 0 𝑘ℓ represents the intercept of regression and 𝝈 = {𝜎11, . . . , 𝜎𝑔𝑚}. The classification log-likelihood can be written:

$$\sum\_{i,k} z\_{i\ell k} \log \pi\_k + \sum\_{j,\ell} w\_j \epsilon\_\ell \log \rho\_\ell - \frac{1}{2} \sum\_{k,\ell} z\_{i,k} w\_{\ell\ell} \log(\sigma\_{k\ell}^2) - \frac{1}{2 \,\sigma\_{k\ell}^2} \sum\_{i,j,k,\ell} z\_{i\ell k} w\_{j\ell} (\mathbf{y}\_{ij} - \mathbf{g}\_{k\ell}^\mathsf{T} \mathbf{x}\_{ij})^2$$

with 𝑧.𝑘 = Í 𝑖 𝑧𝑖𝑘 et 𝑤.ℓ = Í <sup>𝑗</sup> 𝑤𝑗ℓ .

# **3 Variational EM Algorithm**

To estimate 𝛀, the EM algorithm [3] is a candidate for this task. It maximizes the log-likelihood 𝑓 (X, 𝛀) w.r. to 𝛀 iteratively by maximizing the conditional expectation of the complete data log-likelihood 𝐿<sup>𝐶</sup> (**z**, **w**; 𝛀) w.r. to 𝛀, given a previous current estimate 𝛀(𝑐) and the observed data **x**. Unfortunately, difficulties arise owing to the dependence structure among the variables 𝑥𝑖 𝑗 of the model. To solve this problem an approximation using the [12] interpretation of the EM algorithm can be proposed; see, e.g., [7, 8]. Hence, the aim is to maximize the following lower bound of the log-likelihood criterion: 𝐹<sup>𝐶</sup> (**z**˜, **w**˜ ; 𝛀) = 𝐿<sup>𝐶</sup> (**z**˜, **w**˜ , 𝛀) + 𝐻(**z**˜) + 𝐻(**w**˜ ) where 𝐻(**z**˜) = − Í 𝑖,𝑘 𝑧˜𝑖𝑘 log ˜𝑧𝑖𝑘 with 𝑧˜𝑖𝑘 = 𝑃(𝑧𝑖𝑘 = 1|X), 𝐻(**w**˜ ) = − Í 𝑗,ℓ 𝑤˜ 𝑗ℓ log ˜𝑤𝑗ℓ with <sup>𝑤</sup>˜ 𝑗ℓ <sup>=</sup> <sup>𝑃</sup>(𝑤𝑗ℓ <sup>=</sup> <sup>1</sup>|X), and <sup>𝐿</sup><sup>𝐶</sup> (**z**˜, **<sup>w</sup>**˜ ; <sup>𝛀</sup>˜ ) is the fuzzy complete data log-likelihood (up to a constant). 𝐿<sup>𝐶</sup> (**z**˜, **w**˜ ; 𝛀) is given by

$$\begin{split} L\_{C}(\mathbf{z}, \mathbf{w}, \boldsymbol{\Omega}) &= \sum\_{i,k} \tilde{z}\_{ik} \log \pi\_{k} + \sum\_{j,\ell} \tilde{w}\_{j\ell} \log \rho\_{\ell} - \frac{1}{2} \sum\_{k,\ell} \tilde{z}\_{.k} \tilde{w}\_{.\ell} \log (\sigma\_{k\ell}^{2})^{2} \\ &- \frac{1}{2\sigma\_{k\ell}^{2}} \sum\_{i,j,k,\ell} \tilde{z}\_{ik} \tilde{w}\_{j\ell} (\mathbf{y}\_{ij} - \mathbf{g}\_{k\ell}^{\top} \mathbf{x}\_{ij})^{2} \end{split}$$

The maximization of 𝐹<sup>𝐶</sup> (**z**˜, **w**˜ , 𝛀) can be reached by realizing the three following optimization: update **z**˜ by argmax**z**˜𝐹<sup>𝐶</sup> (**z**˜, **w**˜ , 𝛀), update **w**˜ by argmax**w**˜ 𝐹<sup>𝐶</sup> (**z**˜, **w**˜ , 𝛀), and update 𝛀 by argmax𝛀𝐹<sup>𝐶</sup> (**z**˜, **w**˜ , 𝛀). In what follows, we detail the Expectation (E) and Maximization (M) step of the Variational EM algorithm for tensor data.

**E-step.** It consists in computing, for all 𝑖, 𝑘, 𝑗, ℓ the posterior probabilities 𝑧˜𝑖𝑘 and 𝑤˜ 𝑗ℓ maximizing 𝐹<sup>𝐶</sup> (**z**˜, **w**˜ , 𝛀) given the estimated parameters 𝛀𝑘ℓ . It is easy to show that, the posterior probability 𝑧˜𝑖𝑘 maximizing 𝐹<sup>𝐶</sup> (**z**˜, **w**˜ , 𝛀) is given by: 𝑧˜𝑖𝑘 ∝ 𝜋<sup>𝑘</sup> exp Í 𝑗,ℓ 𝑤˜ 𝑗ℓ log 𝑝(𝑦𝑖 𝑗 |**x**𝑖 𝑗, 𝜷𝑘ℓ , 𝜎𝑘ℓ ) . In the same manner, the posterior probability 𝑤˜ 𝑗ℓ is given by: 𝑤˜ 𝑗ℓ ∝ 𝜌<sup>ℓ</sup> exp Í 𝑖,𝑘 𝑧˜𝑖𝑘 log 𝑝(𝑦𝑖 𝑗 |**x**𝑖 𝑗, 𝜷𝑘ℓ , 𝜎𝑘ℓ ) **M-step.** Given the previously computed posterior probabilities **z**˜ and **w**˜ , the M-step consists in updating , ∀𝑘, ℓ, the parameters of the model 𝜋<sup>𝑘</sup> , 𝜌<sup>ℓ</sup> , 𝝁𝑘ℓ and 𝝀𝑘ℓ maximizing 𝐹<sup>𝐶</sup> (**z**˜, **w**˜ , 𝛀). Using the computed quantities from step E, the maximization step (M-step) involves the following closed-form updates.


$$\mathcal{B}\_{k\ell} = \left(\sum\_{i,j} \mathbb{E}\_{ik} \,\boldsymbol{\tilde{w}}\_{j\ell} \boldsymbol{\gamma}\_{ij} \mathbf{x}\_{ij}\right) \left(\sum\_{i,j} \mathbb{E}\_{ik} \,\boldsymbol{\tilde{w}}\_{j\ell} \mathbf{x}\_{ij} \mathbf{x}\_{ij}^\top\right)^{-1}, \\ \sigma\_{k\ell}^2 = \frac{\sum\_{i,j} \mathbb{E}\_{ik} \,\boldsymbol{\tilde{w}}\_{j\ell} \,\boldsymbol{\tilde{w}}\_{j\ell} \left(\boldsymbol{\gamma}\_{ij} - \boldsymbol{\mathcal{B}}\_{k\ell}^\top \mathbf{x}\_{ij}\right)^2}{\sum\_{i,j} \mathbb{E}\_{ik} \,\boldsymbol{\tilde{w}}\_{j\ell}}$$

The proposed algorithm for tensor data referred to as VEM-LBRM alternates the two previously described steps Expectation-Maximization. At the convergence, a hard co-clustering is deduced from the posterior probabilities.

# **4 Experimental Results**

First, we evaluate the proposed VEM-LBRM on three synthetic datasets in terms of coclustering and regression. We compare VEM-LBRM with some clustering and regression methods namely Global model which is a single multiple linear regression model performed on all observations, K-means, Clusterwise, Co-clustering and SCOAL. We retain two widely used measures to assess the quality of clustering, namely the Normalized Mutual Information (NMI) [16] and the Adjusted Rand Index (ARI) [15]. Intuitively, NMI quantifies how much the estimated clustering is informative about the true clustering. The ARI metric is related to the clustering accuracy and measures the degree of agreement between an estimated clustering and a reference clustering. Both NMI and ARI are equal to 1 if the resulting clustering is identical to the true one. On the other hand, we use RMSE (Root MSE) and MAE (Mean Absolute Error) metrics to evaluate the precision of prediction while RMSE is a loss function which is suitable for Gaussian noises when MAE uses the absolute value which is less sensitive to extreme values.

We generated tensor data **X** with size 200 × 200 × 2 according to Gaussian model per block. In the simulation study, we considered three scenarios by varying the regression parameters — the examples have blocks with different regression collinearity and different co-clusters structure complexity. The parameters for each example are reported in Tables 1. In Figures 2 and 3 are depicted the true regression planes and the true simulated response matrix **Y**.

.


Parameters generation for examples. **Table 1**

**Fig. 2** Synthetic data: True regression plans according to the chosen parameters.

**Fig. 3** Synthetic data: True co-clustering according to the chosen parameters.

In our illustrations, we consider co-clustering and regression challenges. All metrics concerning rows and columns are computed by averaging on ten random training, and testing data split using an 80% vs. 20% of training and validation data. Thereby, we compare VEM-LBRM with Global model (which is a multiple linear regression), K-means, Clusterwise by reshaping the tensor to matrix with size 𝑁 × 𝑣 where 𝑁 = 𝑛 × 𝑑. On the other hand, the VEM algorithm for co-clustering is applied on response matrix **Y**. Furthermore, for clustering algorithms, the RMSE, MAE, and R-squared are computed by applying linear regression on each obtained co-cluster. In Table, 2 are reported the performances for all algorithms. The missing values represent measures that cannot be computed by the corresponding models. From these comparisons, we observe that whether the block structure is easy to identify or not, the ability of VEM-LBRM to outperform other algorithms.

To go further, note that in [11], the authors reformulated the clusterwise and introduced the linear cluster-weighted model (CWM) in a statistical setting and showed that it is a general and flexible family of mixture models. They included in

78


**Table 2** (co)-clustering and prediction: mean and sd in parentheses.

the classical model of clusterwise the probability Φ<sup>0</sup> (**x**<sup>𝑖</sup> |𝛀<sup>𝑘</sup> ) to model the covariates, whereas the classical cluster-wise model the output only using Φ(𝑦<sup>𝑖</sup> |**x**𝑖 ; 𝜆<sup>𝑘</sup> ). They prove that sufficient conditions for model identifiability are provided under a suitable assumption of Gaussian covariates [10]. We can include in LBRM a joint probability Φ0 (**x**𝑖 𝑗 |𝛀𝑘ℓ ) where 𝛀𝑘ℓ = [𝝁𝑘ℓ , 𝚺𝑘ℓ ] to evaluate its impact in terms of clustering and regression. Figure 4 presents the graphical model of LBRM and its extension. The first experiments on real datasets give encouraging results.

**Fig. 4** Graphical model of LBRM (left) and its extension (right).

# **5 Conclusion**

Inspired by the flexibility of the latent block model (LBM), we proposed extending it to tensor data aiming at both tasks: co-clustering and prediction. This model (LBRM) gives rise to a variational EM algorithm for co-clustering and prediction referred to as VEM-LBRM. This algorithm which can be viewed as the co-clusterwise algorithm can easily deal with sparse data. Empirical results on synthetic data showed that VEM-LBRM does give more encouraging results for clustering and regression than some algorithms that are devoted to one or both tasks simultaneously. For future work, we plan to develop the extension of LBRM and apply the proposed models for the recommender system task.

**Acknowledgements** Our work is funded by the German Federal Ministry of Education and Research under Grant Agreement Number 01IS19084F (XAPS).

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Using Clustering and Machine Learning Methods to Provide Intelligent Grocery Shopping Recommendations**

Nail Chabane, Mohamed Achraf Bouaoune, Reda Amir Sofiane Tighilt, Bogdan Mazoure, Nadia Tahiri, and Vladimir Makarenkov

**Abstract** Nowadays, grocery lists make part of shopping habits of many customers. With the popularity of e-commerce and plethora of products and promotions available on online stores, it can become increasingly difficult for customers to identify products that both satisfy their needs and represent the best deals overall. In this paper, we present a grocery recommender system based on the use of traditional machine learning methods aiming at assisting customers with creation of their grocery lists on the MyGroceryTour platform which displays weekly grocery deals in Canada. Our recommender system relies on the individual user purchase histories, as well as the available products' and stores' features, to constitute intelligent weekly grocery lists. The use of clustering prior to supervised machine learning methods allowed us to identify customers profiles and reduce the choice of potential products of interest for each customer, thus improving the prediction results. The highest average F-score of 0.499 for the considered dataset of 826 Canadian customers was obtained using the Random Forest prediction model which was compared to the Decision Tree, Gradient Boosting Tree, XGBoost, Logistic Regression, Catboost, Support Vector Machine and Naive Bayes models in our study.

**Keywords:** clustering, dimensionality reduction, grocery shopping recommendation, intelligent shopping list, machine learning, recommender systems

Vladimir Makarenkov ()

e-mail: chabane.nail\_amine@courrier.uqam.ca

e-mail: bouaoune.mohamed\_achraf@courrier.uqam.ca

Bogdan Mazoure

Nadia Tahiri

Nail Chabane · Mohamed Achraf Bouaoune · Reda Amir Sofiane Tighilt ·

Université du Québec à Montreal, 405 Rue Sainte-Catherine Est, Montreal, Canada,

e-mail: tighilt.reda@courrier.uqam.ca;makarenkov.vladimir@uqam.ca

McGill University and MILA - Quebec AI Institute, 845 Rue Sherbrooke O, Montreal, Canada, e-mail: bogdan.mazoure@mail.mcgill.ca

University of Sherbrooke, 2500 Bd de l'Université, Sherbrooke, Canada, e-mail: Nadia.Tahiri@USherbrooke.ca

P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_10

# **1 Introduction**

Grocery shopping is a common activity that involves different factors such as budget and impulse purchasing pressure [1]. Customers typically rely on a mental or digital list to facilitate their grocery trips. Many of them show a favorable interest towards tools and applications that help them manage their grocery lists, while keeping them updated with special offers, coupons and promotions [2, 3]. Major retailers throughout the world typically offer discounts on different products every week in order to improve sales and attract new customers. This very common practice leads to the fact that thousands of items go on special simultaneously across different retailers at a given week. The resulting information overload often makes it difficult for customers to quickly identify the deals that best suit their needs, which can become a source of frustration [4]. To address this problem, many grocery stores have taken advantage of the popularity of e-commerce to set up their own websites featuring various functionalities, including Recommender Systems, to assist customers during the shopping process.

Recommender Systems (RSs) [5] are tools and techniques that offer personalized suggestions to users based on several parameters (e.g. their past behavior). RSs have recently become a field of interest for researchers and retailers as many e-commerces, online book stores and streaming platforms have started to offer this service on their websites (e.g. Amazon, Netflix and Spotify). Here, we recall some recent works in this field. Faggioli et al. [6] used the popular Collaborative Filtering (CF) approach to predict the customer's next basket in a context of grocery shopping, taking into account the recency parameter. When comparing their model with the CF baseline models, Faggioli et al. observed a consistent improvement of their prediction results. Che et al. [7] used attention-based recurrent neural networks to capture both interand intra-basket relationships, thus modelling users' long-term preferences dynamic short-term decisions.

Content-based recommendation has also proven efficient in the literature, as demonstrated by Xia et al. [8] who proposed a tree-based model for coupons recommendation. By processing their data with undersampling methods, the authors were able to increase the estimated click rate from 1.20% to 7.80% as well as to improve significantly the F-score results using Random Forest Classifier and the recall results using XGBoost. Dou [9] presented a statistical model to predict whether a user will buy or not buy an item using Yandex's CatBoost method [10]. Dou relied on contextual and temporal features as well as on some session features, such as the time of visit of specific web pages, to demonstrate the efficiency of CatBoost in this context. Finally, Tahiri et al. [11] used recurrent and feedforward neural networks (RNNs and FFNs) in combination with non-negative matrix factorization and gradient boosting trees to create intelligent weekly grocery baskets to be recommended to the users of MyGroceryTour. Tahiri et al. considered different (from our study) features characterizing the users of MyGroceryTour to provide their predictions, with the best F-score results of 0.37 obtained from the augmented dataset.

# **2 Materials and Methods**

#### **2.1 Data Considered**

In this section we describe the dataset obtained from MyGroceryTour website used in our research. MyGroceryTour [11] is a Canadian grocery shopping website and database available in both English and French languages. The main purpose of the website is to present weekly specials offered by the major grocery retailers in Canada. It allows users to display grocery products available in their living area, compare their products over different stores as well as to build their grocery shopping baskets based on the provided insights. MyGroceryTour users can easily archive and manage their grocery lists and access them at any given time.

In this study, we considered 826 MyGroceryTour users with varying numbers of grocery baskets (between 3 and 100 baskets were available per user). The grocery baskets contained different products added by users when they were creating their weekly shopping lists. In our recommender system (i.e current basket prediction experiment), we have considered the following features:


In addition, we engineered the *total\_bought* feature which represents, for each product, the total number of times it has been bought over all users.

#### **2.2 Data Normalization**

Data normalization is an important data preprocessing step in both unsupervised and supervised machine learning [12] as well as in data mining [13]. Prior to feeding the data to our models we rescaled the available features using z-score standardization. Thus, all rescaled features had the mean of 0 and the standard deviation of 1:

$$z(\mathbf{x}\_f) = \frac{\mathbf{x}\_f - \mu\_f}{\sigma\_f},\tag{1}$$

where 𝑥 <sup>𝑓</sup> is the original value of the observation at feature 𝑓 , 𝜇 <sup>𝑓</sup> is the mean and 𝜎𝑓 is the standard deviation of 𝑓 .

#### **2.3 Further Data Preprocessing Steps**

In order to determine which weekly products could be recommended to a given user we propose to classify them using both clustering (unsupervised learning) and traditional supervised machine learning methods. The final recommendation is obtained based on the availability of the products, the data on the products' regular prices and available discounts, as well as on the user's shopping history. In our context, the baskets contain only the products bought by the users. The information about the other available products (not selected by the user at the moment he/she organized his/her shopping basket) is also available on MyGroceryTour. It has been used to create a large class of available items that were not bought by the user.

While we considered the items bought by a given user as positive feedback, we regarded the items that were available to this user at the time of the order, but not acquired by him/her, as a negative feedback. For an order of size 𝑃, if 𝑇 is the total amount of items available to the user at the time of the order, the negative feedback 𝑁 for that order is 𝑁 = 𝑇 − 𝑃. In this context, 𝑁 usually represents thousands of products, while 𝑃 is typically inferior to 50. This difference in size between positive and negative feedback can lead to a situation of imbalanced training data and could result in an important loss in performance. Similarly to Xia et al. [8], we applied an undersampling method to balance our data instead of considering all of the available disregarded items as the negative feedback.

To identify customer profiles and perform a preselection of products that are susceptible to be of interest to a given user, we first carried out the clustering of the normalized original dataset (the 𝐾-means [14] and DBSCAN [15] data partitioning algorithms were used). Then, we limited the choice of the items offered to a given user to the products purchased by the members of his/her cluster. By doing so, we managed to reduce the amount of products which could be recommended to the user and thus minimize eventual classifications mistakes. The clustering phase is detailed in the Subsection 2.4. Then traditional machine learning methods were used to provide the final weekly recommendation. The size 𝑆 of the weekly basket recommended to a given user was equal to the mean size of his/her previous shopping baskets. As the number of items to be recommended by the machine learning methods was often greater than 𝑆, we retained as final recommendation the top 𝑆 items, ranked according to the confidence score (i.e. the probability estimate for a given observation, computed using the *predict\_proba* function from the *scikit-learn* [16] library).

#### **2.4 Data Clustering**

In this section, we present the steps we carried out to obtain the clusters of users. As explanatory features used to generate clusters, we considered the mean prices and mean specials of the products purchased by the user as well as a new feature, called here the fidelity ratio 𝐹𝑅𝑢, which is meant to give insight on whether a given user 𝑢 has a favorite store where he/she makes most of his/her grocery purchases. 𝐹𝑅<sup>𝑢</sup> is defined as follows:

Clustering and ML Methods for Intelligent Grocery Shopping

$$FR\_{\mu} = \frac{X\_{max,\mu} - \frac{1}{(n-1)}\sum\_{i=2}^{n} X\_{i,\mu}}{X\_{total,\mu}},\tag{2}$$

where 𝑋𝑚𝑎𝑥,𝑢 is the total number of products bought by user 𝑢 at the store where he/she made most of his/her purchases, 𝑛 (𝑛>1) is the total number of stores visited by user 𝑢, and 𝑋𝑡𝑜𝑡 𝑎𝑙,𝑢 (𝑋𝑡𝑜𝑡 𝑎𝑙,𝑢 = 𝑋𝑚𝑎𝑥,𝑢 + Í𝑛 <sup>𝑖</sup>=<sup>2</sup> <sup>𝑋</sup>𝑖,𝑢) is the total number of products purchased by user 𝑢 over all stores he/she visited. A high fidelity ratio means that user 𝑢 buys most of his/her products at the same store, whereas a low fidelity ratio indicates that user 𝑢 buys his/her products at different stores. When user 𝑢 purchases all of his/her products at the same store (𝑋𝑚𝑎𝑥,𝑢 = 𝑋𝑡𝑜𝑡 𝑎𝑙,𝑢 and 𝑛 = 1), the fidelity ratio equals 1. It equals 0 when he/she purchases the same number of products at different stores.

The 𝐾-means [14] and DBSCAN [15] algorithms were used to perform clustering. Here we present the results of DBSCAN, as the clusters provided by DBSACAN had less entity overlap than those provided by 𝐾-means. The main advantage of DBSCAN is that this density-based algorithm is able to capture clusters of any shape.

**Fig. 1** Davies-Bouldin cluster validity index variation with respect to the number of clusters.

We used the Davies-Bouldin (DB) [17] cluster validity index to determine the number of clusters in our dataset. The Davies-Bouldin index is the average similarity between each cluster 𝐶<sup>𝑖</sup> for 𝑖 = 1, ..., 𝑘 and its most similar counterpart 𝐶<sup>𝑗</sup> . It is calculated as follows:

$$DB = \frac{1}{k} \sum\_{i=1}^{k} \max\_{i \neq j} R\_{ij},\tag{3}$$

where 𝑅𝑖 𝑗 is the similarity measure between clusters calculated as (𝑑<sup>𝑖</sup> + 𝑑 <sup>𝑗</sup>)/𝛿𝑖 𝑗, where 𝑑<sup>𝑖</sup> (𝑑 <sup>𝑗</sup>) is the the mean distance between objects of cluster 𝐶<sup>𝑖</sup> (𝐶𝑗) and the cluster centroid and 𝛿𝑖 𝑗 is the distance between the centroids of clusters 𝐶<sup>𝑖</sup> and 𝐶<sup>𝑗</sup> .

Figure 1 illustrates the variation of the Davies-Bouldin cluster validity index whose lowest (i.e. best) value was reached for our dataset with 6 clusters. The resulting clusters are represented in Figure 2. After performing the data clustering, we applied the t-SNE [18] dimensionality reduction method for data visualisation purposes. Since t-SNE is known to preserve the local structure of the data but not the global one, we used the PCA initialization parameter to mitigate this issue.

**Fig. 2** Clustering results : Clustering obtained with DBSCAN with the best number of clusters according to the Davies-Bouldin index. Data reduction was performed using t-SNE. The 6 clusters of customers found by DBSCAN are represented by different symbols.

We have noticed that the users in Cluster 1 (see Fig. 2) are fairly sensitive to specials and have a high fidelity score, the users in Cluster 2 mostly purchase products on special in different stores, the users in Cluster 3 seem to be sensitive to the total price of their shopping baskets, Cluster 4 includes the users who are sensitive to specials but have a low fidelity score, Cluster 5 includes the users who are not very attracted by specials but are rather loyal to their favorite store(s), and the users in Cluster 6 tend to buy products on special and have high fidelity scores.

# **3 Application of Supervised Machine Learning Methods**

To predict the products to be recommended for the current weekly basket, we used the following supervised machine learning methods: Decision Tree, Random Forest, Gradient Boosting Tree, XGBoost, Logistic Regression, Catboost, Support Vector Machine and Naive Bayes. These methods were used through their *scikit-learn* implementations [16]. Due to the lack of large datasets we did not use deep learning models in our study. We decided to use these classical machine learning methods because they are usually recommended to work with smaller datasets contrary to their deep leaning counterparts. Also, deep leaning algorithms usually don't handle properly mixed types of features present in our data. Most of the methods we used are the ensemble methods, i.e. they use multiple replicates to reduce the variance. The F-score results provided by each method without (using all products available) and with clustering (using only the products purchased by the cluster members) are presented in Table 1.

As shown in Table 1, Random Forest outperformed the other competing methods without and with data clustering, providing the average F-scores of 0.494 and 0.499 (obtained over all users), respectively. Tree-based models relying on gradient boosting performed relatively well and could possibly give better results with a different data processing. We can also notice that all the methods, except CatBoost, benefited from the data clustering process.


**Table 1** F-scores provided by ML methods without and with clustering of MyGroceryTour users.

# **4 Conclusion**

In this paper, we presented a novel recommender system that is intended to predict the content of the customer's weekly basket depending on his/her purchase history. Our system is also able to predict the store(s) where the purchase(s) will take place. The clustering step allowed us to identify customer profiles and to improve the Fscore result for every tested machine learning model, except CatBoost. Using our methodology and the new data available on MyGroceryTour, we were able to improve the F-score performance by the margin of 0.129, compared to the results obtained by Tahiri et al. [11]. Our model is able to predict products that will be purchased again or acquired for the first time by a given user, but it is not yet able to predict the optimal quantity for each product to be bought. Another important issue is how to provide plausible recommendations for customers without shopping history (i.e. the cold start problem). We will tackle these important issues in our future work.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **COVID-19 Pandemic: a Methodological Model for the Analysis of Government's Preventing Measures and Health Data Records**

Theodore Chadjipadelis and Sofia Magopoulou

**Abstract** The study aims to investigate the associations between the government's response measures during the COVID-19 pandemic and weekly incidence data (positivity rate, mortality rate and testing rate) in Greece. The study focuses on the period from the detection of the first case in the country (26th February 2020) to the first week of 2022 (08th January 2022). Data analysis was based on Correspondence Analysis on a fuzzy-coded contingency table, followed by Hierarchical Cluster Analysis (HCA) on the factor scores. Results revealed distinct time periods during which interesting interactions took place between control measures and incidence data.

**Keywords:** hierarchical cluster analysis, correspondence analysis, COVID-19, evidence-based policy making

# **1 Introduction**

The present study focuses on the period of the COVID-19 pandemic in Greece, from the detection of the first case of COVID-19 to the first week of 2022. This period can be divided into five distinct phases. The first phase extends from the beginning of 2020 until the first lockdown, i.e., from the first case reported in Greece until the end of the first quarantine period in May 2020. The second phase concerns the interim period from June to October 2020, when the pandemic indices improved, and policies were loosened for the opening of tourism. The third phase concerns the second lockdown and the evolution of the pandemic in the country from November 2020 to April 2021, when the first vaccination period of the adult population took place. The fourth phase includes the interim period from May 2021 to October

© The Author(s) 2023 93 P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_11

Theodore Chadjipadelis ()

Aristotle University of Thessaloniki, Greece, e-mail: chadji@polsci.auth.gr

Sofia Magopoulou Aristotle University of Thessaloniki, Greece, e-mail: sofimago@polsci.auth.gr

2021, where a general stabilization of the number of cases occurred, while the last period refers to a significant increase in the number of cases from November 2021 to January 2022.

Overall, from March 2020 to January 2022, a total of 1,79 million cases of COVID-19 were recorded in Greece (Figure 1) and a total of 22,635 deaths. Vaccination coverage is as of January 2022 over 65% of country's population, i.e., 7,241,468 fully vaccinated citizens.

**Fig. 1** Record of cases of COVID-19 in Greece (March 2020-January 2022).

In this study, a combination of multivariate data analysis methods was employed to analyze COVID-19-related data so as to assess the quality of decision-making outputs during the crisis and improve evidence-based decision-making processes. Section 2 presents the methodology and describes the data sources and the data analysis workflow. Section 3 presents the study results and Section 4 discusses the results and proposes methodological tools and presents the paper conclusions.

# **2 Methodology**

#### **2.1 Data**

For the study purposes, data were obtained from the Oxford Covid-19 Government Response Tracker (OxCGRT) and were combined with self-collected Covid-19 data for Greece [3] daily updated in Greek. The Oxford Covid-19 Government Response Tracker (OxCGRT) collects publicly available information reflecting government response from 180 countries since 1 January 2020 [4]. The tracker is based on data for 23 indicators. In this study, two groups of indicators were considered: Containment & Closure and Health Systems in the case of Greece. The first group of indicators refers to "collective" level policies and measures, such as school closures and restriction in mobility, while the second refers to "individual" level policies and measures, such as testing and vaccination. Specifically, the collective level indicators refer to policies taken by the governments' and reflect on a collective level on the society: school closing, workplace closing, cancelation of public events, restrictions on gathering, closure of public transport, stay at home requirements, internal movement restrictions and international travel controls. The health system policies primarily touch upon the individual level and specifically refer to: public information campaigns, testing, contact tracing, healthcare facilities, vaccines' investments, facial coverings, vaccination and protection of the elderly people. All collective-level indicators (C1 to C8) were summed to yield a total score (ranging from 0 to 16). Similarly, individuallevel indicators (H1 to H3 and H6 to H8) were summed to compute a total score (ranging from to 12).

The self-collected data refer to positive cases, number of Covid-19-related deaths, number of tests and total number of vaccinations administered. These data have been recorded daily since March 2020 from public announcements by official and verified sources. A total of 94 time points were considered in the present study, corresponding to weekly data (Monday was used as a reference). Three quantitative indicators were derived, a positivity index (#cases / #tests), a mortality index (#deaths / #cases) and a testing index (#tests / #people). The number of vaccinations is not used in the present study because the vaccination process began in January 2021 and the administration of the booster dose began in September 2021. The final data set consisted of five indicators: two ordinal total scores, and three quantitative indices.

#### **2.2 Data Analysis**

A four-step data analysis strategy was adopted. In the first step, the three quantitative variables (positivity rate, mortality rate and testing rate) were transformed into ordinal variables, via a method used in [7] (see Step 1) transformation of continuous variables into ordinal categorical variables, with minimum information loss. Three ordinal variables were derived. In the second step, the five ordinal variables (i.e., the three recoded variables and the two ordinal total scores), were fuzzy-coded into three categories each, using the barycentric coding scheme proposed in [7]. This scheme has been recently evaluated in the context of hierarchical clustering in [7] and was applied with the DIAS Excel add-in [6]. Barycentric coding allows us to convert an m-point ordinal variable into an n-point fuzzy-coded variable [6, 7]. In other words, the transformation of the three quantitative variables into ordinal variables resulted in a generalized 0-1 matrix (fuzzy-coded matrix), where for each variable we obtain the estimated probability for each category. A drawback of the proposed approach is that the ordinal information in the 5 ordinal variables is lost.

The third step involved the application of Correspondence Analysis (CA) on the fuzzy-coded table with the 94 weeks as rows and the fifteen fuzzy categories as columns (see [1] for a similar approach). The number of significant axes was determined based on percentage of inertia explained and the significant points on each axis were determined based on the values of two statistics that accompany standard CA output; quality of representation (COR) greater than 200 and contribution (CTR) greater than 1000/(n + 1), where n is the total number of categories (i.e., 15 in our case). In the final step, Hierarchical Cluster Analysis (HCA) using Benzecri's chisquared distance and Ward's linkage criterion [2, 8] was employed to cluster the 94 points (weeks) on the CA axes obtained from the previous step. The number of clusters was determined upon the empirical criterion of the change in the ratio of between-cluster inertia to total inertia, when moving from a partition with r clusters to a partition with r – 1 cluster [8]. Lastly, we interpret the clusters after determining the contribution of each indicator to each cluster. All analyses were conducted with the M.A.D. [Méthodes de l'Analyse des Données] software [5].

# **3 Results**

Correspondance Analysis resulted in four significant axes, which explain 74.91% of the total inertia (Figure 2). For each axis, we describe the main contrast between groups of categories based on their coordinates, COR and CTR values (Figure 3). "Low and moderate mortality rates" and "high factor testing rates" define a pole on the 1st axis, which is opposed to "average and high levels of "individual" measures". On the second axis, "low positivity rate" and "average levels of collective measures" define a pole, while "average and high positivity rate" and "high levels of collective measures" define the opposite pole. The third axis is characterized by "moderate and high mortality rate", "high levels of collective measures" and "average levels of individual measures" that are opposed to "average levels of collective measures". On the fourth axis, "average levels of collective measures" are opposed to "average testing rate" and "high levels of collective measures".


**Fig. 2** Explained inertia by axis.


**Fig. 3** Category coordinates on the four CA axes (#G), quality of representation (COR) and contribution (CTR). COR values greater than 200 and CTR values greater than 1000 / 16 = 62.5 are shown in yellow. Positive coordinates are shown in green and negative in pink.

Hierarchical Cluster Analysis on the factor scores resulted in seven clusters using the empirical criterion for cluster determination (see Section 2.2). The corresponding dendrogram is shown in Figure 4. The seven nodes in the figure that correspond to the seven clusters are 182, 181, 175, 177, 171, 181, 133 and 179. Cluster content reflects the different periods (phases) presented in the introductory section.

**Fig. 4** Dendrogram of the HCA.

The first cluster (182) combines data points from March 2020, the onset of the pandemic with data points from a period following the summer of 2020 (October and November). This cluster is characterized by high positivity rate, low testing rate, high levels of "collective" measures (containment & closure) and low levels of "individual" measures (health system). The second cluster (181) contains data points from April and May 2020 and is characterized by low positivity rate, average to high mortality rate, low testing rate, high levels of "collective" measures (containment & closure) and average levels of "individual" measures (health system). The third cluster (175) combines summer months of 2020 and 2021. This cluster is characterized by low positivity rate, low testing rate and average levels of "collective" measures (containment & closure). The fourth cluster (177) marks the period of December 2020 and the period of spring of 2021, with average positivity rate and high levels of "collective" measures (containment & closure). The fifth cluster (171) refers to the period from December 2020 to February 2021, but also includes August 2021, with high levels of "collective" measures (containment & closure). The sixth cluster (133) refers to the period following the summer of 2021 (September and October 2021). In this cluster, average positivity rates were observed but also strict containment and closure measures.

Lastly, the seventh cluster (179) refers to November and December 2021, including also January 2022, with high positivity and high testing rates, while high levels of containment and closure and health system measures were observed. Figure 5 shows the contributions of each indicator in each cluster.


**Fig. 5** Cluster description (contribution values of the indicators in each cluster - node).

# **4 Discussion**

Based on the study results, we can argue that, when it comes to measures and real time data following a situation such as the pandemic, "the chicken and egg" dilemma arises. The question is whether "collective" and "individual" measures affect daily incidence data or the inverse (i.e., that the daily data lead to measures). We conclude that in fact the two should be perceived as working in conjunction and not independently from one another. The analysis showed that lower positivity rate is accompanied by average levels of measures from the government at both the "individual" and the "collective" level. Furthermore, higher positivity rate is accompanied by higher levels of measures, as a response. With regard to mortality rate, we observed that higher mortality invokes higher levels of "collective" measures and average levels of "individual" measures, whereas average levels of "collective" measures are associated with higher mortality rate.

It is therefore evident that when it comes to decision making in crisis situations, a systematic collection, analysis and use of data is linked to more effective government response overall. Therefore, evidence-based policy making should be linked to crisis management. This paper presents a first attempt to capture an ongoing phenomenon and therefore it is crucial that the collection and analysis of data will be complemented until the end of the phenomenon.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **pcTVI: Parallel MDP Solver Using a Decomposition into Independent Chains**

Jaël Champagne Gareau, Éric Beaudry, and Vladimir Makarenkov

**Abstract** Markov Decision Processes (MDPs) are useful to solve real-world probabilistic planning problems. However, finding an optimal solution in an MDP can take an unreasonable amount of time when the number of states in the MDP is large. In this paper, we present a way to decompose an MDP into Strongly Connected Components (SCCs) and to find dependency chains for these SCCs. We then propose a variant of the Topological Value Iteration (TVI) algorithm, called *parallel chained TVI* (pcTVI), which is able to solve independent chains of SCCs in parallel leveraging modern multicore computer architectures. The performance of our algorithm was measured by comparing it to the baseline TVI algorithm on a new probabilistic planning domain introduced in this study. Our pcTVI algorithm led to a speedup factor of 20, compared to traditional TVI (on a computer having 32 cores).

**Keywords:** Markov decision process, automated planning, strongly connected components, dependancy chains, parallel computing

# **1 Introduction**

Automated planning is a branch of Artificial Intelligence (AI) aiming at finding optimal plans to achieve goals. One example of problems studied in automated planning is the electric vehicle path-planning problem [1]. Planning problems with non-deterministic actions are known to be much harder to solve. Markov Decision

Éric Beaudry

© The Author(s) 2023 101

Jaël Champagne Gareau ()

Université du Québec à Montréal, Canada, e-mail: champagne\_gareau.jael@uqam.ca

Université du Québec à Montréal, Canada, e-mail: beaudry.eric@uqam.ca

Vladimir Makarenkov Université du Québec à Montréal, Canada, e-mail: makarenkov.vladimir@uqam.ca

P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_12

Processes (MDPs) are generally used to solve such problems leading to probabilistic models of applicable actions [2].

In probabilistic planning, a solution is generally a policy, i.e., a mapping specifying which action should be executed in each observed state to achieve an objective. Usually, dynamic programming algorithms such as Value Iteration (VI) are used to find an optimal policy [3]. Since VI is time-expensive, many improvements have been proposed to find an optimal policy faster, using for example the Topological Value Iteration (TVI) algorithm [4]. However, very large domains often remain out of reach. One unexplored way to reduce the computation time of TVI is by taking advantage of the parallel architecture of modern computers and by decomposing an MDP into independent parts which could be solved concurrently.

In this paper, we show that state-of-the-art MDP planners such as TVI can run an order of magnitude faster when considering task-level parallelism of modern computers. Our main contributions are as follows:


# **2 Related Work**

Many MDP solvers are based on the Value Iteration (VI) algorithm [3], or more precisely on asynchronous variants of VI. In asynchronous VI, MDP states can be backed up in any order and do not need to be considered the same number of times. One way to take advantage of this is by assigning a priority to every state and by considering them in priority order.

Several state-of-the-art MDP algorithms have been proposed to increase the speed of computation. Many of them are able to focus on the most promising parts of MDP through heuristic search algorithms such as LRTDP [5] or LAO\* [6]. Some other MDP algorithms use partitioning methods to decompose the state-space in smaller parts. For example, the P3VI (Partitioned, Prioritized, Parallel Value Iteration) algorithm partitions the state-space, uses a priority metric to order the partitions in an approximate best solving order, and solves them in parallel [7]. The biggest disadvantage of P3VI is that the partitioning is done on a case-by-case basis depending on the planning domain, i.e., P3VI does not include a general state-space decomposition method. The inter-process communication between the solving threads also incurs an overhead on the computation time. The more recent TVI (Topological Value Iteration) algorithm [4] also decomposes the state-space, but does it by considering the topological structure of the underlying graph of the MDP, making it more general than P3VI. Unfortunately, to the best of our knowledge, no parallel version of TVI has been proposed in the literature.

# **3 Problem Definition**

There exist different types of MDP, including Finite-Horizon MDP, Infinite-Horizon MDP and Stochastic Shortest Path MDP (SSP-MDP) [2]. The first two of them can be viewed as special cases of SSP-MDP [8]. In this work, we focus on SSP-MDPs, which we describe formally in Definition 1 below.

**Definition 1** A *Stochastic Shortest Path MDP* (SSP-MDP) is given by a tuple (𝑆, 𝐴, 𝑇, 𝐶, 𝐺), where:


We generally search for a policy 𝜋 : 𝑆 → 𝐴 that tells us which action should be executed at each state, such that an execution following the actions given by 𝜋 until a goal is reached has a minimal expected cost. This expected cost is given by a value function 𝑉 𝜋 : 𝑆 → R. The Bellman Optimality Equations are a system of equations satisfied by any optimal policy.

**Definition 2** The Bellman Optimality Equations are the following:

$$V(s) = \begin{cases} 0, & \text{if } s \in G, \\ \min\_{a \in A} \left[ C(s, a) + \sum\_{s' \in S} T(s, a, s') V(s) \right], & \text{otherwise.} \end{cases}$$

The expression between square brackets is called the *Q-value* of a state-action pair:

$$\mathcal{Q}(s, a) = C(s, a) + \sum\_{s' \in \mathcal{S}} T(s, a, s')V(s).$$

When an optimal value function 𝑉 ★ has been computed, an optimal policy 𝜋 ★ can be found greedily:

$$
\pi^\star(\mathbf{s}) = \operatorname{argmin}\_{a \in A} \mathcal{Q}^\star(\mathbf{s}, a).
$$

Most MDP solvers are based on dynamic programming algorithms like Value Iteration (VI), which update iteratively an arbitrarily initialized value function until convergence with a given precision 𝜖. In the worst case, VI needs to do |𝑆| sweeps of the state space, where one sweep consists in updating the value estimate of every state using the Bellman Optimality Equations. Hence, the number of state updates (called a *backup*) is O(|𝑆| 2 ). When the MDP is acyclic, most of these backups are wasteful, since the MDP can in this situation be solved using only |𝑆| backups (ordered in reverse topological order), thus allowing one to find an optimal policy in O(|𝑆|) [8].

# **4 Parallel-chained TVI**

In this section, we describe an improvement to the TVI algorithm, named pcTVI (Parallel-Chained Topological Value Iteration), which is able to solve an MDP in parallel (as P3VI). pcTVI uses the decomposition proposed by TVI, known to give good performance on many planning domains. We start by summarizing how the original TVI algorithm works.

First, TVI uses Kosaraju's graph algorithm on a given MDP to find the strongly connected components (SCCs) of its graphical structure (the graph corresponding to its all-outcomes determinization).The SCCs are found by Kosaraju's algorithm in reverse topological order, which means that for every 𝑖 < 𝑗, there is no path from a state in the 𝑖 th SCC to a state in the 𝑗 th SCC. This property ensures that every SCC can be solved separately by VI sweeps if previous SCCs (according to the reverse topological order) have already been solved. The second step of TVI is thus to solve every SCC one by one in that order. Since TVI divides the MDP in multiple subparts, it maximizes the usefulness of every state backup by ensuring that only useful information (i.e., converged state values) is propagated through the state-space.

Unfortunately, TVI can only solve one SCC at a time. Since modern computers have many computing units (cores) which can work in parallel, we could theoretically solve many SCCs in parallel to greatly reduce computation time. Instead of choosing SCCs to solve in parallel arbitrarily or using a priority metric (as in P3VI), which incur a computational overhead to propagate the values between the threads, we want to consider their topological order (as in TVI) to minimize redundant or useless computations. One way to share the work between the processes is to find independent chains of SCCs which can be solved in parallel. The advantage of independent chains is that no coordination and communication is needed between the SCCs, which both removes some running-time overhead and simplifies the implementation.

The Parallel-Chained TVI algorithm we propose (Algorithm 1) works as follows. First, we find the graph 𝐺 corresponding to the graphical structure of the MDP, decompose it into SCCs, and find the reverse topological order of the SCCs (as in TVI, but we use Tarjan's algorithm instead of Kosaraju's algorithm since it is about twice as fast). We then build the condensation of the graph 𝐺, i.e., the graph 𝐺<sup>𝑐</sup> whose vertices are SCCs of 𝐺, where an edge is present between two vertices 𝑠𝑐𝑐<sup>1</sup> and 𝑠𝑐𝑐<sup>2</sup> if there exists an edge in 𝐺 between a state 𝑠<sup>1</sup> ∈ 𝑠𝑐𝑐<sup>1</sup> and a state 𝑠<sup>2</sup> ∈ 𝑠𝑐𝑐2. We also store the reversed edges in 𝐺<sup>𝑐</sup> and a counter 𝑐𝑠𝑐𝑐 on every vertex 𝑠𝑐𝑐 which indicates how many incoming neighbors have not yet been computed. We use this (usually small) graph 𝐺<sup>𝑐</sup> to detect which SCCs are ready to be considered (the SCCs whose incoming neighbors have all been determined with precision 𝜖, i.e., the SCCs whose associated counter 𝑐𝑠𝑐𝑐 is 0). When a new SCC is ready, it is inserted into a work queue from which the waiting threads acquire their next task.

```
1: procedure pcTVI(𝑀: MDP, 𝑡: Number of threads)
2: ⊲ Find the SCCs of 𝑀
3: 𝐺 ← Graph(𝑀) ⊲ 𝐺 implicitly shares the same data structures as 𝑀
4: 𝑆𝐶𝐶𝑠 ← Tarjan(𝐺) ⊲ SCCs are found in reverse topological order
5:
6: ⊲ Build the graph of SCCs of 𝐺
7: 𝐺𝑐 ← GraphCondensation(𝐺, 𝑆𝐶𝐶𝑠)
8:
9: ⊲ Solve in parallel independent SCCs
10: 𝑃𝑜𝑜𝑙 ← CreateThreadPool(𝑡) ⊲ Create 𝑡 threads
11: 𝑉 ← NewValueFunction() ⊲ Arbitrarily initialized; Shared by all threads
12: 𝑄 ← CreateQueue() ⊲ Shared by all threads
13: Insert(𝑄, Head(𝑆𝐶𝐶𝑠)) ⊲ The goal SCC is inserted in the queue
14: while NotEmpty(𝑄) do ⊲ Only one thread runs this loop
15: 𝑠𝑐𝑐 ← ExtractNextItem(𝑄)
16: for all 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟 ∈ Neighbors(𝑠𝑐𝑐) do
17: Decrement NumIncomingNeighbors(𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟)
18: if NumIncomingNeighbors(𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟) = 0 then
19: AssignTaskToAvailableThread(𝑃𝑜𝑜𝑙, PartialVI(𝑀, 𝑉 , 𝑠𝑐𝑐))
20: Push(𝑄, 𝑠𝑐𝑐) ⊲ Neighbors of 𝑠𝑐𝑐 are ready to be considered next
21: end if
22: end for
23: end while
24:
25: ⊲ Compute and return an optimal policy using the computed value function
26: Π ← GreedyPolicy(𝑉 )
27: return Π
28: end procedure
```
# **Algorithm 1** Parallel-Chained Topological Value Iteration

# **5 Empirical Evaluation**

In this section, we evaluate empirically the performance of pcTVI, comparing it to the three following algorithms: (1) VI – the standard dynamic programming algorithm (here we use its asynchronous round-robin variant), (2) LRTDP – a well-known heuristic search algorithm, and (3) TVI – the Topological Value Iteration algorithm described in Section 4. In the case of LRTDP, we carried out the admissible and domain-independent ℎmin heuristic, first described in the original paper introducing LRTDP [5]:

$$h\_{\min}(s) = \begin{cases} 0, & \text{if } s \in G. \\ \min\_{a \in A\_s} \left[ C(s, a) + \min\_{s' \in succc\_a(s)} V(s') \right], & \text{otherwise}, \end{cases}$$

where 𝐴<sup>𝑠</sup> denotes the set of applicable actions in state 𝑠 and 𝑠𝑢𝑐𝑐<sup>𝑎</sup> (𝑠) is the set of successors when applying action 𝑎 at state 𝑠. The four competing algorithms (VI, TVI, LRTDP and pcTVI) were implemented in C++ by the authors of this paper and compiled using the GNU g++ compiler (version 11.2). All tests were performed on a computer equipped with four Intel Xeon E5-2620V4 processors (each of them having 8 cores at 2.1 GHz, for a total of 32 cores). For every test domain, we measured the running time of the four compared algorithms carried out until convergence to an 𝜖-optimal value function (we used 𝜖 = 10−<sup>6</sup> ). Every domain was tested 15 times with randomly generated MDP instances. To minimize random factors, we report the median values obtained over these 15 MDP instances.

Since there is no standard MDP domain in the scientific literature suitable to benchmark a parallel MDP solver, we propose a new general parametric MDP domain that we use to evaluate the algorithms. This domain, which we call chained-MDP, uses 5 parameters: (1) 𝑘, the number of independent chains {𝑐1, 𝑐2, . . . , 𝑐<sup>𝑘</sup> } in the MDP; (2) 𝑛𝑠𝑐𝑐, the number of SCCs {𝑠𝑐𝑐𝑖,1, 𝑠𝑐𝑐𝑖,2, . . . , 𝑠𝑐𝑐𝑖,𝑛𝑠𝑐𝑐 } in every chain 𝑐𝑖 ; (3) 𝑛𝑠 𝑝𝑠, the number of states per SCC; (4) 𝑛<sup>𝑎</sup> the number of applicable actions per state, and (5) 𝑛<sup>𝑒</sup> the number of probabilistic effects per action. The possible successors 𝑠𝑢𝑐𝑐(𝑠) of a state 𝑠 in 𝑠𝑐𝑐𝑖, 𝑗 are states in 𝑠𝑐𝑐𝑖, 𝑗 and either the states in 𝑠𝑐𝑐𝑖, 𝑗+<sup>1</sup> if it exists, or the goal state otherwise. When generating the transition function of a state-action pair (𝑠, 𝑎), we sampled 𝑛<sup>𝑒</sup> states uniformly from 𝑠𝑢𝑐𝑐(𝑠) with random probabilities. In each of our tests, we used 𝑛𝑠𝑐𝑐 = 2, 𝑛<sup>𝑎</sup> = 5 and 𝑛<sup>𝑒</sup> = 5. A representation of a Chained-MDP instance is shown in Figure 1.

**Fig. 1** A chained-MDP instance where 𝑛<sup>𝑐</sup> = 3 and 𝑛𝑠𝑐𝑐 = 4. Each ellipse represents a strongly connected component.

Figure 2 presents the obtained results for the Chained-MDP domain when varying the number of states and fixing the number of chains (32). We can observe that when the number of states is small, pcTVI does not provide an important advantage over the existing algorithms since the overhead of creating and managing the threads is taking most of the possible gains. However, as the number of states increases, the gap in the running time between pcTVI and the three other algorithms increases. This indicates that pcTVI is particularly useful on very large MDPs, which are usually needed when considering real-world domains.

Figure 3 presents the obtained results for the same Chained-MDP domain when varying the number of chains and fixing the number of states (1M). When the number of chains increases, the total number of SCCs implicitly increases (which also implies the number of states per SCC decreases). This explains why each tested

**Fig. 2** Average running times (in s) for the Chained-MDP domain with varying number of states and fixed number of chains (32).

**Fig. 3** Average running times (in s) for the Chained-MDP domain with varying number of chains and fixed number of states (1M).

algorithms becomes faster (TVI becomes faster by design, since it solves SCCs one-by-one without doing useless state backups, and VI and LRTDP become faster due to an increased locality of the considered states in memory, which improves cache performance). The performance of pcTVI increases as the number of chains increases (for the same reason as the others algorithms, but also due to increased parallelization opportunities). We can also observe that for domains with 4 chains only, pcTVI still clearly outperforms the other methods. This means that pcTVI does not need a highly parallel server CPU and can be used on standard 4-core computer.

# **6 Conclusion**

The main contributions of this paper are two-fold. First, we presented a new algorithm, pcTVI, which is, to the best of our knowledge, the first MDP solver that takes into account both the topological structure of the MDP (as in TVI) and the parallel capacities of modern computers (as in P3VI). Second, we introduced a new parametric planning domain, Chained-MDP, which models any situation where different strategies (corresponding to a chain) can reach a goal, but where, once committed to a strategy, it is not possible to switch to a different one. This domain is ideal to evaluate the parallel performance of an MDP solver. Our experiments indicate that pcTVI outperforms the other competing methods (VI, LRTDP, and TVI) on every tested instance of the Chained-MDP domain. Moreover, pcTVI is particularly effective when the considered MDP has many SCC chains (for increased parallelization opportunities) of large size (for decreased overhead of assigning small tasks to the threads). As future work, we plan to investigate ways of pruning provably suboptimal actions, which would allow more SCCs to be found. While this paper focuses on the automated planning side of MDPs, the proposed optimization and parallel computing approaches could also be applied when using MDPs with Reinforcement Learning and other ML algorithms.

**Acknowledgements** This research has been supported by the *Natural Sciences and Engineering Research Council of Canada* (NSERC) and the *Fonds de Recherche du Québec — Nature et Technologies* (FRQNT).

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Three-way Spectral Clustering**

Cinzia Di Nuzzo and Salvatore Ingrassia

**Abstract** In this paper, we present a spectral clustering approach for clustering *three-way data*. Three-way data concern data characterized by three modes: 𝑛 units, 𝑝 variables, and 𝑡 different occasions. In other words, three-way data contain a 𝑡 × 𝑝 observed matrix for each statistical observation. The units generated by simultaneous observation of variables in different contexts are usually structured as three-way data, so each unit is basically represented as a matrix. In order to cluster the 𝑛 units in 𝐾 groups, the spectral clustering application to three-way data can be a powerful tool for unsupervised classification. Here, one example on real three-way data have been presented showing that spectral clustering method is a competitive method to cluster this type of data.

**Keywords:** spectral clustering, kernel function, three-way data

# **1 Introduction**

Spectral clustering methods are based on the graph theory, where the units are represented by the vertices of an undirected graph and the edges are weighted by the pairwise similarities coming from a suitable kernel function, so the clustering problem is reformulated as a graph partition problem, see e.g. [16, 6]. The spectral clustering algorithm is a very powerful method for finding non-convex clusters of data, moreover, it is a handy approach for handling high-dimensional data since it works on a transformation of the raw data having a smaller dimension than the space of the original data.

Cinzia Di Nuzzo ()

Salvatore Ingrassia

© The Author(s) 2023 111

Department of Statistics, University of Roma La Sapienza, Piazzale Aldo Moro, 5, 00185 Roma, Italy, e-mail: cinzia.dinuzzo@uniroma1.it

Department of Economics and Business, University of Catania, Piazza Università, 2, 95131 Catania, Italy, e-mail: s.ingrassia@unict.it

P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_13

Three-way data derives from the observation of various attributes measured on a set of units in different situations; some examples are longitudinal data on multiple response variables and multivariate spatial data. Three-way data can also derive from temporal measurements of a feature vector, thus having the dataset composed of three modes: 𝑛 units (matrices), 𝑝 variables (columns), and 𝑡 times (rows). Clustering of three-way data has attracted a growing interest in literature, see e.g. [14], [1]; model-based clustering of three-way data has been introduced by [15] in the framework of matrix-variate normal mixtures; recent papers include [9] handle on parsimonious models for modeling matrix data; [11] introduce two matrix-variate distributions, both the elliptical heavy-tailed generalization of the matrix-variate normal distribution; [12] deal with three-way data clustering using matrix-variate cluster-weighted models (MV-CWM); and, [13] consider an application to educational data via mixtures of parsimonious matrix-normal distribution.

In this paper, we present a spectral clustering approach for clustering *three-way data* and a suitable kernel function between matrices is introduced. As a matter of fact, the data matrices represent the vertices of the graph, consequently, the edges must be weighted by a single value.

The rest of the paper is organized as follows: in Section 2 the spectral clustering method is summarized; in Section 3 a method to select the parameters in the spectral clustering algorithm is described; in Section 4 the three-way spectral clustering with a new kernel function are introduced; in Section 5 an application based on real three-way data is presented. Finally, in Section 5 we provide concluding remarks.

# **2 Spectral Clustering**

Spectral clustering algorithm for two-way data has been described in [8, 16, 6]. Here, we summarize the main step of this algorithm.

Let 𝑉 = {𝒙1, 𝒙2, . . . , 𝒙𝑛} be a set of points in X ⊆ R 𝑝 . In order to group the data 𝑉 in 𝐾 cluster, the first step concerns the definition of a symmetric and continuous function 𝜅 : X × X → [0, ∞) called the *kernel function*. Afterwards, a *similarity matrix* 𝑊 = (𝑤𝑖 𝑗) can be assigned by setting 𝑤𝑖 𝑗 = 𝜅(𝒙<sup>𝑖</sup> , 𝒙 <sup>𝑗</sup>) ≥ 0, for 𝒙<sup>𝑖</sup> , 𝒙 <sup>𝑗</sup> ∈ X. and finally the *normalized graph Laplacian* matrix 𝐿sym ∈ R 𝑛×𝑛 is introduced

$$L\_{\rm sym} = I - D^{-1/2} W D^{-1/2},\tag{1}$$

where 𝐷 = diag(𝑑1, 𝑑2, . . . , 𝑑𝑛) is the *degree matrix* and 𝑑<sup>𝑖</sup> is the *degree* of the vertex 𝒙<sup>𝑖</sup> defined as 𝑑<sup>𝑖</sup> = Í <sup>𝑗</sup>≠<sup>𝑖</sup> 𝑤𝑖 𝑗 and 𝐼 denotes the 𝑛 × 𝑛 identity matrix. The Laplacian matrix 𝐿sym is positive semi-definite with 𝑛 non-negative eigenvalues. For a fixed 𝐾 𝑛, let {𝜸<sup>1</sup> , . . . , 𝜸<sup>𝐾</sup> } be the eigenvectors corresponding to the smallest 𝐾 eigenvalues of 𝐿sym. Then, the *normalized Laplacian embedding in the* 𝐾 *principal subspace* is defined as the map Φ<sup>𝚪</sup> : {𝒙1, . . . , 𝒙𝑛} → R <sup>𝐾</sup> given by

$$\Phi\_{\Gamma}(\mathbf{x}\_{i}) = (\gamma\_{1i}, \dots, \gamma\_{Ki}), \quad i = 1, \dots, n,$$

where 𝛾1<sup>𝑖</sup> , . . . , 𝛾𝐾 𝑖 are the 𝑖-th components of 𝜸<sup>1</sup> , . . . , 𝜸<sup>𝐾</sup> , respectively. In other words, the function Φ<sup>𝚪</sup> (·) maps the data from the input space X to a feature space defined by the 𝐾 principal subspace of 𝐿sym. Afterwards, let 𝒀 = (𝒚 0 1 , . . . , 𝒚 0 𝑛 ) be the 𝑛×𝐾 matrix given by the embedded data in the feature space, where 𝒚<sup>𝑖</sup> = Φ<sup>𝚪</sup> (𝒙𝑖) for 𝑖 = 1, . . . , 𝑛. Finally, the embedded data 𝒀 are clustered according to some clustering procedure; usually, the 𝑘-means algorithm is taken into account in literature. However, to this end Gaussian mixtures have been proposed because they yield elliptical cluster shapes, i.e. more flexible cluster shapes with respect to the 𝑘-means, see [2]. Finally, we point out that the performances of other mixture models based on non-Gaussian component densities have been analyzed, but Gaussian mixture models can be considered as a good trade-off between model simplicity and effectiveness, see [3] for details.

# **3 A Graphical Approach for Parameter Selection**

According to spectral clustering algorithm introduced in Section 2, the spectral approach requires to set: *i*) the number of clusters 𝐾, *ii*) the kernel function 𝜅 (with the corresponding parameter). In order to select these quantities, in the following we summarize the method proposed in [4].

To begin with, we point out that the choice of the kernel function affects the entire data structure in the graph, and consequently, the structure of the Laplacian matrix and its eigenvectors. An optimal kernel function should lead to a similarity matrix 𝑊 having (as much as possible) diagonal blocks: in this case, we get well-separated groups and we are also able to understand the number of groups in that data set by counting the number of blocks. For the sake of simplicity, we consider here the self-tuning kernel introduced by [17]

$$\kappa(\mathbf{x}\_i, \mathbf{x}\_f) = \exp\left(-\frac{||\mathbf{x}\_i - \mathbf{x}\_f||^2}{\epsilon\_i \epsilon\_f}\right) \tag{2}$$

with 𝜖<sup>𝑖</sup> = k𝒙<sup>𝑖</sup> − 𝒙<sup>ℎ</sup> k, where 𝒙<sup>ℎ</sup> is the ℎ-th neighbor of point 𝒙<sup>𝑖</sup> (similarly for 𝜖𝑗). This function allow to get a similarity matrix that does not depend on any parameter so that the algorithm of spectral clustering will be based on the pairwise proximity between units. On the contrary, we need to select the ℎ-th neighbor of the unit in (2).

The main novelty of the joint-graphical approach concerns the analysis of some graphic features of the Laplacian matrix including the shape of the embedded space. Indeed, the embedded data provide useful information for the clustering, in particular the main results in [10] and [5] allow to deduce that if the embedded data assume a cones structure, then the number of clusters is equal to the number of the cones/spikes in the feature space; furthermore, a clearer clustering structure emerges when the spikes are narrower and well separated.

The idea behind the graphical approach is to select the number 𝐾 of groups and the parameter ℎ in the kernel function from a joint analysis of three main characteristics: the plot of the Laplacian matrix; the maxima values of the eigengaps between two consecutive eigenvalues; the scatter plot of the mapped data in the feature space and in particular the number of spikes counted in the embedded data space.

We remark that we cannot analyze all possible values of ℎ ∈ {1, 2, . . . , 𝑛−1} and hence we choose a suitable subset H ⊂ {1, 2, . . . , 𝑛 − 1}, in particular we choose H = {1%, 2%, 5%, 10%, 15%, 20%} × 𝑛 ⊂ {1, 2, . . . , 𝑛 − 1}, and select ℎ ∈ H, see the following procedure for details.

$$\text{Parameter selection } (K \text{ and } h)$$

*Input:* data set 𝑉, kernel function 𝜅, H.

	- a. If this plot shows a unique maximum eigengap for each ℎ ∈ H, then set 𝐾 according to this maximum. Go to Step 5.
	- b. If this plot shows multiple maxima for different ℎ ∈ H, select the number of clusters 𝐾 not to be smaller than the number of tight spikes in the corresponding plot of the embedded data.

*Output*: 𝐾, ℎ.

# **4 Three-way Spectral Clustering**

In this section, we propose a spectral approach for clustering three-way data. Threeway data consists of a data set referring to the same sets of units and variables, observed in different situations, i.e., a set of multivariate matrices, that can be organized in three modes: 𝑛 units, 𝑝 variables, and 𝑡 situations. Therefore, given 𝑛 matrices that represent the vertices of the graph, each matrix is composed by 𝑝 columns that represent our variables and 𝑡 rows that represent the time or another feature. So we have a tensor of dimension 𝑛×𝑡 × 𝑝, thus the dataset is a tensor {𝑿}𝑖𝑠𝑘 for 𝑖 = 1, . . . , 𝑛, 𝑠 = 1, . . . , 𝑡, 𝑘 = 1, . . . , 𝑝.

We define a distance function 𝛿<sup>𝑀</sup> between two matrices 𝐴, 𝐵 ∈ R 𝑝×𝑡 such that 𝛿<sup>𝑀</sup> : 𝑅 <sup>𝑡</sup>×<sup>𝑝</sup> × 𝑅 <sup>𝑡</sup>×<sup>𝑝</sup> → [0, +∞) is defined as

Three-way Spectral Clustering

$$\delta\_M(A, B) := \|A - B\|\_F = \sqrt{\sum\_{s=1}^{l} \sum\_{k=1}^{p} |a\_{sk} - b\_{sk}|^2} \tag{3}$$

where k · k<sup>𝐹</sup> is Frobenius norm1. Thus the distance between two units in the matrix data 𝑿 is equal to

$$\delta\_M(X\_{i\_1sk}, X\_{i\_2sk}) = \sqrt{\sum\_{s=1}^{t} \sum\_{k=1}^{p} |X\_{i\_1sk} - X\_{i\_2sk}|^2}, \qquad \text{for } i\_1, i\_2 = 1, \dots, n. \tag{4}$$

For simplicity, in the following, we denote 𝛿<sup>𝑀</sup> (𝑋𝑖1𝑠𝑘 , 𝑋𝑖2𝑠𝑘 ) by 𝛿<sup>𝑀</sup> (𝑖1, 𝑖2). Moreover, we define the three-way self-tuning kernel function as

$$\kappa\_S: X \times X \to [0, +\infty), \qquad \kappa\_S(i\_1, i\_2) = \exp\left(-\frac{\delta\_M(i\_1, i\_2)}{\epsilon\_{i\_1}\epsilon\_{i\_2}}\right) \tag{5}$$

where 𝜖𝑖<sup>1</sup> and 𝜖𝑖<sup>2</sup> need to be selected like in the kernel defined in (2).

Afterwards, we compute the similarity matrix 𝑊 given by 𝑤𝑖1𝑖<sup>2</sup> = 𝜅(𝑖1, 𝑖2), so that we can apply the spectral clustering algorithm.

Finally, we point out that, differently from approaches based on mixtures of matrix-variate data, the number of variables of the data set is not a critical issue because the spectral clustering algorithm is based on distance measures.

# **5 A Real Data Application**

We apply the three-way spectral clustering to the analysis of the Insurance data set, available in the splm R package. This dataset was initially introduced by [7] and has recently been analyzed by [12]. The goal is to study the consumption of non-life insurance during the years 1998-2002 in the 103 Italian provinces, so 𝑡 = 5 and 𝑛 = 103. As regards the number of variables, we consider all the variables contained in the data set, so 𝑝 = 11. Thus, we have 103 matrices of dimensions 5 × 11.

The 103 Italian provinces are divided into north-west (24 provinces), northeast (22 provinces), center (21 provinces), south (23 provinces), and islands (13 provinces).

As regard the choice of 𝐾 and ℎ, we consider the graphical approach introduced in Section 3. In Figure 1 the geometric features of spectral clustering are plotted as ℎ varies. From the number of blocks of the Laplacian matrix (Figure 1-𝑎)), the first maximum eigengap (Figure 1-𝑏)) and the number of spikes in the feature space (Figure 1-𝑐)), we deduce that the number of clusters is 𝐾 = 2. For the selection of

$$\|\mathbf{A}\|\_F := \sqrt{\sum\_{f=1}^m \sum\_{l=1}^n |a\_{lf}|^2}.$$

<sup>1</sup> In general, given a matrix 𝐴 ∈ R <sup>𝑛</sup>×𝑚, with 𝐴 = (𝑎𝑖 𝑗) for 𝑖 = 1, . . . , 𝑛 and 𝑗 = 1, . . . , 𝑚. The Frobenius norm is defined by

**Fig. 1** *Insurance data.* Spectral clustering features: 𝑎) plot of Laplacian matrix in greyscale; 𝑏) plot of the first eight eigengap values; 𝑐) scatterplot of the embedded data along with directions (𝜸<sup>1</sup> , 𝜸<sup>2</sup> ).


**Table 1** *Insurance data.* Table of spectral clustering result.

ℎ we choose indifferently ℎ = 15 and ℎ = 21 because in these cases the maximum eigengap highlights the maximum values corresponding to 𝐾 = 2. In Table 1 the clustering results are presented. This table shows that only 6 center provinces are classified together with the southern provinces. But to be sure that these provinces are neighboring the south provinces, let us analyze spectral clustering results on the map of Italy. Figure 2-𝑎) illustrates the partition deriving from spectral clustering in the political map of Italy, where Italian regions are described by the yellow lines, while the provinces are by the black lines. The result shows a clear separation between center-north Italy and south-insular Italy, in fact, the center-north has a level of insurance penetration close to the European averages, while the South is less developed economically. However, the Massa-Carrara province should belong to the centre-north group. Moreover, we remark that the Rome province, being the capital of Italy, has one socio-economic development comparable to that of north Italy justifying belonging to the centre-north group.

Furthermore, in Figure 2-𝑏) we also represented the partition produced by MN-CWM proposed in [12], we note that the two clustering results are very similar to each other and differ only for one province of central Italy (precisely for the province of Terni). It should also be emphasized that the dataset analyzed by [12] is different from the one analyzed here, since, to avoid excessive parameterization of the models, the authors select only 𝑝 = 5 variables in the data set.

**Fig. 2** *Insurance data.* 𝑎) Three-way spectral clustering; 𝑏) Method proposed by [12].

# **6 Conclusion**

In this paper, a spectral approach to cluster three-way data has been proposed. So the data are organized in a tensor and the vertices in the graph are represented by the matrices of dimension 𝑡 × 𝑝. In order to weigh the matrices in the graph, a kernel function based on the Frobenius norm between the matrix difference has been introduced. The performance of the spectral clustering algorithm has been shown in one real three-way data set. Our method is competitive with respect to other clustering methods proposed in the literature to perform matrix-data clustering. Finally, in order to provide suggestions for future research, other kernel functions can be introduced considering different distances with respect to the Frobenius norm.

**Acknowledgements** This work was supported by the University of Catania grant PIACERI/CRASI (2020).

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Improving Classification of Documents by Semi-supervised Clustering in a Semantic Space**

Jasminka Dobša and Henk A. L. Kiers

**Abstract** In the paper we propose a method for representation of documents in a semantic lower-dimensional space based on the modified Reduced 𝑘-means method which penalizes clusterings that are distant from classification of training documents given by experts. Reduced 𝑘-means (RKM) enables simultaneously clustering of documents and extraction of factors. By projection of documents represented in the vector space model on extracted factors, documents are clustered in the semantic space in a semi-supervised way (using penalization) because clustering is guided by classification given by experts, which enables improvement of classification performance of test documents.

Classification performance is tested for classification by logistic regression and support vector machines (SVMs) for classes of Reuters-21578 data set. It is shown that representation of documents by the RKM method with penalization improves the average precision of classification by SVMs for the 25 largest classes of Reuters collection for about 5,5% with the same level of average recall in comparison to the basic representation in the vector space model. In the case of classification by logistic regression, representation by the RKM with penalization improves average recall for about 1% in comparison to the basic representation.

**Keywords:** classification of textual documents, LSA, reduced 𝑘-means

Jasminka Dobša ()

Henk A. L. Kiers

© The Author(s) 2023 121

Faculty of Organization and Informatics, University of Zagreb, Pavlinska 2, 40000 Varaždin, Croatia, e-mail: jasminka.dobsa@foi.hr

Department of Psychology, University of Groningan, Grote Kruisstraat 2/1, 9712 TS Groningen, The Netherlands, e-mail: h.a.l.kiers@rug.nl

P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_14

# **1 Introduction**

There are two main families of methods that deal with representation of documents and words that index them: global matrix factorization methods such as Latent Semantic Analysis (LSA) [2] and local context window methods such as the continuous bag of words (CBOW) model and the continuous skip-gram model [8]. The latter use neural networks for learning of representations of words and are intensively explored lately in the scientific community since the development of fast processors has enabled processing of huge amounts of data which resulted in improvements in performance of wide spectra of text mining and natural language tasks. However, representation of words solely by context window methods has a drawback due to the neglect of information about global corpus statistics [9].

In this paper we propose a method for representation of documents by application of a penalized version of the RKM method [4] on a term-document matrix. The corpus of textual documents is represented by a sparse term-document matrix in which entry (*i, j*) is equal to the weight of the *i*-th index term for the *j*-th document. Weights of terms are given by the TfIdf weighting which utilizes local information about the frequency of the *i*-th term in the *j*-th document and global information about usage of the *i*-th term in the entire collection. A benchmark method that utilizes global matrix factorization on term-document matrices is LSA [2] which uses truncated singular value decomposition (SVD) for representation of terms and documents in lower-dimensional semantic space. SVD does not capture the clustering structure of data which motivates application of the RKM.

The rest of the paper is organized as follows: the second section describes related work on representation of documents and words and methods of dimensionality reduction related to RKM. The third section describes the modified RKM method with penalization, while the fourth section describes an experiment on Reuters-21578 data set. In the last section conclusions and directions for further work are given.

# **2 Related Work**

#### **2.1 Representation by Matrix Factorization Methods**

A benchmark method among methods that utilize matrix factorization for representation of textual documents is the method of LSA introduced in 1994 [2]. By LSA a sparse term-document matrix is transformed via SVD into a dense matrix of the same term-document type with representations of words (index terms) and documents in a lower-dimensional space. The idea is to map similar documents, or those that describe the same topics, closer to each other regardless of the terms that are used in them. A very efficient application of LSA is in cross-lingual information retrieval where relevant documents for a query in one language are retrieved from a set of documents in another language [7]. According to our knowledge application of methods that simultaneously cluster objects and extract factors in the field of text mining is very limited. In [6] a method is proposed for cross-lingual information retrieval based on the RKM method.

#### **2.2 Neural Network Word Embeddings**

Another approach is to learn representations of words, or so called embeddings, by using local context windows. In 2003 Bengio and coauthors [1] proposed a neural probabilistic language model that uses simple neural network architecture to learn distributed representations for each word as well as probability functions for word sequences, expressed in terms of these representations. Mikolov and coautors [8] proposed in 2013 two models based on single-layer neural network architectures: the skip gram-model that predicts context words given the current word and the continuous bag of words model which predicts current words based on the context. In 2014 the GloVe model [9] was proposed, based on the critique that neural network models suffer from the disadvantage that they do not utilize co-occurrence statistics of the entire corpus, but scan only context windows of words ignoring vast amounts of repetition in the data. That model exploits the advantages of global matrix factorization methods by utilization of term-term co-occurrence matrices and local context window methods.

Word embedding can be classified as static such as word2vec [8] and GloVe [9], and contextual, such as ELMo [10] and BERT [5]. Contextual representation is introduced in [10] in order to model characteristics of word use (syntax and semantics) on one side and variation in word representation due to the context in which words are appearing.

#### **2.3 Methods for Simultaneous Clustering and Factor Extraction**

A standard procedure for clustering of objects in a lower-dimensional space is tandem analysis which includes projection of data by principal components and clustering of data in a lower-dimensional space. Such an approach was criticized in [3] and [4] since principal components may extract dimensions which do not necessarily significantly contribute to the identification of a clustering structure in the data. As a response, De Soete and Carroll proposed the method of RKM [4] which simultaneously clusters data and extracts the factors of variables by reconstructing the original data with only centroids of clusters in a lower-dimensional space. The algorithm of Factorial 𝑘-means (FKM) proposed by Vichi and Kiers [13] has the same aim of simultaneous reduction of objects and variables and it reconstructs the data in a lower-dimensional space by its centroids in the same space. The application of the latter method is limited in text mining since the method is limited to cases in which the number of variables is less than the number of cases. In [11] the RKM and FKM methods are compared using simulations and theoretically in order to identify cases for their application. Timmerman and associates also propose method of Subspace 𝑘-means [12] which gives an insight into cluster characteristics in terms of relative positions of clusters given by centroids and the shape of the clusters given by within cluster residuals.

# **3 Reduced** 𝒌**-Means with Penalization**

Let **X** be 𝑚 × 𝑛 term-document matrix. We use the following notation:


By definition, we suppose that every document in the collection belongs to exactly one cluster. The RKM method minimizes the loss function

$$\mathbf{F(M,A)} = \|\mathbf{X} - \mathbf{A}\mathbf{Y}^T\mathbf{M}^T\|^2 \tag{1}$$

in the least squares sense. The dimension of the lower-dimensional space must be less or equal to the number of clusters. Modified RKM with penalization minimizes the loss function

$$\mathbf{F(M,A)} = \|\mathbf{X} - \mathbf{A}\mathbf{Y}^T\mathbf{M}^T\|^2 + \lambda\|\mathbf{M} - \mathbf{G}\|^2\tag{2}$$

where **G** is 𝑛 × 𝑐 membership matrix based on expert judgements. If *c* is number of classes then 𝑔𝑖𝑐 = 0 if object (document) *i* belongs to class *c*, and 0 otherwise. By the second summand in the loss function we penalize clusterings that are distant from the classes by expert judgements using parameter 𝜆 that regularizes the importance of that penalization. We use the alternating least squares (ALS) algorithm analogous to the one in [4] which alternates between corrections of the loading matrix **A** in one step and of the membership matrix **M** in another. As each of the steps in the ALS algorithm improves the loss function, the algorithm converges to at least a local minimum. By starting the procedure from a large number of random initial estimates and choosing the best solution, the chances of obtaining the global minimum are increased.

# **4 Experiment**

#### **4.1 Design of Experiment**

Experiments are conducted for classification on the Reuters-21578 data set, specifically using the ModApte Split which assigns Reuters reports from April 7, 1987 and before to the training set, and after, until end of 1987, to the test set. It consists of 9603 training and 3299 test documents. The collection has 90 classes which contain at least one training and test document. Documents are represented by a bag of words representation. A list of index terms is formed based on terms that appear in at least four documents of the collection, which resulted in a list of 9867 index terms.

Classification is conducted by logistic regression (LR) and SVM algorithm. The basic model is the bag of words representation (full representation), while representations in the lower-dimensional space are obtained by SVD (Latent Sematic Analysis), RKM and RKM with penalization (𝜆 = 0.1, 0.2, 0.4, 0.6). For RKM and RKM with penalization representations are obtained by applying matrix factorization on the term-document matrix of the training documents, and by projection of test documents on factors given by matrix A in the factorization. RKM is computed for 90 clusters (which corresponds to the number of classes in the collection) using as dimension of the lower-dimensional space 𝑘 = 85, and truncated SVD is computed for 𝑘 = 85 as well. The RKM and RKM with penalization algorithms are run 10 times (with different starting estimates), and the representation and factorization with the minimal loss function is chosen. The optimal cost parameter for LR and SVM is chosen by grid search technique from the set of values 0.1, 0.5, 1, 10, 100 and 1000. For the classification methods, the LiblineaR library in R is used, while RKM and RKM with penalization algorithm are implemented in Matlab.

#### **4.2 Results**

Results are given in terms of precision, recall, and 𝐹<sup>1</sup> measure of the classification. Recall is proportion of correctly classified samples among all positive samples (i.e., samples actually belonging to the class, according to the expert), while precision is proportion of correctly classified samples among all samples classified as positive by the model. In the Figures 1 and 2, are shown results of average 𝐹<sup>1</sup> measures of classification for 5 classes sorted in descending order by their size, i.e. number of train documents (which is 2877 to 389 for classes 1-5, 369 to 181 for classes 6-10, 140 to 111 for classes 11-15, 101 to 75 for classes 16-20, 75 to 55 for classes 21-25, 50 to 41 for classes 26-30, 40 to 37 for classes 31-35, 35 to 24 for classes 36-40, 23 to 19 for classes 41-45, 18 to 16 for classes 46-50, 16 to 13 for classes 51-55, and 13-10 for classes 56-60). Figure 1 shows the results for classification by LR, while Figure 2 for classification by SVM. Only the 60 largest classes are observed since smaller classes (less than 10 training documents) are not interesting for the

**Fig. 1** Average 𝐹<sup>1</sup> measure of classification by LR for 5 classes sorted by their size.

research, because for those classes recall is low and it can be expected that full bag of words representation will result in better recognition since classes can possibly be recognized by key words, but not by transformed representations. It can be seen that 𝐹<sup>1</sup> measures are comparable for the full representation and various representations by RKM with penalization for both classification algorithms for the biggest 25 classes. For smaller classes results for representation by RKM with penalization are unstable, although for some classes they were better than the basic representation (in the case of LR). Classification for representations obtained by SVM and RKM without penalization resulted in lower 𝐹<sup>1</sup> measures for all class sizes.

In Table 1 are shown average precision, recall and 𝐹<sup>1</sup> measures for the 25 largest classes for both classification algorithms and all observed representations. In the case of classification by LR the average recall is improved for representation by RKM with penalization (for 𝜆 = 0.4) approximately 1% compared to basic full representation. For classification by SVM average precision is improved for representation by RKM with penalization (for 𝜆 = 0.6) for almost 6% and 𝐹<sup>1</sup> measure is improved for representation by RKM with penalization (𝜆 = 0.4) for 2% in comparison to the basic full representation. The best results are obtained for classification by the SVM algorithm and representation with RKM with penalization with 𝜆 = 0.2 for which precision is improved for 5% with the similar level of recall as in the basic representation.

**Fig. 2** Average 𝐹<sup>1</sup> measure of classification by SVM for 5 classes sorted by their size.


**Table 1** Average precision, recall, and 𝐹<sup>1</sup> measure of classification for the 25 largest classes.

# **5 Conclusions and Further Work**

In this paper we propose a modification of the RKM method that simultaneously clusters documents and extracts factors on one side, and penalizes clusterings that are distant from the classification of the training documents given by experts on the other side. We show that such a modification enables representation of textual documents in a semantic lower-dimensional space that improves performance of classification. The method is tested for classes of Reuters-21758 data set and compared to the full bag of words representation and the method of LSA. It is also shown that the original RKM method without proposed modification does not have the same effect on classification performance; it has a similar effect as the LSA method.

The proposed representation method can improve precision and recall of classification for sufficiently large classes, i.e. those that have enough training documents to enable capturing of semantic relations and characteristics of classes. A more important effect can be observed in the improvement of precision.

In the future we plan to investigate hybrid models using representation of words by neural language models and application in different domains, such as classification of images.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Trends in Data Stream Mining**

João Gama

**Abstract** Learning from data streams is a hot topic in machine learning and data mining. This article presents our recent work on the topic of learning from data streams. We focus on emerging topics, including fraud detection and hyper-parameter tuning for streaming data. The first study is a case study on interconnected by-pass fraud. This is a real-world problem from high-speed telecommunications data that clearly illustrates the need for online data stream processing. In the second study, we present an optimization algorithm for online hyper-parameter tuning from nonstationary data streams.

**Keywords:** fraud detection, hyperparameter tuning, learning from data streams

# **1 Introduction**

The developments of information and communication technologies dramatically change the data collection and processing methods. What distinguishes current data sets from earlier ones are automatic data feeds. We do not just have people entering information into a computer. We have computers entering data into each other. In most challenging applications, data are modeled best not as persistent tables, but rather as transient data streams.

This article presents our recent work on the topic of learning from data streams. It is organized into main sections. The first one is a real-world application of data stream techniques to a telecommunications fraud detection problem. It is based on the work presented in [5]. The second topic discusses the problem of hyperparameter tuning in the context of data stream mining. It is based on the work presented in [4].

João Gama ()

FEP-University of Porto and INESC TEC

R. Dr. Roberto Frias, Porto, Portugal, e-mail: jgama@fep.up.pt

<sup>©</sup> The Author(s) 2023 131

P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_15

# **2 Fraud Detection: a Case Study**

The high asymmetry of international termination rates with regard to domestic ones, where international calls have higher charges applied by the operator where the call terminates, is fertile ground for the appearance of fraud in Telecommunications. There are several types of fraud that exploit this type of differential, being the Interconnect Bypass Fraud one of the most expressive [1, 3].

In this type of fraud, one of several intermediaries responsible for delivering the calls forwards the traffic over a low-cost IP connection, reintroducing the call in the destination network already as a local call, using VOIP Gateways. This way, the entity that sent the traffic is charged the amount corresponding to the delivery of international traffic. However, once it is illegally delivered as national traffic, it will not have to pay the international termination fee, appropriating this amount.

Traditionally, the telecom operators analyze the calls of these Gateways to detect the fraud patterns and, once identified, have their SIM cards blocked. The constant evolution in terms of technology adopted on these gateways allows them to work like real SIM farms capable of manipulating identifiers, simulating standard call patterns similar to the ones of regular users, and even being mounted on vehicles to complicate the detection using location information.

The interconnect bypass fraud detection algorithms typically consume a stream 𝑆 of events, where 𝑆 contains information about the origin number 𝐴 − 𝑁𝑢𝑚𝑏𝑒𝑟, the destination number 𝐵 − 𝑁𝑢𝑚𝑏𝑒𝑟, the associated timestamp, and the status of the call (accomplished or not). The expected output of this type of algorithm is a set of potential fraudulent 𝐴 − 𝑁𝑢𝑚𝑏𝑒𝑟 𝑠 that require validation by the telecom operator. This process is not fully automated to avoid blocking legit 𝐴− 𝑁𝑢𝑚𝑏𝑒𝑟 𝑠 and getting penalties. In the interconnect bypass fraud, we can observe three different types of abnormal behaviors:



Figures 1 and 2 present the evolving top-10 most active phone numbers. The first Figure 1 presents the top-10 cumulative counts, while the Figure 2 presents the top-10 counts with forget.

# **3 Learning to Learn Hyperparameters**

A hyperparameter is a parameter whose value is used to control the learning process. Hyperparameter optimization (or tuning) is the problem of choosing a set of optimal hyper-parameters for a learning algorithm. For this propose we adapt the Nelder-Mead algorithm [4] for the streaming context. This algorithm is a simplex search algorithm for multidimensional unconstrained optimization without derivatives. The vertexes of the simplex, which define a convex hull shape, are iteratively updated in order to sequentially discard the vertex associated with the largest cost function value.

The Nelder-Mead algorithm relies on four simple operations: *reflection*, *shrinkage*, *contraction* and *expansion*. Figure 3 illustrates the four corresponding Nelder-Mead operators 𝑅, 𝑆, 𝐶 and 𝐸. Each vertex represents a model containing a set of hyper-parameters. The vertexes (models under optimisation) are ordered and named according to the root mean square error (RMSE) value: best (𝐵), good (𝐺), which is

**Fig. 1** Approximate Counts with Lossy Counting.

**Fig. 2** Approximate Counts with Lossy Counting and Fast Forgetting.

the closest to the best vertex, and worst (𝑊). 𝑀 is a mid vertex (auxiliary model). The bottom panel in Figure 3 describe the four operations: Contraction, Reflexion, Expansion, and Shrink.

For each Nelder-Mead operation, it is necessary to compute an additional set of vertexes (midpoint 𝑀, reflection 𝑅, expansion 𝐸, contraction 𝐶 and shrinkage 𝑆) and verify if the calculated vertexes belong to the search space. First, the algorithm computes the midpoint (𝑀) of the best face of the shape as well as the reflection point (𝑅). After this initial step, it determines whether to reflect or expand based on the set of heuristics.

The dynamic sample size, which is based on the RMSE metric, attempts to identify significant changes in the streamed data. Whenever such a change is detected, the Nelder-Mead compares the performance of the 𝑛+1 models under analysis to choose the most promising model. The sample size 𝑆𝑠𝑖𝑧𝑒 is given by Equation 1 where 𝜎

**Fig. 3** SPT working modes: Exploration and Deployment. Bottom panel illustrates the Nelder & Mead operators.

represents the standard deviation of the RMSE and 𝑀 the desired error margin. We use 𝑀 = 95%.

$$S\_{size} = \frac{4\sigma^2}{M^2} \tag{1}$$

However, to avoid using small samples, that imply error estimations with large variance, we defined a lower bound of 30 samples. The adaptation of the Nelder-Mead algorithm to on-line scenarios relies extensively on parallel processing. The main thread launches the 𝑛+1 model threads and starts a continuous event processing loop. This loop dispatches the incoming events to the model threads and, whenever it reaches the sample size interval, assesses the running models, and calculates the new sample size. The model assessment involves the ordering of the 𝑛 + 1 models by RMSE value and the application of the Nelder-Mead algorithm to substitute the worst model. The Nelder-Mead parallel implementation creates a dedicated thread per Nelder-Mead operator, totaling seven threads. Each Nelder-Mead operator thread generates a new model and calculates the incremental RMSE using the instances of the last sample size interval. The worst model is substituted by the Nelder-Mead operator thread model with the lowest RMSE.

Figure 4 presents the critical difference diagram [2] of three hyper-parameter tuning algorithms: SPT, Grid search, default parameter values on four benchmark classification datasets. The diagram clearly illustrates the good performance of SPT.

# **4 Conclusions**

This paper reviews our recent work in learning from data streams. The two works present different approaches to dealing with high-speed and time-evolving data: from applied research in fraud detection to fundamental research on hyperparameter

**Fig. 4** Critical Difference Diagram comparing Self hyperparameter tuning, Grid hyperparameter tuning, and default parameters in 4 classification problems.

optimization for streaming algorithms. The first work identifies burst on the activity in phone calls, using approximate counting with forgetting. The last work presents a streaming optimization method to find the minimum of a function and its application in finding the hyper-parameter values that minimize the error. We believe that the two works reported here will have an impact on the work of other researchers.

**Acknowledgements** I would like to thank my collaborators Bruno Veloso and Rita P. Ribeiro that contribute to this work.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Old and New Constraints in Model Based Clustering**

Luis A. García-Escudero, Agustín Mayo-Iscar, Gianluca Morelli, and Marco Riani

**Abstract** Model-based approaches to cluster analysis and mixture modeling often involve maximizing classification and mixture likelihoods. Without appropriate constrains on the scatter matrices of the components, these maximizations result in ill-posed problems. Moreover, without constrains, non-interesting or "spurious" clusters are often detected by the EM and CEM algorithms traditionally used for the maximization of the likelihood criteria. A useful approach to avoid spurious solutions is to restrict relative components scatter by a prespecified tuning constant. Recently new methodologies for constrained parsimonious model-based clustering have been introduced which include the 14 parsimonious models that are often applied in model-based clustering when assuming normal components as limit cases. In this paper we initially review the traditional approaches and illustrate through an example the benefits of the adoption of the new constraints.

**Keywords:** model based clustering, mixture modelling, constraints

L. A. García-Escudero

A. Mayo-Iscar

G. Morelli

M. Riani ()

© The Author(s) 2023 139 P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_16

Department of Statistics and Operational Research and IMUVA, University of Valladolid, Spain, e-mail: lagarcia@eio.uva.es

Department of Statistics and Operational Research and IMUVA, University of Valladolid, Spain, e-mail: agustinm@eio.uva.es

Department of Economics and Management and Interdepartmental Centre of Robust Statistics, University of Parma, Italy, e-mail: gianluca.morelli@unipr.it

Department of Economics and Management and Interdepartmental Centre of Robust Statistics, University of Parma, Italy, e-mail: mriani@unipr.it

# **1 Introduction**

Given a sample of observations {𝑥1, ..., 𝑥𝑛} in R 𝑝 , a widely used method in unsupervised learning is to assume multivariate normal components and to adopt a maximum likelihood approach for clustering purposes. With this idea in mind, wellknown classification and mixture likelihood approaches can be followed.

In this work, we use 𝜙(·; 𝜇, Σ) to denote the probability density function of a 𝑝-variate normal distribution with mean 𝜇 and covariance matrix Σ.

In the *classification likelihood* approach we search for a partition {𝐻1, ..., 𝐻<sup>𝑘</sup> } of the indices {1, · · · , 𝑛}, centres 𝜇1, · · · , 𝜇<sup>𝑘</sup> in R 𝑝 , symmetric positive semidefinite 𝑝× 𝑝 scatter matrices Σ1, · · · , Σ<sup>𝑘</sup> and positive weights 𝜋1, · · · , 𝜋<sup>𝑘</sup> with Í<sup>𝑘</sup> 𝑗=1 𝜋<sup>𝑗</sup> = 1, which maximize

$$\sum\_{j=1}^{k} \sum\_{i \in H\_j} \log \left( \pi\_j \phi(\mathbf{x}\_i; \boldsymbol{\mu}\_j, \boldsymbol{\Sigma}\_j) \right). \tag{l}$$

On the other hand, in the *mixture likelihood* approach, we seek the maximization of

$$\sum\_{i=1}^{n} \log \left( \sum\_{j=1}^{k} \pi\_j \phi(\mathbf{x}\_i; \boldsymbol{\mu}\_j, \boldsymbol{\Sigma}\_j) \right), \tag{2}$$

« with similar notation and conditions on the parameters as above. In this second approach, a partition into 𝑘 groups can be also obtained, from the fitted mixture model, by assigning each observation to the cluster-component with the highest posterior probability.

Unfortunately, it is well-known that the maximization of "log-likelihoods" like (1) and (2) without constraints on the Σ<sup>𝑗</sup> matrices is a mathematically ill-posed problem [1, 2]. To see this unboundedness issue, we can just take 𝜇<sup>1</sup> = 𝑥1, 𝜋<sup>1</sup> > 0 and |Σ1| → 0 making (2) to diverge to infinity or (1) also to diverge with 𝐻<sup>1</sup> = {1}.

This lack of boundedness can be solved by just focusing on local maxima of the likelihood target functions. However, many local maxima are often found and it is difficult to know which are the most interesting ones. See [3] for a detailed discussion of this issue. In fact, non-interesting local maxima denoted as "spurious" solutions, which consist of a few, almost collinear, observations, are often detected by the Classification EM algorithm (CEM), traditionally applied when maximizing (1), and by the EM algorithm, traditionally applied when maximizing (2). A recent review of approaches for dealing with this lack of boundedness and for reducing the detection of spurious solutions can be found in [4].

It is also common to enforce constraints on the Σ<sup>𝑗</sup> scatter matrices when maximizing (1) or (2). Among them, the use of "parsimonious" models [5, 6] is one of the most popular and widely applied approaches in practice. These parsimonious models follow from a decomposition of the Σ<sup>𝑗</sup> scatter matrices as

$$
\Sigma\_f = \lambda\_f \mathfrak{Q}\_f \Gamma\_f \mathfrak{Q}\_j',\tag{3}
$$

with 𝜆 <sup>𝑗</sup> = |Σ<sup>𝑗</sup> | 1/𝑝 (volume parameters), Old and New Constraints in Model Based Clustering

$$\Gamma\_j = \mathsf{diag}(\gamma\_{j1}, \dots, \gamma\_{jl}, \dots, \gamma\_{jp}) \text{ with } \mathsf{det}(\Gamma\_j) = \prod\_{l=1}^p \gamma\_{jl} = 1$$

(shape matrices), and Ω<sup>𝑗</sup> (rotation matrices) with Ω𝑗Ω<sup>0</sup> 𝑗 = I𝑝. Different constraints on the 𝜆 <sup>𝑗</sup> , Ω<sup>𝑗</sup> and Γ<sup>𝑗</sup> elements are considered across components to get 14 parsimonious models (which are coded with a combination of three letters). These models reduce notably the number of free parameters to be estimated, so improving efficiency and model interpretability. Moreover, many of them turn the constrained maximization of the likelihoods into well-defined problems and help to avoid spurious solutions. Unfortunately, the problems remain for models with unconstrained 𝜆 <sup>𝑗</sup> volume parameters, which are coded with the first letter as a V (V\*\* models). Aside from relying on good initializations, it is common to consider the early stopping of iterations when approaching scatter matrices with very small eigenvalues or when detecting components accounting for a reduced number of observations. A not fully iterated solution (or no solution at all) is then returned in these cases. The idea is known, for instance, to be problematic when dealing with (well-separated) components made up of a few observations.

Starting from a seminal paper by [7], an alternative approach is to constrain the Σ<sup>𝑗</sup> scatter matrices by specifying some tuning constants that control the strength of the constraints. In this direction, the ratio between the largest and the smallest of the 𝑘 × 𝑝 eigenvalues of the Σ<sup>𝑗</sup> matrices was forced to be smaller than a given fixed constant 𝑐 <sup>∗</sup> ≥ 1 [8, 9, 10, 11, 12]. This means that the maximization of (1) and (2) is done under the (more simple) constraint:

$$\max\_{jl} \lambda\_l(\Sigma\_j) / \min\_{jl} \lambda\_l(\Sigma\_j) \le c^\*,\tag{4}$$

where {𝜆𝑙(Σ𝑗)}<sup>𝑝</sup> 𝑙=1 are the set of eigenvalues of the Σ<sup>𝑗</sup> matrix, 𝑗 = 1, ..., 𝑘.

With this eigenvalue-ratio approach, we need a very high 𝑐 ∗ value to be close to affine equivariance. Unfortunately, such a high 𝑐 ∗ value does not always successfully prevent us from incurring into spurious solutions.

# **2 The New Constraints**

García-Escudero *et al.* [13] have recently introduced three different types of constraints on the Σ<sup>𝑗</sup> matrices which depend on three constants 𝑐det, 𝑐shw and 𝑐shb all of them being greater than or equal to 1.

The first type of constraint serves to control the maximal ratio among determinants and, consequently, the maximum allowed difference between component volumes:

$$\mathsf{Tecter}^{\bullet} \colon \qquad \frac{\max\_{j=1,\ldots,k} |\Sigma\_{j}|}{\min\_{j=1,\ldots,k} |\Sigma\_{j}|} = \frac{\max\_{j=1,\ldots,k} \mathsf{A}\_{j}^{p}}{\min\_{j=1,\ldots,k} \mathsf{A}\_{j}^{p}} \le c\_{\mathsf{det}}.\tag{5}$$

The second type of constraint controls departures from sphericity "within" each component:

$$\mathsf{shape-"within":} \qquad \frac{\max\_{l=1,...,p} \mathcal{Y}\_{jl}}{\min\_{l=1,...,p} \mathcal{Y}\_{jl}} \le c\_{\mathsf{shwe}} \text{ for } j = 1,...,k. \tag{6}$$

This provides a set of 𝑘 constraints that in the most constrained case, 𝑐shw = 1, imposes Γ<sup>1</sup> = ... = Γ<sup>𝑝</sup> = I𝑝, where I<sup>𝑝</sup> is the identity matrix of size 𝑝, i.e., sphericity of components.

Note that the new determinant-and-shape constraints (based on 𝑐det > 1 and 𝑐shw = 1) in (4) allow us to deal with spherical "heteroscedastic" cases, whereas the eigenvalue ratio constraint with 𝑐 <sup>∗</sup> = 1 can only handle the spherical "homoscedastic" case. Constraints (5) and (6) were the basis for the "deter-and-shape" constraints in [14]. These two constraints alone resulted in mathematically well-defined constrained maximizations of the likelihoods in (1) and (2). However, although highly operative in many cases, they do not include, as limit cases, all the already mentioned 14 parsimonious models. For instance, we may be interested in the same (or not very different) Γ<sup>𝑗</sup> or Σ<sup>𝑗</sup> matrices for all the mixture components and these cannot be obtained as limit cases from the "deter-and-shape" constraints.

The third constraint serves to control the maximum allowed difference between shape elements "between" components:

$$\mathsf{reshape}^{\mathsf{\*}}\mathsf{between}^{\mathsf{\*}}\text{:}\qquad\frac{\max\_{j=1,\ldots,k}\mathsf{Y}\_{jl}}{\min\_{j=1,\ldots,k}\mathsf{Y}\_{jl}}\leq c\_{\mathsf{sbb}}\text{ for }l=1,\ldots,p.\tag{7}$$

This new type of constraint allows us to impose "similar" shape matrices for the components and, consequently, enforce Γ<sup>1</sup> = ... = Γ<sup>𝑘</sup> in the most constrained 𝑐shb = 1 case.

# **3 An Illustration Example of the New Constraints**

Figure 1 shows an example based on three groups. The data have been generated imposing equal determinants 𝑐det = 1, a sensible departure from sphericity "within" each component 𝑐shw = 40 and a very moderate difference "between" shape elements components, 𝑐shb = 1.3. No constraint has been imposed on the rotation matrices. Finally an average overlap of 0.10 has been imposed. The generation of these data sets has been done through the MixSim method of [15], as extended by [16] and incorporated into the FSDA Matlab toolbox [17]. The overlap is defined as a sum of pairwise misclassification probabilities. See more details in [16].

The application of traditional tclust approach with maximum ratio between eigenvalues (𝑐 ∗ ) respectively equal to 128 and 10<sup>10</sup> produces the classifications shown in the left panels of Figure 2. In fact, it could be seen that the results in the top left panel would be exactly the same one for any choice of 𝑐 ∗ within the interval [16, 128]. This means that a higher value of 𝑐 ∗ would be apparently needed to detect

**Fig. 1** An example with simulated data with 3 clusters in two dimensions. The average overlap is 0.10. The data have been generated using equal determinants, moderate difference between shape elements "between" components and sensible departure from sphericity "within" each component.

those two almost parallel clusters that were shown in Figure 1. However, choosing a value greater for 𝑐 ∗ may destroy the desired protection against spurious solutions provided by the constraints. For example, we see in the lower left panel how the choice 𝑐 <sup>∗</sup> = 10<sup>10</sup> results in the detection of a spurious group consisting of a single observation.

The panels on the right, on the other hand, show the partitions resulting from the 3 new constraints imposed on the components covariance matrices. The top right panel shows the result of applying the 3 new restrictions with values of the tuning constants very close to the real values used to generate the dataset. We can see that, in this case, it is possible to recover the real structure of the data generating process. Moreover, the real cluster structure is also recovered in the low right panel by choosing larger values of these tuning constants, but not too large just to avoid detection of spurious solutions. Some guidelines about how to choose these tuning constants can be found in [13].

**Fig. 2** Comparison between the traditional (left panels) and new tclust procedure (right panels).

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Clustering Student Mobility Data in 3-way Networks**

Vincenzo Giuseppe Genova, Giuseppe Giordano, Giancarlo Ragozini, and Maria Prosperina Vitale

**Abstract** The present contribution aims at introducing a network data reduction method for the analysis of 3-way networks in which classes of nodes of different types are linked. The proposed approach enables simplifying a 3-way network into a weighted two-mode network by considering the statistical concept of joint dependence in a multiway contingency table. Starting from a real application on student mobility data in Italian universities, a 3-way network is defined, where provinces of residence, universities and educational programmes are considered as the three sets of nodes, and occurrences of student exchanges represent the set of links between them. The Infomap community detection algorithm is then chosen for partitioning two-mode networks of students' cohorts to discover different network patterns.

**Keywords:** 3-way network, complex network, community detection, mobility data, tertiary education

Vincenzo Giuseppe Genova

Giuseppe Giordano Department of Political and Social Studies, University of Salerno, Italy, e-mail: ggiordano@unisa.it

Giancarlo Ragozini Department of Political Science, Federico II University of Naples, Italy, e-mail: giragoz@unina.it

Maria Prosperina Vitale () Department of Political and Social Studies, University of Salerno, Italy, e-mail: mvitale@unisa.it

© The Author(s) 2023 147 P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_17

Department of Economics, Business, and Statistics, University of Palermo, Italy, e-mail: vincenzogiuseppe.genova@unipa.it

# **1 Introduction**

Many complex relational data structures can be described as multimode or multiway networks in which nodes belonging to different modes are linked. The most common multimode network in social networks is represented by the affiliation network, where two-mode data, actors and events, form a bipartite graph divided into two groups [6]. In the case of tripartite networks, we deal with three types of nodes, and different graph structures can be defined.

Although only a few papers deal with methods for these networks, in recent years, a growing number of works have appeared –especially in bipartite and tripartite cases– to disentangle the inherent complexity of such kinds of data structures. Looking at clustering and community detection algorithms proposed to partition a network into groups, we can identify some strands, all deriving from generalizations of methods suited for one-mode [19] and two-mode networks [2]. A classical approach consists of applying the usual community detection algorithms on a unique supra-adjacency matrix defined by combining all the possible two-mode networks in a block matrix [11, 15]. Alternative methods rely on projecting each two-mode networks and on applying separately the usual community detection algorithms on these matrices [10]. In addition, there are methods adopting both an optimization procedure for 3-way networks [16, 17, 14] by extending the idea of bipartite modularity [2], and an indirect blockmodeling approach by deriving a dissimilarity measure based on structural equivalence concept [3].

In our opinion, approaches based on the analysis of the k-modes examined considering the collection of the 𝑘 (𝑘 − 1)/2 two-mode networks [10] cannot take into account statistical associations among all modes at same time. Hence, the aim of the contribution is to present a network data reduction method based on the concept of joint dependence in a multiway contingency table [1].

Starting from real applications on the Italian student mobility phenomenon in higher education [12, 21, 7, 8, 13, 22], a 3-way network is defined, where provinces of residence, universities and educational programmes are considered as the three modes. Student mobility flows, measured in terms of occurrences, represent the set of links between them. Assuming that the statistical dependency between the set of nodes provinces of residence and the other two sets of nodes can be captured by the joined pair of nodes (universities and educational programmes), the tripartite network is transformed into a bipartite network, where the two modes are given by Italian provinces of residence (first mode) and the set of nodes given by all possible pairs of universities and educational programmes (second mode). Thus, taking advantage of this approach of network simplification, network indexes and clustering techniques for bipartite networks are available. Hence, the Infomap community detection algorithm is adopted [9, 4] to partition the derived network.

The remainder of the paper is organized as follows. Section 2 presents the details of the proposed strategy of analysis, and the main results are reported from the analysis of student mobility data of Italian universities. Section 3 provides final remarks.

# **2 Simplification of 3-way Networks**

In the present paper, the case of a tripartite network is considered as an example to show how the proposed network data simplification method works. In particular, we consider the real case study of student mobility paths in Italian universities. The MOBYSU.IT dataset1 enables reconstruction of network data structures considering student mobility flows among territorial units and universities.

More formally, given V<sup>𝑃</sup> ≡ {𝑝1, . . . , 𝑝<sup>𝑖</sup> , . . . , 𝑝<sup>𝐼</sup> }, the set of 𝐼 provinces of residence; V<sup>𝑈</sup> ≡ {𝑢1, . . . , 𝑢 <sup>𝑗</sup> , . . . , 𝑢<sup>𝐽</sup> }, the set of 𝐽 Italian universities, and V<sup>𝐸</sup> ≡ {𝑒1, . . . , 𝑒<sup>𝑘</sup> , . . . , 𝑒<sup>𝐾</sup> }, the set of 𝐾 educational programmes, a weighted tripartite 3-uniform hyper-graph T can be defined, consisting of a triple (V, L,W), with V = {V𝑃, V<sup>𝑈</sup> , V<sup>𝐸</sup> } the collection of three sets of vertices, one for each mode, and being L = {L𝑃𝑈 𝐸 }, L𝑃𝑈 𝐸 ⊆ V<sup>𝑃</sup> × V<sup>𝑈</sup> × V<sup>𝐸</sup> , the collection of hyper-edges, with generic term (𝑝<sup>𝑖</sup> , 𝑢 <sup>𝑗</sup> , 𝑒<sup>𝑘</sup> ), which is the link joining the 𝑖-th province, the 𝑗-th university, and the 𝑘-th educational programme. Finally, W is the set of weights, obtained by the function 𝑤 : L𝑃𝑈 𝐸 → N, and 𝑤(𝑝<sup>𝑖</sup> , 𝑢 <sup>𝑗</sup> , 𝑒<sup>𝑘</sup> ) = 𝑤𝑖 𝑗𝑘 is the number of students moving from a province 𝑝<sup>𝑖</sup> towards a university 𝑢 <sup>𝑗</sup> in an educational programme 𝑒<sup>𝑘</sup> . Such a network structure can be described as a three-way array A = (𝑎𝑖 𝑗𝑘 ), with 𝑎𝑖 𝑗𝑘 ≡ 𝑤𝑖 𝑗𝑘 , and it has been called a 3-way network [3].

To deal with such a complex network structure and aiming at obtaining communities in which three modes are mixed, we wish to simplify the tripartite nature of the graph, without losing any significant information. In statistical terms, the array A can be interpreted as a 3-way contingency table, and then the statistical techniques to evaluate the association among variables (i.e. the modes) can be exploited [1]. Because a 3-way contingency table is a cross-classification of observations by the levels of three categorical variables, we are defining a network structure where the sets of nodes are the levels of the categorical variables. Specifically, we assume that if two modes are jointly associated –as are, for their own nature, universities and educational programmes– the tripartite network can be logically simplified into a bipartite one. In the student mobility network, we join the pair of nodes in V<sup>𝑈</sup> and in V<sup>𝐸</sup> , and then we deal with the relationships between these *dyads* and the nodes in V𝑃.

Following this assumption, the sets of nodes V<sup>𝑈</sup> and V<sup>𝐸</sup> are put together into a set of joint nodes, namely V𝑈 𝐸 . The tripartite network T can now be represented as a bipartite network B given by the triple {V<sup>∗</sup> , L<sup>∗</sup> ,W<sup>∗</sup> }, with V<sup>∗</sup> = {V𝑃, V𝑈 𝐸 }. The set of hyper-edges L is thus simplified into a set of edges L<sup>∗</sup> = {L𝑃,𝑈 𝐸 }, L𝑃,𝑈 𝐸 ⊆ V<sup>𝑃</sup> × V𝑈 𝐸 . The new edges (𝑝<sup>𝑖</sup> , (𝑢 <sup>𝑗</sup> ; 𝑒<sup>𝑘</sup> )) connect a province 𝑝<sup>𝑖</sup> with an educational programme 𝑒<sup>𝑘</sup> running in a given university 𝑢 <sup>𝑗</sup> . The weights W<sup>∗</sup> are the same as in the hyper-graph T, i.e., 𝑤 ∗ 𝑖 𝑗,𝑘 = 𝑤𝑖 𝑗𝑘 . Note that the weights contained in the 3-way array A are preserved, but are now organized in a rectangular matrix **A** of 𝐼 rows and (𝐽 × 𝐾) columns.

<sup>1</sup> Database MOBYSU.IT [Mobilità degli Studi Universitari in Italia], research protocol MUR - Universities of Cagliari, Palermo, Siena, Torino, Sassari, Firenze, Cattolica and Napoli Federico II, Scientific Coordinator Massimo Attanasio (UNIPA), Data Source ANS-MUR/CINECA.

Taking advantage of this method, we aim to analyse weighted bipartite graphs adopting clustering methods. Among others, we use the Infomap community detection algorithm [9, 4] to study the flows' patterns in network structures instead of modularity optimization proposed in topological approaches [18, 5]. Indeed, the rationale of this algorithm –*map equation*– takes advantage of the duality between finding communities and minimizing the length –*codelength*– of a random walker's movement on a network. The partition with the shortest path length is the one that best captures the community structure in the bipartite data. Formally, the algorithm defines a module partition **M** of *n* vertices into *m* modules such that each vertex is assigned to one and only one module. The Infomap algorithm looks for the best **M** partition that minimizes the expected *codelength*, 𝐿(𝑀), of a random walker, given by the following map equation:

$$L(M) = q\_{\bigcirc} H(\mathcal{Q}) + \sum\_{i=1}^{m} p\_{\bigcirc}^{i} H(\mathcal{Q}^{i}) \tag{1}$$

In equation (1), 𝑞y𝐻(Q) represents the entropy of the movement between modules weighed for the probability that the random walker switches modules on any given step (𝑞y), and Í<sup>𝑚</sup> 𝑖=1 𝑝 𝑖 𝐻(P<sup>𝑖</sup> ) is the entropy of movements within modules weighed for the fraction of within-module movements that occur in module 𝑖, plus the probability of exiting module 𝑖 (𝑝 𝑖 ), such that <sup>Í</sup><sup>𝑚</sup> 𝑖=1 𝑝 𝑖 = 1 + 𝑞<sup>y</sup> [9].

In our case, the Infomap algorithm is adopted to discover communities of students characterized by similar mobility patterns. Indeed, to analyse mobility data, where links represent patterns of student movement among territorial units and universities, flow-based approaches are likely to identify the most important features. Finally, in our student mobility network, to focus only on relevant student flows, a filtering procedure is adopted by considering the Empirical Cumulative Density Function (ECDF) of links' weights distribution.

#### **2.1 Main Findings**

Students' cohorts enrolled in Italian universities in four academic years (a.y.) 2008– 09, 2011–12, 2014–15, and 2017–18 are analysed. The number of nodes for the sets V<sup>𝑃</sup> (107 provinces), V<sup>𝑈</sup> (79-80 universities), and V<sup>𝐸</sup> (45 educational programmes), and the number of students involved in the four cohorts are quite stable over time (Table 1). Furthermore, the percentage of movers (i.e., students enrolled in a university outside of their region of residence) increased, from 16.4% in the a.y. 2008–09 to 20.6% in the a.y. 2017–18, and it is higher for males than females.


**Table 1** Percentage of students according to their mobility status by cohort and gender.

Following the network simplification approach, the tripartite networks –one for each cohort– are simplified into bipartite networks, and the four ECDFs of links' weights are considered to filter relevant flows. The distributions suggest that more than 50% of links between pairs of nodes have weights equal to 1 (i.e., flows of only one student), and about 95% of flows are characterized by flows not greater than a digit. Thus, networks holding links with a value greater or equal to 10 are further analysed.

To reveal groups of universities and educational programmes attracting students, the Infomap community detection algorithm is applied. Looking at Table 2, we notice a reduction of the number of communities from the first to the last student cohort, suggesting a sort of stabilization in the trajectories of movers towards brand universities of the center-north with also an increase in the north-north mobility [20], and a relevant dichotomy between scientific and humanistic educational programmes. Network visualizations by groups (Figures 1 and 2) confirm that the more attractive universities are located in the north of Italy, especially for educational programmes in economics and engineering (the Bocconi University, the Polytechnic of Turin and the Cattolica University).

**Table 2** Number of communities, codelength, and relative saving codelength per cohort.


**Fig. 1** Network visualization by groups, student cohort a.y. 2008–09.

**Fig. 2** Network visualization by groups, student cohort a.y. 2017–18.

# **3 Concluding Remarks**

The proposed simplification network strategy on tripartite graphs defined for student mobility data provides interesting insights for the phenomenon under analysis. The main attractive destinations still remain the northern universities for educational programmes, such as engineering and business. Besides the well-known south-tonorth route, other interregional routes in the northern area appear. In addition, the reduction in the number of communities suggests a sort of stabilization in terms of mobility roots of movers towards brand universities, highlighting student university destination choices close to the labor market demand.

Hyper-graphs and multipartite networks still remain very active areas for research and challenging tasks for scholars interested in discovering the complexities underlying these kinds of data. Specific tools for such complex network structures should be designed combining network analysis and other statistical techniques. As future lines of research, the comparison of community detection algorithms that better represent the structural constraints of the phenomena under analysis and the assessment of other backbone approaches to filter the significant links will be developed.

**Acknowledgements** The contribution has been supported from Italian Ministerial grant PRIN 2017 "From high school to job placement: micro-data life course analysis of university student mobility and its impact on the Italian North-South divide", n. 2017 HBTK5P - CUP B78D19000180001.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Clustering Brain Connectomes Through a Density-peak Approach**

Riccardo Giubilei

**Abstract** The density-peak (DP) algorithm is a mode-based clustering method that identifies cluster centers as data points being surrounded by neighbors with lower density and far away from points with higher density. Since its introduction in 2014, DP has reaped considerable success for its favorable properties. A striking advantage is that it does not require data to be embedded in vector spaces, potentially enabling applications to arbitrary data types. In this work, we propose improvements to overcome two main limitations of the original DP approach, i.e., the unstable density estimation and the absence of an automatic procedure for selecting cluster centers. Then, we apply the resulting method to the increasingly important task of graph clustering, here intended as gathering together similar graphs. Potential implications include grouping similar brain networks for ability assessment or disease prevention, as well as clustering different snapshots of the same network evolving over time to identify similar patterns or abrupt changes. We test our method in an empirical analysis whose goal is clustering brain connectomes to distinguish between patients affected by schizophrenia and healthy controls. Results show that, in the specific analysis, our method outperforms many existing competitors for graph clustering.

**Keywords:** nonparametric statistics, mode-based clustering, networks, graph clustering, kernel density estimation

# **1 Introduction**

Clustering is the task of grouping elements from a set in such a way that elements in the same group, also defined as *cluster*, are in some sense similar to each other, and dissimilar to those from other groups. Mode-based clustering is a nonparametric approach that works by first estimating the density, and then identifying in some

Riccardo Giubilei ()

© The Author(s) 2023 155

Luiss Guido Carli, Rome, Italy, e-mail: rgiubilei@luiss.it

P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_18

way its modes and the corresponding clusters. An effective method to find modes and clusters is through the density-peak (DP) algorithm [12], which has drawn considerable attention since its introduction in 2014. One of the striking advantages of DP is that it does not require data to be embedded in vector spaces, implying that it can be applied to arbitrary data types, provided that a proper distance is defined. In this work, we focus on its application to clustering graph-structured data objects.

The expression *graph clustering* can refer either to *within-graph clustering* or to *between-graph clustering*. In the first case, the elements to be grouped are the vertices of a single graph; in the second, the objects are distinct graphs. Here, *graph clustering* is intended as *between-graph clustering*. Between-graph clustering is an emerging but increasingly important task due to the growing need of analyzing and comparing multiple graphs [10, 4]. Potential applications include clustering: brain networks of different people for ability assessment, disease prevention, or disease evaluation; online social ego networks of different users to find people with similar social structures; different snapshots of the same network evolving over time to identify similar patterns, cycles, or abrupt changes.

Heretofore, the task of between-graph clustering has not been exhaustively investigated in the literature, implying a substantial lack of well-established methods. The goal of this work is to improve and adapt the density-peak algorithm to define a fairly general method for between-graph clustering. For validation and comparison purposes, the resulting procedure and its main competitors are applied to grouping brain connectomes of different people to distinguish between patients affected by schizophrenia and healthy controls.

# **2 Related Work**

Existing techniques for between-graph clustering can be divided into two main categories: 1) transforming graph-structured data objects into Euclidean feature vectors in order to apply standard clustering algorithms; 2) using the distances between the original graphs in distance-based clustering methods.

The most common technique within the first category is the use of classical clustering techniques on the vectorized adjacency matrices [10]. Nonetheless, more advanced numerical summaries have been proposed to better capture the structural properties of the graphs and to decrease feature dimensionality. Examples include: shell distribution [1], traces of powers of the adjacency matrix [10], and graph embeddings such as *graph2vec* [11]; see [4] for a longer list. Techniques from the first category share an important drawback: the transformation into feature vectors necessarily implies loss of information. Additionally, methods for extracting features may be domain-specific.

The second category features Partitioning Around Medoids (PAM) [7], or kmedoids, which finds representative observations by iteratively minimizing a cost function based on the distances between data objects, and assigns other observations to the closest medoid. PAM's main limitations are that it requires the number of clusters in advance and can only identify convex-shaped groups. Density-based spatial clustering of applications with noise [3], or DBSCAN, overcomes these two constraints by computing the density of data points starting from their distances, and defining clusters as samples of high density that are close to each other (and surrounded by areas of lower density). A similar approach is the DP, which is described in greater detail in Section 3.1. Alternatively, hierarchical clustering can be applied to distances between graphs, as in [13], where a spectral Laplacian-based distance is proposed and used. Finally, 𝑘-groups [8] is a clustering technique within the Energy Statistics framework [14] where the goal is minimizing the total withincluster Energy distance, which is computed starting from the distances between original observations.

# **3 Methods**

In this section, we first describe the original DP approach; then, we introduce the DP-KDE method, which is partly named after Kernel Density Estimation; finally, we discuss how to employ it for graph clustering.

#### **3.1 Original DP**

The density-peak algorithm [12] is based on a simple idea: since cluster centers are identified as the distribution's modes, they must be 1) surrounded by neighbors with lower density, and 2) at a relatively large distance from points with higher density. Consequently, two quantities are computed for each observation 𝑥<sup>𝑖</sup> : the local density 𝜌𝑖 , and the minimum distance 𝛿<sup>𝑖</sup> from other data points with higher density. The local density 𝜌<sup>𝑖</sup> of 𝑥<sup>𝑖</sup> is defined as:

$$\rho\_i = \sum\_j I\_{(d\_{ij} - d\_c)},\tag{l}$$

where 𝐼(·) is the indicator function, 𝑑𝑖 𝑗 = 𝑑(𝑥<sup>𝑖</sup> , 𝑥 <sup>𝑗</sup>) is the distance between 𝑥<sup>𝑖</sup> and 𝑥 𝑗 , and 𝑑<sup>𝑐</sup> is a cutoff distance. In simple terms, 𝜌<sup>𝑖</sup> is the number of points that are closer than 𝑑<sup>𝑐</sup> to 𝑥<sup>𝑖</sup> . The DP algorithm is robust with respect to 𝑑𝑐, at least with large datasets [12]. Once the density is computed, the definition of the minimum distance 𝛿<sup>𝑖</sup> between point 𝑥<sup>𝑖</sup> and any other point 𝑥 <sup>𝑗</sup> with higher density is straightforward:

$$\delta\_{\vec{l}} = \min\_{j:\rho\_j > \rho\_{\vec{l}}} (d\_{\vec{l}j}). \tag{2}$$

By convention, the point with highest density has 𝛿<sup>𝑖</sup> = 𝑚𝑎𝑥 <sup>𝑗</sup>(𝑑𝑖 𝑗). The interpretation of 𝛿<sup>𝑖</sup> reflects the algorithm's core idea: data points that are not local or global maxima have their 𝛿<sup>𝑖</sup> constrained by other points within the same cluster, hence cluster centers have large values of 𝛿<sup>𝑖</sup> . However, this is not sufficient: they also need to have a large 𝜌<sup>𝑖</sup>

because otherwise the point could be merely distant from any other. After identifying cluster centers, other observations are assigned to the same cluster as their nearest neighbor of higher density.

The density-peak algorithm has many favorable properties: it manages to detect nonspherical clusters, it does not require the number of clusters in advance or data to be embedded in vector spaces, it is computationally fast because it does not maximize explicitly each data point's density field and it performs cluster assignment in a single step, it estimates a clear population quantity, and it has only one tuning parameter (the cutoff distance 𝑑𝑐).

#### **3.2 DP-KDE**

The density-peak approach also has drawbacks. Over the last few years, many articles have proposed improvements to overcome two main critical points: the unstable density estimation and the absence of an automatic procedure for selecting cluster centers. In this work, we explicitly tackle these two aspects.

The unstable density estimation induced by Equation (1) has been widely shown [9, 16, 15]. Although many solutions have been proposed, we espouse the research line suggesting the use of Kernel Density Estimation (KDE) to compute 𝜌<sup>𝑖</sup> [9, 15]:

$$\rho\_i = \frac{1}{nh} \sum\_{j=1}^{n} K\left(\frac{\mathbf{x}\_i - \mathbf{x}\_j}{h}\right). \tag{3}$$

In Equation (3), ℎ is the *bandwidth*, which is a smoothing parameter, and 𝐾(·) is the *kernel*, which is a non-negative function weighting the contribution of each data point to the density of the 𝑖-th observation. We use the Epanechnikov kernel, which is normalized, symmetric, and optimal in the Mean Square Error sense [2]:

$$K(u) = \begin{cases} 3/4(1 - u^2), & |u| \le 1 \\ 0, & |u| > 1 \end{cases} \tag{4}$$

Equation (4) implies a null contribution of observation 𝑗 to the 𝑖-th density whenever |(𝑥𝑖−𝑥 <sup>𝑗</sup>)/ℎ| ≥ 1, while, in the opposite case, it results in a positive weight depending quadratically on (𝑥<sup>𝑖</sup> −𝑥 <sup>𝑗</sup>)/ℎ. Consequently, ℎ may be regarded as the cutoff distance for the DP-KDE method.

The automatic selection of cluster centers involves many aspects: the cutoff distance, the number of clusters, and which data points to select. In the following, we use a cutoff distance ℎ such that the average number of neighbors is between 1 and 2% of the sample size, as suggested by [12]. The number of clusters 𝑘 is here considered as a given parameter, leaving the search for its optimal value for future work. Finally, the method for selecting data points as cluster centers is obtained refining an intuition contained in [12], where candidates are observations with sufficiently large values of 𝛾<sup>𝑖</sup> = 𝛿<sup>𝑖</sup> 𝜌<sup>𝑖</sup> . However, this quantity has two major drawbacks: first, if 𝛿<sup>𝑖</sup> and 𝜌<sup>𝑖</sup> are not defined over the same scale, results could be misleading; second, it implicitly assumes that 𝛿<sup>𝑖</sup> and 𝜌<sup>𝑖</sup> shall be given the same weight in the decision. We overcome these two limitations by first normalizing both 𝛿<sup>𝑖</sup> and 𝜌<sup>𝑖</sup> between 0 and 1, and then giving them different weights that are based on their informativeness. We measure the latter using the Gini coefficient of the two (normalized) quantities, under the assumption that the least concentrated distribution between the two is the most informative. Specifically, each observation is given a measure of importance that is defined as:

$$
\gamma\_i^G = \delta\_{01,i}^{G(\delta\_{01})} \rho\_{01,i}^{G(\rho\_{01})},\tag{5}
$$

where 𝛿<sup>01</sup> and 𝜌<sup>01</sup> are the normalized versions of 𝛿 and 𝜌 respectively, 𝛿01,𝑖 and 𝜌01,𝑖 are the corresponding 𝑖-th values, and 𝐺(𝑥) denotes the Gini coefficient of 𝑥. Then, the selected cluster centers are the top 𝑘 observations in terms of 𝛾 𝐺 𝑖 . Assigning observations to the same cluster as their nearest neighbor of higher density is what concludes the DP-KDE method.

#### **3.3 Graph Clustering**

A *graph* is a mathematical object composed of a collection of *vertices* linked by *edges* between them. Formally, a graph is denoted with G = (𝑉, 𝐸), where 𝑉 is the set of vertices and 𝐸 is the set of edges. If 𝑒 ∈ 𝐸 joins vertices 𝑢, 𝑣 ∈ 𝑉, i.e., 𝑒 = {𝑢, 𝑣}, then 𝑢 and 𝑣 are *adjacent* or *neighbors*. The number of edges incident with any vertex 𝑣 is the *degree* of 𝑣. Each edge 𝑒 ∈ 𝐸 is represented through a numerical value 𝑤<sup>𝑒</sup> called *edge weight*: if weights are equal to 1 for all and only the existent edges, and 0 for the others, G is *unweighted*; when existent edges have real-valued weights, G is *weighted*. If 𝑤{𝑢,𝑣 } = 𝑤{𝑣,𝑢} for all 𝑢, 𝑣 ∈ 𝑉, the graph G is *undirected*; otherwise, it is *directed*. The entire information about G's connectivity is stored in a |𝑉| × |𝑉| *adjacency matrix* **A** whose generic entry in the 𝑢-th row and 𝑣-th column is 𝑤𝑒, where 𝑒 = {𝑢, 𝑣} and 𝑢, 𝑣 ∈ 𝑉.

The DP-KDE method can be used for graph clustering if a proper distance between graphs is defined. In this work, we employ the Edge Difference Distance [6], which is defined as the Frobenius norm of the difference between the two graphs' adjacency matrices. The choice is motivated by many factors: a flexible definition that can be directly applied also to directed and weighted graphs, the reasonable results it yields when node correspondence is a concern, and its limited computational time complexity. Formally, the Edge Difference Distance between two graphs 𝑥<sup>𝑖</sup> and 𝑥 <sup>𝑗</sup> is defined as:

$$d\_{ED}(\mathbf{x}\_i, \mathbf{x}\_j) = ||\mathbf{A}^i - \mathbf{A}^j||\_F \, \coloneqq \sqrt{\sum\_p \sum\_q |A^i\_{pq} - A^j\_{pq}|^2},\tag{6}$$

where **A** 𝑖 and **A** 𝑗 are the adjacency matrices of 𝑥<sup>𝑖</sup> and 𝑥 <sup>𝑗</sup> respectively, and || · ||<sup>𝐹</sup> denotes the Frobenius norm.

Consequently, the two fundamental quantities of the DP-KDE method are computed in the following as:

$$\rho\_i = \sum\_{j=1}^{n} K\left(\frac{d\_{ED}(\mathbf{x}\_i, \mathbf{x}\_j)}{h}\right),\tag{7}$$

where 𝐾(·) is the Epanechnikov kernel defined in Equation (4) and the normalizing constant is omitted because we are simply interested in the ranking between the densities, and:

$$\delta\_{\vec{l}} = \min\_{j:\rho\_j > \rho\_{\vec{l}}} (d\_{ED}(\mathbf{x}\_i, \mathbf{x}\_j)). \tag{8}$$

Finally, cluster centers are selected as the observations with the largest values of 𝛾 𝐺 𝑖 , as defined in Equation (5), and other observations are assigned to the same cluster as their nearest neighbor in terms of 𝛿<sup>𝑖</sup> .

# **4 Empirical Analysis**

The DP-KDE method for graph clustering is employed in an unsupervised empirical analysis where the ground truth is known, and its performance is compared in terms of accuracy both with natural competitors and with a method treating the problem as supervised. The ultimate goal is clustering brain connectomes, one for each individual, correctly distinguishing between patients affected by schizophrenia (SZ) and healthy controls.

We use publicly available1 data from a recent study [5] whose aim is finding relevant links between Regions of Interest (ROIs) for predicting schizophrenia from multimodal brain connectivity data. The cohort is composed of 27 schizophrenic patients and 27 age-matched healthy participants acting as control subjects. In the current work, we focus only on this cohort's functional Magnetic Resonance Imaging (fMRI) connectomes. Functional connectivity matrices have been computed starting from fMRI scans, treating them as time series, and computing Pearson's correlation coefficient between time series for distinct ROIs. The resulting matrices are weighted, undirected, and made of 83 nodes.

The aforementioned study [5] treats every functional connectivity matrix as a single multivariate realization of (83 · 82)/2 = 3403 numeric variables, each representing a connection between two of the 83 ROIs. They reduce feature dimensionality by performing Recursive Feature Elimination based on Support Vector Machines (SVM-RFE), and tackle the classification problem as supervised using 20 repetitions of nested 5-fold cross-validation. When using only functional connectivity data, they achieve an average accuracy of 68.28%2 over the resulting 100 test sets.

<sup>1</sup> https://doi.org/10.5281/zenodo.3758534.

<sup>2</sup> This exact figure is not included in the article, but the analysis is fully reproducible since the authors made their source code available at https://github.com/leoguti85/BiomarkersSCHZ.

The approach we adopt in this work is rather different. First, graphs are analyzed in their original form, without any simplification to numeric variables, resulting in only one graph-structured variable. Observations are 54, each one representing the functional connectome of a different individual. We tackle the problem with an unsupervised classification approach seeking to cluster connectomes into two groups: schizophrenic and healthy. To this end, we use the DP-KDE method for graph clustering described in Section 3.3. Starting from the 54 connectomes, each observation's local density 𝜌<sup>𝑖</sup> and minimum distance 𝛿<sup>𝑖</sup> are computed using Equations (7) and (8), respectively. The centers of the two clusters are those whose 𝛾 𝐺 𝑖 is largest. Then, other observations are assigned to the same cluster as their nearest neighbor of higher density. Finally, the clustering performance is evaluated by comparing the algorithm's assignment to the ground truth. The DP-KDE method achieves an accuracy of 70.37%, which is more than 2% higher than the one obtained in [5].

Table 1 includes the performance in terms of accuracy of both the DP-KDE and the SVM-RFE methods, as well as that of other graph clustering competitors. Specifically, we consider: the classical DP algorithm on the original data objects, with the same cutoff distance as in DP-KDE and manually selected cluster centers; k-means clustering on the 3403 numeric variables obtained from vectorizing the adjacency matrices; DBSCAN on the original data objects, with parameters 𝜀 = 20.2 and 15 as the minimum number of points required to form a dense region; PAM and 𝑘-groups on the original data objects. In all these cases, the number of clusters has been kept fixed to 𝑘 = 2. The method that yields the best accuracy in the specific problem is the DP-KDE.

**Table 1** Accuracy for DP-KDE and some of its possible competitors.


# **5 Concluding Remarks**

After explaining the importance of graph clustering and briefly reviewing some existing methods to perform this task, we have considered the possibility of adopting a density-peak approach. We have improved the original DP algorithm by using a more robust definition of the density 𝜌<sup>𝑖</sup> , and by automatically selecting cluster centers based on the quantity 𝛾 𝐺 <sup>𝑖</sup> we have introduced. We have also selected a proper distance between graphs, namely, the Edge Difference Distance. Finally, we have used the resulting method in an empirical analysis with the goal of clustering brain connectomes to distinguish between schizophrenic patients and healthy controls. Our method outperforms another one treating the specific task as supervised, and it is by far the best one with respect to many graph clustering competitors.

An initial idea for future work is the search for the optimal number of clusters. This may be achieved either by fixing a threshold for 𝛾 𝐺 𝑖 or by selecting all the data points after the largest increase in terms of 𝛾 𝐺 𝑖 . Also the cutoff distance could be tuned, possibly maximizing in some way the dispersion of points in the bivariate distribution of 𝜌 and 𝛿. Then, the DP-KDE method needs to be extended beyond the univariate case. Finally, other distances between graphs could be considered to better reflect alternative application-specific needs, e.g., when graphs are not defined over the same set of nodes.

**Acknowledgements** The author would like to thank Pierfancesco Alaimo Di Loro, Federico Carlini, Marco Perone Pacifico, and Marco Scarsini for several engaging and stimulating discussions.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Similarity Forest for Time Series Classification**

Tomasz Górecki, Maciej Łuczak, and Paweł Piasecki

**Abstract** The idea of similarity forest comes from Sathe and Aggarwal [19] and is derived from random forest. Random forests, during already 20 years of existence, proved to be one of the most excellent methods, showing top performance across a vast array of domains, preserving simplicity, time efficiency, still being interpretable at the same time. However, its usage is limited to multidimensional data. Similarity forest does not require such representation – it is only needed to compute similarities between observations. Thus, it may be applied to data, for which multidimensional representation is not available. In this paper, we propose the implementation of similarity forest for time series classification. We investigate 2 distance measures: Euclidean and dynamic time warping (DTW) as the underlying measure for the algorithm. We compare the performance of similarity forest with 1-nearest neighbor and random forest on the UCR (University of California, Riverside) benchmark database. We show that similarity forest with DTW, taking into account mean ranks, outperforms other classifiers. The comparison is enriched with statistical analysis.

**Keywords:** time series, time series classification, random forest, similarity forest

Tomasz Górecki ()

Maciej Łuczak

Paweł Piasecki

© The Author(s) 2023 165 P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_19

Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Uniwersytetu Poznańskiego 4, Poznań, Poland, e-mail: tomasz.gorecki@amu.edu.pl

Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Uniwersytetu Poznańskiego 4, Poznań, Poland, e-mail: maciej.luczak@amu.edu.pl

Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Uniwersytetu Poznańskiego 4, Poznań, Poland, e-mail: pawel.piasecki@amu.edu.pl

# **1 Introduction**

Time series classification is a well-developing research field, that gained much attention from researchers and business during the last two decades apparently by the fact that more and more data around us seems to be located in the time domain – and thus fulfilling the definition of time series. Predictive maintenance [18], quality monitoring [22], stock market analysis [20] or sales forecasting [17] are just a few exemplar nowadays problems where time series are indeed present. The reason why we usually apply to time series different methods from regular (non-time series) data is the fact, that time series are ordered in time (or some other space with ordering) and it is beneficial to use the information conveyed by the ordering.

In recent years, one could observe many advances on the field of time series classification. In 2017, Bagnall et al. presented a comprehensive comparison of time series classification algorithms [2], showing that despite there are dozens of far more complex methods, 1-Nearest Neighbour (1NN) [6, 11] coupled with DTW [3] distance constitutes a good baseline. In fact, it has been outperformed by several classifiers, with Collective of Transformation Ensembles (COTE) [1] as the most efficient one. Furthermore, COTE was extended with Hierarchical Vote system, first to HIVE-COTE [13] and then finally to HIVE-COTE 2.0 [15] – a current state of the art classifier for time series. In general, the success of COTE-family classifiers is based on the observation, that in the case of time series it is highly beneficial to use different data representations. For example, HIVE-COTE 1.0 utilizes five ensembles based on different data transformation domains. However, a common criticism of such an approach is its time complexity. In the case of HIVE-COTE, it equals 𝑂(𝑛 2 𝑙 4 ), where 𝑛 is a number of observations and 𝑙 is a length of series. Another drawback, especially significant for practitioners is the complex structure of the model ensembles that makes it hard to use HIVE-COTE without spending a decent amount of time studying its components beforehand.

As an alternative to such complex models may be trying to achieve possibly slightly worse performance in favour of model simplicity and reduced computation time. A group of classifiers that seems to hold a great potential are those inspired by Random Forest (RF) [4]. This already 20-years old algorithm remains in the classifiers' forefront, showing extremely good performance and robustness across multiple domains. Fernandez-Delgado et al. [10] performed a comparison of 179 classifiers on 121 non-time series data sets originated from UCI Machine Learning Repository [9], concluding RF to be the most accurate one. Unfortunately, the usage of RF is essentially limited to multidimensional data, as they sample features from original space while creating each node of decision trees.

In this paper, we propose a method for extending RF to work with time series using similarity forests (SF). We significantly extend the applicability of the RF method to time series data. Furthermore, the approach even outperforms traditional classifiers for time series. The main goal of this paper is to enrich the pool of time series classifiers by Similarity Forest for time series classification. SF was initially proposed by Sathe and Aggarwal in 2017 [19], as a method extending Random Forests to deal with arbitrary data sets, provided that we are able to compute similarities between observations. We would like to implement and tune the method to time series data. We investigate the performance of the model using two distance measures (the algorithm's hyper-parameter): Euclidean and DTW. Also, a comparison with other selected time series classifiers is provided. We compare its performance against 1NN-ED, 1NN-DTW and RF.

The rest of the paper is structured as follows. In Section 2, we provide details of similarity forest and we give more details about random forests. Additionally, we discuss how similarity forest is related to random forest. Section 3 describes data sets that we used and the comparison methodology. The corresponding results are presented in Section 4. Finally, in Section 5 we give a brief summary of our research.

# **2 Classification Methods Used in Comparison**

In the paper, we compare the standard random forest and the similarity forest with the distance measure: ED (Euclidian distance) and DTW (dynamic time warping distance). As benchmark methods, we also use the nearest neighbor method (1NN) with distance measure ED and DTW. 1NN-ED and 1NN-DTW are very common classification methods for time series classification [2]. For a review of these methods refer to [14].

#### **2.1 General Method of Random Forest Construction**

Random forest consists of random decision trees. For the construction of a random forest we usually take decision trees as simple as possible — without special criteria for stopping, pruning, etc.

When building a decision tree, we start at a node 𝑁, which contains the entire data set (bootstrap sample). Then, according to an established criterion, we split the node 𝑁 into two subnodes 𝑁<sup>1</sup> and 𝑁2. In each subnode there are data subsets of the data set from node 𝑁. We make this split in a way that is optimal for a given split method. In each node, we write down how the split occurred. Then, proceeding recursively, we split next nodes into subnodes until the stop criterion occurs. In our case we take the simplest such criterion, namely we stop the split of a given node when only elements of the same class are included in a node. We call such a node a leaf and assign it a label which elements of the node (leaf) have.

Having built a tree, we can now use it (in the testing phase) to classify a new observation. We pass this observation through the trained tree — starting from the node 𝑁 selecting each time one of the subnodes, according to the condition stored in the node. We do this until we reach one of the leaves, and then we assign the test observation to the class of the leaf.

Now, constructing the random forest, we collect a certain number of decision trees, train them independently according to the above method and, in the test phase, use each of the trees to test new observation. Thus, each tree assigns a label to the test observation. The final label (for the entire forest) we construct by voting, we choose the most frequently appearing label among the decision trees.

#### **2.2 Classical Random Forest**

To create a (classical) random tree and a random forest [4], we proceed as described above using the following node split method:

To obtain split conditions for a single tree, we select randomly a certain number of features (<sup>√</sup> 𝑘 for classification, 𝑘 — number of features), and for each feature we create a feature vector (column, variable) made of all elements of the data set (bootstrap sample). For a given feature vector (variable), we determine a threshold vector. First, we sort values of the feature vector (uniquely — without repeating values). Let us name this sorted feature vector as𝑉 = (𝑉1, 𝑉2, . . . ). Then we take the values of the split as means of successive values of the vector 𝑉:

$$v\_i = \frac{V\_i + V\_{i+1}}{2} \quad i = 1, 2, \dots \tag{1}$$

Each splitting value divides the data set in node 𝑁 into two subsets — the one (left) in which we have elements with feature values smaller than 𝑣<sup>𝑖</sup> and the second (right) with other elements. Then we check the quality of such a split.

The splitting point is chosen such that it minimizes the Gini index of the children nodes. If 𝑝1, 𝑝<sup>2</sup> . . . 𝑝<sup>𝑐</sup> are the fractions of data points belonging to the 𝑐 different classes in node 𝑁, then the Gini index of that node is given by: 𝐺(𝑁) = 1 − Í𝑐 𝑖=1 𝑝 2 𝑖 .

Then, if the node 𝑁 is split into two children nodes 𝑁<sup>1</sup> and 𝑁2, with 𝑛<sup>1</sup> and 𝑛<sup>2</sup> points, respectively, the Gini quality of the children nodes is given by:

$$G\mathcal{Q}(N\_1, N\_2) = \frac{n\_1 G(N\_1) + n\_2 G(N\_2)}{n\_1 + n\_2}.$$

Quality of the split is given by: 𝐺𝑄(𝑁) = 𝐺(𝑁) − 𝐺𝑄(𝑁1, 𝑁2).

#### **2.3 Similarity Forest**

The similarity forest [19] differs from the ordinary (classical) random forest only in the way we split nodes of trees. Instead of selecting a certain number of features, we select randomly a pair of elements 𝑒1, 𝑒<sup>2</sup> with different classes. Then, for each element 𝑒 of the subset of elements in a given node, we calculate the difference of the squared distances to the elements 𝑒<sup>1</sup> and 𝑒2:

$$w(e) = d(e, e\_1)^2 - d(e, e\_2)^2,$$

where 𝑑 is any fixed distance measure of the elements of the data set. We sort the vector 𝑤 uniquely (without duplicates) creating the vector 𝑉 and continue as for the classical decision tree. We calculate values of the split 𝑣<sup>𝑖</sup> (1), calculate the quality of the node split using the Gini index (2.2) and choose the best split. In the learning phase, we remember in each node how the optimal split occurred (elements 𝑒1, 𝑒2, 𝑤(𝑒)). In the learning phase, in each node we write down the optimal split elements 𝑒1, 𝑒2, and value 𝑤(𝑒)).

#### **2.4 Random Forest vs Similarity Forest**

The difference between a classical random tree and a similarity tree is that instead of selecting <sup>√</sup> 𝑘 of the features, we select only one pair of elements 𝑒1, 𝑒2. Generally, we have much fewer possible node splits, which has a very good effect on the computation time.

The second important difference is that in the similarity tree we use any distance measure between elements of the data set. Therefore, we can use distance measures specific to a data set. For example, for time series we can use the DTW distance, much better suited for calculating the distance between time series, instead of the Euclidean distance.

# **3 Experimental Setup**

We investigated the performance of similarity forest on UCR time series repository [7] (128 data sets). The latest update of the UCR database introduced several data sets with missing observations and uneven sample lengths. However, the repository includes a standardized version of the database without these impediments, and that is the version we used.

All data sets are split into a training and testing subset, and all parameter optimization is conducted on the training set only. We combined both parts and in the next step, we used 100 random train/test splits.

# **4 Results**

The error rates for each classifier can be found on the accompanying website1. In the Table 1 we show a short summary of results, including a number of wins (draw is not counted as a win) and mean ranks. Taking into account mean ranks, SF-DTW is the best classifier, sightly ahead of RF (mean ranks correspondingly equal 2.64

<sup>1</sup> https://github.com/ppias/similarity\_forest\_for\_tsc

Method 1NN-ED 1NN-DTW RF SF-ED SF-DTW Wins 12 28 **38** 10 31 Mean rank 3.59 2.89 2.69 3.19 **2.64**

**Table 1** Number of wins (clearly wins) and mean ranks for examined methods.

and 2.89). Figure 1 demonstrates comparison of error rates and ranks for classifiers. These results lead to a conclusion that even though there is no clear winner, the top efficient distances are dominated by RF and SF-based classifiers. Figure 2 shows scatter plots of errors for pairs of classifiers.

**Fig. 1** Comparison of error rates and ranks.

**Fig. 2** Comparison of error rates.

To identify differences between the classifiers, we present a detailed statistical comparison. In the beginning, we test the null hypothesis that all classifiers perform the same and the observed differences are merely random. The Friedman test with Iman & Davenport extension is probably the most popular omnibus test, and it is usually a good choice when comparing different classifiers [12]. The 𝑝-value from this test is equal to 0. The obtained 𝑝-value indicates that we can safely reject the null hypothesis that all the algorithms perform the same. We can therefore proceed with the post-hoc tests in order to detect significant pairwise differences among all of the classifiers.

Demšar [8] proposes the use of the Nemenyi's test [16] that compares all the algorithms pair-wise. For a significance level 𝛼 the test determines the critical difference (CD). If the difference between the average ranking of two algorithms is greater than CD the null hypothesis that the algorithms have the same performance is rejected. Additionally, Demšar [8] creates a plot to visually check the differences, the CD plot. In the plot, those algorithms that are not joined by a line can be regarded as different.

In our case, with a significance of 𝛼 = 0.05 any two algorithms with a difference in the mean rank above 0.54 will be regarded as non equal (Figure 3). We can see that we have three groups of methods. In the first group we have SF-DTW, RF and 1NN-DTW, in the second we have RF, 1NN-DTW and SF-ED and in the last group we have SF-ED and 1NN-ED. Unfortunately, groups are not disjoint. The first group is the group with the highest accuracy of classification. Hence, SF-DTW does not statistically outperform RF. However, we can recommend it over RF because of statistically the same quality and much better computational properties.

**Fig. 3** Critical difference plot.

# **5 Conclusions**

Our contribution is to implement similarity forest for time series classification using two distance measures: Euclidean and DTW. Comparison based on the recently updated UCR data repository (128 data sets) was presented. We showed that SF-DTW outperforms other classifiers, including 1NN-DTW which has been considered as a strong baseline hard to beat for years. The statistical comparison showed, that RF and SF-DTW are statistically insignificantly different, however taking into account mean ranks the latter one is the best one.

There are many improvements that could be applied to the implementation that we propose. For example, we could test other distance measures such as LCSS [21] or ERP [5] that were successfully used in time series tasks. Another idea could be to investigate the usage of boosting algorithm.

**Acknowledgements** The research work was supported by grant No. 2018/31/N/ST6/01209 of the National Science Centre.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Detection of the Biliary Atresia Using Deep Convolutional Neural Networks Based on Statistical Learning Weights via Optimal Similarity and Resampling Methods**

Kuniyoshi Hayashi, Eri Hoshino, Mitsuyoshi Suzuki, Erika Nakanishi, Kotomi Sakai, and Masayuki Obatake

**Abstract** Recently, artificial intelligence methods have been applied in several fields, and their usefulness is attracting attention. These methods are techniques that correspond to models using batch and online processes. Because of advances in computational power, as represented by parallel computing, online techniques with several tuning parameters are widely accepted and demonstrate good results. Neural networks are representative online models for prediction and discrimination. Many online methods require large training data to attain sufficient convergence. Thus, online models may not converge effectively for low and noisy training datasets. For such cases, to realize effective learning convergence in online models, we introduce statistical insights into an existing method to set the initial weights of deep convolutional neural networks. Using an optimal similarity and resampling method, we proposed an initial weight configuration approach for neural networks. For a practice example, identification of biliary atresia (a rare disease), we verified the usefulness

Kuniyoshi Hayashi ()

Eri Hoshino · Kotomi Sakai

Research Organization of Science and Technology, Ritsumeikan University, 90-94 Chudoji Awatacho, Shimogyo Ward, Kyoto, Japan, 600-8815,

e-mail: erihoshino119@gmail.com;koto.sakai1227@gmail.com

Mitsuyoshi Suzuki

Department of Pediatrics, Juntendo University Graduate School of Medicine, 2-1-1 Hongo, Bunkyoku, Tokyo, Japan, 113-8421, e-mail: msuzuki@juntendo.ac.jp

Erika Nakanishi

Masayuki Obatake

Graduate School of Public Health, St. Luke's International University, 3-6 Tsukiji, Chuo-ku, Tokyo, Japan, 104-0045, e-mail: khayashi@slcn.ac.jp

Department of Palliative Nursing, Health Sciences, Tohoku University Graduate School of Medicine, 2-1 Seiryo-machi, Aoba-ku, Sendai, Japan, 980-8575, e-mail: nakanishi.erika.q3@dc.tohoku.ac.jp

Department of Pediatric Surgery, Kochi Medical School, 185-1 Kohasu, Oko-cho, Nankoku-shi, Kochi, Japan, 783-8505, e-mail: mobatake@kochi-u.ac.jp

of the proposed method by comparing existing methods that also set initial weights of neural networks.

**Keywords:** AUC, bootstrap method, leave-one-out cross-validation, projection matrix, rare disease, sensitivity and specificity

# **1 Introduction**

The core technique in deep learning corresponds to neural networks, including the convolutional process. Since 2012, deep learning architectures have been frequently used for image classification [1, 2]. More so, deep convolution neural networks (DCNN) are representative nonlinear classification methods for pattern recognition. The DCNN technique is used as a powerful framework for the entirety of image processing [3]. The clinical medicine field presents many opportunities to perform diagnoses using imaging data from patients. Therefore, DCNN techniques are applied to enhance diagnostic quality, e.g., applying a DCNN to a chest X-ray dataset to classify pneumonia [2] and detecting breast cancer [4]. However, DCNN architectures involve many parameters to be learned using training data. Therefore, effective and efficient model development must realize effective learning convergence for such parameters. Notably, it is important to set the initial parameter values to achieve better learning convergence. Furthermore, several methods have been proposed to set initial parameter values in the artificial intelligence (AI) field [5, 6]. However, there are no clear guidelines for determining which existing methods should be used in different situations. Thus, we propose an efficient initial weight approach using existing methods from the viewpoints of optimal similarity and resampling methods. Using a real-world clinical biliary atresia (BA) dataset, we evaluate the performance of the proposed method compared with existing DCNNs. Additionally, we show the usefulness of the proposed method in terms of learning convergence and prediction accuracy.

# **2 Background**

BA is a rare disease that occurs in children and is fatal unless treated early. Previous studies have investigated models to identify BA by applying neural networks to patient data [7] and using an ensemble deep learning model to detect BA [8]. However, these models were essentially for use in medical institutions, e.g., hospitals. Generally, certain stool colors in infants and children are highly correlated with BA [9]. In Japan, the maternal and child health handbook includes a stool color card so parents can compare their child's stool color to the information on the card. Such fecal color cards are widely used to detect BA because of their easy accessibility outside the clinical environments. However, this stool color card screening approach for BA is subjective; thus, accurate and objective diagnoses are not always possible. Previously, we developed a mobile application to classify BA and non-BA stools using baby stool images captured using an iPhone [10]. Here, a batch type classification method was used, i.e., the subspace method, originating from the pattern recognition field. Since BA is a rare disease, the number of events in the case group is generally less. Thus, when we set the explanatory variables of the target observation as the pixel values of a target image, the number of explanatory variables is much higher than the number of observations, especially the disease group. With the subspace method, we can efficiently discriminate such high-dimensional small-sample data. For example, our previous study using the subspace method to classify BA and non-BA stools showed that BA could be discriminated with reasonable accuracy by applying the proposed method to image pixel data of the stool image data captured by a mobile phone [10]. This application was an automated version of the stool color card from the maternal and child health handbook. Unlike previous studies by [7, 8], the application is widely available outside hospital environments. As described previously, DCNNs are useful for image classification, including the automatic classification of stool images for early BA detection.

# **3 Proposed Method**

Dimension reduction and discrimination processing can be realized using the subspace method and DCNN techniques. In DCNN, layers based on padding, convolution, and pooling correspond to the dimension reduction functions, and the affine layer performs the discrimination. The primary motivation of this study is to propose a method that properly sets the initial weights of the parameters in a DCNN using statistical approaches. Our secondary motivation is to apply the proposed method to real-world, high-dimensional, and small-sample clinical data.

#### **3.1 Description of Related Procedures of the Convolution**

For image discrimination in pattern recognition and machine learning fields, the pixel values of the image data are set as the explanatory variables for the target outcome. Here, the data to be classified correspond to a high-dimensional observation. To improve efficiency and demonstrate the feasibility of discriminant processing, the dimensionality must be reduced to a manageable size before classification. The most representative dimensionality reduction method is convolution in pattern recognition and machine learning, which involves padding, convolution, and pooling operations. After converting the input image to a pixel data matrix, the pixel data matrix is surrounded with a numeric value of 0. Using a convolution filter, we reconstruct the pixel data matrix while considering pixel adjacency information. Generally, the size and convolution filter type are parameters that need optimization to realize sufficient prediction accuracy. However, some representative convolution filters that exhibit good performance are known in the AI field, and we can essentially fix the size and type of the convolution filter. Finally, pooling is performed to reduce the size of the pixel data matrix after convolution. Here, we refer to the sequence of processing from padding to pooling as the layer for feature selection.

#### **3.2 Setting Conditions Assumed in This Study**

We denote the input pattern matrices comprising numerical pixel values in hue (H), saturation (S), and value (V) as **X** <sup>𝐻</sup> (∈ R 𝑝×𝑞 ), **X** 𝑆 (∈ R 𝑝×𝑞 ), and **X** 𝑉 (∈ R 𝑝×𝑞 ), respectively. First, we performed padding for the input pattern matrices in H, S, and V, respectively, and then, performed a convolution in each signal pattern matrix using a convolution filter. Next, we then applied max pooling to each pattern matrix after convolution. Here, we denote the pattern matrices after the padding, convolution, and max pooling as **<sup>X</sup>**˜ 𝐻 (∈ R 𝑝 <sup>0</sup>×𝑞 0 ), **<sup>X</sup>**˜ 𝑆 (∈ R 𝑝 <sup>0</sup>×𝑞 0 ), and **<sup>X</sup>**˜ 𝑉 (∈ R 𝑝 <sup>0</sup>×𝑞 0 ), respectively, where 𝑝 0 and 𝑞 0 are less than 𝑝 and 𝑞. Therefore, we combine the component values of each pattern matrix after padding, convolution, and max pooling into a single pattern matrix by simply adding them together. The combined pattern matrix after applying the feature selection layer is expressed as **<sup>X</sup>**˜ (∈ <sup>R</sup> 𝑝 <sup>0</sup>×𝑞 0 ). Next, we applied convolution and max pooling to the combined pattern matrix 𝑘 times. Additionally, the input vector after performing the convolution and max pooling 𝑘 times is denoted by **x**(∈ R ℓ×1 ), and the output of the DCNN and the label vectors are denoted **y**(∈ R 1×1 ) and **t**(∈ R 1×1 ), respectively. In this study, we evaluated the difference between **y** and **t** according to the mean square error function, i.e., 𝐿(**y**, **t**) = 1 ℓ k **t** − **y** k 2 2 . Here, we consider a simple neural network with three layers. Concretely, between the first and second layers, we perform a linear transformation using **W**<sup>1</sup> (∈ R 2×ℓ ) and **b**<sup>1</sup> (∈ R 2×1 ). Then, a linear transformation is performed using **W**<sup>2</sup> (∈ R 1×2 ) and **b**<sup>2</sup> (∈ R 1×1 ) between the second and third layers. Next, we defined 𝑓<sup>1</sup> (**x**) and 𝑓<sup>2</sup> (**x**) as **W**1**x** + **b**<sup>1</sup> and **W**<sup>2</sup> 𝑓<sup>1</sup> (**x**) + **b**2, respectively. Note that we assume 𝜂<sup>2</sup> is a nonlinear transformation between the second and third layers, and we calculated the output **y** as 𝜂<sup>2</sup> ( 𝑓<sup>2</sup> ◦ 𝑓<sup>1</sup> (**x**)). Generally, **y** is calculated as a continuous value. For example, with classification and regression tree methods, we can determine the optimal cutoff point of **y**𝑠 from a prediction perspective.

#### **3.3 General Approach to Update Parameters in CNNs**

Here, we denote 𝑓<sup>1</sup> (**x**) and 𝑓<sup>2</sup> ◦ 𝑓<sup>1</sup> (**x**) in the previous subsection as **u**<sup>1</sup> and **u**2, respectively. By performing the partial derivative of 𝐿(**y**, **t**) with respect to **W**2, we obtain 𝜕𝐿 𝜕**W**<sup>𝑇</sup> 2 = 𝜕𝐿 𝜕**y** 𝜕**y** 𝜕**u**<sup>2</sup> 𝜕**u**<sup>2</sup> 𝜕**W**<sup>𝑇</sup> 2 where 𝜕𝐿 𝜕**y** = − 2 ℓ (**t** − **y**), 𝜕**y** 𝜕**u**<sup>2</sup> = 𝜕𝜂<sup>2</sup> (**u**2) 𝜕**u**<sup>2</sup> , and <sup>𝜕</sup>**u**<sup>2</sup> 𝜕**W**<sup>𝑇</sup> 2 = **u**1. Additionally, we calculate 𝜂<sup>2</sup> (**u**2) as 1/(1 + exp(−**u**2)) using the representative sigmoid function. Then, <sup>𝜕</sup>**<sup>y</sup>** 𝜕**u**<sup>2</sup> is calculated as 𝜂<sup>2</sup> (**u**2)(1 − 𝜂<sup>2</sup> (**u**2)). Therefore, we obtain 𝜕𝐿 𝜕**W**<sup>𝑇</sup> 2 = − 2 ℓ (**t**−**y**)𝜂<sup>2</sup> (**u**2)(1−𝜂<sup>2</sup> (**u**2))**u**1. With the learning coefficient of 𝛾2, we update **W**<sup>𝑇</sup> 2 to **W**<sup>𝑇</sup> <sup>2</sup> −𝛾<sup>2</sup> 𝜕𝐿 𝜕**W**<sup>𝑇</sup> 2 . Then, when performing the partial derivative of 𝐿(**y**, **t**) with respect to **W**1, we can obtain 𝜕𝐿 𝜕**W**<sup>1</sup> = 𝜕𝐿 𝜕**y** 𝜕**y** 𝜕**u**<sup>2</sup> 𝜕**u**<sup>2</sup> 𝜕**u**<sup>1</sup> 𝜕**u**<sup>1</sup> 𝜕**W**<sup>1</sup> where 𝜕𝐿 𝜕**y** = − 2 ℓ (**t** − **y**), 𝜕**y** 𝜕**u**<sup>2</sup> = 𝜂<sup>2</sup> (**u**2)(1 − 𝜂<sup>2</sup> (**u**2)), 𝜕**u**<sup>2</sup> 𝜕**u**<sup>1</sup> = **W**<sup>𝑇</sup> 2 , and <sup>𝜕</sup>**u**<sup>1</sup> 𝜕**W**<sup>1</sup> = 2**x** 𝑇 . Thus, we then obtain 𝜕𝐿 𝜕**W**<sup>1</sup> = − 4 ℓ (**t** − **y**)𝜂<sup>2</sup> (**u**2)(1 − 𝜂<sup>2</sup> (**u**2))**W**<sup>𝑇</sup> 2 **x** 𝑇 . With the learning coefficient of 𝛾1, we update **W**<sup>1</sup> to **W**<sup>1</sup> − 𝛾<sup>1</sup> 𝜕𝐿 𝜕**W**<sup>1</sup> .

#### **3.4 Setting the Initial Weight Matrix in the Affine Layer**

To ensure proper learning convergence in situations with limited training datasets, we proposed a method using optimal similarity and bootstrap methods. Here, the number of training data and the training dataset are denoted 𝑛 and 𝑆(3 **x**𝑗), respectively, where **x**𝑗 is the 𝑗-th training observation (𝑗 takes values 1 to 𝑛). Additionally, we normalized each observation vector, such that its norm is one. By considering the discrimination problem of two groups whose outcomes are 0 and 1, respectively, we divided {**x**<sup>𝑗</sup> } into {**x**<sup>𝑗</sup> |**y**<sup>𝑗</sup> = 0} and {**x**<sup>𝑗</sup> |**y**<sup>𝑗</sup> = 1}. Next, we defined {**x**<sup>𝑗</sup> |**y**<sup>𝑗</sup> = 0} and {**x**<sup>𝑗</sup> |**y**<sup>𝑗</sup> = 1} as 𝑆<sup>0</sup> and 𝑆1, respectively. First, we calculated the autocorrelation matrix with the observations belonging to 𝑆0. Then, using the eigenvalues (𝜆ˆ 𝑠0 ) and eigenvectors (**u**ˆ <sup>𝑠</sup><sup>0</sup> ) for the autocorrelation matrix, we calculated the following projection matrix:

$$\hat{P}\_0 := \sum\_{s\_0=1}^{\ell\_0'} \hat{\mathfrak{u}}\_{s\_0} \hat{\mathfrak{u}}\_{s\_0}^T,\tag{1}$$

where ℓ 0 0 takes values 1 to ℓ in Equation (1). Similarly, we calculated the autocorrelation matrix with the observations belonging to 𝑆1. Then, with eigenvalues (𝜆ˆ 𝑠1 ) and eigenvectors (**u**ˆ <sup>𝑠</sup><sup>1</sup> ) for the autocorrelation matrix, we calculate the following projection matrix:

$$\hat{P}\_1 := \sum\_{s\_1=1}^{\ell\_1'} \hat{\mathbf{u}}\_{s\_1} \hat{\mathbf{u}}\_{s\_1}^T,\tag{2}$$

where ℓ 0 1 takes values 1 to ℓ in Equation (2). Here, if the value of **x** 𝑇 (𝑃ˆ <sup>1</sup> − 𝑃ˆ <sup>0</sup>)**x** > 0, we classify **x** into 𝑆1; otherwise, we classify **x** into 𝑆0.

From a prediction perspective, using the leave-one-out cross-validation [11], we determined the optimal ˆℓ 0 0 and ˆℓ 0 1 , which are minimum values satisfying 𝜏 < ( Íℓ 0 0 𝑠0=1 𝜆ˆ 𝑠0 )/(Í<sup>ℓ</sup> 𝑠0=1 𝜆ˆ 𝑠0 ) and 𝜏 < ( Íℓ 0 1 𝑠1=1 𝜆ˆ 𝑠1 )/(Í<sup>ℓ</sup> 𝑠1=1 𝜆ˆ 𝑠1 ), respectively. Here, 𝜏 is a tuning parameter to be optimized using the leave-one-out cross-validation. In the second step, based on 𝑃ˆ <sup>1</sup>, we estimated **y**ˆ 𝑗 as **x** 𝑇 𝑗 𝑃ˆ 1**x**𝑗 . In the third step, using existing approaches [5, 6], we generated , we generated normal random numbers and set an initial matrix, vector, and scalar as **<sup>W</sup>**<sup>ˆ</sup> <sup>2</sup>, **<sup>b</sup>**<sup>ˆ</sup> <sup>1</sup>, and **<sup>b</sup>**<sup>ˆ</sup> <sup>2</sup>, respectively. Next, we extracted <sup>𝑚</sup> observations randomly using the bootstrap method [12]. Using **<sup>W</sup>**<sup>ˆ</sup> 2, **b**ˆ 1, **b**ˆ <sup>2</sup>, and a bootstrap sample of size 𝑚, we estimated **W**2**W**<sup>1</sup> as follows:

$$
\hat{\mathbf{W}}\_2 \hat{\mathbf{W}}\_1 = \frac{1}{m} \sum\_{i=1}^m (\eta\_2^{-1}(\hat{\mathbf{y}}\_i) - (\hat{\mathbf{W}}\_2 \hat{\mathbf{b}}\_1 + \hat{\mathbf{b}}\_2)) \mathbf{x}\_i^T (\mathbf{x}\_i \mathbf{x}\_i^T)^{-1},\tag{3}
$$

where we estimate the inverse of **x**𝑖**x** 𝑇 𝑖 in Equation (3) using the naive approach from the diagonal elements in **x**𝑖**x** 𝑇 𝑖 . Additionally, using the generalized inverse approach, we obtained **<sup>W</sup>**<sup>ˆ</sup> <sup>1</sup> in the basis of **<sup>W</sup>**<sup>ˆ</sup> <sup>2</sup> and **<sup>W</sup>**<sup>ˆ</sup> 2**W**ˆ <sup>1</sup>. Finally, **<sup>b</sup>**<sup>ˆ</sup> <sup>1</sup>, **<sup>b</sup>**<sup>ˆ</sup> <sup>2</sup>, **<sup>W</sup>**<sup>ˆ</sup> <sup>1</sup>, and **<sup>W</sup>**<sup>ˆ</sup> <sup>2</sup> were used as initial vectors and matrices to update the parameters of the convolutional neural network.

# **4 Analysis Results on Real-world Data**

In this paper, all analyses were performed using R version 4.1.2 (R Foundation for Statistical Computing). We applied the proposed method to a real BA dataset. Here, stool image data with objects, such as diapers partially photographed on the image were used. In this numeric experiment, we randomly divided 35 data into 15 training and 20 test data, respectively. Next, we compared the proposed and existing methods relative to the learning convergence and prediction accuracy on the training and test data, respectively. Here, we set the values of the learning coefficients 𝛾<sup>1</sup> and 𝛾<sup>2</sup> to 0.1, respectively. Also, we prepared a single feature selection layer and performed the convolution and max pooling process seven times. Each time an initial value was set randomly, learning was performed 1000 times using the 15 training data, and it was judged that learning converged when the value obtained by dividing the sum of the absolute values of the difference between **y**ˆ 𝑗 and **t**<sup>𝑗</sup> by 1000 became less than 0.01. We repeated to randomly divide 35 data into 15 training and 20 test data five times. As a result, we created five datasets. For each dataset, the sensitivity, specificity, and AUC values of the training and test data were calculated using the parameters (**b**<sup>ˆ</sup> 1, **b**ˆ 2, **W**ˆ <sup>1</sup>, and **<sup>W</sup>**<sup>ˆ</sup> <sup>2</sup>) at the time the learning first converged in the existing and our proposed methods. Figure 1 shows the average of the five absolute values of the difference between the correct label and the predicted value at each step when learning was first converged for each method. We can observe that the error decreased steadily as the proposed method progressed compared to the existing methods. When the model was constructed using the weights at the learning convergence point and applied to 15 training data every time, the average values of sensitivity and specificity were 100.0%, and that of the AUC value was 1.000 for all methods. However, a difference was observed among the compared methods on the test data. For the method by [5], the average values of sensitivity, specificity, and AUC in the test data were 83.3%, 42.5%, and 0.629, respectively. Also, for that of [6], the average values of sensitivity, specificity, and AUC in the test data were 85.0%, 40.0%, and 0.625, respectively. With the proposed method, the average values of sensitivity, specificity, and AUC obtained on the test data were 85.0%, 67.5%, and 0.763, respectively.

**Fig. 1** Transition of learning in each method.

# **5 Conclusion and Limitations**

In this paper, we considered a discrimination problem using a DCNN for highdimensional small sample data and proposed a method by setting the initial weight matrix in the affine layer. In situations of limited learning data, although transfer learning can be used, we proposed an efficient learning method using the DCNN method. In terms of learning convergence and results obtained from the test data, we confirm that the proposed method is good. However, the results presented in this paper are limited and the proposed method needs to be examined in more detail. Therefore, in the future, through large-scale simulation studies and other real-world data applications, we plan to investigate the differences between the proposed method and existing methods by changing the number of feature selection layers and using different convolution filters. We also plan to investigate the proposed method by considering robustness and setting outliers on the simulation data.

**Acknowledgements** We thank Shinsuke Ito, Takashi Taguchi, Dr. Yusuke Yamane, Ms. Saeko Hishinuma, and Dr. Saeko Hirai for their advice. In addition, we acknowledge the biliary atresia patients' community (BA no kodomowo mamorukai) for their generous support of this project. This work was supported by the Mitsubishi Foundation.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Some Issues in Robust Clustering**

Christian Hennig

**Abstract** Some key issues in robust clustering are discussed with focus on the Gaussian mixture model based clustering, namely the formal definition of outliers, ambiguity between groups of outliers and clusters, the interaction between robust clustering and the estimation of the number of clusters, the essential dependence of (not only) robust clustering on tuning decisions, and shortcomings of existing measurements of cluster stability when it comes to outliers.

**Keywords:** Gaussian mixture model, trimming, noise component, number of clusters, user tuning, cluster stability

# **1 Introduction**

Cluster analysis is about finding groups in data. Robust statistics is about methods that are not affected strongly by deviations from the statistical model assumptions or moderate changes in a data set. Particular attention has been paid in the robustness literature to the effect of outliers. Outliers and other model deviations can have a strong effect on cluster analysis methods as well. There is now much work on robust cluster analysis, see [1, 19, 9] for overviews.

There are standard techniques of assessing robustness such as the influence function and the breakdown point [15] as well as simulations involving outliers, and these have been applied to robust clustering as well [19, 9].

Here I will argue that due to the nature of the cluster analysis problem, there are issues with the standard reasoning regarding robustness and outliers.

The starting point will be clustering based on the Gaussian mixture model, for details see [3]. For this approach, 𝑛 observations are assumed i.i.d. with density

Christian Hennig ()

Dipartimento di Scienze Statistiche "Paolo Fortunati", University of Bologna, Via delle Belle Arti 41, 40126 Bologna, Italy, e-mail: christian.hennig@unibo.it

<sup>©</sup> The Author(s) 2023 183

P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_21

$$f\_{\eta}(\mathbf{x}) = \sum\_{k=1}^{K} \pi\_k \varphi\_{\mu\_k, \Sigma\_k}(\mathbf{x}),$$

𝑥 ∈ R 𝑝 , with 𝐾 mixture components with proportions 𝜋<sup>𝑘</sup> , 𝜑𝜇<sup>𝑘</sup> ,Σ<sup>𝑘</sup> being the Gaussian density with mean vectors 𝜇<sup>𝑘</sup> , covariance matrices Σ<sup>𝑘</sup> , 𝑘 = 1, . . . , 𝐾, 𝜂 being a vector of all parameters. For given 𝐾, 𝜂 can be estimated by maximum likelihood (ML) using the EM-algorithm, as implemented for example in the R-package "mclust". A standard approach to estimate 𝐾 is the optimisation of the Bayesian Information Criterion (BIC). Normally, mixture components are interpreted as clusters, and observations 𝑥<sup>𝑖</sup> , 𝑖 = 1, . . . , 𝑛, can be assigned to clusters using the estimated posterior probability that 𝑥<sup>𝑖</sup> was generated by mixture component 𝑘. A problem with ML estimation is that the likelihood degenerates if all observations assigned to a mixture component lie on a lower dimensional hyperplane, i.e, a Σ<sup>𝑘</sup> has an eigenvalue of zero. This can be avoided by placing constraints on the eigenvalues of the covariance matrices [8]. Alternatively, a non-degenerate local optimum of the likelihood can be used, and if this cannot be found, constrained covariance matrix models (such as Σ<sup>1</sup> = . . . = Σ<sup>𝐾</sup> ) can be fitted instead, as is the default of mclust. Several issues with robustness that occur here are also relevant for other clustering approaches.

# **2 Outliers vs Clusters**

It is well known that the sample mean and sample covariance matrix as estimators of the parameters of a single Gaussian distribution can be driven to breakdown by a single outlier [15]. Under a Gaussian mixture model with fixed 𝐾, an outlier must be assigned to a mixture component 𝑘 and will break down the estimators of 𝜇<sup>𝑘</sup> , Σ<sup>𝑘</sup> (which are weighted sample means and covariance matrices) for that component in the same manner; the same holds for a cluster mean in 𝑘-means clustering.

Addressing this issue, and dealing with more outliers in order to achieve a high breakdown point, is a starting point for robust clustering. Central ideas are trimming a proportion of observations [7], adding a "noise component" with constant density to catch the outliers [4, 3], mixtures with more robust component-wise estimators such as mixtures of heavy-tailed distributions (Sec. 7 of [18]).

But cluster analysis is essentially different from estimating a homogeneous population. Given a data set with 𝐾 clear Gaussian clusters and standard ML-clustering, consider adding a single outlier that is far enough away from the clusters. Assuming a lower bound on covariance matrix eigenvalues, the outlier will form a one-point cluster, the mean of which will diverge with the added outlier, and the original clusters will be merged to form 𝐾 − 1 clusters [10].

The same will happen with a group of several outliers being close together, once more added far enough away from the Gaussian clusters. "Breakdown" of an estimator it is usually understood as the estimator becoming useless. It is questionable that this is the case here. In fact, the "group of outliers" can well be interpreted as a cluster in its own right, and putting all these points together in a cluster could be seen as desirable behaviour of the ML estimator, at least if two of the original 𝐾 clusters are close enough to each other that merging them will produce a cluster that is fairly well fitted by a single Gaussian distribution; note that the Gaussian mixture model does not assume strong separation between components, and a mixture of two Gaussians may be unimodal and in fact very similar to a single Gaussian. A breakdown point larger than a given 𝛼, 0 < 𝛼 < <sup>1</sup> <sup>2</sup> may not be seen as desirable in cluster analysis if there can be clusters containing a proportion of less than 𝛼 of the data, as a larger breakdown point will stop a method from taking such clusters (when added in large distance from the rest of the data) appropriately into account.

The core problem is that it is not clear what distinguishes a group of outliers from a legitimate cluster. I am not aware of any formal definition of outliers and clusters in the literature that allows this distinction. Even a one-point cluster is not necessarily invalid. Here are some possible and potentially conflicting aspects of such a distinction.


Most of these items require specific decisions that cannot be made in any objective and general manner, but only taking into account subject matter information, such as the minimum size of valid clusters or the density level below which observations are seen as outliers (potentially compared to density peaks in the distribution). This implies that an appropriate treatment of outliers in cluster analysis cannot be expected to be possible without user tuning.

# **3 Robustness and the Number of Clusters**

The last item suggests that there is an interplay between outlier identification and the number of clusters, and that adding clusters might be a way of dealing with outliers; as long as clusters are assumed to be Gaussian, a single additional component may not be enough. More generally, concentrating robustness research on the case of fixed 𝐾 may be seen as unrealistic, because 𝐾 is rarely known, although estimating 𝐾 is a notoriously difficult problem even without worrying about outliers [13].

The classical robustness concepts, breakdown point and influence function, assume parameters from R <sup>𝑞</sup> with fixed 𝑞. If 𝐾 is not fixed, the number of parameters is not fixed either, and the classical concepts do not apply.

As an alternative to the breakdown point, [11] defined a "dissolution point". Dissolution is measured in terms of cluster memberships of points rather than in terms of parameters, and is therefore also applicable to nonparametric clustering methods. Furthermore, dissolution applies to individual clusters in a clustering; certain clusters may dissolve, i.e., there may be no sufficiently similar cluster in a new clustering computed after, e.g., adding an outlier; and others may not dissolve. This does not require 𝐾 to be fixed; the definition is chosen so that if a clustering changes from 𝐾 to 𝐿 < 𝐾 clusters, at least 𝐾 − 𝐿 clusters dissolve.

Hennig [10, 11] showed that when estimating 𝐾 using the BIC and standard ML estimation, reasonably well separated clusters do not dissolve when adding possibly even a large percentage of outliers (this does not hold for every method to estimate the number of clusters, see [11]). Furthermore, [11] showed that no method with fixed 𝐾 can be robust for data in which 𝐾 is misspecified - already [7] had found that robustness features in clustering generally depend on the data.

An implication of these results is that even in the fixed 𝐾 problem, the standard ML method can be a valid competitor regarding robustness if it comes with a rule that allows to add one or possibly more clusters that can then be used to fit the outliers (this is rarely explored in the literature, but [18], Sec. 7.7, show an example in which adding a single component does not work very well).

An issue with adding clusters to accommodate outliers is that in many applications it is appropriate to distinguish between meaningful clusters, and observations that cannot be assigned to such clusters (often referred to as "noise"). Even though adding clusters of outliers can formally prevent the dissolution of existing clusters, it may be misleading to interpret the resulting clusters as meaningful, and a classification as outliers or noise can be more useful. This is provided by the trimming and noise component approaches to robust clustering. Also some other clustering methods such as the density-based DBSCAN [5] provide such a distinction. On the other hand, modelling clusters by heavy-tailed distributions such as in mixtures of t-distributions will implicitly assign outlying observations to clusters that potentially are quite far away. For this reason, [18], Sec. 7.7, provide an additional outlier identification rule on top of the mixture fit. [6] even distinguish between "mild" outliers that are modelled as having a larger variance around the same mean, and "gross" outliers to be trimmed. The variety of approaches can be connected to the different meanings that outliers can have in applications. They can be erroneous, they can be irrelevant noise, but they can also be caused by unobserved but relevant special conditions (and would as such qualify as meaningful clusters), or they could be valid observations legitimately belonging to a meaningful cluster that regularly produces observations further away from the centre than modelled by a Gaussian distribution.

Even though currently there is no formal robustness property that requires both the estimation of 𝐾 and an identification or downweighting of outliers, there is demand for a method that can do both.

Estimating 𝐾 comes with an additional difficulty that is relevant in connection with robustness. As mentioned before, in clustering based on the Gaussian mixture model normally every mixture component will be interpreted as a cluster. In reality, however, meaningful clusters are not perfectly Gaussian. Gaussian mixtures are very flexible for approximating non-Gaussian distributions. Using a consistent method for estimating 𝐾 means that for large enough 𝑛 a non-Gaussian cluster will be approximated by several Gaussian mixture components. The estimated 𝐾 will be fine for producing a Gaussian mixture density that fits the data well, but it will overestimate the number of interpretable clusters. The estimation of 𝐾, if interpreted as the number of clusters, relies on precise Gaussianity of the clusters, and is as such itself riddled with a robustness problem; in fact slightly non-Gaussian clusters may even drive the estimated 𝐾 → ∞ if 𝑛 → ∞ [12, 14].

This is connected with the more fundamental problem that there is no unique definition of a cluster either. The cluster analysis user needs to specify the cluster concept of interest even before robustness considerations, and arguably different clustering methods imply different cluster concepts [13]. A Gaussian mixture model defines clusters by the Gaussian distributional shape (unless mixture components are merged to form clusters [12]). Although this can be motivated in some real situations, robustness considerations require that distributional shapes fairly close to the Gaussian should be accepted as clusters as well, but this requires another specification, namely how far from a Gaussian a cluster is allowed to be, or alternatively how separated Gaussian components have to be in order to count as separated clusters. A similar problem can also occur in nonparametric clustering; if clusters are associated with density modes or level sets, the cluster concept depends on how weak a mode or gap between high level density sets is allowed to be to be treated as meaningful.

Hennig and Coretto [14] propose a parametric bootstrap approach to simultaneously estimate 𝐾 and assign outliers to a noise component. This requires two basic tuning decisions. The first one regards the minimum percentage of observations so that a researcher is willing to add another cluster if the noise component can be reduced by this amount. The second one specifies a tolerance that allows a data subset to count as a cluster even though it deviates to some extent from what is expected under a perfectly Gaussian distribution. There is a third tuning parameter that is in effect for fixed 𝐾 and tunes how much of the tails of a non-Gaussian cluster can be assigned to the noise in order to improve the Gaussian appearance of the cluster. One could even see the required constraints on covariance matrix eigenvalues as a further tuning decision. Default values can be provided, but situations in which matters can be improved deviating from default values are easy to construct.

# **4 More on User Tuning**

User tuning is not popular, as it is often difficult to make appropriate tuning decisions. Many scientists believe that subjective user decisions threaten scientific objectivity, and also background knowledge dependent choices cannot be made when investigating a method's performance by theory and simulations. The reason why user tuning is indispensable in robust cluster analysis is that it is required in order to make the problem well defined. The distinction between clusters and outliers is an interpretative one that no automatic method can make based on the data alone. Regarding the number of clusters, imagine two well separated clusters (according to whatever cluster concept of interest), and then imagine them to be moved closer and closer together. Below what distance are they to be considered a single cluster? This is essentially a tuning decision that the data cannot make on their own.

There are methods that do not require user tuning. Consider the mclust implementation of Gaussian mixture model based clustering. The number of clusters is by default estimated by the BIC. As seen above, this is not really appropriate for large data sets, but its derivation is essentially asymptotic, so that there is no theoretical justification for it for small data sets either. Empirically it often but not always works well, and there is little investigation of whether it tends to make the "right" decision in ambiguous situations where it is not clear without user tuning what it even means to be "right". Covariance matrix constraints in mclust are not governed by a tuning of eigenvalues or their ratios to be specified by the user. Rather the BIC decides between different covariance matrix models, but this can be erratic and unstable, as it depends on whether the EM-algorithm gets caught in a degenerate likelihood maximum or not, and in situations where two or more covariance matrix models have similar BIC values (which happens quite often), a tiny change in the data can result in a different covariance matrix model being selected, and substantial changes in the clustering. A tunable eigenvalue condition can result in much smoother behaviour. When it comes to outlier identification, mclust offers the addition of a uniform "noise" mixture component governed by the range of the data, again supposedly without user tuning. This starts from an initial noise estimation that requires tuning (Sec. 3.1.2 of [3]) and is less robust in terms of breakdown and dissolution than trimming and the improper noise component, both of which require tuning [10, 11]. The ICL, an alternative to the BIC (Sec. 2.6 of [3]), on the other hand, is known to merge different Gaussian mixture components already at a distance at which they intuitively still seem to be separated clusters. Similar comments apply to the mixture of t-distributions; it requires user tuning for identifying outliers, scatter matrix constraints, and it has the same issues with BIC and ICL as the Gaussian mixture.

Summarising, both the identification of and robustness against outliers and the estimation of the number of clusters require tuning in order to be well defined problems; user tuning can only be avoided by taking tuning decisions out of the user's hands and making them internally, which will work in some situations and fail in others, and the impression of automatic data driven decision making that a user may have is rather an illusion. This, however, does not free method designers from the necessity to provide default tunings for experimentation and cases in which the users do not feel able to make the decisions themselves, and tuning guidance for situations in which more information is available. A decision regarding the smallest valid size of a cluster is rather well interpretable; a decision regarding admissible covariance matrix eigenvalues is rather difficult and abstract.

# **5 Stability Measurement**

Robustness is closely connected to stability. Both experimental and theoretical investigation of the stability of clusterings require formal stability measurements, usually comparing two clusterings on the same data (potentially modified by replacing or adding observations). Not assuming any parametric model, proximity measures such as the Adjusted Rand Index (ARI; [16]), the Hamming distance (HD; [2]), or the Jaccard distance between individual clusters [11] can be used. Note that [2], standard reference on cluster stability in the machine learning community, state that stability and instability are caused in the first place by ambiguities in the cluster structure of the data, rather than by a method's robustness or lack of it. Although the outlier problem is ignored in that paper, it is true that cluster analysis can have other stability issues that are as serious as or worse than gross outliers.

To my knowledge, none of the measures currently in use allow for a special treatment of a set of outliers or noise; either these have to be ignored, or treated just as any other cluster. Both ARI and HD, comparing clusterings C<sup>1</sup> and C2, consider pairs of observations 𝑥<sup>𝑖</sup> , 𝑥 <sup>𝑗</sup> and check whether those that are in the same cluster in C<sup>1</sup> are also in the same cluster in C2. An appropriate treatment of noise sets 𝑁<sup>1</sup> ∈ C1, 𝑁<sup>2</sup> ∈ C<sup>2</sup> would require that 𝑥<sup>𝑖</sup> , 𝑥 <sup>𝑗</sup> ∈ 𝑁<sup>1</sup> are not just in the same cluster in C<sup>2</sup> but rather in 𝑁2, i.e., whereas the numberings of the regular clusters do not have to be matched (which is appropriate because cluster numbering is meaningless), 𝑁<sup>1</sup> has to be matched to 𝑁2. Corresponding re-definitions of these proximities will be useful to robustness studies.

# **6 Conclusion**

Key practical implications of the above discussions are:


developers need to provide sensible defaults, but also to guide the users regarding a meaningful interpretation of the tuning decisions.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Robustness Aspects of Optimized Centroids**

Jan Kalina and Patrik Janáček

**Abstract** Centroids are often used for object localization tasks, supervised segmentation in medical image analysis, or classification in other specific tasks. This paper starts by contributing to the theory of centroids by evaluating the effect of modified illumination on the weighted correlation coefficient. Further, robustness of various centroid-based tools is investigated in experiments related to mouth localization in non-standardized facial images or classification of high-dimensional data in a matched pairs design. The most robust results are obtained if the sparse centroid-based method for supervised learning is accompanied with an intrinsic variable selection. Robustness, sparsity, and energy-efficient computation turn out not to contradict the requirement on the optimal performance of the centroids.

**Keywords:** image processing, optimized centroids, robustness, sparsity, low-energy replacements

# **1 Introduction**

Methods based on centroids (templates, prototypes) are simple yet widely used for object localization or supervised segmentation in image analysis tasks and also within other supervised or unsupervised methods of machine learning. This is true e.g. in various biomedical imaging tasks [1], where researchers typically cannot afford a too large number of available images [3]. Biomedical applications also benefit from the interpretability (comprehensibility) of centroids [11].

This paper is focused on the question how are centroid-based methods influenced by data contamination. Section 2 recalls the main approaches to centroid-based object localization in images, as well as a recently proposed method of [6] for op-

Jan Kalina () · Patrik Janáček

The Czech Academy of Sciences, Institute of Computer Science, Pod Vodárenskou věží 2, 182 07 Prague 8, Czech Republic, e-mail: kalina@cs.cas.cz;janacekpatrik@gmail.com

<sup>©</sup> The Author(s) 2023 193

P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_22

timizing centroids and their weights. The performance of these methods to data contamination (non-standard conditions) has not been however sufficiently investigated. Particularly, we are interested in the performance of low-energy replacements of the optimal centroids and in the effect of posterior variable selection (pixel selection). Section 2.1 presents novel expressions for images with a changed illumination. Numerical experiments are presented in Section 3. These are devoted to mouth localization over raw facial images as well as over artificially modified images; other experiments are devoted to high-dimensional data in a matched pairs design. The optimized centroids of [6] and especially their modification proposed here turn out to have remarkable robustness properties. Section 4 brings conclusions.

# **2 Centroid-based Classification (Object Localization)**

Commonly used centroid-based approaches to object localization (template matching) in images construct the centroid simply as the average of the positive examples and typically use Pearson product-moment correlation coefficient𝑟 as the most common measure of similarity between a centroid **c** and a candidate part of the image (say **x**). While the centroid and candidate areas are matrices of size (say) 𝐼 ×𝐽 pixels, they are used in computations after being transformed to vectors of length 𝑑 := 𝐼𝐽. This allows us to use the notation **c** = (𝑐1, . . . , 𝑐𝑑) 𝑇 and **x** = (𝑥1, . . . , 𝑥𝑑) 𝑇 .

*Assumptions* A: We assume the whole image to have size 𝑁<sup>𝑅</sup> × 𝑁<sup>𝐶</sup> pixels. We assume the centroid **c** = (𝑐)𝑖, 𝑗 with 𝑖 = 1, . . . , 𝐼 and 𝑗 = 1, . . . , 𝐽 to be a matrix of size 𝐼 × 𝐽 pixels. A candidate area **x** and nonnegative weights **w** with Í 𝑖 Í <sup>𝑗</sup> 𝑤𝑖 𝑗 = 1 are assumed to be matrices of the same size as **c**.

For a given image, E will denote the set of its rectangular candidate areas of size 𝐼 × 𝐽. The candidate area fulfilling

$$\arg\max\_{\mathbf{x}\in\mathsf{E}} r(\mathbf{x}, \mathbf{c})\tag{1}$$

or (less frequently)

$$\arg\min\_{\mathbf{x}\in\mathbf{E}} ||\mathbf{x}-\mathbf{c}||\_2\tag{2}$$

are classified to correspond to the object (e.g. mouth).

Let us consider here replacing 𝑟 by the weighted correlation coefficient 𝑟<sup>𝑤</sup>

$$\arg\max\_{\mathbf{x}\in\mathbf{E}} r\_w(\mathbf{x}, \mathbf{c}; \mathbf{w}) \tag{3}$$

with given non-negative weights **w** = (𝑤1, . . . , 𝑤𝑑) <sup>𝑇</sup> ∈ R<sup>𝑝</sup> with Í<sup>𝑛</sup> <sup>𝑖</sup>=<sup>1</sup> <sup>𝑤</sup><sup>𝑖</sup> <sup>=</sup> <sup>1</sup>, where R denotes the set of all real numbers. Let us further use the notation 𝑥¯<sup>𝑤</sup> = Í𝑑 <sup>𝑗</sup>=<sup>1</sup> <sup>𝑤</sup>𝑗<sup>𝑥</sup> <sup>𝑗</sup> <sup>=</sup> **<sup>w</sup>** <sup>𝑇</sup> **x** and 𝑐¯<sup>𝑤</sup> = **w** 𝑇 **c**. We may recall 𝑟<sup>𝑤</sup> between **x** and **c** to be defined as

$$r\_W(\mathbf{x}, \mathbf{c}; \mathbf{w}) = \frac{\sum\_{i=1}^d w\_i (\mathbf{x}\_i - \bar{\mathbf{x}}\_w)(c\_i - \bar{c}\_w)}{\sqrt{\sum\_{i=1}^d \left[w\_i (\mathbf{x}\_i - \bar{\mathbf{x}}\_w)^2\right] \sum\_{i=1}^d \left[w\_i (c\_i - \bar{c}\_w)^2\right]}} \tag{4}$$

**Fig. 1** The workflow of the optimization procedure of [6].

A detailed study of [2] investigated theoretical foundations of centroid-based classification, however for the rare situation when (1) is replaced by

The sophisticated centroid optimization method of [6], outlined in Figure 1, requires to minimize a nonlinear loss function corresponding to a regularized marginlike distance (exploiting 𝑟𝑤) evaluated for the worst pair from the worst image over the training database (i.e. the worst with respect to the loss function). Subsequently, optimization of the weights may be also performed, ensuring many pixels to obtain zero weights (i.e. yielding a sparse solution). The optimal centroid may be used as such, even without any weights at all; still, optimization of the weights leads to a further improvement of the classification performance. In the current paper, we always consider a linear (i.e. approximate) approach to centroid optimization, although a nonlinear optimization is also successful as revealed in the comparisons in [6].

# **2.1 Centroid-Based Object Localization: Asymmetric Modification of the Candidate Area**

In the context of object localization as described above, our aim is to express 𝑟<sup>𝑤</sup> (**x** ∗ , **c**; **w**) under modified candidate areas (say **x** ∗ ) of the image **x**; we stress that the considered modification of the image does not allow to modify the centroid **c** and weights **w**. These considerations are useful for centroid-based object localization, when asymmetric illumination is present in the whole image or its part. The weighted variance 𝑆 2 <sup>𝑤</sup> (**x**; **w**) of **x** with weights **w** and the weighted covariance 𝑆<sup>𝑤</sup> (**x**, **c**) between **x** and **c** are denoted as

$$S\_w^2(\mathbf{x}) = \sum\_{i,j} w\_{if} \left(\mathbf{x}\_{if} - \bar{\mathbf{x}}\_w\right)^2, \quad S\_w(\mathbf{x}, \mathbf{c}) = \sum\_{i,j} w\_{if} \left(\mathbf{x}\_{if} - \bar{\mathbf{x}}\_w\right) \left(c\_{if} - \bar{c}\_w\right). \tag{5}$$

Further, the notation **x** + 𝑎 with **x** = (𝑥𝑖 𝑗)𝑖, 𝑗 is used to denote the matrix (𝑥𝑖 𝑗 + 𝑎)𝑖, 𝑗 for a given 𝑎 ∈ R. We also use the following notation. The image **x** is divided to two parts **x** = (**x**1, **x**2) <sup>𝑇</sup> ∈ R<sup>𝑑</sup> , where Í <sup>𝐼</sup> or Í 𝐼 𝐼 denote the sum over the pixels of the first or second part, respectively.

**Theorem 1** *Under Assumptions* A*, the following statements hold.*

*1. For* **x** <sup>∗</sup> = **x** + 𝜀*, it holds* 𝑟<sup>𝑤</sup> (**x** ∗ , **c**) = 𝑟<sup>𝑤</sup> (**x**, **c**) *for* 𝜀 > 0*. 2. For* **x** <sup>∗</sup> = 𝑘**x** *with* 𝑘 > 0*, it holds* 𝑟<sup>𝑤</sup> (**x** ∗ , **c**) = 𝑟<sup>𝑤</sup> (**x**, **c**). *3. For* **x** = (**x**1, **x**2) <sup>𝑇</sup> *and* **x** <sup>∗</sup> = (**x**1, **x**<sup>2</sup> + 𝜀) 𝑇 *, it holds* 𝑟<sup>𝑤</sup> (**x** ∗ , **c**) =

$$=\frac{S\_w(\mathbf{x},\mathbf{c}) + \varepsilon \sum\_{II} w\_{ij} c\_{ij} - \varepsilon v\_2 \bar{c}\_w}{S\_w(\mathbf{c}) \sqrt{S\_w^2(\mathbf{x}) + v\_2 (1 - v\_2) \varepsilon^2 + 2\varepsilon (2v\_2 - 1) (\sum\_{II} w\_{ij} \mathbf{x}\_{ij} - v\_2 \bar{\mathbf{x}}\_w)}},\quad(6)$$

*where* 𝑣<sup>2</sup> = Í 𝐼 𝐼 𝑤𝑖 𝑗 *and* 𝜀 ∈ R*. 4. For* **x** = (**x**1, **x**2) <sup>𝑇</sup> *and* **x** <sup>∗</sup> = (**x**1, 𝑘**x**2) <sup>𝑇</sup> *with* 𝑘 > 0*, it holds*

$$r\_w(\mathbf{x}^\*, \mathbf{c}) = r\_w(\mathbf{x}, \mathbf{c}) \frac{S\_w(\mathbf{x})}{S\_w^\*(\mathbf{x})} + \frac{(k - 1) \sum\_{II} w\_{ij} \mathbf{x}\_{ij} (c\_{ij} - \bar{c}\_w)}{S\_w(\mathbf{c}) S\_w^\*(\mathbf{x})},\tag{7}$$

*where*

$$\left(S\_w^\*(\mathbf{x})\right)^2 = S\_w^2(\mathbf{x}) + (k^2 - 1)\sum\_{II} w\_{ij}\chi\_{ij}^2 - \frac{k^2 - 1}{n} \left(\sum\_{II} w\_{ij}\chi\_{ij}\right)^2 - $$

$$ -\frac{2}{n}(k - 1)\left(\sum\_I w\_{ij}\chi\_{ij}\right)\left(\sum\_{II} w\_{ij}\chi\_{ij}\right). \tag{8}$$

The proofs of the formulas are technical but straightforward exploiting known properties of 𝑟𝑤. The theorem reveals 𝑟<sup>𝑤</sup> to be vulnerable to the modified illumination, i.e. all the methods based on centroids of Section 2 may be too influenced by the data modification.

# **3 Experiments**

#### **3.1 Data**

Three datasets are considered in the experiments. In the first dataset, the task is to localize the mouth in the database containing 212 grey-scale 2D facial images of faces of healthy individuals of size 192 × 256 pixels. The database previously analyzed in [6] was acquired at the Institute of Human Genetics, University of Duisburg-Essen, within research of genetic syndrome diagnostics based on facial images [1] under the projects BO 1955/2-1 and WU 314/2-1 of the German Research Council (DFG). We consider the training dataset to consist of the first 124 images, while the remaining 88 images represent an independent test set acquired later but still under the same standardized conditions fulfilling assumptions of unbiased evaluation. The centroid described below is used with 𝐼 = 26 and 𝐽 = 56.

Using always raw training images, the methods are applied not only to the raw test set, but also to the test set after being artificially modified using models inspired by Section 2.1. On the whole, five different versions of the test database are considered; the modifications required that we first manually localized the mouths in the test images:

1. Raw images.

2. Illumination. If we consider a pixel [𝑖, 𝑗] with intensity 𝑥𝑖 𝑗 in an image (say) 𝑓 , then the grey-scale intensity 𝑓𝑖 𝑗 will be

$$f\_{ij}^\* = f\_{ij} + \lambda |j - j\_0|, \quad i = 1, \ldots, I, \quad j = 1, \ldots, J,\tag{9}$$

where [𝑖0, 𝑗0] are the coordinates of the mouth and 𝜆 = 0.002.


$$\mathbf{x}\_{ij}^{\*} = \begin{cases} \mathbf{x}\_{ij} + \mathbf{0}.2, & i = 1, \ldots, 26, \ j = 1, \ldots, 15, \\ \mathbf{x}\_{ij}, & i = 1, \ldots, 26, \ j = 16, \ldots, 41, \\ \mathbf{x}\_{ij} + \mathbf{0}.1, & i = 1, \ldots, 26, \ j = 42, \ldots, 56. \end{cases} \tag{10}$$


The optimized centroids were explained in [6] to be applicable also to classification tasks for other data than images, if they follow a matched pairs design. We use two datasets from [6] in the experiments and their classification accuracies are reported in a 10-fold cross validation.


**Fig. 2** The average centroid used as the initial choice for the centroid optimization.

#### **3.2 Methods**

The following methods are compared in the experiments; standard methods are computed using R software and we use our own C++ implementation of centroidbased methods. The average centroid is obtained as the average of all mouths of the training set, or the average across all patients. The centroid optimization starts with the average centroid as the initial one, and the optimization of weights starts with equal weights as the initial ones:


$$\cos\theta = \frac{\mathbf{x}^T \mathbf{y}}{||\mathbf{x}||\_2 ||\mathbf{y}||\_2} = \frac{\sum\_{i=1}^d \mathbf{x}\_i \mathbf{y}\_i}{\left(\sum\_{i=1}^d \mathbf{x}\_i^2\right)^{1/2} (\sum\_{j=1}^d \mathbf{y}\_j^2)^{1/2}}.\tag{11}$$


$$\psi\_1(t) = \exp\left\{-\frac{t^2}{2\tau^2}\right\} \mathbf{1}\left[t < \frac{3}{4}\right], \quad t \in [0, 1], \tag{12}$$

corresponding to a (trimmed) density of the Gaussian N(0, 1) distribution; 1 denotes an indicator function. To explain, the computation of 𝑟LWS(𝑥, 𝑦) starts by fitting the LWS estimator in the linear regression of 𝑦 as the response of 𝑥, and 𝑟<sup>𝑤</sup> is used with the weights determined by the LWS estimator.



**Table 1** Classification accuracy for three datasets. For the mouth localization data, modifications of the test images are described in Section 3: (i) None (raw images); (ii) Illumination; (iii) Asymmetry; (iv) Rotation; (v) Image denoising. A detailed description of the methods is given in Section 3.2.

# **3.3 Results**

The results as ratios of correctly classified cases are presented in Table 1. For the mouth localization, the optimized centroids of methods D, F, and H turn out to outperform simple centroids (A, B, and C); the novel modifications E and G performing intrinsic variable selection yield the best results. Simple standard centroids (A, B, and C) are non-robust to data contamination; this follows from Section 2.1 and from analogous considerations for other types of contaminating the images. On the other hand, the robustness of optimized centroids is achieved by their optimization (but not by using 𝑟<sup>𝑤</sup> as such). Methods E and G are even able to overcome methods I and J based on 𝑟LWS. We recall that 𝑟𝐿𝑊 𝑆 is globally robust in terms of the breakdown point [4]), is computationally very demanding, and does not seem to allow any feasible optimization. Other results reported previously in [6] revealed that also numerous standard machine learning methods are too vulnerable (non-robust) with respect to data contamination, if measuring the similarity by 𝑟 or 𝑟𝑤.

For the AMI dataset, methods E and G with variable selection perform the best results for raw as well as contaminated datasets. For the simulated data, the method G yields the best results and the method E stays only slightly behind as the second best method.

# **4 Conclusions**

Understanding the robustness of centroids represents a crucial question in image processing with applications for convolutional neural networks (CNNs), because centroids are very versatile tools that may be based on deep features learned by deep learning. We focus on small datasets, for which CNNs cannot be used [10]. This paper is interested in performance of centroid-based object localization over small databases with non-standardized images, which commonly appear e.g. in medical image analysis.

The requirements on robustness with respect to modifications of the images turn out not to contradict the requirements on optimality of the centroids. The method G applying an intrinsic variable selection on the optimal centroid and weights [6] can be interpreted within a broader framework of robust dimensionality reduction (see [8] for an overview) or low-energy approximate computation. Additional results not presented here reveal the method based on optimized centroids to be robust also to small shift. Neither the theoretical part of this paper nor the experiments exploit any specific properties of faces. The presented robust method has potential also for various other applications, e.g. for deep fake detection by centroids, robust template matching by CNNs [9], or applying filters in convolutional layers of CNNs.

**Acknowledgements** The research was supported by the grant 22-02067S of the Czech Science Foundation.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Data Clustering and Representation Learning Based on Networked Data**

Lazhar Labiod and Mohamed Nadif

**Abstract** To deal simultaneously with both, the attributed network embedding and clustering, we propose a new model exploiting both content and structure information. The proposed model relies on the approximation of the relaxed continuous embedding solution by the true discrete clustering. Thereby, we show that incorporating an embedding representation provides simpler and easier interpretable solutions. Experiment results demonstrate that the proposed algorithm performs better, in terms of clustering, than the state-of-art algorithms, including deep learning methods devoted to similar tasks.

**Keywords:** networked data, clustering, representation learning, spectral rotation

# **1 Introduction**

In recent years, *Networks* [4] and *Attributed Networks* (AN) [8] have been used to model a large variety of real-world networks, such as academic and health care networks where both node links and attributes/features are available for analysis. Unlike plain networks in which only node links and dependencies are observed, with AN, each node is associated with a valuable set of features. In other words, we have **X** and **W** obtained/available independently of **X**. More recently, the learning representation has received a significant amount of attention as an important aim in many applications including social networks, academic citation networks and protein-protein interaction networks. Hence, *Attributed network Embedding* (ANE) [2] aims to seek a continuous low-dimensional matrix representation for nodes in a network, such that original network topological structure and node attribute proximity can be preserved in the new low-dimensional embedding.

Although, many approaches have emerged with *Network Embedding* (NE), the research on ANE (Attributed Network Embedding) still remains to be explored

Lazhar Labiod () · Mohamed Nadif

Centre Borelli UMR9010, Université Paris Cité, 75006-Paris, France, e-mail: lazhar.labiod@u-paris.fr, e-mail: mohamed.nadif@u-paris.fr

<sup>©</sup> The Author(s) 2023 203

P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_23

[3]. Unlike NE that learns from plain networks, ANE aims to capitalize both the proximity information of the network and the affinity of node attributes. Note that, due to the heterogeneity of the two information sources, it is difficult for the existing NE algorithms to be directly applied to ANE. To sum up, the learned representation has been shown to be helpful in many learning tasks such as network clustering [13], Therefore ANE is a challenging research problem due to the high-dimensionality, sparsity and non-linearity of the graph data.

The paper is organized as follows. In Section 2 we formulate the objective function to be optimized, describe the different matrices used, and present a *Simultaneous Attributed Network Embedding and Clustering* (SANEC) framework for embedding and clustering. Section 3 is devoted to numerical experiments. Finally, the conclusion summarizes the advantages of our contribution.

# **2 Proposed Method**

In this section, we describe the SANEC method. We will present the formulation of an objective function and an effective algorithm for data embedding and clustering. But first, we show how to construct two matrices **S** and **M** integrating both types of information –content and structure information– to reach our goal.

#### **2.1 Content and Structure Information**

An attributed network G = (V, E, X) consists of V the set of nodes, 𝐸 ⊆ V × V the set of links, and **X** = [**x**1, **x**2, . . . , **x**𝑛] where 𝑛 = |V| and **x**<sup>𝑖</sup> ∈ R 𝑑 is the feature/attribute vector of the node 𝑣<sup>𝑖</sup> . Formally, the graph can be represented by two types of information, the content information **X** ∈ R 𝑛×𝑑 and the structure information **A** ∈ R 𝑛×𝑛 , where **A** is an adjacency matrix of 𝐺 and 𝑎𝑖 𝑗 = 1 if 𝑒𝑖 𝑗 ∈ 𝐸 otherwise 0; we consider that each node is a neighbor of itself, then we set 𝑎𝑖𝑖 = 1 for all nodes. Thereby, we model the nodes proximity by an (𝑛 × 𝑛) transition matrix **W** given by **W** = **D** <sup>−</sup>1**A**, where **D** is the degree matrix of **A** defined by 𝑑𝑖𝑖 = Í𝑛 𝑖 <sup>0</sup>=1 𝑎𝑖 0 𝑖 .

In order to exploit additional information about nodes similarity from **X**, we preprocessed the above dataset **X** to produce similarity graph input **W<sup>X</sup>** of size (𝑛 × 𝑛); we construct a K-Nearest-Neighbor (KNN) graph. To this end, we use the heat kernel and 𝐿<sup>2</sup> distance, KNN neighborhood mode with 𝐾 = 15 and we set the width of the neighborhood 𝜎 = 1. Note that any appropriate distance or dissimilarity measure can be used. Finally we combine in an (𝑛 × 𝑛) matrix **S**, nodes proximity from both content information **X** and structure information **W**. In this way, we intend to perturb the similarity **W** by adding the similarity from **WX**; we choose to take **S** defined by **S** = **W** + **W<sup>X</sup>** (Figure 1).

As we aim to perform clustering, we propose to integrate it in the formulation of a new data representation by assuming that nodes with the same label tend to have

**Fig. 1** Model and objective function of SANEC.

similar social relations and similar node attributes. This idea is inspired by the fact that, the labels are strongly influenced by both content and structure information and inherently correlated to both these information sources. Thereby the new data representation referred to as **M** = (𝑚𝑖 𝑗) of size (𝑛 × 𝑑) can be considered as a multiplicative integration of both **W** and **X** by replacing each node by the centroid of their neighborhood (barycenter): i.e, **m**𝑖 𝑗 = Í𝑛 <sup>𝑘</sup>=<sup>1</sup> **<sup>w</sup>**𝑖𝑘**x**𝑘 𝑗, <sup>∀</sup>𝑖, 𝑗 or **<sup>M</sup>** <sup>=</sup> **WX**. In this way, given a graph 𝐺, a graph clustering aims to partition the nodes in 𝐺 into 𝑘 disjoint clusters {𝐶1, 𝐶2, . . . , 𝐶<sup>𝑘</sup> }, so that: (1) nodes within the same cluster are close to each other while nodes in different clusters are distant in terms of graph structure; and (2) the nodes within the same cluster are more likely to have similar attribute values.

#### **2.2 Model, Optimization and Algorithm**

Let 𝑘 be the number of clusters and the number of components into which the data is embedded. With **M** and **S**, the SANEC method that we propose aims to obtain the maximally informative embedding according to the clustering structure in the attributed network data. Therefore, we propose to optimize

$$\min\_{\mathbf{B}, \mathbf{Z}, \mathbf{Q}, \mathbf{G}} \left\| \mathbf{M} - \mathbf{B} \mathbf{Q}^{\top} \right\|^2 + \lambda \left\| \mathbf{S} - \mathbf{G} \mathbf{Z} \mathbf{B}^{\top} \right\|^2 \quad \mathbf{B}^{\top} \mathbf{B} = \mathbf{I}, \mathbf{Z}^{\top} \mathbf{Z} = \mathbf{I}, \mathbf{G} \in \{0, 1\}^{n \times k} \tag{1}$$

where **G** = (𝑔𝑖 𝑗) of size (𝑛 × 𝑘) is a cluster membership matrix, **B** = (𝑏𝑖 𝑗) of size (𝑛 × 𝑘) is the embedding matrix and **Z** = (𝑧𝑖 𝑗) of size (𝑘 × 𝑘) is an orthonormal rotation matrix which most closely maps **B** to **G** ∈ {0, 1} 𝑛×𝑘 . **Q** ∈ R 𝑑×𝑘 is the features embedding matrix. Finally, The parameter 𝜆 is a non-negative value and can be viewed as a regularization parameter. The intuition behind the factorization of **M** and **S** is to encourage the nodes with similar proximity, those with higher similarity in both matrices, to have closer representations in the latent space given by **B**. In doing so, the optimisation of (1) leads to a clustering of the nodes into 𝑘 clusters given by **G**. Note that, both tasks –embedding and clustering– are performed simultaneously and supported by **Z**; it is the key to attaining good embedding while taking into account the clustering structure. To infer the latent factor matrices **Z**, **B**, **Q** and **G**, we shall derive an alternating optimization algorithm. To this end, we rely on the following proposition.

**Proposition 1.** Let be **S** ∈ R 𝑛×𝑛 , **G** ∈ {0, 1} 𝑛×𝑘 , **Z** ∈ R 𝑘×𝑘 , **B** ∈ R 𝑛×𝑘 , we have

$$\left\|\mathbf{S} - \mathbf{G}\mathbf{Z}\mathbf{B}^{\top}\right\|^2 = \left\|\mathbf{S} - \mathbf{B}\mathbf{B}^{\top}\mathbf{S}\right\|^2 + \left\|\mathbf{S}\mathbf{B} - \mathbf{G}\mathbf{Z}\right\|^2\tag{2}$$

**proof.** We first expand the matrix norm of the left term of (2)

$$\left\|\mathbf{S} - \mathbf{G}\mathbf{Z}\mathbf{B}^{T}\right\|^{2} = \left\|\mathbf{S}\right\|^{2} + \left\|\mathbf{G}\mathbf{Z}\mathbf{B}^{T}\right\|^{2} - 2Tr(\mathbf{S}\mathbf{G}\mathbf{Z}\mathbf{B}^{T})\tag{3}$$

In a similar way, we obtain from the two terms of the right term of (2)

$$\left\|\mathbf{S} - \mathbf{S}\mathbf{B}\mathbf{B}^T\right\|^2 = \left\|\left|\mathbf{S}\right|\right\|^2 - \left\|\left|\mathbf{S}\mathbf{B}\right|\right\|^2 \quad \text{due to } \mathbf{B}^T\mathbf{B} = \mathbf{I} \tag{4}$$

$$\text{and} \quad \|\mathbf{SB} - \mathbf{GZ}\|^2 = \|\mathbf{SB}\|^2 + \|\mathbf{GZ}\|^2 - 2Tr(\mathbf{SBZ}\mathbf{G}^\top).$$

Due also to **B** <sup>&</sup>gt;**B** = **I**, we have

$$\left\|\mathbf{SB} - \mathbf{GZ}\right\|^2 = \left\|\mathbf{SB}\right\|^2 + \left\|\mathbf{GZ}\mathbf{B}^\top\right\|^2 - 2Tr\left(\mathbf{SGZ}\mathbf{B}^\top\right) \tag{5}$$

Summing the two terms of (4) and (5) leads to the left term of (2).

$$\|\mathbf{S}\|^2 + \|\mathbf{G}\mathbf{Z}\|^2 - 2Tr(\mathbf{S}\mathbf{G}\mathbf{Z}\mathbf{B}^\top) = \left\|\mathbf{S} - \mathbf{G}\mathbf{Z}\mathbf{B}^T\right\|^2 \text{ due to } \|\mathbf{G}\mathbf{Z}\|^2 = \left\|\mathbf{G}\mathbf{Z}\mathbf{B}^\top\right\|^2$$

**Compute Z**. Fixing **G** and **B** the problem which arises in (1) is equivalent to min**<sup>Z</sup>** k**S** − **GZB**>k 2 . From Proposition 1, we deduce that

$$\min\_{\mathbf{Z}} \left\lVert \mathbf{S} - \mathbf{G} \mathbf{Z} \mathbf{B}^{\top} \right\rVert^{2} \Leftrightarrow \min\_{\mathbf{Z}} \left\lVert \mathbf{S} - \mathbf{B} \mathbf{B}^{\top} \mathbf{S} \right\rVert^{2} + \left\lVert \mathbf{S} \mathbf{B} - \mathbf{G} \mathbf{Z} \right\rVert^{2} \tag{6}$$

which can be reduced to max**<sup>Z</sup>** 𝑇𝑟(**G**>**SBZ**) s.t. **Z** <sup>&</sup>gt;**Z** = **I**. As proved in page 29 of [1], let **U**Σ**V** <sup>&</sup>gt; be the SVD for **G**>**SB**, then **Z** = **UV**>.

**Compute Q.** Given **G**, **Z** and **B**, the opimization problem (1) is equivalent to min**<sup>Q</sup>** k**M** − **BQ**>k 2 , and we get

$$\mathbf{Q} = \mathbf{M}^{\mathsf{T}} \mathbf{B}.\tag{7}$$

Thereby **Q** is somewhere an embedding of attributes.

**Compute B**. Given **G**, **Q** and **Z**, the problem (1) is equivalent to

$$\max\_{\mathbf{B}} \quad \operatorname{Tr}((\mathbf{M}^\top \mathbf{Q} + \lambda \mathbf{S} \mathbf{G} \mathbf{Z}) \mathbf{B}^\top) \quad \text{s.t.} \quad \mathbf{B}^\top \mathbf{B} = \mathbf{I}.$$

In the same manner for the computation of **<sup>Z</sup>**, let **<sup>U</sup>**<sup>ˆ</sup> <sup>Σ</sup><sup>ˆ</sup> **<sup>V</sup>**<sup>ˆ</sup> <sup>&</sup>gt; be the SVD for (**M**>**<sup>Q</sup>** <sup>+</sup> 𝜆**SGZ**), we get

$$\mathbf{B} = \mathbf{U}\mathbf{V}^{\mathrm{T}}.\tag{8}$$

It is important to emphasize that, at each step, **B** exploits the information from the matrices **Q**, **G**, and **Z**. This highlights one of the aspects of the simultaneity of embedding and clustering.

**Compute G**: Finally, given **B**, **Q** and **Z**, the problem (1) is equivalent to min**<sup>G</sup>** k**SB** − **GZ**k 2 . As **G** is a cluster membership matrix, its computation is done as follows: We fix **Q**, **Z**, **B**. Let **B˜** = **SB** and calculate

$$\mathbf{g}\_{ik} = 1 \\ \text{if } k = \mathbf{a} \mathbf{r} \mathbf{g} \\ \min\_{k'} ||\mathbf{\tilde{b}}\_i - \mathbf{z}\_{k'}||^2 \text{ and } \mathbf{0} \text{ otherwise } \mathbf{.} \tag{9}$$

In summary, the steps of the SANEC algorithm relying on **S** referred to as SANEC**<sup>S</sup>** can be deduced in Algorithm 1. The convergence of SANEC**<sup>S</sup>** is guaranteed and depends on the initialization to reach only a local optima. Hence, we start the algorithm several times and select the best result which minimizes the objective function (1).


# **3 Numerical Experiments**

In the following, we compare SANEC with some competitive methods described later. The performances of all clustering methods are evaluated using challenging realworld datasets commonly tested with ANE where the clusters are known. Specifically, we consider three public citation network data sets, Citeseer, Cora and Wiki, which contain sparse bag-of-words feature vector for each document and a list of citation links between documents. Each document has a class label. We treat documents as nodes and the citation links as the edges. The characteristics of the used datasets are summarized in Table 1. The balance coefficient is defined as the ratio of the number of documents in the smallest class to the number of documents in the largest class while *nz* denotes the percentage of sparsity.

**Table 1** Description of datasets (#: the cardinality).


In our comparison we include standard methods and also recent deep learning methods; these differ in the way they use available information. Some of them (such as K-means) use only **X** as the baseline, while others use more recent algorithms based on **X** and **W**. All the compared methods are: TADW [14], DeepWalk [7] and Spectral Clustering [11]. Using **X** and **W** we evaluated GAE and VGAE [5], ARVGA [6], AGC [15] and DAEGC [12].

With the SANEC model, the parameter 𝜆 controls the role of the second term ||**S**−**GZB**>||<sup>2</sup> in (1). To measure its impact on the clustering performance of SANEC**S**, we vary 𝜆 in {0, 10−<sup>6</sup> , 10−<sup>3</sup> , 10−<sup>1</sup> , 10<sup>0</sup> , 10<sup>1</sup> , 10<sup>3</sup> }. Through, many experiments, as illustrated in Figure 2 we choose to take 𝜆 = 10−<sup>3</sup> . The choice of 𝜆 warrants in-depth evaluation.

**Fig. 2** Sensitivity analysis of 𝜆 using ACC, NMI and ARI.

Compared to the true available clusters, in our experiments the clustering performance is assessed by *accuracy* (ACC), *normalized mutual information* (NMI) and *adjusted rand index* (ARI). We repeat the experiments 50 times, with different random initialization and the averages (mean) are reported in Table 2; the best performance for each dataset is highlighted in bold.

First, we observe the high performances of methods integrating information from **W**. For instance, RTM and RMSC are better than classical methods using only either **X** or **W**. On the other hand, all methods including deep learning algorithms relying on **X** and **W** are better yet. However, regarding SANEC with both versions relying on **W**, referred to as SANEC**<sup>W</sup>** or **S** referred to as SANEC**S**, we note high performances for all the datasets and with SANEC**S**, we remark the impact of**WX**; it learns low-dimensional representations while suits the clustering structure.

To go further in our investigation and given the sparsity of **X** we proceeded to standardization tf-idf followed by 𝐿2, as it is often used to process document-term matrices; see e.g, [9, 10], while in the construction of **W<sup>X</sup>** we used the cosine metric. In Figure 3 are reported the results where we observe a slight improvement.


**Table 2** Clustering performances (ACC % , NMI % and ARI %).

**Fig. 3** Evaluation of SANEC**<sup>S</sup>** using tf-idf normalization of **X** and cosine metric for **WX**.

# **4 Conclusion**

In this paper, we proposed a novel matrix decomposition framework for simultaneous attributed network data embedding and clustering. Unlike known methods that combine the objective function of AN embedding and the objective function of clustering separately, we proposed a new single framework to perform SANEC**<sup>S</sup>** for AN embedding and nodes clustering. We showed that the optimized objective function can be decomposed into three terms, the first is the objective function of a kind of PCA applied to **X**, the second is the graph embedding criterion in a low-dimensional space, and the third is the clustering criterion. We also integrated a discrete rotation functionality, which allows a smooth transformation from the relaxed continuous embedding to a discrete solution, and guarantees a tractable optimization problem with a discrete solution. Thereby, we developed an effective algorithm capitalizing on learning representation and clustering. The obtained results show the advantages of combining both tasks over other approaches. SANEC**<sup>S</sup>** outperforms all recent methods devoted to the same tasks including deep learning methods which require deep models pretraining. However, there are other points that warrant in-depth evaluation, such as the choice of 𝜆 and the complexity of the algorithm in terms of network size. The proposed framework offers several perspectives and investigations. We have noted that the construction of **M** and **S** is important, it highlights the introduction of **W**. As for the **W<sup>X</sup>** we have observed that it is fundamental as it makes possible to link the information from **X** to the network; this has been verified by many experiments. First, we would like to be able to measure the impact of each matrix **W** and **W<sup>X</sup>** in the construction of **S** by considering two different weights for **W** and **W<sup>X</sup>** as follows: **S** = 𝛼**W** + 𝛽**WX**. Finally, as we have stressed that **Q** is an embedding of attributes, this suggests to consider also a simultaneously ANE and co-clustering.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Towards a Bi-stochastic Matrix Approximation of** 𝒌**-means and Some Variants**

Lazhar Labiod and Mohamed Nadif

**Abstract** The 𝑘-means algorithm and some 𝑘-means variants have been shown to be useful and effective to tackle the clustering problem. In this paper we embed 𝑘-means variants in a bi-stochastic matrix approximation (BMA) framework. Then we derive from the 𝑘-means objective function a new formulation of the criterion. In particular, we show that some 𝑘-means variants are equivalent to algebraic problem of bi-stochastic matrix approximation under some suitable constraints. For optimizing the derived objective function, we develop two algorithms; the first one consists in learning a bi-stochastic similarity matrix while the second seeks for the optimal partition which is the equilibrium state of a Markov chain process. Numerical experiments on real data-sets demonstrate the interest of our approach.

**Keywords:** 𝑘-means, reduced 𝑘-means, factorial 𝑘-means, bi-stochastic matrix

# **1 Introduction**

These last decades unsupervised learning and specifically clustering, have received a significant amount of attention as an important problem with many application in data science. Let 𝐴 = (𝑎𝑖 𝑗) be a 𝑛 × 𝑚 continuous data matrix where the set of rows (objects, individuals) is denoted by 𝐼 and the set of columns (attributes, features) by 𝐽. Many clustering methods such as hierarchical or not aim to construct an optimal partition of 𝐼 or, sometimes of 𝐽.

In this paper we show how some 𝑘-means variants can be presented as a bistochastic matrix approximation problem under some suitable constraints generated by the properties of the reached solution. To reach this goal, we first demonstrate that some variants of 𝑘-means are equivalent to learning a bi-stochastic similarity matrix having a diagonal block structure. Based on this formulation, referred to as BMA, we derive two iterative algorithms, the first algorithm learns a bi-stochastic 𝑛 × 𝑛 similarity matrix while the second directly seeks an optimal clustering solution.

Our main contribution is to establish the theoretical connection of the conventional 𝑘-means and some of its variants to BMA framework. The implications of the reformulation of 𝑘-means as a BMA problem are multi-folds:

Lazhar Labiod () · Mohamed Nadif

Centre Borelli UMR9010, Université Paris Cité, 75006-Paris, France, e-mail: lazhar.labiod@u-paris.fr, e-mail: mohamed.nadif@u-paris.fr

<sup>©</sup> The Author(s) 2023 213

P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_24


The rest of paper is organized as follows. Section 2 introduces some variants of 𝑘-means. Section 3 provides *Matrix Factorization* (MF) and BMA formulations of 𝑘-means variants. Section 4 discusses the BMA clustering algorithm and section 5 is devoted to numerical experiments. Finally, the conclusion summarizes the interest of our contribution.

# **2 Variants of** 𝒌**-Means**

Given a data matrix 𝐴 = (𝑎𝑖 𝑗) ∈ 𝑅 <sup>𝑛</sup>×𝑚, the aim of clustering is to cluster the rows or the columns of 𝐴, so as to optimize the difference between 𝐴 = (𝑎𝑖 𝑗) and the clustered matrix revealing significant block structure. More formally, we seek to partition the set of rows 𝐼 = {1, . . . , 𝑛} into 𝑘 clusters 𝐶 = {𝐶1, . . . , 𝐶<sup>𝑙</sup> , . . . , 𝐶<sup>𝑘</sup> }. The partitioning naturally induce clustering index matrix 𝑅 = (𝑟𝑖𝑙) ∈ R 𝑛×𝑘 , defined as binary classification matrix such as we have 𝑟𝑖𝑙 = 1, if the row 𝑎<sup>𝑖</sup> ∈ 𝐶<sup>𝑙</sup> , and 0 otherwise. On the other hand, we note 𝑆 ∈ R 𝑚×𝑘 a reduced matrix specifying the cluster representation. The detection of homogeneous clusters of objects can be reached by looking for the two matrices 𝑅 and 𝑆 minimizing the total squared residue measure

$$\mathcal{L}\_{KM}(R, S) = ||A - RS^\top||^2 \tag{1}$$

The term 𝑅𝑆> characterizes the information of 𝐴 that can be described by the clusters structure. The clustering problem can be formulated as a matrix approximation problem where the clustering aims to minimize the approximation error between the original data 𝐴 and the reconstructed matrix based on the cluster structures.

Factorial 𝑘-means analysis (FKM) [9] and Reduced 𝑘-means analysis (RKM) [1] are clustering methods that aim at simultaneously achieving a clustering of the objects and a dimension reduction of the features. The advantage of these methods is that both clustering of objects and low-dimensional subspace capturing the cluster structure are simultaneously obtained. To achieve this objective, RKM is defined by the minimizing problem of the following criterion

$$\mathcal{L}\_{RKM}(R, S, \underline{Q}) = ||A - RS^{\top}\underline{Q}^{\top}||^2 \tag{2}$$

and FKM is defined by the minimizing problem of the following criterion

$$\mathcal{L}\_{FKM}(R, S, \underline{Q}) = ||A\underline{Q} - RS^{\top}||^2 \tag{3}$$

where 𝑆 ∈ R <sup>𝑝</sup>×<sup>𝑘</sup> with RKM and FKM, and 𝑄 is an 𝑚 by 𝑝 column-wise orthonormal loading matrix.

# **3 Bi-stochastic Matrix Approximation of** 𝒌**-Means Variants**

#### **3.1 Low-rank Matrix Factorization (MF)**

By considering 𝑘-means as a lower rank matrix factorization with constraints, rather than a clustering method, we can formulate constraints to impose on MF formulation. Let 𝐷 −1 <sup>𝑟</sup> ∈ R𝑘×<sup>𝑘</sup> be diagonal matrix defined as follow 𝐷 −1 <sup>𝑟</sup> = 𝐷𝑖𝑎𝑔(𝑟 −1 1 , . . . , 𝑟−<sup>1</sup> 𝑘 ). Using the matrices 𝐷<sup>𝑟</sup> , 𝐴 and 𝑅, the matrix summary 𝑆 can be expressed as 𝑆 <sup>𝑇</sup> = 𝐷 −1 <sup>𝑟</sup> 𝑅 > 𝐴. Plugging 𝑆 into the objective function in equation, (1) leads to optimize ||𝐴 − 𝑅(𝐷 −1 <sup>𝑟</sup> 𝑅 <sup>&</sup>gt; 𝐴)||<sup>2</sup> equal to

$$\mathcal{L}\_{MF-KM}(\mathbf{R}) = ||A - \mathbf{R}\mathbf{R}^\top A||^2,\text{ where } \mathbf{R} = RD\_r^{-0.5}.\tag{4}$$

On the other hand, it is easy to verify that the approximation **RR**<sup>&</sup>gt; 𝐴 of 𝐴 is formed by the same value in each block 𝐴𝑙,(𝑙=1,...,𝑘) . Specifically, the matrix **R** > 𝐴, equal to 𝑆 𝑇 , plays the role of a summary of 𝐴 and absorbs the different scales of 𝐴 and **R**. Finally **RR**<sup>&</sup>gt; 𝐴 gives the row clusters mean vectors. Note that it is easy to show that **R** verifies the following properties

$$\mathbf{R} \ge 0, \mathbf{R}^\top \mathbf{R} = I\_k, \mathbf{R} \mathbf{R}^\top \mathbb{I} = \mathbb{1}, \operatorname{Trace}(\mathbf{R} \mathbf{R}^\top) = k, (\mathbf{R} \mathbf{R}^\top)^2 = \mathbf{R} \mathbf{R}^\top \tag{5}$$

Next, in similar way, we can derive a MF formulation of FKM,

$$\mathcal{L}\_{MF-FKM}(\mathbf{R}) = ||A\mathcal{Q} - \mathbf{R}\mathbf{R}^\top A\mathcal{Q}||^2,\tag{6}$$

$$\text{and of RKM, }\,\,\mathcal{J}\_{MF-RKM}(\mathbf{R}) = ||A - \mathbf{R}\mathbf{R}^{\top}AQQ^{\top}||^{2}.\tag{7}$$

#### **3.2 BMA Formulation**

Let 𝚷 = **RR**<sup>&</sup>gt; be a bi-stochastic similarity matrix, before giving the BMA formulation of 𝑘-means variants, we need first to spell out the good properties of 𝚷. Indeed, by construction from **R**, 𝚷 has at least the following properties reported below that can be easily proven.

$$\boldsymbol{\Pi} \ge \boldsymbol{0}, \boldsymbol{\Pi}^{\top} = \boldsymbol{\Pi}, \boldsymbol{\Pi} \boldsymbol{1} = \boldsymbol{1}, \boldsymbol{Trace}(\boldsymbol{\Pi}) = k, \boldsymbol{\Pi} \boldsymbol{\Pi}^{\top} = \boldsymbol{\Pi}, \boldsymbol{Rank}(\boldsymbol{\Pi}) = k \tag{8}$$

Given a data matrix 𝐴 and 𝑘 row clusters, we can hope to discover the cluster structure of 𝐴 from 𝚷. Notice that from (8) 𝚷 is nonnegative, symmetric, bi-stochastic (doubly stochastic) and idempotent. By setting the 𝑘means in the BMA framework, the problem of clustering is reformulated as the learning of a structured bi-stochastic similarity matrix 𝚷 by minimizing the following 𝑘-means variants objective,

$$\mathcal{J}\_{BMA-kM}(\mathbf{II}) = ||A - \mathbf{II}A||^2,\tag{9}$$

$$\mathcal{J}\_{BMA-FKM}(\Pi) = ||AQ - \Pi AQ||^2,\tag{10}$$

$$\|\mathcal{J}\_{BMA-RKM}(\Pi) = ||A - \Pi A Q Q^{\top}||^2,\tag{11}$$

with respect to the following constraints on 𝚷

$$\Pi \ge 0, \Pi = \Pi^\top, \Pi \mathbb{1} = \mathbb{1}, \operatorname{Tr}(\Pi) = k, \mathbf{III}\Pi^\top = \Pi \tag{12}$$

$$\text{and } \underline{Q}^{\top} \underline{Q} = I \quad \text{for equations (10) and (11).}$$

In the rest of the paper, we will consider only non-negativity, symmetry and bistochastic constraints.

#### **3.3 The Equivalence Between BMA and** 𝒌**-Means**

The theorem below demonstrates that the optimization of the 𝑘-means objective and the BMA objective under some suitable constraints are equivalent. The equation (13) establishes the equivalence between 𝑘-means and the BMA formulation. Then, solving the BMA objective function (9) is equivalent to finding a global solution of the 𝑘-means criterion (1).

#### **Theorem 1**

$$\arg\min\_{R,S} ||A - RS^{\top}||^2 \Leftrightarrow \arg\min\_{\{\Pi \ge 0, \Pi = \Pi^{\top}, \Pi = 1, Tr(\Pi) = k, \Pi \Pi^{\top} = \Pi\}} ||A - \Pi A||^2 \qquad (13)$$

The proof of this equivalence is given in the appendix. Note that this new formulation gives some interesting highlights on 𝑘-means and its variants:


# **4 BMA Clustering Algorithm**

First, we establish the relationship between our objective function and that used in [12, 11]. From ||𝐴 − 𝚷𝐴||<sup>2</sup> = 𝑇𝑟𝑎𝑐𝑒(𝐴𝐴>) + 𝑇𝑟𝑎𝑐𝑒(𝚷𝐴𝐴>𝚷) − 2𝑇𝑟𝑎𝑐𝑒(𝐴𝐴>𝚷) and by using the idempotent property, 𝚷𝚷<sup>&</sup>gt; = 𝚷 , we can show that

$$\arg\min\_{\Pi} \left||A - \Pi A||^2 \Leftrightarrow \arg\min\_{\Pi} \left||AA^\top - \Pi||^2 \Leftrightarrow \arg\max\_{\Pi} \operatorname{Trace}(AA^\top \Pi).$$

The algorithm for learning similarity matrix is summarized in Algorithm 1 as in [12, 11]. Once the bi-stochastic similarity matrix 𝚷 is obtained, the basic idea of BMA is based on the following steps:



**Why does this work?** At first glance, this process might seem uninteresting since it eventually leads to a vector with all rows and columns coincide for any starting vector. However our practical experience shows that, first the vectors 𝜋 very quickly collapse into rows blocks and these blocks move towards each other relatively slowly. If we stop the Power method iteration at this point, the algorithm would have a potential application for data visualization and clustering. The structure of 𝜋 during short-run stabilization makes the discovery of rows data ordering straightforward. The key is to look for values of 𝜋 that are approximately equal and reordering rows and columns data accordingly. The BMA algorithm involves a reorganization of the rows of data matrix 𝐴ˆ according to sorted 𝜋. It also allows to locate the points corresponding to an abrupt change in the curve of the first left singular vector 𝜋, and then assess the number of clusters and the rows belonging to each cluster.

# **5 Experiments Analysis**

In this subsection we first ran our algorithm on two real world data set, the 16 townships data which consists of the characteristics (rows) of 16 townships (columns), each cell indicates the presence 1 or absence 0 of a characteristic on a township . This example has been used by Niermann [7] for data ordering task and the author aims to reveal a block diagonal form. The second data called Mero data, comes from archaeological data on Merovingian buckles found in north eastern France. This data matrix consists of 59 buckles characterized by 26 attributes of description (see Marcotorchino for more details [6]). Figure 1 shows in order, <sup>𝐴</sup>, <sup>𝐴</sup>ˆ, 𝑆𝑅 <sup>=</sup> 𝐴𝐴<sup>𝑇</sup> reorganized according to the sorted 𝜋 and the sorted 𝜋 plot for both data sets. We also evaluated

**Fig. 1** left: 16 Townships data - right: Mero data.

the performances of BMA on some real challenging datasets described in Table1. We compared the performance of BMA with the spectral co-clustering (SpecCo) [2], Non-negative Matrix Factorization (NMF) and Orthgogonal Non-negative Matrix Tri-Factorization (NMTF) [3] by using two evaluation metrics: accuracy (ACC) corresponding to the percentage of well-classified elements and the normalized mutual information (NMI) [8]. In Table 1, we observe that BMA outperforms all compared algorithms for all tested datasets.


**Table 1** Clustering Accuracy and Normalized Mutual Information (%).

# **6 Conclusion**

In this paper we have presented a new reformulation of some variants of 𝑘-means as a unified BMA framework and established the equivalence between 𝑘-means and BMA under suitable constraints. By doing so, 𝑘-means leads to learning a structured bi-stochastic matrix which is beneficial for clustering task. The proposed approach, not only learns a similarity matrix from data matrix, but uses this matrix in an iterative process that converges to a matrix 𝐴ˆ in which each row is represented by its prototype. The clustering solution is given by the first left eigenvector of 𝐴ˆ while overcoming the knowledge of the number of clusters. We expect for future work to integrate the idempotent and trace constraints on 𝚷 to make the approximate similarity matrix fits the best the case of a block diagonal structure.

# **Appendix**

From the BMA's formulation, we know that one can easily construct a feasible solution for 𝑘-means from a feasible solution of BMA's formulation. Therefore, it remains to show that from a global solution of BMA's formulation, we can obtain a feasible solution of 𝑘-means. In order to show the equivalence between the optimization of 𝑘means formulation and the BMA formulation, we first consider the following lemma.

**Lemma** If Π is a symmetric and positive semi-definite matrix, then we have

 (𝑎)𝜋𝑖𝑖<sup>0</sup> ≤ √ 𝜋𝑖𝑖𝜋<sup>𝑖</sup> 0 𝑖 <sup>0</sup> (geometric mean) ∀𝑖, 𝑖<sup>0</sup> (𝑏)𝜋𝑖𝑖<sup>0</sup> ≤ 1 2 (𝜋𝑖𝑖 + 𝜋<sup>𝑖</sup> 0 𝑖 <sup>0</sup>) (arithmetic mean) ∀𝑖, 𝑖<sup>0</sup> (𝑐) max𝑖𝑖<sup>0</sup> 𝜋𝑖𝑖<sup>0</sup> = max<sup>𝑖</sup> 𝜋𝑖𝑖 (𝑑)𝜋𝑖𝑖 = 0 ⇒ 𝜋𝑖𝑖<sup>0</sup> = 𝜋<sup>𝑖</sup> 0 <sup>𝑖</sup> = 0 ∀𝑖, 𝑖<sup>0</sup>

**Proposition.** Any positive semi-definite matrix Π satisfying the constraints:

$$\begin{cases} \pi\_{ii'} = \pi\_{i'i} \quad \forall i, i' \qquad (\text{symmetry})\\ \pi\_{ii'} = \sum\_{i''} \pi\_{ii''} \pi\_{i'i''} \quad \forall i, i' \text{ (idempotence)}\\ \sum\_{i'} \pi\_{ii'} = 1 \quad \forall i\\ \sum\_{i} \pi\_{ii'} = k \end{cases}$$

is a matrix partitioned into 𝑘 blocks Π = 𝑑𝑖𝑎𝑔(Π 1 , . . . , Π 𝑙 , . . . , Π 𝑘 ) with Π <sup>𝑙</sup> = 1 𝑛𝑙 1𝑙1 𝑡 𝑙 , 𝑡𝑟𝑎𝑐𝑒(Π 𝑙 ) = 1 ∀𝑙 and Í<sup>𝑘</sup> 𝑙=1 𝑛<sup>𝑙</sup> = 𝑛; 1<sup>𝑙</sup> denotes the vector of appropriate dimension with all its values are 1.

**Proof.** Since Π is idempotent (Π <sup>2</sup> = Π), we have: ∀𝑖; 𝜋𝑖𝑖 = Í 𝑖 0 𝜋 2 𝑖𝑖<sup>0</sup> From the Lemma above, we know that there exist; 𝑖 <sup>0</sup> ∈ {1, 2, . . . , 𝑛} such as max𝑖𝑖<sup>0</sup> 𝜋<sup>𝑖</sup> 0 <sup>𝑖</sup> = 𝜋<sup>𝑖</sup> 0 𝑖 <sup>0</sup> > 0. Consider the set 𝐴<sup>𝑖</sup> <sup>0</sup> defined by 𝐴<sup>𝑖</sup> <sup>0</sup> = {𝑖|𝜋<sup>𝑖</sup> 0 <sup>𝑖</sup> > 0}, we can rewrite; ∀𝑖 ∈ 𝐴<sup>𝑖</sup> <sup>0</sup> ; 𝜋𝑖𝑖 = Í 𝑖 <sup>0</sup>∈𝐴<sup>𝑖</sup> 0 𝜋 2 𝑖 0 𝑖

$$\forall i \in A\_{i^0}; \quad \sum\_{i' \in A\_{i^0}} \pi\_{i'i} = \sum\_{i' \in I} \pi\_{i'i} = 1 \tag{14}$$

and,

$$\sum\_{i' \in A\_{i^0}} \sum\_{i \in A\_{i^0}} \pi\_{i'i} = \sum\_{i \in A\_{i^0}} \pi\_{i.} = \sum\_{i \in A\_{i^0}} 1 = |A\_{i^0}| \tag{15}$$

$$\forall i \; \pi\_{ii} = \sum\_{i'} \pi\_{ii'}^2 \Rightarrow \forall i \in A\_{i^0}; \quad \sum\_{i' \in A\_{i^0}} \frac{\pi\_{ii'}^2}{\pi\_{ii}} = \sum\_{i' \in A\_{i^0}} (\frac{\pi\_{ii'}}{\pi\_{ii}}) \pi\_{ii'} = 1. \tag{16}$$

From (14) and (16), we deduce that ∀𝑖 ∈ 𝐴<sup>𝑖</sup> 0 ; Í 𝑖 <sup>0</sup>∈𝐴<sup>𝑖</sup> 0 𝜋𝑖 0 𝑖 = Í 𝑖 <sup>0</sup>∈𝐴<sup>𝑖</sup> 0 ( 𝜋𝑖𝑖<sup>0</sup> 𝜋𝑖𝑖 )𝜋𝑖𝑖0, implying that: 𝜋𝑖𝑖<sup>0</sup> = 𝜋𝑖𝑖, ∀𝑖, 𝑖<sup>0</sup> ∈ 𝐴<sup>𝑖</sup> <sup>0</sup> . Substituting in (15) 𝜋𝑖𝑖<sup>0</sup> by 𝜋𝑖𝑖 for all 𝑖, 𝑖<sup>0</sup> ∈ 𝐴<sup>𝑖</sup> <sup>0</sup> leads to Í 𝑖 <sup>0</sup>∈𝐴<sup>𝑖</sup> 0 𝜋𝑖𝑖<sup>0</sup> = Í 𝑖 <sup>0</sup>∈𝐴<sup>𝑖</sup> 0 𝜋𝑖𝑖 = |𝐴<sup>𝑖</sup> <sup>0</sup> |𝜋𝑖𝑖 = 1, ∀𝑖 ∈ 𝐴<sup>𝑖</sup> <sup>0</sup> . From this we can deduce that 𝜋𝑖𝑖 = 𝜋𝑖𝑖<sup>0</sup> = 1 |𝐴𝑖 0 | , ∀𝑖, 𝑖<sup>0</sup> ∈ 𝐴<sup>𝑖</sup> <sup>0</sup> . We can therefore rewrite the matrix <sup>Π</sup> in the form of a block diagonal matrix Π = Π <sup>0</sup> 0 0 Π¯ <sup>0</sup> where Π 0 is a block matrix whose general term is defined by Π 0 𝑖𝑖<sup>0</sup> = 1 |𝐴𝑖 0 | , ∀𝑖, 𝑖<sup>0</sup> ∈ 𝐴<sup>𝑖</sup> <sup>0</sup> and 𝑡𝑟𝑎𝑐𝑒(Π 0 ) = 1. The matrix Π¯ <sup>0</sup> is a positive semi-definite matrix which also verified the constraints (Π¯ <sup>0</sup> ) <sup>𝑡</sup> = Π¯ <sup>0</sup> , <sup>Π</sup>¯ <sup>0</sup><sup>1</sup> <sup>=</sup> <sup>1</sup>, (Π¯ <sup>0</sup> ) <sup>2</sup> = Π¯ <sup>0</sup> and 𝑡𝑟𝑎𝑐𝑒(Π¯ <sup>0</sup> ) = 𝑘 − 1. By repeating the same process 𝑘 − 1 times, we get the block diagonal form of Π.

Π = 𝑑𝑖𝑎𝑔(Π 0 , Π 1 , . . . , Π 𝑙 , . . . , Π 𝑘−1 ) with, Π <sup>𝑙</sup> = 1 𝑛𝑙 1𝑙1 𝑡 𝑙 , 𝑡𝑟𝑎𝑐𝑒(Π 𝑙 ) = 1∀𝑙 and Í𝑘−<sup>1</sup> 𝑙=0 𝑛<sup>𝑙</sup> = 𝑛.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Clustering Adolescent Female Physical Activity Levels with an Infinite Mixture Model on Random Effects**

Amy LaLonde, Tanzy Love, Deborah R. Young, and Tongtong Wu

**Abstract** Physical activity trajectories from the Trial of Activity in Adolescent Girls (TAAG) capture the various exercise habits over female adolescence. Previous analyses of this longitudinal data from the University of Maryland field site, examined the effect of various individual-, social-, and environmental-level factors impacting the change in physical activity levels over 14 to 23 years of age. We aimed to understand the differences in physical activity levels after controlling for these factors. Using a Bayesian linear mixed model incorporating a model-based clustering procedure for random deviations that does not specify the number of groups *a priori*, we find that physical activity levels are starkly different for about 5% of the study sample. These young girls are exercising on average 23 more minutes per day.

**Keywords:** Bayesian methodology, Markov chain Monte Carlo, mixture model, reversible jump, split-merge procedures

# **1 Introduction**

Physical activity and diet are arguably the two main controllable factors having the greatest impact on our health. Whereas we have little to no control over factors like our genetic predisposition to disease or exposure to environmental toxins, we have

Amy LaLonde

Tanzy Love ()

Deborah Rohm Young University of Maryland, MD, USA, e-mail: dryoung@umd.edu

© The Author(s) 2023 223 P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_25

University of Rochester, NY, USA, e-mail: amylalonde2@gmail.com

University of Rochester, NY, USA, e-mail: tanzy\_love@urmc.rochester.edu

Tongtong Wu University of Rochester, NY, USA, e-mail: tongtong\_wu@urmc.rochester.edu

much greater control over our diet and activity levels. Despite our ability to choose to engage in healthy behaviors such as exercising and eating a healthy diet, these choices are plagued with the complexity of human psychology and the modern demands and distractions that pervade our lives today. Several factors influence levels of physical activity; we explore the factors impacting female adolescents using longitudinal data.

The University of Maryland, one of the six initial university field centers of the Trial of Activity in Adolescent Girls (TAAG), selected to follow its 2006 8 𝑡 ℎ grade cohort for two additional time points over adolescence: 11𝑡 ℎ grade and 23 years of age. The females were therefore measured roughly at ages 14, 17, and 23. In these waves, there was no intervention as this observational longitudinal study aimed at exploring the patterns of physical activity levels and associated factors over time.

The model presented in Wu et al. [1] motivates the current work. We fit a similar linear mixed model controlling for the same variables. Rather than cluster the raw physical activity trajectories to identify groups, we cluster the females within the model-fitting procedure based on the values of the subject-specific deviations from the adjusted physical activity levels. Fitting a Bayesian linear mixed model, we simultaneously explore the subject groups through the use of reversible jump Markov chain Monte Carlo (MCMC) applied to the random effects. Bayesian model-based clustering methods have been applied within linear mixed models to identify groups by clustering the fitted values of the dependent variable. For example, [2] fits clusterspecific linear mixed models to the gene expression outcome using an EM algorithm and [3] clusters gene expression in a similar fashion, except using Bayesian methods. In contrast, we perform the clustering on the random effects, which allows us to investigate the variability that is unexplained by the covariates of interest. This methodology is advantageous because of its ability to jointly estimate all effects, while also exploring the infinite space of group arrangements.

# **2 Bayesian Mixture Models for Heterogeneity of Random Effects**

Let **y**<sup>𝑖</sup> = (𝑦𝑖,1, . . . , 𝑦𝑖,𝑇 ) be the 𝑖 𝑡 ℎ subject's average daily moderate-to-vigorous physical activity (MVPA) at each of the 𝑇 = 3 time points. The MVPA was collected from ActiGraph accelerometers (Manufacturing Technologies Inc. Health Systems, Model 7164, Shalimar, FL) worn for seven consecutive days. Accelerometers offered a great alternative to self-report for tracking physical activity levels, and measuring over seven days helped to account for differences in activity patterns during weekdays and weekends. Wu et al. [1] analyzed this cohort using mixed models that accounted for the subject-specific variability. We let **X**<sup>𝑖</sup> represent the 𝑖 𝑡 ℎ subject's values for covariates.

Furthermore, let **r** = (𝑟1, . . . , 𝑟𝑛) represent the subject-specific random effects for the 𝑛 subjects. The simple linear mixed model is written in terms of each subject as

$$\mathbf{y}\_i = \mathbf{X}\_i \boldsymbol{\beta} + r\_i \mathbf{1}\_T + \mathbf{e}\_i \tag{1}$$

where 𝜷 represents the coefficients for the covariate effects and 𝝐<sup>𝑖</sup> = (𝜖𝑖,1, . . . , 𝜖𝑖,𝑇 ) are the residuals. We assume independence and normality in the residuals and the random effects; hence, 𝑟<sup>𝑖</sup> ∼ 𝑁(0, 𝜎<sup>2</sup> 𝑟 ) and 𝝐<sup>𝑖</sup> ∼ 𝑁(**0**, 𝜎<sup>2</sup> 𝜖 **I**<sup>𝑇</sup> ) for 𝑖 = 1, . . . , 𝑛.

Fitting the mixed model demonstrates substantial heterogeneity in the residuals, the variability increases as the fitted values increase. A traditional approach to fixing this violation would re-fit the model to the log-transformed MVPA values. Plots of residuals versus fitted values in this model approach also exhibited evidence of heterogeneity in the model; thus, still violating a core assumption of the regression framework. Given the changes adolescents experience as they grow into young adults, we expect to see heterogeneity in the physical activity patterns across this duration of follow-up time. However, the inability of the model to capture such changes over time at these higher levels of physical activity suggests the need for model improvements. The purpose of this analysis is to present our adjustments to previous analyses in order to investigate underlying characteristics across different groups of females formed based on deviations from adjusted physical activity levels.

**Fig. 1** The plot on the left depicts the residuals versus fitted values for the linear mixed model in Eq. (1); they demonstrate severe heteroscedasticity. The variance increases as the fitted values increase. The plot on the right depicts the distribution of the random effects.

We fit the mixed model in Eq. (1) to the sample of female adolescents. The heteroscedasticity depicted in Figure 1 reveals an increase in variance with predicted minutes of moderate-to-vigorous physical activity, which we would expect. The plot on the right in Figure 1 demonstrates that the distribution of the random effects do not appear to follow our assumption of normally distributed and centered around zero. The random effects do appear to follow a normal distribution over the lower range of deviations with a subset of the subjects having larger positive deviations from the estimated adjusted physical activity levels.

To capture the heterogeneity and allow the random effects to follow a non-normal distribution, we assign the random effects a Gaussian mixture distribution. Before introducing the model for heterogeneity, we note the likelihood distribution for the observed outcomes, **Y** = (**y**1, . . . , **y**<sup>𝑇</sup> ) 0 . The moderate-to-vigorous physical activity distribution is

A. LaLonde et al.

$$p\left(\mathbf{Y}|\mathcal{B},\mathbf{r},\sigma\_{\epsilon}^{2}\right) = \prod\_{i=1}^{n} \prod\_{t=1}^{T} \left(2\pi\sigma\_{\epsilon}^{2}\right)^{-\frac{1}{2}} \exp\left\{-\frac{1}{2\sigma\_{\epsilon}^{2}}(\mathbf{y}\_{i,t} - \mathbf{X}\_{i,t}\mathcal{B} - r\_{i})^{2}\right\}.\tag{2}$$

Then to account for the heterogeneity across subjects, the probability density for the subject-specific deviations in physical activity is expressed as a mixture of onedimensional normal densities,

$$p\left(r\_i|\mu,\sigma\_r^2\right) = \sum\_{\mathcal{g}=1}^G \pi\_{\mathcal{g}} \left(2\pi\sigma\_{r,\mathcal{g}}^2\right)^{-\frac{1}{2}} \exp\left\{-\frac{1}{2\sigma\_{r,\mathcal{g}}^2} (r\_i - \mu\_{\mathcal{g}})^2\right\}.\tag{3}$$

Here, 𝝁 = (𝜇1, . . . , 𝜇𝐺) <sup>0</sup> defines the group-specific mean deviations, 𝝈 2 <sup>𝑟</sup> = (𝜎 2 𝑟,1 , . . . , 𝜎<sup>2</sup> 𝑟,𝐺) 0 characterizes the variances of the group-specific deviations, and 𝜋 = (𝜋1, . . . , 𝜋𝐺) 0 is the probability of membership in each group 𝑔.

The model in Eqs. (2) and (3) can be fit using either an EM or Bayesian MCMC procedures. Both require specification of a fixed number of 𝐺-groups. While we may hypothesize that there are only two groups–one that is normally distributed and centered at zero and another that is normally distributed and centered at a larger mean–the assumption hinges on what we have seen from plots like those in Figure 1. The random effects in the aforementioned histogram, however, are being shrunk towards zero by assumption; while a mixture model will allow the data to more accurately depict the deviations observed in the girl's physical activity levels. The assumption of 𝐺 groups can strongly influence the results of our model fitting. To circumvent the issues associated with selecting 𝐺 in either an EM algorithm or a Bayesian finite mixture model framework, we implement a Bayesian mixture model that incorporates 𝐺 as an additional unknown parameter.

#### **2.1 Bayesian Mixed Models With Clustering**

Richardson and Green [4] adapts the reversible jump methodology to univariate normal mixture models. In addition to being able to characterize the distribution of 𝐺, this Bayesian framework has the ability to simultaneously explore the posterior distribution for the covariate effects of interest. Furthermore, we will have the posterior distributions of the group-defining parameters rather than just point estimates. Since we are interested in the physical activity differences in subjects when controlling for these covariates, we use Eq. (1) as the basis of our model.

The foundation of our clustering model is a finite mixture model on the random effects, 𝑟<sup>𝑖</sup> , as shown in Eq. (3). For all 𝑖 = 1, . . . , 𝑛 and 𝑔 = 1, . . . , 𝐺, 𝑟<sup>𝑖</sup> |𝑐𝑖 , 𝝁 ∼ 𝐹<sup>𝑟</sup> (𝜇𝑐<sup>𝑖</sup> , 𝜎<sup>2</sup> 𝑟,𝑐<sup>𝑖</sup> ), (𝑐<sup>𝑖</sup> = 𝑔)|𝝅, 𝐺 ∼ Categorical(𝜋1, . . . , 𝜋𝐺), 𝜇<sup>𝑔</sup> |𝜏 ∼ 𝑁(𝜇0, 𝜏), 𝜎 2 𝑟,𝑔 |𝑐, 𝛿 ∼ 𝐼𝐺(𝑐, 𝛿), 𝝅|𝐺 ∼ Dirichlet(𝛼, . . . , 𝛼), 𝐺 ∼ Uniform[1, 𝐺𝑚𝑎𝑥], where 𝑐<sup>𝑖</sup> is the latent grouping variable tracking the assignment of𝑟<sup>𝑖</sup> into any one of the 𝐺 clusters. The *likelihood function* for these subject-specific deviations, given the group as-1

$$\text{sigmoid}, c\_i, \text{is simply } p\left(r\_i | c\_i = \text{g}, \mu\_{\text{g}}, \sigma\_{r, \text{g}}^2\right) = \left(2\pi\sigma\_{r, \text{g}}^2\right)^{-\frac{1}{2}} \exp\left\{-\frac{1}{2\sigma\_{r, \text{g}}^2} (r\_i - \mu\_{\text{g}})^2\right\}.$$

This replaces the typical independent and identically distributed assumption of 𝑟<sup>𝑖</sup> ∼ 𝑁(0, 𝜎<sup>2</sup> 𝑟 ) for all 𝑖 with a normal distribution that is now conditional on group assignment. The remainder of the model formulation follows closely to the framework constructed in [4], except we have an additional layer of unknown parameters defining the linear mixed model in Eq. (1).

We select conjugate priors so that the the posterior distributions of the unknown parameters are analytically tractable. The prior on the mixing probabilities, 𝝅, is a symmetric Dirichlet distribution, reflecting the prior belief that belonging to any one cluster is equally likely. To use the sampling methods of [4], we select a discrete uniform prior on 𝐺 that reflects our uncertainty on the number of groups, and impose an a priori ordering of the 𝜇𝑔, such that for any given value 𝐺, 𝜇<sup>1</sup> < 𝜇<sup>2</sup> < · · · < 𝜇𝐺, to remove label switching. Thus, in the prior for the clustering parameters,

$$p(\mu) = G! \prod\_{g=1}^{G} \sqrt{(2\pi\tau)^{-\frac{1}{2}}} \exp\left\{-\frac{1}{2\tau}(\mu\_g - \mu\_0)^2\right\},$$

$$p(\sigma\_{r,g}^2) = \frac{\delta^c}{\Gamma(c)} (\sigma\_{r,g}^2)^{-c-1} \exp\left\{-\frac{\delta}{\sigma\_{r,g}^2}\right\}$$

$$p(G) = \frac{1}{G\_{max}} \mathbf{1}\{G \in [1, G\_{max}]\},$$

where 𝐺𝑚𝑎𝑥 is set to be reasonably large and **1**{𝐺 ∈ [1, 𝐺𝑚𝑎𝑥]} is a discrete indicator function, equal to 1 on the interval [1, 𝐺𝑚𝑎𝑥] and 0 elsewhere.

The capacity of our sampler to move between dimensions is essential to our ability to explore the grouping of the observations while simultaneously exploring the parameters describing the relationships between the covariates and the outcome. This means that we can allow the number of components of our mixture model on the random effects to increase or decrease at each state of our MCMC chain. Such changes impact the dimension of the parameters of the mixture model, 𝜽 = (𝝁, 𝝈<sup>2</sup> 𝑟 , 𝐺, 𝝅, **c**).

Let 𝜽 denote the current state of the parameters (𝝁, 𝝈<sup>2</sup> 𝑟 , 𝐺, 𝝅, **c**) when proposing move 𝑚 where 𝑚 ∈ {𝑆, 𝑀, 𝐵, 𝐷} corresponds to a split, merge, birth and death, respectively. Given the current state, 𝜽, and move 𝑚, we propose a new state, 𝜽 <sup>𝑚</sup>, under move 𝑚. The acceptance probability is written as 𝑎𝑐𝑐𝑚(𝜽 <sup>𝑚</sup>, 𝜽) = min h 1, 𝑝(𝜽 <sup>𝑚</sup> |**r**)𝑞(𝜽 <sup>𝑚</sup> |𝑚−<sup>1</sup> ) 𝑝(𝜽 |**r**)𝑞(𝜽 |𝑚) |𝐽| i where 𝑝(·) and 𝑞(·) denote the target and proposal distribution, respectively. In our case, the target distribution is the posterior distribution of our group-specific parameters, (𝝁, 𝝈<sup>2</sup> 𝑟 , 𝝅, **c**), given the data, **r**, which are the random effects. Each proposed move changes the dimension of the parameters in 𝜽 by 1, adding or deleting group-specific parameters. The ratio 𝑞(𝜽 <sup>𝑚</sup>|𝑚 −1 )/𝑞(𝜽|𝑚) ensures "dimension balancing", as explained in [4]. For moves increasing in dimension, the Jacobian, |𝐽|, is computed as |𝛿𝜽 <sup>𝑚</sup>/𝛿(𝜽, **u**)| because moving from 𝜽 to 𝜽 <sup>𝑚</sup> will require additional parameters, **u** to appropriately match dimensions. The opposite is true for moves decreasing in dimension. This is what we refer to as the reversible jump mechanism; each time a split is proposed, we must also design the reversible move that would result in the currently merged component, and vice versa.

Split and merge moves are implemented for our model. These moves update 𝜋, 𝜇, and 𝜎 for two adjacent groups or create two adjacent groups using three Betadistributed additional parameters, 𝑢, for dimension balancing in a similar way to [4]. Within our context of random effects, births and deaths are not appropriate. A singleton causes issues of identifiability because the 𝑟<sup>𝑖</sup> is no longer defined as random. We do not allow for birth and death moves in our reversible jump methods.

# **3 Trial of Activity in Adolescent Girls (TAAG) and Model Results**

Our analysis focuses only on these girls from the University of Maryland site of the TAAG study who were measured at all three follow-up time points, beginning in 2006. After excluding girls with missing outcomes, the final sample consisted of 428 girls measured in 2006, 2009, and 2014. Missing covariate values were imputed for four subjects using the values from the nearest time point.

We determine the group assignments using an MCMC sampler having 10,000 iterations, with a burn-in of 500 draws. The posterior distribution for 𝐺 was extremely peaked at 𝐺 = 2. Summarization of the posterior distribution of the group assignments via the least squares clustering method delivers the final arrangement, **c**ˆ𝐿𝑆, of girls into two groups describing their physical activity levels [5]. Since our sampler explores several models for which group assignments and 𝐺 can vary, we sample additional draws from the posterior distribution of the remaining parameters of interest using an MCMC sampler with the model specification of Eq. (1) with groups fixed at our posterior assignment, **c**ˆ𝐿𝑆, for the subject-specific random effects. This additional chain was run for 10,000 iterations with a burn-in 500 draws, yielding the results summarized below. Convergence diagnostics indicated that 10,000 iterations sufficiently met the effective sample size threshold for estimating the coefficients for the covariate effects, 𝜷, and the group-specific means, 𝝁, describing the deviations of the girls' physical activity levels [6].

After controlling for covariates believed to best describe the variation in the physical activity levels of females, our method finds that there is a small subset of the females who are much more active than the remainder of the sample. Every subject in the more active group has fitted trajectories above the recommended 30 minutes of exercise. Most of the population does not get the recommended allowance of daily physical activity and this is well-supported in our analysis. All but two subjects in the less active group have fitted trajectories that never pass the recommended 30 minutes of exercise. The random effects from this model better fit a normal distribution (not centered at 0) for each of the two groups and do not show as much heteroscedasticity over time as the one group model depicted in Figure 1.

Given these differences are observed even after controlling for the aforementioned variables, we would like to further examine the characteristics that may set these highly active females apart from the rest of the girls in our sample. To do this, we look at a number of other covariates that were either excluded during the variable selection process or were not measured at all time points. We use simple Wilcoxon tests on the available time points of the additional variables and on all time points for covariates we adjusted for in the initial model.

We first note that the median BMI of the subset of highly active girls is significantly lower than that of the remaining girls consistently at each TAAG wave. Similarly, mother's education level is also consistently significant at each time point. These values are measured at each time point to reflect changes as the mother pursues additional education, or as the girls become more aware of their mother's education. The majority of the highly active girls have mother's who have completed college or higher (75% or higher at each time point); whereas, the remainder of the sample has mother's with a range of education levels (less than high school through college or more). The number of parks within a one-mile radius of the home is significantly different among the high and low groups in the middle school and high school years, when the girls are likely to be living at home. This variable may be an indicator of socioeconomic status as families with more money may live in neighborhoods nearer to parks. Finally, in the high school and college-aged years, the self-management strategies among the highly active girls are significantly higher rated than the remainder of the population.

In high school, the subset of highly active girls tend to have better self-described health, participate in more sports teams, have access to more physical education classes, and have been older at the time of their first menstrual period. At the college age, these girls still have higher self-described health; however, the higher levels of the global physical activity score and self-esteem scores are now significantly improved in the subset of highly active females.

# **4 Discussion**

We extended the mixed models of [1] with the application still focused on the same 428 girls from the TAAG, TAAG 2, and TAAG 3 studies. Within the Bayesian linear mixed model, we implemented a clustering procedure aimed at clustering girls into groups based on deviations from the adjusted physical activity levels. These groups reflected the tendency for small subsets of females to be highly active. Not surprisingly, only 24 girls (5% of our sample) were classified as highly active.

This group of highly active girls differs in several ways. These girls are more active, and thus we expect that the age at first menstrual period will be higher. We may also expect that the highly active girls are involved in more sports teams and that they will have higher global physical activity scores. Some other interesting characteristics of these girls, however, is their increased self-management strategies, self-esteem scores, and self-described health. This may suggest that interventions focusing on time management and emphasizing self-efficacy could impact adolescent female physical activity levels. In doing so, we could aim to increase self-esteem and self-described health.

The ability to account for heterogeneity in the subject-specific deviations from an adjusted model allows us to keep the outcome on the original scale while still improving model assumptions. Our model estimates model parameters while identifying groups of observations with differing activity levels. In contrast, a frequentist approach could be taken using EM algorithm; however, we would lose the ability for the data to give statistical inference on the appropriate number of groups and to incorporate posterior samples with different numbers of groups into the estimated class label.

The current analysis looks only at identifying groups based on deviations from the overall adjusted minutes of MVPA for the females. A natural extension would be to look at clustering on the slope for time to begin to understand the various patterns we observe among adolescent females over time. Furthermore, we may want to incorporate a variable selection procedure into the fixed portion of the model. The groups we find by either clustering on subject-specific intercepts and/or slopes would be sensitive to the covariates selected, depending on the variability captured by this fixed portion of the model. Physical activity, like most human behavior, varies widely for a multitude of reasons, many of which we may not think to or are unable to measure. Identifying groups when a traditional mixed model constructed using standard variable selection methods suggests lack of fit can be a useful step towards better understanding differences through post-hoc analyses of the groups' characteristics.

**Acknowledgements** Research reported in this publication was supported by the National Institutes of Health (NIH) under award numbers T32ES007271 and R01HL119058. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Unsupervised Classification of Categorical Time Series Through Innovative Distances**

Ángel López-Oriona, José A. Vilar, and Pierpaolo D'Urso

**Abstract** In this paper, two novel distances for nominal time series are introduced. Both of them are based on features describing the serial dependence patterns between each pair of categories. The first dissimilarity employs the so-called association measures, whereas the second computes correlation quantities between indicator processes whose uniqueness is guaranteed from standard stationary conditions. The metrics are used to construct crisp algorithms for clustering categorical series. The approaches are able to group series generated from similar underlying stochastic processes, achieve accurate results with series coming from a broad range of models and are computationally efficient. An extensive simulation study shows that the devised clustering algorithms outperform several alternative procedures proposed in the literature. Specifically, they achieve better results than approaches based on maximum likelihood estimation, which take advantage of knowing the real underlying procedures. Both innovative dissimilarities could be useful for practitioners in the field of time series clustering.

**Keywords:** categorical time series, clustering, association measures, indicator processes

# **1 Introduction**

Clustering of time series concerns the challenge of splitting a set of unlabeled time series into homogeneous groups, which is a pivotal problem in many knowledge discovery tasks [1]. Categorical time series (CTS) are a particular class of time series exhibiting a qualitative range which consists of a finite number of categories. Most of the classical statistical tools used for real-valued time series (e.g., the autocorrelation function) are not useful in the categorical case, so different types of measures than the standard ones are needed for a proper analysis of CTS. CTS

Pierpaolo D'Urso

Ángel López-Oriona (), José A. Vilar

Research Group MODES, Research Center for Information and Communication Technologies (CITIC), University of A Coruña, Spain,

e-mail: oriona38@hotmail.com;jose.vilarf@udc.es

Department of Social Sciences and Economics, Sapienza University of Rome, Italy, e-mail: pierpaolo.durso@uniroma1.it

P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_26

arise in an extensive assortment of fields [2, 3, 7, 8, 9]. Since only a few works have addressed the problem of CTS clustering [4, 5], the main goal of this paper is to introduce novel clustering algorithms for CTS.

# **2 Two Novel Feature-based Approaches for Categorical Time Series Clustering**

Consider a set of 𝑠 categorical time series S = {𝑋 (1) 𝑡 , . . . , 𝑋(𝑠) 𝑡 }, where the 𝑗-th element 𝑋 ( 𝑗) 𝑡 is a 𝑇𝑗-length partial realization from any categorical stochastic process (𝑋𝑡)<sup>𝑡</sup> <sup>∈</sup><sup>Z</sup> taking values on a number 𝑟 of unordered qualitative categories, which are coded from 1 to 𝑟 so that the range of the process can be seen as V = {1, . . . , 𝑟}. We suppose that the process (𝑋𝑡)<sup>𝑡</sup> <sup>∈</sup><sup>Z</sup> is bivariate stationary, i.e., the pairwise joint distribution of (𝑋𝑡−<sup>𝑘</sup> , 𝑋𝑡) is invariant in 𝑡. Our goal is to perform clustering on the elements of S in such a way that the series assumed to be generated from identical stochastic processes are placed together. To that aim, we propose two distance metrics which are based on feature extraction.

#### **2.1 Descriptive Features for Categorical Processes**

Let {𝑋<sup>𝑡</sup> , 𝑡 ∈ Z} be a bivariate stationary categorical stochastic process with range V = {1, . . . , 𝑟}. Denote by 𝝅 = (𝜋1, . . . , 𝜋<sup>𝑟</sup> ) the marginal distribution of 𝑋<sup>𝑡</sup> , which is, 𝑃(𝑋<sup>𝑡</sup> = 𝑗) = 𝜋<sup>𝑗</sup> > 0, 𝑗 = 1, . . . , 𝑟. Fixed 𝑙 ∈ N, we use the notation 𝑝𝑖 𝑗(𝑙) = 𝑃(𝑋<sup>𝑡</sup> = 𝑖, 𝑋𝑡−<sup>𝑙</sup> = 𝑗), with 𝑖, 𝑗 ∈ V, for the lagged bivariate probability and the notation 𝑝𝑖<sup>|</sup> <sup>𝑗</sup>(𝑙) = 𝑃(𝑋<sup>𝑡</sup> = 𝑖|𝑋𝑡−<sup>𝑙</sup> = 𝑗) = 𝑝𝑖 𝑗(𝑙)/𝜋<sup>𝑗</sup> for the conditional bivariate probability.

To extract suitable features characterizing the serial dependence of a given CTS, we start by defining the concepts of perfect serial independence and dependence for a categorical process. We have perfect serial independence at lag 𝑙 ∈ N if and only if 𝑝𝑖 𝑗(𝑙) = 𝜋𝑖𝜋<sup>𝑗</sup> for any 𝑖, 𝑗 ∈ V. On the other hand, we have perfect serial dependence at lag 𝑙 ∈ N if and only if the conditional distribution 𝑝·| <sup>𝑗</sup>(𝑙) is a one-point distribution for any 𝑗 ∈ V. There are several association measures which describe the serial dependence structure of a categorical process at lag 𝑙. One of such measures is the so-called Cramer's 𝑣, which is defined as

$$w(l) = \sqrt{\frac{1}{r-1} \sum\_{i,j=1}^{r} \frac{(p\_{ij}(l) - \pi\_i \pi\_j)^2}{\pi\_i \pi\_j}}. \tag{1}$$

Cramer's 𝑣 summarizes the serial dependence patterns of a categorical process for every pair (𝑖, 𝑗) and 𝑙 ∈ N. However, this quantity is not appropriate for characterizing a given stochastic process, since two different processes can have the same value of 𝑣(𝑙). A better way to characterize the process 𝑋<sup>𝑡</sup> is by considering the matrix 𝑽(𝑙) = 𝑉𝑖 𝑗(𝑙) 1≤𝑖, 𝑗 ≤𝑟 , where 𝑉𝑖 𝑗(𝑙) = ( 𝑝𝑖 𝑗 (𝑙)−𝜋<sup>𝑖</sup> 𝜋𝑗) 2 𝜋<sup>𝑖</sup> 𝜋<sup>𝑗</sup> . The elements of the matrix 𝑽(𝑙) give information about the so-called *unsigned* dependence of the process. However, it is often useful to know whether a process tends to stay in the state it has reached or, on the contrary, the repetition of the same state after 𝑙 steps is infrequent. This motivates the concept of *signed* dependence, which arises as an analogy of the autocorrelation function of a numerical process, since such quantity can take either positive or negative values. Provided that perfect serial dependence holds, we have perfect *positive* (*negative*) serial dependence if 𝑝𝑖|𝑖(𝑙) = 1 (𝑝𝑖|𝑖(𝑙) = 0) for all 𝑖 ∈ V.

Since 𝑽(𝑙) does not shed light on the signed dependence structure, it would be valuable to complement the information contained in 𝑽(𝑙) by adding features describing signed dependence. In this regard, a common measure of signed serial dependence at lag 𝑙 is the Cohen's 𝜅, which takes the form

$$\kappa(l) = \frac{\sum\_{j=1}^{r} (p\_{jj}(l) - \pi\_j^2)}{1 - \sum\_{j=1}^{r} \pi\_j^2}. \tag{2}$$

Proceeding as with 𝑣(𝑙), the quantity 𝜅(𝑙) can be decomposed in order to obtain a complete representation of the signed dependence pattern of the process. In this way, we consider the vector K(𝑙) = (K<sup>1</sup> (𝑙), . . . , K<sup>𝑟</sup> (𝑙)), where each K<sup>𝑖</sup> is defined as

$$\mathcal{K}\_{l}(l) = \frac{p\_{ii}(l) - \pi\_i^2}{1 - \sum\_{j=1}^r \pi\_j^2},\tag{3}$$

𝑖 = 1, . . . , 𝑟.

In practice, the matrix𝑽(𝑙) and the vector K(𝑙) must be estimated from a𝑇-length realization of the process, {𝑋1, . . . 𝑋<sup>𝑇</sup> }. To this aim, we consider estimators of 𝜋<sup>𝑖</sup> and <sup>𝑝</sup>𝑖 𝑗(𝑙), <sup>b</sup>𝜋<sup>𝑖</sup> and <sup>𝑝</sup>b𝑖 𝑗(𝑙), respectively, defined as <sup>b</sup>𝜋<sup>𝑖</sup> <sup>=</sup> 𝑁𝑖 𝑇 and <sup>𝑝</sup>b𝑖 𝑗(𝑙) <sup>=</sup> 𝑁𝑖 𝑗 (𝑙) 𝑇 −𝑙 , where 𝑁<sup>𝑖</sup> is the number of variables 𝑋<sup>𝑡</sup> equal to 𝑖 in the realization {𝑋1, . . . 𝑋<sup>𝑇</sup> }, and 𝑁𝑖 𝑗(𝑙) is the number of pairs (𝑋<sup>𝑡</sup> , 𝑋𝑡−𝑙) = (𝑖, 𝑗) in the realization {𝑋1, . . . 𝑋<sup>𝑇</sup> }. Hence, estimates of <sup>𝑽</sup>(𝑙) and <sup>K</sup>(𝑙), <sup>𝑽</sup>b(𝑙) and <sup>K</sup>b(𝑙), respectively, can be obtained by plugging in the estimates <sup>b</sup>𝜋<sup>𝑖</sup> and <sup>𝑝</sup>b𝑖 𝑗(𝑙) in (2) and (3), respectively. This leads directly to estimates of <sup>𝑣</sup>(𝑙) and <sup>𝜅</sup>(𝑙), denoted byb𝑣(𝑙) and <sup>b</sup>𝜅(𝑙).

An alternative way of describing the dependence structure of the process {𝑋<sup>𝑡</sup> , 𝑡 ∈ Z} is to take into consideration its equivalent representation as a multivariate binary process. The so-called *binarization* of {𝑋<sup>𝑡</sup> , 𝑡 ∈ Z} is constructed as follows. Let 𝒆1, . . . , 𝒆<sup>𝑟</sup> ∈ {0, 1} <sup>𝑟</sup> be unit vectors such that 𝒆<sup>𝑘</sup> has all its entries equal to zero except for a one in the 𝑘-th position, 𝑘 = 1, . . . , 𝑟. Then, the binary representation of {𝑋<sup>𝑡</sup> , 𝑡 ∈ Z} is given by the process {𝒀<sup>𝑡</sup> = (𝑌𝑡,1, . . . , 𝑌𝑡,𝑟 ) <sup>&</sup>gt;, 𝑡 ∈ Z} such that 𝒀<sup>𝑡</sup> = 𝒆 <sup>𝑗</sup> if 𝑋<sup>𝑡</sup> = 𝑗. Fixed 𝑙 ∈ N and 𝑖, 𝑗 ∈ V, consider the correlation 𝜙𝑖 𝑗(𝑙) = 𝐶𝑜𝑟𝑟(𝑌𝑡,𝑖, 𝑌𝑡−𝑙, 𝑗), which measures linear dependence between the 𝑖-th and 𝑗-th categories with respect to the lag 𝑙. The following proposition provides some properties of the quantity 𝜙𝑖 𝑗(𝑙).

#### **Proposition 1**

Let {𝑋<sup>𝑡</sup> , 𝑡 ∈ Z} be a bivariate stationary categorical process with range V = {1, . . . , 𝑟}. Then the following properties hold:

1. For every 𝑖, 𝑗 ∈ V, the function 𝜙𝑖 𝑗 : N → [−1, 1] given by 𝑙 → 𝜙𝑖 𝑗(𝑙) = 𝐶𝑜𝑟𝑟(𝑌𝑡,𝑖, 𝑌𝑡−𝑙, 𝑗) is well-defined.

$$
\mathcal{D}.\ \ \phi\_{if}(l) = 0 \Leftrightarrow \ p\_{if}(l) = \pi\_i \pi\_f.
$$

3. 𝜙𝑖 𝑗(𝑙) = ±1 ⇔ 𝑝𝑖 𝑗(𝑙) = ± p 𝜋𝑖(1 − 𝜋𝑖)𝜋𝑗(1 − 𝜋𝑗) + 𝜋𝑖𝜋<sup>𝑗</sup> .

$$4.\ \phi\_{if}(l) = \sqrt{\frac{\pi\_f(1-\pi\_i)}{\pi\_i(1-\pi\_f)}} \Leftrightarrow p\_{i|f}(l) = 1.$$

The proof of Proposition 1 is quite straightforward and it is not shown in the manuscript for the sake of brevity. According to Proposition 1, the quantity 𝜙𝑖 𝑗(𝑙) can be used to explain both types of dependence, signed and unsigned, within the underlying process. In fact, in the case of perfect unsigned independence at lag 𝑙, we have that 𝑝𝑖 𝑗(𝑙) = 𝜋𝑖𝜋<sup>𝑗</sup> for all 𝑖, 𝑗 ∈ V so that 𝜙𝑖 𝑗(𝑙) = 0 for all 𝑖, 𝑗 ∈ V in accordance with Property 2 of Proposition 1. Under perfect positive dependence at lag 𝑙, 𝑝𝑖|𝑖(𝑙) = 1 for all𝑖 ∈ V. Then 𝜙𝑖𝑖(𝑙) = 1 for all𝑖 ∈ V by following Property 4 of Proposition 1. The same property allows to conclude that 𝜙𝑖𝑖(𝑙) = −𝜋𝑖/(1−𝜋𝑖) for all 𝑖 ∈ V in the case of perfect negative dependence. In sum, 𝜙𝑖 𝑗(𝑙) evaluates unsigned dependence when 𝑖 ≠ 𝑗 and signed dependence when 𝑖 = 𝑗. The previous quantities can be encapsulated in a matrix 𝚽(𝑙) = (𝜙𝑖 𝑗(𝑙))1≤𝑖, 𝑗 <sup>≤</sup><sup>𝑟</sup> , which can be directly estimated by means of <sup>𝚽</sup>b(𝑙) <sup>=</sup> (𝜙b𝑖 𝑗(𝑙))1≤𝑖, 𝑗 <sup>≤</sup><sup>𝑟</sup> , where each <sup>𝜙</sup>b𝑖 𝑗(𝑙) is computed as <sup>𝜙</sup>b𝑖 𝑗(𝑙) <sup>=</sup> <sup>𝑝</sup>b𝑖 𝑗 (𝑙)−𝜋b<sup>𝑖</sup> <sup>𝜋</sup>b<sup>𝑗</sup> √ <sup>𝜋</sup>b<sup>𝑖</sup> (1−𝜋b𝑖) <sup>𝜋</sup>b<sup>𝑗</sup> (1−𝜋b𝑗) (this is derived from the proof of Proposition 1).

#### **2.2 Two Innovative Dissimilarities Between CTS**

In this section we introduce two distance measures between categorical series based on the features described above. Suppose we have a pair of CTS 𝑋 (1) 𝑡 and 𝑋 (2) 𝑡 , and consider a set of 𝐿 lags, L = {𝑙1, . . . , 𝑙𝐿}. A dissimilarity based on Cramer's 𝑣 and Cohen's 𝜅, so-called 𝑑𝐶𝐶, is defined as

$$\begin{split} d\_{CC}(X\_t^{(1)}, X\_t^{(2)}) &= \sum\_{k=1}^{L} \left[ \left\| vec\left(\widehat{\mathbf{V}}(l\_k)^{(1)} - \widehat{\mathbf{V}}(l\_k)^{(2)}\right) \right\|^2 \right] \\ &+ \left\| \left\| \widehat{\mathbf{K}}(l\_k)^{(1)} - \widehat{\mathbf{K}}(l\_k)^{(2)} \right\|^2 \right] + \left\| \widehat{\mathbf{x}}^{(1)} - \widehat{\mathbf{x}}^{(2)} \right\|^2, \end{split}$$

where the superscripts (1) and (2) are used to indicate that the corresponding estimations are obtained with respect to the realizations 𝑋 (1) 𝑡 and 𝑋 (2) 𝑡 , respectively.

An alternative distance measure relying on the binarization of the processes, so-called 𝑑𝐵, is defined as

$$d\_B(X\_t^{(1)}, X\_t^{(2)}) = \sum\_{k=1}^{L} \left\| vec\left(\widehat{\Phi}(l\_k)^{(1)} - \widehat{\Phi}(l\_k)^{(2)}\right)\right\|^2 + \left\|\widehat{\pi}^{(1)} - \widehat{\pi}^{(2)}\right\|^2.$$

For a given set of categorical series, the distances 𝑑𝐶𝐶 and 𝑑<sup>𝐵</sup> can be used as input for traditional clustering algorithms. In this manuscript we consider the *Partition Around Medoids* (PAM) algorithm.

# **3 Partitioning Around Medoids Clustering of CTS**

In this section we examine the performance of both metrics 𝑑𝐶𝐶 and 𝑑<sup>𝐵</sup> in the context of hard clustering (i.e., each series is assigned to exactly one cluster) of CTS through a simulation study.

#### **3.1 Experimental Design**

The simulated scenarios encompass a broad variety of generating processes. In particular, three setups were considered, namely clustering of (i) Markov Chains (MC), (ii) Hidden Markov Models (HMM) and (iii) New Discrete ARMA (NDARMA) processes. The generating models with respect to each class of processes are given below.

**Scenario 1**. Clustering of MC. Consider four three-state MC, so-called MC1, MC2, MC<sup>3</sup> and MC4, with respective transition matrices 𝑷 1 1 , 𝑷 1 2 , 𝑷 1 3 and 𝑷 1 4 given by

$$\begin{aligned} \mathbf{P}\_1^1 &= \operatorname{Mari}^3(0.1, 0.8, 0.1, 0.5, 0.4, 0.1, 0.6, 0.2, 0.2), \\ \mathbf{P}\_2^1 &= \operatorname{Mari}^3(0.1, 0.8, 0.1, 0.6, 0.3, 0.1, 0.6, 0.2, 0.2), \\ \mathbf{P}\_3^1 &= \operatorname{Mari}^3(0.05, 0.90, 0.05, 0.05, 0.05, 0.90, 0.90, 0.05, 0.05), \\ \mathbf{P}\_4^1 &= \operatorname{Mari}^3(1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3), \end{aligned}$$

where the operator 𝑀𝑎𝑡 <sup>𝑘</sup> , 𝑘 ∈ N transforms a vector into a square matrix of order 𝑘 by sequentially placing the corresponding numbers by rows.

**Scenario 2**. Clustering of HMM. Consider the bivariate process (𝑋<sup>𝑡</sup> , 𝑄𝑡)<sup>𝑡</sup> <sup>∈</sup>Z, where 𝑄<sup>𝑡</sup> stands for the hidden states and 𝑋<sup>𝑡</sup> for the observable random variables. Process (𝑄𝑡)<sup>𝑡</sup> <sup>∈</sup><sup>Z</sup> constitutes an homogeneous MC. Both (𝑋𝑡)<sup>𝑡</sup> <sup>∈</sup><sup>Z</sup> and (𝑄𝑡)<sup>𝑡</sup> <sup>∈</sup><sup>Z</sup> are assumed to be count processes with range {1, . . . , 𝑟}. Process (𝑋<sup>𝑡</sup> , 𝑄𝑡)<sup>𝑡</sup> <sup>∈</sup><sup>Z</sup> is assumed to verify the three classical assumptions of a HMM. Based on previous considerations, let HMM1, HMM2, HMM<sup>3</sup> and HMM<sup>4</sup> be four three-state HMM with respective transition matrices 𝑷 2 1 , 𝑷 2 2 , 𝑷 2 3 and 𝑷 2 4 and emission matrices 𝑬 2 1 , 𝑬 2 2 , 𝑬 2 3 and 𝑬 2 4 given by

$$\mathbf{P}\_1^2 = \operatorname{Mor}^3(0.05, 0.90, 0.05, 0.05, 0.05, 0.90, 0.90, 0.05, 0.05), \mathbf{P}\_2^2 = \mathbf{P}\_1^2,$$

$$\mathbf{P}\_3^2 = \operatorname{Mor}^3(0.1, 0.7, 0.2, 0.4, 0.4, 0.2, 0.4, 0.3, 0.3),$$

$$\mathbf{P}\_4^2 = \operatorname{Mor}^3(1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3), \mathbf{E}\_1^2 = \mathbf{P}\_1^2,$$

$$\mathbf{E}\_2^2 = \operatorname{Mor}^3(0.1, 0.8, 0.1, 0.5, 0.4, 0.1, 0.6, 0.2, 0.2), \mathbf{E}\_3^2 = \mathbf{E}\_2^2,$$

$$\mathbf{E}\_4^2 = \operatorname{Mor}^3(1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3, 1/3).$$

**Scenario 3**. Clustering of NDARMA processes. Let (𝑋𝑡)<sup>𝑡</sup> <sup>∈</sup><sup>Z</sup> and (𝜖𝑡)<sup>𝑡</sup> <sup>∈</sup>Z, be two count processes with range {1, . . . , 𝑟} following the equation

$$X\_{l} = \alpha\_{l,1} X\_{l-1} + \dots + \alpha\_{l,p} X\_{l-p} + \beta\_{l,0} \epsilon\_{l} + \dots + \beta\_{l,q} \epsilon\_{l-q},$$

where (𝜖𝑡)<sup>𝑡</sup> <sup>∈</sup><sup>Z</sup> is i.i.d with 𝑃(𝜖<sup>𝑡</sup> = 𝑖) = 𝜋<sup>𝑖</sup> , independent of (𝑋𝑠)𝑠<𝑡, and the i.i.d multinomial random vectors

$$(\alpha\_{l,1}, \ldots, \alpha\_{l,p}, \beta\_{l,0}, \ldots, \beta\_{l,q}) \sim \text{MULT}(1; \phi\_1, \ldots, \phi\_p, \varphi\_0, \ldots, \varphi\_q),$$

are independent of (𝜖𝑡)<sup>𝑡</sup> <sup>∈</sup><sup>Z</sup> and (𝑋𝑠)𝑠<𝑡. The considered models are three three-state NDARMA(2,0) processes and one three-state NDARMA(1,0) process with marginal distribution 𝝅 <sup>3</sup> = (2/3, 1/6, 1/6), and corresponding probabilities in the multinomial distribution given by

$$\begin{aligned} \left( (\phi\_1, \phi\_2, \varphi\_0) \right)\_1^3 &= (0.7, 0.2, 0.1), \left( \phi\_1, \phi\_2, \varphi\_0 \right)\_2^3 = (0.1, 0.45, 0.45), \\ \left( \phi\_1, \phi\_2, \varphi\_0 \right)\_3^3 &= (0.5, 0.25, 0.25), \left( \phi\_1, \varphi\_0 \right)\_4^3 = (0.2, 0.8). \end{aligned}$$

The simulation study was carried out as follows. For each scenario, 5 CTS of length 𝑇 ∈ {200, 600} were generated from each process in order to execute the clustering algorithms twice, thus allowing to analyze the impact of the series length. The resulting clustering solution produced by each considered algorithm was stored. The simulation procedure was repeated 500 times for each scenario and value of 𝑇. The computation of 𝑑𝐶𝐶 and 𝑑<sup>𝐵</sup> was carried out by considering L = {1} in Scenarios 1 and 2, and L = {1, 2} in Scenario 3. This way, we adapted the distances to the maximum number of significant lags existing in each setting.

#### **3.2 Alternative Metrics and Assessment Criteria**

To better analyze the performance of both metrics 𝑑𝐶𝐶 and 𝑑𝐵, we also obtained partitions by using alternative techniques for clustering of categorical series. The considered procedures are described below.


Note that the approach based on the distance 𝑑𝑀 𝐿𝐸 can be seen as a strict benchmark in the evaluation task. The effectiveness of the clustering approaches was assessed by comparing the clustering solution produced by the algorithms with the true clustering partition, so-called ground truth. The latter consisted of 𝐶 = 4 clusters in all scenarios, each group including the five CTS generated from the same process. The value 𝐶 = 4 was provided as input parameter to the PAM algorithm in the case of 𝑑𝐶𝐶, 𝑑𝐵, 𝑑𝑀 𝐿𝐸 and 𝑑𝑀𝑉 . As for the approach 𝑑𝐶𝑍 , a number of 4 components were considered for the mixture model. Experimental and true partitions were compared by using three well-known external clustering quality indexes, the Adjusted Rand Index (ARI), the Jaccard Index (JI) and the Fowlkes-Mallows index (FMI).

#### **3.3 Results and Discussion**

Average values of the quality indexes by taking into account the 500 simulation trials are given in Tables 1, 2 and 3 for Scenarios 1, 2 and 3, respectively.


**Table 1** Average results for Scenario 1.

The results in Table 1 indicate that the dissimilarity 𝑑𝐶𝐶 is the best performing one when dealing with MC, outperforming the MLE-based metric 𝑑𝑀 𝐿𝐸 . The distance 𝑑<sup>𝐵</sup> is also superior to 𝑑𝑀 𝐿𝐸 . The measure 𝑑𝐶𝑍 attains in Scenario 1 similar results than 𝑑𝐶𝐶, specially for 𝑇 = 600. The good performance of 𝑑𝐶𝑍 was expected, since the assumption of first order Markov models considered by this metric is fulfilled in Scenario 1. Table 2 shows a completely different picture, indicating that the metrics 𝑑𝐶𝐶 and 𝑑<sup>𝐵</sup> exhibit a significantly better effectiveness than the rest of the dissimilarities. Finally, the quantities in Table 3 reveal that the model-based distance 𝑑𝑀 𝐿𝐸 attains the best results when 𝑇 = 200, but is defeated by 𝑑<sup>𝐵</sup> when


**Table 2** Average results for Scenario 2.

**Table 3** Average results for Scenario 3.


𝑇 = 600. The metric 𝑑𝐶𝑍 suffers again from model misspecification. In summary, the numerical experiments carried out throughout this section show the excellent ability of both measures 𝑑𝐶𝐶 and 𝑑<sup>𝐵</sup> to discriminate between a broad variety of categorical processes. Specifically, these metrics either outperform or show similar behavior than distances based on estimated model coefficients, which take advantage of knowing the true underlying models.

It is worth highlighting that the methods proposed in this paper could have promising applications in some fields as the clustering of genetic data sequences.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Fuzzy Clustering by Hyperbolic Smoothing**

David Masís, Esteban Segura, Javier Trejos, and Adilson Xavier

**Abstract** We propose a novel method for building fuzzy clusters of large data sets, using a smoothing numerical approach. The usual sum-of-squares criterion is relaxed so the search for good fuzzy partitions is made on a continuous space, rather than a combinatorial space as in classical methods [8]. The smoothing allows a conversion from a strongly non-differentiable problem into differentiable subproblems of optimization without constraints of low dimension, by using a differentiable function of infinite class. For the implementation of the algorithm, we used the statistical software 𝑅 and the results obtained were compared to the traditional fuzzy 𝐶–means method, proposed by Bezdek [1].

**Keywords:** clustering, fuzzy sets, numerical smoothing

# **1 Introduction**

Methods for making groups from data sets are usually based on the idea of disjoint sets, such as the classical crisp clustering. The most well known are hierarchical and 𝑘-means [8], whose resulting clusters are sets with no intersection. However, this restriction may not be natural for some applications, where the condition for

David Masís

Esteban Segura

Javier Trejos () CIMPA & School of Mathematics, University of Costa Rica, San José, Costa Rica,

e-mail: javier.trejos@ucr.ac.cr Adilson E. Xavier

Universidade Federal de Rio de Janeiro, Brazil, e-mail: adilson.xavier@gmail.com

© The Author(s) 2023 243 P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_27

Costa Rica Institute of Technology, Cartago, Costa Rica, e-mail: dmasis@itcr.ac.cr

CIMPA & School of Mathematics, University of Costa Rica, San José, Costa Rica, e-mail: esteban.seguraugalde@ucr.ac.cr

some objects may be to belong to two or more clusters, rather than only one. Several methods for constructing overlapping clusters have been proposed in the literature [4, 5, 8]. Since Zadeh introduced the concept of fuzzy sets [17], the principle of belonging to several clusters has been used in the sense of a degree of membership to such clusters. In this direction, Bezdek [1] introduced a fuzzy clustering method that became very popular since it solved the problem of representation of clusters with centroids and the assignment of objects to clusters, by the minimization of a well-stated numerical criterion. Several methods for fuzzy clustering have been proposed in the literature; a survey of these methods can be found in [16].

In this paper we propose a new fuzzy clustering method based on the numerical principle of hyperbolic smoothing [15]. Fuzzy 𝐶-Means method is presented in Section 2 and our proposed Hyperbolic Smoothing Fuzzy Clustering method in Section 3. Comparative results between these two methods are presented in Section 4. Finally, Section 5 is devoted to the concluding remarks.

# **2 Fuzzy Clustering**

The most well known method for fuzzy clustering is the original Bezdek's 𝐶-means method [1] and it is based on the same principles of 𝑘-means or dynamical clusters [2], that is, iterations on two main steps: i) class representations by the optimization of a numerical criterion, and ii) assignment to the closest class representative in order to construct clusters; these iterations are made until a convergence is reached to a local minimum of the overall quality criterion.

Let us introduce the notation that will be used and the numerical criterion for optimization. Let **X** be an 𝑛 × 𝑝 data matrix containing 𝑝 numerical observations over 𝑛 objects. We look for a 𝐾 × 𝑝 matrix **G** that represents centroids of 𝐾 clusters of the 𝑛 objects and an 𝑛 × 𝐾 membership matrix with elements 𝜇𝑖𝑘 ∈ [0, 1], such that the following criterion is minimized:

$$\begin{aligned} W(\mathbf{X}, \mathbf{U}, \mathbf{C}) &= \sum\_{i=1}^{n} \sum\_{k=1}^{K} \left( \mu\_{ik} \right)^{m} \left\| \mathbf{x}\_{i} - \mathbf{g}\_{k} \right\|^{2} \\ \text{subject to } & \sum\_{k=1}^{K} \mu\_{ik} = 1, \text{ for all } i \in \{1, 2, \dots, n\} \\ & 0 < \sum\_{i=1}^{n} \mu\_{ik} < n, \text{ for all } k \in \{1, 2, \dots, K\}, \end{aligned} \tag{1}$$

where **x**<sup>𝑖</sup> is the 𝑖-th row of **X** and **g**<sup>𝑘</sup> is the 𝑘-th row of **G**, representing in R 𝑝 the centroid of the 𝑘-th cluster.

The parameter 𝑚 ≠ 1 in (1) controls the fuzzyness of the clusters. According to the literature [16], it is usual to take 𝑚 = 2, since greater values of 𝑚 tend to give very low values of 𝜇𝑖𝑘 , tending to the usual crisp partitions such as in 𝑘-means. We also assume that the number of clusters, 𝐾, is fixed.

Minimization of (1) represents a non linear optimization problem with constraints, which can be solved using Lagrange multipliers as presented in [1]. The solution, for each row of the centroids matrix, given a matrix **U**, is:

Fuzzy Clustering by Hyperbolic Smoothing

$$\mathbf{g}\_k = \sum\_{i=1}^n (\mu\_{ik})^m \mathbf{x}\_i \left/ \sum\_{i=1}^n (\mu\_{ik})^m \right. \tag{2}$$

The solution for the membership matrix, given a matrix centroids **G**, is [1]:

$$\mu\_{ik} = \left[ \sum\_{j=1}^{K} \left( \frac{||\mathbf{x}\_i - \mathbf{g}\_k||^2}{||\mathbf{x}\_i - \mathbf{g}\_j||^2} \right)^{1/(m-1)} \right]^{-1} \,\tag{3}$$

The following pseudo-code shows the mains steps of Bezdek's Fuzzy 𝐶-Means method [1].

#### **Bezdek's Fuzzy c-Means (FCM) Algorithm**


Fuzzy 𝐶-Means method starts from an initial partition that is improved in each iteration, according to (1), applying Steps 2 and 3 of the algorithm. It is clear that this procedure may lead to local optima of (1) since iterative improvement in (2) and (3) is made by a local search strategy.

# **3 Algorithm for Hyperbolic Smoothing Fuzzy Clustering**

For the clustering problem of the 𝑛 rows of data matrix **X** in 𝐾 clusters, we can seek for the minimum distance between every **x**<sup>𝑖</sup> and its class center **g**<sup>𝑘</sup> :

$$z\_i^2 = \min\_{\mathbf{g}\_k \in \mathbf{G}} \|\mathbf{x}\_i - \mathbf{g}\_k\|\_2^2$$

where k · k<sup>2</sup> is the Euclidean norm. The minimization can be stated as a sum-ofsquares:

$$\min \sum\_{i=1}^{n} \min\_{\mathbf{g}\_k \in \mathbf{G}} \left\lVert \mathbf{x}\_i - \mathbf{g}\_k \right\rVert\_2^2 = \min \sum\_{i=1}^{n} z\_i^2$$

leading to the following constrained problem:

$$\min \sum\_{i=1}^{n} z\_i^2 \text{ subject to } z\_i = \min\_{\mathbf{g}\_k \in \mathbf{G}} \| \mathbf{x}\_i - \mathbf{g}\_k \|\_2, \text{ with } i = 1, \dots, n.$$

This is equivalent to the following minimization problem:

$$\min \sum\_{i=1}^{n} z\_i^2 \text{ subject to } z\_i - \|\mathbf{x}\_i - \mathbf{g}\_k\|\_2 \le 0, \text{ with } i = 1, \dots, n \text{ and } k = 1, \dots, K.$$

Considering the function: 𝜑(𝑦) = max(0, 𝑦), we obtain the problem:

$$\min \sum\_{i=1}^{n} z\_i^2 \text{ subject to } \sum\_{k=1}^{K} \varphi(z\_i - \|\mathbf{x}\_i - \mathbf{g}\_k\|\_2) = 0 \text{ for } i = 1, \dots, n.$$

That problem can be re-stated as the following one:

$$\min \sum\_{i=1}^{n} z\_i^2 \text{ subject to } \sum\_{k=1}^{K} \varphi \left( z\_i - \|\mathbf{x}\_i - \mathbf{g}\_k\|\_2 \right) > 0, \text{ for } i = 1, \dots, n.$$

Given a perturbation 𝜖 > 0 it leads to the problem:

$$\min \sum\_{i=1}^{n} z\_i^2 \text{ subject to } \sum\_{k=1}^{K} \varphi(z\_i - \|\mathbf{x}\_i - \mathbf{g}\_k\|\_2) \ge \epsilon \text{ for } i = 1, \dots, n.$$

It should be noted that function 𝜑 is not differentiable. Therefore, we will make a smoothing procedure in order to formulate a differentiable function and proceed with a minimization by a numerical method. For that, consider the function: 𝜓(𝑦, 𝜏) = 𝑦+ √ 𝑦 <sup>2</sup>+𝜏 2 2 , for all 𝑦 ∈ R, 𝜏 > 0, and the function: 𝜃(**x**<sup>𝑖</sup> , **g**<sup>𝑘</sup> , 𝛾) = qÍ<sup>𝑝</sup> 𝑗=1 (𝑥𝑖 𝑗 − 𝑔𝑘 𝑗) <sup>2</sup> + 𝛾 2 , for 𝛾 > 0. Hence, the minimization problem is transformed into:

$$\min \sum\_{i=1}^{n} z\_i^2 \text{ subject to } \sum\_{k=1}^{K} \psi(z\_i - \theta(\mathbf{x}\_i, \mathbf{g}\_k, \boldsymbol{\gamma}), \boldsymbol{\tau}) \ge \epsilon, \text{ for } i = 1, \dots, n.$$

Finally, according to the Karush–Kuhn–Tucker conditions [10, 11], all the constraints are active and the final formulation of the problem is:

$$\begin{aligned} \min & \sum\_{i=1}^{n} z\_i^2\\ \text{subject to} & \quad h\_i(z\_i, \mathbf{G}) = \sum\_{k=1}^{K} \psi(z\_i - \theta(\mathbf{x}\_i, \mathbf{g}\_k, \boldsymbol{\gamma}), \boldsymbol{\tau}) - \epsilon = 0, \text{ for } i = 1, \dots, n, \\ & \epsilon, \boldsymbol{\tau}, \boldsymbol{\gamma} > 0. \end{aligned} \tag{4}$$

Considering (4), in [15] it was stated the Hyperbolic Smoothing Clustering Method presented in the following algorithm.

#### **Hyperbolic Smoothing Clustering Method (HSCM) Algorithm**


$$\begin{aligned} \text{6. Solve problem (P): } \min f(\mathbf{G}) &= \sum\_{i=1}^{n} z\_i^2 \text{ with } \boldsymbol{\gamma} = \boldsymbol{\gamma}^l, \boldsymbol{\tau} = \boldsymbol{\tau}^l \text{ and } \boldsymbol{\epsilon} = \boldsymbol{\epsilon}^l, \mathbf{G}^{l-1} \\ \text{bain the initial value and } \mathbf{C}^l \text{ the obtained solution} \end{aligned}$$

being the initial value and **G** the obtained solution 7. Let 𝛾 <sup>𝑙</sup>+<sup>1</sup> = 𝜌1𝛾 𝑙 , 𝜏 <sup>𝑙</sup>+<sup>1</sup> = 𝜌2𝜏 𝑙 , 𝜖 <sup>𝑙</sup>+<sup>1</sup> = 𝜌3𝜖 𝑙 and 𝑙 = 𝑙 + 1.

The most relevant task in the hyperbolic smoothing clustering method is finding the zeroes of the function ℎ𝑖(𝑧<sup>𝑖</sup> , **G**) = Í<sup>𝐾</sup> <sup>𝑘</sup>=<sup>1</sup> <sup>𝜓</sup>(𝑧<sup>𝑖</sup> <sup>−</sup> <sup>𝜃</sup>(**x**<sup>𝑖</sup> , **g**<sup>𝑘</sup> , 𝛾), 𝜏) − 𝜖 = 0 for for 𝑖 = 1, . . . , 𝑛. In this paper, we used the Newton-Raphson method for finding these zeroes [3], particularly the BFGS procedure [12]. Convergence of the Newton-Raphson method was successful, mainly, thank to a good choice of initial solutions. In our implementation, these initial approximations were generated by calculating the minimum distance between the 𝑖-th object and the 𝑘-th centroid for a given partition. Once the zeroes 𝑧<sup>𝑖</sup> of the functions ℎ<sup>𝑖</sup> are obtained, it is implemented the hyperbolic smoothing. The final solution for this method consists on solving a finite number of optimization subproblems corresponding to problem (P) in Step 6 of the HSCM algorithm. Each one of these subproblems was solved with the R routine *optim* [13], a useful tool for solving optimization problems in non linear programming. As far as we know there is no closed solution for solving this step. For the future, we can consider writing a program by our means, but for this paper we are using this R routine.

Since we have that: Í<sup>𝐾</sup> <sup>𝑘</sup>=<sup>1</sup> <sup>𝜓</sup>(𝑧<sup>𝑖</sup> <sup>−</sup> <sup>𝜃</sup>(**x**<sup>𝑖</sup> , **g**<sup>𝑘</sup> , 𝛾), 𝜏) = 𝜖, then each entry 𝜇𝑖𝑘 of the membership matrix is given by: 𝜇𝑖𝑘 = 𝜓(𝑧𝑖−𝑑<sup>𝑘</sup> , 𝜏) 𝜖 . It is worth to note that fuzzyness is controlled by parameter 𝜖.

The following algorithm contains the main steps of the Hyperbolic Smoothing Fuzzy Clustering (HSFC) method.

#### **Hyperbolic Smoothing Fuzzy Clustering (HSFC) Algorithm**


# **4 Comparative Results**

Performance of the HSFC method was studied on a data table well known from the literature, the Fisher's iris [7] and 16 simulated data tables built from a semi-Monte Carlo procedure [14].

For comparing FCM and HSFC, we used the implementation of FCM in R package *fclust* [6]. This comparison was made upon the within class sum-of-squares: 𝑊 (𝑃) = Í<sup>𝐾</sup> 𝑘=1 Í𝑛 𝑖=1 𝜇𝑖𝑘 k**x**<sup>𝑖</sup> − **g**<sup>𝑘</sup> k 2 . Both methods were applied 50 times and the best value of 𝑊 is reported. For simplicity here, for HSFC we used the following parameters: 𝜌<sup>1</sup> = 𝜌<sup>2</sup> = 𝜌<sup>3</sup> = 0.25, 𝜖 = 0.01 and 𝛾 = 𝜏 = 0.001 as initial values. In Table 1 the results for Fisher's iris are shown, in which case HSFC performs slightly better. It contains the Adjusted Rand Index (ARI) [9] between HSFC and the best FCM result among 100 runs; ARI compares fuzzy membership matrices crisped into hard partitions.

**Table 1** Minimum sum-of-squares (SS) reported for the Fisher's iris data table with HSFC and FCM, 𝐾 being the number of clusters, ARI comparing both methods. In bold best method.


Simulated data tables were generated in a controlled experiment as in [14], with random numbers following a Gaussian distribution. Factors of the experiment were:


Table 2 contains codes for simulated data tables according to the codes we used.

Table 3 contains the minimum values of the sum-of-squares obtained for our HSFC and Bezdek's FCM methods; the best solution of 100 random applications for FCM in presented and one run of HSFC. It also contains the ARI values for comparing HSFC solution with that best solution of FCM. It can be seen that, generally, HSFC method tends to obtain better results than FCM, with only few exceptions. In 23 cases HSFC obtains better results, FCM is better in 5 cases, and results are in same in 17 cases. However, ARI shows that partitions tend to be very similar with both methods.


**Table 2** Codes and characteristics of simulated data tables; 𝑛: number of objects, 𝐾: number of clusters, card: cardinality, DS: standard deviation.

**Table 3** Minimum sum-of-squares (SS) reported for HSFC and FCM methods on the simulated data tables. Best method in bold.


# **5 Concluding Remarks**

In hyperbolic smoothing, parameters 𝜏, 𝛾 and 𝜖 tend to zero, so the constraints in the subproblems make that problem (P) tends to solve (1). Parameter 𝜖 controls the fuzzyness degree in clustering; the higher it is, the solution becomes more and more fuzzy; the less it is, the clustering is more and more crisp. In order to compare results and efficiency of the HSFC method, zeroes of functions ℎ<sup>𝑖</sup> can be obtained with any method for solving equations in one variable or a predefined routine. According to the results we obtained so far and the implementation of the hyperbolic smoothing for fuzzy clustering, we can conclude that, generally, the HSFC method has a slightly better performance than original Bezdek's FCM on small real and simulated data tables. Further research is required for testing performance of HSFC method on very large data sets, with measures of efficiency, quality of solutions and running time. We are also considering to study further comparisons between HSFC and FCM with different indices, and writing the program for solving Step 6 in HSFC algorithm, that is the minimization of 𝑓 (𝐺), by our means, instead of using the *optim* routine in R.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Stochastic Collapsed Variational Inference for Structured Gaussian Process Regression Networks**

Rui Meng, Herbert K. H. Lee, and Kristofer Bouchard

**Abstract** This paper presents an efficient variational inference framework for a family of structured Gaussian process regression network (SGPRN) models. We incorporate auxiliary inducing variables in latent functions and jointly treat both the distributions of the inducing variables and hyper-parameters as variational parameters. Then we take advantage of the collapsed representation of the model and propose structured variational distributions, which enables the decomposability of a tractable variational lower bound and leads to stochastic optimization. Our inference approach is able to model data in which outputs do not share a common input set, and with a computational complexity independent of the size of the inputs and outputs to easily handle datasets with missing values. Finally, we illustrate our approach on both synthetic and real data.

**Keywords:** stochastic optimization, Gaussian process, variational inference, multivariate time series, time-varying correlation

# **1 Introduction**

Multi-output regression problems arise in various fields. Often, the processes that generate such datasets are nonstationary. Modern instrumentation has resulted in increasing numbers of observations, as well as the occurrence of missing values. This motivates the development of scalable methods for forecasting in such datasets.

Multi-ouput Gaussian process models or multivariate Gaussian process models (MGP) generalise the powerful Gaussian process predictive model to vector-valued

© The Author(s) 2023 253 P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_28

Rui Meng () · Kristofer Bouchard

Biological Systems and Engineering Division, Lawrence Berkeley National Laboratory, USA, e-mail: rmeng@lbl.gov;kebouchard@lbl.gov

Herbert K. H. Lee University of California, Santa Cruz, USA, e-mail: herbie@ucsc.edu

random fields [1]. Those models demonstrate improved prediction performance compared with independent univariate Gaussian processes (GP) because MGPs express correlations between outputs. Since the correlation information of data is encoded in the covariance function, modeling the flexible and computationally efficient crosscovariance function is of interest. In the literature of multivariate processes, many approaches are proposed to build valid cross-covariance functions including the linear model of coregionalization (LMC) [2], kernel convolution techniques [3], Bspline based coherence functions [4]. However, most of these models are designed for modelling low-dimensional stationary processes, and require Monte Carlo simulations, making inference in large datasets computationally intractable.

Modelling the complicated temporal dependencies across variables is addressed in [5, 6] by several adaptions of stochastic LMC. Such models can handle input-varying correlation across multivariate outputs. Especially for multivariate time series, [6] propose a SGPRN that captures time-varying scale, correlation, and smoothness. However, the inference in [6] is difficult to handle in applications where either the number of observations and dimension size are large or where missing data exist.

Here, we propose an efficient variational inference approach for the SGPRN by employing the inducing variable framework on all latent processes [7], taking advantage of its collapsed representation where nuisance parameters are marginalized out [8] and proposing a tractable variational bound amenable to doubly stochastic variational inference. We call our approach variational SGPRN (VSGPRN). This variational framework allows the model to handle missing data without increasing the computational complexity of inference. We numerically provide evidence of the benefits of simultaneously modeling time-varying correlation, scale and smoothness in both a synthetic experiment and a real-world problem.

The main contributions of this work are threefold:


# **2 Model**

Assume y(x) ∈ R <sup>𝐷</sup> is a vector-valued function of x ∈ R <sup>𝑃</sup>, where 𝐷 is the dimension size of the outputs and 𝑃 is the dimension size of the inputs. SGPRN assumes that noisy observations y(x) are the linear combination of latent variables g(x) ∈ R <sup>𝐷</sup>, corrupted by Gaussian noise 𝜖(x). The coefficients L(x) ∈ R 𝐷×𝐷 of the latent functions are assumed to be a stochastic lower triangular matrix with

**Fig. 1** Graphical model of VSGPRN. Left: Illustration of the generative model. Right: Illustration of the variational structure. The dashed (red) block means that we marginalize out those latent variables in the variational inference framework.

positive values on the diagonal for model identification [9, 6]. Thus, SGPRN is defined in the generative model of Figure 1 and it is y(x) = f (x) + 𝜖(x), f (x) = L(x)g(x) with independent white noise 𝜖(x) 𝑖𝑖𝑑∼ N(0, 𝜎<sup>2</sup> 𝑒𝑟𝑟 𝐼). We note that each latent function 𝑔<sup>𝑑</sup> in g is independently sampled from a GP with a nonstationary kernel 𝐾 𝑔 and the stochastic coefficients are modeled via a structured GP based prior as proposed in [9] with a stationary kernel 𝐾 𝑙 such that 𝑔𝑑 𝑖𝑖𝑑 ∼ GP(0, 𝐾<sup>𝑔</sup> ) , 𝑑 = 1, . . . , 𝐷 , and 𝑙𝑖 𝑗 ∼ ( GP(0, 𝐾<sup>𝑙</sup> ) , 𝑖 > 𝑗 , logGP(0, 𝐾<sup>𝑙</sup> ) , 𝑖 = 𝑗 , where logGP denotes the log Gaussian process [10]. 𝐾 𝑔 is modelled as a Gibbs correlation function 𝐾 𝑔 (x, x 0 ) = q 2ℓ (x)ℓ 0 (x) ℓ (x) <sup>2</sup>+ℓ (x<sup>0</sup> ) <sup>2</sup> exp − kx−x 0 k 2 ℓ (x) <sup>2</sup>+ℓ (x<sup>0</sup> ) 2 , ℓ ∼ logGP(0, 𝐾<sup>ℓ</sup> ) , where ℓ determines the input-dependent length scale of the shared correlations in 𝐾 𝑔 for all latent functions 𝑔𝑑. The varying length-scale process ℓ plays an important role in modelling nonstationary time series as illustrated in [11, 6].

Let X = {x<sup>𝑖</sup> } 𝑁 𝑖=1 be the set of observed inputs and Y = {y<sup>𝑖</sup> } 𝑁 𝑖=1 be the set of observed outputs. Denote 𝜂 as the concatenation of all coefficients and all log length-scale parameters, i.e., 𝜂 = (l, ˜ℓ) evaluated at training inputs <sup>X</sup>. Here, <sup>l</sup> is a vector including the entries below the main diagonal and the entries on the diagonal in the log scale and ˜<sup>ℓ</sup> <sup>=</sup> log <sup>ℓ</sup> is the length-scale parameters in log scale. Also, denote 𝜃 = (𝜃<sup>𝑙</sup> , 𝜃<sup>ℓ</sup> , 𝜎<sup>2</sup> 𝑒𝑟𝑟 ) as all hyper-parameters, where 𝜃<sup>𝑙</sup> and 𝜃<sup>ℓ</sup> are the hyperparameters in kernel 𝐾<sup>𝑙</sup> and 𝐾<sup>ℓ</sup> . We note that directly inferring the posterior of the latent variables 𝑝(𝜂|Y, 𝜃) ∝ 𝑝(Y|𝜂, 𝜎<sup>2</sup> 𝑒𝑟𝑟 )𝑝(𝜂|𝜃<sup>𝑙</sup> , 𝜃<sup>ℓ</sup> ) is computationally intractable in general because the computational complexity of 𝑝(𝜂|Y, 𝜃) is O(𝑁 <sup>3</sup>𝐷 3 ). To overcome this issue, we propose an efficient variational inference to significantly reduce the computational burden in the next section.

# **3 Inference**

We introduce a shared set of inducing inputs Z = {z𝑚} 𝑀 𝑚=1 that lie in the same space as the inputs X and a set of shared inducing variables w<sup>𝑑</sup> for each latent function 𝑔<sup>𝑑</sup> evaluated at the inducing inputs Z. Likewise, we consider inducing variables u𝑖𝑖 for the function log 𝐿𝑖𝑖 when 𝑖 = 𝑗, u𝑖 𝑗 for function 𝐿𝑖 𝑗 when 𝑖 > 𝑗, and inducing variables v for function log ℓ(x) evaluated at inducing inputs Z. We denote those collective variables as l = {l𝑖 𝑗 }𝑖<sup>≥</sup> <sup>𝑗</sup> , u = {u𝑖 𝑗 }𝑖<sup>≥</sup> <sup>𝑗</sup> , g = {g<sup>𝑑</sup> } 𝐷 𝑑=1 , w = {w<sup>𝑑</sup> } 𝐷 𝑑=1 , ℓ and v. Then we redefine the model parameters 𝜂 = (l, u, g, w, ℓ, v), and the prior of those model parameters is 𝑝(𝜂) = 𝑝(l|w)𝑝(w)𝑝(g|u, ℓ, v)𝑝(u)𝑝(ℓ|v)𝑝(v).

The core assumption of inducing point-based sparse inference is that the inducing variables are sufficient statistics for the training and testing data in the sense that the training and testing data are conditionally independent given the inducing variables. In the context of our model, this means that the posterior processes of 𝐿, 𝑔 and ℓ are sufficiently determined by the posterior distribution of u, w and v. We propose a structured variational distribution and its corresponding variational lower bound. Due to the nonconjugacy of this model, instead of doing expectation in the evidence lower bound (ELBO), as is normally done in the literature, we perform the marginalization on inducing variables u, w and g, and then use the reparameterization trick to apply end-to-end training with stochastic gradient descent. We will also discuss a procedure for missing data inference and prediction.

To capture the posterior dependency between the latent functions, we propose a structured variational distribution of the model parameters 𝜂 used to approximate its posterior distribution as 𝑞(𝜂) = 𝑝(l|u)𝑝(g|w, ℓ, v)𝑝(ℓ|v)𝑞(u, w, v) . This variational structure is illustrated in Figure 1. The variational distribution of the inducing variables 𝑞(u, w, v) fully characterizes the distribution of q(𝜂). Thus, the inference of 𝑞(u, w, v) is of interest. We assume the parameters u, w, and v are Gaussian and mutually independent.

Given the definition of Gaussian process priors for the SGPRN, the conditional distributions 𝑝(l|u), 𝑝(g|w, ˜ℓ, <sup>v</sup>), and <sup>𝑝</sup>(ℓ|v) have closed-form expressions and all are Gaussian, except for 𝑝(ℓ|v), which is log Gaussian. The ELBO of the log likelihood of observations under our structured variational distribution 𝑞(𝜂) is derived using Jensen's inequality as:

$$\log p(\mathbb{Y}) \ge E\_{q(\eta)} \left[ \log \left( \frac{p(\mathbb{Y} \mid \mathrm{g}, \mathbb{I}) p(\mathrm{u}) p(\mathrm{w}) p(\mathrm{v})}{q(\mathrm{u}, \mathrm{w}, \mathrm{v})} \right) \right] = R + A \,, \tag{1}$$

where 𝑅 = Í<sup>𝑁</sup> 𝑛=1 Í<sup>𝐷</sup> <sup>𝑑</sup>=<sup>1</sup> <sup>𝐸</sup>𝑞(g𝑛,l𝑛) log(𝑝(𝑦𝑛𝑑 |g𝑛,l𝑛)) is the reconstruction term and 𝐴 = KL(𝑞(u)||𝑝(u)) + KL(𝑞(w)||𝑝(w)) + KL(𝑞(v)||𝑝(v)) is the regularization term. g<sup>𝑛</sup> = {𝑔𝑑𝑛 = (g𝑑)𝑛} 𝐷 𝑑=1 and l<sup>𝑛</sup> = {𝑙𝑖 𝑗𝑛 = (l𝑖 𝑗)𝑛}𝑖<sup>≥</sup> <sup>𝑗</sup> are latent variables.

The structured decomposition trick for 𝑞(𝜂) has also been used by [12] to derive variational inference for the multivariate output case. The benefit of this structure is that all conditional distributions in 𝑞(𝜂) can be cancelled in the derivation of the lower bound in (1), which alleviates the computational burden of inference. Because of the conditional independence of the reconstruction term in (1) given g and l, the lower bound decomposes across both inputs and outputs and this enables the use of stochastic optimization methods. Moreover, due to the Gaussian assumption in the prior and variational distributions of the inducing variables, all KL divergence terms in the regularization term 𝐴 are analytically tractable. Next, instead of directly computing expectation, we leverage stochastic inference [13].

Stochastic inference requires sampling of l and g from the joint variational posterior 𝑞(𝜂). Directly sampling them would introduce much uncertainty from intermediate variables and thus make inference inefficient. To tackle this issue, we marginalize unnecessary intermediate variables u and w and obtain the marginal distributions 𝑞(l) = Î 𝑖= 𝑗 log N(l𝑖𝑖 |𝜇˜ 𝑙 𝑖𝑖, <sup>Σ</sup>˜ <sup>𝑙</sup> 𝑖𝑖) Î 𝑖> 𝑗 N(l𝑖 𝑗 |𝜇˜ 𝑙 𝑖 𝑗, <sup>Σ</sup>˜ <sup>𝑙</sup> 𝑖 𝑗) and 𝑞(g|ℓ, v) = Î<sup>𝐷</sup> <sup>𝑑</sup>=<sup>1</sup> N(g<sup>𝑑</sup> <sup>|</sup>𝜇˜ 𝑔 𝑑 , Σ˜ 𝑔 𝑑 ) with a joint distribution 𝑞(ℓ, v) = 𝑝(ℓ|v)𝑞(v), where the conditional mean and covariance matrix are easily derived. The corresponding marginal distributions 𝑞(l𝑛) and 𝑞(g𝑛|ℓ, v) at each 𝑛 are also easy to derive. Moreover, we conduct collapsed inference by marginalizing the latent variables g𝑛, so then the individual expectation is

$$\mathbb{E}\_{q(\mathbf{g}\_n, \mathbb{I}\_n)} \log(p(\mathbf{y}\_{nd}|\mathbf{g}\_n, \mathbb{I}\_n)) = \int (L\_{nd}) q(\ell\_n, \mathbf{v}) q(\mathbb{I}\_{d \cdot n}) d(\mathbb{I}\_{d \cdot n}, \ell\_n, \mathbf{v})), \quad (2)$$

where 𝐿𝑛𝑑 = log N(𝑦𝑛𝑑 | Í<sup>𝐷</sup> 𝑗=1 𝑙𝑑 𝑗𝑛 𝜇˜ 𝑔 𝑗𝑛, 𝜎<sup>2</sup> 𝑒𝑟𝑟 ) − <sup>1</sup> 2𝜎<sup>2</sup> 𝑒𝑟𝑟 Í<sup>𝐷</sup> 𝑗=1 𝑙 2 𝑑 𝑗𝑛𝜎˜ 𝑔2 𝑗𝑛 measure the reconstruction performance for observations y𝑛𝑑.

Directly evaluating the ELBO is still challenging due to the non-linearities introduced by our structured prior. Recent progress in black box variational inference [13] avoids this difficulty by computing noisy unbiased estimates of the gradient of ELBO, via approximating the expectations with unbiased Monte Carlo estimates and relying on either score function estimators [14] or reparameterization gradients [13] to differentiate through a sampling process. Here we leverage the reparameterization gradients for stochastic optimization for model parameters. We note that evaluating ELBO (1) involves two sources of stochasticity from Monte Carlo sampling in (2) and from data sub-sampling stochasticity [15]. The prediction procedure is based on Bayes' rule and replaces the posterior distribution by the inferred variational distribution. In the case of missing data, the only modification in (1) is in the reconstruction term, where we sum up the likelihoods of observed data instead of complete data.

# **4 Experiments**

This section illustrates the performance of our model on multivariate time series. We first show that our approach can model the time-varying correlation and smoothness of outputs on 2D synthetic datasets in three scenarios with respect to different types of frequencies but the same missing data mechanism. Then, we compare the imputation performance on missing data with other inducing-variable based sparse multivariate Gaussian process models on a real dataset.

We conduct experiments on three synthetic time series with low frequency (LF), high frequency (HF) and varying frequency (VF) respectively. They are generated from the system of equations 𝑦<sup>1</sup> (𝑡) = 5 cos(2𝜋𝑤𝑡<sup>𝑠</sup> ) + 𝜖<sup>1</sup> (𝑡) , 𝑦<sup>2</sup> (𝑡) = 5(1 − 𝑡) cos(2𝜋𝑤𝑡<sup>𝑠</sup> ) − 5𝑡 cos(2𝜋𝑤𝑡<sup>𝑠</sup> ) + 𝜖<sup>2</sup> (𝑡), where {𝜖𝑖(𝑡)}<sup>2</sup> 𝑖=1 are independent standard white noise processes. The value of 𝑤 refers to the frequency and the value of 𝑠 characterizes the smoothness. The LF and HF datasets use the same 𝑠 = 1, implying the smoothness is invariant across time. But they employ different frequencies, 𝑤 = 2 for LF and 𝑤 = 5 for HF (i.e., two periods and five periods in a unit time interval respectively). The VF dataset takes 𝑠 = 2 and 𝑤 = 5, so that the frequency of the function is gradually increasing as time increases. For all three datasets, the system shows that as time 𝑡 increases from 0 to 1, the correlation between 𝑦<sup>1</sup> (𝑡) and 𝑦<sup>2</sup> (𝑡) gradually varies from positive to negative. Within each dataset, we randomly select 200 training data points, in which 100 time stamps are sampled on the interval (0, 0.8) for the first dimension and the other 100 time stamps sampled on the interval (0.2, 1) for the second dimension. For the test inputs, we randomly select 100 time stamps on the interval (0, 1) for each dimension.

**Table 1** Prediction measurements on three synthetic datasets and different models. LF, HF and VF refer to low-frequency, high-frequency, and time-varying datasets. Three prediction measures are root mean square error (RMSE), average length of confidence interval (ALCI), and coverage rate (CR). All three measurements are summarized by the mean and standard deviation across 10 runs with different random initializations.


We quantify the model performance in terms of root mean square error (RMSE), average length of confidence interval (ALCI), and coverage rate (CR) on the test set. A smaller RMSE corresponds to better predictive performance of the model, and a smaller ALCI implies a smaller predictive uncertainty. As for CR, the better the model prediction performance is, the closer CR is to the percentile of the credible band. Those results are reported by the mean and standard deviation with 10 different random initializations of model parameters. Quantitative comparisons relating to all three datasets are in Table 1. We compare with independent Gaussian process regression (IGPR) [16], the intrinsic coregionalization model (ICM) [17], Collaborative Multi-Output Gaussian Processes (CMOGP) [12] and variational inference of Gaussian process regression networks [18] on three synthetic datasets. In both CMOGP and VSGPRN approaches, we use 20 inducing variables. We further examined model predictive performance on a real-world dataset, the PM2.5 dataset from the UCI Machine Learning Repository [19]. This dataset tracks the concentration of fine inhalable particles hourly in five cities in China, along with meteorological data, from Jan 1st, 2010 to Dec 31st, 2015. We compare our model with two sparse Gaussian process models, i.e., independent sparse Gaussian process regression (ISGPR) [20] and the sparse linear model of corregionalization (SLMC) [17]. In the dataset, we consider six important attributes and use 20% of the first 5000 standardized multivaritate for training and use the others for testing. The RMSEs on the testing data are shown in Table 2, illustrating that VSGPRN had better prediction performance compared with ISGPR and SLMC, even when using fewer inducing points.

**Table 2** Empirical results for PM2.5 dataset. Each model's performance is summarized by its RMSE on the testing data. The number of equi-spaced inducing points is given in parentheses.


# **5 Conclusions**

We propose a novel variational inference approach for structured Gaussian process regression networks named the variational structured Gaussian process regression network, VSGPRN. We introduce inducing variables and propose a structured variational distribution to reduce the computational burden. Moreover, we take advantage of the collapsed representation of our model and construct a tractable lower bound of the log likelihood to make it suitable for doubly stochastic inference and easy to handle missing data. In our method, the computation complexity is independent of the size of the inputs and the outputs. We illustrate the superior predictive performance for both synthetic and real data.

Our inference approach, VSGPRN can be widely used for high dimensional time series to model complicated time-varying dependence across multivariate outputs. Moreover, due to its scalability and flexibility, it can be widely applied for irregularly sampled incomplete large datatsets that widely exist in various research fields including healthcare, environmental science and geoscience.

# **References**

	- https://royalsocietypublishing.org/doi/abs/10.1098/rspa.2015.0257

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **An Online Minorization-Maximization Algorithm**

Hien Duy Nguyen, Florence Forbes, Gersende Fort, and Olivier Cappé

**Abstract** Modern statistical and machine learning settings often involve high data volume and data streaming, which require the development of online estimation algorithms. The online Expectation–Maximization (EM) algorithm extends the popular EM algorithm to this setting, via a stochastic approximation approach. We show that an online version of the Minorization–Maximization (MM) algorithm, which includes the online EM algorithm as a special case, can also be constructed in a similar manner. We demonstrate our approach via an application to the logistic regression problem and compare it to existing methods.

**Keywords:** expectation-maximization, minorization-maximization, parameter estimation, online algorithms, stochastic approximation

# **1 Introduction**

Expectation–Maximization (EM) [6, 17] and Minorization–Maximization (MM) algorithms [15] are important classes of optimization procedures that allow for the construction of estimation routines for many data analytic models, including

Hien Duy Nguyen ()

School of Mathematics and Physics, University of Queensland, St. Lucia, 4067 QLD, Australia, e-mail: h.nguyen7@uq.edu.au

Florence Forbes Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000, Grenoble, France, e-mail: florence.forbes@inria.fr

© The Author(s) 2023 263 P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_29

Gersende Fort Institut de Mathématiques de Toulouse, CNRS, Toulouse, France, e-mail: gersende.fort@math.univ-toulouse.fr,

Olivier Cappé ENS Paris, Universite PSL, CNRS, INRIA, France, e-mail: Olivier.Cappe@cnrs.fr

many finite mixture models. The benefit of such algorithms comes from the use of computationally simple surrogates in place of difficult optimization objectives.

Driven by high volume of data and streamed nature of data acquisition, there has been a rapid development of online and mini-batch algorithms that can be used to estimate models without requiring data to be accessed all at once. Online and mini-batch versions of EM algorithms can be constructed via the classic Stochastic Approximation framework (see, e.g., [2, 13]) and examples of such algorithms include those of [3, 7, 8, 10, 11, 12, 19]. Via numerical assessments, many of the algorithms above have been demonstrated to be effective in mixture model estimation problems. Online and mini-batch versions of MM algorithms on the other hand have largely been constructed following convex optimizations methods (see, e.g., [9, 14, 23]) and examples of such algorithms include those of [4, 16, 18, 22].

In this work, we provide a stochastic approximation construction of an online MM algorithm using the framework of [3]. The main advantage of our approach is that we do not make convexity assumptions and instead replace them with oracle assumptions regarding the surrogates. Compared to the online EM algorithm of [3] that this work is based upon, the Online MM algorithm extends the approach to allow for surrogate functions that do not require latent variable stochastic representations, which is especially useful for constructing estimation algorithms for mixture of experts (MoE) models (see, e.g. [20]). We demonstrate the Online MM algorithm via an application to the MoE-related logistic regression problem and compare it to competing methods.

**Notation.** By convention, vectors are column vectors. For a matrix 𝐴, 𝐴 > denotes its transpose. The Euclidean scalar product is denoted by h𝑎, 𝑏i. For a continuously differentiable function 𝜃 ↦→ ℎ(𝜃) (resp. twice continuously differentiable), ∇<sup>𝜃</sup> ℎ (or simply ∇ when there is no confusion) is its gradient (resp. ∇ 2 𝜃 𝜃 is its Hessian). We denote the vectorization operator that converts matrices to column vectors by vec.

# **2 The Online MM Algorithm**

Consider the optimization problem

$$\underset{\theta \in \mathbb{T}}{\text{arg}\,\text{max}}\,\mathbb{E}\left[f\left(\theta;X\right)\right],\tag{1}$$

where T is a measurable open subset of R 𝑝 , X is a topological space endowed with its Borel sigma-field, 𝑓 : T × X → R is a measurable function and 𝑋 is a X-valued random variable on the probability space (Ω, F, P). In this paper, we are interested in the setting when the expectation E [ 𝑓 (𝜃; 𝑋)] has no closed form, and the optimization problem is solved by an MM-based algorithm.

Following the terminology of [15], we say that 𝑔 : T×X×T, (𝜃, 𝑥, 𝜏) ↦→ 𝑔 (𝜃, 𝑥; 𝜏), is a *minorizer of* 𝑓 , if for any 𝜏 ∈ T and for any (𝜃, 𝑥) ∈ T × X, it holds that

$$f(\theta; \mathbf{x}) - f(\tau; \mathbf{x}) \ge g(\theta, \mathbf{x}; \tau) - g(\tau, \mathbf{x}; \tau). \tag{2}$$

In our work, we consider the case when the minorizer function 𝑔 has the following structure:

A1 The minorizer surrogate 𝑔 is of the form:

$$\log\left(\theta,\mathbf{x};\tau\right) = -\psi\left(\theta\right) + \left<\vec{S}\left(\tau;\mathbf{x}\right),\phi\left(\theta\right)\right>,\tag{3}$$

where 𝜓 : T → R, 𝜙 : T → R 𝑑 and 𝑆¯ : T × X → R 𝑑 are measurable functions. In addition, 𝜙 and 𝜓 are continuously differentiable on T.

We also make the following assumptions:

A2 There exists a measurable open and convex set S ⊆ R 𝑑 such that for any 𝑠 ∈ S, 𝛾 ∈ [0, 1) and any (𝜏, 𝑥) ∈ T × X:

$$s + \gamma \left\{ \bar{S}(\tau; x) - s \right\} \in \mathbb{B}.$$


Seen as a function of 𝜃, 𝑔(·, 𝑥; 𝜏) is the sum of two functions: −𝜓 and a linear combination of the components of 𝜙 = (𝜙1, . . . , 𝜙𝑑). Assumption A1 implies that the minorizer surrogate is in a functional space spanned by these (𝑑 + 1) functions. By (2) and A1–A3, it follows that

$$\mathbb{E}\left[f(\theta;X)\right] - \mathbb{E}\left[f(\tau;X)\right] \ge \psi(\tau) - \psi(\theta) + \left\langle \mathbb{E}\left[\bar{S}(\tau;X)\right], \phi(\theta) - \phi(\tau) \right\rangle,\tag{4}$$

thus providing a minorizer function for the objective function 𝜃 ↦→ E [ 𝑓 (𝜃; 𝑋)]. By A4, the usual MM algorithm would define iteratively the sequence 𝜃𝑛+<sup>1</sup> = 𝜃¯ E - 𝑆¯(𝜃𝑛; 𝑋) . Since the expectation may not have closed form but infinite datasets are available (see A3), we propose a novel Online MM algorithm. It defines the sequence {𝑠𝑛, 𝑛 ≥ 0} as follows: given positive step sizes {𝛾𝑛+1, 𝑛 ≥ 1} in (0, 1) and an initial value 𝑠<sup>0</sup> ∈ S, set for 𝑛 ≥ 0:

$$\mathbf{s}\_{n+1} = \mathbf{s}\_n + \boldsymbol{\gamma}\_{n+1} \left\{ \vec{\boldsymbol{S}} \left( \vec{\theta} (\mathbf{s}\_n); \boldsymbol{X}\_{n+1} \right) - \mathbf{s}\_n \right\}. \tag{5}$$

The update mechanism (5) is a Stochastic Approximation iteration, which defines an S-valued sequence (see A2). It consists of the construction of a sequence of minorizer functions through the definition of their *parameter* 𝑠<sup>𝑛</sup> in the functional space spanned by −𝜓, 𝜙1, . . . , 𝜙𝑑.

If our algorithm (5) converges, any limiting point 𝑠★ satisfies E - 𝑆¯(𝜃¯(𝑠★); 𝑋) = 𝑠★. Hence, our algorithm is designed to approximate the intractable expectation, evaluated at 𝜃¯(𝑠★), where 𝑠★ satisfies a fixed point equation. The following lemma establishes the relation between the limiting points of (5) and the optimization problem (1) at hand. Namely, it implies that any limiting value 𝑠★ provides a stationary point 𝜃★ := 𝜃¯(𝑠★) of the objective function E [ 𝑓 (𝜃; 𝑋)] (i.e., 𝜃★ is a root of the derivative of the objective function). The proof follows the technique of [3]. Set

$$\mathsf{h}(\mathsf{s}) := \mathbb{E}\left[\bar{\mathsf{S}}\left(\bar{\theta}\left(\mathsf{s}\right); X\right)\right] - \mathsf{s}, \qquad \Gamma := \{\mathsf{s} \in \mathbb{B} : \mathsf{h}(\mathsf{s}) = \mathbf{0}\}.$$

**Lemma 1** *Assume that* 𝜃 ↦→ E [ 𝑓 (𝜃; 𝑋)] *is continuously differentiable on* T *and denote by* <sup>L</sup> *the set of its stationary points. If* <sup>𝑠</sup>★ <sup>∈</sup> <sup>Γ</sup>*, then* <sup>𝜃</sup>¯(𝑠★) ∈ L*. Conversely, if* 𝜃★ ∈ L*, then* 𝑠★ := E - 𝑆¯ (𝜃★; 𝑋) ∈ Γ*.*

*Proof* A4 implies that

$$-\nabla \psi(\vec{\theta}(\mathbf{s})) + \nabla \phi(\vec{\theta}(\mathbf{s}))^\top \mathbf{s} = \mathbf{0}, \qquad \mathbf{s} \in \mathbb{S}. \tag{6}$$

Use (2) and A1, and apply the expectation w.r.t. 𝑋 (under A3). This yields (4), which is available for any 𝜃, 𝜏 ∈ T. This inequality provides a minorizer function for 𝜃 ↦→ E [ 𝑓 (𝜃; 𝑋)]: the difference is nonnegative and minimal (i.e. equal to zero) at 𝜃 = 𝜏. Under the assumptions and A1, this yields

$$\nabla \mathbb{E}\left[f(\cdot;X)\right]|\_{\theta=\tau} + \nabla \psi(\tau) - \nabla \phi(\tau)^{\top} \mathbb{E}\left[\bar{S}(\tau;X)\right] = 0. \tag{7}$$

Let 𝑠★ ∈ Γ and apply (7) with 𝜏 ← 𝜃¯(𝑠★). It then follows that

$$\nabla \mathbb{E}\left[f(\cdot;X)\right]|\_{\theta=\tilde{\theta}(s\_\star)} + \nabla \psi(\tilde{\theta}(s\_\star)) - \nabla \phi(\tilde{\theta}(s\_\star))^\top s\_\star = 0,$$

which implies 𝜃¯(𝑠★) ∈ L by (6). Conversely, if 𝜃★ ∈ L, then by (7), we have

$$
\nabla \psi(\theta\_\star) - \nabla \phi(\theta\_\star)^\top \mathbb{E} \left[ \bar{S}(\theta\_\star; X) \right] = 0,
$$

which, by A3 and A4, implies that 𝜃★ = 𝜃¯ E - 𝑆¯(𝜃★; 𝑋) = 𝜃¯(𝑠★). By definition of 𝑠★, this yields 𝑠★ = E - 𝑆¯ 𝜃¯(𝑠★); 𝑋 ; i.e. 𝑠★ ∈ Γ.

By applying the results of [5] regarding the asymptotic convergence of Stochastic Approximation algorithms, additional regularity assumptions on 𝜙, 𝜓, 𝜃¯ imply that the algorithm (5) possesses a continuously differentiable Lyapunov function 𝑉 defined on S and given by 𝑉 : 𝑠 ↦→ E - 𝑓 (𝜃¯(𝑠); 𝑋) , satisfying h∇𝑉(𝑠), h(𝑠)i ≤ 0, where the inequality is strict outside the set Γ (see [3, Prop. 2]). In addition to Lemma 1, assumptions on the distribution of 𝑋 and on the stability of the sequence {𝑠𝑛, 𝑛 ≥ 0} are provided in [5, Thm. 2 and Lem. 1], which, combined with the usual conditions on the step sizes: Í <sup>𝑛</sup> 𝛾<sup>𝑛</sup> = +∞ and Í 𝑛 𝛾 2 <sup>𝑛</sup> < ∞, yields the almost-sure convergence of the sequence {𝑠𝑛, 𝑛 ≥ 0} to the set Γ, and the almost-sure convergence of the sequence {𝜃¯(𝑠𝑛), 𝑛 ≥ 0} to the set L of the stationary points of the objective function 𝜃 ↦→ E [ 𝑓 (𝜃; 𝑋)]. Due to the limited space, the exact statement of these convergence results for our Online MM framework is omitted.

# **3 Example Application**

As an example, we consider the logistic regression problem, where we solve (1) with

$$f\left(\theta; \mathbf{x}\right) \coloneqq \mathbf{y}w^\top \theta - \log\left\{1 + \exp\left(w^\top \theta\right)\right\}, \qquad \mathbf{x} \coloneqq (\mathbf{y}, w),$$

where 𝑦 ∈ {0, 1}, 𝑤 ∈ R 𝑝 , and 𝜃 ∈ T := R 𝑝 . Here, we assume that 𝑋 = (𝑌, 𝑊) is a random variable such that E [ 𝑓 (𝜃; 𝑋)] exists for each 𝜃.

Denote by 𝜆 the standard logistic function 𝜆 (·) := exp {·} /(1+exp {·}). Following [1], (2) and A1 are verified by taking

$$\psi\left(\theta\right) \coloneqq 0, \qquad \phi\left(\theta\right) \coloneqq \begin{bmatrix} \theta \\ \operatorname{vec}\left(\theta\theta^{\top}\right) \end{bmatrix}, \qquad \bar{S}\left(\tau; \mathbf{x}\right) = \begin{bmatrix} \bar{s}\_1\left(\tau; \mathbf{x}\right) \\ \operatorname{vec}\left(\bar{S}\_2\left(\tau; \mathbf{x}\right)\right) \end{bmatrix}$$

where

$$\mathcal{S}\_1 \left( \tau; \mathbf{x} \right) \coloneqq \left\{ \mathbf{y} - \lambda \left( \tau^\top w \right) \right\} w + \frac{1}{4} w w^\top \tau, \quad \vec{S}\_2 \left( \tau; \mathbf{x} \right) = -\frac{1}{8} w w^\top \mathbf{x}$$

With S := {(𝑠1, vec (𝑆2)) : 𝑠<sup>1</sup> ∈ R 𝑝 and 𝑆<sup>2</sup> ∈ R 𝑝×𝑝 is symmetric positive definite} , it follows that 𝜃¯ (𝑠) := −(2𝑆2) −1 𝑠1.

**Online MM.** Let 𝑠<sup>𝑛</sup> = 𝑠1,𝑛, 𝑆2,𝑛 ∈ S. The corresponding Online MM recursion is then

$$\begin{aligned} s\_{1,n+1} &= s\_{1,n} + \gamma\_{n+1} \left( Y\_{n+1} - \lambda \left( \overline{\theta}(s\_n)^\top W\_{n+1} \right) W\_{n+1} + \frac{1}{4} W\_{n+1} W\_{n+1}^\top \overline{\theta}(s\_n) - s\_{1,n} \right) (8), \\ S\_{2,n+1} &= S\_{2,n} + \gamma\_{n+1} \left( -\frac{1}{8} W\_{n+1} W\_{n+1}^\top - S\_{2,n} \right), \end{aligned} \tag{9}$$

where {(𝑌𝑛+1, 𝑊𝑛+1), 𝑛 ≥ 0} are i.i.d. pairs with the same distribution as 𝑋 = (𝑌, 𝑊). Parameter estimates can then be deduced by setting <sup>𝜃</sup>𝑛+<sup>1</sup> :<sup>=</sup> <sup>𝜃</sup>¯(𝑠𝑛+1).

For comparison, we also consider two Stochastic Approximation schemes directly on 𝜃 in the parameter-space: a stochastic gradient (SG) algorithm and a Stochastic Newton Raphson (SNR) algorithm.

**Stochastic gradient.** SG requires the gradient of 𝑓 (𝜃; 𝑥) with respect to 𝜃: ∇ 𝑓 (𝜃; 𝑥) = {𝑦 − 𝜆(𝜃 <sup>&</sup>gt;𝑤)} 𝑤, which leads to the recursion

$$
\hat{\theta}\_{n+1} = \hat{\theta}\_n + \gamma\_{n+1} \left\{ Y\_{n+1} - \lambda (\hat{\theta}\_n^\top W\_{n+1}) \right\} W\_{n+1}. \tag{10}
$$

**Stochastic Newton-Raphson.** In addition SNR requires the Hessian with respect to 𝜃, given by ∇ 2 𝜃 𝜃 𝑓 (𝜃; 𝑥) = −𝜆(𝜃 <sup>&</sup>gt;𝑤) {1 − 𝜆(𝜃 <sup>&</sup>gt;𝑤)} 𝑤𝑤>. The SNR recursion is then

$$
\hat{A}\_{n+1} = \hat{A}\_n + \gamma\_{n+1} \left\{ \nabla\_{\theta\theta}^2 f(\hat{\theta}\_n; X\_{n+1}) - \hat{A}\_n \right\} \tag{11}
$$

$$G\_{n+1} = -\hat{A}\_{n+1}^{-1} \tag{12}$$

$$
\hat{\theta}\_{n+1} = \hat{\theta}\_n + \gamma\_{n+1} G\_{n+1} \left\{ Y\_{n+1} - \lambda (\hat{\theta}\_n^T W\_{n+1}) \right\} W\_{n+1} \,. \tag{13}
$$

Equation (12) assumes that 𝐴ˆ <sup>𝑛</sup>+<sup>1</sup> is invertible. In this logistic example, we can guarantee this by choosing 𝐴ˆ <sup>0</sup> to be invertible. Otherwise 𝐴ˆ <sup>𝑛</sup> is invertible after some 𝑛 sufficiently large, with probability one. Again in the logistic case, observe that, from the structure of ∇ 2 𝜃 𝜃 𝑓 and from the Woodbury matrix identity, Equations (11–12) can be replaced by

$$G\_{n+1} = \frac{G\_n}{1 - \gamma\_{n+1}} - \frac{\gamma\_{n+1}}{1 - \gamma\_{n+1}} \frac{a\_{n+1} G\_n W\_{n+1} W\_{n+1}^\top G\_n}{\left\{ (1 - \gamma\_{n+1}) + \gamma\_{n+1} a\_{n+1} W\_{n+1}^\top G\_n W\_{n+1} \right\}} \ .$$

where <sup>𝑎</sup>𝑛+<sup>1</sup> :<sup>=</sup> <sup>𝜆</sup>(𝜃ˆ<sup>&</sup>gt; <sup>𝑛</sup>𝑊𝑛+1) 1 − 𝜆(𝜃ˆ<sup>&</sup>gt; <sup>𝑛</sup>𝑊𝑛+1) ,

It appears that the Online MM recursion in the 𝑠-space defined by (8) and (9) is equivalent to the SNR recursion above (i.e., (11)–(13)) when the Hessian ∇ 2 𝜃 𝜃 𝑓 (𝜃; 𝑥) is replaced by the lower bound − 1 4 𝑤𝑤>. This observation holds whenever 𝑔 is quadratic in (𝜃 − 𝜏).

**Polyak averaging.** In practice, for Online MM, SG, and SNR recursions, it is common to consider Polyak averaging [21], starting from some iteration 𝑛0, chosen such as to avoid the initial highly volatile estimates. Set 𝜃ˆ<sup>𝐴</sup> 𝑛0 := 0, and for 𝑛 ≥ 𝑛0,

$$
\hat{\theta}\_{n+1}^A = \hat{\theta}\_n^A + \alpha\_{n-n\_0+1} (\hat{\theta}\_n - \hat{\theta}\_n^A),
\tag{14}
$$

where 𝛼<sup>𝑛</sup> is usually set to 𝛼<sup>𝑛</sup> := 𝑛 −1 .

**Numerical illustration.** We now demonstrate the performance of the Online MM algorithm for logistic regression – defined by (5) and the derivations above. To do so, a sequence {𝑋<sup>𝑖</sup> = (𝑌<sup>𝑖</sup> , 𝑊𝑖) , 𝑖 ∈ {1, . . . , 𝑛max}} of 𝑛max = 10<sup>5</sup> i.i.d. replicates of 𝑋 = (𝑌, 𝑊) is simulated: 𝑊 = (1, 𝑈), where 𝑈 ∼ N (0, 1) and [𝑌 |𝑊 = 𝑤] ∼ Ber 𝜆 𝜃 > 0 𝑤 , where 𝜃<sup>0</sup> = (3, −3). Online MM is run using the learning rate 𝛾<sup>𝑛</sup> = 𝑛 −0.6 , as suggested in [3]. The algorithm is initialized with 𝜃ˆ <sup>0</sup> = (0, 0) and 𝑠<sup>0</sup> = Í2 𝑖=1 𝑆¯ 𝜃ˆ <sup>0</sup>; 𝑋<sup>𝑖</sup> /2.

For comparison, we also show, on Figure 1, the SG, SNR estimates and their Polyak averaged values in 𝜃-space. As is usually recommended with Stochastic Approximation, the first few volatile estimations are discarded. Similarly, for Polyak averaging, we set 𝑛<sup>0</sup> = 10<sup>3</sup> . As expected, we observe that the Online MM and the SNR recursions are very close but with the SNR showing more variability. Their comparison after Polyak averaging shows very close trajectories while the SG trajectory is clearly different and shows more bias. Final estimates [Polyak averaged estimates] of 𝜃<sup>0</sup> from the SG, SNR, and Online MM algorithms are respectively: (2.67, −2.66) [(2.51, −2.48)], (3.03, −3.03) [(2.99, −3.03)], and (3.01, −3.03) [(2.98, −3.02)], which we can compare to the batch maximum likelihood estimate (3.00, −3.05) (obtained via the glm function in R). Notice the remarkable closeness between the online MM and batch estimates.

**Fig. 1** Logistic regression example: the first row shows Online MM (black), SG (blue), and SNR (red) recursions. The second row shows the respective Polyak averaging recursions. The estimates of the first 𝜃 (first column) and the second (second column) components of 𝜃 are plotted started from 𝑛 = 10<sup>3</sup> for readability.

# **4 Final Remarks**

*Remark 1* For a parametric statistical model indexed by 𝜃, let 𝑓 (𝜃; 𝑥) be the log-density of a random variable 𝑋 with stochastic representation 𝑓 (𝜃; 𝑥) = log ∫ Y 𝑝 <sup>𝜃</sup> (𝑥, 𝑦) 𝜇(d𝑦), where 𝑝 <sup>𝜃</sup> (𝑥, 𝑦) is the joint density of (𝑋, 𝑌) with respect to the positive measure 𝜇 for some latent variable 𝑌 ∈ Y. Then, via [15, Sec. 4.2], we recover the Online EM algorithm by using the minorizer function 𝑔:

$$\log \left( \theta, \mathbf{x}; \tau \right) \coloneqq \int\_{\underline{\chi}} \log p\_{\theta} \left( \mathbf{x}, \mathbf{y} \right) \left. p\_{\tau} (\mathbf{x}, \mathbf{y}) \exp(-f(\tau; \mathbf{x})) \, \mu(\mathbf{dy}) \,.$$

*Remark 2* Via the minorization approach of [1] (as used in Section 3) and the mixture representation from [19], we can construct an Online MM algorithm for MoE models, analogous to the MM algorithm of [20]. We shall provide exposition on such an algorithm in future work.

**Acknowledgements** Part of the work by G. Fort is funded by the *Fondation Simone et Cino Del Duca, Institut de France.* H. Nguyen is funded by ARC Grant DP180101192. The work is supported by Inria project LANDER.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Detecting Differences in Italian Regional Health Services During Two Covid-19 Waves**

Lucio Palazzo and Riccardo Ievoli

**Abstract** During the first two waves of Covid-19 pandemic, territorial healthcare systems have been severely stressed in many countries. The availability (and complexity) of data requires proper comparisons for understanding differences in performance of health services. We apply a three-steps approach to compare the performance of Italian healthcare system at territorial level (NUTS 2 regions), considering daily time series regarding both intensive care units and ordinary hospitalizations of Covid-19 patients. Changes between the two waves at a regional level emerge from the main results, allowing to map the pressure on territorial health services.

**Keywords:** regional healthcare, time series, multidimensional scaling, cluster analysis, trimmed 𝑘-means

# **1 Introduction**

During the Covid-19 pandemic, the evaluation of similarities and differences between territorial health services [23] is relevant for decision makers and should guide the governance of countries [15] through the so-called "waves". This type of analysis becomes even more crucial in countries where the National healthcare system is regionally-based, which is the case of Italy (or Spain) among others. Italy is one of the countries in Europe which has been mostly affected by the pandemic, and the pressure on Regional Health Services (RHS) has been producing dramatic effects also in the economic [2] and the social [3] spheres. Regional Covid-19-related health

© The Author(s) 2023 273

Lucio Palazzo ()

Department of Political Sciences, University of Naples Federico II, via Leopoldo Rodinò 22 - 80138 Napoli, Italy, e-mail: lucio.palazzo@unina.it

Riccardo Ievoli

Department of Chemical, Pharmaceutical and Agricultural Sciences, University of Ferrara, via Luigi Borsari 46 - 44121 Ferrara, Italy, e-mail: riccardo.ievoli@unife.it

P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_30

indicators are extremely relevant for monitoring the pandemic territorial widespread [21], and to impose (or relax) restrictions in accordance with the level of health risk.

The aim of this work is to exploit the potential of Multidimensional Scaling (MDS) to detect the main imbalances occurred in the RHS, observing the hospital admission dynamics of patients with Covid-19 disease. Both daily time series regarding patients treated in Intensive Care (IC) units and individuals hospitalized in other hospital wards are used to evaluate and compare the reaction to healthcare pressure in 21 geographical areas (NUTS 2 Italian regions), considering the first two waves [4] of pandemic. Indeed, territorial imbalances in terms of RHS' performance [24] should be firstly driven by the geographical propagation flows of the virus (first wave). Then, different reactions to pandemic shock may be provided by RHSs, and changes of imbalances can be observed in the second wave.

Our proposal consists of three subsequent steps. Firstly, a matrix of distances between regional time series through a dissimilarity metric [29] is obtained. Therefore, we apply a (weighted) MDS [19, 22] to map similarity patterns in a reduced space, adding also a weighting scheme considering the number of neighbouring regions. Finally, we perform a cluster analysis to identify groups according to RHS performance in the two waves.

The paper is organized as follows: Section 2 describes the methodological approach used to compare and cluster time series, while Section 3 introduces data and descriptive analysis. Results regarding RHSs are depicted and discussed in Section 4, while Section 5 concludes with some remarks and possible advances.

# **2 Time Series Clustering**

Given a matrix 𝑇 × 𝑛, where 𝑇 represents the days and 𝑛 the number of regions, our methodological approach consists of three subsequently steps:


In the first step, a dissimilarity measure is computed for each pair of regional time series. The objective is to obtain a dissimilarity matrix 𝐷 (with elements 𝑑𝑖, 𝑗) for estimating synthetic measures of the differences between regions. There are different alternatives to compare time series, some comprehensive overviews are in [29, 13].

A reasonable choice is the the Fourier dissimilarity 𝑑<sup>𝐹</sup> (**x**, **y**), which applies the 𝑛-point Discrete Fourier Transform [1] on two time series, allowing to compare the similarity between two time sequences after converting them into a combination of structural elements, such as trend and/or cycle.

In the second step, we implement a multidimensional scaling [31]. Due to its flexibility, MDS has been introduced also in time series analysis [25] and recently applied to different topics [30, 9, 16].

Since our aim is to take into account the degree of proximity between regions, we also employ a weighted multidimensional scaling technique (wMDS) [17, 14]. The L<sup>2</sup> norm is multiplied by a set of weights 𝝎 = (𝜔1, . . . , 𝜔𝑛) such that high weights have a stronger influence on the result than low weights.

The reduced space generated by MDS can be used as starting point for subsequent analyses. Then, a cluster algorithm can be performed on the coordinates (of the reduced space) of MDS [18]. Different procedures should be suitable to perform a cluster analysis on the wMDS coordinates map. For an overview of modern clustering techniques in time series, see e.g. [26].

In our case, both the geographical spread of the pandemic and population density can determine remarkable differences in terms of hospitalization rates [12]. To mitigate the risk of regional outliers in the data, generating potential*spurious* clusters, we employ the trimmed 𝑘-means algorithm [8, 11]. A relevant topic in cluster analysis is related to the choice of the 𝑘 number of groups. Our strategy is purely data-driven and it is based on the minimization of the within-cluster variance.

# **3 Data and Descriptive Statistics**

Daily regional time series reporting a) the number of patients treated in IC units and b) the number of patients admitted in the other hospital wards are retrieved through the official website of Italian Civil Protection1. All patients were positive for the Covid-19 test (nasal and oropharyngeal swab). To take into account the different sizes in terms of inhabitants, both a) and b) are normalized according to the population of each territorial unit (estimated at 2020/01/01). The rates of patients treated in IC units and hospitalized (HO) patients in other hospital wards, are then multiplied by 100,000.

The whole dataset contains two identified waves2 of Covid-19, as follows:

Wave 1 (W1): 𝑇 = 109 days from February 24 to June 11, 2020 Wave 2 (W2): 𝑇 = 109 days from September 14 to December 31, 2020

The date/trend may also depend on external factors, such as the implementation of restrictive measures introduced by the Italian Government [27, 6], which influenced the observed differences between W1 and W2. We have to remark that a full national lockdown was held between March 9th and May 18th 2020.

Figure 1 shows the time series for HO and IC (rows), according to the two waves of Covid-19 (columns). The anomaly of the small Italian region (Valle D'Aosta) emerges both in the first (in particular concerning IC) and second waves (also for

<sup>1</sup> Source: www.dati-covid.italia.it

<sup>2</sup> Refer to [7] for further details.

**Fig. 1** Time series distributions of Italian regions.

HO), while Lombardia, which is the largest and most populous region, dominates other territories especially when considering HO of W1. The upper panel of Figure 1 helps to understand differences between the two waves in terms of admission to intensive cares: while regions with high, medium and low IC rate can be directly identified through the eyeball of the series during W1, in W2 more homogeneity is observed. Furthermore, with the exception of Valle D'Aosta, the IC rate remains always less than 10 for all considered observations.

For what concerns HO rate, (lower panels of Figure 1), Lombardia reaches values greater than 100 in W1 (especially in April), while during W2 this threshold had exceeding by Valle D'Aosta and Piemonte (both in November). Again, if W1 opposes regions with high and (moderately) low HO rates, in W2 the following situation arises: a) Valle D'Aosta and Piemonte reach values over 100, b) four regions (Liguria, Lazio, P.A. Trento and P.A. Bolzano) present values over 75, and c) the majority of territories share similar trends with peaks always lower than 75.

# **4 Grouping Regions by Clustering and Discussion**

In order to confirm and deepen the descriptive results of Section 3, we perform a cluster analysis following the scheme proposed in Section 2. We compute wMDS equipped with the Fourier distance3, using a set of weights 𝝎 proportional to the number of neighbourhoods for each region, ensuring a spatial feature into the model.

Figure 2 displays the main results of wMDS, distinguishing between four levels of critical issues experienced by the RHS. Outlying performances are coloured in **Violet**. A first cluster (in **Red**) includes "critical" regions while a group depicted in **Orange** contains territories with high pressure in their RHS. Regions involved in the **Green** cluster experimented a moderate pressure on RHS, while colour **Blue** indicates territories suffering from a low pressure. These clusters may also be interpreted as a ranking of the health service risk.

As regards the HO during W1, leaving apart the two outliers (Lombardia and P.A. Bolzano) the "red" cluster is composed by three Northern regions (Piemonte, Valle d'Aosta and Emilia-Romagna). The group of high pressure is composed by Liguria, Marche and P.A. Trento, while the green cluster involves Lazio, Abruzzo and Toscana (from the centre of Italy) and Veneto. The last group includes nine regions, 7 of which are located in the southern Italy. In W2 the clustering procedure Piemonte and Valle d'Aosta are identified as outliers, while the high-pressure group is composed by two autonomous provinces (Trendo and Bolzano), Lombardia and Liguria. The "orange" group is constituted by regions located in the North-East (Friuli-Venezia Giulia, Emilia-Romagna and Veneto), along with Abruzzo and Lazio. Southern regions are allocated in the "green" coloured group (together with Umbria, Toscana and Marche), while Molise, Calabria and Basilicata remain in the low-pressure cluster.

Regarding IC rates, during W1 Lombardia and Valle d'Aosta are considered as outliers while the "red" cluster is composed by four northern Italian regions (Emilia-Romagna, P.A. of Trento, Piemonte and Liguria), and Marche (located in the centre). The "orange" cluster contains Toscana, Veneto and P.A. Bolzano, while the moderate-pressure cluster involves three areas of centre Italy (Lazio and Umbria), among with the Friuli-Venezia Giulia (from the north-east) and Abruzzo. The last cluster includes only regions from the south. According to the bottom right panel of Figure 2, apart from Valle D'Aosta, the procedure identifies Calabria as an outlier. The "red" group acquires two observations from the Centre of Italy such as Toscana and Umbria, while the majority of regions are classified in the moderately pressured group. Only three Southern Italian areas are allocated in the last group (in green).

If the geography of the disease appears fundamental in W1, especially regarding adjoining territories of Lombardia, in W2 this effect is less evident. Thus, regions improving (e.g. Emilia-Romagna) or worsening (such as Lazio and Abruzzo) their clustering "ranking" can be easily observed. As mentioned, the differences of restrictive measures imposed by the Government in the two periods may have a role on these results.

<sup>3</sup> We remark that other distance measures have been applied. Moreover, a) the Fourier one shows better performance in terms of goodness of fit; b) the results are not sensitive with respect to the choice of distance.

**Fig. 2** Map of the identified regional clusters.

# **5 Concluding Remarks**

The Covid-19 pandemic has put a strain on the Italian healthcare system. The reactions of RHS play a relevant role to mitigate the health crisis at territorial level and to guarantee an equitable access to healthcare.

This work helps to understand similarities and divergences between the Italian regions in relation to the health pressure of the first two waves of the virus. Considering crucial measures such as HO and IC rates, the comparison between two waves allows to understand differences in the reactions to pandemic shocks of RHS. Although the northern Italy represented the epicentre of the Covid-19 spread in the first wave, some regions (e.g. Veneto and Friuli-Venezia Giulia) seem to have succeeded in avoiding hospitals overcrowding, while Southern regions (and Islands) definitively suffered from less pressure. Furthermore, in the second wave, the difference appears slightly smoothed and the cluster sizes seem more homogeneous. Moreover, there are some exceptions, such as the Emilia-Romagna, which seems to have been less affected by the second wave, compared to the other regions. The detection of clusters represents a starting point for the improvement of health governance and can be used to monitor potential imbalances in future unfortunate waves.

Further analysis may employ other dedicated indicators coming, for instance, from the Italian National Institute of Statistics4, or using different proposals for combining wMDS with dissimilarity measures and clustering [28]. Following a different methodological approach, the recent method proposed in [10] should be applied on those data to include more complex spatial relationships between territories.

# **References**


<sup>4</sup> see for example the BES indicators of the domain "Health" and "Quality of services" https://www.istat.it/it/files//2021/03/BES\\_2020.pdf


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Political and Religion Attitudes in Greece: Behavioral Discourses**

Georgia Panagiotidou and Theodore Chadjipadelis

**Abstract** The research presented in this paper attempts to explore the relationship between religious and political attitudes. More specifically we investigate how religious behavior, in terms of belief intensity and practice frequency, is related to specific patterns of political behavior such as ideology, understanding democracy and his set of moral values. The analysis is based on the use of multivariable methods and more specifically Hierarchical Cluster Analysis and Multiple Correspondence Analysis in two steps. The findings are based on a survey implemented in 2019 on a sample of 506 respondents in the wider area of Thessaloniki, Greece. The aim of the research is to highlight the role of people's religious practice intensity in shaping their political views by displaying the profiles resulting from the analysis and linking individual religious and political characteristics as measured with various variables. The final output of the analysis is a map where all variable categories are visualized, bringing forward models of political behavior as associated together with other factors such as religion, moral values and democratic attitudes.

**Keywords:** political behavior, religion, democracy, multivariate methods, data analysis

# **1 Introduction**

In this research we present the analysis results of a survey, which was implemented in April 2019 to 506 respondents in Thessaloniki, focusing on their religious profile as well as their political attitudes, their moral profile and the way they comprehend democracy. The aim of the analysis is to investigate and highlight the role of religious practice in shaping political behavior. In the political behavior analysis field, religion

© The Author(s) 2023 283 P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_31

Georgia Panagiotidou ()

Aristotle University of Thessaloniki, Greece, e-mail: gvpanag@polsci.auth.gr

Theodore Chadjipadelis Aristotle University of Thessaloniki, Greece, e-mail: chadji@polsci.auth.gr

and more specifically church practice has emerged as one of the main pillars that form the political attitudes of voters. Religious habits seem to have a decisive influence on electoral choices, as derives from Lazarsfeld's research at Columbia University in 1944 [3], followed by the work of Butler and Stokes in 1969 [1] and the research of Michelat and Simon in France [6]. More specifically in the comparative study of Rose in 1974 [9], it turns out that the more religious voters appear to be more conservative by choosing to place themselves on the right side of the ideological "left-right" axis, while the non-religious voters opt for the left political parties. The research and analysis of Michelat and Simon [6] brings to the surface two opposing cultural models: on the one hand we have the deeply religious voters, who belong to the middle and upper classes, residing in the cities or in the countryside, while on the other hand we have the non-religious left voters with working class characteristics. The first framework is articulated around religion and those who belong to it identifying themselves as religious people, is inspired by a conservative value system, put before the value of the individual, the family, the ancestral heritage and tradition. The second cultural context is articulated around class rivalries and socio-economic realities; those who belong to this context identify themselves as "us workers towards others". They believe in the values of collective action, vote for left-wing parties, participate actively in unions and defend the interests of the working class. To measure the influence of religious practice on political behavior, applied research uses measurement scales about the intensity of religious beliefs and the frequency of church service practice as an indicator of the level of one's religious integration.

To measure religious intensity level, variables are used such as how often they go to the service, how much do they believe in the existence of God, of afterlife, in the dogmas of the church and so on. Since the 90's there is a rapid decline in the frequency with which the population attends church service or self-identifies strongly in terms of religiousness. Nevertheless, the strong correlation between electoral preference and religious practice remains strong [5]. The most significant change for non-religious people is that the left is losing its universal influence as many of these voters expand also to the center. Strongly religious people continue to support the right more and, in some cases, strengthen the far right. In this paper, apart from attempting to explore and verify the existing literature over the effect of religion on political behavior, focusing on the Greek case, the approach exploits methods used to achieve the visualization of all existing relationships between different sets of variables. To link together numerous variables and their categories to construct a model of religious and political behavior, multiple applications of Hierarchical Cluster analysis (HCA) are being made followed by Multiple Correspondence Analysis (MCA) for the emerging clusters. In this way, a semantic map is constructed [7], which visualizes discourses of political and religious behavior and the inner antagonisms between the behavioral profiles.

# **2 Methodology**

For the implementation of the research a poll was conducted on a random sample of 506 people in the greater area of Thessaloniki in Greece, during April 2019. A questionnaire was used as a research tool which was distributed with an on-site approach of the random respondents. The questionnaire consisted of three sections: a) the first section included seven questions for demographic data of the respondent such as gender, age, educational level, marital status, household income, occupation and social class to which the respondent considers belonging; b) the second part contained seven questions, ordinal variables, related to the religious practice and beliefs of the respondent: i) how often does one go to church? ii) how often does one pray? iii) how close does one feel to God, Virgin Mary (or to another seven religious concepts) during church service? iv) how strongly does one have seven different feelings during church service? v) does one believe or not in the saints, miracles, prophecies (and another six religious concepts)? Two more questions investigating their profile in terms of what is taught in the Christian dogma were included vi) one asking if one can progress only by being an ethical person and vii) another one asking if they agree on the pain/righteousness scheme, that is if one suffers in this life will be rewarded later or in the afterlife; c) questions concerning the political profile of the respondent are developed in the third part of the questionnaire: i) one's self-positioning on the ideological left-right axis, ii) a set of nine ordinal variables requiring one's agreement or disagreement level on sentences that reflect the dimensions of liberalism-authoritarianism and left-right iii) this last section also includes two different sets of pictures, used as symbolic representation for the "democratic self" and the "moral self" [4]. The first set of twelve pictures represent various conceptualizations of democracy, and one is asked to select three pictures that represent democracy. The second set of pictures represent moral values in life, and one is asked to choose three pictures that represent one's set of personal values. Variables are ordinal, using a five-point Likert scale, apart from the question regarding whether one believes or not in prophecies magic etc. and the two last questions with the pictures, where we are using a binary scale of yes-no or zero-one where zero is for a non selected picture and one is for a selected picture.

Data analysis was implemented with the use of M.A.D software (Méthodes d'Analyse des Données), developed by Professor Dimitris Karapistolis (more about M.A.D software at www.pylimad.gr). Firstly, Hierarchical Cluster Analysis (HCA) using chi-quare distance and Ward's linkage, assigns subjects into distinct groups based on their response patterns. This first step produces a cluster membership variable, assigning each subject into a group. In addition to this, the behavior typology of each group is examined, seeing the connection of each variable level to each cluster using two proportion 𝑧 test (significance level set at 0.05) between respondents belonging to cluster 𝑖 and those who do not belong in cluster 𝑖 for a variable level. The number of clusters is determined by using the empirical criterion of the change in the ratio of between-cluster inertia to total inertia, when moving from a partition with 𝑟 clusters to a partition with 𝑟 − 1 clusters [8]. In the second step of the analysis, the cluster membership variable is analyzed together with the existing variables using MCA on the Burt table [2]. All associations among the variable categories are given on a set of orthogonal axes, with the least possible loss of the original information of the original Burt table. Next, we apply HCA for the coordinates of variable categories on the total number of dimensions of the reduced space resulting from the MCA. In this way we cluster the variable, as previously we clustered the subjects. By clustering the variable response categories, we detect the various discourses of behavior, where each cluster of categories stands as a behavioral profile linked with a set of responses and characteristics. To produce the final output, the semantic map, we created a table including the output variables of the questionnaire, including demographics and variables for political behavior. Using the same two-step procedure using HCA and MCA for this final table, the semantic map is constructed, positioning the variable categories on a bi plot created by the two first dimensions of MCA.

# **3 Results**

In the first step of the analysis, we apply HCA for each set of variables in each question. In the question: "How close do you feel during the service 1-To God, 2-To the Virgin, 3-To Christ, 4-To some Saint, Angel, 5-To the other churchgoers, 6-To Paradise, 7-To Hell, 8-To the divine service, 9-To his preaching priest", we get four clusters (Figure 1).


**Fig. 1** Four clusters on how close the respondents feel during church service.

For the question: "How strongly you feel after the end of the service 1-The Grace of God in me, 2-Power of the soul, 3-Forgiveness for those who have hurt me, 4- Forgiveness for my sins, 5-Peace, 6-Relief it is over", we get six clusters (Figure 2).


**Fig. 2** Six clusters on how the respondents feel at the end of church service.

Five clusters (Figure 3) for the question: "Do you believe in 1-Bad (magic influence) 2-Magic? 3- Destiny? 4-Miracles? 5-Prophecies of the Saints? 6- Do you have pictures of holy figures in your house? 7-in your workplace? 8-Do you have a family Saint?".


**Fig. 3** Five clusters on the beliefs of the respondents on various aspects of the Christian faith.

Six clusters are detected (Figure 4) for the question: "How do you feel when you come face to face with a religious image 1-Peace, 2-Awe, 3-The presence of God, 4-Emotion, 5-The need to pray, 6-Contact with the person in the picture".


**Fig. 4** Six clusters on how the respondents feel when facing a religious image.

We proceed with the clustering of the replies on political views and we get seven clusters of political profiles (Figure 5).


**Fig. 5** Seven clusters according to the political views- profile of the respondents.

For the symbolic representation of the democratic self, when choosing three pictures that represent democracy for the respondent, we find eight clusters (Figure 6), and eight clusters for the symbolic representation of the moral self for the respondents, as show in Figure 7.


**Fig. 6** Eight clusters on how the respondents understand democracy.


**Fig. 7** Eight clusters on the different sets of moral values of the respondents.

In the second step of the analysis, we jointly process the cluster membership variables. MCA produces the coefficients of each variable category which are now positioned in a two-dimensional map as seen in Figure 9. HCA is then applied again to the coefficients of the items, which bring forward three main clusters, modeling political and religious behavior. In Figure 8, Cluster 77 is connected to centre and moderate religious behaviour, cluster 78 reflects the voters of the right, with strong religious habits and beliefs, individualistic attitudes and more authoritarian and nationalistic political views, whereas cluster 79 represents the leftists, non-religious voters, closer to revolutionary political views and collective goods. Examining the antagonisms on the behavioral map (Figure 9), the first horizontal axis which explains 22.8% of the total inertia, is created by the antithesis between right political ideology - strong religious behavior and left political ideology-no religious behavior (cluster 78 opposite to cluster 79). The second axis (vertical) accounts for 7% of the inertia, and is explained as the opposition between the center (moderate religious behavior) against the left and right (cluster 77 opposite to both clusters 78 and 79).


**Fig. 8** Three main behavioral discourses linking all variable categories together.

**Fig. 9** The semantic map visualizing the behavioral profiles of voters, and the inner antagonisms.

# **4 Discussion**

The analysis uncovers the strong existing relationship between religious habits and political views, for the Greek case. The semantic map indicates two main antagonistic cultural discourses, including both religious, political and moral characteristics: The first discourse (cluster 77) is described as moderately religious practice and beliefs, connected to the ideological center. These voters have political attitudes that belong to the space between the center-left and the center-right. They understand democracy as a connection to money, direct democracy and electronic democracy. Their moral set of values is naturalistic and individualistic. The next behavioral discourse (cluster 78) describes the voters of right ideology, with strong religious beliefs andfrequent religious practice. They appear as very ethical and believe in the concept of pain and righteousness. Regarding their political attitudes these more religious voters are against violence, have more authoritarian and nationalistic positions. They view democracy as parliamentary, representative, ancient Greece but also as church, while their moral set of values appear clearly naturalistic, Christian and nationalistic.

Cluster 79 reflects the exact opposite discourse compared to 78. These voters belong to the left ideology and are non-religious. They do not adopt the ideas of the ethical person, or the scheme of pain and righteousness as mentioned in the Christian dogma. In terms of political attitudes, they are pro-welfare state. These non-religious and left voters understand democracy as direct with the need for revolution, protest and riot and support collective goods. Interpreting further the antagonisms as visualized on the semantic map, the main competition exists between the "right political ideology - strong religious behavior individualism" discourse and the "left political ideology-no religious behavior collectivism" discourse. A secondary opposition is found between the "center ideology- moderate religious behavior" discourse against the left and right extreme positions.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Supervised Classification via Neural Networks for Replicated Point Patterns**

Kateřina Pawlasová, Iva Karafiátová, and Jiří Dvořák

**Abstract** A spatial point pattern is a collection of points observed in a bounded region of R 𝑑 , 𝑑 ≥ 2. Individual points represent, e.g., observed locations of cell nuclei in a tissue (𝑑 = 2) or centers of undesirable air bubbles in industrial materials (𝑑 = 3). The main goal of this paper is to show the possibility of solving the supervised classification task for point patterns via neural networks with general input space. To predict the class membership for a newly observed pattern, we compute an empirical estimate of a selected functional characteristic (e. g., the pair correlation function). Then, we consider this estimated function to be a functional variable that enters the input layer of the network. A short simulation example illustrates the performance of the proposed classifier in the situation where the observed patterns are generated from two models with different spatial interactions. In addition, the proposed classifier is compared with convolutional neural networks (with point patterns represented by binary images) and kernel regression. Kernel regression classifiers for point patterns have been studied in our previous work, and we consider them a benchmark in this setting.

**Keywords:** spatial point patterns, pair correlation function, supervised classification, neural networks, functional data

Kateřina Pawlasová ()

Iva Karafiátová

Jiří Dvořák

© The Author(s) 2023 293

Charles University, Faculty of Mathematics and Physics, Ke Karlovu 3, 121 16 Praha 2, Czech Republic, e-mail: pawlasova@karlin.mff.cuni.cz

Charles University, Faculty of Mathematics and Physics, Ke Karlovu 3, 121 16 Praha 2, Czech Republic, e-mail: karafiatova@karlin.mff.cuni.cz

Charles University, Faculty of Mathematics and Physics, Ke Karlovu 3, 121 16 Praha 2, Czech Republic, e-mail: dvorak@karlin.mff.cuni.cz

P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_32

# **1 Introduction**

Spatial point processes have recently received increasing attention in a broad range of scientific disciplines, including biology, statistical physics, or material science [9]. They are used to model the locations of objects or events randomly occurring in R 𝑑 , 𝑑 ≥ 2. We distinguish between the stochastic model (point process) and its realization observed in a bounded observation window (point pattern).

Typically, analyzing spatial point pattern data means working with just one pattern, which comes from a specific physical measurement. In this paper, we take another perspective: we suppose that a collection of patterns, which are independent realizations of some underlying stochastic models, is to be analyzed simultaneously. These independent realizations are then referred to as replicated point patterns. Recently, this type of data has become more frequent, encouraging the adaptation of methods such as supervised classification to the point pattern setting.

Since we are talking about supervised classification, our task is to predict the label variable (indicating class membership) for a newly observed point pattern, using the knowledge about a sample collection of patterns with known labels (training data). In the literature, this problem has been studied to a limited extent. Properties of a classifier constructed specifically for the situation where the observed patterns were generated by inhomogeneous Poisson point processes with different intensity functions are discussed in [5]. However, this method is based on the special properties of the Poisson point process, and its use is thus limited to a small class of models. On the other hand, no assumptions about the underlying stochastic models are made in [12], where the task for replicated point patterns is transformed, with the help of multidimensional scaling [16], to the classification task in R 2 . In [10, 11], the kernel regression classifier for functional data [4] is adapted for replicated point patterns. Instead of classifying the patterns themselves, a selected functional characteristic (e.g. the pair correlation function) is estimated for each pattern. These estimated values are considered functional observations, and the classification if performed in the context of functional data. The idea of linking point patterns to functional data also appears in [12] – the dissimilarity matrix needed for the multidimensional scaling is based on the same type of dissimilarity measure that is used for the kernel regression classifier in [10, 11]. Finally, [17] briefly discusses the model-based supervised classification. Unsupervised classification is explored in [2].

In this paper, our goal is to discuss the use of classifiers based on artificial neural networks in the context of replicated point patterns. We pay special attention to the procedure described in [14], where both functional and scalar observations enter the input layer. Hence, similarly as in [10, 11], each pattern can be represented by estimated values of a selected functional characteristic and the classification is performed in the context of functional data. The resulting decision about class membership is based on the spatial properties of the observed patterns that can be described by the selected characteristic. Therefore, with a thoughtfully chosen characteristic, this method has great potential within a wide range of possible classification scenarios. Moreover, it can be used without assuming stationarity of the underlying point processes, and it can be easily extended to more complicated settings (e.g., point patterns in non-Euclidean spaces or realizations of random sets).

We present a short simulation experiment that illustrates the behaviour of the neural network described in [14]. Binary classification is performed on realizations of two different point process models – the Thomas process (model for attractive interactions among pairs of points) and the Poisson point process (benchmark model for no interactions among points). This approach is then compared to the classification based on convolutional neural networks (CNNs) [8], where each pattern enters the network as a binary image. Finally, both methods based on artificial neural networks are compared to the kernel regression classifier studied in [10, 11] which can be considered a benchmark in the context of replicated point patterns.

This paper is organized as follows. Section 2 provides a brief theoretical background on spatial point processes and their functional characteristics, including the definition of the pair correlation function, which plays a crucial role in the sequel. Section 3 summarizes the methodology introduced in [14] about neural network models with general input space. Section 4 is devoted to a short simulation example.

# **2 Point Processes and Point Patterns**

This section presents the necessary definitions from the point process theory. Our exposition closely follows the book [13]. For detailed explanation of the theoretical foundations, see, e.g., [7]. Throughout the paper, a simple point process 𝑋 is defined as a random locally finite subset of R 𝑑 , 𝑑 ≥ 2, where each point 𝑥 ∈ 𝑋 corresponds to a specific object or event occurring at the location 𝑥 ∈ R 𝑑 . In applications, 𝑋 can be used as a mathematical tool to model random locations of cell nuclei in a tissue (with 𝑑 = 2) or centers of undesirable air bubbles in industrial materials (𝑑 = 3). We distinguish between the mathematical model 𝑋, which is called a point process, and its observed realization X, which is called a point pattern. Examples of four different point patterns are given in Figure 1.

Before proper definition of the pair correlation function, a functional characteristic that plays a key role in the sequel, we need to define some moment properties of 𝑋. The *intensity function* 𝜆(·) is a non-negative measurable function on R 𝑑 such that 𝜆(𝑥) d𝑥 corresponds to the probability of observing a point of 𝑋 in a neighborhood of 𝑥 with an infinitesimally small area d𝑥. If 𝑋 is stationary (its distribution is translation invariant in R 𝑑 ), then 𝜆(·) = 𝜆 is a constant function and the constant 𝜆 is called the *intensity* of 𝑋. In this case, 𝜆 is interpreted as the expected number of points of 𝑋 that occur in a set with unit 𝑑-dimensional volume. Similarly, the *second-order product density* 𝜆 (2) (· , ·) is a non-negative measurable function on R <sup>𝑑</sup> × R 𝑑 such that 𝜆 (2) (𝑥, 𝑦) d𝑥 d𝑦 corresponds to the probability of observing two points of 𝑋 that occur jointly at the neighborhoods of 𝑥 and 𝑦 with infinitesimally small areas d𝑥 and d𝑦.

Assuming the existence of 𝜆 and 𝜆 (2) , the *pair correlation function* 𝑔(𝑥, 𝑦) is defined as 𝜆 (2) (𝑥, 𝑦)/(𝜆(𝑥)𝜆(𝑦)), for 𝜆(𝑥)𝜆(𝑦) > 0. If 𝜆(𝑥) = 0 or 𝜆(𝑦) = 0, we set

𝑔(𝑥, 𝑦) = 0. We write 𝑔(𝑥, 𝑦) = 𝑔(𝑥 − 𝑦) when 𝑔 is translation invariant and 𝑔(𝑥, 𝑦) = 𝑔 (k𝑥 − 𝑦k) when 𝑔 is also isotropic (invariant under rotations around the origin). For the Poisson point process, a model for complete spatial randomness, 𝜆 (2) (𝑥, 𝑦) = 𝜆(𝑥)𝜆(𝑦) and 𝑔 ≡ 1. Thus, 𝑔(𝑥, 𝑦) quantifies how likely it is to observe two points in 𝑋 jointly occurring in infinitesimally small neighbourhoods of 𝑥 and 𝑦, relative to the "no interactions" benchmark.

A large variety of characteristics (both functional and numerical) have been developed to capture various hypotheses about the stochastic models that generated the observed point patterns at hand. We have focused on the pair correlation function 𝑔 mainly because of its widespread use in practical applications and ease of interpretation. Other popular characteristics are based on 𝑔, e.g., its cumulative counterpart, traditionally called the 𝐾-function. Others are based on inter-point distances, such as the nearest-neighbor distance distribution function 𝐺 and the spherical contact distribution function 𝐹. A comprehensive summary of commonly used characteristics, including the list of possible empirical estimators, is presented in [9, 13]. Estimators of 𝑔, 𝐾, 𝐺, and 𝐹 are implemented in the R package spatstat [3].

# **3 Neural Networks with General Input Space**

This section prepares the theoretical background for the supervised classification of replicated point patterns via artificial neural networks. The recent approach of [14, 15] is the cornerstone of our proposed classifier, and hence we focus on its description in the following paragraphs. On the other hand, the approach based on CNNs is more established in the literature. We use it primarily for comparison and thus we refer the reader to [8] for a detailed description.

Following the setup in [14], let us assume that we want to build a neural network such that it takes 𝐾 ∈ N functional variables and 𝐽 ∈ N scalar variables as input. In detail, suppose that we have 𝑓<sup>𝑘</sup> : 𝜏<sup>𝑘</sup> −→ R, 𝑘 = 1, 2, . . . , 𝐾 (𝜏<sup>𝑘</sup> are possibly different intervals in R), and 𝑧 (1) 𝑗 ∈ R, 𝑗 = 1, 2, . . . , 𝐽. Furthermore, suppose that the first layer of the network contains 𝑛<sup>1</sup> ∈ N neurons. We then want the 𝑖-th neuron of the first layer to transfer the value

$$z\_i^{(2)} = \mathbf{g} \left( \sum\_{k=1}^K \int\_{\pi\_k} \beta\_{ik}(t) f\_k(t) \, \mathrm{d}t + \sum\_{j=1}^J w\_{ij}^{(1)} z\_j^{(1)} + b\_i^{(1)} \right), \quad i = 1, 2, \dots, n\_1, $$

where 𝑏 (1) 𝑖 ∈ R is the bias and 𝑔 : R −→ R is the activation function. Two types of weights appear in the formula: the functional weights {𝛽𝑖𝑘 : 𝜏<sup>𝑘</sup> −→ R}, and the scalar weights {𝑤 (1) 𝑖 𝑗 , 𝑏(1) 𝑖 }. The optimal value of all these weights should be found during the training of the network. To overcome the difficulty of finding the optimal weight functions 𝛽𝑖𝑘 , we can express 𝛽𝑖𝑘 as a linear combination of 𝜙1, . . . , 𝜙𝑚<sup>𝑘</sup> , where 𝜙1, . . . , 𝜙𝑚<sup>𝑘</sup> are the basis functions (from the Fourier or 𝐵-spline basis) and 𝑚<sup>𝑘</sup> is chosen by the user. The sum Í<sup>𝐾</sup> 𝑘=1 ∫ 𝜏𝑘 𝛽(𝑡)𝑖𝑘 𝑓<sup>𝑘</sup> (𝑡) d𝑡 can

**Fig. 1** Theoretical values of the pair correlation function 𝑔 for the Poisson point process and the Thomas process with different values of the model parameter 𝜎. For these models, 𝑔 is translation invariant and isotropic. A single realization of the Poisson point process and the Thomas process with parameter 𝜎 set to 0.1, 0.05 and 0.02 respectively, is illustrated in the right part of the figure.

be expressed as Í<sup>𝐾</sup> 𝑘=1 Í𝑚<sup>𝑘</sup> 𝑙=1 𝑐𝑖𝑙𝑘 ∫ 𝜏𝑘 𝜙𝑙(𝑡) 𝑓<sup>𝑘</sup> (𝑡) d𝑡, where the integrals ∫ 𝜏𝑘 𝜙𝑙(𝑡) 𝑓<sup>𝑘</sup> (𝑡) d𝑡 can be calculated a priori and the coefficients of the linear combination of the basis functions {𝑐𝑖𝑙𝑘 } act as scalar weights of the first layer and are learned by the network. The scalar values 𝑧 (2) 𝑖 , 𝑖 = 1, . . . , 𝑛1, then propagate through the next fully connected layers as usual. An in-depth analysis of the computational point of view is provided in [14]. In the software R, neural networks with general input space are covered by the package FuncNN [15] built over the packages keras [6] and tensorflow [1]. The last two packages are used to handle CNNs.

# **4 Simulation Example**

This section presents a simple simulation experiment in which we illustrate the performance of the classification rule based on the neural network with general input space. Binary classification is considered, where the group membership indicates whether a point pattern was generated by a stationary Poisson point process or a stationary Thomas process, the latter exhibiting attractive interactions among pairs of points [13]. The sample realizations can be seen in Figure 1.

We consider the Thomas process to be a model with one parameter 𝜎. Small values of 𝜎 indicates strong, attractive short-range interactions between points, while larger values of 𝜎 result in looser clusters of points. Attractive interactions between the points of a Thomas process result in the values of the pair correlation function being greater than the constant 1, which corresponds to the Poisson case. The effect of 𝜎 on the shape of the theoretical pair correlation function of the Thomas process (which is translation invariant and isotropic) is illustrated in Figure 1.

Since the model parameter 𝜎 affects the strength and range of attractive interactions between points of the Thomas process, the complexity of the binary classification task described above increases with increasing values of 𝜎 [10, 11]. Therefore, this experiment focuses on the situation where 𝜎 is set to 0.1, and all realizations are observed on the unit square [0, 1] 2 . We fix the intensity of the two models to 400 (in spatial statistics, patterns with several hundreds of points are standard nowadays). In this framework, we expect the classification task to be challenging enough to observe differences in the performance of the considered classifiers. On the other hand, it is still reasonable to distinguish (w.r.t. the chosen observation window) the realizations of the model with attractive interactions from the realizations corresponding to the complete spatial randomness.

Two different collections of labelled point patterns are considered as training sets. The first, referred to as *Training data 1*, is composed of 1 000 patterns per group. The second, called *Training data 2*, is then composed of 100 patterns per group. The test and validation sets have the same size and composition as the *Training data 2*. Table 1 presents the accuracy of three classification rules (described below) with respect to the test set. For the first two rules, the accuracy is in fact averaged over five runs corresponding to different settings of initial weights in the underlying neural network. Concerning the network architecture, we fix the ReLU function to be the activation function for all layers, except the output one. The output layer consists of one neuron with sigmoid activation function. The loss function is the binary cross-entropy. A detailed description of the individual layers is given below.

*Rule 1* is based on the neural network with general input space. We set 𝐾 and 𝐽 from Sect. 3 to be 1 and 0, respectively, and 𝜏<sup>1</sup> = (0, 0.25). The value 0.25 is related to the observation window of the point patterns at hand being [0, 1] 2 . Then, 𝑓<sup>1</sup> is the vector of the estimated values of the pair correlation function 𝑔 (estimated by the function pcf.ppp from the package spatstat [3] with default settings but the option divisor set to d), considered as a functional observation. Furthermore, we set 𝑚<sup>1</sup> = 29, and consider the Fourier basis. The data preparation (estimation of 𝑔, computation of integrals from Sect. 3) takes 740 s of elapsed time (w.r.t. the *Training data 1*, on a standard personal computer). To tune the hyperparameters of the final neural network (number of hidden layers, number of neurons per hidden layers, dropout, etc.), we performed a rough grid search (models with various combinations of the hyperparameters were trained on *Training data 1* and we used the loss function and the accuracy computed on the validation set to compare the performances). The resulting network consists of one hidden layer with 128 neurons followed by a dropout layer with a rate of 0.3. We use the Adam optimizer, and the learning rate is decaying exponentially, with initial value 0.001 and decay parameter 0.05. In total, the network has 3 969 trainable parameters. To train the network, we perform 50 epochs with an average elapsed time of 200 ms per epoch (w.r.t. *Training data 1*).

*Rule 2* uses CNNs. Similarly to the previous case, our decision about the network architecture is based on a rough grid search. The final network has two convolutional layers, each of them with 8 filters, a squared kernel matrix with 36 (first layer) or 16 rows (second layer), and a following average pooling layer with the pool size fixed at 2 × 2. We add a dropout layer after the pooling, with a rate of 0.3 (after the first



pooling) and 0.2 (after the second pooling). The batch size is set to 32. We use the Adam optimizer, and the learning rate is decaying exponentially, with initial value 0.001 and decay parameter 0.1. The total number of trainable parameters is equal to 32 785 and we perform 50 epochs with the average elapsed time per epoch (w.r.t. *Training data 1*) equal to 930 s. Data preparation (converting point patterns to binary images) takes less than 10 s of the elapsed time (w.r.t. *Training data 1*).

*Rule 3* is the kernel regression classifier studied in [10, 11]. We use the Epanechnikov kernel together with an automatic procedure for the selection of the smoothing parameter. The underlying dissimilarity measure for point patterns is constructed as the integrated squared difference of the corresponding estimates of the pair correlation function 𝑔; for more details, see [10]. The elapsed time needed to compute the upper triangle of the dissimilarity matrix (containing dissimilarities between every pair of patterns from *Training data 1*) is equal to 390 s. To predict the class membership for the testing set (w.r.t. *Training data 1*), 206 s elapsed. During the classification procedure, no random initialization of any weights is needed. Thus, there is no reason to average the accuracy in Table 1 over multiple runs.

For *Training data 1*, Table 1 shows that the highest accuracy was achieved for the neural network with general input space. The standard deviation of the five different accuracy values is significantly higher for CNN which has almost ten times more trainable parameters than the network with general input space. For *Training data 2*, the kernel regression method achieved the highest accuracy. In this situation, the performance of the classifier is stable even in the case of small training data. For the first two rules, the neural network models chosen with the help of the grid search (where the networks were trained w.r.t. the bigger training set) are now trained w.r.t. the smaller training set. The resulting accuracy is still around 0.90 for the network with general input space, but it drops to 0.5 (random assignment of labels) for CNN. The size of *Training data 2* seems to be too small to successfully optimize the large amount of trainable parameters of the convolutional network.

To conclude, our simulation example suggests that the classifier based on CNN (using information about the precise configuration of points) is in the presented situation outperformed by the classifiers based on the estimated values of the pair correlation function (using information about the interactions between pairs of points). The high number of trainable parameters of the CNN makes its use rather demanding with respect to computational time. The approach based on neural networks with general input space proved to be competitive with or even outperform the current benchmark method (kernel regression classifier), especially for large datasets. Also, it has the lowest demands regarding computational time. In the case of a small dataset, the low number of hyperparameters speaks in favor of kernel regression. Finally, in the simple classification scenario that we have presented, the choice of the pair correlation function was adequate. In practical applications, a problemspecific characteristic should be constructed to achieve satisfactory performance.

**Acknowledgements** The work of Kateřina Pawlasová and Iva Karafiátová has been supported from the Grant schemes at Charles University, project no. CZ.02.2.69/0.0/0.0/19 073/0016935. The work of Jiří Dvořák has been supported by the Czech Grant Agency, project no. 19-04412S.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Parsimonious Mixtures of Seemingly Unrelated Contaminated Normal Regression Models**

Gabriele Perrone and Gabriele Soffritti

**Abstract** In recent years, the research into linear multivariate regression based on finite mixture models has been intense. With such an approach, it is possible to perform regression analysis for a multivariate response by taking account of the possible presence of several unknown latent homogeneous groups, each of which is characterised by a different linear regression model. For a continuous multivariate response, mixtures of normal regression models are usually employed. However, in real data, it is not unusual to observe mildly atypical observations that can negatively affect the estimation of the regression parameters under a normal distribution in each mixture component. Furthermore, in some fields of research, a multivariate regression model with a different vector of covariates for each response should be specified, based on some prior information to be conveyed in the analysis. To take account of all these aspects, mixtures of contaminated seemingly unrelated normal regression models have been recently developed. A further extension of such an approach is presented here so as to ensure parsimony, which is obtained by imposing constraints on the group-covariance matrices of the responses. A description of the resulting parsimonious mixtures of seemingly unrelated contaminated regression models is provided together with the results of a numerical study based on the analysis of a real dataset, which illustrates their practical usefulness.

**Keywords:** contaminated normal distribution, ECM algorithm, mixture of regression models, model-based cluster analysis, seemingly unrelated regression

© The Author(s) 2023 303

Gabriele Perrone ()

Department of Statistical Sciences, University of Bologna, via delle Belle Arti 41, 40126 Bologna, Italy, e-mail: gabriele.perrone4@unibo.it

Gabriele Soffritti

Department of Statistical Sciences, University of Bologna, via delle Belle Arti 41, 40126 Bologna, Italy, e-mail: gabriele.soffritti@unibo.it

P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_33

# **1 Introduction**

Seemingly unrelated (SU) regression equations are usually employed in a multivariate regression analysis whenever the dependence of a vector **Y** = (𝑌1, . . . , 𝑌<sup>𝑀</sup> ) 0 of 𝑀 continuous variables on a vector **X** = (𝑋1, . . . , 𝑋𝑃) 0 of 𝑃 regressors has to be modelled by allowing the error terms in the different equations to be correlated and, thus, the regression parameters of the 𝑀 equations have to be jointly estimated [14]. With such an approach, the researcher is also enabled to convey prior information on the phenomenon under study into the specification of the regression equations by defining a different vector of regressors for each dependent variable. This latter feature is particularly useful in any situation in which different regressors are expected to be relevant in the prediction of different responses, such as in [3, 6, 16]. This approach has been recently embedded into the framework of Gaussian mixture models, leading to multivariate SU normal regression mixtures [7]. In these models, the effect of the regressors on the dependent variables changes with some unknown latent sub-populations composing the population that has generated the sample of observations to be analysed. Thus, when the sample is characterised by unobserved heterogeneity, model-based cluster analysis is simultaneously carried out.

Another source of complexity which could affect the data and make the prediction of **Y** a difficult task to perform is represented by mildly atypical observations [13]. Robust methods of parameter estimation insensitive to the presence of such observations in a sample characterised by unobserved heterogeneity have been introduced in [9], where the conditional distribution **Y**|**X** = **x** is modelled through a mixture of 𝐾 multivariate contaminated normal models, where 𝐾 is the number of the latent sub-populations. A limitation associated with these latter models is that the same vector of regressors has to be specified for the prediction of all the dependent variables. To overcome this limitation while preserving all the features mentioned above, a more flexible approach which employs mixtures of multivariate SU contaminated normal regression models has been recently introduced in [11]. These latter models are able to capture the linear effects of the regressors on the dependent variables from sample observations coming from heterogeneous populations. The researcher is also enabled to specify a different vector of regressors for each dependent variable. Finally, a robust estimation of the regression parameters and the detection of mild outliers in the data are ensured.

In the presence of many responses and many latent sub-populations, analyses based on these latter models can become unfeasible in practical applications because of a large number of model parameters. In order to keep this number as low as possible, an approach due to [4], based on the spectral decompositions of the 𝐾 covariance matrices of **Y**|**X** = **x**, is exploited here so as to obtain fourteen different covariance structures. The resulting parsimonious mixtures of SU contaminated regression models are described in Section 2. The usefulness of these new models is illustrated through a study aiming at determining the effect of prices and promotional activities on sales of canned tuna in the US market. A summary of the obtained results is provided in Section 3.

# **2 Parsimonious SU Contaminated Normal Regression Mixtures**

In a system of 𝑀 SU regression equations for modelling the linear dependence of **Y** on **X**, let **X**<sup>𝑚</sup> = (𝑋𝑚<sup>1</sup> , 𝑋𝑚<sup>2</sup> , . . . , 𝑋𝑚𝑃𝑚 ) <sup>0</sup> be the 𝑃𝑚-dimensional sub-vector of **X** composed of the 𝑃<sup>𝑚</sup> regressors expected to be relevant for the explanation of 𝑌𝑚, for 𝑚 = 1, . . . , 𝑀. Furthermore, let **X** ∗ <sup>𝑚</sup> = (1, **X** 0 𝑚) 0 . The mixture of 𝐾 SU normal regression models described in [7] can be defined as follows:

$$\mathbf{Y} = \begin{cases} \check{\mathbf{X}}^{\*} \boldsymbol{\mathcal{B}}\_{1}^{\*} + \boldsymbol{\epsilon}, & \boldsymbol{\epsilon} \sim N\_{M}(\mathbf{0}\_{M}, \boldsymbol{\Sigma}\_{1}) \text{ with probability } \pi\_{1}, \\ \dots & \\ \check{\mathbf{X}}^{\*} \boldsymbol{\mathcal{B}}\_{K}^{\*} + \boldsymbol{\epsilon}, & \boldsymbol{\epsilon} \sim N\_{M}(\mathbf{0}\_{M}, \boldsymbol{\Sigma}\_{K}) \text{ with probability } \pi\_{K}, \end{cases} \tag{1}$$

where 𝜋<sup>𝑘</sup> is the prior probability of the 𝑘th latent sub-population, with 𝜋<sup>𝑘</sup> > 0 for 𝑘 = 1, . . . , 𝐾; Í<sup>𝐾</sup> 𝑘=1 <sup>𝜋</sup><sup>𝑘</sup> <sup>=</sup> <sup>1</sup>; **<sup>X</sup>**˜ <sup>∗</sup> is the following (𝑃 <sup>∗</sup> + 𝑀) × 𝑀 partitioned matrix:

$$
\tilde{\mathbf{X}}^{\*} = \begin{bmatrix}
\mathbf{X}\_1^{\*} & \mathbf{0}\_{P\_1+1} & \dots & \mathbf{0}\_{P\_1+1} \\
\mathbf{0}\_{P\_2+1} & \mathbf{X}\_2^{\*} & \dots & \mathbf{0}\_{P\_2+1} \\
\vdots & \vdots & & \vdots \\
\mathbf{0}\_{P\_M+1} & \mathbf{0}\_{P\_M+1} & \dots & \mathbf{X}\_M^{\*} \\
\end{bmatrix},
$$

with **0**𝑃𝑚+<sup>1</sup> denoting the (𝑃<sup>𝑚</sup> + 1)-dimensional null vector; 𝑃 <sup>∗</sup> = Í<sup>𝑀</sup> <sup>𝑚</sup>=<sup>1</sup> <sup>𝑃</sup>𝑚; <sup>𝜷</sup> ∗ 𝑘 = (𝜷 ∗0 𝑘1 , . . . , 𝜷 ∗0 𝑘𝑚, . . . , 𝜷 ∗0 𝑘𝑀 ) 0 is the (𝑃 <sup>∗</sup> + 𝑀)-dimensional vector containing all the linear effects on the 𝑀 responses in the 𝑘th latent sub-population, with 𝜷 ∗ 𝑘𝑚 = (𝛽0𝑘,𝑚, 𝜷 0 𝑘𝑚) 0 , for 𝑚 = 1, . . . , 𝑀; 𝝐 = (𝜖1, . . . , 𝜖<sup>𝑀</sup> ) 0 is the vector of the errors, which are supposed to be independent and identically distributed; 𝑁<sup>𝑀</sup> (**0**<sup>𝑀</sup> , 𝚺<sup>𝑘</sup> ) denotes the 𝑀-dimensional normal distribution with mean vector **0**<sup>𝑀</sup> and positive-definite covariance matrix 𝚺<sup>𝑘</sup> . From now on, this mixture regression model is denoted as MSUN. When **X**<sup>𝑚</sup> = **X** ∀𝑚 (the 𝑃 regressors are employed in all the 𝑀 equations), model (1) reduces to the mixtures of 𝐾 normal (MN) regression models (see [8]).

When the data are contaminated by the presence of mild outliers, departures from the normal distribution could be observed within any of the 𝐾 latent sub-populations. A model able to manage this situation has been recently introduced in [11]. It has been obtained from equation (1) by replacing the normal distribution with the contaminated normal distribution. Under this latter distribution, the probability density function (p.d.f.) of 𝝐 within the 𝑘th sub-population is equal to ℎ (𝝐; 𝝑<sup>𝑘</sup> ) = 𝛼𝑘𝜙<sup>𝑀</sup> (𝝐; **0**<sup>𝑀</sup> , 𝚺<sup>𝑘</sup> ) + (1 − 𝛼<sup>𝑘</sup> )𝜙<sup>𝑀</sup> (𝝐; **0**<sup>𝑀</sup> , 𝜂𝑘𝚺<sup>𝑘</sup> ), where 𝜙<sup>𝑀</sup> (·; 𝝁, 𝚺) denotes the p.d.f. of the distribution 𝑁<sup>𝑀</sup> (**0**<sup>𝑀</sup> , 𝚺<sup>𝑘</sup> ), 𝛼<sup>𝑘</sup> ∈ (0.5, 1) and 𝜂<sup>𝑘</sup> > 1 are the proportion of typical observations within the 𝑘th sub-population and a parameter that inflates the elements of 𝚺<sup>𝑘</sup> , respectively, and 𝝑<sup>𝑘</sup> = (𝛼<sup>𝑘</sup> , 𝜂<sup>𝑘</sup> , 𝚺<sup>𝑘</sup> ). As a consequence, a mixture of 𝐾 SU contaminated normal (MSUCN) regression models is given by:

$$\mathbf{Y} = \begin{cases} \mathbf{\tilde{X}}^{\circ} \mathcal{B}\_1^{\circ} + \epsilon, & \epsilon \sim CN\_M(\alpha\_1, \eta\_1, \mathbf{0}\_M, \Sigma\_1) \text{ with probability } \pi\_1, \\ \dots \\ \mathbf{\tilde{X}}^{\circ \circ} \mathcal{B}\_K^{\circ} + \epsilon, & \epsilon \sim CN\_M(\alpha\_K, \eta\_K, \mathbf{0}\_M, \Sigma\_K) \text{ with probability } \pi\_K, \end{cases} \tag{2}$$

where 𝐶𝑁<sup>𝑀</sup> (𝛼<sup>𝑘</sup> , 𝜂<sup>𝑘</sup> , **0**<sup>𝑀</sup> , 𝚺<sup>𝑘</sup> ) denotes the 𝑀-dimensional contaminated normal distribution described by the p.d.f. ℎ (𝝐; 𝝑<sup>𝑘</sup> ). The parameter vector of model (2) is 𝝍 = (𝝍<sup>1</sup> , . . . , 𝝍<sup>𝑘</sup> , . . . , 𝝍<sup>𝐾</sup> ), where 𝝍<sup>𝑘</sup> = (𝜋<sup>𝑘</sup> , 𝜽 <sup>𝑘</sup> ), 𝜽 <sup>𝑘</sup> = (𝜷 ∗ 𝑘 , 𝝑<sup>𝑘</sup> ). The number of free elements of 𝝍 is 𝑛<sup>𝝍</sup> = 3𝐾 − 1 + 𝐾(𝑃 <sup>∗</sup> + 𝑀) + 𝑛𝝈, where 𝑛<sup>𝝈</sup> denotes the total number of free variances and covariances, with 𝑛<sup>𝝈</sup> = 𝐾𝑛<sup>𝚺</sup> and 𝑛<sup>𝚺</sup> = 𝑀 (𝑀+1) 2 . When **X**<sup>𝑚</sup> = **X** ∀𝑚, model (2) coincides with the mixture of 𝐾 contaminated normal (MCN) regression models described in [9]. For 𝛼<sup>𝑘</sup> → 1 or 𝜂<sup>𝑘</sup> → 1 ∀𝑘, model (2) reduces to model (1). Conditions ensuring identifiability of models (2) are provided in [11]. The ML estimation of 𝝍 in equation (2) can be carried out by means of a sample S = {(**x**1, **y**1), . . . , (**x**<sup>𝐼</sup> , **y**<sup>𝐼</sup> )} of 𝐼 independent observations drawn from model (2) and an expectation-conditional maximisation (ECM) algorithm [10]. Details about this algorithm, including strategies for the initialisation of 𝝍 and convergence criteria, are illustrated in [11]. In practical applications, the value of 𝐾 is generally unknown and has to be properly chosen. This task can be carried out by resorting to model selection criteria, such as the Bayesian information criterion [15]: 𝐵𝐼𝐶 <sup>=</sup> <sup>2</sup>ℓ(𝝍<sup>ˆ</sup> ) −𝑛<sup>𝝍</sup> ln <sup>𝐼</sup>, where <sup>𝝍</sup><sup>ˆ</sup> is the maximum likelihood estimator of <sup>𝝍</sup>. Another commonly used information criterion is the integrated completed likelihood [2], which admits two slightly different formulations: 𝐼𝐶𝐿<sup>1</sup> = 𝐵𝐼𝐶 + 2 Í𝐼 𝑖=1 Í<sup>𝐾</sup> <sup>𝑘</sup>=<sup>1</sup> MAP(𝑧ˆ𝑖𝑘 ) ln ˆ𝑧𝑖𝑘 and 𝐼𝐶𝐿<sup>2</sup> = 𝐵𝐼𝐶 + 2 Í𝐼 𝑖=1 Í<sup>𝐾</sup> 𝑘=1 𝑧ˆ𝑖𝑘 ln ˆ𝑧𝑖𝑘 , where 𝑧ˆ𝑖𝑘 is the estimated posterior probability that the 𝑖th sample observation come from the 𝑘th sub-population (for further details see [11]), MAP(𝑧ˆ𝑖𝑘 ) = 1 if maxℎ{𝑧ˆ𝑖ℎ} occurs when ℎ = 𝑘 (MAP(𝑧ˆ𝑖𝑘 ) = 0 otherwise). Whenever the specification of the subvectors **X**𝑚, 𝑚 = 1, . . . , 𝑀, to be considered in the 𝑀 equations of the multivariate regression model is questionable, such criteria can also be employed to perform subset selection.

As the number of free parameters 𝑛<sup>𝝍</sup> incresases quadratically with 𝑀, analyses based on model (2) can become unfeasible in real applications. A way to manage this problem can be based on the introduction of suitable constraints on the elements of 𝚺<sup>𝑘</sup> , 𝑘 = 1, . . . , 𝐾, based on the following eigen-decomposition [4]: 𝚺<sup>𝑘</sup> = 𝜆𝑘**D**𝑘**A**𝑘**D** 0 𝑘 , where 𝜆<sup>𝑘</sup> = |𝚺<sup>𝑘</sup> | <sup>1</sup>/<sup>𝑀</sup> , **A**<sup>𝑘</sup> is a diagonal matrix with entries (sorted in decreasing order) proportional to the eigenvalues of 𝚺<sup>𝑘</sup> (with the constraint |**A**<sup>𝑘</sup> | = 1) and **D**<sup>𝑘</sup> is a 𝑀 × 𝑀 orthogonal matrix of the eigenvectors of 𝚺<sup>𝑘</sup> (ordered according to the eigenvalues). This decomposition allows to obtain variances and covariances in 𝚺<sup>𝑘</sup> from 𝜆<sup>𝑘</sup> , **A**<sup>𝑘</sup> and **D**<sup>𝑘</sup> . From a geometrical point of view, 𝜆<sup>𝑘</sup> determines the volume, **A**<sup>𝑘</sup> the shape and **D**<sup>𝑘</sup> the orientation of the 𝑘th cluster of sample observations detected by the fitted model. By constraining 𝜆<sup>𝑘</sup> , **A**<sup>𝑘</sup> and **D**<sup>𝑘</sup> to be equal or variable across the 𝐾 clusters, a class of fourteen mixtures of 𝐾 SUCN regression models is obtained (see Table 1). With variable volumes, shapes and orientations (VVV in Table 1), the resulting model coincides with (2). When 𝐾 > 1, the other covariance structures allow to obtain thirteen different parsimonious mixtures of 𝐾 SUCN regression models (i.e.: with a reduced 𝑛𝝈). When 𝐾 = 1, the possible covariance structures for 𝚺<sup>1</sup> are: diagonal with different entries, diagonal with the same entries and fully unconstrained. The ML estimation of 𝝍 under model (2) with any of these parameterisations can be carried out through an ECM algorithm in which the CM-step update for 𝚺<sup>𝑘</sup> can be computed either in closed form or using iterative procedures, depending on the parameterisation to be employed (see [4]).


**Table 1** Features of the parameterisations for the covariance matrices 𝚺<sup>𝑘</sup> , 𝑘 = 1, . . . , 𝐾 (𝐾 > 1).

# **3 Analysis of U.S. Canned Tuna Sales**

The models illustrated in Section 2 have been fitted to a dataset [5] containing the volume of sales (Move), a measures of the display activity (Nsale) and the log price (Lprice) for seven of the top 10 U.S. brands in the canned tuna product category in the 𝐼 = 338 weeks between September 1989 and May 1997. The goal of the analysis is to study the dependence of canned tuna sales on prices and promotional activites for two products: Star Kist 6 oz. (SK) and Bumble Bee Solid 6.12 oz. (BBS). To this end, the following vectors have been considered: **Y** <sup>0</sup> = (𝑌<sup>1</sup> = Lmove SK, 𝑌<sup>2</sup> = Lmove BBS), **X** <sup>0</sup> = (𝑋<sup>1</sup> = Nsale SK, 𝑋<sup>2</sup> = Lprice SK, 𝑋<sup>3</sup> = Nsale BBS, 𝑋<sup>4</sup> = Lprice BBS), where Lmove denotes the logarithm of Move. The analysis has been carried out using all the parameterisations of the MSUN, MN, MCSUN and MCN models for each 𝐾 ∈ {1, 2, 3, 4, 5, 6}. Furthermore, MSUN and MCSUN models have been fitted by considering all possible subvectors of **X** as vectors **X**𝑚, 𝑚 = 1, 2, for each 𝐾. In this way, best subset selections for Lmove SK and Lmove BBS have been included in the analysis both with and without contamination. The overall number of fitted models is 37376, including the fully unconstrained models (i.e., with the VVV parameterisation) previously employed in [11] to perform the same analysis.

Table 2 reports some information about the nine models which best fit the analysed dataset according to the three model selection criteria over the six examined values of 𝐾 within each model class. An analysis based on a single linear regression model (𝐾 = 1), both with and without contamination, appears to be inadequate according to all criteria. All the examined criteria indicate that the overall best model for studying the effect of prices and promotional activities on sales of SK and BBS tuna is a parsimonious mixture of two SU contaminated Gaussian linear regression models with the EVE parameterisation for the covariance matrices in which the log unit sales of SK tuna are regressed on the log prices and the promotional activites of the same brand, while the regressors selected for the BBS log unit sales are the log prices of both brands and the promotional activites of BBS. Thus, the analysis suggests that two sources of complexity affect the analysed dataset: unobserved heterogeneity over time (𝐾 = 2 clusters of weeks have been detected) and the presence of mildly atypical observations. Since the two estimated proportions of typical observations are quite similar (see the values of 𝛼ˆ <sup>𝑘</sup> in Table 3), contamination seems to characterise the two clusters of weeks detected by the model almost in the same way. As far as the strength of the contaminating effects on the conditional variances and covariances of **Y**|**X** = **x** is concerned, it appears to be stronger in the first cluster, where the estimated inflation parameter is larger (𝜂ˆ<sup>1</sup> = 15.70). By focusing the attention on the other estimates, it appears that also some of the estimated regression coefficients, variances and covariances are affected by heterogeneity over time. Sales of SK tuna results to be negatively affected by prices and positively affected by promotional activites of the same brand within both clusters detected by the model, but with effects which are sligthly stronger in the first cluster of weeks. A similar behavior is detected for the estimated regression equation for Lmove BBS, which also highlights that Lmove BBS are positively affected by the log prices of SK tuna, especially in the first cluster of weeks. Furthermore, typical weeks in the first cluster show values of Lmove SK which are more homogeneous than those of Lmove BBC; the opposite holds true for the typical weeks belonging to the second cluster. Also the correlation between log sales of SK and BBS products results to be affected by heterogeneity over time: while in the largest cluster of weeks this correlation has been estimated to be slightly positive (0.200), the first cluster is characterised by a mild estimated negative correlation (−0.151). An interesting feature of this latter cluster is that 17 out of the 20 weeks which have been assigned to this cluster are consecutive from week no. 58 to week no. 74, which correspond to the period from mid-October 1990 to mid-February 1991 characterised by a worldwide boycott campaign encouraging consumers not to buy Bumble Bee tuna because Bumble Bee was found to be buying yellow-fin tuna caught by dolphin-unsafe techniques [1]. Such events could represent one of the sources of the unobserved heterogeneity detected by the model. According to the overall best model, some weeks have beed detected to be mild outliers. In the first cluster, this has happened for week no. 60 (immediately after Halloween 1990) and week no. 73 (two weeks immediately before Presidents day 1999). The analysis of the estimated sample residuals **y**<sup>𝑖</sup> − 𝝁ˆ <sup>1</sup> (**x**<sup>𝑖</sup> ; 𝜷ˆ ∗ 1 ) for the 20 weeks belonging to the first cluster (see the scatterplot on the left side of Figure 1) clearly show that weeks 60 and 73 noticeably deviates from the other weeks. Among the 318 weeks of the second cluster, 32 have resulted to be mild outliers, most of which are associated with holidays and special events that took place between September 1989 and mid-October 1990 or between mid-February and May 1997 (see the scatterplot on the right side of Figure 1). These results are almost equal to those obtained using the best overall fully unconstrained fitted model in the analysis presented in [11]. However, the EVE parameterisation for the MSUCN model has allowed to obtain a better tradeoff among the fit, the model complexity and the uncertainty of the estimated partition of the weeks; furthermore, it has led to a slightly lower number of mild outliers in the second cluster of weeks.


**Table 2** Maximised log-likelihood <sup>ℓ</sup> (𝝍ˆ) and values of 𝐵𝐼 𝐶, 𝐼 𝐶𝐿<sup>1</sup> and 𝐼 𝐶𝐿<sup>2</sup> for nine models selected from the classes MSUCN, MCN, MSUN and MN in the analysis of tuna sales.

**Table 3** Parameter estimates of the overall best model for the analysis of tuna sales.


**Fig. 1** Scatterplots of the estimated residuals for the weeks assigned to the first (left) and second (right) clusters detected by the overall best model. Points of the first scatterplot are labelled with the number of the corresponding weeks. Black circle and red triangle in the second scatterplot correspond to typical and outlying weeks, respectively.

# **4 Conclusions**

The parsimonious mixtures of seemingly unrelated linear regression models for contaminated data introduced here can account for heterogeneous regression data both in the presence of mild outliers and multivariate correlated dependent variables, each of which is regressed on a different vector of covariates. Models from this class allow for simultaneous robust clustering and detection of mild outliers in multivariate regression analysis. They encompass several other types of Gaussian mixture-based linear regression models previously proposed in the literature, such as the ones illustrated in [7, 8, 9], providing a robust and flexible tool for modelling data in practical applications where different regressors are considered to be relevant for the prediction of different dependent variables. Previous research (see [9, 11]) demonstrated that BIC and ICL could be effectively employed to select a proper value for 𝐾 in the presence of mildly contaminated data. Thanks to an imposition of an eigen-decomposed structure on the 𝐾 variance-covariance matrices of **Y**|**X** = **x**, the presented models are characterised by a reduced number of variance-covariance parameters to be included in the analysis, thus improving flexibility, usefulness and effectiveness of an approach to multivariate linear regression analysis based on finite Gaussian mixture models in real data applications.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Penalized Model-based Functional Clustering: a Regularization Approach via Shrinkage Methods**

Nicola Pronello, Rosaria Ignaccolo, Luigi Ippoliti, and Sara Fontanella

**Abstract** With the advance of modern technology, and with data being recorded continuously, functional data analysis has gained a lot of popularity in recent years. Working in a mixture model-based framework, we develop a flexible functional clustering technique achieving dimensionality reduction schemes through a 𝐿<sup>1</sup> penalization. The proposed procedure results in an integrated modelling approach where shrinkage techniques are applied to enable sparse solutions in both the means and the covariance matrices of the mixture components, while preserving the underlying clustering structure. This leads to an entirely data-driven methodology suitable for simultaneous dimensionality reduction and clustering. Preliminary experimental results, both from simulation and real data, show that the proposed methodology is worth considering within the framework of functional clustering.

**Keywords:** functional data analysis, 𝐿1-penalty, silhouette width, graphical LASSO, mixture model

Nicola Pronello ()

Rosaria Ignaccolo

Luigi Ippoliti Department of Economics, University of Chieti-Pescara, Pescara, Italy, e-mail: luigi.ippoliti@unich.it

Sara Fontanella National Heart and Lung Institute, Imperial College London, London, United Kingdom, e-mail: s.fontanella@imperial.ac.uk

© The Author(s) 2023 313 P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_34

Department of Neurosciences, Imaging and Clinical Sciences, University of Chieti-Pescara, Chieti, Italy, e-mail: nicola.pronello@unich.it

Department of Economics and Statistics "Cognetti de Martiis", University of Torino, Torino, Italy, e-mail: rosaria.ignaccolo@unito.it

# **1 Introduction**

In recent decades, technological innovations have produced data that are increasingly complex, high dimensional, and structured. A large amount of these data can be characterized as functions defined on some continuous domain and their statistical analysis has attracted the interest of many researchers. This surge of interests is explained by the ubiquitous examples of functional data that can be found in different application fields (see for example [2], and references therein for specific examples). With functions as the basic units of observation, the analysis of functional data poses significant theoretical and practical challenges to statisticians. Despite these difficulties, methodology for clustering functional data has advanced rapidly during the past years; recent surveys of functional data clustering are presented in [7] and [2]. Popular approaches have extended classical clustering concepts for vector-valued multivariate data to functional data.

In this paper, we consider a finite mixture as a flexible model for clustering. In particular, applying a functional model-based clustering algorithm with an 𝐿1 penalty function on a set of projection coefficients, we extend the results of [8] and [9] for vector-valued multivariate data to a functional data framework. This approach appears particularly appealing in all cases in which the functions are spatially heterogeneous, meaning that some parts of the function can be smoother than in other parts, or that there may be distant parts of the function that are correlated with each other. Furthermore, the introduction of a shrinkage penalty allows to look for directions in the feature space (that is now the space of expansion/projection coefficients) that are the most useful in separating the underlying groups without first applying dimensionality reduction techniques.

In Section 2 we present at first the methodology along with some details on model estimation (subsection 2.2). Secondly, in Section 3, we perform a validation study with simulated and real data for which the classes are known a-priori.

# **2 Shrinkage Method for Model-based Clustering for Functional Data**

Here we consider the problem of clustering a set of 𝑛 observed curves into 𝐾 homogeneous groups (or clusters). To this end, we propose a flexible model based on a finite mixture of Gaussian distributions, with a 𝐿<sup>1</sup> penalized likelihood, which we name *Penalized model-based Functional Clustering* (PFC-𝐿1).

#### **2.1 Model Definition**

We consider a set of 𝑛 observed curves, 𝑥1, . . . , 𝑥𝑛, that are independent realizations of a continuous stochastic process 𝑋 = {𝑋(𝑡)}<sup>𝑡</sup> ∈[0,𝑇 ] taking values in 𝐿<sup>2</sup> [0, 𝑇]. In practice, such curves/trajectories are available only at a discrete set of the domain points {𝑡𝑖𝑠 : 𝑖 = 1, . . . , 𝑛, 𝑠 = 1, . . . , 𝑚<sup>𝑖</sup> } and the 𝑛 curves need to be reconstructed. To this goal, it is common to assume that the curves belong to a finite dimensional space spanned by a basis of functions, so that given a basis of functions 𝚽 = {𝜓1, ..., 𝜓𝑝} each curve 𝑥𝑖(𝑡) admits the following decomposition:

$$\alpha\_i(t) = \sum\_{j=1}^{p} \beta\_{j,i} \psi\_j(t), \qquad i = 1, \ldots, n; \tag{2.1}$$

that is the stochastic process 𝑋 admits a corresponding truncated basis expansion

$$X(t) = \sum\_{j=1}^{p} \beta\_f(X)\psi\_f(t),$$

where 𝜷 = {𝛽<sup>1</sup> (𝑋), . . . , 𝛽<sup>𝑝</sup> (𝑋)} is a random vector in R 𝑝 . By considering observations with a sampling error, such that

$$\mathbf{x}\_i^{obs}(t) = \mathbf{x}\_i(t) + \boldsymbol{\epsilon}\_i, \qquad i = 1, \ldots, n,\tag{2.2}$$

with 𝜖<sup>𝑖</sup> ∼ N(0, 𝜎<sup>2</sup> 𝜖 ), the realizations of the random coefficients 𝛽𝑗,𝑖 for 𝑗 = 1, . . . , 𝑝 describing each curve can be obtained via least squares as <sup>𝜷</sup><sup>ˆ</sup> <sup>𝑖</sup> = (𝚯 0 <sup>𝑖</sup>𝚯𝑖) <sup>−</sup>1𝚯 0 𝑖**X** 𝑜𝑏𝑠 𝑖 where 𝚯<sup>𝑖</sup> = (𝜓𝑗(𝑡𝑖𝑠)), 1 ≤ 𝑗 ≤ 𝑝, 1 ≤ 𝑠 ≤ 𝑚<sup>𝑖</sup> contains the basis functions evaluated at the fixed domain points and **X** 𝑜𝑏𝑠 𝑖 = (𝑥 𝑜𝑏𝑠 𝑖 (𝑡𝑖1), . . . , 𝑥𝑜𝑏𝑠 𝑖 (𝑡𝑖𝑚<sup>𝑖</sup> ))0 is the vector of observed values of the 𝑖-th curve.

With the goal of dividing into 𝐾 homogeneous groups the observed curves 𝑥1, . . . , 𝑥𝑛, let us assume that it exists an unobservable grouping variable **Z** = (𝑍1, ..., 𝑍<sup>𝐾</sup> ) ∈ [0, 1] <sup>𝐾</sup> indicating the cluster membership: 𝑧𝑖,𝑘 = 1 if 𝑥<sup>𝑖</sup> belongs to cluster 𝑘, 0 otherwise (and 𝑧𝑖,𝑘 is indeed what we want to predict for each curve).

In adopting a model-based clustering approach, we denote with 𝜋<sup>𝑘</sup> the (a-priori) probabilities of belonging to a group:

$$
\pi\_k = \mathbb{P}(Z\_k = 1), \qquad k = 1, \ldots, K,
$$

such that Í<sup>𝐾</sup> 𝑘=1 𝜋<sup>𝑘</sup> = 1 and 𝜋<sup>𝑘</sup> > 0 for each 𝑘, and we assume that, conditionally on 𝑍, the random vector 𝜷 follows a multivariate Gaussian distribution, that is for each cluster

$$|\mathcal{B}|(Z\_k = 1) = \mathcal{B}\_k \sim \mathcal{N}(\mu\_k, \Sigma\_k)$$

where 𝝁<sup>𝑘</sup> = (𝜇1,𝑘 , . . . , 𝜇𝑝,𝑘 ) 𝑇 and 𝚺<sup>𝑘</sup> are respectively the mean vector and the covariance matrix of the 𝑘-th group. Then the marginal distribution of 𝜷 = {𝛽1, . . . , 𝛽𝑝} can be written as a finite mixture with mixing proportions 𝜋<sup>𝑘</sup> as

$$p\left(\mathcal{B}\right) = \sum\_{k=1}^{K} \pi\_k f\left(\mathcal{B}\_k; \mu\_k, \Sigma\_k\right),$$

where 𝑓 is the multivariate Gaussian density function. The log-likelihood function can then be written as

$$d(\theta; \mathcal{B}) = \sum\_{i=1}^{n} \log \sum\_{k=1}^{K} \pi\_k f(\mathcal{B}\_i; \mu\_k, \Sigma\_k),$$

where 𝜽 = {𝜋1, . . . , 𝜋<sup>𝐾</sup> ; 𝝁<sup>1</sup> , . . . , 𝝁<sup>𝐾</sup> ; 𝚺1, . . . , 𝚺<sup>𝐾</sup> } is the vector of parameters to be estimated and 𝜷<sup>𝑖</sup> = (𝛽1,𝑖, . . . , 𝛽𝑝,𝑖) 𝑇 is the vector of projection coefficients of the 𝑖-th curve.

In this modeling framework, we consider a very general situation without introducing any kind of constraints neither for cluster means nor for covariance matrices, that can be different in each cluster. This flexibility, however, leads to overparameterization and, as an alternative to any kind of constraints, we consider a penalty that allows regularized parameters' estimation.

To define a suitable penalty term, we follow the penalized approach introduced by Zhou et al. [8] in the high-dimensional setting, and so we consider a penalty composed by two terms: the first one on the mean vector of each cluster 𝝁<sup>𝑘</sup> , and the second one on the inverse of the covariance matrix in each group **W**<sup>𝑘</sup> = 𝚺 −1 𝑘 , otherwise said "precision" matrix, with elements 𝑊𝑘;𝑗,𝑙. The proposed penalized log-likelihood function, given the projection coefficients 𝜷<sup>𝑖</sup> , is

$$\log \left( \theta ; \mathcal{B} \right) = \sum\_{i=1}^{n} \log \sum\_{k=1}^{K} \pi\_k f(\mathcal{B}\_i; \mu\_k, \Sigma\_k) - \lambda\_1 \sum\_{k=1}^{K} ||\mu\_k||\_1 - \lambda\_2 \sum\_{k=1}^{K} \sum\_{j,l}^{p} |W\_{k;j,l}|\_1$$

where ||𝝁<sup>𝑘</sup> ||1 = Í𝑝 𝑗=1 |𝜇𝑘, 𝑗 |, 𝜆<sup>1</sup> > 0 and 𝜆<sup>2</sup> > 0 are penalty parameters to be suitably chosen.

The penalty term on the cluster mean vectors allow for component selection in the functional data framework (whereas it would be variable selection in the multivariate case), considering that when the 𝑗-th component in the basis expansion is not useful in separating groups it has a common mean across groups, that is 𝜇1, 𝑗 = . . . = 𝜇𝐾 , 𝑗 = 0. Then to realize component selection the considered term is Í<sup>𝐾</sup> 𝑘=1 ||𝝁<sup>𝑘</sup> ||1.

The second part of the penalty, namely Í<sup>𝐾</sup> 𝑘=1 Í𝑝 𝑗,𝑙 |𝑊𝑘;𝑗,𝑙 |, imposes a shrinkage on the elements of the precision matrices, thus avoiding possible singularity problems and facilitating the estimation of large and sparse covariance matrices.

#### **2.2 Model Estimation via E-M Algorithm**

Since the membership of each observation to a cluster is unobservable, data related to the grouping variable **Z** is inevitably missing and the maximum penalized loglikelihood estimator can be obtained by means of the E-M algorithm [4], that iterates over two steps: expectation (E) of the complete data (penalized) log-likelihood by considering the unknown parameters equal to those obtained at the previous iteration (with initialization values), and maximization (M) of a lower bound of the obtained expected value with respect to the unknown parameters.

In particular, at the 𝑑-th iteration, given a current estimate 𝜽 (𝑑) , the lower bound after the E-step assumes the following form:

$$\underline{Q}\cdot\underline{P}\left(\theta;\theta^{(d)}\right) = \sum\_{k=1}^{K}\sum\_{i=1}^{n}\,\tau\_{k,i}^{(d)}\left[\log\,\pi\_{k}+\log\,f\left(\mathcal{B}\_{i};\mu\_{k},\Sigma\_{k}\right)\right] - \lambda\_{1}\sum\_{k=1}^{K}\,||\,\mu\_{k}||\_{\mathbb{I}}-\lambda\_{2}\sum\_{k=1}^{K}\,\sum\_{j,l}^{p}\,|\,W\_{k;j,l}\vert\_{\star},\Delta\_{\underline{p}}\left(\sum\_{k=1}^{K}\,\sum\_{j,l}^{p}\,\sum\_{k=1}^{K}\,\sum\_{j,l}^{p}\,\sum\_{k=1}^{K}\,\sum\_{j,l}^{p}\,\sum\_{k=1}^{K}\,\sum\_{j,l}^{p}\,\sum\_{k=1}^{K}\,\sum\_{j,l}^{p}\,\sum\_{k=1}^{K}\,\sum\_{j,l}^{p}\,\sum\_{k=1}^{K}\,\sum\_{j,l}^{p}\,\sum\_{k=1}^{K}\,\sum\_{j,l}^{p}\,\sum\_{k=1}^{K}\,\sum\_{j,l}^{p}\,\sum\_{k=1}^{K}\,\sum\_{j,l}^{p}\,\sum\_{k=1}^{K}\,\sum\_{j,l}^{p}\,\sum\_{k=1}^{K}\,\sum\_{j,l}^{p}\,\sum\_{k=1}^{K}\,\sum\_{j,l}^{p}\,\sum\_{k=1}^{K}\,\sum\_{j,l}^{p}\,\sum\_{k}\,\sum\_{j,l}^{p}\,\sum\_{k}\,\sum\_{j,l}^{p}\,\sum\_{k}\,\sum\_{j,l}^{p}\,\sum\_{k}\,\sum\_{j,l}^{p}\,\sum\_{j,l}^{p}\,\sum\_{k}\,\sum\_{j,l}^{p}\,\sum\_{j,l}^{p$$

where 𝜏𝑘,𝑖 = P(𝑍<sup>𝑘</sup> = 1|𝑋 = 𝑥𝑖) is the posterior probability of observation 𝑖 to belong to group 𝑘. The M-step maximizes the function 𝑄<sup>𝑃</sup> in order to update the estimate of 𝜽.

As suggested by [9], it is possible to maximize each of the 𝐾 term using a "graphical lasso" (GLASSO) algorithm (first proposed by [5]), thanks to the close connection between fitting Gaussian mixture models and Gaussian graphical models. Indeed, in GLASSO the objective function looks like log det(**W**) − tr(**SW**) − 𝜆 Í𝑝 𝑗,𝑙 |𝑊𝑗,𝑙 | so that the algorithm implemented in the R package "glasso" can be used with **<sup>W</sup>** <sup>=</sup> **<sup>W</sup>**<sup>𝑘</sup> , <sup>𝑆</sup> <sup>=</sup> **<sup>S</sup>**˜ <sup>𝑘</sup> and 𝜆 = 2𝜆<sup>2</sup> Í𝑛 𝑖=1 𝜏 (𝑑) 𝑘,𝑖 for each 𝑘 to obtain the elements <sup>𝑊</sup>b(𝑑+1) <sup>𝑘</sup>;𝑗,𝑙 of the precision matrices.

#### **2.3 Model Selection via Silhouette Profile**

A fundamental, and probably unsolved, problem in cluster analysis is determining the "true" number of groups in a dataset. To this purpose, for simplicity, here we approach the problem choosing the number of groups as cluster validation problem and use the *average silhouette width* index as a model selection heuristic. The silhouette value for curve 𝑖 is given by

$$s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}}$$

where 𝑎(𝑖) is the average distance of curve 𝑖 to all other curves ℎ assigned to the same cluster (if 𝑖 is the only observation in its cluster, then 𝑠(𝑖) = 0), and 𝑏(𝑖) is the minimum average distance of curve 𝑖 to observations ℎ which are assigned to a different cluster. This definition ensures that 𝑠(𝑖) takes values in [−1, 1], where values close to one indicate "better"clustering solutions. Conditional on 𝐾 and a pair of values (𝜆1, 𝜆2), we thus assess the overall cluster solution using the total average of silhouette values

$$S(K, \lambda\_1, \lambda\_2) = \frac{1}{n} \sum\_{i=1}^{n} s(i) \dots$$

In particular, by doing a grid search for the triple (𝐾, 𝜆1, 𝜆2), the best cluster solution is obtained by looking for the largest value of the *average silhouette width* (*ASW*) index. Note that, to evaluate 𝑠(𝑖), 𝑖 = 1, . . . , 𝑛, and then the objective function 𝑆(𝐾, 𝜆1, 𝜆2), we need to compute a distance between pairs of curves 𝑋<sup>𝑖</sup> and 𝑋ℎ. One possibility is to compute the euclidean distance

$$d\_E^2(i, h) = \int \|X\_i(t) - X\_h(t)\|^2 dt.$$

# **3 Experimental Results**

#### **3.1 Simulation**

We present here a simulated scenario in order to investigate the effectiveness of the 𝐿<sup>1</sup> regularization in removing noise while preserving dominant local features, accommodating for spatial heterogeneity of the curves.

The statistical analysis is illustrated for data simulated by means of a finite mixture of multivariate Gaussian distributions. In particular, based on equation (2.1) and (2.2), the curves are simulated using a combination of 𝑝 = 25 Fourier basis functions defined over a one-dimensional regular grid with 100 observations. We consider a mixture of four (𝐾 = 4) multivariate Gaussian distributions with isotropic covariance matrices, i.e.

$$\mathcal{B}\_k \sim \mathcal{N}(\mu\_k; \mathcal{I}\_k) \text{ where } \mathfrak{e}\_i \sim \mathcal{N}(0; 0.5), \ k = 1, \dots, 4.$$

With the exclusion of 3 entries per group, the means 𝝁<sup>𝑘</sup> are all zero mean vectors. Under this scenario, the simulated curves (25 per group) and the non-zero group expansion coefficients are represented in Figure 1. For this simple simulation setting, estimation results suggest that, using euclidean distance to computed the *ASW*, the grid search procedure is always able to correctly select the cluster-relevant basis functions. This is confirmed by Figure 2 which shows both the distribution (over 100 replications) of the selected basis functions and the data projected on these bases that clearly highlight the identification of 4 clusters. Under this scenario, the quality of the estimated clusters thus appears very good as the analysis of the misclassification rate suggests an 100% of accuracy in all the replicated datasets.

Similar results hold for more complex simulation designs, where we consider different structure of the covariance matrices in the data generating process.

#### **3.2 Performance on Real Data Sets**

We evaluate the PFC-𝐿<sup>1</sup> model on a well-known benchmark data set, namely the electrocardiogram (ECG) data set (data can be found at the UCR Time Series Classification Archive [3]).

The ECG data set comprises a set of 200 electrocardiograms from 2 groups of patients, myocardial infarction and healthy, sampled at 96 time instants in time.

**Fig. 1** Left: 25 simulated curves for each group. Right: Vector of expansion coefficients for each group, with only three non-zero coefficients corresponding to basis functions with specific periodicities (Hertz values).

**Fig. 2** Left: Data projected on cluster specific functional subspace generated by the selected basis functions. Right: Distribution (over 100 replications) of the selected basis functions shown for pairs of sine and cosine basis functions, according to the Hertz values.

This data set were previously used to compare the performance of several functional clustering models in [1]. The results in Table 5 of [1] show that the FunFEM models, compared to other state of the art methodologies, achieved the best performances in terms of accuracy. Hence, here, we limit the comparison to the results obtained with the PFC-𝐿<sup>1</sup> and the FunFEM models. Although FunFEM models relay on a mixture of Gaussian distributions describing the likelihood of the data similarly to our proposal, they differ on facing the intrinsic high dimension of the problem by estimating a latent discriminant subspace in parallel with the steps of an EM algorithm.

For all the data, we reconstruct the functional form from the sampled curves choosing arbitrarily 20 cubic spline basis of functions. We tested the PFC-𝐿<sup>1</sup> models considering five different values for the number of clusters, 𝐾 = {2, 3, 4, 5, 6}, and six values for 𝜆<sup>1</sup> = {0.5, 1, 5, 10, 15, 20}.

Considering that the GLASSO penalty parameter 𝜆 depends linearly from 𝜆2, the choice of 𝜆<sup>2</sup> has to provide suitable values for 𝜆. A practical approach is to choose values avoiding convergence problems with GLASSO. Here 𝜆<sup>2</sup> was set to {5, 7.5, 10, 12, 15, 20} for the ECG data. Both PFC-𝐿<sup>1</sup> and FunFEM algorithms were initialized using a 𝐾-means procedure.

The clustering accuracies, computed with respect to the known labels, are 69% for FunFEM DFM[𝛼𝑘 𝑗 <sup>𝛽</sup><sup>𝑘</sup> ] (choosing among 12 different model parameterizations with BIC index), and 75% for PFC-L<sup>1</sup> [𝜆<sup>1</sup> = 0.5 , 𝜆<sup>2</sup> = 5] (values of tuning parameters chose by ASW index) . Thus PFC-𝐿<sup>1</sup> achieves good performance, with an increase in the accuracy about 9%.

# **4 Discussion**

In this paper we tried to investigate the potential of shrinkage methods for clustering functional data. Our numerical examples show the advantages of performing clustering with features selection, such as uncover interesting structures underlying the data while preserving good clustering accuracy. To the best of our knowledge, this is the first proposal that considers a penalty for both means and covariances of mixture components in functional model-based clustering. In the model selection section we defined an heuristic criterion to choose among different model parameterizations based on average silhouette index. It may be interesting to evaluate different distances (i.e. not euclidean) to compute this index in future research. Moreover, we will consider more complex simulation designs to investigate the robustness of the proposal and extend the comparison with the state of the art methodologies on more benchmark datasets.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Emotion Classification Based on Single Electrode Brain Data: Applications for Assistive Technology**

Duarte Rodrigues, Luis Paulo Reis, and Brígida Mónica Faria

**Abstract** This research case focused on the development of an emotion classification system aimed to be integrated in projects committed to improve assistive technologies. An experimental protocol was designed to acquire an electroencephalogram (EEG) signal that translated a certain emotional state. To trigger this stimulus, a set of clips were retrieved from an extensive database of pre-labeled videos. Then, the signals were properly processed, in order to extract valuable features and patterns to train the machine and deep learning models.There were suggested 3 hypotheses for classification: recognition of 6 core emotions; distinguishing between 2 different emotions and recognising if the individual was being directly stimulated or merely processing the emotion. Results showed that the first classification task was a challenging one, because of sample size limitation. Nevertheless, good results were achieved in the second and third case scenarios (70% and 97% accuracy scores, respectively) through the application of a recurrent neural network.

**Keywords:** emotions, brain-computer interface, EEG, supervised learning, machine and deep learning

Duarte Rodrigues

Luis Paulo Reis

Brígida Mónica Faria ()

Faculty of Engineering of University of Porto (FEUP), Rua Dr. Roberto Frias, s/n 4200-465 Porto, Portugal, e-mail: up201705420@fe.up.pt

Faculty of Engineering of University of Porto (FEUP) and Artificial Intelligence and Computer Science Laboratory (LIACC), Rua Dr. Roberto Frias, s/n 4200-465 Porto, Portugal, e-mail: lpreis@fe.up.pt

School of Health, Polytechnic of Porto (ESS-P.PORTO) and Artificial Intelligence and Computer Science (LIACC), Rua Dr. Roberto Frias, s/n 4200-465 Porto, Portugal, e-mail: monica.faria@ess.ipp.pt

# **1 Introduction**

Emotions are a part of our lives, as humans we know how to identify the tiniest of microexpressions to unveil what someone is feeling, but also how to use them to express our hearts. From the youngest of ages we see and interact with others and build a database of patterns of, for example, what joy is and how different it is from fear or sadness. Computers, on the other hand, do not have any idea of what an emotion is or how to recognize it. Or do they?

The Artificial Intelligence and Computer Science Laboratory (LIACC) established 2 projects where emotion recognition can be of the utmost importance. The first project, the "IntellWheels 2.0" [1], intends to develop an interactive and intelligent electric wheelchair. This innovative equipment will have a diverse set of features, such as an adaptive control system (through eye gaze, a brain-computer interface, hand orientation, among others) and a personalized multi-modal interface which will allow communication to multiple devices both from the patients and the caregivers. In this case, having information about the mood of the patient is very beneficial, because the interface can give updates to the nursing staff of the emotional condition of the patient. The second project, the "Sleep at the Wheel" [2], focuses on the research of an interface that can sense and predict a driver's drowsiness state, being able to detect if he fell asleep while driving and, consequently, support an alarm system to provide safer routing and driving. Here the state of mind of the driver is a very important aspect, as different emotions, like anger or fear, can provoke dangerous situations or unpredictable scenarios, making the driver less attentive to his surroundings.

In this work, emotions will be sensed through a brain-computer interface (BCI). These are commercial devices that allow to acquire a surface electroencephalogram (EEG). This signal is used to measure the electrical activity of the brain, that fluctuates according to the firing of the neurons in the brain, being quantified in micro-volts. In this research, the BCI used was the "NeuroSky MindWave2" which possesses one single electrode on the forehead, from which it collects a signal from the activity of the frontal lobe. This brain area is responsible for the higher executive functions, including emotional regulation, planning, reasoning and problem solving [3].

The study of emotion recognition started with psychologist Paul Ekman that defined, based on a cross cultural study, six core emotions - Fear, Anger, Happiness, Sadness, Surprise and Disgust [4]. Later, psychologist Robert Plutchik established a model called "Wheel of Emotions", a diagram where every emotion can be derived from the core 6.

It is also important to have a way to measure what someone is feeling or what emotion they are experiencing. An easy way to do this is through the "Discrete Emotion Questionnaire", a psychological validated questionnaire to verify the intensity of a certain emotion. This assessment presents the 6 core emotions to the subjects asking them to rate the intensity they felt, from 1 to 7 [5].

As a first approach in this area, the current work aims to be able to identify the core emotions using EEG signals collected with the BCI.

# **2 Experimental Methodology**

In order to correctly identify the core emotions, the first step is to trigger them in an efficient way for the brain data collected to be as informative as possible.To do so, the emotions were prompted via a set of video clips, that lasted 5-7 seconds. These videos were selected from a certified database, where the videos were labeled according to the intensity and kind of emotion it caused in the subjects [6]. For each of the 6 core emotions, the 4 videos classified with the biggest intensity were selected to be presented to the participants of this research work.

For each of the 24 video clips (4 videos per each of the 6 emotions), 3 EEG samples are collected. The first is before the display of the video, where a fixation cross is presented, in order to collect the idle/blank state of the user, where he is asked to relax. The second sample is the EEG during the video (active visual stimulus); and the third sample is after the video finishes where the volunteer is processing the emotion triggered (higher level thinking), while getting back to the initial relaxed state, where the fixation cross is presented again. To confirm that the volunteers experience the same emotion defined in the pre-determined label, they are a prompted to answer the "Discrete Emotion Questionnaire", after the 3 EEG samples are collected.

Regarding the physiological signal processing, this step is important because the raw EEG signal that comes directly from the BCI has a low signal-to-noise ratio, as well as many surrounding artifacts that contaminate the readings, especially eye blinks and facial movements triggered by the various emotions. These interfering signals caused by the latter, denominated electromyograms (EMG), are characterized by high frequencies (50-150 Hz) that make the underlying signal very noisy. Every time a person blinks, the EEG signal shows a very high peak with a very low frequency (<1Hz). To remove these muscle artifacts, a 5th order utterworth bandpass filter (this type of filter was chosen because it has the flattest frequency response, which leads to less signal distortion) with cut-off frequencies in 1 Hz and 50 Hz [7].The attenuation of very low frequencies is important to remove the eye blinks artifacts. Considering the top cut-off frequency, it is very convenient to use 50 Hz since it mitigates the effects of the power line noise and the EMG artifacts. Like this, no important brain data is lost. At this step, the EEG was segmented in the brain waves of interest, i.e., the alpha and beta brain waves. The best way to perform this is to apply bandpass filters (same filter type as before) in the corresponding bandwidths, 8-13Hz and 13-32 Hz, to have alpha and beta bands, respectively.

The EEG signals, at this stage possess the "emotional data" exposed allowing to extract the features. To do so, multiple mathematical equations were applied to obtain relevant information from the signals. Feature extraction methods depend on the domain, as will be seen ahead [8]. Most strategies to extract features from the EEG are formulas applied in the time domain, such as, the common statistical equations, the Hjorth statistical parameters, the mean and zero crossings (number of times the signal crosses these 2 thresholds) [8]. Besides these, there were applied more advanced feature extraction methods, based on fractal dimensions and entropy analysis (methods to assess the complexity, or irregularity, of a time-series) [9]. Regarding frequency domain approaches, these features can only be calculated in the filtered EEG and not in the brain waves, as their spectrum is very narrow. In terms of the pure frequency band, the only feature computed was the Power Spectral Density (PSD), based on theWelch method. These domains can be combined creating the time-frequency domain, leading to more sophisticated methods, like the Hilbert – Huang Transform, where the original signal is decomposed in intrinsic mode functions (IMF) [10].

The resulting number of features is too high to compute machine learning models, because the correlation between most of the features is very low, which means that between different classes the information is virtually the same. This would introduce uncertainty in the weights for each class in the models, thus the number of features needs to be reduced. To do this the "Min Redundancy Max Relevance" (MRMR) method was applied, with the objective of finding the optimal number of features to have a higher inter-class variability, in order to find distinct patterns between emotions [11]. The features were used raw, normalized or standardized to train the models.

In this study, all the models implemented are based on supervised learning and fully depend on the data that is inputted. Concerning emotion classification there is not a specific machine learning approach that is optimal, thus 9 different types of models were implemented to verify which has the best performance. These models are designed to be able to adapt to various kinds of input data, through the definition of hyper-parameters. Hence, to tune them to the best possible configuration, it was performed a GridSearchCV. This method exhaustively searches over a given list of possible parameters applying cross validation between them. In the end, the model with the best performance is chosen to be trained with the resulting feature matrix.

A deep learning model was also implemented, based on recurrent neural network (RNN), a very common architecture in classification problems using EEG. A particularity of this network is that it has a GRU, i.e., a layer that helps to mitigate the problem of vanishing gradients (common issue on artificial neural networks), giving long term memory to the model [12].

# **3 Evaluation and Discussion of Results**

In this experiment, 12 subjects volunteered to participate. Each EEG recording is labeled according to the emotion registered in the original database, as well as if it was before video, during or after the video. The answers of the "Discrete Emotion Questionnaire" were used to validate if the emotion triggered by the video was as expected and, if so, the data was used. With this dataset structure, 3 hypotheses were tested and their results are discussed ahead.

An important aspect to have in consideration is that the EEG collected while the subject is relaxing, i.e., while the fixation cross presented before the video, does not have relevant cognitive information regarding emotions. Therefore, these segments were not considered to train any of the models.

#### **3.1 Core Emotions Classification**

This first hypothesis describes the main goal of the project where a model was developed to classify 6 emotions.

First, the feature extraction was computed. At this step, the optimal number of features to get selected was tested, iterating from 5 to 50, 5 at a time. The best number found was 30, which gave the best accuracies, with a balanced computation time and power. This value was chosen for the 3 feature matrixes (raw, normalized and standardized). The dataset was then divided into training and testing with an 80% ratio and fully independent of one another. Each model was then trained and assessed, by computing the accuracy in the test dataset. Table 1 presents the results for each model.


**Table 1** Results of the 6 Core Emotions Classification.

When comparing the various models, the average accuracy is around 16-18%, logically due to the number of classes in the problem (100%/6 = 16,6%). Despite this, the best result reached was 25% accuracy, with the features in their raw state, since the magnitude information was not lost, so patterns in different emotions could be more easily identified due to the high discrepancy in the values. These results are not discouraging since the main objective of the study is very ambitious, as we are trying to create a model to define universally what an emotion is. There is no work more subjective or abstract, and the only way to achieve this universal standardization would be with a sample population as wide and diverse as possible with different beliefs, nationalities, age groups, etc. Although this is an initial study, it shows that it is possible to register and identify differences in the electrical changes of the prefrontal cortex and, with that information, categorize what someone is feeling.

#### **3.2 One vs One – Dual Emotion Classification**

As the results in the previous hypothesis could not precisely identify an emotion when compared to the other 5, the problem was narrowed down and a new hypothesis was tested, to continue the proposed research. In this experiment, the model was trained to discern between only 2 emotions, decided *a priori*. For demonstration purposes, a concrete example can be seen in Table 2 where it compares "fear" vs "surprise".



In this case, most of the machine learning algorithms have accuracies in the order of the 50-53%. This results are not ideal, as they are no better than a random choice between the two classes, however this can be justified by the low population sample, which is not high enough to bring to the surface concrete patterns on the features. Regarding the deep learning approach, the RNN has an advantage in this case, giving a final accuracy of 69%. This result shows that this model is reliable, and in the majority of the cases the 2 emotions can be distinguished. In this particular case, the facial expressions and their muscle activity, can induce big artifacts in the EEG. Someone who feels surprised has the tendency to raise their eyebrows and open the mouth. These movements can lead to a difference in the EEG and, consequently, in the patterns of the features, making the distinction between surprise and fear more noticeable. The same thinking applies to other emotions that trigger facial movement, like laugh, frowning, among others.

#### **3.3 Stimulus vs No Stimulus Classification**

Besides the good results presented in the last premise, one last hypothesis was assessed, regarding the difference between experiencing the emotion while watching the video (direct stimulus), and after, when the fixation cross is presented, while the volunteer is simply thinking and cognitively processing the emotion.

Table 3 summarizes the results of the various models.


**Table 3** Results of Stimulus vs No Stimulus classification.

As it can be seen, for this experiment, most models did fairly well using the standardized feature, being all accuracies higher than 80%. However, when testing the deep learning approach, this architecture revealed to fit almost perfectly to the testing data, with an accuracy higher than 96%. This hypothesis is the proof of concept that the characteristics of the signal collected during the stimulus itself are very different from the ones from a signal obtained when the person is simply thinking and cognitively processing the emotion (this change would be obvious if the EEG was collected from the occipital lobe, which is responsible for the visual perception, but is remarkable when spotted in the prefrontal cortex).

# **4 Conclusions**

In conclusion, as a first approach, the results achieved are very satisfactory and reveal a high potential to be greatly efficient in the proposed applications both in "IntellWheels2.0" and "Sleep at the Wheel projects". Nevertheless by collecting more data the models will get more generalized resulting in more realistic patterns and, consequently, increasing the prediction's accuracies.

Comparing to the literature, using simple visual stimuli to distinguish six emotions, in a relaxed state, is a novel tactic. Most studies, complement the stimulus with forced facial expression, introducing different characteristics to the signal, leading to better results. Other studies use BCIs with more electrodes (channels), covering a wider cranial surface and, consequently, getting more EEG and information, which leads to more robust results.

As future work, the preprocessing of the data could be polished, improving the removal of artifacts and enhancing the underlying information of the EEG's. To obtain better results, it could also be used a transfer learning approach, by pre-training the models with another emotion related EEG databases.

**Acknowledgements** This work was financially supported by Base Funding - UIDB/00027/2020 of the Artificial Intelligence and Computer Science Laboratory – LIACC - funded by national funds through the FCT/MCTES (PIDDAC), Sono ao Volante 2.0 - Information system for predicting sleeping while driving and detecting disorders or chronic sleep deprivation (NORTE-01-0247- FEDER-039720), and Intellwheels 2.0 - IntellWheels2.0 – Intelligent Wheelchair with Flexible Multimodal Interface and Realistic Simulator (POCI-01-0247-FEDER-39898), supported by Norte Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **The Death Process in Italy Before and During the Covid-19 Pandemic: a Functional Compositional Approach**

Riccardo Scimone, Alessandra Menafoglio, Laura M. Sangalli, and Piercesare Secchi

**Abstract** In this talk, based on [1], we propose a spatio-temporal analysis of daily death counts in Italy, collected by ISTAT (Italian Statistical Institute), in Italian provinces and municipalities. While in [1] the focus was on the elderly class (70+ years old), we here focus on the middle class (50-69 years old), carrying out analogous analyses and comparative observations. We analyse historical provincial data starting from 2011 up to 2020, year in which the impacts of the Covid-19 pandemic on the overall death process are assessed and analysed. The cornerstone of our analysis pipeline is a novel functional compositional representation for the death counts during each calendar year: specifically, we work with mortality densities over the calendar year, embedding them in the Bayes space 𝐵 <sup>2</sup> of probability density functions. This Hilbert space embedding allows for the formulation of functional linear models, which are used to split each yearly realization of the mortality density process in a predictable and an unpredictable component, based on the mortality in previous years. The unpredictable components of the mortality density are then spatially analysed in the framework of Object Oriented Spatial Statistics. Via spatial downscaling of the results obtained at the provincial level, we obtain smooth predictions at the fine scale of Italian municipalities; this also enable us to perform

Riccardo Scimone ()

Alessandra Menafoglio

Laura M. Sangalli MOX, Dipartimento di Matematica, Politecnico di Milano, Milano, Italy, e-mail: laura.sangalli@polimi.it

Piercesare Secchi

© The Author(s) 2023 333 P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_36

MOX, Dipartimento di Matematica, Politecnico di Milano and Center for Analysis, Decision and Society, Human Technopole, Milano, Italy, e-mail: riccardo.scimone@polimi.it

MOX, Dipartimento di Matematica, Politecnico di Milano, Milano, Italy, e-mail: alessandra.menafoglio@polimi.it

MOX, Dipartimento di Matematica, Politecnico di Milano and Center for Analysis, Decision and Society, Human Technopole, Milano, Italy, e-mail: piercesare.secchi@polimi.it

anomaly detection, identifying municipalities which behave unusually with respect to the surroundings.

**Keywords:** COVID-19, O2S2, functional data analysis, spatial downscaling

# **1 Introduction and Data Presentation**

At the dawn of the third year of global pandemic, we can affirm that no aspect of people's everyday life has been left untouched by the consequences of Covid-19. The virus, in addition to exacting an heavy death toll, has caused great upheavals in global economy, education systems, technological development and in countless other aspects of human life. Given this global reaching, we deem appropriate to analyse death counts from all causes, and not just those directly attributed to Covid-19, as a proxy of how Italian administrative units, be they municipalities or provinces, have been affected by the pandemic. This choice is driven by the following considerations:


The purpose of the analysis of such data is twofold: (1) to study the correlation structure of the death process in Italy before and during the pandemic, assessing possible perturbations caused by its outbreak, and (2) to assess local anomalies at the municipality level (i.e., identifying municipalities which behave unusually with respect to the surrounding). This talk will entirely be devoted to presenting data and results concerning people aged between 50 and 69 years. The elderly class was the focus of [1], while analyses focusing on younger age classes can be freely examined at https://github.com/RiccardoScimone/Mortality-densities-italy -analysis.git.

Daily death counts for the 107 Italian provinces, in the time interval spanning from 2017 to 2020, are shown in Fig. 1: for each province, we draw death counts along the year in light blue. The black solid line is the weighted mean number of deaths, where each province has a weight proportional to its population. We also

<sup>1</sup> https://www.istat.it/it/archivio/240401

highlight four provinces with colours: Rome, Milan, Naples, and Bergamo. By a visual inspection, it is easy to see that, during the years 2017, 2018 and 2019, the mortality in this age class has an almost uniform behaviour, with only a very slight increase in deaths during winter, for some Provinces. Conversely, 2020 presents an abnormal behaviour in many provinces, due to the pandemic outbreak: look for example at the double peak for Milan, hit by both pandemic waves, or the single, dramatically sharp peak of Bergamo, which reached, during the first wave, higher death counts than the ones associated to provinces which are several times bigger, as Rome or Naples. By comparison with the plots in [1], on can see how all these peaks are less sharper with respect to the elderly class: this is perfectly reasonable, since people aged more than 70 years are much more susceptible to death by Covid-19.

**Fig. 1** Daily death counts during the last four years, for the Italian provinces. The plots refer to people aged between 50 and 69 years. For each province, death counts along the year are plotted in light blue: curves are overlaid one on top of the other to visualize their variability. The black solid line is the weighted mean number of deaths, where each province has a weight proportional to its population, while some selected provinces are highlighted in colour.

To set some notation, we denote the available death counts data as 𝑑𝑖𝑦𝑡, where 𝑖 is a geographical index, identifying provinces or municipalities, 𝑦 is the year and 𝑡 is the day within year 𝑦. Moreover, we denote by 𝑇𝑖𝑦 the absolutely continuous random variable *time of death along the calendar year*, that models the instant of death of a person living in area 𝑖 and passing away during year 𝑦. We hence consider the empirical discrete probability density of this random variable,

$$p\_{ijt} = \frac{d\_{ijt}}{\sum\_{l} d\_{ijl}} \qquad \text{for } t = 1, \ldots, 365$$

for each area 𝑖 and year 𝑦. The family {𝑝𝑖𝑦 }𝑖𝑦 is the main focus of our analysis: we show these discrete densities in Fig. 2, with the same color choices of Fig. 1. It is clear that using densities provides a natural alignment of areas whose population differs significantly, providing complementary insights with respect to the absolute number of death counts: greater emphasis is given on the temporal structure of the phenomenon. For example, the astonishing behaviour of the province of Bergamo during the first pandemic wave in 2020, is now much more visible.

**Fig. 2** Empirical densities of daily mortality, for people aged between 50 and 69 years, at the provincial scale. For each province, the empirical density of the daily mortality is plotted in light blue: densities are overlaid one on top of the other to visualize their variability. The black solid line is the weighted mean density, where the weight for each province has been set to be proportional to its population; some selected provinces are highlighted in colour.

In this talk, we will show results obtained by embedding a smoothed version of the {𝑝𝑖𝑦 }𝑖𝑦, i.e., an estimate { 𝑓𝑖𝑦 }𝑖𝑦 of the continuous density functions of the {𝑇𝑖𝑦 }𝑖𝑦, in the Hilbert space 𝐵 2 (Θ), called *Bayes space* [2, 4, 3], where Θ denotes the calendar year. This is the set (of equivalence classes) of functions

$$B^2(\Theta) = \{ f : \Theta \to \mathbb{R}^+ \text{ s.t.} \, f > 0, \log(f) \in L^2(\Theta) \}$$

where the equivalence relation in 𝐵 2 (Θ) is defined among *proportional* functions, i.e., 𝑓 =𝐵<sup>2</sup> 𝑔 if 𝑓 = 𝛼𝑔 for a constant 𝛼 > 0. In [1], we also propose a preliminary exploration of the {𝑝𝑖𝑦 }𝑖𝑦 based on the *Wasserstein space* embedding, a very regular metric space of probability measures with a straightforward physical interpretation [5]. For the sake of brevity, we here focus on the analysis in 𝐵 2 (Θ), which constitutes our main contribution.

𝐵 2 (Θ) is equipped with an Hilbert geometry, constituted by appropriate operations of sum, multiplication by a scalar, and inner product, which make it the infinitedimensional counterpart of the Aitchison simplex used in standard compositional analysis [6, 7]: for this reason this space is considered the most suited Hilbert embedding for positive continuous density functions. The smoothed densities { 𝑓𝑖𝑦 }𝑖𝑦

**Smooth estimates of the mortality densities, 50-69 years**

**Fig. 3** Smooth estimates of the mortality densities over the 107 Italian provinces. The usual pattern of mortality is visible till 2019, while the functional process is completely different in 2020, with the two pandemic waves clearly captured by the estimated densities. The black thick lines represent the mean density, computed in 𝐵 2 , with weights proportional to the population in each area.

are shown in Fig. 3: they are obtained by smoothing the {𝑝𝑖𝑦 }𝑖𝑦 via *compositional splines* [8, 9]. It is easy to see, by comparison with Fig. 2, how smoothing filters out a good amount of noise, much more than the case of the elderly class: this is fairly reasonable, since the death process is usually more noisy for younger age classes. From now on, the { 𝑓𝑖𝑦 }𝑖𝑦 are analysed as a spatio-temporal functional random sample taking values in 𝐵 2 (Θ). We briefly anticipate the results of such analysis:


Points 1 and 2 above are detailed in Section 2, while point 3 will be discussed during the talk. The reader is referred to [1] for full details on the analysis pipeline.

# **2 Some Results**

The first step of the analysis of the random sample { 𝑓𝑖𝑦 }𝑖𝑦, where 𝑖 is indexing the 107 Italian provinces, is the formulation of a family of function-on-function linear models in 𝐵 2 (Θ), extending classical models formulated in the 𝐿 2 case [17], namely

$$f\_{i\mathbf{y}}(t) = \beta\_{0\mathbf{y}}(t) + \langle \beta\_{\mathbf{y}}(\cdot, t), \overline{f}\_{i\mathbf{y}} \rangle\_{\mathcal{B}^2} + \epsilon\_{i\mathbf{y}}(t), \quad i = 1, \ldots 107, \quad t \in \Theta,\tag{1}$$

where 𝑓 𝑖𝑦 = 1 4 <sup>Í</sup>𝑦−<sup>1</sup> 𝑟=𝑦−4 𝑓𝑖𝑟 is the 𝐵 <sup>2</sup> mean of the observed densities in the four years preceding year 𝑦, functional parameters 𝛽0<sup>𝑦</sup> (𝑡), 𝛽<sup>𝑦</sup> (𝑠, 𝑡) are defined in the 𝐵 2 sense, as well as the residual terms𝜖𝑖𝑦 (𝑡) and all operations of summation and multiplication by a scalar. Model (1) is trying to explain the realization of the mortality density 𝑓𝑖𝑦 for a year 𝑦 in a province 𝑖 as a linear function of what happened in the same province during the preceding years. It is thus interesting to look at the following functional prediction errors:

$$
\delta\_{i\mathbf{y}} = f\_{i\mathbf{y}} - \hat{f}\_{i\mathbf{y}} \tag{2}
$$

where

$$
\hat{f}\_{\rm iy}(t) := \beta\_{0\rm y-1}(t) + \langle \beta\_{\rm y-1}(\cdot, t), \overline{f}\_{\rm iy} \rangle\_{B^2}.\tag{3}
$$

The 𝛿𝑖𝑦 are not the estimate 𝜖ˆ𝑖𝑦 of the residual of model (1): they rather represent

**Fig. 4** First four panels, from the left: heatmaps of the 𝐵 <sup>2</sup> norm of the prediction errors 𝛿𝑖𝑦, in logarithmic scale, for the elderly class. In 2020 the pandemic diffusion is clearly visible in northern Italy, while the prediction errors are generally higher on all provinces. Last panel: result of a 𝐾-mean 𝐵 2 functional clustering (𝐾 = 3) on the 𝛿𝑖𝑦, during 2020.

the error committed in forecasting 𝑓𝑖𝑦 using the model fitted at year 𝑦 − 1. Thus, we can look at the densities 𝛿𝑖𝑦 as the *unpredictable component* of 𝑓𝑖𝑦, i.e., as a proxy of what happened at year 𝑦 which could not be predicted by information available at the previous years, and analyze them under the spatial viewpoint. For example, we can look at the spatial heatmaps of the 𝐵 <sup>2</sup> norms of the 𝛿𝑖𝑦, which are shown in Fig 4. It is clear, by looking at the magnitude of the error norms, that what happened during 2020 was to a large extent unpredictable, since almost all Italian provinces are characterized by higher errors with respect to previous years. More significantly, in 2020 a clear spatial pattern can be noticed, at least during the first wave in northern Italy: a diffusive process, having at its core the provinces most gravely hit by the first pandemic wave, seems to take place in northern Italy. This pattern is, as reasonable, slightly less evident with respect to the case of the elderly class analysed in [1]. Going in this direction, we also show in Fig 4 the result of a K-means functional clustering, set in the 𝐵 2 space, of the 𝛿𝑖𝑦 for the year 2020. We clearly identify provinces hit by the first wave (blue cluster), while the other two clusters behave irregularly: this is a neat distinction with people aged more than 70 years, where each cluster clearly identifies different kinds of pandemic behaviour (see [1]). For a more precise investigation of the spatial correlation structure of the

**Fig. 5** Empirical trace-semivariograms for the prediction errors 𝛿𝑖𝑦, in people aged between 50 and 69 years. The purple lines are the corresponding fitted exponential models. Distances on the x-axes are expressed in kilometers. The last panel shows the 2020 severe perturbation of the spatial dependence structure of the process generating the prediction errors.

process across different years, from the 𝛿𝑖𝑦 we compute a *functional trace variogram* for each year: we show them for 2017 up to 2020 in Figure 5. Without entering into the details of the mathematical definition of variograms, we can look at the fitted curves in Figure 5 as follows. Distances are on the x-axis, while on the y-axis we have a function of the spatial correlation of the process: when the curve reaches its horizontal asymptote, it has reached the total variance of the process and we are beyond the maximum correlation length. In this perspective, it is immediate to infer that not only the total variance of the functional process 𝛿𝑖𝑦 has sharply increased in 2020, but also a significant spatial correlation has manifested, compatibly with the presence of a pandemic. In the main work [1], we further deepen the connection between the pandemic and the upheavals in the spatial structure by means of Principal Component Analysis of the 𝛿𝑖𝑦 in the Bayes space (SFPCA, [16]).

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Clustering Validation in the Context of Hierarchical Cluster Analysis: an Empirical Study**

Osvaldo Silva, Áurea Sousa, and Helena Bacelar-Nicolau

**Abstract** The evaluation of clustering structures is a crucial step in cluster analysis. This study presents the main results of the hierarchical cluster analysis of variables concerning a real dataset in the context of Higher Education. The goal of this research is to find a typology of some relevant items taking into account both the homogeneity and the isolation of the clusters.Two similarity measures, namely the standard affinity coefficient and Spearman's correlation coefficient, were used, and combined with three probabilistic (*AVL*, *AVB* and *AV1*) aggregation criteria, from a parametric family in the scope of the *VL* (Validity Link) methodology. The best partitions were selected based on some validation indices, namely the global *STAT* levels statistics and the measures P(I2, Σ) and 𝛾, adapted to the case of similarity coefficients. In order to evaluate the clusters and identify their most representative elements, the Mann and Whitney *U* statistics and the silhouette plot were also used.

**Keywords:** clustering validation, affinity coefficient, Spearman correlation coefficient, *VL* methodology

Helena Bacelar-Nicolau

© The Author(s) 2023 343 P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_37

Osvaldo Silva () Universidade dos Açores and CICSNOVA.UAc, Rua da Mãe de Deus, 9500-321, Portugal, e-mail: osvaldo.dl.silva@uac.pt

Áurea Sousa Universidade dos Açores and CEEAplA, Rua da Mãe de Deus, Portugal, e-mail: aurea.st.sousa@uac.pt

Universidade de Lisboa (UL) Faculdade de Psicologia and Institute of Environmental Health (ISAMB/FM-UL), Portugal, e-mail: hbacelar@psicologia.ulisboa.pt

# **1 Introduction**

Cluster analysis or unsupervised classification usually concerns exploratory multivariate data analysis methods and techniques for grouping either a set of data units or an associated set of descriptive variables in such a way that elements in the same group (cluster) are more similar to each other than elements in different clusters [6]. Therefore, it is important to validate the results obtained, bearing in mind that, in an ideal situation, the clusters should be internally homogeneous and externally well separated or isolated. Thus, according to Silva et al. ([15], p. 136), there are some important questions, such as: "i) How to compare partitions obtained using different cluster algorithms? ii) Is it possible to join information from several approaches in the decision-making process of choosing the most representative partition?"

This paper presents the main results of a hierarchical cluster analysis of variables concerning a real dataset in the field of Higher Education, in order to find a typology taking into account relevant validation measures. Two similarity measures (standard affinity coefficient and Spearman's correlation coefficient) were used, and combined with a parametric family aggregation criteria in the scope of the *VL* methodology (e.g., [10, 11, 17]).

With regard to the validation of clustering structures, some validation indices were used for the evaluation of partitions and the clusters that integrate them, which are referred to in Section 2. The main results are presented and discussed in Section 3. Section 4 contains some final remarks.

# **2 Data and Methods**

Data were obtained from a questionnaire administered to three hundred and fifty students who were attending Higher Education in a public university, after their informed consent. The questionnaire contains, among others, eleven questions related to academic life and the respective courses.

Several algorithms of hierarchical cluster analysis of variables were applied on the data matrix. The variables (items) are: T1-Participation, T2-Interest, T3- Expectations, T4-Accomplishment, T5-Job Outlook, T6- Teachers' Professional Competence, T7-Distribution of Curricular Units, T8- Number of weekly hours of lessons, T9-Number of hours of daily study, T10-School Outcomes and T11- Assessment Methods, which were evaluated based on a Likert scale from 1 to 5 (1-Totally disagree, 2- Partially disagree, 3- Neither disagree nor agree, 4- Partially agree, 5- Totally agree).

The Ascendant Hierarchical Cluster Analysis (AHCA) was based on the standard affinity coefficient [1, 17] and Spearman's correlation coefficient. In this paper both measures of comparison were combined with three probabilistic aggregation criteria (*AVL*, *AVB* and *AV1*), issued from the *VL* parametric family. This methodology, in the scope of Cluster Analysis, uses probabilistic comparison functions, between pairs of elements, which correspond to random variables following a unit uniform distribution. Besides, this approach considers probabilistic aggregation criteria, which can be interpreted as distribution functions of statistics of independent random variables, that are i.i.d. uniform on [0, 1] (e.g., [17]).

Let A and B be two clusters with cardinals, respectively, 𝛼 and 𝛽, and let 𝛾𝑥 𝑦 be a similarity measure between pairs of elements, 𝑥, 𝑦 ∈ 𝐸 (set of elements to classify). Concerning the family I of *AVL* methods (e.g., *SL*, *AV1*, *AVB*, and *AVL*), the comparison functions between clusters can be summarized by the following conjoined formula:

$$
\Gamma(A,B) = (p\_{AB})^{\lg(\alpha,\beta)}\tag{1}
$$

where 𝛼 = 𝐶𝑎𝑟 𝑑 𝐴, 𝛽 = 𝐶𝑎𝑟 𝑑 𝐵, 𝑝𝐴𝐵 = 𝑚𝑎𝑥[𝛾𝑎𝑏 : (𝑎, 𝑏) ∈ (𝐴 × 𝐵], with 1 ≤ 𝑔(𝛼, 𝛽) ≤ 𝛼𝛽, and 𝛾𝑥 𝑦, establishing a bridge between *SL* and *AVL* methods which have a braking effect on the formation of chains. For example, 𝑔(𝛼, 𝛽) = 1 for *SL*, 𝑔(𝛼, 𝛽)=(𝛼 + 𝛽)/2 for *AV1*, 𝑔(𝛼, 𝛽)= √ 𝛼𝛽 for *AVB*, and 𝑔(𝛼, 𝛽) = 𝛼𝛽 for *AVL* (see [3, 17]).

The application of the two measures of comparison between elements (Spearman correlation coefficient and standard affinity coefficient), combined with the aforementioned aggregation criteria, aims to find a typology of items corresponding to the best partition among the best partitions obtained by the several algorithms, in order to verify if there are any substantial changes in the results. Therefore, some validation indices based on the values of the corresponding proximity matrices were used, namely the global levels statistics (*STAT*) [1, 10, 11] and the indices P(I2mod, Σ) and 𝛾 [8], adapted to this type of matrices [16], so that the choice of the best partition is judicious and based on the desirable properties (e.g., isolation and homogeneity of the clusters). Concerning the best partitions, the respective clusters and the identification of their most representative elements were based on appropriate adaptations of the Mann and Whitney *U* statistics [8] and of the silhouette plots [14] to the case of similarity measures.

Each level of a dendrogram corresponds to a stage in the constitution of the partitions hierarchy. Therefore, the study of the most relevant partition(s) is strictly related to the choice of the best cut-off levels (e.g., [6, 5])

According to Bacelar Nicolau [1, 2], the global levels statistics (*STAT*) values must be calculated for each of the 𝑘 = 1, 𝑛𝑖𝑣𝑚𝑎𝑥 levels of the corresponding dendrograms, designating them by 𝑆𝑇 𝐴𝑇 (𝑘). At each level k, 𝑆𝑇 𝐴𝑇 (𝑘) is the global statistics that measures the total information given by the pre-order associated to the corresponding partition, in relation to the initial pre-order associated with the similarity or dissimilarity measure. A "significant" level is considered to be one that corresponds to a partition for which the global statistics undergoes a significant increase in relation to the information provided by neighbouring levels, that is, a local maximum of the differences 𝐷 𝐼𝐹(𝑘) = 𝑆𝑇 𝐴𝑇 (𝑘) − 𝑆𝑇 𝐴𝑇 (𝑘 − 1), 𝑘 = 1, 𝑛𝑖𝑣𝑚𝑎𝑥.

#### **2.1 Adaptation of the P (I2,** 𝚺**)**

To evaluate the partitions, an appropriate adaptation of the index P (I2, Σ) [8] for the case of similarity measures was used, given by the following formula:

$$P(I2mod,\Sigma) = \frac{1}{c} \sum\_{r=1}^{c} \frac{\sum\_{r} \sum\_{i \notin C\_r} s\_{ij}}{n\_r \times (N - n\_r)} \tag{2}$$

where 𝑐 is the number of clusters of the partition and 𝑠𝑖 𝑗 is the value of the similarity measure between the element 𝑖 belonging to cluster 𝐶<sup>𝑟</sup> and the element 𝑗 belonging to another cluster. This index takes into account the number of clusters and the number of elements in each of the clusters and evaluates the isolation of clusters belonging to a given partition.

#### **2.2 Goodman and Kruskal Index (**𝜸 **)**

The 𝛾 index, proposed by Goodman and Kruskal [7], has been widely used in cluster validation [9]. Comparisons are developed between all within-cluster similarities, 𝑠𝑖 𝑗 and all between-cluster similarities 𝑠𝑘𝑙 [18]. A comparison is judged concordant (respectively discordant) if 𝑠𝑖 𝑗 is strictly greater (respectively, smaller) than 𝑠𝑘𝑙. The 𝛾 index is defined by:

$$\gamma = (\mathcal{S}\_{+} - \mathcal{S}\_{-})/(\mathcal{S}\_{+} + \mathcal{S}\_{-}),\tag{3}$$

where 𝑆<sup>+</sup> (or 𝑆−) is the number of concordant (respectively, discordant) comparisons. This index is a global stopping rule and it evaluates the fit of the partition in *c* clusters based on the homogeneity (high similarity between the elements within the clusters) and the isolation (low similarity of the elements between the clusters) of the clusters. Note that the higher the value of this index, the better is the adjustment of that partition.

The use of *STAT*, 𝛾 and P(I2mod, Σ) indices can help identifying the most significant levels of a dendrogram, taking into account both the homogeneity and the isolation of the clusters [15].

#### **2.3 U Statistics (Mann and Whitney)**

*U* statistics [12] are relevant for assessing the suitability of a cluster, combining the concepts of compactness and isolation. Thus, the "best" cluster is the one with the lowest values of global *U*-index, 𝑈𝐺, and local *U*-index, 𝑈<sup>𝐿</sup> [8]. In the present paper we used an appropriate adaptation of these indices to the case of similarity measures (for details, see [19]). Moreover, the clusters considered "ideal" are those for which 𝑈<sup>𝐺</sup> and 𝑈<sup>𝐿</sup> both take the value zero. Mann and Whitney's *U* statistics are useful in decision making, in situations of uncertainty, both for the evaluation of the clusters and partitions.

#### **2.4 Silhouette Plots**

We also used an appropriate adaptation of the silhouette plots [14], which allows the assessment of compactness and relative isolation of clusters. The adaptation of this measure for the case of similarity measures, 𝑆𝑖𝑙(𝑖), considers the average of the similarities between an element *i* belonging to cluster 𝐶<sup>𝑟</sup> , which contains 𝑛<sup>𝑟</sup> (≥ 2) elements, and all other elements that do not belong to this cluster (see [19]). The values of this measure {𝑆𝑖𝑙(𝑖) : 𝑖 ∈ 𝐶<sup>𝑟</sup> } lie between −1 and +1, with "values near +1 indicating that element strongly belongs to the cluster in which it has been placed" ([8], p. 205). In the case of a singleton cluster, 𝑆𝑖𝑙(𝑖) assumes the value zero [8] in the corresponding algorithm.

# **3 Results and Discussion**

The best partitions provided by the dendrograms are shown in Table 1.


**Table 1** The best partitions concerning the dendrograms.

Figure 1 shows the dendrograms obtained, respectively, by the standard affinity coefficient (left side) and Spearman's correlation coefficient (right side), both combined with the *AVL* method.

**Fig. 1** Dendrograms based on standard affinity coefficient (left side) and Spearman's correlation coefficient (right side) - *AVL*.

The "best" partition obtained using the affinity coefficient and the *AVL* method is the partition into two clusters (level 9 of the aggregation process). The first cluster consists of nine items that highlight the importance of the teachers' professional competence, the structuring/content of the course and the future perspectives in relation to the career opportunities, mostly factors exogenous to the students. The second one is composed by two items (T2 and T9) which emphasize the role of interest in the study of Mathematics.

The algorithms in which the standard affinity coefficient was used are the ones that provided the best partitions and their hierarchies are the ones that remained closest to the initial pre-orders. In fact, in the case of Spearman correlation coefficient the values of *STAT* and 𝛾 indices are clearly lower than the previous ones. Moreover, the cluster {T1, T3, T4, T5, T6, T7, T8, T10, T11}, corresponding to the best partition provided by the combination of the standard affinity coefficient with the aggregation criteria *AVL*, *AV1* and *AVB*, presents (𝑈<sup>𝐺</sup> =39 and 𝑈𝐿=4, both lower than those obtained for the cluster {T3, T4, T2, T9, T6} (𝑈𝐺=65 and 𝑈𝐿=26) provided by the Spearman correlation coefficient combined, respectively, with *AV1* and *AVB* methods.

Focusing the attention on the two first partitions of Table 1, the only difference between them is that while the best partition provided by *AV1* and *AVB* methods contains the singletons T2 and T9, the best partition given by *AVL* joins these two singletons in the same cluster. The values of the numerical validation indices shown in Table 1 indicate that the best partition is the one provided by *AV1* and *AVB* methods. This conclusion is reinforced by the observation of the silhouette plot (see Figure 2), which indicates that the cluster joining T2 and T9, given by *AVL* method, includes the elements which have the two lowest values of 𝑆𝑖𝑙 and *Sil* (T2) is negative

**Fig. 2** Silhouette plot - standard affinity coefficient and *AVL* method.

(i.e., T2 does not fit very well in this cluster). Note that the silhouette plot cannot be used for the best partition, since it does not apply for singletons.

# **4 Final Remarks**

This research was useful concerning the identification of relevant partitions of items in the context of Higher Education. In the cases where the affinity and the Spearman correlation coefficients were used, it was concluded that the probabilistic criteria *AV1* and *AVB* showed a higher agreement regarding the hierarchies of partitions obtained than the *AVL* method.

The validation measures *STAT*, 𝛾 and P(I2mod, Σ) help us to determine the best cut-off levels of a hierarchy of clusters, taking into account both the homogeneity and the isolation of the clusters. It should also be noted that if there is no absolute consensus between these three measures, the Mann and Whitney *U* statistics and the silhouette plot prove to be very useful, as we have seen with the application of this methodology to evaluate both the clusters and the partitions obtained.

**Acknowledgements** Funding. This work is financed by national funds through FCT – Foundation for Science and Technology, I.P., within the scope of the project «UIDB/04647/2020» of CICS.NOVA – Centro de Ciências Sociais da Universidade Nova de Lisboa.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **An MML Embedded Approach for Estimating the Number of Clusters**

Cláudia Silvestre, Margarida G. M. S. Cardoso, and Mário Figueiredo

**Abstract** Assuming that the data originate from a finite mixture of multinomial distributions, we study the performance of an integrated *Expectation Maximization* (EM) algorithm considering *Minimum Message Length* (MML) criterion to select the number of mixture components. The referred EM-MML approach, rather than selecting one among a set of pre-estimated candidate models (which requires running EM several times), seamlessly integrates estimation and model selection in a single algorithm. Comparisons are provided with EM combined with well-known information criteria – e.g. the Bayesian information Criterion. We resort to synthetic data examples and a real application. The EM-MML computation time is a clear advantage of this method; also, the real data solution it provides is more parsimonious, which reduces the risk of model order overestimation and improves interpretability.

**Keywords:** finite mixture model, EM algorithm, model selection, minimum message length, categorical data

# **1 Introduction**

Clustering is a technique commonly used in several research and application areas. Most of the clustering techniques are focused on numerical data. In fact, clustering

Cláudia Silvestre ()

Margarida G. M. S. Cardoso

Mário Figueiredo Instituto de Telecomunicações, Portugal, Av. Rovisco Pais 1, 1049-001 Lisboa, Portugal, e-mail: mario.figueiredo@tecnico.ulisboa.pt

Escola Superior de Comunicação Social, Campus de Benfica do IPL 1549-014 Lisboa, Portugal, e-mail: csilvestre@escs.ipl.pt

BRU-UNIDE, ISCTE-IUL, Av. das Forças Armadas, 1649-026 Lisboa, Portugal, e-mail: margarida.cardoso@iscte-iul.pt

methods for categorical data are more challenging [12] and there are fewer techniques available [11].

In order to determine the number of clusters, model-based approaches commonly resort to information-based criteria e.g., the *Bayesian Information Criterion* (BIC) [15] or the *Akaike Information Criterion* (AIC) [1]. These criteria look for a balance between the model's fit to the data (which corresponds to maximizing the likelihood function) and parsimony (using penalties associated with measures of model complexity), thus trying to avoid over-fitting. The use of information criteria follows the estimation of candidate finite mixture models for which a predetermined number of clusters is indicated, generally resorting to an EM (*Expectation Maximization*) algorithm [7]. In this work, we focus on determining the number of clusters while clustering categorical data, using an EM embedded approach to estimate the number of clusters. This approach does not rely on selecting among a set of pre-estimated candidate models, but rather integrates estimation and model selection in a single algorithm. Our new implementation to deal with categorical variables by estimating a finite mixture of multinomials, follows a previous version described in [16]. We capitalized on the work of Figueiredo and Jain [9] for clustering continuous data and extended it for dealing with categorical data. The embedded method is thus based on a *Minimum Message Length* (MML) criterion to select the number of clusters and on an EM algorithm to estimate the model parameters.

# **2 Clustering with Finite Mixture Models**

The literature on finite mixture models and their application is vast, including some books covering theory, geometry, and applications [8, 13, 3]. When applying finite mixture models to social sciences, the analyst is often confronted with the need to uncover sub-populations based on qualitative indicators.

#### **2.1 Definitions and Concepts**

Let **Y** = {𝑦 𝑖 , 𝑖 = 1, . . . , 𝑛} be a set of 𝑛 independent and identically distributed (i.i.d.) sample of observations of a random vector, 𝑌 = [𝑌1, . . . , 𝑌𝐿] 0 . We assume 𝑌 follows a mixture of 𝐾 components densities, 𝑓 (𝑦|𝜃<sup>𝑘</sup> ) (𝑘 = 1, . . . , 𝐾), with probabilities {𝛼1, . . . , 𝛼<sup>𝐾</sup> }, where 𝜃<sup>𝑘</sup> are the distributional parameters defining the 𝑘-th component and Θ = {𝜃<sup>1</sup> , . . . , 𝜃<sup>𝐾</sup> , 𝛼1, . . . , 𝛼<sup>𝐾</sup> } the set of all the parameters of the model. The 𝛼 values, also called *mixing probabilities*, are subject to the usual constraints: Í<sup>𝐾</sup> 𝑘=1 𝛼<sup>𝑘</sup> = 1 and 𝛼<sup>𝑘</sup> ≥ 0, 𝑘 = 1, . . . , 𝐾. The log-likelihood of the observed set of sample observations is

$$\log f(\mathbf{Y}|\Theta) = \log \prod\_{i=1}^{n} f(\underline{\mathbf{y}}\_{i}|\Theta) = \sum\_{i=1}^{n} \log \sum\_{k=1}^{K} \alpha\_{k} f(\underline{\mathbf{y}}\_{i}|\underline{\theta}\_{k}).\tag{1}$$

In clustering, the identity of the component that generated each sample observation is unknown. The observed data **Y** is therefore regarded as incomplete, where the missing data is a set of indicator variables **Z** = {𝑧 1 , ..., 𝑧 𝑛 }, each taking the form 𝑧 𝑖 = [𝑧𝑖1, ..., 𝑧𝑖𝐾 ] 0 , where 𝑧𝑖𝑘 is a binary indicator: 𝑧𝑖𝑘 takes the value 1 if the observation 𝑦 𝑖 was generated by the k-th component, and 0 otherwise. It is usually assumed that the {𝑧 𝑖 , 𝑖 = 1, . . . , 𝑛} are i.i.d., following a multinomial distribution of 𝐾 categories, with probabilities {𝛼1, . . . , 𝛼<sup>𝐾</sup> }. The log-likelihood of complete data {**Y**, **Z**} is given by

$$\log f(\mathbf{Y}, \mathbf{Z}|\boldsymbol{\Theta}) = \sum\_{i=1}^{n} \sum\_{k=1}^{K} z\_{ik} \log \left[ \alpha\_{k} f(\underline{\mathbf{y}}\_{i}|\underline{\boldsymbol{\Theta}}\_{k}) \right]. \tag{2}$$

#### **2.2 Discrete Finite Mixture Models**

Consider that each variable in Y, 𝑌<sup>𝑙</sup> (𝑙 = 1, . . . , 𝐿) can take one of 𝐶<sup>𝑙</sup> categories. Conditionally on having been generated by the k-th component of the mixture, each 𝑌<sup>𝑙</sup> is thus modeled by a multinomial distribution with 𝑛<sup>𝑙</sup> trials, 𝐶<sup>𝑙</sup> categories, and non-negative parameters <sup>𝜃</sup>𝑘𝑙 <sup>=</sup> {𝜃𝑘𝑙𝑐, 𝑐 <sup>=</sup> <sup>1</sup>, . . . , 𝐶<sup>𝑙</sup> }, with <sup>Í</sup>𝐶<sup>𝑙</sup> 𝑐=1 𝜃𝑘𝑙𝑐 = 1. For a sample 𝑦𝑖𝑙(𝑖 = 1, . . . , 𝑛) of 𝑌<sup>𝑙</sup> , we denote as 𝑦𝑖𝑙𝑐 the number of outcomes in category 𝑐, which is a sufficient statistic; naturally, Í𝐶<sup>𝑙</sup> 𝑐=1 𝑦𝑖𝑙𝑐 = 𝑛<sup>𝑙</sup> . Thus, with 𝜃𝑘 = {𝜃𝑘<sup>1</sup> , . . . , 𝜃𝑘𝐿} and Θ = {𝜃<sup>1</sup> , . . . , 𝜃<sup>𝐾</sup> , 𝛼1, . . . , 𝛼<sup>𝑘</sup> }, the log-likelihood function, for a set of observations corresponding to a discrete finite mixture model (mixture of multinomials). This log-likelihood can be seen as corresponding to a missing-data problem, where the missing data has exactly the same meaning and structure as above. The log-likelihood of the complete data {**Y**, **Z**} is thus given by

$$\log p(\mathbf{Y}, \mathbf{Z}|\boldsymbol{\Theta}) = \sum\_{i=1}^{n} \sum\_{k=1}^{K} z\_{ik} \log \left( \alpha\_{k} \prod\_{l=1}^{L} \left[ n\_{l}! \prod\_{c=1}^{C\_{l}} \frac{(\theta\_{klc})^{\text{Yic}}}{\mathbf{y}\_{ilc}!} \right] \right) . \tag{3}$$

To obtain a *maximum-likelhood* (ML) or *maximum a posteriori* (MAP) estimate of the parameters of a multinomial mixture, the well-known EM algorithm is usually the tool of choice [7].

# **3 Model Selection for Categorical Data**

Model selection is an important problem in statistical analysis [6]. In model-based clustering, the term *model selection* usually refers to the problem of determining the number of clusters, although it may also refer to the problem of selecting the structure of the clusters. Model-based clustering provides a statistical framework to solve this problem usually resorting to *information criteria*. Among the best-known information criteria we find BIC and AIC, their modifications - namely the consistent AIC, (CAIC) and the Modified AIC (MAIC) - and also the Integrated Completed Likelihood (ICL) [14, 4]. They are all easily implemented, the final model being selected according to a compromise between its fit to data and its complexity. In this work, we use the *Minimum Message Length* (MML) criterion to choose the number of components of a mixture of multinomials. MML is based on the information-theoretic view of estimation and model selection, according to which an adequate model is one that allows a short description of the observations. MML-type criteria evaluate statistical models according to their ability to compress a message containing the data, looking for a balance between choosing a simple model and one that describes the data well. According to Shannon's information theory, if 𝑌 is some random variable with probability distribution 𝑝(𝑦|Θ), the optimal code-length (in an expected value sense) for an outcome 𝑦 is 𝑙(𝑦|Θ) = − log<sup>2</sup> 𝑝(𝑦|Θ), measured in bits (from the base-2 logarithm). If Θ is unknown, the total code-length function has two parts: 𝑙(𝑦, Θ) = 𝑙(𝑦|Θ) + 𝑙(Θ); the first part encodes the outcome 𝑦, while the second part encodes the parameters of the model. The first part corresponds the fit of the model to the data (better fit corresponds to higher compression), while the second part represents the complexity of the model. The message length function for a mixture of distributions (as developed in [2]) is:

$$d(\mathbf{y}, \Theta) = -\log p(\Theta) - \log p(\mathbf{y}|\Theta) + \frac{1}{2}\log|I(\Theta)| + \frac{C}{2}\left(1 - \log(12)\right), \qquad (4)$$

where 𝑝(Θ) is a prior distribution over the parameters, 𝑝(𝑦|Θ) is the likelihood function of mixture, |𝐼(Θ)| ≡ | − 𝐸 h 𝜕 2 𝜕Θ<sup>2</sup> log 𝑝(𝑌 |Θ) i | is the determinant of the expected Fisher information matrix, and 𝐶 is the the number of parameters of the model that need to be estimated. For example, for the 𝐾 mixture multinomial distributions presented in (3), 𝐶 = (𝐾 − 1) + 𝐾 Í<sup>𝐿</sup> 𝑙=1 (𝐶<sup>𝑙</sup> − 1) . The expected Fisher information matrix of a mixture leads to a complex analytical form of MML which cannot be easily computed. To overcome this difficulty, Figueiredo and Jain [9] replace the expected Fisher information matrix by its complete-data counterpart 𝐼<sup>𝑐</sup> (Θ) ≡ −𝐸 h 𝜕 2 𝜕𝜃2 log 𝑝(𝑌, 𝑍|Θ) i . Also, they adopt independent Jeffreys' *priors* for the mixture parameters that is proportional to the square root of the determinant of the Fisher information matrix. The resulting message length function is

$$I(\mathbf{y}, \Theta) = \frac{M}{2} \sum\_{k:\,\alpha\_k > 0} \log\left(\frac{n\,\alpha\_k}{12}\right) + \frac{k\_{n\boldsymbol{\varpi}}}{2} \log\frac{n}{12} + \frac{k\_{n\boldsymbol{\varpi}}(M+1)}{2} - \log p(\mathbf{y}, \Theta) \tag{5}$$

where 𝑀 is the number of parameters specifying each component (the dimension of each 𝜃<sup>𝑘</sup> ) and 𝑘𝑛𝑧 the number of components with non zero probability (for more details on the derivation of (5), see [9, 2]).

# **4 The MML Based EM Algorithm**

In order to estimate a mixture of multinomials, we use a variant of the EM algorithm (herein termed EM-MML), which integrates both estimation and model selection, by directly minimizing (5). The algorithm results from observing that (5) contains, in addition to the log-likelihood term, an explicit penalty on the number of components (the two terms proportional to 𝑘𝑛𝑧 ), and a term (the first one) that can be seen as a log-prior on the 𝛼<sup>𝑘</sup> parameters of Θ, that will directly affect the M-step.

**E-step:** The E-step of the EM-MML is precisely the same as in the case of ML or MAP estimation, since the generative model for the data is the same. Since we are dealing with a multinomial mixture, we simply have to plug the corresponding multinomial probability function yielding

$$\overline{z}\_{ik}^{(t)} = \frac{\alpha\_k \prod\_{l=1}^{L} \left[ n\_l! \prod\_{c=1}^{C\_l} \frac{(\overline{\theta}\_{klc}^{(t)})^{\gamma\_{ilc}}}{\text{y}\_{ilc}!} \right]}{\sum\_{j=1}^{K} \alpha\_j \prod\_{l=1}^{L} \left[ n\_l! \prod\_{c=1}^{C\_l} \frac{(\overline{\theta}\_{jlc}^{(t)})^{\gamma\_{ilc}}}{\text{y}\_{ilc}!} \right]},\tag{6}$$

for 𝑖 = 1, . . . , 𝑛 and 𝑘 = 1, . . . , 𝐾.

**M-step:** For the M-step, noticing that the first term in (5) can be seen as the negative log-prior − log 𝑝(𝛼<sup>𝑘</sup> ) = 𝐶−𝐾+1 2𝐾 log 𝛼<sup>𝑘</sup> (plus a constant), and enforcing the conditions that 𝛼<sup>𝑘</sup> ≥ 0, for 𝑘 = 1, ..., 𝐾 and that Í<sup>𝐾</sup> 𝑘=1 𝛼<sup>𝑘</sup> = 1, yields the following updates for the estimates of the 𝛼<sup>𝑘</sup> parameters:

$$\widehat{\alpha}\_{k}^{(t+1)} = \frac{\max\left\{0, \sum\_{i=1}^{n} \bar{z}\_{ik}^{(t)} - \frac{C - K + 1}{2K}\right\}}{\sum\_{j=1}^{K} \max\left\{0, \sum\_{i=1}^{n} \bar{z}\_{ij}^{(t)} - \frac{C - K + 1}{2K}\right\}},\tag{7}$$

for <sup>𝑘</sup> <sup>=</sup> <sup>1</sup>, ..., 𝐾. Notice that, some <sup>b</sup><sup>𝛼</sup> (𝑡+1) <sup>𝑘</sup> may be zero; in that case, the 𝑘-th component is excluded from the mixture model. The multinomial parameters corresponding to components with <sup>b</sup><sup>𝛼</sup> (𝑡+1) 𝑘 = 0 need not be further calculated, since these components do not contribute to the likelihood. For the components with non-zero probability, <sup>b</sup><sup>𝛼</sup> (𝑡+1) 𝑘 > 0, the estimates of multinomial parameters are updated to their standard weighted ML estimates:

$$\widehat{\theta}\_{klc}^{(t+1)} = \frac{\sum\_{i=1}^{n} \bar{z}\_{ik}^{(t)} y\_{ilc}}{n \sum\_{i=1}^{n} \bar{z}\_{ik}^{(t)}},\tag{8}$$

for 𝑘 = 1, . . . , 𝐾, 𝑙 = 1, . . . , 𝐿, and 𝑐 = 1, . . . , 𝐶<sup>𝑙</sup> . Notice that, in accordance with the meaning of the 𝜃𝑘𝑙𝑐 parameters, Í𝐶<sup>𝑙</sup> 𝑐=1 b𝜃 (𝑡+1) 𝑘𝑙𝑐 = 1.

# **5 Data Analysis and Results**

First, we evaluate the performance of the EM-MML algorithm on 10 synthetic data sets, over 50 runs. The data sets were originated from a mixture of 3 categorical variables (with 2, 3 and 4 levels) and 2 components. The correponding Sihouette index values illustrate the structures diversity: 0.099; 0.216; 0.217; 0.230; 0.713; 0.733; 0.746; 0.778; 0.805; 0.817. The obtained results are compared with those obtained from a standard EM algorithm combined with BIC, AIC, CAIC, MAIC, and ICL criteria.

The comparison resorts to a cohesion-separation measure and a concordance measure: the Fuzzy Silhouette index [5] of the clustering structure obtained and the Adjust Rand [10] between the same clustering structure and the original one. In Table 1 we can verify there are no significant differences between the EM-MML and the other criteria, except ICL which only recovers the very well separated structures. Regarding the number of clusters, EM-MML and MAIC are tied, recovering this number correctly for all data sets.The same is not true for the other criteria: AIC identifies 3 clusters in 3 data sets and 4 clusters once; in addition, BIC and CAIC could not find any cluster structure once and ICL was unable to do it for 4 data sets. In terms of computation time, since EM-MML does not require a sequential approach, it becomes clearly faster than the other criteria (Friedman test yields 𝜒 2 (5)=2500 and p-value<0.01; Post hoc tests, with Bonferroni correction, only reveal statistically significant differences between the EM-MML and the other criteria).


**Table 1** Criteria performance.

<sup>𝑎</sup> 1000 bootstrap samples were used to estimate the Confidence Intervals (CI).

Additional insight into the performance of EM-MML is obtained by applying it to a real data set referring to the 6th European Working Conditions Survey (2015), Eurofound working conditions survey. Note that these data are the most recent.

For the purpose of our experiment, we consider the aggregate data referring to 305 European regions and the answers to the following questions: Are you able to

#### An MML Embedded Approach

**Fig. 1** Clusters' profile and their dimensions (𝑛).

choose or change: a) your order of tasks; b) your methods of work; c) your speed or rate of work. Do you work in a group or team that has common tasks and can plan its work?

EM-MML selected 7 clusters, which is a smaller number than for the remaining criteria (ICL, BIC, CAIC, AIC and MAIC select 10, 12, 12, 15 and 15 respectively). This fact avoids estimation problems associated with very small segments and also improves the interpretability of the clustering solution.

The segments selected by EM-MML criterion are presented in Figure 1. Workers with slightly above average autonomy (cluster 7) live in several countries, but Ireland stands out, as well as Belgium, Germany, Netherlands, Switzerland, and the UK regions. Denmark, Estonia, Malta, and Norway are the countries where the most independent workers are found (cluster 3). The smallest cluster, 6, includes Sweden and a region of Greece and Kriti and Açores, a Greek and a Portuguese region, respectively. The cluster 5, where workers claim they have no autonomy, includes regions from many countries.

# **6 Discussion and Perspectives**

In this work, a model selection criterion and method for finite mixture models of categorical observations was studied - EM-MML. This algorithm simultaneously performs model estimation and selects the number of components/clusters. When compared to information criteria, which are commonly associated with the use of the EM algorithm, the EM-MML method exhibits several advantages: 1) it easily recovers the true number of clusters in synthetic data sets with various degrees of separation; 2) its computations times are significantly lower than those required by standard approaches resorting to the sequential use of EM and an information criterion; 3) when applied to a real data set it produces a more parsimonious solution, thus easier to interpret. An additional advantage of this approach that stems from obtaining more parsimonious solutions is that such solutions have a higher number of observations per cluster, thus helping to overcome eventual estimation problems.

The performance of the EM-MML is encouraging for selecting the number of clusters, and the same criterion was already used for feature selection [17]. However, future research is required, namely considering data sets with different numbers of clusters and high dimensional data.

**Acknowledgements** This work was supported by Fundação para a Ciência e Tecnologia, grant UIDB /00315/2020.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Typology of Motivation Factors for Employees in the Banking Sector: An Empirical Study Using Multivariate Data Analysis Methods**

Áurea Sousa, Osvaldo Silva, M. Graça Batista, Sara Cabral, and Helena Bacelar-Nicolau

**Abstract** Leadership has been considerate as a competitive advantage for organizations, contributing to their success and effective and efficient performance. Motivation, on the other hand, is assumed as a basic competence of leadership. Therefore, the main purpose of this paper is to know the perceptions of bank employees on the main motivational factors in the organizational context. Data analysis was performed based on several statistical methods, among which the Categorical Principal Component Analysis (CatPCA) and some agglomerative hierarchical clustering algorithms from *VL* (*V* for Validity, *L* for Linkage) parametrical family, applied to the items that aim to assess the aspects most valued by bankers in the work context. The CatPCA allowed to extract four principal components which explain almost 70% of the total data variance. The dendrograms provided by the hierarchical clustering algorithms over the same data, exhibit four main branches, which are associated with different main motivational factors. Moreover, CatPCA and clustering results show an important correspondence concerning the main motivations in this sector.

**Keywords:** leadership, welfare, motivational factors, CatPCA, cluster analysis

Áurea Sousa ()

Universidade dos Açores and CEEAplA, Rua da Mãe de Deus, 9500-321, Portugal, e-mail: aurea.st.sousa@uac.pt

Osvaldo Silva Universidade dos Açores and CICSNOVA.UAc, Rua da Mãe de Deus, Portugal, e-mail: osvaldo.dl.silva@uac.pt

M. Graça Batista Universidade dos Açores and CEEAplA, Rua da Mãe de Deus, Portugal, e-mail: maria.gc.batista@uac.pt

Sara Cabral Universidade dos Açores, Rua da Mãe de Deus, Portugal, e-mail: sara\_crc@hotmail.com

Helena Bacelar-Nicolau Universidade de Lisboa (UL) Faculdade de Psicologia and Institute of Environmental Health (ISAMB/FM-UL), Portugal, e-mail: hbacelar@psicologia.ulisboa.pt

© The Author(s) 2023 363 P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_39

# **1 Introduction**

Motivation has always been subject of analysis by the scientific community, as numerous definitions have emerged. For Robbins and Judge ([21], p. 184), motivation is defined as "the processes that account for an individual's intensity, direction, and persistence of effort toward attaining a goal". These three indicators are assumed to be key-factors of motivation: intensity describes the individual's effort to achieve the proposed goals; this effort should go in a direction that benefits the organization; and, finally, the persistence with which the individual is able to maintain that effort. In this context, the individual's behavior is determined by what motivates them, which is why their performance results not only from ability and skills, but also from motivation. Moreover, motivation is complex and influenced by innumerable variables, considering the diverse needs and expectations that individuals try to satisfy in different ways [15]. Moreover, different leadership practices may lead to better or worse motivational responses from employees.

The main purpose of this paper is to analyse the perceptions of bank employees who work in the banks that operate in the Autonomous Region of the Azores on the main motivational factors in the organizational context. Our study also intends to perform a reduction of the dimensionality of the data and to find a typology of a set of items that was used to evaluate the latent variable "Motivation", regarding the most valued aspects in the work context. Thus, Section 2 concerns the materials and methods of research. Section 3 presents and discusses the main results of this study. Finally, Section 4 contains the main conclusions.

# **2 Materials and Methods**

This study was based on a quantitative approach, using a validated questionnaire, which can be found in Cabral [7]. The sample consists of 202 bank employees (51.0 % male and 49.0 % female) of the Autonomous Region of the Azores (response rate: 6.4%). Most respondents are 36 years old or older (60.9%) and have higher education (56.7%).

The present study refers to a subset of twenty-seven items used to evaluate the latent variable "Motivation" in work context, namely: 1 - The opportunity for career advancement, 2 - Have greater responsibility, 3 - The feeling of being involved in decision making, 4 - A job that gives you prestige and status, 5 - Have an interesting and challenging job, 6 - The recognition and appreciation of others for the accomplished work, 7 - Have a good relationship with your colleagues, 8 - Have a good relationship with your superiors, 9 - A work environment where there is trust and respect, 10 - The loyalty of superiors towards the collaborators, 11 - Team spirit, 12 - Sense of belonging to the organization, 13 - An adequate discipline, 14 - There is equality of treatment and opportunities between the various employees, 15 - Earn respect and esteem of your colleagues and superiors, 16 - Professional development, 17 - Salary appropriate to the professional functions, 18 - A stable job that gives you security, 19 - Good working conditions, 20 - Balance between personal and professional life, 21 - Being able to express your opinion and ideas without fear of reprisals, 22 - Availability to solve problems/personal situations, 23 - Have a fair and adequate system of objectives and incentives, 24 - Being rewarded for overtime work, 25 - Being pressured to achieve the proposed objectives, 26 - Ability to handle pressure at work, and 27 - Appropriate training to the professional functions.

For each item, respondents could pick only one of six modalities of response according to their level of agreement or disagreement with the items that assess motivation: Totally disagree; Disagree most of the time; Slight disagree; Slight agree; Agree most of the time, and Totally agree. In this study, Categorical Principal Components Analysis (CatPCA), using the Varimax rotation method with Kaiser Normalization; and some agglomerative hierarchical clustering algorithms (AHCA) were used. Data analysis was performed using the packages IBM SPSS Statistics 26 and CLUST11 [19].

Principal Components Analysis (PCA) aims to reduce the dimensionality of the original data so that "the first few dimensions account for as much of the available information as possible" ([9], p. 83), assuming linear relationships among numeric variables. Each principal component is uncorrelated with all others, and it is expressed as a linear combination of the original variables. CatPCA optimally quantifies categorical (ordinal or nominal) variables and can handle and discover nonlinear relationships between variables (e.g., [12]). In the present study, we applied the CatPCA due to the ordinal nature of the items under analysis.

The goal of a clustering algorithm is to obtain a partition, where the elements within a cluster are similar and elements (objects/individuals/groups of individuals or variables) in different groups are dissimilar, identifying natural clustering structures in a data set (e.g., [8]). Agglomerative clustering algorithms usually start with each element to sort into its own separate cluster of size 1 (singleton). At each step, the algorithms find the two "closest" clusters, taking into account the aggregation criterion, and join them. The process continues until a cluster containing all elements to classify is obtained. The AHCA of the set of items was based on the affinity coefficient as a measure of comparison between elements, combined with two classic (Single-Linkage ( *SL*) and Complete-Linkage (*CL*)) and a family of probabilistic *VL* (*V* for Validity, *L* for Linkage) aggregation criteria (e. g., [1, 2, 3, 10, 11, 16, 17, 18, 22]).

According to Ng et al. ([20], p. 849), "the task of finding good clusters has been the focus of considerable research in machine learning and pattern recognition". However, the identification of the best partitions using validation indices is also of crucial importance. Therefore, a pertinent question arises: "How well does the partition fit the data?" ([8], p. 505). On what validation of results is concerned, the identification of the best partitions in the present study was based on the global level statistics, *STAT* [1, 10, 11]. The global maximum *STAT* value indicates the best cut-off level of a dendrogram and the local maxima *STAT* differences indicate the most significant levels.

The affinity coefficient between two distribution functions was introduced by Matusita in 1951 (e.g., [13, 14]). Bacelar-Nicolau extended it to the non-supervised classification field as a similarity measure between profiles. Let *V* be a set of *p* variables, describing a set *D* of *N* statistical data units (individuals), so that each of the 𝑁 × 𝑝 cells of the corresponding data table X contains one single non-negative real value 𝑥𝑖𝑘 (𝑖 = 1,..., *N*; 𝑘 = 1,. . . , *p*) which denotes the value of the k-th variable on the i-th individual. The standard affinity coefficient 𝑎(𝑘, 𝑘<sup>0</sup> ) between a pair of variables, 𝑉<sup>𝑘</sup> and 𝑉 0 𝑘 (𝑘, 𝑘<sup>0</sup> = 1,..., *p*) is given by formula (1), where 𝑥.𝑘 = Σ<sup>𝑁</sup> 𝑖=1 𝑥𝑖𝑘 , 𝑥.𝑘<sup>0</sup> = Σ<sup>𝑁</sup> 𝑖=1 𝑥𝑖𝑘0.

$$a(k,k') = \Sigma\_{i=1}^{N} \sqrt{\frac{\mathcal{X}\_{ik}}{\mathcal{X}\_{.k}} \frac{\mathcal{X}\_{ik'}}{\mathcal{X}\_{.k'}}} \tag{1}$$

The coefficient (1) is a symmetric similarity coefficient which takes values in [0,1] (1 for equal or proportional vectors and 0 for orthogonal vectors). Note that its mathematical formula corresponds to the inner product between the square root column profiles associated with those variables and measures a monotone tendency between column profiles. In the particular case of binary variables, the affinity coefficient coincides with the well-known Ochiai coefficient. Furthermore (e.g., [4, 6]), it is related to the Hellinger distance 𝑑 by the relation 𝑑 <sup>2</sup> = 2(1 − 𝑎), which has been used in the context of spherical factor analysis by Michel Volle. Later on, the standard affinity coefficient was extended to the clustering of statistical data units or variables, mainly in a three-way approach (e.g., [3, 4, 5, 6]). The computation of the standard affinity coefficient between individuals can be performed by previously transposing the data matrix and then applying formula (1).

The probabilistic aggregation criteria on the scope of *VL* methodology can be interpreted as distribution functions of statistics of independent random variables, that are i.i.d. uniform on [0, 1] (e.g., [3, 17]). The *SL* aggregation criterion can lead to very long clusters (chaining effect). On the other hand, the *AVL* (Aggregation Validity Link) has a tendency to form equicardinal clusters with an even number of elements. The comparison functions between a pair of clusters, A and B, concerning the family I of *AVL* methods can be generated by the following conjoined formula (e.g., [17, 10, 11]):

$$
\Gamma(A,B) = (p\_{AB})^{\lg(\alpha,\beta)}\tag{2}
$$

with 𝛼 = 𝐶𝑎𝑟 𝑑 𝐴, 𝛽 = 𝐶𝑎𝑟 𝑑 𝐵, 𝑝𝐴𝐵 = 𝑚𝑎𝑥[𝛾𝑎𝑏 : (𝑎, 𝑏) ∈ (𝐴 × 𝐵], with 1 ≤ 𝑔(𝛼, 𝛽) ≤ 𝛼𝛽, and 𝛾𝑥 𝑦 is a similarity measure between pairs of elements, *x* and *y*, of the set of elements to classify (e.g., 𝑔(𝛼, 𝛽) = 1 for *SL*, 𝑔(𝛼, 𝛽) = 𝛼𝛽 for *AVL*). Note that varying 𝑔(𝛼, 𝛽) with 1 < 𝑔(𝛼, 𝛽) < 𝛼𝛽, a sort of compromise can be built between *SL* and *AVL* methods (e.g., 𝑔(𝛼, 𝛽)=(𝛼 + 𝛽)/2 for *AV1*). Thus, Γ(𝐴, 𝐵) will be "more polluted by the chain effect when 𝑔(𝛼, 𝛽) remains near 1, and more contaminated by the symmetry effect as long as 𝑔(𝛼, 𝛽) is in the neighbourhood of 𝛼𝛽" ( [17], p. 95). Among the criteria that establish a compromise between *AVL* and *SL* methods, stands out the *AV1* method, whose behavior is very similar to that of *AVL* and often provides, at its cut-off level, a partition better adjusted to the preorder than the "best" classification obtained by *AVL*.

# **3 Main Results and Discussion**

Concerning the CatPCA, the best solution comprises four principal components, and the percentage of variance accounted for (PVAF) across these components is almost 70% (about 69%) of the data's total variance. All extracted components have eigenvalues above 1. Moreover, the first three main components have a very good internal consistency and the fourth component has an acceptable internal consistency, as shown by the values of the Cronbach's Alpha coefficient (see Table 1).


**Table 1** Rotated component loadings of the 4-component solution - Motivational factors.

The most important items for the first dimension are items M6, M7, M8, M9, M10, M11, M12, M13, M15, M19, M21, M22, and M27, which are related to human relationships/interactions with colleagues and hierarchical superiors, so it is called "Psychological well-being/Interpersonal relationships". This dimension explains the highest proportion of data variance (29.59%).

Concerning the second dimension, the items M14, M17, M18, M20, M23, and M24 are the most important, so this dimension was designated "Remuneration, job stability and incentive system". The most relevant items regarding the third dimension are M1, M2, M3, M4, M5, and M16; so, this dimension was called "Career progression/Professional achievement". Finally, the most important items for the fourth dimension are M25 and M26 related to "Fulfilment of the proposed objectives and the timings to achieve them".

Regarding the AHCA of the same set of items, and considering the best cut-off levels, the results of the present study are summarized in Table 2.


**Table 2** The best partition - Standard affinity coefficient.

According to the *STAT* values, the best partitions were obtained by the classic *SL*/*CL* and the probabilistic *AV1* methods (see Table 2). All dendrograms highlighted four main branches, which are associated with different motivational factors ("Career progression"; "Psychological well-being / Interpersonal relationships"; "Organizational environment and working conditions"; "Conformity with objectives and time to reach them"), bringing new information, and identifying some singletons, as shown in Figure 1.

# **4 Conclusion**

Organizations and their leaders have become increasingly aware of the importance of their employees being well and that negative feelings can negatively affect productivity. Thus, it is essential to ensure the well-being of employees, taking into account the main motivational factors identified in this study. CatPCA made it possible to extract four principal components (dimensions), which explain almost 70% of the total variance of the data, which were designated, respectively, by "Psychological well-being/Interpersonal relationships"; "Remuneration, job stability and incentive system"; "Career progression/Professional achievement"; and "Fulfilment of objectives and timings to achieve them". Regarding the AHCA of the items that

**Fig. 1** Dendrogram - Standard affinity coefficient + *AV1*.

assess motivation, the dendrograms highlight four main branches, which are associated with different motivational factors called "Career progression"; "Psychological well-being / Interpersonal relationships"; "Organizational environment and working conditions"; and "Conformity with objectives and time to reach them". They carried new information and identify some singletons as well. Comparing the dendrograms, we conclude that the clusters referring to the best partitions are quite similar, with observed differences mainly concerning the few singletons. Moreover, the effective and fruitful correspondence between the AHCA and the CatPCA results may help to better understand the main types of factors identified. In fact, the four main branches of all dendrograms are related to motivational factors which corresponding interpretation are in consonance with those identified through CatPCA.

**Acknowledgements** This paper is financed by Portuguese national funds through FCT – Fundação para a Ciência e a Tecnologia, I.P., project number UIDB/00685/2020.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **A Proposal for Formalization and Definition of Anomalies in Dynamical Systems**

Jan Michael Spoor, Jens Weber, and Jivka Ovtcharova

**Abstract** Although many scientists strongly focus on anomaly detection in different applications and domains, there currently exists no universally accepted definition of anomalies and outliers. Using an approach based on control theory and dynamical systems, as well as a definition for anomalies as described by philosophy of science, the authors propose a generalized framework viewing anomalies as key drivers of progress for a better understanding of the dynamical systems around us. By mathematically defining anomalies and delimiting deviations within expectations from completely unforeseen instances, this paper aims to be a contribution to set up a universally accepted definition of anomalies and outliers.

**Keywords:** anomaly detection, outlier analysis, dynamical systems

# **1 Introduction**

Anomalies, often interchangeably called outliers [1], are of key interest in explorative data analysis. Therefore, anomaly detection finds application in many different scientific fields, i.e., in social science, economics, engineering, and medical science [2]. In particular, research in these domains regarding databases, data mining, machine learning or statistics focuses strongly on anomaly detection [3]. Despite the wide

Jan Michael Spoor ()

Jens Weber

Jivka Ovtcharova

Institut für Informationsmanagement im Ingenieurwesen (IMI), Karlsruhe Institute of Technology, Karlsruhe, Germany, e-mail: jan.spoor@kit.edu

Team Digital Factory Sindelfingen, Mercedes-Benz Group AG, Sindelfingen, Germany, e-mail: jens.je.weber@mercedes-benz.com

Institut für Informationsmanagement im Ingenieurwesen (IMI), Karlsruhe Institute of Technology, Karlsruhe, Germany, e-mail: jivka.ovtcharova@kit.edu

P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_40

range of anomaly detection, there is currently no universally accepted definition of what an outlier or anomaly is [2], and the mathematical definition depends on the selected method to find these anomalies [4].

The authors previously proposed an applied framework to formalize anomalies within the context of control theory and dynamical systems [5]. In this publication, the idea is discussed in more depth, and a generalization of the framework is proposed to extend its application area to more domains since dynamical systems are relevant in engineering and science [6] as well as in management science and economics [7]. Furthermore, the proposed definition of anomalies should also be applicable outside of the context of control theory and aims to be a contribution to set up a universally accepted definition of anomalies and outliers.

When controlling or simulating dynamical systems, a measurement and prediction process is used. Anomalies occur in this process as substantial deviations of a measured system state (an actual value) from an expected system state (a planned value) [5]. Despite simulation and planning effort, these deviations still occur. While some deviations fall within an acceptable range and within the expectations of normal system behavior, other anomalies are completely unforeseen and do not fit the set-up and expectations of the system. Three sequential questions are derived to further investigate the nature of anomalies within dynamical systems:


# **2 Definition of Anomalies for Dynamical Systems**

#### **2.1 Definitions of Anomalies and Outliers**

In general, it is assumed that anomalies are somehow visible within the data of the observed systems. This is also clearly stated by the definition of an outlier or anomaly as data points with a substantial deviation from the norm since this requires a normal state of the system and a measurable deviation [8]. Furthermore, the anomaly detection requires existence and knowledge of a normal state, a definition of a deviation, a metric, and a threshold measure of distance. This threshold measure of distance uses the selected metric. All distances between the norm and the data points, which are either above (in case of distance measures) or below (in case of similarity measures) the defined threshold, are assumed to be non-substantial.

Therefore, in addition, the selection of an appropriate metric becomes an important tool to accurately describe an anomaly. Some authors claim that, in a practical application, the selection of a suitable metric might be more important than the algorithm itself. For example, if clusters are clearly separated within the examined dataset in context of the selected metric, clusters will be found independently of the used method or algorithm [9]. Other authors claim that the selected method for investigating clusters is of importance [10].

To summarize, there is no trivial definition of a normal state, a deviation, and when a deviation might be substantial. Some authors therefore describe the usefulness of an analysis only within the context of the goals of the analysis [11]. Outlier detection becomes more of a technical target than an actual scientific finding of something novel since the novelty is always defined within the technical target of the analysis. Alternatively, the normal model of the data defines an anomaly [1].

This results, for example, in approaches of regression diagnostics to exclude outliers and anomalous data prior to an analysis or to conduct the analysis along the standard model in a more robust way, which is less affected by anomalies [12]. Both approaches result in the maintaining of the normal model using anomalies as if they were less adequate or not at all representative of the data set.

Since anomalies are only relevant within a context, a typology of anomalies within different dataset contexts can be created. Thus, Foorthuis [13] proposes a typology along the following dimensions: types of data (qualitative, quantitative or mixed), anomaly level (atomic or aggregated) and cardinality of relationship (univariate or multivariate). Anomalies are, within this kind of typology, always dependent on the dataset and behave differently along the measured features, which have been classified as relevant for the specific analysis. The anomaly detection becomes a detection of unfitting, surprising values while maintaining the normal model.

#### **2.2 Definition by Philosophy of Science**

If the assumptions regarding normal states, deviation, and substantiality are dropped, it is possible to discuss anomalies on a more fundamental level for understanding our surroundings and the observations of them.

To do this, anomalies have to be placed in the historic context of science and research. Since anomaly detection as a discipline of data science is placed within the scientific context [14], anomaly detection can also be analyzed as part of the scientific method and therefore a comparison with the historical understanding of anomalies in the context of science becomes relevant. By definition of Kuhn [15], anomalies play an important role in the scientific discovery of novelties:

Discovery commences with the awareness of anomaly, i.e., with the recognition that nature has somehow violated the paradigm-induced expectations that govern normal science. It then continues with a (...) exploration of the area of anomaly. And it closes only when the paradigm theory has been adjusted so that the anomalous has become the expected.

This statement describes scientific progress as a stepwise discovery and the placement of anomalies within a normal state by science. The discussed normal state is therefore dictated by current scientific knowledge, which encompasses the predictions of the currently available and widely used models and theories. An anomaly violates the normal state by violating the predictions of these models. The steps of scientific progress are then as follows:


Therefore, different states of an anomaly exist as follows:


The states of anomalies correspond to the initially defined questions in the introduction regarding the delimitation of anomalous states from normal states, the exploration of the causes for anomalies, and the modeling and planning with the now known anomalies. If the states of anomalies are used to describe practical errors in engineering, error states of systems are not anomalies. This is the case because if error states are priorly classified as such, they are therefore already known and described. This corresponds to the idea that outliers or anomalies are created by a different underlying mechanism [16] and therefore imply an unknown system behavior, which needs modeling to better describe the system. In addition, this follows the assumption of a normal state in which anomalies simply derive from a normal model [1] since they are not part of the normal model. Also, this idea relates strongly to the discussion of the relation between novelty and anomaly detection [17].

To follow the definitions by Kuhn [15], science is driven by internal progress, limited by the current methods and available resources, while external targets, defined by stakeholders, e.g., society or companies, drive technicians. This description matches the idea that the usefulness of an analysis should be evaluated within the context of its goals [11] and distinguishes two types of anomalies: "Scientific" anomalies of a novel observation and "technical" anomalies as deviations from a predefined norm using a predefined measurement of substantiality.

"Scientific" anomalies might still result in unwanted system states, which then can result in some kind of error or critical system state. Nevertheless, not every "scientific" anomaly inevitably results in an error state and not every error state is a "scientific" anomaly. An anomaly is not a "scientific" anomaly if the error state is already documented or can be described by the standard model. In this case, the anomaly becomes a "technical" anomaly.

Using the philosophy of science definition of anomalies, the normal state is the prediction by the system model, the deviation is the difference between the prediction of the system state and the measured actual state of the system, and the substantiality is defined by the noise and precision of our predictions and measurement tools.

# **3 Proposed Framework for a Formalization of Anomalies**

To separate "scientific" and "technical" anomalies, a formerly proposed framework [5] is generalized as illustrated in Fig. 2. and mathematically defined in this section.

**Fig. 1** Formalization of "scientific" and "technical" anomalies and system states.

**Definition 1 (System State)** There exists a multivariate description 𝑥<sup>𝑖</sup> of a state 𝑖 with a finite number of features. For each feature 𝑗 of state 𝑖 a value 𝑥𝑖 𝑗 exists, which is a realization of the feature space 𝑅<sup>𝑗</sup> . The value 𝑥𝑖 𝑗 is the actual and precise state description of feature 𝑗 at state 𝑖. Although there exists only a single true value 𝑥𝑖 𝑗, the value itself does not necessarily have to be a single data point but can be a multivariate or symbolic data value and can be of any data type.

$$\forall i \; \forall j \; \exists! \; x\_{ij}, \quad x\_{ij} \in \mathcal{R}\_f \tag{l}$$

The set 𝐶 of all combinations of system state values with 𝐽 features is given by:

$$C = \{ \mathbf{x}\_i \mid \forall j \; \exists \; \mathbf{x}\_{i\;j} \in R\_J \} = R\_1 \times \dots \times R\_J \tag{2}$$

**Definition 2 (Operation)** An operation is an analytical function 𝑓 which changes the system state from state 𝑖 to the following state 𝑖 + 1. Both states belong to the set of all combinations of system states 𝐶.

$$f: \mathcal{C} \to \mathcal{C}, \quad f(\mathbf{x}\_i) = \mathbf{x}\_{i+1} \tag{3}$$

There exists a finite set 𝐹 of functions of endogenous state transformations. This set of functions is the scope of operations that can be performed. These functions are the fundamental functionality of a system, which can be performed without any external involvement. For all functions the following expression is applied:

$$\text{g} \in F \land f \in F : \text{g} \circ f \in F \tag{4}$$

Using the defined function space, a restriction of reachable system states via all functions from 𝐹 is defined, resulting in the set of physically possible system states.

**Definition 3 (Physically Possible System States)** The relation 𝑓 spans the complete space of state changes of a system using the entire scope of operations. The resulting space is the set of all possible system states. The physically possible system states are the possible realizations of 𝑥<sup>𝑖</sup> based on a starting point and if only functions from 𝐹 are applied. The set 𝑃 is a group with a neutral element of operations.

$$P = \{ \mathbf{x}\_i \mid \forall f \in F : f(\mathbf{x}\_i) \in P \} \subseteq \mathcal{C} \tag{5}$$

**Definition 4 (Observed System States)** Of the amount 𝐽 of existing features of the system state, only an amount 𝐷 of features is known with 𝐷 ≤ 𝐽. Since not all system states can be measured, a function 𝑧 transforms the real system states and real operations of the system into observable system states and operations.

$$z: C \to M, \quad z(\mathbf{x}\_i) = \mathbf{x}\_{i^\*} \tag{6}$$

Therefore, the set 𝑀 = 𝑅<sup>1</sup> × ... × 𝑅<sup>𝐷</sup> is the space of all observable and known system states. Function 𝑧 is the measurement process.

**Definition 5 (Observed Operations)** Not all functions of the whole set of function 𝐹 are known or observable when planning and operating a system.

$$F' \subseteq F \tag{7}$$

Additionally, only observable system states are modeled when operating a system. The observed operations of systems are therefore projections of a subsets of known operations of 𝐹 and operate within the observed and known system states.

$$F^\* = z(F')\tag{8}$$

The actual conducted operations 𝑓 are always from the set of operations 𝐹, but the expectation and prediction utilize, due to lack of system knowledge, only 𝑓 <sup>∗</sup> ∈ 𝐹 ∗ .

$$f^\* \colon M \to M, \quad f^\*(\mathbf{x}\_{i^\*}) = \mathbf{x}\_{i+1^\*} \tag{9}$$

Therefore, all states applied in operation 𝑓 ∗ are defined as expected system states.

**Definition 6 (Expected System States)** The system states, which are possible if only the observed and known operations of the set 𝐹 ∗ are applied to all system states 𝑥𝑖 <sup>∗</sup> ∈ 𝐸, are the expected system behavior.

$$E = \{ \mathbf{x}\_{i^\*} \mid \forall f^\* \in F^\* : f^\*(\mathbf{x}\_{i^\*}) \in E \} \subseteq M \tag{10}$$

The expected system states can be further split into desired system states, where the system is running most beneficially for its usage, a critical system state, where a possible error or rare system states are measured, and error states, which are system faults with operational risks involved as defined by Basel III [18]. Applied in engineering, this definition is compatible with the definition of DIN EN 13306 since the system is at risk of being unable to perform a certain range of functions without necessarily being completely inoperable [19]. All kinds of errors, warnings and non-beneficial system states are the "technical" anomalies within the contextual analysis of the data set.

**Definition 7 (Unforeseen System States)** The set of unforeseen system states 𝑈 are therefore all measurable system states within the realm of observable system states but not within the expected system states:

$$U = \mathbf{M}/\mathbf{E} \tag{11}$$

"Scientific" anomalies in unforeseen system states are measured if the real operation 𝑓 differs from 𝑓 ∗ such that a prediction error occurs:

$$f^\*(\mathbf{x}\_{i^\*}) \in E, \quad f^\*(\mathbf{x}\_{i^\*}) \neq \boldsymbol{z}(f(\mathbf{x}\_i)) \notin E \tag{12}$$

"Scientific" anomalies are part of the unforeseen system states. Another reason for unforeseen system states is a measurement of an impossible system state. Anomalies originated by physically impossible system states are to be distinguished from "scientific" anomalies since the reason for their occurrence follows a different mechanism. Thus, they are assigned to the "technical" anomalies.

**Definition 8 (Physically Impossible System States)** Physically impossible system states 𝐼 are combinations of states in set 𝐶 which are not reachable using function 𝑓 :

$$I = \mathcal{C}/P\tag{13}$$

**Definition 9 (External Influence)** Applying changes to the system, the feature space also changes. Consequently, the space of the physically possible system states changes. Previously impossible system states become possible system states.

**Definition 10 (Faulty Data Points)** If a measurement is conducted incorrectly, the measured values could be within the impossible system states. Faulty data points are therefore neither measurement noise nor imprecision, but should be systematically excluded. Note that faulty data points could be within the possible system space but need to be excluded either way.

# **4 Conclusion**

It is concluded that the anomaly concept is often loosely defined and heavily depends on assumptions of a normal state, deviation, and substantiality. These definitions are often case-specific and influenced by the conducting researchers' choice. Therefore, a rigorous definition of anomalies is capable of further streamlining the discourse and increasing a common understanding of what kind of anomaly is described.

Using "technical" and "scientific" anomalies, further research will be conducted to set up models detecting both types of anomalies separately. Differences between observed and real system states and operations are a focus of further research to more precisely analyze the hidden processes of the "scientific" anomaly generation. Also, a more fundamental discussion of the philosophical definition of anomalies within the philosophy of science and its applications to anomaly detection in general should be conducted to further gain insight into the true nature of anomalies.

The authors plan to validate the concept by using the proposed definition and framework in exemplary applications within industrial processes. Furthermore, anomaly detection methods designed for applications in dynamical systems using the proposed framework are planned to be developed.

**Acknowledgements** The Mercedes-Benz Group AG funds this research. The research was prepared within the framework of the doctoral program of the Institut für Informationsmanagement im Ingenieurwesen (IMI) at the Karlsruhe Institute of Technology (KIT).

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **New Metrics for Classifying Phylogenetic Trees Using** 𝑲**-means and the Symmetric Difference Metric**

Nadia Tahiri and Aleksandr Koshkarov

**Abstract** The 𝑘-means method can be adapted to any type of metric space and is sometimes linked to the median procedures. This is the case for symmetric difference metric (or Robinson and Foulds) distance in phylogeny, where it can lead to median trees as well as to Euclidean Embedding. We show how a specific version of the popular 𝑘-means clustering algorithm, based on interesting properties of the Robinson and Foulds topological distance, can be used to partition a given set of trees into one (when the data is homogeneous) or several (when the data is heterogeneous) cluster(s) of trees. We have adapted the popular cluster validity indices of Silhouette, and *Gap* to tree clustering with 𝑘-means. In this article, we will show results of this new approach on a real dataset (aminoacyl-tRNA synthetases). The new version of phylogenetic tree clustering makes the new method well suited for the analysis of large genomic datasets.

**Keywords:** clustering, symmetric difference metrics, 𝑘-means, phylogenetic trees, cluster validity indices

# **1 Introduction**

In biology, one of the most significant organizing principles is the "Tree of Life" (ToL) [12]. In genetic studies, there is evidence of an enormous number of branches, but even a rough estimate of the total size of the tree remains difficult. Many recent

Nadia Tahiri ()

Aleksandr Koshkarov

Department of Computer Science, University of Sherbrooke, Sherbrooke, QC J1K2R1, Canada, e-mail: Nadia.Tahiri@USherbrooke.ca

Department of Computer Science, University of Sherbrooke, Sherbrooke, QC J1K2R1, Canada; Center of Artificial Intelligence, Astrakhan State University, Astrakhan, 414056, Russia, e-mail: Aleksandr.Koshkarov@USherbrooke.ca

representations of ToL have emphasized either the existence of deep evolutionary relationships [7] or the knowledge of a large and diverse variety of life, with an emphasis on Eukaryotes [8]. These approaches do not consider the dramatic evolution in our understanding of the diversity of life due to genomic sampling of previously unexplored environments.

As a result, Maddison in 1991 [11] was the first to formulate the idea of multiple consensus trees when he described his phylogenetic island method. He observed that island consensus trees can differ significantly from each other and are generally better resolved than the species-wide consensus tree. The most intuitive approach to discovering and clustering genes that share similar evolutionary histories is to cluster their genetic phylogenies. In this context, Stockham et al. in 2002 [18] proposed a tree clustering algorithm based on 𝑘-means [4, 9, 10] and the Robinson and Foulds quadratic distance [15]. Their clustering algorithm aims to infer a set of strict consensus trees, minimizing information loss. They proceed by determining the consensus trees for each set of clusters in all intermediate partitioning solutions tested by 𝑘-means. This makes the Stockham et al. algorithm very expensive in terms of execution time. More recently, Tahiri et al. in 2018 [19] proposed a fast and accurate tree clustering method based on 𝑘-medoids. Finally, Silva and Wilkinson in 2021 [17] introduced a revised definition of tree islands based on any tree-to-tree metric that usefully extends this notion to any set or multiset of trees and provided an interesting discussion of biological applications of their method.

In this context, the use of a method that infers multiple supertrees (i.e., a supertree clustering method) would help discover and cluster alternative evolutionary scenarios for several ToL subtrees.

The paper is structured as follows. In the next section, we introduce a new metric for 𝑘-means algorithm based on the Robinson and Foulds distance. The section 3 presents the simulation results (on a real dataset) obtained with our algorithm compared to other clustering methods. Finally, we discuss our contributions in section 4.

# **2 Methods**

The 𝑘-means algorithm [9, 10] is a very common algorithm for data parsing. From a set of 𝑁 observations 𝑥<sup>𝑖</sup> , . . . , 𝑥<sup>𝑁</sup> each one being described by 𝑀 variables, this algorithm creates a partition in 𝑘 homogeneous classes or clusters. Each observation corresponds to a point in a 𝑀-dimensional space and the proximity between two points is measured by the distance between them. In the framework of 𝑘-means, the most commonly used distances are the Euclidean distance, Manhattan distance, and Minkowski distance [4]. To be precise, the objective of the algorithm is to find the partition of the 𝑁 points into 𝑘 clusters in such a way that the sum of the squares of the distances of the points to the center of gravity of the group to which they are assigned is minimal. To the best of our knowledge, finding an optimal partition according to the 𝑘-means least-squares criterion is known to be NP-hard [13]. Considering this fact, several polynomial-time heuristics were developed, most of which have the time complexity of O(𝐾𝑁 𝐼𝑀) for finding an approximate partitioning solution, where 𝐾 is the maximum possible number of clusters, 𝑁 is the number of objects (for example, phylogenetic trees), 𝐼 is the number of iterations in the 𝑘-means algorithm, and 𝑀 is the number of variables characterizing each of the 𝑁 objects.

A well-known metric of comparing two tree topologies in computational biology is the Robinson-Foulds distance (𝑅𝐹), also known as the symmetric-difference distance [15]. Moreover, the distance 𝑅𝐹 is a topological distance, which means that it does not consider the length of the edges of the tree. The formula of 𝑅𝐹 distance can be describe as (𝑛<sup>1</sup> (𝑇1) + 𝑛<sup>2</sup> (𝑇2)), where 𝑛<sup>1</sup> (𝑇1) is the number of partitions of data implied by the tree 𝑇1, but not the tree 𝑇<sup>2</sup> and 𝑛<sup>2</sup> (𝑇2) is the number of partitions of data implied by the tree 𝑇<sup>2</sup> but not the tree 𝑇1. According to Barthélemy and Monjardet [1], the majority-rule consensus tree of a set of trees is the median tree of this set. This fact makes the use of tree clustering possible.

#### **2.1 Silhouette Index Adapted for Tree Clustering**

The first popular cluster validity index we consider in our study is the Silhouette width (𝑆𝐻) [16]. Traditionally, the Silhouette width of the cluster 𝑘 is defined as follows:

$$s(k) = \frac{1}{N\_k} \left[ \sum\_{i=1}^{N\_k} \frac{b(i) - a(i)}{\max(a(i), b(i))} \right],\tag{1}$$

where 𝑁<sup>𝑘</sup> is the number of objects belonging to cluster 𝑘, 𝑎(𝑖) is the average distance between object 𝑖 and all other objects belonging to cluster 𝑘, and 𝑏(𝑖) is the smallest, over all clusters 𝑘 0 different from cluster 𝑘, of all average distances between 𝑖 and all the objects of cluster 𝑘 0 .

We used Equations (2) and (4) for calculating 𝑎(𝑖) and 𝑏(𝑖), respectively, in our tree clustering algorithm (see also [19]). For instance, the quantity 𝑎(𝑖) can be calculated as follows:

$$a(i) = \left[\frac{\sum\_{j=1}^{N\_k} RF(T\_{ki}, T\_{kj})}{2n(T\_{ki}, T\_{kj}) - 6} + \xi\right] / N\_k \; , \tag{2}$$

where 𝑁<sup>𝑘</sup> is the number of trees in cluster 𝑘, 𝑇𝑘𝑖 and 𝑇𝑘 𝑗 are, respectively, trees 𝑖 and 𝑗 in cluster 𝑘, 𝑛(𝑇𝑘𝑖) is the number of leaves in tree 𝑇𝑘𝑖, 𝑛(𝑇𝑘 𝑗) is the number of leaves in tree 𝑇𝑘 𝑗, and 𝜉 is a penalty function which is defined as follows:

$$\xi = \alpha \times \frac{\operatorname{Min}(n(T\_{ki}), n(T\_{kj})) - n(T\_{ki}, T\_{kj})}{\operatorname{Min}(n(T\_{kj}), n(T\_{kj}))},\tag{3}$$

where 𝛼 is the penalization (tuning) parameter, taking values between 0 and 1, used to prevent from putting to the same cluster trees having small percentages of leaves in common, and 𝑛(𝑇𝑘𝑖, 𝑇𝑘 𝑗) is the number of common leaves in trees 𝑇𝑘𝑖 and 𝑇𝑘 𝑗.

The formula for 𝑏(𝑖) is as follows:

$$b(i) = \min\_{1 \le k' \le K, k' \ne k} \left[ \frac{\sum\_{j=1}^{N\_{k'}} RF(T\_{ki}, T\_{k'j})}{2n(T\_{ki}, T\_{k'j}) - 6} + \xi \right] / N\_{k'} \,\,,\tag{4}$$

where 𝑇<sup>𝑘</sup> 0 𝑗 is the tree 𝑗 of the cluster 𝑘 0 , such that 𝑘 <sup>0</sup> ≠ 𝑘, and 𝑁<sup>𝑘</sup> <sup>0</sup> is the number of trees in the cluster 𝑘 0 .

The optimal number of clusters, 𝐾, corresponds to the maximum average Silhouette width, 𝑆𝐻, which is calculated as follows:

$$SH = \overline{s}(K) = \sum\_{k=1}^{K} \left[ s(k) \right] / K \,\,. \tag{5}$$

The value of the Silhouette index defined by Equation (5) ranges from -1 to +1.

#### **2.2** *Gap* **Statistic Adapted for Tree Clustering**

It is worth noting that the 𝑆𝐻 cluster validity index (Equations (1) to (5)) do not allow comparing the solution consisting of a single consensus tree (𝐾 = 1; the calculation of 𝑆𝐻 is impossible in this case) with clustering solutions involving multiple consensus trees or supertrees (𝐾 ≥ 2). This can be considered as an important disadvantage of the 𝑆𝐻-based classifications because a good tree clustering method should be able to recover a single consensus tree or supertree when the input set of trees is homogeneous (e.g. for a set of gene trees that share the same evolutionary history).

The *Gap* statistic was first used by Tibshirani et al. [20] to estimate the number of clusters provided by partitioning algorithms. The formulas proposed by Tibshirani et al. were based on the properties of the Euclidean distance. In the context of tree clustering, the *Gap* statistic can be defined as follows. Consider a clustering of 𝑁 trees into 𝐾 non-empty clusters, where 𝐾 ≥ 1. First, we define the total intracluster distance, 𝐷<sup>𝑘</sup> , characterizing the cohesion between the trees belonging to the same cluster 𝑘:

$$D\_k = \sum\_{i=1}^{N\_k} \sum\_{j=1}^{N\_k} \left[ \frac{RF(T\_{ki}, T\_{kj})}{2n(T\_{ki}, T\_{kj}) - 6} + \xi \right]. \tag{6}$$

Then, the sum of the average total intracluster distances, 𝑉<sup>𝐾</sup> , can be calculated using this formula:

$$V\_K = \sum\_{k=1}^{K} \frac{1}{2N\_k} D\_k \ . \tag{7}$$

Finally, the *Gap* statistic, which reflects the quality of a given clustering solution including 𝐾 clusters, can be defined as follows:

$$\operatorname{Gap}\_N(K) = E\_N^\* \left\{ \log(V\_K) \right\} - \log(V\_K) \,. \tag{8}$$

where 𝐸 ∗ 𝑁 denotes expectation under a sample of size 𝑁 from the reference distribution. The following formula [20] for the expectation of 𝑙𝑜𝑔(𝑉<sup>𝐾</sup> ) was used in our algorithm:

$$E\_N^\*\left\{\log(V\_K)\right\} = \log(Nn/12) - (2/n)\log(K)\,,\tag{9}$$

where 𝑛 is the number of tree leaves.

The largest value of the *Gap* statistic corresponds to the best clustering.

# **3 Results - A Biological Example**

To illustrate the methods described above, we used a dataset from Woese et al. [22]. The aminoacyl-tRNA synthetases (aaRSs) are enzymes that attach the appropriate amino acid onto its cognate transfer RNA. The structure-function aspect of aaRSs has long attracted the attention of biologists [22, 6]. Moreover, the relationship of aaRSs to the genetic code is observed from the evolutionary view (the central role played by the aaRSs in translation would suggest that their histories and that of the genetic code are somehow intertwined [22]). The novel domain additions to aaRSs genes play an important role in the inference of the ToL.

We encoded 20 original aminoacyl-tRNA synthetase trees from Woese et al. [22] in Newick format and then split some of them into sub-trees to account for cases where the same species appeared more than once in the original tree. Our approach cannot handle data that includes multiple instances of the same species in the input trees. Thus, 36 aaRS trees with different numbers of leaves (including 72 species in total) were used as input of our algorithm (their Newick strings are available at: https://github.com/tahiri-lab/PhyloClust). Our approach was applied with the 𝛼 parameter set to 1.

First, we implemented our new approach with the *Gap* statistic cluster validity index which suggested the presence of 7 clusters of trees in the data, thus suggesting a heterogeneous scenario of their evolution. Then, we conducted the computation using the 𝑆𝐻 cluster validity index and obtained 2 clusters of trees each of which could be represented by its own supertree. The first cluster obtained using 𝑆𝐻 included 19 trees for a total of 56 organisms, whereas the second cluster included 17 trees for a total of 61 organisms. The supertrees (see Figure 1) for the two obtained clusters of trees were inferred using the CLANN program [5]. Further, we decided to infer the most common horizontal gene transfers which characterized the evolution of gene trees included in the two obtained tree clusters. The method of [3], reconciling the species and gene phylogenies to infer transfers, was used for this purpose. The species phylogenies followed the NCBI taxonomic classification. These phylogenies were not fully resolved (the species phylogeny in Figure 1a contains 9 internal nodes

**Fig. 1** Nonbinary species tree corresponding to the NCBI taxonomic classification are represented with (a) 56 species for cluster 1. The 4 HGTs (indicated by arrows) were found by the 𝑆𝐻 index for the first cluster; (b) 61 species with 𝛼 equal 1 for cluster 2. The 2 HGTs (indicated by arrows) were found by the 𝑆𝐻 index with 𝛼 equal 1 for the second cluster. We applied Most Similar Supertree Method (𝑑 𝑓 𝑖𝑡) [5] implemented in CLANN Software with 𝑚𝑟 𝑝 criterion. This criterion is a matrix representation employing parsimony criterion.

with a degree higher than 3 and the species phylogeny in Figure 1b contains 10 internal nodes with a degree higher than 3).

We used the version of the HGT (Horizontal Gene Transfer) algorithm available on the T-Rex web site [2] to identify the scenarios of HGT events that reconcile the species tree and each of the supertrees. We choose the same root between species trees and supertrees: the root which split Bacteria to the clade of Eukayota and Archaea.

For the first cluster composed of 56 species, we obtained 40 transfers with 22 regular and 18 trivial HGTs. Trivial HGTs are necessary to transform a non-binary tree into a binary tree. We removed the trivial HGTs and selected between regular HGTs. The non-trivial HGTs with low representation are most likely due to the tree reconstruction artefacts. In Figure 1a, we illustrated only those HGTs that are most represented in the dataset.

We followed the same procedure for the second cluster composed of 61 species and obtained 42 transfers with 28 regular and 14 trivial HGTs that are not represented here. We selected only the most popular HGTs in the dataset. All other transfers are represented in Figure 1b.

The transfers link of *P. horikoshii* to the clade of *spirochetes* (i.e. *B. burgdorferi* and *T. pallidum*) was found by [3, 14]. The transfers of *P. horikoshii* to *P. aerophilum* were also found by [14]. These results confirmed the existing HGT of [3, 14].

# **4 Discussion**

Many research groups are estimating trees containing several thousands to hundreds of thousands of species, toward the eventual goal of the estimation of the Tree of Life, containing perhaps several million leaves. These phylogenetic estimations present enormous computational challenges, and current computational methods are likely to fail to run even with datasets on the low end of this range. One approach to estimate a large species tree is to use phylogenetic estimation methods (such as maximum likelihood) on a supermatrix produced by concatenating multiple sequence alignments for a collection of markers; however, the most accurate of these phylogenetic estimation methods are extremely computationally intensive for datasets with more than a few thousand sequences. Supertree methods, which assemble phylogenetic trees from a collection of trees on subsets of the taxa, are important tools for phylogeny estimation where phylogenetic analyses based upon maximum likelihood (ML) are infeasible.

In this article, we described a new algorithm for partitioning a set of phylogenetic trees in several clusters in order to infer multiple supertrees, for which the input trees have different, but mutually overlapping sets of leaves. We presented new formulas that allow the use of the popular Silhouette and *Gap* statistic cluster validity indices along with the Robinson and Foulds topological distance in the framework of tree clustering based on the popular 𝑘-means algorithm. The new algorithm can be used to address a number of important issues in bioinformatics, such as the identification of genes having similar evolutionary histories, e.g. those that underwent the same horizontal gene transfers or those that were affected by the same ancient duplication events. It can also be used for the inference of multiple subtrees of the Tree of Life. In order to compute the Robinson and Foulds topological distance between such pairs of trees, we can first reduce them to a common set of leaves. After this reduction, the Robinson and Foulds distance is normalized by its maximum value, which is equal to 2𝑛 − 6 for two binary trees with 𝑛 leaves. Overall, the good performance achieved by the new algorithm in both clustering quality and running time makes it well suited for analyzing large genomic and phylogenetic datasets. A C++ program, called PhyloClust (Phylogenetic trees Clustering), implementing the discussed tree partitioning algorithm is freely available at https://github.com/tahiri-lab/ PhyloClust.

**Acknowledgements** We would like to thank Andrey Veriga and Boris Morozov for helping us with the analysis of Aminoacyl-tRNA synthetases data. We also thank Compute Canada for providing access to high-performance computing facilities. This work was supported by Fonds de Recherche sur la Santé of Québec and University of Sherbrooke grant.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **On Parsimonious Modelling via Matrix-variate t Mixtures**

Salvatore D. Tomarchio

**Abstract** Mixture models for matrix-variate data have becoming more and more popular in the most recent years. One issue of these models is the potentially high number of parameters. To address this concern, parsimonious mixtures of matrixvariate normal distributions have been recently introduced in the literature. However, when data contains groups of observations with longer-than-normal tails or atypical observations, the use of the matrix-variate normal distribution for the mixture components may affect the fitting of the resulting model. Therefore, we consider a more robust approach based on the matrix-variate 𝑡 distribution for modeling the mixture components. To introduce parsimony, we use the eigen-decomposition of the components scale matrices and we allow the degrees of freedom to be equal across groups. This produces a family of 196 parsimonious matrix-variate 𝑡 mixture models. Parameter estimation is obtained by using an AECM algorithm. The use of our parsimonious models is illustrated via a real data application, where parsimonious matrix-variate normal mixtures are also fitted for comparison purposes.

**Keywords:** matrix-variate, mixture models, clustering, parsimonious models

# **1 Introduction**

The matrix-variate model-based clustering literature is expanding more and more over the last few years, as confirmed by the high number of contributions using finite mixture models for the modelization of matrix-variate data [1, 2, 3, 4, 5, 6, 7, 8]. This kind of data is arranged in three-dimensional arrays, and depending on the entities indexed in each of the three layers, different data examples might be considered [9]. In many of these applications, we observe a 𝑝 × 𝑟 matrix for each statistical

Salvatore D. Tomarchio ()

University of Catania, Department of Economics and Business, Catania, Italy, e-mail: daniele.tomarchio@unict.it

<sup>©</sup> The Author(s) 2023 393

P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_42

observation. Thus, from a model-based clustering perspective, the challenge is to suitably cluster realization coming from random matrices.

One problem of matrix-variate mixture models is the potentially high number of parameters. To cope with this issue, [5] have recently proposed a family of parsimonious mixtures based on the matrix-variate normal (MVN) distribution. Nevertheless, for many datasets, the tails of the MVN distribution are often shorter than required. This has several consequences on parameter estimation as well as in the proper data classification [4, 7]. Therefore, in this paper we relax the normality assumption of the mixture components by using (in a parsimonious setting) the matrix-variate 𝑡 (MVT) distribution. The MVT distribution has been used within the finite mixture model paradigm by [10] in an unconstrained framework. Here, to introduce parsimony in this model, (i) we use the eigen-decomposition of the two scale matrices of each mixture component and (ii) we allow the degrees of freedom to be tied across the groups. This produces the family of 196 parsimonious matrix-variate MVT mixture models (MVT-Ms) discussed in Section 2. Parameter estimation is implemented by using an alternating expectation-conditional maximization (AECM) algorithm [12]. In Section 3, our parsimonious MVT-Ms, along with parsimonious matrix-variate MVN mixture models (MVN-Ms) for comparison purposes, are fitted to a Swedish municipalities expenditure dataset. The differences in terms of fitting among the two families of models are illustrated. The estimated parameters and the data partition of the overall best fitting model are also commented. Finally, some conclusions are drawn in Section 4.

# **2 Methodology**

#### **2.1 Parsimonious Mixtures of Matrix-variate** 𝒕 **Distributions**

The probability distribution function (pdf) of a 𝑝 ×𝑟 random matrix X coming from a finite mixture model is

$$f\_{\text{MIXT}}(\mathbf{X}; \boldsymbol{\Omega}) = \sum\_{\mathbf{g}=1}^{G} \pi\_{\mathbf{g}} f(\mathbf{X}; \boldsymbol{\Theta}\_{\mathbf{g}}), \tag{1}$$

where 𝜋<sup>𝑔</sup> is the 𝑔th mixing proportion, such that 𝜋<sup>𝑔</sup> > 0 and Í<sup>𝐺</sup> 𝑔=1 𝜋<sup>𝑔</sup> = 1, 𝑓 (**X**; 𝚯𝑔) is the 𝑔th component pdf with parameter 𝚯𝑔, and 𝛀 contains all of the parameters of the mixture. In this paper, for the 𝑔th component of model (1), we adopt the MVT distribution having pdf

$$f\_{\rm MVT}(\mathbf{X};\boldsymbol{\Theta}\_{\mathcal{S}}) = \frac{|\boldsymbol{\Sigma}\_{\rm g}|^{-\frac{\mu}{2}} |\boldsymbol{\Psi}\_{\rm g}|^{-\frac{\mu}{2}} \Gamma\left(\frac{pr+\nu\_{\rm g}}{2}\right)}{\left(\pi \nu\_{\rm g}\right)^{\frac{pr}{2}} \Gamma\left(\frac{\nu\_{\rm g}}{2}\right)} \left[1 + \frac{\delta\_{\rm g}\left(\mathbf{X};\mathbf{M}\_{\rm g},\boldsymbol{\Sigma}\_{\rm g},\boldsymbol{\Psi}\_{\rm g}\right)}{\nu\_{\rm g}}\right]^{-\frac{pr+\nu\_{\rm g}}{2}},\quad(2)$$

where 𝛿<sup>𝑔</sup> **X**; **M**𝑔, 𝚺𝑔, 𝚿<sup>𝑔</sup> = tr - 𝚺 −1 𝑔 (**X** − **M**𝑔)𝚿−<sup>1</sup> 𝑔 (**X** − **M**𝑔) 0 , **M**<sup>𝑔</sup> is the 𝑝 × 𝑟 component mean matrix, 𝚺<sup>𝑔</sup> is the 𝑝 × 𝑝 component row scale matrix, 𝚿<sup>𝑔</sup> is the 𝑟 ×𝑟 component column scale matrix and 𝜈<sup>𝑔</sup> > 0 is the component degree of freedom. It is interesting to recall that the pdf in (2) can be hierarchically obtained via the matrix-variate normal scale mixture model when the mixing random variable 𝑊 is a gamma distribution with scale and rate parameters set to 𝜈𝑔/2 [10]. Specifically, a hierarchical representation of MVT distribution can be given as follows

$$\begin{array}{ll} 1. \ W \sim \mathcal{G}\left(\nu\_{\mathfrak{g}}/2, \nu\_{\mathfrak{g}}/2\right), \\ 2. \ \mathbf{X}|W = w \sim \mathcal{N}(\mathbf{M}\_{\mathfrak{g}}, \Sigma\_{\mathfrak{g}}/w, \mathbf{\Psi}\_{\mathfrak{g}}), \end{array}$$

where G (·) is a gamma distribution and N(·) denotes the MVN distribution. This representation will be convenient for parameter estimation presented in Section 2.2.

As discussed in Section 1, the mixture model in (1) may be characterized by a potentially high number of parameters. To address this concern, we firstly use the eigen-decomposition of the components scale matrices 𝚺<sup>𝑔</sup> and 𝚿𝑔. In detail, we recall that a generic 𝑞 × 𝑞 scale matrix 𝚽<sup>𝑔</sup> can be decomposed as [11]

$$
\boldsymbol{\Phi}\_{\mathcal{S}} = \boldsymbol{\lambda}\_{\mathcal{S}} \boldsymbol{\Gamma}\_{\mathcal{S}} \boldsymbol{\Delta}\_{\mathcal{S}} \boldsymbol{\Gamma}\_{\mathcal{S}}',\tag{3}
$$

where 𝜆<sup>𝑔</sup> = |𝚽<sup>𝑔</sup> | 1/𝑞 , 𝚪<sup>𝑔</sup> is a 𝑞 × 𝑞 orthogonal matrix whose columns are the normalized eigenvectors of 𝚽𝑔, and 𝚫<sup>𝑔</sup> is the scaled (|𝚫<sup>𝑔</sup> | = 1) diagonal matrix of the eigenvalues of 𝚽𝑔. By constraining the three components in (3), the following family of 14 parsimonious structures is obtained: EII, VII, EEI, VEI, EVI, VVI, EEE, VEE, EVE, VVE, EEV, VEV, EVV, VVV, where "E" stands for equal, "V" means varying and "I" denotes the identity matrix.

If we apply the decomposition in (3) to 𝚺<sup>𝑔</sup> and 𝚿𝑔, we obtain 14 × 14 = 196 parsimonious structures. However, to solve a well-known identifiability issue related to the scale matrices of matrix-variate distributions [1, 3, 5], we impose the restriction |𝚿<sup>𝑔</sup> | = 1, which makes the parameter 𝜆<sup>𝑔</sup> unnecessary, and reduces the number of parsimonious structures related to 𝚿<sup>𝑔</sup> from 14 to 7: II, EI, VI, EE, VE, EV, VV. Thus, we have 14×7 = 98 parsimonious structures for the component scale matrices.

To further increase the parsimony of model (1), we also consider the option of constraining the component degrees of freedom 𝜈𝑔. The nomenclature used is the same to that adopted for the scale matrices. This option, combined with that discussed above for the scale matrices, allows us to produce a total of 98 × 2 = 196 parsimonious MVT-Ms.

#### **2.2 An AECM Algorithm for Parameter Estimation**

To estimate the parameters of our family of mixture models, we implement an AECM algorithm. By using the hierarchical representation of Section 2.1, our complete data are **S**<sup>𝑐</sup> = {**X**<sup>𝑖</sup> , **z**𝑖 , 𝑤<sup>𝑖</sup> } 𝑁 𝑖=1 , where **z**<sup>𝑖</sup> = (𝑧𝑖1, . . . , 𝑧𝑖𝐺) 0 , such that 𝑧𝑖𝑔 = 1 if observation 𝑖 belongs to group 𝑔 and 𝑧𝑖𝑔 = 0 otherwise, and 𝑤<sup>𝑖</sup> is the realization of 𝑊. Therefore, the complete-data log-likelihood can be written as

$$\ell\_c \left( \mathbf{Q}; \mathbf{S}\_c \right) = \ell\_{1c} \left( \boldsymbol{\pi}; \mathbf{S}\_c \right) + \ell\_{2c} \left( \boldsymbol{\Xi}; \mathbf{S}\_c \right) + \ell\_{3c} \left( \boldsymbol{\vartheta}; \mathbf{S}\_c \right), \tag{4}$$

where

$$\ell\_{1c} \left( \boldsymbol{\pi}; \mathbf{S}\_c \right) = \sum\_{i=1}^N \sum\_{\mathbf{g}=1}^G z\_{i\mathbf{g}} \ln \left( \pi\_{\mathbf{g}} \right),$$

$$\begin{split} \ell\_{2c} \left( \mathbf{E}; \mathbf{S}\_{c} \right) &= \sum\_{i=1}^{N} \sum\_{\mathbf{g}=1}^{G} z\_{ig} \left[ -\frac{pr}{2} \ln \left( 2\pi \right) + \frac{pr}{2} \ln \left( w\_{ig} \right) - \frac{r}{2} \ln \left| \mathbf{E}\_{\mathbf{g}} \right| - \frac{p}{2} \ln \left| \mathbf{W}\_{\mathcal{S}} \right| \right] \\ &- \frac{w\_{ig} \delta\_{\mathcal{S}} (\mathbf{X}; \mathbf{M}\_{\mathcal{S}}, \mathbf{E}\_{\mathbf{g}}, \mathbf{W}\_{\mathcal{S}})}{2} \right], \end{split} \tag{5}$$

ℓ3<sup>𝑐</sup> (𝜗; **S**𝑐) = Õ 𝑁 𝑖=1 Õ 𝐺 𝑔=1 𝑧𝑖𝑔 n 𝜈𝑔 2 ln 𝜈𝑔 2 − ln h Γ 𝜈𝑔 2 i + 𝜈𝑔 2 − 1 ln 𝑤𝑖𝑔 − 𝜈𝑔 2 𝑤𝑖𝑔o ,

with 𝝅 = 𝜋𝑔 𝐺 𝑔=1 , 𝚵 = **M**𝑔, 𝚺𝑔, 𝚿<sup>𝑔</sup> 𝐺 𝑔=1 and 𝜗 = 𝜈𝑔 𝐺 𝑔=1 .

Our AECM algorithm then proceeds as follows (notice that, the parameters marked with one dot are the updates of the previous iteration, while those marked with two dots are the updates at the current iteration):

E-step At the E-step we have to compute the following quantities

$$\ddot{z}\_{ig} = \frac{\dot{\pi}\_{g} f\_{\text{MVT}} \left( \mathbf{X}\_{i}; \dot{\Theta}\_{g} \right)}{\sum\_{h=1}^{G} \dot{\pi}\_{h} f\_{\text{MVT}} \left( \mathbf{X}\_{i}; \dot{\Theta}\_{h} \right)} \quad \text{and} \quad \ddot{w}\_{ig} = \frac{pr + \dot{\nu}\_{g}}{\dot{\nu}\_{g} + \dot{\delta}\_{g} \left( \mathbf{X}\_{i}; \dot{\mathbf{M}}\_{g}, \dot{\Sigma}\_{g}, \dot{\Psi}\_{g} \right)}. \tag{6}$$

There is no need to compute the expected value of ln 𝑊𝑖𝑔 , given that we do not use this quantity to update 𝜈𝑔.

CM-step 1 At the first CM-step, we have the following updates

$$
\ddot{\pi}\_{\mathfrak{g}} = \frac{\sum\_{i=1}^{N} \ddot{z}\_{ig}}{N} \quad \text{and} \quad \ddot{\mathbf{M}}\_{\mathfrak{g}} = \frac{\sum\_{i=1}^{N} \ddot{z}\_{ig} \ddot{w}\_{ig} \mathbf{X}\_{i}}{\sum\_{i=1}^{N} \ddot{z}\_{ig} \ddot{w}\_{ig}}.
$$

Because of space constraints, we cannot report here the updates of each parsimonious structure related to 𝚺<sup>𝑔</sup> and 𝚿𝑔. However, they can be obtained by generalizing the results in [5]. The only differences consist in the updates of the row and column scatter matrices of the 𝑔th component, that are here defined as

$$\begin{split} \ddot{\mathbf{W}}\_{\mathcal{S}}^{R} &= \sum\_{i=1}^{N} \ddot{z}\_{ig} \ddot{w}\_{ig} \left( \mathbf{X}\_{i} - \ddot{\mathbf{M}}\_{\mathcal{S}} \right) \dot{\Psi}\_{\mathcal{S}}^{-1} \left( \mathbf{X}\_{i} - \ddot{\mathbf{M}}\_{\mathcal{S}} \right)', \\ \ddot{\mathbf{W}}\_{\mathcal{S}}^{C} &= \sum\_{i=1}^{N} \ddot{z}\_{ig} \ddot{w}\_{ig} \left( \mathbf{X}\_{i} - \ddot{\mathbf{M}}\_{\mathcal{S}} \right)' \ddot{\Sigma}\_{\mathcal{S}}^{-1} \left( \mathbf{X}\_{i} - \ddot{\mathbf{M}}\_{\mathcal{S}} \right). \end{split}$$

CM-step 2 At the second CM-step, we firstly define the "partial" complete-data log-likelihood function according to the following specification

$$\mathcal{L}\_{pc} \left( \mathbf{\Omega}; \mathbf{S}\_{pc} \right) = \mathcal{E}\_{1c} \left( \boldsymbol{\pi}; \mathbf{S}\_{pc} \right) + \sum\_{i=1}^{N} \sum\_{\mathbf{g}=1}^{G} z\_{i\mathbf{g}} \ln f\_{\mathbf{M} \mathbf{VT}} (\mathbf{X}\_{i}; \mathbf{\Theta}\_{\mathbf{g}}), \tag{7}$$

where "partial" refers to fact that the complete data are now defined as **S**𝑝𝑐 = {**X**<sup>𝑖</sup> , **z**<sup>𝑖</sup> } 𝑁 𝑖=1 . Then, 𝜈¥<sup>𝑔</sup> is determined by maximizing

$$\sum\_{i=1}^{N} \vec{z}\_{ig} \ln f\_{\text{MVT}}(\mathbf{X}\_i; \tilde{\mathbf{O}}\_g) \quad \text{or} \quad \sum\_{i=1}^{N} \sum\_{g=1}^{G} \vec{z}\_{ig} \ln f\_{\text{MVT}}(\mathbf{X}\_i; \tilde{\mathbf{O}}\_g),$$

over 𝜈<sup>𝑔</sup> ∈ (0, 100), depending on the parsimonious structure selected, i.e. V or E, respectively. Notice that, an higher upper bound could also have been selected for the maximization problem but, with the already chosen value, the differences between an estimated MVT distribution and the nested MVN distribution would be negligible. Furthermore, when a heavy-tailed distribution approaches to normality, the precision of the estimated tailedness parameters is unreliable [4].

# **3 Real Data Application**

Here, we analyze the Municipalities dataset contained in the **AER** package [13] for the R statistical software. It consists of expenditure information for 𝑁 = 265 Swedish municipalities over 𝑟 = 9 years (1979–1987). For each municipality, we measure the following 𝑝 = 3 variables: (i) total expenditures, (ii) total own-source revenues and (iii) intergovernmental grants received.

We fitted parsimonious MVT-Ms and MVN-Ms for 𝐺 ∈ {1, 2, 3, 4, 5} to the data, and for each family of models the Bayesian information criterion (BIC) [14] is used to select the best fitting model. According to our results, we found that the best among MVN-Ms has a BIC of -82362.61, a VVV-EE structure and 𝐺 = 4 groups, while the best among MVT-Ms has a BIC of -82701.59, a VVE-EE-V structure and 𝐺 = 3 groups. Thus, the overall best fitting model is that selected for MVT-Ms. The MVN-Ms seem to overfit the data, given that an additional group is detected. This is not an unusual behavior, given that the tails of normal mixture models cannot adequately accommodate deviations from normality, and additional groups are consequently found in the data [4, 7, 15]. Anyway, the best fitting models of the two families agree in finding varying volumes and shapes in the components row scale matrices and equal shapes and orientations in the components column scale matrices.

Figure 1 illustrates the parallel coordinate plots of the data partition detected by the VVE-EE-V MVT-Ms. The dashed lines correspond to the estimated mean for that variable, across the time, in that group. We notice that the first group contains municipalities having, on average, slightly higher expenditures, an intermediate

**Fig. 1** Parallel coordinate plots of the data partition obtained by the VVE-EE-V MVT-Ms. The dashed lines correspond to the estimated means.

level of revenues and higher levels of intergovernmental grants than the other two groups. Furthermore, it seems to cluster several outlying observations, as confirmed by the estimated degree of freedom 𝜈<sup>1</sup> = 3.75, which implies quite heavy tails for this mixture component. The second group shows the lowest average levels of expenditures and revenues, but a similar amount of received grants to that of the third group. Interestingly, this group does not presents many outlying observations, as also supported by the estimated degree of freedom 𝜈<sup>2</sup> = 10.95. Lastly, the third group has the highest levels of revenues but, as already said, it is similar to the other two groups in the other variables. Also in this case, we have a moderately heavy tail behavior given that the estimated degree of freedom is 𝜈<sup>3</sup> = 6.05.

To evaluate the correlations of the variables with each other and over time, for the three groups, we now report the correlation matrices **R**(·) related to the covariance matrices associated to 𝚺<sup>𝑔</sup> and 𝚿𝑔:

**R**𝚺<sup>1</sup> = 1.00 0.48 0.14 0.48 1.00 −0.06 0.14 −0.06 1.00 , **R**𝚺<sup>2</sup> = 1.00 0.55 0.18 0.55 1.00 −0.07 0.18 −0.07 1.00 , **R**𝚺<sup>3</sup> = 1.00 0.73 0.22 0.73 1.00 −0.02 0.22 −0.02 1.00 , **R**𝚿<sup>1</sup> = **R**𝚿<sup>2</sup> = **R**𝚿<sup>3</sup> = 1.00 0.80 0.72 0.67 0.65 0.59 0.58 0.55 0.52 0.80 1.00 0.79 0.73 0.69 0.62 0.62 0.57 0.54 0.72 0.79 1.00 0.80 0.73 0.69 0.66 0.63 0.60 0.67 0.73 0.80 1.00 0.79 0.73 0.71 0.67 0.64 0.65 0.69 0.73 0.79 1.00 0.83 0.80 0.73 0.71 0.59 0.62 0.69 0.73 0.83 1.00 0.80 0.76 0.73 0.58 0.62 0.66 0.71 0.80 0.80 1.00 0.81 0.78 0.55 0.57 0.63 0.67 0.73 0.76 0.81 1.00 0.79 0.52 0.54 0.60 0.64 0.71 0.73 0.78 0.79 1.00 .

 When **R**𝚺<sup>1</sup> , **R**𝚺<sup>2</sup> and **R**𝚺<sup>3</sup> are considered, we notice that, as it might be reasonable to expect, the correlations between total-expenditures and total-own source revenues or intergovernmental grants received are positive, despite they increase as we move from the first to the third group. Conversely, there exists a slightly negative correlation between total-own source revenues and intergovernmental grants received. However, there would be no great differences among the groups in this case. As concerns **R**𝚿<sup>1</sup> , **R**𝚿<sup>2</sup> and **R**𝚿<sup>3</sup> , we observe that the correlation among the columns, i.e. between time points, decreases as the temporal distance increases. Furthermore, considering the dimensionality of these column matrices, it is readily understandable the benefit, in terms of number of parameters to be estimated, of an EE parsimonious structure with respect to a fully unconstrained model.

Finally, we analyze the uncertainty of the detected classification. This can be computed, for each observation, by subtracting the probability 𝑧𝑖𝑔 of the most likely group from 1 [16]. The lower the uncertainty is, the stronger the assignment becomes. The quantiles of the obtained uncertainties can be used to measure the quality of the classification. In this regard, we noticed that 75% of the observations have an uncertainty equal or lower than 0.05. However, we observed a maximum value of 0.50. This happens when groups intersect, since uncertain classifications are expected in the overlapping regions [17]. Relatedly, a more detailed information can be gained by looking at the uncertainty plot illustrated in Figure 2, which reports the (sorted) uncertainty values of all the municipalities. We see that the municipalities clustered

**Fig. 2** Uncertainty plot for the Municipalities dataset.

in the first group, excluding a couple of cases, have practically null uncertainties. This applies to a lesser extent to the municipalities in the other two groups, given the slightly higher number of exceptions. For example, there are 15 observations (approximately 5% of the total sample size) that have uncertainty values greater than 0.3. However, and as said above, this is due to the closeness between the groups, which can be confirmed by looking at the parallel plots in Figure 1.

# **4 Conclusions**

One serious concern of matrix-variate mixture models is the potentially high number of parameters. Furthermore, many real data requires models having heavier-thannormal tails. To address both aspects, in this paper a family of 196 parsimonious mixture models, based on the matrix-variate 𝑡 distribution, is introduced. The eigendecomposition of the components scale matrices, as well as constraints on the components degrees of freedom, are used to attain parsimony. An AECM algorithm for parameter estimation has been presented. Our family of models have been fitted to a real dataset along with parsimonious mixtures of matrix-variate normal distributions. The results demonstrate the best fitting results of our models, and the overfitting tendency of matrix-variate normal mixtures. Lastly, the estimated parameters and data partition for the best of our models have been reported and commented.

**Acknowledgements** This work was supported by the University of Catania grant PIACERI/CRASI (2020).

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Evolution of Media Coverage on Climate Change and Environmental Awareness: an Analysis of Tweets from UK and US Newspapers**

Gianpaolo Zammarchi, Maurizio Romano, and Claudio Conversano

**Abstract** Climate change represents one of the biggest challenges of our time. Newspapers might play an important role in raising awareness on this problem and its consequences. We collected all tweets posted by six UK and US newspapers in the last decade to assess whether 1) the space given to this topic has grown, 2) any breakpoint can be identified in the time series of tweets on climate change, and 3) any main topic can be identified in these tweets. Overall, the number of tweets posted on climate change increased for all newspapers during the last decade. Although a sharp decrease in 2020 was observed due to the pandemic, for most newspapers climate change coverage started to rise again in 2021. While different breakpoints were observed, for most newspapers 2019 was identified as a key year, which is plausible based on the coverage received by activities organized by the Fridays for Future movement. Finally, using different topic modeling approaches, we observed that, while unsupervised models partly capture relevant topics for climate change, such as the ones related to politics, consequences for health or pollution, semi-supervised models might be of help to reach higher informativeness of words assigned to the topics.

**Keywords:** climate change, Twitter, environment, time series, topic modeling

Gianpaolo Zammarchi ()

University of Cagliari, Viale Sant'Ignazio 17, 09123, Cagliari, Italy, e-mail: gp.zammarchi@unica.it

© The Author(s) 2023 403 P. Brito et al. (eds.), *Classification and Data Science in the Digital Age*, Studies in Classification, Data Analysis, and Knowledge Organization, https://doi.org/10.1007/978-3-031-09034-9\_43

Maurizio Romano University of Cagliari, Viale Sant'Ignazio 17, 09123, Cagliari, Italy, e-mail: romano.maurizio@unica.it

Claudio Conversano University of Cagliari, Viale Sant'Ignazio 17, 09123, Cagliari, Italy, e-mail: conversa@unica.it

# **1 Introduction**

Climate change is one of the biggest challenges for our society. Its consequences which include, among others, glaciers melting, warming oceans, rising sea levels, and shifting weather or rainfall patterns, are already impacting our health and imposing costs on society. Without drastic action aimed at reducing or preventing human-induced emissions of greenhouse gasses, these consequences are expected to intensify in the next years. Despite its global and severe impacts, individuals may perceive climate change as an abstract problem [1]. It is also a well-known fact that the level of information plays a crucial role in the awareness about a topic (e.g. healthy food [2] and smoking [3]) . Media represent a crucial source of information and can exert substantial effects on public opinion, thus helping to raise the awareness on climate change. For instance, media can explain climate change consequences as well as portraying actions that governments, communities and single individuals can take. For this reason, it is important to distinguish themes that might have gained popularity from those that may have seen a decrease of interest. Nowadays, social media have become a reliable and popular source of information for people from all around the world. Twitter is one of the most popular microblogging services and is used by many traditional newspapers on a daily basis. While we can hypothesize that in the last few years the media coverage on climate change might have risen, due for instance to international climate strike movements, the recent emergence of the coronavirus disease 2019 (COVID-19) pandemic might have led to a decrease of attention on other relevant topics.

Aims of this work were to: (1) assess trends in media coverage on climate change using tweets posted by main international newspapers based in United Kingdom (UK) and United States (US), and (2) identify the main topics discussed in these tweets using topic modeling.

# **2 Dataset and Methods**

We downloaded all tweets posted from 2012 January 1st to 2021 December 31st from the official Twitter account of six widely known newspapers based in UK (The Guardian, The Independent and The Mirror) or US (The New York Times, The Washington Post and The Wall Street Journal) leading to a collection of 3,275,499 tweets. Next, we determined which tweets were related to climate change and environmental awareness based on the presence of at least one of the following keywords: "climate change", "sustainability", "earth day", "plastic free", "global warming", "pollution", "environmentally friendly" or "renewable energy". We plotted the number of tweets on climate change posted by each newspaper during each year using R v. 4.1.2 [4].

We analyzed the association between the number of tweets on climate change and the whole number of tweets posted by each newspaper using Spearman's correlation analysis. For each year and for each newspaper, we computed and plotted the differences in the number of posted tweets compared to the previous year, for either (a) tweets related to climate change and (b) all tweets. Finally, we used the changepoint R package [5] to conduct an analysis aimed at identifying structural breaks, i.e. unexpected changes in a time series. In many applications, it is reasonable to believe that there might be *m* breakpoints (especially if some exogenous event occurs) in which a shift in mean value is observed. The changepoint package estimates the breakpoints using several penalty criteria such as the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC). We estimated the breakpoints using the Binary Segmentation (BinSeg) method [6] implemented in the package.

Lastly, we used tweets posted by The Guardian to perform topic modeling, a method for classification of text into topics. Preprocessing (including lemmatization, removal of stopwords and creation of the document term matrix) was conducted with tm [7] and quanteda [8] in R. We used two different approaches: 1) Latent Dirichlet Allocation (LDA) implemented in the textmineR R package [9]; and 2) Correlation Explanation (CorEx), an approach alternative to LDA that allows both unsupervised as well as semi-supervised topic modeling [10].

# **3 Results**

#### **3.1 Analysis of Tweet Trends and Breakpoints**

Among 3,275,499 collected tweets, we identified 11,155 tweets related to climate change and environmental awareness. Figure 1A shows the number of tweets on climate change posted by each of the analyzed newspapers from 2012 to 2021, while Figure 1B the total number of tweets posted by each newspaper.

**Fig. 1** Number of tweets on climate change (A) or total number of tweets (B) posted by the six newspapers from 2012 to 2021.

For the majority of newspapers, the number of tweets on climate change increased from 2014 to 2019, saw a sharp decrease in 2020, in correspondence of the emergence of the COVID-19 pandemic, and a subsequent rise in 2021. On the other hand, the

**Fig. 2** Year-over-year percentage changes of overall tweets and tweets on climate change. A: The Guardian, B: The Mirror, C: The Independent, D: The New York Times, E: The Washington Post, F, The Wall Street Journal.

number of tweets on climate change posted by The Guardian showed a peak during 2015 and a subsequent decrease. However, it must be noted that The Guardian is also the newspaper that showed a more pronounced decrease in the overall number of tweets.

The number of tweets on climate change was significantly positively correlated with the overall number of tweets posted from 2012 to 2021 for four newspapers (The Guardian, Spearman's rho = 0.95, 𝑝 < 0.001; The Mirror, Spearman's rho = 0.95, 𝑝 < 0.001; The Independent, Spearman's rho = 0.76, 𝑝 = 0.016; The Washington Post, Spearman's rho = 0.70, 𝑝 = 0.031) but not for The New York Times (Spearman's rho = 0.18, 𝑝 = 0.63) or The Wall Street Journal (Spearman's rho = 0.49, 𝑝 = 0.15). Year-over-year percentage changes among either tweets related to climate change or all posted tweets can be observed in Figure 2.

Looking at Figure 2, we can observe a great variability in the posted number of tweets during the years, both for the total number of tweets and for the number of tweets on climate change. While the analysis aimed at identifying structural changes

**Fig. 3** Structural changes in the time series of tweets related to climate change. A: The Guardian, B: The Mirror, C: The Independent, D: The New York Times, E: The Washington Post, F, The Wall Street Journal. The red line represents the years between two breakpoints.

in the time series comprising tweets on climate change identified three or four breakpoints for all newspapers, wide variability was observed regarding the specific year in which these structural changes were identified (Figure 3). Despite the great variability, Figure 3 shows that even if a common breakpoint cannot be identified, 2019 was a key year for five out of six newspapers (except for The Independent).

#### **3.2 Topic Modeling**

Finally, we exploited the topic modeling approach to identify and analyze the main topics discussed by newspapers in their tweets. Due to space limitations, we focus only on The Guardian since this newspaper showed a trend in contrast with the others. Data comes from 2,916 tweets posted by The Guardian analyzed using LDA and CorEx. For LDA, a range of 5-20 unsupervised topics was tested, with the most interpretable results obtained with 10 topics (Table 1). The topic coherence ranged from 0.01 to 0.34 (mean: 0.13). For each topic, bi-gram topic labels were assigned with the labeling algorithm implemented in textmineR. We can observe that topics are related to politics or leaders (Topics 3, 7 and 10), environmental scientists or climate journalists (Topics 1 and 5), energy sources (Topics 4 and 8) and effects of climate change (Topics 2, 6 and 9). The intertopic distance map obtained with LDAvis is shown in Figure 4. The area of each circle is proportional to the relative prevalence of that topic in the corpus, while inter-topic distances are computed based on Jensen-Shannon divergence.


**Table 1** Top terms for the ten topics identified with LDA.

**Fig. 4** Intertopic distance map.

Finally, we conducted a semi-supervised topic modeling analysis based on anchored words using CorEx. When anchoring a word to a topic, CorEx maximizes the mutual information between that word and the topic, thus guiding the topic model towards specific subsets of words. A model with 5 topics and three anchored words for each topic (Table 2) showed a total correlation (i.e. the measure maximized by CorEx when constructing the topic model) of 4.36. This value was higher compared to the one observed with an unsupervised CorEx analysis with the same number of topics (total correlation = 0.97, topics not shown due to space limits). Topics related to politics (Topic 3) and science (Topic 5) were found to be the most informative in our dataset based on the total correlation metric.


**Table 2** Topics with anchored words and examples of tweets.

The anchored words are reported in bold.

# **4 Discussion**

The present study aims to evaluate how some of the most relevant British and American newspapers have given space to the topic of climate change on their Twitter page in the last decade. Apart from The Guardian, which shows a decreasing trend in the number of tweets related to climate change, all the other newspapers showed an overall growing trend, except during 2020. During this year, the number of tweets related to climate change declined for all six newspapers. This was most probably due to the COVID-19 outbreak that was massively covered by all media. By analyzing the breakpoints in Figure 3, it is possible to observe that 2019 was a relevant year for climate change. This is plausible considering that, starting from the end of 2018, the strikes launched by the Fridays for Future movement to raise awareness on the issue of climate change, gained high media coverage.

Our topic modeling analysis showed that the main topics defined using unsupervised models such as LDA are mostly related to politics, environmental scientists, energy sources and effects of climate change. While unsupervised models capture relevant topics, using CorEx we found a semi-supervised model to be able to reach a higher total correlation, which is a measure of informativeness of the topics, compared to an unsupervised model with the same number of topics.

As future developments, we plan to extend our analyses to newspapers from other countries. We believe our work to be useful to gain more knowledge and awareness about the climate change topic and on how much space relevant newspapers have given to this issue on social media. Increasing the knowledge about the nature of the topics covered by newspapers will lay the basis for future studies aimed at evaluating public awareness on this highly relevant challenge.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Index**

#### Symbols

k-means, 213 3-way network, 147

#### A

Abdesselam, R., 1 adjacency matrix, 1 affinity coefficient, 343 anomaly detection, 373 Anton, C., 11 Antonazzo, F., 21 Arnone, E., 29 Ascari, R., 35 Aschenbruck, R., 43 Ashofteh, A., 53 association measures, 233 AUC, 176 automated planning, 101

#### B

Bacelar-Nicolau, H., 343, 363 Batagelj, V., 63 Batista, M. G., 363 Bayesian inference, 35 Bayesian methodology, 223 Beaudry., É, 101 bi-stochastic matrix, 213 blockmodeling, 63 bootstrap method, 176 Bouaoune, M. A., 83 Bouchard, K., 253

Boutalbi, R., 73 brain-computer interface, 323

#### C

Cabral, S., 363 Campos, J., 53 Cappé, O., 263 Cardoso, M. G. M. S., 353 categorical data, 353 categorical time series, 233 CatPCA, 363 Chabane, N., 83 Chadjipadelis, T., 93, 283 Champagne Gareau, J., 101 classification of textual documents, 121 climate change, 403 cluster analysis, 273, 363 cluster stability, 43, 183 cluster validation, 43 cluster validity indices, 383 clustering, 53, 83, 203, 233, 243, 383, 393 clustering validation, 343 clustering with relational constraints, 63 clusterwise, 73 co-clustering, 73 community detection, 147 complex network, 147 constraints, 139 contaminated normal distribution, 11, 303 Conversano, C., 403 correspondence analysis, 93

count data, 35 COVID-19, 93, 334 Cunial, E., 29

#### D

D'Urso, P., 233 data analysis, 283 data mining, 73 decision boundaries, 21 democracy, 283 dependancy chains, 101 Di Nuzzo, C., 111 dimensionality reduction, 83 Dobša, J., 121 Dvorák, J., 293 dynamical systems, 373

#### E

ECM algorithm, 303 EEG, 323 EM algorithm, 11, 353 emotions, 323 environment, 403 evidence-based policy making, 93 expectation-maximization, 263

#### F

factorial k-means, 213 Faria, B. M., 323 Figueiredo, M., 353 finite mixture model, 353 Fontanella, S., 313 Forbes, F., 263 Fort, G., 263 fraud detection, 131 functional data, 11, 293 functional data analysis, 29, 313, 334 fuzzy sets, 243

#### G

Gama, J., 131 García-Escudero, L. A., 139 Gaussian mixture model, 183 Gaussian process, 253 Genova, V. G., 147 Giordano, G., 147 Giubilei, R., 155 graph clustering, 155 graphical LASSO, 313 grocery shopping recommendation, 83 Górecki, T., 165

#### H

Hayashi, K., 175 Hennig, C., 183 hierarchical cluster analysis, 93 hierarchical clustering, 1 Hoshino, E., 175 hyperparameter tuning, 131 hyperquadrics, 21

#### I

Ievoli, R., 273 Ignaccolo, R., 313 image processing, 193 indicator processes, 233 Ingrassia, S., 21, 111 intelligent shopping list, 83 Ippoliti, L., 313 item classification, 53

#### J

Janácek, P., 193

#### K

k:-means, 383 Kalina, J., 193 Karafiátová, I., 293 kernel density estimation, 155 kernel function, 111 Kiers, H. A. L., 121 Koshkarov, A., 383

#### L

L1-penalty, 313 López-Oriona., Á, 233 Labiod, L., 73, 203, 213 LaLonde, A., 223 leadership, 363 learning from data streams, 131 leave-one-out cross-validation, 176 Lee, H. K. H., 253 Love, T., 223 low-energy replacements, 193 LSA, 121

#### M

machine and deep learning, 323 machine learning, 83 Magopoulou, S., 93 Makarenkov, V., 83, 101 Markov chain Monte Carlo, 223 Markov decision process, 101 Masís, D., 243

#### Index 415

matrix-variate, 393 Mayo-Iscar, A., 139 Mazoure, B., 83 measurement error, 53 Menafoglio, A., 333 Meng, R., 253 Migliorati, S., 35 minimum message length, 353 minorization-maximization, 263 mixed-mode official surveys, 53 mixed-type data, 43 mixture model, 35, 223, 313 mixture modelling, 139 mixture models, 393 mixture of regression models, 303 mixtures of regressions, 21 mobility data, 147 mode-based clustering, 155 model based clustering, 139 model selection, 353 model-based cluster analysis, 303 model-based clustering, 11, 21 Morelli, G., 139 motivational factors, 363 multidimensional scaling, 273 multivariate data analysis, 1 multivariate methods, 283 multivariate regression, 35 multivariate time series, 253

#### N

Nadif, M., 73, 203, 213 Nakanishi, E., 175 neighborhood graph, 1 network analysis, 63 networked data, 203 networks, 155 neural networks, 293 Nguyen, H. D., 263 noise component, 183 nonparametric statistics, 155 number of clusters, 183 numerical smoothing, 243

#### O

O2S2, 334 Obatake, M., 175 online algorithms, 263 optimized centroids, 193 outlier analysis, 373 Ovtcharova, J., 373

#### P

pair correlation function, 293 Palazzo, L., 273 Panagiotidou, G., 283 parallel computing, 101 parameter estimation, 263 parsimonious models, 393 Pawlasová, K., 293 Perrone, G., 303 phylogenetic trees, 383 Piasecki, P., 165 political behavior, 283 projection matrix, 176 Pronello, N., 313 proximity measure, 1

#### R

Ragozini, G., 147 random forest, 165 rare disease, 176 recommender systems, 83 reduced k-means, 121, 213 regional healthcare, 273 Reis, L. P., 323 religion, 283 representation learning, 203 reversible jump, 223 Riani, M., 139 robustness, 193 Rodrigues, D., 323 Romano, M., 403

#### S

Sakai, K., 175 Sangalli, L. M., 29, 333 Scimone, R., 333 Secchi, P., 333 seemingly unrelated regression, 303 Segura, E., 243 semiparametric regression with roughness penalty, 29 sensitivity and specificity, 176 silhouette width, 313 Silva, O., 343, 363 Silvestre, C., 353 similarity forest, 165 Smith, I., 11 social networks, 63 Soffritti, G., 303 Sousa., Á, 343, 363 sparsity, 193

spatial data analysis, 29 spatial downscaling, 334 spatial point patterns, 293 Spearman correlation coefficient, 343 spectral clustering, 111 spectral rotation, 203 split-merge procedures, 223 Spoor, J. M. , 373 stochastic approximation, 263 stochastic optimization, 253 strongly connected components, 101 supervised classi fication, 293 supervised learning, 323 Suzuki, M. , 175 symbolic data analysis, 63 symmetric difference metrics, 383 Szepannek, G. , 43

#### T

Tahiri, N. , 83 , 383 tensor, 73 tertiary education, 147 three-way data, 111 Tighilt, R. A. S. , 83 time series, 165 , 273 , 403 time series classi fication, 165 time-varying correlation, 253 Tomarchio, S. D. , 393 topic modeling, 403 Trejos, J. , 243

trimmed k-means, 273 trimming, 183 Twitter, 403

#### U

user tuning, 183

#### V

variational inference, 253 Vilar. J. A. , 233 Vitale, M. P. , 147 VL methodology, 343

#### W

Weber, J. , 373 weighting methods, 53 welfare, 363 Wilhelm, A. F. X. , 43 Wu, T. , 223

#### X

Xavier, A. , 243

#### Y

Young, D. R. , 223

#### Z

Zammarchi, G. , 403 Łuczak, T. , 165