**Michael R. Berthold Ad Feelders Georg Krempl (Eds.)**

# **Advances in Intelligent Data Analysis XVIII**

**18th International Symposium on Intelligent Data Analysis, IDA 2020 Konstanz, Germany, April 27–29, 2020 Proceedings**

# Lecture Notes in Computer Science 12080

## Founding Editors

Gerhard Goos Karlsruhe Institute of Technology, Karlsruhe, Germany Juris Hartmanis Cornell University, Ithaca, NY, USA

### Editorial Board Members

Elisa Bertino Purdue University, West Lafayette, IN, USA Wen Gao Peking University, Beijing, China Bernhard Steffen TU Dortmund University, Dortmund, Germany Gerhard Woeginger RWTH Aachen, Aachen, Germany Moti Yung Columbia University, New York, NY, USA

More information about this series at http://www.springer.com/series/7409

Michael R. Berthold • Ad Feelders • Georg Krempl (Eds.)

# Advances in Intelligent Data Analysis XVIII

18th International Symposium on Intelligent Data Analysis, IDA 2020 Konstanz, Germany, April 27–29, 2020 Proceedings

Editors Michael R. Berthold University of Konstanz Konstanz, Germany

Georg Krempl Utrecht University Utrecht, The Netherlands Ad Feelders Utrecht University Utrecht, The Netherlands

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-030-44583-6 ISBN 978-3-030-44584-3S (eBook) https://doi.org/10.1007/978-3-030-44584-3

LNCS Sublibrary: SL3 – Information Systems and Applications, incl. Internet/Web, and HCI

© The Editor(s) (if applicable) and The Author(s) 2020. This book is an open access publication.

Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

## Preface

We are proud to present the proceedings of the 18th International Symposium on Intelligent Data Analysis (IDA 2020), which was held during April 27–29, 2020, in Konstanz, Germany. The first symposium of this series was organized in 1995 and held biannually until 2009, when the conference switched to being held annually. Following demand expressed by the IDA community in a survey held in 2018, IDA 2020 was the first of the series to take place in spring rather than fall, as was common before.

The switch to April, and a more organized outreach to the community, coincided with an increase in the number of submissions from 65 in 2018, to 114 in 2020. After a rigorous review process, 45 of these 114 submissions were accepted for presentation. Almost all submissions were reviewed by at least three Program Committee (PC) members (only two papers had two reviews) and a substantial number of submissions received more than three reviews. In addition to the PC, the review process also involved program chair advisors – a select set of senior researchers with a multi-year involvement in the IDA symposium series. Whenever a program chair advisor flagged a paper with an informed, thoughtful, positive review due to the paper presenting a particularly interesting and novel idea, the paper was accepted irrespective of the other reviews. Each accepted paper was offered a slot for either oral presentation (15 papers) or poster presentation (30 papers).

We wish to express our gratitude to the authors of all submitted papers for their high-quality contributions; to the PC members and additional reviewers for their efforts in reviewing, discussing, and commenting on all submitted papers; to the program chair advisors for their active involvement; and to the IDA council for their ongoing guidance and support. Many people have helped behind the scenes to make IDA 2020 possible, but this year we are particularly grateful to our publicity chairs who helped spread the word: Daniela Gawehns and Hugo Manuel Proença!

February 2020 Georg Krempl Ad Feelders Michael R. Berthold

## Organization

## Program Chairs


## Program Chair Advisors

Niall Adams Imperial College London, UK Michael R. Berthold University of Konstanz, Germany Tijl De Bie Ghent University, Belgium Elisa Fromont Université de Rennes 1, France Jaakko Hollmén Aalto University, Finland Nada Lavrač Jozef Stefan Institute, Slovenia Xiaohui Liu Brunel University, UK Panagiotis Papapetrou Stockholm University, Sweden Stephen Swift Brunel University, UK Hannu Toivonen University of Helsinki, Finland Allan Tucker Brunel University, UK Albrecht Zimmermann Université Caen Normandie, France

## Program Committee

Hendrik Blockeel Katholieke Universiteit Leuven, Belgium Elizabeth Bradley University of Colorado Boulder, USA Wouter Duivesteijn Eindhoven University of Technology, The Netherlands Johannes Fürnkranz Johannes Kepler University Linz, Austria Frank Höppner Ostfalia University of Applied Sciences, Germany Frank Klawonn Ostfalia University of Applied Sciences, Germany Arno Knobbe Leiden University, The Netherlands Rudolf Kruse University of Magdeburg, Germany Matthijs van Leeuwen Leiden University, The Netherlands Arno Siebes Utrecht University, The Netherlands

Fabrizio Angiulli DEIS, University of Calabria, Italy Martin Atzmueller Tilburg University, The Netherlands José Luis Balcázar Universitat Politècnica de Catalunya, Spain Giacomo Boracchi Politecnico di Milano, Italy Christian Borgelt Universität Salzburg, Austria Henrik Boström KTH Royal Institute of Technology, Sweden Paula Brito University of Porto, Portugal Dariusz Brzezinski Poznań University of Technology, Poland José Del Campo-Ávila Universidad de Málaga, Spain

Cassio de Campos Eindhoven University of Technology, The Netherlands Andre de Carvalho University of São Paulo, Brazil Paulo Cortez University of Minho, Portugal Bruno Cremilleux Université de Caen Normandie, France Brett Drury LIAAD-INESC-TEC, Portugal Saso Dzeroski Jozef Stefan Institute, Slovenia Nuno Escudeiro Instituto Superior de Engenharia do Porto, Portugal Douglas Fisher Vanderbilt University, USA Joao Gama University of Porto, Portugal Lawrence Hall University of South Florida, USA Barbara Hammer Bielefeld University, Germany Martin Holena Institute of Computer Science, Czech Republic Tomas Horvath Eötvös Loránd University, Hungary Francois Jacquenet Laboratoire Hubert Curien, France Baptiste Jeudy Laboratoire Hubert Curien, France Ulf Johansson Jönköping University, Sweden Alipio M. Jorge University of Porto, Portugal Irena Koprinska The University of Sydney, Australia Daniel Kottke University of Kassel, Germany Petra Kralj Novak Jozef Stefan Institute, Slovenia Mark Last Ben-Gurion University of the Negev, Israel Niklas Lavesson Jönköping University, Sweden Daniel Lawson University of Bristol, UK Jefrey Lijffijt Ghent University, Belgium Ling Luo The University of Melbourne, Australia George Magoulas Birkbeck University of London, UK Vlado Menkovski Eindhoven University of Technology, The Netherlands Vera Migueis University of Porto, Portugal Decebal Constantin Mocanu Eindhoven University of Technology, The Netherlands Emilie Morvant University of Saint-Etienne, LaHC, France Mohamed Nadif Paris Descartes University, France Siegfried Nijssen Université Catholique de Louvain, Belgium Andreas Nuernberger Otto-von-Guericke University of Magdeburg, Germany Kaustubh Raosaheb Patil Massachusetts Institute of Technology, USA Mykola Pechenizkiy Eindhoven University of Technology, The Netherlands Jose-Maria Pena Universidad Politécnica de Madrid, Spain Ruggero G. Pensa University of Torino, Italy Marc Plantevit LIRIS, Université Claude Bernard Lyon 1, France Lubos Popelinsky Masaryk University, Czech Republic Eric Postma Tilburg University, The Netherlands Miguel A. Prada Universidad de Leon, Spain Ronaldo Prati Universidade Federal do ABC, UFABC, Brazil Peter van der Putten Leiden University and Pegasystems, The Netherlands Jesse Read École Polytechnique, France Antonio Salmeron University of Almería, Spain Vítor Santos Costa University of Porto, Portugal

Melissa Turcotte LANL, USA

Christin Seifert University of Twente, The Netherlands Roberta Siciliano University of Naples Federico II, Italy Jerzy Stefanowski Poznań University of Technology, Poland Frank Takes Leiden University and University of Amsterdam, The Netherlands Maguelonne Teisseire Irstea, UMR Tetis, France Ljupco Todorovski University of Ljubljana, Slovenia Cor Veenman Netherlands Forensic Institute, The Netherlands Veronica Vinciotti Brunel University, UK Filip Zelezny Czech Technical University, Czech Republic Leishi Zhang Middlesex University, UK

## Contents







# **Multivariate Time Series as Images: Imputation Using Convolutional Denoising Autoencoder**

Abdullah Al Safi, Christian Beyer(B) , Vishnu Unnikrishnan, and Myra Spiliopoulou

Fakult¨at f¨ur Informatik, Otto-von-Guericke-Universit¨at, Postfach 4120, 39106 Magdeburg, Germany abdullah.safi@st.ovgu.de, *{*christian.beyer,vishnu.unnikrishnan,myra*}*@ovgu.de

**Abstract.** Missing data is a common occurrence in the time series domain, for instance due to faulty sensors, server downtime or patients not attending their scheduled appointments. One of the best methods to impute these missing values is *Multiple Imputations by Chained Equations (MICE)* which has the drawback that it can only model linear relationships among the variables in a multivariate time series. The advancement of deep learning and its ability to model non-linear relationships among variables make it a promising candidate for time series imputation. This work proposes a modified Convolutional Denoising Autoencoder (CDA) based approach to impute multivariate time series data in combination with a preprocessing step that encodes time series data into 2D images using Gramian Angular Summation Field (GASF). We compare our approach against a standard feed-forward Multi Layer Perceptron (MLP) and MICE. All our experiments were performed on 5 UEA MTSC multivariate time series datasets, where 20 to 50% of the data was simulated to be missing completely at random. The CDA model outperforms all the other models in 4 out of 5 datasets and is tied for the best algorithm in the remaining case.

**Keywords:** Convolutional Denoising Autoencoder · Gramian Angular Summation Field · MICE · MLP. · Imputation · Time series

## **1 Introduction**

Time series data resides in various domains of industries and research fields and is often corrupted with missing data. For further use or analysis, the data often needs to be complete, which gives the rise to the need for imputation techniques with enhanced capabilities of introducing least possible error into the data. One of the most prominent imputation methods is MICE which uses iterative regression and value replacement to achieve state-of-the-art imputation quality but has the drawback that it can only model linear relationships among variables (dimensions).

In past few years, different deep learning architectures were able to break into different problem domains, often exceeding previously achieved performances by other algorithms [7]. Areas like speech recognition, natural language processing, computer vision, etc. were greatly impacted and improved by deep learning architectures. Deep learning models have a robust capability of modelling latent representation of the data and non-linear patterns, given enough training data. Hence, this work presents a deep learning based imputation model called Convolutional Denoising Autoencoder (CDA) with altered convolution and pooling operations in Encoder and Decoder segments. Instead of using the traditional steps of convolution and pooling, we use deconvolution and upsampling which was inspired by [5]. The time series to image transformation mechanisms proposed in [12] and [13] were inherited as a preprocessing step as CDA models are typically designed for images. As rival imputation models, Multiple Imputation by Chained Equations (MICE) and a Multi Layer Perceptron (MLP) based imputation were incorporated.

## **2 Related Work**

Three distinct types of missingness in data were identified in [8]. The first one is *Missing Completely At Random (MCAR)*, where the missingness of the data does not depend on itself or any other variables. In *Missing At Random (MAR)* the missing value depends on other variables but not on the variable where the data is actually missing and in *Missing Not At Random (MNAR)* the missingness of an observation depends on the concerned variable itself. All the experiments in this study were carried out on MCAR missingness as reproducing MAR and MNAR missingness can be challenging and hard to distinguish [5].

Multiple Imputation by Chained Equations (MICE) has secured its place as a principal method for imputing missing data [1]. Costa et al. in [3] experimented and showed that MICE offered the better imputation quality than a Denoising Autoencoder based model for several missing percentages and missing types.

A novel approach was proposed in [14], incorporating General Adversarial Networks (GAN) to perform imputations, thus authors named it Generative Adversarial Imputation Nets (GAIN). The approach imputed significantly well against some state-of-the-art imputation methods including MICE. An Autoencoder based approach was proposed in [4], which was compared against an Artificial Neural Network (NN) model on MCAR missing type and several missing percentages. The proposed model performed well against NN. A novel Denoising Autoencoder based imputation using partial loss (DAPL) approach was presented in [9], where different missing data percentages and MCAR missing type were simulated in a breast cancer dataset. The comparisons incorporated statistical, machine learning based approaches and standard Denoising Autoencoder (DAE) model where DAPL outperformed DAE and all the other models. An MLP based imputation approach was presented for MCAR missingness in [10] and also outperformed other statistical models. A Convolutional Denoising Autoencoder model which did not impute missing data but denoised audio signals was presented in [15]. A Denoising Autoencoder with more units in the encoder layer than input layer was presented in [5] and achieved good imputation results against MICE. Our work was inspired from both of these works which is why we combined the two approaches into a Convolutional Denoising Autoencoder which maps input data into a higher subspace in the Encoder.

## **3 Methodology**

In this section we first describe how we introduce missing data in our datasets, then we show the process used to turn multivariate time series into images which is required by one of our imputation methods and finally we introduce the imputation methods which were compared in this study.

#### **3.1 Simulating Missing Data**

Simulating missing data is a mechanism of artificially introducing unobserved data into a complete time series dataset. Our experiment incorporated 20%, 30%, 40% and 50% of missing data and the missing type was MCAR. Introducing MCAR missingness is quite a simple approach as it does not depend on observed or unobserved data. Many studies assume MCAR missing type quite often when there is no concrete evidence of missingness type [6]. In this experimental framework, values at randomly selected indices were erased from randomly selected variables which simulated MCAR missingness of different percentages.

#### **3.2 Translating Time Series into Images**

A novel approach of encoding time series data into various types of images using Gramian Angular Field (GAF) was presented in [12] to improve classification and imputation. One of the variants of GAF was Gramian Angular Summation Field (GASF), which comprised of multiple steps to perform the encoding. First, the time series is scaled within [−1, 1] range.

$$x\_i' = \frac{(x\_i - Max(X)) + (x\_i - Min(X))}{Max(X) - Min(X)} \tag{1}$$

Here, x*<sup>i</sup>* is a specific value at timepoint i where x *<sup>i</sup>* is derived by scaling and <sup>X</sup> is the time series. The time series is scaled within [−1, 1] range in order to be represented as polar coordinates achieved by applying angular cosine.

$$\theta\_i = \arccos(x\_i') \{-1 \le = x\_i' \le = 1, x\_i' \in X\} \tag{2}$$

The polar encoded time series vector is then transformed into a matrix. If the length of the time series vector is n, then the transformed matrix is of shape (<sup>n</sup> <sup>×</sup> <sup>n</sup>).

$$AGSF\_{i,j} = \cos(\theta\_i + \theta\_j) \tag{3}$$

The GASF represents the temporal features in the form of an image where the timestamps move along top-left to bottom-right, thereby preserving the time factor in the data. Figure 1 shows the different steps of time series to image transformation.

**Fig. 1.** Time series to image transformation

The methods of encoding time series into images described in [12] were only applicable for univariate time series. The GASF transformation generates one image for one time series dimension and thus it is possible to generate multiple images for multivariate time series. An approach which vertically stacked images transformed from different variables was presented in [13], see Fig. 2. The images were grayscaled and the different orders of vertical stacking (ascending, descending and random) were examined by performing a statistical test. The stacking order did not impact classification accuracy.

**Fig. 2.** Vertical stacking of images transformed from different variables

#### **3.3 Convolutional Denoising Autoencoder**

Autoencoder is a very popular unsupervised deep learning model frequently found in different application areas. Autoencoder is unsupervised in fashion and reconstructs the original input by discovering robust features in the hidden layer representation. The latent representation of high dimensional data in the hidden layer contributes in reconstructing the original data. The architecture of Autoencoder consists of two principal segments named Encoder and Decoder. The Encoder usually compresses the original representation of the data into lower dimension. The Decoder decodes the low dimensional representation of the input back into its original dimensional representation.

$$Encode(x^n) = s(x^n W\_E + b\_E) = x^d \tag{4}$$

$$Decoder(x^d) = s(x^d W\_D + b\_D) = x^n \tag{5}$$

Here, x*<sup>n</sup>* is the original input with n dimensions. s is any non-linear activation function, W is weight and b is bias.

Denoising Autoencoder model is an extension of Autoencoder where the input is reconstructed from a corrupted version of it. There are different ways of adding corruption, such as Gaussian noise, setting some values to zero etc. The noisy input is fed as input and the model minimizes the loss between the clean input and corrupted reconstructed input. The objective function looks as follows

$$RMSE(X, X')\frac{1}{n}\sqrt{|X\_{clean} - X'\_{reconstruction}|^2} \tag{6}$$

Convolutional Denoising Autoencoder (CDA) incorporates convolution operation which is ideally performed in Convolutional Neural Networks (CNN). CNN is a methodology, where the layers of perceptrons are replaced by convolution layers and convolution operation is performed on the data. Convolution is defined as multiplication of two function within a finite or infinite range, where two functions refer to input data (e.g. Image) and a fixed size kernel consecutively. The kernel traverses through the input space to generate feature maps. The feature maps consist of important features of the data. The multiple features are pooled, preserving important features.

The combination of convoluted feature maps generation and pooling is performed in the Encoder layer of CDA where the corrupted version of the input is fed into the input layer of the network. The Decoder layer performs Deconvolutiont and Upsampling which decompresses the output coming from Encoder layer back into the shape of input data. The loss between reconstructed data and clean data is minimized. In this work, the default architecture of CDA is tweaked in the favor of imputing multivariate time series data. Deconvolution and Upsampling were performed in the Encoder layer and Convolution and Maxpooling was performed in Decoder layer. The motivation behind this specific tweaking came from [5], where a Denoising Autoencoder was designed with more hidden units in the Encoder layer than input layer. The high dimensional representation in Encoder layer created additional feature which was the contributor of data recovery.

## **3.4 Competitor Models**

*Multiple Imputation by Chained Equations (MICE):* MICE, which is sometimes addressed as fully conditional specification or sequential regression multiple imputation, has emerged in the statistical literature as the principal method of addressing missing data [1]. MICE creates multiple versions of the imputed datasets through multiple imputation technique.

The steps for performing MICE are the following:


According to the experimental setup of our work, MICE had three different regression supports, namely Linear, Ridge and Lasso regression.

*Multi Layer Perceptron (MLP) Based Imputation:* The imputation mechanism of MLP is inspired by the MICE algorithm. Nevertheless, MLP based imputation models do not perform the chained or multiple imputations like MICE but improve the quality of imputation over several epochs as stochastic gradient descent optimizes the weights and biases per epoch. A concrete MLP architecture was described in literature [10] which was a three layered MLP with the hyperbolic tangent activation function in the hidden layer and the identity function (linear) as the activation function for the output layer. The train and test split were slightly different, where training set and test set consisted of both observed and unobserved data.

The imputation process of MLP model in our work is similar to MICE but the non-linear activation function of MLP facilitates finding complex non-linear patterns. However, the imputation of a variable is performed only once, in contrast to the multiple iterations in MICE.

## **4 Experiments**

In this section we present the used datasets, the preprocessing steps that were conducted before training, the chosen hyperparameters and our evaluation method. Our complete imputation process for the CDA model is depicted in Fig. 3. The process for the competitors is the same except that corrupting the training data and turning the time series into images is not being done.

**Fig. 3.** Experiment steps for the CDA model

#### **4.1 Datasets and Data Preprocessing**

Our experiments were conducted on 5 time series datasets from the UEA MTSC repository [2]. Each dataset in UEA time series archive has training and test splits and specific number of dimensions. Each training or test split represents a time series. The table below presents all the relevant structural details (Table 1).


**Table 1.** A structural summary of the 5 UEA MTSC dataset

The Length column of the table denotes the length of each time series. In our framework, each time series was transformed into images. The number of time series for any of the datasets was not very high in number. As we had selected a deep learning model for imputation, such low number of samples could cause overfitting. Experiments showed us that the default number of time series could not perform well. Therefore, the main idea was to increase the number of time series by splitting them into multiple parts and reducing their corresponding lengths. This modification facilitated us by introducing more patterns for learning which aided in imputation. The final lengths chosen were those that yielded the best results. The table below presents the modified number of time series and lengths for each dataset (Table 2).


**Table 2.** Modified number of time series and lengths

The evaluation of the imputation models require a complete dataset and the corresponding incomplete dataset. Therefore, artificial missingness was introduced at different percentages (20%, 30%, 40% and 50%) into all the datasets. After simulating artificial missingness, each dataset has an observed part, which contains all the time series segments where no variables are missing and an unobserved part, where at least one variable is missing. After simulating artificial missingness, each dataset had an observed and unobserved split and the observed data was further processed for training. As CDA models learn denoising from a corrupted version of the input, we introduced noise by discarding a certain amount of values for each observed case from specific variables and replacing them by the mean of the corresponding variables. A higher amount of noise has seen to be contributing more in learning dependencies of different variables, which leads to denoising of good quality [11]. The variables selected for adding noise were the same variables having missing data in unobserved data. Different amount of noise was examined but 90% noise lead to good results. Unobserved data was also mean imputed as the CDA model would apply the denoising technique on the "mean-noise" for imputation. So the CDA learns to deal with "mean-noise" on the observed part and is then applied on mean imputed unobserved part to create the final imputation.

The next step was to perform time series to image transformation where, all the observed and unobserved chunks were rescaled between −1 to 1 using minmax scaling. Rescaled data was further transformed into polar coordinates and then GASF encoded image was achieved for each dimension. Multiple images referring to multiple variables were vertically aggregated. Finally, both observed and unobserved splits consisted their own set of images.

Note that, the following data preprocessing was performed only for CDA based imputation models. The competitor models imputed using the raw format of the data.

#### **4.2 Model Architecture and Hyperparameters**

Our Model architecture was different from a general CDA, where the Encoder layer incorporates Deconvolution and Upsampling operations and the Decoder layer incorporates Convolution and Maxpooling operations. The Encoder and Decoder both have 3 layers. The table below demonstrates the structure of the imputation model (Table 3).


**Table 3.** The architecture of CDA based imputation model

Hyperparameter specification was achieved by performing random search on different random combinations of hyperparameter values and the root mean square error (RMSE) was used to decide on the best combination. The random search allowed us to avoid the exhaustive searching unlike grid search. Applying random search, we selected stochastic gradient descent (SGD) as optimizer, which backpropagates the error to optimize the weights and biases. The number of epochs was 100 and the batch size was 16.

#### **4.3 Competitor Model's Architecture and Hyperparameters**

As competitor models, MICE and MLP based imputation models were selected. MLP based model had 3 hidden layers and number of hidden units were 2/3 of the number of input units in each layer. The hyperparameters for both of the models were tuned by using random search.

Hyperbolic Tangent Function was selected as activation function with a dropout of 0.3. Stochastic Gradient Descent operated as optimizer for 150 epochs and with a batch size of 20.

MICE based imputation was demonstrated using Linear, Ridge and Lasso regression and 10 iterations were performed for each of them.

#### **4.4 Training**

Based on the preprocessed data and model architecture described above, the training is started. L2 regularization was used with weight of 0.01 and stochastic gradient descent was used as the optimizer which outperformed Adam and Adagrad optimizers. The whole training process was about learning to minimize loss between the clean and corrupted data so that it can be applied on the unobserved data (noisy data after mean imputation) to perform imputation. The training and validation split was 70% and 30%. Experiments show that, the training and validation loss was saturated approximately after 10–15 epochs, which was observed for most of the cases.

The training was conducted on a machine with Nvidia RTX 2060 with RAM memory of 16 GB. The programming language for the training and all the steps above was Python 3.7 and the operating system was Ubuntu 16.04 LTS.

## **4.5 Evaluation Criteria**

As all the time series dataset contain continuous numeric values, Root Mean Square Error (RMSE) was selected for evaluation. In out experimental setup, RMSE is not calculated on overall time series but only missing data points are taken into account to be compared with ground truth while calculating RMSE RMSE = - 1 *<sup>m</sup>* Σ*<sup>m</sup> <sup>i</sup>*=1(x*<sup>i</sup>* <sup>−</sup> <sup>x</sup>- *<sup>i</sup>*)<sup>2</sup>. Where m is the total number of missing time points and I represents all the indices of missing values across the time series.

## **5 Results**

Our proposed CDA based imputation model was compared with MLP and three different versions of MICE, each using a different type of regression. Figure 4 presents the RMSE values for 20%, 30% 40% and 50% missingness.

**Fig. 4.** RMSE plots for different missing proportions

The RMSE values for the CDA based model are the lowest at every percentage of missingness on the *Handwriting, ArticularyWordRecognition, UWaveGestureLibrary and Cricket* dataset. The depiction of the results on the *Cricket* dataset is omitted due to space limitations. Unexpectedly, in *StandWalkJump* dataset the performance of MLP and CDA model are very similar, and MLP is even better at 30% missingness. MICE (Linear) and MICE (Ridge) are identical in imputation for all the datasets. MICE (Lasso) performed worst of all the models, which implies that changing the regression type could potentially cause an impact on the imputation quality. The MLP model beat all the MICE models but was outperformed by the CDA model in at least for 80% of the cases.

## **6 Conclusion**

In this work, we introduce an architecture of a Convolutional Denoising Autoencoder (CDA) adapted for multivariate time series imputation which inflates the size of the hidden layers in the Encoder instead of reducing them. We also employ a preprocessing step that turns the time series into 2D images based on Gramian Angular Summation Fields in order to make the data more suitable for our CDA. We compare our method against a standard Multi Layer Perceptron (MLP) and the state-of-the-art imputation method Multiple Imputations by Chained Equations (MICE) with three different types of regression (Linear, Ridge and Lasso). Our experiments were conducted on five different multivariate time series datasets, for which we simulated 20%, 30%, 40% and 50% missingness with data missing completely at random. Our results show that the CDA based imputation outperforms MICE on all five datasets and also beats the MLP on four datasets. On the fifth dataset CDA and MLP perform very similarly, but CDA is still better on four out of the five degrees of missingness. Additionally we present a preprocessing step on the datasets which manipulates the time series lengths to generate more training samples for our model which led to a better performance. The results show that the CDA model performs strongly against both linear and non-linear regression based imputation models. Deep Learning Networks are usually computationally more intensive than MICE but the imputation quality of CDA was convincing enough to be chosen over MICE or MLP based imputation.

In the future we plan to investigate also other types of missing data apart from *Missing Completely At Random (MCAR)* and want to incorporate more datasets as well as other deep learning based approaches for imputation.

**Acknowledgments.** This work is partially funded by the German Research Foundation, project OSCAR "Opinion Stream Classification with Ensembles and Active Learners". The principal investigators of OSCAR are Myra Spiliopoulou and Eirini Ntoutsi. Additionally, Christian Beyer is also partially funded by a PhD grant from the federal state of Saxony-Anhalt.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Dual Sequential Variational Autoencoders for Fraud Detection**

Ayman Alazizi1,2(B) , Amaury Habrard<sup>1</sup>, Fran¸cois Jacquenet<sup>1</sup>, Liyun He-Guelton<sup>2</sup>, and Fr´ed´eric Obl´e<sup>2</sup>

<sup>1</sup> Univ. Lyon, Univ. St-Etienne, UMR CNRS 5516, Laboratoire Hubert-Curien, 42000 Saint-Etienne, France {ayman.alazizi,amaury.habrard,francois.jacquenet}@univ-st-etienne.fr <sup>2</sup> Worldline, 95870 Bezons, France {ayman.alazizi,liyun.he-guelton,frederic.oble}@worldline.com

**Abstract.** Fraud detection is an important research area where machine learning has a significant role to play. An important task in that context, on which the quality of the results obtained depends, is feature engineering. Unfortunately, this is very time and human consuming. Thus, in this article, we present the DuSVAE model that consists of a generative model that takes into account the sequential nature of the data. It combines two variational autoencoders that can generate a condensed representation of the input sequential data that can then be processed by a classifier to label each new sequence as fraudulent or genuine. The experiments we carried out on a large real-word dataset, from the Worldline company, demonstrate the ability of our system to better detect frauds in credit card transactions without any feature engineering effort.

**Keywords:** Anomaly detection · Fraud detection · Sequential data · Variational autoencoder

## **1 Introduction**

An anomaly (also called outlier, change, deviation, surprise, peculiarity, intrusion, etc.) is a pattern, in a dataset, that does not conform to an expected behavior. Thus, anomaly detection is the process of finding anomalies in a dataset [4]. Fraud detection, a subdomain of anomaly detection, is a research area where the use of machine learning can have a significant financial impact for companies suffering from large frauds and it is not surprising that a very large amount of research has been conducted over many years in that field [1].

At the Wordline company, we process billions of electronic transactions per year in our highly secured data centers. It is obvious that detecting frauds in that context is a very difficult task. For many years, the detection of credit card frauds within Wordline has been based on a set of rules manually designed by experts. Nevertheless such rules are difficult to maintain, difficult to transfer to other business lines, and dependent on experts who need a very long training period. The contribution of machine learning in this context seems obvious and Wordline has decided for several years to develop research in this field.

Firstly, Worldline has put a lot of effort in feature engineering [3,9,12] to develop discriminative handcrafted features. This improved drastically supervised learning of classifiers that aim to label card transactions as genuine or fraudulent. Nevertheless, designing such features requires a huge amount of time and human resources which is very costly. Thus developing automatic feature engineering methods becomes a critical issue to improve the efficiency of our models. However, in our industrial setting, we have to face with many issues among which the presence of highly imbalanced data where the fraud ratio is about 0.3%. For this reason, we first focused on classic unsupervised approaches in anomaly detection where the objective is to learn a model from normal data and then isolate non-compliant samples and consider them as anomalies [5,17,19,21,22].

In this context, Deep autoencoder [7] is considered as a powerful data modeling tool in the unsupervised setting. An autoencoder (AE) is made up of two parts: an encoder designed to generate a compressed coding from the training input data and a decoder that reconstructs the original input from the compressed coding. In the context of anomaly detection [6,20,22], an autoencoder is generally trained by minimizing the reconstruction error only on normal data. Afterwards, the reconstruction error is applied as an anomaly score. This assumes that the reconstruction error for a normal data should be small as it is close to the learning data, while the reconstruction error for an abnormal data should be high.

However, this assumption is not always valid. Indeed, it has been observed that sometimes the autoencoder generalizes so well that it can also reconstruct anomalies, which leads to view some anomalies as normal data. This can also be the case when some abnormal data share some characteristics of normal data in the training set or when the decoder is "too powerful" to properly decode abnormal codings. To solve the shortcomings of autoencoders, [13,18] proposed the *negative learning technique* that aims to control the compressing capacity of an autoencoder by optimizing conflicting objectives of normal and abnormal data. Thus, this approach looks for a solution in the gradient direction for the desired normal input and in the opposite direction for the undesired input.

This approach could be very appealing to deal with fraud detection problems but we found that it is sometimes not sufficient in the context of our data. Indeed, it is generally almost impossible to obtain in advance a dataset containing all representative frauds, especially in the context where unknown fraudulent transactions occur on new terminals or via new fraudulent behaviors. This has led us to consider more complex models with variational autoencoders (VAE), a probabilistic generative extension of AE, able to model complex generative distributions that we found more adapted to efficiently model new possible frauds.

Another important point for credit card fraud detection is the sequential aspect of the data. Indeed, to test a card for example, a fraudster may try to make several (small) transactions in a short time interval, or directly perform an abnormally high transaction with respect to existing transactions of the true card holder. In fact this sequential aspect has been addressed either indirectly via aggregated features [3], that we would like to avoid designing, or directly by sequential models such as LSTM, but [9] report nevertheless that the LSTM did not improve much the detection performance for e-commerce transactions. One of the main contribution of this paper is to propose a method to identify fraudulent sequences of credit transactions in the context of highly imbalanced data. For this purpose, we propose a model called DuSVAE, for Dual Sequential Variational Autoencoders, that consists of a combination of two variational autoencoders. The first one is trained from fraudulent sequences of transactions in order to be able to project the input data into another feature space and to assign a fraud score to each sequence thanks to the reconstruction error information. Once this model is trained, we plug a second VAE at the output of the first one. This second VAE is then trained with a negative learning approach with the objective to maximize the reconstruction error of the fraudulent sequences and minimize the reconstruction error of the genuine ones.

Our method has been evaluated on a Wordline dataset for credit card fraud detection. The obtained results show that DuSVAE can extract hidden representations able to provide results close to those obtained after a significant work of feature engineering, therefore saving time and human effort. It is even possible to improve the results when combining engineered features with DuSVAE.

The article is organized as follows: some preliminaries about the techniques used in this work are given in Sect. 2. Then we describe the architecture and the training strategy of the DusVAE method in Sect. 3. Experiments are presented in Sect. 4 after a presentation of the dataset and useful metrics. Finally Sect. 5 concludes this article.

## **2 Preliminaries**

In this section, we briefly describe the main techniques that are used in DuSVAE: vanilla and variational autoencoders, negative learning and mixture of experts.

### **2.1 Autoencoder (AE)**

An AE is a neural network [7], which is optimized in an unsupervised manner, usually used to reduce the dimensionality of the input data. It is made up of two parts linked together: an encoder <sup>E</sup>(x) and a decoder <sup>D</sup>(z). Given an input sample x, the encoder generates z, a condensed representation of x. The decoder is then tuned to reconstruct the original input x from the encoded representation z. The objective function used during the training of the AE is given by:

$$\mathcal{L}\_{AE}(x) = \|x - \mathcal{D}(E(x))\|\tag{1}$$

where -· denotes an arbitrary distance function. The -<sup>2</sup> norm is typically applied here. The AE can be optimized for example using stochastic gradient descent (SGD) [10].

#### **2.2 Variational Autoencoder (VAE)**

A VAE [11,16] is an attractive probabilistic generative version of the standard autoencoder. It can learn a complex distribution and then use it as a generative model defined by a prior <sup>p</sup>(z) and conditional distribution <sup>p</sup>θ(x|z). Due to the fact that the true likelihood of the data is generally intractable, a VAE is trained through maximizing the evidence lower bound (ELBO):

$$\mathcal{L}(x;\theta,\phi) = \mathbb{E}\_{q\_{\phi}(z|x)}\left[\log p\_{\theta}(x|z)\right] - D\_{\text{KL}}\left(q\_{\phi}(z|x) \| p(z)\right) \tag{2}$$

where the first term <sup>E</sup><sup>q</sup>φ(z|x) [log <sup>p</sup>θ(x|z)] is a negative reconstruction loss that enforces <sup>q</sup>φ(z|x) (the encoder) to generate a meaningful latent vector <sup>z</sup>, so that <sup>p</sup>θ(x|z) (the decoder) can reconstruct the input <sup>x</sup> from <sup>z</sup>. The second term <sup>D</sup>KL (qφ(z|x)p(z)) is a KL regularization loss that minimizes the KL divergence between the approximate posterior <sup>q</sup>φ(z|x) and the prior <sup>p</sup>(z) = <sup>N</sup> (**0**, **<sup>I</sup>**).

#### **2.3 Negative Learning**

Negative learning is a technique used for regularizing the training of the AE in the presence of labelled data by limiting reconstruction capability (LRC) [13]. The basic idea is to maximize the reconstruction error for abnormal instances, while minimizing the reconstruction error for normal ones in order to improve the discriminative ability of the AE. Given an input instance <sup>x</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup> and <sup>y</sup> ∈ {0, <sup>1</sup>} denotes its associated label where y = 1 stands for a fraudulent instance and y = 0 for a genuine one. The objective function of LRC to be minimized is:

$$(1 - y)\mathcal{L}\_{AE}(x) - (y)\mathcal{L}\_{AE}(x)\tag{3}$$

Training LRC-based models has the major disadvantage to be generally unstable due to the fact that the anomaly reconstruction error is not upper bounded. The LRC approach tends then to maximize the reconstruction error for known anomalies rather than minimizing the reconstruction error for normal points leading to a bad reconstruction of normal data points. To overcome this problem, [18] has proposed Autoencoding Binary Classifiers (ABC) for supervised anomaly detection that improves LRC by using an objective function based on a bounded reconstruction loss for anomalies, leading to a better training stability. The objective function of the ABC to be minimized is:

$$(1 - y)\mathcal{L}\_{AE}(x) - y\log\_2(1 - e^{-\mathcal{L}\_{AE}(x)})\tag{4}$$

#### **2.4 Mixture-of-Experts Layer (MoE)**

In addition to the previous methods, we now present the notion of MoE layer [8] that will be used in our model.

The MoE layer aims to combine the outputs of a group of n neural networks called experts EX1,EX2, ...., EXn. The experts have their specific parameters but work on the same input, their n output are combined linearly with the

**Fig. 1.** An illustration of the MoE layer architecture

outputs of the gating network G which weights the experts according to the input x. See Fig. 1 for an illustration. Let Ei(x) be the output of expert EXi, and G(x)<sup>i</sup> be the i th attribute of G(x), then the output y of the MoE is defined as follows:

$$y = \sum\_{i=1}^{n} G(x)\_i EX\_i(x). \tag{5}$$

The intuition behind MoE layers is to train different network experts that can focus on specific peculiarities of the data and then choose an appropriate combination of experts with respect to the input x. In our industrial context, such a layer would help us to take into account different behaviors from millions of cardholders, which results in a variety of data distributions. The different expert networks can thus model various behaviors observed in the dataset and be combined adequately in function of the input data.

## **3 The DuSVAE Model**

In this section, we present our approach to extract a hidden representation of input sequences to be used for anomaly/fraud detection. We first introduce the model architecture with the loss functions used, then we describe the learning procedure used to train the model.

### **3.1 Model Architecture**

We assume in the following that we are given as input a set of sequences X = {<sup>x</sup> <sup>|</sup> <sup>x</sup> = (<sup>t</sup> <sup>1</sup>, t2, ....., t<sup>m</sup>) with t <sup>i</sup> <sup>∈</sup> <sup>R</sup><sup>d</sup>}, every sequence being composed of <sup>m</sup> transactions encoded by numerical vectors. Each sequence is associated to a label <sup>y</sup> ∈ {0, <sup>1</sup>} such that <sup>y</sup> = 1 indicates a fraudulent sequence and <sup>y</sup> =0a genuine one. We label a sequence as fraudulent if its last transaction is a fraud.

As illustrated in Fig. 2, our approach consists of two sequential variational autoencoders. The first one is trained only on fraudulent sequences of the training data. We use the generative capacity of this autoencoder to generate diverse and representative instances of fraudulent instances with respect to the sequences given as input. This autoencoder has the objective to prepare the data for the

**Fig. 2.** The DuSVAE model architecture

second autoencoder and to provide also a first anomaly/fraud score with the reconstruction error.

The first layers of the autoencoders are bi-directional GRU layers allowing us to handle sequential data. The remaining parts of the encoder and the decoder contain GRU and fully connected (FC) layers, as shown in Fig. 2. The loss function used to optimize the reconstruction error of the first autoencoder is defined as follows:

$$\mathcal{L}\_{rec}(x, \phi\_1, \theta\_1) = mse(x, \mathcal{D}\_{\theta\_1}(E\_{\phi\_1}(x))) + D\_{\text{KL}}\left(q\_{\phi\_1}(z|x) \| p(z)\right),\tag{6}$$

where mse is the mean square error function and <sup>p</sup>(z) = <sup>N</sup> (**0**, **<sup>I</sup>**). The encoder <sup>E</sup><sup>φ</sup><sup>1</sup> (x) generates a latent representation <sup>z</sup> according to <sup>q</sup><sup>φ</sup><sup>1</sup> (z|x) = <sup>N</sup> (μ1, σ1). The decoder <sup>D</sup><sup>θ</sup><sup>1</sup> tries to reconstruct the input sequence from <sup>z</sup>. In order to avoid mode collapse between the reconstructed transactions of the sequence, we add the following loss function to control the reconstruction of individual transactions with respect to relative distances from an input sequence x:

$$\mathcal{L}\_{trxAE}(x,\phi\_1,\theta\_1) = \sum\_{i=1}^{m} \sum\_{j=i+1}^{m} \frac{1}{d} \|abs(t^i - t^j) - abs(\overline{t}^i - \overline{t}^j)\|\_1 \tag{7}$$

where t <sup>i</sup> is the reconstruction obtained by the AE for the <sup>i</sup> th transaction of the sequence and abs(t) returns a vector where the features are the absolute values of the original input vector t.

So, we train the parameters (φ1, θ1) of the first autoencoder by minimizing the following loss function over all the fraudulent sequences of the training samples:

$$\mathcal{L}\_1(x, \phi\_1, \theta\_1) = \mathcal{L}\_{rec}(x, \phi\_1, \theta\_1) + \lambda \mathcal{L}\_{trx}(x, \phi\_1, \theta\_1), \tag{8}$$

where λ is a tradeoff parameter.

The second autoencoder is then trained over all the training sequences by negative learning. It takes as input both a sequence x and its reconstructed version from the first autoencoder AE1(x) that corresponds to the output of its last layer. The loss function considered to optimize the parameters (φ2, θ2) of the second autoencoder is then defined as follows:

$$\begin{aligned} \mathcal{L}\_2(x, AE\_1(x), \phi\_2, \theta\_2) &= (1 - y) \mathcal{L}\_1(x, \phi\_2, \theta\_2) \\ &- y(\overline{\mathcal{L}\_1}(x, \phi\_1, \theta\_1) + \epsilon) \log\_2(1 - e^{-\mathcal{L}\_1(x, \phi\_2, \theta\_2)}), \end{aligned} \tag{9}$$

where <sup>L</sup>1(x, φ1, θ1) denotes the reconstruction loss <sup>L</sup><sup>1</sup> rescaled in the [0, 1] interval with respect to all fraudulent sequences and is a small value used to smooth very low anomaly scores. The architecture of this second autoencoder is similar to that of the first one, except that we use a MoE layer to compute the mean of the normal distribution <sup>N</sup> (μ2, σ2) defined by the encoder. As said previously, the objective is to take into account the variety of the different behavior patterns found in our genuine data. The experts used in that layer are simple one-layer feed-forward neural networks.

### **3.2 The Training Strategy**

The global learning algorithm is presented in Algorithm 1. We have two training phases, the first one focuses on training the first autoencoder AE<sup>1</sup> as a backing model for the second phase. It is trained only on fraudulent sequences by minimizing Eq. 8. Once the model has converged, we freeze its weights and start the second phase. For training the second autoencoder AE2, we use both genuine and fraudulent sequences and their reconstructed versions given by AE1. We then optimize the weights of AE<sup>2</sup> by minimizing Eq. 9. To control the imbalance ratio, the training is done at each iteration by sampling n examples from fraudulent sequences and n from genuine sequences. We repeat this step iteratively by increasing the number n of sampled transactions for each novel iteration until the model converges.



```
9: X1 ← Sample(Xf , n) ∪ Sample(Xg, n)
```


**Table 1.** Properties of the Worldline dataset used in the experiments.

## **4 Experiments**

In this section, we provide an experimental evaluation of our approach on a real-world dataset of credit card e-payment transactions provided by Worldline. First, we present the dataset, then we present the metrics used to evaluate the models learned by our system and finally, we compare DuSVAE with other stateof-the-art approaches.

### **4.1 Dataset**

The dataset provided by Wordline covers 4 months of credit card e-payment transactions made by European cardholders in e-commerce mode that has been splitted into **Train**, **Validation** and **Test** sets used respectively to train, tune and test the learned models. Its main challenges have been studied in [2], one of them being the imbalance ratio as we can see on Table 1 that presents the main characteristics of this dataset.

Each transaction is described by 12 features. A Boolean value is assigned to each transaction to specify whether it corresponds to a fraud or not. This labeling is handled by a team of human experts.

Since most features have a large number of values, using brute one-hot encoding would generate a huge number of features. For example the "Merchant Category Code" feature has 283 possible values and one-hot encoding would produce 283 new features. That would make our approach inefficient. Thus, before using one-hot encoding, we transform each categorical value of each feature by a score which is its risk to be associated with a fraudulent transaction. Let's consider for example a categorical feature f. We can compute the probability of the jth value of feature f to be associated with a fraudulent transaction, denoted as <sup>β</sup><sup>j</sup> , as follows: <sup>β</sup><sup>j</sup> <sup>=</sup> <sup>N</sup><sup>+</sup> f=j <sup>N</sup>f=<sup>j</sup> , where <sup>N</sup> <sup>+</sup> <sup>f</sup>=<sup>j</sup> is the number of fraudulent transactions where the value of feature f is equal to j and N<sup>f</sup>=<sup>j</sup> is the total number of transactions where the value of feature f is equal to j. In order to take into account the number of transactions related to a particular value of a given feature, we follow [14]. For each value j of a given feature, the fraud score S<sup>j</sup> for this value is defined as follows:

$$S\_j = \alpha\_j' \beta\_j + \left(1 - \alpha\_j'\right) \text{AFP} \tag{10}$$

This score computes a weighted value of β<sup>j</sup> and the probability of having a fraud in a day (Average Fraud Probability: AF P). The weight α <sup>j</sup> is a normalized value of α<sup>j</sup> in the range [0, 1], where α<sup>j</sup> is defined as the proportion of the number of transactions for that value on the total number <sup>N</sup> of transactions: <sup>α</sup><sup>j</sup> <sup>=</sup> <sup>N</sup>f=<sup>j</sup> <sup>N</sup> .

Having replaced each value for each feature by its score, we can then run one-hot encoding and thus significantly reduce the number of features generated. For example, the "Merchant Category Code" feature has 283 possible values and instead of generating 283 features, this technique produces only 29 features.

Finally, to generate sequences from transactions, we grouped all the transactions by cardholder ID and we ordered each cardholder's transactions by time. Then, with a sliding window over the transactions we obtained a time-ordered sequence of transactions for each cardholder. For each sequence, we have assigned the label *fraudulent* or *genuine* of its last transaction.

### **4.2 Metrics**

In the context of fraud detection, fortunately, the number of fraudulent transactions is significantly lower than the number of normal transactions. This leads to a very imbalanced dataset. In this situation, the traditional performance measures are not appropriate. Indeed, with an overall fraud rate of 0.3%, classifying each transaction as normal leads to an accuracy of 99.7%, despite the fact that the model is absolutely naive. That means we have to choose appropriate performance measures that are robust in the case of imbalanced data. In this work we rely on the area under the precision-recall curve (AUC-PR) as a robust and clear measure of the accuracy of the classifier in an imbalanced setting. Each point of the precision-recall curve corresponds to the precision of the classifier at a specific recall level.

Once an alert is raised after a fraud has been detected, fraud experts can contact the card-holder to check the validity of suspicious transactions. So, within a single day, the fraud experts have to check a large number of transactions provided by the fraud detection system. Consequently, the precision of the transactions highlighted as fraud is an important metric because that can help human experts at Worldline to focus on the most important frauds and leave aside minor frauds due to lack of time to process them. For this purpose, we rely on the P@<sup>K</sup> as a global metric to compare models. It is the average of the precision of the first K transactions which are calculated according to the following equation.

$$\text{Average}P\_{\otimes K} = \frac{1}{K} \sum\_{i=1}^{K} P\_{\otimes i} \tag{11}$$

#### **4.3 Comparison with the State of the Art**

We compare our approach with the following methods: variational autoencoder [11,16] trained on fraudulent or genuine data only (VAE(F) or VAE(G) respectively); limiting reconstruction capability (LRC) [13] and autoencoding binary classifiers for supervised anomaly detection (ABC) [18]. It is important to note


**Table 2.** AUC-PR achieved by CatBoost using various autoencoder models

that ABC and LRC are not sequential models by nature. So, to make our comparison more fair, we adapted their implementation to allow them to process sequential data. As a classifier, we used CatBoost [15] which is robust in the context of imbalanced data and efficient on GPUs.

First, as we can observe in Table 2, the AUC-PR values obtained by running CatBoost directly on transactions and sequences of transactions are respectively equal to 0.19 and 0.40. If we look at the AUC-PR values obtained by running CatBoost on the reconstructed transactions and sequences of transactions, we can observe that the results are always greater than those obtained by running CatBoost on raw data. Moreover it is interesting to note that DuSVAE achieved the best results (0.51 and 0.53) compared to other state-of-the-art systems.

Now, if we look at the performance obtained by CatBoost on the hidden representation vectors Code1 and Code2, we observe that DuSVAE outperforms the results obtained by other state-of-the-art systems and those results are quite similar to the ones obtained on the reconstructed sequences of transactions. This is interesting because it means that using DuSVAE a condensed representation of the input data can be obtained, which still gives approximately the same results as on the reconstructed sequences of transactions but that are of higher dimensionality (about 10 times more) and can be less efficiently processed by the classifier. Finally, when using the reconstruction error as a score to classify fraudulent data, as done usually in anomaly detection, we can observe that DuSVAE is competitive with the best method. However, the performance level of Code1 and Code2 with CatBoost being significantly better makes the use of the hidden representations a better strategy than using the reconstruction error.

We then evaluated the impact of handcrafted features built by Worldline on the classifier performance. As we can see on the first two lines of Table 3, adding handcrafted features to the original sequential raw dataset leads to much better results both from the point of view of AUC-PR measure and P@K measure.

Now if we consider using DuSVAE (rows 3 and 4 of Table 3), we can also notice a significant improvement of the results obtained on the raw dataset of sequences augmented by handcrafted features compared to the results obtained on the original one without these additional features. This is observed for both the AUC-PR measure and the P@K measure. We see that, for the moment, by using a classifier on the sequences reconstructed by DuSVAE on just the


**Table 3.** AUC-PR and P@K achieved by CatBoost for sequence classification.

raw dataset (AUC-PR = 0.53), we cannot reach the results obtained when we use this classifier on the raw dataset augmented by handcrafted features (AUC-PR = 0.60). This can be explained by the fact that those features are based on history and profiling techniques that embed information covering a period of time larger than the one used for our dataset. Nevertheless we are not so far and the fact that using DuSVAE on the dataset augmented by handcrafted features (AUC-PR = 0.65) leads to better results than using the classifier without DuSVAE (AUC-PR = 0.60) is promising.

Table 3 also shows that the very good P@K values obtained when running the classifier on the sequences of transactions reconstructed by DuSVAE mean that DuSVAE can be a very significant help for experts to focus on real fraudulent transactions and not waste time on fake ones.

## **5 Conclusion**

In this paper, we presented the DuSVAE model which is a new fraud detection technique. Our model combines two sequential variational autoencoders to produce a condensed representation vector of the input sequential data that can then be used by a classifier to label new sequences of transactions as genuine or fraudulent. Our experiments have shown that the DuSVAE model produces much better results, in terms of AUC-PR and P@<sup>K</sup> measures, than state-of-the-art systems. Moreover, the DuSVAE model produces a condensed representation of the input data which can replace very favorably the handcrafted features. Indeed, running a classifier on the condensed representation of the input data built by the DuSVAE model leads to outperform the results obtained on the raw data, with or without handcrafted features.

We believe that a first interesting way to further improve our results will be to focus on attention mechanisms to better take into account the history of past transactions in the detection of present frauds. A second approach will be to better take into account the temporal aspects in the sequential representation of our data and to reflect it in the core algorithm.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **A Principled Approach to Analyze Expressiveness and Accuracy of Graph Neural Networks**

Asma Atamna1,3(B), Nataliya Sokolovska<sup>2</sup>, and Jean-Claude Crivello<sup>3</sup>

<sup>1</sup> LTCI, T´el´ecom Paris, Institut Polytechnique de Paris, Palaiseau, France asma.atamna@telecom-paris.fr

<sup>2</sup> NutriOmics, INSERM, Sorbonne University, Paris, France nataliya.sokolovska@sorbonne-universite.fr

<sup>3</sup> ICMPE (UMR 7182), CNRS, University of Paris-Est, Thiais, France crivello@icmpe.cnrs.fr

**Abstract.** Graph neural networks (GNNs) have known an increasing success recently, with many GNN variants achieving state-of-the-art results on node and graph classification tasks. The proposed GNNs, however, often implement complex node and graph embedding schemes, which makes it challenging to explain their performance. In this paper, we investigate the link between a GNN's *expressiveness*, that is, its ability to map different graphs to different representations, and its generalization performance in a graph classification setting. In particular, we propose a principled experimental procedure where we (i) define a practical measure for expressiveness, (ii) introduce an expressiveness-based loss function that we use to train a simple yet practical GNN that is permutation-invariant, (iii) illustrate our procedure on benchmark graph classification problems and on an original real-world application. Our results reveal that expressiveness alone does not guarantee a better performance, and that a powerful GNN should be able to produce graph representations that are *well separated* with respect to the class of the corresponding graphs.

**Keywords:** Graph neural networks · Classification · Expressiveness

## **1 Introduction**

Many real-world data present an inherent structure and can be modelled as sequences, graphs, or hypergraphs [2,5,9,15]. Graph-structured data, in particular, are very common in practice and are at the heart of this work.

We consider the problem of graph classification. That is, given a set <sup>G</sup> <sup>=</sup> {G<sup>i</sup>}<sup>m</sup> <sup>i</sup>=1 of arbitrary graphs and their respective labels {y<sup>i</sup>}<sup>m</sup> <sup>i</sup>=1, where <sup>y</sup><sup>i</sup> ∈ {1,...,C} and <sup>C</sup> is the number of classes, we aim at finding a mapping

Supported by the Emergence@INC-2018 program of the French National Center for Scientific Research (CNRS) and the *DiagnoLearn* ANR JCJC project.

<sup>f</sup><sup>θ</sup> : G→{1,...,C} that minimizes the classification error, where <sup>θ</sup> denotes the parameters to optimize.

Graph neural networks (GNNs) and their deep learning variants, the graph convolutional networks (GCNs) [1,7,9,10,13,17,20,27], have gained considerable interest recently. GNNs learn latent node representations by recursively aggregating the neighboring node features for each node, thereby capturing the structural information of a node's neighborhood.

Despite the profusion of GNN variants, some of which achieve state-of-the-art results on tasks like node classification, graph classification, and link prediction, GNNs remain very little studied. In particular, it is often unclear what a GNN learns and how the learned graph (or node) mapping influences its generalization performance. In a recent work, [25] present a theoretical framework to analyze the expressive power of GNNs, where a GNN's *expressiveness* is defined as its ability to compute different graph representations for different graphs. Theoretical conditions under which a GNN is maximally expressive are derived. Although it is reasonable to assume that a higher expressiveness would result in a higher accuracy on classification tasks, this link has not been explicitly studied so far.

In this paper, we design a principled experimental procedure to analyze the link between expressiveness and the test accuracy of GNNs. In particular:


To illustrate our experimental framework, we introduce a simple yet practical architecture, the Simple Permutation-Invariant Graph Convolutional Network (SPI-GCN). We also present an original graph data set of metal hydrides that we use along with benchmark graph data sets to evaluate SPI-GCN.

This paper is organized as follows. Section 2 discusses the related work. Section 3 introduces preliminary notations and concepts related to graphs and GNNs. In Sect. 4, we introduce our graph neural network, SPI-GCN. In Sect. 5, we present a practical expressiveness estimator and a new expressiveness-based loss function as part of our experimental framework. Section 6 presents our results and Sect. 7 concludes the paper.

## **2 Related Work**

Graph neural networks (GNNs) were first introduced in [11,19]. They learn latent node representations by iteratively aggregating neighborhood information for each node. Their more recent deep learning variants, the graph convolutional networks (GCNs), generalize conventional convolutional neural networks to irregular graph domains. In [13], the authors present a GCN for node classification where the computed node representations can be interpreted as the graph coloring returned by the 1-dimensional Weisfeiler-Lehman (WL) algorithm [24]. A related GCN that is invariant to node permutation is presented in [27]. The graph convolution operator is closely related to the one in [13], and the authors introduce a permutation-invariant pooling operator that sorts the convolved nodes before feeding them to a 1-dimensional classical convolution layer for graph-level classification. A popular GCN is Patchy-san [17]. Its graph convolution operator extracts normalized local "patches" (neighborhood representations) of the graph which are then sorted and fed to a 1-dimensional traditional convolution layer for graph-level classification. The method, however, requires the definition of a node ordering and running the WL algorithm in a preprocessing step. On the other hand, the normalization of the extracted patches implies sorting the nodes again and using the external graph software Nauty [14].

Despite the success of GNNs, there are relatively few papers that analyze their properties, either mathematically or empirically. A notable exception is the recent work by [25] that studies the expressive power of GNNs. The authors prove that (i) GNNs are at most as powerful as the WL test in distinguishing graph structures and that (ii) if the graph function of a GNN—i.e. its graph embedding scheme—is injective, then the GNN is as powerful as the WL test. The authors also present the Graph Isomorphism Network (GIN), which approximates the theoretical maximally expressive GNN. In another study [4], the authors present a simple neural network defined on a set of graph augmented features and show that their architecture can be obtained by linearizing graph convolutions in GNNs.

Our work is related to [25] in that we adopt the same definition of expressiveness, that is, the ability of a GNN to compute distinct graph representations for distinct input graphs. However, we go one step further and investigate how the graph function learned by GNNs affects their generalization performance. On the other hand, our SPI-GCN extends the GCN in [13] to graph-level classification. Our SPI-GCN is also related to [27] in that we use a similar graph convolution operator inspired by [13]. Unlike [27], however, our architecture does not require any node ordering, and we only use a simple multilayer perceptron (MLP) to perform classification.

### **3 Some Graph Concepts**

A graph <sup>G</sup> is a pair (V,E) of a set <sup>V</sup> <sup>=</sup> {v1,...,v<sup>n</sup>} of vertices (or nodes) <sup>v</sup>i, and a set <sup>E</sup> <sup>⊆</sup> <sup>V</sup> <sup>×</sup> <sup>V</sup> of edges (vi, v<sup>j</sup> ). In this work, we represent a graph <sup>G</sup> by two matrices: (i) an *adjacency matrix* <sup>A</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup>×<sup>n</sup> such that <sup>a</sup>ij = 1 if there is an edge between nodes v<sup>i</sup> and v<sup>j</sup> and aij = 0 otherwise,<sup>1</sup> and (ii) a *node feature matrix* <sup>X</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup>×<sup>d</sup>, with <sup>d</sup> being the number of node features. Each row x<sup>i</sup> <sup>∈</sup> <sup>R</sup><sup>d</sup> of X contains the feature representation of a node vi, where d is the dimension of the feature space. Since we only consider node features in this paper (as opposed to *edge features* for instance), we will refer to the node feature matrix X simply as the feature matrix in the rest of this paper.

<sup>1</sup> Given a matrix M, m*i* denotes its <sup>i</sup>th row and <sup>m</sup>*ij* denotes the entry at its <sup>i</sup>th row and jth column. More generally, we denote matrices by capital letters and vectors by small letters. Scalars, on the other hand, are denoted by small italic letters.

An important notion in graph theory is *graph isomorphism*. Two graphs G<sup>1</sup> = (V1, E1) and G<sup>2</sup> = (V2, E2) are isomorphic if there exists a bijection <sup>g</sup> : <sup>V</sup><sup>1</sup> <sup>→</sup> <sup>V</sup><sup>2</sup> such that every edge (u, v) is in <sup>E</sup><sup>1</sup> if and only if the edge (g(u), g(v)) is in E2. Informally, this definition states that two graphs are isomorphic if there exists a vertex permutation such that when applied to one graph, we recover the vertex and edge sets of the other graph.

## **3.1 Graph Neural Networks**

Consider a graph G with adjacency matrix A and feature matrix X. GNNs use the graph structure (A) and the node features (X) to learn a node-level or a graph-level representation—or *embedding*—of G. GNNs iteratively update a node representation by aggregating its neighbors' representations. At iteration l, a node representation captures its l-hop neighborhood's structural information. Formally, the lth layer of a general GNN can be defined as follows:

$$\mathbf{a}\_{i}^{l+1} = \text{AGGREGATE}^{l}(\{\mathbf{z}\_{j}^{l} : j \in N(i)\}) \tag{1}$$

$$\mathbf{z}\_{i}^{l+1} = \text{COMPBINE}^{l}(\mathbf{z}\_{i}^{l}, \mathbf{a}\_{i}^{l+1}) \; , \tag{2}$$

where z<sup>l</sup>+1 <sup>i</sup> is the feature vector of node v<sup>i</sup> at layer l and where z<sup>0</sup> <sup>i</sup> = xi. While COMBINE usually consists in concatenating node representations from different layers, different—and often complex—architectures for AGGREGATE have been proposed. In [13], the presented GCN merges the AGGREGATE and COMBINE functions as follows:

$$\mathbf{z}\_{i}^{l+1} = \text{ReLU}\left(\text{mean}(\{\mathbf{z}\_{j}^{l} : j \in N(i) \cup \{i\}\}) \cdot \mathbf{W}^{l}\right),\tag{3}$$

where ReLU is a rectified linear unit and W<sup>l</sup> is a trainable weight matrix. GNNs for graph classification have an additional module that aggregates the node-level representations to produce a graph-level one as follows:

$$\mathbf{z}\_G = \text{READOUT}(\{\mathbf{z}\_i^L : v\_i \in V\}) \; , \tag{4}$$

for a GNN with L layers. In [25], the authors discuss the impact that the choice of AGGREGATE<sup>l</sup> , COMBINE<sup>l</sup> , and READOUT has on the so-called *expressiveness* of the GNN, that is, its ability to map different graphs to different embeddings. They present theoretical conditions under which a GNN is maximally expressive.

We now present a simple yet practical GNN architecture on which we illustrate our experimental framework.

## **4 Simple Permutation-Invariant Graph Convolutional Network (SPI-GCN)**

Our Simple Permutation-Invariant Graph Convolutional Network (SPI-GCN) consists of the following sequential modules: (1) a *graph convolution module* that encodes local graph structure and node features in a substructure feature matrix whose rows represent the nodes of the graph, (2) a *sum-pooling layer* as a READOUT function to produce a single-vector representation of the input graph, and (3) a *prediction module* consisting of dense layers that reads the vector representation of the graph and outputs predictions.

Let <sup>G</sup> be a graph represented by the adjacency matrix A <sup>∈</sup> <sup>R</sup>n×<sup>n</sup> and the feature matrix X <sup>∈</sup> <sup>R</sup>n×d, where <sup>n</sup> and <sup>d</sup> represent the number of nodes and the dimension of the feature space respectively. Without loss of generality, we consider graphs without self-loops.

#### **4.1 Graph Convolution Module**

Given a graph G with its adjacency and feature matrices, A and X, we define the first convolution layer as follows:

$$\mathbf{Z} = f(\hat{\mathbf{D}}^{-1} \hat{\mathbf{A}} \mathbf{X} \mathbf{W}) \ , \tag{5}$$

where A=A+I ˆ <sup>n</sup> is the adjacency matrix of G with added self-loops, D is the ˆ diagonal node degree matrix of A, <sup>ˆ</sup> <sup>2</sup> <sup>W</sup> <sup>∈</sup> <sup>R</sup><sup>d</sup>×d- is a trainable weight matrix, f is a nonlinear activation function, and Z <sup>∈</sup> <sup>R</sup><sup>n</sup>×d- is the convolved graph. To stack multiple convolution layers, we generalize the propagation rule in (5) as follows:

$$\mathbf{Z}^{l+1} = f^l(\hat{\mathbf{D}}^{-1} \hat{\mathbf{A}} \mathbf{Z}^l \mathbf{W}^l) \; \,, \tag{6}$$

where Z<sup>0</sup> = X, Z<sup>l</sup> is the output of the lth convolution layer, W<sup>l</sup> is a trainable weight matrix, and f<sup>l</sup> is the nonlinear activation function applied at layer l. Similarly to the GCN presented in [13] from which we draw inspiration, our graph convolution module merges the AGGREGATE and COMBINE functions (see (1) and (2)), and we can rewrite (6) as:

$$\mathbf{z}\_{i}^{l+1} = f^l \left( \text{mean}(\{z\_j^l : j \in N(i) \cup \{i\}\}) \cdot \mathbf{W}^l \right) \,, \tag{7}$$

where z<sup>t</sup>+1 <sup>i</sup> is the ith row of Z<sup>l</sup>+1.

We return the result of the last convolution layer, that is, for a network with L convolution layers, the result of the convolution is the last substructure feature matrix Z<sup>L</sup>. Note that (6) is able to process graphs with varying node numbers.

#### **4.2 Sum-Pooling Layer**

The *sum-pooling* layer produces a graph-level representation z<sup>G</sup> by summing the rows of Z<sup>L</sup>, previously returned by the convolution module. Formally:

$$\mathbf{z}\_G = \sum\_{i=1}^n \mathbf{z}\_i^L \quad . \tag{8}$$

<sup>2</sup> If G is a directed graph, D corresponds to the ˆ *outdegree* diagonal matrix of A. ˆ

The resulting vector z<sup>G</sup> <sup>∈</sup> <sup>R</sup>d*<sup>L</sup>* contains the final vector representation (or *embedding*) of the input graph G in a dL-dimensional space. This vector representation is then used for prediction—graph classification in our case.

Using a sum pooling operator is a simple idea that has been used in GNNs such as [1,21]. Additionally, it results in the invariance of our architecture to node permutation, as stated in Theorem 1.

**Theorem 1.** *Let* G *and* G<sup>ς</sup> *be two arbitrary isomorphic graphs. The sum-pooling layer of SPI-GCN produces the same vector representation for* G *and* G<sup>ς</sup> *.*

This invariance property is crucial for GNNs as it ensures that two isomorphic and hence equivalent—graphs will result in the same output. The proof of Theorem 1 is straightforward and omitted for space limitations.

## **4.3 Prediction Module**

The prediction module of SPI-GCN is a simple MLP that takes as input the graph-level representation z<sup>G</sup> returned by the sum-pooling layer and returns either: (i) a probability p in case of binary classification or (ii) a vector p of probabilities such that <sup>i</sup> p<sup>i</sup> = 1 in case of multi-class classification.

Note that SPI-GCN can be trained in an end-to-end fashion through backpropagation. Additionally, since only one graph is treated in a forward pass, the training complexity of SPI-GCN is linear in the number of graphs.

In the next section, we describe a practical methodology for studying the expressiveness of SPI-GCN and its connection to the generalization performance of the algorithm.

## **5 Investigating Expressiveness of SPI-GCN**

We start here by introducing a practical definition of expressiveness. We then show how the defined measure can be used to train SPI-GCN and help understand the impact expressiveness has on its generalization performance.

## **5.1 Practical Measure of Expressiveness**

The expressiveness of a GNN, as defined in [25], is its ability to map different graph structures to different embeddings and, therefore, reflects the injectivity of its graph embedding function. Since studying injectivity can be tedious, we characterize expressiveness—and hence injectivity—as a function of the pairwise distance between graph embeddings.

Let {z<sup>G</sup>*<sup>i</sup>* }<sup>m</sup> <sup>i</sup>=1 be the set of graph embeddings computed by a GNN A for a given input graph data set {G<sup>i</sup>}<sup>m</sup> <sup>i</sup>=1. We define A's expressiveness, E(A), as follows:

$$\mathcal{E}(\mathcal{A}) = \text{mean}(\{||\mathbf{z}\_{G\_i} - \mathbf{z}\_{G\_j}||\_2 : i, j = 1, \dots, m, \ i \neq j\}) \; , \tag{9}$$

that is, E(A) is the average pairwise Euclidean distance between graph embeddings produced by A. While not strictly equivalent to injectivity, E is a reasonable indicator thereof, as the average pairwise distance reflects the *diversity* within graph representations which, in turn, is expected to be higher for more diverse input graph data sets. For permutation-invariant GNNs like SPI-GCN,<sup>3</sup> <sup>E</sup> is zero when all graphs {Gi}<sup>m</sup> <sup>i</sup>=1 are isomorphic.

#### **5.2 Penalized Cross Entropy Loss**

We train SPI-GCN using a *penalized cross entropy loss*, Lp, that consists of a classical cross entropy augmented with a penalty term defined as a function of the expressiveness of SPI-GCN. Formally:

$$\mathcal{L}\_p = \text{cross-entropy}(\{y\_i\}\_{i=1}^m, \{\hat{y}\_i\}\_{i=1}^m) - \alpha \cdot \mathcal{E}(\text{SPI-GCN}) \quad , \tag{10}$$

where {y<sup>i</sup>}<sup>m</sup> <sup>i</sup>=1 (resp. {yˆ<sup>i</sup>}<sup>m</sup> <sup>i</sup>=1) is the set of real (resp. predicted) graph labels, α is a non-negative penalty factor, and <sup>E</sup> is defined in (9) with {z<sup>G</sup>*<sup>i</sup>* }<sup>m</sup> <sup>i</sup>=1 being the graph embeddings computed by SPI-GCN.

By adding the penalty term <sup>−</sup><sup>α</sup> · E(SPI-GCN) in <sup>L</sup><sup>p</sup>, the expressiveness is maximized while the cross entropy is minimized during the training process. The penalty factor <sup>α</sup> controls the importance attributed to <sup>E</sup>(SPI-GCN) when <sup>L</sup><sup>p</sup> is minimized. Consequently, higher values of <sup>α</sup> allow to train more expressive variants of SPI-GCN whereas for α = 0, only the cross entropy is minimized.

In the next section, we assess the performance of SPI-GCN for different values of α. We also compare SPI-GCN with other more complex GNN architectures, including the state-of-the-art method.

## **6 Experiments**

We carry out a first set of experiments where we compare our approach, SPI-GCN, with two recent GCNs. In a second set of experiments, we train different instances of SPI-GCN with increasing values of the penalty factor α (see (10)) in an attempt to understand how the expressiveness of SPI-GCN affects its test accuracy, and whether it is the determining factor of its generalization performance, as implicitly suggested in [25]. Our code and data are available at https:// github.com/asmaatamna/SPI-GCN.

#### **6.1 Data Sets**

We use nine public benchmark data sets including five bioinformatics data sets (MUTAG [6], PTC [22], ENZYMES [3], NCI1 [23], PROTEINS [8]), two social network data sets (IMDB-BINARY, IMDB-MULTI [26]), one image data set where images are represented as region adjacency graphs (COIL-RAG [18]), and one synthetic data set (SYNTHIE [16]). We also evaluate SPI-GCN on an original real-world data set collected at the ICMPE,<sup>4</sup> HYDRIDES, that contains metal hydrides in graph format, labelled as *stable* or *unstable* according to specific energetic properties that determine their ability to store hydrogen efficiently.

<sup>3</sup> As mentioned previously, we state that permutation-invariance is a minimal requirement for any practical GNN.

<sup>4</sup> East Paris Institute of Chemistry and Materials Science, France.

## **6.2 Architecture of SPI-GCN**

The instance of SPI-GCN that we use for experiments has two graph convolution layers of 128 and 32 hidden units respectively, followed by a hyperbolic tangent function and a softmax function (per node) respectively. The sum-pooling layer is a classical sum applied row-wise; it is followed by a prediction module consisting of a MLP with one hidden layer of 256 hidden units followed by a batch normalization layer and a ReLU. We choose this architecture by trial and error and keep it unchanged throughout the experiments.

## **6.3 Comparison with Other Methods**

In these experiments, we consider the simplest variant of SPI-GCN where the penalty term in (10) is discarded by setting α = 0. That is, the algorithm is trained using only the cross entropy loss.

**Baselines.** We compare SPI-GCN with the well-known GCN, Patchy-san (PSCN) [17], the Deep Graph Convolutional Neural Network (Dgcnn) [27] that uses a similar convolution module to ours, and the recent state-of-the-art Graph Isomorphism Network (GIN) [25].

**Experimental Procedure.** We train SPI-GCN using full batch Adam optimizer [12], with cross entropy as the loss function to minimize (α = 0 in (10)). Upon experimentation, we set Adam's hyperparameters as follows. The algorithm is trained for 200 epochs on all data sets and the learning rate is set to 10−<sup>3</sup>. To estimate the accuracy, we perform 10-fold cross validation using 9 folds for training and one fold for testing each time. We report the average (test) accuracy and the corresponding standard deviation in Table 1. Note that we only use node attributes in our experiments. In particular, SPI-GCN does not exploit node or edge labels of the data sets. When node attributes are not available, we use the identity matrix as the feature matrix for each graph.

We follow the same procedure for Dgcnn. We use the authors' implementation<sup>5</sup> and perform 10-fold cross validation with the recommended values for training epochs, learning rate, and SortPooling parameter k, for each data set.

For PSCN, we report the results from the original paper [17] (for receptive field size k = 10) as we could not find an authors' public implementation of the algorithm. The experiments were conducted using a similar procedure as ours.

For GIN, we also report the published results [25] (GIN-0 in the paper), as it was not straightforward to use the authors' implementation.

**Results.** Table 1 shows the results for our algorithm (SPI-GCN), Dgcnn [27], PSCN [17], and the state-of-the-art GIN [25]. We observe that SPI-GCN is highly competitive with other algorithms despite using the same architecture for all data sets. The only noticeable exceptions are on the NCI1 and IMDB-BINARY data sets, where the best approach (GIN) is up to 1.28 times better. On the other hand, SPI-GCN appears to be highly competitive on classification tasks

<sup>5</sup> https://github.com/muhanzhang/pytorch DGCNN.

with more than 3 classes (ENZYMES, COIL-RAG, SYNTHIE). The difference in accuracy is particularly significant on COIL-RAG (100 classes), where SPI-GCN is around 34 times better than Dgcnn, suggesting that the features extracted by SPI-GCN are more suitable to characterize the graphs at hand. SPI-GCN also achieves a very reasonable accuracy on the HYDRIDES data set and is 1.06 times better than Dgcnn on ENZYMES.

The results in Table 1 show that despite its simplicity, SPI-GCN is competitive with other practical graph algorithms and, hence, it is a reasonable architecture to consider for our next set of experiments involving expressiveness.


**Table 1.** Accuracy results for SPI-GCN and three other deep learning methods (Dgcnn, PSCN, GIN).

#### **6.4 Expressiveness Experiments**

Through these experiments, we try to answer the following questions:


To this end, we train increasingly injective instances of SPI-GCN on the penalized cross entropy loss <sup>L</sup><sup>p</sup> (10) by setting the penalty factor <sup>α</sup> to increasingly large values. Then, for each trained instance, we investigate (i) its test accuracy, (ii) its expressiveness E(SPI-GCN) (9), and (iii) the *average normalized Inter-class Graph Embedding Distance* (IGED), defined as the average pairwise Euclidean distance between mean graph embeddings taken class-wise divided by E(SPI-GCN). Formally:

$$\text{IGED} = \frac{\text{mean}(\{||\mathbf{z}\_c^\* - \mathbf{z}\_{c'}^\*||\_2 : c, c' = 1, \dots, C, \ c \neq c'\})}{\mathcal{E}(\text{SPI-GCN})} \quad , \tag{11}$$

**Table 2.** Expressiveness experiments results. SPI-GCN is trained on the penalized cross entropy loss, *<sup>L</sup>p*, with increasing values of the penalty factor <sup>α</sup>. For each data set, and for each value of α, we report the test accuracy (a), the expressiveness *E*(SPI-GCN) (b), and the IGED (c). Highlighted are the maximal values for each quantity.


where z<sup>∗</sup> <sup>k</sup> is the mean graph embedding for class k. The IGED can be interpreted as an estimate of how well the graph embeddings computed by SPI-GCN are *separated* with respect to their respective class.

**Experimental Procedure.** We train SPI-GCN on the penalized cross entropy loss <sup>L</sup><sup>p</sup> (10) where we sequentially choose <sup>α</sup> from {0, <sup>10</sup>−<sup>3</sup>, <sup>10</sup>−<sup>1</sup>, <sup>1</sup>, <sup>10</sup>}. We do so using full batch Adam optimizer that we run for 200 epochs with a learning rate of 10−<sup>3</sup>, on all the graph data sets introduced previously. For each data set and for each value of α, we perform 10-fold cross validation using 9 folds for training and one fold for testing. We report in Table 2 the average and standard deviation of: (a) the test accuracy, (b) the expressiveness E(SPI-GCN), and (c) the IGED (11), for each value of α and for each data set.

**Results.** We observe from Table 2 that using a penalty term in L<sup>p</sup> to maximize the expressiveness—or injectivity—of SPI-GCN helps to improve the test accuracy on some data sets, notably on MUTAG, PTC, and SYNTHIE. However, larger values of E(SPI-GCN) do not correspond to a higher test accuracy except for two cases (PTC, SYNTHIE). Overall, E(SPI-GCN) increases when α increases, as expected, since the expressiveness is maximized during training when α > 0. The IGED, on the other hand, is correlated to the best performance in four out of ten cases (ENZYMES, IMDB-BINARY, and IMDB-MULTI), where the test accuracy is maximal when the IGED is maximal. On HYDRIDES, the difference in IGED for α = 10−<sup>1</sup> (highest accuracy) and α = 1 (highest IGED value) is negligible.

Our empirical results indicate that while optimizing the expressiveness of SPI-GCN may result in a higher test accuracy in some cases, more expressive GNNs do not systematically perform better in practice. The IGED, however, which reflects a GNN's ability to compute graph representations that are correctly clustered according to their effective class, better explains the generalization performance of the GNN.

## **7 Conclusion**

In this paper, we challenged the common belief that more expressive GNNs achieve a better performance. We introduced a principled experimental procedure to analyze the link between the expressiveness of a GNN and its test accuracy in a graph classification setting. To the best of our knowledge, our work is the first that explicitly studies the generalization performance of GNNs by trying to uncover the factors that control it, and paves the way for more theoretical analyses. Interesting directions for future work include the design of better expressiveness estimators, as well as different (possibly more complex) penalized loss functions.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Efficient Batch-Incremental Classification Using UMAP for Evolving Data Streams**

Maroua Bahri1,3(B) , Bernhard Pfahringer<sup>2</sup>, Albert Bifet1,2, and Silviu Maniu3,4,5

<sup>1</sup> LTCI, T´el´ecom Paris, IP-Paris, Paris, France *{*maroua.bahri,albert.bifet*}*@telecom-paris.fr <sup>2</sup> University of Waikato, Hamilton, New Zealand bernhard@waikato.ac.nz <sup>3</sup> Universit´e Paris-Saclay, LRI, CNRS, Orsay, France silviu.maniu@lri.fr

<sup>4</sup> Inria, Paris, France

<sup>5</sup> ENS DI, CNRS, Ecole Normale Sup´ ´ erieure, Universit´e PSL, Paris, France

**Abstract.** Learning from potentially infinite and high-dimensional data streams poses significant challenges in the classification task. For instance, *k*-Nearest Neighbors (*k*NN) is one of the most often used algorithms in the data stream mining area that proved to be very resourceintensive when dealing with high-dimensional spaces. Uniform Manifold Approximation and Projection (UMAP) is a novel manifold technique and one of the most promising dimension reduction and visualization techniques in the non-streaming setting because of its high performance in comparison with competitors. However, there is no version of UMAP that copes with the challenging context of streams. To overcome these restrictions, we propose a batch-incremental approach that pre-processes data streams using UMAP, by producing successive embeddings on a stream of disjoint batches in order to support an incremental *k*NN classification. Experiments conducted on publicly available synthetic and realworld datasets demonstrate the substantial gains that can be achieved with our proposal compared to state-of-the-art techniques.

**Keywords:** Data stream · *<sup>k</sup>*-Nearest Neighbors · Dimension reduction · UMAP

## **1 Introduction**

With the evolution of technology, several kinds of devices and applications are continuously generating large amounts of data in a fast-paced way as *streams*. Hence, the data stream mining area has become indispensable and ubiquitous in many real-world applications that require real-time – or near real-time –

c The Author(s) 2020

This work was done in the context of IoTA AAP Emergence DigiCosme Project and was funded by Labex DigiCosme.

M. R. Berthold et al. (Eds.): IDA 2020, LNCS 12080, pp. 40–53, 2020. https://doi.org/10.1007/978-3-030-44584-3\_4

processing, e.g., social networks, weather forecast, spam filters, and more. Unlike traditional datasets, the dynamic environment and the tremendous volume of data streams make them impossible to store or to scan multiple times [12].

*Classification* is an active area of research in data stream mining field where several researchers are paying attention to develop new – or improve existing – algorithms [14]. However, the dynamic nature of data streams has outpaced the capability of traditional classification algorithms to process data streams. In this context, a multitude of supervised algorithms for static datasets that have been widely studied in the offline processing, and proved to be of limited effectiveness on large data, have been extended to work within a streaming framework [3,5,11,18]. Data stream mining approaches can be divided into two main types [23]: (i) *instance-incremental* approaches which update the model with each instance as soon as it arrives, such as Self-Adjusting Memory *k*NN (sam*k*NN) [18], and Hoeffding Adaptive Tree (HAT) [4]; and (ii) *batchincremental* approaches which make no change/increment to their model until a batch is completed, e.g., support vector machines [10], and batch-incremental ensemble of decision trees [15]. Nevertheless, the high dimensionality of data complicates the classification task for some algorithms and increases their computational resources, most notably the *k*-Nearest Neighbors (*k*NN) because it needs the entire dataset to predict the labels for test instances [23]. To cope with this issue, a promising approach is *feature transformation* which transforms the input features into a new set of features, containing the most relevant components, in some lower-dimensional space.

In attempt to improve the performance of *k*NN, we incorporate a batch-incremental feature transformation strategy to tackle potentially highdimensional and possibly infinite batches of evolving data streams while ensuring effectiveness and quality of learning (e.g. *accuracy*). This is achieved via a new manifold technique that has attracted a lot of attention recently: Uniform Manifold Approximation and Projection (UMAP) [21], built upon rigorous mathematical foundations, namely Riemannian geometry. To the best of our knowledge, no incremental version of UMAP exists which makes it not applicable on large datasets. The main contributions are summarized as follows:


The paper is organized as follows. Section 2 reviews the prominent related work. Section 3 provides the background of UMAP, followed by the description of our approach. In Sect. 4 we present and discuss the results of experiments on diverse datasets. Finally, we draw our conclusions and present future directions.

## **2 Related Work**

Dimensionality reduction (DR) is a powerful tool in data science to look for hidden structure in data and reduce the resources usage of learning algorithms. The problem of dimensionality has been widely studied [25] and used throughout different domains, such as image processing and face recognition. Dimensionality reduction techniques facilitate the classification task, by removing redundancies and extracting the most relevant features in the data, and permits a better data visualization. A common taxonomy divides these approaches into two major groups: *matrix factorization* and *graph-based* approaches.

Matrix factorization algorithms require matrix computation tools, such as Principal Components Analysis (PCA) [16]. It is a well-known linear technique that uses singular value decomposition and aims to find a lower-dimensional basis by converting the data into features called principal components by computing the eigenvalues and eigenvectors of a covariance matrix. This straightforward technique is computationally cheap but ineffective with data streams since it relies on the whole dataset. Therefore, some incremental versions of PCA have been developed to handle streams of data [13,24,26].

Graph/Neighborhood-based techniques are leveraged in the context of dimension reduction and visualization by using the insight that similar instances in a large space should be represented by close instances in a low-dimensional space, whereas dissimilar instances should be well separated. t-distributed Stochastic Neighbor Embedding (tSNE) [20] is one of the most prominent DR techniques in the literature. It has been proposed to visualize high-dimensional data embedded in a lower space (typically 2 or 3 dimensions). In addition to the fact that it is computationally expensive, tSNE does not preserve distances between all instances and can affect any density–or distance–based algorithm and hence conserves more of the local structure than the global structure.

### **3 Batch-Incremental Classification**

In the following, we assume a data stream *S* is a *sequence* of instances *X*1*,...,X<sup>N</sup>* , where *N* denotes the number of available observations so far. Each instance *X<sup>i</sup>* is composed of a vector of *d* attributes *X<sup>i</sup>* = (*x*<sup>1</sup> *<sup>i</sup> ,...,x<sup>d</sup> <sup>i</sup>* ). The dimensionality reduction of *S* comprises the process of finding a low-dimensional presentation *S* = *Y*1*,...,Y<sup>N</sup>* , where *Y<sup>i</sup>* = (*y*<sup>1</sup> *<sup>i</sup> ,...,y<sup>p</sup> <sup>i</sup>* ) and *<sup>p</sup> d*.

#### **3.1 Prior Work**

Unlike tSNE [20], UMAP has no restriction on the projected space size making it useful not only for visualization but also as a general dimension reduction technique for machine learning algorithms. It starts by constructing open balls over all instances and building simplicial complexes. Dimension reduction is obtained by finding a representation, in a lower space, that closely resembles the topological structure in the original space. Given the new dimension, an equivalent

**Fig. 1.** Projection of CNAE dataset in 2-dimensional space. Offline: (a) UMAP, (b) tSNE, and (c) PCA. Batch-incremental: (d) UMAP, (e) tSNE, and (f) PCA. (Color figure online)

fuzzy topological representation is then constructed [21]. Then, UMAP optimizes it by minimizing the cross-entropy between these two fuzzy topological representations. UMAP offers better visualization quality than tSNE by preserving more of the global structure in a shorter running time. To the best of our knowledge, none of these techniques has a streaming version. Ultimately, both techniques are essentially transductive<sup>1</sup> and do not learn a mapping function from the input space. Hence, they need to process all the data for each new unseen instance, which prevents them from being usable in data streams classification models.

Figure 1 shows the projection of CNAE dataset (see Table 1) into 2 dimensions in an offline/online fashions where each color represents a label. In Fig. 1a, we note that UMAP offers the most interesting visualization while separating classes (9 classes). The overlap in the new space, for instance with tSNE in Fig. 1b, can potentially affect later classification task, notably distance-based algorithms, since properties like global distances and density may be lost. On the other hand, linear transformation, such as PCA, cannot discriminate between instances which prevents them from being represented in the form of clusters (Fig. 1c). To motivate our choice, we project the same dataset using our batch-

<sup>1</sup> Transductive learning consists on learning on a full given dataset (including unknown label), but prediction is only made on the known set of unlabeled instances from the same dataset. This is achieved by clustering data instances.

incremental strategy (more details in Sect. 3.2). Figure 1d illustrates the change from the offline UMAP representation; this is not as drastic as the ones engendered by tSNE and PCA (Figs. 1e and f, respectively) showing their limits on capturing information from data that arrives in a batch-incremental manner.

### **3.2 Algorithm Description**

A very efficient and simple scheme in supervised learning is *lazy learning* [1]. Since lazy learning approaches are based on distances between every pair of instances, they unfortunately have a low performance in terms of execution time. The *k*-Nearest Neighbors (*k*NN) is a well-known lazy algorithm that does not require any work during training, so it uses the entire dataset to predict labels for test instances. However, it is impossible to store an evolving data stream which is potentially infinite – nor to scan it multiple times – due to its tremendous volume. To tackle this challenge, a basic incremental version of *k*NN has been proposed which uses a fixed-length window that slides through the stream and merges new arriving instances with the closest ones already in the window [23].

To predict the class label for an incoming instance, we take the majority class labels of its nearest neighbors inside the window using a defined distance metric (Eq. 2). Since we keep the recent arrived instances inside the sliding window for prediction, the search for the nearest neighbors is still costly in terms of memory and time [3,7] and high-dimensional streams require further resources.

Given a window *w*, the distance between *X<sup>i</sup>* and *X<sup>j</sup>* is defined as follows:

$$D\_{X\_j}(X\_i) = \sqrt{\|X\_i - X\_j\|^2}.\tag{1}$$

Similarly, the *k*-Nearest Neighbors distance is defined as follows:

$$D\_{w,k}(X\_i) = \min\_{\binom{w}{k}, X\_j \in w} \sum\_{j=1}^k D\_{X\_j}(X\_i),\tag{2}$$

where *<sup>w</sup> k* denotes the subset of the *k*NN to *X<sup>i</sup>* in *w*.

When dealing with high-dimensional data, a pre-processing phase before applying a learning algorithm is imperative to avoid the curse of dimensionality from a *computational* point of view. The latter may increase the resources usage and decrease the performance of some algorithms, such as *k*NN. The main idea to mitigate this curse consists of using an efficient strategy with consistent and promising results such as UMAP.

Since UMAP is a transductive technique, an instance-incremental learning approach that includes UMAP does not work because the entire stream needs to be processed for each new incoming instance. By doing it this way, the process will be costly and will not respond to the streaming requirements. To alleviate the processing cost considering the framework within which several challenges shall be respected, including the memory constraint and the incremental behavior of data, we adopt a batch-incremental strategy. In the following, we introduce the procedure of our novel approach, batch-incremental UMAP-*k*NN.

**Fig. 2.** Batch-incremental UMAP-*k*NN scheme

**Step 1:** *Partition of the Stream.* During this step, we assume that data arrive in batches – or chunks – by dividing the stream into disjoint partitions *S*1*, S*2*,...* of size *s*. The first part of Fig. 2 shows a stream of instances divided into batches, so instead of having instances available one at a time, they will arrive as a group of instances simultaneously, *S*1*,...,Sq*, where *S<sup>q</sup>* is the *q*th chunk. A simple example of data stream is a video sequence where at each instant we have a succession of images.

**Step 2:** *Data Pre-processing.* We aim to construct a low-dimensional *<sup>Y</sup><sup>i</sup>* <sup>∈</sup> *<sup>p</sup>*, from an infinite stream of high-dimensional data *<sup>X</sup><sup>i</sup>* <sup>∈</sup> *<sup>d</sup>*, where *<sup>p</sup> d*. As mentioned before, UMAP is unable to compress data incrementally and needs to transform more than one observation at a time because it builds a neighborhoodgraph on a set of instances and then lays it out in a lower dimensional space [21]. Thus, our proposed approach operates on batches of the stream where a single batch *S<sup>i</sup>* of data is processed at a time *Ti*. The two first steps in Fig. 2 depict the application of UMAP on the disjoint batches. Once a batch is complete, throughout the second step, we apply UMAP on it independently from the chunks that have been already processed, so each *<sup>S</sup><sup>i</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* will be transformed and represented by *S <sup>i</sup>* <sup>∈</sup> <sup>R</sup>*<sup>p</sup>*. This new representation is very likely devoid of redundancies, irrelevant attributes, and is obtained by finding potentially useful non-linear combinations of existing attributes, i.e. by repacking relevant information of the larger feature space and encoding it more compactly.

For UMAP to learn when moving from a batch to another, we seed each chunk's embedding with the outcome of the previous one, i.e., match the prior initial coordinates for instances in the current embedding to the final coordinates in the preceding one. This will help to avoid losing the topological information of the stream and to keep stability in successive embeddings as we transition from one batch to its successor. Afterwards, we use the compressed representation of the high-dimensional chunk for the next step that consists in supporting the incremental *k*NN classification algorithm.

**Step 3:** *<sup>k</sup>NN Classification.* UMAP-*k*NN aims to decrease the computational costs of *k*NN on high-dimensional data stream by reducing the input space size using the dimension reducing UMAP, in a batch-incremental way. In addition to the prediction phase of the *k*NN algorithm that, based on the neighborhood<sup>2</sup>, UMAP operates on a *k*-nearest graph (topological representation) as well and optimizes the low-dimensional representation of the data using gradient descent. One nice takeaway is that UMAP, because of its solid theoretical backing as a manifold technique, keeps properties such as density and pairwise distances. Thus, it does not bias the neighborhood-based *k*NN performance.

This step consists of classifying the evolving data stream, where the learning task occurs on consecutive batches, i.e. we train incrementally *k*NN with instances becoming successively available in chunk buffers after pre-processing. Figure 2 shows the underlying batch-incremental learning scheme used which is built upon the divide-and-conquer strategy. Since UMAP is independently applied to batches, so once a chunk is complete and has been transformed in R*<sup>p</sup>*, we feed the half of the batch to the sliding window and we predict incrementally the class label for the second half (the rest of instances).

Given that *k*NN is adaptive, the main novelty of UMAP-*k*NN is in how it merges the current batch to previous ones. This is done by adding it to the instances from previous chunks inside the *k*NN window. Even if past chunks have been discarded, only some of them have been stored and maintained while the adaptive window scrolls. Thereafter, instances kept temporarily inside the window are going to be used to define the neighborhood and predict the class labels for later incoming instances. As presented in Fig. 2, the intuitive idea to combine results from different batches is to use the half of each batch for training and the second half for prediction. In general, due to the possibility of having often very different successive embeddings, one would expect that this may affect the global performance of our approach. Thus, we adopt this scheme to maintain a stability over an adaptive batch-incremental manifold classification approach.

## **4 Experiments**

In this section, we present a series of experiments carried out on various datasets based on three main results: the accuracy, the memory (MB), and the time (Sec).

### **4.1 Datasets**

We use 3 synthetic and 6 real-world datasets from different scenarios that have been thoroughly used in the literature to evaluate the performance of data

<sup>2</sup> The distances between the new incoming instance and the instances already available inside the adaptive window are computed in order to assign it to a particular class.

streams classifiers. Table 1 presents a short description of each dataset, and further details are provided in what follows.

*Tweets.* The dataset was created using the tweets text data generator provided by MOA [6] that simulates sentiment analysis on tweets, where messages can be classified depending on whether they convey positive or negative feelings. Tweets1*,*2*,*<sup>3</sup> produce instances of 500, 1*,*000 and 1*,*500 attributes respectively.

*Har.* Human Activity Recognition dataset [2] built from several subjects performing daily living activities, such as walking, sitting, standing and laying, while wearing a waist-mounted smartphone equipped with sensors. The sensor signals were pre-processed using noise filters and attributes were normalized.

*CNAE.* CNAE is the national classification of economic activities dataset [9]. Instances represent descriptions of Brazilian companies categorized into 9 classes. The original texts were pre-processed to obtain the current highly sparse data.

*Enron.* The Enron corpus dataset is a large set of email messages that was made public during the legal investigation concerning the Enron corporation [17]. This cleaned version of Enron consists of 1*,* 702 instances and 1*,* 000 attributes.


**Table 1.** Overview of the datasets

*IMDB.* IMDB movie reviews dataset was proposed for sentiment analysis [19], where each review is encoded as a sequence of word indexes (integers).

*Nomao.* Nomao dataset [8] was provided by Nomao Labs where data come from several sources on the web about places (name, address, localization, etc.).

*Covt.* The forest covertype dataset obtained from US forest service resource information system data where each class label presents a different cover type.

**Fig. 3.** (a) Varying the chunk size. (b) Varying the neighborhood size for UMAP.

#### **4.2 Results and Discussions**

We compare our proposed classifier, UMAP-*k*NN, to various commonly-used baseline methods in dimensionality reduction and machine learning areas. PCA [24], tSNE (fixing the perplexity to 30, which is the best value as reported in [20]), SAM-*k*NN (S*k*NN) [18]. We use HAT, a classifier with a different structure based on trees [4], to assess its performance with the neighborhood-based UMAP. For fair comparison, we compare the performance of UMAP-*k*NN approach with a competitor using UMAP as well in the same batch-incremental manner. Actually, incremental *k*NN has two crucial parameters: (i) the number of neighbors *k* fixed to 5; and (ii) the window size *w*, that maintains the low-dimensional data, fixed to 1000. According to previous studies such as [7], a bigger window will increase the resources usage and smaller size will impact the accuracy.

The experiments were conducted on a machine equipped with an Intel Core i5 CPU and 4 GB of RAM. All experiments were implemented and evaluated in Python by extending the Scikit-multiflow framework<sup>3</sup> [22].

Figure 3a depicts the influence of the chunk size on the accuracy using UMAP-*k*NN with some datasets. Generally, fixing the chunk size imposes the following dilemma: choosing a small size so that we obtain an accurate reflection of the current data or choosing a large size that may increase the accuracy since more data are available. The ideal would be to use a batch with the maximum of instances to represent as possible the whole stream. In practice, the chunk size needs to be small enough to fit in the main memory otherwise the running time of the approach will increase. Since UMAP is relatively slow, we choose small chunk sizes to overcome this issue with UMAP-*k*NN. Based on the obtained results, we fix the chunk size to 400 for the best trade-off accuracy-memory.

We investigate the behavior of a crucial parameter that controls UMAP, number of neighbors, via the classification performance of our approach. Based

<sup>3</sup> https://scikit-multiflow.github.io/.

**Fig. 4.** Comparison of UMAP-*k*NN, tSNE-*k*NN, PCA-*k*NN, and *k*NN (with the entire datasets) while projecting into 3-dimensions: (a) Accuracy. (b) Memory.

on the size of the neighborhood, UMAP constructs the manifold and focuses on preserving local and global structures. Figure 3b shows the accuracy when the number of neighbors is varied on diverse datasets. We notice that for all datasets, the accuracy is consistently the same with no large differences, e.g. Har. Since a large neighborhood leads to a slower learning process, in the following we fix the neighborhood size to 15.

tSNE is a visualization technique, so we are limited to project highdimensional data into 2 or 3 dimensions. In order to evaluate the performance of our proposal in a fair comparison against each of tSNE-*k*NN and PCA-*k*NN, we project data into 3-dimensional space. We illustrate in Fig. 4a that UMAP*k*NN makes significantly more accurate predictions beating consistently the best performing baselines (tSNE-*k*NN and PCA-*k*NN) notably with CNAE and the tweets datasets. Figure 4b depicts the quantity of memory needed by the three algorithms which is practically the same for some datasets. Compared to *k*NN that uses the whole data without projection, we notice that UMAP-*k*NN consumes much less memory whilst sacrificing a bit in accuracy because we are removing many attributes. Figure 4c shows that our approach is consistently faster than tSNE-*k*NN because tSNE computes the distances between every pair of instances to project. But PCA-*k*NN is a bit faster thanks to the simplicity of PCA. But with this trade-off our approach performs good on almost all datasets.

In addition to its good classification performance in comparison with competitors, the batch-incremental UMAP-*k*NN did a better job of preserving density by capturing both of global and local structures, as shown in Fig. 1d. The fact that UMAP and *k*NN are both neighborhood-based methods arises as a key element in achieving a good accuracy. UMAP has not only the power of visualization but also the ability to reduce the dimensionality of data efficiently which makes it useful as pre-processing technique for machine learning.

Table 2 reports the comparison of UMAP-*k*NN against state-of-the-art classifiers. We highlight that our approach performs better on almost all datasets. It achieves similar accuracies to UMAP-S*k*NN on several datasets but in terms of resources, the latter is slower because of its drift detection mechanism.


**Table 2.** Comparison of UMAP-*k*NN, PCA-*k*NN, UMAP-S*k*NN, and UMAP-HAT.

UMAP-*k*NN has a better performance than PCA-*k*NN, e.g. the Tweets datasets at the cost of being slower. We also observe the UMAP-HAT failed to overcome our approach (in terms of accuracy, memory, and time) due to the integration of a neighborhood-based technique (UMAP) to a tree structure (HAT).

Figure 5 reports detailed results for Tweet<sup>1</sup> dataset with five output dimensions. Figure 5a exhibits the accuracy of our approach which is consistently above

**Fig. 5.** Comparison of UMAP-*k*NN, PCA-*k*NN, UMAP-S*k*NN, and UMAP-HAT over different output dimensions on Tweet1: (a) Accuracy. (b) Memory. (c) Time.

competitors whilst ensuring stability for different manifolds. Figures 5b and c show that *k*NN-based classifiers use much less resources than the tree-based UMAP-HAT. We see that UMAP-*k*NN requires less time than UMAP-HAT and UMAP-S*k*NN to execute the stream but PCA-*k*NN is fastest thanks to its simplicity. Still, the gain in accuracy with UMAP-*k*NN is more significant.

### **5 Concluding Remarks and Future Work**

In this paper, we presented a novel batch-incremental approach for mining data streams using the *k*NN algorithm. UMAP-*k*NN combines the simplicity of *k*NN and the high performance of UMAP which is used as an internal pre-processing step to reduce the feature space of data streams. We showed that UMAP is capable of embedding efficiently data streams within a batch-incremental strategy in an extensive evaluation with well-known state-of-the-art algorithms using various datasets. We further demonstrated that the batch-incremental approach is just as effective as the offline approach in visualization and its accuracy outperforms reputed baselines while using reasonable resources.

We would like to pursue our promising approach further to enhance its runtime performance by applying a fast dimension reduction before using of UMAP. Another area for future work could be the use of a different mechanism, such as the application of UMAP for each incoming data inside a sliding window. We believe that this may be slow but will be suited for instance-incremental learning.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# GraphMDL: Graph Pattern Selection Based on Minimum Description Length

Francesco Bariatti(B), Peggy Cellier, and Sébastien Ferré

Univ Rennes, INSA, CNRS, IRISA, Campus de Beaulieu, Rennes, France {francesco.bariatti,peggy.cellier,sebastien.ferre}@irisa.fr

Abstract. Many graph pattern mining algorithms have been designed to identify recurring structures in graphs. The main drawback of these approaches is that they often extract too many patterns for human analysis. Recently, pattern mining methods using the *Minimum Description Length* (MDL) principle have been proposed to select a characteristic subset of patterns from transactional, sequential and relational data. In this paper, we propose an MDL-based approach for selecting a characteristic subset of patterns on labeled graphs. A key notion in this paper is the introduction of *ports* to encode connections between pattern occurrences without any loss of information. Experiments show that the number of patterns is drastically reduced. The selected patterns have complex shapes and are representative of the data.

Keywords: Pattern mining · Graph mining · Minimum Description Length

## 1 Introduction

Many fields have complex data that need labeled graphs, i.e. graphs where vertices and edges have labels, for an accurate representation. For instance, in chemistry and biology, molecules are represented as atoms and bonds; in linguistics, sentences are represented as words and dependency links; in the semantic web, knowledge graphs are represented as entities and relationships. Depending on the domain, graph datasets can be made of large graphs or large collections of graphs. Graphs are complex to analyze in order to extract knowledge, for instance to identify frequent structures in order to make them more intelligible.

In the field of pattern mining, there has been a number of proposals, namely *graph mining* approaches, to extract frequent subgraphs. Classical approaches to graph mining, e.g. gSpan [12] and Gaston [7], work on collections of graphs, and generate all patterns w.r.t. a frequency threshold. The major drawback of this kind of approach is the huge amount of generated patterns, which renders them difficult to analyze. Some approaches such as CloseGraph [13] reduce the number of patterns by only generating *closed patterns*. However, the set of closed patterns generally remains too large, with a lot of redundancy between patterns. *Constraint-based* approaches, such as gPrune [14], reduce the number of extracted patterns by extracting only the patterns following a certain acceptance rule. These algorithms generally manage to reduce the number of patterns, however they also limit their type. Additionally, if the acceptance rule is user-provided, the user needs some background knowledge on the data.

More effective approaches to reduce the number of patterns are those based on the *Minimum Description Length* (MDL) principle [3]. The MDL principle comes from information theory, and states that the *model* that describes the data the best is the one that compresses the data the best. It has been shown on sets of items [10], sequences [9] and relations [4] that an MDL-based approach can select a small and descriptive subset of patterns. Few MDL-based approaches have been proposed for graphs. SUBDUE [1] iteratively compresses a graph by replacing each occurrence of a pattern by a single vertex. At each step, the chosen pattern is the one that compresses the most. The drawback of SUBDUE is that the replacement of pattern occurrences by vertices entails a loss of information. VoG [5] summarizes graphs as a composition of predefined families of patterns (e.g., paths, stars). Like SUBDUE, VoG aims to only extract "interesting" patterns, but instead of evaluating each pattern individually like SUBDUE, it evaluates the set of extracted patterns as a whole. This allows the algorithm to find a "good set of patterns" instead of a "set of good patterns". One limitation of VoG is that the type of patterns is restricted to predefined ones. Another limitation is that VoG works on unlabeled graphs, (e.g. network graphs), while we are interested in labeled graphs.

The contribution of this paper (Sect. 3) is a novel approach called Graph-MDL, leveraging the MDL principle to select graph patterns from labeled graphs. Contrary to SUBDUE, GraphMDL ensures that there is no loss of information thanks to the introduction of the notion of *ports* associated to graph patterns. Ports represent how adjacent occurrences of patterns are connected. We evaluate our approach experimentally (Sect. 4) on two datasets with different kinds of graphs: one on AIDS-related molecules (few labels, many cycles), and the other one on dependency trees (many labels, no cycles). Experiments validate our approach by showing that the data can be significantly compressed, and that the number of selected patterns is drastically reduced compared to the number of candidate patterns. More so, we observe that the patterns can have complex and varied shapes, and are representative of the data.

## 2 Background Knowledge

### 2.1 The MDL Principle

The *Minimum Description Length* (MDL) principle [3] is a technique from the domain of information theory that allows to select the model, from a family of models, that best describes some data. The MDL principle states that the best model M for describing some data D is the one that minimizes the *description length* L(M,D) = L(M) + L(D|M), where L(M) is the length of the model and L(D|M) the length of the data encoded with the model. The MDL principle does

Fig. 1. A labeled undirected simple graph.

Vertex singleton

a Edge singleton

X

Fig. 2. Embeddings of a pattern in the graph of Fig. 1.

Fig. 3. Two singleton patterns.

not define how to compute every possible description length. However, common primitives exist for data and distributions [6]:


Description lengths of elements that are common to all models are usually ignored, since they do not affect their comparison.

Krimp [10] is a pattern mining algorithm using the MDL principle to select a "characteristic" set of itemset patterns from a transactional database. Because of its good performances, Krimp has been adapted to other types of data, such as sequences [9] and relational databases [4]. In our approach we redefine Krimp's key concepts on graphs, in order to apply a Krimp-like approach to graph mining.

#### 2.2 Graphs and Graph Patterns

Definition 1. *<sup>A</sup>* labeled graph <sup>G</sup> = (V,E,lV , lE) *over two label sets* <sup>L</sup>V *and* <sup>L</sup>E *is a data structure composed of a set of* vertices V *, a set of* edges E <sup>⊆</sup> V <sup>×</sup> V *, and two* labeling functions <sup>l</sup>V <sup>∈</sup> <sup>V</sup> <sup>→</sup> <sup>2</sup><sup>L</sup>*<sup>V</sup> and* <sup>l</sup>E <sup>∈</sup> <sup>E</sup> → LE *that associate a set of labels to vertices, and one label to edges.*

G *is said* undirected *if* E *is symmetric, and* simple *if* E *is irreflexive.*

Although our approach applies to all labeled graphs, in the following we only consider undirected simple graphs, so as to compare ourselves with existing tools and benchmarks. Figure 1 shows an example of graph, with 8 vertices and 7 edges, defined over vertex label set {W, X, Y, Z} and edge label set {a, b}. In our definition vertices can have several or no labels, unlike usual definitions in graph mining, because it makes it applicable to more datasets.

<sup>1</sup> In our implementation we use *Elias gamma encoding* [2], shifted by 1 so that it can encode 0. Therefore *<sup>L</sup>*<sup>N</sup>(*n*)=2log(*<sup>n</sup>* + 1) + 1.


Fig. 4. Example of a GraphMDL code table over the graph of Fig. 1. Pattern and port usages, and code lengths have been added as illustration and are not part of the table definition. Unused singleton patterns are omitted.

Definition 2. *Let* G<sup>P</sup> *and* G<sup>D</sup> *be graphs. An* embedding *(or* occurrence*) of* G<sup>P</sup> *in* G<sup>D</sup> *is an injective function* ε <sup>∈</sup> V <sup>P</sup> <sup>→</sup> V <sup>D</sup> *such that: (1)* l P V (v) <sup>⊆</sup> <sup>l</sup> D V (ε(v)) *for all* v <sup>∈</sup> V <sup>P</sup> *; (2)* (ε(u), ε(v)) <sup>∈</sup> <sup>E</sup><sup>D</sup> *for all* (u, v) <sup>∈</sup> <sup>E</sup><sup>P</sup> *; and (3)* <sup>l</sup> P E (e) = l D E (ε(e)) *for all* e <sup>∈</sup> E<sup>P</sup> *.*

We define *graph patterns* as graphs <sup>G</sup><sup>P</sup> having some occurrences in the data graph <sup>G</sup><sup>D</sup>. Figure <sup>2</sup> shows the three embeddings <sup>ε</sup><sup>1</sup>, ε<sup>2</sup>, ε<sup>3</sup> of a two-vertices graph pattern into the graph of Fig. 1. We define *singleton patterns* as the elementary patterns. A *vertex singleton pattern* is a graph with one vertex having one label. An *edge singleton pattern* is a graph with two unlabeled vertices, connected by a single labeled edge. Figure 3 shows examples of singleton patterns.

## 3 GraphMDL: MDL for Graphs

In this section we present our contribution: the GraphMDL approach. This approach takes as input a graph—the *original graph* G<sup>o</sup>—and a set of patterns extracted from that graph—the *candidate patterns*—and outputs the most descriptive subset of candidate patterns according to the MDL principle. The candidates can be generated with any graph mining algorithm, e.g. gSpan [12].

The intuition behind GraphMDL is that since data and patterns are both graphs, the data can be seen as a composition of pattern embeddings. Informally, we want a user analyzing the output of GraphMDL to be able to say "the data is composed of one occurrence of pattern A, connected to one occurrence of pattern B, which is itself connected to one occurrence of pattern C". More so, we want the user to be able to tell *how* these structures are connected together: which vertices of each pattern are used to connect it to other patterns.

### 3.1 Model: A Code Table for Graph Patterns

Similarly to Krimp [10], we define our model as a *Code Table* (CT), i.e. a set P of patterns with associated coding information. A first difference with Krimp is that the patterns are graph patterns. A second difference is the need for additional coding information: a single code would not suffice since all the information related to connectivity between pattern occurrences would be lost.

Fig. 5. How the data graph of Fig. 1 is encoded with the code table of Fig. 4. *(a)* Retained occurrences of CT patterns. *(b)* The rewritten graph. Blue squares are pattern embeddings (their label indicates the pattern), white circles are port vertices. Edge labels represent which pattern port correspond to each port vertex. (Color figure online)

We therefore introduce the notion of *ports* in order to represent how pattern embeddings connect to each other to form the original graph. The set of ports of a pattern is a subset of the vertices of the pattern. Intuitively, a pattern vertex is a port if at least one pattern embedding maps this vertex to a vertex in the original graph that is also used by another embedding (be it of the same pattern or a different one). For example, in Fig. <sup>5</sup>*<sup>a</sup>* the three occurrences of pattern P1 are inter-connected through their middle vertex: this vertex is a port. Since port information increases the description length, we expect our approach to select patterns with few ports.

Figure 4 shows an example of CT associated to the graph of Fig. 1. Every row of the CT is composed of three parts, and contains information about a pattern P ∈ P (e.g. the first row contains information about pattern P1). The first part of a row is the graph G<sup>P</sup> , which represents the structure of the pattern (e.g. P1 is a pattern with three labeled vertices and two labeled edges). The second part of a row is the code <sup>c</sup>P , associated to the pattern. The third part of a row is the description of the port set of the pattern, <sup>Π</sup>P , (e.g. <sup>P</sup><sup>1</sup> has two ports, its first two vertices, with codes of 2 and 0.42 bits<sup>2</sup>). We note Π the set of all ports of all patterns. Like Krimp, the length of the code of a pattern or port depends on its usage in the encoding of the data, i.e. how many times it is used to describe the original graph G<sup>o</sup> (e.g. <sup>P</sup><sup>1</sup> has a code of 1 bit because it is used 3 times and the sum of pattern usages in the CT is 6, see Sects. 3.2 and 3.3).

#### 3.2 Encoding the Data with a Code Table

The intuition behind GraphMDL is that we can represent the original graph G<sup>o</sup> (i.e. the data) as a set of pattern occurrences, connected via ports. Encoding the data with a CT consists in creating a structure that explicits which occurrences are used and how they interconnect to form the original graph. We call this structure the *rewritten graph* G<sup>r</sup>.

<sup>2</sup> MDL approaches deal with *theoretical* code lengths, which may not be integers.

Definition 3. *<sup>A</sup>* rewritten graph G<sup>r</sup> = (V <sup>r</sup>, Er, l<sup>r</sup> V , l<sup>r</sup> E) *is a graph where the set of vertices is* V <sup>r</sup> <sup>=</sup> <sup>V</sup> <sup>r</sup> emb <sup>∪</sup> <sup>V</sup> <sup>r</sup> port*:* <sup>V</sup> <sup>r</sup> emb *is the set of* pattern embedding vertices *and* V <sup>r</sup> port *is the set of* port vertices*.* <sup>E</sup><sup>r</sup> <sup>⊆</sup> <sup>V</sup> <sup>r</sup> emb <sup>×</sup> <sup>V</sup> <sup>r</sup> port *is the set of edges from embeddings to ports,* l r V <sup>∈</sup> <sup>V</sup> <sup>r</sup> emb → P *and* <sup>l</sup> r E <sup>∈</sup> <sup>E</sup><sup>r</sup> <sup>→</sup> <sup>Π</sup> *are the labelings.*

In order to compute the encoding of the data graph with a given CT, we start with an empty rewritten graph. One after another, we select patterns from the CT. For each pattern, we compute the occurrences of its graph <sup>G</sup><sup>P</sup> . Similarly to Krimp, we limit embeddings overlaps: we admit overlap on vertices (since it is the key notion behind ports), but we forbid edge overlaps.

Each retained embedding is represented in the rewritten graph by a *pattern embedding vertex* : a vertex <sup>v</sup>e <sup>∈</sup> <sup>V</sup> <sup>r</sup> emb with a label <sup>P</sup> ∈ P indicating which pattern it instantiates. Vertices that are shared by several embeddings are represented in the rewritten graph by a *port vertex* <sup>v</sup>p <sup>∈</sup> <sup>V</sup> <sup>r</sup> port. We add an edge (ve, vp) <sup>∈</sup> <sup>E</sup><sup>r</sup> between the pattern embedding vertex <sup>v</sup>e of a pattern <sup>P</sup> and the port vertex <sup>v</sup>p, when the embedding associated to <sup>v</sup>e maps the pattern's port <sup>v</sup>π <sup>∈</sup> <sup>Π</sup>P to <sup>v</sup>p. We label this edge <sup>v</sup>π.

We make sure that code tables always include all singleton patterns, so that they can always encode any vertex and edge of the original graph.

Figure 5 shows the graph of Fig. 1 encoded with the CT of Fig. 4. Embeddings of CT patterns become pattern embedding vertices in the rewritten graph (blue squares). Vertices that are at the boundary between multiple embeddings become port vertices in the rewritten graph (white circles). When an embedding has a port, its pattern embedding vertex in the rewritten graph is connected to the corresponding port vertex and the edge label indicates which pattern's port it is. For instance, the three retained occurrences of pattern P1 all share the same vertex labeled Y (middle of the original graph), thus in the rewritten graph the three corresponding pattern embedding vertices are connected to the same port vertex via port v<sup>2</sup>.

#### 3.3 Description Lengths

In this section we define how to compute the description length of the CT and the rewritten graph. Description lengths are used to compare CTs. Formulas are explained below and grouped in Fig. 6.

Code Table. The description length L(M) = L(CT) of a CT is the sum of the description lengths of its rows (skipping rows with unused patterns), and every row is composed of three parts: the pattern graph structure, the pattern code, and the pattern port description.

To describe the structure G = G<sup>P</sup> of a pattern (L(G)) we start by encoding the number of vertices of the pattern. Then we encode the vertices one after the other. For each vertex v, we encode its labels then its adjacent edges. To encode the vertex labels (LV (v,G)) we specify their number first, then the labels themselves. To encode the adjacent edges (LE(v,G)) we specify their number (between 0 and <sup>|</sup>V | − 1 in a simple graph), then for each edge, its destination

Fig. 6. Formulas used for computing description lengths. The structure *<sup>G</sup><sup>P</sup>* = (*<sup>V</sup> <sup>P</sup> , E<sup>P</sup> , l<sup>P</sup> V , l<sup>P</sup> E*) is shortened to *<sup>G</sup>* = (*V,E, l<sup>V</sup> , lE*) for ease of reading.

vertex and its label. To avoid encoding twice the same edge, we decide—in undirected graphs—to encode edges with the vertex with the smallest identifier. Vertex and edge labels are encoded based on their relative usage in the original graph G<sup>o</sup> (L<sup>L</sup>*<sup>V</sup> usage* (l, G<sup>o</sup>) and <sup>L</sup><sup>L</sup>*<sup>E</sup> usage* (lE(v, w), G<sup>o</sup>)). Since this encoding does not change between CTs, it is a meaningful way to compare them.

The second element of a CT row is the code <sup>c</sup>P associated to the pattern (L(cP )). This code is based on the usage of the pattern in the rewritten graph.

The last element of a CT row is the description of the pattern's ports (L(ΠP )). First, we encode the number of pattern's ports (between 0 and <sup>|</sup><sup>V</sup> <sup>|</sup>). Then we specify which vertices are ports: if there are k ports, then there are <sup>|</sup><sup>V</sup> <sup>|</sup> k possibilities. Finally, we encode the port codes (L(cπ, P)): their code is based on the usage of the port in the rewritten graph w.r.t. other ports of the pattern.

Rewritten Graph. The rewritten graph has two types of vertices: port vertices and pattern embedding vertices. Port vertices do not have any associated information, so we just need to encode their number. The description length L(D|M) = L(G<sup>r</sup>) of the rewritten graph is the length needed for encoding the number of vertex ports plus the sum of the description lengths <sup>L</sup>emb(v, P, G<sup>r</sup>) of the pattern embedding vertices v. Every pattern embedding vertex has a label l r V (v) specifying its pattern <sup>P</sup>, encoded with the code <sup>c</sup><sup>P</sup> of the pattern. We then encode the number of edges of the vertex i.e. the number of ports of this embedding in particular (between <sup>0</sup> and <sup>|</sup>ΠP <sup>|</sup>). Then for each edge we encode the port vertex to which it is connected and to which port it corresponds (using the port code <sup>c</sup>π).


Table 1. Characteristics of the datasets used in the experiments

## 3.4 The GraphMDL Algorithm

In previous subsections we presented the different MDL definitions that Graph-MDL uses to evaluate pattern sets (CT). A naive algorithm for finding the most descriptive pattern set (in the MDL sense) could be to create a CT for every possible subset of candidates and retain the one yielding the smallest description length. However, such an approach is often infeasible because of the large amount of possible subsets. That is why GraphMDL applies a greedy heuristic algorithm, adapting Krimp algorithm [10] to our MDL definitions.

Like Krimp, our algorithm starts with a CT composed of all singletons, which we call CT<sup>0</sup>. One after the other, candidates are added to the CT if they allow to lower the description length. Two heuristics guide GraphMDL: the candidate order and the order of patterns in the CT. We use the same heuristics as Krimp, with the difference that we define the size of a pattern as its total number of labels (vertices and edges). We also implement Krimp's "post-acceptance pruning": after a pattern is accepted in the CT, GraphMDL verifies if the removal of some patterns from the CT allows to lower the description length L(M,D).

## 4 Experimental Evaluation

In order to evaluate our proposal, we developed a prototype of GraphMDL. The prototype was developed in Java 1.8 and is available as a git repository<sup>3</sup>.

### 4.1 Datasets

The first two datasets that we use, AIDS-CA and AIDS-CM, are part of the National Cancer Institute AIDS antiviral screen data<sup>4</sup>. They are collections of graphs often used to compare graph mining algorithms [11]. Graphs of this collection represent molecules: vertices are atoms and edges are bonds. We stripped all hydrogen atoms from the molecules, since their positions can be inferred.

We took our third dataset, UD-PUD-En, from the Universal Dependencies project<sup>5</sup>. This project curates a collection of trees describing dependency

<sup>3</sup> https://gitlab.inria.fr/fbariatt/graphmdl.

<sup>4</sup> https://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+Antiviral+Screen+Data.

<sup>5</sup> https://universaldependencies.org/.


Table 2. Experimental results for different candidate sets

relationships between words of sentences of multiple corpora in multiple languages. We used the trees corresponding to the English version of the PUD corpus.

Table 1 presents the main characteristics of the three datasets that we use: the number of elementary graphs in the dataset, the total amount of vertices, the total amount of edges, the number of different vertex labels, and the number of different edge labels. Since GraphMDL works on a single graph instead of a collection, we aggregate collections into a single graph with multiple connected components when needed. We generate candidate patterns by using a gSpan implementation available on its author's website<sup>6</sup>.

### 4.2 Quantitative Evaluation

Table 2 presents the results of the first experiment. For instance the first line tells that we ran GraphMDL on the AIDS-CA dataset, with as candidates the 2194 patterns generated by gSpan for a support threshold of 20%. It took 19 min for our approach to select a CT composed of 115 patterns, yielding a description length that is 24% of the description length obtained by the singleton-only CT<sup>0</sup>. Selected patterns have a median of 9 labels and 3 ports.

We observe that the number of patterns of a CT is often significantly smaller than the number of candidates. This is particularly remarkable for experiments ran with small support thresholds, where GraphMDL reduces the number of patterns up to 300 times: patterns generated for these support thresholds probably contain a lot of redundancy, that GraphMDL avoids.

We also note that the description lengths of the CTs found by GraphMDL are between 20% and 40% of the lengths of the baseline code tables CT<sup>0</sup>, which shows that our algorithm succeeds in finding regularities in the data. Description

<sup>6</sup> https://sites.cs.ucsb.edu/~xyan/software/gSpan.htm.

lengths are smaller when the number of candidates is higher: this may be because with more candidates, there are more chances of finding "good" candidates that allow to better reduce description lengths.

Fig. 7. How GraphMDL (left) and SUBDUE (right) encode one of AIDS-CM graphs.

We observe that GraphMDL can find patterns of non-trivial size, as shown by the median label count in Table 2. Also, most patterns have few ports, which shows that GraphMDL manages to find models in which the original graph is described as a set of components without many connections between them. We think that a human will interpret such a model with more ease, as opposed to a model composed of "entangled" components.

### 4.3 Qualitative Evaluations

*Interpretation of Rewritten Graphs.* Figure 7 shows how GraphMDL uses patterns selected on the AIDS-CM dataset to encode one of the graphs of the dataset (more results are available in our git repository). It illustrates the key idea behind our approach: find a set of patterns so that each one describes part of the data, and connect their occurrences via ports to describe the whole data.

We observe that GraphMDL selects bigger patterns (such as P2), describing big chunks of data, as well as smaller patterns (such as P3, edge singleton), that can form bridges between pattern occurrences. Big patterns increase the description length of the CT, but describe more of the data in a single occurrence, whereas small patterns do the opposite. Following the MDL principle, GraphMDL finds a good balance between the two types of patterns.

It is interesting to note that pattern P1 in Fig. 7 corresponds to the carboxylic acid functional group, common in organic chemistry. GraphMDL selected this pattern without any prior knowledge of chemistry, solely by using MDL.

*Comparison with SUBDUE.* On the right of Fig. 7 we can observe the encoding found by SUBDUE on the same graph. The main disadvantage of SUBDUE is information loss: we can see that the data is composed of two occurrences of pattern P1, but not how these two occurrences are connected. Thanks to the notion of ports, GraphMDL does not suffer from this problem: the user can exactly know which atoms lie at the boundary of each pattern occurrence.


Table 3. Classification accuracies. Results of methods marked with \* are from [8].

*Assessing Patterns Through Classification.* We showed in the previous experiments that GraphMDL manages to reduce the amount of patterns, and that the introduction of ports allows for a precise analysis of graphs. We now ask ourselves if the extracted patterns are *characteristic* of the data. To evaluate this aspect, we adopt the classification approach used by Krimp [10]. We apply GraphMDL independently on each class of a multi-class dataset, and then use the resulting CTs to classify each graph: we encode it with each of the CTs, and classify it in the class whose CT yields the smallest description lengthL(D|M). Since GraphMDLis not designed with the goal of classification in mind, we would expect existing classifiers to outperform GraphMDL. In particular, note that patterns are selected on each class independently of other classes. Indeed, GraphMDL follows a descriptive approach whereas classifiers generally follow a discriminative approach. Table 3 presents the results of this new experiment. We compare GraphMDL with graph classification algorithms found in the literature [8], and a baseline that classifies all graphs as belonging to the largest class. The AIDS-CA/CI dataset is composed of the CA class of the AIDS dataset and a same-size same-labels random sample from the CI class (corresponding to negative examples). The other datasets<sup>7</sup> are from [8]. We performed a 10-fold validation repeated 10 times and report average accuracies and standard deviations.

GraphMDL clearly outperforms the baseline on two datasets, AIDS and Mutag, but is only comparable to the baseline for the PTC datasets. On Mutag, GraphMDL is less accurate than other classifiers but closer to them than to the baseline. On the PTC datasets, we hypothesize that the learned descriptions are not discriminative w.r.t. the chosen classes, although they are characteristic enough to reduce description length. Nonetheless results are still better than random guessing (accuracy would be 50%). An interesting point of GraphMDL

<sup>7</sup> For concision, we do not report on PTC-{MM,FM}, they yield similar results.

classification is that it is explainable: the user can look at how the patterns of the two classes encode a graph (similarly to Fig. 7) and understand *why* one class is chosen over another.

## 5 Conclusion

In this paper, we have proposed GraphMDL, an MDL-based pattern mining approach to select a representative set of graph patterns on labeled graphs. We proposed MDL definitions allowing to compute description lengths necessary to apply the MDL principle. The originality of our approach lies in the notion of *ports*, which guarantee that the original graph can be perfectly reconstructed, i.e., without any loss of information. Our experiments show that GraphMDL significantly reduces the amount of patterns w.r.t. complete approaches. Further, the selected patterns can have complex shapes with simple connections. The introduction of the notion of ports facilitates interpretation w.r.t. to SUBDUE. We plan to apply our approach to more complex graphs, e.g. knowledge graphs.

## References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Towards Content Sensitivity Analysis**

Elena Battaglia, Livio Bioglio, and Ruggero G. Pensa(B)

Department of Computer Science, University of Turin, Turin, Italy *{*elena.battaglia,livio.bioglio,ruggero.pensa*}*@unito.it

**Abstract.** With the availability of user-generated content in the Web, malicious users dispose of huge repositories of private (and often sensitive) information regarding a large part of the world's population. The self-disclosure of personal information, in the form of text, pictures and videos, exposes the authors of such contents (and not only them) to many criminal acts such as identity thefts, stalking, burglary, frauds, and so on. In this paper, we propose a way to evaluate the harmfulness of any form of content by defining a new data mining task called *content sensitivity analysis*. According to our definition, a score can be assigned to any object (text, picture, video...) according to its degree of sensitivity. Even though the task is similar to sentiment analysis, we show that it has its own peculiarities and may lead to a new branch of research. Thanks to some preliminary experiments, we show that content sensitivity analysis can not be addressed as a simple binary classification task.

**Keywords:** Privacy · Text mining · Text categorization

## **1 Introduction**

Internet privacy has gained much attention in the last decade due to the success of online social networks and other social media services that expose our lives to the wide public. In addition to personal and behavioral data collected more or less legitimately by companies and organizations, many websites and mobile/web applications store and publish tons of user-generated content in the form of text posts and comments, pictures and videos which, very often, capture and represent private moments of our life. The availability of user-generated content is a huge source of relatively easy-to-access private (and often very sensitive) information concerning habits, preferences, families and friends, hobbies, health and philosophy of life, which expose the authors of such contents (or any other individual referenced by them) to many (cyber)criminal risks, including identity theft, stalking, burglary, frauds, cyberbullying or "simply" discrimination in workplace or in life in general. Sometimes users are not aware of the dangers due to the uncontrolled diffusion of their sensitive information and would probably avoid publishing it if only someone told them how harmful it could be.

In this paper, we address exactly this problem by proposing a way to measure the degree of sensitivity of any type of user-generated content. To this purpose, we define a new data mining task that we call *content sensitivity analysis* (CSA), inspired by sentiment analysis [13]. The goal of CSA is to assign a score to any object (text, picture, video...) according to the amount of sensitive information it potentially discloses. The problem of private content analysis has already been investigated as a way to characterize anonymous vs. non anonymous content posting in specific social media [5,15,16] or question-and-answer platforms [14]. However, the link between anonymity and sensitive contents is not that obvious: users may post anonymously because, for instance, they are referring to illegal matters (e.g., software/steaming piracy, black market and so on); conversely, fully identifiable persons may post very sensitive contents simply because they are underestimating the visibility of their action [18,19]. Although CSA has some points in common with anonymous content analysis and the well-known sentiment analysis task, we show that it has its own peculiarities and may lead to a brand new branch of research, opening many intriguing challenges in several computer science and linguistics fields.

Through some preliminary but extensive experiments on a large annotated corpus of social media posts, we show that content sensitivity analysis can not be addressed straightforwardly. In particular, we design a simplified CSA task leveraging binary classification to distinguish between sensitive and non sensitive posts by testing several bag-of-words and word embedding models. According to our experiments, the classification performances achieved by the most accurate models are far from being satisfactory. This suggests that content sensitivity analysis should consider more complex linguistic and semantic aspects, as well as more sophisticated machine learning models.

The remainder of the paper is organized as follows: we report a short analysis of the related scientific literature in Sect. 2 and Sect. 3 provides the definition of content sensitivity analysis and presents some challenging aspects of this new task together with some hints for possible solutions; the preliminary experiments are reported and discussed in Sect. 4; finally, Sect. 5 concludes by also presenting some open problems and suggestions for future research.

## **2 Related Work**

With the success of online social networks and content sharing platforms, understanding and measuring the exposure of user privacy in the Web has become crucial [11,12]. Thus, many different metrics and methods have been proposed with the goal of assessing the risk of privacy leakage in posting activities [1,23]. Most research efforts, however, focus on measuring the overall exposure of users according to their privacy settings [8,19] or position within the network [18].

Very few research works address the problem of measuring the amount of sensitivity of user-generated content, and yet different definitions of sensitivity are adopted. In [5], for instance, the authors define sensitivity of a social media post as the extent to which users think the post should be anonymous. Then, they try to understand the nature of content posted anonymously and analyze the differences between content posted on anonymous (e.g., Whisper) and nonanonymous (e.g., Twitter) social media sites. They also find significant linguistic differences between anonymous and non-anonymous content. A similar approach has been applied on posts collected from a famous question-and-answer website [14]. The authors of this work identify categories of questions for which users are more likely to exercise anonymity and analyze different machine learning model to predict whether a particular answer will be written anonymously. They also show that post sensitivity should be viewed as a nuanced measure rather than as a binary concept. In [2], the authors propose a ranking-based method for assessing the privacy risk emerging from textual contents related to sensitive topics, such as depression. They use latent topic models to capture the background knowledge of an hypothetical rational adversary who aims at targeting the most exposed users. Additionally, the results are exploited to inform and alert users about their risk of being targeting.

Similarly to sentiment analysis [13], valuable linguistic resources are needed to identify sensitive content in texts. To the best of our knowledge, the only works addressing this issue are [6,22], where the authors leverage prototype theory and traditional theoretical approaches to describe and evaluate a dictionary of privacy designed for content analysis. Dictionary categories are evaluated according to privacy-related categories from an existing content analysis tool, using a variety of text corpora.

The problem of sensitive content detection has been investigated as a pattern recognition problem in images as well. In [25], the authors leverage massive social images and their privacy settings to learn the object-privacy correlation and identify categories of privacy-sensitive object automatically. To increase the accuracy and speed of the classifier, they propose a deep multi-task learning architecture that learn more representative deep convolutional neural networks and more discriminative tree classifier. Additionally, they use the outcomes of such model to identify the most suitable privacy settings and/or blur sensitive objects automatically. This framework is further improved in [24], where the authors add a clustering-based approach to also incorporate trustworthiness of users being granted to see the images in the prediction model.

Contrary to the above-mentioned works, in this paper we formally define the general task of *content sensitivity analysis* independently from the type of data to be analyzed. Additionally, we provide some suggestions for improving the accuracy of the results and show experimentally that the task is challenging, and deserves further investigation and greater research efforts.

### **3 Content Sensitivity Analysis**

In this section, we introduce the new data mining task that we call *content sensitivity analysis* (CSA), aimed at determining the amount of privacy-sensitive content expressed in any user-generated content. We first distinguish two cases, namely *basic CSA* and *continuous CSA*, according to the outcome of the analysis (binary or continuous). Then, we identify a set of subtasks and discuss their theoretical and technical details. Before introducing the technical details of CSA, we briefly provide the intuition behind CSA by describing a motivating example.

## **3.1 Motivating Example**

To explain the main objectives of CSA and the scientific challenges associated to them, we consider the example in Fig. 1. To decide whether (and to what extent) the sentence is sensitive, an inference algorithm should be able to answer the following questions:


**Fig. 1.** An example of a potentially privacy-sensitive post.

By observing the post in Fig. 1, it is clear that: the post discloses information about the author and his friend Alice Green (1); the post contains spatiotemporal references ("now" and "General Hospital"), which are generally considered intrinsically sensitive; the post mentions "chemo", a potentially sensitive term (3); the sentence is related to "cancer", a potentially sensitive topic (4); the sentence structure suggests that the two subjects of disclosure have cancer and they are both about to start their first course of chemotherapy (5).

It is clear that, reducing sensitivity to anonymity, as done in previous research work [5,14], is only one side of the coin. Instead, CSA has much more in common with the famous *sentiment analysis* (SA) task, where the objective is to measure the "polarity" or "sentiment" of a given text [7,13]. However, while SA has already a well-established theory and may count on a set of easy-to-access and easy-to-use tools, CSA has never been defined before. Therefore, apart from the known open problems in SA (such as sarcasm detection), CSA involves three new scientific challenges:


In the following, we will provide the formal definitions concerning CSA and provide some preliminary ideas on how to address the problem.

#### **3.2 Definitions**

Here, we provide the details regarding the formal framework of *content sensitivity analysis*. To this purpose, we consider generic user-generated contents, without specifying their nature (whether textual, visual or audiovisual). We will propose a definition of "sensitivity" further in this section. The simplest way to define CSA is as follows:

**Definition 1 (basic content sensitivity analysis).** *Given a user-generated object* o*<sup>i</sup>* ∈ O*, with* O *being the domain of all user-generated contents, the* basic content sensitivity analysis *task consists in designing a function* f*<sup>s</sup>* : O → {sens, na, ns}*, such that* f*s*(o*i*) = sens *iff* o*<sup>i</sup> is privacy-sensitive,* f*s*(o*i*) = ns *iff* o*<sup>i</sup> is not sensitive, otherwise* f*s*(o*i*) = na*.*

The na value is required since the assignment of a correct sensitivity value could be problematic when dealing with controversial contents or borderline topics. In some cases, assessing the sensitivity of a content object is simply impossible without some additional knowledge, i.e., the conversation a post is part of, the identity of the author of a post, and so on. In addition, sensitivity is not the same for all sensitive objects: a post dealing with health is certainly more sensitive than a post dealing with vacations, although both can be considered as sensitive. This suggests that, instead of considering sensitivity as a binary feature of a text, a more appropriate definition of CSA should take into account different degrees of sensitivity, as follows:

**Definition 2 (continuous content sensitivity analysis).** *Let* o*<sup>i</sup>* ∈ O *be a user-generated object, with* O *being the domain of all user-generated contents. The* continuous content sensitivity analysis *task consists in designing a function* f*<sup>s</sup>* : O → [−1, 1]*, such that* f*s*(o*i*)=1 *iff* o*<sup>i</sup> is maximally privacy-sensitive,* f*s*(o*i*) = −1 *iff* o*<sup>i</sup> is minimally privacy-sensitive,* f*s*(o*i*)=0 *iff* o*<sup>i</sup> has unknown sensitivity. The value* <sup>σ</sup>*<sup>i</sup>* <sup>=</sup> <sup>f</sup>*s*(o*i*) *is the sensitivity score of object* <sup>o</sup>*i.*

According to this definition, sensitive objects have 0 < σ ≤ 1, while non sensitive posts have −1 ≤ σ < 0. In general, when σ ≈ 0 the sensitivity of an object cannot be assessed confidently. Of course, by setting appropriate thresholds, a continuous CSA can be easily turned into a basic CSA task.

At this point, a congruent definition of "sensitivity" is required to set up the task correctly. Although different characterizations of privacy-sensitivity exist, there is no consistent and uniform theory [22]; so, in this work, we consider a more generic, flexible and application-driven definition of privacy-sensitive content.

**Definition 3 (privacy-sensitive content).** *A generic user-generated content object is privacy-sensitive if it makes the majority of users feel uncomfortable in writing or reading it because it may reveal some aspects of their own or others' private life to unintended people.*

Notice that "uncomfortableness" should not be guided by some moral or ethical judgement about the disclosed fact, but uniquely by its harmfulness towards privacy. Such a definition allows the adoption of the "wisdom of the crowd" principle in contexts where providing an objective definition of what is sensitive (and what is not sensitive) is particularly hard. Moreover, it has also an intuitive justification. Different social media may have different meaning of sensitivity. For instance, in a professional social networking site, revealing details about one's own job is not only tolerated, but also encouraged, while one may want to hide detailed information about her professional life in a generic photo-video sharing platform. Similarly, in a closed message board (or group), one may decide to disclose more private information than in open ones. Sensitivity towards certain topics also varies from country to country. As a consequence, function f*<sup>s</sup>* can be learnt according to an annotated corpus of content objects as follows.

**Definition 4 (sensitivity function learning).** *Let* <sup>O</sup> <sup>=</sup> {(o*i*, σ*i*)}*<sup>N</sup> <sup>i</sup>*=1 *be a set of* N *annotated objects* o*<sup>i</sup>* ∈ O *with the related sensitivity score* σ*<sup>i</sup>* ∈ [−1, 1]*. The goal of a sensitivity function learning algorithm is to search for a function* f*<sup>s</sup>* : O → [−1, 1]*, such that* -*N <sup>i</sup>*=1 (f*s*(o*i*) − σ*i*) <sup>2</sup> *is minimum.*

The simplest way to address this problem is by setting a regression (or classification, in the case of basic CSA) task. However, we will show in Sect. 4 that such an approach is unable to capture the actual manifold of sensitivity accurately. Hence, in the following sections, we present a fine-grained definition of CSA together with a list of open subproblems related to CSA and provide some hints on how to address them.

#### **3.3 Fine-Grained Content Sensitivity Analysis**

In the previous section, we have considered contents as monolithic objects with a sensitivity score associated to them. However, in general, any user-generated content object (text, video, picture) may contain both privacy-sensitive and privacy-unsensitive elements. For instance, a long text post (or video) may deal with some unsensitive topic but the author may insert some references to her or his private life. Similarly, a user may post a picture of her own desk deemed to be anonymous but some elements may disclose very private information (e.g., the presence of train tickets, drug paraphernalia, someone else's photo and so on). Moreover, the same object (or some of its elements) may violate the privacy of multiple subjects, including the author and other people mentioned in the corpus, in a different way. For all these reasons, here we propose a fine-grained definition of content sensitivity analysis. The definition is as follows:

**Definition 5 (fine-grained content sensitivity analysis).** *Let* o*<sup>i</sup>* ∈ O *be a user-generated content object. Let* <sup>E</sup>*<sup>i</sup>* <sup>=</sup> {e*<sup>i</sup> <sup>j</sup>*}*<sup>m</sup><sup>i</sup> <sup>j</sup>*=1 ⊂ E *be a set of* m*<sup>i</sup>* ≥ 1 *elements (or components) that constitutes the object* o*i, with* E *being the domain of all possible elements. Let* <sup>P</sup>*<sup>i</sup>* <sup>=</sup> {p*<sup>i</sup> <sup>k</sup>*}*<sup>n</sup><sup>i</sup> <sup>j</sup>*=1 ⊂ P *be the set of* n*<sup>i</sup>* ≥ 1 *persons (or subjects) mentioned in* o*i, with* P *being the domain of all subjects. The* fine-grained content sensitivity analysis *task consists in designing a function* f*<sup>s</sup>* : E×P → [−1, 1]*, such that* f*s*(e*<sup>i</sup> <sup>j</sup>* , p*<sup>i</sup> <sup>k</sup>*)=1 *iff* e*<sup>i</sup> <sup>j</sup> is maximally privacy-sensitive for subject* p*<sup>i</sup> k,* f*s*(e*<sup>i</sup> <sup>j</sup>* , p*<sup>i</sup> <sup>k</sup>*) = <sup>−</sup><sup>1</sup> *iff* <sup>e</sup>*<sup>i</sup> <sup>i</sup> is minimally privacy-sensitive for subject* p*<sup>i</sup> <sup>k</sup>,* f*s*(e*<sup>i</sup> <sup>j</sup>* , p*<sup>i</sup> <sup>k</sup>*)=0 *iff* e*<sup>i</sup> <sup>j</sup> has unknown sensitivity for subject* p*<sup>i</sup> <sup>k</sup>. The value* σ*<sup>i</sup> jk* = f*s*(e*<sup>i</sup> <sup>j</sup>* , p*<sup>i</sup> <sup>k</sup>*) *is the sensitivity score of element* <sup>e</sup>*<sup>i</sup> <sup>j</sup> towards subject* p*<sup>i</sup> k.*

Notice that |E*<sup>i</sup>*| ≥ 1 since each object contains at least one element (when <sup>|</sup>E*<sup>i</sup>*<sup>|</sup> = 1, the only element <sup>e</sup>*<sup>i</sup>* <sup>1</sup> corresponds the object o*<sup>i</sup>* itself). Similarly |P*<sup>i</sup>*| ≥ 1 because each object has at least the author as subject. In the example reported in Fig. 1, the post contains only one element (there is only one sentence) and concerns two subjects (the author and Alice Green). According to Definition 5 (and to what we said in Sect. 3.1), the sensitivity score of the post towards both the author and Alice Green will be high.

#### **3.4 Challenges and Possible Solutions**

Fine-grained content sensitivity analysis presents many scientific and technical challenges, and may benefit of the cross-fertilization of computational linguistics, machine learning and semantic analysis. Addressing the problem of connecting sensitivity to specific subjects in texts requires the solution of many NLP tasks such as named entity recognition, relation extraction [21], and coreference resolution [4]. Additionally, concept extraction and topic modeling are important to understand whether a given text deals with sensitive content. To this purpose, privacy dictionaries [22] could provide a valid support for tagging certain topics/terms as sensitive or non-sensitive. Sentiment analysis and emotion detection could also reveal private personality traits if related to contents associated to certain topics, persons or categories of persons. Furthermore, elements in a sentence cannot be simply considered as separated entities, but the connection between different parts of a text play an important role in determining the correct fine-grained sensitivity. It is clear that such a complex problem requires the availability of massive annotated text corpora and the design of robust machine learning algorithms to cope with the sparsity of the feature space. All these considerations apply to the case of visual and audiovisual content as well, but, in addition, the intrinsic difficulty of handling multimedia data makes the above mentioned challenge even harder and more computationally expensive.

In the next section, we will show how the basic content sensitivity analysis settings can be modeled as a binary classification problem on text data using different approaches with scarce or moderate success, thus showing the necessity of a more systematic and in-depth investigation of the problem.

## **4 Preliminary Experiments**

In this section, we report the results of some preliminary experiments aimed at showing the feasibility of content sensitivity analysis together with its difficulties. The experiments are conducted under the basic CSA framework (see Definition 1 in Sect. 3) with the only difference that we do not consider the "na" class. We set up a binary classification task to distinguish whether a given input text is privacy-sensitive or not. Before presenting the results, in the following, we first introduce the data, then we provide the details of our experimental protocol.

## **4.1 Annotated Corpus**

Since all previous attempts of identifying sensitive text have leveraged user anonymity as a discriminant for sensitive content [5,14], there is no reliable annotated corpus that we can use as benchmark. Hence, we construct our own dataset by leveraging a crowdsourcing experiment. We use one of the datasets described in [3], consisting of 9917 anonymized social media posts, mostly written in English, with a minimum length of 2 characters and a maximum length of 435 (the average length is 80). Thus, they well represent typical social media short posts. On the other hand, they are not annotated for the specific purpose of our experiment and, because of their shortness, they are also very difficult to analyze. Consequently, after discarding all useless posts (mostly uncomprehensible ones) we have set up a crowdsourcing experiment by using a Telegram bot that, for each post, asks whether it is sensitive or not. As third option, it was also possible to select "unable to decide". We collected the annotations of 829 posts from 14 distinct annotators. For each annotated post, we retain the most frequently chosen annotation. Overall, 449 posts where tagged as non sensitive, 230 as sensitive, 150 as undecidable. Thus, the final dataset consists of 679 posts of the first two categories (we discarded all 150 undecidable posts).

## **4.2 Datasets**

We consider two distinct document representations for the dataset, a bag-ofwords and four word vector models. To obtain the bag-of-word representation we perform the following steps. First, we remove all punctuation characters of terms contained in the input posts as well as short terms (less than two characters) and terms containing digits. Then, we build the bag-of-words model with all remaining 2584 terms weighted by their *tfidf* score. Differently from classic text mining approaches, we deliberately exclude lemmatization, stemming and stop word removal from text preprocessing, since those common steps would affect content sensitivity analysis negatively. Indeed, inflections (removed by lemmatization and stemming) and stop words (like "me", "myself") are important to decide whether a sentence reproduces some personal thoughts or private action/status. Hereinafter, the bag-of-words representation is referred to as *BW2584*.

The word vector representation, instead, is built using word vectors pretrained with two billion tweets (corresponding to 42 billion tokens) using the *GloVe* (Global Vector) model [17]. We use this word embedding method as it consistently outperforms both continuous bag-of-words and skip-gram model architectures of word2vec [10]. In detail, we use three representation, here called *WV25*, *WV50* and *WV100* with, respectively, 25, 50 and 100 dimensions<sup>1</sup>. Additionally, we build an ensemble by considering the concatenation of the three vector spaces. The latter representation is named *WVEns*.

Finally, from all five datasets we removed all posts having an empty bagof-words or word vector representation. Such preprocessing step further reduces the size of the dataset down to 611 posts (221 sensitive and 390 non sensitive), but allows for a fair performance comparison.

#### **4.3 Experimental Settings**

Each dataset obtained as described beforehand is given in input to a set of six classifiers. In details, we use k-NN, decision tree (DT), Multi-layer Perceptron (MLP), SVM, Random Forest (RF), and Gradient Boosted trees (GBT). We do not execute any systematic parameter selection procedure since our main goal is not to compare the performances of classifiers, but, rather, to show the overall level of accuracy that can be achieved in a basic content sensitivity analysis task. Hence, we use the following default parameter for each classifier:


<sup>1</sup> Pre-trained vectors are available at https://nlp.stanford.edu/projects/glove/.

All experiments are conducted by performing ten-fold cross-validation, using, for each iteration, nine folds as training set and the remaining fold as test set.

### **4.4 Results and Discussion**

The summary of the results, in terms of average F1-score, are reported in Table 1. It is worth noting that the scores are, in general, very low (between 0.5826, obtained by the neural network on the bag-of-words model, and 0.6858, obtained by Random Forest on the word vector representation with 50 dimensions). Of course, these results are biased by the fact that data are moderately unbalanced (64% of posts fall in the non-sensible class). However they are not completely negative, meaning that there is space for improvement. We observe that the winning model-classifier pair (50-dimensional word vector processed with Random Forest) exhibits high recall on the non-sensitive class (0.928) and rather similar results in terms of precision for the two classes (0.671 and 0.688 for the sensitive and non-sensitive classes respectively). The real negative result is the low recall on the sensitive class (only 0.258), due to the high number of false negatives<sup>2</sup>. We recall that the number of annotated sensitive posts is only 221, i.e., the number of examples is not sufficiently large for training a prediction model accurately.


**Table 1.** Classification in terms of average F1-score for different post representations.

These results highlight the following issues and perspectives. First, negative (or not-so-positive) results are certainly due to the lack of annotated data (especially for the sensitive class). Sparsity is certainly a problem in our settings. Hence, a larger annotated corpus is needed, although this objective is not trivial. In fact, private posts are often difficult to obtain, because social media platforms (luckily, somehow) do not allow users to get them using their API. As a consequence, all previous attempts to guess the sensitivity of text or construct privacy dictionaries strongly leverage user anonymity in public post sharing activities [5,14], or rely on focus groups and surveys [22]. Moreover, without a sufficiently large corpus, not even the application of otherwise successful deep learning techniques (e.g., RNNs for sentiment analysis [9]) would produce valid results. Second, simple classifiers, even when applied to rather complex and rich representations, can not capture the manifold of privacy sensitivity accurately.

<sup>2</sup> Due to space limitations, we do not report detailed precision/recall results.

So, more complex and heterogenous models should be considered. Probably, an accurate sensitivity content analysis tool should consider lexical, semantic as well as grammatical features. Topics are certainly important, but sentence construction and lexical choices are also fundamental. Therefore, reliable solutions would consist of a combination of computational linguistic techniques, machine learning algorithms and semantic analysis. Third, the success of picture and video sharing platforms (such as Instagram and TikTok), implies that any successful sensitivity content analysis tool should be able to cope with audiovisual contents and, in general, with multimodal/multimedia objects (an open problem in sentiment analysis as well [20]). Finally, provided that a taxonomy of privacy categories in everyday life exists (e.g., health, location, politics, religious belief, family, relationships, and so on) a more complex CSA setting might consider, for a given content object, the privacy sensitivity degree in each category.

## **5 Conclusions**

In this paper, we have addressed the problem of determining whether a given content object is privacy-sensitive or not by defining the generic task of content sensitivity analysis (CSA). Then, we have declined it according to increasing complexity of the problem settings. Although the task promises to be challenging, we have shown that it is not unfeasible by presenting a simplified formulation of CSA based on text categorization. With some preliminary but extensive experiments, we have showed that, no matter the data representation, the accuracy of such classifiers can not be considered satisfactory. Thus, it is worth investigating more complex techniques borrowed from machine learning, computational linguistics and semantic analysis. Moreover, without a strong effort in building massive and reliable annotated corpora, the performances of any CSA tool would be barely sufficient, no matter the complexity of the learning model.

**Acknowledgments.** The authors would like to thank Daniele Scanu for implementing the Telegram bot used by the annotators. This work is supported by Fondazione CRT (grant number 2019-0450).

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Gibbs Sampling Subjectively Interesting Tiles**

Anes Bendimerad1(B) , Jefrey Lijffijt<sup>2</sup>, Marc Plantevit<sup>3</sup>, C´eline Robardet<sup>1</sup>, and Tijl De Bie<sup>2</sup>

<sup>1</sup> Univ Lyon, INSA, CNRS UMR 5205, 69621 Villeurbanne, France ahmed-anes.bendimerad@insa-lyon.fr

<sup>2</sup> IDLab, ELIS Department, Ghent University, Ghent, Belgium <sup>3</sup> Univ Lyon, UCBL, CNRS UMR 5205, 69621 Lyon, France

**Abstract.** The local pattern mining literature has long struggled with the so-called pattern explosion problem: the size of the set of patterns found exceeds the size of the original data. This causes computational problems (enumerating a large set of patterns will inevitably take a substantial amount of time) as well as problems for interpretation and usability (trawling through a large set of patterns is often impractical).

Two complementary research lines aim to address this problem. The first aims to develop better measures of interestingness, in order to reduce the number of uninteresting patterns that are returned [6,10]. The second aims to avoid an exhaustive enumeration of all 'interesting' patterns (where interestingness is quantified in a more traditional way, e.g. frequency), by directly sampling from this set in a way that more 'interesting' patterns are sampled with higher probability [2].

Unfortunately, the first research line does not reduce computational cost, while the second may miss out on the most interesting patterns. In this paper, we combine the best of both worlds for mining interesting tiles [8] from binary databases. Specifically, we propose a new pattern sampling approach based on Gibbs sampling, where the probability of sampling a pattern is proportional to their subjective interestingness [6]—an interestingness measure reported to better represent true interestingness.

The experimental evaluation confirms the theory, but also reveals an important weakness of the proposed approach which we speculate is shared with any other pattern sampling approach. We thus conclude with a broader discussion of this issue, and a forward look.

**Keywords:** Pattern mining · Subjective interestingness · Pattern sampling · Gibbs sampling

### **1 Introduction**

Pattern mining methods aim to select elements from a given language that bring to the user "implicit, previously unknown, and potentially useful information from data" [7]. To meet the challenge of selecting the appropriate patterns for a user, several lines of work have been explored: (1) Many constraints on some measures that assess the quality of a pattern using exclusively the data have been designed [4,12,13]; (2) Preference measures have been considered to only retrieve patterns that are non dominated in the dataset; (3) Active learning systems have been proposed that interact with the user to explicit her interest on the patterns and guide the exploration toward those she is interested in; (4) Subjective interestingness measures [6,10] have been introduced that aim to take into account the implicit knowledge of a user by modeling her prior knowledge and retrieving the patterns that are unlikely according to the background model.

The shift from threshold-constraints on objective measures toward the use of subjective measures provides an elegant solution to the so-called pattern explosion problem by considerably reducing the output to only truly interesting patterns. Unfortunately, the discovery of subjectively interesting patterns with exact algorithms remains computationally challenging.

In this paper we explore another strategy that is pattern sampling. The aim is to reduce the computational cost while identifying the most important patterns, and allowing for distributed computations. There are two families of local pattern sampling techniques.

The first family uses Metropolis Hastings [9], a Markov Chain Monte Carlo (MCMC) method. It performs a random walk over a transition graph representing the probability of reaching a pattern given the current one. This can be done with the guarantee that the distribution of the considered quality measure is proportional on the sample set to the one of the whole pattern set [1]. However, each iteration of the random walk is accepted only with a probability equal to the acceptance rate α. This can be very small, which may result in a prohibitively slow convergence rate. Moreover, in each iteration the part of the transition graph representing the probability of reaching patterns given the current one, has to be materialized in both directions, further raising the computational cost. Other approaches [5,11] relax this constraint but lose the guarantee.

Methods in the second family are referred to as direct pattern sampling approaches [2,3]. A notable example is [2], where a two-step procedure is proposed that samples frequent itemsets without simulating stochastic processes. In a first step, it randomly selects a row according to a first distribution, and from this row, draws a subset of items according to another distribution. The combination of both steps follows the desired distribution. Generalizing this approach to other pattern domains and quality measures appeared to be difficult.

In this paper, we propose a new pattern sampling approach based on Gibbs sampling, where the probability of sampling a pattern is proportional to their Subjective Interestingness (SI) [6]. Gibbs sampling – described in Sect. 3 – is a special case of Metropolis Hastings where the acceptance rate α is always equal to 1. In Sect. 4, we show how the random walk can be simulated without materializing any part of the transition graph, except the currently sampled pattern. While we present this approach particularly for mining tiles in rectangular databases, applying it for other pattern languages can be relatively easily achieved. The experimental evaluation (Sect. 5) confirms the theory, but also reveals a weakness of the proposed approach which we speculate is shared by other direct pattern sampling approaches. We thus conclude with a broader discussion of this issue (Sect. 6), and a forward look (Sect. 7).

## **2 Problem Formulation**

### **2.1 Notation**

*Input Dataset.* A dataset **<sup>D</sup>** is a Boolean matrix with m rows and n columns. For i ∈ -1, m and j ∈ -1, n, **D**(i, j) ∈ {0, 1} denotes the value of the cell corresponding to the i-th row and the j-th column. For a given set of rows I ⊆ -1, m, we define the support function supp<sup>C</sup> (I) that gives all the columns having a value of 1 in all the rows of I, i.e., supp<sup>C</sup> (I) = {j ∈ -1, n | ∀i ∈ I : D(i, j)=1}. Similarly, for a set of columns J ⊆ -1, n, we define the function suppR(J) = {i ∈ -1, m | ∀j ∈ J : D(i, j)=1}. Table 1 shows a toy example of a Boolean matrix, where for I = {4, 5, 6} we have that supp<sup>C</sup> (I) = {2, 3, 4}.

**Table 1.** Example of a binary dataset **D**.


*Pattern Language.* This paper is concerned with a particular kind of pattern known as a tile [8], denoted τ = (I,J) and defined as an ordered pair of a set of rows I ⊆ {1, ..., m} and a set of columns J ⊆ {1, ...n}. A tile τ is said to be contained (or present) in **D**, denoted as τ ∈ **D**, iff **D**(i, j) = 1 for all i ∈ I and j ∈ J. The set of all tiles present in the dataset is denoted as T and is defined as: T = {(I,J) | I ⊆ {1, ..., m} ∧ J ⊆ {1, ...n} ∧ (I,J) ∈ **D**}. In Table 1, the tile τ<sup>1</sup> = ({4, 5, 6, 7}, {2, 3, 4}) is present in **D** (τ<sup>1</sup> ∈ T), because each of its cells has a value of 1, but τ<sup>2</sup> = ({1, 2}, {2, 3}) is not present (τ<sup>2</sup> ∈/ T) since **D**(1, 3) = 0.

### **2.2 The Interestingness of a Tile**

In order to assess the quality of a tile τ , we use the framework of subjective interestingness SI proposed in [6]. We briefly recapitulate the definition of this measure for tiles, denoted SI(τ ) for a tile τ , and refer the reader to [6] for more details. SI(τ ) measures the quality of a tile τ as the ratio of its subjective information content IC(τ ) and its description length DL(τ ):

$$\text{SI}(\tau) = \frac{\text{IC}(\tau)}{\text{DL}(\tau)}.$$

Tiles with large SI(τ ) thus compress subjective information in a short description. Before introducing IC and DL, we first describe the background model—an important component required to define the subjective information content IC.

*Background Model.* The SI is subjective in a sense that it accounts for prior knowledge of the current data miner. A tile τ is informative for a particular user if this tile is somehow surprising for her, otherwise, it does not bring new information. The most natural way for formalizing this is to use a background distribution representing the data miner's prior expectations, and to compute the probability Pr(τ ∈ **D**) of this tile under this distribution. The smaller Pr(τ ∈ **D**), the more information this pattern contains. Concretely, the background model consists of a value Pr(**D**(i, j) = 1) associated to each cell **D**(i, j) of the dataset, and denoted pij . More precisely, pij is the probability that **D**(i, j) = 1 under user prior beliefs. In [6], it is shown how to compute the background model and derive all the values pij corresponding to a given set of considered user priors. Based on this model, the probability of having a tile τ = (I,J) in **D** is:

$$\Pr(\tau \in \mathbf{D}) = \Pr\left(\bigwedge\_{i \in I, j \in J} \mathbf{D}(i, j) = 1\right) = \prod\_{i \in I, j \in J} p\_{ij} \dots$$

*Information Content IC .* This measure aims to quantify the amount of information conveyed to a data miner when she is told about the presence of a tile in the dataset. It is defined for a tile τ = (I,J) as follows:

$$\text{IC}(\tau) = -\log(\Pr(\tau \in \mathbf{D})) = \sum\_{i \in I, j \in J} -\log(p\_{ij}).$$

Thus, the smaller Pr(τ ∈ **D**), the higher IC(τ ), and the more informative τ . Note that for τ1, τ<sup>2</sup> ∈ **D** : IC(τ<sup>1</sup> ∪ τ2) = IC(τ1) + IC(τ2) − IC(τ<sup>1</sup> ∩ τ2).

*Description Length DL.* This function should quantify how difficult it is for a user to assimilate the pattern. The description length of a tile τ = (I,J) should thus depend on how many rows and columns it refers to: the larger are |I| and |J|, the larger is the description length. Thus, DL(τ ) can be defined as:

$$\text{DL}(\tau) = a + b \cdot \left( |I| + |J| \right),$$

where a and b are two constants that can be handled to give more or less importance to the contributions of |I| and |J| in the description length.

#### **2.3 Problem Statement**

Given a Boolean dataset **D**, the goal is to sample a tile τ from the set of all the tiles T present in **D**, with a probability of sampling P<sup>S</sup> proportional to SI(τ ), that is: <sup>P</sup>S(<sup>τ</sup> ) = SI(τ) - τ-<sup>∈</sup><sup>T</sup> SI(τ-) .

A na¨ıve approach to sample a tile pattern according to this distribution is to generate the list {τ1, ..., τ<sup>N</sup> } of all the tiles present in **D**, sample x ∈ [0, 1] uniformly at random, and return the tile τ<sup>k</sup> with k−1 <sup>i</sup>=1 SI(τi) - <sup>i</sup> SI(τi) <sup>≤</sup> x < k <sup>i</sup>=1 SI(τi) - <sup>i</sup> SI(τi) . However, the goal behind using sampling approaches is to avoid materializing the pattern space which is generally huge. We want to sample without exhaustively enumerating the set of tiles. In [2], an efficient procedure is proposed to directly sample patterns according to some measures such as the frequency and the area. However, this procedure is limited to only some specific measures. Furthermore, it is proposed for pattern languages defined on only the column dimension, for example, itemset patterns. In such language, the rows related to an itemset pattern F ⊆ {1, ..., n} are uniquely identified and they correspond to all the rows containing the itemset, that are suppR(F). In our work, we are interested in tiles which are defined by both columns and rows indices. In this case, it is not clear how the direct procedure proposed in [2] can be applied.

For more complex pattern languages, a generic procedure based on Metropolis Hasting algorithm has been proposed in [9], and illustrated for subgraph patterns with some quality measures. While this approach is generic and can be extended relatively easily to different mining tasks, a major drawback of using Metropolis Hasting algorithm is that the random walk procedure contains the acceptance test that needs to be processed in each iteration, and the acceptance rate α can be very small, which makes the convergence rate practically extremely slow. Furthermore, Metropolis Hasting can be computationally expensive, as the part of the transition graph representing the probability of reaching patterns given the current one, has to be materialized.

Interestingly, a very useful MCMC technique is Gibbs sampling, which is a special case of Metropolis-Hasting algorithm. A significant benefit of this approach is that the acceptante rate α is always equal to 1, i.e., the proposal of each sampling iteration is always accepted. In this work, we use Gibbs sampling to draw patterns with a probability distribution that converges to PS. In what follows, we will first generically present the Gibbs sampling approach, and then we show how we efficiently exploit it for our problem. Unlike Metropolis Hasting, the proposed procedure performs a random walk by materializing in each iteration only the currently sampled pattern.

## **3 Gibbs Sampling**

Suppose we have a random variable X = (X1, X2, ..., Xl) taking values in some domain Dom. We want to sample a value x ∈ Dom following the joint distribution P(X = x). Gibbs sampling is suitable when it is hard to sample directly from P but known how to sample just one dimension x<sup>k</sup> (k ∈ -1, l) from the conditional probability P(X<sup>k</sup> = x<sup>k</sup> | X<sup>1</sup> = x1, ..., X<sup>k</sup>−<sup>1</sup> = x<sup>k</sup>−<sup>1</sup>, Xk+1 = xk+1, ..., X<sup>l</sup> = xl). The idea of Gibbs sampling is to generate samples by sweeping through each variable (or block of variables) to sample from its conditional distribution with the remaining variables fixed to their current values. Algorithm 1 depicts a generic Gibbs Sampler. At the beginning, x is set to its initial values (often values sampled from a prior distribution q). Then, the algorithm performs a random walk of p iterations. In each iteration, we sample <sup>x</sup><sup>1</sup> <sup>∼</sup> <sup>P</sup>(X<sup>1</sup> <sup>=</sup> <sup>x</sup>(i1) <sup>1</sup> <sup>|</sup> <sup>X</sup><sup>2</sup> <sup>=</sup> <sup>x</sup>(i1) <sup>2</sup> , ..., X<sup>l</sup> <sup>=</sup> <sup>x</sup>(i1) <sup>l</sup> ) (while fixing the other dimensions), then we follow the same procedure to sample x2, ..., until xl.

#### **Algorithm 1:** Gibbs sampler

**<sup>1</sup>** Initialize *<sup>x</sup>*(0) <sup>∼</sup> *<sup>q</sup>*(*x*) **<sup>2</sup> for** *k* ∈ -1*, p* **do <sup>3</sup>** draw *x*(*k*) <sup>1</sup> ∼ *P* - *<sup>X</sup>*<sup>1</sup> <sup>=</sup> *<sup>x</sup>*<sup>1</sup> <sup>|</sup> *<sup>X</sup>*<sup>2</sup> <sup>=</sup> *<sup>x</sup>*(*k*−1) <sup>2</sup> *, X*<sup>3</sup> <sup>=</sup> *<sup>x</sup>*(*k*−1) <sup>3</sup> *, ..., X<sup>l</sup>* <sup>=</sup> *<sup>x</sup>*(*k*−1) *l* **<sup>4</sup>** draw *x*(*k*) <sup>2</sup> ∼ *P* - *<sup>X</sup>*<sup>2</sup> <sup>=</sup> *<sup>x</sup>*<sup>2</sup> <sup>|</sup> *<sup>X</sup>*<sup>1</sup> <sup>=</sup> *<sup>x</sup>*(*k*) <sup>1</sup> *, X*<sup>3</sup> <sup>=</sup> *<sup>x</sup>*(*k*−1) <sup>3</sup> *, ..., X<sup>l</sup>* <sup>=</sup> *<sup>x</sup>*(*k*−1) *l* **5** ... **<sup>6</sup>** draw *x*(*k*) *l* <sup>∼</sup> *<sup>P</sup>* - *<sup>X</sup>l* <sup>=</sup> *<sup>x</sup>l* <sup>|</sup> *<sup>X</sup>*<sup>1</sup> <sup>=</sup> *<sup>x</sup>*(*k*) <sup>1</sup> *, X*<sup>2</sup> <sup>=</sup> *<sup>x</sup>*(*k*) <sup>2</sup> *, ..., Xl*−<sup>1</sup> <sup>=</sup> *<sup>x</sup>*(*k*) *l*−<sup>1</sup> **<sup>7</sup>** return *x*(*p*)

The random walk needs to satisfy some constraints to guarantee that the Gibbs sampling procedure converges to the stationary distribution P. In the case of a finite number of states (a finite space Dom in which X takes values), sufficient conditions for the convergence are irreducibility and aperiodicity:

*Irreducibility*. A random walk is irreducible if, for any two states x, y ∈ Dom s.t. P(x) > 0 and P(y) > 0, we can get from x to y with a probability > 0 in a finite number of steps. I.e. the entire state space is reachable.

*Aperiodicity*. A random walk is aperiodic if we can return to any state x ∈ Dom at any time. I.e. revisiting x is not conditioned to some periodicity constraint.

One can also use blocked Gibbs sampling. This consists in growing many variables together and sample from their joint distribution conditioned to the remaining variables, rather than sampling each variable x<sup>i</sup> individually. Blocked Gibbs sampling can reduce the problem of slow mixing that can be due to the high number of dimensions used to sample from.

### **4 Gibbs Sampling of Tiles with Respect to SI**

In order to sample a tile τ = (I,J) with a probability proportional to SI(τ ), we propose to use Gibbs sampling. The simplest solution is to consider a tile τ as m + n binary random variables (x1, ..., xm, ..., xm+n), each of them corresponds to a row or a column, and then apply the procedure described in Algorithm 1. In this case, an iteration of Gibbs sampling requires to sample from each column and row separately while fixing all the remaining rows and columns. The drawback of this approach is the high number of variables (m + n) which may lead to a slow mixing time. In order to reduce the number of variables, we propose to split τ = (I,J) into only two separated blocks of random variables I and J, we then directly sample from each block while fixing the value of the other block. This means that an iteration of the random walk contains only two sampling operations instead of m+n ones. We will explain in more details how this Blocked Gibbs sampling approach can be applied, and how to compute the distributions used to directly sample a block of rows or columns.

**1** Initialize (*I,J*) (0) <sup>∼</sup> *<sup>q</sup>*(*x*) **<sup>2</sup> for** *k* ∈ -1*, p* **do <sup>3</sup>** draw *<sup>I</sup>*(*k*) <sup>∼</sup> *<sup>P</sup>* - **<sup>I</sup>** <sup>=</sup> *<sup>I</sup>* <sup>|</sup> **<sup>J</sup>** <sup>=</sup> *<sup>J</sup>*(*k*−1) , draw *<sup>J</sup>*(*k*) <sup>∼</sup> *<sup>P</sup>* - **<sup>J</sup>** <sup>=</sup> *<sup>J</sup>* <sup>|</sup> **<sup>I</sup>** <sup>=</sup> *<sup>I</sup>*(*k*) **4** return (*I,J*) (*p*)

Algorithm 2 depicts the main steps of Blocked Gibbs sampling for tiles. We start by initializing (I,J)(0) with a distribution <sup>q</sup> proportional to the area (|I| × |J|) following the approach proposed in [2]. This choice is mainly motivated by its linear time complexity of sampling. Then, we need to efficiently sample from P(**I** = I | **J** = J) and P(**J** = J | **I** = I). In the following, we will explain how to sample I with P(**I** = I|**J** = J), and since the SI is symmetric w.r.t. rows and columns, the same strategy can be used symmetrically to sample a set of columns with P(**J** = J | **I** = I).

*Sampling a Set of Rows* <sup>I</sup> *Conditioned to Columns* <sup>J</sup>*.* For a specific <sup>J</sup> <sup>⊆</sup> {1, ..., n}, the number of tiles (I,J) present in the dataset can be huge, and can go up to 2<sup>m</sup>. This means that na¨ıvely generating all these candidate tiles and then sampling from them is not a solution. Thus, to sample a set of rows I conditioned to a fixed set of columns J, we propose an iterative algorithm that builds the sampled I by drawing each i ∈ I separately, while ensuring that the joint distribution of all the drawings is equal to P(**I** = I|**J** = J). I is built using two variables: R<sup>1</sup> ⊆ {1, ..., m} made of rows that belong to I, and R<sup>2</sup> ⊆ {1, ..., m} \ R<sup>1</sup> that contains candidate rows that can possibly be sampled and added to R1. Initially, we have R<sup>1</sup> = ∅ and R<sup>2</sup> = suppR(J). At each step, we take i ∈ R2, do a random draw to determine whether i is added to R<sup>1</sup> or not, and remove it from R2. When R<sup>2</sup> = ∅, the sampled set of rows I is set equal to R1. To apply this strategy, all we need is to compute P (i ∈ **I** | R<sup>1</sup> ⊆ **I** ⊆ R<sup>1</sup> ∪ R<sup>2</sup> ∧ **J** = J), the probability of sampling i considering the current sets R1, R<sup>2</sup> and J:

*<sup>P</sup>* (*<sup>i</sup>* <sup>∈</sup> **<sup>I</sup>** <sup>|</sup> *<sup>R</sup>*<sup>1</sup> <sup>⊆</sup> **<sup>I</sup>** <sup>⊆</sup> *<sup>R</sup>*<sup>1</sup> <sup>∪</sup> *<sup>R</sup>*<sup>2</sup> <sup>∧</sup> **<sup>J</sup>** <sup>=</sup> *<sup>J</sup>*) = *<sup>P</sup>* (*R*<sup>1</sup> ∪ {*i*} ⊆ **<sup>I</sup>** <sup>⊆</sup> *<sup>R</sup>*<sup>1</sup> <sup>∪</sup> *<sup>R</sup>*<sup>2</sup> <sup>∧</sup> **<sup>J</sup>** <sup>=</sup> *<sup>J</sup>*) *P* (*R*1∪ ⊆ **I** ⊆ *R*<sup>1</sup> ∪ *R*<sup>2</sup> ∧ **J** = *J*) = *F* <sup>⊆</sup>*R*2\{*i*} *SI*(*R*<sup>1</sup> ∪ {*i*} ∪ *F, J*) *<sup>F</sup>* <sup>⊆</sup>*R*<sup>2</sup> *SI*(*R*<sup>1</sup> <sup>∪</sup> *F, J*) <sup>=</sup> *F* <sup>⊆</sup>*R*2\{*i*} IC(*R*1∪{*i*}*,J*)+IC(*F,J*) *a*+*b*·(|*R*1|+|*F* <sup>|</sup>+1+|*J*|) *<sup>F</sup>* <sup>⊆</sup>*R*<sup>2</sup> IC(*R*1*,D*i)+IC(*F,D*i) *a*+*b*·(|*R*1|+|*F* <sup>|</sup>+|*J*|) = <sup>|</sup>*R*2|−<sup>1</sup> *k*=0 1 *a*+*b*·(|*R*1|+*k*+1+|*J*|) *F* <sup>⊆</sup>*R*2\{*i*} <sup>|</sup>*F* <sup>|</sup>=*k* (IC(*R*<sup>1</sup> ∪ {*i*}*, J*) + IC(*F, J*)) <sup>|</sup>*R*2<sup>|</sup> *k*=0 1 *a*+*b*·(|*R*1|+*k*+|*J*|) *<sup>F</sup>* <sup>⊆</sup>*R*<sup>2</sup> <sup>|</sup>*F* <sup>|</sup>=*k* (IC(*R*1*, J*) + IC(*F, J*)) = <sup>|</sup>*R*2|−<sup>1</sup> *k*=0 1 *a*+*b*·(|*R*1|+*k*+1+|*J*|) -<sup>|</sup>*R*2|−<sup>1</sup> *k* · IC(*R*<sup>1</sup> ∪ {*i*}*, J*) + <sup>|</sup>*R*2|−<sup>2</sup> *k*−<sup>1</sup> · IC(*R*<sup>2</sup> \ {*i*}*, J*) <sup>|</sup>*R*2<sup>|</sup> *k*=0 1 *a*+*b*·(|*R*1|+*k*+|*J*|) -<sup>|</sup>*R*2<sup>|</sup> *k* · IC(*R*1*, J*) + <sup>|</sup>*R*2|−<sup>1</sup> *k*−<sup>1</sup> · IC(*R*2*, J*) <sup>=</sup> IC(*R*<sup>1</sup> ∪ {*i*}*, J*) · *<sup>f</sup>*(|*R*2| − <sup>1</sup>*,* <sup>|</sup>*R*1<sup>|</sup> + 1) + IC(*R*<sup>2</sup> \ {*i*}*, J*) · *<sup>f</sup>*(|*R*2| − <sup>2</sup>*,* <sup>|</sup>*R*1<sup>|</sup> + 1) IC(*R*1*, J*) · *<sup>f</sup>*(|*R*2|*,* <sup>|</sup>*R*1|) + IC(*R*2*, J*) · *<sup>f</sup>*(|*R*2| − <sup>1</sup>*,* <sup>|</sup>*R*1|) *,* with f(x, y) = <sup>x</sup> k=0 ( x k) <sup>a</sup>+b·(y+k+|J|) .

*Complexity*. Let's compute the complexity of sampling <sup>I</sup> with a probability P(**I** = I|**J** = J). Before starting the sampling of rows from R2, we first compute the value of IC({i}, J) for each i ∈ R<sup>2</sup> (in O(n · m)). This will allow to compute in O(1) the values of IC that appear in P (i ∈ **I** | R<sup>1</sup> ⊆ **I** ⊆ R<sup>1</sup> ∪ R<sup>2</sup> ∧ **J** = J), based on the relation IC(I<sup>1</sup> ∪ I2, J) = IC(I1, J) + IC(I2, J) for I1, I<sup>2</sup> ⊆ -1, m. In addition to that, sampling each element i ∈ R<sup>2</sup> requires to compute the corresponding values of f(x, y). These values are computed once for the first sampled row i ∈ R<sup>2</sup> with a cost of O(m), and then they can be updated directly when sampling the next rows, using the following relation:

$$f(x-1,y) = f(x,y) - \frac{1}{a+b\cdot(x+y+|J|)} \cdot f(x-1,y+1).$$

This means that the overall cost of sampling the whole set of rows I with a probability P(**I** = I|**J** = J) is O(n · m). Following the same approach, sampling J conditionned to I is done in O(n · m). As we have p sampling iterations, the worst case complexity of the whole Gibbs sampling procedure of a tile τ is O (p · n · m).

*Convergence Guarantee*. In order to guarantee the convergence to the stationary distribution proportional to the SI measure, the Gibbs sampling procedure needs to satisfy some constraints. In our case, the sampling space is finite, as the number of tiles is limited to at most 2<sup>m</sup>+<sup>n</sup>. Then, the sampling procedure converges if it satisfies the aperiodicity and the irreducibility constraints. The Gibbs sampling for tiles is indeed aperiodic, as in each iteration it is possible to remain in exactly the same state. We only have to verify if the irreducibility property is satisfied. We can show that, in some cases, the random walk is reducible, we will show how to make Gibbs sampling irreducible in those cases.

**Theorem 1.** *Let us consider the bipartite graph* G = (U, V, E) *derived from the dataset D, s.t.,* <sup>U</sup> <sup>=</sup> {1, .., m}*,* <sup>V</sup> <sup>=</sup> {1, ..., n}*, and* <sup>E</sup> <sup>=</sup> {(i, j) <sup>|</sup> <sup>i</sup> <sup>∈</sup> -1, m ∧ j ∈ -<sup>1</sup>, n <sup>∧</sup> *D*(i, j)=1}*. A tile* <sup>τ</sup> = (I,J) *present in D corresponds to a complete bipartite subgraph* G<sup>τ</sup> = (I, J, E<sup>τ</sup> ) *of* G*. If the bipartite graph* G *is connected, then the Gibbs sampling procedure on tiles of D is irreducible.*

*Proof.* We need to prove that for all pair of tiles τ<sup>1</sup> = (I1, J1), τ<sup>2</sup> = (I2, J2) present in **D**, the Gibbs sampling procedure can go from τ<sup>1</sup> to τ2. Let G<sup>τ</sup><sup>1</sup> , G<sup>τ</sup><sup>2</sup> be the complete bipartite graphs corresponding to τ<sup>1</sup> and τ2. As G is connected, there is a path from any vertex of G<sup>τ</sup><sup>1</sup> to any vertex of G<sup>τ</sup><sup>2</sup> . The probability that the sampling procedure walks through one of these paths is not 0, as each step of these paths constitutes a tile present in **D**. After walking on one of these paths, the procedure will find itself on a tile τ ⊆ τ2. Reaching τ<sup>2</sup> from τ is probable after one iteration by sampling the right rows and then the right columns.

Thus, if the bipartite graph G is connected, the Gibbs sampling procedure converges to a stationary distribution. To make the random walk converge when G is not connected, we can compute the connected components of G, and then apply Gibbs sampling separately in each corresponding subset of the dataset.

**Table 2.** Dataset characteristics.

**Fig. 1.** Distribution of sampled patterns in synthetic data with 10 rows and 10 columns.

## **5 Experiments**

We report our experimental study to evaluate the effectiveness of Gibbs-SI. Java source code is made available<sup>1</sup>. We consider three datasets whose characteristics are given in Table 2. *mushrooms* and *chess* from the UCI repository<sup>2</sup> are commonly used for evaluation purposes. *kdd* contains a set of SIGKDD paper abstracts between 2001 and 2008 downloaded from the ACM website. Each abstract is represented by a row and words correspond to columns, after stop word removal and stemming. For each dataset, the user priors that we represent in the SI background model are the row and column margins. In other terms, we consider that user knows (or, is already informed about) the following statistics: <sup>j</sup> D(i, j) for all i ∈ I, and <sup>i</sup> D(i, j) for all j ∈ J.

*Empirical Sampling Distribution.* First, we want to experimentally evaluate how the Gibbs sampling distribution matches with the desired distribution. We need to run Gibbs-SI in small datasets where the size of T is not huge. Then, we take a sufficiently large number of samples so that the sampling distribution can be created. To this aim, we have synthetically generated a dataset containing 10 rows, 10 columns, and 855 tiles. We run Gibbs-SI with three different numbers of iterations p: 1k, 10k, and 100k, for each case, we keep all the visited tiles, and we study their distribution w.r.t. their SI values. Figure 1 reports the results. For 1k sampled patterns, the proportionality between the number of sampling and SI is not clearly established yet. For higher numbers of sampled patterns, a linear relation between the two axis is evident, especially for the case of 100k sampled patterns, which represents around 100 times the total number of all the tiles in the dataset. The two tiles with the highest SI are sampled the most, and the number of sampling clearly decreases with the SI value.

<sup>1</sup> http://tiny.cc/g5zmgz.

<sup>2</sup> https://archive.ics.uci.edu/ml/.

**Fig. 2.** Distributions of the sampled patterns w.r.t. # rows, # columns and SI.

*Characteristics of Sampled Tiles.* To investigate which kind of patterns are sampled by Gibbs-SI, we show in Fig. 2 the distribution of sampled tiles w.r.t their number of rows, columns, and their SI, for each of the three datasets given in Table 2. For *mushrooms* and *chess*, Gibbs-SI is able to return patterns with a diverse number of rows and columns. It samples much more patterns with low SI than patterns with high SI values. In fact, even if we are sampling proportionally to SI, the number of tiles in T with poor quality are significantly higher than the ones with high quality values. Thus, the probability of sampling one of low quality patterns is higher than sampling one of the few high quality patterns. For *kdd*, although the number of columns in sampled tiles varies, all the sampled tiles unfortunately cover only one row. In fact, the particularity of this dataset is the existence of some very large transactions (max = 180).

*Quality of the Sampled Tiles.* In this part of the experiment, we want to study whether the quality of the top sampled tiles is sufficient. As mining exhaustively the best tiles w.r.t. SI is not feasible, we need to find some strategy that identifies high quality tiles. We propose to use LCM [14] to retrieve the closed tiles corresponding to the top 10k frequent closed itemsets. A closed tile τ = (I,J) is a tile that is present in **D** and whose I and J cannot be extended anymore. Although closed tiles are not necessarily the ones with the highest SI, we make the hypothesis that at least some of them have high SI values as they maximize the value of IC function. For each of the three real world datasets, we compare between the SI of the top closed tiles identified with LCM and the ones identified with Gibbs-SI. In Table 3, we show the SI of the top-1 tile, and the average SI of the top-10 tiles, for each of LCM and Gibbs-SI.

Unfortunately, the scores of tiles retrieved with LCM are substantially larger than the ones of Gibbs-SI, especially for *mushrooms* and *chess*. Importantly,


**Table 3.** The SI of the top-1 tile, and the average SI of the top-10 tiles, found by LCM and Gibbs-SI in the studied datasets.

there may exist tiles that are even better than the ones found by LCM. This means that Gibbs-SI fails to identify the top tiles in the dataset. We believe that this is due to the very large number of low quality tiles which trumps the number of high quality tiles. The probability of sampling a high-quality tile is exceedingly small, necessitating a practically too large sample to identify any.

## **6 Discussion**

Our results show that efficiently sampling from the set of tiles with a sampling probability proportional to the tiles' subjective interestingness is possible. Yet, they also show that if the purpose is to identify some of the most interesting patterns, direct pattern sampling may not be a good strategy. The reason is that the number of tiles with low subjective interestingness is vastly larger that those with high subjective interestingness. This imbalance is not sufficiently offset by the relative differences in their interestingness and thus in their sampling probability. As a result, the number of tiles that need to be sampled in order to sample one of the few top interesting ones is of the same order as the total number of tiles.

To mitigate this, one could attempt to sample from alternative distributions that attribute an even higher probability to the most interesting patterns, e.g. with probabilities proportional to the *square* or other high powers of the subjective interestingness. We speculate, however, that the computational cost of sampling from such more highly peaked distributions will also be larger, undoing the benefit of needing to sample fewer of them. This intuition is supported by the fact that direct sampling schemes according to itemset support are computationally cheaper than according to the square of their support [2].

That said, the use of sampled patterns as features for downstream machine learning tasks, even if these samples do not include the most interesting ones, may still be effective as an alternative to exhaustive pattern mining.

## **7 Conclusions**

Pattern sampling has been proposed as a computationally efficient alternative to exhaustive pattern mining. Yet, existing techniques have been limited in terms of which interestingness measures they could handle efficiently.

In this paper, we introduced an approach based on Gibbs sampling, which is capable of sampling from the set of tiles proportional to their subjective interestingness. Although we present this approach for a specific type of pattern language and quality measure, we can relatively easily follow the same scheme to apply Gibbs sampling for other pattern mining settings. The empirical evaluation demonstrates effectiveness, yet, it also reveals a potential weakness inherent to pattern sampling: when the number of interesting patterns is vastly outnumbered by the number of non-interesting ones, a large number of samples may be required, even if the samples are drawn with a probability proportional to the interestingness. Investigating our conjecture that this problem affects all approaches for sampling interesting patterns (for sensible measures of interestingness) seems a fruitful avenue for further research.

**Acknowledgements.** This work was supported by the ERC under the EU's Seventh Framework Programme (FP7/2007-2013)/ERC Grant Agreement no. 615517, the Flemish Government under the "Onderzoeksprogramma Artifici¨ele Intelligentie (AI) Vlaanderen" programme, the FWO (project no. G091017N, G0F9816N, 3G042220), and the EU's Horizon 2020 research and innovation programme and the FWO under the Marie Sklodowska-Curie Grant Agreement no. 665501, and by the ACADEMICS grant of the IDEXLYON, project of the Universit´e of Lyon, PIA operated by ANR-16- IDEX-0005.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Even Faster Exact** *k***-Means Clustering**

Christian Borgelt1,2(B)

<sup>1</sup> Department of Mathematics/Computer Sciences, Paris-Lodron-University of Salzburg, Hellbrunner Straße 34, 5020 Salzburg, Austria christian.borgelt@sbg.ac.at <sup>2</sup> Department of Computer and Information Science, University of Konstanz, Universit¨atsstraße 10, 78457 Konstanz, Germany christian@borgelt.net

**Abstract.** A na¨ıve implementation of k-means clustering requires computing for each of the n data points the distance to each of the k cluster centers, which can result in fairly slow execution. However, by storing distance information obtained by earlier computations as well as information about distances between cluster centers, the triangle inequality can be exploited in different ways to reduce the number of needed distance computations, e.g. [3–5,7,11]. In this paper I present an improvement of the Exponion method [11] that generally accelerates the computations. Furthermore, by evaluating several methods on a fairly wide range of artificial data sets, I derive a kind of map, for which data set parameters which method (often) yields the lowest execution times.

**Keywords:** Exact <sup>k</sup>-means · Triangle inequality · Exponion

## **1 Introduction**

The k-means algorithm [9] is, without doubt, the best known and (among) the most popular clustering algorithm(s), mainly because of its simplicity. However, a na¨ıve implementation of the k-means algorithm requires O(nk) distance computations in each update step, where n is the number of data points and k is the number of clusters. This can be a severe obstacle if clustering is to be carried out on truly large data sets with hundreds of thousands or even millions of data points and hundreds to thousands of clusters, especially in high dimensions.

Hence, in our "big data" age, considerable effort was spent on trying to accelerate the computations, mainly by reducing the number of needed distance computations. This led to several very clever approaches, including [3–5,7,11]. These methods exploit that for assigning data points to cluster centers knowing actual distances is not essential (in contrast to e.g. fuzzy c-means clustering [2]). All one really needs to know is which center is closest. This, however, can sometimes be determined without actually computing (all) distances.

A core idea is to maintain, for each data point, bounds on its distance to different centers, especially to the closest center. These bounds are updated by exploiting the triangle inequality, and can enable us to ascertain that the center that was closest before the most recent update step is still closest. Furthermore, by maintaining additional information, tightening these bounds can sometimes be done by looking at only a subset of the cluster centers.

In this paper I present an improvement of one of the most sophisticated of such schemes: the Exponion method [11]. In addition, by comparing my new approach to other methods on several (artificial) data sets with a wide range of number of dimensions and number of clusters, I derive a kind of map, for which data set parameters which method (often) yields the lowest execution times.

## **2** *k***-Means Clustering**

The k-means algorithm is a very simple, yet effective clustering scheme that finds a user-specified number k of clusters in a given data set. This data set is commonly required to consist of points in a metric space. The algorithm starts by choosing an initial set of k cluster centers, which may na¨ıvely be obtained by sampling uniformly at random from the given data points. In the subsequent cluster center optimization phase, two steps are executed alternatingly: (1) each data point is assigned to the cluster center that is closest to it (that is, closer than any other cluster center) and (2) the cluster centers are recomputed as the vector means of the data points assigned to them (to enable these mean computations, the data points are supposed to live in a metric space).

Using ν*m*(x) to denote the cluster center m-th closest to a point x in the data space, this update scheme can be written (for n data points x1,...,x*n*) as

$$\forall i; 1 \le i \le k: \qquad c\_i^{t+1} = \frac{\sum\_{j=1}^n \mathbb{1}(\nu\_1^t(x\_j) = c\_i^t) \cdot x\_j}{\sum\_{j=1}^n \mathbb{1}(\nu\_1^t(x\_j) = c\_i^t)},$$

where the indices t and t + 1 indicate the update step and the function 1(φ) yields 1 if φ is true and 0 otherwise. Here ν*<sup>t</sup>* <sup>1</sup>(x*<sup>j</sup>* ) represents the assignment step and the fraction computes the mean of the data points assigned to center c*i*.

It can be shown that this update scheme must converge, that is, must reach a state in which another execution of the update step does not change the cluster centers anymore [14]. However, there is no guarantee that the obtained result is optimal in the sense that it yields the smallest sum of squared distances between the data points and the cluster centers they are assigned to. Rather, it is very likely that the optimization gets stuck in a local optimum. It has even been shown that k-means clustering is NP-hard for 2-dimensional data [10].

Furthermore, the quality of the obtained result can depend heavily on the choice of the initial centers. A poor choice can lead to inferior results due to a local optimum. However, improvements of na¨ıvely sampling uniformly at random from the data points are easily found, for example the Maximin method [8] and the k-means++ procedure [1], which has become the *de facto* standard.

## **3 Bounds-Based Exact** *k***-Means Clustering**

Some approaches to accelerate the k-means algorithm rely on approximations, which may lead to different results, e.g. [6,12,13]. Here, however, I focus on methods to accelerate *exact* k-means clustering, that is, methods that, starting from the same initialization, produce the same result as a na¨ıve implementation.

**Fig. 1.** Using the triangle inequality to update the distance bounds for a data point <sup>x</sup>*j* .

The core idea of these methods is to compute for each update step the distance each center moved, that is, the distance between the new and the old location of the center. Applying the triangle inequality one can then derive how close or how far away an updated center can be from a data point in the worst possible case. For this we distinguish between the center closest (before the update) to a data point x*<sup>j</sup>* on the one hand and all other centers on the other.

k **Distance Bounds.** The first approach along these lines was developed in [5] and maintains one distance bound for each of the k cluster centers.

For the center closest to a data point x*<sup>j</sup>* an upper bound u*<sup>t</sup> <sup>j</sup>* on its distance is updated as shown in Fig. 1(a): If we know before the update that the distance between x*<sup>j</sup>* and its closest center c*<sup>t</sup> <sup>j</sup>*<sup>1</sup> <sup>=</sup> <sup>ν</sup>*<sup>t</sup>* <sup>1</sup>(x*<sup>j</sup>* ) is (at most) <sup>u</sup>*<sup>t</sup> <sup>j</sup>* , and the update moved the center c*<sup>t</sup> <sup>j</sup>*<sup>1</sup> to the new location <sup>c</sup>*<sup>t</sup>*<sup>∗</sup> *<sup>j</sup>*1, then the distance <sup>d</sup>(x*<sup>j</sup>* , c*<sup>t</sup>*<sup>∗</sup> *<sup>j</sup>*1) between the data point and the new location of this center<sup>1</sup> cannot be greater than u*<sup>t</sup>*+1 *<sup>j</sup>* = u*<sup>t</sup> <sup>j</sup>* + d(c*<sup>t</sup> <sup>j</sup>*1, c*<sup>t</sup>*<sup>∗</sup> *<sup>j</sup>*1). This bound is actually reached if before the update the bound was tight and the center c*<sup>t</sup> <sup>j</sup>*<sup>1</sup> moves away from the data point x*<sup>j</sup>* on the straight line through x*<sup>j</sup>* and c*<sup>t</sup> <sup>j</sup>*<sup>1</sup> (that is, if the triangle is "flat").

For all other centers, that is, centers that are *not* closest to the point x*<sup>j</sup>* , lower bounds *ji*, i = 2,...,k, are updated as shown in Fig. 1(b): If we know before the update that the distance between x*<sup>j</sup>* and a center c*<sup>t</sup> ji* = ν*<sup>t</sup> <sup>i</sup>* (x*<sup>j</sup>* ), is (at least) *<sup>t</sup> ji*, and the update moved the center c*<sup>t</sup> ji* to the new location c*<sup>t</sup>*<sup>∗</sup> *ji* , then the distance d(x*<sup>j</sup>* , c*<sup>t</sup>*<sup>∗</sup> *ji* ) between the data point and the new location of this center cannot be less than *t*+1 *ji* = *<sup>t</sup> ji* <sup>−</sup> <sup>d</sup>(c*<sup>t</sup> ji*, c*<sup>t</sup>*<sup>∗</sup> *ji* ). This bound is actually reached if before the update the bound was tight and the center c*<sup>t</sup> ji* moves towards the data point x*<sup>j</sup>* on the straight line through x*<sup>j</sup>* and c*<sup>t</sup> ji* ("flat" triangle).

<sup>1</sup> Note that it may be c*t*<sup>∗</sup> *j*1 -= c*t*+1 *j*<sup>1</sup> (although equality is not ruled out either), because the update may have changed which cluster center is closest to the data point <sup>x</sup>*j* .

These bounds are easily exploited to avoid distance computations for a data point x*<sup>j</sup>* : If we find that u*t*+1 *<sup>j</sup>* < *t*+1 *<sup>j</sup>* = min*<sup>k</sup> <sup>i</sup>*=2 *t*+1 *ji* , that is, if the upper bound on the distance to the center that was closest before the update (in step t) is less than the smallest lower bound on the distances to any other center, the center that was closest before the update must still be closest after the update (that is, in step t + 1). Intuitively: even if the worst possible case happens, namely if the formerly closest center moves straight away from the data point and the other centers move straight towards it, no other center can have been brought closer than the one that was already closest before the update.

And even if this test fails, one first computes the actual distance between the data point x*<sup>j</sup>* and c*<sup>t</sup>*<sup>∗</sup> *<sup>j</sup>*1. That is, one tightens the bound <sup>u</sup>*<sup>t</sup>*+1 *<sup>j</sup>* to the actual distance and then reevaluates the test. If it succeeds now, the center that was closest before the update must still be closest. Only if the test fails also with the tightened bound, the distances between the data point and the remaining cluster centers have to be computed in order to find the closest center and to reinitialize the bounds (all of which are tight after such a computation).

This scheme leads to considerable acceleration, because the cost of computing the distances between the new and the old locations of the cluster centers as well as the cost of updating the bounds is usually outweighed by the distance computations that are saved in those cases in which the test succeeds.

2 **Distance Bounds.** A disadvantage of the scheme just described is that k bound updates are needed for each data point. In order to reduce this cost, in [7] only two bounds are kept per data point: u*<sup>t</sup> <sup>j</sup>* and *<sup>t</sup> <sup>j</sup>* , that is, all non-closest centers are captured by a single lower bound. This bound is updated according to *t*+1 *<sup>j</sup>* = *<sup>t</sup> <sup>j</sup>* <sup>−</sup> max*<sup>k</sup> <sup>i</sup>*=2 <sup>d</sup>(c*<sup>t</sup> ji*, c*<sup>t</sup>*<sup>∗</sup> *ji* ). Even though this leads to worse lower bounds for the non-closest centers (since they are all treated as if they moved by the maximum of the distances any one of them moved), the fact that only two bounds have to be updated leads to faster execution, at least in many cases.

**YinYang Algorithm.** Instead of having either one distance bound for each center (k bounds) or capturing all non-closest centers by a single bound (2 bounds), one may consider a hybrid approach that maintains lower bounds for subsets of the non-closest centers. This improves the quality of bounds over the 2 bounds approach, because bounds are updated only by the maximum distance a center in the corresponding group moved (instead of the global maximum). On the other hand, (considerably) fewer than k bounds have to be updated.

This is the idea of the YinYang algorithm [4], which forms the groups of centers by clustering the initial centers with k-means clustering. The number of groups is chosen as k/10 in [4], but other factors may be tried. The groups found initially are maintained, that is, there is no re-clustering after an update.

However, apart from fewer bounds (compared to k bounds) and better bounds (compared to 2 bounds), grouping the centers has yet another advantage: If the bounds test fails, even with a tightened bound u*<sup>t</sup> <sup>j</sup>* , the groups and their bounds may be used to limit the centers for which a distance recomputation is needed. Because if the test succeeds for some group, one can infer that the closest center

**Fig. 2.** If 2u*t*+1 *j* < d(c*t*<sup>∗</sup> *j*<sup>1</sup>, ν*t*+1 <sup>2</sup> (c*t*<sup>∗</sup> *j*<sup>1</sup>)), then the center <sup>c</sup>*t*<sup>∗</sup> *j*<sup>1</sup> must still be closest to the data point <sup>x</sup>*j* , due to the triangle inequality.

**Fig. 3.** Annular algorithm [3]: If even after the upper bound <sup>u</sup>*j* for the distance from data point <sup>x</sup>*j* to its (updated) formerly closest center <sup>c</sup>*t*<sup>∗</sup> *j*<sup>1</sup> has been made tight, the lower bound *j* for distances to other centers is still lower, it is necessary to recompute the two closest centers. Exploiting information about the distance between c*t*<sup>∗</sup> *j*<sup>1</sup> and another center ν2(c*t*<sup>∗</sup> *j*<sup>1</sup>) closest to it, these two centers are searched in a (hyper-)annulus around the origin (dot in the bottom left corner) with c*t*<sup>∗</sup> *j*<sup>1</sup> in the middle and thickness 2θ*<sup>j</sup>* , where <sup>θ</sup>*j* = 2u*j* <sup>+</sup> <sup>δ</sup>*j* and <sup>δ</sup>*j* <sup>=</sup> <sup>d</sup>(c*t*<sup>∗</sup> *i*<sup>1</sup>, ν2(c*t*<sup>∗</sup> *j*<sup>1</sup>)). (Color figure online)

cannot be in that group. Only centers in groups, for which the group-specific test fails, need to be considered for recomputation.

**Cluster to Cluster Distances.** The described bounds test can be improved by not only computing the distance each center moved, but also the distances between (updated) centers, to find for each center another center that is closest to it [5]. With my notation I can denote such a center as ν*<sup>t</sup>*+1 <sup>2</sup> (c*<sup>t</sup>*<sup>∗</sup> *<sup>j</sup>*1), that is, the center that is second closest<sup>2</sup> to the point c*<sup>t</sup>*<sup>∗</sup> *<sup>j</sup>*1. Knowing the distances <sup>d</sup>(c*<sup>t</sup>*<sup>∗</sup> *<sup>j</sup>*1, ν*<sup>t</sup>*+1 <sup>2</sup> (c*<sup>t</sup>*<sup>∗</sup> *<sup>j</sup>*1)), one can test whether 2u*<sup>t</sup>*+1 *<sup>l</sup>* < d(c*<sup>t</sup>*<sup>∗</sup> *<sup>j</sup>*1, ν*<sup>t</sup>*+1 <sup>2</sup> (c*<sup>t</sup>*<sup>∗</sup> *<sup>j</sup>*1)). If this is the case, the center that was closest to the data point x*<sup>j</sup>* before the update must still be closest after, as

<sup>2</sup> Note that ν*t*+1 <sup>1</sup> (c*t*<sup>∗</sup> *j*<sup>1</sup>) = <sup>c</sup>*t*<sup>∗</sup> *j*<sup>1</sup>, because a center is certainly the center closest to itself.

is illustrated in Fig. 2 for the worst possible case (namely x*<sup>j</sup>* , c*t*<sup>∗</sup> *ji* and ν*t*+1 <sup>2</sup> (c*t*<sup>∗</sup> *<sup>j</sup>*1) lie on a straight line with c*t*<sup>∗</sup> *ji* and ν*t*+1 <sup>2</sup> (c*t*<sup>∗</sup> *<sup>j</sup>*1) on opposite sides of x*<sup>j</sup>* ).

Note that this second test can be used with k as well as with 2 bounds. However, it should also be noted that, although it can lead to an acceleration, if used in isolation it may also make an algorithm slower, because of the O(k<sup>2</sup>) distance computations needed to find the k distances d(c*t*+1 *<sup>i</sup>* , ν*t*+1 <sup>2</sup> (c*t*+1 *<sup>i</sup>* )).

**Annular Algorithm.** With the YinYang algorithm an idea appeared on the scene that is at the focus of all following methods: try to limit the centers that need to be considered in the recomputations if the tests fail even with a tightened bound u*<sup>t</sup>*+1 *<sup>j</sup>* . Especially, if one uses the 2 bounds approach, significant gains may be obtained: all we need to achieve in this case is to find c*<sup>t</sup>*+1 *<sup>i</sup>*<sup>1</sup> <sup>=</sup> <sup>ν</sup>*<sup>t</sup>*+1 <sup>1</sup> (x*<sup>j</sup>* ) and c*<sup>t</sup>*+1 *<sup>i</sup>*<sup>2</sup> <sup>=</sup> <sup>ν</sup>*<sup>t</sup>*+1 <sup>2</sup> (x*<sup>j</sup>* ), that is, the two centers closest to x*<sup>j</sup>* , because these are all that is needed for the assignment step as well as for the (tight) bounds u*<sup>t</sup>*+1 *<sup>j</sup>* and *t*+1 *<sup>j</sup>* .

One such approach is the Annular algorithm [3]. For its description, as generally in the following, I drop the time step indices t + 1 in order to simplify the notation. The Annular algorithm relies on the following idea: if the tests described above fail with a tightened bound u*<sup>j</sup>* , we cannot infer that c*<sup>t</sup>*<sup>∗</sup> *ji* is still the center closest to x*<sup>j</sup>* . But we know that the closest center must lie in (hyper-) ball with radius u*<sup>j</sup>* around x*<sup>j</sup>* (darkest circle in Fig. 3). Any center outside this (hyper-)ball cannot be closest to x*<sup>j</sup>* , because c*<sup>t</sup>*<sup>∗</sup> *ji* is closer. Furthermore, if we know the distance to another center closest to c*<sup>t</sup>*<sup>∗</sup> *ji* , that is, ν2(c*<sup>t</sup>*<sup>∗</sup> *<sup>j</sup>*1), we know that even in the worst possible case (which is depicted in Fig. 3: x*<sup>j</sup>* , c*<sup>t</sup>*<sup>∗</sup> *ji* and ν2(c*<sup>t</sup>*<sup>∗</sup> *<sup>j</sup>*1) lie on a straight line), the two closest centers must lie in a (hyper-)ball with radius u*<sup>j</sup>* + δ*<sup>j</sup>* around x*<sup>j</sup>* , where δ*<sup>j</sup>* = d(c*<sup>t</sup>*<sup>∗</sup> *<sup>i</sup>*1, ν2(c*<sup>t</sup>*<sup>∗</sup> *<sup>j</sup>*1)) (medium circle in Fig. 3), because we already know two centers that are this close, namely c*<sup>t</sup>*<sup>∗</sup> *ji* and ν2(c*<sup>t</sup>*<sup>∗</sup> *<sup>j</sup>*1). Therefore, if we know the distances of the centers from the origin, we can easily restrict the recomputations to those centers that lie in a (hyper-)annulus (hence the name of this algorithm) around the origin with c*<sup>t</sup>*<sup>∗</sup> *<sup>j</sup>*<sup>1</sup> in the middle and thickness 2θ*<sup>j</sup>* , where θ*<sup>j</sup>* = 2u*<sup>j</sup>* + δ*<sup>j</sup>* with δ*<sup>j</sup>* = d(c*<sup>t</sup>*<sup>∗</sup> *<sup>i</sup>*1, ν2(c*<sup>t</sup>*<sup>∗</sup> *<sup>j</sup>*1)) (see Fig. 3, light gray ring section, origin in the bottom left corner; note that the green line is perpendicular to the red/blue lines only by accident/for drawing convenience).

**Exponion Algorithm.** The Exponion algorithm [11] improves over the Annular algorithm by switching from annuli around the origin to (hyper-)balls around the (updated) formerly closest center c*<sup>t</sup>*<sup>∗</sup> *<sup>j</sup>*1. Again we know that the center closest to x*<sup>j</sup>* must lie in a (hyper-)ball with radius u*<sup>j</sup>* around x*<sup>j</sup>* (darkest circle in Fig. 4) and that the two closest centers must lie in a (hyper-)ball with radius u*<sup>j</sup>* + δ*<sup>j</sup>* around x*<sup>j</sup>* , where δ*<sup>j</sup>* = d(c*<sup>t</sup>*<sup>∗</sup> *<sup>i</sup>*1, ν2(c*<sup>t</sup>*<sup>∗</sup> *<sup>j</sup>*1)) (medium circle in Fig. 4). Therefore, if we know the pairwise distances between the (updated) centers, we can easily restrict the recomputations to those centers that lie in the (hyper-)ball with radius r*<sup>j</sup>* = 2u*<sup>j</sup>* + δ*<sup>j</sup>* around c*<sup>t</sup>*<sup>∗</sup> *<sup>j</sup>*<sup>1</sup> (lightest circle in Fig. 4).

The Exponion algorithm also relies on a scheme with which it is avoided having to sort, for each cluster center, the lists of the other centers by their distance. For this concentric annuli, one set centered at a each center, are created, with each annulus further out containing twice as many centers as the preceding

**Fig. 4.** Exponion algorithm [11]: If even after the upper bound <sup>u</sup>*j* for the distance from a data point <sup>x</sup>*j* to its (updated) formerly closest center <sup>c</sup>*t*<sup>∗</sup> *j*<sup>1</sup> has been made tight, the lower bound *j* for distance to other centers is still lower, it is necessary to recompute the two closest centers. Exploiting information about the distance between c*t*<sup>∗</sup> *j*<sup>1</sup> and another center ν2(c*t*<sup>∗</sup> *j*<sup>1</sup>) closest to it, these two centers are searched in a (hyper-)sphere around center c*t*<sup>∗</sup> *j*<sup>1</sup> with radius <sup>r</sup>*<sup>j</sup>* = 2u*<sup>j</sup>* <sup>+</sup> <sup>δ</sup>*<sup>j</sup>* where <sup>δ</sup>*<sup>j</sup>* <sup>=</sup> <sup>d</sup>(c*t*<sup>∗</sup> *j*<sup>1</sup>, ν2(c*t*<sup>∗</sup> *j*<sup>1</sup>)). (Color figure online)

one. Clearly this creates an onion-like structure, with an exponentially increasing number of centers in each layer (hence the name of the algorithm).

However, avoiding the sorting comes at a price, namely that more centers may have to be checked (although at most twice as many [11]) for finding the two closest centers and thus additional distance computations ensue. In my implementation I avoided this complication and simply relied on sorting the distances, since the gains achievable by concentric annuli over sorting are somewhat unclear (in [11] no comparisons of sorting versus concentric annuli are provided).

**Shallot Algorithm.** The Shallot algorithm is the main contribution of this paper. It starts with the same considerations as the Exponion algorithm, but adds two improvements. In the first place, not only the closest center c*<sup>j</sup>*<sup>1</sup> and the two bounds u*<sup>j</sup>* and *<sup>j</sup>* are maintained for each data point (as for Exponion), but also the second closest center c*<sup>j</sup>*2. This comes at practically no cost (apart from having to store an additional integer per data point), because the second closest center has to be determined anyway in order to set the bound *<sup>j</sup>* .

If a recomputation is necessary, because the tests fail even for a tightened u*<sup>j</sup>* , it is *not* automatically assumed that c*<sup>t</sup>*<sup>∗</sup> *<sup>j</sup>*<sup>1</sup> is the best center z for a (hyper-)ball to search. As it is plausible that the formerly second closest center c*<sup>t</sup>*<sup>∗</sup> *<sup>j</sup>*<sup>2</sup> may now be closer to x*<sup>j</sup>* than c*<sup>t</sup>*<sup>∗</sup> *<sup>j</sup>*1, the center <sup>c</sup>*<sup>t</sup>*<sup>∗</sup> *<sup>j</sup>*<sup>2</sup> is processed first among the centers <sup>c</sup>*<sup>t</sup>*<sup>∗</sup> *ji* , i = 2,...,k. If it turns out that it is actually closer to x*<sup>j</sup>* than c*t*<sup>∗</sup> *<sup>j</sup>*1, then <sup>c</sup>*t*<sup>∗</sup> *<sup>j</sup>*<sup>2</sup> is chosen as the center z of the (hyper-)ball to check. In this case the (hyper-)ball will be smaller (since we found that d(x*<sup>j</sup>* , c*t*<sup>∗</sup> *<sup>j</sup>*2) < d(x*<sup>j</sup>* , c*t*<sup>∗</sup> *<sup>j</sup>*1)). For the following, let p denote the other (updated) center that was not chosen as the center z.

The second improvement may be understood best by viewing the chosen center z of the (hyper-)ball as the initial candidate c<sup>∗</sup> *<sup>j</sup>*<sup>1</sup> for the closest center in step t + 1. Hence we initialize u*<sup>j</sup>* = d(x*<sup>j</sup>* , z). For the initial candidate c<sup>∗</sup> *<sup>j</sup>*<sup>2</sup> for the second closest center in step t + 1 we have two choices, namely p and ν2(z). We choose c<sup>∗</sup> *<sup>j</sup>*<sup>2</sup> = p if u*<sup>j</sup>* +d(x*<sup>j</sup>* , p) < 2u*<sup>l</sup>* +δ*<sup>j</sup>* and c<sup>∗</sup> *<sup>j</sup>*<sup>2</sup> = ν2(z) otherwise, and initialize *<sup>j</sup>* = u*<sup>j</sup>* +d(x*<sup>j</sup>* , p) or *<sup>j</sup>* = 2u*<sup>j</sup>* +δ*<sup>j</sup>* accordingly, thus minimizing the radius, which then can be written, regardless of the choice taken, as r*<sup>j</sup>* = u*<sup>j</sup>* + *<sup>j</sup>* .

While traversing the centers in the constructed (hyper-)ball, better candidates may be obtained. If this happens, the radius of the (hyper-)ball may be reduced, thus potentially reducing the number of centers to be processed. This idea is illustrated in Fig. 5. Let u◦ *<sup>j</sup>* be the initial value of u*<sup>j</sup>* when the (hyper-) ball center was chosen, but before the search is started, that is u◦ *<sup>j</sup>* = d(x*<sup>j</sup>* , z). If a new closest center (candidate) c<sup>∗</sup> *<sup>j</sup>*<sup>1</sup> is found (see Fig. 5(a)), we can update u*<sup>j</sup>* = d(x*<sup>j</sup>* , c<sup>∗</sup> *<sup>j</sup>*1) and *<sup>j</sup>* = d(x*<sup>j</sup>* , c<sup>∗</sup> *<sup>j</sup>*2) = u◦ *<sup>j</sup>* . Hence we can shrink the radius to r*<sup>j</sup>* = 2u◦ *<sup>j</sup>* = u◦ *<sup>j</sup>* + *<sup>j</sup>* . If then an even closer center is found (see Fig. 5(b)), the radius may be shrunk further as u*<sup>j</sup>* and *<sup>j</sup>* are updated again. As should be clear from these examples, the radius is always r*<sup>j</sup>* = u◦ *<sup>j</sup>* + *<sup>j</sup>* .

**Fig. 5.** Shallot algorithm: If a center closer to the data point than the two currently closest centers is found, the radius of the (hyper-)ball to be searched can be shrunk.

A *shallot* is a type of onion, smaller than, for example, a bulb onion. I chose this name to indicate that the (hyper-)ball that is searched for the two closest centers tends to be smaller than for the Exponion algorithm. The reference to an onion may appear misguided, because I rely on sorting the list of other centers by their distance for each cluster center, rather than using concentric annuli. However, an onion reference may also be justified by the fact that my algorithm may shrink the (hyper-)ball radius during the traversal of centers in the (hyper-) ball, as this also creates a layered structure of (hyper-)balls.

## **4 Experiments**

In order to evaluate the performance of the different exact k-means algorithms I generated a large number of artificial data sets. Standard benchmark data sets proved to be too small to measure performance differences reliably and would also not have permitted drawing "performance maps" (see below). I fixed the number of data points in these data sets at n = 100 000. Anything smaller renders the time measurements too unreliable, anything larger requires an unpleasantly long time to run all benchmarks. Thus I varied only the dimensionality m of the data space, namely as <sup>m</sup> ∈ {2, <sup>3</sup>, <sup>4</sup>, <sup>5</sup>, <sup>6</sup>, <sup>8</sup>, <sup>10</sup>, <sup>15</sup>, <sup>20</sup>, <sup>25</sup>, <sup>30</sup>, <sup>35</sup>, <sup>40</sup>, <sup>45</sup>, <sup>50</sup>}, and the number k of clusters, from 20 to 300 in steps of 20. For each parameter combination I generated 10 data sets, with clusters that are (roughly, due to random deviations) equally populated with data points and that may vary in size by a factor of at most ten per dimension. All clusters were modeled as isotropic normal (or Gaussian) distributions. Each data set was then processed 10 times with different initializations. All optimization algorithms started from the same initializations, thus making the comparison as fair as possible.

The clustering program is written in C (however, there is also a Python version, see the link to the source code below). All implementations of the different algorithms are entirely my own and use the same code to read the data and to write the clustering results. This adds to the fairness of the comparison, as in this way any differences in execution time can only result from differences of the actual algorithms. The test systems was an Intel Core 2 Quad Q9650@3GHz with 8 GB of RAM running Ubuntu Linux 18.04 64bit.

**Fig. 6.** Map of the algorithms that produced the best execution times over number of dimensions (horizontal) and number of clusters (vertical), showing fairly clear regions of algorithm superiority. Enjoyably, the Shallot algorithm that was developed in this paper yields the best results for the largest number of parameter combinations.

**Fig. 7.** Relative comparison between the Shallot algorithm and the Exponion algorithm. The left diagram refers to the number of distance computations, the right diagram to execution time. Blue means that Shallot is better, red that Exponion is better. (Color figure online)

The results of these experiments are visualized in Figs. 6, 7 and 8. Figure 6 shows on a grid spanned by the number of dimensions (horizontal axis) and the number of clusters inducted into the data set (vertical axis) which algorithm performed best (in terms of execution time) for each combination. Clearly, the Shallot algorithm wins most parameter combinations. Only for larger numbers of dimensions and larger numbers of clusters the YinYang algorithm is superior.

In order to get deeper insights, Fig. 7 shows on the same grid a comparison of the number of distance computations (left) and the execution times (right) of the Shallot algorithm and the Exponion algorithm. The relative performance

**Fig. 8.** Variation of the execution times over number of dimensions (horizontal) and number of clusters (vertical). The left diagram refers to the Shallot algorithm, the right diagram to the Exponion algorithm. The larger variation for fewer clusters and fewer dimensions may explain the speckled look of Figs. 6 and 7.

**Fig. 9.** Relative comparison between the Shallot algorithm and the YinYang algorithm using the cluster to cluster distance test (pure YinYang is very similar, though). The left diagram refers to the number of distance computations, the right diagram to execution time. Blue means that Shallot is better, red that YinYang is better. (Color figure online)

is color-coded: saturated blue means that the Shallot algorithm needed only half the distance computations or half the execution time of the Exponion algorithm, saturated red means that it needed 1.5 times the distance computations or execution time compared to the Exponion algorithm.

W.r.t. distance computations there is no question who is the winner: the Shallot algorithm wins all parameter combinations, some with a considerable margin. W.r.t. execution times, there is also a clear region towards more dimensions and more clusters, but for fewer clusters and fewer dimensions the diagram looks a bit speckled. This is a somewhat strange result, as a smaller number of distance computations should lead to lower execution times, because the effort spent on organizing the search, which is also carried out in exactly the same situations, is hardly different between the Shallot and the Exponion algorithm.

The reason for this speckled look could be that the benchmarks were carried out with heavy parallelization (in order to minimize the total time), which may have distorted the measurements. As a test of this hypothesis, Fig. 8 shows the standard deviation of the execution times relative to their mean. White means no variation, fully saturated blue indicates a standard deviation half as large as the mean value. The left diagram refers to the Shallot, the right diagram to the Exponion algorithm. Clearly, for a smaller number of dimensions and especially for a smaller number of clusters the execution times vary more (this may be, at least in part, due to the generally lower execution times for these parameter combinations). It is plausible to assume that this variability is the explanation for the speckled look of the diagrams in Fig. 6 and in Fig. 7 on the right.

Finally, Fig. 9 shows, again on the same grid, a comparison of the number of distance computations (left) and the execution times (right) of the Shallot algorithm and the YinYang algorithm (using the test based on cluster to cluster distances, although a pure YinYang algorithm performs very similarly). The relative performance is color-coded in the same way as in Fig. 7. Clearly, the smaller number of distance computations explains why the YinYang algorithm is superior for more clusters and more dimensions.

The reason is likely that grouping the centers leads to better bounds. This hypothesis is confirmed by the fact that the Elkan algorithm (k distance bounds) always needs the fewest distance computations (not shown as a grid) and loses on execution time only due to having to update so many distance bounds.

## **5 Conclusion**

In this paper I introduced the Shallot algorithm, which adds two improvements to the Exponion algorithm [11], both of which can potentially shrink the (hyper-) ball that has to be searched for the two closest centers if recomputation becomes necessary. This leads to a measurable, sometimes even fairly large speedup compared to the Exponion algorithm due to fewer distance computations. However, for high-dimensional data and large numbers of clusters the YinYang algorithm [4] (with or without the cluster to cluster distance test) is superior to both algorithms. Yet, since clustering in high dimensions is problematic anyway due to the curse of dimensionality, it may be claimed reasonably confidently that the Shallot algorithm is the best choice for standard clustering tasks.

**Software.** My implementation of the described methods (C and Python), with which I conducted the experiments, can be obtained under the MIT License at http://www.borgelt.net/cluster.html.

**Complete Results.** A table with the complete experimental results I obtained can be retrieved as a simple text table at

http://www.borgelt.net/docs/clsbench.txt.

More maps comparing the performance of the algorithms can be found at http://www.borgelt.net/docs/clsbench.pdf.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Ising-Based Consensus Clustering on Specialized Hardware**

Eldan Cohen1(B) , Avradip Mandal<sup>2</sup>, Hayato Ushijima-Mwesigwa<sup>2</sup>, and Arnab Roy<sup>2</sup>

> <sup>1</sup> University of Toronto, Toronto, Canada ecohen@mie.utoronto.ca <sup>2</sup> Fujitsu Laboratories of America, Inc., Sunnyvale, USA *{*amandal,hayato,aroy*}*@us.fujitsu.com

**Abstract.** The emergence of specialized optimization hardware such as CMOS annealers and adiabatic quantum computers carries the promise of solving hard combinatorial optimization problems more efficiently in hardware. Recent work has focused on formulating different combinatorial optimization problems as Ising models, the core mathematical abstraction used by a large number of these hardware platforms, and evaluating the performance of these models when solved on specialized hardware. An interesting area of application is data mining, where combinatorial optimization problems underlie many core tasks. In this work, we focus on consensus clustering (clustering aggregation), an important combinatorial problem that has received much attention over the last two decades. We present two Ising models for consensus clustering and evaluate them using the Fujitsu Digital Annealer, a quantum-inspired CMOS annealer. Our empirical evaluation shows that our approach outperforms existing techniques and is a promising direction for future research.

## **1 Introduction**

The increasingly challenging task of scaling the traditional Central Processing Unit (CPU) has lead to the exploration of new computational platforms such as quantum computers, CMOS annealers, neuromorphic computers, and so on (see [3] for a detailed exposition). Although their physical implementations differ significantly, adiabatic quantum computers, CMOS annealers, memristive circuits, and optical parametric oscillators all share Ising models as their core mathematical abstraction [3]. This has lead to a growing interest in the formulation of computational problems as Ising models and in the empirical evaluation of these models on such novel computational platforms. This body of literature includes clustering and community detection [14,19,23], graph partitioning [26], and many NP-Complete problems such as covering, packing, and coloring [17].

Consensus clustering is the problem of combining multiple 'base clusterings' of the same set of data points into a single consolidated clustering [9]. Consensus clustering is used to generate robust, stable, and more accurate clustering

E. Cohen—Work done while at Fujitsu Laboratories of America.

c The Author(s) 2020

M. R. Berthold et al. (Eds.): IDA 2020, LNCS 12080, pp. 106–118, 2020. https://doi.org/10.1007/978-3-030-44584-3\_9

results compared to a single clustering approach [9]. The problem of consensus clustering has received significant attention over the last two decades [9], and was previously considered under different names (clustering aggregation, cluster ensembles, clustering combination) [10]. It has applications in different fields including data mining, pattern recognition, and bioinformatics [10] and a number of algorithmic approaches have been used to solve this problem. The consensus clustering is, in essence, a combinatorial optimization problem [28] and different instances of the problem have been proven to be NP-hard (e.g., [6,25]).

In this work, we investigate the use of special purpose hardware to solve the problem of consensus clustering. To this end, we formulate the problem of consensus clustering using Ising models and evaluate our approach on a specialized CMOS annealer. We make the following contributions:


## **2 Background**

### **2.1 Problem Definition**

Let <sup>X</sup> <sup>=</sup> {x1, ..., xn} be a set of <sup>n</sup> data points. A *clustering* of <sup>X</sup> is a process that partitions X into subsets, referred to as *clusters*, that together cover X. A clustering is represented by the mapping <sup>π</sup> : <sup>X</sup> → {1,...,kπ} where <sup>k</sup>π is the number of clusters produced by clustering <sup>π</sup>. Given <sup>X</sup> and a set <sup>Π</sup> <sup>=</sup> {π1,...,πm} of m clusterings of the points in X, the *Consensus Clustering Problem* is to find a new clustering, π∗, of the data X that best summarizes the set of clusterings Π. The new clustering π<sup>∗</sup> is referred to as the *consensus* clustering.

Due to the ambiguity in the definition of an optimal consensus clustering, several approaches have been proposed to measure the solution quality of consensus clustering algorithms [9]. In this work, we focus on the approach of determining a consensus clustering that agrees the most with the original clusterings. As an objective measure to determine this agreement, we use the mean Adjusted Rand Index (ARI) metric (Eq. 14). However, we also consider clustering quality measured by mean Silhouette Coefficient [22] and clustering accuracy based on true labels. In Sect. 4 these evaluation criteria are discussed in more details.

#### **2.2 Existing Criteria and Methods**

Various criteria or objectives have been proposed for the Consensus Clustering Problem. In this work we mainly focus on two well-studied criteria, one based on the pairwise similarity of the data points, and the other based on the different assignments of the base clusterings. Other well-known criteria and objectives for the Consensus Clustering Problem can be found in the excellent surveys of [9,27], with most defining NP-Hard optimization problems.

*Pairwise Similarity Approaches:* In this approach, a similarity matrix S is constructed such that each entry in S represents the fraction of clusterings in which two data points belong to the same cluster [20]. In particular,

$$S\_{uv} = \frac{1}{m} \sum\_{i=1}^{m} \mathbb{1}(\pi\_i(u) = \pi\_i(v)),\tag{1}$$

with <sup>1</sup> being the indicator function. The value <sup>S</sup>uv lies between 0 and 1, and is equal to 1 if all the base clusterings assign points u and v to the same cluster. Once the pairwise similarity matrix is constructed, one can use any similaritybased clustering algorithm on S to find a consensus clustering with a fixed number of clusters, K. For example, [16] proposed to find a consensus clustering π<sup>∗</sup> with exactly K clusters that minimizes the within-cluster dissimilarity:

$$\min \sum\_{\substack{u,v \in X \colon \\ \pi^\*(u) = \pi^\*(v)}} (1 - S\_{uv}). \tag{2}$$

*Partition Difference Approaches:* An alternative formulation is based on the different assignments between clustering. Consider two data points u, v <sup>∈</sup> <sup>X</sup>, and two clusterings <sup>π</sup>i, πj <sup>∈</sup> <sup>Π</sup>. The following binary indicator tests if <sup>π</sup>i and <sup>π</sup>j disagree on the clustering of u and v:

$$d\_{u,v}(\pi\_i, \pi\_j) = \begin{cases} 1, & \text{if } \pi\_i(u) = \pi\_i(v) \text{ and } \pi\_j(u) \neq \pi\_j(v) \\ 1, & \text{if } \pi\_i(u) \neq \pi\_i(v) \text{ and } \pi\_j(u) = \pi\_j(v) \\ 0, & \text{otherwise.} \end{cases} \tag{3}$$

The distance between two clusterings is then defined based on the number of pairwise disagreements:

$$d(\pi\_i, \pi\_j) = \frac{1}{2} \sum\_{u, v \in X} d\_{u, v}(\pi\_i, \pi\_j) \tag{4}$$

with the <sup>1</sup> <sup>2</sup> factor to take care of double counting and can be ignored. This measure is defined as the number of pairs of points that are in the same cluster in one clustering and in different clusters in the other, essentially considering the (unadjusted) Rand index [9]. Given this measure, a common objective is to find a consensus clustering π<sup>∗</sup> with respect to the following optimization problem:

$$\min \sum\_{i=1}^{m} d(\pi\_i, \pi^\*). \tag{5}$$

*Methods and Algorithms:* The two different criteria given above define fundamentally different optimization problems, thus different algorithms have been proposed. One key difference between the two approaches inherently lies in determining the number of clusters <sup>k</sup>π<sup>∗</sup> in <sup>π</sup>∗. The pairwise similarity approaches (e.g., Eq. (2)) require an input parameter K that fixes the number of clusters in π∗, whereas the partition difference approaches such as Eq. (5) do not have this requirement and determining <sup>k</sup>π<sup>∗</sup> is part of the objective of the problem. Therefore, for example, Eq. (2) will have a minimum value in the case when <sup>k</sup>π<sup>∗</sup> <sup>=</sup> <sup>n</sup>, however this does not hold for Eq. (5).

The Cluster-based Similarity Partitioning Algorithm (CSPA) is proposed in [24] for solving the pairwise similarity based approach. The CSPA constructs a similarity-based graph with each edge having a weight proportional to the similarity given by S. Determining the consensus clustering with exactly K clusters is treated as a K-way graph partitioning problem, which is solved by methods such as METIS [12]. In [20], the authors experiment with different clustering algorithms including hierarchical agglomerative clustering (HAC) and iterative techniques that start from an initial partition and iteratively reassign points to clusters based on their pairwise similarities. For the partition difference approach, Li et al. [15] proposed to solve Eq. (5) using nonnegative matrix factorization (NMF). Gionis et al. [10] proposed several algorithms that make use of the connection between Eq. (5) and the problem of correlation clustering. CSPA, HAC, NMF: these three approaches are considered as baseline in our empirical evaluation section (Sect. 4).

#### **2.3 Ising Models**

Ising models are graphical models that include a set of nodes representing spin variables and a set of edges corresponding to the interactions between the spins. The energy level of an Ising model which we aim to minimize is given by:

$$E(\sigma) = \sum\_{(i,j)\in\mathcal{E}} J\_{i,j}\sigma\_i\sigma\_j + \sum\_{i\in\mathcal{N}} h\_i\sigma\_i,\tag{6}$$

where the variables <sup>σ</sup>i ∈ {−1, <sup>1</sup>} are the spin variables and the couplers, <sup>J</sup>i,j , represent the interaction between the spins.

A Quadratic Unconstrained Binary Optimization (QUBO) model includes binary variables <sup>q</sup>i ∈ {0, <sup>1</sup>} and couplers, <sup>c</sup>i,j . The objective to minimize is:

$$E(\mathbf{q}) = \sum\_{i=1}^{n} c\_i q\_i + \sum\_{i$$

QUBO models can be transformed to Ising models by setting <sup>σ</sup>i = 2qi−1 [2].

## **3 Ising Approach for Consensus Clustering on Specialized Hardware**

In this section, we present our approach for solving consensus clustering on specialized hardware using Ising models. We present two Ising models that correspond to the two approaches in Sect. 2.2. We then demonstrate how they can be solved on the Fujitsu Digital Annealer (DA), a specialized CMOS hardware.

#### **3.1 Pairwise Similarity-Based Ising Model**

For each data point <sup>u</sup> <sup>∈</sup> <sup>X</sup>, let <sup>q</sup>uc ∈ {0, <sup>1</sup>} be the binary variable such that quc = 1 if <sup>π</sup><sup>∗</sup> assigns <sup>u</sup> to cluster <sup>c</sup>, and 0 otherwise. Then the constraints

$$\sum\_{c=1}^{K} q\_{uc} = 1,\quad \text{for each } u \in X \tag{8}$$

ensure π<sup>∗</sup> assigns each point to exactly one cluster. Subject to the constraints (8), the sum of quadratic terms <sup>K</sup> c=1 <sup>q</sup>ucqvc is 1 if <sup>π</sup><sup>∗</sup> assigns both u, v <sup>∈</sup> <sup>X</sup> to the same cluster, and is 0 if assigned to different clusters. Therefore the value

$$\sum\_{\substack{u,v \in X \colon \\ \pi^\*(u) = \pi^\*(v)}} (1 - S\_{uv}) = \sum\_{u,v \in X} (1 - S\_{uv}) \sum\_{c=1}^{\kappa} q\_{uc} q\_{vc} \tag{9}$$

represents the sum of within-cluster dissimilarities in <sup>π</sup>∗: (1−Suv) is the fraction of clusterings in Π that assign u and v to different clusters while π<sup>∗</sup> assigns them to the same cluster. We therefore reformulate Eq. (2) as QUBO:

$$\min \sum\_{u,v \in X} (1 - S\_{uv}) \sum\_{c=1}^{K} q\_{uc} q\_{vc} + \sum\_{u \in X} A (\sum\_{c=1}^{K} q\_{uc} - 1)^2. \tag{10}$$

where the term u∈X <sup>A</sup>( K c=1 <sup>q</sup>uc <sup>−</sup> 1)<sup>2</sup> is added to the objective function to ensure that the constraints (8) are satisfied. A is positive constant that penalizes the objective for violations of constraints (8). One can show that if <sup>A</sup> <sup>≥</sup> <sup>n</sup>, the optimal solution of the QUBO in Eq. (10) does not violate the constraints (8). The proof is very similar to proof of Theorem 1 and a similar result in [14].

#### **3.2 Partition Difference Ising Model**

The partition difference approach essentially considers the (unadjusted) Rand Index [9] and therefore can be expected to perform better. The *Correlation Clustering Problem* is another important problem in data mining. Gionis et al. [10] showed that Eq. (5) is a restricted case of the Correlation Clustering Problem, and that Eq. (5) can be expressed as the following equivalent form of the Correlation Clustering Problem

$$\min\_{\pi^\star} \sum\_{\substack{u,v \in X \colon \\ \pi^\star(u) = \pi^\star(v)}} (1 - S\_{uv}) + \sum\_{\substack{u,v \in X \colon \\ \pi^\star(u) \neq \pi^\star(v)}} S\_{uv} \,. \tag{11}$$

We take advantage of this equivalence to model Eq. (5) as a QUBO. In a similar fashion to the QUBO formulated in the preceding subsection, the terms

$$\sum\_{\substack{u,v \in X:\\ \pi^\*(u) \neq \pi^\*(v)}} S\_{uv} = \sum\_{u,v \in X} S\_{uv} \sum\_{1 \le c \ne l \le K} q\_{uc} q\_{vl} \tag{12}$$

measure the similarity between points in *different* clusters, where K represents an *upper bound* for the number of clusters in π∗. This then leads to the minimizing the following QUBO:

$$\sum\_{u,v \in X} \left(1 - S\_{uv}\right) \sum\_{c=1}^{K} q\_{uc} q\_{vc} + \sum\_{u,v \in X} S\_{uv} \sum\_{1 \le c \ne l \le K} q\_{uc} q\_{vl} + \sum\_{u \in X} B(\sum\_{c=1}^{K} q\_{uc} - 1)^2. \tag{13}$$

Intuitively, Eq. (13) measures the disagreement between the consensus clustering and the clusterings in Π. This disagreement is due to points that are clustered together in the consensus clustering but not in the clusterings in Π, however it is also due to points that are assigned to different clusters in the consensus partition but in the same cluster in some of the partitions in Π.

Formally, we can show that Eq. (13) is equivalent to the correlation clustering formulation in Eq. (11) when setting <sup>B</sup> <sup>≥</sup> <sup>n</sup>. Consistent with other methods that optimize Eq. (5) (e.g., [15]), our approach takes as an input K, an *upper bound* on the number of clusters in π∗, however the obtained solution can use smaller number of clusters. In our proof, we assume K is large enough to represent the optimal solution, i.e., greater than the number of clusters in optimal solutions to the correlation clustering problem in Eq. (11).

**Theorem 1.** *Let q*¯ *be the optimal solution to the QUBO given by Eq. (13). If* <sup>B</sup> <sup>≥</sup> <sup>n</sup>*, for a large enough* <sup>K</sup> <sup>≤</sup> <sup>n</sup>*, an optimal solution to the Correlation Clustering Problem in Eq. (11),* <sup>π</sup>¯*, can be efficiently evaluated from q*¯*.*

*Proof.* First we show the optimal solution to the QUBO in Eq. (13) satisfies the one-hot encoding ( k <sup>q</sup>uk = 1). This would imply given **<sup>q</sup>**¯ we can create a valid clustering ¯π. Note, the optimal solution will never have c <sup>q</sup>uc <sup>&</sup>gt; 1 as it can only increase the cost. The only case in which an optimal solution will have c <sup>q</sup>uc <sup>&</sup>lt; 1 is when the cost of assigning a point to a cluster is higher than the cost of not assigning it to a cluster (i.e., the penalty B). Assigning a point u to a cluster will incur a cost of (1 <sup>−</sup> <sup>S</sup>uv) for each point <sup>v</sup> in the same cluster and Suv for each point <sup>v</sup> that is not in the cluster. As there is additional <sup>n</sup>−1 points in total, and both (1 <sup>−</sup> <sup>S</sup>uv) and <sup>S</sup>uv are less or equal to one (Eq. (1)), setting <sup>B</sup> <sup>≥</sup> <sup>n</sup> guarantees the optimal solution satisfies the one-hot encoding.

Now we assume that ¯π is not optimal, i.e., there exists an optimal solution πˆ to Eq. (11) that has a strictly lower cost than ¯π. Let **q**ˆ be the corresponding QUBO solution to ˆπ, such that ¯π(u) = <sup>k</sup> if and only if ¯quk = 1. This is possible because K is large enough to accomodate all clusters in ˆπ. As both **q**¯ and **q**ˆ satisfy that one-hot encoding (penalty terms are zero), their cost is identical to the cost of ¯π and ˆπ . Since the cost of ˆπ is strictly lower than ¯π, and the cost of **q**¯ is lower or equal to **q**ˆ, we have a contradiction.

#### **3.3 Solving Consensus Clustering on the Fujitsu Digital Annealer**

The Fujitsu Digital Annealer (DA) is a recent CMOS hardware for solving combinatorial optimization problems formulated as QUBO [1,8]. We use the second generation of the DA that is capable of representing problems with up to 8192 variables with up to 64 bits of precision. The DA has previously been used to solve problems in areas such as communication [18] and signal processing [21].

The DA algorithm [1] is based on simulated annealing (SA) [13], while taking advantage of the massive parallelization provided by the CMOS hardware [1]. It has several key differences compared to SA, most notably a *parallel-trial* scheme in which each MC step considers all possible one-bit flips in parallel and *dynamic offset* mechanism that increase the energy of a state to escape local minima [1].

**Encoding Consensus Clustering on the DA.** When embedding our Ising models on the DA, we need to consider the hardware specification and adapt the representation of our model accordingly. Due to hardware precision limit, we need to embed the couplers and biases on an integer scale with limited granularity. In our experiments, we normalize the pairwise costs <sup>S</sup>uv in the discrete range [0, 100], <sup>D</sup>ij = [Suv · 100], and accordingly (1 <sup>−</sup> <sup>S</sup>uv) is replaced by (100 <sup>−</sup> <sup>D</sup>uv). Note that the theoretical bound <sup>B</sup> <sup>=</sup> <sup>n</sup> is adjusted accordingly to be <sup>B</sup> = 100·n.

The theoretical bound guarantees that all constraints are satisfied if problems are solved to optimality. In practice, the DA does not necessarily solve problems to optimality and due to the nature of annealing-based algorithms, using very high weights for constraints is likely to create deep local minima and result in solutions that may satisfy the constraints but are often of low-quality. This is especially relevant to our pairwise similarity model where the bound tends to become loose as the number of clusters grows. In our experiments, we use constant, reasonably high, weights that were empirically found to perform well across datasets. For the pairwise similarity-based model (Eq. (10)) we use A = 2<sup>14</sup>, and for the partition difference model (Eq. (13)) we use B = 2<sup>15</sup>. While we expect to get better performance by tuning the weights per-dataset, our goal is to demonstrate the performance of our approach in a general setting. Automatic tuning of the weight values for the DA is a direction for future work.

Unlike many of the existing consensus clustering algorithms that run until convergence, our method runs for a given time limit (defined by the number of runs and iterations) and returns the best solution encountered. In our experiments, we arbitrarily choose *three seconds* as a (reasonably short) time limit to solve our Ising models. As with the weights, we employ a single temperature schedule across all datasets, and *do not* tune it per dataset.

## **4 Empirical Evaluation**

We perform an extensive empirical evaluation of our approach using a set of seven benchmark datasets. We first describe how we generate the set of clusterings, Π. Next, we describe the baselines, the evaluation metrics, and the datasets.

**Generating Partitions.** We follow [7] and generate a set of clusterings by randomizing the parameters of the K-Means algorithm, namely the number of clusters K and the initial cluster centers. In this work, we only use labelled datasets for which we know the number of clusters, <sup>K</sup>, based on the true labels. To generate the base clusterings we run the K-Means algorithm with random cluster centers and we randomly choose <sup>K</sup> from the range [2, <sup>3</sup>K]. For each dataset, we generate 100 clusterings to serve as the clustering set Π.

**Baseline Algorithms.** We compare our pairwise similarity-based Ising model, referred to as DA-Sm, and our correlation clustering Ising model, referred to as DA-Cr, to three popular algorithms for consensus clustering:


**Evaluation.** We evaluate the different methods using three measures. Our main concern in this work is the level of agreement between the consensus clustering and the set of input clusterings. To this end, one requires a metric measuring the similarity of two clusterings that can be used to measure how close the consensus clustering <sup>π</sup><sup>∗</sup> to each base clustering <sup>π</sup>i <sup>∈</sup> <sup>Π</sup> is. One of popularly used metrics to measure the similarity between two clusterings is the Rand Index (RI) and Adjusted Rand Index (ARI) [11]. The Rand Index of two clustering lies between 0 and 1, obtaining the value 1 when both clusterings perfectly agree. Likewise, the maximum score of ARI, which is corrected-for-chance version of RI, is achieved when both clusterings perfectly agree. ARI(πi, π∗) can be viewed as measure of *agreement* between the consensus clustering π<sup>∗</sup> and some base clusterings πi <sup>∈</sup> <sup>Π</sup>. We use the mean ARI as the main evaluation criteria:

$$\frac{1}{m} \sum\_{i=1}^{m} ARI(\pi\_i, \pi^\*) \tag{14}$$

We also evaluate π<sup>∗</sup> based on clustering quality and accuracy. For clustering quality, we use the mean Silhouette Coefficient [22] of all data points (computed using the Euclidean distance between the data points). For clustering accuracy, we compute the ARI between the consensus partition π<sup>∗</sup> and the true labels.

**Benchmark Datasets.** We run experiments on seven datasets with different characteristics: *Iris, Optdigits, Pendigits, Seeds, Wine* from the UCI repository [5] as well as *Protein* [29] and *MNIST*. <sup>1</sup> *Optdigits-389* is a randomly sampled subset of Optdigits containing only the digits {3, <sup>8</sup>, <sup>9</sup>}. Similarly, *MNIST-3689* and *Pendigits-149* are subsets of the MNIST and Pendigits datasets.

<sup>1</sup> http://yann.lecun.com/exdb/mnist/.

Table 1 provides statistics on each of the data set, with the coefficient of variation (CV) [4] describing the degree of class imbalance: zero indicates perfectly balanced classes, while higher values indicate higher degree of class imbalance.


**Table 1.** Datasets

#### **4.1 Results**

We compare the baseline algorithms to the two Ising models in Sect. 3 solved using the Fujitsu Digital Annealer described in Sect. 3.3.

Clustering is typically an unsupervised task and the number of clusters is unknown. The number of clusters in the true labels, <sup>K</sup>, is not available in real scenarios. Furthermore, <sup>K</sup> is not necessarily the best value for clustering tasks (e.g., in many cases it is better to have smaller clusters that are more pure). We therefore test the algorithms in two configurations: when the number of clusters is set to <sup>K</sup>, as in the true labels, and when the number of clusters is set to 2K.


**Table 2.** Consensus performance measured by mean ARI across partitions

**Consensus Criteria.** Table 2 shows the mean ARI between π<sup>∗</sup> and the clusterings in Π. To avoid bias due to very minor differences, we consider all the methods that achieved Mean ARI that is within a threshold of 0.0025 from the best method to be equivalent and highlight them in bold. We also summarize the number of times each method was considered best across the different datasets.

The results show that DA-Cr is the best performing method for both <sup>K</sup> and <sup>2</sup>K clusters. The results of DA-Sm are not consistent: DA-Sm and NMF are performing well for <sup>K</sup> clusters and HAC is performing better for 2K clusters.

**Clustering Quality.** Table 3 report the mean Silhouette Coefficient of all data points. Again, DA-Cr is the best performing method across datasets, followed by HAC. NMF seems to be equivalent to HAC for 2K.


**Table 3.** Clustering quality measured by Silhouette

**Clustering Accuracy.** Table 4 shows the clustering accuracy measured by the ARI between <sup>π</sup><sup>∗</sup> and the true labels. For <sup>K</sup>, we find DA-Sm to be best-performing solution (followed by DA-Cr). For 2K, DA-Cr outperforms the other methods. Interestingly, there is no clear winner between CSPA, NMF, and HAC.

**Experiments with Higher** *K.* In partition difference approaches, increasing K does not necessarily lead to a π<sup>∗</sup> that has more clusters. Instead, K serves as an upper bound and new clusters will be used in case they reduce the objective.

To demonstrate how different algorithms handle different K values, Table 5 shows the consensus criteria and the actual number of clusters in π<sup>∗</sup> for different values of <sup>K</sup> (note that <sup>K</sup> = 3 in Iris). The results show that the performance of the pairwise similarity methods (CSPA, HAC, DA-Sm) degrades as we increase K. This is associated with the fact the actual number of clusters in π<sup>∗</sup> is equal to K which is significantly higher compared to the clusterings in Π. Methods based on partition difference (NMF and DA-Cr) do not exhibit significant degradation and the actual number of clusters does not grow beyond 5 for DA-Cr and 6 for NMF. Note that the average number of clusters in Π is 5.26.


**Table 4.** Clustering accuracy measured by ARI compared to true labels

**Table 5.** Results for Iris dataset with different number of clusters


## **5 Conclusion**

Motivated by the recent emergence of specialized hardware platforms, we present a new approach to the consensus clustering problem that is based on Ising models and solved on the Fujitsu Digital Annealer, a specialized CMOS hardware. We perform an extensive empirical evaluation and show that our approach outperforms existing methods on a set of seven datasets. These results shows that using specialized hardware in core data mining tasks can be a promising research direction. As future work, we plan to investigate additional problems in data mining that can benefit from the use of specialized optimization hardware as well as experimenting with different types of specialized hardware platforms.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Transfer Learning by Learning Projections from Target to Source**

Antoine Cornu´ejols1(B) , Pierre-Alexandre Murena1,2, and Rapha¨el Olivier<sup>3</sup>

<sup>1</sup> UMR MIA-Paris, AgroParisTech, INRA, Universit´e Paris-Saclay, 75005 Paris, France antoine.cornuejols@agroparistech.fr <sup>2</sup> Telecom ParisTech - Universit´e Paris-Saclay, 75013 Paris, France <sup>3</sup> University Carnegie-Mellon, Pittsburgh, USA http://www.agroparistech.fr/mia/equipes:membres:page:antoine

**Abstract.** Using transfer learning to help in solving a new classification task where labeled data is scarce is becoming popular. Numerous experiments with deep neural networks, where the representation learned on a source task is transferred to learn a target neural network, have shown the benefits of the approach. This paper, similarly, deals with hypothesis transfer learning. However, it presents a new approach where, instead of transferring a representation, the source hypothesis is kept and this is a translation from the target domain to the source domain that is learned. In a way, a change of representation is learned. We show how this method performs very well on a classification of time series task where the space of time series is changed between source and target.

**Keywords:** Transfer learning · Boosting

## **1 Introduction**

While transfer learning has a long history, dating back at least to the study of analogy reasoning, it has enjoyed a spectacular rise of interest in recent years, thanks largely to its use and effectiveness in learning new tasks with deep neural networks using an architecture learned on a source task. This approach is called Hypothesis Transfer Learning [6]. The justification for this strategy is that, in the absence of enough data in the target domain to learn anew a good hypothesis, it might be effective to transfer the intermediate representations learned on the source task. This is indeed the case, for instance, in face analysis when the source task is to guess the age of the person, and the target task is to recognize the gender. Technically, with neural networks, this amounts to keeping the first layers of the source neural network in the target network and learning only the last layers, the ones that combine intermediate representations of the examples in order to make a prediction.

Let X , Y and Z be the input, output and feature spaces respectively. Let F be a class of representation functions, where f ∈ F: X→Z. Let G be a class c The Author(s) 2020

of decision functions that use descriptions of the examples in the feature space: g ∈ G : Z→Y. Then, in the context of deep neural networks, the hypothesis class is H := {h : ∃f ∈ F, g ∈ G st. h = g ◦ f} and while f is kept (at least approximately) from the source problem to the target one, only g remains to be learned to solve the target problem.

In this paper, we adopt a dual perspective: we propose to keep the decision function g fixed, and learn *translation functions* from the target input space to the source input space, π : X<sup>T</sup> → X<sup>S</sup> , such that the target hypothesis space becomes H<sup>T</sup> := {h<sup>T</sup> : ∃π ∈ Π, f ∈ F, g ∈ G st. h<sup>T</sup> = g ◦f ◦π}, which, given that h<sup>S</sup> = g ◦ f might be considered as the source hypothesis, may be re-expressed as: H<sup>T</sup> := {h<sup>T</sup> : ∃π ∈ Π, f ∈ F, g ∈ G st. h<sup>T</sup> = h<sup>S</sup> ◦ π}.

Indeed, for some problems, it might be much more easy to learn a translation (also called *projections* in this paper) from the target input space X<sup>T</sup> to the source input space X<sup>S</sup> than to learn a new target decision function. Furthermore, this allows one to tackle problems with different input spaces X<sup>S</sup> and X<sup>T</sup> .

In the following, Sect. 2 presents TransBoost a new algorithm for transfer learning. The theoretical analysis of Sect. 3 provides a PAC-learning bound on the generalization error on the target domain. Controlled experiments are described in Sect. 4 together with an analysis of the results. The new approach is put in perspective in Sect. 5 before we conclude in Sect. 6.

## **2 A New Algorithm for Transfer Learning**

Suppose that we have a system that is able to recognize poppy fields in satellite images. We might imagine that knowing how to translate a biopsy image into a satellite image, we could, using the recognition function defined on satellite image, decide if there is cancerous cells in the biopsy.

Ideally then, one could translate a target query: "what is the label of **x**<sup>T</sup> ∈ X<sup>T</sup> " into a source query "what is the label of π(**x**<sup>T</sup> ) ∈ X<sup>S</sup> " where h<sup>S</sup> is the source hypothesis which, applied to π(**x**<sup>T</sup> ) ∈ X<sup>S</sup> , provides the answer we are looking for. Notice here that we suppose that Y<sup>S</sup> = Y<sup>T</sup> , but not X<sup>S</sup> = X<sup>T</sup> .

The goal is then to learn a good translation π : X<sup>T</sup> → X<sup>S</sup> . However, defining a proper space of candidate projections Π might be problematic, not to mention the risk of overfitting if the space of functions h<sup>S</sup> ◦ Π has too high a capacity. It might be more easy and manageable to discover "weak projections" from X<sup>T</sup> to X<sup>S</sup> using a boosting learning scheme.

**Definition 1.** *<sup>A</sup> weak projection w.r.t. source decision function* <sup>h</sup><sup>S</sup> *is a function* π : X<sup>T</sup> → X<sup>S</sup> *such that the decision function* h<sup>S</sup> - π(**x**<sup>T</sup> ) *has better than random classification performance on the target training set* S<sup>T</sup> *.*

In this setting, the training set S<sup>T</sup> = {(**x**<sup>T</sup> i , y<sup>T</sup> i )}<sup>1</sup>≤i≤<sup>m</sup> is used to learn *weak projections* (Fig. 1).

Once the concept of weak projection is assumed, it is natural to use a boosting algorithm in order to learn a set of such weak projections and to combine them to get a final good classification on elements of T . This is what does the

**Fig. 1.** The principle of prediction using TransBoost. A given target example **x**<sup>T</sup> <sup>i</sup> is projected in the source domain using a set of identified weak projections π<sup>j</sup> and the prediction for **x**<sup>T</sup> <sup>i</sup> is computed as: H<sup>T</sup> (**x**<sup>T</sup> <sup>i</sup> ) = sign-<sup>N</sup> <sup>j</sup>=1 αjh<sup>S</sup> π<sup>j</sup> (**x**<sup>T</sup> i ) .

TransBoost algorithm (see Algorithm 1). It does rely on the property of the boosting algorithm to find and combine weak rules to get a strong(er) rule.

### **3 Theoretical Analysis**

Here, we study the question: can we get *guarantees* about the performance of the learned decision function <sup>H</sup><sup>T</sup> in the target space using TransBoost?

We tackle this question in two steps. First, we suppose that we learn a single projection function π ∈ Π : X<sup>T</sup> → X<sup>S</sup> so that h<sup>T</sup> = h<sup>S</sup> ◦ π, and we find bounds on the generalization error on the target domain given the generalization error on the source domain. Second, we turn to the TransBoost algorithm in order to justify the use of a boosting approach.

#### **3.1 Generalization Error Bounds When Using a Single Projection**

For this analysis, we suppose the existence of a source input distribution **P**<sup>X</sup><sup>S</sup> in addition to the target input distribution **P**<sup>X</sup><sup>T</sup> . We consider the binary classification setting <sup>Y</sup> <sup>=</sup> {−1, +1}, and we note <sup>h</sup>¯<sup>S</sup> and <sup>h</sup>¯<sup>T</sup> respectively the source and the target labelling functions. We note R<sup>S</sup> (h) (resp. R<sup>T</sup> (h)) the risk of a hypothesis <sup>h</sup> on the source (resp. target) domain: <sup>R</sup><sup>S</sup> (h) = <sup>E</sup>**<sup>x</sup>**S∼**P**XS [h<sup>S</sup> (**x**<sup>S</sup> ) <sup>=</sup> <sup>h</sup>¯<sup>S</sup> (**x**<sup>S</sup> )] (resp. <sup>R</sup><sup>T</sup> (h) = <sup>E</sup>**<sup>x</sup>**<sup>T</sup> <sup>∼</sup>**P**XT [h<sup>T</sup> (**x**<sup>T</sup> ) <sup>=</sup> <sup>h</sup>¯<sup>T</sup> (**x**<sup>T</sup> )]). Let <sup>R</sup><sup>S</sup> (h) and <sup>R</sup><sup>T</sup> (h) be the corresponding empirical risks, with m<sup>S</sup> training points for S and m<sup>T</sup> training points for T . Let d<sup>H</sup> be the VC dimension of the hypothesis space H.

In the following, what is learned is a projection π ∈ Π : X<sup>T</sup> → X<sup>S</sup> in order to get a target hypothesis of the form <sup>h</sup><sup>T</sup> <sup>=</sup> <sup>h</sup><sup>S</sup> ◦π, where <sup>h</sup><sup>S</sup> = ArgMinh∈H<sup>S</sup> <sup>R</sup><sup>S</sup> (h) is the source hypothesis. Our aim is to upper-bound <sup>R</sup><sup>T</sup> ( h<sup>T</sup> ), the risk of the learned hypothesis on the target domain in terms of:

### **Algorithm 1.** Transfer learning by boosting

**Input**: h<sup>S</sup> : X<sup>S</sup> → Y<sup>S</sup> the source hypothesis S<sup>T</sup> = {(**x**<sup>T</sup> <sup>i</sup> , y<sup>T</sup> <sup>i</sup> }<sup>1</sup>≤i≤<sup>m</sup>: the target training set

**Initialization** of the distribution on the training set: D1(i)=1/m for i = 1,...,m ;

**for** n = 1,...,N **do** Find a projection π<sup>i</sup> : X<sup>T</sup> → X<sup>S</sup> st. h<sup>S</sup> (πi(·)) performs better than random on Dn(S<sup>T</sup> ) ; Let ε<sup>n</sup> be the error rate of h<sup>S</sup> (πi(·)) on Dn(S<sup>T</sup> ) : εn . = **P**<sup>i</sup>∼D<sup>n</sup> [h<sup>S</sup> (πn(**x**i)) = yi] (with ε<sup>n</sup> < 0.5) ; Computes α<sup>i</sup> = <sup>1</sup> <sup>2</sup> log<sup>2</sup> <sup>1</sup>−ε<sup>i</sup> εi ; Update, for i = 1 ...,m: <sup>D</sup>n+1(i) = <sup>D</sup>n(i) Z<sup>n</sup> × e−α<sup>n</sup> if h<sup>S</sup> πn(**x**<sup>T</sup> i ) = y<sup>T</sup> i e<sup>α</sup><sup>n</sup> if h<sup>S</sup> πn(**x**<sup>T</sup> i ) = y<sup>T</sup> i <sup>=</sup> <sup>D</sup>n(i) exp <sup>−</sup>α<sup>n</sup> <sup>y</sup>(<sup>T</sup> ) <sup>i</sup> <sup>h</sup><sup>S</sup> (πn(**x**(<sup>T</sup> ) <sup>i</sup> )) Z<sup>n</sup>

where Z<sup>n</sup> is a normalization factor chosen so that Dn+1 be a distribution on S<sup>T</sup> ;

**end**

**Output**: the final target hypothesis H<sup>T</sup> : X<sup>T</sup> → Y<sup>T</sup> :

$$H\_T(\mathbf{x}^T) = \text{sign}\left\{\sum\_{n=1}^N \alpha\_n \, h\_S\left(\pi\_n(\mathbf{x}^T)\right)\right\} \tag{1}$$


For the latter term, we adapt the theoretical study of McNamara and Balcan [9] on the transfer of representation in deep neural networks. We suppose that <sup>P</sup><sup>S</sup> , <sup>P</sup><sup>T</sup> , <sup>h</sup><sup>S</sup> , <sup>h</sup><sup>T</sup> <sup>=</sup> <sup>h</sup><sup>S</sup> ◦ <sup>π</sup> (<sup>π</sup> <sup>∈</sup> <sup>Π</sup>), h<sup>S</sup> and Π have the property:

$$\forall \, \widehat{h}\_{\mathcal{S}} \in \mathcal{H}\_{\mathcal{S}} : \quad \underset{\pi \in \Pi}{\text{Min}} \, R\_T(\widehat{h}\_{\mathcal{S}} \circ \pi) \le \, \omega \Big(R\_{\mathcal{S}}(h\_{\mathcal{S}})\Big) \tag{2}$$

where ω : IR → IR is a non-decreasing function.

Equation (2) means that the best target hypothesis expressed using the learned source hypothesis has a true risk bounded by a non-decreasing function of the true risk on the source domain of the learned source hypothesis.

We are now in position to get the desired theorem.

**Theorem 1.** *Let* ω : *IR* → *IR be a non-decreasing function. Suppose that* P<sup>S</sup> *,* <sup>P</sup><sup>T</sup> *,* <sup>h</sup><sup>S</sup> *,* <sup>h</sup><sup>T</sup> <sup>=</sup> <sup>h</sup><sup>S</sup> ◦ <sup>π</sup>(<sup>π</sup> <sup>∈</sup> <sup>Π</sup>)*,* h<sup>S</sup> *and* Π *have the property given by Eq. (2). Let* <sup>π</sup> := *ArgMin*π∈Π <sup>R</sup><sup>T</sup> ( h<sup>S</sup> ◦ π)*, be the best apparent projection.*

*Then, with probability at least* 1 − δ (δ ∈ (0, 1)) *over pairs of training sets for tasks* S *and* T *:*

$$\begin{split} R\_T(\hat{h}\_T) &\leq \omega \left( \hat{R}\_S(\hat{h}\_S) \right) + 2 \sqrt{\frac{2 \, d\_{\mathcal{H}\_S} \log(2em\_S/d\_{\mathcal{H}\_S}) + 2 \log(8/\delta)}{m\_{\mathcal{S}}}} \\ &+ 4 \sqrt{\frac{2 \, d\_{h\_{\mathcal{S}\circ\mathcal{H}}} \log(2em\_T/d\_{h\_{\mathcal{S}\circ\mathcal{H}}}) + 2 \log(8/\delta)}{m\_T}} \end{split} \tag{3}$$

*Proof.* Let π<sup>∗</sup> = ArgMinπ∈Π <sup>R</sup><sup>T</sup> (h<sup>S</sup> ◦ <sup>π</sup>). With probability at least 1 <sup>−</sup> <sup>δ</sup>:

$$\begin{split} R\_{\mathcal{T}}(h\_{\mathcal{S}}\circ\widehat{\pi}) &\leq \widehat{R}\_{\mathcal{T}}(h\_{\mathcal{S}}\circ\widehat{\pi}) + 2\sqrt{\frac{2\,d\_{h\_{\mathcal{S}\mathcal{H}}}\log(2em\_{\mathcal{T}}/d\_{h\_{\mathcal{S}\mathcal{H}}}) + 2\log(8/\delta)}{m\_{\mathcal{T}}}} \\ &\leq \widehat{R}\_{\mathcal{T}}(h\_{\mathcal{S}}\circ\pi^{\*}) + 2\sqrt{\frac{2\,d\_{h\_{\mathcal{S}\mathcal{H}}}\log(2em\tau/d\_{h\_{\mathcal{S}\mathcal{H}}}) + 2\log(8/\delta)}{m\_{\mathcal{T}}}} \\ &\leq R\_{\mathcal{T}}(h\_{\mathcal{S}}\circ\pi^{\*}) + 4\sqrt{\frac{2\,d\_{h\_{\mathcal{S}\mathcal{H}}}\log(2em\tau/d\_{h\_{\mathcal{S}\mathcal{H}}}) + 2\log(8/\delta)}{m\_{\mathcal{T}}}} \\ &\leq \omega\left(R\_{\mathcal{S}}(\widehat{h}\_{\mathcal{S}})\right) + 4\sqrt{\frac{2\,d\_{h\_{\mathcal{S}\mathcal{H}}}\log(2em\tau/d\_{h\_{\mathcal{S}\mathcal{H}}}) + 2\log(8/\delta)}{m\_{\mathcal{T}}}} \\ &\leq \omega\left(\widehat{R}\_{\mathcal{S}}(\widehat{h}\_{\mathcal{S}})\right) + 2\sqrt{\frac{2\,d\_{h\_{\mathcal{S}\mathcal{H}}}\log(2em\_{\mathcal{S}}/d\_{h\_{\mathcal{S}}}) + 2\log(8/\delta)}{m\_{\mathcal{S}}}} \\ &\qquad + 4\sqrt{\frac{2\,d\_{h\_{\mathcal{S}\mathcal{H}}}\log(2em\_{\mathcal{T}}/d\_{h\_{\mathcal{S}\mathcal{H}}}) + 2\log(8/\delta)}{m\_{\mathcal{T}}}} \end{split}$$

This follows from the fact that [10] (p. 48) using m training points and a hypothesis class of VC dimension d, with probability at least 1 − δ, for all hypotheses <sup>h</sup> simultaneously, the true risk <sup>R</sup>(h) and empirical risk <sup>R</sup>(h) satisfy <sup>|</sup>(R(h)−R(h)| ≤ <sup>2</sup> <sup>2</sup> d log(2em/d)+2 log(4/δ) m . For <sup>h</sup><sup>S</sup> ◦ <sup>Π</sup>, this yields the first and third inequalities with probabilities at least 1−δ/2. For H<sup>S</sup> , this yields the fifth inequality with probability at least 1 − δ/2. Applying the union bound archives the desired results. The second inequality follows from the definition of <sup>π</sup>, and the fourth inequality is where we inject our assumption about the transferability (or proximity) between the source and the target problem. -

We can thus control the generalization error on the transfer domain by controlling <sup>d</sup>hS◦<sup>Π</sup> , <sup>m</sup><sup>S</sup> and <sup>ω</sup> which measures the link between the domain and the target domain. The number of target training data m<sup>T</sup> is typically supposed to be small in transfer learning and thus cannot be employed to control the error.

#### **3.2 Boosting Projections from Target to Source**

The above analysis bounds the generalization error of the learned target hypothesis <sup>h</sup><sup>S</sup> ◦ <sup>π</sup> in terms, among others, of the VC dimension of the space <sup>h</sup><sup>S</sup> ◦ <sup>Π</sup>. The problem of controlling the capacity of such a space of functions in order to prevent under or over-fitting is the same as in the traditional supervised learning setting. The difficulty lies in choosing the right space Π of projection functions from X<sup>T</sup> to X<sup>S</sup> .

The space of hypothesis functions considered is:

$$L(h\_{\mathcal{S}} \circ \Pi\_B) \ := \left\{ \mathbf{x} \mapsto \text{sign} \left[ \sum\_{n=1}^N \alpha\_n \left( h\_{\mathcal{S}} \circ \pi\_n(\mathbf{x}^T) \right) \right] : \forall n, \alpha\_n \in \mathbb{R}, \text{ and } \pi\_n \in \Pi\_B \right\}.$$

where <sup>Π</sup>B is a space of weak projections satisfying definition (1).

Now, from [11] (p. 109), the VC dimension of the space <sup>h</sup><sup>S</sup> ◦ <sup>Π</sup>B satisfies:

$$d\_{L(h\_S \circ \Pi\_B)} \le N(d\_{h\_S \circ \Pi\_B} + 1) \left( 3 \log(N(d\_{h\_S \circ \Pi\_B} + 1)) + 2 \right)$$

If <sup>d</sup>h<sup>S</sup> ◦Π<sup>B</sup> <sup>d</sup>h<sup>S</sup> ◦Π, then <sup>d</sup>L(h<sup>S</sup> ◦ΠB) can also be much less than <sup>d</sup>h<sup>S</sup> ◦Π, and theorem (1) provides tighter bounds.

Using the TransBoost method, we can thus gain both on the theoretical bounds on the generalization error and on the ease of finding an appropriate space of projections X<sup>T</sup> → X<sup>S</sup> .

## **4 Design of the Experiments**

#### **4.1 The Main Dimensions of Experiments in Transfer Learning**

There are two dimensions that can be expected to govern the efficiency of transfer learning:


Regarding the *first dimension*, one can expect that if there is no signal in the target data (i.e. the examples are labelled randomly), then no regularity can be extracted, directly or using transfer. In fact, only overfitting of the training data can potentially occur. If, on the contrary, the target learning task is easy, then there cannot be much advantage in using transfer learning. A question therefore arises as to whether there might be an optimal level of signal in the target data so as to maximally benefit from transfer learning.

The *second dimension* is tricky. Here, we intuitively expect that the closer the source and target domains (and problems), the more profitable transfer learning should be. However, how should we measure the "relatedness" of the source and target problems? In the domain adaptation setting, closeness can be measured through a measure of the divergence between the source distribution and the target one, since they are defined on the same input space. In transfer learning, the input spaces can be different, so that it is much more difficult to define a divergence between distributions. This is why we resorted to the function ω in our theoretical analysis. In our experiments, we control relatedness through the information shared between source and target (see below).

#### **4.2 Experimental Setup**

In our study, we devised an experimental setup that would allow us to control the two dimensions above.

In the **target domain**, the learning task is to classify time series of length t<sup>T</sup> into two classes: h<sup>T</sup> : IRt<sup>T</sup> → {−1, +1}. By controlling the level of noise and the difference between the distributions governing the two classes, we can control the signal level, that is the difficulty of extracting information from the target training data. We control the amount of information by varying the size m<sup>T</sup> of the target training set.

Likewise, the source input space is the space of sequences of real measurements of length t<sup>S</sup> . Therefore, we have h<sup>S</sup> : IRt<sup>S</sup> → {−1, +1}.

Varying |t<sup>S</sup> − t<sup>T</sup> | is a way of controlling the information potentially shared in the two domains. With t<sup>S</sup> = t<sup>T</sup> , the two input domains are the same.

Note that learning to classify times series is not a trivial task. It has many applications, some of them involving to classify time series of length different from the length for which exists a classifier.

### **4.3 Description of the Experiments**

Time series were generated according to the following equation:

$$x\_t = \underbrace{t \times \text{slope} \times \text{class}}\_{\text{information gain}} + \underbrace{\mathbf{x}\_{max} \sin(\omega\_i \times t + \varphi\_j)}\_{\text{sub shape within class}} + \underbrace{\eta(t)}\_{\text{noise factor}} \tag{4}$$

The fact that the noise factor is generated according to a Gaussian distribution induces a distribution over the data (class ∈ {−1, +1}).

The **level of signal in the training data** is governed by:


In our experiments, the noise factor is generated according to a Gaussian distribution of mean = 0 and standard deviation in {0.001, 0.002, 0.02, 0.2, 1}.

Figure 2 illustrates what can be obtained with slope = 0.01 with 3 subclasses in the +1 class, and 2 subclasses in the −1 class.

**Fig. 2.** A synthetic data set S with 5 times series where η is Gaussian (μ = 0, σ = 0.2).

In the experiments reported here, we kept the size of the training set constant. In each experiment, 900 times series of length 200 were generated according to the equation described above: 450 times series in each class −1 or +1. We varied the difficulty of learning by varying the slope from almost non existent: 0.001 to significant: 0.01. Similarly, we varied the length t<sup>T</sup> of the target training set in {20, 50, 70, 100} thus providing increasing levels of signal.

A *target training data set* of 300 time series was drawn equally balanced between the two classes. Note that this relatively small number corresponds to transfer learning scenarios where the training data is limited in the target domain. The remaining 600 time series were used as a *test set*. The source hypothesis was learned using the complete time series generated as explained above.

In these experiments, the *set of projections* Π was chosen as a set of "hinge functions", defined by three parameters, the slope of the first linear part, the time t where the hinge takes place, and the slope of the second linear part. The set is explored randomly by the algorithm and a projection is retained if its error rate on the current weighted data is lower than 0.45. We explored other, richer, spaces of projections without gaining superior performances. This simple set seems to be sufficient for this learning task.

In order to better assess the value of TransBoost, its performance was compared (1) to a classifier (Gaussian SVM as implemented in Scikit Learn) acting directly on the target training data, (2) to a boosting algorithm operating in the target domain with base classifiers being Gaussian SVMs, and (3) to a baseline transfer learning method that consists in finding a regression from the target input space to the source input space using a SVR regression. In this last method the regression acts as a translation from X<sup>T</sup> to X<sup>S</sup> and the class of an example **x**<sup>T</sup> is given by h<sup>S</sup> - regression(**x**<sup>T</sup> ) .

Table 1 provides representative examples of the results obtained. Each cell of the table shows the average performance (and the standard deviations) computed from 100 experiments repeated under the same conditions. The experimental conditions are organized according to the level of signal in the training data. In the experiments corresponding to this table, the source hypotheses were learned according to the first protocol defined above.

**Several lessons** can be drawn. First of all, in most situations, TransBoost brings *very significant gains* over learning without transfer or using transfer learning with regression. Figures 3 and 4 that sum up a larger set of experimental

**Table 1.** Comparison of the error rate (lower is better) between: learning directly in the target domain (columns h<sup>T</sup> (train) and h<sup>T</sup> (test)), using TransBoost (columns H<sup>T</sup> (train) and H<sup>T</sup> (test)), learning in the source domain (column h<sup>S</sup> (test)) and, finally, mapping the time series with a SVR regression and using h<sup>S</sup> (na¨ıve transfer, column H <sup>T</sup> (test)). Test errors are highlighted in the orange columns. Bold numbers indicate where TransBoost significantly dominates both learning without transfer and learning with na¨ıve transfer.


conditions make this even more striking. In both tables, the x-axis reports the error rate obtained using TransBoost, while the y-axis reports the error rate of the competing algorithm: either the hypothesis h<sup>T</sup> learnt on the target training data alone (Fig. 3), or the hypothesis H <sup>T</sup> learned on the target data projected on the source input space using a SVR regression (Fig. 4). The remarkable efficiency of TransBoost in a large spectrum of situations is readily apparent.

Secondly, as expected, Transboost is less dominant when either the data is so noisy that no method can learn from the data (high level of noise or low slope): this is apparent on the right part of the graphs 3 and 4 (near the diagonal), or when the task is so easy (large slope and/or low noise) that nothing can be gained from transfer learning (left part of the two graphs).

We did not report here the results obtained with boosting directly in the target input space X<sup>T</sup> since the learning performance was almost the same as the performance as the one of the SVM classifier. This shows that this is not boosting in itself that brings a gain.

**Fig. 3.** Comparison of error rates. yaxis: test error of the SVM classifier (without transfer). x-axis: test error of the TransBoost classifier with 10 boosting steps. The results of 75 experiments (each one repeated 100 times) are summed up in this graph.

**Fig. 4.** Comparison of error rates. yaxis: test error of the "na¨ıve" transfer method. x-axis: test error of the Trans-Boost classifier with 10 boosting steps. The results of 75 experiments (each one repeated 100 times) are summed up in this graph.

#### **4.4 Additional Experiments**

We show here, in Figs. 5, 6 and 7 qualitative results obtained on the classical half-moon problem. It is apparent that Transboost brings satisfying results.

**Fig. 5.** Experiments on the half-moon problem.

### **5 Comparison to Previous Works**

In the theoretical analysis of Ben-David *et al.* [1,2], one central idea is that a *common representation space* should be found in which the projections of the source data {(**x**<sup>S</sup> i )}<sup>1</sup>≤i≤<sup>m</sup> and of the target data {(**x**<sup>T</sup> i )}<sup>1</sup>≤i≤<sup>m</sup> should be as undistinguishable as possible using discriminative functions from the hypothesis space H. The intuition is that if the domains become indistinguishable, a classifier constructed for the source domain should work also for the target domain. It has been at the core of many proposed methods so far [3,5,7,12].

In [8] a scenario in which multiple sources are available for a single target domain is studied. For each source <sup>i</sup> ∈ {1,...,k}, the input distribution <sup>D</sup>i is

**Fig. 6.** A KNN model trained on the few target data points (in yellow). (Color figure online)

**Fig. 7.** A KNN model transboosted on the few target data points.

known as well as a hypothesis <sup>h</sup>i with loss bounded by <sup>ε</sup> on <sup>D</sup>i. It is further assumed that the target input distribution is a mixture of the k source distributions <sup>D</sup>i. The adaptation problem is thus seen as finding a combination of the hypotheses <sup>h</sup>i. It is shown that guarantees on the loss of the combined target hypothesis can be given for some forms of combinations. However, the authors do not show how to learn the parameters of these combinations. In [4], the authors present a system called TrAdaboost, which uses a boosting scheme to eliminate data points that seem irrelevant for the new task defined over the same space X . Despite the use of boosting, the scope is quite different from ours.

Finally, the authors in [6] study a scheme seemingly very close to ours. They define *Hypothesis Transfer Learning algorithms* as algorithms taking as input a training set in the target domain and a source hypothesis in the source domain, and producing a target hypothesis:

$$A^{\text{htl}}: (\mathcal{X}\_T \times \mathcal{Y}\_T)^m \times \mathcal{H}\_{\mathcal{S}} \to \mathcal{H}\_T \subseteq \mathcal{Y}^{\mathcal{X}}$$

One goal of the paper is to identify the effect of the source hypothesis on the generalization properties of Ahtl. However, the scope of the analysis is limited in several ways. First, it focusses on linear regression with the Regularized Least Square algorithm. Second, the formal framework necessitates that in fact X<sup>T</sup> = X<sup>S</sup> and Y<sup>T</sup> = Y<sup>S</sup> . It is thus more an analysis of domain adaptation than of transfer learning. Third, the transfer learning algorithm in effect tries to find a weight vector **w**<sup>T</sup> as close as possible to the source weight vector **w**<sup>S</sup> while fitting the target data set. There is therefore a parameter λ to set. More importantly, the consequence is that the analysis singles out the performance of the source hypothesis on the target domain as the most significant factor controlling the expected error on the target problem. Again, therefore, the target hypothesis cannot be much different from the source one, which seems to defeat the whole purpose of transfer learning.

### **6 Conclusion**

This paper has presented a new transfer learning algorithm, TransBoost, that uses the boosting mechanism in an original way by selecting and combining weak projections from the target domain to the source domain. The algorithm inherits some nice features from boosting. There is only one parameter to set: the number of boosting steps, and guarantees on the training error an on the test error are easily derived from the ones obtained in the theory of boosting.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Computing Vertex-Vertex Dissimilarities Using Random Trees: Application to Clustering in Graphs**

Kevin Dalleau(B), Miguel Couceiro, and Malika Smail-Tabbone

Universite de Lorraine, CNRS, Inria, LORIA, 54000 Nancy, France {kevin.dalleau,miguel.couceiro,malika.smail}@loria.fr

**Abstract.** A current challenge in graph clustering is to tackle the issue of complex networks, *i.e*, graphs with attributed vertices and/or edges. In this paper, we present GraphTrees, a novel method that relies on random decision trees to compute pairwise dissimilarities between vertices in a graph. We show that using different types of trees, it is possible to extend this framework to graphs where the vertices have attributes. While many existing methods that tackle the problem of clustering vertices in an attributed graph are limited to categorical attributes, GraphTrees can handle heterogeneous types of vertex attributes. Moreover, unlike other approaches, the attributes do not need to be preprocessed. We also show that our approach is competitive with well-known methods in the case of non-attributed graphs in terms of quality of clustering, and provides promising results in the case of vertex-attributed graphs. By extending the use of an already well established approach – the random trees – to graphs, our proposed approach opens new research directions, by leveraging decades of research on this topic.

**Keywords:** Graph clustering · Attributed graph · Random tree · Dissimilarity · Heterogeneous data

## **1 Introduction**

Identifying community structure in graphs is a challenging task in many applications: computer networks, social networks, etc. Graphs have an expressive power that enables an efficient representation of relations between objects as well as their properties. Attributed graphs where vertices or edges are endowed with a set of attributes are now widely available, many of them being created and curated by the semantic web community. While these so-called knowledge graphs<sup>1</sup> contain a lot of information, their exploration can be challenging in practice. In particular, common approaches to find communities in such graphs rely on rather complex transformations of the input graph.

<sup>1</sup> Although many definitions can be found in the literature [9].

c The Author(s) 2020 M. R. Berthold et al. (Eds.): IDA 2020, LNCS 12080, pp. 132–144, 2020.

Funded by the RHU FIGHT-HF (ANR-15-RHUS-0004) and the Region Grand Est (France).

https://doi.org/10.1007/978-3-030-44584-3\_11

In this paper, we propose a decision tree based method that we call Graph-Trees (GT) to compute dissimilarities between vertices in a straightforward manner. The paper is organized as follows. In Sect. 2, we briefly survey related work. We present our method in Sect. 3, and we discuss its performance in Sect. 4 through an empirical study on real and synthetic datasets. In the last section of the paper, we present a brief discussion of our results and state some perspectives for future research.

### **Main Contributions of the Paper:**


## **2 Related Work**

Community detection aims to find highly connected groups of vertices in a graph. Numerous methods have been proposed to tackle this problem [1,8,24]. In the case of vertex-attributed<sup>2</sup> graph, clustering aims at finding homogeneous groups of vertices sharing (i) common neighbourhoods and structural properties, and (ii) common attributes. A *vertex-attributed graph* is thought of as a finite structure G = (V,E,A), where


In the case of vertex-attributed graphs, the problem of clustering refers to finding communities (*i.e.*, clusters), where vertices in the same cluster are densely connected, whereas vertices that do not belong to the same cluster are sparsely connected. Moreover, as attributes are also taken into account, the vertices in the same cluster should be similar w.r.t. attributes.

In this section, we briefly recall existing approaches to tackle this problem.

**Weight-Based Approaches.** The weight-based approach consists in transforming the attributed graphs in weighted graphs. Standard clustering algorithms that focus on structural properties can then be applied.

The problem of mapping attribute information into edge weight have been considered by several authors. Neville *et al.* define a matching coefficient [20] as

<sup>2</sup> To avoid terminology-related issues, we will exclusively use the terms vertex for graphs and node for random trees throughout the paper.

a similarity measure S between two vertices v*<sup>i</sup>* and v*<sup>j</sup>* based on the number of attribute values the two vertices have in common. The value S*vi,v<sup>j</sup>* is used as the edges weight between v*<sup>i</sup>* and v*<sup>j</sup>* . Although this approach leads to good results using Min-Cut [15], MajorClust [26] and spectral clustering [25], only nominal attributes can be handled. An extended matching coefficient was proposed in [27] to overcome this limitation, based on a combination of normalized dissimilarities between continuous attributes and increments of the resulting weight per pair of common categorical attributes.

**Optimization of Quality Functions.** A second type of methods aim at finding an optimal clustering of the vertices by optimizing a quality function over the partitions (clusters).

A commonly used quality function is *modularity* [21], that measures the density differences between vertices within the same cluster and vertices in different clusters. However, modularity is only based on the structural properties of the graph. In [6], the authors use entropy as the quality metric to optimize between attributes, combined with a modularity-based optimization. Another method, recently proposed by Combe *et al.* [5], groups similar vertices by maximizing both modularity and *inertia*.

However, these methods suffer from the same drawbacks as any other modularity optimization based methods in simple graphs. Indeed, it was shown by [17] that these methods are biased, and do not always lead to the best clustering. For instance, such methods fail to detect small clusters in graphs with clusters of different sizes.

**Aggregated Distance Measures.** Another type of methods used to find communities in vertex-attributed graphs is to define an aggregated vertexvertex distance between the topological distance and the symbolic distance. All these methods express a distance d*<sup>v</sup>i,v<sup>j</sup>* between two vertices v*<sup>i</sup>* and v*<sup>j</sup>* as d*<sup>v</sup>i,v<sup>j</sup>* = αd*<sup>T</sup>* (v*i*, v*<sup>j</sup>* ) + (1 − α)d*S*(v*i*, v*<sup>j</sup>* ) where d*<sup>T</sup>* is a structural distance and d*<sup>S</sup>* is a distance in the attribute space. These structural and attribute distances represent the two different aspects of the data. These distances can be chosen from the vast number of available ones in the literature. For instance, in [4] a combination of geodesic distance and cosine similarities are used by the authors. The parameter α is useful to control the importance of each aspect of the overall similarity in each use case. These methods are appealing because once the distances between vertices are obtained, many clustering algorithms that cannot be applied to structures such as graphs can be used to find communities.

**Miscellaneous.** There is yet another family of methods that enable the use of common clustering methods on attributed graphs. SA-cluster [3,32] is a method performing the clustering task by adding new vertices. The virtual vertices represent possible values of the attributes. This approach, although appealing by its simplicity, has some drawbacks. First, continuous attributes cannot be taken into account. Second, the complexity can increase rapidly as the number of added vertices depends on the number of attributes and values for each attribute. However, the authors proposed an improvement of their method named *Inc-Cluster* in [33], where they reduce its complexity.

Some authors have worked on model-based approaches for clustering in vertex-attributed settings. In [29], the authors proposed a method based on a bayesian probabilistic model that is used to perform the clustering of vertexattributed graphs, by transforming the clustering problem into a probabilistic inference problem. Also, graph embeddings can be used for this task of vertexattributed graph clustering. Examples of these techniques include node2vec [13] or deepwalk [23], and aim to efficiently learn a low dimensional vector representation of each vertex. Some authors focused on extending vertex embeddings to vertex-attributed networks [11,14,30].

In this paper, we take a different approach and present a tree-based method enabling the computation of vertex-vertex dissimilarities. This method is presented in the next section.

## **3 Method**

Previous works [7,28] have shown that random partitions of data can be used to compute a similarity between the instances. In particular, in Unsupervised Extremely Randomized Trees (UET), the idea is that all instances ending up in the same leaves are more similar to each other than to other instances. The pairwise similarities s(i, j) are obtained by increasing s(i, j) for each leaf where both i and j appear. A normalisation is finally performed when all trees have been constructed, so that values lie in the interval [0, 1]. Leaves, and, more generally, nodes of the trees can be viewed as partitions of the original space. Enumerating the number of co-occurrences in the leaves is then the same as enumerating the number of co-occurrence of instances in the smallest regions of a specific partition.

So far, this type of approach has not been applied to graphs. The intuition behind our proposed method, GT, is to leverage a similar partition in the vertices of a graph. Instead of using the similarity computation that we described previously, we chose to use the mass-based approach introduced by Ting *et al.* [28] instead. The key property of their measure is that the dissimilarity between two instances in a dense region is higher than the same interpoint dissimilarity between two instances in a sparse region of the same space. One of the interesting aspects of this approach is that a dissimilarity is obtained without any post-processing.

Let H ∈ H(D) be a hierarchical partitioning of the original space of a dataset D into non-overlapping and non-empty regions, and let R(x, y|H) be the smallest local region covering x and y with respect to H. The mass-based dissimilarity m*<sup>e</sup>* estimated by a finite number t of models – here, random trees – is given by the following equation:

$$m\_e(x, y | D) = \frac{1}{t} \sum\_{i=1}^{t} \tilde{P}(R(x, y | H\_i)) \tag{1}$$

where P˜(R) = <sup>1</sup> <sup>|</sup>*D*<sup>|</sup> *<sup>z</sup>*∈*<sup>D</sup>* <sup>1</sup>(<sup>z</sup> <sup>∈</sup> <sup>R</sup>). Figure <sup>1</sup> presents an example of a hierarchical partition H of a dataset D containing 8 instances. These instances are vertices in our case. For the sake of the example, let us compute m*e*(1, 4) and m*e*(1, 8). We have m*e*(1, 4) = <sup>1</sup> <sup>8</sup> (2) = 0.25, as the smallest region where instances 1 and 4 co-appear contains 2 instances. However, m*e*(1, 8) = <sup>1</sup> <sup>8</sup> (8) = 1, since instances 1 and 8 only appear in one region of size 8, the original space. The same approach can be applied to graphs.

**Fig. 1.** Example of partitioning of 8 instances in non-overlapping non-empty regions using a random tree structure. The blue and red circles denote the smallest nodes (*i.e.*, regions) containing vertices 1 and 4 and vertices 1 and 8, respectively. (Color figure online)

Our proposed method is based on two steps: (i) obtain several partitions of the vertices using random trees, (ii) use the trees to obtain a relevant dissimilarity measure between the vertices. The Algorithm 1 describes how to build one tree, describing one possible partition of the vertices. Each tree corresponds to a model of (1). Finally, the dissimilarity can be obtained using Eq. 1.

The computation of pairwise vertex-vertex dissimilarities using Graph Trees and the mass-based dissimilarity we just described has a time complexity of <sup>O</sup>(<sup>t</sup> · Ψlog(Ψ) + <sup>n</sup><sup>2</sup>tlog(Ψ)) [28], where <sup>t</sup> is the number of trees, <sup>Ψ</sup> the maximum height of the trees, and n is the number of vertices. When Ψ << n, this time complexity becomes O(n<sup>2</sup>).

To extend this approach to vertex-attributed graphs, we propose to build a forest containing trees obtained by GT over the vertices and trees obtained by UET on the vertex attributes. We can then compute the dissimilarity between vertices by averaging the dissimilarities obtained by both types of trees.

In the next section, we evaluate GT on both real-world and synthetic datasets.

## **4 Evaluation**

This section is divided into 2 subsections. First, we assess GT's performance on graphs without vertex attributes (Subsect. 4.1). Then we present **Algorithm 1.** Algorithm describing how to build a random tree partitioning the vertices of a graph.

**Data:** A graph G(V,E), an uninitialized stack S root node = V ; // *The root node contains all the vertices of* G <sup>v</sup>*s* = a vertex sampled without replacement from <sup>V</sup> ; <sup>V</sup>*left* <sup>=</sup> <sup>N</sup> (v*s*) ∪ {v*s*} ; //<sup>N</sup> (v) *returns the set of neighbours of* <sup>v</sup> <sup>V</sup>*right* <sup>=</sup> <sup>V</sup> \ <sup>V</sup>*left* ; Push <sup>V</sup>*left* and <sup>V</sup>*right* to <sup>S</sup> ; leaves = []; //*leaves is an empty list* **while** S *is not empty* **do** <sup>V</sup>*node* = pop the last element of <sup>S</sup>; **if** <sup>|</sup>V*node*<sup>|</sup> < n*min* **then** Append <sup>V</sup>*node* to leaves; //*node size in lower than* <sup>n</sup>*min, it is a leaf node* **end else** <sup>v</sup>*s* = a vertex sampled without replacement from <sup>V</sup>*node*; <sup>V</sup>*left* = (V*node* ∩ N (v*s*)) ∪ {v*s*}; <sup>V</sup>*right* <sup>=</sup> <sup>V</sup>*node* \ <sup>V</sup>*left* ; Push <sup>V</sup>*left* to <sup>S</sup>; Push <sup>V</sup>*right* to <sup>S</sup>; **end end return** *leaves*;

the performance of our proposed method in the case of vertex-attributed graphs (Subsect. 4.2). An implementation of GT, as well as these benchmarks are available on https://github.com/jdalleau/gt.

### **4.1 Graph Trees on Simple Graphs**

We first evaluate our approach on simple graphs with no attributes, in order to assess if our proposed method is able to discriminate clusters in such graphs. This evaluation is performed on both synthetic and real-world graphs, presented Table 1.

**Table 1.** Datasets used for the evaluation of clustering on simple graphs using graph-trees


The graphs we call *SBM* are synthetic graphs generated using stochastic block models composed of k blocks of a user-defined size, that are connected by edges depending on a specific probability which is a parameter. The Football graph represents a network of American football games during a given season [12]. The Email-Eu-Core graph [18,31] represents relations between members of a research institution, where edges represents communication between those members. We also use a random graph in our first experiment. This graph is an Erdos-Renyi graph [10] generated with the parameters n = 300 and p = 0.2. Finally, the PolBooks data [16] is a graph where nodes represent books about US politics sold by an online merchant and edges books that were frequently purchased by the same buyers.

Our first empirical setting aims to compare the differences between the mean intracluster and the mean intercluster dissimilarities. These metrics enable a comparison that is agnostic to a subsequent clustering method.

The mean difference is computed as follows. First, the arithmetic mean of the pairwise similarities between all vertices with the same label is computed, corresponding to the mean intracluster dissimilarity μ*intra*. The same process is performed for vertices with a different label, giving the mean intercluster similarity μ*inter*. We finally compute the difference Δ = |μ*intra* − μ*inter*|. In our experiments, this difference Δ is computed 20 times. Δ¯ denotes the mean of differences between runs, and σ its standard deviation. The results are presented Table 2. We observe that in the case of the random graph, Δ¯ is close to 0, unlike the graphs where a cluster structure exists. A projection of the vertices based on their pairwise dissimilarity obtained using GT is presented Fig. 2.

**Table 2.** Mean difference between intercluster and intracluster similarities in different settings.


We then compare the Normalized Mutual Information (NMI) obtained using GT with the NMI obtained using two well-known clustering methods on simple graphs, namely MCL [8] and Louvain [1]. NMI is a clustering quality metric when a ground truth is available. Its values lie in the range [0, 1], with a value of 1 being a perfect matching between the computed clusters and the reference one. The empirical protocol is the following:


**Fig. 2.** Projection of the vertices obtained using GT on (left) a random graph, (middle) an SBM generated graph (middle) and (right) the football graph. Each cluster membership is denoted by a different color. Note how in the case of the random graph, no clear cluster can be observed. (Color figure online)

We repeated this procedure 20 times and computed means and standard deviations of the NMI.

The results are presented Table 3. We compared the mean NMI using the t-test, and checked that the differences between the obtained values are statistically significant.

We observe that our approach is competitive with the two well-known methods we chose in the case of non-attributed graphs on the benchmark datasets. In one specific case, we even observe that Graph trees significantly outperforms state of the art results, on the graphs generated by the SBM model. Since the dissimilarity computation is based on the method proposed by [28] to find clusters in regions of varying densities, this may indicate that our approach performs particularly well in the case of clusters of different size.


**Table 3.** Comparison of NMI on benchmark graph datasets. Best results are in boldface.

#### **4.2 Graph Trees on Attributed Graphs**

Now that we have tested GT on simple graphs, we can assess its performance on vertex-attributed graphs. The datasets that we used in this subsection are presented Table 4.

WebKB represents relations between web pages of four universities, where each vertex label corresponds to the university and the attributes represent the words that appear in the page. The Parliament dataset is a graph where the vertices represent french parliament members, linked by an edge if they cosigned a bill. The vertex attributes indicate their constituency, and each vertex has a label that corresponds to their political party.


**Table 4.** Datasets used for the evaluation of clustering on attributed graphs using GT

The empirical setup is the following. We first compute the vertex-vertex dissimilarities using GT, and the vertex-vertex dissimilarities using UET. In this first step, a forest of trees on the structures and a forest of trees on the attributes of each vertex are constructed. We then compute the average of the pairwise dissimilarities. Finally, we then apply t-SNE and use the k-means algorithm on the points in the embedded space. We set k to the number of clusters, since we have the ground truths. We repeat these steps 20 times and report the means and standard deviations. During our experiments, we found out that preprocessing the dissimilarities prior to the clustering phase may lead to better results, in particular with *Scikit learn*'s [22] *QuantileTransformer*. This transformation tends to spread out the most frequent values and to reduce the impact of outliers. In our evaluations, we performed this quantile transformation prior to every clustering, with n*quantile* = 10.

The NMI obtained after the clustering step are presented in Table 5.

**Table 5.** NMI using GT on the structure only, UET on the attributes only and GT+UET. Best results are indicated in boldface.


We observe that for two datasets, namely *WebKB* and *HVR*, considering both structural and attribute information leads to a significant improvement in NMI. For the other dataset considered in this evaluation, while the attribute information does not improve the NMI, we observe that is does not decrease it either. Here, we give the same weight to structural and attribute information.

**Fig. 3.** Projection of the WebKB data based on the dissimilarities computed (left) using GT on structural data, (middle) using UET on the attributes data and (right) using the aggregated dissimilarity. Each cluster membership is denoted by a different color. (Color figure online)

In Fig. 3 we present the projection of the WebKB dataset, where we observe that the structure and attribute information both bring a different view of the data, each with a strong cluster structure.

HVR and Parliament datasets are extracted from [2]. Using their proposed approach, they obtain an NMI of 0.89 and 0.78, respectively. Although the NMI we obtained using our approach are not consistently better in this first assessment, the methods still seems to give similar results without any fine tuning.

### **5 Discussion and Future Work**

In this paper, we presented a method based on the construction of random trees to compute dissimilarities between graph vertices, called GT. For vertex clustering purposes, our proposed approach is plug-and-play, since any clustering algorithm that can work on a dissimilarity matrix can then be used. Moreover, it could find application beyond graphs, for instance in relational structures in general.

Although the goal of our empirical study was not to show a clear superiority in terms of clustering but rather to assess the vertex-vertex dissimilarities obtained by GT, we showed that our proposed approach is competitive with wellknown clustering methods, Louvain and MCL. We also showed that by computing forests of graph trees and other trees that specialize in other types of input data, *e.g*, feature vectors, it is then possible to compute pairwise dissimilarities between vertices in attributed graphs.

Some aspects are still to be considered. First, the importance of the vertex attributes is dataset dependent and, in some cases, considering the attributes can add noise. Moreover, the aggregation method between the graph trees and the attribute trees can play an essential role. Indeed, in all our experiments, we gave the same importance to the attribute and structural dissimilarities. This choice implies that both the graph trees and the attribute trees have the same weight, which may not always be the case. Finally, we chose here a specific algorithm to compute the dissimilarity in the attribute space, namely, UET. The poor results we obtained for some datasets may be caused by some limitations of UET in these cases.

It should be noted that our empirical results depend on the choice of a specific clustering algorithm. Indeed, GT is not a clustering method *per se*, but a method to compute pairwise dissimilarities between vertices. Like other dissimilarity-based methods, this is a strength of the method we propose in this paper. Indeed, the clustering task can be performed using many algorithms, leveraging their respective strengths and weaknesses.

As a future work, we will explore an approach where the choice of whether to consider the attribute space in the case of vertex-attributed graphs is guided by the distribution of the variables or the visualization of the embedding. We also plan to apply our methods on bigger graphs than the ones we used in this paper.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Evaluation of CNN Performance in Semantically Relevant Latent Spaces**

Jeroen van Doorenmalen(B) and Vlado Menkovski(B)

Eindhoven University of Technology, Eindhoven, The Netherlands j.v.doorenmalen@student.tue.nl, v.menkovski@tue.nl

**Abstract.** We examine deep neural network (DNN) performance and behavior using contrasting explanations generated from a semantically relevant latent space. We develop a semantically relevant latent space by training a variational autoencoder (VAE) augmented by a metric learning loss on the latent space. The properties of the VAE provide for a smooth latent space supported by a simple density and the metric learning term organizes the space in a semantically relevant way with respect to the target classes. In this space we can both linearly separate the classes and generate meaningful interpolation of contrasting data points across decision boundaries. This allows us to examine the DNN model beyond its performance on a test set for potential biases and its sensitivity to perturbations of individual factors disentangled in the latent space.

**Keywords:** Deep learning · VAE · Metric learning · Interpretability · Explanation

## **1 Introduction**

Advances in machine learning and deep learning have had a profound impact on many tasks involving high dimensional data such as object recognition and behavior monitoring. The domain of Computer Vision especially has been witnessing a great growth in bridging the gap between the capabilities of humans and machines. This field tries to enable machines to view the world as humans do, perceive it similar and even use the knowledge for a multitude of tasks such as Image & Video Recognition, Image Analysis and Classification, Media Recreation, recommender systems, etc. And, has since been implemented in high-level domains like COMPAS [8], healthcare [3] and politics [17]. However, as blackbox models inner workings are still hardly understood, can lead to dangerous situations [3], such as racial bias [8], gender inequality [1].

The need for confidence, certainty, trust and explanations when using supervised black-box models is substantial in domains with high responsibility. This paper provides an approach towards better understanding of a model's predictions by investigating its behavior on semantically relevant (contrastive) explanations. The build a semantically relevant latent space we need a smooth space that corresponds well with the generating factors of the data (i.e. regions wellsupported by the associated density should correspond to realistic data points) and with a distance metric that conveys semantic information about the target task. The vanilla VAE without any extra constraints is insufficient as is does not necessarily deliver a distance metric that corresponds to the semantics of the target class assignment (in our task). Our target is to develop semantically relevant decision boundaries in the latent space, which we can use to examine our target classification model. Therefore, we propose to use a weakly-supervised VAE that uses a combination of metric learning and VAE disentanglement to create a semantically relevant, smooth and well separated space. And, we show that we can use this VAE and semantically relevant latent space can be used for various interpretability/explainability tasks, such as validate predictions made by the CNN, generate (contrastive) explanations when predictions are odd and being able to detect bias. The approach we propose for these tasks is more specifically explained using Fig. 1.

**Fig. 1.** The diagnostics approach to validate and understand the behavior of the CNN. (1) extra constraints, loss functions are applied during training of the VAE in order to create semantically relevant latent spaces. The generative model captures the essential semantics within the data and is used by (2) A linear Support Vector Machine. The linear SVM is trained on top of the latent space to classify input on semantics rather than the direct mapping from input data X and labels Y . If the SVM and CNN do not agree on a prediction then (3) we traverse the latent space in order to generate and capture semantically relevant synthetic images, tested against the CNN, in order to check what elements have to change in order to change its prediction from a to b, where a and b are different classes.

In this paper, the key contributions are: (1) an approach that can be used in order to validate and check predictions made by a CNN by utilizing a weaklysupervised generative model that is trained to create semantically relevant latent spaces. (2) The semantically relevant latent spaces are then used in order to train a linear support vector machine to capture decision rules that define a class assignment. The SVM is then used to check predictions based on semantics rather than the direct mapping of the CNN. (3) if there is a misalignment in the predictions (i.e. the CNN and SVM do not agree) then we posit the top k best candidates (classes) and for these candidates traverse the latent spaces in order to generate semantically relevant (contrastive) explanations by utilizing the decision boundaries of the SVM.

To conclude, This paper posits a method that allows for the validation of CNN performance by comparing it against the linear classifier that is based on semantics and provides a framework that generates explanations when the classifiers do not agree. The explanations are provided qualitatively to an expert within the field. This explanation encompasses the original image, reconstructed images and the path towards its most probable answers. Additionally, it shows the minimal difference that makes the classifiers change its prediction to one of the most probable answers. The expert can then check these results to make a quick assessment to which class the image actually belongs to. Additionally, the framework provides the ability to further investigate the model mathematically using the linear classifier as a proxy model.

## **2 Related Work**

Interest in interpretability and explainability studies has significantly grown since the inception of "Right to Explanation" [20] and ethicality studies into the behavior of machine learning models [1,3,8,17]. As a result, developers of AI are promoted and required, amongst others, to create algorithms that are transparent, non-discriminatory, robust and safe. Interpretability is most commonly used as an umbrella term and stands for providing insight into the behavior and thought processes behind machine-learning algorithms and many other terms are used for this phenomenon, such as, Interpretable AI, Explainable machine learning, causality, safe AI, computational social science, etc. [5]. We posit our research as an interpretability study, but it does not necessarily mean that other interpretability studies are directly closely related to this work.

There have been many approaches that all work towards the goal of understanding black-box models: Linear Proxy Models: Lime [18] are approaches that locally approximate complex models using linear fits, Decision trees and Rule extraction methods, such as deepred [21] are also considered highly explainable, but quickly become intractable as complexity increases and salience mapping [19] that provide visual information as to which part of an image is most likely used in its prediction, however, it has been demonstrated to be unreliable if not strongly conditioned [10]. Additionally, another approach to interpretability is explaining the role of each part within a black-box models such as the role of a layer or individual neurons [2] or representation vectors within the activation space [9].

Most of the approaches stated above assume that there has to be a tradeoff between model performance and explainability. Additionally, as the current interpretable methods for black-box models are still insufficient and approximated can cause more harm than good when communicated as a method that solves all problems. A lot of the interpretability methods do not take into account the actual needs that stakeholders require [13]. Or, fail to take into account the vast research into explanations or interpretability of the field of psychology [14] and social sciences [15]. The "Explanation in Artificial Intelligence" study by Miller [15] describes the current state of interpretable and explainable algorithms, how most of the techniques currently fail to capture the essence of an explanation and how to improve: an interpretability or explainability method should at least include, but is not limited to, a non-disputable textual- and/or mathematical- and/or visual explanation that is selective, social and depending on the proof, contrastive.

For this reason, our approach focuses on providing selective (contrastive) explanations that combines visual aspects as well as the ability to further investigate the model mathematically using a proxy model that does not impact the CNN directly. Usually, generative models such as the Variational Autoencoders (VAE) [11] and Generative Adversarial Networks (GAN)s are unsupervised and used in order to sample and generate images from a latent space, provided by training the generative network. However, we posit to use a weakly-supervised generative network in order to impose (discriminative) structure in addition to variational inference to the latent space of said model using metric learning [6].

This approach and method is therefore most related to the interpretability area of sub-sampling proxy generative models to answer questions about a discriminative black box model. The two closest studies that attempt similar research is a preprint of CDeepEx [4] by Amir Feghahati et al. and xGEMs [7] by Joshi et al. Both cDeepEx and xGEMS propose the use of a proxy generative model in order to explain the behavior of a black-box model, primarily using generative adversarial networks (GANs). The xGems paper presents a framework to characterize and explaining binary classification models by generating manifold guided examples using a generative model. The behavior of the black box model is summarized by quantitatively perturbing data samples along the manifold. And, xGEMS detects and quantifies bias during model training to understand how bias affects black box models. The xGEMS approach is similar to our approach as in using a generative model in order to explain a black box model. Similarly, the cDeepEx paper posits their work as generating contrastive explanations using a proxy generative model. The generated explanations focus on answering the question "why a and not b?" with GANs, where a is the class of an input example I and b is a chosen class to which to capture the differences.

However, both of these papers do not state that in a multi-class (discriminative) classification problem if the generative models' latent space is not smooth, well separated and semantically relevant then unexpected behavior can happen. For instance, when traversing the latent space it is possible to can pass from a to any number of classes before reaching class b because the space is not well separated and smooth. This will create ineffective explanations, as depending on how they generate explanations will give information on 'why class a and not b using properties of c'. An exact geodesic path along the manifold would require great effort, especially in high dimensions. Also, our approach is different in the fact that we utilize a weakly-supervised generative model as well as an extra linear classifier on top of the latent space to provide us with extra information on the data and the latent space. Some approaches we take, however, are very similar, such as using a generative model as a proxy to explain a black-box model as well as sub-sampling the latent space to probe the behavior of a black-box model and generate explanations using the predictions.

## **3 Methodology**

This paper posits its methodology as a way to explain and validate decisions made by a CNN. The predictions made by the CNN are validated and explained utilizing the properties of a weakly-supervised proxy generative model, more specifically, a triplet-vae. There are three main factors that contribute to the validation and explanation of the CNN. First, a triplet-vae is trained in order to provide a semantically relevant and well separated latent space. Second, this latent space is then used to train an interpretable linear support vector machine and is used to validate decisions by the CNN by comparison. Third, when a CNN decision is misaligned with the decision boundaries in the latent space, we generate explanations through stating the K most probable answers as well as provide a qualitative explanation to validate the top K most probable answers. Each of these factors respectively refer to the number stated in Fig. 1 as well as link to each section: (1) triplet-vae Sect. 3.1, (2) CNN Decision Validation, Sect. 3.2, (3) Generating (contrastive) Explanations, Sect. 3.3.

### **3.1 Semantically Relevant Latent Space**

Typically, a triplet network consists of three instances of a neural network that share parameters. These three instances are separately fed differences types of input: an anchor, positive sample and negative sample. These are then used to learn useful representations by distance comparisons. We propose to incorporate this notion of a triplet network to semantically structure and separate the latent space of the VAE using the available input and labels. A triplet VAE consists of three instances of the encoder with shared parameters that are each fed precomputed triplets: an anchor, positive sample and negative sample; xa, x<sup>p</sup> and xn. The anchor x<sup>a</sup> and positive sample x<sup>p</sup> are of the same class but not the same image, whereas negative sample x<sup>n</sup> is from a different class. In each iteration of training, the input triplet is fed to the encoder network to get their mean latent embedding: <sup>F</sup>(xa)<sup>μ</sup> <sup>=</sup> <sup>z</sup><sup>μ</sup> <sup>a</sup> , <sup>F</sup>(xp)<sup>μ</sup> <sup>=</sup> <sup>z</sup><sup>μ</sup> <sup>p</sup> , <sup>F</sup>(xn)<sup>μ</sup> <sup>=</sup> <sup>z</sup><sup>μ</sup> <sup>n</sup>. These are then used to compute a similarity loss function as to induce loss when a negative sample z<sup>μ</sup> <sup>n</sup> is closer to z<sup>μ</sup> <sup>a</sup> than z<sup>μ</sup> <sup>p</sup> distance-wise. i.e. δap(z<sup>μ</sup> <sup>a</sup> , z<sup>μ</sup> <sup>p</sup> ) = ||z<sup>μ</sup> <sup>a</sup> <sup>−</sup> <sup>z</sup><sup>μ</sup> <sup>p</sup> || and δan(z<sup>μ</sup> <sup>a</sup> , z<sup>μ</sup> <sup>n</sup>) = ||z<sup>μ</sup> <sup>a</sup> <sup>−</sup> <sup>z</sup><sup>μ</sup> <sup>n</sup>|| and, provides us with three possible situations: δap > δan, δap < δan and δap = δan [6].

We wish to find an embedding where samples of a certain class lie close to each other in the latent space of the VAE. For this reason, we wish to add loss the algorithm when we arrive in the situation where δap > δan. In other words,

**Fig. 2.** Given an input image I we check the prediction of the CNN as well as the SVM. If both classifiers predict the same class, we return the predicted class. In contrast, if the classifiers do not predict the same class, we propose to return the top k most probable answers as well as an explanation why those classes are the most probable.

we wish to push x<sup>n</sup> further away, such that we ultimately arrive in the situation where δap < δan or δap = δan with some margin φ. As such we arrive at the triplet loss function that we'll use in addition to the KL divergence and reconstruction loss within the VAE: L(z<sup>μ</sup> <sup>a</sup> , z<sup>μ</sup> <sup>p</sup> , z<sup>μ</sup> <sup>n</sup>) = <sup>α</sup> <sup>∗</sup> argmax{||z<sup>μ</sup> <sup>a</sup> <sup>−</sup> <sup>z</sup><sup>μ</sup> <sup>p</sup> || − ||z<sup>μ</sup> <sup>a</sup> <sup>−</sup> <sup>z</sup><sup>μ</sup> <sup>n</sup>|| <sup>+</sup> φ , <sup>0</sup>}. Where φ will provide leeway when δap = δan and push the negative sample away even when the distances are equal.

We have an already present CNN which we would like to validate, and is trained by input data X : xi...x<sup>n</sup> and labels Y : yi...y<sup>n</sup> where each y<sup>i</sup> states the true class of xi. We then use the same X and Y to train the triplet-VAE. (1) First, we compute triplets of the form xa, xpx<sup>n</sup> from the input data X and labels Y which are then used to train the triplet VAE. A typical VAE consists of an <sup>F</sup>(x) = Encoder(x) <sup>∼</sup> <sup>q</sup>(z|x) which compresses the data into a latent space <sup>Z</sup>, a <sup>G</sup>(z) = Decoder(z) <sup>∼</sup> <sup>p</sup>(x|z) which reconstructs the data given the latent space <sup>Z</sup> and a prior <sup>p</sup>(z), in our case a gaussian <sup>N</sup> (0, 1), imposed on the model. In order for the VAE to train a latent space similar to its prior and be able to reconstruct images it is trained by minimizing the Evidence Lower Bound (ELBO). ELBO <sup>=</sup> <sup>−</sup>E<sup>z</sup>∼Q(z|X)[log <sup>P</sup>(x|z)] + KL[Q(z|X)||P(z)] This can be explained as the reconstruction loss or expected negative loglikelihood: <sup>−</sup>E<sup>z</sup>∼Q(z|X)[log <sup>P</sup>(x|z)] and the KL divergence loss KL[Q(z|X)||P(z)], to which we add the triplet loss:

$$\mathcal{L}(z\_a^{\mu}, z\_p^{\mu}, z\_n^{\mu}) = \alpha \ast \operatorname{argmax} \{ ||z\_a^{\mu} - z\_p^{\mu}|| - ||z\_a^{\mu} - z\_n^{\mu}|| + \phi \,\, , 0 \}$$

This compound loss semi-forces the latent space of the VAE to be well separated due to the triplet loss, disentangled due to the KL divergence loss combined with β scalar, and provides a means of (reasonably) reconstructing images by the reconstruction loss. And, thus results in the following loss function for training the VAE:

$$loss = -\mathbb{E}\_{z \sim \mathcal{Q}(z|X)}[\log P(x|z)] + \beta \ast \mathcal{KL}[\mathcal{Q}(z|X) || P(z)] + \mathcal{L}(z\_a^{\mu}, z\_p^{\mu}, z\_n^{\mu}).$$

#### **3.2 Decision Validation**

Afterwards, given a semantically relevant latent space we can use it for step two and three as indicated in Fig. 1. (2) Second step - CNN Decision Validation, we train an additional classifier on top of the triplet-VAE latent space, specifically zμ. We train the linear Support Vector Machine using Zμs as input data and Y as labels where [Zμ, Zσ] = <sup>F</sup>(<sup>X</sup> ). The goal of the linear support vector machine is two-fold. It provides a means of validating each prediction made by the CNN by using the encoder and the linear classifier. i.e. given an input example I, we have <sup>C</sup>(I)=ˆyC(I) and <sup>S</sup>(F(I)<sup>μ</sup>)=ˆyS(I), and compare them against each other <sup>y</sup>ˆC(I) = ˆyS(I). And, as the linear classifier is a simpler model than the highly complex CNN it will function as the ground-truth base for the predictions that are made. As such, we arrive at two possible cases:

$$\text{Comparison}(\mathbf{I}) = \begin{cases} \text{Positive} & \text{if } (\hat{y}\_{\mathcal{C}(I)} = \hat{y}\_{\mathcal{S}(I)}) \\ \text{Negative} & \text{if } (\hat{y}\_{\mathcal{C}(I)} \neq \hat{y}\_{\mathcal{S}(I)}) \end{cases} \tag{1}$$

First, If both classifiers agree then we arrive at an optimal state, meaning that the prediction is based on semantics and the direct mapping found by the CNN. In this way, we can say with high confidence that the prediction is correct. In the second case, if the classifiers do not agree, three cases can occur: the SVM is correct and the CNN is incorrect, the SVM is incorrect and the CNN is correct, or both the SVM and the CNN is incorrect. In each of these cases we can suggest a most probable answer as well as a selective (contrastive) explanation indicated as step 3 of the framework as explained in Fig. 2.

#### **3.3 Generating (contrastive) Explanations**

An explanation consists of (1) the most probable answers and (2) a qualitative investigation of latent traversal towards the most probable answers The most probable answer is presented by the averaged sum rule [12] over the predicted probabilities per class for both the CNN and SVM and selecting the top K answers, where K can be appropriately selected. Additionally, originally an SVM does not return a probabilistic answer, however, applying Platts [16] method we apply an additional sigmoid function to map the SVM outputs into probabilities. These top k answers are then used in order to present and generate selected contrastive explanations.

The top K predictions or classes will be used in order to traverse and subsample the latent space from the initial representation or Z<sup>μ</sup> <sup>I</sup> location towards another class. We can find a path by finding the closest point within the latent space such that the decision boundary is crossed and the SVM predicts the target class. Alternatively we could use the closest data point in the latent space that adheres to the training set argmin <sup>F</sup>(xi)<sup>μ</sup>−Z<sup>μ</sup> <sup>I</sup> for every <sup>x</sup><sup>i</sup> <sup>∈</sup> <sup>X</sup>. Traversing and sub-sampling the latent space will change the semantics minimally to change the class prediction. We capture the minimal change needed in order to change both the SVM and CNN prediction to the target class. This information is then

**Fig. 3.** Generating (contrastive) explanations consist of several steps: First, given an input image I in question and the K top most probable answer. K denotes training data X for class k labeled with y = k. We feed both I and K through the encoder *F*(X) to receive their respective semantic location in the latent space. We then find the closest training point that belongs to the target class k and find the vector *v*; the direction of that point. Afterwards, uniformly sample data points along this vector *v*, where j iterates over 0 *···* j *···* and is denoted as Z*<sup>µ</sup> <sup>v</sup>* . Z*<sup>µ</sup> <sup>v</sup>* is then used to check these against the SVM and use them to generate images <sup>X</sup>*Z<sup>µ</sup> <sup>v</sup>* using the decoder *G*(Z*<sup>µ</sup> <sup>v</sup>* ). The generated images are then fed to the CNN to make a prediction and as the images will semantically change along the vector the prediction will change as well. Afterwards, we can compare the predictions from both the CNN and SVM. Subsequently, we use the first moment where both predictions are equal to target class k, denoted as moment l for generating an explanation - minimal semantic difference necessary to be equal to the target class, ΔU*l*.

presented to the domain expert for verification and answers the following question: The most probable answer is a because the input image I is semantically closest to the following features, where the features are presented qualitatively. The explanations are generated as follows: see Fig. 3.

The decision boundaries around the clusters within the latent space are fitted by the SVM and can be used to answer questions of the form 'why a and not <sup>b</sup>?'. If ˆyC(I) and ˆyS(I) do not predict the same class, then, we assume that ˆyS(I) is correct. We then use the find a path, indicated by *<sup>v</sup>* from ˆyS(I) to ˆyC(I), <sup>Z</sup><sup>μ</sup> I to the target class. This can be done by calculating a vector orthogonal to the hyper-plane fitted by the SVM towards the target class. Alternatively, we can find the closest <sup>z</sup><sup>μ</sup> <sup>∈</sup> <sup>Z</sup><sup>μ</sup> that satisfies ˆyS(z*µ*) = ˆyC(z*µ*) that are not the same as the initial prediction ˆyC(I). This means that *<sup>v</sup>* is the vector from <sup>I</sup> to the closest data point of the target class, with respect to Euclidean distance.

We then uniformly sample points along vector *v* and check them against the SVM as well as the CNN. The sampled points can directly be fed to the SVM to get a prediction ˆy(<sup>∫</sup> (vi) for every <sup>v</sup><sup>i</sup> <sup>∈</sup> <sup>V</sup> . Similarly, we can get predictions of the CNN by transforming the images using the decoder D. The images are then fed to the CNN to get a prediction ˆy(C(D(vi)) for every <sup>v</sup><sup>i</sup> <sup>∈</sup> <sup>V</sup> . The predictions of both classifiers will change as the images start looking more and more like the target class as generative factors change along the vector. If we capture the changes that make the change happen, we can show the minimal difference required in order to change the prediction of the CNN. In this way we can generate contrastive examples: For the top 'close' class that is not ˆy<sup>I</sup> we answer the question: 'why yˆ<sup>I</sup> and not the other semantically close class?'. Hence, we find the answer to the question "why a and not b?", as the answer is the shortest approximate changes between the two classes that make the CNN change its prediction. As a result, we have found a way to validate the inner workings of the CNN. If there are doubts about a prediction it can be investigated and checked.

### **4 Results**

In this paper we show experimental results on MNIST by generating (contrastive) explanations to provide extra information to predictions made by the CNN and evaluate its performance. The creation of these explanations requires a semantically relevant and well separated latent space. Therefore, we first show the difference between the latent space of the vanilla VAE and the triplet-VAE and its effects on training a linear classifier on top of the latent space. The Figs. 4 and 5 show a tSNE visualization of the separation of classes within the latent space. Not surprisingly the triplet-VAE separated the data in a far more semantically relevant way and this is also reflected with respect to the accuracy of training a linear model on the data.

**Fig. 4.** Visualization of a twodimensional latent space of a vanilla VAE on MNIST

**Fig. 5.** Visualization of a twodimensional latent space of a *T* -VAE on MNIST

Second, the percentages show as to know how much both classifiers agree by showing the percentage per possible case, as shown in Table 1. Not surprisingly case four happens more often than case three and can mean two things, our latent space is too simple to capture the full complexity of the class assignment and the CNN is not constraint by extra loss functions. However, in three of the four cases where <sup>Y</sup><sup>S</sup> <sup>=</sup> <sup>Y</sup><sup>C</sup> we can explain

**Table 1.** This table shows the percentages of agreement with respect to all possible cases.


the most probable predictions and provide a generated (contrastive) explanation. The only case we cannot check or know about is case two, where both Y<sup>S</sup> and Y<sup>C</sup> predict the same class but is wrong. The only way to capture this behavior is by explaining every single decision by generating explanations for everything. Nevertheless, as an example for generating explanations we use an example: 6783 (case 5) as shown in Fig. 6.

Generating explanations consists of three parts: First, we propose the top K probable answers: for this example the true label is 1, the most probable answers are 6, 8 and then 1 with averaged probabilities 0,512332, 0.3382, 0.1150. Second, Then for those most probable target classes, 6, 8, 1 we traverse the latent space from the initial location Z<sup>μ</sup> <sup>I</sup> to the closest point of that class, denoted as *v* ∈ that is predicted correctly i.e. the SVM and CNN agree. Figure 7 shows the generated images from the uniformly sam-

**Fig. 6.** Once the SVM and the CNN both predict the target class we capture the minimal changes that are necessary to change their predictions

pled data points along vectors <sup>v</sup><sup>k</sup> <sup>∈</sup> <sup>V</sup> where <sup>k</sup> <sup>∈</sup> <sup>K</sup> stand for 6, 8, 1 in this case. The figures show which changes happen when traversing the latent space and at which points both the SVM and the CNN agree with respect to their decision.

For the traversal from Z<sup>μ</sup> <sup>I</sup> to class 6 it can be seen that rather quickly both classifiers agree and only minimal changes are required to change the predictions. Third, for such an occurrence we can further zoom in on what is happening and what really makes that the most probable answer. Figure 6 shows these minimal changes required to change its prediction as well as the transformed image on which the classifiers agree. The first row shows the original image, positive changes, negative changes and the changes combined. The second row shows the reconstructed image and the reconstructed images with the positive changes, negative changes and positive and negative changes respectively. In this way, for each probable answer it shows its closest representative and the changes required to be part of that class.

**Fig. 7.** Per top k probable answers we traverse and sample the latent space to generate images that can be used to test the behavior of the CNN. The red line indicates the moment where both the SVM and the CNN predict the target class (Color figure online)

## **5 Conclusion**

This paper examines deep neural network's behaviour and performance by utilizing a weakly-supervised generative model as a proxy. The weakly-supervised generative model aims to uncover the generative factors underlying the data and separate abstract classes by applying metric learning. The proxy's goal is three-fold: the semantically meaningful space will be the base for a linear support vector machine; The model's generative capabilities will be used to generate images that can be probed against the black box in question; the latent space is traversed and sampled from an anchor I to another class k in order to find the minimal important difference that changes both classifier's predictions. The goal of the framework is to be sure of the predictions made by the black box by better understanding the behaviour of the CNN by simulating questions of the form 'Why a and not b?' where a and b are different classes.

We examine deep neural network (DNN) performance and behaviour using contrasting explanations generated from a semantically relevant latent space. The results show that each of the above goals can be achieved and the framework performs as expected. We develop a semantically relevant latent space by training an variational autoencoder (VAE) augmented by a metric learning loss on the latent space. The properties of the VAE provide for a smooth latent space supported by a simple density and the metric learning term organizes the space in a semantically relevant way with respect to the target classes. In this space we can both linearly separate the classes and generate relevant interpolation of contrasting data points across decision boundaries and find the minimal important difference that changes the classifier's predictions. This allows us to examine the DNN model beyond its performance on a test set for potential biases and its sensitivity to perturbations of individual factors in the latent space.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Vouw: Geometric Pattern Mining Using the MDL Principle**

Micky Faas(B) and Matthijs van Leeuwen

LIACS, Leiden University, Leiden, The Netherlands micky@edukitty.org, m.van.leeuwen@liacs.leidenuniv.nl

**Abstract.** We introduce geometric pattern mining, the problem of finding recurring local structure in discrete, geometric matrices. It differs from existing pattern mining problems by identifying complex spatial relations between elements, resulting in arbitrarily shaped patterns. After we formalise this new type of pattern mining, we propose an approach to selecting a set of patterns using the Minimum Description Length principle. We demonstrate the potential of our approach by introducing Vouw, a heuristic algorithm for mining exact geometric patterns. We show that Vouw delivers high-quality results with a synthetic benchmark.

## **1 Introduction**

Frequent pattern mining [1] is the well-known subfield of data mining that aims to find and extract recurring substructures from data, as a form of knowledge discovery. The generic concept of pattern mining has been instantiated for many different types of patterns, e.g., for item sets (in Boolean transaction data) and subgraphs (in graphs/networks). Little research, however, has been done on pattern mining for raster-based data, i.e., geometric matrices in which the row and column orders are fixed. The exception is geometric tiling [4,11], but that problem only considers tiles, i.e., rectangular-shaped patterns, in Boolean data.

In this paper we generalise this setting in two important ways. First, we consider geometric patterns *of any shape* that are geometrically connected, i.e., it must be possible to reach any element from any other element in a pattern by only traversing elements in that pattern. Second, we consider *discrete geometric data* with any number of possible values (which includes the Boolean case). We call the resulting problem *geometric pattern mining*.

Figure 1 illustrates an example of geometric pattern mining. Figure 1a shows a 32 <sup>×</sup> 24 grayscale 'geometric matrix', with each element in [0, 255], apparently filled with noise. If we take a closer look at all horizontal pairs of elements, however, we find that the pair (146, 11) is, amongst others, more prevalent than expected from 'random noise' (Fig. 1b). If we would continue to try all combinations of elements that 'stand out' from the background noise, we would eventually find four copies of the letter 'I' set in 16 point Garamond Italic (Fig. 1c).

**Fig. 1.** Geometric pattern mining example. Each element is in [0, 255].

The 35 elements that make up a single 'I' in the example form what we call a *geometric pattern*. Since its four occurrences jointly cover a substantial part of the matrix, we could use this pattern to describe the matrix more succinctly than by 768 independent values. That is, we could describe it as the pattern 'I' at locations (5, 4),(11, 11),(20, 3),(25, 10) plus 628 independent values, hereby separating structure from accidental (noise) data. Since the latter description is shorter, we have compressed the data. At the same time we have learned something about the data, namely that it contains four I's. This suggests that we can use compression as a criterion to find patterns that describe the data.

**Approach and Contributions.** Our first contribution is that we introduce and formally define *geometric pattern mining*, i.e., the problem of finding recurring local structure in geometric, discrete matrices. Although we restrict the scope of this paper to two-dimensional data, the generic concept applies to higher dimensions. Potential applications include the analysis of satellite imagery, texture recognition, and (pattern-based) clustering.

We distinguish three types of geometric patterns: (1) *exact* patterns, which must appear exactly identical in the data to match; (2) *fault-tolerant* patterns, which may have noisy occurrences and are therefore better suited to noisy data; and (3) *transformation-equivalent* patterns, which are identical after some transformation (such as mirror, inverse, rotate, etc.). Each consecutive type makes the problem more expressive and hence more complex. In this initial paper we therefore restrict the scope to the first, exact type.

As many geometric patterns can be found in a typical matrix, it is crucial to find a compact set of patterns that together describe the structure in the data well. We regard this as a model selection problem, where a model is defined by a set of patterns. Following our observation above, that geometric patterns can be used to compress the data, our second contribution is the formalisation of the model selection problem by using the *Minimum Description Length (MDL) principle* [5,8]. Central to MDL is the notion that 'learning' can be thought of as 'finding regularity' and that regularity itself is a property of data that is exploited by *compressing* said data. This matches very well with the goals of pattern mining, as a result of which the MDL principle has proven very successful for MDL-based pattern mining [7,12].

Finally, our third contribution is Vouw, a heuristic algorithm for MDL-based geometric pattern mining that (1) finds compact yet descriptive sets of patterns, (2) requires no parameters, and (3) is tolerant to noise in the data (but not in the occurrences of the patterns). We empirically evaluate Vouw on synthetic data and demonstrate that it is able to accurately recover planted patterns.

## **2 Related Work**

As the first pattern mining approach using the MDL principle, Krimp [12] was one of the main sources of inspiration for this paper. Many papers on patternbased modelling using MDL have appeared since, both improving search, e.g., Slim [10], and extensions to other problems, e.g., Classy [7] for rule-based classification.

The problem closest to ours is probably that of geometric tiling, as introduced by Gionis et al. [4] and later also combined with the MDL principle by Tatti and Vreeken [11]. Geometric tiling, however, is limited to Boolean data and rectangularly shaped patterns (tiles); we strongly relax both these limitations (but as of yet do not support patterns based on densities or noisy occurrences).

Campana et al. [2] also use matrix-like data (textures) in a compressionbased similarity measure. Their method, however, has less value for *explanatory* analysis as it relies on generic compression algorithms that are essentially a black box.

Geometric pattern mining is different from graph mining, although the concept of a matrix can be redefined as a grid-like graph where each node has a fixed degree. This is the approach taken by Deville et al. [3], solving a problem similar to ours but using an approach akin to bag-of-words instead of the MDL principle.

## **3 Geometric Pattern Mining Using MDL**

We define geometric pattern mining on bounded, discrete and two-dimensional raster-based data. We represent this data as an <sup>M</sup> <sup>×</sup> <sup>N</sup> matrix <sup>A</sup> whose rows and columns are finite and in a fixed ordering (i.e., reordering rows and columns semantically alters the matrix). Each element <sup>a</sup>*i,j* <sup>∈</sup> <sup>S</sup>, where row <sup>i</sup> <sup>∈</sup> [0; <sup>N</sup>), column <sup>j</sup> <sup>∈</sup> [0; <sup>M</sup>), and <sup>S</sup> is a finite set of symbols, i.e., the alphabet of <sup>A</sup>.

According to the MDL principle, the shortest (optimal) description of A reveals all structure of A in the most succinct way possible. This optimal description is only optimal if we can unambiguously reconstruct A from it and nothing more—the compression is both minimal and lossless. Figure 2 illustrates how an example matrix could be succinctly described using patterns: matrix A is decomposed into patterns X and Y . A set of such patterns constitutes the **model** for a matrix A, denoted H*<sup>A</sup>* (or H for short when A is clear from the context). In order to reconstruct A from this model, we also need a mapping from the H*<sup>A</sup>* back to A. This mapping represents what (two-part) MDL calls **the data given the model** H*A*. In this context we can think of this as a set of all instructions required to rebuild A from H*A*, which we call the **instantiation** of H*<sup>A</sup>* and is denoted by I in the example. These concepts allow us to express matrix A as a decomposition into sets of local and global spatial information, which we will next describe in more detail.

$$A = \begin{bmatrix} 1 & \cdot & \cdot & 1 \cdot 1 \\ \cdot & 1 & 1 \cdot 1 & \cdot \\ 1 & 1 & \cdot & \cdot & 1 \end{bmatrix}, \ I = \begin{bmatrix} X \cdot \cdot \cdot \cdot \cdot Y \cdot \\ \cdot \cdot \cdot Y \cdot X \cdot \cdot \\ Y \cdot \cdot \cdot \cdot \cdot \cdot \cdot \cdot \cdot \end{bmatrix}, \ H = \left\{ X = \begin{bmatrix} 1 \cdot \\ \cdot \cdot 1 \end{bmatrix}, Y = \begin{bmatrix} 1 \cdot 1 \end{bmatrix} \right\}$$

**Fig. 2.** Example decomposition of A into instantiation I and patterns X, Y .

#### **3.1 Patterns and Instances**


From this definition, the dimensions <sup>M</sup>*<sup>X</sup>* <sup>×</sup> <sup>N</sup>*<sup>X</sup>* give the smallest rectangle around <sup>X</sup> (the *bounding box)*. We also define the cardinality <sup>|</sup>X<sup>|</sup> of <sup>X</sup> as the number of non-empty elements. We call a pattern <sup>X</sup> with <sup>|</sup>X<sup>|</sup> =1a **singleton pattern**, i.e., a pattern containing exactly one element of A.

Each pattern contains a special **pivot** element: pivot(X) is the first nonempty element of X. A pivot can be thought of as a fixed point in X which we can use to position its elements in relation to A. This translation, or **offset**, is a tuple q = (i, j) that is on the same domain as an index in A. We realise this translation by placing all elements of <sup>X</sup> in an empty <sup>M</sup> <sup>×</sup> <sup>X</sup> size matrix such that the pivot element is at (i, j). We formalise this in the **instantiation operator** ⊗:


Since this does not yield valid results for arbitrary offsets (i, j), we enforce two constraints: (1) an instance must be **well-defined**: placing pivot(X) at index (i, j) must result in an instance that contains all elements of X, and (2) elements of instances cannot *overlap*, i.e., each element of A can be described only once.


From here on we will use the same letter in lower case to denote an arbitrary instance of a pattern, e.g., <sup>x</sup> <sup>=</sup> <sup>X</sup> <sup>⊗</sup> <sup>q</sup> when the exact value of <sup>q</sup> is unimportant. Since instances are simply patterns projected onto an <sup>M</sup> <sup>×</sup> <sup>N</sup> matrix, we can reverse ⊗ by removing all completely empty rows and columns:


We briefly introduced the instantiation I as a set of 'instructions' of where instances of each pattern should be positioned in order to obtain A. As Fig. 2 suggests, this mapping has the shape of an <sup>M</sup> <sup>×</sup> <sup>N</sup> matrix.


### **3.2 The Problem and Its Solution Space**

Larger patterns can be naturally constructed by joining (or merging) smaller patterns in a bottom-up fashion. To limit the considered patterns to those relevant to A, instances can be used as an intermediate step. As Fig. 3 demonstrates, we can use a simple element-wise matrix addition to sum two instances and use to obtain a joined pattern. Here we start by instantiating <sup>X</sup> and <sup>Y</sup> with offsets (1, 0) and (1, 1), respectively. We add the resulting <sup>x</sup> and <sup>y</sup> to obtain z, the union of <sup>X</sup> and <sup>Y</sup> with relative offset (1, 1) <sup>−</sup> (1, 0) = (0, 1).

$$x = X \otimes \{1, 0\} = \begin{bmatrix} \cdot \cdot \\ 1 \cdot \cdot \\ \cdot 1 \end{bmatrix}, \ y = Y \otimes \{1, 1\} = \begin{bmatrix} \cdot \cdot \\ \cdot 1 \\ \cdot \cdot \end{bmatrix} \\ x + y = \begin{bmatrix} \cdot \cdot \\ 1 \cdot 1 \\ \cdot 1 \end{bmatrix}, \ Z = \bigcirc (x + y) = \begin{bmatrix} 1 \cdot 1 \\ \cdot 1 \end{bmatrix}$$

**Fig. 3.** Example of joining patterns X and Y to construct a new pattern Z.

**The Sets** *H<sup>A</sup>* **and** *I<sup>A</sup>* **.** We define the **model class** H as the set of all possible models for all possible inputs. Without any prior knowledge, this would be the search space. To simplify the search, however, we only consider the more bounded subset <sup>H</sup>*<sup>A</sup>* of all possible models for <sup>A</sup>, and <sup>I</sup>*<sup>A</sup>*, the set of all possible instantiations for these models. To this end we first define H<sup>0</sup> *<sup>A</sup>* to be the model with only singleton patterns, i.e., H<sup>0</sup> *<sup>A</sup>* = S, and denote its corresponding instantiation matrix by I<sup>0</sup> *<sup>A</sup>*. Given that each element of I<sup>0</sup> *<sup>A</sup>* must correspond to exactly one element of A in H<sup>0</sup> *<sup>A</sup>*, we see that each I*i,j* = a*i,j* and so we have I<sup>0</sup> *<sup>A</sup>* = A.

Using H<sup>0</sup> *<sup>A</sup>* and I<sup>0</sup> *<sup>A</sup>* as base cases we can now inductively define I*<sup>A</sup>*:

**Base case** I<sup>0</sup> *<sup>A</sup>* ∈ I*<sup>A</sup>* **By induction** If <sup>I</sup> is in <sup>I</sup>*<sup>A</sup>* then take any pair <sup>I</sup>*i,j* , I*k,l* <sup>∈</sup> <sup>I</sup> such that (i, j) <sup>≤</sup> (k,l) in lexicographical order. Then the set <sup>I</sup> is also in <sup>I</sup>*<sup>A</sup>*, providing <sup>I</sup> equals <sup>I</sup> except: I *i,j* := - <sup>I</sup>*i,j* <sup>⊗</sup> (i, j) + <sup>I</sup>*k,l* <sup>⊗</sup> (k,l) I *k,l* := ·

This shows we can add any two instances together, in any order, as they are by definition always non-overlapping and thus valid in A, and hereby obtain another element of I*<sup>A</sup>*. Eventually this results in just one big instance that is equal to <sup>A</sup>. Note that when we take two elements <sup>I</sup>*i,j* , I*k,l* <sup>∈</sup> <sup>I</sup> we force (i, j) <sup>≤</sup> (k,l), not only to eliminate different routes to the same instance matrix, but also so that the pivot of the new pattern coincides with I*i,j* . We can then leave I*k,l* empty.

The construction of I*<sup>A</sup>* also implicitly defines H*A*. While this may seem odd—defining models for instantiations instead of the other way around—note that there is no unambiguous way to find one instantiation for a given model. Instead we find the following definition by applying the inductive construction:

$$\mathcal{H}\_A = \left\{ \{ \bigcirc (x) \mid x \in I \} \; \middle| \; I \in \mathcal{Z}\_A \right\}.\tag{1}$$

So for any instantiation <sup>I</sup> ∈ I*<sup>A</sup>* there is a corresponding set in <sup>H</sup>*<sup>A</sup>* of all patterns that occur in I. This results in an interesting symbiosis between model and instantiation: increasing the complexity of one decreases that of the other. This construction gives a tightly connected lattice as shown in Fig. 4.

#### **3.3 Encoding Models and Instances**

From all models in <sup>H</sup>*<sup>A</sup>* we want to select the model that describes <sup>A</sup> best. Two-part MDL [5] tells us to choose that model that minimises the sum of <sup>L</sup>1(H*A*) + <sup>L</sup>2(A|H*A*), where <sup>L</sup><sup>1</sup> and <sup>L</sup><sup>2</sup> are two functions that give the length of the model and the length of 'the data given the model', respectively. In this context, the data given the model is given by I*A*, which represents the accidental information needed to reconstruct the data A from H*A*.

**Fig. 4.** Model space lattice for a 2×2 Boolean matrix. The V, W, and Z columns show which pattern is added in each step, while I depicts the current instantiation.

In order to compute their lengths, we need to decide how to encode H*<sup>A</sup>* and I. As this encoding is of great influence on the outcome, we should adhere to the conditions that follow from MDL theory: (1) the model and data must be encoded losslessly; and (2) the encoding should be as concise as possible, i.e., it should be optimal. Note that for the purpose of model selection we only need the length functions; we do not need to actually encode the patterns or data.

**Code Length Functions.** Although the patterns in H and instantiation matrix I are all matrices, they have different characteristics and thus require different encodings. For example, the size of I is constant and can be ignored, while the


**Table 1.** Code length definitions. Each row specifies the code length given by the first column as the sum of the remaining terms.

sizes of the patterns vary and should be encoded. Hence we construct different length functions<sup>1</sup> for the different components of H and I, as listed in Table 1.

When encoding <sup>I</sup>, we observe that it contains each pattern <sup>X</sup> <sup>∈</sup> <sup>H</sup> multiple times, given by the **usage** of X. Using the **prequential plug-in code** [5] to encode I enables us to omit encoding these usages separately, which would create unwanted bias. The prequential plug-in code gives us the following length function for I. We use = 0.5 and elaborate on its derivation in the Appendix<sup>2</sup>.

$$L\_{pp}(I \mid P\_{plugin}) = -\sum\_{X\_i \in h}^{|H|} \left[ \log \frac{\Gamma(\text{usage}(X\_i) + \epsilon)}{\Gamma(\epsilon)} \right] + \log \frac{\Gamma(|I| + \epsilon |H|)}{\Gamma(\epsilon |H|)} \tag{2}$$

Each length function has four terms. First we encode the total size of the matrix. Since we assume MN to be known/constant, we can use this constant to define the uniform distribution <sup>1</sup> *MN* , so that log MN encodes an arbitrary index of A. Next we encode the number of elements that are non-empty. For patterns this value is encoded together with the third term, namely the positions of the non-empty elements. We use the previously encoded M*X*N*<sup>X</sup>* in the binominal function to enumerate the ways we can place the <sup>|</sup>X<sup>|</sup> elements onto a grid of M*X*N*X*. This gives us both *how many* non-empties there are as well as *where* they are. Finally the fourth term is the length of the actual symbols that encode the elements of the matrix. In case we encode single elements of A, we assume that each unique value in A occurs with equal probability; without other prior knowledge, using the uniform distribution has minimax regret and is therefore optimal. For the instance matrix, which encodes symbols to patterns, the prequential code is used as demonstrated before. Note that L*<sup>N</sup>* is the universal prior for the integers [9], which can be used for arbitrary integers and penalises larger integers.

## **4 The Vouw Algorithm**

Pattern mining often yields vast search spaces and geometric pattern mining is no exception. We therefore use a heuristic approach, as is common in MDL-based approaches [7,10,12]. We devise a greedy algorithm that exploits the inductive

<sup>1</sup> We calculate code lengths in bits and therefore all logarithms have base 2.

<sup>2</sup> The appendix is available on https://arxiv.org/abs/1911.09587.

definition of the search space as shown by the lattice in Fig. 4. We start with a completely underfit model (leftmost in the lattice), where there is one instance for each matrix element. Next, in each iteration we combine two patterns, resulting in one or more pairs of instances to be merged (i.e., we move one step right in the lattice). In each step we merge the pair of patterns that improves compression most, and we repeat this until no improvement is possible.

### **4.1 Finding Candidates**

The first step is to find the 'best' **candidate** pair of patterns for merging (Algorithm 1). A candidate is denoted as a tuple (X, Y, δ), where X and Y are patterns and δ is the relative offset of X and Y as they occur in the data. Since we only need to consider pairs of patterns and offsets that actually occur in the instance matrix, we can directly enumerate candidates from the instantiation matrix and never even need to consider the original data.


The **support** of a candidate, written sup(X, Y, δ), tells how often it is found in the instance matrix. Computing support is not completely trivial, as one candidate occurs multiple times in 'mirrored' configurations, such as (X, Y, δ) and (Y, X, <sup>−</sup>δ), which are equivalent but can still be found separately. Furthermore, due to the definition of a pattern, many potential candidates cannot be considered by the simple fact that their elements are not adjacent.

**Peripheries.** For each instance x we define its *periphery*: the set of instances which are positioned such that their union with x produces a valid pattern. This set is split into *anterior* ANT(X) and *posterior* POST(X) peripheries, containing instances that come before and after x in lexicographical order, respectively. This enables us to scan the instance matrix once, in lexicographical order. For each instance x, we only consider the instances POST(x) as candidates, thereby eliminating any (mirrored) duplicates.

**Self-overlap.** Self-overlap happens for candidates of the form (X, X, δ). In this case, too many or too few copies may be counted. Take for example a straight line of five instances of X. There are four unique pairs of two X's, but only two can be merged at a time, in three different ways. Therefore, when considering candidates of the form (X, X, δ), we also compute an *overlap coefficient*. This coefficient e is given by e = (2N*<sup>X</sup>* + 1)δ*<sup>i</sup>* +δ*<sup>j</sup>* + N*X*, which essentially transforms δ into a one-dimensional coordinate space of all possible ways that X could be arranged *after* and *adjacent* to itself. For each instance x<sup>1</sup> a vector of bits V (x) is used to remember if we have already encountered a combination x1, x<sup>2</sup> with coefficient e, such that we do not count a combination x2, x<sup>3</sup> with an equal e. This eliminates the problem of incorrect counting due to self-overlap.

#### **4.2 Gain Computation**

After candidate search we have a set of candidates C and their respective supports. The next step is to select the candidate that gives the best *gain*: the improvement in compression by merging the candidate pair of patterns. For each candidate c = (X, Y, δ) the gain ΔL(A , c) is comprised of two parts: (1) the negative gain of adding the union pattern Z to the model H, resulting in H , and (2) the gain of replacing all instances x, y with relative offset δ by Z in I, resulting in I . We use length functions L1, L<sup>2</sup> to derive an equation for gain:

$$\begin{aligned} \Delta L(A', c) &= \left( L\_1(H') + L\_2(I') \right) - \left( L\_1(H) + L\_2(I) \right) \\ &= L\_N(|H|) - L\_N(|H| + 1) - L\_p(Z) + \left( L\_2(I') - L\_2(I) \right) \end{aligned} \tag{3}$$

As we can see, the terms with <sup>L</sup><sup>1</sup> are simplified to <sup>−</sup>L*p*(Z) and the model's length because L<sup>1</sup> is simply a summation of individual pattern lengths. The equation of L<sup>2</sup> requires the recomputation of the entire instance matrix' length, which is expensive considering we need to perform it for *every candidate*, *every iteration*. However, we can rework the function L*pp* in Eq. (2) by observing that we can isolate the logarithms and generalise them into

$$\log\_G(a,b) = \log \frac{\Gamma(a+b\epsilon)}{\Gamma(b\epsilon)} = \log \Gamma(a+b\epsilon) - \log \Gamma(b\epsilon),\tag{4}$$

which can be used to rework the second part of Eq. (3) in such way that the gain equation can be computed in constant time complexity.

$$\begin{aligned} L\_2(I') - L\_2(I) &= \log\_G(U(X), 1) + \log\_G(U(Y), 1) \\ &- \log\_G(U(X) - U(Z), 1) - \log\_G(U(Y) - U(Z), 1) \\ &- \log\_G(U(Z), 1) + \log\_G(|I|, |H|) - \log\_G(|I'|, |H'|) \end{aligned} \tag{5}$$

Notice that in some cases the usages of X and Y are equal to that of Z, which means additional gain is created by removing X and Y from the model.

### **4.3 Mining a Set of Patterns**

In the second part of the algorithm, listed in Algorithm 2, we select the candidate (X, Y, δ) with the largest gain and merge X and Y to form Z, as explained in Sect. 3.2. We linearly traverse I to replace all instances x and y with relative offset δ by instances of Z. (X, Y, δ) was constructed by looking in the posterior periphery of all x to find Y and δ, which means that Y always comes after X in lexicographical order. The pivot of a pattern is the first element in lexicographical order, therefore pivot(Z) = pivot(X). This means that we can replace all matching <sup>x</sup> with an instance of <sup>Z</sup> and all matching <sup>y</sup> with ·.

### **4.4 Improvements**

**Local Search.** To improve the efficiency of finding large patterns without sacrificing the underlying idea of the original heuristics, we add an optional local search. Observe that without local search, Vouw generates a large pattern X

**Fig. 5.** Synthetic patterns are added to a matrix filled with noise. The difference between the ground truth and the matrix reconstructed by the algorithm is used to compute precision and recall.

**Fig. 6.** The influence of SNR in the ground truth (left) and prevalence on recall (right)

by adding small elements to an incrementally growing pattern, resulting in a behaviour that requires up to <sup>|</sup>X| − 1 steps. To speed this up, we can try to 'predict' which elements will be added to X and add them immediately. After selecting candidate (X, Y, δ) and merging X and Y into Z, for all m resulting instances <sup>z</sup>*<sup>i</sup>* <sup>∈</sup> <sup>z</sup>0,...,z*m*−<sup>1</sup> we try to find pattern <sup>W</sup> and offset <sup>δ</sup> such that

$$\forall\_{i \in 0...m} \exists\_w \in \text{ANT}(z\_i) \cup \text{POST}(z\_i) \quad \complement \Diamond(w) = W \land dist(z\_i, w) = \delta. \tag{6}$$

This yields zero or more candidates (Z, W, δ), which are then treated as any set of candidates: candidates with the highest gain are iteratively merged until no candidates with positive gain exist. This essentially means that we run the baseline algorithm only on the peripheries of all z*i*, with the condition that the support of the candidates is equal to that of Z.

**Reusing Candidates.** We can improve performance by reusing the candidate set and slightly changing the search heuristic of the algorithm. The **Best-\*** heuristic selects multiple candidates on each iteration, as opposed to the baseline **Best-1** heuristic that only selects a single candidate with the highest gain. Best-\* selects candidates in descending order of gain until no candidates with positive gain are left. Furthermore we only consider candidates that are all *disjoint*, because when we merge candidate (X, Y, δ), remaining candidates with X and/or Y have unknown support and therefore unknown gain.

## **5 Experiments**

To asses Vouw's practical performance we primarily use Ril, a synthetic dataset generator developed for this purpose. Ril utilises random walks to populate a matrix with patterns of a given size and prevalence, up to a specified density, while filling the remainder of the matrix with noise. Both the pattern elements and the noise are picked from the same uniform random distribution on the interval [0, 255]. The *signal-to-noise ratio* (SNR) of the data is defined as the number of pattern elements over the matrix size MN. The objective of the experiment is to assess whether Vouw recovers all of the signal (the patterns) and none of the noise. Figure 5 gives an example of the generated data and how it is evaluated. A more extensive description can be found in the Appendix (see footnote 2).

**Implementation.** The implementation<sup>3</sup> used consists of the Vouw algorithm (written in vanilla C/C++), a GUI, and the synthetic benchmark Ril. Experiments were performed on an Intel Xeon-E2630v3 with 512 GB RAM.

**Evaluation.** Completely random data (noise) is unlikely to be compressed. The SNR tells us how much of the data is noise and thus conveniently gives us an upper bound of how much compression could be achieved. We use the ground truth SNR versus the resulting compression ratio as a benchmark to tell us how close we are to finding all the structure in the ground truth.

<sup>3</sup> https://github.com/mickymuis/libvouw.

In addition, we also compare the ground truth matrix to the obtained model and instantiation. As singleton patterns do not yield any compression over the baseline model, we reconstruct the matrix omitting any singleton patterns. Ignoring the actual values, this gives us a Boolean matrix with 'positives' (pattern occurrence = signal) and 'negatives' (no pattern = noise). By comparing each element in this matrix with the corresponding element in the ground truth matrix, *precision* and *recall* can be calculated and evaluated.

Figure 6 (left) shows the influence of ground truth SNR on compression ratio for different matrix sizes. Compression ratio and SNR are clearly strongly correlated. Figure 6 (right) shows that patterns with a low prevalence (i.e., number of planted occurrences) have a lower probability of being 'detected' by the algorithm as they are more likely to be accidental/noise. Increasing the matrix size also increases this threshold. In Table 2 we look at the influence of the two improvements upon the baseline algorithm as described in Sect. 4.4. In terms of quality, local search can improve the results quite substantially while Best-\* notably *lowers* precision. Both improve speed by an order of magnitude.

**Table 2.** Performance measurements for the baseline algorithm and its optimisations.


## **6 Conclusions**

We introduced geometric pattern mining, the problem of finding recurring structures in discrete, geometric matrices, or raster-based data. Further, we presented Vouw, a heuristic algorithm for finding sets of geometric patterns that are good descriptions according to the MDL principle. It is capable of accurately recovering patterns from synthetic data, and the resulting compression ratios are on par with the expectations based on the density of the data. For the future, we think that extensions to fault-tolerant patterns and clustering have large potential.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **A Consensus Approach to Improve NMF Document Clustering**

Mickael Febrissy(B) and Mohamed Nadif

LIPADE, Universit´e de Paris, 75006 Paris, France mickael.febrissy@u-paris.fr

**Abstract.** Nonnegative Matrix Factorization (NMF) which was originally designed for dimensionality reduction has received throughout the years a tremendous amount of attention for clustering purposes in several fields such as image processing or text mining. However, despite its mathematical elegance and simplicity, NMF has exposed a main issue which is its strong sensitivity to starting points, resulting in NMF struggling to converge toward an optimal solution. On another hand, we came to explore and discovered that even after providing a meaningful initialization, selecting the solution with the best local minimum was not always leading to the one having the best clustering quality, but somehow a better clustering could be obtained with a solution slightly off in terms of criterion. Therefore in this paper, we undertake to study the clustering characteristics and quality of a set of NMF best solutions and provide a method delivering a better partition using a consensus made of the best NMF solutions.

**Keywords:** NMF · Clustering · Clustering ensemble · Consensus

## **1 Introduction**

When dealing with text data, document clustering techniques allow to divide a set of documents into groups so that documents assigned to the same group are more similar to each other than to documents assigned to other groups [12,18,21,22]. In information retrieval, the use of clustering relies on the assumption that if a document is relevant to a query, then other documents in the same cluster can also be relevant. This hypothesis can be used at different stages in the information retrieval process, the two most notable being: cluster-based retrieval to speed up search, and search result clustering to help users navigate and understand what is in the search results. The document clustering which still remains a hot topic can be tackled under different approaches. In our contribution we rely on the non-negative matrix factorization for its simplicity and popularity. We will not propose a new variant of NMF but rather a consensus approach that will boost its performance.

Unlike supervised learning, the evaluation of clustering algorithms - unsupervised learning - remains a difficult problem. When relying on generative models, it is easier to evaluate the performance of a given clustering algorithm based on the simulated partition. On real data already labeled, many papers evaluate the performance of clustering algorithms by relying on indices such as Accuracy (ACC), Normalized Mutual Information (NMI) [25] and Adjusted Rand Index (ARI) [14]. However, the algorithms commonly used which are of type k-means, EM [8], Classification EM [6], NMF [15] etc. are iterative and require several initializations; the resulting partition is the one optimizing the objective function. Sometimes in these works, we observe comparative studies between methods on the basis of maximum ACC/NMI/ARI measures obtained after several initializations and not optimizing the criterion used in the algorithm. Such a comparison is thereby not accurate, because in fact these measures cannot be calculated in practice and cannot be used in this way to evaluate the quality of a clustering algorithm.

A fair comparison can only be made on the basis of objective functions considered in a clustering purpose; for example, within-cluster inertia, likelihood, classification likelihood for mixture models, factorization, etc. Nonetheless, in our experiences, we realized that while the clustering results become better in terms of ACC/NMI/ARI when the objective function value increases, the best value is not necessarily associated with the best results. However, by ranking the objective values, the best partition tends to be among those leading to the first best scores. We illustrate this behavior in Fig. 4. This remark leads us to consider an *ensemble method* that is widely used in supervised learning [11,24] but a little less in unsupervised learning [25]. If this approach, referred to as *consensus clustering*, is often used in the context of comparing partitions obtained with different algorithms, it is less studied considering the same algorithm.

The paper is organized as follows. In Sect. 2, we review the nonnegative matrix factorization with the Frobenius norm and the Kullback–Leibler divergence. Section 3 is devoted to describe the ensemble method and the popular used algorithms. In Sect. 4, we perform comparisons on document-term matrices and propose a strategy to improve document clustering with NMF.

## **2 Nonnegative Matrix Factorization**

Nonnegative Matrix Factorization (NMF) [15], aiming to deliver a lower rank decomposition of a nonnegative data matrix *X* has highlighted clustering properties for which strong connections with K-means or Spectral clustering can be drawn [16]. However, while several variants arise in order to accommodate its clustering property [10,29–31], its premier model formulation does not involve a clustering objective and was originally presented as a dimension reduction algorithm with exclusive nonnegative factors. More specifically in text mining where NMF produces a meaningful interpretation for document-term matrices in comparison with methods like Singular Value Decomposition (SVD) components or Latent Semantic Analysis (LSA) [7] arising factors with possible negative values. NMF seeks to approximate a matrix *<sup>X</sup>* <sup>∈</sup> <sup>R</sup><sup>n</sup>×<sup>d</sup> <sup>+</sup> by the product of two lower rank matrices *<sup>Z</sup>* <sup>∈</sup> <sup>R</sup><sup>n</sup>×<sup>g</sup> <sup>+</sup> and *<sup>W</sup>* <sup>∈</sup> <sup>R</sup><sup>d</sup>×<sup>g</sup> <sup>+</sup> with g(n + d) < ng. This problem can be formulated as a constrained optimization problem

$$\mathcal{F}(\mathbf{Z}, \mathbf{W}) = \min\_{\mathbf{Z} \ge 0, \mathbf{W} \ge 0} D(\mathbf{X}, \mathbf{Z} \mathbf{W}^\top) \tag{1}$$

where D is a fitting error allowing to measure the quality of the approximation of *X* by *ZW*, the most popular ones being the Frobenius norm and Kullback-Leibler (KL) divergence. For a clustering setup, *Z* will be referred to as the soft classification matrix while *W* will be the centers matrix. Despite its multiple applications benefits, NMF has a recurrent downside which takes place at its initialization. NMF provides a different solution for every different initialisation making it substantially sensitive to starting points as its convergence directly relies on the characteristics of the given entries. Several publications have shown interest in finding the best way to start a NMF algorithm by providing a structured initialization, in some cases obtained from results of clustering algorithms such as k-means or Spherical K-means [27,28] (especially for applying NMF on document-term matrices), Nonnegative Singular Value decomposition (NNDSVD) [4] or SVD based strategies [17]. The optimization procedures for D respectively equal to the Frobenius norm and the KL divergence, based on multiplicative update rules are given in Algorithms 1 and 2.


## **3 Cluster Ensembles (CE)**

In machine learning, the idea of utilizing multiple sources of data partitions firstly occurred with multi-learner systems where the output of several classifier algorithms where used together in order to improve the accuracy and robustness of a classification or regression, for which strong performances were acknowledged [24,25]. At this stage, very few approaches have worked toward applying a similar concept to unsupervised learning algorithms. In this sense, we denote the work of [5] who tried to combine several clustering partitions according to the combination of the cluster centers. In the early 2000, [25] were the first to consider an idea of combining several data partitions however, without accessing any original sources of information (features) or led computed centers. This approach is referred to as *cluster ensembles*. At the time, their idea was motivated by the possibilities of taking advantage of existing information such as a prior clustering partitions or an expert categorization (all regrouped under the terms Knowledge Reuse), which may still be relevant or substantial for a user to consider in a new analysis on the same objects, whether or not the data associated with these objects may also be different than the ones used to define the prior partitions. Another motivation was *Distributed computing*, referring to analyzing different sources of data (which might be complicated to merge together for instance for privacy reasons) stored in different locations. In our concept, we will use *cluster ensembles* to improve the quality of the final partition (as opposed to selecting a unique one) and therefore extract all the possibilities offered by the miscellaneous best solutions created by NMF.

In [25], the authors introduced three consensus methods that can produce a partition. All of them consider the consensus problem on a hypergraph representation *H* of the set of partitions *H*<sup>r</sup>. More specifically, each partition *H*<sup>r</sup> equals a binary classification matrix (with objects in rows and clusters in columns) where the concatenation of all the set defines the hypergraph *H*.


Furthermore, in [25] the authors proposed an objective function to characterize the *cluster ensembles* problem and therefore allowing a selection of the best consensus algorithm among the three to deliver its ensemble partition. Let <sup>Λ</sup> <sup>=</sup> {λ(q) <sup>|</sup><sup>q</sup> ∈ {1,...,r}} be a given set of <sup>r</sup> partitions <sup>λ</sup>(q) represented as labels vectors. The ensemble criterion denoted as λ(k−opt) is called the optimal combine clustering and aims at maximizing the Average Normalized Mutual Information (ANMI). It is defined as follows:

$$\begin{aligned} \text{as } \text{mours.}\\ \lambda^{(k-opt)} &= \operatorname\*{argmax}\_{\tilde{\lambda}} \sum\_{q=1}^{r} \text{NMI}(\tilde{\lambda}, \lambda^{\{q\}}) \end{aligned} \tag{2}$$

The ANMI is simply the average of the normalized mutual information of a labels vector <sup>λ</sup> with all labels vectors <sup>λ</sup>(q) in <sup>Λ</sup>:

$$\text{ANMI}(A, \tilde{\lambda}) = \frac{1}{r} \sum\_{q=1}^{r} \text{NMI}(\tilde{\lambda}, \lambda^{(q)}) \tag{3}$$

To cast with cases where the vector labels λ(q) have missing values, the authors have proposed a generalized expression of (2) not substantially different that viewers can refer to in the original paper [25].

## **4 Experiments**

We conduct several experiences leading to emphasise the behavior of NMF regarding a clustering task compared to a dedicated clustering algorithm such as Spherical K-means referred to as S-Kmeans [9] which was introduced for clustering large sets of sparse text data (or directional data) and remains appealing for its low computational cost beside its good performances. It was also retained along side the random starting points (generated according to an uniform distribution U(0, 1)×mean(*X*)) as initialization for NMF. We use two error measures frequently employed for NMF: the Frobenius norm (which will be referred to as NMF-F) and the Kullback-Leibler divergence (NMF-KL). Eventually, we compute the consensus partition by using the Cluster Ensemble Python package<sup>1</sup> which utilizes the consensus methods defined earlier [25].

## **4.1 Datasets**

We apply NMF on 5 bench-marking document-term matrices for which the detailed characteristics are available in Table 1 where nz indicates the percentage of values other than 0 and the *balance* coefficient is defined as the ratio of the number of documents in the smallest class to the number of documents in the largest class. These datasets highlight several varieties of challenging situations such as the amount of clusters, the dimensions, the clusters balance, the degree of mixture of the different groups and the sparsity. We normalized each data matrix with TF-IDF and their respective documents-vectors to unit L2-norm to remove the bias introduced by their length.


**Table 1.** Datasets description: # denotes the cardinality

### **4.2 NMF Raw Performances and Initialization**

The results obtained by NMF-F and NMF-KL according to S-Kmeans and the random starting points are available in Table 2. The clustering quality of the

<sup>1</sup> https://pypi.org/project/Cluster Ensembles/.

S-Kmeans partitions given as entry to both algorithms are also displayed. We make use of two relevant measures to quantify and assess the clustering quality of each algorithm. The first one is the NMI [25] which quantifies how much information the clustering partition shares with the true partition, the second is the ARI [14], sensitive to the clusters proportions and measures the degree of agreement between the clustering and the true partition. To replicate a relevant user experience achieving an unsupervised task, we refer to the criterion of each algorithm in order to select the 10 first best solutions (out of 30 runs) and report their average NMI and ARI with the true partition.

One can clearly see that NMF-F and NMF-KL do not react similarly to the different initializations. While NMF-F substantially benefits from the S-kmeans initialization on every datasets compared to the random initialization, NMF-KL does not seem to accommodate S-kmeans entries. In fact, S-Kmeans as starting values seems to worsen NMF-KL solutions, especially on CLASSIC4 and NG5. For this reason, we will avoid this initialization strategy for NMF-KL in the future although it improves on RCV1. Also, NMF-KL with a random initialization provides much better results than the other algorithms on almost all datasets.


**Table 2.** Mean and standard deviation of NMI and ARI computed over the 10 best solutions.

We reported in Figs. 1, 2, 3 and 4 the clustering quality of the algorithm's solutions ranked from the best one in terms of criterion to the poorest one. The respective criterion of each algorithm is normalized to belong to [0, 1].

When one does have the real partition, a common practice to evaluate the clustering result, one relies on the best solution obtained by optimizing the objective function. Figures 1 and 3 highlight a critical behavior of NMF-F which tends to produce solutions with the lowest minima that do not fulfil the best clustering partitions, sometimes with a substantial gap (see CSTR, RCV1, NG5 in Fig. 1). Moreover, a surprising lesser but still similar behavior is delivered by S-Kmeans which compared to NMF, optimizes a clustering objective by definition. The results are displayed in Fig. 2. In reality, this behavior can be observed with several types of what we refer to clustering algorithms hosting an optimization procedure. Initializing NMF-F randomly as shown in Fig. 3 seems to lighten this

**Fig. 1.** NMF-F: NMI/ARI behaviour according to the objective function F (initializations by S-Kmeans)

effect (on CSTR, Classic4 and RCV1). On another hand, NMF-KL which to this day remains recognized as a relevant method for document clustering [13] seems to consistently deliver solutions with the lowest criteria aligned with the goodness of their clustering, sustaining the use of NMF for clustering purposes. Furthermore, compared to all, NMF-KL is the only method emphasizing a wide variety of solutions and therefore seems to explore way more possibilities than NMF-F or S-Kmeans. Its better behavior might almost comfort the idea of selecting the best partition in terms of criterion as the one to keep. However, it still fails on RCV1 which is the toughest dataset to partition mainly because of its scant density. Eventually, it remains concerning to select the best partition just based on the fact that, even with NMF-KL, the solution among the best ones providing the best clustering, is not necessarily the first one (see on CSTR, CLASSIC4 and NG5).

In addition, while the best solutions possibly share a similar amount of information with the true partition, they could be fairly distinct from each other, making their use appealing to deduce an even more exhaustive solution. Figure 5 shows results of pairwise NMI and ARI between the top 10 partitions (criterionwise) of each algorithm. NMF-KL's best solutions appear to be fairly different among each other.

**Fig. 2.** S-Kmeans: NMI/ARI behaviour according to the objective function F (Random initializations)

**Fig. 3.** NMF-F: NMI/ARI behaviour according to the objective function F (Random initializations)

**Fig. 4.** NMF-KL: NMI/ARI behaviour according to the objective function F (Random initializations)

**Fig. 5.** Average pairwise NMI & ARI between top 10 solutions

#### **4.3 Consensus Clustering**

Following the previous statement, we went ahead and computed a cluster ensemble (CE) for NMF-F and NMF-KL according to their best initialization strategy as well as for S-Kmeans due to its pertinence for initializing NMF-F and the method being widely known as relevant for document clustering. The results are reported in Table 3. It appears that the consensus obtained with the top 10 results of each method generally outperforms the best solution. This result is even stronger for NMF-KL where the ensemble clustering increases the NMI and ARI by respectively 11 and 13 points on NG20. Note that NG20 is the dataset where the average pairwise NMI and ARI between the 10 top partitions are the lowest, meaning the most different (see Fig. 5). Furthermore, it is interesting to note that these performances are obtained from solutions giving an average NMI and ARI smaller than the best solution itself.

**Table 3.** Mean and standard deviation, first best result and CE consensus computed over the 10 best solutions.


#### **4.4 Consensus Multinomial**

Following the cluster-based consensus approach which implies a similaritybased clustering algorithm, we decided to make use of a model-based clustering to go and try to obtain a better final partition than the one delivered by *cluster ensembles*. In [26], the authors have used the Multinomial mixture approach to propose a consensus function. In model-based clustering, it is assumed that the data are generated by a mixture of underlying probability distributions, where each component k of the mixture represents a cluster.

Let <sup>Λ</sup> <sup>∈</sup> <sup>N</sup>n×<sup>r</sup> <sup>0</sup> be the data matrix of labels vectors from the top <sup>r</sup> solutions. Our data being categorical, we used a Multinomial Mixture Model (MMM) in order to partition the elements λi. Categorical data being a generalization of binary data; assuming a perfect scenario where there is no partition with an empty cluster, a disjunctive matrix *<sup>M</sup>* ∈ {0, <sup>1</sup>}<sup>n</sup>×rg is usually used instead of <sup>Λ</sup> with value m(h) iq where h ∈ {1,...,g} is a cluster label. Therefore, the data values m(h) iq are assumed to be generated from a Multinomial distribution of parameter M(m(h) iq ; <sup>α</sup>(h) kq ) where <sup>α</sup>(h) kq is the probability that an element m<sup>i</sup> in the group k takes the category h for the partition/variable λq. The density probability function of the model can be stated as:

$$f(M; \theta) = \prod\_{i=1}^{n} \sum\_{k=1}^{g} \pi\_k \prod\_{q,h}^{r,g} (\alpha\_{kq}^{(h)})^{m\_{iq}^{(h)}} \tag{4}$$

where *θ* = (*π*, *α*) are the parameters of the model with *π* = (π1,...,πk) being the proportions and *α* the vector of the components parameters.


**Table 4.** MMM consensus results over the 10 best solutions

The Rmixmod package<sup>2</sup> is used to achieve our analysis. We employ the default settings to compute the clustering, allowing the selection between 10 parsimonious models according to the Bayesian information Criterion (BIC) [23]. With CSTR, the model mainly selected is the one keeping the proportions π<sup>k</sup> free with the model also independent from the variables (labels vectors), meaning <sup>M</sup>(m(h) iq ; αk). CSTR is the dataset with the highest pairwise NMI and ARI therefore with the most similar best solutions. On CLASSIC4 and RCV1 where the pairwise NMI & ARI are a little bit lower, it is the model with free proportions and parameters *α* depending on distinct components and labels vectors (M(m(h) iq ; <sup>α</sup>(h) kq )) which is mainly chosen. On NG5 where the best solutions are fairly similar (high pairwise NMI & ARI), it is the model depending on the components and the labels vectors which has been retained. However, the proportions here were kept equal. For NG20 where the best solutions were fairly distinct, the model selected is the one depending on the components and the variables. As previously, the proportions π<sup>k</sup> are kept equal. Following the characteristics in Table 1, it is notable to see that the datasets where the proportions are kept equal are actually those with the more balanced real clusters proportions. The results of the obtained consensus are displayed in Table 4 which only retains prior results of NMF-KL top 10 solutions and CE consensus, as they were the best overall. Apart from CSTR, we can see that MMM does a better job at computing a better partition from the top 10 solutions than CE.

### **5 Conclusion**

In this paper, by using *cluster ensembles*, we have proposed a simple method to obtain a better clustering for the scope of NMF algorithms on text data. From its

<sup>2</sup> https://cran.r-project.org/web/packages/Rmixmod/Rmixmod.pdf.

gathering nature, this process should also alleviate the uncertainty based around the overall quality of the final partition compared to other selection practices such as keeping an unique solution according to the best criterion. Furthermore, we have shown that it was possible to improve the consensus quality through the use of finite mixture models, allowing more powerful underlying settings than cluster-based consensus involving plain similarities or distances. A future work will be to investigate the use of *cluster ensembles* for other recent clustering algorithms [1–3,19,20].

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Discriminative Bias for Learning Probabilistic Sentential Decision Diagrams**

Laura Isabel Galindez Olascoaga1(B), Wannes Meert<sup>2</sup>, Nimish Shah<sup>1</sup>, Guy Van den Broeck<sup>3</sup>, and Marian Verhelst<sup>1</sup>

<sup>1</sup> Electrical Engineering Department, KU Leuven, Leuven, Belgium *{*laura.galindez,nimish.shah,marian.verhelst*}*@esat.kuleuven.be <sup>2</sup> Computer Science Department, KU Leuven, Leuven, Belgium wannes.meert@cs.kuleuven.be

<sup>3</sup> Computer Science Department, University of California, Los Angeles, USA guyvdb@cs.ucla.edu

**Abstract.** Methods that learn the structure of Probabilistic Sentential Decision Diagrams (PSDD) from data have achieved state-of-the-art performance in tractable learning tasks. These methods learn PSDDs incrementally by optimizing the likelihood of the induced probability distribution given available data and are thus robust against missing values, a relevant trait to address the challenges of embedded applications, such as failing sensors and resource constraints. However PSDDs are outperformed by discriminatively trained models in classification tasks. In this work, we introduce D-LearnPSDD, a learner that improves the classification performance of the LearnPSDD algorithm by introducing a discriminative bias that encodes the conditional relation between the class and feature variables.

**Keywords:** Probabilistic models · Tractable inference · PSDD

## **1 Introduction**

Probabilistic machine learning models have shown to be a well suited approach to address the challenges inherent to embedded applications, such as the need to handle uncertainty and missing data [11]. Moreover, current efforts in the field of Tractable Probabilistic Modeling have been making great strides towards successfully balancing the trade-offs between model performance and inference efficiency: probabilistic circuits, such as Probabilistic Sentential Decision Diagrams (PSDDs), Sum-Product Networks (SPNs), Arithmetic Circuits (ACs) and Cutset Networks, posses myriad desirable properties [4] that make them amenable to application scenarios where strict resource budget constraints must be met [12]. But these models' robustness against missing data—from learning them generatively—is often at odds with their discriminative capabilities. We address such a conflict by proposing a discriminative-generative probabilistic circuit learning strategy, which aims to improve the models' discriminative capabilities, while maintaining their robustness against missing features.

We focus in particular on the PSDD [17], a state-of-the-art tractable representation that encodes a joint probability distribution over a set of random variables. Previous work [12] has shown how to learn hardware-efficient PSDDs that remain robust to missing data and noise. This approach relies largely on the LearnPSDD algorithm [20], a generative algorithm that incrementally learns the structure of a PSDD from data. Moreover, it has been shown how to exploit such robustness to trade off resource usage with accuracy. And while the achieved accuracy is competitive when compared to Bayesian Network classifiers, discriminatively learned models perform consistently better than purely generative models [21] since the latter remain agnostic to the discriminative task they ought to perform. This begs the question of whether the discriminative performance of the PSDD could be improved while remaining robust and tractable.

In this work, we propose a hybrid discriminative-generative PSDD learning strategy, D-LearnPSDD, that enforces the discriminative relationship between class and feature variables by capitalizing on the model's ability to encode domain knowledge as a logic formula. We show that this approach consistently outperforms the purely generative PSDD and is competitive compared to other classifiers, while remaining robust to missing values at test time.

### **2 Background**

*Notation.* Variables are denoted by upper case letters X and their instantiations by lower case letters x. Sets of variables are denoted in bold upper case **X** and their joint instantiations in bold lower case **x**. For the classification task, the feature set is denoted by **F** while the class variable is denoted by C.

**Fig. 1.** A Bayesian network and its equivalent PSDD (taken from [20]).

*PSDD.* Probabilistic Sentential Decision Diagrams (PSDDs) are circuit representations of joint probability distributions over binary random variables [17]. They were introduced as probabilistic extensions to Sentential Decision Diagrams (SDDs) [7], which represent Boolean functions as logical circuits. The inner nodes of a PSDD alternate between AND gates with two inputs and OR gates with arbitrary number of inputs; the root must be an OR node; and each leaf node encodes a distribution over a variable X (see Fig. 1c). The combination of an OR gate with its AND gate inputs is referred to as *decision* node, where the left input of the AND gate is called *prime* (p), and the right is called *sub* (s). Each of the n edges of a decision node are annotated with a normalized probability distribution θ1, ..., θ*n*.

PSDDs possess two important syntactic restrictions: (1) Each AND node must be *decomposable*, meaning that its input variables must be disjoint. This property is enforced by a *vtree*, a binary tree whose leaves are the random variables and which determines how will variables be arranged in primes and subs in the PSDD (see Fig. 1d): each internal vtree node is associated with the PSDD nodes at the same level, variables appearing in the left subtree **X** are the primes and the ones appearing in the right subtree **Y** are the subs. (2) Each decision node must be *deterministic*, thus only one of its inputs can be true.

Each PSDD node q represents a probability distribution. Terminal nodes encode a univariate distributions. Decision nodes, when normalized for a vtree node with **X** in its left subtree and **Y** in its right subtree, encode the following distribution over **XY** (see also Fig. 1a and c):

$$\operatorname{Pr}\_q(\mathbf{X}\mathbf{Y}) = \sum\_i \theta\_i \operatorname{Pr}\_{p\_i}(\mathbf{X}) \operatorname{Pr}\_{s\_i}(\mathbf{Y}) \tag{1}$$

Thus, each decision node decomposes the distribution into independent distributions over **X** and **Y**. In general, prime and sub variables are independent at PSDD node q given the prime *base* [q] [17]. This base is the support of the node's distribution, over which it defines a non-zero probability and it is written as a logical sentence using the recursion [q] = *<sup>i</sup>*[p*i*] <sup>∧</sup> [s*i*]. Kisa et al. [17] show that prime and sub variables are independent in PSDD node q given a prime base:

$$\Pr\_q(\mathbf{X}\mathbf{Y}|[p\_i]) = \Pr\_{p\_i}(\mathbf{X}|[p\_i])\Pr\_{s\_i}(\mathbf{Y}|[p\_i])\tag{2}$$

$$=\Pr\_{p\_i}(\mathbf{X})\Pr\_{s\_i}(\mathbf{Y})$$

This equation encodes *context specific independence* [2], where variables (or sets of variables) are independent given a logical sentence. The structural constraints of the PSDD are meant to exploit such independencies, leading to a representation that can answer a number of complex queries in polynomial time [1], which is not guaranteed when performing inference on Bayesian Networks, as they don't encode and therefore can't exploit such local structures.

*LearnPSDD.* The LearnPSDD algorithm [20] generatively learns a PSDD by maximizing log-likelihood given available data. The algorithm starts by learning a *vtree* that minimizes the mutual information among all possible sets of variables. This vtree is then used to guide the PSDD structure learning stage, which relies on the iterative application of the Split and Clone operations [20]. These operations keep the PSDD syntactically sound while improving likelihood of the distribution represented by the PSDD. A problem with LearnPSDD when using the resulting model for classification is that when the class variable is only weakly dependent on the features, the learner may choose to ignore that dependency, potentially rendering the model unfit for classification tasks.

## **3 A Discriminative Bias for PSDD Learning**

Generative learners such as LearnPSDD optimize the likelihood of the distribution given available data rather than the conditional likelihood of the class variable C given a full set of feature variables **F**. As a result, their accuracy is often worse than that of simple models such as Naive Bayes (NB), and its close relative Tree Augmented Naive Bayes (TANB) [12], which perform surprisingly well on classification tasks even though they encode a simple—or naive—structure [10]. One of the main reasons for their performance, despite being generative, is that (TA)NB models have a discriminative bias that directly encodes the conditional dependence of all the features on the class variable.

We introduce D-LearnPSDD, an extension to LearnPSDD based on the insight that the learned model should satisfy the "class conditional constraint" present in Bayesian Network classifiers. That is, all feature variables must be conditioned on the class variable. This enforces a structure that is beneficial for classification while still allowing to generatively learn a PSDD that encodes the distribution over all variables using a state-of-the-art learning strategy [20].

### **3.1 Discriminative Bias**

The classification task can be stated as a probabilistic query:

$$\Pr(C|\mathbf{F}) \sim \Pr(\mathbf{F}|C) \cdot \Pr(C). \tag{3}$$

Our goal is to learn a PSDD whose root decision node directly represents the conditional probability distribution Pr(**F**|C). This can be achieved by forcing the primes of the first line in Eq. <sup>2</sup> to be [p0]=[¬c] and [p1]=[c], where [c] states that the propositional variable c representing the class variable is true (i.e. <sup>C</sup> = 1), and similarly [¬c] represents <sup>C</sup> = 0. For now we assume the class is binary and will show later how to generalize to a multi-valued class variable. For the feature variables we can assume they are binary without loss of generality since a multi-valued variable can be converted to a set of binary variables via a one-hot encoding (see, for example [20]). To achieve our goal we first need the following proposition:

**Proposition 1.** *Given (i) a vtree with a single variable* C *as the prime and variables* **F** *as the sub of the root node, and (ii) an initial PSDD where the root decision node decomposes the distribution as* [root] = ([p0] <sup>∧</sup> [s0]) <sup>∨</sup> ([p1] <sup>∧</sup> [s1])*; applying the Split and Clone operators will never change the root decision decomposition* [root] = ([p0] <sup>∧</sup> [s0]) <sup>∨</sup> ([p1] <sup>∧</sup> [s1])*.*

*Proof.* The D-LearnPSDD algorithm iteratively applies two operations: Clone and Split (following the algorithm in [20]). First, the Clone operator requires a parent node, which is not available for the root node. Since the initial PSDD follows the logical formula described above, whose only restriction is on the root node, there is no parent available to clone and the root's base thus remains intact when applying the Clone operator. Second, the Split operator splits one of the subs to extend the sentence that is used to mutually exclusively and exhaustively define all children. Since the given vtree has only one variable, C, as the prime of the root node, there are no other variables available to add to the sub. The Split operator cant thus not be applied anymore and the root's base stays intact (see Figs. 1c and d).

We can now show that the resulting PSDD contains nodes that directly represent the distribution Pr(**F**|C).

**Proposition 2.** *A PSDD of the form* [root] = ([¬c] <sup>∧</sup> [s0]) <sup>∨</sup> ([c] <sup>∧</sup> [s1]) *with* <sup>c</sup> *the propositional variable stating that the class variable is true, and* s<sup>0</sup> *and* s<sup>1</sup> *any formula with propositional feature variables* f0,...,f*n, directly expresses the distribution* Pr(**F**|C)*.*

*Proof.* Applying this to Eq. 1 results in:

$$\begin{split} \operatorname{Pr}\_{q}(C\mathbf{F}) &= \operatorname{Pr}\_{\neg c}(C)\operatorname{Pr}\_{s\_{0}}(\mathbf{F}) + \operatorname{Pr}\_{c}(C)\operatorname{Pr}\_{s\_{1}}(\mathbf{F}) \\ &= \operatorname{Pr}\_{\neg c}(C|[\neg c]) \cdot \operatorname{Pr}\_{s\_{0}}(\mathbf{F}|[\neg c]) + \operatorname{Pr}\_{c}(C|[c]) \cdot \operatorname{Pr}\_{s\_{1}}(\mathbf{F}|[c]) \\ &= \operatorname{Pr}\_{\neg c}(C=0) \cdot \operatorname{Pr}\_{s\_{0}}(\mathbf{F}|C=0) + \operatorname{Pr}\_{c}(C=1) \cdot \operatorname{Pr}\_{s\_{1}}(\mathbf{F}|C=1) \end{split}$$

The learned PSDD thus contains a node s<sup>0</sup> with distribution Pr*<sup>s</sup>*<sup>0</sup> that directly represents Pr(**F**|<sup>C</sup> = 0) and a node <sup>s</sup><sup>1</sup> with distribution Pr*<sup>s</sup>*<sup>1</sup> that represents Pr(**F**|<sup>C</sup> = 1). The PSDD thus encodes Pr(**F**|C) directly because the two possible value assignments of C are C = 0 and C = 1.

The following examples illustrate why both the specific vtree and initial PSDD are required.

*Example 1.* Figure 2b shows a PSDD that encodes a fully factorized probability distribution normalized for the vtree in Fig. 2a. The PSDD shown in this example initializes the incremental learning procedure of LearnPSDD [20]. Note that the vtree does not connect the class variable C to all feature variables (e.g. F1). Therefore, when initializing the algorithm on this vtree-PSDD combination, there are no guarantees that the conditional relations between certain features and the class will be learned.

*Example 2.* Figure 2e shows a PSDD that explicitly conditions the feature variables on the class variables by normalizing for the vtree in Fig. 2c and by following the logical formula from Proposition 2. This biased PSDD is then used to initialize the D-LearnPSDD learner. Note that the vtree in Fig. 2c forces the prime of the root node to be the class variable C.

*Example 3.* Figure 2d shows, however, that only setting the vtree in Fig. 2c is not sufficient for the learner to condition the features on the class. When initializing on a PSDD that encodes a fully factorized formula, and then applying the Split and Clone operators, the relationship between the class variable and the features are not guaranteed to be learned. In this worst case scenario, the learned model could have an even worse performance than the case from Example 1. By applying Eq. 1 on the top split, we can give intuition why this is the case:

$$\begin{aligned} \operatorname{Pr}\_q(C\mathbf{F}) &= \operatorname{Pr}\_{p\_0}(C|[c \vee \neg c]) \cdot \operatorname{Pr}\_{s\_0}(\mathbf{F}|[c \vee \neg c]) \\ &= (\operatorname{Pr}\_{p\_1}(C|[c]) + \operatorname{Pr}\_{p\_2}(C|[\neg c])) \cdot \operatorname{Pr}\_{s\_0}(\mathbf{F}|[c \vee \neg c]) \\ &= (\operatorname{Pr}\_{p\_1}(C=1) + \operatorname{Pr}\_{p\_2}(C=0)) \cdot \operatorname{Pr}\_{s\_0}(\mathbf{F}) \end{aligned}$$

The PSDD thus encodes a distribution that assumes that the class variable is independent from all feature variables. While this model might still have a high likelihood, its classification accuracy will be low.

We have so far introduced the D-LearnPSDD for a binary classification task. However, it can be easily generalized to an n-valued classification scenario: (1) The class variable C will be represented by multiple propositional variables c0, c1,...,c*<sup>n</sup>* that represent the set C = 0, C = 1,...,C = n, of which exactly one will be true at all times. (2) The vtree in Proposition 1 now starts as a right-linear tree over c0,...,c*n*. The **F** variables are the sub of the node that has c*<sup>n</sup>* as prime. (3) The initial PSDD in Proposition 2 now has a root the form [root] = *<sup>i</sup>*=0*...n*([c*<sup>i</sup> <sup>j</sup>*:0*...n*∧*i*=*<sup>j</sup>* <sup>¬</sup>c*<sup>j</sup>* ] <sup>∧</sup> [s*i*]), which remains the same after applying Split and Clone. The root decision node now represents the distribution Pr*q*(C**F**) = *<sup>i</sup>*:0*...n* Pr*<sup>c</sup><sup>i</sup>* - *j*-<sup>=</sup>*<sup>i</sup>* <sup>¬</sup>*c<sup>j</sup>* (<sup>C</sup> <sup>=</sup> <sup>i</sup>) · Pr*<sup>s</sup><sup>i</sup>* (**F**|<sup>C</sup> <sup>=</sup> <sup>i</sup>) and therefore has nodes at the top of the tree that directly represent the discriminative bias.

### **3.2 Generative Bias**

Learning the distribution over the feature variables is a generative learning process and we can achieve this by applying the Split and Clone operators in the same way as the original LearnPSDD algorithm. In the previous section we had not yet defined how should Pr(**F**|C) from Proposition <sup>2</sup> be represented in the initial PSDD, we only explained how our constraint enforces it. So the question is how do we exactly define the nodes corresponding to s<sup>0</sup> and s<sup>1</sup> with distributions Pr(**F**|<sup>C</sup> = 0) and Pr(**F**|<sup>C</sup> = 1)? We follow the intuition behind (TA)NB and start with a PSDD that encodes a distribution where all feature variables are independent given the class variable (see Fig. 2e). Next, the LearnPSDD algorithm will incrementally learn the relations between the feature variables by applying the Split and Clone operations following the approach in [20].

### **3.3 Obtaining the Vtree**

In learnPSDD, the decision nodes decompose the distribution into independent distributions. Thus, the vtree is learned from data by maximizing the approximate pairwise mutual information, as this metric quantifies the level of independence between two sets of variables. For D-LearnPSDD we are interested in the level of conditional independence between sets of feature variables given the class variable. We thus obtain the vtree by optimizing for Conditional Mutual Information instead and replace mutual information in the approach in [20] with: CMI(**X**, **<sup>Y</sup>**|**Z**) = **x y <sup>z</sup>** Pr(**xy**) log Pr(**z**) Pr(**xyz**) Pr(**xz**) Pr(**yz**) .

**Fig. 2.** Examples of vtrees and initial PSDDs.

## **4 Experiments**

We compare the performance of D-LearnPSDD, LearnPSDD, two generative Bayesian classifiers (NB and TANB) and a discriminative classifier (logistic regression). In particular, we discuss the following research queries: (1) Sect. 4.2 examines whether the introduced discriminative bias improves classification performance on PSDDs. (2) Sect. 4.3 analyzes the impact of the vtree and the imposed structural constraints on model tractability and performance. (3) Finally, Sect. 4.4 compares the robustness to missing values for all classification approaches.

**Table 1.** Datasets


### **4.1 Setup**

We ran our experiments on the suite of 15 standard machine learning benchmarks listed in Table 1. All of the datasets come from the UCI machine learning repository [8], with exception of "Mofn" and "Corral" [18]. As pre-processing steps, we applied the discretization method described in [9], and we binarized all variables using a one-hot encoding. Moreover, we removed instances with missing values and features whose value was always equal to 0. Table 1 summarizes the number of binary features <sup>|</sup>**F**|, the number of classes <sup>|</sup>C<sup>|</sup> and the available number of training samples |N| per dataset.

### **4.2 Evaluation of DG-LearnPSDD**

Table 2 compares D-LearnPSDD, LearnPSDD, Naive Bayes (NB), Tree Augmented Naive Bayes (TANB) and logistic regression (LogReg)<sup>1</sup> in terms of accuracy via five fold cross validation<sup>2</sup>. For LearnPSDD, we incrementally learned a model on each fold until convergence on validation-data log-likelihood, following the methodology in [20].

For D-LearnPSDD, we incrementally learned a model on each fold until likelihood converged but then selected the incremental model with the highest training set accuracy. For NB and TANB, we learned a model per fold and compiled them to Arithmetic Circuits<sup>3</sup>, a more general form of PSDDs [6], which allows us to compare the size of these Bayes net classifiers and the PSDDs. Finally, we compare all probabilistic models with a discriminative classifier, a multinomial logistic regression model with a ridge estimator.

Table 2 shows that the proposed D-LearnPSDD clearly benefits from the introduced discriminative bias, outperforming LearnPSDD in all but two datasets, as the latter method is not guaranteed to learn significant relations between feature and class variables. Moreover, it outperforms Bayesian classifiers in most benchmarks, as the learned PSDDs are more expressive and allow to encode complex relationships among sets of variables or local dependencies such as context specific independence, while remaining tractable. Finally, note that the D-LearnPSDD is competitive in terms of accuracy with respect to logistic regression (LogReg) a purely discriminative classification approach.

### **4.3 Impact of the Vtree on Discriminative Performance**

The structure and size of the learned PSDD is largely determined by the vtree it is normalized for. Naturally, the vtree also has an important role in determining the quality (in terms of log-likelihood) of the probability distribution encoded by the learned PSDD [20]. In this section, we study the impact that the choice of vtree and learning strategy has on the trade-offs between model tractability, quality and discriminative performance.

<sup>1</sup> NB, TANB and LogReg are learned using Weka with default settings.

<sup>2</sup> In each fold, we hold 10% of the data for validation.

<sup>3</sup> Using the ACE tool Available at http://reasoning.cs.ucla.edu/ace/.


**Table 2.** Five cross fold accuracy and size in number of parameters

Figure 3a shows test-set log-likelihood and Fig. 3b classification accuracy as a function of model size (in number of parameters) for the "Chess" dataset. We display average log-likelihood and accuracy over logarithmically distributed ranges of model size. This figure contrasts the results of three learning approaches: D-LearnPSDD when the vtree learning stage optimizes mutual information (MI, shown in light blue); when it optimizes conditional mutual information (CMI, shown in dark blue); and the traditional LearnPSDD (in orange).

Figure 3a shows that likelihood improves at a faster rate during the first iterations of LearnPSDD, but eventually settles to the same values as D-LearnPSDD because both optimize for log-likelihood. However, the discriminative bias guarantees that classification accuracy on the initial model will be at least as high as that of a Naive Bayes classifier (see Fig. 3b). Moreover, this results in consistently superior accuracy (for the CMI case) compared to the purely generative LearnPSDD approach as shown also in Table 2. The dip in accuracy during the second and third intervals are a consequence of the generative learning, which optimizes for log-likelihood and can therefore initially yield feature-value correlations that decrease the model's performance as a classifier.

Finally, Fig. 3b demonstrates that optimizing the vtree for conditional mutual information results in an overall better performance vs. accuracy trade-off when compared to optimizing for mutual information. Such a conditional mutual information objective function is consistent with the conditional independence constraint we impose on the structure of the PSDD and allows the model to consider the special status of the class variable in the discriminative task.

**Fig. 3.** Log-likelihood and accuracy vs. model size trade-off of the incremental PSDD learning approaches. MI and CMI denote mutual information and conditional mutual information vtree learning, respectively. (Color figure online)

#### **4.4 Robustness to Missing Features**

The generative models in this paper encode a joint probability distribution over all variables and therefore tend to be more robust against missing features than discriminative models, which only learn relations relevant to their discriminative task. In this experiment, we assessed this robustness aspect by simulating the random failure of 10% of the original feature set per benchmark and per fold in five-fold cross-validation. Figure 4 shows the average accuracy over 10 such feature failure trials in each of the 5 folds (flat markers) in relation to their full feature set accuracy reported in Table 2 (shaped markers). As expected, the performance of the discriminative classifier (LogReg) suffers the most during feature failure, while D-LearnPSDD and LearnPSDD are notably more robust than any other approach, with accuracy losses of no more than 8%. Note from the flat markers that the performance of D-LearnPSDD under feature failure is the best in all datasets but one.

**Fig. 4.** Classification robustness per method.

## **5 Related Work**

A number of works have dealt with the conflict between generative and discriminative model learning, some dating back decades [14]. There are multiple techniques that support learning of parameters [13,23] and structure [21,24] of probabilistic circuits. Typically, different approaches are followed to either learn generative or discriminative tasks, but some methods exploit discriminative models' properties to deal with missing variables [22]. Other works that also constraint the structure of PSDDs have been proposed before, such as Choi et al. [3]. However, they only do parameter learning, not structure learning: their approach to improve accuracy is to learn separate structured PSDDs for each distribution of features given the class and feed them to a NB classifier. In [5], Correira and de Campos propose a constrained SPN architecture that shows both computational efficiency and classification performance improvements. However, it focuses on decision robustness rather than robustness against missing values, essential to the application range discussed in this paper. There are also a number of methods that focus specifically on the interaction between discriminative and generative learning. In [15], Khosravi et al. provide a method to compute expected predictions of a discriminative model with respect to a probability distribution defined by an arbitrary generative model in a tractable manner. This combination allows to handle missing values using discriminative couterparts of generative classifiers [16]. More distant to this work is the line of hybrid discriminative and generative models [19], their focus is on semisupervised learning and deals with missing labels.

## **6 Conclusion**

This paper introduces a PSDD learning technique that improves classification performance by introducing a discriminative bias. Meanwhile, robustness against missing data is kept by exploiting generative learning. The method capitalizes on PSDDs' domain knowledge encoding capabilities to enforce the conditional relation between the class and the features. We prove that this constraint is guaranteed to be enforced throughout the learning process and we show how not encoding such a relation might lead to poor classification performance. Evaluation on a suite of benchmarking datasets shows that the proposed technique outperforms purely generative PSDDs in terms of classification accuracy and the other baseline classifiers in terms of robustness.

**Acknowledgements.** This work was supported by the EU-ERC Project Re-SENSE grant ERC-2016-STG-71503; NSF grants IIS-1943641, IIS-1633857, CCF-1837129, DARPA XAI grant N66001-17-2-4032, gifts from Intel and Facebook Research, and the "Onderzoeksprogramma Artifici¨ele Intelligentie Vlaanderen" programme from the Flemish Government.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Widening for MDL-Based Retail Signature Discovery**

Cl´ement Gautrais1(B) , Peggy Cellier<sup>2</sup>, Matthijs van Leeuwen<sup>3</sup>, and Alexandre Termier<sup>2</sup>

<sup>1</sup> Department of Computer Science, KU Leuven, Leuven, Belgium clement.gautrais@cs.kuleuven.be

<sup>2</sup> Univ Rennes, Inria, INSA, CNRS, IRISA, Rennes, France <sup>3</sup> LIACS, Leiden University, Leiden, The Netherlands

**Abstract.** *Signature patterns* have been introduced to model repetitive behavior, e.g., of customers repeatedly buying the same set of products in consecutive time periods. A disadvantage of existing approaches to signature discovery, however, is that the required number of occurrences of a signature needs to be manually chosen. To address this limitation, we formalize the problem of selecting the best signature using the minimum description length (MDL) principle. To this end, we propose an encoding for signature models and for any data stream given such a signature model. As finding the MDL-optimal solution is unfeasible, we propose a novel algorithm that is an instance of *widening*, i.e., a diversified beam search that heuristically explores promising parts of the search space. Finally, we demonstrate the effectiveness of the problem formalization and the algorithm on a real-world retail dataset, and show that our approach yields relevant signatures.

**Keywords:** Signature discovery · Minimum description length · Widening

## **1 Introduction**

When analyzing (human) activity logs, it is especially important to discover recurrent behavior. Recurrent behavior can indicate, for example, personal preferences or habits, and can be useful in contexts such as personalized marketing. Some types of behavior are elusive to traditional data mining methods: for example, behavior that has some temporal regularity but not strong enough to be periodic, and which does not form simple itemsets or sequences in the log. A prime example is the set of products that is essential to a retail customer: all of these products are bought regularly, but often not periodically due to different

C. Gautrais—This work has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No [694980] SYNTH: Synthesising Inductive Data Models).

depletion rates, and they are typically bought over several transactions—in any arbitrary order—rather than all at the same time.

To model and detect such behavior, we have proposed *signature patterns* [3]: patterns that identify irregular recurrences in an event sequence by segmenting the sequence (see Fig. 1). We have shown the relevance of signature patterns in the retail context, and demonstrated that they are general enough to be used in other domains, such as political speeches [2]. As a disadvantage, however, signature patterns require the analyst to provide the number of recurrences, i.e., the number of segments in the segmentation. This number of segments influences the signature: fewer segments give a more detailed signature, while more segments result in a simpler signature. Although in some cases domain experts may have some intuition on how to choose the number of segments, it is often difficult to decide on a good trade-off between the number of segments and the complexity of the signature. The main problem that we study in this paper is therefore how to automatically set this parameter in a principled way, based on the data.

Our first main contribution is a problem formalization that defines the best signature for a given dataset, so that the analyst no longer needs to choose the number of segments. By considering the signature corresponding to each possible number of segments as a model, we can naturally formulate the problem of selecting the best signature as a model selection problem. We formalize this problem using the minimum description length (MDL) principle [4], which, informally, states that the best model is the one that compresses the data best. The MDL principle perfectly fits our purposes because (1) it allows to select the simplest model that adequately explains the data, and (2) it has been previously shown to be very effective for the selection of pattern-based models (e.g., [7,11]).

After defining the problem using the MDL principle, the remaining question is how to solve it. As the search space of signatures is extremely large and the MDL-based problem formulation does not offer any properties that could be used to substantially prune the search space, we resort to heuristic search. Also here, the properties of signature patterns lead to technical challenges. In particular, we empirically show that a na¨ıve beam search often gets stuck in suboptimal solutions. Our second main contribution is therefore to propose a diverse beam search algorithm, i.e., an instance of *widening* [9], that ensures that a diverse set of candidate solutions is maintained on each level of the beam search. For this, we define a distance measure for signatures based on their segmentations.

## **2 Preliminaries**

**Fig. 1.** A sequence of transactions and a 4-segmentation. We have the signature items R = {a, b}, the remaining items E = {c, d, e}, the set of items I = {a, b, c, d, e}, the segmentation S = -[T1, T2, T3], [T4, T5], [T6], [T7].

*Signatures.* Let us first recall the definition of a *signature* as presented in [3]. Let I be the set of all items, and let α = -T<sup>1</sup> ...Tn, T<sup>i</sup> ⊆ I be a sequence of itemsets. A k*-segmentation* of α, denoted S(α, k) = -S<sup>1</sup> ...Sk, is a sequence of k non-overlapping consecutive sub-sequences of α, denoted S<sup>i</sup> and called *segments*, each consisting of consecutive transactions. An example of a 4-segmentation is given in Fig. 1. Given S(α, k) = -S<sup>1</sup> ...Sk, a k-segmentation of α, we have Rec(S(α, k)) = - <sup>S</sup>*i*∈S(α,k)( <sup>T</sup>*j*∈S*<sup>i</sup>* <sup>T</sup><sup>j</sup> ): the set of all recurrent items that are present in each segment of S(α, k). For example in Fig. 1, the segmentation S(α, 4) = -S1, S2, S3, S4 gives Rec(S(α, 4)) = {a, b}. Given k and α, one can compute Smax(α, k), the set of k-segmentation of α yielding the largest sets of recurrent items: Smax(α, k) = argmaxS(α,k) |Rec(S(α, k))|. For example, in Fig. 4, -S1, S2, S3, S4 is the only 4-segmentation yielding two recurrent items. As all other 4-segmentations either yield zero or one recurrent item, Smax(α, 4) = {-S1, S2, S3, S4}. A k-signature (also named signature when k is clear from context) is then defined as a maximal set of recurrent items in a ksegmentation S, with S ∈ Smax(α, k). As Smax(α, k) can contain several segmentations, we define the k-signature set Sig(α, k), which contains all k-signatures: Sig(α, k) = {Rec(Sm(α, k)) | S<sup>m</sup> ∈ Smax(α, k)}. k gives the number of recurrences of the recurrent items in sequence α. Given a number of recurrences k, finding a k*-signature* relies on finding a k-segmentation that maximizes the size of the itemset that occurs in each segment of that segmentation. For example, in Fig. 1, given segmentation S = -S1, S2, S3, S4 and given that Smax(α, 4) = {S}, we have Sig(α, 4) = {Rec(S)} = {{a, b}}. For simplicity, the segmentation associated with a k-signature in Sig(α, k) is denoted S = -S<sup>1</sup> ...Sk, and the signature items are denoted R⊆I. The remaining items are denoted E, i.e., E = I\R.

*Minimum Description Length (MDL).* Let us now briefly introduce the basic notions of the minimum description length (MDL) principle [4] as it is commonly used in compression-based pattern mining [7]. Given a set of models M and a dataset D, the best model M ∈ M is the one that minimizes L(D, M) = L(M) + L(D|M), with L(M) the length, in bits, of the encoding of M, and L(D|M) the length, in bits, of the encoding of the data given M. This is called *two-part MDL* because it separately encodes the model and the data given the model, which results in a natural trade-off between model complexity and data complexity. To fairly compare all models, the encoding has to be *lossless*. To use the MDL principle for model selection, the model class M has to be defined (in our case, the set of all signatures), as well as how to compute the length of the model and the length of the data given the model. It should be noted that only the *encoded length* of the data is of interest, not the encoded data itself.

## **3 Problem Definition**

To extract recurrent items from a sequence using signatures, one must define the number of segments k. Providing meaningful values for k usually requires expert knowledge and/or many tryouts, as there is no general rule to automatically set k. Our problem is therefore to devise a method that adjusts k, depending on the data at hand. As this is a typical model selection problem, our approach relies on the minimum description length principle (MDL) to find the best model from a set of candidate models. However, the signature model must be refined into a probabilistic model to use the MDL principle for model selection. Especially, the occurrences of items in α should be defined according to a probability distribution. With no information about these occurrences, the uniform distribution is the most natural choice. Indeed, without information on the transaction in which an item occurs, the best is to assume it can occur uniformly at random in any transaction of the sequence α. Moreover, the choice of the uniform distribution has been shown to minimize the worst case description length [4].

To make the signature model probabilistic, we assume that it generates three different types of occurrences independently and uniformly. As the signature gives the information that there is at least one occurrence of every signature item in every segment, the first type of occurrences correspond to this one occurrence of signature items in every segment. These are generated uniformly over all the transactions of every segment. The second type of occurrences are the remaining signature items occurrences. Here, the information is that these items already have occurrences generated by the previous type of occurrences. As α is a sequence of itemsets, an item can occur at most once in a transaction. Hence, for a given signature item, the second type of occurrences for this item are distributed uniformly over the transactions where this item does not already occur for the first type of occurrences. Finally, the third type are the occurrences of the remaining items: the items that are not part of the signature. There is no information about these items occurrences, hence we assume them to be generated uniformly over all transactions of α.

With these three types of occurrences, the signature model is probabilistic: all occurrences in α are generated according to a probability distribution that takes into account the information provided by the signature specification. Hence, we can now define the problem we are tackling:

*Problem 1.* Let S denote the set of signatures for all values of k, S = <sup>|</sup>α<sup>|</sup> <sup>k</sup>=1 Sig(α, k). Given a sequence α, it follows from the MDL principle that the best signature <sup>S</sup> <sup>∈</sup> <sup>S</sup> is the one that minimizes the two-part encoded length of S and α, i.e.,

$$S\_{MDL} = \operatorname{argmin}\_{S \in \mathbb{S}} L(\alpha, S),$$

where L(α, S) is the two-part encoded length that we present in the next section.

### **4 An Encoding for Signatures**

As typically done in compression-based pattern mining [7], we use a two-part MDL code that leads to decomposing the total encoded length L(α, S) into two parts: L(S) and L(α|S), with the relation L(α, S) = L(S) + L(α|S). In the upcoming subsection we define L(S), i.e., the encoded length of a signature, after which Subsect. 4.2 introduces L(α|S), i.e., the length of the sequence α given a signature S. In the remainder of this paper, all logarithms are in base 2.

### **4.1 Model Encoding:** *L***(***S***)**

A signature is composed of two parts: (1) the signature items, and (2) the signature segmentation. The two parts are detailed below.

**Signature Items Encoding.** The encoding of the signature items consists of three parts. The signature items are a subset of I, hence we first encode the number of items in I. A common way to encode non-negative integer numbers is to use the universal code for integers [4,8], denoted L<sup>N</sup> <sup>1</sup>. This yields a code of size LN(|I|). Next, we encode the number of items in the signature, using again the universal code for integers, with length LN(|R|). Finally, we encode the items of the signature. As the order of signature items is irrelevant, we can use an |R|-combination of |I| elements without replacement. This yields a length of log(|I| |R| ). From R and I, we can deduce E.

**Segmentation Encoding.** We now present the encoding of the second part of the signature: the signature segmentation. To encode the segmentation, we encode the segment boundaries. These boundaries are indexed on the size of the sequence, hence we first need to encode the number of transactions n. This can be done using again the universal code for integers, which is of size LN(n). Then, we need to encode the number of segments |S|, which is of length LN(|S|). To encode the segments, we only have to encode the boundaries between two consecutive segments. As there are |S|−1 such boundaries, a naive encoded length would be (|S|−1)∗log(n). An improved encoding takes into account the previous segments. For example, when encoding the second boundary, we know that its value will not be higher than n − |S1|. Hence, we can encode it in log(n − |S1|) instead of log(n) bits. This principle can be applied to encode all boundaries. Another way to further reduce the encoded length is to use the fact that we know that each signature segment contains at least one transaction. We can therefore subtract the number of remaining segments to encode the boundary of the segment we are encoding. This yields an encoded length of <sup>|</sup>S|−<sup>1</sup> <sup>i</sup>=1 log(n−(|S| −i)−<sup>i</sup>−<sup>1</sup> <sup>j</sup>=1 |S<sup>j</sup> |).

*Putting Everything Together.* The total encoded length of a signature S is

$$\begin{aligned} L(S) &= L\_{\mathbb{N}}(|\mathcal{Z}|) + L\_{\mathbb{N}}(|\mathcal{R}|) + \log(\binom{|\mathcal{Z}|}{|\mathcal{R}|}) + \\ L\_{\mathbb{N}}(n) &+ L\_{\mathbb{N}}(|S|) + \sum\_{i=1}^{|S|-1} \log(n - (|S| - i) - \sum\_{j=1}^{i-1} |S\_j|). \end{aligned}$$

<sup>1</sup> L<sup>N</sup> = log∗(n) + log(2.865064), with log∗(n) = log(n) + log(log(n)) + ....

**Fig. 2.** A sequence of transactions and its encoding scheme. We have R = {a, b}, E = {c, d, e} and I = {a, b, c, d, e}. The first occurrence of each signature item in each segment is encoded in the red stream, the remaining signature items occurrences in the orange stream, and the items from E in the blue stream. (Color figure online)

## **4.2 Data Encoding:** *L***(***α|S***)**

We now present the encoding of the sequence given the model: L(α|S). This encoding relies on the refinement of the signature model into a probabilistic model presented in Sect. 3. To summarize, we have three separate encoding streams that encode the three different types of occurrences presented in Sect. 3: (1) one that encodes one occurrence of every signature item in every segment, (2) one that encodes the rest of the signature items occurrences, and (3) one that encodes the remaining items occurrences. An example illustrating the three different encoding streams is presented in Fig. 2.

**Encoding One Occurrence of Each Signature Item in Each Segment.** As stated in Sect. 3, the signature says that in each segment, there is at least one occurrence of each signature item. The size of each segment is known (from the encoding of the model, in Subsect. 4.1), hence we encode one occurrence of each signature item in segment S<sup>i</sup> by encoding the index of the transaction, within segment Si, that contains this occurrence. From Sect. 3, this occurrence is uniformly distributed over the transactions in Si. As encoding an index over |Si| equiprobable possibilities costs log(|Si|) bits and as in each segment, |R| occurrences are encoded this way, we encode each segment in |R| ∗ log(|Si|) bits.

**Encoding the Remaining Signature Items' Occurrences.** As presented in Fig. 2, we now encode remaining signature items occurrences to guarantee a lossless encoding. Again, this encoding relies on encoding transactions where signature items occur. For each item a, we encode its occurrences occ(a) = T*i*∈α <sup>p</sup>∈T*<sup>i</sup>* **<sup>1</sup>**a=<sup>p</sup> by encoding to which transaction it belongs. As <sup>S</sup> occurrences have already been encoded using the previous stream, there are occ(a)−|S| remaining occurrences to encode. These occurrences can be in any of the n− |S| remaining transactions. From Sect. 3, we use a uniform distribution to encode them. More precisely, the first occurrence of item a can belong to any of the n−|S| transactions where a does not already occur. For the second occurrence of a, there are now only n−|S|−1 transactions where a can occur. By applying this principle, we encode all the remaining occurrences of a as occ(a)−|S|−<sup>1</sup> <sup>i</sup>=0 log(n−|S|−i). For each item, we also use LN(occ(a)−|S|) bits to encode the number of occurrences. This yields a total length of <sup>a</sup>∈R <sup>L</sup>N(occ(a)−|S|)+occ(a)−|S|−<sup>1</sup> <sup>i</sup>=0 log(n−|S|−i).

**Remaining Items Occurrences Encoding.** Finally, we encode the remaining items occurrences, i.e., the occurrences of items in E. The encoding technique is identical to the one used to encode additional signature items occurrences, with the exception that the remaining items occurrences can initially be present in any of the n transactions. This yields a total code of <sup>a</sup>∈E <sup>L</sup>N(occ(a)) + occ(a) <sup>i</sup>=0 log(n − i).

*Putting Everything Together.* The total encoded length of the data given the model is given by: L(α|S) = <sup>S</sup>*i*∈<sup>S</sup> |R| ∗ log(|Si|) + <sup>a</sup>∈R <sup>L</sup>N(occ(a) − |S|) + occ(a)−|S|−<sup>1</sup> <sup>i</sup>=0 log(n − |S| − i) + <sup>a</sup>∈E <sup>L</sup>N(occ(a)) + occ(a) <sup>i</sup>=0 log(n − i).

## **5 Algorithms**

The previous section presented how a sequence is encoded, completing our problem formalization. The remaining problem is to find the signature minimizing the code length, that is, finding SMDL such that SMDL = argmin<sup>S</sup>∈<sup>S</sup> L(α, S).

**Naive Algorithm.** A naive approach would be to directly mine the whole set of signatures S and find the signature that minimizes the code length. However, mining a signature with k segments has time complexity O(n2k). Mining the whole set of signatures requires k to vary from 1 to n, resulting in a total complexity of O(n<sup>4</sup>). The quartic complexity does not allow us to quickly mine the complete set of possible signatures on large datasets, hence we have to rely on heuristic approaches.

To quickly search for the signature in S that minimizes the code length, we initially rely on a top-down greedy algorithm. We start with one segment containing the whole sequence, and then search for the segment boundary that minimizes the encoded length. Then, we recursively search for a new single segment boundary that minimizes the encoded length. We stop when no segment can be added, i.e., when the number of segments is equal to the number of transactions. During this process, we record the signature with the best encoded length. However, this algorithm can perform early segment splits that seem promising initially, but that eventually impair the search for the best signature.

#### **5.1 Widening for Signatures**

To solve this issue, a solution is to keep the w signatures with the lowest code length at each step instead of keeping only the best one. This technique is called *beam search* and has been used to tackle optimization problems in pattern mining [6]. The beam width w is the number of solutions to keep at each step of the algorithm. However, the beam search technique suffers from having many of the best w signatures that tend to be similar and correspond to slight variations of one signature. Here, this means that most signatures in the beam would




have segmentations that are very similar. The widening technique [9] solves this issue by adding a diversity constraint into the beam. Different constraints exist [5,6,9], but a common solution is to add a distance constraint between each pair of elements in the beam: all pairwise distances between the signatures in the beam have to be larger than a given threshold θ. As this threshold is dependent on the data and the beam width, we propose a method to automatically set its value.

Algorithm 1 presents the proposed widening algorithm. Line 3 iterates over the number of segments. Line 4 computes all signatures having k segments that are considered to enter the beam. More specifically, function *Split1Segment* computes the direct refinements of each of all signatures in BestKSign. A direct refinement of a signature corresponds to splitting one segment in the segmentation associated with that signature. Line 5 selects the refinement having the smallest code length. If several refinements yield the smallest code length, one of these refinements is chosen at random. Lines 8 to 11 perform the widening step by adding new signatures to the beam while respecting the pairwise distance constraint. Line 8 computes the distance threshold (θ) depending on the diversity parameter (β), the beam width (w), and the current refinements. Algorithm 2 presents the details of the threshold computation. With this threshold, we recursively add a new element in the beam, until either the beam is full or no new element can be added (line 9). Lines 10 and 11 add the signature having the smallest code length and being at a distance of at least θ to any current element of the beam. Line 12 returns the best overall signature we have encountered.

**Distance Between Signatures.** We now define the distance measure for signatures (used in line 10 of Algorithm 1). As the purpose of the signature distance is to ensure diversity in the beam, we will use the segmentation to define the distance between two elements of the beam, i.e., between two signatures. Terzi et al. [10] presented several distance measures for segmentations. The *disagreement distance* is particularly appealing for our purposes as it compares how transactions belonging to the same segment in one segmentation are allocated to the other segmentation. Let S<sup>a</sup> = -Sa<sup>1</sup> ...Sak and S<sup>b</sup> = -Sb<sup>1</sup> ...Sbk be two k-segmentations of a sequence α. We denote by d(Sa, Sb) the disagreement distance between segmentation a and segmentation b. The disagreement distance corresponds to the number of transaction pairs that belong to the same segment in one segmentation, but that are not in the same segment in the other segmentation. Techniques on how to efficiently compute this distance are presented in [10].

**Defining a Distance Threshold.** Algorithm 1 uses a distance threshold θ between two signatures, that controls the diversity constraint in the beam. If θ is equal to 0, there is no diversity constraint, as any distance between two different signatures is greater than 0. Higher values of θ enforce more diversity in the beam: good signatures will not be included in the beam if they are too close to signatures already in the beam. However, setting the θ threshold is not easy. For example θ depends on the beam width w. Indeed, with large beam widths, θ should be low enough to allow many good signatures to enter the beam.

To this end, we introduce a method that automatically sets the θ parameter, depending on the beam width and on a new parameter β that is easier to interpret. The β parameter ranges from 0 to 1 and controls the strength of the diversity constraint. The intuition behind β is that its value will approximately correspond to the relative rank of the worst signature in the beam. For example, if β is set to 0.2, it means that signatures in the beam are in the top-20% in ascending order of code length. Algorithm 2 details how θ is derived from β and w; this algorithm is called by the threshold function in line 8 of Algorithm 1.

Knowing the set of all candidate signatures that are considered to enter the beam, we retain only the proportion β of the best signatures (line 3 of Algorithm 2). Then, in line 4 we extract the best signature. Finally, we look for the distance threshold θ such that the number of signatures within a distance of θ from the best signature is equal to the number of considered signatures divided by the beam width w (line 5). The rationale behind this threshold is that since we are adding w signatures to the beam and we want to use the proportion β of the best signatures, the distance threshold should approximately discard 1/w of the proportion β of the best signatures around each signature of the beam.

## **6 Experiments**

This section, analyzes runtimes and code lengths of variants of our algorithm on a real retail dataset<sup>2</sup>. We show that our method runs significantly faster than the naive baseline, and give advice on how to choose the w and β parameters. Next, we illustrate the usefulness of the encoding to analyze retail customers.

**Fig. 3. Left**: Mean relative code length for different instances of the widening algorithm. For each customer, the relative code length is computed with regard to the smallest code length found for this customer. Averaging these lengths across all customers gives the mean relative code length. The β parameter sets the diversity constraint and w the beam width. The solid black line shows the mean code length of the naive algorithm. Bootstrapped 95% confidence intervals [1] are displayed. **Right**: Mean runtime in seconds for different instances of the widening algorithm. The dotted black lines shows a bootstrapped 95% confidence interval of the naive algorithm's mean runtime.

### **6.1 Algorithm Runtime and Code Length Analysis**

We here analyze the runtimes and code lengths obtained by variants of Algorithm 1. 3000 customers having more than 40 baskets in the Instacart 2017 dataset are randomly selected<sup>3</sup>. Customers having few purchases are less relevant, as we are looking for purchase regularities. These 3000 customers are analyzed individually, hence the algorithm is evaluated on different sequences.

<sup>2</sup> Code is available at https://bitbucket.org/clement gautrais/mdl signature ida 2020/.

<sup>3</sup> The Instacart Online Grocery Shopping Dataset 2017, Accessed from https://www. instacart.com/datasets/grocery-shopping-2017on05/04/2018.

**Code Length Analysis.** To assess the performance of the different algorithms, we analyze the code length yielded by each algorithm on each of these 3000 customers. We evaluate different instances of the widening algorithm with different beam widths w and diversity constraints β. The resulting relative mean code lengths per algorithm instance are presented in Fig. 3 left. When increasing the beam width, the code length always decreases for a fixed β value. This is expected, as increasing the beam size allows the widening algorithm to explore more solutions. As increasing the beam size improves the search, we recommend setting it as high as your computational budget allows you to do.

Increasing the β parameter usually leads to better code lengths. However, for w = 5, higher β values give slightly worse results. Indeed, if β is too high, good signatures might not be included in the beam, if they are too close to existing solutions. Therefore, we recommend setting the β value to a moderate value, for example between 0.3 and 0.5. A strong point of our method is that it is not too sensitive to different β values. Hence, setting this parameter to its optimal value is not critical. The enforced diversity is highly relevant, as a fixed beam size with some diversity finds code lengths that are similar to the ones found by a larger beam size with no diversity. For example, with w = 5 and β = 0.3, the code lengths are better than with w = 10 and β = 0. As using a beam size of 5 with β = 0.3 is faster than using a beam size of 10 with β = 0, it shows that using diversity is highly suited to decrease runtime while yielding smaller code lengths.

**Runtime Analysis.** We now present runtimes of different widening instances in Fig. 3 right. The beam width mostly influences the runtime, whereas the β value has a smaller influence. Overall, increasing β slightly increases computation time, while yielding a noticeable improvement in the resulting code length, especially for small beam sizes. Our method also runs 5 to 10 times faster than the naive method. In this experiment, customers have a limited number of baskets (at most 100), thus the O(n<sup>4</sup>) complexity of the naive approach exhibits reasonable runtimes. However in settings with more transactions (retail data over a longer period for example), the naive approach will require hours to run, and the performance gain of our widening approach will be a necessity. Another important thing is that the naive method has a high variability in runtimes. Confidence intervals are narrow for the widening algorithm (they are barely noticeable on the plot), whereas it spans over 5 s for the naive algorithm.

### **6.2 Qualitative Analysis**

Figure 4 presents two signatures of a customer, to illustrate that signatures are of practical use to analyze retail customers, and that finding signatures with smaller code lengths is of interest. We use the widening algorithm to get a variety of good signatures according to our MDL encoding. The top signature in Fig. 4 is the best signature found: it has the smallest code length. This signature seems to correctly capture the regular behavior of this customer, as it contains 7 products that are regularly bought throughout the whole purchase sequence.

**Fig. 4.** Example of two signatures found by our algorithms. Gray vertical lines are segments boundaries and each dot represents an item occurrence in a purchase sequence. **Top**: best signature (code length of 5221.33 bits) found by the widening algorithm, with w = 20 and β = 0.5. **Bottom**: signature found by the beam search algorithm: w = 1 and β = 0, with a code length of 5338.46 bits (the worst code length).

Knowing these 7 favorite products, a retailer could target its offers. The segments also give some information regarding the temporal behavior of this customer. For example, because segments tend to be smaller and more frequent towards the end of the sequence, one could guess that this customer is becoming a regular.

On the other hand, the bottom signature is significantly worse than the top one. It is clear that it mostly contains products that are bought only at the end of the purchase sequence of this customer. This phenomenon occurs because the beam search algorithm, with w = 1, only picks the best solution at each step of the algorithm. Hence, it can quickly get stuck in a local minimum. This example shows that considering larger beams and adding diversity is an effective approach to optimize code length. Indeed, having a large and diverse beam is necessary to have the algorithm explore different segmentations, yielding better signatures.

## **7 Conclusions**

We tackled the problem of automatically finding the best number of segments for signature patterns. To this end, we defined a model selection problem for signatures based on the minimum description length principle. Then, we introduced a novel algorithm that is an instance of widening. We evaluated the relevance and effectiveness of both the problem formalization and the algorithm on a retail dataset. We have shown that the widening-based algorithm outperforms the beam search approach as well as a naive baseline. Finally, we illustrated the practical usefulness of the signature on a retail use case. As part of future work, we would like to study our optimization techniques on larger databases (thousands of transactions), like online news feeds. We would also like to work on model selection for *sets* of interesting signatures, to highlight diverse recurrences.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Addressing the Resolution Limit and the Field of View Limit in Community Mining**

Shiva Zamani Gharaghooshi<sup>1</sup>, Osmar R. Za¨ıane1(B), Christine Largeron<sup>2</sup>, Mohammadmahdi Zafarmand<sup>1</sup>, and Chang Liu<sup>1</sup>

<sup>1</sup> Alberta Machine Intelligence Institute, University of Alberta, Edmonton, AB, Canada {zamanigh,zaiane,zafarman,chang6}@ualberta.ca <sup>2</sup> Laboratoire Hubert Curien, Universit´e de Lyon, Saint-Etienne, France Christine.Largeron@univ-st-etienne.fr

**Abstract.** We introduce a novel efficient approach for community detection based on a formal definition of the notion of community. We name the links that run between communities weak links and links being inside communities strong links. We put forward a new objective function, called SIWO (Strong Inside, Weak Outside) which encourages adding strong links to the communities while avoiding weak links. This process allows us to effectively discover communities in social networks without the resolution and field of view limit problems some popular approaches suffer from. The time complexity of this new method is linear in the number of edges. We demonstrate the effectiveness of our approach on various real and artificial datasets with large and small communities.

**Keywords:** Community detection · Social network analysis

## **1 Introduction**

Community detection is an important task in social network analysis and can be used in different domains where entities and their relations are presented as graphs. It allows us to find linked nodes that we call communities inside graphs. There are community detection methods that partition the graph into subgroups of nodes such as the spectral bisection method [4] or the Kernighan-Lin algorithm [27]. There are also hierarchical methods such as the divisive algorithms based on edge betweenness of Girwan et al. [18] or agglomerative algorithms based on dynamical process such as Walktrap [20], Infomap [24] or Label propagation [22]. We do not detail them and refer the interested reader to [7,10,12], but we come back on another class of hierarchical algorithms that aim at maximizing Q-modularity introduced by Newman et al. [18]. After the greedy agglomerative algorithm initially introduced by Newman [19], Blondel et al. [5] proposed Louvain, one of the fastest algorithms to optimize Q-modularity and to solve the community detection task. However, Fortunato et al. [11] showed that Q-Modularity suffers from the resolution limit which means by optimizing

c The Author(s) 2020

M. R. Berthold et al. (Eds.): IDA 2020, LNCS 12080, pp. 210–222, 2020. https://doi.org/10.1007/978-3-030-44584-3\_17

Q-modularity, communities that are smaller than a scale cannot be resolved. The field of view limit [25] is in contrast to the resolution limit leads to overpartitioning the communities with a large diameter.

To overcome the resolution limit of Q-modularity, several proposals have been made, notably by [2,17,23], who introduced variants of this criterion allowing the detection of community structures at different levels of granularity. However, these revised criteria make the method time-consuming since they require to tune a parameter. Therefore, we retain the greedy approach of Louvain for its efficiency and ability to handle very large networks, but we introduce SIWO because it relies on the notions of strong and weak links defined in Sect. 2.

We consider that a community corresponds to a subgraph sparsely connected to the rest of the graph. Contrary to the majority of methods which do not formally define what is a community and simply consider that it corresponds to a subset of nodes densely connected internally, we define the conditions a subgraph should meet to be considered as a community in Sect. 2. In Sect. 3, we present the generic community detection algorithm. We can apply this general process regardless of the objective function to improve other community detection methods as our experiments show.

Finally, the extensive experiments described in Sects. 4 and 5, confirm that our objective function is less sensitive to the resolution and the field of view limit compared to the objective functions mentioned earlier. Also, our algorithm has consistently good performance regardless of the size of communities in a network and is efficient on large size networks having up to a million edges.

## **2 Notations and Definitions**

### **2.1 Strong and Weak Links**

A community is oftentimes defined as a subgraph in which nodes are densely connected while sparsely connected to the rest of the graph. One way to find such subgraphs is to divide the network into parts so that the number of links lying inside that part is maximized. However, if there is no prior information about the number of communities or their sizes, one can maximize the number of links within communities by putting all the nodes in one community, but the final result will not be the true communities. To avoid this approach, we penalize the missing links within the communities and we introduce the notions of strong and weak links.

**Fig. 1.** A network with two communities; each consists of a clique of size 5.

**Fig. 2.** A network with 2 communities and 4 dangling nodes (1, 2, 3, and 4).

Weak links lie between communities, while strong links are inside them. We develop our criterion so that it encourages adding strong links to the communities while avoiding weak ones instead of penalizing the missing links. As these different types of links play different roles in graph connectivity; removing a weak link may divide the graph into disconnected subgraphs, whereas removing a random link would not. Let us focus on the link between nodes i and j in Fig. 1 and also the link between nodes j and k in this graph. Node j is connected to all the neighbors of node k, whereas node i and j have no common neighbors. As generally, nodes in the same community are more likely to have common neighbors, (i, j) can be considered as a weak link whereas (j, k) as a strong link and it is exactly what we want to capture through weights assigned to the links.

### **2.2 Edge Strength**

Given a graph G = (V,E) where V is the set of nodes and E the set of edges, we propose to assign a weight in the range of (−1, 1) to each edge; such that strong links have larger weights. As nodes in the same community tend to have more common neighbors compared to nodes in different communities, if S*xy* > S*xy*- then e*xy* is more likely to be a strong link compared to e*xy*with S*xy* defined by:

$$S\_{xy} = |\{k \in V : (x, k) \in E, (y, k) \in E\}|\tag{1}$$

We can compare two links according to S only if they share a node. Thus, if we consider nodes x and y that have 5 and 20 links incident to them, then S can be in range of [0, 4] and [0, 19] for x and y respectively. Consequently, for comparisons, we have to scale down <sup>S</sup> values to (−1, 1). If <sup>S</sup>*xy* has the maximum value of S*max <sup>x</sup>* (S*max <sup>x</sup>* = max*<sup>y</sup>*:(*x,y*)∈*<sup>E</sup>* <sup>S</sup>*xy*) for a particular node <sup>x</sup>. We divide the range [−1, 1] into <sup>S</sup>*max <sup>x</sup>* + 1 equal length segments. Each S value in the range of [0, S*max <sup>x</sup>* ] is then mapped to the center of (n + 1)*th* segment using equation:

$$w\_{xy}^x = S\_{xy} \frac{2}{S\_x^{max} + 1} + \frac{1}{S\_x^{max} + 1} - 1 \tag{2}$$

where w*<sup>x</sup> xy* is the scaled value of S*xy* from the viewpoint of node x (min-max normalization could also work). We can also scale S*xy* from the viewpoint of node y: w*<sup>y</sup> xy* = S*xy* <sup>2</sup> *Smax <sup>y</sup>* +1 <sup>+</sup> <sup>1</sup> *Smax <sup>y</sup>* +1 <sup>−</sup> 1 where <sup>S</sup>*max <sup>y</sup>* = max*<sup>x</sup>*:(*y,x*)∈*<sup>E</sup>* <sup>S</sup>*xy*. To decide whether we should trust x or y, we need to look at the importance of each one in the network. Local clustering coefficient (CC) [28], given below, is a measure that reflects the importance of nodes and it can be computed even on large graphs, for instance with Mapreduce [15].

$$CC(x) = \frac{|\{e\_{ij} : i, j \in N\_x, e\_{ij} \in E\}|}{\binom{d\_x}{2}} \tag{3}$$

where d*<sup>x</sup>* and N*<sup>x</sup>* are respectively the degree and the set of neighbors of node x. CC is in the range of [0,1] with 1 for nodes whose neighbors form cliques, and 0 for nodes whose neighbors are not connected to each other directly. Here, we scale each edge from the viewpoint of the endpoint that is more likely to be in a dense neighborhood characterized by a large CC:

$$w\_{xy} = \begin{cases} w\_{xy}^x, & \text{if } CC(x) \ge CC(y) \\ w\_{xy}^y, & \text{otherwise} \end{cases} \tag{4}$$

#### **2.3 SIWO Measure**

The new measure that we propose encourages adding strong links into the communities while keeping the weak links outside of the communities (**S**trong **I**nside, **W**eak **O**utside). This measure is defined as follows:

$$SIWO = \sum\_{i,j \in V} \frac{w\_{ij} \delta(c\_i, c\_j)}{2} \tag{5}$$

where c*<sup>i</sup>* is the community of node i and δ(x, y) is 1 if x = y and 0 otherwise. SIWO is the sum of weights of the edges that reside in the communities. This objective function provides a way to partition the set of nodes but it does not specify the conditions required by a subset of nodes to be a community. These conditions are defined in the following.

#### **2.4 Community Definition**

Following [21] we consider that a subgraph C is a community in a weak sense if the following condition is satisfied:

$$\frac{1}{2} \sum\_{v \in C} |N\_v^C| > \sum\_{v \in C} |N\_v - N\_v^C| \tag{6}$$

where N*<sup>v</sup>* is the set of the neighbors of node v and N *<sup>C</sup> <sup>v</sup>* is the set of the neighbors of node v that are also in community C. This condition means that the collective of the nodes in a community have more neighbors within the community than outside. In this paper, we expand this definition by adding one more condition. Given a partition <sup>p</sup> <sup>=</sup> {C1, C2, ..., C*<sup>t</sup>*} of a network, subgraph <sup>C</sup>*<sup>i</sup>* is considered as a **qualified community** if it satisfies the following conditions:


$$\frac{1}{2} \sum\_{v \in C\_i} |N\_v^{C\_i}| > \sum\_{v \in C\_i} |N\_v^{C\_j}|, j \in [1..t], j \neq i \tag{7}$$

## **3 The SIWO Method**

This method has four steps: pre-processing, optimizing SIWO, qualified community identification, and post-processing. They are discussed in detail below.

## **Step 1. Pre-processing**

The first step calculates the edge strength weights (w*ij* ) needed during the SIWO optimization. Moreover, to reduce the computational time, we remove the dangling nodes temporally. Node x is a dangling node if there exists node y such that by removing e*xy*, the network would be divided into two disconnected parts with part*<sup>x</sup>* (the part containing node x) being a tree. Since part*<sup>x</sup>* has a tree structure, it cannot form a community on its own. So all the nodes in part*<sup>x</sup>* belong to the same community as node y. In Fig. 2, nodes 1, 2, 3 and 4 are dangling nodes and they belong to the same community as node 5, unless we consider them outliers. Even though such tree-structured subgraphs attached to the network are very sparse and cannot be considered as communities, they satisfy Eqs. (6) and (7) defined for qualified communities. So we do not need to consider them during the community detection process. To remove them (and the links incident to them), we need to investigate every node of the network in the first time to identify nodes with degree of 1. However, after the first visit, we only need to check the list of the neighbors of the nodes that are removed in the previous time.

## **Step 2. Optimizing SIWO**

We use Louvain's optimization process to maximize SIWO since it has been proven to be very efficient but we replace the modularity by our criterion. This greedy optimization process has two main phases, iteratively performed until a local maximum of the objective function (SIWO measure) is reached. The first phase starts by placing each node of graph G in its community. Then each node is moved to the neighbor community which results in the maximum gain of the SIWO value. If no gain can be achieved, the node stays in its community. In the second phase, a new weighted graph G is created in which each node corresponds to a community in G. Two nodes in G are connected if there exists at least one edge lying between their corresponding communities in G. Finally, we assign each edge e*xy* in G a weight equal to the sum of the weights of edges between the communities that match with x and y. These two phases are repeated until no further improvement in the SIWO objective function can be achieved.

## **Step 3. Qualified Community Identification**

This step determines qualified communities complying with Eqs. (6) and (7) for the dense subgraphs discovered in the previous step. However, there may exist communities consisting of one node weakly connected to all of its neighbors (S*max <sup>x</sup>* = 0) and that have links with non-positive weight incident to it, we call them Lone communities. Since the decision about the communities of such nodes can not be made on edge strength, we let the majority of their neighbors decide about their communities but, to reduce the computational time, like for dangling nodes, we temporarily remove these nodes in this step and bring them back in the final step. Then, we identify the unqualified communities which do not satisfy Eqs. (6) or (7). We keep merging each unqualified community with one of its neighboring communities (qualified or not) until no more unqualified community exists. For that, first, we assign a weight equal to 1 to each edge. Then, we repeat the two phases of Louvain. In phase 1, we create a new graph G<sup>∗</sup> in which each node corresponds to a community identified in step 3 for the first iteration of in phase 2 for the next ones and where each edge e*xy* is assigned a weight equal to the sum of the weights of edges between the communities that correspond to x and y. We also add a self-loop to each node that has a weight equal to the sum of the weights of the edges that reside in its corresponding community. In phase 2, we visit all nodes in G∗. If a node x has a self-loop with a weight that is larger than (1) half of sum of the weights of the edges incident to it and (2) weight of any edge connecting x to another node in G∗, it means the community assigned to x satisfies both the conditions in Eqs. (6) and (7), we let x stay in its community. Otherwise, we move node x to the neighboring community that results in the maximum decrease in the sum of the weights of the edges that lie between communities of G∗.

#### **Step 4. Post-processing**

Finally, each lone community that was temporarily removed is sequentially added back to the network and merged with the community in which it has the most neighbors. If two or more communities tie and they have more than one connection to the node, then one is chosen at random. Otherwise, we choose the community of the most important neighbor, based on the largest degree of centrality within its community. Since we add lone nodes one after the other, the community that a former node is assigned to, might not be the best for that node. To resolve this issue, once all lone nodes are added to the network, we repeat moving each one of them to the community of the majority of its neighbors until no further movement can be made. Dangling nodes are also added to the network in the reverse order that they were removed and they are assigned to the community of their unique neighbor.

## **4 The Resolution Limit of SIWO**

Fortunato and Barth´elemy [11] used two sample networks, shown in Fig. 3, to demonstrate how Q-modularity is affected by the resolution limit. The first example is a ring of cliques where each clique is connected to its adjacent cliques through a single link. If the number of cliques is larger than about √m with m being the total number of edges in the network, then optimizing Q-modularity results in merging the adjacent cliques into groups of two or more, despite that each clique corresponds to a community. The second example is a network containing 4 cliques: 2 of size k and 2 of size p. If k >> p, Q-modularity similarly fails to find the correct communities and the cliques of size p will be merged.

To prove how SIWO resolves the resolution limit of Q-modularity, the exact structure of the network should be known; which is not possible. So, we analyze whether SIWO is affected by the resolution limit on these networks Given the definition of SIWO, let us consider the edge e*xy* between two adjacent cliques

**Fig. 3.** Schematic examples (a) a ring of cliques; adjacent cliques are connected through a single link (b) a network with 2 cliques of size *k* and 2 cliques of size *p*.

in the first network. Since x and y do not have any common neighbors, the edge between them has a non-positive weight. Therefore, by maximizing SIWO measure in our algorithm, the adjacent cliques will not be merged. For the edge e*xy* between the cliques of size p in the second network, since x and y have at most one common neighbor, the edge between them has a non-positive weight. Therefore, the cliques in the second network will not be merged either.

### **5 Experimental Results**

We compared the performance of our method with the most widely used and efficient algorithms, as pointed out in several recent state of art studies [8,29], on both real and synthetic networks. The algorithms are: 1- Fastgreedy [6]; 2- Infomap; 3- Infomap+ which is Infomap to which we added the third step of our algorithm (to relieve its sensitivity to the field of view limit and demonstrate that our framework can be used to improve other algorithms); 4- Label Propagation [22]; 5- Louvain<sup>1</sup> [5]; 6- Walktrap<sup>2</sup> [20]. It should be noted that Infomap is the only algorithm that suffers from the filed of view limit among these algorithms.

The results are evaluated according to the Adjusted Rand Index (ARI) [14] and Normalized Mutual Information (NMI) [26]. As both ARI and NMI show similar results, we only present ARI results for lack of space. We also compared the results of different methods according to the ratio of the number of detected communities over the true number of communities in the ground-truth to observe how a method is affected by the resolution and the field of view limits.

#### **5.1 Real Networks**

We used 5 real networks and the ground-truth communities are available for 4 of them. Table 1 presents the properties of these networks.

We compared SIWO and Louvain on Eurosis network [9] which represents scientific web pages from 12 European countries and the hyperlinks between them without known ground-truth communities. However, since each European country has its own language, web pages in different countries are sparsely connected to each other. Moreover, as reported in [9], some of the countries can be

<sup>1</sup> https://github.com/taynaud/python-louvain.

<sup>2</sup> https://www-complexnetworks.lip6.fr/∼latapy/PP/walktrap.html.


**Table 1.** Properties of real networks

*<sup>a</sup>*http://www.orgnet.com

divided into smaller components e.g. Montenegro network includes three components: 1- Telecom and Engineering, 2- Faculties and 3- High Schools. Louvain detects 13 communities whereas SIWO detects 16 communities in this network. Louvain assigns all nodes in Montenegro network to one giant community. However, SIWO puts Faculties and High Schools in one community and Telecom and Engineering web pages in another community. These two communities are connected to each other with only 7 links. However, Louvain cannot separate them due to its resolution limit.

**Table 2.** Comparison of 7 algorithms according to ARI and the ratio of the number of detected communities over the true number of communities in the ground-truth on real networks. Tables shows the average results and standard deviation computed on 10 iterations of the algorithms on each network.


Table 2 presents the comparison with respect to ARI and C/C*r*, the ratio of the number of detected communities over the true number of communities (both ARI and C/C*<sup>r</sup>* should be as close to 1 as possible) in the ground-truth, on real networks with ground-truth communities. It shows that SIWO performs better on Karate and Polbooks based on ARI. It also outperforms the others methods on Karate, Football, and Polblogs networks according to C/C*<sup>r</sup>* measure (SIWO could detect the exact communities with respect to the ground-truth on these networks). Infomap detects a considerably larger number of communities in Polblogs network which indicates this algorithm is sensitive to the field of view limit [25]. However, Infomap+ is much less sensitive to this limit which implies the third step of SIWO, added to Infomap+, is effective in resolving the field of view limit. Considering results for all networks, SIWO is the top performer among these algorithms on a variety of networks.

### **5.2 Synthetic Networks**

To analyze the effect of the resolution and field of view limit, it is important to test how community detection algorithms perform on networks with small/large communities. Therefore, in this work we generated two sets of networks using LFR [16] to test the different algorithms: one with large communities and one with small communities. The first set is in favor of algorithms that suffer from resolution limit such as Louvain and the second set is in favor of algorithms with field of view limit such as Infomap. Each set includes networks with a varying number of nodes and mixing parameter. The mixing parameter controls the fraction of edges that lie between communities. We do not generate networks with mixing parameter ≥0.5 since beyond this point and including 0.5, the communities in the ground truth no longer satisfy the definition of community. The input parameters used to generate these two sets are presented in Table 3. Figures 4 and 5 present respectively ARI or the ratio of the number of detected communities over the true number of communities (C/C*r*). Panels correspond to networks with a specific number of nodes (1000 to 100000) and they are divided into two parts; the lower (respectively upper) part illustrates the average ARI (or C/C*r*) (respectively standard deviation) computed over 20 graphs (10 small and 10 large communities) as a function of the mixing parameter.

**Table 3.** Input parameters of LFR benchmark: Set 1 contains networks with large communities and Set 2 contains networks with small communities. For each combination of parameters we generated 10 networks.


Figure 4 shows the performance of Fastgreedy decreases as the mixing parameter increases. Louvain and Walktrap perform well on the smallest networks in the set; however, its performance drops when we apply it to the networks with sizes 50000 and larger. Label propagation, Infomap and Infomap+ perform well up to when the mixing parameter reaches 0.3. However, a larger mixing parameter causes a rapid decrease in the ARI value when applying these algorithms to the two largest networks in the set. These three algorithms have a large standard deviation and their outputs are not stable on these networks. SIWO correctly detects the communities when the mixing parameter is less than or equal to 0.3 (ARI 1) regardless of size of the network and has the best performance overall.

Figure 5 clearly shows the resolution limit of Louvain and Fastgreedy as they underestimate the number of communities. SIWO is the best performer in terms of the number of communities and it has a very small standard deviation whereas, Infomap+ and Label propagation have a large standard deviation and fail to find the correct number of communities when the mixing parameter exceeds 0.3.

**Fig. 4.** Evaluation according to ARI on synthetic networks generated with LFR.

**Fig. 5.** Evaluation of SIWO, Label propagation, Infomap+, Louvain and Fastgreedy according to *C/Cr* on synthetic networks generated with LFR.

### **6 Scalability**

We analyze how the computational cost of SIWO varies with the size of the network. The pre-processing step has two phases: removing dangling nodes which requires a time of the order of n where n is the number of nodes, and calculating the edge strength weights which requires a time of the order of nd<sup>2</sup> = 2md where m is the number of edges and d is the average degree. In many real networks d is much smaller than n and it does not grow with n [10]. The second and third step follows the same greedy process as Louvain does. Louvain is theoretically cubic but was demonstrated experimentally to be quasi-linear [3] and has been applied with success to handle large size networks having several million nodes, and 100 million links. The time complexity of the post-processing step depends on the number of Lone communities and if all the nodes are in Lone communities, it requires a time O(nd<sup>2</sup>). Overall, the time complexity of SIWO is O(n + md), which is similar to Louvain due to the fact that d is small and n = 2m/d. SIWO can detect communities in a networks with 100000 nodes and 1 million edges, in about 1 min on a commodity i7 and 8GB RAM laptop. The current implementation of SIWO is in Python<sup>3</sup>, derived from python-louvain.

## **7 Conclusion**

This paper introduces SIWO, a novel objective function based on edge strength for community detection, and a formal definition of community, that we use to lead the community detection process after optimizing the objective function. This framework can also be applied to other community detection methods to remedy their inability that causes the resolution or the field of view limit. Our extensive experiments using both small and large networks confirm that our algorithm is consistent, effective and scalable for networks with either large or small communities demonstrating less sensitivity to the resolution limit and field of view limit that most community mining algorithms suffer from. As a future direction, we will generalize the proposed algorithm for weighted/directed networks. Notably, SIWO algorithm can be easily generalized to handle weighted graphs. It requires only to adjust the pre-processing step by combining the weights from the input graph and the weights computed by SIWO to evaluate the edge strength.

## **References**


<sup>3</sup> SIWO Code and datasets available at https://www.dropbox.com/sh/eehjt5qblll0yvg/ AACW2XjHJjHX2Q876Vbk0e4Ya?dl=0 .


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Estimating Uncertainty in Deep Learning for Reporting Confidence: An Application on Cell Type Prediction in Testes Based on Proteomics**

Biraja Ghoshal1(B) , Cecilia Lindskog<sup>2</sup>, and Allan Tucker<sup>1</sup>

<sup>1</sup> Brunel University London, Uxbridge UB8 3PH, UK biraja.ghoshal@brunel.ac.uk <sup>2</sup> Department of Immunology, Genetics and Pathology, Rudbeck Laboratory, Uppsala University, 75185 Uppsala, Sweden https://www.brunel.ac.uk/computer-science

**Abstract.** Multi-label classification in deep learning is a practical yet challenging task, because class overlaps in the feature space means that each instance is associated with multiple class labels. This requires a prediction of more than one class category for each input instance. To the best of our knowledge, this is the first deep learning study which quantifies uncertainty and model interpretability in multi-label classification; as well as applying it to the problem of recognising proteins expressed in cell types in testes based on immunohistochemically stained images. Multi-label classification is achieved by thresholding the class probabilities, with the optimal thresholds adaptively determined by a grid search scheme based on Matthews correlation coefficients. We adopt MC-Dropweights to approximate Bayesian Inference in multi-label classification to evaluate the usefulness of estimating uncertainty with predictive score to avoid overconfident, incorrect predictions in decision making. Our experimental results show that the MC-Dropweights visibly improve the performance to estimate uncertainty compared to state of the art approaches.

**Keywords:** Uncertainty estimation · Multi-label classification · Cell type prediction · Human Protein Atlas · Proteomics

## **1 Introduction**

Proteins are the essential building blocks of life, and resolving the spatial distribution of all human proteins at an organ, tissue, cellular, and subcellular level greatly improves our understanding of human biology in health and disease. The testes is one of the most complex organs in the human body [15]. The spermatogenesis process results in the testes containing the most tissue-specific genes than elsewhere in the human body. Based on an integrated 'omics' approach using transcriptomics and antibody-based proteomics, more than 500 proteins with distinct testicular protein expression patterns have previously been identified [10], and transcriptomics data suggests that over 2,000 genes are elevated in testes compared to other organs. The function of a large proportion of these proteins are however largely unknown, and all genes involved in the complex process of spermatogenesis are yet to be characterized. Manual annotation provides the standard for scoring immunohistochemical staining pattern in different cell types. However, it is tedious, time-consuming and expensive as well as subject to human error as it is sometimes challenging to separate cell types by the human eye. It would be extremely valuable to develop an automated algorithm that can recognise the various cell types in testes based on antibody-based proteomics images while providing information on which proteins are expressed by that cell type [10]. This is, therefore, a multi-label image classification problem.

**Fig. 1.** Schematic overview: cell type-specific expression of testis elevated genes [10]

Exact Bayesian inference with deep neural networks is computationally intractable. There are many methods proposed for quantifying uncertainty or confidence estimates. Recently Gal [5] proved that a dropout neural network, a well-known regularisation technique [13], is equivalent to a specific variational approximation in Bayesian neural networks. Uncertainty estimates can be obtained by training a network with dropout and then taking Monte Carlo (MC) samples of the prediction using dropout during test time. Following Gal [5], Ghoshal et al. [7] also showed similar results for neural networks with Dropweights and Teye [14] with batch normalisation layers in training (Fig. 1).

In this paper, we aim to:


Our objective is not to achieve state-of-the-art performance on these problems, but rather to evaluate the usefulness of estimating uncertainty leveraging MC-Dropweights with predictive score in multi-label classification to avoid overconfident, incorrect predictions for decision making.

## **2 Multi-label Cell-Type Recognition and Localization with Estimated Uncertainty**

## **2.1 Problem Definition**

Given a set of training data <sup>D</sup>, where <sup>X</sup> <sup>=</sup> {x1, x<sup>2</sup> ...x<sup>N</sup> } is the set of N images and the corresponding labels <sup>Y</sup> <sup>=</sup> {y1, y<sup>2</sup> ...y<sup>N</sup> } is the cell-type information. The vector <sup>y</sup><sup>i</sup> <sup>=</sup> {yi,1, yi,<sup>2</sup> ...yi,M} is a binary vector, where <sup>y</sup>i,j = 1 indicates that the i th image belongs to the jth cell-type. Note that an image may belong to multiple cell-types, i.e., 1 <= - <sup>j</sup> <sup>y</sup>i,j <sup>&</sup>lt;<sup>=</sup> <sup>M</sup>. Based on <sup>D</sup>(X, Y ), we constructed a Bayesian Deep Learning model giving an output of the predictive probability with estimated uncertainty of a given image x<sup>i</sup> belonging to each cell category. That is, the constructed model acts as a function such that <sup>f</sup> : <sup>X</sup> <sup>→</sup> <sup>Y</sup> using weights of neural net parameters ω where (0 <= ˆyx,j <= 1) as close as possible to the original function that has generated the outputs Y, output the estimated value (ˆyi,<sup>1</sup>, yˆi,<sup>2</sup>,..., yˆi,M) as close to the actual value (yi,<sup>1</sup>, yi,<sup>2</sup>,...,yi,M).

### **2.2 Solution Approach**

We tailored Deep Convolutional Neural Network (DCNN) architectures for cell type detection and localisation by considering a large image capacity, binarycross entropy loss, sigmoid activation, along with Dropweights in the fully connected layer and Batch Normalization formulation of propagating uncertainty in deep learning to estimate meaningful model uncertainty.

**Multi-label Setup:** There are multiple approaches to transform the multilabel classification into multiple single-label problems with the associated loss function [8]. In this study, we used immunohistochemically stained testes tissue consisting of 8 cell types corresponding to 512 testis elevated genes.

Therefore, we define a 8-dimensional class label vector <sup>Y</sup> <sup>=</sup> {y1, y<sup>2</sup> ...y<sup>N</sup> } ; <sup>Y</sup> ∈ {0, <sup>1</sup>}, given 8 cell types. <sup>y</sup><sup>c</sup> indicates the presence with respect to according cell type expressing the protein in the image while an all-zero vector [0; 0; 0; 0; 0; 0; 0; 0] represents the "Absence" (no cell type expresses the protein in the scope of any of 8 categories).

**Multi-label Classification Cost Function:** The cost function for Multi-label Classification has to be different considering the fact that a prediction for a class is not mutually exclusive. So we selected the sigmoid function with the addition of binary cross-entropy.

**Data Augmentation:** We used Keras' image pre-processing package to apply affine transformations to the images, such as rotation, scaling, shearing, and translation during training and inference. This reduces the epistemic uncertainty during training, captures heteroscedastic aleatoric uncertainty during inference and overall improves the performance of models.

**Multi-label Classification Algorithm:** In Bayesian classification, the mean of the predictive posterior corresponds to the parameter point estimates, and the width of the posterior reflects the confidence of the predictions. The output of the network is an M-dimensional probability vector, where each dimension indicates how likely each cell type in a given image expresses the protein. The number of cell types that simultaneously express the protein in an image varies. One method to solve this multi-label classification problem is placing thresholds on each dimension. However different dimensions may be associated with different thresholds. If the value of the i th dimension of ˆy is greater than a threshold, we can say that the i-th cell-type is expressed in the given tissue. The main problem is defining the threshold for each class label.

A threshold based on Matthews Correlation Coefficient (MCC) is used on the model outcome to determine the predicted class to improve the accuracy of the models.

We adopted a grid search scheme based on Matthews Correlation Coefficients (MCC) to estimate the optimal thresholds for each cell type-specific protein expression [2]. Details of the optimal threshold finding algorithm is shown in Algorithm 1.

The idea is to estimate the threshold for each cell category in an image separately. We convert the predicted probability vector with the estimated threshold into binary and calculate the Matthews correlation coefficient (MCC) between the threshold value and the actual value. The Matthews correlation coefficient for all thresholds are stored in the vector ω, from which we find the index of threshold that causes the largest correlation. The Optimal Threshold for the i th dimension is then determined by the corresponding value. We then leveraged Bias-Corrected Uncertainty quantification method [6] using Deep Convolutional Neural Network (DCNN) architectures with Dropweights [7].

```
Input: Ground Truth Vector: {yi,1, yi,2,...,yi,M} ;
Estimated Probability Vector: {yˆi,1, yˆi,2,..., yˆi,M} ;
Upper Bound for threshold = Ω, and Threshold Stride = S
Result: The Optimal Thresholds T = (ot1, ot2, . . . , otM)
Initialization: The set of threshold T = (ot1 = 0, ot2 = 0, . . . , otM = 0) ;
for i ← 1 to M do
   j ← 0;
   ω ← 0;
   π ← 0;
   for j<Ω do
       Initialize M-dimensional binary vector v ← (v1 = 0, v2 = 0,...,vM = 0)
         ;
       if yˆi > j then
           vi ← 1;
       end
       else
           vi ← 0;
       end
       ω ← ω.append(MCC(y[1 : i], v));
       π = π.append(j) ;
       j = j + S
   end
   mˆ ← argmaxmω = (ω1, ω2,...,ωm,... ) ;
   oti = π[ ˆm]
end
```
**Algorithm 1.** Find Optimal Threshold

**Network Architecture:** Our models are trained and evaluated using Keras with Tensorflow backend. For the DNN architecture, we used a generic building block containing the following model structure: Conv-Relu-BatchNorm-MaxPool-Conv-Relu-BatchNorm-MaxPool-Dense-Relu-Dropweights and Dense-Relu-Dropweights-Dense-Sigmoid, with 32 convolution kernels, 3 × 3 kernel size, 2 × 2 pooling, dense layer with 512 units, 128 units, and 8 feed-forward Dropweights probabilities 0.3. We optimised the model using Adam optimizer with the default learning rate of 0.001. The training process was conducted in 1000 epochs, with mini-batch size 32. We repeated our experiments three times for an algorithm and calculated a mean of the results.

## **3 Estimating Bias-Corrected Uncertainty Using Jackknife Resampling Method**

### **3.1 Bayesian Deep Learning and Estimating Uncertainty**

There are many measures to estimate uncertainty such as softmax variance, expected entropy, mutual information, predictive entropy and averaging predictions over multiple models. In supervised learning, information gain, i.e. mutual information between the input data and the model parameters is considered as the most relevant measure of the epistemic uncertainty [4,12]. Estimation of entropy from the finite set of data suffers from a severe downward bias when the data is under-sampled. Even small biases can result in significant inaccuracies when estimating entropy [9]. We leveraged Jackknife resampling method to calculate bias-corrected entropy [11].

Given a set of training data <sup>D</sup>, where **<sup>X</sup>** <sup>=</sup> {x1, x<sup>2</sup> ...x<sup>N</sup> } is the set of N images and the corresponding labels **<sup>Y</sup>** <sup>=</sup> {y1, y<sup>2</sup> ...y<sup>N</sup> }, a BNN is defined in terms of a prior <sup>p</sup>(ω) on the weights, as well as the likelihood <sup>p</sup>(D|ω). Consider class probabilities <sup>p</sup>(y<sup>x</sup>*<sup>i</sup>* <sup>=</sup> <sup>c</sup> <sup>|</sup> <sup>x</sup>i, ωt, D) with <sup>ω</sup><sup>t</sup> <sup>∼</sup> <sup>q</sup>(<sup>ω</sup> <sup>|</sup> <sup>D</sup>) with <sup>W</sup> = (ωt)<sup>T</sup> <sup>t</sup>=1, a set of independent and identically distributed (i.i.d.) samples draws from <sup>q</sup>(<sup>ω</sup> <sup>|</sup>, D). The below procedure computes the Monte Carlo (MC) estimate of the posterior predictive distribution, its Entropy and Mutual Information(MI):

$$\sum\_{i=1}^{N} \mathbb{I}\_{\text{MC}}(y\_i; \omega \mid x\_i, D) = \mathbb{H}\{\hat{p}(y\_i \mid x\_i, D)\} - \frac{1}{|\mathcal{W}|} \sum\_{\omega \in \mathcal{W}} \mathbb{H}\{p(y\_i \mid x\_i, \omega, D)\}\,. \tag{1}$$

where

$$\hat{p}(y\_i \mid x\_i, D) = \frac{1}{|\mathcal{W}|} \sum\_{\omega \in \mathcal{W}} p(y\_i \mid x\_i, \omega, D) \,. \tag{2}$$

The stochastic predictive entropy is <sup>H</sup>[<sup>y</sup> <sup>|</sup> x, ω] = <sup>H</sup>(ˆp) = <sup>−</sup>- <sup>c</sup> <sup>p</sup>ˆ<sup>c</sup> log(ˆpc), where ˆp<sup>c</sup> = <sup>1</sup> T - <sup>t</sup> <sup>p</sup>tc is the entire sample maximum likelihood estimator of probabilities.

The first term in the MC estimate of the mutual information is called the plug-in estimator of the entropy. It has long been known that the plug-in estimator underestimates the true entropy and plug-in estimate is biased [11,17].

A classic method for correcting the bias is the Jackknife resampling method [3]. In order to solve the bias problem, we propose a Jackknife estimator to estimate the epistemic uncertainty to improve an entropy-based estimation model. Unlike MC-Dropout, it does not assume constant variance. If <sup>D</sup>(X, Y ) is the observed random sample, the i th Jackknife sample, xi, is the subset of the sample that leaves-one-out observation <sup>x</sup><sup>i</sup> : <sup>x</sup>(i) = (x1,...x<sup>i</sup>−<sup>1</sup>, xi+1 ...xn). For sample size <sup>N</sup>, the Jackknife standard error ˆ<sup>σ</sup> is defined as: (N−1) N -N <sup>i</sup>=1(ˆσ<sup>i</sup> <sup>−</sup> <sup>σ</sup>ˆ())<sup>2</sup> , where ˆσ() is the empirical average of the Jackknife replicates: <sup>1</sup> N -N <sup>i</sup>=1 <sup>σ</sup>ˆ(i). Here, the Jackknife estimator is an unbiased estimator of the variance of the sample mean. The Jackknife correction of a plug-in estimator <sup>H</sup>(·) is computed according to the method below [3]:

Given a sample (pt)<sup>T</sup> <sup>t</sup>=1 with p<sup>t</sup> discrete distribution on 1...C classes, T corresponds to the total number of MC-Dropweights forward passes during the test.

	- calculate the leave-one-out estimator: ˆp−<sup>t</sup> <sup>c</sup> = <sup>1</sup> T −1 - <sup>j</sup>=<sup>i</sup> <sup>p</sup>jc
	- calculate the plug-in entropy estimate: <sup>H</sup>ˆ−<sup>t</sup> <sup>=</sup> <sup>H</sup>(ˆp−<sup>t</sup> )

We leveraged the following relation:

$$
\mu\_{-i} = \frac{1}{T - 1} \sum\_{j \neq i} x\_j = \mu + \frac{\mu - x\_i}{T - 1}.
$$

while resolving the i-th data point out of the sample mean μ = <sup>1</sup> T - <sup>i</sup> <sup>x</sup><sup>i</sup> and recompute the mean <sup>μ</sup>−i. This makes it possible to quickly calculate leave-oneout estimators of a discrete probability distribution.

The epistemic uncertainty can be obtained as the difference between the approximate predictive posterior entropy (or total entropy) and the average uncertainty in predictions (i.e: aleatoric entropy):

$$I(\mathbf{y} : \boldsymbol{\omega}) = H\_e(\mathbf{y}|\mathbf{x}) = \hat{H}\_J(\mathbf{y}|\mathbf{x}) - H\_a(\mathbf{y}|\mathbf{x}) = \hat{H}\_J(\mathbf{y}|\mathbf{x}) - \mathbb{E}\_{q(\boldsymbol{\omega}|\mathbf{D})}[\hat{H}\_J(\mathbf{y}|\mathbf{x}, \boldsymbol{\omega})]$$

Therefore, the mutual information I(**y** : ω) i.e. as a measure of bias-corrected epistemic uncertainty, represents the variability in the predictions made by the neural network weight configurations drawn from approximate posteriors. It derives an estimate of the finite sample bias from the leave-one-out estimators of the entropy and reduces bias considerably down to O(n−<sup>2</sup>) [3].

The bias-corrected uncertainty estimation model explains regions of ambiguous data space or difficult to classify, as data distribution with noise in the inputs or model, which was trained with different domain data. Consequently, these inputs should be assigned a higher aleatoric uncertainty. As a result, we can expect high model uncertainty in these regions.

Following Gal [5], we define the stochastic versions of Bayesian uncertainty using MC-Dropweights, where the class probabilities <sup>p</sup>(y<sup>x</sup>*<sup>i</sup>* <sup>=</sup> <sup>c</sup> <sup>|</sup> <sup>x</sup>i, ωt, D) with <sup>ω</sup><sup>t</sup> <sup>∼</sup> <sup>q</sup>(<sup>ω</sup> <sup>|</sup> <sup>D</sup>) and <sup>W</sup> = (ωt)<sup>T</sup> <sup>t</sup>=1 along with a set of independent and identically distributed (i.i.d.) samples drawn from <sup>q</sup>(<sup>ω</sup> <sup>|</sup>, D), can be approximated by the average over the MC-Dropweights forward pass.

We trained the multi-label classification network with all eight classes. We dichotomised the network outputs using optimal threshold with Algorithm 1 for each cell type, with a 1000 MC-Dropweights forward passes at test time. In these detection tasks, <sup>p</sup>(y<sup>x</sup>*<sup>i</sup>* <sup>&</sup>gt;= 0; OptimalT hreshold<sup>i</sup> <sup>|</sup> <sup>x</sup>i, ωt, D), where 1 marks the presence of cell type, is sufficient to indicate the most likely decision along with estimated uncertainty.

#### **3.2 Dataset**

Our main dataset is taken from The Human Protein Atlas project, that maps the distribution of all human proteins in human tissues and organs [15]. Here, we used high-resolution digital images of immunohistochemically stained testes tissue consisting of 8 cell types: spermatogonia, preleptotene spermatocytes, pachytene spermatocytes, round/early spermatids, elongated/late spermatids, sertoli cells, leydig cells, and peritubular cells, publicly available on the Human Protein Atlas version 18 (v18.proteinatlas.org), as shown in Fig. 2:

**Fig. 2.** Examples of proteins expressed only in one cell-type [10]

**Fig. 3.** Annotated heatmap of a correlation matrix between cell types

A relationship was observed between spermatogonia and preleptotene spermatocytes cell types and between round/early spermatids and elongated/late spermatids cell types along with Pachytene spermatocytes cells. Figure 3 illustrates the correlation coefficients between cell types. The observable pattern is that very few cell types are strongly correlated with each other.

### **3.3 Results and Discussions**

We conducted the experiments on Human Protein Atlas datasets to validate the proposed algorithm, MC-Dropweights in Multi-Label Classification.

**Multi-label Classification Model Performance:** Model evaluation metrics for multi-label classification are different from those used in multi-class (or binary) classification. The performance metrics of multi-label classifiers can be classified as label-based (i.e.: it is assumed that labels are mutually exclusive) and example-based [16]. In this work, example-based measures (Accuracy score, Hamming-loss, F1-Score) and Rank-Loss are used to evaluate the performance of the classifiers.


**Table 1.** Performance metrics

In the first experiment, we compared the MC-Dropweights neural networkbased method with five machine learning MLC algorithms introduced in Sect. 1: binary relevance (BR), Classifier Chain (CC), Probabilistic Classifier Chain (PCC) and Condensed Filter Tree (CFT), Cost-Sensitive Label Embedding with Multi-dimensional Scaling (CLEMS) and the MC-Dropout neural network model. Table 1 shows that MC-Dropweights exhibits considerably better performance overall the algorithms, which demonstrates the importance of considering the Dropweights in the neural network.

**Cell Type-Specific Predictive Uncertainty:** The relationship between uncertainty and predictive accuracy grouped by correct and incorrect predictions is shown in Fig. 4. It is interesting to note that, on average, the highest uncertainty is associated with Elongated/late Spermatids and Round/early Spermatids. This indicates that there is some feature which contributes greater uncertainty to the Spermatids class types than to the other cell types.

**Cell Type Localization:** Estimated uncertainty with Saliency Mapping is a simple technique to uncover discriminative image regions that strongly influence the network prediction in identifying a specific class label in the image. It highlights the most influential features in the image space that affect the predictions of the model [1] and visualises the contributions of individual pixels to epistemic and aleatoric uncertainties separately. We calculated the class activation maps (CAM) [18] using the activations of the fully connected layer and the weights from the prediction layer as shown in Fig. 5.

**Fig. 4.** Distribution of uncertainty values for all protein images, grouped by correct and incorrect predictions. Label assignment was based on optimal thresholding (Algorithm 1). For an incorrect prediction, there is a strong likelihood that the predictive uncertainty is also high in all cases except for Spermatids.

**Fig. 5.** Saliency maps for some common methods towards model explanation

## **4 Conclusion and Discussion**

In this study, a multi-label classification method was developed using deep learning architecture with Dropweights for the purposes of predicting cell typesspecific protein expression with estimated uncertainty, which can increase the ability to interpret, with confidence and make models based on deep learning more applicable in practice. The results show that a Deep Learning Model with MC-Dropweights yields the best performance among all popular classifiers.

Building truly large-scale, fully-automated, high precision, very high dimensional, image analysis system that can recognise various cell type-specific protein expression, specifically for Elongated/Late Spermatids and Round/early Spermatids remains a strenuous task. The properties in the dataset such as label correlations, label cardinality can strongly affect the uncertainty quantification in predictive probability performance of a Bayesian Deep learning algorithm in multi-label settings. There is no systematic study on how and why the performance varies over different data properties; any such study would be of great benefit in progressing multi-label algorithms.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Adversarial Attacks Hidden in Plain Sight**

Jan Philip G¨opfert1(B), Andr´e Artelt<sup>1</sup>, Heiko Wersing<sup>2</sup>, and Barbara Hammer<sup>1</sup>

<sup>1</sup> Bielefeld University, Bielefeld, Germany jgoepfert@techfak.uni-bielfeld.de

<sup>2</sup> Honda Research Institute Europe GmbH, Offenbach, Germany

**Abstract.** Convolutional neural networks have been used to achieve a string of successes during recent years, but their lack of interpretability remains a serious issue. Adversarial examples are designed to deliberately fool neural networks into making any desired incorrect classification, potentially with very high certainty. Several defensive approaches increase robustness against adversarial attacks, demanding attacks of greater magnitude, which lead to visible artifacts. By considering human visual perception, we compose a technique that allows to hide such adversarial attacks in regions of high complexity, such that they are imperceptible even to an astute observer. We carry out a user study on classifying adversarially modified images to validate the perceptual quality of our approach and find significant evidence for its concealment with regards to human visual perception.

## **1 Introduction**

The use of convolutional neural networks has led to tremendous achievements since Krizhevsky et al. [1] presented AlexNet in 2012. Despite efforts to understand the inner workings of such neural networks, they mostly remain black boxes that are hard to interpret or explain. The issue was exaggerated in 2013 when Szegedy et al. [2] showed that "adversarial examples" – images perturbed in such a way that they fool a neural network – prove that neural networks do not simply generalize correctly the way one might na¨ıvely expect. Typically, such adversarial attacks change an input only slightly, but in an adversarial manner, such that humans do not regard the difference of the inputs relevant, but machines do. There are various types of attacks, such as one pixel attacks, attacks that work in the physical world, and attacks that produce inputs fooling several different neural networks without explicit knowledge of those networks [3–5].

Adversarial attacks are not strictly limited to convolutional neural networks. Even the simplest binary classifier partitions the entire input space into labeled regions, and where there are no training samples close by, the respective label can only be nonsensical with regards to the training data, in particular near decision boundaries. One explanation of the "problem" that convolutional neural networks have is that they perform extraordinarily well in high-dimensional settings, where the training data only covers a very thin manifold, leaving a lot of "empty space" with ragged class regions. This creates a lot of room for an

**Fig. 1.** Two adversarial attacks carried out using the Basic Iterative Method (first two rows) and our Entropy-based Iterative Method (last two rows). The original image (a) (and (g)) is correctly classified as *umbrella* but the modified images (b) and (h) are classified as *slug* with a certainty greater than 99 %. Note the visible artifacts caused by the perturbation (c), shown here with maximized contrast. The perturbation (i) does not lead to such artifacts. (d), (e), (f), (j), (k), and (l) are enlarged versions of the marked regions in (a), (b), (c), (g), (h), and (i), respectively.

attacker to modify an input sample and move it away from the manifold on which the network can make meaningful predictions, into regions with nonsensical labels. Due to this, even adversarial attacks that simply blur an image, without any specific target, can be successful [6]. There are further attempts at explaining the origin of the phenomenon of adversarial examples, but so far, no conclusive consensus has been established [7–10].

A number of defenses against adversarial attacks have been put forward, such as defensive distillation of trained networks [11], adversarial training [12], specific regularization [9], and statistical detection [13–16]. However, no defense succeeds in universally preventing adversarial attacks [17,18], and it is possible that the existence of such attacks is inherent in high-dimensional learning problems [6]. Still, some of these defenses do result in more robust networks, where an adversary needs to apply larger modifications to inputs in order to successfully create adversarial examples, which begs the question how robust a network can become and whether robustness is a property that needs to be balanced with other desirable properties, such as the ability to generalize well [19] or a reasonable complexity of the network [20].

Strictly speaking, it is not entirely clear what defines an adversarial example as opposed to an incorrectly classified sample. Adversarial attacks are devised to change a given input minimally such that it is classified incorrectly – in the eyes of a human. While astonishing parallels between human visual information processing and deep learning exist, as highlighted e. g. by Yamins and DiCarlo [21] and Rajalingham et al. [22], they disagree when presented with an adversarial example. Experimental evidence has indicated that specific types of adversarial attacks can be constructed that also deteriorate the decisions of humans, when they are allowed only limited time for their decision making [23]. Still, human vision relies on a number of fundamentally different principles when compared to deep neural networks: while machines process image information in parallel, humans actively explore scenes via saccadic moves, displaying unrivaled abilities for structure perception and grouping in visual scenes as formalized e. g. in the form of the Gestalt laws [24–27]. As a consequence, some attacks are perceptible by humans, as displayed in Fig. 1. Here, humans can detect a clear difference between the original image and the modified one; in particular in very homogeneous regions, attacks lead to structures and patterns which a human observer can recognize. We propose a simple method to address this issue and answer the following questions. How can we attack images using standard attack strategies, such that a human observer does not recognize a clear difference between the modified image and the original? How can we make use of the fundamentals of human visual perception to "hide" attacks such that an observer does not notice the changes?

Several different strategies for performing adversarial attacks exist. For a multiclass classifier, the attack's objective can be to have the classifier predict *any* label other than the correct one, in which case the attack is referred to as *untargeted*, or *some specifically chosen* label, in which case the attack is called *targeted*. The former corresponds to minimizing the likelihood of the original label being assigned; the latter to maximizing that of the target label. Moreover, the classifier can be fooled into classifying the modified input with extremely high confidence, depending on the method employed. This, in particular, can however lead to visible artifacts in the resulting images (see Fig. 1). After looking at a number of examples, one can quickly learn to make out typical patterns that depend on the classifying neural network. In this work, we propose a method for changing this procedure such that this effect is avoided.

For this purpose, we extend known techniques for adversarial attacks. A particularly simple and fast method for attacking convolutional neural networks is the aptly named Fast Gradient Sign Method (FGSM) [4,7]. This method, in its original form, modifies an input image x along a linear approximation of the objective of the network. It is fast but limited to untargeted attacks. An extension of FGSM, referred to as the Basic Iterative Method (BIM) [28], repeatedly adds small perturbations and allows targeted attacks. Moosavi-Dezfooli et al. [29] linearize the classifier and compute smaller (with regards to the *<sup>p</sup>* norm) perturbations that result in untargeted attacks. Using more computationally demanding optimizations, Carlini and Wagner [17] minimize the -<sup>0</sup>, -<sup>2</sup>, or -∞ norm of a perturbation to achieve targeted attacks that are still harder to detect. Su et al. [3] carry out attacks that change only a single pixel, but these attacks are only possible for some input images and target labels. Further methods exist that do not result in obvious artifacts, e. g. the Contrast Reduction Attack [30], but these are again limited to untargeted attacks – the input images are merely corrupted such that the classification changes. None of the methods mentioned here regard human perception directly, even though they all strive to find imperceptibly small perturbations. Sch¨onherr et al. [31] successfully do this within the domain of acoustics.

We rely on BIM as the method of choice for attacks based on images, because it allows robust targeted attacks with results that are classified with arbitrarily high certainty, even though it is easy to implement and efficient to execute. Its drawbacks are the aforementioned visible artifacts. To remedy this issue, we will take a step back and consider human perception directly as part of the attack. In this work, we propose a straightforward, very effective modification to BIM that ensures targeted attacks are visually imperceptible, based on the observation that attacks do not need to be applied homogeneously across the input image and that humans struggle to notice artifacts in image regions of high local complexity. We hypothesize that such attacks, in particular, do not change saccades as severely as generic attacks, and so humans perceive the original image and the modified one as very similar – we confirm this hypothesis in Sect. 3 as part of a user study.

## **2 Adversarial Attacks**

Recall the objective of a targeted adversarial attack. Given a classifying convolutional neural network f, we want to modify an input x, such that the network assigns a different label f(x ) to the modified input x than to the original x, where the target label f(x ) can be chosen at will. At the same time, x should be as similar to x as possible, i. e. we want the modification to be small. This results in the optimization problem:

$$\min \|x' - x\| \quad \text{such that} \quad f(x') = y \neq f(x), \tag{1}$$

where y = f(x ) is the target label of the attack. BIM finds such a small perturbation <sup>x</sup> <sup>−</sup> <sup>x</sup> by iteratively adapting the input according to the update rule

$$x \gets x - \epsilon \cdot \text{sign}[\nabla\_x J(x, y)] \tag{2}$$

until f assigns the label y to the modified input with the desired certainty, where the certainty is typically computed via the softmax over the activations of all class-wise outputs. sign[∇*<sup>x</sup>*J(x, y)] denotes the sign of the gradient of the objective function J(x, y), and is computed efficiently via backpropagation; is the step size. The norm of the perturbation is not considered explicitly, but because in each iteration the change is distributed evenly over all pixels/features in x, its -<sup>∞</sup>-norm is minimized.

#### **2.1 Localized Attacks**

The main technical observation, based on which we hide attacks, is the fact that one can weigh and apply attacks locally in a precise sense: During prediction, a convolutional neural network extracts features from an input image, condenses the information contained therein, and conflates it, in order to obtain its best guess for classification. Where exactly in an image a certain feature is located is of minor consequence compared to how strongly it is expressed [32,33]. As a result, we find that during BIM's update, it is not strictly necessary to apply the computed perturbation evenly across the entire image. Instead, one may choose to leave parts of the image unchanged, or perturb some pixels more or less than others, i. e. one may localize the attack. This can be directly incorporated into Eq. (2) by setting an individual value for for every pixel.

For an input image <sup>x</sup> <sup>∈</sup> [0, 1]*<sup>w</sup>*×*h*×*<sup>c</sup>* of width <sup>w</sup> and height <sup>h</sup> with <sup>c</sup> color channels, we formalize this by setting a strength map E ∈ [0, 1]*<sup>w</sup>*×*<sup>h</sup>* that holds an update magnitude for each pixel. Such a strength map can be interpreted as a grayscale image where the brightness of a pixel corresponds to how strongly the respective pixel in the input image is modified. The adaptation rule (2) of BIM is changed to the update rule

$$x\_{ijk} \gets x\_{ijk} - \epsilon \cdot \mathcal{E}\_{ijk} \cdot \text{sign}[\nabla\_x J(x, y)] \tag{3}$$

for all pixel values (i, j, k). In order to be able to express the overall strength of an attack, for a given strength map <sup>E</sup> of size <sup>w</sup> by <sup>h</sup>, we call

$$\kappa(\mathcal{E}) = \frac{\sum\_{i,j \in \overline{w} \times \overline{h}} \mathcal{E}\_{i,j}}{w \cdot h} \tag{4}$$

**Fig. 2.** Localized attacks with different relative total strengths. The strength maps (d), (e), and (f), which are based on Perlin noise, scaled such that the relative total strength is 0*.*43, 0*.*14, and 0*.*04, are used to create the adversarial examples in (a), (b), and (c), respectively. In each case, the attacked image is classified as *slug* with a certainty greater than 99 %. The attacks took 14, 17, and 86 iterations. (g), (h), and (i) are enlarged versions of the marked regions in (a), (b), and (c).

the *relative total strength* of <sup>E</sup>, where for <sup>n</sup> <sup>∈</sup> <sup>N</sup> we let <sup>n</sup> <sup>=</sup> {1,...,n} denote the set of natural numbers from 1 to <sup>n</sup>. In the special case where <sup>E</sup> only contains either black or white pixels, <sup>κ</sup>(E) is the ratio of white pixels, i. e. the number of attacked pixels over the total number of pixels in the attacked image.

As long as the scope of the attack, i. e. <sup>κ</sup>(E), remains large enough, adversarial attacks can still be carried out successfully – if not as easily – with more iterations required until the desired certainty is reached. This leads to the attacked pixels being perturbed more, which in turn leads to even more pronounced artifacts. Given a strength map <sup>E</sup>, it can be modified to increase or decrease <sup>κ</sup>(E) by adjusting its brightness or by applying appropriate morphological operations. See Fig. 2 for a demonstration that uses pseudo-random noise as a strength map.

### **2.2 Entropy-Based Attacks**

The crucial component necessary for "hiding" adversarial attacks is choosing a strength map E that appropriately considers human perceptual biases. The strength map essentially determines which "norm" is chosen in Eq. (1). If it differs from a uniform weighting, the norm considers different regions of the image differently. The choice of the norm is critical when discussing the visibility of adversarial attacks. Methods that explicitly minimize the *<sup>p</sup>* norm of the perturbation for some p, only "accidentally" lead to perturbations that are hard to detect visually, since the *<sup>p</sup>* norm does not actually resemble e. g. the human visual focus for the specific image. We propose to instead make use of how humans perceive images and to carefully choose those pixels where the resulting artifacts will not be noticeable.

Instead of trying to hide our attack in the background or "where an observer might not care to look", we instead focus on those regions where there is high local complexity. This choice is based on the rational that humans inspect images in saccadic moves, and a focus mechanism guides how a human can process highly complex natural scenes efficiently in a limited amount of time. *Visual interest* serves as a selection mechanism, singling out relevant details and arriving at an optimized representation of the given stimuli [34]. We rely on the assumption that adversarial attacks remain hidden if they do not change this scheme. In particular, regions which do not attract focus in the original image should not increase their level of interest, while relevant parts can, as long as the adversarial attack is not adding additional relevant details to the original image.

Due to its dependence on semantics, it is hard – if not impossible – to agnostically compute the magnitude of *interest* for specific regions of an image. Hence, we rely on a simple information theoretic proxy, which can be computed based on the visual information in a given image: the entropy in a local region. This simplification relies on the observation that regions of interest such as edges typically have a higher entropy than homogeneous regions and the entropy serves as a measure for how much information is already contained in a region – that is, how much relative difference would be induced by additional changes in the region.

Algorithmically, we compute the *local entropy* at every pixel in the input image as follows: After discarding color, we bin the gray values, i. e. the intensities, in the neighborhood of pixel i, j such that B*i,j* contains the respective occurrence ratios. The occurrence ratios can be interpreted as estimates of the intensity probability in this neighborhood, hence the local entropy S*i,j* can be calculated as the Shannon entropy

$$S\_{i,j} = -\sum\_{p \in B\_{i,j}} p \log p. \tag{5}$$

Through this, we obtain a measure of local complexity for every pixel in the input image, and after adjusting the overall intensity, we use it as suggested above to scale the perturbation pixel-wise during BIM's update. In other words, we set

$$\mathcal{E} = \phi(S) \tag{6}$$

where φ is a nonlinear mapping, which adjusts the brightness. The choice of a strength map based on the local entropy of an image allows us to perform an attack as straightforward as BIM, but localized, in such a way that it does not produce visible artifacts, as we will see in the following experiments.

While we could attach our technique to any attack that relies on gradients, we use BIM because of the aforementioned advantages including simplicity, versatility, and robustness, but also because as the direct successor to FGSM we consider it the most typical attack at present. As a method of performing adversarial attacks, we refer to our method as the *Entropy-based Iterative Method (EbIM)*.

## **3 A Study of How Humans Perceive Adversarial Examples**

It is often claimed that adversarial attacks are imperceptible<sup>1</sup>. While this can be the case, there are many settings in which it does not necessarily hold true – as can be seen in Fig. 1. When robust networks are considered and an attack is expected to reliably and efficiently produce adversarial examples, visible artifacts appear. This motivated us to consider human visual perception directly and thereby our method. To confirm that there are in fact differences in how adversarial examples produced by BIM and EbIM are perceived, we conducted a user study with 35 participants.

<sup>1</sup> We do not want to single out any specific source for this claim, and it should not necessarily be considered strictly false, because there is no commonly accepted rigorous definition of what constitutes an adversarial example or an adversarial attack, just as it remains unclear how to best measure adversarial robustness. Whether an adversarial attack results in noticeable artifacts depends on a multitude of factors, such as the attacked model, the underlying data (distribution), the method of attack, and the target certainty.

#### **3.1 Generation of Adversarial Examples**

To keep the course of the study manageable, so as not to bore our relatively small number of participants, and still acquire statistically meaningful (i. e. with high statistical power) and comparable results, we randomly selected only 20 labels and 4 samples per label from the validation set of the *ILSVRC 2012 classification challenge* [35], which gave us a total of 80 images. For each of these 80 images we generated a targeted high confidence adversarial example using BIM and another one using EbIM – resulting in a total of 240 images. We set a fixed target class and the target certainty to 0.99. We attacked the pretrained *Inception v3* model [36] as provided by *keras* [37]. We set the parameters of BIM to = 1.0, stepsize = 0.004 and max iterations = 1000. For EbIM, we binarized the entropy mask with a threshold of 4.2. We chose these parameters such that the algorithms can reliably generate targeted high certainty adversarial examples across all images, without requiring expensive per-sample parameter searches.

#### **3.2 Study Design**

For our study, we assembled the images in pairs according to *three different conditions*:


This resulted in 240 pairs of images that were to be evaluated during the study.

All image pairs were shown to each participant in a random order – we also randomized the positioning (left and right) of the two images in each pair. For each pair, the participant was asked to determine whether the two images were identical or different. If the participant thought that the images were identical they were to click on a button labeled "Identical" and otherwise on a button labeled "Different" – the ordering of the buttons was fixed for a given participant but randomized when they began the study. To facilitate completion of the study in a reasonable amount of time, each image pair was shown for 5 s only; the participant was, however, able to wait as long as they wanted until clicking on a button, whereby they moved on to the next image pair.

#### **3.3 Hypotheses Tests**

Our hypothesis was that it would be more difficult to perceive the changes in the images generated by EbIM than by BIM. We therefore expect our participants to click "Identical" more often when seeing an adversarial example generated by EbIM than when seeing an adversarial generated by BIM.

As a test statistic, we compute *for each participant* and *for each of the three conditions separately*, the percentage of time they clicked on "Identical". The values can be interpreted as a mean if we encode "Identical" as 1 and "Different" as 0. Hereinafter we refer to these mean values as μBIM and μEbIM. For each of

**Fig. 3.** Percentage of times users clicked on "Identical" when seeing two identical images (condition (i), blue box), a BIM adversarial (condition (ii), orange box), or an EbIM adversarial (condition (iii), green box). (Color figure online)

the three conditions, we provide a boxplot of the test statistics in Fig. 3 – the scores of EbIM are much higher than BIM, which indicates that it is in fact much harder to perceive the modifications introduced by EbIM compared to BIM. Furthermore, users almost always clicked on "Identical" when seeing two identical images.

Finally, we can phrase our belief as a hypothesis test. We determine whether we can reject the following five hypotheses:


We use a *one-tailed t-test* and the (non-parametric) *Wilcoxon signed rank test* with a significance level α = 0.05 in both tests. The cases (1), (4) and (5) are tested as a *paired test* and the other two cases (2) and (3) as *one sample tests*.

Because the t-test assumes that the mean difference is normally distributed, we test for normality<sup>2</sup> by using the *Shapiro-Wilk normality test*. The Shapiro-Wilk normality test computes a p-value of 0.425, therefore we assume that the mean difference follows a normal distribution. The resulting p-values are listed in Table 1 – we can reject all null hypotheses with very low p-values.

<sup>2</sup> Because we have 35 participants, we assume that normality approximately holds because of the central limit theorem.


**Table 1.** p-values of each hypothesis (columns) under each test (rows). We reject all null hypotheses.

In order to compute the power of the t-test, we compute the effect size by computing *Cohen's d*. We find that <sup>d</sup> <sup>≈</sup> <sup>2</sup>.29 which is considered a huge effect size [38]. The power of the one-tailed t-test is then approximately 1.

We have empirically shown that adversarial examples produced by EbIM are significantly harder to perceive than adversarial examples generated by BIM. Furthermore, adversarial examples produced by EbIM are not perceived as differing from their respective originals.

### **4 Discussion**

Adversarial attacks will remain a potential security risk on the one hand and an intriguing phenomenon that leads to insight into neural networks on the other. Their nature is difficult to pinpoint and it is hard to predict whether they constitute a problem that will be solved. To further the understanding of adversarial attacks and robustness against them, we have demonstrated two key points:


This has allowed us to develop the Entropy-based Iterative Method (EbIM), which performs adversarial attacks against convolutional neural networks that are hard to detect visually even when their magnitude is considerable with regards to an *<sup>p</sup>*-norm. It remains to be seen how current adversarial defenses perform when confronted with entropy-based attacks, and whether robust networks learn special kinds of features when trained adversarially using EbIM.

Through our user study we have made clear that not all adversarial attacks are imperceptible. We hope that this is only the start of considering human perception explicitly during the investigation of deep neural networks in general and adversarial attacks against them specifically. Ideally, this would lead to a concise definition of what constitutes an adversarial example.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Enriched Weisfeiler-Lehman Kernel for Improved Graph Clustering of Source Code**

Frank H¨oppner(B) and Maximilian Jahnke

Department of Computer Science, Ostfalia University of Applied Sciences, 38302 Wolfenb¨uttel, Germany f.hoeppner@ostfalia.de

**Abstract.** To perform cluster analysis on graphs we utilize graph kernels, Weisfeiler-Lehman kernel in particular, to transform graphs into a vector representation. Despite good results, these kernels have been criticized in the literature for high dimensionality and high sensitivity, so we propose an efficient subtree distance measure that is subsequently used to enrich the vector representations and enables more sensitive distance measurements. We demonstrate the usefulness in an application, where the graphs represent different source code snapshots, and a cluster analysis of these snapshots provides the lecturer an overview about the overall performance of a group of students.

## **1 Motivation**

Graphs are a universal data structure and have become very popular over recent years in various domains with structured data (e.g. protein function prediction, drug toxicity prediction, malware detection, etc.). To apply existing clustering or classification techniques to graphs, either a distance (or similarity) measure is needed, or a transformation into a vector representation for which most clustering and classification algorithms were developed for. In this paper we are concerned about repeatedly clustering graphs to understand the evolution of student's source code. As will be explained in Sect. 2, we settle on Weisfeiler-Lehman (WL) graph kernels [9] to decompose the graph into subtrees and to define a similarity function over the number of common substructures across graphs. It has been criticized, however, that WL subtree kernels produce (a) many different substructures and thus only a few substructures will be common across graphs, which establishes (b) a tendency of *being only similar to itself*. In this paper we propose to include the subtree similarity in an efficient postprocessing step to tackle both problems: We exploit the fact that many of the substructures may be formally distinct but actually quite similar. By enriching the vector representations we obtain positive effects for the overall graph similarity.

**Algorithm 1.** WLSK(G, <sup>l</sup>i−<sup>1</sup>) **Require:** graph <sup>G</sup> = (V,E), label function <sup>l</sup>*<sup>i</sup>*−<sup>1</sup> : <sup>V</sup> <sup>→</sup> <sup>Σ</sup><sup>∗</sup> **Ensure:** returns new label function <sup>l</sup>*<sup>i</sup>* : <sup>V</sup> <sup>→</sup> <sup>Σ</sup><sup>∗</sup> 1: **for** <sup>v</sup> <sup>∈</sup> <sup>V</sup> **do** 2: store node label <sup>l</sup>*<sup>i</sup>*−<sup>1</sup>(v) in <sup>s</sup> 3: **for** <sup>w</sup> <sup>∈</sup> V, (v, w) <sup>∈</sup> <sup>E</sup> in (some lexicographical) order of <sup>l</sup>*<sup>i</sup>*−<sup>1</sup>(w) **do** 4: append <sup>l</sup>*<sup>i</sup>*−<sup>1</sup>(w) to <sup>s</sup> 5: **end for** 6: compress <sup>s</sup> <sup>←</sup> <sup>h</sup>(s) by applying a hash function <sup>h</sup> 7: assign new label to node <sup>v</sup> : <sup>l</sup>*i*(v) <sup>←</sup> <sup>s</sup> 8: **end for** 9: **return** l*<sup>i</sup>*

### **2 Related Work**

### **2.1 Measuring Similarity Directly**

A common approach to compare graphs is to calculate the *edit distance* between graphs F and G: the minimal number of steps to transform G to F. For the special case of trees, these steps consists of node deletion, node insertion, and node relabelling. A survey on tree edit distance can be found in [1], an efficient algorithmic O(n<sup>3</sup>) solution, n being the maximal number of nodes in F and G, is proposed in [2]. To adapt a tree edit distance to a specific application, there are approaches to learn appropriate cost parameters [6]. With general graphs, the editing process becomes more complicated as additional operations need to be considered (edge insertion and edge deletion). A survey on graph edit distance is given in [3]. Its computation is exponential in the number of nodes and therefore infeasible for large graphs.

#### **2.2 Measuring Similarity Indirectly**

Instead of coping with the full graph, one may decompose the graph into a set of smaller entities and compare these sets instead of the graphs. These entities may be frequent subgraphs (e.g. [8]), walks (short paths), graphlets (e.g. [10]) or subtrees (e.g. [9]). Many *graph kernel* approaches explicitly construct a vector representation, where the i th element indicates how often the i th substructure occurs in the graph. From this vector a kernel or similarity matrix may be calculated. Recent approaches, such as subgraph2vec [5], use deep learning to translate graphs into such a vector representation.

This section particularly reviews the construction of a WL subtree kernel (following [9]), as it will be foundation of the next section. The subtree kernel transforms a graph into a vector, where a non-zero entry indicates the occurrence of a specific subtree in the graph. The total number of dimensions is determined by all subtrees that have been identified in the full set of graphs.

Given a graph <sup>G</sup> = (V,E), a label function <sup>l</sup> : <sup>V</sup> <sup>→</sup> <sup>Σ</sup><sup>∗</sup> yields for each node <sup>v</sup> <sup>∈</sup> <sup>V</sup> a label over a finite alphabet <sup>Σ</sup>. The initial labels <sup>l</sup>0(v) are provided together with the graph G (original labels). A new label function l<sup>i</sup> is obtained by calling W LSK(G, li−<sup>1</sup>), which is shown in Algorithm 1: It constructs new labels by concatenating all child labels deterministically (by processing children in some lexicographic order). A series of n WLSK calls provides a sequence of n label functions l0,...,ln, where a node label li(v) takes all children of v up to depth i into account. A label li(v) may thus serve as a kind of fingerprint of the neighbourhood of <sup>v</sup> (hashcode). Let <sup>L</sup><sup>i</sup> <sup>=</sup> {<sup>l</sup> 1 <sup>i</sup> , l<sup>2</sup> <sup>i</sup> ,...,lk*<sup>i</sup>* <sup>i</sup> } <sup>=</sup> <sup>l</sup>i(<sup>V</sup> ) be the set of all different li-labels in G. The final vector representation of a graph is obtained from

$$\Phi(G) = \left( \#l\_0^1, \dots, \#l\_0^{k\_0}, \#l\_1^1, \dots, \#l\_1^{k\_1}, \dots, \#l\_{n-1}^1, \dots, \#l\_{n-1}^{k\_{n-1}} \right)$$

where #l j <sup>i</sup> denotes how many nodes received the label l j <sup>i</sup> . Originally this approach was proposed as a test of isomorphism [11], as isomorphic graphs exhibit identical substructures (labels).

Figure 1 shows an illustrative example. On the top left we have two graphs G<sup>1</sup> and G<sup>2</sup> with nodes v1–v<sup>7</sup> and v8–v14, resp. The (numeric) label is written in the node, the node identifiers are shown in gray. The table next to the graphs shows, for each node, how the new label s is constructed from the current node label and its successors. For instance, node v<sup>1</sup> of G<sup>1</sup> has label 0 and successors with labels 2, 0, 1. Algorithm 1 creates new labels by appending the node label and the successor labels (in sorted order), which yields "0 : 0, 1, 2" for v1. The rightmost table shows a dictionary, where each new label (here: 0 : 0, 1, 2) gets a fresh ID (here: 3). Algorithm 1 refers to this step as hashing the node label into a new ID (or hashcode) – we use consecutive numbers just for illustrative purposes. Children need to be ordered deterministically to get the same hash for identical subtrees. The new label l1(v1) = 3 thus encodes a subtree of depth 1 with root 0 and children 0, 1, 2. Once all new labels are determined (lower half of Fig. 1) the nodes v<sup>1</sup> and v<sup>8</sup> still have the same label: l1(v1)=3= l1(v8), because their subtree of depth 1 was identical. After another WLSK iteration, however, the subtrees of depth 2 are no longer identical for v<sup>1</sup> and v8, so their <sup>l</sup>2-labels are no longer the same: <sup>l</sup>2(v1) = 11 = 17 = <sup>l</sup>2(v8). The final vector representation for G<sup>1</sup> and G<sup>2</sup> (after 2 iterations) consists of counts for each label (from all depths):

$$\begin{aligned} \Phi(G\_1) &= (4, 1, 2, \, 1, 1, 1, 1, 2, 1, 0, 0, \, 1, 1, 1, 1, 2, 1, 0, 0, 0, 0), \\ \Phi(G\_2) &= (\underbrace{3, 2, 2}\_{L\_0-}, \underbrace{2, 0, 0, 0, 2, 1, 1, 1}\_{L\_1-}, \underbrace{0, 0, 0, 0, 2, 1, 1, 1, 2, 1}\_{L\_2-\text{label counts}}) \end{aligned}$$

The vector representation Φ(G) enables us to construct a kernel matrix or apply standard clustering and classification directly.

### **2.3 Discussion**

Measuring graph similarity indirectly is in general more efficient than direct approaches. Among the kernel approaches it has been pointed out that with some

**Fig. 1.** Illustrative example of 2 WLSK iterations. left: initial labels l0, middle: l1, right: l<sup>2</sup>

substructures, e.g. short paths (aka walks), many different graphs refer to the same point at the same point in the feature space (cf. [7]). Subtree kernels (and in particular WLSK) have been reported to be efficient<sup>1</sup> and well-performing in subsequent task (e.g. SVM classification). However, from the example in Fig. 1 we can also acknowledge the critique of the approach: Although G<sup>2</sup> has been obtained from G<sup>1</sup> by removing v<sup>4</sup> and adding v<sup>12</sup> only, the vector representations are very different. Spotting differences early is good when checking for isomorphic graphs, but may be less desirable for similarity assessment (e.g. clustering). Despite the few changes, more than half of the labels occur exclusively in only one of the graphs (13 entries out of 21 that are zero in one of the two graphs). Continuous (rather than integer) features may help, as provided by some deep learning approaches, but deep learning requires a huge amount of training data, which makes them unsuitable for datasets of moderate size.

### **3 Enriching WL Subtree Kernels**

Revisiting Fig. 1, node v<sup>3</sup> of G<sup>1</sup> and node v<sup>10</sup> of G<sup>2</sup> differ only by a missing node labelled '1'. From the different l1-hashcodes for both nodes (5 for v<sup>3</sup> and 3 for v10) we cannot conclude what they have in common. Secondly, node v<sup>2</sup> of G<sup>1</sup> and v<sup>9</sup> of G<sup>2</sup> are similar in the sense that nodes labelled 0 and 2 can be reached, only in G<sup>1</sup> there is an intermediate node v4. If we accept that node pairs (v2, v9) and (v3, v10) are somewhat similar, this should then positively affect the l2-similarity of v<sup>1</sup> and v8, too. We want to take this kind of similarity into account without

<sup>1</sup> The only necessary data structure is a hash table that collects how often each node label occurred.

sacrificing the efficiency of WLSK. Instead of integer features (subtree counts) we introduce continuous features to better reflect a partial matching of subtrees. We stick to the WLSK construction, but propose a post-processing step, which replaces the zero entries in the vector representation. As many subtrees (with different hashcodes) are in fact similar, we obtain highly correlating dimensions which are safe to remove and thus reduces the dimensionality. We optionally apply dimensionality reduction to arrive at a vector of moderate size.

### **3.1 Subtree Similarity**

Given a graph G = (V,E), let L<sup>i</sup> = li(V ) be the set of all hashcodes for subtrees of depth i (cf. tables on the right of Fig. 1). The hashcodes compress the newly constructed node labels, but no longer contain any information about the subtree. So we track this information in tables: For all occurred hashcodes <sup>h</sup> <sup>∈</sup> <sup>L</sup>i, we denote the root node label by <sup>r</sup><sup>h</sup> <sup>∈</sup> <sup>L</sup><sup>i</sup>−<sup>1</sup> and the multiset of successor labels by <sup>S</sup><sup>h</sup> <sup>⊆</sup> <sup>L</sup><sup>i</sup>−<sup>1</sup>. (*Example:* For <sup>h</sup> = 11 <sup>∈</sup> <sup>L</sup><sup>2</sup> in Fig. <sup>1</sup> we have <sup>r</sup><sup>h</sup> = 3 and <sup>S</sup><sup>h</sup> <sup>=</sup> {4, <sup>5</sup>, <sup>7</sup>}.)

Next we define a series of distance functions <sup>d</sup><sup>i</sup> : <sup>L</sup><sup>i</sup> <sup>×</sup> <sup>L</sup><sup>i</sup> <sup>→</sup> <sup>R</sup> to capture the distance between subtree hashcodes of the same depth i. We start with a distance d<sup>0</sup> for the original graph node labels. In absence of any background knowledge we use for the initial level

$$d\_0(h, h') := \begin{cases} 0 & \text{if } h = h' \\ 1 & \text{otherwise} \end{cases},\tag{1}$$

but generally assume that some background information can be provided to arrive at meaningful distances for the initial node labels.

For non-trivial subtrees (that is, i > 0) we recursively define distance functions di(h, h ). It is natural to define the distance as the sum of distances between root and child nodes. This requires to assign child nodes of h uniquely to child nodes of h , which is provided by a bijective function <sup>f</sup> : <sup>S</sup><sup>h</sup> <sup>→</sup> <sup>S</sup><sup>h</sup>-:

$$d\_i(h, h') := \underbrace{d\_{i-1}(r\_h, r\_{h'})}\_{\text{root node distance}} + \underbrace{\min\_{f \in B(S\_h, S\_{h'})} \sum\_{k \in S\_h} d\_{i-1}(k, f(k))}\_{k \in S\_h} \tag{2}$$

 distance of best subtree alignment

Here <sup>B</sup>(S, T) denotes the set of bijective functions <sup>f</sup> : <sup>S</sup> <sup>→</sup> <sup>T</sup>. The first term measures the distance between the root node labels and the second term identifies the minimal distance among all node assignments. Finding the assignment with minimal distance is known as the *assignment problem*, which has well-known solutions and we adopt the Munkres algorithm for this task [4].

We are likely to deal with unbalanced assignments, that is, different numbers of children for h and h . A bijective assignment requires <sup>|</sup>S<sup>h</sup><sup>|</sup> <sup>=</sup> <sup>|</sup>S<sup>h</sup>- |, so we add the necessary number of *missing nodes* (denoted by <sup>⊥</sup>) to the smaller multiset.<sup>2</sup>

<sup>2</sup> More formally <sup>B</sup>(S, T) is the set of bijective functions <sup>f</sup> : <sup>S</sup> <sup>→</sup> <sup>T</sup> where <sup>|</sup>S <sup>|</sup> <sup>=</sup> <sup>k</sup> <sup>=</sup> |T <sup>|</sup>, <sup>S</sup> <sup>⊆</sup> <sup>S</sup> , <sup>T</sup> <sup>⊆</sup> <sup>T</sup> , <sup>S</sup> has <sup>k</sup> − |S<sup>|</sup> (and <sup>T</sup> has <sup>k</sup> − |T|) additional <sup>⊥</sup> elements.

**Fig. 2.** Left: A priori distances d<sup>0</sup> between labels of L0. Case (i): Assignment matrix for d<sup>1</sup> distance of l1(v2) = 4 and l1(v9) = 9. Case (ii): Assignment matrix for d<sup>1</sup> distance of <sup>v</sup><sup>3</sup> ({0, <sup>2</sup>}) and <sup>v</sup><sup>10</sup> ({0, <sup>1</sup>, <sup>2</sup>}). Right: Derived <sup>d</sup>1-distances from case (i) and (ii). Case (iii): Assignment matrix for <sup>d</sup><sup>2</sup> distance of <sup>v</sup><sup>0</sup> ({4, <sup>5</sup>, <sup>7</sup>}) and <sup>v</sup><sup>8</sup> ({3, <sup>7</sup>, <sup>9</sup>}) (Color figure online)

We extend the distance d<sup>0</sup> to the case of missing nodes, which corresponds to an additional row/column in the d0-matrix (see d<sup>0</sup> example matrix in Fig. 1(left)). Again, these ⊥-distances may be an arbitrary constant or specifically provided for each label <sup>h</sup> <sup>∈</sup> <sup>L</sup><sup>0</sup> using background knowledge. Then Eq. (2) extends naturally to ⊥-values:

$$d\_i(h, \perp) := d\_{i-1}(r\_h, \perp) + \sum\_{k \in S\_h} d\_{i-1}(k, \perp) \tag{3}$$

Figure 2 shows an example. The leftmost table shows the d0-distances between original node labels (cf. Fig. 1: <sup>L</sup><sup>0</sup> <sup>=</sup> {0, <sup>1</sup>, <sup>2</sup>}), including the case of a missing label <sup>⊥</sup>. For the sake of illustration we assume a distance of <sup>1</sup> <sup>2</sup> for the label pair (0, 2). Consider the comparison of v<sup>2</sup> and v<sup>9</sup> for depth-1 subtrees: d1(h, h ) with h = l1(v2), h = l1(v9). Both root nodes are identical (r<sup>h</sup> = r<sup>h</sup>- = 0), but the multisets of successors are not (S<sup>h</sup> <sup>=</sup> {0, <sup>0</sup>}, S<sup>h</sup>- <sup>=</sup> {0, <sup>2</sup>}). Matrix (i) shows the distance matrix for the assignment problem: all nodes of h (rows) have to be assigned to a node of h (columns). As the child nodes represent l0-hashcodes, we take the distances from the d<sup>0</sup> table. An optimal assignment is marked in red and we obtain a distance d1(h, h ) = 0 + (0 + <sup>1</sup> <sup>2</sup> ) = <sup>1</sup> <sup>2</sup> . Matrix (ii) shows a second example for the d<sup>1</sup> comparison of v<sup>3</sup> vs v10: As v<sup>10</sup> has three children but <sup>v</sup><sup>3</sup> only two, we introduce one <sup>⊥</sup>-element to obtain a square matrix. The optimal assignment is shown in red, the d1-distance becomes 1.0. Both examples contribute two values to the d1-distance (fourth matrix), from which we may then calculate, e.g., d2(l2(v1), l2(v8)) = 0 + ( <sup>1</sup> <sup>2</sup> + 1 + 0) = 1.5 (matrix (iii)).

#### **3.2 Updating Vector Representations**

Once the WLSK algorithm has been executed, we determine all di-distances from the li-labels alone (without revisiting the graphs). Then we update the vector

**Fig. 3.** Insertion of nodes to compensate side-effects of superfluous nodes. (Color figure online)

representations of all graphs, the zero entries in particular. Suppose **x** is a vector representation of <sup>G</sup> and **<sup>x</sup>**<sup>h</sup> = 0 for some <sup>h</sup> <sup>∈</sup> <sup>L</sup>i, which means that subtree <sup>h</sup> is not present in G. Among the subtrees that *do occur* in G we can now find the one most similar to <sup>h</sup> <sup>∈</sup> <sup>L</sup><sup>i</sup> (smallest distance <sup>d</sup>i(h, h )) and replace **x**<sup>h</sup> by

$$\mathbf{x}\_h \leftarrow k(d\_i(h, h')) \cdot x\_{h'} $$

where <sup>k</sup> : <sup>R</sup><sup>+</sup> <sup>→</sup> [0, 1] is a monotonically decreasing function that turns distances into similarities with k(0) = 1. The multiplication with x<sup>h</sup> accounts for the fact that h may occur multiple times in G. We used k(d) = e−(d/δ)<sup>2</sup> , where δ is a user-defined threshold.

### **3.3 Compensating Superfluous Nodes**

We say v is an *superfluous node* if it is just a *stopover* on the way to yet another node, but does not contribute to the graph structure itself, that is, if the inand out-degree of v is 1. In Fig. 1 the node v<sup>4</sup> in G<sup>1</sup> is such a superfluous node. In some applications nodes with certain labels may occur occasionally, but do not carry any important information. Their existence/absence should therefore affect the graph similarity not too much.

The discussed distance measure can cope with such differences when comparing, e.g., the subtree of v<sup>2</sup> with that of v9. But if we consider v<sup>4</sup> as an superfluous intermediate node, it brings another undesired effect: It may introduce completely new subtrees which are not present in other graphs. In the example of Fig. 1 the node v<sup>4</sup> introduces subtrees with hashcodes 6 (at depth 1) and 14 (at depth 2), which are not present in G2. When measuring the similarity of G<sup>1</sup> and G2, such subtrees make the graphs appear less similar.

We address such cases by considering the insertion of a superfluous node in our distance calculation. Figure 3 shows the situation once more: To enrich the vector representation of G<sup>2</sup> we seek a closest match for label h. According to Sect. 3.1 we consider, amongst others, the node v<sup>9</sup> with label h as a candidate. With both nodes having a single child only, finding the optimal bijective assignment f is trivial (f(k) = k ) and Eq. (2) boils down to <sup>d</sup><sup>i</sup>−<sup>1</sup>(rh, r<sup>h</sup>- )+d<sup>i</sup>−<sup>1</sup>(k, k ). Now we additionally consider the *insertion* of a superfluous node v<sup>s</sup> with the same label as v4, as shown in Fig. 3 (red). Note that a hashcode li(vs) for the newly inserted node was not necessarily generated earlier. How would the distance between a node v<sup>4</sup> and v<sup>s</sup> evaluate? According to (2) we have

$$d\_i(l\_i(v), l\_i(v\_s)) = d\_{i-1}(l\_{i-1}(v), l\_{i-1}(v\_s)) + d\_{i-1}(k, k')$$

The second part consists of a single term because both nodes have a single child only. Note that it does not depend on vs. Substituting the first term repeatedly by its definition eventually leads us to

$$d\_i(l\_i(v), l\_i(v\_s)) = \underbrace{d\_0(l\_0(v), l\_0(v\_s))}\_{0 \text{ by construction}} + \sum\_{j=0}^{i-1} d\_j(l\_j(k), l\_j(l)) \tag{4}$$

The level-0-distance to the newly inserted node is 0 by construction, however, we replace it by a penalty term d<sup>I</sup> (l0(v)) to reflect the fact that we had to insert a new node. As with <sup>d</sup>0(·, ·) we assume that <sup>d</sup><sup>I</sup> (·) can be derived meaningfully from the application context: If, for instance, nodes with a certain label h are optional, we choose a low insertion distance <sup>d</sup><sup>I</sup> (h) and may otherwise set <sup>d</sup><sup>I</sup> (h) = <sup>∞</sup> to prevent undesired insertions.

We thus arrive at a distance d<sup>∗</sup> <sup>i</sup> (h, h ) for the insertion of a superfluous node

$$\min \begin{cases} \min \{ d\_I(r\_h), d\_I(r\_{h'}) \} + \sum\_{j=0}^{i-1} (l\_j(k), l\_j(k')) \text{ if } S\_h = \{ k \} \wedge S\_{h'} = \{ k' \} \\ \infty \end{cases} \tag{5}$$

which yields ∞ if the prerequisites of a superfluous nodes are not given and considers node insertion on both sides (inner min-term). The original distance (2) may then be replaced by min{di(h, h ), d<sup>∗</sup> <sup>i</sup> (h, h )} to reflect the occurrence of superfluous nodes appropriately. These changes can be handled during the precalculation of the distance matrices, the vector enrichment remains unchanged.

#### **3.4 Complexity**

Enriching the vector representations requires two steps: (1) The calculation of all distance matrices d<sup>i</sup> requires to calculate <sup>i</sup> <sup>|</sup>L<sup>i</sup><sup>|</sup> <sup>2</sup> entries. For each entry we have to solve an assignment problem, which is O(d<sup>2</sup> log d) where d is the maximal node degree. The method is therefore unattractive for highly connected graphs. But many applications with large graphs have a bounded node degree. (2) Secondly, the vector representations **x** of all n graphs need to be enriched. This takes <sup>O</sup>(m<sup>z</sup> · <sup>m</sup>nz) for each graph, where <sup>m</sup><sup>z</sup> (resp. <sup>m</sup>nz) is the number of entries in **<sup>x</sup>** with zero (resp. non-zero) entries: for each 0-entry in **x** we have to find the most similar 1-entry. The number of all labels from all graphs (m = <sup>i</sup> <sup>|</sup>L<sup>i</sup>|) is much larger than the number of nodes in a single graph, whereas mnz is bounded by the number of nodes in a single graph. With <sup>m</sup>nz m<sup>z</sup> we may consider mnz as a constant (max. no. of nodes) and arrive at <sup>O</sup>(<sup>n</sup> · <sup>m</sup>) for the vector enrichment.

**Fig. 4.** Example of a source code snapshot and its graph representation. The student has not yet finished the solution at this stage/snapshot, the return statement is still missing.

## **4 Application**

We demonstrate the usefulness of the proposed modification in an application from computer science education. The increase in the number of CS students over the last years calls for tools that help lecturers to assess the stage of development of a whole group of students – rather than inspecting the solutions one by one. Our dataset consists of editing streams from the students source code editor (for selected exercises of an introductory programming course using Java). In our preliminary evaluation we have about 30–50 such streams per task. We extract snapshots of the code whenever a student starts to edit a different code line than before. (Many snapshot thus do not represent compileable code.) The goal is to compare editing paths against each other, for instance, to identify the most common paths or outliers. We replace the textual representation of the source snapshot by a graph capturing the abstract syntax tree and the variable usage, as can be seen in the example of Fig. 4. We want to cluster the snapshots and to construct a new graph where nodes correspond to clusters (of code snapshots) and edges indicate editing paths of students. For the experiments we applied some preprocessing (e.g. variable renaming in the graph) and assigned low insertion costs to expression- and declaration-nodes, because students may phrase conditions quite differently. Our use case for superfluous nodes (Sect. 3.3) are code blocks ({ }), which are optional if the code within the block consists of a single statement only (e.g. the ++count in Fig. 4).

### **4.1 Effect on Distances**

To measure the effect of the enriched kernel we have manually subdivided a set of snapshots into *similar* and *dissimilar snapshots*. In a clustering setting we want


**Table 1.** Effect of vector enrichment on distances.

the modification to carve out clusters more clearly. We therefore compare the mean distance μ<sup>w</sup> (and variance σw) *within* the group of similar graphs against the mean distance μ<sup>b</sup> (and variance σb) *between* both groups. By the factor f we denote the size of the gap between both means in multiples of the within-group standard deviation σw, that is, f = <sup>μ</sup>*b*−μ*<sup>s</sup>* <sup>σ</sup>*<sup>s</sup>* . The factor <sup>f</sup> may be considered as a measure of separation between the cluster of similar graphs and the remaining graphs. From Table 1 we find that the enriched representation consistently yields higher values of f for the enriched than for the standard vector representation.

### **4.2 Dimensionality**

New node labels are introduced for every new subtree, which introduces a high dimensional vector representation that has been identified as problematic in the literature (Sect. 2.3). Enriching the vector representation can help to overcome this problem, because labels with minor changes will receive similar (enriched) entries. For instance, a dataset with 718 code snapshot graphs generated as many as 5179 different subtree labels (depth 3). After enrichment we identified the number of attributes that might be removed from the dataset because it contains a highly correlating attribute already. This leads to a substantial reduction in the number of columns: Depending on the Pearson correlation threshold of 0.9/0.95/0.99 as much as 77%/68%/55% of the attributes can be discarded.

### **4.3 Code Graph Clustering**

To reduce the dimensionality further, a principal component analysis (PCA) may be applied. Figure 5 shows the scatter plot of the principal components (PC) #2 against PC #1, #3 and #4 for the standard representation (top) and the enriched vectors (bottom). The colors indicate cluster memberships from a mean shift clustering over 4 principal components. Note that, by construction of the dataset, we do not expect the source code snapshots to fall apart completely in well separated clusters, because the data represents the evolution towards a final solution, snapshots differ by incremental changes only. In the standard case the data scatters more uniformly and less structured (left; PC1 vs PC2), while the enriched data shows two long-stretched clusters that reflect a somewhat linear code evolution for two different approaches to solve the exercise, which

**Fig. 5.** Principal component #2 versus principal component #1, #3 and #4 for standard (left) and enriched (right) vectors. (Color figure online)

**Fig. 6.** Snapshot evolution for a group of students: Nodes represent clusters, edges represent snapshot transitions. (Color figure online)

corresponds much better to our expectation. When taking additional component into account (PC3), the scatterplot in the middle (PC2 vs PC3) offers a clearer structure for the enriched data (e.g. the separation of the curved red cluster at the top) than the original data.

Figure 6 shows how the clusters are used in the context of our application. Each cluster (like those in Fig. 5, but for a different exercise) corresponds to a node in this graph. Whenever a student changes the code and thereby moves to a different cluster, a (directed) edge is inserted. The number of students who have followed a path is written nearby the edge. Clusters that have only one incoming and one outgoing edge are not shown for the sake of brevity. The green color indicates the degree of unit-test fulfillment. The node labels <sup>a</sup> : <sup>b</sup>(c|d) carry information about the cluster id a, number of students b that came across this node, number of students c (resp. d) who started (resp. ended) in this node. From this example the lecturer can immediately recognize that 42 students start in cluster #1, from where most students (25) transition to cluster #2 and 10 more students reach the same cluster via cluster #4 as an intermediate step. Cluster #2 does not yet correspond to a perfect solution, but only 12 students manage to reach the green cluster #3 from cluster #2. Other clusters and edges have much smaller numbers, they cover exotic solutions or trial-and-error approaches. The graph provides a good overview about the students performance as a group.

## **5 Conclusions**

Weisfeiler-Lehman subtree kernels can be used to transform graphs into a meaningful vector representation, but suffer from high dimensionality and sparsity, such that the similarity assessment is limited. We overcome both problems by taking the subtree distances into account – which are simpler to assess than general tree distance, because only subtrees of equal depth need to be considered. Based on the subtree distance we enrich the zero entries of graph vectors and improve the similarity assessment. A removal of highly correlating attributes reduces the dimensionality considerably. The modifications turned out to be advantageous in a use case of source code snapshot clustering.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Overlapping Hierarchical Clustering (OHC)**

Ian Jeantet(B), Zolt´an Mikl´os, and David Gross-Amblard

Univ Rennes, CNRS, IRISA, Rennes, France *{*ian.jeantet,zoltan.miklos, david.gross-amblard*}*@irisa.fr

**Abstract.** Agglomerative clustering methods have been widely used by many research communities to cluster their data into hierarchical structures. These structures ease data exploration and are understandable even for non-specialists. But these methods necessarily result in a tree, since, at each agglomeration step, two clusters have to be merged. This may bias the data analysis process if, for example, a cluster is almost equally attracted by two others. In this paper we propose a new method that allows clusters to overlap until a strong cluster attraction is reached, based on a density criterion. The resulting hierarchical structure, called a quasi-dendrogram, is represented as a directed acyclic graph and combines the advantages of hierarchies with the precision of a less arbitrary clustering. We validate our work with extensive experiments on real data sets and compare it with existing tree-based methods, using a new measure of similarity between heterogeneous hierarchical structures.

## **1 Introduction**

Agglomerative hierarchical clustering methods are widely used to analyze large amounts of data. These successful methods construct a dendrogram – a tree structure – that enables a natural exploration of data which is very suitable even for non-expert users. Various tools offer intuitive top-down or bottom-up exploration strategies, zoom-in and zoom-out operations, etc.

Let us consider the following real-life scenario: a social science researcher would like to understand the structure of specific scientific domains based on a large corpus of publications, such as dblp or Wiley. A contemporary approach is to construct a word embedding [23] of the key terms in publications, that is, to map terms into a high-dimensional space such that terms frequently used in the same context appear close together in this space (for the sake of simplicity, we omit interesting issues such as preprocessing, polysemy, etc.). Identifying for example the denser regions in this space directly leads to insights on the key terms of Science. Moreover, building a dendrogram of key terms using an agglomerative method is typically used [9,14] to organize terms into hierarchies. This dendogram (Fig. 1a) eases data exploration and is understandable even for non-specialists of data science.

Despite its usefulness, the dendrogram structure might be limiting. Indeed, any embedding of key terms has a limited precision, and key terms proximity is a debatable question. For example, in Fig. 1a, we can see that the *bioinformatics* key term is almost equally attracted by *biology* and *computing*, meaning that these terms appear frequently together, but in different contexts (e.g. different scientific conferences). Unfortunately, with classical agglomerative clustering, a merging decision has to be made, even if the advantage of one cluster on another is very small. Let us suppose that arbitrarily, *biology* and *bioinformatics* are merged. This may suggest to our analyst (not expert in computer science) that *bioinformatics* is part of *biology*, and its link to *computing* may only appear at the root of the dendrogram. Clearly, an interesting part of information is lost in this process.

In this paper, our goal is to combine the advantages of hierarchies while avoiding early cluster merge. Going back to the previous example, we would like to provide two different clusters showing that *bioinformatics* is closed both to *biology* and *computing*. At a larger level of granularity, these clusters will still collapse, showing that these terms belong to a broader community. This way, we deviate from the strict notion of trees, and produce a directed acyclic graph that we call a quasi-dendrogram (Fig. 1b).

(a) A classical dendrogram, hiding the early relationship between *bioinformatics* and *computing*.

(b) A **quasi-dendrogram**, preserving the relationships of *bioinformatics*.

**Fig. 1.** Dendrogram and quasi-dendrogram for the structure of Science.

Our contributions are the following:


The rest of the paper is organized as follows: Sect. 2 describes our proposed overlapping hierarchical clustering framework<sup>1</sup>. Section 3 details our experimen-

<sup>1</sup> Source code available at https://gitlab.inria.fr/ijeantet/ohc.

tal evaluation. Section 4 presents the related works, while Sect. 5 concludes the paper.

## **2 Overlapping Hierarchical Clustering**

### **2.1 Intuition and Basic Definitions**

In a nutshell, our method obtains clusters in a gradual agglomerative fashion and in a precise way. At each step, when we increase the neighbourhood of the clusters by including more interconnections, we consider the points that fall in this connected neighbourhood and we take the decision to merge some of them whenever they are connected enough to a cluster using a *density criterion* λ. Taking interconnections into account may lead to overlapping clusters.

More precisely, we consider a set <sup>V</sup> <sup>=</sup> {X1,...,X<sup>N</sup> } of <sup>N</sup> points in a <sup>n</sup>dimensional space, i.e. <sup>X</sup><sup>i</sup> <sup>∈</sup> <sup>V</sup> <sup>⊂</sup> <sup>R</sup><sup>n</sup> where <sup>n</sup> <sup>≥</sup> 1 and <sup>|</sup><sup>V</sup> <sup>|</sup> <sup>=</sup> <sup>N</sup>. In order to explore this space in an iterative way, we consider points that are close up to a limit distance <sup>δ</sup> <sup>≥</sup> 0. We define the <sup>δ</sup>-neighbourhood graph of <sup>V</sup> as follows:

**Definition 1 (**δ**-neighbourhood graph).** *Let* <sup>V</sup> <sup>⊂</sup> <sup>R</sup><sup>n</sup> *be a finite set of data points and* <sup>E</sup> <sup>⊂</sup> <sup>V</sup> <sup>2</sup> *a set of pair of elements of* <sup>V</sup> *, let* <sup>d</sup> *be a metric on* <sup>R</sup><sup>n</sup> *and let* <sup>δ</sup> <sup>≥</sup> <sup>0</sup> *be a positive number. The* <sup>δ</sup>*-neighbourhood graph* <sup>G</sup>δ(V,E) *is a graph with vertices labelled with the data points in* <sup>V</sup> *, and where there is an edge* (X, Y ) <sup>∈</sup> <sup>E</sup> *between* <sup>X</sup> <sup>∈</sup> <sup>V</sup> *and* <sup>Y</sup> <sup>∈</sup> <sup>V</sup> *if and only if* <sup>d</sup>(X, Y ) <sup>≤</sup> <sup>δ</sup>*.*

*Property 1.* If δ = 0 then the δ-neighbourhood graph consists of isolated points while if δ = δmax, where δmax is the maximum distance between any two nodes in V then Gδ(V,E) is the complete graph on V .

Varying δ will allow to progressively extend the neighbourhood of the vectors to form bigger and bigger clusters. Clusters will be formed according to the density of a region of the graph.

**Definition 2 (Density).** *The density [16]* dens(G) *of a graph* G(V,E) *is given by the ratio of the number of edges of* G *to the number of edges of* G *if it were a complete graph, that is,* dens(G) = <sup>2</sup>|E<sup>|</sup> <sup>|</sup><sup>V</sup> <sup>|</sup>(|<sup>V</sup> |−1) *. If* <sup>|</sup><sup>V</sup> <sup>|</sup> = 1*,* dens(G)=1*.*

A cluster is simply defined as a subset of the nodes of the graph and its density is defined as the density of the corresponding subgraph.

### **2.2 Computing Hierarchies with Overlaps**

Our algorithm, called OHC, computes a hierarchy of clusters that we can identify in the data. We call the generated structure a quasi-dendrogram and it is defined as follows.

**Definition 3 (Quasi-dendrogram).** *A quasi-dendrogram is a hierarchical structure, represented as a directed acyclic graph, where the nodes are labelled with a set of data points, the clusters, such as:*


The OHC method works as presented in Algorithm 1. We first compute the distance matrix of the data points (I3). We chose the cosine distance, widely use in NLP. Then we construct and maintain the δ-neighbourhood graph Gδ(V,E), starting from δ = 0 (I4).

We also initialize the set of clusters, i.e. the leaves of our quasi-dendrogram, with the individual data points (I4). At each iteration, we increase δ (I6) and consider the new added links to the graph (I8) and the impacted clusters (I9). We extend these clusters by integrating the most linked neighbour vertices if the density does not change more than a given threshold λ (I10–15). We remove all the clusters included in these extended clusters (I16) and add the new set of clusters to the hierarchy as a new level (I18). We stop when all the points are in the same cluster which means that we reached the root of the quasi-dendrogram.

Also to improve the efficiency of this algorithm we use dynamic programming to avoid to recompute information related to the clusters like their density and the list of their neighbour vertices. It lead to significant improvements in the execution time of the algorithm. We will discuss this further in the Sect. 3.3.

*Property 2 (*λ = 0*).* When λ = 0, each level δ<sup>i</sup> of a quasi-dendrogram contains exactly the cliques (complete subgraphs) of the δi-neighbourhood graph G<sup>δ</sup><sup>i</sup> .

*Property 3 (*λ = 1*).* When λ = 1, each level δ<sup>i</sup> of a quasi-dendrogram contains exactly the connected subgraphs of the δi-neighbourhood graph G<sup>δ</sup><sup>i</sup> .

## **3 Experimental Evaluation**

#### **3.1 Experimental Methodology**

**Tests:** The tests we performed were focused on the quality of the hierarchical structures produced by our algorithm. To measure this quality we used the classical hierarchy produced by *SLINK*, an optimal single-linkage clustering algorithm proposed in *Sibson et al.* [28], as a baseline. Our goal was to study the behaviour of the **merging criterion** parameter λ that we introduced, as long as its influence on the **execution time**, to verify if for λ = 1 we experimentally obtain the same hierarchy as *SLINK* (Property 3) and hence observe the **conservativeness** of our algorithm. We also compared our method to other agglomerative

### **Algorithm 1.** Overlapping Hierarchical Clustering (OHC)


methods such as the *Ward* variant [29] and *HDBSCAN\** [8]. To compare such structures we needed to create a new similarity measure which is described in Sect. 3.2.

**Datasets:** To partially see the scalability of our algorithm but also to avoid too long running times we had to limit the size of the datasets to few thousand vectors. To be able to compare the results, we run the tests on datasets of same size that we fixed to **1000 vectors**.


**Experimental Setting:** All our experiments are done on a Intel Xeon 5 Core 1.4 GHz, running MacOS 10.2 on a SSD hard drive. Our code is developed with

<sup>2</sup> https://onlinelibrary.wiley.com/library-info/resources/text-and-datamining.

Python 3.5 and the visualization part was done on a Jupyter NoteBook. We used the *SLINK* and Ward implementations from the scikit-learn python package and the *HDBSCAN\** implementation of *McInnes et al.* [21].

### **3.2 A Hierarchy Similarity Measure**

As there is no ground truth on the hierarchy of the data we used, we need a similarity measure to compare the hierarchical structures produced by hierarchical clustering algorithms. The goal is not only to compare the topology but also the content of the nodes of the structure. However up to our knowledge there is very little in the literature about hierarchy comparison especially when the structure is similar to a DAG or a quasi-dendrogram. *Fowlkes and Mallows* [19] defined a similarity measure per level and the new similarity function we propose is based on the same principle. First we construct a similarity between two given levels of the hierarchies, and then we extend it to the global structures by exploring all the existing levels.

**Level Similarity:** Given two hierarchies h<sup>1</sup> and h<sup>2</sup> and a cardinality i, we assume that it is possible to identify a set l<sup>1</sup> (resp. l2) of i clusters for a given level of hierarchy h<sup>1</sup> (resp. h2). Then, to measure the similarity between l<sup>1</sup> and l2, we take the maximal Jaccard similarity among one cluster of l<sup>1</sup> and every clusters of l2. The average of these similarities, one for each cluster of l1, will give us the similarity between the two sets. If we consider the similarity matrix of h<sup>1</sup> and h<sup>2</sup> with a cluster of l<sup>1</sup> for each row, a cluster of l<sup>2</sup> for each column and the Jaccard similarity between each pair of clusters at the respective coordinates in the matrix, we can compute the similarity between l<sup>1</sup> and l<sup>2</sup> by taking the average of the maximal value for each row. Hence, the similarity function between two sets of clusters l1, l<sup>2</sup> is defined as:

$$\text{sum}(l\_1, l\_2) = \text{mean}\{\max\{J(c\_1, c\_2) \mid c\_2 \in l\_2\} | c\_1 \in l\_1\} \tag{1}$$

where J is the Jaccard similarity function.

However, taking the maximal value of each row shows how the clusters of the first set are represented in the second. If we take the maximal value of each column we will see the opposite, i.e. how the second set is represented in the first set. Hence with this definition the similarity might not be symmetrical so we propose this corrected similarity measure that shows how both sets are represented in the other one:

$$sim\_l^\*(l\_1, l\_2) = mean(sim\_l(l\_1, l\_2), sim\_l(l\_2, l\_1))\tag{2}$$

**Complete Similarity:** Now that we can compare two levels of the hierarchical structures, we can simply average the similarity for each corresponding levels of the same size. For classical dendograms, each level has a distinct number of clusters so identification of levels is easy. Conversely, our quasi-dendrograms may have several distinct levels (pseudo-levels) with the same number of clusters. If so, we need to find the best similarity between these pseudo-levels. For a given level (i.e. number of clusters), we want to build a matching M that maps each pseudolevel l 1 1, l<sup>2</sup> <sup>1</sup>, ... of h<sup>1</sup> to at least one pseudo-level l 1 2, l<sup>2</sup> <sup>2</sup>, ...of h<sup>2</sup> and conversely (see Fig. 2). This matching M should maximize the similarity between pseudo-levels while preserving their hierarchical relationship. That is, for a, b, c, d representing the height of pseudo-levels in the hierarchies, if (l a <sup>1</sup> , l<sup>c</sup> <sup>2</sup>) <sup>∈</sup> <sup>M</sup> and (<sup>l</sup> b 1, l<sup>d</sup> <sup>2</sup>) <sup>∈</sup> <sup>M</sup>, then (<sup>b</sup> <sup>≥</sup> <sup>a</sup> <sup>→</sup> <sup>d</sup> <sup>≥</sup> <sup>c</sup>) or (b<a <sup>→</sup> d<c) (no "crossings" in <sup>M</sup>, such as ((l 231 <sup>1</sup> , l<sup>303</sup> <sup>2</sup> ) with (l 230 <sup>1</sup> , l<sup>304</sup> <sup>2</sup> )).

To produce this mapping, our simple algorithm is the following. We initialize M and two pointers with the two highest pseudo-levels ((l 231 <sup>1</sup> , l<sup>304</sup> <sup>2</sup> ), step 1 of Fig. 2). At each step, for each hierarchy, we consider current pointers and their children, and compute all their similarities (step 2). We then add pseudolevels with maximal similarity to M (here, (l 230 <sup>1</sup> , l<sup>303</sup> <sup>2</sup> )). Whenever a child is chosen, the respective pointer advances, and at each step, at least one pointer advances. Once pseudo-levels have been consumed on one side, ending with l, we can finish the process by adding (l <sup>f</sup> , l) to M for all remaining pseudo-level l on the other side (here, l = l 230 <sup>1</sup> . On our example, the final matching is M = {(l 231 <sup>1</sup> , l<sup>304</sup> <sup>2</sup> ),(l 230 <sup>1</sup> , l<sup>303</sup> <sup>2</sup> ),(l 230 <sup>1</sup> , l<sup>302</sup> <sup>2</sup> ),(l 230 <sup>1</sup> , l<sup>301</sup> <sup>2</sup> ), (l 230 <sup>1</sup> , l<sup>300</sup> <sup>2</sup> ),(l 230 <sup>1</sup> , l<sup>299</sup> <sup>2</sup> )}.

**Fig. 2.** Computing the similarity between two quasi-dendograms h<sup>1</sup> and h<sup>2</sup> for levels having the same number of clusters.

Finally, from (2) we define the similarity between two hierarchies as

$$\text{sim}(h\_1, h\_2) = \text{mean}\{\text{sim}\_l^\*(l\_1, l\_2) | (l\_1, l\_2) \in (h\_1, h\_2) \text{ & ( $l\_1, l\_2$ )} \in M\}.\tag{3}$$

#### **3.3 Experimental Results**

**Expressiveness:** With this small following example we would like to present the expressiveness of our algorithm compared to classical hierarchical clustering algorithms such as *SLINK*. On the hand-built example shown in Fig. 3a we can clearly distinguish two groups of points, {A, B, C, D, E} and {G, H, I, J, K} and two points that we can consider as noise, F and L. Due to the chaining effect we expect that the *SLINK* algorithm will regroup the 2 sets of points early in the hierarchy while we would like to prevent it by allowing some cluster overlaps.

Figure 3b shows the dendrogram computed by *SLINK* and we can see as expected that when <sup>F</sup> merges with the cluster formed by {A, B, C, D, E} the next step is to merge this new cluster with {G, H, I, J, K}.

On the contrary in Fig. 4 that presents the hierarchy built with our method for a specific merging criterion, we can see an example of diamond shape that

**Fig. 3.** A hand-built example (a) and its *SLINK* dendrogram (b).

is specific to our quasi-dendrogram. For simplicity the view here slightly differs from the quasi-dendrogram definition as we used dashed arrows to represent the provenance of some elements of a cluster instead of going further down in hierarchy to have a perfect inclusion and respect the lattice-like structure. The merge between the clusters {A, B, C, D, E} and {G, H, I, J, K} is delayed to the very last moment and the point F will belong to these 2 clusters instead of forcing them to merge. Also depending on the merging criterion we obtain different hierarchical structures by merging earlier of later some clusters.

**Merging Criterion:** As we can see in Fig. 5b when the merging criterion increases we obtain a hierarchy more and more similar to the one produced by the classical *SLINK* algorithm until we obtain exactly the same for a merging criterion of 1. Knowing this fact it is also normal to have a similarity between OHC and *Ward* (resp. *HDBSCAN\** ) hierarchies converging to the similarity between *SLINK* and *Ward* (resp. *HDBSCAN\** ) hierarchies. However we can notice that the OHC and *Ward* hierarchies are the most similar for a merging criterion smaller than 1.

**Fig. 4.** *OHC* quasi-dendrogram obtained from the hand-built example in Fig. 3a for λ = 0.2.

(a) Execution time according to the number of vectors.

(b) Similarity between hierarchical structures according to the merging criterion.

**Fig. 5.** Study of the merging criterion.

**Execution Time:** We observe that when the merging criterion increases the execution time decreases. It is due to the fact that when the merging criterion increases we are more likely to completely merge clusters so we reach faster the top of the hierarchy. It means less levels and less overlapping clusters so less computation. However in this case we have the same drawback of chaining effect as the single-linkage clustering that we wanted to avoid. Even if it was not the objective of this work we set λ = 0.1, as it is an interesting value according to the study of the merging criterion (Fig. 5a), to observe the evolution of the execution time (Fig. 5a). The trend gives a function in O(n2.<sup>45</sup>) so to speed up the process and scale up our algorithm is it possible to precompute a set of possibly overlapping clusters over a given δ-neighbourhood graph with a classical method, for instance CLIQUE, and build the OHC hierarchy on top of that.

### **4 Related Work**

Our goal is to group together data points represented as vectors in R<sup>n</sup>. For our motivating application domain of understanding the structure of scientific fields, it is important to construct structures (i) that are hierarchical, (ii) that allow overlaps between the identified groups of vectors and (iii) which groups (clusters) are related to dense areas of the data. There are a number of other application domains where obtaining a structure with these properties is important. In the following, we relate our work to relevant literature.

**Hierarchical Clustering:** There exist two kinds of hierarchical clustering. Divisive methods follow a top-down strategy while agglomerative techniques compute the hierarchy in a bottom-up fashion. It produces the well known dendrogram structure [1]. One of the oldest methods is the single-linkage clustering that first appeared in the work of *Florek et al.* [18]. It had many improvements over the years until an optimal algorithm named *SLINK* proposed by *Sibson* [28]. However the commonly cited drawback of the single-linkage clustering is that it is not robust to noise and suffers from chaining effects (spurious points merging clusters prematurely). It led to the invention of many variants with their advantages and disadvantages. In the NLP world we have for instance the *Brown clustering* [7] and its generalized version [13]. The drawback of choosing the number of clusters beforehand present in the original Brown clustering is corrected in the generalized version. Researchers also tried to address directly the chaining effect problem with approaches through defining new objective functions such as the *Robust Hierarchical Clustering* [4,11]. However these variants do not allow any overlaps in the clusters. Other variants tried to allow this fuzzy clustering in the hierarchy such as *SOHC* [10], a hierarchical clustering based on a spatial overlapping metric but with a fixed number of clusters, or *HCOSM* [26], that use an overlap similarity measure to merge clusters and then compute a hierarchy from an already determined set of clusters. Generalization of dendrogram to more complex structures like *Pyramidal Clustering* [15] and *Weak Hierarchies* [5] were also proposed. We can find examples to prove that our method produces even more general hierarchical structures that include the weak hierarchies.

**Density-Based Clustering:** Another important class of work is the densitybased clustering. Here, clusters are defined as regions in the data that have a higher density. The data points in the sparse areas that are required to separate clusters are considered as noise or border points. One of the most widely-used algorithms of this category is *DBSCAN* defined by *Ester et al.* [17]. This method connects data points that satisfy a specific density-based criterion: the minimum number of other data points within a given radius must be above a predefined threshold. The main advantage of this method is that it allows detecting clusters of arbitrary shapes. More recently improved versions of *DBSCAN* were proposed such as *HDBSCAN\** [8]. This new variant not only improved notions from *DBSCAN* and *OPTICS* [3] but also proposed a procedure to extract a simplified cluster tree from the reachability relation which allows determining a hierarchy of the clusters but again with no overlapping.

**Overlapping Clustering:** Fuzzy clustering methods [6] allow that certain data points belong to multiple clusters with a different level of confidence. In this way, the boundary of clusters is fuzzy and we can talk about overlaps of these clusters. In our definition it is a different notion, a data point either does or does not belong to a specific cluster and might also belong to multiple clusters. While *HDBSCAN* is closely related to connected components of certain level sets, the clusters do not overlap (since overlap would imply the connectivity).

**Community Detection in Networks:** A number of algorithmic methods have been proposed to identify communities. The first kind of methods produces a partition where a vertex can belong to one and only one community. Following the *modularity* function of *Newman and Girvan* [24], numerous quality functions have been proposed to evaluate the goodness of a partition with a fundamental drawback, the now proved existence of a resolution limit. The second kind of methods, such as *CLIQUE* [2], *k-clique* [25], *DBLC* [31] or *NMF* [30], aims at finding sets of vertices that respect an edge density criterion which allows overlaps but can lead to incomplete cover of the network. Similarly to *HCOSM*, the method *EAGLE* [27] builds a dendrogram over the set of predetermined clusters, here the maximal cliques of the network so overlaps appear only at the leaf level. Coscia et al. [12] have proposed an algorithm to reconstruct a hierarchical and overlapping community structure of a network, by hierarchically merging local ego neighbourhoods.

## **5 Conclusion and Future Work**

We propose an overlapping hierarchical clustering framework. We construct a quasi-dendrogram hierarchical structure to represent the clusters that is however not necessarily a tree (of specific shape) but a directed acyclic graph. In this way, at each level, we represent a set of possibly overlapping clusters. We experimentally evaluated our method using several datasets and also our new similarity measure that hence proved its usefulness. If the clusters present in the data show no overlaps, the obtained clusters are identical to the clusters we can compute using agglomerative clustering methods. In case of overlapping and nested clusters, however, our method results in a richer representation that can contain relevant information about the structure of the clusters of the underlying dataset. As a future work we plan to identify interesting clusters on the basis of the concept of stability. Such methods give promising results in the context of hierarchical density-based clustering [21], but the presences of overlaps in the clusters requires specific considerations.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Digital Footprints of International Migration on Twitter**

Jisu Kim1(B) , Alina Sˆırbu2(B) , Fosca Giannotti3(B) , and Lorenzo Gabrielli3(B)

<sup>1</sup> Scuola Normale Superiore, Pisa, Italy jisu.kim@sns.it <sup>2</sup> University of Pisa, Pisa, Italy alina.sirbu@unipi.it <sup>3</sup> Istituto di Scienza e Tecnologie dell'Informazione, National Research Council of Italy, Pisa, Italy *{*fosca.giannotti,lorenzo.gabrielli*}*@isti.cnr.it

**Abstract.** Studying migration using traditional data has some limitations. To date, there have been several studies proposing innovative methodologies to measure migration stocks and flows from social big data. Nevertheless, a uniform definition of a migrant is difficult to find as it varies from one work to another depending on the purpose of the study and nature of the dataset used. In this work, a generic methodology is developed to identify migrants within the Twitter population. This describes a migrant as a person who has the current residence different from the nationality. The residence is defined as the location where a user spends most of his/her time in a certain year. The nationality is inferred from linguistic and social connections to a migrant's country of origin. This methodology is validated first with an internal gold standard dataset and second with two official statistics, and shows strong performance scores and correlation coefficients. Our method has the advantage that it can identify both immigrants and emigrants, regardless of the origin/destination countries. The new methodology can be used to study various aspects of migration, including opinions, integration, attachment, stocks and flows, motivations for migration, etc. Here, we exemplify how trending topics across and throughout different migrant communities can be observed.

**Keywords:** International migration *·* Emigration *·* Big data *·* Twitter

This work was supported by the European Commission through the Horizon2020 European project "SoBigData Research Infrastructure—Big Data and Social Mining Ecosystem" (grant agreement no 654024) and partially by the Horizon2020 European project "HumMingBird – Enhanced migration measures from a multidimensional perspective" (grant agreement no 870661).

M. R. Berthold et al. (Eds.): IDA 2020, LNCS 12080, pp. 274–286, 2020. https://doi.org/10.1007/978-3-030-44584-3\_22

## **1 Introduction**

Understanding where migrants are is an important topic because it touches upon multidimensional aspects of the sending and receiving countries' society. It is not only the demographic fabric of countries but also labour market conditions, as well as economic conditions that may alter due to demographic adjustment. Understanding their allocation is essential for both policy makers and researchers to bring the best of its effects.

Official data such as census, survey and administrative data have been traditionally the main data source to study migration. However, these data have some limitations [12]. They are inconsistent across different nations because countries employ different definitions of a migrant. Moreover, collecting traditional data is costly and time consuming, thus tracking instantaneous stocks of migrants becomes difficult. This becomes even harder when tracking emigrants because of the lack of motivation from citizens to declare their departure.

In recent years, however, we are provided with other alternative data sources for migration. The availability of social big data allows us to study social behaviours both at large scale and at a granular level, and to peek into realworld phenomena. Although known to suffer from other types of issues, such as selection bias, these data could bring complementary value to standard statistics.

Here, we propose a method to identify migrants based on Twitter data, to be used in further analyses. According to the official definition, a migrant<sup>1</sup> is "a person who moves to a country other than that of his or her usual residence for a period of at least a year". In the context of Twitter, we define a migrant as "*a person who has the current residence different from the nationality*".

Following this definition, we performed a two step analysis. First, we estimated the current residence for users by examining location information from tweets. The residence is defined as the country where the user spends most of the time in a year. Second, we estimated nationality, by considering the social network of users. In the international literature, nationality is defined as a relationship between a state and an individual, with rights and duties on both sides [1,6]. Related concepts are ethnicity - in terms of cultural features - and citizenship - in terms of political life. In this paper, we employ the term nationality to define the ensemble of features that make a person feel like they belong to a certain country [2,5]. This could be the country where a person was born, raised and/or lived most of their lives. By comparing labels of residence and nationality of a user, we were able to understand whether the person has moved from their home country to a host country, and thus if they are a migrant. We validated our estimation internally, from the data itself, and externally, with two official datasets (Italian register and Eurostat data).

One of the advantages of our methodology is that it is generic enough to allow for identification of both immigrants and emigrants. We also overcome one of the limitations of traditional data by setting up a uniform definition of

<sup>1</sup> Recommendations on Statistics of International Migration, Revision 1 (p. 113). United Nations, 1998.

a migrant across different countries. Furthermore, our definition of a migrant is very close to the official definition. We establish the fact that a person has spent a significant period at the current location. Also, we eliminate visitors or short-term stays that do not follow the definition of a migrant. This is also validated by the comparison with official datasets. Another advantage of our method is the fact that it uses only very basic features from the Twitter data: location, language and network information. This is useful since the settings of the freely available Twitter API change constantly. Some of the user attributes that the existing literature use to estimate nationality are no longer available. In addition, we make use of unknown locations of tweets by examining whether they intersect with identified locations. By doing so, we do not neglect any information provided by the tweets from unknown locations which later provide useful information on trending topics of Italian emigrants overseas.

One of the issues with our method is that the migrants that we observed are selected from the Twitter population, and not from the general world population, and it is known that some demographic groups are missing. Nevertheless, we believe that studying the Twitter migrant population can provide important insight into migration phenomena, even if some findings may not apply to the other demographic groups that are not represented in the data.

It is important to note that tracking individual migrants is not the objective of our study, but it is only an intermediate stage to enable further analyses. We simply perform user classification to identify migrants among users in our data, and then aggregate the findings. Further studies we envision are aimed at devising new population-level indices useful to evaluate and improve the quality of life of migrants, through targeted evidence-based policy making. No individual personal information nor migration status is released at any stage during the current analysis, nor in any population-level analysis, which is performed following the highest ethical and privacy standards.

The rest of the paper is organised as follows. In the next section we describe related work that studies migration using big data. In Sect. 3, we provide details of the experimental setting for data collection as well as data pre-processing. We then explain our identification strategy for both residence and nationality in Sect. 4. In Sect. 5, we evaluate our estimation using both internal and external data. Section 6 covers a possible application of our method on studying trending topics among Italian emigrants, while Sect. 7 concludes the paper.

## **2 Related Work**

In the past few years, there have been several works on migration studies using social big data. Most of these employed Twitter data but Facebook, Skype, Email as well as Call Detail Record (CDR) data have also been used to study both international and internal migration [3,9,10,14,16]. Here, we focus on studies that have employed freely available data. The definition of a migrant varied from one work to another depending on the purpose of the study and the nature of the dataset. Thus, the definitions provided fit under different types of migration such as refugees, internal migrants, seasonal migrants or even visitors.

One example of using Twitter to observe migration flows is [15]. They defined residence as the country where the tweets were most frequently sent out for periods of four months. If one's residence changed in the following four months period, it was considered that the person has moved. In a more recent work, [11] measure migration flows from Venezuela to neighbouring countries between 2015 and 2019. They look at the bounding boxes and country labels provided by the tweets and identified the most common country of tweets posted monthly. Their definition of a migrant was "any individual leaving Venezuela during the time window of observation" which was observed when an identified Venezuelan resident appeared for the first time in a different country. Our definition of residence is somewhat similar to these works. However, unlike them, we are measuring stocks of migrants, and not flows. Thus, we take into account the aspect of duration of stay. This naturally eliminates short-term trips and visits.

Apart from geo-tagged tweets, there is other information provided by the Twitter API that can help us infer whether a person is a migrant or not. Although [8] did not directly study migrants, but looked at foreigners present in Qatar, it provides important insights to which of the features provided by Twitter is useful in identifying nationality of users. They gathered features from both profile and tweets of users. For features providing information on profile pictures and name, they performed facial recognition and name ethnicity detection. Their final results showed that ethnicity of name, race, language of tweet, language of mention, location of followers and friends are the first six features that are useful. In this paper, we purely employ data provided by Twitter for the analysis and therefore, we do not have name, ethnicity and race features. Nevertheless, our work also shows that locations of users and friends are the useful features. The difference here is that we propose to use the social network of users as one of the main features in identifying nationality, which is more flexible than having to perform ethnicity detection on names and profile pictures.

## **3 Experimental Setting for Data Collection**

We began with a Twitter dataset collected by the SoBigData.eu Laboratory [4]. We started from a three months period of geo-tagged tweets from August to October 2015. Due to our focus on Italy, we selected from these data the users that tweeted from Italy, obtaining thus 34,160 users. We then crawled the network of geo-enabled friends of these 34,160 users, using the Twitter API. Friends are people that the individual users are following. We focused on friends because we believe that for a user, the information on whom they follow is more informative when it comes to nationality, than who they are followed by. We concentrated on geo-enabled friends because geo-location is necessary for our analysis. By collecting friends, the list of users crossed our initial geographic boundary, i.e., Italy. At this stage, the number of unique users grew to over 250,000. For all users we also scraped the profile information and the 200 most recent tweets using the Twitter API. During this process, we were able to collect all 200 recent tweets for 97% of users and at least 55 tweets for 99% of users. Our final user network consists of 258,455 nodes and 1,205,133 edges which includes both our initial 34,160 users and their geo-tagged friends.

For the process of identifying migration status, we focus on the core users, i.e., 34,160 users. We assign a residence and a nationality to each user, based on the geo-locations included in the data, the language of tweets and profile information. The final dataset includes 237 unique countries from where individuals have sent out their tweets, including 'undefined' location. Even if a user enables geo-tags on their tweets, not all tweets are geo-tagged. As a result, 21% of our tweets are 'undefined'. As for the languages, there are 66 unique languages and 12% of our tweets are in English.

**Fig. 1.** Distribution of the number of days (left) and the number of tweets (right) observed in the data per user: on average, our users have tweeted 47 days and 82 tweets in 2018.

As for the profile features, we observe that 40% of the users have filled out location description. In addition, most of users have set their profile language to English. The number of unique profile languages detected in our data is 58 which is smaller than the languages used, indicating that some users are using languages different from their profile language when tweeting.

In order to assign a place of residence to users, we needed to restrict the observation time period. We have chosen to look at one year length of tweets from 2018, in order to assign the residence label for the 2018 solar year. We selected users that have tweeted in 2018, identifying 128,305 users. To remove bots, we looked at whether a user is tweeting too many times a day. We considered that tweeting more than 50 tweets on average in a single day was excessive and we have eliminated in this way 39 users. In addition, we removed users that were not very active in 2018. If the number of tweets was less than 20, we checked whether the tweeted days were spread out during the year. If the days were not well spread out, we filtered out the user. On the other hand, if it was well spread out, it meant that the user was regularly tweeting, so the user was kept. During this process, we removed 10,764 users. After removing bots and inactive users, we have 117,502 users. For these, we show the distribution of the number of tweets and number of days in which they tweeted in Fig. 1. On average we see 47 days and 82 tweets.

In addition to the Twitter data, we also collected a list of official and spoken languages for countries identified in our data<sup>2</sup>.

## **4 Identifying Migrants**

A migrant is a person that has the residence different from the nationality. We thus consider our core 34,160 Twitter users and assign a residence and nationality based on the information included in our dataset. The difference between the two labels will allow us to detect individuals who have migrated and are currently living in a place different from their home country. The methodology we propose is based on a series of hypotheses: a person that has moved away from their home country stays in contact with their friends back in the home country and may keep using their mother tongue.

### **4.1 Assigning Residence**

In order for a place to be called residence, a person has to spend a considerable amount of time at the location. Our definition of residence is based on the amount of time in which a Twitter user is observed in a country for a given solar year. More precisely, a residence for each user is the country with the longest length of stay which is calculated by taking into account both the number of days in which a user tweets from a country but also the period between consecutive tweets in the same country. In this work we compute residences based on 2018 data.

To compute the residence, we first compute the number of days in which we see tweets for each country for each user. If the top location is not 'undefined', then that is the location chosen as residence. Otherwise, we check whether any tweet sent from 'undefined' country was sent on a same day as tweets sent from the second top country. In case at least one date matched between the two locations, we substitute second country as the user's place of residence. On average, 5 dates matched. This is done under the assumption that a user cannot tweet from two different countries in a day. Although this is not always the case if a user travels, in most of the days of the year this should be true. This approach allowed us to assign a residence in 2018 to 57,180 users.

For the remaining 60,322 users, a slightly different approach was implemented. We computed the length of stay in days by adding together the duration between consecutive tweets in the same country. We selected the country with the largest length of stay. In case the top country was 'undefined', we checked whether 'undefined' locations were in between segments of the second top country, in which case the second country was chosen. In this way, an additional 11,046 users were assigned a place of residence. The remaining 49,276 users were neglected because we considered that we did not have enough information to assign a residence.

<sup>2</sup> Retrieved from http://www.geonames.org and https://www.worlddata.info.

#### **4.2 Assigning Nationality**

In order to estimate nationalities for Twitter users, we took into account two types of information included in our Twitter data. The first type relates to the users themselves, and includes the countries from which tweets are sent and the languages in which users tweet. For each user *u* we define two dictionaries *loc<sup>u</sup>* and *lang<sup>u</sup>* where we include, for each country and language the proportion of user tweets in that country/language.

**Fig. 2.** Example of calculation of the *f loc* and *f lang* values for a user. The calculation of *f locU*<sup>1</sup> and *f langU*<sup>1</sup> is based of the *f loc* and *f lang* values for the three friends, showing the distribution of tweets in various countries/languages for each.

The second type of information used is related to the user's friends. Again, we look at the languages spoken by friends, and locations from which friends tweet. Specifically, starting from the *loc* and *lang* dictionaries of all friends of a user, we define two further dictionaries *floc* and *flang*. The first stores all countries from where friends tweet, together with the average fraction of tweets in that country, computed over all friends:

$$floc^u[C] = \frac{1}{|F(u)|} \sum\_{f \in F(u)} loc^f[C] \tag{1}$$

where *F*(*u*) is the set of friends of user *u*. Similarly, the *flang* dictionary stores all languages spoken by friends, with the average fraction of tweets in each language *l*:

$$flang^u[l] = \frac{1}{|F(u)|} \sum\_{f \in F(u)} lang^f[l] \tag{2}$$

Figure 2 shows an example of a (fictitious) user with their friends, and the four resulting dictionaries.

The four dictionaries defined above are then used to assign a nationality score to each country *C* for each user *u*:

$$N\_C^u = w\_{loc}loc^u[C] + w\_{lang} \sum\_{l \in languages(C)} \text{lang}^u[l] + \tag{3}$$

$$w\_{floc}loc^u[C] + w\_{flang} \sum\_{l \in languages(C)} flang^u[l] \tag{4}$$

where *languages*(*C*) are the set of languages spoken in country C, while *wloc*, *wlang* , *wf loc* and *wf lang* are parameters of our model which need to be estimated from the data (one global value estimated for all users). Each of the *w* value gives a weight to the corresponding user attribute in the calculation of the nationality. To select the nationality for each user we simply select the country *C* with maximum *N<sup>C</sup>* : *N<sup>u</sup>* = argmax*<sup>C</sup> N<sup>u</sup> C* .

## **5 Evaluation**

To evaluate our strategy for identifying migrants we first propose an internal validation procedure. This defines gold standard datasets for residence and nationality and computes the classification performance of our two strategies to identify the two user attributes. The gold standard datasets are produced using profile information as they are provided by the users themselves. We then perform an external validation where we compare the migrant percentages obtained in our data with those from official statistics.

## **5.1 Internal Validation: Gold Standards Derived from Our Data**

**Residence.** To devise a gold standard dataset for residence we consider profile locations set by users. We assume that if users declare a location in their profile, then that is most probably their residence. Very few users actually declare a location, and not all of them provide a valid one, thus we only selected profile locations that were identifiable to country level. Among the user accounts for which we could estimate the residence, 3,065 accounts had a valid country in their profile location. Using these accounts as our validation data, we computed the F1 score to measure the performance of our residence calculation. Table 1 shows overall results, and also scores for the most common countries individually. The weighted average of the F1 score is 86%, with individual countries reaching up to 94%, demonstrating the validity of our residence estimation procedure.

**Nationality.** In order to build a gold standard for nationality, we take into account the profile language declared by the users. The assumption is that profile languages can provide a hint of one's nationality [13]. However, many users might not set their profile language, but use the default English setting. For this reason, we do not include into the gold standard users that have English as their profile language.

**Table 1.** Average precision, recall and F1 scores, together with scores for the top 7 residences in terms of support size.



**Table 2.** Average precision, recall and F1 scores for top 8 nationalities in terms of support numbers

The profile language, however, does not immediately translate into nationality. While for some languages the correspondence to a country is immediate, for many others it is not. For instance, Spanish is spoken in Spain and most American countries, so one needs to select the correct one. For this, we look at tweet locations. We consider all countries that match with the profile language and, among these, we select the one with the largest number of tweets, but only if the number of tweets from that country is at least 10% of the total number of tweets of that user. This allows to select the most probable country, also for users who reside outside their native country. If no location satisfies this criterion the user is not included in the gold standard. We were able to identify nationalities of 12,223 users. Due to the fact that during data collection we focused on geo-tags in Italy, the dataset contains a significant number of Italians.

**Fig. 3.** Distribution of residences and nationalities of top 30 countries, for all users that possess both residence and nationality labels.

We employed this gold standard dataset in two ways. First, we needed to select suitable values for the *w* weights from Eqs. 3–4. These show the importance of the four components used for nationality computation: own language and location, friends' language and location. We performed a simple grid search and obtained the best accuracy on the gold standard using values 0 for languages and 2 and 1*.*5 for own and friends' location, respectively. Thus we can conclude that it is the locations that are most important in defining nationality for twitter users, with a slightly stronger weight on the individual's location rather than the friends. The final F1-score, both overall and for top individual nationalities, are included in Table 2, showing a very good performance in all cases.

To assign final residences and nationalities to our core users, we combined the predictions with the gold standards (we predicted only if the gold standard was not present). Figure 3 shows the final distribution of residences and nationalities of top 30 countries for all users that have both the residence and nationality labels. The difference in the residence and nationality can be interpreted as either immigrants or emigrants.

**Fig. 4.** Comparison between the true and predicted data; the first two plots show predicted versus AIRE/EUROSTAT data on European countries. The last plot shows predicted versus AIRE data on non-European countries.

#### **5.2 External Validations: Validation with Ground Truth Data**

In order to validate our results with ground truth data, we study users labelled with Italian nationality and non-Italian residence, i.e. Italian emigrants. We computed the normalised percentage of Italian emigrants resulting from our data for all countries, and compared with two official datasets: AIRE (Anagrafe Italiani residenti all'estero), containing Italian register data, and Eurostat, the European Union statistical office. For comparison we use Spearman correlation coefficients, which allow for quantifying the monotonic relationship between the ground truth data and our estimation by taking ranks of variables into consideration.

Figure 4 displays the various values obtained, compared with official data. A first interesting remark is that even between the official datasets themselves, the numbers do not match completely. The correlation between the two datasets is 0.91. Secondly we observed good agreement between our predictions and the official data for European countries. The correlation with AIRE is 0.753, while with Eurostat it is 0.711 when considering Europe. For non-European countries, however the correlation with AIRE data drops to 0.626. We believe the lower performance is due to several factors related to sampling bias and data quality in the various datasets. This includes bias on Twitter and in our methods, but also errors in the official data, which could be larger in non-EU countries due to less efficient connections in sharing information.

All in all, we believe our method shows good performance and can be successfully used to build population level indices for studying migration. We do not aim to perform nowcasting of immigrant stocks, but rather to identify a population that can be representative enough for further analyses.

## **6 Case Study: Topics on Twitter**

In this section we show that our methodology can be employed to study how trending topics in Italy are also being discussed among Italian emigrants. As an example, we selected one hashtag that has been very popular in the last years: #Salvini. This refers to the Italian politician Matteo Salvini who served as Deputy Prime Minister and Minister of internal affairs in Italy until recently. To this, we added the top nine hashtags that appear frequently with #Salvini in our data: Berlusconi, Conti, Diciott, DiMaio, Facciamorete, Legga, M5S, Migrant, Ottoemezzo. Indeed, they all represent people that are often mentioned together or political parties or other issues that are associated with the hashtag #Salvini.

**Fig. 5.** Stream graph: appearance of hashtags related to #Salvini from Italians across 10 selected residence countries in 2018. The discussion continuously appeared in Italy throughout the year and it became more lively employed by Italians overseas as Salvini gained more political attention.

Figure 5 shows an evolution of the usage of the 10 above mentioned hashtags across different Italian communities both within and abroad Italy. The values shown are the number of tweets from Italian nationals residing in each country that include one of the 10 hashtags, divided by the total number of tweets from Italian nationals from that country. Values are computed monthly. Thus, we show the monthly popularity of the topics in each country. In this way, even the tweets from less represented countries are well shown. As the figure shows, the hashtag was continuously used by Italians in Italy. We observed that the hashtag gradually spread over other residence countries as Salvini received more and more attention. We also observe that most of the attention comes from Italians residing in Europe, with non-European countries less represented.

### **7 Conclusion and Future Work**

We have developed a new methodology to provide a snapshot of migrants within the Twitter population. We considered the length of stay in a country as the key factor to define a user's residence. As for the nationality, connections which migrants maintain with their country of origin provided us with a good indication. In particular, the location of friends seemed to be a strong feature in determining nationality, together with the location of the users themselves. Tweet language, on the other hand, was not considered relevant by our model. This is probably due to the fact that English is the dominating language on Twitter, since a language that is widely understood has to be spoken to get more attention from other users. We have validated our results both with internal and external data. The results show good classification performance scores and good correlation coefficients with official datasets.

The constructed dataset can be applied in different scenarios. We have shown how it can be used to study trending topics on Twitter, and how attention is divided between emigrants and non-migrants of a certain nationality. In the future, we plan to analyse social ties, integration and assimilation of migrants [7]. At the same time, one can investigate the strength of the ties with the community of origin.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Percolation-Based Detection of Anomalous Subgraphs in Complex Networks**

Corentin Larroche1,2(B), Johan Mazel<sup>1</sup>, and Stephan Cl´emen¸con<sup>2</sup>

<sup>1</sup> French National Cybersecurity Agency (ANSSI), Paris, France *{*corentin.larroche,johan.mazel*}*@ssi.gouv.fr <sup>2</sup> LTCI, T´el´ecom Paris, Institut Polytechnique de Paris, Palaiseau, France

*{*corentin.larroche,stephan.clemencon*}*@telecom-paris.fr

**Abstract.** The ability to detect an unusual concentration of extreme observations in a connected region of a graph is fundamental in a number of use cases, ranging from traffic accident detection in road networks to intrusion detection in computer networks. This task is usually performed using scan statistics-based methods, which require explicitly finding the most anomalous subgraph and thus are computationally intensive.

We propose a more scalable method in the case where the observations are assigned to the edges of a large-scale network. The rationale behind our work is that if an anomalous cluster exists in the graph, then the subgraph induced by the most individually anomalous edges should contain an unexpectedly large connected component. We therefore reformulate our problem as the detection of anomalous sample paths of a percolation process on the graph, and our contribution can be seen as a generalization of previous work on percolation-based cluster detection. We evaluate our method through extensive simulations.

## **1 Introduction**

Detection of a significant connected subgraph in a larger background network is a ubiquitous task: such significant regions can be indicative of fraudulent behavior in social networks [15] or of the propagation of an intruder in a computer network [22], for instance. Therefore, being able to discern them from ambient noise has valuable applications in a number of settings. This anomaly detection problem is, however, remarkably challenging: the large size and complex structure of real-world graphs make the characterization of normal behavior difficult and the search for non-trivial substructures computationally expensive.

The aim of this paper is to propose a scalable method for anomalous connected subgraph detection in a graph with observations attached to its edges. The null distribution of the observations, or an approximation thereof, is assumed to be known. Building upon this knowledge, the degree of abnormality of each individual edge with respect to the model can be measured, and our goal is to detect a significant concentration of anomalous edges in a connected region of the graph. Usual methods for this task are built around scan statistics [14]. Such methods boil down to maximizing a scoring function over the set of connected regions of the graph, then rejecting the null hypothesis (*i.e.* absence of anomalous subgraph) if the maximum exceeds a certain threshold. This implies solving a combinatorial optimization problem over the class of all connected subgraphs, which is expensive due to the exponentially growing size of the latter.

In contrast, our approach does not require explicitly searching for the best candidate subgraph. Instead, we build on the following idea: under the null hypothesis, the most individually anomalous edges are randomly spread out over the graph. Therefore, removing all but the k most anomalous edges from the graph is equivalent to drawing k edges uniformly at random and extracting the subgraph induced by these edges. In other words, this procedure amounts to bond percolation on a graph. On the other hand, when an anomalous subgraph is present, the location of the individual anomalies is no longer random, and thus the largest connected component of the subgraph induced by the k most anomalous edges should contain an unexpectedly large connected component. This link between anomalous subgraph detection and percolation theory has already been introduced in the context of regular lattices [6,19,20], but to the best of our knowledge, it has not yet been studied for arbitrary graphs.

We argue that our method is more scalable than traditional ones while retaining an acceptable detection power, especially when seeking to detect small anomalous regions in large graphs. We assess this detection performance through numerical experiments on several realistic synthetic graphs.

The rest of this paper is structured as follows. In Sect. 2, we introduce the statistical framework for our problem and present some related work. Section 3 describes our detection method, while Sect. 4 is devoted to its empirical evaluation on simulated data. Finally, we discuss our results and some interesting leads for future work in Sect. 5, then briefly conclude in Sect. 6.

## **2 Problem Formulation and Related Work**

We begin with a thorough formulation of our problem as a case of statistical hypothesis testing, then review the main existing approaches to it.

### **2.1 Problem Formulation – Statistical Hypothesis Testing**

Consider an undirected and connected graph G = (V, E), where V (resp. E) is the set of vertices (resp. edges) of G. Letting |A| denote the number of elements of a set A, we write m = |E|, and we use E and [m] = {1,...,m} interchangeably to represent the set of edges. We further write 2<sup>A</sup> for the set of all subsets of A and 1{·} for the indicator function of an event.

Let Λ ⊂ 2<sup>E</sup> denote the class of subsets of E whose induced subgraph in G is connected. Given a signal **<sup>X</sup>** = (X1,...,Xm) <sup>∈</sup> <sup>R</sup><sup>m</sup> observed on the edges of G and a known probability distribution F0, the null hypothesis is defined as H<sup>0</sup> : X<sup>i</sup> iid ∼ F0. For each S ∈ Λ, we further define the alternative

$$H\_{\mathcal{S}} : \begin{cases} \mathbf{X}\_{|\mathcal{S}} \sim F\_{\mathcal{S}} \\ \forall i \notin \mathcal{S}, \, X\_i \sim F\_0 \end{cases} ,$$

where **X**|S is the restriction of **X** to S and F<sup>S</sup> is a joint probability distribution. F<sup>S</sup> is only assumed to be different from F ⊗|S| <sup>0</sup> , and it can differ in various ways. In many applications, the observations in S are simply larger than expected (consider for instance network intrusion detection, where the presence of an intruder results in additional activity in a connected region of the network). The problem considered in this paper can be formulated as

$$H\_0 \quad \text{vs.} \quad H\_1 = \bigcup\_{\mathcal{S} \in A} H\_{\mathcal{S}} \dots$$

That is, we want to know whether there exists a connected subgraph of G inside of which the observations X<sup>i</sup> are drawn from an alternative distribution. Note that we only care about detection, leaving the reconstruction of S aside.

#### **2.2 Related Work – Scan Statistics and Beyond**

A lot of existing work deals with a specific instance of the problem defined above, namely elevated mean detection on a graph. In this setting, the observations are independent standard centered normal random variables under the null, while X<sup>i</sup> has mean μ<sup>S</sup> 1{i ∈ S} under the alternative H<sup>S</sup> (for some μ<sup>S</sup> > 0). Theoretical conditions for detectability in this case are stated in [1]. A closely related problem arises when the observations are associated with vertices rather than edges, and this setting was studied in [3–5]. However, these papers focus on statistical analysis and do not provide computationally tractable tests.

From a more practical perspective, the most common approach to anomalous subgraph detection is based on scan statistics. Broadly speaking, this method consists in defining a scoring function <sup>f</sup> : 2<sup>E</sup> <sup>→</sup> <sup>R</sup>, computing the test statistic t = maxS∈<sup>Λ</sup> f(S), then rejecting H<sup>0</sup> if t exceeds a given threshold. This amounts to finding the most anomalous subset S<sup>∗</sup> in Λ, and then rejecting the null hypothesis if S<sup>∗</sup> is anomalous enough. Defining f requires some hypotheses on the class of alternative distributions {F<sup>S</sup> }. For instance, when F<sup>S</sup> has a parametric form, f(S) can be defined as the likelihood ratio between H<sup>S</sup> and H0. In the more general case considered here, however, finding a suitable scoring function is non-trivial. Moreover, computing t implies maximizing f over the combinatorial class Λ, which quickly becomes computationally intensive as the graph grows. Therefore, most related work focuses on making the computation of scan statistics more efficient. Ways to achieve this include the following:

**Restriction of the Class** *Λ***.** The easiest way to speed up the computation is to simply reduce the size of the search space by considering only a subset of Λ. Such restriction can be based on domain-specific knowledge [17,18,22,25] or more general heuristics [24].


Despite the popularity of scan statistics, other ideas have also been considered in the literature. We focus on one of these alternative approaches, namely the Largest Open Cluster (LOC) test, which was first studied in the context of object detection in images [19,20]. The idea of this method is to represent an image as a two-dimensional lattice, each node carrying a random variable standing for the value of the associated pixel. Then, after deleting from the lattice every vertex whose pixel value is lower than a suitable threshold, the largest remaining connected component is expected to be small if there is no object in the image. On the other hand, if an object is present, an unexpectedly large connected component should remain in the thresholded lattice. The theory behind the LOC test has since been extended to lattices of arbitrary dimension [6], but to the best of our knowledge, the underlying idea of using percolation theory to detect anomalous connected subgraphs has not yet been applied to complex, arbitrary-shaped networks.

## **3 Local Anomaly Detection and Percolation Theory**

We now describe our method, first introducing some necessary notions of percolation theory, then highlighting their relevance to our anomaly detection problem. Finally, we provide a detailed description of our testing procedure.

### **3.1 Some Notions of Percolation Theory**

An interesting aspect of the LOC test is that the behavior of its test statistic under the null hypothesis can be described using percolation theory. Therefore, we first review some useful results from this field, which motivate our approach. For more details, see for example [10] and references therein. Since our primary interest is in signals associated with edges, we focus on bond percolation, where edges of a connected graph with n vertices are occupied uniformly at random with probability p or unoccupied with probability 1 − p.

Let C(p) denote the size of the largest connected component of this graph at occupation probability p. The main focus of percolation theory is to find the limit of C(p) as n becomes large. Extremal values of p yield obvious results: for p = 0, C(p) = 1 for any n and for p = 1, limn→∞ C(p) = ∞. For intermediate values of p, however, there are two possible regimes. If p is small enough, only small connected components are present and C(p)/n converges in probability to 0. On the other hand, larger values of p lead to the emergence of a giant connected component, which contains a constant fraction of the vertices. The transition between the two regimes happens for a critical value of p called the percolation threshold pc. Note that p<sup>c</sup> depends on the graph structure and can be vanishingly small. Although this phase transition is only well-defined in the limit of an infinite graph, a somewhat similar behavior can be observed in the finite case [8,16]. In particular, define the percolation process {C(p)}<sup>0</sup>≤p≤<sup>1</sup> as follows: assign to each edge e an independent random variable Ue, uniformly distributed on [0, 1]. Then, keeping the U<sup>e</sup> fixed, let p vary on [0, 1], deleting e from the graph whenever U<sup>e</sup> > p. A tightly related process is obtained by considering the imbedded Markov chain {Gk}<sup>k</sup>≥<sup>0</sup>, where G<sup>k</sup> is the subgraph induced by the edges associated with the k smallest random variables. Letting C<sup>k</sup> denote the size of the largest connected component of Gk, {Ck}<sup>k</sup>≥<sup>0</sup> can be seen as a discretized version of {C(p)}<sup>0</sup>≤p≤<sup>1</sup>. Even for finite graphs, sample paths of these two processes do not deviate significantly from the mean trajectory, making them suitable candidates for anomaly detection.

#### **3.2 Application to Anomalous Subgraph Detection**

We now motivate the idea of mapping a signal **X** onto a sample path of the percolation process. For i ∈ [m], define P<sup>i</sup> = 1−F0(Xi) as the upper tail p-value associated with Xi. Define also, for k ∈ {0,...,m}, the subgraph G<sup>k</sup> induced by the edges associated with the k smallest p-values, and let S<sup>k</sup> denote the size of its largest connected component. Under the null hypothesis, the random variables {Pi} are independent and uniformly distributed on [0, 1]. Therefore, S<sup>k</sup> has the same distribution as C<sup>k</sup> for all k ∈ {0,...,m}. Under the alternative H<sup>S</sup> , however, the distribution of the variables {Pi}i∈S is altered, which induces a deviation in the process {Sk}<sup>0</sup>≤k≤<sup>m</sup> with respect to the normal percolation process. Our test aims to detect this deviation.

Figure 1 illustrates the normal and anomalous behaviors of the percolation process for three graph models: a two-dimensional square lattice, an Erd˝os-R´enyi random graph [13] and a Barab´asi-Albert preferential attachment graph [7]. For each model, a graph with 1024 vertices and approximately 2000 edges is generated, and the mean and standard deviation of the fraction of vertices in the largest connected component for each value of p is estimated using 10000 Monte Carlo simulations. Then, for each graph, we generate a subtree S containing a fraction δ of the vertices, assign to each edge e an independent Gaussian random variable X<sup>e</sup> ∼ N (μ1{e ∈ S}, 1) and compute the associated sample path of the percolation process. This experiment was repeated 1000 times for each graph, and the mean sample path for different values of δ and μ is displayed. The two regimes of the percolation process can be observed, and the shape and location of the phase transition both clearly depend on the graph model. While the

**Fig. 1.** Evolution of the fraction of vertices in the largest connected component as p varies from 0 to 1, under H<sup>0</sup> and various alternatives, for three kinds of graphs: a two-dimensional square lattice (left), an Erd˝os-R´enyi random graph (center) and a Barab´asi-Albert preferential attachment graph (right).

separation between the two regimes is quite clear for the lattice and the Erd˝os-R´enyi graph, it is much blurrier for the Barab´asi-Albert model, which yields more complex structures – most interestingly, heavy-tailed degree distributions. Since such properties are often found in real-world networks, it is important to qualify their impact on the feasibility of percolation-based cluster detection. Figure 1 shows that although the anomalous sample paths become harder to distinguish as the phase transition gets hazier, the normal trajectories are concentrated enough to make even small deviations visible, which motivates our approach.

#### **3.3 Putting It All Together – Description of Our Test**

We now proceed with the description of our test. First, define

$$K = \min\left\{ k \le m, \,\mathbb{E}\_0[S\_k] \ge \sqrt{|\mathcal{V}|} \right\},$$

where E<sup>0</sup> denotes the expected value under H0. K can be understood as the index corresponding to the onset of the phase transition. Since we aim to detect the appearance of an unexpectedly large connected component in the early steps of the percolation process, the test statistic we use is

$$\chi = \frac{1}{|\mathcal{V}| \cdot K} \sum\_{k=1}^{K} S\_k \dots$$

This statistic is equivalent to the area under a piecewise constant interpolation of the sequence of points {(k, Sk)}<sup>0</sup>≤k≤<sup>K</sup>, and is therefore expected to be higher than usual in the presence of an anomalous subgraph.

Estimation of K and calibration of the test are both done through Monte Carlo simulation: using the Newman-Ziff algorithm [23], N random sample paths of the imbedded Markov chain are computed. Let {S(i) <sup>k</sup> }<sup>0</sup>≤k≤<sup>m</sup> denote the trajectory of the largest connected component's size for the ith realization of the process. We get the following estimates:

$$\hat{K} = \min \left\{ k \le m, \, \frac{1}{N} \sum\_{i=1}^{N} S\_k^{(i)} \ge \sqrt{|\mathcal{V}|} \right\}, \qquad \hat{\chi} = \frac{1}{|\mathcal{V}| \cdot \hat{K}} \sum\_{k=1}^{\bar{K}} S\_k. \text{}$$

Finally, the empirical p-value can be expressed as

$$\hat{p} = \frac{1}{N} \sum\_{i=1}^{N} \mathbb{1} \{ \hat{\chi} \le \hat{\chi}^{(i)} \}, \quad \text{where } \hat{\chi}^{(i)} = \frac{1}{|\mathcal{V}| \cdot \hat{K}} \sum\_{k=1}^{\hat{K}} S\_k^{(i)} \quad \text{for } i \in \{1, \dots, N\}.$$

### **4 Experiments**

In order to assess the power of our test, we ran it on several synthetic graphs containing random anomalous trees. This section describes the procedure we used to generate the dataset, then presents our results and their interpretation.

#### **4.1 Generation of the Dataset**

The dataset is generated using the stochastic Kronecker graph model [21]. Kronecker graphs exhibit similar structural properties as real-world networks, most importantly power law-distributed degrees and small diameter. Hence, this model allows us to evaluate our test in a somewhat realistic setting.

Two parameter matrices are used: Θ<sup>1</sup> = [0.9 0.5; 0.5 0.3] (core-periphery network) and Θ<sup>2</sup> = [0.9 0.2; 0.2 0.9] (hierarchical network). For a given matrix and for i ∈ {12, 13, 14, 15}, we generate an undirected graph through i iterations of the Kronecker product, and only the largest connected component of this graph is kept in order to obtain a connected network with approximately 2<sup>i</sup> vertices. Using this procedure, 10 graphs are generated for each pair of parameters (Θ, i). Thus, we evaluate our test on graphs with sizes ranging from a few thousands to a few tens of thousands of vertices, which covers a wide scope of potential use cases. For each synthetic graph, anomalies are then generated as follows: given δ ∈ (0, 1), a random subtree S containing a fraction δ of the vertices is drawn. Then, a random observation X<sup>e</sup> ∼ N (μ1{e ∈ S}, 1) is independently drawn for each edge e of the graph (where μ is a fixed signal strength). For a given graph and a pair of parameters (δ, μ), 1000 anomalous signals **X** = (X1,...,Xm) are generated. 1000 signals are also drawn from the null distribution (that is, **X** ∼ N (0, Im), where I<sup>m</sup> is the m × m identity matrix) for comparison. Finally, for each graph, the null distribution of the test statistic is estimated using 10000 random realizations of the percolation process. Using the obtained histogram, the empirical p-values associated with the normal and anomalous samples are derived, and we construct the Receiver Operating Characteristic (ROC) curve for each pair (δ, μ). This procedure exposes the influence of various parameters on the performance of our test, namely the graph size, the generator matrix, the size δ of the anomalous region and the signal strength μ.

### **4.2 Detectability Conditions – Empirical Study**

Our results are displayed in Table 1 and Figs. 2 and 3. Our main interest is in finding out which parameters have the strongest influence on the power of the test, and we provide some key observations and interpretations below.

**Fig. 2.** Aggregated ROC curves of our test for 10 Kronecker graphs with initial matrix Θ<sup>1</sup> = [0.9 0.5; 0.5 0.3], for several values of the number of Kronecker product iterations i, the proportion δ of vertices in the anomalous tree and the signal strength μ.

*Influence of the Graph Size.* The first thing we notice in Figs. 2 and 3 is that for a given pair of parameters (δ, μ), the performance of the test consistently improves as the size of the graph increases. One possible explanation for this comes from percolation theory: before the phase transition, the size of the largest connected component is sublinear in the size of the graph. This implies that, for a fixed ratio of vertices in the anomalous component, the difference between the size of the latter and the expected size of the largest component grows with the graph size. Therefore, the anomalous component becomes more visible as the graph grows. Note, however, that some structural properties of our synthetic graphs (*e.g.* density) might not remain identical for different values of i. It is thus difficult to pinpoint the actual influence of the sole number of vertices.

**Fig. 3.** Aggregated ROC curves of our test for 10 Kronecker graphs with initial matrix Θ<sup>2</sup> = [0.9 0.2; 0.2 0.9], for several values of the number of Kronecker product iterations i, the proportion δ of vertices in the anomalous tree and the signal strength μ.

*Trade-Off Between* δ *and* μ*.* As could be intuitively expected, our test performs better for higher values of δ and μ. More interestingly, these two parameters are intertwined: what makes an anomalous subgraph detectable is not only the number of vertices it contains (which is controlled by δ), but also the presence of a sufficient fraction of its edges among the most individually anomalous edges of the graph (which is controlled by μ). In terms of experimental results, this translates to poor performance when at least one of these parameters is too low. However, there seems to be a range of values of δ and μ in which a decrease in one can be made up for by an increase in the other. In particular, this implies that even small subgraphs can be detected by our test as long as the signal is strong enough. This is useful in "needle-in-a-haystack" scenarios such as network intrusion detection, where the anomalies one looks for are often localized.

*Influence of the Graph Structure.* As evidenced by Fig. 1, structural properties of the graph heavily influence the normal behavior of the percolation process, in turn affecting the viability of percolation-based cluster detection. This explains the observable difference in detection power between the two kinds of graphs we consider. Further analysis shows that the generator Θ<sup>1</sup> yields more heavy-tailed degree distributions, which is a plausible cause for the performance gap.

## **5 Discussion and Future Work**

We now discuss the main properties of our test, identifying some limitations and providing leads for future work.

**Table 1.** Aggregated AUC score of our test for 10 Kronecker graphs, using various combinations of initial matrix Θ, number of iterations of the Kronecker product i, proportion δ of vertices in the anomalous tree and signal strength μ.


*Theoretical Guarantees.* From a theoretical perspective, our setting is more complex than that of [6]: we consider arbitrary networks instead of regular lattices, and our test statistic depends on the whole sample path of the percolation process rather than the marginal behavior at a given occupation probability. Therefore, the search for theoretical guarantees for our test was left out of the scope of this work, although it would certainly be of great interest.

*Computational Cost.* The main advantage of our method is its computational efficiency. Indeed, computing the empirical p-value for a given graph and an observed signal only requires N + 1 runs of the Newman-Ziff algorithm, which has a very low cost. In contrast, a scan statistic-based test would require N + 1 runs of a combinatorial optimization algorithm (one for the observed data and N additional runs to estimate the distribution of the test statistic under the null). Even with a very efficient optimization method, this is significantly more intensive. In terms of complexity, our test requires sorting the observations Xi, running the Newman-Ziff algorithm N + 1 times, computing the mean sample path and the index K, and summing the first K values for each of the N + 1 trajectories, resulting in O(m(log m + N)) operations. Note that the algorithm can be further optimized using the fact that the test statistic depends only on the first K steps of the percolation process. Although the exact value of K depends on the graph, we empirically observe that it is generally smaller than the number of vertices |V|. Therefore, early stopping of the Newman-Ziff algorithm and partial sorting can reduce the complexity to O(m + |V|(N + log |V|)).

*Detection Power.* The expected downside of our method's low computational cost is a loss in detection power. Our simulations show, however, that the proposed test can detect reasonably small anomalous subgraphs in large enough ambient graphs, which is our main goal here. Moreover, it does not rely on prior knowledge of the alternative distribution and can be used with only a rough estimate of F0, which improves its usability in realistic settings.

Although the influence of some factors on the performance of the test was left out of the scope of this work, a wider analysis would be an interesting topic for future work. These factors include the density of the graph and the shape of the anomalous subgraph. More specifically, we only evaluated our test in the case of random anomalous trees, which provides general results but no insight into the influence of the diameter and the density of the anomalous subgraph.

## **6 Conclusion**

By extending previous work on percolation-based cluster detection to a more general setting, we propose a computationally efficient test to detect an anomalous connected subgraph in an edge-weighted network. The underlying intuition is that it is often possible to find out whether such a subgraph is present without explicitly finding it: instead of enumerating all possible candidates, a much faster method can be obtained by looking for properties of the whole graph which are affected by the apparition of an anomalous cluster. Our work suggests that percolation theory can provide such properties.

Since it scales easily to large graphs and does not rely on extensive knowledge of the null and alternative distributions of the observed signal, we argue that our method is applicable to real-world problems. Moreover, we show through extensive simulations that its detection power remains acceptable, and that it can in particular detect small anomalous regions in large graphs. Therefore, we think the link between cluster detection and percolation theory deserves further exploration, both from a theoretical and applied point of view.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **A Late-Fusion Approach to Community Detection in Attributed Networks**

Chang Liu<sup>1</sup>, Christine Largeron<sup>2</sup>, Osmar R. Za¨ıane1(B) , and Shiva Zamani Gharaghooshi<sup>1</sup>

<sup>1</sup> Alberta Machine Intelligence Institute, University of Alberta, Edmonton, Canada

*{*chang6,zaiane,zamanigh*}*@ualberta.ca <sup>2</sup> Laboratoire Hubert Curien, Universit´e de Lyon, Saint-Etienne, France christine.largeron@univ-st-etienne.fr

**Abstract.** The majority of research on community detection in attributed networks follows an "early fusion" approach, in which the structural and attribute information about the network are integrated together as the guide to community detection. In this paper, we propose an approach called *late-fusion*, which looks at this problem from a different perspective. We first exploit the network structure and node attributes separately to produce two different partitionings. Later on, we combine these two sets of communities via a fusion algorithm, where we introduce a parameter for weighting the importance given to each type of information: node connections and attribute values. Extensive experiments on various real and synthetic networks show that our latefusion approach can improve detection accuracy from using only network structure. Moreover, our approach runs significantly faster than other attributed community detection algorithms including early fusion ones.

**Keywords:** Community detection *·* Attributed networks *·* Late fusion

## **1 Introduction**

In many modern applications, data is represented in the form of relationships between nodes forming a *network*, or interchangeably a *graph*. A typical characteristic of these real networks is the *community structure*, where network nodes can be grouped into densely connected modules called communities. Community identification is an important issue because it can help to understand the network structure and leads to many substantial applications [6]. While traditional community detection methods focus on the network topology where communities can be defined as sets of nodes densely connected internally, recently, increasing attention has been paid to the attributes associated with the nodes in order to take into account homophily effects, and several works have been devoted to community detection in attributed networks. The aim of such process is to obtain a partitioning of the nodes where vertices belonging to the same subgroup are densely connected and homogeneous in terms of attribute values.

In this paper, we propose a new method designed for community detection in attributed networks, called *late fusion*. This is a two-step approach where we first identify two sets of communities based on the network topology and node attributes respectively, then we merge them together to produce the final partitioning of the network that exhibits the homophily effect, according to which linked nodes are more likely to share the same attribute values. The communities based upon the network topology are obtained by simply applying an existing algorithm such like Louvain [2]. For graphs whose node attributes are numeric, we utilize existing clustering algorithms to get the communities (i.e., clusters) based on node attributes. We extend to binary-attributed graphs by generating a virtual graph from the attribute similarities between the nodes, and performing traditional community detection on the virtual graph. Albeit being simple, extensive experiments have shown that our late-fusion method can be competitive in terms of both accuracy and efficiency when compared against other algorithms. We summarize our main contributions in this work are:


The rest of the paper is organized as follows: In Sect. 2, we provide a brief review of community detection algorithms suited for attributed networks, next we present our late fusion approach in Sect. 3. Experiments to illustrate the effectiveness of the proposed method are detailed in Sect. 4. Finally, we summarize our work and point out several future directions in Sect. 5.

## **2 Related Work**

How to incorporate the node attribute information into the process of network community detection has been studied for a long time. One of the early ideas is to transform attribute similarities into edge weights. For example, [13] proposes *matching coefficient* which is the count of shared attributes between two connected nodes in a network; [15] extends the matching coefficient to networks with numeric node attributes; [4] defines edge weights based on self-organizing maps. A drawback of these methods is that new edge weights are only applicable to edges already existed, hence the attribute information is not fully utilized. To overcome this issue, a different approach is to *augment* the original graph by adding virtual edges and/or nodes based on node attribute values. For instance, [14] generates content edges based on the cosine similarity between node attribute vectors, in graphs where nodes are textual documents and the corresponding attribute vector is the TF-IDF vector describing their content. The kNN-enhance algorithm [9] adds directed virtual edges from a node to one of its k-nearest neighbors if their attributes are similar. The SA-Clustering [17] adds both virtual nodes and edges to the original graph, where the virtual nodes represent binary-valued attributes, and the virtual edges connect the real nodes to the virtual nodes representing the attributes that the real nodes own.

Another class of methods is inspired by the modularity measure. These methods incorporate attribute information into an optimization objective like the modularity. [5] injects an attribute based similarity measure into the modularity function; [1] combines the gain in the modularity with multiple common users' attributes as an integrated objective; I-Louvain algorithm [3] proposes inertia-based modularity to describe the similarity between nodes with numeric attributes, and adds the inertia-based modularity to the original modularity formula to form the new optimization objective.

With the wide spreading of deep learning, network representation learning and node embedding (e.g. [8]) motivated new solutions. [12] proposes an embedding based community detection algorithm that applies representation learning of graphs to learn a feature representation of a network structure, which is combined with node attributes to form a cost function. Minimizing it, the optimal community membership matrix is obtained.

Probabilistic models can be used to depict the relationship between node connections, attributes, and community membership. The task of community detection is thus converted to inferring the community assignment of the nodes. A representative of this kind is the CESNA algorithm [16], which builds a generative graphical model for inferring the community memberships.

Whereas the majority of the previous methods exploit simultaneously both types of information, we propose the late-fusion approach that combines two sets of communities obtained separately and independently from the network structure and node attributes via a fusion algorithms.

## **3 The Late-Fusion Method**

Given an attributed network G = (V,E,A), with V being the set of m nodes, E the set of n edges, and A an m *×* r attribute matrix describing the attribute values of the nodes with r attributes, the goal is to build a partitioning *P* = *{*C1, ..., C*<sup>k</sup>}* of V into k communities such that nodes in the same community are densely connected and similar in terms of attributes, whereas nodes from distinct communities are loosely connected and different in terms of attribute.

For networks with numeric attributes, we can directly apply a community detection algorithm F*<sup>s</sup>* on G to identify a set of communities based on node connections *P<sup>s</sup>* = *{*C1, C2, ..., C*<sup>k</sup><sup>s</sup> }*, and a clustering algorithms F*<sup>a</sup>* on A to find a set of clusters based on node attributes *P<sup>a</sup>* = *{*C1, C2, ..., C*<sup>k</sup><sup>a</sup> }*. When it comes to binary attributed networks, traditional clustering algorithms become inaccessible, we instead build a virtual graph G*<sup>a</sup>* that shares the same node set as G, but there is an edge only when the two nodes are similar enough in terms of attributes. Then we apply F*<sup>s</sup>* on G*<sup>a</sup>* and obtain *P<sup>a</sup>*. Note that we omit categorical attributes since categorical values can be easily converted to the binary case.

The second step is to combine the partitions *P<sup>s</sup>* and *Pa*. We first derive the adjacency matrices D*<sup>s</sup>* and D*<sup>a</sup>* from *P<sup>s</sup>* and *P<sup>a</sup>* respectively, where d*ij* = 1 when nodes i and j are in the same community in a partitioning *P* and d*ij* = 0 otherwise. Next, an integrated adjacency matrix D is given by D = αD*<sup>s</sup>* + (1*−*α)D*a*. Here α is the weighting parameter that leverages the strength between network topology and node attributes. In this way, the information about network topology and node attributes of the original graph G is represented in D. Now G*int*, derived from the adjacency matrix D, is an integrated, virtual, weighted graph whose edges embody the homophily effect of G. Algorithm 1 shows the steps of our late-fusion approach applied to networks with binary attributes.


Here we address an important detail: how to build the virtual graph G*<sup>a</sup>* from the node-attribute matrix A? We compute the inner product as the similarity measure between each node pair, and if the inner product exceeds a predetermined threshold, we regard the nodes as similar and add a virtual edge between them. The threshold can be determined heuristically based on the distribution of the node similarities. However, the threshold should be chosen properly so that the resulted G*<sup>a</sup>* would be neither too dense nor too sparse, where both cases could harm the quality of the final communities. Under this guidance, we put forward two thresholding approaches:


**Fig. 1.** Node attribute distribution for three groups of experiments. (a) Strong attributes, (b) Medium attributes, (c) Weak attributes. Each color represents a unique community (Color figure online)

## **4 Experiments**

Our proposed method has been evaluated through experiments on multiple synthetic and real networks and results are presented in this section. For networks with numeric attributes, we take advantage of existing clustering algorithms to obtain communities based on attributes (i.e., clusters), and for networks with binary attributes, we employ Algorithm 1 to perform community detection. We have also released our code so that readers can reproduce the results<sup>1</sup>.

## **4.1 Synthetic Networks with Numeric Attributes**

**Data.** We use an attributed graph generator [10] to create three attributed graphs with ground-truth communities, denoted as G*strong*, G*medium* and G*weak*, indicating the corresponding ground-truth partitionings are *strong*, *medium*, and *weak* in terms of modularity Q. To examine the effect of attributes on community detection, for each of G*strong*, G*medium* and G*weak*, we assign three different attribute distributions as shown in Fig. 1, where attributes in Fig. 1a and b are generated from a Gaussian mixture model with a shared standard deviation, and Fig. 1c presents the original attributes generated by [10]. By this way, for each graph having a specific community structure (G*strong*, G*medium*, G*weak*) we have also three types of attributes denoted strong attributes, medium attributes and weak attributes leading in fact to 9 datasets.

**Evaluation Measures and Baselines.** Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI) and running time are used to evaluate algorithm accuracy and efficiency. Louvain [2] and SIWO [7] have been chosen as baseline algorithms that utilize only the links to identify network communities.

<sup>1</sup> https://github.com/changliu94/attributed-community-detection.


**Table 1.** Properties of synthetic networks



Note that since the attribute distribution does not affect Louvain and SIWO, the results of Louvain and SIWO are only presented in Table 3. We choose Spectral Clustering (SC) and DBSCAN as two representative clustering algorithms as they both can handle non-flat geometry. We treat the number of clusters as a known input parameter of SC, and the neighborhood size of DBSCAN is set to the average node degree. We adopt default values of the remaining parameters from the *scikit-learn* implementation of these two algorithms. Finally, we take the implementation of the I-Louvain algorithm which exploits links and attribute values as our contender. The code of I-Louvain is available online<sup>2</sup>. Given Louvain, SIWO, SC, and DBSCAN, correspondingly we can have four combinations for our late-fusion method. In all experiments, the α parameter in Algorithm 1 is chosen to be 0.5, i.e., the same weight is allocated to structural and attribute information.


**Table 3.** Results of strong attributes, time is measured in seconds

<sup>2</sup> https://www.dropbox.com/sh/j4aqitujiaifgq4/AAAAH0L3uIPYNWKoLpcAh0TPa.


**Table 4.** Results of medium attributes, time is measured in seconds

**Results.** Table 3, corresponding to strong attributes, shows that late fusion is the best-performing algorithm in terms of NMI on G*strong* and G*medium*, and very close to SC on G*weak* (0.765 against 0.768) whereas it is better in terms of ARI on this last graph. On Tables 4 and 5, corresponding respectively to medium and weak attributes, with the deterioration of the attribute quality, the accuracy of late-fusion degrades, but late fusion still remains at a consistently high level compared to I-Louvain and the clustering algorithms. Moreover, the performance degradation of late-fusion methods is less susceptible to the deterioration of community quality compared to the clustering algorithms, thanks to the complementary structural information. As for the running time, it is expected that classic community detection algorithms Louvain and SIWO are the fastest algorithms, as they do not consider node attributes, but the late-fusion method still outperforms I-Louvain by a remarkable margin.


**Table 5.** Results of weak attributes, time is measured in seconds

### **4.2 Real Network with Numeric Attributes**

**Data and Baselines.** Sina Weibo<sup>3</sup> is the largest online Chinese micro-blog social networking website. Table 2 shows the corresponding properties of the Sina Weibo network built by [9] <sup>4</sup>. It includes within-inertia ratio I, a measure of attribute homogeneity of data points that are assigned to the same subgroup. The lower the within-inertia ratio, the more similar the nodes in the same community are. As DBSCAN algorithm performs poorly on the Sina Weibo network and it is costly to infer a good combination of the hyper-parameters of the algorithm, it has been replaced by k-means as a supplement to spectral clustering. The number of clusters required as an input by k-means and SC is inferred from the 'elbow method', which happens to be 10, the actual number of clusters. Moreover, since we have the prior knowledge that the ground truth communities are based on the topics of the forums from which those users are gathered, we reckon that the formation of communities depends more on the attribute values than the structure and set the parameter α at 0.2.

**Results.** Table 6 presents the results on Sina Weibo network. The two baseline algorithms Louvain and SIWO and the contending algorithm I-Louvain perform poorly on the Sina Weibo network, whereas the clustering algorithms show a high accuracy. Especially, the k-means algorithm together with our four late-fusion methods with the emphasis on attribute information produce results with the best NMI and ARI. This is because modularity of Sina Weibo network is low (0.05 as indicated in Table 2) and the within-inertia ratio is also low (0.04). The results also validate our assumption that communities in this network are mainly determined by the attributes. We will further explore the effect of α in Sect. 4.4.


**Table 6.** Experimental results on Sina Weibo network



<sup>3</sup> http://www.weibo.com.

<sup>4</sup> This dataset is available online https://github.com/smileyan448/Sinanet.

### **4.3 Real Network with Binary Attributes**

**Data.** Facebook dataset [11] contains 10 egocentric networks with binary attributes corresponding to anonymous information of the user about the name, work, and education and ground-truth communities. This dataset is available online<sup>5</sup> and Table 7 presents the properties of these networks.

We still treat Louvain and SIWO as our baselines. We use the CESNA algorithm [16], able to handle binary attributes in addition to the links, as our contender<sup>6</sup>. To compare the two thresholding strategies proposed in Section 3, we present experimental results of four late-fusion methods: Louvain + equaledge thresholding (denoted as Louvain-EET), Louvain + median thresholding (denoted as Louvain-MT), SIWO + equal-edge thresholding (denoted as SIWO-EET), and SIWO + median thresholding (denoted as SIWO-MT). We set α to its default value 0.5.


**Table 8.** NMI of different community detection results on Facebook network

**Table 9.** ARI of different community detection results on Facebook network


<sup>5</sup> http://snap.stanford.edu/data.

<sup>6</sup> The source code of CESNA is available online https://github.com/snap-stanford/ snap/tree/master/examples/cesna.


**Table 10.** Running time of different community detection results on Facebook network, measured in seconds

**Results.** Results in terms of NMI, ARI, and running time are respectively presented in Tables 8, 9, and 10. In terms of NMI, results in Table 8 show again that our late-fusion algorithms can significantly improve the community detection accuracy upon Louvain. On average, the late fusion method Louvain+EET outperforms Louvain, SIWO, and CESNA by 30.8%, 42.2%, and 33.2% respectively. The late fusion method Louvain+MT outperforms the three by 14.1%, 24.0%, and 16.2% respectively. However, all of the late-fusion methods perform poorly when evaluated by ARI. This is resulted from the goal of our late-fusion approach. Remember that we aim to find the set of communities such that nodes in the same subgroup are densely connected and similar in terms of attributes, whereas nodes residing in different communities are loosely connected and dissimilar in attributes. This purpose led the late-fusion approach to over-partition communities that are formed by only one of the two sources of information. The over-partitioning greatly hurts the results of ARI. A postprocessing model to resolve the over-partitioning issue with late fusion is left as a future work. The running time results shown in Table 10 again manifests the efficiency advantage of our late-fusion methods over CESNA.

#### **4.4 Effect of Parameter** *α*

In the Sina Weibo experiment, we see the advantage of having a weighting parameter to accordingly leverage the strength of the two sources of information. In this section, we dive deeper into the effect of α on the community detection results. To do so, we devise an experiment where we use the G*strong* and G*weak* introduced in Table 1. In reverse, we assign **weak** attributes to G*strong* and **strong** attributes to G*weak*. Then we perform our late fusion algorithm on these two graphs with varying α values. In our experiment, we choose SIWO as F*<sup>s</sup>* and k-means as F*a*.

Table 11 presents the NMI and ARI of the late fusion with SIWO and kmeans when α varies. G*strong* has communities with a strong structure but weak attributes, so the accuracy score for NMI and ARI goes up as we put more weight on the structure; On the contrary, G*weak* has weak structural communities but


**Table 11.** Effect of α

strong attributes, hence the accuracy score decreases as α increases. One can also notice that when α is sufficiently high or low, late fusion becomes equivalent to using community detection or clustering only, which is in accordance with our observation done on the Sina Weibo experiment.

In practice, when network communities are mainly determined by the links, α should be greater than 0.5; α < 0.5 is recommended if attributes play a more important role in forming the communities; When prior knowledge about network communities is unavailable or both sources of information contribute equally, α should be 0.5.

### **4.5 Complexity of Late Fusion**

It is a known drawback of attributed community detection algorithms that they are very time-consuming due to the need to consider node attributes. Our latefusion method tries to circumvent this problem by taking advantage of the existing community detection and clustering algorithms that are efficiently optimized, and combining their results by a simple approach. To further show the computational efficiency of our late-fusion method, we compute the running time of the late-fusion method and compare it with other methods.

**Fig. 2.** Running time of Louvain, SIWO, late fusion and I-Louvain on networks of different sizes

We test the running time of four different community detection methods on five graphs with the number of nodes varying from 2000, 4000, 6000, 8000, and 10000. These graphs are also generated by the attributed graph generator [10]. We control the modularity of each graph at the range of 0.64*−*0.66 and keep other hyperparameters the same. For each size, we randomly sample 10 graphs from the graph generator and plot the average running time of each method. As we can see in Fig. 2, it is expected that our late-fusion method is inevitably slower than the two community detection methods that only utilize node connections. However, our algorithm runs way faster than the I-Louvain algorithm, albeit both being approximately linear in the growth of network sizes.

## **5 Conclusion and Future Direction**

In this paper, we proposed a new approach to the problem of community detection in attributed networks that follows a late-fusion strategy. We showed with extensive experiments that most often, our late-fusion method is not only able to improve the detection accuracy provided by traditional community detection algorithms, but it can also outperform the chosen contenders in terms of both accuracy and efficiency. We learned that combining node connections with attributes to detect communities of a network is not always the best solution, especially when one side of the network properties is strong while the other is weak, using only the best information available can lead to better detection results. It is part of our future work to understand when and how we should use the extra attribute information to help community detection. ARI suffers greatly from over-partitioning issue with our late fusion when applied to networks with binary attributes. A postprocessing model to resolve this issue is desired. We also hope to expand the late-fusion approach to networks with a hybrid of binary and numeric attributes as well as networks with overlapping communities.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Reconciling Predictions in the Regression Setting: An Application to Bus Travel Time Prediction**

Joao Mendes-Moreira ˜ 1,2 and Mitra Baratchi3(B)

<sup>1</sup> LIAAD-INESC TEC, Porto, Portugal jmoreira@fe.up.pt <sup>2</sup> Faculty of Engineering, University of Porto, Porto, Portugal <sup>3</sup> LIACS, Leiden University, Leiden, The Netherlands m.baratchi@liacs.leidenuniv.nl

**Abstract.** In different application areas, the prediction of values that are hierarchically related is required. As an example, consider predicting the revenue per month and per year of a company where the prediction of the year should be equal to the sum of the predictions of the months of that year. The idea of reconciliation of prediction on grouped time-series has been previously proposed to provide optimal forecasts based on such data. This method in effect, models the time-series collectively rather than providing a separate model for time-series at each level. While originally, the idea of reconciliation is applicable on data of time-series nature, it is not clear if such an approach can also be applicable to regression settings where multi-attribute data is available. In this paper, we address such a problem by proposing Reconciliation for Regression (R4R), a two-step approach for prediction and reconciliation. In order to evaluate this method, we test its applicability in the context of Travel Time Prediction (TTP) of bus trips where two levels of values need to be calculated: (i) travel times of the links between consecutive bus-stops; and (ii) total trip travel time. The results show that R4R can improve the overall results in terms of both link TTP performance and reconciliation between the sum of the link TTPs and the total trip travel time. We compare the results acquired when using group-based reconciliation methods and show that the proposed reconciliation approach in a regression setting can provide better results in some cases. This method can be generalized to other domains as well.

**Keywords:** Regression · Reconciliation · Bus travel time

## **1 Introduction**

Regression analysis provides a simple framework for predicting numerical target attributes from a set of independent predictive attributes. Addressing any problem using this framework requires designing models that fully capture the relations between predictive and target attributes. This has so far led to many classes of regression models being designed. For instance, multi-target regression models [11] consider predicting the value of multiple target attributes as opposed to basic regression models that aim at predicting only a single target attribute at a time. In another case, when one target variable is being predicted from a set of hierarchically ordered predictive attributes, the problem is known to be multi-level regression [5].

In this paper, we address the problem of regression for a class of problems where dependent variables are additionally hierarchically organized following different levels of aggregation. An example is the revenue forecasts per month and also per year of a given company. The forecasts for the new year can be the sum of the predictions done for each of the twelve months of the new year or can be done directly for the full new year. However, in many situations, it is important that the sum of the prediction per month is equal to the prediction for the full year. Moreover, relevant questions in this regard can arise. Can we obtain better predictions using both predictions for all months and for the full year? How may we reconcile the sum of the predictions done per month with the prediction done for the full year? Authors of [8] answered these questions for hierarchies of time series, i.e., a sequence of values, typically equally spaced, where this sequence can be aggregated by a given dimension.

This notion of hierarchy can also exist in the regression setting i.e., a problem with a set of <sup>n</sup> instances (**Xi**, **<sup>y</sup><sup>i</sup>**), **<sup>i</sup>** <sup>=</sup> **<sup>1</sup>**, ..., **<sup>n</sup>**. Each (**Xi**, **<sup>y</sup><sup>i</sup>**) instance has a vector **<sup>X</sup><sup>i</sup>** with <sup>p</sup> predictive attributes (x<sup>i</sup><sup>1</sup> , x<sup>i</sup><sup>2</sup> , ..., x<sup>i</sup>*<sup>p</sup>* ) and a quantitative target attribute <sup>y</sup>i. The hierarchy can exist in this regression setting when, for instance, two of the p predictive attributes have a 1-to-many relation as referred to in relational databases. Addressing this problem in the regression setting leads to more flexible and robust solutions compared to the time series approach because: (1) any number of observations per time interval can be defined; (2) there are no limitations to the time interval between consecutive observations; and (3) any other type of predictive attribute can be used to better explain the target attribute.

In this work, we present an approach to reconcile predictions in the regression setting. We achieve this by proposing a new method named Reconciling for Regression (R4R). The R4R method is tested for the bus travel time prediction problem. This problem considers that buses run in predefined routes, and each route is composed of several links. Each link is the road stretch between two consecutive bus stops. Reconciling the predictions in this problem aims at reconciling the sum of the predictions done for each link with the prediction done for the full route. According to the authors' knowledge, this is the first work on reconciling predictions in the regression setting. This work is also different from multi-target and multi-level variants being a combination of both (having multiple targets that are hierarchically ordered).

The R4R method can be applied to any other regression problem which exhibits a one-to-many relationship between instances, and also where the aggregated target value (the one) is the sum of the detailed target values (the many). In the previous example: (1) the revenue forecasts for the new year, the many component targets are the revenue forecasts per month, and the one component target is the revenue forecast for the full year; (2) in the bus travel time example, the many component targets are the link predictions while the one component target is the full route prediction. In this paper, we only discuss the sum as aggregation criterion (the one should be equal to the sum of the many), but the proposed method could be easily extended to other aggregation criteria, e.g., the average.

The remainder of this paper is organized as follows. In Sect. 2, we present the previous work on reconciling predictions. Section 3 elaborates the proposed methodology. In Sect. 4, we describe the case study. The results of the case study are presented and discussed in Sect. 5. Finally, the conclusions are presented in Sect. 6.

## **2 Literature Review**

In this section, we review the previous research, both considering (i) the methods for forecasting for hierarchically organized time-series data and (ii) application area of travel time prediction.

**Methods for Forecasting Hierarchically Organized Data:** Common methods used to reconcile predictions for hierarchically organized time-series data can be further grouped into three categories: bottom-up, top-down and middle-out, based on the level which is predicted first. Bottom-up strategies forecast all the low-level target attributes and use the sum of these predictions as the forecast for the higher-level attribute. On the contrary, top-down approaches predict the top-level attribute and then splits up the predictions for the lower level attributes based on historical proportions that may be estimated. For time-series data with more than two levels of hierarchy, a middle-out approach can be used, combining both bottom-up and top-down approaches [3]. These methods form linear mappings from the initial predictions to reconciled estimates. As a consequence, the sum of the forecasts of the components of a hierarchical time series is equal to the forecast of the whole. However, this is achieved without guaranteeing an optimal solution. Authors of [8] presented a new framework for optimally reconciling forecasts of all series in a hierarchy to ensure they add up. The method first computes the forecast independently for each level of the hierarchy. Afterward, the method provides a means for optimally reconciling the base forecasts so that they aggregate appropriately across the hierarchy. The optimal reconciliation is based on a generalized least squares estimator and requires an estimation of the covariance matrix of the reconciliation errors. Using Australian domestic tourism data, authors of [8] compare their optimal method with bottom-up and conventional top-down forecast approaches. Results show that the optimal combinational approach and the bottom-up approach outperform the top-down method. The same authors extended, in [9], the previous work proposed in [8] to cover non-hierarchical groups of time series, as well as, large groups of time series data with a partial hierarchical structure. A new combinational forecasting approaches is proposed that incorporates the information from a full covariance matrix of forecast errors in obtaining a set of aggregate forecasts. They use a weighted least squares method due to the difficulty of estimating the covariance matrix for large hierarchies.

In [16], an alternative representation that involves inverting only a single matrix of a lower number of dimensions is used. The new combinational forecasting approach incorporates the information from a full covariance matrix of forecast errors in obtaining a set of aggregate consistent forecasts. The approach minimizes the mean squared error of the aggregate consistent forecasts across the entire collection of time series.

A game-theoretically optimal reconciliation method is proposed in [6]. The authors address the problem in two independent steps, by first computing the best possible forecasts for the time series without taking into account the hierarchical structure and next to a game-theoretic reconciliation procedure to make the forecasts aggregate consistent.

The previously mentioned methods are limited by the nature of the time-series approach they take. It is often impossible to take any advantage of additional features and attributes accompanying data with such an approach. Furthermore, many prevalent data imperfection problems such as missing data, lead to imperfect time-series. This fact reduces the applicability of time-series models that require equally distanced samples.

In our work, we take advantage of additional features and the structure of the grouped data to improve and reconcile predictions. Instead of forecasting each time series independently and then combine the predictions, in a regression setting, we can reconcile future events using only some past events. This leads to a solution suitable for online applications.

**Application Area of Travel Time Prediction:** There exists a considerable amount of research papers that address the problem of travel time prediction for transport applications. Accurate travel time information is essential as it attracts more commuters and increases commuter's satisfaction [1]. The majority of these works are on short-term travel time prediction [19], aimed at applications in advanced traveler information systems. There are also works on long-term travel time prediction [13], which can be used as a planning tool for public transport companies or even for freight transports.

Link travel time prediction can be used for route guidance [17], for bus bunching detection [14], or to predict the bus arrival time at the next station [18] which can promote information services about it. More recently, Global Positioning System (GPS) data is becoming more and more available, allowing its use to predict travel times from GPS trajectories. These trajectories can be used to construct origin-destination matrices of travel times or traffic flows, an important tool for mobility purposes [2].

Using both link travel time predictions and the full trip travel time prediction in order to improve all those predictions is a contribution of this paper for the transportation field.

## **3 The R4R Method**

#### **3.1 Problem Definition**

Consider a dataset <sup>D</sup> = -**<sup>X</sup>**,**L**, **<sup>r</sup>**. Note that **<sup>X</sup>** in this tuple denotes the **set of predictive attributes** and is a matrix of size <sup>N</sup> <sup>×</sup> <sup>Q</sup> representing a set of <sup>N</sup> number of instances each composed of Q number of predictive attributes. Furthermore, **L** is the **set of the many component targets** and is a matrix of size <sup>N</sup> <sup>×</sup> <sup>K</sup> with <sup>K</sup> being the number of elements of the many component target. **r** representes **the set of one component target** and is a vector of length <sup>N</sup>. Elements of {r<sup>n</sup> <sup>∈</sup> **<sup>r</sup>**} represent the target attributes of the one component and each {ln,k <sup>∈</sup> **<sup>L</sup>**} is the <sup>k</sup>th target attribute of the many component. Also, consider <sup>r</sup><sup>n</sup> <sup>=</sup> -K <sup>k</sup>=1 <sup>l</sup>n,k denoting the sum of all the many component targets being equal to a corresponding one component target.

Defining the prediction of each ln,k as pn,k, we are looking for a model that ensures that the sum of the predictions of the many component target are as close as possible to the rn. In other words, after making predictions, we want the following equation to hold:

$$\{\sum\_{k=1}^{K} p\_{n,k} \approx r\_n | n \le N\}\tag{1}$$

### **3.2 Methodology**

In this section, we elaborate on our proposed method, Reconciling for Regression (R4R), to address the above-mentioned problem. R4R method is composed of two steps. In the first step, it learns models for prediction of the **many component targets**, separately. In the second step, it reconciles the many predictions with the one component.

In order to improve the individual pn,k predictions such that Eq. 1 holds, our proposed framework uses a modified version of the least squares optimization method to compute a set of corrective coefficients (see Eq. 4), that are used to update the individual pn,k predictions.

**Step 1, Learning the Predictive Models:** at the first step, the predictions of the many component targets are calculated using a specific base learning method. K different models are trained, one for each of the K elements of the many target component. It is possible to select a different learning method for each element to ensure higher accuracy. The resulting predictions for each of the K elements are referred to as pm,k, where m is the instance number, and k identifies elements of the many component targets. Algorithm 1 depicts these steps. As a result, this algorithm creates an output **P**, a matrix of size <sup>M</sup> <sup>×</sup> <sup>K</sup> composed of predictions <sup>p</sup>m,k. **<sup>P</sup>** is used in the second stage for reconciliation.

#### **Algorithm 1.** Learning the predictive model

**Input: D** (dataset matrix of size <sup>N</sup> *<sup>×</sup>* (<sup>Q</sup> <sup>+</sup> <sup>K</sup>)), Me (base learning method),<sup>γ</sup> (a percentage value)

**Output: P** (Predictions matrix of size <sup>M</sup> *<sup>×</sup>* <sup>K</sup>)


```
4: for m = 1 to M do
```

**Step 2, Reconciling Predictions:** In the second step, the framework updates the value of predictions resulted from the initial models used in Algorithm 1. This is achieved by estimating a corrective coefficient (θk) for each element of the many target component (pm,k). This coefficient needs to be multiplied with the model predictions to ensure minimized error from the actual one component target (rm) and many component target (lm,k). We achieve this goal using a least-squares method on the current training dataset and using the objective functions given by Eqs. <sup>2</sup> and <sup>3</sup> to estimate *<sup>θ</sup>* = (θ1, ..., θK).

$$\mathop{\arg\min}\_{lb<\theta$$

$$\underset{lb<\theta$$

The first objective function presented in Eq. 2 is attempting to optimize reconciliation based on the value of one component target. The second objective function presented in Eq. 3 aims at minimizing the error of the predictions based on the value of each element of the many component targets, separately. Both of these objective functions can be combined and expanded to Eq. 4. In Eq. 4, the first M rows are representing the objective function presented in Eq. 2. The remaining <sup>M</sup> <sup>×</sup>(KM) rows represent the second objective function as provided in Eq. 3.

$$
\begin{bmatrix} p\_{1,1} & p\_{1,2} & \cdots & p\_{1,k} & \cdots & p\_{1,K} \\ p\_{2,1} & p\_{2,2} & \cdots & p\_{2,k} & \cdots & p\_{2,K} \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ p\_{m,1} & p\_{m,2} & \cdots & p\_{m,k} & \cdots & p\_{m,K} \\ p\_{1,1} & 0 & \cdots & 0 & \cdots & 0 \\ 0 & p\_{1,2} & \cdots & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 0 & \cdots & p\_{1,K} \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots \\ 0 & p\_{m,2} & \cdots & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ 0 & p\_{m,2} & \cdots & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & p\_{m,k} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & p\_{m,k} & \cdots & 0 \\ \end{bmatrix} \begin{bmatrix} \theta\_{1} \\ \theta\_{2} \\ \vdots \\ \theta\_{K} \\ \vdots \\ \theta\_{K} \end{bmatrix} = \begin{bmatrix} \theta\_{1} \\ \vdots \\ \theta\_{1} \\ \vdots \\ \theta\_{1,K} \\ \vdots \\ \theta\_{1,K} \\ \vdots \\ \theta\_{1,K} \\ \vdots \\ \theta\_{1,K} \\ \vdots \\ \theta\_{1,K} \end{bmatrix} \tag{4}
$$

As seen in Eqs. 2 and 3 we have defined a constraint on the values of *θ*. The aim is to regularize the modifications to the predictions done for each element of the many component targets in a sensible manner (e.g. negative factors cannot be allowed when negative predictions are not meaningful). Therefore, we assume, without loss of generality, that all values of *θ* are positive, with lower (lb) and upper (ub) bound constraints, <sup>0</sup> < lb < θ<sup>k</sup> < ub. Both lb and ub are free input parameters. We reduce the number of free parameters to one (α) by defining a symmetric bound region as (lb, up) = (1 <sup>−</sup> α, 1 + <sup>α</sup>).

The process of reconciliation on predictions is explained in Algorithm 2. In the final step of this algorithm, the prediction matrix for all elements of the many component targets is updated using the corrective coefficients *θ*. A Least Squares method is used to calculate corrective coefficients. To allow robustness against outliers, we suggest using a nk number of nearest neighbors for estimating *θ*. We assume that similar trips from the past have the same behavior, as shown in [12]. The new predictions are defined as **P**new. The algorithm takes into account the information of the predictions for both the


**Input: P** (Predictions matrix of size <sup>M</sup> *<sup>×</sup>* <sup>K</sup>), nk (number of nearest neighbors), lb, up (lower and upper bounds for <sup>θ</sup>*k*s)

**Output: P***new* (new predictions matrix of size <sup>M</sup> *<sup>×</sup>* <sup>K</sup>), *<sup>θ</sup>* (vector of corrective coefficients) 1: **for** k = 1 to K **do**


many component elements and the one component predicted from similar instances in **P**m,k, in order to verify Eq. 1 on reconciliation.


**Table 1.** Characteristics of tested STCP bus routes

### **4 Case Study**

To test the methodology explained in Sect. 3.2, we conduct a series of experiments using a real dataset that has our desired hierarchical organization of target values. Measuring travel time in public transport systems can produce such a dataset. Being able to perform accurate Travel Time Prediction (TTP) is an important goal for public transport companies. On the one hand, *travel time prediction of the link between two consecutive stops* (the many component targets in our model) allows timely informing the roadside users about the arrival of buses at bus stops (in the rest of this paper we refer to this value as link TTP). On the other hand, *total trip travel time prediction* (the one component in our model) is useful to better schedule drivers' duty services (in the rest of this paper, we refer to this value as total TTP) [4].

The dataset used in this section is provided by the Sociedade de Transportes Colectivos do Porto (STCP), the main mass public transportation company in Porto, Portugal.

The experiments described in the following sections are based on the data collected during a period from January 1st to March 30th of 2010 from six bus routes (shown in Table 1). All the six selected bus routes operate between 5:30 a.m. to 2:00 a.m. However, we have considered only bus trips starting after 6 a.m.

The collected dataset has multiple nominal and ordinal attributes that make it suitable for defining a regression problem. We have selected five features that characterize each bus trip: (1) WEEKDAY: the day of the week {Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday}; (2) DAYTYPE: the type of the day {holiday, normal, non-working day, weekend holiday}; (3) Bus Day Month: {1,...,31}; (4) Shift ID; (5) Travel ID.

We have implemented R4R using the R Software [15] and the *lsq linear* routine from *Scipy* Python library [10]. For the first stage of R4R, as depicted in Algorithm 1, we use a simple multivariate linear regression as a base learning method. We refer to this base learning method as (Bas). We further split data according to the following format. A 30 days window length is used for selecting training samples, and a 60 days window length is considered for selecting test samples.

In our experiments, the parameter α used for determining the lower and upper bound for the parameter for estimating *θ* varies from 0.01 to 0.04, which corresponds to 0.96– 1.04, minimum and maximum values that *θ* can take, respectively.

## **5 Comparative Study**

### **5.1 Can Reconciliation Be Achieved Using R4R?**

Firstly, using the proposed R4R method, we try to answer the following question: is it possible to use the total trip travel time to improve the link TTPs guaranteeing a better reconciliation between the sum of the link TTPs and the total TTP simultaneously? To answer this question we measure the relative performance improvement achieved by R4R compared to a multivariate linear regression as the base learning method (denoted by Bas).

We evaluate the performance in predicting the following metrics (i) link travel time prediction (LP), the sum of link travel time predictions (SFP), and full trip time prediction (FP). Methods are compared based on Root Mean Square Error (RMSE) as defined in Eq. 5.

$$RMSE = \sqrt{\frac{1}{N\_{test}} \sum\_{i=1}^{N\_{test}} (\hat{y}\_i - y\_i)^2} \tag{5}$$

where <sup>y</sup><sup>i</sup> and <sup>y</sup>ˆ<sup>i</sup> represent the target and predicted bus arrival times, for the <sup>i</sup>th example in the test set, respectively. Ntest is the total number of test samples. For link travel time prediction indicator, LP, the mean of the RMSE of each bus link is considered.

Results of the comparison of R4R and Bas are presented in Fig. 1. Please note that relative gains are presented for the sake of readability of graphs. The duration of traveltimes varies widely. This fact leads to unreadable graphs when actual data is presented.

As seem, R4R outperforms the base multivariate regression model in all cases. This comparison answers the question posed earlier. R4R improves predictions of the base regression learning method, guaranteeing a better reconciliation between the sum of the link TTPs and the total trip travel time, simultaneously.

**Fig. 1.** Relative improvement of R4R (Res) relative to Baseline (Bas) for mean LP - Link Prediction (red), sum of the link travel time predictions (green) and the full trip time prediction (blue). (Color figure online)

#### **5.2 How Does R4R Perform Against Baselines Made for Time Series Data?**

We continue our experiments by comparing our proposed methodology R4R with the methods proposed by Hyndman et al. in the recent related works [8,9,16] denoted by (H2011, W2015, and H2016). To compare with these works, we used the available implementation in the R package [7]. It should be considered that these baseline models are designed for time-series data. Therefore, in order to perform comparisons with these approaches, we also define a time series problem using this dataset. This is achieved by representing data in the form of a time series with a resolution of a one-hour interval. We compute the mean link travel time for each hour between 6:00 a.m. to 2:00 a.m the next day, i.e. 20 data points in total for each "bus day". In the majority of the cases, each interval has more than one link travel time. For this reason we averaged the link travel times for each hour. Because the dataset has a considerable amount of missing values, interpolation was used to fill the missing links' travel times. However, the results presented in the paper do not take into account the predictions done for intervals with no data.

The above-mentioned pre-processing tasks that were necessary in order to use the approaches proposed by Hyndman et al. already suggest that it is viable to propose methods such as R4R that perform in a more general and flexible regression setting. Indeed, the discretization of data into a time-series format implies the need to make predictions for intervals instead of point-wise predictions as done in the regression setting. Discretization also implies the necessity of filling missing data when the intervals have no data instances. This problem can be prevented by considering larger intervals. However, larger intervals imply loss of details. Moreover, the regression setting deals naturally with additional attributes that can partially explain the value of the target attribute.

**Fig. 2.** RMSE for each of the Link Travel Time Predictions of R4R against the methods proposed in H2011 [8], W2015 [16], H2016 [9] applied to bus route L305. SUM is the RMSE of the sum of the LTT prediction for the entire trip against the full trip time. This plot shows the results before the bus starts its journey.



Figure 2 presents the results of predictions for bus route L305. It should be mentioned that we have chosen to show only results for <sup>α</sup> = 0.01, the parameter that consistently gave us the best performance in all the experiments we did. Indeed, the errors increase with increasing values of α in all experiments we did. The results show very small differences between the methods under study.

The data provided is not homogeneous. This can adversely affect the performance of the least-squares method when outlying data is used to find the corrective coefficients *θ*. To avoid such problems, in our proposed framework, we select the nk number of nearest neighbors for each bus trip (also presented in Algorithm 2). Thus, after each link travel time prediction, it is necessary to recompute the whole process, i.e., to select a new set of similar bus trips and further find the coefficients using the least-squares method and update the predictions. Comparing with Hyndman et al. works, this process leads to a more computationally expensive solution. It is also important to find a suitable value for nk. During our experiments, we observed that the best results are achieved for nk = 3. Therefore, all results presented in this paper are based on nk = 3.

Table 2 shows the general results of predictions using this approach for all bus routes tested using multivariate linear regression as the base learning method (Bas). The results show that R4R outperforms Bas in all cases. There are a number of cases where a version of the time series model proposed by Hyndman. et al. perform better than R4R. These differences can be explained when considering the simple linear regression algorithm we used as a base learner in Algorithm 1. A linear model cannot find non-linear relations between features. Technically, the performance of R4R can be improved further as it allows using any other regression method. Furthermore, using extra features, such as weather conditions, could possibly improve the performance of R4R even further. However, the methods proposed by Hyndman et al. cannot benefit from using extra features.

## **6 Conclusion**

In this paper, we study the problem of the reconciliation of predictions in a regression setting. We presented a two-stage prediction framework for prediction and reconciliation. In order to evaluate the performance and applicability of this method, we conduct a set of experiments using a real dataset collected from buses in Porto, Portugal. The results demonstrate that R4R improves the predictions of the base learning method. R4R is also able to further improve the reconciliation of the link TTPs after each iteration in an online manner. However, this is not shown due to space constraints. We also compare the results achieved in a regression setting with that of a time-series approach. In the case study discussed in this paper, R4R is able to reduce the error of link TTPs and increase reconciliation. An important advantage of the R4R method compared to time series variants is that it provides a flexible framework that can take advantage of any regression model and additional features accompanying data. Furthermore, R4R is not affected by data imperfection problems such as missing data, that reduce the applicability of time-series models that require equally distanced samples.

**Acknowledgments.** This work is financed by the Portuguese funding agency, FCT - Fundac¸ao˜ para a Ciencia e a Tecnologia, through national funds, and co-funded by the FEDER, where ˆ applicable.

We also thank STCP - Sociedade de Transportes Colectivos do Porto, SA, for providing the data used in this work.

### **References**

1. Amita, J., Jain, S., Garg, P.: Prediction of bus travel time using ann: a case study in Delhi. Transp. Res. Procedia **17**, 263–272 (2016). International Conference on Transportation Planning and Implementation Methodologies for Developing Countries (12th TPMDC) Selected Proceedings, IIT Bombay, Mumbai, India, 10–12 December 2014


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **A Distribution Dependent and Independent Complexity Analysis of Manifold Regularization**

Alexander Mey1(B) , Tom Julian Viering<sup>1</sup> , and Marco Loog1,2

<sup>1</sup> Delft University of Technology, Delft, The Netherlands *{*a.mey,t.j.viering*}*@tudelft.nl <sup>2</sup> University of Copenhagen, Copenhagen, Denmark m.loog@tudelft.nl

**Abstract.** Manifold regularization is a commonly used technique in semi-supervised learning. It enforces the classification rule to be smooth with respect to the data-manifold. Here, we derive sample complexity bounds based on pseudo-dimension for models that add a convex data dependent regularization term to a supervised learning process, as is in particular done in Manifold regularization. We then compare the bound for those semi-supervised methods to purely supervised methods, and discuss a setting in which the semi-supervised method can only have a constant improvement, ignoring logarithmic terms. By viewing Manifold regularization as a kernel method we then derive Rademacher bounds which allow for a distribution *dependent* analysis. Finally we illustrate that these bounds may be useful for choosing an appropriate manifold regularization parameter in situations with very sparsely labeled data.

**Keywords:** Semi-supervised learning · Learning theory · Manifold regularization

## **1 Introduction**

In many applications, as for example image or text classification, gathering unlabeled data is easier than gathering labeled data. Semi-supervised methods try to extract information from the unlabeled data to get improved classification results over purely supervised methods. A well-known technique to incorporate unlabeled data into a learning process is manifold regularization (MR) [7,18]. This procedure adds a data-dependent penalty term to the loss function that penalizes classification rules that behave non-smooth with respect to the data distribution. This paper presents a sample complexity and a Rademacher complexity analysis for this procedure. In addition it illustrates how our Rademacher complexity bounds may be used for choosing a suitable Manifold regularization parameter.

We organize this paper as follows. In Sects. 2 and 3 we discuss related work and introduce the semi-supervised setting. In Sect. 4 we formalize the idea of adding a distribution-dependent penalty term to a loss function. Algorithms such as manifold, entropy or co-regularization [7,14,21] follow this idea. Section 5 generalizes a bound from [4] to derive sample complexity bounds for the proposed framework, and thus in particular for MR. For the specific case of regression, we furthermore adapt a sample complexity bound from [1], which is essentially tighter than the first bound, to the semi-supervised case. In the same section we sketch a setting in which we show that if our hypothesis set has finite pseudodimension, and we ignore logarithmic factors, any semi-supervised learner (SSL) that falls in our framework has at most a constant improvement in terms of sample complexity. In Sect. 6 we show how one can obtain distribution *dependent* complexity bounds for MR. We review a kernel formulation of MR [20] and show how this can be used to estimate Rademacher complexities for *specific* datasets. In Sect. 7 we illustrate on an artificial dataset how the distribution dependent bounds could be used for choosing the regularization parameter of MR. This is particularly useful as the analysis does not need an additional labeled validation set. The practicality of this approach requires further empirical investigation. In Sect. 8 we discuss our results and speculate about possible extensions.

### **2 Related Work**

In [13] we find an investigation of a setting where distributions on the input space X are restricted to ones that correspond to unions of irreducible algebraic sets of a fixed size <sup>k</sup> <sup>∈</sup> <sup>N</sup>, and each algebraic set is either labeled 0 or 1. A SSL that knows the true distribution on X can identify the algebraic sets and reduce the hypothesis space to all 2<sup>k</sup> possible label combinations on those sets. As we are left with finitely many hypotheses we can learn them efficiently, while they show that every supervised learner is left with a hypothesis space of infinite VC dimension.

The work in [18] considers manifolds that arise as embeddings from a circle, where the labeling over the circle is (up to the decision boundary) smooth. They then show that a learner that has knowledge of the manifold can learn efficiently while for every fully supervised learner one can find an embedding and a distribution for which this is not possible.

The relation to our paper is as follows. They provide specific examples where the sample complexity between a semi-supervised and a supervised learner are infinitely large, while we explore general sample complexity bounds of MR and sketch a setting in which MR can not essentially improve over supervised methods.

### **3 The Semi-supervised Setting**

We work in the statistical learning framework: we assume we are given a feature domain X and an output space Y together with an unknown probability distribution P over X ×Y. In binary classification we usually have that Y = {−1, 1}, while for regression <sup>Y</sup> <sup>=</sup> <sup>R</sup>. We use a loss function <sup>φ</sup> : <sup>R</sup> ×Y → <sup>R</sup>, which is convex in the first argument and in practice usually a surrogate for the 0–1 loss in classification, and the squared loss in regression tasks. A hypothesis f is a function <sup>f</sup> : X → <sup>R</sup>. We set (X, Y ) to be a random variable distributed according to P, while small x and y are elements of X and Y respectively. Our goal is to find a hypothesis f, within a restricted class F, such that the expected loss Q(f) := E[φ(f(X), Y )] is small. In the standard supervised setting we choose a hypothesis f based on an i.i.d. sample S<sup>n</sup> = {(xi, yi)}i∈{1,..,n} drawn from P. With that we define the empirical risk of a model f ∈ F with respect to φ and measured on the sample S<sup>n</sup> as Qˆ(f,Sn) = <sup>1</sup> n n <sup>i</sup>=1 φ(f(xi), yi). For ease of notation we sometimes omit S<sup>n</sup> and just write Qˆ(f). Given a learning problem defined by (P, F, φ) and a labeled sample Sn, one way to choose a hypothesis is by the empirical risk minimization principle

$$f\_{\text{sup}} = \arg\min\_{f \in \mathcal{F}} \hat{Q}(f, S\_n). \tag{1}$$

We refer to fsup as the *supervised solution*. In SSL we additionally have samples with unknown labels. So we assume to have n + m samples (xi, yi)<sup>i</sup>∈{1,..,n+m} independently drawn according to P, where y<sup>i</sup> has not been observed for the last m samples. We furthermore set U = {x1, ..., x<sup>x</sup>n+m}, so U is the set that contains all our available information about the feature distribution.

Finally we denote by m<sup>L</sup>(, δ) the sample complexity of an algorithm L. That means that for all <sup>n</sup> <sup>≥</sup> <sup>m</sup><sup>L</sup>(, δ) and all possible distributions <sup>P</sup> the following holds. If L outputs a hypothesis f<sup>L</sup> after seeing an n-sample, we have with probability of at least 1 − δ over the n-sample S<sup>n</sup> that Q(fL) − min <sup>f</sup>∈F <sup>Q</sup>(f) <sup>≤</sup> .

## **4 A Framework for Semi-supervised Learning**

We follow the work of [4] and introduce a second convex loss function ψ : F×X → R<sup>+</sup> that only depends on the input feature and a hypothesis. We refer to ψ as the *unsupervised loss* as it does not depend on any labels. We propose to *add* the unlabeled data through the loss function ψ and add it as a penalty term to the supervised loss to obtain the semi-supervised solution

$$f\_{\text{semi}} = \arg\min\_{f \in \mathcal{F}} \frac{1}{n} \sum\_{i=1}^{n} \phi(f(x\_i), y\_i) + \lambda \frac{1}{n+m} \sum\_{j=1}^{n+m} \psi(f, x\_j), \tag{2}$$

where λ > 0 controls the trade-off between the supervised and the unsupervised loss. This is in contrast to [4], as they use the unsupervised loss to restrict the hypothesis space directly. In the following section we recall the important insight that those two formulations are equivalent in some scenarios and we can use [4] to generate sample complexity bounds for the here presented SSL framework.

For ease of notation we set Rˆ(f,U) = <sup>1</sup> n+m n+m <sup>j</sup>=1 ψ(f,x<sup>j</sup> ) and R(f) = E[ψ(f,X)]. We do not claim any novelty for the idea of adding an unsupervised loss for regularization. A different framework can be found in [11, Chapter 10]. We are, however, not aware of a deeper analysis of this particular formulation, as done for example by the sample complexity analysis in this paper. As we are in particular interested in the class of MR schemes we first show that this method fits our framework.

*Example: Manifold Regularization.* Overloading the notation we write now P(X) for the distribution P restricted to X . In MR one assumes that the input distribution P(X) has support on a compact manifold M ⊂ X and that the predictor f ∈ F varies smoothly in the geometry of M [7]. There are several regularization terms that can enforce this smoothness, one of which is <sup>M</sup> ||∇Mf(x)||<sup>2</sup>dP(x), where <sup>∇</sup>M<sup>f</sup> is the gradient of <sup>f</sup> along <sup>M</sup>. We know that <sup>M</sup> ||∇Mf(x)||<sup>2</sup>dP(x) may be approximated with a finite sample of <sup>X</sup> drawn from P(X) [6]. Given such a sample U = {x1, ..., xn+m} one defines first a weight matrix W, where Wij = e−||xi−x<sup>j</sup> ||2/σ. We set L then as the Laplacian matrix L = D − W, where D is a diagonal matrix with Dii = n+m <sup>j</sup>=1 Wij . Let furthermore f<sup>U</sup> = (f(x1), ..., f(xn+m))<sup>t</sup> be the evaluation vector of f on U. The expression <sup>1</sup> (n+m)<sup>2</sup> f<sup>t</sup> <sup>U</sup> Lf<sup>U</sup> = <sup>1</sup> (n+m)<sup>2</sup> - i,j (f(xi) <sup>−</sup> <sup>f</sup>(x<sup>j</sup> ))<sup>2</sup>Wij converges to <sup>M</sup> ||∇Mf||<sup>2</sup>dP(x) under certain conditions [6]. This motivates us to set the unsupervised loss as <sup>ψ</sup>(f,(xi, x<sup>j</sup> )) = (f(xi) <sup>−</sup> <sup>f</sup>(x<sup>j</sup> ))<sup>2</sup>Wij . Note that <sup>f</sup><sup>t</sup> <sup>U</sup> Lf<sup>U</sup> is indeed a convex function in f: As L is a Laplacian matrix it is positive definite and thus f<sup>t</sup> <sup>U</sup> Lf<sup>U</sup> defines a norm in f. Convexity follows then from the triangle inequality.

## **5 Analysis of the Framework**

In this section we analyze the properties of the solution fsemi found in Equation (2). We derive sample complexity bounds for this procedure, using results from [4], and compare them to sample complexities for the supervised case. In [4] the unsupervised loss is used to restrict the hypothesis space directly, while we use it as a regularization term in the empirical risk minimization as usually done in practice. To switch between the views of a constrained optimization formulation and our formulation (2) we use the following classical result from convex optimization [15, Theorem 1].

**Lemma 1.** *Let* φ(f(x), y) *and* ψ(f,x) *be functions convex in* f *for all* x, y*. Then the following two optimization problems are equivalent:*

$$\min\_{f \in \mathcal{F}} \frac{1}{n} \sum\_{i=1}^{n} \phi(f(x\_i), y\_i) + \lambda \frac{1}{n+m} \sum\_{i=1}^{n+m} \psi(f, x\_i) \tag{3}$$

$$\min\_{f \in \mathcal{F}} \frac{1}{n} \sum\_{i=1}^{n} \phi(f(x\_i), y\_i) \quad subject \text{ to } \sum\_{i=1}^{n+m} \frac{1}{n+m} \psi(f, x\_i) \le \tau \tag{4}$$

*Where equivalence means that for each* λ *we can find a* τ *such that both problems have the same solution and vice versa.*

For our later results we will need the conditions of this lemma are true, which we believe to be not a strong restriction. In our sample complexity analysis we stick as close as possible to the actual formulation and implementation of MR, which is usually a convex optimization problem. We first turn to our sample complexity bounds.

### **5.1 Sample Complexity Bounds**

Sample complexity bounds for supervised learning use typically a notion of complexity of the hypothesis space to bound the worst case difference between the estimated and the true risk. As our hypothesis class allows for real-valued functions, we will use the notion of pseudo-dimension Pdim(F, φ), an extension of the VC-dimension to real valued loss functions φ and hypotheses classes F [17,22]. Informally speaking, the pseudo-dimension is the VC-dimension of the set of functions that arise when we threshold real-valued functions to define binary functions. Note that sometimes the pseudo-dimension will have as input the loss function, and sometimes not. This is because some results use the concatenation of loss function and hypotheses to determine the capacity, while others only use the hypotheses class. This lets us state our first main result, which is a generalization of [4, Theorem 10] to bounded loss functions and real valued function spaces.

**Theorem 1.** *Let* <sup>F</sup><sup>ψ</sup> <sup>τ</sup> := {<sup>f</sup> ∈F| <sup>E</sup>[ψ(f,x)] <sup>≤</sup> <sup>τ</sup>}*. Assume that* φ, ψ *are measurable loss functions such that there exists constants* B1, B<sup>2</sup> > 0 *with* ψ(f,x) ≤ B<sup>1</sup> *and* φ(f(x), y) ≤ B<sup>2</sup> *for all* x, y *and* f ∈ F *and let* P *be a distribution. Furthermore let* f <sup>∗</sup> <sup>τ</sup> = arg min <sup>f</sup>∈F<sup>ψ</sup> τ Q(f)*. Then an unlabeled sample* U *of size*

$$m \geq \frac{8B\_1}{\epsilon^2} \left[ \ln \frac{16}{\delta} + 2 \operatorname{Pdim}(\mathcal{F}, \psi) \ln \frac{4B\_1}{\epsilon} + 1 \right] \tag{5}$$

*and a labeled sample* S<sup>n</sup> *of size*

$$n \ge \max\left(\frac{8B\_2}{\epsilon^2} \left[ \ln \frac{8}{\delta} + 2\operatorname{Pdim}(\mathcal{F}\_{\tau+\frac{\theta}{2}}^{\psi}, \phi) \ln \frac{4B\_2}{\epsilon} + 1 \right], \frac{h}{4} \right) \tag{6}$$

*is sufficient to ensure that with probability at least* 1 − δ *the classifier* g ∈ F *that minimizes* <sup>Q</sup>ˆ(·, Sn) *subject to* <sup>R</sup>ˆ(·, U) <sup>≤</sup> <sup>τ</sup> <sup>+</sup> <sup>2</sup> *satisfies*

$$Q(g) \le Q(f\_\tau^\*) + \epsilon. \tag{7}$$

*Sketch Proof:* The idea is to combine three partial results with a union bound. For the first part we use Theorem 5.1 from [22] with h = Pdim(F, ψ) to show that an unlabeled sample size of

$$m \geq \frac{8B\_1^{1/2}}{\epsilon^2} \left[ \ln \frac{16}{\delta} + 2h \ln \frac{4B\_1}{\epsilon} + 1 \right] \tag{8}$$

is sufficient to guarantee <sup>R</sup>ˆ(f)−R(f) <sup>&</sup>lt; <sup>2</sup> for all f ∈ F with probability at least <sup>1</sup><sup>−</sup> <sup>δ</sup> <sup>4</sup> . In particular choosing f = f <sup>∗</sup> <sup>τ</sup> and noting that by definition R(f <sup>∗</sup> <sup>τ</sup> ) ≤ τ we conclude that with the same probability

$$
\hat{R}(f\_\tau^\*) \le \tau + \frac{\epsilon}{2}.\tag{9}
$$

For the second part we use Hoeffding's inequality to show that the labeled sample size is big enough that with probability at least 1 <sup>−</sup> <sup>δ</sup> <sup>4</sup> it holds that

$$
\hat{Q}(f\_\tau^\*) \le Q(f\_\tau^\*) + B\_2 \sqrt{\ln(\frac{4}{\delta}) \frac{1}{2n}}.\tag{10}
$$

The third part again uses Th. 5.1 from [22] with <sup>h</sup> = Pdim(F<sup>ψ</sup> <sup>τ</sup> , φ) to show that <sup>n</sup> <sup>≥</sup> <sup>8</sup>B<sup>2</sup> 2 2 ln <sup>8</sup> <sup>δ</sup> + 2<sup>h</sup> ln <sup>4</sup>B<sup>2</sup> + 1 is sufficient to guarantee <sup>Q</sup>(f) <sup>≤</sup> <sup>Q</sup>ˆ(f) + <sup>2</sup> with probability at least 1 <sup>−</sup> <sup>δ</sup> 2 .

Putting everything together with the union bound we get that with probability 1 <sup>−</sup> <sup>δ</sup> the classifier <sup>g</sup> that minimizes <sup>Q</sup>ˆ(·, X, Y ) subject to <sup>R</sup>ˆ(·, U) <sup>≤</sup> <sup>τ</sup> <sup>+</sup> 2 satisfies

$$Q(g) \le \hat{Q}(g) + \frac{\epsilon}{2} \le \hat{Q}(f\_\tau^\*) + \frac{\epsilon}{2} \le Q(f\_\tau^\*) + \frac{\epsilon}{2} + B\_2 \sqrt{\frac{\ln(\frac{4}{\delta})}{2n}}.\tag{11}$$

Finally the labeled sample size is big enough to bound the last rhs term by <sup>2</sup> . -

The next subsection uses this theorem to derive sample complexity bounds for MR. First, however, a remark about the assumption that the loss function φ is globally bounded. If we assume that F is a reproducing kernel Hilbert space there exists an M > 0 such that for all f ∈ F and x ∈ X it holds that |f(x)| ≤ M||f||<sup>F</sup> . If we restrict the norm of f by introducing a regularization term with respect to the norm ||.||<sup>F</sup> , we know that the image of F is globally bounded. If the image is also closed it will be compact, and thus φ will be globally bounded in many cases, as most loss functions are continuous. This can also be seen as a justification to also use an intrinsic regularization for the norm of f in addition to the regularization by the unsupervised loss, as only then the guarantees of Theorem 1 apply. Using this bound together with Lemma 1 we can state the following corollary to give a PAC-style guarantee for our proposed framework.

**Corollary 1.** *Let* φ *and* ψ *be convex supervised and an unsupervised loss function that fulfill the assumptions of Theorem 1. Then* <sup>f</sup>*semi* (2) *satisfies the guarantees given in Theorem 1, when we replace for it* g *in Inequality* (7)*.*

Recall that in the MR setting Rˆ(f) = <sup>1</sup> (n+m)<sup>2</sup> n+m <sup>i</sup>=1 <sup>W</sup>ij (f(xi) <sup>−</sup> <sup>f</sup>(x<sup>j</sup> ))<sup>2</sup>. So we gather unlabeled samples from X ×X instead of X . Collecting m samples from <sup>X</sup> equates <sup>m</sup><sup>2</sup> <sup>−</sup> 1 samples from X ×X and thus we only need <sup>√</sup><sup>m</sup> instead of <sup>m</sup> unlabeled samples for the same bound.

#### **5.2 Comparison to the Supervised Solution**

In the SSL community it is well-known that using SSL does not come without a risk [11, Chapter 4]. Thus it is of particular interest how those methods compare to purely supervised schemes. There are, however, many potential supervised methods we can think of. In many works this problem is avoided by comparing to all possible supervised schemes [8,12,13]. The framework introduced in this paper allows for a more fine-grained analysis as the semi-supervision happens on top of an already existing supervised methods. Thus, for our framework, it is natural to compare the sample complexities of fsup with the sample complexity of fsemi. To compare the supervised and semi-supervised solution we will restrict ourselves to the square loss. This allows us to draw from [1, Chapter 20], where one can find lower and upper sample complexity bounds for the regression setting. The main insight from [1, Chapter 20] is that the sample complexity depends in this setting on whether the hypothesis class is (closure) convex or not. As we anyway need convexity of the space, which is stronger than closure convexity, to use Lemma 1, we can adapt Theorem 20.7 from [1] to our semisupervised setting.

**Theorem 2.** *Assume that* <sup>F</sup><sup>ψ</sup> <sup>τ</sup>+ *is a closure convex class with functions mapping to* [0, 1]<sup>1</sup>*, that* <sup>ψ</sup>(f,x) <sup>≤</sup> <sup>B</sup><sup>1</sup> *for all* <sup>x</sup> ∈ X *and* <sup>f</sup> ∈ F *and that* <sup>φ</sup>(f(x), y) = (f(x) <sup>−</sup> <sup>y</sup>)<sup>2</sup>*. Assume further that there is a* <sup>B</sup><sup>2</sup> <sup>&</sup>gt; <sup>0</sup> *such that* (f(x) <sup>−</sup> <sup>y</sup>)<sup>2</sup> < B<sup>2</sup> *almost surely for all* (x, y) ∈X ×Y *and* <sup>f</sup> ∈ F<sup>ψ</sup> <sup>τ</sup>+*. Then an unlabeled sample size of*

$$m \geq \frac{2B\_1^{-2}}{\epsilon^2} \left[ \ln \frac{8}{\delta} + 2 \operatorname{Pdim}(\mathcal{F}, \psi) \ln \frac{2B\_1}{\epsilon} + 2 \right] \tag{12}$$

*and a labeled sample size of*

$$n \geq \mathcal{O}\left(\frac{B\_2^2}{\epsilon} \left(\text{Pdim}(\mathcal{F}\_{\tau+\epsilon}^{\psi}) \ln \frac{\sqrt{B\_2}}{\epsilon} + \ln \frac{2}{\delta}\right)\right) \tag{13}$$

*is* sufficient *to guarantee that with probability at least* 1 − δ *the classifier* g *that minimizes* <sup>Q</sup>ˆ(·) *w.r.t* <sup>R</sup>ˆ(f) <sup>≤</sup> <sup>τ</sup> <sup>+</sup> *satisfies*

$$Q(g) \le \min\_{f \in \mathcal{F}\_\tau^\psi} Q(f) + \epsilon. \tag{14}$$

*Proof:* As in the proof of Theorem 1 the unlabeled sample size is sufficient to guarantee with probability at least 1<sup>−</sup> <sup>δ</sup> <sup>2</sup> that R(f <sup>∗</sup> <sup>τ</sup> ) ≤ τ +. The labeled sample size is big enough to guarantee with at least 1 <sup>−</sup> <sup>δ</sup> <sup>2</sup> that Q(g) ≤ Q(f <sup>∗</sup> <sup>τ</sup>+) + [1, Theorem 20.7]. Using the union bound we have with probability of at least 1 − δ that Q(g) ≤ Q(f <sup>∗</sup> <sup>τ</sup>+) + ≤ Q(f <sup>∗</sup> <sup>τ</sup> ) + . -

Note that the previous theorem of course implies the same learning rate in the supervised case, as the only difference will be the pseudo-dimension term. As in specific scenarios this is also the best possible learning rate, we obtain the following negative result for SSL.

**Corollary 2.** *Assume that* φ *is the square loss,* F *maps to the interval* [0, 1] *and* <sup>Y</sup> = [1 <sup>−</sup> B,B] *for a* <sup>B</sup> <sup>≥</sup> <sup>2</sup>*. If* <sup>F</sup> *and* <sup>F</sup><sup>ψ</sup> <sup>τ</sup> *are both closure convex, then for sufficiently small* , δ > <sup>0</sup> *it holds that* <sup>m</sup>*sup*(, δ) = <sup>O</sup>˜(m*semi*(, δ))*, where*

<sup>1</sup> In the remarks after Theorem <sup>1</sup> we argue that in many cases *<sup>|</sup>*f(x)*<sup>|</sup>* is bounded, and in those cases we can always map to [0,1] by re-scaling.

<sup>O</sup>˜ *suppresses logarithmic factors, and* <sup>m</sup>*semi*, m*sup denote the sample complexity of the semi-supervised and the supervised learner respectively. In other words, the semi-supervised method can improve the learning rate by at most a constant which may depend on the pseudo-dimensions, ignoring logarithmic factors. Note that this holds in particular for the manifold regularization algorithm.*

*Proof:* The assumptions made in the theorem allow is to invoke Equation (19.5) from [1] which states that msemi = Ω( <sup>1</sup> + Pdim(F<sup>ψ</sup> <sup>τ</sup> )).<sup>2</sup> Using Inequality (13) as an upper bound for the supervised method and comparing this to Eq. (19.5) from [1] we observe that all differences are either constant or logarithmic in and δ. -

### **5.3 The Limits of Manifold Regularization**

We now relate our result to the conjectures published in [19]: A SSL cannot learn faster by more than a constant (which may depend on the hypothesis class F and the loss φ) than the supervised learner. Theorem 1 from [12] showed that this conjecture is true up to a logarithmic factor, much like our result, for classes with finite VC-dimension, and SSL that do *not* make any distributional assumptions. Corollary 2 shows that this statement also holds in some scenarios for all SSL that fall in our proposed framework. This is somewhat surprising, as our result holds explicitly for SSLs that *do* make assumptions about the distribution: MR assumes the labeling function behaves smoothly w.r.t. the underlying manifold.

## **6 Rademacher Complexity of Manifold Regularization**

In order to find out in which scenarios semi-supervised learning can help it is useful to also look at distribution *dependent* complexity measures. For this we derive computational feasible upper and lower bounds on the Rademacher complexity of MR. We first review the work of [20]: they create a kernel such that the inner product in the corresponding kernel Hilbert space contains automatically the regularization term from MR. Having this kernel we can use standard upper and lower bounds of the Rademacher complexity for RKHS, as found for example in [10]. The analysis is thus similar to [21]. They consider a coregularization setting. In particular [20, p. 1] show the following, here informally stated, theorem.

**Theorem 3 (**[20, **Propositions 2.1, 2.2]).** *Let* H *be a RKHS with inner product* ·, · <sup>H</sup>*. Let* <sup>U</sup> <sup>=</sup> {x1, ..., xn+m}*,* f,g <sup>∈</sup> <sup>H</sup> *and* <sup>f</sup><sup>U</sup> = (f(x1), ..., f(xn+m))<sup>t</sup> *. Furthermore let* ·, · <sup>R</sup><sup>n</sup> *be any inner product in* R<sup>n</sup>*. Let* H˜ *be the same space of functions as* H*, but with a newly defined inner product by* f,g <sup>H</sup>˜ = f,g <sup>H</sup> + f<sup>U</sup> , g<sup>U</sup> <sup>R</sup><sup>n</sup> *. Then* H˜ *is a RKHS.*

<sup>2</sup> Note that the original formulation is in terms of the fat-shattering dimension, but this is always bounded by the pseudo-dimension.

Assume now that L is a positive definite n-dimensional matrix and we set the inner product f<sup>U</sup> , g<sup>U</sup> <sup>R</sup><sup>n</sup> = f<sup>t</sup> <sup>U</sup> Lg<sup>U</sup> . By setting L as the Laplacian matrix (Sect. 4) we note that the norm of H˜ automatically regularizes w.r.t. the data manifold given by {x1, ..., xn+m}. We furthermore know the exact form of the kernel of H˜ .

**Theorem 4 (**[20, **Proposition 2.2]).** *Let* k(x, y) *be the kernel of* H*,* K *be the gram matrix given by* Kij = k(xi, x<sup>j</sup> ) *and* k<sup>x</sup> = (k(x1, x), ..., k(xn+m, x))<sup>t</sup> *. Finally let* I *be the* n + m *dimensional identity matrix. The kernel of* H˜ *is then given by* ˜ <sup>k</sup>(x, y) = <sup>k</sup>(x, y) <sup>−</sup> <sup>k</sup><sup>t</sup> <sup>x</sup>(I + LK)−1Lky.

This interpretation of MR is useful to derive computationally feasible upper and lower bounds of the empirical Rademacher complexity, giving distribution *dependent* complexity bounds. With σ = (σ1, ..., σn) i.i.d Rademacher random variables (i.e. <sup>P</sup>(σ<sup>i</sup> = 1) = <sup>P</sup>(σ<sup>i</sup> <sup>=</sup> <sup>−</sup>1) = <sup>1</sup> <sup>2</sup> .), recall that the empirical Rademacher complexity of the hypothesis class H and measured on the sample labeled input features {x1, ..., xn} is defined as

$$\operatorname{Rad}\_n(H) = \frac{1}{n} \mathbb{E}\_\sigma \sup\_{f \in H} \sum\_{i=1}^n \sigma\_i f(x\_i). \tag{15}$$

**Theorem 5 (**[10, **p. 333]).** *Let* H *be a RKHS with kernel* k *and* H<sup>r</sup> = {f ∈ H | ||f||<sup>H</sup> ≤ r}*. Given an* n *sample* {x1, ..., xn} *we can bound the empirical Rademacher complexity of* H<sup>r</sup> *by*

$$\frac{r}{n\sqrt{2}}\sqrt{\sum\_{i=1}^{n}k(x\_i, x\_i)} \le \text{Rad}\_n(H\_r) \le \frac{r}{n}\sqrt{\sum\_{i=1}^{n}k(x\_i, x\_i)}.\tag{16}$$

The previous two theorems lead to upper bounds on the complexity of MR, in particular we can bound the maximal reduction over supervised learning.

**Corollary 3.** *Let* H *be a RKHS and for* f,g ∈ H *define the inner product* f,g <sup>H</sup>˜ = f,g <sup>H</sup> + f<sup>U</sup> (μL)g<sup>t</sup> <sup>U</sup> *, where* <sup>L</sup> *is a positive definite matrix and* <sup>μ</sup> <sup>∈</sup> <sup>R</sup> *is a regularization parameter. Let* H˜<sup>r</sup> *be defined as before, then*

$$\operatorname{Rad}\_n(\tilde{H}\_r) \le \frac{r}{n} \sqrt{\sum\_{i=1}^n k(x\_i, x\_i) - k\_{x\_i}^t (\frac{1}{\mu}I + LK)^{-1} Lk\_{x\_i}}.\tag{17}$$

*Similarly we can obtain a lower bound in line with Inequality* (16)*.*

The corollary shows in particular that the difference of the Rademacher complexity of the supervised and the semi-supervised method is given by the term kt <sup>x</sup><sup>i</sup> ( <sup>1</sup> μ In+<sup>m</sup> + LK)−<sup>1</sup>Lkx<sup>i</sup> . This can be used for example to compute generalization bounds [17, Chapter 3]. We can also use the kernel to compute local Rademacher complexities which may yield tighter generalization bounds [5]. Here we illustrate the use of our bounds for choosing the regularization parameter μ without the need for an additional labeled validation set.

## **7 Experiment: Concentric Circles**

We illustrate the use of Eq. (17) for model selection. In particular, it can be used to get an initial idea of how to choose the regularization parameter μ. The idea is to plot the Rademacher complexity versus the parameter μ as in Fig. 1. We propose to use an heuristic which is often used in clustering, the so called elbow criteria [9]. We essentially want to find a μ such that increasing the μ will not result in much reduction of the complexity anymore. We test this idea on a dataset which consists out of two concentric circles with 500 datapoints in R<sup>2</sup>, 250 per circle, see also Fig. 2. We use a Gaussian base kernel with bandwidth set to 0.5. The MR matrix L is the Laplacian matrix, where weights are computed with a Gaussian kernel with bandwidth 0.2. Note that those parameters have to be carefully set in order to capture the structure of the dataset, but this is not the current concern: we assume we already found a reasonable choice for those parameters. We add a small L2-regularization that ensures that the radius r in Inequality (17) is finite. The precise value of r plays a secondary role as the behavior of the curve from Fig. 1 remains the same.

Looking at Fig. 1 we observe that for μ smaller than 0.1 the curve still drops steeply, while after 0.2 it starts to flatten out. We thus plot the resulting kernels for μ = 0.02 and μ = 0.2 in Fig. 2. We plot the isolines of the kernel around the point of class one, the red dot in the figure. We indeed observe that for μ = 0.02 we don't capture that much structure yet, while for μ = 0.2 the two concentric circles are almost completely separated by the kernel. If this procedure indeed elevates to a practical method needs further empirical testing.

**Fig. 1.** The behavior of the Rademacher complexity when using manifold regularization on circle dataset with different regularization values *µ*.

**Fig. 2.** The resulting kernel when we use manifold regularization with parameter *µ* set to 0*.*02 and 0*.*2.

## **8 Discussion and Conclusion**

This paper analysed improvements in terms of sample or Rademacher complexity for a certain class of SSL. The performance of such methods depends both on how the approximation error of the class <sup>F</sup> compares to that of <sup>F</sup><sup>ψ</sup> <sup>τ</sup> and on the reduction of complexity by switching from the first to the latter. In our analysis we discussed the second part. The first part depends on a notion the literature often refers to as a *semi-supervised assumption*. This assumption basically states that we can learn with <sup>F</sup><sup>ψ</sup> <sup>τ</sup> as good as with F. Without prior knowledge, it is unclear whether one can test efficiently if the assumption is true or not. Or is it possible to treat just this as a model selection problem? The only two works we know that provide some analysis in this direction are [3], which discusses the sample consumption to test the so-called cluster assumption, and [2], which analyzes the overhead of cross-validating the hyper-parameter coming from their proposed semi-supervised approach.

As some of our settings need restrictions, it is natural to ask whether we can extend the results. First, Lemma 1 restricts us to convex optimization problems. If that assumption would be unnecessary, one may get interesting extensions. Neural networks, for example, are typically not convex in their function space and we cannot guarantee the fast learning rate from Theorem 2. But maybe there are semi-supervised methods that turn this space convex, and thus could achieve fast rates. In Theorem 2 we have to restrict the loss to be the square loss, and [1, Example 21.16] shows that for the absolute loss one cannot achieve such a result. But whether Theorem 2 holds for the hinge loss, which is a typical choice in classification, is unknown to us. We speculate that this is indeed true, as at least the related classification tasks, that use the 0–1 loss, cannot achieve a rate faster than <sup>1</sup> [19, Theorem 6.8].

Corollary 2 sketches a scenario in which sample complexity improvements of MR can be at most a constant over their supervised counterparts. This may sound like a negative result, as other methods with similar assumptions can achieve exponentially fast learning rates [16, Chapter 6]. But constant improvement can still have significant effects, if this constant can be arbitrarily large. If we set the regularization parameter μ in the concentric circles example high enough, the only possible classification functions will be the one that classifies each circle uniformly to one class. At the same time the pseudo-dimension of the supervised model can be arbitrarily high, and thus also the constant in Corollary 2. In conclusion, one should realize the significant influence constant factors in finite sample settings can have.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Actionable Subgroup Discovery and Urban Farm Optimization**

Alexandre Millot<sup>1</sup>, Romain Mathonat1,2, R´emy Cazabet<sup>3</sup>, and Jean-Fran¸cois Boulicaut1(B)

<sup>1</sup> Univ de Lyon, CNRS, INSA Lyon, LIRIS, UMR5205, 69621 Villeurbanne, France *{*alexandre.millot,romain.mathonat,jean-francois.boulicaut*}*@insa-lyon.fr <sup>2</sup> Atos, 69100 Villeurbanne, France <sup>3</sup> Univ de Lyon, CNRS, Universit´e Lyon 1, LIRIS, UMR5205, 69622 Villeurbanne, France remy.cazabet@univ-lyon1.fr

**Abstract.** Designing, selling and/or exploiting connected vertical urban farms is now receiving a lot of attention. In such farms, plants grow in controlled environments according to recipes that specify the different growth stages and instructions concerning many parameters (e.g., temperature, humidity, CO2, light). During the whole process, automated systems collect measures of such parameters and, at the end, we can get some global indicator about the used recipe, e.g., its yield. Looking for innovative ideas to optimize recipes, we investigate the use of a new optimal subgroup discovery method from purely numerical data. It concerns here the computation of subsets of recipes whose labels (e.g., the yield) show an interesting distribution according to a quality measure. When considering optimization, e.g., maximizing the yield, our virtuous circle optimization framework iteratively improves recipes by sampling the discovered optimal subgroup description subspace. We provide our preliminary results about the added-value of this framework thanks to a plant growth simulator that enables inexpensive experiments.

**Keywords:** Subgroup discovery · Virtuous circle · Urban farms

## **1 Introduction**

Conventional farming methods have to face many challenges like, for instance, soil erosion and/or an overuse of pesticides. The crucial problems related to climate change also stimulate the design of new production systems. The concept of urban farms (see, e.g., AeroFarms, FUL, Infarm<sup>1</sup>) could be part of a solution. It enables the growth of plants in fully controlled environments close to the place where consumers are [8]. Most of the crop protection chemical products can be removed while being able to optimize both the quantity and the quality of plants (e.g., improving the flavor [9] or their chemical proportions [20]).

<sup>1</sup> https://aerofarms.com/, http://www.fermeful.com/, https://infarm.com/.

M. R. Berthold et al. (Eds.): IDA 2020, LNCS 12080, pp. 339–351, 2020. https://doi.org/10.1007/978-3-030-44584-3\_27

Urban farms can generate large amounts of data that can be pushed towards a cloud environment such that various machine learning and data mining methods can be used. We may then provide new insights about the plant growth process itself (discovering knowledge about not yet identified/understood phenomena) but also offer new services to farm owners. We focus here on services that rely on the optimization of a given target variable, e.g., the yield. The number of parameters influencing plant growth can be relatively large (e.g., temperature, hygrometry, water pH level, nutrient concentration, LED lighting intensity, CO<sup>2</sup> concentration). There are numerous ways of measuring the crop end-product (e.g., energy cost, plant mass and size, flavor and chemical properties). In general, for a given type of plants, expert knowledge exists that concerns the available sub-systems (e.g., to model the impact of nutrient on growth, the effect of LED lighting on photosynthesis, the energy consumption w.r.t. the temperature instruction) but we are far from a global understanding of the interaction between the various underlying phenomena. In other terms, setting the optimal instructions for the diverse set of parameters given an optimization task remains an open problem.

We want to address such an issue by means of data mining techniques. Plant growth recipes are made of instructions in time and space for many numerical attributes. Once a recipe is completed, collections of measures have been collected and we assume that at least one numerical target label value is available, e.g., the yield. Can we learn from available recipe records to suggest new ones that should provide better results w.r.t. the selected target attribute? For that purpose, we investigate the use of subgroup discovery [12,21]. It aims at discovering subsets of objects - called subgroups - with high quality according to a quality measure calculated on the target label. Such a quality measure has to capture deviations in the target label distribution when we consider the overall data set or the considered subset of objects. When addressing only subgroup discovery from numerical data, a few approaches for numerical attributes [6,15] and numerical target labels [14] have been described. To the best of our knowledge, the reference algorithm for subgroup discovery in purely numerical data is SD-Map\* [14]. However, like other methods, it uses discretization and leads to loss of information and sub-optimal results.

Our first contribution concerns the proposal of a simple branch and bound algorithm called MinIntChange4SD that exploits the exhaustive enumeration strategy from [11] to achieve a guaranteed optimal subgroup discovery in numerical data without any discretization. Discussing details about this algorithm is out of the scope of this paper and we recently designed a significantly optimized version of MinIntChange4SD in [17]. Our main contribution concerns a new methodology for plant growth recipe optimization that (i) uses MinIntChange4SD to find the optimal subgroup of recipes and (ii) exploits the subgroup description to design better recipes which can in turn be analyzed with subgroup discovery, and so on.

The paper is organized as follows. Section 2 formalizes the problem. In Sect. 3, we discuss related works and their limitations. In Sect. 4, we introduce our new

**Fig. 1.** (**left**) Purely numerical dataset. (**center**) Non-closed (p<sup>1</sup> = - [2*,* 4]*,* [1*,* 3] , nonhatched) and closed (p<sup>2</sup> = - [2*,* 4]*,* [2*,* 3] , hatched) interval patterns. (**right**) Depth-first traversal of m<sup>2</sup> using minimal changes.

optimal subgroup discovery algorithm and we detail our framework for plant growth recipe optimization. An empirical evaluation of our method is in Sect. 5. Section 6 briefly concludes.

## **2 Problem Definition**

*Numerical Dataset.* A numerical dataset (*G, M, T*) is given by a set of objects *G*, a set of numerical attributes *M* and a numerical target label *T*. In a given dataset, the domain of any attribute *<sup>m</sup>* <sup>∈</sup> *<sup>M</sup>* (resp. label *<sup>T</sup>*) is a finite ordered set denoted *D<sup>m</sup>* (resp. *D<sup>T</sup>* ). Figure 1 (left) provides a numerical dataset made of two attributes *<sup>M</sup>* <sup>=</sup> {*m*1*, m*2} and a target label *<sup>T</sup>*. A subgroup p is defined by a pattern, i.e., its intent or description, and the set of objects from the dataset where it appears, i.e., its extent, denoted *ext*(*p*). For instance, in Fig. 1, the domain of *<sup>m</sup>*<sup>1</sup> is {1*,* <sup>2</sup>*,* <sup>3</sup>*,* <sup>4</sup>} and the intent [2*,* 4]*,* [1*,* 3]- (see the definition of interval patterns later) denotes a subgroup whose extent is {*g*3*, g*4*, g*5*, g*6}.

*Quality Measure, Optimal Subgroup.* The interestingness of a subgroup in a numerical dataset is measured by a numerical value. We consider here the quality measure based on the mean introduced in [14]. Let *p* be a subgroup. The quality of *p* is given by: *q<sup>a</sup> mean*(*p*) = <sup>|</sup>*ext*(*p*)<sup>|</sup> *<sup>a</sup>* <sup>×</sup> (*µext*(*p*) <sup>−</sup> *<sup>µ</sup>ext*(∅))*, a* <sup>∈</sup> [0*,* 1]. <sup>|</sup>*ext*(*p*)<sup>|</sup> denotes the cardinality of *ext*(*p*), *µext*(*p*) is the mean of the target label in the extent of *<sup>p</sup>*, *<sup>µ</sup>ext*(∅) is the mean of the target label in the overall dataset, and *<sup>a</sup>* is a parameter that controls the number of objects of the subgroups. Let (*G, M, T*) be a numerical dataset, *q* a quality measure and *P* the set of all subgroups of (*G, M, T*). A subgroup *<sup>p</sup>* <sup>∈</sup> *<sup>P</sup>* is said to be optimal iff <sup>∀</sup>*p* <sup>∈</sup> *<sup>P</sup>* : *<sup>q</sup>*(*p* ) <sup>≤</sup> *<sup>q</sup>*(*p*).

*Plant Growth Recipe and Optimization Measure.* A plant growth recipe (M, P, T) is given by a set of numerical parameters *M* specifying the growing conditions thanks to intervals on numerical values, a numerical value *P* representing the number of stages of the growth cycle, and a numerical target label *T* to quantify the recipe quality. In a given recipe, each parameter of *M* is repeated *<sup>P</sup>* times s.t. we have <sup>|</sup>*M*|×*<sup>P</sup>* numerical attributes. Our goal is to optimize recipes and we want to discover actionable patterns in the sense that delivering such patterns will support the design of new growing conditions. An optimization measure *f* quantifies the quality of an iteration. We are interested in the mean of the target label of the objects of the optimal subgroup after each iteration. The measure is given by *fmean* = - *<sup>i</sup>*∈*ext*(*p*) *<sup>T</sup>*(*i*) <sup>|</sup>*ext*(*p*)<sup>|</sup> where *<sup>T</sup>*(*i*) is the value of the target label for object *i*.

## **3 Related Work**

Designing recipes that optimize a given target attribute (e.g., the mass, the energy cost) is often tackled by domain experts who exploit the scientific literature. However, in our setting, it has two major drawbacks. First, most of the literature remains oriented towards conventional growing conditions and farming methods. In urban farms, there are more parameters that can be controlled. Secondly, the amount of knowledge about plants is unbalanced from one plant to another. Therefore, relying only on expert knowledge for plant recipe optimization is not sufficient. We have an optimization problem and the need for a limited number of iterations. Indeed, experimenting with plant growth recipes is time consuming (i.e., asking for weeks or months). Therefore, we have to minimize the number of experiments that are needed to optimize a given recipe. There are two main families of methods addressing the problem of optimizing a function over numerical variables: *direct* and *model-based* [18]. For *direct* methods, the common idea is to apply various strategies to sequentially evaluate solutions in the search space of recipes. However such methods do not address the problem of minimizing the number of experiments. For *model-based* methods, the idea is to build a model simulating the ground truth using available data and then to use it to guide the search process. For instance, [9] introduced a solution for recipe optimization using this type of method with the goal of optimizing the flavor of plants. Their framework is based on using a surrogate model, in this case a Symbolic Regression [13]. It considers recipe optimization by means of a promising virtuous circle. However, it suffers from several shortcomings: there is no guarantee on the quality of the generated models (i.e., they may not be able to model correctly the ground truth), the number of tested parameters is small (only 3), and the ratio between the number of objects and the number of parameters in the data needs to be at least ten for Symbolic Regression [10]. Clearly, it would restrict the search to only a few parameters.

Heuristic [2,15] and exhaustive [1,5] solutions have been proposed for subgroup discovery. Usually, these approaches consider a set of nominal attributes with a binary label. To work with numerical data, prior discretization of the attributes is then required (see, e.g., [3]) and it leads to loss of information and suboptimal results. A major issue with exhaustive pattern mining is the size of the search space. Fortunately, optimistic estimates can be used to prune the search space and provide tractability in practice [7,21]. [14] introduces a large panel of quality measures and corresponding optimistic estimates for an exhaustive subgroup mining given numerical target labels. They describe SD-Map\*, the reference algorithm for subgroup discovery in numerical data. Notice however that for [14] or others [6,15], discretization techniques over the numerical attributes have to be performed. When looking for an exhaustive search of frequent patterns - not subgroups - in numerical data without discretization, we find the MinIntChange algorithm [11]. Using closure operators (see, e.g., [4]) has become a popular solution to reduce the size of the search space. We indeed exploit most of these ideas to design our optimal subgroup discovery algorithm.

## **4 Optimization with Subgroup Discovery**

### **4.1 An Efficient Algorithm for Optimal Subgroup Discovery**

Let us first introduce MinIntChange4SD, our branch and bound algorithm for the optimal subgroup discovery in purely numerical data. It exploits smart concepts about interval patterns from [11].

*Interval Patterns, Extent and Closure.* In a numerical dataset (*G, M, T*), an interval pattern *p* is a vector of intervals *p* = [*ai, bi*] - *<sup>i</sup>*∈{1*,...,*|*M*|} with *<sup>a</sup>i, b<sup>i</sup>* <sup>∈</sup> *<sup>D</sup>mi*, where each interval is a restriction on an attribute of *<sup>M</sup>*, and <sup>|</sup>*M*<sup>|</sup> is the number of attributes. Let *<sup>g</sup>* <sup>∈</sup> *<sup>G</sup>* be an object. *<sup>g</sup>* is in the extent of an interval pattern *p* = [*ai, bi*] - *<sup>i</sup>*∈{1*,...,*|*M*|} iff <sup>∀</sup>*<sup>i</sup>* ∈ {1*, ...,* <sup>|</sup>*M*|}*, mi*(*g*) <sup>∈</sup> [*ai, bi*]. Let *<sup>p</sup>*<sup>1</sup> and *<sup>p</sup>*<sup>2</sup> be two interval patterns. *<sup>p</sup>*<sup>1</sup> <sup>⊆</sup> *<sup>p</sup>*<sup>2</sup> means that *<sup>p</sup>*<sup>2</sup> encloses *<sup>p</sup>*1, i.e., the hyperrectangle of *p*<sup>1</sup> is included in that of *p*2. It is said that *p*<sup>1</sup> is a specialization of *p*2. Let *p* be an interval pattern and *ext*(*p*) its extent. *p* is defined as *closed* if and only if it is the most restrictive pattern (i.e., the smallest hyper-rectangle) that contains *ext*(*p*). Figure 1 (center) depicts the dataset of Fig. 1 (left) in a cartesian plane as well as examples of interval patterns that are closed (p2) or not (p1).

*Traversing the Search Space with Minimal Changes.* To guarantee the optimal subgroup discovery, we proceed to the so-called minimal changes introduced in MinIntChange. It enables an exhaustive enumeration within the interval pattern search space. A left minimal change consists in replacing the left bound of an interval by the current value closest higher value in the domain of the corresponding attribute. Similarly, a right minimal change consists in replacing the right bound by the current value closest lower value. The search starts with the computation of the minimal interval pattern that covers all the objects of the dataset. The premise is to apply consecutive right or left minimal changes until obtaining an interval whose left and right bounds have the same value for each interval of the minimal interval pattern. In that case, the algorithm backtracks until it finds a pattern on which a minimal change can be applied. Figure 1 (right) depicts the depth-first traversal of attribute *m*<sup>2</sup> from the dataset of Fig. 1 (left) using minimal changes.

*Compressing and Pruning the Search Space.* We leverage the concept of closure to significantly reduce the number of candidate interval patterns. After a minimal change and instead of evaluating the resulting interval pattern, we compute its corresponding closed interval pattern. We exploit advanced pruning techniques to reduce the size of the search space thanks to the use of a tight optimistic estimate. We also exploit a combination of *forward checking* and *branch reordering*. Given an interval pattern, the set of all its direct specializations (application of a right or left minimal change on each interval) are computed - forward checking - and those whose optimistic estimate is higher than the best subgroup quality are stored. Branch reordering by descending order of the optimistic estimate value is then carried out which enables to explore the most promising parts of the search space first. It also enables a more efficient pruning by raising the minimal quality early. In fact, providing details about the algorithm is out of the scope of this paper though its source code is available at https://bit.ly/3bA87NE. The important outcome is that it guarantees the discovery of optimal subgroups for a given quality measure. Indeed, provided that it remains tractable, the runtime efficiency is not here an issue given that we want to use the algorithm at some steps of quite slow vegetable growth processes.

## **4.2 Leveraging Subgroups to Optimize Recipes**

*A Virtuous Circle.* Our optimization framework can be seen as a virtuous circle, where each new iteration uses information previously gathered to iteratively improve the targeted process. First, a set of recipe experiments - which can be created with or without the use of expert knowledge - is created. With the use of expert knowledge, values or domain of values are defined for each attribute and then recipes are produced using these values. When generating recipes without prior knowledge, we create recipes by randomly sampling the values of each attribute. Secondly, we use subgroup discovery to find the best subgroup of recipes according to the chosen quality measure (e.g., the subgroup of recipes with the best average yield). Then, we exploit the subgroup description - i.e., we apply new restrictions on the range of each parameter according to the description - to generate new, better, recipe experiments. Finally these recipes are in turn processed to find the best subgroup for the new recipes, and so on until recipes cannot be improved anymore. This way, we sample recipes in a space which gets smaller after each iteration and where the ratio between good and bad solutions gets larger and larger. Figure 2 depicts a step-by-step example of the process behind the framework. Our framework makes use of several hyperparameters that affect runtime efficiency, the number of iterations and the quality of the results.

*Convergence.* The first hyperparameter is the parameter *<sup>a</sup>* used in the *<sup>q</sup><sup>a</sup> mean* quality measure. In standard subgroup discovery, it controls the number of objects in the returned subgroups. A higher value of *a* means larger subgroups. For us, a larger subgroup means a larger search space to sample. By extension, a higher value of *a* means more iterations to be able to reach smaller subspaces of

**Fig. 2.** Example of execution of the optimization framework in 3 iterations. We consider a two-dimensional space (i.e., 2 attributes *m*<sup>1</sup> and *m*2) where 4 recipes are generated during each iteration using our first sampling method. The best subgroup (optimizing the yield) of each iteration (hatched) serves as the next iteration sampling space.

the search space. For that reason, we rename the parameter as the *convergence rate*. The second hyperparameter is called the *minimal improvement* (*minImp*). It defines the minimal improvement of the *Optimization measure* - *fmean* in our setting - needed from one iteration to another for the framework to keep running. After each iteration, we check whether the following statement is true or false.

$$\frac{f\_{mean\_{it}} - f\_{mean\_{it-1}}}{f\_{mean\_{it-1}}} \ge minImp$$

If it is true, then the optimization framework keeps running, else we consider that the recipes cannot be improved any further. This parameter has a direct effect on the number of iterations needed for the algorithm to converge. A higher value for *minImp* means a lower number of iterations and vice versa. We can also forget *minImp* and set the number of iterations by means of another parameter that would denote a budget.

*Sampling the Subspace.* After each iteration, to generate new recipes to experiment with, we need to sample the subspace corresponding to the description of the best subgroup. Three sampling methods are currently available and this defines again a new hyperparameter. The first method consists in sampling recipes using the original set of values of each attribute (i.e., in the first iteration) minus the excluded values due to the new restrictions applied on the subspace. Let *D*<sup>1</sup> *<sup>m</sup>* be the domain of values of attribute *m* at Iteration 1 and [*ai <sup>m</sup>, b<sup>i</sup> <sup>m</sup>*] be the interval of attribute *m* at Iteration *i* according to the description of the best subgroup of Iteration *<sup>i</sup>*−1. Then, <sup>∀</sup>*<sup>v</sup>* <sup>∈</sup> *<sup>D</sup>*<sup>1</sup> *<sup>m</sup>, v* <sup>∈</sup> *<sup>D</sup><sup>i</sup> <sup>m</sup>* <sup>⇔</sup> *<sup>b</sup><sup>i</sup> <sup>m</sup>* <sup>≥</sup> *<sup>v</sup>* <sup>≥</sup> *<sup>a</sup><sup>i</sup> <sup>m</sup>*. Using this method, the number of values available for sampling for each attribute gets smaller after each iteration, meaning that each iteration is faster than the previous one. The second consists in discretizing the search space through the discretization of each attribute in *k* intervals of equal length. Parameter k is set before launching the framework. Recipes are then sampled using the discretized domain of values for each attribute. Finally, we can use *Latin Hypercube* *Sampling* [16] as a third method. In *Latin Hypercube Sampling*, each attribute is divided in *S* equally probable intervals, with *S* the number of samples (i.e., recipes). Using this method, recipes are sampled such that each recipe is the only one in each hyperspace that contains it. The number of samples generated for each iteration is also a hyperparameter of the framework.

*An Explainable Generic Framework.* Our optimization framework is explainable contrary to black box optimization algorithms. Each step of the process is easily understandable due to the descriptive nature of subgroup discovery. Although we have been referring to our algorithm MinIntChange4SD when introducing the optimization framework, other subgroup discovery algorithms can be used, including [14] and [17]. Notice however that the better the quality of the provided subgroup, the better the results returned by our framework will be. Finally, our method can be applied to quite many application domains where we want to optimize a numerical target given collections of numerical features (e.g., hyperparameter optimization in machine learning).

## **5 Experiments**

We work on urban farm recipe optimization while we do not have access to real farming data yet. One of our partners in the FUI DUF 4.0 project (2018–2021) is designing new types of urban farms. We found a way to support the empirical study of our recipe optimizing framework thanks to inexpensive experiments enabled by a simulator. In an urban farm, plants grow in a controlled environment. In the absence of failure, recipe instructions are followed and we can investigate the optimization of the plant yield at the end of the growth cycle. We simulate recipe experiments by using the PCSE<sup>2</sup> simulation environment by setting the characteristics (e.g., the climate) of the different growth stages. We focus on 3 variables that set the amount of solar irradiation (range [0*,* 25000]), wind (range [0*,* 30]) and rain (range [0*,* 40]). The plant growth is split into 3 stages of equal length such that we finally get 9 attributes. In real life, we can control most of the parameters of an urban farm (e.g., providing more or less light) and a recipe optimization iteration needs for new insights about the promising parameter values. This is what we can emulate using the crop simulator: given the description of the optimal subgroup, we get insights to support the design of the next simulations, say experiments, as if we were controlling the growth environment. At the end of the growth cycle, we retrieve the total mass of plants harvested using a given recipe. Note that in the following experiments, unless stated otherwise, no assumption is made on the values of parameters (i.e., no restriction is applied on the range of values defined above and expert knowledge is not taken into account). Table 1 features examples of plant growth recipes. The source code and datasets used in our evaluation are available at https:// bit.ly/3bA87NE.

<sup>2</sup> https://pcse.readthedocs.io/en/stable/index.html.


**Table 1.** Examples of growth recipes split in 3 stages (P1, P2, P3), 3 attributes, and a target label (Yield).

**Table 2.** Comparison between descriptions of the overall dataset (DS), the optimal subgroup returned by MinIntChange4SD (MIC4SD), the optimal subgroup returned by SD-Map\*. "–" means no restriction on the attribute compared to DS, Q and S respectively the quality and size of the subgroup.


## **5.1** MinIntChange4SD **vs** SD-Map\*

We study the description of the best subgroup returned by MinIntChange4SD and SD-Map\*, the state-of-the art algorithm for subgroup discovery in numerical data. Table 2 depicts the descriptions for a dataset comprised of 30 recipes generated randomly with the simulator. Besides the higher quality of the subgroup returned by MinIntChange4SD, the optimal subgroup description also enables to extract information that is missing from the description obtained with SD-Map\*. In fact, where SD-Map\* only offers a strong restriction on 3 attributes, MinIntChange4SD provides actionable information on all the considered attributes, i.e., the 9 attributes. This confirms its qualitative superiority over SD-Map\* which has to proceed to attribute discretizations.

### **5.2 Empirical Evaluation of the Model Hyperparameters**

Our optimization framework involves several hyperparameters whose values need to be studied to define proper ranges or values that will lead to optimized results with a minimized number of recipe experiments. We choose to apply a random search on discretized hyperparameters. Note that in this setting, grid search is a bad solution due to the combinatorial number of hyperparameter values and the high time cost of the optimization process itself. We discretize each hyperparameter in several values (the convergence rate is split into 10 values ranging from 0.1 to 1, the minimal improvement parameter is split into 12 values between 0 and 0.05, the sampling parameter is split between the 3 available methods, and the number of recipes for each iteration is either 20 or 30). We run 100 iterations of random search, with each iteration - read set of parameter values - being tested 10 times and averaged to account for randomness of the

**Fig. 3.** Yield of the best recipe depending on the value of different hyperparameters using 100 sample recipes for each hyperparameter.

recipes generated. After each iteration of random search, we store the set of hyperparameter values and the corresponding best recipe found. Figure 3 depicts results of the experiments. Optimal values for convergence rate seem to be around 0.5, between 0.001 and 0.01 for minimal improvement, and the best sampling method is tied between the first and second one. Generating 30 recipes for each iteration yields better results than 20 (average yield of 23857 for 30 recipes against 22829 for 20 recipes). To compare our method against other methods, we run our framework with the following parameters: 30 recipes times 5 iterations (for a total of 150 recipes), 0.5 convergence rate, using the second sampling method with *k* = 15. To address the variance in the yield due to randomness in the recipe generation process, we run the framework 10 times, we store the best recipe found at each iteration and then compute the average of the stored recipes. We report the results in Table 3.

### **5.3 Comparison with Alternative Methods**

Good hyperparameter values have been defined for our optimization framework and we can now compare our method with other ones. Let us consider the use of expert knowledge and random search. First, we want to create a model using expert knowledge. With the help of an agricultural engineer, we defined a priori good values for each parameter using expert knowledge and we generated a recipe that can serve as a baseline for our experiments. We then choose to compare our method against a random search model without expert knowledge. We set the number of recipes to 150 for all methods to provide a fair comparison with our own model where the number of recipes is set to 150. To account for randomness in the recipe generation, we run 10 iterations of the random search model, we store the value of the best recipe found in each iteration, and we compute their average yield. Results of the experiments and a description of the best recipe for each method are available in Table 3. Random search and expert knowledge find recipes with almost equal yields, while our framework find recipes with higher


**Table 3.** Comparison of the description and the yield of the best recipe returned by each method. EK = Expert Knowledge, RS = Random Search, SM = Surrogate Modeling, VC = Virtuous Circle (our framework).

yield. Note that in industrial settings, an improved yield of 3% to 4% has a significant impact on revenues.

Let us now compare our framework to the Surrogate Modeling method presented in [9]. To be fair, we give the same number of data points to build the Symbolic Regression surrogate model as we used in previous experiments, i.e., 150 for training the model (we evaluated the RMSE of the model on a test set of 38 other samples). We use gplearn [19], with default parameters, except for the number of generations and the number of models evaluated for each generations, which are respectively of 1000 and 2000, as in [9]. Note that the model obtained has a RMSE of 2112, and it is composed of more than 2000 terms (including mathematical operators), therefore the argument of interpretability is questionable. A grid search is finally done on this model and we select the best recipe and obtain their true yield using the PCSE simulation environment. The number of steps for each attribute for the grid search has to be defined. We set it to 5. As we have 9 parameters, it means that the model needs to be evaluated on nearly 9 million potential recipes. Also, the model is composed of hundreds of terms such that experiments are computationally expensive. The best recipe found so far is given in Table 3. The surrogate model predicts a yield value of 21137. Compared to the ground truth of 10170, the model has a strong bias. It illustrates that using a surrogate model for this kind of problem will give good recipes only if it is reliable enough. Interestingly, the RMSE seems to be quite good at first glance, but this does not guarantee that the model will behave correctly on all elements of the search space: on the best recipe found, it largely overestimates the yield, leading to a non-interesting recipe. It seems that this method performs poorly on recipes with more attributes than in [9]. Further studies are here needed.

## **6 Conclusion**

We investigated the optimization of plant growth recipes in controlled environments, a key process in connected urban farms. We motivated the reasons why existing methods fall short of real life constraints, including the necessity to minimize the number of experiments needed to provide good results. We detailed a new optimization framework that leverages subgroup discovery to iteratively find better growth recipes through the use of a virtuous circle. We also introduced an efficient algorithm for the optimal subgroup discovery in purely numerical datasets. It has been recently improved much further in [17]. We avoid discretization and it provides a qualitative added-value (i.e., more interesting optimal subgroups). Future work includes extending our framework to deal with multiple target labels at the same time (e.g., optimizing the yield while keeping the energy cost as low as possible).

**Acknowledgment.** Our research is partially funded by the French FUI programme (project DUF 4.0, 2018–2021).

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **AVATAR - Machine Learning Pipeline Evaluation Using Surrogate Model**

Tien-Dung Nguyen1(B) , Tomasz Maszczyk<sup>1</sup>, Katarzyna Musial<sup>1</sup>, Marc-Andr´e Z¨oller<sup>2</sup>, and Bogdan Gabrys<sup>1</sup>

<sup>1</sup> University of Technology Sydney, Sydney, Australia TienDung.Nguyen-2@student.uts.edu.au, *{*Tomasz.Maszczyk,Katarzyna.Musial-Gabrys,Bogdan.Gabrys*}*@uts.edu.au <sup>2</sup> USU Software AG, Karlsruhe, Germany m.zoeller@usu.de

**Abstract.** The evaluation of machine learning (ML) pipelines is essential during automatic ML pipeline composition and optimisation. The previous methods such as Bayesian-based and genetic-based optimisation, which are implemented in Auto-Weka, Auto-sklearn and TPOT, evaluate pipelines by executing them. Therefore, the pipeline composition and optimisation of these methods requires a tremendous amount of time that prevents them from exploring complex pipelines to find better predictive models. To further explore this research challenge, we have conducted experiments showing that many of the generated pipelines are invalid, and it is unnecessary to execute them to find out whether they are good pipelines. To address this issue, we propose a novel method to evaluate the validity of ML pipelines using a surrogate model (AVATAR). The AVATAR enables to accelerate automatic ML pipeline composition and optimisation by quickly ignoring invalid pipelines. Our experiments show that the AVATAR is more efficient in evaluating complex pipelines in comparison with the traditional evaluation approaches requiring their execution.

## **1 Introduction**

Automatic machine learning (AutoML) has been studied to automate the process of data analytics to collect and integrate data, compose and optimise ML pipelines, and deploy and maintain predictive models [1–3]. Although many existing studies proposed methods to tackle the problem of pipeline composition and optimisation [2,4–9], these methods have two main drawbacks. Firstly, the pipelines' structures, which define the executed order of the pipeline components, use fixed templates [2,5]. Although using fixed structures can reduce the number of invalid pipelines during the composition and optimisation, these approaches limit the exploration of promising pipelines which may have a variety of structures. Secondly, while evolutionary algorithms based methods [4] enable the randomness of the pipelines' structure using the concept of evolution, this randomness tends to construct more invalid pipelines than valid ones.

M. R. Berthold et al. (Eds.): IDA 2020, LNCS 12080, pp. 352–365, 2020. https://doi.org/10.1007/978-3-030-44584-3\_28

Besides, the search spaces of the pipelines' structures and hyperparameters of the pipelines' components expand significantly. Therefore, the existing approaches tend to be inefficient as they often attempt to evaluate invalid pipelines. There are several attempts to reduce the randomness of pipeline construction by using context-free grammars [8,9] or AI planning to guide the construction of pipelines [6,7]. Nevertheless, all of these methods evaluate the validity of a pipeline by executing them (T-method). After executing a pipeline, if the result is a predictive model, the T-method evaluates the pipeline to be valid; otherwise it is invalid. If a pipeline is complex, the complexity of preprocessing/predictor components within the pipeline is high, or the size of the dataset is large, the evaluation of the pipeline is expensive. Consequently, the optimisation will require a significant time budget to find well-performing pipelines.

To address this issue, we propose the AVATAR to evaluate ML pipelines using their surrogate models. The AVATAR transforms a pipeline to its surrogate model and evaluates it instead of executing the original pipeline. We use the business process model and notation (BPMN) [10] to represent ML pipelines. BPMN was invented for the purposes of a graphical representation of business processes, as well as a description of resources for process execution. In addition, BPMN simplifies the understanding of business activities and interpretation of behaviours of ML pipelines. The ML pipelines' components use the Weka libraries<sup>1</sup> for ML algorithms. The evaluation of the surrogate models requires a knowledge base which is generated from many synthetic datasets. To this end, this paper has two main contributions:


This paper is divided into five sections. After the Introduction, Sect. 2 reviews previous approaches to representing and evaluating ML pipelines in the context of AutoML. Section 3 presents the AVATAR to evaluate ML pipelines. Section 4 presents experiments to motivate our research and prove the efficiency of the proposed method. Finally, Sect. 5 concludes this study.

## **2 Related Work**

Salvador et al. [2] proposed an automatic pipeline composition and optimisation method of multicomponent predictive systems (MCPS) to deal with the problem of combined algorithm selection and hyperparameter optimisation (CASH). This proposed method is implemented in the tool AutoWeka4MCPS [2] developed on top of Auto-Weka 0.5 [11]. The pipelines, which are generated by

<sup>1</sup> https://www.cs.waikato.ac.nz/ml/weka/.

AutoWeka4MCPS, are represented using Petri nets [12]. A Petri net is a mathematical modelling language used to represent pipelines [2] as well as data service compositions [13]. The main idea of Petri nets is to represent transitions of states of a system. Although it is not clearly mentioned in these previous works [4–7], directed acyclic graph (DAG) is often used to model sequential pipelines in the methods/tools such as AutoWeka4MCPS [14], ML-Plan [6], P4ML [7], TPOT [4] and Auto-sklearn [5]. DAG is a type of graph that has connected vertexes, and the connections of vertexes have only one direction [15]. In addition, a DAG does not allow any directed loop. It means that it is a topological ordering. ML-Plan generates sequential workflows consisting of ML components. Thus, the workflows are a type of DAG. The final output of P4ML is a pipeline which is constructed by making an ensemble of other pipelines. Auto-sklearn generates fixed-length sequential pipelines consisting of scikit-learn components. TPOT construct pipelines consisting of multiple preprocessing sub-pipelines. The authors claim that the representation of the pipelines is a tree-based structure. However, a tree-based structure always starts with a root node and ends with many leaf nodes, but the output of a TPOT's pipeline is a single predictive model. Therefore, the representation of TPOT pipeline is more like a DAG. P4ML uses a tree-based structure to make a multi-layer ensemble. This treebased structure can be specialised into a DAG. The reason is that the execution of these pipelines will start from leaf nodes and end at root nodes where the construction of the ensembles are completed. It means that the control flows of these pipelines have one direction, or they are topologically ordered. Using a DAG to model an ML pipeline makes it easy to understand by humans as DAGs facilitate visualisation and interpretation of the control flow. However, DAGs do not model inputs/outputs (i.e. possibly datasets, output predictive models, parameters and hyperparameters of components) between vertexes. Therefore, the existing studies use ad-hoc approaches and make assumptions about data inputs/outputs of the pipelines' components.

Although AutoWeka4MCPS, ML-Plan, P4ML, TPOT and Auto-sklearn evaluate pipelines by executing them, these methods have strategies to limit the generation of invalid pipelines. Auto-sklearn uses a fixed pipeline template including preprocessing, predictor and ensemble components. AutoWeka4MCPS also uses a fixed pipeline template consisting of six components. TPOT, ML-Plan and P4ML use grammars/primitive catalogues, which are designed manually, to guide the construction of pipelines. Although these approaches can reduce the number of invalid pipelines, our experiments showed that the wasted time used to evaluate the invalid pipelines is significant. Moreover, using fixed templates, grammars and primitive catalogues reduce search spaces of potential pipelines, which is a drawback during pipeline composition and optimisation.

## **3 Evaluation of ML Pipelines Using Surrogate Models**

Because the evaluation of ML pipelines is expensive in certain cases (i.e., complex pipelines, high complexity pipeline's components and large datasets) in the context of AutoML, we propose the AVATAR<sup>2</sup> to speed up the process by evaluating their surrogate pipelines. The main idea of the AVATAR is to expand the purpose and representation of MCPS introduced in [12]. The AVATAR uses a surrogate model in the form of a Petri net. This surrogate pipeline keeps the structure of the original pipeline, replaces the datasets in the form of data matrices (i.e., components' input/output simplified mappings) by the matrices of transformed-features, and the ML algorithms by transition functions to calculate the output from the input tokens (i.e., the matrices of transformed-features). Because of the simplicity of the surrogate pipelines in terms of the size of the tokens and the simplicity of the transition functions, the evaluation of these pipelines is substantially less expensive than the original ones.

### **3.1 The AVATAR Knowledge Base**

We define transformed-features as the features, which represent dataset's characteristics. These characteristics can be changed because of the transformations of this dataset by ML algorithms. Table 1 describes the transformed-features used


**Table 1.** Descriptions of the transformed-features of a dataset.

<sup>2</sup> https://github.com/UTS-AAi/AVATAR.

for the knowledge base. We select these transformed-features because the capabilities of a ML algorithm to work with a dataset depend on these transformedfeatures. These transformed-features are extended from the capabilities of Weka algorithms<sup>3</sup>.

The purpose of the AVATAR knowledge base is for describing the logic of transition functions of the surrogate pipelines. The logic includes the capabilities and effects of ML algorithms (i.e., pipeline components).

The capabilities are used to verify whether an algorithm is compatible to work with a dataset or not. For example, whether the linear regression algorithm can work with missing value and numeric attributes or not? The capabilities have a list of transformed-features. The value of each capability-related transformedfeature is either 0 (i.e., the algorithm can not work with the dataset which has this transformed-feature) or 1 (i.e., the algorithm can work with the dataset which has this transformed-feature). Based on the capabilities, we can determine which components of a pipeline (i.e., ML algorithms) are not able to process specific transformed-features of a dataset.

The effects describe data transformations. Similar to the capabilities, the effects have a list of transformed-features. Each effect-related transformedfeature can have three values, 0 (i.e., do not transform this transformed-feature), 1 (i.e., transform one or more attributes/classes to this transformed-feature), or −1 (i.e., disable the effect of this transformed-feature on one or more attributes/classes).

To generate the AVATAR knowledge base<sup>4</sup>, we have to use synthetic datasets<sup>5</sup> to minimise the number of active transformed-features in each dataset to evaluate which and how transformed-features impact on the capabilities and effects of ML algorithms<sup>6</sup>. Real-world datasets usually have many active transformedfeatures that make them not suitable for our purpose. We minimise the number of available transformed-features in each synthetic dataset so that the knowledge base can be applicable in a variety of pipelines and datasets. Figure 1 presents the algorithm to generate the AVATAR knowledge base. This algorithm has four main stages:


<sup>3</sup> http://weka.sourceforge.net/doc.dev/weka/core/Capabilities.html.

<sup>4</sup> https://github.com/UTS-AAi/AVATAR/blob/master/avatar-knowledge-base/ avatar knowledge base.json.

<sup>5</sup> https://github.com/UTS-AAi/AVATAR/tree/master/synthetic-datasets.

<sup>6</sup> https://github.com/UTS-AAi/AVATAR/blob/master/supplementary-documents/ avatar algorithms.txt.

**Fig. 1.** Algorithm to generate the knowledge base for evaluating surrogate pipelines.

current value is a default value, we set this effect-related transformed-feature equal the difference of the values of this transformed-feature of the output and input dataset.

#### **3.2 Evaluation of ML Pipelines**

The AVATAR evaluates a ML pipeline by mapping it to its surrogate pipeline and evaluating this surrogate pipeline. BPMN is the most promising method to represent an ML pipeline. The reasons are that a BPMN-based ML pipeline can be executable, has a better interpretation of the pipeline in terms of control, data flows and resources for execution, as well as integrates into existing business processes as a subprocess. Moreover, we claim that a Petri net is the most promising method to represent a surrogate pipeline. The reason is that it is fast to verify the validity of a Petri net based simplified ML pipeline.

**Fig. 2.** Mapping a ML pipeline to its surrogate model.

**Mapping a ML Pipeline to Its Surrogate Model.** The AVATAR maps a BPMN pipeline to a Petri net pipeline via three stages (Fig. 2).


**Fig. 3.** Algorithm for firing a transition of the surrogate model.

**Evaluating a Surrogate Model.** The evaluation of a surrogate model will execute a Petri net pipeline. This execution starts by firing each transition of the Petri net pipeline and transforming the input token. As shown in Fig. 3, firing a transition consists of two tasks: (i) the evaluation of the capabilities of each component; and (ii) the calculation of the output token. The first task verifies the validity of the component using the following rules. If the value of a transformed-feature stored in the input token (*f in token i*) is 1 and the corresponding transformed-feature in the component's capabilities (*f cap i*) is 0, this component is invalid. Otherwise, this component is always valid. If a component is invalid, the surrogate pipeline is evaluated as invalid. The second task calculates each transformed-feature stored in the output token (*f out token i*) in the next place from the input token by adding the value of a transformed-feature stored in the input token (*f in token i*) and the respective transformed-feature in the component's effects (*f effect i*).

## **4 Experiments**

To investigate the impact of invalid pipelines on ML pipeline composition and optimisation, we have first conducted a series of experiments with current stateof-the-art AutoML tools. After that, we have conducted the experiments to compare the performance of the AVATAR and the existing methods.

### **4.1 Experimental Settings**

Table 2 summarises characteristics of datasets<sup>7</sup> used for experiments. We use these datasets because they were used in previous studies [2,4,5]. The AutoML tools used for the experiments are AutoWeka4MCPS [2] and Auto-sklearn [5]. These tools are selected because their abilities to construct and optimise hyperparameters of complex ML pipelines have been empirically proven to be effective in a number of previous studies [2,5,16]. However, these previous experiments


**Table 2.** Summary of datasets' characteristics: the number of numeric attributes, nominal attributes, distinct classes, instances in training and testing sets.

<sup>7</sup> https://archive.ics.uci.edu.

had not investigated the negative impact of the evaluation of invalid pipelines on the quality of the pipeline composition and optimisation yet. This is the goal of our first set of experiments. In the second set of experiments, we show that the AVATAR can significantly reduce the evaluation time of ML pipelines.

## **4.2 Experiments to Investigate the Impact of Invalid Pipelines**

To investigate the impact of invalid pipelines, we use five iterations (Iter) for the first set of experiments. We run these experiments on AWS EC2 *t*3*a.small* virtual machines which have 2 vCPU and 2 GB memory. Each iteration uses a different seed number. We set the time budget to 1 h and the memory to 1 GB. We evaluate the pipelines produced by the AutoML tools using three criteria: (1) the number of invalid/valid pipelines, (2) the total evaluation time of invalid/valid pipelines (seconds), and (3) the wasted evaluation time (%). The wasted evaluation time is calculated by the percentage of the total evaluation time of invalid pipelines over the total runtime of the pipeline composition and optimisation. The wasted evaluation time represents the degree of negative impacts of invalid pipelines.

Tables 3 and 4 present negative impacts of invalid pipelines in ML pipeline composition and optimisation of AutoWeka4MCPS and Auto-sklearn using the above criteria. These tables show that not all of constructed pipelines are valid. Because AutoWeka4MCPS can compose pipelines which have up to six components, it is more likely to generate invalid pipelines and the evaluation time




**Table 4.** Negative impacts of invalid pipelines in pipeline composition and optimisation of Auto-sklearn. (1): the number of invalid/valid pipelines, (2): the total evaluation time of invalid/valid pipelines (s), (3): the wasted evaluation time (%).

of these invalid pipelines are significant. For example, the wasted evaluation time is 97.98% in the case of using the dataset car and Iter 5. We can see that changing the different random iterations has a strong impact on the wasted evaluation time in the case of AutoWeka4MCPS. For example, the experiments with the dataset abalone show that the wasted evaluation time is in the range between 0.66% and 92.87%. The reason is that Weka libraries them-self can evaluate the compatibility of a single component pipeline without execution. If the initialisation of the pipeline composition and optimisation with a specific seed number results in pipelines consisting of only one predictor, and these pipelines are well-performing, it tends to exploit similar ML pipelines. As a result, the wasted evaluation time is low. However, this impact is negligible in the case of Auto-sklearn. The reason is that Auto-sklearn uses meta-learning to initialise with promising ML pipelines. The experiments with the datasets abalone, car and gcredit show that Auto-sklearn limits the generation of invalid pipelines by making assumption about cleaned input datasets, because the experiments crash if the input datasets have multiple attribute types. It means that Auto-sklearn can not handle invalid pipelines effectively.

### **4.3 Experiments to Compare the Performance of AVATAR and the Existing Methods**

In order to demonstrate the efficiency of the AVATAR, we have conducted a second set of experiments. We run these experiments on a machine with an Intel core i7-8650U CPU and 16 GB memory. We compare the performance of the AVATAR and the T-method that requires the executions of pipelines. The T-method is used to evaluate the validity of pipelines in the pipeline composition and optimisation of AutoWeka4MCPS and Auto-sklearn. We randomly generate ML pipelines which have up to six components (i.e., these component types are missing value handling, dimensionality reduction, outlier removal, data transformation, data sampling and predictor). The predictor is put at the end


**Table 5.** Comparison of the performance of the AVATAR and T-method

of the pipelines because a valid pipeline always has a predictor at the end. Each pipeline is evaluated by the AVATAR and the T-method. We set the time budget to 12 h per dataset. We use the following criteria to compare the performance: the number of invalid/valid pipelines, the total evaluation time of invalid/valid pipelines (seconds), the number of pipelines that have the same evaluated results between the AVATAR and the T-method, and the percentage of the pipelines that the AVATAR can validate accurately (%) in comparison to the T-method.

Table 5 compares the performance of the AVATAR and the T-method using the above criteria. We can see that the total evaluation time of invalid/valid pipelines of the AVATAR is significantly lower than the T-method. While the evaluation time of pipelines of the AVATAR is quite stable, the evaluation time of pipelines of the T-method is much higher and depends on the size of the datasets. It means that the AVATAR is faster than the T-method in evaluating both invalid and valid pipelines regardless of the size of datasets. Moreover, we can see that the accuracy of the AVATAR is approximately 99% in comparison with the T-method. We have carefully reviewed the pipelines which have different evaluated results between the AVATAR and the T-method. Interestingly, the AVATAR evaluates all of these pipelines to be valid and vice versa in the case of the T-method. The reason is that executions of these pipelines cause the out of memory problem. In other words, the AVATAR does not consider the allocated memory as an impact on the validity of a pipeline. A promising solution is to reduce the size of an input dataset by adding a sampling component with appropriate hyperparameters. If the sampling size is too small, we may miss important features. If the sampling size is large, we may continue to run into the problem of out of memory. We cannot conclude that if we allocate more memory, whether the executions of these pipelines would be successful or not. It proves that the validity of a pipeline also depends on its execution environment such as memory. These factors have not been considered yet in the AVATAR. This is an interesting research gap that should be addressed in the future.

**Table 6.** Five invalid pipelines with the longest evaluation time using the T-method on the gcredit dataset.


Finally, we take a detailed look at the invalid pipelines with the longest evaluation time using the T-method on the gcredit dataset, as shown in Table 6. Pipeline #1 (11.092 s) has the structure *ReplaceMissingValues* → *PeriodicSampling* → *NumericToNominal* → *PrincipalComponents* → *SMOreg*. This pipeline is invalid because *SMOreg* does not work with nominal classes, and there is no component transforming the nominal to numeric data. We can see that the AVATAR is able to evaluate the validity of this pipeline without executing it in just 0.014 s.

## **5 Conclusion**

We empirically demonstrate the problem of generation of invalid pipelines during pipeline composition and optimisation. We propose the AVATAR which is a pipeline evaluation method using a surrogate model. The AVATAR can be used to accelerate pipeline composition and optimisation methods by quickly ignoring invalid pipelines to improve the effectiveness of the AutoML optimisation process. In future, we will improve the AVATAR to evaluate pipelines' quality besides their validity. Moreover, we will investigate how to employ the AVATAR to reduce search spaces dynamically.

**Acknowledgment.** This research is sponsored by AAi, University of Technology Sydney (UTS).

## **References**

1. Kadlec, P., Gabrys, B.: Architecture for development of adaptive on-line prediction models. Memetic Computing **1** (2009). https://doi.org/10.1007/s12293-009-0017-8. Article number. 241


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Detection of Derivative Discontinuities in Observational Data**

Dimitar Ninevski(B) and Paul O'Leary

University of Leoben, 8700 Leoben, Austria automation@unileoben.ac.at http://automation.unileoben.ac.at

**Abstract.** This paper presents a new approach to the detection of discontinuities in the n-th derivative of observational data. This is achieved by performing two polynomial approximations at each interstitial point. The polynomials are coupled by constraining their coefficients to ensure continuity of the model up to the (n *−* 1)-th derivative; while yielding an estimate for the discontinuity of the n-th derivative. The coefficients of the polynomials correspond directly to the derivatives of the approximations at the interstitial points through the prudent selection of a common coordinate system. The approximation residual and extrapolation errors are investigated as measures for detecting discontinuity. This is necessary since discrete observations of continuous systems are discontinuous at every point. It is proven, using matrix algebra, that positive extrema in the combined approximation-extrapolation error correspond exactly to extrema in the difference of the Taylor coefficients. This provides a relative measure for the severity of the discontinuity in the observational data. The matrix algebraic derivations are provided for all aspects of the methods presented here; this includes a solution for the covariance propagation through the computation. The performance of the method is verified with a Monte Carlo simulation using synthetic piecewise polynomial data with known discontinuities. It is also demonstrated that the discontinuities are suitable as knots for B-spline modelling of data. For completeness, the results of applying the method to sensor data acquired during the monitoring of heavy machinery are presented.

**Keywords:** Data analysis · Discontinuity detection · Free-knot splines

## **1 Introduction**

In the recent past *physics informed data science* has become a focus of research activities, e.g., [9]. It appears under different names e.g., *physics informed* [12]; *hybrid learning* [13]; *physics-based* [17], etc.; but with the same basic idea of embedding physical principles into the data science algorithms. The goal is to ensure that the results obtained obey the laws of physics and/or are based on physically relevant features. Discontinuities in the observations of continuous systems violate some very basic physics and for this reason their detection is of fundamental importance. Consider Newton's second law of motion,

$$F(t) = \frac{\mathrm{d}}{\mathrm{d}t} \left\{ m(t) \, \frac{\mathrm{d}}{\mathrm{d}t} y(t) \right\} = \dot{m}(t) \, \dot{y}(t) + m(t) \, \ddot{y}(t). \tag{1}$$

Any discontinuities in the observations of m(t), ˙m(t), y(t), ˙y(t) or ¨y(t) indicate a violation of some basic principle: be it that the observation is incorrect or something unexpected is happening in the system. Consequently, detecting discontinuities is of fundamental importance in physics based data science. A function <sup>s</sup>(x) is said to be <sup>C</sup><sup>n</sup> discontinuous, if <sup>s</sup> <sup>∈</sup> <sup>C</sup><sup>n</sup>−<sup>1</sup>\C<sup>n</sup>, that is if <sup>s</sup>(x) has continuous derivatives up to and including order <sup>n</sup> <sup>−</sup> 1, but the <sup>n</sup>-th derivative is discontinuous. Due to the discrete and finite nature of the observational data, only jump discontinuities in the n-th derivative are considered; asymptotic discontinuities are not considered. Furthermore, in more classical data modelling, C<sup>n</sup> jump discontinuities form the basis for the locations of knots in B-Spline models of observational data [15].

#### **1.1 State of the Art**

There are numerous approaches in the literature dealing with estimating regression functions that are smooth, except at a finite number of points. Based on the methods, these approaches can be classified into four groups: local polynomial methods, spline-based methods, kernel-based methods and wavelet methods. The approaches vary also with respect to the available a priori knowledge about the number of points of discontinuity or the derivative in which these discontinuities appear. For a good literature review of these methods, see [3]. The method used in this paper is relevant both in terms of local polynomials as well as spline-based methods; however, the new approach requires no a priori knowledge about the data.

In the local polynomial literature, namely in [8] and [14], ideas similar to the ones presented here are investigated. In these papers, local polynomial approximations from the left and the right side of the point in question are used. The major difference is that neither of these methods use constraints to ensure that the local polynomial approximations enforce continuity of the lower derivatives, which is done in this paper. As such, they use different residuals to determine the existence of a change point. Using constrained approximation ensures that the underlying physical properties of the system are taken into consideration, which is one of the main advantages of the approach presented here. Additionally, in the aforementioned papers, it is not clear whether only co-locative points are considered as possible change points, or interstitial points are also considered. This distinction between collocative and interstitial is of great importance. Fundamentally, the method presented here can be applied to discontinuities at either locations. However, it has been assumed that discontinuities only make sense between the sampled (co-locative) points, i.e., the discontinuities are interstitial.

In [11] on the other hand, one polynomial instead of two is used, and the focus is mainly on detecting C<sup>0</sup> and C<sup>1</sup> discontinuities. Additionally, the number of change-points must be known a-priori, so only their location is approximated; the required a-priori knowledge make the method unsuitable in real sensor based system observation.

In the spline-based literature there are heuristic methods (top-down and bottom-up) as well as optimization methods. For a more detailed state of the art on splines, see [2]. Most heuristic methods use a discrete geometric measure to calculate whether a point is a knot, such as: discrete curvature, kink angle, etc, and then use some (mostly arbitrary) threshold to improve the initial knot set. In the method presented here, which falls under the category of bottomup approaches, the selection criterion is based on calculus and statistics, which allows for incorporation of the fundamental physical laws governing the system, in the model, but also ensures mathematical relevance and rigour.

### **1.2 The New Approach**

This paper presents a new approach to detecting C<sup>n</sup> discontinuities in observational data. It uses constrained coupled polynomial approximation to obtain two estimates for the nth Taylor coefficients and their uncertainties, at every interstitial point. These correspond approximating the local function by polynomials, once from the left f(x, *<sup>α</sup>*) and once from the right g(x, *<sup>β</sup>*). The constraints couple the polynomials to ensure that <sup>α</sup><sup>i</sup> <sup>=</sup> <sup>β</sup><sup>i</sup> for every <sup>i</sup> <sup>∈</sup> [0 ...n <sup>−</sup> 1]. In this manner the approximations are C<sup>n</sup>−<sup>1</sup> continuous at the interstitial points, while delivering an estimate for the difference in the nth Taylor coefficients. All the derivations for the coupled constrained approximations and the numerical implementations are presented. Both the approximation and extrapolation residuals are derived. It is proven that the discontinuities must lie at local positive peaks in the extrapolation error. The new approach is verified with both known synthetic data and on real sensor data obtained from observing the operation of heavy machinery.

## **2 Detecting** *C<sup>n</sup>* **Discontinuities**

Discrete observations s(xi) of a continuous system s(x) are, by their very nature, discontinuous at every sample. Consequently, some measure for discontinuity will be required, with uncertainty, which provides the basis for further analysis.

The observations are considered to be the co-locative points, denoted by x<sup>i</sup> and collectively by the vector *x*; however, we wish to estimate the discontinuity at the interstitial points, denoted by ζ<sup>i</sup> and collectively as *ζ*. Using interstitial points, one ensures that each data point is used for only one polynomial approximation at a time. Furthermore, in the case of sensor data, one expects the discontinuities to happen between samples. Consequently the data is segmented at the interstitial points, i.e. between the samples. This requires the use of interpolating functions and in this work we have chosen to use polynomials.

Polynomials have been chosen because of their approximating, interpolating and extrapolating properties when modelling continuous systems: The Weierstrass approximation theorem [16] states that if f(x) is a continuous real-valued function defined on the real interval <sup>x</sup> <sup>∈</sup> [a, b], then for every ε > 0, there exists a polynomial <sup>p</sup>(x) such that for all <sup>x</sup> <sup>∈</sup> [a, b], the supremum norm f(x) <sup>−</sup> <sup>p</sup>(x)<sup>∞</sup> < ε. That is *any* function <sup>f</sup>(x) can be approximated by a polynomial to an arbitrary accuracy ε given a sufficiently high degree.

The basic concept (see Fig. 1) to detect a C<sup>n</sup> discontinuity is: to approximate the data to the left of an interstitial point by the polynomial <sup>f</sup>(x, *<sup>α</sup>*) of degree <sup>d</sup><sup>L</sup> and to the right by g(x, *<sup>β</sup>*) of degree <sup>d</sup>R, while constraining these approximations to be Cn−<sup>1</sup> continuous at the interstitial point. This approximation ensures that,

$$\mathbf{f}^{(k-1)}(\zeta\_i) = \mathbf{g}^{(k-1)}(\zeta\_i), \quad \text{for every } k \in [1 \ldots n] \text{.} \tag{2}$$

while yielding estimates for f(n)(ζi) and <sup>g</sup>(n)(ζi) together with estimates for their variances λ<sup>f</sup>(ζ*i*) and λ<sup>g</sup>(ζ*i*). This corresponds exactly to estimating the Taylor coefficients of the function twice for each interstitial point, i.e., once from the left and once from the right. It they differ significantly, then the function's nth derivative is discontinuous at this point. The Taylor series of a function f(x) around the point a is defined as,

**Fig. 1.** Schematic of a finite set of discrete observations (dotted circles) of a continuous function. The span of the observation is split into a left and right portion at the interstitial point (circle), with lengths <sup>l</sup>*L* and <sup>l</sup>*R* respectively. The left and right sides are considered to be the functions <sup>f</sup>(x) and <sup>g</sup>(x); modelled by the polynomials f(x, *<sup>α</sup>*) and <sup>g</sup>(x, *<sup>β</sup>*) of degrees <sup>d</sup>*L* and <sup>d</sup>*R*.

$$f(x) = \sum\_{k=0}^{\infty} \frac{f^{(k)}\left(a\right)}{k!} \left(x - a\right)^k \tag{3}$$

for each x for which the infinite series on the right hand side converges. Furthermore, any function which is n + 1 times differentiable can be written as

$$f(x) = \tilde{\mathbf{f}}(x) + R(x) \tag{4}$$

where ˜f(x) is an <sup>n</sup>th degree polynomial approximation of the function <sup>f</sup>(x),

$$\tilde{\mathbf{f}}(x) = \sum\_{k=0}^{n} \frac{f^{(k)}\left(a\right)}{k!} \left(x - a\right)^{k} \tag{5}$$

and R(x) is the remainder term. The Lagrange form of the remainder R(x) is given by

$$R(x) = \frac{f^{(n+1)}\left(\xi\right)}{(n+1)!} \left(x - a\right)^{n+1} \tag{6}$$

where ξ is a real number between a and x.

A Taylor expansion around the origin (i.e. a = 0 in Eq. 3) is called a Maclaurin expansion; for more details, see [1]. In the rest of this work, the nth Maclaurin coefficient for the function f(x) will be denoted by

$$t\_f^{(n)} \triangleq \frac{f^{(n)}\,(0)}{n!}.\tag{7}$$

The coefficients of a polynomial <sup>f</sup>(x, *<sup>α</sup>*) = <sup>α</sup>nx<sup>n</sup> <sup>+</sup> ... <sup>+</sup> <sup>α</sup>1<sup>x</sup> <sup>+</sup> <sup>α</sup><sup>0</sup> are closely related to the coefficients of the Maclaurin expansion of this polynomial. Namely, it's easy to prove that

$$
\alpha\_k = t\_\mathbf{f}^{(k)}, \quad \text{for every } k \in [0 \dots n] \text{.} \tag{8}
$$

A prudent selection of a common local coordinate system, setting the interstitial point as the origin, ensures that the coefficients of the left and right approximating polynomials correspond to the derivative values at this interstitial point. Namely, one gets a very clear relationship between the coefficients of the left and right polynomial approximations, *α* and *β*, their Maclaurin coefficients, t (n) f and t (n) g , and the values of the derivatives at the interstitial point

$$t\_{\mathbf{f}}^{(n)} = \alpha\_n = \frac{\mathbf{f}^{(n)}\,(0)}{n!} \quad \text{and} \quad t\_{\mathbf{g}}^{(n)} = \beta\_n = \frac{\mathbf{g}^{(n)}\,(0)}{n!}. \tag{9}$$

From Eq. 9 it is clear that performing a left and right polynomial approximation at an interstitial point is sufficient to get the derivative values at that point, as well as their uncertainties.

### **3 Constrained and Coupled Polynomial Approximation**

The goal here is to obtain Δt(n) fg t (n) f <sup>−</sup> <sup>t</sup> (n) g via polynomial approximation. To this end two polynomial approximations are required; whereby, the interstitial point is used as the origin in the common coordinate system, see Fig. 1. The approximations are coupled [6] at the interstitial point by constraining the coefficients such that <sup>α</sup><sup>i</sup> <sup>=</sup> <sup>β</sup>i, for every <sup>i</sup> <sup>∈</sup> [0 ...n <sup>−</sup> 1]. This ensures that the two polynomials are C<sup>n</sup>−<sup>1</sup> continuous at the interstitial points. This also reduces the degrees of freedom during the approximation and with this the variance of the solution is reduced. For more details on constrained polynomial approximation see [4,7].

To remain fully general, a local polynomial approximation of degree d<sup>L</sup> is performed to the left of the interstitial point with the support length l<sup>L</sup> creating f(x, *<sup>α</sup>*); similarly to the right <sup>d</sup>R, <sup>l</sup>R, <sup>g</sup>(x, *<sup>β</sup>*). The <sup>x</sup> coordinates to the left, denoted as *x*<sup>L</sup> are used to form the left Vandermonde matrix *V*L, similarly *x*<sup>R</sup> form *V*<sup>R</sup> to the right. This leads to the following formulation of the approximation process,

$$\mathbf{y}\_L = \mathbf{V}\_L \,\mathbf{\alpha} \quad \text{and} \quad \mathbf{y}\_R = \mathbf{V}\_R \boldsymbol{\beta}. \tag{10}$$

$$
\begin{bmatrix} V\_L & \mathbf{0} \\ \mathbf{0} & V\_R \end{bmatrix} \begin{bmatrix} \alpha \\ \beta \end{bmatrix} = \begin{bmatrix} y\_L \\ y\_R \end{bmatrix} \tag{11}
$$

<sup>A</sup> <sup>C</sup><sup>n</sup>−<sup>1</sup> continuity implies <sup>α</sup><sup>i</sup> <sup>=</sup> <sup>β</sup>i, for every <sup>i</sup> <sup>∈</sup> [0 ...n−1] which can be written in matrix form as

$$
\begin{bmatrix}
\mathbf{0} & I\_{n-1} \end{bmatrix} \begin{bmatrix}
\mathbf{0} & -I\_{n-1}
\end{bmatrix} \begin{bmatrix}
\alpha \\ \beta
\end{bmatrix} = \mathbf{0} \tag{12}
$$

Defining

$$\mathbf{V} \triangleq \begin{bmatrix} \mathbf{V}\_{L} & \mathbf{0} \\ \mathbf{0} & \mathbf{V}\_{R} \end{bmatrix}, \boldsymbol{\gamma} \triangleq \begin{bmatrix} \boldsymbol{\alpha} \\ \boldsymbol{\beta} \end{bmatrix}, \boldsymbol{y} \triangleq \begin{bmatrix} \boldsymbol{y}\_{L} \\ \boldsymbol{y}\_{R} \end{bmatrix} \text{ and } \mathbf{C} \triangleq \begin{bmatrix} \mathbf{0} \ I\_{n-1} \begin{bmatrix} \mathbf{0} \ -I\_{n-1} \end{bmatrix} \end{bmatrix}$$

We obtain the task of least squares minimization with homogeneous linear constraints,

$$\begin{array}{ccccc}\hline\hline\begin{array}{c}\min\quad||\boldsymbol{y}-\boldsymbol{V}\,\gamma||\_{2}^{2}\\\text{Given}\quad\boldsymbol{C}\,\gamma=\mathbf{0}.\\\hline\end{array}\end{array}\tag{13}$$

Clearly *γ* must lie in the null-space of *C*; now, given *N*, an ortho-normal vector basis set for null {*C*}, we obtain,

$$
\gamma = \mathbf{N} \,\delta.\tag{14}
$$

Back-substituting into Eq. 13 yields,

$$\min\_{\delta} \|\boldsymbol{y} - \boldsymbol{V}\,\boldsymbol{N}\,\delta\|\_{2}^{2} \tag{15}$$

The least squares solution to this problem is,

$$\delta = (\mathbf{V} \cdot \mathbf{N})^{+} \ y,\tag{16}$$

and consequently,

$$\gamma = \begin{bmatrix} \alpha \\ \beta \end{bmatrix} = \mathbf{N} \begin{pmatrix} \mathbf{V} \ \mathbf{N} \end{pmatrix}^{+} \begin{array}{c} \mathbf{y} \\ \hline \end{array} \tag{17}$$

Formulating the approximation in the above manner ensures that the difference in the Taylor coefficients can be simply computed as

$$
\Delta t\_{\mathbf{f}\mathbf{g}}^{(n)} = t\_{\mathbf{f}}^{(n)} - t\_{\mathbf{g}}^{(n)} = \alpha\_n = \beta\_n. \tag{18}
$$

Now defining *<sup>d</sup>* = [1, **<sup>0</sup>**d*L*−<sup>1</sup>, <sup>−</sup>1, **<sup>0</sup>**d*R*−<sup>1</sup>] <sup>T</sup>, Δt(n) fg is obtained from *<sup>γ</sup>* as

$$
\Delta t\_{\rm tg}^{(n)} = \mathbf{d}^{\rm T} \boldsymbol{\gamma} = \mathbf{d}^{\rm T} \mathbf{N} \ (\mathbf{V} \ \mathbf{N})^{+} \ \boldsymbol{y}. \tag{19}
$$

### **3.1 Covariance Propagation**

Defining, *K* = *N* (*V N*) <sup>+</sup>, yields, *γ* = *K y*. Then given the covariance of *y*, i.e., *Λy* , one gets that,

$$\mathbf{A}\_{\gamma} = \mathbf{K} \,\mathbf{A}\_{y} \,\mathbf{K}^{\mathrm{T}}.\tag{20}$$

Additionally, from Eq. 19 one could derive the covariance of the difference in the Taylor coefficients

$$\mathbf{A}\_{\Delta} = \mathbf{d} \mathbf{A}\_{\gamma} \mathbf{d}^{\mathrm{T}} \tag{22}$$

Keep in mind that, if one uses approximating polynomials of degree n to determine a discontinuity in the nth derivative, as done so far, *ΛΔ* is just a scalar and corresponds to the variance of Δt(n) fg .

### **4 Error Analysis**

In this paper we consider three measures for error:


#### **4.1 Approximation Error**

The residual vector has the form

$$r = y - V\gamma = \begin{bmatrix} y\_L - V\_L \alpha \\ y\_R - V\_R \beta \end{bmatrix}.$$

The approximation error is calculated as

$$\begin{split} &E\_{a} = \|\mathbf{r}\|\_{2}^{2} = \|\mathbf{y}\_{L} - \mathbf{V}\_{L}\boldsymbol{\alpha}\|\_{2}^{2} + \|\mathbf{y}\_{R} - \mathbf{V}\_{R}\boldsymbol{\beta}\|\_{2}^{2} \\ &= \left(\mathbf{y}\_{L} - \mathbf{V}\_{L}\boldsymbol{\alpha}\right)^{\mathrm{T}} \left(\mathbf{y}\_{L} - \mathbf{V}\_{L}\boldsymbol{\alpha}\right) + \left(\mathbf{y}\_{R} - \mathbf{V}\_{R}\boldsymbol{\beta}\right)^{\mathrm{T}} \left(\mathbf{y}\_{R} - \mathbf{V}\_{R}\boldsymbol{\beta}\right) \\ &= \mathbf{y}^{\mathrm{T}}\boldsymbol{y} - 2\boldsymbol{\alpha}^{\mathrm{T}}\mathbf{V}\_{L}^{\mathrm{T}}\boldsymbol{y}\_{L} + \boldsymbol{\alpha}^{\mathrm{T}}\mathbf{V}\_{L}^{\mathrm{T}}\mathbf{V}\_{L}\boldsymbol{\alpha} - 2\boldsymbol{\beta}^{\mathrm{T}}\mathbf{V}\_{R}^{\mathrm{T}}\boldsymbol{y}\_{R} + \boldsymbol{\beta}^{\mathrm{T}}\mathbf{V}\_{R}^{\mathrm{T}}\mathbf{V}\_{R}\boldsymbol{\beta}. \end{split}$$

**Fig. 2.** Schematic of the approximations around the interstitial point. Red: left polynomial approximation f(x, *<sup>α</sup>*); dotted red: extrapolation of f(x, *<sup>α</sup>*) to the RHS; blue: right polynomial approximation, <sup>g</sup>(x, *<sup>β</sup>*); dotted blue: extrapolation of <sup>g</sup>(x, *<sup>β</sup>*) to the LHS; <sup>ε</sup>*i* is the vertical distance between the extrapolated value and the observation. The approximation is constrained with the conditions: f(0, *<sup>α</sup>*) = g(0, *<sup>β</sup>*) and f - (0, *<sup>α</sup>*) = g- (0, *β*). (Color figure online)

#### **4.2 Combined Error**

The basic concept, which can be seen in Fig. 2, is as follows: the left polynomial f (x, *<sup>α</sup>*), which approximates over the values *<sup>x</sup>*L, is extended to the right and evaluated at the points *<sup>x</sup>*R. Analogously, the right polynomial <sup>g</sup> (x, *<sup>β</sup>*) is evaluated at the points *x*L. If there is no C<sup>n</sup> discontinuity in the system, the polynomials f and g must be equal and consequently the extrapolated values won't differ significantly from the approximated values.

**Analytical Combined Error.** The extrapolation error in a continuous case, i.e. between the two polynomial models, can be computed with the following 2-norm,

$$\varepsilon\_x = \int\_{x\_{min}}^{x\_{max}} \left\{ \mathbf{f}(x, \alpha) - \mathbf{g}(x, \beta) \right\}^2 \, \mathrm{d}x. \tag{23}$$

Given, the constraints which ensure that <sup>α</sup><sup>i</sup> <sup>=</sup> <sup>β</sup><sup>i</sup> <sup>i</sup> <sup>∈</sup> [0,...,n <sup>−</sup> 1], we obtain,

$$
\varepsilon\_x = \int\_{x\_{min}}^{x\_{max}} \left\{ \left( \alpha\_n - \beta\_n \right) x^n \right\}^2 \,\mathrm{d}x. \tag{24}
$$

Expanding and performing the integral yields,

$$\varepsilon\_x = (\alpha\_n - \beta\_n)^2 \left\{ \frac{x\_{\max}^{2n+1} - x\_{\min}^{2n+1}}{2n+1} \right\} \tag{25}$$

Given fixed values for xmin and xmax across a single computation implies that the factor,

$$k = \frac{x\_{max}^{2n+1} - x\_{min}^{2n+1}}{2n+1} \tag{26}$$

is a constant. Consequently, the extrapolation error is directly proportional to the square of the difference in the Taylor coefficients,

$$
\varepsilon\_x \propto \left(\alpha\_n - \beta\_n\right)^2 \propto \left\{\Delta t\_{\mathbf{f}\mathbf{g}}^{(n)}\right\}^2. \tag{27}
$$

**Numerical Combined Error.** In the discrete case, one can write the errors of f(x, *<sup>α</sup>*) and g(x, *<sup>β</sup>*) as

$$\mathbf{e}\_{\mathsf{f}} = \mathbf{y} - \mathsf{f}(x, \alpha) \quad \text{and} \quad \mathbf{e}\_{\mathsf{g}} = \mathbf{y} - \mathsf{g}(x, \beta) \tag{28}$$

respectively. Consequently, one could define an error function as

$$E\_{\mathbf{f}\mathbf{g}} = \|\mathbf{e}\_{\mathbf{f}} - \mathbf{e}\_{\mathbf{g}}\|\_{2}^{2} = \|(a\_{n} - b\_{n})\,\mathbf{z}\|\_{2}^{2} = (a\_{n} - b\_{n})^{2}\mathbf{z}^{\mathrm{T}}\mathbf{z}^{n} = (a\_{n} - b\_{n})^{2}\sum x\_{i}^{n} \tag{29}$$

where *z x*.ˆn. From these calculations it is clear that in the discrete case the error is also directly proportional to the square of the difference in the Taylor coefficients and that <sup>E</sup>fg <sup>∝</sup> <sup>ε</sup>x. This proves that the numerical computation is consistent with the analytical continuous error.

### **4.3 Extrapolation Error**

One could also define a different kind of error, based just on the extrapolative properties of the polynomials. Namely, using the notation from the beginning of Sect. 3, one defines

$$\mathbf{r}\_{\text{ef}} = \mathbf{y}\_L - \mathbf{g}(\mathbf{z}\_L, \boldsymbol{\beta}) = \mathbf{y}\_L - \mathbf{V}\_L \boldsymbol{\beta} \quad \text{and} \quad \mathbf{r}\_{\text{eg}} = \mathbf{y}\_R - \mathbf{f}(\mathbf{z}\_R, \boldsymbol{\alpha}) = \mathbf{y}\_R - \mathbf{V}\_R \boldsymbol{\alpha}$$

and then calculates the error as

$$\begin{split} &E\_e = \boldsymbol{\sigma}\_{\text{ef}}^{\text{T}} \mathbf{r}\_{\text{ef}} + \boldsymbol{\sigma}\_{\text{eg}}^{\text{T}} \mathbf{r}\_{\text{eg}} \\ &= \left(\boldsymbol{y}\_L - \mathbf{V}\_L \boldsymbol{\beta}\right)^{\text{T}} \left(\boldsymbol{y}\_L - \mathbf{V}\_L \boldsymbol{\beta}\right) + \left(\boldsymbol{y}\_R - \mathbf{V}\_R \boldsymbol{\alpha}\right)^{\text{T}} \left(\boldsymbol{y}\_R - \mathbf{V}\_R \boldsymbol{\alpha}\right) \\ &= \boldsymbol{y}^{\text{T}} \boldsymbol{y} - 2\boldsymbol{\beta}^{\text{T}} \mathbf{V}\_L^{\text{T}} \mathbf{y}\_L + \boldsymbol{\beta}^{\text{T}} \mathbf{V}\_L^{\text{T}} \mathbf{V}\_L \boldsymbol{\beta} - 2\boldsymbol{\alpha}^{\text{T}} \mathbf{V}\_R^{\text{T}} \mathbf{y}\_R + \boldsymbol{\alpha}^{\text{T}} \mathbf{V}\_R^{\text{T}} \mathbf{V}\_R \boldsymbol{\alpha}. \end{split}$$

In the example in Sect. 5, it will be seen that there is no significant numerical difference between these two errors.

## **5 Numerical Testing**

The numerical testing is performed with: synthetic data from a piecewise polynomial, where the locations of the C<sup>n</sup> discontinuities are known; and with real sensor data emanating from the monitoring of heavy machinery.

### **5.1 Synthetic Data**

In the literature on splines, functions of the type y (x) = e−x<sup>2</sup> are commonly used. However, this function is analytic and C<sup>∞</sup> continuous; consequently it was not considered a suitable function for testing. In Fig. 3 a piecewise polynomial with a similar shape is shown; however, this curve has C<sup>2</sup> discontinuities at known locations. The algorithm was applied to the synthetic data from the piecewise polynomial, with added noise with σ = 0.05 and the results for a single case can be seen in Fig. 3. Additionally, a Monte Carlo simulation with m = 10000 iterations was performed and the results of the algorithm were compared to the true locations of the two known knots. The mean errors in the location of the knots are: <sup>μ</sup><sup>1</sup> = (5.<sup>59</sup> <sup>±</sup> <sup>2</sup>.05) <sup>×</sup> <sup>10</sup>−<sup>4</sup> with 95% confidence, and <sup>μ</sup><sup>2</sup> <sup>=</sup> (−4.<sup>62</sup> <sup>±</sup> <sup>1</sup>.94) <sup>×</sup> <sup>10</sup>−<sup>4</sup>. Errors in the scale of 10−<sup>4</sup>, in a support with a range [0, 1], and 5% noise amplitude in the curve can be considered a highly satisfactory result.

**Fig. 3.** A piecewise polynomial of degree d = 2, created from the knots sequence *<sup>x</sup>k* = [0, <sup>0</sup>.3, <sup>0</sup>.7, 1] with the corresponding values *<sup>y</sup>k* = [0, <sup>0</sup>.3, <sup>0</sup>.7, 1]. The end points are clamped with y- (x)<sup>0</sup>*,*<sup>1</sup> = 0. Gaussian noise is added with <sup>σ</sup> = 0.05. Top: the circles mark the known points of C<sup>2</sup> discontinuity; the blue and red lines indicate the detected discontinuities; additionally the data has been approximated by the b-spline (red) using the detected discontinuities as knots. Bottom: shows Δt(*n*) fg <sup>=</sup> <sup>t</sup> (*n*) f *<sup>−</sup>* <sup>t</sup> (*n*) g , together with the two identified peaks. (Color figure online)

#### **5.2 Sensor Data**

The algorithm was also applied to a set of real-world sensor data<sup>1</sup> emanating from the monitoring of heavy machinery. The original data set can be seen in Fig. 4 (top). It has many local peaks and periods of little or no change, so the algorithm was used to detect discontinuities in the first derivative, in order to determine the peaks and phases. The peaks in the Taylor differences were used in combination with the peaks of the extrapolation error to determine the points of discontinuity. A peak in the Taylor differences means that the Taylor coefficients are significantly different at that interstitial point, compared to other interstitial points in the neighbourhood. However, if there is no peak in the extrapolation errors at the same location, then the peak found by the Taylor differences is deemed insignificant, since one polynomial could model both the left and right values and as such the peak isn't a discontinuity. Additionally, it can be seen in

<sup>1</sup> For confidentiality reasons the data has been anonymized.

**Fig. 4.** The top-most graph shows a function y(x), together with the detected C<sup>1</sup> discontinuity points. The middle graph shows the difference in the Taylor polynomials Δt(*n*) fg calculated at every interstitial point. The red and blue circles mark the relevant local maxima and minima of the difference respectively. According to this, the red and blue lines are drawn in the top-most graph. The bottom graph shows the approximation error evaluated at every interstitial point. (Color figure online)

**Fig. 5.** The two error functions, <sup>E</sup>*<sup>e</sup>* and <sup>E</sup>fg as defined in Sect. 4, for the example from Fig. 4. One can see that the location of the peaks doesn't change, and the two errors don't differ significantly.

Fig. 5 that both the extrapolation error and the combined error, as defined in Sect. 4, have peaks at the same locations, and as such the results they provide do not differ significantly.

## **6 Conclusion and Future Work**

It may be concluded, from the results achieved, that the coupled constrained polynomial approximation yield a good method for the detection of C<sup>n</sup> discontinuities in discrete observational data of continuous systems. Local peaks in the square of the difference of the Taylor polynomials provide a relative measure as a means of determining the locations of discontinuities.

Current investigations indicate that the method can be implemented directly as a convolutional operator, which will yield a computationally efficient solution. The use of discrete orthogonal polynomials [5,10] is being tested as a means of improving the sensitivity of the results to numerical perturbations.

**Acknowledgements.** This work was partially funded by:

1. The COMET program within the K2 Center "Integrated Computational Material, Process and Product Engineering (IC-MPPE)" (Project No 859480). This program is supported by the Austrian Federal Ministries for Transport, Innovation and Technology (BMVIT) and for Digital and Economic Affairs (BMDW), represented by the Austrian research funding association (FFG), and the federal states of Styria, Upper Austria and Tyrol.

2. The European Institute of Innovation and Technology (EIT), a body of the European Union which receives support from the European Union's Horizon 2020 research and innovation programme. This was carried out under Framework Partnership Agreement No. 17031 (MaMMa - Maintained Mine & Machine).

The authors gratefully acknowledge this financial support.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Improving Prediction with Causal Probabilistic Variables**

Ana Rita Nogueira1,2(B), Jo˜ao Gama1(B), and Carlos Abreu Ferreira1(B)

<sup>1</sup> LIAAD - INESC TEC, Rua Dr. Roberto Frias, 4200-465 Porto, Portugal ana.r.nogueira@inesctec.pt, jgama@fep.up.pt, cgf@isep.ipp.pt <sup>2</sup> Faculdade de Ciˆencias da Universidade do Porto, Rua do Campo Alegre 1021/1055, 4169-007 Porto, Portugal

**Abstract.** The application of feature engineering in classification problems has been commonly used as a means to increase the classification algorithms performance. There are already many methods for constructing features, based on the combination of attributes but, to the best of our knowledge, none of these methods takes into account a particular characteristic found in many problems: causality. In many observational data sets, causal relationships can be found between the variables, meaning that it is possible to extract those relations from the data and use them to create new features. The main goal of this paper is to propose a framework for the creation of new supposed causal probabilistic features, that encode the inferred causal relationships between the target and the other variables. In this case, an improvement in the performance was achieved when applied to the *Random Forest* algorithm.

**Keywords:** Causality · Causal discovery · Conditional probability · Feature engineering · Causal features

## **1 Introduction**

In regular classification problems, a set of data, classified with a finite set of classes, is used as input so that the chosen classification algorithm can build a model, that represents the behaviour of the learning set. This classifier can have better or worse results, depending on the data and how the algorithm handles it.

Nevertheless, in many problems, applying only machine learning algorithms may not be the answer [4]. Instead, the use of feature engineering can be a way of improving the performance of these algorithms.

Feature engineering is a process by which new information is extracted from the available data, to create new features. These new features are related to the original variables, but also with the target variable, being a better representation of the knowledge embedded in the data, hence helping the algorithms achieve more accurate results [4]. These types of solutions are usually problemrelated, being that one solution might work in one particular problem, but not in the other. However, there is one particular characteristic common to many classification problems: causality. In observational data, there is the possibility of existing causal relationships between variables, especially in data related to medical problems (among others) [16,17]. This fact should be taken into consideration, for example when selecting or creating new features, since it can give clues to which variables are the most important to the problem.

By definition, causality, more specifically causal discovery, relates to the search of possible cause-effect relationships between variables [13]. The application of causal discovery in the various tasks of machine learning may be challenging, both at the level of the causal process or the sampling process to generate the observed data [9]. Despite this fact, this subject has been the focus of several researchers over the years, given the importance and the potential impact that the discovery of causal relationships between events can have in the problemsolving. In the words of Judea Pearl: *"while probabilities encode our beliefs about a static world, causality tells us whether and how probabilities change when the world changes, be it by intervention or by an act of imagination"* [20]. By discovering causal relationships, it is possible to uncover, not only correlations but also relations that explain how and why the variables behave the way they do.

In this paper, we propose a framework to create new features for discrete data sets (discrete features + discrete target) based on the causal relationships uncovered in the data. These attributes are created through the generation of a causal network, using a modified version of PC [21], and posterior probabilistic analysis of the relations a target variable and the variables considered as relevant. The relevant variables can be chosen by two different methods: parents and children of the target and Markov blanket [19].

This paper is organised as follows: Sect. 2 describes some important definitions. Section 3 describes the proposed framework and Sect. 4 the results obtained in the tests.

## **2 Background**

In this section, we introduce some important notations that are used throughout the document.

## **2.1 PC**

PC is a constraint-based algorithm and was proposed by Spirtes et al. [21]. This algorithm relies on the *faithfulness* assumption (*"If we have a joint probability distribution P of the random variables in some set V and a DAG G =(V,E), (G,P) satisfies the faithfulness condition if, G entails all and only conditional independencies in P"* [18]), meaning that all the independencies in a *DAG* (directed acyclic graph) need to respect the d-separation criterion [8].

This algorithm is divided into two phases. In the first phase, the algorithm starts with a fully connected undirected graph. It removes an edge if the two nodes are independent, *i.e.*, if there is a set of nodes adjacent to both variables in which they are conditionally independent [12]. One of the most applied statistical independence tests is G<sup>2</sup>, proposed by Spirtes et al. [21], and then used in noncausal Bayesian networks by Tsamardinos et al. [24].

In the second phase [12], the algorithm orients the edges by first searching for v-structures (A → B ← C) and then by applying a set of rules, to create a completed partially directed acyclic graph (*CPDAG*), that is equivalent to the original one, where the faithfulness is respected.

### **2.2 Cochran-Mantel-Haenszel Test**

The Cochran-Mantel-Haenszel test, [2] is an independence test that studies the influence of two variables on each other, and takes into account the possible influence of other variables on this dependence, *i.e.*, it searches for causal dependence [11].

There are two different versions of this test: the normal Cochran-Mantel-Haenszel test, which is used in 2 × 2 × K tables (being K the number of tables created), and the Generalised Cochran-Mantel-Haenszel tests, which is used in I × J × K tables (being that I and J represent the number of categories in the studied variables, and K the number of layer categories [6]).

It is important to note that these type of contingency tables (three-way tables) are representations of the association between two variables if the influence of the other covariates is controlled.

Since many causal discovery algorithms (for discrete data) are used in data sets that are composed by a mixture of binary and non-binary discrete variables, the normal Cochran-Mantel-Haenszel test for 2×2×K contingency tables is not enough. In such cases, the generalised version of this test can be applied instead (Generalised Cochran-Mantel-Haenszel test Eq. (1) [15]).

$$Q\_{CMH} = G'Var\{G|H\_0\}^{-1}G \qquad G\_h = B\_h(n\_h - m\_h)$$

$$G = \sum\_h G\_h \qquad Var\{G|H\_0\} = \sum\_h Var\{G\_h|H\_0\} \qquad B\_h = C\_h \bigotimes R\_h \tag{1}$$

In the equations presented previously, B*<sup>h</sup>* represents the product of Kronecker between C*<sup>h</sup>* and R*h*, V ar the co-variance matrix, (nh − mh) the difference between the observed and the expected, C*<sup>h</sup>* and R*<sup>h</sup>* the columns scores and row scores respectively, and H<sup>0</sup> <sup>1</sup> the null hypothesis.

### **3 Framework**

In many machine learning problems, the application of only classification algorithms might not be the answer to obtain satisfactory results [4]. The application

<sup>1</sup> *"For each of the separate levels of the co-variable set h = 1, 2, ..., q, the response variable is distributed at random with respect to the sub-populations,* i.e*. the data in the respective rows of the <sup>h</sup>th table can be regarded as a successive set of simple random samples of sizes {Nhi.} from a fixed population corresponding to the marginal total distribution of the response variable {Nh.j}."* [15].

of feature engineering to the target data can be a way of improving such results. There are already several methods to improve the overall performance of an algorithm through the creation or modification of attributes, but, to the best of our knowledge, none of them explores the potential causal relationships between the target variable and the other variables.

The addition of these new inferred causal attributes may help improve the performance of classification algorithms, since they encode the relationship between the target and the other variables, thus feeding more information about the data set and its behaviour to the model. Moreover, these features may also aid in the generated models interpretability, since they encode the underlying relationships between the variables, thus being possible to explain more easily the decisions made by them.

In this section, we present a new framework to create new features using causal probabilities retrieve from a model that represents the causal associations between variables. This framework can be divided into four different phases:

	- They are its parents and children;
	- They belong to its Markov blanket (*i.e.* parents, children and spouses).

In the first step, the framework starts by creating a full causal model, that represents the causal associations between all the variables. This is done through the application of a modified version of PC [21]. In this modified version, the state of the art independence test (usually χ<sup>2</sup> or G<sup>2</sup>) is replaced by the Generalised Cochran-Mantel-Haenszel test presented in Sect. 2.2. This test has the advantage (over χ<sup>2</sup> and G<sup>2</sup>) of adjusting for confounding factors [22].

It is important to note that, in some cases, PC can't direct every edge, hence it creates a CPDAG. In those cases, we apply a method to direct such edges. This method, proposed by Dor and Tarsi [5] searches recursively for possible ways to direct undirected edges.

In the second step, the framework selects the relevant variables. To select these attributes, we propose two different approaches: parents and children and Markov blanket.

In the parents and children (P-C) approach, as the name says, the variables selected are the ones that, in the causal graph, have an edge directed to the target (parents) or from the target (children).

In the Markov blanket (MB) approach, both the parents and children of the target are selected, as well as the nodes that have edges directed to the child nodes (also called spouse nodes). It is important to note that the most common way to select the variables that influence the target is through Markov Blanket


**Table 1.** Example of probabilities generated by the probability queries

(often used in causal feature selection methods [10]). However, several authors proposed to use only parents and children, as these variables can be considered to be the ones with the most influence in the target within its Markov blanket [1,3,23].

In the third step, the framework infers a set of probabilities that represent the influence of each relevant variable on the classes of the target: posterior probability distribution (Eq. (2)). In these probabilistic queries, the objective is to find what the influence that a evidence (particular values of the relevant variable) has on the value of the target [14]. This is performed for all the values in each variable and the resulting probability matrix is similar to Table 1.

$$P(Target = t | Atr = a) = \frac{occurrences\_{t \cap a}}{occurrences\_a} \tag{2}$$

Finally, in the fourth step, the new features are created and added to the data set. Each new feature represents the probability of the relevant variables influence on a specific class, *i.e.*, if we have, for example, a target variable with two classes ({0, 1}) and a relevant variable *Attr*, there will be created two new features representing the influence of *Attr* in each class (each instance of the feature represent the, influence the value of *Attr* in that instance on the class represented in that feature).

An overview of the framework can be seen in Fig. 1.

#### **3.1 An Illustrative Example**

To explain in more detail how this approach works, we will use as example a data set with 6 discrete variables (A, B, C, D, E and F), with 5000 instances. The values for variables A, B, C, D, and E can be {0, 1, 2}, while F can have the values {0, 1}. For this example, we will use variable **B** as the target.

As it was explained in *Step 1*, the approach starts by generating the full network with PC and Generalised Cochran Mantel Haenszel. The generated network can be seen in Fig. 2.

After the creation of the full network, the relevant variables are selected. The selected variables can be parents or children (P-C) of **B** ({A, E}) or the Markov blanket (MB) of **B** ({A, E, F}).

**Fig. 1.** Example of the operation of the proposed framework

In the third step of the framework generates inference probabilities for the chosen variables (Table 2). Taking A = 0 and B = 0 as an example, the probabilities are obtained for each one of the target values are obtained by by dividing the number of times both A = 0 and B = 0 occur, by the number of times A = 0 occurs, or in other words P(B = 0|A = 0) = 0.86.

These probabilities are then added to the global data set. The resulting data set is similar to Table 3. There is a difference between the number of new features created, since the number of generated features is equal to the product between the number of values in the target and the number of relevant variables. Since the MB approach selects more variables than the P-C approach, in theory, the number of generated features will be higher. So, in the case of P-C features we have 6 new features and in the case of MB we generate 9 new features.

## **4 Results and Discussion**

To evaluate the proposed approaches and make a comparative study, the following configuration of experiments was designed: the performance of Random Forest, using the original data, as well has the versions generated by the two proposed approaches were compared.

**Fig. 2.** Example: network generated

**Table 2.** Probabilities generated for the Markov blanket variables. In parents and children's case, the probabilities for *F* are not generated.


This comparative analysis was made through 10-fold cross validation, in several public data sets (Table 4). For each fold, the two approaches are applied to the train set and then the resulting conditional probabilities are used to create the new features for both the train and test set (this ensures that no information about the classes in the test set is added to the new features).

To choose the optimal parameters for the approaches presented in the following sections, a sensitivity analysis was performed. This analysis consisted of obtaining the error (*1 - accuracy*) for the presented data sets (by dividing them into 70% train, 30% test). In the case of PC this test was repeated for significance levels 1% and 5%. In these tests we concluded that the error of the algorithms in the data sets did not change much when the parameters were changed. For this reason, for all the data sets we select and present a significance level of 5%.

The performance of this algorithm was compared in terms of error rate (Table 5). This comparison was performed using the *No new features* as a reference. The classification algorithm performance, trained with causal features in each data set were compared to the reference using the Wilcoxon signed rankedtest. The sign +/− indicates that algorithm is significantly better/worse than the reference with a p-value of less than 0.05. Besides this, the algorithms are also compared in terms of average and geometric mean of the errors, average ranks, average error ratio, win/losses, significant win/losses (number of times **Table 3.** Features generated with the probabilities for Markov blanket variables. In parents and children's case, the features related with *F* are not generated.



**Table 4.** Data set description

that the reference was better or worse than the algorithm, using signed rankedtest) and the Wilcoxon signed ranked-test. For the Wilcoxon signed ranked-test we consider also a p-value of 0.05.

If we analyse Table 5, it is possible to see that, in general, *+Causal features P-C* (the addition of features representing the conditional probability of parents and children features on the target) has a better performance than *No new features*, since the value obtained in the Wilcoxon test is 0.0266 (less then the p-value of 0.05), which means that the difference between the performance is significant. This difference can also be seen in the values of the average and geometric ranks. More specifically, if we look at the average ranks, we can see that *+Causal features P-C* has lower ranks (in average) than *No new features* (1.436 against 2.538).

If we now compare the second approach proposed (*+Causal features MB*) with the reference, we can see that there is a positive difference in the results


**Table 5.** Error rates of Random Forest for classification with causal features

**Table 6.** AUC for Lucas data set


(although not significant). It is possible to see this difference, once again, in the average and geometric mean, as well as in the average rank (1.538).

In Table 6, it is possible to see the AUC values for the three analysed approaches, for lucas data set<sup>2</sup>. The results presented in this table were obtained by dividing this data set in train and test (70%/30%). The model scores were then obtained for the test data (with a 50% cutoff).

In this table it is possible to see that *+Causal features MB* has the highest area, meaning that, in the data set with the causal probabilistic features that represent the relations between the target and its Markov blanket, Random Forest can distinguish better the classes than with the data from the other approaches, thus having a better performance [7]. Although *+Causal features MB* was the

<sup>2</sup> http://www.causality.inf.ethz.ch/data/LUCAS.html.

best approach in terms of AUC, the other proposed approach *+Causal features P-C* also obtained an AUC higher than the reference.

Finally, from these results, we can conclude that there is evidence that applying causality to the creation of new features can have a positive impact on the classification algorithms performance.

## **5 Conclusion**

The achievement of satisfactory results in a classification problem not only depends on the chosen classifier but also the data being processed. One possible way to improve the performance of classifiers is to apply feature engineering, or in other words, use the original data to infer new information, creating new attributes and altering others, to obtain more descriptive features. Furthermore, most of the proposed methodologies do not take into account the possible causal relationships in the data. This information can help to create more accurate models, since we are encoding in one variable, information about the interaction between variables, thus reinforcing their importance.

In this paper we proposed a framework that uses causal discovery to create new features based on posterior probabilistic analysis of the relations between a target variable and the variables considered as relevant, being these variables the parents and children of the Markov Blanket of the target.

In the experiments, we compared the approaches with the original data, using Random Forest in public data sets. From these results, we can conclude that there is evidence that the application of causality in the creation of new supposed probabilistic features may have a positive impact on the overall performance of the classification algorithm.

In the future, we intend to study the application of these techniques in other classifiers, as well as in the classification of mixed data (continuous and discrete variables).

**Acknowledgments.** This research was carried out in the context of the project Fail-Stopper (DSAIPA/DS/0086/2018) and supported by the *Funda¸c˜ao para a Ciˆencia e Tecnologia* (FCT), Portugal for the PhD Grant SFRH/BD/146197/2019.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **DO-U-Net for Segmentation and Counting Applications to Satellite and Medical Images**

Toyah Overton1,2(B) and Allan Tucker<sup>1</sup>

1 Department of Computer Science, Brunel University London, Uxbridge, UK *{*toyah.overton,allan.tucker*}*@brunel.ac.uk <sup>2</sup> Alcis Holdings Ltd., Guildford, UK

**Abstract.** Many image analysis tasks involve the automatic segmentation and counting of objects with specific characteristics. However, we find that current approaches look to either segment objects or count them through bounding boxes, and those methodologies that both segment and count struggle with co-located and overlapping objects. This restricts our capabilities when, for example, we require the area covered by particular objects as well as the number of those objects present, especially when we have a large amount of images to obtain this information for. In this paper, we address this by proposing a Dual-Output U-Net. DO-U-Net is an Encoder-Decoder style, Fully Convolutional Network (FCN) for object segmentation and counting in image processing. Our proposed architecture achieves precision and sensitivity superior to other, similar models by producing two target outputs: a segmentation mask and an edge mask. Two case studies are used to demonstrate the capabilities of DO-U-Net: locating and counting Internally Displaced People (IDP) tents in satellite imagery, and the segmentation and counting of erythrocytes in blood smears. The model was demonstrated to work with a relatively small training dataset, achieving a sensitivity of 98.69% for IDP camps of the fixed resolution, and 94.66% for a scale-invariant IDP model. DO-U-Net achieved a sensitivity of 99.07% on the erythrocytes dataset. DO-U-Net has a reduced memory footprint, allowing for training and deployment on a machine with a lower to mid-range GPU, making it accessible to a wider audience, including non-governmental organisations (NGOs) providing humanitarian aid, as well as health care organisations.

**Keywords:** Convolutional neural networks *·* U-Net *·* Segmentation *·* Counting *·* Satellite imagery *·* Blood smear

## **1 Introduction**

Over recent years, the volumes of data collected across all industries globally have grown dramatically. As a result, we find ourselves in an ever greater need for fully automated analysis techniques. The most common approaches to large scale data analysis rely on the use of supervised and unsupervised Machine Learning, and, increasingly, Deep Learning. Using only a small number of human-annotated data samples, we can train models to rapidly analyse vast quantities of data without sacrificing the quality or accuracy compared with a human analyst. In this paper, we focus on images - a rich datatype that often requires rapid and accurate analysis, despite its volumes and complexity. Object classification is one of the most common types of analysis undertaken. In many cases, a further step may be required in which the classified and segmented objects of interest need to be accurately counted. While easily performed by humans, albeit slowly, this task is often non-trivial in Computer Vision, especially in the cases where the objects exist in complex environments or when objects are closely co-located and overlapping. We look at two such case studies: locating and counting Internally Displaced People (IDP) shelters in Western Afghanistan using satellite imagery, and the segmentation and counting of erythrocytes in blood smear images. Both applications have a high impact in the real world and are in a need of new rapid and accurate analysis techniques.

### **1.1 Searching for Shelter: Locating IDP Tents in Satellite Imagery**

Over 40 million individuals were believed to have been internally displaced globally in 2018 [1]. Afghanistan is home to 2,598,000 IDPs displaced by conflict and violence, with the numbers growing by 372,000 in the year 2018 alone. In the same year, an additional 435,000 individuals were displaced due to natural disasters, 371,000 of whom were displaced due to drought.

The internally displaced population receive aid from various nongovernmental organisations (NGOs), to prevent IDPs from becoming refugees. The Norwegian Refugee Council (NRC) is one such body, providing humanitarian aid to IDPs across 31 countries, assisting 8.5 million people in 2018 [2]. In Afghanistan, the NRC has been providing IDPs with tents as temporary living spaces. Alcis is assisting the NRC with the analysis of the number, flow, and concentration of these humanitarian shelters, enabling valuable aid to be delivered more effectively.

**Existing Methods.** In the past, Geographical Information System (GIS) Technicians relied mostly on industry-standard software, such as ArcGIS, for the majority of their analysis. These tools provide the user with a small number of built-in Machine Learning algorithms, such as the popularly used implementation of the Support Vector Machine (SVM) algorithm [3]. These generally involve a time consuming, semi-automated process, with each step being revisited multiple times as manual tuning of the model parameters is required. The methodology does not allow for batch processing as all stages must be repeated with human input for each image. An example of the ArcGIS process<sup>1</sup> used by GIS technicians is:

1. Manually segment the image by selecting a sample of objects exhibiting similar shape, spectral and spatial characteristics.

<sup>1</sup> Details of the process have been provided by Alcis.


More recently, many GIS specialists have begun to look towards the latest techniques in Data Science and Big Data analysis to create custom Machine Learning solutions. A review paper by Quinn et al. in 2018 [4] weighed up the merits of Machine Learning approaches used to segment and count both refugee and IDP camps. Their work used a sample of 87,123 structures; a magnitude which was required for training using their approach and was seen as a limitation. Quinn et al. used the popular Mask R-CNN [5] segmentation model to analyse their data; a model using a Region Proposal Network to simultaneously classify and segment images. This yielded an average precision of 75%, improving to 78% by applying a transfer learning approach.

## **1.2 Counting in Vein: Finding Erythrocytes in Blood Smears**

The counting of erythrocytes, or red blood cells, in blood smear images, is another application in which one must count complex objects. This is an important task in exploratory and diagnostic medicine, as well as medical research. An average blood smear imaged using a microscope, contains several hundred erythrocytes of varying size, many of which are overlapping, making an accurate manual count both difficult and time-consuming.

**Existing Methods.** While only a small number of analyses were able to successfully perform an erythrocyte count, Tran et al. [6] have achieved a counting accuracy of 93.30%. Their technique relied on locating the cells using the Seg-Net [7] network. SegNet is an encoder-decoder style FCN architecture producing segmentation masks as its output. Due to the overlap of erythrocyte cells, they performed a Euclidean Distance Transform on the binary segmentation masks to obtain the location of each cell using a connected component labelling algorithm. A work by Alam and Islam [8] proposes an approach using YOLO9000 [9]; a network using a similar approach to Mask R-CNN, to locate elliptical bounding regions that roughly correspond to the outer contours of the cells. Using 300 images, each containing a small number of erythrocytes, for training, they achieve an accuracy of 96.09%. Bounding boxes acted as ground-truth for Alam and Islam, as opposed to segmentation masks used by Tran et al.

## **2 Data Description**

### **2.1 Satellite Imagery**

Working on the ground, the NRC identified areas within Western Afghanistan with known locations of IDP camps. Through their relationship with Maxar [10], Alcis has access to satellite imagery covering multiple camps, in a range of different environments. Figure 1 shows a section of a camp in Qala'e'Naw, Badghis.

This work uses imagery collected by the WorldView-2 and WorldView-3 satellites [11], by their operator and owner Maxar. WorldView-2 has a multispectral resolution of 1.85 m, while the multispectral resolution of WorldView-3 is 1.24 m [12], allowing tents of approximately 7.5 m long and 4 m wide to be resolved. The WorldView-2 images were captured on either 05/01/19 (DD/MM/YY) or 03/03/19, with the WorldView-3 images captured on 12/03/19. A further set of images, observed between 08/08/18 and 23/09/19 by WorldView-3, became available for some locations. This dataset can be used to show evolution in the camps during this period, allowing for a better understanding of the changes undergone in IDP camps. Due to the orbital position of the satellite, images observed at different times have varying resolutions, as well as other properties, due to differences in viewing angle and atmospheric effects.

**Training Data.** We developed DO-U-Net using a limited number of satellite images, obtained over a very limited time, with a nearly identical pixel scale. Each tent found in the training imagery has been marked with a polygon, using a custom Graphical User Interface (GUI) tool developed by Alcis. This has been done for a total of 6 images, covering an area of approximately 15 km<sup>2</sup> and containing 5,178 tents. Incidentally, this makes our training dataset nearly 17 times smaller than that used by Quinn et al. in their analysis.

The second satellite dataset includes imagery of varying quality and resolution, providing an opportunity to develop a scale-invariant version of our model. We used 3 additional training images, distinct from the original dataset, to train our scale-invariant DO-U-Net. These images contained 2,338 additional tents, in an area of around 130 km<sup>2</sup>, giving a total of 7,516 tents in over 140 km<sup>2</sup>.

### **2.2 Blood Smear Images**

We used blood smear images from the Acute Lymphoblastic Leukemia (ALL) Image Database for Image Processing<sup>2</sup>. These images were captured using an optical laboratory microscope, with magnification ranging from 300–500*×*, and a Canon PowerShot G5 camera. We used the ALL IDB1 dataset, comprised of 108 images taken during September 2005 from both ALL and non-ALL patients. An example blood smear image from an ALL patient can be seen in Fig. 2.

**Training Data.** We selected 10 images from the ALL IDB1 dataset to be used as training data. These images are representative of the diverse nature of the entire dataset, including the varying microscope magnifications and backgrounds. Of the images used, 3 belong to ALL patients, with the remaining 7

<sup>2</sup> Provided by the Department of Information Technology at Universit`a degli Studi di Milano, https://homes.di.unimi.it/scotti/all/.

**Fig. 1.** *Left:* An IDP camp in Badghis, Afghanistan. NRC tents are clearly visible due to their uniform size and light colour. *Right:* The manually created ground-truth annotation for the image.

**Fig. 2.** *Left:* An image of a blood smear from an Acute Lymphoblastic Leukemia (ALL) patient. *Right:* The manually created ground-truth annotation for the image. The images also contain lymphocytes, which are not marked in the training data.

images coming from non-ALL patients. Similarly to the IDP camp dataset, all erythrocytes in the training data have been manually labelled with a polygon using our custom GUI tool. In the images belonging to ALL patients, a total of 1,300 erythrocytes were marked. A further 3,060 erythrocytes were marked in the images belonging to non-ALL patients, giving a total of 4,360 erythrocytes in the training data.

The training data does not distinguish between typical erythrocytes and those with any forms of morphology - of the 4,360 erythrocytes, just 106 display a clear morphology. The training data also does not contain any annotation for leukocytes. Instead, our focus is on correctly segmenting and counting all erythrocytes in the images.

## **3 Methodology**

Of late, several very advanced and powerful Computer Vision algorithms have been developed, including the popular Mask R-CNN [5] and YOLO [9] architectures. While their performance is undoubtedly impressive, they rely on a large number of images to train their complex networks, as highlighted by Quinn et al. [5]. More recently, many more examples of FCN have been developed, including SegNet [7], DeconvNet [13] and U-Net [14], with the latter emerging as arguably one of the most popular encoder-decoder based architectures. Aimed at achieving a high degree of success even with sparse training datasets and developed to tackle biological image segmentation problems, it is a clear starting block for our architecture.

### **3.1 U-Net**

The classical U-Net, as proposed by Ronneberger et al. has revolutionised the field of biomedical image segmentation. Similarly to other encoder-decoder networks, U-Net is capable of producing highly precise segmentation masks.

**Fig. 3.** *<sup>a</sup>*: Sample segmentation mask in which some tent segmentations are seen to overlap. *<sup>b</sup>*: Sobel filtered image. *<sup>c</sup>*: Local entropy filtered image. *<sup>d</sup>*: Otsu filtered image. *<sup>e</sup>*: Image with contour ellipses applied. *<sup>f</sup>* : Image with gradient morphology applied. *<sup>g</sup>*: Eroded image. *<sup>h</sup>*: Tophat filtered image. *<sup>i</sup>*: Blackhat filtered image.

What differentiates it from Mask R-CNN, SegNet and other similar networks is its lack of reliance on large datasets [14]. This is achieved by the introduction of a large number of skip connections, which reintroduce some of the early encoder layers into the much deeper decoder layers. This greatly enriches the information received by the decoder part of the network, and hence reduces the overall size of the dataset required to train the network.

We have deployed the original U-Net on our dataset of satellite images of IDP camps in Western Afghanistan. While we were able to produce segmentation masks that very accurately marked the location of the tents, the segmentation masks contained significant overlaps between tents, as seen in Fig. 3. This overlap prevents us from carrying out an automated count, despite using several post-processing techniques to minimise the impact of these overlaps. The most successful post-processing approaches are shown in Fig. 3. The issues encountered with the classical U-Net have motivated our modifications to the architecture, as described in this work.

#### **3.2 DO-U-Net**

Driven by the need to reduce overlap in segmentation masks, we modified the U-Net architecture to produce dual outputs, thus developing the DO-U-Net. The idea of a contour aware network was first demonstrated by the DCAN architecture [15]. Based on a simple FCN, DCAN was trained to use the outer contours of the areas of interest to guide the training of the segmentation masks. This led to improved semantic and instance segmentation of the model, which in their case, looked at non-overlapping features in biomedical imaging.

With the aim of counting closely co-located and overlapping objects, we are predominantly interested in the correct detection of individual objects as opposed to the exact precision of the segmentation mask itself. An examination of the hidden convolutional layers of the classical U-Net showed that the penultimate layer of the network extracts information about the edges of our objects of

**Fig. 4.** The DO-U-Net architecture, showing two output layers that target the segmentation and edge masks corresponding to the training images.

interest, without any external stimulus. We introduce a secondary output layer to the network, targeting a mask segmenting the edges of our objects. By subtracting this "edge" mask from the original segmentation mask, we can obtain a "reduced" mask containing only non-overlapping objects.

As our objective was to identify tents of fixed scale in our image dataset, we were able to simplify the model considerably. This reduced the computational requirements in training of the model, allowing not only for much faster development and training but also opening the possibility of deploying the algorithm on a dataset covering a large proportion of the total area of Afganistan, driven by our commercial requirements.

**Architecture.** Starting with the classical U-Net, we reduce the number of convolutional layers and skip connections in the model. Simultaneously, we minimised the complexity of the model by looking at smaller input regions of the images, thus minimising the memory footprint of the model. We follow the approach of Ronneberger et. al. by using unpadded convolutions throughout the network, resulting in a model with smaller output masks (100 *×* 100 px) corresponding to a central region of a larger (188 *×* 188 px) input image region. DO-U-Net uses two, independently trained, output layers of identical size. Figure 4 shows our proposed DO-U-Net architecture. The model can also be found in our public online repository<sup>3</sup>. Examples of the output edge and segmentation masks, as well as the final "reduced" mask, can be seen in Figs. 6 and 7. With the reduced memory footprint of our model, we can produce a "reduced" segmentation mask for a single 100 *<sup>×</sup>* 100 px region in 3 ms using TensorFlow 2.0 with Intel i9-9820X CPU and a single NVIDIA RTX 2080 Ti GPU setup.

**Training.** The large training images were divided such that no overlap exists between the regions corresponding to the target masks, using zero-padding at the image borders. We train our model against both segmentation and edge masks. The edges of the mark-up polygons, annotated using our custom tool, were used as the "edge" masks during training. Due to the difference in a pixel

<sup>3</sup> https://github.com/ToyahJade/DO-U-Net.

**Fig. 5.** Scale-Invariant DO-U-Net, redesigned to work with datasets containing objects of variable scale.

size of tents and erythrocytes, the edges were taken to span 2 px and 4 px wide respectively in these case studies.

As our problem deals with segmentation masks covering only a small proportion of the image (*<*1% in some satellite imagery), the choice of a loss function was a very important factor. We use the Focal Tversky Index, which is suited for training with sparse positive labels compared to the overall area of the training data [16]. Our best result, obtained using the Focal Tversky loss function, gave an improvement of 5% in the Intersect-over-Union (IoU) value compared to the Binary Cross-Entropy loss function, as used by Ronneberger et al. [14]. We found the training to behave most optimally when the combined loss function for the model was heavily weighted toward the edge mask segmentation. Here, we used a 10%/90% split for the object and edge mask segmentation respectively.

**Counting.** As the resulting "reduced" masks produced by our approach do not contain any overlaps, we can use simple counting techniques, relying on the detection of the bounding polygons for the objects of interest. We apply a threshold to remove all negative values from the image, which may occur due to the subtractions. We then use the Marching Squares Algorithm implemented as part of Python's skimage.measure image analysis library [17].

**Scale-Invariant DO-U-Net.** In addition to the simple DO-U-Net, we propose a scale-invariant version of the architecture with an additional encoder and decoder block. Figure 5 shows the increased depth of the network as is required to capture the generalised model of our objects in the scale varying dataset. The addition of extra blocks resulted in a larger input layer of 380 *×* 380 px, corresponding to a segmentation mask of 196 *×* 196 px.

## **4 Results**

### **4.1 IDP Tent Results**

Using our DO-U-Net architecture, we were able to achieve a very significant improvement in the counting of IDP tents compared to the popularly used SVM

**Fig. 6.** *Left:* Segmentation mask produced for NRC tents in a camp near Qala'e'Naw. *Centre:* Edges mask produced for the same image. *Right:* The final output mask.

classifier available in ArcGIS. However, due to the manually intensive nature of the ArcGIS approach<sup>4</sup>, we were only able to directly compare a single test camp, located in the Qala'e'Naw region of the Badghis Province. This area contains 921 tents as identified in the ground-truth masks. Using DO-U-Net, we achieved a precision of 99.78% with a sensitivity of 99.46%. Using ArcGIS, we find a precision of 99.86% and a significantly lower sensitivity of 79.37%. Sensitivity, or the true positive rate, measures the probability of detecting an object and is, therefore, the most important metric for us as we aim to locate and count all tents in the region. The scale-invariant DO-U-Net achieved a precision of 98.48% and a sensitivity of 98.37% on the same image.

We also apply DO-U-Net to a larger dataset of five images containing a total of 3,447 tents and find an average precision of 97.01% and an average sensitivity of 98.68%. Similarly, we tested the scale-invariant DO-U-Net using 10 images with varying properties and resolutions containing a total of 5,643 tents. Here, the average precision was reduced to 91.45%, and the average sensitivity dropped to 94.66%. This result is not surprising as, on inspection, we find that without the scale constraints the resulting segmenting masks are contaminated with other structures of similar properties to NRC tents. We also find that, without scale constraints, NRC tents which are partially covered e.g. with tarpaulin may be missed or only partially segmented. Our DO-U-Net and scale-invariant DO-U-Net sensitivities of 98.68% and 94.66% respectively are very strong results when compared to the existing literature.

### **4.2 Erythrocyte Results**

To validate the performance of DO-U-Net at counting erythrocytes, we use 3 randomly selected blood smear images from ALL patients and a further 5 selected images from non-ALL patients. While randomly selected, the images are representative of the entire ALL IDB1 dataset. On a total of 2,775 erythrocytes, as found in these 8 validation images, DO-U-Net achieved an average precision of 98.31% and an average sensitivity of 99.07%.

<sup>4</sup> Results found using ArcGIS methodology can be found at https://storymaps.arcgis. com/stories/d85e5cca27464d97ad4c1bad3da7f140.

**Fig. 7.** *Left:* Segmentation mask produced for a blood smear of an ALL patient. *Centre:* Edges mask produced for the same image. *Right:* The final output mask.

**Fig. 8.** *Top:* Blood smear images of overlapping cells. *Bottom:* Segmentation masks produced by DO-U-Net. *Left:* An overlapped cell is counted twice when the "edges" from neighbouring cells overlap and break up the cell. *Centre:* A cell is missed due to an incomplete edge mask. *Right:* An uncertainty in the order of the cell overlap leads to the intersect between two cells being counted as an additional cell.

Whilst our proposed DO-U-Net is extremely effective at producing image and edge segmentation masks, as demonstrated in Fig. 7, we do note that the obtained erythrocyte count may not always match the near-perfect segmentation. Figure 8 shows examples of the three most common issues found to occur in the final "reduced" masks. These mistakes arise largely due to the translucent nature of erythrocytes and the difficulty in differentiating between a cell which is overlapping another and a cell which is overlapped. While these cases are rare, this demonstrates that further improvements can be made to the architecture.

#### **4.3 Future Work**

Our current model has been designed to segment only one type of object, which is a clear limitation of our solution. As an example, the blood smear images from the ALL-IDP1 dataset contain normal erythrocytes as well as two clear types of morphology: burr cells and dacryocytes. These morphologies may be signs of disease in patients, though burr cells are common artefacts, especially known to occur when the blood sample is aged. It is therefore important to not only count all erythrocytes, but to also differentiate between their various morphologies.


**Table 1.** Summary of results for DO-U-Net, when tested on our two satellite imagery datasets and the ALL IDB1 dataset.

While our general theory can be applied to identifying different types of object, further modifications to our proposed DO-U-Net would be required.

## **5 Conclusion**

We have proposed a new approach to segmenting and counting closely co-located and overlapping objects in complex image datasets. For this, we developed DO-U-Net: a modified U-Net based architecture, designed to produce both a segmentation and an "edge" mask. By subtracting the latter from the former, we can locate and spatially separate objects of interest before automatically counting them. Our methodology was successful on both of our case studies: locating and counting IDP tents in satellite imagery, and the segmentation and counting of erythrocytes in blood smear images. In the first case study, DO-U-Net increased our sensitivity by approximately 20% compared to a popular ArcGIS based solution, achieving an average sensitivity of 98.69% for a dataset of fixed spatial resolution. Our network went on to achieve a precision of 91.45% and a sensitivity of 94.66% on a set of satellite images with a varying resolution and colour profiles. This is an impressive result when compared to Quinn et al. who achieved a precision of 78%. We also found DO-U-Net to be extremely successful at segmenting and counting erythrocytes in blood smear images, achieving a sensitivity of 99.07% for our test dataset. This is an improvement of 6% over the results found by Tran et al. who used the same training dataset, and 3% over Alam and Islam who used a comparable set of images, giving us a near-perfect sensitivity when counting erythrocytes. The results are summarised in Table 1.

**Acknowledgements.** We thank Harry Robson, GIS Analyst at Alcis Holdings Ltd., for industry knowledge shared and for performing post-processing in ArcGIS. We would also like to show our gratitude to Richard Brittan, Tim Buckley and the Alcis team

for sharing insight with us during the course of this research. We are also immensely grateful to the Department of Information Technology at Universit`a degli Studi di Milano for providing the ALL IDB1 dataset from the Acute Lymphoblastic Leukemia Image Database for Image Processing.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were

made. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Enhanced Word Embeddings for Anorexia Nervosa Detection on Social Media**

Diana Ram´ırez-Cifuentes1(B) , Christine Largeron<sup>2</sup>, Julien Tissier<sup>2</sup>, Ana Freire<sup>1</sup>, and Ricardo Baeza-Yates<sup>1</sup>

<sup>1</sup> Universitat Pompeu Fabra, Carrer de Tanger, 122-140, 08018 Barcelona, Spain *{*diana.ramirez,ana.freire,ricardo.baeza*}*@upf.edu <sup>2</sup> Univ Lyon, UJM-Saint-Etienne, CNRS, Laboratoire Hubert Curien, UMR 5516,

42023 Saint-Etienne, France

*{*julien.tissier,christine.largeron*}*@univ-st-etienne.fr

**Abstract.** Anorexia Nervosa (AN) is a serious mental disorder that has been proved to be traceable on social media through the analysis of users' written posts. Here we present an approach to generate word embeddings enhanced for a classification task dedicated to the detection of Reddit users with AN. Our method extends *Word2vec*'s objective function in order to put closer domain-specific and semantically related words. The approach is evaluated through the calculation of an average similarity measure, and via the usage of the embeddings generated as features for the AN screening task. The results show that our method outperforms the usage of fine-tuned pre-learned word embeddings, related methods dedicated to generate domain adapted embeddings, as well as representations learned on the training set using *Word2vec*. This method can potentially be applied and evaluated on similar tasks that can be formalized as document categorization problems. Regarding our use case, we believe that this approach can contribute to the development of proper automated detection tools to alert and assist clinicians.

**Keywords:** Social media *·* Eating disorders *·* Word embeddings *·* Anorexia Nervosa *·* Representation learning

## **1 Introduction**

We present models to identify users with AN based on the texts they post on social media. Word embeddings previously learned in a large corpus, have provided good results on predictive tasks [3]. However, in the case of writings generated by users living with a mental disorder such as AN, we observe specific vocabulary exclusively related with the topic. Terms such as: "*cw*", used to refer to the current weight of a person, or "*ow*" referring to the objective weight,

c The Author(s) 2020 M. R. Berthold et al. (Eds.): IDA 2020, LNCS 12080, pp. 404–417, 2020. https://doi.org/10.1007/978-3-030-44584-3\_32

This work was supported by the University of Lyon - IDEXLYON and the Spanish Ministry of Economy and Competitiveness under the Maria de Maeztu Units of Excellence Program (MDM-2015-0502).

are elements that are not easily found in large yet general collections extracted from Wikipedia, social media and news websites. Therefore, using pre-learned embeddings may not be the most suitable approach for the task.

We propose a method based on *Dict2vec* [15] to generate word embeddings enhanced for our task domain. The main contributions of our work are the following: (1) a method that modifies *Dict2vec* [15] in order to generate word embeddings enhanced for our classification task, this method has the power to be applied on similar tasks that can be formulated as document categorization problems; (2) different ways to improve the performance of the embeddings generated by our method corresponding to four embeddings variants; and (3) a set of experiments to evaluate the performance of our generated embeddings in comparison to pre-learned embeddings, and other domain adaptation methods.

## **2 Related Work**

In previous work related to detection of mental disorders [8], documents were represented using bag of words (BoW) models, which involve representing words in terms of their frequencies. As these models do not consider contextual information or relations between the terms, other models have been proposed based on word embeddings [3]. These representations are generated considering the distributional hypothesis, which assumes that words appearing in similar contexts are related, and therefore should have close representations [11,13].

Embedding models allow words from a large corpus to be encoded as vectors in a high-dimensional space. The vectors are defined by taking into account the context in which the words appear in the corpus in such a way that two words having the same neighborhood should be close in the vector space.

Among the methods used for generating word embeddings we find *Word2vec* [11], which generates a vector for each word in the corpus considering it as an atomic entity. To build the embeddings, *Word2vec* defines two approaches: one known as continuous bag of words (CBOW) that uses the context to predict a target word; and another one called skip-gram, which uses a word to predict a target context. Another method is *fastText* [2], which takes into account the morphology of words, having each word represented as a bag of character n-grams for training. There is also *GloVe* [13], which proposes a weighted least squares model that does the training on global word-word co-occurrence counts.

In contrast to the previous methods, we can also mention recent methods like Embeddings from Language Models (ELMo) [14] and Bidirectional Encoder Representations from Transformers (BERT) [6] that generate representations which are aware of the context they are being used at. These approaches are useful for tasks where polysemic terms are relevant, and when there are enough sentences to learn these from the context. Regarding our use case, we observe that the vocabulary used by users with AN is very specific and contains almost no polysemic terms, which is why these methods are not addressed in our evaluation framework.

All the methods already mentioned are generally trained over large general purpose corpora. However, for certain domain specific classification tasks we have to work with small corpora. This is the case of mental disorders screening tasks given that the annotation phase is expensive, and requires the intervention of specialists. There are some methods that address this issue by either enhancing the embeddings learned over small corpora with external information, or adapting embeddings learned on large corpora to the task domain.

Among the enhancement methods we find Zhang's *et al.* [17] work. They made use of word embeddings learned in different health related domains to recognize symptoms in psychiatry. They designed approaches to combine data of the source and target to generate word embeddings, which are considered in our experimental results.

Kuang *et al*. [9] propose learning weights based on the words' relative importance for the classification task (predictive terms). This method proposes weighting words according to their χ<sup>2</sup> [12] statistics to represent the context. However, this method differs from ours as we generate our embeddings through a different approach, which takes into account the context terms, introduces new domain related vocabulary, considers the predictive terms to be equally important, and moves apart the vectors of terms that are not predictive for the main target class.

Faruqui *et al*. [7] present an alternative, known as a retrofitting method, which makes use of relational information from semantic lexicons to improve pre-built word vectors. The main disadvantage is that no external new terms representations can be introduced to the enhanced embeddings, and that despite related embeddings are put closer, the embeddings of terms that should not be related (task-wise) cannot be put apart from each other. In our experimental setup, this method is used to define a baseline and to enhance the embeddings generated through our approach.

Our proposal is based on *Dict2vec* [15], which is an extension of the *Word2vec* approach. *Dict2vec* uses the lexical dictionary definitions of words in order to enrich the semantics of the embeddings generated. This approach has proved to perform well on small corpora because in addition to the context defined by *Word2vec*, it introduces a (1) positive sampling, which moves closer the vector of words co-occurring in their mutual dictionary definitions, and a (2) controlled negative sampling which prevents to move apart the vectors of words that appear in the definition of others, as the authors assume that all the words in the definition of a term from a dictionary are semantically related to the word they define.

## **3 Method Proposed**

Our method generates word embeddings enhanced for a classification task dedicated to the detection of users with AN over a small size corpus. In this context, users are represented by documents that contain their writings concatenated, and that are labeled as anorexic (positive) or control (negative) cases. These labels are known as the classes to predict for our task.

Our method is based on *Dict2vec*'s general idea [15]. We extend the *Word2vec* model with both a positive and a negative component, but our method differs from *Dict2vec* because both components are designed to learn vectors for a specific classification task. Within the word embeddings context, we assume that word-level n-grams' vectors, which are predictive for a class, should be placed close to each other given their relation with the class to be predicted. Therefore we first define sets of what we call *predictive pairs* for each class, and use them later for our learning approach.

#### **3.1 Predictive Pairs Definition**

Prior to learning our embeddings, we use χ<sup>2</sup> [12] to identify the predictive ngrams. This is a method commonly used for feature reduction, being capable to identify the most predictive features, in this case terms, for a classification task.

Based on the χ<sup>2</sup> scores distribution, we obtain the n terms with the highest scores (most predictive terms) for each of the classes to predict (positive and negative). Later, we identify the most predictive term for the positive class denoted as t<sup>1</sup> or *pivot term*. Depending on the class for which a term is predictive, two types of *predictive pairs* are defined, so that every time a predictive word is found, it will be put close or far from t1. These predictive pair types are: (1) positive predictive pairs, where each predictive term for the positive class is paired with the term t<sup>1</sup> in order to get its vector representation closer to t1; and (2) negative predictive pairs, where each term predictive for the negative class is also paired with t1, but with the goal of putting it apart from t1.

In order to define the positive predictive terms for our use case, we consider: the predictive terms defined by the χ<sup>2</sup> method, AN related vocabulary (domain-specific) and the k most similar words to t<sup>1</sup> obtained from pre-learned embeddings, according to the cosine similarity. Like this, information coming from external sources that are closely related with the task could be introduced to the training corpus. The terms that were not part of the corpus were appended to it, providing us an alternative to add new vocabulary of semantic significance to the task.

Regarding the negative predictive terms, no further elements are considered asides from the (χ<sup>2</sup>) predictive terms of the negative class as for our use case and similar tasks, control cases do not seem to share a vocabulary strictly related to a given topic. In other words, and as observed for the anorexia detection use case, control users are characterized by their discussions on topics unrelated to anorexia.

For the χ<sup>2</sup> method, when having a binary task, the resulting predictive features are the same for both classes (positive and negative). Therefore, we have proceeded to get the top *n* most predictive terms based on the distribution of the χ<sup>2</sup> scores for all the terms. Later, we decided to take a look at the number of documents containing the selected *n* terms based on their class (anorexia or control). Given a term t, we calculated the number of documents belonging to the positive class (anorexia) containing t, denoted as PCC; and we also calculated the number of documents belonging to the negative class (control) containing t, named as NCC. Then, for t we calculate the respective ratio of both counts in relation to the total amount of documents belonging to each class: total amount of positive documents (TPD) and total amount of negative documents (TND), obtaining like this a positive class count ratio (PCCR) and a negative class count ratio (NCCR).

For a term to be part of the set of positive predictive terms its PCCR value has to be higher than the NCCR, and the opposite applies for the terms that belong to the set of negative predictive pairs. The positive and negative class count ratios are defined in Eqs. 1a and 1b as:

$$PCCR(t) = \frac{PCC(t)}{TPD} \tag{1a}$$

$$NCCR(t) = \frac{NCC(t)}{TND} \tag{1b}$$

#### **3.2 Learning Embeddings**

Once the predictive pairs are defined, the objective function for a target term ω<sup>t</sup> (Eq. 2) is defined by the addition of a positive sampling cost (Eq. 3) and a negative sampling cost (Eq. 4a) in addition to *Word2vec*'s usual target, context pair cost given by (ωt, ωc) where represents the logistic loss function, and vt, and v<sup>c</sup> are the vectors associated to ω<sup>t</sup> and ω<sup>c</sup> respectively.

$$J(\omega\_t, \omega\_c) = \ell(v\_t, v\_c) + J\_{pos}(\omega\_t) + J\_{neg}(\omega\_t) \tag{2}$$

Unlike *Dict2vec*, Jpos is computed for each target term where P(ωt) is the set of all the words that form a positive predictive pair with the word ωt, and v<sup>t</sup> and v<sup>i</sup> are the vectors associated to ω<sup>t</sup> and ω<sup>i</sup> respectively. β<sup>P</sup> is a weight that defines the importance of the positive predictive pairs during the learning phase. Also, as an aspect that differs from *Dict2vec*, the cost given by the predictive pairs is normalized by the size of the predictive pairs set, *|*P (ωt)*|*, considering that all the terms from the predictive pairs set of ω<sup>t</sup> are taken into account for the calculations, and therefore when t<sup>1</sup> is found, the impact of trying to move it closer to a big amount of terms is reduced, and it remains as a pivot element to which other predictive terms get close to:

$$J\_{pos}(\omega\_t) = \beta\_P \sum\_{\omega\_t \in P(\omega\_t)} \frac{\ell(v\_t \cdot v\_i)}{|P \left(\omega\_t\right)|} \tag{3}$$

On the negative sampling, we modify *Dict2vec*'s approach. We not only make sure that the vectors of the terms forming a positive predictive pair with ω<sup>t</sup> are not put apart from it, but we also define a set of words that are predictive for the negative class and define a cost given by the negative predictive pairs. In this case, as explained before, the main goal is to put apart these terms from t1, so this cost is added to the negative random sampling cost J<sup>n</sup> <sup>r</sup> (Eq. 4b), as detailed in Eq. 4a.

$$J\_{neg}(\omega\_t) = J\_{n\omega}(\omega\_t) + \beta\_N \sum\_{\omega\_j \in N(\omega\_t)} \frac{\ell(-v\_t \cdot v\_j)}{|N\left(\omega\_t\right)|}\tag{4a}$$

$$J\_{n\cdot r}(\omega\_t) = \sum\_{\substack{\omega\_i \in F(\omega\_t) \\ \omega\_i \notin P(\omega\_t)}} \ell(-v\_t \cdot v\_i) \tag{4b}$$

The negative sampling cost considers, as on *Word2vec*, a set F(ωt) of k words selected randomly from the vocabulary. These words are put apart from ω<sup>t</sup> as they are likely to not be semantically related. Considering *Dict2vec*'s approach, we make sure as well that any term belonging to the set of positive predictive pairs of ω<sup>t</sup> ends up being put apart. In addition to this, we add another negative sampling cost which corresponds to the cost of putting apart from t<sup>1</sup> the most predictive terms from the negative class. In this case, N(ωt) represents the set of all the words that form a negative predictive pair with the word ωt. β<sup>N</sup> is a weight to define the importance of the negative predictive pairs during the learning phase.

The global objective function (Eq. 5) is given by the sum of every pair's cost across the whole corpus:

$$J = \sum\_{t=1}^{C} \sum\_{c=-n}^{n} J(\omega\_t, \omega\_{t+c}) \tag{5}$$

where C is the corpora size, and *n* represents the size of the window.

#### **3.3 Enhanced Embeddings Variations**

Given a pre-learned embedding which associates for a word ω a pre-learned representation vpl, and an enhanced embedding v obtained through our approach for ω with the same length *m* as vpl, we generate variations of our embeddings based on existing enhancement methods. First, we denote the embeddings generated exclusively by our approach (predictive pairs) as *Variation 0*, v is an instance of the representation of ω for this variation.

For the next variations, we address ways to combine the vectors of prelearned embeddings (*i.e.*, vpl) with the ones of our enhanced embeddings (*i.e.*, v). For *Variation 1* we concatenate both representations vpl +v, obtaining a *2m* dimensions vector [16]. *Variation 2* involves concatenating both representations and applying truncated SVD as a dimensionality reduction method to obtain a new representation given by SV D(vpl + v). *Variation 3* uses the values of the pre-learned vector vpl as starting weights to generate a representation using our learning approach. This variation is inspired in a popular transfer learning method that was successfully applied on similar tasks [5]. For these variations (1–3) we take into account the intersection between the vocabularies of both embeddings types (pre-learned and *Variation 0* ). Finally, *Variation 4* implies applying Faruqui's retrofitting method [7] over the embeddings of *Variation 0*.

## **4 Evaluation Framework**

### **4.1 Data Set Description**

We used a Reddit data set [10] that consists on posts of users labeled as anorexic and control cases. This data set was defined in the context of an early risk detection shared task, and the training and test sets were provided by the organizers of the eRisk task.<sup>1</sup> Table 1 provides a description of the training and testing data sets statistics. Given the incidence of Anorexia Nervosa, for both sets there is a reduced yet significant amount of AN cases compared to the control cases.


**Table 1.** Collection description as described on [10].

#### **4.2 Embeddings Generation**

The training corpus used to generate the embeddings, named *anorexia corpus*, consisted on the concatenation of all the writings from all the training users. A set of stop-words were removed. This resulted on a training corpus with a size of 1,267,208 tokens and a vocabulary size of 87,197 tokens. In order to consider the bigrams defined by our predictive pairs, the words belonging to a bigram were paired and formatted as if they were a single term.

For the predictive pairs generation with χ<sup>2</sup>, each user is an instance represented by a document composed by all the user's posts concatenated. χ<sup>2</sup> is applied over the train set considering the users classes (anorexic or control) as the possible categories for the documents. The process described in Sect. 3.1 is followed in order to obtain a list of 854 positive (anorexia) and 15 negative (control) predictive terms. Some of these terms can be seen on Table 2, which displays the top 15 most predictive terms for both classes. *Anorexia* itself resulted to be the term with the highest χ<sup>2</sup> score, denoted as t<sup>1</sup> in Sect. 3.

The anorexia domain related terms from [1] were added as the topic related vocabulary, and the top 20 words with the highest similarity to *anorexia* coming from a set of pre-learned embeddings from *GloVe* [13] were also paired to it to define the predictive pairs sets. The GloVe's pre-learned vectors considered are the 100 dimensions representations learned over 2B tweets with 27B tokens, and with 1.2M vocabulary terms.

<sup>1</sup> eRisk task: https://early.irlab.org/2018/index.html.


**Table 2.** List of some of the most predictive terms for each class.

The term *anorexia* was paired to 901 unique terms and, likewise, each of these terms was paired to *anorexia*. The same approach was followed for the negative predictive terms (15), which were also paired with *anorexia*. An instance of a positive predictive pair is *(anorexia, underweight)*, whereas an instance of a negative predictive pair is *(anorexia, game)*. For learning the embeddings through our approach, and as it extends *Word2vec*, we used as parameters a window size of 5, the number of random negative pairs chosen for negative sampling was 5, and we trained with one thread/worker and 5 epochs.

#### **4.3 Evaluation Based on the Average Cosine Similarity**

This evaluation is done over the embeddings generated through *Variation 0* over the anorexia corpus. It averages the cosine similarities (sim) between t<sup>1</sup> and all the terms that were defined either as its *p* positive predictive pairs, obtaining a positive score denoted as PS on Eq. 6a; or as its *n* negative predictive pairs, with a negative score denoted as NS on Eq. 6b. On these equations v<sup>a</sup> represents the vector of the term *anorexia*; vPPT*<sup>i</sup>* represents the vector of the positive predictive term (PPT) *i* belonging to the set of positive predictive pairs of *anorexia* of size *p*; and vNPT*<sup>i</sup>* represents the vector of the negative predictive term (NPT) *i* belonging to the set of negative predictive pairs of *anorexia* of size n:

$$PS(a) = \frac{\sum\_{i=1}^{p} sim(v\_a, v\_{OPT\_i})}{p} \tag{6a}$$

$$NS(a) = \frac{\sum\_{i=1}^{n} \text{sim}(v\_a, v\_{NPT\_i})}{n} \tag{6b}$$

We designed our experiments using PS and NS in order to analyze three main aspects: (1) we verify that through the application of our method, the predictive terms for the positive class are closer to the pivot term representation, and that the predictive terms for the negative class were moved away from it; (2) we evaluate the impact of using different values of the parameters β<sup>P</sup> and β<sup>N</sup> to obtain the best representations where PS has the highest possible value, keeping NS as low as possible; and (3) we compare our generation method with *Word2vec* as baseline since this is the case for which our predictive pairs would not be considered (β<sup>P</sup> = 0 and β<sup>N</sup> = 0). We expect for our embeddings to obtain higher values for PS and lower values for NS in comparison to the baseline.

**Results.** Table 3 shows first the values for PS and NS obtained by what we consider our baseline, *Word2vec* (β<sup>P</sup> = 0 and β<sup>N</sup> = 0), and then the values obtained by embeddings models generated using our approach (*Variation 0* ), with different yet equivalent values given to the parameters β<sup>P</sup> and β<sup>N</sup> , as they proved to provide the best results for PS and PN. We also evaluated individually the effects of varying exclusively the values for β<sup>P</sup> , leaving β<sup>N</sup> = 0, and then the effects of varying only the values of β<sup>N</sup> , with β<sup>P</sup> = 0. On the last row of the table we show a model corresponding to the combination of the parameters with the best individual performance (β<sup>P</sup> = 75 and β<sup>N</sup> = 25).

After applying our approach the value of PS becomes greater than NS for most of our generated models, meaning that we were able to obtain a representation where the positive predictive terms are closer to the pivot term *anorexia*, and the negative predictive terms are more apart from it. Then, we can also observe that the averages change significantly depending on the values of the parameters β<sup>P</sup> and β<sup>N</sup> , and for this case the best results according to PS are obtained when β<sup>P</sup> = 50 and β<sup>N</sup> = 50. Finally, when we compare our scores with *Word2vec*, we can observe that after applying our method, we can obtain representations where the values of PS and NS are respectively higher and lower than the ones obtained by the baseline model.


**Table 3.** Positive Scores (PS) and Negative Scores (NS) for *Variation 0*. Different values for <sup>β</sup>*P* and <sup>β</sup>*N* are tested.

#### **4.4 Evaluation Based on Visualization**

We focus on the comparison of embeddings generated using *word2vec* (baseline), *Variation 0* of our enhanced embeddings, and *Variation 4*. In order to plot over the space the vectors of the embeddings generated (see Fig. 1), we performed dimensionality reduction, from the original 200 dimensions to 2, through Principal Component Analysis (PCA) over the vectors of the terms in Table 2 for the embeddings generated with these three representations. We focused over the embeddings representing the positive and negative predictive terms. For the resulting embeddings of our method (*Variation 0* ), we selected β<sup>P</sup> =50 and β<sup>N</sup> =50 as parameter values.

**Fig. 1.** Predictive terms sample represented on two dimensions after PCA was applied on their embeddings as dimensionality reduction method. From left to right each plot shows the vectorial representation of the predictive terms according to the embeddings obtained through (1) *Word2vec* (baseline), (2) *Variation 0*, and (3) *Variation 4*.

The positive predictive terms representations are closer after applying our method (*Variation 0* ), and the negative predictive terms are displayed farther, in comparison to the baseline. The last plot displays the terms for the embeddings generated through Variation 4. For this case, given the input format for the retrofitting method, *anorexia* was linked with all the remaining predictive terms of the anorexia class (901), and likewise, each of these predictive terms was linked to the term *anorexia*. Notice that the retrofitting approach converges to changes in Euclidean distance of adjacent vertices, whereas the closeness between terms for our approach is given by the cosine distance.

#### **4.5 Evaluation Based on the Predictive Task**

In order to test our generated embeddings for the classification task dedicated to AN screening, we conduct a series of experiments to compare our method with related approaches. We define 5 baselines for our task: the first one is a BoW model based on word level unigrams and bigrams (*Baseline 1* ), this model is kept mainly as a reference since our main focus is to evaluate our approach compared to other word embedding based models. We create a second model using *GloVe*'s pre-learned embeddings (*Baseline 2* ), and a third model that uses word embeddings learned on the training set with the *Word2vec* approach (*Baseline 3* ). We evaluate a fourth approach (*Baseline 4* ) given by the enhancement of the *Baseline 3* embeddings, with Faruqui's *et al.* [7] retrofitting method. *Baseline 5* uses the same retrofitting method over GloVe's pre-learned embeddings, as we expected that a domain adaptation of the embeddings learned on a external source could be achieved this way.

**Predictive Models Generation.** To create our predictive models, again, each user is an instance represented by their writings (see Sect. 4.2). For *Baseline 1* we did a tf *·* idf vectorization of the users' documents, by using the *TfIdfVectorizer* provided by the *Scikit-learn* Python library, with a stop-words list and the removal of the n-grams that appeared in less than 5 documents. The representation of each user through embeddings was given by the aggregation of the vector representations of the words in the concatenated texts of the users, normalized by the size (words count) of the document. Then, an L<sup>2</sup> normalization was applied to all the instances.

Given the reduced amount of anorexia cases on the training set, we used SMOTE [4] as an over-sampling method to deal with the unbalanced classes. The Scikit learn's Python library implementations for Logistic regression (LR), Random Forest (RF), Multilayer Perceptron (MLP), and Support Vector Machines (SVM) were tested as classifiers over the training set with a 5-fold cross validation approach. A grid search over each method to find the best parameters for the models was done.

**Results.** The results of the baselines are compared to models with our variations. For *Variation 4* and baselines 4 and 5 we use the 901 predictive terms of Sect. 4.4. To define the parameters of *Variation 3*, we test different configurations, as on Sect. 4.3, and chose the ones with the best results according to PS.

Precision (P), Recall (R), F1-Score (F1) and Accuracy (A) are used as evaluation measures. The scores for P, R and F1 reported over the test set on Table 4 correspond to the Anorexia (positive) class, as this is the most relevant one, whereas A corresponds to the accuracy computed on both classes. Seeing that there are 6 times more control cases than AN and that false negative (FN) cases are a bigger concern compared to false positives, we prioritize R and F1 over P and A. This is done because as with most medical screening tasks, classifying a user at risk as a control case (FN) is worst than the opposite (FP), in particular on a classifier that is intended to be a first filter to detect users at risk and eventually alert clinicians, who are the ones that do an specialized screening of the user profile. Table 4 shows the results for the best classifiers. The best scores are highlighted for each measure.

Comparing the baselines, we can notice that the embeddings based approaches provide an improvement on R compared to the BoW model, however this is given with a significant loss on P.

Regarding the embeddings based models, our variations outperform the results obtained by the baselines. The model with the embeddings generated with our method (*Variation 0* ) provides significantly better results compared to the *Word2vec* model (*Baseline 3* ), and even the model with pre-learned embeddings (*Baseline 2* ), with a wider vocabulary.

The combination of pre-learned embeddings and embeddings learned on our training set, provide the best results in terms of F1 and R. They also provide a good accuracy considering that most of the test cases are controls. We can also observe that using the weights of pre-learned embeddings (*Variation 3* ) to start our learning process over our corpus improves significantly the R score in comparison to *Word2vec*'s generated embeddings (*Baseline 3* ).

The worst results for our variations are given by *Variation 1* that obtains equivalent results to *Baseline 2*. The best model in terms of F1 corresponds to *Variation 2*. Also, better results are obtained for P when the embeddings are enhanced by the retrofitting approach (*Variation 4* ).


**Table 4.** Baselines and enhanced embeddings evaluated in terms of Precision (P), Recall (R), F1-Score (F1) and Accuracy (A).

## **5 Conclusions and Future Work**

We presented an approach for enhancing word embeddings towards a classification task on the detection of AN. Our method extends *Word2vec* considering positive and negative costs for the objective function of a target term. The costs are added by defining predictive terms for each of the target classes. The combination of the generated embeddings with pre-learned embeddings is also evaluated. Our results show that the usage of our enhanced embeddings outperforms the results obtained by pre-learned embeddings and embeddings learned through *Word2vec* regardless of the small size of the corpus. These results are promising as they might lead to new research paths to explore.

Future work involves the evaluation of the method on similar tasks, which can be formalized as document categorization problems, addressing small corpora. Also, ablation studies will be performed to assess the impact of each component into the results obtained.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Event Recognition Based on Classification of Generated Image Captions**

Andrey V. Savchenko1,2(B) and Evgeniy V. Miasnikov<sup>1</sup>

<sup>1</sup> Samsung-PDMI Joint AI Center, St. Petersburg Department of Steklov Institute of Mathematics, Fontanka Street, St. Petersburg, Russia <sup>2</sup> National Research University Higher School of Economics, Laboratory of Algorithms and Technologies for Network Analysis, Nizhny Novgorod, Russia avsavchenko@hse.ru

**Abstract.** In this paper, we consider the problem of event recognition on single images. In contrast to conventional fine-tuning of convolutional neural networks (CNN), we proposed to use image captioning, i.e., a generative model that converts images to textual descriptions. The motivation here is the possibility to combine conventional CNNs with a completely different approach in an ensemble with high diversity. As event recognition task has nothing serial or temporal, obtained captions are one-hot encoded and summarized into a sparse feature vector suitable for the learning of an arbitrary classifier. We provide the experimental study of several feature extractors for Photo Event Collection, Web Image Dataset for Event Recognition and Multi-Label Curation of Flickr Events Dataset. It is shown that the image captions trained on the Conceptual Captions dataset can be classified more accurately than the features from an object detector, though they both are obviously not as rich as the CNN-based features. However, an ensemble of CNN and our approach provides state-of-the-art results for several event datasets.

**Keywords:** Image captioning · Event recognition · Ensemble of classifiers · Convolutional neural network (CNN)

## **1 Introduction**

Nowadays, social networks and mobile devices create a vast stream of multimedia data because people are taking more photos in recent years than ever before [1]. To organize a large gallery of personal photos, they may be assigned to albums according to some events. Social events are happenings that are attended and shared by the people [2,3] and take place in a specific environment [4], e.g., holidays, sports events, weddings, various activities, etc. The album labels are usually assigned either manually or by using locations from EXIF data if the GPS tags in a camera are switched on. However, content-based image analysis has been recently introduced in photo organizing systems. Such analysis can be used to selectively look for photos for a particular event in order to keep nice memories of some episodes of our lives [4] or to gather our specific interests for personalized recommender systems.

There exist two different event recognition tasks [2]. In the first task, the event categories are recognized for the whole album (a sequence of photos). However, the assignments of images of the same event into albums may be unknown in practice. Hence, in this paper, we focus on the second task, namely, event recognition in single images from social media. As an event here is a complex scene with large variations in visual appearance [4], deep learning techniques [5] are widely used. It is typical to fine-tune existing convolutional neural networks (CNNs) on event datasets [4]. Sometimes CNN-based object detection is applied [6] for discovering particular categories, e.g., interior objects, food, transport, sports equipment, animals, etc. [7,8].

However, in this paper, a slightly different approach is considered. Despite the conventional usage of a CNN as a discriminative model in a classifier design [9], we propose to borrow generative models to represent an input image in the other domain. In particular, we use existing methods of image captioning [10] that generate textual descriptions of images. Our main contribution is a demonstration that the generated descriptions can be fed to the input of a classifier in an ensemble in order to improve the event recognition accuracy of traditional methods. Though the proposed visual representation is not as rich as features extracted by fine-tuned CNNs, they are better than the outputs of object detectors [8]. As our approach is completely different than traditional CNNs, it can be combined with them into an ensemble that possesses high diversity and, as a consequence, high accuracy.

The rest of the paper is organized as follows. In Sect. 2, the survey of image captioning models is given. In Sect. 3, we introduce the proposed pipeline for event recognition based on generated captions. Experimental results for several event datasets are presented in Sect. 4. Finally, concluding comments and future works are discussed in Sect. 5.

### **2 Literature Survey**

Most existing methods of event recognition on single photos tend to applications of the CNN-based architectures [2]. Four layers of fine-tuned CNN were used to extract features for LDA (Linear Discriminant Analysis) classifier in the ChaLearn LAP 2015 cultural event recognition challenge [11]. The iterative selection method [4] identifies the most relevant subset of classes for transferring representations from CNN learned from the object (ImageNet) and scene (Places2) datasets. The bounding boxes of detected objects are projected onto multi-scale spatial maps in the paper [6]. An ensemble of scene classifiers and object detectors provided the high accuracy [12] for the Photo Event Collection (PEC) [13]. Unfortunately, there is a significant gap in the accuracies of event classification in still photos [4] and albums [14], so that there is a huge demand in all-the-more accurate methods of single image processing.

That is why in this paper, we proposed to concentrate on other suitable visual features extracted with the generative models and, in particular, image captioning techniques. There is a wide range of applications of image captioning: from the automatic generation of descriptions for photos posted in social networks to image retrieval from databases using generated text descriptions [15]. The image captioning methods are usually based on an encoder-decoder neural network, which first encodes an image into a fixed-length vector representation using pre-trained CNN, and then decodes the representation into captions (a natural language description). During the training of a decoder (generator), the input image and its ground-truth textual description are fed as inputs to the neural network, while one hot encoded description presents the desired network output. The description is encoded using text embeddings in the Embedding (look-up) layer [5]. The generated image and text embeddings are merged using concatenation or summation and form the input to the decoder part of the network. It is typical to include the recurrent neural network (RNN) layer followed by a fully connected layer with the Softmax output layer.

One of the first successful models, "Show and Tell" [16], won the first MS COCO Image Captioning Challenge in 2015. It uses RNN with long short-term memory (LSTM) units in a decoder part. Its enhancement "Show, Attend and Tell" [17] incorporates a soft attention mechanism to improve the quality of the caption generation. The "Neural Baby Talk" image captioning model [18] is based on generating the template with slot locations explicitly tied to specific image regions. These slots are then filled in by visual concepts identified in the object detectors. The foreground regions are obtained using the Faster-RCNN network [19], and LSTM with attention mechanism serves as a decoder. The "Multimodal Recurrent Neural Network" (mRNN) [20] is based on the Inception network for image features extraction and deep RNN for sentence generation. One of the best models nowadays is the Auto-Reconstructor Network (ARNet) [21], which uses the Inception-V4 network [22] in an encoder, and the decoder is based on LSTM. There exist two pre-trained models with greedy search (ARNet-g) and beam search (ARNet-b) with size 3 to generate the final caption for each input image.

## **3 Proposed Approach**

Our task can be formulated as a typical image recognition problem [9]. It is required to assign an input photo X from a gallery to one of C > 1 event categories (classes). The training set of N ≥ 1 images **X** = {X*<sup>n</sup>*|n ∈ {1, ..., N}} with known event labels c*<sup>n</sup>* ∈ {1, ..., C} is available for classifier learning. Sometimes the training photos of the same event are associated with an album [13,14]. In such a case, the training albums are unfolded into a set **X** so that the collectionlevel label of the album is assigned to labels of each photo from this album. This task possesses several characteristics that makes it extremely challenging compared to album-based event recognition. One of these characteristics is the presence of irrelevant images or unimportant photos that can be associated with any event [2]. These images can be detected by attention-based models when the whole album is available [1] but may have a significant negative impact on the quality of event recognition in single images.

As N is usually rather small, transfer learning may be applied [5]. A deep CNN is firstly pre-trained on a large dataset, e.g., ImageNet or Places [23]. Secondly, this CNN is fine-tuned on **X**, i.e., the last layer is replaced to the new layer with Softmax activations and C outputs. An input image X is classified by feeding it to the fine-tuned CNN to compute C scores from the output layer, i.e., the estimates of posterior probabilities for all event categories. This procedure can be modified by the extraction of deep image features (embeddings) using the outputs of one of the last layers of the pre-trained CNN [5,24]. The input image X and each training image X*n*, n ∈ {1, ..., N} are fed to the input of the CNN, and the outputs of the last-but-one layer are used as the D-dimensional feature vectors **x** = [x1, ..., x*D*] and **x***<sup>n</sup>* = [x*<sup>n</sup>*;1, ..., x*<sup>n</sup>*;*<sup>D</sup>*], respectively. Such deep learning-based feature extractors allow training of a general classifier C*emb*, e.g., k-nearest neighbor, random forest (RF), support vector machine (SVM) or gradient boosting [9,25]. The C-dimensional vector of **p***emb* = C*emb*(**x**) confidence scores is predicted given the input image in both cases of fine-tuning with the last Softmax layer in a role of classifier C*emb* and feature extraction with general classifier. The final decision is made in favor of a class with maximal confidence.

In this paper, we use another approach to event recognition based on generative models and image captioning. The proposed pipeline is presented in Fig. 1. At first, the conventional extraction of embeddings **x** is implemented using pretrained CNN. Next, these visual features and a vocabulary V are fed to a special RNN-based neural network (generator) that produces the caption, which describes the input image. Caption is represented as a sequence of L > 0 tokens

**Fig. 1.** Proposed event recognition pipeline based on image captioning

**t** = {t0, t1..., t*L*+1} from the vocabulary (t*<sup>l</sup>* ∈ V,l ∈ {0, ..., L}). It is generated sequentially, word-by-word starting from t<sup>0</sup> =< ST ART > token until a special t*L*+1 =< END > word is produced [21].

The generated caption **t** is fed into an event classifier. In order to learn its parameters, every n-th image from the training set is fed to the same image captioning network to produce the caption **t***<sup>n</sup>* = {t*n*;0, t*n*;1..., t*n*;*Ln*+1}. Since the number of tokens L*<sup>n</sup>* is not the same for all images, it is necessary to either train a sequential RNN-based classifier or transform all captions into feature vectors with the same dimensionality. As the number of training instances N is not very large, we experimentally noticed that the latter approach is as accurate as the former, though the training time is significantly lower. This fact can be explained by the absence of anything temporal or serial in the initial task of event recognition in single images. Hence, we decided to use one-hot encoding and convert the sequences **t** and {**t***<sup>n</sup>*} into vectors of 0s and 1s as described in [26]. In particular, we select a subset of vocabulary <sup>V</sup>˜ <sup>⊂</sup> <sup>V</sup> by choosing the top most frequently occurring words in the training data {**t***<sup>n</sup>*} with the optional exclusion of stop words. Next, the input image is represented as the <sup>|</sup>V˜ <sup>|</sup>-dimensional sparse vector **˜t** ⊂ {0, <sup>1</sup>}<sup>|</sup>*V*˜ <sup>|</sup> , where <sup>|</sup>V˜ <sup>|</sup> is the size of reduced vocabulary <sup>V</sup>˜ and the <sup>v</sup>-th component of vector **˜t** is equal to 1 only if at least one of L words in the caption **t** is equal to the v-th word from vocabulary V˜ . This would mean, for instance, turning the sequence {1, 5, 10, 2} into a <sup>V</sup>˜ -dimensional sparse vector that would be all 0s except for indices 1, 2, 5 and 10, which would be 1s [26]. The same procedure is used to describe each n-th training image with V˜ -dimensional sparse vector **˜t***n*. After that an arbitrary classifier <sup>C</sup>*txt* of such textual representations suitable for sparse data can be used to predict <sup>C</sup> confidence scores **<sup>p</sup>***txt* <sup>=</sup> <sup>C</sup>*txt*(**˜t**). It is known [26] that such an approach is even more accurate than conventional RNN-based classifiers (including one layer of LSTMs) for the IMDB dataset.

In general, we do not expect that classification of short textual descriptions is more accurate than the conventional image recognition methods. Nevertheless, we believe that the presence of image captions in an ensemble of classifiers can significantly improve its diversity [27]. Moreover, as the captions are generated based on the extracted feature vector **x**, only one inference in the CNN is required if we combine the conventional general classifier of embeddings from pre-trained CNN and the image captions. In this paper, the outputs of individual classifiers are combined in simple voting with soft aggregation. In particular, we compute aggregated confidences as the weighted sum of outputs of individual classifier:

$$\mathbf{p}\_{ensemble} = [p\_1, \dots, p\_C] = w \cdot \mathbf{p}\_{emb} + (1 - w)\mathbf{p}\_{txt}.\tag{1}$$

The decision is taken in favor of the class with maximal confidence:

$$c^\* = \underset{c \in \{1, \dots, C\}}{\text{argmax}} \ p\_c. \tag{2}$$

The weight w ∈ [0, 1] in (1) can be chosen using a special validation subset in order to obtain the highest accuracy of criterion (2).

Let us provide qualitative examples for the usage of our pipeline (Fig. 1). The results of (correct) event recognition using our ensemble are presented in Fig. 2. Here the first line of the title contains the generated image caption. In addition, the title displays the result of event recognition using captions **t** (second line), embeddings **x***emb* (third line), and the whole ensemble (last line). As one can notice, the single classification of captions is not always correct. However, our ensemble is able to obtain a reliable solution even when individual classifiers make wrong decisions.

**Fig. 2.** Sample results of event recognition

## **4 Experimental Results**

In the experimental study, we examined the following event datasets:


We used standard train/test split for all datasets proposed by their creators. In PEC and ML-CUFED, the collection-level label is directly assigned to each image contained in this collection. Moreover, we completely ignore any metadata, e.g., temporal information, except the image itself similarly to the paper [4]. As a result, the training and validation sets are not ideally balanced. The majority classes in each dataset contains 5-times higher number of training images when compared to the minority classes. However, the class distribution in the training and validation sets remains more or less identical, so that the number of validation images for majority classes is also 5-times higher than the number of testing examples for minority classes.

As we mainly focus on the possibility of implementing offline event recognition on mobile devices [12], in order to compare the proposed approach with conventional classifiers, we used MobileNet v2 with α =1[28] and Inception v4 [22] CNNs. At first, we pre-trained them on the Places2 dataset [23] for feature extraction. The linear SVM classifier from the scikit-learn library was used because it has higher accuracy than other classifiers from this library (RF, k-NN, and RBF SVM) on the considered datasets. Moreover, we fine-tuned these CNNs using the given training set as follows. At first, the weights in the base part of the CNN were frozen, and the new head (fully connected layer with C outputs and Softmax activation) was learned using the ADAM optimizer (learning rate 0.001) for 10 epochs with an early stop in the Keras 2.2 framework with the TensorFlow 1.15 backend. Next, the weights in the whole CNN were learned during 5 epochs using the ADAM. Finally, the CNN was trained using SGD during 3 epochs with 10-times lower learning rate.

In addition, we used features from object detection models that are typical for event recognition [6,12]. As many photos from the same event sometimes contain identical objects (e.g., ball in the football), they can be detected by contemporary CNN-based methods, i.e., SSDLite [28] or Faster R-CNN [19]. These methods detect the positions of several objects in the input image and predict the scores of each class from the predefined set of K > 1 types. We extract the sparse K-dimensional vector of scores for each type of object. If there are several objects of the same type, the maximal score is stored in this feature vector [8]. This feature vector is either classified by the linear SVM or used to train a feed-forward neural network with two hidden layers containing 32 units. Both classifiers were learned using the training set from each event dataset. In this study, we examined SSD with the MobileNet backbone and Faster R-CNN with the InceptionResNet backbone. The models pre-trained on the Open Images Dataset v4 (K = 601 objects) were taken from the TensorFlow Object Detection Model Zoo.

Our preliminarily experimental study with the pre-trained image captioning models discussed in Sect. 2 demonstrated that the best quality for MS COCO captioning dataset is achieved by the ARNet model [21]. Thus, in this experiment, we used ARNet's encoder-decoder model. However, it can be replaced with any other image captioning technique without modification of our event recognition algorithm.

Unfortunately, event datasets do not contain captions (textual descriptions), which are required to train or fine-tune the image captioning model. Due to this reason, the image captioning model was trained on the Conceptual Captions dataset. Today this dataset is the largest dataset used for image captioning. It contains more than 3.3M image-URL and caption pairs in the training set, and about 15 thousand pairs in the validation set. While there exist other smaller datasets, such as MS COCO and Flickr, in our preliminary experiments, the image captioning model, which were trained on the Conceptual Captions Dataset, provided better worse-case performance in the cross-dataset evaluation.

The feature extraction in the encoder is implemented not only with the same CNNs (Inception and MobileNet v2). We extracted <sup>|</sup>V˜ <sup>|</sup> = 5000 most frequent words except special tokens < ST ART > and < END >. They are classified by either linear SVM or a feed-forward neural network with the same architecture as for the object detection case. Again, these classifiers are trained from scratch, given each event training set. The weight w in our ensemble (Eq. 1) was estimated using the same set.

The results of the lightweight mobile (MobileNet and SSD object detector) and deep models (Inception and Faster R-CNN) for PEC, WIDER and ML-CUFED are presented in Tables 1, 2, 3, respectively. Here we added the bestknown results for the same experimental setups.

Certainly, the proposed recognition of image captions is not as accurate as conventional CNN-based features. However, the classification of textual descriptions is much better than the random guess with accuracy 100%/14 ≈ 7.14%, 100%/61 ≈ 1.64% and 100%/23 ≈ 4.35% for PEC, WIDER and ML-CUFED, respectively. It is important to emphasize that our approach has a lower error rate than the classification of the features based on object detection in most cases. This gain is especially noticeable for lightweight SSD models, which are 1.5–13% less accurate than the proposed classification of image captions due to the limitations of SSD-based models to detect small objects (food, pets, fashion accessories, etc.). The Faster R-CNN-based detection features can be classified more accurately, but the inference in Faster R-CNN with the InceptionResNet backbone is several times slower than the decoding in the image captioning model (6–10 s vs. 0.5–2 s on MacBook Pro 2015).


**Table 1.** Event recognition accuracy (%), PEC



Finally, the most appropriate way to use image captioning in event classification is its fusion with conventional CNNs. In such case, we improved the previous state-of-the-art for PEC from 62.2% [4] even for the lightweight models (63.38%) if the fine-tuned CNNs are used in an ensemble. Our Inception-based model is even better (accuracy 65.12%). We have not still reached the state-of-the-art accuracy 53% [4] for the WIDER dataset, though our best accuracy (51.84%) is up to 9% higher when compared to the best results (42.4%) from original paper [6]. Our experimental setup for the ML-CUFED dataset is studied for the


**Table 3.** Event recognition accuracy (%), ML-CUFED

first time here because this dataset is developed mostly for album-based event recognition. We should highlight that our preliminary experiments in the latter task with this dataset and simple averaging of MobileNet features extracted from all images from an album slightly improved the state-of-the-art accuracy for this dataset, though it is necessary to study more complex feature aggregation techniques [1].

In practice, it is preferable to use pre-trained CNN as a feature extractor in order to prevent additional inference in fine-tuned CNN when it differs from the encoder in the image captioning model. Unfortunately, the accuracies of SVM for pre-trained CNN features are 1.5–3% lower when compared to the fine-tuned models for PEC and ML-CUFED. In this case, an additional inference may be acceptable. However, the difference in error rates between pre-trained and finetuned models for the WIDER dataset is not significant, so that the pre-trained CNNs are definitely worth being used here.

## **5 Conclusion**

In this paper, we have proposed to apply generative models in the classical discriminative task [9]; namely, image captioning in event recognition in still images. We have presented the novel pipeline of visual preference prediction using image captioning with the classification of generated captions and retrieval of images based on their textual descriptions (Fig. 1). It has been experimentally demonstrated that our approach is more accurate than the widely-used image representations obtained by object detectors [6,8]. Moreover, our approach is much faster than Faster R-CNNs, which do not implement one-shot detection. What is especially useful for ensemble models [27] generated caption provides additional diversity to conventional CNN-based recognition.

The motivation behind the study of image captioning techniques in this paper is connected not only with generating compact informative descriptions of images, but also with the wide possibilities to ensure the privacy of user data if further processing at remote servers is necessary. Moreover, as the vocabulary of generated captions is restricted, such techniques are considered as effective anonymization methods. Since the textual descriptions can be easily perceived and understood by the user (as opposed to a vector of numeric features), his or her attitude to the use of such methods will be more trustworthy.

Unfortunately, short conceptual textual descriptions are obviously not enough to classify event categories with high accuracy even for a human due to errors and lack of specificity (see an example of generated captions in Fig. 2). Another disadvantage of the proposed approach is the need to repeat inference if fine-tuned CNN is applied in an ensemble. Hence, the decision-making time will be significantly increased, though the overall accuracy also becomes higher in most cases (Tables 1 and 3).

In the future, it is necessary to make the classification of generated captions more accurate. At first, though our preliminary experiments of LSTMs did not decrease the error rate of our simple approach with linear SVM and onehot encoded words, we strongly believe that a thorough study of the RNN-based classifiers of generated textual descriptors is required. Second, the comparison of image captioning models trained on the Conceptual Captions dataset is needed to choose the best model for caption generation. Here the impact on event recognition accuracy arising from erroneous captions being generated should be examined. Third, additional research is needed to check if we can fine-tune a CNN on an event dataset and use it as an encoder for the caption generation without loss of quality. In this case, a more compact and fast solution can be achieved. Finally, the proposed pipeline should be extended for the album-based event recognition [2,13] with, e.g., attention models [12].

**Acknowledgements.** This research is based on the work supported by Samsung Research, Samsung Electronics. The work of A.V. Savchenko was conducted within the framework of the Basic Research Program at the National Research University Higher School of Economics (HSE).

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Human-to-AI Coach: Improving Human Inputs to AI Systems**

Johannes Schneider(B)

Institute of Information Systems, University of Liechtenstein, Vaduz, Liechtenstein johannes.schneider@uni.li

**Abstract.** Humans increasingly interact with Artificial intelligence (AI) systems. AI systems are optimized for objectives such as minimum computation or minimum error rate in recognizing and interpreting inputs from humans. In contrast, inputs created by humans are often treated as a given. We investigate how inputs of humans can be altered to reduce misinterpretation by the AI system and to improve efficiency of input generation for the human while altered inputs should remain as similar as possible to the original inputs. These objectives result in trade-offs that are analyzed for a deep learning system classifying handwritten digits. To create examples that serve as demonstrations for humans to improve, we develop a model based on a conditional convolutional autoencoder (CCAE). Our quantitative and qualitative evaluation shows that in many occasions the generated proposals lead to lower error rates, require less effort to create and differ only modestly from the original samples.

## **1 Introduction**

Human-to-AI information flow is increasing rapidly in importance and extent across multiple modalities. For example, voice-machine interaction is becoming more and more popular with deep learning networks recognizing text from speech. Similar, the progress in image recognition has lowered error rates in gesture and optical character recognition. Still, key technologies in AI such as deep learning are not perfect. They might also error given ambiguous inputs created by humans. Errors might be more likely by humans being in a hurry, being unaware of the AI's recognition mechanism, sloppiness or lack of skill. Safety critical application areas such as autonomous driving or medical applications, where an AI might depend on inputs from humans in one way or another, are becoming more and more prominent. Thus, mistakes in recognizing and processing inputs should be avoided. Apart from avoiding errors, humans might also have an incentive to provide inputs with less effort, e.g. "Why try to speak clearly and loudly in the presence of noise, if mumbling works just as well? Why doing that extra stroke in writing a character, if detection works just as well without it?" In this work, we do not focus on how to improve AI systems that recognize and interpret human information. We aim at strategies how humans can convey information better to such a system by adjusting their behavior. Identifying potential improvements becomes more difficult when deep learning is involved. Improvements are often based on a deep understanding of mechanisms of the task at hand, i.e. how an AI system processes inputs. Deep learning is said to follow a black-box behavior. Even worse, deep learning is well-known to reason very differently from humans: Deep learning models might astonish due to their high accuracy rates, but disappoint at the same time by failing on simple examples that were just slightly modified as well-documented by so called "adversarial examples". As such, humans might depend even more on being shown opportunities for generating better data that serves as input to an AI. In this work, we formalize the aforementioned partially conflicting goals such as minimizing wrongly recognized human inputs and reducing effort for humans – both in terms of need to adjust their behavior as well as to interact effortlessly. We focus on the classification problem of digits, where we aim to provide suggestions to humans by altering their generated inputs as illustrated in Fig. 1. We express the problem in terms of a multi-objective optimization problem, i.e. as a linear weighted sum. As model we use a conditional convolutional autoencoder. Our qualitative and quantitative evaluation highlights that the generated samples are visually appealing, easy to interpret and also lead to a lower error rate in recognition.

**Fig. 1.** "Human-to-AI" (H2AI) coach: From misunderstandings to understanding

## **2 Challenges of Human-to-AI Communication**

We consider the problem of improving human generated inputs to an AI illustrated in Fig. 1. A human wants to convey information to an AI using some mode, e.g. speech, writing, or gestures. The processing of the received signals by the AI often involves two steps: (i) recognition, i.e. identifying and extracting relevant information in the input signal, and (ii) interpretation, i.e. deriving actions by utilizing the information in a specific context. For recognition, the information has to be extracted from a physical (analog) signal, e.g. using speech recognition, image recognition, etc. In case information is communicated in a digital manner using structured data, recognition is commonly obsolete. Often the extracted information has to be further processed by the AI using some form of sensemaking or interpretation. The AI requires potentially semantic understanding capabilities and might rely on the use of context such as prior discourse or surrounding. We assume that the human interacts frequently with such a system, so that it is reasonable for the human to improve on objectives such as errors and efficiency in communication. In this paper, we consider the challenge of discovering variations of the original inputs that might help a human to improve.

More formally, we consider a classification problem, where a user provides data D = (X, Y ). Each sample X should be recognized as class Y by a classifier C*H*. We denote by X*<sup>i</sup>* the i-th feature of sample X. For illustration, for the case of handwritten digits a sample X is a gray-tone scan of a digit and Y ∈ [0 − 9] the digitized number. X*<sup>i</sup>* ∈ [0, 1] gives the brightness of the i-th pixel in the scan. The classification model C*<sup>H</sup>* was trained to optimize classification performance of human samples, i.e. maximize P*<sup>C</sup><sup>H</sup>* (Y |X). We regard the model C*<sup>H</sup>* as a given, i.e. we do not alter it in any way, but use it in our optimization process. The Human-to-AI coach "H2AI" takes as input one sample X with its label Y . It returns at least one proposal Xˆ, i.e. Xˆ := H2AI(X, Y ). The suggestion Xˆ should be superior to X according to some objective, e.g. we might demand higher certainty in recognition <sup>P</sup>*<sup>C</sup><sup>H</sup>* (<sup>Y</sup> <sup>|</sup>X) < P*<sup>C</sup><sup>H</sup>* (<sup>Y</sup> <sup>|</sup>Xˆ). In a handwriting scenario a human might use a proposal Xˆ based on an input X to adjust her strokes.

## **3 Model and Objectives**

An essential requirement is that the modified samples are similar to the given input, otherwise a trivial solution is to always return "the perfect sample" that is the same for any input. This motivates utilizing an auto-encoder (Sect. 3.1) and adding multiple loss terms to handle various objectives (Sect. 3.2).

#### **3.1 Architecture**

Two approaches that allow to create (modified) samples are generative adverserial networks (GANs) and autoencoders (AEs). There are also combinations thereof, e.g. the pix2pix architecture [10] or conditional variational autoencoder [2]. [10] and [2] contain an AE which has a decoder serving as a generator based on a latent representation from the encoder and, additionally, a discriminator. AE tend to generate outcomes that are closer to the inputs. But they are often smoother and less realistic looking. In our application staying close to the input is a key requirement, since we only want to show how a sample can be modified rather than generating completely new samples. Thus, we decided to focus on an AE-based architecture. We also investigate including a discriminator to improve generated samples. More precisely, we utilize conditional AE with extra loss terms for regularization covering not only a discriminator loss but also losses for efficiency and classification of modified samples as shown in Fig. 3. Conditional AE are given as input the class of a sample in addition to the sample itself. This often improves generated samples, in particular for samples that are ambiguous, i.e. samples that seem to match multiple classes well.

**Fig. 2.** H2AI implementation using a convolutional conditional autoencoder (CCAE)

Convolutional AE are known to work well on image data. Therefore, we propose convolutional conditional AE (CCAE) as shown in Fig. 2, where the NNupsample layers in the decoder denote nearest-neighbor upsampling. After each convolutional layer, there is a ReLU layer that is not shown in Fig. 2. Compared to transposed convolutional layers, NN-upsampling with convolutional layers prevents checkerboard artifacts in the resulting images.

**Fig. 3.** Human-to-AI (H2AI) model with its components and regularizers

### **3.2 Objectives and Loss Terms**

The generated input samples should meet multiple criteria, each of which is implemented as a loss term. The loss terms and their weighted sum (with parameters α·) are given in Eq. 1 and illustrated in Fig. 3. The total loss L*T ot*(X, Y ) contains four parameters α*RE*, α*CL*, α*EF* and α*D*. It is possible to keep α*RE* and use the other three to control the relative importance of the following objectives:

$$\begin{aligned} \hat{X} &:= CCAE(X, Y) & \text{Sample proposed by H2AI-coch} \\ L\_{RE}(X, \hat{X}) &:= \sum\_{i} |X\_{i} - \hat{X}\_{i}| & \text{Reconstruction or Change Loss} \\ L\_{CL}(\hat{X}, Y) & \text{Clasification Loss} \\ L\_{EF}(\hat{X}) &:= \sum\_{i} |\hat{X}\_{i}| & \text{Efficiency Loss} \\ L\_{D}(\hat{X}) &:= \log(1 - D(\hat{X})) & \text{Discriminator Loss} \\ L\_{Tot}(X, Y) &:= \alpha\_{RE} L\_{RE}(X, \hat{X}) + \alpha\_{CL} L\_{CL}(\hat{X}, Y) + \alpha\_{EF} L\_{EF}(\hat{X}) + \alpha\_{D} L\_{D}(\hat{X}) \end{aligned} \tag{1}$$

**Minimal Effort to Change:** Change might be difficult and tedious for humans. Thus, the effort for humans to adjust their behavior should be minimized. This implies that the original samples X created by humans and the newly generated variations Xˆ should be similar. This is covered by the reconstruction loss L*RE*(X, Xˆ) of the AE (see Eq. (1)). It enforces the output and the input to be similar. But parts of the input might be changed fairly drastically, i.e. for handwritten digits pixels might change from 0(black) to 1(white) and vice versa. For that reason, we do not employ an L2-metric, which heavily penalizes such differences, but rather opt for an L1-metric.

**Reduce Mis-understanding:** The amount of wrongly extracted or interpreted information by the AI should be reduced. AEs are known to have a denoising, averaging effect. They are also known to improve performance in some cases in conjunction with classification tasks [11]. To further foster a reduction in mis-understandings we minimize the classification loss L*<sup>C</sup><sup>H</sup>* (X,Y ˆ ) for generated examples Xˆ for the model C*<sup>H</sup>* the human communicates with.

**Realistic Samples:** The generated samples Xˆ should still be comprehensible for humans or other systems, i.e. look realistic. It can happen that a generated proposal Xˆ is so optimized for the given AI model C*<sup>H</sup>* that it is not meaningful in general. That is, the proposal Xˆ might appear not only very different from prototypical examples of its class but very different from any example occurring in reality. While AEs partially counteract this, AEs do not enforce that samples look real, but tend to create smooth (averaged) samples. Thus, we add a discriminator D resulting in a GAN architecture that should distinguish between real and generated samples and make them look crispier. The added discriminator loss <sup>L</sup>*D*(Xˆ) is log(1 <sup>−</sup> <sup>D</sup>(Xˆ)), where <sup>X</sup><sup>ˆ</sup> is the generated sample Xˆ := CCAE(X, Y ) for an input sample X of a human of class Y .

**Minimal Effort to Create Samples:** Interaction should be effortless for the human (and AI). To quantify effort of a human to create a sample, time might be a good option if available. If not, application specific measures might be more appropriate. For measuring effort in handwriting, the amount (and length) of strokes can be used. A good approximation can be the total amount of needed "ink", which corresponds to the L1-loss of the proposal Xˆ, i.e. L*EF* (Xˆ - ) := *<sup>i</sup>* <sup>|</sup>Xˆ*i*|. We chose the <sup>L</sup>1 over the <sup>L</sup>2-metric, since having many low intensity pixels (as fostered by L2) is generally discouraged.

## **4 Evaluation**

We conducted both a qualitative and quantitative evaluation on the MNIST dataset, since it has been used by recent work in similar contexts [6,8]. It consists of 50000 handwritten digits from 0 to 9 for training and 10000 digits for testing. The classification model C*H*, i.e. the system a user is supposed to communicate well with, is a simple convolutional neural network (CNN) consisting of two convolutional layers (8 and 16 channels) that are both followed by a ReLU and 2 × 2 Max-Pooling Layer. The last layer is a fully connected layer. The network achieved a test accuracy of 95.97%. While this could be improved, it is not of prime relevance for our problem, since the classifier C*<sup>H</sup>* is treated as a given. The architecture of the H2AI coach is shown in Fig. 3 with details of the AE in Fig. 2 and loss terms in Eq. 1. We did not employ any data augmentation. We used the AdamOptimizer with learning rate 1e−4 for all models. Training lasted for 10 epochs with a batchsize of 8. We trained 5 networks for each hyperparameter setting. We perform statistical analysis of our results using t-tests.

For the ablation study we consider adding each of the losses in isolation to the baseline with just the AE by varying parameters α*CL*, α*EF* , α*<sup>D</sup>* that control their impact. For the AE we used α*RE* = 32 for all experiments.<sup>1</sup> Finally, we consider a model, where we add all losses. There are no fixed ranges for the parameters α, but they should be chosen so that all loss terms have a noticeable impact on the total loss – at least in the early phases of training.<sup>2</sup>

Our qualitative analysis is a visual assessment of the generated images. We investigate images that were improved (in terms of each of the metrics), worsened and remained roughly the same. As quantitative measures we used the losses as defined in Eq. 1 except for classification, where we used the more common accuracy metric.

### **4.1 Qualitative Analysis**

Figure 4 shows unmodified samples (left most column) and various configurations of loss weights α. We use *R.x* to denote "row x". The AE (2nd column, α*RE* = 32) on its own already has overall a positive impact yielding smoother images than the original ones. It tends to improve efficiency by removing "exotic" strokes,

<sup>1</sup> <sup>α</sup>*RE* is not needed (could be set to 1). But, in practice, it is easier to vary <sup>α</sup>*RE* than

changing <sup>α</sup>*CL*, α*EF* , α*<sup>D</sup>* since they behave non-linearly. <sup>2</sup> We found that altering <sup>α</sup> during training requires much more tuning, but yields only modest improvements.

e.g. for the *2* in R.6 and the *5* in the last row, and sometimes helps also in improving readability (e.g. ease of classification), e.g. the *8* in R.1 row and the *6* in the 2nd last row both become more readable. Other digits might seem more readable but are actually worsened, e.g. the *6* in R.6 appears to become a *0* (it is actually a *6* ) and the *7* in R.7 appears to become more of a *9*. When optimizing in addition for efficiency (3rd column), some parts of digits get deleted, which is sometimes positive and sometimes negative. Some benefits of the AE seem to get undone, e.g. the *6* in the 2nd last row now looks again more like the original with missing parts. The same holds for the *8* in R.1, though for both some improvement in shape remains. More interestingly, the digits in R.6 both get changed to *0*, which is incorrect. On the positive side, several figures become more readable through subtle changes, e.g. removals of parts like the *5* in the last row, the *2* in the 2nd last row or the *3* in R.3. When using the AE and the discriminator (4th column in Fig. 4), we can observe that the samples become slightly more realistic, i.e. crispier. We can see clear improvements for the *7* in R.7 and the *6* in R.9. Many digits remain the same. When using the AE and the classification loss (last column) smoothness increases and digits appear blurry. Readability worsens for a few digits, i.e. the left *4* in R.2 can now be easily confused with a *9*, the *6* in R.9 is no better than the original and worse than the one using a discriminator. Overall, the classification loss helps to improve many other samples. Some only now become readable, e.g. the *5* and *3* in R.8. Also some digits become simpler, e.g. the *1* R.1 and the *7* s in R.3, R.4 and R.7.

**Fig. 4.** Original and generated samples using a subset of all loss terms

When combining all losses (Fig. 5) it can be observed that for some parameters α larger values are possible to get reasonable results, since the objectives might counteract each other. For example, the discriminator loss pushes pixels to become brighter, whereas the efficiency loss pushes them to be darker. We noticed that the strong smoothing effect due to the classification loss is essentially removed mainly due to the discriminator loss but also partially due to the efficiency loss. The benefits of the classification loss, however, mainly remain and are also improved: The *4* in the R.2 and the *6* in R.9 become more readable. There are also differences in quality among the three configurations. Interestingly, the original images show somewhat more contrast, in particular compared to the second column. A careful observer will notice a few bright points in the upper part of both *4* in R.2. These seem to be artifacts of the optimization. It is well-known that training GANs might lead to non-convergence or modecollapse. The former was observed for (too) large discriminator loss α*D*. We also noticed mode collapse for large values of α*CL* (not shown) and bad outcomes for large values of α*EF* as shown in the last column. Degenerated examples still score high in some of the metrics, but are very poor in others, e.g. in the last column accuracy and efficiency loss are good, but reconstruction loss is large. Still, overall combining all losses leads to best results.

**Fig. 5.** Original and generated samples using all loss terms

### **4.2 Quantitative Analysis**

Table 1 shows the loss terms (with accuracy instead of classification loss) for all loss configurations also shown in Fig. 4 for our ablation study with the reconstruction loss (AE only) as baseline. We first discuss accuracy. The AE on its own leads to a small gain in accuracy compared to the baseline classifier C*<sup>H</sup>* of 95.97%. Not surprisingly, optimizing accuracy directly (using a classification loss, i.e. α*CL* > 0) leads to best results: even for a seemingly small α*CL* accuracy exceeds .999%. While it appears that differences in accuracy between various values of α*CL* are not significant, from a statistical perspective (using a t-test) they are (p-value < .001). For any α*CL*, the network tends to always fail to learn the same samples, leading to very low variance in accuracy. The large accuracy values are no surprise, since also for the test set, the network is fed the correct label and therefore could in principle always return a "prototypical" class sample, ignoring all other information. When varying the efficiency loss weight α*EF* , accuracy decreases, but the decrease was only statistically significant for α*EF* ≥ 8 (p-value < .001). Adding a discriminator also negatively impacts accuracy with α*<sup>D</sup>* ≥ 0.64 showing statistically significant worse results (p-value < .01).


**Table 1.** Results varying one loss term weight <sup>α</sup>*CL*, <sup>α</sup>*EF* ,α*D*

The reconstruction loss L*RE* is most tightly correlated with the visual quality of the outcomes. In particular, large AE loss is likely to imply poor visual outcomes, despite the fact that other metrics such as accuracy are indicating good results. This can be observed in Table 1 for α*CL* = 0.24. Generally, the reconstruction loss worsens when optimizing for accuracy α*CL* > 0 or adding a discriminator α*<sup>D</sup>* ≥ 0. Differences to the baseline are significant (p-value < .01). For adding an efficiency loss differences are only significant for values α*EF* ≥ 8 (p-value < .01).

The efficiency loss decreases when adding other losses. For the discriminator differences are not significant compared to the baseline, while for all other losses they are for any value α*EF* and α*CL* ≥ 0.1 (p-value < .01).

## **5 Related Work**

There are numerous types of AE. Related to our applications are denoising AE that are typically used through intentional noise injection with the goal of weight regularization. In contrast, we assume that noise is part of the input data and its removal is thus not motivated by regularization. The idea to combine AEs and GANs for image generation has been explored previously, e.g. [2] uses a conditional variational AE and applies it for image inpainting and attribute morphing. In this work, we consider a novel application of this architecture type. Our work is a form of image-to-image translation [10]. Typically, input and outputs are fairly different, e.g. the input could be a colored segmentation of an image not showing any details and the output could be a photo like image with many details. In contrast, in our scenario in- and outputs are fairly similar. For image in-painting or completion [9,16] a network learns to fill in blank spaces of an image. In contrast, we might both in-paint and erase. Image manipulation based on user edits has been studied in [18]. They learn the natural image manifold using a generative adversarial network and express manipulations as constraint optimization problem. They apply both spatial and channel, i.e. color, flow regularization. Their primary goal is to obtain realistically looking images after manipulations. Thus, their problem and approach is fairly different. Furthermore, in contrast to the mentioned prior works [2,9,10,16,18] our work can be classified as unsupervised learning. That is, we do not know the final outputs, i.e. the images that should be proposed to the human. Prior work trains by comparing their outcome to a target. In our case, we do not have pairs of human input (images) and improved input (images) in our training data.

The field of human-AI interaction is fairly broad. The effect of various user and system characteristics has been extensively studied [13]. There has been little work on how to improve communication and prevent misunderstandings. [12] discusses high level, non-technical strategies to deal with errors in communication using speech that originate either from humans or from machines. [4] lists some errors that occur when interacting with a robot using natural language, such as grammatical, geometrical misunderstandings as well as ambiguities. [5] highlighted the impact of nonverbal communication on efficiency and robustness in communication. It is shown that nonverbal communication can reduce errors. Our work also relates to the field of personalized explanations [15]. It aims to explain to a user how she might improve interaction with an AI. Explainability in the context of machine learning is generally more focused on interpreting decisions and models (see [1,15] for recent surveys). Counterfactual explanations also seek to identify some form of modification of the input. [6] explains by answering "How to modify an input to get classification Y?" and "What is minimally needed?". The former focuses on mis-classified examples with the goal of changing them with minimal effort to the correct class. For the latter all objectives except efficiency are ignored and there is only the constraint of maintaining classification confidence above a threshold. Thus, [6] discusses special cases of our work. Technically, [6] generates a perturbation added to the sample such that the perturbation is minimal given a threshold confidence of the prediction (either as the correct class or as an alternative class) has been achieved. They use an ordinary AE as an optional element on the perturbation, which does only slightly alter results. In contrast, we use a CCAE on the inputs, which is essential. We optimize for multiple linear weighted objectives without thresholds. [8] aims at explaining counterfactuals, i.e. showing how to change a class to another by combining images of both classes. That is, given a query image and a distractor image they generate a composite image that essentially uses parts of each input. For instance, in the right part of Fig. 6 the "7" in the second row serves as query image, the "2" in the middle as distractor and the right most column shows the outcome. The implementation relies on a gating mechanism to select image parts. Differences are also noticeable in the outcomes as shown in Fig. 6. The highlighted differences appear noisy in [6] and are not necessarily intuitive, e.g. for column CEM-PP for digit "3" a stroke on top is missing, but [6] finds a miniature "3" within the given digit. The generated images in [8] appear more natural, but do have artifacts, e.g. the "2" being a composition of a "7" and a "2" has a "dot" in the bottom originating from the "7". In conclusion, while counterfactual explanations [6,8] are related to our work, the objectives differ, e.g. we include efficiency, as well as methodology and outcomes. While we also make recommendations to a user, there are only weak ties to recommender systems. Even for interpretable recommendation systems [7] users typically primarily seek to understand decisions but do not commonly aim to alter their behavior to obtain better recommendations.

**Fig. 6.** Left digits are taken from [6]. Right digits stem from [8].

### **6 Discussion and Conclusions**

Input from human to AI is likely to gain further in importance. This paper investigated improving information flow from human to AI by proposing adjustments to human generated examples based on optimizing multiple objectives. Our evaluation highlights that such an automatic approach is indeed feasible for handwriting. While we believe that our approach is suitable for other domains such as speech recognition, details of the network architecture, definition of loss terms and the loss weights likely need to be adjusted. Furthermore, our work focused on generating altered input samples fulfilling specific metrics, but it leaves many questions unanswered when applying it. For instance, it did not investigate how these samples are best shown or explained to users, e.g. by highlighting differences or, maybe, even in textual form. These points and more advanced multi-objective optimization, i.e. exploring the set of (Pareto) optimal solutions rather than manually adjusting parameters α, are subject to future work. Furthermore, one might include more objectives, e.g. generating proposals that require little energy to process by the AI [14] or taking into account behavioral norms expected by people as common for social robots [3,17]. We hope that in the future human-to-AI coaches will help non-experts to better interact with AI systems.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Aleatoric and Epistemic Uncertainty with Random Forests**

Mohammad Hossein Shaker(B) and Eyke H¨ullermeier

Heinz Nixdorf Institute and Department of Computer Science, Paderborn University, Paderborn, Germany *{*mhshaker,eyke*}*@upb.de

**Abstract.** Due to the steadily increasing relevance of machine learning for practical applications, many of which are coming with safety requirements, the notion of uncertainty has received increasing attention in machine learning research in the last couple of years. In particular, the idea of distinguishing between two important types of uncertainty, often refereed to as *aleatoric* and *epistemic*, has recently been studied in the setting of supervised learning. In this paper, we propose to quantify these uncertainties, referring, respectively, to inherent randomness and a lack of knowledge, with random forests. More specifically, we show how two general approaches for measuring the learner's aleatoric and epistemic uncertainty in a prediction can be instantiated with decision trees and random forests as learning algorithms in a classification setting. In this regard, we also compare random forests with deep neural networks, which have been used for a similar purpose.

**Keywords:** Machine learning · Uncertainty · Random forest

## **1 Introduction**

The notion of uncertainty has received increasing attention in machine learning research in the last couple of years, especially due to the steadily increasing relevance of machine learning for practical applications. In fact, a trustworthy representation of uncertainty should be considered as a key feature of any machine learning method, all the more in safety-critical application domains such as medicine [9,22] or socio-technical systems [19,20].

In the general literature on uncertainty, a distinction is made between two inherently different sources of uncertainty, which are often referred to as *aleatoric* and *epistemic* [4]. Roughly speaking, aleatoric (*aka* statistical) uncertainty refers to the notion of randomness, that is, the variability in the outcome of an experiment which is due to inherently random effects. The prototypical example of aleatoric uncertainty is coin flipping. As opposed to this, epistemic (*aka* systematic) uncertainty refers to uncertainty caused by a lack of knowledge, i.e., it relates to the epistemic state of an agent or decision maker. This uncertainty can in principle be reduced on the basis of additional information. In other words, epistemic uncertainty refers to the *reducible* part of the (total) uncertainty, whereas aleatoric uncertainty refers to the *non-reducible* part.

More recently, this distinction has also received attention in machine learning, where the "agent" is a learning algorithm [18]. In particular, a distinction between aleatoric and epistemic uncertainty has been advocated in the literature on deep learning [6], where the limited awareness of neural networks of their own competence has been demonstrated quite nicely. For example, experiments on image classification have shown that a trained model does often fail on specific instances, despite being very confident in its prediction. Moreover, such models are often lacking robustness and can easily be fooled by "adversarial examples" [14]: Drastic changes of a prediction may already be provoked by minor, actually unimportant changes of an object. This problem has not only been observed for images but also for other types of data, such as natural language text [17].

In this paper, we advocate the use of decision trees and random forests, not only as a powerful machine learning method with state-of-the-art predictive performance, but also for measuring and quantifying predictive uncertainty. More specifically, we show how two general approaches for measuring the learner's aleatoric and epistemic uncertainty in a prediction (recalled in Sect. 2) can be instantiated with decision trees and random forests as learning algorithms in a classification setting (Sect. 3). In an experimental study on uncertainty-based abstention (Sect. 4), we compare random forests with deep neural networks, which have been used for a similar purpose.

### **2 Epistemic and Aleatoric Uncertainty**

We consider a standard setting of supervised learning, in which a learner is given access to a set of (i.i.d.) training data <sup>D</sup> . .<sup>=</sup> {(*x*i, yi)}<sup>N</sup> <sup>i</sup>=1 ⊂ X ×Y, where X is an instance space and Y the set of outcomes that can be associated with an instance. In particular, we focus on the classification scenario, where <sup>Y</sup> <sup>=</sup> {y1,...,y<sup>K</sup>} consists of a finite set of class labels, with binary classification (<sup>Y</sup> <sup>=</sup> {0, <sup>1</sup>}) as an important special case.

Suppose a *hypothesis space* <sup>H</sup> to be given, where a hypothesis <sup>h</sup> ∈ H is a mapping X −→ <sup>P</sup>(Y), i.e., a hypothesis maps instances *<sup>x</sup>* ∈ X to probability distributions on outcomes. The goal of the learner is to induce a hypothesis <sup>h</sup><sup>∗</sup> ∈ H with low risk (expected loss)

$$R(h) := \int\_{\mathcal{X}\times\mathcal{Y}} \ell(h(\mathbf{x}), y) \, dP(\mathbf{x}, y), \tag{1}$$

where P is the (unknown) data-generating process (a probability distribution on X ×Y), and - : Y × Y −→ <sup>R</sup> a loss function. This choice of a hypothesis is commonly guided by the empirical risk

$$R\_{emp}(h) := \frac{1}{N} \sum\_{i=1}^{N} \ell(h(\mathbf{x}), y), \tag{2}$$

i.e., the performance of a hypothesis on the training data. However, since Remp(h) is only an estimation of the true risk R(h), the empirical risk minimizer (or any other predictor)

$$\widehat{h} := \operatorname\*{argmin}\_{h \in \mathcal{H}} R\_{emp}(h) \tag{3}$$

favored by the learner will normally not coincide with the true risk minimizer (Bayes predictor)

$$h^\* := \operatorname\*{argmin}\_{h \in \mathcal{H}} R(h). \tag{4}$$

Correspondingly, there remains uncertainty regarding h<sup>∗</sup> as well as the approximation quality of <sup>h</sup> (in the sense of its proximity to <sup>h</sup>∗) and its true risk <sup>R</sup>( h).

Eventually, one is often interested in the *predictive uncertainty*, i.e., the uncertainty related to the prediction <sup>y</sup><sup>q</sup> for a concrete query instance *<sup>x</sup>*<sup>q</sup> ∈ X . In other words, given a partial observation (*x*q, ·), we are wondering what can be said about the missing outcome, especially about the uncertainty related to a prediction of that outcome. Indeed, estimating and quantifying uncertainty in a transductive way, in the sense of tailoring it to individual instances, is arguably important and practically more relevant than a kind of average accuracy or confidence, which is often reported in machine learning.

**Fig. 1.** Different types of uncertainties related to different types of discrepancies and approximation errors: *f* <sup>∗</sup> is the pointwise Bayes predictor, *h*<sup>∗</sup> is the best predictor within the hypothesis space, and - *h* the predictor produced by the learning algorithm.

As the prediction <sup>y</sup><sup>q</sup> constitutes the end of a process that consists of different learning and approximation steps, all errors and uncertainties related to these steps may also contribute to the uncertainty about <sup>y</sup><sup>q</sup> (cf. Fig. 1):

– Since the dependency between X and Y is typically non-deterministic, the description of a new prediction problem in the form of an instance *x*<sup>q</sup> gives rise to a conditional probability distribution

$$p(y \mid x\_q) = \frac{p(x\_q, y)}{p(x\_q)}\tag{5}$$

on <sup>Y</sup>, but it does normally not identify a single outcome <sup>y</sup> in a unique way. Thus, even given full information in the form of the measure P (and its density p), uncertainty about the actual outcome y remains. This uncertainty is of an *aleatoric* nature. In some cases, the distribution (5) itself (called the predictive posterior distribution in Bayesian inference) might be delivered as a prediction. Yet, when having to commit to a point estimate, the best prediction (in the sense of minimizing the expected loss) is prescribed by the pointwise Bayes predictor f <sup>∗</sup>, which is defined by

$$f^\*(x) := \operatorname\*{argmin}\_{\hat{y} \in \mathcal{Y}} \int\_{\mathcal{Y}} \ell(y, \hat{y}) \, dP(y \mid x) \tag{6}$$

for each *x* ∈ X .


As already said, aleatoric uncertainty is typically understood as uncertainty that is due to influences on the data-generating process that are inherently random, that is, due to the non-deterministic nature of the sought input/output dependency. This part of the uncertainty is irreducible, in the sense that the learner cannot get rid of it. Model uncertainty and approximation uncertainty, on the other hand, are subsumed under the notion of epistemic uncertainty, that is, uncertainty due to a lack of knowledge about the perfect predictor (6). Obviously, this lack of knowledge will strongly depend on the underlying hypothesis space <sup>H</sup> as well as the amount of data seen so far: The larger the number <sup>N</sup> <sup>=</sup> |D| of observations, the less ignorant the learner will be when having to make a new prediction. In the limit, when <sup>N</sup> → ∞, a consistent learner will be able to identify <sup>h</sup>∗. Moreover, the "larger" the hypothesis pace <sup>H</sup>, i.e., the weaker the prior knowledge about the sought dependency, the higher the epistemic uncertainty will be, and the more data will be needed to resolve this uncertainty.

How to capture these intuitive notions of aleatoric and epistemic uncertainty in terms of quantitative measures? In the following, we briefly recall two proposals that have recently been made in the literature.

#### **2.1 Entropy Measures**

An attempt at measuring and separating aleatoric and epistemic uncertainty on the basis of classical information-theoretic measures of entropy is made in [2]. This approach is developed in the context of neural networks for regression, but the idea as such is more general and can also be applied to other settings. A similar approach was recently adopted in [10].

Given a query instance *x*, the idea is to measure the total uncertainty in a prediction in terms of the (Shannon) entropy of the predictive posterior distribution, which, in the case of discrete Y, is given as

$$H\left[p(y\mid\mathbf{z})\right] = \mathbf{E}\_{p(y\mid\mathbf{z})}\left\{-\log\_2 p(y\mid\mathbf{z})\right\} = -\sum\_{y \in \mathcal{Y}} p(y\mid\mathbf{z}) \log\_2 p(y\mid\mathbf{z}).\tag{7}$$

Moreover, the epistemic uncertainty is measured in terms of the mutual information between hypotheses and outcomes (i.e., the Kullback-Leibler divergence between the joint distribution of outcomes and hypotheses and the product of their marginals):

$$I(y,h) = \mathbf{E}\_{p(y,h)} \left\{ \log\_2 \left( \frac{p(y,h)}{p(y)p(h)} \right) \right\},\tag{8}$$

Finally, the aleatoric uncertainty is specified in terms of the difference between (7) and (8), which is given by

$$\mathbf{E}\_{p(h \mid \mathcal{D})} H[p(y \mid h, \mathbf{z})] = -\int\_{\mathcal{H}} p(h \mid \mathcal{D}) \left( \sum\_{y \in \mathcal{N}} p(y \mid h, \mathbf{z}) \log\_2 p(y \mid h, \mathbf{z}) \right) \, d \, h \tag{9}$$

The idea underlying (9) is as follows: By fixing a hypothesis <sup>h</sup> ∈ H, the epistemic uncertainty is essentially removed. Thus, the entropy <sup>H</sup>[p(<sup>y</sup> <sup>|</sup> h, *<sup>x</sup>*)], i.e., the entropy of the conditional distribution on <sup>Y</sup> predicted by <sup>h</sup> for the query instance *x*, is a natural measure of the aleatoric uncertainty. However, since h is not precisely known, aleatoric uncertainty is measured in terms of the expectation of this entropy with regard to the posterior probability <sup>p</sup>(<sup>h</sup> | D).

The epistemic uncertainty (8) captures the dependency between the probability distribution on <sup>Y</sup> and the hypothesis <sup>h</sup>. Roughly speaking, (8) is high if the distribution <sup>p</sup>(<sup>y</sup> <sup>|</sup> h, *<sup>x</sup>*) varies a lot for different hypotheses <sup>h</sup> with high probability. This is plausible, because the existence of different hypotheses, all considered (more or less) probable but leading to quite different predictions, can indeed be seen as a sign for high epistemic uncertainty.

Obviously, (8) and (9) cannot be computed efficiently, because they involve an integration over the hypothesis space H. One idea, therefore, is to approximate these measures by means of ensemble techniques [10], that is, to represent the posterior distribution <sup>p</sup>(<sup>h</sup> | D) by a finite ensemble of hypotheses <sup>H</sup> <sup>=</sup> {h1,...,h<sup>M</sup>}. An approximation of (9) can then be obtained by

$$u\_a(\mathbf{x}) := -\frac{1}{M} \sum\_{i=1}^{M} \sum\_{y \in \mathcal{Y}} p(y \mid h\_i, \mathbf{x}) \log\_2 p(y \mid h\_i, \mathbf{x}),\tag{10}$$

an approximation of (7) by

$$u\_t(\mathbf{z}) := -\sum\_{y \in \mathcal{Y}} \left( \frac{1}{M} \sum\_{i=1}^M p(y \mid h\_i, \mathbf{z}) \right) \log\_2 \left( \frac{1}{M} \sum\_{i=1}^M p(y \mid h\_i, \mathbf{z}) \right), \tag{11}$$

and finally and approximation of (8) by ue(*x*) . .<sup>=</sup> <sup>u</sup>t(*x*) <sup>−</sup> <sup>u</sup>a(*x*).

#### **2.2 Measures Based on Relative Likelihood**

Another approach, put forward in [18], is based on the use of relative likelihoods, historically proposed by [1] and then justified in other settings such as possibility theory [21]. Here, we briefly recall this approach for the case of binary classification, i.e., where <sup>Y</sup> <sup>=</sup> {0, <sup>1</sup>}; see [13] for an extension to the case of multinomial classification.

Given training data <sup>D</sup> <sup>=</sup> {(*x*i, yi)}<sup>N</sup> <sup>i</sup>=1 ⊂X ×Y, the normalized likelihood of <sup>h</sup> ∈ H is defined as

$$\pi\_{\mathcal{H}}(h) := \frac{L(h)}{L(h^{ml})} = \frac{L(h)}{\max\_{h' \in \mathcal{H}} L(h')},\tag{12}$$

where <sup>L</sup>(h) = <sup>N</sup> <sup>i</sup>=1 <sup>p</sup>(y<sup>i</sup> <sup>|</sup> h, *<sup>x</sup>*i) is the likelihood of <sup>h</sup>, and <sup>h</sup>ml ∈ H the maximum likelihood estimation. For a given instance *x*, the degrees of support (plausibility) of the two classes are defined as follows:

$$\pi(1\mid\mathbf{z}) = \sup\_{h \in \mathcal{H}} \min\left[\pi\_{\mathcal{H}}(h), p(1\mid h, \mathbf{z}) - p(0\mid h, \mathbf{z})\right],\tag{13}$$

$$\pi(0 \mid \mathbf{z}) = \sup\_{h \in \mathcal{H}} \min \left[ \pi\_{\mathcal{H}}(h), p(0 \mid h, \mathbf{z}) - p(1 \mid h, \mathbf{z}) \right]. \tag{14}$$

So, <sup>π</sup>(1 <sup>|</sup> *<sup>x</sup>*) is high if and only if a highly plausible hypothesis supports the positive class much stronger (in terms of the assigned probability) than the negative class (and <sup>π</sup>(0 <sup>|</sup> *<sup>x</sup>*) can be interpreted analogously). Given the above degrees of support, the degrees of epistemic and aleatoric uncertainty are defined as follows:

$$u\_e(\mathbf{x}) = \min\left[\pi(1 \mid \mathbf{x}), \pi(0 \mid \mathbf{x})\right],\tag{15}$$

$$u\_a(\mathbf{x}) = 1 - \max\left[\pi(1 \mid \mathbf{x}), \pi(0 \mid \mathbf{x})\right].\tag{16}$$

Thus, epistemic uncertainty refers to the case where both the positive and the negative class appear to be plausible, while the degree of aleatoric uncertainty (16) is the degree to which none of the classes is supported. More specifically, the above measures have the following properties:

– ue(*x*) will be high if class probabilities strongly vary within the set of plausible hypotheses, i.e., if we are unsure how to compare these probabilities. In particular, it will be 1 if and only if we have h(*x*) = 1 and h (*x*) = 0 for two totally plausible hypotheses h and h ;

– ua(*x*) will be high if class probabilities are similar for all plausible hypotheses, i.e., if there is strong evidence that <sup>h</sup>(*x*) <sup>≈</sup> <sup>0</sup>.5. In particular, it will be close to 1 if all plausible hypotheses allocate their probability mass around h(*x*)=0.5.

As can be seen, the measures (15) and (16) are actually quite similar in spirit to the measures (8) and (9).

## **3 Random Forests**

Our basic idea is to instantiate the (generic) uncertainty measures presented in the previous section by means of decision trees [15,16], that is, with decision trees as an underlying hypothesis space H. This idea is motivated by the fact that, firstly, decision trees can naturally be seen as probabilistic predictors [7], and secondly, they can easily be used as an ensemble in the form of a random forest—recall that ensembling is needed for the (approximate) computation of the entropy-based measures in Sect. 2.1.

### **3.1 Entropy Measures**

The approach in Sect. 2.1 can be realized with decision forests in a quite straightforward way. Let <sup>H</sup> <sup>=</sup> {h1,...,h<sup>M</sup>} be a classifier ensemble in the form of a random forest consisting of decision trees hi. Moreover, recall that a decision tree <sup>h</sup><sup>i</sup> partitions the instance space <sup>X</sup> into (rectangular) regions <sup>R</sup>i,1,...,Ri,L*<sup>i</sup>* (i.e., <sup>L</sup>*<sup>i</sup>* <sup>l</sup>=1 <sup>R</sup>i,l <sup>=</sup> <sup>X</sup> and <sup>R</sup>i,k <sup>∩</sup> <sup>R</sup>i,l <sup>=</sup> <sup>∅</sup> for <sup>k</sup> <sup>=</sup> <sup>l</sup>) associated with corresponding leafs of the tree (each leaf node defines a region R). Given a query instance *x*, the probabilistic prediction produced by the tree h<sup>i</sup> is specified by the Laplacecorrected relative frequencies of the classes <sup>y</sup> ∈ Y in the region <sup>R</sup>i,j *x*:

$$p(y \mid h\_i, x) = \frac{n\_{i,j}(y) + 1}{n\_{i,j} + |\mathcal{Y}|},$$

where ni,j is the number of training instances in the leaf node Ri,j , and ni,j (y) the number of instances with class y. With probabilities estimated in this way, the uncertainty degrees (10) and (11) can directly be derived.

### **3.2 Measures Based on Relative Likelihood**

Instantiating the approach in Sect. 2.2 essentially means computing the degrees of support (13–14), from which everything else can easily be derived.

As already said, a decision tree partitions the instance space into several regions, each of which can be associated with a constant predictor. More specifically, in the case of binary classification, the predictor is of the form hθ, <sup>θ</sup> <sup>∈</sup> <sup>Θ</sup> = [0, 1], where <sup>h</sup>θ(*x*) <sup>≡</sup> <sup>θ</sup> is the (predicted) probability <sup>p</sup>(1 <sup>|</sup> *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>) of the positive class in the region. If we restrict inference to a local region, the underlying hypothesis space is hence given by <sup>H</sup> <sup>=</sup> {h<sup>θ</sup> <sup>|</sup> <sup>0</sup> <sup>≤</sup> <sup>θ</sup> <sup>≤</sup> <sup>1</sup>}.

With p and n the number of positive and negative instances, respectively, within a region R, the likelihood and the maximum likelihood estimate of θ are respectively given by

$$L(\theta) = \binom{n+p}{n} \theta^n (1-\theta)^p \text{ and } \theta^{ml} = \frac{n}{n+p}.\tag{17}$$

Therefore, the degrees of support for the positive and negative classes are

$$\pi(1\mid\mathbf{z}) = \sup\_{\theta \in [0,1]} \min\left(\frac{\theta^p (1-\theta)^n}{\left(\frac{p}{n+p}\right)^p \left(\frac{n}{n+p}\right)^n}, 2\theta - 1\right),\tag{18}$$

$$\pi(0 \mid \mathbf{z}) = \sup\_{\theta \in [0, 1]} \min \left( \frac{\theta^p (1 - \theta)^n}{\left(\frac{p}{n + p}\right)^p \left(\frac{n}{n + p}\right)^n}, 1 - 2\theta \right). \tag{19}$$

Solving (18) and (19) comes down to maximizing a scalar function over a bounded domain, for which standard solvers can be used. From (18–19), the epistemic and aleatoric uncertainty associated with the region R can be derived according to (15) and (16), respectively. For different combinations of n and p, these uncertainty degrees can be pre-computed.

Note that, for this approach, the uncertainty degrees (15) and (16) can be obtained for a single tree. To leverage the ensemble H, we average both uncertainties over all trees in the random forest.

### **4 Experiments**

The empirical evaluation of methods for quantifying uncertainty is a non-trivial problem. In fact, unlike for the prediction of a target variable, the data does normally not contain information about any sort of "ground truth" uncertainty. What is often done, therefore, is to evaluate predicted uncertainties *indirectly*, that is, by assessing their usefulness for improved prediction and decision making. Adopting an approach of that kind, we produced *accuracy-rejection curves*, which depict the accuracy of a predictor as a function of the percentage of rejections [5]: A classifier, which is allowed to abstain on a certain percentage p of predictions, will predict on those (1 <sup>−</sup> <sup>p</sup>)% on which it feels most certain. Being able to quantify its own uncertainty well, it should improve its accuracy with increasing p, hence the accuracy-rejection curve should be monotone increasing (unlike a flat curve obtained for random abstention).

#### **4.1 Implementation Details**

For this work, we used the Random Forest Classifier from SKlearn. The number of trees within the forest is set to 50, with the maximum level of tree grows set to 10. We use bootstrapping to create diversity between the trees of the forest.

As a baseline to compare with, we used the DropConnect model for deep neural networks as introduced in [10]. The idea of DropConnect is similar to Dropout, but here, instead of randomly deleting neurons, we randomly delete the connections between neurons. In this model, the act of dropping the connections is also active in the test phase. In this way, the data passes through a different network on each iteration, and therefore we can compute Monte Carlo samples for each query instance. The DropConnect model is a feed forward neural network consisting of two DropConnect layers with 32 neurons and a final softmax layer for the output. The model is trained for 20 epochs with mini batch size of 32. After the training is done, we take 50 Monte Carlo samples to create an ensemble, from which the uncertainty values can be calculated.

### **4.2 Results**

Due to space limitations, we show results in the form of accuracy-rejection curves for only two exemplary data sets from the UCI repository<sup>1</sup>, spect and diabetes yet, very similar results were obtained for other data sets. The data is randomly split into 70% for training and 30% for testing, and accuracy-rejection curves are computed on the latter (the curves shown are averages over 100 repetitions). In the following, we abbreviate the aleatoric and epistemic uncertainty degrees produced by the entropy-based approach (Sect. 2.1) and the approach based on relative likelihood (Sect. 2.2) by AU-ent, EU-ent, AU-rl, and EU-rl, respectively.

**Fig. 2.** Accuracy-rejection curves for aleatoric (above) and epistemic (below) uncertainty using random forests. The curve for random rejection is included as a baseline.

<sup>1</sup> https://archive.ics.uci.edu/ml/datasets/.

As can be seen from Figs. 1, 2, 3 and 4, both approaches to measuring uncertainty are effective in the sense of producing monotone increasing accuracyrejection curves, and on the data sets we analyzed so far, we could not detect any systematic differences in performance. Besides, rejection seems to work well on the basis of both criteria, aleatoric as well as epistemic uncertainty. This is plausible, since both provide reasonable reasons for a learner to abstain from a prediction. Likewise, there are no big differences between random forests and neural networks, showing that the former are indeed a viable alternative to the latter—this was actually a major concern of our study.

**Fig. 3.** Scatter plot for test set on diabetes data, showing the relationship between the uncertainty degrees (aleatoric left, epistemic right) estimated by the two approaches.

**Fig. 4.** Comparison between random forests and neural networks (DropConnect) for aleatoric (above) and epistemic (below) in the entropy-based uncertainty approach.

## **5 Conclusion**

The distinction between aleatoric and epistemic uncertainty has recently received a lot of attention in machine learning, especially in the deep learning community [6]. Roughly speaking, the approaches in deep learning are either based on the idea of equipping networks with a probabilistic component, like in Bayesian deep learning [11], or on using ensemble techniques [8], which can be implemented (indirectly) through techniques such as Dropout [3] or DropConnect. The main purpose of this paper was to show that the use of decision trees and random forests is an interesting alternative to neural networks.

Indeed, as we have shown, the basic ideas underlying the estimation of aleatoric and epistemic uncertainty can be realized with random forests in a very natural way. In a sense, they even appear to be simpler and more flexible than neural networks. For example, while the approach based on relative likelihood (Sect. 2.2) could be realized efficiently for random forests, a neural network implementation is far from obvious (and was therefore not included in the experiments).

There are various directions for future work. For example, since the hyperparameters of random forests have an influence on the hypothesis space we are (indirectly) working with, they also influence the estimation of uncertainty degrees. This relationship calls for a thorough investigation. Besides, going beyond a proof of principle with statistics such as accuracy-rejection curves, it would be interesting to make use of uncertainty quantification with random forests in applications such as active learning, as recently proposed in [12].

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Master Your Metrics with Calibration**

Wissam Siblini(B) , Jordan Fr´ery, Liyun He-Guelton, Fr´ed´eric Obl´e, and Yi-Qing Wang

Worldline, 53 avenue Paul Kr¨uger, 69100 Villeurbanne, France *{*wissam.siblini,jordan.frery,liyun.he-guelton,frederic.oble, yi-qing.wang*}*@worldline.com

**Abstract.** Machine learning models deployed in real-world applications are often evaluated with precision-based metrics such as F1-score or AUC-PR (Area Under the Curve of Precision Recall). Heavily dependent on the class prior, such metrics make it difficult to interpret the variation of a model's performance over different subpopulations/subperiods in a dataset. In this paper, we propose a way to calibrate the metrics so that they can be made invariant to the prior. We conduct a large number of experiments on balanced and imbalanced data to assess the behavior of calibrated metrics and show that they improve interpretability and provide a better control over what is really measured. We describe specific real-world use-cases where calibration is beneficial such as, for instance, model monitoring in production, reporting, or fairness evaluation.

**Keywords:** Performance metrics · Class imbalance · Precision-recall

## **1 Introduction**

In real-world machine learning systems, the predictive performance of a model is often evaluated on multiple datasets, and comparisons are made. These datasets can correspond to sub-populations in the data, or different periods in time [15]. Choosing the best suited metrics is not a trivial task. Some metrics may prevent a proper interpretation of the performance differences between the sets [8,14], especially because different datasets generally not only have a different likelihood **P**(x|y) but also a different class prior **P**(y). A metric dependent on the prior (e.g. precision) will be affected by both differences indiscernibly [3] but a practitioner could be interested in isolating the variation of performance due to likelihood which reflects the intrinsic model's performance (see illustration in Fig. 1). Take the example of comparing the performance of a model across time periods: At time <sup>t</sup>, we receive data drawn from **<sup>P</sup>**t(x, y) = **<sup>P</sup>**t(x|y)**P**t(y) where <sup>x</sup> are the features and y the label. Hence the optimal scoring function (i.e. model) for this dataset is the likelihood ratio [11]:

$$s\_t(x) := \frac{\mathbb{P}\_t(x|y=1)}{\mathbb{P}\_t(x|y=0)}\tag{1}$$

In particular, if **<sup>P</sup>**t(x|y) does not vary with time, neither will <sup>s</sup>t(x). In this case, even if the prior **<sup>P</sup>**t(y) varies, it is desirable to have a performance metric <sup>M</sup>(·) satisfying <sup>M</sup>(st, **<sup>P</sup>**t) = <sup>M</sup>(st+1, **<sup>P</sup>**t+1), <sup>∀</sup><sup>t</sup> so that the model maintains the same metric value over time. That being said, this does not mean that dependence to prior is an intrinsically bad behavior. Some applications seek this property as it reflects a part of the difficulty to classify on a given dataset (e.g. the performance of the random classifier evaluated with a prior-dependent metric is more or less high depending on the skew of the dataset).

**Fig. 1.** Evolution of the AUC-PR of a fraud detection system and of the fraud ratio (π, i.e. the empirical **<sup>P</sup>***t*(y)) over time. Both decrease, but, as the AUC-PR is dependent on the prior, it does not allow to tell if the performance variation is only due to the variation of <sup>π</sup> or if there was a drift in **<sup>P</sup>***t*(x*|*y)

In binary classification, researchers often rely on the AUC-ROC (Area Under the Curve of Receiver Operating Characteristic) to measure a classifier's performance [6,9]. While this metric has the advantage of being invariant to the class prior, many real-world applications, especially when data are imbalanced, have recently begun to favor precision-based metrics such as AUC-PR and F-Score [12,13]. The reason is that AUC-ROC suffers from giving false positives too little importance [5] although the latter strongly deteriorate user experience and waste human efforts with false alerts. Indeed AUC-ROC considers a tradeoff between TPR and FPR whereas AUC-PR/F1-score consider a tradeoff between TPR (Recall) and Precision. With a closer look, the difference boils down to the fact that it normalizes the number of false positives with respect to the number of true negatives whereas precision-based metrics normalize it with respect to the number of true positives. In highly imbalanced scenarios (e.g. fraud/disease detection), the first is much more likely than the second because negative examples are in large majority.

Precision-based metrics give false positives more importance, but they are tied to the class prior [2,3]. A new definition of precision and recall into precision gain and recall gain has been recently proposed to correct several drawbacks of AUC-PR [7]. But, while the resulting AUC-PR Gain has some advantages of the AUC-ROC such as the validity of linear interpolation between points, it remains dependent on the class prior. Our study aims at providing metrics (i) that are precision-based to tackle problems where the class of interest is highly under-represented and (ii) that can be made independent of the prior for comparison purposes (e.g. monitoring the evolution of the performance of a classifier across several time periods). To reach this objective, this paper provides: (1) A formulation of calibration for precision-based metrics. It compute the value of precision as if the ratio π of the test set was equal to a reference class ratio π0. We give theoretical arguments to explain why it allows invariance to the class prior. We also provide a calibrated version for precision gain and recall gain [7]. (2) An empirical analysis on both synthetic and real-world data to confirm our claims and show that new metrics are still able to assess the model's performance and are easier to interpret. (3) A large scale experiments on 614 datasets using openML [16] to (a) give more insights on correlations between popular metrics by analyzing how they rank models, (b) explore the links between the calibrated metrics and the regular ones.

Not only calibration solves the issue of dependence to the prior but also allows, with parameter π0, anticipating a different ratio and controlling what the metric precisely reflects. This new property has several practical interests (e.g. for development, reporting, analysis) and we discuss them in realistic usecases in Sect. 5.

## **2 Popular Metrics for Binary Classification: Advantages and Limits**

We consider a usual binary classification setting where a model has been trained and its performance is evaluated on a test dataset of <sup>N</sup> instances. <sup>y</sup>i ∈ {0, <sup>1</sup>} is the ground-truth label of the i th instance and is equal to 1 (resp. 0) if the instance belongs to the positive (resp. negative) class. The model provides <sup>s</sup>i <sup>∈</sup> **<sup>R</sup>**, a score for the i th instance to belong to the positive class. For a given threshold <sup>τ</sup> <sup>∈</sup> **<sup>R</sup>**, the predicted label is <sup>y</sup>i = 1 if <sup>s</sup>i > τ and 0 otherwise. Predictive performance is generally measured using the number of true positives (TP = <sup>N</sup> i=1 **1**(yi = <sup>1</sup>, yi = 1)), true negatives (TN = <sup>N</sup> i=1 **1**(yi = 0, yi = 0)), false positives (FP = N i=1 **1**(yi = 1, yi = 0)), false negatives (FN = <sup>N</sup> i=1 **1**(yi = 0, yi = 1)). One can compute relevant ratios such as the True Positive Rate (TPR) also referred to as the Recall (Rec = TP TP + FN ), the False Positive Rate (FPR = FP TN + FP ) also referred to as the Fall-out and the Precision (P rec = TP TP + FP ). As these ratios are biased towards a specific type of error and can easily be manipulated with the threshold, more complex metrics have been proposed. In this paper, we discuss the most popular ones which have been widely adopted in binary classification: F1-Score, AUC-ROC, AUC-PR and AUC-PR Gain. F1-Score is the harmonic average between P rec and Rec:

$$F\_1 = \frac{2 \ast Prec \ast Rec}{Prec + Rec}.\tag{2}$$

The three other metrics consider every threshold <sup>τ</sup> from the highest <sup>s</sup>i to the lowest. For each one, they compute TP, FP, TN and FN. Then, they plot one ratio against another and compute the Area Under the Curve (Fig. 2). AUC-ROC considers the Receiver Operating Characteristic curve where TPR is plotted against FPR. AUC-PR considers the Precision vs Recall curve. Finally, in AUC-PR Gain, the precision gain (P recG) is plotted against the recall gain (RecG). They are defined in [7] as follows (π = -N <sup>i</sup>=1 y<sup>i</sup> N is the positive class ratio and we always consider that it is the minority class in this paper):

$$Prec\_G = \frac{Prec - \pi}{(1 - \pi)Prec} \tag{3}$$

$$Rec\_G = \frac{Rec - \pi}{(1 - \pi)Rec} \tag{4}$$

**Fig. 2.** ROC, PR and PR gain curves for the same model evaluated on an extremely imbalanced test set from a fraud detection application (π = 0.003, in the top row) and on a balanced sample (π = 0.5, in the bottom row).

PR Gain enjoys many properties of the ROC that the regular PR analysis does not (e.g. the validity of linear interpolations or the existence of universal baselines) [7]. However, AUC-PR Gain becomes hardly usable in extremely imbalanced settings. In particular, we can derive from (3) and (4) that P recG/RecG will be mostly close to 1 if π is close to 0 (see top right chart in Fig. 2).

**Fig. 3.** Illustration of the impact of π on precision, recall, and the false positive rate. Instances are ordered from left to right according to their score given by the model. The threshold is illustrated as a vertical line between the instances: those on the left (resp. right) are classified as positive (resp. negative)

As explained in the introduction, precision-based metrics (F1, AUC-PR) are more adapted than AUC-ROC for problems with class imbalance. On the other hand, only AUC-ROC is invariant to the positive class ratio. Indeed, FPR and Rec are both unrelated to the class ratio because they only focus on one class but it is not the case for P rec. Its dependency on the positive class ratio π is illustrated in Fig. 3: when comparing a case (i) with a given ratio π and another case (ii) where a randomly selected half of the positive examples has been removed, one can visually understand that both recall and false positive rate are the same but the precision is lower in the second case.

## **3 Calibrated Metrics**

We seek a metric that is based on P rec to tackle problems where data are imbalanced and the minority (positive) class is the one of interest but we want it to be invariant w.r.t. the class prior to be able to interpret its variation across different datasets (e.g. different time periods). To obtain such a metric, we will modify those based on P rec (AUC-PR, F1-Score and AUC-PR Gain) to make them independent of the positive class ratio π.

### **3.1 Calibration**

The idea is to fix a reference ratio π<sup>0</sup> and to weigh the count of TP or FP in order to calibrate them to the value that they would have if π was equal to π0. π<sup>0</sup> can be chosen arbitrarily (e.g. 0.5 for balanced) but it is preferable to fix it according to the task at hand (we analyze the impact of π<sup>0</sup> in Sect. 4 and describe simple guidelines to fix it in Sect. 5).

If the positive class ratio is π<sup>0</sup> instead of π, the ratio between negative examples and positive examples is multiplied by <sup>π</sup>(1 <sup>−</sup> <sup>π</sup>0) π0(1 <sup>−</sup> π) . In this case, we expect the ratio between false positives and true positives to be multiplied by <sup>π</sup>(1 <sup>−</sup> <sup>π</sup>0) π0(1 <sup>−</sup> π) . Therefore, we define the calibrated precision P recc as follows:

$$Prec\_c = \frac{\text{TP}}{\text{TP} + \frac{\pi(1-\pi\_0)}{\pi\_0(1-\pi)}\text{FP}} = \frac{1}{1 + \frac{\pi(1-\pi\_0)}{\pi\_0(1-\pi)}\frac{\text{FP}}{\text{TP}}} \tag{5}$$

Since <sup>1</sup> <sup>−</sup> <sup>π</sup> π is the imbalance ratio <sup>N</sup>*<sup>−</sup>* <sup>N</sup><sup>+</sup> where <sup>N</sup><sup>+</sup> (resp. <sup>N</sup>−) is the number of positive (resp. negative) examples, we have: <sup>π</sup> <sup>1</sup> <sup>−</sup> π FP TP <sup>=</sup> FP/N*<sup>−</sup>* TP/N<sup>+</sup> <sup>=</sup> FPR TPR which is independent of π.

Based on the calibrated precision, we can also define the calibrated F1-score, the calibrated P recG and the calibrated RecG by replacing P rec by P recc and π by π<sup>0</sup> in Eqs. (2), (3) and (4). Note that calibration does not change precision gain. Indeed, calibrated precision gain P rec<sup>c</sup> <sup>−</sup> <sup>π</sup><sup>0</sup> (1 <sup>−</sup> <sup>π</sup>0)P rec<sup>c</sup> can be rewritten as P rec <sup>−</sup> <sup>π</sup> (1 <sup>−</sup> π)P rec which is equal to the regular precision gain. Also, the interesting properties of the recall gain were proved independently of the ratio π in [7] which means that calibration preserves them.

#### **3.2 Robustness to Variations in** *π*

In order to evaluate the robustness of the new metrics to variations in π, we create a synthetic dataset where the label is drawn from a Bernoulli distribution with parameter π and the feature is drawn from Normal distributions:

$$p(x|y=1; \mu\_1) = \mathcal{N}(x; \mu\_1, 1), \qquad p(x|y=0; \mu\_0) = \mathcal{N}(x; \mu\_0, 1) \tag{6}$$

**Fig. 4.** Evolution of AUC-PR, AUC-PR Gain, F1-score and their calibrated version (AUC-P*c*R, AUC-P*c*R Gain, F1-score*c*) as <sup>π</sup> decreases. We arbitrarily set <sup>π</sup><sup>0</sup> = 0.5 for the calibrated metrics. The curves are obtained by averaging results over 30 runs and we show the confidence intervals.

For several values of π, data points are generated from (6) with μ<sup>1</sup> = 2 and μ<sup>0</sup> = 1.8. We consider a large number of points (10<sup>6</sup>) so that the empirical class ratio π is approximately equal to the Bernouilli parameter π. We empirically study the evolution of several metrics (F1-score, AUC-PR, AUC-PR Gain and their calibrated version) for the optimal model (as defined in (1)) as π decreases from π = 0.5 (balanced) to π = 0.001. We observe that the impact of the class prior on the regular metrics is important (Fig. 4). It can be a serious issue for applications where π sometimes vary by one order of magnitude from one day to another (see [4] for a real world example) as it leads to a significant variation of the measured performance (see the difference between AUC-PR when π = 0.5 and when π = 0.05) even if the optimal model remains the same. On the contrary, the calibrated versions remain very robust to changes in the class prior π even for extreme values. Note that we here experiment with synthetic data to have a full control over the distribution/prior and make the analysis easier but the conclusions are exactly the same on real world data.<sup>1</sup>

<sup>1</sup> See appendix in https://figshare.com/articles/Calibrated metrics IDA Supplementary material pdf/11848146.

#### **3.3 Assessment of the Model Quality**

Besides the robustness of the calibrated metrics to changes in π, we also want them to be sensitive to the quality of the model. If this latter decreases regardless of the π value, we expect all metrics, calibrated ones included, to decrease in value. Let us consider an experiment where we use the same synthetic dataset as defined the previous section. However, instead of changing the value of π only, we change (μ1, μ0) to make the problem harder and harder and thus worsen the optimal model's performance. This can be done by reducing the distance between the two normal distributions in (6), because this would result in more overlapping between the classes and make it harder to discriminate between them. As a distance, we consider the *KL*-divergence that boils down to <sup>1</sup> <sup>2</sup> (μ<sup>1</sup> <sup>−</sup> <sup>μ</sup>0)<sup>2</sup>.

**Fig. 5.** Evolution of AUC-PR, AUC-PR Gain, F1-score and their calibrated version as KL(p1, p0) tends to 0 and as π randomly varies. This curve was obtained by averaging results over 30 runs.

Figure 5 shows how the values of the metrics evolve as the KL-divergence gets closer to zero. For each run, we randomly chose the prior π in the interval [0.001, 0.5]. As expected, all metrics globally decrease as the problem gets harder. However, we can notice an important difference: the variation in the calibrated metrics are smooth and monotonic compared to those of the original metrics which are affected by the random changes in π. In that sense, variations of the calibrated metrics across the different generated datasets are much easier to interpret than the original metrics.

### **4 Link Between Calibrated and Original Metrics**

### **4.1 Meaning of** *π***<sup>0</sup>**

Let us first remark that for test datasets in which <sup>π</sup> <sup>=</sup> <sup>π</sup>0, P recc is equal to the regular precision P rec since <sup>π</sup>(1 <sup>−</sup> <sup>π</sup>0) π0(1 <sup>−</sup> π) = 1 (this is observable in Fig. <sup>4</sup> with the intersection of the metrics for π = π<sup>0</sup> = 0.5).

**Fig. 6.** Comparison between heuristic-based calibrated AUC-PR (red line) and our closed-form calibrated AUC-PR (blue dots). The red shadow represents the standard deviation of the heuristic-based calibrated AUC-PR over 1000 runs. (Color figure online)

If π = π0, the calibrated metrics essentially have the value that the original ones would have if the positive class ratio π was equal to π0. To further demonstrate that, we compare our proposal for calibration (5) with the only proposal from the past [10] that was designed for the same objective: a heuristic-based calibration. The approach from [10] consists in randomly undersampling the test set to make the positive class ratio π equal to a chosen ratio (let us refer to it as π<sup>0</sup> for the analogy) and then computing the regular metrics on the sampled set. Because of the randomness, sampling may remove more hard examples than easy examples so the performance can be over-estimated, and vice versa. To avoid that, the approach performs several runs and computes a mean estimation. In Fig. 6, we compare the results obtained with our formula and with their heuristic, for several reference ratio π0, on a highly unbalanced (π = 0.0017) credit card fraud detection dataset available on Kaggle [4].

We can observe that our formula and the heuristic provide really close values. This can be theoretically explained (See Footnote 1) and confirms that our formula really computes the value that the original metric would have if the ratio π in the test set was π0. Note that our closed-form calibration (5) can be seen as an improvement of the heuristic-based calibration from [10] as it directly provides the targeted value without running a costly Monte-Carlo simulation.

### **4.2 Do the Calibrated Metrics Rank Models in the Same Order as the Original Metrics?**

Calibration results in evaluating the metric for a different prior. In this section, we analyze how this impacts the task of selectioning the best model for a given dataset. To do this, we empirically analyze the correlation of several metrics in terms of model ordering. We use OpenML [16] to select the 602 supervised binary classification datasets on which at least 30 models have been evaluated with a 10-fold cross-validation. For each one, we randomly choose 30 models, fetch their predictions, and evaluate their performance with the metrics. This leaves us with 614 × 30 = 18, 420 different values for each metric. To analyze whether they rank the models in the same order, we compute the Spearman rank correlation coefficient between them for the 30 models for each of the 614 problems.<sup>2</sup> Most datasets roughly have balanced classes (π > 0.2 in more than 90% of the datasets). Therefore, to also specifically analyze the imbalance case, we run the same experiment with only the subset of 4 highly imbalanced datasets (π < 0.01). The compared metrics are AUC-ROC, AUC-PR, AUC-PR Gain and the best F1-score over all possible thresholds. We also add the calibrated version of the last three. In order to understand the impact of π0, we use two different values: the arbitrary π<sup>0</sup> = 0.5 and another value π<sup>0</sup> ≈ π (for the first experiment with all datasets, π<sup>0</sup> ≈ π corresponds to π<sup>0</sup> = 1.01π and for the second experiment where π is very small, we go further and π<sup>0</sup> ≈ π corresponds to π<sup>0</sup> = 10π which remains closer to π than 0.5). The obtained correlation matrices are shown in Fig. 7. Each individual cell corresponds to the average Spearman correlation over all datasets between the row metric and the column metric.

**Fig. 7.** Spearman rank correlation matrices between 10 metrics over 614 datasets for the left figure and the 4 highly imbalanced datasets for the right figure.

A general observation is that most metrics are less correlated with each other when classes are unbalanced (right matrix in Fig. 7). We also note that the best F1-score is more correlated to AUC-PR than to AUC-ROC or AUC-PR Gain. In the balanced case (left matrix in Fig. 7), we can see that metrics defined as area under curves are generally more correlated with each other than with the threshold sensitive classification metric F1-score. Let us now analyze the impact of calibration. As expected, in general, when π<sup>0</sup> ≈ π, calibrated metrics have a behavior really close to that of the original metrics because <sup>π</sup>(1 <sup>−</sup> <sup>π</sup>0) π0(1 <sup>−</sup> π) <sup>≈</sup> 1 and therefore

<sup>2</sup> The implementation of the paper experiments can be found at https://github.com/ wissam-sib/calibrated metrics.

P recc <sup>≈</sup> P rec. In the balanced case (left), since <sup>π</sup> is close to 0.5, calibrated metrics with π<sup>0</sup> = 0.5 are also highly correlated with the original metrics. In the imbalanced case (on the right matrix of Fig. 7), when π<sup>0</sup> is arbitrarily set to 0.5 the calibrated metrics seem to have a low correlation with the original ones. In fact, they are less correlated with them than with AUC-ROC. And this makes sense given the relative weights that each of the metric applies to FP and TP. The original precision gives the same weight to T P and F P, although false positives are <sup>1</sup> <sup>−</sup> <sup>π</sup> π times more likely to occur ( <sup>1</sup> <sup>−</sup> <sup>π</sup> π <sup>&</sup>gt; 100 if π < <sup>0</sup>.01). The calibrated precision with the arbitrary value π<sup>0</sup> = 0.5 boils down to TP TP + <sup>π</sup> (1 *<sup>−</sup>* <sup>π</sup>) FP and gives a weight <sup>1</sup> <sup>−</sup> <sup>π</sup> π times smaller to false positives which counterbalances their higher likelihood. ROC, like the calibrated metrics with π<sup>0</sup> = 0.5, gives <sup>1</sup> <sup>−</sup> <sup>π</sup> π less weight to FP because it is computed from FPR and TPR which are linked to TP and FP with the relationship π FP TP <sup>=</sup> FPR TPR .

<sup>1</sup> <sup>−</sup> π To sum up the results, we first emphasize that the choice of the metrics to rank classifiers when datasets are rather balanced seems to be much less sensitive than in the extremely imbalanced case. In the balanced case the least correlated metrics have an average rank correlation of 0.81. For the imbalanced datasets, on the other hand, many metrics have low correlations which means that they often disagree on the best model. The choice of the metric is therefore very important here. Our experiment also seems to reflect that rank correlations are mainly a matter of how much weight is given to each type of error. Choosing these "weights" generally depends on the application at hand. An this should be remembered when using calibration. To preserve the nature of a given metrics, π<sup>0</sup> has to be fixed to a value close to π and not arbitrarily. The user still has the choice to fix it to another value if his purpose is to specifically place the results into a different reference with a different prior.

## **5 Guidelines and Use-Cases**

Calibration could benefit ML practitioners when analyzing the performance of a model across different datasets/time periods. Without being exhaustive, we give four use-cases where it is beneficial (setting π<sup>0</sup> depends on the target use-case):

**Comparing the Performance of a Model on Two Populations/Classes:** Consider a practitioner who wants to predict patients with a disease and evaluate the performance of his model on subpopulations of the dataset (e.g. children, adults and elderly people). If the prior is different from one population to another (e.g. elderly people are more likely to have the disease), precision will be affected, i.e. population with a higher disease ratio will be more likely to have a higher precision. In this case, the calibrated precision can be used to obtain the precision of each population set to the same reference prior (for instance, π<sup>0</sup> can be chosen as the average prior over all populations). This would provide an additional balanced point of view and make the analysis richer to draw more precise conclusions and perhaps study fairness [1].

**Model Performance Monitoring in an Industrial Context:** In systems where a model's performance is monitored over time with precision-based metrics like F1-score, using calibration in addition to the regular metrics makes it easier to understand the evolution especially when the class prior can evolve (cf. application in Fig. 1). For instance, it can be useful to analyze the drift (i.e. distinguish between variations linked to π or P(X|y)) and design adapted solutions; either updating the threshold or completely retraining the model. To avoid denaturing too much the F1-score, here π<sup>0</sup> has to be fixed based on realistic values (e.g. average π in historical data).

**Establishing Agreements with Clients:** As shown in previous sections, π<sup>0</sup> can be interpreted as the ratio to which we refer to compute the metric. This can be useful to establish a guarantee, in an agreement, that will be robust to uncontrollable events. Indeed, if we take the case of fraud detection, the real positive class ratio π can vary extremely from one day to another and on particular events (e.g. fraudster attacks, holidays) which significantly affects the measured metrics (see Fig. 4). Here, after having both parties to agree beforehand on a reasonable value for π<sup>0</sup> (based on their business knowledge), calibration will always compute the performance relative to this ratio and not the real π and thus be easier to guarantee.

**Anticipating the Deployment of a Model in Production:** Imagine one collects a sample of data to develop an algorithm and reaches an acceptable AUC-PR for production. If the prior in the collected data is different from reality, the non-calibrated metric might have given either a pessimistic or optimistic estimation of the post-deployment performance. This can be extremely harmful if the production has strict constraints. Here, if the practitioner uses calibration with π<sup>0</sup> equal to the minimal prior envisioned for the application at hand, he/she would be able to anticipate the worst case scenario.

## **6 Conclusion**

In this paper, we provided a formula of calibration, empirical results, and guidelines to make the values of metrics across different datasets more interpretable. Calibrated metrics are a generalization of the original ones. They rely on a reference π<sup>0</sup> and compute the value that we would obtain if the positive class ratio π in the evaluated test set was equal to π0. If the user chooses π<sup>0</sup> = π, this does not change anything and he retrieves the regular metrics. But, with different choices, the metrics can serve several purposes such as obtaining robustness to variation in the class prior across datasets, or anticipation. They are useful in both academic and industrial applications as explained in the previous section: they help drawing more accurate comparisons between subpopulations, or study incremental learning on streams by providing a point of view agnostic to virtual concept drift [17]. They can be used to provide more controllable performance indicators (easier to guarantee and report), help preparing deployment in production, and prevent false conclusions about the evolution of a deployed model. However, π<sup>0</sup> has to be chosen with caution as it controls the relative weights given to FP and TP and, consequently, can affect the selection of the best classifier.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Supervised Phrase-Boundary Embeddings**

Manni Singh(B), David Weston, and Mark Levene

Department of Computer Science and Information Systems, Birkbeck, University of London, London WC1E 7HX, UK *{*manni,dweston,mark*}*@dcs.bbk.ac.uk

**Abstract.** We propose a new word embedding model, called SPhrase, that incorporates supervised phrase information. Our method modifies traditional word embeddings by ensuring that all target words in a phrase have exactly the same context. We demonstrate that including this information within a context window produces superior embeddings for both intrinsic evaluation tasks and downstream extrinsic tasks.

**Keywords:** Phrase embeddings · Named entity recognition · Natural language processing

## **1 Introduction**

Word embeddings represent words with multidimensional vectors that are used in various models for applications such as, named entity recognition [9], query expansion [13], and sentiment analysis [21]. These embeddings are usually generated from a huge corpus with unsupervised learning models [3,16,18,23,24]. These models are based on describing target words by their neighbouring words which are also considered as contexts. The selection of these context words is generally linear (i.e. *n* words surrounding the target). Alternatively, arbitrary context words were used in [16] where context selection is based on the syntactic dependencies to the target word.

These models treat words as lexical units and create a context window surrounding a target word. This approach can be problematic when the context window for a target word contains only part of a phrase. For example, consider a scenario where a target word is close to (and to the right of) the named entity "George W. Bush" but the context window only retains the word "George". Clearly this will generate ambiguity as the independent word "George" may refer another person (George Washington), location (George Street, Oxford) or a music band (George). To deal with the issue described above, [19] used a datadriven approach to identify and treat these phrases as individual tokens. While this technique may learn a phrase representation it cannot learn a representation of the individual words that comprise the phrase.

In our approach we obtain phrase information directly from Wikipedia. Terms from Wikipedia articles are formatted as hyperlinks to relevant articles. In a related method [22] these terms are extracted as named entities. This paper interprets these terms as phrases. By using Wikipedia for phrase information (unlike [16]) we avoid needing additional grammatical information. This also gives us the potential to generate multi-lingual embeddings, although we do not pursue this here.

In this work, we are using phrase boundary information to generate word embedding in a non-compositional manner rather than a phrase embedding. We consider each of the words in the phrase as a part of the unit, where a unit can either be single word (i.e. not a link in the Wikipedia) or otherwise a bag of words. The embeddings are then learned for each of the unit members by considering surrounding units in the context.

In the following section we present related work in this domain, Sect. 3 presents our model and in Sects. 4 to 6 we give details of the implementation and the experiments.

## **2 Related Work**

Word representations can be obtained from a language model where the goal is to predict a future word based on some previously observed information such as, a sentence, a sequence, or a phrase. For this task, various models can be utilised including: joint probabilities of observation that may include the Markov assumption. Under this assumption, we may say that the immediate future is independent of the entire past given the present. N-gram language models [4] use this assumption to predict token(s) using the previous *<sup>N</sup>* <sup>−</sup> 1 tokens [17]. This can be constructed efficiently for very large datasets using neural network based language modelling (NNLM) [2].

The NNLM of [2] used a non-linear hidden layer between the input and output layers. A simpler network named the log bi-linear model was introduced in [20] by dropping the hidden layer between input and output layer. Instead of the hidden layer, context vectors were summed and projected to the output layer. This model was later used by [18] and named CBOW (Continuous Bagof-words model), with a symmetric context (i.e. context words on both sides of the target word).

In addition, the Skip-gram model, was introduced in this work by reversing CBOW to predict context from the target word. Given a context range *c* and target word *w<sup>t</sup>* the objective is to maximise the average log probability,

$$\sum\_{-c \le j \le c} \log p(w\_{t+j}|w\_t)$$

The model defines *<sup>p</sup>*(*w<sup>t</sup>*+*<sup>j</sup>* <sup>|</sup>*wt*) using the softmax function,

$$p(w\_O|w\_I) = \frac{\exp\left(v\_{w\_O}^{\prime}\top v\_{w\_I}\right)}{\sum\_{w=1}^{W} \exp\left(v\_w^{\prime}\top v\_{w\_I}\right)}$$

where *v<sup>w</sup>* and *v <sup>w</sup>* are the "input" and "output" vector representations of *w*, and *W* is the number of words in the vocabulary. However, due to the large vocabulary, the computation becomes impractical. Thus, Noise Contrastive Estimation (NCE) [7] was used that performs the same operation by sampling a very small amount of words *k* from the vocabulary as noise.

A similar technique is called Candidate Sampling [10] that combines noise samples with the true class, denoted as the set S, with the objective to predict the true class from it, where *Y* is a set of true classes. Embeddings are scored as,

$$
\hat{Y}\_s = (X\_s \* W\_s + b\_s) - \log(E(s)).
$$

Where *<sup>X</sup><sup>s</sup>* is a vector (embedding) corresponding to a word *<sup>s</sup>* ∈ S, *<sup>W</sup><sup>s</sup>* is the corresponding weight, *b<sup>s</sup>* is the bias, and E(*s*) is the expectation for s. Each score is approximated to a probability using the softmax function,

$$Softmax(\hat{Y}\_s) = \frac{\exp \hat{Y}\_s}{\sum\_{s' \in S} \exp \hat{Y}\_{s'}}.$$

In addition to words, phrases may also be considered. In [18], the words comprising a phrase were joined using the delimiter ' ' between them, and their joint embedding was learned. This scheme is called non-compositional embedding [8,26]. Alternatively, compositional embeddings [8] are generated by merging word embeddings of phrase components using a composition function. The main difference in these schemes is that the previous learns the phrase embeddings while the latter just merges already learned word embeddings to make the phrase embeddings. Similarly, [3] introduced an extension of the Skip-gram model [18] that composes sub-word embeddings to make word embeddings with summation as the composition function.

## **3 The SPhrase Model**

The proposed model uses information about which words belong to which phrases. This information can be conveniently represented as simply the locations for where phrases start and end, hence the name, *Supervised Phrase Boundary Representations model* (SPhrase).

The key assumption is that each word that comprises a phrase has the same context. This will produce an embedding where words that occur in the same phrase are likely to be close in the vector space. For example consider the sentence:

### *British Airways to New York has Departed*

This sentence includes the (noun) phrase 'New York'. Following the procedure for Word2vec we focus on the target word 'New' using a context window of size 1. The target, context pairs are (New, to) and (New, York). Repeating this procedure for the target word 'York', yields the target, context pairs (York, New) and (York, has).

For SPhrase, the context differs from Word2vec, both target words in 'New York' will have the same context based on the words immediately surrounding the phrase, hence the SPhrase target context pairs are (New, to), (New, has), (York, to), (York, has). Figure 1 highlights the context words for the word 'New' for both Word2vec and SPhrase.

**Fig. 1.** Context words for the target word *New* using Word2vec and SPhrase. The context words are in bold. The context size is 1.

In the above, we demonstrated the target context pairs induced by a target word that is a member of a phrase, where its context are individual words. In the following, we generalise the approach to handle the situation where phrases are part of a context. We do this by introducing the concept of a *unit*, where a unit consist of a sequence of words. A unit of length 1 represents individual words, a unit of length 2 represents two word phrases and so on for larger phrases.

Thus we measure the context simply in terms of units. Figure 2 provides an example of a context of size 2 each side. Note that the left context for SPhrase contains 3 words. Thus the context size measured in words will be larger for SPhrase than Word2vec if there is a phrase within the context window.

**Fig. 2.** Context words for the target word *Rome* using Word2vec and SPhrase. The context words are in bold. The context size is 2.

## **3.1 SPhrase Context Sampling**

A standard approach to reduce the computation involved in generating embeddings is to shorten the effective context length by using only a sample of words from a context [18]. For SPhrase this can be achieved in several ways. First it can be done at the level of units not words, this is denoted *unit context sampling* (SPhrase). Second *random word context sampling* (R)<sup>1</sup> involves first performing unit context sampling, then for each unit that has a length greater than one only one word is sampled uniformly at random. This yields an effective context length that matches the context length of Word2vec. In addition to that, we generate embeddings named *without unit context sampling* (NU) where the target still is a unit but the context comprises individual words.

## **4 Methods and Datasets**

## **4.1 Dataset**

In order to generate an embedding using our approach, we require a corpus that has phrases annotated. Unfortunately this is not readily available, so we use a proxy for phrase annotation. In datasets that include hyperlinks we assume that the *hyperlink displayed text* is a phrase. One such data set is Wikipedia; we use the English Wikipedia dump version 20180920 that contains over 3 billion tokens. The proportion of tokens in phrases of length 2 is 2.5%; of length 3, 4, 5, and greater is respectively 0.8%, 0.3%, 0.2%, and less than 0.1%. Obviously not all phrases are represented as hyperlink text and not all hyperlink texts are phrases. Indeed the longest hyperlink text in our data set is of length 16,382 (it included internal formatting of Wikipedia). For our study we restricted maximum length to 10. The embedding vocabulary contained tokens with a frequency of at least 100 which gave us a total of 400,919 distinct tokens.

## **4.2 Parameter Settings**

Training is performed in mini-batches of 60,000 tokens per batch with candidate sampling of 5000 classes per batch (value dictated by the available computational resource). The remaining parameters use standard values, the learning rate is initialised to 0.001 and optimisation is based on *Adam* optimiser [12] for stochastic learning. The learning decay is set to 10% (i.e. learning rate \* 0.9) after each epoch. The total number of the epochs is set to 20. The weighting scheme for selecting words in the context sampling is the same as for Word2vec [18].

## **5 Evaluation**

There are two types of evaluation tasks commonly accepted: intrinsic and extrinsic. Intrinsic evaluation tasks determine the quality of embeddings. Under this

<sup>1</sup> Pretrained embeddings are available at: https://github.com/ManniSingh/SPhrase.

class, word similarity/relatedness tasks are generally based on cosine distance as a metric to find similarity between two word vectors. Extrinsic evaluation tasks, on the other hand, are based on specific downstream tasks such as, named entity recognition (NER), sentiment classification, topic detection. In this work, we are doing similarity based intrinsic evaluation and NER based extrinsic evaluation.

## **6 Experimental Design**

### **6.1 Intrinsic Evaluation**

The following experiments fit into the so-called *intrinsic* category of embedding evaluation. We aim to demonstrate that although the total number of phrases in our dataset is small compared to the number of words, they do have a positive impact on the resulting embeddings. In order to determine an optimal configuration of the method, intrinsic evaluation is done on embeddings trained on the first 10% of the corpus; see Fig. 3, As a result, the extrinsic evaluation described Sect. 6.2, the performance of the optimal configuration in this evaluations is: SPhrase (R) with window size 5. For the extrinsic evaluation only the optimal configuration is used and the embeddings are trained on the full corpus.

In the following experiments we compare SPhrase embeddings with the ones generated by Word2vec. It is known that increasing the context window size generally improves the quality of the embedding. Recall that the expected context size for each target word is the same for Word2vec and SPhrase due to word context sampling.

We expect that words in phrases should be mapped to similar locations in the embedding, i.e. words within a phrase should be closer together than words that are not in the same phrase. In the following we first perform experiment on pairwise similarity and then we investigate further structure with an analogy task.

**Pairwise Similarity.** For pairwise similarity experiments we use phrases from three datasets.


**Fig. 3.** Similarity scores comparison for the phrases relative to 100 random words representing: *unit context sampling* (SPhrase), Without *unit context sampling* (NU) and, with *random word context sampling* (R). Where SPhrase (in bold) and Word2vec (dashed) are compared on phrase lengths 2–7 (in horizontal axis) with higher the score the better it performed.

In order to investigate how the distances of words within a phrase compare to distances of words with random words in the datasets we use the following,

$$\text{Similarity Score } = \frac{1}{N\_l(l-1)} \sum\_{i=1}^{l-1} b(w\_i, w\_{i+1}, r)$$

where,

$$b(w\_i, w\_{i+1}, r) = \begin{cases} 1 & s(w\_i, w\_{i+1}) > s(w\_i, r), \\ 0 & \text{otherwise}, \end{cases}$$

where *r* is a word selected at random from another phrase. A new word is drawn for each phrase pair comparison. The similarity score is calculated 100 times and the overall average is taken in order to reduce the noise generated by selecting only one word for each comparison. The interpretation of this is similar to the cosine score in that the larger the value the better.

We computed scores for phrase lengths up to and including length 7. We have used context window sizes 3, 5 and 10. Figure 3 shows these scores for the context sampling regimes: with *unit context sampling*, without *unit context sampling*, and *word context sampling*.

We can see that regardless of the embedding, the scores in general reduce as the phrase gets longer. However, the larger the window size the more Word2vec and SPhrase agree. This is what we should expect, since there will be greater overlap in the context words between SPhrase and Word2vec. Nevertheless we see that, overall, SPhrase performs better.

**Google Analogy Test Set.** Analogy based tasks are widely used, e.g. [5,6,11] to evaluate the quality of word embeddings. One well known test set is the Google analogy test set [18]. This dataset comprises rows of four words, such as known unknown informed uninformed. The analogy task is to predict the final word using the first three using simple vector addition/subtraction of their vector representations. Informally the task attempts to show how well words follow the vector relationship

*unknown - known = uninformed - informed*


**Table 1.** Scores on Google analogy dataset with *unit context sampling* (SPhrase), here accuracy is the total correct count on the total count of instances.


**Table 2.** Scores on Google analogy dataset without *unit context sampling* (NU), here accuracy is the total correct count on the total count of instances.

The dataset is divided into categories, some of which are inherently phrasebased. In the category capital-common-countries a typical line is:

### Athens Greece Baghdad Iraq

Both *Athens Greece* and *Baghdad Iraq* can be reasonably construed to be phrases, unlike in the first example above. Two other categories have this same character, namely capital-world and city-in-state.

Example rows are: Athens Greece Canberra Australia and Chicago Illinois Houston Texas respectively.

With this in mind we show the accuracy of SPhrase and Word2vec stratified by category, in addition to the overall accuracy that is usually reported. The categories that have a phrasal quality are italicised in Tables 1, 2 and 3. We see that, overall, SPhrase performs better in these categories.

### **6.2 Extrinsic Evaluation**

We use Conll2003 English [25] and Wikigold [1] to evaluate the performance of the embeddings generated. The Conll dataset is widely used to evaluate various NER based models. It contains 203,621 tokens in the training set, while validation and test set contains 51,362 and 46,435 tokens respectively. On the other hand, Wikigold provides a single data file of 39,007 tokens that we used for testing while the NER models were trained with Conll train and validation data. We used SPhrase (R) model with window size 5 since this configuration demonstrated significant improvements over Word2vec as shown in Fig. 3. We recreated the BLSTMs and CRF based model [14] but without any feature engineering.


**Table 3.** Scores on Google analogy dataset with *random word context sampling* (R), here accuracy is the total correct count on the total count of instances.

**Table 4.** Comparison of Word2vec with SPhrase(NU) on Conll2003 English and Wikigold dataset


We trained this in 20 epochs with evaluating on validation data each time. We performed 10 instances for each of these models and presented the range of F1 scores (using Conll2003 evaluation script). Table 4 displays the results that show a significant improvement over the Word2vec model trained on the same corpus.

## **7 Concluding Remarks**

This investigation demonstrates that using phrasal information can directly enrich word embeddings. In this work, we presented an alternative context sampling technique to that used in skip-gram Word2vec. We note that the SPhrase approach is not limited to augmenting Word2Vec, it can also be applied to morphological extensions such as Fasttext [3].

We used the displayed text from hyperlinks as a proxy for phrases, and in this sense SPhrase is supervised. We are, however, planning to generalise the methodology by investigating whether we can identify useful phrase boundaries in a completely unsupervised fashion.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Predicting Remaining Useful Life with Similarity-Based Priors**

Youri Soons<sup>1</sup>, Remco Dijkman2(B), Maurice Jilderda<sup>3</sup>, and Wouter Duivesteijn<sup>2</sup>

<sup>1</sup> Sitech Services B.V., Geleen, The Netherlands youri.soons@sitech.nl <sup>2</sup> Technische Universiteit Eindhoven, Eindhoven, The Netherlands {r.m.dijkman,w.duivesteijn}@tue.nl <sup>3</sup> Perfact Group, Munstergeleen, The Netherlands mauricejilderda@perfact-group.com

**Abstract.** Prognostics is the area of research that is concerned with predicting the remaining useful life of machines and machine parts. The remaining useful life is the time during which a machine or part can be used, before it must be replaced or repaired. To create accurate predictions, predictive techniques must take external data into account on the operating conditions of the part and events that occurred during its lifetime. However, such data is often not available. Similarity-based techniques can help in such cases. They are based on the hypothesis that if a curve developed similarly to other curves up to a point, it will probably continue to do so. This paper presents a novel technique for similaritybased remaining useful life prediction. In particular, it combines Bayesian updating with priors that are based on similarity estimation. The paper shows that this technique outperforms other techniques on long-term predictions by a large margin, although other techniques still perform better on short-term predictions.

**Keywords:** Remaining useful life · Trajectory based similarity prediction · Bayesian updating · Similarity estimation · Prognostics · Prediction

## **1 Introduction**

Prognostics is the area of research that concerns the prediction of the remaining useful life (RUL) of machines or machine parts. A RUL prediction is a prediction of the time until a machine or machine part must be replaced or repaired. It is important that such predictions are accurate: early predictions lead to unnecessarily frequent maintenance with associated costs, while late predictions increase the risk of a machine break down with associated loss of production time and possibly sales.

Data-driven RUL prediction is based on run to failure data, i.e., observations on what happened to a part or machine in a run from the last maintenance c The Author(s) 2020

activity to the next. Figure 1 shows a typical example of run to failure data, in this case data of a filter in a chemical plant. The figure shows condition measurements on the filter over time, in terms of the difference in pressure before and after the filter. It shows that this difference is close to zero for some time. Then, the filter starts to clog up and the pressure builds up, until the filter is replaced and the pressure difference returns to normal. The resulting 'sawtooth' shape is frequently observed in run to failure data.

**Fig. 1.** Example run to failure data.

RUL prediction on run to failure data can be done by fitting a model, such as a regression model or a probability distribution, on the data. Many different techniques exist for those purposes [1]. However, as is evident from Fig. 1, different runs may have very different durations or shapes, and RUL prediction techniques rely on additional data to accurately predict the duration and shape of a particular run. Unfortunately, additional data is often unavailable or hard to relate to the run to failure data [2]. If additional data is unavailable, it is unclear which condition measurements are reliable and of course what their influence is on the RUL. One way to overcome these problems is to use similarity-based techniques, which work based on the hypothesis that, if a curve has developed similarly to some collection of other curves until now, it will likely continue to develop like that, and have a similar remaining useful life.

This paper explores the performance of two similarity-based techniques: trajectory-based similarity prediction, and Bayesian updating. It then adds its own: Bayesian updating with similarity-based priors. The contribution of this paper consists of this technique, described in Sect. 3.4, as well as a detailed evaluation of all three techniques in a case study from practice, described in Sect. 4.

Against this background, the remainder of this paper is structured as follows. Section 2 presents related work on remaining useful life prediction. Section 3 presents similarity-based remaining useful life prediction techniques, including the new technique. Section 4 compares the performance of the various techniques in a case study and Sect. 5 presents the conclusions.

## **2 Related Work**

RUL prediction can be considered a specialized form of survival analysis [10]. Essentially, two types of techniques exist for predicting RUL: model-based and data-driven techniques. Model-based techniques use physical models to accurately represent the wear and tear of a component over time [5]. Data-driven techniques do not presume any knowledge about how a component wears out over time, but merely predicts the RUL based on past observations. Hybrid models, which are a combination of physical and data-driven techniques, also exist [9]. This paper focuses on data-driven models, which are most suited when the physical mechanisms that cause a component to fail are too complex to model cost-effectively, or if they are not sufficiently understood.

A large number of data-driven techniques is available that fall into two classes depending on whether or not a probability distribution of the RUL must be obtained or a point-estimate is sufficient [1]. A probability distribution of the RUL has several benefits [16,17,20]. For example, it facilitates stochastic decision making, where maintenance is done when the probability that a part will fail exceeds a certain threshold, which is in line with the way in which maintenance decisions are made. When it is not necessary to produce a probability density function, several models can be used. The most obvious choices include regression models that use time as the primary independent variable and time-series models. However, regression models require that the behavior of the curve is predictable over time [4,13] and time-series [12] models are only suitable for short-term predictions [3,16] or when the behavior of the curve is predictable over time. Regression models that take other variables into account can also be used [6]. Such models have the benefit that they do not only consider the dependency of the RUL on the time that the part has been in operation, but also on other relevant factors, such as the operational temperature or vibration of the part.

When the RUL depends on other factors beyond time, but data on such factors is not available, one can include them as a black box. While we may not know the values of relevant factors, we can still find historical runs that are similar to the current run. If we assume that the factors that influenced historically similar runs are also similar to the current run, then the future behavior of the current run will also be similar to the behavior of the historically similar runs. This is called Trajectory Based Similarity Prediction (TBSP) [11,18,19]. Bayesian updating techniques use a similar principle [7,8]. Such techniques create a prior probability distribution of the RUL (based on data from historical runs to failure), which updates as more data of the current run is revealed.

### **3 Prediction Techniques**

This section presents similarity-based techniques that can be used for RUL prediction: TBSP and Bayesian updating, which are defined in related work as explained in Sect. 2. Subsequently, Sect. 3.4 presents a novel technique, Bayesian updating with similarity-based prior estimation, which is a combination of TBSP and Bayesian updating.

### **3.1 Preliminaries**

The remaining useful life of a part is defined as follows.

**Definition 1 (Remaining Useful Life (RUL)).** *Let* t *be a moment in a run and* t<sup>E</sup> *be the moment in the run at which the part fails. The Remaining Useful Life (RUL) at time* t*,* r(t)*, is defined as* r(t) = t<sup>E</sup> − t*.*

Note that 'failure' can be interpreted broadly. It does not have to be the point at which the part breaks, but can also be the point at which the part reaches a condition in which it is not considered suitable for operation anymore, or a condition in which maintenance is considered necessary. Over time, multiple runs to failure will be observed, such as the runs to failure shown in Fig. 1.

**Definition 2 (Run to failure library).** L *is the library of past runs to failure. For each* l ∈ L*,* t l <sup>E</sup> *is the moment in the run at which the part fails, and* g<sup>l</sup> (t) *is the function that returns the condition of the part at time* t *of the run.*

The function g<sup>l</sup> (t) is created by fitting a curve on the condition measurements of the run. We consider the one-dimensional case here (i.e., the case in which we only measure the condition of the part), but this can easily be extended to a multi-dimensional case (i.e., the case in which we not only measure the condition of the part, but also external factors (i.e., other variables than the condition variable itself), such as the operating temperature or pressure) by considering the observations as vectors over multiple variables. We will also omit the superscript l if there can be no confusion about the run to which we refer.

### **3.2 Trajectory-Based Similarity Prediction**

Figure 2 shows a different (cf. Fig. 1) representation of a run to failure library. It shows all runs in the library, starting from the moment at which the condition variable starts to increase from the base condition. It also shows a 'current' run as a thicker, unfinished curve. The idea of trajectory-based similarity prediction is to find some number k of runs that are most similar to the current run. For each of these k similar runs, we know the time it took until the part failed.

**Fig. 2.** Example library of runs.

Trajectory-based Similarity Prediction (TBSP) estimates the time until failure as the mean failure time of the similar runs.

**Definition 3 (Distance of current run to library run).** *At a moment in time* t*, let* I *be the number of observations made in the current run, with values* z1,...,z<sup>I</sup> *observed at times* t1,...,t<sup>I</sup> *, and let* l ∈ L *be a library run. We denote by* d<sup>l</sup> (t) *any distance measure contrasting* z1,...,z<sup>I</sup> *with* g<sup>l</sup> (t1),...,g<sup>l</sup> (t<sup>I</sup> )*. Let* El (t) *and* M<sup>l</sup> (t) *denote Euclidean and Manhattan distance, respectively.*

Clearly other distance functions can and indeed have been used as well in the context of remaining useful life prediction [21]. An in-depth analysis of the distance function that performs best for TBSP is beyond the scope of this work.

**Definition 4 (Fit of current run to library run).** *For each library run* <sup>l</sup> <sup>∈</sup> <sup>L</sup>*, let* <sup>d</sup><sup>l</sup> (t) *be defined as in Definition 3. The fit of the current run to* l *is:*

$$S^l(t) = e^{-|d^l(t)|}$$

When, at time t of the current run, the library run l is found that fits the current run best, the remaining useful life of the current run can be predicted as the remaining useful life of that run l: r(t) = t l <sup>E</sup> −t. It is also possible to base the prediction of the remaining useful life on the best k runs; sensitivity to k is part of our experiments. If k > 1, we can also aggregate RUL predictions by weighted average, where the weights are the goodness of fit of the library runs to the current run.

**Definition 5 (Trajectory-based Similarity Prediction).** *For each library run* <sup>l</sup> <sup>∈</sup> <sup>L</sup>*, let* <sup>S</sup><sup>l</sup> (t) *be the fit of the run to the current run as per Definition 4 and let* r<sup>l</sup> (t) *be the RUL of the run. Let* L ⊆ L *be the subset of past runs on which we want to base our RUL prediction. The predicted RUL of the current run,* rˆ(t)*, is:*

$$\hat{r}(t) = \frac{\sum\_{l \in L'} S^l(t) \cdot r^l(t)}{\sum\_{l \in L'} S^l(t)}$$

### **3.3 Bayesian Updating**

A Bayesian updating method has also been proposed to create a probability distribution of the remaining useful life [7,8]. The probability distribution can be updated with each observation of the condition variable that is obtained. The method works by fitting an exponential model to the library runs and subsequently updating that model with observations of the current run.

Intuitively, looking at Fig. 2, Bayesian updating works by fitting a curve to each of the library runs or to a selection of library runs. Based on the resulting collection of curves, a prior probability distribution of the time until the part fails can be created, which represents the 'probable' curve that the current run —or in fact any run—will follow. The prior probability distribution can be updated each time a condition value is observed in the current run. This update leads to a posterior probability distribution that represents the curve that the current run will follow with a higher precision (smaller confidence interval).

**Definition 6 (RUL probability density).** *For each library run* l ∈ L*, let* gl (t) *be the function that returns the condition of the part at time* t *of the run. The condition function can be fitted as an exponential model that has the form:*

$$g^l(t) = \phi + \theta e^{\beta t + \epsilon(t) - \frac{1}{2}\sigma^2 t}$$

*Here,* φ *is the intercept,* (t) *is the error term with mean 0 and variance* σ<sup>2</sup>*, and* θ *and* β *are random variables.*

If we set φ = 0 and take the natural logarithm of both sides, we get:

$$
\ln(g^l(t)) = \theta' + \beta t + \epsilon(t).
$$

where θ = ln(θ) + <sup>1</sup> <sup>2</sup>σ<sup>2</sup>. Considering that we have multiple runs <sup>l</sup> <sup>∈</sup> <sup>L</sup>, it is possible to fit this equation multiple times to those runs and calculate values for θ , β and σ for each run. With these values, we can compute the prior probability distributions of θ and β. We assume these distributions are normal distributions with means μ <sup>0</sup> and μ<sup>1</sup> and variances σ<sup>2</sup> <sup>0</sup> and σ<sup>2</sup> <sup>1</sup>. While the prior distributions are created based on observations from library runs, the distribution can be updated as more observations become available in the current run.

**Proposition 1 (RUL probability density updating).** *Let* π(θ ) *and* π(β) *be the prior distributions of the random variables from Definition 6 with means* μ <sup>0</sup> *and* μ<sup>1</sup> *and variances* σ<sup>2</sup> <sup>0</sup> *and* σ<sup>2</sup> <sup>1</sup>*, where* θ = ln(θ)+ <sup>1</sup> <sup>2</sup>σ<sup>2</sup> *and* σ<sup>2</sup> *is the variance of the error term. Furthermore, let there be* I *observed values,* z1,...,z<sup>I</sup> *, in the current run, made at times* t1,...,t<sup>I</sup> *, and for* i ∈ I*, let* L<sup>i</sup> = ln(zi) *the natural logarithm of each observation. The posterior distribution is a bivariate normal distribution with* θ *and* β*, whose means* μ<sup>θ</sup> *and* μβ*, variances* σ<sup>2</sup> θ *and* σ<sup>2</sup> <sup>β</sup>*, and correlation coefficient* ρ *can be calculated as follows:*

μθ- = - i∈I <sup>L</sup>iσ<sup>2</sup> <sup>0</sup> + μ 0σ<sup>2</sup> - i∈I t 2 i σ2 <sup>1</sup> + σ<sup>2</sup> − - i∈I <sup>t</sup>iσ<sup>2</sup> 0 - i∈I <sup>L</sup>itiσ<sup>2</sup> <sup>1</sup> + μ1σ<sup>2</sup> (|I|σ<sup>2</sup> <sup>0</sup> + σ<sup>2</sup>) - i∈I t2 i σ2 <sup>1</sup> + σ<sup>2</sup> − - i∈I <sup>t</sup>iσ<sup>2</sup> 1 - i∈I <sup>t</sup>iσ<sup>2</sup> 0 <sup>μ</sup>β <sup>=</sup> <sup>|</sup>I|σ<sup>2</sup> <sup>0</sup> <sup>+</sup> <sup>σ</sup><sup>2</sup> - i∈I <sup>L</sup>itiσ<sup>2</sup> <sup>1</sup> + μ1σ<sup>2</sup> − - i∈I <sup>t</sup>iσ<sup>2</sup> 1 - i∈I <sup>L</sup>iσ<sup>2</sup> <sup>0</sup> + μ 0σ<sup>2</sup> (|I|σ<sup>2</sup> <sup>0</sup> + σ<sup>2</sup>) - i∈I t2 i σ2 <sup>1</sup> + σ<sup>2</sup> − - i∈I <sup>t</sup>iσ<sup>2</sup> 1 - i∈I <sup>t</sup>iσ<sup>2</sup> 0 σ2 θ- = σ<sup>2</sup> σ2 0 i∈I t 2 i σ2 <sup>1</sup> + σ<sup>2</sup> (|I|σ<sup>2</sup> <sup>0</sup> + σ<sup>2</sup>) - i∈I t2 i σ2 <sup>1</sup> + σ<sup>2</sup> − - i∈I ti 2 σ2 0σ<sup>2</sup> 1 σ2 β <sup>=</sup> <sup>σ</sup><sup>2</sup> σ2 1 <sup>|</sup>I|σ<sup>2</sup> <sup>0</sup> + σ<sup>2</sup> (|I|σ<sup>2</sup> <sup>0</sup> + σ<sup>2</sup>) - i∈I t2 i σ2 <sup>1</sup> + σ<sup>2</sup> − - i∈I ti 2 σ2 0σ<sup>2</sup> 1 ρ = −σ0σ<sup>1</sup> i∈I ti |I|σ<sup>2</sup> <sup>0</sup> + σ<sup>2</sup> σ2 1 i∈I t2 i <sup>+</sup> <sup>σ</sup><sup>2</sup>

The proof of this proposition is given in [8]. Consequently, ln(g<sup>l</sup> (t)) for the current run to failure l is normally distributed with mean and variance:

$$
\mu(t) \cong \mu\_{\theta'} + \mu\_\beta t - \frac{1}{2}\sigma^2 \qquad \qquad \sigma(t) \cong \sigma\_{\theta'}^2 + \sigma\_\beta^2 t^2 + \sigma^2 + 2\rho t \sigma\_{\theta'} \sigma\_\beta
$$

With this information, the probability that future values of ln(g<sup>l</sup> (t)) exceed the maximum acceptable condition at some time t can be computed.

### **3.4 Bayesian Updating with Similarity-Based Prior Estimation**

The RUL probability density function in Definition 6 depends on estimated prior distributions of θ and β. These priors can be set through analyzing previous runs to failure, either based on the complete library of runs, or on a subset of the runs. More precisely, we can determine prior distributions as follows.

**Definition 7 (Prior distributions).** *For each library run* <sup>l</sup> <sup>∈</sup> <sup>L</sup>*, let* <sup>g</sup><sup>l</sup> (t) *be the exponential curve that is fitted to the observations in that run with parameters* <sup>θ</sup><sup>l</sup> *and* <sup>β</sup><sup>l</sup> *as in Definition 6. For a subset* <sup>M</sup> <sup>⊆</sup> <sup>L</sup> *of runs, we can determine the mean and standard deviation of* θ *and* β *over all* θ<sup>m</sup> *and* β<sup>m</sup>*.*

Consequently, our priors depend on the subset M ⊆ L of runs that we use. For example, we can determine our priors based on M = L, the complete set of runs. Here, we consider a variant of the Bayesian updating method in which the priors are set based on the runs that are most similar to the current run, using Definition 4 for similarity and thresholds to select the most similar runs. More precisely, we select our priors as follows.

**Definition 8 (Similarity-based prior distributions).** *Let* t *be the moment in time at which we determine our prior distributions and* k *be the number of similar runs on which we base them. Furthermore, let* S<sup>l</sup> (t) *be the similarity of a run* l *to the observations in the current run until time* t *as per Definition 4. The set of* k *most similar runs* M ⊆ L *at moment* t *is then defined as the set in which, for all runs* <sup>m</sup> <sup>∈</sup> <sup>M</sup>*, there is no run* <sup>l</sup> <sup>∈</sup> <sup>L</sup> <sup>−</sup> <sup>M</sup>*, such that* <sup>S</sup><sup>l</sup> (t) > S<sup>m</sup>(t)*.*

Note that this definition depends on variables t and k, which can therefore be expected to influence the performance of the technique. In our evaluation, we will explore the performance of the technique for different values of t and k.

### **4 Evaluation**

In this section, we put the RUL prediction techniques introduced in Sect. 3 to the test, in a case study with data from practice.

### **4.1 Case Study**

Our data originates from a chemical plant on the Chemelot Industrial Site<sup>1</sup>. The plant we investigate produces a steady flow of various chemical products; whatever the product happens to be, an unwanted byproduct is always generated. Filters have been installed to obtain an untainted final product. These filters have a variable service life, ranging between two and eight days. When the filter performs its function, it withholds residue of the unwanted byproduct. This residue gradually builds up, forming a cake which increases the resistance of the filter. The additional resistance is measured through an increase in differential pressure (δP), as illustrated in Fig. 1. An unclogged filter has a δP of 0.2 bar. When δP reaches a threshold of 2.4 bar, a valve in front of the filter is switched to let the product run through a parallel, clean filter, which returns δP to 0.2 bar and enables engineers to maintain the clogged filter.

Sensor data, including δP, is stored in a NoSQL database as time series. Preprocessing is needed in several aspects. First, the data has many missing values, which we replace by the last observed value. Second, the sensors generate a data point every second. We established experimentally that resampling the data to the minute barely loses any information from the signal, while still substantially reducing the size of the dataset. Third, to avoid the amplification of clear outliers, they are removed with a Hampel filter [14]. Fourth, we focus on the 'exponential deterioration stage' of the filter's life cycle [5], because—according to the company—the start of that stage is early enough to be able to act on time, and because it provides us with a dataset that is suitable for similarity-based RUL prediction techniques. The start and end of the exponential deterioration stage must be derived from data. We do that by comparing the average pressure over the last hour with its preceding hour. To ensure that every run has only one start per stop, a detected start is ignored if another start was already detected in the same run.

### **4.2 Results**

We quantify our results using an α−λ graph. Intuitively, this graph represents the probability that, at a certain moment in the run to failure, the RUL prediction (λ) is within a pre-defined level of precision (α) [15]. We will use a concise representation of the α − λ quality: rather than time into the run, we put the RUL on the x-axis, while the y-axis displays the probability. This representation allows us to visually compare different techniques. All analysis is done using 5 fold cross validation. The results presented in the graphs are the averages over the 5 folds.

Figures 3a, b, and c show the performance of the TBSP technique for various parameter settings. Figure 3a compares the performance of TBSP when fitting various types of curves (second ('poly2') and third ('poly3') order polynomials,

<sup>1</sup> An anonymized version of the data is made available at: https://surfdrive.surf.nl/ files/index.php/s/1dTFFXfZ7woeSUA.

**Fig. 3.** Comparison of hyperparameter settings.

exponential curves ('exp1'), and the sum of two exponential curves ('exp2')), Fig. 3b compares Manhattan and Euclidean distance, and Fig. 3c shows the sensitivity to the number of similar curves k. The graphs show that TBSP performs best for an exponential curve in short term (<48 h) predictions, and for k = 2, 3, or 4, while there is little to no performance difference between Manhattan and Euclidean distance and between k = 2, 3, or 4. For those reasons, we parameterize TBSP with exponential curves, using Euclidean distance as a distance metric, and using 3 similar curves to make the prediction.

Figure 3d shows the performance of the Bayesian updating technique for various prior sets of runs on which the prior is based. We consider four alternatives. In the first alternative, no prior is defined and the prediction is only computed based on the current run. In the second alternative, the prior distribution is based on all runs in the library. In the third alternative, we create a prior distribution by fitting the run with the (closest to) average run to failure time. In the fourth alternative, we create a prior distribution by fitting the shortest, the longest, and the average run. The figure shows that for long term predictions, a prior fitted on the 'average', the shortest and longest run performs best, while for short term predictions, a prior fitted on the whole library performs best.

**Fig. 4.** Performance across moments for setting priors.

Figure 4 shows the performance of Bayesian updating with similarity-based priors for various settings of the moment at which the priors are determined. The best performance is obtained when priors are determined 5 h into the current run to completion; 10, 15, and 20 h were also considered. The number of similar runs on which the priors are based is also a parameter for Bayesian updating with similarity-based priors. The priors are based on the 3 most similar runs. This led to the best results when comparing results for priors based on 1, 2, 3, 4, 5, and 10 similar runs.

Figure 5 shows the results for the various prediction techniques: TBSP, Bayesian Updating, and Bayesian updating with similarity-based priors. The results show a clear distinction in the performance of the different techniques. TBSP performs best for short-term (<48 h before failure) predictions, while Bayesian updating with similarity-based priors performs best in the long term (150–200 h before failure). This is expected, because for long-term prediction, Bayesian updating with similarity-based priors benefits from being based both on similar runs and on general Bayesian behavior, while after some updates the impact of the priors is reduced and the behavior approaches that of normal Bayesian updating. TBSP on the other hand benefits from having a better estimate of the runs to which it is close as time progresses.

**Fig. 5.** Overall comparison of techniques.

### **5 Conclusions**

In a case study, we show how techniques from literature can be combined and parameterized to accurately predict the Remaining Useful Life (RUL) of a machine or part. While curves of the degradation of a machine or part over time typically have a similar shape, the challenge is that operational constraints, which may be unknown, influence the exact parameterization of that curve, as evidenced by the real-life runs displayed in Figs. 1 and 2. Therefore, we propose a similarity-based prediction technique: while it makes no sense to compare the current run with all previously observed runs, it is quite likely that there are *some* historical runs that are similar to the current run, because they have similar operational constraints, hence providing us with powerful predictive information.

This paper proposes a new similarity-based prediction technique, in which we obtain a probability distribution of the RUL through Bayesian updating, where the priors of the Bayesian distribution are calculated based on a careful selection of previously seen runs. As evidenced by Fig. 5, our technique outperforms alternative techniques in a case study by a large margin within the long-term region. If we strive to predict the RUL shorter in advance, Fig. 5 clearly indicates that other methods work better.

While we studied the performance of RUL prediction techniques in the context of a particular case study, in many other domains degradation patterns have similar properties. In particular, in many other domains: run to failure data has a 'sawtooth' shape as in Fig. 1, degradation depends on operational conditions that are unknown (e.g., because they are not measured), and long-term predictions are of interest (e.g., for planning maintenance activities). In such situations our technique can also be expected to work well.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Orometric Methods in Bounded Metric Data**

Maximilian Stubbemann1,2(B) , Tom Hanika<sup>2</sup> , and Gerd Stumme1,2

<sup>1</sup> L3S Research Center, Leibniz University of Hannover, Hannover, Germany

*{*stubbemann,stumme*}*@l3s.de <sup>2</sup> Knowledge and Data Engineering Group, University of Kassel, Kassel, Germany *{*stubbemann,hanika,stumme*}*@cs.uni-kassel.de

**Abstract.** A large amount of data accommodated in knowledge graphs (KG) is metric. For example, the Wikidata KG contains a plenitude of metric facts about geographic entities like cities or celestial objects. In this paper, we propose a novel approach that transfers orometric (topographic) measures to bounded metric spaces. While these methods were originally designed to identify relevant mountain peaks on the surface of the earth, we demonstrate a notion to use them for metric data sets in general. Notably, metric sets of items enclosed in knowledge graphs. Based on this we present a method for identifying outstanding items using the transferred valuations functions isolation and prominence. Building up on this we imagine an item recommendation process. To demonstrate the relevance of the valuations for such processes, we evaluate the usefulness of isolation and prominence empirically in a machine learning setting. In particular, we find structurally relevant items in the geographic population distributions of Germany and France.

**Keywords:** Metric spaces · Orometry · Knowledge graphs · Classification

## **1 Introduction**

Knowledge graphs (KG), such as DBpedia [15] or Wikidata [24], are the state of the art for storing information and to draw knowledge from. They represent knowledge through graphs and consist essentially of *items* which are related through *properties* and *values*. This enables them to fulfill the task of giving exact answers to exact questions. However, their ability to present a concise overview over collections of items with metric distances is limited. The number of such data sets in Wikidata is tremendous, e.g., the set of all cities of the world, including their geographic coordinates. Further examples are celestial bodies and their trajectories or, more general, feature spaces of data mining tasks.

One approach to understand such metric data is to identify outstanding elements, i.e., outstanding items. Based on such elements it is possible to compose or enhance item recommendations to users. For example, such recommendations could provide a set of the most relevant cities in the world with respect

**Fig. 1.** Isolation: minimal horizontal distance to another point of at least equal height. Prominence: minimal vertical descent to reach a point of at least equal height.

to being outstanding in their local surroundings. However, it is a challenging task to identify outstanding items in metric data sets. In cases where the metric space is equipped with an additional valuation function, this task becomes more feasible. Such functions, often called *scores* or *height* functions, are often naturally provided: cities may be ranked by population; the importance of scientific authors by the h-index [12]. A na¨ıve approach for recommending relevant items in such settings would be: items with higher scores are more relevant items. As this method seems reasonable for many applications, some obstacles arise if the "highest" items concentrate into a specific region of the underlying metric space. For example, representing the cities of the world by the twenty most populated ones would include no western European city.<sup>1</sup> Recommending the 100 highest mountains would not lead to knowledge about the mountains outside of Asia.<sup>2</sup>

Our novel approach shall overcome this problem: we combine the valuation measure (e.g., "height") and distances, to provide new valuation functions on the set of items, called *prominence* and *isolation*. These functions do rate items based on their height in relation to the valuations of the surrounding items. This results in valuation functions on the set of items that reflect the extend to which an item is locally outstanding. The basic idea is the following: the prominence values an item based on the minimal descent (w.r.t. the height function) that is needed to get to another point of at least same height. The isolation, sometimes also called *dominance radius*, values the distance to the next higher point w.r.t. the metric (Fig. 1). These measures are adapted from the field of topography where isolation and prominence are used in order to identify outstanding mountain peaks. We base our approach on [22], where the authors proposed prominence and dominance for networks. We generalize these to the realm of bounded metric space.

We provide insights to the novel valuation functions and demonstrate their ability to identify relevant items for a given topic in metric knowledge graph applications. The contributions of this paper are as follows: • We propose prominence and isolation for bounded metric spaces. For this we generalize the results in [22] and overcome the limitations to finite, undirected graphs. • We demonstrate an artificial machine learning task for evaluating our novel valuation functions in metric data. • We introduce an approach for using prominence and iso-

<sup>1</sup> https://en.wikipedia.org/wiki/List of largest cities on 2019-06-16.

<sup>2</sup> https://en.wikipedia.org/wiki/List of highest mountains on Earth on 2019-06-16.

lation to enrich metric data in knowledge graphs. We show empirically that this information helps to identify a set of representative items.

## **2 Related Work**

Item recommendations for knowledge graphs is a contemporary topic of high interest in research. Investigations cover for example music recommendation using content and collaborative information [17] or movie recommendations using PageRank like methods [5]. The former is based on the common notion of embedding, i.e., embedding of the graph structure into d-dimensional real vector spaces. The latter operates on the relational structure itself. Our approach differs from those as it is based on combining a valuation measure with the metric of the data space. Nonetheless, given an embedding into an finite dimensional real vector space, one could apply isolation and prominence in those as well.

The novel valuation functions prominence and isolation are inspired by topographic measures, which have their origin in the classification of mountain peaks. The idea of ranking peaks solely by their absolute height was already deprecated in 1978 by Fry in his work [8]. The author introduced prominence for geographic mountains, a function still investigated in this realm, e.g., in Torres et al. [23], where the authors used deep learning methods to identify prominent mountain peaks. Another recent step for this was made in [14], where the authors investigated methods for discovering new ultra-prominent mountains. Isolation and more valuations functions motivated in the orometric realm are collected in [11]. A well-known procedure for identifying peaks and saddles in 3D terrain data is described in [6]. However, these approaches rely on data that approximates a continuous terrain surface via a regular square grid or a triangulation. Our data cannot fulfill this requirement. Recently the idea of transferring orometric functions to different realms of research gained attention: The authors of [16] used topographic prominence to identify population areas in several U.S. States. In [22] the authors Schmidt and Stumme transferred prominence and dominance, i.e., isolation, to co-author graphs in order to evaluate their potential of identifying ACM Fellows. We build on this for proposing our valuation functions on bounded metric data. This generalization results in a wide range of applications.

## **3 Mathematical Modeling**

While the Wikidata knowledge graph itself could be analyzed with the prominence and isolation measures for networks, this paper focuses on bounded metric data sets. To analyze such data sets is more sufficient, since real world networks often suffer from a small average shortest path length [26]. This leads to a low amount of outstanding items: an item is outstanding if it is "higher" than the items that have a low distance to it. This leads to a strict measure for many realworld network data when the shortest path length is used as the metric function. Hence, we model our functions for bounded metric data instead of networks.

We consider the following scenario: We have a data set M, consisting of a set of items, in the following called *points*, equipped with a metric d and a valuation function h, in the following called *height function*. The goal of the orometric (topographic) measures prominence and isolation is, to provide measures that reflect the extent to which a point is locally outstanding in its neighborhood.

More precisely, let <sup>M</sup> be a non-empty set and <sup>d</sup> : <sup>M</sup> <sup>×</sup> <sup>M</sup> <sup>→</sup> <sup>R</sup>≥<sup>0</sup>. We call d a *metric* on the set M iff • ∀x, y ∈ M : d(x, y)=0 ⇐⇒ x = y, and • d(x, y) = d(y, x) for all x, y ∈ M, called symmetry, and • ∀x, y, z ∈ M : d(x, z) ≤ d(x, y) + d(y, z), called triangle inequality. If d is a metric on M, we call (M, d) a *metric space* and if M is finite we call (M, d) a *finite metric space*. If there exists a <sup>C</sup> <sup>∈</sup> <sup>R</sup>≥<sup>0</sup> such that we have <sup>d</sup>(m, n) <sup>≤</sup> <sup>C</sup> for all m, n <sup>∈</sup> <sup>M</sup>, we call (M, d) *bounded*. For the rest of our work we assume that |M| > 1 and (M, d) is a bounded metric space. Additionally, we have that M is equipped with a height function (valuation/score function) <sup>h</sup> : <sup>M</sup> <sup>→</sup> <sup>R</sup>≥0, m → <sup>h</sup>(m).

**Definition 1 (Isolation).** *Let* (M, d) *be a bounded metric space and let* h : <sup>M</sup> <sup>→</sup> <sup>R</sup>≥<sup>0</sup> *be a height function on M. The* isolation of a point <sup>x</sup> <sup>∈</sup> <sup>M</sup> *is then defined as follows:*


$$\text{iso}(m) := \inf \{ d(m, n) \mid n \in M \mid \{m\} \land h(n) \ge h(m) \}.$$

The isolation of a mountain peek is often called the *dominance radius* or sometimes the *dominance*. Since the term *orometric dominance* of a mountain sometimes refers to the quotient of prominence and height, we will stick to the term *isolation* to avoid confusion. While the isolation can be defined within the given setup, we have to equip our metric space with some more structure in order to transfer the notion of prominence. Informally, the prominence of a point is given by the minimal vertical distance one has to descend to get to a point of at least the same height. To adapt this measure to our given setup in metric spaces with a height function, we have to define what a path is. Structures that provide paths in a natural way are graph structures. For a given graph G = (V,E) with vertex set V and edge set E ⊆ -V 2 , *walks* are defined as sequences of nodes {v<sup>i</sup>}<sup>n</sup> <sup>i</sup>=0 which satisfy {v<sup>i</sup>−<sup>1</sup>, v<sup>i</sup>} ∈ E for all i ∈ {1, ..., n}. If we also have v<sup>i</sup> = v<sup>j</sup> for i = j, we call such a sequence a *path*. For v, w ∈ V we say v and w are *connected* iff there exists a path connecting them. Furthermore, we denote by G(v) the *connected component* of G containing v, i.e., G(v) := {w ∈ V | v is connected with w}.

To use the prominence measure as introduced by Schmidt and Stumme in [22], which is indeed defined on graphs, we have to derive an appropriate graph structure from our metric space. The topic of graphs embedded in finite dimensional vector spaces, so called spatial networks [2], is a topic of current interest. These networks appear in real world scenarios frequently, for example in the modeling of urban street networks [13]. Note that our setting, in contrast to the afore mentioned, is not based on a priori given graph structure. In our scenario the graph structure must be derived from the structure of the given metric space.

Our approach is, to construct a *step size graph* or *threshold graph*, where we consider points in the metric space as nodes and connect two points through an edge, iff their distance is smaller then a given threshold δ.

**Definition 2 (**δ**-Step Graph).** *Let* (M, d) *be a metric space and* δ > 0*. We define the* δ-step graph *or* δ-threshold graph*, denoted by* Gδ*, as the tuple* (M,Eδ) *via*

$$E\_\delta := \{ \{m, n\} \in \binom{M}{2} \mid d(m, n) \le \delta \}. \tag{1}$$

This approach is similar to the one found in the realm of random geometric graphs, where it is common sense to define random graphs by placing points uniformly in the plane and connect them via edges if their distance is less than a given threshold [21]. Since we introduced a possibility to derive a graph that just depends on the metric space, we use a slight modification of the definition of prominence compared to [22] for networks.

**Definition 3 (Prominence in Networks).** *Let* G = (V,E) *be a graph and let* <sup>h</sup> : <sup>V</sup> <sup>→</sup> <sup>R</sup>≥<sup>0</sup> *be a height function. The* prominence promG(v) *of* <sup>v</sup> <sup>∈</sup> <sup>V</sup> *is defined by*

$$\text{prom}\_G(v) \coloneqq \min\{h(v), \text{mindes}\_G(v)\}\tag{2}$$

*where* mindescG(v) := inf{max{h(v) − h(u) | u ∈ p} | p ∈ P<sup>v</sup>}*. The set* P<sup>v</sup> *contains of all paths to vertices* <sup>w</sup> *with* <sup>h</sup>(w) <sup>≥</sup> <sup>h</sup>(v)*, i.e.,* <sup>P</sup><sup>v</sup> := {{v<sup>i</sup>}<sup>n</sup> <sup>i</sup>=0 ∈ P | v<sup>0</sup> = v ∧ v<sup>n</sup> = v ∧ h(vn) ≥ h(v)}*, where* P *denotes the set of all paths of* G*.*

Informally, mindescG(v) reflects on the minimal descent in order to get to a vertex in G which has a height of at least h(v). For this the definition makes use of the fact that inf ∅ = ∞. This case results in promG(v) being the height of v. A distinction to the definition in [22] is, that we now consider all paths and not just shortest paths. This change better reflects the calculation of the prominence for mountains. Based on this we transfer the notions above to metric spaces.

**Definition 4 (**δ**-Prominence).** *Let* (M, d) *be a bounded metric space and* h : <sup>M</sup> <sup>→</sup> <sup>R</sup>≥<sup>0</sup> *be a height function. We define the* <sup>δ</sup>*-prominence* promδ(m) *of* <sup>m</sup> <sup>∈</sup> <sup>M</sup> *as* prom<sup>G</sup><sup>δ</sup> (v)*, i.e., the prominence of* m *in* G<sup>δ</sup> *from Definition 2.*

We now have a prominence term for all metric spaces that depends on a parameter δ to choose. For all knowledge procedures, choosing such a parameter is a demanding task. Hence, we want to provide in the following a natural choice for δ. We consider only those values for δ such that corresponding G<sup>δ</sup> does not exhibit noise, i.e., there is no element without a neighbor.

**Definition 5 (Minimal Threshold).** *For a bounded metric space* (M, d) *with* |M| > 1 *we define the* minimal threshold δ<sup>M</sup> *of* M *as*

$$\delta\_M := \sup \{ \inf \{ d(m, n) \mid n \in M \} \nmid m \} \mid m \in M \}.$$

Based on this definition a natural notion of prominence for metric spaces (equipped with a height function) emerges via a limit process.

**Lemma 1.** *Let* M *be a bounded metric space and* δ<sup>M</sup> *as in Definition 5. For* m ∈ M *the following descending limit exists:*

$$\lim\_{\delta \searrow \delta\_M} \text{pro}\_{\delta}(m). \tag{3}$$

*Proof.* Fix any ˆδ>δ<sup>M</sup> and consider on the open interval from δ<sup>M</sup> to ˆδ the function that maps <sup>δ</sup> to promδ(m): prom(.)(m):]δM, <sup>ˆ</sup>δ[<sup>→</sup> <sup>R</sup>, δ → promδ(m). It is known that it is sufficient to show that prom(.)(m) is monotone decreasing and bounded from above. Since we have for any δ that promδ(m) ≤ h(m) holds, we need to show the monotony. Let <sup>δ</sup>1, δ<sup>2</sup> be in ]δM, <sup>ˆ</sup>δ[ with <sup>δ</sup><sup>1</sup> <sup>≤</sup> <sup>δ</sup>2. If we consider the corresponding graphs (M,E<sup>δ</sup><sup>1</sup> ) and (M,E<sup>δ</sup><sup>2</sup> ), it easy to see E<sup>δ</sup><sup>1</sup> ⊆ E<sup>δ</sup><sup>2</sup> . Hence, we have to consider more paths in Eq. (2) for E<sup>δ</sup><sup>2</sup> , resulting in a not larger value for the infimum. We obtain prom<sup>δ</sup><sup>1</sup> (m) ≥ prom<sup>δ</sup><sup>2</sup> (m), as required.

**Definition 6 (Prominence in Metric Spaces).** *If* M *is a bounded metric space with* |M| > 1 *and a height function* h*, the prominence* prom(m) *of* m *is defined as:*

$$\text{prom}(m) := \lim\_{\delta \searrow \delta\_M} \text{prom}\_{\delta}(m).$$

Note, if we want to compute prominence on a real world finite metric data set, it is possible to directly compute the prominence values: in that case the supremum in Definition 5 can be replaced by a maximum and the infimum by a minimum, which leads to prom(m) being equal to prom<sup>δ</sup><sup>M</sup> (m). There are results for efficiently creating such step graphs [3]. However, for our needs in this work, in particular in the experiment section, a quadratic brute force approach for generating all edges is sufficient. We want to show that our prominence definition for bounded metric spaces is a natural generalization of Definition 3.

**Lemma 2.** *Let* G = (V,E) *be a finite, connected graph with* |V | ≥ 2*. Consider* V *equipped with the shortest path metric as a metric space. Then the prominence* promG(·) *from Definition 3 and* prom(·) *from Definition 6 coincide.*

*Proof.* Let M := V be equipped with the shortest path metric d on G. As G is connected and has more than one node, we have δ<sup>M</sup> = 1. Hence, (M,E<sup>δ</sup><sup>M</sup> ) from Definition 2 and G are equal. Therefore, the prominence terms coincide.

## **4 Application**

*Score Based Item Recommending.* As an application we envisage a general approach for a score based item recommending process. The task of item recommending with knowledge graphs is a current research topic [17,18]. However, most approaches are solely based on knowledge about preferences of the user and graph structural properties, often accessed through KG embeddings [19]. The idea of the recommendation process we imagine differs from those. We stipulate on a procedure that is based on the information entailed in the connection of the metric aspects of the data together with some (often naturally present) height function. We are aware that this limits our approach to metric data in KGs. Nonetheless, given the large amounts of metric item sets in prominent KGs, we claim the existence of a plenitude of applications. For example, while considering sets of cities, such a system could recommend a *relevant* subset, based on a height function, like population, and a metric, like geographical distances. By doing so, we introduce a source of information for recommending metric data in relational structures, like KGs. A common approach for analyzing and learning in KGs is embedding. There is an extensive amount of research about that, see for example [4,25]. Since our novel methods rely solely on bounded metric spaces and some valuation function, one may apply those after the embedding step as well. In particular, one may use isolation and prominence for investigating or completing KG embeddings. This constitutes our second envisioned application. Finally, common item recommending scores/ranks can also be used as height functions in our sense. Hence, computing prominence and isolation for already setup recommendation systems is another possibility. Here, our valuation functions have the potential to enrich the recommendation process with additional information. In such a way our measures can provide a novel additional aspect to existing approaches. The realization and evaluation of our proposed recommendation approach is out of scope of this paper. Nonetheless, we want to provide some first insights for the applicability of valuation functions for item sets based on empirical experiments. As a first experiment, we will evaluate if isolation and prominence help to separate important and unimportant items in specific item sets in Wikidata. In detail, we evaluate if the valuation functions help to differentiate important and unimportant municipalities in France and Germany, solely based on their geographic metric properties and their population as height.

#### **4.1 Resulting Questions**

Given a bounded metric space M which represents the data set and a given height h. The following questions shall evaluate if our functions isolation and prominence provide useful information about the relevance of given points in the metric space. If (M, d, h) is a metric space equipped with an additional height function, let c : M → {0, 1} be a binary function that classifies the points in the data set as relevant (1) or not (0). We connect this to our running example using a function that classifies municipalities having a university (1) and municipalities that do not have an university (0). We admit that the underlying classification is not meaningful in itself. It treats a real geographic case while our model could also handle more abstract scenarios. However, since this setup is essentially a benchmark framework (in which we assume cities with universities to be more relevant) we refrain from employing a more meaningful classification task in favor of a controllable classification scenario. Our research questions are now: **1. Are prominence and isolation alone characteristical for relevance?** We use isolation and/or prominence for a given set of data points as features. To which extend do these features improve learning a classification function for relevance? **2. Do prominence and isolation provide additional information, not catered by the absolute height?** Do prominence and isolation improve the prediction performance of relevance compared to just using the height? Does a classifier that uses prominence and isolation as additional features produce better results than a classifier that just uses the height? We will evaluate the proposed setup in the realm of a KG and take on the questions stated above in the following section and present some experimental evidence.

## **5 Experiments**

We extract information about municipalities in the countries of Germany and France from the Wikidata KG. This KG is a structure that stores knowledge via *statements*, linking *entities* via *properties* to *values*. A detailed description can be found in [24], while [9] gives an explicit mathematical structure to the Wikidata graph and shows how to use the graph for extracting implicational knowledge from Wikidata subsets. We investigate if prominence and isolation of a given municipality can be used as features to predict university locations in a classification setup. We use the query service of Wikidata<sup>3</sup> to extract points in the country maps from Germany and France and to extract all their universities. We report all necessary SPAQRL queries employed on GitHub.<sup>4</sup>


<sup>3</sup> https://query.wikidata.org/.

<sup>4</sup> https://github.com/mstubbemann/Orometric-Methods-in-Bounded-Metric-Data.

<sup>5</sup> Queried on 2019-08-07.


## **5.1 Binary Classification Task**

*Setup.* We compute prominence and isolation for all data points and normalize them as well as the height. The data that is used for the classification task consists of the following information for each city: The height, the prominence, the isolation and the binary information whether the city has a university. Since our data set is highly imbalanced, common classifiers tend to simply predict the majority class. To overcome the imbalance, we use inverse penalty weights with respect to the class distribution. We want to stress out again that the goal for the to be introduced classification task is not to identify the best classifier. Rather we want to produce evidence for the applicability of employing isolation and prominence as features for learning a classification function. We decide to use logistic regression with L<sup>2</sup> regularization and Support Vector Machines [7] with a radial kernel. For our experiment we use Scikit-Learn [20]. As penalty factor for the SVC we set <sup>C</sup> = 1, and experiment with <sup>C</sup> ∈ {0.5, <sup>1</sup>, <sup>2</sup>, <sup>5</sup>, <sup>10</sup>, <sup>100</sup>}. For <sup>γ</sup> we rely on previous work by [1] and set it to one. For all combinations of population, isolation and prominence we use 100 iterations of 5-fold-cross-validation.

*Evaluation.* We use the g-mean (i.e., geometric mean) as evaluation function. Consider for this denotations TN (True Negative), FP (False Positive), FN (False Negative), and TP (True Positive). Overall accuracy is highly misleading for heavily imbalanced data. Therefore, we evaluate the classification decisions by using the geometric mean of the accuracy on the positive instances, acc<sup>+</sup> := T P T P <sup>+</sup>F N and the accuracy on the negative instances acc<sup>−</sup> := T N T N+F P . Hence, the g-mean score is then defined by the formula <sup>g</sup>mean := <sup>√</sup>acc<sup>+</sup> · acc−. The evaluation function g-mean is established in the topic of imbalanced data mining. It is mentioned in [10] and used for evaluation in [1]. We compare the values for

<sup>6</sup> Last checked on 2019-10-26.


**Table 1.** Results of the classification task. We do 100 rounds of 5-fold-cross-validation and shuffle the data between the rounds. For all rounds we compute the g-mean value and then compute the average over the 100 rounds.

po = population, pr = prominence, is = isolation

SVM = Support Vector Machine, LR = Logistic Regression

g-mean for the following cases. First, we train a classifier function purely on the features population, prominence or isolation. Secondly, we try combinations of them for the training process. We consider the classifier trained using the population feature as baseline. An increase in g-mean while using prominence or isolation together with the population function is evidence for the utility of the introduced valuation functions. Even stronger evidence is a comparison of isolation/prominence trained classifiers versus baseline.

In our experiments, we are not expecting high g-mean values, since the placement of university locations depends on many additional features, including historical evolution of the country and political decisions. Still, the described evaluation setup is sufficient to demonstrate the potential of the novel features.

*Results.* The results of the computations are depicted in Table 1. • *Isolation is a good indicator for structural relevance.* For both countries and classifiers isolation outperforms population. • *Combining absolute height with our valuation functions leads to better results.* • *Prominence is not useful as a solo indicator.* We draw from our result that prominence solely is not a useful indicator. Prominence is a very strict valuation function: recall that we constructed the graphs by using distance margins as indicators for edges, leading to a dense graph structure in more dense parts of the metric space. Hence, a point in a more dense part has many neighbors and thus many potential paths that may lead to a very low prominence value. From Definition 3 we see that having a higher neighbor always leads to a prominence value of zero. This threshold is about 34 km for Germany and 54 km for France. Thus, a municipality has a not vanishing prominence if it is the most populated point in a radius of over 34 km, respectively 54 km. Only 75 municipalities of France have non zero prominence, with 40 of them being university locations. Germany has 104 municipalities with positive prominence with 72 of them being university locations. Thus, prominence alone as a feature is insufficient for the prediction of university locations. • *Support vector machine and logistic regression lead to similar results.* To the question, whether our valuation functions improve the classification compared with the population feature, support vector machines and logistic regressions provide the same answer: isolation always outperforms population, a combination of all features is always better then using just the plain population feature. • *Support vector machine penalty parameter.* Finally, for our last test we check the different results for support vector machines using the penalty parameters C ∈ {0.5, 1, 2, 5, 10, 100}. We observe that increasing the penalty results in better performance using the population feature. However, for lower values of C, i.e., less overfitting models, we see better performance in using the isolation feature. In short, the more the model overfits due to C, the less useful are the novel valuation functions we introduced in this paper.

## **6 Conclusion and Outlook**

In this work, we presented a novel approach to identify outstanding elements in item sets. For this we employed orometric valuation functions, namely prominence and isolation. We investigated a computationally reasonable transfer to the realm of bounded metric spaces. In particular, we generalized previously known results that were researched in the field of finite networks.

The theoretical work was motivated by the observation that KGs, like Wikidata, do contain huge amounts of metric data. These are often equipped with some kind of height functions in a natural way. Based on this we proposed in this work the groundwork for a locally working item recommending scheme.

To evaluate the capabilities for identifying locally outstanding items we selected an artificial classification task. We identified all French and German municipalities from Wikidata and evaluated if a classifier can learn a meaningful connection between our valuation functions and the relevance of a municipality. To gain a binary classification task and to have a benchmark, we assumed that universities are primarily located at relevant municipalities. In consequence, we evaluated if a classifier can use prominence and isolation as features to predict university locations. Our results showed that isolation and prominence are indeed helpful for identifying relevant items.

For future work we propose to develop the conceptualized item recommender system and to investigate its practical usability in an empirical user study. Furthermore, we urge to research the transferability of other orometric based valuation functions.

**Acknowledgement.** The authors would like to express thanks to Dominik D¨urrschnabel for fruitful discussions. This work was funded by the German Federal Ministry of Education and Research (BMBF) in its program "Quantitative Wissenschaftsforschung" as part of the REGIO project under grant 01PU17012.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Interpretable Neuron Structuring with Graph Spectral Regularization**

Alexander Tong<sup>1</sup>, David van Dijk<sup>2</sup>, Jay S. Stanley III<sup>2</sup>, Matthew Amodio<sup>1</sup>, Kristina Yim<sup>2</sup>, Rebecca Muhle<sup>2</sup>, James Noonan<sup>2</sup>, Guy Wolf<sup>3</sup>, and Smita Krishnaswamy1,2(B)

> <sup>1</sup> Yale Department of Computer Science, New Haven, USA smita.krishnaswamy@yale.edu <sup>2</sup> Yale Department of Genetics, New Haven, USA <sup>3</sup> Department of Mathematics and Statistics, Universit´e de Montr´eal, Mila, Montreal, Canada

**Abstract.** While neural networks are powerful approximators used to classify or embed data into lower dimensional spaces, they are often regarded as black boxes with uninterpretable features. Here we propose *Graph Spectral Regularization* for making hidden layers more interpretable without significantly impacting performance on the primary task. Taking inspiration from spatial organization and localization of neuron activations in biological networks, we use a graph Laplacian penalty to structure the activations within a layer. This penalty encourages activations to be smooth either on a predetermined graph or on a featurespace graph learned from the data via co-activations of a hidden layer of the neural network. We show numerous uses for this additional structure including cluster indication and visualization in biological and image data sets.

**Keywords:** Neural Network Interpretability · Graph learning · Feature saliency

## **1 Introduction**

Common intuitions and motivating explanations for the success of deep learning approaches rely on analogies between artificial and biological neural networks, and the mechanism they use for processing information. However, one aspect that is overlooked is the spatial organization of neurons in the brain. Indeed, the hierarchical spatial organization of neurons, determined via fMRI and other technologies [13,16], is often leveraged in neuroscience works to explore, understand, and interpret various neural processing mechanisms and high-level brain functions. In artificial neural networks (ANN), on the other hand, hidden layers offer no organization that can be regarded as equivalent to the biological one. This lack of organization poses great difficulties in exploring and interpreting

A. Tong, D. Dijk, G. Wolf and S. Krishnaswamy—Equal contribution.

M. R. Berthold et al. (Eds.): IDA 2020, LNCS 12080, pp. 509–521, 2020. https://doi.org/10.1007/978-3-030-44584-3\_40

the internal data representations provided by hidden layers of ANNs and the information encoded by them. This challenge, in turn, gives rise to the common treatment of ANNs as black boxes whose operation and data processing mechanisms cannot be easily understood. To address this issue, we focus on the problem of modifying ANNs to learn more interpretable feature spaces without degrading their primary task performance.

While most neural networks are treated as black boxes, we note that there are methods in ANN literature for understanding the activations of filters in convolutional neural networks (CNNs) [11], either by examining trained networks [24], or by learning a better representation [12,17,18,22,25], but such methods rarely apply to other types of networks, in particular dense neural networks (DNNs) where a single activation is often not interpretable on its own. Furthermore, convolutions only apply to datatypes where we know the feature structure apriori, as in the case of images and natural language. In layers of a DNN, there is no enforced structure between neurons. The correspondence between neurons and concepts is only determined based on the random initialization of the network. In this work, we encourage *structure between neurons* in the same layer, creating more localized and interpretable layers in dense architectures.

More specifically we propose a *Graph Spectral Regularization* to encourage arbitrary graph structure between neurons within a layer. The internal layers of a neural network are constrained to take the structure of a graph, with graph neighbors activating on similar inputs. This allows us to map the activations of a given layer over the graph and interpret new input by examining the activations. We show that graph-structuring a hidden layer causes useful, interpretable features to emerge. For instance, we show that grid-structuring a layer of a classification network creates a structure over which convolution can be applied, and local receptive fields can be traced to understand classification decisions.

While a majority of the time imposing a known graph structure gives interpretable results, there are circumstances where we would like to learn the graph structure from data. In such cases we can learn and emphasize the natural graph structure of the feature space. We do this by an iterative process of encoding the data, and modifying the graph based on the feature co-activation patterns. This procedure reinforces existing patterns in the data. This allows us to learn an abstracted graph structure of features in high-dimensional domains such as single-cell RNA sequencing.

The main contributions of this work are as follows: (1) Demonstration of hierarchical, spatial, and smoothed feature maps for interpretability in dense networks. (2) A novel method for learning and reinforcing the natural graph structure for complex feature spaces. (3) Demonstration of graph learning and abstraction on single-cell RNA-sequencing data.

## **2 Related Work**

*Disentangled Representation Learning:* While there is no precise definition of what makes for a disentangled representation, the aim is to learn a representation that axis aligns with the generative factors of the data [2,8]. [9] suggest a way to disentangle the representation of variational autoencoders [10] with β-VAE. Subsequent work has generalized this to discrete representations [5], and simple hierarchical representations [6]. These works focus on learning a single vector representation of the data, where each element represents a single concept. In contrast, our work learns a representation where groups of neurons may be involved in representing a single concept. Moreover, disentangled representation learning can only be applied to unsupervised models and only the most compressed level of either an autoencoder [9] or generative adversarial network as in [4], whereas graph spectral regularization (GSR) can be applied to any or all layers of the network.

*Graph Structure in ANNs:* Graph based penalties have been used in the graph signal processing literature [3,21,26], but are rarely used in an ANN setting. In the biological data setting, [14] used a graph penalty in sparse logistic regression on gene expression data. Another way of utilizing graph structure is through graph convolutional networks (GCN). GCNs are a related body of work introduced by [7], and expanded on by [19], but focus on a different set of problems (For an overview see [23]). GCNs require a known graph structure. We focus on learning a graph representation of general data. This learned graph representation could be used as the input to a GCN similar to our MNIST example.

### **3 Enforcing Graph Structure**

We consider the intra-layer relationships between neurons or larger structures such as capsules. For a given layer of neurons we construct a graph G = (V,E) with V = {v1,...,v*<sup>N</sup>* } the set of vertices and E ⊆ V ×V the set of edges. Let W be the weighted symmetric adjacency matrix of size N × N with W*ij* = W*ji* ≥ 0 representing the weight of the edge between v*<sup>i</sup>* and v*<sup>j</sup>* . The graph Laplacian L is then defined as L = D − W where D*ii* = - *<sup>j</sup>* W*ij* and D*ij* = 0 for i = j.

To enforce smoothing we use the Laplacian smoothing loss. On some activation vector z and fixed Laplacian L we formulate the graph spectral regularization function G as:

$$G(z, \mathbf{L}) = z^T \mathbf{L} z = \sum\_{ij} W\_{ij} ||z\_i - z\_j|| \tag{1}$$

where || · || denotes the Frobenius norm. We add it to the reconstruction or classification loss with a weighting term α. This adds an additional objective that activations should be smooth along the graph defined by L. This optimization procedure applies to any multi-layer model and valid graph Laplacian. We apply this algorithm to grid, and hierarchical graph structures on both autoencoder and classification dense architectures.


**Input** batches <sup>x</sup>*i*, model <sup>M</sup> with latent layer activations <sup>z</sup>*i*, regularization weight <sup>α</sup>. **Pre-train** <sup>M</sup> on <sup>x</sup>*i* with <sup>α</sup> = 0 **for** i = 1 **to** T **do** Create Graph Laplacian <sup>L</sup>*i* from activations <sup>z</sup>*i* **for** j = 1 **to** m **do** Train <sup>M</sup> on <sup>x</sup>*i* with <sup>α</sup> <sup>=</sup> <sup>w</sup> and <sup>L</sup> <sup>=</sup> <sup>L</sup>*i* with MSE + loss in eq. 1 **end for end for**

#### **3.1 Learning and Reinforcing an Abstracted Feature-Space Graph**

Instead of enforcing smoothness over a fixed graph, we can learn a feature graph from the data (See Algorithm 1) using neural network activations themselves to bootstrap the process. Note, that most graph and kernel-based methods are applied over the space of observations but not over the space of features. One of the reasons is because it is even more difficult to define a distance between features than it is between observations. To circumvent this problem, we propose to learn a feature graph in the latent space of a neural network using feature co-activations as a measure of similarity.

We proceed by creating a graph using feature activation similarity, then applying this graph using Laplacian smoothing for a number of iterations. This converges to a graph of a latent feature space at the level of granularity of the number of dimensions in the corresponding layer.

Our algorithm for learning the graph consists of two phases. First, a pretraining phase where the model is learned with no graph regularization. Second, we alternate between constructing the graph from the similarities of the embedding layer features and further training the network for reconstruction and smoothness on the graph. There are many ways to create a graph from the feature × datapoint activation matrix. We use an adaptive Gaussian kernel,

$$K(z\_i, z\_j) = \frac{1}{2} \exp\left(-\frac{||z\_i - z\_j||\_2^2}{\sigma\_i^2}\right) + \frac{1}{2} \exp\left(-\frac{||z\_i - z\_j||\_2^2}{\sigma\_j^2}\right).$$

where σ*<sup>i</sup>* is the adaptive bandwidth for node i which we set as the distance to the k*th* nearest neighbor of feature. An adaptive bandwidth Gaussian kernel is necessary for general architectures as the scale of the activations is not fixed. Batch normalization can also be used to limit the activation scale.

Since we are smoothing on the graph then constructing a new graph from the smoothed signal the learned graph converges to a steady state where the mean squared error acts as a repulsive force to stop the graph collapsing any further. We present the results of graph learning a biological dataset and show that the learned structure adds interpretability to the activations.

## **4 Experiments**

Through examples, we show that visualizing the activations of data on the regularized layer highlights relationships in the data that are not easily visible without it. We establish this with two examples on fixed graphs, then move to graphs learned from the structure of the data with two examples of hierarchical structure and two with progression structure.

## **4.1 Fixed Structure**

Enforcing fixed graph structure localizes activations for similar datapoints to a region of the graph. Here we show that enforcing a 8×8 grid graph on a layer of a dense MNIST classifier causes receptive fields to form, where each digit occupies a localized group of neurons on the grid. This can, in principle, be applied to any neural network layer to group neurons activating to similar features. Like in FMRI data or a convolutional neural network, we can examine the activation patterns for each localized group of neurons. For a second example, we show the usefulness in encouraging localized structure on a capsulenet architecture [18]. Where we are able to create globally consistent structure for better alignment of features between capsules.

**Fig. 1.** Shows average activation by digit over an (8×8) 2D grid using graph spectral regularization and convolutions following the regularization layer. Next, we segment the embedding space by class to localize portions of the embedding associated with each class. Notice that the digit 4 here serves as the null case and does not show up in the segmentation. Finally, we show the top 10% activation on the embedding of some sample images. For two digits (9 and 3) we show a normal input, a correctly classified but transitional input, and a misclassified input. The highlighted regions of the embedding space correlate with the semantic description of the input.

**Enforcing Grid Structure on Mnist.** Without GSR, activations are unstructured and as a result are difficult to interpret, in that it is difficult to visually identify even which class a digit comes from based on the activation pattern (See Fig. 1). With GSR we can organize the activations making this representation more visually distinguishable. Since we can now take this embedding as an image, it is possible to use a standard convolutional architecture in subsequent layers in order to further filter the encodings. When we add 3 layers of 3×3 2D convolutions with 2×2 max pooling we see that representations for each digit are compressed into specific areas of the image. This leads to the formation of receptive fields over the network pertaining to similar datapoints. Using these receptive fields, we can now extract the features responsible for digit classification. For example, features that contribute to the activation of the top right of our grid we can associate with those features that contribute to being the digit 9.

The activation patterns on the embedding layer correspond well to a human perception of the digit type. The 9 that is misclassified as 7 both has significant activation in the 7 region of the embedding layer, and looks visually close to a 7. We can now interpret the embedding layer as a sort of brain map, where the map can map regions of activations, to types of inputs. This is not possible in a standard neural network, where activations are not spatially organized.

**Fig. 2.** (a) shows the regularization structure between capsules. (b–c) Show reconstruction when one of the 16 dimensions in the DigitCaps representation is tweaked by 0.05 ∈ [−0.25, 0.25]. (b) Without GSR each digit responds differently to perturbation of the same dimension. With GSR (c) a single dimension represents line thickness across all digits.

**Enforcing Node Consistency on Capsule Networks.** Capsule networks [18] represent the input as a set of vectors where norm denotes activation and each component corresponds to some abstract feature. These elements are generally unordered. Here we use GSR to order these features consistently between digits. We train a capsule net on MNIST with GSR on 16 fully connected graphs between the 10 digit capsules. In the standard capsule network, each capsule orders features randomly based on initialization. However, with GSR we obtain a *consistent feature ordering*, e.g. node 1 corresponds to line thickness across all digits. GSR enforces a more ordered and interpretable encoding where localized regions are similarly organized, and the global line thickness feature is consistently learned between digits. More generally, GSR can be used to order nodes such that features common across capsules appear together. Finally, GSR does not degrade performance much, as can be seen by the digit reconstructions in Fig. 2.

In these examples the goal was to enforce a specified structure on unstructured features, but next we will examine the case where the goal is to learn the structure of the reduced feature space.

#### **4.2 Learning Graph Structure**

Using the procedure defined in Sect. 3.1, we can learn a graph structure. We first show that depending on the data, the learned graph exhibits either cluster or trajectory structure. We then show that our framework can learn structures that are hierarchical, i.e. subclusters within clusters or trajectories within clusters. Hierarchies are a difficult structure for other interpretability methods to learn [6]. However, our method naturally captures this by allowing for arbitrary graph structure among neurons in a layer.

**Fig. 3.** We show the structure of the training data and snapshots of the learned graph for (a) three modules and (b) eight modules. (c) shows we have the mean and 95% CI of the number of connected components in the trained graph for over 50 trials.

**Cluster Structure on Generated Data.** We structure our n*th* dataset to have exactly n feature clusters. We generate the data with n clusters by first creating 2*<sup>n</sup>* data points representing the binary numbers from 0 to 2*<sup>n</sup>*−1, then added gaussian noise N(0, 0.1). This creates a dataset with a ground truth number of feature clusters. In the n*th* dataset the learned graph should have n connected components for n independent features. In Fig. 3 (a–b) we can see how this graph evolves over time for 3 and 8 modules. (c) shows how the learned graph learns the correct number of connected components for each ground truth number of clusters.

**Fig. 4.** Shows (a) graph structure over training iterations (b) feature activations of parts of the trajectory. PHATE [15] embedding plots colored by (c) branch number and (b) inferred trajectory location showing the branching structure of the data.

**Trajectory Structure on T Cell Development Data.** Next, we test graph learning on biological mass cytometry data, which is a high dimensional, singlecell protein dataset, measured on differentiating T cells from the Thymus [20]. The T cells lie along a bifurcating progression where the cells eventually diverge into two lineages (CD4+ and CD8+). Here, the structure of the data is a trajectory (as opposed to a pattern of clusters). We can see in Fig. 4 how the activated nodes in the graph embedding layer correspond to locations along the data trajectory, and importantly, the learned graph is a single connected component. The activated nodes (yellow) move from the bottom of the embedding to the top as T-cells develop into CD8+ cells. The CD4+ lineage is also CD8- and thus looks like a mixture between the CD8+ branch and the naive T cells. The learned graph structure here has captured the transitioning structure of the underlying data.

**Fig. 5.** Graph architecture, PCA plot, activation heatmaps of a standard autoencoder, β-VAE [9] and a graph regularized autoencoder. With relu activations normalized to [0, 1] for comparison. In the model with graph spectral we are able to clearly decipher the hierarchical structure of the data, whereas with the standard autoencoder or the β-VAE the structure of the data is not clear.

**Clusters Within Clusters on Generated Data.** We demonstrate graph spectral regularization on data that is generated with a structure containing sub-clusters. Our data contains three large-scale structures, each comprising two Gaussian sub clusters generated in 15 dimensions (See Fig. 5). We use this dataset as it has both global and local structure. We demonstrate that our graph spectral regularized model is able to pick up on both the global and local structure of this dataset where disentangling methods such as β-VAE cannot. We use a graph-structure layer with six nodes with three connected node pairs and employ the graph spectral regularization. After training, we find that each node pair acts as a "super node" that detects each large-scale cluster. Within each super node, each of the two nodes encodes one of each of the two Gaussian substructures. Thus, this specific graph topology is able to extract the hierarchical topology of the data.

**Fig. 6.** Shows correlation between a set of marker genes for specific cell types and embedding layer activations. First with the standard autoencoder, then our autoencoder with graph spectral regularization. The left heatmap is biclustered, the right heatmap is grouped by connected components in the learned graph. We can see progression especially in the largest connected component where features on the right of the component correspond to less developed neurons.

**Hierarchical Cluster and Trajectory Structure on Developing Mouse Cortex Data.** In Fig. 6 we learn a graph on a single-cell RNA-sequencing dataset of over 4000 cells and over 8000 genes. The data contains a set of cells in the process of developing from neural stem cells to full neurons in the mouse brain. While there are many gene modules that contribute to the neuronal development, there are some states that have been studied. We use a list of cell type marker genes to validate our method. We use 1000 PCA components of the data in an autoencoder with a 20-dimensional embedding space. We learn the graph using an adaptive bandwidth gaussian kernel with the bandwidth for each feature set to the Euclidean distance to the nearest neighboring feature.

Our graph learns six components that represent meta features over the gene space. We can identify each with a specific type of cell or related types of cells. For example, the light green component (cluster 2) represents the very early stage neural stem cells as it is highly correlated with increased Aldoc, Pax6 and Sox2 gene expression. Most interesting to examine is cluster 6, the largest component, which represents development into mature neurons. Within this component we can see a progression from just after intermediate progenitors on the left (showing Eomes expression) to more mature neurons with higher expression of Tbr1 and Sox5. With a standard autoencoder we cannot see progression structure of this dataset. While some of the more global structure is captured, we fail to see the data progression from intermediate progenitors to mature neurons. Learning a graph allows us to create receptive fields e.g. clusters of neurons that correspond to specific structures within the data, in this case cell types. Within these neighborhoods, we can pick up on the substructure within a single cell type, i.e. their developmental trajectory.

## **4.3 Computational Cost**

Our method can be used to increase interpretability without much loss in representation power. At low levels, GSR can be thought of as rearranging the activations so that they become spatially coherent. As with other interpretability methods, GSR is not meant to increase representation power, but create useful representations with low cost in power. Since GSR does not require an information bottleneck such as in β-VAE, a GSR layer can be very wide, while still being interpretable. In comparing loss of representation power, GSR should be compared to other regularization methods, namely L1 and L2 penalties (See Table 1). In all three cases we can see that a higher penalty reduces the model capacity. GSR affects performance in approximately the same way as L1 and L2 regularizations do. To confirm this, we ran a MNIST classifier and measured train and test accuracy with 10 replicates. Graph spectral regularization adds a bit more overhead than elementwise activation penalties. However, the added cost can be seen as containing one matrix vector operation per pass. Empirically, GSR shows similar computational cost as other simple regularizations such as L1 and L2. To compare costs, we used a Keras model with Tensorflow backend [1] on a Nvidia Titan X GPU and a dual Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30 GHz, and with batchsize 256. we observed during training 233 milliseconds (ms) per step with no regularization, 266 ms for GSR, and 265 ms for L2 penalties.



## **5 Conclusion**

We have introduced a novel biologically inspired method for regularizing features of the internal layers of dense neural networks to take the shape of a graph. We show that coherent features emerge and can be used to interpret the underlying structure of the dataset. Furthermore, when the intended graph is not known apriori, we have presented a method for learning the graph structure, which learns a graph relevant to the data. This regularization framework takes a step towards more interpretable neural networks, and has applicability for future work seeking to reveal important structure in real-world biological datasets as we have demonstrated here.

**Acknowledgements.** This research was partially funded by IVADO (l'institut de valorisation des donn´ees) [*G.W.*]; Chan-Zuckerberg Initiative grants 182702 & CZF2019- 002440 [*S.K.*]; and NIH grants R01GM135929 & R01GM130847 [*G.W.,S.K.*].

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Comparing the Preservation of Network Properties by Graph Embeddings**

R´emi Vaudaine1(B) , R´emy Cazabet<sup>2</sup>, and Christine Largeron<sup>1</sup>

1 Univ Lyon, UJM-Saint-Etienne, CNRS, Institut d Optique Graduate School Laboratoire Hubert Curien UMR 5516, 42023 Saint-Etienne, France *{*remi.vaudaine,christine.largeron*}*@univ-st-etienne.fr <sup>2</sup> Univ Lyon, UCBL, CNRS, LIRIS UMR 5205, 69621 Lyon, France remy.cazabet@gmail.com

**Abstract.** Graph embedding is a technique which consists in finding a new representation for a graph usually by representing the nodes as vectors in a low-dimensional real space. In this paper, we compare some of the best known algorithms proposed over the last few years, according to four structural properties of graphs: first-order and second-order proximities, isomorphic equivalence and community membership. To study the embedding algorithms, we introduced several measures. We show that most of the algorithms are able to recover at most one of the properties and that some algorithms are more sensitive to the embedding space dimension than some others.

**Keywords:** Graph embedding · Network properties

## **1 Introduction**

Graphs are useful to model complex systems in a broad range of domains. Among the approaches designed to study them, graph embedding has attracted a lot of interest in the scientific community. It consists in encoding parts of the graph (node, edge, substructure) or a whole graph into a low dimensional space while preserving structural properties. Because it allows all the range of data mining and machine learning techniques that require vectors as input to be applied to relational data, it can benefit a lot of applications.

Several surveys have been recently published [5,6,8,20,21], some of them including a comparative study of the performance of the methods to solve specific tasks. Among them, Cui *et al.* [6] propose a typology of network embedding methods into three families: matrix factorization, random walk and deep learning methods. Following the same typology, Goyal et al. [8] compare state of the art methods on few tasks such as link prediction, graph reconstruction or node classification and analyze the robustness of the algorithms with respect

**Electronic supplementary material** The online version of this chapter (https:// doi.org/10.1007/978-3-030-44584-3 41) contains supplementary material, which is available to authorized users.

to hyper-parameters. Recently, Cai et al. [5] extended the typology by adding deep learning based methods without random walks but also two other families: graph kernel based methods notably helpful to represent the whole graph as a low-dimensional vector and generative models which provide a latent space as embedding space. For their part, Zhang et al. [21] classify embedding techniques into two types: unsupervised network representation learning or semi-supervised and they list a number of embedding methods depending on the information sources they use to learn. Like Goyal et al. [8], they compare the methods on different tasks. Finally, Hamilton et al. [10] introduce an encoder-decoder framework to describe representative embedding algorithms from a methodological perspective. In this framework, the encoder corresponds to the function which maps the elements of a graph as vectors. The decoder is a function which associates a specific graph statistic to the obtained vectors, for instance for a pair of node embeddings the decoder can give their similarity in the vector space, allowing the similarity of the nodes in the original graph to be quantified.

From this last work, we retained the encoder-decoder framework and we propose to use it for evaluating the different embedding methods. To that end, we compare, using metrics that we introduce, the value computed by the decoder with the value associated to the corresponding nodes in the graph for the equivalent function. Thus, in this paper, we adopt a different point of view from the previous task-oriented evaluations. Indeed, all of them consider embeddings as a *black box*, i.e., using obtained features without considering their properties. They ignore the fact that embedding algorithms are designed, explicitly or implicitly, to preserve some particular structural properties and their usefulness for a given task depends on how they succeed to capture it. Thus, in this paper, through an experimental comparative study, we compare the ability of embedding algorithms to capture specific properties, i.e., first-order proximity of nodes, structural equivalence (second-order proximity), isomorphic equivalence and community structure.

In Sect. 2, these topological properties are formally defined and measures are introduced to evaluate to what extent embedding methods encode them. Section 3 presents the studied embedding methods. Section 4 describes the datasets used for the experiments, while Sect. 5 presents the results.

### **2 Structural Properties and Metrics**

There is a wide range of graph properties that are of interest. We propose to study several of them which are at the basis of network analysis and are directly linked with usual learning and mining tasks on graphs [13]. First, we measure the ability of an embedding method to recover the set of neighbors of the nodes which is the first-order proximity (P1). This property is important for several downstream tasks: clustering where vectors of the same cluster represent nodes of the same community, graph reconstruction where two similar vectors represent two nodes that are neighbors in the graph, and node classification based for instance on majority vote of the neighbors. Secondly, we evaluate the ability of embedding methods to capture the second-order proximity (P2) which is the fact that two nodes have the same set of neighbors. This property is especially interesting when dealing with link prediction since, in social graphs, it is assumed that two nodes that share the same friends are likely to become friends too. Thirdly, we measure how much an embedding method is able to capture the roles of nodes in a graph which is the isomorphic equivalence (P3). This property is interesting when looking for specific nodes like leaders or outsiders. Finally, we evaluate the ability of an embedding method to detect communities (P4) in a graph which has been an on going field of research for the last 20 years. Next, we define both properties and measures we use in order to quantify how much an embedding method is able to capture those properties.

Let *<sup>G</sup>*(*V,E*) be an unweighted and undirected graph where *<sup>V</sup>* <sup>=</sup> {*v*0*, ..., v<sup>n</sup>*−<sup>1</sup>} is the set of n vertices, *<sup>E</sup>* <sup>=</sup> {*eij*}*<sup>n</sup>*−<sup>1</sup> *i,j*=0 the set of m edges and *<sup>A</sup>* is its binary adjacency matrix. Graph embedding consists in encoding the graph into a lowdimensional space *R<sup>d</sup>*, where d is the dimension of the real space, with a function *<sup>f</sup>* : *<sup>V</sup>* -<sup>→</sup> *<sup>Y</sup>* which maps vertices to vector embeddings while preserving some properties of the graph. We note *<sup>Y</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>*×*<sup>d</sup>* the embedding matrix and *<sup>Y</sup><sup>i</sup>* its i-th row representing the node *vi*.

**Neighborhood or first-order proximity (P1)**: capturing the neighborhood for an embedding method means that it aims at keeping any two nodes *v<sup>i</sup>* and *v<sup>j</sup>* that are linked in the original graph (*Aij* = 1) close in the embedding space. The measure *S* designed for this property is based on the comparison between the set *N*(*vi*) of neighbors in the graph of every node *v<sup>i</sup>* and the set *<sup>N</sup><sup>E</sup>* (*vi*) of its <sup>|</sup>*<sup>N</sup>* (*vi*)<sup>|</sup> nearest neighbors in the embedding space where <sup>|</sup>*<sup>N</sup>* (*vi*)<sup>|</sup> is its degree. Finally, by averaging over all nodes, *S* quantifies the ability of an embedding to respect the neighborhood. The higher S, the more P1 is preserved.

$$S\left(v\_i\right) = \frac{\left|N\left(v\_i\right)\right| \bigcap\_{E} \left(v\_i\right)\right|}{\left|N\left(v\_i\right)\right|}, \ S = \frac{1}{n} \sum\_{i} S\left(v\_i\right) \tag{1}$$

**Structural equivalence or second-order proximity (P2)**: two vertices are structurally equivalent if they share many of the same neighbors [13]. To measure the efficiency of an embedding method to recover the structural equivalence, we define the distance *dist<sup>A</sup>* (*Ai, A<sup>j</sup>* ) between the lines of the adjacency matrix corresponding to each pair of nodes (*vi, v<sup>j</sup>* ), and *dist<sup>E</sup>* (*Yi, Y<sup>j</sup>* ) the distance between their representative vectors in the embedding space. The metric for P2 is defined by the correlation coefficient (Spearman or Pearson) *Struct eq* between those values for all pairs of nodes. The higher *Struct eq* (close to 1), the better P2 is preserved by the algorithm.

$$L\_A\left(v\_i, v\_j\right) = dist\_A\left(A\_i, A\_j\right), \; L\_E\left(v\_i, v\_j\right) = dist\_E\left(Y\_i, Y\_j\right) \tag{2}$$

with *dist<sup>A</sup>* the distance in the adjacency matrix (cosine or euclidean) and *distE*, the embedding similarity which is indicated in Table 1. Finally,

$$Stract.eq = person(L\_A, L\_E) \tag{3}$$

**Isomorphic equivalence (P3)**: two nodes are isomorphically equivalent, i.e they share the same role in the graph, if their ego-networks are isomorphic [4]. The ego-network of node *v<sup>i</sup>* is defined as the subgraph *EN<sup>i</sup>* made up of its neighbors and the edges between them (without *v<sup>i</sup>* itself). To go beyond a binary evaluation, for each pair of nodes (*vi, v<sup>j</sup>* ), we compute the Graph Edit Distance *GED* (*ENi,EN<sup>j</sup>* ) between their ego-networks *EN<sup>i</sup>* and *EN<sup>j</sup>* thanks to the Graph Matching Toolkit [16] and the distance between their representative vectors in the embedding space *dist<sup>E</sup>* (*Yi, Y<sup>j</sup>* ). *dist<sup>E</sup>* is indicated in Table 1. Finally, the Pearson and Spearman correlation coefficients between those values computed on all pairs of nodes are used to have an indicator for the whole graph. A negative correlation means that if the distance in the embedding space is large then exp(-GED), as in [15], is small. So, to ease one's reading, we take the opposite of the correlation coefficient such that, for all measures, the best result is 1. Thus, the higher *Isom eq*, the better P3 is preserved by the algorithm.

$$L\_{Egonent}\left(v\_i, v\_j\right) = \exp(-GED\left(EN\_i, EN\_j\right)),\ L\_E\left(v\_i, v\_j\right) = dist\_E\left(Y\_i, Y\_j\right) \tag{4}$$

$$Isom.eq = -pearson(L\_{Egonent}, L\_E) \tag{5}$$

**Community/cluster membership (P4)**: communities can be defined as "groups of vertices having higher probability of being connected to each other than to members of other groups" [7]. On the other hand, clusters can be defined as sets of elements such that elements in the same cluster are more similar to each other than to those in other clusters. We propose to study the ability of an embedding method to transfer a community structure to a cluster structure. Given a graph with *k* ground-truth communities, we cluster, using KMeans (since k, the number of communities, is known), the node embeddings into k clusters. Finally, we compare this partition with the ground-truth partition using the adjusted mutual information (AMI). We also used the normalized mutual information (NMI) but both measures showed similar results. Let *LCommunity* be the ground-truth labeling and *LClusters* the one found by KMeans.

$$Score = AMI(L\_{Commuty}, L\_{Clusters}) \tag{6}$$

### **3 Embeddings**

There are many different graph embedding algorithms. We present a nonexhaustive list of recent methods, representative of the different families proposed in the state-of-the-art. We refer the reader to the full papers for more information. In Table 1 we mention all the embedding methods we used in our comparative study with the graph similarity they are supposed to preserve and the distance that is used in the embedding space to relate any pair of nodes of the graph. Two versions of N2V are used (A: *p* = 0*.*5*, q* = 4 for local random walks, B: *p* = 4*, q* = 0*.*5 for deeper random walks).



## **4 Graphs**

To evaluate embedding algorithms, we choose real graphs and generated graphs having different sizes and types: random (R), with preferential attachment (PA), social (S), social with community structure (SC) as shown in Table 2. While real graphs correspond to common datasets, generators allow to control the characteristics of the graphs. Thus, we have prior knowledge which makes evaluation easier and more precise. Table 2 gives the characteristics of these graphs divided in three groups: small, medium and large graphs.

**Table 2.** Dataset characteristics. All graphs are provided in our GitHub


### **5 Results**

We used the metrics presented in Sect. 2 to quantify the ability of the embedding algorithms described in Sect. 3 to recover four properties of the graphs: first order proximity (P1), structural and isomorphic equivalences (P2 and P3), community membership (P4). Due to lack of space, we show only the most representative results and provide the others as additional materials<sup>1</sup>. For the same reason, to evaluate P2 and P3, both Pearson and Spearman correlation coefficients have been computed but we only show results for Pearson as they are similar with Spearman. For readability, every algorithm successfully captures a property when its corresponding score is at 1 and 0 means unsuccessful. Moreover, a dash (-) in a Table indicates that a method has not been able to provide a result. Note that due to high complexity, KKL and MDS are not computed for every graph. Finally, the code and datasets are available online on our GitHub (see footnote 1).

### **5.1 Neighborhood (P1)**

**Fig. 1.** Neighborhood (P1) as a function of embedding dimension.

<sup>1</sup> https://github.com/vaudaine/Comparing embeddings.


**Table 3.** Neighborhood (P1) *Italic*: Best in row. **Bold**: best.

For the first order proximity (P1), we measure the similarity *S* as a function of the dimension *d* for all the embedding methods. For computational reasons, for large graphs, the measure is computed on 10% of the nodes. Results are shown in Fig. 1 and Table 3, for d varying from 2 until approximately the number of nodes. We can make several observations: for networks with communities (Dancer and ZKC), only LE and LLE reasonably capture this property. For Barabasi Albert graph and Erdos-Renyi networks, Verse, MDS and LE reach scores higher than LLE. It means that those algorithms are able to capture this property, but are fooled by complex meso-scopic organizations. These results can be generalized as shown in additional materials. MDS can show good performance for instance on email dataset, Verse works only on our random graphs, LLE works only for ZKC and Dancer while LE seems to show good performance on every graph when the right dimension is chosen. In the cases of LE and LLE, there is an optimal dimension: the increase of the similarity as the dimension grows can be explained by the fact that enough information is learned; the decrease is due to eigen-value computation in high-dimension which is very noisy. To conclude, LE seems to be the best option to recover neighborhood but the right dimension has to be found.

#### **5.2 Structural Equivalence (P2)**

Concerning the second-order proximity (P2), we compute the Pearson correlation coefficient, as indicated in Sect. 2, as a function of the embedding space dimension *d* and we use the same sampling strategy as for property P1.

The results are shown in Fig. 2 and Table 4. Two methods are expected to have good results, because they explicitly embed the structural equivalence: SVD and SDNE. HOPE does not explicitly embed this property but a very similar one which is Katz-Index. On every small graph, SVD effectively performs the best and with the lowest dimension. HOPE still has very good results. The Pearson coefficient grows as the dimension of the embedding grows which implies that the best results are obtained when the dimension of the space is high enough. The other algorithms fail to recover the structural equivalence. For medium

**Fig. 2.** Structural equivalence (P2) as a function of embedding dimension.

and large graphs as presented in Table 4, SVD and HOPE still show very good performance and the higher the dimension of the embedding space, the higher the correlation. For large graphs, SDNE shows also very good results but it seems to need more data to be able to learn properly. In the end, SVD seems to be the best algorithm to capture the second order proximity. It computes a singular value decomposition which is fast and scalable but SDNE performs also very well on the largest graphs and, in that case, it can outperform SVD.

### **5.3 Isomorphic Equivalence (P3)**

With the property P3, we investigate the ability of an embedding algorithm to capture roles in a graph. To do so, we compute the graph edit distance (GED) between every pair of nodes in the graph and the distance between the vectors of the embedding. Moreover, we sample nodes at random and compute the GED only between every pair of the sampled nodes thus reducing the computing time drastically. We sample 10% of the nodes for medium graphs and 1% of the nodes for large graphs. Experiments have demonstrated that results are robust to sampling. We present, in Fig. 3 and Table 5, the evolution of the correlation coefficient according to the dimension of the embedding space. The only algorithm that is supposed to perform well for this property is Struc2vec. Note also that algorithms which capture the structural equivalence can also give results since two nodes that are structurally equivalent are also isomorphically equivalent


**Table 4.** Structural equivalence (P2). *Italic*: Best in row. **Bold**: best.

but the converse is not true. For small graphs, as illustrated in Fig. 3, Struc2vec (S2V) is nearly always the best. It performs well on medium and large graphs too as shown in Table 5. However results obtained on other graphs (available in supplementary material) indicate that Stru2vec is not always much better than the other algorithms. As a matter of fact, Struc2vec remains the best algorithm for this measure but it is not totally accurate since the correlation coefficient is not close to 1 on every graph e.g on Dancer10k in Table 5(b).

**Fig. 3.** Isomorphic equivalence (P3) as a function of embedding dimension.


**Table 5.** Isomorphic equivalence (P3). *Italic*: Best in row. **Bold**: best.

#### **5.4 Community Membership (P4)**

To study the ability of an embedding to recover the community structure of a graph (P4), we compare, using Adjusted Mututal Information (AMI) and Normalized (NMI), the partition given by KMeans on the node embeddings and the ground-truth partition. The results are given only for PPG (averaged over 3 instances) and Dancer graphs (for 20 different graphs) for which the community structure (ground truth) is provided by the generators. To obtain them, we generated planted partition graphs (PPG) with 10 communities and 100 nodes

**Fig. 4.** AMI for community detection on PPG (top) and Dancer (bottom)

per community. We set the probability of an edge existing between communities *pout* = 0*.*01 and vary the probability of an edge existing within a community *pin* from 0*.*01 (no communities) to 1 (clearly defined communities), thus varying the modularity of the graph from 0 to 0*.*7. For Dancer, we generate 20 graphs with varying community structure by adding between-community edges and removing within-community edges. Moreover, we apply also usual community detection algorithms such as Louvain's modularity maximisation (maxmod) [2] and Infomap [3] on the graphs. Results are shown in Fig. 4. In low dimension (d = 2, left of the Figure), every embedding is less efficient than the usual community detection algorithms. In higher dimension (d = 128, right of the Figure), many embedding techniques, Verse, MDS, N2V (both versions) and HOPE (on PPG), are able to have the same results as the best community detection algorithm: Louvain and obvioulsly for all the methods, AMI increases with the modularity.

## **6 Conclusion**

In this paper, we studied how a wide range of graph embedding techniques preserve essential structural properties of graphs. Most of recent works on graph embeddings focused on the introduction of new methods and on task-oriented evaluation but they ignore the rationale of the methods, and only focus on their performance on a specific task in a particular setting. As a consequence, methods that have been designed to embed local structures are compared with methods that should embed global structures on tasks as diverse as link prediction or community detection. In contrast, we focused on (i) The structural properties for which each algorithm has been *designed*, and (ii) How well these properties are effectively preserved in practice, on networks having diverse topological properties. As a result, we have shown that no method embed efficiently all properties, and that most methods embed effectively only one of them. We have also shown that most of recently introduced methods are outperformed or at least challenged by older methods specifically designed for that purpose, such as LE/LLE for P1, SVD for P2, and modularity optimization for P4. Finally, we have shown that, even when they have been designed to embed a particular property, most methods fail to do so in every setting. In particular, some algorithms (particularly LE and LLE) have shown an important, non-monotonous sensibility to the number of dimensions which can be difficult to choose in a non supervised context.

In order to improve graph embedding methods, we believe that we need to better understand the nature of produced embeddings. We wish to pursue this work in two directions, (1) Understanding how those methods can obtain good results on tasks depending mainly on local structures, such as link prediction, when they do not encode efficiently local properties, and (2) study how well the meso-scale structure is preserved by such algorithms.

**Acknowledgement.** This work has been supported by BITUNAM Project ANR-18- CE23-0004 and IDEXLYON ACADEMICS Project ANR-16-IDEX-0005 of the French National Research Agency.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were

made. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Making Learners (More) Monotone**

Tom Julian Viering1(B) , Alexander Mey<sup>1</sup> , and Marco Loog1,2

<sup>1</sup> Delft University of Technology, Delft, The Netherlands

{t.j.viering,a.mey,m.loog}@tudelft.nl <sup>2</sup> University of Copenhagen, Copenhagen, Denmark

**Abstract.** Learning performance can show non-monotonic behavior. That is, more data does not necessarily lead to better models, even on average. We propose three algorithms that take a supervised learning model and make it perform more monotone. We prove consistency and monotonicity with high probability, and evaluate the algorithms on scenarios where non-monotone behaviour occurs. Our proposed algorithm MTHT makes less than 1% non-monotone decisions on MNIST while staying competitive in terms of error rate compared to several baselines. Our code is available at https://github.com/tomviering/monotone.

**Keywords:** Learning curve · Model selection · Learning theory

### **1 Introduction**

It is a widely held belief that more training data usually results in better generalizing machine learning models—cf. [11,17] for instance. Several learning problems have illustrated, however, that more training data can lead to worse generalization performance [3,9,12]. For the peaking phenomenon [3], this occurs exactly at the transition from the underparametrized to the overparametrized regime. This double-descent behavior has found regained interest in the context of deep neural networks [1,18], since these models are typically overparametrized. Recently, also several new examples have been found, where in quite simple settings more data results in worse generalization performance [10,19].

It can be difficult to explain to a user that machine learning models can actually perform worse when more, possibly expensive to collect data has been used for training. Besides, it seems generally desirable to have algorithms that guarantee increased performance with more data. How to get such a guarantee? That is the question we investigate in this work and for which we use learning curves. Such curves plot the expected performance of a learning algorithm versus the amount of training data.<sup>1</sup> In other words, we wonder how we can make learning curves monotonic.

The core approach to make learners monotone is that, when more data is gathered and a new model is trained, this newly trained model is compared to

<sup>1</sup> Not to be confused with training curves, where the loss versus epochs (optimization iterations) is plotted.

c The Author(s) 2020

M. R. Berthold et al. (Eds.): IDA 2020, LNCS 12080, pp. 535–547, 2020. https://doi.org/10.1007/978-3-030-44584-3\_42

the currently adopted model that was trained on less data. Only if the new model performs better should it be used. We introduce several wrapper algorithms for supervised classification techniques that use the holdout set or cross-validation to make this comparison. Our proposed algorithm MTHT uses a hypothesis test to switch if the new model improves significantly upon the old model. Using guarantees from the hypothesis test we can prove that the resulting learning curve is monotone with high probability. We empirically study the effect of the parameters of the algorithms and benchmark them on several datasets including MNIST [8] to check to what degree the learning curves become monotone.

This work is organized as follows. The notion of monotonicity of learning curves is reviewed in Sect. 2. We introduce our approaches and algorithms in Sect. 3, and prove consistency and monotonicity with high probability in Sect. 4. Section 5 provides the empirical evaluation. We discuss the main findings of our results in Sect. 6 and end with the most important conclusions.

## **2 The Setting and the Definition of Monotonicity**

We consider the setting where we have a learner that now and then receives data and that is evaluated over time. The question is then, how to make sure that the performance of this learner over time is monotone—or with other words, how can we guarantee that this learner over time improves its performance?

We analyze this question in a (frequentist) classification framework. We assume there exists an (unknown) distribution <sup>P</sup> over X ×Y, where <sup>X</sup> is the input space (features) and Y is the output space (classification labels). To simplify the setup we operate in rounds indicated by <sup>i</sup>, where <sup>i</sup> ∈ {1,...,n}. In each round, we receive a batch of samples S<sup>i</sup> that is sampled i.i.d. from P. The learner L can use this data in combination with data from previous rounds to come up with a hypothesis h<sup>i</sup> in round i. The hypothesis comes from a hypothesis space <sup>H</sup>. We consider learners <sup>L</sup> that, as subroutine, use a supervised learner <sup>A</sup> : S→H, where <sup>S</sup> is the space of all possible training sets.

We measure performance by the error rate. The true error rate on P equals

$$\epsilon(h\_i) = \int\_{x \in \mathcal{X}} \sum\_{y \in \mathcal{Y}} l\_{0-1}(h\_i(x), y)dP(x, y) \tag{1}$$

where l0-1 is the zero-one loss. We indicate the empirical error rate of h on a sample S as ˆ-(h, S). We call n rounds a run. The true error of the returned h<sup>i</sup> by the learner L in round i is indicated by <sup>i</sup>, all the <sup>i</sup>'s of a run form a learning curve. By averaging multiple runs one obtains the expected learning curve, ¯i.

The goal for the learner L is twofold. The error rates of the returned models <sup>i</sup>'s should (1) be as small as possible, and (2) be monotonically decreasing. These goals can be at odds with another. For example, always returning a fixed model ensures monotonicity but incurs large error rates. To measure (1), we summarize performance of a learning curve using the Area Under the Learning Curve (AULC) [6,13,16]. The AULC averages all <sup>i</sup>'s of a run. Low AULC indicates that a learner manages to quickly reduce the error rate.

Monotone in round i means that <sup>i</sup>+1 <sup>≤</sup> i. We may care about monotonicity of the expected learning curve *or* individual learning curves. In practice, however, we typically get one chance to gather data and submit models. In that case, we rather want to make sure that then any additional data also leads to better performance. Therefore, we are mainly concerned with monotonicity of *individual* learning curves. We quantify monotonicity of a run by the fraction of non-monotone transitions in an individual curve.

## **3 Approaches and Algorithms**

We introduce three algorithms (learners L) that wrap around supervised learners with the aim of making them monotone. First, we provide some intuition how to achieve this: ideally, during the generation of the learning curve, we would check whether -(h<sup>i</sup>+1) <sup>≤</sup> -(hi). A fix to make a learner monotone would be to output h<sup>i</sup> instead of h<sup>i</sup>+1 if the error rate of h<sup>i</sup>+1 is larger. Since learners do not have access to -(hi), we have to estimate it using the incoming data. The first two algorithms, MTSIMPLE and MTHT, use the holdout method to this end; newly arriving data is partitioned into training and validation sets. The third algorithm, MTCV, makes use of cross validation.

**MTSIMPLE: Monotone Simple.** The pseudo-code for MTSIMPLE is given by Algorithm 1 in combination with the function UpdateSimple. Batches S<sup>i</sup> are split into training (S<sup>i</sup> <sup>t</sup> ) and validation (S<sup>i</sup> <sup>v</sup>). The training set S<sup>t</sup> is enlarged each round with S<sup>i</sup> <sup>t</sup> and a new model h<sup>i</sup> is trained. S<sup>i</sup> <sup>v</sup> is used to estimate the performance of h<sup>i</sup> and hbest. We store the previously best performing model, hbest, and compare its performance to that of hi. If the new model h<sup>i</sup> is better, it is returned and hbest is updated, otherwise hbest is returned.

Because h<sup>i</sup> and hbest are both compared on S<sup>i</sup> <sup>v</sup> the comparison is more accurate because the comparison is paired. After the comparison S<sup>i</sup> <sup>v</sup> can safely be added to the training set (line 7 of Algorithm 1).

We call this algorithm MTSIMPLE because the model selection is a bit naive: for small validation sets, the variance in the performance measure could be quite large, leading to many non-monotone decisions. In the limit of infinitely large Si <sup>v</sup>, however, this algorithm should always be monotone (and very data hungry).

**MTHT: Monotone Hypothesis Test.** The second algorithm, MTHT, aims to resolve the issues of MTSIMPLE with small validation set sizes. In addition, for this algorithm, we prove that individual learning curves are monotone with high probability. The same pseudo-code is used as for MTSIMPLE (Algorithm 1), but with a different update function UpdateHT. Now a hypothesis test HT determines if the newly trained model is significantly better than the previous model. The hypothesis test makes sure that the newly trained model is not better due to chance (such as an unlucky sample). The hypothesis test is conservative, and only switches to a new model if we are reasonably sure it is significantly better, to avoid non-monotone decisions. Japkowicz and Shah [7] provide an accessible introduction to understand the frequentist hypothesis testing.

### **Algorithm 1.** MSIMPLE and MHT

**input:** supervised learner A, rounds n, batches S*<sup>i</sup>* <sup>u</sup> ∈ {updateSimple, updateHT} if u = updateHT: confidence level α, hypothesis test HT **<sup>1</sup>** <sup>S</sup>*t* <sup>=</sup> {} **<sup>2</sup> for** i = 1,...,n **do <sup>3</sup>** Split S*<sup>i</sup>* in S*<sup>i</sup> t* and <sup>S</sup>*<sup>i</sup> v* **<sup>4</sup>** Append to <sup>S</sup>*t* : <sup>S</sup>*t* = [S*t*; <sup>S</sup>*<sup>i</sup> t* ] **<sup>5</sup>** <sup>h</sup>*i* <sup>←</sup> <sup>A</sup>(S*t*) **<sup>6</sup>** U pdate*i* <sup>←</sup> u(S*<sup>i</sup> v*, <sup>h</sup>*i*, <sup>h</sup>best, <sup>α</sup>, HT) // see below **<sup>7</sup>** Append to <sup>S</sup>*t* : <sup>S</sup>*t* = [S*t*; <sup>S</sup>*<sup>i</sup> v*] **<sup>8</sup> if** U pdate*i or* <sup>i</sup> = 1 **then <sup>9</sup>** <sup>h</sup>best <sup>←</sup> <sup>h</sup>*<sup>i</sup>* **10 end <sup>11</sup>** Return hbest in round i **12 end**


The choice of hypothesis test depends on the performance measure. For the error rate the McNemar test can be used [7,14]. The hypothesis test should use paired data, since we evaluate two models on one sample, and it should be onetailed. One-tailed, since we only want to know whether h<sup>i</sup> is better than hbest (a two tailed test would switch to h<sup>i</sup> if its performance is significantly different). The test compares two hypotheses: H<sup>0</sup> : -(hi) = -(hbest) and H<sup>1</sup> : -(hi) < -(hbest).

Several versions of the McNemar test can be used [4,7,14]. We use the McNemar exact conditional test which we briefly review. Let b be the random variable indicating the number of samples classified correctly by hbest and incorrectly by h<sup>i</sup> of the sample S<sup>i</sup> <sup>v</sup>, and let N<sup>d</sup> be the number of samples where they disagree. The test conditions on <sup>N</sup>d. Assuming <sup>H</sup><sup>0</sup> is true, <sup>P</sup>(<sup>b</sup> <sup>=</sup> <sup>x</sup>|H0, Nd) = <sup>N</sup>*<sup>d</sup>* x ( 1 <sup>2</sup> )<sup>N</sup>*<sup>d</sup>* . Given x b's, the <sup>p</sup>-value for our one tailed test is <sup>p</sup> <sup>=</sup> <sup>x</sup> <sup>i</sup>=0 <sup>P</sup>(<sup>b</sup> <sup>=</sup> <sup>i</sup>|H0, Nd).

The one tailed p-value is the probability of observing a more extreme sample given hypothesis H<sup>0</sup> considering the tail direction of H1. The smaller the p-value, the more evidence we have for H1. If the p-value is smaller than α, we accept H1, and thus we update the model hbest. The smaller α, the more conservative the hypothesis test, and thus the smaller the chance that a wrong decision is made due to unlucky sampling. For the McNemar exact conditional test [4] the False Positive Rate (FPR, or the probability to make a Type I error) is bounded by <sup>α</sup>: <sup>P</sup>(<sup>p</sup> <sup>≤</sup> <sup>α</sup>|H0) <sup>≤</sup> <sup>α</sup>. We need this to prove monotonicity with high probability. **MTCV: Monotone Cross Validation.** In practice, often K-fold cross validation (CV) is used to estimate model performance instead of the holdout. This is what MTCV does, and is similar to MTSIMPLE. As described in Algorithm 2, for each incoming sample an index I maintains to which fold it belongs. These indices are used to generate the folds for the K-fold cross validation.

During CV, K models are trained and evaluated on the validation sets. We now have to memorize K previously best models, one for each fold. We average the performance of the newly trained models over the K-folds, and compare that to the average of the best previous K models. This averaging over folds is essential, as this reduces the variance of the model selection step as compared to selecting the best model overall (like MTSIMPLE does).

In our framework we return a single model in each iteration. We return the model with the optimal training set size that performed best during CV. This can further improve performance.

### **Algorithm 2.** MCV

**input:** K folds, learner A, rounds n, batches S<sup>i</sup> **<sup>1</sup>** <sup>b</sup> <sup>←</sup> <sup>1</sup> // keeps track of best round **<sup>2</sup>** <sup>S</sup> <sup>=</sup> {}, <sup>I</sup> <sup>=</sup> {} **<sup>3</sup> for** i = 1,...,n **do <sup>4</sup>** Generate stratified CV indices for S<sup>i</sup> and put in I<sup>i</sup> . Each index i indicates to which validation fold the corresponding sample belongs. **<sup>5</sup>** Append to <sup>S</sup>: <sup>S</sup> <sup>←</sup> [S; <sup>S</sup><sup>i</sup> ] **<sup>6</sup>** Append to <sup>I</sup>: <sup>I</sup> <sup>←</sup> [I; <sup>I</sup><sup>i</sup> ] **<sup>7</sup> for** k = 1,...,K **do <sup>8</sup>** h<sup>k</sup> <sup>i</sup> <sup>←</sup> <sup>A</sup>(S[<sup>I</sup> <sup>=</sup> <sup>k</sup>]) // training set of <sup>k</sup>th fold **<sup>9</sup>** P<sup>k</sup> <sup>i</sup> <sup>←</sup> -ˆ(h<sup>k</sup> <sup>i</sup> , S[I = k]) // validation set of kth fold **<sup>10</sup>** P<sup>k</sup> <sup>b</sup> <sup>←</sup> -ˆ(h<sup>k</sup> <sup>b</sup> , S[I = k]) // update performance of prev. models **<sup>11</sup> end <sup>12</sup>** U pdate<sup>i</sup> <sup>←</sup> (mean(P<sup>k</sup> <sup>i</sup> ) <sup>≤</sup> mean(P<sup>k</sup> <sup>b</sup> )) // mean w.r.t. k **<sup>13</sup> if** U pdate<sup>i</sup> *or* i = 1 **then <sup>14</sup>** <sup>b</sup> <sup>←</sup> <sup>i</sup> **<sup>15</sup> end <sup>16</sup>** <sup>k</sup> <sup>←</sup> arg min<sup>k</sup> <sup>P</sup><sup>k</sup> <sup>b</sup> // break ties **<sup>17</sup>** Return h<sup>k</sup> <sup>b</sup> in round i **<sup>18</sup> end**

## **4 Theoretical Analysis**

We derive the probability of a monotone learning curve for MTSIMPLE and MTHT, and we prove our algorithms are consistent if the model updates enough.

**Theorem 1.** *Assume we use the McNemar exact conditional test (see Sect. 3) with* <sup>α</sup> <sup>∈</sup> (0, <sup>1</sup> <sup>2</sup> ]*, then the individual learning curve generated by Algorithm MTHT with* <sup>n</sup> *rounds is monotone with probability at least* (1 <sup>−</sup> <sup>α</sup>)<sup>n</sup>*.*

*Proof.* First we argue that the probability of making a non-monotone decision in round i is at most α. If H<sup>1</sup> : -(hi) < -(hbest) or H<sup>0</sup> : -(hi) = -(hbest) is true, we are monotone in round i, so we only need to consider a new alternative hypothesis H<sup>2</sup> : -(hi) > -(hbest). Under <sup>H</sup><sup>0</sup> we have [4]: <sup>P</sup>(<sup>p</sup> <sup>≤</sup> <sup>α</sup>|H0) <sup>≤</sup> <sup>α</sup>. Conditioned on H2, b is binomial with larger mean than in the case of H0, thus we observe larger <sup>p</sup>-values if <sup>α</sup> <sup>∈</sup> (0, <sup>1</sup> <sup>2</sup> ], thus <sup>P</sup>(<sup>p</sup> <sup>≤</sup> <sup>α</sup>|H2) <sup>≤</sup> <sup>P</sup>(<sup>p</sup> <sup>≤</sup> <sup>α</sup>|H0) <sup>≤</sup> <sup>α</sup>. Therefore the probability of being non-monotone in round i is at most α. This holds for any model hi, hbest and anything that happened before round i. Since Si <sup>v</sup> are independent samples, being non-monotone in each round can be seen as independent events, resulting in (1 <sup>−</sup> <sup>α</sup>)<sup>n</sup>.

If the probability of being non-monotone in all rounds is at most β, we can set <sup>α</sup> = 1 <sup>−</sup> <sup>β</sup> <sup>1</sup> *<sup>n</sup>* to fulfill this condition. Note that this analysis also holds for MTSIMPLE, since running MTHT with α = <sup>1</sup> <sup>2</sup> results in the same algorithm as MTSIMPLE for the McNemar exact conditional test.

We now argue that all proposed algorithms are consistent under some conditions. First, let us revisit the definition of consistency [17].

**Definition 1 (Consistency** [17]**).** *Let* L *be a learner that returns a hypothesis* <sup>L</sup>(S) ∈ H *when evaluated on* <sup>S</sup>*. For all excess* <sup>∈</sup> (0, 1)*, for all distributions* <sup>D</sup> *over* <sup>X</sup> <sup>×</sup> <sup>Y</sup> *, for all* <sup>δ</sup> <sup>∈</sup> (0, 1)*, if there exists a* <sup>n</sup>(*excess*, D, δ)*, such that for all* <sup>m</sup> <sup>≥</sup> <sup>n</sup>(*excess*, D, δ)*, if* <sup>L</sup> *uses a sample* <sup>S</sup> *of size* <sup>m</sup>*, and the following holds with probability (over the choice of* <sup>S</sup>*) at least* <sup>1</sup> <sup>−</sup> <sup>δ</sup>*,*

$$\epsilon(L(S)) \le \min\_{h \in \mathcal{H}} \epsilon(h) + \epsilon\_{excess},\tag{2}$$

*then* L *is said to be consistent.*

Before we can state the main result, we have to introduce a bit of notation. U<sup>i</sup> indicates the event that the algorithm updates hbest (or in case of MCV it updates the variable b). H<sup>i</sup>+<sup>z</sup> <sup>i</sup> to indicates the event that <sup>¬</sup>U<sup>i</sup> ∩ ¬U<sup>i</sup>+1 <sup>∩</sup> ... <sup>∩</sup> <sup>¬</sup>U<sup>i</sup>+<sup>z</sup>, or in words, that in round <sup>i</sup> to <sup>i</sup> <sup>+</sup> <sup>z</sup> there has been no update. To fulfill consistency, we need that when the number of rounds grows to infinity, the probability of updating is large enough. Then consistency of A makes sure that hbest has sufficiently low error. For this analysis it is assumed that the number of rounds of the algorithms is not fixed.

**Theorem 2.** *MTSIMPLE, MTHT and MTCV are consistent, if* <sup>A</sup> *is consistent and if for all* <sup>i</sup> *there exists a* <sup>z</sup><sup>i</sup> <sup>∈</sup> <sup>N</sup> \ <sup>0</sup> *and* <sup>C</sup><sup>i</sup> <sup>&</sup>gt; <sup>0</sup> *such that for all* <sup>k</sup> <sup>∈</sup> <sup>N</sup> \ <sup>0</sup> *it holds that* P(H<sup>i</sup>+kz*<sup>i</sup>* <sup>i</sup> ) <sup>≤</sup> (1 <sup>−</sup> <sup>C</sup>i)<sup>k</sup>*.*

*Proof.* Let A be consistent with nA(excess, D, δ) samples. Let us analyze round <sup>i</sup> where <sup>i</sup> is big enough such that<sup>2</sup> <sup>|</sup>S<sup>t</sup><sup>|</sup> > nA(excess, D, <sup>δ</sup> <sup>2</sup> ). Assume that

$$
\epsilon(h\_{\text{best}}) > \min\_{h \in \mathcal{H}} \epsilon(h) + \epsilon\_{\text{excess}}, \tag{3}
$$

<sup>2</sup> In case of MTCV, take <sup>|</sup>S*t*<sup>|</sup> to be the smallest training fold size in round <sup>i</sup>.

otherwise the proof is trivial. For any round <sup>j</sup> <sup>≥</sup> <sup>i</sup>, since <sup>A</sup> produces hypothesis <sup>h</sup><sup>j</sup> with <sup>|</sup>St<sup>|</sup> > nA(excess, D, <sup>δ</sup> <sup>2</sup> ) samples,

$$
\epsilon(h\_j) \le \min\_{h \in \mathcal{H}} \epsilon(h) + \epsilon\_{\text{excess}} \tag{4}
$$

holds with probability of at least 1 <sup>−</sup> <sup>δ</sup> <sup>2</sup> . Now <sup>L</sup> should update. The probability that in the next kz<sup>i</sup> rounds we don't update is, by assumption, bounded by (1−Ci)k. Since <sup>C</sup><sup>i</sup> <sup>&</sup>gt; 0, we can choose <sup>k</sup> big enough so that (1−Ci)<sup>k</sup> <sup>≤</sup> <sup>δ</sup> <sup>2</sup> . Thus the probability of not updating after kz<sup>i</sup> more rounds is at most <sup>δ</sup> <sup>2</sup> , and we have a probability of <sup>δ</sup> <sup>2</sup> that the model after updating is not good enough. Applying the union bound we find the probability of failure is at most <sup>δ</sup>.

A few remarks about the assumption. It tells us, that an update is more and more likely if we have more consecutive rounds where there has been no update. It holds if each z<sup>i</sup> rounds the update probability is nonzero. A weaker but also sufficient assumption is <sup>∀</sup><sup>i</sup> : lim<sup>z</sup>→∞ <sup>P</sup>(H<sup>i</sup>+<sup>z</sup> <sup>i</sup> ) → 0.

For MTSIMPLE and MTCV the assumption is always satisfied, because these algorithms look directly at the mean error rate—and due to fluctuations in the sampling there is always a non-zero probability that ˆ-(hi) <sup>≤</sup> -ˆ(hbest). However, for MTHT this may not always be satisfied. Especially if the validation batches N<sup>v</sup> are small, the hypothesis test may not be able to detect small differences in error—the test then has zero power. If N<sup>v</sup> stays small, even in future rounds the power may stay zero, in which case the learner is not consistent.

## **5 Experiments**

We evaluate MTSIMPLE and MTHT on artificial datasets to understand the influence of their parameters. Afterward we perform a benchmark where we also include MTCV and a baseline that uses validation data to tune the regularization strength. This last experiment is also performed on the MNIST dataset to get an impression of the practicality of the proposed algorithms. First we describe the experimental setup in more detail.

**Experimental Setup.** The peaking dataset [3] and dipping dataset [9] are artificial datasets that cause non-monotone behaviour. We use stratified sampling to obtain batches S<sup>i</sup> for the peaking and dipping dataset, for MNIST we use random sampling. For simplicity all batches have the same size. N indicates batch size, and N<sup>v</sup> and N<sup>t</sup> indicate the sizes of the validation and training sets.

As model we use least squares classification [5,15]. This is ordinary linear least squares regression on the classification labels {−1, +1} with intercept. For MNIST one-versus-all is used to train a multi-class model. In case there are less samples for training than dimensions, the required inverse of the covariance matrix is ill-defined and we resort to the Moore-Penrose Pseudo-Inverse.

Monotonicity is calculated by the fraction of non-monotone iterations per run. AULC is also calculated per run. We do 100 runs with different batches and average to reduce variation from the randomness in the batches. Each run uses a newly sampled test set consisting of 10000 samples. The test set is used to estimate the true error rate and is not accessible by any of the algorithms.

We evaluate MSIMPLE, MHT and MCV and several baselines. The standard learner just trains on all received data. A second baseline, λS, splits the data in train and validation like MSIMPLE and uses the validation data to select the optimal L<sup>2</sup> regularization parameter λ for the least square classifier. Regularization is implemented by adding λI to the estimate of the covariance matrix.

In the first experiment we investigate the influence of N<sup>v</sup> and α for MTSIMPLE and MTHT on the decisions. A complicating factor is that if N<sup>v</sup> changes, not only decisions change, but also training set sizes because S<sup>v</sup> is appended to the training set (see line 7 of Algorithm 1). This makes interpretation of the results difficult because decisions are then made in a different context. Therefore, for the first set of experiments, we do not add S<sup>v</sup> to the training sets, also not for the standard learner. For this set of experiment We use N<sup>t</sup> = 4, n = 150, d = 200 for the peaking dataset, and we vary α and Nv.

For the benchmark, we set N<sup>t</sup> = 10, N<sup>v</sup> = 40, n = 150 for peaking and dipping, and we set N<sup>t</sup> = 5, N<sup>v</sup> = 20, n = 40 for MNIST. We fix α = 0.05 and use d = 500 for the peaking dataset. For MNIST, as preprocessing step we extract 500 random Fourier-features as also done by Belkin et al. [1]. For MTCV we use <sup>K</sup> = 5 folds. For <sup>λ</sup><sup>S</sup> we try <sup>λ</sup> ∈ {10−5, <sup>10</sup>−4.5,..., <sup>10</sup><sup>4</sup>.5, <sup>10</sup><sup>5</sup>} for peaking and dipping, and we try <sup>λ</sup> ∈ {10−3, <sup>10</sup>−2,..., <sup>10</sup><sup>3</sup>} for MNIST.

**Results.** We perform a preliminary investigation of the algorithms MSIMPLE and MHT and the influence of the parameters N<sup>v</sup> and α. We show several learning curves in Fig. 1a and d. For small N<sup>v</sup> and α we observe MTHT gets stuck: it does not switch models anymore, indicating that consistency could be violated.

In Fig. 1b and e we give a more complete picture of all tried hyperparameters in terms of the AULC. In Fig. 1c and f we plot the fraction of non-monotone decisions during a run (note that the legends for the subfigures are different). Observe that the axes are scaled differently (some are logarithmic). In some cases zero non-monotone decisions were observed, resulting in a missing value due to log(0). This occurs for example if MTHT always sticks to the same model, then no non-monotone decisions are made. The results of the benchmark are shown in Fig. 2. The AULC and fraction of monotone decisions are given in Table 1.

## **6 Discussion**

**First Experiment: Tuning** *α* **and** *Nv* **.** As predicted MTSIMPLE typically performs worse than MTHT in terms of AULC and monotonicity unless N<sup>v</sup> is very large. The variance in the estimate of the error rates on S<sup>i</sup> <sup>v</sup> is so large that in most cases the algorithm doesn't switch to the correct model. However, MTSIMPLE seems to be consistently better than the standard learner in terms of monotonicity and AULC, while MTHT can perform worse if badly tuned.

**Fig. 1.** Influence of <sup>N</sup>*v* and <sup>α</sup> for MTSIMPLE and MTHT on the Peaking and Dipping dataset. Note that some axes are logarithmic and b, c, e, f have the same legend.

Larger <sup>N</sup><sup>v</sup> leads typically to improved AULC for both. <sup>α</sup> <sup>∈</sup> [0.05, <sup>0</sup>.1] seems to work best in terms of AULC for most values of Nv. If α is too small, MTHT can get stuck, if α is too large, it switches models too often and non-monotone behaviour occurs. If <sup>α</sup> <sup>→</sup> <sup>1</sup> <sup>2</sup> , MTHT becomes increasingly similar to MTSIMPLE as predicted by the theory.

The fraction of non-monotone decisions of MTHT is much lower than α. This is in agreement with Theorem 1, but could indicate in addition that the hypothesis test is rather pessimistic. The standard learner and MTSIMPLE often make non-monotone decisions. In some cases almost 50% of the decisions are not-monotone.

**Fig. 2.** Expected learning curves on the benchmark datasets.


**Table 1.** Results of the benchmark. SL is the Standard Learner. AULC is the Area Under the Learning Curve of the error rate. Fraction indicates the average fraction of non-monotone decisions during a single run. Standard deviation shown in (braces). Best monotonicity result is underlined.

**Second Experiment: Benchmark on Peaking, Dipping, MNIST.** Interestingly, for peaking and MNIST datasets any non-monotonicity (double descent [1]) in the *expected* learning curve almost completely disappears for λS, which tunes the regularization parameter using validation data (Fig. 2). We wonder if regularization can also help reducing the severity of double descent in other settings. For the dipping dataset, regularization doesn't help, showing that it cannot prevent non-monotone behaviour. Furthermore, the fraction of non-monotone decisions *per run* is largest for this learner (Table 1).

For the dipping dataset MCV has a large advantage in terms of AULC. We hypothesize that this is largely due to tie breaking and small training set sizes due to the 5-folds. Surprisingly on the peaking dataset it seems to learn quite slowly. The expected learning curves of MTHT look better than that of MTSIMPLE, however, in terms of AULC the difference is quite small.

The fraction of non-monotone decisions for MTHT per run is very small as guaranteed. However, it is interesting to note that this does not always translate to monotonicity in the expected learning curve. For example, for peaking and dipping the expected curve doesn't seem entirely monotone. But MTCV, which makes many non-monotone decisions per run, still seems to have a monotone expected learning curve. While monotonicity of each individual learning curves guarantees monotonicity in the expected curve, this result indicates monotonicity of each individual curve may not be necessary. This raises the question: under what conditions do we have monotonicity of the expected learning curve?

**General Remarks.** The fraction of non-monotone decisions of MTHT being so much smaller than α could indicate the hypothesis test is too pessimistic. Fagerland et al. [4] note that the asymptotic McNemar test can have more power, which could further improve the AULC. For this test the guarantee <sup>P</sup>(<sup>p</sup> <sup>≤</sup> <sup>α</sup>|H0) <sup>≤</sup> <sup>α</sup> can be violated, but in light of the monotonicity results obtained, practically this may not be an issue.

MTHT is inconsistent at times, but this does not have to be problematic. If one knows the desired error rate, a minimum N<sup>v</sup> can be determined that ensures the hypothesis test will not get stuck before reaching that error rate. Another possibility is to make the size N<sup>v</sup> dependent on i: if N<sup>v</sup> is monotonically increasing this directly leads to consistency of MTHT. It would be ideal if somehow N<sup>v</sup> could be automatically tuned to trade off sample size requirements, consistency and monotonicity. Since for CV N<sup>v</sup> automatically grows and thus also directly implies consistency, a combination of MTHT and MTCV is another option.

Devroye et al. [2] conjectured that it is impossible to construct a consistent learner that is monotone in terms of the expected learning curve. Since we look at individual curves, our work does not disprove this conjecture, but some of the authors on this paper believe that the conjecture can be disproved. One step to make is to get to an essentially better understanding of the relation between individual learning curves and the expected one.

Currently, our definition judges any decision that increases the error rate, by however small amount, as non-monotone. It would be desirable to have a broader definition of non-monotonicity that allows for small and negligible increases of the error rate. Using a hypothesis test satisfying such a less strict condition could allow us to use less data for validation.

Finally, the user of the learning system should be notified that nonmonotonicity has occurred. Then the cause can be investigated and mitigated by regularization, model selection, etc. However, in automated systems our algorithm can prevent any known and unknown causes of non-monotonicity (as long as data is i.i.d.), and thus can be used as a failsafe that requires no human intervention.

## **7 Conclusion**

We have introduced three algorithms to make learners more monotone. We proved under which conditions the algorithms are consistent and we have shown for MTHT that the learning curve is monotone with high probability. If one cares only about monotonicity of the expected learning curve, MTSIMPLE with very large N<sup>v</sup> or MTCV may prove sufficient as shown by our experiments. If N<sup>v</sup> is small, or one desires that individual learning curves are monotone with high probability (as practically most relevant), MTHT is the right choice. Our algorithms are a first step towards developing learners that, given more data, improve their performance in expectation.

**Acknowledgments.** We would like to thank the reviewers for their useful feedback for preparing the camera ready version of this paper.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Combining Machine Learning and Simulation to a Hybrid Modelling Approach: Current and Future Directions**

Laura von Rueden1,2(B), Sebastian Mayer1,3, Rafet Sifa1,2, Christian Bauckhage1,2, and Jochen Garcke1,3,4

<sup>1</sup> Fraunhofer Center for Machine Learning, Sankt Augustin, Germany <sup>2</sup> Fraunhofer IAIS, Sankt Augustin, Germany

<sup>3</sup> Fraunhofer SCAI, Sankt Augustin, Germany

<sup>4</sup> Institute for Numerical Simulation, University of Bonn, Bonn, Germany laura.von.rueden@iais.fraunhofer.de

**Abstract.** In this paper, we describe the combination of machine learning and simulation towards a hybrid modelling approach. Such a combination of data-based and knowledge-based modelling is motivated by applications that are partly based on causal relationships, while other effects result from hidden dependencies that are represented in huge amounts of data. Our aim is to bridge the knowledge gap between the two individual communities from machine learning and simulation to promote the development of hybrid systems. We present a conceptual framework that helps to identify potential combined approaches and employ it to give a structured overview of different types of combinations using exemplary approaches of simulation-assisted machine learning and machine-learning assisted simulation. We also discuss an advanced pairing in the context of Industry 4.0 where we see particular further potential for hybrid systems.

**Keywords:** Machine learning *·* Simulation *·* Hybrid approaches

## **1 Introduction**

*Machine learning* and *simulation* have a similar goal: To predict the behaviour of a system with data analysis and mathematical modelling. On the one side, machine learning has shown great successes in fields like image classification [21], language processing [24], or socio-economic analysis [7], where causal relationships are often only sparsely given but huge amounts of data are available. On the other side, simulation is traditionally rooted in natural sciences and engineering, e.g. in computational fluid dynamics [35], where the derivation of causal relationships plays an important role, or in structural mechanics for the performance evaluation of structures regarding reactions, stresses, and displacements [6].

However, some applications can benefit from combining machine learning and simulation. Such an hybrid approach can be useful when the processing c The Author(s) 2020

capabilities of classical simulation computations can not handle the available dimensionality of the data, for example in earth system sciences [30], or when the behaviour of a system that is supposed to be predicted is based on both known, causal relationships and unknown, hidden dependencies, for example in risk management [25]. However, such challenges are in practice often still approached distinctly with either machine learning or simulation, apparently because they historically originate from distinct fields. This raises the question how these two modelling approaches can be combined into a hybrid approach in order to foster intelligent data analysis. Here, a key challenge in developing a hybrid modelling approach is to bridge the knowledge gap between the two individual communities, which are mostly either experts for machine learning or experts for simulation. Both groups have extremely deep knowledge about the methods used in their particular fields. However, the respectively used terminologies are different, so that an exchange of ideas between both communities can be impeded.

Related work that describes a combination of machine learning with simulation can roughly be divided in two groups, not surprisingly, either from a machine learning or a simulation point of view. The first group frequently describes the integration of simulation into machine learning as an additional source for training data, for example in autonomous driving [23], thermodynamics [19], or biomedicine [13]. A typical motivation is the augmentation of data for scenarios that are not sufficiently represented in the available data. The second group of related works describes the integration of machine learning techniques in simulation, often for a specific application, such as car crash simulation [6], fluid simulation [38], or molecular simulation [26]. A typical motivation is to identify surrogate models [16], which offer an approximate but cheaper to evaluate model to replace the full simulation. Another technique that is used to adapt a dynamical simulation model to new measurements is data assimilation, which is traditionally used in weather forecasting [22]. Related work that considers an equal combination of machine learning and simulation is quite rare. A work that is closest to describing such a hybrid, symbiotic modelling approach is [4].

More general, the integration of prior knowledge into machine learning can be described as *informed machine learning* [34] or *theory-guided data science* [18]. The paper [34] presents a survey with a taxonomy that structures approaches according to the knowledge type, representation, and integration stage. We reuse those categories in this paper. However, that survey considers a much broader spectrum of knowledge representations, from logic rules over simulation results to human interaction, while this paper puts an explicit focus on simulations.

Our goal is to make the key components of the two modelling approaches *machine learning* and *simulation* transparent and to show the versatile, potential combination possibilities in order to inspire and foster future developments of hybrid systems. We do not intend to go into technical details but rather give a high-level methodological overview. With our paper we want to outline a vision of a stronger, more automated interplay between data- and simulation-based analysis methods. We mainly aim our findings at the data analysis and machine

**Fig. 1. Subfields of Combining Machine Learning and Simulation.** The fields of machine learning and simulation have an intersecting area, which we partition into three subfields: 1. Simulation-assisted machine learning describes the integration of simulations into machine learning. 2. Machine-learning assisted simulation describes the integration of machine learning into simulation. 3. A hybrid combination describes a combination of machine learning and simulation with a strong mutual interplay.

learning community, but also those from the simulation community are welcome to read on. Generally, our target audience are researchers and users of one of the two modelling approaches who want to learn how they can use the other one.

The contributions of this paper are: 1. A conceptual framework serving as an orientation aid for comparing and combining machine learning and simulation, 2. a structured overview of combinations of both modelling approaches, 3. our vision of a hybrid approach with a stronger interplay of data- and simulation based analysis.

The paper is structured as follows: In Sect. 2 we give a brief overview of the subfields that result from combining machine learning and simulation. In Sect. 3 we present these two separate modelling approaches along our conceptual framework. In Sect. 4 we describe the versatile combinations by giving exemplary references and applications. In Sect. 5 we further discuss our observations in Industry 4.0 projects that lead us to a vision for the advanced pairing of machine learning and simulation. Finally we conclude in Sect. 6.

## **2 Overview**

In this section, we give a short overview about the subfields that result from a combination of machine learning with simulation. We view the combination with equal focus on both fields, driving our vision of a hybrid modelling approach with a stronger and automated interplay. Figure 1 illustrates our view on the fields' overlap, which can be partitioned into the three subfields simulationassisted machine learning, machine-learning assisted simulation, and a hybrid combination. Even though the first two can be regarded as one-sided approaches because they describe the integration with a point of view from one approach, the last one can be regarded as a two-sided approach. Although the term *hybrid*

**Fig. 2. Components of Machine Learning.** Machine Learning consists of two phases 1. model generation, and 2. model application, where the focus is usually made on the first phase, in which an inductive model is learned from data. The components of this phase are the training data, a hypothesis set, a learning algorithm, and a final hypothesis [1,34]. It describes the finding of patterns in an initially large data space, which are finally represented in a condensed form by the final hypothesis. This is illustrated by the reversed triangle and can be described as a "bottom-up approach".

is in the literature often used for the above one-sided approaches, we prefer to use it only for the two-sided approach where machine learning and simulation have a strong mutual, symbiotic-like interplay.

## **3 Modelling Approaches**

In this section, we describe the two modelling approaches by means of a conceptual framework that aims to make them and their components transparent and comparable.

### **3.1 Machine Learning**

The main goal of machine learning is that a machine automatically learns a model that describes patterns in given data. The typical components of machine learning are illustrated in Fig. 2. In the first, main phase an inductive model is learned. Inductive means that the model is built by drawing conclusions from samples and is thus not guaranteed to depict causal relationships, but can instead identify hidden, previously unknown patterns, meaning that the model is usually not knowledge-based but rather data-based. This inductive model can finally be applied to new data in order to predict or infer a desired target variable.

The model generation phase can be roughly split into four sub-phases or respective components [1,34]. Firstly, training data is prepared that depicts historical records of the investigated process or system. Secondly, a hypothesis set

**Fig. 3. Components of Simulation.** Simulation comprises the two phases 1. model generation, and 2. model application, where the focus often is on the second phase, in which an earlier identified deductive model is used in order to create simulation results. The components of this phase are the simulation model, input parameters, a numerical method, and the simulation result. It describes the unfolding of local interactions from a compactly represented initial model into an expanded data space. This is supposed be illustrated by the triangle and can be described as a "top-down approach".

is defined in the form of a function class or network architecture that is assumed to map input features to the target variables. Thirdly, a learning algorithm tunes the parameters of the hypothesis set so that the performance of the mapping is maximized by using optimization algorithms like gradient descent and results in, fourthly, the final hypothesis, which is the desired inductive model. This model generation phase is often repeated in a loop-like manner by tuning hyper-parameters until a sufficient model performance is achieved.

### **3.2 Simulation**

The goal of a simulation is to predict the behaviour of a system or process for a particular situation. There are different types of simulations, ranging from cellular automata, over agent-based simulations, to equation-based simulations [9,15,36]. In the following we concentrate on the last type, which is based on mathematical models and is especially used in science and engineering. The first, required stage preceding the actual simulation is the identification of a deductive model, often in the form of differential equations. Deductive in this context means that the model describes causal relationships and can thus be called knowledge-based. Such models are often developed through extensive research, starting with a derivation, for example in theoretical physics, and continuing with plentiful experimental validations. Some recent research exists of proof-of-concepts for identifying models directly from data [8,33].

The main phase of a simulation is the application of the identified model for a specific scenario, often called running a simulation. This phase can be described in four typical main components or sub-phases, which are, as illustrated in Fig. 3, the mathematical model, the input parameters, the numerical

**Fig. 4. Types of Simulation-Assisted Machine Learning.** Simulations, in particular the simulation results, can be generally integrated into the four different components of machine learning. The triangles illustrate the machine learning (blue/dark gray) or the simulation (orange/light gray) approach and their components, which are themselves presented in Figs. 2 and 3. The simulation results can be used to (a) augment the training data, (b) define parts of the hypothesis set in the form of empirical functions, (c) steer the training algorithm in generative adversarial networks, or (d) verify the final hypothesis against scientific consistency. (Color figure online)

method, and finally the simulation result [36]. After the selection of a mathematical model, the input parameters that describe the specific scenario are defined in the second sub-phase. They can comprise general parameters such as the spatial domain or time of interest, as well as initial conditions quantifying the systems' or processes' initial status and boundary conditions defining the behaviour at domain borders. In the third sub-phase, a numerical method computes the solution of the given model observing the constraints resulting from the input parameters. Examples for numerical methods are finite differences, finite elements or finite volume methods for spatial discretization [36], or particle methods based on interaction forces [26]. These form the basis for an approximate solution, which is the final simulation result. This model application phase is often repeated in a loop-like manner, e.g., by tuning the discretization to achieve a desired approximation accuracy and stability of the solution.

## **4 Combining Machine Learning and Simulation**

In this section, we describe combinations of machine learning and simulation by using our conceptual framework from Sect. 3. Here, we focus on simulationassisted machine learning and machine-learning assisted simulation. For each of the methodical combination types, we give exemplary application references.

### **4.1 Simulation-Assisted Machine Learning**

Simulation offers an additional source of information for machine learning that goes beyond typically available data and that is rich of knowledge. This additional information can be integrated into the four components of machine learning as illustrated in Fig. 4. In the following, we will give an overview about these integration types by giving for each an illustrative example and refer for a more detailed discussion to [34].

Simulations are particularly useful for creating additional training data in a controlled environment. This is for example applied in autonomous driving, where simulations such as physics engines are employed to create photo-realistic traffic scenes, which can be used as synthetic training data for learning tasks like semantic segmentation [14], or for adversarial test generation [40]. As another example, in systems biology, simulations can be integrated in the training data of kernelized machine learning methods [13].

Moreover, simulations can be integrated into the hypothesis set, either directly as the solvers or through deduced, empirical functions that compactly describe the simulations results. These functions can be built into the architecture of a neural network, as shown for the application of finding an optimal design strategy for a warm forming process [20].

The integration of simulations into the learning algorithm can for example be realized by generative adversarial networks (GANs), which learn a prediction function that obeys constraints, which might be unknown but are implicitly given through a simulation [31].

Another important integration type is in the validation of the final hypothesis by simulations. An example for this comes from material discovery, where first a machine learning model suggests new compounds based on patterns in a data basis, and second the physical properties are computed and thus checked by a density functional theory simulation [17].

An approach that uses simulations along the whole machine learning pipeline is reinforcement learning (RL), when the model is learned in a simulated environment [2]. Studies under the keyword "sim-to-real" are often concerned with robots learning to grip or move unknown objects in simulations and usually require retraining in reality. An application for controlling the temperature of plasma follows the analogous approach, i.e., a training based on a softwarephysics model, where the learned RL model is then further adapted for use in reality [41].

### **4.2 Machine-Learning Assisted Simulation**

Machine learning is often used in simulation with the intention to support the solution process or to detect patterns in the simulation data. With respect to our conceptual framework presented in Sect. 3, machine learning techniques can be used for the initial model, the input parameters, the numerical method, and the final simulation results, as illustrated in Fig. 4. In the following we will give an overview about the integration types. Again, we do not intend to cover the full spectrum of machine-learning assisted simulation, we rather want to illustrate its diverse approaches through representative examples.

A prominent integration type of machine learning techniques into simulation is the identification of simpler models, such as surrogate models [11,12,16,26].

**Fig. 5. Types of Machine-Learning Assisted Simulation.** Machine learning techniques, in particular the final hypothesis, can be used in different simulation components. The triangles illustrate the machine learning (blue/dark gray) or the simulation (orange/light gray) approach and their components, which are themselves explained in Figs. 2 and 3. Exemplary use cases for machine learning models in simulation are (a) model order reduction and the development of surrogate models that offer approximate but simpler solutions, (b) the automated inference of an intelligent choice of input parameters for a next simulation run, (c) a partly trainable solver for differential equations, or d) the identification of patterns in simulation results for scientific discovery. (Color figure online)

These are approximate and cheap to evaluate models that are particularly of interest when the solution of the original, more precise model is very time- or resource-consuming. The surrogate model can then be used to analyse the overall behaviour of the system in order to reveal scenarios that should be further investigated with the detailed original simulation model. Such surrogate models can be developed with machine-learning techniques either with data from realworld experiments, or with data from high-fidelity simulations. One application example is the optimization of process parameters using deep neural networks as surrogate models [27]. Kernel-based approaches are also commonly used as surrogate models for simulations, an example to improve the energetic efficiency of a gas transport network is shown in [10]. A well-established approach for surrogate modelling is model order reduction, for example with proper orthogonal decomposition, which is closely related to principal component analysis [5,37].

Data assimilation, which includes the calibration of constitutive models and the estimation of system states, is another area where machine learning techniques enhance simulations. Data assimilation problems can be modelled using dynamic Bayesian networks with continuous physically interpretable state spaces where the evaluation of transition kernels and observation operators requires forward-simulation runs [29].

Machine learning techniques can also be used to study the parameter dependence of simulation results. For example, after an engineer executes a sequence of simulations, a machine learning model can detect different behavioral modes in the results and thus reduce the analysis effort during the engineering process [6]. This supports the selection of the parameter setting for the next simulation, for which active learning techniques can also be employed. For example, [39] studied it for selecting the molecules for which the internal energy shall be determined by computationally expensive quantum-mechanical calculations, as well as for determining a surrogate model for the fluid flow in a well-bore while drilling.

The integration of machine learning techniques into the numerical method can support to obtain the numerical solution. One approach is to exchange parts of the model that are resource-consuming to solve, with learned models that can be computed faster, for example with machine learning generated force fields in molecular dynamics simulations [26]. Another approach that is recently investigated are trainable solvers for partial differential equations that determine the complete solution through a neural network [28].

A further, very important integration type is the application of machine learning techniques on the simulation results in order to detect patterns, often motivated by the goal of scientific discovery. While there are plenty of application domains, two exemplary representatives are particle physics [3] and earthsciences, for example with the use of convolutional neural networks for the detection of weather patterns on climate simulation data [30]. For further examples we refer to a survey about explainable machine learning for scientific discovery [32].

## **5 Advanced Pairing of Machine Learning and Simulation**

Section 4 gave a brief overview of the versatile existing approaches that integrate aspects of machine learning into simulation and vice versa, or that combine simulation and machine learning sequentially. Yet, we think that the integration of these two established worlds is only at the beginning, both in terms of modelling approaches and in terms of available software solutions.

In the following, we describe a number of observations from our project experience in the development of cyber-physical systems for Industry 4.0 applications that support this assessment. Note that the key technical goal of Industry 4.0 is the flexibilization of production processes. In addition to the broad integration of digital equipment in the production machinery, a key provider of flexibilization is a decrease of process design and dimensioning times and ideally, a merging of planning and production phase that are today still strictly separated. This requires a new generation of computer-aided engineering (CAE) software systems that allow for very fast process optimization cycles with real time feedback loops to the production machinery. An advanced pairing of machine learning and simulation will be key to realize such systems by addressing the following issues:


experts. In this way, it is exploited too little that similar underlying systems might lead to similar surrogate models and in consequence, too many costly high-fidelity simulations are run to generate the data basis, although parts of the learned surrogate models could be transferred.

– **Parameter studies and simulation engines:** Parameter and design studies are well-established tools in many fields of engineering. Surprisingly, the frameworks to conduct these studies and to build the surrogate models are third-party solutions that are separated from the core simulation engines. For the parameter study framework, the simulation engine is a black box, which does not know that it is currently used for a parameter study. In turn, the standard rules to generate sampling points in the parameter space are not aware about the internals of the simulation engine. This raises the question how much more efficient parameter studies could be conducted so that both software systems were stronger connected to each other.

These observations lead us to a research concept that we propose in this paper and call it **learning simulation engines**. A learning simulation engine is a hybrid system that combines machine learning and simulation in an optimal way. Such an engine can automatically decide when and where to apply learned surrogate models or high-fidelity simulations. Surrogate models are efficiently organized and re-used through the use of transfer learning. Parameter and design optimization is an integral component of the learning simulation engine and active learning methods allow the efficient re-use of costly high-fidelity computations.

Of course, the vision of a learning simulation engine raises numerous research questions. We describe some of them in view of Fig. 1. First of all, the question is how learning and simulation can be technically combined to such an advanced hybrid approach, especially, if they can only be integrated into each other by using the final simulation results and the final hypothesis (as shown in Figs. 4 and 5), or if they can also be combined at an earlier sub-phase. Moreover, the counterparts of the learning's model generation phase and the simulation's model application phase (see Figs. 2 and 3) should be investigated further in order to better understand the similarities and differences to the simulation's model generation phase and a learning's model application phase.

## **6 Conclusion**

In this paper, we described the combination of machine learning and simulation motivated by fostering intelligent analysis of applications that can benefit from a combination of data- and knowledge-based solution approaches.

We categorized the overlap between the two fields into three sub-fields, namely, simulation-assisted machine learning, machine-learning assisted simulation, and a hybrid approach with a strong and mutual interplay. We presented a conceptual framework for the two separate approaches, in order to make them and their components transparent for the development of a potential combined approach. In summary, it describes machine learning as a bottom-up approach that generates an inductive, data-based model and simulation as a top-down approach that applies a deductive, knowledge-based model. Using this conceptual framework as an orientation aid for their integration into each other, we gave a structured overview about the combination of machine learning and simulation. We showed the versatility of the approaches through exemplary methods and use cases, ranging from simulation-based data augmentation and scientific consistency checking of machine learning models, to surrogate modelling and pattern detection in simulations for scientific discovery. Finally, we described the scenario of an advanced pairing of machine learning and simulation in the context of Industry 4.0 where we see particular further potential for hybrid systems.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **LiBRe: Label-Wise Selection of Base Learners in Binary Relevance for Multi-label Classification**

Marcel Wever1(B), Alexander Tornede<sup>1</sup>, Felix Mohr<sup>2</sup>, and Eyke H¨ullermeier<sup>1</sup>

<sup>1</sup> Heinz Nixdorf Institut, Paderborn University, Paderborn, Germany

{marcel.wever,alexander.tornede,eyke}@upb.de <sup>2</sup> Universidad de La Sabana, Chia, Cundinamarca, Colombia felix.mohr@unisabana.edu.co

**Abstract.** In multi-label classification (MLC), each instance is associated with a set of class labels, in contrast to standard classification, where an instance is assigned a single label. Binary relevance (BR) learning, which reduces a multi-label to a set of binary classification problems, one per label, is arguably the most straight-forward approach to MLC. In spite of its simplicity, BR proved to be competitive to more sophisticated MLC methods, and still achieves state-of-the-art performance for many loss functions. Somewhat surprisingly, the optimal choice of the base learner for tackling the binary classification problems has received very little attention so far. Taking advantage of the label independence assumption inherent to BR, we propose a label-wise base learner selection method optimizing label-wise macro averaged performance measures. In an extensive experimental evaluation, we find that or approach, called LiBRe, can significantly improve generalization performance.

**Keywords:** Multi-label classification · Algorithm selection · Binary relevance

## **1 Introduction**

By relaxing the assumption of mutual exclusiveness of classes, the setting of *multi-label classification* (MLC) generalizes standard (binary or multinomial) classification—subsequently also referred to as single-label classification (SLC). MLC has received a lot of attention in the recent machine learning literature [23, 29]. The motivation for allowing an instance to be associated with several classes simultaneously originated in the field of text categorization [19], but nowadays multi-label methods are used in applications as diverse as image processing [4,26] and video annotation [14], music classification [18], and bioinformatics [2].

Common approaches to MLC either adapt existing algorithms (*algorithm adaptation*) to the MLC setting, e.g., the structure and the training procedure for neural networks, or reduce the original MLC problem to one or multiple SLC problems (*problem transformation*). The most intuitive and straight-forward c The Author(s) 2020

problem transformation is to decompose the original task into several binary classification tasks, one per label. More specifically, each task consists of training a classifier that predicts whether or not a specific label is relevant for a query instance. This approach is called *binary relevance* (BR) learning [3]. Beyond BR, many more sophisticated strategies have been developed, most of them trying to exploit correlations and interdependencies between labels [28]. In fact, BR is often criticized for ignoring such dependencies, implicitly assuming that the relevance of one label is (statistically) independent of the relevance of another label. In spite of this, or perhaps just because of this simplification, BR proved to achieve state-of-the-art performance, especially for so-called decomposable loss functions, for which its optimality can even be corroborated theoretically [7,9].

Techniques for reducing MLC to SLC problems involve the choice of a base learner for solving the latter. Somewhat surprisingly, this choice is often neglected, despite having an important influence on generalization performance [10–12,15]. Even in more extensive studies [10,12], a base learner is fixed a priori in a more or less arbitrary way. Broader studies considering multiple base learners, such as [6,22], are relatively rare and rather limited in terms of the number of base learners considered. Only recently, greater attention to the choice of the base learner has been paid in the field of automated machine learning (AutoML) [17,24,25], where the base learner is considered as an important "hyper-parameter" to tune. Indeed, while optimizing the selection of base learners is laborious and computationally expensive in general, which could be one reason for why it has been tackled with reservation, AutoML now offers new possibilities in this direction.

Motivated by these opportunities, and building on recent AutoML methodology, we investigate the idea of base learner selection for BR in a more systematic way. Instead of only choosing a single base learner to be used for all labels simultaneously, we even allow for selecting an individual learner for each label (i.e., each binary classification task) separately. In an extensive experimental study, we find that customizing BR in a label-wise manner can significantly improve generalization performance.

## **2 Multi-label Classification**

The setting of *multi-label classification* (MLC) allows an instance to belong to several classes simultaneously. Consequently, several class labels can be assigned to an instance at the same time. For example, a single image could be tagged with labels Sun and Beach and Sea and Yacht.

### **2.1 Problem Setting**

To formalize this learning problem, let X denote an instance space and L = {λ1,...,λ*<sup>m</sup>*} a finite set of m class labels. An instance *x* ∈ X is then (nondeterministically) associated with a subset of class labels L ∈ 2L. The subset L is often called the set of relevant labels, while its complement L \L is considered irrelevant for *x*. Furthermore, a set L of relevant labels can be identified by a binary vector *y* = (y1,...,y*m*) where y*<sup>i</sup>* = 1 if λ*<sup>i</sup>* ∈ L and y*<sup>i</sup>* = 0 otherwise (i.e., if <sup>λ</sup>*<sup>i</sup>* ∈L\ <sup>L</sup>). The set of all label combinations is denoted by <sup>Y</sup> <sup>=</sup> {0, <sup>1</sup>}*m*.

Generally speaking, a multi-label classifier *h* is a mapping *h* : X −→ Y returning, for a given instance *x* ∈ X , a prediction in the form of a vector

$$h(x) = \left(h\_1(x), h\_2(x), \dots, h\_m(x)\right).$$

The MLC task can be stated as follows: Given a finite set of observations as training data <sup>D</sup>train . .= (Xtrain, Ytrain) = (*xi*, *yi*) *N <sup>i</sup>*=1 ⊂ X *<sup>N</sup>* × Y *<sup>N</sup>* , the goal is to learn a classifier *h* : X −→ Y that generalizes well beyond these observations in the sense of minimizing the risk with respect to a specific loss function.

#### **2.2 Loss Functions**

A wide spectrum of loss functions has been proposed for MLC, many of which are generalizations or adaptations of losses for single-label classification. In general, these loss functions can be divided into two major categories: instance-wise and label-wise. While the latter first compute a loss for each label and then aggregate the values obtained across the labels, e.g., by taking the mean, instance-wise loss functions first compute a loss for each instance and subsequently aggregate the losses over all instances in the test data. As an obvious advantage of label-wise loss functions, note that they can be optimized by optimizing a standard SLC loss for each label separately. In other words, label-wise losses naturally harmonize with label-wise decomposition techniques such as BR. Since this allows for a simpler selection of the base learner per label, we focus on two such loss functions in the following. For additional details on MLC and loss functions, especially instance-wise losses, we refer to [23,29].

Let <sup>D</sup>test . .= (Xtest, Ytest) = {(*xi*, *<sup>y</sup>i*)}*<sup>S</sup> <sup>i</sup>*=1 ⊂ X *<sup>S</sup>* × Y*<sup>S</sup>* be a test set of size <sup>S</sup>. Further, let <sup>H</sup> = (*h*(*x*1),...,*h*(*xS*)) ⊂ Y*<sup>S</sup>*. Then, the Hamming loss, which can be seen as a generalized form of the error rate, is defined<sup>1</sup> as

$$\mathcal{L}\_H(Y\_{\text{test}}, H) := \frac{1}{m} \sum\_{j=1}^m \frac{1}{S} \sum\_{i=1}^S \left[ y\_{i,j} \neq h\_j(x\_i) \right] \;. \tag{1}$$

Moreover, the label-wise macro-averaged F-measure (which is actually a measure of accuracy, not a loss function, and thus to be maximized) is given by

$$\mathcal{F}(Y\_{\text{test}}, H) := \frac{1}{m} \sum\_{j=1}^{m} \frac{2 \sum\_{i=1}^{S} y\_{i,j} h\_j(\mathbf{x}\_i)}{\sum\_{i=1}^{S} y\_{i,j} + \sum\_{i=1}^{S} h\_j(\mathbf{x}\_i)} \ . \tag{2}$$

Obviously, to optimize the measures (1) and (2), it is sufficient to optimize each label individually, which corresponds to optimizing the inner term of the (first) sum.

<sup>1</sup> -· is the indicator function.

### **2.3 Binary Relevance**

As already said, binary relevance learning decomposes the MLC task into several binary classification tasks, one for each label. For every such task, a single-label classifier, such as an SVM, random forest, or logistic regression, is trained. More specifically, a classifier for the <sup>j</sup>*th* label is trained on the dataset {(*xi*, y*i,j* )}*<sup>N</sup> <sup>i</sup>*=1. Formally, BR induces a multi-label predictor

$$\mathbf{BR}\_b: \mathcal{X} \longrightarrow \mathcal{Y}, \quad \mathbf{x} \mapsto \left( b\_1(\mathbf{x}), b\_2(\mathbf{x}), \dots, b\_m(\mathbf{x}) \right)\_+,$$

where <sup>b</sup>*<sup>j</sup>* : X −→ {0, <sup>1</sup>} represents the prediction of the base learner for the <sup>j</sup>*th* label.

## **3 Related Work**

Binary relevance has been subject to modifications in various directions, an excellent overview of which is provided in a recent survey [28]. Extensions of BR mainly focus on its inability to exploit label correlations, due to treating all labels independently of each other. Three types of approaches have been proposed to overcome this problem. The first is to use *classifier chains* [15]. In this approach, one first defines a total order among the m labels and then trains binary classifiers in this order. The input of the classifier for the i *th* label is the original data plus the predictions of *all classifiers* for labels preceding this label in the chain. Similarly, in addition to the binary classifiers for the m labels, *stacking* uses a second layer of m meta-classifiers, one for each label, which take as input the original data augmented by the predictions of *all* base learners [11,21]. A third approach seeks to capture the dependencies in a Bayesian network, and to learn such a network from the data [1,20]. One can then use probabilistic inference to compute the probability for each possible prediction.

Another line of research looks at how the problem of imbalanced classes can be addressed using BR. Class imbalance constitutes an important challenge in multi-label classification in general, since most labels are usually irrelevant for an instance, i.e., the overwhelming majority of labels in a binary task is negative. Using BR, the imbalance can be "repaired" in a label-wise manner, using techniques for standard binary classification, such as sampling [5] or thresholding the decision boundary [13]. An approach taking dependencies among labels into account (and hence applied prior to splitting the problem) is presented in [27].

To the best of our knowledge, this is the first approach in which the base learner used for the different labels is subject to optimization itself. In fact, except for AutoML tools, we are not even aware of an approach optimizing a single base learner applied to all labels. In all the above approaches, the choice of the base learners is an external decision and not part of the learning problem itself.

## **4 Label-Wise Selection of Base Learners**

As already stated before, while various attempts at improving binary relevance learning by capturing label dependencies have been made, the choice of the base learner for tackling the underlying binary problems—as another potential source of improvement—has attracted much less attention in the literature so far. If considered at all, this choice has been restricted to the selection of a *single* learner, which is applied to all m binary problems simultaneously.

We proceed from a portfolio of base learners

$$\mathcal{A} := \left\{ a \mid a: (\mathcal{X}^n \times \{0, 1\}^n) \longrightarrow (\mathcal{X} \longrightarrow \{0, 1\}) \right\}.$$

Then, given training data Dtrain = (Xtrain, Ytrain), the objective is to find the base learner a for which BR performs presumably best on test data Dtest = (Xtest, Ytest) with respect to some loss function L:

$$\mathop{\arg\min}\_{a \in \mathcal{A}} \mathcal{L}\left(Y\_{\text{test}}, \mathbf{BR}\_b(X\_{\text{test}})\right), \text{ with } b\_j := a\left(X\_{\text{train}}, Y\_{\text{train}}^{(j)}\right), \tag{3}$$

where <sup>Y</sup> (*i*) train denotes the <sup>j</sup>*th* column of the label matrix <sup>Y</sup>train.

Moreover, we propose to leverage the independence assumption underlying BR to select a different base learner for each of the labels, and refer to this variant as LiBRe. We are thus interested in solving the following problem:

$$\arg\min\_{a \in \mathcal{A}^m} \mathcal{L}\left(Y\_{\text{test}}, \mathbf{BR}\_b(X\_{\text{test}})\right), \text{ with } b\_j := a\_j \left(X\_{\text{train}}, Y\_{\text{train}}^{(j)}\right) \ . \tag{4}$$

Compared to (3), we thus significantly increase flexibility. In fact, by taking advantage of the different behavior of the respective base learners, and the ability to model the relationship between features and a class label differently for each binary problem, one may expect to improve the overall performance of BR. On the other side, the BR learner as a whole is now equipped with many degrees of freedom, namely the choice of the base learners, which can be seen as "hyperparameters" of LiBRe. Since this may easily lead to undesirable effects such as over-fitting of the training data, an improvement in terms of generalization performance (approximated by the performance on the test data) is by no means self-evident. From this point of view, the restriction to a single base learner in (3) can also be seen as a sort of regularization. Such kind of regulation can indeed be justified for various reasons. In most cases, for example, the binary problems are indeed not completely different but share important characteristics.

Computationally, (4) may appear more expensive than choosing a single base learner jointly for all the labels, at least at first sight. However, the complexity in terms of the number of base learners to be evaluated remains exactly the same. In fact, just like in (3), we need to fit a BR model for every base learner exactly once. The only difference is that, instead of picking one of the base learners for all labels in the end, LiBRe assembles the base learners performing best for the respective labels (recall that we head for label-wise decomposable performance measures).

## **5 Experimental Evaluation**

This section presents an empirical evaluation of LiBRe, comparing it to the use of a single base learner as a baseline. We first describe the experimental setup (Sect. 5.1), specify the baseline with the single best base learner (Sect. 5.2), and define the oracle performance (Sect. 5.3) for an upper bound. Finally, the experimental results are presented in Sect. 5.4.

### **5.1 Experimental Setup**

For the evaluation, we considered a total of 24 MLC datasets. These datasets stem from various domains, such as text, audio, image classification, and biology, and range from small datasets with only a few instances and labels to larger datasets with thousands of instances and hundreds of labels. A detailed overview is given in Table 1, where, in addition to the number of instances (#I) and number of labels (#L), statistics regarding the label-to-instance ratio (L2IR), the percentage of unique label combinations (ULC), and the average label cardinality (card.) are given.

The train and validation folds were derived by conducting a nested 2-fold cross validation, i.e., to assess the test performance we have an outer loop of 2 fold cross validation. To tune the thresholds and select the base learner, we again split the training fold of the outer loop into train and validation sets by 2-fold cross validation. The entire process is repeated 5 times with different random seeds for the cross validation. Throughout this study, we trained and evaluated a total of 14,400 instances of BR and 649,800 base learners accordingly.

Furthermore, we consider two performance measures, namely the Hamming loss L*<sup>H</sup>* and the macro-averaged label-wise F-measure as defined in (1) and (2), respectively. A binary prediction is obtained by thresholding the prediction of an underlying scoring classifier, which produces values in the unit interval (the higher the value, the more likely a label is considered relevant). The thresholds *τ* = (τ1, τ2,...,τ*m*) are optimized by a grid search considering values for τ*<sup>i</sup>* ∈ [0, 1] and a step size of 0.01. When optimizing the thresholds, we either allow for label-wise optimization or constrain the threshold to be the same for all labels (uniform τ ), i.e., τ*<sup>i</sup>* = τ*<sup>j</sup>* for all i, j ∈ {1,...,m}.

In order to determine significance of results, we apply a Wilcoxon signed rank test with a threshold for the p-value of 0.05. Significant improvements of LiBRe are marked by • and significant degradations by ◦.

We executed the single BR evaluation runs, i.e., training and evaluating either on the validation or test split, on up to 300 nodes in parallel, each of them equipped with 8 CPU cores and 32 GB of RAM, and a timeout of 6 h. Due to the limitation of the memory and the runtime, some of the evaluations failed due to memory overflows or timeouts.

The implementation is based on the Java machine learning library WEKA [8] and an extension for multi-label classification called MEKA [16]. In our study, we consider a total of 20 base learners from WEKA: BayesNet (BN), DecisionStump (DS), IBk, J48, JRip (JR), KStar (KS), LMT, Logistic (L), MultilayerPerceptron


**Table 1.** The datasets used in this study. Furthermore, the number of instances (#I), the number of labels (#L), the label-to-instance ratio (L2IR), the percentage of unique label combinations (ULC), and the label cardinality (card.) are given.

(MlP), NaiveBayes (NB), NaiveBayesMultinomial (NBM), OneR (1R), PART (P), REPTree (REP), RandomForest (RF), RandomTree (RT), SMO, SimpleLogistic (SL), VotedPerceptron (VP), ZeroR (0R). All the data and source code is made available via GitHub (https://github.com/mwever/LiBRe).

#### **5.2 Single Best Base Learner**

To figure out how much we can benefit from selecting a base learner for each label individually, and whether this flexibility is beneficial at all, we define the single best base learner, subsequently referred to as SBB, as a baseline. In principle, SBB is nothing but a grid search over the portfolio of base learners (3).

When considering a base learner a, it is chosen to be employed as a base learner for every label. After training and validating the performance, we pick the base learner that performs best overall. This baseline thus gives an upper bound on the performance of what can be achieved when the base learner is not chosen for each label individually. As simple and straight-forward as it is, this baseline represents what is currently possible in implementations of MLC libraries, and already goes beyond what is most commonly done in the literature.

#### **5.3 Optimistic Versus Validated Optimization**

In addition to the results obtained by selecting the base learner(s) according to the validation performance (obtained in the inner loop of the nested cross validation), we consider optimistic performance estimates, which are obtained as follows: After having trained the base learners on the training data, we select the presumably best one, not on the basis of their performance on validation data, but based on their actual test performance (as observed in the outer loop

**Fig. 1.** The heat map shows the average share of each base learner being employed for a label with respect to the optimized performance measure: Hamming (LH) or the label-wise macro averaged F-measure (F).

of the nested cross-validation). Intuitively, this can be understood as a kind of "oracle" performance: Given a set of candidate predictors to choose from, the oracle anticipates which of them will perform best on the test data.

Although these performances should be treated with caution, and will certainly tend to overestimate the true generalization performance of a classifier, they can give some information about the potential of the optimization. More specifically, these optimistic performance estimates suggest an upper bound on what can be obtained by the nested optimization routine.

### **5.4 Results**

In Fig. 1, the average share of a base learner per label is shown. From this heatmap, it becomes obvious that for the SBB baseline only a subset of base learners plays a role. However, one can also notice that the distribution of the shares varies when different performance measures are optimized. Furthermore, although random forest (RF) achieves significant shares of 0.8 for the Hamming loss and around 0.6 for the F-measure, it is not best on all the datasets. To put it differently, one still needs to optimize the base learner per dataset. This is especially true, when different performance measures are of interest.

In the case of LiBRe, it is clearly recognizable how the shares are distributed over the base learners, in contrast to SBB. For example, the shares of RF decrease to 0.29 for F-measure and to 0.25 for Hamming, respectively. Moreover, base learners that did not even play any role in SBB are now gaining in importance and are selected quite often. Although there are significant differences in the frequency of base learners being picked, there is not a single base learner in the portfolio that was never selected.

In Table 2, the results for optimizing Hamming loss are presented. The optimistic performance estimates already indicate that there is not much room for improvement. This comes at no surprise, since the datasets are already pretty much saturated, i.e., the loss is already close to 0 for most of the datasets. While LiBRe performs competitively to SBB for the setting with uniform τ , SBB compares favourably to LiBRe in the case where the thresholds can be tuned in a label-wise manner. Apparently, the additional degrees of freedom make LiBRe more prone to over-fitting, especially on smaller datasets.

In contrast to the previous results, for the optimization of the F-measure, the optimistic performance estimates already give a promising outlook on the

**Table 2.** Results obtained for minimizing L<sup>H</sup> optimistically resp. with validation performances. Thresholds are optimized either jointly for all the labels (uniform τ ) or label-wise. Best performances per setting and dataset are highlighted in bold. Significant improvements of LiBRe are marked by a • and degradations by ◦.


potential for improving the generalization performance through the label-wise selection of the base learners. More precisely, they indicate that performance gains of up to 11% points are possible. Independent of the threshold optimization variant, LiBRe outperforms the SBB baseline, yielding the best performance on two third of the considered datasets, 13 improvements of which are significant in the case of uniform τ , and 11 in the case of label-wise τ . Significant degradations of LiBRe compared to SBB can only be observed for 2 respectively 3 datasets. Hence, for the F-measure, LiBRe compares favorably to the SBB baseline.

In summary, we conclude that LiBRe does indeed yield performance improvements. However, increasing the flexibility of BR also makes it more prone to over-fitting. Furthermore, these results were obtained by conducting a nested 2-fold cross validation. While keeping the computational costs of this evaluation reasonable, this implies that, for the purpose of validation, the base learners were trained on only one fourth of the original dataset. Therefore, considering nested 5-fold or 10-fold cross validation could help to reduce the observed over-fitting.

**Table 3.** Results for maximizing the F-measure optimistically resp. with validation performances. Thresholds are optimized either jointly for all the labels (uniform τ ) or label-wise. Best performances per setting and dataset are highlighted in bold. Significant improvements of LiBRe are marked by a • and degradations by ◦.


## **6 Conclusion**

In this paper, we have not only demonstrated the potential of binary relevance to optimize label-wise macro averaged measures, but also the importance of the base learner as a hyper-parameter for each label. Especially for the case of optimizing for F1 macro-averaged over the labels, we could achieve significant performance improvements by choosing a proper base learner in a label-wise manner. Compared to selecting the best single base learner, choosing the base learner for each label individually comes at no additional cost in terms of base learner evaluations. Moreover, the label-wise selection of base learners can be realized by a straight-forward grid search.

As the label-wise choice of a base learner has already led to considerable performance gains, we plan to examine to what extent the optimization of the hyper-parameters of those base learners can lead to further improvements. Furthermore, we want to increase the efficiency of the tuning by replacing the grid search with a heuristic approach.

Another direction of future work concerns the avoidance of over-fitting effects due to an overly excessive flexibility of LiBRe. As already explained, the restriction to a single base learner can be seen as a kind of regularization, which, however, appears to be too strong, at least according to our results. On the other side, the full flexibility of LiBRe does not always pay off either. An interesting compromise could be to restrict the number of different base learners used by LiBRe to a suitable value k ∈ {1,...,m}. Technically, this comes down to finding the arg min in (4), not over *<sup>a</sup>* ∈ A*m*, but over {*<sup>a</sup>* ∈ A*<sup>m</sup>* <sup>|</sup> #{a1,...,a*m*} ≤ <sup>k</sup>}.

**Acknowledgement.** This work was supported by the German Research Foundation (DFG) within the Collaborative Research Center "On-The-Fly Computing" (SFB 901/3 project no. 160364472). The authors also gratefully acknowledge support of this project through computing time provided by the Paderborn Center for Parallel Computing (PC<sup>2</sup>).

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Angle-Based Crowding Degree Estimation for Many-Objective Optimization**

Yani Xue1(B) , Miqing Li<sup>2</sup>, and Xiaohui Liu<sup>1</sup>

<sup>1</sup> Department of Computer Science, Brunel University London, Uxbridge, Middlesex UB8 3PH, UK ynxue6219@gmail.com

<sup>2</sup> CERCIA, School of Computer Science, University of Birmingham, Edgbaston, Birmingham B15 2TT, UK

**Abstract.** Many-objective optimization, which deals with an optimization problem with more than three objectives, poses a big challenge to various search techniques, including evolutionary algorithms. Recently, a meta-objective optimization approach (called bi-goal evolution, BiGE) which maps solutions from the original high-dimensional objective space into a bi-goal space of proximity and crowding degree has received increasing attention in the area. However, it has been found that BiGE tends to struggle on a class of many-objective problems where the search process involves *dominance resistant solutions*, namely, those solutions with an extremely poor value in at least one of the objectives but with (near) optimal values in some of the others. It is difficult for BiGE to get rid of dominance resistant solutions as they are Pareto nondominated and far away from the main population, thus always having a good crowding degree. In this paper, we propose an angle-based crowding degree estimation method for BiGE (denoted as aBiGE) to replace distancebased crowding degree estimation in BiGE. Experimental studies show the effectiveness of this replacement.

**Keywords:** Many-objective optimization · Evolutionary algorithm · Bi-goal evolution · Angle-based crowding degree estimation

## **1 Introduction**

Many-objective optimization problems (MaOPs) refer to the optimization of four or more conflicting criteria or objectives at the same time. MaOPs exist in many fields, such as environmental engineering, software engineering, control engineering, industry, and finance. For example, when assessing the performance of a machine learning algorithm, one may need to take into account not only accuracy but also some other criteria such as efficiency, misclassification cost, interpretability, and security.

There is often no one best solution for an MaOP since the performance increase in one objective will lead to a decrease in some other objectives. In the past three decades, multi-objective evolutionary algorithms (MOEAs) have been successfully applied in many real-world optimization problems with low-dimensional search space (two or three conflicting objectives) to search for a set of trade-off solutions.

The major purpose of MOEAs is to provide a population (a set of optimal individuals or solutions) that balance proximity (converging a population to the Pareto front) and diversity(diversifying a population over the whole Pareto front). By considering the two goals above, traditional MOEAs, such as SPEA2 [13] and NSGA-II [1] mainly focus on the use of Pareto dominance relations between solutions and the design of diversity control mechanisms.

However, compared with a low-dimensional optimization problem, wellknown Pareto-based evolutionary algorithms lose their efficiency in solving MaOPs. In MaOPs, most solutions in a population become equally good solutions, since the Pareto dominance selection criterion fails to distinguish between solutions and drive the population towards the Pareto front. Then the density criterion is activated to guide the search, resulting in a substantial reduction of the convergence of the population and the slowdown of the evolution process. This is termed the *active diversity promotion* (ADP) phenomenon in [11].

Some studies [6] observed that the main reason for ADP phenomenon is the preference of dominance resistant solutions (DRSs). DRSs refer to those solutions that are extremely inferior to others in at least one objective but have nearoptimal values in some others. They are considered as Pareto-optimal solutions despite having very poor performance in terms of proximity. As a result, Paretobased evolutionary algorithms could search a population that is widely covered but far away from the true Pareto front.

To address the difficulties of MOEAs in high-dimensional search space, one approach is to modify the Pareto dominance relation. Some powerful algorithms in this category include: --MOEA [2] and fuzzy Pareto dominance [5]. These methods work well under certain circumstances but they often involve extra parameters and the performance of these algorithms often depends on the setting of parameters. The other approach, without considering Pareto dominance relation, may be classified into two categories: aggregation-based algorithms [15] and indicator-based algorithms [14]. These algorithms have been successfully applied to some applications, however, the diversity performance of these aggregationbased algorithms depends on the distribution of weight vectors. The latter defines specific performance indicators to guide the search.

Recently, a meta-objective optimization algorithm, called Bi-Goal Evolution (BiGE) [8] for MaOPs is proposed and becomes the most cited paper published in the Artificial Intelligence journal over the past four years. BiGE was inspired by two observations in many-objective optimization: (1) the conflict between proximity and diversity requirement is aggravated when increasing the number of objectives and (2) the Pareto dominance relation is not effective in solving MaOPs. In BiGE, two indicators were used to estimate the proximity and crowding degree of solutions in the population, respectively. By doing so, BiGE maps solutions from the original objective space to a bi-goal objective space and deals with the two goals by the nondominated sorting. This is able to provide sufficient selection pressure towards the Pareto front, regardless of the number of objectives that the optimization problem has.

However, despite its attractive features, it has been found that BiGE tends to struggle on a class of many-objective problems where the search process involves DRSs. DRSs are far away from the main population and always ranked as good solutions by BiGE, thus hindering the evolutionary progress of the population. To address this issue, this paper proposes an angle-based crowding degree estimation method for BiGE (denoted as aBiGE). The rest of the paper is organized as follows. Section 2 gives some concepts and terminology about many-objective optimization. In Sect. 3, we present our angle-based crowding degree estimation method and its incorporation with BiGE. The experimental results are detailed in Sect. 4. Finally, the conclusions and future work are set out in Sect. 5.

## **2 Concepts and Terminology**

When dealing with optimization problems in the real world, sometimes it may involve more than three performance criteria to determine how "good" a certain solution is. These criteria, termed as objectives (e.g., cost, safety, efficiency) need to be optimized simultaneously, but usually conflict with each other. This type of problem is called many-objective optimization problem (MaOP). A minimization MaOP can be mathematically defined as follows:

$$\begin{array}{ll}\text{minimize} & F(x) = (f\_1(x), f\_2(x), \dots, f\_N(x)) \\ \text{subject to} & g\_j(x) \le 0, \quad j = 1, 2, \dots, J \\ & h\_k(x) = 0, \quad k = 1, 2, \dots, K \\ & x = (x\_1, x\_2, \dots, x\_M), \quad x \in \Omega \end{array} \tag{1}$$

where x denotes an M-dimensional decision variable vector from the feasible region in the decision space Ω, F(x) represents an N-dimensional objective vector (N is larger than three), fi(x) is the i-th objective to be minimized, objective functions f1, f2, ..., f<sup>N</sup> constitute N-dimensional space called the objective space, g<sup>j</sup> (x) ≤ 0 and hk(x) = 0 define J inequality and K equality constraints, respectively.

**Definition 1 (Pareto Dominance).** *Given two decision vectors* x, y ∈ Ω *of a minimization problem,* x *is said to (Pareto) dominate* y *(denoted as* x ≺ y*), or equivalently* y *is dominated by* x*, if and only if* [4]

$$\forall i \in \{1, 2, \ldots, N\}: f\_i(x) \le f\_i(y) \land \exists i \in \{1, 2, \ldots, N\}: f\_i(x) < f\_i(y). \tag{2}$$

*Namely, given two solutions, one solution is said to dominate the other solution if it is at least as good as the other solution in any objective and is strictly better in at least one objective.*

**Definition 2 (Pareto Optimality).** *A solution* x ∈ Ω *is said to be Pareto optimal if and only if there is no solution* y ∈ Ω *dominates it. Those solutions that are not dominated by any other solutions is said to be Pareto-optimal (or non-dominated).*

**Definition 3 (Pareto Set).** *All Pareto-optimal (or non-dominated) solutions in the decision space constitute the Pareto set (PS).*

**Definition 4 (Pareto Front).** *The Pareto front (PF) is referred to corresponding objective vectors to a Pareto set.*

**Definition 5 (Dominance Resistant Solution).** *Given a solution set, dominance resistant solution (DRS) is referred to the solution with an extremely poor value in at least one objective, but with near-optimal value in some other objective.*

## **3 The Proposed Algorithm: aBiGE**

### **3.1 A Brief Review of BiGE**

Algorithm 1 shows the basic framework of BiGE. First, a parent population with M solutions is randomly initialized. Second, proximity and crowding degree for each solution is estimated, respectively. Third, in the mating selection, individuals that have better quality with regards to the proximity and crowding degree tend to become parents of the next generation. Afterward, variation operators (e.g., crossover and mutation) are applied to these parents to produce an offspring population. Finally, the environmental selection is applied to reduce the expanded population of parents and offspring to M individuals as the new parent population of the next generation.


In particular, a simple aggregation function is adopted to estimate the proximity of an individual. For an individual x in a population, denoted as fp(x), its aggregation value is calculated by the sum of each normalized objective value in the range [0, 1] (lines 3 in Algorithm 1), formulated as [8]:

$$f\_p(x) = \sum\_{j=1}^{N} \widetilde{f}\_j(x). \tag{3}$$

where f <sup>j</sup> (x) denotes the normalized objective value of individual <sup>x</sup> in the <sup>j</sup>-th objective, and N is the number of objectives. A smaller f<sup>p</sup> value of an individual usually indicates a good performance on proximity. In particular, for a DRS, it is more likely to obtain a significantly large f<sup>p</sup> value in comparison with other individuals in a population.

In addition, the crowding degree of an individual x (lines 4 in Algorithm 1) is defined as follows [8]:

$$f\_c(x) = (\sum\_{y \in \Omega, x \neq y} sh(x, y))^{1/2}.\tag{4}$$

where sh(x, y))<sup>1</sup>/<sup>2</sup> denotes a sharing function. It is a penalized Euclidean distance between two individuals x and y by using a weight parameter, defined as follows:

$$sh(x,y) = \begin{cases} (0.5(1 - \frac{d(x,y)}{r}))^2, \text{ if } d(x,y) < r, f\_p(x) < f\_p(y) \\ (1.5(1 - \frac{d(x,y)}{r}))^2, \text{ if } d(x,y) < r, f\_p(x) > f\_p(y) \\ rand(), & \text{ if } d(x,y) < r, f\_p(x) = f\_p(y) \\ 0, & \text{otherwise} \end{cases} \tag{5}$$

where r is the radius of a niche, adaptively calculated by r = 1/ *<sup>N</sup>* √ M (M is the population size and N is the number of objectives). The function rand() means to assign *either* sh(x, y) = (0.5(1−[d(x, y)/r]))<sup>2</sup> and sh(y, x)= (1.5(1−[d(x, y)/r]))<sup>2</sup> *or* sh(x, y)= (1.5(1−[d(x, y)/r]))<sup>2</sup> and sh(y, x)= (0.5(1−[d(x, y)/r]))<sup>2</sup> randomly. Individuals with lower crowding degree imply better performance on diversity.

It is observed that BiGE tends to struggle on a class of MaOPs where the search process involves DRSs, such as DTLZ1 and DTLZ3 (in a well-known benchmark test suite DTLZ [3]). Figure 1 shows the true Pareto front of the eight-objective DTLZ1 and the final solution set of BiGE in one typical run on the eight-objective DTLZ1 by parallel coordinates. The parallel coordinates map the original manyobjective solution set to a 2D parallel coordinates plane. Particularly, Li et al. in [9] systematically explained how to read many-objective solution sets in parallel coordinates, and indicates that parallel coordinates can *partly* reflect the quality of a solution set in terms of convergence, coverage, and uniformity.

Clearly, there are some solutions that are far away from the Pareto front in BiGE, with the solution set of eight-objective DTLZ1 ranging from 0 to around 450 compared to the Pareto front ranging from 0 to 0.5 on each objective. Such solutions always have a poor proximity degree and a good crowding degree (estimated by Euclidean distance)in bi-goal objective space (i.e., convergence and

**Fig. 1.** The true Pareto front and the final solution set of BiGE on the eight-objective DTLZ1, shown by parallel coordinates.

diversity), and will be preferred since there is no solution in the population that dominates them in BiGE. These solutions are detrimental for BiGE to converge the population to the Pareto front considering their poor performance in terms of convergence. A straightforward method to remove DRSs is to change the crowding degree estimation method.

#### **3.2 Basic Idea**

The basic idea of the proposed method is based on some observations of DRSs. Figure 2 shows one typical situation of a non-dominated set with five individuals including two DRSs (i.e, A and E) in a two-dimensional objective minimization scenario.

As seen, it is difficult to find a solution that could dominate DRSs by estimating the crowding degree using Euclidean distance. Take individual A as an example, it performs well on objective f<sup>1</sup> (slightly better than B with a nearoptimal value 0) but inferior to all the other solutions on objective f2. It is difficult to find a solution with better value than A on objective f1, same as individual E on objective f2. A and E (with poor proximity degree and good crowding degree) are considered as good solutions and have a high possibility to survive in the next generation in BiGE. However, the results would be different if the distance-based crowding degree estimation is replaced by a vector angle. It can be observed that (1) an individual in a crowded area would have a smaller vector angle to its neighbor compared to the individual in a sparse area, e.g., C and D, (2) a DRS would have an extremely small value of vector angle to its neighbor, e.g., the angle between A and B or the angle between E and D. Namely, these DRSs would be assigned both poor proximity and crowding degrees, and have a high possibility to be deleted during the evolutionary process. Therefore, vector angles have the advantage to distinguish DRSs in the population and could be considered into crowding degree estimation.

**Fig. 2.** An illustration of a population of five solutions with two DRSs - A and E. They have good crowding degrees estimated by the Euclidean distance, but poor crowding degrees calculated by the vector angle between two neighbors.

### **3.3 Angle-Based Crowding Degree Estimation**

Inspired by the work in [12], we propose a novel angle-based crowding degree estimation method, and integrate it into the BiGE framework (line 4 in Algorithm 1), called aBiGE. Before estimating the diversity of an individual in a population in aBiGE, we first introduce some basic definitions.

**Norm.** For individual xi, its norm, denoted as norm(xi) in the normalzied objective space defined as [12]:

$$norm(x\_i) = \sqrt{\sum\_{j=1}^{N} \tilde{f}\_j(x\_i)^2}.\tag{6}$$

**Vector Angles.** The vector angle between two individuals x<sup>i</sup> and x<sup>k</sup> is defined as follows [12]:

$$angle\_{x\_i \to x\_k} = \arccos\left|\frac{F'(x\_i) \bullet F'(x\_k)}{norm(x\_i) \cdot norm(x\_k)}\right|.\tag{7}$$

where F (xi) • F (xk) is the inner product between F (xi) and F (xk) defined as:

$$F'(x\_i) \bullet F'(x\_k) = \sum\_{j=1}^{N} \widetilde{f}\_j(x\_i) \cdot \widetilde{f}\_j(x\_k). \tag{8}$$

Note that angle <sup>x</sup>*i*→x*<sup>k</sup>* ∈ [0, π/2].

The vector angle from an individual x<sup>i</sup> ∈ Ω to the population is defined as the minimum vector angle between x<sup>i</sup> and another individual in a population P: θ(xi) = angle <sup>x</sup>*i*→<sup>P</sup>

When an individual x is selected into archive in the environmental selection, respectively, θ(x) value will be punished. There are several factors need to be considered in order to achieve a good balance between proximity and diversity.


Keep the above factors in mind, in aBiGE, the diversity estimation of individual x ∈ Ω based on vector angles is defined as

$$f\_a(x) = \frac{c+1}{\theta(x) \cdot (p+1) + \frac{\pi}{90}}.\tag{9}$$

By applying the angle-based crowding degree estimation method to BiGE framework in minimizing many-objective optimization problems, we aim to enhance the selection pressure on those non-dominated solutions in the population of each generation and avoid the negative influence of DRSs in the optimization process. Note that, a smaller value of fa(x) is preferred.

## **4 Experiments**

#### **4.1 Experimental Design**

To test the performance of the proposed aBiGE on those MaOPs where the search process involves DRSs, the experiments are conducted on nine DTLZ test problems. For each test problem (i.e., DTLZ1, DTLZ3, and DTLZ7), five, eight, and ten objectives will be considered, respectively.

To make a fair comparison with the state-of-the-art BiGE for MaOPs, we kept the same settings as [8]. Settings for both BiGE and aBiGE are:

– The population size of both algorithms is set to 100 for all test problems.


Algorithms performance is assessed by performance indicators that consider both proximity and diversity. In this paper, a modified version of the original inverted generational distance indicator (IGD) [15], called (IGD+) [7] is chosen as the performance indicator. Although IGD has been widely used to evaluate the performance of MOEAs on MaOPs, it has been shown [10] that IGD needs to be replaced by IGD+ to make it compatible with Pareto dominance. IGD+ evaluates a solution set in terms of both convergence and diversity, and a smaller value indicates better quality.

### **4.2 Performance Comparison**

**Test Problems with DRSs.** Table 1 shows the mean and standard deviation of IGD+ metric results on nine DTLZ test problems with DRSs. For each test problem, among different algorithms, the algorithm that has the best result based on the IGD+ metric is shown in bold. As can be seen from the table, for MaOPs with DRSs, the proposed aBiGE performs significantly better than BiGE on all test problems in terms of convergence and diversity.



To visualize the experimental results, Figs. 3 and 4 plot, by parallel coordinate, the final solutions of one run with respect to five-objective DTLZ1 and fiveobjective DTLZ7, respectively. This run is associated with the particular run with the closest results to the mean value of IGD+. As shown in Fig. 3(a), the approximation set obtained by BiGE has an inferior convergence on the five-objective DTLZ1, with the range of its solution set is between 0 and about 400 in contrast to the Pareto front ranging from 0 to 0.5 on each objective. From Fig. 3 (b), it can be observed that the obtained solution set of the proposed aBiGE converge to the Pareto front and only a few individuals do not converge.

**Fig. 3.** The final solution sets of the two algorithms on the five-objective DTLZ1, shown by parallel coordinates.

**Fig. 4.** The final solution sets of the two algorithms on the five-objective DTLZ7, shown by parallel coordinates.

For the solutions of the five-objective DTLZ7, the boundary of the first four objectives is in the range [0, 1], and the boundary of the last objective is in the range [3.49, 10] according to the formula of DTLZ7. As can be seen from (Fig. 4), all solutions of the proposed aBiGE appear to converge into the Pareto front. In contrast, some solutions (with objective value beyond the upper boundary in 5th objective) of BiGE fail to reach the Pareto front. In addition, the solution set of the proposed aBiGE has better extensity than BiGE on the boundaries. In particular, the solution set of BiGE fails to cover the region from 3.49 to 6 of the last objective and the solution set of the proposed aBiGE does not cover the range of Pareto front below 4 on 5th objective.

**Test Problem Without DRSs.** Figure 5 gives the final solution set of both algorithms on the ten-objective DTLZ2 in order to visualize their distribution on the MaOPs without DRSs. As can be seen, the final solution sets of both algorithms could coverage the Pareto front with lower and upper boundary within [0,1] of each objective. Moreover, refer to [9], parallel coordinates in Fig. 5 partly reflect the diversity of solutions obtained by aBiGE is sightly worse than BiGE. This observation can be assessed by the IGD+ performance indicator where BiGE obtained a slightly lower (better) than the proposed aBiGE.

**Fig. 5.** The final solution sets of BiGE and aBiGE on the ten-objective DTLZ2 and evaluated by IGD+ indicator, shown by parallel coordinates. (a) BiGE (IGD+ = 2.4319E*−*01) (b) aBiGE (IGD+ = 2.5021E*−*01).

## **5 Conclusion**

In this paper, we have addressed an issue of a well-established evolutionary manyobjective optimization algorithm BiGE on the problems with high probability to produce dominance resistant solutions during the search process. We have proposed an angle-based crowding distance estimation method to replace distancebased estimation in BiGE, thus significantly reducing the effect of dominance resistant solutions to the algorithm. The effectiveness of the proposed method has been well evaluated on three representative problems with dominance resistant solutions. It is worth mentioning that for problems without dominance resistant solutions the proposed method performs slightly worse than the original BiGE. In the near future, we would like to focus on the problems without dominance resistant solutions, aiming at a comprehensive improvement of the algorithm on both types of problems.

## **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Author Index

Alazizi, Ayman 14 Amodio, Matthew 509 Artelt, André 235 Atamna, Asma 27

Baeza-Yates, Ricardo 404 Bahri, Maroua 40 Baratchi, Mitra 313 Bariatti, Francesco 54 Battaglia, Elena 67 Bauckhage, Christian 548 Bendimerad, Anes 80 Beyer, Christian 1 Bifet, Albert 40 Bioglio, Livio 67 Borgelt, Christian 93 Boulicaut, Jean-François 339

Cazabet, Rémy 339, 522 Cellier, Peggy 54, 197 Clémençon, Stephan 287 Cohen, Eldan 106 Cornuéjols, Antoine 119 Couceiro, Miguel 132 Crivello, Jean-Claude 27

Dalleau, Kevin 132 De Bie, Tijl 80 Dijkman, Remco 483 Duivesteijn, Wouter 483

Faas, Micky 158 Febrissy, Mickael 171 Ferré, Sébastien 54 Ferreira, Carlos Abreu 379 Freire, Ana 404 Fréry, Jordan 457

Gabrielli, Lorenzo 274 Gabrys, Bogdan 352 Galindez Olascoaga, Laura Isabel 184 Gama, João 379 Garcke, Jochen 548

Gautrais, Clément 197 Gharaghooshi, Shiva Zamani 210, 300 Ghoshal, Biraja 223 Giannotti, Fosca 274 Göpfert, Jan Philip 235 Gross-Amblard, David 261

Habrard, Amaury 14 Hammer, Barbara 235 Hanika, Tom 496 He-Guelton, Liyun 14, 457 Höppner, Frank 248 Hüllermeier, Eyke 444, 561

Jacquenet, François 14 Jahnke, Maximilian 248 Jeantet, Ian 261 Jilderda, Maurice 483

Kim, Jisu 274 Krishnaswamy, Smita 509

Largeron, Christine 210, 300, 404, 522 Larroche, Corentin 287 Levene, Mark 470 Li, Miqing 574 Lijffijt, Jefrey 80 Lindskog, Cecilia 223 Liu, Chang 210, 300 Liu, Xiaohui 574 Loog, Marco 326, 535

Mandal, Avradip 106 Maniu, Silviu 40 Maszczyk, Tomasz 352 Mathonat, Romain 339 Mayer, Sebastian 548 Mazel, Johan 287 Meert, Wannes 184 Mendes-Moreira, João 313 Menkovski, Vlado 145 Mey, Alexander 326, 535 Miasnikov, Evgeniy V. 418 Miklós, Zoltán 261 Millot, Alexandre 339 Mohr, Felix 561 Muhle, Rebecca 509 Murena, Pierre-Alexandre 119 Musial, Katarzyna 352

Nadif, Mohamed 171 Nguyen, Tien-Dung 352 Ninevski, Dimitar 366 Nogueira, Ana Rita 379 Noonan, James 509

O'Leary, Paul 366 Oblé, Frédéric 14, 457 Olivier, Raphaël 119 Overton, Toyah 391

Pensa, Ruggero G. 67 Pfahringer, Bernhard 40 Plantevit, Marc 80

Ramírez-Cifuentes, Diana 404 Robardet, Céline 80 Roy, Arnab 106

Safi, Abdullah Al 1 Savchenko, Andrey V. 418 Schneider, Johannes 431 Shah, Nimish 184 Shaker, Mohammad Hossein 444 Siblini, Wissam 457 Sifa, Rafet 548 Singh, Manni 470 Sîrbu, Alina 274 Smail-Tabbone, Malika 132 Sokolovska, Nataliya 27

Soons, Youri 483 Spiliopoulou, Myra 1 Stanley III, Jay S. 509 Stubbemann, Maximilian 496 Stumme, Gerd 496

Termier, Alexandre 197 Tissier, Julien 404 Tong, Alexander 509 Tornede, Alexander 561 Tucker, Allan 223, 391

Unnikrishnan, Vishnu 1 Ushijima-Mwesigwa, Hayato 106

van Dijk, David 509 van Doorenmalen, Jeroen 145 van Leeuwen, Matthijs 158, 197 Van den Broeck, Guy 184 Vaudaine, Rémi 522 Verhelst, Marian 184 Viering, Tom Julian 326, 535 von Rueden, Laura 548

Wang, Yi-Qing 457 Wersing, Heiko 235 Weston, David 470 Wever, Marcel 561 Wolf, Guy 509

Xue, Yani 574

Yim, Kristina 509

Zafarmand, Mohammadmahdi 210 Zaïane, Osmar R. 210, 300 Zöller, Marc-André 352