#### BOOK OF ABSTRACTS AND SHORT PAPERS

13th Scientific Meeting of the Classification and Data Analysis Group Firenze, September 9-11, 2021

#### edited by

Giovanni C. Porzio Carla Rampichini Chiara Bocci

#### PROCEEDINGS E REPORT

ISSN 2704-601X (PRINT) - ISSN 2704-5846 (ONLINE)

– 128 –

#### SCIENTIFIC PROGRAM COMMITTEE

Giovanni C. Porzio (chair) (University of Cassino and Southern Lazio - Italy)

Silvia Bianconcini (University of Bologna - Italy) Christophe Biernacki (University of Lille - France) Paula Brito (University of Porto - Portugal) Francesca Marta Lilja Di Lascio (Free University of Bozen-Bolzano - Italy) Marco Di Marzio ("Gabriele d'Annunzio" University of Chieti-Pescara - Italy) Alessio Farcomeni ("Tor Vergata" University of Rome - Italy) Luca Frigau (University of Cagliari - Italy) Luis Ángel García Escudero (University of Valladolid - Spain) Bettina Grün (Vienna University of Economics and Business - Austria) Salvatore Ingrassia (University of Catania - Italy) Volodymyr Melnykov (University of Alabama - USA) Brendan Murphy (University College Dublin -Ireland) Maria Lucia Parrella (University of Salerno - Italy) Carla Rampichini (University of Florence - Italy) Monia Ranalli (Sapienza University of Rome - Italy) J. Sunil Rao (University of Miami - USA) Marco Riani (University of di Parma - Italy) Nicola Salvati (University of Pisa - Italy) Laura Maria Sangalli (Polytechnic University of Milan - Italy) Bruno Scarpa (University of Padua - Italy) Mariangela Sciandra (University of Palermo - Italy) Luca Scrucca (University of Perugia - Italy) Domenico Vistocco (Federico II University of Naples - Italy) Mariangela Zenga (University of Milan-Bicocca - Italy)

#### LOCAL PROGRAM COMMITTEE

Carla Rampichini (chair) (University of Florence - Italy)

Chiara Bocci (University of Florence - Italy) Anna Gottard (University of Florence - Italy) Leonardo Grilli (University of Florence - Italy) Monia Lupparelli (University of Florence - Italy) Maria Francesca Marino (University of Florence - Italy) Agnese Panzera (University of Florence - Italy) Emilia Rocco (University of Florence - Italy) Domenico Vistocco (Federico II University of Naples - Italy)

# CLADAG 2021 BOOK OF ABSTRACTS AND SHORT PAPERS

13th Scientific Meeting of the Classification and Data Analysis Group Firenze, September 9-11, 2021

> edited by Giovanni C. Porzio Carla Rampichini Chiara Bocci

FIRENZE UNIVERSITY PRESS 2021

CLADAG 2021 BOOK OF ABSTRACTS AND SHORT PAPERS : 13th Scientific Meeting of the Classification and Data Analysis Group Firenze, September 9-11, 2021/ edited by Giovanni C. Porzio, Carla Rampichini, Chiara Bocci. — Firenze : Firenze University Press, 2021. (Proceedings e report ; 128)

**INDEX**

**Keynote Speakers**

*Peter Rousseeuw, Jakob Raymaekers and Mia Hubert*

*Robert Tibshirani, Stephen Bates and Trevor Hastie*

*Jean-Michel Loubes*

*Cinzia Viroli*

**Plenary Session**

*Daniel Diaz*

*Jeffrey S. Morris*

*Bhramar Mukherjee*

*Danny Pfeffermann*

**Invited Papers**

*Emanuele Aliverti* 

*Bin Yu*

**Preface 1**

**Optimal transport methods for fairness in machine learning 5**

**Class maps for visualizing classification results 6**

**Understanding cross-validation and prediction error 7**

**Quantile-based classification 8**

**through deepTune 9**

**A simple correction for COVID-19 sampling bias 14**

**COVID-19 pandemic 15**

**science call to arms 16**

**Contributions of Israel's CBS to rout COVID-19 17**

**Robust issues in estimating modes for multivariate torus data 21**

**Bayesian nonparametric dynamic modeling of psychological traits 25**

**Veridical data science for responsible AI: characterizing V4 neurons** 

**A seat at the table: the key role of biostatistics and data science in the** 

**Predictions, role of interventions and the crisis of virus in India: a data** 

*Claudio Agostinelli, Giovanni Saraceno and Luca Greco*

https://www.fupress.com/isbn/9788855183406

ISSN 2704-601X (print) ISSN 2704-5846 (online) ISBN 978-88-5518-340-6 (PDF) ISBN 978-88-5518-341-3 (XML) DOI 10.36253/978-88-5518-340-6

Graphic design: Alberto Pizarro Fernández, Lettera Meccanica SRLs Front cover: Illustration of the statue by Giambologna, *Appennino* (1579-1580) by Anna Gottard

CLAssification and Data Analysis Group (CLADAG) of the Italian Statistical Society (SIS)

*FUP Best Practice in Scholarly Publishing* (DOI https://doi.org/10.36253/fup\_best\_practice) All publications are submitted to an external refereeing process under the responsibility of the FUP Editorial Board and the Scientific Boards of the series. The works published are evaluated and approved by the Editorial Board of the publishing house, and must be compliant with the Peer review policy, the Open Access, Copyright and Licensing policy and the Publication Ethics and Complaint policy.

#### Firenze University Press Editorial Board

M. Garzaniti (Editor-in-Chief), M.E. Alberti, F. Vittorio Arrigoni, E. Castellani, F. Ciampi, D. D'Andrea, A. Dolfi, R. Ferrise, A. Lambertini, R. Lanfredini, D. Lippi, G. Mari, A. Mariani, P.M. Mariano, S. Marinai, R. Minuti, P. Nanni, A. Orlandi, I. Palchetti, A. Perulli, G. Pratesi, S. Scaramuzzi, I. Stolzi.

The online digital edition is published in Open Access on www.fupress.com.

Content license: except where otherwise noted, the present work is released under Creative Commons Attribution 4.0 International license (CC BY 4.0: http://creativecommons.org/licenses/by/4.0/ legalcode). This license allows you to share any part of the work by any means and format, modify it for any purpose, including commercial, as long as appropriate credit is given to the author, any changes made to the work are indicated and a URL link is provided to the license.

Metadata license: all the metadata are released under the Public Domain Dedication license (CC0 1.0 Universal: https://creativecommons.org/publicdomain/zero/1.0/legalcode).

© 2021 Author(s)

Published by Firenze University Press Firenze University Press Università degli Studi di Firenze via Cittadella, 7, 50144 Firenze, Italy www.fupress.com

*This book is printed on acid-free paper Printed in Italy*

#### **INDEX**

#### **Preface 1**

#### **Keynote Speakers**


#### **Plenary Session**


#### **Invited Papers**


FUP Best Practice in Scholarly Publishing (DOI 10.36253/fup\_best\_practice)

Giovanni C. Porzio, Carla Rampichini, Chiara Bocci (edited by), *CLADAG 2021 Book of abstracts and short papers. 13th Scientific Meeting of the Classification and Data Analysis Group Firenze, September 9-11, 2021*, © 2021 Author(s), content CC BY 4.0 International, metadata CC0 1.0 Universal, published by Firenze University Press (www.fupress.com), ISSN 2704-5846 (online), ISBN 978-88-5518-340-6 (PDF), DOI 10.36253/978-88-5518-340-6


*Pierpaolo D'Urso, Livia De Giovanni and Vincenzina Vitale*

*Tahir Ekin and Claudio Conversano*

*Matteo Fasiolo*

*Carlotta Galeone*

*Riccardo Peli* 

*Jayant Jha*

*and Isobel Claire Gormley*

*Silvia Facchinetti and Silvia Angela Osmetti*

*Francesca Greselin and Alina Jędrzejczak*

*Christian Hennig and Pietro Coretto*

*Yinxuan Huang and Natalie Shlomo*

*Maria Iannario and Claudia Tarantola*

**Spatial-temporal clustering based on B-splines: robust models with** 

*Leonardo Egidi, Roberta Pappadà, Francesco Pauli and Nicola Torelli* **PIVMET: pivotal methods for Bayesian relabelling in finite mixture** 

*Luis Angel García-Escudero, Agustín Mayo-Iscar and Marco Riani*

*Carlo Gaetan, Paolo Girardi and Victor Muthama Musau*

*Michael Gallaugher, Christophe Biernacki and Paul McNicholas*

**applications to COVID-19 pandemic 83**

**models 87**

**Cluster validity by random forests 91**

**Robust estimation of parsimonious finite mixture of Gaussian models 92**

**A risk indicator for categorical data 93**

**Additive quantile regression via the qgam R package 97**

**likelihood 98**

**On model-based clustering using quantile regression 102**

**Socioeconomic inequalities and cancer risk: myth or reality? 106**

**Parameter-wise co-clustering for high dimensional data <sup>107</sup>**

**an analysis based on EU-SILC data from Poland and Italy 108**

**models 112**

**likelihood estimator 116**

**Improving the reliability of a nonprobability web survey 120**

**regression models 124**

**Best approach direction for spherical random variables 128**

**Quantifying the impact of covariates on the gender gap measurement:** 

*Alessandra Guglielmi, Mario Beraha, Matteo Giannella, Matteo Pegoraro and* 

**A transdimensional MCMC sampler for spatially dependent mixture** 

**Non-parametric consistency for the Gaussian mixture maximum** 

**A semi-Bayesian approach for the analysis of scale effects in ordinal** 

*Michael Fop, Dimitris Karlis, Ioannis Kosmidis, Adrian O'Hagan, Caitriona Ryan* 

**Gaussian mixture models for high dimensional data using composite** 


*Andres M. Alonso, Carolina Gamboa and Daniel Peña*

*Raffaele Argiento, Edoardo Filippi-Mazzola and Lucia Paci*

*Francesco Bartolucci, Fulvia Pennoni and Federico Cortese*

*Michela Battauz and Paolo Vidoni*

*Matteo Bottai*

*Marcello Chiodi*

*Silvia D'Angelo*

*Ferrari and Clelia Di Serio*

*Anna Denkowska and Stanisław Wanat*

*Antonio Balzanella, Antonio Irpino and Francisco de A.T. De Carvalho* **Mining multiple time sequences through co-clustering algorithms for** 

**Clustering financial time series using generalized cross correlations 27**

**Model-based clustering for categorical data via Hamming distance 31**

**distributional data 32**

**in multiple time-series 36**

**Boosting multidimensional IRT models 40**

**Understanding and estimating conditional parametric quantile models 44**

**Shapley Lorenz methods for eXplainable artificial intelligence 45**

**the stability of results 49**

**Issues in monitoring the EU trade of critical COVID-19 commodities 53**

**Smoothed non linear PCA for multivariate data 54**

**Accounting for response behavior in longitudinal rating data 58**

**Netwok-based semi-supervised clustering of time series data 62**

**a latent class framework 64**

**Sender and receiver effects in latent space models for multiplex data 68**

**GARCH-MST model developed for European insurance institutions 71**

**Two-step estimation of multilevel latent class models with covariates 75**

**mixture models 79**

**Hidden Markov and regime switching copula models for state allocation** 

*Niklas Bussmann, Roman Enzmann, Paolo Giudici and Emanuela Raffinetti*

*Andrea Cerasa, Enrico Checchi, Domenico Perrotta and Francesca Torti*

*Claudio Conversano, Giulia Contu, Luca Frigau and Carmela Cappelli*

*Federica Cugnata, Chiara Brombin, Pietro Cippà, Alessandro Ceschi, Paolo* 

**DTW-based assessment of the predictive power of the copula-DCC-**

**Clustering data with non-ignorable missingness using semi-parametric** 

*Roberto Di Mari, Zsuzsa Bakk, Jennifer Oser and Jouni Kuha*

*Marie Du Roy de Chaumaray and Matthieu Marbac*

**Characterising longitudinal trajectories of COVID-19 biomarkers within** 

*Roberto Colombi, Sabrina Giordano and Maria Kateri*

*Andrea Cappozzo, Ludovic Duponchel, Francesca Greselin and Brendan Murphy* **Robust classification of spectroscopic data in agri-food: first analysis on** 


*Giuseppe Pandolfo*

*Panos Pardalos*

*Fabian Capitanio*

*Nicoleta Rogovschi*

*Massimiliano Russo*

*Florian Schuberth*

*Simone Vantini*

*Laura Trinchera*

*Xanthi Pedeli and Cristiano Varin*

*Mark Reiser and Maduranga Dassanayake*

*Paula Saavedra-Nieves and Rosa M. Crujeiras*

*Shuchismita Sarkar, Volodymyr Melnykov and Xuwen Zhu*

*Jarod Smith, Mohammad Arashi and Andriette Bekker*

*Paul Smith, Peter van der Heijden and Maarten Cruyff*

*Valentin Todorov and Peter Filzmoser*

**Tensor-variate finite mixture model for the analysis of university** 

**Specifying composites in structural equation modeling: the Henseler-**

**Network analysis implementing a mixture distribution from Bayesian** 

**Robust classification in high dimensions using regularized covariance** 

*Agostino Torti, Marta Galvani, Alessandra Menafoglio, Piercesare Secchi and* 

**Developing a multidimensional and hierarchical index following a** 

*Salvatore Daniele Tomarchio, Luca Bagnato and Antonio Punzo*

**A graphical depth-based aid to detect deviation from unimodality on** 

**hyperspheres 182**

**Networks of networks 186**

**Pairwise likelihood estimation of latent autoregressive count models 187**

**variables 191**

**Assessing food security issues in Italy: a quantile copula approach 195**

**Co-clustering for high dimensional sparse data 199**

**Malaria risk detection via mixed membership models 203**

**Nonparametric estimation of the number of clusters for directional data 207**

**professor remuneration 208**

**Ogasawara specification 209**

**viewpoint 210**

**Measurement errors in multiple systems estimation 211**

**estimates 215**

**Clustering via new parsimonious mixtures of heavy tailed distributions 216**

**A general bi-clustering technique for functional data 217**

**composite-based approach 220**

**A study of lack-of-fit diagnostics for models fit to cross-classified binary** 

*Giorgia Rivieccio, Jean-Paul Chavas, Giovanni De Luca, Salvatore Di Falco and* 


*Maria Kateri*

*John Kent*

**Simple effect measures for interpreting generalized binary regression** 

**Mixtures of Kato–Jones distributions on the circle, with an application to** 

**Identifying mortality patterns of main causes of death among young EU** 

**A nonparametric approach for statistical matching under informative** 

**Investigating model fit in item response models with the Hellinger** 

**Transformation mixture modeling for skewed data groups with heavy** 

*Jesper Møller , Mario Beraha, Raffaele Argiento and Alessandra Guglielmi* **MCMC computations for Bayesian mixture models using repulsive point** 

**Detection of internet attacks with histogram principal component** 

*Shogo Kato, Kota Nagasaki and Wataru Nakanishi*

*Simona Korenjak-Černe and Nataša Kejžar*

*Fabrizio Laurini and Gianluca Morelli*

*Daniela Marella and Danny Pfeffermann*

*Matteo Mazziotta and Adriano Pareto*

*Mariagiulia Matteucci and Stefania Mignani*

*Marcella Mazzoleni, Angiola Pollastri and Vanda Tulli*

*Yana Melnykov, Xuwen Zhu and Volodymyr Melnykov*

*Keefe Murphy, Cinzia Viroli and Isobel Claire Gormley*

*Stanislav Nagy, Petra Laketa and Rainer Dyckerhoff*

*M. Rosário Oliveira, Ana Subtil and Lina Oliveira*

*Sally Paganin*

*Yarema Okhrin, Gazi Salah Uddin and Muhammad Yahya*

*Luca Merlo, Lea Petrella and Nikos Tzavidis*

**models 129**

**traffic count data 133**

**How to design a directional distribution 137**

**population using SDA approaches 141**

**Robust supervised clustering: some practical issues 142**

**sampling and nonresponse 146**

**distance 150**

**PCA-based composite indices and measurement model 154**

**Gender inequalities from an income perspective 158**

**tails and scatter 162**

**Unconditional M-quantile regression 163**

**processes 167**

**Infinite mixtures of infinite factor analysers 168**

**Angular halfspace depth: computation 169**

**Nonlinear Interconnectedness of crude oil and financial markets 173**

**analysis 174**

**Semiparametric IRT models for non-normal latent traits 178**


*Giuseppe Bove*

*Antonio Calcagni*

*Mayo-Iscar*

*Francesca Condino*

*Francesco Piccialli*

*Agostino Di Ciaccio*

*Maurizio Carpita and Silvia Golia*

*Luca Tardella and Giovanna Jona Lasinio*

*Cristina Davino and Giuseppe Lamberti*

*Lorenzo Focardi Olmi and Anna Gottard*

*Carlo Cavicchia, Maurizio Vichi and Giorgia Zaccaria*

**A subject-specific measure of interrater agreement based on the** 

*Andrea Cappozzo, Alessandro Casa and Michael Fop*

**homogeneity index 272**

**Estimating latent linear correlations from fuzzy contingency tables 276**

**Model-based clustering with sparse matrix mixture models 280**

**Exploring solutions via monitoring for cluster weighted robust models 284**

**Categorical classifiers in multi-class classification problems 288**

**abundance 292**

**Model-based clustering with parsimonious covariance structure 296**

**Clustering income data based on share densities 300**

**Group-dependent finite mixture model 304**

**A machine learning approach in stock risk management 308**

**Pathmox segmentation trees to compare linear regression models 312**

**Angular halfspace depth: classification using spherical bagdistances 316**

**Neural networks for high cardinality categorical data <sup>320</sup>**

**clustering 324**

**notices of the ministry of labour 328**

**graphical models 332**

*Houyem Demni, Davide Buttarazzi, Stanislav Nagy and Giovanni Camillo Porzio*

*Andrea Cappozzo, Luis Angel Garcìa Escudero, Francesca Greselin and Agustìn* 

*Gianmarco Caruso, Greta Panunzi, Marco Mingione, Pierfrancesco Alaimo Di Loro, Stefano Moro, Edoardo Bompiani, Caterina Lanfredi, Daniela Silvia Pace,* 

**Model-based clustering for estimating cetaceans site-fidelity and** 

*Paula Costa Fontichiari, Miriam Giuliani, Raffaele Argiento and Lucia Paci*

*Salvatore Cuomo, Federico Gatta, Fabio Giampaolo, Carmela Iorio and* 

*F. Marta L. Di Lascio, Andrea Menapace and Roberta Pappadà*

**Ali-Mikhail-Haq copula to detect low correlations in hierarchical** 

*Maria Veronica Dorgali, Silvia Bacci, Bruno Bertaccini and Alessandra Petrucci* **Higher education and employability: insights from the mandatory** 

**An alternative to joint graphical lasso for learning multiple Gaussian** 

#### **Contributed Papers**



*Rosanna Verde, Francisco T. de A. De Carvalho and Antonio Balzanella*

*Isadora Antoniano Villalobos, Simone Padoan and Boris Beranger*

*Antonino Abbruzzo, Maria Francesca Cracolici and Furio Urso*

*Luigi Augugliaro, Gianluca Sottile and Angelo Mineo*

*Claudia Berloco, Raffaele Argiento and Silvia Montagna*

*Marco Berrettini, Giuliano Galimberti and Saverio Ranciati*

*Stefano Calza*

*Giancarlo Ragozini*

*Ernst Wit and Lucas Kania*

*Qiuyi Wu and David Banks*

**Contributed Papers**

*Chiara Bardelli*

*Roberto Ascari and Sonia Migliorati*

*Marika Vezzoli, Francesco Doglietto, Stefano Renzetti, Marco Fontanella and* 

**A machine learning approach for evaluating anxiety in neurosurgical** 

**Prediction of large observations via Bayesian inference for extreme-**

*Maria Prosperina Vitale, Vincenzo Giuseppe Genova, Giuseppe Giordano and* 

**Community detection in tripartite networks of university student** 

**A generalised clusteriwise regression for distributional data 223**

**patients during the COVID-19 pandemic 227**

**value theory 231**

**mobility flows 232**

**Causal regularization 236**

**Minimizing conflicts of interest: optimizing the JSM program 240**

**Model selection procedure for mixture hidden Markov models 243**

**A full mixture of experts model to classify constrained data 247**

**models 251**

**Semi-supervised learning through depth functions 255**

**A combined test of the Benford hypothesis with anti-fraud applications 256**

**Unabalanced classfication of electronic invoicing 260**

**application for credit risk 264**

**splines 268**

**Sparse inference in covariate adjusted censored Gaussian graphical** 

*Simona Balzano, Mario Rosario Guarracino and Giovanni Camillo Porzio*

*Lucio Barabesi, Andrea Cerasa, Andrea Cerioli and Domenico Perrotta*

**Predictive power of Bayesian CAR models on scale free networks: an** 

**Semiparametric finite mixture of regression models with Bayesian P-**


*Marta Nai Ruscone and Dimitris Karlis*

*Roberto Rocci and Monia Ranalli*

*Theresa Scharl and Bettina Grün*

*Donatella Vicari and Paolo Giordani*

*Gianpaolo Zammarchi and Jaromir Antoch*

*Pieragostino*

*Luca Scrucca*

**Robustness methods for modelling count data with general dependence** 

**Bayesian analysis of a water quality high-frequency time series through** 

**Detecting the effect of secondary school in higher education university** 

**Semi-constrained model-based clustering of mixed-type data using a** 

**Antibodies to SARS-CoV-2: an exploratory analysis carried out through** 

**Modelling three-way RNA sequencing data with mixture of multivariate** 

*Annalina Sarra, Adelia Evangelista, Tonio Di Battista and Damiana* 

*Rosaria Simone, Cristina Davino, Domenico Vistocco and Gerhard Tutz*

**Using eye-traking data to create a weighted dictionary for sentiment** 

*Venera Tomaselli, Giulio Giacomo Cantone and Valeria Mazzeo*

*Roberta Paroli, Luigi Spezia, Marc Stutter and Andy Vinten*

*Mariano Porcu, Isabella Sulis and Cristian Usala*

**structures <sup>396</sup>**

**Markov switching autoregressive models 400**

**choices 404**

**composite likelihood approach 408**

**the Bayesian profile regression 412**

**Poisson-lognormal distribution 416**

**Stacking ensemble of Gaussian mixtures 420**

**A robust quantile approach to ordinal trees 424**

**The detection of spam behaviour in review bomb 428**

**Clustering models for three-way data 432**

**analysis: the eye dictionary 436**


*Francesca Fortuna, Alessia Naccarato and Silvia Terzi*

*Chiara Galimberti, Federico Castelletti and Stefano Peluso*

*Michele La Rocca, Francesco Giordano and Cira Perna*

**Clustering production indexes for construction with forecast** 

*Maria Mannone, Veronica Distefano, Claudio Silvestri and Irene Poli*

*Laura Marcis, Maria Chiara Pagliarella and Renato Salvatore*

*Paolo Mariani, Andrea Marletta and Matteo Locci*

*Federico Marotta, Paolo Provero and Silvia Montagna*

*Ana Martins, Paula Brito, Sónia Dias and Peter Filzmoser*

*Giovanna Menardi and Federico Ferraccioli*

**Clustering longitudinal data with category theory for diabetic kidney** 

**A redundancy analysis with multivariate random-coefficients linear** 

**Prediction of gene expression from transcription factors affinities: an** 

*Francesca Martella, Fabio Attorre, Michele De Sanctis and Giuliano Fanelli* **High dimensional model-based clustering of European georeferenced** 

*Massimo Mucciardi, Giovanni Pirrotta, Andrea Briglia and Arnaud Sallaberry* **Visualizing cluster of words: a graphical approach to grammar** 

*Siciliano*

*Petra Laketa and Stanislav Nagy*

*Andrea Gilardi, Riccardo Borgoni, Luca Presicce and Jorge Mateu*

**Measurement error models on spatial network lattices: car crashes in** 

*Carmela Iorio, Giuseppe Pandolfo, Michele Staiano, Massimo Aria and Roberta* 

**The LP data depth and its application to multivariate process control** 

*Sylvia Frühwirth-Schnatter, Bettina Grün and Gertraud Malsiner-Walli* **Estimating Bayesian mixtures of finite mixtures with telescoping** 

**Functional cluster analysis of HDI evolution in European countries 336**

**sampling 340**

**A Bayesian framework for structural learning of mixed graphical models 344**

**Leeds 348**

**charts 352**

**Angular halfspace depth: central regions 356**

**distributions 360**

**disease 364**

**models 368**

**The use of multiple imputation techniques for social media data 372**

**application of Bayesian non-linear modelling 376**

**vegetation plots 380**

**Multivariate outlier detection for histogram-valued variables 384**

**A nonparametric test for mode significance 388**

**acquisition 392**

**Preface** 

Federation of Classification Societies (IFCS).

whose aims are to further classification research.

condition, CLADAG 2021 will be completely online.

Giovanni Camillo Porzio Carla Rampichini Chiara Bocci

This book collects the abstracts and short papers presented at CLADAG 2021, the 13th Scientific Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society (SIS). The meeting has been organized by the Department of Statistics, Computer Science, Applications 'G. Parenti' of the University of Florence, under the auspices of the University of Florence, the SIS and the International

CLADAG is a member of the IFCS, a federation of national, regional, and linguisticallybased classification societies. It is a non-profit, non-political scientific organization,

Every two years, CLADAG organizes a scientific meeting, devoted to the presentation of theoretical and applied papers on classification and related methods of data analysis in the broad sense. This includes advanced methodological research in multivariate statistics, mathematical and statistical investigations, survey papers on the state of the art, real case studies, papers on numerical and algorithmic aspects, applications in special fields of interest, and the interface between classification and data science. The conference aims at encouraging the interchange of ideas in the above-mentioned fields of research, as well as the dissemination of new findings. CLADAG conferences, initiated in 1997 in Pescara (Italy), were soon considered as an attractive information exchange market and became an important meeting point for people interested in classification and data analysis. A selection of the presented papers is regularly

The Scientific Committee of CLADAG 2021 conceived the Plenary and Invited Sessions to provide a fresh perspective on the state of the art of knowledge and research in the field. The scientific program of CLADAG 2021 is particularly rich. All in all, it comprises 5 Keynote Lectures, 26 Invited Sessions promoted by the members of the Scientific Program Committee, 10 Contributed Sessions, and a Plenary Session on *Statistical Issues in the COVID-19 Pandemic*. We thank all the session organizers for inviting renowned speakers, coming from many different countries. We are greatly indebted to the referees, for the time spent in a careful review of the abstracts and short papers collected in this book. The Conference was planned as an in presence event; unfortunately due to the persistent uncertainty of the COVID-19 epidemic

Special thanks are finally due to the members of the Local Organizing Committee and all the people who collaborated for CLADAG 2021. Last but not least, we thank all the authors and participants, without whom the conference would not have been possible.

1

Florence, September 2021

published in (post-conference) proceedings, typically by Springer Verlag.

#### **Preface**

This book collects the abstracts and short papers presented at CLADAG 2021, the 13th Scientific Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society (SIS). The meeting has been organized by the Department of Statistics, Computer Science, Applications 'G. Parenti' of the University of Florence, under the auspices of the University of Florence, the SIS and the International Federation of Classification Societies (IFCS).

CLADAG is a member of the IFCS, a federation of national, regional, and linguisticallybased classification societies. It is a non-profit, non-political scientific organization, whose aims are to further classification research.

Every two years, CLADAG organizes a scientific meeting, devoted to the presentation of theoretical and applied papers on classification and related methods of data analysis in the broad sense. This includes advanced methodological research in multivariate statistics, mathematical and statistical investigations, survey papers on the state of the art, real case studies, papers on numerical and algorithmic aspects, applications in special fields of interest, and the interface between classification and data science. The conference aims at encouraging the interchange of ideas in the above-mentioned fields of research, as well as the dissemination of new findings. CLADAG conferences, initiated in 1997 in Pescara (Italy), were soon considered as an attractive information exchange market and became an important meeting point for people interested in classification and data analysis. A selection of the presented papers is regularly published in (post-conference) proceedings, typically by Springer Verlag.

The Scientific Committee of CLADAG 2021 conceived the Plenary and Invited Sessions to provide a fresh perspective on the state of the art of knowledge and research in the field. The scientific program of CLADAG 2021 is particularly rich. All in all, it comprises 5 Keynote Lectures, 26 Invited Sessions promoted by the members of the Scientific Program Committee, 10 Contributed Sessions, and a Plenary Session on *Statistical Issues in the COVID-19 Pandemic*. We thank all the session organizers for inviting renowned speakers, coming from many different countries. We are greatly indebted to the referees, for the time spent in a careful review of the abstracts and short papers collected in this book. The Conference was planned as an in presence event; unfortunately due to the persistent uncertainty of the COVID-19 epidemic condition, CLADAG 2021 will be completely online.

Special thanks are finally due to the members of the Local Organizing Committee and all the people who collaborated for CLADAG 2021. Last but not least, we thank all the authors and participants, without whom the conference would not have been possible.

Giovanni Camillo Porzio Carla Rampichini Chiara Bocci

Florence, September 2021

FUP Best Practice in Scholarly Publishing (DOI 10.36253/fup\_best\_practice)

1 Giovanni C. Porzio, Carla Rampichini, Chiara Bocci (edited by), *CLADAG 2021 Book of abstracts and short papers. 13th Scientific Meeting of the Classification and Data Analysis Group Firenze, September 9-11, 2021*, © 2021 Author(s), content CC BY 4.0 International, metadata CC0 1.0 Universal, published by Firenze University Press (www.fupress.com), ISSN 2704-5846 (online), ISBN 978-88-5518-340-6 (PDF), DOI 10.36253/978-88-5518-340-6

# **Keynote Speakers**

Giovanni C. Porzio, University of Cassino and Southern Lazio, Italy, porzio@unicas.it, 0000-0003-1208-6991 Carla Rampichini, University of Florence, Italy, carla.rampichini@unifi.it, 0000-0002-8519-083X Chiara Bocci, University of Florence, Italy, chiara.bocci@unifi.it, 0000-0001-8189-4445

FUP Best Practice in Scholarly Publishing (DOI 10.36253/fup\_best\_practice)

Giovanni C. Porzio, Carla Rampichini, Chiara Bocci (edited by), *CLADAG 2021 Book of abstracts and short papers. 13th Scientific Meeting of the Classification and Data Analysis Group Firenze, September 9-11, 2021*, © 2021 Author(s), content CC BY 4.0 International, metadata CC0 1.0 Universal, published by Firenze University Press (www.fupress.com), ISSN 2704-5846 (online), ISBN 978-88-5518-340-6 (PDF), DOI 10.36253/978-88-5518-340-6

OPTIMAL TRANSPORT METHODS FOR FAIRNESS IN MACHINE LEARNING Jean-Michel Loubes1

ABSTRACT: The principle of Supervised Machine Learning is to build a decision rule from a set of labeled examples called the learning sample, that fits the data. This rule becomes a model or a decision algorithm that will be used for all the population. Mathematical guarantees can be provided in certain cases to control the generalization error of the algorithm which corresponds to the approximation done by building the model based on the observations and not knowing the true model that actually generated the data set. More precisely, the data are assumed to follow an unknown distribution while only its empirical distribution is at hand. Yet potential existing bias, present in the learning sample, will be implicitly learnt and incorporated in the prediction. This leads to a potential amplification or generalization of bias that may create unfair decision rules. In this presentation we will present how optimal transport methods can be used to control the bias from machine learning algorithms. From a global point of view, group discrimination can be quantified by looking at the behaviour of the algorithm for different groups of individuals. This enables to measure the trade-off between the accuracy of the algorithm and the level of fairness using the notion of Wasserstein's barycenter. From an individual point, optimal transport methods provide an alternative way to define counterfactual worlds that explain how changes in some attributes of the individual may affect the decisions of an algorithm. This enables to recast the problem of training individually fair algorithms to ensuring regularity assumptions in both normal

<sup>1</sup> Universite Toulouse Paul Sabatier, FRANCE ´ (e-mail: loubes@math.univ-toulouse.fr)

and counterfactual world.

### OPTIMAL TRANSPORT METHODS FOR FAIRNESS IN MACHINE LEARNING

Jean-Michel Loubes1

<sup>1</sup> Universite Toulouse Paul Sabatier, FRANCE ´ (e-mail: loubes@math.univ-toulouse.fr)

ABSTRACT: The principle of Supervised Machine Learning is to build a decision rule from a set of labeled examples called the learning sample, that fits the data. This rule becomes a model or a decision algorithm that will be used for all the population. Mathematical guarantees can be provided in certain cases to control the generalization error of the algorithm which corresponds to the approximation done by building the model based on the observations and not knowing the true model that actually generated the data set. More precisely, the data are assumed to follow an unknown distribution while only its empirical distribution is at hand. Yet potential existing bias, present in the learning sample, will be implicitly learnt and incorporated in the prediction. This leads to a potential amplification or generalization of bias that may create unfair decision rules. In this presentation we will present how optimal transport methods can be used to control the bias from machine learning algorithms. From a global point of view, group discrimination can be quantified by looking at the behaviour of the algorithm for different groups of individuals. This enables to measure the trade-off between the accuracy of the algorithm and the level of fairness using the notion of Wasserstein's barycenter. From an individual point, optimal transport methods provide an alternative way to define counterfactual worlds that explain how changes in some attributes of the individual may affect the decisions of an algorithm. This enables to recast the problem of training individually fair algorithms to ensuring regularity assumptions in both normal and counterfactual world.

### CLASS OF MAPS FOR VISUALIZING CLASSIFICATION RESULTS

UNDERSTANDING CROSS-VALIDATION AND PREDICTION ERROR Robert Tibshirani1, Stephen Bates2 and Trevor Hastie1

<sup>1</sup> Departments of Statistics, and Biomedical Data Science, Stanford University,

<sup>2</sup> Departments of Statistics, and Electrical Engineering and Computer Sciences, UC

ABSTRACT: Cross-validation is a widely-used technique to estimate prediction accuracy. However its properties are not that well understood. First, it is not clear exactly what form of prediction error is being estimated by crossvalidation: one would like to think that cross-validation estimates the prediction error for the model and the data at hand. Surprisingly, we show here that this is not the case, (at least for the special case of linear models) and derive the actual estimand(s). This phenomenon occurs for most popular estimates of prediction error including data splitting, bootstrapping,*Cp* and AIC. Second, the standard (na¨ıve) confidence intervals for prediction accuracy that are derived from cross-validation may fail to cover at the nominal rate, because each data point is used for both training and testing, inducing correlations among the measured accuracy for each fold. As a result, the variance of the CV estimate of error is larger than suggested by na¨ıve estimators, which leads to confidence intervals for prediction accuracy that can have coverage far below the desired level. We introduce a nested cross-validation scheme to estimate the standard error of the cross-validation estimate of prediction error, showing empirically that this modification leads to intervals with approximately correct coverage in

(e-mail: tibs@stanford.edu, hastie@stanford.edu)

many examples where traditional cross-validation intervals fail.

Berkeley, (e-mail: stephenbates@cs.berkeley.edu)

Peter J. Rousseeuw1, Jacob Raymaekers1 and Mia Hubert1

<sup>1</sup> Section of Statistics and Data Science, Dept of Mathematics, KU Leuven, Belgium, (e-mail: peter@rouseeuw.net, jakob.raymaekers@kuleuven.be, mia.hubert@kuleuven.be)

ABSTRACT: Classification is a major tool of statistics and machine learning. A classification method first processes a training set of objects with given classes (labels), with the goal of afterward assigning new objects to one of these classes. When running the resulting prediction method on the training data or on test data, it can happen that an object is predicted to lie in a class that differs from its given label. This is sometimes called label bias, and raises the question whether the object was mislabeled.

The proposed class map reflects how well an object lies within its class, by comparing to an alternative class as done in Rousseeuw (1987) for unsupervised classification. The class map also shows how far the object is from the other objects in its class, and whether some objects lie far from all classes. The goal is to visualize aspects of the classification results to obtain insight in the data.

The display is constructed for discriminant analysis, the k-nearest neighbor classifier, support vector machines, logistic regression, and coupling pairwise classifications. It is illustrated on several benchmark datasets, including some consisting of images and texts.

KEYWORDS: discriminant analysis, k-nearest neighbors, mislabeling, pairwise coupling, support vector machines.

#### References


### UNDERSTANDING CROSS-VALIDATION AND PREDICTION ERROR

CLASS OF MAPS FOR VISUALIZING CLASSIFICATION RESULTS Peter J. Rousseeuw1, Jacob Raymaekers1 and Mia Hubert1

<sup>1</sup> Section of Statistics and Data Science, Dept of Mathematics, KU Leuven, Belgium, (e-mail: peter@rouseeuw.net, jakob.raymaekers@kuleuven.be,

ABSTRACT: Classification is a major tool of statistics and machine learning. A classification method first processes a training set of objects with given classes (labels), with the goal of afterward assigning new objects to one of these classes. When running the resulting prediction method on the training data or on test data, it can happen that an object is predicted to lie in a class that differs from its given label. This is sometimes called label bias, and raises the question whether

The proposed class map reflects how well an object lies within its class, by comparing to an alternative class as done in Rousseeuw (1987) for unsupervised classification. The class map also shows how far the object is from the other objects in its class, and whether some objects lie far from all classes. The goal is to visualize aspects of the classification results to obtain insight in the data. The display is constructed for discriminant analysis, the k-nearest neighbor classifier, support vector machines, logistic regression, and coupling pairwise classifications. It is illustrated on several benchmark datasets, including some

KEYWORDS: discriminant analysis, k-nearest neighbors, mislabeling, pairwise

RAYMAEKERS, J, & ROUSEEUW, P J. Transforming variables to central

RAYMAEKERS, J, ROUSEEUW, P J, & HUBERT, M. 2021. Class maps for

ROUSEEUW, P J. 1987. Silhouettes: a graphical aid to the Interpretation and Validation of Cluster Analysis. *Journal of Computational and Applied*

visualizing classification results. *arXiv:2007.14495*.

mia.hubert@kuleuven.be)

the object was mislabeled.

consisting of images and texts.

References

coupling, support vector machines.

normality. *Machine Learning*.

*Mathematics*, 20, 53–65.

Robert Tibshirani1, Stephen Bates2 and Trevor Hastie1

<sup>1</sup> Departments of Statistics, and Biomedical Data Science, Stanford University, (e-mail: tibs@stanford.edu, hastie@stanford.edu)

<sup>2</sup> Departments of Statistics, and Electrical Engineering and Computer Sciences, UC Berkeley, (e-mail: stephenbates@cs.berkeley.edu)

ABSTRACT: Cross-validation is a widely-used technique to estimate prediction accuracy. However its properties are not that well understood. First, it is not clear exactly what form of prediction error is being estimated by crossvalidation: one would like to think that cross-validation estimates the prediction error for the model and the data at hand. Surprisingly, we show here that this is not the case, (at least for the special case of linear models) and derive the actual estimand(s). This phenomenon occurs for most popular estimates of prediction error including data splitting, bootstrapping,*Cp* and AIC. Second, the standard (na¨ıve) confidence intervals for prediction accuracy that are derived from cross-validation may fail to cover at the nominal rate, because each data point is used for both training and testing, inducing correlations among the measured accuracy for each fold. As a result, the variance of the CV estimate of error is larger than suggested by na¨ıve estimators, which leads to confidence intervals for prediction accuracy that can have coverage far below the desired level. We introduce a nested cross-validation scheme to estimate the standard error of the cross-validation estimate of prediction error, showing empirically that this modification leads to intervals with approximately correct coverage in many examples where traditional cross-validation intervals fail.

### QUANTILE-BASED CLASSIFICATION

VERIDICAL DATA SCIENCE FOR RESPONSIBLE AI: CHARACTERIZING V4 NEURONS THROUGH DEEPTUNE Bin Yu1

<sup>1</sup> Departments of Statistics, and Electrical Engineering and Computer Sciences, UC

ABSTRACT: Data Science is a pillar of A.I. and has driven most of recent cuttingedge discoveries in biomedical research. In practice, Data Science has a life cycle (DSLC) that includes problem formulation, data collection, data cleaning, modeling, result interpretation and the drawing of conclusions. Human judgement calls :wq:ware ubiquitous at every step of this process, e.g., in choosing data cleaning methods, predictive algorithms and data perturbations. Such judgment calls are often responsible for the "dangers" of A.I. To maximally mitigate these dangers, we developed a framework based on three core principles: Predictability, Computability and Stability (PCS). Through a workflow and documentation (in R Markdown or Jupyter Notebook) that allows one to manage the whole DSLC, the PCS framework unifies, streamlines and expands on the best practices of machine learning and statistics – bringing us a step

The PCS framework will be illustrated through the development of the DeepTune framework for characterizing V4 neurons. DeepTune builds predictive models using DNNs and linear regression and applies the stability principle

Finally, a general DNN interpretation method based on contextual decomposition (CD) will be discussed with applications to sentiment analysis and

ABBASI-ASL, R., CHEN, Y., BLONIARZ, A., OLIVER, M., WILLMORE, B.

"A.I. is like nuclear energy – both promising and dangerous"

Bill Gates, 2019

Berkeley (e-mail: binyu@berkeley.edu)

forward towards veridical Data Science.

cosmological parameter estimation.

References

to obtain stable interpretations of 18 predictive models.

Cinzia Viroli1

<sup>1</sup> Department of Statistical Sciences, University of Bologna, (e-mail: cinzia.viroli@unibo.it)

ABSTRACT: The idea of using quantiles in classification is relatively recent. The median classifier for high-dimensional problems (Hall *et al.*, 2009), the quantile classifier (Hennig & Viroli, 2016); the ensemble and the directional quantile classifiers (Lai & McLeod, 2020; Farcomeni *et al.*, 2021) represent main relevant proposals for supervised classification. These ideas proved to perform well for high dimensional and skewed data compared to other classical classification strategies. For clustering purposes quantiles lead to analogues appealing advantages. In this context, K-quantiles have been recently introduced (Hennig *et al.*, 2019). In this talk the main quantile-based strategies for supervised and unsupervised classification will be presented and discussed, both from the theoretical and empirical points of view.

KEYWORDS: L1 distance, supervised and unsupervised classification, k-means, skewness, high-dimensional data

#### References


### VERIDICAL DATA SCIENCE FOR RESPONSIBLE AI: CHARACTERIZING V4 NEURONS THROUGH DEEPTUNE

Bin Yu1

<sup>1</sup> Departments of Statistics, and Electrical Engineering and Computer Sciences, UC Berkeley (e-mail: binyu@berkeley.edu)

> "A.I. is like nuclear energy – both promising and dangerous"

> > Bill Gates, 2019

ABSTRACT: Data Science is a pillar of A.I. and has driven most of recent cuttingedge discoveries in biomedical research. In practice, Data Science has a life cycle (DSLC) that includes problem formulation, data collection, data cleaning, modeling, result interpretation and the drawing of conclusions. Human judgement calls :wq:ware ubiquitous at every step of this process, e.g., in choosing data cleaning methods, predictive algorithms and data perturbations. Such judgment calls are often responsible for the "dangers" of A.I. To maximally mitigate these dangers, we developed a framework based on three core principles: Predictability, Computability and Stability (PCS). Through a workflow and documentation (in R Markdown or Jupyter Notebook) that allows one to manage the whole DSLC, the PCS framework unifies, streamlines and expands on the best practices of machine learning and statistics – bringing us a step forward towards veridical Data Science.

The PCS framework will be illustrated through the development of the DeepTune framework for characterizing V4 neurons. DeepTune builds predictive models using DNNs and linear regression and applies the stability principle to obtain stable interpretations of 18 predictive models.

Finally, a general DNN interpretation method based on contextual decomposition (CD) will be discussed with applications to sentiment analysis and cosmological parameter estimation.

#### References

QUANTILE-BASED CLASSIFICATION

Cinzia Viroli1

ABSTRACT: The idea of using quantiles in classification is relatively recent. The median classifier for high-dimensional problems (Hall *et al.*, 2009), the quantile classifier (Hennig & Viroli, 2016); the ensemble and the directional quantile classifiers (Lai & McLeod, 2020; Farcomeni *et al.*, 2021) represent main relevant proposals for supervised classification. These ideas proved to perform well for high dimensional and skewed data compared to other classical classification strategies. For clustering purposes quantiles lead to analogues appealing advantages. In this context, K-quantiles have been recently introduced (Hennig *et al.*, 2019). In this talk the main quantile-based strategies for supervised and unsupervised classification will be presented and discussed,

KEYWORDS: L1 distance, supervised and unsupervised classification, k-means,

FARCOMENI, A., GERACI, M., & VIROLI, C. 2021. *Directional quantile*

HALL, P., TITTERINGTON, D. M., & XUE, J.-H. 2009. Median-Based Classifiers for High-Dimensional Data. *Journal of the American Statistical*

HENNIG, C., & VIROLI, C. 2016. Quantile-based classifiers. *Biometrika*,

HENNIG, C., VIROLI, C., & ANDERLUCCI, L. 2019. Quantile-based cluster-

LAI, Y., & MCLEOD, I. 2020. Ensemble quantile classifier. *Computational*

ing. *Electronic Journal of Statistics*, 13(2), 4849–4883.

<sup>1</sup> Department of Statistical Sciences, University of Bologna,

both from the theoretical and empirical points of view.

skewness, high-dimensional data

*Association*, 104(488), 1597–1608.

*Statistics & Data Analysis*, 144, 106849.

References

*classifiers*.

103(2), 435–446.

(e-mail: cinzia.viroli@unibo.it)

ABBASI-ASL, R., CHEN, Y., BLONIARZ, A., OLIVER, M., WILLMORE, B.

D. B., GALLANT, J. L., & YU, B. 2018. The DeepTune framework for modeling and characterizing neurons in visual cortex area V4. *bioRxiv*.

YU, B., & KUMBIER, K. 2020. Veridical data science. *Proceedings of the National Academy of Sciences*, 117(8), 3920–3929.

# **Plenary Session**

D. B., GALLANT, J. L., & YU, B. 2018. The DeepTune framework for modeling and characterizing neurons in visual cortex area V4. *bioRxiv*. YU, B., & KUMBIER, K. 2020. Veridical data science. *Proceedings of the*

*National Academy of Sciences*, 117(8), 3920–3929.

Giovanni C. Porzio, University of Cassino and Southern Lazio, Italy, porzio@unicas.it, 0000-0003-1208-6991 Carla Rampichini, University of Florence, Italy, carla.rampichini@unifi.it, 0000-0002-8519-083X Chiara Bocci, University of Florence, Italy, chiara.bocci@unifi.it, 0000-0001-8189-4445

FUP Best Practice in Scholarly Publishing (DOI 10.36253/fup\_best\_practice)

Giovanni C. Porzio, Carla Rampichini, Chiara Bocci (edited by), *CLADAG 2021 Book of abstracts and short papers. 13th Scientific Meeting of the Classification and Data Analysis Group Firenze, September 9-11, 2021*, © 2021 Author(s), content CC BY 4.0 International, metadata CC0 1.0 Universal, published by Firenze University Press (www.fupress.com), ISSN 2704-5846 (online), ISBN 978-88-5518-340-6 (PDF), DOI 10.36253/978-88-5518-340-6

## *Statistical Issues in the COVID-19 Pandemic*

ORGANIZER **J. Sunil Rao** Chair Division of Biostatistics, University of Miami, USA ABSTRACT: COVID-19 has become a pandemic of epic proportion, calling on scientific enquiry from a broad range of disciplines, including biology, chemistry, pharmacology, epidemiology, mathematics, statistics, and data science, among others. As a result, potential solutions to this problem have become highly interdisciplinary. Notwithstanding, statistics and data science have become paramount in the quest for providing evidentiary based answers to a host of scientific problems associated with this novel virus. Among these problems are the issues of vaccine development, development of therapeutics, testing, contract tracing, forecasting, and inferential analysis.

The effects of the virus have varied greatly from country to country reflecting differences in data reporting, public health infrastructure, politics, economics, social contexts and the role of civil society. This session will discuss specific statistical issues related to the COVID-19 pandemic and will bring together prominent researchers who will share their experiences from Israel to India to the US.

#### SPEAKERS

#### **Daniel Diaz**

Research Assistant Professor Division of Biostatistics, University of Miami, USA

**Jeffrey S. Morris** Chair Division of Biostatistics, University of Pennsylvania, USA

**Bhramar Mukherjee** Chair Department of Biostatistics, University of Michigan, USA

#### **Danny Pfeffermann**

Chief Statistician of Israel and Professor of Statistics Hebrew University of Jerusalem, ISRAEL & University of Southampton, UK

### A SIMPLE CORRECTION FOR COVID-19 SAMPLING BIAS

A SEAT AT THE TABLE: THE KEY ROLE OF BIOSTATISTICS AND DATA SCIENCE IN THE COVID-19 PANDEMIC Jeffrey Morris1

ABSTRACT: The novel virus SARS-CoV-2 has produced a global pandemic, forcing doctors and policymakers to "fly blind", trying to deal with a virus and disease they knew virtually nothing about. Sorting through the information in real time has been a daunting process—processing data, media reports, commentaries, and research articles. In the USA this is exacerbated by an ideologically divided society that has difficulty with mutual trust, or even agreement on common facts. The skills underlying our statistical profession are central to this knowledge discovery process, filtering out biases, aggregating disparate data sources together, dealing with measurement error and missing data, identifying key insights while quantifying the uncertainty in these insights, and then communicating the results in an accessible balanced way. As a result, we have had a central role to play in society to bring our perspective and expertise to bear on the pandemic to help ensure knowledge is efficiently discovered and put into practice. Unfortunately, our profession is often shy about asserting its perspective in broader societal ventures, perhaps not realizing the central importance of our perspective and mindset. I have authored a website and blog covid-datascience.com that represents my own person efforts to disseminate information I have found reliable and insightful regarding the pandemic, accounting for subtle scientific and data analytical issues and uncertainties about our current knowledge, and seeking to filter out political

Using experiences with the covid-datascience blog as a backdrop, I will highlight how statistical and data scientific issues have been central in understanding the emerging knowledge in the pandemic. I will discuss various broad issues I have seen impede the knowledge discovery process, including subjective bias causing individuals to ignore some information and magnify others, viral misinformation spread on social media platforms, danger of rushed and inadequately reviewed scientific studies, conflating of political concerns and scientific messaging, and incomplete and messaging from scientific leaders to

<sup>1</sup> Division of Biostatistics, University of Pennsylvania, USA, (e-mail: jeffrey.morris@pennmedicine.upenn.edu)

and other subjective biases.

Daniel Diaz1

<sup>1</sup> Division of Biostatistics, University of Miami, USA, (e-mail: ddiaz3@miami.edu)

ABSTRACT: COVID-19 testing has become a standard approach for estimating prevalence which then assist in public health decision making to contain and mitigate the spread of the disease. The sampling designs used are often biased in that they do not reflect the true underlying populations. For instance, individuals with strong symptoms are more likely to be tested than those with no symptoms. This results in biased estimates of prevalence (too high). Typical post-sampling corrections are not always possible. Here we present a simple bias correction methodology derived and adapted from a correction for publication bias in meta analysis studies. The methodology is general enough to allow a wide variety of customization making it more useful in practice. Implementation is easily done using already collected information. Via a simulation and two real datasets, we show that the bias corrections can provide dramatic reductions in estimation error.

### A SEAT AT THE TABLE: THE KEY ROLE OF BIOSTATISTICS AND DATA SCIENCE IN THE COVID-19 PANDEMIC

Jeffrey Morris1

<sup>1</sup> Division of Biostatistics, University of Pennsylvania, USA, (e-mail: jeffrey.morris@pennmedicine.upenn.edu)

A SIMPLE CORRECTION FOR COVID-19 SAMPLING BIAS Daniel Diaz1

<sup>1</sup> Division of Biostatistics, University of Miami, USA, (e-mail: ddiaz3@miami.edu)

ABSTRACT: COVID-19 testing has become a standard approach for estimating prevalence which then assist in public health decision making to contain and mitigate the spread of the disease. The sampling designs used are often biased in that they do not reflect the true underlying populations. For instance, individuals with strong symptoms are more likely to be tested than those with no symptoms. This results in biased estimates of prevalence (too high). Typical post-sampling corrections are not always possible. Here we present a simple bias correction methodology derived and adapted from a correction for publication bias in meta analysis studies. The methodology is general enough to allow a wide variety of customization making it more useful in practice. Implementation is easily done using already collected information. Via a simulation and two real datasets, we show that the bias corrections can provide dramatic reductions in estimation

error.

ABSTRACT: The novel virus SARS-CoV-2 has produced a global pandemic, forcing doctors and policymakers to "fly blind", trying to deal with a virus and disease they knew virtually nothing about. Sorting through the information in real time has been a daunting process—processing data, media reports, commentaries, and research articles. In the USA this is exacerbated by an ideologically divided society that has difficulty with mutual trust, or even agreement on common facts. The skills underlying our statistical profession are central to this knowledge discovery process, filtering out biases, aggregating disparate data sources together, dealing with measurement error and missing data, identifying key insights while quantifying the uncertainty in these insights, and then communicating the results in an accessible balanced way. As a result, we have had a central role to play in society to bring our perspective and expertise to bear on the pandemic to help ensure knowledge is efficiently discovered and put into practice. Unfortunately, our profession is often shy about asserting its perspective in broader societal ventures, perhaps not realizing the central importance of our perspective and mindset. I have authored a website and blog covid-datascience.com that represents my own person efforts to disseminate information I have found reliable and insightful regarding the pandemic, accounting for subtle scientific and data analytical issues and uncertainties about our current knowledge, and seeking to filter out political and other subjective biases.

Using experiences with the covid-datascience blog as a backdrop, I will highlight how statistical and data scientific issues have been central in understanding the emerging knowledge in the pandemic. I will discuss various broad issues I have seen impede the knowledge discovery process, including subjective bias causing individuals to ignore some information and magnify others, viral misinformation spread on social media platforms, danger of rushed and inadequately reviewed scientific studies, conflating of political concerns and scientific messaging, and incomplete and messaging from scientific leaders to

the broader community. I will discuss these concepts in various specific contexts, including identification of key modes of spread and effective mitigation strategies, vaccine safety and efficacy, durability of immune protection and risk of reinfections or breakthrough infections, and the emergence of variants of concern and how this affects the pandemic moving forward. I will finish with a call to urge statisticians to seek greater visibility and engagement with the media and policymakers to ensure our understanding of quantitative nuances is reflected in important societal-level decisions and dissemination of emerging scientific knowledge.

PREDICTIONS, ROLE OF INTERVENTIONS AND THE CRISIS OF VIRUS IN INDIA:ADATA SCIENCE CALL TO ARMS Bhramar Mukherjee1

ABSTRACT: India, the world's largest democracy with 1.38 billion people, underwent five phases of national lockdown from March 25-June 30, 2020 and several phases of unlocking in Wave 1 of the COVID-19 pandemic. The virus curve turned the corner in mid-September of 2020 and it appeared that India could avoid a second resurgence in the Winter. Normalcy returned to the life of Indian people and vaccination had a sluggish start nationwide. Several hypotheses were being postulated for this miraculous recovery of India including that of herd immunity as implied by some serosurveys. Then came an astronomic wave 2 for India, where the daily case counts reached more than 400000 and daily death counts peaked around 4500. In this presentation, we provide a brief chronicle of the modeling experience of our study team over the last one year trying to understand the pandemic in India and explain what caused this devastating second wave, including the role of the Delta variant. We discuss methodological innovations by incorporating imperfect viral testing when using case-counts in an extended SEIR model for COVID-19. We use this model to estimate the unobserved infections and deaths leading to an estimate of the infection fatality rates in India for Waves 1 and 2 . This is joint work with many, with all supporting research materials and products available at covind19.org.

<sup>1</sup> Department of Biostatistics, University of Michigan, USA,

(e-mail: bhramar@umich.edu)

### PREDICTIONS, ROLE OF INTERVENTIONS AND THE CRISIS OF VIRUS IN INDIA:ADATA SCIENCE CALL TO ARMS

Bhramar Mukherjee1

<sup>1</sup> Department of Biostatistics, University of Michigan, USA, (e-mail: bhramar@umich.edu)

the broader community. I will discuss these concepts in various specific contexts, including identification of key modes of spread and effective mitigation strategies, vaccine safety and efficacy, durability of immune protection and risk of reinfections or breakthrough infections, and the emergence of variants of concern and how this affects the pandemic moving forward. I will finish with a call to urge statisticians to seek greater visibility and engagement with the media and policymakers to ensure our understanding of quantitative nuances is reflected in important societal-level decisions and dissemination of emerging

scientific knowledge.

ABSTRACT: India, the world's largest democracy with 1.38 billion people, underwent five phases of national lockdown from March 25-June 30, 2020 and several phases of unlocking in Wave 1 of the COVID-19 pandemic. The virus curve turned the corner in mid-September of 2020 and it appeared that India could avoid a second resurgence in the Winter. Normalcy returned to the life of Indian people and vaccination had a sluggish start nationwide. Several hypotheses were being postulated for this miraculous recovery of India including that of herd immunity as implied by some serosurveys. Then came an astronomic wave 2 for India, where the daily case counts reached more than 400000 and daily death counts peaked around 4500. In this presentation, we provide a brief chronicle of the modeling experience of our study team over the last one year trying to understand the pandemic in India and explain what caused this devastating second wave, including the role of the Delta variant. We discuss methodological innovations by incorporating imperfect viral testing when using case-counts in an extended SEIR model for COVID-19. We use this model to estimate the unobserved infections and deaths leading to an estimate of the infection fatality rates in India for Waves 1 and 2 . This is joint work with many, with all supporting research materials and products available at covind19.org.

## CONTRIBUTIONS OF ISRAEL'S CBS TO ROUT COVID-19

Danny Pfeffermann1

<sup>1</sup> Central Bureau of Statistics and Hebrew University of Jerusalem, Israel; University of Southampton, UK, (e-mail: D.Pfeffermann@soton.ac.uk)

ABSTRACT: In this presentation, I shall describe the major problems that the Central Bureau of Statistics in Israel (ICBS) had faced during the pandemic, discuss the methodological issues involved and how we dealt with them. Issues considered are lack of health data; performing special household, business and serological surveys; accounting for NMAR nonresponse; publication of flash estimates; estimation of excess mortality; seasonal adjustment, trend estimation and weighting of CPI items in a year of pandemic.

# **Invited Papers**

CONTRIBUTIONS OF ISRAEL'S CBS TO ROUT COVID-19 Danny Pfeffermann1

<sup>1</sup> Central Bureau of Statistics and Hebrew University of Jerusalem, Israel; University of

ABSTRACT: In this presentation, I shall describe the major problems that the Central Bureau of Statistics in Israel (ICBS) had faced during the pandemic, discuss the methodological issues involved and how we dealt with them. Issues considered are lack of health data; performing special household, business and serological surveys; accounting for NMAR nonresponse; publication of flash estimates; estimation of excess mortality; seasonal adjustment, trend estimation

Southampton, UK, (e-mail: D.Pfeffermann@soton.ac.uk)

and weighting of CPI items in a year of pandemic.

Giovanni C. Porzio, University of Cassino and Southern Lazio, Italy, porzio@unicas.it, 0000-0003-1208-6991 Carla Rampichini, University of Florence, Italy, carla.rampichini@unifi.it, 0000-0002-8519-083X Chiara Bocci, University of Florence, Italy, chiara.bocci@unifi.it, 0000-0001-8189-4445

FUP Best Practice in Scholarly Publishing (DOI 10.36253/fup\_best\_practice)

Giovanni C. Porzio, Carla Rampichini, Chiara Bocci (edited by), *CLADAG 2021 Book of abstracts and short papers. 13th Scientific Meeting of the Classification and Data Analysis Group Firenze, September 9-11, 2021*, © 2021 Author(s), content CC BY 4.0 International, metadata CC0 1.0 Universal, published by Firenze University Press (www.fupress.com), ISSN 2704-5846 (online), ISBN 978-88-5518-340-6 (PDF), DOI 10.36253/978-88-5518-340-6

ROBUST ISSUES IN ESTIMATING MODELS FOR MULTIVARIATE TORUS DATA Claudio Agostinelli 1, Giovanni Saraceno <sup>1</sup> and Luca Greco <sup>2</sup>

<sup>1</sup> Department of Mathematics, University of Trento, (e-mail: claudio.agostinelli@unitn.it, giovanni.saraceno@unitn.it) <sup>2</sup> University Giustino Fortunato, Benevento (e-mail:

ABSTRACT: We consider the problem of robust fitting for statistical models applied to multivariate torus data, e.g., data which are multivariate angles. We discuss two different definitions of outliers, "geometric" and "probabilistic" outliers, and the proposed robust methods to cope with them. We mainly focus on multivariate wrapped

KEYWORDS: circular data, multivariate torus data, outlier detection, robust estima-

Multivariate circular data arise commonly in many different fields. Depending on the situation, observations can be thought as points on the surface of a hyper-sphere (S*p*−1) or as points on the surface of a torus (T*<sup>p</sup>* = [0,2π)*p*). While the first problem is well studied in literature, the latter received much less attention, even though it is more common. Here, we review some aspects of robust fitting of torus data according to wrapped models. The peculiarity of multivariate torus data is periodicity, that reflects in the boundedness of the sample space and often of the parametric space. Indeed, it is challenging to introduce the *geometric* concept of outliers, as points that are far from the bulk of the data. However, it is always possible to define circular outliers from a *probabilistic* point of view, as points that are unlikely to occur under the assumed model. Notice that outliers are model dependent, since they are defined with respect to the specified model. A first general attempt to develop a robust parametric technique for multivariate torus data can be found in Saraceno *et al.*, 2021 where a weighted likelihood estimator is introduced and outliers are defined using the probabilistic point of view. In contrast, Greco *et al.*, 2021 develop robust estimators based on S/M/MM-estimators as well as weighted

l.greco@unifortunato.eu)

tion, wrapped models

1 Introduction

models together with some computational aspects.

likelihood estimators considering the geometric approach.

### ROBUST ISSUES IN ESTIMATING MODELS FOR MULTIVARIATE TORUS DATA

Claudio Agostinelli 1, Giovanni Saraceno <sup>1</sup> and Luca Greco <sup>2</sup>

<sup>1</sup> Department of Mathematics, University of Trento, (e-mail: claudio.agostinelli@unitn.it, giovanni.saraceno@unitn.it) <sup>2</sup> University Giustino Fortunato, Benevento (e-mail: l.greco@unifortunato.eu)

ABSTRACT: We consider the problem of robust fitting for statistical models applied to multivariate torus data, e.g., data which are multivariate angles. We discuss two different definitions of outliers, "geometric" and "probabilistic" outliers, and the proposed robust methods to cope with them. We mainly focus on multivariate wrapped models together with some computational aspects.

KEYWORDS: circular data, multivariate torus data, outlier detection, robust estimation, wrapped models

### 1 Introduction

Multivariate circular data arise commonly in many different fields. Depending on the situation, observations can be thought as points on the surface of a hyper-sphere (S*p*−1) or as points on the surface of a torus (T*<sup>p</sup>* = [0,2π)*p*). While the first problem is well studied in literature, the latter received much less attention, even though it is more common. Here, we review some aspects of robust fitting of torus data according to wrapped models. The peculiarity of multivariate torus data is periodicity, that reflects in the boundedness of the sample space and often of the parametric space. Indeed, it is challenging to introduce the *geometric* concept of outliers, as points that are far from the bulk of the data. However, it is always possible to define circular outliers from a *probabilistic* point of view, as points that are unlikely to occur under the assumed model. Notice that outliers are model dependent, since they are defined with respect to the specified model. A first general attempt to develop a robust parametric technique for multivariate torus data can be found in Saraceno *et al.*, 2021 where a weighted likelihood estimator is introduced and outliers are defined using the probabilistic point of view. In contrast, Greco *et al.*, 2021 develop robust estimators based on S/M/MM-estimators as well as weighted likelihood estimators considering the geometric approach.

#### 2 Wrapped models

Let X be a multivariate random variable with model density *m*(x;θ) on R*<sup>p</sup>* parameterized by θ ∈ Θ. We can construct a wrapped model by Y = X mod 2π where the mod operator is performed component-wise. The density function of Y takes the form of an infinite sum over Z*<sup>p</sup>* given by

*f*(x)=(1−ε)*m*(x;*µ*,Σ)+ε*g*(x) and hence the corresponding wrapped density

(y;*µ*,Σ) +ε*g*◦

If we instead consider the approach leading to *<sup>C</sup>*(*µ*,Σ) and equation (1), for a

(y*i*) ≈ (1−ε)*m*(y*<sup>i</sup>* +2πj*i*;*µ*,Σ) +ε*g*(y*<sup>i</sup>* +2πj*i*)

which suggests the classical geometric definition of outliers. In such cases, the degree of outlyingness of an observation is based on some "geometric" distance, e.g., the squared Mahalanobis distance. In contrast, we can define outliers directly on the torus, that is, according to equation (2), based on a "probabilistic" distance [Markatou *et al.*, 1998 and Agostinelli, 2007] where we compare the *true* density *f* ◦(y*i*) with the model density *m*◦(y*i*;*µ*,Σ). A measure of the agreement is provided by the finite sample Pearson residual function [Lindsay, 1994 and Markatou *et al.*, 1998], defined as <sup>δ</sup>*n*(y) = <sup>ˆ</sup>*fn*(y)

estimate (with kernel function *k* and bandwidth *h*) of the true density *f*(y) and

Here, we illustrate the behavior of the robust estimators introduced in Saraceno *et al.*, 2021 and Greco *et al.*, 2021 using a simulated example. We point the reader to the cited papers for full details. The bulk of data has been drawn from a bivariate wrapped normal distribution with *µ* = 0, Σ = *D*1/2*RD*1/<sup>2</sup> where *R* is a random correlation matrix and *D* = *diag*(σ12) with σ = π/4. The sample size is *n* = 500 with 10% of contamination. Two types of outlying observations are considered: scattered and point-mass. It is suggested to represent circular data points after they have been unwrapped on a "flat" torus in the form x = y+2πj for j ∈ *CJ* . The figure shows the unwrapped bivariate points (grey points), the scattered (red crosses) and the point-mass (green plus) outliers. The bivariate fitted models are given in the form of ellipses based on the 0.99-level quantile

<sup>2</sup> distribution. We show the results obtained using maximum likelihood estimator (grey line) and the proposed robust estimators. In particular, we

*k*(y;t,*h*)*m*(t;*µ*,Σ) *d*t is a smoothed version of the model density.

*<sup>m</sup>*(y+2πj;*µ*,Σ) +<sup>ε</sup> ∑

j∈Z*<sup>p</sup>*

*<sup>i</sup>*=<sup>1</sup> *k*(y;y*i*,*h*) is a non-parametric kernel density

*g*(y+2πj) (1)

(y). (2)

would have the form

*f* ◦

given observation y*<sup>i</sup>* we have

*<sup>m</sup>*ˆ(y;θ) <sup>−</sup> 1 where <sup>ˆ</sup>*fn*(y) = <sup>1</sup>

*m*ˆ(y;*µ*,Σ) =

4 Example

of a χ<sup>2</sup>

*f* ◦

(y)=(1−ε) ∑

= (1−ε)*m*◦

j∈Z*<sup>p</sup>*

*<sup>n</sup>* <sup>∑</sup>*<sup>n</sup>*

$$m^\diamond(\mathbf{y}; \boldsymbol{\Theta}) = \sum\_{\mathbf{j} \in \mathbb{Z}^p} m(\mathbf{y} + 2\pi \mathbf{j}; \boldsymbol{\Theta}) \; . $$

A good approximation, denoted as *m*◦ *<sup>J</sup>* , can be obtained, in most cases, with only few terms of the summation, so that <sup>Z</sup>*<sup>p</sup>* is replaced by *CJ* <sup>=</sup> <sup>⊗</sup>*<sup>p</sup> <sup>s</sup>*=<sup>1</sup>*J* where *J* = (−*J*,−*J* + 1,...,0,...,*J* − 1, *J*) for some fixed *J*. The support of Y is bounded and given by [0,2π)*p*, for convenience, and the parametric space Θ might be restricted as well to ensure identifiability. The *p*−dimensional vector j represents the wrapping coefficients vector, that is, it indicates how many times each component of the *p*−toroidal data point has been wrapped. Given a sample (y1,y2,...,y*n*), the approximated log-likelihood function is given by

$$\ell(\boldsymbol{\Theta}) = \sum\_{i=1}^{n} \log m\_J^\circ(\mathbf{y}\_i; \boldsymbol{\Theta}) = \sum\_{i=1}^{n} \log \sum\_{\mathbf{j} \in \mathcal{L}\_\mathcal{I}} m(\mathbf{y}\_i + 2\pi \mathbf{j}; \boldsymbol{\Theta}) \ .$$

Assuming that we could observe the vectors j*<sup>i</sup>* (*i* = 1,...,*n*), then we would have access to the unwrapped and unobserved sample xˆ*<sup>i</sup>* = y*<sup>i</sup>* + 2πj*i*. This leads to the following log-likelihood

$$\ell\_C(\boldsymbol{\Theta}) = \sum\_{i=1}^n \log m(\mathbf{\hat{x}}\_i; \boldsymbol{\Theta}) = \sum\_{i=1}^n \log m(\mathbf{y}\_i + 2\pi \mathbf{j}\_i; \boldsymbol{\Theta}) = \sum\_{i=1}^n \sum\_{\mathbf{j} \in \mathcal{L}\_I} \nu\_{i\mathbf{j}} \log m(\mathbf{y}\_i + 2\pi \mathbf{j}; \boldsymbol{\Theta}) \,,$$

where v*i*<sup>j</sup> = 1 or v*i*<sup>j</sup> = 0 according to whether y*<sup>i</sup>* has j ∈ *CJ* as the wrapping coefficient vector and now the j*i*s are additional unknown parameters needed to be estimated. Optimization of the above log-likelihood can be performed naturally through a Classification-Expectation-Maximization algorithm, see Nodehi *et al.*, 2021 for more details. Hereafter, we concentrate on unimodal and elliptically symmetric densities *m*, i.e., given a strictly decreasing and nonnegative function *h* and set θ = (*µ*,Σ) for a location vector parameter *µ* and dispersion matrix Σ, then *m*(x;θ) ∝ |Σ| <sup>−</sup>1/2*h* (x−*µ*)Σ−1(x−*µ*) .

#### 3 Outliers in multivariate torus data

Consider 0 <sup>≤</sup> <sup>ε</sup> <sup>&</sup>lt; <sup>0</sup>.5 and an arbitrary distribution *<sup>g</sup>*(x) in <sup>R</sup>*p*. According to the usual gross error model, the true density *f*(x) of the data is given by

*f*(x)=(1−ε)*m*(x;*µ*,Σ)+ε*g*(x) and hence the corresponding wrapped density would have the form

$$f^{\mathbb{O}}(\mathbf{y}) \ = \ (1 - \mathbf{\bar{c}}) \sum\_{\mathbf{j} \in \mathbb{Z}^p} m(\mathbf{y} + 2\pi \mathbf{j}; \boldsymbol{\mu}, \boldsymbol{\Sigma}) + \mathbf{\bar{c}} \sum\_{\mathbf{j} \in \mathbb{Z}^p} \mathbf{g}(\mathbf{y} + 2\pi \mathbf{j}) \qquad (1)$$

$$=\left(1-\mathfrak{e}\right)m^{\circ}\left(\mathfrak{y};\mu,\Sigma\right)+\mathfrak{e}\mathfrak{g}^{\circ}\left(\mathfrak{y}\right).\tag{2}$$

If we instead consider the approach leading to *<sup>C</sup>*(*µ*,Σ) and equation (1), for a given observation y*<sup>i</sup>* we have

$$f^\diamond(\mathbf{y}\_i) \approx (1 - \mathbf{e})m(\mathbf{y}\_i + 2\pi \mathbf{j}\_i; \boldsymbol{\mu}, \boldsymbol{\Sigma}) + \mathbf{e}g(\mathbf{y}\_i + 2\pi \mathbf{j}\_i)$$

which suggests the classical geometric definition of outliers. In such cases, the degree of outlyingness of an observation is based on some "geometric" distance, e.g., the squared Mahalanobis distance. In contrast, we can define outliers directly on the torus, that is, according to equation (2), based on a "probabilistic" distance [Markatou *et al.*, 1998 and Agostinelli, 2007] where we compare the *true* density *f* ◦(y*i*) with the model density *m*◦(y*i*;*µ*,Σ). A measure of the agreement is provided by the finite sample Pearson residual function [Lindsay, 1994 and Markatou *et al.*, 1998], defined as <sup>δ</sup>*n*(y) = <sup>ˆ</sup>*fn*(y) *<sup>m</sup>*ˆ(y;θ) <sup>−</sup> 1 where <sup>ˆ</sup>*fn*(y) = <sup>1</sup> *<sup>n</sup>* <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *k*(y;y*i*,*h*) is a non-parametric kernel density estimate (with kernel function *k* and bandwidth *h*) of the true density *f*(y) and *m*ˆ(y;*µ*,Σ) = *k*(y;t,*h*)*m*(t;*µ*,Σ) *d*t is a smoothed version of the model density.

#### 4 Example

2 Wrapped models

Let X be a multivariate random variable with model density *m*(x;θ) on R*<sup>p</sup>* parameterized by θ ∈ Θ. We can construct a wrapped model by Y = X mod 2π where the mod operator is performed component-wise. The density function

of Y takes the form of an infinite sum over Z*<sup>p</sup>* given by

(y;θ) = ∑

only few terms of the summation, so that <sup>Z</sup>*<sup>p</sup>* is replaced by *CJ* <sup>=</sup> <sup>⊗</sup>*<sup>p</sup>*

*<sup>J</sup>* (y*i*;θ) =

j∈Z*<sup>p</sup>*

*J* = (−*J*,−*J* + 1,...,0,...,*J* − 1,*J*) for some fixed *J*. The support of Y is bounded and given by [0,2π)*p*, for convenience, and the parametric space Θ might be restricted as well to ensure identifiability. The *p*−dimensional vector j represents the wrapping coefficients vector, that is, it indicates how many times each component of the *p*−toroidal data point has been wrapped. Given a sample (y1,y2,...,y*n*), the approximated log-likelihood function is given by

> *n* ∑ *i*=1

Assuming that we could observe the vectors j*<sup>i</sup>* (*i* = 1,...,*n*), then we would have access to the unwrapped and unobserved sample xˆ*<sup>i</sup>* = y*<sup>i</sup>* + 2πj*i*. This

log*m*(y*i*+2πj*i*;θ) =

where v*i*<sup>j</sup> = 1 or v*i*<sup>j</sup> = 0 according to whether y*<sup>i</sup>* has j ∈ *CJ* as the wrapping coefficient vector and now the j*i*s are additional unknown parameters needed to be estimated. Optimization of the above log-likelihood can be performed naturally through a Classification-Expectation-Maximization algorithm, see Nodehi *et al.*, 2021 for more details. Hereafter, we concentrate on unimodal and elliptically symmetric densities *m*, i.e., given a strictly decreasing and nonnegative function *h* and set θ = (*µ*,Σ) for a location vector parameter *µ* and

<sup>−</sup>1/2*h* 

Consider 0 <sup>≤</sup> <sup>ε</sup> <sup>&</sup>lt; <sup>0</sup>.5 and an arbitrary distribution *<sup>g</sup>*(x) in <sup>R</sup>*p*. According to the usual gross error model, the true density *f*(x) of the data is given by

log ∑ j∈*CJ*

> *n* ∑ *i*=1 ∑ j∈*CJ*

(x−*µ*)Σ−1(x−*µ*)

*m*(y+2πj;θ) .

*<sup>J</sup>* , can be obtained, in most cases, with

*m*(y*<sup>i</sup>* +2πj;θ) .

*<sup>s</sup>*=<sup>1</sup>*J* where

*vi*<sup>j</sup> log*m*(y*i*+2πj;θ) ,

 .

*m*◦

A good approximation, denoted as *m*◦

(θ) =

leads to the following log-likelihood

log*m*(xˆ*i*;θ) =

dispersion matrix Σ, then *m*(x;θ) ∝ |Σ|

3 Outliers in multivariate torus data

*<sup>C</sup>*(θ) =

*n* ∑ *i*=1

*n* ∑ *i*=1

log*m*◦

*n* ∑ *i*=1

> Here, we illustrate the behavior of the robust estimators introduced in Saraceno *et al.*, 2021 and Greco *et al.*, 2021 using a simulated example. We point the reader to the cited papers for full details. The bulk of data has been drawn from a bivariate wrapped normal distribution with *µ* = 0, Σ = *D*1/2*RD*1/<sup>2</sup> where *R* is a random correlation matrix and *D* = *diag*(σ12) with σ = π/4. The sample size is *n* = 500 with 10% of contamination. Two types of outlying observations are considered: scattered and point-mass. It is suggested to represent circular data points after they have been unwrapped on a "flat" torus in the form x = y+2πj for j ∈ *CJ* . The figure shows the unwrapped bivariate points (grey points), the scattered (red crosses) and the point-mass (green plus) outliers. The bivariate fitted models are given in the form of ellipses based on the 0.99-level quantile of a χ<sup>2</sup> <sup>2</sup> distribution. We show the results obtained using maximum likelihood estimator (grey line) and the proposed robust estimators. In particular, we



According to the proposed model, a set of latent profiles characterizes the population-specific response patterns, while the individual propensity toward a specific profile is allowed to change in time and with subject-specific covariates, leveraging on a dependent stick-breaking construction for the mixture weights.

CLUSTERING FINANCIAL TIME SERIES USING GENERALIZED CROSS CORRELATIONS Andres M. Alonso ´ 1, Carolina Gamboa1 and Daniel Pena˜ <sup>1</sup>

<sup>1</sup> Department of Statistics, Universidad Carlos III de Madrid, Spain.

daniel.pena@uc3m.es)

methodology is applied to a set of real data.

1 Introduction

series until a certain lag, *k*.

and D'Urso *et al.*, 2013 for GARCH models).

(email: andres.alonso@uc3m.es, 100312917@alumnos.uc3m.es),

ABSTRACT: In this paper we propose a procedure for clustering financial time series using the generalized cross correlations (GCC) between the estimated volatilities and the squared residuals of ARMA(*p*,*q*) models. Monte Carlo experiments are carried out to analyze the performance of the proposed procedure. We show that the procedure is able to recover the original clustering structures in all cases studied. Finally, the

KEYWORDS: unsupervised classification, dependence measure, conditional variance.

A variety of methods have been proposed in the literature to cluster time series (see Caiado *et al.*, 2015 and the references cited there). In those methods the clustering problem is solved using two different strategies: the first one works directly on the original time series by defining an appropriate metric; in the second one, time series are projected in a smaller space of features or parameters. These methods are useful when the time series are independent, however, in many applications the assumption of independence does not hold. Few articles have proposed method for clustering by dependency. Zhang & An, 2018 proposed a distance measure based on copulas to measure general dependence of the time series. Alonso & Pena, 2019 introduced the generalized ˜ cross correlation metric based on all the cross correlations between two time

These two methods assumed that the dependency among the time series is on the levels, and do not consider the case in which the dependency is on the conditional variances. This fact is important in many fields. For example, in financial time series where asset returns do not present a strong structure in the levels but do present it in the volatility. Some studies have taken into account the similarity of the evolution of the conditional variances (see Otranto, 2008

We illustrate the details of the proposed methodology and its application on the Italian population. Our empirical findings focus on the evolution of the psychosis across the pandemic and on the estimated sub-regional differences in terms of the impact of COVID-19 pandemic on the individual's psychology.

#### References


### CLUSTERING FINANCIAL TIME SERIES USING GENERALIZED CROSS CORRELATIONS

Andres M. Alonso ´ 1, Carolina Gamboa1 and Daniel Pena˜ <sup>1</sup>

<sup>1</sup> Department of Statistics, Universidad Carlos III de Madrid, Spain. (email: andres.alonso@uc3m.es, 100312917@alumnos.uc3m.es), daniel.pena@uc3m.es)

ABSTRACT: In this paper we propose a procedure for clustering financial time series using the generalized cross correlations (GCC) between the estimated volatilities and the squared residuals of ARMA(*p*,*q*) models. Monte Carlo experiments are carried out to analyze the performance of the proposed procedure. We show that the procedure is able to recover the original clustering structures in all cases studied. Finally, the methodology is applied to a set of real data.

KEYWORDS: unsupervised classification, dependence measure, conditional variance.

### 1 Introduction

According to the proposed model, a set of latent profiles characterizes the population-specific response patterns, while the individual propensity toward a specific profile is allowed to change in time and with subject-specific covariates, leveraging on a dependent stick-breaking construction for the mixture

We illustrate the details of the proposed methodology and its application on the Italian population. Our empirical findings focus on the evolution of the psychosis across the pandemic and on the estimated sub-regional differences in terms of the impact of COVID-19 pandemic on the individual's psychology.

Agresti, A. (2003). *Categorical data analysis*, volume 482. John Wiley &

Aliverti, E. and Dunson, D. B. (2020). Composite mixture of log-linear models

Dunson, D. B. and Xing, C. (2009). Nonparametric Bayes modeling of multivariate categorical data. *Journal of the American Statistical Association*,

Lazarsfeld, P. F. (1950). The logical and mathematical foundation of latent structure analysis. *Studies in Social Psychology in World War II Vol. IV:*

Muller, P., Quintana, F. A., Jara, A., and Hanson, T. (2015). ¨ *Bayesian non-*

Nardi, Y., Rinaldo, A., et al. (2012). The log-linear group-lasso estimator and

for categorical data. *arXiv preprint arXiv:2004.01462*.

*Measurement and Prediction*, pages 362–412.

its asymptotic properties. *Bernoulli*, 18(3):945–974.

*parametric data analysis*. Springer.

weights.

References

Sons.

104(487):1042–1051.

A variety of methods have been proposed in the literature to cluster time series (see Caiado *et al.*, 2015 and the references cited there). In those methods the clustering problem is solved using two different strategies: the first one works directly on the original time series by defining an appropriate metric; in the second one, time series are projected in a smaller space of features or parameters. These methods are useful when the time series are independent, however, in many applications the assumption of independence does not hold. Few articles have proposed method for clustering by dependency. Zhang & An, 2018 proposed a distance measure based on copulas to measure general dependence of the time series. Alonso & Pena, 2019 introduced the generalized ˜ cross correlation metric based on all the cross correlations between two time series until a certain lag, *k*.

These two methods assumed that the dependency among the time series is on the levels, and do not consider the case in which the dependency is on the conditional variances. This fact is important in many fields. For example, in financial time series where asset returns do not present a strong structure in the levels but do present it in the volatility. Some studies have taken into account the similarity of the evolution of the conditional variances (see Otranto, 2008 and D'Urso *et al.*, 2013 for GARCH models).

In this work we study a procedure to cluster time series for dependency on the conditional variability, integrating the concepts of dependency and heteroscedasticity of a set of time series. We extend the results presented with Alonso & Pena, 2019 in the search for dependencies between the squares of ˜ two time series or between their estimated volatilities. In Section 2, we present the new methodology and Section 3 we illustrate its use in a real data example. Some Monte Carlo experiments are available upon request to the authors.

values close to zero will be related to strong dependence. Once the pairwise dissimilarities between time series are obtained, we can apply any clustering

In this section we are going use the set of the portfolios designed by Kenneth R. French, which contains daily and equal-weighted returns of firms listed on the NYSE, AMEX, or NASDAQ. The portfolios are constructed based on different criteria such as companies size, book/market ratio, company capitalization and/or industry classification. See at http://mba.tuck.dartmouth. edu/pages/faculty/ken.french/data library.html. We will analyze 100 time series that contains 25 portfolios based market equity (ME) and the ratio of book equity to market equity (BE/ME) for European (UE), Japanese (JAP), Pacific Asian (PA) -except Japan- and North American (AM)

First, we obtain the dendrogram using single linkage and dissimilarity, *d*(*wt*,*zt*), for the levels of daily returns. Silhouette statistic suggests four clusters which corresponds to the four regions analyzed. In addition, the series that belong to each clusters present a strong dependence between them except the Asian time series. When we use the squares of daily returns for clustering, Silhouette statistic finds five clusters: the same four groups as in the levels and a fifth group with a single time series belongs to Pacific Asia. Also it is observed that the group of American time series presents weaker dependencies than those observed in levels. In Figure 1, we show that the dependency structures based on levels differs from the ones based on squared returns. In particular, it is remarkable that portfolios AM12, AM13, AM14, AM15 make up a group of dependent series on the levels, however, this group is divided when squares are taken into account, the same is true for the group of portfolios AM52, AM53,

Acknowledgements: The authors gratefully acknowledge the financial support from the Spanish government Agencia Estatal de Investigacion (PID2019- ´

ALONSO, A.M., & PENA˜ , D. 2019. Clustering time series by linear depen-

108311GB-I00 / AEI / 10.13039/501100011033).

dency. *Statistics and Computing*, 29, 655–676.

method that uses dissimilarity matrices as input.

3 Real data example

markets.

AM54.

References

#### 2 Clustering time series by volatility dependency

Let *wt* and *zt* be two stationary time series and let *xt* = *w*<sup>2</sup> *<sup>t</sup>* , *yt* = *z*<sup>2</sup> *<sup>t</sup>* be their corresponding squares, that will also be stationary. Using the results given in Alonso & Pena, 2019, we are going to define a linear dependence measure between ˜ (*xt*, *yt*). We calculate the autocorrelations of *xt* and *yt*, ρ*x*(*h*) and ρ*y*(*h*), and the cross correlations between *xt* and *yt*, ρ*xy*(*h*), for lags *h* = 0,±1,···,±*k*. The linear dependency between the two time series of squares can be summarized in the matrix

$$\mathbf{R}\_{k} = \begin{pmatrix} \mathbf{R}(0) & \mathbf{R}(1) & \dots & \mathbf{R}(k) \\ \mathbf{R}(-1) & \mathbf{R}(0) & \dots & \mathbf{R}(k-1) \\ \vdots & \vdots & \dots & \vdots \\ \mathbf{R}(-k) & \mathbf{R}(-k+1) & \dots & \mathbf{R}(0) \end{pmatrix},\tag{1}$$

where <sup>R</sup>(*h*) = <sup>ρ</sup>*x*(*h*) <sup>ρ</sup>*xy*(*h*) ρ*yx*(*h*) ρ*x*(*h*) . Matrix R*<sup>k</sup>* corresponds to the correlation

matrix of the stationary process (*xt*, *yt*, *xt*−1, *yt*−1, ..., *xt*−*k*, *yt*−*k*)*<sup>T</sup>*

The *generalized correlation coefficient* is defined using matrix R*<sup>k</sup>* by

$$GCC(\mathbf{x}\_l, \mathbf{y}\_l) = 1 - \left(\frac{\det(\mathbf{R}\_{\mathbf{yx},k})}{\det(\mathbf{R}\_{\mathbf{x},k})\det(\mathbf{R}\_{\mathbf{yy},k})}\right)^{1/(k+1)},\tag{2}$$

where R*xx*,*<sup>k</sup>* and R*yy*,*<sup>k</sup>* are the correlation matrices for the *Xt*,*<sup>k</sup>* and *Yt*,*k*, respectively, and C*xy*,*<sup>k</sup>* the matrix of cross-correlations between these two vectors.

This similarity measure *GCC*(*xt*, *yt*) satisfies the following properties: (1) *GCC*(*xt*, *yt*) = *GCC*(*yt*, *xt*); (2) 0 ≤ *GCC*(*yt*, *xt*) ≤ 1, it takes the zero value in the case that the dependence between both variables is perfectly linear, and take the value one in the case that all cross correlations are zero. Based on this measure we define the dissimilarity between *xy* and *yt* as *d*(*xt*, *yt*) = 1 − *GCC*(*xt*, *yt*), in that way, high dissimilarity values are associated to weak dependence and

values close to zero will be related to strong dependence. Once the pairwise dissimilarities between time series are obtained, we can apply any clustering method that uses dissimilarity matrices as input.

### 3 Real data example

In this work we study a procedure to cluster time series for dependency on the conditional variability, integrating the concepts of dependency and heteroscedasticity of a set of time series. We extend the results presented with Alonso & Pena, 2019 in the search for dependencies between the squares of ˜ two time series or between their estimated volatilities. In Section 2, we present the new methodology and Section 3 we illustrate its use in a real data example. Some Monte Carlo experiments are available upon request to the authors.

sponding squares, that will also be stationary. Using the results given in Alonso & Pena, 2019, we are going to define a linear dependence measure between ˜ (*xt*, *yt*). We calculate the autocorrelations of *xt* and *yt*, ρ*x*(*h*) and ρ*y*(*h*), and the cross correlations between *xt* and *yt*, ρ*xy*(*h*), for lags *h* = 0,±1,···,±*k*. The linear dependency between the two time series of squares can be summarized

> R(0) R(1) ... R(*k*) R(−1) R(0) ... R(*k* −1)

R(−*k*) R(−*k* +1) ... R(0)

. ... .

. .

. Matrix R*<sup>k</sup>* corresponds to the correlation

1/(*k*+1)

.

The *generalized correlation coefficient* is defined using matrix R*<sup>k</sup>* by

det(R*yx*,*k*)

where R*xx*,*<sup>k</sup>* and R*yy*,*<sup>k</sup>* are the correlation matrices for the *Xt*,*<sup>k</sup>* and *Yt*,*k*, respectively, and C*xy*,*<sup>k</sup>* the matrix of cross-correlations between these two vectors. This similarity measure *GCC*(*xt*,*yt*) satisfies the following properties: (1) *GCC*(*xt*, *yt*) = *GCC*(*yt*, *xt*); (2) 0 ≤ *GCC*(*yt*,*xt*) ≤ 1, it takes the zero value in the case that the dependence between both variables is perfectly linear, and take the value one in the case that all cross correlations are zero. Based on this measure we define the dissimilarity between *xy* and *yt* as *d*(*xt*,*yt*) = 1 − *GCC*(*xt*,*yt*), in that way, high dissimilarity values are associated to weak dependence and

det(R*xx*,*k*)det(R*yy*,*k*)

*<sup>t</sup>* , *yt* = *z*<sup>2</sup>

*<sup>t</sup>* be their corre-

, (1)

, (2)

2 Clustering time series by volatility dependency

Let *wt* and *zt* be two stationary time series and let *xt* = *w*<sup>2</sup>

in the matrix

where R(*h*) =

R*<sup>k</sup>* =

 ρ*x*(*h*) ρ*xy*(*h*) ρ*yx*(*h*) ρ*x*(*h*)

*GCC*(*xt*, *yt*) = 1−

. .

. .

matrix of the stationary process (*xt*,*yt*, *xt*−1, *yt*−1, ...,*xt*−*k*, *yt*−*k*)*<sup>T</sup>*

In this section we are going use the set of the portfolios designed by Kenneth R. French, which contains daily and equal-weighted returns of firms listed on the NYSE, AMEX, or NASDAQ. The portfolios are constructed based on different criteria such as companies size, book/market ratio, company capitalization and/or industry classification. See at http://mba.tuck.dartmouth. edu/pages/faculty/ken.french/data library.html. We will analyze 100 time series that contains 25 portfolios based market equity (ME) and the ratio of book equity to market equity (BE/ME) for European (UE), Japanese (JAP), Pacific Asian (PA) -except Japan- and North American (AM) markets.

First, we obtain the dendrogram using single linkage and dissimilarity, *d*(*wt*,*zt*), for the levels of daily returns. Silhouette statistic suggests four clusters which corresponds to the four regions analyzed. In addition, the series that belong to each clusters present a strong dependence between them except the Asian time series. When we use the squares of daily returns for clustering, Silhouette statistic finds five clusters: the same four groups as in the levels and a fifth group with a single time series belongs to Pacific Asia. Also it is observed that the group of American time series presents weaker dependencies than those observed in levels. In Figure 1, we show that the dependency structures based on levels differs from the ones based on squared returns. In particular, it is remarkable that portfolios AM12, AM13, AM14, AM15 make up a group of dependent series on the levels, however, this group is divided when squares are taken into account, the same is true for the group of portfolios AM52, AM53, AM54.

Acknowledgements: The authors gratefully acknowledge the financial support from the Spanish government Agencia Estatal de Investigacion (PID2019- ´ 108311GB-I00 / AEI / 10.13039/501100011033).

### References

ALONSO, A.M., & PENA˜ , D. 2019. Clustering time series by linear dependency. *Statistics and Computing*, 29, 655–676.

MODEL-BASED CLUSTERING FOR CATEGORICAL DATA VIA HAMMING DISTANCE Raffaele Argiento1, Edoardo Filippi-Mazzola2 and Lucia Paci1

(e-mail: raffaele.argiento@unicatt.it, lucia.paci@unicatt.it) <sup>2</sup> Universita della Svizzera Italiana, (e-mail: ` edoardo.filippi-mazzola@usi.ch)

ABSTRACT: In this work a model-based approach for clustering categorical data with no natural ordering is introduced. The proposed method exploits the Hamming distance to define a family of probability mass functions to model categorical data. The elements of this family are considered as kernels of a finite mixture model with unknown number of components. Fully Bayesian inference is provided using a sampling strategy based on a trans-dimensional blocked Gibbs-sampler, facilitating computation with respect to the customary reversible-jump algorithm. Model performances are assessed via a simulation study, showing improvements both in terms of prediction and estimation, with respect to existing approaches. Finally, our method is illustrated with application

KEYWORDS: Hamming distribution, mixture modelling, categorical data analy-

<sup>1</sup> Universita Cattolica del Sacro Cuore, `

to reference datasets.

sis, blocked Gibbs Sampling

Figure 1: Dendrograms for American and European portfolios.


### MODEL-BASED CLUSTERING FOR CATEGORICAL DATA VIA HAMMING DISTANCE

Raffaele Argiento1, Edoardo Filippi-Mazzola2 and Lucia Paci1

<sup>1</sup> Universita Cattolica del Sacro Cuore, `

(a) Returns levels.

(b) Squared returns levels.

*Applications*, 392, 2114–2129.

structure. *PloS One*, 13, e0206753.

Hall/CRC.

Figure 1: Dendrograms for American and European portfolios.

CAIADO, J., MAHARAJ, E.A., & D'URSO, P. 2015. Time-Series Clustering. *Pages 262–285 of: Handbook of Cluster Analysis*. Chapman and

D'URSO, P., CAPPELLI, C., DI LALLO, D., & MASSARI, R. 2013. Clustering of financial time series. *Physica A: Statistical Mechanics and its*

OTRANTO, E. 2008. Clustering heteroskedastic time series by model-based procedures. *Computational Statistics & Data Analysis*, 52, 4685–4698. ZHANG, B., & AN, B. 2018. Clustering time series based on dependence (e-mail: raffaele.argiento@unicatt.it, lucia.paci@unicatt.it)

<sup>2</sup> Universita della Svizzera Italiana, (e-mail: ` edoardo.filippi-mazzola@usi.ch)

ABSTRACT: In this work a model-based approach for clustering categorical data with no natural ordering is introduced. The proposed method exploits the Hamming distance to define a family of probability mass functions to model categorical data. The elements of this family are considered as kernels of a finite mixture model with unknown number of components. Fully Bayesian inference is provided using a sampling strategy based on a trans-dimensional blocked Gibbs-sampler, facilitating computation with respect to the customary reversible-jump algorithm. Model performances are assessed via a simulation study, showing improvements both in terms of prediction and estimation, with respect to existing approaches. Finally, our method is illustrated with application to reference datasets.

KEYWORDS: Hamming distribution, mixture modelling, categorical data analysis, blocked Gibbs Sampling

### MINING MULTIPLE TIME SEQUENCES THROUGH CO-CLUSTERING ALGORITHMS FOR DISTRIBUTIONAL DATA

We use a co-clustering approach (de A.T. De Carvalho et al. , 2021) that extends the classic alternated double k-means. It performs double partition of objects and variables to simultaneously discover blocks of subsets of the rows and columns of a data table according to a homogeneity criterion. We use two variants of this algorithm: the distributional double k-means (DDK) and the adaptive distributional double k-means (ADDK). The main difference between the two algorithm is that only ADDK computes the relevance weight for each variable. In both the variants, the internal variability of clusters or coclusters is measured by the Wasserstein-based sum of squared errors (Irpino &

2 Distributional Double k-means (DDK) and the Adaptive Distri-

Let us consider a set of *N* objects observed on *P* Distributional variables (DV). A DV takes as values one-dimensional theoretical or empirical (i.e., histograms)

The objects are indexed by *i* (with *i* = 1,...,*N*), the *P* variables are denoted by *Yj* (with *j* = 1,...,*P*), and the *i*−*th* one-dimensional distribution data (DD) of the *Yj* variable is denoted by *yi j*. The vector y*<sup>i</sup>* = [*yi*1,..., *yiP*] contains the description of the *i* −*th* object on the *P* DVs. Considering *yi j* an empirical probability density function, we refer to *Qi j* as the quantile function (qf), that

In order to consider the relevance of each variable we use the following notion of adaptive distances based on the squared *L*<sup>2</sup> Wasserstein distance. Let us consider a vector of positive weights Λ = [λ1,...,λ*P*]. According to (De Carvalho & Lechevallier, 2009), a general expression for the adaptive (squared)

> *P* ∑ *j*=1

The objective is to obtain a co-clustering of the input data, that is a double partition of the data table into *C* × *H* blocks such that *P* = {*P*1,...,*PC*} is a

λ*jd*<sup>2</sup> *W yi j*, *yi j*


*<sup>j</sup>*=<sup>1</sup> λ*<sup>j</sup>* = 1.

*<sup>j</sup>*) =

 1

0 

*<sup>W</sup>* between the DD *yi j* and

 *<sup>j</sup>*(*t*) 2 *dt*

(1)

*Qi j*(*t*)−*Qi*

butional Double k-means (ADDK)

We use the the squared *L*<sup>2</sup> Wasserstein metric *d*<sup>2</sup>

*<sup>j</sup>*, with support in ℜ, defined as: *dW* (*yi j*, *yi*

*d*2 *<sup>W</sup>* (y*i*,y*<sup>i</sup>*

probability density functions.

is, the inverse of the cdf.

*L*<sup>2</sup> Wasserstein distance is:

with <sup>λ</sup>*<sup>j</sup>* <sup>&</sup>gt; <sup>0</sup> <sup>∀</sup> *<sup>j</sup>* and <sup>∏</sup>*<sup>P</sup>*

*yi*

Verde, 2015).

Balzanella Antonio 1, Irpino Antonio1 and Francisco T. de A. de Carvalho2

<sup>1</sup> Department of Mathematics and Physics, University of Campania "Luigi Vanvitelli", (e-mail: antonio.balzanella@unicampania.it, antonio.irpino@unicampania.it)

<sup>2</sup> CIN-UFPE, Av. Jornalista Anibal Fernandes, s/n - Cidade Universitria 50.740-560, Recife, PE, Brasil, (e-mail: fatc@cin.ufpe.br)

ABSTRACT: This paper deals with the co-clustering of distributional data applied to multiple time sequences. The aims are: to get a double-partition of data into clusters of units and variables; to summarize the main concepts in the data through histogram prototypes; to overview the evolution over time of the monitored phenomenon. We extend the double k-means algorithm to handle distributional data by using the *L*<sup>2</sup> Wasserstein distance for comparing distributions. Moreover, we adapt double k-means algorithm to compute optimal relevance weights associated with the variables.

KEYWORDS: co-clustering, distribution data, Wasserstein distance.

#### 1 Introduction

In recent years, several authors (Arroyo & Mate, 2009; Balzanella & Irpino, ´ 2020) have proposed summarizing temporal sequences by a set of distributions. In particular, they assume that the time domain of the sequences is split into non-overlapping time windows, and the distribution of the records, framed by each window, is estimated through histograms or kernel density estimators. This summarization allows us to retain most of the information regarding the monitored phenomenon, and to perform dimensionality reduction.

In this framework, we consider co-clustering of distribution data with the following objectives: 1) To summarize the main concepts in the data through histogram prototypes; 2) To reorganize the initial matrix into a block matrix; 3) to overview the evolution over time of the monitored phenomenon through the partition of the variables; 4) To evaluate the contribution of various periods (intervals of time) to the optimal partitioning by considering the weights of the variables; 5) to obtain a partition of the series so that groups of series that record similar data over time can be discovered.

We use a co-clustering approach (de A.T. De Carvalho et al. , 2021) that extends the classic alternated double k-means. It performs double partition of objects and variables to simultaneously discover blocks of subsets of the rows and columns of a data table according to a homogeneity criterion. We use two variants of this algorithm: the distributional double k-means (DDK) and the adaptive distributional double k-means (ADDK). The main difference between the two algorithm is that only ADDK computes the relevance weight for each variable. In both the variants, the internal variability of clusters or coclusters is measured by the Wasserstein-based sum of squared errors (Irpino & Verde, 2015).

#### 2 Distributional Double k-means (DDK) and the Adaptive Distributional Double k-means (ADDK)

Let us consider a set of *N* objects observed on *P* Distributional variables (DV). A DV takes as values one-dimensional theoretical or empirical (i.e., histograms) probability density functions.

The objects are indexed by *i* (with *i* = 1,...,*N*), the *P* variables are denoted by *Yj* (with *j* = 1,...,*P*), and the *i*−*th* one-dimensional distribution data (DD) of the *Yj* variable is denoted by *yi j*. The vector y*<sup>i</sup>* = [*yi*1,...,*yiP*] contains the description of the *i* −*th* object on the *P* DVs. Considering *yi j* an empirical probability density function, we refer to *Qi j* as the quantile function (qf), that is, the inverse of the cdf.

We use the the squared *L*<sup>2</sup> Wasserstein metric *d*<sup>2</sup> *<sup>W</sup>* between the DD *yi j* and *yi <sup>j</sup>*, with support in ℜ, defined as: *dW* (*yi j*, *yi <sup>j</sup>*) = 1 0 *Qi j*(*t*)−*Qi <sup>j</sup>*(*t*) 2 *dt*

In order to consider the relevance of each variable we use the following notion of adaptive distances based on the squared *L*<sup>2</sup> Wasserstein distance. Let us consider a vector of positive weights Λ = [λ1,...,λ*P*]. According to (De Carvalho & Lechevallier, 2009), a general expression for the adaptive (squared) *L*<sup>2</sup> Wasserstein distance is:

$$d\_W^2\left(\mathbf{y}\_i, \mathbf{y}\_{i'} | \Lambda\right) = \sum\_{j=1}^P \lambda\_j d\_W^2\left(\mathbf{y}\_{ij}, \mathbf{y}\_{i'j}\right) \tag{1}$$

with <sup>λ</sup>*<sup>j</sup>* <sup>&</sup>gt; <sup>0</sup> <sup>∀</sup> *<sup>j</sup>* and <sup>∏</sup>*<sup>P</sup> <sup>j</sup>*=<sup>1</sup> λ*<sup>j</sup>* = 1.

MINING MULTIPLE TIME SEQUENCES THROUGH CO-CLUSTERING ALGORITHMS FOR DISTRIBUTIONAL DATA Balzanella Antonio 1, Irpino Antonio1 and Francisco T. de A. de Carvalho2

<sup>1</sup> Department of Mathematics and Physics, University of Campania "Luigi Vanvitelli", (e-mail: antonio.balzanella@unicampania.it,

<sup>2</sup> CIN-UFPE, Av. Jornalista Anibal Fernandes, s/n - Cidade Universitria 50.740-560,

ABSTRACT: This paper deals with the co-clustering of distributional data applied to multiple time sequences. The aims are: to get a double-partition of data into clusters of units and variables; to summarize the main concepts in the data through histogram prototypes; to overview the evolution over time of the monitored phenomenon. We extend the double k-means algorithm to handle distributional data by using the *L*<sup>2</sup> Wasserstein distance for comparing distributions. Moreover, we adapt double k-means algorithm to compute optimal relevance weights associated with the variables.

In recent years, several authors (Arroyo & Mate, 2009; Balzanella & Irpino, ´ 2020) have proposed summarizing temporal sequences by a set of distributions. In particular, they assume that the time domain of the sequences is split into non-overlapping time windows, and the distribution of the records, framed by each window, is estimated through histograms or kernel density estimators. This summarization allows us to retain most of the information regarding the

In this framework, we consider co-clustering of distribution data with the following objectives: 1) To summarize the main concepts in the data through histogram prototypes; 2) To reorganize the initial matrix into a block matrix; 3) to overview the evolution over time of the monitored phenomenon through the partition of the variables; 4) To evaluate the contribution of various periods (intervals of time) to the optimal partitioning by considering the weights of the variables; 5) to obtain a partition of the series so that groups of series that

antonio.irpino@unicampania.it)

1 Introduction

Recife, PE, Brasil, (e-mail: fatc@cin.ufpe.br)

KEYWORDS: co-clustering, distribution data, Wasserstein distance.

monitored phenomenon, and to perform dimensionality reduction.

record similar data over time can be discovered.

The objective is to obtain a co-clustering of the input data, that is a double partition of the data table into *C* × *H* blocks such that *P* = {*P*1,...,*PC*} is a partition of the set of *N* objects into *C* clusters, and *Q* = {*Q*1,...,*QH*} is a partition of the set of *P* distributional-valued variables into *H* clusters.

Given the number of desired object clusters *C* and variable clusters *H*, the co-clustering returns the matrix G of prototypes, the partition *P* of the objects, and the partition *Q* of the variables. These are iteratively obtained by minimizing the following error function, denoted here as *JDDK*:

$$J\_{\rm DDK}(\mathbf{G}, \mathcal{P}, \mathbf{Q}) = \sum\_{k=1}^{C} \sum\_{h=1}^{H} \sum\_{e\_l \in \mathcal{P}\_k} \sum\_{Y\_j \in \mathbf{Q}\_l} d\_W^2(\mathbf{y}\_{ij}, \mathbf{g}\_{kh}), \tag{2}$$

*http://db.csail.mit.edu/labdata/labdata.html* which collects some environmental variables inside a laboratory. We show in Fig. 1 the double partition for DDK. The left side shows the obtained co-clusters while the right side pro-

ARROYO, J., & MATE´, C. 2009. Forecasting histogram time series with knearest neighbours methods. International Journal of Forecasting, 25(1),

BALZANELLA, ANTONIO,&IRPINO, ANTONIO. 2020. Spatial prediction and spatial dependence monitoring on georeferenced data streams.

DE A.T. DE CARVALHO, FRANCISCO, BALZANELLA, ANTONIO, IRPINO, ANTONIO,&VERDE, ROSANNA. 2021. Co-clustering algorithms for distributional data with automated variable weighting. Information

DE CARVALHO, F. A. T., & LECHEVALLIER, Y. 2009. Partitional clustering algorithms for symbolic interval data based on single adaptive distances.

IRPINO, A., & VERDE, R. 2015. Basic statistics for distributional symbolic variables: a new metric-based approach. Advances in Data Analysis and

Statistical Methods & Applications, 29(1), 101–128.

**Reorganized Data**

20 40 60 80 100 120

vides a reorganized version to highlight the main blocks.

**a) b)**

Figure 1. *DDK algorithm: co-clustering structure.* **Original data matrix**

References

192–207.

Sciences, 549, 87–115.

Pattern Recognition, 42(7), 1223–1236.

Classification, 9(2), 143–175.

20 40 60 80 100 120

where *gkh* is the prototype of the co-cluster Y*kh*.

In most applications, variables may have a different relevance. We propose to obtain relevance weights by minimizing an objective function denoted by *JADDK*:

$$J\_{ADDK}(\mathbf{G}, \Lambda, \mathcal{P}, \mathbf{Q}) = \sum\_{k=1}^{C} \sum\_{h=1}^{H} \sum\_{e\_l \in \mathcal{P}\_l} \sum\_{Y\_j \in \mathcal{Q}\_l} d\_W^2(\mathbf{y}\_{ij}, \mathbf{g}\_{kh} | \Lambda), \tag{3}$$

where *d*<sup>2</sup> *<sup>W</sup>* (.|Λ) is the adaptive (squared) *L*<sup>2</sup> Wasserstein distance computed between the generic *yi j* and the prototype *gkh* of the belonging co-cluster Y*kh*, weighted by the elements of Λ.

The basic scheme of the DDK and ADDK co-clustering algorithms is the following: from an initial random partitioning of the objects, into clusters of objects, and variables, the algorithms perform a sequence of alternating steps (three for DDK and four for ADDK) until the algorithms converge to a stationary value of the objective function:


#### 3 Conclusions

In this paper we propose to use two co-clustering algorithms for the analysis of time sequences. We tested the method on a real world dataset available at *http://db.csail.mit.edu/labdata/labdata.html* which collects some environmental variables inside a laboratory. We show in Fig. 1 the double partition for DDK. The left side shows the obtained co-clusters while the right side provides a reorganized version to highlight the main blocks.

Figure 1. *DDK algorithm: co-clustering structure.*

#### References

partition of the set of *N* objects into *C* clusters, and *Q* = {*Q*1,...,*QH*} is a

Given the number of desired object clusters *C* and variable clusters *H*, the co-clustering returns the matrix G of prototypes, the partition *P* of the objects, and the partition *Q* of the variables. These are iteratively obtained by

partition of the set of *P* distributional-valued variables into *H* clusters.

*C* ∑ *k*=1

> *C* ∑ *k*=1

*H* ∑ *h*=1

between the generic *yi j* and the prototype *gkh* of the belonging co-cluster Y*kh*,

The basic scheme of the DDK and ADDK co-clustering algorithms is the following: from an initial random partitioning of the objects, into clusters of objects, and variables, the algorithms perform a sequence of alternating steps (three for DDK and four for ADDK) until the algorithms converge to a station-

i) *representation* step, in which the optimal representative (prototype) of

ii) *weighting* step (ADDK), in which the relevance weights for each variable

iii) *object assignment* step, in which the optimal assignment of the objects to

iv) *variable assignment* step, in which the optimal assignment of the vari-

In this paper we propose to use two co-clustering algorithms for the analysis of time sequences. We tested the method on a real world dataset available at

∑ *ei*∈*P<sup>k</sup>*

*<sup>W</sup>* (.|Λ) is the adaptive (squared) *L*<sup>2</sup> Wasserstein distance computed

∑ *Yj*∈*Q<sup>h</sup>* *d*2

*H* ∑ *h*=1

∑ *ei*∈*P<sup>k</sup>*

In most applications, variables may have a different relevance. We propose to obtain relevance weights by minimizing an objective function denoted by

∑ *Yj*∈*Q<sup>h</sup>* *d*2

*<sup>W</sup>* (*yi j*,*gkh*), (2)

*<sup>W</sup>* (*yi j*,*gkh*|Λ), (3)

minimizing the following error function, denoted here as *JDDK*:

*JDDK*(G,*P*,*Q* ) =

where *gkh* is the prototype of the co-cluster Y*kh*.

*JADDK*(G,Λ,*P*,*Q* ) =

weighted by the elements of Λ.

ary value of the objective function:

each cluster is computed;

ables to clusters is obtained.

clusters is obtained;

3 Conclusions

and/or each component are computed;

*JADDK*:

where *d*<sup>2</sup>


### HIDDEN MARKOV AND REGIME SWITCHING COPULA MODELS FOR STATE ALLOCATION IN MULTIPLE TIME-SERIES

RS copulas are instead based on a copula function, which may be chosen among the Clayton, the Gumbel, the Gaussian, or the Student-*t*, with parameters governed by a hidden Markov process of first-order so as to flexibly account for

The expectation-maximization (EM) algorithm (Dempster *et al.*, 1977) is used for maximum likelihood estimation of the parameters of both models. Model selection is performed to choose the most appropriate number of hidden states and evaluate the level of chain homogeneity over time (Bartolucci *et al.*, 2013). For the HM model, this selection is based on the Bayesian Information Criterion (BIC), and for RS copulas, it is also based on a goodness-of-fit

As an illustration we consider the problem of state allocation in analyzing time-series of the main cryptocurrencies daily log-returns over a three-year

Let *yt*, *t* = 1,2,..., be the vector where each element *yt j*, *j* = 1,...,*r*, corresponds to the value of time-series *j* at time occasion *t*, with *r* denoting the number of time-series under consideration. The main assumption of the multivariate HM model is that the random vectors *y*1, *y*2,... are conditionally independent given a hidden process *u*1,*u*2,... that follows a first-order Markov chain with *k* states, labeled from 1 to *k*. This process is governed by the initial probabilities π*<sup>u</sup>* = *p*(*u*<sup>1</sup> = *u*), *u* = 1,...,*k*, and the transition probabilities π*u*|*u*¯ = *p*(*ut* = *u*|*ut*−<sup>1</sup> = *u*¯), *t* = 2,..., *u*¯,*u* = 1,..., *k*. We assume a Gaussian distribution for the observations at every time occasion, that is, *yt* | *ut* = *u* ∼ *Nr*(*µu*,Σ*u*), where *µu* and Σ*<sup>u</sup>* are the mean vector and variancecovariance matrix for latent state *u*. The above assumptions imply that the conditional distribution of the time-series *y*1, *y*2,..., given the sequence of hidden states, may be expressed as *f*(*y*1, *y*2,... | *u*1,*u*2,...) = ∏*<sup>t</sup>* φ(*yt*;*µut*

where φ(·;·) denotes the density of the multivariate Gaussian distribution. The manifest distribution of the multiple time-series has the following density func-

Concerning the copula model, we first consider only the bivariate case, so we define *yt* = (*yt*1, *yt*2) as a vector with elements *yt j*, *j* = 1,2, corresponding to the observation for time-series *j* at time *t* = 1,2,... and *F*<sup>1</sup> and *F*<sup>2</sup> as the

π*u*2|*u*1φ(*y*2;*µu*<sup>2</sup> ,Σ*u*<sup>2</sup> )··· .

<sup>π</sup>*u*1φ(*y*1;*µu*<sup>1</sup> ,Σ*u*<sup>1</sup> )∑*<sup>u</sup>*<sup>2</sup>

,Σ*ut*),

2 Hidden Markov and Regime-Switching Copula Models

the correlation patterns between each pair of series.

procedure relying on the Cramer-von Mises statistic. ´

period.

tion:

*<sup>f</sup>*(*y*1, *<sup>y</sup>*2,...) = <sup>∑</sup>*<sup>u</sup>*<sup>1</sup>

Francesco Bartolucci1, Fulvia Pennoni2, and Federico P. Cortese3

<sup>1</sup> Department of Economics, University of Perugia (e-mail: francesco.bartolucci@unipg.it)

<sup>2</sup> Department of Statistics and Quantitative Methods, University of Milano-Bicocca (e-mail: fulvia.pennoni@unimib.it)

<sup>3</sup> Department of Economics, Management and Statistics, University of Milano-Bicocca (e-mail: f.cortese5@campus.unimib.it)

ABSTRACT: We consider hidden Markov and regime-switching copula models as approaches for state allocation in multiple time-series, where state allocation means prediction of the latent state characterizing each time occasion based on the observed data. This dynamic clustering, performed under the two model specifications, takes the correlation structure of the time-series into account. Maximum likelihood estimation of the model parameters is carried out by the expectation-maximization algorithm. For illustration we use data on the market of cryptocurrencies characterized by periods of high turbulence in which interdependence among assets is marked.

KEYWORDS: daily log-returns, expectation-maximization algorithm, forecast, latent variables, model-based clustering

#### 1 Introduction

In the analysis of multiple time-series, state allocation, namely prediction of the state or regime underlying the observed data at a certain time occasion, is an important task, especially in finance and related fields. This type of clustering is dynamic because a different state may be predicted at every time occasion and may be based on models representing each time-specific state by a discrete latent variable assuming, typically, a few possible values. In this contribution, we compare two different model specifications of this type: multivariate hidden Markov (HM) models (Zucchini *et al.*, 2017) and regimeswitching (RS) copulas (Rodriguez, 2007).

Among HM models we consider, in particular, those based on the assumption that the time-specific vector of observable variables follows a conditional Gaussian distribution with parameters depending on the latent state.

RS copulas are instead based on a copula function, which may be chosen among the Clayton, the Gumbel, the Gaussian, or the Student-*t*, with parameters governed by a hidden Markov process of first-order so as to flexibly account for the correlation patterns between each pair of series.

HIDDEN MARKOV AND REGIME SWITCHING COPULA MODELS FOR STATE ALLOCATION IN MULTIPLE TIME-SERIES Francesco Bartolucci1, Fulvia Pennoni2, and Federico P. Cortese3

<sup>2</sup> Department of Statistics and Quantitative Methods, University of Milano-Bicocca

<sup>3</sup> Department of Economics, Management and Statistics, University of Milano-Bicocca

ABSTRACT: We consider hidden Markov and regime-switching copula models as approaches for state allocation in multiple time-series, where state allocation means prediction of the latent state characterizing each time occasion based on the observed data. This dynamic clustering, performed under the two model specifications, takes the correlation structure of the time-series into account. Maximum likelihood estimation of the model parameters is carried out by the expectation-maximization algorithm. For illustration we use data on the market of cryptocurrencies characterized by periods of

KEYWORDS: daily log-returns, expectation-maximization algorithm, forecast, latent

In the analysis of multiple time-series, state allocation, namely prediction of the state or regime underlying the observed data at a certain time occasion, is an important task, especially in finance and related fields. This type of clustering is dynamic because a different state may be predicted at every time occasion and may be based on models representing each time-specific state by a discrete latent variable assuming, typically, a few possible values. In this contribution, we compare two different model specifications of this type: multivariate hidden Markov (HM) models (Zucchini *et al.*, 2017) and regime-

Among HM models we consider, in particular, those based on the assumption that the time-specific vector of observable variables follows a conditional

Gaussian distribution with parameters depending on the latent state.

high turbulence in which interdependence among assets is marked.

<sup>1</sup> Department of Economics, University of Perugia (e-mail: francesco.bartolucci@unipg.it)

(e-mail: fulvia.pennoni@unimib.it)

variables, model-based clustering

switching (RS) copulas (Rodriguez, 2007).

1 Introduction

(e-mail: f.cortese5@campus.unimib.it)

The expectation-maximization (EM) algorithm (Dempster *et al.*, 1977) is used for maximum likelihood estimation of the parameters of both models. Model selection is performed to choose the most appropriate number of hidden states and evaluate the level of chain homogeneity over time (Bartolucci *et al.*, 2013). For the HM model, this selection is based on the Bayesian Information Criterion (BIC), and for RS copulas, it is also based on a goodness-of-fit procedure relying on the Cramer-von Mises statistic. ´

As an illustration we consider the problem of state allocation in analyzing time-series of the main cryptocurrencies daily log-returns over a three-year period.

#### 2 Hidden Markov and Regime-Switching Copula Models

Let *yt*, *t* = 1,2,..., be the vector where each element *yt j*, *j* = 1,...,*r*, corresponds to the value of time-series *j* at time occasion *t*, with *r* denoting the number of time-series under consideration. The main assumption of the multivariate HM model is that the random vectors *y*1, *y*2,... are conditionally independent given a hidden process *u*1,*u*2,... that follows a first-order Markov chain with *k* states, labeled from 1 to *k*. This process is governed by the initial probabilities π*<sup>u</sup>* = *p*(*u*<sup>1</sup> = *u*), *u* = 1,...,*k*, and the transition probabilities π*u*|*u*¯ = *p*(*ut* = *u*|*ut*−<sup>1</sup> = *u*¯), *t* = 2,..., *u*¯,*u* = 1,...,*k*. We assume a Gaussian distribution for the observations at every time occasion, that is, *yt* | *ut* = *u* ∼ *Nr*(*µu*,Σ*u*), where *µu* and Σ*<sup>u</sup>* are the mean vector and variancecovariance matrix for latent state *u*. The above assumptions imply that the conditional distribution of the time-series *y*1, *y*2,..., given the sequence of hidden states, may be expressed as *f*(*y*1, *y*2,... | *u*1,*u*2,...) = ∏*<sup>t</sup>* φ(*yt*;*µut* ,Σ*ut*), where φ(·;·) denotes the density of the multivariate Gaussian distribution. The manifest distribution of the multiple time-series has the following density function:

$$f(\mathbf{y}\_1, \mathbf{y}\_2, \dots) = \sum\_{\boldsymbol{\mu}\_1} \pi\_{\boldsymbol{\mu}\_1} \phi(\mathbf{y}\_1; \boldsymbol{\mu}\_{\boldsymbol{\mu}\_1}, \boldsymbol{\Sigma}\_{\boldsymbol{\mu}\_1}) \sum\_{\boldsymbol{\mu}\_2} \pi\_{\boldsymbol{\mu}\_2 | \boldsymbol{\mu}\_1} \phi(\mathbf{y}\_2; \boldsymbol{\mu}\_{\boldsymbol{\mu}\_2}, \boldsymbol{\Sigma}\_{\boldsymbol{\mu}\_2}) \cdots \dots$$

Concerning the copula model, we first consider only the bivariate case, so we define *yt* = (*yt*1, *yt*2) as a vector with elements *yt j*, *j* = 1,2, corresponding to the observation for time-series *j* at time *t* = 1,2,... and *F*<sup>1</sup> and *F*<sup>2</sup> as the marginal cdfs of each time-series. Sklar's theorem (Sklar, 1959) allows us to separate the fitting of the marginal cdfs from the fitting of the joint distribution, represented by a copula function. This approach consists in estimating the two marginal distributions, obtaining *F*ˆ <sup>1</sup> and *F*ˆ 2, and then computing the normalized ranks of the pseudo-observations *e*˜*<sup>t</sup>* = (*e*˜*t*1, *e*˜*t*2) as *e*˜*t j* = rank(*z*ˆ*t j*)/(*T* +1), with *z*ˆ*t j* = *F*ˆ*j*(*yt j*), and *T* being the number of observed time occasions. Finally, for the pseudo-observations *e*˜*t*, an RS copula model is assumed based on a hidden homogeneous Markov process denoted as *v*1, *v*2,..., with *k* states. The copula density indicated with *c*(·;·) may be chosen among the Clayton, the Gumbel, the Gaussian, or the Student-*t* copulas, with state-specific parameter β*v*. The density of the pseudo-observations is given by

Cash, for the period 2017-2020. For the RS copulas, allowing only for bivariate associations, we define four copulas where the bivariate vector of observations consists of the Bitcoin and each of the other four cryptocurrencies. Results for the HM model show that the minimum value of the BIC is reached considering a five-state heteroschedastic structure. According to these estimates, there are three negative regimes (in terms of estimated expected log-returns), with relatively high and positive correlations of Bitcoin with all the other cryptocurrencies, and two states with positive returns and lower correlations. Regarding the global decoding, these two states are the most likely in the first year of

observation, and the other three states characterize the last two years.

References

Concerning the RS copulas, and considering as an example the couple of cryptocurrencies Bitcoin-Ethereum, we observe that a three-regime Clayton copula provides the best fit. Given that the Clayton copula allows for explicit computation of the lower tail correlation index, we estimate that two regimes provide zero or low values for the lower tail index, and the third regime provides high values for it. According to the optimal state sequence, we estimate that there is substantial interchangeability between the first two states in the whole period, whereas the third state is the most likely for the last year of observation.

BARTOLUCCI, F., FARCOMENI, A., & PENNONI, F. 2013. *Latent Markov Models for Longitudinal Data*. Boca Raton, FL: Chapman & Hall/CRC. DEMPSTER, A. P., LAIRD, N. M., & RUBIN, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm (with discussion).

RODRIGUEZ, J. C. 2007. Measuring financial contagion: A copula approach.

SKLAR, M. 1959. Fonctions de repartition a n dimensions et leurs marges. ` *Publications de l'Institut Statistique de l'Universite de Paris ´* , 8, 229–231. VARIN, C., REID, N., & FIRTH, D. 2011. An overview of composite likelihood

VITERBI, A. 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. *IEEE Transactions on Information Theory*,

ZUCCHINI, W., MACDONALD, I. L., & LANGROCK, R. 2017. *Hidden Markov Models for Time Series: An Introduction Using R*. Boca Raton, FL: CRC.

*Journal of the Royal Statistical Society, Series B*, 39, 1–38.

*Journal of Empirical Finance*, 14, 401–423.

methods. *Statistica Sinica*, 21, 5–42.

13, 260–269.

$$f(\mathfrak{e}\_1, \mathfrak{e}\_2, \dots) = \sum\_{\nu\_1} \pi\_{\nu\_1} c(\mathfrak{e}\_1; \mathfrak{f}\_{\nu\_1}) \sum\_{\nu\_2} \pi\_{\nu\_2|\nu\_1} c(\mathfrak{e}\_2; \mathfrak{f}\_{\nu\_2}) \cdots, \mathfrak{e}\_n$$

and it is based on the initial and transition probabilities defined as above.

Given that the state sequence is not observable, a full maximum likelihood approach for estimating the parameters of both models is carried out through the EM algorithm. Following the current literature, model selection for the HM model is based on the BIC, and for the RS copula it is also performed through a goodness-of-fit procedure consisting in calculating a *p*-value referred to the Cramer-von Mises statistic for the hypothesis of correct model specification. ´

We compare the performance of HM models and RS copulas focusing on the crucial aspect of state allocation. The optimal state allocation is performed by finding the optimal joint sequence *u*˜1,*u*˜2,... (or *v*˜1, *v*˜2,...) of unknown states given the corresponding observations. This clustering procedure, also known as global decoding, is achieved through the Viterbi algorithm (Viterbi, 1967), which is a dynamic programming algorithm.

We also aim at extending the RS copula approach to an arbitrary number of time-series *r* rather than to only 2. In this regard, we propose the composite likelihood approach (Varin *et al.*, 2011) for estimation, which is based on considering all possible ordered pairs of time-series among the available ones.

#### 3 Application

As an illustration, for the HM model we consider the joint daily log-returns\* of the five cryptocurrencies Bitcoin, Ethereum, Ripple, Litecoin, and Bitcoin

\*provided by the Crypto Asset Lab: https://cryptoassetlab.diseade.unimib. it/.

Cash, for the period 2017-2020. For the RS copulas, allowing only for bivariate associations, we define four copulas where the bivariate vector of observations consists of the Bitcoin and each of the other four cryptocurrencies. Results for the HM model show that the minimum value of the BIC is reached considering a five-state heteroschedastic structure. According to these estimates, there are three negative regimes (in terms of estimated expected log-returns), with relatively high and positive correlations of Bitcoin with all the other cryptocurrencies, and two states with positive returns and lower correlations. Regarding the global decoding, these two states are the most likely in the first year of observation, and the other three states characterize the last two years.

Concerning the RS copulas, and considering as an example the couple of cryptocurrencies Bitcoin-Ethereum, we observe that a three-regime Clayton copula provides the best fit. Given that the Clayton copula allows for explicit computation of the lower tail correlation index, we estimate that two regimes provide zero or low values for the lower tail index, and the third regime provides high values for it. According to the optimal state sequence, we estimate that there is substantial interchangeability between the first two states in the whole period, whereas the third state is the most likely for the last year of observation.

#### References

marginal cdfs of each time-series. Sklar's theorem (Sklar, 1959) allows us to separate the fitting of the marginal cdfs from the fitting of the joint distribution, represented by a copula function. This approach consists in estimating the two

<sup>1</sup> and *F*ˆ

ranks of the pseudo-observations *e*˜*<sup>t</sup>* = (*e*˜*t*1, *e*˜*t*2) as *e*˜*t j* = rank(*z*ˆ*t j*)/(*T* +1), with *z*ˆ*t j* = *F*ˆ*j*(*yt j*), and *T* being the number of observed time occasions. Finally, for the pseudo-observations *e*˜*t*, an RS copula model is assumed based on a hidden homogeneous Markov process denoted as *v*1,*v*2,..., with *k* states. The copula density indicated with *c*(·;·) may be chosen among the Clayton, the Gumbel, the Gaussian, or the Student-*t* copulas, with state-specific parameter β*v*. The

<sup>π</sup>*v*<sup>1</sup> *<sup>c</sup>*(*e*˜1;β*v*<sup>1</sup> )∑*<sup>v</sup>*<sup>2</sup>

Given that the state sequence is not observable, a full maximum likelihood approach for estimating the parameters of both models is carried out through the EM algorithm. Following the current literature, model selection for the HM model is based on the BIC, and for the RS copula it is also performed through a goodness-of-fit procedure consisting in calculating a *p*-value referred to the Cramer-von Mises statistic for the hypothesis of correct model specification. ´ We compare the performance of HM models and RS copulas focusing on the crucial aspect of state allocation. The optimal state allocation is performed by finding the optimal joint sequence *u*˜1,*u*˜2,... (or *v*˜1, *v*˜2,...) of unknown states given the corresponding observations. This clustering procedure, also known as global decoding, is achieved through the Viterbi algorithm (Viterbi, 1967),

We also aim at extending the RS copula approach to an arbitrary number of time-series *r* rather than to only 2. In this regard, we propose the composite likelihood approach (Varin *et al.*, 2011) for estimation, which is based on considering all possible ordered pairs of time-series among the available ones.

As an illustration, for the HM model we consider the joint daily log-returns\* of the five cryptocurrencies Bitcoin, Ethereum, Ripple, Litecoin, and Bitcoin

\*provided by the Crypto Asset Lab: https://cryptoassetlab.diseade.unimib.

and it is based on the initial and transition probabilities defined as above.

2, and then computing the normalized

π*v*2|*v*<sup>1</sup> *c*(*e*˜2;β*v*<sup>2</sup> )··· ,

marginal distributions, obtaining *F*ˆ

density of the pseudo-observations is given by

*<sup>f</sup>*(*e*˜1,*e*˜2,...) = <sup>∑</sup>*<sup>v</sup>*<sup>1</sup>

which is a dynamic programming algorithm.

3 Application

it/.


### BOOSTING MULTIDIMENSIONAL IRT MODELS

2 Multidimensional IRT models: definition and inference

positive response to a specific item is defined as

vector of unknown model parameters is γ = (α

from the complete likelihood *L*(γ;y) = ∏*<sup>n</sup>*

3 The boosting algorithm

sion *J* +*JD*, which, in some applications, can be very large. Given the responses y, realization of Y = (Y

algorithms for handling with the high dimension of the integrals.

The response variable for the subject *i* on item *j* is a Bernoulli random variable *Yi j*, *i* = 1,...,*n*, *j* = 1,...,*J*, with one denoting a positive response. The responses of subject *i* are collected in the vector Y*<sup>i</sup>* = (*Yi*1,...,*YiJ*). Let θ*<sup>i</sup>* = (θ*i*1,...,θ*iD*), *i* = 1,...,*n*, be a latent random vector, composed of independent standard normal variables. Furthermore, it is assumed that (Y*i*,θ*i*) are independent across subjects and that observations *Yi j* are conditionally independent given θ*i*. With particular attention to the multidimensional twoparameter logistic (2PL) IRT model, the conditional probability of giving a

*Pi j* <sup>=</sup> *<sup>P</sup>*(*Yi j* <sup>=</sup> <sup>1</sup>|θ*i*;β*j*,α<sup>1</sup> *<sup>j</sup>*,...,α*D j*) = exp(β*<sup>j</sup>* <sup>+</sup>α<sup>1</sup> *<sup>j</sup>*θ*i*1+,···+α*D j*θ*iD*)

where β*<sup>j</sup>* is the intercept and α*d j*, *d* = 1,...,*D*, are the slope parameters. The

(α*d*1,...,α*dJ*), *d* = 1,...,*D*, and β = (β1,...,β*J*); the vector γ has dimen-

likelihood for γ can be obtained by integrating out the unobserved θ values

is a Bernoulli-type probability function based on *Pi j* and φ(·) denotes the density of a multivariate standard normal distribution with independent components. Thus, the marginal log-likelihood does not have a closed-form expression, since the *D*-dimensional integral does not have an analytic solution and requires numerical approximations. The most common methods for estimating the item parameters are based on the EM algorithm, approximating the integrals using Gaussian or adaptive quadrature procedures, or on suitable MCMC

We consider the boosting algorithm introduced in Battauz & Vidoni (2021), with the negative log-likelihood as objective function. Starting from a model that includes only the intercept terms, only two parameters are updated at each iteration of the algorithm, hence following a component-wise approach. The starting point of the algorithm poses a very challenging issue, since the gradient is null making any gradient descent method unable to move from it. A

1+exp(β*<sup>j</sup>* +α<sup>1</sup> *<sup>j</sup>*θ*i*1+,···+α*D j*θ*iD*)

<sup>1</sup> ,...,α

<sup>1</sup> ,...,Y

*<sup>i</sup>*=<sup>1</sup> *f*(y*i*|θ*i*; γ)φ(θ*i*), where *f*(y*i*|θ*i*; γ)

,

*<sup>D</sup>*,β), with α*<sup>d</sup>* =

*<sup>n</sup>* ), the marginal

Michela Battauz <sup>1</sup> and Paolo Vidoni1

<sup>1</sup> Department of Economics and Statistics, University of Udine, (e-mail: michela.battauz@uniud.it, paolo.vidoni@uniud.it)

ABSTRACT: Multidimensional IRT models can be used to analyze the latent variables that underlay the responses given to a test or questionnaire. However, these models are not only difficult to estimate, but they also suffer of the rotational indeterminacy typical of factor analysis models. In this paper, we propose a boosting algorithm that, starting from a model that includes only the intercepts, sequentially updates a pair of coefficients in a component-wise approach. The solution provided by the algorithm tends to be sparse and to facilitate the interpretation without requiring a posterior rotation.

KEYWORDS: negative curvature direction, regularization, sparse solution.

#### 1 Introduction

IRT models are commonly applied in educational assessment and they are also considered, with increasing frequency, in the field of health and psychological measurement studies. In these models, the probability of observing a categorical response is a function of a single latent trait (simple IRT models) or of multiple latent traits (multiple IRT models) and of some item parameters (see for example Reckase, 2009). Various methods have been proposed for model estimation. However, in the multidimensional setting, serious computational problems may occur if the number of items is large and many latent variables have to be considered. Moreover, in this context, the interpretability of the solution is very important.

In this paper, the new statistical boosting procedure introduced in Battauz & Vidoni (2021) is applied for estimating multiple IRT models. More precisely, we consider a suitable likelihood-based boosting algorithm which may escape from a region of local non-convexity of the objective function, improve the optimization procedure, provide a more interpretable sparse solution and regularize the estimates. We apply this new procedure to the multidimensional two-parameter logistic IRT model for dichotomously scored outcomes. An example concerning a sample from the 2017 Eurobarometer survey is presented.

#### 2 Multidimensional IRT models: definition and inference

The response variable for the subject *i* on item *j* is a Bernoulli random variable *Yi j*, *i* = 1,...,*n*, *j* = 1,...,*J*, with one denoting a positive response. The responses of subject *i* are collected in the vector Y*<sup>i</sup>* = (*Yi*1,...,*YiJ*). Let θ*<sup>i</sup>* = (θ*i*1,...,θ*iD*), *i* = 1,...,*n*, be a latent random vector, composed of independent standard normal variables. Furthermore, it is assumed that (Y*i*,θ*i*) are independent across subjects and that observations *Yi j* are conditionally independent given θ*i*. With particular attention to the multidimensional twoparameter logistic (2PL) IRT model, the conditional probability of giving a positive response to a specific item is defined as

$$P\_{ij} = P(Y\_{ij} = 1 | \boldsymbol{\theta}\_i; \boldsymbol{\theta}\_j, \boldsymbol{\alpha}\_{1j}, \dots, \boldsymbol{\alpha}\_{Dj}) = \frac{\exp(\mathfrak{B}\_j + \mathfrak{a}\_{1j}\boldsymbol{\theta}\_{l1} + \dots + \mathfrak{a}\_{Dj}\boldsymbol{\theta}\_{lD})}{1 + \exp(\mathfrak{B}\_j + \mathfrak{a}\_{1j}\boldsymbol{\theta}\_{l1} + \dots + \mathfrak{a}\_{Dj}\boldsymbol{\theta}\_{lD})},$$

where β*<sup>j</sup>* is the intercept and α*d j*, *d* = 1,...,*D*, are the slope parameters. The vector of unknown model parameters is γ = (α <sup>1</sup> ,...,α *<sup>D</sup>*,β), with α*<sup>d</sup>* = (α*d*1,...,α*dJ*), *d* = 1,...,*D*, and β = (β1,...,β*J*); the vector γ has dimension *J* +*JD*, which, in some applications, can be very large.

Given the responses y, realization of Y = (Y <sup>1</sup> ,...,Y *<sup>n</sup>* ), the marginal likelihood for γ can be obtained by integrating out the unobserved θ values from the complete likelihood *L*(γ;y) = ∏*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *f*(y*i*|θ*i*; γ)φ(θ*i*), where *f*(y*i*|θ*i*; γ) is a Bernoulli-type probability function based on *Pi j* and φ(·) denotes the density of a multivariate standard normal distribution with independent components. Thus, the marginal log-likelihood does not have a closed-form expression, since the *D*-dimensional integral does not have an analytic solution and requires numerical approximations. The most common methods for estimating the item parameters are based on the EM algorithm, approximating the integrals using Gaussian or adaptive quadrature procedures, or on suitable MCMC algorithms for handling with the high dimension of the integrals.

#### 3 The boosting algorithm

BOOSTING MULTIDIMENSIONAL IRT MODELS Michela Battauz <sup>1</sup> and Paolo Vidoni1

<sup>1</sup> Department of Economics and Statistics, University of Udine, (e-mail:

ABSTRACT: Multidimensional IRT models can be used to analyze the latent variables that underlay the responses given to a test or questionnaire. However, these models are not only difficult to estimate, but they also suffer of the rotational indeterminacy typical of factor analysis models. In this paper, we propose a boosting algorithm that, starting from a model that includes only the intercepts, sequentially updates a pair of coefficients in a component-wise approach. The solution provided by the algorithm tends to be sparse and to facilitate the interpretation without requiring

michela.battauz@uniud.it, paolo.vidoni@uniud.it)

KEYWORDS: negative curvature direction, regularization, sparse solution.

IRT models are commonly applied in educational assessment and they are also considered, with increasing frequency, in the field of health and psychological measurement studies. In these models, the probability of observing a categorical response is a function of a single latent trait (simple IRT models) or of multiple latent traits (multiple IRT models) and of some item parameters (see for example Reckase, 2009). Various methods have been proposed for model estimation. However, in the multidimensional setting, serious computational problems may occur if the number of items is large and many latent variables have to be considered. Moreover, in this context, the interpretability of the

In this paper, the new statistical boosting procedure introduced in Battauz & Vidoni (2021) is applied for estimating multiple IRT models. More precisely, we consider a suitable likelihood-based boosting algorithm which may escape from a region of local non-convexity of the objective function, improve the optimization procedure, provide a more interpretable sparse solution and regularize the estimates. We apply this new procedure to the multidimensional two-parameter logistic IRT model for dichotomously scored outcomes. An example concerning a sample from the 2017 Eurobarometer survey is presented.

a posterior rotation.

1 Introduction

solution is very important.

We consider the boosting algorithm introduced in Battauz & Vidoni (2021), with the negative log-likelihood as objective function. Starting from a model that includes only the intercept terms, only two parameters are updated at each iteration of the algorithm, hence following a component-wise approach. The starting point of the algorithm poses a very challenging issue, since the gradient is null making any gradient descent method unable to move from it. A peculiar feature of the method is that it exploits any local non-convexity of the objective function, since the gradient vector and the Hessian matrix are used to define two alternative directions. These are the classical Newton-type direction and a negative curvature direction given by the eigenvector associated with the most negative eigenvalue (if any) of a 2×2 submatrix of the Hessian matrix. More specifically, at step *k* of the boosting algorithm, the Newton-type direction for each pair of parameters indexed *b*, *c* = 1,...,*J*(*D* + 1), *b* < *c*, is given by:

$$\mathbf{s}\_{bc}^{(k)} = -\widehat{\mathbf{H}}\_{bc}^{(k-1)^{-1}} \widehat{\mathbf{g}}\_{bc}^{(k-1)},\tag{1}$$

correlation between the areas of terrorism, immigration, democracy and peace (that present the highest estimated discrimination parameters). However, the areas of energy supply, environment, investment and job creation are also re-

Table 1. *Items of the Eurobarometer survey included in the analysis and parameter*

1 Fighting terrorism 3.07 4.02 1.61 6.10 -8.80 2.83

5 Securing energy supply 1.78 3.27 0.44 1.87 -3.49 -0.06

GOULD, N. I. M., LUCIDI, S., ROMA, M., & TOINT, PH. L. 2000. Exploiting negative curvature directions in linesearch methods for unconstrained optimization. *Optimization Methods and Software*, 14(1-2), 75–98. MICHELA BATTAUZ, PAOLO VIDONI. 2021. A new likelihood-based boosting algorithm for factor analysis models with binary data. *Submitted*. RECKASE, MARK D. 2009. *Multidimensional Item Response Theory Models*.

boosting MLE

β*<sup>j</sup>* α<sup>1</sup> *<sup>j</sup>* α<sup>2</sup> *<sup>j</sup>* β*<sup>j</sup>* α<sup>1</sup> *<sup>j</sup>* α<sup>2</sup> *<sup>j</sup>*

1.02 3.37 0.00 1.10 -3.55 -0.68

1.41 3.34 0.00 1.45 -3.37 -0.54

1.95 2.99 0.92 2.08 -3.37 0.21

2.36 3.32 1.06 2.49 -3.73 0.33

2.41 4.74 0.58 2.46 -4.87 -0.38

1.80 4.41 0.74 2.02 -5.06 -0.24

lated to this dimension.

level

QC7 Areas where more decision-making should take place at a European

2 Dealing with health and social security issues

3 Promoting equal treatment of men and women

4 Promoting democracy

6 Dealing with migration issues from outside the

7 Protecting the environ-

8 Stimulating investment and job creation

New York, NY: Springer Verlag.

and peace

EU

ment

References

*estimates.*

while the negative curvature direction is:

$$\mathbf{d}\_{bc}^{(k)} = -\operatorname{sign}\left\{ \left( \widehat{\mathbf{g}}\_{bc}^{(k-1)} \right)^{\top} \widehat{\mathbf{u}}\_{bc}^{(k-1)} \right\} \widehat{\mathbf{u}}\_{bc}^{(k-1)},\tag{2}$$

where <sup>g</sup> (*k*−1) *bc* and <sup>H</sup>(*k*−1) *bc* are the gradient and the Hessian computed at step *<sup>k</sup>* <sup>−</sup> 1, and <sup>u</sup>(*k*−1) *bc* is the eigenvector corresponding to the minimum negative eigenvalue of <sup>H</sup>(*k*−1) *bc* . The algorithm computes the variation of a quadratic approximation of the objective function for all the pairs of parameters in both the directions, and selects the one leading to the largest decrease. The algorithm represents a particular application of the optimization method proposed by Gould et al. (2000), who proved the convergence to second-order critical points. Since the algorithm converges to the maximum likelihood estimates, a suitable stopping criterion is necessary to obtain regularized estimates.

#### 4 A real-data example

The proposal was applied to the responses of 1027 Italian citizens to some items of the 2017 Eurobarometer survey regarding the area that people thinks that the decisions should be made at the European level. Table 1 reports the items and the estimated parameters. The number of iterations of the algorithm as well as the number of latent variables were selected by 5-fold crossvalidation. The table also reports the maximum likelihood estimates (MLEs) obtained with the R package mirt and using the quartimax rotation, which was chosen for the higher similarity of the solution. It is possible to observe that the MLEs tend to assume more extreme values, while the boosting procedure provides regularized estimates. Both the methods identify a first dimension strongly related to all the items. The interpretation of the second dimension seems a bit more clear using the boosting algorithm, since it reveals a positive correlation between the areas of terrorism, immigration, democracy and peace (that present the highest estimated discrimination parameters). However, the areas of energy supply, environment, investment and job creation are also related to this dimension.


Table 1. *Items of the Eurobarometer survey included in the analysis and parameter estimates.*

#### References

peculiar feature of the method is that it exploits any local non-convexity of the objective function, since the gradient vector and the Hessian matrix are used to define two alternative directions. These are the classical Newton-type direction and a negative curvature direction given by the eigenvector associated with the most negative eigenvalue (if any) of a 2×2 submatrix of the Hessian matrix. More specifically, at step *k* of the boosting algorithm, the Newton-type direction for each pair of parameters indexed *b*, *c* = 1,...,*J*(*D* + 1), *b* < *c*, is

> s (*k*)

*bc* = −*sign*

while the negative curvature direction is:

d(*k*)

*bc* and <sup>H</sup>(*k*−1)

*bc* <sup>=</sup> <sup>−</sup><sup>H</sup>(*k*−1) *bc*

> g (*k*−1) *bc*

−1 g (*k*−1)

proximation of the objective function for all the pairs of parameters in both the directions, and selects the one leading to the largest decrease. The algorithm represents a particular application of the optimization method proposed by Gould et al. (2000), who proved the convergence to second-order critical points. Since the algorithm converges to the maximum likelihood estimates, a

The proposal was applied to the responses of 1027 Italian citizens to some items of the 2017 Eurobarometer survey regarding the area that people thinks that the decisions should be made at the European level. Table 1 reports the items and the estimated parameters. The number of iterations of the algorithm as well as the number of latent variables were selected by 5-fold crossvalidation. The table also reports the maximum likelihood estimates (MLEs) obtained with the R package mirt and using the quartimax rotation, which was chosen for the higher similarity of the solution. It is possible to observe that the MLEs tend to assume more extreme values, while the boosting procedure provides regularized estimates. Both the methods identify a first dimension strongly related to all the items. The interpretation of the second dimension seems a bit more clear using the boosting algorithm, since it reveals a positive

suitable stopping criterion is necessary to obtain regularized estimates.

<sup>u</sup>(*k*−1) *bc*

*bc* is the eigenvector corresponding to the minimum negative

*bc* . The algorithm computes the variation of a quadratic ap-

*bc* are the gradient and the Hessian computed at step

<sup>u</sup>(*k*−1)

*bc* , (1)

*bc* , (2)

given by:

where <sup>g</sup>

(*k*−1)

eigenvalue of <sup>H</sup>(*k*−1)

4 A real-data example

*<sup>k</sup>* <sup>−</sup> 1, and <sup>u</sup>(*k*−1)


### UNDERSTANDING AND ESTIMATING CONDITIONAL PARAMETRIC QUANTILE MODELS

SHAPLEY LORENZ METHODS FOR EXPLAINABLE ARTIFICIAL INTELLIGENCE Niklas Bussmann1, Roman Enzmann2, Paolo Giudici1 and Emanuela Raffinetti1

<sup>1</sup> Department of Economics and Management, University of Pavia (Italy), (e-mail: niklas.bussmann01@universitadipavia.it,

ABSTRACT: A trustworthy application of Artificial Intelligence (AI) requires to measure in advance its possible risks. When applied to regulated industries, such as banking, finance and insurance, Artificial Intelligence methods lack explainability and, therefore, authorities aimed at monitoring risks may not validate them. To solve this issue, eXplainable Artificial Intelligence (XAI) methods have to be developed. In this paper, we introduce an alternative XAI method, based on Lorenz Zonoids, that is statistically normalised and therefore more suitable to the risk management context. The application, focused on data involving more than 15,000 small and medium companies asking for credit, allows to further stress the benefits deriving from our

paolo.giudici@unipv.it, emanuela.raffinetti@unipv.it) <sup>2</sup> University of Bonn (Germany), (e-mail: ryenzmann@hotmail.com)

KEYWORDS: Artificial Intelligence, Lorenz Zonoids, risk management.

The key requirement for trustworthy Artificial Intelligence (AI) methods is their attitude to measure the risks deriving from their use. When applied to regulated fields, such finance and health, AI methods need to be validated by national regulators. It is worth noting that AI methods typically rely on the implementation of complex machine learning models which provide high predictive accuracy at the expense of explainability. This represents a problem for the regulated industries, where comprehensible results have to be made available in order to detect risks, especially in terms of the factors which can cause them. To avoid that wrong actions can be taken as a consequence of "automatic" choices, AI methods need to explain the reasons of their classifications

In this paper, we propose a new explainable Artificial Intelligence method, based on the combination between the Shapley value approach (see, e.g. Shpaley, 1953) and the Lorenz Zonoid tool described in Giudici and Raffinetti

proposal.

1 Introduction

and predictions.

Matteo Bottai1

<sup>1</sup> Karolinska Institute, (e-mail: matteo.bottai@ki.se)

ABSTRACT: The talk gives an overview of conditional parametric models. it outlines their features and potentials with focus on their interpretation, modeling possibilities, and real-data examples. It is intended for a broad audience, including methodological and applied statisticians, data analysts, practitioners, and anyone who may be interested in knowing more about these models.

KEYWORDS: integrated loss function, quantile regression, quantile regression coefficients models

## SHAPLEY LORENZ METHODS FOR EXPLAINABLE ARTIFICIAL INTELLIGENCE

Niklas Bussmann1, Roman Enzmann2, Paolo Giudici1 and Emanuela Raffinetti1

<sup>1</sup> Department of Economics and Management, University of Pavia (Italy), (e-mail: niklas.bussmann01@universitadipavia.it, paolo.giudici@unipv.it, emanuela.raffinetti@unipv.it)

<sup>2</sup> University of Bonn (Germany), (e-mail: ryenzmann@hotmail.com)

ABSTRACT: A trustworthy application of Artificial Intelligence (AI) requires to measure in advance its possible risks. When applied to regulated industries, such as banking, finance and insurance, Artificial Intelligence methods lack explainability and, therefore, authorities aimed at monitoring risks may not validate them. To solve this issue, eXplainable Artificial Intelligence (XAI) methods have to be developed.

In this paper, we introduce an alternative XAI method, based on Lorenz Zonoids, that is statistically normalised and therefore more suitable to the risk management context. The application, focused on data involving more than 15,000 small and medium companies asking for credit, allows to further stress the benefits deriving from our proposal.

KEYWORDS: Artificial Intelligence, Lorenz Zonoids, risk management.

### 1 Introduction

UNDERSTANDING AND ESTIMATING CONDITIONAL PARAMETRIC QUANTILE MODELS Matteo Bottai1

ABSTRACT: The talk gives an overview of conditional parametric models. it outlines their features and potentials with focus on their interpretation, modeling possibilities, and real-data examples. It is intended for a broad audience, including methodological and applied statisticians, data analysts, practitioners, and anyone who may be interested in knowing more about these models.

KEYWORDS: integrated loss function, quantile regression, quantile regression

<sup>1</sup> Karolinska Institute, (e-mail: matteo.bottai@ki.se)

coefficients models

The key requirement for trustworthy Artificial Intelligence (AI) methods is their attitude to measure the risks deriving from their use. When applied to regulated fields, such finance and health, AI methods need to be validated by national regulators. It is worth noting that AI methods typically rely on the implementation of complex machine learning models which provide high predictive accuracy at the expense of explainability. This represents a problem for the regulated industries, where comprehensible results have to be made available in order to detect risks, especially in terms of the factors which can cause them. To avoid that wrong actions can be taken as a consequence of "automatic" choices, AI methods need to explain the reasons of their classifications and predictions.

In this paper, we propose a new explainable Artificial Intelligence method, based on the combination between the Shapley value approach (see, e.g. Shpaley, 1953) and the Lorenz Zonoid tool described in Giudici and Raffinetti (2020). Shapley values belong to the class of local explanation methods, as they aim to interpret individual predictions in terms of which variables mostly affect them. Lorenz Zonoids instead are a global explanation method, as they aim to interpret all model predictions as a whole, in terms of which variables most determine them, for all observations.

model configurations which can be obtained with *K* − 1 variables, excluding

∪*Xk*

The Shapley-Lorenz decomposition presents as an agnostic eXplainable Artificial Intelligence method which can be applied to the predictive output, re-

We apply our proposed method to data supplied by a European External Credit Assessment Institution (ECAI) specialised in credit scoring for P2P platforms focused on SME commercial lending. In summary, the analysis relies on a dataset composed of official financial information, extracted from the balancesheets of 15,045 SMEs, mostly based in Southern Europe, for the year 2015. The information about the status (0 = active, 1 = defaulted) of each company one year later (2016) is also provided. The observed proportion of defaulted companies is equal to 10.9%. In order to lead our analysis, we apply a logistic regression model after the data is split in a training set (80%) and a test set (20%). We then calculate, on the same split, the contribution of each of the nineteen explanatory variables to the estimate of the probability of default, using two explainable AI methods: the Shapley value approach and the Lorenz-Shapley approach that we propose. Table 1 displays the result of the comparison. From Table 1 note that the variable which most contributes to the prediction of default, according to the sum of the Shapley values, is Variable 8: (Profit or Loss before tax + Interest paid)/Total asset, followed at a considerable distance by Variables 13 and 14 (both related to EBITDA) and by Variable 3 (Total Assets/Total Liabilities). In terms of *G*<sup>2</sup> (deviance), instead, the differences between Variable 8 (the highest contributor) and Variables 14, 15 and 3 are lower. The role that Variable 13 has in terms of Shapley value is replaced by Variable 15. The first column of Table 1, giving the Shapley Lorenz values,

*Cov*(πˆ*<sup>X</sup>*

*Cov*(πˆ*<sup>X</sup>*,*r*(πˆ*<sup>X</sup>*)),

∪*Xk*


∪*Xk*

,*r*(πˆ*<sup>X</sup>*

) and *LZ*(πˆ*<sup>X</sup>*) in equation (1) can be

)) and

∪*Xk*

variable *Xk*; |*X*

3 Application

Note that he Lorenz Zonoids *LZ*(πˆ*<sup>X</sup>*

∪*Xk*

*LZ*(πˆ*<sup>X</sup>*) = <sup>2</sup>

gardless of which model and data generated it.

*LZ*(πˆ*<sup>X</sup>*

where *r*(·) denotes the rank score.

computed by resorting to the covariance operators, i.e.,

) = <sup>2</sup> ∑*n <sup>i</sup>*=<sup>1</sup> πˆ*iX*

> ∑*n <sup>i</sup>*=<sup>1</sup> πˆ*iX*

model.

We apply our methodology to a challenging problem: the prediction of a binary variable, representing the credit default, through a large set of balance sheet variables.

Next section describes the methodology, while Section 3 illustrates the empirical findings obtained applying our proposal to financial data.

#### 2 Methodology

Following GIudici and Raffinetti (2021), we consider, for financial risk management purposes, a global explainable AI method, named Shapley-Lorenz decomposition, which combines the interpretability power of the local Shapley value game theoretic approach (see, e.g. Shapley, 1953) with a more robust global approach based on the Lorenz Zonoid model accuracy tool (see, e.g. Giudici and Raffinetti, 2020).

The Lorenz Zonoids, originally introduced by Koshevoy and Mosler (1996), were further developed by Giudici and Raffinetti (2020) as a generalisation of the ROC curve in a multidimensional setting and, therefore, the Shapley-Lorenz decomposition has the advantage of combining predictive accuracy and explainability performance into one single diagnostics. Furthermore, the Lorenz Zonoid is based on a measure of mutual variability that is more robust to the presence of outlying (anomalous) observations, with respect to the standard variability around the mean.

The Shapley-Lorenz decomposition expression is the result of a combination between the Shapley value-based formula and the Lorenz Zonoid tool. Formally, given *K* explanatory variables, the contribution of the additional variable *Xk*, expressed in terms of the differential contribution to the global predictive accuracy, equals to

$$LZ^{\chi\_k}(\hat{\mathfrak{m}}) = \sum\_{X' \subseteq \mathcal{L}(X) \backslash X\_K} \frac{|X'|!(K - |X'| - 1)!}{K!} [LZ(\hat{\mathfrak{m}}\_{X' \cup X\_k}) - LZ(\hat{\mathfrak{m}}\_{X'})], \quad (1)$$

where: πˆ is the estimated probability of default; the term [*LZ*(πˆ*<sup>X</sup>* ∪*Xk* )−*LZ*(πˆ*<sup>X</sup>*)] measures the marginal contribution provided by the inclusion of variable *Xk*; *K* is the number of available predictors; *C*(*X*) \*Xk* is the set of all the possible

model configurations which can be obtained with *K* − 1 variables, excluding variable *Xk*; |*X* | denotes the number of variables included in each possible model.

Note that he Lorenz Zonoids *LZ*(πˆ*<sup>X</sup>* ∪*Xk* ) and *LZ*(πˆ*<sup>X</sup>*) in equation (1) can be computed by resorting to the covariance operators, i.e.,

$$\begin{aligned} LZ(\hat{\mathfrak{m}}\_{X' \cup X\_k}) &= \frac{2}{\sum\_{i=1}^n \hat{\mathfrak{m}}\_{iX' \cup X\_k}} Cov(\hat{\mathfrak{m}}\_{X' \cup X\_k}, r(\hat{\mathfrak{m}}\_{X' \cup X\_k})) \quad \text{and} \\ LZ(\mathfrak{m}\_{X'}) &= \frac{2}{\sum\_{i=1}^n \hat{\mathfrak{m}}\_{iX'}} Cov(\mathfrak{m}\_{X'}, r(\mathfrak{m}\_{X'})), \end{aligned}$$

where *r*(·) denotes the rank score.

The Shapley-Lorenz decomposition presents as an agnostic eXplainable Artificial Intelligence method which can be applied to the predictive output, regardless of which model and data generated it.

#### 3 Application

(2020). Shapley values belong to the class of local explanation methods, as they aim to interpret individual predictions in terms of which variables mostly affect them. Lorenz Zonoids instead are a global explanation method, as they aim to interpret all model predictions as a whole, in terms of which variables

We apply our methodology to a challenging problem: the prediction of a binary variable, representing the credit default, through a large set of balance

Next section describes the methodology, while Section 3 illustrates the empir-

Following GIudici and Raffinetti (2021), we consider, for financial risk management purposes, a global explainable AI method, named Shapley-Lorenz decomposition, which combines the interpretability power of the local Shapley value game theoretic approach (see, e.g. Shapley, 1953) with a more robust global approach based on the Lorenz Zonoid model accuracy tool (see, e.g.

The Lorenz Zonoids, originally introduced by Koshevoy and Mosler (1996), were further developed by Giudici and Raffinetti (2020) as a generalisation of the ROC curve in a multidimensional setting and, therefore, the Shapley-Lorenz decomposition has the advantage of combining predictive accuracy and explainability performance into one single diagnostics. Furthermore, the Lorenz Zonoid is based on a measure of mutual variability that is more robust to the presence of outlying (anomalous) observations, with respect to the stan-

The Shapley-Lorenz decomposition expression is the result of a combination between the Shapley value-based formula and the Lorenz Zonoid tool. Formally, given *K* explanatory variables, the contribution of the additional variable *Xk*, expressed in terms of the differential contribution to the global predictive


measures the marginal contribution provided by the inclusion of variable *Xk*; *K* is the number of available predictors; *C*(*X*) \*Xk* is the set of all the possible


∪*Xk*

)−*LZ*(πˆ*<sup>X</sup>*)], (1)

)−*LZ*(πˆ*<sup>X</sup>*)]

∪*Xk*

ical findings obtained applying our proposal to financial data.

most determine them, for all observations.

sheet variables.

2 Methodology

Giudici and Raffinetti, 2020).

dard variability around the mean.

*LZXk* (πˆ) = ∑ *X*

⊆*C*(*X*)\*XK*


where: πˆ is the estimated probability of default; the term [*LZ*(πˆ*<sup>X</sup>*

accuracy, equals to

We apply our proposed method to data supplied by a European External Credit Assessment Institution (ECAI) specialised in credit scoring for P2P platforms focused on SME commercial lending. In summary, the analysis relies on a dataset composed of official financial information, extracted from the balancesheets of 15,045 SMEs, mostly based in Southern Europe, for the year 2015. The information about the status (0 = active, 1 = defaulted) of each company one year later (2016) is also provided. The observed proportion of defaulted companies is equal to 10.9%. In order to lead our analysis, we apply a logistic regression model after the data is split in a training set (80%) and a test set (20%). We then calculate, on the same split, the contribution of each of the nineteen explanatory variables to the estimate of the probability of default, using two explainable AI methods: the Shapley value approach and the Lorenz-Shapley approach that we propose. Table 1 displays the result of the comparison. From Table 1 note that the variable which most contributes to the prediction of default, according to the sum of the Shapley values, is Variable 8: (Profit or Loss before tax + Interest paid)/Total asset, followed at a considerable distance by Variables 13 and 14 (both related to EBITDA) and by Variable 3 (Total Assets/Total Liabilities). In terms of *G*<sup>2</sup> (deviance), instead, the differences between Variable 8 (the highest contributor) and Variables 14, 15 and 3 are lower. The role that Variable 13 has in terms of Shapley value is replaced by Variable 15. The first column of Table 1, giving the Shapley Lorenz values, indicate, instead, that Variable 8, with a value of 0.16, and Variable 3, with a value of 0.11, are one magnitude order higher than the others. This indicates a more clear cut choice, with only two variables being selected: a measure of leverage, and a measure of profitability. In the latter case, only the most contributing one, among the several that measure profitability, is chosen.

ROBUST CLASSIFICATION OF SPECTROSCOPIC DATA IN AGRI-FOOD: FIRST ANALYSIS ON THE STABILITY OF RESULTS Andrea Cappozzo1, Ludovic Duponchel2, Francesca Greselin3 and Brendan Murphy4

<sup>1</sup> Department of Mathematics, Politecnico di Milano, (andrea.cappozzo@polimi.it)

<sup>3</sup> Department of Statistics and Quantitative Methods, University of Milano Bicocca,

<sup>4</sup> School of Mathematics and Statistics, University College Dublin, (bren-

ABSTRACT: We investigate here the stability of the obtained results of a variable selection method recently introduced in the literature, and embedded into a modelbased classification framework. It is applied to chemometric data, with the purpose of selecting a few wavenumbers (of the order of tens) among the thousands measured ones, to build a (robust) decision rule for classification. The robust nature of the method safeguards it from potential label noise and outliers, which are particularly dangerous in the field of food-authenticity studies. As a by-product of the learning process, samples are grouped into similar classes, and anomalous samples are also singled out. Our first results show that there is some variability around a common

KEYWORDS: Variable selection, Robust classification, Label noise, Outlier detec-

Nowadays, many challenging classification problems, arising from scientific domains such as chemometrics, computer vision, engineering, and genetics, among others, have to deal with hundreds or thousands of variables on each sample. Many contributions in the literature show that inferential methods benefit greatly from the identification of a subset of relevant variables. Dimension reduction techniques, like Principal Component Analysis (PCA), projection to latent structures (PLS-DA), single class modeling (SIMCA) and kernel methods (SVM) are generally adopted to this aim. In some fields of application, like

tion, Near infrared spectroscopy, Mid infrared spectroscopy, Agri-food.

<sup>2</sup> LASIR Lab, University of Lille, (ludovic.duponchel@univ-lille.fr)

(francesca.greselin@unimib.it)

pattern in the obtained selection.

1 Introduction

dan.murphy@ucd.ie)


Table 1: Marginal contribution of each explanatory variable in terms of: Shapley-Lorenz Zonoids, *G*<sup>2</sup> and total Shapley values

#### References


### ROBUST CLASSIFICATION OF SPECTROSCOPIC DATA IN AGRI-FOOD: FIRST ANALYSIS ON THE STABILITY OF RESULTS

Andrea Cappozzo1, Ludovic Duponchel2, Francesca Greselin3 and Brendan Murphy4

<sup>1</sup> Department of Mathematics, Politecnico di Milano, (andrea.cappozzo@polimi.it)

<sup>2</sup> LASIR Lab, University of Lille, (ludovic.duponchel@univ-lille.fr)

<sup>3</sup> Department of Statistics and Quantitative Methods, University of Milano Bicocca, (francesca.greselin@unimib.it)

<sup>4</sup> School of Mathematics and Statistics, University College Dublin, (brendan.murphy@ucd.ie)

ABSTRACT: We investigate here the stability of the obtained results of a variable selection method recently introduced in the literature, and embedded into a modelbased classification framework. It is applied to chemometric data, with the purpose of selecting a few wavenumbers (of the order of tens) among the thousands measured ones, to build a (robust) decision rule for classification. The robust nature of the method safeguards it from potential label noise and outliers, which are particularly dangerous in the field of food-authenticity studies. As a by-product of the learning process, samples are grouped into similar classes, and anomalous samples are also singled out. Our first results show that there is some variability around a common pattern in the obtained selection.

KEYWORDS: Variable selection, Robust classification, Label noise, Outlier detection, Near infrared spectroscopy, Mid infrared spectroscopy, Agri-food.

#### 1 Introduction

indicate, instead, that Variable 8, with a value of 0.16, and Variable 3, with a value of 0.11, are one magnitude order higher than the others. This indicates a more clear cut choice, with only two variables being selected: a measure of leverage, and a measure of profitability. In the latter case, only the most contributing one, among the several that measure profitability, is chosen.

Table 1: Marginal contribution of each explanatory variable in terms of: Shapley-Lorenz

Variable Shapley-Lorenz *G*<sup>2</sup> Shapley Total assets/Equity 0.00 0.16 2.53

Total assets/Total Liabilties 0.11 1088.12 -1273.97 Current assets/Current Liabilties 0.05 553.68 -641.69

EBIT/interest paid -0.01 411.10 1504.44

Return on Equity 0.05 826.96 -1993.98 Operating revenues/Total assets 0.06 17.36 -289.46 Sales/Total assets -0.02 10.96 252.59

EBITDA/interest paid 0.02 418.00 -1697.31 EBITDA/Operating revenues 0.03 1254.63 -1419.43 EBITDA/Sales 0.02 1122.05 -785.95 Trade Payables/Operating revenues 0.00 14.73 -193.60 Trade Receivables/Operating revenues 0.05 475.40 -585.58 Inventories/Operating revenues 0.01 126.78 1190.47 Turnover 0.02 85.26 1072.37

GIUDICI, P., & RAFFINETTI, E. 2020. Lorenz Model Selection. *Journal of*

GIUDICI, P., & RAFFINETTI, E. 2021. Shapley-Lorenz eXplainable Artifi-

KOSHEVOY, G., & MOSLER, K. 1996. The Lorenz Zonoid of a Multivariate Distribution. *Journal of the American Statistical Association*, 91, 873-

SHAPLEY, L.S. 1953. A value for *n*-person games. *Contributions to the The-*

cial Intelligence. *Expert Systems With Appications*, 167.

0.00 0.54 -202.80

0.00 479.06 -93.51

0.00 13.16 4180.56

0.16 1633.51 -13115.53

0.01 103.26 379.73

Zonoids, *G*<sup>2</sup> and total Shapley values

stocks)/Current Liabilties

ties)/Fixed assets

paid)/Total assets

Funds

paid)

References

882.

(Long term debt + Loans)/Shareholders

(Current assets - Current assets:

(Shareholders Funds + Non current liabili-

(Profit or Loss before tax + Interest

Interest paid/(Profit before taxes + Interest

*Classification*, 37, 754-768.

*ory of Games*, 307-317.

Nowadays, many challenging classification problems, arising from scientific domains such as chemometrics, computer vision, engineering, and genetics, among others, have to deal with hundreds or thousands of variables on each sample. Many contributions in the literature show that inferential methods benefit greatly from the identification of a subset of relevant variables. Dimension reduction techniques, like Principal Component Analysis (PCA), projection to latent structures (PLS-DA), single class modeling (SIMCA) and kernel methods (SVM) are generally adopted to this aim. In some fields of application, like in food-authentication, mislabeled and adulterated spectra may appear both in the calibration and/or validation sets. This contamination produces dramatic effects on the model estimation, and consequently on its prediction accuracy. To overcome this issue, a recent proposal in the literature introduces a variable selection step within the Robust Eigenvalue Decomposition Discriminant Analysis framework (Cappozzo *et al.*, 2019). Under the realistic assumption that only a portion of the spectral region is relevant for class discrimination, the procedure i) robustly identifies a subset of wavenumbers onto which building the decision rule, ii) protects it from potential label noise and outliers, and iii) simultaneously identifies anomalous samples.

Training Set

4000 3500 3000 2500 2000 1500 1000 500

)

Wavenumber (cm−<sup>1</sup>

Starch 1234

MIR spectra of starches of *G* = 4 different classes. For each sample, a total of *P* = 2901 absorbance measurements are recorded. A subset of training observations is displayed in Fig. 1. The aim of the competition was to discriminate the four different groups, defining a classification rule from the training set. In addition, outlier detection was advisable: four intentionally corrupted spectra were manually placed in the test set, as described in Fernandez Pierna & ´

For the first experiment 100 bootstrap datasets, of the same size as the actual dataset, were generated by sampling with replacement from the training set. A pattern in the selected variables arises from our results. For each bootstrapped sample, all models were fitted and the best-fit model was chosen using the BIC criterion, and the selected wavelengths were recorded. The chosen wavelengths show us which parts of the spectrum are of importance when classifying samples into different starches types. Results are shown in Fig. 2 through a raster plot. As we expect, there is some variability, due to the fact that the role of "relevant" and "irrelevant" variable is judged in terms of the set of already selected features. The wavelengths 997*cm*−<sup>1</sup> and 995*cm*−<sup>1</sup> correspond to spectral distributions of *amylose* and *amylopectin*, which are known to be present in different ratios across the starch classes. They have been se-

lected with higher frequency, respectively 17 and 21 times in 67 runs.

We developed a first stability analysis for a recent method for robust variable selection and classification, applied to spectrometric data. By a bootstrap simulation study on the learning set, although there has been variability in the

3 Conclusions and further research

Figure 1. *Starches dataset: mid-infrared spectra of four starches classes.*

0.00

Dardenne (2007).

0.25

0.50

Absorbance

0.75

1.00

We will recall here the main idea onto which the stepwise algorithm works, redirecting the interested reader to Cappozzo *et al.* (2021) for a more detailed presentation. The detection of *p* relevant features (out of the whole collection of *P p* available variables) on which to train the classifier has many advantages. Firstly, parameter estimation and interpretation is enhanced; secondly, loss on predictive power due to the inclusion of irrelevant and redundant information is avoided. Finally, cost reduction on future data collection and processing is obtained.

In model-based discriminant analysis, the features that directly depend on the class membership itself are called *relevant* variables. Conversely, *irrelevant* or noisy variables do not contain any discriminating power. Their distribution is completely independent on the group structure. Lastly, *redundant* variables essentially contain discriminant information that is already provided by the relevant ones: their distribution is conditionally independent of the grouping variable, given the relevant ones.

The algorithm starts from the empty set and, at each iteration, the inclusion of a *relevant* variable into the model is evaluated, based on its robustly assessed discriminating power. In a similar fashion, the removal of an existing variable from the model is also considered. The procedure iterates between variable addition and removal until two consecutive steps have been rejected.

#### 2 Stability study

In this section, the results of a bootstrap-based analysis will be presented using data produced using non-parametric re-sampling of the actual data. The aim is to investigate the stability of the variable selection procedure.

The data we analyze come from the chemometric challenge organized during the "Chimiometrie 2005" conference (Fern ´ andez Pierna & Dardenne, ´ 2007). The learning scenario encompasses *N* = 215 training and *M* = 43 test

Figure 1. *Starches dataset: mid-infrared spectra of four starches classes.*

in food-authentication, mislabeled and adulterated spectra may appear both in the calibration and/or validation sets. This contamination produces dramatic effects on the model estimation, and consequently on its prediction accuracy. To overcome this issue, a recent proposal in the literature introduces a variable selection step within the Robust Eigenvalue Decomposition Discriminant Analysis framework (Cappozzo *et al.*, 2019). Under the realistic assumption that only a portion of the spectral region is relevant for class discrimination, the procedure i) robustly identifies a subset of wavenumbers onto which building the decision rule, ii) protects it from potential label noise and outliers, and iii)

We will recall here the main idea onto which the stepwise algorithm works, redirecting the interested reader to Cappozzo *et al.* (2021) for a more detailed presentation. The detection of *p* relevant features (out of the whole collection of *P p* available variables) on which to train the classifier has many advantages. Firstly, parameter estimation and interpretation is enhanced; secondly, loss on predictive power due to the inclusion of irrelevant and redundant information is avoided. Finally, cost reduction on future data collection and

In model-based discriminant analysis, the features that directly depend on the class membership itself are called *relevant* variables. Conversely, *irrelevant* or noisy variables do not contain any discriminating power. Their distribution is completely independent on the group structure. Lastly, *redundant* variables essentially contain discriminant information that is already provided by the relevant ones: their distribution is conditionally independent of the group-

The algorithm starts from the empty set and, at each iteration, the inclusion of a *relevant* variable into the model is evaluated, based on its robustly assessed discriminating power. In a similar fashion, the removal of an existing variable from the model is also considered. The procedure iterates between variable

In this section, the results of a bootstrap-based analysis will be presented using data produced using non-parametric re-sampling of the actual data. The aim is

The data we analyze come from the chemometric challenge organized during the "Chimiometrie 2005" conference (Fern ´ andez Pierna & Dardenne, ´ 2007). The learning scenario encompasses *N* = 215 training and *M* = 43 test

addition and removal until two consecutive steps have been rejected.

to investigate the stability of the variable selection procedure.

simultaneously identifies anomalous samples.

processing is obtained.

2 Stability study

ing variable, given the relevant ones.

MIR spectra of starches of *G* = 4 different classes. For each sample, a total of *P* = 2901 absorbance measurements are recorded. A subset of training observations is displayed in Fig. 1. The aim of the competition was to discriminate the four different groups, defining a classification rule from the training set. In addition, outlier detection was advisable: four intentionally corrupted spectra were manually placed in the test set, as described in Fernandez Pierna & ´ Dardenne (2007).

For the first experiment 100 bootstrap datasets, of the same size as the actual dataset, were generated by sampling with replacement from the training set. A pattern in the selected variables arises from our results. For each bootstrapped sample, all models were fitted and the best-fit model was chosen using the BIC criterion, and the selected wavelengths were recorded. The chosen wavelengths show us which parts of the spectrum are of importance when classifying samples into different starches types. Results are shown in Fig. 2 through a raster plot. As we expect, there is some variability, due to the fact that the role of "relevant" and "irrelevant" variable is judged in terms of the set of already selected features. The wavelengths 997*cm*−<sup>1</sup> and 995*cm*−<sup>1</sup> correspond to spectral distributions of *amylose* and *amylopectin*, which are known to be present in different ratios across the starch classes. They have been selected with higher frequency, respectively 17 and 21 times in 67 runs.

#### 3 Conclusions and further research

We developed a first stability analysis for a recent method for robust variable selection and classification, applied to spectrometric data. By a bootstrap simulation study on the learning set, although there has been variability in the

ISSUES IN MONITORING THE EU TRADE OF CRITICAL COVID-19 COMMODITIES Andrea Cerasa1, Enrico Checchi1, Domenico Perrotta1 and Francesca Torti1

andrea.cerasa@ec.europa.eu, enrico.checchi@ec.europa.eu, domenico.perrotta@ec.europa.eu, francesca.torti@ec.europa.eu)

ABSTRACT: The unexpected and constant increase of demand of commodities needed to manage the COVID-19 pandemic impacted the supply chain worldwide. Many countries, fearing shortages of those commodities, applied

The European Union, since the early stages of the pandemic, has monitored the procurement of these commodities by the EU Member States, to identify supply gaps, strong dependencies from extra-EU countries, as well as potential cases of frauds. Products like personal protective equipment, medicines, diagnostic kits, medical devices and (more recently) vaccines were scrutinized by a

We illustrate some of the statistical issues encountered in analyzing these data from various perspectives, in particular the evolution in time of the traded prices and quantities of the most critical commodities. Robust statistical methods are still used to identify and rank spikes, level shifts and trends in hundreds

<sup>1</sup> European Commission, Joint Research Centre, (e-mail:

restrictions to the export of the national production.

inter-service task force.

of time series of Customs declarations.

KEYWORDS: time series, international trade, COVID

Figure 2. *Results of the stability analysis: for each of the 67 bootstrap samples, the selected wavenumbers are indicated in a raster plot.*

structure of the selected models, some stable pattern arises in results. Further research is still needed to cast more light on this topic. For instance, to investigate the sensitivity of the derived decision model, its accuracy on the test set is worth being analyzed, to establish the level of reliability in the resulting classification. This would mitigate the use of only a few real data examples and hence allows a more general discussion of the results.

#### References


### ISSUES IN MONITORING THE EU TRADE OF CRITICAL COVID-19 COMMODITIES

Selected Wavenumber (cm−1

References

)

12 3 8 4 4 3 23 5 3 13 16 12 15 15 4 5 12 15 14 15 21 3 16 15 19 11 19 13 5 17 5 3 21 17 7 15 4 19 11 3 17 15 5 18 15 13 5 12 13 19 5 2 17 12 17 5 13 5 12 9 16 11 10 14 19 11 3 Bootstrap sample: number of retained variables

Figure 2. *Results of the stability analysis: for each of the 67 bootstrap samples, the*

structure of the selected models, some stable pattern arises in results. Further research is still needed to cast more light on this topic. For instance, to investigate the sensitivity of the derived decision model, its accuracy on the test set is worth being analyzed, to establish the level of reliability in the resulting classification. This would mitigate the use of only a few real data examples

CAPPOZZO, A., GRESELIN, F., & MURPHY, T. B. 2019. A robust approach to model-based classification based on trimming and constraints. *Advances*

CAPPOZZO, A., DUPONCHEL, L., GRESELIN, F., & MURPHY, T. B. 2021. Robust variable selection in the framework of classification with label noise and outliers: Applications to spectroscopic data in agri-food. *Analytica*

FERNANDEZ ´ PIERNA, J. A., & DARDENNE, P. 2007. Chemometric contest at "Chimiometrie 2005": A discrimination study. ´ *Chemometrics and Intel-*

*selected wavenumbers are indicated in a raster plot.*

*in Data Analysis and Classification*, 1–28.

*ligent Laboratory Systems*, 86(2), 219–223.

*Chimica Acta*, 1153, 338245.

and hence allows a more general discussion of the results.

Andrea Cerasa1, Enrico Checchi1, Domenico Perrotta1 and Francesca Torti1

<sup>1</sup> European Commission, Joint Research Centre, (e-mail: andrea.cerasa@ec.europa.eu, enrico.checchi@ec.europa.eu, domenico.perrotta@ec.europa.eu, francesca.torti@ec.europa.eu)

ABSTRACT: The unexpected and constant increase of demand of commodities needed to manage the COVID-19 pandemic impacted the supply chain worldwide. Many countries, fearing shortages of those commodities, applied restrictions to the export of the national production.

The European Union, since the early stages of the pandemic, has monitored the procurement of these commodities by the EU Member States, to identify supply gaps, strong dependencies from extra-EU countries, as well as potential cases of frauds. Products like personal protective equipment, medicines, diagnostic kits, medical devices and (more recently) vaccines were scrutinized by a inter-service task force.

We illustrate some of the statistical issues encountered in analyzing these data from various perspectives, in particular the evolution in time of the traded prices and quantities of the most critical commodities. Robust statistical methods are still used to identify and rank spikes, level shifts and trends in hundreds of time series of Customs declarations.

KEYWORDS: time series, international trade, COVID

### SMOOTHED NON LINEAR PCA FOR MULTIVARIATE DATA

Indeed when looking for a dependence of a variable from another variable, we are not usually restricted to linear relationships, as estimated through linear regression, but we culd also use non parametric techniques, at least in an exploratory step of the analysis and with poor knowledge about the theoretical model which generated observed data. In this context usually some smooth functionis used which minimizes a compromise between fitting to data and

If we are given a set of *k* variables, without a clear assignment of the roles of *dependent* and *explicative* variable, in some situation we would like to study multiple mutual interdipendence between variables, without the constraint of

Something similar is made in functional principal component analysis,

In this paper we propose an exploratory tool, called smoothed PCA, which seek for function f(*t*) in a *k* dimensional space, close to observed points but

In our approach we searchs for a parametric curve f(*t*) in a *k* dimensional space, close to *n* observed points with some penalization or constrain P(·) for

A first component is found solving a least squares penalized problem:

As usual λ is a smoothing parametr which controls the amount of smooth-

After a first component f1(*t*) is found, a second component is similarly found, imposing a costrain of lack of correlation with the first component. Further components could be found in a similar way, even if this aspect is not

The choice of λ could be made by means of cross validation techniques.

The first problem we dealt with was the choice of the function f(*t*), which of course cannot be totally free. A natural choice was to seek for some family of parametric cubic splines. A possible choice, that we explored first, is to use


when seeking for reduction of dimensionality with functional data.

sufficiently rough. An extension is given to *k* components.

min f(*t*)

3 Explicit form of the approximant function.

smoothness.

linearity.

2 Aim of the method

curvature or length.

uniquely solved till now.

ing.

Marcello Chiodi <sup>1</sup>

<sup>1</sup> Department of Economics, Business and Statistics, University of Palermo, (e-mail: marcello.chiodi@unipa.it)

ABSTRACT: Principal Component Analysys is one of the wides known and used tool of linear explorative analysis with *n* observations on *k* numerical variables, in the most simple form. Besides the so called reduction of dimensionality achiveded by taking the first components as new reduced coordinates, the first principal axis can interpreted as the principal regression line, that is, the straight line which minimizes the sum of the orthogonal distances of the *n* points from the line, in a *k* dimensional space. Of course this interpretation of a principal lines relies on the assumption of linearity of relationships between variables, even if not conjointly normal. In this paper we propose an approach which searchs for a parametric curve f(*t*) in a *k* dimensional space, with some constrains for curvature or length. An extension is given to *k* components )

KEYWORDS: Smoothing splines, Principal regression, Multivariate curvature

#### 1 Introduction

In a classical explorative phase of the analysis of a set X of data with *n* observations on *k* numerical variables, Principal Component Analysis (PCA) is often used to obtain reduction of dimensionality achieved by taking the first components as new reduced coordinates, but in general to have an insight in the multiple correlation structure among variables.

In so far, the first principal axis can interpreted as the principal regression line, that is, the straight line which minimizes the sum of the orthogonal distances of the *n* points from the line, in a *k* dimensional space. Of course this interpretation of a principal lines relies on the assumption of linearity of relationships between variables, even if not conjointly normal.

However, if the real interdependence structure between variables is not linear, the components could be not meaningful. Furthermore the distribution of the optimal distances from the first component could be not very regular.

Similar considerations could be made for components successive to the first, and this can be highlighted by appropriate residual analyses.

Indeed when looking for a dependence of a variable from another variable, we are not usually restricted to linear relationships, as estimated through linear regression, but we culd also use non parametric techniques, at least in an exploratory step of the analysis and with poor knowledge about the theoretical model which generated observed data. In this context usually some smooth functionis used which minimizes a compromise between fitting to data and smoothness.

If we are given a set of *k* variables, without a clear assignment of the roles of *dependent* and *explicative* variable, in some situation we would like to study multiple mutual interdipendence between variables, without the constraint of linearity.

Something similar is made in functional principal component analysis, when seeking for reduction of dimensionality with functional data.

In this paper we propose an exploratory tool, called smoothed PCA, which seek for function f(*t*) in a *k* dimensional space, close to observed points but sufficiently rough. An extension is given to *k* components.

#### 2 Aim of the method

SMOOTHED NON LINEAR PCA FOR MULTIVARIATE DATA Marcello Chiodi <sup>1</sup>

<sup>1</sup> Department of Economics, Business and Statistics, University of Palermo, (e-mail:

ABSTRACT: Principal Component Analysys is one of the wides known and used tool of linear explorative analysis with *n* observations on *k* numerical variables, in the most simple form. Besides the so called reduction of dimensionality achiveded by taking the first components as new reduced coordinates, the first principal axis can interpreted as the principal regression line, that is, the straight line which minimizes the sum of the orthogonal distances of the *n* points from the line, in a *k* dimensional space. Of course this interpretation of a principal lines relies on the assumption of linearity of relationships between variables, even if not conjointly normal. In this paper we propose an approach which searchs for a parametric curve f(*t*) in a *k* dimensional space, with some constrains for curvature or length. An extension is given to *k* components )

KEYWORDS: Smoothing splines, Principal regression, Multivariate curvature

the multiple correlation structure among variables.

relationships between variables, even if not conjointly normal.

first, and this can be highlighted by appropriate residual analyses.

In a classical explorative phase of the analysis of a set X of data with *n* observations on *k* numerical variables, Principal Component Analysis (PCA) is often used to obtain reduction of dimensionality achieved by taking the first components as new reduced coordinates, but in general to have an insight in

In so far, the first principal axis can interpreted as the principal regression line, that is, the straight line which minimizes the sum of the orthogonal distances of the *n* points from the line, in a *k* dimensional space. Of course this interpretation of a principal lines relies on the assumption of linearity of

However, if the real interdependence structure between variables is not linear, the components could be not meaningful. Furthermore the distribution of the optimal distances from the first component could be not very regular. Similar considerations could be made for components successive to the

marcello.chiodi@unipa.it)

1 Introduction

In our approach we searchs for a parametric curve f(*t*) in a *k* dimensional space, close to *n* observed points with some penalization or constrain P(·) for curvature or length.

A first component is found solving a least squares penalized problem:

$$\min\_{\mathbf{f}(t)} ||\mathbf{X} - \mathbf{f}(t)|| \, + \, \lambda \, \mathbf{P}(\mathbf{f}(t)) \tag{1}$$

As usual λ is a smoothing parametr which controls the amount of smoothing.

After a first component f1(*t*) is found, a second component is similarly found, imposing a costrain of lack of correlation with the first component. Further components could be found in a similar way, even if this aspect is not uniquely solved till now.

The choice of λ could be made by means of cross validation techniques.

#### 3 Explicit form of the approximant function.

The first problem we dealt with was the choice of the function f(*t*), which of course cannot be totally free. A natural choice was to seek for some family of parametric cubic splines. A possible choice, that we explored first, is to use

$$\mathbf{f}(t) \;= \sum\_{l=1}^{m} \mathbf{c}\_{l} B\_{l}(t)$$

At the present moment promising results are obtained with a double Levenberg-

Marqadt type optimization, modified for the peculiarity of the problem. In our

A R package smoothPCA is under construction, which tries to use as

The problem of the choice of the number of knots, *m*, is still open, even if it seems to be not so crucial as the choice of λ, for which some bounding values are proposed. Satisfactory solutions are obtained using as starting points a linear set of points Q*<sup>l</sup>* computed along the line of the first principal component.

The utility of the results of this techniques in exploratory analysis, relies in the possibility of giving a sort of multidimensional measure of conjoint nonlinearity, together with the possibility of describing observed points in a re-

ALLEN, GENEVERA I., & WEYLANDT, MICHAEL. 2019. Sparse and Functional Principal Components Analysis. *2019 IEEE Data Science Work-*

SILVERMAN, BERNARD W. 1996. Smoothed functional principal components analysis by choice of norm. *The Annals of Statistics*, 24(1), 1 – 24.

duced space obtained by non linear parametric transformations. Some example will be presented on standard dataset

setting we tried to insert the penalization term in a least squares form.

much as possible existing optimized routines for the majority of steps.

6 Exploratory analysis

*shop (DSW)*, Jun.

References

where the c*l*, *l* = 1,2,...,*m*, are a set of *k*−dimensional vectors, and the *Bl*(*t*) are a set of *m* Basis of splines (cubic) defined on some set of *m*+4 knots.

An alternative setting, which is the one we use in our presentation, is to define a function f(*t*) composed by *k*−components *fj*(*t*).

Each component *fj*(*t*)is a natural spline, with *m* knots*zl*, with *l* = 1,2,...,*m*, each interpolating *m* points for each of the *j* dimension. With this setting, the unknown quantities of the problems are the *m* × *k* coordinates of the *m k*−dimensional points Q*l*, with *l* = 1,2,...,*m*.

This points could be maybe called *principal points*, but for now we simply use them as multivariate knots.

#### 4 The roughness, or penalty, function.

The penalty function P(·) is defined as a measure of the curvature in *Rk*. Since here we have a curve in *Rk*, the measure of curvature could be not so easy, but with the definition of f(*t*) as a set *k* natural splines, the curvature can be easily defined as the sum of the *k* curvatures of the single splines, based as usual on second derivatives, so that the simpler formulation is:

$$\mathbf{P}(\mathbf{f}(t)) \,= \sum\_{j=1}^{k} \int [f\_j''(t)]^2 dt$$

This formulation will allow to express the penalty P(·) as a simple function of the higher coefficients of the piecewise polynomials which define the splines, and some tricks is used in order to manage with penalty term as it were a vector of residuals.

#### 5 Numerical algorythms

The minimzation problem in (1), for a fixed value of λ is not an easy problem, since in the minimization problem the *n* optimal orthogonal projections *ti*,*i* = 1,2,...,*n* of the *n* observed points on the parametric curve f(*t*) must be found solving *n* optimization sub-problems, so that the problem cannot be splitted in *k* simpler penalized problems. For each possible curve ˆf(*t*), a set of optimal points should be recomputed.

At the present moment promising results are obtained with a double Levenberg-Marqadt type optimization, modified for the peculiarity of the problem. In our setting we tried to insert the penalization term in a least squares form.

A R package smoothPCA is under construction, which tries to use as much as possible existing optimized routines for the majority of steps.

The problem of the choice of the number of knots, *m*, is still open, even if it seems to be not so crucial as the choice of λ, for which some bounding values are proposed. Satisfactory solutions are obtained using as starting points a linear set of points Q*<sup>l</sup>* computed along the line of the first principal component.

#### 6 Exploratory analysis

The utility of the results of this techniques in exploratory analysis, relies in the possibility of giving a sort of multidimensional measure of conjoint nonlinearity, together with the possibility of describing observed points in a reduced space obtained by non linear parametric transformations.

Some example will be presented on standard dataset

#### References

f(*t*) =

define a function f(*t*) composed by *k*−components *fj*(*t*).

*k*−dimensional points Q*l*, with *l* = 1,2,...,*m*.

4 The roughness, or penalty, function.

second derivatives, so that the simpler formulation is:

P(f(*t*)) =

use them as multivariate knots.

a vector of residuals.

5 Numerical algorythms

points should be recomputed.

*m* ∑ *l*=1

where the c*l*, *l* = 1,2,...,*m*, are a set of *k*−dimensional vectors, and the *Bl*(*t*) are a set of *m* Basis of splines (cubic) defined on some set of *m*+4 knots. An alternative setting, which is the one we use in our presentation, is to

Each component *fj*(*t*)is a natural spline, with *m* knots*zl*, with *l* = 1,2,...,*m*, each interpolating *m* points for each of the *j* dimension. With this setting, the unknown quantities of the problems are the *m* × *k* coordinates of the *m*

This points could be maybe called *principal points*, but for now we simply

The penalty function P(·) is defined as a measure of the curvature in *Rk*. Since here we have a curve in *Rk*, the measure of curvature could be not so easy, but with the definition of f(*t*) as a set *k* natural splines, the curvature can be easily defined as the sum of the *k* curvatures of the single splines, based as usual on

> *k* ∑ *j*=1

 [ *f <sup>j</sup>* (*t*)]<sup>2</sup> *dt*

This formulation will allow to express the penalty P(·) as a simple function of the higher coefficients of the piecewise polynomials which define the splines, and some tricks is used in order to manage with penalty term as it were

The minimzation problem in (1), for a fixed value of λ is not an easy problem, since in the minimization problem the *n* optimal orthogonal projections *ti*,*i* = 1,2,...,*n* of the *n* observed points on the parametric curve f(*t*) must be found solving *n* optimization sub-problems, so that the problem cannot be splitted in *k* simpler penalized problems. For each possible curve ˆf(*t*), a set of optimal

c*lBl*(*t*)


### ACCOUNTING FOR RESPONSE BEHAVIOR IN LONGITUDINAL RATING DATA

The latent Markov model. For every *i* ∈ *I*, *t* ∈ *T* , the *latent construct Lit* (as: health status, financial capability) has a finite discrete state space *S<sup>L</sup>* = {1,..., *k*}, while the *latent binary response style indicator Uit* has a state space *S<sup>U</sup>* = {1,2}, where 1 and 2 denote the EMRS and AWR states, respectively. The latent variables are independent across units and for every unit, {*Lit*,*Uit*}*t*∈*<sup>T</sup>* is a first order bivariate Markov process with states (*u*,*l*), *<sup>u</sup>* <sup>∈</sup> *S<sup>U</sup>* , *<sup>l</sup>* <sup>∈</sup> *SL*. The initial probabilities (*<sup>t</sup>* <sup>=</sup> 1) of {*Lit*,*Uit*}*t*∈*<sup>T</sup>* are <sup>π</sup>*i*1(*u*,*l*), and <sup>π</sup>*it*(*u*,*l*|*u*¯, ¯*l*) are the transition probabilities. They are are simplified to

past, does not depend on the past of *Uit* and the current *Uit* depends on its past and on the contemporaneous latent construct but not on the past of the latent

(*m*)

not necessarily different, influencing the initial and transition probabilities, respectively, of the latent variables. Assuming independence between the latent variables at the first time, the latent model is specified by the following logit models: A) a baseline logit model for the initial probabilities of the latent

baseline logit models for the marginal transition probabilities of the latent construct, with reference category the state ¯*l* of the previous time point, i.e. for

the conditional transition probabilities of the response style indicator for each response style state ¯*u* of the previous occasion and for each current state *l* of

> β0*lu*¯ + ¯ β <sup>1</sup>*lu*¯z (*U*)

The observation model. Independence is assumed among units. The conditional probability functions of *Yit*, given the EMRS (1,*l*) and AWR (2,*l*) latent states are both time and subject invariant, denoted by *f*(*y*|*l*,*u*), *u* ∈ *S<sup>U</sup>* , *l* ∈ *SL*, *y* ∈ *C*, for *t* ∈ *T* , *i* ∈ *I*. Given the EMRS regime, *f*(*y*|*l*,1), *l* ∈ *SL*, is pa-

<sup>2</sup> −*y*)/

Application to Bank of Italy data. We applied the model to the panel data from the Survey on Household Income and Wealth (Bank of Italy), collected every 2 years from 2006 to 2016 on 1109 Italian households. The ordinal re-

erns the skewness, φ<sup>1</sup> the U and bell shape. Given the AWR regime, *f*(*y*|*l*,2),

 ∑*c*−<sup>1</sup>

*it*(*l*|¯*l*),*<sup>t</sup>* <sup>=</sup> <sup>2</sup>,...,*T*, by assuming that *Lit*, given its

*it* , *m* ∈ {*L*,*U*}, stand for the covariates,

*<sup>i</sup>* ,*l* = 2,..., *k*; B) a logit model for the ini-

*it* ,*<sup>l</sup>* <sup>∈</sup> *SL*,*<sup>l</sup>* <sup>=</sup> ¯*l*,*<sup>t</sup>* <sup>=</sup> <sup>2</sup>,...,*T*; D) a logit model for

*<sup>f</sup>*(*y*−1|*l*,1) <sup>=</sup> <sup>φ</sup>0*<sup>l</sup>* <sup>+</sup> <sup>φ</sup>1*ls*(*y*), *<sup>y</sup>* <sup>=</sup> <sup>2</sup>,..., *<sup>c</sup>*, where the

*<sup>f</sup>*(*y*−1|*l*,2) <sup>=</sup> <sup>ϕ</sup>*yl* , *<sup>y</sup>* <sup>=</sup> <sup>2</sup>,..., *<sup>c</sup>*.

*<sup>i</sup>*1(1) π*U*

*<sup>i</sup>*1(2) <sup>=</sup> <sup>α</sup>¯ <sup>0</sup> <sup>+</sup> <sup>α</sup>¯

*it* ,*l* ∈ *SL*,*u*¯ ∈ *S<sup>U</sup>* ,*t* = 2,...,*T*.

*<sup>y</sup>*=1(*y*−*c*/2)2, *y* ∈ *C*, φ<sup>0</sup> gov-

1x (*U*) *<sup>i</sup>* ; C)

<sup>π</sup>*it*(*u*,*l*|*u*¯, ¯*l*) = <sup>π</sup>

construct log <sup>π</sup>*<sup>L</sup>*

¯*<sup>l</sup>* <sup>∈</sup> *<sup>S</sup>L*, log <sup>π</sup>*<sup>L</sup>*

*U*|*L*

construct. The row vectors x

*<sup>i</sup>*1(*l*) π*L*

*it*(*l*|¯*l*) π*L*

the latent construct log <sup>π</sup>

*it*(¯*l*|¯*l*) <sup>=</sup> <sup>β</sup>0*l*¯*<sup>l</sup>* <sup>+</sup>β

rameterized by the logits log *<sup>f</sup>*(*y*|*l*,1)

scores are known constants *s*(*y*)=( *<sup>c</sup>*

*it* (*u*|*l*,*u*¯)π*<sup>L</sup>*

*<sup>i</sup>*1(1) <sup>=</sup> <sup>α</sup>0*<sup>l</sup>* <sup>+</sup> <sup>α</sup>

(*m*) *<sup>i</sup>* and z

> 1*l* x (*L*)

tial probabilities of the response style indicator log <sup>π</sup>*<sup>U</sup>*

1*l*¯*l* z (*L*)

*it* (2|*l*,*u*¯) <sup>=</sup> ¯

*U*|*L it* (1|*l*,*u*¯)

π *U*|*L*

*<sup>l</sup>* <sup>∈</sup> *<sup>S</sup>L*, is parameterized by the logits log *<sup>f</sup>*(*y*|*l*,2)

Roberto Colombi 1, Sabrina Giordano2 and Maria Kateri3

<sup>1</sup> Department of Management, Information and Production Engineering, University of Bergamo, Italy (e-mail: roberto.colombi@unibg.it)

<sup>2</sup> Department of Economics, Statistics and Finance "Giovanni Anania", University of Calabria, Italy (e-mail: sabrina.giordano@unical.it)

<sup>3</sup> Institute for Statistics, RWTH Aachen University, Germany (e-mail: maria.kateri@rwth-aachen.de)

ABSTRACT: We present a hidden Markov model for repeated ordinal responses observed on some units at different time occasions. The responses reflect the levels of unobservable latent constructs and can be observed under two latent regimes according to whether the respondents are confident with their preference or take shelter in the extremes/middle points of the rating scale.

KEYWORDS: latent variables; response style; financial capability.

#### Hidden Markov models with two regimes

Consider one ordinal response observed on *n* units at *T* time occasions. So *Yit* denotes the response of unit *i*, *i* ∈ *I* = {1,...,*n*}, at occasion *t*, *t* ∈ *T* = {1,...,*T*}, with *Yit* ∈ *C* = {1,..., *c*}. The response is assumed to reflect the levels of unobservable latent constructs *Lit*, *i* ∈ *I*, *t* ∈ *T* and can be observed under two different latent regimes: *awareness* (AWR) and *middle or extreme categories response style* (EMRS) that are captured by binary latent variables *Uit*, *i* ∈ *I*, *t* ∈ *T* . The presence of two regimes is based on the idea that when required to express their opinion on one item, respondents either identify their true preference into one category on the rating scale or, when in doubt or reluctant to disclose their opinion, take shelter by opting for the extreme or middle categories. These are the cases, for example, of patients asked to give a subjective assessment of their health or disability in daily living, or people required to evaluate their financial capability; all of them can feel confident or reluctant to answer. The proposal is a hidden Markov model (HMM) defined by two components that describe the distribution of the latent variables and the conditional distribution of the response given the latent variables. It generalizes the models by Bartolucci *et al.*, 2012 to a bivariate latent Markov process. Here, we describe the main features of the model proposed by Colombi *et al.*, 2021.

The latent Markov model. For every *i* ∈ *I*, *t* ∈ *T* , the *latent construct Lit* (as: health status, financial capability) has a finite discrete state space *S<sup>L</sup>* = {1,..., *k*}, while the *latent binary response style indicator Uit* has a state space *S<sup>U</sup>* = {1,2}, where 1 and 2 denote the EMRS and AWR states, respectively. The latent variables are independent across units and for every unit, {*Lit*,*Uit*}*t*∈*<sup>T</sup>* is a first order bivariate Markov process with states (*u*,*l*), *<sup>u</sup>* <sup>∈</sup> *S<sup>U</sup>* , *<sup>l</sup>* <sup>∈</sup> *SL*. The initial probabilities (*<sup>t</sup>* <sup>=</sup> 1) of {*Lit*,*Uit*}*t*∈*<sup>T</sup>* are <sup>π</sup>*i*1(*u*,*l*), and <sup>π</sup>*it*(*u*,*l*|*u*¯, ¯*l*) are the transition probabilities. They are are simplified to <sup>π</sup>*it*(*u*,*l*|*u*¯, ¯*l*) = <sup>π</sup> *U*|*L it* (*u*|*l*,*u*¯)π*<sup>L</sup> it*(*l*|¯*l*),*<sup>t</sup>* <sup>=</sup> <sup>2</sup>,...,*T*, by assuming that *Lit*, given its past, does not depend on the past of *Uit* and the current *Uit* depends on its past and on the contemporaneous latent construct but not on the past of the latent construct. The row vectors x (*m*) *<sup>i</sup>* and z (*m*) *it* , *m* ∈ {*L*,*U*}, stand for the covariates, not necessarily different, influencing the initial and transition probabilities, respectively, of the latent variables. Assuming independence between the latent variables at the first time, the latent model is specified by the following logit models: A) a baseline logit model for the initial probabilities of the latent construct log <sup>π</sup>*<sup>L</sup> <sup>i</sup>*1(*l*) π*L <sup>i</sup>*1(1) <sup>=</sup> <sup>α</sup>0*<sup>l</sup>* <sup>+</sup> <sup>α</sup> 1*l* x (*L*) *<sup>i</sup>* ,*l* = 2,..., *k*; B) a logit model for the initial probabilities of the response style indicator log <sup>π</sup>*<sup>U</sup> <sup>i</sup>*1(1) π*U <sup>i</sup>*1(2) <sup>=</sup> <sup>α</sup>¯ <sup>0</sup> <sup>+</sup> <sup>α</sup>¯ 1x (*U*) *<sup>i</sup>* ; C) baseline logit models for the marginal transition probabilities of the latent construct, with reference category the state ¯*l* of the previous time point, i.e. for ¯*<sup>l</sup>* <sup>∈</sup> *<sup>S</sup>L*, log <sup>π</sup>*<sup>L</sup> it*(*l*|¯*l*) π*L it*(¯*l*|¯*l*) <sup>=</sup> <sup>β</sup>0*l*¯*<sup>l</sup>* <sup>+</sup>β 1*l*¯*l* z (*L*) *it* ,*<sup>l</sup>* <sup>∈</sup> *SL*,*<sup>l</sup>* <sup>=</sup> ¯*l*,*<sup>t</sup>* <sup>=</sup> <sup>2</sup>,...,*T*; D) a logit model for the conditional transition probabilities of the response style indicator for each response style state ¯*u* of the previous occasion and for each current state *l* of the latent construct log <sup>π</sup> *U*|*L it* (1|*l*,*u*¯) π *U*|*L it* (2|*l*,*u*¯) <sup>=</sup> ¯ β0*lu*¯ + ¯ β <sup>1</sup>*lu*¯z (*U*) *it* ,*l* ∈ *SL*,*u*¯ ∈ *S<sup>U</sup>* ,*t* = 2,...,*T*.

ACCOUNTING FOR RESPONSE BEHAVIOR IN LONGITUDINAL RATING DATA Roberto Colombi 1, Sabrina Giordano2 and Maria Kateri3

<sup>1</sup> Department of Management, Information and Production Engineering, University of

<sup>2</sup> Department of Economics, Statistics and Finance "Giovanni Anania", University of

<sup>3</sup> Institute for Statistics, RWTH Aachen University, Germany (e-mail:

ABSTRACT: We present a hidden Markov model for repeated ordinal responses observed on some units at different time occasions. The responses reflect the levels of unobservable latent constructs and can be observed under two latent regimes according to whether the respondents are confident with their preference or take shelter in

Consider one ordinal response observed on *n* units at *T* time occasions. So *Yit* denotes the response of unit *i*, *i* ∈ *I* = {1,...,*n*}, at occasion *t*, *t* ∈ *T* = {1,...,*T*}, with *Yit* ∈ *C* = {1,...,*c*}. The response is assumed to reflect the levels of unobservable latent constructs *Lit*, *i* ∈ *I*, *t* ∈ *T* and can be observed under two different latent regimes: *awareness* (AWR) and *middle or extreme categories response style* (EMRS) that are captured by binary latent variables *Uit*, *i* ∈ *I*, *t* ∈ *T* . The presence of two regimes is based on the idea that when required to express their opinion on one item, respondents either identify their true preference into one category on the rating scale or, when in doubt or reluctant to disclose their opinion, take shelter by opting for the extreme or middle categories. These are the cases, for example, of patients asked to give a subjective assessment of their health or disability in daily living, or people required to evaluate their financial capability; all of them can feel confident or reluctant to answer. The proposal is a hidden Markov model (HMM) defined by two components that describe the distribution of the latent variables and the conditional distribution of the response given the latent variables. It generalizes the models by Bartolucci *et al.*, 2012 to a bivariate latent Markov process. Here, we describe the main features of the model proposed by Colombi *et al.*, 2021.

Bergamo, Italy (e-mail: roberto.colombi@unibg.it)

Calabria, Italy (e-mail: sabrina.giordano@unical.it)

KEYWORDS: latent variables; response style; financial capability.

maria.kateri@rwth-aachen.de)

the extremes/middle points of the rating scale.

Hidden Markov models with two regimes

The observation model. Independence is assumed among units. The conditional probability functions of *Yit*, given the EMRS (1,*l*) and AWR (2,*l*) latent states are both time and subject invariant, denoted by *f*(*y*|*l*,*u*), *u* ∈ *S<sup>U</sup>* , *l* ∈ *SL*, *y* ∈ *C*, for *t* ∈ *T* , *i* ∈ *I*. Given the EMRS regime, *f*(*y*|*l*,1), *l* ∈ *SL*, is parameterized by the logits log *<sup>f</sup>*(*y*|*l*,1) *<sup>f</sup>*(*y*−1|*l*,1) <sup>=</sup> <sup>φ</sup>0*<sup>l</sup>* <sup>+</sup> <sup>φ</sup>1*ls*(*y*),*<sup>y</sup>* <sup>=</sup> <sup>2</sup>,...,*c*, where the scores are known constants *s*(*y*)=( *<sup>c</sup>* <sup>2</sup> −*y*)/ ∑*c*−<sup>1</sup> *<sup>y</sup>*=1(*y*−*c*/2)2, *y* ∈ *C*, φ<sup>0</sup> governs the skewness, φ<sup>1</sup> the U and bell shape. Given the AWR regime, *f*(*y*|*l*,2), *<sup>l</sup>* <sup>∈</sup> *<sup>S</sup>L*, is parameterized by the logits log *<sup>f</sup>*(*y*|*l*,2) *<sup>f</sup>*(*y*−1|*l*,2) <sup>=</sup> <sup>ϕ</sup>*yl* , *<sup>y</sup>* <sup>=</sup> <sup>2</sup>,..., *<sup>c</sup>*.

Application to Bank of Italy data. We applied the model to the panel data from the Survey on Household Income and Wealth (Bank of Italy), collected every 2 years from 2006 to 2016 on 1109 Italian households. The ordinal re-

Table 1. *Estimates (EM algorithm) of the parameters of logit models A, B, C, D.* parameters cst G Jse Jhrs CH D S E R (α02,α2) 2.8 0.44<sup>∗</sup> -1.38<sup>∗</sup> -0.75<sup>∗</sup> -0.15 0.02 -1.44<sup>∗</sup> -1.86<sup>∗</sup> -0.35<sup>∗</sup> (α¯ <sup>0</sup>,α¯ <sup>1</sup>) -0.06 -0.03 0.16 0.08 -0.04 0.32 0.63<sup>∗</sup> 0.04 0.14 (β021,β121) -0.86 1.32<sup>∗</sup> 0.27 -0.49 -0.89<sup>∗</sup> 0.48 -1.69<sup>∗</sup> -1.16<sup>∗</sup> -0.17 (β012,β112) -11.93 0.18 -0.91 -0.21 -0.36 -0.23 8.44<sup>∗</sup> 1.38<sup>∗</sup> -8.83<sup>∗</sup>

β111) 1.10 0.45 -0.29 0.00 -0.20 0.13 -0.79<sup>∗</sup> -0.47<sup>∗</sup> -0.06

β121) -3.36 -0.05 1.09<sup>∗</sup> -0.33 0.45 -0.37 1.97<sup>∗</sup> 0.81<sup>∗</sup> -0.37

β112) 1.91 -0.07 -0.35 -0.23 0.00 -0.05 -0.19 -0.29 -0.39<sup>∗</sup>

β122) 1.69 -0.50 -0.34 -0.08 0.10 -0.07 1.80<sup>∗</sup> -0.09 -0.37 cst: constant – ∗ 95% confidence interval does not contain zero

tus. Further, responders with savings show a major propensity to a response style at the beginning of the survey (row 2). From row 3, it seems that, in two consecutive moments, women move from a financially confident (*l* = 1) condition to a worse status (*l* = 2) with higher probability, while low-educated households with children and savings more likely tend to rest in the previous more comfortable financial status (*l* = 1). Individuals who have savings and a low education pass with greater probability from the financial stressed status (*l* = 2) to the better condition (*l* = 1), while financially stressed households tend to remain in the same worst status with greater probability when they are no risk averse (row 4). From rows 5-6, it is more likely to change from the EMRS status ( ¯*u* = 1) to an AWR behavior (*u* = 2) for low educated persons with savings, who currently belong to the group of financially confident households, while self-employee and low educated respondents with savings show greater probability of remaining in the EMRS status if in the previous occasion were reluctant ( ¯*u* = 1) and in the current time are financially stressed (*l* = 2). Who is no risk averse and in the current moment feels to be financially confident has higher probability of keeping the previous awareness in revealing the own financial capability. On the other hand, individuals with savings, being in the latent financially worrying status, tend with more propensity to give up on

the previous AWR behavior and opt for a response style, rows 7-8.

*Models for Longitudinal Data*. CRC Press.

household financial capability. *Submitted*.

BARTOLUCCI, F., FARCOMENI, A., & PENNONI, F. 2012. *Latent Markov*

COLOMBI, R., GIORDANO, S., & KATERI, M. 2021. Hidden Markov models for longitudinal rating data with dynamic response styles: evidence on

(¯ β011, ¯

(¯ β021, ¯

(¯ β012, ¯

(¯ β022, ¯

References

Figure 1. *Observation probability functions of AWR and EMRS respondents in the two latent states of the perceived financial condition.*

sponse of interest is the perception of the household's financial ability to make ends meet (ve = very easily, e = easily, fe = fairly easily, sd = with some difficulty, d = with difficulty, gd = with great difficulty), the covariates are: G (female, *male*), J (Jse: self-employee, Jhrs: housekeeper/retired/student, *employee*), CH (with children, *no children*), D (with debts, *no debts*), S (with savings, *no savings*), E (up to secondary school, *over high school*), R (no risk averse in managing financial investments, *risk averse*), with the reference categories being in italics. The minimum BIC corresponds to the model with *k* = 2 states, meaning that households can be grouped according to whether they feel financially confident (*l* = 1) or deal with financial stress (*l* = 2). Fig. 1 allows us to characterize the choices of the respondents in 4 latent states. Individuals, in the financially confident latent state, when in doubt about their perception, tend to choose with more chance the optimistic extreme points, AWR people instead are more incline to the intermediate rates. Reluctant households (EMRS) in the latent group that deals with financial stress have the highest probabilities of reporting great difficulties, AWR people in the same group are more likely to point out just some difficulties. The behavior in the 4 stata is well distinguished, and optimistic/pessimistic choices are mainly due to the EMRS tendency. By the sign of the estimates in Table 1 row 1, we deduce that at the first occasion women, employees, people without savings, with high education and risk averse are with higher probability in a worse financial sta-


Table 1. *Estimates (EM algorithm) of the parameters of logit models A, B, C, D.*

cst: constant – ∗ 95% confidence interval does not contain zero

tus. Further, responders with savings show a major propensity to a response style at the beginning of the survey (row 2). From row 3, it seems that, in two consecutive moments, women move from a financially confident (*l* = 1) condition to a worse status (*l* = 2) with higher probability, while low-educated households with children and savings more likely tend to rest in the previous more comfortable financial status (*l* = 1). Individuals who have savings and a low education pass with greater probability from the financial stressed status (*l* = 2) to the better condition (*l* = 1), while financially stressed households tend to remain in the same worst status with greater probability when they are no risk averse (row 4). From rows 5-6, it is more likely to change from the EMRS status ( ¯*u* = 1) to an AWR behavior (*u* = 2) for low educated persons with savings, who currently belong to the group of financially confident households, while self-employee and low educated respondents with savings show greater probability of remaining in the EMRS status if in the previous occasion were reluctant ( ¯*u* = 1) and in the current time are financially stressed (*l* = 2). Who is no risk averse and in the current moment feels to be financially confident has higher probability of keeping the previous awareness in revealing the own financial capability. On the other hand, individuals with savings, being in the latent financially worrying status, tend with more propensity to give up on the previous AWR behavior and opt for a response style, rows 7-8.

#### References

Figure 1. *Observation probability functions of AWR and EMRS respondents in the two*

sponse of interest is the perception of the household's financial ability to make ends meet (ve = very easily, e = easily, fe = fairly easily, sd = with some difficulty, d = with difficulty, gd = with great difficulty), the covariates are: G (female, *male*), J (Jse: self-employee, Jhrs: housekeeper/retired/student, *employee*), CH (with children, *no children*), D (with debts, *no debts*), S (with savings, *no savings*), E (up to secondary school, *over high school*), R (no risk averse in managing financial investments, *risk averse*), with the reference categories being in italics. The minimum BIC corresponds to the model with *k* = 2 states, meaning that households can be grouped according to whether they feel financially confident (*l* = 1) or deal with financial stress (*l* = 2). Fig. 1 allows us to characterize the choices of the respondents in 4 latent states. Individuals, in the financially confident latent state, when in doubt about their perception, tend to choose with more chance the optimistic extreme points, AWR people instead are more incline to the intermediate rates. Reluctant households (EMRS) in the latent group that deals with financial stress have the highest probabilities of reporting great difficulties, AWR people in the same group are more likely to point out just some difficulties. The behavior in the 4 stata is well distinguished, and optimistic/pessimistic choices are mainly due to the EMRS tendency. By the sign of the estimates in Table 1 row 1, we deduce that at the first occasion women, employees, people without savings, with high education and risk averse are with higher probability in a worse financial sta-

*latent states of the perceived financial condition.*


### NETWORK-BASED SEMI-SUPERVISED CLUSTERING OF TIME SERIES DATA

contiguity property the best split lays in *K* (or in its subintervals after the split of the root note has taken place) and the tree algorithm, based on the classical "reduction in impurity" splitting criterion is forced to identify it. In general, the use of *K* as covariate enables ART to generate *G* different groups having different means. The effectiveness of the proposed NeSSC-ART combined approach for time series clustering is demonstrated on simulated and real data KEYWORDS: network-based semisupervised clustering, community detection

CAPPELLI, C, D'URSO, P, & DI IORIO, F. 2013. Change point analysis of

CAPPELLI, C, D'URSO, P, & DI IORIO, F. 2015. Regime change analysis of interval-valued time series with an application to PM10. *Chemiometrics*

FRIGAU, L, CONTU, G, MOLA, F, & CONVERSANO, C. 2021. *NNetwork-*

imprecise time series. *Fuzzi Sets and Systems*, 225, 23–38.

*and Intelligent Laboratory System*, 146, 337–346.

*based semisupervised clustering*. Vol. 37.

trees, atheoretical regression tree.

References

Claudio Conversano1, Giulia Contu1, Luca Frigau1 and Carmela Cappelli2

<sup>1</sup> Department of Economics and Business, University of Cagliari, (e-mail: conversa@unica.it, giulia.contu@inica.it, frigau@inica.it)

<sup>2</sup> Department of Humanities, University of Naples Federico II, (e-mail: carmela.cappelli@unina.it)

ABSTRACT: Semisupervised clustering extends standard clustering methods to the semisupervised setting, in some cases considering situations when clusters are associated with a given outcome variable that acts as a "noisy surrogate", that is a good proxy of the unknown clustering structure. A novel approach to semisupervised clustering associated with an outcome variable named networkbased semisupervised clustering (NeSSC) has been recently introduced (Frigau *et al.*, 2021). It combines an initialization, a training and an agglomeration phase. In the initialization and training a matrix of pairwise affinity of the instances is estimated by a classifier. In the agglomeration phase the matrix of pairwise affinity is transformed into a complex network, in which a community detection algorithm searches the underlying community structure. Thus, a partition of the instances into clusters highly homogeneous in terms of the outcome is obtained. A particular specification of NeSSC, called Community Detection Trees (Co-De Tree), uses classification or regression trees as classifiers and the Louvain, Label propagation and Walktrap as possible community detection algorithm. NeSSC is based on an ad-hoc defined stopping criterion and a criterion for the choice of the optimal partition of the original data. In this presentation, we provide a new specification of the NeSSC algorithm that allows us to perform clustering of time series data. This specification is based on the integration between Co-De Tree and the Atheoretical Regression Tree (ART) approach introduced by (Cappelli *et al.*, 2013; Cappelli *et al.*, 2015). ART exploits the concept of contiguous partitions within the framework of Least Squares Regression Trees using as a single covariate an arbitrary sequence of completely ordered numbers *K* = 1,2,...,*i*,...,*N*. Tree-regressing the response variable *Y* on this artificial covariate resorts to create and check at any node *h* all possible binary contiguous partitions of the *Yi* ∈ *h*. These splits are the only ones that need to be checked to detect the binary partition that minimizes the sum of squares and, indeed, they are generated by using *K* as covariate. In other words, for the

contiguity property the best split lays in *K* (or in its subintervals after the split of the root note has taken place) and the tree algorithm, based on the classical "reduction in impurity" splitting criterion is forced to identify it. In general, the use of *K* as covariate enables ART to generate *G* different groups having different means. The effectiveness of the proposed NeSSC-ART combined approach for time series clustering is demonstrated on simulated and real data

KEYWORDS: network-based semisupervised clustering, community detection trees, atheoretical regression tree.

#### References

NETWORK-BASED SEMI-SUPERVISED CLUSTERING OF TIME SERIES DATA Claudio Conversano1, Giulia Contu1, Luca Frigau1 and Carmela Cappelli2

<sup>1</sup> Department of Economics and Business, University of Cagliari, (e-mail: conversa@unica.it, giulia.contu@inica.it, frigau@inica.it) <sup>2</sup> Department of Humanities, University of Naples Federico II, (e-mail:

ABSTRACT: Semisupervised clustering extends standard clustering methods to the semisupervised setting, in some cases considering situations when clusters are associated with a given outcome variable that acts as a "noisy surrogate", that is a good proxy of the unknown clustering structure. A novel approach to semisupervised clustering associated with an outcome variable named networkbased semisupervised clustering (NeSSC) has been recently introduced (Frigau *et al.*, 2021). It combines an initialization, a training and an agglomeration phase. In the initialization and training a matrix of pairwise affinity of the instances is estimated by a classifier. In the agglomeration phase the matrix of pairwise affinity is transformed into a complex network, in which a community detection algorithm searches the underlying community structure. Thus, a partition of the instances into clusters highly homogeneous in terms of the outcome is obtained. A particular specification of NeSSC, called Community Detection Trees (Co-De Tree), uses classification or regression trees as classifiers and the Louvain, Label propagation and Walktrap as possible community detection algorithm. NeSSC is based on an ad-hoc defined stopping criterion and a criterion for the choice of the optimal partition of the original data. In this presentation, we provide a new specification of the NeSSC algorithm that allows us to perform clustering of time series data. This specification is based on the integration between Co-De Tree and the Atheoretical Regression Tree (ART) approach introduced by (Cappelli *et al.*, 2013; Cappelli *et al.*, 2015). ART exploits the concept of contiguous partitions within the framework of Least Squares Regression Trees using as a single covariate an arbitrary sequence of completely ordered numbers *K* = 1,2,...,*i*,...,*N*. Tree-regressing the response variable *Y* on this artificial covariate resorts to create and check at any node *h* all possible binary contiguous partitions of the *Yi* ∈ *h*. These splits are the only ones that need to be checked to detect the binary partition that minimizes the sum of squares and, indeed, they are generated by using *K* as covariate. In other words, for the

carmela.cappelli@unina.it)


### CHARACTERISING LONGITUDINAL TRAJECTORIES OF COVID-19 BIOMARKERS WITHIN A LATENT CLASS FRAMEWORK

the trajectories of serum creatinine and C-Reactive Protein (CRP) from the

A sample of 512 hospitalized patients, admitted by Ente Ospedaliero Cantonale COVID-19 dedicated hospital between March 1-May 1 2020, diagnosed with COVID-19 and with at least two determinations of Serum creatinine (3546 observations) or CRP (3592 observations) has been considered for the analysis. Diagnosis of COVID-19 was based on a positive nasopharyngeal swab specimen tested with real-time RT-PCR assay or high clinical suspicion. The study was approved by the Ethical Committee of the Canton of Ticino, Switzerland. Demographic and clinical characteristics along with the comorbidities and symptoms of COVID-19 were recorded at admission time. Clinical and laboratory parameters have been regularly monitored every 48h during hospitalization. The median patients' age was 72 years (IQR [60.75, 80.00]) ranging from 22 to 97 years; 317 (61.9%) were male. 379 patients (74%) were discharged, 95 patients (18.6%) died and 7.4% were still hospital-

To identify groups of patients with distinct biomarkers' trajectories over time, latent class linear mixed model (LCMM Proust-Lima *et al.*, 2017) were applied. LCMMs generalize traditional Linear Mixed Effects (LME) models, assuming that the population is heterogeneous and *G* unobserved sub-populations (latent classes), with their own mean profiles of trajectories, may be identified. Consistently with the literature on latent variable modelling the approach requires the specification of a structural latent model, i.e., a standard linear mixed model without measurement errors, along with a measurement model, linking the latent process to the outcome of interest. When heterogeneous population is assumed, for a subject *i* belonging to the class *ci* equal to *g* (*g* = 1,...,*G*), a

ized. 116 patients (22.7%) in total were admitted to the ICU.

latent class-specific process can be defined as Λ*i*(*ti j*)|*ci*=*<sup>g</sup>* = *X*1*i*(*ti j*)

β+*X*2*i*(*ti j*)

where *ti j* denotes the time of measurement for subject *i* (*i* = 1,...,*N*) at occasion *j* (*j* = 1,...,*ni*), *X*1*i*(*ti j*) and *X*2*i*(*ti j*) are vectors of time-dependent covariates respectively with common fixed effects β over classes and classspecific fixed effects γ*g*, *Zi*(*ti j*) is a vector of time-dependent covariates as-

γ*<sup>g</sup>* +*Zi*(*ti j*)

*uig* +*wi*(*ti j*)

hospital admission.

2 Sample description

3 Statistical methods

Federica Cugnata 1, Chiara Brombin1, Pietro E. Cippa` 2, Alessandro Ceschi 3, Paolo Ferrari <sup>4</sup> and last Clelia Di Serio <sup>1</sup>

<sup>1</sup> University Centre for Statistics in the Biomedical Sciences (CUSSB), Vita-Salute San Raffaele University, (e-mail: cugnata.federica@unisr.it, chiara.brombin@unisr.it, clelia.diserio@unisr.it)

<sup>2</sup> Department of Medicine, Division of Nephrology, Ente Ospedaliero Cantonale, Bellinzona and Faculty of Medicine, University of Zurich, (e-mail: Pietro.Cippa@eoc.ch

<sup>3</sup> Faculty of Medicine, University of Zurich, Biomedical Faculty, Universita della ` Svizzera Italiana, Lugano, Institute of Pharmacology and Toxicology, Ente Ospedaliero Cantonale, Bellinzona, (e-mail: Alessandro.Ceschi@eoc.ch)

<sup>4</sup> Department of Medicine, Division of Nephrology, Ente Ospedaliero Cantonale, Bellinzona and Biomedical Faculty, Universita della Svizzera Italiana, Lugano, (e-mail: ` Paolo.Ferrari@eoc.ch)

ABSTRACT: In COVID-19 clinical research, identifying homogeneous subgroups of patients is essential for tailoring treatments. To address this issue from a statistical point of view, models accounting for unobservable heterogeneity in patients are needed. We propose latent class mixed models (LCMMs) to model trajectories of clinically relevant biomarkers for COVID-19 and we compared patients in the uncovered different classes with respect to their baseline clinical characteristics and COVID-19 outcomes.

KEYWORDS: Latent class mixed model, C-Reactive Protein, serum creatinine

#### 1 Introduction

One of the main goals in COVID-19 clinical research is to identify patients' characteristics associated with different degree of disease severity. Most of the published paper focus on patients' characteristics at hospital admission linking them to the final outcome either intensive care unit (ICU) admission or death. In this work we applied an alternative approach to evaluate the dynamics of commonly monitored biomarkers while uncovering subgroups of patients with specific longitudinal response pattern. In particular, here we focus on the trajectories of serum creatinine and C-Reactive Protein (CRP) from the hospital admission.

#### 2 Sample description

CHARACTERISING LONGITUDINAL TRAJECTORIES OF COVID-19 BIOMARKERS WITHIN A LATENT CLASS FRAMEWORK Federica Cugnata 1, Chiara Brombin1, Pietro E. Cippa` 2, Alessandro Ceschi 3, Paolo Ferrari <sup>4</sup> and last Clelia Di Serio <sup>1</sup>

<sup>1</sup> University Centre for Statistics in the Biomedical Sciences (CUSSB), Vita-Salute San Raffaele University, (e-mail: cugnata.federica@unisr.it,

<sup>2</sup> Department of Medicine, Division of Nephrology, Ente Ospedaliero Cantonale, Bellinzona and Faculty of Medicine, University of Zurich, (e-mail:

<sup>3</sup> Faculty of Medicine, University of Zurich, Biomedical Faculty, Universita della ` Svizzera Italiana, Lugano, Institute of Pharmacology and Toxicology, Ente Ospedaliero

<sup>4</sup> Department of Medicine, Division of Nephrology, Ente Ospedaliero Cantonale, Bellinzona and Biomedical Faculty, Universita della Svizzera Italiana, Lugano, (e-mail: `

ABSTRACT: In COVID-19 clinical research, identifying homogeneous subgroups of patients is essential for tailoring treatments. To address this issue from a statistical point of view, models accounting for unobservable heterogeneity in patients are needed. We propose latent class mixed models (LCMMs) to model trajectories of clinically relevant biomarkers for COVID-19 and we compared patients in the uncovered different classes with respect to their baseline clinical characteristics and COVID-19

KEYWORDS: Latent class mixed model, C-Reactive Protein, serum creatinine

One of the main goals in COVID-19 clinical research is to identify patients' characteristics associated with different degree of disease severity. Most of the published paper focus on patients' characteristics at hospital admission linking them to the final outcome either intensive care unit (ICU) admission or death. In this work we applied an alternative approach to evaluate the dynamics of commonly monitored biomarkers while uncovering subgroups of patients with specific longitudinal response pattern. In particular, here we focus on

chiara.brombin@unisr.it, clelia.diserio@unisr.it)

Cantonale, Bellinzona, (e-mail: Alessandro.Ceschi@eoc.ch)

Pietro.Cippa@eoc.ch

Paolo.Ferrari@eoc.ch)

outcomes.

1 Introduction

A sample of 512 hospitalized patients, admitted by Ente Ospedaliero Cantonale COVID-19 dedicated hospital between March 1-May 1 2020, diagnosed with COVID-19 and with at least two determinations of Serum creatinine (3546 observations) or CRP (3592 observations) has been considered for the analysis. Diagnosis of COVID-19 was based on a positive nasopharyngeal swab specimen tested with real-time RT-PCR assay or high clinical suspicion. The study was approved by the Ethical Committee of the Canton of Ticino, Switzerland. Demographic and clinical characteristics along with the comorbidities and symptoms of COVID-19 were recorded at admission time. Clinical and laboratory parameters have been regularly monitored every 48h during hospitalization. The median patients' age was 72 years (IQR [60.75, 80.00]) ranging from 22 to 97 years; 317 (61.9%) were male. 379 patients (74%) were discharged, 95 patients (18.6%) died and 7.4% were still hospitalized. 116 patients (22.7%) in total were admitted to the ICU.

### 3 Statistical methods

To identify groups of patients with distinct biomarkers' trajectories over time, latent class linear mixed model (LCMM Proust-Lima *et al.*, 2017) were applied. LCMMs generalize traditional Linear Mixed Effects (LME) models, assuming that the population is heterogeneous and *G* unobserved sub-populations (latent classes), with their own mean profiles of trajectories, may be identified. Consistently with the literature on latent variable modelling the approach requires the specification of a structural latent model, i.e., a standard linear mixed model without measurement errors, along with a measurement model, linking the latent process to the outcome of interest. When heterogeneous population is assumed, for a subject *i* belonging to the class *ci* equal to *g* (*g* = 1,...,*G*), a latent class-specific process can be defined as

$$\Lambda\_i(t\_{ij})|\_{c=g} = X\_{1i}(t\_{ij})'\mathfrak{B} + X\_{2i}(t\_{ij})'\mathfrak{\chi}\_g + Z\_i(t\_{ij})'\mathfrak{u}\_{ig} + \mathfrak{w}\_i(t\_{ij})'$$

where *ti j* denotes the time of measurement for subject *i* (*i* = 1,...,*N*) at occasion *j* (*j* = 1,...,*ni*), *X*1*i*(*ti j*) and *X*2*i*(*ti j*) are vectors of time-dependent covariates respectively with common fixed effects β over classes and classspecific fixed effects γ*g*, *Zi*(*ti j*) is a vector of time-dependent covariates associated with individual class-specific random effects *uig* and *wi*(*ti j*) represents an autocorrelated process. Then a measurement model is defined as *Yi j*|*ci*=*<sup>g</sup>* = *H*(Λ*i*(*ti j*)|*ci*=*g*+ε*i j*;η) where *H* is a parametrized monotonic increasing link function (linear, splines, thresholds, etc. depending on the type of the longitudinal markers), ε*i j* are independent normally distributed errors and represents a noisy latent process at time. Every subject is assigned to one latent class only. For each subject, the latent class membership is described by a latent variable *ci* that equals *g* if *i* belongs to class *g* and probability of latent class membership is modeled using a multinomial logistic regression according to covariates *X*3*i*:

$$\pi\_{\rm ig} = P(c\_i = \mathbf{g} | X\_{\mathfrak{H}}) = \frac{e^{\mathfrak{E}\_{\mathfrak{H}\mathbf{g}} + X\_{\mathfrak{H}}' \mathfrak{E}\_{\mathfrak{H}\mathbf{g}}}}{\sum\_{l=1}^{G} e^{\mathfrak{E}\_{\mathfrak{H}l} + X\_{\mathfrak{H}}' \mathfrak{E}\_{\mathfrak{H}l}}}$$

Figure 1. *Class-specific mean predicted trajectories for serum creatinine (a) and C-*

With reference to the model for CRP we found that the three latent classes model was the best in terms of BIC, with 30 subjects assigned to class 1, 411 to class 2 and 64 to class 3 (BIC=9716.96). At baseline, class 2 and class 3 differ from class 1 (both p-values<0.001). Moreover, for patients in class 1 CRP significantly increased over time (p-values<0.001) whereas for patients in class 2 and class 3 CRP significantly declined over time (both p-values<0.001) with larger decrease for class 3. The class-specific mean predicted trajectories are reported in Figure 1(b). Average posterior probabilities of falling into the class in which the subjects were classified are equal to 0.807, 0.873 and 0.773. These classes significantly differ on age (median age in class 1 is 76.50, in class 2 is 72.00 and in class 3 is 67.00, p-value=0.001) and they are associated to diabetes, cardiovascular disease and respiratory symptoms. Moreover the classes are associated to the outcome of the disease (p-value<0.001), the percentage of death is 66.7% in the class 1, 17.8% in the class 2 and 0% in the class 3. In order to better understand the relationship between biomarkers evolution and COVID-19 outcome as a matter of future research the same latent class framework will be considered and in particular multi-process joint latent class

PROUST-LIMA, CECILE ´ , PHILIPPS, VIVIANE,&LIQUET, BENOIT. 2017. Estimation of Extended Mixed Models Using Latent Classes and Latent Processes: The R Package lcmm. *Journal of Statistical Software*, 78(2),

RAMSAY, JAMES O. 1988. Monotone regression splines in action. *Statistical*

*Reactive Protein (b)*

mixed models will be applied.

*science*, 3(4), 425–441.

References

1–56.

where ξ0*<sup>g</sup>* is the intercept for class *g* and ξ1*<sup>g</sup>* is the vector of class-specific parameters related to the time-independent covariates *X*3*i*.

Since we are specifically interest in identifying different dynamics over time for biomarkers only the measurement time from the hospital admission has been considered as covariate. Splines link functions (with 5 equidistant knots; Ramsay, 1988) were considered to account for nonlinearities in the longitudinal response. Several LCMMs were estimated assuming different number of latent classes and BIC criterion was used to select the optimal number of latent classes. In presence of more than two classes, Fisher exact test and Kruskal-Wallis test were used to compare patients clinical features in different latent classes.

#### 4 Results

The best model for serum creatinine included two latent classes, with 453 subjects assigned to class 1 and 42 to class 2 (BIC = 974.92). At baseline, class 2 differs from class 1 (p-value<0.001) and for patients in class 1 creatine significantly declined over time (p-values<0.0001), while class 2 remains stable. The class-specific mean predicted trajectories are reported in Figure 1(a). Average posterior probabilities of falling into the class in which the subjects were classified are equal to 0.957 and 0.804. Examining differences among the two classes, it emerged that they are associated to diabetes (p-value<0.001), cardiovascular disease (p-value<0.001) and cough (p-value=0.044). Moreover considering the patients assigned to the class 1, 20.5% were admitted to the intensive care unit and 15% died whereas considering the patients assigned to the class 1, 52.4% were admitted to the intensive care unit and 61.9% died (both p-values<0.001).

Figure 1. *Class-specific mean predicted trajectories for serum creatinine (a) and C-Reactive Protein (b)*

With reference to the model for CRP we found that the three latent classes model was the best in terms of BIC, with 30 subjects assigned to class 1, 411 to class 2 and 64 to class 3 (BIC=9716.96). At baseline, class 2 and class 3 differ from class 1 (both p-values<0.001). Moreover, for patients in class 1 CRP significantly increased over time (p-values<0.001) whereas for patients in class 2 and class 3 CRP significantly declined over time (both p-values<0.001) with larger decrease for class 3. The class-specific mean predicted trajectories are reported in Figure 1(b). Average posterior probabilities of falling into the class in which the subjects were classified are equal to 0.807, 0.873 and 0.773. These classes significantly differ on age (median age in class 1 is 76.50, in class 2 is 72.00 and in class 3 is 67.00, p-value=0.001) and they are associated to diabetes, cardiovascular disease and respiratory symptoms. Moreover the classes are associated to the outcome of the disease (p-value<0.001), the percentage of death is 66.7% in the class 1, 17.8% in the class 2 and 0% in the class 3. In order to better understand the relationship between biomarkers evolution and COVID-19 outcome as a matter of future research the same latent class framework will be considered and in particular multi-process joint latent class

#### References

mixed models will be applied.

sociated with individual class-specific random effects *uig* and *wi*(*ti j*) represents an autocorrelated process. Then a measurement model is defined as *Yi j*|*ci*=*<sup>g</sup>* = *H*(Λ*i*(*ti j*)|*ci*=*g*+ε*i j*;η) where *H* is a parametrized monotonic increasing link function (linear, splines, thresholds, etc. depending on the type of the longitudinal markers), ε*i j* are independent normally distributed errors and represents a noisy latent process at time. Every subject is assigned to one latent class only. For each subject, the latent class membership is described by a latent variable *ci* that equals *g* if *i* belongs to class *g* and probability of latent class membership is modeled using a multinomial logistic regression according to

<sup>π</sup>*ig* <sup>=</sup> *<sup>P</sup>*(*ci* <sup>=</sup> *<sup>g</sup>*|*X*3*i*) = *<sup>e</sup>*ξ0*g*+*<sup>X</sup>*

parameters related to the time-independent covariates *X*3*i*.

where ξ0*<sup>g</sup>* is the intercept for class *g* and ξ1*<sup>g</sup>* is the vector of class-specific

The best model for serum creatinine included two latent classes, with 453 subjects assigned to class 1 and 42 to class 2 (BIC = 974.92). At baseline, class 2 differs from class 1 (p-value<0.001) and for patients in class 1 creatine significantly declined over time (p-values<0.0001), while class 2 remains stable. The class-specific mean predicted trajectories are reported in Figure 1(a). Average posterior probabilities of falling into the class in which the subjects were classified are equal to 0.957 and 0.804. Examining differences among the two classes, it emerged that they are associated to diabetes (p-value<0.001), cardiovascular disease (p-value<0.001) and cough (p-value=0.044). Moreover considering the patients assigned to the class 1, 20.5% were admitted to the intensive care unit and 15% died whereas considering the patients assigned to the class 1, 52.4% were admitted to the intensive care unit and 61.9% died

Since we are specifically interest in identifying different dynamics over time for biomarkers only the measurement time from the hospital admission has been considered as covariate. Splines link functions (with 5 equidistant knots; Ramsay, 1988) were considered to account for nonlinearities in the longitudinal response. Several LCMMs were estimated assuming different number of latent classes and BIC criterion was used to select the optimal number of latent classes. In presence of more than two classes, Fisher exact test and Kruskal-Wallis test were used to compare patients clinical features in different

3*i* ξ1*<sup>g</sup>*

> 3*i* ξ1*l*

*<sup>l</sup>*=<sup>1</sup> *e*ξ0*l*+*<sup>X</sup>*

∑*G*

covariates *X*3*i*:

latent classes.

4 Results

(both p-values<0.001).


### SENDER AND RECEIVER EFFECTS IN LATENT SPACE MODELS FOR MULTIPLEX DATA

to flexibly model heterogeneity in the data (see Hoff, 2005). Last, the possibility of different link functions can be considered (see Sewell & Chen, 2016),

Given a set of *n* nodes, *i*, *j* = 1,...,*n*, we can define a multidimensional net-

A single network can be viewed as a specific case of Y, when *K* = 1. In the

correspond to the weight associated to the interaction between nodes *i* and *j* in network *k*, that is the "strength of their interaction". Generally, we assume

where *f* (·) is some link function, depending on the type of edges considered. Similar specifications to those employed in generalized linear mixed

the shared latent space structure to the different networks. Distances between pairs of nodes in the latent space are indicated via *di j*, here taken to be the

resents the effect of the dyad-specific heterogeneity on the probability of an interaction in network *k* between nodes *i* and *j*. More specifically, it is as-

, where <sup>θ</sup>

*<sup>j</sup>* = 0), a CONSTANT scenario, where node-specific effects do not

(*k*)

sender effect of node *i* and the receiver effect of node *j*, in network *k*. To flexibly describe different levels of heterogeneity in multidimensional networks, we define three increasing complexity scenarios for either the sender and re-

squared Euclidean distance between *i* and *j* in the latent space. Last, φ

ceiver parameters. A NULL scenario, where no effect is present (θ

*<sup>i</sup>* = θ*<sup>i</sup>* and/or γ

(*k*)

where effects are present and network-specific (θ

φ (*k*) *i j* <sup>−</sup>β(*k*)

*<sup>k</sup>*=<sup>1</sup> are intercept and scale network-specific parameters, adapting

(*k*) *<sup>i</sup>* and γ

(*k*)

*<sup>j</sup>* = γ *<sup>j</sup>*), and a VARIABLE scenario,

(*k*)

*<sup>i</sup>* and/or γ

(*k*)

*j* are connected, or 0, if they are not. Instead, in weighted networks *y*

*i j* <sup>=</sup> <sup>α</sup>(*k*)

models can be used for *<sup>f</sup>* (·) (Sewell & Chen, 2016). <sup>α</sup> <sup>=</sup>

(*k*)

Y(1)

*di j*,

,...,Y(*k*)

*i j* will be either 1, if nodes *i* and

,...,Y(*K*)

(*k*) *i j* will

(*k*) *i j* rep-

(*k*) *<sup>i</sup>* = 0

*<sup>j</sup>* ). Depending

α(*k*) *K <sup>k</sup>*=<sup>1</sup> and

*<sup>j</sup>* are, respectively, the

 .

to adapt the framework to either binary or weighted networks.

work as a collection of *K* adjacency matrices: Y =

case of binary networks, the general entry *y*

*f* E *y* (*k*)

2 The models

that:

β =

β(*k*) *K*

sumed that:

and/or γ

(*k*)

vary across networks (θ

φ (*k*) *i j* = *g* θ (*k*) *<sup>i</sup>* , γ (*k*) *j*

Silvia D'Angelo <sup>1</sup>

<sup>1</sup> School of Mathematics and Statistics, University College Dublin, (e-mail: silvia.dangelo@ucd.ie)

ABSTRACT: Network and multidimensional network (multiplex) data often entail transitivity and heterogeneity of the nodes. This last aspect is particularly of interest in multiplex data, as nodes' tendencies to send or receive links is often networkdependent. Here, a class of latent space models is discussed. This class allows both to account for different levels of complexity in nodes' heterogeneity and for recurring symmetric relations between the nodes, via the inclusion of a shared latent space. The frameworks is quite general, as both weighted and binary networks are considered. Inference is carried out within a hierarchical Bayesian framework, while a Markov Chain Monte Carlo algorithm is used for estimation of model parameters.

KEYWORDS: latent space models, multiplex, Markov chain Monte Carlo

#### 1 Introduction

Network data are relational data representing interactions among a set of actors, the nodes. Interactions among pairs of nodes are represented as links binding them, the edges. Depending on the type of relation represented in a network, such links can either be binary, indicating the presence or absence of a relation, or weighted, expressing the "strength" of the interaction between pairs of nodes. Moreover, when multiple relationships are observed among the same set of nodes, a particular type of network can be defined, that is a multidimensional network (or multiplex). Observed network data can display different characteristics, and these may have a direct impact on their structure. Two common features are transitivity ("a friend of my friend is my friend") and heterogeneity of the nodes. Building on a previous work (D'Angelo *et al.*, 2020), we propose to address the presence of the first feature by defining a shared, low dimensional, latent space (see Hoff *et al.*, 2002 and Gollini & Murphy, 2016) underlying the network or multidimensional network. Nodes are embedded in such latent space, with the main assumption that there proximity denotes similarity and hence a larger probability to interact in the observed network. Node-specific sender and receiver effects are then introduced to flexibly model heterogeneity in the data (see Hoff, 2005). Last, the possibility of different link functions can be considered (see Sewell & Chen, 2016), to adapt the framework to either binary or weighted networks.

#### 2 The models

SENDER AND RECEIVER EFFECTS IN LATENT SPACE MODELS FOR MULTIPLEX DATA Silvia D'Angelo <sup>1</sup>

<sup>1</sup> School of Mathematics and Statistics, University College Dublin, (e-mail:

ABSTRACT: Network and multidimensional network (multiplex) data often entail transitivity and heterogeneity of the nodes. This last aspect is particularly of interest in multiplex data, as nodes' tendencies to send or receive links is often networkdependent. Here, a class of latent space models is discussed. This class allows both to account for different levels of complexity in nodes' heterogeneity and for recurring symmetric relations between the nodes, via the inclusion of a shared latent space. The frameworks is quite general, as both weighted and binary networks are considered. Inference is carried out within a hierarchical Bayesian framework, while a Markov

Network data are relational data representing interactions among a set of actors, the nodes. Interactions among pairs of nodes are represented as links binding them, the edges. Depending on the type of relation represented in a network, such links can either be binary, indicating the presence or absence of a relation, or weighted, expressing the "strength" of the interaction between pairs of nodes. Moreover, when multiple relationships are observed among the same set of nodes, a particular type of network can be defined, that is a multidimensional network (or multiplex). Observed network data can display different characteristics, and these may have a direct impact on their structure. Two common features are transitivity ("a friend of my friend is my friend") and heterogeneity of the nodes. Building on a previous work (D'Angelo *et al.*, 2020), we propose to address the presence of the first feature by defining a shared, low dimensional, latent space (see Hoff *et al.*, 2002 and Gollini & Murphy, 2016) underlying the network or multidimensional network. Nodes are embedded in such latent space, with the main assumption that there proximity denotes similarity and hence a larger probability to interact in the observed network. Node-specific sender and receiver effects are then introduced

Chain Monte Carlo algorithm is used for estimation of model parameters. KEYWORDS: latent space models, multiplex, Markov chain Monte Carlo

silvia.dangelo@ucd.ie)

1 Introduction

Given a set of *n* nodes, *i*, *j* = 1,...,*n*, we can define a multidimensional network as a collection of *K* adjacency matrices: Y = Y(1) ,...,Y(*k*) ,...,Y(*K*) . A single network can be viewed as a specific case of Y, when *K* = 1. In the case of binary networks, the general entry *y* (*k*) *i j* will be either 1, if nodes *i* and *j* are connected, or 0, if they are not. Instead, in weighted networks *y* (*k*) *i j* will correspond to the weight associated to the interaction between nodes *i* and *j* in network *k*, that is the "strength of their interaction". Generally, we assume that:

$$f\left(\mathbf{E}\left[\mathbf{y}\_{ij}^{(k)}\right]\right) = \alpha^{(k)}\Phi\_{ij}^{(k)} - \mathsf{B}^{(k)}d\_{ij},$$

where *f* (·) is some link function, depending on the type of edges considered. Similar specifications to those employed in generalized linear mixed models can be used for *<sup>f</sup>* (·) (Sewell & Chen, 2016). <sup>α</sup> <sup>=</sup> α(*k*) *K <sup>k</sup>*=<sup>1</sup> and β = β(*k*) *K <sup>k</sup>*=<sup>1</sup> are intercept and scale network-specific parameters, adapting the shared latent space structure to the different networks. Distances between pairs of nodes in the latent space are indicated via *di j*, here taken to be the squared Euclidean distance between *i* and *j* in the latent space. Last, φ (*k*) *i j* represents the effect of the dyad-specific heterogeneity on the probability of an interaction in network *k* between nodes *i* and *j*. More specifically, it is assumed that: φ (*k*) *i j* = *g* θ (*k*) *<sup>i</sup>* , γ (*k*) *j* , where <sup>θ</sup> (*k*) *<sup>i</sup>* and γ (*k*) *<sup>j</sup>* are, respectively, the sender effect of node *i* and the receiver effect of node *j*, in network *k*. To flexibly describe different levels of heterogeneity in multidimensional networks, we define three increasing complexity scenarios for either the sender and receiver parameters. A NULL scenario, where no effect is present (θ (*k*) *<sup>i</sup>* = 0 and/or γ (*k*) *<sup>j</sup>* = 0), a CONSTANT scenario, where node-specific effects do not vary across networks (θ (*k*) *<sup>i</sup>* = θ*<sup>i</sup>* and/or γ (*k*) *<sup>j</sup>* = γ *<sup>j</sup>*), and a VARIABLE scenario, where effects are present and network-specific (θ (*k*) *<sup>i</sup>* and/or γ (*k*) *<sup>j</sup>* ). Depending on the presence or absence of the effects, *g*(·,·) can be defined as:

$$g(\cdot, \cdot) = \begin{cases} 1 & \text{if both effects are NULL} \\ \boldsymbol{\theta}\_i^{(k)} & \text{if receiver effects are NULL} \\ \boldsymbol{\gamma}\_j^{(k)} & \text{if sender effects are NULL} \\ \frac{\boldsymbol{\theta}\_i^{(k)} + \boldsymbol{\eta}\_j^{(k)}}{2} & \text{if neither the receiver and the sender effects are NULL} \end{cases}$$

**DTW-BASED ASSESSMENT OF THE PREDICTIVE POWER OF THE COPULA -DCC-GARCH-MST MODEL DEVELOPED** 

Anna Denkowska1 and Stanisław Wanat<sup>2</sup>

**ABSTRACT**: We are investigating the possibilities of using the Dynamic Time Warping algorithm in two ways. A first way of using DTW is to assess the suitability of the Minimum Spanning Trees' topological indicators, which are constructed based on the tail dependence coefficients determined by the copula-DCC-GARCH model in order to establish the links between insurance companies in the context of potential shock contagion. A second way consists in using the DTW algorithm to group institutions by the similarity of their contribution to systemic risk, as expressed by DeltaCoVaR. The results obtained confirm the effectiveness of MST topological indicators for SR identification and evaluation of indirect

**KEYWORDS**: time series analysis, Minimum Spanning Trees, topological indicators of the

Our motivation is report of the European Insurance and Occupational Pensions Authority (EIOPA, 2017), encouraging to study the dynamics of interconnectedness between institutions. In the present article we use the Dynamic Time Warping (DTW - algorithm to determine the similarity between time series, which may be of different length and are distorted (stretched or shifted) in relation to the time axis) in two ways in the different market states: 1) to evaluate the suitability of Minimum Spanning Trees' topological indicators in the context of SR; 2) to construct the MST, to establish the similarity between the time series of the DeltaCoVaR. In the paper we analyze the dynamics of indirect connections between insurance companies that result from market price channels. We propose as in (Denkowska and Wanat, 2020) a hybrid approach to the analysis of interlinkages dynamics based on combining the copula-DCC-GARCH model and Minimum Spanning Trees (MST - connected and acyclic graph with the smallest sum of weights assigned to each edge; vertices are insurance institutions and the edges connect those lying at relatively small distances). The MST topology shows the links between institutions in the context of the possibility of propagating SR. We establish the similarity of time series' topological indicators of MST in periods of financial crises and outside

MST, Dynamic Times Warping, insurance sector, systemic risk Section Heading

**FOR EUROPEAN INSURANCE INSTYTUTIONS**

Department of Mathematics, Cracow University of Economics, Kraków, Poland.

Department of Mathematics, Cracow University of Economics, Kraków, Poland.

(e-mail: anna.denkowska@uek.krakow.pl) <sup>2</sup>

(e-mail: wanats@uek.krakow.pl)

links between insurance institutions.

**1. Introduction**

1

Different combinations between *g*(·,·) specifications and the three scenarios give rise to a set of 9 latent space models, incorporating varying degrees of heterogeneity.

Last, inference is carried out within a hierarchical Bayesian framework, and a Markov Chain Monte Carlo algorithm is employed for estimation of model parameters.

#### 3 Conclusion

A class of latent space models for network and multidimensional networks is discussed. The models allow to flexibly account for transitivity and heterogeneity in network data, for both binary and weighted edges. Currently, only the class of models for binary networks is implemented in the *spaceNet* R package (https://CRAN.R-project.org/package=spaceNet), with the plan of including those for weighted networks in the near future.

#### References


#### **DTW-BASED ASSESSMENT OF THE PREDICTIVE POWER OF THE COPULA -DCC-GARCH-MST MODEL DEVELOPED FOR EUROPEAN INSURANCE INSTYTUTIONS**

Anna Denkowska1 and Stanisław Wanat<sup>2</sup>

1 Department of Mathematics, Cracow University of Economics, Kraków, Poland. (e-mail: anna.denkowska@uek.krakow.pl) <sup>2</sup> Department of Mathematics, Cracow University of Economics, Kraków, Poland. (e-mail: wanats@uek.krakow.pl)

**ABSTRACT**: We are investigating the possibilities of using the Dynamic Time Warping algorithm in two ways. A first way of using DTW is to assess the suitability of the Minimum Spanning Trees' topological indicators, which are constructed based on the tail dependence coefficients determined by the copula-DCC-GARCH model in order to establish the links between insurance companies in the context of potential shock contagion. A second way consists in using the DTW algorithm to group institutions by the similarity of their contribution to systemic risk, as expressed by DeltaCoVaR. The results obtained confirm the effectiveness of MST topological indicators for SR identification and evaluation of indirect links between insurance institutions.

**KEYWORDS**: time series analysis, Minimum Spanning Trees, topological indicators of the MST, Dynamic Times Warping, insurance sector, systemic risk Section Heading

#### **1. Introduction**

on the presence or absence of the effects, *g*(·,·) can be defined as:

<sup>2</sup> if neither the receiver and the sender effects are NULL

Different combinations between *g*(·,·) specifications and the three scenarios give rise to a set of 9 latent space models, incorporating varying degrees of

A class of latent space models for network and multidimensional networks is discussed. The models allow to flexibly account for transitivity and heterogeneity in network data, for both binary and weighted edges. Currently, only the class of models for binary networks is implemented in the *spaceNet* R package (https://CRAN.R-project.org/package=spaceNet), with the plan of including those for weighted networks in the near future.

D'ANGELO, S., ALFO` , M., & MURPHY, T.B. 2020. Modeling node heterogeneity in latent space models for multidimensional networks. *Statistica*

GOLLINI, I., & MURPHY, T.B. 2016. Joint modeling of multiple network views. *Journal of Computational and Graphical Statistic.*, 25, 246–265. HOFF, P. 2005. Bilinear mixed-effects models for dyadic data. *Journal of the*

HOFF, P., RAFTERY, A., & HANDCOCK, M. 2002. Latent space approaches to social network analysis. *Journal of the American Statistical Associa-*

SEWELL, D., & CHEN, Y. 2016. Latent space models for dynamic networks

*American Statistical Association.*, 100, 286–295.

with weighted edges. *Social Networks.*, 44, 105–116.

Last, inference is carried out within a hierarchical Bayesian framework, and a Markov Chain Monte Carlo algorithm is employed for estimation of

1 if both effects are NULL

*<sup>i</sup>* if receiver effects are NULL

*<sup>j</sup>* if sender effects are NULL

*g*(·,·) =

 

θ (*k*)

γ (*k*)

θ (*k*) *<sup>i</sup>* +γ (*k*) *j*

heterogeneity.

model parameters.

3 Conclusion

References

*Neerlandica.*, 74, 324–341.

*tion.*, 97, 1090–1098.

Our motivation is report of the European Insurance and Occupational Pensions Authority (EIOPA, 2017), encouraging to study the dynamics of interconnectedness between institutions. In the present article we use the Dynamic Time Warping (DTW - algorithm to determine the similarity between time series, which may be of different length and are distorted (stretched or shifted) in relation to the time axis) in two ways in the different market states: 1) to evaluate the suitability of Minimum Spanning Trees' topological indicators in the context of SR; 2) to construct the MST, to establish the similarity between the time series of the DeltaCoVaR. In the paper we analyze the dynamics of indirect connections between insurance companies that result from market price channels. We propose as in (Denkowska and Wanat, 2020) a hybrid approach to the analysis of interlinkages dynamics based on combining the copula-DCC-GARCH model and Minimum Spanning Trees (MST - connected and acyclic graph with the smallest sum of weights assigned to each edge; vertices are insurance institutions and the edges connect those lying at relatively small distances). The MST topology shows the links between institutions in the context of the possibility of propagating SR. We establish the similarity of time series' topological indicators of MST in periods of financial crises and outside of crises. Moreover, we examine the contribution of a single insurer to the systemic risk of the European insurance sector using the measure DeltaCoVaR (cf. Denkowska and Wanat, 2021).

By examining the contribution to SR of all the analyzed insurance institutions, we establish a standard DeltaCoVaR measure for each of them described in the paper

MST's topological indicators constructed based on tail dependencies present different behaviors in the distinguished market states (Fig. 1). The analysis shows that during crises, MSTs shrink, as evidenced by the decreasing APL and Diameter and the growing Max.Deg. which is favorable to the potential spread of undesirable effects of the shocks on the insurance market. MSTs are scale-free in the studied period. The mean RCE for k = 4, where k is the degree of the vertex, is on a similar level. MSTs are non-assortative according to the previous definition, as the numbers

The DTW results indicate a greater similarity of the APL time series fragments, separately in the periods of SMC, I, FIC, and in normal periods. The Diameter time series is noticeably divided into the group of SMC and FIC crises and a separate group of Normal states. The Max.Degree indicator remained at a similar level during the crises. During the entire period 2005-2019 MSTs are scale-free, as alpha has values in the range (2, 3). MSTs are not assortative in the entire analyzed period. We present the average DeltaCoVaR for all analyzed institutions. We study the similarity of a fragment of this time series from the SMC period to other periods of crises or normal periods. SMC stands out in a separate group. Thus, not only the size of the SR contribution is observable on the basis of the time series itself, but also the dynamics of this contribution as assessed by the DTW is different. Also, the FIC or I

Now, using Kruskal's algorithm we construct MSTs based on the DTW( ) distance matrix which show the similarity of DeltaCoVaR time series between pairs ( ) of insurers in states SMC, I, FIC and N. As a result of this analysis, we found that during the SMC crisis, the MST graph has the most compressed structure, as evidenced by the smallest APL, the largest MaxDegree and the smallest

The presented analysis is the first work in the literature in which the possibilities of identifying SR in the insurance sector with the use of a hybrid model are determined by the copula-DCC-GARCH based MST with the DTW algorithm. Then, we use the DTW algorithm to analyze the similarity in different market regimes time series of the MST topological indicators. The results obtained confirm the possibility of

**Funding:** Cracow University of Economics, POTENTIAL Program, project number

crises are outside the group of similarities with most normal periods.

identifying SR in the insurance sector using the presented model.

(Denkowska and Wanat, 2021).

Assortativity (Tab. 1).

26/EIM/2021/POT

**3. Conclusions**

**2. Empirical results and discussion**

are negative throughout the period considered.

SR in the financial sector was analyzed by: Bierth et al. (2015), Kanno (2016), Giglio et al. (2016), Kaserer (2018) and risk infection is studied by Hautsch et al. (2015). The paper (Petitjean et al. 2011) shows that the non-parametric DTW measure of similarity is better than other measures, such as the Pearson correlation coefficient.

### **2. Data and Methodology**

We study the stock quotes of 38 European insurance institutions, most of them from the list of the top 50 insurance companies in Europe based on total assets. We analyze weekly logarithmic returns for the period from January 7th, 2005 to December 20th, 2019.

As in (Denkowska and Wanat, 2020) we carry out the analysis of the dynamics of interconnections between insurance companies using a new hybrid approach based on the combination of the copula-DCC-GARCH model and MST. For each period *t*, we determine the "distance" matrix between insurance companies using the metric: and the Kruskal algorithm (Mantegna and Stanley, 1999), we construct with 38 vertices and 37 edges.

Based on the trees thus obtained we determine the time series of the following topological network indicators (Denkowska and Wanat, 2020): Average Path Length (APL - the average number of steps taken along all the shortest paths connecting all possible pairs of network nodes), Maximum Degree (Max.deghighest number of edges arising from a vertex). Parameters "alpha" of the power law of the degree distribution, Network Diameter (length of the longest geodesic path between any two nodes), Rich Club Effect (RCE-well-connected vertices connect also one with another), Assortativity (graphical measure of the way vertices connect due to their degree).

Next we determine the DTW distance between the series in the following periods:





DTW is one of the algorithms for measuring the similarity between two time series of different length that may differ in time (Raihan, 2017).

By examining the contribution to SR of all the analyzed insurance institutions, we establish a standard DeltaCoVaR measure for each of them described in the paper (Denkowska and Wanat, 2021).

### **2. Empirical results and discussion**

of crises. Moreover, we examine the contribution of a single insurer to the systemic risk of the European insurance sector using the measure DeltaCoVaR (cf.

SR in the financial sector was analyzed by: Bierth et al. (2015), Kanno (2016), Giglio et al. (2016), Kaserer (2018) and risk infection is studied by Hautsch et al. (2015). The paper (Petitjean et al. 2011) shows that the non-parametric DTW measure of similarity is better than other measures, such as the Pearson correlation

We study the stock quotes of 38 European insurance institutions, most of them from the list of the top 50 insurance companies in Europe based on total assets. We analyze weekly logarithmic returns for the period from January 7th, 2005 to

As in (Denkowska and Wanat, 2020) we carry out the analysis of the dynamics of interconnections between insurance companies using a new hybrid approach based on the combination of the copula-DCC-GARCH model and MST. For each period *t*, we determine the "distance" matrix between insurance companies using the metric:

Based on the trees thus obtained we determine the time series of the following topological network indicators (Denkowska and Wanat, 2020): Average Path Length (APL - the average number of steps taken along all the shortest paths connecting all possible pairs of network nodes), Maximum Degree (Max.deghighest number of edges arising from a vertex). Parameters "alpha" of the power law of the degree distribution, Network Diameter (length of the longest geodesic path between any two nodes), Rich Club Effect (RCE-well-connected vertices connect also one with another), Assortativity (graphical measure of the way vertices connect

Next we determine the DTW distance between the series in the following periods: - the period of two subprime crises and excessive public debt; (February 8th, 2008-




DTW is one of the algorithms for measuring the similarity between two time series

2016 – April 14th, 2017), (May 18th, 2018 – December 20th, 2019).

of different length that may differ in time (Raihan, 2017).

and the Kruskal algorithm (Mantegna and Stanley,

Denkowska and Wanat, 2021).

**2. Data and Methodology** 

1999), we construct with 38 vertices and 37 edges.

March 1st, 2013 - Subprime Mortgage Crisis (SMC)

(7th, 2015 to September 23rd, 2016) - Immigrant (I)

December 20th, 2019.

due to their degree).

Crisis( FIC),

coefficient.

MST's topological indicators constructed based on tail dependencies present different behaviors in the distinguished market states (Fig. 1). The analysis shows that during crises, MSTs shrink, as evidenced by the decreasing APL and Diameter and the growing Max.Deg. which is favorable to the potential spread of undesirable effects of the shocks on the insurance market. MSTs are scale-free in the studied period. The mean RCE for k = 4, where k is the degree of the vertex, is on a similar level. MSTs are non-assortative according to the previous definition, as the numbers are negative throughout the period considered.

The DTW results indicate a greater similarity of the APL time series fragments, separately in the periods of SMC, I, FIC, and in normal periods. The Diameter time series is noticeably divided into the group of SMC and FIC crises and a separate group of Normal states. The Max.Degree indicator remained at a similar level during the crises. During the entire period 2005-2019 MSTs are scale-free, as alpha has values in the range (2, 3). MSTs are not assortative in the entire analyzed period.

We present the average DeltaCoVaR for all analyzed institutions. We study the similarity of a fragment of this time series from the SMC period to other periods of crises or normal periods. SMC stands out in a separate group. Thus, not only the size of the SR contribution is observable on the basis of the time series itself, but also the dynamics of this contribution as assessed by the DTW is different. Also, the FIC or I crises are outside the group of similarities with most normal periods.

Now, using Kruskal's algorithm we construct MSTs based on the DTW( ) distance matrix which show the similarity of DeltaCoVaR time series between pairs ( ) of insurers in states SMC, I, FIC and N. As a result of this analysis, we found that during the SMC crisis, the MST graph has the most compressed structure, as evidenced by the smallest APL, the largest MaxDegree and the smallest Assortativity (Tab. 1).

### **3. Conclusions**

The presented analysis is the first work in the literature in which the possibilities of identifying SR in the insurance sector with the use of a hybrid model are determined by the copula-DCC-GARCH based MST with the DTW algorithm. Then, we use the DTW algorithm to analyze the similarity in different market regimes time series of the MST topological indicators. The results obtained confirm the possibility of identifying SR in the insurance sector using the presented model.

**Funding:** Cracow University of Economics, POTENTIAL Program, project number 26/EIM/2021/POT


TWO–STEP ESTIMATION OF MULTILEVEL LATENT CLASS MODELS WITH COVARIATES Roberto Di Mari1, Zsuzsa Bakk2,Jennifer Oser3 and Jouni Kuha4

<sup>1</sup> Department of Economics and Business, University of Catania, (e-mail:

<sup>4</sup> Department of Statistics, London School of Economics and Political Science, Lon-

ABSTRACT: In this article we present a two-step estimation approach applied to multilevel latent class analysis (LCA) with covariates. In the first step, the measurement model for the low-level and the high-level latent class variables is estimated. In the second step, covariates are added as predictors of latent class memberships, keeping the measurement model parameters fixed at their first step values. Separating the estimation of the structural from the measurement model generates a significant computational gain with respect to simultaneous estimation, greatly simplifying model building. Finite sample properties of the resulting estimator are investigated in a broad

KEYWORDS: multilevel latent class analysis; covariates; two-step estimation; pseudo

Latent class (LC) analysis is an approach used to create a clustering of a set of observed variables, based on an underlying unknown classification. In the multilevel extension of the baseline LC model, the respondents are assumed to belong to higher level groups - e.g. students nested in schools, or households in countries. Multilevel LCA is becoming increasingly popular in various fields. In most applications the focus is on lower level clustering, and on the differ-

In LCA creating a clustering is usually only the first step for applied researchers. The research interest often lies in including external variables as clustering predictors at a later stage of the analysis. While in single level LCA different approaches are available for relating LC membership to external variables, in multilevel settings only two classical approaches are used, both known to be suboptimal, namely the one-step and classical three-step

ence in the distribution of the lower level classes in higher level units.

<sup>2</sup> Department of Methodology and Statistics,Leiden University, The Netherlands

<sup>3</sup> Department of Politics and Government, Ben-Gurion University, Israel

roberto.dimari@unict.it)

don, UK

simulation study.

maximum likelihood 1 Introduction

 **Figure 1.** Topological indicators.

#### **References**


### TWO–STEP ESTIMATION OF MULTILEVEL LATENT CLASS MODELS WITH COVARIATES

Roberto Di Mari1, Zsuzsa Bakk2,Jennifer Oser3 and Jouni Kuha4

<sup>1</sup> Department of Economics and Business, University of Catania, (e-mail: roberto.dimari@unict.it)

<sup>2</sup> Department of Methodology and Statistics,Leiden University, The Netherlands

<sup>3</sup> Department of Politics and Government, Ben-Gurion University, Israel

<sup>4</sup> Department of Statistics, London School of Economics and Political Science, London, UK

ABSTRACT: In this article we present a two-step estimation approach applied to multilevel latent class analysis (LCA) with covariates. In the first step, the measurement model for the low-level and the high-level latent class variables is estimated. In the second step, covariates are added as predictors of latent class memberships, keeping the measurement model parameters fixed at their first step values. Separating the estimation of the structural from the measurement model generates a significant computational gain with respect to simultaneous estimation, greatly simplifying model building. Finite sample properties of the resulting estimator are investigated in a broad simulation study.

KEYWORDS: multilevel latent class analysis; covariates; two-step estimation; pseudo maximum likelihood

#### 1 Introduction

**Table 1.** DTW (DeltaCoVaR )- based MST topological indicators.

APL 7.18 6.50 8.67 8.42 max.deg 4.00 7.00 4.00 4.00 alpha 2.03 3.6 3.81 3.85

(k=2) 0.43 0.44 0.30 0.14 diameter 0.01 0.02 0.02 0.01 Assort. -0.34 -0.41 -0.29 -0.30

N SMC I FIC

Insurance Sector. *Risks*., **8(2)**, 1-39.

*series*., **22(2)**, 173–188.

**119(3)**, 457-471.

Publications Office of the EU.

*and Insurance*., **86(3)**, 729-759.

economics. SSRN 3047649

*Pattern Recognition*., **44(3)**, 678–693.

**References** 

RCE

 **Figure 1.** Topological indicators.

BIERTH, C., IRRESBERGER, F., & WEIß, G. N. 2015. Systemic risk of insurers

DENKOWSKA, A., & WANAT, S. 2020. A Tail Dependence-Based MST and Their Topological Indicators in Modeling Systemic Risk in the European

DENKOWSKA, A., & WANAT, S. 2021. A dynamic MST-deltaCoVaR model of systemic risk in the European insurance sector. *Statistics in Transition new* 

EIOPA. 2017. Systemic risk and macroprudential policy in insurance. Luxembourg:

GIGLIO, S., & KELLY, B., & PRUITT, S. 2016. Systemic risk and the macroeconomy: An empirical evaluation. *Journal of Financial Economics.*,

HAUTSCH, N., & SCHAUMBURG, J., & SCHIENLE, M. 2015. Financial Network Systemic Risk Contributions. *Review of Finance*., **19(2)**, 685–738. KANNO, M., 2016. The network structure and systemic risk in the global non-life insurance market. *Insurance: Mathematics and Economics*., **67**, 38–53. KASERER, C., & KLEIN, C. 2018. Supplementary Material to 'Systemic Risk in Financial Markets: How Systemically Important Are Insurers?' *Journal of Risk* 

PETITJEAN, F., & KETTERLIN, A., & GANCARSKI, P. 2011. A global averaging method for dynamic time warping, with applications to clustering.

RAIHAN, T. 2017. Predicting US recessions: A dynamic time warping exercise in

around the globe. *Journal of Banking & Finance.*, **55**, 232-245.

Latent class (LC) analysis is an approach used to create a clustering of a set of observed variables, based on an underlying unknown classification. In the multilevel extension of the baseline LC model, the respondents are assumed to belong to higher level groups - e.g. students nested in schools, or households in countries. Multilevel LCA is becoming increasingly popular in various fields. In most applications the focus is on lower level clustering, and on the difference in the distribution of the lower level classes in higher level units.

In LCA creating a clustering is usually only the first step for applied researchers. The research interest often lies in including external variables as clustering predictors at a later stage of the analysis. While in single level LCA different approaches are available for relating LC membership to external variables, in multilevel settings only two classical approaches are used, both known to be suboptimal, namely the one-step and classical three-step approaches. Using the one-step approach the full LC model including covariates is estimated simultaneously (for example, Mutz & Daniel, 2013). Using the alternative three-step approach, after estimating the measurement model in step 1, respondents are assigned to latent classes in step 2, and this posterior assigned class membership is related to the predictors of interest through a multinomial logistic regression in the third step (for example Tomczyk *et al.*, 2015). However in the second step a classification error is introduced, that if not corrected for induces systematic bias in the step 3 model.

The multilevel LC model of Equation (1) can be parametrized by means of

*t* )

*t* )

, (3)

, (4)

*<sup>s</sup>*=<sup>2</sup> exp(γ*sm*) (5)

log*P*(Y*i j*), (6)

*K* ∏ *k*=1

. The one step approach finds θˆ by maximizing

log*P*(Y*j*|Z*j*), (9)

. (7)

*P*(*Yijk*|*Xi j* = *t*),

(8)

, where

*<sup>s</sup>*=<sup>1</sup> exp(γ0*sm* +γ1*sZ*<sup>1</sup> *<sup>j</sup>* +γ2*sZ*2*i j*)

1+exp(β*<sup>k</sup>*

*<sup>l</sup>*=<sup>2</sup> δ0*<sup>l</sup>*

1+∑*<sup>M</sup>*

1+∑*<sup>T</sup>*

Under the parametrizations (3), (5) and (4), given a sample of *J* groups,

,...,β*<sup>K</sup> TJ* ) .

Level 1 and level 2 covariates can be included to predict class membership. Denoting one level 2 covariate by *Z*<sup>1</sup> *<sup>j</sup>* and a level 1 covariate by *Z*2*i j* the multinomial logistic regression for *Xi j* with a random intercept can be written

A random slope for the level 1 covariate can be obtained by replacing γ2*<sup>t</sup>* by γ<sup>2</sup> *jt*. Level 2 covariates can be used also to predict group class membership, but for simplicity we present only a model with covariates on the level 1 LC

Under the parametrization (7) that now includes covariates, the model for

, can be specified as

which depends on the vector of unknown parameters θ = (θ1,θ2)

*J* ∑ *j*=1

*P*(*Xi j* = *t*|*Wj* = *m*,*Z*<sup>1</sup> *<sup>j</sup>*,*Z*2*i j*)

*J* ∑ *j*=1

21

*<sup>P</sup>*(*Xi j* <sup>=</sup> *<sup>t</sup>*|*Wj*,*Z*<sup>1</sup> *<sup>j</sup>*,*Z*2*i j*) = exp(γ0*tm* <sup>+</sup>γ1*tZ*<sup>1</sup> *<sup>j</sup>* <sup>+</sup>γ2*tZ*2*i j*) ∑*T*

*<sup>P</sup>*(*Yijk*|*Xi j* <sup>=</sup> *<sup>t</sup>*) = exp(β*<sup>k</sup>*

*<sup>P</sup>*(*Wj* <sup>=</sup> *<sup>m</sup>*) = exp(δ0*m*)

*<sup>P</sup>*(*Xi j* <sup>=</sup> *<sup>t</sup>*|*Wj* <sup>=</sup> *<sup>m</sup>*) = exp(γ*tm*)

for the group-level membership probabilities, and

for the individual latent class probabilities.

the model parameters can be found by maximizing

with respect to θ<sup>1</sup> = (δ02,...,δ0*M*,β<sup>1</sup>

Y*i j*|Z*j*, where Z*<sup>j</sup>* = (*Z*<sup>1</sup> *<sup>j</sup>*,*Z*2*i j*)

*M* ∑ *m*=1

θ<sup>2</sup> = (γ12,..., γ1*<sup>T</sup>* , γ22,..., γ2*<sup>T</sup>* )

*P*(*Wj* = *m*)

*T* ∑ *t*=1

log*L*(θ) =

as:

variable.

*P*(Y*i j*|Z*j*) =

with respect to θ.

log*L*(θ1) =

multinomial logistic regressions as follows

for the item-class probabilities,

In the current paper we introduce a two-step approach, extending Bakk & Kuha (2018)'s work to the multilevel LC model as an alternative to the one-step and classical three-step approaches, since both are known to be sub-optimal in single level LC models.

#### 2 The multilevel latent class model

Consider the vector of responses Y*i j* = (*Yi j*1,...,*YijK*), where *Yijk* denotes the response of individual *i* in group *j* on the *k*-th categorical indicator variable, with 1 ≤ *k* ≤ *K* and 1 ≤ *j* ≤ *J*, where *K* denotes the number of categorical indicators and *J* the number of level 2 units. In addition, we let *nj* denote the number of level 1 units within the *j*-th level 2 unit, with 1 ≤ *j* ≤ *J*. For simplicity of exposition, we focus on dichotomous indicators.

Adopting the nonparametric approach (Laird, 1978), multilevel LC analysis is an extension of the LC models (Goodman, 1974), assuming that level 1 units belong to one of the *T* categories belong to *T* categories ("latent classes") of an underlying categorical latent variable *X*, whereas level 2 units belong to one of the *M* categories of the group level latent class *W*. The model for Y*i j* can then be specified as

$$P(\mathbf{Y}\_{ij}) = \sum\_{m=1}^{M} P(W\_j = m) \sum\_{t=1}^{T} P(X\_{ij} = t | W\_j = m) P(\mathbf{Y}\_{ij} | X = t) \tag{1}$$

where *P*(*Wj* = *m*) = π*<sup>m</sup>* is the probability of group *j* to belong to class *m*. *P*(*Xi j* = *t*|*Wj* = *m*) is the probability that individual *i* in group *j* belongs to class *t* given group membership *m*. The term *P*(Y*i j*|*X* = *t*) is the class-specific probability of observing a pattern of responses given that a person belongs to class*t* under the common assumption that item-conditional probabilities not to depend on the level 2 unit (Vermunt, 2003; Lukociene *et al.*, 2010). Furthermore we make the "local independence" assumption that the *K* indicator are independent within latent classes, leading to

$$P(\mathbf{Y}\_{ij}) = \sum\_{t=1}^{T} P(\mathbf{X}\_{ij} = t) \prod\_{k=1}^{K} P(\mathbf{Y}\_{ijk} | \mathbf{X}\_{ij} = t). \tag{2}$$

The multilevel LC model of Equation (1) can be parametrized by means of multinomial logistic regressions as follows

$$P(Y\_{ijk}|X\_{ij} = t) = \frac{\exp(\mathfrak{B}\_t^k)}{1 + \exp(\mathfrak{B}\_t^k)},\tag{3}$$

for the item-class probabilities,

approaches. Using the one-step approach the full LC model including covariates is estimated simultaneously (for example, Mutz & Daniel, 2013). Using the alternative three-step approach, after estimating the measurement model in step 1, respondents are assigned to latent classes in step 2, and this posterior assigned class membership is related to the predictors of interest through a multinomial logistic regression in the third step (for example Tomczyk *et al.*, 2015). However in the second step a classification error is introduced, that if

In the current paper we introduce a two-step approach, extending Bakk & Kuha (2018)'s work to the multilevel LC model as an alternative to the one-step and classical three-step approaches, since both are known to be sub-optimal in

Consider the vector of responses Y*i j* = (*Yi j*1,...,*YijK*), where *Yijk* denotes the response of individual *i* in group *j* on the *k*-th categorical indicator variable, with 1 ≤ *k* ≤ *K* and 1 ≤ *j* ≤ *J*, where *K* denotes the number of categorical indicators and *J* the number of level 2 units. In addition, we let *nj* denote the number of level 1 units within the *j*-th level 2 unit, with 1 ≤ *j* ≤ *J*. For

Adopting the nonparametric approach (Laird, 1978), multilevel LC analysis is an extension of the LC models (Goodman, 1974), assuming that level 1 units belong to one of the *T* categories belong to *T* categories ("latent classes") of an underlying categorical latent variable *X*, whereas level 2 units belong to one of the *M* categories of the group level latent class *W*. The model for Y*i j*

*P*(*Xi j* = *t*|*Wj* = *m*)*P*(Y*i j*|*X* = *t*) (1)

*P*(*Yijk*|*Xi j* = *t*). (2)

not corrected for induces systematic bias in the step 3 model.

simplicity of exposition, we focus on dichotomous indicators.

*P*(*Wj* = *m*)

*T* ∑ *t*=1

*P*(*Xi j* = *t*)

*K* ∏ *k*=1

*T* ∑ *t*=1

where *P*(*Wj* = *m*) = π*<sup>m</sup>* is the probability of group *j* to belong to class *m*. *P*(*Xi j* = *t*|*Wj* = *m*) is the probability that individual *i* in group *j* belongs to class *t* given group membership *m*. The term *P*(Y*i j*|*X* = *t*) is the class-specific probability of observing a pattern of responses given that a person belongs to class*t* under the common assumption that item-conditional probabilities not to depend on the level 2 unit (Vermunt, 2003; Lukociene *et al.*, 2010). Furthermore we make the "local independence" assumption that the *K* indicator are

single level LC models.

can then be specified as

*P*(Y*i j*) =

*M* ∑ *m*=1

independent within latent classes, leading to

*P*(Y*i j*) =

2 The multilevel latent class model

$$P(W\_j = m) = \frac{\exp(\delta\_{0m})}{1 + \sum\_{l=2}^{M} \delta\_{0l}},\tag{4}$$

for the group-level membership probabilities, and

$$P(X\_{ij} = t | W\_j = m) = \frac{\exp(\gamma\_{tm})}{1 + \sum\_{s=2}^{T} \exp(\gamma\_{sm})} \tag{5}$$

for the individual latent class probabilities.

Under the parametrizations (3), (5) and (4), given a sample of *J* groups, the model parameters can be found by maximizing

$$\log L(\boldsymbol{\theta}\_{l}) = \sum\_{j=1}^{J} \log P(\mathbf{Y}\_{ij}),\tag{6}$$

with respect to θ<sup>1</sup> = (δ02,...,δ0*M*,β<sup>1</sup> 21 ,...,β*<sup>K</sup> TJ* ) .

Level 1 and level 2 covariates can be included to predict class membership. Denoting one level 2 covariate by *Z*<sup>1</sup> *<sup>j</sup>* and a level 1 covariate by *Z*2*i j* the multinomial logistic regression for *Xi j* with a random intercept can be written as:

$$P(\mathbf{X}\_{ij} = t | W\_j, \mathbf{Z}\_{1j}, \mathbf{Z}\_{2ij}) = \frac{\exp(\gamma\_{0tm} + \gamma\_{1t} Z\_{1j} + \gamma\_{2t} Z\_{2ij})}{\sum\_{s=1}^{T} \exp(\gamma\_{0sm} + \gamma\_{1s} Z\_{1j} + \gamma\_{2s} Z\_{2ij})}.\tag{7}$$

A random slope for the level 1 covariate can be obtained by replacing γ2*<sup>t</sup>* by γ<sup>2</sup> *jt*. Level 2 covariates can be used also to predict group class membership, but for simplicity we present only a model with covariates on the level 1 LC variable.

Under the parametrization (7) that now includes covariates, the model for Y*i j*|Z*j*, where Z*<sup>j</sup>* = (*Z*<sup>1</sup> *<sup>j</sup>*,*Z*2*i j*) , can be specified as

$$P(\mathbf{Y}\_{ij}|\mathbf{Z}\_j) = \sum\_{m=1}^{M} P(W\_j = m) \sum\_{t=1}^{T} P(\mathbf{X}\_{ij} = t | W\_j = m, Z\_{1j}, Z\_{2ij}) \prod\_{k=1}^{K} P(Y\_{ijk} | \mathbf{X}\_{ij} = t),\tag{8}$$

which depends on the vector of unknown parameters θ = (θ1,θ2) , where θ<sup>2</sup> = (γ12,..., γ1*<sup>T</sup>* , γ22,..., γ2*<sup>T</sup>* ) . The one step approach finds θˆ by maximizing

$$\log L(\boldsymbol{\theta}) = \sum\_{j=1}^{J} \log P(\mathbf{Y}\_j | \mathbf{Z}\_j), \tag{9}$$

with respect to θ.

#### 3 A stepwise estimator for multilevel LC model with covariates

Step 1: the ML estimate θ<sup>1</sup> of θ<sup>1</sup> is found as the maximizer of the log-likelihood of the simple multilevel LC model without covariates.

CLUSTERING DATA WITH NON-IGNORABLE MISSINGNESS USING SEMI-PARAMETRIC MIXTURE MODELS Marie Du Roy de Chaumaray1 and Matthieu Marbac1

<sup>1</sup> Univ. Rennes, Ensai, CNRS, CREST - UMR 9194, F-35000 Rennes, France, (e-mail: marie.du-roy-de-chaumaray@ensai.fr,

ABSTRACT: We are concerned in clustering continuous data sets subject to nonignorable missingness. Clustering is achieved by a semi-parametric mixture that, for each subject, considers the joint distribution of the observed variables and the response-data indicator vector. Estimation is performed by maximizing the smoothed

KEYWORDS: Clustering, Mixture Model, Non-ignorable Missigness, Smoothed Like-

Mixture models permit to achieve the clustering purpose in a rigorous context but the case where data have missingness is generally neglected. Moreover, the missing not at random scenario (MNAR; Little & Rubin, 2019), where the missingness mechanism depends on the missing values even conditionally on the observed variables, generally requires the missingness mechanism to be considered to obtain consistent estimators. However, few statistical methods

Two clustering approaches allow data subject to the MNAR scenario to be analyzed. Chi *et al.* , 2016 introduce the *K*-POD algorithm that extends the *K*-means algorithm to the case of missing data even if the missing mechanism if unknown. However, this approach suffers from the standard drawbacks of the *K*-means algorithm (*i.e.,* assumptions of spherical clusters and equals proportions of the clusters). Alternatively, using a *selection model* approach Miao *et al.* , 2016 proposed a specific Gaussian mixtures and *t*-mixtures to analyze data under MNAR scenario. For such approach, the missingness mechanism must be specified (probit and logit distributions are generally used). However,

matthieu.marbac-lourdelle@ensai.fr)

likelihood via a Majoration-Minimization algorithm.

lihood.

1 Introduction

permit this scenario for clustering.

Step 2: covariates are added to the model. The log-likelihood (9) is maximized only with respect to θ2, and θ<sup>2</sup> is kept fixed at its first step estimates.

Our 2-step estimator is an instance of pseudo maximum likelihood estimation (Gong & Samaniego, 1981). Such estimators are consistent under very general regularity conditions (see, for instance, Gourieroux & Monfort, 1995). We propose to compute the step-two standard errors to account for the uncertainty about the fixed parameters in the calculation applying the approach proposed by Bakk & Kuha (2018) for single level LC models to the multilevel setting.

We will setup a simulation study to assess the finite sample properties of the proposed estimator. To do so, we will generate data with varying sample sizes at both the lower and higher level, with different levels of class separation and association between the covariates and class membership. We expect that the proposed two-step estimator will be unbiased (similarly to the one-step approach) as opposed to the three step approach, and will be slightly less efficient than the one-step estimator.

#### References


### CLUSTERING DATA WITH NON-IGNORABLE MISSINGNESS USING SEMI-PARAMETRIC MIXTURE MODELS

Marie Du Roy de Chaumaray1 and Matthieu Marbac1

<sup>1</sup> Univ. Rennes, Ensai, CNRS, CREST - UMR 9194, F-35000 Rennes, France, (e-mail: marie.du-roy-de-chaumaray@ensai.fr, matthieu.marbac-lourdelle@ensai.fr)

ABSTRACT: We are concerned in clustering continuous data sets subject to nonignorable missingness. Clustering is achieved by a semi-parametric mixture that, for each subject, considers the joint distribution of the observed variables and the response-data indicator vector. Estimation is performed by maximizing the smoothed likelihood via a Majoration-Minimization algorithm.

KEYWORDS: Clustering, Mixture Model, Non-ignorable Missigness, Smoothed Likelihood.

#### 1 Introduction

3 A stepwise estimator for multilevel LC model with covariates

of the simple multilevel LC model without covariates.

mates.

than the one-step estimator.

*Journal of Sociology*, 79–259.

*dependence*, 155, 208–214.

33(1), 213–239.

Vol. 1. Cambridge University Press.

*ological Methodology*, 40(1), 247–283.

*of Educational Psychology*, 83(2), 280–304.

External Variables. *Psychometrika*, 83, 871–892.

Theory and applications. *The Annals of Statistics*, 861–869.

*Journal of the American Statistical Association*, 73(364), 805–811.

References

Step 1: the ML estimate θ<sup>1</sup> of θ<sup>1</sup> is found as the maximizer of the log-likelihood

Step 2: covariates are added to the model. The log-likelihood (9) is maximized only with respect to θ2, and θ<sup>2</sup> is kept fixed at its first step esti-

Our 2-step estimator is an instance of pseudo maximum likelihood estimation (Gong & Samaniego, 1981). Such estimators are consistent under very general regularity conditions (see, for instance, Gourieroux & Monfort, 1995). We propose to compute the step-two standard errors to account for the uncertainty about the fixed parameters in the calculation applying the approach proposed by Bakk & Kuha (2018) for single level LC models to the multilevel setting. We will setup a simulation study to assess the finite sample properties of the proposed estimator. To do so, we will generate data with varying sample sizes at both the lower and higher level, with different levels of class separation and association between the covariates and class membership. We expect that the proposed two-step estimator will be unbiased (similarly to the one-step approach) as opposed to the three step approach, and will be slightly less efficient

BAKK, Z, & KUHA, J. 2018. Two-Step Estimation of Models Between Latent Classes and

FINCH,WHOLMES,&FRENCH, BRIAN F. 2014. Multilevel latent class analysis: Parametric and nonparametric models. *The Journal of Experimental Education*, 82(3), 307–333. GONG, GAIL,&SAMANIEGO, FRANCISCO J. 1981. Pseudo maximum likelihood estimation:

GOODMAN, LEO A. 1974. The Analysis of Systems of Qualitative Variables When Some of the Variables Are Unobservable. Part I: A Modified Latent Structure Approach. *American*

GOURIEROUX, CHRISTIAN,&MONFORT, ALAIN. 1995. *Statistics and Cconometric Models*.

LAIRD, NAN. 1978. Nonparametric maximum likelihood estimation of a mixing distribution.

LUKOCIENE, O., VARRIALE, R., & VERMUNT, J.K. 2010. The simultaneous decision(s) about the number of lower- and higher-level classes in multilevel latent class analysis. *Soci-*

MUTZ, R., & DANIEL, H.D. 2013. University and student segmentation: Multilevel latentclass analysis of students' attitudes towards research methods and statistics. *British Journal*

TOMCZYK, SAMUEL, HANEWINKEL, REINER,&ISENSEE, BARBARA. 2015. Multiple substance use patterns in adolescents: A multilevel latent class analysis. *Drug and alcohol*

VERMUNT, JEROEN K. 2003. Multilevel Latent Class Models. *Sociological Methodology*,

Mixture models permit to achieve the clustering purpose in a rigorous context but the case where data have missingness is generally neglected. Moreover, the missing not at random scenario (MNAR; Little & Rubin, 2019), where the missingness mechanism depends on the missing values even conditionally on the observed variables, generally requires the missingness mechanism to be considered to obtain consistent estimators. However, few statistical methods permit this scenario for clustering.

Two clustering approaches allow data subject to the MNAR scenario to be analyzed. Chi *et al.* , 2016 introduce the *K*-POD algorithm that extends the *K*-means algorithm to the case of missing data even if the missing mechanism if unknown. However, this approach suffers from the standard drawbacks of the *K*-means algorithm (*i.e.,* assumptions of spherical clusters and equals proportions of the clusters). Alternatively, using a *selection model* approach Miao *et al.* , 2016 proposed a specific Gaussian mixtures and *t*-mixtures to analyze data under MNAR scenario. For such approach, the missingness mechanism must be specified (probit and logit distributions are generally used). However, this approach produces strong bias if the parametric assumptions (made on the distribution of the variables or on the missingness mechanism) are violated.

sumption on the missingness mechanism (*i.e.,* no assumption are made on the conditional distribution of *Ri* | *Xi*,*Zi*). The probability distribution function

of variables (*Xi j*,*Ri j*) are assumed to be conditionally independent given *Zi*. Thus, the distribution of *Ri* | *Zi* is a product of Bernoulli distributions and the conditional density of *Xi* | *Zi*,*Ri* is defined as the product of univariate densi-

<sup>1</sup>−*ri j* and *gk*(*xi* <sup>|</sup> *ri*) =

where τ*<sup>k</sup>* = (τ*k*1,..., τ*kd*), τ*k j* is the probability that *Xi j* is observed given that subject *i* belongs to subpopulation *k*, *pk j*(·) is the conditional density of *Xi j* given *Zik* = 1 and *Ri j* = 1 and *qk j*(·) is the conditional density of *Xi j* given

ties. Thus, from (1), the pdf of component *k* is also defined as

*Zik* = 1 and *Ri j* = 0. Integrated out the unobserved variables *X*miss

*<sup>i</sup>* ,*ri*; τ*k*), with *gk*(*x*obs

where θ groups all the finite parameters (π*<sup>k</sup>* and τ*k*) and all the infinite parameters *pk j*(·). For clustering, the *pattern-mixture model* should be preferred to *selection model* because it does not require to specify the missingness mechanism, allows this mechanism to be nonignorable and permits to easily obtain the conditional probabilities of the subpopulation membership given the distri-

*<sup>i</sup>* ,*ri*) = *gk*(*x*obs

∑*K*

Note that we do not need to estimate *qk j*(·) for the clustering purpose but that this implies that we are not able to estimate the distribution of *Xi* | *Zi*. Thus, this approach does not permit to estimate the marginal distribution of

*k j*(1−τ*k j*)

*<sup>i</sup>* ) for subpopulation *k* (*i.e., Zik* = 1) is denoted by *gk*(·). Us-

π*kgk*(*xi*,*ri*; τ*k*) with *gk*(*xi*,*ri*; τ*k*) = *gk*(*ri*; τ*k*)*gk*(*xi* | *ri*), (1)

*<sup>k</sup>*=<sup>1</sup> π*<sup>k</sup>* = 1 and *gk*(·; τ*k*) is pdf of component *k*. The couples

*d* ∏ *j*=1 *p ri j k j*(*xi j*)*q*

*<sup>i</sup>* ,*ri*; τ*k*) = *gk*(*ri*; τ*k*)

*<sup>i</sup>* ,*ri*; τ*k*)

*<sup>i</sup>* ,*ri*; τ)

.

<sup>=</sup><sup>1</sup> π*g*(*x*obs

*<sup>i</sup>* ) is defined by the pdf of the

1−*ri j k j* (*xi j*),

*<sup>i</sup>* , we have

*d* ∏ *j*=1 *p ri j k j*(*xi j*),

*<sup>i</sup>* ,*R*

(pdf) of (*X*

*<sup>i</sup>* ,*R*

*K* ∑ *k*=1

*d* ∏ *j*=1 τ *ri j*

*K* ∑ *k*=1

π*kgk*(*x*obs

bution of the observed values defined by

<sup>P</sup>(*Zik* <sup>=</sup> <sup>1</sup> <sup>|</sup> *<sup>x</sup>*obs

*K*-component mixture

*g*(*xi*,*ri*;θ) =

where π*<sup>k</sup>* > 0, ∑*<sup>K</sup>*

*gk*(*ri*; τ*k*) =

*<sup>i</sup>* ,*ri*;θ) =

*g*(*x*obs

ing the *pattern-mixture model*, the pdf (*X*

In this paper, clustering is performed via a mixture model that uses a *pattern-mixture model* approach with non-parametric distributions. Thus, no assumptions are made on the data distribution or on the missingness mechanism except that the variables are independent within components. Note that this assumption is quite standard for semi-parametric mixtures (Levine *et al.* , 2011; Kasahara & Shimotsu, 2014). For each mixture component, we estimate, for each variable, its probability to be observed and its conditional distribution given the variables is observed. We emphasize that our concern is clustering and not imputation or density estimation. Indeed, without adding assumptions, the distribution of the variables within component cannot be estimated by our procedure.

#### 2 Mixture for nonignorable missingness

#### 2.1 The data

The observed sample is composed of *n* independent and identically distributed subjects arisen form *K* homogeneous subpopulations. Each subject is described by *d* continuous variables and some realizations of these variables may be unobserved. The probability, for a variable, to be not observed is allowed to depend on the values of the variable itself and the subpopulation membership.

Each subject *i* is described by a vector of three variables (*X <sup>i</sup>* ,*R <sup>i</sup>* ,*Z <sup>i</sup>* ) where *Xi* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* is set of continuous variables, *Ri* = (*Ri*1,...,*Rid*) ∈ {0,1}*<sup>d</sup>* indicates whether *Xi j* is observed (*Ri j* = 1) and *Zi* = (*Zi*1,...,*ZiK*) indicates the subpopulation of subject *i* (*Zik* = 1 if subject *i* belongs to subpopulation *k* and otherwise *Zik* = 0). Each subject belongs to one subpopulation such that ∑*<sup>K</sup> <sup>k</sup>*=<sup>1</sup> *Zik* = 1. The realizations of *Zi* are unobserved and a part of the realizations of *Xi* can be unobserved too. Therefore, the observed variables for subject *i* are (*X*obs *<sup>i</sup>* ,*R <sup>i</sup>* ) where *X*obs *<sup>i</sup>* is composed of the elements of *Xi* such that *Ri j* = 1 and the unobserved variables for subject *i* are (*X*miss *<sup>i</sup>* ,*Z <sup>i</sup>* ) where *X*miss *<sup>i</sup>* is composed of the elements of *Xi* such that *Ri j* = 0.

#### 2.2 General mixture model

We use mixture models in a purpose of clustering and not for density estimation. Clustering aims to estimate the subpopulation memberships given the observed variables (*i.e.,* the realization of *Zi* given (*X*obs *<sup>i</sup>* ,*R <sup>i</sup>* )) without assumption on the missingness mechanism (*i.e.,* no assumption are made on the conditional distribution of *Ri* | *Xi*,*Zi*). The probability distribution function (pdf) of (*X <sup>i</sup>* ,*R <sup>i</sup>* ) for subpopulation *k* (*i.e., Zik* = 1) is denoted by *gk*(·). Using the *pattern-mixture model*, the pdf (*X <sup>i</sup>* ,*R <sup>i</sup>* ) is defined by the pdf of the *K*-component mixture

this approach produces strong bias if the parametric assumptions (made on the distribution of the variables or on the missingness mechanism) are violated. In this paper, clustering is performed via a mixture model that uses a *pattern-mixture model* approach with non-parametric distributions. Thus, no assumptions are made on the data distribution or on the missingness mechanism except that the variables are independent within components. Note that this assumption is quite standard for semi-parametric mixtures (Levine *et al.* , 2011; Kasahara & Shimotsu, 2014). For each mixture component, we estimate, for each variable, its probability to be observed and its conditional distribution given the variables is observed. We emphasize that our concern is clustering and not imputation or density estimation. Indeed, without adding assumptions, the distribution of the variables within component cannot be es-

The observed sample is composed of *n* independent and identically distributed subjects arisen form *K* homogeneous subpopulations. Each subject is described by *d* continuous variables and some realizations of these variables may be unobserved. The probability, for a variable, to be not observed is allowed to depend on the values of the variable itself and the subpopulation membership. Each subject *i* is described by a vector of three variables (*X*

where *Xi* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* is set of continuous variables, *Ri* = (*Ri*1,...,*Rid*) ∈ {0,1}*<sup>d</sup>* indicates whether *Xi j* is observed (*Ri j* = 1) and *Zi* = (*Zi*1,...,*ZiK*) indicates the subpopulation of subject *i* (*Zik* = 1 if subject *i* belongs to subpopulation *k* and otherwise *Zik* = 0). Each subject belongs to one subpopulation such

We use mixture models in a purpose of clustering and not for density estimation. Clustering aims to estimate the subpopulation memberships given the

*<sup>k</sup>*=<sup>1</sup> *Zik* = 1. The realizations of *Zi* are unobserved and a part of the realizations of *Xi* can be unobserved too. Therefore, the observed variables for

*<sup>i</sup>* is composed of the elements of *Xi* such

*<sup>i</sup>* ,*R*

*<sup>i</sup>* ,*Z*

*<sup>i</sup>* ) where

*<sup>i</sup>* )) without as-

*<sup>i</sup>* ,*R <sup>i</sup>* ,*Z <sup>i</sup>* )

timated by our procedure.

2.1 The data

that ∑*<sup>K</sup>*

*X*miss

subject *i* are (*X*obs

*<sup>i</sup>* ,*R*

2.2 General mixture model

*<sup>i</sup>* ) where *X*obs

*<sup>i</sup>* is composed of the elements of *Xi* such that *Ri j* = 0.

observed variables (*i.e.,* the realization of *Zi* given (*X*obs

that *Ri j* = 1 and the unobserved variables for subject *i* are (*X*miss

2 Mixture for nonignorable missingness

$$\mathbf{g}(\mathbf{x}\_{i}, r\_{i}; \boldsymbol{\Theta}) = \sum\_{k=1}^{K} \pi\_{k} \mathbf{g}\_{k}(\mathbf{x}\_{i}, r\_{i}; \boldsymbol{\mathsf{τ}}\_{k}) \text{ with } \mathbf{g}\_{k}(\mathbf{x}\_{i}, r\_{i}; \boldsymbol{\mathsf{τ}}\_{k}) = \mathbf{g}\_{k}(r\_{i}; \boldsymbol{\mathsf{τ}}\_{k}) \mathbf{g}\_{k}(\mathbf{x}\_{i} \mid r\_{i}), \quad \text{(1)}$$

where π*<sup>k</sup>* > 0, ∑*<sup>K</sup> <sup>k</sup>*=<sup>1</sup> π*<sup>k</sup>* = 1 and *gk*(·; τ*k*) is pdf of component *k*. The couples of variables (*Xi j*,*Ri j*) are assumed to be conditionally independent given *Zi*. Thus, the distribution of *Ri* | *Zi* is a product of Bernoulli distributions and the conditional density of *Xi* | *Zi*,*Ri* is defined as the product of univariate densities. Thus, from (1), the pdf of component *k* is also defined as

$$\log\_k(r\_i; \mathsf{T}\_k) = \prod\_{j=1}^d \mathsf{T}\_{kj}^{r\_{ij}} (1 - \mathsf{T}\_{kj})^{1 - r\_{ij}} \text{ and } \operatorname{g}\_k(\mathbf{x}\_i \mid r\_i) = \prod\_{j=1}^d p\_{kj}^{r\_{ij}}(\mathbf{x}\_{ij}) q\_{kj}^{1 - r\_{ij}}(\mathbf{x}\_{ij}),$$

where τ*<sup>k</sup>* = (τ*k*1,..., τ*kd*), τ*k j* is the probability that *Xi j* is observed given that subject *i* belongs to subpopulation *k*, *pk j*(·) is the conditional density of *Xi j* given *Zik* = 1 and *Ri j* = 1 and *qk j*(·) is the conditional density of *Xi j* given *Zik* = 1 and *Ri j* = 0. Integrated out the unobserved variables *X*miss *<sup>i</sup>* , we have

$$\log(\mathbf{x}\_i^{\text{obs}}, r\_i; \boldsymbol{\theta}) = \sum\_{k=1}^{K} \pi\_k \mathbf{g}\_k(\mathbf{x}\_i^{\text{obs}}, r\_i; \mathbf{\tau}\_k), \text{ with } \mathbf{g}\_k(\mathbf{x}\_i^{\text{obs}}, r\_i; \mathbf{\tau}\_k) = \mathbf{g}\_k(r\_i; \mathbf{\tau}\_k) \prod\_{j=1}^{d} p\_{kj}^{r\_{ij}}(\mathbf{x}\_{ij}),$$

where θ groups all the finite parameters (π*<sup>k</sup>* and τ*k*) and all the infinite parameters *pk j*(·). For clustering, the *pattern-mixture model* should be preferred to *selection model* because it does not require to specify the missingness mechanism, allows this mechanism to be nonignorable and permits to easily obtain the conditional probabilities of the subpopulation membership given the distribution of the observed values defined by

$$\mathbb{P}(Z\_{ik} = 1 \mid \boldsymbol{x}\_i^{\mathrm{obs}}, r\_i) = \frac{g\_k(\boldsymbol{x}\_i^{\mathrm{obs}}, r\_i; \mathsf{\tau}\_k)}{\sum\_{\ell=1}^K \pi\_\ell g\_\ell(\boldsymbol{x}\_i^{\mathrm{obs}}, r\_i; \mathsf{\tau}\_\ell)}.$$

Note that we do not need to estimate *qk j*(·) for the clustering purpose but that this implies that we are not able to estimate the distribution of *Xi* | *Zi*. Thus, this approach does not permit to estimate the marginal distribution of *Xi* | *Zi* without adding assumptions on the missing mechanism. This implies that the proposed approach can be used for clustering but not for density estimation. Model identifiability is obtained by extending Theorem 8 in Allman *et al.* , 2009. Parameter estimation is performed by maximizing the smoothed likelihood over θ via a MM algorithm like in Levine *et al.* , 2011. More details are given in Du Roy de Chaumaray & Marbac, 2020.

SPATIAL-TEMPORAL CLUSTERING BASED ON B-SPLINES: ROBUST MODELS WITH APPLICATIONS TO COVID-19 PANDEMIC Pierpaolo D'Urso 1, Livia De Giovanni2 and Vincenzina Vitale1

<sup>1</sup> Department of Social and Economic Sciences, Sapienza University of Rome, P.za Aldo Moro, 5 - 00185 Rome, Italy, (e-mail: pierpaolo.durso@uniroma1.it,

<sup>2</sup> Department of Political Sciences, LUISS university, Viale Romania, 32 - 00197

ABSTRACT: Robust fuzzy*C*-Medoids clustering models based on B-splines with spatial penalty term have been proposed to cluster Italian regions according to the daily time-series of the cumulative COVID-19 cases over population (per 10000 inhabitants) and of the cumulative COVID-19 deaths over population (per 10000 inhabitants), spanning from 2020-02-24 to 2021-02-08. Both spatial and time components have been efficiently embedded in the model. Furthermore the use of B-splines coef-

KEYWORDS: B-splines, robust distance, PAM algorithm, COVID-19 data, contiguity

The new 2019 coronavirus that has originated the COVID-19 disease spread out quickly from the Chinese city of Wuhan worldwide giving rise to a pandemic whose huge effects on national health systems are still evident. It has been well known that Italy, its Northern regions in particular, was the first country facing the outbreak in February 2020 to such an extent that the Italian government needed to impose a nationwide lockdown (on 9 March 2020) to dastrically reduce the incidence rate and the overfloading of the intensive care units. Italy faced other two outbreak waves, in October 2020 and then in March 2021, involving all territories, the Southern ones too. Three and then five risk profiles have been identified by the Scientific committee engaged by the national authorities to monitor pandemic's dynamic in order to differentiate the restrictive measures in the territories. At the beginning of 2021, the COVID-19 vaccination campaign has been started and is ongoing to this day. The Italian Civil Protection Department provides, daily, all information related to the COVID-19 outbreak in Italy at the regional level and, for some variables,

vincenzina.vitale@uniroma1.it)

matrix

1 Introduction

Rome, Italy, (e-mail: ldegiovanni@luiss.it)

ficients allows to reduce consistently the computational burdern.

#### 3 Conclusion

The proposed method allows continuous data set with non-ignorable missingness to be clustered with no more assumption than the independence within components. Selecting the number of components is a difficult task that could be achieved by extending the approach of Kasahara & Shimotsu, 2014 to the mixed-type data. A procedure of bandwidth selection should be investigated.

#### References


### SPATIAL-TEMPORAL CLUSTERING BASED ON B-SPLINES: ROBUST MODELS WITH APPLICATIONS TO COVID-19 PANDEMIC

Pierpaolo D'Urso 1, Livia De Giovanni2 and Vincenzina Vitale1

<sup>1</sup> Department of Social and Economic Sciences, Sapienza University of Rome, P.za Aldo Moro, 5 - 00185 Rome, Italy, (e-mail: pierpaolo.durso@uniroma1.it, vincenzina.vitale@uniroma1.it)

<sup>2</sup> Department of Political Sciences, LUISS university, Viale Romania, 32 - 00197 Rome, Italy, (e-mail: ldegiovanni@luiss.it)

ABSTRACT: Robust fuzzy*C*-Medoids clustering models based on B-splines with spatial penalty term have been proposed to cluster Italian regions according to the daily time-series of the cumulative COVID-19 cases over population (per 10000 inhabitants) and of the cumulative COVID-19 deaths over population (per 10000 inhabitants), spanning from 2020-02-24 to 2021-02-08. Both spatial and time components have been efficiently embedded in the model. Furthermore the use of B-splines coefficients allows to reduce consistently the computational burdern.

KEYWORDS: B-splines, robust distance, PAM algorithm, COVID-19 data, contiguity matrix

#### 1 Introduction

*Xi* | *Zi* without adding assumptions on the missing mechanism. This implies that the proposed approach can be used for clustering but not for density estimation. Model identifiability is obtained by extending Theorem 8 in Allman *et al.* , 2009. Parameter estimation is performed by maximizing the smoothed likelihood over θ via a MM algorithm like in Levine *et al.* , 2011. More details

The proposed method allows continuous data set with non-ignorable missingness to be clustered with no more assumption than the independence within components. Selecting the number of components is a difficult task that could be achieved by extending the approach of Kasahara & Shimotsu, 2014 to the mixed-type data. A procedure of bandwidth selection should be investigated.

ALLMAN, ELIZABETH S, MATIAS, CATHERINE, RHODES, JOHN A, *et al.* . 2009. Identifiability of parameters in latent structure models with many

DU ROY DE CHAUMARAY, MARIE,&MARBAC, MATTHIEU. 2020. Clustering Data with Nonignorable Missingness using Semi-Parametric Mixture

KASAHARA, HIROYUKI,&SHIMOTSU, KATSUMI. 2014. Non-parametric identification and estimation of the number of components in multivariate mixtures. *Journal of the Royal Statistical Society: Series B (Statistical*

LEVINE, MICHAEL, HUNTER, DAVID R, & CHAUVEAU, DIDIER. 2011. Maximum smoothed likelihood for multivariate mixtures. *Biometrika*,

LITTLE, RODERICK JA, & RUBIN, DONALD B. 2019. *Statistical analysis*

MIAO, WANG, DING, PENG,&GENG, ZHI. 2016. Identifiability of normal and normal mixture models with nonignorable missing data. *Journal of*

*the American Statistical Association*, 111(516), 1673–1683.

observed variables. *The Annals of Statistics*, 37(6A), 3099–3132. CHI, JOCELYN T, CHI, ERIC C, & BARANIUK, RICHARD G. 2016. k-pod: A method for k-means clustering of missing data. *The American Statisti-*

are given in Du Roy de Chaumaray & Marbac, 2020.

Models. *arXiv preprint arXiv:2009.07662*.

*with missing data*. Vol. 793. John Wiley & Sons.

*Methodology)*, 76(1), 97–111.

3 Conclusion

References

*cian*, 70(1), 91–99.

403–416.

The new 2019 coronavirus that has originated the COVID-19 disease spread out quickly from the Chinese city of Wuhan worldwide giving rise to a pandemic whose huge effects on national health systems are still evident. It has been well known that Italy, its Northern regions in particular, was the first country facing the outbreak in February 2020 to such an extent that the Italian government needed to impose a nationwide lockdown (on 9 March 2020) to dastrically reduce the incidence rate and the overfloading of the intensive care units. Italy faced other two outbreak waves, in October 2020 and then in March 2021, involving all territories, the Southern ones too. Three and then five risk profiles have been identified by the Scientific committee engaged by the national authorities to monitor pandemic's dynamic in order to differentiate the restrictive measures in the territories. At the beginning of 2021, the COVID-19 vaccination campaign has been started and is ongoing to this day. The Italian Civil Protection Department provides, daily, all information related to the COVID-19 outbreak in Italy at the regional level and, for some variables, also at the provincial level even if data reliability is low due to misreporting and lack of uniformity in the number of swabbed people per region. In this study we focused on the daily time-series of the cumulative cases over population (per 10000 inhabitants) and on the cumulative deaths over population (per 10000 inhabitants), spanning from 2020-02-24 to 2021-02-08, at a regional level. The aim of the work is to cluster Italian regions based on the aforementioned rates, separately. Being spatial time series, the proposed clustering appoaches embedded both the spatial and time components by including a spatial penalization term in the objective function, as proposed by D'Urso *et al.*, 2019, and a suitable transformation of the time series onto (finite dimensional) vectors of cubic B-splines basis coefficients. To deal with noisy data and outliers, three robust approaches have been proposed, one based on a exponential transformation of the distance, the other two based on the trimming and noise approach, respectively. The paper is structured as follows. Section 2 focuses on the proposed clustering models while Section 3 on the application to COVID-19 data.

#### 2 The Spatial-temporal clustering based on B-splines: robust methods

In a formal way, a spatial-time data matrix can be algebraically defined as (D'Urso, 2000):

$$\mathbf{X} \equiv \{ \mathbf{x}\_i(t) : i = 1, \dots, I; t = 1, \dots, T \} \tag{1}$$

with Noise Cluster (ST-BS-Noise-FCMd). The ST-BS-Exp-FCMd model is

the *i*-th spatial time series and of the *c*-th spatial medoid (c=1,. . . ,C) respectively, while *m* > 1 is well-known fuzziness parameter. The β parameter is set as the inverse of the variability of the data and appropriately tunes the distance

As far as the spatial penalty term is concerned, γ is the tuning parameter of spatial information. The spatial proximity among the *I* objects, has been taken into account by means of the contiguity matrix P*I*×*<sup>I</sup>* where the generic element

<sup>b</sup>*<sup>c</sup>*<sup>2</sup>)] +<sup>γ</sup>

For γ = 0, the ST-BS-Exp-FCMd model reduces to its no-spatial version, the BS-Exp-FCMd clustering model, while the ST-BS-Tr-FCMd and ST-BS-Noise-FCMd models to their no-spatial versions, the BS-Tr-FCMd and BS-Noise-

In this study, we show the results with reference to the ST-BS-Exp-FCMd model applied to cluster the *I*=20 Italian regions during *T*=351 times represented by the days from 2020-02-24 to 2021-02-08. The optimal number of clusters has been identified running the model, with γ = 0 and *m* = 1.5, and choosing the number of groups that maximizes the Fuzzy Silhouette index. Then fixed *C*, the optimal value of γ has been chosen according a heuristic

*<sup>c</sup>*<sup>2</sup>)] +γ

)] + <sup>γ</sup> 2

*I* ∑ *i*=1

b*<sup>c</sup>* are the vectors of coefficients of the B-spline representation of

*I* ∑ *i* =1 ∑ *c* ∈*Cc*

> *I* ∑ *i* =1

∑ *c*∈*Cc*

*C* ∑ *c*=1 *um ic I* ∑ *i* =1 ∑ *c* ∈*Cc*

*piium i c*

, 0 otherwise. The *uic* is the

<sup>−</sup> <sup>1</sup> *m*−1

*piiu<sup>m</sup> i c*

> *piiu<sup>m</sup> i <sup>c</sup>*<sup>−</sup> <sup>1</sup> *m*−1

(2)

(3)

*ic*[1−exp(−<sup>β</sup><sup>b</sup>*<sup>i</sup>* <sup>−</sup><sup>b</sup>*<sup>c</sup>*<sup>2</sup>

*uic* = 1, *uic* ≥ 0

according to the variability of the data.

*C* ∑ *c* =1

FCMd models, respectively.

*pii*=1 if the object *i* is contiguous to the object *i*

membership degree of the unit *i* belonging to the cluster *c*:

[1−exp(−βb*<sup>i</sup>* −

[1−exp(−βb*<sup>i</sup>* −b

3 Clustering of Italian regions - COVID-19 data

defined as follows:

*I* ∑ *i*=1

*s*.*t*. *C* ∑ *c*=1

*uic* =

where b*<sup>i</sup>* and

*C* ∑ *c*=1 *um*

min :

where *i* indicates the generic spatial unit and *t* the generic time. The time series {(*t*,*xi*(*t*))} could be seen as the result of collecting a variable *X* on unit *i* at the *T* times {*t* = 1,...,*T*}. We can model each time series by a simple linear least-squares fit as:

$$\mu\_i(t) = \sum\_{s=1}^p b\_i^s B\_s(t) + \mathfrak{e}\_i, \ t = 1, \dots, T$$

where {*Bs*(·)}*<sup>p</sup> <sup>s</sup>*=<sup>1</sup> are *p*-dimensional functional basis. For the *I* time series x*i*, *i* = 1,...,*I*, we will have *I* vectors of fitted coefficients b*<sup>i</sup>* = (*b*<sup>1</sup> *<sup>i</sup>* ,...,*bs <sup>i</sup>* ,...,*b<sup>p</sup> i* ) , *i* = 1,··· ,*I*. For sake of simplicity, we show the results with reference to the Spatial-Temporal based on Exponential distance Fuzzy *C*-Medoids clustering model (ST-BS-Exp-FCMd), even if the same problem can be addressed by using the Spatial-Temporal Fuzzy Trimmed *C*-Medoids clustering model (ST-BS-Tr-FCMd) and the Spatial-Temporal Fuzzy *C*-Medoids clustering model

with Noise Cluster (ST-BS-Noise-FCMd). The ST-BS-Exp-FCMd model is defined as follows:

also at the provincial level even if data reliability is low due to misreporting and lack of uniformity in the number of swabbed people per region. In this study we focused on the daily time-series of the cumulative cases over population (per 10000 inhabitants) and on the cumulative deaths over population (per 10000 inhabitants), spanning from 2020-02-24 to 2021-02-08, at a regional level. The aim of the work is to cluster Italian regions based on the aforementioned rates, separately. Being spatial time series, the proposed clustering appoaches embedded both the spatial and time components by including a spatial penalization term in the objective function, as proposed by D'Urso *et al.*, 2019, and a suitable transformation of the time series onto (finite dimensional) vectors of cubic B-splines basis coefficients. To deal with noisy data and outliers, three robust approaches have been proposed, one based on a exponential transformation of the distance, the other two based on the trimming and noise approach, respectively. The paper is structured as follows. Section 2 focuses on the proposed clustering models while Section 3 on the application

2 The Spatial-temporal clustering based on B-splines: robust meth-

In a formal way, a spatial-time data matrix can be algebraically defined as

where *i* indicates the generic spatial unit and *t* the generic time. The time series {(*t*,*xi*(*t*))} could be seen as the result of collecting a variable *X* on unit *i* at the *T* times {*t* = 1,...,*T*}. We can model each time series by a simple linear

*i* = 1,··· ,*I*. For sake of simplicity, we show the results with reference to the Spatial-Temporal based on Exponential distance Fuzzy *C*-Medoids clustering model (ST-BS-Exp-FCMd), even if the same problem can be addressed by using the Spatial-Temporal Fuzzy Trimmed *C*-Medoids clustering model (ST-BS-Tr-FCMd) and the Spatial-Temporal Fuzzy *C*-Medoids clustering model

X ≡ {*xi*(*t*) : *i* = 1,...,*I*;*t* = 1,...,*T*} (1)

*iBs*(*t*) +ε*i*, *t* = 1,...*T*

*<sup>s</sup>*=<sup>1</sup> are *p*-dimensional functional basis. For the *I* time series x*i*,

*<sup>i</sup>* ,...,*bs*

*<sup>i</sup>* ,...,*b<sup>p</sup> i* ) ,

to COVID-19 data.

ods

(D'Urso, 2000):

least-squares fit as:

where {*Bs*(·)}*<sup>p</sup>*

*xi*(*t*) =

*p* ∑ *s*=1 *bs*

*i* = 1,...,*I*, we will have *I* vectors of fitted coefficients b*<sup>i</sup>* = (*b*<sup>1</sup>

$$\begin{aligned} \min &: \sum\_{i=1}^{I} \sum\_{c=1}^{C} \mu\_{ic}^{m} [1 - \exp(-\beta ||\mathbf{b}\_{i} - \widetilde{\mathbf{b}}\_{c}||^{2})] + \frac{\gamma}{2} \sum\_{i=1}^{I} \sum\_{c=1}^{C} \mu\_{ic}^{m} \sum\_{i'=1}^{I} \sum\_{c' \in \mathcal{C}\_{c}} p\_{ii'} \mu\_{i'c'}^{m} \\ &\text{s.t.} \sum\_{c=1}^{C} \mu\_{ic} = 1, \; \mu\_{ic} \ge 0 \end{aligned} \tag{2}$$

where b*<sup>i</sup>* and b*<sup>c</sup>* are the vectors of coefficients of the B-spline representation of the *i*-th spatial time series and of the *c*-th spatial medoid (c=1,. . . ,C) respectively, while *m* > 1 is well-known fuzziness parameter. The β parameter is set as the inverse of the variability of the data and appropriately tunes the distance according to the variability of the data.

As far as the spatial penalty term is concerned, γ is the tuning parameter of spatial information. The spatial proximity among the *I* objects, has been taken into account by means of the contiguity matrix P*I*×*<sup>I</sup>* where the generic element *pii*=1 if the object *i* is contiguous to the object *i* , 0 otherwise. The *uic* is the membership degree of the unit *i* belonging to the cluster *c*:

$$\mu\_{ic} = \frac{\left[\left[1 - \exp(-\beta||\mathbf{b}\_{i} - \widetilde{\mathbf{b}}\_{c}||^{2}\right] + \gamma \sum\_{i'=1}^{I} \sum\_{c' \in C\_{c}} p\_{ii'} u\_{I'c'}^{m}\right]^{-\frac{1}{m-1}}}{\sum\_{c'=1}^{C} \left[\left[1 - \exp(-\beta||\mathbf{b}\_{i} - \widetilde{\mathbf{b}}\_{c}'||^{2})\right] + \gamma \sum\_{i'=1}^{I} \sum\_{c'' \in C\_{c'}} p\_{ii'} u\_{I'c''}^{m}\right]^{-\frac{1}{m-1}}} \qquad (3)$$

For γ = 0, the ST-BS-Exp-FCMd model reduces to its no-spatial version, the BS-Exp-FCMd clustering model, while the ST-BS-Tr-FCMd and ST-BS-Noise-FCMd models to their no-spatial versions, the BS-Tr-FCMd and BS-Noise-FCMd models, respectively.

#### 3 Clustering of Italian regions - COVID-19 data

In this study, we show the results with reference to the ST-BS-Exp-FCMd model applied to cluster the *I*=20 Italian regions during *T*=351 times represented by the days from 2020-02-24 to 2021-02-08. The optimal number of clusters has been identified running the model, with γ = 0 and *m* = 1.5, and choosing the number of groups that maximizes the Fuzzy Silhouette index. Then fixed *C*, the optimal value of γ has been chosen according a heuristic procedure based on maximization of the spatial autocorrelation measure introduced in Coppi *et al.*, 2010. To assign each region to a specific cluster we have set the cut-off value *uic* ≥ 0.6 (Maharaj & D'Urso, 2011). The clustering results are reported in Table 1 for both models\*. For each one, three clusters have been selected, whose medoids are denoted in each column header. For the *Total cases over population*, two fuzzy units has been identified, Basilicata and Sicily, the latter characterized by an anomalous increase of infections during the second wave. For the *The total deaths over population* due to COVID-19 desease, two fuzzy units have been identified, Aosta Valley and Sicily; the former is a global outlier, the latter a local one. We argue that the three clusters matched with three risk levels, from the highest to the lowest one. The

**PIVMET**: PIVOTAL METHODS FOR BAYESIAN RELABELLING IN FINITE MIXTURE MODELS Leonardo Egidi 1, Roberta Pappada` 1, Francesco Pauli <sup>1</sup> and Nicola Torelli <sup>1</sup>

<sup>1</sup> Dipartimento di Scienze Economiche, Aziendali, Matematiche e Statistiche 'Bruno de Finetti', Universita degli Studi di Trieste (e-mail: ` legidi@units.it,

ABSTRACT: The identification of groups' prototypes, i.e. elements of a dataset that are representative of the group they belong to, is relevant to the tasks of clustering, classification and mixture modeling. The R package pivmet includes different methods for extracting pivotal units from a dataset, to be exploited for a Markov Chain Monte Carlo (MCMC) relabelling technique for dealing with label switching in Bayesian estimation of mixture models. Moreover, consensus clustering based on pivotal units may improve classical algorithms (e.g. *k*-means) by means of a careful

KEYWORDS: pivotal unit, mixture model, relabelling, consensus clustering.

as clustering, classification, and mixture modeling.

The identification of some units which may be representative of the group they belong to is often a matter of statistical importance and can help avoiding an extra amount of work when processing the data. The advantage of such pivotal units (hereafter called pivots) is that they are somehow chosen to be as far as possible from units in the other groups and as similar as possible to the units in the same group, and may be beneficial in many statistical frameworks, such

The pivmet R package (Egidi *et al.*, 2021) implements various pivotal selection criteria, graphical tools and the relabelling method (Papastamoulis, 2016) described in Egidi *et al.*, 2018 to deal with 'label switching' (Redner & Walker, 1984), a well-known phenomenon causing nonidentifiability of the mixture parameters during the MCMC sampling (Fruhwirth-Schnatter, 2001). ¨ Compared to other packages, it allows the user to fit their own mixture model using data augmentation with component memberships either via the JAGS (Plummer, 2018) or the Stan (Carpenter *et al.*, 2017) software, by specifying suitable prior distributions. Pivotal units are detected via the similarity matrix

rpappada@units.it, francesco.pauli@deams.units.it,

nicola.torelli@deams.units.it )

seeding.

1 Introduction

Table 1. *Total cases (columns 1-3) and Total deaths (columns 4-6) over population (per 10000 inhabitants) - 3 clusters memberships*


main advantages of this methodology consist in data reduction obtained by using B-splines coefficients, in robustness by using the exponential, trimming and noise approaches while spatial information is taken into account adding a penalty term in the objective function.

#### References


\*One notices that, in the contiguity matrix, Calabria and Sicily have been considered contiguous since they have very frequent ferry connections.

### **PIVMET**: PIVOTAL METHODS FOR BAYESIAN RELABELLING IN FINITE MIXTURE MODELS

Leonardo Egidi 1, Roberta Pappada` 1, Francesco Pauli <sup>1</sup> and Nicola Torelli <sup>1</sup>

<sup>1</sup> Dipartimento di Scienze Economiche, Aziendali, Matematiche e Statistiche 'Bruno de Finetti', Universita degli Studi di Trieste (e-mail: ` legidi@units.it, rpappada@units.it, francesco.pauli@deams.units.it, nicola.torelli@deams.units.it )

ABSTRACT: The identification of groups' prototypes, i.e. elements of a dataset that are representative of the group they belong to, is relevant to the tasks of clustering, classification and mixture modeling. The R package pivmet includes different methods for extracting pivotal units from a dataset, to be exploited for a Markov Chain Monte Carlo (MCMC) relabelling technique for dealing with label switching in Bayesian estimation of mixture models. Moreover, consensus clustering based on pivotal units may improve classical algorithms (e.g. *k*-means) by means of a careful seeding.

KEYWORDS: pivotal unit, mixture model, relabelling, consensus clustering.

#### 1 Introduction

procedure based on maximization of the spatial autocorrelation measure introduced in Coppi *et al.*, 2010. To assign each region to a specific cluster we have set the cut-off value *uic* ≥ 0.6 (Maharaj & D'Urso, 2011). The clustering results are reported in Table 1 for both models\*. For each one, three clusters have been selected, whose medoids are denoted in each column header. For the *Total cases over population*, two fuzzy units has been identified, Basilicata and Sicily, the latter characterized by an anomalous increase of infections during the second wave. For the *The total deaths over population* due to COVID-19 desease, two fuzzy units have been identified, Aosta Valley and Sicily; the former is a global outlier, the latter a local one. We argue that the three clusters matched with three risk levels, from the highest to the lowest one. The

Table 1. *Total cases (columns 1-3) and Total deaths (columns 4-6) over population*

Region Piedmont Lazio Calabria Trentino-South Tyrol Lazio Calabria Piedmont 1.000 0.000 0.000 0.956 0.023 0.021 Aosta Valley 0.697 0.154 0.150 *0.563* 0.218 0.218 Lombardy 0.992 0.004 0.003 0.812 0.094 0.094 Trentino-South Tyrol 0.936 0.035 0.029 1.000 0.000 0.000 Veneto 0.849 0.083 0.068 0.975 0.014 0.011 Friuli-Venezia Giulia 0.688 0.222 0.089 0.850 0.095 0.055 Liguria 0.881 0.087 0.032 0.834 0.097 0.070 Emilia-Romagna 0.784 0.16 0.056 0.846 0.093 0.061 Tuscany 0.095 0.860 0.045 0.194 0.708 0.097 Umbria 0.031 0.953 0.016 0.004 0.99 0.007 Marche 0.019 0.964 0.017 0.173 0.748 0.079 Lazio 0.000 1.000 0.000 0.000 1.000 0.000 Abruzzo 0.000 0.999 0.000 0.011 0.978 0.011 Molise 0.013 0.943 0.044 0.000 1.000 0.000 Campania 0.020 0.964 0.015 0.001 0.997 0.002 Apulia 0.018 0.910 0.072 0.001 0.998 0.001 Basilicata 0.044 0.433 *0.523* 0.042 0.744 0.215 Calabria 0.000 0.000 1.000 0.000 0.000 1.000 Sicily *0.376* 0.279 *0.345 0.563* 0.298 0.139 Sardinia 0.027 0.036 0.937 0.000 0.001 0.999

main advantages of this methodology consist in data reduction obtained by using B-splines coefficients, in robustness by using the exponential, trimming and noise approaches while spatial information is taken into account adding a

COPPI, R., D'URSO, P., & GIORDANI, P. 2010. A fuzzy clustering model for multivariate

D'URSO, P. 2000. Dissimilarity measures for time trajectories. *Stat. Methods Appl.*, 9(1-3),

D'URSO, P., DE GIOVANNI, L., DISEGNA, M., & MASSARI, R. 2019. Fuzzy clustering with

MAHARAJ, E A, & D'URSO, P. 2011. Fuzzy clustering of time series in the frequency domain.

\*One notices that, in the contiguity matrix, Calabria and Sicily have been considered con-

Model with no spatial penalty (γ = γ*opt*) for Total cases over pop. Model with spatial penalty (γ = γ*opt*) for Total deaths over pop.

*(per 10000 inhabitants) - 3 clusters memberships*

penalty term in the objective function.

spatial time series. *J. Classification*, 27(1), 54–88.

*Information Sciences*, 181(7), 1187–1211.

tiguous since they have very frequent ferry connections.

spatial–temporal information. *Spatial Statistics*, 30, 71–102.

References

53–83.

The identification of some units which may be representative of the group they belong to is often a matter of statistical importance and can help avoiding an extra amount of work when processing the data. The advantage of such pivotal units (hereafter called pivots) is that they are somehow chosen to be as far as possible from units in the other groups and as similar as possible to the units in the same group, and may be beneficial in many statistical frameworks, such as clustering, classification, and mixture modeling.

The pivmet R package (Egidi *et al.*, 2021) implements various pivotal selection criteria, graphical tools and the relabelling method (Papastamoulis, 2016) described in Egidi *et al.*, 2018 to deal with 'label switching' (Redner & Walker, 1984), a well-known phenomenon causing nonidentifiability of the mixture parameters during the MCMC sampling (Fruhwirth-Schnatter, 2001). ¨ Compared to other packages, it allows the user to fit their own mixture model using data augmentation with component memberships either via the JAGS (Plummer, 2018) or the Stan (Carpenter *et al.*, 2017) software, by specifying suitable prior distributions. Pivotal units are detected via the similarity matrix derived from the MCMC sample—whose elements are the estimated probabilities that any two units in the observed sample are drawn from the same component—and used to relabel the chains. Such units may be fruitfully used in Dirichlet process mixture models (DPMM) (Ferguson, 1973, Neal, 2000), a class of models that naturally sorts data into clusters, and in data clustering to guarantee a better final clustering solution starting from a careful seeding based on well-separated statistical units.

The aim of the paper is to provide a quick overview of the computational capabilities of our package in the field of Bayesian mixture models.

#### 2 Finite mixtures of Gaussian distributions

Consider a multivariate mixture of Gaussian distributions, let *yi* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* and assume that

$$\mathbf{y}\_{i} \sim \sum\_{j=1}^{k} \eta\_{j} \mathcal{N}\_{d}(\mu\_{j}, \mathbf{E}\_{j}), \ i = 1, \ldots, n,\tag{1}$$

−20 0 20 40 60

μj1

−20 0 20 40 60

μj1

*<sup>i</sup>* ). The DPMM sorts the data into

−20 −10 0

Figure 1. *Bivariate mixture data: scatterplot for the mean parameters obtained via JAGS sampling (left plot) and relabelled estimates (right plot) via the maxsumdiff*

where *K*(·) is a parametric kernel function which is usually continuous, *F* is an unknown probability distribution, DP is the nonparametric Dirichlet process prior with concentration parameter α and *base measure G*, which encapsulates any prior knowledge about *F*. A common choice for *K*(·) is a Gaussian

clusters, corresponding to the mixture components. Thus, it may be seen as an infinite dimensional mixture model which generalizes finite mixture models. Thus, pivotal units detection may be quite relevant for this class of models in order to identify distinct groups characteristics. We generate *n* = 200 data from a student−*t* distribution with 3 degrees of freedom and we draw posterior samples for *µ*1,*µ*2,...,*µk* via the dirichletprocess package. Figure 2 represents posterior density estimation for the simulated dataset along with nine pivotal units (blue points). detected by pivmet via the maxsumdiff

CARPENTER, BOB, GELMAN, ANDREW, HOFFMAN, MATTHEW D, LEE, DANIEL, GOODRICH, BEN, BETANCOURT, MICHAEL, BRUBAKER, MARCUS A, GUO, JIQIANG, LI, PETER,&RIDDELL, ALLEN. 2017. Stan: a probabilistic programming language. *Journal of Statistical Soft-*

μj2

mixture model, so that *<sup>K</sup>*(*yi*|θ*i*) = *N* (*µi*,σ<sup>2</sup>

*pivotal criterion.*

pivotal criterion.

*ware*, 76(1), 1–32.

References

μj2

where *<sup>µ</sup> <sup>j</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* and <sup>Σ</sup> *<sup>j</sup>* is a *<sup>d</sup>*×*<sup>d</sup>* positive definite covariance matrix. We assume the following prior specification for the parameters in (1):

$$\begin{aligned} \mu\_j \sim \mathcal{N}\_{\mathbf{\hat{z}}}(\mu\_0, S\_2), \quad \mathbf{\hat{z}}\_j^{-1} \sim \text{Wishart}(S\_3, d+1) \\ \mathbf{\eta} \sim \text{Dirichlet}(\mathbf{u}), \end{aligned} \tag{2}$$

where α is a *k*-dimensional vector and *S*<sup>2</sup> and *S*<sup>3</sup> are positive definite matrices. We fix *µ*<sup>0</sup> = 0, α = (1,...,1) and assume *S*<sup>2</sup> and *S*<sup>3</sup> are diagonal matrices, with diagonal elements equal to 105. We simulate a sample of size *n* = 150 from a bivariate Gaussian distribution with the function piv sim and we fit the model using the JAGS option. From the bivariate traceplot chains for each mean component *µj*,<sup>1</sup> and *µj*,<sup>2</sup> in Figure 1 we clearly note that label switching has occurred and the relabelling algorithm fixed it, by isolating the four bivariate high-density regions.

#### 3 Dirichlet Process Mixture Models

DPMM are useful tools for non-parametric density estimation, and, more generally, the choice of a Dirichlet process prior avoids the specification of an inappropriate parametric form. The DPMM has the following form:

$$\begin{aligned} \mathbf{y}\_i &\sim K(\mathbf{y}\_i|\boldsymbol{\Theta}\_i), \ i = 1, \ldots, n, \\ \boldsymbol{\Theta}\_i &\sim F, \ F \sim \mathbf{DP}(\boldsymbol{\alpha}, G), \end{aligned} \tag{3}$$

Figure 1. *Bivariate mixture data: scatterplot for the mean parameters obtained via JAGS sampling (left plot) and relabelled estimates (right plot) via the maxsumdiff pivotal criterion.*

where *K*(·) is a parametric kernel function which is usually continuous, *F* is an unknown probability distribution, DP is the nonparametric Dirichlet process prior with concentration parameter α and *base measure G*, which encapsulates any prior knowledge about *F*. A common choice for *K*(·) is a Gaussian mixture model, so that *<sup>K</sup>*(*yi*|θ*i*) = *N* (*µi*,σ<sup>2</sup> *<sup>i</sup>* ). The DPMM sorts the data into clusters, corresponding to the mixture components. Thus, it may be seen as an infinite dimensional mixture model which generalizes finite mixture models. Thus, pivotal units detection may be quite relevant for this class of models in order to identify distinct groups characteristics. We generate *n* = 200 data from a student−*t* distribution with 3 degrees of freedom and we draw posterior samples for *µ*1,*µ*2,...,*µk* via the dirichletprocess package. Figure 2 represents posterior density estimation for the simulated dataset along with nine pivotal units (blue points). detected by pivmet via the maxsumdiff pivotal criterion.

#### References

derived from the MCMC sample—whose elements are the estimated probabilities that any two units in the observed sample are drawn from the same component—and used to relabel the chains. Such units may be fruitfully used in Dirichlet process mixture models (DPMM) (Ferguson, 1973, Neal, 2000), a class of models that naturally sorts data into clusters, and in data clustering to guarantee a better final clustering solution starting from a careful seeding

The aim of the paper is to provide a quick overview of the computational

Consider a multivariate mixture of Gaussian distributions, let *yi* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* and as-

where *<sup>µ</sup> <sup>j</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* and <sup>Σ</sup> *<sup>j</sup>* is a *<sup>d</sup>*×*<sup>d</sup>* positive definite covariance matrix. We assume

where α is a *k*-dimensional vector and *S*<sup>2</sup> and *S*<sup>3</sup> are positive definite matrices. We fix *µ*<sup>0</sup> = 0, α = (1,...,1) and assume *S*<sup>2</sup> and *S*<sup>3</sup> are diagonal matrices, with diagonal elements equal to 105. We simulate a sample of size *n* = 150 from a bivariate Gaussian distribution with the function piv sim and we fit the model using the JAGS option. From the bivariate traceplot chains for each mean component *µj*,<sup>1</sup> and *µj*,<sup>2</sup> in Figure 1 we clearly note that label switching has occurred and the relabelling algorithm fixed it, by isolating the four

DPMM are useful tools for non-parametric density estimation, and, more generally, the choice of a Dirichlet process prior avoids the specification of an

*yi* ∼ *K*(*yi*|θ*i*), *i* = 1,...,*n*,

<sup>θ</sup>*<sup>i</sup>* <sup>∼</sup> *<sup>F</sup>*, *<sup>F</sup>* <sup>∼</sup> DP(α,*G*), (3)

inappropriate parametric form. The DPMM has the following form:

η*jNd*(*µ <sup>j</sup>*,Σ *<sup>j</sup>*), *i* = 1,...,*n*, (1)

*<sup>j</sup>* ∼ Wishart(*S*3,*d* +1) <sup>η</sup> <sup>∼</sup>Dirichlet(α), (2)

capabilities of our package in the field of Bayesian mixture models.

based on well-separated statistical units.

sume that

2 Finite mixtures of Gaussian distributions

*k* ∑ *j*=1

the following prior specification for the parameters in (1):

*<sup>µ</sup> <sup>j</sup>* <sup>∼</sup>*N*2(*µ*0,*S*2), <sup>Σ</sup>−<sup>1</sup>

*yi* ∼

bivariate high-density regions.

3 Dirichlet Process Mixture Models

CARPENTER, BOB, GELMAN, ANDREW, HOFFMAN, MATTHEW D, LEE, DANIEL, GOODRICH, BEN, BETANCOURT, MICHAEL, BRUBAKER, MARCUS A, GUO, JIQIANG, LI, PETER,&RIDDELL, ALLEN. 2017. Stan: a probabilistic programming language. *Journal of Statistical Software*, 76(1), 1–32.

CLUSTER VALIDITY BY RANDOM FORESTS Tahir Ekin1 and Claudio Conversano2

ABSTRACT: Clustering is a widely used unsupervised method, which is characterized by the lack of an outcome variable that supervises the analysis. In literature, several indices have been developed to assess the goodness of cluster partition. Nonetheless, they usually suffer from computational limitations and therefore may not be appropriate in big data circumstances. We propose a method that validates the outputs of multiple clustering algorithms and is scalable for large number of observations. It utilizes machine learning classifiers to automatically rank the clustering outputs accounting for the coherence of the partitions with the data patterns. We illustrate the performance of the proposed method by applying it to simulated clustering datasets, as well as to big data

<sup>2</sup> Department of Economics and Business Sciences, University of Cagliari,

<sup>1</sup> McCoy College of Business, Texas State University,

(e-mail: tahirekin@txstate.edu)

situations in health care fraud detection.

KEYWORDS: cluster validity, classifiers, big data

(e-mail: conversa@unica.it)

Figure 2. *Posterior density estimation (red line) for a sample of n* = 200 *data points from a student*−*t distribution. Blue points below the x-axis denote the pivotal units.*


### CLUSTER VALIDITY BY RANDOM FORESTS

Tahir Ekin1 and Claudio Conversano2

<sup>1</sup> McCoy College of Business, Texas State University, (e-mail: tahirekin@txstate.edu)

<sup>2</sup> Department of Economics and Business Sciences, University of Cagliari, (e-mail: conversa@unica.it)

ABSTRACT: Clustering is a widely used unsupervised method, which is characterized by the lack of an outcome variable that supervises the analysis. In literature, several indices have been developed to assess the goodness of cluster partition. Nonetheless, they usually suffer from computational limitations and therefore may not be appropriate in big data circumstances. We propose a method that validates the outputs of multiple clustering algorithms and is scalable for large number of observations. It utilizes machine learning classifiers to automatically rank the clustering outputs accounting for the coherence of the partitions with the data patterns. We illustrate the performance of the proposed method by applying it to simulated clustering datasets, as well as to big data situations in health care fraud detection.

KEYWORDS: cluster validity, classifiers, big data

� � �� � � ��

Figure 2. *Posterior density estimation (red line) for a sample of n* = 200 *data points from a student*−*t distribution. Blue points below the x-axis denote the pivotal units.*

EGIDI, LEONARDO, PAPPADA` , ROBERTA, PAULI, FRANCESCO, & TORELLI, NICOLA. 2018. Relabelling in Bayesian mixture models by

EGIDI, LEONARDO, PAPPADA` , ROBERTA, PAULI, FRANCESCO, & TORELLI, NICOLA. 2021. pivmet: Pivotal methods for Bayesian relabelling and k-means clustering. *arXiv preprint arXiv:2103.16948*. FERGUSON, THOMAS S. 1973. A Bayesian analysis of some nonparametric

FRUHWIRTH ¨ -SCHNATTER, SYLVIA. 2001. Markov chain Monte Carlo estimation of classical and dynamic switching and mixture models. *Journal*

NEAL, RADFORD M. 2000. Markov chain sampling methods for Dirichlet process mixture models. *Journal of computational and graphical statis-*

PAPASTAMOULIS, PANAGIOTIS. 2016. label.switching: An R package for dealing with the label switching problem in MCMC outputs. *Journal of*

PLUMMER, MARTYN. 2018. *rjags: Bayesian graphical models using MCMC*.

REDNER, RICHARD A., & WALKER, HOMER F. 1984. Mixture Densities, Maximum Likelihood and the EM Algorithm. *SIAM Review*, 26(2), 195–

*of the American Statistical Association*, 96(453), 194–209.

pivotal units. *Statistics and Computing*, 28(4), 957–969.

problems. *The annals of statistics*, 209–230.

*Statistical Software, Code Snippets*, 69(1), 1–24.

*tics*, 9(2), 249–265.

R package version 4-8.

239.

−10 0 10

0.0

0.1

0.2

0.3

### ROBUST ESTIMATION OF PARSIMONIOUS FINITE MIXTURE OF GAUSSIAN MODELS

A RISK INDICATOR FOR CATEGORICAL DATA Silvia Facchinetti1 and Silvia Angela Osmetti1

<sup>1</sup> Department of Statistical sciences, Universita Cattolica del Sacro ` Cuore – Milano, (e-mail: silvia.facchinetti@unicatt.it,

ABSTRACT: In this paper we present a suitable measure of risk for data expressed on an ordinal scale. The proposed indicator is based on the cumulative probabilities of the ordinal variable that represents the level of severity for different risk events. The method relies on the construction of a Criticality index which may be used as an initial view of the level of risk, for comparisons among environments, to indicate how risk changes over time, and to identify appropriate interventions. Along with the description of the methodology, we present two examples of application in statistical

KEYWORDS: categorical variables, risk measure, Criticality index, ordinal data.

The most common approach of risk modeling is a quantitative approach. When data are available only on an ordinal scale, companies often use approaches

In this paper we propose a risk indicator which can exploit ordinal data to rank risks by their "criticality", so to prioritise preventive actions aimed at

Let *X* ∼ {*xk*, *pk*;*k* = 1,2,...,*K*} be a categorical random variable (r.v.) with ordered categories *xk* and probabilities *pk* that represents a severity variable. In the loss data framework, the severity is a continuous r.v., while in the context of ordinal risk data, the severity is generally expressed on an ordinal scale, characterised by *K* distinct levels, ordered according to the correspond-

> *K*−1 ∑ *k*=1

It is a normalized index with values in [0,1], that provides a risk measure easy to interpret, with extreme values univocally defined, and intermediate

(*K* −*k*)*pk* (1)

based on categorical data improperly treating the data as quantitative.

mitigating and reducing their impact ex-ante rather than ex-post.

ing magnitude. We define the *Criticality Index* as follows:

*<sup>I</sup>* <sup>=</sup> <sup>1</sup> *K* −1

silvia.osmetti@unicatt.it)

quality control field and in cyber risk evaluation.

1 Methodological proposal

values expressed as a percentage.

Luis Angel Garc´ıa-Escudero1, Agust´ın Mayo-Iscar1 and Marco Riani2

<sup>1</sup> Universidad de Valladolid, (e-mail: lagarcia@uva.es, agustin.mayo.iscar@uva.es)

<sup>2</sup> Universita degli Studi di Parma, (e-mail: ` marco.riani@unipr.it)

ABSTRACT: Maximum likelihood estimators are typically robustified by using impartial trimming. For robustifying mixture models' estimators is necessary to apply additionally constraints, for avoiding spurious solutions. We propose robust estimators, based on the joint application of trimming and constraints, for the classical collection of 14 parsimonious models of Celeux and Govaert. They include different versions of constraints, for being jointly applied, in order to get a more flexible methodology. Feasible algorithms for these estimators, EM and ECM type, have been developed. Empirical evidences about the performance of these estimators, when applied, both, to artificial and to real data will be provided.

KEYWORDS: trimming, constrained estimation, mixture models

#### A RISK INDICATOR FOR CATEGORICAL DATA

Silvia Facchinetti1 and Silvia Angela Osmetti1

<sup>1</sup> Department of Statistical sciences, Universita Cattolica del Sacro ` Cuore – Milano, (e-mail: silvia.facchinetti@unicatt.it, silvia.osmetti@unicatt.it)

ABSTRACT: In this paper we present a suitable measure of risk for data expressed on an ordinal scale. The proposed indicator is based on the cumulative probabilities of the ordinal variable that represents the level of severity for different risk events. The method relies on the construction of a Criticality index which may be used as an initial view of the level of risk, for comparisons among environments, to indicate how risk changes over time, and to identify appropriate interventions. Along with the description of the methodology, we present two examples of application in statistical quality control field and in cyber risk evaluation.

KEYWORDS: categorical variables, risk measure, Criticality index, ordinal data.

#### 1 Methodological proposal

ROBUST ESTIMATION OF PARSIMONIOUS FINITE MIXTURE OF GAUSSIAN MODELS Luis Angel Garc´ıa-Escudero1, Agust´ın Mayo-Iscar1 and Marco Riani2

ABSTRACT: Maximum likelihood estimators are typically robustified by using impartial trimming. For robustifying mixture models' estimators is necessary to apply additionally constraints, for avoiding spurious solutions. We propose robust estimators, based on the joint application of trimming and constraints, for the classical collection of 14 parsimonious models of Celeux and Govaert. They include different versions of constraints, for being jointly applied, in order to get a more flexible methodology. Feasible algorithms for these estimators, EM and ECM type, have been developed. Empirical evidences about the performance of these estimators, when applied, both, to artificial and to real data will be

<sup>1</sup> Universidad de Valladolid, (e-mail: lagarcia@uva.es,

<sup>2</sup> Universita degli Studi di Parma, (e-mail: ` marco.riani@unipr.it)

KEYWORDS: trimming, constrained estimation, mixture models

agustin.mayo.iscar@uva.es)

provided.

The most common approach of risk modeling is a quantitative approach. When data are available only on an ordinal scale, companies often use approaches based on categorical data improperly treating the data as quantitative.

In this paper we propose a risk indicator which can exploit ordinal data to rank risks by their "criticality", so to prioritise preventive actions aimed at mitigating and reducing their impact ex-ante rather than ex-post.

Let *X* ∼ {*xk*, *pk*; *k* = 1,2,...,*K*} be a categorical random variable (r.v.) with ordered categories *xk* and probabilities *pk* that represents a severity variable. In the loss data framework, the severity is a continuous r.v., while in the context of ordinal risk data, the severity is generally expressed on an ordinal scale, characterised by *K* distinct levels, ordered according to the corresponding magnitude. We define the *Criticality Index* as follows:

$$I = \frac{1}{K - 1} \sum\_{k=1}^{K-1} (K - k)p\_k \tag{1}$$

It is a normalized index with values in [0,1], that provides a risk measure easy to interpret, with extreme values univocally defined, and intermediate values expressed as a percentage.

This index can be estimated by its empirical counterpart, replacing the probabilities *pk* with their estimators ˆ*pk* = *rk*/*n*, where *rk* is the number of observations in the sample equal to the category *xk* (*rk* <sup>∈</sup> <sup>N</sup> and <sup>∑</sup>*<sup>K</sup> <sup>k</sup>*=<sup>1</sup> *rk* = *n*). Henceforth, we use *<sup>I</sup>* to indicate the index defined in Equation (1), and <sup>ˆ</sup>*<sup>I</sup>* <sup>=</sup> <sup>1</sup> *<sup>K</sup>*−<sup>1</sup> <sup>∑</sup>*K*−<sup>1</sup> *<sup>k</sup>*=<sup>1</sup> (*K* −*k*)*p*ˆ*<sup>k</sup>* to denote its estimator.

Severity is a measure of the gravity of a particular type of defect on a 3 point scale (serious, medium, minor defect); detection is a measure of the ease of identifying a failure mode on a 3-point scale (low, medium, high detection);

These information are typically available in companies that apply Failure Mode and Effects Analysis (FMEA) to identify potential failures that could affect the customer's expectations of product quality or process performance

To obtain a global measure of risk, we summarize the *Criticality Indices* related to the severity and detection in a Criticality Impact Chart (Figure 1).

For each component of the product, we plot a ball with coordinates given by the levels of risk (ˆ*I*) for severity and for detection. The dimension of the balls is related to the occurrence of each component. The lines represent the locus of points with equal joint level of risk. We observe that hose is the component with the highest joint level of risk. For guard, a situation of minimum heterogeneity occurs and, thus, the indices assume their minimum value.

This graph may be very useful for companies wanting to prioritize interventions on the production line of a finite product, as well as those wanting to

We apply our proposal to real data on serious cyber attacks occurs worldwide in 2017, described by Clusit (Italian Association for Information Security) in

occurrence is the frequency of a particular type of defect in a product.

Figure 1. *Criticality Impact Chart for severity and detection.*

(Sellappan & Palanikumar, 2013).

improve related process controls.

2.2 Cyber risk

The *Criticality Index* estimator is an unbiased and consistent estimator for *I* (Facchinetti & Osmetti, 2018). In fact, since every *rk* (*k* = 1,2,...,*K*)follows a binomial distribution with parameters (*n*, *pk*), the expression ∑*K*−<sup>1</sup> *<sup>k</sup>*=<sup>1</sup> (*<sup>K</sup>* <sup>−</sup>*k*)*rk <sup>n</sup>* is a mixture of binomial r.v.s (*k* = 1,2,...,*K* −1). Therefore, since *E*(*rk*) = *npk* and *Var*(*rk*) = *npk*(1<sup>−</sup> *pk*), the mean and the variance of <sup>ˆ</sup>*<sup>I</sup>* are respectively:

$$E(\hat{I}) = \frac{1}{K - 1} \sum\_{k=1}^{K-1} (K - k) \frac{E(r\_k)}{n} = \frac{1}{K - 1} \sum\_{k=1}^{K-1} (K - k) p\_k = I \tag{2}$$

$$Var(I) = \frac{1}{n(K-1)^2} \left[ \sum\_{k=1}^{K-1} (K-k)^2 p\_k (1-p\_k) - 2 \sum\_{k=1}^{K-1} (K-k) p\_k \sum\_{l=1}^{k-1} (K-l) p\_l \right] \tag{3}$$

Finally, from (3), lim*n*→<sup>∞</sup> *Var*(ˆ*I*) = 0.

Moreover, a Kolmogorov–Smirnov test for discrete r.v.s (Facchinetti & Osmetti, 2013) ensure that the *Criticality Index* estimator is asymptotically normally distributed, with the mean and variance given in (2) and (3).

#### 2 Applications

In this section we present two examples of application of the *Criticality Index*. First, in statistical quality control field, the proposed index appears naturally suitable for measuring the risk of failure of a product in the testing and recall phases of products, or in similar situations where quality is expressed on an ordinal scale. Second, in cyber risk evaluation, since the data are very sensitive and it is unlikely that a private institution is willing to disclose them, we consider a classification of cyber risk loss data into severity levels and we apply the proposed methodology to measure cyber risks, using ordinal data.

#### 2.1 Statistical quality control

We apply the *Criticality Index* on real data concerning severity, detection, and the occurrence of defects in the component of hose assemblies (stripes, guard, fitting, and hose) produced by a sales company of multinational manufacturer.

Severity is a measure of the gravity of a particular type of defect on a 3 point scale (serious, medium, minor defect); detection is a measure of the ease of identifying a failure mode on a 3-point scale (low, medium, high detection); occurrence is the frequency of a particular type of defect in a product.

These information are typically available in companies that apply Failure Mode and Effects Analysis (FMEA) to identify potential failures that could affect the customer's expectations of product quality or process performance (Sellappan & Palanikumar, 2013).

To obtain a global measure of risk, we summarize the *Criticality Indices* related to the severity and detection in a Criticality Impact Chart (Figure 1).

Figure 1. *Criticality Impact Chart for severity and detection.*

For each component of the product, we plot a ball with coordinates given by the levels of risk (ˆ*I*) for severity and for detection. The dimension of the balls is related to the occurrence of each component. The lines represent the locus of points with equal joint level of risk. We observe that hose is the component with the highest joint level of risk. For guard, a situation of minimum heterogeneity occurs and, thus, the indices assume their minimum value.

This graph may be very useful for companies wanting to prioritize interventions on the production line of a finite product, as well as those wanting to improve related process controls.

#### 2.2 Cyber risk

This index can be estimated by its empirical counterpart, replacing the probabilities *pk* with their estimators ˆ*pk* = *rk*/*n*, where *rk* is the number of

Henceforth, we use *<sup>I</sup>* to indicate the index defined in Equation (1), and <sup>ˆ</sup>*<sup>I</sup>* <sup>=</sup> <sup>1</sup>

a mixture of binomial r.v.s (*k* = 1,2,...,*K* −1). Therefore, since *E*(*rk*) = *npk* and *Var*(*rk*) = *npk*(1<sup>−</sup> *pk*), the mean and the variance of <sup>ˆ</sup>*<sup>I</sup>* are respectively:

*E*(*rk*)

*<sup>n</sup>* <sup>=</sup> <sup>1</sup>

<sup>2</sup> *pk*(1<sup>−</sup> *pk*)−<sup>2</sup>

Moreover, a Kolmogorov–Smirnov test for discrete r.v.s (Facchinetti & Osmetti, 2013) ensure that the *Criticality Index* estimator is asymptotically nor-

In this section we present two examples of application of the *Criticality Index*. First, in statistical quality control field, the proposed index appears naturally suitable for measuring the risk of failure of a product in the testing and recall phases of products, or in similar situations where quality is expressed on an ordinal scale. Second, in cyber risk evaluation, since the data are very sensitive and it is unlikely that a private institution is willing to disclose them, we consider a classification of cyber risk loss data into severity levels and we apply

We apply the *Criticality Index* on real data concerning severity, detection, and the occurrence of defects in the component of hose assemblies (stripes, guard, fitting, and hose) produced by a sales company of multinational manufacturer.

*K* −1

*K*−1 ∑ *k*=1

*K*−1 ∑ *k*=1

(*K* −*k*)*pk*

The *Criticality Index* estimator is an unbiased and consistent estimator for *I* (Facchinetti & Osmetti, 2018). In fact, since every *rk* (*k* = 1,2,...,*K*)follows a

*<sup>k</sup>*=<sup>1</sup> *rk* = *n*).

*<sup>n</sup>* is

(3)

*<sup>k</sup>*=<sup>1</sup> (*<sup>K</sup>* <sup>−</sup>*k*)*rk*

(*K* −*k*)*pk* = *I* (2)

*k*−1 ∑ *l*=1

(*K* −*l*)*pl*

observations in the sample equal to the category *xk* (*rk* <sup>∈</sup> <sup>N</sup> and <sup>∑</sup>*<sup>K</sup>*

binomial distribution with parameters (*n*, *pk*), the expression ∑*K*−<sup>1</sup>

(*K* −*k*)

(*K* −*k*)

mally distributed, with the mean and variance given in (2) and (3).

the proposed methodology to measure cyber risks, using ordinal data.

*<sup>k</sup>*=<sup>1</sup> (*K* −*k*)*p*ˆ*<sup>k</sup>* to denote its estimator.

*K*−1 ∑ *k*=1

*<sup>E</sup>*(ˆ*I*) = <sup>1</sup>

*n*(*K* −1)<sup>2</sup>

2.1 Statistical quality control

*Var*(ˆ*I*) = <sup>1</sup>

2 Applications

*K* −1

 *K*−1 ∑ *k*=1

Finally, from (3), lim*n*→<sup>∞</sup> *Var*(ˆ*I*) = 0.

*<sup>K</sup>*−<sup>1</sup> <sup>∑</sup>*K*−<sup>1</sup>

We apply our proposal to real data on serious cyber attacks occurs worldwide in 2017, described by Clusit (Italian Association for Information Security) in its Report on ICT Security in Italy (Antonielli *et al.*, 2018).

We consider for each type of attack the ordinal variable severity on a 3 point scale (critical, high, medium severity). In Table 1, we report the *Criticality Index* estimates, the standard errors and the associated 90% asymptotic confidence intervals (Facchinetti & Osmetti, 2018).

ADDITIVE QUANTILE REGRESSION VIA THE QGAM R PACKAGE Matteo Fasiolo1

ABSTRACT: Generalized additive models (GAMs) are flexible non-linear regression models, which can be fitted efficiently using the approximate Bayesian methods provided by the mgcv R package. While the GAM methods provided by mgcv are based on the assumption that the response distribution is modelled parametrically, in this talk I will discuss more flexible methods that do not entail any parametric assumption. In particular, I will introduce the qgam package, which is an extension of mgcv providing fast calibrated Bayesian methods for fitting quantile GAMs (QGAMs) in R. QGAMs are based on a smooth version of the pinball loss of Koenker (2005), rather than on a likelihood function, hence jointly achieving satisfactory accuracy of the quantile point estimates and coverage of the corresponding credible intervals requires adopting the specialized Bayesian fitting framework of Fasiolo et al. (2020), which is implemented

KEYWORDS: Bayesian quantile regression, generalized additive models, regres-

sion splines, calibrated Bayes, fast Bayesian inference.

<sup>1</sup> University of Bristol, (e-mail: matteo.fasiolo@bristol.ac.uk)

by the qgam package.


Table 1. *Criticality index estimates, standard errors and* 90% *CIs for type of attack.*

From Table 1 we obtain that Espionage and Information Warfare dominate Hacktivism in terms of severity, followed by Cybercrime. Our proposed measure can thus be employed as a simple and effective measurement to prioritise cyber risk.

#### References


### ADDITIVE QUANTILE REGRESSION VIA THE QGAM R PACKAGE

Matteo Fasiolo1

<sup>1</sup> University of Bristol, (e-mail: matteo.fasiolo@bristol.ac.uk)

its Report on ICT Security in Italy (Antonielli *et al.*, 2018).

confidence intervals (Facchinetti & Osmetti, 2018).

cyber risk.

References

*ation*, 4, 9–21.

34, 265–275.

173–185.

We consider for each type of attack the ordinal variable severity on a 3 point scale (critical, high, medium severity). In Table 1, we report the *Criticality Index* estimates, the standard errors and the associated 90% asymptotic

Table 1. *Criticality index estimates, standard errors and* 90% *CIs for type of attack.* Type of attack ˆ*I* SE CI (90%) Cybercrime 0.239 0.015 0.214-0.264 Hacktivism 0.342 0.045 0.268-0.416 Espionage/Sabotage 0.973 0.014 0.950-0.996 Information Warfare 0.952 0.027 0.908-0.995

From Table 1 we obtain that Espionage and Information Warfare dominate Hacktivism in terms of severity, followed by Cybercrime. Our proposed measure can thus be employed as a simple and effective measurement to prioritise

ANTONIELLI, A., BECHELLI, L., BOSCO, F., & BUTTI, G., ET AL. 2018.

FACCHINETTI, S., & OSMETTI, S.A. 2013. A goodness-of-fit test for maximum order statistics from discrete distributions. *Electronic Journal of Applied Statistical Analysis: Decision Support Systems and Services Evalu-*

FACCHINETTI, S., & OSMETTI, S.A. 2018. A risk index for ordinal variables and its statistical properties: A priority of intervention indicator in quality control framework. *Quality and Reliability Engingeering International*,

FACCHINETTI, S., GIUDICI, P., & OSMETTI, S.A. 2020. Cyber risk measurement with ordinal data. *Statistical Methods and Applications*, 29,

SELLAPPAN, N., & PALANIKUMAR, K. 2013. Modified Prioritization Methodology for Risk Priority Number in Failure Mode and Effects Analysis. *International Journal of Applied Science and Technology*, 3, 27–36.

*Rapporto Clusit 2019 sulla Sicurezza ICT in Italia*. Clusit.

ABSTRACT: Generalized additive models (GAMs) are flexible non-linear regression models, which can be fitted efficiently using the approximate Bayesian methods provided by the mgcv R package. While the GAM methods provided by mgcv are based on the assumption that the response distribution is modelled parametrically, in this talk I will discuss more flexible methods that do not entail any parametric assumption. In particular, I will introduce the qgam package, which is an extension of mgcv providing fast calibrated Bayesian methods for fitting quantile GAMs (QGAMs) in R. QGAMs are based on a smooth version of the pinball loss of Koenker (2005), rather than on a likelihood function, hence jointly achieving satisfactory accuracy of the quantile point estimates and coverage of the corresponding credible intervals requires adopting the specialized Bayesian fitting framework of Fasiolo et al. (2020), which is implemented by the qgam package.

KEYWORDS: Bayesian quantile regression, generalized additive models, regression splines, calibrated Bayes, fast Bayesian inference.

### GAUSSIAN MIXTURE MODELS FOR HIGH DIMENSIONAL DATA USING COMPOSITE LIKELIHOOD

variance matrix factorization and/or strict independence restrictions have been proposed (Bouveyron & Brunet-Saumard, 2014). Despite their advantages, these methods still struggle in high-dimensional data settings (*p N*) and

Recently, Ranalli & Rocci, 2016; Ranalli & Rocci, 2017 proposed the use of composite likelihood (CL) to estimate the parameters of a finite mixture model for ordinal and mixed mode data. The CL approach (see Varin *et al.*, 2011) uses smaller dimensional marginal and/or conditional pseudolikelihoods to estimate the parameters of the model. The use of CL avoids the need to fully specify the underlying joint distribution and estimates parameters from a product of lower dimensional likelihoods. Such an approximation is very helpful when the full model is difficult to specify or manipulate. The CL framework assists in avoiding the computational problems often arising from the need to deal with a multi-dimensional joint distribution. In addition, the specification of appropriate conditional likelihoods allows the modelling

Here the CL approach is exploited to enable clustering of high-dimensional data using GMMs. Lower dimensional terms corresponding to Gaussian multivariate marginal distributions are involved in the construction of the pseudolikelihood, thus avoiding the use of high dimensional covariance matrices (and their inversion), which is advantageous in *p N* scenarios. We embed GMMs in the CL framework to serve a dual purpose: to facilitate the use of GMMs in high-dimensional scenarios, while capturing at the same time the complex

To deal with the complexities arising in high-dimensional settings (*p n*), we decompose the likelihood of a GMM into terms of tractable size, corresponding to lower dimensional Gaussian distributions in which *n* is larger than the number of variables involved in each term. To do so, we define a general

Suppose the vector X of *p* variables is partitioned into a set of *K* nonoverlapping blocks *B* = {*B*1,...*Bj*,...*BK*}. For ease of exposition, we take blocks having the same size *b*. Let *S* denote the set of all possible *<sup>K</sup>*

constructed using the blocks in *B*. The generic element *Sl* ∈ *S* is given by a pair of blocks, such that *Sl* = *Bj* ∪*Bk*, with *j* = *k*, ∀ *j*, *k* = 1,...,*K*. We then

2 pairs

have several limitations in the case of highly correlated variables.

of the dependence structure by means of lower dimensional terms.

dependence structures which are often present in such settings.

2 Block-pairwise composite likelihood for GMMs

composite likelihood based on pairs of blocks of variables.

Michael Fop 1, Dimitris Karlis2, Ioannis Kosmidis3, Adrian O'Hagan1, Caitriona Ryan4 and Isobel Claire Gormley1

<sup>1</sup> School of Mathematics and Statistics, University College Dublin, Ireland. (e-mail: michael.fop@ucd.ie, claire.gormley@ucd.ie)

<sup>2</sup> Department of Statistics, Athens University of Economics and Business, Greece.

<sup>3</sup> Department of Statistics, University of Warwick, UK.

<sup>4</sup> Hamilton Institute, Maynooth University, Ireland.

ABSTRACT: The use of finite Gaussian mixture models (GMMs) is a well established approach to performing model-based clustering. Despite its popularity, its widespread use is hindered by its inability to transfer to high-dimensional data applications. This is often due to the difficulties related to dealing with high-dimensional covariance matrices and joint densities. Here we propose a composite likelihood framework to enable the use of GMMs for clustering high-dimensional data. The framework is specified by approximating the likelihood of a GMM by means of a block-pairwise composite likelihood, which allows the decomposition of the potentially high-dimensional density into terms of smaller dimensions. A computationally efficient expectation maximization algorithm is developed to facilitate estimation. Performance of the approach is demonstrated through simulated and real data examples.

KEYWORDS: composite likelihood, high-dimensional data, model-based clustering.

#### 1 Introduction

Model-based clustering of continuous data is routinely implemented by means of Gaussian mixture models (GMMs). Despite the popularity of GMMs, their widespread use is curtailed by their inability to transfer to settings where the number of variables *p* is large compared to the sample size *N*. Difficulties with storage and manipulation of the multivariate Gaussian distribution's covariance matrix in such settings lead to increased computational cost, and it often makes the approach impractical. Further, in settings where the number of variables *p* > *N*, fitting a GMM with an unconstrained covariance matrix is infeasible. To overcome these issues, parsimonious models based on covariance matrix factorization and/or strict independence restrictions have been proposed (Bouveyron & Brunet-Saumard, 2014). Despite their advantages, these methods still struggle in high-dimensional data settings (*p N*) and have several limitations in the case of highly correlated variables.

GAUSSIAN MIXTURE MODELS FOR HIGH DIMENSIONAL DATA USING COMPOSITE LIKELIHOOD Michael Fop 1, Dimitris Karlis2, Ioannis Kosmidis3, Adrian O'Hagan1, Caitriona Ryan4 and Isobel Claire Gormley1

<sup>1</sup> School of Mathematics and Statistics, University College Dublin, Ireland. (e-mail:

ABSTRACT: The use of finite Gaussian mixture models (GMMs) is a well established approach to performing model-based clustering. Despite its popularity, its widespread use is hindered by its inability to transfer to high-dimensional data applications. This is often due to the difficulties related to dealing with high-dimensional covariance matrices and joint densities. Here we propose a composite likelihood framework to enable the use of GMMs for clustering high-dimensional data. The framework is specified by approximating the likelihood of a GMM by means of a block-pairwise composite likelihood, which allows the decomposition of the potentially high-dimensional density into terms of smaller dimensions. A computationally efficient expectation maximization algorithm is developed to facilitate estimation. Performance of the approach is demonstrated through simulated and real data examples.

KEYWORDS: composite likelihood, high-dimensional data, model-based clustering.

Model-based clustering of continuous data is routinely implemented by means of Gaussian mixture models (GMMs). Despite the popularity of GMMs, their widespread use is curtailed by their inability to transfer to settings where the number of variables *p* is large compared to the sample size *N*. Difficulties with storage and manipulation of the multivariate Gaussian distribution's covariance matrix in such settings lead to increased computational cost, and it often makes the approach impractical. Further, in settings where the number of variables *p* > *N*, fitting a GMM with an unconstrained covariance matrix is infeasible. To overcome these issues, parsimonious models based on co-

<sup>2</sup> Department of Statistics, Athens University of Economics and Business, Greece.

michael.fop@ucd.ie, claire.gormley@ucd.ie)

<sup>3</sup> Department of Statistics, University of Warwick, UK. <sup>4</sup> Hamilton Institute, Maynooth University, Ireland.

1 Introduction

Recently, Ranalli & Rocci, 2016; Ranalli & Rocci, 2017 proposed the use of composite likelihood (CL) to estimate the parameters of a finite mixture model for ordinal and mixed mode data. The CL approach (see Varin *et al.*, 2011) uses smaller dimensional marginal and/or conditional pseudolikelihoods to estimate the parameters of the model. The use of CL avoids the need to fully specify the underlying joint distribution and estimates parameters from a product of lower dimensional likelihoods. Such an approximation is very helpful when the full model is difficult to specify or manipulate. The CL framework assists in avoiding the computational problems often arising from the need to deal with a multi-dimensional joint distribution. In addition, the specification of appropriate conditional likelihoods allows the modelling of the dependence structure by means of lower dimensional terms.

Here the CL approach is exploited to enable clustering of high-dimensional data using GMMs. Lower dimensional terms corresponding to Gaussian multivariate marginal distributions are involved in the construction of the pseudolikelihood, thus avoiding the use of high dimensional covariance matrices (and their inversion), which is advantageous in *p N* scenarios. We embed GMMs in the CL framework to serve a dual purpose: to facilitate the use of GMMs in high-dimensional scenarios, while capturing at the same time the complex dependence structures which are often present in such settings.

#### 2 Block-pairwise composite likelihood for GMMs

To deal with the complexities arising in high-dimensional settings (*p n*), we decompose the likelihood of a GMM into terms of tractable size, corresponding to lower dimensional Gaussian distributions in which *n* is larger than the number of variables involved in each term. To do so, we define a general composite likelihood based on pairs of blocks of variables.

Suppose the vector X of *p* variables is partitioned into a set of *K* nonoverlapping blocks *B* = {*B*1,...*Bj*,...*BK*}. For ease of exposition, we take blocks having the same size *b*. Let *S* denote the set of all possible *<sup>K</sup>* 2 pairs constructed using the blocks in *B*. The generic element *Sl* ∈ *S* is given by a pair of blocks, such that *Sl* = *Bj* ∪*Bk*, with *j* = *k*, ∀ *j*, *k* = 1,...,*K*. We then define the following *block-pairwise composite likelihood* (BCL)

$$\begin{split} \text{BCL}(\boldsymbol{\Theta}) &= \prod\_{S\_l \in \mathcal{S}} \left\{ \prod\_{i=1}^N \sum\_{g=1}^G \mathsf{\tau}\_g \phi(\mathbf{x}\_i^l; \boldsymbol{\mu}\_g^l, \mathbf{E}\_g^l) \right\} \\ &= \prod\_{j=1}^{K-1} \prod\_{k>j} \left\{ \prod\_{i=1}^N \sum\_{g=1}^G \mathsf{\tau}\_g \phi\left(\{\mathbf{x}\_i^j, \mathbf{x}\_i^k\}; \boldsymbol{\mu}\_g^{(j,k)}, \mathbf{E}\_g^{(j,k)}\right) \right\}, \end{split} \tag{1}$$

3 Estimation

References

267–278.

529–547.

The complete-data composite log-likelihood decomposes into the sum of a number of standard GMM complete data log-likelihood terms, each related to

ever, these terms are not independent of each other due to the coupling of the parameters in the factorization. Therefore, maximization of the BCL and RBCL is carried out by means of an Expectation-Conditional-Maximization algorithm (Meng & Rubin, 1993). In particular, the maximization step involves a series of conditional maximization passes, where the pairs are scanned in a sequential manner, such that the joint distribution of each pair is rewritten as the product of a marginal distribution estimated at the previous step and the conditional distribution of a block given the block of the previous step. The optimization is based on the conditional estimation procedure outlined in Fop *et al.*, 2021. While the main purpose of employing the CL framework when clustering high-dimensional data is computational efficiency, further work will explore the statistical properties (e.g. consistency) of the resulting parameter estimates. Performance of the CL approach and estimation procedure are

BOUVEYRON, CHARLES,&BRUNET-SAUMARD, CAMILLE. 2014. Modelbased clustering of high-dimensional data: A review. *Computational*

FOP, M., MATTEI, P.A., BOUVEYRON, C., & MURPHY, T.B. 2021. Unobserved classes and extra variables in high-dimensional discriminant anal-

MENG, XIAO-LI,&RUBIN, DONALD B. 1993. Maximum likelihood estimation via the ECM algorithm: A general framework. *Biometrika*, 80(2),

RANALLI, MONIA,&ROCCI, ROBERTO. 2016. Mixture models for ordinal data: a pairwise likelihood approach. *Statistics and Computing*, 26(1-2),

RANALLI, MONIA,&ROCCI, ROBERTO. 2017. Mixture models for mixedtype data through a composite likelihood approach. *Computational Statis-*

VARIN, CRISTIANO, REID, NANCY,&FIRTH, DAVID. 2011. An overview

of composite likelihood methods. *Statistica Sinica*, 5–42.

*<sup>i</sup>* <sup>=</sup> {x*<sup>j</sup>*

*<sup>i</sup>* ,x*<sup>k</sup>*

*<sup>i</sup>* }. How-

the marginal joint Gaussian distribution of the block pair x*<sup>l</sup>*

demonstrated through simulated and real data examples.

*Statistics & Data Analysis*, 71, 52–78.

*tics & Data Analysis*, 110, 87–102.

ysis. *arXiv:2102.01982*.

where x*<sup>l</sup> <sup>i</sup>* <sup>=</sup> {x*<sup>j</sup> <sup>i</sup>* ,x*<sup>k</sup> <sup>i</sup>* } is the observation *i* measured on the variables included in the pair of blocks *Sl* = *Bj* ∪*Bk*, φ(·) is the density of a multivariate Gaussian of dimension 2*b* and *µ<sup>l</sup> <sup>g</sup>* and Σ*<sup>l</sup> <sup>g</sup>* are the component parameters that relate to the variables in the pair *Sl*. The second expression in (1) makes explicit that *µ*(*j*,*k*) *g* and Σ(*j*,*k*) *<sup>g</sup>* are the parameters of the joint Gaussian distribution of {x*<sup>j</sup> <sup>i</sup>* ,x*<sup>k</sup> i* }.

In (1), each product term in the curly brackets is the likelihood of a GMM over the variables in the pair *Sl*. Hence, by setting *b n*/2, the potentially high-dimensional GMM likelihood is decomposed into a number of terms involving lower dimensional Gaussian distributions, enabling computationally efficient inference. As the BCL approach works with low dimensional Gaussian distributions, estimation and inversion of large covariance matrices are avoided, facilitating the use of GMMs in high-dimensional scenarios.

As long as 1 < *b n*/2, there is computational advantage in the BCL approach. However, for certain situations, *<sup>K</sup>* 2 can mean an intractable number of terms in the log-likelihood. To further reduce the complexity, instead of looking at all possible 2*b*-dimensional marginal distributions, we define a restricted subset *S* <sup>∗</sup> ⊂ *S* of pairs of blocks. In particular, we take this set as a sequential enumeration of pairs of blocks: *S* <sup>∗</sup> = *<sup>K</sup>*−<sup>1</sup> *<sup>j</sup>*=<sup>1</sup> *Bj* ∪ *Bj*+1. We then define the following *restricted block-pairwise composite likelihood* (RBCL):

$$\begin{split} \text{RBCL}(\boldsymbol{\Theta}) &= \prod\_{i=1}^{N} \left\{ \prod\_{S\_{i} \in S^{\*}} \sum\_{g=1}^{G} \texttt{\tau}\_{\mathcal{S}} \boldsymbol{\Phi}(\mathbf{x}\_{i}^{l}; \boldsymbol{\mu}\_{g}^{l}, \mathbf{E}\_{g}^{l}) \right\} \\ &= \prod\_{i=1}^{N} \left\{ \prod\_{\substack{j=1 \\ k=j+1}}^{K-1} \sum\_{g=1}^{G} \texttt{\tau}\_{\mathcal{S}} \boldsymbol{\Phi}(\{\mathbf{x}\_{i}^{j}, \mathbf{x}\_{i}^{k}\}; \boldsymbol{\mu}\_{g}^{(j,k)}, \mathbf{E}\_{g}^{(j,k)}) \right\}. \end{split} \tag{2}$$

Importantly, compared to the *<sup>K</sup>* 2 pairs of BCL, the number of pairs for RBCL is linear in *K*. The formulation of the GMM in terms of BCL and RBCL significantly reduces the complexity in computing and dealing with high-dimensional likelihood terms and covariance matrices.

### 3 Estimation

define the following *block-pairwise composite likelihood* (BCL)

 *N* ∏ *i*=1

*G* ∑ *g*=1 τ*g*φ(x*<sup>l</sup> i*;*µ<sup>l</sup> g*,Σ*<sup>l</sup> g*) 

the pair of blocks *Sl* = *Bj* ∪*Bk*, φ(·) is the density of a multivariate Gaussian

variables in the pair *Sl*. The second expression in (1) makes explicit that *µ*(*j*,*k*)

In (1), each product term in the curly brackets is the likelihood of a GMM over the variables in the pair *Sl*. Hence, by setting *b n*/2, the potentially high-dimensional GMM likelihood is decomposed into a number of terms involving lower dimensional Gaussian distributions, enabling computationally efficient inference. As the BCL approach works with low dimensional Gaussian distributions, estimation and inversion of large covariance matrices are

As long as 1 < *b n*/2, there is computational advantage in the BCL

ber of terms in the log-likelihood. To further reduce the complexity, instead of looking at all possible 2*b*-dimensional marginal distributions, we define a restricted subset *S* <sup>∗</sup> ⊂ *S* of pairs of blocks. In particular, we take this set as

define the following *restricted block-pairwise composite likelihood* (RBCL):

τ*g*φ(x*<sup>l</sup> i*;*µ<sup>l</sup> g*,Σ*<sup>l</sup> g*) 

<sup>τ</sup>*g*φ({x*<sup>j</sup>*

*<sup>i</sup>* ,x*<sup>k</sup>*

*<sup>i</sup>* };*µ*(*j*,*k*)

*<sup>g</sup>* ,Σ(*j*,*k*) *<sup>g</sup>* )

pairs of BCL, the number of pairs for RBCL

*G* ∑ *g*=1

> *G* ∑ *g*=1

is linear in *K*. The formulation of the GMM in terms of BCL and RBCL significantly reduces the complexity in computing and dealing with high-dimensional

2 

*<sup>g</sup>* are the parameters of the joint Gaussian distribution of {x*<sup>j</sup>*

avoided, facilitating the use of GMMs in high-dimensional scenarios.

*<sup>i</sup>* };*µ*(*j*,*k*)

*<sup>g</sup>* are the component parameters that relate to the

*<sup>i</sup>* } is the observation *i* measured on the variables included in

*<sup>g</sup>* ,Σ(*j*,*k*) *g* , (1)

*g*

*<sup>i</sup>* ,x*<sup>k</sup> i* }.

can mean an intractable num-

*<sup>j</sup>*=<sup>1</sup> *Bj* ∪ *Bj*+1. We then

 .

(2)

*G* ∑ *g*=1 τ*g*φ {x*j <sup>i</sup>* ,x*<sup>k</sup>*

 *N* ∏ *i*=1

BCL(Θ) = ∏

*<sup>i</sup>* ,x*<sup>k</sup>*

where x*<sup>l</sup>*

and Σ(*j*,*k*)

*<sup>i</sup>* <sup>=</sup> {x*<sup>j</sup>*

of dimension 2*b* and *µ<sup>l</sup>*

= *K*−1 ∏ *j*=1 ∏ *k*>*j*

*Sl*∈*S*

*<sup>g</sup>* and Σ*<sup>l</sup>*

approach. However, for certain situations, *<sup>K</sup>*

RBCL(Θ) =

Importantly, compared to the *<sup>K</sup>*

a sequential enumeration of pairs of blocks: *S* <sup>∗</sup> = *<sup>K</sup>*−<sup>1</sup>

 ∏ *Sl*∈*S* <sup>∗</sup>

 

*K*−1 ∏ *j*=1 *k*=*j*+1

> 2

*N* ∏ *i*=1

= *N* ∏ *i*=1

likelihood terms and covariance matrices.

The complete-data composite log-likelihood decomposes into the sum of a number of standard GMM complete data log-likelihood terms, each related to the marginal joint Gaussian distribution of the block pair x*<sup>l</sup> <sup>i</sup>* <sup>=</sup> {x*<sup>j</sup> <sup>i</sup>* ,x*<sup>k</sup> <sup>i</sup>* }. However, these terms are not independent of each other due to the coupling of the parameters in the factorization. Therefore, maximization of the BCL and RBCL is carried out by means of an Expectation-Conditional-Maximization algorithm (Meng & Rubin, 1993). In particular, the maximization step involves a series of conditional maximization passes, where the pairs are scanned in a sequential manner, such that the joint distribution of each pair is rewritten as the product of a marginal distribution estimated at the previous step and the conditional distribution of a block given the block of the previous step. The optimization is based on the conditional estimation procedure outlined in Fop *et al.*, 2021. While the main purpose of employing the CL framework when clustering high-dimensional data is computational efficiency, further work will explore the statistical properties (e.g. consistency) of the resulting parameter estimates. Performance of the CL approach and estimation procedure are demonstrated through simulated and real data examples.

### References


### ON MODEL-BASED CLUSTERING USING QUANTILE REGRESSION

2 Methodology

variable with density

*L*(α,β,σ, *p*|y) =

where β = (β

α*<sup>k</sup>* < 1 and ∑*<sup>K</sup>*

ψ = (α

,β ,σ )

We start by considering a vector, y = (*y*1,..., *yT* ) of responses *yt* and the associated design matrix X = (*x*1,...,*xT* ) that collects the vectors *xt* of *L* covariates. Further, let *Qp*(*yt*|*xt*), for 0 < *p* < 1, be the *p*th quantile regression

a vector of unknown parameters to be estimated. The regression coefficient

where ρ*p*(·) is the check loss function defined by ρ*p*(*x*) = *x*(*p* − *I*(*x* < 0)) and *I*(·) denotes the usual indicator function. Koenker and Machado (1999) showed that there is a direct relationship between minimizing (1) and the maximum likelihood theory using independently distributed asymmetric Laplace

<sup>σ</sup> exp

where σ > 0 and 0 < *p* < 1 represents the skewness parameter that can be used

ald1(*yt*|β*k*,σ*k*, *p*) =

of the mixing proportions for the *K* clusters which satisfy the conditions 0 <

We now consider a set *Y* = {y*i*,*i* = 1,...,*n*} of *n* vectors y*<sup>i</sup>* = (*yi*1,...,*yiT* ) of independent observations. and we want to split the data set *Y* into *K* clusters. According the mixture model (3) the cluster membership *ci* ∈ {1,...,*K*}, where *ci* = *k* indicates that the *i*th vector y*<sup>i</sup>* belongs to cluster *k* is a multinomial

We adopt a Bayesian approach to make inference on the model parameters

membership of a single vector, Pr(*ci* = ·|*Y* ). In doing this we first note that

. Moreover it is possible to get the posterior probability of

According to the finite mixture framework theory we define the likelihood

−ρ*<sup>p</sup>*

*K* ∑ *k*=1

, σ = (σ1,...,σ*K*) and α = (α1,...,α*K*) is the vector

*yt* <sup>−</sup>*<sup>x</sup>*

*t*β σ

ρ*p*(*yt* −*x*

*T* ∑ *t*=1 *<sup>t</sup>*β, where β is

(2)

α*k*ALD(y|β*k*,σ*k*, *p*) (3)

*<sup>t</sup>*β) (1)

function of *yt* given *xt* which can be modelled as *Qp*(*yt*|*xt*) = *x*

estimate is obtained by minimizing (Koenker & Bassett, 1978)

β = argmin β

ald(*yt*|β,σ, *<sup>p</sup>*) = *<sup>p</sup>*(1<sup>−</sup> *<sup>p</sup>*)

directly to model any quantile of interest.

of our mixture model for a single vector y as

*K* ∑ *k*=1 α*k T* ∏*t*=1

*K*)

1,....,β

random variable with parameter α.

*<sup>k</sup>*=1α*<sup>k</sup>* = 1.

ˆ

Carlo Gaetan 1, Paolo Girardi2 and Victor Muthama Musau <sup>3</sup>

<sup>1</sup> DAIS, Ca' Foscari of Venice (Italy) , (e-mail: gaetan@unive.it)

<sup>2</sup> Department of Developmental and Social Psychology, University of Padova (Italy), (e-mail: paolo.girardi@unipd.it)

<sup>3</sup> Department of Pure and Applied Sciences, Kirinyaga University (Kenya), (e-mail: vmusau@kyu.ac.ke)

ABSTRACT: Clustering general regression functions or curves can suffer of lack of robustness when we consider the usual Gaussian assumption. In this note we introduce a new model-based clustering method that tries to overcome this limitation.

KEYWORDS: Functional data, hierarchical Bayesian model, MCMC algorithm

#### 1 Introduction

Unlike the classical clustering approaches such as agglomerative hierarchical clustering and K-means clustering, which are largely heuristic and not based on formal statistical models, model-based clustering takes a likelihood based approach thus permitting inference to be drawn on the clusters. These techniques are based on the finite mixture model theory (Fraley & Raftery, 2002), where each mixture component corresponds to a cluster. However, fundamental concerns remain about robustness and in particular the choice of distribution representing the within cluster density. The Gaussian mixture models are historically the most popular tool for model-based clustering. However, if the distribution of the observed variable is characterized by asymmetry and presence of outliers, a Gaussian distribution may not be an appropriate within cluster density. The direct link that exists between univariate quantile regression approach and the Asymmetric Laplace Distribution (ALD) forms our basis of introducing a clustering model based on finite mixture of ALDs to group individuals subject to heterogeneity due to regressor variables.

#### 2 Methodology

ON MODEL-BASED CLUSTERING USING QUANTILE REGRESSION Carlo Gaetan 1, Paolo Girardi2 and Victor Muthama Musau <sup>3</sup>

<sup>2</sup> Department of Developmental and Social Psychology, University of Padova (Italy),

<sup>3</sup> Department of Pure and Applied Sciences, Kirinyaga University (Kenya), (e-mail:

ABSTRACT: Clustering general regression functions or curves can suffer of lack of robustness when we consider the usual Gaussian assumption. In this note we introduce

Unlike the classical clustering approaches such as agglomerative hierarchical clustering and K-means clustering, which are largely heuristic and not based on formal statistical models, model-based clustering takes a likelihood based approach thus permitting inference to be drawn on the clusters. These techniques are based on the finite mixture model theory (Fraley & Raftery, 2002), where each mixture component corresponds to a cluster. However, fundamental concerns remain about robustness and in particular the choice of distribution representing the within cluster density. The Gaussian mixture models are historically the most popular tool for model-based clustering. However, if the distribution of the observed variable is characterized by asymmetry and presence of outliers, a Gaussian distribution may not be an appropriate within cluster density. The direct link that exists between univariate quantile regression approach and the Asymmetric Laplace Distribution (ALD) forms our basis of introducing a clustering model based on finite mixture of ALDs to group indi-

a new model-based clustering method that tries to overcome this limitation. KEYWORDS: Functional data, hierarchical Bayesian model, MCMC algorithm

viduals subject to heterogeneity due to regressor variables.

<sup>1</sup> DAIS, Ca' Foscari of Venice (Italy) , (e-mail: gaetan@unive.it)

(e-mail: paolo.girardi@unipd.it)

vmusau@kyu.ac.ke)

1 Introduction

We start by considering a vector, y = (*y*1,...,*yT* ) of responses *yt* and the associated design matrix X = (*x*1,...,*xT* ) that collects the vectors *xt* of *L* covariates. Further, let *Qp*(*yt*|*xt*), for 0 < *p* < 1, be the *p*th quantile regression function of *yt* given *xt* which can be modelled as *Qp*(*yt*|*xt*) = *x <sup>t</sup>*β, where β is a vector of unknown parameters to be estimated. The regression coefficient estimate is obtained by minimizing (Koenker & Bassett, 1978)

$$\hat{\mathfrak{B}} = \underset{\mathfrak{B}}{\text{argmin}} \sum\_{t=1}^{T} \mathfrak{p}\_P(\mathbf{y}\_t - \mathbf{x}\_t^\prime \mathbf{B}) \tag{1}$$

where ρ*p*(·) is the check loss function defined by ρ*p*(*x*) = *x*(*p* − *I*(*x* < 0)) and *I*(·) denotes the usual indicator function. Koenker and Machado (1999) showed that there is a direct relationship between minimizing (1) and the maximum likelihood theory using independently distributed asymmetric Laplace variable with density

$$\text{ald}(y\_t | \mathfrak{B}, \mathfrak{o}, p) = \frac{p(1 - p)}{\mathfrak{o}} \exp\left\{-\mathfrak{p}\_p \left(\frac{y\_t - \mathfrak{x}\_t^\prime \mathfrak{B}}{\mathfrak{o}}\right)\right\} \tag{2}$$

where σ > 0 and 0 < *p* < 1 represents the skewness parameter that can be used directly to model any quantile of interest.

According to the finite mixture framework theory we define the likelihood of our mixture model for a single vector y as

$$L(\mathfrak{a}, \mathfrak{B}, \mathfrak{o}, p | \mathbf{y}) = \sum\_{k=1}^{K} \mathfrak{a}\_{k} \prod\_{t=1}^{T} \text{ald}\_{1}(\mathfrak{y}\_{t} | \mathfrak{B}\_{k}, \mathfrak{o}\_{k}, p) = \sum\_{k=1}^{K} \mathfrak{a}\_{k} \text{ALD}(\mathbf{y} | \mathfrak{B}\_{k}, \mathfrak{o}\_{k}, p) \quad (3)$$

where β = (β 1,....,β *K*) , σ = (σ1,...,σ*K*) and α = (α1,...,α*K*) is the vector of the mixing proportions for the *K* clusters which satisfy the conditions 0 < α*<sup>k</sup>* < 1 and ∑*<sup>K</sup> <sup>k</sup>*=1α*<sup>k</sup>* = 1.

We now consider a set *Y* = {y*i*,*i* = 1,...,*n*} of *n* vectors y*<sup>i</sup>* = (*yi*1,...,*yiT* ) of independent observations. and we want to split the data set *Y* into *K* clusters. According the mixture model (3) the cluster membership *ci* ∈ {1,...,*K*}, where *ci* = *k* indicates that the *i*th vector y*<sup>i</sup>* belongs to cluster *k* is a multinomial random variable with parameter α.

We adopt a Bayesian approach to make inference on the model parameters ψ = (α ,β ,σ ) . Moreover it is possible to get the posterior probability of membership of a single vector, Pr(*ci* = ·|*Y* ). In doing this we first note that Kozumi and Kobayashi (2011) represent the density (2) as a location scale mixture of Gaussian distributions i.e.

$$\mathbf{y}\_t = \mathbf{x}\_t' \mathbf{\hat{s}} + \mathbf{\hat{e}} \mathbf{w}\_t + \mathbf{o} \sqrt{\mathbf{o} \mathbf{w}\_t} \mathbf{v}\_t \tag{4}$$

Figure 1. *Clustering of the 35 temperature curves as obtained by funHDDC algorithm*

clusters with the previous algorithm, leading to a perfect agreement. These results generally indicate a good performance of our proposed algorithm when

BOUVEYRON, C., & JACQUES, J. 2011. Model-based clustering of time series in group-specific functional subspaces. *Advances in Data Analysis and*

FRALEY, C., & RAFTERY, A.E. 2002. Model-based clustering, discriminant analysis, and density estimation. *Journal of the American Statistical As-*

KOENKER, R., & BASSETT, G. 1978. Regression quantiles. *Econometrica*,

KOENKER, R., & MACHADO, J. A.F. 1999. Goodness of fit and related inference processes for quantile regression. *Journal of the American Statistical*

KOZUMI, H., & KOBAYASHI, G. 2011. Gibbs sampling methods for Bayesian quantile regression. *Journal of Statistical Computation and Simulation*,

MUSAU, V.M. 2021. *Model-based Clustering Using Quantile Regression*.

RAMSAY, J.O., & SILVERMAN, B.W. 2005. *Functional Data Analysis*.

*(left panel) and results with curves contaminated by outliers (right panel).*

clustering data characterized by outlying observations.

*Classification*, 5, 281–300.

*sociation*, 97, 611–631.

*Association*, 94, 1296–1310.

Ph.D. thesis, University of Padua, Italy.

46, 33–50.

81, 1565–1578.

Springer, New York.

References

where ν*<sup>t</sup>* ∼ *N*(0,1), and *w* is an exponential random variable with *E*(*w*) = σ. Here ν and *w* are mutually independent and θ = (1 − 2*p*)/{*p*(1 − *p*)} and <sup>ω</sup><sup>2</sup> <sup>=</sup> <sup>2</sup>/{*p*(1<sup>−</sup> *<sup>p</sup>*)}.

Equation (4) constitutes the first stage of a hierarchical Bayesian model where the prior distribution on the cluster specific parameters and as well as the mixing proportions are specified as conjugate priors to having closed form conditional posterior densities which are easy to sample from in a MCMC algorithm.

A conjugate prior for the mixing proportions α = (α1,...,α*K*) is the Dirichlet distribution, α ∼ *D*(ζ1,...,ζ*K*). A straightforward prior for β*<sup>k</sup>* is the multivariate Gaussian distribution, *N* (*b*0,Σ0) where by setting *b*<sup>0</sup> = 0 and Σ<sup>0</sup> = *aI*, for *a* 0, leads to an improper prior. Finally we propose the inverse gamma distribution, IG(*s*0,*d*0), as the prior for σ*<sup>k</sup>* where the shape and scale parameters, *s*<sup>0</sup> and *d*<sup>0</sup> respectively, are known.

Musau (2021) gives a complete account on how we can devise an MCMC algortihm for sampling from the posterior distribution of ψ.

#### 3 Numerical results

We exemplify our proposal with a clustering problem for functional data. We consider the well-known Canadian temperature dataset available in the R package fda. The dataset consists of the daily measured temperatures from 35 Canadian weather stations across the country.

Under functional data framework (Ramsay & Silverman, 2005), daily temperature data, *yt*, can be described by a linear combination of *L* = 65 cubic spline basis functions, *yt* <sup>∑</sup>*<sup>L</sup> <sup>j</sup>*=<sup>1</sup> β*jBj*(*t*) = *x <sup>t</sup>*β, with knots which are equally distributed over the range of time.

The funHDDC clustering algorithm (Bouveyron & Jacques, 2011) on this data selects *K* = 4 as the optimal number of clusters. Figure 1 (left panel) summarize the resulted clusters.

For each of the 35 stations we randomly introduce outliers (*yt*=0) at 10% of the total observation points. This distorts the general trend of the data, as shown in right panel of Figure 1, making reconstruction of the clusters difficult.

We apply our mixture model setting *p* = 0.5, i.e. we consider a robust median regression and we compare its performance in reconstructing the 4

Figure 1. *Clustering of the 35 temperature curves as obtained by funHDDC algorithm (left panel) and results with curves contaminated by outliers (right panel).*

clusters with the previous algorithm, leading to a perfect agreement. These results generally indicate a good performance of our proposed algorithm when clustering data characterized by outlying observations.

#### References

Kozumi and Kobayashi (2011) represent the density (2) as a location scale

where ν*<sup>t</sup>* ∼ *N*(0,1), and *w* is an exponential random variable with *E*(*w*) = σ. Here ν and *w* are mutually independent and θ = (1 − 2*p*)/{*p*(1 − *p*)} and

Equation (4) constitutes the first stage of a hierarchical Bayesian model where the prior distribution on the cluster specific parameters and as well as the mixing proportions are specified as conjugate priors to having closed form conditional posterior densities which are easy to sample from in a MCMC

A conjugate prior for the mixing proportions α = (α1,...,α*K*) is the Dirichlet distribution, α ∼ *D*(ζ1,...,ζ*K*). A straightforward prior for β*<sup>k</sup>* is the multivariate Gaussian distribution, *N* (*b*0,Σ0) where by setting *b*<sup>0</sup> = 0 and Σ<sup>0</sup> = *aI*, for *a* 0, leads to an improper prior. Finally we propose the inverse gamma distribution, IG(*s*0,*d*0), as the prior for σ*<sup>k</sup>* where the shape and scale parame-

Musau (2021) gives a complete account on how we can devise an MCMC

We exemplify our proposal with a clustering problem for functional data. We consider the well-known Canadian temperature dataset available in the R package fda. The dataset consists of the daily measured temperatures from 35

Under functional data framework (Ramsay & Silverman, 2005), daily temperature data, *yt*, can be described by a linear combination of *L* = 65 cubic

The funHDDC clustering algorithm (Bouveyron & Jacques, 2011) on this data selects *K* = 4 as the optimal number of clusters. Figure 1 (left panel)

For each of the 35 stations we randomly introduce outliers (*yt*=0) at 10% of the total observation points. This distorts the general trend of the data, as shown in right panel of Figure 1, making reconstruction of the clusters difficult. We apply our mixture model setting *p* = 0.5, i.e. we consider a robust median regression and we compare its performance in reconstructing the 4

*<sup>j</sup>*=<sup>1</sup> β*jBj*(*t*) = *x*

<sup>√</sup>σ*wt*ν*<sup>t</sup>* (4)

*<sup>t</sup>*β, with knots which are equally

*<sup>t</sup>*β+θ*wt* +ω

mixture of Gaussian distributions i.e.

ters, *s*<sup>0</sup> and *d*<sup>0</sup> respectively, are known.

Canadian weather stations across the country.

3 Numerical results

spline basis functions, *yt* <sup>∑</sup>*<sup>L</sup>*

distributed over the range of time.

summarize the resulted clusters.

algortihm for sampling from the posterior distribution of ψ.

<sup>ω</sup><sup>2</sup> <sup>=</sup> <sup>2</sup>/{*p*(1<sup>−</sup> *<sup>p</sup>*)}.

algorithm.

*yt* = *x*


### SOCIOECONOMIC INEQUALITIES AND CANCER RISK: MYTH OR REALITY?

PARAMETER-WISE CO-CLUSTERING FOR HIGH DIMENSIONAL DATA Michael Gallaugher1, Christophe Biernacki2 and Paul McNicholas3

ABSTRACT: Due to the "big data" phenomenon, data is becoming increasingly higher dimensional. As such, new techniques need to be developed to handle higher dimensional data, and this is especially true in clustering. One such clustering method for high dimensional data is co-clustering where the aim is to cluster both rows and columns resulting in data blocks, or co-clusters, where observations within each block are independent and identically distributed. Although highly parsimonious, co-clustering can be quite inflexible. In this talk, a method that clusters columns according to both means and variances, while assuming normality, will be presented. The proposed model increases flexibility while maintaining a high degree of parsimony. Both simulated and

<sup>1</sup> Baylor University, (e-mail: Michael Gallaugher@baylor.edu) <sup>2</sup> University Lille 1, (e-mail: christophe.biernacki@inria.fr)

KEYWORDS: model-based clustering, mixture model, co-clustering,

<sup>3</sup> McMaster University, (e-mail: paulmc@mcmaster.ca)

real data will be used for illustration.

high-dimensional data

Carlotta Galeone <sup>1</sup>

<sup>1</sup> University of Milan, (e-mail: carlotta.galeone@statinfo.org)

ABSTRACT: Accurate quantification of the impact of low socioeconomic position (SEP) on selected no communicable diseases, including diabetes several cancers, is needed. There is increasing evidence that low SEP is a strong determinant of morbidity and premature mortality concerning cancer risk. This is mainly caused byadelay in screening uptake and consequent timeliness of symptomatic presentation and lifestyle (including diet, smoking habit and physical activity). The accurate quantification of the relation between SEP and cancer risk is crucial to plan public health interventions for cancer incidence and socioeconomic disparities reduction. The recent advent of collaborative and interdisciplinary research by pooling a large amount of worldwide epidemiological data in multi-institutional data consortia is the answer to this gap in knowledge. In fact, data analyses of epidemiological consortia will allow to define and quantify the associations of interest with a higher degree of accuracy, explore subgroups of the population, and investigate the interactions between environmental, genetic, and socioeconomic factors. The Stomach Cancer Pooling (StoP) Project and the International Head and Neck Cancer Epidemiology (INHANCE) are two example of large data consortia, in which the University of Milan is proactively involved. Their large sample size allowed investigators to address the effects of education and household income on the onset and evolution of the disease. INHANCE findings suggested that low education and low income are risk factors for head and neck cancer, independent of tobacco smoking and alcohol consumption. The collaborative pooled analysis within the StoP consortium showed a strong inverse relation between SEP indicators and gastric cancer risk, with a 40% decreased risk among individuals with intermediate/high education status than less educated study subjects. In conclusion, social epidemiology is crucial to understand the sociostructural factors related to health and disease. In an era of fast inter-diffuse communication and data-sharing, large data consortia are among the most effective strategies to create new social epidemiological useful evidences. In these example of data consortia, SEP is strongly related to a number of cancers. Health education campaigns targeting socioeconomically disadvantaged in vulnearal populations are probabily the most efficacious stategy to redurce the cancer burden in the world.

KEYWORDS: socioeconomic inequalities, cancer risk, INHANCE, StoP.

## PARAMETER-WISE CO-CLUSTERING FOR HIGH DIMENSIONAL DATA

Michael Gallaugher1, Christophe Biernacki2 and Paul McNicholas3

<sup>1</sup> Baylor University, (e-mail: Michael Gallaugher@baylor.edu)

<sup>2</sup> University Lille 1, (e-mail: christophe.biernacki@inria.fr)

<sup>3</sup> McMaster University, (e-mail: paulmc@mcmaster.ca)

SOCIOECONOMIC INEQUALITIES AND CANCER RISK: MYTH OR REALITY? Carlotta Galeone <sup>1</sup>

ABSTRACT: Accurate quantification of the impact of low socioeconomic position (SEP) on selected no communicable diseases, including diabetes several cancers, is needed. There is increasing evidence that low SEP is a strong determinant of morbidity and premature mortality concerning cancer risk. This is mainly caused byadelay in screening uptake and consequent timeliness of symptomatic presentation and lifestyle (including diet, smoking habit and physical activity). The accurate quantification of the relation between SEP and cancer risk is crucial to plan public health interventions for cancer incidence and socioeconomic disparities reduction. The recent advent of collaborative and interdisciplinary research by pooling a large amount of worldwide epidemiological data in multi-institutional data consortia is the answer to this gap in knowledge. In fact, data analyses of epidemiological consortia will allow to define and quantify the associations of interest with a higher degree of accuracy, explore subgroups of the population, and investigate the interactions between environmental, genetic, and socioeconomic factors. The Stomach Cancer Pooling (StoP) Project and the International Head and Neck Cancer Epidemiology (INHANCE) are two example of large data consortia, in which the University of Milan is proactively involved. Their large sample size allowed investigators to address the effects of education and household income on the onset and evolution of the disease. INHANCE findings suggested that low education and low income are risk factors for head and neck cancer, independent of tobacco smoking and alcohol consumption. The collaborative pooled analysis within the StoP consortium showed a strong inverse relation between SEP indicators and gastric cancer risk, with a 40% decreased risk among individuals with intermediate/high education status than less educated study subjects. In conclusion, social epidemiology is crucial to understand the sociostructural factors related to health and disease. In an era of fast inter-diffuse communication and data-sharing, large data consortia are among the most effective strategies to create new social epidemiological useful evidences. In these example of data consortia, SEP is strongly related to a number of cancers. Health education campaigns targeting socioeconomically disadvantaged in vulnearal populations are probabily the most efficacious stategy to redurce

<sup>1</sup> University of Milan, (e-mail: carlotta.galeone@statinfo.org)

the cancer burden in the world.

KEYWORDS: socioeconomic inequalities, cancer risk, INHANCE, StoP.

ABSTRACT: Due to the "big data" phenomenon, data is becoming increasingly higher dimensional. As such, new techniques need to be developed to handle higher dimensional data, and this is especially true in clustering. One such clustering method for high dimensional data is co-clustering where the aim is to cluster both rows and columns resulting in data blocks, or co-clusters, where observations within each block are independent and identically distributed. Although highly parsimonious, co-clustering can be quite inflexible. In this talk, a method that clusters columns according to both means and variances, while assuming normality, will be presented. The proposed model increases flexibility while maintaining a high degree of parsimony. Both simulated and real data will be used for illustration.

KEYWORDS: model-based clustering, mixture model, co-clustering, high-dimensional data

## QUANTIFYING THE IMPACT OF COVARIATES ON THE GENDER GAP MEASUREMENT: AN ANALYSIS BASED ON EU-SILC DATA FROM POLAND AND ITALY

European Social Report (TARKI, 2008), a study on intolerance to income in- ´ equality across countries confirmed a markedly lower level of acceptance of inequality in the post-socialist bloc than in the other European countries. The calculations were based on microdata from the European Union Statistics on

Several methods can be applied to the measurement of the income discrepancy between men and women. Among them, summary measures remain an important tool for the comparison of distributional changes. However, to uncover the factors contributing to the gender discrepancy, it is useful to move beyond the typical focus on average or median earnings differences, towards a view on how the entire distribution of women's earnings relative to men's compares. Indeed, inequality is a property of a distribution. A prominent feature of these methods is the use of the "relative distribution", a transformation of the data from two distributions into a single distribution that contains all the information needed for scale-invariant comparison (Handcock & Morris, 2006). In a previous paper (Greselin & Je¸drzejczak, 2020) the authors highlighted remarkable differences between Poland and Italy, especially related to the discrepancy across regions between men and women. The next natural step is hence to search for the socioeconomic factors that could explain the

Income and Living Condition (EU-SILC) (Eurostat, 2018).

differences observed in the income distribution for men and women.

Often there are covariates which vary systematically by the compared populations, and the impact of these covariates is of interest. We will follow the approach introduced by Handcock & Morris (2006). The overall relative distribution is decomposed into a first component representing the effect of changes in the marginal distribution of a covariate, and a second component defining the residual changes. The first term is the composition effect, which measures the shift in the covariates from one population to the other. The second term is obtained by adjusting the reference (men) population to have the same marginal covariate composition as the comparison (women) population. By holding the population composition constant across the gender groups, differences in the covariate-response relationships can be correctly identified.

Let (*Y*0,*Z*0) and (*Y*,*Z*) denote random vectors describing the reference and comparison populations, where *Yo* and *Y* are the response variable, while *Z*<sup>0</sup>

be the probability mass function of *Z* and *Z*0, respectively. These probability mass functions represent the population composition with respect to the co-

*K <sup>k</sup>*=<sup>1</sup> and π<sup>0</sup>

*k K k*=1

and *Z* are the categorical covariates, with support 1,2,...*K*. Let π*<sup>k</sup>*

2 Quantifying the covariates effects

Francesca Greselin1 and Alina Je¸drzejczak2

<sup>1</sup> Department of Statistics and Quantitative Methods, University of Milano Bicocca, Italy (francesca.greselin@unimib.it)

<sup>2</sup> Department of Statistical Methods, University of Łod´ z, Poland ´ (alina.jedrzejczak@uni.lodz.pl)

ABSTRACT: High income inequality, accompanied by substantial regional differentiation, is still a great challenge for social policy makers in many European countries. One of the important elements of this phenomenon is inequality between income distributions of men and women. Using data coming from EU-SILC 2018, we compare the distribution of income for Italy and Poland, and analyze gender gap in these countries. We are interested here to uncover the socioeconomic factors that could contribute to explain the differences observed in the income distribution for men and women.

KEYWORDS: Income inequality, Gender gap, Gini index, new Zenga index, relative distribution method, Dagum, Italy, Poland.

#### 1 Introduction

Substantial regional disparities and income inequality is a great challenge for policymakers in many European countries, nowadays. One of the critical elements of this phenomenon is the inequality between income distributions of men and women. The gender pay gap can be a problem from a public policy perspective because it reduces economic output and means that women are more likely to be dependent upon welfare payments, especially in old age.

Many studies analyze income inequality across the European Union (EU) countries and regions for social and economic policies. The focus of the present paper is on income distributions across Poland and Italy, to compare countries with different economic backgrounds. Poland still suffers the transition from a centrally-planned economy to a market-based economy, and Italy is a former well-established market economy. Moreover, according to the Tarki ´ European Social Report (TARKI, 2008), a study on intolerance to income in- ´ equality across countries confirmed a markedly lower level of acceptance of inequality in the post-socialist bloc than in the other European countries. The calculations were based on microdata from the European Union Statistics on Income and Living Condition (EU-SILC) (Eurostat, 2018).

Several methods can be applied to the measurement of the income discrepancy between men and women. Among them, summary measures remain an important tool for the comparison of distributional changes. However, to uncover the factors contributing to the gender discrepancy, it is useful to move beyond the typical focus on average or median earnings differences, towards a view on how the entire distribution of women's earnings relative to men's compares. Indeed, inequality is a property of a distribution. A prominent feature of these methods is the use of the "relative distribution", a transformation of the data from two distributions into a single distribution that contains all the information needed for scale-invariant comparison (Handcock & Morris, 2006). In a previous paper (Greselin & Je¸drzejczak, 2020) the authors highlighted remarkable differences between Poland and Italy, especially related to the discrepancy across regions between men and women. The next natural step is hence to search for the socioeconomic factors that could explain the differences observed in the income distribution for men and women.

#### 2 Quantifying the covariates effects

QUANTIFYING THE IMPACT OF COVARIATES ON THE GENDER GAP MEASUREMENT: AN ANALYSIS BASED ON EU-SILC DATA FROM POLAND AND ITALY Francesca Greselin1 and Alina Je¸drzejczak2

<sup>1</sup> Department of Statistics and Quantitative Methods, University of Milano Bicocca,

<sup>2</sup> Department of Statistical Methods, University of Łod´ z, Poland ´

ABSTRACT: High income inequality, accompanied by substantial regional differentiation, is still a great challenge for social policy makers in many European countries. One of the important elements of this phenomenon is inequality between income distributions of men and women. Using data coming from EU-SILC 2018, we compare the distribution of income for Italy and Poland, and analyze gender gap in these countries. We are interested here to uncover the socioeconomic factors that could contribute to explain the differences observed in the income distribution for men and

KEYWORDS: Income inequality, Gender gap, Gini index, new Zenga index, relative

Substantial regional disparities and income inequality is a great challenge for policymakers in many European countries, nowadays. One of the critical elements of this phenomenon is the inequality between income distributions of men and women. The gender pay gap can be a problem from a public policy perspective because it reduces economic output and means that women are more likely to be dependent upon welfare payments, especially in old age.

Many studies analyze income inequality across the European Union (EU) countries and regions for social and economic policies. The focus of the present paper is on income distributions across Poland and Italy, to compare countries with different economic backgrounds. Poland still suffers the transition from a centrally-planned economy to a market-based economy, and Italy is a former well-established market economy. Moreover, according to the Tarki ´

Italy (francesca.greselin@unimib.it)

distribution method, Dagum, Italy, Poland.

(alina.jedrzejczak@uni.lodz.pl)

women.

1 Introduction

Often there are covariates which vary systematically by the compared populations, and the impact of these covariates is of interest. We will follow the approach introduced by Handcock & Morris (2006). The overall relative distribution is decomposed into a first component representing the effect of changes in the marginal distribution of a covariate, and a second component defining the residual changes. The first term is the composition effect, which measures the shift in the covariates from one population to the other. The second term is obtained by adjusting the reference (men) population to have the same marginal covariate composition as the comparison (women) population. By holding the population composition constant across the gender groups, differences in the covariate-response relationships can be correctly identified.

Let (*Y*0,*Z*0) and (*Y*,*Z*) denote random vectors describing the reference and comparison populations, where *Yo* and *Y* are the response variable, while *Z*<sup>0</sup> and *Z* are the categorical covariates, with support 1,2,...*K*. Let π*<sup>k</sup> K <sup>k</sup>*=<sup>1</sup> and π<sup>0</sup> *k K k*=1 be the probability mass function of *Z* and *Z*0, respectively. These probability mass functions represent the population composition with respect to the covariate. The marginal density of *Y* can be written as *f*(*y*) := ∑*<sup>K</sup> <sup>k</sup>*=<sup>1</sup> π*<sup>k</sup> fY*|*Z*(*y*|*k*), where *fY*|*Z*(*y*|*k*) denotes the conditional densities of *Y* given that *Z* = *k*, for *k* = 1,...,*K*. An analogous definition holds for *f*0(*y*) := ∑*<sup>K</sup> <sup>k</sup>*=<sup>1</sup> π<sup>0</sup> *<sup>k</sup> fY*0|*Z*<sup>0</sup> (*y*|*k*). Now any differences between *f*(*y*) and *f*0(*y*) are a result of the differences in the conditional densities *fY*0|*Z*<sup>0</sup> (*y*|*k*) and *fY*|*Z*(*y*|*k*), for *k* = 1,...,*K*. These represent differences in the covariate-response relationship between the two populations.

We can construct a counter-factual distribution for the compositional difference using these ideas. We define the distribution of *Y*<sup>0</sup> *composition-adjusted* to *Y* to be *Y*0*<sup>C</sup>* with density: *f*0*C*(*y*) := ∑*<sup>K</sup> <sup>k</sup>*=<sup>1</sup> π*<sup>k</sup> fY*0|*Z*<sup>0</sup> (*y*|*k*). It corresponds to a counter-factual population with the covariate composition of the comparison population and the covariate-response relationship of the reference population. Comparisons of *f*0*C*(*y*) to *f*(*y*) hold the population composition constant, and therefore isolate differences in the covariate-response relationship. By contrast, *f*0(*y*) and *f*0*C*(*y*) have the same covariate-response relationship and comparisons between them isolate the impact of the compositional shifts. Using the composition-adjusted response distribution, we can decompose the overall relative distribution into a component that represents the effect of changes in the marginal distribution of the covariate (the composition effect), and a component that represents the residual changes. In terms of density ratios, we have:

$$\frac{f(\mathbf{y}\_r)}{f\_\mathbf{0}(\mathbf{y}\_r)} = \frac{f\_\mathbf{0}(\mathbf{y}\_r)}{f\_\mathbf{0}(\mathbf{y}\_r)} \times \frac{f(\mathbf{y}\_r)}{f\_\mathbf{0}(\mathbf{y}\_r)}\tag{1}$$

Proportion of men

Proportion of men

3 Conclusions and further research

0.0 0.2 0.4 0.6 0.8 1.0

Proportion of men

Proportion of men

Figure 2. *The three plots for Italian data, to assess the effect of managerial position*

certain polarization of income of these groups in relation to the position held. Women from the last decile occupy higher positions, which, however, does not translate into their earnings. As a result, the income gap in these groups, adjusted by the type of position held in the counterfactual distribution, widens (right panel). On the other hand, results on Italian data are somehow different.

We developed a first analysis on the covariate effects for studying gender gap. Besides the univariate case, also the adjustment for multivariate covariate is

GRESELIN, F., & JE¸ DRZEJCZAK, A. 2020. Analyzing the Gender Gap in Poland and Italy, and by Regions. *International Advances in Economic Re-*

HANDCOCK, M. S., & MORRIS, M. 2006. *Relative distribution methods in*

TARKI. 2008. T ´ ARKI European social report. ´ *Czech Sociological Review*.

0.0 0.2 0.4 0.6 0.8 1.0

Figure 1. *The three plots for Polish data, to assess the effect of managerial position*

Proportion of men

Proportion of men

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

Relative density G

0.0 0.5 1.0 1.5 2.0 2.5

Relative density G

0.0 0.5 1.0 1.5 2.0 2.5

0.0 0.2 0.4 0.6 0.8 1.0

Relative density G

0.0 0.5 1.0 1.5 2.0 2.5

Relative density G

0.0 0.5 1.0 1.5 2.0 2.5

worth to be considered and is the object of ongoing work.

*the social sciences*. Springer Science & Business Media.

Relative density G

0.0 0.5 1.0 1.5 2.0 2.5

Relative density G

0.0 0.5 1.0 1.5 2.0 2.5

References

*search*, 26(4), 433–447.

Figure 1 graphically shows the decomposition of the relative income distribution of women in relation to men, assuming the position (managerial or not, variable PL150: Managerial position) as the explanatory variable. The first panel from the left shows the (uncorrected) relative density of income differences between men and women, the middle panel represents the effects of differences in the distributions of the explanatory variable, and the right panel represents the counterfactual distribution - i.e. the expected relative density for men's and women's income distributions with assuming the same profiles of positions held in both groups. The comparison of the three relative densities provides a useful tool for assessing the relative magnitude and nature of the impact of individual components. It can be noticed that the distribution presented in the middle panel is U-shaped and in the central part close to the uniform distribution. This means that the difference in the structure of management positions observed between the two cohorts of the distribution in central deciles has little effect on the observed income gap. On the other hand, greater differences were observed in extreme decile groups, which suggests a

Figure 1. *The three plots for Polish data, to assess the effect of managerial position*

Figure 2. *The three plots for Italian data, to assess the effect of managerial position*

certain polarization of income of these groups in relation to the position held. Women from the last decile occupy higher positions, which, however, does not translate into their earnings. As a result, the income gap in these groups, adjusted by the type of position held in the counterfactual distribution, widens (right panel). On the other hand, results on Italian data are somehow different.

#### 3 Conclusions and further research

We developed a first analysis on the covariate effects for studying gender gap. Besides the univariate case, also the adjustment for multivariate covariate is worth to be considered and is the object of ongoing work.

#### References

variate. The marginal density of *Y* can be written as *f*(*y*) := ∑*<sup>K</sup>*

*k* = 1,...,*K*. An analogous definition holds for *f*0(*y*) := ∑*<sup>K</sup>*

*f*(*yr*)

*<sup>f</sup>*0(*yr*) <sup>=</sup> *<sup>f</sup>*0*C*(*yr*)

*<sup>f</sup>*0(*yr*) <sup>×</sup> *<sup>f</sup>*(*yr*)

Figure 1 graphically shows the decomposition of the relative income distribution of women in relation to men, assuming the position (managerial or not, variable PL150: Managerial position) as the explanatory variable. The first panel from the left shows the (uncorrected) relative density of income differences between men and women, the middle panel represents the effects of differences in the distributions of the explanatory variable, and the right panel represents the counterfactual distribution - i.e. the expected relative density for men's and women's income distributions with assuming the same profiles of positions held in both groups. The comparison of the three relative densities provides a useful tool for assessing the relative magnitude and nature of the impact of individual components. It can be noticed that the distribution presented in the middle panel is U-shaped and in the central part close to the uniform distribution. This means that the difference in the structure of management positions observed between the two cohorts of the distribution in central deciles has little effect on the observed income gap. On the other hand, greater differences were observed in extreme decile groups, which suggests a

to *Y* to be *Y*0*<sup>C</sup>* with density: *f*0*C*(*y*) := ∑*<sup>K</sup>*

populations.

have:

where *fY*|*Z*(*y*|*k*) denotes the conditional densities of *Y* given that *Z* = *k*, for

Now any differences between *f*(*y*) and *f*0(*y*) are a result of the differences in the conditional densities *fY*0|*Z*<sup>0</sup> (*y*|*k*) and *fY*|*Z*(*y*|*k*), for *k* = 1,...,*K*. These represent differences in the covariate-response relationship between the two

We can construct a counter-factual distribution for the compositional difference using these ideas. We define the distribution of *Y*<sup>0</sup> *composition-adjusted*

counter-factual population with the covariate composition of the comparison population and the covariate-response relationship of the reference population. Comparisons of *f*0*C*(*y*) to *f*(*y*) hold the population composition constant, and therefore isolate differences in the covariate-response relationship. By contrast, *f*0(*y*) and *f*0*C*(*y*) have the same covariate-response relationship and comparisons between them isolate the impact of the compositional shifts. Using the composition-adjusted response distribution, we can decompose the overall relative distribution into a component that represents the effect of changes in the marginal distribution of the covariate (the composition effect), and a component that represents the residual changes. In terms of density ratios, we

*<sup>k</sup>*=<sup>1</sup> π*<sup>k</sup> fY*|*Z*(*y*|*k*),

*<sup>k</sup> fY*0|*Z*<sup>0</sup> (*y*|*k*).

*<sup>k</sup>*=<sup>1</sup> π<sup>0</sup>

*<sup>k</sup>*=<sup>1</sup> π*<sup>k</sup> fY*0|*Z*<sup>0</sup> (*y*|*k*). It corresponds to a

*<sup>f</sup>*0*C*(*yr*) (1)


### A TRANSDIMENSIONAL MCMC SAMPLER FOR SPATIALLY DEPENDENT MIXTURE MODELS

Bayesian viewpoint and we specify a prior for dependent densities (*f*1,..., *fI*) encouraging distributions associated to areas that are spatially close to be more

In this paper, we consider first the same framework as in Beraha *et al.*, 2020, where we assume a finite mixture with a fixed number of components *H* in each area *I* and introduce spatial dependence via a suitable prior on the weights of the mixtures, i.e., the *logistic multivariate CAR prior*. We will show how specific features of the proposed model include (i) a sparse mixture specification as meant in Malsiner-Walli *et al.*, 2016 and (ii) densities corresponding to areal units which belong to two different connected components

As it happens with finite mixture models, the choice of the appropriate number *H* of components is crucial. Under the Bayesian approach, it is straightforward to frame *H* random and compute the posterior distribution for all parameters, including *H*. In this case, Markov Chain Monte Carlo (MCMC) algorithms for posterior inference are called *transdimensional*, and they are not easy to design. Examples of such transdimensional MCMC algorithms include the reversible jump MCMC sampler in Green, 1995 and the MCMC algorithm based on birth-and-death processes in Stephens, 2000. Hence, we extend the model above (see Beraha *et al.*, 2020 for details) by assuming a prior on the number *H* of components and we propose a transdimensional sampler via a reversible jump MCMC algorithm. The approach we follow to design a

similar than those associated to areas that are far away.

in the proximity graph (=matrix) may behave differently.

reversible jump move is based on Norets, 2021.

(*w*1,...,*wI*) <sup>|</sup> <sup>ρ</sup>,σ<sup>2</sup>

components, we assume the following:

2 Details on the model and the reversible jump algorithm

*yi j* <sup>|</sup> *wi*, <sup>τ</sup> ,*<sup>H</sup>* iid

As an extension of the model in Beraha *et al.*, 2020 to a random number of

<sup>τ</sup> *<sup>h</sup>* <sup>|</sup> *<sup>H</sup>* iid

<sup>σ</sup><sup>2</sup> <sup>∼</sup> inv-gamma(α,β), <sup>ρ</sup> <sup>∼</sup> *beta*(*a*,*b*), *<sup>H</sup>* <sup>∼</sup> <sup>π</sup>(*H*)

Here τ *<sup>h</sup>* represents mean and variance of the Gaussian component in the mixture (1), <sup>I</sup> is the (*<sup>H</sup>* <sup>−</sup>1)×(*<sup>H</sup>* <sup>−</sup>1) identity matrix, while *wi* = (*wi*1,...,*wiH*)*<sup>T</sup>* . The distribution *P*<sup>0</sup> is the normal–inv-gamma density that is conjugate to the

∼ *H* ∑ *h*=1

*wih* N (τ *<sup>h</sup>*) *j* = 1,...,*Ni* (1)

I;*G*) (2)

∼ *P*<sup>0</sup> *h* = 1,...,*H*

,*<sup>H</sup>* <sup>∼</sup> logisticMCAR(0,ρ,σ<sup>2</sup>

Alessandra Guglielmi 1, Mario Beraha1, Matteo Gianella1, Matteo Pegoraro2 and Riccardo Peli2

<sup>1</sup> Department of Mathematics, Politecnico di Milano, (e-mail: alessandra.guglielmi@polimi.it)

<sup>2</sup> MOX, Department of Mathematics, Politecnico di Milano

ABSTRACT: We consider the problem of spatially dependent areal data, where for each area independent observations are available, and propose to model the density of each area through a finite mixture of Gaussian distributions. The spatial dependence is introduced via a novel joint distribution for a collection of vectors in the simplex, that we term logisticMCAR. We also discuss a generalization of the mixture model with a random number of components, introducing a reversible jump algorithm to sample from the full posterior. Through simulated data examples we check the performance of our algorithm. Moreover, we present an application on a real dataset of Airbnb listings in the city of Amsterdam, also showing how to easily incorporate for additional covariate information in the model.

KEYWORDS: finite mixture models, spatial density estimation, logistic normal, multivariate CAR models, reversible jump.

#### 1 Introduction

Mixture models (Fruhwirth-Schnatter ¨ *et al.*, 2019) provide a natural framework for density estimation. Though mixtures are often used under the assumption of exchangeable samples from a unique unknown distribution, such models may be adopted to model data that show spatial dependence. In this work we focus on areal data, considering the problem of modelling data from *I* different groups, where each group corresponds to a specific areal location. More in detail, we assume that the spatial domain Ω is divided into *I* areas and, for each area, there is a vector of observations *yi* = (*yi*1,...,*yiNi* ) from the same variable, each value *yi j* corresponding to a different subject *j* in area *i*. We further assume that data, within each areal unit *i*, are independent and identically distributed (i.i.d.) from an area-specific density *fi*; the problem we address is the joint estimation of spatially dependent densities *f*1,..., *fI*. We take the Bayesian viewpoint and we specify a prior for dependent densities (*f*1,..., *fI*) encouraging distributions associated to areas that are spatially close to be more similar than those associated to areas that are far away.

A TRANSDIMENSIONAL MCMC SAMPLER FOR SPATIALLY DEPENDENT MIXTURE MODELS Alessandra Guglielmi 1, Mario Beraha1, Matteo Gianella1, Matteo Pegoraro2 and Riccardo Peli2

<sup>1</sup> Department of Mathematics, Politecnico di Milano, (e-mail:

ABSTRACT: We consider the problem of spatially dependent areal data, where for each area independent observations are available, and propose to model the density of each area through a finite mixture of Gaussian distributions. The spatial dependence is introduced via a novel joint distribution for a collection of vectors in the simplex, that we term logisticMCAR. We also discuss a generalization of the mixture model with a random number of components, introducing a reversible jump algorithm to sample from the full posterior. Through simulated data examples we check the performance of our algorithm. Moreover, we present an application on a real dataset of Airbnb listings in the city of Amsterdam, also showing how to easily incorporate for additional

KEYWORDS: finite mixture models, spatial density estimation, logistic normal, mul-

Mixture models (Fruhwirth-Schnatter ¨ *et al.*, 2019) provide a natural framework for density estimation. Though mixtures are often used under the assumption of exchangeable samples from a unique unknown distribution, such models may be adopted to model data that show spatial dependence. In this work we focus on areal data, considering the problem of modelling data from *I* different groups, where each group corresponds to a specific areal location. More in detail, we assume that the spatial domain Ω is divided into *I* areas and,

variable, each value *yi j* corresponding to a different subject *j* in area *i*. We further assume that data, within each areal unit *i*, are independent and identically distributed (i.i.d.) from an area-specific density *fi*; the problem we address is the joint estimation of spatially dependent densities *f*1,..., *fI*. We take the

) from the same

for each area, there is a vector of observations *yi* = (*yi*1,..., *yiNi*

alessandra.guglielmi@polimi.it)

covariate information in the model.

1 Introduction

tivariate CAR models, reversible jump.

<sup>2</sup> MOX, Department of Mathematics, Politecnico di Milano

In this paper, we consider first the same framework as in Beraha *et al.*, 2020, where we assume a finite mixture with a fixed number of components *H* in each area *I* and introduce spatial dependence via a suitable prior on the weights of the mixtures, i.e., the *logistic multivariate CAR prior*. We will show how specific features of the proposed model include (i) a sparse mixture specification as meant in Malsiner-Walli *et al.*, 2016 and (ii) densities corresponding to areal units which belong to two different connected components in the proximity graph (=matrix) may behave differently.

As it happens with finite mixture models, the choice of the appropriate number *H* of components is crucial. Under the Bayesian approach, it is straightforward to frame *H* random and compute the posterior distribution for all parameters, including *H*. In this case, Markov Chain Monte Carlo (MCMC) algorithms for posterior inference are called *transdimensional*, and they are not easy to design. Examples of such transdimensional MCMC algorithms include the reversible jump MCMC sampler in Green, 1995 and the MCMC algorithm based on birth-and-death processes in Stephens, 2000. Hence, we extend the model above (see Beraha *et al.*, 2020 for details) by assuming a prior on the number *H* of components and we propose a transdimensional sampler via a reversible jump MCMC algorithm. The approach we follow to design a reversible jump move is based on Norets, 2021.

#### 2 Details on the model and the reversible jump algorithm

As an extension of the model in Beraha *et al.*, 2020 to a random number of components, we assume the following:

$$\mathbf{y}\_{ij} \mid \mathbf{w}\_i, \mathfrak{a}, H \stackrel{\text{iid}}{\sim} \sum\_{h=1}^{H} w\_{ih} \mathcal{N}(\mathfrak{a}\_h) \quad j = 1, \dots, N\_{\bar{l}} \quad (1)$$

$$\mathfrak{a}\_h \mid H \stackrel{\text{iid}}{\sim} \mathcal{P}\_0 \quad \qquad h = 1, \dots, H$$

$$(\mathfrak{w}\_1, \dots, \mathfrak{w}\_I) \mid \mathfrak{p}, \mathfrak{o}^2, H \sim \text{logisticMCAR}(\mathbf{0}, \mathfrak{p}, \mathfrak{o}^2 \mathbf{I}; G) \quad (2)$$

$$\mathfrak{o}^2 \sim \text{inv-gamma}(\mathfrak{a}, \beta), \quad \mathfrak{p} \sim \text{beta}(a, b), \quad H \sim \mathfrak{a}(H)$$

Here τ *<sup>h</sup>* represents mean and variance of the Gaussian component in the mixture (1), <sup>I</sup> is the (*<sup>H</sup>* <sup>−</sup>1)×(*<sup>H</sup>* <sup>−</sup>1) identity matrix, while *wi* = (*wi*1,...,*wiH*)*<sup>T</sup>* . The distribution *P*<sup>0</sup> is the normal–inv-gamma density that is conjugate to the Gaussian distribution N (τ *<sup>h</sup>*) in (1). The spatial prior logisticMCAR is defined through a logistic transformation of Gaussian multivariate CAR models for auxiliary parameters *<sup>w</sup><sup>i</sup>*s. Parameters in (2) include the proximity matrix *G*, in this paper fixed as *gi j* = 1 if areas *i* and *j* are neighbours and *gi j* = 0 otherwise, a positive parameter ρ of the multivariate CAR specification – ρ = 0 corresponding to the transformed weights being independent – and a positive parameter σ<sup>2</sup> representing the conditional variance of the multivariate CAR model. See Beraha *et al.*, 2020 for the definition of such prior.

2.0

¯), *<sup>w</sup><sup>i</sup>*<sup>2</sup> <sup>=</sup> <sup>−</sup>3(*si* <sup>−</sup> *<sup>s</sup>*¯) <sup>−</sup> <sup>3</sup>(*ti* <sup>−</sup>*<sup>t</sup>*

3(*ti* −*t*

jump sampler.

References

center of area *i* and (*s*¯,*t*

*arXiv:2007.14961*.

*of Statistics*, 28, 40–74.

0 2500 5000 7500 10000 Iteration

Figure 1: Posterior distribution (traceplot) of *H*.

*wi*, *<sup>i</sup>* <sup>=</sup> <sup>1</sup>,...,*I*, are set to the inverse of the logistic transformation of *<sup>w</sup><sup>i</sup>* by definition, while the transformed weights *<sup>w</sup><sup>i</sup>* are fixed as *<sup>w</sup><sup>i</sup>*<sup>1</sup> <sup>=</sup> <sup>3</sup>(*si* <sup>−</sup> *<sup>s</sup>*¯) +

BERAHA, M., PEGORARO, M., PELI, R., & GUGLIELMI, A. 2020. Spatially dependent mixture models via the Logistic Multivariate CAR prior.

FRUHWIRTH ¨ -SCHNATTER, S., CELEUX, G., & ROBERT, C. P. 2019. *Hand-*

GREEN, PETER J. 1995. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. *Biometrika*, 82, 711–732. MALSINER-WALLI, GERTRAUD, FRUHWIRTH ¨ -SCHNATTER, SYLVIA, & GRUN¨ , BETTINA. 2016. Model-based clustering based on sparse finite

NORETS, A. 2021. Optimal auxiliary priors and reversible jump proposals for a class of variable dimension models. *Econometric Theory*, 37, 49–81. STEPHENS, M. 2000. Bayesian analysis of mixture models with an unknown number of components–an alternative to reversible jump methods. *Annals*

Gaussian mixtures. *Statistics and Computing*, 26, 303–324.

*book of Mixture Analysis*. Boca Raton: CRC Press.

¯) the coordinates of the grid center. From Figure 1, which displays the posterior distribution of *H* (no burn-in and no thinning), it is clear that the true value is recovered by our reversible

¯), where (*si*,*ti*) are the coordinates of the

2.5

3.0

N° of Components

3.5

4.0

As mentioned above, when *H* has the prior distribution π(*H*) with support {1,2,...}, such a model requires a transdimensional sampling scheme for posterior inference. Reversible Jump MCMC samplers (Green, 1995) provide a general framework for transdimensional simulation schemes. Given the current state of the chain <sup>θ</sup> = (*H*,<sup>θ</sup> *<sup>H</sup>*), with <sup>θ</sup> *<sup>H</sup>* = (*<sup>w</sup>*<sup>1</sup>,...,*<sup>w</sup><sup>I</sup>*, <sup>τ</sup> <sup>1</sup>,..., <sup>τ</sup> *<sup>H</sup>*), with *<sup>w</sup><sup>i</sup>* <sup>∈</sup> <sup>R</sup>*H*, the next state <sup>θ</sup> = (*<sup>H</sup>* ,θ *<sup>H</sup>*) is (i) sampled from a proposal distribution *q*(θ ,θ ), and (ii) accepted with probability α(θ ,θ ). Usually, the proposal distribution is defined in two steps. If <sup>θ</sup> *<sup>H</sup>* <sup>∈</sup> <sup>R</sup>*nH* and <sup>θ</sup> <sup>∈</sup> <sup>R</sup>*nH* , with *nH* > *nH* and *<sup>d</sup>* <sup>=</sup> *nH* <sup>−</sup>*nH*, first a random vector *<sup>u</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* is sampled from a distribution *qd*(*u*) and then θ *<sup>H</sup>* is defined as *gH*→*<sup>H</sup>*(θ *<sup>H</sup>*,*u*) for a suitable mapping function *gH*→*<sup>H</sup>* . Since both *qd* and *gH*→*<sup>H</sup>* are arbitrary, the definition of a suitable reversible jump move is usually a difficult task.

The approach we follow to design a reversible jump move is based on Norets, 2021, who introduces auxiliary priors and proposals for generic nested models indexed by *H* in {1,2,...} and a prior for (*H*,θ *<sup>H</sup>*) the form π(θ *<sup>H</sup>* | *H*)π(*H*). Let θ <sup>∞</sup> denote the infinite vector of all parameters for the *largest* model, i.e., the mixture model with infinite components, and let [θ <sup>∞</sup>]*<sup>H</sup>* be the *H* -th entry of θ <sup>∞</sup>, with *H* > *H*. Since models are nested, the unknown parameters are nested as well, i.e., if *<sup>H</sup>* <sup>=</sup> *<sup>H</sup>*+1, [<sup>θ</sup> <sup>∞</sup>]*H*+<sup>1</sup> = (*<sup>w</sup>*<sup>1</sup>*H*+1,...,*<sup>w</sup>IH*+1, <sup>τ</sup> *<sup>H</sup>*+1). The key point is the approximation of the conditional posterior distribution of [θ <sup>∞</sup>]*H*+<sup>1</sup> with a multivariate Gaussian distribution centred at the mode of the conditional posterior of [θ <sup>∞</sup>]*H*+<sup>1</sup> given *y*,*H* +1,θ *<sup>H</sup>*. In this way, we sidestep the artificial construction of proposal distributions and mapping functions whilst ensuring quasi-optimal properties of the resulting sampler in terms of chain mixing and sampler efficiency.

To illustrate our algorithm, we consider the case of *I* = 9 areas in a square unit area domain and we simulate data for each area *i* from

$$w\_{ij} \stackrel{\text{iid}}{\sim} w\_{i1} \mathcal{N}(-\mathfrak{S}, 1) + w\_{i2} \mathcal{N}(0, 1) + w\_{i3} \mathcal{N}(\mathfrak{S}, 1) \quad j = 1, \ldots, 2\mathfrak{S}. \tag{3}$$

Note that the number of samples in each location is small, so that the sharing of information between neighbouring mixtures is a key point. The *true* weights

Figure 1: Posterior distribution (traceplot) of *H*.

*wi*, *<sup>i</sup>* <sup>=</sup> <sup>1</sup>,...,*I*, are set to the inverse of the logistic transformation of *<sup>w</sup><sup>i</sup>* by definition, while the transformed weights *<sup>w</sup><sup>i</sup>* are fixed as *<sup>w</sup><sup>i</sup>*<sup>1</sup> <sup>=</sup> <sup>3</sup>(*si* <sup>−</sup> *<sup>s</sup>*¯) + 3(*ti* −*t* ¯), *<sup>w</sup><sup>i</sup>*<sup>2</sup> <sup>=</sup> <sup>−</sup>3(*si* <sup>−</sup> *<sup>s</sup>*¯) <sup>−</sup> <sup>3</sup>(*ti* <sup>−</sup>*<sup>t</sup>* ¯), where (*si*,*ti*) are the coordinates of the center of area *i* and (*s*¯,*t* ¯) the coordinates of the grid center.

From Figure 1, which displays the posterior distribution of *H* (no burn-in and no thinning), it is clear that the true value is recovered by our reversible jump sampler.

#### References

Gaussian distribution N (τ *<sup>h</sup>*) in (1). The spatial prior logisticMCAR is defined through a logistic transformation of Gaussian multivariate CAR models for auxiliary parameters *<sup>w</sup><sup>i</sup>*s. Parameters in (2) include the proximity matrix *G*, in this paper fixed as *gi j* = 1 if areas *i* and *j* are neighbours and *gi j* = 0 otherwise, a positive parameter ρ of the multivariate CAR specification – ρ = 0 corresponding to the transformed weights being independent – and a positive parameter σ<sup>2</sup> representing the conditional variance of the multivariate CAR

As mentioned above, when *H* has the prior distribution π(*H*) with support {1,2,...}, such a model requires a transdimensional sampling scheme for posterior inference. Reversible Jump MCMC samplers (Green, 1995) provide a general framework for transdimensional simulation schemes. Given the current state of the chain <sup>θ</sup> = (*H*,<sup>θ</sup> *<sup>H</sup>*), with <sup>θ</sup> *<sup>H</sup>* = (*<sup>w</sup>*<sup>1</sup>,...,*<sup>w</sup><sup>I</sup>*, <sup>τ</sup> <sup>1</sup>,..., <sup>τ</sup> *<sup>H</sup>*), with

and *<sup>d</sup>* <sup>=</sup> *nH* <sup>−</sup>*nH*, first a random vector *<sup>u</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* is sampled from a distribution *qd*(*u*) and then θ *<sup>H</sup>* is defined as *gH*→*<sup>H</sup>*(θ *<sup>H</sup>*,*u*) for a suitable mapping function *gH*→*<sup>H</sup>* . Since both *qd* and *gH*→*<sup>H</sup>* are arbitrary, the definition of a suitable

The approach we follow to design a reversible jump move is based on Norets, 2021, who introduces auxiliary priors and proposals for generic nested models indexed by *H* in {1,2,...} and a prior for (*H*,θ *<sup>H</sup>*) the form π(θ *<sup>H</sup>* | *H*)π(*H*). Let θ <sup>∞</sup> denote the infinite vector of all parameters for the *largest* model, i.e., the mixture model with infinite components, and let [θ <sup>∞</sup>]*<sup>H</sup>* be the


To illustrate our algorithm, we consider the case of *I* = 9 areas in a square

Note that the number of samples in each location is small, so that the sharing of information between neighbouring mixtures is a key point. The *true* weights

∼ *wi*1N (−5,1) +*wi*2N (0,1) +*wi*3N (5,1) *j* = 1,...,25. (3)

unit area domain and we simulate data for each area *i* from

,θ *<sup>H</sup>*) is (i) sampled from a proposal distribu-

). Usually, the proposal

, with *nH* > *nH*

model. See Beraha *et al.*, 2020 for the definition of such prior.

), and (ii) accepted with probability α(θ ,θ

distribution is defined in two steps. If <sup>θ</sup> *<sup>H</sup>* <sup>∈</sup> <sup>R</sup>*nH* and <sup>θ</sup> <sup>∈</sup> <sup>R</sup>*nH*

reversible jump move is usually a difficult task.

*<sup>w</sup><sup>i</sup>* <sup>∈</sup> <sup>R</sup>*H*, the next state <sup>θ</sup> = (*<sup>H</sup>*

mixing and sampler efficiency.

*yi j* iid

tion *q*(θ ,θ

*H*


### NON-PARAMETRIC CONSISTENCY FOR THE GAUSSIAN MIXTURE MAXIMUM LIKELIHOOD ESTIMATOR

Here we state that such a result can also be proved for the ML estimator for Gaussian mixtures, without requiring that the data are in fact generated from a Gaussian mixture. It is also of interest (and discussed in the conference presentation) to what extent it can be made sure that under certain (not necessarily Gaussian mixture) distributions with a clear clustering the value of the Gaussian mixture ML canonical functional can be interpreted appropriately as

The Gaussian mixture model is probably the most popular approach to modelbased cluster analysis, see, e.g., McLachlan & Peel, 2000. Data are modelled as *p* ≥ 1-dimensional Euclidean random variables *X*1,...,*Xn* i.i.d., where the

> *G* ∑ *j*=1

where *G* is the number of mixture components (considered fixed here), φ(·;*µ*,Σ) is the *p*-variate Gaussian density with mean *µ* and covariance matrix Σ, π*<sup>j</sup>* ∈

The standard way of estimating θ is by maximum likelihood (ML). For

*n* ∑ *i*=1

θ∈Θ*<sup>G</sup>*

The theory presented here will concern the global optimum θ*n*(*X*˜*n*), whereas algorithms used in practice such as the EM (McLachlan & Peel, 2000) cannot

The parameter space Θ*<sup>G</sup>* cannot simply be the space of all parameter vectors that are in principle possible in (1), because *ln* can degenerate if an eigen-

*n*

θ*n*(*X*˜*n*) = argmax

π*j*φ(*x*;*µj*,Σ*j*), (1)

logψ(*Xi*;θ). (2)

*ln*(*X*˜*n*;θ). (3)

*<sup>j</sup>*=<sup>1</sup> π*<sup>j</sup>* = 1. The parameter vector θ contains all

corresponding to the clusters in the population.

2 ML-estimation of Gaussian mixtures

ψ(*x*;θ) =

*ln*(*X*˜*n*;θ) = <sup>1</sup>

value of a component's covariance matrix converges to zero.

Gaussian parameters plus all proportion parameters.

distribution of *X*<sup>1</sup> has density

[0,1] for *j* = 1,2,...,*G*, ∑*<sup>G</sup>*

The ML-estimator is then

guarantee that this is indeed found.

*X*˜*<sup>n</sup>* = (*X*1,...,*Xn*), the log-likelihood is

Christian Hennig 1, Pietro Coretto2

<sup>1</sup> Dipartimento di Scienze Statistiche "Paolo Fortunati", University of Bologna (e-mail: christian.hennig@unibo.it)

<sup>2</sup> Department of Economics and Statistics, University of Salerno, (e-mail: pcoretto@unisa.it)

ABSTRACT: Fitting Gaussian mixtures by maximum likelihood is a major modelbased approach to clustering. Under certain constraints, for a fixed and known number of mixture components, it is known to be consistent assuming that the data were indeed generated by a Gaussian mixture. Here we state a nonparametric consistency result, showing that under general conditions that allow for distributions that are not Gaussian mixtures the suitably constrained maximum likelihood estimator for Gaussian mixtures is consistent for the value of its own canonical functional (population version).

KEYWORDS: model-based clustering, consistency, separation, k-means

#### 1 Introduction

The Gaussian mixture model is probably the most popular approach to modelbased cluster analysis, see, e.g., McLachlan & Peel, 2000. Given the number of mixture components, under suitable conditions (which obviously include that the model holds), the maximum likelihood (ML) estimator is consistent for estimating the parameters of the Gaussian mixture model, see Redner & Walker, 1984. Occasionally it is claimed that ML in Gaussian mixtures requires the mixture model to hold, whereas some other clustering methods are more universally applicable, because they are "nonparametric" and do not rely on model assumptions. Sometimes *k*-means is referred to as nonparametric (despite the fact that it can be derived as ML-estimator for a fixed partition model with spherical Gaussian clusters, see Bock, 1996), based on the nonparametric consistency theorem proved by Pollard, 1981, which shows that without assuming any parametric model, under fairly general conditions, *k*-means converges to its own canonical functional (population version).

Here we state that such a result can also be proved for the ML estimator for Gaussian mixtures, without requiring that the data are in fact generated from a Gaussian mixture. It is also of interest (and discussed in the conference presentation) to what extent it can be made sure that under certain (not necessarily Gaussian mixture) distributions with a clear clustering the value of the Gaussian mixture ML canonical functional can be interpreted appropriately as corresponding to the clusters in the population.

#### 2 ML-estimation of Gaussian mixtures

NON-PARAMETRIC CONSISTENCY FOR THE GAUSSIAN MIXTURE MAXIMUM LIKELIHOOD ESTIMATOR Christian Hennig 1, Pietro Coretto2

<sup>1</sup> Dipartimento di Scienze Statistiche "Paolo Fortunati", University of Bologna (e-mail:

<sup>2</sup> Department of Economics and Statistics, University of Salerno, (e-mail:

ABSTRACT: Fitting Gaussian mixtures by maximum likelihood is a major modelbased approach to clustering. Under certain constraints, for a fixed and known number of mixture components, it is known to be consistent assuming that the data were indeed generated by a Gaussian mixture. Here we state a nonparametric consistency result, showing that under general conditions that allow for distributions that are not Gaussian mixtures the suitably constrained maximum likelihood estimator for Gaussian mixtures is consistent for the value of its own canonical functional (population

The Gaussian mixture model is probably the most popular approach to modelbased cluster analysis, see, e.g., McLachlan & Peel, 2000. Given the number of mixture components, under suitable conditions (which obviously include that the model holds), the maximum likelihood (ML) estimator is consistent for estimating the parameters of the Gaussian mixture model, see Redner & Walker, 1984. Occasionally it is claimed that ML in Gaussian mixtures requires the mixture model to hold, whereas some other clustering methods are more universally applicable, because they are "nonparametric" and do not rely on model assumptions. Sometimes *k*-means is referred to as nonparametric (despite the fact that it can be derived as ML-estimator for a fixed partition model with spherical Gaussian clusters, see Bock, 1996), based on the nonparametric consistency theorem proved by Pollard, 1981, which shows that without assuming any parametric model, under fairly general conditions, *k*-means converges to

KEYWORDS: model-based clustering, consistency, separation, k-means

its own canonical functional (population version).

christian.hennig@unibo.it)

pcoretto@unisa.it)

version).

1 Introduction

The Gaussian mixture model is probably the most popular approach to modelbased cluster analysis, see, e.g., McLachlan & Peel, 2000. Data are modelled as *p* ≥ 1-dimensional Euclidean random variables *X*1,...,*Xn* i.i.d., where the distribution of *X*<sup>1</sup> has density

$$\Psi(\mathbf{x};\boldsymbol{\Theta}) = \sum\_{j=1}^{G} \pi\_j \boldsymbol{\Phi}(\mathbf{x}; \boldsymbol{\mu}\_j, \boldsymbol{\Sigma}\_j), \tag{1}$$

where *G* is the number of mixture components (considered fixed here), φ(·;*µ*,Σ) is the *p*-variate Gaussian density with mean *µ* and covariance matrix Σ, π*<sup>j</sup>* ∈ [0,1] for *j* = 1,2,...,*G*, ∑*<sup>G</sup> <sup>j</sup>*=<sup>1</sup> π*<sup>j</sup>* = 1. The parameter vector θ contains all Gaussian parameters plus all proportion parameters.

The standard way of estimating θ is by maximum likelihood (ML). For *X*˜*<sup>n</sup>* = (*X*1,...,*Xn*), the log-likelihood is

$$d\_n(\tilde{X}\_n; \Theta) = \frac{1}{n} \sum\_{i=1}^n \log \Psi(X\_i; \Theta). \tag{2}$$

The ML-estimator is then

$$\Theta\_n(\tilde{X}\_n) = \underset{\Theta \in \Theta\_G}{\text{arg}\max} \, l\_n(\tilde{X}\_n; \Theta). \tag{3}$$

The theory presented here will concern the global optimum θ*n*(*X*˜*n*), whereas algorithms used in practice such as the EM (McLachlan & Peel, 2000) cannot guarantee that this is indeed found.

The parameter space Θ*<sup>G</sup>* cannot simply be the space of all parameter vectors that are in principle possible in (1), because *ln* can degenerate if an eigenvalue of a component's covariance matrix converges to zero.

This can be dealt with constraining the ratio of any two of the eigenvalues of the within-component covariance matrices. See Garc´ıa-Escudero *et al.*, 2018 for a discussion of eigenvalue constraints in Gaussian mixture modelling. Let λ*j*,*<sup>k</sup>* be the *k*th eigenvalue of Σ*j*, define

$$\begin{aligned} \Lambda(\boldsymbol{\Theta}) &= \left\{ \lambda\_{j,k} : \ j = 1, 2, \dots, G; \ k = 1, 2, \dots, p \right\}, \\ \lambda\_{\min}(\boldsymbol{\Theta}) &= \min\_{j,k} \{ \Lambda(\boldsymbol{\Theta}) \} , \lambda\_{\max}(\boldsymbol{\Theta}) = \max\_{j,k} \{ \Lambda(\boldsymbol{\Theta}) \} .\end{aligned}$$

Then, for given γ < ∞,

$$\Theta\_G = \left\{ \boldsymbol{\Theta} : \ \pi\_j \ge 0 \,\forall j \ge 1, \sum\_{j=1}^G \pi\_j = 1; \ \frac{\lambda\_{\max}(\boldsymbol{\Theta})}{\lambda\_{\min}(\boldsymbol{\Theta})} \le \boldsymbol{\Upsilon} \right\}. \tag{4}$$

A1 For every *<sup>x</sup>*1,...,*xG* <sup>∈</sup> <sup>R</sup>*<sup>p</sup>* : *<sup>P</sup>*{*x*1,...,*xG*} <sup>&</sup>lt; 1.

and ultimately the nonparametric consistency result:

lim *<sup>n</sup>*→<sup>∞</sup> *<sup>P</sup>* 

component") to the Gaussian mixture case.

*Statistics & Data Analysis*, 23, 5–28.

*sis and Classification*, 12, 203–233.

*of Machine Learning Research*, 18, 1–39.

of second moments).

these it follows that

• θ(*P*) exists,

*maximisers* θ*n*(*X*˜*n*) *of ln:*

References

Wiley.

*Statistics*, 9, 135–140.

A2 *LG*(*P*) > *LG*−1(*P*) (implying *LG*(*P*) > −∞, which follows from existence

A1 stops all covariance matrices from degenerating simultaneously. A2 guarantees the existence of the involved covariance matrices, and prevents a proportion parameter from being set to zero so that the corresponding mean and covariance matrix could take any value without changing the likelihood. From

• θ*n*(*X*˜*n*) exists with probability arbitrarily close to 1 for large enough *n*,

Theorem 1. *Assume A1 and A2. Then for every* ε > 0 *and every sequence of*

<sup>θ</sup>*n*(*X*˜*n*) <sup>∈</sup> *K* (*P*, <sup>ε</sup>)

BOCK, H. H. 1996. Probabilistic models in cluster analysis. *Computational*

CORETTO, P., & HENNIG, C. 2017. Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering. *Journal*

GARC´IA-ESCUDERO, L. A., GORDALIZA, A., GRESELIN, F., INGRASSIA, S., & MAYO-ISCAR, A. 2018. Eigenvalues and constraints in mixture modeling: geometric and computational issues. *Advances in Data Analy-*

MCLACHLAN, G. J., & PEEL, D. 2000. *Finite Mixture Models*. New York:

POLLARD, D. 1981. Strong Consistency of *K*-Means Clustering. *Annals of*

REDNER, R. A., & WALKER, H. F. 1984. Mixture densities, maximum like-

lihood and the EM algorithm. *SIAM Review*, 26, 195–239.

This can be proved adapting results in Coretto & Hennig, 2017 (where corresponding statements are showed for a version including an additional "noise

 = 1.

#### 3 Consistency and the canonical functional

The canonical functional (population version) of an estimator is a functional on the space of distributions that extends the estimator in a canonical manner so that it reproduces the estimator when applied to the empirical distribution of the dataset. Define

$$L(P; \Theta) = \mathcal{E}\_P \log(\psi(X; \Theta)), \ L\_G(P) = \sup\_{\Theta \in \hat{\Theta}\_G} L(\Theta).$$

(population version of the log-likelihood function and its supremum; E*<sup>p</sup>* denotes the expected value assuming *X* ∼ *P*). Then, the canonical functional corresponding to θ*<sup>n</sup>* is defined as

$$\Theta^\star(P) = \underset{\Theta \in \Theta\_G}{\text{arg}\, \text{max}} \, L(P; \Theta). \tag{5}$$

This definition (as well as (3)) implies existence and uniqueness of the argmax. These are not trivial. Uniqueness is in fact violated, because for mixture models the order of the mixture components is not identified, and for *G* > 1 (in case of existence) there are several maximisers of (3)) and (5). In this case, define θ*n*(*X*˜*n*) and θ(*P*) as any maximiser, *S*(*P*;θ(*P*)) as the set of all maximisers θ with *L*(*P*;θ) = *L*(*P*;θ(*P*)), and

$$\mathcal{K}(P, \mathfrak{e}) = \left\{ \theta \in \Theta\_G : \inf\_{\theta \in S(\Theta^\*(P))} ||\theta - \dot{\theta}|| < \mathfrak{e} \right\} \quad \text{for any} \quad \mathfrak{e} > 0.$$

The following assumptions are required:


A1 stops all covariance matrices from degenerating simultaneously. A2 guarantees the existence of the involved covariance matrices, and prevents a proportion parameter from being set to zero so that the corresponding mean and covariance matrix could take any value without changing the likelihood. From these it follows that

• θ*n*(*X*˜*n*) exists with probability arbitrarily close to 1 for large enough *n*,

$$\bullet \text{ } \theta^\*(P) \text{ exists},$$

This can be dealt with constraining the ratio of any two of the eigenvalues of the within-component covariance matrices. See Garc´ıa-Escudero *et al.*, 2018 for a discussion of eigenvalue constraints in Gaussian mixture modelling.

λ*j*,*<sup>k</sup>* : *j* = 1,2,...,*G*; *k* = 1,2,..., *p*

λmin(θ) = min*j*,*k*{Λ(θ)},λmax(θ) = max*j*,*k*{Λ(θ)}.

The canonical functional (population version) of an estimator is a functional on the space of distributions that extends the estimator in a canonical manner so that it reproduces the estimator when applied to the empirical distribution

*L*(*P*;θ) = E*<sup>P</sup>* log(ψ(*X*;θ)), *LG*(*P*) = sup

(*P*) = argmax θ∈Θ*<sup>G</sup>*

(population version of the log-likelihood function and its supremum; E*<sup>p</sup>* denotes the expected value assuming *X* ∼ *P*). Then, the canonical functional

This definition (as well as (3)) implies existence and uniqueness of the argmax. These are not trivial. Uniqueness is in fact violated, because for mixture models the order of the mixture components is not identified, and for *G* > 1 (in case of existence) there are several maximisers of (3)) and (5). In this case, define θ*n*(*X*˜*n*) and θ(*P*) as any maximiser, *S*(*P*;θ(*P*)) as the set of all maximisers θ

θ−θ˙ <sup>&</sup>lt; <sup>ε</sup>

*G* ∑ *j*=1

<sup>π</sup>*<sup>j</sup>* <sup>=</sup> 1; <sup>λ</sup>max(θ)

<sup>λ</sup>min(θ) <sup>≤</sup> <sup>γ</sup>

θ∈Θ*<sup>G</sup>*

*L*(θ)

*L*(*P*;θ). (5)

for any ε > 0.

 ,

. (4)

Let λ*j*,*<sup>k</sup>* be the *k*th eigenvalue of Σ*j*, define

Λ(θ) =

θ : π*<sup>j</sup>* ≥ 0 ∀ *j* ≥ 1,

3 Consistency and the canonical functional

θ

θ ∈ Θ*<sup>G</sup>* : inf

<sup>θ</sup>˙∈*S*(<sup>θ</sup>(*P*))

Then, for given γ < ∞,

of the dataset. Define

corresponding to θ*<sup>n</sup>* is defined as

with *L*(*P*;θ) = *L*(*P*;θ(*P*)), and

The following assumptions are required:

*K* (*P*, ε) =

Θ*<sup>G</sup>* =

and ultimately the nonparametric consistency result:

Theorem 1. *Assume A1 and A2. Then for every* ε > 0 *and every sequence of maximisers* θ*n*(*X*˜*n*) *of ln:*

$$\lim\_{n \to \infty} P\left\{\Theta\_n(\mathcal{X}\_n) \in \mathcal{K}(P, \mathfrak{e})\right\} = 1.$$

This can be proved adapting results in Coretto & Hennig, 2017 (where corresponding statements are showed for a version including an additional "noise component") to the Gaussian mixture case.

#### References


## IMPROVING THE RELIABILITY OF A NONPROBABILITY WEB SURVEY

2. Adjustment Weights and Imputation in the

We use quasi-randomisation approaches to account for the selection bias in the 2016 Netherlands WI dataset where the two main techniques are sample matching and

 Sample Matching: we calculate a propensity score to estimate the probability of participation for the nonprobability WI dataset. The WI dataset is stacked to the EU-SILC dataset and we define if i is in the WI dataset, otherwise . Using a logistic regression model, we estimate a propensity score of participation:

that are common in both datasets. The covariates are: age group (17-25, 26-35, 36- 45, 46-55, 56-65, 66+), sex (Males, Females), employment (Employed, Selfemployed), education (Elementary, Secondary, Tertiary, Missing), occupation (Manager, Professional, Technician, Clerical, Service sales, Agricultural, Craft/trade, Operators, Elementary, Missing). Then, within strata defined by sex and age group, we identify the record in the WI dataset and the record in the EU-SILC data having the closest propensity score and copied the WI log hourly wage to the EU-SILC record. We excluded those cases where WI log hourly wage was missing and allowed for up to 10 multiple donors from the WI dataset. For the substantive analysis on GPG, sample weights and all covariates used were those of the EU-SILC

 Propensity Score Adjustment: To calculate the propensity score, we use the method proposed in Chen, et al (2019) which utilizes the weights of the EU-SILC reference sample. The initial weight of the WI data di is the inverse propensity score. The final weight of the WI data is obtained by benchmarking to the EU-SILC weighted data using post-stratification and raking on the 5 covariates mentioned

 Missing Data: The calculation of adjustment weights for the WI dataset included item missing data and they were defined as separate categories for the variables education (16%) and occupation (21%). Besides these variables, there is missing data in log hourly wage (45%). Therefore, we carried out an imputation method whilst accounting for the adjustment weights to ensure that the imputation was applied on representative data. For this purpose we ran the MICE procedure with predicted mean matching (Van Buuren, et al. 2011) in R (package: mice.impute.pmm). Other variables in the imputation model with no missing data were sex, age group and urbanicity (Large cities, Small cities, Rural areas) and we also included the adjustment weight to account for the selection bias. We denote this approach by Weight/PMM. In addition, we carried out a different approach assuming a single imputation approach. We first imputed the WI dataset using a single iteration of the predictive mean matching and then calculated the adjustment weights with no missing data categories (any missing data in the EU-SILC were deleted). We denote this approach by PMM/Weight. A simulation study not shown here showed that both approaches provide similar point estimates of correlations and regression coefficients however the PMM/Weight approach had less variation compared to the Weight/PMM approach as is expected from single imputation.

data, but the response variable of log hourly wage is from the WI dataset.

above: sex\*age group\*education and employment\*occupation.

3. Application Measuring the Gender Pay Gap

The advantage of using the WI data to measure the GPG is that is has the variable log hourly wage. In contrast, the EU-SILC data has only annual income from wages

where is a vector of covariates

WageIndicator Web survey

post-hoc adjustments using propensity scores.

Yinxuan Huang1 and Natalie Shlomo1

1 Social Statistics, School of Social Sciences, University of Manchester, (e-mail: Natalie.shlomo@manchester.ac.uk)

ABSTRACT: In this paper we present robust weighting adjustments and imputation methods to compensate for selection bias in a nonprobability online web-survey taken from the WageIndicator (WI) programme (www.wageindicator.org). For the substantive study, we estimate the gender pay gap (GPG) using the 2016 WI survey data from the Netherlands. To calculate the adjustment weights, we use the 2016 EU-SILC data as a reference sample. Based on the study of GPG, we show that the combination of predictive mean matching and robust weighting adjustment techniques are able to compensate for the selection bias in the nonprobability web survey and ameliorate outcomes of the Blinder-Oaxaca decomposition model in terms of the degree of similarity relative to patterns found in representative probability samples in the Netherlands.

KEYWORDS: Gender Pay Gap, Binder-Oaxaca decomposition, propensity score, predictive mean matching, sample matching

#### 1. Introduction

One nonprobability web survey supported by the European Union's Horizon 2020 research and innovation programme under grant agreement No 730998 (InGRID-2 Integrating Research Infrastructure for European expertise on Inclusive Growth from data to policy) is the WageIndicator (WI) programme (www.wageindicator.org). It was initiated in The Netherlands in 2001 as a platform for employees and employers looking for information about income. The respondents of these multilingual web-survey are volunteers recruited through national WI websites and a wide range of websites of WI partners. Apart from questions on real wage data, working conditions, and demographic characteristics, WI web surveys also cover a wide range of topics related to job and life satisfaction, work-life balance and health. However, one of the key issues is self-selection.

 In this paper we present an application using the 2016 Netherlands WI data to measure the gender pay gap (GPG) using log hourly wage. We selected from this dataset those that are employed or self-employed. The minimum age in the data was 17. We also deleted outliers with very small or very large log hourly wage which did not seem feasible since it is important to note that there is no interviewer screening of responses or edit checks to the web survey that are typically carried out as expected in a probability-based sample. The final sample size was 22,643. To adjust for the selection bias, it is necessary to identify a probability reference sample and for this purpose we used the 2016 Netherlands EU-SILC dataset. We selected only the employed and self-employed with a minimum of age of 17 to be consistent with the WI data. The EU-SILC sample size was 12,939.

#### 2. Adjustment Weights and Imputation in the WageIndicator Web survey

IMPROVING THE RELIABILITY OF A NONPROBABILITY WEB SURVEY

Yinxuan Huang1 and Natalie Shlomo1

KEYWORDS: Gender Pay Gap, Binder-Oaxaca decomposition, propensity score, predictive

One nonprobability web survey supported by the European Union's Horizon 2020 research and innovation programme under grant agreement No 730998 (InGRID-2 Integrating Research Infrastructure for European expertise on Inclusive Growth from data to policy) is the WageIndicator (WI) programme (www.wageindicator.org). It was initiated in The Netherlands in 2001 as a platform for employees and employers looking for information about income. The respondents of these multilingual web-survey are volunteers recruited through national WI websites and a wide range of websites of WI partners. Apart from questions on real wage data, working conditions, and demographic characteristics, WI web surveys also cover a wide range of topics related to job and life satisfaction, work-life balance and health. However, one of the key issues is self-selection. In this paper we present an application using the 2016 Netherlands WI data to measure the gender pay gap (GPG) using log hourly wage. We selected from this dataset those that are employed or self-employed. The minimum age in the data was 17. We also deleted outliers with very small or very large log hourly wage which did not seem feasible since it is important to note that there is no interviewer screening of responses or edit checks to the web survey that are typically carried out as expected in a probability-based sample. The final sample size was 22,643. To adjust for the selection bias, it is necessary to identify a probability reference sample and for this purpose we used the 2016 Netherlands EU-SILC dataset. We selected only the employed and self-employed with a minimum of age of 17 to be consistent with

ABSTRACT: In this paper we present robust weighting adjustments and imputation methods to compensate for selection bias in a nonprobability online web-survey taken from the WageIndicator (WI) programme (www.wageindicator.org). For the substantive study, we estimate the gender pay gap (GPG) using the 2016 WI survey data from the Netherlands. To calculate the adjustment weights, we use the 2016 EU-SILC data as a reference sample. Based on the study of GPG, we show that the combination of predictive mean matching and robust weighting adjustment techniques are able to compensate for the selection bias in the nonprobability web survey and ameliorate outcomes of the Blinder-Oaxaca decomposition model in terms of the degree of similarity relative to patterns found in representative probability samples in

Social Statistics, School of Social Sciences, University of Manchester,

(e-mail: Natalie.shlomo@manchester.ac.uk)

the WI data. The EU-SILC sample size was 12,939.

1

the Netherlands.

mean matching, sample matching

1. Introduction

We use quasi-randomisation approaches to account for the selection bias in the 2016 Netherlands WI dataset where the two main techniques are sample matching and post-hoc adjustments using propensity scores.

 Sample Matching: we calculate a propensity score to estimate the probability of participation for the nonprobability WI dataset. The WI dataset is stacked to the EU-SILC dataset and we define if i is in the WI dataset, otherwise . Using a logistic regression model, we estimate a propensity score of participation: where is a vector of covariates that are common in both datasets. The covariates are: age group (17-25, 26-35, 36- 45, 46-55, 56-65, 66+), sex (Males, Females), employment (Employed, Selfemployed), education (Elementary, Secondary, Tertiary, Missing), occupation (Manager, Professional, Technician, Clerical, Service sales, Agricultural, Craft/trade, Operators, Elementary, Missing). Then, within strata defined by sex and age group, we identify the record in the WI dataset and the record in the EU-SILC data having the closest propensity score and copied the WI log hourly wage to the EU-SILC record. We excluded those cases where WI log hourly wage was missing and allowed for up to 10 multiple donors from the WI dataset. For the substantive analysis on GPG, sample weights and all covariates used were those of the EU-SILC data, but the response variable of log hourly wage is from the WI dataset.

 Propensity Score Adjustment: To calculate the propensity score, we use the method proposed in Chen, et al (2019) which utilizes the weights of the EU-SILC reference sample. The initial weight of the WI data di is the inverse propensity score. The final weight of the WI data is obtained by benchmarking to the EU-SILC weighted data using post-stratification and raking on the 5 covariates mentioned above: sex\*age group\*education and employment\*occupation.

 Missing Data: The calculation of adjustment weights for the WI dataset included item missing data and they were defined as separate categories for the variables education (16%) and occupation (21%). Besides these variables, there is missing data in log hourly wage (45%). Therefore, we carried out an imputation method whilst accounting for the adjustment weights to ensure that the imputation was applied on representative data. For this purpose we ran the MICE procedure with predicted mean matching (Van Buuren, et al. 2011) in R (package: mice.impute.pmm). Other variables in the imputation model with no missing data were sex, age group and urbanicity (Large cities, Small cities, Rural areas) and we also included the adjustment weight to account for the selection bias. We denote this approach by Weight/PMM. In addition, we carried out a different approach assuming a single imputation approach. We first imputed the WI dataset using a single iteration of the predictive mean matching and then calculated the adjustment weights with no missing data categories (any missing data in the EU-SILC were deleted). We denote this approach by PMM/Weight. A simulation study not shown here showed that both approaches provide similar point estimates of correlations and regression coefficients however the PMM/Weight approach had less variation compared to the Weight/PMM approach as is expected from single imputation.

#### 3. Application Measuring the Gender Pay Gap

The advantage of using the WI data to measure the GPG is that is has the variable log hourly wage. In contrast, the EU-SILC data has only annual income from wages and therefore is dependent on confounders such as part-time work. To measure the GPG, we use the Blinder-Oaxaca decomposition (Oxaca, 1973, Blinder, 1973) which is available in the STATA package (Jann, 2008). The method explains the difference in the means of the log hourly wage by decomposing the gender gap into that part that is due to differences in the mean values of the independent variables in the model, and group differences in the effects (parameters) of the independent variables. The method calculates the size and significance of the overall pay gap between men and women, and also divides the gap into a part that is explained by differences in determinants of wages and a part that cannot be explained by such group differences. Moreover, since our analysis include employees and selfemployed as reported by the respondents to the WI web survey, the Blinder-Oaxaca decomposition model is integrated with the Heckman's selection model to correct for self-choice in the labour market. All methods in the analyses used weights described in Section 2. As a benchmark for the analysis, the 2016 GPG in the Netherlands was around 15.6% based on the Structure of Earnings survey.

respectively, and highly significant, which is approximately the expected level. We note that the results of this model are dependent on the explanatory variables that we

 In this substantive study of estimating the 2016 GPG for the Netherlands based on the 2016 WI nonprobability web-survey, we provide important lessons for others working with this type of data on how to improve the reliability of nonprobability online data collection for carrying out general inference. We demonstrate that choosing a probability-based reference sample and applying the robust estimation for propensity score calculations according to Chen et al. (2019) with benchmarking on the inverse propensity scores to produce final weight adjustments, and using predictive mean matching to impute missing data, can be used to overcome potential biases in a nonprobability sample. We also demonstrated that sample matching did not produce credible results for this application. We also show two approaches for carrying out imputations of item missing data: impute after the weighting adjustments and include the weight variable as a covariate in the imputation model; impute missing data within the nonprobability sample to obtain a complete dataset and then carry out the weighting adjustments. The approaches provide similar results albeit there is smaller variation in the impute/weight approach as it is typically based

 We note that none of the other studies using the online WI web-survey datasets attempt to adjust for the selection bias using a probability-based reference sample as we have shown here with the EU-SILC for the study of the GPG in the Netherlands. We provide evidence that we must undertake robust methods to improve the reliability of a web survey before carrying out statistical analyses, otherwise we can

Acknowledgement: The research leading to these results has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 730998 (InGRID-2 Integrating Research Infrastructure for European expertise on

BLINDER, A. S. 1973. Wage Discrimination: Reduced Form and Structural

CHEN, Y., LI, P. & WU, C. 2019. Doubly Robust Inference with Non-probability Survey Samples. Journal of the American Statistical Association, 115(532),

JANN, B. 2008. The Blinder-Oaxaca Decomposition for Linear Regression Models.

OAXACA, R. 1973. Male-Female Wage Differentials in Urban Labour

VAN BUUREN, S. & GROOTHUIS-OUDSHOORN, K. 2011. mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3),

Estimates. Journal of Human Resources, 8 (4), 436–455.

Markets. International Economic Review, 14 (3), 693–709.

have available.

4. Conclusions

on a single imputation.

obtain severely biased results.

Inclusive Growth from data to policy).

The Stata Journal, 8(4), 453-479.

References

2011-2021.

1355–1390.

 Table 1 shows the results of the Blinder-Oaxaca decomposition of the difference between log hourly earnings of men and women. The upper section exhibits the overall pay gaps between men and women under the different approaches: original WI, Weight\PMM, PMM\Weight and sample matching. In addition, the overall explained part and the unexplained part are also expressed as a percentage of the difference between log hourly earnings of men and women. The subcomponents of the explained part are displayed in the lower section of Table 1. The explanatory variables included in the analysis are age, education, occupation, and urbanicity.


Table 1: Oaxaca-Blinder decomposition of GPG with adjusted selection bias for men and women

 All approaches in Table 1 suggest a pay gap between men and women in favour of men. With regard to the size of the GPG (the difference between log hourly wage of men and women), the GPG detected in the original WI data and the sample matching approaches appear to be smaller than those detected in Weight/PMM and PMM/Weight approaches. The GPG is 9% in the original WI dataset and even less in the sample matching of 4% (where the difference was found to be not significant). The Weight/PMM and the PMM/Weight approaches, with the use of adjustment weights and imputation as explained in Section 2, have a GPG of 18% and 16% respectively, and highly significant, which is approximately the expected level. We note that the results of this model are dependent on the explanatory variables that we have available.

#### 4. Conclusions

and therefore is dependent on confounders such as part-time work. To measure the GPG, we use the Blinder-Oaxaca decomposition (Oxaca, 1973, Blinder, 1973) which is available in the STATA package (Jann, 2008). The method explains the difference in the means of the log hourly wage by decomposing the gender gap into that part that is due to differences in the mean values of the independent variables in the model, and group differences in the effects (parameters) of the independent variables. The method calculates the size and significance of the overall pay gap between men and women, and also divides the gap into a part that is explained by differences in determinants of wages and a part that cannot be explained by such group differences. Moreover, since our analysis include employees and selfemployed as reported by the respondents to the WI web survey, the Blinder-Oaxaca decomposition model is integrated with the Heckman's selection model to correct for self-choice in the labour market. All methods in the analyses used weights described in Section 2. As a benchmark for the analysis, the 2016 GPG in the

Netherlands was around 15.6% based on the Structure of Earnings survey.

Original WI (unweighted no missing data)

and women

Total gap in logged

 Table 1 shows the results of the Blinder-Oaxaca decomposition of the difference between log hourly earnings of men and women. The upper section exhibits the overall pay gaps between men and women under the different approaches: original WI, Weight\PMM, PMM\Weight and sample matching. In addition, the overall explained part and the unexplained part are also expressed as a percentage of the difference between log hourly earnings of men and women. The subcomponents of the explained part are displayed in the lower section of Table 1. The explanatory variables included in the analysis are age, education, occupation, and urbanicity.

Table 1: Oaxaca-Blinder decomposition of GPG with adjusted selection bias for men

Overall Men 2.67 3.16 3.12 2.72 Women 2.43 2.70 2.62 2.61 Difference 0.24\* 0.46\*\*\* 0.50\*\*\* 0.11

hourly wage 9% 18% 16% 4% Explained% 7% 27% 34% 3% Unexplained% 93% 73% 66% 97% Detailed composition (%) of the explained gap Age Group 1% -2% -2% 18% Education 33% 42% 41% 36% Occupation 62% 64% 65% 40% Urbanicity 4% -4% -4% 7% n 10851 22,643 22,643 12,096

Weight\ PMM

PMM\ Weight

Sample Matching (EU-SILC weights

 All approaches in Table 1 suggest a pay gap between men and women in favour of men. With regard to the size of the GPG (the difference between log hourly wage of men and women), the GPG detected in the original WI data and the sample matching approaches appear to be smaller than those detected in Weight/PMM and PMM/Weight approaches. The GPG is 9% in the original WI dataset and even less in the sample matching of 4% (where the difference was found to be not significant). The Weight/PMM and the PMM/Weight approaches, with the use of adjustment weights and imputation as explained in Section 2, have a GPG of 18% and 16%  In this substantive study of estimating the 2016 GPG for the Netherlands based on the 2016 WI nonprobability web-survey, we provide important lessons for others working with this type of data on how to improve the reliability of nonprobability online data collection for carrying out general inference. We demonstrate that choosing a probability-based reference sample and applying the robust estimation for propensity score calculations according to Chen et al. (2019) with benchmarking on the inverse propensity scores to produce final weight adjustments, and using predictive mean matching to impute missing data, can be used to overcome potential biases in a nonprobability sample. We also demonstrated that sample matching did not produce credible results for this application. We also show two approaches for carrying out imputations of item missing data: impute after the weighting adjustments and include the weight variable as a covariate in the imputation model; impute missing data within the nonprobability sample to obtain a complete dataset and then carry out the weighting adjustments. The approaches provide similar results albeit there is smaller variation in the impute/weight approach as it is typically based on a single imputation.

 We note that none of the other studies using the online WI web-survey datasets attempt to adjust for the selection bias using a probability-based reference sample as we have shown here with the EU-SILC for the study of the GPG in the Netherlands. We provide evidence that we must undertake robust methods to improve the reliability of a web survey before carrying out statistical analyses, otherwise we can obtain severely biased results.

Acknowledgement: The research leading to these results has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 730998 (InGRID-2 Integrating Research Infrastructure for European expertise on Inclusive Growth from data to policy).

### References


### A SEMI-BAYESIAN APPROACH FOR THE ANALYSIS OF SCALE EFFECTS IN ORDINAL REGRESSION MODELS

2 Model description

(continuous) latent variable such that,

*Y*∗

*Pr*(*Yi* = *j* | θ, *x*) = *Pr*(α*j*−<sup>1</sup> < *Y*<sup>∗</sup>

tion. The parameter vector θ = (α

MCMC methods to obtain posterior samples.

probability mass function of *Yi*, for *j* = 1,2,...,*k*, is:

α*j*−<sup>1</sup> < *Y*<sup>∗</sup>

regression model

α = (α1,...,α*k*−1)

3 Applicative section

Let *Y* = (*Y*1,*Y*2,...,*Yn*) be a random sample generated by an ordinal random variable *Y* ∼ *G*(*y*) on the support {1,..., *k*}, where *k* is a known integer. We interpret *Yi* as the rating expressed by the *i*-th subject about a definite item. For each subject, we collect information *I<sup>i</sup>* = (*yi*, *xi*), for *i* = 1,2,...,*n*, where *yi* is the observed value of the rating and *xi* is a row vector of the matrix *X* which

*<sup>i</sup>* ≤ α*<sup>j</sup>* ⇐⇒ *Yi* = *j*, *j* = 1,2,...,*k* ,

Assume that *p* ≥ 1 covariates are relevant for explaining *Y*<sup>∗</sup> by the latent

*<sup>i</sup>* = *xi*β +σε*i*, *i* = 1,2,...,*n*,

where σ is the standard deviation of the noise variable ε ∼ *F*ε(.). Then, the

*<sup>i</sup>* ≤ α*j*) = *F*<sup>ε</sup>

Common choices for *F*ε(.) are the Gaussian, the logistic, and the (complementary) log-log distribution, whose related models are named probit, logit, and extreme value model, respectively. Here we focus on the logit link func-

, β

parameter σ. The latter may depend on covariates yielding σ*<sup>i</sup>* = *zi*γ. Here *zi* is a row vector of the matrix *Z* which includes all the *q* ≥ 1 relevant covariates and γ = (γ1,..., γ*q*) the covariates coefficients. Since we do not have relevant prior information, we use non informative priors on all parameters of interest, letting the data guide the behaviour of the posterior distributions. We rely on

We examine a set of data collected via a survey conducted in Italy during the 2020 COVID-19 lockdown (March 18 until May 3, 2020). The dataset consits of 2224 observations on 21 variables. Respondents were asked to express on

α*<sup>j</sup>* −*xi*β σ

, the covariates coefficients β = (β1,...,β*p*) and the scale

 −*F*<sup>ε</sup>

,σ) is split into the intercept values

*<sup>i</sup>* the underlying

α*j*−<sup>1</sup> −*xi*β σ

 .

includes all the appropriate covariates. We indicate with *Y*∗

where −∞ = α<sup>0</sup> < α<sup>1</sup> <...< α*<sup>k</sup>* = +∞ are the thresholds of *Y*∗.

Maria Iannario1 and Claudia Tarantola2

<sup>1</sup> Department of Political Sciences, University of Naples Federico II, (e-mail: maria.iannario@unina.it)

<sup>2</sup> Department of Economics and Management, University of Pavia, (e-mail: claudia.tarantola@unipv.it)

ABSTRACT: In this paper we propose a semi-Bayesian approach for the analysis of categorical data with an ordered outcome when a scaling component is considered. A recursive partitioning method yielding two trees –one for the location and one for the scaling– is used for selecting covariates, then a Bayesian approach for model estimation is implemented and an MCMC sampler is used to obtain posterior estimates. An analysis on risk perception concerning Covid-19 pandemic is carried out to assess the performance of the method.

KEYWORDS: Heterogeneity of variances, ordinal responses, scale effects, tree structure, MCMC.

#### 1 Background and preliminaries

Ordinal regression models based on a rating procedure are common in different disciplines such as Economics, Marketing, Medicine and Psychology. If unobserved heterogeneity of variances is present, scale effects in regression structures with ordinal responses are needed. The modeling of scale effects in ordinal regression was already considered by McCullagh, 1980, who introduced the so-called location-scale model, extended in the Bayesian framework because of the flexibility in specifying models and richness and accuracy in providing parameter estimates (see Burkner, 2017; Liddell & Kruschke, 2018). ¨ Variable selection in this framework represents a challenge since typically it is not known which variables contribute to the location and to the scaling component. Tree-based methods offer a nonparametric solution to investigate the interaction structure and automatically select variables (see Tutz & Berger, 2021). In our proposal we take into account covariates obtained for the two components by separate trees and implement an ordinal logit model with parameters estimated through a Bayesian approach.

#### 2 Model description

A SEMI-BAYESIAN APPROACH FOR THE ANALYSIS OF SCALE EFFECTS IN ORDINAL REGRESSION MODELS Maria Iannario1 and Claudia Tarantola2

<sup>1</sup> Department of Political Sciences, University of Naples Federico II, (e-mail:

<sup>2</sup> Department of Economics and Management, University of Pavia, (e-mail:

ABSTRACT: In this paper we propose a semi-Bayesian approach for the analysis of categorical data with an ordered outcome when a scaling component is considered. A recursive partitioning method yielding two trees –one for the location and one for the scaling– is used for selecting covariates, then a Bayesian approach for model estimation is implemented and an MCMC sampler is used to obtain posterior estimates. An analysis on risk perception concerning Covid-19 pandemic is carried out to assess the

KEYWORDS: Heterogeneity of variances, ordinal responses, scale effects, tree struc-

Ordinal regression models based on a rating procedure are common in different disciplines such as Economics, Marketing, Medicine and Psychology. If unobserved heterogeneity of variances is present, scale effects in regression structures with ordinal responses are needed. The modeling of scale effects in ordinal regression was already considered by McCullagh, 1980, who introduced the so-called location-scale model, extended in the Bayesian framework because of the flexibility in specifying models and richness and accuracy in providing parameter estimates (see Burkner, 2017; Liddell & Kruschke, 2018). ¨ Variable selection in this framework represents a challenge since typically it is not known which variables contribute to the location and to the scaling component. Tree-based methods offer a nonparametric solution to investigate the interaction structure and automatically select variables (see Tutz & Berger, 2021). In our proposal we take into account covariates obtained for the two components by separate trees and implement an ordinal logit model with pa-

maria.iannario@unina.it)

performance of the method.

ture, MCMC.

claudia.tarantola@unipv.it)

1 Background and preliminaries

rameters estimated through a Bayesian approach.

Let *Y* = (*Y*1,*Y*2,...,*Yn*) be a random sample generated by an ordinal random variable *Y* ∼ *G*(*y*) on the support {1,..., *k*}, where *k* is a known integer. We interpret *Yi* as the rating expressed by the *i*-th subject about a definite item. For each subject, we collect information *I<sup>i</sup>* = (*yi*, *xi*), for *i* = 1,2,...,*n*, where *yi* is the observed value of the rating and *xi* is a row vector of the matrix *X* which includes all the appropriate covariates. We indicate with *Y*∗ *<sup>i</sup>* the underlying (continuous) latent variable such that,

$$\mathfrak{a}\_{j-1} < Y\_i^\* \le \mathfrak{a}\_j \qquad \Longleftrightarrow \qquad \qquad Y\_i = j, \qquad j = 1, 2, \dots, k, j$$

where −∞ = α<sup>0</sup> < α<sup>1</sup> <...< α*<sup>k</sup>* = +∞ are the thresholds of *Y*∗.

Assume that *p* ≥ 1 covariates are relevant for explaining *Y*<sup>∗</sup> by the latent regression model

$$Y\_i^\* = \mathfrak{x}\_i \mathbf{B} + \sigma \varepsilon\_i, \qquad i = 1, 2, \dots, n,$$

where σ is the standard deviation of the noise variable ε ∼ *F*ε(.). Then, the probability mass function of *Yi*, for *j* = 1,2,..., *k*, is:

$$\Pr\left(Y\_{i} = j \mid \boldsymbol{\Theta}, \mathbf{x}\right) = \Pr\left(\mathfrak{a}\_{j-1} < Y\_{i}^{\*} \le \mathfrak{a}\_{j}\right) = F\_{\mathfrak{e}}\left(\frac{\mathfrak{a}\_{j} - \mathfrak{x}\_{i}\mathfrak{B}}{\sigma}\right) - F\_{\mathfrak{e}}\left(\frac{\mathfrak{a}\_{j-1} - \mathfrak{x}\_{i}\mathfrak{B}}{\sigma}\right).$$

Common choices for *F*ε(.) are the Gaussian, the logistic, and the (complementary) log-log distribution, whose related models are named probit, logit, and extreme value model, respectively. Here we focus on the logit link function. The parameter vector θ = (α , β ,σ) is split into the intercept values α = (α1,...,α*k*−1) , the covariates coefficients β = (β1,...,β*p*) and the scale parameter σ. The latter may depend on covariates yielding σ*<sup>i</sup>* = *zi*γ. Here *zi* is a row vector of the matrix *Z* which includes all the *q* ≥ 1 relevant covariates and γ = (γ1,..., γ*q*) the covariates coefficients. Since we do not have relevant prior information, we use non informative priors on all parameters of interest, letting the data guide the behaviour of the posterior distributions. We rely on MCMC methods to obtain posterior samples.

#### 3 Applicative section

We examine a set of data collected via a survey conducted in Italy during the 2020 COVID-19 lockdown (March 18 until May 3, 2020). The dataset consits of 2224 observations on 21 variables. Respondents were asked to express on

Table 1. *Bayesian estimates for the location-scale model*

Estimate SE L-95% CI U-95% CI

*Approve Directives* 0.28 0.03 0.22 0.35 *Covid19 News* 0.25 0.04 0.17 0.33 *Age* 0.60 0.24 0.14 1.08 *log disc Sex* -0.21 0.06 -0.33 -0.09 *sd disc Sex* 1.24 0.08 1.09 1.39

Figure 2. *Marginal effects of Age on Risk evaluation. Points indicate the posterior*

ship between *Age* and *Risk*. This figure displays the estimated probabilities of the five response categories for the two age groups. We notice that older people present a higher risk perception. The latter is also stated by respondents who approve the directives expressed by Italian Government and usually read and discuss Covid-19 news. *Sex* instead affects the scale component; higher

BURKNER ¨ , P. 2017. brms: An R Package for Bayesian Multilevel Models

LIDDELL, T. M., & KRUSCHKE, J.K. 2018. Analyzing ordinal data with metric models: What could possibly go wrong? *Journal of Experimental*

MCCULLAGH, P. 1980. Regression Models for Ordinal Data. *Journal of the*

TUTZ, G, & BERGER, M. 2021. Tree-structured scale effects in binary and

*mean estimates and error bars corresponds to the 95% Credible Intervals.*

variability in expressing risk perception is reported for females.

Using Stan. *Journal of Statistical Software*, 80(1), 1–28.

ordinal regression. *Statistics and Computing*, 17, 31–17.

*Royal Statistical Society. Series B*, 42, 109–142.

*Social Psychology*, 79, 328–348.

References

Figure 1. *Tree structures for the location and scale term of the Covid-19 data set. The parameter estimates are given in the terminal nodes.*

a five-points scale how risky they evaluate Covid-19 infection for the society (*Risk*).

The relevant covariates to include in the model are the ones reported in the tree structures of Figure 1, obtained following the approach of Tutz & Berger, 2021. In particular we examine the following covariates: *Sex*, gender of the respondent (1=female, 0=male); *Approve Directives*, the respondents were asked to evaluate their agreement with the government directive on a scale from 1 (completely disagree) to 7 (completely agree); *Covid News*, the respondents were asked to evaluate frequency of Covid news access and consumption on a scale from 1 (seldom) to 7 (often); *Age*, a dichotomous variable (0 if *Age*≤ 54, 1 otherwise).

The Bayesian estimates of the location and scale parameters are reported in Table 1 (posterior mean, MCMC Standard Error and 95% credible intervals). These results are obtained via the R package brms (Bayesian regression model using "Stan"); see Burkner, 2017. The estimated thresholds are ¨ αˆ <sup>1</sup> = −1.11*(0.30)*, αˆ <sup>2</sup> = 1.00*(0.27)*, αˆ <sup>3</sup> = 1.82 *(0.28)*, and αˆ <sup>4</sup> = 3.39 *(0.31)*. We run in parrel 4 chains of 2000 iteration with a burnin period of 1000 iteration each; as previously mentioned default non informative priors have been used. Standard convergence diagnostics has been considered. The Bayesian estimates of the latent variables standard deviations are obtained from the posterior samples of log-disc (log-discrimination) with disc corresponding to the inverse of the standard deviation.

In Figure 2 we provide a visual representation of the estimated relation-


Table 1. *Bayesian estimates for the location-scale model*

Figure 2. *Marginal effects of Age on Risk evaluation. Points indicate the posterior mean estimates and error bars corresponds to the 95% Credible Intervals.*

ship between *Age* and *Risk*. This figure displays the estimated probabilities of the five response categories for the two age groups. We notice that older people present a higher risk perception. The latter is also stated by respondents who approve the directives expressed by Italian Government and usually read and discuss Covid-19 news. *Sex* instead affects the scale component; higher variability in expressing risk perception is reported for females.

#### References

Figure 1. *Tree structures for the location and scale term of the Covid-19 data set. The*

a five-points scale how risky they evaluate Covid-19 infection for the society

The relevant covariates to include in the model are the ones reported in the tree structures of Figure 1, obtained following the approach of Tutz & Berger, 2021. In particular we examine the following covariates: *Sex*, gender of the respondent (1=female, 0=male); *Approve Directives*, the respondents were asked to evaluate their agreement with the government directive on a scale from 1 (completely disagree) to 7 (completely agree); *Covid News*, the respondents were asked to evaluate frequency of Covid news access and consumption on a scale from 1 (seldom) to 7 (often); *Age*, a dichotomous variable

The Bayesian estimates of the location and scale parameters are reported in Table 1 (posterior mean, MCMC Standard Error and 95% credible intervals). These results are obtained via the R package brms (Bayesian regression model using "Stan"); see Burkner, 2017. The estimated thresholds are ¨ αˆ <sup>1</sup> = −1.11*(0.30)*, αˆ <sup>2</sup> = 1.00*(0.27)*, αˆ <sup>3</sup> = 1.82 *(0.28)*, and αˆ <sup>4</sup> = 3.39 *(0.31)*. We run in parrel 4 chains of 2000 iteration with a burnin period of 1000 iteration each; as previously mentioned default non informative priors have been used. Standard convergence diagnostics has been considered. The Bayesian estimates of the latent variables standard deviations are obtained from the posterior samples of log-disc (log-discrimination) with disc corresponding to the

In Figure 2 we provide a visual representation of the estimated relation-

*parameter estimates are given in the terminal nodes.*

(*Risk*).

(0 if *Age*≤ 54, 1 otherwise).

inverse of the standard deviation.


### BEST APPROACH DIRECTION FOR SPHERICAL RANDOM VARIABLES

SIMPLE EFFECT MEASURES FOR INTERPRETING GENERALIZED BINARY REGRESSION MODELS Maria Kateri <sup>1</sup>

ABSTRACT: In a statistical information theoretical setup, the logistic regression model has been extended to a family of binary regression models that are scaled through the φ-divergence. This generalized model provides a great flexibility and enables a precise fit but at the cost of not easily interpretable parameters. Here, we propose some simple measures that facilitate a straightforward and sound interpretation for the effects

In the context of regression modeling of the effects of *p* explanatory variables, *x*1,...,*xp*, on a binary response *Y*, we consider a sample of size *n* with *Yi* being the response of the *i*-th observation, *xi* = (*xi*1,... *xip*) the associated values of the explanatory variables, and we assume that *Y*1,*Y*2,...,*Yn* are independent. The most well-known model for modeling *pi* = P(*Yi* = 1) in terms of the ex-

1+exp(β<sup>0</sup> +∑*<sup>p</sup>*

Kateri & Agresti, 2010 proved that in the above specified framework and in the class of models with explanatory variables that have fixed value *sj* = ∑*<sup>n</sup>*

*<sup>i</sup>*=<sup>1</sup> *pixi j*, *j* = 1,..., *p*, the logistic regression model (1) is the closest to the model of constant success probability P(*Yi* = 1|*xi*) = exp(β0)/[1+exp(β0)] =

Based on this property and considering the general family of φ-divergences, which contains the KL as special case, Kateri & Agresti, 2010 introduced the

> = *p* ∑ *j*=1

*<sup>j</sup>*=<sup>1</sup> β*jxi j*)

*<sup>j</sup>*=<sup>1</sup> β*jxi j*)

, *i* = 1,...,*n*. (1)

β*jxi j* , *i* = 1,...,*n*, (2)

*<sup>i</sup>*=<sup>1</sup> *yixi j*

of quantitative and qualitative explanatory variables on a binary response. KEYWORDS: logistic regression, φ-divergence, ordinal data, odds ratio.

1 Binary response models based on φ-divergence

planatory variables is the logistic regression model

*pi* <sup>=</sup> *Pr*(*Yi* <sup>=</sup> <sup>1</sup>|*xi*) = exp(β<sup>0</sup> <sup>+</sup>∑*<sup>p</sup>*

, in terms of the Kullback-Leibler (KL) divergence.

 <sup>1</sup><sup>−</sup> *pi* 1− *p*(0)

generalized binary regression model

 −*F*

*F pi p*(0)

for ∑*<sup>n</sup>*

*p*(0)

<sup>1</sup> Institute for Statistics, RWTH Aachen University, Germany

(e-mail: maria.kateri@rwth-aachen.de)

Jayant Jha1

<sup>1</sup> Institut de Neurosciences des Systemes, Aix-Marseille University, Marseille, France, ` (e-mail: jayantjha@gmail.com)

ABSTRACT: The quantiles of projections are discussed for spherical random variables. The concept of best approach direction is defined for any quantile based on the ordering of projections in different directions. The usefulness of the concept is discussed when the preferred direction for a specified proportion of observations is of interest. The variation of best approach directions with quantiles is studied for different families of distributions on the sphere which helps in gaining insights into the symmetry, uniformity, and multimodality of the distributions. Exact polynomial-time algorithms are provided for the computation of its estimate on circle and spheres. The connected highest sample density regions for spherical observations can be directly derived from these estimates. Inferential properties of the estimator are studied. Simulations and real data analyses are performed to illustrate the results.

KEYWORDS: depth, directional data, quantiles, von Mises Fisher distribution

### SIMPLE EFFECT MEASURES FOR INTERPRETING GENERALIZED BINARY REGRESSION MODELS

Maria Kateri <sup>1</sup>

<sup>1</sup> Institute for Statistics, RWTH Aachen University, Germany (e-mail: maria.kateri@rwth-aachen.de)

BEST APPROACH DIRECTION FOR SPHERICAL RANDOM VARIABLES Jayant Jha1

<sup>1</sup> Institut de Neurosciences des Systemes, Aix-Marseille University, Marseille, France, `

ABSTRACT: The quantiles of projections are discussed for spherical random variables. The concept of best approach direction is defined for any quantile based on the ordering of projections in different directions. The usefulness of the concept is discussed when the preferred direction for a specified proportion of observations is of interest. The variation of best approach directions with quantiles is studied for different families of distributions on the sphere which helps in gaining insights into the symmetry, uniformity, and multimodality of the distributions. Exact polynomial-time algorithms are provided for the computation of its estimate on circle and spheres. The connected highest sample density regions for spherical observations can be directly derived from these estimates. Inferential properties of the estimator are studied. Simulations and

KEYWORDS: depth, directional data, quantiles, von Mises Fisher distribution

(e-mail: jayantjha@gmail.com)

real data analyses are performed to illustrate the results.

ABSTRACT: In a statistical information theoretical setup, the logistic regression model has been extended to a family of binary regression models that are scaled through the φ-divergence. This generalized model provides a great flexibility and enables a precise fit but at the cost of not easily interpretable parameters. Here, we propose some simple measures that facilitate a straightforward and sound interpretation for the effects of quantitative and qualitative explanatory variables on a binary response.

KEYWORDS: logistic regression, φ-divergence, ordinal data, odds ratio.

#### 1 Binary response models based on φ-divergence

In the context of regression modeling of the effects of *p* explanatory variables, *x*1,...,*xp*, on a binary response *Y*, we consider a sample of size *n* with *Yi* being the response of the *i*-th observation, *xi* = (*xi*1,... *xip*) the associated values of the explanatory variables, and we assume that *Y*1,*Y*2,...,*Yn* are independent. The most well-known model for modeling *pi* = P(*Yi* = 1) in terms of the explanatory variables is the logistic regression model

$$p\_i = Pr(Y\_i = 1 | \mathbf{x}\_i) = \frac{\exp(\mathfrak{B}\_0 + \sum\_{j=1}^p \mathfrak{B}\_j \mathbf{x}\_{ij})}{1 + \exp(\mathfrak{B}\_0 + \sum\_{j=1}^p \mathfrak{B}\_j \mathbf{x}\_{ij})}, \quad i = 1, \dots, n. \tag{1}$$

Kateri & Agresti, 2010 proved that in the above specified framework and in the class of models with explanatory variables that have fixed value *sj* = ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *yixi j* for ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *pixi j*, *j* = 1,..., *p*, the logistic regression model (1) is the closest to the model of constant success probability P(*Yi* = 1|*xi*) = exp(β0)/[1+exp(β0)] = *p*(0) , in terms of the Kullback-Leibler (KL) divergence.

Based on this property and considering the general family of φ-divergences, which contains the KL as special case, Kateri & Agresti, 2010 introduced the generalized binary regression model

$$F\left(\frac{p\_i}{p^{(0)}}\right) - F\left(\frac{1 - p\_i}{1 - p^{(0)}}\right) = \sum\_{j=1}^p \mathfrak{F}\_j \mathfrak{x}\_{ij} \; , \quad i = 1, \ldots, n,\tag{2}$$

with *pi* = *p*(*xi*) and *F* = φ , where φ is a twice differentiable, strictly convex real–valued function on [0,+∞), satisfying φ(1) = φ (1) = 0, 0φ(0/0) = 0 and 0φ(*x*/0) = *x*φ∞ with φ∞ = lim*x*→∞[φ(*x*)/*x*]. In the class of models with fixed values *sj* = ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *yixi j* for ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *pixi j*, *j* = 1,..., *p*, model (2) under the constraints 0 < *pi* < 1, is the closest to the model of constant success probability, *pi* = *p*(0) for all *i*, in terms of the φ–divergence.

For φ(*x*) = *x* log(*x*)−*x*+1, *x* > 0, the φ–divergence simplifies to the KL divergence and model (2) reduces to (1). The Pearsonian divergence corresponds to φ(*x*) = <sup>1</sup> <sup>2</sup> (*<sup>x</sup>* <sup>−</sup> <sup>1</sup>)2, for which (2) simplifies to the linear probability model *pi* = *p*(0) <sup>1</sup>+ (1<sup>−</sup> *<sup>p</sup>*(0) )∑*<sup>p</sup> <sup>j</sup>*=<sup>1</sup> <sup>β</sup>*jxi j* , with <sup>−</sup>1/(1<sup>−</sup> *<sup>p</sup>*(0) ) < ∑*<sup>k</sup> <sup>j</sup>*=<sup>1</sup> β*jxi j* < 1/*p*(0) , for all *i*. For φλ(*x*) = <sup>1</sup> <sup>λ</sup>(λ+1)[*x*λ+<sup>1</sup> <sup>−</sup> *<sup>x</sup>* <sup>−</sup> <sup>λ</sup>(*<sup>x</sup>* <sup>−</sup> <sup>1</sup>)], *<sup>x</sup>* <sup>&</sup>gt; 0, where <sup>λ</sup> is a real– valued parameter, the φ–divergence becomes the power divergence of Cressie and Read and (2) leads to

$$p\_i = p^{(0)} \left[ 1 + \lambda(\mathfrak{B}\_{0i} + \sum\_{j=1}^p \mathfrak{B}\_j \mathfrak{x}\_{ij}) \right]^{1/\lambda}, \ i = 1, \ldots, n,\tag{3}$$

by Agresti & Kateri, 2017 and Agresti *et al.*, 2021, for ordinal and binary responses, respectively. These sources discuss existing effect measures, reviewing the related literature and propose new ones. The need for simple effect measures is even more important for the case of generalized models of type (2), for which the *F*-scaled ORs are by far more unattractive to deal with and interpret. Here, we extend measures proposed in Agresti & Kateri, 2017 and Agresti *et al.*, 2021 for the parametric family of models (3). The adaption of these measures for any other member of the φ–divergence based binary regres-

For a quantitative covariate *xk*, a common choice of simple effect measure for the logistic regression model is the rate of change of the response probability *p* = P(*Y* = 1|*x* = *x*∗) in *xk* when all other covariates in *x* are kept fixed at value *x*∗, which is ∂*p*/∂β*<sup>k</sup>* = β*<sup>k</sup> p*(1− *p*) and is known as partial effect. This rate depends on *x* and for given *x* = *x*<sup>∗</sup> achieves its maximum β*k*/4 at *p* = 0.5. The average partial effect over all *xi* in a sample or the partial effect at the mean *x* have been proposed as simple effect measures (s. Agresti *et al.*, 2021). These measures can also be defined for the generalized binary regression models presented above. For the linear probability model this rate

rate is increasing in β*<sup>k</sup>* for λ ∈ (0,1) and decreasing for λ < 0 or λ > 1. Similar measures can be defined for a binary covariate *xk*, by replacing the rate of change by the difference between *p*'s for *xk* = 0 and *xk* = 1, estimating these differences over all *xi*'s and averaging them. In case of a categorical covariate, this process can be followed for the differences between any pair of its levels.

For the problem of comparing two independent groups of items based on their response on an ordinal scale of *c* levels, the data form a 2 × *c* contingency table. Such cases can equivalently be analyzed by models treating the binary variable as response. The data in Table 1 are from an experiment on the use of drugs (sulfones and streptomycin) in the treatment of leprosy. The rows group the patients according to the degree of infiltration (a measure of a certain type of skin damage) present at the beginning of the experiment. The columns indicate the change in the overall clinical condition of the patient after 48 weeks of treatment. This data set has been analyzed by generalized binary regression models by Kateri & Agresti, 2010, considering equidistant scores for the response on clinical change. The corresponding logistic model

), independent of *x*, while for model (3) and

β*<sup>k</sup>* = ˜

β*<sup>k</sup> p*1−λ/λ. This

sion models family is straightforward.

(<sup>1</sup> <sup>−</sup> *<sup>p</sup>*(0)

using the alternative definition (4) for *p*, we have ∂*p*/∂˜

3 Comparison of two ordinal responses

equals ∂*p*/∂β*<sup>k</sup>* = β*<sup>k</sup> p*(0)

with parameters β0*<sup>i</sup>* satisfying suitable constraints to ensure that *pi* ∈ (0,1). When λ = 0, φ0(*x*) = limλ→0[φλ(*x*)] and model (3) becomes (1). It reduces to the linear probability model for λ = 1. Model (3) can be expressed by the simpler equivalent form

$$p\_i = \left[\tilde{\mathsf{B}}\_0 + \sum\_{j=1}^p \tilde{\mathsf{B}}\_j \mathsf{x}\_{ij}\right]^{1/\lambda}, \ i = 1, \ldots, n. \tag{4}$$

#### 2 Parameter interpretation and induced effect measures

The effect of any explanatory variable *xk*, quantitative or qualitative, is interpreted conditional on the value of all other covariates in terms of the corresponding parameter β*<sup>k</sup>* in the model, as usual in regression models. In particular, for quantitative *xk*, the *F*-scaled odds ratio (OR)

$$\left[ F\left( \frac{p(\mathbf{x}\_i)}{p^{(0)}} \right) - F\left( \frac{1 - p(\mathbf{x}\_i)}{1 - p^{(0)}} \right) \right] - \left[ F\left( \frac{p(\mathbf{x}\_{i'})}{p^{(0)}} \right) - F\left( \frac{1 - p(\mathbf{x}\_{i'})}{1 - p^{(0)}} \right) \right] \tag{5}$$

opposing the *F*-scaled odds for any two covariate vectors *xi* and *xi* differing only on their *xk* component, equals β*k*(*xik* −*xi <sup>k</sup>*), where β*<sup>k</sup>* is the parameter in model (2). Is *xk* categorical with *c* levels, then the associated parameters β*k j*, *j* = 2,...,*c*, equal the *F*-scaled ORs comparing level *j* to the reference level 1.

The necessity and practical importance of effect measures that are easy to calculate and straightforward to interpret has been underlined among others by Agresti & Kateri, 2017 and Agresti *et al.*, 2021, for ordinal and binary responses, respectively. These sources discuss existing effect measures, reviewing the related literature and propose new ones. The need for simple effect measures is even more important for the case of generalized models of type (2), for which the *F*-scaled ORs are by far more unattractive to deal with and interpret. Here, we extend measures proposed in Agresti & Kateri, 2017 and Agresti *et al.*, 2021 for the parametric family of models (3). The adaption of these measures for any other member of the φ–divergence based binary regression models family is straightforward.

For a quantitative covariate *xk*, a common choice of simple effect measure for the logistic regression model is the rate of change of the response probability *p* = P(*Y* = 1|*x* = *x*∗) in *xk* when all other covariates in *x* are kept fixed at value *x*∗, which is ∂*p*/∂β*<sup>k</sup>* = β*<sup>k</sup> p*(1− *p*) and is known as partial effect. This rate depends on *x* and for given *x* = *x*<sup>∗</sup> achieves its maximum β*k*/4 at *p* = 0.5. The average partial effect over all *xi* in a sample or the partial effect at the mean *x* have been proposed as simple effect measures (s. Agresti *et al.*, 2021). These measures can also be defined for the generalized binary regression models presented above. For the linear probability model this rate equals ∂*p*/∂β*<sup>k</sup>* = β*<sup>k</sup> p*(0) (<sup>1</sup> <sup>−</sup> *<sup>p</sup>*(0) ), independent of *x*, while for model (3) and using the alternative definition (4) for *p*, we have ∂*p*/∂˜ β*<sup>k</sup>* = ˜ β*<sup>k</sup> p*1−λ/λ. This rate is increasing in β*<sup>k</sup>* for λ ∈ (0,1) and decreasing for λ < 0 or λ > 1. Similar measures can be defined for a binary covariate *xk*, by replacing the rate of change by the difference between *p*'s for *xk* = 0 and *xk* = 1, estimating these differences over all *xi*'s and averaging them. In case of a categorical covariate, this process can be followed for the differences between any pair of its levels.

#### 3 Comparison of two ordinal responses

with *pi* = *p*(*xi*) and *F* = φ

values *sj* = ∑*<sup>n</sup>*

to φ(*x*) = <sup>1</sup>

<sup>1</sup>+ (1<sup>−</sup> *<sup>p</sup>*(0)

*pi* = *p*(0)

for all *i*. For φλ(*x*) = <sup>1</sup>

and Read and (2) leads to

simpler equivalent form

 *F <sup>p</sup>*(*xi*) *p*(0) −*F*

*pi* = *p*(0)

real–valued function on [0,+∞), satisfying φ(1) = φ

*<sup>i</sup>*=<sup>1</sup> *yixi j* for ∑*<sup>n</sup>*

*pi* = *p*(0) for all *i*, in terms of the φ–divergence.

)∑*<sup>p</sup>*

*pi* = ˜

β<sup>0</sup> +

lar, for quantitative *xk*, the *F*-scaled odds ratio (OR)

only on their *xk* component, equals β*k*(*xik* −*xi*

<sup>1</sup><sup>−</sup> *<sup>p</sup>*(*xi*) 1− *p*(0)

*p* ∑ *j*=1 ˜ β*jxi j*)

2 Parameter interpretation and induced effect measures

 − *F <sup>p</sup>*(*xi* ) *p*(0) −*F*

opposing the *F*-scaled odds for any two covariate vectors *xi* and *xi*

*<sup>j</sup>*=<sup>1</sup> β*jxi j*

1+λ(β0*<sup>i</sup>* +

0φ(*x*/0) = *x*φ∞ with φ∞ = lim*x*→∞[φ(*x*)/*x*]. In the class of models with fixed

straints 0 < *pi* < 1, is the closest to the model of constant success probability,

valued parameter, the φ–divergence becomes the power divergence of Cressie

with parameters β0*<sup>i</sup>* satisfying suitable constraints to ensure that *pi* ∈ (0,1). When λ = 0, φ0(*x*) = limλ→0[φλ(*x*)] and model (3) becomes (1). It reduces to the linear probability model for λ = 1. Model (3) can be expressed by the

The effect of any explanatory variable *xk*, quantitative or qualitative, is interpreted conditional on the value of all other covariates in terms of the corresponding parameter β*<sup>k</sup>* in the model, as usual in regression models. In particu-

model (2). Is *xk* categorical with *c* levels, then the associated parameters β*k j*, *j* = 2,...,*c*, equal the *F*-scaled ORs comparing level *j* to the reference level 1. The necessity and practical importance of effect measures that are easy to calculate and straightforward to interpret has been underlined among others

*p* ∑ *j*=1

For φ(*x*) = *x* log(*x*)−*x*+1, *x* > 0, the φ–divergence simplifies to the KL divergence and model (2) reduces to (1). The Pearsonian divergence corresponds

<sup>2</sup> (*<sup>x</sup>* <sup>−</sup> <sup>1</sup>)2, for which (2) simplifies to the linear probability model

, with <sup>−</sup>1/(1<sup>−</sup> *<sup>p</sup>*(0)

β*jxi j*)

<sup>1</sup>/<sup>λ</sup>

<sup>1</sup>/<sup>λ</sup>

, where φ is a twice differentiable, strictly convex

*<sup>i</sup>*=<sup>1</sup> *pixi j*, *j* = 1,..., *p*, model (2) under the con-

<sup>λ</sup>(λ+1)[*x*λ+<sup>1</sup> <sup>−</sup> *<sup>x</sup>* <sup>−</sup> <sup>λ</sup>(*<sup>x</sup>* <sup>−</sup> <sup>1</sup>)], *<sup>x</sup>* <sup>&</sup>gt; 0, where <sup>λ</sup> is a real–

) < ∑*<sup>k</sup>*

(1) = 0, 0φ(0/0) = 0 and

*<sup>j</sup>*=<sup>1</sup> β*jxi j* < 1/*p*(0)

, *i* = 1,...,*n*, (3)

, *i* = 1,...,*n*. (4)

<sup>1</sup><sup>−</sup> *<sup>p</sup>*(*xi*

1− *p*(0)

*<sup>k</sup>*), where β*<sup>k</sup>* is the parameter in

)

 

(5)

differing

,

For the problem of comparing two independent groups of items based on their response on an ordinal scale of *c* levels, the data form a 2 × *c* contingency table. Such cases can equivalently be analyzed by models treating the binary variable as response. The data in Table 1 are from an experiment on the use of drugs (sulfones and streptomycin) in the treatment of leprosy. The rows group the patients according to the degree of infiltration (a measure of a certain type of skin damage) present at the beginning of the experiment. The columns indicate the change in the overall clinical condition of the patient after 48 weeks of treatment. This data set has been analyzed by generalized binary regression models by Kateri & Agresti, 2010, considering equidistant scores for the response on clinical change. The corresponding logistic model fits well (*G*<sup>2</sup> = 0.63), as does also the linear probability model (*G*<sup>2</sup> = 0.26), both with *d f* = 3. Fitting the power divergence model with λ as a parameter, gives the best fit for ˆ λ = 1.673 (*G*<sup>2</sup> = 0.05, *d f* = 2). Kateri & Agresti, 2010 commented that generalized models of this type, though very flexible, have a restricted scope for applications due to the lack of a simple interpretation.

MIXTURES OF KATO–JONES DISTRIBUTIONS ON THE CIRCLE, WITH AN APPLICATION TO TRAFFIC COUNT DATA Shogo Kato1, Kota Nagasaki2 and Wataru Nakanishi2

<sup>2</sup> Department of Civil and Environmental Engineering, Tokyo Institute of Technology,

ABSTRACT: Kato–Jones distribution is a probability distribution on the circle that is unimodal and affords a wide range of skewness and kurtosis. Motivated by a multimodal skewed data set which appears in traffic engineering, we discuss some properties of mixtures of Kato–Jones distributions. A key reparametrization is done to achieve the identifiability of the proposed mixtures. With this reparameterazation, we consider two methods for parameter estimation, namely, a modified method of moments and the maximum likelihood method. These methods are seen to be useful for

KEYWORDS: directional statistics, EM algorithm, maximum likelihood estimation,

Circular data are a set of observations which can be expressed as angles between [0,2π). For the modelling of circular data, a considerable number of probability distributions have been proposed in the literature. Among them, a flexible four-parameter family of distributions has been proposed by Kato &

where 0 ≤ *µ* < 2π, 0 ≤ γ < 1, and 0 ≤ ρ < 1 and 0 ≤ λ < 2π satisfy (ρcosλ− <sup>γ</sup>)<sup>2</sup> + (ρsinλ)<sup>2</sup> <sup>≤</sup> (1−γ)2. This distribution, which will be called Kato–Jones distribution, is unimodal, affords a very wide range of skewness and kurtosis,

cos(θ−*µ*)−ρcosλ 1+ρ<sup>2</sup> −2ρcos(θ−*µ*−λ)

, 0 ≤ θ < 2π,

fitting the proposed mixtures to the traffic counter data set of interest.

<sup>1</sup> Institute of Statistical Mathematics, (e-mail: skato@ism.ac.jp)

(e-mail: k.nagasaki@plan.cv.titech.ac.jp,

nakanishi@plan.cv.titech.ac.jp)

method of moments, road network analysis.

Jones, 2015. It is given by the density

2π

 1+2γ

1 Introduction

*<sup>g</sup>*KJ(θ;*µ*, <sup>γ</sup>,λ,ρ) = <sup>1</sup>

Table 1. *Change in Clinical Condition (C1: Worse, C2: Stationary, C3: Slight Improvement, C4: Moderate Improvement, C5: Marked Improvement) by Degree of Infiltration in a study comparing two drugs against leprosy (Source: Cochran, 1954).*


This drawback can be overcome by adopting for these models the measures for ordinal models introduced by Agresti & Kateri, 2017 (Section 5). These are the ordinal superiority measures ∆ and γ, which in our case are ∆ = ∑*j*>*<sup>k</sup>* π<sup>1</sup> *<sup>j</sup>*π2*<sup>k</sup>* −∑*k*>*<sup>j</sup>* π<sup>1</sup> *<sup>j</sup>*π2*<sup>k</sup>* and γ = ∑*j*>*<sup>k</sup>* π<sup>1</sup> *<sup>j</sup>*π2*<sup>k</sup>* +∑*<sup>j</sup>* π<sup>1</sup> *<sup>j</sup>*π<sup>2</sup> *<sup>j</sup>*/2, ranging in [−1,1] and [0, 1], respectively, where π*i j* is the (*i*, *j*) cell probability. For the logistic, linear, and power divergence models fitted on Table 1, ∆ is estimated as ∆ˆ <sup>0</sup> = 0.229, ∆ˆ <sup>1</sup> = 0.241 and ∆ˆˆ <sup>λ</sup> = 0.231, respectively, while γˆ0 = 0.614, γˆ1 = 0.620, and γˆˆ <sup>λ</sup> = 0.616. Thus under all three models it is estimated that there is about 62% change for a better clinical change at the high than the low group.

The models discussed so far are based on local *F*-scaled ORs. Treating the ordinal variable as response, we consider models and measures of ordinal superiority that are based on cumulative *F*-scaled ORs, and compare them.

#### References


### MIXTURES OF KATO–JONES DISTRIBUTIONS ON THE CIRCLE, WITH AN APPLICATION TO TRAFFIC COUNT DATA

Shogo Kato1, Kota Nagasaki2 and Wataru Nakanishi2

<sup>1</sup> Institute of Statistical Mathematics, (e-mail: skato@ism.ac.jp)

fits well (*G*<sup>2</sup> = 0.63), as does also the linear probability model (*G*<sup>2</sup> = 0.26), both with *d f* = 3. Fitting the power divergence model with λ as a parameter,

commented that generalized models of this type, though very flexible, have a restricted scope for applications due to the lack of a simple interpretation.

Table 1. *Change in Clinical Condition (C1: Worse, C2: Stationary, C3: Slight Improvement, C4: Moderate Improvement, C5: Marked Improvement) by Degree of Infiltration in a study comparing two drugs against leprosy (Source: Cochran, 1954).* Degree of Clinical Change Infiltration C1 C2 C3 C4 C5 High 1 13 16 15 7 Low 11 53 42 27 11

This drawback can be overcome by adopting for these models the measures for ordinal models introduced by Agresti & Kateri, 2017 (Section 5). These are the ordinal superiority measures ∆ and γ, which in our case are ∆ = ∑*j*>*<sup>k</sup>* π<sup>1</sup> *<sup>j</sup>*π2*<sup>k</sup>* −∑*k*>*<sup>j</sup>* π<sup>1</sup> *<sup>j</sup>*π2*<sup>k</sup>* and γ = ∑*j*>*<sup>k</sup>* π<sup>1</sup> *<sup>j</sup>*π2*<sup>k</sup>* +∑*<sup>j</sup>* π<sup>1</sup> *<sup>j</sup>*π<sup>2</sup> *<sup>j</sup>*/2, ranging in [−1,1] and [0, 1], respectively, where π*i j* is the (*i*, *j*) cell probability. For the logistic, linear, and power divergence models fitted on Table 1, ∆ is estimated

there is about 62% change for a better clinical change at the high than the low

AGRESTI, A., & KATERI, M. 2017. Ordinal probability effect measures for group comparisons in multinomial cumulative link models. *Biometrics*,

AGRESTI, A., TARANTOLA, C., & VARRIALE, R. 2021. Simple ways to interpret effects in modeling binary data. *In:* KATERI, M., & MOUSTAKI, I. (eds), *Trends and Challenges in Categorical Data Analysis*. Springer

KATERI, M., & AGRESTI, A. 2010. A generalized regression model for a

binary response. *Statistics & Probability Letters*, 80, 89–95.

The models discussed so far are based on local *F*-scaled ORs. Treating the ordinal variable as response, we consider models and measures of ordinal superiority that are based on cumulative *F*-scaled ORs, and compare them.

λ = 1.673 (*G*<sup>2</sup> = 0.05, *d f* = 2). Kateri & Agresti, 2010

<sup>λ</sup> = 0.231, respectively, while γˆ0 = 0.614,

<sup>λ</sup> = 0.616. Thus under all three models it is estimated that

gives the best fit for ˆ

as ∆ˆ <sup>0</sup> = 0.229, ∆ˆ <sup>1</sup> = 0.241 and ∆ˆˆ

γˆ1 = 0.620, and γˆˆ

group.

References

73, 214–219.

(to appear).

<sup>2</sup> Department of Civil and Environmental Engineering, Tokyo Institute of Technology, (e-mail: k.nagasaki@plan.cv.titech.ac.jp, nakanishi@plan.cv.titech.ac.jp)

ABSTRACT: Kato–Jones distribution is a probability distribution on the circle that is unimodal and affords a wide range of skewness and kurtosis. Motivated by a multimodal skewed data set which appears in traffic engineering, we discuss some properties of mixtures of Kato–Jones distributions. A key reparametrization is done to achieve the identifiability of the proposed mixtures. With this reparameterazation, we consider two methods for parameter estimation, namely, a modified method of moments and the maximum likelihood method. These methods are seen to be useful for fitting the proposed mixtures to the traffic counter data set of interest.

KEYWORDS: directional statistics, EM algorithm, maximum likelihood estimation, method of moments, road network analysis.

#### 1 Introduction

Circular data are a set of observations which can be expressed as angles between [0,2π). For the modelling of circular data, a considerable number of probability distributions have been proposed in the literature. Among them, a flexible four-parameter family of distributions has been proposed by Kato & Jones, 2015. It is given by the density

$$g\_{\rm KJ}(\theta;\mu,\gamma,\lambda,\mathfrak{p}) = \frac{1}{2\pi} \left\{ 1 + 2\gamma \frac{\cos(\theta - \mu) - \mathfrak{p}\cos\lambda}{1 + \mathfrak{p}^2 - 2\mathfrak{p}\cos(\theta - \mu - \lambda)} \right\}, \quad 0 \le \theta < 2\pi, \lambda$$

where 0 ≤ *µ* < 2π, 0 ≤ γ < 1, and 0 ≤ ρ < 1 and 0 ≤ λ < 2π satisfy (ρcosλ− <sup>γ</sup>)<sup>2</sup> + (ρsinλ)<sup>2</sup> <sup>≤</sup> (1−γ)2. This distribution, which will be called Kato–Jones distribution, is unimodal, affords a very wide range of skewness and kurtosis,

has clear interpretation of the parameters, and allows straightforward parameter estimation by both method of moments and maximum likelihood.

The first method is a modified version of the method of moments based on trigonometric moments. Kato & Jones, 2015, proposed a method of moments based on trigonometric moments for Kato–Jones distribution or, equivalently, the mixture (1) with *m* = 1. However their method can not be directly applied to our mixture (1) with general *m* because the resulting estimates are not always within the range of λ*<sup>k</sup>* and ρ*k*. In order to circumvent this problem, we propose a function to evaluate the error between the empirical and theoretic trigonometric moments. Then the estimates are obtained as the minimizer of the proposed function. An advantage of this method is that the estimates always belong to the parameter space and therefore are well-defined. In particular, for a single component mixture *m* = 1, this estimator converges to the method of moments estimator of Kato & Jones, 2015, under certain conditions. Some asymptotic properties such as the consistency and asymptotic normality also hold for the

Second we consider the maximum likelihood estimation. As is the case for *m* = 1, there do not seem to be a closed-form expression for the maximum likelihood estimator for general *m* as well. Therefore we consider a numerical algorithm to estimate the maximum likelihood estimate of the mixture (1). We apply the EM algorithm to estimate the parameters of the mixture (1). This algorithm enables us to express the reparametrized mixing proportions of the mixture (1) in closed form in each step. The other parameters of the mixture need to be estimated numerically. However the estimation of these parameters is equivalent to weighted maximum likelihood estimation for a single Kato– Jones distribution and can be done in a similar manner as in Kato & Jones,

Our experiments suggest the following: The modified method of moments estimation is faster than the maximum likelihood estimation. There is no great difficulty in implementing the maximum likelihood estimation using the EM algorithm. The modified method of moments estimate provides a useful initial

Using the two proposed methods for parameter estimation, we fit the proposed mixture (1) to a traffic data set. The data of interest are the timestamps of all vehicles' passing recorded by a traffic counter at 20.4 kilopost of Kobe route, Hanshin Expressway, Japan. Kobe route is located in Osaka metropolitan area and connects two large cities of Japan, Osaka and Kobe. The data of the timestamps are converted from 24 hours to angles in [0,2π); for clarity, 0

value of the EM algorithm for maximum likelihood estimation.

3 Application to traffic count data

proposed estimator.

2015.

Motivated by a multimodal skewed data set which appears in traffic engineering, we consider the following mixtures of Kato–Jones distributions with density

$$\begin{split} f(\boldsymbol{\theta}) &= \sum\_{k=1}^{m} \pi\_{k} g\_{\mathrm{KJ}}(\boldsymbol{\theta}; \boldsymbol{\mu}\_{k}, \boldsymbol{\eta}\_{k}, \lambda\_{k}, \boldsymbol{\rho}\_{k}) \\ &= \frac{1}{2\pi} \sum\_{k=1}^{m} \pi\_{k} \left\{ 1 + 2\boldsymbol{\eta}\_{k} \frac{\cos(\boldsymbol{\theta} - \boldsymbol{\mu}\_{k}) - \boldsymbol{\rho}\_{k} \cos \lambda\_{k}}{1 + \boldsymbol{\rho}\_{k}^{2} - 2\boldsymbol{\rho}\_{k} \cos(\boldsymbol{\theta} - \boldsymbol{\mu}\_{k} - \lambda\_{k})} \right\}, \quad \boldsymbol{0} \leq \boldsymbol{\theta} < 2\pi, \end{split} \tag{1}$$

where *m* ∈ N is the number of the components of the mixture and 0 < π1,...,π*<sup>m</sup>* < 1 are the mixing proportions satisfying ∑*<sup>m</sup> <sup>k</sup>*=<sup>1</sup> π*<sup>k</sup>* = 1.

Apart from our proposal (1), some mixtures of circular distributions have been proposed in the literature. The most attention have been paid to mixtures of the von Mises distributions (e.g., Wallace & Dowe, 2000; Mooney *et al.*, 2003; Banerjee *et al.*, 2005; Mulder *et al.*, 2020). The components of the mixtures, the von Mises distributions, are symmetric distributions with two parameters controlling location and mean resultant length. Recently, mixtures of the sine-skewed distributions have been discussed by Miyata *et al.*, 2020. The sine-skewed distribution is an extension of a circular distribution which can adopt a mildly asymmetric shape. However these existing models do not seem to be appropriate for our traffic data because one of the clusters of our data is strongly skewed.

In this short paper, we discuss two methods for parameter estimation for the mixture (1), namely, a modified method of moments and the maximum likelihood method. Then, using the proposed methods, we apply the proposed mixture (1) to the traffic data which show bimodality and asymmetry.

#### 2 Parameter estimation

Let Θ1,...,Θ*<sup>n</sup>* be independent and identically distributed from the mixture (1). Note that, as it stands, the parameters, π*<sup>k</sup>* and γ*k*, of the mixture (1) can not be uniquely determined in parameter estimation and therefore the mixture (1) is not identifiable. In order to circumvent this problem, we reparametrize the parameters of the mixture (1). With this reparametrization, we discuss two methods for parameter estimation.

The first method is a modified version of the method of moments based on trigonometric moments. Kato & Jones, 2015, proposed a method of moments based on trigonometric moments for Kato–Jones distribution or, equivalently, the mixture (1) with *m* = 1. However their method can not be directly applied to our mixture (1) with general *m* because the resulting estimates are not always within the range of λ*<sup>k</sup>* and ρ*k*. In order to circumvent this problem, we propose a function to evaluate the error between the empirical and theoretic trigonometric moments. Then the estimates are obtained as the minimizer of the proposed function. An advantage of this method is that the estimates always belong to the parameter space and therefore are well-defined. In particular, for a single component mixture *m* = 1, this estimator converges to the method of moments estimator of Kato & Jones, 2015, under certain conditions. Some asymptotic properties such as the consistency and asymptotic normality also hold for the proposed estimator.

Second we consider the maximum likelihood estimation. As is the case for *m* = 1, there do not seem to be a closed-form expression for the maximum likelihood estimator for general *m* as well. Therefore we consider a numerical algorithm to estimate the maximum likelihood estimate of the mixture (1). We apply the EM algorithm to estimate the parameters of the mixture (1). This algorithm enables us to express the reparametrized mixing proportions of the mixture (1) in closed form in each step. The other parameters of the mixture need to be estimated numerically. However the estimation of these parameters is equivalent to weighted maximum likelihood estimation for a single Kato– Jones distribution and can be done in a similar manner as in Kato & Jones, 2015.

Our experiments suggest the following: The modified method of moments estimation is faster than the maximum likelihood estimation. There is no great difficulty in implementing the maximum likelihood estimation using the EM algorithm. The modified method of moments estimate provides a useful initial value of the EM algorithm for maximum likelihood estimation.

#### 3 Application to traffic count data

has clear interpretation of the parameters, and allows straightforward parame-

Motivated by a multimodal skewed data set which appears in traffic engineering, we consider the following mixtures of Kato–Jones distributions with

cos(θ−*µk*)−ρ*<sup>k</sup>* cosλ*<sup>k</sup>*

where *m* ∈ N is the number of the components of the mixture and 0 < π1,...,π*<sup>m</sup>* <

Apart from our proposal (1), some mixtures of circular distributions have been proposed in the literature. The most attention have been paid to mixtures of the von Mises distributions (e.g., Wallace & Dowe, 2000; Mooney *et al.*, 2003; Banerjee *et al.*, 2005; Mulder *et al.*, 2020). The components of the mixtures, the von Mises distributions, are symmetric distributions with two parameters controlling location and mean resultant length. Recently, mixtures of the sine-skewed distributions have been discussed by Miyata *et al.*, 2020. The sine-skewed distribution is an extension of a circular distribution which can adopt a mildly asymmetric shape. However these existing models do not seem to be appropriate for our traffic data because one of the clusters of our

In this short paper, we discuss two methods for parameter estimation for the mixture (1), namely, a modified method of moments and the maximum likelihood method. Then, using the proposed methods, we apply the proposed

Let Θ1,...,Θ*<sup>n</sup>* be independent and identically distributed from the mixture (1). Note that, as it stands, the parameters, π*<sup>k</sup>* and γ*k*, of the mixture (1) can not be uniquely determined in parameter estimation and therefore the mixture (1) is not identifiable. In order to circumvent this problem, we reparametrize the parameters of the mixture (1). With this reparametrization, we discuss two

mixture (1) to the traffic data which show bimodality and asymmetry.

*<sup>k</sup>* −2ρ*<sup>k</sup>* cos(θ−*µk* −λ*k*)

*<sup>k</sup>*=<sup>1</sup> π*<sup>k</sup>* = 1.

, 0 ≤ θ < 2π,

(1)

ter estimation by both method of moments and maximum likelihood.

1+ρ<sup>2</sup>

density

*f*(θ) =

*m* ∑ *k*=1

> *m* ∑ *k*=1 π*k*

<sup>=</sup> <sup>1</sup> 2π

data is strongly skewed.

2 Parameter estimation

methods for parameter estimation.

π*kg*KJ(θ;*µk*, γ*k*,λ*k*,ρ*k*)

1+2γ*<sup>k</sup>*

1 are the mixing proportions satisfying ∑*<sup>m</sup>*

Using the two proposed methods for parameter estimation, we fit the proposed mixture (1) to a traffic data set. The data of interest are the timestamps of all vehicles' passing recorded by a traffic counter at 20.4 kilopost of Kobe route, Hanshin Expressway, Japan. Kobe route is located in Osaka metropolitan area and connects two large cities of Japan, Osaka and Kobe. The data of the timestamps are converted from 24 hours to angles in [0,2π); for clarity, 0 corresponds to midnight, π to midday, etc. Our data show bimodality and one of the clusters of the data is strongly skewed.

HOW TO DESIGN A DIRECTIONAL DISTRIBUTION John T. Kent <sup>1</sup>

<sup>1</sup> School of Mathematics, University of Leeds (e-mail: j.t.kent@leeds.ac.uk)

ABSTRACT: One way to specify a model in directional statistics is to look for an exponential family which mimics the multivariate normal distribution under high concentration. However, in some important examples this strategy leads to an overspecified model, with a spare parameter. This paper revisits two standard distributions, the Fisher-Bingham distribution on the sphere and the bivariate von Mises distribution

KEYWORDS: Fisher-Bingham distribution, bivariate von Mises distribution, direc-

Directional statistics is concerned with data on circles, spheres and related manifolds. For this paper, we focus on two particular cases: the unit sphere *Sp*−<sup>1</sup> in <sup>R</sup>*<sup>p</sup>* , especially the case *<sup>p</sup>* <sup>=</sup> 3, and the torus (*S*1)*<sup>d</sup>* , especially *<sup>d</sup>* <sup>=</sup> 2. In each case it is possible to construct an exponential family which mimics the multivariate normal distribution under high concentration. But there is a problem. The models include one more parameter than necessary. Overparameterized models can lead to problems of interpretation and fitting. Hence, if possible, it is usually better to choose a parameterization with the same number of parameters as the corresponding asymptotic multivariate normal distribution. Various suggestions have been made in the literature to fix the spare

on the torus, and takes a fresh look at guidelines to specify this parameter.

parameter, but these suggestions sometimes have severe limitations.

R3, after rotation to a standardized coordinate system, has the density

*<sup>f</sup>*(*x*) <sup>∝</sup> exp{κ*x*<sup>3</sup> <sup>+</sup>β1*x*<sup>2</sup>

The 6-parameter Fisher-Bingham distribution (FB6) on the unit sphere *S*<sup>2</sup> in

with respect to the uniform distribution on *S*2, where κ > 0, −∞ < β<sup>2</sup> ≤ β1.

<sup>1</sup> +β2*x*<sup>2</sup>

*<sup>T</sup>* is a unit vector, *x<sup>T</sup> x* = 1, and the third coordinate axis

<sup>2</sup>}, *x* ∈ *S*<sup>2</sup> (1)

2 The Fisher-Bingham distribution

tional statistics

1 Introduction

Here *x* = [*x*1, *x*2, *x*3]

In parameter estimation, we first estimate the parameters based on the modified method of moments. Then the maximum likelihood estimation is carried out by using the modified method of moments estimates as the initial values of the EM algorithm. The model estimated by the maximum likelihood method is a two-component (*m* = 2) mixture of Kato–Jones distributions. The estimated model provides a reasonable fit to the data including the strongly skewed cluster of data. Details of the data analysis will be given in the talk.

#### References


### HOW TO DESIGN A DIRECTIONAL DISTRIBUTION

John T. Kent <sup>1</sup>

<sup>1</sup> School of Mathematics, University of Leeds (e-mail: j.t.kent@leeds.ac.uk)

ABSTRACT: One way to specify a model in directional statistics is to look for an exponential family which mimics the multivariate normal distribution under high concentration. However, in some important examples this strategy leads to an overspecified model, with a spare parameter. This paper revisits two standard distributions, the Fisher-Bingham distribution on the sphere and the bivariate von Mises distribution on the torus, and takes a fresh look at guidelines to specify this parameter.

KEYWORDS: Fisher-Bingham distribution, bivariate von Mises distribution, directional statistics

#### 1 Introduction

corresponds to midnight, π to midday, etc. Our data show bimodality and one

BANERJEE, A., DHILLON, I.S., GHOSH, J., & SRA, S. 2005. Clustering on the unit hypersphere using von Mises–Fisher distributions. *Journal of*

KATO, S., & JONES, M.C. 2015. A tractable and interpretable four-parameter family of unimodal distributions on the circle. *Biometrika*, 102, 181–190. MIYATA, Y., SHIOHAMA, T., & ABE, T. 2020. Estimation of finite mixture models of skew-symmetric circular distributions. *Metrika*, 83, 895–922. MOONEY, J.A., HELMS, P.J., & JOLLIFFE, I.T. 2003. Fitting mixtures of von Mises distributions: a case study involving sudden infant death syndrome.

MULDER, K., JONGSMA, P., & KLUGKIST, I. 2020. Bayesian inference for mixtures of von Mises distributions using reversible jump MCMC sampler. *Journal of Statistical Computation and Simulation*, 90, 1539–

WALLACE, C.S., & DOWE, D.L. 2000. MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions. *Statistics and Com-*

ter of data. Details of the data analysis will be given in the talk.

*Computational Statistics & Data Analysis*, 41, 505–513.

*Machine Learning Research*, 6, 1345–1382.

In parameter estimation, we first estimate the parameters based on the modified method of moments. Then the maximum likelihood estimation is carried out by using the modified method of moments estimates as the initial values of the EM algorithm. The model estimated by the maximum likelihood method is a two-component (*m* = 2) mixture of Kato–Jones distributions. The estimated model provides a reasonable fit to the data including the strongly skewed clus-

of the clusters of the data is strongly skewed.

References

1556.

*puting*, 10, 73–83.

Directional statistics is concerned with data on circles, spheres and related manifolds. For this paper, we focus on two particular cases: the unit sphere *Sp*−<sup>1</sup> in <sup>R</sup>*<sup>p</sup>* , especially the case *<sup>p</sup>* <sup>=</sup> 3, and the torus (*S*1)*<sup>d</sup>* , especially *<sup>d</sup>* <sup>=</sup> 2. In each case it is possible to construct an exponential family which mimics the multivariate normal distribution under high concentration. But there is a problem. The models include one more parameter than necessary. Overparameterized models can lead to problems of interpretation and fitting. Hence, if possible, it is usually better to choose a parameterization with the same number of parameters as the corresponding asymptotic multivariate normal distribution. Various suggestions have been made in the literature to fix the spare parameter, but these suggestions sometimes have severe limitations.

#### 2 The Fisher-Bingham distribution

The 6-parameter Fisher-Bingham distribution (FB6) on the unit sphere *S*<sup>2</sup> in R3, after rotation to a standardized coordinate system, has the density

$$f(\mathfrak{x}) \approx \exp\{\mathfrak{x}\mathfrak{x}\_3 + \mathfrak{B}\_1\mathfrak{x}\_1^2 + \mathfrak{B}\_2\mathfrak{x}\_2^2\}, \quad \mathfrak{x} \in \mathbb{S}\_2 \tag{1}$$

with respect to the uniform distribution on *S*2, where κ > 0, −∞ < β<sup>2</sup> ≤ β1. Here *x* = [*x*1, *x*2, *x*3] *<sup>T</sup>* is a unit vector, *x<sup>T</sup> x* = 1, and the third coordinate axis can be viewed as the "north pole". The parameters of the model are κ,β1,β<sup>2</sup> plus two parameters for location and one parameter for orientation about the north pole, making 6 parameters in all. Provided κ > 0 and 2β<sup>1</sup> ≤ κ, 2β<sup>2</sup> ≤ κ, the density is unimodal at the north pole. Under high concentration the distribution is asymptotically bivariate normal, a 5-parameter family.

How should the parameters in FB6 be constrained to yield a 5-parameter family? Two choices are:


−3 −1 1 3

−3 −1 1 3

(b) longitude

**BVM5c**

−3 −1 1 3

(d) theta1

**FB5e**

−3

−3

 −1

 1

theta2

Figure 1. *Illustration of directional simulations. Panels (a) and (b) illustrate the FB5b and FB5e distributions. Note the spread in longitude is similar for both distributions, but FB5e has a much smaller spread in latitude than can be modelled by FB5b. Panels (c) and (d) illustrate the BVMs and BVMc distributions. Note the marginal spreads in* θ<sup>1</sup> *and* θ<sup>2</sup> *are similar to one another and similar for the two distributions. However, BVMc shows a much higher correlation between the two angles than can be modelled*

 3

 −1

 1

latitude

 3

(a) longitude

**BVM5s**

−3 −1 1 3

(c) theta1

**FB5b**

−3

−3

 −1

 1

theta2

*by BVMs.*

 3

 −1

 1

latitude

 3

The parameters β<sup>1</sup> and β<sup>2</sup> determine the eccentricity of the the distribution; in the limiting bivariate normal case, the eccentricity describes the ratio of the eigenvalues of the covariance matrix. Although both the balanced and extreme FB5 distributions can accommodate high eccentricity under high concentration, the balanced distribution is much less able to describe high eccentricity under low and moderate concentration. See Fig. 1 panels (a) and (b) for an example with moderate concentration, where the mode has been moved to the equator. Panel (a) gives the most eccentric choice possible with the FB5b distribution, and Panel (b) shows how FB5e can be much more eccentric.

#### 3 Bivariate von Mises distribution

Represent points on the torus *S*1×*S*<sup>1</sup> as a pair of angles θ<sup>1</sup> and θ2. The bivariate von Mises distribution, after a suitable rotation of each circle, has density

$$f(\boldsymbol{\theta}\_{1}, \boldsymbol{\theta}\_{2}) \approx \exp\{\mathbf{x}\_{1}c\_{1} + \mathbf{x}\_{2}c\_{2} + \boldsymbol{\nu}\_{1}^{T}B\boldsymbol{\nu}\_{2}\},\tag{2}$$

where *B* is a 2 × 2 parameter matrix (not necessarily symmetric) Mardia & Jupp, 1999; Mardia *et al.* , 2008; Kent *et al.* , 2008; Mardia *et al.* , 2012. Here the shorthand notation *cj* = cosθ*j*,*sj* = sinθ*<sup>j</sup>* and *v <sup>j</sup>* = [*cj*, *sj*] *<sup>T</sup>* , *j* = 1,2 for the first order trigonometric functions has been used. The density is unimodal with a mode at θ1,θ<sup>2</sup> = 0, provided

$$(\mathbf{x}\_1 + b\_{11} > 0, \; \mathbf{x}\_2 + b\_{11} > 0, \; b\_{12} = b\_{21} = 0, \; b\_{22}^2 \le (\mathbf{x}\_1 + b\_{11})(\mathbf{x}\_2 + b\_{11}).\tag{3}$$

After adding 2 parameters for location, the bivariate von Mises distribution, subject to the constraints in (3) forms a 6-parameter family (BVM6,

can be viewed as the "north pole". The parameters of the model are κ,β1,β<sup>2</sup> plus two parameters for location and one parameter for orientation about the north pole, making 6 parameters in all. Provided κ > 0 and 2β<sup>1</sup> ≤ κ, 2β<sup>2</sup> ≤ κ, the density is unimodal at the north pole. Under high concentration the

How should the parameters in FB6 be constrained to yield a 5-parameter

(a) the *balanced FB5 distribution (FB5b)*, with β<sup>2</sup> = −β<sup>1</sup> = β say, 0 ≤ β ≤ κ/2. It was introduced in Kent, 1982 (without the adjective "balanced")

(b) The *extreme FB5 distribution (FB5e)*. Set β<sup>1</sup> = 0, β<sup>2</sup> = −δ, say, where

The parameters β<sup>1</sup> and β<sup>2</sup> determine the eccentricity of the the distribution; in the limiting bivariate normal case, the eccentricity describes the ratio of the eigenvalues of the covariance matrix. Although both the balanced and extreme FB5 distributions can accommodate high eccentricity under high concentration, the balanced distribution is much less able to describe high eccentricity under low and moderate concentration. See Fig. 1 panels (a) and (b) for an example with moderate concentration, where the mode has been moved to the equator. Panel (a) gives the most eccentric choice possible with the FB5b dis-

tribution, and Panel (b) shows how FB5e can be much more eccentric.

Represent points on the torus *S*1×*S*<sup>1</sup> as a pair of angles θ<sup>1</sup> and θ2. The bivariate von Mises distribution, after a suitable rotation of each circle, has density

where *B* is a 2 × 2 parameter matrix (not necessarily symmetric) Mardia & Jupp, 1999; Mardia *et al.* , 2008; Kent *et al.* , 2008; Mardia *et al.* , 2012. Here

the first order trigonometric functions has been used. The density is unimodal

After adding 2 parameters for location, the bivariate von Mises distribution, subject to the constraints in (3) forms a 6-parameter family (BVM6,

<sup>1</sup> *Bv*2}, (2)

<sup>22</sup> ≤ (κ<sup>1</sup> +*b*11)(κ<sup>2</sup> +*b*11). (3)

*<sup>T</sup>* , *j* = 1,2 for

*<sup>f</sup>*(θ1,θ2) <sup>∝</sup> exp{κ1*c*<sup>1</sup> <sup>+</sup>κ2*c*<sup>2</sup> <sup>+</sup>*v<sup>T</sup>*

the shorthand notation *cj* = cosθ*j*,*sj* = sinθ*<sup>j</sup>* and *v <sup>j</sup>* = [*cj*, *sj*]

distribution is asymptotically bivariate normal, a 5-parameter family.

and is sometimes known as the Kent distribution.

δ ≥ 0. It was introduced in Kent *et al.* , 2016.

3 Bivariate von Mises distribution

with a mode at θ1,θ<sup>2</sup> = 0, provided

κ<sup>1</sup> +*b*<sup>11</sup> > 0, κ<sup>2</sup> +*b*<sup>11</sup> > 0, *b*<sup>12</sup> = *b*<sup>21</sup> = 0, *b*<sup>2</sup>

family? Two choices are:

Figure 1. *Illustration of directional simulations. Panels (a) and (b) illustrate the FB5b and FB5e distributions. Note the spread in longitude is similar for both distributions, but FB5e has a much smaller spread in latitude than can be modelled by FB5b. Panels (c) and (d) illustrate the BVMs and BVMc distributions. Note the marginal spreads in* θ<sup>1</sup> *and* θ<sup>2</sup> *are similar to one another and similar for the two distributions. However, BVMc shows a much higher correlation between the two angles than can be modelled by BVMs.*

say). Under high concentration BVM6 is asymptotically bivariate normal, a 5-parameter family. Just as in the last section, BVM6 is over-parameterized.

IDENTIFYING MORTALITY PATTERNS OF MAIN CAUSES OF DEATH AMONG YOUNG EU POPULATION USING SDA APPROACHES Simona Korenjak-Cerne ˇ <sup>1</sup> and Natasa Kej ˇ zar ˇ <sup>2</sup>

<sup>1</sup> University of Ljubljana, School of Economics and Business, and Institute of Mathematics, Physics and Mechanics (e-mail: simona.cerne@ef.uni-lj.si)

<sup>2</sup> University of Ljubljana, Faculty of Medicine, Institute for Biostatistics and Medical

ABSTRACT: Young population is generally considered to be very healthy, so the most common causes of death in this population are often associated with risky behaviours. In fact, in the population aged 20-39, external causes of death account for more than half of the causes of death in EU countries (also in the US), while by far the most common causes of death in the general population are circulatory diseases and various cancers. The next most common causes of death in the 20-39 age group in the US are suicides and homicides, both of which are strongly associated with stress, therefore we examine them also for EU countries. Our application is based on the 2016 data, which at this point is the most recent complete data available, however the area is even more relevant nowadays in the pandemic and post-pandemic period with many extraordinary

In order to include as much information as possible from these data into our cluster analysis, we use symbolic data methods. By considering for each agesex group not only the number of deaths but also their distribution among the main causes of death, we can include internal variability (in our case, variability

The main objective of the study is twofold: first, to identify groups of EU countries with similar mortality patterns, taking into account two-level information for each age and sex group, i.e. number of deaths and their distribution among the main causes of death; and second, to describe clusters of mortality patterns and to investigate possible links between the mortality patterns in the obtained clusters and some other socio-demographic indicators. In our study, we use a symbolic table for more informative data description and adaptations of compatible hierarchical and non-hierarchical clustering methods for group identification that allows us to consider this two-level information. To this end,

KEYWORDS: symbolic data analysis, main death causes, young population.

Informatics (e-mail: natasa.kejzar@mf.uni-lj.si)

stressful situations.

by cause of death) in the analysis.

we have extended our R package clamix.

In this case there are two well-established ways to constrain the spare degree of freedom:


Under high concentration both the sine and cosine model can accommodate high correlation between θ<sup>1</sup> and θ2. However, under low and moderate concentration the cosine model accommodates high correlation more effectively. See Fig 1, panels (c) and (d) for an example with moderate concentration. Panel (c) gives the most highly correlated choice possible with the BVMs distribution, and Panel (d) shows how BVMc can exhibit much higher correlation.

#### References


### IDENTIFYING MORTALITY PATTERNS OF MAIN CAUSES OF DEATH AMONG YOUNG EU POPULATION USING SDA APPROACHES

say). Under high concentration BVM6 is asymptotically bivariate normal, a 5-parameter family. Just as in the last section, BVM6 is over-parameterized. In this case there are two well-established ways to constrain the spare de-

Under high concentration both the sine and cosine model can accommodate high correlation between θ<sup>1</sup> and θ2. However, under low and moderate concentration the cosine model accommodates high correlation more effectively. See Fig 1, panels (c) and (d) for an example with moderate concentration. Panel (c) gives the most highly correlated choice possible with the BVMs distribution, and Panel (d) shows how BVMc can exhibit much higher

KENT, J. T. 1982. The Fisher-Bingham distribution on the sphere. *Journal of*

KENT, J. T., MARDIA, K. V., & TAYLOR, C. C. 2008. Modelling strategies for bivariate circular data. *Pages 70–73 of:* BARBER, S., BAXTER, P. D., GUSNANTO A., & MARDIA, K. V. (eds), *The Art and Science of*

KENT, J. T., HUSSEIN, I., & JAH, M. K. 2016. Directional Distributions in Tracking of Space Debris. *Pages 2081–2086 of: Proceedings of the 19th International Conference on Information Fusion (FUSION), Heidelberg,*

MARDIA, K. V., & JUPP, P. E. 1999. *Directional Statistics*. New York: Wiley. MARDIA, K. V., HUGHES, G., TAYLOR, C. C., & SINGH, H. 2008. A multivariate von Mises distribution with applications to bioinformatics.

MARDIA, K. V., KENT, J. T., ZHANG, Z., TAYLOR, C. C., & HAMELRYCK, T. 2012. Mixtures of concentrated multivariate sine distributions with applications to bioinformatics. *Journal of Applied Statistics*, 39, 2475–

*the Royal Statistical Society, Series B*, 44, 71–80.

*Statistical Bioinformatics*. Leeds University Press.

*Canadian Journal of Statistics*, 36, 99–109.

(a) The *bivariate von Mises sine model (BVM5s)*, by setting *b*<sup>11</sup> = 0. (b) The *bivariate von Mises cosine model (BVM5c)*, by setting *b*<sup>22</sup> = |*b*11|.

gree of freedom:

correlation.

References

*Germany*. IEEE.

2492.

Simona Korenjak-Cerne ˇ <sup>1</sup> and Natasa Kej ˇ zar ˇ <sup>2</sup>

<sup>1</sup> University of Ljubljana, School of Economics and Business, and Institute of Mathematics, Physics and Mechanics (e-mail: simona.cerne@ef.uni-lj.si)

<sup>2</sup> University of Ljubljana, Faculty of Medicine, Institute for Biostatistics and Medical Informatics (e-mail: natasa.kejzar@mf.uni-lj.si)

ABSTRACT: Young population is generally considered to be very healthy, so the most common causes of death in this population are often associated with risky behaviours. In fact, in the population aged 20-39, external causes of death account for more than half of the causes of death in EU countries (also in the US), while by far the most common causes of death in the general population are circulatory diseases and various cancers. The next most common causes of death in the 20-39 age group in the US are suicides and homicides, both of which are strongly associated with stress, therefore we examine them also for EU countries. Our application is based on the 2016 data, which at this point is the most recent complete data available, however the area is even more relevant nowadays in the pandemic and post-pandemic period with many extraordinary stressful situations.

In order to include as much information as possible from these data into our cluster analysis, we use symbolic data methods. By considering for each agesex group not only the number of deaths but also their distribution among the main causes of death, we can include internal variability (in our case, variability by cause of death) in the analysis.

The main objective of the study is twofold: first, to identify groups of EU countries with similar mortality patterns, taking into account two-level information for each age and sex group, i.e. number of deaths and their distribution among the main causes of death; and second, to describe clusters of mortality patterns and to investigate possible links between the mortality patterns in the obtained clusters and some other socio-demographic indicators. In our study, we use a symbolic table for more informative data description and adaptations of compatible hierarchical and non-hierarchical clustering methods for group identification that allows us to consider this two-level information. To this end, we have extended our R package clamix.

KEYWORDS: symbolic data analysis, main death causes, young population.

### ROBUST SUPERVISED CLUSTERING: SOME PRACTICAL ISSUES

Fabrizio Laurini <sup>1</sup> and Gianluca Morelli <sup>1</sup>

<sup>1</sup> Department of Economics and Management and Ro.S.A., University of Parma, (email: fabrizio.laurini@unipr.it, gianluca.morelli@inicas.it)

ABSTRACT: A semi-automatic procedure for regression models, which leads to identify the optimal number of clusters, in a large and complex data set, is discussed. Robust methods usually suffer from high-computational load and we give practical clues when using the TCLUST-REG with the FSDA toolbox in Matlab

KEYWORDS: FSDA, Outliers, TCLUST-REG.

#### 1 Introduction and motivation

The purpose of this paper is to provide the user with a set of semi-automatic tools in the context of regression clustering which can help to select the optimal number of groups (or more generally to find a set of relevant solutions), give insights about the optimal restriction factors among the variances of the estimated residual variances and finally enable to estimate the optimal trimming level keeping into account that it can depend on the chosen solution.

We made use of our Flexible Statistics for Data Analysis software package, the FSDA toolbox for MATLAB, which is available as "Add-On" inside MATLAB or on github.

#### 2 Technical machinery

Let the multivariate covariates *X* and the response variable *Y* be defined on Ω with values in *X* × *Y* ⊆ *Rp*−<sup>1</sup> × *R*. Then, {*xi*,*yi*}, *i* = 1,2,...,*n*, represents a i.i.d. random sample of size *n* drawn from (*X*,*Y*). Assume that Ω can be partitioned into *k* groups, say Ω1, Ω2, ..., Ω*k*. Then, the general formulation of the regression clustering mixture model has a density which can be written as

$$p(\mathbf{x}, \mathbf{y}, \boldsymbol{\Theta}) = \sum\_{\mathbf{g}=1}^{k} p(\mathbf{y}|\mathbf{x}, \boldsymbol{\Theta}\_{\mathbf{y}, \mathbf{g}}) p(\mathbf{x}, \boldsymbol{\Theta}\_{\mathbf{x}, \mathbf{g}}) \pi\_{\mathbf{g}},$$

where *p*(*y*|*x*,θ*y*,*g*) is the conditional density of *Y* given *x* in Ω*<sup>g</sup>* which depends on the vector of parameters θ*y*,*g*, *p*(*x*,θ*x*,*g*) is the marginal density of *X* in Ω*<sup>g</sup>* which depends on the vector of parameters θ*x*,*g*, and π*<sup>g</sup>* reflects the importance of Ω*<sup>g</sup>* in the mixture with the usual constraints π*<sup>g</sup>* > 0 and ∑*<sup>k</sup> <sup>g</sup>*=<sup>1</sup> π*<sup>g</sup>* = 1. Vector θ denotes the full set of parameters θ = (θ*<sup>T</sup> <sup>y</sup>*,*<sup>g</sup>* θ*<sup>T</sup> <sup>x</sup>*,*g*)*<sup>T</sup>* . It is customary to assume that in each group *g* the conditional relationship between *Y* and *x*, *p*(*y*|*x*,θ*y*,*g*), has form *Y* = β0,*<sup>g</sup>* + *xT* β*<sup>g</sup>* + ε*g*, with proper parameters for all *g* components. Assuming normality and linearity implies the Gaussian Cluster Weighted Model (CWM) of Gershenfeld *et al.* , 1999, and can be written as

$$p(\mathbf{x}, \mathbf{y}, \boldsymbol{\Theta}) = \sum\_{\mathbf{g}=1}^{k} \boldsymbol{\Phi}(\mathbf{y}; \boldsymbol{\mathsf{B}}\_{0, \mathbf{g}} + \boldsymbol{\mathsf{B}}\_{\mathbf{g}}^{T} \boldsymbol{\mathsf{x}}, \boldsymbol{\mathsf{G}}\_{\mathbf{g}}^{2}) \boldsymbol{\Phi}\_{p-1}(\mathbf{x}, \boldsymbol{\mu}\_{\mathbf{g}}, \boldsymbol{\Sigma}\_{\mathbf{g}}) \boldsymbol{\mathsf{x}}\_{\mathbf{g}}.$$

This is linked to the clustering around regression that ignores the distribution of *X*. To accommodate for such an unrealistic assumption, in the so-called classification framework of model based clustering, the classification log-likelihood

$$L\_{\mathbf{Cla}}(\boldsymbol{\Theta}) = \sum\_{i=1}^{n} \sum\_{g=1}^{k} z\_{i\mathbf{g}}(\boldsymbol{\Theta}) \log \boldsymbol{\Phi}(\mathbf{y}\_{i}|\boldsymbol{b}\_{\mathbf{0g}}, \mathbf{x}\_{i}^{T}\boldsymbol{b}\_{\mathbf{g}}, \mathbf{s}\_{\mathbf{g}}^{2}) \boldsymbol{\Phi}\_{p-1}(\mathbf{x}\_{i}, \boldsymbol{m}\_{\mathbf{g}}, \mathbf{S}\_{\mathbf{g}}) p\_{\mathbf{g}}.\tag{1}$$

The target function (1) is unbounded when no constraints are imposed on the scatter parameters. It is necessary therefore to impose constraints on the maximization on the set of eigenvalues of the scatter matrices.

In the literature of robust regression it is widely known the effect of both vertical outliers in *Y* and outliers in *X*. Robustness can be achieved by discarding in each step of the maximization procedure a proportion of units equal to α, associated with the smallest contributions to the target likelihood. More precisely, for example in the mixture modeling context, the Trimmed CWM parameter estimates are based on the maximization of the following trimmed likelihood function *L*Mixt(θ|α,*cy*,*cX* ) Garc´ıa-Escudero *et al.* , 2017

$$L\_{\mathbf{M}\mathbf{i}\mathbf{x}\mathbf{t}}(\boldsymbol{\theta}|\mathbf{o},\mathbf{c}\_{\mathbf{y}},\mathbf{c}\_{\mathbf{x}}) = \sum\_{i=1}^{n} \boldsymbol{z}^\*(\mathbf{x}\_i,\mathbf{y}\_i) \log \left[ \sum\_{\mathbf{g}=1}^{k} \boldsymbol{\phi}(\mathbf{y}\_i|\mathbf{b}\_{0,\mathbf{g}},\mathbf{b}\_{\mathbf{g}}^T\mathbf{x},\mathbf{s}\_{\mathbf{g}}^2) \boldsymbol{\phi}\_{p-1}(\mathbf{x}\_i,\mathbf{m}\_{\mathbf{g}},\mathbf{S}\_{\mathbf{g}}) p\_{\mathbf{g}} \right], \tag{2}$$

where *z*∗(·,·) is a 0-1 trimming indicator function. A fixed fraction α of observations can be unassigned by setting ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *z*(*xi yi*)=[*n*(1−α)]. The TCLUST-REG Garc´ıa-Escudero *et al.* , 2010 can be considered as a particular case of TCWRM in which the contribution to the likelihood of φ*p*−1(*xi*,*mg*,*Sg*) is set equal to 1.

Figure 1. *Maximised likelihood for models Cla and Mixt (from left to right respectively) with associated choices of restriction coefficient and number of groups*

Often it is convenient to consider a further trimming step, which discards a proportion α*<sup>X</sup>* of the units, after taking into account their degree of remoteness in the *X* space, among the observations which have survived the first trimming operation. The observations surviving to the two trimming steps are then used for updating the regression coefficients, weights and scatter matrices. This modification of the algorithm is usually referred in the literature as *adaptive TCLUST-REG*. In the sequel we contrast the performance of (adaptive) TCLUST-REG to a large data set to provide guidelines when complex big data are available.

#### 3 Data, results and further research

In the data analysed, kept anonymous for confidentiality, there are shopping tracks of 24 month sales of non-food items. The number of customers is approximately 470000. The average sale of each customer in the time period is the response variable *Y* from which we perform the clustering regression approach. The set of explanatory variables is given by the number of visits, the number of items bought per visit, the percentage value bought with promotion/sales, age and the gender of the customer. The optimal number of groups, and the optimal constraint factor, are displayed for likelihoods 1 and 2 in Figure 1 (left and right panels respectively).

In all cases we obtained 3 groups with approximately 8% of customers identified as outliers and un-allocated. The outliers in the data (roughly 40000 customers) are mostly characterized by occasional shops and low revenue for the retailer. Broadly speaking in cluster 2 the customers spend more and buy more articles compared to the average. Customers in cluster 1 tend to buy on sales rather than full price and buy more compared to the other clusters. In

Figure 2. *Scatter plot of data and cluster membership for the optimal solution*

cluster 3 people buy less expensive articles, but often.

We want to remark that the identification of these outliers is fully automatic and not arbitrary, but comes as a by-product of an optimal model-based algorithm. The cluster membership is displayed in Figure 2 and the overlap of units would create troubles in many "standard" cluster methods. Further details and comments will be provided during the Conference.

#### References


### **A NONPARAMETRIC APPROACH FOR STATISTICAL MATCHING UNDER INFORMATIVE SAMPLING AND NONRESPONSE**

in the samples *A* and *B* , are considered. See, D'Orazio *et al.* (2006), Conti *et al.*

In practice, the sample selection in survey sampling involves complex sampling designs based on different levels of clustering and differential inclusion probabilities. When the inclusion probabilities are related to the value of the target outcome variable even after conditioning on the model covariates, the observed outcomes are no longer representative of the population outcomes and the model holding for the sample data is then different from the model holding in the population. This, quite common phenomenon is known as *informative sampling*, see Pfeffermann and Sverchkov (2009). The case of informative sampling designs in the statistical matching problem in a parametric setting assuming complete response is analysed in Marella and Pfeffermann (2019). However, in practice, not all the sampled units respond. When the response probabilities are correlated with the missing target outcomes, even after conditioning on the observed data (often, the model covariates), the missing data are not missing at random (NMAR). Valid inference under NMAR nonresponse requires therefore modelling the response mechanism. The problem in applying standard inferential procedures, which ignore the sampling process and nonresponse, is that the distribution holding for the data observed for the responding units can be very different from the distribution holding for the population data, which may result in large bias of estimators and affect other

The aim of this paper is to propose an approach of handling statistical matching under informative sampling and NMAR nonresponse, by use of empirical likelihood (EL). The main advantages of EL approach are: (i) it does not require to specify the

**2 Empirical likelihood approach for statistical matching**

The empirical likelihood is essentially the likelihood of the multinomial distribution used in Hartley and Rao (1968), where the parameters are the point masses assigned to the distinct sample values. We assume that the sampling designs used for selecting the two samples *A* and *B* are informative for the corresponding joint population *pdf*, in the sense that the sample selection probabilities , , { , } *iA iB*

correlated with at least some of the variables ( ,,) *XYZ* , implying that the joint sample *pdf* is different from the corresponding population *pdf*. Additionally to informative sampling, we assume that *A* and *B* are subject to NMAR unit

population unit is drawn to the sample *A B*( ) and 0 otherwise. Let ( ) *A B R R i i* define the response indicator, taking the value 1 if sample unit *i Ai B* ( ) responds and 0 otherwise. The response process is assumed to be independent between units. We

*i i I I* be the sample indicator taking the value 1 if the *i* th

 are

population model; (ii) it facilities the use of calibration constraints.

(2016) and references therein.

aspects of the inference process.

nonresponse. Let ( ) *A B*

Daniela Marella1 and Danny Pfeffermann2

<sup>1</sup> Department of Education, University of Roma Tre, (e-mail: daniela.marella@uniroma3.it)

<sup>2</sup> Central Bureau of Statistics and Hebrew University of Jerusalem, Israel; University of Southampton, UK, (e-mail: D.Pfeffermann@soton.ac.uk)

**ABSTRACT**: Statistical matching attempts to combine the information obtained from different, non-overlapping samples, selected from the same target population, to form a matched sample containing the data in the different samples. The aim of this paper is to propose a nonparametric approach of handling statistical matching under informative sampling and not missing at random (NMAR) nonresponse, by use of empirical likelihood.

**KEYWORDS**: calibration, empirical likelihood, informative sampling, NMAR nonresponse.

#### **1 Introduction**

Statistical matching is becoming more and more popular in recent years. Information on a set of variables is often obtained from different data sources related to the same target population, each containing only some of the variables, with no joint observations on all the variables.

Let *A* and *B* be two independent samples of size *<sup>A</sup> n* and *<sup>B</sup> n* respectively, selected from a population of *N* independent and identically distributed (*i.i.d*.) records, generated from some joint probability distribution function (*pdf*), (, ,) *<sup>p</sup> f xyz* of variables ( ,,) *XYZ* . Only (,) *X Y* are observed for the units in sample *A*, and only (,) *X Z* are observed for the units in sample *B* . Because of the lack of joint information, the joint *pdf* (, ,) *<sup>p</sup> f xyz* is not identifiable. Several alternative techniques have been proposed in the literature to overcome the identification problem. At first, techniques based on the conditional independence assumption (CIA) between *Y* and *Z* given *X* were considered. A second group of techniques uses external auxiliary information on the statistical relationship between *Y* and *Z* . Finally, a third approach consists of analysing the uncertainty regarding the joint distribution of ( ,,) *XYZ* , that is several alternative models for the joint distribution of ( ,,) *XYZ* , compatible with the distributions of (,) *X Y* and (,) *X Z*

in the samples *A* and *B* , are considered. See, D'Orazio *et al.* (2006), Conti *et al.* (2016) and references therein.

**A NONPARAMETRIC APPROACH FOR STATISTICAL MATCHING UNDER INFORMATIVE SAMPLING AND NONRESPONSE**

Daniela Marella1 and Danny Pfeffermann2

**ABSTRACT**: Statistical matching attempts to combine the information obtained from different, non-overlapping samples, selected from the same target population, to form a matched sample containing the data in the different samples. The aim of this paper is to propose a nonparametric approach of handling statistical matching under informative sampling and not

**KEYWORDS**: calibration, empirical likelihood, informative sampling, NMAR nonresponse.

Statistical matching is becoming more and more popular in recent years. Information on a set of variables is often obtained from different data sources related to the same target population, each containing only some of the variables, with no joint

Let *A* and *B* be two independent samples of size *<sup>A</sup> n* and *<sup>B</sup> n* respectively, selected from a population of *N* independent and identically distributed (*i.i.d*.) records, generated from some joint probability distribution function (*pdf*), (, ,) *<sup>p</sup> f xyz* of variables ( ,,) *XYZ* . Only (,) *X Y* are observed for the units in sample *A*, and only (,) *X Z* are observed for the units in sample *B* . Because of the lack of joint information, the joint *pdf* (, ,) *<sup>p</sup> f xyz* is not identifiable. Several alternative techniques have been proposed in the literature to overcome the identification problem. At first, techniques based on the conditional independence assumption (CIA) between *Y* and *Z* given *X* were considered. A second group of techniques uses external auxiliary information on the statistical relationship between *Y* and *Z* . Finally, a third approach consists of analysing the uncertainty regarding the joint distribution of ( ,,) *XYZ* , that is several alternative models for the joint distribution of ( ,,) *XYZ* , compatible with the distributions of (,) *X Y* and (,) *X Z*

<sup>2</sup> Central Bureau of Statistics and Hebrew University of Jerusalem, Israel; University of

<sup>1</sup> Department of Education, University of Roma Tre, (e-mail: daniela.marella@uniroma3.it)

**1 Introduction**

observations on all the variables.

Southampton, UK, (e-mail: D.Pfeffermann@soton.ac.uk)

missing at random (NMAR) nonresponse, by use of empirical likelihood.

In practice, the sample selection in survey sampling involves complex sampling designs based on different levels of clustering and differential inclusion probabilities. When the inclusion probabilities are related to the value of the target outcome variable even after conditioning on the model covariates, the observed outcomes are no longer representative of the population outcomes and the model holding for the sample data is then different from the model holding in the population. This, quite common phenomenon is known as *informative sampling*, see Pfeffermann and Sverchkov (2009). The case of informative sampling designs in the statistical matching problem in a parametric setting assuming complete response is analysed in Marella and Pfeffermann (2019). However, in practice, not all the sampled units respond. When the response probabilities are correlated with the missing target outcomes, even after conditioning on the observed data (often, the model covariates), the missing data are not missing at random (NMAR). Valid inference under NMAR nonresponse requires therefore modelling the response mechanism. The problem in applying standard inferential procedures, which ignore the sampling process and nonresponse, is that the distribution holding for the data observed for the responding units can be very different from the distribution holding for the population data, which may result in large bias of estimators and affect other aspects of the inference process.

The aim of this paper is to propose an approach of handling statistical matching under informative sampling and NMAR nonresponse, by use of empirical likelihood (EL). The main advantages of EL approach are: (i) it does not require to specify the population model; (ii) it facilities the use of calibration constraints.

#### **2 Empirical likelihood approach for statistical matching**

The empirical likelihood is essentially the likelihood of the multinomial distribution used in Hartley and Rao (1968), where the parameters are the point masses assigned to the distinct sample values. We assume that the sampling designs used for selecting the two samples *A* and *B* are informative for the corresponding joint population *pdf*, in the sense that the sample selection probabilities , , { , } *iA iB* are correlated with at least some of the variables ( ,,) *XYZ* , implying that the joint sample *pdf* is different from the corresponding population *pdf*. Additionally to informative sampling, we assume that *A* and *B* are subject to NMAR unit nonresponse. Let ( ) *A B i i I I* be the sample indicator taking the value 1 if the *i* th population unit is drawn to the sample *A B*( ) and 0 otherwise. Let ( ) *A B R R i i* define the response indicator, taking the value 1 if sample unit *i Ai B* ( ) responds and 0 otherwise. The response process is assumed to be independent between units. We assume that *X* can take *K* distinct values with probabilities P( ) *<sup>X</sup> k k p Xx* , while *Y* and *Z* are continuous.

The basic idea of the EL approach is to approximate the population distribution with a multinomial model which support is given by the empirical observations. Under the CIA, the probabilities (, ,) *XYZ <sup>i</sup> i ii p Px y z* can be factorized as,

$$\boldsymbol{p}\_{\boldsymbol{\cdot}}^{\rm nx} = P(\boldsymbol{\x}\_{\boldsymbol{\cdot}}, \boldsymbol{y}\_{\boldsymbol{\cdot}}, \boldsymbol{z}\_{\boldsymbol{\cdot}}) = P(\boldsymbol{x}\_{\boldsymbol{\cdot}})P(\boldsymbol{y}\_{\boldsymbol{\cdot}} \mid \boldsymbol{x}\_{\boldsymbol{\cdot}})P(\boldsymbol{z}\_{\boldsymbol{\cdot}} \mid \boldsymbol{x}\_{\boldsymbol{\cdot}}) = \boldsymbol{p}\_{\boldsymbol{\cdot}}^{\boldsymbol{\cdot}} \boldsymbol{p}\_{\boldsymbol{\cdot}}^{\boldsymbol{\cdot} \mid \boldsymbol{x}} \boldsymbol{p}\_{\boldsymbol{\cdot}}^{\boldsymbol{\cdot} \mid \boldsymbol{x}}.\tag{2.1}$$

(2.2)-(2.5) are obtained for the model holding for the responding units in *B* . Thus, assuming that the outcome, the sampling and the response are independent between


where *RA k*, ( *RB k*, ) defines the group of respondents with *<sup>k</sup> X x* in sample *A B*( )

modelled by a parametric model and estimated from the available data. Let *<sup>A</sup>*

likelihood (2.6) must be maximized with respect to [ | | { , , }, , *X YX ZX*


0, 0, 0, 1, 1, 1

1

*K X Y X Z X X Y X Z X k i i k j j*

*pp p p p p*

be the unknown response models parameters postulated in the two samples, the

*k j R j R*

An important advantage of the proposed approach is that it facilitates the use of calibration constraints. That is, auxiliary information on known population means for some auxiliary variables can be incorporated by placing additional constraints on

CONTI, P.L, MARELLA, D. & SCANU, M. 2016. Statistical matching analysis for complex survey data with applications. *Journal of the American Statistical* 

D'ORAZIO, M., DI ZIO, M., & SCANU, M. 2006. *Stastical Matching: Theory and* 

HARTLEY, H.O & RAO, J.N.K., 1968. A new estimation theory for sample surveys.

MARELLA, D. & PFEFFERMANN, D. 2019. Matching information from two independent informative sampling. *Journal of Statistical Planning and Inference*,

PFEFFERMANN, D. & SVERCHKOV, M. 2009. Inference under informative sampling. In Handbook of Statistics 29B; Sample Surveys: Inference and Analysis (Eds.

D.Pfeffermann and C.R.Rao). Amsterdam: North Holland.

 . (2.5)

(2.6)

*kA kB r r* . The response probabilities in (2.6) are unknown and need to be

, ,

*A k B k*

 , *<sup>B</sup>* 

*k i i AB pp p*

 ] under

units, the *empirical respondents likelihood* for the sample *A B* is given by,

, ,

, ,

( ) ( ) *X X k A k B A A B B A k B k*

*A B X r r Y X X Z X Obs k R i R k R i R k i R k i R*

1 1

of size , , ( ) *X X*

the constraints,

**References**

203, 70-81.

the maximization process.

*Association*, 111, 516, 1715-1725.

*Practice*. Chichester: Wiley.

*Biometrika,* 55, 547-557*.*

*ERL p p p p*

*K K*

, , , ,

The parameters | | {, , } *X YX ZX ki i pp p* are unknown and need to be estimated from the samples *A* and *B* . By Bayes rule,

$$\mathbf{p}\_{\boldsymbol{\cdot},\boldsymbol{R}\_{\boldsymbol{\cdot}}}^{\boldsymbol{\cdot}\mid\boldsymbol{X}} = P(\boldsymbol{\mathbf{y}}\_{\boldsymbol{\cdot}} \mid \boldsymbol{\mathbf{x}}\_{\boldsymbol{\cdot}},\boldsymbol{I}\_{\boldsymbol{\cdot}}^{\boldsymbol{\cdot}} = \mathbf{1}, \boldsymbol{R}\_{\boldsymbol{\cdot}}^{\boldsymbol{\cdot}} = \mathbf{1}) = \frac{P(\boldsymbol{R}\_{\boldsymbol{\cdot}}^{\boldsymbol{\cdot}} = \mathbf{1} \mid \boldsymbol{x}\_{\boldsymbol{\cdot}}, \boldsymbol{y}\_{\boldsymbol{\cdot}}, \boldsymbol{I}\_{\boldsymbol{\cdot}}^{\boldsymbol{\cdot}} = \mathbf{1})}{P(\boldsymbol{R}\_{\boldsymbol{\cdot}}^{\boldsymbol{\cdot}} = \mathbf{1} \mid \boldsymbol{x}\_{\boldsymbol{\cdot}}, \boldsymbol{I}\_{\boldsymbol{\cdot}}^{\boldsymbol{\cdot}} = \mathbf{1})} \ \boldsymbol{p}\_{\boldsymbol{\cdot},\boldsymbol{t}}^{\boldsymbol{\cdot}\mid\boldsymbol{X}} \tag{2.2}$$

$$\boldsymbol{p}\_{\boldsymbol{\lambda}, \boldsymbol{R}\_{\boldsymbol{\lambda}}}^{\boldsymbol{\lambda}} = P(\boldsymbol{\x}\_{i} = \boldsymbol{x}\_{\boldsymbol{\lambda}} \mid \boldsymbol{I}\_{i}^{\boldsymbol{\epsilon}} = 1, \boldsymbol{R}\_{\boldsymbol{\epsilon}}^{\boldsymbol{\epsilon}} = \boldsymbol{1}) = \frac{P(\boldsymbol{R}\_{i}^{\boldsymbol{\epsilon}} = 1 \mid \boldsymbol{x}\_{i}, \boldsymbol{I}\_{i}^{\boldsymbol{\epsilon}} = \boldsymbol{1})}{P(\boldsymbol{R}\_{i}^{\boldsymbol{\epsilon}} = 1 \mid \boldsymbol{I}\_{i}^{\boldsymbol{\epsilon}} = \boldsymbol{1})} \boldsymbol{p}\_{\boldsymbol{\lambda}, \boldsymbol{\epsilon}}^{\boldsymbol{\lambda}} \tag{2.3}$$

where the sample models <sup>|</sup> , *Y X i A p* , , *X k A p* are defined as,

$$p\_{\boldsymbol{\cdot},\boldsymbol{\cdot}}^{\boldsymbol{\cdot}\mid\boldsymbol{\chi}} = P(\boldsymbol{\chi}\_{\boldsymbol{\cdot}} \mid \boldsymbol{\chi}\_{\boldsymbol{\cdot}}, I\_{\boldsymbol{\cdot}}^{\boldsymbol{\cdot}} = \mathbf{l}) = \frac{E\_{\boldsymbol{\cdot}}(\mathbf{w}\_{\boldsymbol{\cdot},\boldsymbol{\cdot}} \mid \boldsymbol{\chi}\_{\boldsymbol{\cdot}})}{E\_{\boldsymbol{\cdot}}(\mathbf{w}\_{\boldsymbol{\cdot},\boldsymbol{\cdot}} \mid \boldsymbol{\chi}\_{\boldsymbol{\cdot}}, \boldsymbol{\chi}\_{\boldsymbol{\cdot}})} p\_{\boldsymbol{\cdot}}^{\boldsymbol{\cdot}\mid\boldsymbol{\chi}}, \tag{2.4}$$

$$\mathbf{p}\_{\boldsymbol{\lambda},\boldsymbol{\lambda}}^{\boldsymbol{\lambda}} = P(\mathbf{x}\_{\boldsymbol{\lambda}} \mid \boldsymbol{I}\_{\boldsymbol{\lambda}}^{\boldsymbol{\lambda}} = \mathbf{l}) = \frac{E\_{\boldsymbol{\lambda}}(\mathbf{w}\_{\boldsymbol{\lambda},\boldsymbol{\lambda}} \mid \mathbf{x}\_{\boldsymbol{\lambda}})}{\sum\_{j \in \mathcal{A}\_{\boldsymbol{\lambda}}} E\_{\boldsymbol{\lambda}}(\mathbf{w}\_{j,\boldsymbol{\lambda}} \mid \mathbf{x}\_{j}) p\_{j}^{r|\boldsymbol{\lambda}}} \; \mathbf{p}\_{\boldsymbol{\lambda}}^{\boldsymbol{\lambda}} \tag{2.5}$$

and : *A i Ax x xk i k* . Then, the sample models and the model for the response probabilities ( 1 | , , 1) *A A PR x y I <sup>i</sup> k ii* define the model holding for the outcomes of the responding units. Notice that unless ( 1 | , , 1) ( 1 | , 1) *<sup>A</sup> <sup>A</sup> <sup>A</sup> <sup>A</sup> PR x y I PR x I <sup>i</sup> k ii <sup>i</sup> k i* for all (,) *k i x y* , the model (2.2) is different from the sample model <sup>|</sup> , *Y X i A p* (2.4), which in turn is different from the population model *Y X*<sup>|</sup> *<sup>i</sup> p* under informative sampling. Analagous expressions to (2.2)-(2.5) are obtained for the model holding for the responding units in *B* . Thus, assuming that the outcome, the sampling and the response are independent between units, the *empirical respondents likelihood* for the sample *A B* is given by,

$$ERL\_{\boldsymbol{\alpha}\boldsymbol{\alpha}}^{\boldsymbol{\alpha}\boldsymbol{\alpha}} = \prod\_{k=1}^{K} (\boldsymbol{p}\_{\boldsymbol{\iota},\boldsymbol{R}\_{\boldsymbol{\iota}}}^{\boldsymbol{\iota}})^{\boldsymbol{\epsilon}\_{\boldsymbol{\iota}}^{\boldsymbol{\iota}}} \prod\_{i\in\boldsymbol{R}\_{\boldsymbol{\iota},\boldsymbol{\iota}}} \boldsymbol{p}\_{\boldsymbol{\iota},\boldsymbol{R}\_{\boldsymbol{\iota}}}^{\boldsymbol{\iota}\boldsymbol{\chi}} \prod\_{k=1}^{K} (\boldsymbol{p}\_{\boldsymbol{\iota},\boldsymbol{R}\_{\boldsymbol{\iota}}}^{\boldsymbol{\iota}})^{\boldsymbol{\epsilon}\_{\boldsymbol{\iota}}^{\boldsymbol{\iota}}} \prod\_{i\in\boldsymbol{R}\_{\boldsymbol{\iota},\boldsymbol{\iota}}} \boldsymbol{p}\_{\boldsymbol{\iota},\boldsymbol{R}\_{\boldsymbol{\iota}}}^{\boldsymbol{\iota}\boldsymbol{\iota}} \tag{2.6}$$

where *RA k*, ( *RB k*, ) defines the group of respondents with *<sup>k</sup> X x* in sample *A B*( ) of size , , ( ) *X X kA kB r r* . The response probabilities in (2.6) are unknown and need to be modelled by a parametric model and estimated from the available data. Let *<sup>A</sup>* , *<sup>B</sup>* be the unknown response models parameters postulated in the two samples, the likelihood (2.6) must be maximized with respect to [ | | { , , }, , *X YX ZX k i i AB pp p* ] under the constraints,

$$\mathbb{P}\left[p\_{\boldsymbol{x}}^{\boldsymbol{X}} \geq \mathbf{0}, p\_{\boldsymbol{\cdot}}^{\boldsymbol{Y}|\boldsymbol{X}} \geq \mathbf{0}, p\_{\boldsymbol{\cdot}}^{\boldsymbol{Z}|\boldsymbol{X}} \geq \mathbf{0}, \sum\_{k=1}^{k} p\_{\boldsymbol{x}}^{\boldsymbol{X}} = \mathbf{1}, \sum\_{j \in \mathcal{R}\_{\boldsymbol{\cdot}, \boldsymbol{\cdot}}} p\_{\boldsymbol{\cdot}}^{\boldsymbol{Y}|\boldsymbol{X}} = \mathbf{1}, \sum\_{j \in \mathcal{R}\_{\boldsymbol{\cdot}, \boldsymbol{\cdot}}} p\_{\boldsymbol{\cdot}}^{\boldsymbol{Z}|\boldsymbol{X}} = \mathbf{1}. \tag{2.5}$$

An important advantage of the proposed approach is that it facilitates the use of calibration constraints. That is, auxiliary information on known population means for some auxiliary variables can be incorporated by placing additional constraints on the maximization process.

#### **References**

assume that *X* can take *K* distinct values with probabilities P( ) *<sup>X</sup>*

The basic idea of the EL approach is to approximate the population distribution with a multinomial model which support is given by the empirical observations. Under

(, , ) ()( | )( | ) *XYZ X YX ZX*



*Y X A A i k ii Y X i R i ki i A A i A*

( 1 | , 1) *<sup>A</sup>*

, , ( 1 | , 1) ( | 1, 1)

,

,

*Ew x*

,

( |)

*Ew xp*

*A jA j j*

and : *A i Ax x xk i k* . Then, the sample models and the model for the

outcomes of the responding units. Notice that unless

*PR x y I PR x I <sup>i</sup> k ii <sup>i</sup> k i* for all (,) *k i x y* , the model

, *Y X*

*Ew x*

( |,)

*A iA i i*

*X A A i k i X k R i ki i A A k A*

( 1 | 1) *<sup>A</sup>*

, *Y X i A p* , , *X*


*p Py x I p*

, |

*p Px I p*

response probabilities ( 1 | , , 1) *A A*

(2.2) is different from the sample model <sup>|</sup>

( 1 | , , 1) ( 1 | , 1) *<sup>A</sup> <sup>A</sup> <sup>A</sup> <sup>A</sup>*

( |) ( | 1)

*X A A iA i X k A i i Y X k*

*xk*

*j A*

*Y X A A iA i Y X i A i ii i*

( |) ( | , 1)

*p Px x I R p*

*PR x y I p Py x I R <sup>p</sup>*

*<sup>i</sup> i ii p Px y z* can be factorized as,

*<sup>i</sup> i ii i i i i i ki i p Px y z Px Py x Pz x p p p* . (2.1)

*ki i pp p* are unknown and need to be estimated from the

*A A*

*i k i*

*A A*

*i i*

(2.5)

*PR x y I <sup>i</sup> k ii* define the model holding for the

*<sup>i</sup> p* under informative sampling. Analagous expressions to

*i A p* (2.4), which in turn is different from

*PR x I* (2.2)

*PR x I*

*PR I* (2.3)

*k A p* are defined as,

*Ew xy* , (2.4)

while *Y* and *Z* are continuous.

The parameters | |

where the sample models <sup>|</sup>

the population model *Y X*<sup>|</sup>

,

samples *A* and *B* . By Bayes rule,

the CIA, the probabilities (, ,) *XYZ*

{, , } *X YX ZX*

*k k p Xx* ,


### **INVESTIGATING MODEL FIT IN ITEM RESPONSE MODELS WITH THE HELLINGER DISTANCE**

for improving the interpretation of results in applied settings and useful for model

PPMC techniques are based on the comparison of observed data with replicated data generated or predicted by the model by using a number of diagnostic measures that are sensitive to model misfit (Sinharay *et al.,* 2006). Substantial differences between the posterior distribution based on observed data and the posterior predictive distribution indicate poor model fit. Given the data *y*, let *p*(*y*|*ω*) and *p*(*ω*) be the likelihood for a model depending on the set of parameters ω and the prior

From a practical point of view, one should define a suitable discrepancy measure *D*(·) and compare the posterior distribution of *D*(*y,ω*), based on observed data, to the posterior predictive distribution of *D*(*y*rep,*ω*). Discrepancy measures should be chosen to capture relevant features of the data and differences among data and the model. It is possible to resort to the PPP-values defined as "the probability that the replicated data could be more extreme than the observed data, as measured by the test quantity". The choice of a suitable discrepancy measure is crucial in PPMC. Effective diagnostic measures in checking for unidimensionality or multidimensionality are based on the association or on covariance/correlation among item pairs. In this paper we consider the model-based covariance (MBC; Reckase, 1997) that depends on both data and model parameters. The MBC is found to be effective as it measures the covariance among item pairs by explicitly conditioning on the latent variable. If the local independence assumption holds, the MBC is close to zero. If the local independence does not hold, the MBC is greater than zero for items loading on the same latent variable (PPP-values are close to zero) and smaller

for items loading on different latent variables (PPP-values are close to one).

Lastly, Levy and Svetina (2011) proposed an overall measure, namely the generalized dimensionality discrepancy measure (GDDM) that is a unidirectional measure of average conditional covariance defined as the mean of the absolute values of MBC over unique item pairs. When the GDDM is equal to zero, a "weak" local independence for all the item pairs is assumed. If the assumption of local independence is violated, the GDDM is greater than zero and the PPP-value will be

To quantify the difference between the realized and the predictive distribution within PPMC, Matteucci and Mignani (2020) propose to use the Hellinger (H)

The main objective of this paper is to deepen the performance of the Hellinger distance in an over-fitting situation to evaluate the potential misfit of a IRT multidimensional model when the data are generated by a unidimensional approach. We explore our proposal by simulation to enrich the previous results of an under-

comparison purposes.

fitting scenario.

close to zero.

**3 The Hellinger distance**

**2 The discrepancy measures**

distribution for the parameters, respectively.

Mariagiulia Matteucci1 and Stefania Mignani1

<sup>1</sup> Department of Statistical Sciences, University of Bologna, (e-mail: m.matteucci@unibo.it, stefania.mignani@unibo.it)

**ABSTRACT**: Under the Bayesian approach, posterior predictive model checking has become a popular tool for fit assessment of item response theory models. In this study, we propose the use of the Hellinger distance to quantify the distance between the realized and the predictive distribution of the model-based covariance for item pairs. Specifically, the case of over-fitting is taken into account. The results of a simulation study show the effectiveness of the method.

**KEYWORDS**: posterior predictive model checking, Hellinger distance, MIRT models, goodness of fit.

#### **1 Introduction**

Bayesian estimation of item response theory (IRT) models via Markov chain Monte Carlo (MCMC) has been intensively applied due to its flexibility in arranging complex situations. Through posterior predictive model checking (PPMC; Rubin, 1984). it is possible to define tools for evaluating the fit of the model. Considerable advantages of the method are that it does not rely on distributional assumptions, and it is relatively easy to implement, given that the entire posterior distribution of all parameters of interest is obtained through MCMC algorithms.

The use of PPMC for IRT models received an increasing interest in assessing multidimensionality (Sinharay *et al.*, 2006; Levy and Svetina, 2011). The PPMC method is based on the comparison between the observed and the replicated data of a given discrepancy measure *D*. PPMC is implemented first with graphical analyses and then with the estimation of the posterior predictive *p*-values (PPP-values). However, the PPP-value simply counts the number of times the replicated *D* is equal or higher than the realized *D* without addressing the magnitude of the difference between the two distributions. To overcome these limitations, in a previous paper (Matteucci and Mignani, 2020) it is proposed to measure the difference between the predictive and the realized distribution via the Hellinger distance, a suitable measure for improving the interpretation of results in applied settings and useful for model comparison purposes.

The main objective of this paper is to deepen the performance of the Hellinger distance in an over-fitting situation to evaluate the potential misfit of a IRT multidimensional model when the data are generated by a unidimensional approach. We explore our proposal by simulation to enrich the previous results of an underfitting scenario.

#### **2 The discrepancy measures**

**INVESTIGATING MODEL FIT IN ITEM RESPONSE MODELS WITH THE HELLINGER DISTANCE**

Mariagiulia Matteucci1 and Stefania Mignani1

**ABSTRACT**: Under the Bayesian approach, posterior predictive model checking has become a popular tool for fit assessment of item response theory models. In this study, we propose the use of the Hellinger distance to quantify the distance between the realized and the predictive distribution of the model-based covariance for item pairs. Specifically, the case of over-fitting is taken into account. The results of a

**KEYWORDS**: posterior predictive model checking, Hellinger distance, MIRT models,

Bayesian estimation of item response theory (IRT) models via Markov chain Monte Carlo (MCMC) has been intensively applied due to its flexibility in arranging complex situations. Through posterior predictive model checking (PPMC; Rubin, 1984). it is possible to define tools for evaluating the fit of the model. Considerable advantages of the method are that it does not rely on distributional assumptions, and it is relatively easy to implement, given that the entire posterior distribution of all

The use of PPMC for IRT models received an increasing interest in assessing multidimensionality (Sinharay *et al.*, 2006; Levy and Svetina, 2011). The PPMC method is based on the comparison between the observed and the replicated data of a given discrepancy measure *D*. PPMC is implemented first with graphical analyses and then with the estimation of the posterior predictive *p*-values (PPP-values). However, the PPP-value simply counts the number of times the replicated *D* is equal or higher than the realized *D* without addressing the magnitude of the difference between the two distributions. To overcome these limitations, in a previous paper (Matteucci and Mignani, 2020) it is proposed to measure the difference between the predictive and the realized distribution via the Hellinger distance, a suitable measure

<sup>1</sup> Department of Statistical Sciences, University of Bologna,

simulation study show the effectiveness of the method.

parameters of interest is obtained through MCMC algorithms.

goodness of fit.

**1 Introduction**

(e-mail: m.matteucci@unibo.it, stefania.mignani@unibo.it)

PPMC techniques are based on the comparison of observed data with replicated data generated or predicted by the model by using a number of diagnostic measures that are sensitive to model misfit (Sinharay *et al.,* 2006). Substantial differences between the posterior distribution based on observed data and the posterior predictive distribution indicate poor model fit. Given the data *y*, let *p*(*y*|*ω*) and *p*(*ω*) be the likelihood for a model depending on the set of parameters ω and the prior distribution for the parameters, respectively.

From a practical point of view, one should define a suitable discrepancy measure *D*(·) and compare the posterior distribution of *D*(*y,ω*), based on observed data, to the posterior predictive distribution of *D*(*y*rep,*ω*). Discrepancy measures should be chosen to capture relevant features of the data and differences among data and the model. It is possible to resort to the PPP-values defined as "the probability that the replicated data could be more extreme than the observed data, as measured by the test quantity". The choice of a suitable discrepancy measure is crucial in PPMC. Effective diagnostic measures in checking for unidimensionality or multidimensionality are based on the association or on covariance/correlation among item pairs. In this paper we consider the model-based covariance (MBC; Reckase, 1997) that depends on both data and model parameters. The MBC is found to be effective as it measures the covariance among item pairs by explicitly conditioning on the latent variable. If the local independence assumption holds, the MBC is close to zero. If the local independence does not hold, the MBC is greater than zero for items loading on the same latent variable (PPP-values are close to zero) and smaller for items loading on different latent variables (PPP-values are close to one).

Lastly, Levy and Svetina (2011) proposed an overall measure, namely the generalized dimensionality discrepancy measure (GDDM) that is a unidirectional measure of average conditional covariance defined as the mean of the absolute values of MBC over unique item pairs. When the GDDM is equal to zero, a "weak" local independence for all the item pairs is assumed. If the assumption of local independence is violated, the GDDM is greater than zero and the PPP-value will be close to zero.

#### **3 The Hellinger distance**

To quantify the difference between the realized and the predictive distribution within PPMC, Matteucci and Mignani (2020) propose to use the Hellinger (H)

distance which is symmetric, it does obey the triangle inequality and its range is 0-1. The direct calculation is computationally demanding and given the MCMC simulations, it is usually estimated by the normal kernel density. In order to check for local independence, we used the H distance with the MBC discrepancy measure (MBC-H) to take into account a fit measure for each item pair, and with the GDDM measure (GDDM-H) to evaluate the overall fit based on item pairs. It is proposed to investigate the assumption of local independence for 2PNO models by focusing on multidimensional data analyzed with the unidimensional model. The main strengths of the H distance, compared to traditional approaches rely on the possibility a) to directly quantify the amount of misfit; b) to be used for model comparison purposes, c) to make more informative analyses on item pairs. Furthermore, it is demonstrated that, in practical applications, the MBC-H can be used to: a) leave out the models that show serious misfit by using the threshold of 0.5; b) compare the amount of misfit of different competing models and choose the model which fits the data best; c) identify, also through graphical plots, critical items that may involve misfit which are associated to high MBC-H in several pairs.

are towards the additive model. Classical Bayesian indicators such as DIC confirm these conclusions. The Hellinger distance seems to be an effective tool in highlighting the presence of possible misfit and determining plausible thresholds for

Uni/multi-uni 10 1000 6600.27 0.427 0.086 0.275 0.605 0.352

Uni/additive 10 1000 6433.11 0.383 0.081 0.246 0.551 0.292

Uni/uni 10 1000 6614.53 0.450 0.082 0.295 0.619 0.376

LEVY, R., & SVETINA, D. 2011. A generalized dimensionality discrepancy measure for dimensionality assessment in multidimensional item response theory. *British* 

MATTEUCCI, M., & MIGNANI, S. 2020. The Hellinger distance within posterior predictive assessment for investigating multidimensionality in IRT models. *Multivariate Behavioral Research*, DOI: 10.1080/00273171.2020.1753497 RECKASE, M. 2009. *Multidimensional Item Response Theory.* New York: Springer-

RUBIN, D. B. 1984. Bayesianly justifiable and relevant frequency calculations for the

SINHARAY, S., JOHNSON, M. S., & STERN, H. S. 2006. Posterior predictive assessment of item response theory models. *Applied Psychological* 

*Journal of Mathematical and Statistical Psychology,* **65**, 208-232.

applies statistician. *Annals of Statistics*, **12**, 1151-1172.

*k n* **DIC MBC-H GDDM-H**

20 1000 15312.89 0.546 0.047 0.396 0.656 0.475 10 2000 14671.31 0.446 0.068 0.312 0.618 0.376 20 2000 25302.47 0.542 0.061 0.375 0.644 0.456

20 1000 15173.99 0.524 0.049 0.363 0.639 0.426 10 2000 14410.26 0.373 0.060 0.276 0.518 0.279 20 2000 25022.79 0.506 0.065 0.323 0.621 0.402

20 1000 12702.39 0.552 0.057 0.400 0.653 0.479 10 2000 13180.96 0.447 0.083 0.283 0.643 0.339 20 2000 25306.66 0.554 0.059 0.377 0.664 0.475

**Mean Sd Min Max**

Table 1- Summary results for the 100 replications and for all item pairs.

classifying the misfit levels.

**References**

Verlag.

*Measurement,* **30**, 298-321.

In this paper we confirm the strength of our proposal through a simulation study in an over-fitting setting, where unidimensional data are analyzed through different multidimensional models.

#### **4 The simulation**

A simulation study is conducted to examine the performance of the proposed MBC-H and GDDM-H at detecting the misfit when data follow a two-parameter normal ogive (2PNO) unidimensional model and we fit a multi-unidimensional model and an additive model with two latent dimensions. Response data for tests with *k*=10 or *k*=20 items and a sample size of *n*=1,000 or *n*=2,000 are simulated. Two subtests are assumed for the multidimensional models (*k*1=*k*2=5 or *k*1=*k*2=10). The case of unidimensional data analyzed with the same model is also considered. A number of 5,000 MCMC iterations are conducted, where 1,000 are used for PPMC. Finally, 100 replications are done for each simulation condition. The parameters of the dataanalysis model are estimated via the Gibbs sampler. The over-fitting scenario is particularly meaningful for its implications in real situations. Although unidimensionality is quite unrealistic, especially with a high number of items, there are situations addressing a different point of view. For example, in the educational context, a test could be arranged under the assumption that groups of items refer to different cognitive domains. In this situation, a multidimensional model should be estimated to investigate the different domains, but one predominant dimension should explain the most part of the variability.

The main results of the simulation study are reported in Table 1. We do not present PPP-values as they indicate lack of bad fit for all conditions. We found the more critical evidence for *k*=20 where the fitted models show, on average, MBC-H higher than 0.5 meaning bad fit. The additive model seems to be the more appropriate, even when data are unidimensional, as it also includes an overall latent trait. For *k*=10, the goodness of fit improves but again the results of the H-distance are towards the additive model. Classical Bayesian indicators such as DIC confirm these conclusions. The Hellinger distance seems to be an effective tool in highlighting the presence of possible misfit and determining plausible thresholds for classifying the misfit levels.


Table 1- Summary results for the 100 replications and for all item pairs.

#### **References**

distance which is symmetric, it does obey the triangle inequality and its range is 0-1. The direct calculation is computationally demanding and given the MCMC simulations, it is usually estimated by the normal kernel density. In order to check for local independence, we used the H distance with the MBC discrepancy measure (MBC-H) to take into account a fit measure for each item pair, and with the GDDM measure (GDDM-H) to evaluate the overall fit based on item pairs. It is proposed to investigate the assumption of local independence for 2PNO models by focusing on multidimensional data analyzed with the unidimensional model. The main strengths of the H distance, compared to traditional approaches rely on the possibility a) to directly quantify the amount of misfit; b) to be used for model comparison purposes, c) to make more informative analyses on item pairs. Furthermore, it is demonstrated that, in practical applications, the MBC-H can be used to: a) leave out the models that show serious misfit by using the threshold of 0.5; b) compare the amount of misfit of different competing models and choose the model which fits the data best; c) identify, also through graphical plots, critical items that may involve misfit which

In this paper we confirm the strength of our proposal through a simulation study in an over-fitting setting, where unidimensional data are analyzed through different

A simulation study is conducted to examine the performance of the proposed MBC-H and GDDM-H at detecting the misfit when data follow a two-parameter normal ogive (2PNO) unidimensional model and we fit a multi-unidimensional model and an additive model with two latent dimensions. Response data for tests with *k*=10 or *k*=20 items and a sample size of *n*=1,000 or *n*=2,000 are simulated. Two subtests are assumed for the multidimensional models (*k*1=*k*2=5 or *k*1=*k*2=10). The case of unidimensional data analyzed with the same model is also considered. A number of 5,000 MCMC iterations are conducted, where 1,000 are used for PPMC. Finally, 100 replications are done for each simulation condition. The parameters of the dataanalysis model are estimated via the Gibbs sampler. The over-fitting scenario is particularly meaningful for its implications in real situations. Although unidimensionality is quite unrealistic, especially with a high number of items, there are situations addressing a different point of view. For example, in the educational context, a test could be arranged under the assumption that groups of items refer to different cognitive domains. In this situation, a multidimensional model should be estimated to investigate the different domains, but one predominant dimension

The main results of the simulation study are reported in Table 1. We do not present PPP-values as they indicate lack of bad fit for all conditions. We found the more critical evidence for *k*=20 where the fitted models show, on average, MBC-H higher than 0.5 meaning bad fit. The additive model seems to be the more appropriate, even when data are unidimensional, as it also includes an overall latent trait. For *k*=10, the goodness of fit improves but again the results of the H-distance

are associated to high MBC-H in several pairs.

should explain the most part of the variability.

multidimensional models.

**4 The simulation**


### **PCA-BASED COMPOSITE INDICES AND MEASUREMENT MODEL**

Matteo Mazziotta1 , Adriano Pareto<sup>1</sup>

1 Italian National Institute of Statistics, Rome, (e-mail: mazziott@istat.it, pareto@istat.it)

**ABSTRACT**: The measurement of complex phenomena, such as well-being, socioeconomic development, and competitiveness, is very difficult because they are characterized by a multiplicity of aspects or dimensions. Principal Component Analysis (PCA) is probably the most popular multivariate statistical technique for reducing data with many dimensions. Thus, often, socio-economic indicators are reduced to a single index by using PCA. However, PCA is implicitly based on a reflective measurement model that is not suitable for all types of indicators. In this paper, we discuss the use and misuse of PCA for measuring complex phenomena.

**KEYWORDS**: PCA, data reduction, composite index, measurement model.

### **1 Introduction**

Socio-economic indicators are often analysed by multivariate statistical technique, such as Principal Components Analysis (PCA), in order to summarize the data and to construct composite indices. However, a fundamental distinction must be made between reducing dimensionality and constructing composite indices.

Reducing dimensionality is a purely mathematical operation that consists in summarizing a set of individual indicators, so that most of the information in the data is preserved. Many techniques have been developed for this purpose, but PCA is one of the oldest and most widely used. Its idea is simple: reduce the dimensionality of a dataset, while preserving as much 'variability' as possible. This translates into finding new variables that are linear functions of the original ones, that successively maximize variance and that are uncorrelated with each other. Because the new variables are defined by the dataset at hand, and not a priori, PCA can be considered an adaptive data analysis tool.

Constructing a composite index (or composite indicator) is a conceptual, as well as mathematical, operation that consists in summarizing (or aggregating as it is termed) a set of individual indicators, on the basis of a well-defined measurement model: formative or reflective. Therefore, a composite indicator is formed when individual indicators are compiled into a single index, on the basis of an underlying model of the multi-dimensional concept that is being measured.

Obviously, a composite index can be obtained by reducing dimensionality (with an appropriate model of measurement), but not necessarily reducing dimensionality provides a composite index. In this paper, we discuss the use of PCA for studying socio-economic indicators and we explain how and why it can be improperly used as a method for constructing composite indices.

#### **2 The measurement model**

As it is known, a model of measurement can be conceived through two different approaches: reflective or formative.

The most popular approach is the reflective model, according to which individual indicators denote effects (or manifestations) of an underlying latent variable. Therefore, causality is from the concept to the indicators and a change in the phenomenon causes variation in all its measures. In this model, the construct exists independently of awareness or interpretation by the researcher, even if it is not directly measurable. Specifically, the latent variable R represents the common cause shared by all indicators X*i* reflecting the construct, with each indicator corresponding to a linear function of the underlying variable plus a measurement error:

$$\mathbf{X}\_{l} \mathbf{-} \boldsymbol{\lambda}\_{l} \mathbf{R} + \mathbf{c}\_{l} \tag{1}$$

where X*i* is the indicator *i*, λ*i* is a coefficient (loading) capturing the effect of R on X*<sup>i</sup>* and ε*i* is the measurement error for the indicator *i*. Measurement errors are assumed to be independent and unrelated to the latent variable. A typical example of reflective model is the measurement of the intelligence of a person. In this case, it is the 'intelligence level' that influences the answers to a questionnaire for measuring attitude, and not vice versa. Hence, if the intelligence of a person increased, this would be accompanied by an increase of correct answers to all questions.

The second approach is the formative model, according to which individual indicators are causes of an underlying latent variable, rather than its effects. Therefore, causality is from the indicators to the concept and a change in the phenomenon does not necessarily imply variations in all its measures. In this model, the construct is defined by, or is a function of, the observed variables. The specification of the formative model is:

$$\mathbf{R} = \sum\_{i} \lambda\_{i} \mathbf{X}\_{i} + \zeta \tag{2}$$

where λ*i* is a coefficient capturing the effect of X*i* on R, and ζ is an error term. A typical example of formative model is the measurement of well-being of society. It depends on health, income, occupation, services, environment, etc., and not vice versa. So, if any one of these factors improved, well-being would increase (even if the other factors did not change). However, if well-being increased, this would not necessarily be accompanied by an improvement in all factors.

Note that (1) is a system of simple regression equations where each individual indicator is the dependent variable and the latent variable is the explanatory variable; whereas (2) represents a multiple regression equation where the latent variable is the dependent variable and the indicators are the explanatory variables.

Although the reflective approach dominates the psychological and management sciences, the formative view is common in economics and sociology.

### **3 How and when to use PCA**

According to the "Handbook on Constructing Composite Indicators. Methodology and user guide" by OECD, PCA should be used to study the overall structure of the dataset, assess its suitability, and guide some methodological choices in constructing a composite index. Nevertheless, PCA can also be used for constructing composite indices. For this purpose, it is essential to define the model of measurement in order to describe relationships between the phenomenon to be measured (latent variable) and its measures (individual indicators). But above all, it is necessary to establish whether PCA is formative or reflective. To answer to this question it is important to distinguish between PCA and FA , since they are sometimes considered more or less interchangeable.

PCA is a pure data reduction technique that aggregates the observed variables (indicators) in order to reproduce the most amount of variance with fewer variables (principal components or factors). PCA works without an explicit hypothesis on the latent structure of the variables, so that the observed variables are themselves of interest. This makes PCA similar to multiple regression in some ways, in that it seeks to create optimized weighted linear combinations of variables.

FA is an explanatory model in which the observed variables (indicators) are assumed to be (linear) functions of a certain (fewer) number of unobserved variables (latent factors). FA hypothesizes an underlying latent structure of the variables and estimates latent factors influencing observed variables.

On the basis of these features, PCA is often views as formative, whereas FA is a reflective measurement model. However, the question whether PCA is formative or reflective is not trivial. Indeed, although the definition of principal component as weighted sum of individual indicators suggests a formative model, some important issues are involved. In particular:


In the light of the above, a composite index based on PCA looks more suited for a reflective approach than a formative one. In fact, PCA is commonly used for the evaluation of reflective measurement models and it is considered an appropriate method for examining the indicators' underlying factor structure in order to check the content validity.

#### **4 Conclusions**

The construction of composite indices for measuring multidimensional phenomena is a central issue in data analysis. Researcher cannot solve this question simply by using PCA or related methods, such as Factor Analysis, since they are typically used for a reflective approach.

Reducing dimensionality and constructing composite indicators are two separate issues that are repeatedly confused. Both the procedures aims to summarize a set of variables or individual indicators, but reducing dimensionality focuses on extracting the most important information from the data, whereas constructing composite indicators focuses on the use of a measurement model that can be reflective or formative. Extracting the most important information from the data translates in summarizing correlated indicators, but correlations can indicate causal, non-causal (spurious) and coincidental relationships, making the principal components meaningless or difficult to interpret. On the contrary, defining a measurement model means assuming a specific direction of causality between the measures (individual indicators) and the latent variable (phenomenon to be measured).

Measuring complex phenomena, such as development or well-being, requires a formative approach, where the index to be constructed does not exist as an independent entity, but it is a composite measure directly determined by a set of non-interchangeable individual indicators or pillars (e.g. the HDI by UNDP).

In such a context, PCA can be recommended for various reasons. Firstly, PCA is a powerful tool for reducing complexity and visualizing data, so that the researcher can identify clusters of units (regions, provinces or countries) that have the same characteristics. Secondly, it allows for comparing empirical dimensions (factors) with theoretical dimensions (pillars), in order to evaluate any differences and to detect possible dimensions that had not previously been taken into account. Lastly, PCA makes it easy to study correlations among many individual indicators in order to find redundant and non-redundant indicators and to assess linkages with other relevant measures, such as GDP. Nevertheless, the use of PCA for constructing formative composite indices is not recommended, since it can give very misleading information about the latent variable of interest, being based exclusively on the covariance structure between the individual indicators.

### **References**

MAZZIOTTA, M., & PARETO, A. 2019. Use and Misuse of PCA for Measuring Well-Being. *Social Indicators Research*, **142**, 451–476.

### **GENDER INEQUALITIES FROM AN INCOME PERSPECTIVE**

While the density function for > 0 is:

The purpose of this paper is to analyse the ratio:

() <sup>=</sup> � 11()11−1 1

females' and males' income in different groups:

*(*40 ≤ < 70) and *old (* ≥ 70)

+∞

0

() <sup>=</sup> 1122 1 112

11 �1 + �

22 � 11−1 

> �1 + � 2 � 2 � −2−1

empirical one is created with the ratios of all the possible couples.

0

ratio :

() <sup>=</sup> −1

3000, while for the scale parameter the bias tends to 0 when > 4000.

where (1, 1, 1) and (2, 2, 2) with and independent.

 1 � 1 � 1+1�

�1 + �

 <sup>=</sup> 

Following the definition of the density function of the ratio of two random variables in Mood, Graybill and Boes (1974), applying the independence of and and the density function of a type I Dagum, it is possible to obtain the density function for the

Using the definition of the cumulative distribution function, it is possible to obtain:

In Pollastri and Zambruno (2010) a graphical analysis of this method performance is exposed comparing the empirical and the computed density function, where the

**2 Applications to Survey on Household Income and Wealth**

We apply this method to the individual net incomes in 2016 from the Bank of Italy Survey on Household Income and Wealth (SHIW). We compare the ratio of the

• males and females divided in three age classes: *young* ( < 40), *adult* 

∞

0

×

2

� 11+2 2−1 �1 + �

222 2−1

2 � 2 �

 1 � 1 � −1−1

2+1�

×

22 �1 + �

The Dagum distribution parameters are estimated using the function *dagum* implemented in the VGAM package in the software R. This function estimates the parameters using the maximum likelihood estimation method proposed by Kleiber and Kotz (2003). Domanski and Jedrzejczak (1998) showed, through a simulation study, that estimation method performance is good for and when > 2000 or

 � � 

Marcella Mazzoleni1, Angiola Pollastri2 and Vanda Tulli2

<sup>1</sup> Center on Economic, Social and Cooperation dynamics (CESC), University of Bergamo, (e-mail: marcella.mazzoleni@unibg.it)

<sup>2</sup> Department of Statistics and Quantitative Methods, University of Milano-Bicocca (e-mail: angiola.pollastri@unimib.it, vanda.tulli@unimib.it)

**ABSTRACT**: The difference between females' and males' income is one of the main topics in the analysis of gender gap, as it is known that, even with a higher educational level, females earn less than males do. To inspect this, we analyse and estimate the distribution of the ratio of females' income over males' income using the methodology based on the distribution of the ratio of two Dagum with three parameters. We applied this method to the Bank of Italy Survey on Household Income and Wealth (SHIW) data to evaluate the deciles, the density functions, and the cumulative distribution functions of the ratio of the females' income over males' income in different age classes, Italian areas, and years.

**KEYWORDS**: Dagum distribution, distribution of the ratio of two Dagum random variables, distribution of the ratio of female and male income.

#### **1 Introduction and method**

It is well known that even with a higher educational level, women earn less than men do. The differences between men' and women' income on average are decreasing in the recent years but income parity has not yet been achieved.

The purpose of this paper is to estimate the distribution of the ratio of females' income over males' income. The methodology used to study the ratio is based on the distribution of the ratio of two Dagum with three parameters (Pollastri and Zambruno 2010). The distribution of this ratio studied in two different situations can reveal the gender inequality concerning income in different groups or times, accordingly the distribution of the ratio is analysed to reveal the gender inequality with applications to the income in different age classes, areas, and times.

In literature we have many examples which confirm that the model proposed in 1977 by Camilo Dagum fits very well to many distributions of economic variables. Supposing that is a type I Dagum, then (, , ) with , , > 0. The distribution function for > 0 is defined as (Kleiber and Kotz 2003):

$$F\_{\mathcal{X}}(\boldsymbol{x}) = \left[1 + \left(\frac{\boldsymbol{\chi}}{b}\right)^{-a}\right]^{-p}$$

While the density function for > 0 is:

**GENDER INEQUALITIES FROM AN INCOME PERSPECTIVE**

Marcella Mazzoleni1, Angiola Pollastri2 and Vanda Tulli2

**ABSTRACT**: The difference between females' and males' income is one of the main topics in the analysis of gender gap, as it is known that, even with a higher educational level, females earn less than males do. To inspect this, we analyse and estimate the distribution of the ratio of females' income over males' income using the methodology based on the distribution of the ratio of two Dagum with three parameters. We applied this method to the Bank of Italy Survey on Household Income and Wealth (SHIW) data to evaluate the deciles, the density functions, and the cumulative distribution functions of the ratio of the females' income over males' income

**KEYWORDS**: Dagum distribution, distribution of the ratio of two Dagum random variables,

It is well known that even with a higher educational level, women earn less than men do. The differences between men' and women' income on average are decreasing in

The purpose of this paper is to estimate the distribution of the ratio of females' income over males' income. The methodology used to study the ratio is based on the distribution of the ratio of two Dagum with three parameters (Pollastri and Zambruno 2010). The distribution of this ratio studied in two different situations can reveal the gender inequality concerning income in different groups or times, accordingly the distribution of the ratio is analysed to reveal the gender inequality with applications

In literature we have many examples which confirm that the model proposed in 1977 by Camilo Dagum fits very well to many distributions of economic variables. Supposing that is a type I Dagum, then (, , ) with , , > 0. The

> � − � −

<sup>1</sup> Center on Economic, Social and Cooperation dynamics (CESC), University of Bergamo,

<sup>2</sup> Department of Statistics and Quantitative Methods, University of Milano-Bicocca (e-mail: angiola.pollastri@unimib.it, vanda.tulli@unimib.it)

(e-mail: marcella.mazzoleni@unibg.it)

in different age classes, Italian areas, and years.

**1 Introduction and method**

distribution of the ratio of female and male income.

the recent years but income parity has not yet been achieved.

to the income in different age classes, areas, and times.

distribution function for > 0 is defined as (Kleiber and Kotz 2003):

() = �1 + �

$$f\_{\mathcal{X}}(\boldsymbol{\chi}) = \frac{ap\boldsymbol{\chi}^{ap-1}}{b^{ap}\left[1 + \left(\frac{\boldsymbol{\chi}}{b}\right)^{a}\right]^{p+1}}$$

The Dagum distribution parameters are estimated using the function *dagum* implemented in the VGAM package in the software R. This function estimates the parameters using the maximum likelihood estimation method proposed by Kleiber and Kotz (2003). Domanski and Jedrzejczak (1998) showed, through a simulation study, that estimation method performance is good for and when > 2000 or 3000, while for the scale parameter the bias tends to 0 when > 4000.

The purpose of this paper is to analyse the ratio:

$$\upsilon = \frac{X}{Y}$$

where (1, 1, 1) and (2, 2, 2) with and independent.

Following the definition of the density function of the ratio of two random variables in Mood, Graybill and Boes (1974), applying the independence of and and the density function of a type I Dagum, it is possible to obtain the density function for the ratio :

$$f\_{U}(\mathsf{u}) = \int\_{0}^{+\infty} \mathcal{Y} \left\{ \frac{a\_{1}p\_{1}(\mathsf{u}\mathsf{y})^{a\_{1}p\_{1}-1}}{b\_{1}^{a\_{1}p\_{1}}\left[1+\left(\frac{\mathsf{u}\mathsf{y}}{b\_{1}}\right)^{a\_{1}}\right]^{p\_{1}+1}} \right\} \times \left\{ \frac{a\_{2}p\_{2}\mathsf{y}^{a\_{2}\cdot p\_{2}-1}}{b\_{2}^{a\_{2}p\_{2}}\left[1+\left(\frac{\mathsf{y}}{b\_{2}}\right)^{a\_{2}}\right]^{p\_{2}+1}} \right\} d\mathsf{y} \leq \frac{1}{2}$$

Using the definition of the cumulative distribution function, it is possible to obtain:

$$F\_U(u) = \frac{a\_1 p\_1 a\_2 p\_2}{b\_1^{a\_1 p\_1} b\_2^{a\_2 p\_2}} \int\_0^u t^{a\_1 p\_1 - 1} \int\_0^\infty y^{a\_1 p\_1 + a\_2 p\_2 - 1} \left[ 1 + \left( \frac{t \mathcal{Y}}{b\_1} \right)^{a\_1} \right]^{-p\_1 - 1} \times \mathcal{Y}$$

$$\left[ 1 + \left( \frac{\mathcal{Y}}{b\_2} \right)^{a\_2} \right]^{-p\_2 - 1} d\mathcal{y} dt$$

In Pollastri and Zambruno (2010) a graphical analysis of this method performance is exposed comparing the empirical and the computed density function, where the empirical one is created with the ratios of all the possible couples.

#### **2 Applications to Survey on Household Income and Wealth**

We apply this method to the individual net incomes in 2016 from the Bank of Italy Survey on Household Income and Wealth (SHIW). We compare the ratio of the females' and males' income in different groups:

• males and females divided in three age classes: *young* ( < 40), *adult (*40 ≤ < 70) and *old (* ≥ 70)

• males and females divided in three areas: *North, Centre,* and *South and Islands*

of the ratio of females' income over males' income are higher for 2016 dataset than

ALLEVA G. (2017). Indagine conoscitiva sulle politiche in materia di parità tra uomini

BANK OF ITALY (2016). I bilanci delle famiglie italiane nell'anno 2016. Supplementi

BANK OF ITALY (1998). I bilanci delle famiglie italiane nell'anno 1998. Supplementi

DAGUM C. (1977). A new model of personal income distribution: specification and

DAGUM C. (1990). Generation and properties of income distribution functions. [In:] *Studies in Contemporary economics. Income and wealth distribution, inequality* 

DOMANSKI, C., JEDRZEJCZAK A. (1998) Maximum likelihood estimation of the Dagum model parameters. *International Advances in Economic Research*, **4**, 243–

KLEIBER, C., KOTZ S. (2003). Statistical Size Distributions in Economics and

MOOD A., GRAYBILL F., BOES D. (1974). Introduction to the theory of statistics.

POLLASTRI A., ZAMBRUNO G. (2010) Distribution of the ratio of two independent Dagum random variables. *Operations Research and Decisions*, **3** (20), 95-102. YEE T. (2019). VGAM: Vector Generalized Linear and Additive Models. R package.

al Bollettino Statistico, XVII, Centro Stampa Banca d'Italia, Roma.

al Bollettino Statistico, XVII, Centro Stampa Banca d'Italia, Roma.

estimation. *Economie Appliquée*, **30** (3), 413-437.

*and poverty*, C. Dagum, M. Zenga (Eds.), Springer, Berlin.

Actuarial Sciences. Hoboken, NJ, USA: Wiley-Interscience.

the deciles of the 1998 dataset.

e donne. Istat, Rome.

**References**

252

Wiley, New York.

• males and females in two different years: 2016 and 1998

The dataset is composed of 11,844 subjects. Of these 50.98% are males, and 49.02% are females. Concerning the division by ages 15.21% are aged less than 40 years, 54.41% are aged between 40 and 70 years, and 30.39% are aged equal or more than 70 years. For the division by area, 43.90% of the subjects come from the North, 22.40% from the Centre, and 33.70% from the South and the Islands. The 1998 dataset, it is composed of 12,616 subjects, of these 56.52% are males and 43.48% are females.

After estimating the Dagum parameters, we evaluate the cumulative distribution function and the deciles of the ratio of the females' income over the males' income, comparing the results of the ratio distributions for different ages, areas, and times.

Comparing the ratio of the females' income over males' income in different age classes, we observe higher value of deciles for younger subject, lower for adult group, and even lower for the older group. This confirms that the income at the beginning of the career is similar between the two genders but increasing the age and the position achieved, the gap rises.

For the deciles of ratio of the females' income over males' income comparing different areas, we observe close and higher value for the subjects that live in North and Centre of Italy, and lower value for the subjects that live in South and Islands. This can be related to the different economical and social situation in the Islands and in the South of Italy.

We observe that the deciles of the ratio of the females' income over males' income are higher in 2016 with respect to the ratio in 1998. This confirms that the differences between men' and women' income are decreasing in the recent years, but income parity has not yet been achieved.

#### **3 Conclusions**

In this paper we propose to use the ratio of two type I Dagum random variables for analysing the difference of the income of females and males. We observe that this method gives us interesting conclusions and can be applied to different dataset comparing also the ratio of females' over males' income in different countries, in order to highlight the differences concerning gender gap.

This method is used to analyse the Italian situation and to compare the ratio of females' over males' income in different ages, areas and times. As a matter of fact, in the applications we observe less diversity for females' and males' income in the younger group, but the diversity rises increasing the subjects' age, passing from young to adult, and from adult to old group. In the division by areas, the deciles of the ratio of females' income over males' income for the North and Centre are close, while for the South and Islands a wider difference between genders is observed. The difference of the income for males' and females' is decreasing over the years. In fact, the deciles of the ratio of females' income over males' income are higher for 2016 dataset than the deciles of the 1998 dataset.

#### **References**

• males and females divided in three areas: *North, Centre,* and *South and* 

The dataset is composed of 11,844 subjects. Of these 50.98% are males, and 49.02% are females. Concerning the division by ages 15.21% are aged less than 40 years, 54.41% are aged between 40 and 70 years, and 30.39% are aged equal or more than 70 years. For the division by area, 43.90% of the subjects come from the North, 22.40% from the Centre, and 33.70% from the South and the Islands. The 1998 dataset, it is composed of 12,616 subjects, of these 56.52% are males and 43.48% are

After estimating the Dagum parameters, we evaluate the cumulative distribution function and the deciles of the ratio of the females' income over the males' income, comparing the results of the ratio distributions for different ages, areas, and times. Comparing the ratio of the females' income over males' income in different age classes, we observe higher value of deciles for younger subject, lower for adult group, and even lower for the older group. This confirms that the income at the beginning of the career is similar between the two genders but increasing the age and the position

For the deciles of ratio of the females' income over males' income comparing different areas, we observe close and higher value for the subjects that live in North and Centre of Italy, and lower value for the subjects that live in South and Islands. This can be related to the different economical and social situation in the Islands and

We observe that the deciles of the ratio of the females' income over males' income are higher in 2016 with respect to the ratio in 1998. This confirms that the differences between men' and women' income are decreasing in the recent years, but income

In this paper we propose to use the ratio of two type I Dagum random variables for analysing the difference of the income of females and males. We observe that this method gives us interesting conclusions and can be applied to different dataset comparing also the ratio of females' over males' income in different countries, in

This method is used to analyse the Italian situation and to compare the ratio of females' over males' income in different ages, areas and times. As a matter of fact, in the applications we observe less diversity for females' and males' income in the younger group, but the diversity rises increasing the subjects' age, passing from young to adult, and from adult to old group. In the division by areas, the deciles of the ratio of females' income over males' income for the North and Centre are close, while for the South and Islands a wider difference between genders is observed. The difference of the income for males' and females' is decreasing over the years. In fact, the deciles

order to highlight the differences concerning gender gap.

• males and females in two different years: 2016 and 1998

*Islands*

achieved, the gap rises.

in the South of Italy.

**3 Conclusions**

parity has not yet been achieved.

females.


### TRANSFORMATION MIXTURE MODELING FOR SKEWED DATA GROUPS WITH HEAVY TAILS AND SCATTER

UNCONDITIONAL M-QUANTILE REGRESSION

Luca Merlo1, Lea Petrella2 and Nikos Tzavidis3

<sup>3</sup> Department of Social Statistics and Demography and Southampton Statistical Sciences

ABSTRACT: In this paper we develop the unconditional M-quantile regression for modeling unconditional M-quantiles in the presence of covariates. Extending the paper by Firpo *et al.* (2009), we assess the impact of small changes in the explanatory variables on the M-quantile of the unconditional distribution of the dependent variable by running a mean regression of the recentered influence function of the unconditional M-quantile on the covariates. The proposed methodology is applied on the Survey of

Quantile Regression (QR), as proposed by Koenker & Bassett Jr (1978), has proven to be a powerful tool to explore conditional distributions in many empirical applications. However, if one is interested in how the whole unconditional distribution of the dependent variable responds to changes in the covariates, using the well-known QR would yield misleading inferences (see Firpo *et al.* 2009 and Borah & Basu 2013). Motivated by this interest, Firpo *et al.* (2009) proposed the Unconditional Quantile Regression (UQR) approach for modeling unconditional quantiles of a dependent variable as a function of the explanatory variables. This method builds upon the concept of Recentered Influence Function (RIF) which originates from a widely used tool in robust statistics, namely the Influence Function (IF) discussed in Hampel *et al.* (2011). The RIF of a distributional statistic ν is obtained by adding back the statistic to the IF and it can be thought of as the contribution of an individual observation on ν. In the regression framework where covariates are available, Firpo *et al.* (2009) proposed to replace the dependent variable with the RIF to model the

Household Income and Wealth (SHIW) 2016 conducted by the Bank of Italy. KEYWORDS: Influence function, M-estimation, RIF regression, Robust method

<sup>1</sup> Department of Statistics, Sapienza University of Rome,

<sup>2</sup> MEMOTEF Department, Sapienza University of Rome,

(e-mail: luca.merlo@uniroma1.it)

(e-mail: lea.petrella@uniroma1.it)

Research Institute, University of Southampton, (e-mail: N.TZAVIDIS@soton.ac.uk)

1 Introduction

Yana Melnykov1, Xuwen Zhu1 and Volodymyr Melnykov1

<sup>1</sup> The University of Alabama, (e-mail: ymelnykov@cba.ua.edu, xzhu20@cba.ua.edu, vmelnykov@cba.ua.edu)

ABSTRACT: For decades, Gaussian mixture models have been the most popular mixtures in literature. However, the adequacy of the fit provided by Gaussian components is often in question. Various distributions capable of modeling skewness or heavy tails have been considered in this context recently. In this paper, we propose a novel contaminated transformation mixture model that is constructed based on the idea of transformation to symmetry and can account for skewness, heavy tails, and automatically assign scatter to secondary components.

KEYWORDS: finite mixture model, cluster analysis, transformation to normality, symmetry

### UNCONDITIONAL M-QUANTILE REGRESSION

Luca Merlo1, Lea Petrella2 and Nikos Tzavidis3

<sup>1</sup> Department of Statistics, Sapienza University of Rome, (e-mail: luca.merlo@uniroma1.it)

<sup>2</sup> MEMOTEF Department, Sapienza University of Rome, (e-mail: lea.petrella@uniroma1.it)

<sup>3</sup> Department of Social Statistics and Demography and Southampton Statistical Sciences Research Institute, University of Southampton, (e-mail: N.TZAVIDIS@soton.ac.uk)

ABSTRACT: In this paper we develop the unconditional M-quantile regression for modeling unconditional M-quantiles in the presence of covariates. Extending the paper by Firpo *et al.* (2009), we assess the impact of small changes in the explanatory variables on the M-quantile of the unconditional distribution of the dependent variable by running a mean regression of the recentered influence function of the unconditional M-quantile on the covariates. The proposed methodology is applied on the Survey of Household Income and Wealth (SHIW) 2016 conducted by the Bank of Italy.

KEYWORDS: Influence function, M-estimation, RIF regression, Robust method

### 1 Introduction

TRANSFORMATION MIXTURE MODELING FOR SKEWED DATA GROUPS WITH HEAVY TAILS AND SCATTER Yana Melnykov1, Xuwen Zhu1 and Volodymyr Melnykov1

ABSTRACT: For decades, Gaussian mixture models have been the most popular mixtures in literature. However, the adequacy of the fit provided by Gaussian components is often in question. Various distributions capable of modeling skewness or heavy tails have been considered in this context recently. In this paper, we propose a novel contaminated transformation mixture model that is constructed based on the idea of transformation to symmetry and can account for skewness, heavy tails, and automatically assign scatter to secondary

KEYWORDS: finite mixture model, cluster analysis, transformation to normality,

<sup>1</sup> The University of Alabama, (e-mail: ymelnykov@cba.ua.edu,

xzhu20@cba.ua.edu, vmelnykov@cba.ua.edu)

components.

symmetry

Quantile Regression (QR), as proposed by Koenker & Bassett Jr (1978), has proven to be a powerful tool to explore conditional distributions in many empirical applications. However, if one is interested in how the whole unconditional distribution of the dependent variable responds to changes in the covariates, using the well-known QR would yield misleading inferences (see Firpo *et al.* 2009 and Borah & Basu 2013). Motivated by this interest, Firpo *et al.* (2009) proposed the Unconditional Quantile Regression (UQR) approach for modeling unconditional quantiles of a dependent variable as a function of the explanatory variables. This method builds upon the concept of Recentered Influence Function (RIF) which originates from a widely used tool in robust statistics, namely the Influence Function (IF) discussed in Hampel *et al.* (2011). The RIF of a distributional statistic ν is obtained by adding back the statistic to the IF and it can be thought of as the contribution of an individual observation on ν. In the regression framework where covariates are available, Firpo *et al.* (2009) proposed to replace the dependent variable with the RIF to model the

unconditional quantiles of the response and evaluate the effect of changes in the law of the covariates on unconditional quantiles. When the interest of the research is concentrated on the entire distribution of a response variable, in addition to the classical QR, a possible alternative is represented by the M-quantile regression (MQR) approach proposed by Breckling & Chambers (1988). This method provides a "quantile-like" generalization of the mean regression based on influence functions, combining in a common framework the robustness and efficiency properties of quantiles and expectiles (Newey & Powell 1987), respectively.

In this article, we extend the UQR of Firpo *et al.* (2009) to the M-quantile regression framework. We develop the Unconditional M-quantile Regression (UMQR) to model the M-quantiles of the unconditional distribution of the response variable. In order to analyze how the entire unconditional distribution of the outcome is affected by changes in the distribution of explanatory variables, we regress the RIF of the unconditional M-quantile on the covariates and denote such effect as Unconditional M-Quantile Partial Effect (UMQPE).

#### 2 Methodology

Let *Y* denote a scalar random variable with absolutely continuous distribution function *FY* . The M-quantile of order τ ∈ (0,1) of *Y* is defined as the solution, θτ ∈ R, of the following estimating equation:

$$
\int \Psi\_{\mathfrak{T}}(\mathbf{y} - \boldsymbol{\Theta}\_{\mathfrak{T}}) dF\_{\mathbf{Y}}(\mathbf{y}) = 0,\tag{1}
$$

where *IF*(*y*;θτ) is the IF of θτ and ψ

define the UMQR model as follows:

 *d*E[*RIF*(*Y*;θτ) | X = x] *d*x

ατ =

*RIF*(*Y*;θτ) on X.

3 Application

References

E[*RIF*(*Y*;θτ) | X = x] = θτ +E

(*u*) = 1(|*u*|<*c*) is the derivative of ψ in (2).

 *d*E[ψτ(*Y* −θτ) | X = x] *d*x

ψ

 X = x 

. (4)

*dF*X(x), (5)

<sup>τ</sup>(*y* − θτ)*dFY* (*y*). As

In a regression framework when covariates <sup>X</sup> <sup>⊂</sup> <sup>R</sup>*<sup>k</sup>* are available, from (3) we

Our objective is to identify how small changes in the distribution of X affect the M-quantile of the unconditional distribution of *Y*. From (4) and Firpo *et al.* (2009), the unconditional effect of the τ-th M-quantile, that we denote

*dF*X(x) = <sup>1</sup>

suggested by Firpo *et al.* (2009), we can estimate ατ in (5) via a mean regression of the *RIF*(*Y*;θτ) as dependent variable onto X by using a two-step procedure. Specifically, an estimate θτ of θτ is obtained by solving (1) via Iterative Reweighted Least Squares, substitute θτ in (3) and then regress the

We investigate the effect of economic and socio-demographic characteristics on italian households' log-consumption using data from the SHIW 2016. We fit the UMQR at different points of the unconditional distribution of the response and compare the results with standard conditional M-quantile regressions. The tuning constant *c* in (2) has been set to 1.345 and 100. In the second case, we obtain the Unconditional Expectile Regression (UER). The results in Table 1 highlight that the impact of income, gender, age and education is very different on the conditional and unconditional distributions of consumption, especially in the tails. This demonstrates the ability of the UMQR to extend mean regression for estimating the effect of covariates, not only at the center, but also at different

BORAH, BIJAN J, & BASU, ANIRBAN. 2013. Highlighting differences between conditional and unconditional quantile regression approaches through an

Unconditional M-quantile Partial Effect, ατ, is formally defined as:

where *F*<sup>X</sup> is the distribution function of X and *s*<sup>τ</sup> =

parts of the unconditional distribution of interest.

 ψ

*s*τ

ψτ(*y*−θτ)

<sup>τ</sup>(*y*−θτ)*dFY* (*y*)

where ψτ(*u*) =| τ − 1(*u*<0) | ψ(*u*/στ), with ψ being the first derivative of a convex loss function ρ and στ is a suitable scale parameter. In this work, we consider the well-known Huber influence function (Huber (1964)):

$$\Psi(\mu) = \mu \mathbf{1}\_{\left(|\mu| \le c\right)} + c \operatorname{sign}(\mu) \mathbf{1}\_{\left(|\mu| > c\right)},\tag{2}$$

where *c* denotes a tuning constant bounded away from zero that can be used to trade robustness for efficiency in the model fit. In particular, M-quantiles nicely include quantiles when *c* → 0, ψ(*u*) = sign(*u*), and expectiles when *c* → ∞, ψ(*u*) = *u*.

To build the UMQR model, it follows from Firpo *et al.* (2009) and Hampel *et al.* (2011) that the RIF of the M-quantile θτ is defined as:

$$RIF(\mathbf{y}; \boldsymbol{\theta}\_{\mathsf{t}}) = \boldsymbol{\theta}\_{\mathsf{t}} + IF(\mathbf{y}; \boldsymbol{\theta}\_{\mathsf{t}}) = \boldsymbol{\theta}\_{\mathsf{t}} + \frac{\boldsymbol{\Psi}\_{\mathsf{t}}(\mathbf{y} - \boldsymbol{\theta}\_{\mathsf{t}})}{\int \boldsymbol{\Psi}\_{\mathsf{t}}'(\mathbf{y} - \boldsymbol{\theta}\_{\mathsf{t}}) dF\_{Y}(\mathbf{y})},\tag{3}$$

where *IF*(*y*;θτ) is the IF of θτ and ψ (*u*) = 1(|*u*|<*c*) is the derivative of ψ in (2). In a regression framework when covariates <sup>X</sup> <sup>⊂</sup> <sup>R</sup>*<sup>k</sup>* are available, from (3) we define the UMQR model as follows:

$$\mathbb{E}[RIF(Y;\theta\_{\mathsf{t}}) \mid \mathbf{X} = \mathbf{x}] = \theta\_{\mathsf{t}} + \mathbb{E}\left[\frac{\mathbb{V}\_{\mathsf{t}}(\mathbf{y} - \theta\_{\mathsf{t}})}{\int \mathbb{V}\_{\mathsf{t}}^{\prime}(\mathbf{y} - \theta\_{\mathsf{t}}) dF\_{Y}(\mathbf{y})} \Big| \mathbf{X} = \mathbf{x} \right].\tag{4}$$

Our objective is to identify how small changes in the distribution of X affect the M-quantile of the unconditional distribution of *Y*. From (4) and Firpo *et al.* (2009), the unconditional effect of the τ-th M-quantile, that we denote Unconditional M-quantile Partial Effect, ατ, is formally defined as:

$$\mathbf{\dot{\alpha}}\_{\mathsf{f}} = \int \frac{d\mathbb{E}[RIF(Y;\mathsf{H}\_{\mathsf{f}}) \mid \mathbf{X} = \mathbf{x}]}{d\mathbf{x}} dF\_{\mathsf{X}}(\mathbf{x}) = \frac{1}{s\_{\mathsf{f}}} \int \frac{d\mathbb{E}[\mathbb{W}\_{\mathsf{f}}(Y - \theta\_{\mathsf{f}}) \mid \mathbf{X} = \mathbf{x}]}{d\mathbf{x}} dF\_{\mathsf{X}}(\mathbf{x}), \tag{5}$$

where *F*<sup>X</sup> is the distribution function of X and *s*<sup>τ</sup> = ψ <sup>τ</sup>(*y* − θτ)*dFY* (*y*). As suggested by Firpo *et al.* (2009), we can estimate ατ in (5) via a mean regression of the *RIF*(*Y*;θτ) as dependent variable onto X by using a two-step procedure. Specifically, an estimate θτ of θτ is obtained by solving (1) via Iterative Reweighted Least Squares, substitute θτ in (3) and then regress the *RIF*(*Y*;θτ) on X.

#### 3 Application

unconditional quantiles of the response and evaluate the effect of changes in the law of the covariates on unconditional quantiles. When the interest of the research is concentrated on the entire distribution of a response variable, in addition to the classical QR, a possible alternative is represented by the M-quantile regression (MQR) approach proposed by Breckling & Chambers (1988). This method provides a "quantile-like" generalization of the mean regression based on influence functions, combining in a common framework the robustness and efficiency properties of quantiles and expectiles (Newey &

In this article, we extend the UQR of Firpo *et al.* (2009) to the M-quantile regression framework. We develop the Unconditional M-quantile Regression (UMQR) to model the M-quantiles of the unconditional distribution of the response variable. In order to analyze how the entire unconditional distribution of the outcome is affected by changes in the distribution of explanatory variables, we regress the RIF of the unconditional M-quantile on the covariates and denote

Let *Y* denote a scalar random variable with absolutely continuous distribution function *FY* . The M-quantile of order τ ∈ (0,1) of *Y* is defined as the solution,

where ψτ(*u*) =| τ − 1(*u*<0) | ψ(*u*/στ), with ψ being the first derivative of a convex loss function ρ and στ is a suitable scale parameter. In this work, we

where *c* denotes a tuning constant bounded away from zero that can be used to trade robustness for efficiency in the model fit. In particular, M-quantiles nicely include quantiles when *c* → 0, ψ(*u*) = sign(*u*), and expectiles when *c* → ∞,

To build the UMQR model, it follows from Firpo *et al.* (2009) and Hampel

*RIF*(*y*;θτ) = θτ <sup>+</sup>*IF*(*y*;θτ) = θτ <sup>+</sup> ψτ(*y*−θτ)

consider the well-known Huber influence function (Huber (1964)):

*et al.* (2011) that the RIF of the M-quantile θτ is defined as:

ψτ(*y*−θτ)*dFY* (*y*) = 0, (1)

ψ(*u*) = *u*1(|*u*|≤*c*) +*c*sign(*u*)1(|*u*|>*c*), (2)

 ψ

<sup>τ</sup>(*y*−θτ)*dFY* (*y*)

, (3)

such effect as Unconditional M-Quantile Partial Effect (UMQPE).

θτ ∈ R, of the following estimating equation:

Powell 1987), respectively.

2 Methodology

ψ(*u*) = *u*.

We investigate the effect of economic and socio-demographic characteristics on italian households' log-consumption using data from the SHIW 2016. We fit the UMQR at different points of the unconditional distribution of the response and compare the results with standard conditional M-quantile regressions. The tuning constant *c* in (2) has been set to 1.345 and 100. In the second case, we obtain the Unconditional Expectile Regression (UER). The results in Table 1 highlight that the impact of income, gender, age and education is very different on the conditional and unconditional distributions of consumption, especially in the tails. This demonstrates the ability of the UMQR to extend mean regression for estimating the effect of covariates, not only at the center, but also at different parts of the unconditional distribution of interest.

#### References

BORAH, BIJAN J, & BASU, ANIRBAN. 2013. Highlighting differences between conditional and unconditional quantile regression approaches through an


MCMC COMPUTATIONS FOR BAYESIAN MIXTURE MODELS USING REPULSIVE POINT PROCESSES Jesper Møller 1, Mario Beraha2, Raffaele Argiento3 and Alessandra Guglielmi2

<sup>1</sup> University of Aalborg, Department of Mathematics, Aalborg (Denmark), (e-mail:

<sup>3</sup> Universita Cattolica del Sacro Cuore, Department of Statistical Sciences, Milano `

ABSTRACT: Repulsive mixture models have recently gained popularity for Bayesian cluster detection. Compared to more traditional mixture models, repulsive mixture models produce a smaller number of well separated clusters. The most commonly used methods for posterior inference either require to fix a priori the number of components or are based on reversible jump MCMC computation. We present a general framework for mixture models, when the prior of the 'cluster centres' is a finite repulsive point process depending on a hyperparameter, specified by a density which may depend on an intractable normalizing constant. By investigating the posterior characterization of this class of mixture models, we derive a MCMC algorithm which avoids the well-known difficulties associated to reversible jump MCMC computation. In particular, we use an ancillary variable method, which eliminates the problem of having intractable normalizing constants in the Hastings ratio. The ancillary variable method relies on a perfect simulation algorithm, and we demonstrate this is fast because the number of components is typically small. In several simulation studies and an application on sociological data, we illustrate the advantage of our new methodology over existing methods, and we compare the use of a determinantal or a repulsive

KEYWORDS: birth-death Metropolis Hastings algorithm, cluster estimation, pairwise interaction point process, intractable normalizing constant, normalized infinitely divis-

<sup>2</sup> Department of Mathematics, Politecnico di Milano, Milano (Italy)

jm@math.aau.dk)

Gibbs point process prior model.

ible distribution, perfect simulation.

(Italy)

Table 1. *M-quantile and Expectile regression results at* τ = (0.1,0.5,0.9)*. Parameter estimates are displayed in boldface when significant at the 5% level.*

application to assess medication adherence. *Health Economics*, 22(9), 1052– 1070.


### MCMC COMPUTATIONS FOR BAYESIAN MIXTURE MODELS USING REPULSIVE POINT PROCESSES

Jesper Møller 1, Mario Beraha2, Raffaele Argiento3 and Alessandra Guglielmi2

<sup>1</sup> University of Aalborg, Department of Mathematics, Aalborg (Denmark), (e-mail: jm@math.aau.dk)

<sup>2</sup> Department of Mathematics, Politecnico di Milano, Milano (Italy)

Variable MQR UMQR ER UER τ 0.1 0.5 0.9 0.1 0.5 0.9 0.1 0.5 0.9 0.1 0.5 0.9 Log-Income 0.570 0.595 0.442 0.447 0.391 0.429 0.483 0.413 0.263 0.450 0.413 0.436

Marital status

Education level

Employment status

1070.

75(4), 761–771.

*Society*, 77(3), 953–973.

Gender −0.019 −0.011 −0.043 −0.011 −0.024 −0.038 −0.023 −0.026 −0.046 −0.010 −0.026 −0.035

Age −0.002 0.001 0.004 −0.013 0.006 0.013 −0.001 0.004 0.008 −0.011 0.004 0.011

never married −0.062 −0.084 −0.164 −0.094 −0.141 −0.187 −0.095 −0.138 −0.201 −0.101 −0.138 −0.176

separated −0.066 −0.056 −0.127 −0.102 −0.151 −0.155 −0.111 −0.137 −0.207 −0.105 −0.137 −0.141

widowed −0.040 −0.063 −0.119 −0.116 −0.136 −0.111 −0.074 −0.123 −0.193 −0.110 −0.123 −0.107

elementary school 0.175 0.120 0.151 0.488 0.125 −0.037 0.188 0.161 0.187 0.446 0.161 −0.000

middle school 0.240 0.203 0.316 0.645 0.269 0.060 0.281 0.294 0.398 0.590 0.294 0.094

high school 0.248 0.235 0.383 0.652 0.355 0.147 0.313 0.363 0.500 0.598 0.363 0.168

university 0.298 0.297 0.521 0.631 0.440 0.506 0.391 0.484 0.705 0.608 0.484 0.515

self-employed −0.087 0.010 0.083 −0.060 0.021 0.121 −0.058 0.023 0.081 −0.046 0.023 0.107

not-employed 0.008 0.027 0.035 −0.046 0.037 0.037 −0.002 0.014 0.017 −0.052 0.014 0.031

Table 1. *M-quantile and Expectile regression results at* τ = (0.1,0.5,0.9)*. Parameter*

application to assess medication adherence. *Health Economics*, 22(9), 1052–

BRECKLING, JENS,&CHAMBERS, RAY. 1988. M-quantiles. *Biometrika*,

FIRPO, SERGIO, FORTIN, NICOLE M, & LEMIEUX, THOMAS. 2009. Unconditional quantile regressions. *Econometrica: Journal of the Econometric*

HAMPEL, FRANK R, RONCHETTI, ELVEZIO M, ROUSSEEUW, PETER J, & STAHEL, WERNER A. 2011. *Robust statistics: the approach based on*

HUBER, PETER J. 1964. Robust Estimation of a Location Parameter. *Annals*

KOENKER, ROGER,&BASSETT JR, GILBERT. 1978. Regression quantiles.

NEWEY, WHITNEY K, & POWELL, JAMES L. 1987. Asymmetric least squares

*estimates are displayed in boldface when significant at the 5% level.*

*influence functions*. Vol. 196. John Wiley & Sons.

estimation and testing. *Econometrica*, 819–847.

*Econometrica: Journal of the Econometric Society*, 33–50.

*of Mathematical Statistics*, 35(1), 73–101.

(0.011) (0.007) (0.010) (0.038) (0.032) (0.038) (0.011) (0.008) (0.011) (0.038) (0.033) (0.038)

(0.016) (0.009) (0.014) (0.018) (0.012) (0.018) (0.016) (0.011) (0.016) (0.017) (0.012) (0.018)

(0.003) (0.002) (0.003) (0.003) (0.002) (0.003) (0.003) (0.002) (0.003) (0.003) (0.002) (0.003)

(0.020) (0.012) (0.018) (0.025) (0.017) (0.022) (0.020) (0.014) (0.020) (0.024) (0.017) (0.022)

(0.025) (0.015) (0.022) (0.034) (0.024) (0.030) (0.025) (0.017) (0.026) (0.033) (0.024) (0.030)

(0.022) (0.013) (0.020) (0.029) (0.019) (0.025) (0.022) (0.015) (0.022) (0.028) (0.019) (0.025)

(0.039) (0.023) (0.035) (0.069) (0.024) (0.022) (0.039) (0.027) (0.040) (0.066) (0.027) (0.022)

(0.041) (0.024) (0.037) (0.070) (0.028) (0.029) (0.041) (0.028) (0.042) (0.067) (0.030) (0.028)

(0.042) (0.025) (0.038) (0.072) (0.033) (0.037) (0.042) (0.029) (0.043) (0.069) (0.034) (0.036)

(0.045) (0.027) (0.040) (0.076) (0.040) (0.053) (0.045) (0.031) (0.046) (0.073) (0.042) (0.052)

(0.024) (0.014) (0.022) (0.021) (0.019) (0.038) (0.024) (0.017) (0.025) (0.020) (0.018) (0.037)

(0.021) (0.013) (0.019) (0.025) (0.016) (0.025) (0.021) (0.015) (0.022) (0.024) (0.015) (0.024)

<sup>3</sup> Universita Cattolica del Sacro Cuore, Department of Statistical Sciences, Milano ` (Italy)

ABSTRACT: Repulsive mixture models have recently gained popularity for Bayesian cluster detection. Compared to more traditional mixture models, repulsive mixture models produce a smaller number of well separated clusters. The most commonly used methods for posterior inference either require to fix a priori the number of components or are based on reversible jump MCMC computation. We present a general framework for mixture models, when the prior of the 'cluster centres' is a finite repulsive point process depending on a hyperparameter, specified by a density which may depend on an intractable normalizing constant. By investigating the posterior characterization of this class of mixture models, we derive a MCMC algorithm which avoids the well-known difficulties associated to reversible jump MCMC computation. In particular, we use an ancillary variable method, which eliminates the problem of having intractable normalizing constants in the Hastings ratio. The ancillary variable method relies on a perfect simulation algorithm, and we demonstrate this is fast because the number of components is typically small. In several simulation studies and an application on sociological data, we illustrate the advantage of our new methodology over existing methods, and we compare the use of a determinantal or a repulsive Gibbs point process prior model.

KEYWORDS: birth-death Metropolis Hastings algorithm, cluster estimation, pairwise interaction point process, intractable normalizing constant, normalized infinitely divisible distribution, perfect simulation.

### INFINITE MIXTURES OF INFINITE FACTOR ANALYSERS

Keefe Murphy 1, Cinzia Viroli2 and I. Claire Gormley3

<sup>1</sup> Department of Mathematics and Statistics, Maynooth University (e-mail: keefe.murphy@mu.ie)

<sup>2</sup> Department of Statistical Sciences, University of Bologna (e-mail: cinzia.viroli@unibo.it)

<sup>3</sup> School of Mathematics and Statistics, University College Dublin (e-mail: claire.gormley@ucd.ie)

ABSTRACT: Factor-analytic Gaussian mixtures are often employed as a model-based approach to clustering high-dimensional data. Typically, the numbers of clusters and latent factors must be fixed in advance of model fitting. The pair which optimises some model selection criterion is then chosen. For computational reasons, having the number of factors differ across clusters is rarely considered.

Here the infinite mixture of infinite factor analysers (IMIFA) model is introduced. IMIFA employs a Pitman-Yor process prior to facilitate automatic inference of the number of clusters using the stick-breaking construction and a slice sampler. Automatic inference of the cluster-specific numbers of factors is achieved using multiplicative gamma process shrinkage priors and an adaptive Gibbs sampler. IMIFA is presented as the flagship of a family of factor-analytic mixtures.

Applications to benchmark data, metabolomic spectral data, and a handwritten digit example illustrate the IMIFA model's advantageous features. These include obviating the need for model selection criteria, reducing the computational burden associated with the search of the model space, improving clustering performance by allowing cluster-specific numbers of factors, and uncertainty quantification.

KEYWORDS: model-based clustering, factor analysis, Pitman-Yor process, multiplicative gamma process, adaptive Markov chain Monte Carlo.

#### References

MURPHY, K., VIROLI, C., & GORMLEY, I. C. 2020. Infinite mixtures of infinite factor analysers. *Bayesian Analysis*, 15(3), 937–963.

### **ANGULAR HALFSPACE DEPTH: COMPUTATION**\*

Stanislav Nagy1, Petra Laketa1 and Rainer Dyckerhoff2

<sup>1</sup> Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic (e-mail: nagy@karlin.mff.cuni.cz, laketa@karlin.mff.cuni.cz)

<sup>2</sup> Institute of Econometrics and Statistics, University of Cologne, Koln, Germany ¨ (e-mail: rainer.dyckerhoff@statistik.uni-koeln.de)

**ABSTRACT**: The angular halfspace depth is a nonparametric tool for the analysis of directional data. That depth was proposed already in 1987, but its widespread use has been hampered in practice by significant computational issues. We address these problems by considering a simple projection scheme that allows reducing the computation of the angular depth to the task of evaluating a variant of the usual halfspace depth in a linear space. Efficient algorithms for exact computation and approximation of the angular halfspace depth are thus developed.

**KEYWORDS**: angular depth, computation, directional data analysis, projection.

#### **1 Angular halfspace depth**

INFINITE MIXTURES OF INFINITE FACTOR ANALYSERS Keefe Murphy 1, Cinzia Viroli2 and I. Claire Gormley3

<sup>1</sup> Department of Mathematics and Statistics, Maynooth University (e-mail:

<sup>2</sup> Department of Statistical Sciences, University of Bologna (e-mail:

<sup>3</sup> School of Mathematics and Statistics, University College Dublin (e-mail:

ABSTRACT: Factor-analytic Gaussian mixtures are often employed as a model-based approach to clustering high-dimensional data. Typically, the numbers of clusters and latent factors must be fixed in advance of model fitting. The pair which optimises some model selection criterion is then chosen. For computational reasons, having the

Here the infinite mixture of infinite factor analysers (IMIFA) model is introduced. IMIFA employs a Pitman-Yor process prior to facilitate automatic inference of the number of clusters using the stick-breaking construction and a slice sampler. Automatic inference of the cluster-specific numbers of factors is achieved using multiplicative gamma process shrinkage priors and an adaptive Gibbs sampler. IMIFA

Applications to benchmark data, metabolomic spectral data, and a handwritten digit example illustrate the IMIFA model's advantageous features. These include obviating the need for model selection criteria, reducing the computational burden associated with the search of the model space, improving clustering performance by

KEYWORDS: model-based clustering, factor analysis, Pitman-Yor process, multi-

MURPHY, K., VIROLI, C., & GORMLEY, I. C. 2020. Infinite mixtures of

infinite factor analysers. *Bayesian Analysis*, 15(3), 937–963.

number of factors differ across clusters is rarely considered.

is presented as the flagship of a family of factor-analytic mixtures.

plicative gamma process, adaptive Markov chain Monte Carlo.

allowing cluster-specific numbers of factors, and uncertainty quantification.

keefe.murphy@mu.ie)

cinzia.viroli@unibo.it)

claire.gormley@ucd.ie)

References

Nonparametric analysis of data living in non-linear spaces is an exciting and largely unexplored field of statistics. *Statistical depths* generalize quantiles, ranks, and orderings to multivariate and non-Euclidean data, by evaluating "centrality", or the depth, of points with respect to a probability measure.

We consider directional data (Ley & Verdebout, 2017), that is, observations naturally residing the unit sphere S*d*−<sup>1</sup> = *x* ∈ R*<sup>d</sup>* : �*x*� = 1 of the Euclidean space R*d*. For directional data, the *angular halfspace depth* was first introduced by Small, 1987, and later substantially elaborated on by Liu & Singh, 1992. Just as many other depths, the angular halfspace depth is, however, difficult to compute, and no efficient algorithms for its computation are available in dimensions *d* > 2. We use the gnomonic projection of S*d*−<sup>1</sup> to reduce this problem to the computation of the usual halfspace depth in linear spaces R*d*<sup>−</sup>1, with respect to signed measures. This connection opens new possibilities for construction of efficient computational tools for directional data.

\*This work was supported by the grant 19-16097Y of the Czech Science Foundation, and by the PRIMUS/17/SCI/3 project of Charles University. P. Laketa was supported by the OP RDE project "International mobility of research, technical and administrative staff at the Charles University" CZ.02.2.69/0.0/0.0/18 053/0016976.

#### **2 The depth on spheres and gnomonic projection**

The angular halfspace depth is a mapping that to each point on a sphere assigns the smallest probability of a hemisphere that contains that point. More precisely, denote by *H*<sup>0</sup> the collection of all closed halfspaces in <sup>R</sup>*<sup>d</sup>* whose boundary passes through the origin in R*d*. For a Borel probability measure *P* on S*d*−<sup>1</sup> the angular halfspace depth of *x* ∈ R*<sup>d</sup>* with respect to *P* is defined as

$$
abla D\left(\mathbf{x}; P\right) = \inf \left\{ P\left(H\right) : H \in \mathcal{H}\_{\mathbb{G}} \text{ and } \mathbf{x} \in H \right\}.\tag{1}$$

In this short paper we assume for simplicity that *P* is absolutely continuous with respect to the spherical Lebesgue measure.\* For *ed* = (0,...,0,1) we denote by

$$\mathbb{S}\_{+}^{d-1} = \left\{ \mathbf{x} \in \mathbb{S}^{d-1} \colon \langle \mathbf{x}, e\_d \rangle > \mathbf{0} \right\}, \quad \mathbb{S}\_{-}^{d-1} = \left\{ \mathbf{x} \in \mathbb{S}^{d-1} \colon \langle \mathbf{x}, e\_d \rangle < \mathbf{0} \right\},$$

the northern and the southern hemisphere of S*d*−1, respectively. We write

$$G = \left\{ \mathbf{x} \in \mathbb{R}^d \colon \langle \mathbf{x}, e\_d \rangle = 1 \right\}.$$

for the "horizontal" hyperplane that touches S*d*−<sup>1</sup> at *ed*.

We consider the *gnomonic projection* of S*d*−<sup>1</sup> to *G*, that is a mapping that to each *x* ∈ S*d*−<sup>1</sup> <sup>+</sup> assigns a point π(*x*) = *x*/�*x*, *ed*� from the hyperplane *G*. For *x* ∈ S*d*−<sup>1</sup> <sup>−</sup> we define π(*x*) = π(−*x*); the mapping remains undefined if �*x*, *ed*� = 0. In the left panel of Figure 1 we present π in the plane R<sup>2</sup> — two points, one from the northern (*n*) and one from the southern (*s*) halfcircle of S1, together with their gnomonic images are shown. A closed halfplane *<sup>H</sup>* <sup>∈</sup> *H*<sup>0</sup> contains both *n* and *s*. The intersection *H* ∩ *G* is a closed halfline in *G* displayed as a thick line. One observes that π(*n*) ∈ *G* ∩ *H*, while π(*s*) ∈/ *G* ∩ *H*. A similar illustration with the sphere <sup>S</sup>2, the plane *<sup>G</sup>* and a halfspace from *H*<sup>0</sup> in <sup>R</sup><sup>3</sup> is visualised in the right panel of Figure 1.

The gnomonic projection satisfies an important property — for any *<sup>H</sup>* <sup>∈</sup>*H*<sup>0</sup> it holds true that

$$\pi\left(H\cap\mathbb{S}\_{+}^{d-1}\right) = H\cap G, \quad \pi\left(H\cap\mathbb{S}\_{-}^{d-1}\right) = G\backslash\text{int}(H),\tag{2}$$

where int(*H*) is the interior of *H*. We define a signed measure *P*<sup>±</sup> on *G* by

$$P\_{\pm} \left( H \cap G \right) = P \left( H \cap \mathbb{S}\_{+}^{d-1} \right) - P \left( \mathbb{S}\_{-}^{d-1} \backslash H \right). \tag{3}$$

\*This assumption is not made without loss of generality; the general theory is technical and much more delicate. It will be presented elsewhere.

**Figure 1.** *Left: A halfplane H (coloured) that contains two points — n from the northern and s from the southern halfcircle. We see that* π(*n*) ∈ *H* ∩*G, while* π(*s*) ∈/ *H* ∩*G. Right: Analogous illustration for d* <sup>=</sup> <sup>3</sup>*, with a halfspace from H*<sup>0</sup> *and the plane G.*

The Cramer-Wold theorem asserts that ´ *P*<sup>±</sup> is well defined. Due to the assumption of *P* being absolutely continuous, (2) and (3) imply that

$$P(H) = P\left(H \cap \mathbb{S}\_{+}^{d-1}\right) + P\left(H \cap \mathbb{S}\_{-}^{d-1}\right) = P\left(\mathbb{S}\_{-}^{d-1}\right) + P\_{\pm}\left(H \cap G\right). \tag{4}$$

Equation (4) relates the probability of a halfspace *<sup>H</sup>* <sup>∈</sup> *H*<sup>0</sup> with the value of the signed measure of its projection *H* ∩*G* in *G*. Note that *H* ∩*G* is a closed halfspace in space *<sup>G</sup>*, unless *<sup>H</sup>* is orthogonal to *ed*. We denote by *H* the collection of all closed halfspaces in *G*. From (4) it is straightforward to see that

$$
abla D\left(\mathbf{x}; P\right) = P\left(\mathbb{S}\_{-}^{d-1}\right) + \inf\left\{P\_{\pm}\left(H\right) : H \in \mathcal{H} \text{ and } \mathbf{x} \in H\right\},\tag{5}$$

for any *x* ∈ S*d*−<sup>1</sup> <sup>+</sup> . This formula draws connections of the angular halfspace depth with the usual halfspace depth in linear spaces *hD*(*x*;*Q*), defined for a point *x* ∈ R*d*−<sup>1</sup> with respect to a given probability measure *Q* in R*d*−<sup>1</sup> as the infimum of *Q*(*H*) over all closed halfspaces in R*d*−<sup>1</sup> that contain *x*.

The last term in (5) may be considered as the usual halfspace depth of a *signed measure P*<sup>±</sup> in R*d*<sup>−</sup>1. This connection is at the core of our approach. It opens ways of utilizing the highly developed algorithms for computing the usual halfspace depth, and applying it to the analysis of directional data. The main difference is that, for a probability measure *Q*, one may reduce attention only to those halfspaces that contain *x* on their boundary when computing *hD*(*x*;*Q*), which simplifies the computation substantially. The same is, however, not the case with signed measures, where all halfspaces that contain *x* must be considered. At a slight increase in the computational complexity, it is however possible to adopt the existing algorithms to resolve this issue.

NONLINEAR INTERCONNECTEDNESS OF CRUDE OIL AND FINANCIAL MARKETS Yarema Okhrin1, Gazi Salah Uddin2 and Muhammad Yahya3

<sup>2</sup> Department of Management and Engineering, Linkoping University, Sweden. ¨

<sup>3</sup> Department of Safety, Economics and Planning, University of Stavanger, Norway.

ABSTRACT: This paper investigates the heterogeneous and asymmetrical effect of COVID-19 on the crude oil, S&P 500 index, EUR/USD exchange rate, and various uncertainty measures. These assets reflect the overall health of the global financial and economic system. For instance, S&P 500 is the most liquid financial index and partly reflects the development of global financial system. Crude oil plays a fundamental role in the developmental and economic activities of a country. Elevated prices of energy commodities lead to a higher inflation and production cost, resulting in declined demand, output, and trade in the economy. The COVID-19 pandemic has contributed significantly to demand and supply shocks that has led to an unprecedented decline in crude oil price. In addition, global geopolitics is triggering the volatility of the crude oil market. The stability of the crude oil market is not only important for oil exporting countries but also oil importing and industrialized countries in order to maintain the price stability of goods. EUR/USD exchange rate is among the most liquid assets. This study adds to the literature by examining the heterogeneous and asymmetric impact of COVID-19 on these different asset classes. This would enable us to understand how

The contribution of this paper is fourfold. First, we evaluated the impact of the COVID-19 crisis on the interconnectedness of the financials, forex, and commodity markets with a specific focus on risk dynamics. Second, in contrast to the previous studies we consider high-frequency intraday data. This allows us to provide a deeper insight into the dependencies at a daily level. Third, we quantify the dependence and its dynamics using paired vine copulas. This class of copulas is highly flexible and can allows for a convenient visualization of the dependence. Forth, we put a particular focus on the crude oil returns as a function of several financial covariates using C- and D-vine regressions. This approach allows us to model the whole conditional distribution within a single day and to get insights the causal dependence in tails or at particular quantiles.

<sup>1</sup> Department of Statistics, University of Ausgburg, Germany.

(e-mail: yarema.okhrin@uni-a.de)

(e-mail: gazi.salah.uddin@liu.se)

different asset classes react to such unique shocks.

(e-mail: muhammad.yahya@uis.no)

#### **3 Computation: An example**

The depth *hD*(*x*;*P*) can be written as an infimum of one-dimensional halfspace depths *hD*(�*x*,*u*�;*Pu*) of *x* ∈ R*<sup>d</sup>* with respect to the projections *Pu* of *P* onto the lines given by all directions *u* ∈ S*d*−1. A standard approximation of *hD* then consists of computing the minimum over *hD*(�*x*,*u*�;*Pu*) for a collection *U* ⊂ S*d*−<sup>1</sup> of randomly chosen directions *u* ∈ *U*. A halfspace depth of a signed measure *P*<sup>±</sup> has the same projection property, which may be used to compute (5). This rather naive approximate algorithm is extremely simple, but allows us to consider *ahD* also in dimensions *d* > 3.

For *d* = 3 we adopted a more sophisticated exact algorithm of Dyckerhoff & Mozharovskyi, 2016, generalized it from *hD* to *ahD*, and implemented the results in C++. As a benchmark, we use the implementation of *ahD* for *d* = 3 available as function sdepth from the R package depth. Detailed results of our comparison are omitted from the present note due to the space restrictions, but will be discussed during the conference talk. Here we only remark that compared to the currently available programs, our new algorithms compute *ahD* up to 10 000-times faster for standard datasets, deal with the exact depth for tens of thousands of observations in S<sup>2</sup> within seconds, and approximate algorithms allow fast evaluation of *ahD* also for *d* > 3. All this illustrates the great potential of our projection method in the analysis of directional data.

#### **References**


### NONLINEAR INTERCONNECTEDNESS OF CRUDE OIL AND FINANCIAL MARKETS

Yarema Okhrin1, Gazi Salah Uddin2 and Muhammad Yahya3

<sup>1</sup> Department of Statistics, University of Ausgburg, Germany. (e-mail: yarema.okhrin@uni-a.de)

<sup>2</sup> Department of Management and Engineering, Linkoping University, Sweden. ¨ (e-mail: gazi.salah.uddin@liu.se)

<sup>3</sup> Department of Safety, Economics and Planning, University of Stavanger, Norway. (e-mail: muhammad.yahya@uis.no)

ABSTRACT: This paper investigates the heterogeneous and asymmetrical effect of COVID-19 on the crude oil, S&P 500 index, EUR/USD exchange rate, and various uncertainty measures. These assets reflect the overall health of the global financial and economic system. For instance, S&P 500 is the most liquid financial index and partly reflects the development of global financial system. Crude oil plays a fundamental role in the developmental and economic activities of a country. Elevated prices of energy commodities lead to a higher inflation and production cost, resulting in declined demand, output, and trade in the economy. The COVID-19 pandemic has contributed significantly to demand and supply shocks that has led to an unprecedented decline in crude oil price. In addition, global geopolitics is triggering the volatility of the crude oil market. The stability of the crude oil market is not only important for oil exporting countries but also oil importing and industrialized countries in order to maintain the price stability of goods. EUR/USD exchange rate is among the most liquid assets. This study adds to the literature by examining the heterogeneous and asymmetric impact of COVID-19 on these different asset classes. This would enable us to understand how different asset classes react to such unique shocks.

The contribution of this paper is fourfold. First, we evaluated the impact of the COVID-19 crisis on the interconnectedness of the financials, forex, and commodity markets with a specific focus on risk dynamics. Second, in contrast to the previous studies we consider high-frequency intraday data. This allows us to provide a deeper insight into the dependencies at a daily level. Third, we quantify the dependence and its dynamics using paired vine copulas. This class of copulas is highly flexible and can allows for a convenient visualization of the dependence. Forth, we put a particular focus on the crude oil returns as a function of several financial covariates using C- and D-vine regressions. This approach allows us to model the whole conditional distribution within a single day and to get insights the causal dependence in tails or at particular quantiles.

### DETECTION OF INTERNET ATTACKS WITH HISTOGRAM PRINCIPAL COMPONENT ANALYSIS

projected data on the first histogram principal component (PC) to successfully

Principal component analysis (PCA) is frequently used as a dimensionality reduction method. Given the importance of PCA, various generalisations have been proposed in the symbolic data analysis (SDA) framework. A common generalisation relies on the so-called symbolic-conventional-symbolic approach, where a symbolic covariance matrix is estimated, to which conventional PCA is applied, followed by rewriting the original symbolic data into the space spanned by the first eigenvectors of the covariance matrix. In the case of histogram PCA, Makosso-Kallyth & Diday, 2012 and Chen *et al.*, 2015 use the same definition of sample symbolic covariance, but differ on the way objects are represented in the reduced space. In Le-Rademacher & Billard, 2017, another definition of sample symbolic covariance matrix is used, and the original objects are represented in the reduced space relying on a geometric construction of polytopes. Other approaches to generalise PCA to histogram-valued

To detect traffic redirection attacks, we project the original histogramvalued data in the direction of the first PC, whose loadings are determined by the first eigenvector of the chosen symbolic covariance matrix. In our case, we use the same covariance matrix definition as used in Makosso-Kallyth & Diday, 2012 and Chen *et al.*, 2015. The loadings define a weighted sum of the original observations leading to histogram-valued scores, by applying Moore's based histogram algebraic structure. The obtained symbolic means (vide Le-Rademacher & Billard, 2017, eq. (2)) of the scores are the input for the (con-

The data set under analysis was gathered on a monitoring network that comprised 12 geographically dispersed servers (probes) that measured, at 120 second intervals, the RTT to two hosts under surveillance (targets: Frankfurt1 and Hong Kong). When an attack was being perpetrated, traffic from 12 probes

Each probe made 10 RTT measurements every 120 seconds, by sending 10 packets to the target, and the corresponding average, minimum, median, and maximum over the 10 RTT measurements were obtained. These summary

detect traffic redirection attacks.

data exist, but are not considered here.

ventional) unsupervised anomaly detection methods.

3 Detection of Traffic Redirection Attacks

to the target was diverted through an attacker (relay).

2 Histogram Principal Component Analysis

M. Rosario Oliveira ´ 1, Ana Subtil1 and Lina Oliveira2

<sup>1</sup> CEMAT and Mathematics Department, Instituto Superior Tecnico, Universi- ´ dade de Lisboa, Portugal (e-mail: rosario.oliveira@tecnico.ulisboa.pt, anasubtil@tecnico.ulisboa.pt)

<sup>2</sup> CAMGSD and Mathematics Department, Instituto Superior Tecnico, Universidade ´ de Lisboa, Portugal (e-mail: lina.oliveira@tecnico.ulisboa.pt)

ABSTRACT: We propose symbolic unsupervised anomaly detection methods to identify Internet traffic redirection attacks based on histogram principal component analysis. We obtain histogram-valued scores, by applying Moore's based histogram algebraic structure. The symbolic means of the scores are the input for two unsupervised anomaly detection methods used to successfully signal Internet attacks.

KEYWORDS: histogram principal component analysis, symbolic data analysis, Internet data.

#### 1 Introduction

Internet security is a major concern for users and Internet Service Providers since successful attacks can produce substantial damage. These attacks may be aimed at gaining access to sensitive information from the victim, monitoring its online activity, causing network delay, among other motivations.

To identify traffic redirection attacks, we had access to measurements obtained from a worldwide probing platform, designed to detect routing variations based on round-trip-time (RTT)\* measurements from multiple and disperse geographic locations (Salvador & Nogueira, 2014). At each timestamp, various measurements are made that are summarized by a histogram. Thus, we propose an anomaly detection method based on histogram principal component analysis. To do so, we consider linear combinations of histogramvalued data (according to a histogram algebraic structure, generalised from the Moore's interval algebraic structure, vide Moore *et al.*, 2009) and use the

\*Round-trip-time (RTT) is the length of time since a data packet is sent until an acknowledgement of the packet is received back at the origin.

projected data on the first histogram principal component (PC) to successfully detect traffic redirection attacks.

#### 2 Histogram Principal Component Analysis

DETECTION OF INTERNET ATTACKS WITH HISTOGRAM PRINCIPAL COMPONENT ANALYSIS M. Rosario Oliveira ´ 1, Ana Subtil1 and Lina Oliveira2

<sup>1</sup> CEMAT and Mathematics Department, Instituto Superior Tecnico, Universi- ´ dade de Lisboa, Portugal (e-mail: rosario.oliveira@tecnico.ulisboa.pt,

<sup>2</sup> CAMGSD and Mathematics Department, Instituto Superior Tecnico, Universidade ´

ABSTRACT: We propose symbolic unsupervised anomaly detection methods to identify Internet traffic redirection attacks based on histogram principal component analysis. We obtain histogram-valued scores, by applying Moore's based histogram algebraic structure. The symbolic means of the scores are the input for two unsupervised

KEYWORDS: histogram principal component analysis, symbolic data analysis, Inter-

Internet security is a major concern for users and Internet Service Providers since successful attacks can produce substantial damage. These attacks may be aimed at gaining access to sensitive information from the victim, monitoring

To identify traffic redirection attacks, we had access to measurements obtained from a worldwide probing platform, designed to detect routing variations based on round-trip-time (RTT)\* measurements from multiple and disperse geographic locations (Salvador & Nogueira, 2014). At each timestamp, various measurements are made that are summarized by a histogram. Thus, we propose an anomaly detection method based on histogram principal component analysis. To do so, we consider linear combinations of histogramvalued data (according to a histogram algebraic structure, generalised from the Moore's interval algebraic structure, vide Moore *et al.*, 2009) and use the

\*Round-trip-time (RTT) is the length of time since a data packet is sent until an acknowl-

de Lisboa, Portugal (e-mail: lina.oliveira@tecnico.ulisboa.pt)

anomaly detection methods used to successfully signal Internet attacks.

its online activity, causing network delay, among other motivations.

edgement of the packet is received back at the origin.

anasubtil@tecnico.ulisboa.pt)

net data.

1 Introduction

Principal component analysis (PCA) is frequently used as a dimensionality reduction method. Given the importance of PCA, various generalisations have been proposed in the symbolic data analysis (SDA) framework. A common generalisation relies on the so-called symbolic-conventional-symbolic approach, where a symbolic covariance matrix is estimated, to which conventional PCA is applied, followed by rewriting the original symbolic data into the space spanned by the first eigenvectors of the covariance matrix. In the case of histogram PCA, Makosso-Kallyth & Diday, 2012 and Chen *et al.*, 2015 use the same definition of sample symbolic covariance, but differ on the way objects are represented in the reduced space. In Le-Rademacher & Billard, 2017, another definition of sample symbolic covariance matrix is used, and the original objects are represented in the reduced space relying on a geometric construction of polytopes. Other approaches to generalise PCA to histogram-valued data exist, but are not considered here.

To detect traffic redirection attacks, we project the original histogramvalued data in the direction of the first PC, whose loadings are determined by the first eigenvector of the chosen symbolic covariance matrix. In our case, we use the same covariance matrix definition as used in Makosso-Kallyth & Diday, 2012 and Chen *et al.*, 2015. The loadings define a weighted sum of the original observations leading to histogram-valued scores, by applying Moore's based histogram algebraic structure. The obtained symbolic means (vide Le-Rademacher & Billard, 2017, eq. (2)) of the scores are the input for the (conventional) unsupervised anomaly detection methods.

#### 3 Detection of Traffic Redirection Attacks

The data set under analysis was gathered on a monitoring network that comprised 12 geographically dispersed servers (probes) that measured, at 120 second intervals, the RTT to two hosts under surveillance (targets: Frankfurt1 and Hong Kong). When an attack was being perpetrated, traffic from 12 probes to the target was diverted through an attacker (relay).

Each probe made 10 RTT measurements every 120 seconds, by sending 10 packets to the target, and the corresponding average, minimum, median, and maximum over the 10 RTT measurements were obtained. These summary

first PC is an overall mean of the traffic volume going through the probes, we merely compare their absolute values with the upper boundary *Q*3+3×*IQR*, where *IQR* = *Q*3−*Q*1 is the interquartile range, *Q*1 and *Q*3 are, respectively,

For the target Frankfurt, both the heuristic and Tukey's method detect all the attacks and no false positive results occur (recall=1, false positive rate=0, precision=1). For Hong Kong, the heuristic is unable to detect attacks perpetrated by the relays Los Angeles (LA1) and Madrid (MAD), as shown in Figure 1. The failure to detect two of the four attacks leads to recall=0.5. Moreover, for this target, the false positive rate is 0 and precision is 1. Tukey's method yields a small false positive rate (0.08), a recall of 1, and 0.79 precision.

This paper introduces novel symbolic unsupervised anomaly detection methods to identify Internet traffic redirection attacks based on histogram PCA, using histogram means of the first PC. Results point out the superiority of the symbolic Tukey's method over the symbolic heuristic in detecting the attacks. Overall, we show that PC histogram scores can be used as an interesting input

CHEN, M., WANG, H., & QIN, Z. 2015. Principal component analysis for probabilistic symbolic data: a more generic and accurate algorithm. *Adv. Data Anal.*

LE-RADEMACHER, J., & BILLARD, L. 2017. Principal component analysis for

MAKOSSO-KALLYTH, S., & DIDAY, E. 2012. Adaptation of interval PCA to sym-

MOORE, R. E., KEARFOTT, R. B., & CLOUD, M. J. 2009. *Introduction to interval*

SALVADOR, P., & NOGUEIRA, A. 2014. Customer-side detection of Internet-scale traffic redirection. *Pages 1–5 of: 2014 16th Int. Telecom. Network Strategy and*

SUBTIL, A., OLIVEIRA, M. R., VALADAS, R., PACHECO, A., & SALVADOR, P. 2018. Detecting Internet-Scale Traffic Redirection Attacks Using Latent Class Models. *Pages 370–380 of: Int. Conf. Soft Comp. and Pattern Recognit.*

for further statistical analysis (conventional or symbolic).

Acknowledgements: Work supported by FCT, Portugal, through projects UIDB/04621/2020, PTDC/EEI-TEL/32454/2017, and UID/MAT/04459/2020.

histogram-valued data. *Adv. Data Anal. Classif.*, 11, 327–351.

bolic histogram variables. *Adv. Data Anal. Classif.*, 6, 147–159.

the 1st and 3rd quartiles of the data. We also adopt the rule-of-10.

4 Conclusions

References

*Classif.*, 9, 59–79.

*analysis*. SIAM, USA.

Springer.

*Planning Symp. (Networks)*.

Figure 1. *Means of the first PC histogram-scores for target Hong Kong (black line) and thresholds for the heuristic (blue line) and Tukey's method (red line). Shaded background bands signal the attack periods, with the corresponding relays indicated.*

indicators can be taken as the features to be analysed, and conventional statistical methods may be applied (Salvador & Nogueira, 2014, Subtil *et al.*, 2018). Alternatively, SDA provides a framework to address this problem taking into account the intrinsic variability of the data. As such, we can consider, at every timestamp, for each probe monitoring a target, a histogram with two subintervals, whose bounds are the minimum, median, and maximum of the 10 RTT measurements. Therefore, for each target, we have a symbolic data set with *p* = 12 histogram-valued variables, with as many realisations as timestamps where measurements were made.

We apply the described histogram PCA to each target data set. Given the first PC scores, we calculate their symbolic means and use them as input for two anomaly detection methods: the heuristic proposed by Salvador & Nogueira, 2014 and Tukey's method for outlier detection.

Salvador & Nogueira, 2014 proposed a heuristic to discriminate between the RTTs of regular and redirected traffic. At every timestamp, the conventional average RTT is compared with a decision threshold set at 1.2 times the average of the past 480 observations that were not classified as attacks. Additionally, the heuristic requires a minimum sequence of 10 observations exceeding the threshold to signal attacks (rule-of-10). We apply this heuristic, replacing the average RTT by the means of the first PC histogram-scores. Tukey's method defines boundaries based on the quartiles of the data and identifies as outliers the observations that lie outside these boundaries. Since the first PC is an overall mean of the traffic volume going through the probes, we merely compare their absolute values with the upper boundary *Q*3+3×*IQR*, where *IQR* = *Q*3−*Q*1 is the interquartile range, *Q*1 and *Q*3 are, respectively, the 1st and 3rd quartiles of the data. We also adopt the rule-of-10.

For the target Frankfurt, both the heuristic and Tukey's method detect all the attacks and no false positive results occur (recall=1, false positive rate=0, precision=1). For Hong Kong, the heuristic is unable to detect attacks perpetrated by the relays Los Angeles (LA1) and Madrid (MAD), as shown in Figure 1. The failure to detect two of the four attacks leads to recall=0.5. Moreover, for this target, the false positive rate is 0 and precision is 1. Tukey's method yields a small false positive rate (0.08), a recall of 1, and 0.79 precision.

#### 4 Conclusions

LA1 MAD MSW SP1

jun5 jun6 jun7 jun8 jun9 jun10 jun11 jun12

Figure 1. *Means of the first PC histogram-scores for target Hong Kong (black line) and thresholds for the heuristic (blue line) and Tukey's method (red line). Shaded background bands signal the attack periods, with the corresponding relays indicated.*

indicators can be taken as the features to be analysed, and conventional statistical methods may be applied (Salvador & Nogueira, 2014, Subtil *et al.*, 2018). Alternatively, SDA provides a framework to address this problem taking into account the intrinsic variability of the data. As such, we can consider, at every timestamp, for each probe monitoring a target, a histogram with two subintervals, whose bounds are the minimum, median, and maximum of the 10 RTT measurements. Therefore, for each target, we have a symbolic data set with *p* = 12 histogram-valued variables, with as many realisations as timestamps

We apply the described histogram PCA to each target data set. Given the first PC scores, we calculate their symbolic means and use them as input for two anomaly detection methods: the heuristic proposed by Salvador &

Salvador & Nogueira, 2014 proposed a heuristic to discriminate between the RTTs of regular and redirected traffic. At every timestamp, the conventional average RTT is compared with a decision threshold set at 1.2 times the average of the past 480 observations that were not classified as attacks. Additionally, the heuristic requires a minimum sequence of 10 observations exceeding the threshold to signal attacks (rule-of-10). We apply this heuristic, replacing the average RTT by the means of the first PC histogram-scores. Tukey's method defines boundaries based on the quartiles of the data and identifies as outliers the observations that lie outside these boundaries. Since the

Nogueira, 2014 and Tukey's method for outlier detection.

Tukey

where measurements were made.

900

1000

1100

1200

SPC1 Scores meanH

1300

1400

Target:Hong Kong

Heuristic

This paper introduces novel symbolic unsupervised anomaly detection methods to identify Internet traffic redirection attacks based on histogram PCA, using histogram means of the first PC. Results point out the superiority of the symbolic Tukey's method over the symbolic heuristic in detecting the attacks. Overall, we show that PC histogram scores can be used as an interesting input for further statistical analysis (conventional or symbolic).

Acknowledgements: Work supported by FCT, Portugal, through projects UIDB/04621/2020, PTDC/EEI-TEL/32454/2017, and UID/MAT/04459/2020.

#### References


### SEMIPARAMETRIC IRT MODELS FOR NON-NORMAL LATENT TRAITS

because for any fixed η*<sup>j</sup>* the probability of a correct response to item *i* is decreasing in β*i*. When λ*<sup>i</sup>* = 1 for all *i* = 1,...,*I*, the model in 1 reduces to the one-parameter logistic (1PL) model. Often, conditional log-odds in 1 are reparametrized as λ*i*η*<sup>i</sup>* +γ*i*, with γ*<sup>i</sup>* = −λ*<sup>i</sup>* ×β*i*. Sometimes this is reffered to as slope-intercept parameterization as opposed to the IRT parameterization in

Traditional literature assumes that η*<sup>j</sup>* ∼ *N* (0,1) for *j* = 1,...,*N*, but there are situations in which such assumption can be too restrictive. We can extend the model in 1 to describe more flexible latent trait distributions using a

η*j*|*G* ∼ *G*, *G* ∼ *DP*(α,*G*0),

where α is the concentration parameter and *G*<sup>0</sup> the base measure. Alternative representations of the DP are known as the Chinese Restaurant Process (CRP) Blackwell *et al.*, 1973 or the truncated stick-breaking (SB) Sethuraman,

Estimation of the model parameters is carried out in the Bayesian framework via MCMC methods, using NIMBLE de Valpine *et al.*, 2017, a R software for hierarchical models. The NIMBLE system provides a suite of different sampling algorithms along with the possibility to code user-defined samplers. We compare results from the parametric and semiparametric 2PL model, using NIMBLE's default sampling configuration, that mixes conjugate samplers with

Typically parameters of the 2PL model are not identifiable, so constraints are either included in the model or one can post-process posterior samples to meet the constraints. This last approach is typical of parameter-expanded algorithms, which embed targeted models in a larger specification. We found this last option to be the most efficient in terms on both MCMC mixing and

In traditional literature on parametric 2PL model, identification is obtained constraining the discrimination parameters λ*i*, for *i* = 1,...,*I* to be positive, when the latent trait distribution is assumed to be a standard normal. Since we are relaxing the normal assumption on the latent traits, we considered sum-to-

zero constraints on the item parameters, i.e. ∑*<sup>i</sup>* β*<sup>i</sup>* = 0, ∑*<sup>i</sup>* log(λ*i*) = 0.

<sup>0</sup>)×InvGamma(ν1,ν2) (2)

considered traditionally for interpretation.

Dirichlet Process (DP) mixture of normal distributions

*<sup>G</sup>*<sup>0</sup> <sup>≡</sup> *N* (0,σ<sup>2</sup>

1994.

time.

3 Model estimation

adaptive Metropolis Hastings algorithm.

Sally Paganin 1Department of Biostatistics, Harvard School of Public Health, Harvard University (e-mail: spaganin@hsph.harvard.edu)

ABSTRACT: Item Response Theory models are widely used in many domains of applications to analyze questionnaires data, scaling categorical data into continuous construct. Interpretable inference is often obtained relying on a set of assumptions for the latent constructs, as for example normality for the unknown subject-specific latent traits. This assumption can often be unrealistic and lead to biased results, hence we consider more flexible models using Bayesian nonparametric mixtures for the individual latent traits. We study several identifiability constraints, and compare inferential results and different Markov chain Monte Carlo strategies for posterior sampling.

KEYWORDS: 2PL, Bayesian nonparametrics, Dirichlet Process, MCMC, NIMBLE.

#### 1 IRT models for binary responses

Let *yi j* denote the answer of an individual *j* to item *i* for *j* = 1,...,*N* and *i* = 1,...,*I*, with *yi j* = 1, when the answer is correct and 0 otherwise. Typically, different individuals are assumed to work independently, while responses from the same individuals are assumed independent conditional to the latent trait (local independence assumption). Hence each answer *yi j*, conditionally to the latent parameters, is assumed to be a realization of a Bernoulli distribution, and the probability of a correct response is typically modeled via logistic regression.

#### 2 Semiparametric 2PL models

In the two-parameter logistic (2PL) model, the conditional probability of a correct response is modeled as

$$\Pr(\mathbf{y}\_{ij} = 1 | \lambda\_i, \mathbf{\beta}\_i, \mathbf{\eta}\_j) = \frac{\exp\{\lambda\_i (\mathbf{\eta}\_j - \beta\_i)\}}{1 = \exp\{\lambda\_i (\mathbf{\eta}\_j - \beta\_i)\}}, i = 1, \dots, I, \quad j = 1, \dots, N. \tag{1}$$

where η*<sup>j</sup>* represents the health status, or more in general latent trait, of the *j*-th individual, while β*<sup>i</sup>* and λ*<sup>i</sup>* encode item characteristics. The parameter λ*<sup>i</sup>* > 0 is often referred to as *discrimination*, while β*<sup>i</sup>* is called *difficulty* because for any fixed η*<sup>j</sup>* the probability of a correct response to item *i* is decreasing in β*i*. When λ*<sup>i</sup>* = 1 for all *i* = 1,...,*I*, the model in 1 reduces to the one-parameter logistic (1PL) model. Often, conditional log-odds in 1 are reparametrized as λ*i*η*<sup>i</sup>* +γ*i*, with γ*<sup>i</sup>* = −λ*<sup>i</sup>* ×β*i*. Sometimes this is reffered to as slope-intercept parameterization as opposed to the IRT parameterization in considered traditionally for interpretation.

Traditional literature assumes that η*<sup>j</sup>* ∼ *N* (0,1) for *j* = 1,...,*N*, but there are situations in which such assumption can be too restrictive. We can extend the model in 1 to describe more flexible latent trait distributions using a Dirichlet Process (DP) mixture of normal distributions

$$\begin{aligned} \mathfrak{n}\_{\downarrow}|G &\sim G, \quad G \sim DP(\mathfrak{a}, G\_0),\\ G\_0 &\equiv \mathcal{N}(0, \sigma\_0^2) \times \text{InvGamma}(\mathbf{v}\_1, \mathbf{v}\_2) \end{aligned} \tag{2}$$

where α is the concentration parameter and *G*<sup>0</sup> the base measure. Alternative representations of the DP are known as the Chinese Restaurant Process (CRP) Blackwell *et al.*, 1973 or the truncated stick-breaking (SB) Sethuraman, 1994.

#### 3 Model estimation

SEMIPARAMETRIC IRT MODELS FOR NON-NORMAL LATENT TRAITS Sally Paganin 1Department of Biostatistics, Harvard School of Public Health, Harvard University (e-mail: spaganin@hsph.harvard.edu)

ABSTRACT: Item Response Theory models are widely used in many domains of applications to analyze questionnaires data, scaling categorical data into continuous construct. Interpretable inference is often obtained relying on a set of assumptions for the latent constructs, as for example normality for the unknown subject-specific latent traits. This assumption can often be unrealistic and lead to biased results, hence we consider more flexible models using Bayesian nonparametric mixtures for the individual latent traits. We study several identifiability constraints, and compare inferential results and different Markov chain Monte Carlo strategies for posterior sampling. KEYWORDS: 2PL, Bayesian nonparametrics, Dirichlet Process, MCMC, NIMBLE.

Let *yi j* denote the answer of an individual *j* to item *i* for *j* = 1,...,*N* and *i* = 1,...,*I*, with *yi j* = 1, when the answer is correct and 0 otherwise. Typically, different individuals are assumed to work independently, while responses from the same individuals are assumed independent conditional to the latent trait (local independence assumption). Hence each answer *yi j*, conditionally to the latent parameters, is assumed to be a realization of a Bernoulli distribution, and the probability of a correct response is typically modeled via logistic

In the two-parameter logistic (2PL) model, the conditional probability of a

where η*<sup>j</sup>* represents the health status, or more in general latent trait, of the *j*-th individual, while β*<sup>i</sup>* and λ*<sup>i</sup>* encode item characteristics. The parameter λ*<sup>i</sup>* > 0 is often referred to as *discrimination*, while β*<sup>i</sup>* is called *difficulty*

,*i* = 1,...,*I*, *j* = 1,...,*N*. (1)

1 = exp{λ*i*(η*<sup>j</sup>* −β*i*)}

1 IRT models for binary responses

2 Semiparametric 2PL models

Pr(*yi j* <sup>=</sup> <sup>1</sup>|λ*i*,β*i*,η*j*) = exp{λ*i*(η*<sup>j</sup>* <sup>−</sup>β*i*)}

correct response is modeled as

regression.

Estimation of the model parameters is carried out in the Bayesian framework via MCMC methods, using NIMBLE de Valpine *et al.*, 2017, a R software for hierarchical models. The NIMBLE system provides a suite of different sampling algorithms along with the possibility to code user-defined samplers. We compare results from the parametric and semiparametric 2PL model, using NIMBLE's default sampling configuration, that mixes conjugate samplers with adaptive Metropolis Hastings algorithm.

Typically parameters of the 2PL model are not identifiable, so constraints are either included in the model or one can post-process posterior samples to meet the constraints. This last approach is typical of parameter-expanded algorithms, which embed targeted models in a larger specification. We found this last option to be the most efficient in terms on both MCMC mixing and time.

In traditional literature on parametric 2PL model, identification is obtained constraining the discrimination parameters λ*i*, for *i* = 1,...,*I* to be positive, when the latent trait distribution is assumed to be a standard normal. Since we are relaxing the normal assumption on the latent traits, we considered sum-tozero constraints on the item parameters, i.e. ∑*<sup>i</sup>* β*<sup>i</sup>* = 0, ∑*<sup>i</sup>* log(λ*i*) = 0.

#### 4 Inferential results

We compare inferential results via simulation. We simulate data from two different scenarios changing the distribution generating the latent traits. We simulate responses from *N* = 3,000 individuals to *I* = 20 binary items. Values for the discrimination parameters {λ*i*}<sup>20</sup> *<sup>i</sup>*=<sup>1</sup> are sampled from a Uniform distribution over the interval (0.5,2), while values for difficulty parameters {β*i*}<sup>20</sup> *i*=1 are sampled from a Normal distribution with mean zero and variance 2.

In particular, we considered two different generating distribution for the latent traits. A unimodal scenario, where η*<sup>j</sup>* are i.i.d. draws from a *N* (0,1) and a multimodal scenario where

$$
\mathfrak{n}\_{\mathbb{J}} \sim 0.4 \times \mathcal{N}(-3, 1) + 0.2 \times \mathcal{N}(-2, 4) + 0.4 \times \mathcal{N}(2, 1). \tag{3}
$$

parametric and semiparametric models, computed taking the posterior means of the η*j*s. It can be noticed that the parametric model leads to a flat distribution because of the underlying normal assumption, while the semiparametric specification recover the true density structure. Better estimation of the latent abilities helps to avoid bias in inference, for example when estimating item

BLACKWELL, DAVID, MACQUEEN, JAMES B, *et al.* 1973. Ferguson distributions via Polya urn schemes. ´ *The annals of statistics*, 1(2), 353–355. DE VALPINE, PERRY, TUREK, DANIEL, PACIOREK, CHRISTOPHER J., ANDERSON-BERGMAN, CLIFFORD, LANG, DUNCAN TEMPLE, & BODIK, RASTISLAV. 2017. Programming With Models: Writing Statistical Algorithms for General Model Structures With NIMBLE. *Journal*

SETHURAMAN, J. 1994. A constructive definition of Dirichlet priors. *Statis-*

*of Computational and Graphical Statistics*, 26(2), 403–413.

parameters or item characteristics curves (ICC).

*tica Sinica*, 4(2), 639–650.

References

We chose moderately vague priors for the item parameters, β*<sup>i</sup>* ∼ *N* (0,3) and log(λ*i*) ∼ *N* (0.5,0.5). In the parametric model, η*j*s are assumed to follow *N* (0,1), while for DP we choose *G*<sup>0</sup> ≡ *N* (0,3)×*InvGamma*(1.01,2.01). We run the MCMC for 50,000 iterations using a 10% burn-in of 5000 iterations, and check traceplots for convergence.

#### **Estimate of latent trait distribution**

Figure 1. *Comparison of the latent trait density estimates, using a parametric 2PL model (orange line) and a semiparametric 2PL model (green line). The dotted black lines indicate the true distribution in (3).*

Figure 1 compares density estimates of the latent trait distribution from the

parametric and semiparametric models, computed taking the posterior means of the η*j*s. It can be noticed that the parametric model leads to a flat distribution because of the underlying normal assumption, while the semiparametric specification recover the true density structure. Better estimation of the latent abilities helps to avoid bias in inference, for example when estimating item parameters or item characteristics curves (ICC).

#### References

4 Inferential results

for the discrimination parameters {λ*i*}<sup>20</sup>

and a multimodal scenario where

and check traceplots for convergence.

*lines indicate the true distribution in (3).*

Density

0.00 0.05 0.10 0.15 0.20

We compare inferential results via simulation. We simulate data from two different scenarios changing the distribution generating the latent traits. We simulate responses from *N* = 3,000 individuals to *I* = 20 binary items. Values

bution over the interval (0.5,2), while values for difficulty parameters {β*i*}<sup>20</sup>

In particular, we considered two different generating distribution for the latent traits. A unimodal scenario, where η*<sup>j</sup>* are i.i.d. draws from a *N* (0,1)

We chose moderately vague priors for the item parameters, β*<sup>i</sup>* ∼ *N* (0,3) and log(λ*i*) ∼ *N* (0.5,0.5). In the parametric model, η*j*s are assumed to follow *N* (0,1), while for DP we choose *G*<sup>0</sup> ≡ *N* (0,3)×*InvGamma*(1.01,2.01). We run the MCMC for 50,000 iterations using a 10% burn-in of 5000 iterations,

**Estimate of latent trait distribution**

Latent trait value

Figure 1 compares density estimates of the latent trait distribution from the

−10 −5 0 5

Figure 1. *Comparison of the latent trait density estimates, using a parametric 2PL model (orange line) and a semiparametric 2PL model (green line). The dotted black*

η*<sup>j</sup>* ∼ 0.4×*N* (−3,1) +0.2×*N* (−2,4) +0.4×*N* (2,1). (3)

are sampled from a Normal distribution with mean zero and variance 2.

*<sup>i</sup>*=<sup>1</sup> are sampled from a Uniform distri-

*i*=1


### A GRAPHICAL DEPTH-BASED AID TO DETECT DEVIATION FROM UNIMODALITY ON HYPERSPHERES

The concept of data depth was also used to build control charts for monitoring processes of multivariate quality measurement [11, 3]. However, despite the great and increasing interest for multivariate data analysis in R*q*, the adoption of depth-based visualizations for the analysis of directional data has been neglected so far, except for a recent work about the classification of data on the

Data depth function is an important nonparametric tool for the analysis of complex data such as functional and directional data. It provides a center-outward ordering of the data and leads to a ranking of data which can be exploited for describing different features of the data distribution. Hence, a data depth function is any function *D*(*x*,*F*) that measures the closeness or centrality of a point *<sup>x</sup>* <sup>∈</sup> *<sup>S</sup>q*−<sup>1</sup> d with respect to a distribution function *<sup>F</sup>*. Thus, a depth function assigns to each *<sup>x</sup>* <sup>∈</sup> *<sup>S</sup>q*−<sup>1</sup> a nonnegative score as its center-outward depth with respect to *F*. Observations close to the center of *F* receive high ranks whereas peripheral observations receive low ranks. Such notion is limited to data modeling with a unimodal distribution. For this reason, *local* versions of depth functions were derived in order to deal with multimodal distributions (see Agostinelli & Romanazzi, 2011 and Paindaveine & Van Bever, 2013).

Here, the notion of distance-based depths for directional data introduced by Pandolfo *et al.* , 2018 is adopted along with its local version which is derived by considering a neighborhood of each point *x* whose radius is the τ parameter, which cannot goes to ∞, as it occurs for data in R*q*, because of the boundedness of the space. The usual definition is recovered when τ approaches its maximum, thus local depth includes ordinary depth as a particular case. Hence local depth is a class of center-outward ranking functions serving multiple purposes, according to the value of the tuning parameter: low values describe centralness of the points of the space conditional on a small neighborhood around them, higher values lead to wider windows and therefore produce

rankings which are more and more similar to "standard" global depth.

The rankings produced by the notions of global and local depths can be compared by means of a two-dimensional scatterplot, which can be exploited to investigate unimodality of directional data. This is because for symmetric unimodal distributions the rankings of the data provided by global and local

3 Plotting global and local depth rankings

unit circle through the DD-plot (Liu, 1995, Pandolfo *et al.* , 2021).

2 Data depth

Giuseppe Pandolfo <sup>1</sup>

<sup>1</sup> Department of Industrial Engineering, University of Naples Federico II, (e-mail: giuseppe.pandolfo@unina.it

ABSTRACT: A graphical tool for investigating unimodality of hyperspherical data is proposed. It is based on the notion of statistical data depth function for directional data. Then "standard" global depth is compared to its local version by means of a two-dimensional scatterplot. The proposal is illustrated on simulated data.

KEYWORDS: Data depth, distance measure, ranks, Von-Mises Fisher.

#### 1 Setting

Testing unimodality of a sample *X*1,...,*Xn* of a random vector *X* supported on the hypersphere *<sup>S</sup>q*−<sup>1</sup> :<sup>=</sup> {*<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* : *<sup>x</sup> x* = 1}, with *q* > 1, is one important step in multivariate data analysis for which only the directions (and not the magnitudes) are of interest – the so-called directional data. This kind of data arise in many applied disciplines, such as astronomy, biology, etc.

The inspiration for this contribution comes from the *center-outward ordering* provided by statistical depth functions, which can be intended as a multivariate generalization of standard univariate rank. Specifically, information about unimodality of hyperspherical distributions are obtained through data depths, and they can be displayed and visualized in a simple two-dimensional plot. Such graph is based on an analysis of the rankings derived from a data depth function and its local counterpart, so that they can offer an easy interpretable picture of the distributions.

The use of depth-induced rankings to investigate distributional features has been already used for analyzing data in R*<sup>q</sup>* by means of graphical tools. Liu *et al.* , 1999 proposed the "sunburst plot" as a bivariate generalization of the box-plot and the DD-(depth versus depth) plots. Rousseeuw *et al.* , 1999 proposed the bagplot, a bivariate generalization of the univariate boxplot by exploiting the notion of halfspace location depth. Li *et al.* , 2012 used the DDplot to perform classification of data in R*q*. A nonparametric classification procedure based on the DD-plot was introduced also by Lange *et al.* , 2014. The concept of data depth was also used to build control charts for monitoring processes of multivariate quality measurement [11, 3]. However, despite the great and increasing interest for multivariate data analysis in R*q*, the adoption of depth-based visualizations for the analysis of directional data has been neglected so far, except for a recent work about the classification of data on the unit circle through the DD-plot (Liu, 1995, Pandolfo *et al.* , 2021).

#### 2 Data depth

A GRAPHICAL DEPTH-BASED AID TO DETECT DEVIATION FROM UNIMODALITY ON HYPERSPHERES Giuseppe Pandolfo <sup>1</sup>

<sup>1</sup> Department of Industrial Engineering, University of Naples Federico II, (e-mail:

ABSTRACT: A graphical tool for investigating unimodality of hyperspherical data is proposed. It is based on the notion of statistical data depth function for directional data. Then "standard" global depth is compared to its local version by means of a

Testing unimodality of a sample *X*1,...,*Xn* of a random vector *X* supported

step in multivariate data analysis for which only the directions (and not the magnitudes) are of interest – the so-called directional data. This kind of data

The inspiration for this contribution comes from the *center-outward ordering* provided by statistical depth functions, which can be intended as a multivariate generalization of standard univariate rank. Specifically, information about unimodality of hyperspherical distributions are obtained through data depths, and they can be displayed and visualized in a simple two-dimensional plot. Such graph is based on an analysis of the rankings derived from a data depth function and its local counterpart, so that they can offer an easy inter-

The use of depth-induced rankings to investigate distributional features has been already used for analyzing data in R*<sup>q</sup>* by means of graphical tools. Liu *et al.* , 1999 proposed the "sunburst plot" as a bivariate generalization of the box-plot and the DD-(depth versus depth) plots. Rousseeuw *et al.* , 1999 proposed the bagplot, a bivariate generalization of the univariate boxplot by exploiting the notion of halfspace location depth. Li *et al.* , 2012 used the DDplot to perform classification of data in R*q*. A nonparametric classification procedure based on the DD-plot was introduced also by Lange *et al.* , 2014.

*x* = 1}, with *q* > 1, is one important

two-dimensional scatterplot. The proposal is illustrated on simulated data. KEYWORDS: Data depth, distance measure, ranks, Von-Mises Fisher.

arise in many applied disciplines, such as astronomy, biology, etc.

giuseppe.pandolfo@unina.it

on the hypersphere *<sup>S</sup>q*−<sup>1</sup> :<sup>=</sup> {*<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>q</sup>* : *<sup>x</sup>*

pretable picture of the distributions.

1 Setting

Data depth function is an important nonparametric tool for the analysis of complex data such as functional and directional data. It provides a center-outward ordering of the data and leads to a ranking of data which can be exploited for describing different features of the data distribution. Hence, a data depth function is any function *D*(*x*,*F*) that measures the closeness or centrality of a point *<sup>x</sup>* <sup>∈</sup> *<sup>S</sup>q*−<sup>1</sup> d with respect to a distribution function *<sup>F</sup>*. Thus, a depth function assigns to each *<sup>x</sup>* <sup>∈</sup> *<sup>S</sup>q*−<sup>1</sup> a nonnegative score as its center-outward depth with respect to *F*. Observations close to the center of *F* receive high ranks whereas peripheral observations receive low ranks. Such notion is limited to data modeling with a unimodal distribution. For this reason, *local* versions of depth functions were derived in order to deal with multimodal distributions (see Agostinelli & Romanazzi, 2011 and Paindaveine & Van Bever, 2013).

Here, the notion of distance-based depths for directional data introduced by Pandolfo *et al.* , 2018 is adopted along with its local version which is derived by considering a neighborhood of each point *x* whose radius is the τ parameter, which cannot goes to ∞, as it occurs for data in R*q*, because of the boundedness of the space. The usual definition is recovered when τ approaches its maximum, thus local depth includes ordinary depth as a particular case. Hence local depth is a class of center-outward ranking functions serving multiple purposes, according to the value of the tuning parameter: low values describe centralness of the points of the space conditional on a small neighborhood around them, higher values lead to wider windows and therefore produce rankings which are more and more similar to "standard" global depth.

#### 3 Plotting global and local depth rankings

The rankings produced by the notions of global and local depths can be compared by means of a two-dimensional scatterplot, which can be exploited to investigate unimodality of directional data. This is because for symmetric unimodal distributions the rankings of the data provided by global and local depths should be exactly the same. On the contrary, the more the distribution deviates from unimodality the greater the difference between the two rankings. Such difference can be easily understood by a two-dimensional plot where the *x*-coordinates are the global depth of the corresponding data point and the *y*coordinates are the local depth of the corresponding data point. If the distribution is unimodal and thus the set of the deeper local points does not substantially differ from the corresponding set of the deeper global depth points, the plot will exhibit a concentration on the upper-right corner. In case of strong unimodality, the ranks of the two depth functions will coincide, and points on the plot will roughly form a straight diagonal line. On the other hand, departure from unimodality will show different scenarios, obviously depending on the kind of departure. Below, Figure 1 reports an example of the proposed tool, where the arc distance depth in its global and local were adopted for a unimodal von Mises-Fisher distribution in 5 dimensions with concentration parameter equal to 20 (a) and a bimodal distribution in 5 dimensions obtained trough a weighted mixture of two von Mises-Fisher distributions with means 90◦ far away from each other with 80% of the weight on the first component and different concentrations, i.e. 5 and 2 (b). The sample size was set equal to 250. In the first case one can see that points do not deviate too much from the straight line suggesting a strong unimodality. For the bimodal data, points are more scattered around, and the deepest sample points according to the global and local depth functions do not clearly lie on the upper-right quadrant, signaling a departure from unimodality.

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

Global Arc Distance Depth

(b)

0.0 0.2 0.4 0.6 0.8 1.0

Figure 1: Plot of global vs local depth-induced rankings of a von Mises-Fisher distribution in 5 dimensions with concentration parameter κ = 5 (a), and of a weighted mixture of two von Mises-Fisher distributions with means 90◦ far away from each other (b). The normalized global and local arc depths were

PAINDAVEINE, DAVY,&VAN BEVER, GERMAIN. 2013. From depth to local depth: a focus on centrality. *Journal of the American Statistical Associa-*

PANDOLFO, GIUSEPPE, PAINDAVEINE, DAVY,&PORZIO, GIOVANNI C. 2018. Distance-based depths for directional data. *Canadian Journal of*

PANDOLFO, GIUSEPPE, IORIO, CARMELA, STAIANO, MICHELE, ARIA, MASSIMO,&SICILIANO, ROBERTA. 2021. Multivariate process control charts based on the Lp depth. *Applied Stochastic Models in Business*

ROUSSEEUW, PETER J, RUTS, IDA,&TUKEY, JOHN W. 1999. The bagplot: a bivariate boxplot. *The American Statistician*, 53(4), 382–387.

Local Arc Distance Depth

Global Arc Distance Depth

ence. *The annals of statistics*, 27(3), 783–858.

(a)

0.0 0.2 0.4 0.6 0.8 1.0

used with τ equal to π/2.

*tion*, 108(503), 1105–1119.

*Statistics*, 46(4), 593–609.

*and Industry*, 37(2), 229–250.

Local Arc Distance Depth

#### References


Figure 1: Plot of global vs local depth-induced rankings of a von Mises-Fisher distribution in 5 dimensions with concentration parameter κ = 5 (a), and of a weighted mixture of two von Mises-Fisher distributions with means 90◦ far away from each other (b). The normalized global and local arc depths were used with τ equal to π/2.

ence. *The annals of statistics*, 27(3), 783–858.

depths should be exactly the same. On the contrary, the more the distribution deviates from unimodality the greater the difference between the two rankings. Such difference can be easily understood by a two-dimensional plot where the *x*-coordinates are the global depth of the corresponding data point and the *y*coordinates are the local depth of the corresponding data point. If the distribution is unimodal and thus the set of the deeper local points does not substantially differ from the corresponding set of the deeper global depth points, the plot will exhibit a concentration on the upper-right corner. In case of strong unimodality, the ranks of the two depth functions will coincide, and points on the plot will roughly form a straight diagonal line. On the other hand, departure from unimodality will show different scenarios, obviously depending on the kind of departure. Below, Figure 1 reports an example of the proposed tool, where the arc distance depth in its global and local were adopted for a unimodal von Mises-Fisher distribution in 5 dimensions with concentration parameter equal to 20 (a) and a bimodal distribution in 5 dimensions obtained trough a weighted mixture of two von Mises-Fisher distributions with means 90◦ far away from each other with 80% of the weight on the first component and different concentrations, i.e. 5 and 2 (b). The sample size was set equal to 250. In the first case one can see that points do not deviate too much from the straight line suggesting a strong unimodality. For the bimodal data, points are more scattered around, and the deepest sample points according to the global and local depth functions do not clearly lie on the upper-right quadrant, signal-

AGOSTINELLI, CLAUDIO,&ROMANAZZI, MARIO. 2011. Local depth. *Jour-*

LANGE, TATJANA, MOSLER, KARL,&MOZHAROVSKYI, PAVLO. 2014. Fast nonparametric classification based on data depth. *Statistical Papers*,

LI, JUN, CUESTA-ALBERTOS, JUAN A, & LIU, REGINA Y. 2012. DDclassifier: Nonparametric classification procedure based on DD-plot. *Journal of the American statistical association*, 107(498), 737–753. LIU, REGINA Y. 1995. Control charts for multivariate processes. *Journal of*

LIU, REGINA Y, PARELIUS, JESSE M, SINGH, KESAR, *et al.* 1999. Multivariate analysis by data depth: descriptive statistics, graphics and infer-

*nal of Statistical Planning and Inference*, 141(2), 817–830.

*the American Statistical Association*, 90(432), 1380–1387.

ing a departure from unimodality.

References

55(1), 49–69.


### NETWORKS OF NETWORKS

PAIRWISE LIKELIHOOD ESTIMATION OF LATENT AUTOREGRESSIVE COUNT MODELS Xanthi Pedeli <sup>1</sup> and Cristiano Varin2

<sup>1</sup> Department of Statistics, Athens University of Economics and Business, Athens,

<sup>2</sup> Department of Environmental Sciences, Informatics and Statistics, Ca' Foscari Uni-

ABSTRACT: Latent autoregressive models are often employed for the analysis of infectious disease data. However, the likelihood function of latent autoregressive models is intractable and it is usually approximated by simulation-based methods. Through such approximations the inferential problem becomes feasible, but at the price of a high computational cost and difficulties in the assessment of the the quality of the numerical approximation. We consider instead a weighted pairwise likelihood approach and explore several computational and methodological aspects including estimation of robust standard errors and the role of numerical integration. The suggested approach

KEYWORDS: latent autoregressive model, numerical integration, pairwise likelihood.

1 Pairwise Likelihood Inference for Latent Autoregressive Models

Let *y*1,...,*yn* be an observed time series of length *n* and let *ut* = φ*ut*−<sup>1</sup> +ε*<sup>t</sup>* be an unobserved autoregressive Gaussian model with <sup>ε</sup>*<sup>t</sup>* <sup>∼</sup> *<sup>N</sup>*(0,σ2) and <sup>|</sup>φ<sup>|</sup> <sup>&</sup>lt; 1. Latent autoregressive models assume that conditionally on the unobserved *ut*, the observed counts *yt* are independent Poisson random variables with condi-

and β = (β0,...,β*p*)*<sup>T</sup>* the corresponding vector of regression coefficients. The inclusion of the latent variable *ut* in the linear predictor induces both serial correlation and overdispersion which is frequently observed in time series of disease counts. Likelihood inference for the parameter vector θ = (β*<sup>T</sup>* ,σ2,φ)*<sup>T</sup>* of latent autoregressive models requires to approximate the *n*-fold integral

Alternatively, the likelihood can be expressed as a series of *n* nested onedimensional integrals using the filtering algorirthm described in Cagnone &

*<sup>p</sup>*(*yt*|*ut*;β)*p*(*ut*|*ut*−1;σ<sup>2</sup>

*<sup>t</sup>* β +*ut*), where *xt* is a vector of regressors

,φ)*du*<sup>1</sup> ...*dun*. (1)

versity, Venice, Italy (e-mail: cristiano.varin@unive.it)

is illustrated on monthly cases of invasive meningococcal disease in Italy.

Greece (e-mail: xpedeli@aueb.gr)

tional expectation E(*yt*|*ut*) = exp(*x<sup>T</sup>*

*<sup>p</sup>*(*y*1|*u*1;β)*p*(*u*1;σ<sup>2</sup>

,φ) *n* ∏*t*=2

*<sup>L</sup>*(θ) =

R*n*

Panos Pardalos1

<sup>1</sup> Department of Industrial and Systems Engineering, University of Florida, (e-mail: pardalos@ufl.edu)

ABSTRACT: Many complex systems in nature (or man made) are represented not by single networks but by sets of interdependent networks. Such networks of networks (NoN) include the internet, airline alliances, biological networks, and smart city networks. There is no doubt that NoN will be the next frontier in network sciences. In my lecture I will address some recent developments (robustness, diversity) and discuss some challenging problems in NoN.

KEYWORDS: interdependent networks; network robustness; network diversity

### PAIRWISE LIKELIHOOD ESTIMATION OF LATENT AUTOREGRESSIVE COUNT MODELS

NETWORKS OF NETWORKS Panos Pardalos1

ABSTRACT: Many complex systems in nature (or man made) are represented not by single networks but by sets of interdependent networks. Such networks of networks (NoN) include the internet, airline alliances, biological networks, and smart city networks. There is no doubt that NoN will be the next frontier in network sciences. In my lecture I will address some recent developments

KEYWORDS: interdependent networks; network robustness; network diversity

<sup>1</sup> Department of Industrial and Systems Engineering, University of Florida,

(robustness, diversity) and discuss some challenging problems in NoN.

(e-mail: pardalos@ufl.edu)

Xanthi Pedeli <sup>1</sup> and Cristiano Varin2

<sup>1</sup> Department of Statistics, Athens University of Economics and Business, Athens, Greece (e-mail: xpedeli@aueb.gr)

<sup>2</sup> Department of Environmental Sciences, Informatics and Statistics, Ca' Foscari University, Venice, Italy (e-mail: cristiano.varin@unive.it)

ABSTRACT: Latent autoregressive models are often employed for the analysis of infectious disease data. However, the likelihood function of latent autoregressive models is intractable and it is usually approximated by simulation-based methods. Through such approximations the inferential problem becomes feasible, but at the price of a high computational cost and difficulties in the assessment of the the quality of the numerical approximation. We consider instead a weighted pairwise likelihood approach and explore several computational and methodological aspects including estimation of robust standard errors and the role of numerical integration. The suggested approach is illustrated on monthly cases of invasive meningococcal disease in Italy.

KEYWORDS: latent autoregressive model, numerical integration, pairwise likelihood.

#### 1 Pairwise Likelihood Inference for Latent Autoregressive Models

Let *y*1,...,*yn* be an observed time series of length *n* and let *ut* = φ*ut*−<sup>1</sup> +ε*<sup>t</sup>* be an unobserved autoregressive Gaussian model with <sup>ε</sup>*<sup>t</sup>* <sup>∼</sup> *<sup>N</sup>*(0,σ2) and <sup>|</sup>φ<sup>|</sup> <sup>&</sup>lt; 1. Latent autoregressive models assume that conditionally on the unobserved *ut*, the observed counts *yt* are independent Poisson random variables with conditional expectation E(*yt*|*ut*) = exp(*x<sup>T</sup> <sup>t</sup>* β +*ut*), where *xt* is a vector of regressors and β = (β0,...,β*p*)*<sup>T</sup>* the corresponding vector of regression coefficients. The inclusion of the latent variable *ut* in the linear predictor induces both serial correlation and overdispersion which is frequently observed in time series of disease counts. Likelihood inference for the parameter vector θ = (β*<sup>T</sup>* ,σ2,φ)*<sup>T</sup>* of latent autoregressive models requires to approximate the *n*-fold integral

$$L(\boldsymbol{\Theta}) = \int\_{\mathbb{R}^n} p(\mathbf{y}\_1|\boldsymbol{u}\_1; \boldsymbol{\upbeta}) p(\boldsymbol{u}\_1; \boldsymbol{\upsigma}^2, \boldsymbol{\phi}) \prod\_{t=2}^n p(\mathbf{y}\_t|\boldsymbol{u}\_t; \boldsymbol{\upbeta}) p(\boldsymbol{u}\_t|\boldsymbol{u}\_{t-1}; \boldsymbol{\upsigma}^2, \boldsymbol{\upphi}) d\boldsymbol{u}\_1 \dots d\boldsymbol{u}\_n. \tag{1}$$

Alternatively, the likelihood can be expressed as a series of *n* nested onedimensional integrals using the filtering algorirthm described in Cagnone & Bartolucci, 2017. The algorithm is based on recursive evaluation of the nested integrals, a process that can introduce propagation of the numerical error, which is the main drawback of the filtering algorithm approach. Various simulation strategies for approximation of the likelihood (1) have been suggested under the frequentist and Bayesian frameworks (Davis & Dunsmuir, 2016). In this paper, we consider a weighted pairwise likelihood approach (Varin & Vidoni, 2009) which is based on replacement of the high-dimensional integral in (1) with a limited set of double integrals. Consequently, a significant reduction of the computational cost related to ordinary likelihood is achieved.

2 Application

This section illustrates the proposed approach with an update of the application considered in Pedeli & Varin, 2020. Data on the monthly number of meningococcal disease cases in Italy for the years 1999-2018 have been obtained from the Surveillance Atlas of the European Center of Disease Control (ECDC). Thereafter, a latent autoregressive Poisson model is fitted to the period 1999- 2017 and then it is used to predict the disease cases in 2018. In the time series plot of the data (left panel of Figure 1), it can be observed that the main feature of the series is a level shift corresponding to a reduction of the monthly number of cases after March 2005. We therefore consider the latent autoregressive model *E*(*yt*|*ut*) = exp(η*<sup>t</sup>* +*ut*) with η*<sup>t</sup>* = β0+β1*xt* +β<sup>2</sup> sin(2π*t*/12)+ β<sup>3</sup> cos(2π*t*/12), where *xt* is a binary indicator for observations before (*xt* = 1) and after (*xt* = 0) March 2005. The Pearson residuals obtained by a standard Poisson regression model with linear predictor η*<sup>t</sup>* are non-spuriously autocorrelated at the first two lags. We thus fit the latent autoregressive model with the pairwise likelihood of order two and trapezoidal weights, as suggested by simulation results discussed in Pedeli & Varin, 2020. The trapezoidal weights

have a window length parameter *md* = 2*d* and are defined as

1, 1 ≤ *i* < *d*, (2*d* −*i*)/*d*, *d* ≤ *i* < 2*d*, 0, *i* ≥ 2*d*.

Numerical integration for computation of the pairwise likelihood is performed through Gauss-Hermite quadrature with 5, 10 and 20 nodes per dimension giving the same estimates and standard errors up to two decimal digits. Maximum pairwise likelihood estimates are in close agreement with integrated nested Laplace approximation (INLA) (Rue et al., 2009) and confirm the significant level shift in the invasive meningitis cases after March 2015. The maximum pairwise likelihood estimates and corresponding standard errors were obtained after 0.164, 0.379 and 1.439 CPU seconds, with five, 10 and 20 quadrature nodes per dimension, respectively, while INLA required 5.256 CPU seconds. The observed and predicted cases of meningococcal infections in Italy and the corresponding 95% upper bounds are illustrated in the right panel of Figure 1. Predictions were computed with 10,000 simulations from the fitted model. The comparison of in-sample predictions with the observed disease counts indicates a realistic model fitting and retrospectively identifies some periods of excess disease cases. Out-of-sample predictions are also close to the observed

miningitis cases for the year 2018 indicating a good predictive ability.

 

*wi* ∝

The pairwise log-likelihood of order *d* is defined as a weighted sum of the form *<sup>d</sup>*(θ) = ∑*<sup>n</sup> <sup>t</sup>*=*md*+<sup>1</sup> <sup>∑</sup>*md <sup>i</sup>*=<sup>1</sup>*wi* log *p*(*yt*−*i*, *yt*;θ), where *md* is a window length parameter, *wi* are some non-negative weights, that are normalized, so that ∑*md <sup>i</sup>*=<sup>1</sup>*wi* = 1 (see Section 2 and Pedeli & Varin, 2020 for more details) and

$$p(\mathbf{y}\_{t-i}, \mathbf{y}\_t; \boldsymbol{\Theta}) = \int\_{\mathbb{R}^2} p(\mathbf{y}\_t | \boldsymbol{u}\_t; \boldsymbol{\mathsf{\mathsf{B}}}) p(\mathbf{y}\_{t-i} | \boldsymbol{u}\_{t-i}; \boldsymbol{\mathsf{\mathsf{B}}}) p(\boldsymbol{u}\_{t-i}, \boldsymbol{u}\_t; \boldsymbol{\mathsf{\mathsf{G}}}^2, \boldsymbol{\phi}) d\boldsymbol{u}\_{t-i} d\boldsymbol{u}\_t.$$

The maximum pairwise likelihood estimator of order *d* is denoted as ˆ θ*<sup>d</sup>* and is the solution of the pairwise score equations ψ*d*(ˆ θ*d*) = ∑*<sup>n</sup> <sup>t</sup>*=*md*+1ψ*d*,*t*(<sup>ˆ</sup> θ*d*) = 0, where ψ*d*,*t*(θ) = ∑*md <sup>i</sup>*=<sup>1</sup>*wi* ∂ ∂θ log *p*(*yt*−*i*, *yt*;θ) are the averaged pairwise scores.

It can be shown (Davis & Yau, 2011) that the limiting distribution of ˆ θ*<sup>d</sup>* is normal with mean equal to the true value, θ∗, and asymptotic variance equal to the inverse of the Godambe information *Gd*(θ∗) = *Hd*(θ∗)*Jd*(θ∗)−<sup>1</sup>*Hd*(θ∗), where *Hd* = E − ∂ ∂θψ*d*,*t*(θ∗) and *Jd* = ∑∞ *<sup>k</sup>*=−<sup>∞</sup> <sup>E</sup> <sup>ψ</sup>*d*,*t*−*k*(θ∗)ψ*d*,*t*(θ∗)*<sup>T</sup>* are referred as the sensitivity and variability matrices, respectively. For the estimation of *Hd* one can work with either the observed pairwise likelihood information or an outer-product estimator which derives from the second-order Bartlett identity that holds for each specific pair of observations. Estimation of *Jd* is more demanding. We consider an heteroskedasticity and autocorrelation consistent (HAC) estimator (Newey & West, 1994) of the form

$$\hat{\mathcal{J}}\_d = \sum\_{k=-r}^r \left(1 - \frac{|k|}{r}\right) \sum\_{t=m\_d+1}^n \left\{ \frac{1}{n} \Psi\_{d,t-k}(\hat{\Theta}\_d) \Psi\_{d,t}(\hat{\Theta}\_d)^T \right\},$$

where the weights (1−|*k*|/*r*) correspond to the Bartlett kernel, although other types of kernels might also be used. Empirical evidence suggests that the default lag length considered by autocorrelation functions of standard statistical softwares can serve as a reliable choice for the window semi-length *r*. We thus adopt the rule *r* = 10log10 *n* corresponding to the number of lags used in the acf() function of the R software (R Core Team, 2020).

#### 2 Application

Bartolucci, 2017. The algorithm is based on recursive evaluation of the nested integrals, a process that can introduce propagation of the numerical error, which is the main drawback of the filtering algorithm approach. Various simulation strategies for approximation of the likelihood (1) have been suggested under the frequentist and Bayesian frameworks (Davis & Dunsmuir, 2016). In this paper, we consider a weighted pairwise likelihood approach (Varin & Vidoni, 2009) which is based on replacement of the high-dimensional integral in (1) with a limited set of double integrals. Consequently, a significant reduction

The pairwise log-likelihood of order *d* is defined as a weighted sum of the

*<sup>p</sup>*(*yt*|*ut*;β)*p*(*yt*−*i*|*ut*−*i*;β)*p*(*ut*−*i*,*ut*;σ<sup>2</sup>

parameter, *wi* are some non-negative weights, that are normalized, so that

It can be shown (Davis & Yau, 2011) that the limiting distribution of ˆ

normal with mean equal to the true value, θ∗, and asymptotic variance equal to the inverse of the Godambe information *Gd*(θ∗) = *Hd*(θ∗)*Jd*(θ∗)−<sup>1</sup>*Hd*(θ∗),

and *Jd* = ∑∞

referred as the sensitivity and variability matrices, respectively. For the estimation of *Hd* one can work with either the observed pairwise likelihood information or an outer-product estimator which derives from the second-order Bartlett identity that holds for each specific pair of observations. Estimation of *Jd* is more demanding. We consider an heteroskedasticity and autocorrelation

> 1 *n*

where the weights (1−|*k*|/*r*) correspond to the Bartlett kernel, although other types of kernels might also be used. Empirical evidence suggests that the default lag length considered by autocorrelation functions of standard statistical softwares can serve as a reliable choice for the window semi-length *r*. We thus adopt the rule *r* = 10log10 *n* corresponding to the number of lags used in the

*<sup>i</sup>*=<sup>1</sup>*wi* = 1 (see Section 2 and Pedeli & Varin, 2020 for more details) and

The maximum pairwise likelihood estimator of order *d* is denoted as ˆ

*<sup>i</sup>*=<sup>1</sup>*wi* log *p*(*yt*−*i*, *yt*;θ), where *md* is a window length

θ*d*) = ∑*<sup>n</sup>*

∂θ log *p*(*yt*−*i*,*yt*;θ) are the averaged pairwise scores.

*<sup>k</sup>*=−<sup>∞</sup> <sup>E</sup>

<sup>ψ</sup>*d*,*t*−*k*(<sup>ˆ</sup>

θ*d*)ψ*d*,*t*(ˆ

θ*d*) *T* ,

,φ)*dut*−*idut*.

*<sup>t</sup>*=*md*+1ψ*d*,*t*(<sup>ˆ</sup>

<sup>ψ</sup>*d*,*t*−*k*(θ∗)ψ*d*,*t*(θ∗)*<sup>T</sup>* are

θ*<sup>d</sup>* and is

θ*d*) = 0,

θ*<sup>d</sup>* is

of the computational cost related to ordinary likelihood is achieved.

form *<sup>d</sup>*(θ) = ∑*<sup>n</sup>*

*p*(*yt*−*i*, *yt*;θ) =

where ψ*d*,*t*(θ) = ∑*md*

 − ∂

*J*ˆ*<sup>d</sup>* =

*r* ∑ *k*=−*r* <sup>1</sup><sup>−</sup> <sup>|</sup>*k*<sup>|</sup> *r*

acf() function of the R software (R Core Team, 2020).

where *Hd* = E

∑*md*

*<sup>t</sup>*=*md*+<sup>1</sup> <sup>∑</sup>*md*

 R2

*<sup>i</sup>*=<sup>1</sup>*wi* ∂

the solution of the pairwise score equations ψ*d*(ˆ

∂θψ*d*,*t*(θ∗)

consistent (HAC) estimator (Newey & West, 1994) of the form

 *n* ∑ *t*=*md*+1 This section illustrates the proposed approach with an update of the application considered in Pedeli & Varin, 2020. Data on the monthly number of meningococcal disease cases in Italy for the years 1999-2018 have been obtained from the Surveillance Atlas of the European Center of Disease Control (ECDC). Thereafter, a latent autoregressive Poisson model is fitted to the period 1999- 2017 and then it is used to predict the disease cases in 2018. In the time series plot of the data (left panel of Figure 1), it can be observed that the main feature of the series is a level shift corresponding to a reduction of the monthly number of cases after March 2005. We therefore consider the latent autoregressive model *E*(*yt*|*ut*) = exp(η*<sup>t</sup>* +*ut*) with η*<sup>t</sup>* = β0+β1*xt* +β<sup>2</sup> sin(2π*t*/12)+ β<sup>3</sup> cos(2π*t*/12), where *xt* is a binary indicator for observations before (*xt* = 1) and after (*xt* = 0) March 2005. The Pearson residuals obtained by a standard Poisson regression model with linear predictor η*<sup>t</sup>* are non-spuriously autocorrelated at the first two lags. We thus fit the latent autoregressive model with the pairwise likelihood of order two and trapezoidal weights, as suggested by simulation results discussed in Pedeli & Varin, 2020. The trapezoidal weights have a window length parameter *md* = 2*d* and are defined as

$$w\_i \approx \begin{cases} 1, & 1 \le i < d, \\ (2d - i)/d, & d \le i < 2d, \\ 0, & i \ge 2d. \end{cases}$$

Numerical integration for computation of the pairwise likelihood is performed through Gauss-Hermite quadrature with 5, 10 and 20 nodes per dimension giving the same estimates and standard errors up to two decimal digits. Maximum pairwise likelihood estimates are in close agreement with integrated nested Laplace approximation (INLA) (Rue et al., 2009) and confirm the significant level shift in the invasive meningitis cases after March 2015. The maximum pairwise likelihood estimates and corresponding standard errors were obtained after 0.164, 0.379 and 1.439 CPU seconds, with five, 10 and 20 quadrature nodes per dimension, respectively, while INLA required 5.256 CPU seconds.

The observed and predicted cases of meningococcal infections in Italy and the corresponding 95% upper bounds are illustrated in the right panel of Figure 1. Predictions were computed with 10,000 simulations from the fitted model. The comparison of in-sample predictions with the observed disease counts indicates a realistic model fitting and retrospectively identifies some periods of excess disease cases. Out-of-sample predictions are also close to the observed miningitis cases for the year 2018 indicating a good predictive ability.

A STUDY OF LACK-OF-FIT DIAGNOSTICS FOR MODELS FIT TO CROSS-CLASSIFIED BINARY VARIABLES Mark Reiser1 and Maduranga Dassanayake2

<sup>1</sup> School of Mathematical and Statistical Sciences, Arizona State University, USA,

ABSTRACT: In this paper, a new version of the *GFfit* statistic is compared to other lack-of-fit diagnostics for models fit to cross-classified binary variables. The new *GFfit* statistic is obtained by decomposing the Pearson statistic from the full table into orthogonal components defined on marginal distributions. The new version of the *GFfit* statistic can be applied to a variety of models for cross-classified tables.

joint frequencies are very sparse and has higher power for detecting the source of lack

For a multi-way contingency table, the traditional Pearson's chi-square statistic is obtained by comparing observed frequencies to the expected frequencies under the null hypothesis. For a composite null hypothesis where the null distri-

Fisher (1924) gave the degrees of freedom, *T* −*g*−1. Orthogonal components

the Pearson statistic from the full table into orthogonal components defined on

ee is a vector of residuals on marginals distributions, such as bivariate residuals,

*PF* have been studied by many authors, including Lancaster (1969). Reiser, Cagnone, and Zhu (2021) propose a new *GFfit* statistic for the purpose of

<sup>⊥</sup> are the squared elements of <sup>ˆ</sup>

)−1, where *CC* is the Cholesky factor of ΩΩee, and ΩΩee is the covariance

*<sup>s</sup>*, where *zs* <sup>=</sup> <sup>√</sup>*n*(π*s*(<sup>ˆ</sup>

KEYWORDS: Item response model, goodness of fit, orthogonal components

bution depends on a vector of *g* unknown parameters ββ = (β1,...,β*g*)

*PF* = ∑*<sup>s</sup> z*<sup>2</sup>

<sup>⊥</sup> has good Type I error performance even if the

*<sup>T</sup>* , requires

ee, where

pˆ*s*−π*s*(<sup>ˆ</sup> ββ) .

ββ))−<sup>1</sup> 2

<sup>⊥</sup> , is obtained by decomposing

γγ = *n* 1 2 *FF*

(e-mail: mark.reiser@asu.edu, )

(e-mail: maduranga@uga.edu)

Simulation results show that *GF fit*(*i j*)

1 Introduction

of *X*<sup>2</sup>

*FF* = (*CC*

the Pearson-Fisher statistic, *X*<sup>2</sup>

marginal distributions. *GF fit*(*i j*)

detecting lack of fit. The new statistic, *GF fit*(*i j*)

<sup>2</sup> Department of Statistics, University of Georgia, USA,

of fit compared to other diagnostics on bivariate marginal tables.

Figure 1. *Left panel: Time series of the monthly number of invasive meningococcal disease (IMD) cases in Italy for the years 1999–2018. Right panel: observed (*◦*) and predicted (–) number of IMD cases. The vertical dotted line separates the data used for model fitting from the data used for the prediction exercise.*

#### References


## A STUDY OF LACK-OF-FIT DIAGNOSTICS FOR MODELS FIT TO CROSS-CLASSIFIED BINARY VARIABLES

Mark Reiser1 and Maduranga Dassanayake2

<sup>1</sup> School of Mathematical and Statistical Sciences, Arizona State University, USA, (e-mail: mark.reiser@asu.edu, )

<sup>2</sup> Department of Statistics, University of Georgia, USA, (e-mail: maduranga@uga.edu)

ABSTRACT: In this paper, a new version of the *GFfit* statistic is compared to other lack-of-fit diagnostics for models fit to cross-classified binary variables. The new *GFfit* statistic is obtained by decomposing the Pearson statistic from the full table into orthogonal components defined on marginal distributions. The new version of the *GFfit* statistic can be applied to a variety of models for cross-classified tables. Simulation results show that *GF fit*(*i j*) <sup>⊥</sup> has good Type I error performance even if the joint frequencies are very sparse and has higher power for detecting the source of lack of fit compared to other diagnostics on bivariate marginal tables.

KEYWORDS: Item response model, goodness of fit, orthogonal components

#### 1 Introduction

Figure 1. *Left panel: Time series of the monthly number of invasive meningococcal disease (IMD) cases in Italy for the years 1999–2018. Right panel: observed (*◦*) and predicted (–) number of IMD cases. The vertical dotted line separates the data used*

CAGNONE, S. & BARTOLUCCI, F. 2017. Adaptive quadrature for maximum likelihood estimation of a class of dynamic latent variable models. *Com-*

DAVIS, R.A. & DUNSMUIR, W.T.M. 2016. *State Space Models for Count Time Series*. In Handbook of discrete-valued time series. CRC Press. DAVIS, R.A. & YAU, C.Y. 2011. Comments on pairwise likelihood in time

NEWEY, W.K. & WEST, K.D. 1994. Automatic lag selection in covariance matrix estimation. *Review of Economic Studies*, 61, 631–653. PEDELI, X. & VARIN, C. 2020. Pairwise likelihood estimation of latent autoregressive count models. *Statistical Methods in Medical Research*, 29,

R CORE TEAM, 2020. A language and environment for statistical computing.

RUE, H., & MARTINO, S., & CHOPIN, N. 2009. Approximate Bayesian inference for latent Gaussian models using integrated nested Laplace ap-

proximations (with discussion). *J. R. Statist. Soc. B*, 71, 319–392. VARIN, C. & VIDONI, P. 2009. Pairwise likelihood inference for general state

R Foundation for Statistical Computing, Vienna, Austria.

space models. *Econometric Reviews*, 28, 170–85.

*for model fitting from the data used for the prediction exercise.*

*putational Economics*, 49, 599–622.

series models. *Statistica Sinica*, 21, 255–77.

References

3278–3293.

For a multi-way contingency table, the traditional Pearson's chi-square statistic is obtained by comparing observed frequencies to the expected frequencies under the null hypothesis. For a composite null hypothesis where the null distribution depends on a vector of *g* unknown parameters ββ = (β1,...,β*g*) *<sup>T</sup>* , requires the Pearson-Fisher statistic, *X*<sup>2</sup> *PF* = ∑*<sup>s</sup> z*<sup>2</sup> *<sup>s</sup>*, where *zs* <sup>=</sup> <sup>√</sup>*n*(π*s*(<sup>ˆ</sup> ββ))−<sup>1</sup> 2 pˆ*s*−π*s*(<sup>ˆ</sup> ββ) . Fisher (1924) gave the degrees of freedom, *T* −*g*−1. Orthogonal components of *X*<sup>2</sup> *PF* have been studied by many authors, including Lancaster (1969). Reiser, Cagnone, and Zhu (2021) propose a new *GFfit* statistic for the purpose of detecting lack of fit. The new statistic, *GF fit*(*i j*) <sup>⊥</sup> , is obtained by decomposing the Pearson statistic from the full table into orthogonal components defined on marginal distributions. *GF fit*(*i j*) <sup>⊥</sup> are the squared elements of <sup>ˆ</sup> γγ = *n* 1 2 *FF* ee, where ee is a vector of residuals on marginals distributions, such as bivariate residuals, *FF* = (*CC* )−1, where *CC* is the Cholesky factor of ΩΩee, and ΩΩee is the covariance

matrix of √*n*ee.

In this paper, the performance of *GF fit*(*i j*) <sup>⊥</sup> is compared to adjusted residuals (Reiser, 1996) and χ¯ <sup>2</sup> *i j* (Liu & Maydeu-Olivares, 2014) using simulations to assess Type I error rate and power for models fit to binary cross-classified variables. The adjusted residual *k* for the second-order marginal is *zi j* = *n* 1 <sup>2</sup> e(*k*) /σˆ (*k*) <sup>e</sup> , where *k* = 1,2,···, *q* 2 and corresponds to item pair *ij*, e(*k*) is an element of ee, and σˆ (*k*) <sup>e</sup> is the square root of a diagonal element of ΣΣee where ΣΣee <sup>=</sup> *<sup>n</sup>*−1ΩΩ ee. χ¯ 2 *i j* <sup>=</sup> <sup>2</sup>*µ*ˆ1 *µ*ˆ2 χ2 *i j*, where *µ*ˆ1 and *µ*ˆ2 are the first and second asymptotic moments of χ2 *i j*, and χ<sup>2</sup> *i j* is the Pearson chi-square statistic calculated on a bivariate table.

Table 1. *Type I Error Study for Symmetric Intercept Model*

Pair (i,j) GFfit<sup>⊥</sup> Std. Residuals <sup>χ</sup>¯ <sup>2</sup>

3 Power Study for Eight Variables

n=300 n=500

(1,2) 0.046 0.055 0.052 0.052 0.0590591 0.056 (1,3) 0.048 0.048 0.047 0.044 0.046046 0.046 (1,4) 0.044 0.057 0.054 0.051 0.0510511 0.047 (1,5) 0.044 0.034 0.03 0.042 0.043043 0.043 (1,6) 0.049 0.048 0.044 0.053 0.049049 0.047 (1,7) 0.051 0.063 0.06 0.043 0.045045 0.045 (1,8) 0.057 0.053 0.052 0.051 0.0530531 0.053 (2,3) 0.051 0.055 0.054 0.041 0.042042 0.043 (2,4) 0.039 0.046 0.049 0.038 0.047047 0.046 (2,5) 0.043 0.054 0.052 0.049 0.0500501 0.05 (2,6) 0.052 0.063 0.059 0.048 0.042042 0.042 (2,7) 0.043 0.059 0.06 0.048 0.049049 0.047 (2,8) 0.047 0.048 0.048 0.057 0.0530531 0.054 (3,4) 0.05 0.058 0.058 0.05 0.0520521 0.051 (3,5) 0.042 0.038 0.038 0.043 0.044044 0.042 (3,6) 0.049 0.06 0.056 0.051 0.046046 0.046 (3,7) 0.043 0.048 0.049 0.056 0.0500501 0.048 (3,8) 0.041 0.043 0.043 0.047 0.039039 0.04 (4,5) 0.074 0.08 0.079 0.064 0.07 0.069 (4,6) 0.062 0.079 0.077 0.057 0.068 0.067 (4,7) 0.037 0.054 0.052 0.037 0.0510511 0.049 (4,8) 0.05 0.042 0.042 0.048 0.042042 0.042 (5,6) 0.07 0.074 0.073 0.062 0.0630731 0.064 (5,7) 0.039 0.044 0.043 0.039 0.037 0.038 (5,8) 0.052 0.05 0.052 0.037 0.037 0.037 (6,7) 0.045 0.045 0.048 0.054 0.048048 0.05 (6,8) 0.037 0.044 0.044 0.049 0.037 0.038 (7,8) 0.052 0.04 0.036 0.041 0.037 0.04

Asymptotic and empirical power comparison for symmetric intercept models are given in Table 2. Higher values for slopes were allocated to items 4, 5, and 6 on a second latent dimension, and higher power is expected for components related to those item pairs. By examining the highlighted values in Table 2, it is clear that the empirical power of second order marginal components (4,5), (4,6) and (5,6) are significantly higher compared to other components. Thus, these second order components were successful in detecting the source of a poorly fit model. This process was repeated for n=300 and n=500. By the results in these tables, it is clear that the empirical power will increase with the sample size and the components were more successful in detecting the lack-offit for larger sample sizes. However, when n=300, empirical power results were somewhat lower compared to asymptotic power results. This indicates when sample size is smaller empirical distribution may not close to the hypothesized theoretical distribution. When n=500, empirical power results and asymptotic power results were fairly close. This indicates when sample size increases the empirical distribution approaches hypothesized theoretical distribution.

*i j* Gffit<sup>⊥</sup> Std. Residuals <sup>χ</sup>¯ <sup>2</sup>

*i j*

#### 2 Type I Error Study

The first simulation included eight manifest variables. One thousand data sets were generated using Monte-Carlo methods related to a one factor model where β <sup>1</sup> = (0.1, 0.1, 0.1, 0.9, 0.9, 0.9, 0.2, 0.2). Three intercept settings were used. Only results for a simulation with intercepts symmetric around zero are shown below. A 2 PL item response model with one latent dimension was estimated for each of these datasets, and empirical Type I error rates of the individual orthogonal components were calculated. Since each individual orthogonal component is distributed approximately as chi-square with one degree of freedom, to calculate the empirical Type I error rate for each component, the sum of the number of cases that exceed the chi-square critical value (at 5% significance level) with one degree of freedom was divided by the number of datasets. Similar process was used to calculate the Type I error rates of the adjusted residual and χ¯ <sup>2</sup> *i j*. This simulation was repeated for sample sizes 300 and 500.

Table 1 below indicates the empirical Type I error rates for *q* = 8 manifest variables for symmetric intercept model. The Type I error rates outside of the Monte-Carlo error interval <sup>0</sup>.05<sup>±</sup><sup>0</sup>.05(0.95)/<sup>1000</sup> = (0.0365,0.0635) are bolded. When n=300, Type I error rates related to *GF fit*(*i j*) <sup>⊥</sup> (4,5) and (5,6) were outside the Monte-Carlo error interval. Given that there are twenty eight individual *GF fit*(*i j*) <sup>⊥</sup> , it is possible that one or two components may randomly fall slightly outside the Monte-Carlo error interval. However, five χ¯ <sup>2</sup> *i j* and four adjusted residuals were outside the Monte-Carlo error interval. This suggests, when n=300, orthogonal components have better Type I error rates compared to χ¯ 2 *i j* and adjusted residuals for *q* = 8 manifest variables for symmetric intercept model. When n=500, all the *GF fit*(*i j*) <sup>⊥</sup> and most of the <sup>χ</sup>¯ <sup>2</sup> *i j* and adjusted residuals were inside the Monte-Carlo error interval (0.0365,0.0635).


Table 1. *Type I Error Study for Symmetric Intercept Model*

matrix of √*n*ee.

(Reiser, 1996) and χ¯ <sup>2</sup>

where *k* = 1,2,···,

(*k*)

2 Type I Error Study

ee, and σˆ

*i j*, and χ<sup>2</sup>

χ¯ 2 *i j* <sup>=</sup> <sup>2</sup>*µ*ˆ1 *µ*ˆ2 χ2

χ2

β

and χ¯ <sup>2</sup>

χ¯ 2

individual *GF fit*(*i j*)

model. When n=500, all the *GF fit*(*i j*)

In this paper, the performance of *GF fit*(*i j*)

*q* 2  <sup>⊥</sup> is compared to adjusted residuals

1 <sup>2</sup> e(*k*) /σˆ (*k*) <sup>e</sup> ,

<sup>⊥</sup> (4,5) and (5,6)

*i j* and adjusted residuals

*i j* and four

*i j* (Liu & Maydeu-Olivares, 2014) using simulations to

and corresponds to item pair *ij*, e(*k*) is an element of

assess Type I error rate and power for models fit to binary cross-classified vari-

The first simulation included eight manifest variables. One thousand data sets were generated using Monte-Carlo methods related to a one factor model where

<sup>1</sup> = (0.1, 0.1, 0.1, 0.9, 0.9, 0.9, 0.2, 0.2). Three intercept settings were used. Only results for a simulation with intercepts symmetric around zero are shown below. A 2 PL item response model with one latent dimension was estimated for each of these datasets, and empirical Type I error rates of the individual orthogonal components were calculated. Since each individual orthogonal component is distributed approximately as chi-square with one degree of freedom, to calculate the empirical Type I error rate for each component, the sum of the number of cases that exceed the chi-square critical value (at 5% significance level) with one degree of freedom was divided by the number of datasets. Similar process was used to calculate the Type I error rates of the adjusted residual

*i j*. This simulation was repeated for sample sizes 300 and 500.

bolded. When n=300, Type I error rates related to *GF fit*(*i j*)

were inside the Monte-Carlo error interval (0.0365,0.0635).

fall slightly outside the Monte-Carlo error interval. However, five χ¯ <sup>2</sup>

Table 1 below indicates the empirical Type I error rates for *q* = 8 manifest variables for symmetric intercept model. The Type I error rates outside of the Monte-Carlo error interval <sup>0</sup>.05<sup>±</sup><sup>0</sup>.05(0.95)/<sup>1000</sup> = (0.0365,0.0635) are

were outside the Monte-Carlo error interval. Given that there are twenty eight

adjusted residuals were outside the Monte-Carlo error interval. This suggests, when n=300, orthogonal components have better Type I error rates compared to

*i j* and adjusted residuals for *q* = 8 manifest variables for symmetric intercept

<sup>⊥</sup> and most of the <sup>χ</sup>¯ <sup>2</sup>

<sup>⊥</sup> , it is possible that one or two components may randomly

<sup>e</sup> is the square root of a diagonal element of ΣΣee where ΣΣee <sup>=</sup> *<sup>n</sup>*−1ΩΩ ee.

*i j* is the Pearson chi-square statistic calculated on a bivariate table.

*i j*, where *µ*ˆ1 and *µ*ˆ2 are the first and second asymptotic moments of

ables. The adjusted residual *k* for the second-order marginal is *zi j* = *n*

#### 3 Power Study for Eight Variables

Asymptotic and empirical power comparison for symmetric intercept models are given in Table 2. Higher values for slopes were allocated to items 4, 5, and 6 on a second latent dimension, and higher power is expected for components related to those item pairs. By examining the highlighted values in Table 2, it is clear that the empirical power of second order marginal components (4,5), (4,6) and (5,6) are significantly higher compared to other components. Thus, these second order components were successful in detecting the source of a poorly fit model. This process was repeated for n=300 and n=500. By the results in these tables, it is clear that the empirical power will increase with the sample size and the components were more successful in detecting the lack-offit for larger sample sizes. However, when n=300, empirical power results were somewhat lower compared to asymptotic power results. This indicates when sample size is smaller empirical distribution may not close to the hypothesized theoretical distribution. When n=500, empirical power results and asymptotic power results were fairly close. This indicates when sample size increases the empirical distribution approaches hypothesized theoretical distribution.


Table 2. *Asymptotic and Empirical Power Comparison for Symmetric Intercept Model*

ASSESSING FOOD SECURITY ISSUES IN ITALY: A QUANTILE COPULA APPROACH Giorgia Rivieccio 1, Jean-Paul Chavas 2, Giovanni De Luca 1, Salvatore Di Falco <sup>3</sup> and Fabian Capitanio <sup>4</sup>

<sup>1</sup> Department of Management Studies and Quantitative Methods, University of Naples Parthenope, Italy (e-mail: giorgia.rivieccio@uniparthenope.it,

<sup>2</sup> Taylor Hall, University of Wisconsin, Madison, USA, (e-mail:

<sup>3</sup> Geneva School of Economics and Management, University of Geneva, Switzerland

<sup>4</sup> Department of Veterinary Medical and Animal Technology Production, University of

ABSTRACT: The study investigates the dependence structure of the Italian crop yields to provide better insights about the role of climate changes and the crop rotation effects on agricultural productivity. Modeling such dependence, in contemporaneous and serial framework, attempts to explain climate changes as possible engine for codependency in the tails of the joint distribution as well as the crop diversification. We used a quantile copula approach to estimate the multivariate distribution of yields across 7 Italian provinces per crop (wheat and corn) over the last 116 years. Findings show a possible dependence by climate for some provinces. Northern regions show higher dependence with a lower crop diversification, thus resulting more exposed to

Food security issues are going to focus our attention on the lower tail of the yield distributions. At the regional level, the issue is co-dependence across crop yields of the closest locations (Chavas *et al.* [2019]). At the national level, the issue is co-dependence across locations per each crop where the role of climate could emerge. Therefore, we attempt to explore the possible effect of climate, modeling the joint tail behavior of yields among 7 Italian provinces for each crop (corn and wheat). To estimate the yield distribution we propose

Naples Federico II, Italy, (e-mail: fabian.capitanio@gmail.com)

giovanni.deluca@uniparthenope.it)

(e-mail: Salvatore.DiFalco@unige.ch)

KEYWORDS: QAR models, copulas, tail dependence

jchavas@wisc.edu)

risk of climate effects.

1 Introduction

Asymptotic power was calculated only for the orthogonal components.

#### References


### ASSESSING FOOD SECURITY ISSUES IN ITALY: A QUANTILE COPULA APPROACH

Giorgia Rivieccio 1, Jean-Paul Chavas 2, Giovanni De Luca 1, Salvatore Di Falco <sup>3</sup> and Fabian Capitanio <sup>4</sup>

<sup>1</sup> Department of Management Studies and Quantitative Methods, University of Naples Parthenope, Italy (e-mail: giorgia.rivieccio@uniparthenope.it, giovanni.deluca@uniparthenope.it)

<sup>2</sup> Taylor Hall, University of Wisconsin, Madison, USA, (e-mail: jchavas@wisc.edu)

<sup>3</sup> Geneva School of Economics and Management, University of Geneva, Switzerland (e-mail: Salvatore.DiFalco@unige.ch)

<sup>4</sup> Department of Veterinary Medical and Animal Technology Production, University of Naples Federico II, Italy, (e-mail: fabian.capitanio@gmail.com)

ABSTRACT: The study investigates the dependence structure of the Italian crop yields to provide better insights about the role of climate changes and the crop rotation effects on agricultural productivity. Modeling such dependence, in contemporaneous and serial framework, attempts to explain climate changes as possible engine for codependency in the tails of the joint distribution as well as the crop diversification. We used a quantile copula approach to estimate the multivariate distribution of yields across 7 Italian provinces per crop (wheat and corn) over the last 116 years. Findings show a possible dependence by climate for some provinces. Northern regions show higher dependence with a lower crop diversification, thus resulting more exposed to risk of climate effects.

KEYWORDS: QAR models, copulas, tail dependence

### 1 Introduction

Table 2. *Asymptotic and Empirical Power Comparison for Symmetric Intercept Model*

(1,2) 0.068 0.07 0.073 0.05077 0.065 0.063 0.064 0.05128 (1,3) 0.073 0.073 0.071 0.05233 0.082 0.076 0.076 0.05389 (1,4) 0.054 0.064 0.063 0.05796 0.071 0.08 0.081 0.06331 (1,5) 0.064 0.072 0.072 0.05861 0.062 0.064 0.066 0.06439 (1,6) 0.072 0.064 0.064 0.05866 0.077 0.062 0.065 0.06448 (1,7) 0.056 0.058 0.06 0.05 0.051 0.055 0.055 0.05 (1,8) 0.063 0.063 0.062 0.05001 0.059 0.059 0.057 0.05002 (2,3) 0.076 0.079 0.077 0.05699 0.085 0.063 0.062 0.06169 (2,4) 0.06 0.058 0.055 0.06535 0.088 0.08 0.081 0.07572 (2,5) 0.064 0.062 0.067 0.06642 0.079 0.073 0.072 0.07752 (2,6) 0.067 0.053 0.054 0.06648 0.085 0.065 0.065 0.07763 (2,7) 0.041 0.061 0.061 0.05 0.03 0.052 0.05 0.05 (2,8) 0.051 0.052 0.048 0.05014 0.055 0.049 0.049 0.05023 (3,4) 0.08 0.064 0.066 0.08986 0.109 0.074 0.075 0.11717 (3,5) 0.104 0.056 0.055 0.09223 0.116 0.075 0.074 0.12118 (3,6) 0.121 0.048 0.049 0.09236 0.15 0.075 0.075 0.1214 (3,7) 0.062 0.056 0.057 0.05157 0.068 0.062 0.062 0.05262 (3,8) 0.044 0.07 0.064 0.05004 0.052 0.061 0.06 0.05007 (4,5) 0.531 0.601 0.599 0.60186 0.757 0.826 0.824 0.81689 (4,6) 0.482 0.553 0.553 0.56285 0.717 0.781 0.781 0.78068 (4,7) 0.044 0.047 0.049 0.05011 0.046 0.068 0.069 0.05019 (4,8) 0.055 0.065 0.065 0.05005 0.053 0.06 0.06 0.05008 (5,6) 0.539 0.562 0.562 0.62046 0.803 0.803 0.803 0.83304 (5,7) 0.048 0.059 0.057 0.05 0.035 0.054 0.055 0.05 (5,8) 0.054 0.054 0.056 0.05009 0.043 0.061 0.06 0.05015 (6,7) 0.046 0.056 0.055 0.05001 0.064 0.064 0.064 0.05002 (6,8) 0.043 0.066 0.064 0.05009 0.048 0.065 0.066 0.05015 (7,8) 0.064 0.07 0.072 0.05001 0.054 0.057 0.058 0.05001 Asymptotic power was calculated only for the orthogonal components.

FISHER, R.A. 1924. The conditions under which chi square measures the discrepancy between observation and hypothesis. *Royal Statistical Society.*,

REISER, M. 1996. Analysis of residuals for the multinomial item response

REISER, M., CAGNONE, S., & ZHU, J. 2021. An extended GFfit statistic defined on orthogonal components of Pearson's chi-square. *paper under*

LANCASTER, H. 1969. The chi-squared distribution. New York: Wiley LIU, Y., & MAYDEU-OLIVARES, A. 2014. Identifying the source of misfit in item response theory models. *Multivariate Behavioral Research.*, 49,

Pair (i,j) GFfit<sup>⊥</sup> Residuals <sup>χ</sup>¯ <sup>2</sup>

References

87, 19–43.

354–371.

*review.*

model. *Psychometrika.*, 61, 509–528

n=300 n=500 Std. Asymptotic Std. Asymptotic

*i j* power\* GFfit<sup>⊥</sup> Residuals <sup>χ</sup>¯ <sup>2</sup>

*i j* power\*

Food security issues are going to focus our attention on the lower tail of the yield distributions. At the regional level, the issue is co-dependence across crop yields of the closest locations (Chavas *et al.* [2019]). At the national level, the issue is co-dependence across locations per each crop where the role of climate could emerge. Therefore, we attempt to explore the possible effect of climate, modeling the joint tail behavior of yields among 7 Italian provinces for each crop (corn and wheat). To estimate the yield distribution we propose a two-step estimation method which involves the use of copulas in a multivariate framework. The first step relies on estimating a Quantile AutoRegressive (QAR) model to shape the yield dynamics of each crop and location, taking into account the dependence structure in terms of lagged quantiles. The second step involves the parametric estimation of multivariate copulas among the conditional quantiles of QAR model estimated in the first step, to measure the whole dependence structure, that is the contemporaneous and the serial dependence as well as the tail dependence of yields across locations per crop. Copulas can be considered as the suitable tool to model both co-dependence and extreme dependence (see Nelsen [2006]). Findings reveal that tail dependence coefficients are high among locations per crop, and such result induces to consider climate as the possible common factor of yield joint behavior, specifically both for higher and lower co-movements in some areas. The paper is organized as follows. In Section 2, we provide a brief review of the QAR model and discuss the use of copulas. Finally, Section 3 develops the empirical analysis.

important feature of copulas relies on modeling general form of dependencies, also nonlinear as well as focused on the extreme values of variables. The association between extreme values, known as tail dependence, is defined, respectively, in the lower and upper tails, as the limit for a copula *C* of some *h* variables with respect to remaining *p*−*h* (De Luca & Rivieccio [2012] for details). The specific behavior in the tails of the joint distribution can suggest the copula to select among the parametric families. This feature allows to consider

Data cover period from 1901 to 2017 and concern yearly crop yields of 7 Italian provinces (see Table 1) and 2 crops (wheat and corn). It also includes the variables *xt* = (*t*0,*t*1,*t*2) where *t*<sup>0</sup> is an overall time trend starting at 0 in 2000, *t*<sup>1</sup> is a time trend starting at 0 in 1940, *t*<sup>2</sup> is a time trend starting at 0 in 1980. The time trends capture technological and structural changes taking place during the sample period. The best univariate QAR fitting model has 3 lags for each province and for both crop variety, wheat and corn, according to BIC. Elliptical, Archimedean and mixture copulas were applied to model the whole dependence structure of the selected seven provinces per each crop. The most suitable 7-variate copula across provinces per each crop is the mixture of Normal and Student-*t* copula (see Hu [2006] for details), where the mixture weights are, respectively, 0.216 and 0.784 for wheat and 0.751 and 0.249 for corn. Accordingly, each tail dependence coefficient (λ) is a weighted average of the coefficients of the two copulas, estimated by following Demarta & McNeil [2005] (Table 1). Findings reveal that tail dependence coefficients are high among locations of the same area, and such result induces to consider climate as the common factor of yield joint behavior. In particular, northern regions, generally characterized by a lower crop diversification, highlight higher

tail dependence, thus resulting more exposed to risk of climate effects.

CHAVAS, J.P., DI FALCO, S., ADINOLFI, F., & CAPITANIO, F. 2019. Weather effects and their long-term impact on the distribution of agricultural yields: evidence from Italy. *European Review of Agricultural Economics*,

DE LUCA, G, & RIVIECCIO, G. 2012. *Multivariate tail dependence coefficients for Archimedean Copulae*. New York: in A. Di Ciaccio et al. (eds.),

nonlinear association among conditional quantiles.

3 Empirical Application

References

46, 29–51.

#### 2 Methodology

Let *Yt* be the random variable denoting a crop yield at time *t*, *yt* (*t* = 1,...,*T*) a sample of T observations and *qt*(θ) the corresponding quantiles at θ (with 0 < θ < 1). The Quantile AutoRegression (QAR) model describes the dynamics of the θ-th quantile as:

$$q\_l(\boldsymbol{\theta}) = \boldsymbol{c} + \sum\_{k=1}^{K} a\_k (q\_{t-k}(\boldsymbol{\theta})) + \sum\_{m=1}^{M} b'\_m \mathbf{x}\_{t-m}.\tag{1}$$

where *M* and *K* are the possibly different number of lags and *xt*−*<sup>m</sup>* is a vector of exogenous lagged values which affect *yt*. The parameters of the model are estimated by regression quantiles, as introduced by Koenker [2005]. The conditional quantiles of QAR model of each crop yield is then the input margin of the joint distribution described by the copula function. Copulas allow to better describe the dependence structure among variables and here among quantiles, providing a flexible and well-suited specification of the joint distribution (see Nelsen [2006]). According to the Sklar's theorem (Sklar [1959]), the joint distribution function *H* of *q*1,...,*qp* can be expressed by a copula function *C* defined in the unit interval as

$$H(q\_1(\boldsymbol{\theta}), \dots, q\_p(\boldsymbol{\theta})) = \mathcal{C}(F\_1(q\_1(\boldsymbol{\theta})), \dots, F\_p(q\_p(\boldsymbol{\theta})))$$

where *Fi*(*qi*(θ)) (*i* = 1,..., *p*) is the distribution function of the conditional quantile in Eq. (1) and *C* is uniquely determined if *F* are continuous. An important feature of copulas relies on modeling general form of dependencies, also nonlinear as well as focused on the extreme values of variables. The association between extreme values, known as tail dependence, is defined, respectively, in the lower and upper tails, as the limit for a copula *C* of some *h* variables with respect to remaining *p*−*h* (De Luca & Rivieccio [2012] for details). The specific behavior in the tails of the joint distribution can suggest the copula to select among the parametric families. This feature allows to consider nonlinear association among conditional quantiles.

#### 3 Empirical Application

a two-step estimation method which involves the use of copulas in a multivariate framework. The first step relies on estimating a Quantile AutoRegressive (QAR) model to shape the yield dynamics of each crop and location, taking into account the dependence structure in terms of lagged quantiles. The second step involves the parametric estimation of multivariate copulas among the conditional quantiles of QAR model estimated in the first step, to measure the whole dependence structure, that is the contemporaneous and the serial dependence as well as the tail dependence of yields across locations per crop. Copulas can be considered as the suitable tool to model both co-dependence and extreme dependence (see Nelsen [2006]). Findings reveal that tail dependence coefficients are high among locations per crop, and such result induces to consider climate as the possible common factor of yield joint behavior, specifically both for higher and lower co-movements in some areas. The paper is organized as follows. In Section 2, we provide a brief review of the QAR model and discuss the use of copulas. Finally, Section 3 develops the empirical analysis.

Let *Yt* be the random variable denoting a crop yield at time *t*, *yt* (*t* = 1,...,*T*) a sample of T observations and *qt*(θ) the corresponding quantiles at θ (with 0 < θ < 1). The Quantile AutoRegression (QAR) model describes the dynamics of

*ak*(*qt*−*k*(θ)) +

where *M* and *K* are the possibly different number of lags and *xt*−*<sup>m</sup>* is a vector of exogenous lagged values which affect *yt*. The parameters of the model are estimated by regression quantiles, as introduced by Koenker [2005]. The conditional quantiles of QAR model of each crop yield is then the input margin of the joint distribution described by the copula function. Copulas allow to better describe the dependence structure among variables and here among quantiles, providing a flexible and well-suited specification of the joint distribution (see Nelsen [2006]). According to the Sklar's theorem (Sklar [1959]), the joint distribution function *H* of *q*1,...,*qp* can be expressed by a copula function *C*

*H*(*q*1(θ),...,*qp*(θ)) = *C*(*F*1(*q*1(θ)),...,*Fp*(*qp*(θ))) where *Fi*(*qi*(θ)) (*i* = 1,..., *p*) is the distribution function of the conditional quantile in Eq. (1) and *C* is uniquely determined if *F* are continuous. An

*M* ∑ *m*=1 *b*

*mxt*−*m*. (1)

2 Methodology

the θ-th quantile as:

defined in the unit interval as

*qt*(θ) = *c*+

*K* ∑ *k*=1 Data cover period from 1901 to 2017 and concern yearly crop yields of 7 Italian provinces (see Table 1) and 2 crops (wheat and corn). It also includes the variables *xt* = (*t*0,*t*1,*t*2) where *t*<sup>0</sup> is an overall time trend starting at 0 in 2000, *t*<sup>1</sup> is a time trend starting at 0 in 1940, *t*<sup>2</sup> is a time trend starting at 0 in 1980. The time trends capture technological and structural changes taking place during the sample period. The best univariate QAR fitting model has 3 lags for each province and for both crop variety, wheat and corn, according to BIC. Elliptical, Archimedean and mixture copulas were applied to model the whole dependence structure of the selected seven provinces per each crop. The most suitable 7-variate copula across provinces per each crop is the mixture of Normal and Student-*t* copula (see Hu [2006] for details), where the mixture weights are, respectively, 0.216 and 0.784 for wheat and 0.751 and 0.249 for corn. Accordingly, each tail dependence coefficient (λ) is a weighted average of the coefficients of the two copulas, estimated by following Demarta & McNeil [2005] (Table 1). Findings reveal that tail dependence coefficients are high among locations of the same area, and such result induces to consider climate as the common factor of yield joint behavior. In particular, northern regions, generally characterized by a lower crop diversification, highlight higher tail dependence, thus resulting more exposed to risk of climate effects.

#### References


Advanced Statistical Methods for the Analysis of Large, Studies in Theoretical and Applied Statistics, Springer.

CO-CLUSTERING FOR HIGH DIMENSIONAL SPARSE DATA Nicoleta Rogovschi1

<sup>1</sup> LIPADE, Universite de Paris (e-mail: ´ nicoleta.rogovschi@u-paris.fr)

ABSTRACT: Data anonymization is the process of de-identifying sensitive data while preserving its format and data type (Venkataramanan & Shriram, 2016 Raghunathan, 2013), generally this procedure is achieved by masking one or multiple values in order to hide some aspects of the data. In this paper, we propose a co-clustering model for data anonimization based on topological co-clustering. Co-clustering which is a simultaneous clustering of rows and columns of data matrix consists in interlacing row clusterings with column clusterings at each iteration Govaert, 1995; co-clustering exploits the duality between rows and columns which allows to effectively deal with

KEYWORDS: anonymization, co-clustering, self-organizing maps, sparse data

microdata was disclosed using the microaggregation technique

To mine collected data without security breaching, some rules related especially to the privacy of the people on the dataset have to be respected. The process of preserving data privacy is called data anonymization and was used

*k*-anonymity is a global framework to evaluate the amount of privacy in some dataset, as the elimination of key identifiers was proven to be inefficient,

Li et al. Li *et al.*, 2006 introduced the first algorithm that combines clustering and anonymization. The algorithm forms equivalence classes from the database by finding an equivalence class with records' number less than *k*. It measures the distance between the found equivalence class and the other equivalence classes and merges it with the nearest equivalence class in order to form a cluster of at least *k* elements with minimum information distortion. This method gives good computational results but it is very time consuming. The topological co-clustering approaches leads to a simultaneous clustering on the rows and columns of data matrix, as well as the projection of the

high dimensional data.

1 Introduction

for quite a while to statistical purposes.

(Domingo-Ferrer & Torra, 2001).


Table 1. *Tail dependence estimates from a Normal-Student*−*t copula mixture across the Italian provinces: Milan (1), Venice (2), Bologna (3), Florence (4), Rome (5), Naples (6), Palermo (7).*


## CO-CLUSTERING FOR HIGH DIMENSIONAL SPARSE DATA

Nicoleta Rogovschi1

<sup>1</sup> LIPADE, Universite de Paris (e-mail: ´ nicoleta.rogovschi@u-paris.fr)

ABSTRACT: Data anonymization is the process of de-identifying sensitive data while preserving its format and data type (Venkataramanan & Shriram, 2016 Raghunathan, 2013), generally this procedure is achieved by masking one or multiple values in order to hide some aspects of the data. In this paper, we propose a co-clustering model for data anonimization based on topological co-clustering. Co-clustering which is a simultaneous clustering of rows and columns of data matrix consists in interlacing row clusterings with column clusterings at each iteration Govaert, 1995; co-clustering exploits the duality between rows and columns which allows to effectively deal with high dimensional data.

KEYWORDS: anonymization, co-clustering, self-organizing maps, sparse data

#### 1 Introduction

Advanced Statistical Methods for the Analysis of Large, Studies in Theoret-

DEMARTA, S., & MCNEIL, A.J. 2005. The t copula and related copulas.

HU, L. 2006. Dependence Patterns Across Financial Markets: A Mixed Cop-

Table 1. *Tail dependence estimates from a Normal-Student*−*t copula mixture across the Italian provinces: Milan (1), Venice (2), Bologna (3), Florence (4), Rome (5),*

Wheat Corn Coefficient Estimate Coefficient Estimate λ<sup>12</sup> 0.5693 λ<sup>12</sup> 0.5571 λ<sup>13</sup> 0.4828 λ<sup>13</sup> 0.5763 λ<sup>14</sup> 0.4801 λ<sup>14</sup> 0.5539 λ<sup>15</sup> 0.4798 λ<sup>15</sup> 0.5515 λ<sup>16</sup> 0.4346 λ<sup>16</sup> 0.4585 λ<sup>17</sup> 0.3789 λ<sup>17</sup> 0.3460 λ<sup>23</sup> 0.5002 λ<sup>23</sup> 0.5702 λ<sup>24</sup> 0.5124 λ<sup>24</sup> 0.6023 λ<sup>25</sup> 0.5245 λ<sup>25</sup> 0.5842 λ<sup>26</sup> 0.4626 λ<sup>26</sup> 0.4552 λ<sup>27</sup> 0.3578 λ<sup>27</sup> 0.3138 λ<sup>34</sup> 0.4776 λ<sup>34</sup> 0.5835 λ<sup>35</sup> 0.4754 λ<sup>35</sup> 0.5511 λ<sup>36</sup> 0.4969 λ<sup>36</sup> 0.4749 λ<sup>37</sup> 0.3724 λ<sup>37</sup> 0.3153 λ<sup>45</sup> 0.5023 λ<sup>45</sup> 0.6050 λ<sup>46</sup> 0.4403 λ<sup>46</sup> 0.4511 λ<sup>47</sup> 0.3496 λ<sup>47</sup> 0.3136 λ<sup>56</sup> 0.4751 λ<sup>56</sup> 0.4497 λ<sup>57</sup> 0.3658 λ<sup>57</sup> 0.3214 λ<sup>67</sup> 0.4422 λ<sup>67</sup> 0.3693

KOENKER, R. 2005. *Quantile Regression*. Cambridge University Press. NELSEN, R.B. 2006. *An Introduction to Copulas*. New York: Springer. SKLAR, A. 1959. Fonctions de repartition ´ a` *n* dimensions et leurs marges.

ula Approach. *Applied Financial Economics*, 16, 717–729.

ical and Applied Statistics, Springer.

*International Statistical Review*, 73, 111–129.

*Publ. Inst. Statist. Univ. Paris*, 8, 229–231.

*Naples (6), Palermo (7).*

To mine collected data without security breaching, some rules related especially to the privacy of the people on the dataset have to be respected. The process of preserving data privacy is called data anonymization and was used for quite a while to statistical purposes.

*k*-anonymity is a global framework to evaluate the amount of privacy in some dataset, as the elimination of key identifiers was proven to be inefficient, microdata was disclosed using the microaggregation technique (Domingo-Ferrer & Torra, 2001).

Li et al. Li *et al.*, 2006 introduced the first algorithm that combines clustering and anonymization. The algorithm forms equivalence classes from the database by finding an equivalence class with records' number less than *k*. It measures the distance between the found equivalence class and the other equivalence classes and merges it with the nearest equivalence class in order to form a cluster of at least *k* elements with minimum information distortion. This method gives good computational results but it is very time consuming.

The topological co-clustering approaches leads to a simultaneous clustering on the rows and columns of data matrix, as well as the projection of the clusters on a two-dimensional grid while preserving the topological order of the initial data.

4. From *U* and *V*, form the matrices *U*˜, *V*˜ and

prototypes *w*[*ii*]

to cluster *Ck*.

*X*

*<sup>j</sup>* <sup>←</sup> [*w*[1] *jc*(1) ,*w*[2] *jc*(2)

with element *j*.

tering, as a stopping criterion.

the initial structure of the data.

Anonymization step For each co-cluster *Ck* :

the matching neuron:

D =

5. Cluster the rows of D into *g* clusters by using SOM and compute the

6. Assign object *i* to cluster *Rk* if and only if the corresponding row d*<sup>i</sup>* of the matrix D was assigned to cluster *Rk* and assign attribute *j* to cluster *Ck* if and only if the corresponding row d*<sup>j</sup>* of the matrix D was assigned

• Find the BMU of each object *j* in *Rk* using corresponding *wjc* where *c* is

To evaluate the co-clustering results, we use the Davies Bouldin index which is a clustering evaluation indicator that reflects the quality of the clus-

In order to compare the performances of our approach with other traditional unsupervised clustering algorithms, we use many text datasets, which represent the frequency of words in documents. We used eight datasets for document clustering. "Classic30", "Classic150", "Classic300", "Classic400" are an extract of Classic3 Dhillon, 2001 which contains three classes denoted

The impact of co-clustering on the utility of anonymized data is quantified as the resulting accuracy of a machine learning model (Rodr´ıGuez-Hoyos *et al.*, 2018). To quantify the utility of the dataset for further study and since all the datasets used are labelled we thought that the best way to evaluate the proposed approaches is to use an external evaluation i.e. the classification. For this purpose, we designed a decision tree model and used it to see how the anonymized data was classified by this model. We then compared the accuracy of the results of both approaches to understand how much data *quality* have we traded for the sake of anonymization. The obtained results, shows that the accuracy doesn't decrease after the anonymization and allows to maintain

], where *c*(*q*) is the index of the cell associated

(*X*[*ii*] *<sup>i</sup>* <sup>−</sup>*w*[*ii*] *jc* )

• Code each element *j* with its corresponding vector:

,...,*w*[*P*] *jc*(*q*)

Medline, Cisi, Cranfield as their original database source.

 *U*˜ *V*˜

The co-clustering implicitly performs an adaptive dimensionality reduction at each iteration, leading to better document clustering accuracy compared to one side clustering methods (Dhillon, 2001). Co-clustering is also preferred when there is an association relationship between the data and the features (i.e., the columns and the rows) Ding *et al.*, 2006.

In text mining field, (Dhillon, 2001) has proposed a spectral block clustering method by exploiting the duality between rows (documents) and columns (words). In the analysis of microarray data, where data are often presented as matrices of expression levels of genes under different conditions, block clustering of genes and conditions has been used to overcome the problem of choosing the similarity on the two sets found in conventional clustering methods (Cheng & Church, 2000). The aim of block clustering is to try to summarize this matrix by homogeneous blocks.

#### 2 The proposed algorithm

We propose to use the topological co-clustering in order to *k*-anonimyze a large sparse dataset. This way, the curse of dimensionality is implicitly dealt with, as the algorithm treats each part simultaneously and the results are proved to be more accurate.

The proposed *k*-coTCA approach take in input the dataset *OT* to anonimyze and performs in output an anonymized datset *AT* having the same size. The algorithm is composed from two steps : co-clustering and anonimization.

#### Topological co-clustering step:


$$D\_r = \operatorname{diag}(A\mathbf{1}) \text{ and } D\_c = \operatorname{diag}(A^t\mathbf{1}).$$

3. Find *U*,*V* the (*g*−1) left-right largest eigenvectors of

$$\tilde{A} = D\_r^{-\frac{1}{2}} A D\_c^{-\frac{1}{2}}$$

4. From *U* and *V*, form the matrices *U*˜, *V*˜ and

$$\mathbf{D} = \left(\begin{array}{c} \vec{U} \\ \vec{V} \end{array}\right).$$


#### Anonymization step

For each co-cluster *Ck* :

clusters on a two-dimensional grid while preserving the topological order of

(i.e., the columns and the rows) Ding *et al.*, 2006.

this matrix by homogeneous blocks.

2 The proposed algorithm

Topological co-clustering step:

2. Define *Dr* and *Dc* to be the diagonal matrices

3. Find *U*,*V* the (*g*−1) left-right largest eigenvectors of

1. Form the affinity matrix *A*

be more accurate.

The co-clustering implicitly performs an adaptive dimensionality reduction at each iteration, leading to better document clustering accuracy compared to one side clustering methods (Dhillon, 2001). Co-clustering is also preferred when there is an association relationship between the data and the features

In text mining field, (Dhillon, 2001) has proposed a spectral block clustering method by exploiting the duality between rows (documents) and columns (words). In the analysis of microarray data, where data are often presented as matrices of expression levels of genes under different conditions, block clustering of genes and conditions has been used to overcome the problem of choosing the similarity on the two sets found in conventional clustering methods (Cheng & Church, 2000). The aim of block clustering is to try to summarize

We propose to use the topological co-clustering in order to *k*-anonimyze a large sparse dataset. This way, the curse of dimensionality is implicitly dealt with, as the algorithm treats each part simultaneously and the results are proved to

The proposed *k*-coTCA approach take in input the dataset *OT* to anonimyze and performs in output an anonymized datset *AT* having the same size. The algorithm is composed from two steps : co-clustering and anonimization.

*Dr* = *diag*(*A*1) and *Dc* = *diag*(*At*

*<sup>A</sup>*˜ <sup>=</sup> *<sup>D</sup>*−<sup>1</sup>

<sup>2</sup> *<sup>r</sup> AD*−<sup>1</sup> 2 *c* 1)

the initial data.

• Find the BMU of each object *j* in *Rk* using corresponding *wjc* where *c* is the matching neuron:

$$(X\_i^{[\vec{u}]} - \boldsymbol{\omega}\_{\vec{j}c}^{[\vec{u}]})^\cdot$$

• Code each element *j* with its corresponding vector: *X <sup>j</sup>* <sup>←</sup> [*w*[1] *jc*(1) ,*w*[2] *jc*(2) ,...,*w*[*P*] *jc*(*q*) ], where *c*(*q*) is the index of the cell associated with element *j*.

To evaluate the co-clustering results, we use the Davies Bouldin index which is a clustering evaluation indicator that reflects the quality of the clustering, as a stopping criterion.

In order to compare the performances of our approach with other traditional unsupervised clustering algorithms, we use many text datasets, which represent the frequency of words in documents. We used eight datasets for document clustering. "Classic30", "Classic150", "Classic300", "Classic400" are an extract of Classic3 Dhillon, 2001 which contains three classes denoted Medline, Cisi, Cranfield as their original database source.

The impact of co-clustering on the utility of anonymized data is quantified as the resulting accuracy of a machine learning model (Rodr´ıGuez-Hoyos *et al.*, 2018). To quantify the utility of the dataset for further study and since all the datasets used are labelled we thought that the best way to evaluate the proposed approaches is to use an external evaluation i.e. the classification. For this purpose, we designed a decision tree model and used it to see how the anonymized data was classified by this model. We then compared the accuracy of the results of both approaches to understand how much data *quality* have we traded for the sake of anonymization. The obtained results, shows that the accuracy doesn't decrease after the anonymization and allows to maintain the initial structure of the data.

### 3 Conclusions

In this paper, we introduced a new data anonymization approach based on topological co-clustering which allows to use the prototypes as new values for the anonymized data. The experiences shows that using an classification model on the anonymized dataset, the accuracy doesn't deacrease which means that there is no loose of knowledge from the initial data.

MALARIA RISK DETECTION VIA MIXED MEMBERSHIP MODELS Massimiliano Russo <sup>1</sup>

<sup>1</sup> Harvard-MIT Center for Regulatory science, Harvard medical school, & Department of Data Science, Dana-Farber Cancer institute (e-mail:

ABSTRACT: The diffusion of malaria is a complex phenomenon evolving over time and space, driven by several aspects that include economical biological, behavioral and environmental factors, which act and interact together. We consider as a case study the Machadinho settlement project in Brazil, and provide a risk classification for the households in the area. To accomplish this goal we estimate survey based environmental and behavioral risk profiles via a mixed membership model. We then validate the model comparing the predictive ability of the estimated risk profiles for the crude malaria rate with the performances of standard machine learning (ML) tools. KEYWORDS: Malaria risk, mixed membership models, multivariate categorical vari-

The risk of malaria infection is favored by multiple and interacting causes that are largely driven by human behaviors and their interaction with the surrounding environment. To evaluate the risk of malaria infection in a certain geographical area, biological and economical aspects juxtaposed with behavioral and environmental factors should be evaluated. We focus on these last two aspects providing a risk classification for the Machadinho Settlement Project, located in the Rondonia state, Western Brazilian Amazon. The project was ap- ˆ proved in 1982, with occupation starting in late 1984. The area was previously

Since the early phases of the settlement, malaria diffusion became a problem because of the proliferation of the *Anopheles Darlingi* mosquito, the main malaria vector in the Amazon area. Spread of malaria in frontier settlements can profoundly impact the ecosystem at different levels, and its quantification is of primary importance to design effective measures of mitigation and pre-

a forest sparsely inhabited by rubber tappers (Castro *et al.*, 2006).

m russo@hms.harvard.edu

ables.

vention.

1 Introduction

#### References


### MALARIA RISK DETECTION VIA MIXED MEMBERSHIP MODELS

Massimiliano Russo <sup>1</sup>

<sup>1</sup> Harvard-MIT Center for Regulatory science, Harvard medical school, & Department of Data Science, Dana-Farber Cancer institute (e-mail: m russo@hms.harvard.edu

ABSTRACT: The diffusion of malaria is a complex phenomenon evolving over time and space, driven by several aspects that include economical biological, behavioral and environmental factors, which act and interact together. We consider as a case study the Machadinho settlement project in Brazil, and provide a risk classification for the households in the area. To accomplish this goal we estimate survey based environmental and behavioral risk profiles via a mixed membership model. We then validate the model comparing the predictive ability of the estimated risk profiles for the crude malaria rate with the performances of standard machine learning (ML) tools.

KEYWORDS: Malaria risk, mixed membership models, multivariate categorical variables.

#### 1 Introduction

3 Conclusions

References

274.

*cies*, 91–110.

28258–28277.

*and Cybernetics*, 24, 437–458.

*housing and Knowledge Discovery*. Springer.

*From Planning to Implementation*. CRC Press.

In this paper, we introduced a new data anonymization approach based on topological co-clustering which allows to use the prototypes as new values for the anonymized data. The experiences shows that using an classification model on the anonymized dataset, the accuracy doesn't deacrease which means that

CHENG, Y., & CHURCH, G. 2000. Biclustering of expression data. *Pages 93– 103 of: ISMB2000, 8th International Conference on Intelligent Systems*

DHILLON, I. 2001. Co-clustering documents and words using bipartite spectral graph partitioning. *ACM SIGKDD International Conference*, 269–

DING, C., LI, T., PENG, W., & PARK, H. 2006. Orthogonal nonnegative matrix tri-factorizations for clustering. *ACM SIGKDD International Confer-*

DOMINGO-FERRER, JOSEP,&TORRA, VICENC. 2001. Disclosure control methods and information loss for microdata. *Confidentiality, disclosure, and data access: theory and practical applications for statistical agen-*

GOVAERT, G. 1995. Simultaneous Clustering of Rows and Columns. *Control*

LI, JIUYONG, WONG, RAYMOND CHI-WING, FU, ADA WAI-CHEE,&PEI, JIAN. 2006. Achieving k-anonymity by clustering in attribute hierarchical structures. *Pages 405–416 of: International Conference on Data Ware-*

RAGHUNATHAN, BALAJI. 2013. *The Complete Book of Data Anonymization:*

RODR´IGUEZ-HOYOS, A., ESTRADA-JIME´NEZ, J., REBOLLO-MONEDERO, D., PARRA-ARNAU, J., & FORNE´, J. 2018. Does *k* -Anonymous Microaggregation Affect Machine-Learned Macrotrends? *IEEE Access*, 6,

VENKATARAMANAN, NATARAJ,&SHRIRAM, ASHWIN. 2016. *Data Pri-*

*vacy: Principles and Practice*. Chapman & Hall/CRC.

there is no loose of knowledge from the initial data.

*for Molecular Biology*. Philadelphia, PA: KDD'06.

*ence on Knowledge Discovery and Data Mining (KDD)*.

The risk of malaria infection is favored by multiple and interacting causes that are largely driven by human behaviors and their interaction with the surrounding environment. To evaluate the risk of malaria infection in a certain geographical area, biological and economical aspects juxtaposed with behavioral and environmental factors should be evaluated. We focus on these last two aspects providing a risk classification for the Machadinho Settlement Project, located in the Rondonia state, Western Brazilian Amazon. The project was ap- ˆ proved in 1982, with occupation starting in late 1984. The area was previously a forest sparsely inhabited by rubber tappers (Castro *et al.*, 2006).

Since the early phases of the settlement, malaria diffusion became a problem because of the proliferation of the *Anopheles Darlingi* mosquito, the main malaria vector in the Amazon area. Spread of malaria in frontier settlements can profoundly impact the ecosystem at different levels, and its quantification is of primary importance to design effective measures of mitigation and prevention.

Crude malaria risk measures (e.g., number of malaria cases reported in an households) can be adopted to determine risk profiles through regression or classification models. However, for the Machadinho settlement these indicators can lead to misleading results because of the presence of transient individuals, responsible for high malaria rates in zones presenting low risk conditions (see for example Castro *et al.*, 2006). Unsupervised analysis is not affected by this bias being only based on household features, and can lead reliable findings for targeted stable populations in the settlement project.

can be interpreted as a summary of behavioral and environmental risks. We refer to Russo *et al.*, 2019 for additional details, prior specification and for a

Figure 1. *Representation of the risk profiles for subject i and the p variables. In this example variable* 1 *is associated with domain* 1 *(behavioral) and variable p with*

We consider an external validation of model (1), checking its performances in the out-of-sample prediction of the crude malaria rate. We use 5-fold cross validation, comparing the results with random forest, lasso prediction, and PCA regression on the raw data. For the random forest and lasso we used the R packages randomForest and glmnet, respectively. We refer to Castro

Note that crude malaria rates and the risk profiles estimated from model (1) measure different aspect of risk. Therefore, they are expected to be related but they are not expressed in the same scale. To overcome this issue, we assume

tal risks estimated via (1). The coefficients *c*<sup>1</sup> and *c*<sup>2</sup> are estimated via least squares on a training sample. To avoid over fitting in the considered ML models, for each fold we select tuning parameters with an additional cross valida-

Based on Table 1, multivariate mixed membership scores can predict malaria rates with comparable out-of-sample predictive performance to black box machine learning algorithms, such as random forests. The ML algorithms and

*<sup>i</sup>* are the posterior mean of the behavioral and environmen-

*x*

*i*

ψ(*p*) 3

*<sup>i</sup>* <sup>λ</sup>(2)

ψ(*p*) 2

ψ(*p*) 1

description of the algorithm to approximate the posterior of model (1).

ψ(*j*) 3

*x* ······

λ(1)

*et al.*, 2006 for a detailed description of the data.

that the estimated risk are proportional to the malaria rate *ri* via:

*ri* = *c*<sup>1</sup> ˆ λ(1) *<sup>i</sup>* +*c*<sup>2</sup> ˆ λ(2) *<sup>i</sup>* ,

ψ(*j*) 1

*domain* 2 *(environmental)*

3 Model validation

where ˆ

tion.

λ(1) *<sup>i</sup>* and <sup>ˆ</sup>

λ(2)

ψ(*j*) 2

We consider the risk classification provided in Russo *et al.*, 2019 and provide a validation of their method. Specifically, we consider the prediction ability of the estimated risk classification of the crude malaria rate and compare them with popular ML tools.

#### 2 Model specification

We follow the model proposed in Russo *et al.*, 2019. We observe categorical variables *Xi j* ∈ {1,...,*dj*} for household *i* = 1,...,*n* and variable *j* = 1,..., *p*. These variable are naturally partitioned in two groups: behavioral and environmental variables. We indicate with *g* = (*g*1,...,*gp*) the group for each of the *p* variables, where *gj* ∈ {1,2}; the 1 codifies behavioral variables and the 2 the environmental ones. All households are endowed with 2 membership score vectors (λ(1) *<sup>i</sup>* ,λ(2) *<sup>i</sup>* ) such that ∑*<sup>H</sup> <sup>h</sup>*=<sup>1</sup> <sup>λ</sup>(*g*) *ih* = 1 for *g* = 1,2.

The proposed model can be expressed in the following hierarchical form:

$$\begin{array}{rcl} X\_{ij} \mid Z\_{ij} = h, \Psi\_{h}^{(j)} & \sim & \text{Cat}(\Psi\_{h1}^{(j)}, \dots, \Psi\_{hd\_{j}}^{(j)}), \\ & Z\_{ij} \mid \mathsf{\mathsf{\lambda}}\_{i}^{(g\_{j})} & \sim & \text{Cat}(\mathsf{\lambda}\_{i1}^{(g\_{j})}, \dots, \mathsf{\mathsf{\lambda}}\_{iH}^{(g\_{j})}), \\ & (\mathsf{\mathsf{\lambda}}\_{i}^{(1)}, \mathsf{\mathsf{\lambda}}\_{i}^{(2)}) & \sim & \text{MLND}(\boldsymbol{\mu}, \boldsymbol{\Sigma}). \end{array} \tag{1}$$

Here *Zi j* ∈ {1,...,*H*} are discrete latent variables that identify *H* latent groups. The notation *X* ∼ Cat(π1,...,π*d*) indicates a *d*-dimensional categorical random variable, i.e. pr(*X* = *k*) = π*k*, for *k* = 1,...,*d*, while MLND(*µ*,Σ) is the multivariate logistic normal distribution introduced in Russo *et al.*, 2019.

Model (1) provides a low dimensional summary of the analyzed variables. In fact, for a certain latent group *<sup>h</sup>*, {ψ(*j*) *<sup>h</sup>* for *j* = 1,..., *p*} describe the *h*-th group in terms of the observed variables. The scores λ(1) *<sup>i</sup>* and <sup>λ</sup>(2) *<sup>i</sup>* can be interpreted as a degree of similarity of the *i*-th individual for the latent group *h* (see figure 1 for a representation). Hence, partitioning the variables in behavioral and environmental domains, and choosing *H* = 2, the scores λ(1) *<sup>i</sup>* and <sup>λ</sup>(2) *i*

can be interpreted as a summary of behavioral and environmental risks. We refer to Russo *et al.*, 2019 for additional details, prior specification and for a description of the algorithm to approximate the posterior of model (1).

Figure 1. *Representation of the risk profiles for subject i and the p variables. In this example variable* 1 *is associated with domain* 1 *(behavioral) and variable p with domain* 2 *(environmental)*

#### 3 Model validation

Crude malaria risk measures (e.g., number of malaria cases reported in an households) can be adopted to determine risk profiles through regression or classification models. However, for the Machadinho settlement these indicators can lead to misleading results because of the presence of transient individuals, responsible for high malaria rates in zones presenting low risk conditions (see for example Castro *et al.*, 2006). Unsupervised analysis is not affected by this bias being only based on household features, and can lead reliable findings for targeted stable populations in the settlement project.

We consider the risk classification provided in Russo *et al.*, 2019 and provide a validation of their method. Specifically, we consider the prediction ability of the estimated risk classification of the crude malaria rate and compare

We follow the model proposed in Russo *et al.*, 2019. We observe categorical variables *Xi j* ∈ {1,...,*dj*} for household *i* = 1,...,*n* and variable *j* = 1,..., *p*. These variable are naturally partitioned in two groups: behavioral and environmental variables. We indicate with *g* = (*g*1,...,*gp*) the group for each of the *p* variables, where *gj* ∈ {1,2}; the 1 codifies behavioral variables and the 2 the environmental ones. All households are endowed with 2 membership score

*<sup>h</sup>*=<sup>1</sup> <sup>λ</sup>(*g*)

The proposed model can be expressed in the following hierarchical form:

*<sup>h</sup>* <sup>∼</sup> Cat(ψ(*j*)

*<sup>i</sup>* ) ∼ MLND(*µ*,Σ).

*<sup>i</sup>* ∼ Cat(λ

Here *Zi j* ∈ {1,...,*H*} are discrete latent variables that identify *H* latent groups. The notation *X* ∼ Cat(π1,...,π*d*) indicates a *d*-dimensional categorical random variable, i.e. pr(*X* = *k*) = π*k*, for *k* = 1,...,*d*, while MLND(*µ*,Σ) is the multivariate logistic normal distribution introduced in Russo *et al.*, 2019.

Model (1) provides a low dimensional summary of the analyzed variables.

terpreted as a degree of similarity of the *i*-th individual for the latent group *h* (see figure 1 for a representation). Hence, partitioning the variables in behav-

ioral and environmental domains, and choosing *H* = 2, the scores λ(1)

*ih* = 1 for *g* = 1,2.

(*gj*) *<sup>i</sup>*<sup>1</sup> ,...,λ

*<sup>h</sup>*<sup>1</sup> ,...,ψ(*j*)

*hdj* ),

(*gj*)

*<sup>h</sup>* for *j* = 1,..., *p*} describe the *h*-th

*<sup>i</sup>* and <sup>λ</sup>(2)

*<sup>i</sup>* can be in-

*<sup>i</sup>* and <sup>λ</sup>(2) *i*

*iH* ), (1)

them with popular ML tools.

2 Model specification

*<sup>i</sup>* ,λ(2)

*<sup>i</sup>* ) such that ∑*<sup>H</sup>*

*Xi j* <sup>|</sup> *Zi j* <sup>=</sup> *<sup>h</sup>*,ψ(*j*)

(λ(1) *<sup>i</sup>* ,λ(2)

In fact, for a certain latent group *<sup>h</sup>*, {ψ(*j*)

*Zi j* <sup>|</sup> <sup>λ</sup>(*gj*)

group in terms of the observed variables. The scores λ(1)

vectors (λ(1)

We consider an external validation of model (1), checking its performances in the out-of-sample prediction of the crude malaria rate. We use 5-fold cross validation, comparing the results with random forest, lasso prediction, and PCA regression on the raw data. For the random forest and lasso we used the R packages randomForest and glmnet, respectively. We refer to Castro *et al.*, 2006 for a detailed description of the data.

Note that crude malaria rates and the risk profiles estimated from model (1) measure different aspect of risk. Therefore, they are expected to be related but they are not expressed in the same scale. To overcome this issue, we assume that the estimated risk are proportional to the malaria rate *ri* via:

$$r\_{\bar{\imath}} = c\_1 \hat{\lambda}\_{\bar{\imath}}^{(1)} + c\_2 \hat{\lambda}\_{\bar{\imath}}^{(2)},$$

where ˆ λ(1) *<sup>i</sup>* and <sup>ˆ</sup> λ(2) *<sup>i</sup>* are the posterior mean of the behavioral and environmental risks estimated via (1). The coefficients *c*<sup>1</sup> and *c*<sup>2</sup> are estimated via least squares on a training sample. To avoid over fitting in the considered ML models, for each fold we select tuning parameters with an additional cross validation.

Based on Table 1, multivariate mixed membership scores can predict malaria rates with comparable out-of-sample predictive performance to black box machine learning algorithms, such as random forests. The ML algorithms and


Table 1. *5 fold cross validation mean square prediction error (standard deviation) divided by year.*

NONPARAMETRIC ESTIMATION OF THE NUMBER OF CLUSTERS FOR DIRECTIONAL DATA Paula Saavedra-Nieves <sup>1</sup> and Rosa M. Crujeiras1

ABSTRACT: Set estimation is focused on the reconstruction of a set (or the estimation of any of its features such as its volume or its boundary) from a random sample of points. Target sets to be estimated may appear in different contexts, but from a distribution-based perspective, level set estimation is a problem of interest. Actually, this theory is also linked to clustering methods: Hartigan (1975) defines the number of population clusters as the number of connected components of density level sets. This topic has received some attention in the literature specially for densities supported on an Euclidean space. However, just as density level sets, this clustering approach can

The rationale for establishing the definition of cluster provided by Hartigan (1975) is quite related with the notion of mode. In fact, several cluster algorithms are based on the detection of modes noting that the number of modes (local maxima of the density function) is rarely smaller than the number of clusters. Nevertheless, the concept of cluster is easier to handle, since it has a global and geometrical nature, whereas the

In this work, we derive some methodology for estimating the number of directional clusters as the number of connected components of directional level sets. From an empirical perspective, directional level sets are estimated using a nonparametric plug-in reconstruction (see, for instance, Saavedra-Nieves and Crujeiras, 2020). An extensive simulation study shows the performance of this estimator for densities supported on the unit circle and the sphere. Additionally, this methodology is applied to analyse a

<sup>1</sup> Department of Statistics, Mathematical Analysis and Optimization,

(e-mails: paula.saavedra@usc.es, rosa.crujeiras@usc.es)

be easily extended to more general settings such as the circle or the sphere.

KEYWORDS: Connected components, density level sets, directional data.

HARTIGAN, J. A. 1975. *Clustering algorithms*. John Wiley & Sons, Inc.

SAAVEDRA-NIEVES, P., & CRUJEIRAS, R. M. 2020. Nonparametric estimation of directional highest density regions. arXiv preprint arXiv:2009.08915.

Universidade de Santiago de Compostela,

local maxima depend on analytical properties.

real data set.

References

other supervised approaches are subject to the selection bias issue mentioned in Section 1. This result gives evidence that model (1) provides a reasonable low dimensional representation in the considered context.

#### References


### NONPARAMETRIC ESTIMATION OF THE NUMBER OF CLUSTERS FOR DIRECTIONAL DATA

Paula Saavedra-Nieves <sup>1</sup> and Rosa M. Crujeiras1

<sup>1</sup> Department of Statistics, Mathematical Analysis and Optimization, Universidade de Santiago de Compostela, (e-mails: paula.saavedra@usc.es, rosa.crujeiras@usc.es)

ABSTRACT: Set estimation is focused on the reconstruction of a set (or the estimation of any of its features such as its volume or its boundary) from a random sample of points. Target sets to be estimated may appear in different contexts, but from a distribution-based perspective, level set estimation is a problem of interest. Actually, this theory is also linked to clustering methods: Hartigan (1975) defines the number of population clusters as the number of connected components of density level sets. This topic has received some attention in the literature specially for densities supported on an Euclidean space. However, just as density level sets, this clustering approach can be easily extended to more general settings such as the circle or the sphere.

The rationale for establishing the definition of cluster provided by Hartigan (1975) is quite related with the notion of mode. In fact, several cluster algorithms are based on the detection of modes noting that the number of modes (local maxima of the density function) is rarely smaller than the number of clusters. Nevertheless, the concept of cluster is easier to handle, since it has a global and geometrical nature, whereas the local maxima depend on analytical properties.

In this work, we derive some methodology for estimating the number of directional clusters as the number of connected components of directional level sets. From an empirical perspective, directional level sets are estimated using a nonparametric plug-in reconstruction (see, for instance, Saavedra-Nieves and Crujeiras, 2020). An extensive simulation study shows the performance of this estimator for densities supported on the unit circle and the sphere. Additionally, this methodology is applied to analyse a real data set.

KEYWORDS: Connected components, density level sets, directional data.

#### References

Table 1. *5 fold cross validation mean square prediction error (standard deviation)*

Random Forest 0.104(0.026) 0.131(0.023) 0.108(0.011) 0.021(0.006) lasso regression 0.104(0.026) 0.169(0.017) 0.123(0.005) 0.021(0.006) PCA regression 0.101(0.030) 0.161(0.021) 0.121(0.005) 0.021(0.006) Russo *et al.*, 2019 0.099(0.024) 0.148(0.015) 0.105(0.003) 0.021(0.006)

other supervised approaches are subject to the selection bias issue mentioned in Section 1. This result gives evidence that model (1) provides a reasonable

CASTRO, MARCIA CALDAS, MONTE-MOR´ , ROBERTO L., SAWYER, DI-ANA O., & SINGER, BURTON H. 2006. Malaria risk on the Amazon frontier. *Proceedings of the National Academy of Sciences*, 103(7), 2452–

RUSSO, MASSIMILIANO, SINGER, BURTON H, & DUNSON, DAVID B. 2019. Multivariate mixed membership modeling: Inferring domain-

specific risk profiles. *arXiv preprint arXiv:1901.05191*.

low dimensional representation in the considered context.

1985 1986 1987 1995

*divided by year.*

References

2457.

HARTIGAN, J. A. 1975. *Clustering algorithms*. John Wiley & Sons, Inc.

SAAVEDRA-NIEVES, P., & CRUJEIRAS, R. M. 2020. Nonparametric estimation of directional highest density regions. arXiv preprint arXiv:2009.08915.

### TENSOR-VARIATE FINITE MIXTURE MODEL FOR THE ANALYSIS OF UNIVERSITY PROFESSOR REMUNERATION

SPECIFYING COMPOSITES IN STRUCTURAL EQUATION MODELING: THE HENSELER-OGASAWARA SPECIFICATION Florian Schuberth <sup>1</sup>

<sup>1</sup> Department of Design, Production and Management, University of Twente, The

ABSTRACT: Structural equation modeling (SEM) is a versatile statistical method that can deal in principle with latent variables and composites. In practice, however, researchers using SEM encounter problems incorporating composites into their models. To overcome this problem, I present a specification for SEM that was recently sketched by Henseler (2021) to incorporate composites in structural models. It draws from the same idea that was proposed by Ogasawara (2007) to conduct a canonical correlation analysis in SEM. Therefore, the specification is dubbed Henseler-Ogasawara (H-O) specification. In the H-O specification, a set of observed variables forming a composite is expressed byaset of synthetic variables, which are labeled as emergent and excrescent variables. An emergent variable is a linear combination of variables that is related to other variables in the structural model, whereas an excrescent variable is a linear combination of variables that is unrelated to all other variables in the structural model. This approach is advantageous over existing approaches, as it offers the same flexibility in terms of model specification for modeling with composites as SEM provides for modeling with latent variables. As a consequence, the H-O specification makes all existing developments in SEM available for modeling with composites, such as testing parameter estimates,

Netherlands, (e-mail: f.schuberth@utwente.nl))

testing for overall model fit and dealing with missing values.

*lyzing Latent and Emergent Variables.* Guilford Press.

*Multivariate Analysis*, 98, 1726–1750.

crescent variable, components

References

KEYWORDS: model specification, composite model, emergent variable, ex-

HENSELER, J. 2021. *Composite-Based Structural Equation Modeling: Ana-*

OGASAWARA, H. 2007. Asymptotic expansions of the distributions of estimators in canonical correlation analysis under nonnormality. *Journal of*

Shuchismita Sarkar1, Volodymyr Melnykov2 and Xuwen Zhu2

<sup>1</sup> Bowling Green State University, (e-mail: ssarkar@bgsu.edu)

<sup>2</sup> University of Alabama, (e-mail: vmelnykov@cba.ua.edu, xzhu20@cba.ua.edu)

ABSTRACT: Finite mixture modeling and model-based clustering of matrixand tensor-variate data has recently gained a lot of attention. In this paper a novel tensor regression mixture model has been proposed to analyze salary data collected by the American Association of University Professors over a span of thirteen years at several faculty rank and gender levels. Most of the studies involving faculty remuneration employs linear regression models intended for predicting individual salaries. Such models, however, are not suitable for developing strategies and policies at institutional level. The tensor regression mixture framework adopted in this paper allows for an university level analysis of faculty remuneration by considering the heterogeneous, skewed, multi-way, and temporal nature of the data. The developed model addresses several important issues related to gender equity and peer institution comparison.

KEYWORDS: finite mixture model, model-based clustering, EM algorithm, tensor regression mixture model

### SPECIFYING COMPOSITES IN STRUCTURAL EQUATION MODELING: THE HENSELER-OGASAWARA SPECIFICATION

Florian Schuberth <sup>1</sup>

<sup>1</sup> Department of Design, Production and Management, University of Twente, The Netherlands, (e-mail: f.schuberth@utwente.nl))

ABSTRACT: Structural equation modeling (SEM) is a versatile statistical method that can deal in principle with latent variables and composites. In practice, however, researchers using SEM encounter problems incorporating composites into their models. To overcome this problem, I present a specification for SEM that was recently sketched by Henseler (2021) to incorporate composites in structural models. It draws from the same idea that was proposed by Ogasawara (2007) to conduct a canonical correlation analysis in SEM. Therefore, the specification is dubbed Henseler-Ogasawara (H-O) specification. In the H-O specification, a set of observed variables forming a composite is expressed byaset of synthetic variables, which are labeled as emergent and excrescent variables. An emergent variable is a linear combination of variables that is related to other variables in the structural model, whereas an excrescent variable is a linear combination of variables that is unrelated to all other variables in the structural model. This approach is advantageous over existing approaches, as it offers the same flexibility in terms of model specification for modeling with composites as SEM provides for modeling with latent variables. As a consequence, the H-O specification makes all existing developments in SEM available for modeling with composites, such as testing parameter estimates, testing for overall model fit and dealing with missing values.

KEYWORDS: model specification, composite model, emergent variable, excrescent variable, components

#### References

TENSOR-VARIATE FINITE MIXTURE MODEL FOR THE ANALYSIS OF UNIVERSITY PROFESSOR REMUNERATION Shuchismita Sarkar1, Volodymyr Melnykov2 and Xuwen Zhu2

ABSTRACT: Finite mixture modeling and model-based clustering of matrixand tensor-variate data has recently gained a lot of attention. In this paper a novel tensor regression mixture model has been proposed to analyze salary data collected by the American Association of University Professors over a span of thirteen years at several faculty rank and gender levels. Most of the studies involving faculty remuneration employs linear regression models intended for predicting individual salaries. Such models, however, are not suitable for developing strategies and policies at institutional level. The tensor regression mixture framework adopted in this paper allows for an university level analysis of faculty remuneration by considering the heterogeneous, skewed, multi-way, and temporal nature of the data. The developed model addresses several important

KEYWORDS: finite mixture model, model-based clustering, EM algorithm, ten-

<sup>1</sup> Bowling Green State University, (e-mail: ssarkar@bgsu.edu) <sup>2</sup> University of Alabama, (e-mail: vmelnykov@cba.ua.edu,

issues related to gender equity and peer institution comparison.

xzhu20@cba.ua.edu)

sor regression mixture model


### NETWORK ANALYSIS IMPLEMENTING A MIXTURE DISTRIBUTION FROM BAYESIAN VIEWPOINT

Jarod Smith1, Mohammad Arashi2 and Andriette Bekker ¨ <sup>1</sup>

<sup>1</sup> Department of Statistics, University of Pretoria, South Africa (e-mail: jarodsmith706@gmail.com, andriette.bekker@up.ac.za)

<sup>2</sup> Department of Statistics, Faculty of Mathematical Sciences, Ferdowsi University of Mashhad, Mashhad, Iran (e-mail: arashi@um.ac.ir)

ABSTRACT: Differential networks (DN) are important tools for modeling the changes in conditional dependencies between multiple samples. A Bayesian approach for estimating DNs, from the classical viewpoint, is introduced with a computationally efficient threshold selection for graphical model determination. The algorithm separately estimates the precision matrices of the DN using the Bayesian adaptive graphical lasso procedure. Synthetic experiments illustrate that the Bayesian DN performs exceptionally well in numerical accuracy and graphical structure determination in comparison to state of the art methods. The proposed method is applied to South African COVID-19 data to investigate the change in DN structure between various phases of the pandemic.

KEYWORDS: Bayesian graphical lasso, differential network, double-exponential distribution, Gaussian graphical model, precision matrix

### **MEASUREMENT ERRORS IN MULTIPLE SYSTEMS ESTIMATION**

Paul A. Smith1, Peter G.M. van der Heijden12 and Maarten Cruyff2

1 Department of Social Statistics & Demography, University of Southampton, UK, (e-mail: p.a.smith@soton.ac.uk)

2 Dept of Social Sciences, Methodology and Statistics, Utrecht University, Netherlands, (e-mail: P.G.M.vanderHeijden@uu.nl, m.cruyff@uu.nl)

**ABSTRACT**: Dual and multiple system estimation use the presence ('capture') of people in different data sources as the basis for estimation of the population size. Where further characteristics are also available, these can be used to provide estimates of the population size classified by these characteristics. We consider the situation that there are measurement errors in these classifying variables, but not in the linkage of people between data sources. We consider strategies to produce estimates of the population size and breakdown using a consistent, adjusted definition taking account of all the evidence in the collected data sources.

**KEYWORDS**: capture-recapture, latent class analysis, ethnicity, population size estimation.

#### **1 Introduction**

NETWORK ANALYSIS IMPLEMENTING A MIXTURE DISTRIBUTION FROM BAYESIAN VIEWPOINT Jarod Smith1, Mohammad Arashi2 and Andriette Bekker ¨ <sup>1</sup>

<sup>1</sup> Department of Statistics, University of Pretoria, South Africa (e-mail:

<sup>2</sup> Department of Statistics, Faculty of Mathematical Sciences, Ferdowsi University of

ABSTRACT: Differential networks (DN) are important tools for modeling the changes in conditional dependencies between multiple samples. A Bayesian approach for estimating DNs, from the classical viewpoint, is introduced with a computationally efficient threshold selection for graphical model determination. The algorithm separately estimates the precision matrices of the DN using the Bayesian adaptive graphical lasso procedure. Synthetic experiments illustrate that the Bayesian DN performs exceptionally well in numerical accuracy and graphical structure determination in comparison to state of the art methods. The proposed method is applied to South African COVID-19 data to investigate the change in DN structure between various phases of the pandemic.

KEYWORDS: Bayesian graphical lasso, differential network, double-exponential dis-

jarodsmith706@gmail.com, andriette.bekker@up.ac.za)

Mashhad, Mashhad, Iran (e-mail: arashi@um.ac.ir)

tribution, Gaussian graphical model, precision matrix

Dual and multiple system estimation have a long history of use to estimate the size of populations which cannot be completely observed, and in recent years there have been many applications to estimating the size of human populations. In the simplest cases this may result from observing people on two sources, and using an assumption of independence between the sources to obtain an estimated population size. When there are more sources, interactions between the sources can be fitted, and an appropriate model needs to be fitted to the (implied) contingency table formed from the presence or absence of people in each source. In general this procedure assumes no errors of observation, and that no errors are made in linking people on the different sources. If an independent estimate of the linkage errors can be obtained, it can be used in an adjusted estimator (Zult *et al*. 2021). However, in this paper we work with the usual framework that assumes that linkage is made perfectly.

Auxiliary information is often available on the different sources, in addition to the existence of a person (or record), and this information may be used in linkage where it corresponds to a stable characteristic. Other variables are of more substantive interest, and may be expected to vary between sources for a number of reasons: they may be characteristics which vary in time, or they may be measured in different ways in different data sources, leading to variations in the measurement. In this latter case we may consider that there is an underlying 'true' variable, and that one or more of the sources that we are using observe a version of this variable with some added measurement error. The process of linking datasets means that some variables will not be observed for some records, and that no variables are observed for records in none of the sources, the number of which will nevertheless be estimated during population size estimation.

In this paper we consider strategies for dealing with population size estimation, broken down using variables measured in one or more sources and subject to measurement error in this way. Section 2 deals with solutions based on explicit decisions about which measure is the best, and with simple combinations of variables, and Section 3 with the use of a latent class model to derive an underlying measure, which we consider to estimate the 'true' measure based on the available data.

#### **2 Population size estimation with a preferred covariate source**

First we consider that there are two sources, and both sources contain what is conceptually the same covariate, though we know or suspect that they are measured differently, or that their resulting quality is different because of the way they are collected. Van der Heijden *et al*. (2018) present an example where characteristics of accidents, specifically whether a motorised vehicle was involved, are recorded both by the police and by hospitals. It would be possible to treat these as the same variable, but investigation of the data where the 'motorised vehicle' variable is available from both sources shows that about 5% of cases have discrepancies. Instead we treat them as two different variables, and construct a four-way contingency table formed from presence/absence on the two sources and the motorised/non-motorised variables in the two sources. We then use loglinear modelling to choose a suitable model for this contingency table, and use this model together with the EM algorithm to produce a completed table (Table 1), which provides an estimate of the missing part of the population, and also estimates of the population sizes in each cell of the contingency table (where they are not observed). This allows us to add up in any way we want to achieve a set of consistent estimates.

In the accidents example, we have reasonable confidence that the police register is better at recording whether a motorised vehicle is involved, as gathering this


Table 1: Completed road accidents table. A is the police register, B the hospital register, X1 is the police record of motor vehicle involvement, and X2 the hospital record.

information is part of the police function. So we consider the classification of the total according to the variable X1 in the police register (whether observed or estimated) to be the correct one. And since the full dataset, cross-classified by both police and hospital versions of the motorised vehicle variable is available, we can make inferences about the measurement error in the hospital version.

In a situation where the relative merits of the measurements are less clear, we could pragmatically use the average of the population size estimates under the different versions of the auxiliary variable.

#### **3 Latent class models**

A further approach is to treat the different measurements separately in the population size estimation, but then to embed them in a latent class model (LCM), which postulates an underlying, unobserved parameter related to all the separate measurements, and which can be interpreted as the true parameter. This approach can be considered when there are at least three measurements. It is conceptually different from using LCMs to deal with heterogeneity in the capture probabilities (as in Stanghellini & van der Heijden 2004). Van der Heijden et al. (2021) apply this approach in analysing four linked data sources in New Zealand – the population census, the health register, the birth registration register, and an education register (covering largely, but not only, tertiary education). Each of these sources includes an ethnicity variable, which we consider in a simplified version recoded to Māori or all other ethnicities. We would like to estimate both the size of the population in New Zealand and the size of the Māori population based on these sources.

Two approaches are possible. In the first, we treat the four sources using multiple system estimation, fitting a loglinear model to the eight-way table formed by the inclusion or not in each source and Māori ethnicity or not. Some of the estimates from a saturated model go to infinity, so a reduced form of the model is needed to obtain parameter estimates with reasonable interpretation and stability. The estimates arising from this model (including the estimates of the size of the unobserved part of the population) are then used as the inputs to a latent class model with two latent classes. This gives a two-part procedure which has the advantage of being close to the original model for the 8-way contingency table. The model produces estimates of the size of the Māori population from one of the latent classes, which can be interpreted as the true Māori variable. It can also be used to give estimates of the errors in the four observed variables in measuring this underlying Māori concept.

The second approach aims to include the latent class model directly in the modelling of the 8-way contingency table. The assumption of the latent class model is that the (unobserved) interactions between the observed variables and the latent variable explain all the interactions within the observed variables in the original data. Therefore we replace [abcd] in the original model with [aX][bX][cX][dX] with the latent variable X (where a, b, c and d label the ethnicity variables in the four data sources). This replaces all interactions of a, b, c, and d, so any terms containing two or more of these parameters are dropped from the model (which serves to make the loglinear model hierarchical with respect to interactions, and therefore more easily


ROBUST CLASSIFICATION IN HIGH DIMENSIONS USING REGULARIZED COVARIANCE ESTIMATES Valentin Todorov1 and Peter Filzmoser2

<sup>2</sup> Vienna University of Technology, (e-mail: p.filzmoser@tuwien.ac.at)

ABSTRACT: High-dimensional highly correlated data exist in many application domains which makes the classical classification methods like LDA and QDA practically useless because they will suffer from the singularity problem if the number of observed variables p exceeds the number of observations n. A number of regularization techniques with the purpose to stabilize the classifier and to achieve an improved classification performance have been developed and there exist several studies comparing various regularization techniques trying to facilitate the choice of a method. However, these methods are vulnerable to the presence of outlying observations (outliers) in the training data set which can influence the obtained classification rules and make the results unreliable. On the other hand, the proposed in the literature high breakdown point versions of discriminant analysis do not work or are not reliable in high dimensions. We propose to utilize the recently introduced regularized versions of the minimum covariance determinant (MCD) estimator – RMCD and MRCD - and thus to combine high robustness to outliers, the possibility to be computed for high dimensions and readily available software in R. Simulated and real data examples show that the proposed method performs better than, or at least as

<sup>1</sup> United Nations Industrial Development Organization, AUSTRIA,

well as, the existing methods in a wide range of settings.

KEYWORDS: robust classification, regularization, outliers, MCD

(e-mail: valentin@todorov.at)

Table 2: Estimates of probabilities from latent class models with two latent classes. Class *r* = 1 is interpreted as non-Māori, and class 2 as Māori.

interpretable). This leaves a latent class model embedded in the multiple system estimation, and van der Heijden *et al*. (2021) call this the latent class multiple system estimation (LCMSE) model.

In the application to data from the New Zealand Integrated Data Infrastructure (IDI-ERP), the LCMSE has a slightly lower estimate of the number of Māori and a slightly higher overall population estimate than the two-step procedure based on latent class estimation using the multiple system estimation results. The LCMSE therefore takes a more conservative approach to the definition of Māori in this dataset.

The population census has been generally held to be the best measure of Māori ethnicity among the different sources available in New Zealand, and it has low values for measurement error in both Māori and non-Māori in both approaches (Table 2). The Health register has a low error for non-Maori, but the largest measurement error for Māori. The births and education registers are similar to the census in the estimated measurement error in the Māori class, but have more error in estimating the non-Māori class. Therefore overall our results support the conclusion that the census is the best overall measure of ethnicity.

#### **References**


### ROBUST CLASSIFICATION IN HIGH DIMENSIONS USING REGULARIZED COVARIANCE ESTIMATES

Valentin Todorov1 and Peter Filzmoser2

<sup>1</sup> United Nations Industrial Development Organization, AUSTRIA, (e-mail: valentin@todorov.at)

<sup>2</sup> Vienna University of Technology, (e-mail: p.filzmoser@tuwien.ac.at)

ABSTRACT: High-dimensional highly correlated data exist in many application domains which makes the classical classification methods like LDA and QDA practically useless because they will suffer from the singularity problem if the number of observed variables p exceeds the number of observations n. A number of regularization techniques with the purpose to stabilize the classifier and to achieve an improved classification performance have been developed and there exist several studies comparing various regularization techniques trying to facilitate the choice of a method. However, these methods are vulnerable to the presence of outlying observations (outliers) in the training data set which can influence the obtained classification rules and make the results unreliable. On the other hand, the proposed in the literature high breakdown point versions of discriminant analysis do not work or are not reliable in high dimensions. We propose to utilize the recently introduced regularized versions of the minimum covariance determinant (MCD) estimator – RMCD and MRCD - and thus to combine high robustness to outliers, the possibility to be computed for high dimensions and readily available software in R. Simulated and real data examples show that the proposed method performs better than, or at least as well as, the existing methods in a wide range of settings.

KEYWORDS: robust classification, regularization, outliers, MCD

### CLUSTERING VIA NEW PARSIMONIOUS MIXTURES OF HEAVY TAILED DISTRIBUTIONS

A GENERAL BI-CLUSTERING TECHNIQUE FOR FUNCTIONAL DATA Agostino Torti12 , Marta Galvani1, Alessandra Menafoglio1, Piercesare Secchi<sup>12</sup> and Simone Vantini1

<sup>1</sup> MOX - Department of Mathematics, Politecnico di Milano, Italy

based on a flexible modeling depending on the problem at hand.

KEYWORDS: Bi-clustering, Clustering, Functional Data

Ramsay (2004) for more details.

<sup>2</sup> Center for Analysis Decisions and Society, Human Technopole, Milano, Italy

ABSTRACT: The problem of bi-clustering in the Functional Data Analysis framework is considered, with the aim of simultaneously clustering the rows and columns of a data matrix whose entries are functions, possibly taking values in a multidimensional space. A definition of bi-cluster for functional data is given and a novel bi-clustering method - called Functional Cheng and Church (FunCC) - is developed. The FunCC method is a non parametric and very flexible technique able to discover bi-clusters,

Nowadays, many systems are able to collect information with high frequency recording multiple phenomena at the same time in an almost continuous way. For this reason, researchers have put a lot of efforts into the development of new statistical methods able to deal with this new type of data. In particular, functional data analysis (FDA) is the branch of statistics that deals with random variables taking values into an infinite dimensional functional space, see

In this contribution, we consider the problem of bi-clustering functional data which has recently been addressed in the literature and we describe the methods we proposed in Galvani *et al.* (2021) and Torti *et al.* (2021). Bi-clustering methods, commonly known thanks to the work of Cheng & Church (2000)), allow to discover subgroubs of observations behaving in a similar way on a subset of features or vice-versa. This is of particular interest when the data are intrinsically ordered in a matrix structure and the aim is to simultaneously group the rows and the columns of the data matrix without constraining the rows (or the columns) of a data matrix to belong to only one group over all the features (or the observations) as in the classical clustering methods. In the FDA framework, there are just few works dedicated to the problem of bi-clustering functional data framed in a matrix structure. Bouveyron *et al.* (2018) developed a parametric bi-clustering technique, based on the functional latent block

Salvatore D. Tomarchio1, Luca Bagnato2 and Antonio Punzo1

<sup>1</sup> Department of Economics and Business, University of Catania, (e-mail: daniele.tomarchio@unict.it, antonio.punzo@unict.it)

<sup>2</sup> Department of Economic and Social Sciences, Universita Cattolica del Sacro Cuore, ` (e-mail: luca.bagnato@unicatt.it)

ABSTRACT: Two families of parsimonious mixture models are used for model-based clustering. They are based on the multivariate shifted exponential normal and the multivariate tail-inflated normal distributions, heavy tailed generalizations of the multivariate normal. Parsimony is achieved via the eigen-decomposition of the component scale matrices, as well as by imposing a constraint on the tailedness parameter. Two variants of the expectation-maximization algorithm are used for parameter estimation. Identifiability conditions are illustrated, and the advantages of our models with respect to other existing parsimonious heavy-tailed mixture models are commented. Our models are firstly tested via simulation studies, and then compared to some competing models in real data applications.

KEYWORDS: mixture models, model-based clustering, parsimony, heavy-tailed distributions

### A GENERAL BI-CLUSTERING TECHNIQUE FOR FUNCTIONAL DATA

Agostino Torti12 , Marta Galvani1, Alessandra Menafoglio1, Piercesare Secchi<sup>12</sup> and Simone Vantini1

<sup>1</sup> MOX - Department of Mathematics, Politecnico di Milano, Italy

CLUSTERING VIA NEW PARSIMONIOUS MIXTURES OF HEAVY TAILED DISTRIBUTIONS Salvatore D. Tomarchio1, Luca Bagnato2 and Antonio Punzo1

<sup>1</sup> Department of Economics and Business, University of Catania, (e-mail:

<sup>2</sup> Department of Economic and Social Sciences, Universita Cattolica del Sacro Cuore, `

ABSTRACT: Two families of parsimonious mixture models are used for model-based clustering. They are based on the multivariate shifted exponential normal and the multivariate tail-inflated normal distributions, heavy tailed generalizations of the multivariate normal. Parsimony is achieved via the eigen-decomposition of the component scale matrices, as well as by imposing a constraint on the tailedness parameter. Two variants of the expectation-maximization algorithm are used for parameter estimation. Identifiability conditions are illustrated, and the advantages of our models with respect to other existing parsimonious heavy-tailed mixture models are commented. Our models are firstly tested via simulation studies, and then compared to some com-

KEYWORDS: mixture models, model-based clustering, parsimony, heavy-tailed dis-

daniele.tomarchio@unict.it, antonio.punzo@unict.it)

(e-mail: luca.bagnato@unicatt.it)

peting models in real data applications.

tributions

<sup>2</sup> Center for Analysis Decisions and Society, Human Technopole, Milano, Italy

ABSTRACT: The problem of bi-clustering in the Functional Data Analysis framework is considered, with the aim of simultaneously clustering the rows and columns of a data matrix whose entries are functions, possibly taking values in a multidimensional space. A definition of bi-cluster for functional data is given and a novel bi-clustering method - called Functional Cheng and Church (FunCC) - is developed. The FunCC method is a non parametric and very flexible technique able to discover bi-clusters, based on a flexible modeling depending on the problem at hand.

KEYWORDS: Bi-clustering, Clustering, Functional Data

Nowadays, many systems are able to collect information with high frequency recording multiple phenomena at the same time in an almost continuous way. For this reason, researchers have put a lot of efforts into the development of new statistical methods able to deal with this new type of data. In particular, functional data analysis (FDA) is the branch of statistics that deals with random variables taking values into an infinite dimensional functional space, see Ramsay (2004) for more details.

In this contribution, we consider the problem of bi-clustering functional data which has recently been addressed in the literature and we describe the methods we proposed in Galvani *et al.* (2021) and Torti *et al.* (2021). Bi-clustering methods, commonly known thanks to the work of Cheng & Church (2000)), allow to discover subgroubs of observations behaving in a similar way on a subset of features or vice-versa. This is of particular interest when the data are intrinsically ordered in a matrix structure and the aim is to simultaneously group the rows and the columns of the data matrix without constraining the rows (or the columns) of a data matrix to belong to only one group over all the features (or the observations) as in the classical clustering methods. In the FDA framework, there are just few works dedicated to the problem of bi-clustering functional data framed in a matrix structure. Bouveyron *et al.* (2018) developed a parametric bi-clustering technique, based on the functional latent block model, to co-cluster different electricity consumption curves on different days. Although, this approach needs to rely on strong modelling assumptions of the data, which are hardly verified in the FDA framework, and only detect exhaustive bi-clusters, i.e. discovering a checkerboard structure that does not always fit with real data, for uni-variate functional data. An alternative extension of bi-clustering to the functional realm is proposed by Di Iorio & Vantini (2019): given a set of functions, they propose an algorithm to identify sub-domains of the original functional domain where a subset of functions shows similar patterns. In our work, we proceed along the same line introduced by Bouveyron *et al.* (2018) and go a step further developing a non parametric algorithm able to discover non exhaustive bi-clusters in a data matrix whose entries are functions, possibly taking values in a multidimensional space. First, we introduce a novel methodology based on the extension of the Cheng and Church algorithm, called FunCC, by proposing an iterative procedure based on a non parametric approach which allows to find flexible and non exclusive bi-clusters for univariate functional data. Then, the FunCC algorithm is extended to the general case of multivariate data, therefore bi-clustering data matrices whose entries in each cell are multivariate functional data. In this way, we are able to deal with bi-clustering problems where multiple aspects are observed at the same time for each observation. For more details about the developed methodology and the implemented algorithm see Galvani *et al.* (2021) and Torti *et al.* (2021). *with* (*a*,*b*) ∈ {0,1}<sup>2</sup> *fixed by the analyst, µ defined for the bi-cluster B*(*I*, *<sup>J</sup>*) *as*

Starting from Definition 0.1 (Galvani *et al.* (2021)), it is possible to obtain different kinds of ideal bi-clusters, associated to different application contexts, by differently considering *a* and *b*. For example, setting (*a*,*b*)=(0,0) in the Definition 0.1, only the average value in the bi-cluster is considered, hence the ideal bi-cluster is composed by a group of functions *fi j* all equal to the average value *µ* of the bi-cluster. Moreover, while *µ* is evaluated as the average function of the functions contained in *B*(*I*, *J*), the computation of the row and column components α*<sup>i</sup>* and β*<sup>j</sup>* depends on their functional form. If α*<sup>i</sup>* and β*<sup>j</sup>* are assumed to be functional objects, then, they can be evaluated as the average functional residuals of rows and columns, respectively, with respect to the average function *µ*,

and β*<sup>j</sup>* are assumed to be constant, then, they can be consistently evaluated as the average value of the functional residuals of rows and columns, respectively,

matrices *B*(*I*,*J*) as similar as possible to an ideal bi-cluster, i.e. sub-matrices *B*(*I*,*J*) which minimize a specific objective function. The so-called *H*-score measures the deviation of the selected elements from an ideal bi-cluster (Cheng & Church (2000)). In our case, we define the *H*-score of a sub-matrix *B*(*I*, *J*)

> <sup>|</sup>*I*||*J*<sup>|</sup> <sup>∑</sup> *i*∈*I* ∑ *j*∈*J fi j* <sup>−</sup> *<sup>f</sup>* <sup>0</sup> *i j* 2 *L*2

cluster, where (*a*,*b*), *µ*,α*<sup>i</sup>* and β*<sup>j</sup>* are defined as in Definition 0.1.

*i j*,..., *f <sup>P</sup>*

*i j*(*t*) = *µ*(*t*) + *a*α*i*(*t*) + *b*β*j*(*t*) being the template function of the bi-

Notice that, the definition just mentioned above can be generalised also in the multivariate case, e.g. dealing with data matrices *A* whose entries are multi-

dimensional codomain with *P* ≥ 1. In this case, the definition of ideal bi-cluster is re-defined in way such that each element of the bi-cluster can be expressed on each *p*-dimension, with *p* ∈ {1,...,*P*}, as in Definition 0.1. Consistently, a measure of goodness of the bi-cluster can be trivially evaluated by estimating the *H*-score of a sub-matrix *B*(*I*, *J*) as the average value of the single *H*-score on each *p*-dimension. In both uni-variate and multivariate functional cases,


*T*

 1

*components, respectively, s.t.* ∑*i*∈*<sup>I</sup>* α*<sup>i</sup>* = 0 *and* ∑*j*∈*<sup>J</sup>* β*<sup>j</sup>* = 0*.*

<sup>|</sup>*J*<sup>|</sup> <sup>∑</sup>*j*∈*<sup>J</sup> fi j*(*t*) <sup>−</sup> *<sup>µ</sup>*(*t*) and <sup>β</sup>*j*(*t*) = <sup>1</sup>

with respect to the average function *µ*, i.e. α*<sup>i</sup>* = <sup>1</sup>

<sup>|</sup>*I*<sup>|</sup> <sup>∑</sup>*i*∈*<sup>I</sup> fi j*(*t*) <sup>−</sup> *<sup>µ</sup>*(*t*)

*<sup>H</sup>*(*I*,*J*) = <sup>1</sup>

<sup>|</sup>*I*||*J*<sup>|</sup> <sup>∑</sup>*i*∈*<sup>I</sup>* <sup>∑</sup>*j*∈*<sup>J</sup> fi j*(*t*) *for t* <sup>∈</sup> *T , and* <sup>α</sup>*<sup>i</sup> and* <sup>β</sup>*<sup>j</sup> being the rows and columns*

<sup>|</sup>*I*<sup>|</sup> <sup>∑</sup>*i*∈*<sup>I</sup> fi j*(*t*) <sup>−</sup> *<sup>µ</sup>*. If instead, <sup>α</sup>*<sup>i</sup>*

<sup>|</sup>*J*<sup>|</sup> <sup>∑</sup>*j*∈*<sup>J</sup> fi j*(*t*)−*µ*(*t*)

*dt*. In practice, we want to find sub-

*i j*) with one-dimensional domain and a *P*-

 *dt*

*µ*(*t*) = <sup>1</sup>

i.e. α*i*(*t*) = <sup>1</sup>

and β*<sup>j</sup>* = <sup>1</sup>

as

with *f* <sup>0</sup>


*T*

variate functional data *fi j* = (*f* <sup>1</sup>

 1

Given a dataset arranged in a matrix *A* composed by *n* rows and *m* columns, the aim of a bi-clustering technique is to find a submatrix *B*(*I*, *J*) ∈ *A*, corresponding to a subset of rows *I* and a subset of columns *J*, with a *similar behavior*. In particular, in the Cheng and Church algorithm (Cheng & Church (2000)), an ideal bi-cluster is a set of rows *I* and a set of columns *J* such that each element in the bi-cluster can be represented as the average value in the bi-cluster plus a row and column components. A particular measure of goodness is evaluated for each sub-matrix *B*(*I*, *J*) considering a similarity score - which is the *Mean Squared Residual* between all the real values and their representative values in the bi-cluster - and the sub-matrix *B*(*I*,*J*) is selected as bi-cluster if its similarity score is lower than a threshold value.

Extending these concepts in the FDA framework, in each cell of the data matrix *A* a function *fi j*(*t*) defined on a continuous domain *T* is contained.

Definition 0.1 *Given a data matrix A, an ideal bi-cluster is a sub-matrix B*(*I*, *J*) ⊆ *A, s.t. each element fi j with i* ∈ *I and j* ∈ *J can be expressed as*

$$f\_{ij}(t) = \mu(t) + a\alpha\_i(t) + b\mathfrak{B}\_j(t), \ \forall i \in I, \ \forall j \in J \text{ with } t \in T$$

*with* (*a*,*b*) ∈ {0,1}<sup>2</sup> *fixed by the analyst, µ defined for the bi-cluster B*(*I*, *<sup>J</sup>*) *as µ*(*t*) = <sup>1</sup> <sup>|</sup>*I*||*J*<sup>|</sup> <sup>∑</sup>*i*∈*<sup>I</sup>* <sup>∑</sup>*j*∈*<sup>J</sup> fi j*(*t*) *for t* <sup>∈</sup> *T , and* <sup>α</sup>*<sup>i</sup> and* <sup>β</sup>*<sup>j</sup> being the rows and columns components, respectively, s.t.* ∑*i*∈*<sup>I</sup>* α*<sup>i</sup>* = 0 *and* ∑*j*∈*<sup>J</sup>* β*<sup>j</sup>* = 0*.*

model, to co-cluster different electricity consumption curves on different days. Although, this approach needs to rely on strong modelling assumptions of the data, which are hardly verified in the FDA framework, and only detect exhaustive bi-clusters, i.e. discovering a checkerboard structure that does not always fit with real data, for uni-variate functional data. An alternative extension of bi-clustering to the functional realm is proposed by Di Iorio & Vantini (2019): given a set of functions, they propose an algorithm to identify sub-domains of the original functional domain where a subset of functions shows similar patterns. In our work, we proceed along the same line introduced by Bouveyron *et al.* (2018) and go a step further developing a non parametric algorithm able to discover non exhaustive bi-clusters in a data matrix whose entries are functions, possibly taking values in a multidimensional space. First, we introduce a novel methodology based on the extension of the Cheng and Church algorithm, called FunCC, by proposing an iterative procedure based on a non parametric approach which allows to find flexible and non exclusive bi-clusters for univariate functional data. Then, the FunCC algorithm is extended to the general case of multivariate data, therefore bi-clustering data matrices whose entries in each cell are multivariate functional data. In this way, we are able to deal with bi-clustering problems where multiple aspects are observed at the same time for each observation. For more details about the developed methodology and the implemented algorithm see Galvani *et al.* (2021) and Torti *et al.* (2021).

Given a dataset arranged in a matrix *A* composed by *n* rows and *m* columns, the aim of a bi-clustering technique is to find a submatrix *B*(*I*,*J*) ∈ *A*, corresponding to a subset of rows *I* and a subset of columns *J*, with a *similar behavior*. In particular, in the Cheng and Church algorithm (Cheng & Church (2000)), an ideal bi-cluster is a set of rows *I* and a set of columns *J* such that each element in the bi-cluster can be represented as the average value in the bi-cluster plus a row and column components. A particular measure of goodness is evaluated for each sub-matrix *B*(*I*, *J*) considering a similarity score - which is the *Mean Squared Residual* between all the real values and their representative values in the bi-cluster - and the sub-matrix *B*(*I*,*J*) is selected as bi-cluster if its similar-

Extending these concepts in the FDA framework, in each cell of the data matrix

Definition 0.1 *Given a data matrix A, an ideal bi-cluster is a sub-matrix B*(*I*,*J*) ⊆

*fi j*(*t*) = *µ*(*t*) +*a*α*i*(*t*) +*b*β*j*(*t*), ∀*i* ∈ *I* , ∀ *j* ∈ *J with t* ∈ *T*

*A* a function *fi j*(*t*) defined on a continuous domain *T* is contained.

*A, s.t. each element fi j with i* ∈ *I and j* ∈ *J can be expressed as*

ity score is lower than a threshold value.

Starting from Definition 0.1 (Galvani *et al.* (2021)), it is possible to obtain different kinds of ideal bi-clusters, associated to different application contexts, by differently considering *a* and *b*. For example, setting (*a*,*b*)=(0,0) in the Definition 0.1, only the average value in the bi-cluster is considered, hence the ideal bi-cluster is composed by a group of functions *fi j* all equal to the average value *µ* of the bi-cluster. Moreover, while *µ* is evaluated as the average function of the functions contained in *B*(*I*,*J*), the computation of the row and column components α*<sup>i</sup>* and β*<sup>j</sup>* depends on their functional form. If α*<sup>i</sup>* and β*<sup>j</sup>* are assumed to be functional objects, then, they can be evaluated as the average functional residuals of rows and columns, respectively, with respect to the average function *µ*, i.e. α*i*(*t*) = <sup>1</sup> <sup>|</sup>*J*<sup>|</sup> <sup>∑</sup>*j*∈*<sup>J</sup> fi j*(*t*) <sup>−</sup> *<sup>µ</sup>*(*t*) and <sup>β</sup>*j*(*t*) = <sup>1</sup> <sup>|</sup>*I*<sup>|</sup> <sup>∑</sup>*i*∈*<sup>I</sup> fi j*(*t*) <sup>−</sup> *<sup>µ</sup>*. If instead, <sup>α</sup>*<sup>i</sup>* and β*<sup>j</sup>* are assumed to be constant, then, they can be consistently evaluated as the average value of the functional residuals of rows and columns, respectively,

with respect to the average function *µ*, i.e. α*<sup>i</sup>* = <sup>1</sup> |*T*| *T* 1 <sup>|</sup>*J*<sup>|</sup> <sup>∑</sup>*j*∈*<sup>J</sup> fi j*(*t*)−*µ*(*t*) *dt*

and β*<sup>j</sup>* = <sup>1</sup> |*T*| *T* 1 <sup>|</sup>*I*<sup>|</sup> <sup>∑</sup>*i*∈*<sup>I</sup> fi j*(*t*) <sup>−</sup> *<sup>µ</sup>*(*t*) *dt*. In practice, we want to find submatrices *B*(*I*, *J*) as similar as possible to an ideal bi-cluster, i.e. sub-matrices *B*(*I*,*J*) which minimize a specific objective function. The so-called *H*-score measures the deviation of the selected elements from an ideal bi-cluster (Cheng & Church (2000)). In our case, we define the *H*-score of a sub-matrix *B*(*I*, *J*) as

$$H(I,J) = \frac{1}{|I||J|} \sum\_{i \in I} \sum\_{j \in J} \left\| |f\_{ij} - f\_{ij}^0| \right\|\_{L^2}^2$$

with *f* <sup>0</sup> *i j*(*t*) = *µ*(*t*) + *a*α*i*(*t*) + *b*β*j*(*t*) being the template function of the bicluster, where (*a*,*b*), *µ*,α*<sup>i</sup>* and β*<sup>j</sup>* are defined as in Definition 0.1.

Notice that, the definition just mentioned above can be generalised also in the multivariate case, e.g. dealing with data matrices *A* whose entries are multivariate functional data *fi j* = (*f* <sup>1</sup> *i j*,..., *f <sup>P</sup> i j*) with one-dimensional domain and a *P*dimensional codomain with *P* ≥ 1. In this case, the definition of ideal bi-cluster is re-defined in way such that each element of the bi-cluster can be expressed on each *p*-dimension, with *p* ∈ {1,...,*P*}, as in Definition 0.1. Consistently, a measure of goodness of the bi-cluster can be trivially evaluated by estimating the *H*-score of a sub-matrix *B*(*I*, *J*) as the average value of the single *H*-score on each *p*-dimension. In both uni-variate and multivariate functional cases, the implemented algorithm starts considering the whole dataset and try to find the biggest bi-cluster with a *H*-score value lower then a given threshold δ by adding/removing rows/columns. Each time a row/column is added/removed, the *H*-score is updated. For more details about the steps of the algorithm and the choice of the treshold parameter δ see Galvani *et al.* (2021) and Torti *et al.* (2021).

DEVELOPING A MULTIDIMENSIONAL AND HIERARCHICAL INDEX FOLLOWING A COMPOSITE-BASED APPROACH Laura Trinchera1

<sup>1</sup> Department of Information Systems, Supply Chain Management and Decision-making,

ABSTRACT: The development of a measurement instrument involves establishing a link between the concepts (theoretical world) and the data (empirical world) (Zeller & Carmines, 1980) . When modelling multidimensional concepts on a higher level of abstraction is common practice to include higher-order constructs as a proxy of such concepts. Higher-order construct are defined as constructs whose indicators are not directly observable but again constructs (Henseler, 2021). Following Henseler's (2021) classification we can specify

Each of these four specifications needs to apply a different validation process of the measurement instrument. During this presentation I will discuss the more recent advances on higher-order construct validation (Schuberth *et al.*, 2020).

HENSELER, J. 2021. *Composite-Based Structural Equation Modeling: Analyz-*

MASSIERA, P, TRINCHERA, L, & RUSSOLILLO, G. 2018. Evaluating the presence of marketing capabilities: a multidimensional, hierarchical index.

SCHUBERTH, F, RADEMAKER, M E, & J, HENSELER. 2020. Estimating and assessing second-order constructs using PLS-PM: the case of composites

KEYWORDS: measurement model, PLS path modelling, formative index

*ing Latent and Emergent Variables.* Guilford Press.

*Recherche et Applications en Marketing*, 33, 30–52.

NEOMA Business School, (e-mail: laura.trinchera@neoma-bs.fr)

high-order constructs according to 4 different structures:

References


To bi-cluster a data matrix whose entries are functions possibly taking values in a multidimensional space, a bi-clustering technique - called Functional Cheng and Church (FunCC) - is developed. The presented approach is non parametric, thus no assumptions are made on the distribution generating the data, and very flexible, allowing to discover non-exhaustive and different biclusters depending on the problem at hand. During the presentation of this contribution, we will show the performance of the developed methodology both on simulated data and on real case studies stimulated by challenging research questions related to mobility infrastructures.

Acknowledgement: the authors gratefully acknowledge Trenord for providing the data that will be shown during the presentation of this contribution.

#### References


## DEVELOPING A MULTIDIMENSIONAL AND HIERARCHICAL INDEX FOLLOWING A COMPOSITE-BASED APPROACH

Laura Trinchera1

<sup>1</sup> Department of Information Systems, Supply Chain Management and Decision-making, NEOMA Business School, (e-mail: laura.trinchera@neoma-bs.fr)

ABSTRACT: The development of a measurement instrument involves establishing a link between the concepts (theoretical world) and the data (empirical world) (Zeller & Carmines, 1980) . When modelling multidimensional concepts on a higher level of abstraction is common practice to include higher-order constructs as a proxy of such concepts. Higher-order construct are defined as constructs whose indicators are not directly observable but again constructs (Henseler, 2021). Following Henseler's (2021) classification we can specify high-order constructs according to 4 different structures:


Each of these four specifications needs to apply a different validation process of the measurement instrument. During this presentation I will discuss the more recent advances on higher-order construct validation (Schuberth *et al.*, 2020).

KEYWORDS: measurement model, PLS path modelling, formative index

#### References

the implemented algorithm starts considering the whole dataset and try to find the biggest bi-cluster with a *H*-score value lower then a given threshold δ by adding/removing rows/columns. Each time a row/column is added/removed, the *H*-score is updated. For more details about the steps of the algorithm and the choice of the treshold parameter δ see Galvani *et al.* (2021) and Torti *et al.*

To bi-cluster a data matrix whose entries are functions possibly taking values in a multidimensional space, a bi-clustering technique - called Functional Cheng and Church (FunCC) - is developed. The presented approach is non parametric, thus no assumptions are made on the distribution generating the data, and very flexible, allowing to discover non-exhaustive and different biclusters depending on the problem at hand. During the presentation of this contribution, we will show the performance of the developed methodology both on simulated data and on real case studies stimulated by challenging re-

Acknowledgement: the authors gratefully acknowledge Trenord for providing the data that will be shown during the presentation of this contribution.

BOUVEYRON, CHARLES, BOZZI, LAURENT, JACQUES, JULIEN,&JOL-LOIS, FRANÇOIS-XAVIER. 2018. The functional latent block model for the co-clustering of electricity consumption curves. *Journal of the Royal*

CHENG, YIZONG,&CHURCH, GEORGE M. 2000. Biclustering of expression

DI IORIO, JACOPO,&VANTINI, SIMONE. 2019. funBI: a Biclustering Algo-

GALVANI, MARTA, TORTI, AGOSTINO, MENAFOGLIO, ALESSANDRA, & VANTINI, SIMONE. 2021. FunCC: a new bi-clustering algorithm for functional data with misalignment. *Computational Statistics Data Analysis*. RAMSAY, JAMES O. 2004. Functional data analysis. *Encyclopedia of Statisti-*

TORTI, AGOSTINO, GALVANI, MARTA, MENAFOGLIO, ALESSANDRA, SECCHI, PIERCESARE,&VANTINI, SIMONE. 2021. A General Biclustering Algorithm for Hilbert Data: Analysis of the Lombardy Railway

*Statistical Society: Series C (Applied Statistics)*, 67(4), 897–915.

rithm for Functional Data. *MOX-Report No. 46/2019*.

search questions related to mobility infrastructures.

data. *Pages 93–103 of: Ismb*, vol. 8.

Service. *Mox-Report No. 21/2021*.

(2021).

References

*cal Sciences*, 4.

HENSELER, J. 2021. *Composite-Based Structural Equation Modeling: Analyzing Latent and Emergent Variables.* Guilford Press.


of composites. *Industrial Management and Data Systems*, 120, 2211– 2241.

A GENERALISED CLUSTERWISE REGRESSION FOR DISTRIBUTIONAL DATA Rosanna Verde 1, Francisco de A. T. de Carvalho <sup>2</sup> and Antonio Balzanella <sup>1</sup>

<sup>1</sup> Department of Mathematics and Physics, University of Campania "Luigi Vanvitelli", (e-mail: rosanna.verde@unicampania.it,

<sup>2</sup> CIN-UFPE, Av. Jornalista Anibal Fernandes, s/n - Cidade Universitaria 50.740-560, ´

ABSTRACT: This paper deals with a cluster-wise regression method for distributional data. The objects to cluster are observed on a dependent character and on a set of explanatory variables. A dependence relation is then assumed, which can be improved by considering local structures among the data. The proposed algorithm is based on the K-means clustering algorithm: the centroids of the clusters are linear regression models and the objects are assigned to the clusters according to minimum sum of squared errors. The generalised CR algorithm is based on a linear regression model for distributional variables and on a K-means algorithm developed for similar data;

KEYWORDS: Distributional data, Clusterwise regression, K-means, Wasserstein dis-

In this paper we propose a cluster-wise regression strategy for distributionalvalued data. Clusterwise Regression (CR) methods aim at identifying both the partition of a set of data in a fixed number of clusters and regression models as representative elements of the clusters. A pioneering paper for the search of local models for clustered data is the Typological Principal Component Analysis (Diday, 1974). It carries out *K* sub-spaces of maximal inertia assigning elements to the clusters according to the minimum distances from the local factorial planes, until the convergence to a stable partition and to *K* final subspaces. Spath (Sp ¨ ath, 1979, Sp ¨ ath, 1991) introduced a criterion for partition- ¨ ing a set of objects into *K* classes establishing a regression model within each class. Preda & Saporta, 2005 use PLS regression for solving an ill-posed problem in clusterwise regression. Morever, DeSarbo & Cron, 1988, Hennig, 2000 proposed mixture-model-based clusterwise regression. They assume that the

antonio.balzanella@unicampania.it)

Recife, PE, Brasil, (e-mail: fatc@cin.ufpe.br)

both the methods use a *L*<sup>2</sup> Wasserstein distance.

tance.

1 Introduction

ZELLER, R A, & CARMINES, E G. 1980. *Measurement in the social sciences: The link between theory and data.* CUP Archive.

## A GENERALISED CLUSTERWISE REGRESSION FOR DISTRIBUTIONAL DATA

Rosanna Verde 1, Francisco de A. T. de Carvalho <sup>2</sup> and Antonio Balzanella <sup>1</sup>

<sup>1</sup> Department of Mathematics and Physics, University of Campania "Luigi Vanvitelli", (e-mail: rosanna.verde@unicampania.it, antonio.balzanella@unicampania.it)

<sup>2</sup> CIN-UFPE, Av. Jornalista Anibal Fernandes, s/n - Cidade Universitaria 50.740-560, ´ Recife, PE, Brasil, (e-mail: fatc@cin.ufpe.br)

ABSTRACT: This paper deals with a cluster-wise regression method for distributional data. The objects to cluster are observed on a dependent character and on a set of explanatory variables. A dependence relation is then assumed, which can be improved by considering local structures among the data. The proposed algorithm is based on the K-means clustering algorithm: the centroids of the clusters are linear regression models and the objects are assigned to the clusters according to minimum sum of squared errors. The generalised CR algorithm is based on a linear regression model for distributional variables and on a K-means algorithm developed for similar data; both the methods use a *L*<sup>2</sup> Wasserstein distance.

KEYWORDS: Distributional data, Clusterwise regression, K-means, Wasserstein distance.

#### 1 Introduction

of composites. *Industrial Management and Data Systems*, 120, 2211–

ZELLER, R A, & CARMINES, E G. 1980. *Measurement in the social sciences:*

*The link between theory and data.* CUP Archive.

2241.

In this paper we propose a cluster-wise regression strategy for distributionalvalued data. Clusterwise Regression (CR) methods aim at identifying both the partition of a set of data in a fixed number of clusters and regression models as representative elements of the clusters. A pioneering paper for the search of local models for clustered data is the Typological Principal Component Analysis (Diday, 1974). It carries out *K* sub-spaces of maximal inertia assigning elements to the clusters according to the minimum distances from the local factorial planes, until the convergence to a stable partition and to *K* final subspaces. Spath (Sp ¨ ath, 1979, Sp ¨ ath, 1991) introduced a criterion for partition- ¨ ing a set of objects into *K* classes establishing a regression model within each class. Preda & Saporta, 2005 use PLS regression for solving an ill-posed problem in clusterwise regression. Morever, DeSarbo & Cron, 1988, Hennig, 2000 proposed mixture-model-based clusterwise regression. They assume that the response variable estimations, related to the clusters, are obtained as mixtures of *K* conditional density distributions.

tions (or distributional-valued data): *f*

*yi*(*t*) = β<sup>0</sup> +

using the *L*<sup>2</sup> Wasserstein distance.

*mink* : ˆ*e*<sup>2</sup>

quantile functions *x<sup>c</sup>*

0,β*<sup>k</sup> j*, γ *k <sup>j</sup>*|*Pk*) =

*SSE*(β*<sup>k</sup>*

distance ˆ*e*<sup>2</sup>

*p* ∑ *j*=1

*i j* (*j* = 1,..., *p*).

*K* ∑ *k*=1 ∑ *i*∈*Ck*

*ik* = 1

0

 1

[*yk*

0

*ik* from the estimated regression model ˆ*yk*:

[*yi*(*t*)−(<sup>ˆ</sup>

β*k* <sup>0</sup> + *p* ∑ *j*=1 ˆ β*k j x*¯*i j* +

β*jx*¯*i j* +

*y <sup>i</sup>* , *f <sup>x</sup>*

looks for clustering the data set *W* into *K* clusters according to the best fitting regression model for each cluster. The regression model used to fit clustering distributional data was introduced by Irpino & Verde, 2015, as follows:

γ *jx<sup>c</sup>*

where: β<sup>0</sup> is the constant, β*<sup>j</sup>* are the coefficients associated with the vectors of the averages ¯*xi j* of each distribution *fi j*; γ *<sup>j</sup>* are the coefficients of the centred

Fixed the number *K* of clusters, CR algorithm seeks the better partition *Pk* = {*C*1,...,*CK*} and the best fitting models ˆ*y<sup>k</sup>* for each cluster *Ck* by minimising:

*i*(*t*)−(β*<sup>k</sup>*

An element *wi* is assigned to a cluster *Ck* according to the minimum squared

The convergence of the algorithm is guaranteed by the criterion decreasing related to the improvement of the best fitting of the cluster regression models. We consider two indexes to evaluate the goodness of fit of the clusterwise regressions: the Ω index proposed by Dias & Brito, 2015, and the *RMSEW* (Root Mean Square Error, according to the *L*<sup>2</sup> Wasserstein distance), computed for each cluster (denoted as Ω*<sup>k</sup>* and *RMSEW* (*Ck*)), and the total *RMSEW* (*Pk*) for the entire partition *Pk*. To determine the best number *K* of clusters of the partition *Pk*, we consider the Root Mean Square Error *RMSEW* (*Pk*) as a measure of total within variability of the clusters. According to the elbow method, we choose the number of clusters such that adding another cluster does not lead to an important decrease of the total *RMSEW* (*Pk*). Finally, a forward selection of the explanatory variables allows of defining the best cluster regression models as well as the variables which affect the prediction of the response variable the

<sup>0</sup> +

*p* ∑ *j*=1 β*k jx*¯*i j* +

> *p* ∑ *j*=1 γˆ*k j xc*

The Sum of Square Errors function (SSE), like in LS method, is computed

*p* ∑ *j*=1 *i j* (*j* = 1,...,*P*). The CRD method

*i j*(*t*) +*ei*(*t*), ∀*t* ∈ [0,1] (1)

*p* ∑ *j*=1 γ *k jxc i j*(*t*))]<sup>2</sup>

*dt* (2)

*i j*(*t*))]2*dt* (3)

In this framework, we propose a Cluster-wise Regression method for Distributional data (CRD). The latter are a particular kind of multi-valued Symbolic Data, like: intervals, multi-categoricals, histograms or continue distributions (Bock & Diday, 2000). Many exploratory statistical methods have been extended to such data, especially considering them as suitable aggregated data. These are assuming more and more relevance for the treatment of high dimensional data. Among the methods proposed in Symbolic Data Analysis context, a CR method for interval data was presented by De Carvalho *et al.*, 2010. It performs a double regression on the centers and on the radii of the intervals, recalling a suitable strategy for interval data analysis. Recently De Carvalho *et al.*, 2021 have developed a non linear clusterwise regression which extends the previous proposal. A prediction model based on CR for data aggregated as empirical distributions was proposed by Suresh *et al.*, 2020.

Our method aims at clustering distributional-valued data in *K* clusters according to a local dependence structure between distributional variables. Consistently with the K-means algorithm the centroids of the clusters are expressed as ordinary least squares (OLS) regression models and the objects are assigned to the clusters assuming as criterion the minimum increasing of sum of the squared errors. Related to the type of variables, the generalised CR algorithm is based on a linear regression model (Irpino & Verde, 2015) and on a K-means algorithm (Irpino & Verde, 2007) for distributional data; both these methods use the *L*<sup>2</sup> Wasserstein distance (Wasserstein, 1969) as measure of distance between distributions. Moreover, we propose to determine the optimal number of clusters *K* according to a criterion of global best fitting of the cluster regression models. In the same way, a selection of the best explanatory variables, for each cluster regression model, is carried out in order to improve the prediction of the dependent variable in each cluster. The final achieved cluster regression models are evaluated using root-mean-square error (RMSE), goodness of fit *R*<sup>2</sup> index and the Pseudo-*R*<sup>2</sup> index. For sake of brevity, we have omitted some promising results obtained on real and synthetic distributional data sets.

#### 2 Clusterwise Regression for Distributional-valued data (CRD)

Let *W* = {*w*1,...,*wN*} be a set of *N* objects described by *p*+1 distributionalvalued variables. We assume that one of the *p* + 1 distributional-valued variables, denoted by *Y*, is a dependent variable from the *p* explanatory variable *Xj* (*j* = 1,..., *p*). Each object *wi* (1 ≤ *i* ≤ *N*) is represented by *p*+1 distributions (or distributional-valued data): *f y <sup>i</sup>* , *f <sup>x</sup> i j* (*j* = 1,...,*P*). The CRD method looks for clustering the data set *W* into *K* clusters according to the best fitting regression model for each cluster. The regression model used to fit clustering distributional data was introduced by Irpino & Verde, 2015, as follows:

response variable estimations, related to the clusters, are obtained as mixtures

empirical distributions was proposed by Suresh *et al.*, 2020.

In this framework, we propose a Cluster-wise Regression method for Distributional data (CRD). The latter are a particular kind of multi-valued Symbolic Data, like: intervals, multi-categoricals, histograms or continue distributions (Bock & Diday, 2000). Many exploratory statistical methods have been extended to such data, especially considering them as suitable aggregated data. These are assuming more and more relevance for the treatment of high dimensional data. Among the methods proposed in Symbolic Data Analysis context, a CR method for interval data was presented by De Carvalho *et al.*, 2010. It performs a double regression on the centers and on the radii of the intervals, recalling a suitable strategy for interval data analysis. Recently De Carvalho *et al.*, 2021 have developed a non linear clusterwise regression which extends the previous proposal. A prediction model based on CR for data aggregated as

Our method aims at clustering distributional-valued data in *K* clusters according to a local dependence structure between distributional variables. Consistently with the K-means algorithm the centroids of the clusters are expressed as ordinary least squares (OLS) regression models and the objects are assigned to the clusters assuming as criterion the minimum increasing of sum of the squared errors. Related to the type of variables, the generalised CR algorithm is based on a linear regression model (Irpino & Verde, 2015) and on a K-means algorithm (Irpino & Verde, 2007) for distributional data; both these methods use the *L*<sup>2</sup> Wasserstein distance (Wasserstein, 1969) as measure of distance between distributions. Moreover, we propose to determine the optimal number of clusters *K* according to a criterion of global best fitting of the cluster regression models. In the same way, a selection of the best explanatory variables, for each cluster regression model, is carried out in order to improve the prediction of the dependent variable in each cluster. The final achieved cluster regression models are evaluated using root-mean-square error (RMSE), goodness of fit *R*<sup>2</sup> index and the Pseudo-*R*<sup>2</sup> index. For sake of brevity, we have omitted some promising results obtained on real and synthetic distributional data sets.

2 Clusterwise Regression for Distributional-valued data (CRD)

Let *W* = {*w*1,...,*wN*} be a set of *N* objects described by *p*+1 distributionalvalued variables. We assume that one of the *p* + 1 distributional-valued variables, denoted by *Y*, is a dependent variable from the *p* explanatory variable *Xj* (*j* = 1,..., *p*). Each object *wi* (1 ≤ *i* ≤ *N*) is represented by *p*+1 distribu-

of *K* conditional density distributions.

$$\mathfrak{h}\_{j}(t) = \mathfrak{B}\_{0} + \sum\_{j=1}^{p} \mathfrak{B}\_{j} \overline{\mathfrak{x}}\_{ij} + \sum\_{j=1}^{p} \gamma\_{j} \mathfrak{x}\_{ij}^{c}(t) + e\_{i}(t), \quad \forall t \in [0, 1] \tag{1}$$

where: β<sup>0</sup> is the constant, β*<sup>j</sup>* are the coefficients associated with the vectors of the averages ¯*xi j* of each distribution *fi j*; γ *<sup>j</sup>* are the coefficients of the centred quantile functions *x<sup>c</sup> i j* (*j* = 1,..., *p*).

The Sum of Square Errors function (SSE), like in LS method, is computed using the *L*<sup>2</sup> Wasserstein distance.

Fixed the number *K* of clusters, CR algorithm seeks the better partition *Pk* = {*C*1,...,*CK*} and the best fitting models ˆ*y<sup>k</sup>* for each cluster *Ck* by minimising:

$$SSE(\mathbb{B}\_0^k, \mathbb{B}\_j^k, \gamma\_j^k | P\_k) = \sum\_{k=1}^K \sum\_{i \in \mathcal{C}\_k} \int\_0^1 [\mathbf{y}\_i^k(t) - (\mathbb{B}\_0^k + \sum\_{j=1}^p \mathbb{B}\_j^k \bar{\mathbf{x}}\_{ij} + \sum\_{j=1}^p \gamma\_j^k \mathbf{x}\_{ij}^c(t))]^2 dt \quad (2)$$

An element *wi* is assigned to a cluster *Ck* according to the minimum squared distance ˆ*e*<sup>2</sup> *ik* from the estimated regression model ˆ*yk*:

$$\min\_{i} \colon \quad \hat{e}\_{ik}^{2} = \underset{0}{\int} [\mathbf{y}\_{i}(t) - (\hat{\mathfrak{P}}\_{0}^{k} + \sum\_{j=1}^{p} \hat{\mathfrak{P}}\_{j}^{k} \bar{\mathbf{x}}\_{ij} + \sum\_{j=1}^{p} \mathfrak{P}\_{j}^{k} \mathbf{x}\_{ij}^{c}(t))]^{2} dt \tag{3}$$

The convergence of the algorithm is guaranteed by the criterion decreasing related to the improvement of the best fitting of the cluster regression models.

We consider two indexes to evaluate the goodness of fit of the clusterwise regressions: the Ω index proposed by Dias & Brito, 2015, and the *RMSEW* (Root Mean Square Error, according to the *L*<sup>2</sup> Wasserstein distance), computed for each cluster (denoted as Ω*<sup>k</sup>* and *RMSEW* (*Ck*)), and the total *RMSEW* (*Pk*) for the entire partition *Pk*. To determine the best number *K* of clusters of the partition *Pk*, we consider the Root Mean Square Error *RMSEW* (*Pk*) as a measure of total within variability of the clusters. According to the elbow method, we choose the number of clusters such that adding another cluster does not lead to an important decrease of the total *RMSEW* (*Pk*). Finally, a forward selection of the explanatory variables allows of defining the best cluster regression models as well as the variables which affect the prediction of the response variable the most. It is worth noticing that the regression models can differ in the importance of the predictors from one cluster to another. The more different are the estimated cluster regression models the more the linear relations in the clusters of the partition are different for distinct observed data subsets

#### References


## **A MACHINE LEARNING APPROACH FOR EVALUATING ANXIETY IN NEUROSURGICAL PATIENTS DURING THE COVID-19 PANDEMIC**

Vezzoli M.1 , Doglietto F.2 , Renzetti S.1 , Fontanella M.M.2 , Calza S.1

<sup>1</sup> Department of Molecular and Translational Medicine, University of Brescia,

(e-mail: marika.vezzoli@unibs.it, stefano.renzetti@unibs.it, stefano.calza@unibs.it)

<sup>2</sup> Neurosurgery, Department of Medical and Surgical Specialties, Radiological Sciences and Public Health, University of Brescia,

(e-mail: francesco.doglietto@unibs.it, marco.fontanella@unibs.it)

**ABSTRACT**: In 2020, the COVID-19 pandemic has forced many countries into lockdown postponing nonurgent neurosurgical procedures. After the lockdown, neurosurgical patients admitted to eastern Lombardy hospitals, filled pre- and postoperative questionnaires which measured anxiety (State Anxiety Inventory) related to COVID-19, and safety perception during hospital admission. These data were merged with information on age, sex, primary pathology, and time on surgical waiting list. By means of Random Forest, Variable importance measure and Partial Dependence Plots, we identified which variables had a strong impact on anxiety, and safety perception. Results highlighted that worry about positivity to SARS-CoV-2 was associated with anxiety. Bed distance and hand sanitizer were associated with a feeling of safety.

**KEYWORDS**: COVID-19, random forest, variable importance measure, partial dependence plot

#### **1 Introduction**

most. It is worth noticing that the regression models can differ in the importance of the predictors from one cluster to another. The more different are the estimated cluster regression models the more the linear relations in the clusters

BOCK, H.-H., & DIDAY, E. 2000. *Analysis of symbolic data: exploratory methods for extracting statistical information from complex data*. Heidel-

DE CARVALHO, F. A.T., SAPORTA, G., & QUEIROZ, DANILO N. 2010. A Clusterwise Center and Range Regression Model for Interval-Valued

DE CARVALHO, F. A.T., LIMA NETO, EUFRASIO DE ´ A., & DA SILVA, KAS-SIO C.F. 2021. A clusterwise nonlinear regression algorithm for interval-

DESARBO, W.S., & CRON, W.L. 1988. A maximum likelihood methodology for clusterwise linear regression. *Journal of Classification*, 5, 249–282. DIAS, S., & BRITO, P. 2015. Linear regression model with histogram-valued variables. *Statistical Analysis and Data Mining*, 8(2), 75–113. DIDAY, E. 1974. Introduction a´l'analyse factorielle typologique. *Revue de*

HENNIG, C. 2000. Identifiability of models for Clusterwise linear regression.

IRPINO, A., & VERDE, R. 2007. *Dynamic Clustering of Histogram Data:*

IRPINO, A., & VERDE, R. 2015. Linear regression for numeric symbolic variables: a least squares approach based on Wasserstein Distance. *Advances*

PREDA, C., & SAPORTA, G. 2005. Clusterwise PLS regression on a stochastic process. *Computational Statistics & Data Analysis*, 49(1), 99–108. SPATH ¨ , H. 1979. Clusterwise linear regression. *Computing*, 22, 367–373. SPATH ¨ , H. 1991. Agorithm 48: A fast algorithm for clusterwise linear regres-

SURESH, N., BRITO, P., & DIAS, S. 2020. Prediction of pollution levels from atmospheric variables A study using clusterwise symbolic regression. *In:*

WASSERSTEIN, L. 1969. Markov processes over denumerable products of spaces describing large systems of automata. *rob. Inf. Transm.*, 5, 47–52.

of the partition are different for distinct observed data subsets

References

berg: Springer Verlag.

Data. *In: Proc. of COMPSTAT'2010*.

*Statistique Appliquee´*, 22(4), 29–38.

sion. *Computing*.

*Proc. RECPAD'20*.

*Journal of Classification*, 17, 273–296.

*Using the Right Metric*. Berlin: Springer Verlag.

*in Data Analysis and Classification*, 9(1), 81–106.

valued data. *Information Sciences*, 555, 357–385.

In 2020, the COVID-19 pandemic forced Italy and many other countries over the world into lockdown. In that period, in Lombardy nonurgent neurosurgical procedures were rescheduled from the end of May 2020.

Although stress and anxiety during the COVID-19 pandemic is being investigated in general population (Gasteiger, 2021), no studies investigated anxiety in patients whose neurosurgery has been postponed.

The aim of this study was to investigate anxiety in neurosurgical patients undergoing nonurgent surgical procedures in the post-lockdown phase of the COVID-19 pandemic. Moreover, we inspected safety perception from SARS-CoV-2 infection during hospitalization. Data of various nature (qualitative and quantitative), including state anxiety, were collected in hospitals mainly located in eastern Lombardy, an area in Italy extremely affected by COVID-19. Since during COVID-19 period the percentage of anxious patients that must undergo surgery is 25%, the study will require 100 patients for estimating the expected proportion with 8.5% accuracy (95% CI). The study was approved by the local ethics committee (Study n. 4290; COVID-SAFENSG).

K8# (5D(:5D('# '()# 9):2'58"=(51=# C)'?))"# 68>2952')=# 2"3# 8<'684)=&# '?8# 2335'58"2:# 4)'(83=#?)9)#<=)3X#H*.*J#9):2'5>)#,2952C:)#!4189'2"6)#E)2=<9)#H9):,!EX#,)NN8:5&#\$%//J# ?(56(#53)"'575)=#68>2952')=#'(2'#48='#54126'#8"#'()#19)356'58"#87#*"* H,!E#Y#P%JT#H*..*J# 129'52:# 3)1)"3)"6)# 1:8'=# Hd-d=T# U95)342"& \$%%/J# ?(56(# >5=<2:5N) '()# 7<"6'58"2:# 9):2'58"=(51#C)'?))"#'()#=):)6')3#68>2952')=#2'#185"'#H*.*J#2"3#'()#MU#19)356'58"=A U5D<9)# /#198>53)=#2#>5=<2:#=<4429;#87#'(5=#'(9)).=')1#1986)3<9)A#F"2:;=)=#?)9)#1)97894)3#

**RF1 covariates RF2 covariates** Worried for positivity to Coronavirus Feeling of safety due to distance between beds

which you have been admitted? Feeling of safety due to hand sanitizer gel available in hospital

protocols

COVID-19

hospitalization Feeling of safety due to a reassuring behavior of health personnel

entrance

post-pone your admission? Feeling of safety due to sanitization of hospital environments How safe do you feel in Neurosurgical ward? In the operating room, did you feel safe from Coronavirus

period

&'( **Which** *Xs* **have a major impact on** *Y?* **In which**

**VIM - Mean Decrease in accuracy**

**VIM>50**

**0 20 40 60 80 100**

**VIM>50**

**Step 2 – Relative VIM - MDA**

**First** *X***: VIM=100**

*X1 X2 X3 X4 X5 X6 X7 X8*

**",-./)\*#+\***K(9)).=')1#1986)3<9)#C2=)3#8"#M2"384#U89)='&#,!E#2"3#d-d

**Not relevant** *Xs*

**Relevant** *Xs* Feeling of safety due to health personnel following security

Feelings of safety due to the procedures to prevent infection from

Feeling of safety due to measure body temperature at hospital

Did the health personnel seem prepared for the post-operative

**relationship are the most important** *Xs* **with** *Y***?** 

2 4 6 8 10

2 4 6 8 10 **Step 3 – Partial Dependence Plots**

**%&'()\*#+\***\*8>2952')=#<=)3#5" MU/#H:)7'J&#MU\$ H95D('J#483):=

Age Feeling of safety due to masks

How worried are you about the pathology for

How much are you worried about the surgical

How stressed were you during the waiting time

How many days would you have been willing to

How much COVID-19 increased concern about

How useful is the screening on COVID-19

**Stable and accurate predictions RF combines predictions extracted by trees using average (regression) or majority vote (classification)**

&''() !"!"!""#\$ # #\$\$%

**Input features: Matrix containing outcome (***Y***) and covariates (114 analytes)**

**Step 1 - Random Forests**

**Dataset is repeatedly perturbed obtaining** *BOOT***=10,000 bootstrap sample. A tree is grown on each of them.**

**Tree T2 !!! Tree T***BOOT***-1 Tree**

**T***BOOT*

**Coloured circles in the ensemble (grey box) represent the 10,000 predictions** !"!"!""#\$ #

How anxious were you about a possible worsening of your condition?

Becoming positive to COVID-19 during

Perception of time from neurosurgical

evaluation to admission

Neurosurgery admission?

performed pre-operatively? How safe is the screening on COVID-19 performed pre-operatively?

> !"!" # # \$ %

**Tree T1**

**Black-box**

?5'(#M#SA%A/A

procedure?

to admission?

### **2 Methods**

#### *2.1 Inclusion criteria, questionnaires and clinical Data*

Inclusion criteria for the study were: adult patients (>18) undergoing nonurgent neurosurgical procedures who consented to study participation. Each patient filled in 3 questionnaires: 2 before surgery and 1 after. The first questionnaire collected demographic data (age, sex, and highest academic degree), days of postponement of the surgery, fear related to disease, COVID-19 and hospitalization (measured on a Likert scale from 1 (not at all) to 10 (very)).

The second questionnaire, widely used and validated in many languages, wasthe State Anxiety Inventory (STAI-State) (Spielberger, 2010), which contains 20 questions on a 4-point Likert scale. It measures the latent constructs of state anxiety related to an event in a specific moment, such as a surgical procedure. Each item belonging to this questionnaire has a range from a minimum of 1 to a maximum of 4 points, hence the score ranges from a minimum of 20 to a maximum of 80. In detail:


The last questionnaire collected patients' impressions (Likert scales from 1 (not at all) to 10 (very)) on safety from SARS-CoV-2 infection during hospitalization. First and third questionnaires were tested at the beginning of June 2020 on an external and independent sample of 30 subjects in order to improve the questions' semantics and their comprehension. Answers were collected with REDCap, a secure web application for building and managing online surveys and databases. Clinical data, provided by the neurosurgeon in charge of the patient, included among others, prolongation of time on the waiting list and postponement of hospital admission.

#### *2.2 Machine learning approach*

Two different models were used to identify which covariates (*X*) have the greatest impact on the outcomes (*Y*<sup>1</sup> *and Y*<sup>2</sup> which are ordinal variables). Since variables were qualitative and quantitative, mostly asymmetrical, and related to *Y* by nonlinear relationships, the Random Forest (RF; Breiman, 2001) was applied, and, for each model, 10,000 regression trees were grown. In detail:


K8# (5D(:5D('# '()# 9):2'58"=(51=# C)'?))"# 68>2952')=# 2"3# 8<'684)=&# '?8# 2335'58"2:# 4)'(83=#?)9)#<=)3X#H*.*J#9):2'5>)#,2952C:)#!4189'2"6)#E)2=<9)#H9):,!EX#,)NN8:5&#\$%//J# ?(56(#53)"'575)=#68>2952')=#'(2'#48='#54126'#8"#'()#19)356'58"#87#*"* H,!E#Y#P%JT#H*..*J# 129'52:# 3)1)"3)"6)# 1:8'=# Hd-d=T# U95)342"& \$%%/J# ?(56(# >5=<2:5N) '()# 7<"6'58"2:# 9):2'58"=(51#C)'?))"#'()#=):)6')3#68>2952')=#2'#185"'#H*.*J#2"3#'()#MU#19)356'58"=A U5D<9)# /#198>53)=#2#>5=<2:#=<4429;#87#'(5=#'(9)).=')1#1986)3<9)A#F"2:;=)=#?)9)#1)97894)3# ?5'(#M#SA%A/A


**%&'()\*#+\***\*8>2952')=#<=)3#5" MU/#H:)7'J&#MU\$ H95D('J#483):=

**",-./)\*#+\***K(9)).=')1#1986)3<9)#C2=)3#8"#M2"384#U89)='&#,!E#2"3#d-d

### **3 Results and discussion**

After exclusion of 11 patients due to significant missing data, 123 subjects (M/F, 64/59; mean age 60.28 (SD=15.08) years were included in the study, for 114 variables. Modeling state anxiety (STAI-State, RF1 in Fig. 2 on the left), the patients' condition was significantly associated with the worry of being positive for SARS-CoV-2. This was the first variable identified by VIM, followed by intuitive ones such as the concern for the primary pathology, surgery, and worsening of their condition, as well as waiting time. In fact, hospital admission to neurosurgery was postponed in mean of 49.72 days and it was due to organizational issues (83%) or, rarely, for positivity to SARS-CoV-2 (1.6%). Our data confirm that psychological support should be enhanced during outbreaks, possibly using novel solutions to provide follow-up care remotely during waiting times.

This study also investigated the feeling of safety conveyed by different features that were activated in all Italian hospitals during the pandemic (RF2, in Fig. 2 on the right). Interestingly, the increased distance between surgical beds was the first factor associated with a feeling of safety from SARS-CoV-2, followed by the availability of hand sanitizers. These data might be interpreted as a result of the ongoing social media communication on the importance of social distancing; we believe they might be important for hospital managers and to optimize communication with patients during this pandemic.

**Figure 2:** Results from RF1 and RF2

#### **References**

BREIMAN, L. 2001. Random Forests. *Mach. Learn.* **45**, 5-32.

**Step 2 RF1 – Relative VIM - MDA Step 3 RF1– Partial Dependence Plots**

FRIEDMAN,J. H. 2001. Greedy function approximation: A gradient boosting machine. *Ann. Stat.* **29**, 1189-1232.

**Step 2 RF3– Relative VIM - MDA Step 3 RF3– Partial Dependence Plots**

PREDICTION OF LARGE OBSERVATIONS VIA BAYESIAN INFERENCE FOR EXTREME-VALUE THEORY Isadora Antoniano Villalobos1, Simone Padoan2 and Boris Beranger3

<sup>1</sup> Ca' Foscari University of Venice, (e-mail: isadora.antoniano@unive.it)

ABSTRACT: In many applications placing interest on large observations, usual inferential methods may fail to reproduce the heavy tail behaviour of the quantities involved. Recent literature has proposed the use of multivariate extreme value theory to predict an unobserved component of a random vector given large observed values of the rest. This is achieved through the estimation of the angular measure controlling the dependence structure in the tail of the distribution. The idea can be extended and used for effective data imputation and prediction of multiple components at adequately large levels, provided the model used for the angular measure is flexible enough to capture complex dependence structures. A Bayesian nonparametric model based on constrained Bernstein polynomials ensures such flexibility. Tractable inference for both the dependence structure and the marginal parameters of the model is achieved via

KEYWORDS: Bernstein polynomials, extremal dependence, multivariate regular

<sup>2</sup> Bocconi University, (e-mail: simone.padoan@unibocconi.it) <sup>3</sup> University of New South Wales, (e-mail: b.beranger@unsw.edu.au)

a trans-dimensional MCMC algorithm for posterior simulation.

variation, trans-dimensional MCMC


### PREDICTION OF LARGE OBSERVATIONS VIA BAYESIAN INFERENCE FOR EXTREME-VALUE THEORY

Isadora Antoniano Villalobos1, Simone Padoan2 and Boris Beranger3

<sup>1</sup> Ca' Foscari University of Venice, (e-mail: isadora.antoniano@unive.it)

<sup>2</sup> Bocconi University, (e-mail: simone.padoan@unibocconi.it)

<sup>3</sup> University of New South Wales, (e-mail: b.beranger@unsw.edu.au)

ABSTRACT: In many applications placing interest on large observations, usual inferential methods may fail to reproduce the heavy tail behaviour of the quantities involved. Recent literature has proposed the use of multivariate extreme value theory to predict an unobserved component of a random vector given large observed values of the rest. This is achieved through the estimation of the angular measure controlling the dependence structure in the tail of the distribution. The idea can be extended and used for effective data imputation and prediction of multiple components at adequately large levels, provided the model used for the angular measure is flexible enough to capture complex dependence structures. A Bayesian nonparametric model based on constrained Bernstein polynomials ensures such flexibility. Tractable inference for both the dependence structure and the marginal parameters of the model is achieved via a trans-dimensional MCMC algorithm for posterior simulation.

KEYWORDS: Bernstein polynomials, extremal dependence, multivariate regular variation, trans-dimensional MCMC

### COMMUNITY DETECTION IN TRIPARTITE NETWORKS OF UNIVERSITY STUDENT MOBILITY FLOWS

ture of the student mobility data (i.e. flows of students connecting provinces of residence and universities of destination), network analysis has been adopted as one of the most appropriate methodological approach to interpret this phenomenon (Santelli *et al.*, 2019; Genova *et al.*, 2019; Columbu *et al.*, 2021). Based on this theoretical framework and the intrinsic complexity of student mobility flows, this study analyses the data at hand using the framework of multimode networks (Fararo & Doreian, 1984). More specifically, we define a tripartite network based on a three-mode data structure, consisting of Italian provinces of residence, universities and fields of study, with student exchanges representing the links between them. A comparison of algorithms for detecting communities from tripartite networks or k-partite modularity (Neubauer & Obermayer, 2009; Ikematsu & Murata, 2013; Melamed *et al.*, 2013; Ignatov *et al.*, 2017; Feng *et al.*, 2019), mainly based on modularity optimisation, is applied to reveal relevant information about the phenomenon under analysis. The algorithms are applied to the MOBYSU.IT dataset which contains micro-level longitudinal information on university students' careers from 2008 to 2017 in

2 Community detection algorithms in tripartite networks

Many real-world networks have a natural multimode network structure in which vertices of different types are linked together. Without reducing generalisability, in the case of tripartite networks, three types of vertices are defined and links can be present only between vertices of distinct types (Fararo & Doreian, 1984). Several approaches can be pursued to disentangle the inherent complexity of such kinds of data. Recently, Everett & Borgatti (2019) suggested that, in the case of multimode data, the collection of all bipartite networks

In our case study, a tripartite network is considered in which V*<sup>P</sup>* is the set of provinces of residence of Italian students enrolled in the first academic year of any bachelor/master degree, V*<sup>U</sup>* is the set of public and private universities, and V*<sup>F</sup>* is the set of educational fields of study. The tripartite network T can be defined as consisting of a pair (V ,E ), being V = {V*P*,V*<sup>U</sup>* ,V*F*} the collection of three sets of vertices, one for each mode, and being E = {E*PUF*}, E*PUF* ⊆ V*<sup>P</sup>* ×V*<sup>U</sup>* ×V*F*, with E*PP*,E*UU* ,E*FF* = /0, the collection of links among

∗This study was supported by the Italian Ministerial grant PRIN 2017 'From high school to job placement: micro-data life course analysis of university student mobility and its impact on

the Italian North-South divide', n. 2017HBTK5P - CUP B78D19000180001.

Italy.∗

should be examined.

Vitale Maria Prosperina1 , Vincenzo Giuseppe Genova2, Giuseppe Giordano1 and Giancarlo Ragozini3

<sup>1</sup> Department of Political and Social Studies, University of Salerno, (e-mail: mvitale@unisa.it, ggiordano@unisa.it)

<sup>2</sup> Department of Economics, Business, and Statistics, University of Palermo, (e-mail: vincenzogiuseppe.genova@unipa.it)

<sup>3</sup> Department of Political Science, Federico II University of Naples, (e-mail: giragoz@unina.it)

ABSTRACT: The purpose of this study is to explore how the multimode network approach can be used to analyse network patterns derived from student mobility flows. We define a tripartite network based on a three-mode data structure, consisting of Italian provinces of residence, universities and fields of study, with student exchanges representing the links between them. A comparison of algorithms for detecting communities from tripartite networks based on modularity optimization is provided, revealing relevant information about the phenomenon under analysis over time. The findings are applied to a real dataset containing micro-level longitudinal information on Italian university students' careers.

KEYWORDS: student mobility, tripartite networks, modularity optimisation

#### 1 Introduction

The analysis of intra- and international student mobility has become a vibrant research field in migration literature and a key concern for national policymaking on tertiary education systems (Van Mol & Timmerman, 2014; Riano˜ *et al.*, 2018). Usually, European mobility in higher education is described by considering the dynamics of the Erasmus programme. From a national perspective, Italian student mobility from high school to bachelor and master degrees is analysed as a crucial step in determining future migration choices. Such analysis shows an unbalanced migration of students from the southern to the northern regions of the country (Genova *et al.*, 2019), which is influenced by the attractiveness of universities, related to the socio-economic characteristics and the job market opportunities in the geographic areas where they are located (Giambona *et al.*, 2017; Impicciatore & Panichella, 2019). Given the nature of the student mobility data (i.e. flows of students connecting provinces of residence and universities of destination), network analysis has been adopted as one of the most appropriate methodological approach to interpret this phenomenon (Santelli *et al.*, 2019; Genova *et al.*, 2019; Columbu *et al.*, 2021). Based on this theoretical framework and the intrinsic complexity of student mobility flows, this study analyses the data at hand using the framework of multimode networks (Fararo & Doreian, 1984). More specifically, we define a tripartite network based on a three-mode data structure, consisting of Italian provinces of residence, universities and fields of study, with student exchanges representing the links between them. A comparison of algorithms for detecting communities from tripartite networks or k-partite modularity (Neubauer & Obermayer, 2009; Ikematsu & Murata, 2013; Melamed *et al.*, 2013; Ignatov *et al.*, 2017; Feng *et al.*, 2019), mainly based on modularity optimisation, is applied to reveal relevant information about the phenomenon under analysis. The algorithms are applied to the MOBYSU.IT dataset which contains micro-level longitudinal information on university students' careers from 2008 to 2017 in Italy.∗

COMMUNITY DETECTION IN TRIPARTITE NETWORKS OF UNIVERSITY STUDENT MOBILITY FLOWS Vitale Maria Prosperina1 , Vincenzo Giuseppe Genova2, Giuseppe Giordano1 and Giancarlo Ragozini3

<sup>1</sup> Department of Political and Social Studies, University of Salerno, (e-mail:

<sup>2</sup> Department of Economics, Business, and Statistics, University of Palermo, (e-mail:

<sup>3</sup> Department of Political Science, Federico II University of Naples, (e-mail:

ABSTRACT: The purpose of this study is to explore how the multimode network approach can be used to analyse network patterns derived from student mobility flows. We define a tripartite network based on a three-mode data structure, consisting of Italian provinces of residence, universities and fields of study, with student exchanges representing the links between them. A comparison of algorithms for detecting communities from tripartite networks based on modularity optimization is provided, revealing relevant information about the phenomenon under analysis over time. The findings are applied to a real dataset containing micro-level longitudinal information

KEYWORDS: student mobility, tripartite networks, modularity optimisation

The analysis of intra- and international student mobility has become a vibrant research field in migration literature and a key concern for national policymaking on tertiary education systems (Van Mol & Timmerman, 2014; Riano˜ *et al.*, 2018). Usually, European mobility in higher education is described by considering the dynamics of the Erasmus programme. From a national perspective, Italian student mobility from high school to bachelor and master degrees is analysed as a crucial step in determining future migration choices. Such analysis shows an unbalanced migration of students from the southern to the northern regions of the country (Genova *et al.*, 2019), which is influenced by the attractiveness of universities, related to the socio-economic characteristics and the job market opportunities in the geographic areas where they are located (Giambona *et al.*, 2017; Impicciatore & Panichella, 2019). Given the na-

mvitale@unisa.it, ggiordano@unisa.it)

vincenzogiuseppe.genova@unipa.it)

giragoz@unina.it)

on Italian university students' careers.

1 Introduction

#### 2 Community detection algorithms in tripartite networks

Many real-world networks have a natural multimode network structure in which vertices of different types are linked together. Without reducing generalisability, in the case of tripartite networks, three types of vertices are defined and links can be present only between vertices of distinct types (Fararo & Doreian, 1984). Several approaches can be pursued to disentangle the inherent complexity of such kinds of data. Recently, Everett & Borgatti (2019) suggested that, in the case of multimode data, the collection of all bipartite networks should be examined.

In our case study, a tripartite network is considered in which V*<sup>P</sup>* is the set of provinces of residence of Italian students enrolled in the first academic year of any bachelor/master degree, V*<sup>U</sup>* is the set of public and private universities, and V*<sup>F</sup>* is the set of educational fields of study. The tripartite network T can be defined as consisting of a pair (V ,E ), being V = {V*P*,V*<sup>U</sup>* ,V*F*} the collection of three sets of vertices, one for each mode, and being E = {E*PUF*}, E*PUF* ⊆ V*<sup>P</sup>* ×V*<sup>U</sup>* ×V*F*, with E*PP*,E*UU* ,E*FF* = /0, the collection of links among

<sup>∗</sup>This study was supported by the Italian Ministerial grant PRIN 2017 'From high school to job placement: micro-data life course analysis of university student mobility and its impact on the Italian North-South divide', n. 2017HBTK5P - CUP B78D19000180001.

the vertices belonging to the three modes. Given T , a unique supra-adjacency matrix could be defined by combining the sociomatrices in a block matrix A*PU* , A*UF*, and A*PF*, where the links are the number of students enrolled, and the corresponding bipartite networks are weighted. Thus, the related supraadjacency matrix is:

bility in higher education: Sicilian outflow network and chain migrations.

IKEMATSU, KYOHEI,&MURATA, TSUYOSHI. 2013. A fast method for detecting communities from tripartite networks. *Pages 192–205 of: Interna-*

IMPICCIATORE, ROBERTO,&PANICHELLA, NAZARENO. 2019. Internal migration trajectories, occupational achievement and social mobility in contemporary Italy. A life course perspective. *Population, Space and Place*,

MELAMED, DAVID, BREIGER, RONALD L, & WEST,AJOSEPH. 2013. Community structure in multi-mode networks: Applying an eigenspectrum

NEUBAUER, NICOLAS,&OBERMAYER, KLAUS. 2009. Towards community detection in k-partite k-uniform hypergraphs. *Pages 1–9 of: Proceedings of the NIPS 2009 Workshop on Analyzing Networks and Learning with Graphs*. RIANO˜ , YVONNE, VAN MOL, CHRISTOF,&RAGHURAM, PARVATI. 2018. New directions in studying policies of international student mobility and

migration. *Globalisation, Societies and Education*, 16(3), 283–294. SANTELLI, FRANCESCO, SCOLORATO, CONCETTA,&RAGOZINI, GIAN-CARLO. 2019. On the determinants of student mobility in an interregional perspective: a focus on Campania region. *Statistica Applicata - Italian Jour-*

VAN MOL, CHRISTOF,&TIMMERMAN, CHRISTIANE. 2014. Should I stay or should I go? An analysis of the determinants of intra-European student

*tional Conference on Social Informatics*. Springer.

approach. *Connections*, 33(1), 1823.

*nal of Applied Statistics*, 31(1), 119–142.

mobility. *Population, Space and Place*, 20(5), 465–479.

*Electronic Journal of Applied Statistical Analysis*, 12(4), 774–800. GIAMBONA, FRANCESCA, PORCU, MARIANO,&SULIS, ISABELLA. 2017. Students mobility: Assessing the determinants of attractiveness across competing territorial areas. *Social indicators research*, 133(3), 1105–1132. IGNATOV, DMITRY I, SEMENOV, ALEXANDER, KOMISSAROVA, DARIA, & GNATYSHAK, DMITRY V. 2017. Multimodal clustering for community detection. *Pages 59–96 of: Formal Concept Analysis of Social Networks*.

Springer.

25(6), e2240.

$$\mathbf{A} = \begin{bmatrix} \mathbf{0} & \mathbf{A}\_{PU} & \mathbf{A}\_{PF} \\ \mathbf{A}\_{PU}^T & \mathbf{0} & \mathbf{A}\_{UF} \\ \mathbf{A}\_{PF}^T & \mathbf{A}\_{UF}^T & \mathbf{0} \end{bmatrix}.$$

Over the past two decades, a growing number of studies have been devoted to community detection algorithmic solutions in tripartite graphs. The first and simplest proposed method consists of applying on the matrix , or on its version built up after matrices' transformation, the usual community detection algorithms (Melamed *et al.*, 2013; Everett & Borgatti, 2019). Other methods adopting an optimisation of tripartite networks (Neubauer & Obermayer, 2009; Ikematsu & Murata, 2013), extending the idea of bipartite modularity.

Given the nature of our data, the approaches which maximise the bipartite modularity seem more appropriate. A detailed comparison of proposed algorithms could be of interest in understanding how tripartite community detection can be used to interpret the network patterns underlying the Italian student mobility phenomenon.

#### References


bility in higher education: Sicilian outflow network and chain migrations. *Electronic Journal of Applied Statistical Analysis*, 12(4), 774–800.

the vertices belonging to the three modes. Given T , a unique supra-adjacency matrix could be defined by combining the sociomatrices in a block matrix A*PU* , A*UF*, and A*PF*, where the links are the number of students enrolled, and the corresponding bipartite networks are weighted. Thus, the related supra-

0 A*PU* A*PF*

*PU* 0 A*UF*

Over the past two decades, a growing number of studies have been devoted to community detection algorithmic solutions in tripartite graphs. The first and simplest proposed method consists of applying on the matrix , or on its version built up after matrices' transformation, the usual community detection algorithms (Melamed *et al.*, 2013; Everett & Borgatti, 2019). Other methods adopting an optimisation of tripartite networks (Neubauer & Obermayer, 2009;

*UF* 0

 .

=

 

A*T*

A*T PF* A*<sup>T</sup>*

Ikematsu & Murata, 2013), extending the idea of bipartite modularity.

Given the nature of our data, the approaches which maximise the bipartite modularity seem more appropriate. A detailed comparison of proposed algorithms could be of interest in understanding how tripartite community detection can be used to interpret the network patterns underlying the Italian student

COLUMBU, SILVIA, PORCU, MARIANO,&SULIS, ISABELLA. 2021. University choice and the attractiveness of the study area: Insights on the differences amongst degree programmes in Italy based on generalised mixed-

EVERETT, MARTIN G, & BORGATTI, STEPHEN P. 2019. Partitioning multimode networks. *Pages 251–265 of: Advances in network clustering and*

FARARO, THOMAS J, & DOREIAN, PATRICK. 1984. Tripartite structural analysis: Generalizing the Breiger-Wilson formalism. *Social Networks*, 6(2),

FENG, LIANG, ZHAO, QIANCHUAN,&ZHOU, CANGQI. 2019. An efficient method to find communities in K-partite networks. *Pages 534–535 of: 2019 IEEE/ACM International Conference on Advances in Social Networks Anal-*

GENOVA, VINCENZO GIUSEPPE, TUMMINELLO, MICHELE, ENEA, MARCO, AIELLO, FABIO,&ATTANASIO, MASSIMO. 2019. Student mo-

effect models. *Socio-Economic Planning Sciences*, 74, 100926.

*blockmodeling*. John Wiley and Sons.

*ysis and Mining (ASONAM)*. IEEE.

adjacency matrix is:

mobility phenomenon.

References

141–175.


### CAUSAL REGULARIZATION

Ernst C. Wit 1, Lucas Kania1

<sup>1</sup> Institute of Computing, Universita della Svizzera italiana, (e-mail: ` wite@usi.ch, lucas.kania@usi.ch)

*X*<sup>1</sup> *X*<sup>2</sup>

*X*4

(*Xe*,*Ye*,*Ae*) be determined by the solution of the system

= *B* unknown constant structure

and *<sup>A</sup><sup>e</sup>* <sup>∈</sup> <sup>R</sup>*p*+<sup>1</sup> are random vectors. We require *Ae*

· *Ye Xe*  *X*<sup>1</sup> *X*<sup>2</sup>

*A*

*X*4

*Y*

*X*3

(a) (b) Figure 1. *(a) Causal directed acyclic graph D associated with a causal graphical model. (b) An extended intervention graph De associated with this causal GM.*

where *<sup>B</sup>* <sup>∈</sup> <sup>R</sup>(*p*+1)×(*p*+1) is a constant matrix, and *<sup>X</sup><sup>e</sup>* <sup>∈</sup> <sup>R</sup>*p*, *<sup>Y</sup><sup>e</sup>* <sup>∈</sup> <sup>R</sup>, <sup>ε</sup> <sup>∈</sup> <sup>R</sup>*p*+<sup>1</sup>

that the target variable is not intervened on. Interventions and noise variables must be uncorrelated, Cor[*Ae*, ε*e*] = 0, and have finite second moments *E*[*AeAeT* ] < ∞ and Cor[ε*e*] < ∞. Furthermore, ε*<sup>e</sup>* is assumed to have zeromean, i.e. *E*[ε*e*] = 0. Additionally, the noise random variables are assumed to be identically distributed across environments, i.e., <sup>ε</sup>*<sup>e</sup>* <sup>∼</sup> <sup>ζ</sup>. Moreover, for the distribution to be well defined, we ask for the existence of (*<sup>I</sup>* <sup>−</sup> *<sup>B</sup>*)−<sup>1</sup> so that

= (*<sup>I</sup>* <sup>−</sup>*B*)−1(ε*<sup>e</sup>* <sup>+</sup>*Ae*). This is guaranteed if the underlying graph *<sup>D</sup>* is a

Given that the structure *B* is fixed across environments, we can talk about *XS* ⊆ *X* being a descendant or ancestor of *Y* without referring to the environmental variables *Y<sup>e</sup>* and *Xe*. Moreover, since we are interested is estimating

0 β*<sup>T</sup>*

*PA* <sup>β</sup>*CH Bx*

the structural equation corresponding to *Ye*, it is useful to split *B* into

*B* =

+ ε*<sup>e</sup>* noise

+ *Ae* shift intervention

*<sup>Y</sup>* <sup>≡</sup> 0, i.e., *Ae* <sup>=</sup>

(1)

(2)

 0 *AeX* ,

*Y*

*X*3

 *Ye Xe* 

 *Ye Xe* 

directed acyclic graph.

ABSTRACT: When predicting a response variable from a set of covariates, the ordinary least squares estimator (OLS) provides the best in-sample risk but with limited out-of-sample guarantees. Conversely, the causal parameters provide the best out-of-sample guarantees but the worst in-sample risk. Based on the causal Dantzig and Anchor Regression, we develop a *causal regularization* approach that interpolates between then the OLS and the causal Dantzig solutions. As the regularization is increased, we prove that causal regularization provides a solution that has better out-of-sample risk guarantees at the cost of increasing the in-sample risk. Moreover, we provide an efficient algorithm to recover the regularized solution for every tuning parameter.

KEYWORDS: causal regularization, causal Dantzig, anchor regression, out-of-sample risk.

#### 1 Introduction

We will consider a causal graphical model, for example expressed by Figure 1a (Pearl, 2009). As we are interested in uncovering the causal structure involving a particular *target variable Y*, in particular, in identifying the causal parents of *Y* and the associated causal parameters β*PA*.

Besides having access to observational data on the system, we will also assume that we have data on the some intervened version of the same system. We will refer to such intervened system as an *environment*. Formally, given a causal DAG *D*, such as in Figure 1a, for a probability distribution *P* over random variables (*X*,*Y*). The tuple (*D*,*Pe*,*Xe*,*Ye*,*Ae*) for *<sup>e</sup>* <sup>∈</sup> <sup>ε</sup> is called an environment, where *A<sup>e</sup>* is the set of shift-intervention variables in *De*, the extended intervention graph of *D* for environment *e*, such as for example in Figure 1b.

For simplicity, we focus on a particular structure of the distribution *P*, described by means of a linear structural equation model (SEM), also known as linear structural causal model. In particular, for *<sup>e</sup>* <sup>∈</sup> <sup>ε</sup>, let the distribution *<sup>P</sup><sup>e</sup>* of

Figure 1. *(a) Causal directed acyclic graph D associated with a causal graphical model. (b) An extended intervention graph De associated with this causal GM.*

(*Xe*,*Ye*,*Ae*) be determined by the solution of the system

CAUSAL REGULARIZATION Ernst C. Wit 1, Lucas Kania1

<sup>1</sup> Institute of Computing, Universita della Svizzera italiana, (e-mail: ` wite@usi.ch,

ABSTRACT: When predicting a response variable from a set of covariates, the ordinary least squares estimator (OLS) provides the best in-sample risk but with limited out-of-sample guarantees. Conversely, the causal parameters provide the best out-of-sample guarantees but the worst in-sample risk. Based on the causal Dantzig and Anchor Regression, we develop a *causal regularization* approach that interpolates between then the OLS and the causal Dantzig solutions. As the regularization is increased, we prove that causal regularization provides a solution that has better out-of-sample risk guarantees at the cost of increasing the in-sample risk. Moreover, we provide an efficient algorithm to recover the regularized solution for every tuning

KEYWORDS: causal regularization, causal Dantzig, anchor regression, out-of-sample

We will consider a causal graphical model, for example expressed by Figure 1a (Pearl, 2009). As we are interested in uncovering the causal structure involving a particular *target variable Y*, in particular, in identifying the causal parents of

Besides having access to observational data on the system, we will also assume that we have data on the some intervened version of the same system. We will refer to such intervened system as an *environment*. Formally, given a causal DAG *D*, such as in Figure 1a, for a probability distribution *P* over random variables (*X*,*Y*). The tuple (*D*,*Pe*,*Xe*,*Ye*,*Ae*) for *<sup>e</sup>* <sup>∈</sup> <sup>ε</sup> is called an environment, where *A<sup>e</sup>* is the set of shift-intervention variables in *De*, the extended intervention graph of *D* for environment *e*, such as for example in

For simplicity, we focus on a particular structure of the distribution *P*, described by means of a linear structural equation model (SEM), also known as linear structural causal model. In particular, for *<sup>e</sup>* <sup>∈</sup> <sup>ε</sup>, let the distribution *<sup>P</sup><sup>e</sup>* of

lucas.kania@usi.ch)

parameter.

Figure 1b.

1 Introduction

*Y* and the associated causal parameters β*PA*.

risk.

$$
\begin{bmatrix} Y^e \\ X^e \end{bmatrix} = \underbrace{\mathbf{B}}\_{\begin{subarray}{c} \text{unknown} \\ \text{constant} \\ \text{structure} \end{subarray}} \cdot \begin{bmatrix} Y^e \\ X^e \end{bmatrix} + \underbrace{\mathbf{c}^e}\_{\begin{subarray}{c} \text{noise} \\ \text{interaction} \end{subarray}} + \underbrace{A^e}\_{\begin{subarray}{c} \text{shift} \\ \text{interaction} \end{subarray}} \tag{1}
$$

where *<sup>B</sup>* <sup>∈</sup> <sup>R</sup>(*p*+1)×(*p*+1) is a constant matrix, and *<sup>X</sup><sup>e</sup>* <sup>∈</sup> <sup>R</sup>*p*, *<sup>Y</sup><sup>e</sup>* <sup>∈</sup> <sup>R</sup>, <sup>ε</sup> <sup>∈</sup> <sup>R</sup>*p*+<sup>1</sup> and *<sup>A</sup><sup>e</sup>* <sup>∈</sup> <sup>R</sup>*p*+<sup>1</sup> are random vectors. We require *Ae <sup>Y</sup>* <sup>≡</sup> 0, i.e., *Ae* <sup>=</sup> 0 *AeX* , that the target variable is not intervened on. Interventions and noise variables must be uncorrelated, Cor[*Ae*, ε*e*] = 0, and have finite second moments *E*[*AeAeT* ] < ∞ and Cor[ε*e*] < ∞. Furthermore, ε*<sup>e</sup>* is assumed to have zeromean, i.e. *E*[ε*e*] = 0. Additionally, the noise random variables are assumed to be identically distributed across environments, i.e., <sup>ε</sup>*<sup>e</sup>* <sup>∼</sup> <sup>ζ</sup>. Moreover, for the distribution to be well defined, we ask for the existence of (*<sup>I</sup>* <sup>−</sup> *<sup>B</sup>*)−<sup>1</sup> so that *Ye Xe* = (*<sup>I</sup>* <sup>−</sup>*B*)−1(ε*<sup>e</sup>* <sup>+</sup>*Ae*). This is guaranteed if the underlying graph *<sup>D</sup>* is a directed acyclic graph.

Given that the structure *B* is fixed across environments, we can talk about *XS* ⊆ *X* being a descendant or ancestor of *Y* without referring to the environmental variables *Y<sup>e</sup>* and *Xe*. Moreover, since we are interested is estimating the structural equation corresponding to *Ye*, it is useful to split *B* into

$$B = \begin{bmatrix} 0 & \mathfrak{B}\_{PA}^T \\ \mathfrak{B}\_{CH} & B\mathfrak{x} \end{bmatrix} \tag{2}$$

Consequently, the structural equation of *Y<sup>e</sup>* would be

$$Y^e = \mathfrak{B}\_{\rm PA}^T X^e + \mathfrak{E}\_Y^e \tag{3}$$

3 Causal regularization

We define *causal regularization* as an estimator that provides the best possible

Note that for *t* → ∞ we recover the OLS solution β*OLS*, whereas for *t* → 0 we

The causal regularizer has strong out-of-sample risk guarantees within *G*γ.

*Rf*

The theorem tells us that if we expect out-of-sample environments to have interventions that are τ times stronger than in the in-sample environment *e*, then setting *t* = τ−<sup>1</sup> would provide an estimator that guarantees a bounded risk on such environments. In other words, β ∈ β*CR*(*t*) guarantees a bounded out-ofsample risk for environments in *C*1+1/*t*. Particularly, β*CS* provides a bounded out-of-sample risk for the *biggest* set of environments, i.e., *C*∞, while β*OLS* guarantees a bounded out-of-sample risk for environments whose interventions are at most as strong as the intervention present in environment *e*, i.e., *C*1.

PETERS, J, BUHLMANN ¨ , P, & MEINSHAUSEN, N. 2016. Causal inference by using invariant prediction. *JRSS-B (Statistical Methodology)*, 947–1012. ROTHENHAUSLER ¨ , D, BUHLMANN ¨ , P, & MEINSHAUSEN, N. 2019. Causal dantzig: fast inference in linear structural equation models with hidden

Given the in-sample shift environment (*Xe*,*Ye*,*Ae*), we define a set of environments *C*<sup>γ</sup> such that their interventions only differ in magnitude to the ones

*<sup>A</sup>f T* ] <sup>γ</sup>*E*[*A<sup>e</sup>*

*Rpred*(β) such that *Rinv*(β) ≤ *t* (6)

*AeT* ]}.

(β) ≤ *Rpred*(β) +τ*t*||β*PA* −β||1, (7)

(β) <sup>≤</sup> *Rpred*(β) +||β*PA* <sup>−</sup>β||<sup>1</sup> Constant

.

in-sample risk for a certain out-of-sample risk guarantee, as follows:

<sup>β</sup>*CR*(*t*) = arg min <sup>β</sup>∈R*<sup>p</sup>*

obtain the Causal Dantzig solution β*CS*.

contained in the in-sample environment *e*,

sup *f*∈*C*1+<sup>τ</sup>

in particular, <sup>∀</sup><sup>β</sup> <sup>∈</sup> <sup>β</sup>*CR*(*t*) : sup*f*∈*C*1+1/*<sup>t</sup>*

References

*Rf*

*<sup>C</sup>*<sup>γ</sup> <sup>=</sup> { *<sup>f</sup>* <sup>∈</sup> <sup>ε</sup> : *<sup>E</sup>*[*A<sup>f</sup>*

Theorem. *Causal regularization out-of-sample risk guarantees* For any CR estimator β ∈ β*CR*(*t*), we have the following risk bound

PEARL, JUDEA. 2009. *Causality*. Cambridge university press.

variables. *The Annals of Statistics*, 47(3), 1688–1722.

where β*PA* are called the *causal parameters* since they are non-zero only for *Xe pa*(*Y*) . In the SEM context, the components of *Ae* are called shift-interventions or interventions. If *<sup>A</sup>eXi* <sup>≡</sup> 0 for *<sup>i</sup>* ∈ {1,..., *<sup>p</sup>*}, we say that *Xi* is intervened. Thus, note that assuming *A<sup>e</sup> Y* ≡ 0 means that no interventions is performed on the target, which is the equivalent of assuming *<sup>E</sup>* <sup>∈</sup> *pa*(*Y*). When *Ae* <sup>≡</sup> 0, the environment is called *observational*.

#### 2 Discovering causes from inner-product invariance

Under the SEM in equation (3), the following *distribution invariance* holds (Peters *et al.*, 2016) <sup>∀</sup>*<sup>e</sup>* <sup>∈</sup> <sup>ε</sup> : *<sup>Y</sup><sup>e</sup>* <sup>−</sup>β*<sup>T</sup> PAX<sup>e</sup>* = ε*<sup>e</sup> <sup>Y</sup>* ∼ ζ, which we call *residual invariance*. Furthermore, by left multiplying with *X<sup>e</sup>* and taking the expectation, we obtain,

$$\begin{aligned} \forall e \in \mathfrak{e}: E[X^{\varepsilon}(Y^{\varepsilon} - \mathfrak{f}\_{\mathsf{PA}}^{T}X^{\varepsilon})] &= \quad P\_{\mathcal{X}}(I - B)^{-1}(E[\mathfrak{e}^{\varepsilon}\mathfrak{e}\_{Y}^{e}] + E[A^{e}\mathfrak{e}\_{Y}^{e}])\\ &= \quad P\_{\mathcal{X}}(I - B)^{-1}\text{Cor}[\mathfrak{f}, \mathfrak{f}\_{Y}] \quad \text{constant over } e \end{aligned}$$

which yields *inner-product invariance*. By taking the difference between the expected inner-product of an interventional environment (*Xe*,*Ye*) and an observational one (*Xo*,*Yo*), we obtain

$$E\left[Z - G\mathbb{B}\_{\text{PA}}\right] = E\left[Z\right] - E\left[G\right]\mathbb{B}\_{\text{PA}} = 0\tag{4}$$

where *Z* = *X<sup>e</sup> <sup>Y</sup><sup>e</sup>* <sup>−</sup>*X<sup>o</sup> <sup>Y</sup><sup>o</sup>* and *<sup>G</sup>* <sup>=</sup> *<sup>X</sup>eXeT* <sup>−</sup>*XoXoT* . Since ||α||<sup>∞</sup> <sup>=</sup> <sup>0</sup> ⇐⇒ <sup>α</sup> <sup>=</sup> 0, we get ||*E*[*Z*]−*E*[*G*]β*PA*||<sup>∞</sup> = 0. Thus, equation (4) gives a plausible method for identifying β*PA* without having search over all possible subsets of *X*. That is, to solve the following linear regression problem,

$$\mathfrak{B}\_{\text{CS}} \in \arg\min\_{\mathfrak{B} \in \mathbb{R}^{\rho}} ||E[\mathbf{Z}] - E[G] \mathfrak{B}||\_{\approx},\tag{5}$$

which is referred to as the unregularized *causal Dantzig problem* (Rothenhausler ¨ *et al.*, 2019). Although β*PA* is a solution, depending on the rank of *E*[*G*], the solution β*CS* may not be unique. We call *Rinv*(β) = ||*E*[*Z* −*G*β]||<sup>∞</sup> the invariance risk for <sup>β</sup>. Let *<sup>R</sup>e*(β) = *<sup>E</sup>*[(*Y<sup>e</sup>* <sup>−</sup> <sup>β</sup>*TXe*)] be the risk in environment e and *Rpred*(β) = *Re*(β) +*Ro*(β) the pooled risk of the in-sample environments, then we remind the reader that the OLS problem minimizes the in-sample risk β*OLS* ∈ argminβ∈R*<sup>p</sup> Rpred*(β).

#### 3 Causal regularization

Consequently, the structural equation of *Y<sup>e</sup>* would be

*Xe pa*(*Y*)

we obtain,

<sup>∀</sup>*<sup>e</sup>* <sup>∈</sup> <sup>ε</sup> : *<sup>E</sup>*[*X<sup>e</sup>*

where *Z* = *X<sup>e</sup>*

Thus, note that assuming *A<sup>e</sup>*

environment is called *observational*.

(Peters *et al.*, 2016) <sup>∀</sup>*<sup>e</sup>* <sup>∈</sup> <sup>ε</sup> : *<sup>Y</sup><sup>e</sup>* <sup>−</sup>β*<sup>T</sup>*

(*Y<sup>e</sup>* <sup>−</sup>β*<sup>T</sup>*

servational one (*Xo*,*Yo*), we obtain

*<sup>Y</sup><sup>e</sup>* <sup>−</sup>*X<sup>o</sup>*

β*OLS* ∈ argminβ∈R*<sup>p</sup> Rpred*(β).

*PAX<sup>e</sup>*

is, to solve the following linear regression problem,

<sup>β</sup>*CS* <sup>∈</sup> arg min <sup>β</sup>∈R*<sup>p</sup>*

*Y<sup>e</sup>* = β*<sup>T</sup>*

2 Discovering causes from inner-product invariance

*PAX<sup>e</sup>* +ε*<sup>e</sup>*

. In the SEM context, the components of *Ae* are called shift-interventions

where β*PA* are called the *causal parameters* since they are non-zero only for

or interventions. If *<sup>A</sup>eXi* <sup>≡</sup> 0 for *<sup>i</sup>* ∈ {1,..., *<sup>p</sup>*}, we say that *Xi* is intervened.

the target, which is the equivalent of assuming *<sup>E</sup>* <sup>∈</sup> *pa*(*Y*). When *Ae* <sup>≡</sup> 0, the

Under the SEM in equation (3), the following *distribution invariance* holds

*variance*. Furthermore, by left multiplying with *X<sup>e</sup>* and taking the expectation,

= *PX* (*I* −*B*)

which yields *inner-product invariance*. By taking the difference between the expected inner-product of an interventional environment (*Xe*,*Ye*) and an ob-

0, we get ||*E*[*Z*]−*E*[*G*]β*PA*||<sup>∞</sup> = 0. Thus, equation (4) gives a plausible method for identifying β*PA* without having search over all possible subsets of *X*. That

which is referred to as the unregularized *causal Dantzig problem* (Rothenhausler ¨ *et al.*, 2019). Although β*PA* is a solution, depending on the rank of *E*[*G*], the solution β*CS* may not be unique. We call *Rinv*(β) = ||*E*[*Z* −*G*β]||<sup>∞</sup> the invariance risk for <sup>β</sup>. Let *<sup>R</sup>e*(β) = *<sup>E</sup>*[(*Y<sup>e</sup>* <sup>−</sup> <sup>β</sup>*TXe*)] be the risk in environment e and *Rpred*(β) = *Re*(β) +*Ro*(β) the pooled risk of the in-sample environments, then we remind the reader that the OLS problem minimizes the in-sample risk

)] = *PX* (*I* −*B*)

*PAX<sup>e</sup>* = ε*<sup>e</sup>*

−1 (*E*[ε*<sup>e</sup>* ε*e*

−1

*E*[*Z* −*G*β*PA*] = *E*[*Z*]−*E*[*G*]β*PA* = 0 (4)

*<sup>Y</sup><sup>o</sup>* and *<sup>G</sup>* <sup>=</sup> *<sup>X</sup>eXeT* <sup>−</sup>*XoXoT* . Since ||α||<sup>∞</sup> <sup>=</sup> <sup>0</sup> ⇐⇒ <sup>α</sup> <sup>=</sup>

*<sup>Y</sup>* (3)

*<sup>Y</sup>* ∼ ζ, which we call *residual in-*

*<sup>Y</sup>* ] +*E*[*Ae*


Cor[ζ,ζ*<sup>Y</sup>* ] constant over *e*

ε*e <sup>Y</sup>* ])

*Y* ≡ 0 means that no interventions is performed on

We define *causal regularization* as an estimator that provides the best possible in-sample risk for a certain out-of-sample risk guarantee, as follows:

$$\mathfrak{B}\_{\mathcal{CR}}(t) = \arg\min\_{\mathfrak{B} \in \mathbb{R}^p} \mathcal{R}\_{pred}(\mathfrak{B}) \text{ such that } \mathcal{R}\_{inv}(\mathfrak{B}) \le t \tag{6}$$

Note that for *t* → ∞ we recover the OLS solution β*OLS*, whereas for *t* → 0 we obtain the Causal Dantzig solution β*CS*.

Given the in-sample shift environment (*Xe*,*Ye*,*Ae*), we define a set of environments *C*<sup>γ</sup> such that their interventions only differ in magnitude to the ones contained in the in-sample environment *e*,

$$C\_{\mathfrak{I}} = \{ f \in \mathfrak{e} : E[A^f A^{fT}] \preceq \gamma \mathcal{E}[A^e A^{eT}] \}.$$

The causal regularizer has strong out-of-sample risk guarantees within *G*γ.

Theorem. *Causal regularization out-of-sample risk guarantees* For any CR estimator β ∈ β*CR*(*t*), we have the following risk bound

$$\sup\_{f \in \mathcal{C}\_{1+\tau}} \mathcal{R}^f(\mathbb{B}) \le \mathcal{R}\_{pred}(\mathbb{B}) + \mathfrak{u} ||\mathbb{B}\_{\mathbb{B}\mathbb{A}} - \mathbb{B}||\_1,\tag{7}$$

.

in particular, <sup>∀</sup><sup>β</sup> <sup>∈</sup> <sup>β</sup>*CR*(*t*) : sup*f*∈*C*1+1/*<sup>t</sup> Rf* (β) <sup>≤</sup> *Rpred*(β) +||β*PA* <sup>−</sup>β||<sup>1</sup> Constant

The theorem tells us that if we expect out-of-sample environments to have interventions that are τ times stronger than in the in-sample environment *e*, then setting *t* = τ−<sup>1</sup> would provide an estimator that guarantees a bounded risk on such environments. In other words, β ∈ β*CR*(*t*) guarantees a bounded out-ofsample risk for environments in *C*1+1/*t*. Particularly, β*CS* provides a bounded out-of-sample risk for the *biggest* set of environments, i.e., *C*∞, while β*OLS* guarantees a bounded out-of-sample risk for environments whose interventions are at most as strong as the intervention present in environment *e*, i.e., *C*1.

#### References

PEARL, JUDEA. 2009. *Causality*. Cambridge university press.


### MINIMIZING CONFLICTS OF INTEREST: OPTIMIZING THE JSM PROGRAM

Qiuyi Wu1 and David Banks2

<sup>1</sup> University of Rochester, (e-mail: jqiuyi wu@urmc.rochester.edu)

<sup>2</sup> Duke University, (e-mail: dlbanks@duke.edu)

ABSTRACT: Sometimes the Joint Statistical Meetings (JSM) are frustrating to attend, because multiple sessions on the same topic are scheduled at the same time. This paper uses seeded Latent Dirichlet Allocation and a scheduling optimization algorithm to very significantly reduce overlapping content in the 2020 program. Of course, since the pandemic forced the 2020 JSM to be held virtually, our superior schedule was made moot. Nonetheless, this approach may assist in organizing future meetings, both for statistics and for other disciplines.

KEYWORDS: latent Dirichlet allocation, topic modeling, greedy algorithm, scheduling

# **Contributed Papers**

MINIMIZING CONFLICTS OF INTEREST: OPTIMIZING THE JSM PROGRAM

Qiuyi Wu1 and David Banks2

ABSTRACT: Sometimes the Joint Statistical Meetings (JSM) are frustrating to attend, because multiple sessions on the same topic are scheduled at the same time. This paper uses seeded Latent Dirichlet Allocation and a scheduling optimization algorithm to very significantly reduce overlapping content in the 2020 program. Of course, since the pandemic forced the 2020 JSM to be held virtually, our superior schedule was made moot. Nonetheless, this approach may assist in organizing future meetings, both for statistics and for other disciplines. KEYWORDS: latent Dirichlet allocation, topic modeling, greedy algorithm,

<sup>1</sup> University of Rochester, (e-mail: jqiuyi wu@urmc.rochester.edu)

<sup>2</sup> Duke University, (e-mail: dlbanks@duke.edu)

scheduling

Giovanni C. Porzio, University of Cassino and Southern Lazio, Italy, porzio@unicas.it, 0000-0003-1208-6991 Carla Rampichini, University of Florence, Italy, carla.rampichini@unifi.it, 0000-0002-8519-083X Chiara Bocci, University of Florence, Italy, chiara.bocci@unifi.it, 0000-0001-8189-4445

FUP Best Practice in Scholarly Publishing (DOI 10.36253/fup\_best\_practice)

Giovanni C. Porzio, Carla Rampichini, Chiara Bocci (edited by), *CLADAG 2021 Book of abstracts and short papers. 13th Scientific Meeting of the Classification and Data Analysis Group Firenze, September 9-11, 2021*, © 2021 Author(s), content CC BY 4.0 International, metadata CC0 1.0 Universal, published by Firenze University Press (www.fupress.com), ISSN 2704-5846 (online), ISBN 978-88-5518-340-6 (PDF), DOI 10.36253/978-88-5518-340-6

MODEL SELECTION PROCEDURE FOR MIXTURE HIDDEN MARKOV MODELS A. Abbruzzo 1, M.F. Cracolici <sup>1</sup> and F. Urso1

<sup>1</sup> Department of Economics, Business and Statistics, University of Palermo, Palermo, Italy, (e-mail: antonino.abbruzzo@unipa.it, mariafrancesca.cracolici@unipa.it, furio.urso@unipa.it)

ABSTRACT: This paper proposes a model selection procedure to identify the number of clusters and hidden states in discrete Mixture Hidden Markov models (MHMMs). The model selection is based on a step-wise approach that uses, as score, information criteria and an entropy criterion. By means of a simulation study, we show that our procedure performs better than classical model selection methods in identifying the

KEYWORDS: model selection, clusters, hidden states, entropy-based scores, informa-

In many research fields, we deal with data whose independent units present one or more categorical sequences that represent the evolution of a specific feature over time (longitudinal data). Thus, it is necessary to define suitable methods capable of modelling an evolving process by describing some unknown variables that influence the observed sequences. Latent class models such as MHMMs can be used to analyse longitudinal data under the assumptions that (i) the sequences follows a latent Markov process and that (ii) the population is heterogeneous (Vermunt *et al.*, 2008; Bartolucci & Pandolfi, 2015). These models present two latent levels: one related to the hidden states of the discrete-time Markov chain and one representing the population's subgroups. The identification of the number of clusters and hidden states can be achieved, according to the literature on Mixture and Hidden Markov models, by fitting different models to the data and then selecting the model by using the results of information criteria (IC) such as AIC and BIC or classification criteria based on entropy (Dias *et al.*, 2009; Crayen *et al.*, 2012). However, these criteria tend to underestimate or overestimate these numbers (Wang & Chan, 2011). Here, we define a model selection procedure that combines IC and an entropy criterion to balance their limitations. Performing a simulation study, we show that

correct number of clusters and hidden states or an approximation of them.

tion criteria

1 Introduction

## MODEL SELECTION PROCEDURE FOR MIXTURE HIDDEN MARKOV MODELS

A. Abbruzzo 1, M.F. Cracolici <sup>1</sup> and F. Urso1

<sup>1</sup> Department of Economics, Business and Statistics, University of Palermo, Palermo, Italy, (e-mail: antonino.abbruzzo@unipa.it, mariafrancesca.cracolici@unipa.it, furio.urso@unipa.it)

ABSTRACT: This paper proposes a model selection procedure to identify the number of clusters and hidden states in discrete Mixture Hidden Markov models (MHMMs). The model selection is based on a step-wise approach that uses, as score, information criteria and an entropy criterion. By means of a simulation study, we show that our procedure performs better than classical model selection methods in identifying the correct number of clusters and hidden states or an approximation of them.

KEYWORDS: model selection, clusters, hidden states, entropy-based scores, information criteria

#### 1 Introduction

In many research fields, we deal with data whose independent units present one or more categorical sequences that represent the evolution of a specific feature over time (longitudinal data). Thus, it is necessary to define suitable methods capable of modelling an evolving process by describing some unknown variables that influence the observed sequences. Latent class models such as MHMMs can be used to analyse longitudinal data under the assumptions that (i) the sequences follows a latent Markov process and that (ii) the population is heterogeneous (Vermunt *et al.*, 2008; Bartolucci & Pandolfi, 2015). These models present two latent levels: one related to the hidden states of the discrete-time Markov chain and one representing the population's subgroups. The identification of the number of clusters and hidden states can be achieved, according to the literature on Mixture and Hidden Markov models, by fitting different models to the data and then selecting the model by using the results of information criteria (IC) such as AIC and BIC or classification criteria based on entropy (Dias *et al.*, 2009; Crayen *et al.*, 2012). However, these criteria tend to underestimate or overestimate these numbers (Wang & Chan, 2011). Here, we define a model selection procedure that combines IC and an entropy criterion to balance their limitations. Performing a simulation study, we show that the proposed procedure exhibits promising results compared to the classical techniques.

when dealing with MHMMs, it is necessary to define a criterion that takes into account two levels of entropy: the first En1(*S*) relating to the classification of observations in latent states and the second En2(*K*) concerning the degree of

*<sup>P</sup>*(*Mk*|*Yi*) is the posterior probability that the given *<sup>i</sup>*-th observed sequence has been generated by the *<sup>k</sup>*-th model; *<sup>P</sup>*(*uit* <sup>=</sup> *<sup>s</sup>*|*Yi*,*Mk*) is the posterior probability that the *t*-th element in the *i*-th hidden sequence takes the *s*-th hidden states given the observed sequence *Yi* and that the sequence has been generated

hidden states in all the *K* clusters. E*new* takes value from 0 to 1. Values close to 1 are related to low entropy and a good degree of class separation, values close to 0 are related to a high entropy level and unreliable classification.

We compare our procedure of modeling selection to other methods such as AIC, BIC, sample-size adjusted BIC (ssBIC) through a Monte Carlo simulation study. We define 24 scenarios considering 4 models having different number of clusters *K* and latent states (*S*1,*S*2,...,*SK*), by varying the number of sequences *n* ∈ {200,2000} and the state-dependent conditional probabili-

(*yit*) to represent low, medium, and high levels of uncertainty in hidden states classification of observations. We generate 100 longitudinal datasets for each scenario for a total of 2400 datasets, the analysis is carried out by using the *R* package "seqHMM" (Helske & Helske, 2017). In Table 1 we report methods' success rate for *n* = 2000, where success means identifying a model having the correct number of clusters *K*, and number of hidden states equal to the exact number or one from this number. The last column report the results of our procedure considering the AIC as the IC used at the first stage as

En1(*S*) *<sup>T</sup>* log*<sup>S</sup>* <sup>+</sup>

*<sup>P</sup>*(*uit* <sup>=</sup> *<sup>s</sup>*|*Yi*,*Mk*)log*P*(*uit* <sup>=</sup> *<sup>s</sup>*|*Yi*,*Mk*),

En2(*K*) log*K*

*<sup>k</sup>*=<sup>1</sup> *S<sup>k</sup>* is the total number of

(2)

2*n*

*<sup>P</sup>*(*Mk*|*Yi*)log*P*(*Mk*|*Yi*).

<sup>E</sup>*new*(*S*,*K*) = <sup>1</sup><sup>−</sup> <sup>1</sup>

separation between clusters.

∑ *i*=1

> ∑ *i*=1

*T* ∑ *t*=1

> *K* ∑ *k*=1

*K* ∑ *k*=1

*Sk* ∑ *s*=1

by the model related to the *k*-th cluster. The *S* = ∑*<sup>K</sup>*

where En1(*S*) = *<sup>n</sup>*

En2(*K*) = *<sup>n</sup>*

4 Simulation study

it showed better results than other IC.

ties *b<sup>k</sup> ut*

#### 2 Mixture Hidden Markov models

Let *Yi* = (*Yi*1,*Yi*2,...,*YiT* ) be the generic *i*-th sequence of length *T* with card|*Yi*| = *R*, *Ui* = (*Ui*1,*Ui*2,...,*UiT* ) the *i*-th hidden random vector with card|*Ui*| = *S* and assume *<sup>n</sup>* independent sequences. Let *<sup>M</sup>* <sup>=</sup> {*M*1,*M*2,...,*MK*} be a set of Hidden Markov Models, where <sup>Θ</sup>*<sup>k</sup>* <sup>=</sup> {π*k*,*Ak*,*Bk*} is the set of parameters for each sub-models *Mk*, related to each sub-population *k* = 1,...,*K*. For each sequence *Yi*, we define the prior cluster probabilities that the model parameters are the ones related to the *k*-th sub-model *M<sup>k</sup>* as *P*(*Mk*) = *wk*. Then, the log-likelihood is

$$\ell(\Theta; Y) = \sum\_{i=1}^{n} \log P(Y\_i|\Theta) = \sum\_{i=1}^{n} \log \left( \sum\_{k=1}^{K} w\_{ik} \sum\_{u} \pi\_{u\_1}^{k} b\_{u\_1}^{k} (\mathbf{y}\_{i1}) \prod\_{t=2}^{T} a\_{u\_{t-1}, u\_t}^{k} b\_{u\_t}^{k} (\mathbf{y}\_{it}) \right), \tag{1}$$

where the hidden state sequences *u* = (*u*1,*u*2,...,*uT* ) take all possible combinations of values in the hidden state space *S* and where *yit* are the observations of subject *i* at time *t*, π*<sup>k</sup> <sup>u</sup>*<sup>1</sup> <sup>=</sup> *<sup>P</sup>*(*u*<sup>1</sup> <sup>=</sup> *<sup>s</sup>*|Θ*k*) with *<sup>s</sup>* ∈ {1,...,*Sk*} is the initial probability of the hidden state at time *t* = 1 in sequence *u* for cluster *k*; *a<sup>k</sup> ut*−1,*ut* <sup>=</sup> *<sup>P</sup>*(*ut* <sup>=</sup> *<sup>j</sup>*|*ut*−<sup>1</sup> <sup>=</sup> *<sup>i</sup>*,Θ*k*) with *<sup>i</sup>*, *<sup>j</sup>* ∈ {1,...,*S*} is the transition probability from the hidden state at time *t* − 1 to the hidden state at *t* in cluster *k*; and *b<sup>k</sup> ut* (*yit*) = *<sup>P</sup>*(*yit* <sup>=</sup> *<sup>r</sup>*|*ut* <sup>=</sup> *<sup>s</sup>*,Θ*k*) with *<sup>s</sup>* ∈ {1,...,*S*} and *<sup>r</sup>* ∈ {1,...,*R*} is the probability that the hidden state of subject *i* at time *t* emits the observed state at *t* in cluster *k*. Parameters can be estimated by means of the Expectation-Maximization; and the log-likelihood is calculated by using the forward-backward algorithm.

#### 3 Proposed model selection procedure

Our proposed procedure combines IC and entropy for identifying MHMMs models on the basis of both goodness-of-fit and degree of class separation. Hence, the procedure consists of two stages. Firstly, we estimate models with different number of clusters and states, for each model the IC value is calculated and the models having these values below a predetermined threshold (the mean of the IC) are selected. At the second stage, an entropy criterion is used to identify among the models selected at the first-stage the one with the best degree of separation between classes (clusters and states). At the second stage, when dealing with MHMMs, it is necessary to define a criterion that takes into account two levels of entropy: the first En1(*S*) relating to the classification of observations in latent states and the second En2(*K*) concerning the degree of separation between clusters.

$$\mathrm{E}\_{\mathrm{new}}(S, K) = 1 - \frac{1}{2n} \left[ \frac{\mathrm{En}\_1(S)}{T \log S} + \frac{\mathrm{En}\_2(K)}{\log K} \right] \tag{2}$$

where

the proposed procedure exhibits promising results compared to the classical

Let *Yi* = (*Yi*1,*Yi*2,...,*YiT* ) be the generic *i*-th sequence of length *T* with card|*Yi*| = *R*, *Ui* = (*Ui*1,*Ui*2,...,*UiT* ) the *i*-th hidden random vector with card|*Ui*| = *S* and assume *<sup>n</sup>* independent sequences. Let *<sup>M</sup>* <sup>=</sup> {*M*1,*M*2,...,*MK*} be a set of Hidden Markov Models, where <sup>Θ</sup>*<sup>k</sup>* <sup>=</sup> {π*k*,*Ak*,*Bk*} is the set of parameters for each sub-models *Mk*, related to each sub-population *k* = 1,...,*K*. For each sequence *Yi*, we define the prior cluster probabilities that the model parameters are the ones related to the *k*-th sub-model *M<sup>k</sup>* as *P*(*Mk*) = *wk*. Then, the log-likelihood

> *K* ∑ *k*=1

where the hidden state sequences *u* = (*u*1,*u*2,...,*uT* ) take all possible combinations of values in the hidden state space *S* and where *yit* are the obser-

initial probability of the hidden state at time *t* = 1 in sequence *u* for cluster

is the probability that the hidden state of subject *i* at time *t* emits the observed state at *t* in cluster *k*. Parameters can be estimated by means of the Expectation-Maximization; and the log-likelihood is calculated by using the

Our proposed procedure combines IC and entropy for identifying MHMMs models on the basis of both goodness-of-fit and degree of class separation. Hence, the procedure consists of two stages. Firstly, we estimate models with different number of clusters and states, for each model the IC value is calculated and the models having these values below a predetermined threshold (the mean of the IC) are selected. At the second stage, an entropy criterion is used to identify among the models selected at the first-stage the one with the best degree of separation between classes (clusters and states). At the second stage,

*ut*−1,*ut* <sup>=</sup> *<sup>P</sup>*(*ut* <sup>=</sup> *<sup>j</sup>*|*ut*−<sup>1</sup> <sup>=</sup> *<sup>i</sup>*,Θ*k*) with *<sup>i</sup>*, *<sup>j</sup>* ∈ {1,...,*S*} is the transition probability from the hidden state at time *t* − 1 to the hidden state at *t* in cluster

(*yit*) = *<sup>P</sup>*(*yit* <sup>=</sup> *<sup>r</sup>*|*ut* <sup>=</sup> *<sup>s</sup>*,Θ*k*) with *<sup>s</sup>* ∈ {1,...,*S*} and *<sup>r</sup>* ∈ {1,...,*R*}

*wik*∑*<sup>u</sup>* π*k <sup>u</sup>*<sup>1</sup> *<sup>b</sup><sup>k</sup> <sup>u</sup>*<sup>1</sup> (*yi*1)

*T* ∏*t*=2 *ak ut*−1,*ut bk ut* (*yit*) ,

*<sup>u</sup>*<sup>1</sup> <sup>=</sup> *<sup>P</sup>*(*u*<sup>1</sup> <sup>=</sup> *<sup>s</sup>*|Θ*k*) with *<sup>s</sup>* ∈ {1,...,*Sk*} is the

(1)

techniques.

is

*k*; *a<sup>k</sup>*

*k*; and *b<sup>k</sup>*

*ut*

(Θ;*Y*) =

*n* ∑ *i*=1

2 Mixture Hidden Markov models

log*P*(*Yi*|Θ) =

vations of subject *i* at time *t*, π*<sup>k</sup>*

forward-backward algorithm.

3 Proposed model selection procedure

*n* ∑ *i*=1 log

$$\begin{aligned} \operatorname{En}\_1(S) &= \sum\_{i=1}^n \sum\_{t=1}^T \sum\_{k=1}^K \sum\_{s=1}^{S\_k} P(\mu\_{it} = s | Y\_i, M^k) \log P(\mu\_{it} = s | Y\_i, M^k), \\ \operatorname{En}\_2(K) &= \sum\_{i=1}^n \sum\_{k=1}^K P(M^k | Y\_i) \log P(M^k | Y\_i). \end{aligned}$$

*<sup>P</sup>*(*Mk*|*Yi*) is the posterior probability that the given *<sup>i</sup>*-th observed sequence has been generated by the *<sup>k</sup>*-th model; *<sup>P</sup>*(*uit* <sup>=</sup> *<sup>s</sup>*|*Yi*,*Mk*) is the posterior probability that the *t*-th element in the *i*-th hidden sequence takes the *s*-th hidden states given the observed sequence *Yi* and that the sequence has been generated by the model related to the *k*-th cluster. The *S* = ∑*<sup>K</sup> <sup>k</sup>*=<sup>1</sup> *S<sup>k</sup>* is the total number of hidden states in all the *K* clusters. E*new* takes value from 0 to 1. Values close to 1 are related to low entropy and a good degree of class separation, values close to 0 are related to a high entropy level and unreliable classification.

#### 4 Simulation study

We compare our procedure of modeling selection to other methods such as AIC, BIC, sample-size adjusted BIC (ssBIC) through a Monte Carlo simulation study. We define 24 scenarios considering 4 models having different number of clusters *K* and latent states (*S*1,*S*2,...,*SK*), by varying the number of sequences *n* ∈ {200,2000} and the state-dependent conditional probabilities *b<sup>k</sup> ut* (*yit*) to represent low, medium, and high levels of uncertainty in hidden states classification of observations. We generate 100 longitudinal datasets for each scenario for a total of 2400 datasets, the analysis is carried out by using the *R* package "seqHMM" (Helske & Helske, 2017). In Table 1 we report methods' success rate for *n* = 2000, where success means identifying a model having the correct number of clusters *K*, and number of hidden states equal to the exact number or one from this number. The last column report the results of our procedure considering the AIC as the IC used at the first stage as it showed better results than other IC.


A FULL MIXTURE OF EXPERTS MODEL TO CLASSIFY CONSTRAINED DATA Ascari Roberto <sup>1</sup> and Migliorati Sonia1

<sup>1</sup> Department of Economics, Management and Statistics, University of Milano-Bicocca, (e-mail: roberto.ascari@unimib.it,

ABSTRACT: This contribution proposes a model-based classifier developed for compositional data. A full mixture of experts model with Dirichlet components is used to incorporate information both on the composition and on a set of covariates. Estimation issues are dealt with by a Bayesian approach, allowing the researcher to use the posterior distribution of the parameters to measure the classification uncertainty.

Many fields have witnessed the increasing popularity of compositional data (i.e., vectors representing parts of a whole), which are defined on the *D*-part

Due to the unit-sum constraint imposed by *S <sup>D</sup>*, standard statistical methods are often unsuitable to deal with compositional data. Several ad-hoc proposals have been prompted by mapping the simplex into a different (unconstrained) space, but leaving the simplex often results in interpretative difficulties, especially when the relationship among variables is of interest. This is particularly true in the classification context, where methods for compositional data still present many unsolved issues (Gu & Cui, 2021). In this work, we define a full mixture of experts model (fmem, Bouveyron *et al.*, 2019) and use it to implement a supervised classification algorithm for compositional data in the presence of covariates. Since we adopt a Bayesian approach to inference, we take advantage of posterior samples to measure the classification uncertainty.

A fmem is a generalization of a finite mixture model with *G* components where the mixing weights p and (some of) the component-specific parameters can be linked to a set of covariates through proper link functions. Since the random

*<sup>d</sup>*=<sup>1</sup> *yd* = 1

(Ongaro *et al.*, 2020).

sonia.migliorati@unimib.it)

2 Full mixture of experts model

1 Introduction

simplex *S <sup>D</sup>* =

KEYWORDS: Dirichlet, mixture model, Bayesian, simplex.

y = (*y*1,...,*yD*) : *yd* > 0,∑*<sup>D</sup>*

Table 1: Results of the Monte Carlo study for *n* = 2000. Low, medium and high level of uncertainty in hidden states classification scenario

As we can see, the proposed procedure has a better performance than the classic IC-based model selection methods when the number of clusters is *K* > 2. We also note how, unlike these methods, it is less affected by an increase in the uncertainty of hidden states' classification.

#### References


### A FULL MIXTURE OF EXPERTS MODEL TO CLASSIFY CONSTRAINED DATA

Ascari Roberto <sup>1</sup> and Migliorati Sonia1

<sup>1</sup> Department of Economics, Management and Statistics, University of Milano-Bicocca, (e-mail: roberto.ascari@unimib.it, sonia.migliorati@unimib.it)

ABSTRACT: This contribution proposes a model-based classifier developed for compositional data. A full mixture of experts model with Dirichlet components is used to incorporate information both on the composition and on a set of covariates. Estimation issues are dealt with by a Bayesian approach, allowing the researcher to use the posterior distribution of the parameters to measure the classification uncertainty.

KEYWORDS: Dirichlet, mixture model, Bayesian, simplex.

#### 1 Introduction

classification

LOW

MEDIUM

HIGH

References

uncertainty (*S*1,*S*2,...,*SK*) BIC AIC ssBIC *Enew* Our Procedure

Table 1: Results of the Monte Carlo study for *n* = 2000. Low, medium and

As we can see, the proposed procedure has a better performance than the classic IC-based model selection methods when the number of clusters is *K* > 2. We also note how, unlike these methods, it is less affected by an increase in

BARTOLUCCI, F., & PANDOLFI, S. 2015. LMest: Latent Markov Models

CRAYEN, C., EID, M., LISCHETZKE, T., COURVOISIER, D. S., & VER-MUNT, J. K. 2012. Exploring dynamics in mood regulation—mixture latent Markov modeling of ambulatory assessment data. *Psychosomatic*

DIAS, J. G., VERMUNT, J. K., & RAMOS, S. 2009. Mixture hidden Markov models in finance research. *Pages 451–459 of: Advances in data analysis,*

HELSKE, S., & HELSKE, J. 2017. Mixture hidden Markov models for se-

VERMUNT, J. K., TRAN, B., & MAGIDSON, J. 2008. Latent class models in longitudinal research. *Handbook of longitudinal research: Design,*

WANG, M., & CHAN, D. 2011. Mixture latent Markov modeling: Identifying and predicting unobserved heterogeneity in longitudinal qualitative status

change. *Organizational Research Methods*, 14(3), 411–431.

high level of uncertainty in hidden states classification scenario

with and without Covariates. *R package version.*, 2.

*data handling and business intelligence*. Springer.

quence data: The seqHMM package in R.

*measurement, and analysis*, 373–385.

the uncertainty of hidden states' classification.

*medicine.*, 74(4), 366–376.

(2,3) 1.00 - 0.91 (0.029) 0.96 (0.020) 0.85 (0.036) 0.85 (0.036) (2,2,3) 0.62 (0.048) 0.40 (0.049) 0.58 (0.049) 0.62 (0.048) 0.66 (0.047) (2,2,3,3) 0.30 (0.046) 0.37 (0.048) 0.34 (0.047) 0.21 (0.041) 0.60 (0.049) (2,2,3,3,2) 0.19 (0.039) 0.38 (0.048) 0.24 (0.043) 0.19 (0.039) 0.48 (0.050)

(2,3) 1.00 - 0.86 (0.035) 1.00 - 0.49 (0.050) 0.73 (0.044) (2,2,3) 0.20 (0.040) 0.40 (0.049) 0.29 (0.045) 0.41 (0.049) 0.59 (0.049) (2,2,3,3) 0.02 (0.014) 0.35 (0.048) 0.08 (0.027) 0.10 (0.042) 0.53 (0.049) (2,2,3,3,2) 0.00 (0.000) 0.12 (0.032) 0.00 (0.000) 0.29 (0.045) 0.31 (0.046)

(2,3) 1.00 - 0.58 (0.049) 1.00 - 0.46 (0.050) 0.62 (0.048) (2,2,3) 0.10 (0.030) 0.40 (0.049) 0.14 (0.035) 0.42 (0.049) 0.59 (0.049) (2,2,3,3) 0.00 (0.000) 0.00 (0.000) 0.00 (0.000) 0.10 (0.030) 0.28 (0.045) (2,2,3,3,2) 0.00 (0.000) 0.00 (0.000) 0.00 (0.000) 0.19 (0.039) 0.25 (0.043)

> Many fields have witnessed the increasing popularity of compositional data (i.e., vectors representing parts of a whole), which are defined on the *D*-part simplex *S <sup>D</sup>* = y = (*y*1,...,*yD*) : *yd* > 0,∑*<sup>D</sup> <sup>d</sup>*=<sup>1</sup> *yd* = 1 (Ongaro *et al.*, 2020). Due to the unit-sum constraint imposed by *S <sup>D</sup>*, standard statistical methods are often unsuitable to deal with compositional data. Several ad-hoc proposals have been prompted by mapping the simplex into a different (unconstrained) space, but leaving the simplex often results in interpretative difficulties, especially when the relationship among variables is of interest. This is particularly true in the classification context, where methods for compositional data still present many unsolved issues (Gu & Cui, 2021). In this work, we define a full mixture of experts model (fmem, Bouveyron *et al.*, 2019) and use it to implement a supervised classification algorithm for compositional data in the presence of covariates. Since we adopt a Bayesian approach to inference, we take advantage of posterior samples to measure the classification uncertainty.

#### 2 Full mixture of experts model

A fmem is a generalization of a finite mixture model with *G* components where the mixing weights p and (some of) the component-specific parameters can be linked to a set of covariates through proper link functions. Since the random vector Y belongs to the simplex *S <sup>D</sup>*, a mixture with Dirichlet components displaying different means *<sup>µ</sup> <sup>j</sup>* <sup>∈</sup> *S <sup>D</sup>* (*<sup>j</sup>* <sup>=</sup> <sup>1</sup>,...,*G*) and a common precision parameter φ > 0 is a proper choice. Thus, we can define the fmem probability density function (pdf) as

$$f\_{\mathbf{Y}}(\mathbf{y}\_{i}|\mathbf{x}\_{i},\boldsymbol{\upmu}(\mathbf{x}\_{i}),\mathbf{p}(\mathbf{x}\_{i}),\boldsymbol{\upphi}) = \sum\_{j=1}^{G} p\_{j}(\mathbf{x}\_{i}) f^{D} \left(\mathbf{y}\_{i};\boldsymbol{\upmu}\_{j}(\mathbf{x}\_{i}),\boldsymbol{\upphi}\right), \quad i = 1,\ldots,n \qquad (1)$$

where η = (β<sup>∗</sup>

*z*ˆ (*b*) *<sup>u</sup>*, *<sup>j</sup>* = *P* 

where *µ*(*b*)

the simulated ˆ*z*

next section.

*<sup>j</sup>* and *p*

(*b*)

4 Application on plants data

(*b*)

1,...,β<sup>∗</sup>

posterior distribution of η (namely, η(1)

*Su* <sup>=</sup> *<sup>j</sup>*|Y*<sup>u</sup>* <sup>=</sup> <sup>y</sup>*u*,x*u*;η(*b*)

*<sup>D</sup>*, γ∗,φ), and β<sup>∗</sup>

nating by row the vectors β1, *<sup>j</sup>*,...,β*D*, *<sup>j</sup>* and γ1,..., γ*G*, respectively. Following a Bayesian approach to inference, we have to specify a joint prior distribution for η. We select a multivariate normal with zero mean vector and diagonal covariance matrix with "large" values of the variances as non-informative prior for the regression parameters β*d*, *<sup>j</sup>* and γ *<sup>j</sup>*, for any proper choice of *d* and *j*. For the precision parameter φ, we adopt a Gamma(*g*,*g*) prior distribution, with rate parameter *g* "small" enough to induce a large variability. We simulate samples from the posterior distribution through the Hamiltonian Monte Carlo algorithm in the Stan language. Please note that we do not face label switching problems because we know the true allocations of training observations to the mixture components. Once we have drawn *B* samples from the simulated

,...,η(*B*)

 <sup>y</sup>*u*;*µ*(*b*)

> <sup>y</sup>*u*;*µ*(*b*)

a new observation for which we observe only (y*u*,x*u*), *u* > *n*. Indeed, Bayes' theorem enables to compute the posterior probability that unit *u* arises from

> *p* (*b*) *<sup>j</sup>* (x*u*)· *<sup>f</sup> <sup>D</sup>*

*G* ∑ *l*=1 *p* (*b*) *<sup>l</sup>* (x*u*)· *<sup>f</sup> <sup>D</sup>*

sification rule can be defined by allocating to group *j* whenever the mean of

is to take advantage of the (simulated) posterior distribution of the probability of each category to measure the classification uncertainty, as we discuss in the

We consider an application based on a compositional dataset regarding *n* = 500 plants (Douma & Weedon, 2019). The composition is defined by the proportion of biomass in roots (RMF), stems (SMF), and leaves (LMF). We aim to classify the species of a plant (*D. flexuosa* or *H. lanatus*, so that *G* = 2) based on the biomass composition, as well as two covariates represented by the nitrate supply level (high or low), and a measure of the total amount of biomass (TDM). Since we have neither a validation nor a test set, we use *V*-fold crossvalidation to assess the performance of the classification rule. Thus, we ran-

*<sup>u</sup>*, *<sup>j</sup>* is the highest (*j* = 1,...,*G*), the purpose of this contribution

group *j* given its observed value y*<sup>u</sup>* and x*u*, *b* = 1,...,*B*, that is:

 =

*<sup>j</sup>* are computed based on η(*b*)

*<sup>j</sup>* and γ<sup>∗</sup> are matrices obtained concate-

), we can use them to classify

*<sup>j</sup>* (x*u*),φ(*b*)

*<sup>l</sup>* (x*u*),φ(*b*)

. Although a Bayesian clas-

, (3)

where *<sup>f</sup> <sup>D</sup>* (·;·,·)is the Dirichlet pdf, <sup>p</sup>(x*i*)=(*p*1(x*i*),..., *pG*(x*i*)) <sup>∈</sup> *S <sup>G</sup>*, *<sup>µ</sup> <sup>j</sup>*(x*i*) <sup>∈</sup> *S <sup>D</sup>* for any fixed x*i*, x*<sup>i</sup>* is the (*K* +1)-dimensional vector of covariates, and *n* is the sample size. Since both p and *µ <sup>j</sup>* belong to the simplex, we suggest to take advantage of the multinomial logit link function, so that:

$$p\_f(\mathbf{x}\_l) = \frac{\exp\left(\mathbf{x}\_l^\mathsf{T} \boldsymbol{\mathsf{\mathsf{Y}}}\_l\right)}{1 + \sum\_{r=1}^{G-1} \exp\left(\mathbf{x}\_l^\mathsf{T} \boldsymbol{\mathsf{\mathsf{Y}}}\_r\right)}, \quad \mu\_{d,j}(\mathbf{x}\_l) = \frac{\exp\left(\mathbf{x}\_l^\mathsf{T} \boldsymbol{\mathsf{\mathsf{B}}}\_{d,j}\right)}{1 + \sum\_{r=1}^{D-1} \exp\left(\mathbf{x}\_l^\mathsf{T} \boldsymbol{\mathsf{\mathsf{B}}}\_{r,j}\right)},$$

(*j* = 1,...,*G*;*d* = 1,...,*D*), where γ *<sup>j</sup>* and β*d*, *<sup>j</sup>* are (*K* + 1)-dimensional vectors, with γ*<sup>G</sup>* = β*D*, *<sup>j</sup>* = 0. Although considering a common (and constant) φ keeps the model simple, one can further generalize the model linking it to some covariates through a proper link function. Note that the above proposed approach allows to avoid any transformation of compositional data, so that regression coefficients deserve an easy and meaningful interpretation.

#### 3 Estimation and classification issues

Let us consider a supervised classification problem, where we want to learn a classifier on a training set, so that we can assign a label to new observations. More specifically, suppose we have observed a compositional vector <sup>Y</sup>*<sup>i</sup>* <sup>∈</sup> *S <sup>D</sup>*, a vector of covariates x*i*, and a discrete variable *Si*, *i* = 1,...,*n*. *Si* can assume *G* different labels, denoted by 1,...,*G*, and *Si* = *j* if the *i*-th observation belongs to the *j*-th group/label. Therefore, *Si* is the target in the classification task. Our training set consists in a vector S = (*S*1,...,*Sn*) and two matrices Y and X, with generic *i*-th row Y*<sup>i</sup>* and x*i*, respectively. Here, the mixture components represent the *G* groups encoded by S. This means that we know which mixture component generated a specific training data point, and thus we can resort to the complete-data likelihood, which can be written as

$$L\_C(\mathfrak{n}; \mathbf{y}, \mathbf{x}, \mathbf{s}) = \left[ \prod\_{j=1}^{G} \prod\_{i: S\_i = j} f^D \left( \mathbf{y}\_i; \boldsymbol{\mu}\_j(\mathbf{x}\_i), \boldsymbol{\Phi} \right) \right] \cdot \left[ \prod\_{j=1}^{G} \prod\_{i: S\_i = j} p\_j(\mathbf{x}\_i) \right],\tag{2}$$

where η = (β<sup>∗</sup> 1,...,β<sup>∗</sup> *<sup>D</sup>*, γ∗,φ), and β<sup>∗</sup> *<sup>j</sup>* and γ<sup>∗</sup> are matrices obtained concatenating by row the vectors β1, *<sup>j</sup>*,...,β*D*, *<sup>j</sup>* and γ1,..., γ*G*, respectively. Following a Bayesian approach to inference, we have to specify a joint prior distribution for η. We select a multivariate normal with zero mean vector and diagonal covariance matrix with "large" values of the variances as non-informative prior for the regression parameters β*d*, *<sup>j</sup>* and γ *<sup>j</sup>*, for any proper choice of *d* and *j*. For the precision parameter φ, we adopt a Gamma(*g*,*g*) prior distribution, with rate parameter *g* "small" enough to induce a large variability. We simulate samples from the posterior distribution through the Hamiltonian Monte Carlo algorithm in the Stan language. Please note that we do not face label switching problems because we know the true allocations of training observations to the mixture components. Once we have drawn *B* samples from the simulated posterior distribution of η (namely, η(1) ,...,η(*B*) ), we can use them to classify a new observation for which we observe only (y*u*,x*u*), *u* > *n*. Indeed, Bayes' theorem enables to compute the posterior probability that unit *u* arises from group *j* given its observed value y*<sup>u</sup>* and x*u*, *b* = 1,...,*B*, that is:

$$\hat{\boldsymbol{\varepsilon}}\_{u,j}^{(b)} = P\left(\mathbf{S}\_{u} = j | \mathbf{Y}\_{u} = \mathbf{y}\_{u}, \mathbf{x}\_{u}; \boldsymbol{\mathfrak{n}}^{(b)}\right) = \frac{p\_{j}^{(b)}(\mathbf{x}\_{u}) \cdot f^{D}\left(\mathbf{y}\_{u}; \boldsymbol{\mathfrak{n}}\_{j}^{(b)}(\mathbf{x}\_{u}), \boldsymbol{\phi}^{(b)}\right)}{\sum\_{l=1}^{G} p\_{l}^{(b)}(\mathbf{x}\_{u}) \cdot f^{D}\left(\mathbf{y}\_{u}; \boldsymbol{\mathfrak{n}}\_{l}^{(b)}(\mathbf{x}\_{u}), \boldsymbol{\phi}^{(b)}\right)},\tag{3}$$

where *µ*(*b*) *<sup>j</sup>* and *p* (*b*) *<sup>j</sup>* are computed based on η(*b*) . Although a Bayesian classification rule can be defined by allocating to group *j* whenever the mean of the simulated ˆ*z* (*b*) *<sup>u</sup>*, *<sup>j</sup>* is the highest (*j* = 1,...,*G*), the purpose of this contribution is to take advantage of the (simulated) posterior distribution of the probability of each category to measure the classification uncertainty, as we discuss in the next section.

#### 4 Application on plants data

vector Y belongs to the simplex *S <sup>D</sup>*, a mixture with Dirichlet components displaying different means *<sup>µ</sup> <sup>j</sup>* <sup>∈</sup> *S <sup>D</sup>* (*<sup>j</sup>* <sup>=</sup> <sup>1</sup>,...,*G*) and a common precision parameter φ > 0 is a proper choice. Thus, we can define the fmem probability

*pj*(x*i*)*f <sup>D</sup>*

where *<sup>f</sup> <sup>D</sup>* (·;·,·)is the Dirichlet pdf, <sup>p</sup>(x*i*)=(*p*1(x*i*),..., *pG*(x*i*)) <sup>∈</sup> *S <sup>G</sup>*, *<sup>µ</sup> <sup>j</sup>*(x*i*) <sup>∈</sup> *S <sup>D</sup>* for any fixed x*i*, x*<sup>i</sup>* is the (*K* +1)-dimensional vector of covariates, and *n* is the sample size. Since both p and *µ <sup>j</sup>* belong to the simplex, we suggest to take

, *µd*, *<sup>j</sup>*(x*i*) =

(*j* = 1,...,*G*;*d* = 1,...,*D*), where γ *<sup>j</sup>* and β*d*, *<sup>j</sup>* are (*K* + 1)-dimensional vectors, with γ*<sup>G</sup>* = β*D*, *<sup>j</sup>* = 0. Although considering a common (and constant) φ keeps the model simple, one can further generalize the model linking it to some covariates through a proper link function. Note that the above proposed approach allows to avoid any transformation of compositional data, so that

Let us consider a supervised classification problem, where we want to learn a classifier on a training set, so that we can assign a label to new observations. More specifically, suppose we have observed a compositional vector <sup>Y</sup>*<sup>i</sup>* <sup>∈</sup> *S <sup>D</sup>*, a vector of covariates x*i*, and a discrete variable *Si*, *i* = 1,...,*n*. *Si* can assume *G* different labels, denoted by 1,...,*G*, and *Si* = *j* if the *i*-th observation belongs to the *j*-th group/label. Therefore, *Si* is the target in the classification task. Our training set consists in a vector S = (*S*1,...,*Sn*) and two matrices Y and X, with generic *i*-th row Y*<sup>i</sup>* and x*i*, respectively. Here, the mixture components represent the *G* groups encoded by S. This means that we know which mixture component generated a specific training data point, and thus we can resort to

y*i*;*µ <sup>j</sup>*(x*i*),φ

exp x *<sup>i</sup>* β*d*, *<sup>j</sup>* 

<sup>1</sup>+∑*D*−<sup>1</sup> *<sup>r</sup>*=<sup>1</sup> exp

 x *<sup>i</sup>* β*r*, *<sup>j</sup>* ,

, *i* = 1,...,*n* (1)

*G* ∑ *j*=1

advantage of the multinomial logit link function, so that:

x *<sup>i</sup>* γ*<sup>r</sup>*

regression coefficients deserve an easy and meaningful interpretation.

exp x *<sup>i</sup>* γ *<sup>j</sup>* 

3 Estimation and classification issues

the complete-data likelihood, which can be written as

*f <sup>D</sup>*

y*i*;*µ <sup>j</sup>*(x*i*),φ

 · *G* ∏ *j*=1 ∏ *i*:*Si*=*j* *pj*(x*i*) 

, (2)

 *G* ∏ *j*=1 ∏ *i*:*Si*=*j*

*LC*(η;y,x,s) =

<sup>1</sup>+∑*G*−<sup>1</sup> *<sup>r</sup>*=<sup>1</sup> exp

density function (pdf) as

*pj*(x*i*) =

*f*Y(y*i*|x*i*,*µ*(x*i*),p(x*i*),φ) =

We consider an application based on a compositional dataset regarding *n* = 500 plants (Douma & Weedon, 2019). The composition is defined by the proportion of biomass in roots (RMF), stems (SMF), and leaves (LMF). We aim to classify the species of a plant (*D. flexuosa* or *H. lanatus*, so that *G* = 2) based on the biomass composition, as well as two covariates represented by the nitrate supply level (high or low), and a measure of the total amount of biomass (TDM). Since we have neither a validation nor a test set, we use *V*-fold crossvalidation to assess the performance of the classification rule. Thus, we randomly divide the dataset into *V* = 4 parts and classify each fold using the remaining three parts as the training set. The estimated overall misclassification error rate (MER) (defined as the average of the fold-specific MERs.) resulted in 0.238. Fig. 1 shows the simulated distribution of the posterior probability of being classified as *D. flexuosa* for eight randomly selected plants. Classifying as *D. flexuosa* every plant with a mean (or median) posterior probability greater than 0.5, we misclassify two plants (2/8 ≈ 0.238). The range of each subjectspecific posterior probability distribution helps in assessing the classification uncertainty. For example, the distribution of the posterior probability for plant 6 is very wide and centered close to 0.5, suggesting that its classification could be unreliable, while the reverse holds for the other plants.

SPARSE INFERENCE IN COVARIATE ADJUSTED CENSORED GAUSSIAN GRAPHICAL MODELS Luigi Augugliaro1, Gianluca Sottile1 and Angelo M. Mineo<sup>1</sup>

ABSTRACT: The covariate adjusted glasso is one of the most used estimators for inferring genetic networks. Despite its diffusion, there are several fields in applied research where the limits of detection of modern measurement technologies make the use of this estimator theoretically unfounded, even when the assumption of a multivariate Gaussian distribution is

KEYWORDS: censored data, censored glasso estimator, Gaussian graphical model, glasso

An important aim in genomics is to understand interactions among genes, characterized by the regulation and synthesis of proteins under internal and external signals. These relationships can be represented by a genetic network, i.e., a graph where nodes represent genes and edges describe the interactions among them. Gaussian graphical models (GGM, Lauritzen (1996)) have been widely used for reconstructing a genetic network from expression data. The reason of such diffusion relies on the statistical properties of the multivariate Gaussian distribution which allow the topological structure of a network to be related with the non-zero elements of the concentration matrix, i.e., the inverse of the covariance matrix. Thus, the problem of network inference can be recast as the problem of estimating a concentration matrix. The covariate adjusted glasso estimator (Yin & Li, 2011) is a popular method for estimating a sparse concentration matrix, based on the idea of adding an 1-penalty function to the likelihood function of the multivariate Gaussian distribution. Despite the widespread literature on the covariate adjusted glasso estimator, there is a great number of fields in applied research where the use of the graphical model is theoretically unfounded. For example in some cases data are left- or right-censored. In this paper we propose an extension of the covariate adjusted glasso estimator that takes into account the censoring mechanism of

2 The covariate adjusted censored Gaussian graphical model

Let *Y* = (*Y*1,...,*Yp*) be a *p*-dimensional random vector. Graphical models allow to represent the set of conditional independencies among these random variables by a graph *G* = {*V* ,*E*}, where *V* is the set of nodes associated to *Y* and *E* ⊆ *V* × *V*

<sup>1</sup> Dep. of Economics, Business and Statistics, University of Palermo, Italy, (e-mail: luigi.augugliaro@unipa.it, angelo.mineo@unipa.it,

satisfied. In this paper we propose an extension to censored data.

gianluca.sottile@unipa.it)

estimator.

1 Introduction

the data explicitly.

Figure 1. *Boxplots of the posterior probability of being classified as D. flexuosa for 8 randomly selected plants. Colors represent the true label of each plant.*

#### References


## SPARSE INFERENCE IN COVARIATE ADJUSTED CENSORED GAUSSIAN GRAPHICAL MODELS

Luigi Augugliaro1, Gianluca Sottile1 and Angelo M. Mineo<sup>1</sup>

<sup>1</sup> Dep. of Economics, Business and Statistics, University of Palermo, Italy, (e-mail: luigi.augugliaro@unipa.it, angelo.mineo@unipa.it, gianluca.sottile@unipa.it)

ABSTRACT: The covariate adjusted glasso is one of the most used estimators for inferring genetic networks. Despite its diffusion, there are several fields in applied research where the limits of detection of modern measurement technologies make the use of this estimator theoretically unfounded, even when the assumption of a multivariate Gaussian distribution is satisfied. In this paper we propose an extension to censored data.

KEYWORDS: censored data, censored glasso estimator, Gaussian graphical model, glasso estimator.

### 1 Introduction

domly divide the dataset into *V* = 4 parts and classify each fold using the remaining three parts as the training set. The estimated overall misclassification error rate (MER) (defined as the average of the fold-specific MERs.) resulted in 0.238. Fig. 1 shows the simulated distribution of the posterior probability of being classified as *D. flexuosa* for eight randomly selected plants. Classifying as *D. flexuosa* every plant with a mean (or median) posterior probability greater than 0.5, we misclassify two plants (2/8 ≈ 0.238). The range of each subjectspecific posterior probability distribution helps in assessing the classification uncertainty. For example, the distribution of the posterior probability for plant 6 is very wide and centered close to 0.5, suggesting that its classification could

be unreliable, while the reverse holds for the other plants.

0.00 0.25 0.50 0.75 1.00 Post. Probability group D. flexuosa

*randomly selected plants. Colors represent the true label of each plant.*

regression. *Methods Ecol Evol*, 10, 1412–1430. 3

on the simplex. *Stat Comp*, 30, 749–770. 1

Figure 1. *Boxplots of the posterior probability of being classified as D. flexuosa for 8*

BOUVEYRON, C., CELEUX, G., BRENDAN MURPHY, T., & RAFTERY, A.E. 2019. *Model-based Clustering and Classification for Data Science*. 1 DOUMA, J.C., & WEEDON, J.T. 2019. Analysing continuous proportions in ecology and evolution: A practical introduction to beta and Dirichlet

GU, J., & CUI, B. 2021. A classification framework for multivariate compositional data with Dirichlet feature embedding. *Knowl Based Syst*. 1 ONGARO, A., MIGLIORATI, S., & ASCARI, R. 2020. A new mixture model

Species D. flexuosa H. lanatus

References

ID

An important aim in genomics is to understand interactions among genes, characterized by the regulation and synthesis of proteins under internal and external signals. These relationships can be represented by a genetic network, i.e., a graph where nodes represent genes and edges describe the interactions among them. Gaussian graphical models (GGM, Lauritzen (1996)) have been widely used for reconstructing a genetic network from expression data. The reason of such diffusion relies on the statistical properties of the multivariate Gaussian distribution which allow the topological structure of a network to be related with the non-zero elements of the concentration matrix, i.e., the inverse of the covariance matrix. Thus, the problem of network inference can be recast as the problem of estimating a concentration matrix. The covariate adjusted glasso estimator (Yin & Li, 2011) is a popular method for estimating a sparse concentration matrix, based on the idea of adding an 1-penalty function to the likelihood function of the multivariate Gaussian distribution. Despite the widespread literature on the covariate adjusted glasso estimator, there is a great number of fields in applied research where the use of the graphical model is theoretically unfounded. For example in some cases data are left- or right-censored. In this paper we propose an extension of the covariate adjusted glasso estimator that takes into account the censoring mechanism of the data explicitly.

#### 2 The covariate adjusted censored Gaussian graphical model

Let *Y* = (*Y*1,...,*Yp*) be a *p*-dimensional random vector. Graphical models allow to represent the set of conditional independencies among these random variables by a graph *G* = {*V* ,*E*}, where *V* is the set of nodes associated to *Y* and *E* ⊆ *V* × *V*

is the set of ordered pairs, called edges, representing the conditional dependencies among the *p* random variables (Lauritzen (1996)). The covariate adjusted Gaussian graphical model (CGGM) is an extension of the classical GGM based on the assumption that the conditional distribution of *Y* given a *q*-dimensional vector of predictors, say *X* = (*X*1,...,*Xq*), follows a multivariate Gaussian distribution with expected value: *µ*(β) = β*x*, where β = (β*hk*) is a matrix *q* × *p* coefficient matrix, and covariance matrix denoted by Σ = (σ*hk*). Denoting with Θ = (θ*hk*) the concentration matrix, i.e., the inverse of the covariance matrix, the conditional density function of *Y* can be written as follows:

$$\Phi(\mathbf{y} \mid \mathbf{x}; \mathbf{\mathfrak{B}}, \boldsymbol{\Theta}) = (2\pi)^{-p/2} |\boldsymbol{\Theta}|^{1/2} \exp[-1/2 \{\mathbf{y} - \boldsymbol{\mu}(\mathbf{\mathfrak{B}})\}^{\top} \boldsymbol{\Theta} \{\mathbf{y} - \boldsymbol{\mu}(\mathbf{\mathfrak{B}})\}].\tag{1}$$

3 The covariate adjusted censored glasso estimator

is denoted by *oi* = {*h* ∈ *I* : *rih* = 0}, while *c*<sup>−</sup>

log-likelihood function can be written as

log *Dci*

,<sup>Θ</sup><sup>ρ</sup>} <sup>=</sup> arg max <sup>β</sup>,Θ<sup>0</sup>

introduces sparsity in ˆ

4 Simulation study

*ioi* , *x <sup>i</sup>* ,*r*

, *yici*

application of this inferential procedure to real datasets is limited.

logϕ({*yioi*

φ({*yioi*

*<sup>i</sup>* )×(*uc*<sup>+</sup>

1 *n n* ∑ *i*=1

β λ

in the estimated concentration matrix <sup>Θ</sup><sup>ρ</sup> = (θˆ<sup>ρ</sup>

observed data is the vector (*y*

*n* ∑ *i*=1

(β,Θ) =

defined as

{ ˆ β λ

where *Dci* = (−∞,*l <sup>c</sup>*<sup>−</sup>

Suppose we have a sample of size *n* independent observations drawn from a CCGGM. For ease of exposition, we shall assume that *l* and *u* are fixed across the *n* observations. To simplify our notation the set of indices of the variables observed in the *i*th observation

*I* : *rih* = +1} denote the sets of indices associated to the left and right-censored data, respectively. Denoting by *ri* the realization of the random vector *R*(*Y <sup>i</sup>*;*l*,*u*), the *i*th

}|*xi*;β,Θ)*dyici* =

parameters of this model can be carried out via the maximum likelihood method, the

where λ and ρ are two non-negative tuning parameters. The lasso penalty on β

In this section, we compare our proposed estimator with MissGlasso (Städler & Bühlmann, 2012), which performs 1-penalized estimation under the assumption that the censored data are missing at random, and with the covariate adjusted glasso estimator (Yin & Li, 2011), where the empirical covariance matrix is calculated by imputing the missing values with the censoring values. These estimators are evaluated in terms of both recovering the structure of the true graph. We use the method implemented in the R package huge (Zhao *et al.*, 2020), to simulate a sparse concentration matrix with a random structure for *Y* . We set the probability of observing a link between two nodes to *k*/*p*, where *p* is the number of responses and *k* is used to control the amount of sparsity in Θ. Moreover, we set the right censoring value to 40 for any variable and the sample size *n* to 100. The predictors matrix *X* is sampled from a multivariate gaussian distribution with zero expected value and sparse covariance matrix simulated as done

We propose to estimate the parameters of the CCGGM by generalizing the approach proposed in Yin & Li (2011), i.e., by maximizing a new objective function defined by adding two lasso-type penalty functions to the observed log-likelihood (4). The resulting estimator, called covariate adjusted censored glasso estimator, is formally

*<sup>i</sup>* ,+∞) and *ci* <sup>=</sup> *<sup>c</sup>*<sup>−</sup>

*<sup>i</sup>* <sup>=</sup> {*<sup>h</sup>* <sup>∈</sup> *I* : *rih* <sup>=</sup> <sup>−</sup>1} and *<sup>c</sup>*<sup>+</sup>

*<sup>i</sup>* ). Using the density function (3), the observed

logϕ({*yioi*

*n* ∑ *i*=1

*<sup>i</sup>* <sup>∪</sup>*c*<sup>+</sup>

,*ri*}|*xi*;β,Θ)−λ∑

*hk*).

*h*,*k*

, while the tuning parameter ρ controls the amount of sparsity

*<sup>i</sup>* = {*h* ∈

,*ri*}|*xi*;β,Θ), (4)

*<sup>i</sup>* . Although inference about the

<sup>|</sup>β*hk*| −<sup>ρ</sup> ∑

*h*=*k*


As shown in Lauritzen (1996), the off-diagonal elements of the concentration matrix are the parametric tools relating the pairwise Markov property to the factorization of the density (1), i.e., two random variables, say *Yh* and *Yk*, are conditionally independent given all the remaining variables if and only if θ*hk* is equal to zero.

As done in Augugliaro *et al.* (2020), we assume that *Y* is a (partially) latent random vector with density function (1). In order to include the censoring mechanism inside our framework, let us denote by *l* = (*l*1,...,*lp*) and *u* = (*u*1,...,*up*), with *lh* < *uh* for *h* = 1,..., *p*, the vectors of known left and right censoring values. Thus, *Yh* is observed only if it is inside the interval [*lh*,*uh*] otherwise it is censored from below if *Yh* < *lh* or censored from above if *Yh* > *uh*. Using the approach for missing data with nonignorable mechanism (Little & Rubin (2002)) we denote the quantity *R*(*Y* ;*l*,*u*), to encode the censoring patterns, such that the *h*th element of *R*(*Y* ;*l*,*u*) is defined as *R*(*Yh*;*lh*,*uh*) = *I*(*Yh* > *uh*)−*I*(*Yh* < *lh*), where *I*(·) denotes the indicator function. By construction *R*(*Y* ;*l*,*u*) is a discrete random vector with support the set {−1,0,1}*<sup>p</sup>* and probability function Pr{*R*(*<sup>Y</sup>* ;*l*,*u*) = *<sup>r</sup>*} <sup>=</sup> *Dr* φ(*y* | *x*;β,Θ)*dy*, where *Dr* <sup>=</sup> {*<sup>y</sup>* <sup>∈</sup> <sup>R</sup>*<sup>p</sup>* : *<sup>R</sup>*(*y*;*l*,*u*) = *<sup>r</sup>*}. Given a censoring pattern, we can simplify our notation by partitioning the set *I* = {1,..., *p*} into *o* = {*h* ∈ *I* : *rh* = 0},*c*<sup>−</sup> = {*h* ∈ *I* : *rh* = −1} and *<sup>c</sup>*<sup>+</sup> <sup>=</sup> {*<sup>h</sup>* <sup>∈</sup> *I* : *rh* = +1} and, in the following of this paper, we shall use the convention that a vector indexed by a set of indices denotes the corresponding subvector. As done in Augugliaro *et al.* (2020), the probability distribution of the observed data, denoted by ϕ({*yo*,*r*} | *x*;β,Θ), can be defined as follows:

$$\Phi(\{\mathbf{y}\_o, \mathbf{r}\}|\mathbf{x}; \mathfrak{B}, \Theta) = \int \Phi(\{\mathbf{y}\_o, \mathbf{y}\_c\}|\mathbf{x}; \mathfrak{B}, \Theta) \Pr\{R(\mathbf{Y}; l, \mathfrak{u}) = \mathbf{r}|\mathbf{Y} = \mathbf{y}\} d\mathbf{y}\_c,\qquad(2)$$

where *<sup>c</sup>* <sup>=</sup> *<sup>c</sup>*<sup>−</sup> <sup>∪</sup>*c*+. Density (2) can be simplified by observing that Pr{*R*(*<sup>Y</sup>* ;*l*,*u*) = *r* | *Y* = *y*} is equal to one if the censoring pattern encoded in *r* is equal to the pattern observed in *y*, otherwise it is equal to zero, hence ϕ({*yo*,*r*} | *x*;β,Θ) can be rewritten as

$$\Phi(\{\mathbf{y}\_o, \mathbf{r}\}|\mathbf{x}; \mathbf{\mathfrak{B}}, \Theta) = \int\_{D\_c} \Phi(\{\mathbf{y}\_o, \mathbf{y}\_c\}|\mathbf{x}; \mathbf{\mathfrak{B}}, \Theta) d\mathbf{y}\_c I(\mathbf{l}\_o \le \mathbf{y}\_o \le \mathbf{u}\_o),\tag{3}$$

where *Dc* = (−∞,*l <sup>c</sup>*<sup>−</sup> )×(*uc*<sup>+</sup> ,+∞). Using density (3), the covariate adjusted censored Gaussian graphical model (CCGGM) is defined as the set {*Y* ,*R*(*Y* ;*l*,*u*),ϕ({*yo*,*r*} | *x*;β,Θ),*G*}, where ϕ({*yo*,*r*}|*x*;β,Θ) factorizes according to the undirected graph *G*.

#### 3 The covariate adjusted censored glasso estimator

Suppose we have a sample of size *n* independent observations drawn from a CCGGM. For ease of exposition, we shall assume that *l* and *u* are fixed across the *n* observations. To simplify our notation the set of indices of the variables observed in the *i*th observation is denoted by *oi* = {*h* ∈ *I* : *rih* = 0}, while *c*<sup>−</sup> *<sup>i</sup>* <sup>=</sup> {*<sup>h</sup>* <sup>∈</sup> *I* : *rih* <sup>=</sup> <sup>−</sup>1} and *<sup>c</sup>*<sup>+</sup> *<sup>i</sup>* = {*h* ∈ *I* : *rih* = +1} denote the sets of indices associated to the left and right-censored data, respectively. Denoting by *ri* the realization of the random vector *R*(*Y <sup>i</sup>*;*l*,*u*), the *i*th observed data is the vector (*y ioi* , *x <sup>i</sup>* ,*r <sup>i</sup>* ). Using the density function (3), the observed log-likelihood function can be written as

$$\ell(\mathsf{B}, \Theta) = \sum\_{i=1}^{n} \log \int\_{D\_{c\_{l}}} \phi(\{\mathsf{y}\_{lo\_{l}}, \mathsf{y}\_{ic\_{l}}\} | \mathsf{x}\_{i}; \mathsf{B}, \Theta) d\mathsf{y}\_{ic\_{l}} = \sum\_{i=1}^{n} \log \phi(\{\mathsf{y}\_{lo\_{l}}, \mathsf{r}\_{i}\} | \mathsf{x}\_{i}; \mathsf{B}, \Theta), \quad (4)$$

where *Dci* = (−∞,*l <sup>c</sup>*<sup>−</sup> *<sup>i</sup>* )×(*uc*<sup>+</sup> *<sup>i</sup>* ,+∞) and *ci* <sup>=</sup> *<sup>c</sup>*<sup>−</sup> *<sup>i</sup>* <sup>∪</sup>*c*<sup>+</sup> *<sup>i</sup>* . Although inference about the parameters of this model can be carried out via the maximum likelihood method, the application of this inferential procedure to real datasets is limited.

We propose to estimate the parameters of the CCGGM by generalizing the approach proposed in Yin & Li (2011), i.e., by maximizing a new objective function defined by adding two lasso-type penalty functions to the observed log-likelihood (4). The resulting estimator, called covariate adjusted censored glasso estimator, is formally defined as

$$\{\hat{\bf B}^{\hat{\lambda}}, \hat{\bf O}^{\rho}\} = \arg\max\_{\bf B, \Theta > 0} \frac{1}{n} \sum\_{i=1}^{n} \log \Phi(\{\mathbf{y}\_{io\_l}, \mathbf{r}\_i\} | \mathbf{x}\_i; \mathbf{B}, \Theta) - \lambda \sum\_{h,k} |\mathbf{B}\_{hk}| - \mathfrak{p} \sum\_{h \neq k} |\Theta\_{hk}|,\tag{5}$$

where λ and ρ are two non-negative tuning parameters. The lasso penalty on β introduces sparsity in ˆ β λ , while the tuning parameter ρ controls the amount of sparsity in the estimated concentration matrix <sup>Θ</sup><sup>ρ</sup> = (θˆ<sup>ρ</sup> *hk*).

#### 4 Simulation study

is the set of ordered pairs, called edges, representing the conditional dependencies among the *p* random variables (Lauritzen (1996)). The covariate adjusted Gaussian graphical model (CGGM) is an extension of the classical GGM based on the assumption that the conditional distribution of *Y* given a *q*-dimensional vector of predictors, say *X* = (*X*1,...,*Xq*), follows a multivariate Gaussian distribution with expected value: *µ*(β) = β*x*, where β = (β*hk*) is a matrix *q* × *p* coefficient matrix, and covariance matrix denoted by Σ = (σ*hk*). Denoting with Θ = (θ*hk*) the concentration matrix, i.e., the inverse of the covariance matrix, the conditional density function of *Y* can be

As shown in Lauritzen (1996), the off-diagonal elements of the concentration matrix are the parametric tools relating the pairwise Markov property to the factorization of the density (1), i.e., two random variables, say *Yh* and *Yk*, are conditionally independent

As done in Augugliaro *et al.* (2020), we assume that *Y* is a (partially) latent random vector with density function (1). In order to include the censoring mechanism inside our framework, let us denote by *l* = (*l*1,...,*lp*) and *u* = (*u*1,...,*up*), with *lh* < *uh* for *h* = 1,..., *p*, the vectors of known left and right censoring values. Thus, *Yh* is observed only if it is inside the interval [*lh*,*uh*] otherwise it is censored from below if *Yh* < *lh* or censored from above if *Yh* > *uh*. Using the approach for missing data with nonignorable mechanism (Little & Rubin (2002)) we denote the quantity *R*(*Y* ;*l*,*u*), to encode the censoring patterns, such that the *h*th element of *R*(*Y* ;*l*,*u*) is defined as *R*(*Yh*;*lh*,*uh*) = *I*(*Yh* > *uh*)−*I*(*Yh* < *lh*), where *I*(·) denotes the indicator function. By construction *R*(*Y* ;*l*,*u*) is a discrete random vector with support the set

*Dr* <sup>=</sup> {*<sup>y</sup>* <sup>∈</sup> <sup>R</sup>*<sup>p</sup>* : *<sup>R</sup>*(*y*;*l*,*u*) = *<sup>r</sup>*}. Given a censoring pattern, we can simplify our notation by partitioning the set *I* = {1,..., *p*} into *o* = {*h* ∈ *I* : *rh* = 0}, *c*<sup>−</sup> = {*h* ∈ *I* : *rh* = −1} and *<sup>c</sup>*<sup>+</sup> <sup>=</sup> {*<sup>h</sup>* <sup>∈</sup> *I* : *rh* = +1} and, in the following of this paper, we shall use the convention that a vector indexed by a set of indices denotes the corresponding subvector. As done in Augugliaro *et al.* (2020), the probability distribution of the observed data,

where *<sup>c</sup>* <sup>=</sup> *<sup>c</sup>*<sup>−</sup> <sup>∪</sup>*c*+. Density (2) can be simplified by observing that Pr{*R*(*<sup>Y</sup>* ;*l*,*u*) = *r* | *Y* = *y*} is equal to one if the censoring pattern encoded in *r* is equal to the pattern observed in *y*, otherwise it is equal to zero, hence ϕ({*yo*,*r*} | *x*;β,Θ) can be rewritten

where *Dc* = (−∞,*l <sup>c</sup>*<sup>−</sup> )×(*uc*<sup>+</sup> ,+∞). Using density (3), the covariate adjusted censored Gaussian graphical model (CCGGM) is defined as the set {*Y* ,*R*(*Y* ;*l*,*u*),ϕ({*yo*,*r*} | *x*;β,Θ),*G*}, where ϕ({*yo*,*r*}|*x*;β,Θ) factorizes according to the undirected graph *G*.

<sup>1</sup>/<sup>2</sup> exp[−1/2{*<sup>y</sup>* <sup>−</sup>*µ*(β)}<sup>Θ</sup>{*<sup>y</sup>* <sup>−</sup>*µ*(β)}]. (1)

φ({*yo*,*yc*}|*x*;β,Θ)Pr{*R*(*Y* ;*l*,*u*) = *r*|*Y* = *y*}*dyc*, (2)

φ({*yo*,*yc*}|*x*;β,Θ)*dycI*(*l <sup>o</sup>* ≤ *yo* ≤ *uo*), (3)

*Dr* φ(*y* | *x*;β,Θ)*dy*, where

written as follows:

φ(*y* | *x*;β,Θ)=(2π)

<sup>−</sup>*p*/2|Θ<sup>|</sup>

given all the remaining variables if and only if θ*hk* is equal to zero.

{−1,0,1}*<sup>p</sup>* and probability function Pr{*R*(*<sup>Y</sup>* ;*l*,*u*) = *<sup>r</sup>*} <sup>=</sup>

denoted by ϕ({*yo*,*r*} | *x*;β,Θ), can be defined as follows:

 *Dc*

ϕ({*yo*,*r*}|*x*;β,Θ) =

ϕ({*yo*,*r*}|*x*;β,Θ) =

as

In this section, we compare our proposed estimator with MissGlasso (Städler & Bühlmann, 2012), which performs 1-penalized estimation under the assumption that the censored data are missing at random, and with the covariate adjusted glasso estimator (Yin & Li, 2011), where the empirical covariance matrix is calculated by imputing the missing values with the censoring values. These estimators are evaluated in terms of both recovering the structure of the true graph. We use the method implemented in the R package huge (Zhao *et al.*, 2020), to simulate a sparse concentration matrix with a random structure for *Y* . We set the probability of observing a link between two nodes to *k*/*p*, where *p* is the number of responses and *k* is used to control the amount of sparsity in Θ. Moreover, we set the right censoring value to 40 for any variable and the sample size *n* to 100. The predictors matrix *X* is sampled from a multivariate gaussian distribution with zero expected value and sparse covariance matrix simulated as done

for *Y* . Each column of the true matrix of predictors β contains only two non-zero regression coefficients sampled from a uniform distribution on the interval [0.3,0.7]. The values of the intercepts are chosen in such a way that *H* response variables are right censored with probability equal to 0.40. The quantities *k*, *p*, *q* and *H* are chosen according to the following cases:

SEMI-SUPERVISED LEARNING THROUGH DEPTH FUNCTIONS Simona Balzano1, Mario R. Guarracino,1 and Giovanni C. Porzio <sup>1</sup>

<sup>1</sup> Department of Economics and Law, University of Cassino and Southern Lazio (e-mail: s.balzano@unicas.it, mario.guarracino@unicas.it,

ABSTRACT: Depth functions have been exploited in supervised learning since years. Given that the depth of a point is somehow a distribution-free measure of its distance from the center of a distribution, their use in supervised learning arose naturally and it has seen a certain degree of success. Particularly, DD-classifers and their extensions have been extensively studied and applied in many applied fields and statistical settings. What has not been investigated so far is their use within a semi-supervised learning framework. That is, in case some labeled data are available along with some unlabeled data within the same training set. A case which arises in many applications and where it has been proved that combining information from labeled and unlabeled data can improve the overall performance of a classifier. For this reason, this work aims at introducing semi-supervised learning techniques in association with DD-classifiers and at investigating to what extent such technique is able to improve DD-classifier performances. Performances will be evaluated by means of an extensive

KEYWORDS: DD-classifiers, labeled and unlabeled data, supervised learning.

porzio@unicas.it)

simulation study and illustrated on some real data sets.


For each scenario, we simulate 50 samples and in each simulation, we compute the coefficients path using cglasso, MissGlasso, and glasso. Each path is computed using an equally spaced sequence of ρ and λ-values. Moreover, the precision-recall curves and the area under the curves (AUCs) are computed for each Scenarios. Table 1 shows how cglasso gives a better estimate of the concentration and coefficient matrices in terms of AUCs, for any given value of the tuning parameters. We report only five evenly spaced values of λ and ρ.

Table 1. *Mean area under the curves across the sequence of* ρ *and* λ*-values under the specification of the two Scenarios (see row blocks). The first column block refers to the concentration matrix (*Θ*) when* λ *is fixed and the second refers to the coefficient matrix (*β*) when* ρ *is fixed. In the first column (1), (2) and (3) refer to cglasso, MissGlasso and glasso algorithms, respectively.*


#### References


### SEMI-SUPERVISED LEARNING THROUGH DEPTH FUNCTIONS

for *Y* . Each column of the true matrix of predictors β contains only two non-zero regression coefficients sampled from a uniform distribution on the interval [0.3,0.7]. The values of the intercepts are chosen in such a way that *H* response variables are right censored with probability equal to 0.40. The quantities *k*, *p*, *q* and *H* are chosen

• Scenario 1: *k* = 3, *p* = 50, *q* = 10 and *H* = 25. This setting is used to evaluate the effects of the number of censored variables on the behavior of the proposed

• Scenario 2: *k* = 3, *p* = 150, *q* = 10 and *H* = 75. This setting is used to evaluate

For each scenario, we simulate 50 samples and in each simulation, we compute the coefficients path using cglasso, MissGlasso, and glasso. Each path is computed using an equally spaced sequence of ρ and λ-values. Moreover, the precision-recall curves and the area under the curves (AUCs) are computed for each Scenarios. Table 1 shows how cglasso gives a better estimate of the concentration and coefficient matrices in terms of AUCs, for any given value of the tuning parameters. We report only five

Table 1. *Mean area under the curves across the sequence of* ρ *and* λ*-values under the specification of the two Scenarios (see row blocks). The first column block refers to the concentration matrix (*Θ*) when* λ *is fixed and the second refers to the coefficient matrix (*β*) when* ρ *is fixed. In the first column (1), (2) and (3) refer to cglasso, MissGlasso and glasso algorithms, respectively.*

λ/λmax ρ/ρmax 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 (1) 0.546 0.429 0.139 0.103 0.101 0.844 0.877 0.883 0.882 0.885 (2) 0.239 0.199 0.086 0.073 0.073 0.745 0.764 0.766 0.767 0.768 (3) 0.414 0.218 0.097 0.092 0.091 0.813 0.847 0.864 0.866 0.866 (1) 0.418 0.094 0.037 0.035 0.035 0.794 0.930 0.931 0.929 0.933 (2) 0.329 0.098 0.033 0.031 0.030 0.753 0.830 0.831 0.830 0.831 (3) 0.321 0.040 0.033 0.032 0.031 0.751 0.902 0.906 0.907 0.907

AUGUGLIARO, L., ABBRUZZO, A., & V., VINCIOTTI. 2020. 1-Penalized censored Gaussian

LITTLE, R.J.A., & RUBIN, D.B. 2002. *Statistical Analysis with Missing Data*. John Wiley &

STÄDLER, N., & BÜHLMANN, P. 2012. Missing values: sparse inverse covariance estimation and an extension to sparse regression. *Statistics and Computing.*, 22(1), 219–235. YIN, J., & LI, H. 2011. A sparse conditional Gaussian graphical model for analysis of genetical

ZHAO, T., LI, X., LIU, H., ROEDER, K., LAFFERTY, J., & WASSERMAN, L. 2020. *huge: High-Dimensional Undirected Graph Estimation*. R package version 1.3.4.1.

LAURITZEN, S.L. 1996. *Graphical Models*. Oxford University Press, Oxford.

genomics data. *Annals of Applied Statistics.*, 5(4), 2630–2650.

graphical model. *Biostatitistics.*, 21(2), e1–e16.

the impact of the high dimensionality on the estimators (*p* > *n*).

according to the following cases:

estimators when *n* > *p*.

evenly spaced values of λ and ρ.

References

Sons, Inc., Hoboken.

Simona Balzano1, Mario R. Guarracino,1 and Giovanni C. Porzio <sup>1</sup>

<sup>1</sup> Department of Economics and Law, University of Cassino and Southern Lazio (e-mail: s.balzano@unicas.it, mario.guarracino@unicas.it, porzio@unicas.it)

ABSTRACT: Depth functions have been exploited in supervised learning since years. Given that the depth of a point is somehow a distribution-free measure of its distance from the center of a distribution, their use in supervised learning arose naturally and it has seen a certain degree of success. Particularly, DD-classifers and their extensions have been extensively studied and applied in many applied fields and statistical settings. What has not been investigated so far is their use within a semi-supervised learning framework. That is, in case some labeled data are available along with some unlabeled data within the same training set. A case which arises in many applications and where it has been proved that combining information from labeled and unlabeled data can improve the overall performance of a classifier. For this reason, this work aims at introducing semi-supervised learning techniques in association with DD-classifiers and at investigating to what extent such technique is able to improve DD-classifier performances. Performances will be evaluated by means of an extensive simulation study and illustrated on some real data sets.

KEYWORDS: DD-classifiers, labeled and unlabeled data, supervised learning.

### A COMBINED TEST OF THE BENFORD HYPOTHESIS WITH ANTI-FRAUD APPLICATIONS

Robust and efficient estimation of β in model (1) may lead to the definition of a "fair" unit price for the good under consideration, against which individual or aggregate transaction prices can be contrasted. Transactions well below the "fair" price may correspond to revenue frauds leading to substantial undervaluation of goods imported into the European Union; see, e.g., European Anti-Fraud Office, 2018, p. 26. The normality assumption in model (1) has proven to be satisfactory in the case of monthly-aggregated trade data (Perrotta *et al.*, 2020b). However it may become less adequate when analyzing individual customs declarations, where multiple populations often occur and a skew distribution may seem more appropriate for the definition of *F*<sup>0</sup> (Perrotta *et al.*, 2020a). An alternative contamination model based on Benford's law then becomes very useful in such a framework: see Cerioli *et al.*, 2019a.

Benford's law (BL, for short) is a fascinating phenomenon which rules the pattern of the leading digits in many types of data. Informally speaking, the law states that the digits follow a logarithmic-type distribution in which the leading digit 1 is more likely to occur than the leading digit 2, the leading digit 2 is more likely than the leading digit 3, and so on. Indeed, the first-digit form of BL gives the probability that the first leading digit equals *d*, for *d* = 1,...,9,

> 1+ 1 *d*

Another, perhaps even less intuitive, property of Benford's law concerns sum invariance. Given an absolutely-continuous random variable *X*, in the first digit

is the *significand* of the non-null real number *x*, *x* = *x*− *x* and *C* = log10 e. First-digit sum invariance thus means that the expected value (3) does not depend on *d* when *X* is a Benford random variable. Although (2) and (3) are not equivalent when only the first digit is concerned, they are both implied by the

*<sup>S</sup>*(*X*) *<sup>L</sup>*

. (2)

E[*S*(*X*)I[*d*,*d*+<sup>1</sup>[(*S*(*X*))] = *C*, (3)

*S*(*x*) = 10log10 <sup>|</sup>*x*<sup>|</sup> (4)

= 10*<sup>U</sup>* , (5)

log10

setting of (2), this property states that, for *d* = 1,...,9,

where I*<sup>E</sup>* is the indicator function of the set *E*, while

full form of BL, which states that

2 Benford's law

as

Lucio Barabesi 1, Andrea Cerasa2, Andrea Cerioli3 and Domenico Perrotta2

<sup>1</sup> University of Siena, Department of Economics and Statistics, Siena, Italy, (e-mail: lucio.barabesi@unisi.it)

<sup>2</sup> European Commission, Joint Research Centre (JRC), Ispra, Italy, (e-mail: andrea.cerasa@ec.europa.eu, domenico.perrotta@ec.europa.eu)

<sup>3</sup> University of Parma, Department of Economics and Management, Parma, Italy, (email: andrea.cerioli@unipr.it)

ABSTRACT: In this work we describe a combined test of the null hypothesis that the significant digits in a random sample of numbers follow Benford's law. We also show the potential of the method for the purpose of fraud detection in international trade.

KEYWORDS: Anomaly detection, Benford's law, sum-invariance, customs data.

#### 1 Motivating framework of data analysis

Most unsupervised fraud detection methods look for anomalies in the data. Therefore, all of these techniques assume that the available data have been generated by an appropriate contamination model. Any parameter of the distribution that models the "genuine" part of the data, say *F*0, must then be estimated in a robust way, in order to avoid the well-known masking and swamping effects due to the anomalies themselves (Cerioli *et al.*, 2019b). In the context of fraud detection in international trade, where the value of an individual import transaction *X* originates from the product of the traded amount *v* with the unit price β, the available anti-fraud tools are derived from the theory of outlier identification in robust regression; see, e.g., Perrotta *et al.*, 2020b. Under this approach it is then assumed that non-fraudulent transactions for a specific good are generated according to the distribution function

$$F\_0(\mathbf{x}) = \Phi\left(\frac{\mathbf{x} - \mathfrak{B}\nu}{b}\right),\tag{1}$$

where Φ is the distribution function of a standard Normal random variable. In model (1), the regression slope β corresponds to unit price and *b* > 0 defines the (usually unknown) model variability, which is taken to be constant. Robust and efficient estimation of β in model (1) may lead to the definition of a "fair" unit price for the good under consideration, against which individual or aggregate transaction prices can be contrasted. Transactions well below the "fair" price may correspond to revenue frauds leading to substantial undervaluation of goods imported into the European Union; see, e.g., European Anti-Fraud Office, 2018, p. 26. The normality assumption in model (1) has proven to be satisfactory in the case of monthly-aggregated trade data (Perrotta *et al.*, 2020b). However it may become less adequate when analyzing individual customs declarations, where multiple populations often occur and a skew distribution may seem more appropriate for the definition of *F*<sup>0</sup> (Perrotta *et al.*, 2020a). An alternative contamination model based on Benford's law then becomes very useful in such a framework: see Cerioli *et al.*, 2019a.

#### 2 Benford's law

A COMBINED TEST OF THE BENFORD HYPOTHESIS WITH ANTI-FRAUD APPLICATIONS Lucio Barabesi 1, Andrea Cerasa2, Andrea Cerioli3 and Domenico Perrotta2

<sup>1</sup> University of Siena, Department of Economics and Statistics, Siena, Italy, (e-mail:

<sup>2</sup> European Commission, Joint Research Centre (JRC), Ispra, Italy, (e-mail: andrea.cerasa@ec.europa.eu, domenico.perrotta@ec.europa.eu) <sup>3</sup> University of Parma, Department of Economics and Management, Parma, Italy, (e-

ABSTRACT: In this work we describe a combined test of the null hypothesis that the significant digits in a random sample of numbers follow Benford's law. We also show the potential of the method for the purpose of fraud detection in international trade. KEYWORDS: Anomaly detection, Benford's law, sum-invariance, customs data.

Most unsupervised fraud detection methods look for anomalies in the data. Therefore, all of these techniques assume that the available data have been generated by an appropriate contamination model. Any parameter of the distribution that models the "genuine" part of the data, say *F*0, must then be estimated in a robust way, in order to avoid the well-known masking and swamping effects due to the anomalies themselves (Cerioli *et al.*, 2019b). In the context of fraud detection in international trade, where the value of an individual import transaction *X* originates from the product of the traded amount *v* with the unit price β, the available anti-fraud tools are derived from the theory of outlier identification in robust regression; see, e.g., Perrotta *et al.*, 2020b. Under this approach it is then assumed that non-fraudulent transactions for a specific good

lucio.barabesi@unisi.it)

mail: andrea.cerioli@unipr.it)

1 Motivating framework of data analysis

are generated according to the distribution function

*F*0(*x*) = Φ

*<sup>x</sup>*−β*<sup>v</sup> b*

where Φ is the distribution function of a standard Normal random variable. In model (1), the regression slope β corresponds to unit price and *b* > 0 defines the (usually unknown) model variability, which is taken to be constant.

, (1)

Benford's law (BL, for short) is a fascinating phenomenon which rules the pattern of the leading digits in many types of data. Informally speaking, the law states that the digits follow a logarithmic-type distribution in which the leading digit 1 is more likely to occur than the leading digit 2, the leading digit 2 is more likely than the leading digit 3, and so on. Indeed, the first-digit form of BL gives the probability that the first leading digit equals *d*, for *d* = 1,...,9, as

$$
\log\_{10}\left(1+\frac{1}{d}\right).\tag{2}
$$

Another, perhaps even less intuitive, property of Benford's law concerns sum invariance. Given an absolutely-continuous random variable *X*, in the first digit setting of (2), this property states that, for *d* = 1,...,9,

$$\mathbb{E}[\mathcal{S}(X)\mathbf{I}\_{[d,d+1]}(\mathcal{S}(X))]=\mathcal{C},\tag{3}$$

where I*<sup>E</sup>* is the indicator function of the set *E*, while

$$S(\mathbf{x}) = 10^{\langle \log\_{10} |\mathbf{x}| \rangle} \tag{4}$$

is the *significand* of the non-null real number *x*, *x* = *x*− *x* and *C* = log10 e. First-digit sum invariance thus means that the expected value (3) does not depend on *d* when *X* is a Benford random variable. Although (2) and (3) are not equivalent when only the first digit is concerned, they are both implied by the full form of BL, which states that

$$S(X) \stackrel{\mathcal{L}}{=} 10^U,\tag{5}$$

with *U* a Uniform random variable on [0,1[. We refer to Berger & Hill, 2020 for a recent survey of the mathematical properties of BL and to Barabesi *et al.*, 2021 for a thorough study of the relationship between (2) and (3).

Table 1. *Estimated power of tests of the Benford hypothesis for sample size n* = 100*.*

References

*Statistics*, 36, 346–358.

*Statistics*, 46, 235–256.

*Alternative* χ<sup>2</sup> *Q KS L*χ2,*Q*,*KS* Lognormal 0.903 0.926 0.899 0.940 Generalized Benford 0.446 0.466 0.853 0.785

BARABESI, L., CERASA, A., CERIOLI, A., & PERROTTA, D. 2018. Goodness-of-fit testing for the Newcomb-Benford law with application to the detection of customs fraud. *Journal of Business and Economic*

BARABESI, L., CERASA, A., CERIOLI, A., & PERROTTA, D. 2021. On Characterizations and Tests of Benford's Law. *Journal of the American Statistical Association*. https://doi.org/10.1080/01621459.2021.1891927. BERGER, A., & HILL, T. P. 2020. The mathematics of Benford's law: a primer. *Statistical Methods and Applications*.

CERIOLI, A., BARABESI, L., CERASA, A., MENEGATTI, M., & PERROTTA, D. 2019a. Newcomb-Benford law and the detection of frauds in interna-

CERIOLI, A., FARCOMENI, A., & RIANI, M. 2019b. Wild adaptive trimming for robust estimation and cluster analysis. *Scandinavian Journal of*

EUROPEAN ANTI-FRAUD OFFICE. 2018. *The OLAF report 2017*. Tech. rept. Publications Office of the European Union, Luxembourg.

PERROTTA, D., CHECCHI, E., TORTI, F., CERASA, A., & NOVAU, X. A. 2020a. *Addressing Price and Weight heterogeneity and Extreme Outliers in Surveillance Data: The Case of Face Masks*. Tech. rept. JRC121650, EUR 12345 EN. Publications Office of the European Union, Luxem-

PERROTTA, D., CERASA, A., TORTI, F., & RIANI, M. 2020b. *The Robust Estimation of Monthly Prices of Goods Traded by the European Union*. Tech. rept. JRC120407, EUR 30188 EN. Publications Office of the Euro-

pean Union, Luxembourg. https://doi.org/10.2760/635844.

https://doi.org/10.1007/s10260-020-00532-8.

tional trade. *PNAS*, 116, 106–115.

https://doi.org/10.2784/652365.

bourg. https://doi.org/10.2760/817681.

### 3 Tests of the Benford hypothesis

In the motivating framework sketched in §1, Cerioli *et al.*, 2019a investigate the conditions under which Benford's law may yield a reasonable approximation for the first-digit distribution of customs declarations. If Benford's law is expected to hold for genuine transactions, then deviations from the law can be taken as evidence of possible data manipulation. Several exact tests of the Benford hypothesis exist according to which characterization is considered. Those that follow have proven to be useful under a variety of circumstances:


Barabesi *et al.*, 2021 show that the combination of χ<sup>2</sup> and *Q* provides a test which is consistently close to the best solution provided by either χ<sup>2</sup> or *Q*. We further develop this strategy in two directions. First, we derive the asymptotic joint distribution of χ<sup>2</sup> and *Q* under Benford's law. This result gives theoretical substance to the observed empirical behavior of the combined test. We then extend our combination strategy to include *KS*. The proposed extension is extremely relevant in view of the motivating framework of §1, since the performance of the individual tests may vary considerably according to the actual digit generating process when Benford's law does not hold. Our combined test thus provides a powerful, yet robust, solution when the type of departure from Benford's law is unknown, as it happens in anti-fraud applications. Some preliminary simulation results for a sample size of *n* = 100 observations are shown in Table 1, where *L*χ2,*Q*,*KS* denotes the newly developed combined test. The alternative data generating models for *X* are a Lognormal random variable of scale parameter 1 and shape parameter 0.5, and a Generalized Benford random variable of parameter -0.6.


Table 1. *Estimated power of tests of the Benford hypothesis for sample size n* = 100*.*

#### References

with *U* a Uniform random variable on [0,1[. We refer to Berger & Hill, 2020 for a recent survey of the mathematical properties of BL and to Barabesi *et al.*,

In the motivating framework sketched in §1, Cerioli *et al.*, 2019a investigate the conditions under which Benford's law may yield a reasonable approximation for the first-digit distribution of customs declarations. If Benford's law is expected to hold for genuine transactions, then deviations from the law can be taken as evidence of possible data manipulation. Several exact tests of the Benford hypothesis exist according to which characterization is considered. Those that follow have proven to be useful under a variety of circumstances:

• The chi-square test of the first-digit distribution (2) considered by Barabesi

• The Hotelling-type test of the sum-invariance property (2) proposed by

• The Kolmogorov-Smirnov test of the Benford property (4) described in

Barabesi *et al.*, 2021 show that the combination of χ<sup>2</sup> and *Q* provides a test which is consistently close to the best solution provided by either χ<sup>2</sup> or *Q*. We further develop this strategy in two directions. First, we derive the asymptotic joint distribution of χ<sup>2</sup> and *Q* under Benford's law. This result gives theoretical substance to the observed empirical behavior of the combined test. We then extend our combination strategy to include *KS*. The proposed extension is extremely relevant in view of the motivating framework of §1, since the performance of the individual tests may vary considerably according to the actual digit generating process when Benford's law does not hold. Our combined test thus provides a powerful, yet robust, solution when the type of departure from Benford's law is unknown, as it happens in anti-fraud applications. Some preliminary simulation results for a sample size of *n* = 100 observations are shown in Table 1, where *L*χ2,*Q*,*KS* denotes the newly developed combined test. The alternative data generating models for *X* are a Lognormal random variable of scale parameter 1 and shape parameter 0.5, and a Generalized Benford

2021 for a thorough study of the relationship between (2) and (3).

3 Tests of the Benford hypothesis

*et al.*, 2018, say χ2;

Barabesi *et al.*, 2021, say *Q*;

Barabesi *et al.*, 2021, say *KS*.

random variable of parameter -0.6.


### UNBALANCED CLASSIFICATION OF ELECTRONIC INVOICING

learning model giving the possibility for the accountants to focus on more stimulating projects. This classification task presents different challenges, due to the nature of the problem and to the high number of accounts employed in the classification. In this work we address this issue exploiting the hierarchical structure of the invoices documents to cluster them into smaller subsets easier to manage in terms of classification task. A single classifier is then trained on each subset. The results of this two-step methodology are compared with the

The dataset analyzed in this work consists of 13.605 supplier and customer invoices for a total of 121.946 lines of invoices to classify. This data are part of the accountability database of a single business company. The total number of different classes to predict is 42, with the frequency of the majority class

Given the information about a single line of an invoice and the generic characteristics of the invoice, the aim is to predict the accounting code related to the line. For the sake of simplicity, we assume that lines inside an invoice are independent ignoring the grouping term which could influence the predictive output. In future works, this grouping information can be included in the features space too. In our classification task, we construct the prediction rule

*<sup>i</sup>*=<sup>1</sup> with:

• *yi*, *i* = 1,...,*N*, categorical observations which represent the accounting

• x*i*, the vector of predictors related to the content of the invoice and the

Thanks to the hierarchical structure of the invoice, which is composed by an header and the description lines, we apply a two-step approach: first of all, we exploited the information of the header of the invoice to cluster data in smaller datasets, and secondly, we train a classifier on the lines of the invoices for each cluster of data. We compare this two-step approach with a direct approach that develops a single classification model on the entire original dataset

The classification algorithm used in this work is the xgboost model, known to be very efficient in case of large dataset. To process the textual description of lines of invoices we apply the Word2vec model. Clusters of invoices are

combining predictors both from the header and lines of the invoice.

performance of a single classifier trained on the entire dataset.

2 Data and method

given the training sample {(*yi*,x*i*)}*<sup>N</sup>*

codes associated to the i-th line of invoice

equal to 87.513.

line

Chiara Bardelli <sup>1</sup>

<sup>1</sup> Department of Mathematics, University of Pavia, (e-mail: chiara.bardelli01@universitadipavia.it)

ABSTRACT: Real world classification problems may present a high number of classes to predict which are not equally distributed in the dataset. We propose a two-step approach to address this problem analyzing data from the accounting world. Electronic invoices have a hierarchical structure which is exploited in the first step of our model. A classifier, then, is trained on the lines of the invoices for each subset generated by the cluster analysis. The results obtained show a higher recall values for the least frequent classes of the dataset with respect to the adoption of a single classification model.

KEYWORDS: unbalanced classification, text mining, prediction model

#### 1 Introduction

The issue of unbalanced classification is known to affect different domains of application when a machine learning classifier is trained on real world data. If a dataset presents a lot of classes to predict, these classes often are not equally distributed, leading to an unbalanced problem which needs ad hoc analyses. Different methods have been proposed in literature to implement the possible solutions to this problem (Santos *et al.*, 2018; Ganganwar, 2012). However, in case of large datasets and a high number of classes, one suggested methodology is the possibility of splitting data in smaller subsets which contain similar observations and develop a single classifier for each subset of data to have simpler classification algorithms to manage and a lower number of classes per each cluster to predict (Tsoumakas *et al.*, 2008).

The classification of electronic invoices in the accounting process, using accounts which are part of Chart of Accounts, is a multiclass classification problem characterized by a high number of classes and unbalanced distributions. Nowadays, the interest in the automation of this task is really high (Bardelli *et al.*, 2020; Bel¸skis *et al.*, 2020). This is considered a repetitive and monotonous routine activity which can be easily replaced by a machine learning model giving the possibility for the accountants to focus on more stimulating projects. This classification task presents different challenges, due to the nature of the problem and to the high number of accounts employed in the classification. In this work we address this issue exploiting the hierarchical structure of the invoices documents to cluster them into smaller subsets easier to manage in terms of classification task. A single classifier is then trained on each subset. The results of this two-step methodology are compared with the performance of a single classifier trained on the entire dataset.

#### 2 Data and method

UNBALANCED CLASSIFICATION OF ELECTRONIC INVOICING Chiara Bardelli <sup>1</sup>

<sup>1</sup> Department of Mathematics, University of Pavia, (e-mail:

ABSTRACT: Real world classification problems may present a high number of classes to predict which are not equally distributed in the dataset. We propose a two-step approach to address this problem analyzing data from the accounting world. Electronic invoices have a hierarchical structure which is exploited in the first step of our model. A classifier, then, is trained on the lines of the invoices for each subset generated by the cluster analysis. The results obtained show a higher recall values for the least frequent classes of the dataset with respect to the adoption of a single classification

The issue of unbalanced classification is known to affect different domains of application when a machine learning classifier is trained on real world data. If a dataset presents a lot of classes to predict, these classes often are not equally distributed, leading to an unbalanced problem which needs ad hoc analyses. Different methods have been proposed in literature to implement the possible solutions to this problem (Santos *et al.*, 2018; Ganganwar, 2012). However, in case of large datasets and a high number of classes, one suggested methodology is the possibility of splitting data in smaller subsets which contain similar observations and develop a single classifier for each subset of data to have simpler classification algorithms to manage and a lower number of classes per

The classification of electronic invoices in the accounting process, using accounts which are part of Chart of Accounts, is a multiclass classification problem characterized by a high number of classes and unbalanced distributions. Nowadays, the interest in the automation of this task is really high (Bardelli *et al.*, 2020; Bel¸skis *et al.*, 2020). This is considered a repetitive and monotonous routine activity which can be easily replaced by a machine

KEYWORDS: unbalanced classification, text mining, prediction model

each cluster to predict (Tsoumakas *et al.*, 2008).

chiara.bardelli01@universitadipavia.it)

model.

1 Introduction

The dataset analyzed in this work consists of 13.605 supplier and customer invoices for a total of 121.946 lines of invoices to classify. This data are part of the accountability database of a single business company. The total number of different classes to predict is 42, with the frequency of the majority class equal to 87.513.

Given the information about a single line of an invoice and the generic characteristics of the invoice, the aim is to predict the accounting code related to the line. For the sake of simplicity, we assume that lines inside an invoice are independent ignoring the grouping term which could influence the predictive output. In future works, this grouping information can be included in the features space too. In our classification task, we construct the prediction rule given the training sample {(*yi*,x*i*)}*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> with:


Thanks to the hierarchical structure of the invoice, which is composed by an header and the description lines, we apply a two-step approach: first of all, we exploited the information of the header of the invoice to cluster data in smaller datasets, and secondly, we train a classifier on the lines of the invoices for each cluster of data. We compare this two-step approach with a direct approach that develops a single classification model on the entire original dataset combining predictors both from the header and lines of the invoice.

The classification algorithm used in this work is the xgboost model, known to be very efficient in case of large dataset. To process the textual description of lines of invoices we apply the Word2vec model. Clusters of invoices are computed using the k-means algorithm. The number of clusters suggested by the Elbow method and used in the analysis is 4.

### 3 Results

We describe the results of the two different approaches providing some indexes of accuracy computed on the test set (the original dataset has been split into 80% for training set and 20% for test set preserving the class proportions in the split). In Table 1 we report the values of macro and weighted recall. As we can observe, the two approaches show similar values in terms of recall, meaning that splitting the original dataset into small clusters of data does not affect the overall performance of the model.

Accounts

References

*tion*, 2(4), 617–629.

*Advanced Engineering*, 2(4), 42–47.

*ing Multidimensional Data (MMD'08)*, vol. 21.

*Magazine*, 13(4), 59–76.

Springer.

0% 25% 50% 75% 100% Recall

The results obtained in this analysis encourage deeper studies, trying to completely automate the classification of invoices into accounting codes which

BARDELLI, CHIARA, RONDINELLI, ALESSANDRO, VECCHIO, RUGGERO, & FIGINI, SILVIA. 2020. Automatic electronic invoice classification using machine learning models. *Machine Learning and Knowledge Extrac-*

BEL¸ SKIS, ZIGMUNDS, ZIRNE, MARITA,&PINNIS, MARCIS ¯ . 2020. Features and Methods for Automatic Posting Account Classification. *In: International Baltic Conference on Databases and Information Systems*.

GANGANWAR, VAISHALI. 2012. An overview of classification algorithms for imbalanced datasets. *International Journal of Emerging Technology and*

SANTOS, MIRIAM SEOANE, SOARES, JASTIN POMPEU, ABREU, PE-DRO HENRIGUES, ARAUJO, HELDER,&SANTOS, JOAO. 2018. Cross-Validation for Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches [Research Frontier]. *IEEE Computational Intelligence*

TSOUMAKAS, GRIGORIOS, KATAKIS, IOANNIS,&VLAHAVAS, IOANNIS. 2008. Effective and efficient multilabel classification in domains with large number of labels. *In: Proc. ECML/PKDD 2008 Workshop on Min-*

Figure 1. *Recall values for the 10 least frequent classes of the dataset.*

is an expensive and demanding task for the accountant work.

Model

Direct approach Two−step approach

Table 1. *Macro and weighted recall computed on the test set fort the two approaches*


The most interesting result is obtained for the accuracy of some classes which have low frequencies in the original data. Figure 1 reports the values of recall of the least 10 frequent classes of the dataset. Most of the classes improves its recall values, in particular the accounting codes 680203024 and 680203010 show recall values from 0 of the direct approach to 80% and 62% with the two-step approach, respectively. The invoices related to these accounts belong all to the same cluster which can be identified as group of supplier invoices related to the purchase of materials. On the other hand, high frequencies classes preserve same values of accuracy in both the approaches.

### 4 Conclusions

The challenge of unbalanced dataset with high number of classes is to develop accurate classification model able to correctly predict classes with low frequencies (minority classes). The problem we addressed allows us to exploit the hierarchical structure of the invoice document to divide the original dataset in smaller clusters, based on characteristics of invoice headers. Thanks to this procedure, the original classification problem has been split into simpler classification tasks with a smaller number of classes in each cluster.

Figure 1. *Recall values for the 10 least frequent classes of the dataset.*

The results obtained in this analysis encourage deeper studies, trying to completely automate the classification of invoices into accounting codes which is an expensive and demanding task for the accountant work.

#### References

computed using the k-means algorithm. The number of clusters suggested by

We describe the results of the two different approaches providing some indexes of accuracy computed on the test set (the original dataset has been split into 80% for training set and 20% for test set preserving the class proportions in the split). In Table 1 we report the values of macro and weighted recall. As we can observe, the two approaches show similar values in terms of recall, meaning that splitting the original dataset into small clusters of data does not affect the

Table 1. *Macro and weighted recall computed on the test set fort the two approaches*

Direct approach 82.5% 98.2% Two-step approach 83.3% 98.5%

Methodology Macro recall Weighted recall

The most interesting result is obtained for the accuracy of some classes which have low frequencies in the original data. Figure 1 reports the values of recall of the least 10 frequent classes of the dataset. Most of the classes improves its recall values, in particular the accounting codes 680203024 and 680203010 show recall values from 0 of the direct approach to 80% and 62% with the two-step approach, respectively. The invoices related to these accounts belong all to the same cluster which can be identified as group of supplier invoices related to the purchase of materials. On the other hand, high frequencies classes preserve same values of accuracy in both the approaches.

The challenge of unbalanced dataset with high number of classes is to develop accurate classification model able to correctly predict classes with low frequencies (minority classes). The problem we addressed allows us to exploit the hierarchical structure of the invoice document to divide the original dataset in smaller clusters, based on characteristics of invoice headers. Thanks to this procedure, the original classification problem has been split into simpler

classification tasks with a smaller number of classes in each cluster.

the Elbow method and used in the analysis is 4.

overall performance of the model.

3 Results

4 Conclusions


### PREDICTIVE POWER OF BAYESIAN CAR MODELS ON SCALE FREE NETWORKS: AN APPLICATION FOR CREDIT RISK

ing a period of liquidity distress can delay payments towards its commercial partners, that can consequently experience liquidity distress. The supply chain is seen as a complex network in these studies, but it can also be represented as an adjacency matrix with proper assumptions (Lamieri & Sangalli, 2019). In this work, we set up a predictive model leveraging Bayesian conditionally auto-regressive (CAR) models for areal data (Banerjee *et al.* , 2003). Specifically, inference is based on a sample of firms from a trade network in a given month, and the predictive performance of a CAR model is tested by estimating the probability of default for both a different sample of firms and for the same sample in the future. Although spatial CAR models have been widely used in ecology, environmental science, biology and medicine, to the best of our knowledge they have not yet been fully exploited in econometrics when dealing with hundreds of thousands of data points interacting in a dynamic

With some due simplifications, the monthly goal for a lending bank is to red flag those borrowing firms that have the greatest probability of default (delay in paying their debts to the bank) in the following 3 months. In this paper, we analyse a proprietary dataset of Intesa Sanpaolo collected in a given month, for a total of *n* = 944 firms. Our response variable is a binary indicator such that *Yk* = 1 if firm *k* switches to a liquidity distress state in the next 3 months.

The trade network can be represented as a link matrix *<sup>W</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* <sup>×</sup>R*n*, with binary entries *wk j* = 1 if *k* = *j* and *k* supplier, *j* customer in the previous year. The link matrix *W* represents a complex network with a scale free structure (Barabasi & Albert, 1999). Further, the Bank database stores several credit and ´ trend information on each specific customer firm, but for the sake of simplicity here we only consider two possible covariates *xk* for each firm *k*. The first

among all Italian financial institutions, while the second, *x*<sup>2</sup>

φ*k*|φ−*k*,α, τ,*W* ∼ *N*

maximum number of days of payment delay recorded in the past 3 months. We fit a proper CAR specification (Banerjee *et al.* , 2003) to our data as

> α∑*n*

*<sup>k</sup>* , represents the used amount of credit over the granted amount

*Yk* ∼ *Bernoulli*(θ*k*)

, τ−<sup>1</sup> ,

*<sup>i</sup>*=<sup>1</sup>*wki*φ*<sup>i</sup>*

∑*n <sup>i</sup>*=<sup>1</sup>*wki*

*logit*(θ*k*) = β*xk* +φ*<sup>k</sup>* (1)

*<sup>k</sup>* , represents the

complex network (e.g., firms or natural persons).

2 Methodology

covariate, *x*<sup>1</sup>

follows:

Claudia Berloco <sup>12</sup> , Raffaele Argiento34 and Silvia Montagna14

<sup>1</sup> Dipartimento di Scienze Economico-sociali e Matematico-statistiche, Universita` degli Studi di Torino, Corso Unione Sovietica, 218/bis, 10134 Torino, Italy, (e-mail: claudia.berloco@unito.it, silvia.montagna@unito.it)

<sup>2</sup> Intesa Sanpaolo, Piazza San Carlo, 156, 10121 Torino, Italy

<sup>3</sup> Dipartimento di Scienze statistiche, Universita Cattolica Sacro Cuore, Largo A. ` Gemelli, 1, 20123 Milano, Italy, (e-mail: raffaele.argiento@unicatt.it)

<sup>4</sup> Collegio Carlo Alberto, Piazza Vincenzo Arbarello, 8, 10122 Torino, Italy

ABSTRACT: The monitoring of loans' life-cycle has received the increasing attention of the scientific community after the 2008 global financial crisis. A number of aspects of this broad topic have been addressed by means of several regulatory, statistical and economical tools. However, many issues still require further investigation. In this work, we are interested in the monitoring phase of granted loans to anticipate possible defaults and to investigate whether there is evidence of a liquidity contagion effect within a trade network of firms. To this end, we apply a Bayesian spatial model to a proprietary dataset, and assess its out-of-time predictive performance.

KEYWORDS: Bayesian modelling, spatial modelling, credit risk, CAR model.

#### 1 Introduction

The European Central Bank requires banks to adapt their organization, processes and IT infrastructure in order to give an integrated answer to the nonperforming loans problem. Banks can mitigate their credit risk in several steps of the loan life-cycle, for example by foreseeing liquidity problems for those customers which already have a debt to the bank. A timely detection of the transition to financial distress is pivotal, and it will be addressed it in this work leveraging on statistical models and bank data.

Recently, a number of contributions (see, e.g., Dolfin *et al.* , 2019) focused on introducing information on the supply chain connections in credit risk models based on the evidence of trade credit use in European markets. The main idea is that liquidity distress can flow along these connections, and a firm experiencing a period of liquidity distress can delay payments towards its commercial partners, that can consequently experience liquidity distress. The supply chain is seen as a complex network in these studies, but it can also be represented as an adjacency matrix with proper assumptions (Lamieri & Sangalli, 2019). In this work, we set up a predictive model leveraging Bayesian conditionally auto-regressive (CAR) models for areal data (Banerjee *et al.* , 2003). Specifically, inference is based on a sample of firms from a trade network in a given month, and the predictive performance of a CAR model is tested by estimating the probability of default for both a different sample of firms and for the same sample in the future. Although spatial CAR models have been widely used in ecology, environmental science, biology and medicine, to the best of our knowledge they have not yet been fully exploited in econometrics when dealing with hundreds of thousands of data points interacting in a dynamic complex network (e.g., firms or natural persons).

#### 2 Methodology

PREDICTIVE POWER OF BAYESIAN CAR MODELS ON SCALE FREE NETWORKS: AN APPLICATION FOR CREDIT RISK Claudia Berloco <sup>12</sup> , Raffaele Argiento34 and Silvia Montagna14

<sup>1</sup> Dipartimento di Scienze Economico-sociali e Matematico-statistiche, Universita` degli Studi di Torino, Corso Unione Sovietica, 218/bis, 10134 Torino, Italy, (e-mail:

<sup>3</sup> Dipartimento di Scienze statistiche, Universita Cattolica Sacro Cuore, Largo A. ` Gemelli, 1, 20123 Milano, Italy, (e-mail: raffaele.argiento@unicatt.it) <sup>4</sup> Collegio Carlo Alberto, Piazza Vincenzo Arbarello, 8, 10122 Torino, Italy

ABSTRACT: The monitoring of loans' life-cycle has received the increasing attention of the scientific community after the 2008 global financial crisis. A number of aspects of this broad topic have been addressed by means of several regulatory, statistical and economical tools. However, many issues still require further investigation. In this work, we are interested in the monitoring phase of granted loans to anticipate possible defaults and to investigate whether there is evidence of a liquidity contagion effect within a trade network of firms. To this end, we apply a Bayesian spatial model to a

claudia.berloco@unito.it, silvia.montagna@unito.it)

proprietary dataset, and assess its out-of-time predictive performance.

leveraging on statistical models and bank data.

1 Introduction

KEYWORDS: Bayesian modelling, spatial modelling, credit risk, CAR model.

The European Central Bank requires banks to adapt their organization, processes and IT infrastructure in order to give an integrated answer to the nonperforming loans problem. Banks can mitigate their credit risk in several steps of the loan life-cycle, for example by foreseeing liquidity problems for those customers which already have a debt to the bank. A timely detection of the transition to financial distress is pivotal, and it will be addressed it in this work

Recently, a number of contributions (see, e.g., Dolfin *et al.* , 2019) focused on introducing information on the supply chain connections in credit risk models based on the evidence of trade credit use in European markets. The main idea is that liquidity distress can flow along these connections, and a firm experienc-

<sup>2</sup> Intesa Sanpaolo, Piazza San Carlo, 156, 10121 Torino, Italy

With some due simplifications, the monthly goal for a lending bank is to red flag those borrowing firms that have the greatest probability of default (delay in paying their debts to the bank) in the following 3 months. In this paper, we analyse a proprietary dataset of Intesa Sanpaolo collected in a given month, for a total of *n* = 944 firms. Our response variable is a binary indicator such that *Yk* = 1 if firm *k* switches to a liquidity distress state in the next 3 months.

The trade network can be represented as a link matrix *<sup>W</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* <sup>×</sup>R*n*, with binary entries *wk j* = 1 if *k* = *j* and *k* supplier, *j* customer in the previous year. The link matrix *W* represents a complex network with a scale free structure (Barabasi & Albert, 1999). Further, the Bank database stores several credit and ´ trend information on each specific customer firm, but for the sake of simplicity here we only consider two possible covariates *xk* for each firm *k*. The first covariate, *x*<sup>1</sup> *<sup>k</sup>* , represents the used amount of credit over the granted amount among all Italian financial institutions, while the second, *x*<sup>2</sup> *<sup>k</sup>* , represents the maximum number of days of payment delay recorded in the past 3 months.

We fit a proper CAR specification (Banerjee *et al.* , 2003) to our data as follows:

$$\begin{aligned} Y\_k &\sim Bernoulli(\Theta\_k) \\ logit(\Theta\_k) &= \mathbf{B}\mathbf{x}\_k + \phi\_k \\ \phi\_k | \phi\_{-k}, \mathbf{a}, \mathbf{z}, W &\sim N\left(\mathbf{a}\frac{\sum\_{i=1}^n \mathbf{w}\_{ki}\Phi\_i}{\sum\_{i=1}^n \mathbf{w}\_{ki}}, \mathbf{z}^{-1}\right), \end{aligned} \tag{1}$$

Here φ*<sup>k</sup>* is a firm-specific spatial random effect incorporating the information contained in the network of relationships *W*. Conditionally on *W*, φ*<sup>k</sup>* is modelled as a Markov random field, meaning that the value of φ*<sup>k</sup>* only depends on the value of its neighbours. Indeed, we expect the probability of default of firm *k* to increase (decrease) if one of more firms connected with *k* are (not) in default. Parameters α and τ represent the strength and the precision of the autocorrelation, respectively. The CAR specification is chosen because the information arising from the network (incorporated through φ*k*) can help explain those default events that are not ubiquitously captured by the linear covariates. Standard priors are placed on α, τ, and β0,β1,β2, and estimation of model parameters proceeds via MCMC (Banerjee *et al.* , 2003).

Figure 1 (right panel).

0 250 500 750 Locations

Defaulted 0 1

0.00

Figure 1. *Left: Estimated probability of a strictly positive spatial effect (i.e.,* (φ*<sup>k</sup>* <sup>&</sup>gt; <sup>0</sup>)*) for each firm. Red dots are defaulted firms (Yk* = 1*) with estimated probability of strictly positive spatial effects greater than* 50%*. Black dots indicate all other firms. Right: ROC curves and AUC for a GLM considering only covariates xk (black) and CAR model (blue) for the prediction six-months ahead with respect to training.*

To conclude, the application of disease mapping methods to a scale free network represents a novelty at present. The encouraging results on the outof-time set suggest to further investigate spatial modelling of trade networks.

BANERJEE, SUDIPTO, CARLIN, BRADLEY P, & GELFAND, ALAN E. 2003. *Hierarchical Modeling and Analysis for Spatial Data*. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. CRC Press. BARABASI ´ , ALBERT-LASZL ´ O´ ,&ALBERT, REKA ´ . 1999. Emergence of scal-

DOLFIN, MARINA, KNOPOFF, DAMIAN, LIMOSANI, MICHELE,&XIBILIA, MARIA GABRIELLA. 2019. Credit risk contagion and systemic risk on

LAMIERI, MARCO,&SANGALLI, ILARIA. 2019. The propagation of liquidity imbalances in manufacturing supply chains: evidence from a spatial auto-regressive approach. *The European Journal of Finance*, 25(15),

ing in random networks. *Science*, 286(5439), 509–512.

networks. *Mathematics*, 7(8), 713.

0.25

0.50

True positive rate

0.75

1.00

AUC GLM =0.779 AUC CAR =0.801 Model GLM CAR

0.00 0.25 0.50 0.75 1.00 False positive rate

0.00

References

1377–1401.

0.25

0.50

P(Phi > 0)

0.75

1.00

#### 3 Results and conclusions

Testing model (1) on real data, we notice that the posterior distributions of the linear parameters obtained with the CAR model are coherent with those of a standard GLM, which considers covariates *xk* only. The overlap between the credible intervals of the linear parameters from the two models implies that the spatial random effects estimated by the CAR model contribute to explain a part of the default phenomenon not entirely captured by firm-specific information. Further, we record very good in-sample performance in terms of area under the curve (AUC), as the GLM has a 0.79 AUC while the CAR specification reaches a 0.89 AUC. Furthermore, model (1) helps in identifying defaulted firms through the spatial random effects. Indeed, Figure 1 (left panel) shows that, for most truly defaulted firms (red dots), the estimated probability that the spatial effect is positive, computed as -(φ*<sup>k</sup>* <sup>&</sup>gt; <sup>0</sup>) = <sup>1</sup> *<sup>T</sup>*−*<sup>B</sup>* <sup>∑</sup>*<sup>T</sup> <sup>g</sup>*=*B*+<sup>1</sup> (φ*<sup>g</sup> <sup>k</sup>* > 0), is strictly greater than 50%. Here *T* is the total number of MCMC iterations, and *B* denotes the burn-in.

Further, we test the predictive power of the model on a disjoint sample drawn from the network seen at the same timestamp of the training sample (out-of-sample set composed of unseen firms), and on the training dataset but seen six months later (out-of-time set composed of future observations of the same firms used in training). In line with the original aim of spatial CAR models, which are intended to fit data referring to static maps, the model does not generalise in the out-of-sample case. This is an unfortunate result for our credit risk application, as one can instead expect the liquidity distress contagion dynamics to spread with similar strength (α) and precision (τ) in different areas of the trade network. In the out-of-time case, the CAR model shows slightly better predictive performance with respect to the simple GLM, as shown in

Here φ*<sup>k</sup>* is a firm-specific spatial random effect incorporating the information contained in the network of relationships *W*. Conditionally on *W*, φ*<sup>k</sup>* is modelled as a Markov random field, meaning that the value of φ*<sup>k</sup>* only depends on the value of its neighbours. Indeed, we expect the probability of default of firm *k* to increase (decrease) if one of more firms connected with *k* are (not) in default. Parameters α and τ represent the strength and the precision of the autocorrelation, respectively. The CAR specification is chosen because the information arising from the network (incorporated through φ*k*) can help explain those default events that are not ubiquitously captured by the linear covariates. Standard priors are placed on α, τ, and β0,β1,β2, and estimation of model

Testing model (1) on real data, we notice that the posterior distributions of the linear parameters obtained with the CAR model are coherent with those of a standard GLM, which considers covariates *xk* only. The overlap between the credible intervals of the linear parameters from the two models implies that the spatial random effects estimated by the CAR model contribute to explain a part of the default phenomenon not entirely captured by firm-specific information. Further, we record very good in-sample performance in terms of area under the curve (AUC), as the GLM has a 0.79 AUC while the CAR specification reaches a 0.89 AUC. Furthermore, model (1) helps in identifying defaulted firms through the spatial random effects. Indeed, Figure 1 (left panel) shows that, for most truly defaulted firms (red dots), the estimated probability that the

strictly greater than 50%. Here *T* is the total number of MCMC iterations, and

Further, we test the predictive power of the model on a disjoint sample drawn from the network seen at the same timestamp of the training sample (out-of-sample set composed of unseen firms), and on the training dataset but seen six months later (out-of-time set composed of future observations of the same firms used in training). In line with the original aim of spatial CAR models, which are intended to fit data referring to static maps, the model does not generalise in the out-of-sample case. This is an unfortunate result for our credit risk application, as one can instead expect the liquidity distress contagion dynamics to spread with similar strength (α) and precision (τ) in different areas of the trade network. In the out-of-time case, the CAR model shows slightly better predictive performance with respect to the simple GLM, as shown in

(φ*<sup>k</sup>* <sup>&</sup>gt; <sup>0</sup>) = <sup>1</sup>

*<sup>T</sup>*−*<sup>B</sup>* <sup>∑</sup>*<sup>T</sup>*

*<sup>g</sup>*=*B*+<sup>1</sup> (φ*<sup>g</sup>*

*<sup>k</sup>* > 0), is

parameters proceeds via MCMC (Banerjee *et al.* , 2003).

3 Results and conclusions

spatial effect is positive, computed as -

*B* denotes the burn-in.

Figure 1. *Left: Estimated probability of a strictly positive spatial effect (i.e.,* (φ*<sup>k</sup>* <sup>&</sup>gt; <sup>0</sup>)*) for each firm. Red dots are defaulted firms (Yk* = 1*) with estimated probability of strictly positive spatial effects greater than* 50%*. Black dots indicate all other firms. Right: ROC curves and AUC for a GLM considering only covariates xk (black) and CAR model (blue) for the prediction six-months ahead with respect to training.*

To conclude, the application of disease mapping methods to a scale free network represents a novelty at present. The encouraging results on the outof-time set suggest to further investigate spatial modelling of trade networks.

#### References


### SEMIPARAMETRIC FINITE MIXTURE OF REGRESSION MODELS WITH BAYESIAN P-SPLINES

Each component *g* = 1,...,*G* is modelled by a normal density function *f<sup>N</sup>* (·),

conditions for identifiability of Model (1) can be deduced by the ones Huang et al. (2013) provide for their nonparametric mixture of regression models. Jacobs et al. (1991) model the component weights π*g*(*xi*) using a multinomial logistic regression model, thus expressing the log-odds of these probabilties, with respect to the reference one (e.g., the *G*-th), as linear functions of the covariate *xi*. In this Paper, each of these *G* − 1 linear predictors is replaced with an additive structure, defined as a linear combination of *m* cubic B-spline

> *m* ∑ ρ=1

In the Bayesian framework, Lang & Brezger (2004) suggest a high number of knots to ensure enough flexibility, and to define priors for the regression

<sup>γ</sup>*g*<sup>ρ</sup> <sup>=</sup> <sup>γ</sup>*g*,ρ−<sup>1</sup> <sup>+</sup>*wg*ρ, *wg*<sup>ρ</sup> <sup>∼</sup> *<sup>N</sup>*(0,δ<sup>2</sup>

The amount of smoothness is controlled by the additional variance parameters

*<sup>g</sup>*. Their presence protect against possibile overfitting when a large number of knots is chosen. The multinomial model in Equation (2) can be conveniently represented as a binary formulation in the partial difference random utility model (dRUM) representation proposed by Fruwirth-Schnatter et al. (2012), ¨

where *zgi* is a latent variable, *Dgi* is the allocation indicator and ε*gi*,*i* = 1,...,*n*, are i.i.d. errors following a logistic distribution. To avoid any Metropolis-Hastings (MH) step, Fruwirth-Schnatter et al. (2012) approximate the logistic ¨ distribution of the error terms ε*gi* by a finite scale mixture of normal distribu-

Regarding the components' normal densities, each mean *µg*(·) is assumed to be an unknown smooth function of covariate *x*, represented by Bayesian

*<sup>B</sup>*ρ(*xi*)β*g*ρ, <sup>β</sup>*g*<sup>ρ</sup> <sup>=</sup> <sup>β</sup>*g*,ρ−<sup>1</sup> <sup>+</sup>*ug*ρ, *ug*<sup>ρ</sup> <sup>∼</sup> *<sup>N</sup>*(0, <sup>τ</sup><sup>2</sup>

*<sup>g</sup>*=<sup>1</sup> π*g*(*xi*) = 1, for *i* = 1,...,*n*. The

*B*ρ(*xi*)γ*g*ρ, for *i* = 1,...,*n*. (2)

+ε*gi*, *Dgi* = 1(*zgi* > 0); (4)

*<sup>g</sup>*). (3)

*<sup>g</sup>*). (5)

and has weight π*g*(*xi*) > 0, such that ∑*<sup>G</sup>*

bases *B*ρ(·) and coefficients γ*g*ρ:

log <sup>π</sup>*g*(*xi*)

δ2

P-splines:

*µg*(*xi*) =

*m* ∑ ρ=1

<sup>π</sup>*G*(*xi*) <sup>=</sup> <sup>η</sup>*g*(*xi*) =

parameters γ*g*1,..., γ*gm* in terms of a random walk:

conditional on knowing each λ*g*(*xi*) = exp(η*g*(*xi*)):

tions with parameters drawn with fixed probabilities.

∑ *l*=*g*

λ*l*(*xi*)

*zgi* <sup>=</sup> <sup>η</sup>*g*(*xi*)−log

Marco Berrettini 1, Giuliano Galimberti 1and Saverio Ranciati <sup>1</sup>

<sup>1</sup> Department of Statistical Sciences , University of Bologna, (e-mail: marco.berrettini2@unibo.it, giuliano.galimberti@unibo.it, saverio.ranciati2@unibo.it)

ABSTRACT: A semiparametric finite mixture of regression models is defined, with concomitant information assumed to influence both the component weights and the conditional means. The contribution of a concomitant variable is flexibly specified as a smooth function represented by cubic splines. A Bayesian estimation procedure is proposed and an empirical analysis of the baseball salaries dataset is illustrated.

KEYWORDS: mixture of experts models, Gibbs sampling, data augmentation

#### 1 Introduction

Mixture models provide a useful tool to account for unobserved heterogeneity. In order to gain additional flexibility, some model parameters can be expressed as functions of concomitant covariates, introducing the Mixture of Experts (MoE) framework. In this Paper, a semiparametric MoE regression model is proposed, where component weights and conditional means are smooth functions of a univariate covariate. Estimation is carried out within the Bayesian paradigm: a new Gibbs sampler algorithm is developed, exploiting data augmentation to express the effect of the covariate on the component weights, as in Fruwirth-Schnatter et al. (2012). Bayesian P-splines (Lang & Brezger, 2004) ¨ are used to achieve a parsimonious representation of the smooth functions.

#### 2 Model specification

Suppose that {*yi*},*i* = 1,...,*n* is a random sample from a population clustered into *G* components. It is assumed that the conditional distribution of *yi*, given a concomitant covariate *xi*, is represented by the following MoE model:

$$f(\mathbf{y}\_i|\mathbf{x}\_i) = \sum\_{\mathbf{g}=1}^G \pi\_{\mathbf{g}}(\mathbf{x}\_i) f\_{\mathcal{N}}\left(\mu\_{\mathbf{g}}(\mathbf{x}\_i), \sigma\_{\mathbf{g}}^2\right). \tag{1}$$

Each component *g* = 1,...,*G* is modelled by a normal density function *f<sup>N</sup>* (·), and has weight π*g*(*xi*) > 0, such that ∑*<sup>G</sup> <sup>g</sup>*=<sup>1</sup> π*g*(*xi*) = 1, for *i* = 1,...,*n*. The conditions for identifiability of Model (1) can be deduced by the ones Huang et al. (2013) provide for their nonparametric mixture of regression models. Jacobs et al. (1991) model the component weights π*g*(*xi*) using a multinomial logistic regression model, thus expressing the log-odds of these probabilties, with respect to the reference one (e.g., the *G*-th), as linear functions of the covariate *xi*. In this Paper, each of these *G* − 1 linear predictors is replaced with an additive structure, defined as a linear combination of *m* cubic B-spline bases *B*ρ(·) and coefficients γ*g*ρ:

SEMIPARAMETRIC FINITE MIXTURE OF REGRESSION MODELS WITH BAYESIAN P-SPLINES Marco Berrettini 1, Giuliano Galimberti 1and Saverio Ranciati <sup>1</sup>

(e-mail: marco.berrettini2@unibo.it, giuliano.galimberti@unibo.it,

ABSTRACT: A semiparametric finite mixture of regression models is defined, with concomitant information assumed to influence both the component weights and the conditional means. The contribution of a concomitant variable is flexibly specified as a smooth function represented by cubic splines. A Bayesian estimation procedure is proposed and an empirical analysis of the baseball salaries dataset is illustrated. KEYWORDS: mixture of experts models, Gibbs sampling, data augmentation

Mixture models provide a useful tool to account for unobserved heterogeneity. In order to gain additional flexibility, some model parameters can be expressed as functions of concomitant covariates, introducing the Mixture of Experts (MoE) framework. In this Paper, a semiparametric MoE regression model is proposed, where component weights and conditional means are smooth functions of a univariate covariate. Estimation is carried out within the Bayesian paradigm: a new Gibbs sampler algorithm is developed, exploiting data augmentation to express the effect of the covariate on the component weights, as in Fruwirth-Schnatter et al. (2012). Bayesian P-splines (Lang & Brezger, 2004) ¨ are used to achieve a parsimonious representation of the smooth functions.

Suppose that {*yi*},*i* = 1,...,*n* is a random sample from a population clustered into *G* components. It is assumed that the conditional distribution of *yi*, given a concomitant covariate *xi*, is represented by the following MoE model:

π*g*(*xi*)*f<sup>N</sup>*

*µg*(*xi*),σ<sup>2</sup> *g* 

. (1)

*f*(*yi*|*xi*) =

*G* ∑ *g*=1

<sup>1</sup> Department of Statistical Sciences , University of Bologna,

saverio.ranciati2@unibo.it)

1 Introduction

2 Model specification

$$\log \frac{\pi\_{\mathbf{g}}(\mathbf{x}\_{i})}{\pi\_{G}(\mathbf{x}\_{i})} = \mathfrak{η}\_{\mathbf{g}}(\mathbf{x}\_{i}) = \sum\_{\mathfrak{p}=1}^{m} B\_{\mathfrak{p}}(\mathbf{x}\_{i}) \mathfrak{y}\_{\mathbf{g}\mathfrak{p}}, \quad \text{for } i = 1, \ldots, n. \tag{2}$$

In the Bayesian framework, Lang & Brezger (2004) suggest a high number of knots to ensure enough flexibility, and to define priors for the regression parameters γ*g*1,..., γ*gm* in terms of a random walk:

$$
\gamma\_{\mathfrak{g}\mathfrak{p}} = \gamma\_{\mathfrak{g}, \mathfrak{p}-1} + \nu\_{\mathfrak{g}\mathfrak{p}}, \quad \kappa\_{\mathfrak{g}\mathfrak{p}} \sim N(0, \mathfrak{d}\_{\mathfrak{g}}^2). \tag{3}
$$

The amount of smoothness is controlled by the additional variance parameters δ2 *<sup>g</sup>*. Their presence protect against possibile overfitting when a large number of knots is chosen. The multinomial model in Equation (2) can be conveniently represented as a binary formulation in the partial difference random utility model (dRUM) representation proposed by Fruwirth-Schnatter et al. (2012), ¨ conditional on knowing each λ*g*(*xi*) = exp(η*g*(*xi*)):

$$z\_{gi} = \mathfrak{n}\_{\mathcal{g}}(\boldsymbol{x}\_i) - \log\left(\sum\_{l \neq g} \lambda\_l(\boldsymbol{x}\_i)\right) + \mathfrak{e}\_{gi}, \quad D\_{gi} = \mathbf{1}(z\_{gi} > 0);\tag{4}$$

where *zgi* is a latent variable, *Dgi* is the allocation indicator and ε*gi*,*i* = 1,...,*n*, are i.i.d. errors following a logistic distribution. To avoid any Metropolis-Hastings (MH) step, Fruwirth-Schnatter et al. (2012) approximate the logistic ¨ distribution of the error terms ε*gi* by a finite scale mixture of normal distributions with parameters drawn with fixed probabilities.

Regarding the components' normal densities, each mean *µg*(·) is assumed to be an unknown smooth function of covariate *x*, represented by Bayesian P-splines:

$$\mu\_{\mathfrak{g}}(\mathbf{x}\_{i}) = \sum\_{\mathfrak{p}=\mathbf{l}}^{m} B\_{\mathfrak{p}}(\mathbf{x}\_{i}) \mathfrak{P}\_{\mathfrak{g}\mathfrak{p}}, \quad \mathfrak{P}\_{\mathfrak{g}\mathfrak{p}} = \mathfrak{P}\_{\mathfrak{g}, \mathfrak{p}-1} + \mathfrak{u}\_{\mathfrak{g}\mathfrak{p}}, \quad \mu\_{\mathfrak{g}\mathfrak{p}} \sim N(0, \mathfrak{r}\_{\mathbf{g}}^{2}). \tag{5}$$

ber of runs on the log-salary for Cluster 2 (the upper one, in blue), while the bands does not exclude a linear effect for Cluster 1 (the lower one, in green). These two clusters appear quite well separated, apart from the region with low values of both *x* and *y*. Group 1 might be broadly interpreted as the cluster of "underrated" (or "underpaid") baseball players. In fact, while it is obvious that players with better performances get paid more, as is comfirmed by the increasing trends of both means, there seems to be a group of players whose salary is substantially lower than that of players with similar performances (in terms of number of runs), belonging to the upper group (in blue). Indeed, the two estimated function for *µ*1(*x*) and *µ*2(*x*) in the right plot of Figure 1 appear

FRUHWIRTH ¨ -SCHNATTER, S., PAMMINGER, C., WEBER, A, & WINTER-EBMER, R. 2012. Labor market entry and earnings dynamics: Bayesian inference using mixtures-of-experts Markov chain clustering. *Journal of*

HUANG, MIAN, LI, RUNZE,&WANG, SHAOLI. 2013. Nonparametric mixture of regression models. *Journal of the American Statistical Associa-*

JACOBS, R.A., JORDAN, M.I., NOWLAN, S.J., & HINTON, G.E. 1991. Adaptive mixtures of local experts. *Neural Computation*, 3, 79–87. LANG, S., & BREZGER, A. 2004. Bayesian P-splines. *Journal of Computa-*

RAFTERY, A.E., NEWTON, M.A., SATAGOPAN, J.M., & KRIVITSKY, P.N. 2007. Estimating the integrated likelihood via posterior simulation using the harmonic mean identity. *Pages 1–45 of:* BERNARDO, J.M., BA-YARRI, M.J., BERGER, J.O., DAWID, A.P., HECKERMAN, D., SMITH, A.F.M., & WEST, M. (eds), *Bayesian Statistics*, vol. 8. Oxford Univer-

WATNIK, MITCHELL R. 1998. Pay for play: Are baseball salaries based on

*Applied Econometrics*, 27, 1116–1137.

*tional and Graphical Statistics*, 13, 183–212.

performance? *Journal of Statistics Education*, 6(2).

*tion*, 108(503), 929–941.

sity Press.

almost parallel.

References

Figure 1. *Estimated posterior effects (and pointwise 95% posterior credible bands) of the number of runs on the log-odds* η1(*x*) *(left plot) and conditional means µ*1(*x*) *and µ*2(*x*) *(right plot), in green and blue respectively.*

The proposed Gibbs sampler requires the number of components *G* to be fixed. The optimal number of components can be selected according to the Akaike's Information Criterion for MCMC samples (AICM) proposed by Raftery et al. (2007). Finally, to obtain a hard clustering, observations can be allocated into the *G* components using the maximum-a-posteriori (MAP) rule once the algorithm completes the prefixed number of iterations.

#### 3 Application: Baseball salaries

Watnik (1998) provides a dataset consisting of information about players for the 1992 Major League Baseball season. The following analysis evaluates the effect of the number of runs (*x*), taken as a measure of a player's contribution to the team, on the log-salary (*y*). Number of components *G* ranging from 1 to 4 has been considered; the optimal value resulted to be equal to 2 for the proposed model, according to AICM. The left plot of Figure 1 shows a lack of monotonicity in the effect of the number of runs on the log-odds of the mixture weight η1(*x*). However, the overall decreasing trend indicates a lower prior probability of belonging to Cluster 1, rather than Cluster 2 (i.e. the reference one), for players providing better performances in terms of number of runs. Players' allocations, with respect to *x* and *y*, are depicted in the right plot of the same Figure, where it can also be noticed a nonlinear effect of the number of runs on the log-salary for Cluster 2 (the upper one, in blue), while the bands does not exclude a linear effect for Cluster 1 (the lower one, in green). These two clusters appear quite well separated, apart from the region with low values of both *x* and *y*. Group 1 might be broadly interpreted as the cluster of "underrated" (or "underpaid") baseball players. In fact, while it is obvious that players with better performances get paid more, as is comfirmed by the increasing trends of both means, there seems to be a group of players whose salary is substantially lower than that of players with similar performances (in terms of number of runs), belonging to the upper group (in blue). Indeed, the two estimated function for *µ*1(*x*) and *µ*2(*x*) in the right plot of Figure 1 appear almost parallel.

#### References

0 20 40 60 80 100 120

number of runs

*µ*2(*x*) *(right plot), in green and blue respectively.*

rithm completes the prefixed number of iterations.

3 Application: Baseball salaries

estimated 2.5 − 97.5 percentile

> � � � � � �

� �

� � �

�� �

� � � � �

� � � �

�

�

�

�

� �

� ��

� � � ��� � � � � � �

� �� �� � � ��� � � � �

� �

� �

� � � �

� � � � �

� � � � � � �

� � � � �

�

�

5678

log(salary)

Figure 1. *Estimated posterior effects (and pointwise 95% posterior credible bands) of the number of runs on the log-odds* η1(*x*) *(left plot) and conditional means µ*1(*x*) *and*

The proposed Gibbs sampler requires the number of components *G* to be fixed. The optimal number of components can be selected according to the Akaike's Information Criterion for MCMC samples (AICM) proposed by Raftery et al. (2007). Finally, to obtain a hard clustering, observations can be allocated into the *G* components using the maximum-a-posteriori (MAP) rule once the algo-

Watnik (1998) provides a dataset consisting of information about players for the 1992 Major League Baseball season. The following analysis evaluates the effect of the number of runs (*x*), taken as a measure of a player's contribution to the team, on the log-salary (*y*). Number of components *G* ranging from 1 to 4 has been considered; the optimal value resulted to be equal to 2 for the proposed model, according to AICM. The left plot of Figure 1 shows a lack of monotonicity in the effect of the number of runs on the log-odds of the mixture weight η1(*x*). However, the overall decreasing trend indicates a lower prior probability of belonging to Cluster 1, rather than Cluster 2 (i.e. the reference one), for players providing better performances in terms of number of runs. Players' allocations, with respect to *x* and *y*, are depicted in the right plot of the same Figure, where it can also be noticed a nonlinear effect of the num-

�

��

�

� �

�

�

�

�

� � �

� � �

�

�

� �

0 20 40 60 80 100 120

number of runs

�

�

�

�

−8 −6 −4 −2 0 2 4

log−odds


### **A SUBJECT-SPECIFIC MEASURE OF INTERRATER AGREEMENT BASED ON THE HOMOGENEITY INDEX**

for example, in situations where the rating scale is being tested, and it is necessary to identify any changes to improve it, or to request the raters for a specific comparison

O'Connell and Dobson (1984) proposed a chance-corrected measure of agreement for several raters using nominal (or ordinal) categories on a single subject *i* (*i*=1,2, ….,*N*),

= 1 − /,

A different approach is proposed here that is based on the largely known homogeneity index to measure the dispersion of a qualitative variable (e.g., Leti

> = ∑ 2, =1

where is the proportion of ratings in category *j* (*j*=1,2,…,*K*). The index is equal to 1 in the case of maximum homogeneity (perfect agreement), and 1/ in the case of maximum heterogeneity (total disagreement, for each category *j* is = 1/). *O* depends on the number of categories, so the normalization in the interval [0,1] is given

= ( − 1)/( − 1).

where is the overall disagreement on the whole response profile *i* and ∆ is the disagreement expected by chance (see O'Connell and Dobson (1984), equation (6)). The measure takes the value 1 when there is perfect agreement; it is positive when the agreement is better than chance, and negative otherwise. Besides, an overall measure of agreement across subjects Sav can be obtained as the arithmetic average of the individual values. The index has some drawbacks: 1) it cannot be computed for only one observation, because in that case the disagreement expected by chance ∆ is not defined; 2) it is formulated in terms of agreement statistics based on all pairs of raters, but some authors argued that simultaneous agreement among three or more raters can be alternatively considered (e.g., see Warrens, 2012); 3) agreement expected by chance depends on the observed proportions of subjects allocated to the categories of the scale by each rater, and this imply that the measure of agreement depends on the marginal distributions of the categories of the scale observed for each

In the next sections an index to measure the interrater agreement on a single subject is proposed based on a measure of dispersion for nominal variables. Furthermore, a global measure of agreement on the whole group of subjects obtained as the arithmetic average of the subject values of the index will be also considered and applied to a data set concerning the cause of death of 35 hypertensive patients.

on the single case in which agreement is poor.

rater (for this aspect see, e.g., Marasini *et al.*, 2016).

1983). For a classification in *K* categories the index is given by

**2 Method**

given by

by

Giuseppe Bove1

<sup>1</sup> Dipartimento di Scienze della Formazione, Università degli Studi Roma Tre, (e-mail: giuseppe.bove@uniroma3.it)

**ABSTRACT**: Interrater agreement for classifications on nominal scales is usually evaluated by overall measures across subjects like the Cohen's kappa index. In this paper, the homogeneity index for a qualitative variable is proposed to evaluate the agreement between raters for each single case (subject or object), and to obtain also a global measure of the interrater agreement for the whole group of cases evaluated. The subject-specific and the global measures proposed do not depend on a particular definition of agreement (simultaneously between two, three or more raters) and are not influenced by the marginal rater distributions of the scale like most of the kappa-type indexes.

**KEYWORDS**: nominal classification scales, interrater agreement, homogeneity index.

#### **1 Introduction**

In behavioral and biomedical sciences classifications of subjects or objects into predefined classes or categories and the analysis of their agreement are a rather common activity. For instance, agreement between clinical diagnoses provided by more physicians(raters) is considered for identifying the best treatment for the patient, and the extent to which the diagnoses coincide, the rating procedure (or scale) can be used with confidence. Hence, in this type of applications it is important to analyse interrater absolute agreement, that is the extent that raters assign the same (or very similar) values on the rating scale.

Agreement between two raters who rate each of a sample of subjects (objects) on a nominal scale is usually assessed with Cohen's kappa (Cohen 1960). Generalizations of kappa for the case of more than two raters and for the case where raters assessing one subject are not always the same have been proposed by many authors (e.g., Fleiss, 1971, Conger 1980). These indexes are used to analyse the agreement between multiple raters for a whole group of subjects. Moreover, methods to detect subsets of raters who demonstrate a high level of interobserver agreement were considered, for instance, by Landis & Koch (1977). Less frequently agreement on a single subject has been considered (O'Connell & Dobson, 1984), in spite of the fact that having evaluations of the agreement on the single case is particularly useful, for example, in situations where the rating scale is being tested, and it is necessary to identify any changes to improve it, or to request the raters for a specific comparison on the single case in which agreement is poor.

In the next sections an index to measure the interrater agreement on a single subject is proposed based on a measure of dispersion for nominal variables. Furthermore, a global measure of agreement on the whole group of subjects obtained as the arithmetic average of the subject values of the index will be also considered and applied to a data set concerning the cause of death of 35 hypertensive patients.

#### **2 Method**

**A SUBJECT-SPECIFIC MEASURE OF INTERRATER AGREEMENT BASED ON THE HOMOGENEITY INDEX**

Giuseppe Bove1

**ABSTRACT**: Interrater agreement for classifications on nominal scales is usually evaluated by overall measures across subjects like the Cohen's kappa index. In this paper, the homogeneity index for a qualitative variable is proposed to evaluate the agreement between raters for each single case (subject or object), and to obtain also a global measure of the interrater agreement for the whole group of cases evaluated. The subject-specific and the global measures proposed do not depend on a particular definition of agreement (simultaneously between two, three or more raters) and are not influenced by the marginal rater distributions of the scale like most of the kappa-type

**KEYWORDS**: nominal classification scales, interrater agreement, homogeneity index.

In behavioral and biomedical sciences classifications of subjects or objects into predefined classes or categories and the analysis of their agreement are a rather common activity. For instance, agreement between clinical diagnoses provided by more physicians(raters) is considered for identifying the best treatment for the patient, and the extent to which the diagnoses coincide, the rating procedure (or scale) can be used with confidence. Hence, in this type of applications it is important to analyse interrater absolute agreement, that is the extent that raters assign the same (or very

Agreement between two raters who rate each of a sample of subjects (objects) on a nominal scale is usually assessed with Cohen's kappa (Cohen 1960). Generalizations of kappa for the case of more than two raters and for the case where raters assessing one subject are not always the same have been proposed by many authors (e.g., Fleiss, 1971, Conger 1980). These indexes are used to analyse the agreement between multiple raters for a whole group of subjects. Moreover, methods to detect subsets of raters who demonstrate a high level of interobserver agreement were considered, for instance, by Landis & Koch (1977). Less frequently agreement on a single subject has been considered (O'Connell & Dobson, 1984), in spite of the fact that having evaluations of the agreement on the single case is particularly useful,

<sup>1</sup> Dipartimento di Scienze della Formazione, Università degli Studi Roma Tre,

(e-mail: giuseppe.bove@uniroma3.it)

indexes.

**1 Introduction**

similar) values on the rating scale.

O'Connell and Dobson (1984) proposed a chance-corrected measure of agreement for several raters using nominal (or ordinal) categories on a single subject *i* (*i*=1,2, ….,*N*), given by

$$\mathcal{S}\_l = 1 - D\_l / \Delta,$$

where is the overall disagreement on the whole response profile *i* and ∆ is the disagreement expected by chance (see O'Connell and Dobson (1984), equation (6)). The measure takes the value 1 when there is perfect agreement; it is positive when the agreement is better than chance, and negative otherwise. Besides, an overall measure of agreement across subjects Sav can be obtained as the arithmetic average of the individual values. The index has some drawbacks: 1) it cannot be computed for only one observation, because in that case the disagreement expected by chance ∆ is not defined; 2) it is formulated in terms of agreement statistics based on all pairs of raters, but some authors argued that simultaneous agreement among three or more raters can be alternatively considered (e.g., see Warrens, 2012); 3) agreement expected by chance depends on the observed proportions of subjects allocated to the categories of the scale by each rater, and this imply that the measure of agreement depends on the marginal distributions of the categories of the scale observed for each rater (for this aspect see, e.g., Marasini *et al.*, 2016).

A different approach is proposed here that is based on the largely known homogeneity index to measure the dispersion of a qualitative variable (e.g., Leti 1983). For a classification in *K* categories the index is given by

$$o = \sum\_{j=1}^{K} f\_j^2 \, .$$

where is the proportion of ratings in category *j* (*j*=1,2,…,*K*). The index is equal to 1 in the case of maximum homogeneity (perfect agreement), and 1/ in the case of maximum heterogeneity (total disagreement, for each category *j* is = 1/). *O* depends on the number of categories, so the normalization in the interval [0,1] is given by

$$\mathcal{O}\_{rel} = (K \; \mathcal{O} - 1) / (K - 1).$$

Thus, : 1) is equal to zero for total disagreement and one for perfect agreement; 2) can be computed also for only one observation; 3) does not depend on the definition of pairwise agreement; 4) does not depend on the observed proportions of subjects allocated to the categories of the scale.

measures based on pairwise agreement between raters. The index proposed is mainly considered as a measure of size of the interrater agreement, therefore future developments may concern the definition of reliable thresholds useful in the application. Finally, we notice that a measure of interrater agreement for ordinal data recently proposed and applied in educational studies follow an approach similar to the present proposal (Bove *et al.* 2020), where a measure of dispersion for ordinal

BOVE, G., CONTI, P.L., & MARELLA, D. 2020. A measure of interrater absolute agreement for ordinal categorical data. *Statistical Methods & Applications*,

COHEN,J. 1960. A coefficient of agreement for nominal scales. *Educ. Psychol. Meas.,*

CONGER, A.J. 1980. Integration and generalization of kappas for multiple raters.

FLEISS, J.L. 1971. Measuring nominal scale agreement among many raters. *Psychol.*

LANDIS, J.R., & KOCH, G.G. 1977. An Application of Hierarchical Kappa-type Statistics in the Assessment of Majority Agreement among Multiple Observers.

MARASINI, D., QUATTO P., & RIPAMONTI E. 2016. Assessing the interrater agreement for ordinal data through weighted indexes. *Statistical Methods in Medical* 

O'CONNELL, D.L., & DOBSON, A.J. 1984. General Observer-Agreement Measures on

WARRENS, M.J. 2012. Equivalences of weighted kappas for multiple raters. *Statistical* 

WOOLSON, R. F. 1987. *Statistical methods for the analysis of biomedical data*. New

Individual Subjects and Groups of Subjects. *Biometrics,* **40**, 973-983.

variables is considered instead of the homogeneity index.

LETI, G. 1983. *Statistica descrittiva*. Bologna: Il Mulino.

doi.org/10.1007/s10260-020-00551-5.

*Psychol. Bull.,* **88**, 322–328.

**References**

**20**, 213–220.

*Bull.*, **76**, 378–382.

*Biometrics*, **33**, 363-374.

*Research*, **25**, 2611-2633.

*Methodology*, **9**, 407-422.

York: John Wiley and Sons.

A global measure of agreement on the whole group (indicated with ̅) can be easily obtained as the arithmetic average of the individual values of .

#### **3 Application**

Data considered are about a study with seven nosologists assessing the cause of death of 35 hyperthensive patients by using the death certificates (Woolson, 1987). The scores were assigned by the following categories: 1=Arteriosclerotic disease, 2= cerebrovascular disease, 3=other hearth disease, 4=renal disease, 5=other disease. The marginal proportions of ratings for the five categories were 0.21, 0.17, 0.19, 0.27 and 0.16, respectively. Some preliminary results are presented for the method based on the index.

The subjective values of allowed to detect low level of agreement for many evaluations (28.6% of the values less than 0.4), that call for a possible revision of the assessment procedure. It can be also interesting to analyse some descriptive statistics provided in Table 1 for the comparison of and . The mean values for the global agreement are Sav=0.48 and ̅=0.56. values show higher dispersion respect to the values. The measures are almost perfectly correlated (*r*=0.99).

**Table 1:** Some descriptive statistics for S*<sup>i</sup>* and O*rel* values


We also add that the value of the average Cohen's kappa coincides with Sav and the value of Fleiss kappa (Fleiss, 1971) is also approximately equal to 0.48.

It is interesting to point out that if we increase the level of agreement between raters by collapsing the five categories in the two strongly unbalanced categories cerebrovascular disease (marginal proportion 0.17) and all other diseases (marginal proportion 0.83), the values of Sav, average Cohen's kappa and Fleiss kappa remain almost the same, while the new value of ̅ increases to 0.75, accordingly to the new high level of agreement. It is not uncommon in applications to have highly unbalanced categories, this happens, for example, when a diagnostic category is rare or when for some reasons the raters use almost exclusively very few levels of the scale.

#### **4 Conclusion**

A descriptive approach to the analysis of absolute interrater agreement has been proposed that presents some advantages respect to the approach by kappa-type measures based on pairwise agreement between raters. The index proposed is mainly considered as a measure of size of the interrater agreement, therefore future developments may concern the definition of reliable thresholds useful in the application. Finally, we notice that a measure of interrater agreement for ordinal data recently proposed and applied in educational studies follow an approach similar to the present proposal (Bove *et al.* 2020), where a measure of dispersion for ordinal variables is considered instead of the homogeneity index.

#### **References**

Thus, : 1) is equal to zero for total disagreement and one for perfect agreement; 2) can be computed also for only one observation; 3) does not depend on the definition of pairwise agreement; 4) does not depend on the observed proportions of subjects

A global measure of agreement on the whole group (indicated with ̅) can be

Data considered are about a study with seven nosologists assessing the cause of death of 35 hyperthensive patients by using the death certificates (Woolson, 1987). The scores were assigned by the following categories: 1=Arteriosclerotic disease, 2= cerebrovascular disease, 3=other hearth disease, 4=renal disease, 5=other disease. The marginal proportions of ratings for the five categories were 0.21, 0.17, 0.19, 0.27 and 0.16, respectively. Some preliminary results are presented for the method based on

The subjective values of allowed to detect low level of agreement for many evaluations (28.6% of the values less than 0.4), that call for a possible revision of the assessment procedure. It can be also interesting to analyse some descriptive statistics provided in Table 1 for the comparison of and . The mean values for the global agreement are Sav=0.48 and ̅=0.56. values show higher dispersion respect to the values. The measures are almost perfectly correlated (*r*=0.99).

**Table 1:** Some descriptive statistics for S*<sup>i</sup>* and O*rel* values

 *<sup>35</sup> 0.48 0.27 56.5 <sup>35</sup> 0.56 0.23 42.1*

We also add that the value of the average Cohen's kappa coincides with Sav and the

A descriptive approach to the analysis of absolute interrater agreement has been proposed that presents some advantages respect to the approach by kappa-type

It is interesting to point out that if we increase the level of agreement between raters by collapsing the five categories in the two strongly unbalanced categories cerebrovascular disease (marginal proportion 0.17) and all other diseases (marginal proportion 0.83), the values of Sav, average Cohen's kappa and Fleiss kappa remain almost the same, while the new value of ̅ increases to 0.75, accordingly to the new high level of agreement. It is not uncommon in applications to have highly unbalanced categories, this happens, for example, when a diagnostic category is rare or when for

value of Fleiss kappa (Fleiss, 1971) is also approximately equal to 0.48.

some reasons the raters use almost exclusively very few levels of the scale.

*N* **Mean Std. Dev. CV**

easily obtained as the arithmetic average of the individual values of .

allocated to the categories of the scale.

**3 Application**

the index.

**4 Conclusion**


### ESTIMATING LATENT LINEAR CORRELATIONS FROM FUZZY CONTINGENCY TABLES

2 Fuzzy frequencies

*<sup>r</sup>*,...,*C*˜

*<sup>R</sup>*} and <sup>C</sup>*<sup>k</sup>* <sup>=</sup> {*C*˜

when the observations fully belong to subsets of *C*˜*<sup>j</sup>* or *C*˜

sian (*X*∗,*Y*∗) <sup>∼</sup> <sup>N</sup> (0,ρ) under the constraints that (*<sup>X</sup>* <sup>∈</sup> *<sup>C</sup>*˜

*<sup>c</sup>*−1, <sup>τ</sup>*<sup>Y</sup>*

likelihood implied by the model conditioned on τˆ*<sup>X</sup>* and τˆ

*R* ∑ *r*=1

*C* ∑ *c*=1

an element of the fuzzy Cartesian product *<sup>C</sup>*˜*<sup>j</sup>* <sup>×</sup>˜ *<sup>C</sup>*˜

{*C*˜

min(ξ*<sup>C</sup><sup>r</sup>*

(*X*∗,*Y*∗) <sup>∈</sup> (τ*<sup>X</sup>*

(x*j*),ξ*<sup>C</sup><sup>c</sup>*

3 LLCs for fuzzy frequency tables

*<sup>r</sup>* ]×(τ*<sup>Y</sup>*

lnL(θ;N) ∝

*<sup>r</sup>*−1, <sup>τ</sup>*<sup>X</sup>*

<sup>1</sup>,...,*C*˜

A fuzzy subset *<sup>A</sup>*˜ of a universal set <sup>A</sup> is defined by means of its characteristic function <sup>ξ</sup>*<sup>A</sup>* : <sup>A</sup> <sup>→</sup> [0,1]. Let <sup>A</sup> <sup>⊂</sup> <sup>R</sup> without loss of generality and consider (*X*,*Y*) a pair of random variables taking values on A. Then A can conveniently be partitioned into a collection of fuzzy subsets, namely C*<sup>j</sup>* =

<sup>1</sup>,...,*C*˜

x = (*x*1,...,*xI*) and y = (*y*1,...,*yI*) can partially or fully be classified into C*<sup>j</sup>* or <sup>C</sup>*k*. The evaluation of the amount of sample realizations over *<sup>C</sup>*˜*<sup>j</sup>* or *<sup>C</sup>*˜

called *cardinality*. This is a natural number or crisp count (i.e., *nrc* ∈ N0)

site case, it is a fuzzy natural number ˜*nrc* ∈ F(N), with F(N) being the set of all *generalized natural numbers* (Bodjanova & Kalina, 2008). Let *C*˜

*<sup>n</sup>*˜*rc* is a fuzzy set with membership function <sup>ξ</sup>*<sup>n</sup>rc* : <sup>N</sup><sup>0</sup> <sup>→</sup> [0,1] being computed as follows: <sup>ξ</sup>*<sup>n</sup>rc* (*n*) = min(ν*rc*(*n*),*µrc*(*n*)), with <sup>ν</sup>*rc*(*n*) = FGC(ε*rc*) and *µrc*(*n*) = FLC(ε*rc*) ∀ *n* ∈ {0,1,...,*I*} ⊂ N0. In this context, FGC(.) and FLC(.) are the fuzzy counting functions as defined by Zadeh (1983) whereas ε*rc* =

servations x and y w.r.t. the fuzzy categories. More details can be found in Bodjanova & Kalina (2008). Finally, the fuzzy frequency table N*<sup>R</sup>*×*<sup>C</sup>* can be computed by applying the above calculus over *r* = 1,...,*R* and *c* = 1,...,*C*.

The latent statistical model underlying the sample realizations is bivariate Gaus-

thresholds <sup>τ</sup>*<sup>X</sup>* and <sup>τ</sup>*<sup>Y</sup>* are defined so that <sup>τ</sup><sup>0</sup> <sup>=</sup> <sup>−</sup><sup>∞</sup> and <sup>τ</sup>*<sup>R</sup>* <sup>=</sup> <sup>∞</sup> for both *<sup>X</sup>* and *<sup>Y</sup>* variables. Note that (*X*∗,*Y*∗) are unobserved pairs of latent variables. Following Olsson (1979), the parameters <sup>θ</sup> <sup>=</sup> {ρ, <sup>τ</sup>*<sup>X</sup>* , <sup>τ</sup>*<sup>Y</sup>* } ∈ [−1,1] <sup>×</sup> <sup>R</sup>*R*−<sup>1</sup> <sup>×</sup> <sup>C</sup>*C*−<sup>1</sup> can be estimated using a two step-approach. In particular, given the filtered counts at the current iteration, thresholds are estimated using the cumulative marginals of N*<sup>R</sup>*×*<sup>C</sup>* (first step). Then, ρ is estimated by maximizing the log-

> *nrc* ln <sup>τ</sup>*<sup>X</sup> r* τ*X r*−1

 <sup>τ</sup>*<sup>Y</sup> c* τ*Y c*−1

*<sup>c</sup>*,...,*C*˜

(y*k*)) contains the joint degrees of inclusion of the sample ob-

*<sup>c</sup>* ] <sup>⊂</sup> <sup>R</sup><sup>2</sup> for all *<sup>r</sup>* <sup>=</sup> <sup>1</sup>,...,*<sup>R</sup>* and *<sup>c</sup>* <sup>=</sup> <sup>1</sup>,...,*C*. The

*<sup>C</sup>*}. The random realizations

*<sup>k</sup>* is

*rc* be

*<sup>c</sup>*) iif

*k*. On the oppo-

*<sup>k</sup>*. Then a fuzzy count

*<sup>r</sup>*) <sup>∧</sup> (*<sup>Y</sup>* <sup>∈</sup> *<sup>C</sup>*˜

*<sup>Y</sup>* (second step):

φ(*x*, *y*;ρ) *dxdy* (1)

Antonio Calcagnì <sup>1</sup>

<sup>1</sup> DPSS, University of Padova, Italy (e-mail: antonio.calcagni@unipd.it)

ABSTRACT: In this contribution, we describe a method to estimate polychoric correlations when data are available in the form of fuzzy frequency tables. A simulation study is used to assess the characteristics of the proposed approach. Fuzzy polychoric correlations can be of particular utility, for instance, in studies involving covariance structural analysis (e.g., CFA) and dimensionality reduction techniques (e.g., EFA).

KEYWORDS: fuzzy frequencies, polychoric correlations, fuzzy classification

#### 1 Introduction

The latent linear correlation (LLC), also called polychoric correlation, is a measure of linear association which is usually adopted when dealing with categorical variables or statistics such as frequency or contingency tables. Given a set of *J* variables, LLC is computed pairwise for each pair (*j*, *k*) of variables by considering their joint frequencies N(*j*,*k*) *<sup>R</sup>*×*<sup>C</sup>* = (*<sup>n</sup>* (*j*,*k*) <sup>11</sup> ,...,*n* (*j*,*k*) *rc* ,...,*n* (*j*,*k*) *RC* ) over a *Rjk* ×*Cjk* partition space of the variables' domain. The general idea is to adopt a bivariate Gaussian distribution with correlation ρ*jk* as a latent statistical model underlying the observed frequency table N(*j*,*k*) *<sup>R</sup>*×*C*, which maps the *Rjk* ×*Cjk* space to the real domain of the bivariate density via a thresholdbased approach. There are several contexts in which LLCs have been applied, including covariance structural analysis (e.g., CFA) and dimensionality reduction techniques (e.g., PCA, EFA). In this contribution, we generalize the problem of estimating polychoric correlations from fuzzy frequency tables, which are of particular utility when observed data are classified using fuzzy categories as done, for example, in socio-economic studies, images/videos classification, and content analysis. In all these cases, the *Rjk* ×*Cjk* space of the variables' domain constitutes a fuzzy partition and observed counts in N(*j*,*k*) *<sup>R</sup>*×*<sup>C</sup>* are no longer natural numbers. In order to deal with this issue, in this paper we describe a novel way to compute fuzzy frequency tables and provide a way to estimate ρ*jk* when observed frequencies are fuzzy. In what follows, we will set *R* = *C* and *J* = 2 for the sake of simplicity.

#### 2 Fuzzy frequencies

ESTIMATING LATENT LINEAR CORRELATIONS FROM FUZZY CONTINGENCY TABLES Antonio Calcagnì <sup>1</sup>

<sup>1</sup> DPSS, University of Padova, Italy (e-mail: antonio.calcagni@unipd.it)

ABSTRACT: In this contribution, we describe a method to estimate polychoric correlations when data are available in the form of fuzzy frequency tables. A simulation study is used to assess the characteristics of the proposed approach. Fuzzy polychoric correlations can be of particular utility, for instance, in studies involving covariance structural analysis (e.g., CFA) and dimensionality reduction techniques (e.g., EFA).

The latent linear correlation (LLC), also called polychoric correlation, is a measure of linear association which is usually adopted when dealing with categorical variables or statistics such as frequency or contingency tables. Given a set of *J* variables, LLC is computed pairwise for each pair (*j*,*k*) of variables

a *Rjk* ×*Cjk* partition space of the variables' domain. The general idea is to adopt a bivariate Gaussian distribution with correlation ρ*jk* as a latent statis-

*Rjk* ×*Cjk* space to the real domain of the bivariate density via a thresholdbased approach. There are several contexts in which LLCs have been applied, including covariance structural analysis (e.g., CFA) and dimensionality reduction techniques (e.g., PCA, EFA). In this contribution, we generalize the problem of estimating polychoric correlations from fuzzy frequency tables, which are of particular utility when observed data are classified using fuzzy categories as done, for example, in socio-economic studies, images/videos classification, and content analysis. In all these cases, the *Rjk* ×*Cjk* space of the variables'

natural numbers. In order to deal with this issue, in this paper we describe a novel way to compute fuzzy frequency tables and provide a way to estimate ρ*jk* when observed frequencies are fuzzy. In what follows, we will set *R* = *C*

tical model underlying the observed frequency table N(*j*,*k*)

domain constitutes a fuzzy partition and observed counts in N(*j*,*k*)

*<sup>R</sup>*×*<sup>C</sup>* = (*<sup>n</sup>*

(*j*,*k*) <sup>11</sup> ,...,*n*

(*j*,*k*) *rc* ,...,*n*

(*j*,*k*) *RC* ) over

*<sup>R</sup>*×*C*, which maps the

*<sup>R</sup>*×*<sup>C</sup>* are no longer

KEYWORDS: fuzzy frequencies, polychoric correlations, fuzzy classification

1 Introduction

by considering their joint frequencies N(*j*,*k*)

and *J* = 2 for the sake of simplicity.

A fuzzy subset *<sup>A</sup>*˜ of a universal set <sup>A</sup> is defined by means of its characteristic function <sup>ξ</sup>*<sup>A</sup>* : <sup>A</sup> <sup>→</sup> [0,1]. Let <sup>A</sup> <sup>⊂</sup> <sup>R</sup> without loss of generality and consider (*X*,*Y*) a pair of random variables taking values on A. Then A can conveniently be partitioned into a collection of fuzzy subsets, namely C*<sup>j</sup>* = {*C*˜ <sup>1</sup>,...,*C*˜ *<sup>r</sup>*,...,*C*˜ *<sup>R</sup>*} and <sup>C</sup>*<sup>k</sup>* <sup>=</sup> {*C*˜ <sup>1</sup>,...,*C*˜ *<sup>c</sup>*,...,*C*˜ *<sup>C</sup>*}. The random realizations x = (*x*1,..., *xI*) and y = (*y*1,...,*yI*) can partially or fully be classified into C*<sup>j</sup>* or <sup>C</sup>*k*. The evaluation of the amount of sample realizations over *<sup>C</sup>*˜*<sup>j</sup>* or *<sup>C</sup>*˜ *<sup>k</sup>* is called *cardinality*. This is a natural number or crisp count (i.e., *nrc* ∈ N0) when the observations fully belong to subsets of *C*˜*<sup>j</sup>* or *C*˜ *k*. On the opposite case, it is a fuzzy natural number ˜*nrc* ∈ F(N), with F(N) being the set of all *generalized natural numbers* (Bodjanova & Kalina, 2008). Let *C*˜ *rc* be an element of the fuzzy Cartesian product *<sup>C</sup>*˜*<sup>j</sup>* <sup>×</sup>˜ *<sup>C</sup>*˜ *<sup>k</sup>*. Then a fuzzy count *<sup>n</sup>*˜*rc* is a fuzzy set with membership function <sup>ξ</sup>*<sup>n</sup>rc* : <sup>N</sup><sup>0</sup> <sup>→</sup> [0,1] being computed as follows: <sup>ξ</sup>*<sup>n</sup>rc* (*n*) = min(ν*rc*(*n*),*µrc*(*n*)), with <sup>ν</sup>*rc*(*n*) = FGC(ε*rc*) and *µrc*(*n*) = FLC(ε*rc*) ∀ *n* ∈ {0,1,...,*I*} ⊂ N0. In this context, FGC(.) and FLC(.) are the fuzzy counting functions as defined by Zadeh (1983) whereas ε*rc* = min(ξ*<sup>C</sup><sup>r</sup>* (x*j*),ξ*<sup>C</sup><sup>c</sup>* (y*k*)) contains the joint degrees of inclusion of the sample observations x and y w.r.t. the fuzzy categories. More details can be found in Bodjanova & Kalina (2008). Finally, the fuzzy frequency table N*<sup>R</sup>*×*<sup>C</sup>* can be computed by applying the above calculus over *r* = 1,...,*R* and *c* = 1,...,*C*.

#### 3 LLCs for fuzzy frequency tables

The latent statistical model underlying the sample realizations is bivariate Gaussian (*X*∗,*Y*∗) <sup>∼</sup> <sup>N</sup> (0,ρ) under the constraints that (*<sup>X</sup>* <sup>∈</sup> *<sup>C</sup>*˜ *<sup>r</sup>*) <sup>∧</sup> (*<sup>Y</sup>* <sup>∈</sup> *<sup>C</sup>*˜ *<sup>c</sup>*) iif (*X*∗,*Y*∗) <sup>∈</sup> (τ*<sup>X</sup> <sup>r</sup>*−1, <sup>τ</sup>*<sup>X</sup> <sup>r</sup>* ]×(τ*<sup>Y</sup> <sup>c</sup>*−1, <sup>τ</sup>*<sup>Y</sup> <sup>c</sup>* ] <sup>⊂</sup> <sup>R</sup><sup>2</sup> for all *<sup>r</sup>* <sup>=</sup> <sup>1</sup>,...,*<sup>R</sup>* and *<sup>c</sup>* <sup>=</sup> <sup>1</sup>,...,*C*. The thresholds <sup>τ</sup>*<sup>X</sup>* and <sup>τ</sup>*<sup>Y</sup>* are defined so that <sup>τ</sup><sup>0</sup> <sup>=</sup> <sup>−</sup><sup>∞</sup> and <sup>τ</sup>*<sup>R</sup>* <sup>=</sup> <sup>∞</sup> for both *<sup>X</sup>* and *<sup>Y</sup>* variables. Note that (*X*∗,*Y*∗) are unobserved pairs of latent variables. Following Olsson (1979), the parameters <sup>θ</sup> <sup>=</sup> {ρ, <sup>τ</sup>*<sup>X</sup>* , <sup>τ</sup>*<sup>Y</sup>* } ∈ [−1,1] <sup>×</sup> <sup>R</sup>*R*−<sup>1</sup> <sup>×</sup> <sup>C</sup>*C*−<sup>1</sup> can be estimated using a two step-approach. In particular, given the filtered counts at the current iteration, thresholds are estimated using the cumulative marginals of N*<sup>R</sup>*×*<sup>C</sup>* (first step). Then, ρ is estimated by maximizing the loglikelihood implied by the model conditioned on τˆ*<sup>X</sup>* and τˆ *<sup>Y</sup>* (second step):

$$\ln \mathcal{L}(\boldsymbol{\Theta}; \mathbf{N}) \approx \sum\_{r=1}^{R} \sum\_{c=1}^{C} n\_{rc} \ln \int\_{\mathfrak{r}\_{r-1}^{\mathbf{x}}}^{\mathfrak{r}\_{r}^{\mathbf{x}}} \int\_{\mathfrak{r}\_{c-1}^{\mathbf{y}}}^{\mathfrak{r}\_{c}^{\mathbf{y}}} \phi(\mathbf{x}, \mathbf{y}; \boldsymbol{\uprho}) \, d\mathbf{x} d\mathbf{y} \tag{1}$$

with φ(*x*, *y*;ρ) being the bivariate Gaussian density centered at zero. In what follows, we will focus on estimating ρ as estimation of thresholds follows straightforwardly from Olsson (1979). As we observe fuzzy frequencies N*<sup>R</sup>*×*C*, we solve the maximization problem via the fuzzy EM algorithm proposed by Denoeux (2011), which in this case requires the computation of the following quantity:

$$\mathbb{E}\_{\mathbf{\bullet}'} \left[ \ln \mathcal{L}(\boldsymbol{\Theta}; \mathbf{N}) | \widetilde{\mathbf{N}} \right] \approx \sum\_{r=1}^{R} \sum\_{c=1}^{C} \mathbb{E}\_{\mathbf{\bullet}'} [N\_{rc} | \widetilde{\boldsymbol{n}}\_{rc}] \ln \int\_{\mathbf{\bf{c}}\_{r-1}^{\mathbf{x}}}^{\mathbf{t}\_{r}^{\mathbf{y}}} \int\_{\mathbf{\bf{c}}\_{c-1}^{\mathbf{y}}}^{\mathbf{t}\_{c}^{\mathbf{y}}} \phi(\mathbf{x}, \mathbf{y}; \mathbf{p}) \, d\mathbf{x} d\mathbf{y} \tag{2}$$

given a candidate estimate θ . The quantity *Nrc*|*n*˜*r*,*<sup>c</sup>* is a random variable conditioned on a fuzzy event:

$$\mathbb{E}\_{\mathsf{o}'}\left[N\_{\mathrm{rc}}\,|\,\tilde{n}\_{\mathrm{rc}}\right] = \sum\_{n \in \mathbb{N}\_0} \frac{\mathsf{f}\_{\tilde{n}\_{\mathrm{rc}}}(n) f\_{N\_{\mathrm{rc}}}(n; \mathfrak{m}\_{\mathrm{rc}}(\mathsf{\Theta}))}{\sum\_{n \in \mathbb{N}\_0} \mathsf{f}\_{\tilde{n}\_{\mathrm{rc}}}(n) f\_{N\_{\mathrm{rc}}}(n; \mathfrak{m}\_{\mathrm{rc}}(\mathsf{\Theta}))} \, n \tag{3}$$

4*s*<sup>2</sup> 1) 1 <sup>2</sup> /2*s*<sup>2</sup>

<sup>1</sup>, β*rc* = (*m*1+*m*<sup>2</sup>

1 + *nrc*β*m*<sup>1</sup> , β*m*<sup>1</sup> = (*nrc* + *n*<sup>2</sup>

model for the unobserved *nrc*.

ρ = 0.15

ρ = 0.50

ρ = 0.85

References

*max-based and mean-based defuzzified frequency tables.*

algorithm. *Fuzzy sets and systems*, 183(1), 72–91.

*Pages 149–184 of: Computational linguistics*. Elsevier.

*Psychometrika*, 44(4), 443–460.

*m*0β*s*<sup>1</sup> , β*s*<sup>1</sup> = (*m*<sup>0</sup> +*m*<sup>2</sup>

1+4*s*<sup>2</sup> 1) 1 2 2*s*<sup>2</sup>

density of the discrete Gamma random variable G*amma*d.

<sup>0</sup> +4*s*<sup>2</sup> 0) 1 2 2*s*<sup>2</sup>

*rc* + 4*s*<sup>2</sup> 1) 1 2 2*s*<sup>2</sup>

*Outcome measures*. For each condition of the simulation design, sample

*Results*. Table 1 shows the results of the study. As expected, fEM outperformed standard ML applied on both max-based and mean-based defuzzified data in terms of bias and root mean square errors. This is mainly due to the fact that <sup>ρ</sup>fEM estimator weights the observed fuzzy data <sup>ξ</sup>*<sup>n</sup>rc* with the probabilistic

> *I* = 150 0.0358 0.0881 -0.0105 0.1142 -0.0402 0.0846 *I* = 250 0.0043 0.0514 -0.0284 0.0817 -0.0403 0.0683 *I* = 500 0.0099 0.0297 0.0020 0.0416 -0.0082 0.0335

> *I* = 150 0.0103 0.0747 -0.0933 0.1545 -0.1797 0.1956 *I* = 250 -0.0363 0.0626 -0.1216 0.1488 -0.1706 0.1800 *I* = 500 -0.0006 0.0264 -0.0457 0.0689 -0.0828 0.0903

*I* = 150 0.0013 0.0441 -0.2150 0.2525 -0.3274 0.3354 *I* = 250 -0.0028 0.0269 -0.1707 0.1967 -0.2580 0.2642 *I* = 500 -0.0009 0.0145 -0.1034 0.1211 -0.1630 0.1672 Table 1. *Monte Carlo study: Estimating* ρ *via fuzzy-EM (fEM) and standard ML (dML) on*

BODJANOVA, SLAVKA,&KALINA, MARTIN. 2008. Cardinalities of Granules of Vague Data. *Pages 63–70 of:* MAGDALENA, L., OJEDA-ACIEGO, M., & VERDEGAY, J.L. (eds),

DENOEUX, THIERRY. 2011. Maximum likelihood estimation from fuzzy data using the EM

OLSSON, ULF. 1979. Maximum likelihood estimation of the polychoric correlation coefficient.

ZADEH, LOTFI A. 1983. A computational approach to fuzzy quantifiers in natural languages.

*Proceedings of IPMU2008, Torreliminos (Malaga), June 22-27 2008*.

fEM dML (max) dML (mean) *bias rmse bias rmse bias rmse*

results were evaluated using bias of estimates and root mean square error.

<sup>1</sup>, *m*<sup>1</sup> ∼ G*amma*d(α*m*<sup>1</sup> ,β*m*<sup>1</sup> ) where α*m*<sup>1</sup> =

<sup>0</sup>, *m*<sup>0</sup> = 1 and *s*<sup>0</sup> = 0.15. Note that *f*G<sup>d</sup> is the

<sup>1</sup>, *s*<sup>1</sup> ∼ G*amma*d(α*s*<sup>1</sup> ,β*s*<sup>1</sup> ), α*s*<sup>1</sup> = 1 +

where *fNrc* (*n*;π*rc*(θ)) = <sup>B</sup>*in*(*n*;π*rc*(θ)), with <sup>π</sup>*rc*(θ) = <sup>τ</sup>*<sup>Y</sup> r* τ*X r*−1 <sup>τ</sup>*<sup>Y</sup> c* τ*Y c*−1 φ(*x*, *y*;ρ) *dxdy*. Note that ˆ*nrc* = E<sup>θ</sup> [*Nrc*|*n*˜*rc*] denotes the reconstructed *rc*-th count. The fuzzy EM algorithm proceeds by alternating between the computation of Eq. (3) and the maximization of Eq. (1) once ˆ*nrc* has been obtained.

#### 4 Simulation study

The aim of this Monte Carlo study is twofold. First, we will evaluate the performances of fuzzy-EM estimator for ρ*jk* when fuzzy frequency data are available. Second, we will assess whether the standard maximum likelihood estimator for polychoric correlations performs as good as the proposed method if applied on max-based and mean-based defuzzified data. The case *J* = 2 was considered for the sake of simplicity.

*Design*. The design involved two factors, namely (i) *I* ∈ {150,250,500}, and (ii) ρ ∈ {0.15,0.50,0.85}, which were varied in a complete factorial design. For each combination, *B* = 5000 samples were generated.

*Data generation*. For each condition of the simulation design, data were generated according to a two-step procedure. First, a crisp frequency table N*R*×*<sup>C</sup>* was computed using the approximation *nrc* = *I* · π*rc* (*r* = 1,...,*R*; *c* = <sup>1</sup>,...,*C*), with <sup>τ</sup>*<sup>X</sup>* <sup>=</sup> <sup>τ</sup>*<sup>Y</sup>* = (−2,−1,0,1,2). Second, each element of <sup>N</sup>*R*×*<sup>C</sup>* was fuzzified via the following probability-possibility transformation: ξ*n*˜*rc* = *f*G<sup>d</sup> (n;α*rc*,β*rc*) max *<sup>f</sup>*G<sup>d</sup> (n;α*rc*,β*rc*), <sup>α</sup>*rc* <sup>=</sup> <sup>1</sup> <sup>+</sup> *<sup>m</sup>*1β*s*<sup>1</sup> , <sup>β</sup>*s*<sup>1</sup> <sup>=</sup> <sup>1</sup> + (*m*<sup>1</sup> <sup>+</sup> *<sup>m</sup>*<sup>2</sup> <sup>1</sup> + 4*s*<sup>2</sup> 1) 1 <sup>2</sup> /2*s*<sup>2</sup> <sup>1</sup>, β*rc* = (*m*1+*m*<sup>2</sup> 1+4*s*<sup>2</sup> 1) 1 2 2*s*<sup>2</sup> <sup>1</sup>, *m*<sup>1</sup> ∼ G*amma*d(α*m*<sup>1</sup> ,β*m*<sup>1</sup> ) where α*m*<sup>1</sup> = 1 + *nrc*β*m*<sup>1</sup> , β*m*<sup>1</sup> = (*nrc* + *n*<sup>2</sup> *rc* + 4*s*<sup>2</sup> 1) 1 2 2*s*<sup>2</sup> <sup>1</sup>, *s*<sup>1</sup> ∼ G*amma*d(α*s*<sup>1</sup> ,β*s*<sup>1</sup> ), α*s*<sup>1</sup> = 1 + *m*0β*s*<sup>1</sup> , β*s*<sup>1</sup> = (*m*<sup>0</sup> +*m*<sup>2</sup> <sup>0</sup> +4*s*<sup>2</sup> 0) 1 2 2*s*<sup>2</sup> <sup>0</sup>, *m*<sup>0</sup> = 1 and *s*<sup>0</sup> = 0.15. Note that *f*G<sup>d</sup> is the density of the discrete Gamma random variable G*amma*d.

*Outcome measures*. For each condition of the simulation design, sample results were evaluated using bias of estimates and root mean square error.

*Results*. Table 1 shows the results of the study. As expected, fEM outperformed standard ML applied on both max-based and mean-based defuzzified data in terms of bias and root mean square errors. This is mainly due to the fact that <sup>ρ</sup>fEM estimator weights the observed fuzzy data <sup>ξ</sup>*<sup>n</sup>rc* with the probabilistic model for the unobserved *nrc*.


Table 1. *Monte Carlo study: Estimating* ρ *via fuzzy-EM (fEM) and standard ML (dML) on max-based and mean-based defuzzified frequency tables.*

#### References

with φ(*x*, *y*;ρ) being the bivariate Gaussian density centered at zero. In what follows, we will focus on estimating ρ as estimation of thresholds follows straightforwardly from Olsson (1979). As we observe fuzzy frequencies N*<sup>R</sup>*×*C*, we solve the maximization problem via the fuzzy EM algorithm proposed by Denoeux (2011), which in this case requires the computation of the following

E<sup>θ</sup> [*Nrc*|*n*˜*rc*]ln

Note that ˆ*nrc* = E<sup>θ</sup> [*Nrc*|*n*˜*rc*] denotes the reconstructed *rc*-th count. The fuzzy EM algorithm proceeds by alternating between the computation of Eq. (3) and

The aim of this Monte Carlo study is twofold. First, we will evaluate the performances of fuzzy-EM estimator for ρ*jk* when fuzzy frequency data are available. Second, we will assess whether the standard maximum likelihood estimator for polychoric correlations performs as good as the proposed method if applied on max-based and mean-based defuzzified data. The case *J* = 2 was

*Design*. The design involved two factors, namely (i) *I* ∈ {150,250,500}, and (ii) ρ ∈ {0.15,0.50,0.85}, which were varied in a complete factorial de-

*Data generation*. For each condition of the simulation design, data were generated according to a two-step procedure. First, a crisp frequency table N*R*×*<sup>C</sup>* was computed using the approximation *nrc* = *I* · π*rc* (*r* = 1,...,*R*; *c* = <sup>1</sup>,...,*C*), with <sup>τ</sup>*<sup>X</sup>* <sup>=</sup> <sup>τ</sup>*<sup>Y</sup>* = (−2,−1,0,1,2). Second, each element of <sup>N</sup>*R*×*<sup>C</sup>* was fuzzified via the following probability-possibility transformation: ξ*n*˜*rc* =

max *<sup>f</sup>*G<sup>d</sup> (n;α*rc*,β*rc*), <sup>α</sup>*rc* <sup>=</sup> <sup>1</sup> <sup>+</sup> *<sup>m</sup>*1β*s*<sup>1</sup> , <sup>β</sup>*s*<sup>1</sup> <sup>=</sup> <sup>1</sup> + (*m*<sup>1</sup> <sup>+</sup> *<sup>m</sup>*<sup>2</sup>

sign. For each combination, *B* = 5000 samples were generated.

 <sup>τ</sup>*<sup>Y</sup> r* τ*X r*−1

<sup>ξ</sup>*<sup>n</sup>rc* (*n*)*fNrc* (*n*;π*rc*(θ))

 <sup>τ</sup>*<sup>Y</sup> c* τ*Y c*−1

. The quantity *Nrc*|*n*˜*r*,*<sup>c</sup>* is a random variable con-

<sup>∑</sup>*n*∈N<sup>0</sup> <sup>ξ</sup>*<sup>n</sup>rc* (*n*)*fNrc* (*n*;π*rc*(θ)) *<sup>n</sup>* (3)

*r* τ*X r*−1 <sup>τ</sup>*<sup>Y</sup> c* τ*Y c*−1

φ(*x*, *y*;ρ) *dxdy* (2)

φ(*x*, *y*;ρ) *dxdy*.

<sup>1</sup> +

quantity:

E<sup>θ</sup> 

lnL(θ;N)|N

given a candidate estimate θ

ditioned on a fuzzy event:

4 Simulation study

*f*G<sup>d</sup> (n;α*rc*,β*rc*)

considered for the sake of simplicity.

 ∝ *R* ∑ *r*=1

<sup>E</sup><sup>θ</sup> [*Nrc*|*n*˜*rc*] = ∑

*C* ∑ *c*=1

*n*∈N<sup>0</sup>

where *fNrc* (*n*;π*rc*(θ)) = <sup>B</sup>*in*(*n*;π*rc*(θ)), with <sup>π</sup>*rc*(θ) = <sup>τ</sup>*<sup>Y</sup>*

the maximization of Eq. (1) once ˆ*nrc* has been obtained.


### MODEL-BASED CLUSTERING WITH SPARSE MATRIX MIXTURE MODELS

is dramatically over-parameterized even in moderate dimensions. Therefore, we propose a penalized model-based clustering strategy in the matrix-variate framework. Our approach reduces the number of parameters to be estimated, by shrinking some of them towards zero, and possibly leads to a gain in terms of interpretability. The rest of the paper is organized as follows. In Section 2 we introduce matrix Gaussian mixture models (MGMMs) and we outline our proposal. An application to real world data is reported in Section 3 alongside

with some concluding remarks and possible future research directions.

Let <sup>X</sup> <sup>=</sup> {X1,...,X*n*} be a set of *<sup>n</sup>* matrices with <sup>X</sup>*<sup>i</sup>* <sup>∈</sup> <sup>R</sup>*p*×*q*. MGMM provides an extension of the GMM when clustering of matrices are needed. The density

and ∑*<sup>k</sup>* τ*<sup>k</sup>* = 1. On the other hand φ(*p*×*<sup>q</sup>*)(X*i*;*Mk*,Ω*k*,Γ*k*) denotes the density of <sup>a</sup> *<sup>p</sup>* <sup>×</sup> *<sup>q</sup>* matrix normal distribution where *Mk* <sup>∈</sup> <sup>R</sup>*p*×*<sup>q</sup>* is the mean of the *<sup>k</sup>*-*th* component while <sup>Ω</sup>*<sup>k</sup>* <sup>∈</sup> <sup>R</sup>*p*×*<sup>p</sup>* and <sup>Γ</sup>*<sup>k</sup>* <sup>∈</sup> <sup>R</sup>*q*×*<sup>q</sup>* represent respectively the rows

In (1) the number of parameters to estimate scales quadratically with both *p* and *q*, endangering the pratical usefulness of the model. Recently some solutions have been proposed, trying to overcome this issue (see Wang & Melnykov, 2020 and Sarkar *et al.* , 2020). These approaches present some drawbacks as they are computationally intensive and as they implement a rigid way to induce parsimony. Therefore in this work we take a different path, adopting a penalized estimation approach which implicitly assumes that *Mk*,Ω*k*,Γ*k*, for

To this aim, we introduce a penalized likelihood strategy to obtain Θˆ . The

λ2||*P*<sup>2</sup> ∗Ω*k*||<sup>1</sup> +

*K* ∑ *k*=1 − *p*λ1,λ2,λ<sup>3</sup> (*Mk*,Ω*k*,Γ*k*) (2)

*K* ∑ *k*=1

λ3||*P*<sup>3</sup> ∗Γ*k*||<sup>1</sup>

τ*k*φ*p*×*q*(X*i*;*Mk*,Ω*k*,Γ*k*)

λ1||*P*<sup>1</sup> ∗*Mk*||<sup>1</sup> +

τ*k*φ(*p*×*<sup>q</sup>*)(X*i*;*Mk*,Ω*k*,Γ*k*) (1)

*<sup>k</sup>*=1, τ*k*'s are the mixing proportions, with τ*<sup>k</sup>* > 0

2 Penalized matrix-variate mixture model

*f*(X*i*;Θ) =

and the columns component precision matrices.

*k* = 1,...,*K*, possess some degree of sparsity.

log-likelihood function to be maximized is defined as

with the penalization term *p*λ1,λ2,λ<sup>3</sup> (*Mk*,Ω*k*,Γ*k*) equals to

*K* ∑ *k*=1

*K* ∑ *k*=1

of X*<sup>i</sup>* is then expressed as follows

where <sup>Θ</sup> <sup>=</sup> {τ*k*,*Mk*,Ω*k*,Γ*k*}*<sup>K</sup>*

(Θ;X) =

*n* ∑ *i*=1

*p*λ1,λ2,λ<sup>3</sup> (*Mk*,Ω*k*,Γ*k*) =

log *<sup>K</sup>* ∑ *k*=1

Andrea Cappozzo 1, Alessandro Casa2 and Michael Fop2

<sup>1</sup> Deparment of Mathematics, Politecnico di Milano (e-mail: andrea.cappozzo@polimi.it)

<sup>2</sup> School of Mathematics and Statistics, University College Dublin (e-mail: alessandro.casa@ucd.ie, michael.fop@ucd.ie)

ABSTRACT: In recent years we are witnessing to an increased attention towards methods for clustering matrix-valued data. In this framework, matrix Gaussian mixture models constitute a natural extension of the model-based clustering strategies. Regrettably, the overparametrization issues, already affecting the vector-valued framework in high-dimensional scenarios, are even more troublesome for matrix mixtures. In this work we introduce a sparse model-based clustering procedure conceived for the matrix-variate context. We introduce a penalized estimation scheme which, by shrinking some of the parameters towards zero, produces parsimonious solutions when the dimensions increase. Moreover it allows cluster-wise sparsity, possibly easing the interpretation and providing richer insights on the analyzed dataset.

KEYWORDS: model-based clustering, penalized likelihood, sparse matrix estimation, EM-algorithm

#### 1 Introduction

Model-based clustering represents a well established framework to cluster multivariate data. When dealing with continuous data, the generative mechanism is routinely described by means of Gaussian Mixture Models (GMMs). Partitions are obtained by exploiting the one-to-one correspondence between the groups and the components of the mixture. This approach has been used in many different applications; nonetheless GMMs tend to be over-parameterized in high-dimensional settings where their usefulness might be jeopardized.

This problem complicates even further in three-way data scenarios, where multiple variables are measured on different occasions for the considered units. Here matrix-variate distributions have often been used and embedded in the mixtures framework, thus providing a valid solution when partitions of matrices are required (Viroli, 2011). In spite of its strenght points, this approach is dramatically over-parameterized even in moderate dimensions. Therefore, we propose a penalized model-based clustering strategy in the matrix-variate framework. Our approach reduces the number of parameters to be estimated, by shrinking some of them towards zero, and possibly leads to a gain in terms of interpretability. The rest of the paper is organized as follows. In Section 2 we introduce matrix Gaussian mixture models (MGMMs) and we outline our proposal. An application to real world data is reported in Section 3 alongside with some concluding remarks and possible future research directions.

#### 2 Penalized matrix-variate mixture model

MODEL-BASED CLUSTERING WITH SPARSE MATRIX MIXTURE MODELS Andrea Cappozzo 1, Alessandro Casa2 and Michael Fop2

ABSTRACT: In recent years we are witnessing to an increased attention towards methods for clustering matrix-valued data. In this framework, matrix Gaussian mixture models constitute a natural extension of the model-based clustering strategies. Regrettably, the overparametrization issues, already affecting the vector-valued framework in high-dimensional scenarios, are even more troublesome for matrix mixtures. In this work we introduce a sparse model-based clustering procedure conceived for the matrix-variate context. We introduce a penalized estimation scheme which, by shrinking some of the parameters towards zero, produces parsimonious solutions when the dimensions increase. Moreover it allows cluster-wise sparsity, possibly easing the

KEYWORDS: model-based clustering, penalized likelihood, sparse matrix estimation,

Model-based clustering represents a well established framework to cluster multivariate data. When dealing with continuous data, the generative mechanism is routinely described by means of Gaussian Mixture Models (GMMs). Partitions are obtained by exploiting the one-to-one correspondence between the groups and the components of the mixture. This approach has been used in many different applications; nonetheless GMMs tend to be over-parameterized in high-dimensional settings where their usefulness might be jeopardized.

This problem complicates even further in three-way data scenarios, where multiple variables are measured on different occasions for the considered units. Here matrix-variate distributions have often been used and embedded in the mixtures framework, thus providing a valid solution when partitions of matrices are required (Viroli, 2011). In spite of its strenght points, this approach

<sup>1</sup> Deparment of Mathematics, Politecnico di Milano (e-mail: andrea.cappozzo@polimi.it)

<sup>2</sup> School of Mathematics and Statistics, University College Dublin (e-mail: alessandro.casa@ucd.ie, michael.fop@ucd.ie)

interpretation and providing richer insights on the analyzed dataset.

EM-algorithm

1 Introduction

Let <sup>X</sup> <sup>=</sup> {X1,...,X*n*} be a set of *<sup>n</sup>* matrices with <sup>X</sup>*<sup>i</sup>* <sup>∈</sup> <sup>R</sup>*p*×*q*. MGMM provides an extension of the GMM when clustering of matrices are needed. The density of X*<sup>i</sup>* is then expressed as follows

$$f(\mathbf{X}\_i; \Theta) = \sum\_{k=1}^{K} \mathfrak{r}\_k \phi\_{(p \times q)}(\mathbf{X}\_i; \mathcal{M}\_k, \Omega\_k, \Gamma\_k) \tag{1}$$

where <sup>Θ</sup> <sup>=</sup> {τ*k*,*Mk*,Ω*k*,Γ*k*}*<sup>K</sup> <sup>k</sup>*=1, τ*k*'s are the mixing proportions, with τ*<sup>k</sup>* > 0 and ∑*<sup>k</sup>* τ*<sup>k</sup>* = 1. On the other hand φ(*p*×*<sup>q</sup>*)(X*i*;*Mk*,Ω*k*,Γ*k*) denotes the density of <sup>a</sup> *<sup>p</sup>* <sup>×</sup> *<sup>q</sup>* matrix normal distribution where *Mk* <sup>∈</sup> <sup>R</sup>*p*×*<sup>q</sup>* is the mean of the *<sup>k</sup>*-*th* component while <sup>Ω</sup>*<sup>k</sup>* <sup>∈</sup> <sup>R</sup>*p*×*<sup>p</sup>* and <sup>Γ</sup>*<sup>k</sup>* <sup>∈</sup> <sup>R</sup>*q*×*<sup>q</sup>* represent respectively the rows and the columns component precision matrices.

In (1) the number of parameters to estimate scales quadratically with both *p* and *q*, endangering the pratical usefulness of the model. Recently some solutions have been proposed, trying to overcome this issue (see Wang & Melnykov, 2020 and Sarkar *et al.* , 2020). These approaches present some drawbacks as they are computationally intensive and as they implement a rigid way to induce parsimony. Therefore in this work we take a different path, adopting a penalized estimation approach which implicitly assumes that *Mk*,Ω*k*,Γ*k*, for *k* = 1,...,*K*, possess some degree of sparsity.

To this aim, we introduce a penalized likelihood strategy to obtain Θˆ . The log-likelihood function to be maximized is defined as

$$\ell(\Theta; \mathbf{X}) = \sum\_{i=1}^{n} \log \left\{ \sum\_{k=1}^{K} \tau\_{k} \phi\_{p \times q}(\mathbf{X}\_{i}; M\_{k}, \Omega\_{k}, \Gamma\_{k}) \right\} - p\_{\lambda\_{1}, \lambda\_{2}, \lambda\_{3}}(M\_{k}, \Omega\_{k}, \Gamma\_{k}) \tag{2}$$

with the penalization term *p*λ1,λ2,λ<sup>3</sup> (*Mk*,Ω*k*,Γ*k*) equals to

$$p\_{\lambda\_1, \lambda\_2, \lambda\_3}(M\_k, \Omega\_k, \Gamma\_k) = \sum\_{k=1}^K \lambda\_1 ||P\_1 \* M\_k||\_1 + \sum\_{k=1}^K \lambda\_2 ||P\_2 \* \Omega\_k||\_1 + \sum\_{k=1}^K \lambda\_3 ||P\_3 \* \Gamma\_k||\_1$$

Table 1. *Adjusted Rand Index (ARI) and number of free estimated parameters for three clustering procedures.*


xxx

M ^

Ω ^

Γ ^

> x x x x x

References

x x x x x

x x x x

x x

x

x

x

*Data Analysis*, 142, 106822.

modelling. *Stat*, 9(1), e278.

x x x x x x x x x x

x x x x x x x x

efficients as well as for the number of mixture components.

x

x x x

x

Figure 1. *Sparsely estimated Mk (upper plots),* Ω*<sup>k</sup> (middle plots) and* Γ*<sup>k</sup> (lower plots) for k* = 1,2,3*. Entries that are shrunk to* 0 *by the estimator are highlighted with an* ×*.*

Our proposal is able to effectively reduce the number of parameters to estimate while, at the same time, flexibly accounting for different relationships among the variables and for different level of sparsity across the groups. Future research directions would focus on the derivation of an appropriate model selection procedure, determining jointly reasonable values for the penalty co-

FRIEDMAN, J., HASTIE, T., & TIBSHIRANI, R. 2008. Sparse inverse covariance estimation with the graphical lasso. *Biostatistics*, 9(3), 432–441. SARKAR, S., ZHU, X., MELNYKOV, V., & INGRASSIA, S. 2020. On parsimonious models for modeling matrix data. *Computational Statistics &*

VIROLI, C. 2011. Finite mixtures of matrix normal distributions for classifying three-way data. *Statistics and Computing*, 21(4), 511–522. WANG, Y., & MELNYKOV, V. 2020. On variable selection in matrix mixture

x x x x x x x

x x x x x x x x x x x x x x

x x

x x

x x x

x x x

x x

x

x

x x x

x

xxx

x

Damp grey Soil Grey Soil Soil with vegetation stubble

−1.0 −0.5 0.0 0.5 1.0

> 0 1

x

x x

x

x x x x x

x x x x

x

*P*1,*P*2,*P*<sup>3</sup> are matrices with non-negative entries, ||*A*||<sup>1</sup> = ∑*jh* |*Ajh*|, λ1,λ2,λ<sup>3</sup> are the penalization parameters while ∗ denotes the element-wise product. To estimate Θ, we devise an ad-hoc EM-algorithm which maximizes the *penalized complete data log-likelihood* associated with (2). The E-step computes class membership a posteriori probabilities via the standard updating formula. On the other hand the M-step consists of three partial optimization cycles. An estimate for *Mk* is obtained by means of a cell-wise coordinate ascent algorithm while, to estimate Ω*<sup>k</sup>* and Γ*k*, we propose a suitable modification of the graphical LASSO (Friedman *et al.* , 2008). The resulting model, inducing sparsity in the precision matrices, accounts for cluster-wise conditional independence patterns, which might ease the interpretation of the results, and possibly provides indications about irrelevant variables. Moreover the number of parameters is reduced without imposing rigid structures.

#### 3 Application and concluding remarks

We employ the procedure outlined in Section 2 to obtain a partition of the Landsat satellite data, where *n* = 845 matrices, with dimensions 4 × 9, coming from three different classes are available (see Viroli, 2011 for a detailed description). In Table 1 we report the results obtained with the proposed procedure (Sparsemixmat) and with two plausible competitors being the approach by Sarkar *et al.* , 2020 and the standard GMM applied to the unfolded two-way representation of the data. Our model outperforms the competitors, when recovering the true clustering structure is the aim. Furthermore, we provide the most parsimonious solution, displaying the lowest number of non zero estimated parameters. The retrieved sparse matrix structures are graphically displayed, for the three classes, in Figure 1. While the clustering is mainly driven by the different patterns in *Mk*'s, the Γ*k*'s are the ones showing the highest degree of sparsity, with different intensities for the three classes.

The promising results obtained in the application demonstrate how the penalized matrix-variate mixture model proposed in this work might alleviate the flaws of standard three-way data clustering in high-dimensional scenarios.

Figure 1. *Sparsely estimated Mk (upper plots),* Ω*<sup>k</sup> (middle plots) and* Γ*<sup>k</sup> (lower plots) for k* = 1,2,3*. Entries that are shrunk to* 0 *by the estimator are highlighted with an* ×*.*

Our proposal is able to effectively reduce the number of parameters to estimate while, at the same time, flexibly accounting for different relationships among the variables and for different level of sparsity across the groups. Future research directions would focus on the derivation of an appropriate model selection procedure, determining jointly reasonable values for the penalty coefficients as well as for the number of mixture components.

#### References

Table 1. *Adjusted Rand Index (ARI) and number of free estimated parameters for three*

ARI 0.7883 0.7772 0.3841 # of parameters 218 275 850

*P*1,*P*2,*P*<sup>3</sup> are matrices with non-negative entries, ||*A*||<sup>1</sup> = ∑*jh* |*Ajh*|, λ1,λ2,λ<sup>3</sup> are the penalization parameters while ∗ denotes the element-wise product. To estimate Θ, we devise an ad-hoc EM-algorithm which maximizes the *penalized complete data log-likelihood* associated with (2). The E-step computes class membership a posteriori probabilities via the standard updating formula. On the other hand the M-step consists of three partial optimization cycles. An estimate for *Mk* is obtained by means of a cell-wise coordinate ascent algorithm while, to estimate Ω*<sup>k</sup>* and Γ*k*, we propose a suitable modification of the graphical LASSO (Friedman *et al.* , 2008). The resulting model, inducing sparsity in the precision matrices, accounts for cluster-wise conditional independence patterns, which might ease the interpretation of the results, and possibly provides indications about irrelevant variables. Moreover the number of parameters is

We employ the procedure outlined in Section 2 to obtain a partition of the Landsat satellite data, where *n* = 845 matrices, with dimensions 4 × 9, coming from three different classes are available (see Viroli, 2011 for a detailed description). In Table 1 we report the results obtained with the proposed procedure (Sparsemixmat) and with two plausible competitors being the approach by Sarkar *et al.* , 2020 and the standard GMM applied to the unfolded two-way representation of the data. Our model outperforms the competitors, when recovering the true clustering structure is the aim. Furthermore, we provide the most parsimonious solution, displaying the lowest number of non zero estimated parameters. The retrieved sparse matrix structures are graphically displayed, for the three classes, in Figure 1. While the clustering is mainly driven by the different patterns in *Mk*'s, the Γ*k*'s are the ones showing the highest

degree of sparsity, with different intensities for the three classes.

The promising results obtained in the application demonstrate how the penalized matrix-variate mixture model proposed in this work might alleviate the flaws of standard three-way data clustering in high-dimensional scenarios.

Sparsemixmat Sarkar *et al.* , 2020 GMM

*clustering procedures.*

reduced without imposing rigid structures.

3 Application and concluding remarks


### EXPLORING SOLUTIONS VIA MONITORING FOR CLUSTER WEIGHTED ROBUST MODELS

subject to: λ*l*<sup>1</sup> (Σ*g*<sup>1</sup> ) ≤ *cX* λ*l*<sup>2</sup> (Σ*g*<sup>2</sup> ) for every 1 ≤ *l*<sup>1</sup> = *l*<sup>2</sup> ≤ *d*, 1 ≤ *g*<sup>1</sup> = *g*<sup>2</sup> ≤ *G* and

*z*(·,·) tells us whether observation (x*i*, *yi*) is trimmed off, with trimming level

The set {λ*l*(Σ*g*)}*l*=1,...,*<sup>d</sup>* denotes the eigenvalues of the scatter matrices Σ*<sup>g</sup>* and the constants *cX* and *cy* are respectively finite real numbers such that *cX* ≥ 1

The tone perception dataset (De Veaux, 1989) is employed as a case study to illustrate the proposed two-step monitoring procedure. In the first step, dedicated graphical and exploratory tools are employed for determining one or more plausible values for the trimming level α. Specifically, group proportion (black bars denote the trimmed units), total sum of squares decomposition (Ingrassia & Punzo, 2020), regression coefficients, standard deviations, cluster volumes and Adjusted Rand Index (ARI) between consecutive cluster allocations are monitored within a grid of αs, as reported in Figure 1. For each trimming level, the best model is selected according to a novel penalized likelihood criterion tailored for the CWRM framework, building upon the proposal developed in Cerioli *et al.*, 2018 for Gaussian mixtures. As it is clearly visible for the plots in Figure 1, model parameters stabilize as soon as α is set higher than 0.08, a value sufficient to trim off the level of contamination known to be

In the second stage, conditioning on the α selected in the previous step, solutions stability and validity are fully investigated varying hyper-parameters in *E*<sup>0</sup> <sup>=</sup> {(*G*,*cX* ,*cy*) : *<sup>G</sup>* <sup>=</sup> <sup>1</sup>,...,4, *cX* , *cy* <sup>=</sup> 21,...,25}, as reported in Figure 2. Darker and lighter opacity cells respectively indicate the sets of *B<sup>t</sup>* best and *S<sup>t</sup>* stable solutions, for each optimal solution *t*,*t* = 1...,4, where optimality is in the sense of the penalized criterion. The former set includes solutions ARI-similar to the optimal and not worse than the next optimal, while the latter encompasses all solutions ARI-similar to the optimal, such that *B<sup>t</sup>* ⊆ *St*. In this example, solutions are assumed to be ARI-similar if the ARI between the estimated partitions is higher than 0.7. It is interesting to notice that the CWRM favors models with higher number of clusters with respect to the accepted truth of *G* = 2 (fourth optimal solution, stable in the entire grid of *cX* and *cy*). The reason being that, contrarily to the standard mixture of regression, the CWRM treats the covariate as random, thus allowing the learning of group-wise different distributions in the explanatory variable (Figure 3).

We have demonstrated the adequacy of our monitoring procedure in aiding

α% of observations being left unassigned by setting ∑*<sup>n</sup>*

present in this dataset (Garc´ıa-Escudero *et al.*, 2017).

2 Tone perception data application

*<sup>g</sup>*<sup>2</sup> for every 1 ≤ *g*<sup>1</sup> = *g*<sup>2</sup> ≤ *G*. The 0-1 trimming indicator function

*<sup>i</sup>*=<sup>1</sup> *z*(x*i*, *yi*) = *n*(1−α).

σ2

*<sup>g</sup>*<sup>1</sup> <sup>≤</sup> *cy*σ<sup>2</sup>

and *cy* ≥ 1.

Andrea Cappozzo 1, Luis Angel Garc´ıa Escudero2, Francesca Greselin <sup>3</sup> and Agust´ın Mayo-Iscar2

<sup>1</sup> Department of Mathematics, Politecnico di Milano, (e-mail: andrea.cappozzo@polimi.it ) <sup>2</sup> Departamento de Estad´ıstica e Investigacion Operativa, Facultad de ´ Ciencias, Universidad de Valladolid, (e-mail: lagarcia@uva.es, agustin.mayo.iscar@uva.es)

<sup>3</sup> Department of Statistics and Quantitative Methods, University of Milano-Bicocca, (e-mail: francesca.greselin@unimib.it)

ABSTRACT: Depending on the selected hyper-parameters, cluster weighted modeling may produce a set of diverse solutions. Particularly, the user can manually specify the number of mixture components, the degree of heteroscedasticity of the clusters in the explanatory variables and of the errors around the regression lines. In addition, when performing robust inference, the level of impartial trimming enforced in the estimation needs to be selected. This flexibility gives rise to a variety of "legitimate" solutions. To mitigate the problem of model selection, we propose a two stage monitoring procedure to identify a set of "good models". An application to the benchmark tone perception data showcases the benefits of the approach.

KEYWORDS: Cluster-weighted modeling, Outliers, Trimmed BIC, Eigenvalue constraint, Monitoring, Constrained estimation, Model-based clustering.

#### 1 Introduction and model preliminaries

Assume to have observed a dataset {x*i*, *yi*}*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> of *n* i.i.d. samples, where the regression on *Y* varies across *G* groups, based on a vector X of explanatory variables with values in R*d*. Within this framework, the Gaussian Cluster Weighted Robust Model (Garc´ıa-Escudero *et al.*, 2017) is based on the constrained maximization of the *trimmed* log-likelihood:

$$\ell\_{\text{terminal}}(\boldsymbol{\Theta}|\mathbf{X},Y) = \sum\_{i=1}^{n} z(\mathbf{x}\_{i},\mathbf{y}\_{i}) \log \left[ \sum\_{\mathbf{g}=1}^{G} \pi\_{\mathbf{g}} \boldsymbol{\phi}(\mathbf{y}\_{i};\mathbf{b}\_{\mathbf{g}}^{\prime}\mathbf{x}\_{i} + b\_{\mathbf{g}}^{0}, \sigma\_{\mathbf{g}}^{2}) \boldsymbol{\phi}\_{d}(\mathbf{x}\_{i};\boldsymbol{\mu}\_{\mathbf{g}}, \boldsymbol{\Sigma}\_{\mathbf{g}}) \right], \tag{1}$$

subject to: λ*l*<sup>1</sup> (Σ*g*<sup>1</sup> ) ≤ *cX* λ*l*<sup>2</sup> (Σ*g*<sup>2</sup> ) for every 1 ≤ *l*<sup>1</sup> = *l*<sup>2</sup> ≤ *d*, 1 ≤ *g*<sup>1</sup> = *g*<sup>2</sup> ≤ *G* and σ2 *<sup>g</sup>*<sup>1</sup> <sup>≤</sup> *cy*σ<sup>2</sup> *<sup>g</sup>*<sup>2</sup> for every 1 ≤ *g*<sup>1</sup> = *g*<sup>2</sup> ≤ *G*. The 0-1 trimming indicator function *z*(·,·) tells us whether observation (x*i*, *yi*) is trimmed off, with trimming level α% of observations being left unassigned by setting ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *z*(x*i*, *yi*) = *n*(1−α). The set {λ*l*(Σ*g*)}*l*=1,...,*<sup>d</sup>* denotes the eigenvalues of the scatter matrices Σ*<sup>g</sup>* and the constants *cX* and *cy* are respectively finite real numbers such that *cX* ≥ 1 and *cy* ≥ 1.

#### 2 Tone perception data application

EXPLORING SOLUTIONS VIA MONITORING FOR CLUSTER WEIGHTED ROBUST MODELS Andrea Cappozzo 1, Luis Angel Garc´ıa Escudero2, Francesca Greselin <sup>3</sup> and Agust´ın Mayo-Iscar2

<sup>1</sup> Department of Mathematics, Politecnico di Milano, (e-mail:

<sup>2</sup> Departamento de Estad´ıstica e Investigacion Operativa, Facultad de ´ Ciencias, Universidad de Valladolid, (e-mail: lagarcia@uva.es,

<sup>3</sup> Department of Statistics and Quantitative Methods, University of Milano-Bicocca,

ABSTRACT: Depending on the selected hyper-parameters, cluster weighted modeling may produce a set of diverse solutions. Particularly, the user can manually specify the number of mixture components, the degree of heteroscedasticity of the clusters in the explanatory variables and of the errors around the regression lines. In addition, when performing robust inference, the level of impartial trimming enforced in the estimation needs to be selected. This flexibility gives rise to a variety of "legitimate" solutions. To mitigate the problem of model selection, we propose a two stage monitoring procedure to identify a set of "good models". An application to the benchmark tone perception

KEYWORDS: Cluster-weighted modeling, Outliers, Trimmed BIC, Eigenvalue con-

regression on *Y* varies across *G* groups, based on a vector X of explanatory variables with values in R*d*. Within this framework, the Gaussian Cluster Weighted Robust Model (Garc´ıa-Escudero *et al.*, 2017) is based on the con-

> *G* ∑ *g*=1

π*g*φ(*yi*;b

*<sup>g</sup>*x*<sup>i</sup>* +*b*<sup>0</sup>

*g*,σ<sup>2</sup>

*<sup>g</sup>*)φ*d*(x*i*;*µg*,Σ*g*)

 ,

(1)

*<sup>i</sup>*=<sup>1</sup> of *n* i.i.d. samples, where the

straint, Monitoring, Constrained estimation, Model-based clustering.

andrea.cappozzo@polimi.it )

agustin.mayo.iscar@uva.es)

data showcases the benefits of the approach.

1 Introduction and model preliminaries

strained maximization of the *trimmed* log-likelihood:

*z*(x*i*,*yi*)log

Assume to have observed a dataset {x*i*, *yi*}*<sup>n</sup>*

*n* ∑ *i*=1

*trimmed*(Θ|X,*Y*) =

(e-mail: francesca.greselin@unimib.it)

The tone perception dataset (De Veaux, 1989) is employed as a case study to illustrate the proposed two-step monitoring procedure. In the first step, dedicated graphical and exploratory tools are employed for determining one or more plausible values for the trimming level α. Specifically, group proportion (black bars denote the trimmed units), total sum of squares decomposition (Ingrassia & Punzo, 2020), regression coefficients, standard deviations, cluster volumes and Adjusted Rand Index (ARI) between consecutive cluster allocations are monitored within a grid of αs, as reported in Figure 1. For each trimming level, the best model is selected according to a novel penalized likelihood criterion tailored for the CWRM framework, building upon the proposal developed in Cerioli *et al.*, 2018 for Gaussian mixtures. As it is clearly visible for the plots in Figure 1, model parameters stabilize as soon as α is set higher than 0.08, a value sufficient to trim off the level of contamination known to be present in this dataset (Garc´ıa-Escudero *et al.*, 2017).

In the second stage, conditioning on the α selected in the previous step, solutions stability and validity are fully investigated varying hyper-parameters in *E*<sup>0</sup> <sup>=</sup> {(*G*, *cX* , *cy*) : *<sup>G</sup>* <sup>=</sup> <sup>1</sup>,...,4, *cX* , *cy* <sup>=</sup> 21,...,25}, as reported in Figure 2. Darker and lighter opacity cells respectively indicate the sets of *B<sup>t</sup>* best and *S<sup>t</sup>* stable solutions, for each optimal solution *t*,*t* = 1...,4, where optimality is in the sense of the penalized criterion. The former set includes solutions ARI-similar to the optimal and not worse than the next optimal, while the latter encompasses all solutions ARI-similar to the optimal, such that *B<sup>t</sup>* ⊆ *St*. In this example, solutions are assumed to be ARI-similar if the ARI between the estimated partitions is higher than 0.7. It is interesting to notice that the CWRM favors models with higher number of clusters with respect to the accepted truth of *G* = 2 (fourth optimal solution, stable in the entire grid of *cX* and *cy*). The reason being that, contrarily to the standard mixture of regression, the CWRM treats the covariate as random, thus allowing the learning of group-wise different distributions in the explanatory variable (Figure 3).

We have demonstrated the adequacy of our monitoring procedure in aiding

4

ity. Trimming level α = 0.08, tone perception data.

Trimming level α = 0.08, tone perception data.

*Computing*, 27(2), 377–402.

0.0 0.5 1.0 1.5 2.0

References

density

cy

G = 1 G = 2 G = 3 G = 4

2 4 8 16 32 2 4 8 16 32 2 4 8 16 32 2 4 8 16 32

cX

Optimal solutions 1234 Figure 2: Step 2: monitoring optimal solutions, in terms of validity and stabil-

practitioners in the hyper-parameters selection when fitting CWRM. Furthermore, by exploring the space of solutions a deeper understanding of the data structure is achieved, uncovering sometimes unexpected yet valuable results.

1.0 1.5 2.0 2.5 3.0 3.5 stretching ratio

Figure 3: Estimated density on the explanatory variable, first optimal solution.

CERIOLI, ANDREA, GARC´IA-ESCUDERO, LUIS ANGEL, MAYO-ISCAR, AGUST´IN,&RIANI, MARCO. 2018. Finding the number of normal groups in model-based clustering via constrained likelihoods. *Journal*

DE VEAUX, RICHARD D. 1989. Mixtures of linear regressions. *Computa-*

GARC´IA-ESCUDERO, L. A., GORDALIZA, A., GRESELIN, F., INGRASSIA, S., & MAYO-ISCAR, A. 2017. Robust estimation of mixtures of regressions with random covariates, via trimming and constraints. *Statistics and*

INGRASSIA, SALVATORE,&PUNZO, ANTONIO. 2020. Cluster Validation for Mixtures of Regressions via the Total Sum of Squares Decomposition.

*of Computational and Graphical Statistics*, 27(2), 404–416.

*tional Statistics & Data Analysis*, 8(3), 227–245.

*Journal of Classification*, 37(2), 526–547.

2 3 1

Figure 1: Step 1, monitoring the choice of a plausible trimming level α, tone perception data.

G 1234 TSS Decomposition NBSS NEWSS NRWSS

Figure 2: Step 2: monitoring optimal solutions, in terms of validity and stability. Trimming level α = 0.08, tone perception data.

practitioners in the hyper-parameters selection when fitting CWRM. Furthermore, by exploring the space of solutions a deeper understanding of the data structure is achieved, uncovering sometimes unexpected yet valuable results.

Figure 3: Estimated density on the explanatory variable, first optimal solution. Trimming level α = 0.08, tone perception data.

#### References

Groups proportion

Total Sum of Squares Decomposition

bg

σg

Σg 1 d

Adjusted Rand Index

0.00 0.25 0.50 0.75 1.00

0.00 0.25 0.50 0.75 1.00

0.00 0.25 0.50 0.75 1.00

0.000 0.025 0.050 0.075 0.100

> 0.00 0.05 0.10 0.15 0.20

> > 0.6 0.7 0.8 0.9 1.0

> > > 0.007

perception data.

0.013

0.020

0.027

0.033

0.040

0.047

0.053

0.060

0.067

0.073

0.080

Figure 1: Step 1, monitoring the choice of a plausible trimming level α, tone

0.087

α

G 1234 TSS Decomposition NBSS NEWSS NRWSS

0.093

0.100

0.107

0.113

0.120

0.127

0.133

0.140

0.147

0.153

0.160

0.167


### CATEGORICAL CLASSIFIERS IN MULTI-CLASS CLASSIFICATION PROBLEMS

jointly to a new proposal, denoted as Maximum Ratio Classifier. Both classifiers are based on the comparison between the predicted probabilities and the sample frequencies. Let *pri* be the predicted probability of the category *ai* (*i* = 1,2,...,*k*) of the variable *A*, and let *f ri* be the corresponding frequency computed from observed data. The *Maximum Difference Classifier* (MDC) computes the deviations of *pri* from *f ri* and takes the category corresponding to the maximum difference, that is: argmax*i*∈(*a*1,*a*2,...,*ak*)(*pri* − *f ri*). This classifier represents the extension of what proposed by Cramer (1999) for the dichotomous case. The *Maximum Ratio Classifier* (MRC) computes the relative deviations of *pri* from *f ri* and takes the category corresponding to the maximum ratio, that is: argmax*i*∈(*a*1,*a*2,...,*ak*)(*pri*/ *f ri*). In order to evaluate the predictive performance of a classifier, some indicators computed from the confusion matrix can be used. In this study they are: the *Sensitivity* (Sen) and the *Specificity* (Spe) of each category, the *Maximum Distance Between Sensitivities* (MDBSen) and the *Maximum Distance Between Specificities* (MDB-Spe), the *Overall Accuracy* (OvAc) and the *Macro Average F1 score* (MAF1) (Raschka & Mirjalili, 2019). *Seni* (*Spei*) expresses how well the classifier recognizes a unit belonging (not belonging) to the category *ai*. MDBSen and MDBSpe highlight the balanced or unbalanced ability of the classifier to assign a unit to the right category, the lower the MDBSen and MDMSpe, the more balanced the classification. The OvAc is the rate of correct classification and it is the indicator maximized by BC. The MAF1 is another indicator to measure the accuracy of the classifier and it is obtained as the average of the F1 scores class-by-class. The choice of MAF1 instead of the weighted average F1 score, is linked to the will to attribute the same relevance to all classes.

This study originates from a real classification problem related to the prediction of the result of a soccer match from the home-team side (Golia & Carpita, 2020), so the target variable admits three possible categories, which own a natural order. However, wanting to consider a more general framework, the variable to be predicted, considered in the present study, will be nominal. In order to simulate the probability distribution of this nominal variable, the trivariate Dirichlet random variable (r.v.) was used. This r.v. is determined by three parameters, α*A*,α*<sup>B</sup>* and α*C*. Table 1 reports the values chosen in the simulation study which is described below and the values of the mean and the skewness of the marginals. The chosen sample size is 1200, which implies that in the balanced case each category gets around 400 units. This sample size is high, and the reason lies in the aim to investigate the performance of the analyzed categorical classifiers in a less problematic framework, given that they are working in a context of big samples. For each of these 1200 units the

Maurizio Carpita 1, Silvia Golia <sup>1</sup>

<sup>1</sup> Department of Economics and Mnagement, University of Brescia, (e-mail: maurizio.carpita@unibs.it, silvia.golia@unibs.it)

ABSTRACT: This paper shows the preliminary results of a simulation study devoted to comparing, in a multi-class classification setting, three classifiers that transform the probabilities produced by a probabilistic classifier into a single class: the usual Bayes Classifier and the new Max Difference Classifier and Max Ratio Classifier. As well known, the Bayes Classifier has some limits with rare classes, whereas the proposed Max Difference and Max Ratio Classifiers seem to represent better alternatives.

KEYWORDS: categorical classifier, polytomous variable, Bayes classifier

#### 1 The proposed categorical classifiers and preliminary results

In machine learning, when dealing with a classification problem, it is possible to distinguish two aspects. The first one concerns the identification of a so-called *probabilistic classifier*, which corresponds to a suitable method that assigns a probability to all the categories that can be assumed by the target variable. The second one regards the so-called *categorical classifier*, which transforms the probabilities produced by the probabilistic classifier into a single category. There is a large literature concerning how to find the best probabilistic classifier in both the dichotomous and polytomous contexts, whereas less attention was paid to the choice of the criterion to be used to pass from the probabilistic to the categorical classifier. The *Bayes Classifier* (BC), which assigns, based on the probabilistic classifier, a unit to the most likely category, minimizes, on average, the test error rate (James *et al.*, 2013), so it is the optimal criterion if one is interested in the accuracy of the classification. Nevertheless, this classifier favors the prevalent category most and in situations in which there is not a category of interest but all the categories have the same relevance, the BC can not be the best choice. In previous papers (see, for example, Golia & Carpita, 2020) the authors investigated the performances of different categorical classifiers and they found one of them promising. In this study this classifier, called Maximum Difference Classifier, is considered jointly to a new proposal, denoted as Maximum Ratio Classifier. Both classifiers are based on the comparison between the predicted probabilities and the sample frequencies. Let *pri* be the predicted probability of the category *ai* (*i* = 1,2,..., *k*) of the variable *A*, and let *f ri* be the corresponding frequency computed from observed data. The *Maximum Difference Classifier* (MDC) computes the deviations of *pri* from *f ri* and takes the category corresponding to the maximum difference, that is: argmax*i*∈(*a*1,*a*2,...,*ak*)(*pri* − *f ri*). This classifier represents the extension of what proposed by Cramer (1999) for the dichotomous case. The *Maximum Ratio Classifier* (MRC) computes the relative deviations of *pri* from *f ri* and takes the category corresponding to the maximum ratio, that is: argmax*i*∈(*a*1,*a*2,...,*ak*)(*pri*/ *f ri*). In order to evaluate the predictive performance of a classifier, some indicators computed from the confusion matrix can be used. In this study they are: the *Sensitivity* (Sen) and the *Specificity* (Spe) of each category, the *Maximum Distance Between Sensitivities* (MDBSen) and the *Maximum Distance Between Specificities* (MDB-Spe), the *Overall Accuracy* (OvAc) and the *Macro Average F1 score* (MAF1) (Raschka & Mirjalili, 2019). *Seni* (*Spei*) expresses how well the classifier recognizes a unit belonging (not belonging) to the category *ai*. MDBSen and MDBSpe highlight the balanced or unbalanced ability of the classifier to assign a unit to the right category, the lower the MDBSen and MDMSpe, the more balanced the classification. The OvAc is the rate of correct classification and it is the indicator maximized by BC. The MAF1 is another indicator to measure the accuracy of the classifier and it is obtained as the average of the F1 scores class-by-class. The choice of MAF1 instead of the weighted average F1 score, is linked to the will to attribute the same relevance to all classes.

CATEGORICAL CLASSIFIERS IN MULTI-CLASS CLASSIFICATION PROBLEMS Maurizio Carpita 1, Silvia Golia <sup>1</sup>

<sup>1</sup> Department of Economics and Mnagement, University of Brescia, (e-mail:

ABSTRACT: This paper shows the preliminary results of a simulation study devoted to comparing, in a multi-class classification setting, three classifiers that transform the probabilities produced by a probabilistic classifier into a single class: the usual Bayes Classifier and the new Max Difference Classifier and Max Ratio Classifier. As well known, the Bayes Classifier has some limits with rare classes, whereas the proposed Max Difference and Max Ratio Classifiers seem to represent better alternatives.

maurizio.carpita@unibs.it, silvia.golia@unibs.it)

KEYWORDS: categorical classifier, polytomous variable, Bayes classifier

1 The proposed categorical classifiers and preliminary results

In machine learning, when dealing with a classification problem, it is possible to distinguish two aspects. The first one concerns the identification of a so-called *probabilistic classifier*, which corresponds to a suitable method that assigns a probability to all the categories that can be assumed by the target variable. The second one regards the so-called *categorical classifier*, which transforms the probabilities produced by the probabilistic classifier into a single category. There is a large literature concerning how to find the best probabilistic classifier in both the dichotomous and polytomous contexts, whereas less attention was paid to the choice of the criterion to be used to pass from the probabilistic to the categorical classifier. The *Bayes Classifier* (BC), which assigns, based on the probabilistic classifier, a unit to the most likely category, minimizes, on average, the test error rate (James *et al.*, 2013), so it is the optimal criterion if one is interested in the accuracy of the classification. Nevertheless, this classifier favors the prevalent category most and in situations in which there is not a category of interest but all the categories have the same relevance, the BC can not be the best choice. In previous papers (see, for example, Golia & Carpita, 2020) the authors investigated the performances of different categorical classifiers and they found one of them promising. In this study this classifier, called Maximum Difference Classifier, is considered

This study originates from a real classification problem related to the prediction of the result of a soccer match from the home-team side (Golia & Carpita, 2020), so the target variable admits three possible categories, which own a natural order. However, wanting to consider a more general framework, the variable to be predicted, considered in the present study, will be nominal. In order to simulate the probability distribution of this nominal variable, the trivariate Dirichlet random variable (r.v.) was used. This r.v. is determined by three parameters, α*A*,α*<sup>B</sup>* and α*C*. Table 1 reports the values chosen in the simulation study which is described below and the values of the mean and the skewness of the marginals. The chosen sample size is 1200, which implies that in the balanced case each category gets around 400 units. This sample size is high, and the reason lies in the aim to investigate the performance of the analyzed categorical classifiers in a less problematic framework, given that they are working in a context of big samples. For each of these 1200 units the


Table 1. *Parameters of the Dirichlet r.v. and mean and skewness of the marginals*

Table 2. *Mean values of the indicators, with standard deviation in parenthesis*

BC 0.425 (0.03) 0.424 (0.03) 0.425 (0.03) 0.042 (0.02) MDC 0.423 (0.06) 0.424 (0.06) 0.424 (0.06) 0.120 (0.06) MRC 0.423 (0.07) 0.424 (0.06) 0.424 (0.06) 0.131 (0.07)

BC 0.712 (0.02) 0.713 (0.02) 0.712 (0.02) 0.030 (0.02) C1 MDC 0.712 (0.05) 0.711 (0.05) 0.712 (0.05) 0.103 (0.05) MRC 0.712 (0.05) 0.711 (0.05) 0.712 (0.05) 0.112 (0.06)

BC 0.015 (0.01) 0.150 (0.02) 0.939 (0.01) 0.924 (0.01) MDC 0.444 (0.06) 0.489 (0.05) 0.483 (0.05) 0.109 (0.06) MRC 0.572 (0.07) 0.458 (0.05) 0.403 (0.05) 0.189 (0.08)

BC 0.997 (0.00) 0.936 (0.01) 0.139 (0.02) 0.858 (0.02) C2 MDC 0.789 (0.03) 0.703 (0.04) 0.699 (0.04) 0.122 (0.05) MRC 0.700 (0.04) 0.720 (0.04) 0.759 (0.04) 0.100 (0.05)

BC 0.135 (0.02) 0.136 (0.02) 0.895 (0.01) 0.770 (0.02) MDC 0.443 (0.05) 0.445 (0.05) 0.459 (0.05) 0.107 (0.06) MRC 0.470 (0.06) 0.472 (0.06) 0.405 (0.05) 0.134 (0.07)

BC 0.942 (0.01) 0.942 (0.01) 0.205 (0.02) 0.742 (0.02) C3 MDC 0.734 (0.04) 0.732 (0.04) 0.702 (0.05) 0.090 (0.05) MRC 0.711 (0.05) 0.710 (0.05) 0.744 (0.04) 0.101 (0.05)

BC 0.004 (0.01) 0.586 (0.02) 0.588 (0.02) 0.595 (0.02) MDC 0.409 (0.06) 0.476 (0.05) 0.482 (0.06) 0.131 (0.06) MRC 0.590 (0.07) 0.401 (0.05) 0.407 (0.05) 0.224 (0.09)

BC 0.999 (0.00) 0.574 (0.02) 0.573 (0.02) 0.438 (0.02) C4 MDC 0.806 (0.03) 0.681 (0.05) 0.675 (0.05) 0.166 (0.05) MRC 0.682 (0.05) 0.736 (0.05) 0.731 (0.04) 0.107 (0.06)

OvAc MAF1

OvAc MAF1

OvAc MAF1

OvAc MAF1

BC 0.425 (0.01) 0.424 (0.01) MDC 0.422 (0.01) 0.421 (0.01) MRC 0.421 (0.01) 0.420 (0.02)

BC 0.598 (0.01) 0.331 (0.02) MDC 0.479 (0.02) 0.434 (0.02) MRC 0.437 (0.03) 0.413 (0.02)

BC 0.514 (0.02) 0.359 (0.02) MDC 0.449 (0.02) 0.436 (0.02) MRC 0.436 (0.02) 0.429 (0.02)

BC 0.534 (0.01) 0.320 (0.09) MDC 0.471 (0.02) 0.422 (0.02) MRC 0.419 (0.03) 0.393 (0.02)

*SenA SenB SenC* MDBSen

*SpeA SpeB SpeC* MDBSpe

*SenA SenB SenC* MDBSen

*SpeA SpeB SpeC* MDBSpe

*SenA SenB SenC* MDBSen

*SpeA SpeB SpeC* MDBSpe

*SenA SenB SenC* MDBSen

*SpeA SpeB SpeC* MDBSpe

probability distribution of the target variable is simulated from the trivariate Dirichlet r.v. The use of the set of these three probabilities was twofold; first, a realization of this random variable was extracted and it represents the actual (observed) value of the target variable, second, the same set of probabilities was considered as the output of a probabilistic classifier for the target variable. Then, the BC, MDC and MRC were applied, the predicted classifications for the 1200 units were obtained and the performance indicators previously described were calculated. This scheme was repeated 1000 times and the mean values of the indicators, with standard deviation in parenthesis, are reported in Table 2. When all the three categories are equally represented in the population, as in condition C1, the three categorical classifiers perform in the same way. When one category is rare, as in conditions C2 and C4, BC is not able to recognize it, whereas MDC and MRC have a certain ability in doing it, and in general, Sen and Spe are more balanced when MDC and MRC are used. Moreover, as expected, BC performs better for OvAc, but MDC and MRC have higher MAF1. So, concluding, this first simulation study reveals that, in a multi-class setting, giving equal importance to all the classes (i.e. different types of mis-classification do not involve different costs) both MDC and MDR are preferable to BC.

#### References



Table 2. *Mean values of the indicators, with standard deviation in parenthesis*

Table 1. *Parameters of the Dirichlet r.v. and mean and skewness of the marginals* Condition α*<sup>A</sup>* α*<sup>B</sup>* α*<sup>C</sup>* M. *XA* M. *XB* M. *XC* Sk. *XA* Sk. *XB* Sk. *XC* C1 10 10 10 0.333 0.333 0.333 0.246 0.246 0.246 C2 2 5 10 0.118 0.294 0.588 1.060 0.404 -0.160 C3 5 5 10 0.250 0.250 0.500 0.481 0.481 0.000 C4 2 10 10 0.091 0.455 0.455 1.137 0.073 0.073

probability distribution of the target variable is simulated from the trivariate Dirichlet r.v. The use of the set of these three probabilities was twofold; first, a realization of this random variable was extracted and it represents the actual (observed) value of the target variable, second, the same set of probabilities was considered as the output of a probabilistic classifier for the target variable. Then, the BC, MDC and MRC were applied, the predicted classifications for the 1200 units were obtained and the performance indicators previously described were calculated. This scheme was repeated 1000 times and the mean values of the indicators, with standard deviation in parenthesis, are reported in Table 2. When all the three categories are equally represented in the population, as in condition C1, the three categorical classifiers perform in the same way. When one category is rare, as in conditions C2 and C4, BC is not able to recognize it, whereas MDC and MRC have a certain ability in doing it, and in general, Sen and Spe are more balanced when MDC and MRC are used. Moreover, as expected, BC performs better for OvAc, but MDC and MRC have higher MAF1. So, concluding, this first simulation study reveals that, in a multi-class setting, giving equal importance to all the classes (i.e. different types of mis-classification do not involve different costs) both MDC and MDR

CRAMER, J. S. 1999. Predictive performance of the binary logit model in

GOLIA, S., & CARPITA, M. 2020. Comparing classifiers for ordinal variables. In A. Pollice, N. Salvati and F. Schirripa Spagnolo (Eds.). *Book of short*

JAMES, G., WITTEN, D., HASTIE, T., & TIBSHIRANI, R. 2013. *An introduction to statistical learning with applications in R*. New York: Springer. RASCHKA, S., & MIRJALILI, V. 2019. *Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow*

unbalanced samples. *The Statistician*, 48(1), 85–94.

*papers SIS 2020*, 1160–1165.

*2*. Birmingham: Packt Publishing.

are preferable to BC.

References

## MODEL-BASED CLUSTERING FOR ESTIMATING CETACEANS SITE-FIDELITY AND ABUNDANCE

photo-identification for identifying bottlenose dolphins from natural markings present on their bodies. In the same paper an interesting characteristic of this type of animals is illustrated: marked individuals may show a different level of *site-fidelity*. This point introduces the need of defining a statistical protocol or a specific model accounting for the different probabilities of *capture* among the categories. Here, we propose a method that allows both to differentiate between *resident* and *non-resident* individuals and to estimate the population abundance in a common modelling framework. This improves on the original multistep protocols (see Pace *et al.*, 2021) in guaranteeing the correct uncertainty

We consider the Schwarz & Arnason (1996)'s formalization of the Jolly-Seber model, which assumes the existence of a *super-population*, representing the set of individuals potentially available in the study area between the first and the last sampling period. In Jolly-Seber-type models, captures are assumed to be independent across individuals and along time. Moreover, the population is assumed to be *open*, meaning that individuals can either enter (e.g. birth or immigration) or exit (e.g. death or emigration) the population during the study. Notably, we assume that individuals leaving the population cannot come back in it. Here, we adopt the Bayesian framework illustrated by Royle & Dorazio (2012), where the super-population size (*N*super) is provided with a discrete uniform distribution in the interval {0,...,*M*}, with *M* sufficiently large. The hyperparameter *M* can be seen as an upper bound for the super-population size and it implies the use of an augmented dataset of *M* individuals. Moreover, we consider a sampling scheme divided in *T* periods and, for each time *t* = 1,...,*T*, a number *Jt* of capture sessions. Thus, the augmented data matrix Y = [*yit*] has *M* rows and *T* columns and contains the capture frequency of each individual in each period. If *D* is the number of individuals that have been observed at least once, the matrix contains *M* −*D* rows of zeros: among them, *N*super −*D* rows correspond to individuals which belong to the super-population but have never been captured, while *M* −*N*super correspond to *pseudo*-individuals which

Recruitment and survival process Population dynamics consisting in recruitment and survival can be expressed through the following latent binary

• *rit* which is equal to 1 iff individual *i* is recruitable at time *t*;

propagation of the two estimation processes.

do not belong to the super-population.

variables:

2 The model

Gianmarco Caruso1, Greta Panunzi1, Marco Mingione1, Pierfrancesco Alaimo di Loro1, Stefano Moro2, Edoardo Bompiani1, Caterina Lanfredi1, Daniela Silvia Pace2, Luca Tardella1 and Giovanna Jona Lasinio1

```
1 Department of Statistical Sciences, Sapienza University of Rome, Italy (e-mail:
gianmarco.caruso@uniroma1.it, greta.panunzi@gmail.com,
marco.mingione@uniroma1.it,
pierfrancesco.alaimodiloro@uniroma1.it,
edoardo.bompiani@uniroma1.it, lanfredicaterina@gmail.com,
luca.tardella@uniroma1.it, giovanna.jonalasinio@uniroma1.it)
2 Department of Environmental Biology, Sapienza University of Rome, Italy (e-mail:
```
stefano.moro@uniroma1.it, danielasilvia.pace@uniroma1.it)

ABSTRACT: Estimating the size of animal populations in a given area is of particular interest in ecological studies on wildlife conservation, and this task is commonly handled via capture-recapture methods. A recent work (Pace *et al.*, 2021) adopts a two-step approach for identifying groups of animals with similar site-fidelity patterns according to specific metrics - and estimating the abundance of bottlenose dolphins between 2017 and 2020 at the Tiber Estuary (Mediterranean Sea, Rome, Italy). In this work, we aim at simultaneously classifying individuals and estimating their abundance in the study area, by introducing finite mixtures within the *Open-Population Jolly-Seber* framework. In capture-recapture analyses, finite mixture models allow to account for groups heterogeneity and to reduce the bias in the final abundance estimates (Pledger, 2005).

KEYWORDS: Capture-recapture analysis, Wildlife population, Finite mixture models, Unsupervised classification, Applied statistics

### 1 Introduction

Capture-recapture methods are widely employed in estimating the size of wildlife populations, whose units are subject to multiple captures across several occasions. We will use the terms *capture* and *recapture* in accordance with the classical literature (Seber, 1986), but animals are not necessarily *captured*: nowadays, non-invasive ways of keeping trace of a wild animal over time are successfully employed. In that spirit, for example, Pace *et al.* (2021) employs

photo-identification for identifying bottlenose dolphins from natural markings present on their bodies. In the same paper an interesting characteristic of this type of animals is illustrated: marked individuals may show a different level of *site-fidelity*. This point introduces the need of defining a statistical protocol or a specific model accounting for the different probabilities of *capture* among the categories. Here, we propose a method that allows both to differentiate between *resident* and *non-resident* individuals and to estimate the population abundance in a common modelling framework. This improves on the original multistep protocols (see Pace *et al.*, 2021) in guaranteeing the correct uncertainty propagation of the two estimation processes.

#### 2 The model

MODEL-BASED CLUSTERING FOR ESTIMATING CETACEANS SITE-FIDELITY AND ABUNDANCE Gianmarco Caruso1, Greta Panunzi1, Marco Mingione1, Pierfrancesco Alaimo di Loro1, Stefano Moro2, Edoardo Bompiani1, Caterina Lanfredi1, Daniela Silvia Pace2, Luca Tardella1 and Giovanna Jona Lasinio1

<sup>1</sup> Department of Statistical Sciences, Sapienza University of Rome, Italy (e-mail: gianmarco.caruso@uniroma1.it, greta.panunzi@gmail.com,

edoardo.bompiani@uniroma1.it, lanfredicaterina@gmail.com, luca.tardella@uniroma1.it, giovanna.jonalasinio@uniroma1.it) <sup>2</sup> Department of Environmental Biology, Sapienza University of Rome, Italy (e-mail: stefano.moro@uniroma1.it, danielasilvia.pace@uniroma1.it)

ABSTRACT: Estimating the size of animal populations in a given area is of particular interest in ecological studies on wildlife conservation, and this task is commonly handled via capture-recapture methods. A recent work (Pace *et al.*, 2021) adopts a two-step approach for identifying groups of animals with similar site-fidelity patterns according to specific metrics - and estimating the abundance of bottlenose dolphins between 2017 and 2020 at the Tiber Estuary (Mediterranean Sea, Rome, Italy). In this work, we aim at simultaneously classifying individuals and estimating their abundance in the study area, by introducing finite mixtures within the *Open-Population Jolly-Seber* framework. In capture-recapture analyses, finite mixture models allow to account for groups heterogeneity and to reduce the bias in the final abundance estimates (Pledger,

KEYWORDS: Capture-recapture analysis, Wildlife population, Finite mixture models,

Capture-recapture methods are widely employed in estimating the size of wildlife populations, whose units are subject to multiple captures across several occasions. We will use the terms *capture* and *recapture* in accordance with the classical literature (Seber, 1986), but animals are not necessarily *captured*: nowadays, non-invasive ways of keeping trace of a wild animal over time are successfully employed. In that spirit, for example, Pace *et al.* (2021) employs

marco.mingione@uniroma1.it,

Unsupervised classification, Applied statistics

2005).

1 Introduction

pierfrancesco.alaimodiloro@uniroma1.it,

We consider the Schwarz & Arnason (1996)'s formalization of the Jolly-Seber model, which assumes the existence of a *super-population*, representing the set of individuals potentially available in the study area between the first and the last sampling period. In Jolly-Seber-type models, captures are assumed to be independent across individuals and along time. Moreover, the population is assumed to be *open*, meaning that individuals can either enter (e.g. birth or immigration) or exit (e.g. death or emigration) the population during the study. Notably, we assume that individuals leaving the population cannot come back in it. Here, we adopt the Bayesian framework illustrated by Royle & Dorazio (2012), where the super-population size (*N*super) is provided with a discrete uniform distribution in the interval {0,...,*M*}, with *M* sufficiently large. The hyperparameter *M* can be seen as an upper bound for the super-population size and it implies the use of an augmented dataset of *M* individuals. Moreover, we consider a sampling scheme divided in *T* periods and, for each time *t* = 1,...,*T*, a number *Jt* of capture sessions. Thus, the augmented data matrix Y = [*yit*] has *M* rows and *T* columns and contains the capture frequency of each individual in each period. If *D* is the number of individuals that have been observed at least once, the matrix contains *M* −*D* rows of zeros: among them, *N*super −*D* rows correspond to individuals which belong to the super-population but have never been captured, while *M* −*N*super correspond to *pseudo*-individuals which do not belong to the super-population.

Recruitment and survival process Population dynamics consisting in recruitment and survival can be expressed through the following latent binary variables:

• *rit* which is equal to 1 iff individual *i* is recruitable at time *t*;

• *zit* which is equal to 1 iff individual *i* belongs to the population at time *t*. Let φ*<sup>t</sup>* be the probability of remaining in the population at time *t*, being in the population at time *t* −1, and let ρ*<sup>t</sup>* be the probability of belonging to the super-population *and* being recruited into the population at time *t*. Without loss of generality, in this context we assume these two parameters to be constant over time, i.e. φ*<sup>t</sup>* = φ and ρ*<sup>t</sup>* = ρ. Following Royle & Dorazio (2012), it can be proved that, for *i* = 1,...,*M*, *ri*<sup>1</sup> = 1 and *zi*<sup>1</sup> ∼ *Bern*(ρ), and

*ri*1

...

ρ φ

*yi*,<sup>1</sup> ... *yi*,*t*−<sup>1</sup> *yi*,*<sup>t</sup>*

*wg ci*

Figure 1. *Bayesian DAG with the main components of the model. White rhombi represent deterministic variables. White circles represent latent variables and parameters.*

Acknowledgements This work was partially supported by project "Joint Cetacean Database and Mapping (JCDM) in Italian waters: a tool for knowledge and conservation", Sapienza University of Rome (nr. RM1201729F23D51B).

PACE, DANIELA SILVIA, *et al.* 2021. Capitoline Dolphins: Residency Patterns and Abundance Estimate of Tursiops truncatus at the Tiber River Estuary

PLEDGER, SHIRLEY. 2005. The performance of mixture models in heterogeneous closed population capture–recapture. *Biometrics*, 61(3), 868–873. PLUMMER, MARTYN. 2003. *JAGS: A program for analysis of Bayesian*

ROYLE,JANDREW,&DORAZIO, ROBERT M. 2012. Parameter-expanded data augmentation for Bayesian analysis of capture–recapture models.

SCHWARZ, CARL JAMES,&ARNASON,ANEIL. 1996. A general methodology for the analysis of capture-recapture experiments in open populations.

SEBER, GEORGE ARTHUR FREDERICK. 1986. A Review of Estimating Animal

∀*g* ∀*i*

*<sup>J</sup>*<sup>1</sup> ... *Jt*−<sup>1</sup> *Jt pg*

*ri*,*t*−<sup>1</sup> *ri*,*<sup>t</sup>*

*zi*,*t*−<sup>1</sup> *zi*,*<sup>t</sup>*

t in 2:T

∀*i*

...

*zi*1

*Grey circles represents observable variables.*

(Mediterranean Sea). *Biology*, 10(4), 275.

*graphical models using Gibbs sampling*.

*Journal of Ornithology*, 152(2), 521–537.

Abundance. *Biometrics*, 42(2), 267–292.

*Biometrics*, 860–873.

References

$$r\_{it} = \min\{r\_{i,t-1}, 1 - z\_{i,t-1}\}, \qquad t > 1$$

$$z\_{it}|z\_{i,t-1}, r\_{it} \sim Ber(\boldsymbol{\phi} \cdot z\_{i,t-1} + \boldsymbol{\upupup} \cdot r\_{it}), \qquad t > 1.$$

Notice that when an individual becomes part of the population, it cannot be recruitable any more: for *t* > 1, *rit* and *zit* cannot simultaneously be equal to 1.

Detection process In this work, we consider a finite mixture model in order to model the different propensity to the capture among different groups of individuals. The generic element of the augmented data matrix is such that

$$\text{y}\_{it}|z\_{it}, c\_i = \text{g} \sim \text{Binom}(J\_t, p\_g \cdot z\_{it}), \qquad \text{g} = 1 \ldots, G \quad ,$$

with *pg* being the capture probability of individuals in group *g* and *P*(*ci* = *g*) = *wg* being the probability that the *i*-th individual belongs to the *g*-th mixture component. Notice that *yit* = 0 almost surely when *zit* = 0, so that the previous model corresponds to a finite mixture of zero-inflated binomial distributions.

Abundance estimation The population size at time *t* and the super-population size can be estimated through the latent variables *z*'s, namely, *Nt* = ∑*<sup>M</sup> <sup>i</sup>*=<sup>1</sup> *zit* and *N*super = ∑*<sup>M</sup> <sup>i</sup>*=<sup>1</sup> **1**{∑*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> *zit*>0} .

#### 3 Illustration

A graphical visualization of the main components of the model is provided by the DAG in *Figure* 1. The model is used to estimate the abundance of bottlenose dolphins between 2017 and 2020 at the Tiber Estuary (Mediterranean Sea, Rome, Italy) and identifying groups of animals with different propensities to the capture: individuals with a low detection probability are considered *nonresident*, while the others are considered *resident*. The model is implemented using JAGS (Plummer, 2003) and the results will be shown in details during the conference.

Figure 1. *Bayesian DAG with the main components of the model. White rhombi represent deterministic variables. White circles represent latent variables and parameters. Grey circles represents observable variables.*

Acknowledgements This work was partially supported by project "Joint Cetacean Database and Mapping (JCDM) in Italian waters: a tool for knowledge and conservation", Sapienza University of Rome (nr. RM1201729F23D51B).

#### References

• *zit* which is equal to 1 iff individual *i* belongs to the population at time *t*. Let φ*<sup>t</sup>* be the probability of remaining in the population at time *t*, being in the population at time *t* −1, and let ρ*<sup>t</sup>* be the probability of belonging to the super-population *and* being recruited into the population at time *t*. Without loss of generality, in this context we assume these two parameters to be constant over time, i.e. φ*<sup>t</sup>* = φ and ρ*<sup>t</sup>* = ρ. Following Royle & Dorazio (2012), it can be

*rit* = min{*ri*,*t*−1,1−*zi*,*t*−1}, *t* > 1

*zit*|*zi*,*t*−1,*rit* ∼ *Bern*(φ ·*zi*,*t*−<sup>1</sup> +ρ ·*rit*), *t* > 1. Notice that when an individual becomes part of the population, it cannot be recruitable any more: for *t* > 1, *rit* and *zit* cannot simultaneously be equal to 1.

Detection process In this work, we consider a finite mixture model in order to model the different propensity to the capture among different groups of individuals. The generic element of the augmented data matrix is such that

*yit*|*zit*, *ci* = *g* ∼ *Binom*(*Jt*, *pg* ·*zit*), *g* = 1...,*G* ,

with *pg* being the capture probability of individuals in group *g* and *P*(*ci* = *g*) = *wg* being the probability that the *i*-th individual belongs to the *g*-th mixture component. Notice that *yit* = 0 almost surely when *zit* = 0, so that the previous model corresponds to a finite mixture of zero-inflated binomial distributions.

Abundance estimation The population size at time *t* and the super-population

A graphical visualization of the main components of the model is provided by the DAG in *Figure* 1. The model is used to estimate the abundance of bottlenose dolphins between 2017 and 2020 at the Tiber Estuary (Mediterranean Sea, Rome, Italy) and identifying groups of animals with different propensities to the capture: individuals with a low detection probability are considered *nonresident*, while the others are considered *resident*. The model is implemented using JAGS (Plummer, 2003) and the results will be shown in details during the

*<sup>i</sup>*=<sup>1</sup> *zit* and

size can be estimated through the latent variables *z*'s, namely, *Nt* = ∑*<sup>M</sup>*

*N*super = ∑*<sup>M</sup>*

conference.

3 Illustration

*<sup>i</sup>*=<sup>1</sup> **1**{∑*<sup>T</sup>*

*<sup>t</sup>*=<sup>1</sup> *zit*>0} .

proved that, for *i* = 1,...,*M*, *ri*<sup>1</sup> = 1 and *zi*<sup>1</sup> ∼ *Bern*(ρ), and


## MODEL-BASED CLUSTERING WITH PARSIMONIOUS COVARIANCE STRUCTURE

Since GMMs can easily fall into the so-called "curse of dimensionality" because of the large number of parameters dedicated to covariance structures, in the specialized literature several different parametrizations are present. One of the most used is the eigen-decomposition (Banfield & Raftery, 1993) of the

diagonal matrix controlling the cluster shape, and D is an orthogonal matrix which specifies the cluster orientation. Another parameterization is proper of the mixture of factor analyzers (Ghahramani & Hilton, 1997) and assumes a

variables, *Q* is the number of factors, Λ is the *p*×*Q* factor loading matrix and Ψ is the *p*-dimensional diagonal covariance matrix of the error. Our proposal aims to implement a new parameterization of a covariance matrix via a hierarchical

Multidimensional phenomena are often composed of nested dimensions characterized by distinct levels of abstraction. Each dimension is uniquely connected to a group of variables and represents a specific concept. Merging two dimensions together gives rise to a broader dimension up to the general one such that the hierarchical structure underlying a multidimensional phenomenon is detected. In order to model the hierarchical relationships among the dimensions, we introduce three main features of a variable group: the variance of the variable group, the covariance within the variable group, which measures the internal concordance among variables belonging to the same group, and the covariance between concepts associated with the variable groups. These features are constrained to be "ordered" such that the variance of the groups is greater (in the absolute sense) than the covariance within or between groups, whereas the covariance within groups must be in turn larger than the covariance between groups. These constraints allow to define a hierarchical structure of concepts, from the most concordant to the most discordant. The last aggregations in the hierarchy may occur between: (i) concordant concepts defining a general one; (ii) discordant concepts with negative between-group covariance;

Given the number of specific dimensions *Q* which underlie the multidimensional phenomenon, each level *q* = *Q*,...,1 of the hierarchy is characterized by: (i) the *p*×*q* membership matrix V*q*, which pinpoints the membership of

diagonal represents the variance of each group; (iii) the diagonal matrix S*<sup>W</sup>*

covariance one for each cluster that can be extremely parsimonious.

cluster covariance structure of the form Σ = ΛΛ

2 Features of the covariance structure

(iii) uncorrelated concepts.

each variable to a group; (ii) the diagonal matrix S*<sup>V</sup>*

, where λ is a scalar determining the cluster volume, A is a

+Ψ, where *p* is the number of

*<sup>q</sup>* of order *q*, whose main

*q*

form Σ = λDAD

Carlo Cavicchia1 , Maurizio Vichi2 and Giorgia Zaccaria2

<sup>1</sup> Econometric Institute, Erasmus University Rotterdam, Rotterdam, The Netherlands, (e-mail: cavicchia@ese.eur.nl)

<sup>2</sup> Department of Statistical Sciences, Sapienza University of Rome, Rome, Italy, (e-mail: maurizio.vichi@uniroma1.it, giorgia.zaccaria@uniroma1.it)

ABSTRACT: Complex multidimensional concepts are often explained by a tree-shape structure by considering nested partitions of variables, where each variable group is associated with a specific concept. Recalling that relations among variables can be detected by their covariance matrix, this paper introduces a covariance structure that reconstructs hierarchical relationships among variables highlighting three features of the variable groups. We finally present an application of the latter covariance structure to the model-based clustering.

KEYWORDS: Gaussian mixture model, hierarchical latent concepts, partition of variables

### 1 Introduction

The main goal of Factor Analysis (FA, Spearman, 1904) is to reconstruct the covariance matrix of variables by computing a reduced number of factors while preserving as much information as possible. However, since FA is unable to reconstruct hierarchical relations, a model with a hierarchical form is therefore required. Among several models based on the sequential application of FA addressing the same problem, Cavicchia *et al.* (2020) proposed a model to reconstruct a nonnegative correlation matrix via an ultrametric one. The model results in a simultaneous procedure which is able both to detect the best variable partition in a reduced number of groups and build the hierarchy upon them. The latter model ensues particularly suitable for complex hierarchical multidimensional concepts due to the one-to-one relation between a hierarchy of concepts and an ultrametric correlation matrix (Dellacherie *et al.*, 2014). Our paper overcomes the limitations of the model presented by Cavicchia *et al.* (2020) extending the same idea to a general covariance matrix and applies this special covariance structure in the Gaussian Mixture Models (GMMs) framework.

Since GMMs can easily fall into the so-called "curse of dimensionality" because of the large number of parameters dedicated to covariance structures, in the specialized literature several different parametrizations are present. One of the most used is the eigen-decomposition (Banfield & Raftery, 1993) of the form Σ = λDAD , where λ is a scalar determining the cluster volume, A is a diagonal matrix controlling the cluster shape, and D is an orthogonal matrix which specifies the cluster orientation. Another parameterization is proper of the mixture of factor analyzers (Ghahramani & Hilton, 1997) and assumes a cluster covariance structure of the form Σ = ΛΛ +Ψ, where *p* is the number of variables, *Q* is the number of factors, Λ is the *p*×*Q* factor loading matrix and Ψ is the *p*-dimensional diagonal covariance matrix of the error. Our proposal aims to implement a new parameterization of a covariance matrix via a hierarchical covariance one for each cluster that can be extremely parsimonious.

#### 2 Features of the covariance structure

MODEL-BASED CLUSTERING WITH PARSIMONIOUS COVARIANCE STRUCTURE Carlo Cavicchia1 , Maurizio Vichi2 and Giorgia Zaccaria2

<sup>1</sup> Econometric Institute, Erasmus University Rotterdam, Rotterdam, The Netherlands,

<sup>2</sup> Department of Statistical Sciences, Sapienza University of Rome, Rome, Italy, (e-mail: maurizio.vichi@uniroma1.it, giorgia.zaccaria@uniroma1.it)

ABSTRACT: Complex multidimensional concepts are often explained by a tree-shape structure by considering nested partitions of variables, where each variable group is associated with a specific concept. Recalling that relations among variables can be detected by their covariance matrix, this paper introduces a covariance structure that reconstructs hierarchical relationships among variables highlighting three features of the variable groups. We finally present an application of the latter covariance structure

KEYWORDS: Gaussian mixture model, hierarchical latent concepts, partition of vari-

The main goal of Factor Analysis (FA, Spearman, 1904) is to reconstruct the covariance matrix of variables by computing a reduced number of factors while preserving as much information as possible. However, since FA is unable to reconstruct hierarchical relations, a model with a hierarchical form is therefore required. Among several models based on the sequential application of FA addressing the same problem, Cavicchia *et al.* (2020) proposed a model to reconstruct a nonnegative correlation matrix via an ultrametric one. The model results in a simultaneous procedure which is able both to detect the best variable partition in a reduced number of groups and build the hierarchy upon them. The latter model ensues particularly suitable for complex hierarchical multidimensional concepts due to the one-to-one relation between a hierarchy of concepts and an ultrametric correlation matrix (Dellacherie *et al.*, 2014). Our paper overcomes the limitations of the model presented by Cavicchia *et al.* (2020) extending the same idea to a general covariance matrix and applies this special covariance structure in the Gaussian Mixture Models (GMMs)

(e-mail: cavicchia@ese.eur.nl)

to the model-based clustering.

1 Introduction

framework.

ables

Multidimensional phenomena are often composed of nested dimensions characterized by distinct levels of abstraction. Each dimension is uniquely connected to a group of variables and represents a specific concept. Merging two dimensions together gives rise to a broader dimension up to the general one such that the hierarchical structure underlying a multidimensional phenomenon is detected. In order to model the hierarchical relationships among the dimensions, we introduce three main features of a variable group: the variance of the variable group, the covariance within the variable group, which measures the internal concordance among variables belonging to the same group, and the covariance between concepts associated with the variable groups. These features are constrained to be "ordered" such that the variance of the groups is greater (in the absolute sense) than the covariance within or between groups, whereas the covariance within groups must be in turn larger than the covariance between groups. These constraints allow to define a hierarchical structure of concepts, from the most concordant to the most discordant. The last aggregations in the hierarchy may occur between: (i) concordant concepts defining a general one; (ii) discordant concepts with negative between-group covariance; (iii) uncorrelated concepts.

Given the number of specific dimensions *Q* which underlie the multidimensional phenomenon, each level *q* = *Q*,...,1 of the hierarchy is characterized by: (i) the *p*×*q* membership matrix V*q*, which pinpoints the membership of each variable to a group; (ii) the diagonal matrix S*<sup>V</sup> <sup>q</sup>* of order *q*, whose main diagonal represents the variance of each group; (iii) the diagonal matrix S*<sup>W</sup> q*

of order *q*, whose main diagonal represents the covariance within each group; (iv) the ultrametric matrix S*<sup>B</sup> <sup>q</sup>* of order *q*, whose diagonal entries are set to zero and off-diagonal ones represent the hierarchical relationships between pairs of concepts. Given V*q*, the estimates of the matrices S*<sup>V</sup> <sup>q</sup>* , S*<sup>W</sup> <sup>q</sup>* and S*<sup>B</sup> <sup>q</sup>* are

$$
\widehat{\mathbf{S}}\_q^V = (\widehat{\mathbf{V}}\_q^\prime \widehat{\mathbf{V}}\_q)^{-1} \widehat{\mathbf{V}}\_q^\prime \text{diag}(\mathbf{S}) \widehat{\mathbf{V}}\_q,\tag{1}
$$

$$
\widehat{\mathbf{S}}\_{q}^{W} = [(\widehat{\mathbf{V}}\_{q}^{\prime}\widehat{\mathbf{V}}\_{q})^{2} - \widehat{\mathbf{V}}\_{q}^{\prime}\widehat{\mathbf{V}}\_{q}]^{-1} \text{diag}\left[\widehat{\mathbf{V}}\_{q}^{\prime}\left(\mathbf{S} - \text{diag}(\widehat{\mathbf{V}}\_{q}\widehat{\mathbf{S}}\_{q}^{\prime}\widehat{\mathbf{V}}\_{q}^{\prime})\right)\widehat{\mathbf{V}}\_{q}\right],\qquad(2)$$

$$
\hat{\mathbf{S}}\_q^{\mathcal{B}} = \hat{\mathbf{V}}\_q^+ \mathbf{S} (\hat{\mathbf{V}}\_q')^+,\tag{3}
$$

Figure 1: Clusters of countries: Cluster 1 (red), Cluster 2 (yellow) and Cluster

with a negative between-group covariance highlighting the absence of a unique

This paper proposes a parsimonious GMM which aims at modeling multidimensional phenomena, usually defined by hierarchically nested latent concepts.

BANFIELD, J.D., & RAFTERY, A.E. 1993. Model-based Gaussian and non-

CAVICCHIA, C., VICHI, M., & ZACCARIA, G. 2020. The ultrametric correlation matrix for modelling hierarchical latent concepts. *Advances in Data*

DELLACHERIE, C, MARTINEZ, S, & MARTIN,JSAN. 2014. *Inverse Mmatrices and ultrametric matrices*. Lecture Notes in Mathematics. Springer

GHAHRAMANI, Z., & HILTON, G.H. 1997. The EM algorithm for factor analyzers. Technical report CRG-TR-96-1, University of Toronto, Toronto. SCHWARZ, G. 1978. Estimating the dimension of a model. *Annals of Statistics*,

SPEARMAN, C. E. 1904. "General intelligence,' objectively determined and measured. *The American Journal of Psychology*, 15(2), 201–293.

The application of the method on real data shows its potentialities.

Gaussian clustering. *Biometrics*, 49(3), 803–821.

*Analysis and Classification*, 14(4), 837–853.

International Publishing.

6(2), 461–464.

3 (blue)

concordant general concept.

4 Conclusions

References

respectively, where S represents the *p*× *p* observed covariance matrix, I*<sup>p</sup>* is the identity matrix of order *p* and diag(·) denotes the diagonal matrix whose diagonal elements are those of a parenthesized one.

We implement the parameterization of the covariance matrix based on the aforementioned quantities into the GMMs in order to simultaneously detect homogeneous clusters of units and a hierarchical definition of a multidimensional phenomenon.

#### 3 Application

Our proposal is applied on the "Human Development Index" dataset\* which consists of 167 countries and 9 variables. The optimal model in terms of Bayesian Information Criterion (BIC, Schwarz, 1978) considers 3 clusters of countries (Fig. 1) and 3 groups of variables. It is worth highlighting that the model requires 71 parameters to be estimated, of which only 14 for each covariance structure. The first cluster is characterized by the countries with high income, gdp per capita and very low child mortality. The second cluster is constituted by the poorest countries with low life expectancy and income, whereas the third one is composed by countries with median performances. Each cluster is characterized by a different hierarchy of the latent concepts associated with the three groups of variables. The group made by the economic variables (income, gdp, exports and imports) in Cluster 1 is the one with the highest value of internal variance, whereas the same group in Cluster 3 is merged with the group considering child mortality and fertility and has the highest covariance within the group. Notwithstanding the latent concepts and their hierarchical relationships are specific per cluster, all the hierarchies end

<sup>\*</sup>https://www.kaggle.com/rohan0301/unsupervised-learning-on-country-data

Figure 1: Clusters of countries: Cluster 1 (red), Cluster 2 (yellow) and Cluster 3 (blue)

with a negative between-group covariance highlighting the absence of a unique concordant general concept.

### 4 Conclusions

of order *q*, whose main diagonal represents the covariance within each group;

and off-diagonal ones represent the hierarchical relationships between pairs of

respectively, where S represents the *p*× *p* observed covariance matrix, I*<sup>p</sup>* is the identity matrix of order *p* and diag(·) denotes the diagonal matrix whose

Our proposal is applied on the "Human Development Index" dataset\* which consists of 167 countries and 9 variables. The optimal model in terms of Bayesian Information Criterion (BIC, Schwarz, 1978) considers 3 clusters of countries (Fig. 1) and 3 groups of variables. It is worth highlighting that the model requires 71 parameters to be estimated, of which only 14 for each covariance structure. The first cluster is characterized by the countries with high income, gdp per capita and very low child mortality. The second cluster is constituted by the poorest countries with low life expectancy and income, whereas the third one is composed by countries with median performances. Each cluster is characterized by a different hierarchy of the latent concepts associated with the three groups of variables. The group made by the economic variables (income, gdp, exports and imports) in Cluster 1 is the one with the highest value of internal variance, whereas the same group in Cluster 3 is merged with the group considering child mortality and fertility and has the highest covariance within the group. Notwithstanding the latent concepts and their hierarchical relationships are specific per cluster, all the hierarchies end

\*https://www.kaggle.com/rohan0301/unsupervised-learning-on-country-data

We implement the parameterization of the covariance matrix based on the aforementioned quantities into the GMMs in order to simultaneously detect homogeneous clusters of units and a hierarchical definition of a multidimensional

*<sup>q</sup>* of order *q*, whose diagonal entries are set to zero

*<sup>q</sup>*diag(S)V*<sup>q</sup>*, (1)

<sup>S</sup>−diag(<sup>V</sup>*q*<sup>S</sup>*<sup>V</sup>*

<sup>+</sup>, (3)

*<sup>q</sup>* , S*<sup>W</sup>*

*<sup>q</sup>* and S*<sup>B</sup>*

*q* V *q*) V*q* 

*<sup>q</sup>* are

, (2)

(iv) the ultrametric matrix S*<sup>B</sup>*

*<sup>q</sup>*V*<sup>q</sup>*) −1 V

*<sup>q</sup>*V*<sup>q</sup>*)

*<sup>q</sup>* <sup>S</sup>(<sup>V</sup> *q*)

 S *V <sup>q</sup>* = (<sup>V</sup>

S*W <sup>q</sup>* = [(<sup>V</sup>

 S *B <sup>q</sup>* <sup>=</sup> <sup>V</sup><sup>+</sup>

phenomenon.

3 Application

concepts. Given V*q*, the estimates of the matrices S*<sup>V</sup>*

<sup>2</sup> <sup>−</sup><sup>V</sup>

diagonal elements are those of a parenthesized one.

*<sup>q</sup>*V*<sup>q</sup>*] −1 diag V *q* 

> This paper proposes a parsimonious GMM which aims at modeling multidimensional phenomena, usually defined by hierarchically nested latent concepts. The application of the method on real data shows its potentialities.

### References


GHAHRAMANI, Z., & HILTON, G.H. 1997. The EM algorithm for factor analyzers. Technical report CRG-TR-96-1, University of Toronto, Toronto.


### CLUSTERING INCOME DATA BASED ON SHARE DENSITIES

2 Jensen-Shannon divergence between share densities

ing *K* densities *f*1,..., *fK* and their mixture *m* = ∑*<sup>K</sup>*

among *lk* densities (*k* = 1,...,*K*) is given by:

*DJS*(*f*1,..., *fK*) =

*DKL* (*fk*||*m*) =

*DJS*(*f*1,..., *fK*) = *H*(*m*)−

*DJS*(*l*1,...,*lK*) = *H*(*lm*)−

follows:

where

as follows:

where *<sup>H</sup>* (*fk*) = <sup>−</sup>

where *lm* = ∑*<sup>K</sup>*

discrepancies.

The Jensen-Shannon divergence (JSD), also called total divergence to the average, is a well known measure of dissimilarity among probability distributions. It can be obtainined starting from the Kullback–Leibler divergence, consider-

> *K* ∑ *k*=1

> > *fk* (*x*)log *fk* (*x*)

*K* ∑ *k*=1

*<sup>X</sup> fk* (*x*)log *fk* (*x*) *dx*. It is easy to prove that *DJS*(*f*1,..., *fK*) ≥

*K* ∑ *k*=1

*<sup>k</sup>*=<sup>1</sup> π*klk*(*u*). To define the *lm* mixture density, the decomposi-

tion of Lorenz curve proposed by Bishop *et al.*, 2003 is considered, so that π*<sup>k</sup>* represents the income share for the *k* −*th* group. From (4), it is evident that the JSD takes into account, for each share density, the whole function and its entropy, so that it will be influenced by existing differences in tail inequality among groups, as well as in concentration around the center of income distribution. Therefore, clustering procedures based on JSD will exploit these

*m*(*x*)

*X*

Alternatively, expression (1) can be rewritten in terms of Shannon entropy *H*,

0 and equality holds when *f*<sup>1</sup> = *f*<sup>2</sup> = ... = *fK*. In addition, for two densities, it is symmetric, i.e. *DJS*(*f*1|| *f*2) = *DJS*(*f*2|| *f*1), and then it is a bonafide measure of dissimilarities between *f*<sup>1</sup> (·) and *f*<sup>2</sup> (·). Now, with the aim to analyse existing differences among various groups of income earners, this dissimilarity measure will be considered in connection with the Lorenz density. Let *L*1,...,*LK* be the Lorenz curves corresponding to *K* different groups of income earners and *l*1,...,*lK* the corresponding derivatives with respect to *u*. Hence, the JSD

*<sup>k</sup>*=<sup>1</sup> π*<sup>k</sup>* · *fk* with π*<sup>k</sup>* ∈ [0,1], as

*dx*. (2)

π*kH*(*fk*) (3)

π*kH*(*lk*) (4)

π*<sup>k</sup>* ·*DKL*(*fk*||*m*) (1)

Francesca Condino <sup>1</sup>

<sup>1</sup> Department of Economics, Statistics and Finance "Giovanni Anania", University of Calabria, Italy, (e-mail: francesca.condino@unical.it )

ABSTRACT: Different measures, generally used to analyse income inequality, refer to the context of information theory and the concept of entropy. In particular, Theil's *T* index can be interpreted in terms of entropy and related to the well-known Lorenz curve. Indeed, Lorenz curve and its derivative, the so-called share density, provides different information regarding inequality. Starting from this evidence, the aim of this work is to compare income inequality of different subgroups, by using a proper dissimilarity measure between parametric share densities, and to use these information for their clustering. Preliminary results regarding data from Survey on Households Income and Wealth (SHIW) by Bank of Italy are shown.

KEYWORDS: tail inequality, dissimilarity measure, income concentration.

#### 1 Lorenz curve and share density

In economic literature, Lorenz curve is a well-known and widely used tool for analysing income inequality. Since its proposal, in 1905 (Lorenz, 1905), a lot of investigation has been suggested among statisticians and economists, generating a fertile field of study. Conversely, Lorenz density is rarely explicitly mentioned. One of the few reference to Lorenz density can be found in Farris, 2010, where this curve is referred as share density. Afterwards, the concept of Lorenz density is resumed in Zizler, 2014, in Kampke & Radermacher, 2015 ¨ and Shao, 2021. Actually, it is known that each Lorenz curve *L*(*u*) (*u* ∈ [0,1]) can be viewed as a distribution function on the unit interval, therefore it is possible to consider it's derivative with respect to *u*, *l*(*u*) = *L* (*u*), as a density function. It is worth to note that, this density function furnishes different information regarding income inequality, as suggested by Rohde, 2008, who has shown that the two well-known Theil's inequalities indexes, *L* and *T*, can be directly obtained from *l*(*u*). In particular, Theil's *T* index coincides with Shannon entropy, changed in sign, of *l*(*u*). In this perspective, it arises natural to compare different groups of income earners in terms of inequality, by quantifying the dissimilarity between share densities through a proper measure.

#### 2 Jensen-Shannon divergence between share densities

The Jensen-Shannon divergence (JSD), also called total divergence to the average, is a well known measure of dissimilarity among probability distributions. It can be obtainined starting from the Kullback–Leibler divergence, considering *K* densities *f*1,..., *fK* and their mixture *m* = ∑*<sup>K</sup> <sup>k</sup>*=<sup>1</sup> π*<sup>k</sup>* · *fk* with π*<sup>k</sup>* ∈ [0,1], as follows:

$$D\_{JS}(f\_1, \dots, f\_K) = \sum\_{k=1}^K \pi\_k \cdot D\_{KL}(f\_k || m) \tag{1}$$

where

CLUSTERING INCOME DATA BASED ON SHARE DENSITIES Francesca Condino <sup>1</sup>

<sup>1</sup> Department of Economics, Statistics and Finance "Giovanni Anania", University of

ABSTRACT: Different measures, generally used to analyse income inequality, refer to the context of information theory and the concept of entropy. In particular, Theil's *T* index can be interpreted in terms of entropy and related to the well-known Lorenz curve. Indeed, Lorenz curve and its derivative, the so-called share density, provides different information regarding inequality. Starting from this evidence, the aim of this work is to compare income inequality of different subgroups, by using a proper dissimilarity measure between parametric share densities, and to use these information for their clustering. Preliminary results regarding data from Survey on Households

Calabria, Italy, (e-mail: francesca.condino@unical.it )

Income and Wealth (SHIW) by Bank of Italy are shown.

1 Lorenz curve and share density

KEYWORDS: tail inequality, dissimilarity measure, income concentration.

possible to consider it's derivative with respect to *u*, *l*(*u*) = *L*

In economic literature, Lorenz curve is a well-known and widely used tool for analysing income inequality. Since its proposal, in 1905 (Lorenz, 1905), a lot of investigation has been suggested among statisticians and economists, generating a fertile field of study. Conversely, Lorenz density is rarely explicitly mentioned. One of the few reference to Lorenz density can be found in Farris, 2010, where this curve is referred as share density. Afterwards, the concept of Lorenz density is resumed in Zizler, 2014, in Kampke & Radermacher, 2015 ¨ and Shao, 2021. Actually, it is known that each Lorenz curve *L*(*u*) (*u* ∈ [0,1]) can be viewed as a distribution function on the unit interval, therefore it is

sity function. It is worth to note that, this density function furnishes different information regarding income inequality, as suggested by Rohde, 2008, who has shown that the two well-known Theil's inequalities indexes, *L* and *T*, can be directly obtained from *l*(*u*). In particular, Theil's *T* index coincides with Shannon entropy, changed in sign, of *l*(*u*). In this perspective, it arises natural to compare different groups of income earners in terms of inequality, by quantifying the dissimilarity between share densities through a proper measure.

(*u*), as a den-

$$D\_{KL}\left(f\_k||m\right) = \int\_X f\_k\left(\mathbf{x}\right) \log \frac{f\_k\left(\mathbf{x}\right)}{m\left(\mathbf{x}\right)} d\mathbf{x}.\tag{2}$$

Alternatively, expression (1) can be rewritten in terms of Shannon entropy *H*, as follows:

$$D\_{IS}(f\_1, \ldots, f\_K) = H(m) - \sum\_{k=1}^{K} \pi\_k H(f\_k) \tag{3}$$

where *<sup>H</sup>* (*fk*) = <sup>−</sup> *<sup>X</sup> fk* (*x*)log *fk* (*x*) *dx*. It is easy to prove that *DJS*(*f*1,..., *fK*) ≥ 0 and equality holds when *f*<sup>1</sup> = *f*<sup>2</sup> = ... = *fK*. In addition, for two densities, it is symmetric, i.e. *DJS*(*f*1|| *f*2) = *DJS*(*f*2|| *f*1), and then it is a bonafide measure of dissimilarities between *f*<sup>1</sup> (·) and *f*<sup>2</sup> (·). Now, with the aim to analyse existing differences among various groups of income earners, this dissimilarity measure will be considered in connection with the Lorenz density. Let *L*1,...,*LK* be the Lorenz curves corresponding to *K* different groups of income earners and *l*1,...,*lK* the corresponding derivatives with respect to *u*. Hence, the JSD among *lk* densities (*k* = 1,...,*K*) is given by:

$$D\_{JS}(l\_1, \ldots, l\_K) = H(l\_m) - \sum\_{k=1}^{K} \pi\_k H(l\_k) \tag{4}$$

where *lm* = ∑*<sup>K</sup> <sup>k</sup>*=<sup>1</sup> π*klk*(*u*). To define the *lm* mixture density, the decomposition of Lorenz curve proposed by Bishop *et al.*, 2003 is considered, so that π*<sup>k</sup>* represents the income share for the *k* −*th* group. From (4), it is evident that the JSD takes into account, for each share density, the whole function and its entropy, so that it will be influenced by existing differences in tail inequality among groups, as well as in concentration around the center of income distribution. Therefore, clustering procedures based on JSD will exploit these discrepancies.

#### 3 Clustering income data: an application

In this section, data from the Survey on Households Income and Wealth (SHIW), carried out by Bank of Italy in 2016, are considered. To take into account the composition of households, equivalent income are obtained, using the OECDmodified equivalent scale. The Dagum distribution (Dagum, 1977) is used to model income and to obtain the expressions of Lorenz curves, *Lk*(*u*), and share densities, *lk*(*u*), for each region (*k* = 1,...,20). For this model, a closed expression for *H*(*lk*) function is obtained. This result (not reported for space reason) agrees, unless the sign, with that reported in Chotikapanich *et al.*, 2018 for Theil's *T* index, confirming the relation between *T* and *H*(*l*). Table 1 shows,

bership cluster is reported. In order to obtain this partition, elements *DJS*(*li*,*lj*) (*i*, *j* = 1,...,20) of dissimilarity matrix *D* are computed from expression (4). Here, numerical integration method is used to compute *H*(*lm*). Then, a hierarchical clustering based on matrix *D* has been conducted, considering complete agglomeration method and a final number of groups equal to 3. As we can see from the results, clusters seem clearly characterized, with regions having generally lower concentration of income belonging to cluster 1 and 3 and regions with higher concentrations levels included in cluster 2. Furthermore, by analysing more in depth the obtained results, it appears that this method allows to gather together regions with similar behaviour in tail inequality (results not

BISHOP, J.A., CHOW, K.V., & ZEAGER, L.A. 2003. Decomposing Lorenz and Concentration Curves. *International Economic Review*, 44, 965–978. CHOTIKAPANICH, D., GRIFFITHS, W.E., HAJARGASHT, G., KARUNARATHNE, W., & RAO, D.S.P. 2018. Using the GB2 In-

DAGUM, C. 1977. A new model of personal distribution: specification and

FARRIS, F.A. 2010. The Gini index and measures of inequality. *American*

KAMPKE ¨ , T., & RADERMACHER, F. 2015. *Income Modeling and Balancing: A Rigorous Treatment of Distribution Patterns*. Switzerland: Springer. LORENZ, M.O. 1905. Methods of Measuring the Concentration of Wealth. *Publications of the American Statistical Association*, 9(70), 209–219. ROHDE, N. 2008. *Lorenz Curves and Generalised Entropy Inequality Measures. In Chotikapanich D. (eds) Modeling Income Distributions and Lorenz Curve. Economic Studies in Equality, Social Exclusion and Well-*

SHAO, B. 2021. Decomposition of the Gini index by income source for aggregated data and its applications. *Computational statistics*, Epub ahead

THEIL, H. 1967. *Economics and Information Theory*. Amsterdam: North

ZIZLER, P. 2014. Gini indices and the moments of the share density function.

reported), as well as similar values of Theil's and Gini indexes.

come Distribution. *Econometrics*, 6, 21.

estimation. *Economie Appliquee´* , 30, 413–437.

*Mathematical Monthly*, 117(10), 851–864.

*Being, vol 5*. New York: Springer.

*Applications of Mathematics*, 59, 167–175.

of print, 1–25.

Holland.

References


Table 1. *Fitted means, entropy, Gini index and membership cluster for Italian regions*

for each Italian region, the estimates for average income ( ˆ*µk*, in tens of thousands of euros), entropy (*H*ˆ(*lk*)) and Gini index (*G*ˆ *<sup>k</sup>*). Furthermore, the membership cluster is reported. In order to obtain this partition, elements *DJS*(*li*,*lj*) (*i*, *j* = 1,...,20) of dissimilarity matrix *D* are computed from expression (4). Here, numerical integration method is used to compute *H*(*lm*). Then, a hierarchical clustering based on matrix *D* has been conducted, considering complete agglomeration method and a final number of groups equal to 3. As we can see from the results, clusters seem clearly characterized, with regions having generally lower concentration of income belonging to cluster 1 and 3 and regions with higher concentrations levels included in cluster 2. Furthermore, by analysing more in depth the obtained results, it appears that this method allows to gather together regions with similar behaviour in tail inequality (results not reported), as well as similar values of Theil's and Gini indexes.

#### References

3 Clustering income data: an application

In this section, data from the Survey on Households Income and Wealth (SHIW), carried out by Bank of Italy in 2016, are considered. To take into account the composition of households, equivalent income are obtained, using the OECDmodified equivalent scale. The Dagum distribution (Dagum, 1977) is used to model income and to obtain the expressions of Lorenz curves, *Lk*(*u*), and share densities, *lk*(*u*), for each region (*k* = 1,...,20). For this model, a closed expression for *H*(*lk*) function is obtained. This result (not reported for space reason) agrees, unless the sign, with that reported in Chotikapanich *et al.*, 2018 for Theil's *T* index, confirming the relation between *T* and *H*(*l*). Table 1 shows,

Table 1. *Fitted means, entropy, Gini index and membership cluster for Italian regions*

Regions *<sup>µ</sup>*<sup>ˆ</sup> *<sup>k</sup>* <sup>−</sup>*H*ˆ(*lk*) *<sup>G</sup>*<sup>ˆ</sup> *<sup>k</sup>* Cluster Piedmont 2.0797 0.1277 0.2751 1 Aosta Valley 2.3134 0.1195 0.2663 1 Veneto 1.9739 0.1244 0.2706 1 Friuli 2.3140 0.1226 0.2693 1 Emilia Romagna 2.3971 0.1141 0.2602 1 Tuscany 2.3648 0.1131 0.2588 1 Abruzzo 1.9542 0.1237 0.2717 1 Calabria 1.3472 0.1425 0.2910 1 Sardinia 1.5772 0.1361 0.2843 1 Lombardy 2.4798 0.1632 0.3064 2 Molise 1.7789 0.1873 0.3294 2 Campania 1.3461 0.1815 0.3252 2 Apulia 1.4558 0.1557 0.3026 2 Basilicata 1.5191 0.1850 0.3287 2 Sicily 1.4610 0.1633 0.3071 2 Trentino 2.2247 0.1008 0.2408 3 Liguria 2.2482 0.1356 0.2782 3 Umbria 1.9897 0.1024 0.2456 3 Marche 2.1809 0.1053 0.2475 3 Lazio 1.9972 0.1437 0.2883 3

for each Italian region, the estimates for average income ( ˆ*µk*, in tens of thousands of euros), entropy (*H*ˆ(*lk*)) and Gini index (*G*ˆ *<sup>k</sup>*). Furthermore, the mem-


### GROUP-DEPENDENT FINITE MIXTURE MODEL

group-dependent clustering as follows. First, an latent parameter θ *<sup>j</sup>,<sup>i</sup>* ∼ *Pj* for individual *i* and group *j* is introduced. Second, since *Pj* is almost surely discrete, ties within are expected, leading to a group-specific clustering. Finally, since the *Pj*'s share same support, we expect also ties between groups, providing a global clustering. We are able to derive the joint law of the group-specific clustering as well as the one of the global clustering. Such results allows to

Let *yji* be the observed variable for group *j*, *j* = 1*,...,d*, and individual *i*, *i* = 1*,...,nj*. We assume that the data in each group *j* come from a mixture of

where *f*(*yji |* τ*l*) is called kernel and is a parametric density over the sampling space, *wjm* are the group-specific mixing weights and τ*<sup>l</sup>* are the kernel parameters that are shared across groups. We assign a prior distribution

nents, i.e., *M* ∼ *q*(*m*). Conditionally on *M*, *Sjl* are independent positive random variables with distribution *hj*(*s*), while τ*<sup>l</sup>* follows a prior distribution over Θ,

with θ ∈ Θ. We refer the joint distribution of *P*1*,...,Pd* to as the Vector Normalized Independent weights, i.e., *V* −*NIw*(*q,hj, p*0). Model (1) and the priors

As in Argiento & Iorio, 2019, the model can be framed in a Bayesian nonparametric fashion. Indeed, *q*(*M*), *hj*(*s*) and *p*0(τ) define the joint distribution of a vector of almost sure discrete random measures *P*1*,...,Pd* with support Θ,

*<sup>l</sup>*=<sup>1</sup> *Sjl*. Also, we assume a prior distribution on the number of compo-

on the mixing weights by normalization, namely we define *wjl* <sup>=</sup> *Sjl*

*M* ∑ *l*=1

*wjl f*(*yji |* τ*l*)*,* (1)

δτ*l*(θ)*, j* = 1*,...,d* (2)

∼ *Pj* (3)

*Tj*

, where

build up a posterior sampling strategy based on the Gibbs sampler.

*y <sup>j</sup>*1*,...,yjnj | wjl,* τ*l,M* ∼

the parameter space of the kernel, that we denote *p*0(τ).

*M* ∑ *l*=1

ind

θ *<sup>j</sup>*1*,...,*θ *jnj | Pj*

*Sjl Tj*

described above can be rewritten in a hierarchical form as follows:

∼ *f*(*yji |* θ *ji*)

iid

*P*1*,...,Pd | q,h, p*<sup>0</sup> ∼ *V* −*NIw*(*q,hj, p*0)*.*

*Pj* =

*yji |* θ *ji*

2 Model developments

*M* components, that is

*Tj* = ∑*<sup>M</sup>*

where

Paula Costa Fontichiari1, Miriam Giuliani1, Raffaele Argiento1 and Lucia Paci<sup>1</sup>

<sup>1</sup> Department of Statistical Sciences, Universita Cattolica del Sacro ` Cuore, (e-mail: paula.costafontichiari01@icatt.it, miriam.giuliani01@icatt.it, raffaele.argiento@unicatt.it, lucia.paci@unicatt.it)

ABSTRACT: We present a Bayesian nonparametric group-dependent mixture model for clustering. This is achieved by building a hierarchical structure, where the discreteness of the shared base measure is exploited to cluster the data, between and within groups. We study the properties of the group-dependent clustering structure based on the latent parameters of the model. Furthermore, we obtain the joint distribution of the clustering induced by the hierarchical mixture model and define the complete posterior characterization of interest. We construct a Gibbs sampler to perform Bayesian inference and measure performances on simulated and a real data.

KEYWORDS: Bayesian analysis, clustering, Gibbs sampling, EPPF.

#### 1 Introduction

In several statistical settings there is the need to model data organized in groups, allowing for sharing of information across them. In the Bayesian framework, this is achieved by hierarchical modeling, where the joint distribution of groupspecific parameters accounts for such dependence. For instance, in Bayesian nonparametrics, the seminal work of Teh *et al.* , 2006 considered a mixture model within each group *j*, where the group-specific parameter is the mixing measure *Pj* and whose joint law is defined by an extra layer of hierarchy, yielding to the hierarchical Dirichlet process. This approach has been extended to the class of NRMI (Regazzini *et al.* , 2003) by Camerlenghi *et al.* , 2019 and Argiento *et al.* , 2020. In the cited works, the mixing measure is infinite dimensional.

In this work, we propose a hierarchical model where the group-specific mixing distribution belongs to the class of almost surely finite dimensional distributions introduced by Argiento & Iorio, 2019. We assign the joint law of the group-specific parameter such that the random measures within each group share the same support. In this framework, it is possible to define a

group-dependent clustering as follows. First, an latent parameter θ *<sup>j</sup>,<sup>i</sup>* ∼ *Pj* for individual *i* and group *j* is introduced. Second, since *Pj* is almost surely discrete, ties within are expected, leading to a group-specific clustering. Finally, since the *Pj*'s share same support, we expect also ties between groups, providing a global clustering. We are able to derive the joint law of the group-specific clustering as well as the one of the global clustering. Such results allows to build up a posterior sampling strategy based on the Gibbs sampler.

#### 2 Model developments

GROUP-DEPENDENT FINITE MIXTURE MODEL Paula Costa Fontichiari1, Miriam Giuliani1, Raffaele Argiento1 and Lucia Paci<sup>1</sup>

<sup>1</sup> Department of Statistical Sciences, Universita Cattolica del Sacro ` Cuore, (e-mail: paula.costafontichiari01@icatt.it, miriam.giuliani01@icatt.it, raffaele.argiento@unicatt.it,

ABSTRACT: We present a Bayesian nonparametric group-dependent mixture model for clustering. This is achieved by building a hierarchical structure, where the discreteness of the shared base measure is exploited to cluster the data, between and within groups. We study the properties of the group-dependent clustering structure based on the latent parameters of the model. Furthermore, we obtain the joint distribution of the clustering induced by the hierarchical mixture model and define the complete posterior characterization of interest. We construct a Gibbs sampler to perform Bayesian

In several statistical settings there is the need to model data organized in groups, allowing for sharing of information across them. In the Bayesian framework, this is achieved by hierarchical modeling, where the joint distribution of groupspecific parameters accounts for such dependence. For instance, in Bayesian nonparametrics, the seminal work of Teh *et al.* , 2006 considered a mixture model within each group *j*, where the group-specific parameter is the mixing measure *Pj* and whose joint law is defined by an extra layer of hierarchy, yielding to the hierarchical Dirichlet process. This approach has been extended to the class of NRMI (Regazzini *et al.* , 2003) by Camerlenghi *et al.* , 2019 and Argiento *et al.* , 2020. In the cited works, the mixing measure is infinite di-

In this work, we propose a hierarchical model where the group-specific mixing distribution belongs to the class of almost surely finite dimensional distributions introduced by Argiento & Iorio, 2019. We assign the joint law of the group-specific parameter such that the random measures within each group share the same support. In this framework, it is possible to define a

inference and measure performances on simulated and a real data. KEYWORDS: Bayesian analysis, clustering, Gibbs sampling, EPPF.

lucia.paci@unicatt.it)

1 Introduction

mensional.

Let *yji* be the observed variable for group *j*, *j* = 1*,...,d*, and individual *i*, *i* = 1*,...,nj*. We assume that the data in each group *j* come from a mixture of *M* components, that is

$$\left|\mathbf{y}\_{j1},\ldots,\mathbf{y}\_{jn\_{l}}\right|\left|\mathbf{w}\_{jl},\mathbf{q}\_{l},\mathbf{M}\right.\tag{1}$$

$$\left|\mathbf{y}\_{j1},\ldots,\mathbf{y}\_{jn\_{l}}\right|\left|\mathbf{w}\_{jl},\mathbf{q}\_{l},\mathbf{M}\right.\right.\tag{2}$$

where *f*(*yji |* τ*l*) is called kernel and is a parametric density over the sampling space, *wjm* are the group-specific mixing weights and τ*<sup>l</sup>* are the kernel parameters that are shared across groups. We assign a prior distribution on the mixing weights by normalization, namely we define *wjl* <sup>=</sup> *Sjl Tj* , where *Tj* = ∑*<sup>M</sup> <sup>l</sup>*=<sup>1</sup> *Sjl*. Also, we assume a prior distribution on the number of components, i.e., *M* ∼ *q*(*m*). Conditionally on *M*, *Sjl* are independent positive random variables with distribution *hj*(*s*), while τ*<sup>l</sup>* follows a prior distribution over Θ, the parameter space of the kernel, that we denote *p*0(τ).

As in Argiento & Iorio, 2019, the model can be framed in a Bayesian nonparametric fashion. Indeed, *q*(*M*), *hj*(*s*) and *p*0(τ) define the joint distribution of a vector of almost sure discrete random measures *P*1*,...,Pd* with support Θ, where

$$P\_j = \sum\_{l=1}^{M} \frac{S\_{jl}}{T\_j} \delta\_{\mathfrak{F}\_l}(\pmb{\mathfrak{G}}), \quad j = 1, \ldots, d \tag{2}$$

with θ ∈ Θ. We refer the joint distribution of *P*1*,...,Pd* to as the Vector Normalized Independent weights, i.e., *V* −*NIw*(*q,hj, p*0). Model (1) and the priors described above can be rewritten in a hierarchical form as follows:

$$\begin{aligned} \left| \mathbf{y}\_{ji} \mid \boldsymbol{\Theta}\_{ji} \stackrel{\text{ind}}{\sim} f(\mathbf{y}\_{ji} \mid \boldsymbol{\Theta}\_{ji}) \\ \left| \boldsymbol{\Theta}\_{j1}, \dots, \boldsymbol{\Theta}\_{jn\_j} \right| P\_j \stackrel{\text{iid}}{\sim} P\_j \\ P\_1, \dots, P\_d \mid q, h, p\_0 \sim V - NIw(q, h\_j, p\_0). \end{aligned} \tag{3}$$

In this work, the kernel *f*(*y |* θ) represents the density of a univariate normal distribution with parameter θ = (*µ,*σ2)⊤. We assume *q*(*m*) to be the p.m.f. of a 1−shifted Poisson distribution with parameter Λ and *hj*(*s*) is the density of a gamma distribution with shape parameter γ*<sup>i</sup>* and rate equal to 1. Finally, *p*0(τ) is the density of a conjugate normal inverse gamma prior with parameters *µ*0, κ0, ν<sup>0</sup> and σ<sup>2</sup> 0.

chical mixture model (3). This turns out to be:

) = ! <sup>∞</sup> 0

*...*! <sup>∞</sup> 0

*d* ∏ *j*=1

exp#

Λ*M*(*a*) −1 # Λ *d* ∏ *j*=1 ψγ *<sup>j</sup>*

tive cumulant function. The joint distribution in (4) enables us to build a Gibbs sampler for sampling from the full posterior distribution. We omit here the details for brevity. We will illustrate the performance of our model over a set

ARGIENTO, RAFFAELE,&IORIO, MARIA DE. 2019. Is infinity that far? A Bayesian nonparametric perspective of finite mixture models. *arXiv:*

ARGIENTO, RAFFAELE, CREMASCHI, ANDREA,&VANNUCCI, MARINA. 2020. Hierarchical normalized completely random measures to cluster grouped data. *Journal of the American Statistical Association*, 115(529),

CAMERLENGHI, FEDERICO, LIJOI, ANTONIO, ORBANZ, PETER, PRUNSTER ¨ , IGOR, *et al.* . 2019. Distribution theory for hierarchi-

JAMES, LANCELOT F, LIJOI, ANTONIO,&PRUNSTER ¨ , IGOR. 2009. Posterior analysis for normalized random measures with independent incre-

REGAZZINI, EUGENIO, LIJOI, ANTONIO,&PRUNSTER ¨ , IGOR. 2003. Distributional results for means of normalized random measures with inde-

TEH, YEE WHYE, JORDAN, MICHAEL I, BEAL, MATTHEW J, & BLEI, DAVID M. 2006. Hierarchical dirichlet processes. *Journal of the ameri-*

cal processes. *Annals of Statistics*, 47(1), 67–92.

ments. *Scandinavian Journal of Statistics*, 36(1), 76–97.

pendent increments. *Annals of Statistics*, 560–585.

*can statistical association*, 101(476), 1566–1581.

1 Γ(*nj*) *u nj*−1 *j*

−Λ

\$ *d* ∏ *j*=1 ψγ *<sup>j</sup>*

*M*(*a*) ∏ *k*=1 κγ *j*

<sup>γ</sup> *<sup>j</sup>* is the Laplace transform of a gamma distribution with

Γ(γ *<sup>j</sup>*)

(*njk,uj*) = <sup>Γ</sup>(<sup>γ</sup> *<sup>j</sup>*+*njk*)

(*njk,uj*)

%&

1 (*uj*+1) &

(4)

*du*<sup>1</sup> *...dud,*

*<sup>n</sup> jk*+<sup>γ</sup> *<sup>j</sup>* is its rela-

(*uj*)−1

(*uj*) + *M*(*a*)

π(ρ1*,...,*ρ*d,M*(*a*)

where ψγ *<sup>j</sup>*

References

(*uj*) = <sup>1</sup>

of simulated and real data.

*Methodology*.

318–333.

(*uj*+1)

shape γ *<sup>j</sup>* and rate equal to 1, while κγ *<sup>j</sup>*

#### 3 Group-dependent clustering

The hierarchical model in (3) allows to define a group-dependent clustering based on the latent variables θ *ji*. First, we introduce latent allocation variables *c ji* such that *cji* = *m* if θ *ji* = τ*m*. Then, we denote *M* (*a*) the set of couples (*j,m*) such that ∃*i* for which *ci j* = *m* and we define the number of *allocated columns* as

$$\mathcal{M}^{(a)} = \# \left\{ m \text{ : there exists one couple} (j, m) \in \mathcal{M}^{(a)}, j = 1, \dots, d \right\}.$$

We denote *M* (*na*) the complement of *M* (*a*) . Hence, for every pair (*j,m*), we define *njm* = #*{*(*j,i*) : *cji* = *m}*. Note that

$$(j,m)\in \mathcal{M}^{(na)} \Rightarrow n\_{jm} = 0$$

$$(j,m)\in \mathcal{M}^{(a)} \Rightarrow n\_{jm} \ge 0.$$

Finally, let *c*∗ <sup>1</sup>*,..., c*<sup>∗</sup> *<sup>M</sup>*(*a*) be the allocated columns, that is, the indexes within *{*1*,...,M}* such that (*j, c*<sup>∗</sup> *<sup>k</sup>* ) <sup>∈</sup> *M* (*a*) .

We are now ready to define, for each group *j*, the clustering ρ*<sup>j</sup>* = *{Aj*1*,... ...,AjM*(*a*)*}*, where *Ajk* = *{*(*j,i*) : (*j, c*<sup>∗</sup> *ki*) <sup>∈</sup> *M* (*a*) *}* and *<sup>k</sup>* <sup>=</sup> <sup>1</sup>*,...,M*(*a*) . In other words, *Ajk* is the set of data points of group *j* belonging to the *k*-th cluster. Note that, a distinctive feature of our setting, is that *Ajk* can be an empty set. Nevertheless, if *Ajk* = 0/ appears in ρ*j*, it means that there is at least another group ˜*j* such that *A*˜*jk* is not empty.

We build upon the work Argiento & Iorio, 2019 and James *et al.* , 2009 to derive the joint distribution of the clustering ρ1*,...,*ρ*d*, induced by the hierar-

chical mixture model (3). This turns out to be:

$$\begin{aligned} \pi(\mathfrak{p}\_1, \dots, \mathfrak{p}\_d, M^{(a)}) &= \int\_0^{\infty} \dots \int\_0^{\infty} \prod\_{j=1}^d \frac{1}{\Gamma(n\_j)} u\_j^{n\_j - 1} \prod\_{k=1}^{M^{(a)}} \kappa\_{\mathfrak{I}\_l}(n\_{jk}, u\_j) \\ & \quad \exp\left[ -\Lambda \left( \prod\_{j=1}^d \Psi\_{\mathfrak{I}\_l}(u\_j) - 1 \right) \right] \\ & \quad \Lambda^{M^{(a)} - 1} \left[ \Lambda \prod\_{j=1}^d \Psi\_{\mathfrak{I}\_l}(u\_j) + M^{(a)} \right] du\_1 \dots du\_d, \end{aligned} (4)$$

where ψγ *<sup>j</sup>* (*uj*) = <sup>1</sup> (*uj*+1) <sup>γ</sup> *<sup>j</sup>* is the Laplace transform of a gamma distribution with shape γ *<sup>j</sup>* and rate equal to 1, while κγ *<sup>j</sup>* (*njk,uj*) = <sup>Γ</sup>(<sup>γ</sup> *<sup>j</sup>*+*njk*) Γ(γ *<sup>j</sup>*) 1 (*uj*+1) *<sup>n</sup> jk*+<sup>γ</sup> *<sup>j</sup>* is its relative cumulant function. The joint distribution in (4) enables us to build a Gibbs sampler for sampling from the full posterior distribution. We omit here the details for brevity. We will illustrate the performance of our model over a set of simulated and real data.

#### References

In this work, the kernel *f*(*y |* θ) represents the density of a univariate normal distribution with parameter θ = (*µ,*σ2)⊤. We assume *q*(*m*) to be the p.m.f. of a 1−shifted Poisson distribution with parameter Λ and *hj*(*s*) is the density of a gamma distribution with shape parameter γ*<sup>i</sup>* and rate equal to 1. Finally, *p*0(τ) is the density of a conjugate normal inverse gamma prior with parameters *µ*0,

The hierarchical model in (3) allows to define a group-dependent clustering based on the latent variables θ *ji*. First, we introduce latent allocation variables *c ji* such that *cji* = *m* if θ *ji* = τ*m*. Then, we denote *M* (*a*) the set of couples (*j,m*) such that ∃*i* for which *ci j* = *m* and we define the number of *allocated*

*<sup>m</sup>* : there exists one couple(*j,m*) <sup>∈</sup> *M* (*a*)

(*j,m*) <sup>∈</sup> *M* (*na*) <sup>⇒</sup> *njm* <sup>=</sup> <sup>0</sup>

(*j,m*) <sup>∈</sup> *M* (*a*) <sup>⇒</sup> *njm* <sup>≥</sup> <sup>0</sup>*.*

We are now ready to define, for each group *j*, the clustering ρ*<sup>j</sup>* = *{Aj*1*,...*

We build upon the work Argiento & Iorio, 2019 and James *et al.* , 2009 to derive the joint distribution of the clustering ρ1*,...,*ρ*d*, induced by the hierar-

words, *Ajk* is the set of data points of group *j* belonging to the *k*-th cluster. Note that, a distinctive feature of our setting, is that *Ajk* can be an empty set. Nevertheless, if *Ajk* = 0/ appears in ρ*j*, it means that there is at least another

*ki*) <sup>∈</sup> *M* (*a*)

.

*<sup>k</sup>* ) <sup>∈</sup> *M* (*a*)

*<sup>M</sup>*(*a*) be the allocated columns, that is, the indexes within

*, j* = 1*,...,d*

. Hence, for every pair (*j,m*), we

*}* and *<sup>k</sup>* <sup>=</sup> <sup>1</sup>*,...,M*(*a*)

" *.*

. In other

κ0, ν<sup>0</sup> and σ<sup>2</sup>

*columns* as

Finally, let *c*∗

*M*(*a*) = #

!

<sup>1</sup>*,..., c*<sup>∗</sup>

*...,AjM*(*a*)*}*, where *Ajk* = *{*(*j,i*) : (*j, c*<sup>∗</sup>

group ˜*j* such that *A*˜*jk* is not empty.

*{*1*,...,M}* such that (*j, c*<sup>∗</sup>

We denote *M* (*na*) the complement of *M* (*a*)

define *njm* = #*{*(*j,i*) : *cji* = *m}*. Note that

0.

3 Group-dependent clustering


### A MACHINE LEARNING APPROACH IN STOCK RISK MANAGEMENT

the hierarchical Neural Network Principal Component Analysis (hNNPCA). As for time series clustering, we report the survey of Aghabozorgi *et al.*, 2015. In the second section, we show the data analysis techniques that we exploit. In the third section, we propose our methodology for clustering and investment.

In the fourth section, we test our strategy on the Italian stock market.

2.1 Hierarchical Neural Network Principal Component Analysis

calculated on the sub-NN obtained considering only the first *k* PCs.

given a linear model with *K* observations and *n* inputs: *X*(*j*) = ∑*<sup>n</sup>*

*<sup>t</sup>* , the A-LASSO estimates the coefficients β(*j*) as the argmin of:


The A-LASSO is exploited only for feature selection, so its final results is

Fama and MacBeth (FMB) that is performed by dividing the train data into

*<sup>i</sup> vi*<sup>|</sup> ] *vi* <sup>=</sup> <sup>|</sup><sup>ˆ</sup>

*n* ∑ *i*=1

subsets and averaging the OLS coefficients obtained in the subsets.

2.2 Adaptive Least Absolute Shrinkage and Selection Operator

The hNNPCA is a technique of dimensionality reduction based on a neural network (NN) with 5 layers, such that both input and output are *Xt* = [*X*(*j*)

The central layer has dimension *n*, equal to the number of series to be extracted, known as *principal components* (PCs), and its neurons give us their value.

The *Adaptive Least Absolute Shrinkage and Selection Operator* (A-LASSO) is a feature selection technique that adjusts the LASSO estimator weighting the contribution of each coefficient, when computing the *l*<sup>1</sup> norm, with a weight that can be obtained from an Ordinary Least Squares (OLS) regression. Namely,

*<sup>k</sup>*=<sup>1</sup> *Ek*, where *Ek* is the Mean Square Error (MSE)

β*OLS*,*i*| −τ

*<sup>i</sup>* = 0}. As for the regression, we exploit that of

*<sup>t</sup>* ] *<sup>j</sup>*∈Ω.

*<sup>i</sup>*=<sup>1</sup> *Fi*,*t*β(*j*)

, λ > 0, τ > 0 (2)

*<sup>i</sup>* +

2 Data Analysis Tools

The loss function is *E* = ∑*<sup>n</sup>*

ε (*j*)

> [ 1 *K*

*K* ∑ *t*=1

(*X*(*j*) *t* −

*<sup>A</sup><sup>j</sup>* <sup>=</sup> {*<sup>i</sup>* ∈ {1,...,*n*} *<sup>s</sup>*.*t*. <sup>β</sup>(*j*)

3 The Methodology

*n* ∑ *i*=1

*Fi*,*t*β(*j*)

*<sup>i</sup>* ) +λ

Salvatore Cuomo <sup>1</sup> , Federico Gatta1 , Fabio Giampaolo1 , Carmela Iorio2 and Francesco Piccialli1

<sup>1</sup> Department of Mathematics and Application 'R. Caccioppoli', University of Naples Federico II, Italy, (e-mail: salvatore.cuomo@unina.it, federico.gatta@unina.it, fabio.giampaolo@unina, francesco.piccialli@unina.it)

<sup>3</sup> Department of Industrial Engineering, University of Naples Federico II, Italy, (e-mail: carmela.iorio@unina.it)

ABSTRACT: In this paper, we propose a novel approach in stock clustering with the purpose of the construction of a portfolio optimization strategy. The idea is to exploit hierarchical Neural Network Principal Component Analysis and Adaptive LASSO in combination with the Arbitrage Pricing Theory in order to group stocks whose returns are affected by the same risk factors, and then eliminate such dependence through an appropriately constructed portfolio. We test our proposal on the Italian stock market.

KEYWORDS: neural network principal component analysis, stocks clustering, arbitrage pricing theory, pure alpha strategy

#### 1 Introduction

In this work, we propose a novel technique in stock risk management through the construction of an appropriate pure alpha strategy. To do this, we exploit the *Arbitrage Pricing Theory* (APT) (Ross, 1976) that, given a market made up of *<sup>M</sup>* stocks <sup>Ω</sup> <sup>=</sup> {1,...,*M*}, explain the stocks returns *<sup>X</sup>*(*j*) , *j* ∈ Ω with a collection of standard random variables common to all stocks called *risk factors Fi* with *i* = 1,...,*n*. So, let's α(*j*) be the intercept and ε(*j*) the error:

$$X^{(j)} = \mathfrak{a}^{(j)} + \mathfrak{B}\_1^{(j)} F\_1 + \dots + \mathfrak{B}\_n^{(j)} F\_n + \mathfrak{e}^{(j)} \tag{1}$$

So, the task is to identify an appropriate set of risk factors. In literature, there are mainly two approaches: *macroeconomic*, which searches outside of the data; *statistical*, which extracts the risk factors from the data itself. We follow the statistical approach. A work of this type is that of Ladron de Gue- ´ vara Cortes´ *et al.*, 2019 that is the starting point for our model in that it uses the hierarchical Neural Network Principal Component Analysis (hNNPCA). As for time series clustering, we report the survey of Aghabozorgi *et al.*, 2015. In the second section, we show the data analysis techniques that we exploit. In the third section, we propose our methodology for clustering and investment. In the fourth section, we test our strategy on the Italian stock market.

### 2 Data Analysis Tools

A MACHINE LEARNING APPROACH IN STOCK RISK MANAGEMENT Salvatore Cuomo <sup>1</sup> , Federico Gatta1 , Fabio Giampaolo1 , Carmela Iorio2 and Francesco Piccialli1

<sup>1</sup> Department of Mathematics and Application 'R. Caccioppoli', University of Naples Federico II, Italy, (e-mail: salvatore.cuomo@unina.it, federico.gatta@unina.it, fabio.giampaolo@unina,

<sup>3</sup> Department of Industrial Engineering, University of Naples Federico II, Italy, (e-mail:

ABSTRACT: In this paper, we propose a novel approach in stock clustering with the purpose of the construction of a portfolio optimization strategy. The idea is to exploit hierarchical Neural Network Principal Component Analysis and Adaptive LASSO in combination with the Arbitrage Pricing Theory in order to group stocks whose returns are affected by the same risk factors, and then eliminate such dependence through an appropriately constructed portfolio. We test our proposal on the Italian stock market.

KEYWORDS: neural network principal component analysis, stocks clustering, arbi-

In this work, we propose a novel technique in stock risk management through the construction of an appropriate pure alpha strategy. To do this, we exploit the *Arbitrage Pricing Theory* (APT) (Ross, 1976) that, given a market made

a collection of standard random variables common to all stocks called *risk factors Fi* with *i* = 1,...,*n*. So, let's α(*j*) be the intercept and ε(*j*) the error:

So, the task is to identify an appropriate set of risk factors. In literature, there are mainly two approaches: *macroeconomic*, which searches outside of the data; *statistical*, which extracts the risk factors from the data itself. We follow the statistical approach. A work of this type is that of Ladron de Gue- ´ vara Cortes´ *et al.*, 2019 that is the starting point for our model in that it uses

<sup>1</sup> *<sup>F</sup>*<sup>1</sup> <sup>+</sup>...+β(*j*)

, *j* ∈ Ω with

*<sup>n</sup> Fn* +ε(*j*) (1)

up of *<sup>M</sup>* stocks <sup>Ω</sup> <sup>=</sup> {1,...,*M*}, explain the stocks returns *<sup>X</sup>*(*j*)

*X*(*j*) = α(*j*) +β(*j*)

francesco.piccialli@unina.it)

carmela.iorio@unina.it)

trage pricing theory, pure alpha strategy

1 Introduction

#### 2.1 Hierarchical Neural Network Principal Component Analysis

The hNNPCA is a technique of dimensionality reduction based on a neural network (NN) with 5 layers, such that both input and output are *Xt* = [*X*(*j*) *<sup>t</sup>* ] *<sup>j</sup>*∈Ω. The central layer has dimension *n*, equal to the number of series to be extracted, known as *principal components* (PCs), and its neurons give us their value. The loss function is *E* = ∑*<sup>n</sup> <sup>k</sup>*=<sup>1</sup> *Ek*, where *Ek* is the Mean Square Error (MSE) calculated on the sub-NN obtained considering only the first *k* PCs.

#### 2.2 Adaptive Least Absolute Shrinkage and Selection Operator

The *Adaptive Least Absolute Shrinkage and Selection Operator* (A-LASSO) is a feature selection technique that adjusts the LASSO estimator weighting the contribution of each coefficient, when computing the *l*<sup>1</sup> norm, with a weight that can be obtained from an Ordinary Least Squares (OLS) regression. Namely, given a linear model with *K* observations and *n* inputs: *X*(*j*) = ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *Fi*,*t*β(*j*) *<sup>i</sup>* + ε (*j*) *<sup>t</sup>* , the A-LASSO estimates the coefficients β(*j*) as the argmin of:

$$\mathbb{E}\left[\frac{1}{K}\sum\_{t=1}^{K}(X\_t^{(j)} - \sum\_{i=1}^{n}F\_{i,t}\mathbb{B}\_i^{(j)}) + \lambda\sum\_{i=1}^{n}|\mathbb{B}\_i^{(j)}\nu\_i|\right] \quad \nu\_i = |\hat{\mathbb{B}}\_{OLS,i}|^{-\mathsf{T}}, \ \lambda > 0, \ \mathsf{T} > 0 \ (2)$$

The A-LASSO is exploited only for feature selection, so its final results is *<sup>A</sup><sup>j</sup>* <sup>=</sup> {*<sup>i</sup>* ∈ {1,...,*n*} *<sup>s</sup>*.*t*. <sup>β</sup>(*j*) *<sup>i</sup>* = 0}. As for the regression, we exploit that of Fama and MacBeth (FMB) that is performed by dividing the train data into subsets and averaging the OLS coefficients obtained in the subsets.

#### 3 The Methodology

#### 3.1 Stocks Clustering

Firstly, we use the hNNPCA to obtain *n* PCs that, after the standardization process, are used as risk factors in equation 1. The underlying idea is that not all PCs affect the returns of all the stocks, so we apply the A-LASSO to perform feature selection. The hyperparameters λ and τ are set with grid-search searching to minimize the estimate of the MSE provided by the 3-fold nested cross-validation. In this stage we discard the combinations of hyperparameters that save less than 2 or more than 4 PCs, to prevent strong regularizations or complex models. So, for each stock *j*, we have a subset of PCs *A<sup>j</sup>* that really affects *j* returns. After introducing the equivalence relationship between stocks *j* ∼ *l* ⇐⇒ *A<sup>j</sup>* = *Al*, the clusters are the equivalence classes of ∼. Two strengths of our strategy are that we don't need to know in advance the number of clusters to create and we don't need to establish a similarity measure between the considered time series. These are, according to Aghabozorgi, difficult points in the traditional clustering algorithms. However, we have to set λ and τ, and not all the obtained clusters are usable in practice.

data are from 2010-10-26 to 2020-08-31 (train set) and from 2020-09-01 to 2020-12-31 (test set). We extract 6 PCs from the train set and we obtain 16 clusters (only 3 usable in the investment methodology). Then, the results of the pure alpha strategy in the test set are compared with those of the Italian

Figure 1. *Portfolio (blue) vs FTSE MIB (orange) in the period 01/09/20 - 31/12/20*

From the figure, we can see that the proposed investment methodology is quite good. In fact, it achieves a profit (even if it isn't as big as the FTSE MIB one), and it seems to be safer than the index, with quite constant growth and less

AGHABOZORGI, SAEED, SHIRKHORSHIDI, ALI SEYED, &WAH, TEH YING. 2015. Time-series clustering–a decade review. *Information*

LADRON DE ´ GUEVARA CORTES´ , ROGELIO, TORRA PORRAS, SALVADOR, & MONTE MORENO, ENRIC. 2019. Neural Networks Principal Component Analysis for estimating the generative multifactor model of returns under a statistical approach to the Arbitrage Pricing Theory. Evidence from the Mexican Stock Exchange. *Computacion y Sistemas ´* , 23(2), 281–

ROSS, STEPHEN A. 1976. The arbitrage theory of capital asset pricing. *Jour-*

*nal of Economic Theory*, 13(3), 341 – 360.

Index FTSE MIB, see figure 1.

downward peaks.

*Systems*, 53, 16–38.

References

298.

#### 3.2 Pure Alpha Strategy

Now, we use the clustering to obtain an investment strategy. Fix a class *A*˜ , assume that <sup>|</sup>*A*˜ <sup>|</sup> <sup>=</sup> *<sup>n</sup>* and consider a portfolio (equation 3) made up by *<sup>n</sup>* <sup>+</sup> 1 stocks in *A*˜ (without loss of generality 0,...,*n* + 1). The coefficients are estimated with FMB and the weights γ(*j*) indicate the exposition on the stocks.

$$X^{(\text{Port})} = \sum\_{j=0}^{n} \mathfrak{l}^{(j)} X^{(j)} = \sum\_{j=0}^{n} \mathfrak{l}^{(j)} \mathfrak{a}^{(j)} + \sum\_{i \in \tilde{\mathcal{A}}} \left( \sum\_{j=0}^{n} \mathfrak{l}^{(j)} \mathfrak{d}\_{i}^{(j)} \right) F\_{i} + \sum\_{j=0}^{n} \mathfrak{l}^{(j)} \mathfrak{c}^{(j)} \quad (3)$$

A *pure alpha strategy* is a portfolio (designed to reduce the riskiness) s.t. the total exposition on the *Fi* is nil. Furthermore, for the law of large numbers, we can neglect the contribution of the ε(*j*) . So, to determine the weights, we impose the *l*<sup>1</sup> norm of Γ equal to 1 and we find that there are only two admissible vectors of weights. We chose the one with the higher expected return.

If we have more than *n*+1 stocks in *A*˜ , then we still consider portfolios made up by *n*+1 stocks and we chose the one that maximizes the expected return.

#### 4 A Real Application in Italian Stock Market

Now, we propose a real application in the Italian stock market. Ω is made up of 30 stocks, whose time series are supplied by *mercati.ilsole24ore.com*. The data are from 2010-10-26 to 2020-08-31 (train set) and from 2020-09-01 to 2020-12-31 (test set). We extract 6 PCs from the train set and we obtain 16 clusters (only 3 usable in the investment methodology). Then, the results of the pure alpha strategy in the test set are compared with those of the Italian Index FTSE MIB, see figure 1.

Figure 1. *Portfolio (blue) vs FTSE MIB (orange) in the period 01/09/20 - 31/12/20*

From the figure, we can see that the proposed investment methodology is quite good. In fact, it achieves a profit (even if it isn't as big as the FTSE MIB one), and it seems to be safer than the index, with quite constant growth and less downward peaks.

#### References

3.1 Stocks Clustering

3.2 Pure Alpha Strategy

*n* ∑ *j*=0 γ (*j*)

*X*(*j*) =

can neglect the contribution of the ε(*j*)

*n* ∑ *j*=0 γ (*j*)

4 A Real Application in Italian Stock Market

*X*(*Port*) =

Firstly, we use the hNNPCA to obtain *n* PCs that, after the standardization process, are used as risk factors in equation 1. The underlying idea is that not all PCs affect the returns of all the stocks, so we apply the A-LASSO to perform feature selection. The hyperparameters λ and τ are set with grid-search searching to minimize the estimate of the MSE provided by the 3-fold nested cross-validation. In this stage we discard the combinations of hyperparameters that save less than 2 or more than 4 PCs, to prevent strong regularizations or complex models. So, for each stock *j*, we have a subset of PCs *A<sup>j</sup>* that really affects *j* returns. After introducing the equivalence relationship between stocks *j* ∼ *l* ⇐⇒ *A<sup>j</sup>* = *Al*, the clusters are the equivalence classes of ∼. Two strengths of our strategy are that we don't need to know in advance the number of clusters to create and we don't need to establish a similarity measure between the considered time series. These are, according to Aghabozorgi, difficult points in the traditional clustering algorithms. However, we have to

set λ and τ, and not all the obtained clusters are usable in practice.

Now, we use the clustering to obtain an investment strategy. Fix a class *A*˜ , assume that <sup>|</sup>*A*˜ <sup>|</sup> <sup>=</sup> *<sup>n</sup>* and consider a portfolio (equation 3) made up by *<sup>n</sup>* <sup>+</sup> 1 stocks in *A*˜ (without loss of generality 0,...,*n* + 1). The coefficients are estimated with FMB and the weights γ(*j*) indicate the exposition on the stocks.

> <sup>α</sup>(*j*) <sup>+</sup> ∑ *<sup>i</sup>*∈*A*˜

A *pure alpha strategy* is a portfolio (designed to reduce the riskiness) s.t. the total exposition on the *Fi* is nil. Furthermore, for the law of large numbers, we

pose the *l*<sup>1</sup> norm of Γ equal to 1 and we find that there are only two admissible

If we have more than *n*+1 stocks in *A*˜ , then we still consider portfolios made up by *n*+1 stocks and we chose the one that maximizes the expected return.

Now, we propose a real application in the Italian stock market. Ω is made up of 30 stocks, whose time series are supplied by *mercati.ilsole24ore.com*. The

vectors of weights. We chose the one with the higher expected return.

 *n* ∑ *j*=0 γ (*j*) β(*j*) *i Fi* +

*n* ∑ *j*=0 γ (*j*)

. So, to determine the weights, we im-

ε(*j*) (3)


### PATHMOX SEGMENTATION TREES TO COMPARE LINEAR REGRESSION MODELS

between a global model estimated on the whole set of observations and models nbased on sub-groups identified on the basis of known categorical variables external to the model. These variables may identify partitions characterised by a dependency structure heterogeneity. The most popular approaches to comparing regression models rely on comparative statistical testing or on recursive methods. The comparison approach consists of comparing coefficients related to a model common to all the data (i.e., a restricted model representing a homogeneous situation) and another model that reflects the interactions between categorical and predictor variables (i.e., an unrestricted model corresponding to a heterogeneous situation). The comparison approach, which allows for analysis of one categorical variable at a time, is reflected in the F-tests developed by Chow (1960) and Lebart *et al.* (1979), based on an assumption of the normality of the residuals of the two models. Comparison is done by calculating restricted deviance (*SSR*0) and unrestricted deviance (*SSR*1). The latter will be lower if interaction between categorical and predictor variables is significant. Under the null hypothesis, if both types of deviance are equal, then the categorical variables produce no differences in model coefficients. This

> *<sup>F</sup>* <sup>=</sup> (*SSR*<sup>0</sup> <sup>−</sup>*SSR*1)/(*n*−2*p*) *SSR*1/*p*

fined according to the variables that produce the highest instability.

Pathmox (Lamberti *et al.*, 2016), developed to detect heterogeneity in models, is a recursive algorithm based on segmentation trees. While pathmox was introduced in the context of partial least square structural equation modelling, it can be generalized to other contexts when a suitable test for comparing models is available. The algorithm applies binary segmentation principles to produce

3 Pathmox in a nutshell

The recursive approach, based on multiple model comparisons, ranks variables that produce differences in the model coefficients. The outcome is a tree where each node represents a model. Partitions are obtained by comparing the effect of each categorical variable on the model coefficients and choosing the partitions that produce the biggest differences. This approach requires a criterion to quantify differences in the model coefficients. In case of the MOB procedure this criterion is based on a fluctuation test that measures coefficient instability (Zelies and Hornick, 2007) as caused by a categorical variable. High instability points to a significant effect of the variable. Tree partitions are de-

(1)

null hypothesis is tested by computing an F–statistic:

Cristina Davino1, Giuseppe Lamberti2

<sup>1</sup> Department of Economics and Statistics, University of Naples Federico II, (e-mail: cristina.davino@unina.it)

<sup>2</sup> Department of Business, Universitat Autonoma de Barcelona, (e-mail: giuseppe.lamberti@uab.cat)

ABSTRACT: The estimation of a dependency model for a group as a whole does not take into account possible heterogeneity, i.e., the presence of possible partitions characterised by different dependency structures. We propose a procedure that exploits the potential of segmentation trees to identify partitions in an initial set of data characterised by different linear regression patterns.

KEYWORDS: "pathmox approach", "linear regression", "heterogeneity", "F-Fisher".

#### 1 Introduction

Segmentation trees have been attracting a great deal of attention as model comparison tools, with research mainly motivated by the fact that segmentation trees allow identification of partitions of data characterised by different dependency structures. Few algorithms have been proposed by the statistical community that combine model estimation and segmentation trees, outside the MOdel-based recursive partitioning (MOB) procedure proposed by Zelies *et al.* (2008). In a new approach we generalize the pathmox algorithm developed by Lamberti *et al.* (2016) to the context of linear regression models, using a model comparison test to identify the most significant partitions (i.e., subgroups) in data. Further developments of the proposed approach will involve extensions to other contexts such as quantile regression.

#### 2 State-of-the-art

Analysis of a dependency model can be furthered by assessing whether a model and/or the impact of regressors on dependent variables differ if heterogeneity is observed. In other words, it may be interesting to assess differences between a global model estimated on the whole set of observations and models nbased on sub-groups identified on the basis of known categorical variables external to the model. These variables may identify partitions characterised by a dependency structure heterogeneity. The most popular approaches to comparing regression models rely on comparative statistical testing or on recursive methods. The comparison approach consists of comparing coefficients related to a model common to all the data (i.e., a restricted model representing a homogeneous situation) and another model that reflects the interactions between categorical and predictor variables (i.e., an unrestricted model corresponding to a heterogeneous situation). The comparison approach, which allows for analysis of one categorical variable at a time, is reflected in the F-tests developed by Chow (1960) and Lebart *et al.* (1979), based on an assumption of the normality of the residuals of the two models. Comparison is done by calculating restricted deviance (*SSR*0) and unrestricted deviance (*SSR*1). The latter will be lower if interaction between categorical and predictor variables is significant. Under the null hypothesis, if both types of deviance are equal, then the categorical variables produce no differences in model coefficients. This null hypothesis is tested by computing an F–statistic:

$$F = \frac{(SSR\_0 - SSR\_1)/(n - 2p)}{SSR\_1/p} \tag{1}$$

The recursive approach, based on multiple model comparisons, ranks variables that produce differences in the model coefficients. The outcome is a tree where each node represents a model. Partitions are obtained by comparing the effect of each categorical variable on the model coefficients and choosing the partitions that produce the biggest differences. This approach requires a criterion to quantify differences in the model coefficients. In case of the MOB procedure this criterion is based on a fluctuation test that measures coefficient instability (Zelies and Hornick, 2007) as caused by a categorical variable. High instability points to a significant effect of the variable. Tree partitions are defined according to the variables that produce the highest instability.

#### 3 Pathmox in a nutshell

PATHMOX SEGMENTATION TREES TO COMPARE LINEAR REGRESSION MODELS Cristina Davino1, Giuseppe Lamberti2

<sup>1</sup> Department of Economics and Statistics, University of Naples Federico II, (e-mail:

<sup>2</sup> Department of Business, Universitat Autonoma de Barcelona, (e-mail:

ABSTRACT: The estimation of a dependency model for a group as a whole does not take into account possible heterogeneity, i.e., the presence of possible partitions characterised by different dependency structures. We propose a procedure that exploits the potential of segmentation trees to identify partitions in an initial set of data char-

KEYWORDS: "pathmox approach", "linear regression", "heterogeneity", "F-Fisher".

Segmentation trees have been attracting a great deal of attention as model comparison tools, with research mainly motivated by the fact that segmentation trees allow identification of partitions of data characterised by different dependency structures. Few algorithms have been proposed by the statistical community that combine model estimation and segmentation trees, outside the MOdel-based recursive partitioning (MOB) procedure proposed by Zelies *et al.* (2008). In a new approach we generalize the pathmox algorithm developed by Lamberti *et al.* (2016) to the context of linear regression models, using a model comparison test to identify the most significant partitions (i.e., subgroups) in data. Further developments of the proposed approach will involve

Analysis of a dependency model can be furthered by assessing whether a model and/or the impact of regressors on dependent variables differ if heterogeneity is observed. In other words, it may be interesting to assess differences

cristina.davino@unina.it)

giuseppe.lamberti@uab.cat)

acterised by different linear regression patterns.

extensions to other contexts such as quantile regression.

1 Introduction

2 State-of-the-art

Pathmox (Lamberti *et al.*, 2016), developed to detect heterogeneity in models, is a recursive algorithm based on segmentation trees. While pathmox was introduced in the context of partial least square structural equation modelling, it can be generalized to other contexts when a suitable test for comparing models is available. The algorithm applies binary segmentation principles to produce a tree with different models in each node. It starts by fitting a global model to all the data (i.e., the tree root) and identifies models with the most significant differences in child nodes. The most different models are identified by minimizing the sum of the squares of the residuals of the models estimated in each child node. The available data are recursively partitioned according to categorical variables – not included in the model – that yield the most significant differences in the child nodes. Partitions are identified using a test that determines the degree of difference between two compared sub-models. Finally, pathmox avoids overfitting using stopping rules based on maximum depth, minimum node size and non-significance of the partitioning criterion. As the partitioning criterion we propose the hypothesis test as proposed by Lebart *et al.* (1979) and Chow (1960) to compare two linear regression models.

the sub-samples identified by the terminal nodes (Table 1), showing that, in terms of satisfaction, managers primarily valued empowerment followed by company reputation, senior employees valued empowerment, while junior em-

Our results suggest that pathmox can be used to compare regression models, opening up a future research line in other contexts such as quantile regression. While the algorithm allows partitions to be identified where differences between model coefficients are greatest, it has the limitation that no overall significance criterion is considered once each partition is identified. This important aspect needs to be considered in a future version of the algorithm. Note that pathmox aims to identify the most significantly different sub-groups, unlike a classic decision tree where the objective is to obtain the best prediction based on splitting observations into sub-groups. Therefore, the only similar method is the MOB proposed by Zelies *et al.* (2008), which, however, uses a different criterion to identify the best partitions. A comparison of both ap-

CHOW, G.C. 1960. Test of equality between sets of coefficients in two linear regressions.

LAMBERTI, G., ALUJA T., & SANCHEZ, G. 2016. The Pathmox approach for PLS path modeling. *Applied Stochastic Models in Business and Industry.*, 32, 453–468. LEBART, L., MORINEAU A., & FEENELON, J.P. 1979. *Traitement des donnees statistiques*.

ZEILEIS, A., & HORNIK, K. 2007. Generalized M-fluctuation tests for parameter instability.

ZEILEIS, A., HOTHORN T., & HORNIK, K. 2008. Model-Based Recursive Partitioning. *Jour-*

*nal of Computational and Graphical Statistics.*, 17, 492–514.

Table 1. *Coefficient comparison for global*

Global model 0.328 0.190 0.158 0.169 0.181 LM2: managers 0.267 0.209 0.116 0.118 0.191 LM4: senior 0.517 0.247 0.142 0.120 0.201 LM3: junior 0.271 0.052*NS* 0.333 0.342 0.121

LM β coefficients Empowerment Company Supervisor Pay Work

reputation leadership conditions

*and terminal nodes.*

*NS* indicates non-significance according to the t-test

ployees mainly valued pay and leadership.

Root node *Global model* **2000**

Split: *job level* p-value<0.001

Intermediate High

Intermediate LM1

**789**

Low

Split: *Antiquity* p-value<0.001

≤2014 >2014

Terminal LM4 **401**

proaches will be a natural next step in our research.

*Econometrica.*, 28, 591–605.

*Statistica Neerlandica.*, 61, 488–508.

Terminal LM3

388

References

Paris: Dunod.

Figure 1. *Pathmox tree*

Terminal LM2

1211

#### 4 Employee satisfaction: a pathmox application

Using data referring to an organizational study of 2,000 employees in a Spanish financial institution, we applied the pathmox approach in an empirical analysis of the impact of work climate satisfaction on overall employee satisfaction. Overall satisfaction and specific work climate aspects (empowerment, company reputation, supervisor leadership, pay and work conditions) were scored on a 5-point Likert scale. The following categorical variables, reflecting specific employee characteristics, were considered as potential sources of heterogeneity: *age* (<31, 31-45, >45 years), *gender*, *marital status* (married, not married), *education* (secondary, graduate, post-graduate), *job grade* (low, intermediate, high) and *antiquity* in the organization (<2004, 2005-2009, 2009- 2014, >2014).

Pathmox analysis results are reported in Figure 1 and Table 1. We set maximum depth to two levels, bounded the final number of segments to a maximum of four and set the minimum admissible node size to 10% of the total sample. The significance threshold for the partitioning algorithm was p=0.05. The pathmox algorithm identified *job grade* as the variable with the greatest power, distinguishing between low-intermediate grade and high grade employees (LM1 and LM2, respectively). LM1 (low-intermediate grade) was further differentiated according to *antiquity*. On the basis of job grade combined with antiquity, we could characterise partitions and assign labels to subgroups. Thus, LM2 can be defined as the group of managers, LM3 as senior employees and LM4 as junior employees. Finally, the global model coefficients were compared with the coefficients for the three models estimated for the sub-samples identified by the terminal nodes (Table 1), showing that, in terms of satisfaction, managers primarily valued empowerment followed by company reputation, senior employees valued empowerment, while junior employees mainly valued pay and leadership.

Table 1. *Coefficient comparison for global and terminal nodes.*


Figure 1. *Pathmox tree*

a tree with different models in each node. It starts by fitting a global model to all the data (i.e., the tree root) and identifies models with the most significant differences in child nodes. The most different models are identified by minimizing the sum of the squares of the residuals of the models estimated in each child node. The available data are recursively partitioned according to categorical variables – not included in the model – that yield the most significant differences in the child nodes. Partitions are identified using a test that determines the degree of difference between two compared sub-models. Finally, pathmox avoids overfitting using stopping rules based on maximum depth, minimum node size and non-significance of the partitioning criterion. As the partitioning criterion we propose the hypothesis test as proposed by Lebart *et al.* (1979) and Chow (1960) to compare two linear regression mod-

Using data referring to an organizational study of 2,000 employees in a Spanish financial institution, we applied the pathmox approach in an empirical analysis of the impact of work climate satisfaction on overall employee satisfaction. Overall satisfaction and specific work climate aspects (empowerment, company reputation, supervisor leadership, pay and work conditions) were scored on a 5-point Likert scale. The following categorical variables, reflecting specific employee characteristics, were considered as potential sources of heterogeneity: *age* (<31, 31-45, >45 years), *gender*, *marital status* (married, not married), *education* (secondary, graduate, post-graduate), *job grade* (low, intermediate, high) and *antiquity* in the organization (<2004, 2005-2009, 2009-

Pathmox analysis results are reported in Figure 1 and Table 1. We set maximum depth to two levels, bounded the final number of segments to a maximum of four and set the minimum admissible node size to 10% of the total sample. The significance threshold for the partitioning algorithm was p=0.05. The pathmox algorithm identified *job grade* as the variable with the greatest power, distinguishing between low-intermediate grade and high grade employees (LM1 and LM2, respectively). LM1 (low-intermediate grade) was further differentiated according to *antiquity*. On the basis of job grade combined with antiquity, we could characterise partitions and assign labels to subgroups. Thus, LM2 can be defined as the group of managers, LM3 as senior employees and LM4 as junior employees. Finally, the global model coefficients were compared with the coefficients for the three models estimated for

4 Employee satisfaction: a pathmox application

els.

2014, >2014).

Our results suggest that pathmox can be used to compare regression models, opening up a future research line in other contexts such as quantile regression. While the algorithm allows partitions to be identified where differences between model coefficients are greatest, it has the limitation that no overall significance criterion is considered once each partition is identified. This important aspect needs to be considered in a future version of the algorithm. Note that pathmox aims to identify the most significantly different sub-groups, unlike a classic decision tree where the objective is to obtain the best prediction based on splitting observations into sub-groups. Therefore, the only similar method is the MOB proposed by Zelies *et al.* (2008), which, however, uses a different criterion to identify the best partitions. A comparison of both approaches will be a natural next step in our research.

#### References


### ANGULAR HALFSPACE DEPTH: CLASSIFICATION USING SPHERICAL BAGDISTANCES\*

are adopted), and for a fruitful use of the distances in classification, certain assumptions on the data distribution typically need to be imposed (e.g., ellipticity

For these reasons, and to introduce a supervised classification rule for Euclidean data, Hubert *et al.*, 2017 proposed to combine the information from these two approaches to obtain the so-called *bagdistance*, a function which joins the depth and the distance to obtain a measure of how close/inner is a point w.r.t. a given distribution. Bagdistances are robust, nonparametric, and

In this work, we introduce the bagdistance for directional data. To do so, we use the angular halfspace depth, being the directional analogue of the standard halfspace depth from ℜ*q*. We also evaluate the performance of the bagdistance within the setting of supervised classification for directional data. Our short paper is organized as follows. Section 2 provides some background on the bagdistance in the Euclidean case, while in Section 3, the spher-

of the underlying distribution in the case of the Mahalanobis distance).

ical bagdistance and a directional classifier based on it are introduced.

Let *Y* be a random variable in ℜ*<sup>q</sup>* with distribution *PY* , and let θ be its halfspace median (the point that maximizes the halfspace depth w.r.t. *PY* , or the barycentre of the set of such points if not a singleton). Denote by *<sup>B</sup>*(*Y*) <sup>⊂</sup> <sup>ℜ</sup>*<sup>q</sup>* the smallest halfspace depth central region of *PY* (i.e., an upper level set of the halfspace depth of *PY* ) that contains at least 50 % of the *PY* -probability mass. The bagdistance of *x* to *Y* is given by the ratio of the Euclidean distances of *x*

where *c*(*x*) is the intersection of the boundary of the bag *B*(*Y*) and the ray from

Directional data can be viewed as realizations of a random variable *X* whose support is the unit hyper-sphere *<sup>S</sup>*(*q*−1) :<sup>=</sup> {*<sup>x</sup>* <sup>∈</sup> <sup>ℜ</sup>*<sup>q</sup>* : *<sup>x</sup>* <sup>=</sup> <sup>1</sup>}. For directional data, the spherical bagdistance can be introduced in complete analogy with the

0 if *c*(*x*) = θ, *x*−θ/*c*(*x*)−θ otherwise,

able to manage information in the tails of the distribution.

2 The bagdistance for Euclidean data

*BD*(*x*,*PY* ) :=

the halfspace median θ passing through *x*.

bagdistance for Euclidean data.

3 The spherical bagdistance and a classification rule

to θ, and *c*(*x*) to θ:

Houyem Demni1, Davide Buttarazzi1, Stanislav Nagy2, and Giovanni C Porzio1

<sup>1</sup> Department of Economics and Law, University of Cassino and Southern Lazio (e-mail: houyem66@gmail.com, davidebuttarazzi@outlook.com, porzio@unicas.it)

<sup>2</sup> Department of Probability and Mathematical Statistics, Charles University (e-mail: nagy@karlin.mff.cuni.cz)

ABSTRACT: Directional data lies on the surface of the unit sphere. Exploiting new results on the computation and the properties of the angular halfspace depth, we introduce the spherical version of the bagdistance, applicable to directional data. A bagdistance-based classification method for directional data is considered. The proposed method will be compared with other directional classifiers by means of a simulation study.

KEYWORDS: angular depth, bagdistance, directional data, supervised learning.

#### 1 Introduction

Depth functions are nonparametric tools that assess how "centrally located", or "inner" is a point with respect to (w.r.t.) a given probability distribution. They have been successfully adopted in supervised classification analysis. However, many depths suffer when evaluating points that lie in the tails of the distribution. This is because the depth functions are typically not robust at their lowest values, and also because they can easily assign constant zero depth to many points when evaluated w.r.t. datasets (the so-called outsider issue). An example of an important depth sharing all these shortcomings is the standard *halfspace depth* defined in Euclidean spaces <sup>ℜ</sup>*q*, *<sup>q</sup>* <sup>≥</sup> 1.

Contrary to the depths, distance functions are much more powerful when dealing with points at the extremes of the distribution. Nevertheless, they generally suffer from robustness issues as well (unless some robustified versions

\*The work of H. Demni and G.C. Porzio has been partially funded by the BiBiNet project (grant H35F21000430002) within the POR-Lazio FESR 2014-2020. The work of S. Nagy was supported by the grant 19-16097Y of the Czech Science Foundation, and by the PRIMUS/17/SCI/3 project of Charles University.

are adopted), and for a fruitful use of the distances in classification, certain assumptions on the data distribution typically need to be imposed (e.g., ellipticity of the underlying distribution in the case of the Mahalanobis distance).

For these reasons, and to introduce a supervised classification rule for Euclidean data, Hubert *et al.*, 2017 proposed to combine the information from these two approaches to obtain the so-called *bagdistance*, a function which joins the depth and the distance to obtain a measure of how close/inner is a point w.r.t. a given distribution. Bagdistances are robust, nonparametric, and able to manage information in the tails of the distribution.

In this work, we introduce the bagdistance for directional data. To do so, we use the angular halfspace depth, being the directional analogue of the standard halfspace depth from ℜ*q*. We also evaluate the performance of the bagdistance within the setting of supervised classification for directional data.

Our short paper is organized as follows. Section 2 provides some background on the bagdistance in the Euclidean case, while in Section 3, the spherical bagdistance and a directional classifier based on it are introduced.

#### 2 The bagdistance for Euclidean data

ANGULAR HALFSPACE DEPTH: CLASSIFICATION USING SPHERICAL BAGDISTANCES\* Houyem Demni1, Davide Buttarazzi1, Stanislav Nagy2, and Giovanni C Porzio1

<sup>1</sup> Department of Economics and Law, University of Cassino and Southern Lazio (e-mail: houyem66@gmail.com, davidebuttarazzi@outlook.com,

ABSTRACT: Directional data lies on the surface of the unit sphere. Exploiting new results on the computation and the properties of the angular halfspace depth, we introduce the spherical version of the bagdistance, applicable to directional data. A bagdistance-based classification method for directional data is considered. The proposed method will be compared with other directional classifiers by means of a simu-

KEYWORDS: angular depth, bagdistance, directional data, supervised learning.

*halfspace depth* defined in Euclidean spaces <sup>ℜ</sup>*q*, *<sup>q</sup>* <sup>≥</sup> 1.

PRIMUS/17/SCI/3 project of Charles University.

Depth functions are nonparametric tools that assess how "centrally located", or "inner" is a point with respect to (w.r.t.) a given probability distribution. They have been successfully adopted in supervised classification analysis. However, many depths suffer when evaluating points that lie in the tails of the distribution. This is because the depth functions are typically not robust at their lowest values, and also because they can easily assign constant zero depth to many points when evaluated w.r.t. datasets (the so-called outsider issue). An example of an important depth sharing all these shortcomings is the standard

Contrary to the depths, distance functions are much more powerful when dealing with points at the extremes of the distribution. Nevertheless, they generally suffer from robustness issues as well (unless some robustified versions

\*The work of H. Demni and G.C. Porzio has been partially funded by the BiBiNet project (grant H35F21000430002) within the POR-Lazio FESR 2014-2020. The work of S. Nagy was supported by the grant 19-16097Y of the Czech Science Foundation, and by the

<sup>2</sup> Department of Probability and Mathematical Statistics, Charles University

porzio@unicas.it)

lation study.

1 Introduction

(e-mail: nagy@karlin.mff.cuni.cz)

Let *Y* be a random variable in ℜ*<sup>q</sup>* with distribution *PY* , and let θ be its halfspace median (the point that maximizes the halfspace depth w.r.t. *PY* , or the barycentre of the set of such points if not a singleton). Denote by *<sup>B</sup>*(*Y*) <sup>⊂</sup> <sup>ℜ</sup>*<sup>q</sup>* the smallest halfspace depth central region of *PY* (i.e., an upper level set of the halfspace depth of *PY* ) that contains at least 50 % of the *PY* -probability mass. The bagdistance of *x* to *Y* is given by the ratio of the Euclidean distances of *x* to θ, and *c*(*x*) to θ:

$$BD(\boldsymbol{x}, \boldsymbol{P}\_Y) := \begin{cases} 0 & \text{if } c(\boldsymbol{x}) = \boldsymbol{\theta}, \\ ||\boldsymbol{x} - \boldsymbol{\theta}|| / ||c(\boldsymbol{x}) - \boldsymbol{\theta}|| & \text{otherwise,} \end{cases}$$

where *c*(*x*) is the intersection of the boundary of the bag *B*(*Y*) and the ray from the halfspace median θ passing through *x*.

#### 3 The spherical bagdistance and a classification rule

Directional data can be viewed as realizations of a random variable *X* whose support is the unit hyper-sphere *<sup>S</sup>*(*q*−1) :<sup>=</sup> {*<sup>x</sup>* <sup>∈</sup> <sup>ℜ</sup>*<sup>q</sup>* : *<sup>x</sup>* <sup>=</sup> <sup>1</sup>}. For directional data, the spherical bagdistance can be introduced in complete analogy with the bagdistance for Euclidean data.

We first define the directional variant of the halfspace depth. Let *X* be a directional random variable with distribution *PX* . The *angular halfspace depth ahD* of a point *<sup>x</sup>* <sup>∈</sup> *<sup>S</sup>*(*q*−1) w.r.t. *PX* can be defined considering the collection H<sup>0</sup> of closed halfspaces in ℜ*<sup>q</sup>* whose boundary contains the origin:

in Figure 1. The two Kent distributions have equal locations and ovalness, and different concentrations (the simulation setting described in Setup 2 in Demni & Porzio, 2021 has been adopted). The training set size is 400 (200 from each group), while the size of the testing set is 200; the number of replications is 100. Misclassification errors are essentially equivalent, with some preference to be given to the LDA and QDA solution. Performances under other simulation settings and comparison with other directional classifiers are under

Figure 1: Misclassification rates of the empirical Bayes under Kent (EBk), and the spherical Bagdistance classifier (BD) when associated with the LDA, QDA, and *k*-NN classification rule. Data generated according to Kent distributions.

DEMNI, HOUYEM,&PORZIO, GIOVANNI C. 2021. Directional DDclassifiers under non-rotational symmetry. *IEEE Xplore, submitted*. DEMNI, HOUYEM, MESSAOUD, AMOR,&PORZIO, GIOVANNI C. 2021. Distance-based directional depth classifiers: a robustness study. *Commu-*

*nications in Statistics – Simulation and Computation, in press*. HUBERT, MIA, ROUSSEEUW, PETER,&SEGAERT, PIETER. 2017. Multivariate and functional classification using depth and distance. *Advances*

*in Data Analysis and Classification*, 11(3), 445–466.

investigation.

References

$$
abla D(\mathbf{x}, P\_X) := \inf \{ P\_X(H) \colon H \in \mathcal{H}\_0^\ell, \ x \in H \} \in [0, 1].$$

Denote by *aB*(*X*) <sup>⊂</sup> *<sup>S</sup>*(*q*−1) the *angular bag* of *<sup>X</sup>*, defined as the smallest angular depth central region containing at least 50 % of the *PX* -probability mass. Such a region always exists; its properties are detailed in the contribution of *P. Laketa* and *S. Nagy* in the present book of short papers. The *spherical bagdistance* from *<sup>x</sup>* <sup>∈</sup> *<sup>S</sup>*(*q*−1) to *<sup>X</sup>* is defined as the ratio of the arc distance between *x* and the angular halfspace median θ˜ (a maximizer of the angular halfspace depth of *X*), and the arc distance between *caB*(*x*) and θ˜. Here, *caB*(*x*) is the intersection between the boundary of the angular bag *aB*(*X*) and the geodesic from θ˜ to *x*. Altogether, we define

$$\begin{aligned} \text{SBD}(\mathbf{x}, \mathbf{P}\_{\mathbf{X}}) &:= \begin{cases} 0 & \text{if } c\_{aB}(\mathbf{x}) = \tilde{\boldsymbol{\theta}}, \\ \arccos(\boldsymbol{\chi}^{\mathsf{T}} \tilde{\boldsymbol{\theta}}) / \arccos(c\_{aB}(\boldsymbol{\chi})^{\mathsf{T}} \tilde{\boldsymbol{\theta}}) & \text{otherwise.} \end{cases} \end{aligned}$$

Similarly as the usual bagdistance in ℜ*q*, the spherical bagdistance can be exploited for supervised classification of directional objects. Formally, considering *K* directional distributions on *S*(*q*−1) , a directional classifier is defined as the function *class* : *<sup>S</sup>*(*q*−1) → {1,...,*K*}. Given a training set composed of *<sup>K</sup>* empirical distributions *P*ˆ *Xi* , *i* = 1,...,*K*, the directional bagdistance classifier is then defined as the rule *classbag* such that:

$$class\_{bag}(\mathbf{x}) := \mu(\mathbf{SBD}(\mathbf{x}; \hat{\mathbf{P}}\_{\mathbf{X}\_{\parallel}}), \dots, \mathbf{SBD}(\mathbf{x}; \hat{\mathbf{P}}\_{\mathbf{X}\_{\parallel}}), \dots, \mathbf{SBD}(\mathbf{x}; \hat{\mathbf{P}}\_{\mathbf{X}\_{\mathbf{X}}})),$$

where *<sup>u</sup>* : <sup>ℜ</sup>*<sup>K</sup>* → {1,...,*i*,...,*K*} is some discriminating function. That is, the classifier is a rule defined on a Euclidean space given by the bagdistances of the training set values w.r.t the directional distributions defined on a Riemannian manifold. For the choice of the discriminating function, we refer to the literature available for depth based classifiers, which includes the linear (LDA), quadratic (QDA) and *k*-NN classifiers (see e.g., Demni *et al.*, 2021).

In line with such a strategy, a simulation study with data generated according to a Kent distribution for each group has been performed. First results are promising: the spherical bagdistance classifier reaches the same level of correct classification as achieved by the empirical Bayes, at least under some circumstances. To exemplify, boxplots of the misclassification rates of the proposed classifier and of the empirical Bayes classifier under Kent are reported in Figure 1. The two Kent distributions have equal locations and ovalness, and different concentrations (the simulation setting described in Setup 2 in Demni & Porzio, 2021 has been adopted). The training set size is 400 (200 from each group), while the size of the testing set is 200; the number of replications is 100. Misclassification errors are essentially equivalent, with some preference to be given to the LDA and QDA solution. Performances under other simulation settings and comparison with other directional classifiers are under investigation.

Figure 1: Misclassification rates of the empirical Bayes under Kent (EBk), and the spherical Bagdistance classifier (BD) when associated with the LDA, QDA, and *k*-NN classification rule. Data generated according to Kent distributions.

#### References

We first define the directional variant of the halfspace depth. Let *X* be a directional random variable with distribution *PX* . The *angular halfspace depth ahD* of a point *<sup>x</sup>* <sup>∈</sup> *<sup>S</sup>*(*q*−1) w.r.t. *PX* can be defined considering the collection

*ahD*(*x*,*PX* ) := inf{*PX* (*H*): *H* ∈ H0, *x* ∈ *H*} ∈ [0,1].

Denote by *aB*(*X*) <sup>⊂</sup> *<sup>S</sup>*(*q*−1) the *angular bag* of *<sup>X</sup>*, defined as the smallest angular depth central region containing at least 50 % of the *PX* -probability mass. Such a region always exists; its properties are detailed in the contribution of *P. Laketa* and *S. Nagy* in the present book of short papers. The *spherical bagdistance* from *<sup>x</sup>* <sup>∈</sup> *<sup>S</sup>*(*q*−1) to *<sup>X</sup>* is defined as the ratio of the arc distance between *x* and the angular halfspace median θ˜ (a maximizer of the angular halfspace depth of *X*), and the arc distance between *caB*(*x*) and θ˜. Here, *caB*(*x*) is the intersection between the boundary of the angular bag *aB*(*X*) and the

Similarly as the usual bagdistance in ℜ*q*, the spherical bagdistance can be exploited for supervised classification of directional objects. Formally, consid-

*<sup>X</sup>*<sup>1</sup> ),...,*SBD*(*x*;*P*ˆ

In line with such a strategy, a simulation study with data generated according to a Kent distribution for each group has been performed. First results are promising: the spherical bagdistance classifier reaches the same level of correct classification as achieved by the empirical Bayes, at least under some circumstances. To exemplify, boxplots of the misclassification rates of the proposed classifier and of the empirical Bayes classifier under Kent are reported

the function *class* : *<sup>S</sup>*(*q*−1) → {1,...,*K*}. Given a training set composed of *<sup>K</sup>*

where *<sup>u</sup>* : <sup>ℜ</sup>*<sup>K</sup>* → {1,...,*i*,...,*K*} is some discriminating function. That is, the classifier is a rule defined on a Euclidean space given by the bagdistances of the training set values w.r.t the directional distributions defined on a Riemannian manifold. For the choice of the discriminating function, we refer to the literature available for depth based classifiers, which includes the linear (LDA),

quadratic (QDA) and *k*-NN classifiers (see e.g., Demni *et al.*, 2021).

0 if *caB*(*x*) = θ˜, arccos(*x*Tθ˜)/arccos(*caB*(*x*)Tθ˜) otherwise.

, *i* = 1,...,*K*, the directional bagdistance classifier is

*Xi*

, a directional classifier is defined as

),...,*SBD*(*x*;*P*ˆ

*XK* )),

H<sup>0</sup> of closed halfspaces in ℜ*<sup>q</sup>* whose boundary contains the origin:

geodesic from θ˜ to *x*. Altogether, we define

ering *K* directional distributions on *S*(*q*−1)

then defined as the rule *classbag* such that: *classbag*(*x*) := *u*(*SBD*(*x*;*P*ˆ

*Xi*

*SBD*(*x*,*PX* ) :=

empirical distributions *P*ˆ


### **NEURAL NETWORKS FOR HIGH CARDINALITY CATEGORICAL DATA**

multicollinearity problem (the dummy variable trap), but applying machine learning models, as the neural networks, it is necessary to include all the

One Hot Encoding is the most used method. The coding in dummies does not depend directly on the target. Despite its great use, some drawbacks of OHE are well known: the tendency of dummy variables to cause overfitting; the introduction of many new orthogonal variables, which can slow down or affect learning; memory problems. The encoding of categorical variables has been extensively studied in the approach based on Optimal Scaling (OS, Gifi 1990) where the *embedding* of the categories in a *p*-dimensional space was proposed. Given a categorical variable *X* which can assume the values 1 2 , ,..., *<sup>k</sup> aa a* , with *k* the number of categories, *n* the number of observations, then 1 2 , ,..., *G gg g* = *<sup>k</sup>* is the indicator matrix with dimension *n* × *k*. Let

1

*h c* =

*h h*

The values of **c** are the quantifications of the *k* categories and have to be estimated. The vector of the quantified data **x** is a linear combination of the indicator variables,

ordered quantifications in the OS, the order indicator matrices, with non-negativity

In expression (1) we considered a single quantification for a categorical variable. There are several reasons that may lead to consider two or more quantifications of the same variable (Di Ciaccio 2020). Considering a regressive problem, in OS (MORALS, Young et al. 1976) it is possible to obtain a multiple quantification by means of copies of the variables (Gifi 1990). After choosing the number *p* of

1 1 1

In neural network applications, fixing a low *p*, equal to 2 or 3, is usually enough

To introduce quantification (2) in a neural network it is necessary to define, for each categorical variable, a distinct input and a dense layer with *p* neurons without bias and with linear activation function. In the next layer the outputs, coming from all the variables, must be concatenated. For example, given 3 input categorical variables, each with 100 categories, and one hidden layer containing 512 neurons, using this approach we must estimate (considering a regression problem and *p*=2) 4.697 weights. Given *t*=512, *p*=2, *m*=3, *kj*=100 for each *j*, the Neural Network can be written:

> *r s j j jrs s*

**<sup>y</sup> G c** (3)

*w w*

*k*

*h h n kk p n p <sup>p</sup> <sup>h</sup> <sup>n</sup>* <sup>=</sup>

for a good quantification of categorical variables even with high cardinality.

0 0

= + +

1 1 1

= = =

*s j r*

<sup>ˆ</sup> *t m <sup>p</sup>*

 

**<sup>x</sup>** = = **Gc g** (1)

, then is defined in a subspace of Rk

**<sup>X</sup>** = = **GC g c** (2)

. To obtain

*k*

categories, otherwise we would never consider the omitted category.

**3 Single and multiple quantifications by OHE**

**c** a vector of *k* real values, the quantification of *X* is the vector:

constraints on the coefficients, can be used (Gifi 1990).

which are an orthogonal base of Rk

quantifications, we can extend (1) as: 

Agostino Di Ciaccio

Department of Statistics, University of Rome "La Sapienza", (e-mail: agostino.diciaccio@uniroma1.it)

**ABSTRACT**: If we want to apply neural networks to categorical data, we must necessarily adopt a coding strategy. This is a common problem for many multivariate techniques and several approaches have been suggested. In this paper, a method is proposed to analyze categorical variables with high cardinality. An application to simulated data illustrates the interest of the proposal.

**KEYWORDS**: encoding categorical data, neural networks, high cardinality attributes.

### **1 Introduction**

Several machine learning algorithms cannot handle directly categorical variables and, in any case, categorical data can pose a serious problem if they have too many categories. Postal code is a good example of a categorical variable with high cardinality. This paper starts with some considerations on the currently used approaches, then an efficient encoding method is proposed for supervised neural networks when categorical variables with high cardinality need to be analyzed.

#### **2 Approaches to quantify categorical features**

Several methods have been proposed to encode categorical variables (a recent review is Hancock et al. 2020). From our point of view, they can be classified as:


multicollinearity problem (the dummy variable trap), but applying machine learning models, as the neural networks, it is necessary to include all the categories, otherwise we would never consider the omitted category.

#### **3 Single and multiple quantifications by OHE**

**NEURAL NETWORKS FOR HIGH CARDINALITY CATEGORICAL DATA**

Agostino Di Ciaccio

Department of Statistics, University of Rome "La Sapienza", (e-mail: agostino.diciaccio@uniroma1.it)

**ABSTRACT**: If we want to apply neural networks to categorical data, we must necessarily adopt a coding strategy. This is a common problem for many multivariate techniques and several approaches have been suggested. In this paper, a method is proposed to analyze categorical variables with high cardinality. An application to simulated data illustrates the

**KEYWORDS**: encoding categorical data, neural networks, high cardinality attributes.

**2 Approaches to quantify categorical features**

*out Encoder* or the *Catboost Encoder* have been proposed.

obtained are essentially arbitrary.

is Hancock et al. 2020). From our point of view, they can be classified as:

Several machine learning algorithms cannot handle directly categorical variables and, in any case, categorical data can pose a serious problem if they have too many categories. Postal code is a good example of a categorical variable with high cardinality. This paper starts with some considerations on the currently used approaches, then an efficient encoding method is proposed for supervised neural networks when categorical variables with high cardinality need to be analyzed.

Several methods have been proposed to encode categorical variables (a recent review

1- Methods that do not use the target variable. In this category we find rather crude methods, such as the *Label Encoder* or the *Hashing Encoder*. The quantifications

2- Methods that use only the target variable. The *Target Encoder* (TE) replaces the categorical variable with the conditional means of the target variable. This method often produces data leakage, to limit this inconvenience the *Leave one* 

3- Methods based on *One Hot Encoding* (OHE). In this approach a new binary variable is introduced for each category, indicating the presence or absence of that category. The eventual exclusion of one category is due to the

interest of the proposal.

**1 Introduction**

One Hot Encoding is the most used method. The coding in dummies does not depend directly on the target. Despite its great use, some drawbacks of OHE are well known: the tendency of dummy variables to cause overfitting; the introduction of many new orthogonal variables, which can slow down or affect learning; memory problems.

The encoding of categorical variables has been extensively studied in the approach based on Optimal Scaling (OS, Gifi 1990) where the *embedding* of the categories in a *p*-dimensional space was proposed. Given a categorical variable *X* which can assume the values 1 2 , ,..., *<sup>k</sup> aa a* , with *k* the number of categories, *n* the number of observations, then 1 2 , ,..., *G gg g* = *<sup>k</sup>* is the indicator matrix with dimension *n* × *k*. Let **c** a vector of *k* real values, the quantification of *X* is the vector:

$$\mathbf{x} = \mathbf{G}\mathbf{c} = \sum\_{h=1}^{k} \mathbf{c}\_{h}\mathbf{g}\_{h} \tag{1}$$

The values of **c** are the quantifications of the *k* categories and have to be estimated. The vector of the quantified data **x** is a linear combination of the indicator variables, which are an orthogonal base of Rk , then is defined in a subspace of Rk . To obtain ordered quantifications in the OS, the order indicator matrices, with non-negativity constraints on the coefficients, can be used (Gifi 1990).

In expression (1) we considered a single quantification for a categorical variable. There are several reasons that may lead to consider two or more quantifications of the same variable (Di Ciaccio 2020). Considering a regressive problem, in OS (MORALS, Young et al. 1976) it is possible to obtain a multiple quantification by means of copies of the variables (Gifi 1990). After choosing the number *p* of quantifications, we can extend (1) as:

$$\mathbf{X} = \mathbf{G} \cdot \mathbf{C} = \sum\_{h=1}^{k} \mathbf{g}\_h \cdot \mathbf{c}\_h \tag{2}$$

In neural network applications, fixing a low *p*, equal to 2 or 3, is usually enough for a good quantification of categorical variables even with high cardinality.

To introduce quantification (2) in a neural network it is necessary to define, for each categorical variable, a distinct input and a dense layer with *p* neurons without bias and with linear activation function. In the next layer the outputs, coming from all the variables, must be concatenated. For example, given 3 input categorical variables, each with 100 categories, and one hidden layer containing 512 neurons, using this approach we must estimate (considering a regression problem and *p*=2) 4.697 weights. Given *t*=512, *p*=2, *m*=3, *kj*=100 for each *j*, the Neural Network can be written:

$$\hat{\mathbf{y}} = \beta\_0 + \sum\_{s=1}^{t} \beta\_s \phi \left( \sum\_{j=1}^{m} \sum\_{r=1}^{p} \mathbf{G}\_j \mathbf{c}\_j^r \boldsymbol{\omega}\_{jrs} + \boldsymbol{\omega}\_{0s} \right) \tag{3}$$

where ( ). is the activation function of the hidden layer, *<sup>r</sup> <sup>j</sup>* **c** is the quantification of the *j*-th variable on the *r*-th dimension. Conversely, in the classical OHE encoding:

$$\hat{\mathbf{y}} = \beta\_0 + \sum\_{s=l}^{l} \beta\_s \phi \left( \sum\_{j=1}^{m} \sum\_{r=l}^{k\_j} \mathbf{G}\_{\ j} \boldsymbol{w}\_{jrs} + \boldsymbol{w}\_{0s} \right) \tag{4}$$

added, for each original categorical variable, using (5). It was also checked that the results obtained did not improve, on the test-set, by changing the size of the network or the number of iterations. Although the *Target Encoder* was applied also with a bigger neural network, with 32 neurons in each hidden layer, the result is very poor even on the training-set, as this encoding prevents interactions from being identified.

OHE 2.11 6.18 4839 LEE 2.55 4.82 1287 Target Encoder 61.47 61.48 1217

**Figure 1.** *OHE on the test-set* **Figure 2.** *LEE on the test-set*

The proposed method LEE allows to apply neural networks to categorical variables with high cardinality, reducing the number of parameters and memory resources. The results obtained show an increased predictive capacity of the neural network thanks

DI CIACCIO, A. 2020. Categorical Encoding for Machine Learning. *Book of short papers SIS2020*, A. Pollice et al. eds., ISBN 9788891910776, Pearson Italia. GIFI, A. 1990. *Nonlinear Multivariate Analysis*. John Wiley & Sons, New York. GUO, C., & BERKHAHN, F. 2016. Entity embeddings of categorical variables.

HANCOCK,J.T., & KHOSHGOFTAAR, T.M. 2020. Survey on categorical data for neural networks. *Journal of Big Data*,**7**,28, https://doi.org/10.1186/s40537-020-00305-w YOUNG, F.W., DE LEEUW, J., TAKANE, Y. 1976. Regression with qualitative and quantitative variables: an alternating least squares method with optimal scaling

*MSE - train MSE - test n. parameters*

**Table 1.** *Comparison between three approaches*

**4 Conclusions**

**References**

*arXiv*:1604.06737.

features. *Psychometrika*, v. **41**, n. 4.

to the more efficient architecture.

obtaining 154.625 weights to estimate.

**G***<sup>j</sup>* can be very big sparse matrices (sparsity equal to 1 1− *kj* ), but we can avoid building such an inefficient coding estimating the dense matrix of quantifications **C***<sup>j</sup>* of expression (3) without building the sparse matrix **G***j*.

In the first step, for a categorical variable *X*, the *k*-dimensional 'vocabulary' **V** of the categories have to be created and indexed. Then all the categories in the data will be substituted by the corresponding numerical index in the vocabulary, in a similar way to what the Label Encoder does. Call *<sup>i</sup> a* the modality assumed by the categorical variable, and [ ]*<sup>i</sup>* **v** *a* the index in the vocabulary corresponding to this modality. The *i*th row of the ( ) *n p* matrix of the quantified variable *X* can be expressed as:

$$\mathbf{x}\_{\prime} = \mathbf{C}[\mathbf{v}[a\_{\prime}]] \tag{5}$$

Each line of the quantification matrix **C** can be seen as the *p*-dimensional representation of one category. Inspired by Natural Language Processing, Guo & Berkhahn's (2016) *entity embedding* technique takes a similar approach. To obtain the estimate of **C** in a supervised neural network, the gradient descent and the backpropagation can be used, where the matrix **C** is initialized with random values taken from a standardized normal and subsequently updated through an iterative procedure to minimize the loss function, which in the case of regression is the classic Sum of Square Error. We call this technique LEE, Low Embedding Encoder, and to illustrate the proposed approach, a small simulation for a regression problem was build. Given three qualitative variables <sup>123</sup> *XXX* , , with 200 categories each (coded as the integers between 1 and 200), for each variable 20,000 observations were extracted randomly from a uniform distribution, then *Y* was computed by the rules:

( *X X* 1 2 *and* <sup>3</sup> *X* 100) → *Y N* ( ) 20,1.5 ( *X X* 1 2 *and* <sup>3</sup> *X* 100) → *Y N* ( ) 10,1.5 *else Y N* ( ) 1,1.5

There are only 3 expected values 123 *EY x x x* (| , , ), i.e. (1, 10, 20), so an optimal regressive model should predict these values. Note that the expected value of *Y* depends on the interaction of the three categorical variables and that the three conditional distributions of *Y* overlap in the tails. The dataset was then splitted as training-set (50%) and test-set (50%). Regression algorithms such as MORALS or Regression Tree cannot make a satisfactory prediction on this data unless introducing explicitly the interaction terms into the model, producing thousands of dummy variables. On the contrary, neural networks are able to autonomously detect the interactions, then a small neural network was chosen to predict the target *Y* in our simulation. The network includes an input layer, two hidden layers with 8 and 3 neurons (*elu* activation function), and 1 output neuron with linear activation function. With the LEE approach, each categorical variable is considered a separate input and one dense layer with 2 neurons (*p* = 2) and no bias, for each categorical variable, is added to the input. If we want to avoid sparse matrices, an *embedding* layer can be

added, for each original categorical variable, using (5). It was also checked that the results obtained did not improve, on the test-set, by changing the size of the network or the number of iterations. Although the *Target Encoder* was applied also with a bigger neural network, with 32 neurons in each hidden layer, the result is very poor even on the training-set, as this encoding prevents interactions from being identified.


**Table 1.** *Comparison between three approaches*

### **4 Conclusions**

where

( ). is the activation function of the hidden layer, *<sup>r</sup>*

*s j r*

= = =

<sup>ˆ</sup> *<sup>j</sup> <sup>k</sup> t m*

of expression (3) without building the sparse matrix **G***j*.

 

( *X X* 1 2 *and* <sup>3</sup> *X* 100) → *Y N* ( ) 20,1.5

obtaining 154.625 weights to estimate.

the *j*-th variable on the *r*-th dimension. Conversely, in the classical OHE encoding:

*s j jrs s*

**G***<sup>j</sup>* can be very big sparse matrices (sparsity equal to 1 1− *kj* ), but we can avoid building such an inefficient coding estimating the dense matrix of quantifications **C***<sup>j</sup>*

In the first step, for a categorical variable *X*, the *k*-dimensional 'vocabulary' **V** of the categories have to be created and indexed. Then all the categories in the data will be substituted by the corresponding numerical index in the vocabulary, in a similar way to what the Label Encoder does. Call *<sup>i</sup> a* the modality assumed by the categorical variable, and [ ]*<sup>i</sup>* **v** *a* the index in the vocabulary corresponding to this modality. The *i*th row of the ( ) *n p* matrix of the quantified variable *X* can be expressed as:

Each line of the quantification matrix **C** can be seen as the *p*-dimensional representation of one category. Inspired by Natural Language Processing, Guo & Berkhahn's (2016) *entity embedding* technique takes a similar approach. To obtain the estimate of **C** in a supervised neural network, the gradient descent and the backpropagation can be used, where the matrix **C** is initialized with random values taken from a standardized normal and subsequently updated through an iterative procedure to minimize the loss function, which in the case of regression is the classic Sum of Square Error. We call this technique LEE, Low Embedding Encoder, and to illustrate the proposed approach, a small simulation for a regression problem was build. Given three qualitative variables <sup>123</sup> *XXX* , , with 200 categories each (coded as the integers between 1 and 200), for each variable 20,000 observations were extracted randomly from a uniform distribution, then *Y* was computed by the rules:

( *X X* 1 2 *and* <sup>3</sup> *X* 100) → *Y N* ( ) 10,1.5 *else Y N* ( ) 1,1.5

There are only 3 expected values 123 *EY x x x* (| , , ), i.e. (1, 10, 20), so an optimal regressive model should predict these values. Note that the expected value of *Y* depends on the interaction of the three categorical variables and that the three conditional distributions of *Y* overlap in the tails. The dataset was then splitted as training-set (50%) and test-set (50%). Regression algorithms such as MORALS or Regression Tree cannot make a satisfactory prediction on this data unless introducing explicitly the interaction terms into the model, producing thousands of dummy variables. On the contrary, neural networks are able to autonomously detect the interactions, then a small neural network was chosen to predict the target *Y* in our simulation. The network includes an input layer, two hidden layers with 8 and 3 neurons (*elu* activation function), and 1 output neuron with linear activation function. With the LEE approach, each categorical variable is considered a separate input and one dense layer with 2 neurons (*p* = 2) and no bias, for each categorical variable, is added to the input. If we want to avoid sparse matrices, an *embedding* layer can be

**y G** (4)

[ [ ]] *<sup>i</sup> <sup>i</sup>* **x** = **C v** *a* (5)

*w w*

0 0 1 1 1

= + +

*<sup>j</sup>* **c** is the quantification of

The proposed method LEE allows to apply neural networks to categorical variables with high cardinality, reducing the number of parameters and memory resources. The results obtained show an increased predictive capacity of the neural network thanks to the more efficient architecture.

### **References**


### ALI-MIKHAIL-HAQ COPULA TO DETECT LOW CORRELATIONS IN HIERARCHICAL CLUSTERING

Ali-Mikhail-Haq (AMH hereafter) copula, and empirically compare it with a

Here, we want to perform an agglomerative hierarchical clustering (AHC hereafter) of *m* continuous random variables (*X*1,...,*Xm*) defined on the same probability space by taking into account their stochastic dependence. A typical dissimilarity measure used in the AHC algorithm can be defined in terms of

(*Xj*,*Xj*). From a different perspective, one can assume a specific copula function, motivated by its ability to capture some features of the joint behaviour observed from the data. Here we focus on the AMH copula function *C*(*u*1,*u*2) = (*u*1*u*2)/(1 − θ(1 − *u*1)(1 − *u*2)), where θ ∈ [−1,1]. The AMH copula is very suitable for modeling low degree of association since the corresponding range for τ is [−0.1817,0.3333]. Hence, we introduce a new dissimilarity measure

where θ*j j* is the dependence parameter of the AMH copula that can be estimated via one of the methods in the literature (see, e.g., Gunky *et al.*, 2007).

We analyse time series data concerning the heat demand (in kWh) of *m* = 41 residential users connected to the DH of Bozen-Bolzano, which has been identified as a key technology for the development of sustainable cities. We consider *n* = 150 hourly observations in the period Jan 1–Jan 14, 2016. We first tackle serial dependence in the original time series by adopting a dynamic panel regression model (Wooldridge, 2002), that takes into account the relationships between DH demand and meteorological variables, such as temperature and solar radiation. Then, the residual time series are used to estimate

2(1−τ*j j*) ∈ [0,2] (1)

2(1−θ*j j*) ∈ [0,2] (2)

∈ {1,...,*m*}, is computed from *n* observations of the pair

The contribution is organized as follows. First, we introduce the copulabased dissimilarity measures (Sect. 2). Second, we present the cluster analysis performed to compare the proposed AMH-based dissimilarity with the one based on Kendall's τ (Sect. 3). Finally, Sect. 4 summarizes the main findings.

classical dissimilarity measure based on Kendall's τ coefficient.

2 Kendall's τ- and AMH-based dissimilarity

*d*τ *j j* = 

*d*AMH *j j* =

3 Application to district heating demand

Kendall's τ coefficient as follows

where τ*j j* , *j*, *j*

F. Marta L. Di Lascio1, Andrea Menapace2, and Roberta Pappada`3

<sup>1</sup> Faculty of Economics and Management, Free University of Bozen-Bolzano, Bozen-Bolzano, Italy, (e-mail: marta.dilascio@unibz.it)

<sup>2</sup> Faculty of Science and Technology, Free University of Bozen-Bolzano, Bozen-Bolzano, Italy, (e-mail: andrea.menapace@unibz.it)

<sup>3</sup> Department of Economics, Business, Mathematics and Statistics "B. de Finetti", University of Trieste, Italy, (e-mail: rpappada@units.it)

ABSTRACT: In this work we introduce a new dissimilarity measure based on the Ali-Mikhail-Haq copula, motivated by the empirical issue of detecting low correlations and discriminating variables with very similar rank correlation. This issue arises from the analysis of panel data concerning the district heating demand of the Italian city Bozen-Bolzano. In the hierarchical clustering framework, we empirically investigate the features of the proposed measure and compare it with a classical dissimilarity measure based on Kendall's rank correlation.

KEYWORDS: Ali-Mikhail-Haq copula; cluster analysis; dissimilarity measure; low correlation.

#### 1 Introduction

Copula-based measures of association have been employed in clustering procedures in a variety of applied contexts, since they allow to describe complex multivariate dependence structures and address specific features of the joint distribution of random variables, such as asymmetries and tail dependence (Durante & Sempi, 2015). For instance, the copula approach made it possible to define pairwise dissimilarities in terms of concordance or tail dependence measures (see, e.g., Fuchs *et al.*, 2021, and the references therein).

While many contributions in this context have focused on detecting high association between extremely low/high values, in this paper we focus on modeling weak correlation and the ability to discriminate objects with low and very similar degree of dependence. This issue comes from the features of the district heating (DH hereafter) demand from residential users of the Italian city of Bozen-Bolzano. We thus propose a new dissimilarity measure based on the Ali-Mikhail-Haq (AMH hereafter) copula, and empirically compare it with a classical dissimilarity measure based on Kendall's τ coefficient.

The contribution is organized as follows. First, we introduce the copulabased dissimilarity measures (Sect. 2). Second, we present the cluster analysis performed to compare the proposed AMH-based dissimilarity with the one based on Kendall's τ (Sect. 3). Finally, Sect. 4 summarizes the main findings.

#### 2 Kendall's τ- and AMH-based dissimilarity

ALI-MIKHAIL-HAQ COPULA TO DETECT LOW CORRELATIONS IN HIERARCHICAL CLUSTERING F. Marta L. Di Lascio1, Andrea Menapace2, and Roberta Pappada`3

<sup>1</sup> Faculty of Economics and Management, Free University of Bozen-Bolzano, Bozen-

<sup>2</sup> Faculty of Science and Technology, Free University of Bozen-Bolzano, Bozen-

<sup>3</sup> Department of Economics, Business, Mathematics and Statistics "B. de Finetti", Uni-

ABSTRACT: In this work we introduce a new dissimilarity measure based on the Ali-Mikhail-Haq copula, motivated by the empirical issue of detecting low correlations and discriminating variables with very similar rank correlation. This issue arises from the analysis of panel data concerning the district heating demand of the Italian city Bozen-Bolzano. In the hierarchical clustering framework, we empirically investigate the features of the proposed measure and compare it with a classical dissimilarity

KEYWORDS: Ali-Mikhail-Haq copula; cluster analysis; dissimilarity measure; low

Copula-based measures of association have been employed in clustering procedures in a variety of applied contexts, since they allow to describe complex multivariate dependence structures and address specific features of the joint distribution of random variables, such as asymmetries and tail dependence (Durante & Sempi, 2015). For instance, the copula approach made it possible to define pairwise dissimilarities in terms of concordance or tail dependence measures (see, e.g., Fuchs *et al.*, 2021, and the references therein). While many contributions in this context have focused on detecting high association between extremely low/high values, in this paper we focus on modeling weak correlation and the ability to discriminate objects with low and very similar degree of dependence. This issue comes from the features of the district heating (DH hereafter) demand from residential users of the Italian city of Bozen-Bolzano. We thus propose a new dissimilarity measure based on the

Bolzano, Italy, (e-mail: marta.dilascio@unibz.it)

Bolzano, Italy, (e-mail: andrea.menapace@unibz.it)

versity of Trieste, Italy, (e-mail: rpappada@units.it)

measure based on Kendall's rank correlation.

correlation.

1 Introduction

Here, we want to perform an agglomerative hierarchical clustering (AHC hereafter) of *m* continuous random variables (*X*1,...,*Xm*) defined on the same probability space by taking into account their stochastic dependence. A typical dissimilarity measure used in the AHC algorithm can be defined in terms of Kendall's τ coefficient as follows

$$d\_{jj'}^{\mathfrak{r}} = \sqrt{2(1 - \mathfrak{r}\_{jj'})} \in [0, 2] \tag{1}$$

where τ*j j* , *j*, *j* ∈ {1,...,*m*}, is computed from *n* observations of the pair (*Xj*,*Xj*). From a different perspective, one can assume a specific copula function, motivated by its ability to capture some features of the joint behaviour observed from the data. Here we focus on the AMH copula function *C*(*u*1,*u*2) = (*u*1*u*2)/(1 − θ(1 − *u*1)(1 − *u*2)), where θ ∈ [−1,1]. The AMH copula is very suitable for modeling low degree of association since the corresponding range for τ is [−0.1817,0.3333]. Hence, we introduce a new dissimilarity measure

$$d\_{jj'}^{\mathbf{AMH}} = \sqrt{2(1 - \theta\_{jj'})} \in [0, 2] \tag{2}$$

where θ*j j* is the dependence parameter of the AMH copula that can be estimated via one of the methods in the literature (see, e.g., Gunky *et al.*, 2007).

#### 3 Application to district heating demand

We analyse time series data concerning the heat demand (in kWh) of *m* = 41 residential users connected to the DH of Bozen-Bolzano, which has been identified as a key technology for the development of sustainable cities. We consider *n* = 150 hourly observations in the period Jan 1–Jan 14, 2016. We first tackle serial dependence in the original time series by adopting a dynamic panel regression model (Wooldridge, 2002), that takes into account the relationships between DH demand and meteorological variables, such as temperature and solar radiation. Then, the residual time series are used to estimate the 41×41 dissimilarity matrices based on Eqs. (1) and (2) to use in the AHC algorithm. The crucial point is that all pairs of users have a quite low Kendall's <sup>τ</sup> (the minimum is <sup>−</sup>0.2, the highest value is 0.39). Thus, in principle, *<sup>d</sup>*AMH should be able to better distinguish objects with low and very similar degree of association. On the basis of both the informativeness of the final clusters and the separation index by Akhanli & Hennig, 2020, we decide to adopt the complete linkage method and cut the dendrogram at *k* = 3 for both the dissimilarities.

Fig. 1 displays the mean daily pattern of each user (hourly heat demand over daily average heat demand (Menapace *et al.*, 2019)) by cluster, according to *d*<sup>τ</sup> and *d*AMH. As can be seen, a certain degree of internal homogeneity is obtained in both cases, denoting an overall good quality of the results. However, by using static features of the buildings, such as heating surface (in *m*2), age class (in years), and energy class (in yearly *kW h*/*m*2), we can highlight the diversity between the obtained partitions. The clusters based on *d*<sup>τ</sup> are quite similar in terms of heating surface with median values in the range (3656,4076), and even though are better separated in terms of age class and energy class, they also present a source of variability. Indeed, the 75% of buildings in cluster 1 was built between 1961 and 1990, in cluster 2 almost the 70% of buildings is dated after 1981), while cluster 3 has a larger variability, and contains both recently-constructed and old energy-renovated buildings, with relatively low energy class (the third quartile is equal to 120). On the contrary, the *d*AMH produces groups that are different in terms of heating surface (the medians are 3969, 5382, and 3102, respectively) and show withinhomogeneity with respect to the energy and age class (e.g. buildings in cluster 3 are old, i.e. mostly dated before 1990, and non-efficient with first and third quartiles of energy class equal to 120 and 145, respectively).

Figure 1. *Mean daily pattern (hours in x-axis) of DH users according to AHC based on d*<sup>τ</sup> *and dAMH (panels by rows) in cluster 1, 2, and 3 (panels by columns).*

AKHANLI, S, & HENNIG, C. 2020. Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes. *Stat.*

DURANTE, F, & SEMPI, C. 2015. *Principles of Copula Theory*. CRC Press,

FUCHS, S, DI LASCIO, F M L, & DURANTE, F. 2021. Dissimilarity functions for rank-invariant hierarchical clustering of continuous variables.

GUNKY, K, SILVAPULLE, M J, & SILVAPULLE, P. 2007. Comparison of semiparametric and parametric methods for estimating copulas. *Comput.*

MENAPACE, A, RIGHETTI, M, SANTOPIETRO, S, GARGANO, R, & DALVIT, G. 2019. Stochastic characterisation of the district heating load pattern of

WOOLDRIDGE, J. 2002. *Econometrics analysis of cross section and panel*

residential buildings. *Euroheat and Power*, 16(3–4), 14–19.

References

*Comput.*, 30, 1523–1544.

*Comput. Stat. Data An.*, 159, 107201.

*Stat. Data An.*, 51(6), 2836–2850.

*data*. Cambridge: MIT Press.

Boca Raton.

#### 4 Conclusions

We have introduced a new dissimilarity measure based on the Ali-Mikhail-Haq copula and empirically showed its ability to detect low correlations and discriminate among them. The application to district heating demand illustrates that the proposed measure seems to produce clusters that have a clear interpretation in terms of the relevant features, thus leading to a valuable tool to support the management and planning of a district heating system.

Figure 1. *Mean daily pattern (hours in x-axis) of DH users according to AHC based on d*<sup>τ</sup> *and dAMH (panels by rows) in cluster 1, 2, and 3 (panels by columns).*

#### References

the 41×41 dissimilarity matrices based on Eqs. (1) and (2) to use in the AHC algorithm. The crucial point is that all pairs of users have a quite low Kendall's <sup>τ</sup> (the minimum is <sup>−</sup>0.2, the highest value is 0.39). Thus, in principle, *<sup>d</sup>*AMH should be able to better distinguish objects with low and very similar degree of association. On the basis of both the informativeness of the final clusters and the separation index by Akhanli & Hennig, 2020, we decide to adopt the complete linkage method and cut the dendrogram at *k* = 3 for both the dissim-

Fig. 1 displays the mean daily pattern of each user (hourly heat demand over daily average heat demand (Menapace *et al.*, 2019)) by cluster, according to *d*<sup>τ</sup> and *d*AMH. As can be seen, a certain degree of internal homogeneity is obtained in both cases, denoting an overall good quality of the results. However, by using static features of the buildings, such as heating surface (in *m*2), age class (in years), and energy class (in yearly *kW h*/*m*2), we can highlight the diversity between the obtained partitions. The clusters based on *d*<sup>τ</sup> are quite similar in terms of heating surface with median values in the range (3656,4076), and even though are better separated in terms of age class and energy class, they also present a source of variability. Indeed, the 75% of buildings in cluster 1 was built between 1961 and 1990, in cluster 2 almost the 70% of buildings is dated after 1981), while cluster 3 has a larger variability, and contains both recently-constructed and old energy-renovated buildings, with relatively low energy class (the third quartile is equal to 120). On the contrary, the *d*AMH produces groups that are different in terms of heating surface (the medians are 3969, 5382, and 3102, respectively) and show withinhomogeneity with respect to the energy and age class (e.g. buildings in cluster 3 are old, i.e. mostly dated before 1990, and non-efficient with first and third

quartiles of energy class equal to 120 and 145, respectively).

We have introduced a new dissimilarity measure based on the Ali-Mikhail-Haq copula and empirically showed its ability to detect low correlations and discriminate among them. The application to district heating demand illustrates that the proposed measure seems to produce clusters that have a clear interpretation in terms of the relevant features, thus leading to a valuable tool

to support the management and planning of a district heating system.

ilarities.

4 Conclusions


### **HIGHER EDUCATION AND EMPLOYABILITY: INSIGHTS FROM THE MANDATORY NOTICES OF THE MINISTRY OF LABOUR**

degree of entry regulation and the proportion of graduates employed in professional and managerial jobs has declined since 1990 (Ballarino et al., 2016). In few words, the national economy seems to lack the characteristics to valorise and reward qualified human capital (Rostan e Stan, 2017). Secondly, the expansion of HE in Italy is often not associated with the demand of skilled workers and can be explained by other factors, such as the increase of family income, the pressure of some social classes to obtain or maintain education advantages and the role of state and academy (Rostan e Stan, 2017). In this perspective, the growth of the education system has led to an oversupply of graduates, especially in some fields, worsening the employment and working conditions of degree holders (Rostan e Stan, 2017). As underlined by Assirelli et al. (2018), the 2015 unemployment rate among individuals aged 25 to 34 was higher than the corresponding value for upper secondary

In this contribution, we aim at studying in deep the topic at issue, relying on two main sources of data: the Mandatory Notices (MN) of the Italian Ministry of Labour and the administrative database of the University of Florence (UNIFI). In particular, we focus on detecting the determinants of two main variables of interest: (i) the probability of being employed and (ii) conditionally on being employed, the

The analysis is based on the integration of the MN database and the UNIFI

The UNIFI administrative archive allows us to integrate the MN dataset with information about graduates, such as enrolment date, graduation date, graduation mark, type of high school, high school graduation mark, description of the degree course, level of degree course (i.e., bachelor vs. master degree), and field of study. The two datasets were merged using a probabilistic record linkage approach. The archive contains data on about 262,250 contracts signed by 46,931 UNIFI graduates from 1 January 2008 to 31 December 2016. All the information refers to UNIFI students that obtained their degree between 2008 and 2016. Overall, more than 60% of contracts were signed after graduation, the 37.17% within 3 years from

Focusing on the contract signed after graduation and on those signed while studying (or during university) the most common contract among UNIFI students (bachelor and master level graduates and five-years masters) was the temporary one (59.13%); only the 10.35% of contracts were permanents. The 19.81% of contracts belongs to the category "Others" that includes "atypical" or "non standard" contracts. More in detail, permanent contracts were, respectively, the third (8.77%) and the fourth

Self-employment jobs are not included in the MN database.

graduation and almost the 29% more than 3 years after graduation.

The MN database is provided by the Ministry of Labour and collects information on the job contracts signed by graduates in the years after graduation, such as type of contracts (open-ended, fixed term, short term, permanent, etc.), number of working days per contract, contract effective date, graduate age and gender, economic sector.

graduates.

**2 Data**

administrative archive.

probability of having a permanent job.

Maria Veronica Dorgali1 , Silvia Bacci1, Bruno Bertaccini1 and Alessandra Petrucci1

**ABSTRACT**: The Bologna Process has brought significant changes in the national education systems, increasing student mobility and expanding available options of education and training. Thus, an academic degree may no longer be sufficient to access the most prestigious and remunerative occupational positions. Relying on two sources of data, the Mandatory Notices of the Ministry of Labour datasets and the administrative database of University of Florence (UNIFI) students, this work aims to provide an overview of UNIFI graduates' employment and labour market participation. Preliminary results are provided.

**KEYWORDS**: bivariate random-effects probit model, higher education, logit model, occupational condition.

### **1 Introduction**

In the twenty-first century the system of Higher Education (HE) in Italy has undergone profound, structural changes with a substantial increase in the number of higher education institutions (HEIs). The Bologna Process has brought significant changes in the education system, increasing student mobility and expanding available options of education and training. Thus, an academic degree may no longer be sufficient to access the most prestigious and remunerative occupational positions (Breen & Goldthorpe,1997). According to Rostan and Stan (2017) Italian graduates' employment conditions can be explained according to two main points. Firstly, even if Italy is one of the most industrialised country in Europe, its production system is characterized by small and medium size firms, poorer capacity for innovation and private and public sectors less developed than in other advanced economies. Moreover, R&D investments are insufficient and, in the last two decades, public sector lost its capacity of being the major employer of Italian graduates (ANVUR,2014). In addition, access to the liberal professions is limited by the high

<sup>1</sup> Departments of Statistics Informatics Applications "G.Parenti", University of Florence (e-mail: mariaveronica.dorgali@unifi.it, silvia.bacci@unifi.it, Bruno.bertaccini@unifi.it, alessandra.petrucci@unifi.it )

degree of entry regulation and the proportion of graduates employed in professional and managerial jobs has declined since 1990 (Ballarino et al., 2016). In few words, the national economy seems to lack the characteristics to valorise and reward qualified human capital (Rostan e Stan, 2017). Secondly, the expansion of HE in Italy is often not associated with the demand of skilled workers and can be explained by other factors, such as the increase of family income, the pressure of some social classes to obtain or maintain education advantages and the role of state and academy (Rostan e Stan, 2017). In this perspective, the growth of the education system has led to an oversupply of graduates, especially in some fields, worsening the employment and working conditions of degree holders (Rostan e Stan, 2017). As underlined by Assirelli et al. (2018), the 2015 unemployment rate among individuals aged 25 to 34 was higher than the corresponding value for upper secondary graduates.

In this contribution, we aim at studying in deep the topic at issue, relying on two main sources of data: the Mandatory Notices (MN) of the Italian Ministry of Labour and the administrative database of the University of Florence (UNIFI). In particular, we focus on detecting the determinants of two main variables of interest: (i) the probability of being employed and (ii) conditionally on being employed, the probability of having a permanent job.

#### **2 Data**

**HIGHER EDUCATION AND EMPLOYABILITY: INSIGHTS FROM THE MANDATORY NOTICES OF THE MINISTRY OF LABOUR**

<sup>1</sup> Departments of Statistics Informatics Applications "G.Parenti", University of Florence (e-mail: mariaveronica.dorgali@unifi.it, silvia.bacci@unifi.it, Bruno.bertaccini@unifi.it, alessandra.petrucci@unifi.it )

**ABSTRACT**: The Bologna Process has brought significant changes in the national education systems, increasing student mobility and expanding available options of education and training. Thus, an academic degree may no longer be sufficient to access the most prestigious and remunerative occupational positions. Relying on two sources of data, the Mandatory Notices of the Ministry of Labour datasets and the administrative database of University of Florence (UNIFI) students, this work aims to provide an overview of UNIFI graduates' employment and labour market participation.

**KEYWORDS**: bivariate random-effects probit model, higher education, logit

In the twenty-first century the system of Higher Education (HE) in Italy has undergone profound, structural changes with a substantial increase in the number of higher education institutions (HEIs). The Bologna Process has brought significant changes in the education system, increasing student mobility and expanding available options of education and training. Thus, an academic degree may no longer be sufficient to access the most prestigious and remunerative occupational positions (Breen & Goldthorpe,1997). According to Rostan and Stan (2017) Italian graduates' employment conditions can be explained according to two main points. Firstly, even if Italy is one of the most industrialised country in Europe, its production system is characterized by small and medium size firms, poorer capacity for innovation and private and public sectors less developed than in other advanced economies. Moreover, R&D investments are insufficient and, in the last two decades, public sector lost its capacity of being the major employer of Italian graduates (ANVUR,2014). In addition, access to the liberal professions is limited by the high

, Silvia Bacci1, Bruno Bertaccini1 and Alessandra Petrucci1

Maria Veronica Dorgali1

Preliminary results are provided.

model, occupational condition.

**1 Introduction**

The analysis is based on the integration of the MN database and the UNIFI administrative archive.

The MN database is provided by the Ministry of Labour and collects information on the job contracts signed by graduates in the years after graduation, such as type of contracts (open-ended, fixed term, short term, permanent, etc.), number of working days per contract, contract effective date, graduate age and gender, economic sector. Self-employment jobs are not included in the MN database.

The UNIFI administrative archive allows us to integrate the MN dataset with information about graduates, such as enrolment date, graduation date, graduation mark, type of high school, high school graduation mark, description of the degree course, level of degree course (i.e., bachelor vs. master degree), and field of study.

The two datasets were merged using a probabilistic record linkage approach. The archive contains data on about 262,250 contracts signed by 46,931 UNIFI graduates from 1 January 2008 to 31 December 2016. All the information refers to UNIFI students that obtained their degree between 2008 and 2016. Overall, more than 60% of contracts were signed after graduation, the 37.17% within 3 years from graduation and almost the 29% more than 3 years after graduation.

Focusing on the contract signed after graduation and on those signed while studying (or during university) the most common contract among UNIFI students (bachelor and master level graduates and five-years masters) was the temporary one (59.13%); only the 10.35% of contracts were permanents. The 19.81% of contracts belongs to the category "Others" that includes "atypical" or "non standard" contracts. More in detail, permanent contracts were, respectively, the third (8.77%) and the fourth (8.59%) most common type of contract among bachelor and master degree graduates, respectively.

*Final grade (Ref:"106-110")*

*Study area (Ref: Literature)*

*Honours (Ref: "No")*

**4 Further developments**

and master's graduates.

employment status.

Rome: ANVUR.

*International Migration Review*., **53**,4-25.

*SerieII*. doi: 10.4000/sociologico.1818.

**References**

*Outside prescribed time (Ref: "No")*

Final grade: 75-95 -0.2175 0.1123 -0.5843 0.1862\* Final grade: 96-100 -0.1871 0.1102 -0.2413 0.1942 Final grade: 101-105 -0.0990 0.1070 -0.4297 0.1725\*\*

Study area: Scientific -0.0363 0.0899 -0.3276 0.1558\* Study area: Social 0.0968 0.0915 -0.1453 0.1543

Outside prescribed time: Yes -0.1631 0.0916 -0.1619 0.1507

Honours: Yes -0.0692 0.1315 -0.0256 0.1934

Looking at the results of these preliminary analyses, it seems that age at first job, the final graduation mark, the study area, and being outside of the prescribed degree path play an important role in predicting professional achievements of bachelor's

The preliminary analyses above displayed represent a first step of our study. These analyses are static, because they refer to the job condition of graduates a year after the degree. To allow a dynamic analysis that takes into account the longitudinal structure of data, we intend to estimate a bivariate random-effect probit model. In particular, we will model, at any time occasion, the employment status (employed vs. unemployed) and the type of job contract (permament vs. temporary), given the

ANVUR 2014: Rapporto sullo stato del sistema universitario e della ricerca 2013.

ASSIRELLI, G., BARONE, C., & RECCHI, E. 2018. You Better Move On: Determinants and Labour Market Outcomes of Graduate Migration from Italy.

BALLARINO, B., BARONE, C., & PENNICHELLA, N. 2016. Origini sociali e

BREEN, R., & GOLDTHORPE, J. H. 1997. Explaining educational differentials: Towards a formal rational action theory. *Rationality and Society*., **9**, 275–305. ROSTAN, M., STAN., A 2017. Italian graduates' employability in times of economic crisis: overview, problems, and possible solutions. *Sociologica.*

occupazione in Italia. *Rassegna Italiana di Sociologia*., **57**, 103–34.

### **3 Preliminary analyses**

As preliminary analyses, we estimated two logistic regression models to detect the determinants of the probability to get the first job one year after graduation (Table 1) and the probability, one years after graduation, to obtain a permanent job contract (Table 2).


**Table 1 Logistic regression results (Y=obtain the first job one year after graduation)**

#### **Table 2 Logistic regression results (Y=obtain a permanent job contract one year after graduation)**



Looking at the results of these preliminary analyses, it seems that age at first job, the final graduation mark, the study area, and being outside of the prescribed degree path play an important role in predicting professional achievements of bachelor's and master's graduates.

#### **4 Further developments**

The preliminary analyses above displayed represent a first step of our study. These analyses are static, because they refer to the job condition of graduates a year after the degree. To allow a dynamic analysis that takes into account the longitudinal structure of data, we intend to estimate a bivariate random-effect probit model. In particular, we will model, at any time occasion, the employment status (employed vs. unemployed) and the type of job contract (permament vs. temporary), given the employment status.

#### **References**

(8.59%) most common type of contract among bachelor and master degree

As preliminary analyses, we estimated two logistic regression models to detect the determinants of the probability to get the first job one year after graduation (Table 1) and the probability, one years after graduation, to obtain a permanent job contract

**Table 1 Logistic regression results (Y=obtain the first job one year after graduation) Variable Bachelor graduates Master graduates**

Intercept -1.8040 0.0730\*\*\* 0.3320 0.0879\*\*\*

Gender: Male -0.0498 0.0388 -0.0193 0.0636

Age at first job: 20-23 3.2130 0.0754\*\*\* -2.4416 0.1428\*\*\* Age at first job: 26-30 4.2089 0.0831\*\*\* 0.8063 0.0646\*\*\* Age at first job: 30+ 3.9773 0.0990\*\*\* 1.0651 0.0855\*\*\*

Final grade: 75-95 -0.6303 0.0583\*\*\* 0.0928 0.0944 Final grade: 96-100 -0.2899 0.0571\*\*\* 0.0788 0.0914 Final grade: 101-105 -0.1794 0.0547\*\* 0.0541 0.0830

Study area: Scientific 0.3496 0.0471\*\*\* 0.2049 0.0751\*\* Study area: Social 0.5700 0.0479\*\*\* 0.3174 0.0724\*\*\*

Outside of prescribed time: Yes -1.6370 0.0562\*\*\* -0.7235 0.0728\*\*\*

Honours: Yes 0.1543 0.0700\* -0.0536 0.0854

**Table 2 Logistic regression results (Y=obtain a permanent job contract one year after graduation) Variable Bachelor graduates Master graduates**

Intercept 3.1912 0.1173\*\*\* 3.6304 0.2020\*\*\*

Gender: Male -0.0163 0.07437 -0.2781 0.1231\*

Age at first job: 20-23 0.0491 0.1095 -0.1897 0.2109 Age at first job: 26-30 -0.2141 0.0800\*\* -0.0822 0.1387 Age at first job: 30+ -0.6456 0.1140\*\*\* -0.5268 0.1564\*\*\*

Estimate SE Estimate SE

Estimate SE Estimate SE

graduates, respectively.

(Table 2).

**3 Preliminary analyses**

*Gender (Ref:" Female")*

*Age at first job (Ref=23-26)*

*Final grade (Ref:"106-110")*

*Study area (Ref: Literature)*

*Honours (Ref: "No")*

*Gender (Ref:" Female")*

*Age at first job (Ref=23-26)*

*Outside of prescribed time (Ref: "No")*


### AN ALTERNATIVE TO JOINT GRAPHICAL LASSO FOR LEARNING MULTIPLE GAUSSIAN GRAPHICAL MODELS

entries of regression coefficients. A full estimate of the precision matrices is then obtained via constrained maximum likelihood approach in each group.

Meinshausen & Buhlmann (2006) firstly proposed the idea of neighbourhood ¨ selection based on penalized linear regressions. Their proposal consists in performing *d* lasso regression procedures, one for each variable as response, given the other *d* −1 variables in the graph. To extend this procedure for group

,...,*Y*(*K*)

*<sup>Y</sup>*(*k*) is a *nk* <sup>×</sup> *<sup>d</sup>* matrix. Within each group, we assume observations to be

where λ, γ and τ = {τ*k*,*<sup>k</sup>* : *k* > *k*} are tuning parameters. The last term of Equation (1) allows exact equality between coefficient from different groups where τ allows to weight differently each couple of groups. The vector τ allows to attribute a specific shrinkage only on some pairs of parameters and

Similarly to Meinshausen & Buhlmann (2006), we use Equation (1) node- ¨

group, we adopt a two-step approach. We first learn the edge set, and then we use constrained maximum likelihood method with given zero elements. Let

)−log det(Ω(*k*)

This two-step procedure assures a positive definite estimate. However, using the same data for model selection and parameter estimation is known to lead

*<sup>j</sup>* <sup>=</sup> <sup>0</sup>}, while the selected edge set is given by *<sup>E</sup>*(*k*) <sup>=</sup> {(*i*, *<sup>j</sup>*) :

*<sup>i</sup>* }. To obtain an estimate of the precision matrix for each

. The precision matrix estimate is

) 

To extend neighbourhood selection to multiple graphs, we propose to adopt a penalty term similar to the one used in the joint lasso by Dondelinger &

2+λ||θ*i*,*<sup>k</sup>*


independent and identically distributed as *<sup>Y</sup>*(*k*) <sup>∼</sup> *Nd*(0,Σ*k*).

<sup>−</sup>*<sup>i</sup>* <sup>θ</sup>*i*,*<sup>k</sup>* ||2

wise. The neighborhood of node *<sup>i</sup>* for the *<sup>k</sup>*th group is then ne (*k*)

Ω(*k*)

Mukherjee (2018). Estimation is achieved minimizing

) from *K* different groups, where

} be the set of positive definite ma-

∀*k* ∈ {1,...,*K*}.

<sup>τ</sup>*k*,*k*||θ*i*,*k*−θ*i*,*<sup>k</sup>*


*<sup>i</sup>* = { *j* ∈

(1)

2 Nodewise multiple graphical models

structured data, consider *Y* = (*Y*(1)

*K* ∑ *k*=1

is set to 1 in the rest of the paper.

*<sup>E</sup>*(*k*) <sup>=</sup> {<sup>Ω</sup> : <sup>Ω</sup> <sup>0</sup>∧ω*i j* <sup>=</sup> <sup>0</sup>,∀(*i*, *<sup>j</sup>*) <sup>∈</sup>/ *<sup>E</sup>*(*k*)

 tr(*S*(*k*)

trices with support defined by *<sup>E</sup>*(*k*)

 1 *nk* ||*Y*(*k*) *<sup>i</sup>* <sup>−</sup>*Y*(*k*)

<sup>Θ</sup>*<sup>i</sup>* <sup>=</sup> argmin (θ*i*,1,...,θ*i*,*K*)

{1,...,*d*} : <sup>θ</sup>*i*,*<sup>k</sup>*

*<sup>j</sup>* <sup>∧</sup> *<sup>j</sup>* <sup>∈</sup> ne (*k*)

<sup>Ω</sup>(*k*) <sup>=</sup> argmin <sup>Ω</sup>∈*S*<sup>+</sup> *<sup>E</sup>*(*k*)

*<sup>i</sup>* <sup>∈</sup> ne (*k*)

*S*+

Lorenzo Focardi Olmi1, Anna Gottard1

<sup>1</sup> Dipartimento di Statistica, Informatica, Applicazioni 'G. Parenti' (DiSIA), University of Florence, (e-mail: lorenzo.focardiolmi@unifi.it, anna.gottard@unifi.it)

ABSTRACT: Gaussian graphical models are widely used to learn the conditional independence structure of a set of random variables. This is done through the nonzero elements of its precision matrix. In many practical situations, one needs to estimate multiple graphical models due to a group structure of the data. We propose a neighbourhood approach to jointly learn multiple Gaussian graphical models. Our method estimates the edge set of each graph through joint lasso regression, and a constrained maximum likelihood method is then used to obtain precision matrices. The estimation procedure can be refined with prior information about relations among groups.

KEYWORDS: Gaussian graphical models, graphical lasso, joint lasso, multiple graphs

#### 1 Introduction

Graphical models represent conditional independence relations among a set of random variables via a graph. The graph structure recovery of a concentration graph model is equivalent to find the zero elements of a precision matrix (Lauritzen, 1996).

Several recent proposals have focused on estimating Gaussian graphical models when data come from more than one distinct subpopulations. In particular, Guo *et al.,* (2011) suggested a hierarchical penalty that forces a similar sparsity pattern across classes with no shrinking non zero elements. Danaher *et al.,* (2014) proposed a direct extension of Glasso (Friedman *et al.,* 2008) using two different convex penalties to force precision matrices to be similar. Dondelinger & Mukherjee (2018) developed a lasso type penalty to handle observations divided into groups in a regression setting.

In this work, we propose a nodewise regression approach to jointly estimate multiple Gaussian graphical models using a penalty similar to the one proposed by Dondelinger & Mukherjee (2018) for inducing similarities in zero entries of regression coefficients. A full estimate of the precision matrices is then obtained via constrained maximum likelihood approach in each group.

#### 2 Nodewise multiple graphical models

AN ALTERNATIVE TO JOINT GRAPHICAL LASSO FOR LEARNING MULTIPLE GAUSSIAN GRAPHICAL MODELS Lorenzo Focardi Olmi1, Anna Gottard1

<sup>1</sup> Dipartimento di Statistica, Informatica, Applicazioni 'G. Parenti' (DiSIA), University of Florence, (e-mail: lorenzo.focardiolmi@unifi.it,

ABSTRACT: Gaussian graphical models are widely used to learn the conditional independence structure of a set of random variables. This is done through the nonzero elements of its precision matrix. In many practical situations, one needs to estimate multiple graphical models due to a group structure of the data. We propose a neighbourhood approach to jointly learn multiple Gaussian graphical models. Our method estimates the edge set of each graph through joint lasso regression, and a constrained maximum likelihood method is then used to obtain precision matrices. The estimation procedure can be refined with prior information about relations among groups.

KEYWORDS: Gaussian graphical models, graphical lasso, joint lasso, multiple graphs

Graphical models represent conditional independence relations among a set of random variables via a graph. The graph structure recovery of a concentration graph model is equivalent to find the zero elements of a precision matrix

Several recent proposals have focused on estimating Gaussian graphical models when data come from more than one distinct subpopulations. In particular, Guo *et al.,* (2011) suggested a hierarchical penalty that forces a similar sparsity pattern across classes with no shrinking non zero elements. Danaher *et al.,* (2014) proposed a direct extension of Glasso (Friedman *et al.,* 2008) using two different convex penalties to force precision matrices to be similar. Dondelinger & Mukherjee (2018) developed a lasso type penalty to handle

In this work, we propose a nodewise regression approach to jointly estimate multiple Gaussian graphical models using a penalty similar to the one proposed by Dondelinger & Mukherjee (2018) for inducing similarities in zero

observations divided into groups in a regression setting.

anna.gottard@unifi.it)

1 Introduction

(Lauritzen, 1996).

Meinshausen & Buhlmann (2006) firstly proposed the idea of neighbourhood ¨ selection based on penalized linear regressions. Their proposal consists in performing *d* lasso regression procedures, one for each variable as response, given the other *d* −1 variables in the graph. To extend this procedure for group structured data, consider *Y* = (*Y*(1) ,...,*Y*(*K*) ) from *K* different groups, where *<sup>Y</sup>*(*k*) is a *nk* <sup>×</sup> *<sup>d</sup>* matrix. Within each group, we assume observations to be independent and identically distributed as *<sup>Y</sup>*(*k*) <sup>∼</sup> *Nd*(0,Σ*k*).

To extend neighbourhood selection to multiple graphs, we propose to adopt a penalty term similar to the one used in the joint lasso by Dondelinger & Mukherjee (2018). Estimation is achieved minimizing

$$\widehat{\Theta}^{j} = \underset{(\Phi^{i,1}, \dots, \Phi^{i,K})}{\arg\min} \sum\_{k=1}^{K} \left( \frac{1}{n\_k} ||Y\_i^{(k)} - Y\_{-i}^{(k)}\Phi^{i,k}||\_2^2 + \lambda ||\Theta^{i,k}||\_1 + \gamma \sum\_{k'>k} \tau\_{k,k'} ||\Theta^{i,k} - \Theta^{i,k'}||\_1 \right), \tag{1}$$

where λ, γ and τ = {τ*k*,*<sup>k</sup>* : *k* > *k*} are tuning parameters. The last term of Equation (1) allows exact equality between coefficient from different groups where τ allows to weight differently each couple of groups. The vector τ allows to attribute a specific shrinkage only on some pairs of parameters and is set to 1 in the rest of the paper.

Similarly to Meinshausen & Buhlmann (2006), we use Equation (1) node- ¨ wise. The neighborhood of node *<sup>i</sup>* for the *<sup>k</sup>*th group is then ne (*k*) *<sup>i</sup>* = { *j* ∈ {1,...,*d*} : <sup>θ</sup>*i*,*<sup>k</sup> <sup>j</sup>* <sup>=</sup> <sup>0</sup>}, while the selected edge set is given by *<sup>E</sup>*(*k*) <sup>=</sup> {(*i*, *<sup>j</sup>*) : *<sup>i</sup>* <sup>∈</sup> ne (*k*) *<sup>j</sup>* <sup>∧</sup> *<sup>j</sup>* <sup>∈</sup> ne (*k*) *<sup>i</sup>* }. To obtain an estimate of the precision matrix for each group, we adopt a two-step approach. We first learn the edge set, and then we use constrained maximum likelihood method with given zero elements. Let *S*+ *<sup>E</sup>*(*k*) <sup>=</sup> {<sup>Ω</sup> : <sup>Ω</sup> <sup>0</sup>∧ω*i j* <sup>=</sup> <sup>0</sup>,∀(*i*, *<sup>j</sup>*) <sup>∈</sup>/ *<sup>E</sup>*(*k*) } be the set of positive definite matrices with support defined by *<sup>E</sup>*(*k*) . The precision matrix estimate is

$$\widehat{\mathfrak{Q}}^{(k)} = \operatorname\*{arg\,min}\_{\mathfrak{Q} \in \mathcal{S}^{+}\_{\widehat{E}^{(k)}}} \left( \operatorname{tr} (\operatorname{S}^{(k)} \mathfrak{Q}^{(k)}) - \log \det(\mathfrak{Q}^{(k)}) \right) \qquad \forall k \in \{1, \dots, K\}.$$

This two-step procedure assures a positive definite estimate. However, using the same data for model selection and parameter estimation is known to lead


To summarize, if one is mainly interested in structure learning, our proposal a slightly better performance, in comparison with the other existing procedures, not reported here, mostly because of fewer false negative errors. If the aim is to estimate the precision matrix through a two-step procedure, it seems

DANAHER, PATRICK, WANG, PEI,&WITTEN, DANIELA. 2014. The joint graphical lasso for inverse covariance estimation across multiple classes. *Journal of the Royal Statistical Society Series B (Statistical Methodol-*

DONDELINGER, FRANK,&MUKHERJEE, SACH. 2018. The joint lasso: highdimensional regression for group structured data. *Biostatistics (Oxford,*

FRIEDMAN, JEROME, HASTIE, TREVOR,&TIBSHIRANI, ROBERT. 2008. Sparse inverse covariance estimation with the graphical LASSO. *Bio-*

GUO, JIAN, MICHAELIDIS, GEORGE, ZHU, JI, *et al.* . 2011. Joint estimation

LAURITZEN, STEFFEN L. 1996. *Graphical Models*. Oxford University Press. LIU, HAN, ROEDER, KATHRYN,&WASSERMAN, LARRY. 2010. Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models. *Pages 1432–1440 of:* LAFFERTY, J. D., WILLIAMS, C. K. I., SHAWE-TAYLOR, J., ZEMEL, R. S., & CULOTTA, A. (eds), *Advances in Neural Information Processing Systems 23*. Curran Asso-

MA, J., & MICHAILIDIS, GEORGE. 2016. Joint structural estimation of mul-

MEINSHAUSEN, NICOLAI,&BUHLMANN ¨ , PETER. 2006. High-dimensional graphs and variable selection with the Lasso. *The Annals of Statistics*,

SNOEK, JASPER, LAROCHELLE, HUGO,&ADAMS, RYAN P. 2012. Practical Bayesian Optimization of Machine Learning Algorithms. *Page 2951–2959 of: Advances in Neural Information Processing Systems*.

TIAN, XIAOYIANG,&TAYLOR, JONATHAN. 2018. Selective inference with a randomized response. *The Annals of Statistics*, 46(2), 679–710.

of multiple graphical models. *Biometrika*, 98(1), 1–15.

that data carving is a better option than data splitting.

References

*ogy)*, 76(2), 373–397.

*England)*, 21, 219–235.

*statistics*, 9(3), 432–441.

tiple graphical models. 17(09), 1–44.

NIPS 2012, vol. 25. Curran Associates Inc.

ciates, Inc.

34(3), 1436–1462.

Table 1. *Monte Carlo summary of performance*

to not valid inference (Tian & Taylor, 2018). Thus, post-selection inference, such as data splitting or carving procedures, needs to be used.

Nodewise regression relies on the selection of tuning parameters λ and γ. We used a slightly modified version of StARS (Stability Approach to Regularization Selection) algorithm proposed by Liu *et al.,* (2010).

For a chosen *b*, 1 < *b* < *n*, we draw *N* random subsamples *X*1,...,*XN* from *Y* each of size *b*. Given a value of λ and γ, we apply nodewise joint estimation in each subsample. Let *D*(λ, γ) be the maximum among groups of the average of instability for each edge across subsamples. We use a Bayesian optmization technique based on Gaussian Processes (Snoek *et al.,* 2012) to obtain optimal values of tuning parameters, minimizing the instability measure |*D*(λ, γ) − β| with β to be set. The performance of the proposed procedure is illustrated in the next section.

#### 3 Monte Carlo simulations

This Monte Carlo study reports a simple setting with only two groups (*K* = 2). We generate a random graph structure with *d* = 15 nodes and the corresponding precision matrix Ω(1) as described in Danaher *et al.,* (2014). To generate Ω(2) , we randomly change some entries of Ω(1) adding edges, removing them or varying partial correlation coefficients.

We simulate 50 datasets of dimension *n* = 150 from*Y* = (*Y*(1) ,*Y*(2) ) where *<sup>Y</sup>*(*i*) <sup>∼</sup> *<sup>N</sup>*(0,Σ*i*), <sup>Σ</sup>*<sup>i</sup>* being the inverse of <sup>Ω</sup>(*i*) and *<sup>i</sup>* <sup>=</sup> <sup>1</sup>,2. Then we use tuning parameter selection with β = 0.1 and *N* = 30 to optimize the nodewise selection algorithm in three different situations: structure estimation only, precision matrix estimation using data carving (*p* = 0.9), precision matrix estimation using data splitting (*p* = 0.5), where *p* is the proportion of data using to estimate structure. We evaluate the edge selection performances using false negative rate (FNR) and false positive rate (FPR), while the estimate of the precision matrices is compared using entropy loss (EL) and Frobenius loss (FL). Simulation results are summarized in Table 1.

To summarize, if one is mainly interested in structure learning, our proposal a slightly better performance, in comparison with the other existing procedures, not reported here, mostly because of fewer false negative errors. If the aim is to estimate the precision matrix through a two-step procedure, it seems that data carving is a better option than data splitting.

#### References

Table 1. *Monte Carlo summary of performance*

the next section.

Ω(2)

3 Monte Carlo simulations

or varying partial correlation coefficients.

lation results are summarized in Table 1.

ing precision matrix Ω(1)

Method EL FL FNR FPR Structure learning 0.2673 0.0823 Data Carving 3.0059 0.5516 0.3087 0.0767 Data Splitting 3.1110 0.5590 0.4080 0.0700

to not valid inference (Tian & Taylor, 2018). Thus, post-selection inference,

Nodewise regression relies on the selection of tuning parameters λ and γ. We used a slightly modified version of StARS (Stability Approach to Regular-

For a chosen *b*, 1 < *b* < *n*, we draw *N* random subsamples *X*1,...,*XN* from *Y* each of size *b*. Given a value of λ and γ, we apply nodewise joint estimation in each subsample. Let *D*(λ, γ) be the maximum among groups of the average of instability for each edge across subsamples. We use a Bayesian optmization technique based on Gaussian Processes (Snoek *et al.,* 2012) to obtain optimal values of tuning parameters, minimizing the instability measure |*D*(λ, γ) − β| with β to be set. The performance of the proposed procedure is illustrated in

This Monte Carlo study reports a simple setting with only two groups (*K* = 2). We generate a random graph structure with *d* = 15 nodes and the correspond-

*<sup>Y</sup>*(*i*) <sup>∼</sup> *<sup>N</sup>*(0,Σ*i*), <sup>Σ</sup>*<sup>i</sup>* being the inverse of <sup>Ω</sup>(*i*) and *<sup>i</sup>* <sup>=</sup> <sup>1</sup>,2. Then we use tuning parameter selection with β = 0.1 and *N* = 30 to optimize the nodewise selection algorithm in three different situations: structure estimation only, precision matrix estimation using data carving (*p* = 0.9), precision matrix estimation using data splitting (*p* = 0.5), where *p* is the proportion of data using to estimate structure. We evaluate the edge selection performances using false negative rate (FNR) and false positive rate (FPR), while the estimate of the precision matrices is compared using entropy loss (EL) and Frobenius loss (FL). Simu-

We simulate 50 datasets of dimension *n* = 150 from*Y* = (*Y*(1)

, we randomly change some entries of Ω(1) adding edges, removing them

as described in Danaher *et al.,* (2014). To generate

,*Y*(2)

) where

such as data splitting or carving procedures, needs to be used.

ization Selection) algorithm proposed by Liu *et al.,* (2010).


### FUNCTIONAL CLUSTER ANALYSIS OF HDI EVOLUTION IN EUROPEAN COUNTRIES

by its first derivative, *f*

2 Functional distances

*fi*(*t*), *fj*(*t*)

(Ramsay & Silverman, 2005);

*fi*(*t*), *fj*(*t*)

 = *T* 

 = 

curve and for its evolutionary dynamic.

*T* 

evolution of the HDI.

tances:

*d*0 

*d*0+<sup>1</sup> 

where *f*

where *fi*(*t*) = ∑*<sup>K</sup>*

3 Application

(*x*), in order to discount for a decreasing or increasing

*dt*, ∀*i* = *j*; *i*, *j* = 1,2,...,*n*; (1)

*dt*, (2)

To identify common patterns among the HDI curves, the functional k-means algorithm (Tarpey & Kinateder, 2003) is considered using the following dis-

2

*<sup>k</sup>*=<sup>1</sup> *aik*φ*k*(*t*), is expanded in terms of *K* cubic B-splines functions

2 *dt* + *T f <sup>i</sup>*(*t*)− *f j* (*t*) 2

*<sup>i</sup>*(*t*) denotes the smoothing estimate of the first derivative of *fi*(*t*). The

distances in (1) and (2) are the norm and the semi-norm in the Hilbert space, respectively. The semi-norm *d*0+<sup>1</sup> accounts both for the level of the well-being

The prosed method is applied to the annual time series of the HDI indexes from 2000 to 2019 for 44 European countries. Functional cluster analysis is applied to HDI data, converted into a sample of smooth functions using *K* = 5 cubic B-splines basis, chosen by cross validation (left-hand side of Figure 2). Distances in (1) and (2) are considered choosing three clusters, corresponding to high, medium and low human development countries. The clustering results are the same, except for France and Italy which, by means of *d*0+1, are assigned to the high human development cluster rather than the medium one. The clustering algorithm is also applied to the smoothed version of *EHDI* using *d*<sup>0</sup> as a distance. The resulting configuration is the same as that obtained with *d*0+<sup>1</sup> on the functional HDI. The high development group is characterised by the countries of Western and Northern Europe: Austria, Belgium, Germany, Liechtenstein, Luxembourg, Netherlands, Switzerland, United Kingdom, Denmark, Finland, Iceland, Ireland, Norway and Sweden. This group also includes Slovenia, the only country in South-Eastern Europe. The medium development

*fi*(*t*)− *fj*(*t*)

*fi*(*t*)− *fj*(*t*)

Francesca Fortuna1, Alessia Naccarato1 and Silvia Terzi1

<sup>1</sup> Department of Economics, Roma Tre University, (e-mail: francesca.fortuna@uniroma3.it, alessia.naccarato@uniroma3.it, silvia.terzi@uniroma3.it)

ABSTRACT: The contribution aims to study the evolutionary aspects of a well-being indicator in European countries. To this end, an evolutionary indicator is proposed by considering the indicator as a function and integrating the information provided by the well-being curve with its temporal dynamic reflected by the first derivative. Then, functional cluster analysis is considered to derive groups of geographical areas that account not only for the indicator's level, but also for its evolution.

KEYWORDS: Human Development Index, FDA, functional clustering.

### 1 Introduction

Well-being indicators are commonly used to support decision making and to assess the performance of countries. However, well-being indicators are generally considered from a static point of view, disregarding their temporal dynamics. Our aim is to exploit the evolutionary aspect of a well-being indicator. To this end, temporal sequences of well-being indicators are analyzed from a functional point of view. Thus, indicators are considered as functions rather than scalar vectors. This is a novel perspective in well-being processing, which allows to introduce new analytical tools, such as derivatives. Since the latter quantify a function's behavior in an evolutionary perspective, we suggest to integrate the information provided by the well-being curve with the information concerning its first order derivative. Specifically, we focus on the problem of clustering well-being curves using the functional k-means algorithm under different distances in order to identify specific common patterns among the countries. The procedure is applied to a real data set regarding the annual time series of the Human Development Index (HDI) for 44 European countries. We compare the clusters obtained by functional k-means algorithm with the clusters derived in a non-functional environment via a k-means algorithm applied to raw data of an evolutionary integrated HDI, say *EHDI*. The latter is defined as *EHDI* = *HDI*[1+ *f* (*x*)] and integrates HDI with the information provided

by its first derivative, *f* (*x*), in order to discount for a decreasing or increasing evolution of the HDI.

#### 2 Functional distances

FUNCTIONAL CLUSTER ANALYSIS OF HDI EVOLUTION IN EUROPEAN COUNTRIES Francesca Fortuna1, Alessia Naccarato1 and Silvia Terzi1

alessia.naccarato@uniroma3.it, silvia.terzi@uniroma3.it)

account not only for the indicator's level, but also for its evolution.

KEYWORDS: Human Development Index, FDA, functional clustering.

ABSTRACT: The contribution aims to study the evolutionary aspects of a well-being indicator in European countries. To this end, an evolutionary indicator is proposed by considering the indicator as a function and integrating the information provided by the well-being curve with its temporal dynamic reflected by the first derivative. Then, functional cluster analysis is considered to derive groups of geographical areas that

Well-being indicators are commonly used to support decision making and to assess the performance of countries. However, well-being indicators are generally considered from a static point of view, disregarding their temporal dynamics. Our aim is to exploit the evolutionary aspect of a well-being indicator. To this end, temporal sequences of well-being indicators are analyzed from a functional point of view. Thus, indicators are considered as functions rather than scalar vectors. This is a novel perspective in well-being processing, which allows to introduce new analytical tools, such as derivatives. Since the latter quantify a function's behavior in an evolutionary perspective, we suggest to integrate the information provided by the well-being curve with the information concerning its first order derivative. Specifically, we focus on the problem of clustering well-being curves using the functional k-means algorithm under different distances in order to identify specific common patterns among the countries. The procedure is applied to a real data set regarding the annual time series of the Human Development Index (HDI) for 44 European countries. We compare the clusters obtained by functional k-means algorithm with the clusters derived in a non-functional environment via a k-means algorithm applied to raw data of an evolutionary integrated HDI, say *EHDI*. The latter is defined

(*x*)] and integrates HDI with the information provided

<sup>1</sup> Department of Economics, Roma Tre University, (e-mail: francesca.fortuna@uniroma3.it,

1 Introduction

as *EHDI* = *HDI*[1+ *f*

To identify common patterns among the HDI curves, the functional k-means algorithm (Tarpey & Kinateder, 2003) is considered using the following distances:

$$d\_0\left(f\_i(t), f\_j(t)\right) = \int\_T \left(f\_i(t) - f\_j(t)\right)^2 dt, \quad \forall i \neq j; \ i, j = 1, 2, \ldots, n; \quad (1)$$

where *fi*(*t*) = ∑*<sup>K</sup> <sup>k</sup>*=<sup>1</sup> *aik*φ*k*(*t*), is expanded in terms of *K* cubic B-splines functions (Ramsay & Silverman, 2005);

$$d\_{0+1}\left(f\_i(t), f\_j(t)\right) = \sqrt{\int\_T \left(f\_i(t) - f\_j(t)\right)^2 dt + \int\_T \left(f\_i'(t) - f\_j'(t)\right)^2 dt},\quad(2)$$

where *f <sup>i</sup>*(*t*) denotes the smoothing estimate of the first derivative of *fi*(*t*). The distances in (1) and (2) are the norm and the semi-norm in the Hilbert space, respectively. The semi-norm *d*0+<sup>1</sup> accounts both for the level of the well-being curve and for its evolutionary dynamic.

#### 3 Application

The prosed method is applied to the annual time series of the HDI indexes from 2000 to 2019 for 44 European countries. Functional cluster analysis is applied to HDI data, converted into a sample of smooth functions using *K* = 5 cubic B-splines basis, chosen by cross validation (left-hand side of Figure 2). Distances in (1) and (2) are considered choosing three clusters, corresponding to high, medium and low human development countries. The clustering results are the same, except for France and Italy which, by means of *d*0+1, are assigned to the high human development cluster rather than the medium one. The clustering algorithm is also applied to the smoothed version of *EHDI* using *d*<sup>0</sup> as a distance. The resulting configuration is the same as that obtained with *d*0+<sup>1</sup> on the functional HDI. The high development group is characterised by the countries of Western and Northern Europe: Austria, Belgium, Germany, Liechtenstein, Luxembourg, Netherlands, Switzerland, United Kingdom, Denmark, Finland, Iceland, Ireland, Norway and Sweden. This group also includes Slovenia, the only country in South-Eastern Europe. The medium development

group includes mainly Southern and Eastern European countries, plus France and 3 Northern countries: Estonia, Latvia and Lithuania. The low human development group is characterised by the countries of Eastern and South-Eastern Europe: Armenia, Azerbaijan, Georgia, Ukraine, Albania, Bosnia and Herzegovina, Republic of Moldova, North Macedonia and Turkey. Table 1 displays the cluster sizes and the average silhouette value for each scenario. To provide

derivative with a decreasing trend, but with high values especially in the first part of the domain. Romania has a first derivative with a fluctuating trend: strongly increasing until 2005, decreasing until 2015, increasing subsequently. Russian Federation and Serbia have flat first derivatives but with high values. Viceversa the non-functional algorithm upgrades France and Italy, including them in the high development cluster. However it is a partial upgrade, only with respect to the classification provided by *d*<sup>0</sup> (the *d*0+<sup>1</sup> distance assigned both

2000 2010

year

**First derivatives**

Bulgaria France Italy Romania Russian F. Serbia

−0.01

FDA is a useful methodological framework for the analysis of well-being indicators as it allows to evaluate their evolution with additional tools. Specifically, the joint analysis of the level of well-being curves and their first derivatives can provide useful insight in countries' well-being improvement or worsening. In our application, the range of the first derivatives is very limited, thus the additional information concerning the indicator's trend has little effect on countries'

RAMSAY, J. O., & SILVERMAN, B. W. 2005. *Functional Data Analysis*. New

TARPEY, T., & KINATEDER, K. K. J. 2003. Clustering functional data. *Journal*

Figure 2. *Functional HDI and First derivatives of the European countries.*

 0.01

 0.03

these countries to the high development cluster).

**Functional HDI**

2000 2010

year

0.6 0.7 0.8 0.9 1.0

4 Conclusions

classification.

References

York: Springer.

*of Classification.*, 20, 93–114.


Table 1. *Clustering Results from k-means algorithm with different distances.*

an insight on the role of the first derivative of the curve, the results are compared with those obtained in a non-functional framework. Specifically, the k-means algorithm is applied on the raw *EHDI*. Figure 1 shows the centroids obtained with the k-means algorithm and different distances. We remark that the righthand side of Figure 1 shows the sequences of raw *EHDI* across the years, not the smoothed functions. Comparing the clusters obtained in the functional and

Figure 1. *Cluster centroids: k-means with different distances.*

0.6

0.7

0.8

0.9

1.0

the non-functional contexts, only six countries are assigned differently. Specifically, the non-functional algorithm downgrades Bulgaria, Romania, Russian Federation and Serbia, classifying them as low development countries. Indeed, as we can see from the right-hand side of Figure 2, Bulgaria presents a first

derivative with a decreasing trend, but with high values especially in the first part of the domain. Romania has a first derivative with a fluctuating trend: strongly increasing until 2005, decreasing until 2015, increasing subsequently. Russian Federation and Serbia have flat first derivatives but with high values. Viceversa the non-functional algorithm upgrades France and Italy, including them in the high development cluster. However it is a partial upgrade, only with respect to the classification provided by *d*<sup>0</sup> (the *d*0+<sup>1</sup> distance assigned both these countries to the high development cluster).

Figure 2. *Functional HDI and First derivatives of the European countries.*

#### 4 Conclusions

0.6 0.7 0.8 0.9 1.0

group includes mainly Southern and Eastern European countries, plus France and 3 Northern countries: Estonia, Latvia and Lithuania. The low human development group is characterised by the countries of Eastern and South-Eastern Europe: Armenia, Azerbaijan, Georgia, Ukraine, Albania, Bosnia and Herzegovina, Republic of Moldova, North Macedonia and Turkey. Table 1 displays the cluster sizes and the average silhouette value for each scenario. To provide

Table 1. *Clustering Results from k-means algorithm with different distances.*

Cluster *d*<sup>0</sup> *d*0+<sup>1</sup> Raw *EHDI* High 15 17 16 Medium 20 18 15 Low 9 9 13 Mean Sil. 0.5 0.5 0.6

an insight on the role of the first derivative of the curve, the results are compared with those obtained in a non-functional framework. Specifically, the k-means algorithm is applied on the raw *EHDI*. Figure 1 shows the centroids obtained with the k-means algorithm and different distances. We remark that the righthand side of Figure 1 shows the sequences of raw *EHDI* across the years, not the smoothed functions. Comparing the clusters obtained in the functional and

d0<sup>+</sup><sup>1</sup>

2000 2010

year

the non-functional contexts, only six countries are assigned differently. Specifically, the non-functional algorithm downgrades Bulgaria, Romania, Russian Federation and Serbia, classifying them as low development countries. Indeed, as we can see from the right-hand side of Figure 2, Bulgaria presents a first

2000 2010

year

0.6

 0.7

 0.8

 0.9

 1.0 **Raw EHDI**

High Med Low

0.6

Figure 1. *Cluster centroids: k-means with different distances.*

 0.7

 0.8

 0.9

 1.0

2000 2010

year

0.6

 0.7

 0.8

 0.9

 1.0 d0

FDA is a useful methodological framework for the analysis of well-being indicators as it allows to evaluate their evolution with additional tools. Specifically, the joint analysis of the level of well-being curves and their first derivatives can provide useful insight in countries' well-being improvement or worsening. In our application, the range of the first derivatives is very limited, thus the additional information concerning the indicator's trend has little effect on countries' classification.

#### References


### ESTIMATING BAYESIAN MIXTURES OF FINITE MIXTURES WITH TELESCOPING SAMPLING \*

the number of components in a finite mixture model does not necessarily correspond to the number of filled components given the observed data. As an alternative, Escobar & West (1995) consider Dirichlet process mixtures (DPMs) where the number of components is infinite and only inference on the number of filled components is performed. Malsiner-Walli *et al.* (2016) suggest sparse finite mixtures (SFMs), where the parameter for the prior on the weights is selected to imply that the number of components will be higher than the number of filled components, in this way allowing for posterior inference of the

Richardson & Green (1997) propose to use the specification of a mixture of finite mixtures (MFM) model where a prior on the number of components is included, to obtain posterior estimates for the number of components, the number of filled components and the parameter estimates. Richardson & Green (1997) also indicate the estimation of this model class using a reversible jump Markov chain Monte Carlo (RJMCMC) algorithm. Alternative approaches to perform Bayesian inference of the MFM model were considered by Stephens (2000) who suggests to use a Markov birth-death process and Miller & Harrison (2018) who re-use Chinese restaurant process (CRP) methods proposed

Fruhwirth-Schnatter ¨ *et al.* (2020) develop the *telescoping sampler* to perform inference for any kind of MFM where arbitrary component distributions or models as well as hierarchical priors may be included without complicating the sampling. They build on the data augmentation scheme suggested for finite mixtures by Diebolt & Robert (1994) and include a sampling step for the number of components. This implies that the telescoping sampler is straightforward to implement given that a MCMC sampling scheme for the components

Fruhwirth-Schnatter ¨ *et al.* (2020) already present the application of the telescoping sampler on mixtures of univariate Gaussian distributions, which allows them to benchmark their sampler against RJMCMC and CRP sampling, on mixtures of multivariate Gaussian distributions and on latent class analysis models applied to multivariate categorical data. Following Fruhwirth- ¨ Schnatter & Malsiner-Walli (2019), it is straightforward to investigate the use of the telescoping sampler also for mixtures of Poisson distributions, mixtures of generalized linear models and mixtures of skew normal and skew-t distributions and compare the performance to DPMs and SFMs. Section 2.1 presents

number of filled components.

for DPMs to sample the partitions.

2 Empirical Demonstrations

is available.

Sylvia Fruhwirth-Schnatter ¨ 1, Bettina Grun¨ <sup>1</sup> and Gertraud Malsiner-Walli1

<sup>1</sup> Institute for Statistics and Mathematics, WU (Vienna University of Economics and Business), (e-mail: Sylvia.Fruehwirth-Schnatter@wu.ac.at, Bettina.Gruen@wu.ac.at, Gertraud.Malsiner-Walli@wu.ac.at)

ABSTRACT: Finite mixtures result from convex combinations of arbitrary statistical models as components and thus allow to extend any statistical model. Specifying a prior on the number of components is natural in a Bayesian framework and results in a mixture of finite mixtures (MFM) model. Several sampling schemes for Bayesian estimation have been proposed, with most being only applicable to a specific component distribution or requiring extensive tuning. The recently proposed telescoping sampler extends the Markov chain Monte Carlo sampling scheme with data augmentation of the finite mixture model by sampling also from the posterior of the number of components. We will demonstrate the general applicability and performance of the telescoping sampler on mixture models with different component models.

KEYWORDS: Bayesian estimation, finite mixture model, Markov chain Monte Carlo sampling, transdimensional sampling.

#### 1 Bayesian MFMs & Telescoping Sampling

Mixture models are a versatile model class which can be used for model-based clustering as well as density estimation. A finite mixture model is given by a convex combination of several distributions or models and hence any statistical model may be embedded within the mixture framework. In the following only mixture models with fixed component weights are considered, i.e., where the component sizes do not depend on any covariates, while parameteric distributions as well as regression models are covered for the components.

The application of finite mixture models in practice usually requires for estimation to fix the number of components a-priori and to then perform model selection to decide on a suitable number of components. In particular in Bayesian analysis, such a model selection step is complicated by the fact that

\*The authors gratefully acknowledge support from the *Austrian Science Fund (FWF)*: P28740; and through the *WU Projects* grant scheme: IA-27001574.

the number of components in a finite mixture model does not necessarily correspond to the number of filled components given the observed data. As an alternative, Escobar & West (1995) consider Dirichlet process mixtures (DPMs) where the number of components is infinite and only inference on the number of filled components is performed. Malsiner-Walli *et al.* (2016) suggest sparse finite mixtures (SFMs), where the parameter for the prior on the weights is selected to imply that the number of components will be higher than the number of filled components, in this way allowing for posterior inference of the number of filled components.

Richardson & Green (1997) propose to use the specification of a mixture of finite mixtures (MFM) model where a prior on the number of components is included, to obtain posterior estimates for the number of components, the number of filled components and the parameter estimates. Richardson & Green (1997) also indicate the estimation of this model class using a reversible jump Markov chain Monte Carlo (RJMCMC) algorithm. Alternative approaches to perform Bayesian inference of the MFM model were considered by Stephens (2000) who suggests to use a Markov birth-death process and Miller & Harrison (2018) who re-use Chinese restaurant process (CRP) methods proposed for DPMs to sample the partitions.

Fruhwirth-Schnatter ¨ *et al.* (2020) develop the *telescoping sampler* to perform inference for any kind of MFM where arbitrary component distributions or models as well as hierarchical priors may be included without complicating the sampling. They build on the data augmentation scheme suggested for finite mixtures by Diebolt & Robert (1994) and include a sampling step for the number of components. This implies that the telescoping sampler is straightforward to implement given that a MCMC sampling scheme for the components is available.

#### 2 Empirical Demonstrations

ESTIMATING BAYESIAN MIXTURES OF FINITE MIXTURES WITH TELESCOPING SAMPLING \* Sylvia Fruhwirth-Schnatter ¨ 1, Bettina Grun¨ <sup>1</sup> and Gertraud Malsiner-Walli1

<sup>1</sup> Institute for Statistics and Mathematics, WU (Vienna University of Economics and Business), (e-mail: Sylvia.Fruehwirth-Schnatter@wu.ac.at, Bettina.Gruen@wu.ac.at, Gertraud.Malsiner-Walli@wu.ac.at)

ABSTRACT: Finite mixtures result from convex combinations of arbitrary statistical models as components and thus allow to extend any statistical model. Specifying a prior on the number of components is natural in a Bayesian framework and results in a mixture of finite mixtures (MFM) model. Several sampling schemes for Bayesian estimation have been proposed, with most being only applicable to a specific component distribution or requiring extensive tuning. The recently proposed telescoping sampler extends the Markov chain Monte Carlo sampling scheme with data augmentation of the finite mixture model by sampling also from the posterior of the number of components. We will demonstrate the general applicability and performance of the

telescoping sampler on mixture models with different component models.

tions as well as regression models are covered for the components.

P28740; and through the *WU Projects* grant scheme: IA-27001574.

sampling, transdimensional sampling.

1 Bayesian MFMs & Telescoping Sampling

KEYWORDS: Bayesian estimation, finite mixture model, Markov chain Monte Carlo

Mixture models are a versatile model class which can be used for model-based clustering as well as density estimation. A finite mixture model is given by a convex combination of several distributions or models and hence any statistical model may be embedded within the mixture framework. In the following only mixture models with fixed component weights are considered, i.e., where the component sizes do not depend on any covariates, while parameteric distribu-

The application of finite mixture models in practice usually requires for estimation to fix the number of components a-priori and to then perform model selection to decide on a suitable number of components. In particular in Bayesian analysis, such a model selection step is complicated by the fact that

\*The authors gratefully acknowledge support from the *Austrian Science Fund (FWF)*:

Fruhwirth-Schnatter ¨ *et al.* (2020) already present the application of the telescoping sampler on mixtures of univariate Gaussian distributions, which allows them to benchmark their sampler against RJMCMC and CRP sampling, on mixtures of multivariate Gaussian distributions and on latent class analysis models applied to multivariate categorical data. Following Fruhwirth- ¨ Schnatter & Malsiner-Walli (2019), it is straightforward to investigate the use of the telescoping sampler also for mixtures of Poisson distributions, mixtures of generalized linear models and mixtures of skew normal and skew-t distributions and compare the performance to DPMs and SFMs. Section 2.1 presents

Table 1. *Posterior inference for K*<sup>+</sup> *and K. The posteriors are summarized by their*

*p*(*K*) *p*(*K*+|*y*) *p*(*K*|*y*) *U*(1,150) 13 [12, 16, 21] 119 [50, 83, 118] BNB(1,1,1) 10 [9, 12, 16] 11 [12, 21, 41] Geo(0.1) 9 [9, 11, 15] 13 [12, 17, 25] BNB(1,4,3) 6 [6, 8, 10] 7 [7, 9, 13]

present and a mixture with several components is needed to approximate the

DIEBOLT, JEAN,&ROBERT, CHRISTIAN P. 1994. Estimation of Finite Mixture Distributions Through Bayesian Sampling. *Journal of the Royal Statis-*

ESCOBAR, MICHAEL D., & WEST, MIKE. 1995. Bayesian Density Estimation and Inference Using Mixtures. *Journal of the American Statistical*

FRUHWIRTH ¨ -SCHNATTER, S., & MALSINER-WALLI, G. 2019. From Here to Infinity: Sparse Finite Versus Dirichlet Process Mixtures in Model-Based Clustering. *Advances in Data Analysis and Classification*, 13(1), 33–64. FRUHWIRTH ¨ -SCHNATTER, SYLVIA, MALSINER-WALLI, GERTRAUD, & GRUN¨ , BETTINA. 2020. *Generalized Mixtures of Finite Mixtures and Tele-*

MALSINER-WALLI, GERTRAUD, FRUHWIRTH ¨ -SCHNATTER, SYLVIA, & GRUN¨ , BETTINA. 2016. Model-Based Clustering Based on Sparse Finite

MILLER, JEFFREY W, & HARRISON, MATTHEW T. 2018. Mixture Models with a Prior on the Number of Components. *Journal of the American*

RICHARDSON, SYLVIA,&GREEN, PETER J. 1997. On Bayesian Analysis of Mixtures with an Unknown Number of Components. *Journal of the Royal*

STEPHENS, MATTHEW. 2000. Bayesian Analysis of Mixture Models with an Unknown Number of Components – An Alternative to Reversible Jump

Gaussian Mixtures. *Statistics and Computing*, 26(1), 303–324.

*modes, followed by the 1st, 2nd and 3rd quartiles.*

distribution of counts.

*tical Society B*, 56(2), 363–375.

*Association*, 90(430), 577–588.

*scoping Sampling*. arXiv:2005.09918 [stat.ME].

*Statistical Association*, 113(521), 340–356.

Methods. *The Annals of Statistics*, 28(1), 40–74.

*Statistical Society B*, 59(4), 731–792.

References

Figure 1. *Eye tracking data. Histogram of the observations.*

the results obtained with telescoping sampling for Poisson mixtures using the eye tracking data.

#### 2.1 Poisson Mixtures: Eye Tracking Data

Figure 1 visualizes the count data on eye tracking anomalies in 101 schizophrenic patients studied among others by Fruhwirth-Schnatter & Malsiner- ¨ Walli (2019). The overdispersion and the excess number of zeros present in the data set are clearly visible in the plot showing the frequency of counts. The MFM is fitted using the same hierarchical specification for the component means λ*<sup>k</sup>* as used in Fruhwirth-Schnatter & Malsiner-Walli (2019) when fitting ¨ SFMs and DPMs: λ*<sup>k</sup>* ∼ *G*(*a*0,*b*0) and *b*<sup>0</sup> ∼ *G*(*g*0,*G*0), where the parameters of the gamma distribution are given by *a*<sup>0</sup> = 0.1, *g*<sup>0</sup> = 0.5 and *G*<sup>0</sup> = *g*0*y*¯/*a*<sup>0</sup> with ¯*y* the mean of the observations. In addition the dynamic specification for the Dirichlet weights is used, i.e., the weights are a-priori drawn from a *K*dimensional symmetric Dirichlet distribution Dir*K*(α/*K*), with α having a hyperprior *F*-distribution *F*(6,3). Four different priors on the number of components are considered: the discrete uniform prior on {1,2,...,150}, the shifted beta-negative-binomial priors BNB(1,1,1) and BNB(1,4,3) and the geometric prior Geo(0.1). These priors vary in their prior mean, their regularization of additional components and the mass assigned to the tail.

The posterior distributions of the number of components *K* and the number of filled components *K*<sup>+</sup> obtained with telescoping sampling are summarized in Table 1. The influence of the prior on *K* is particularly noticeably for the posterior of *K* and a much less pronounced influence on the posterior of *K*<sup>+</sup> is discernible. Clearly results for all priors on *K* indicate that heterogeneity is

Table 1. *Posterior inference for K*<sup>+</sup> *and K. The posteriors are summarized by their modes, followed by the 1st, 2nd and 3rd quartiles.*


present and a mixture with several components is needed to approximate the distribution of counts.

#### References

0

eye tracking data.

0 10 20 30 Counts

the results obtained with telescoping sampling for Poisson mixtures using the

Figure 1 visualizes the count data on eye tracking anomalies in 101 schizophrenic patients studied among others by Fruhwirth-Schnatter & Malsiner- ¨ Walli (2019). The overdispersion and the excess number of zeros present in the data set are clearly visible in the plot showing the frequency of counts. The MFM is fitted using the same hierarchical specification for the component means λ*<sup>k</sup>* as used in Fruhwirth-Schnatter & Malsiner-Walli (2019) when fitting ¨ SFMs and DPMs: λ*<sup>k</sup>* ∼ *G*(*a*0,*b*0) and *b*<sup>0</sup> ∼ *G*(*g*0,*G*0), where the parameters of the gamma distribution are given by *a*<sup>0</sup> = 0.1, *g*<sup>0</sup> = 0.5 and *G*<sup>0</sup> = *g*0*y*¯/*a*<sup>0</sup> with ¯*y* the mean of the observations. In addition the dynamic specification for the Dirichlet weights is used, i.e., the weights are a-priori drawn from a *K*dimensional symmetric Dirichlet distribution Dir*K*(α/*K*), with α having a hyperprior *F*-distribution *F*(6,3). Four different priors on the number of components are considered: the discrete uniform prior on {1,2,...,150}, the shifted beta-negative-binomial priors BNB(1,1,1) and BNB(1,4,3) and the geometric prior Geo(0.1). These priors vary in their prior mean, their regularization

The posterior distributions of the number of components *K* and the number of filled components *K*<sup>+</sup> obtained with telescoping sampling are summarized in Table 1. The influence of the prior on *K* is particularly noticeably for the posterior of *K* and a much less pronounced influence on the posterior of *K*<sup>+</sup> is discernible. Clearly results for all priors on *K* indicate that heterogeneity is

Figure 1. *Eye tracking data. Histogram of the observations.*

of additional components and the mass assigned to the tail.

2.1 Poisson Mixtures: Eye Tracking Data

10

20

Frequency

30


### A BAYESIAN FRAMEWORK FOR STRUCTURAL LEARNING OF MIXED GRAPHICAL MODELS

However, literature oriented to DAG structural learning given mixed data is extremely narrow. In the Bayesian framework, a unified approach which jointly models categorical and continuous data is also still lacking. The scope of this study is to develop a Bayesian methodology for DAG learning in the presence of mixed observations. Our ultimate goal is the development of an MCMC algorithm, along the lines of Castelletti *et al.*, 2018 and Castelletti & Peluso, 2021 for, respectively, the Gaussian and categorical case. In the next sections we illustrate some preliminary results relative to general Bayesian models for mixed variables together with some possible extensions to DAG-

Our starting point is represented by the notion of Conditional Gaussian (CG) distribution introduced by Lauritzen & Wermuth, 1989. Let *V* be a finite set of nodes indexing a collection random variables *<sup>Z</sup>* = (*Z*1,...,*Z*|*V*|)*<sup>T</sup>* , which comprises both discrete and continuous quantities indexed by ∆ ∪ Γ = *V* respectively. The authors defined a general class of probability distributions of

*g*(*s*) +*h*(*s*)

where *s* and *y* correspond to the level assumed by the categorical and continuous variables respectively. A probability distribution of the form (1) has CG-distribution if and only if *<sup>Z</sup>*Γ|*Z*<sup>∆</sup> <sup>=</sup> *<sup>s</sup>* <sup>∼</sup> *Nq*(*K*(*s*)−1*h*(*s*),*K*(*s*)−1) and the

for each level *s* assumed by *Z*∆. Moreover, if *K*(*s*) = *K* the distribution is called *homogeneous*. An alternative representation of a CG-distribution, hereinafter adopted, is given in terms of moment-characteristics parameters (θ,ξ,Σ).

*<sup>Y</sup>*1(*s*),...,*Yq*(*s*) <sup>|</sup> *<sup>µ</sup>*(*s*),<sup>Ω</sup> <sup>∼</sup> *Nq*(*µ*(*s*),Ω−<sup>1</sup>

Specifically, let (*X*1,...,*Xp*) be *p* categorical variables, (*Y*1,...,*Yp*) *q* continuous variables. Let also *I* be the space of all possible configurations of the *p* categorical variables and θ = {θ(*s*),*s* ∈ *I* }) where θ(*s*) = Pr(*X*<sup>1</sup> = *s*1,...*Xp* = *sp*) is the probability to observe configuration *s* = (*s*1,...,*sp*). Under the CG

*<sup>g</sup>*(*s*) + <sup>1</sup> 2 *h*(*s*)

*<sup>T</sup> <sup>y</sup>*<sup>−</sup> <sup>1</sup> 2

*yTK*(*s*)*y*

*TK*(*s*) −1 *h*(*s*) 

(1)

, (2)

). (3)

constrained models.

the form

2 Model development

*<sup>f</sup>*(*z*) = *<sup>f</sup>*(*s*,*y*) = exp

marginal distribution of the discrete variables is

−*q* <sup>2</sup> |*K*(*s*)|

−1 <sup>2</sup> exp

θ(*s*)=(2π)

assumption we can write for each *s* ∈ *I*

Chiara Galimberti 1, Federico Castelletti <sup>2</sup> and Stefano Peluso <sup>3</sup>

<sup>1</sup> Department of Economics, Managment and Statistics, Universita degli Studi di ` Milano-Bicocca, (e-mail: c.galimberti19@campus.unimib.it)

<sup>2</sup> Department of Statistical Sciences, Universita Cattolica del Sacro Cuore, (e-mail: ` federico.castelletti@unicatt.it)

<sup>3</sup> Department of Statistics and Quantitative Methods, Universita degli Studi di Milano- ` Bicocca, (e-mail: stefano.peluso@unimib.it)

ABSTRACT: Graphical models provide an effective tool to represent conditional independences among variables. While this class of models has been extensively studied in the Gaussian and categorical settings separately, literature which combines the two types of variables is narrow. However, mixed data are extremely diffuse in many applications where both continuous and categorical measurements are available. In this paper we propose a Bayesian framework for the analysis of mixed data. Specifically, we specifiy a likelihood function for *n* observations following a conditional Gaussian distribution, and assign suitable priors for the model parameters. Our end-result is a closed form espression for the marginal data distribution. The latter provides a primary input for the computation of the marginal likelihood under graph (independence) constraints and the development of an MCMC strategy for graph structural learning.

KEYWORDS: conditional gaussian distribution, directed acyclic graph, graphical models, marginal likelihood, mixed variables

#### 1 Introduction

Graphical models are particularly effective to represent conditional dependency structures in multivariate distributions (Lauritzen, 1996). In particular, inferring the unknown graph generating model from the data is possible using structural learning methodologies. In this contribution we focus on directed acyclic graphs (DAGs) where conditional dependencies between variables are represented through parent-child relationships.

Several works for structural learning of graphical models given continuous (Gaussian) or discrete/categorical data (Ising model) are available in the literature.

However, literature oriented to DAG structural learning given mixed data is extremely narrow. In the Bayesian framework, a unified approach which jointly models categorical and continuous data is also still lacking. The scope of this study is to develop a Bayesian methodology for DAG learning in the presence of mixed observations. Our ultimate goal is the development of an MCMC algorithm, along the lines of Castelletti *et al.*, 2018 and Castelletti & Peluso, 2021 for, respectively, the Gaussian and categorical case. In the next sections we illustrate some preliminary results relative to general Bayesian models for mixed variables together with some possible extensions to DAGconstrained models.

#### 2 Model development

A BAYESIAN FRAMEWORK FOR STRUCTURAL LEARNING OF MIXED GRAPHICAL MODELS Chiara Galimberti 1, Federico Castelletti <sup>2</sup> and Stefano Peluso <sup>3</sup>

<sup>1</sup> Department of Economics, Managment and Statistics, Universita degli Studi di `

<sup>2</sup> Department of Statistical Sciences, Universita Cattolica del Sacro Cuore, (e-mail: `

<sup>3</sup> Department of Statistics and Quantitative Methods, Universita degli Studi di Milano- `

ABSTRACT: Graphical models provide an effective tool to represent conditional independences among variables. While this class of models has been extensively studied in the Gaussian and categorical settings separately, literature which combines the two types of variables is narrow. However, mixed data are extremely diffuse in many applications where both continuous and categorical measurements are available. In this paper we propose a Bayesian framework for the analysis of mixed data. Specifically, we specifiy a likelihood function for *n* observations following a conditional Gaussian distribution, and assign suitable priors for the model parameters. Our end-result is a closed form espression for the marginal data distribution. The latter provides a primary input for the computation of the marginal likelihood under graph (independence) constraints and the development of an MCMC strategy for graph structural learning. KEYWORDS: conditional gaussian distribution, directed acyclic graph, graphical mod-

Graphical models are particularly effective to represent conditional dependency structures in multivariate distributions (Lauritzen, 1996). In particular, inferring the unknown graph generating model from the data is possible using structural learning methodologies. In this contribution we focus on directed acyclic graphs (DAGs) where conditional dependencies between variables are

Several works for structural learning of graphical models given continuous (Gaussian) or discrete/categorical data (Ising model) are available in the

Milano-Bicocca, (e-mail: c.galimberti19@campus.unimib.it)

federico.castelletti@unicatt.it)

els, marginal likelihood, mixed variables

represented through parent-child relationships.

1 Introduction

literature.

Bicocca, (e-mail: stefano.peluso@unimib.it)

Our starting point is represented by the notion of Conditional Gaussian (CG) distribution introduced by Lauritzen & Wermuth, 1989. Let *V* be a finite set of nodes indexing a collection random variables *<sup>Z</sup>* = (*Z*1,...,*Z*|*V*|)*<sup>T</sup>* , which comprises both discrete and continuous quantities indexed by ∆ ∪ Γ = *V* respectively. The authors defined a general class of probability distributions of the form

$$f(z) = f(s, \mathbf{y}) = \exp\left\{\mathbf{g}(s) + h(s)^T \mathbf{y} - \frac{1}{2} \mathbf{y}^T K(s) \mathbf{y} \right\} \tag{1}$$

where *s* and *y* correspond to the level assumed by the categorical and continuous variables respectively. A probability distribution of the form (1) has CG-distribution if and only if *<sup>Z</sup>*Γ|*Z*<sup>∆</sup> <sup>=</sup> *<sup>s</sup>* <sup>∼</sup> *Nq*(*K*(*s*)−1*h*(*s*),*K*(*s*)−1) and the marginal distribution of the discrete variables is

$$\Theta(s) = (2\pi)^{-\frac{q}{2}} |K(s)|^{-\frac{1}{2}} \exp\left\{ g(s) + \frac{1}{2} h(s)^T K(s)^{-1} h(s) \right\},\tag{2}$$

for each level *s* assumed by *Z*∆. Moreover, if *K*(*s*) = *K* the distribution is called *homogeneous*. An alternative representation of a CG-distribution, hereinafter adopted, is given in terms of moment-characteristics parameters (θ,ξ,Σ).

Specifically, let (*X*1,...,*Xp*) be *p* categorical variables, (*Y*1,...,*Yp*) *q* continuous variables. Let also *I* be the space of all possible configurations of the *p* categorical variables and θ = {θ(*s*),*s* ∈ *I* }) where θ(*s*) = Pr(*X*<sup>1</sup> = *s*1,...*Xp* = *sp*) is the probability to observe configuration *s* = (*s*1,...,*sp*). Under the CG assumption we can write for each *s* ∈ *I*

$$(Y\_1(s), \dots, Y\_q(s) \mid \mu(s), \mathbf{\Omega} \sim \mathcal{N}\_q(\mu(s), \mathbf{\Omega}^{-1}).\tag{3}$$

We now consider a collection of *n* independent observations *xi* = (*xi*,1,...,*xi*,*p*)*<sup>T</sup>* , *yi* = (*yi*,1,...,*yi*,*q*)*<sup>T</sup>* , *<sup>i</sup>* <sup>=</sup> <sup>1</sup>,...,*n*. Categorical data {*xi*,*<sup>i</sup>* <sup>=</sup> <sup>1</sup>,...,*n*}, can be equivalently represented as a contingency table of counts *N* with elements *n*(*s*) ∈ *N* satisfying ∑*s*∈*<sup>I</sup> n*(*s*) = *n*. Following Frydenberg & Lauritzen, 1989, the likelihood function can be written as

where *Y* denotes the (*n*,*q*) data matrix, row-binding of the *yi*'s. Because of

can be computed as the ratio of prior/posterior normalizing constants. Care will be posed on score-equivalence (same marginal likelihood) for Markov

We obtained a closed-form expression for the marginal likelihood of a complete (unconstrained) Bayesian model given mixed data. The subsequent step requires the computation of the marginal likelihood for a subset of (mixed) variables, e.g. {*Xj*| *j* ∈ *C* ⊆ {1,..., *p*}} ∪ {*Yk*|*k* ∈ *D* ⊆ {1,...,*q*}}. To this end we will adopt the procedure for prior parameter elicitation introduced by Geiger & Heckerman, 2002. The computation of the marginal likelihood of a given DAG will be at the basis of an MCMC algorithm for DAG structural

CASTELLETTI, F., CONSONNI, G., DELLA VEDOVA, M., & PELUSO, S. 2018. Learning Markov equivalence classes of directed acyclic graphs: an objective

CASTELLETTI, FEDERICO,&PELUSO, STEFANO. 2021. Equivalence class selection of categorical graphical models. *arXiv preprint arXiv:2102.06437*. DEGROOT, M. 2004. *Optimal statistical decisions*. John Wiley and Sons.

FRYDENBERG, M., & LAURITZEN, S. 1989. Decomposition of Maximum Likelihood in Mixed Graphical Interaction Models. *Biometrika*, 76(3), 539–555. GEIGER, D., & HECKERMAN, D. 2002. Parameter priors for directed acyclic graphical models and the characterization of several probability distributions. *The*

LAURITZEN, S., & WERMUTH, N. 1989. Graphical models for associations between variables, some of which are qualitative and some quantitative. *The Annals of*

PELUSO, STEFANO,&CONSONNI, GUIDO. 2020. Compatible priors for model selection of high-dimensional Gaussian DAGs. *Electronic Journal of Statistics*,

Bayes approach. *Bayesian Analysis*, 13(4), 1235–1260.

*Annals of Statistics*, 30(5), 1412–1440. LAURITZEN, S. 1996. *Graphical models*. Oxford Press.

*Statistics*, 17(1), 31–57.

14(2), 4110–4132.

*<sup>p</sup>*(*µ*(*s*))*p*(Ω)*d*<sup>θ</sup> <sup>∏</sup>*<sup>s</sup>*∈*<sup>I</sup>*

*dµ*(*s*)*d*Ω,

*<sup>f</sup>*(*N*,*<sup>Y</sup>* <sup>|</sup>θ,{*µ*(*s*)}*s*∈*I*,Ω)*p*(θ)∏*<sup>s</sup>*∈*<sup>I</sup>*

equivalent DAGs; see also Peluso & Consonni, 2020.

conjugacy, the marginal data distribution

3 Conclusion and further steps

*<sup>m</sup>*(*<sup>Y</sup>* ,*N*) =

learning.

References

$$f(\mathcal{N}, \mathbf{y}\_1, \dots, \mathbf{y}\_n | \boldsymbol{\Theta}, \{\boldsymbol{\mu}(s)\}\_{s \in I}, \boldsymbol{\Omega}) = \prod\_{s \in I} \boldsymbol{\Theta}(s)^{n(s)} \prod\_{s \in I} \prod\_{i \in d(s)} \boldsymbol{\phi}(\mathbf{y}\_i | \boldsymbol{\mu}(s), \boldsymbol{\Omega}^{-1})$$

$$\approx \prod\_{s \in I} \boldsymbol{\Theta}(s)^{n(s)} \prod\_{s \in I} \prod\_{i \in d(s)} |\boldsymbol{\Omega}|^{\frac{1}{2}} \exp\left\{-\frac{1}{2} (\mathbf{y}\_i - \boldsymbol{\mu}(s))^T \boldsymbol{\Delta} (\mathbf{y}\_i - \boldsymbol{\mu}(s))\right\}, \quad (4)$$

where *d*(*s*) is the set of observations among *i* = 1,...,*n* with observed configuration *s* and φ is the Gaussian density. We then proceed by assigning the following prior distributions

$$\boldsymbol{\Theta} \sim \text{Dirichlet}(\mathbf{A}), \quad \mu(s) \mid \mathbf{\Omega} \sim \mathcal{N}\_{\boldsymbol{\Phi}}(\mathfrak{m}(s), (a\_{\boldsymbol{\mu}} \mathbf{\Omega})^{-1}), \quad \mathbf{\Omega} \sim \mathcal{W}\_{\boldsymbol{q}}(a\_{\boldsymbol{\Omega}}, \mathbf{U}), \quad (5)$$

where in particular *Wq*(*a*Ω,*U*) denotes a Wishart distribution having expectation *<sup>a</sup>*Ω*<sup>U</sup>* <sup>−</sup>1, *<sup>a</sup>*<sup>Ω</sup> <sup>&</sup>gt; *<sup>q</sup>* <sup>−</sup> 1 and *<sup>U</sup>* is a s.p.d. matrix. Under prior parameter independence, the posterior distribution is written after some calculations as

$$p(\boldsymbol{\Theta}, \{\boldsymbol{\mu}(s)\}\_{s \in I}, \boldsymbol{\Omega} \mid \mathcal{N}, \mathbf{y}\_1, \dots, \mathbf{y}\_n) \sim \prod\_{s \in I} \boldsymbol{\Theta}(s)^{a(s) + n(s) - 1}$$

$$\cdot \prod\_{s \in I} \left\{ |\boldsymbol{\Delta}|^{\frac{1}{2}} |\exp\left\{ -\frac{1}{2} (n(s) + a\_\mu) (\boldsymbol{\mu}(s) - \boldsymbol{\tilde{m}}(s))^T \boldsymbol{\Delta} (\boldsymbol{\mu}(s) - \boldsymbol{\tilde{m}}(s)) \right\} \right.$$

$$\cdot |\boldsymbol{\Delta}|^{\frac{a\_\Omega + n - q - 1}{2}} \exp\left\{ -\frac{1}{2} \text{tr} [ (\boldsymbol{\mathcal{U}} + \boldsymbol{\mathcal{S}} + \boldsymbol{\mathcal{S}}\_0) \boldsymbol{\Delta}] \right\}, \qquad (6)$$

with *S* = ∑*s*∈*<sup>I</sup>* SSD(*s*),

$$\begin{aligned} \bar{\mathfrak{m}}(s) &= \frac{a\_{\mu}}{a\_{\mu} + n(s)} \mathfrak{m}(s) + \frac{n(s)}{a\_{\mu} + n(s)} \bar{\mathfrak{y}}(s), \\ \mathfrak{S}\_0 &= \sum\_{s \in I} \frac{a\_{\mu} n(s)}{a\_{\mu} \mu + n(s)} (\mathfrak{m}(s) - \mathfrak{y}(s)) (\mathfrak{m}(s) - \mathfrak{y}(s))^T, \end{aligned}$$

where SSD(*s*) = <sup>∑</sup>*i*∈*d*(*s*) *eie<sup>T</sup> <sup>i</sup>* , *ei* = (*yi* −*y*¯(*s*)) and *y*¯(*s*) is the (*q*,1) vector with sample means of (*Y*1,...,*Yq*) relative to observations *i* ∈ *d*(*s*). It follows that

$$\begin{array}{rcl}\mathsf{\Theta}\mid\mathsf{N} & \sim & \text{Dirichlet}(\mathsf{A}+\mathsf{N})\\\mathfrak{\mu}(s)\mid\mathsf{N},\mathsf{Y},\mathsf{\Omega} & \sim & \mathsf{N}\_{q}(\mathsf{m}(s),[(a\_{\mu}+n(s))\mathbf{\Omega})]^{-1})\\\end{array} \tag{7}$$

$$\begin{array}{rcl}\mathbf{\Omega}\mid\mathbf{Y} & \sim & \mathcal{W}\_{q}^{l}(a\_{\Omega}+n,\mathbf{U}+\mathbf{S}+\mathbf{S}\_{0}),\\\end{array} \tag{7}$$

where *Y* denotes the (*n*,*q*) data matrix, row-binding of the *yi*'s. Because of conjugacy, the marginal data distribution

$$m(\mathbf{Y}, \mathbf{N}) = \int f(\mathbf{N}, \mathbf{Y} \mid \boldsymbol{\Theta}, \{\boldsymbol{\mu}(\mathbf{s})\}\_{\boldsymbol{s} \in I}, \mathbf{\Omega}) p(\boldsymbol{\Theta}) \prod\_{s \in I} p(\boldsymbol{\mu}(\mathbf{s})) p(\mathbf{\Omega}) \, d\boldsymbol{\Theta} \prod\_{s \in I} d\boldsymbol{\mu}(\mathbf{s}) \, d\mathbf{\Omega}, \mathbf{\Omega}$$

can be computed as the ratio of prior/posterior normalizing constants. Care will be posed on score-equivalence (same marginal likelihood) for Markov equivalent DAGs; see also Peluso & Consonni, 2020.

#### 3 Conclusion and further steps

We obtained a closed-form expression for the marginal likelihood of a complete (unconstrained) Bayesian model given mixed data. The subsequent step requires the computation of the marginal likelihood for a subset of (mixed) variables, e.g. {*Xj*| *j* ∈ *C* ⊆ {1,..., *p*}} ∪ {*Yk*|*k* ∈ *D* ⊆ {1,...,*q*}}. To this end we will adopt the procedure for prior parameter elicitation introduced by Geiger & Heckerman, 2002. The computation of the marginal likelihood of a given DAG will be at the basis of an MCMC algorithm for DAG structural learning.

#### References

We now consider a collection of *n* independent observations *xi* = (*xi*,1,...,*xi*,*p*)*<sup>T</sup>* , *yi* = (*yi*,1,...,*yi*,*q*)*<sup>T</sup>* , *<sup>i</sup>* <sup>=</sup> <sup>1</sup>,...,*n*. Categorical data {*xi*,*<sup>i</sup>* <sup>=</sup> <sup>1</sup>,...,*n*}, can be equivalently represented as a contingency table of counts *N* with elements *n*(*s*) ∈ *N* satisfying ∑*s*∈*<sup>I</sup> n*(*s*) = *n*. Following Frydenberg & Lauritzen, 1989,

> θ(*s*) *n*(*s*) ∏*s*∈*I* ∏ *i*∈*d*(*s*)

where *d*(*s*) is the set of observations among *i* = 1,...,*n* with observed configuration *s* and φ is the Gaussian density. We then proceed by assigning the

where in particular *Wq*(*a*Ω,*U*) denotes a Wishart distribution having expectation *<sup>a</sup>*Ω*<sup>U</sup>* <sup>−</sup>1, *<sup>a</sup>*<sup>Ω</sup> <sup>&</sup>gt; *<sup>q</sup>* <sup>−</sup> 1 and *<sup>U</sup>* is a s.p.d. matrix. Under prior parameter independence, the posterior distribution is written after some calculations as

*<sup>p</sup>*(θ,{*µ*(*s*)}*s*∈*I*,<sup>Ω</sup> <sup>|</sup>*N*, *<sup>y</sup>*1,...,*yn*) <sup>∝</sup> <sup>∏</sup>*<sup>s</sup>*∈*<sup>I</sup>*

*a*Ω+*n*−*q*−1 <sup>2</sup> exp

·|Ω|

*aµ* +*n*(*s*)

*aµn*(*s*) *amu*+*n*(*s*) <sup>φ</sup>(*yi* <sup>|</sup>*µ*(*s*),Ω−<sup>1</sup>

), Ω ∼ *Wq*(*a*Ω,*U*), (5)

*a*(*s*)+*n*(*s*)−1

) (7)

, (6)

(*yi* <sup>−</sup>*µ*(*s*))*T*Ω(*yi* <sup>−</sup>*µ*(*s*))

θ(*s*)

tr[(*U* +*S* +*S*0)Ω]

−1

(*n*(*s*) +*aµ*)(*µ*(*s*)−*m*¯ (*s*))*T*Ω(*µ*(*s*)−*m*¯ (*s*))

 −1 2

*aµ* +*n*(*s*)

*y*¯(*s*),

(*m*(*s*)−*y*¯(*s*))(*m*(*s*)−*y*¯(*s*))*<sup>T</sup>* ,

*<sup>i</sup>* , *ei* = (*yi* −*y*¯(*s*)) and *y*¯(*s*) is the (*q*,1) vector with

*<sup>m</sup>*(*s*) + *<sup>n</sup>*(*s*)

sample means of (*Y*1,...,*Yq*) relative to observations *i* ∈ *d*(*s*). It follows that

θ | *N* ∼ Dirichlet(*A* +*N*) *<sup>µ</sup>*(*s*) <sup>|</sup> *<sup>N</sup>*,*<sup>Y</sup>* ,<sup>Ω</sup> <sup>∼</sup> *Nq*(*m*¯ (*s*),[(*aµ* <sup>+</sup>*n*(*s*))Ω)]−<sup>1</sup>

Ω | *Y* ∼ *Wq*(*a*<sup>Ω</sup> +*n*,*U* +*S* +*S*0),

)

, (4)

the likelihood function can be written as

*<sup>f</sup>*(*N*, *<sup>y</sup>*1,...,*yn* <sup>|</sup>θ,{*µ*(*s*)}*s*∈*I*,Ω) = <sup>∏</sup>*<sup>s</sup>*∈*<sup>I</sup>*

θ ∼ Dirichlet(*A*), *µ*(*s*) | Ω ∼ *Nq*(*m*(*s*),(*aµ*Ω)

 −1 2

*<sup>m</sup>*¯ (*s*) = *aµ*

*<sup>S</sup>*<sup>0</sup> <sup>=</sup> ∑

*s*∈*I*


∝ ∏*s*∈*I*

· ∏*s*∈*I*

 |Ω| 1 <sup>2</sup> | exp

with *S* = ∑*s*∈*<sup>I</sup>* SSD(*s*),

where SSD(*s*) = <sup>∑</sup>*i*∈*d*(*s*) *eie<sup>T</sup>*

θ(*s*) *n*(*s*) ∏*s*∈*I* ∏ *i*∈*d*(*s*)

following prior distributions


### MEASUREMENT ERROR MODELS ON SPATIAL NETWORK LATTICES: CAR CRASHES IN LEEDS

2 Road network and car crashes

elementary units of the statistical model.

CAR prior used below.

3 Statistical methods

The statistical analysis introduced in Section 3 requires a specific data structure that was obtained after several preprocessing steps briefly described hereafter. The *road network* was built using data extracted from Open Street Map (OSM), an online database that provides open-access geographic rich-attribute data worldwide. We downloaded the street segments that pertain to the most important\* roads of Leeds and created a matrix of segments representing the

A street network can also be seen as a graph object whose edges represent the road network segments and whose vertices are placed at junctions, intersections, and boundary points (Barthelemy, 2011). We took advantage of the ´ graph representation to contract the street network removing redundant nodes, edges loops, duplicated roads, and several isolated clusters of segments that may create numerical problems (Gilardi *et al.*, 2020). Furthermore, we calculated the weighted edge betweenness centrality, a graph measure correlated with the spatial distribution of commercial activities, which is usually adopted to analyse congestion problems as a proxy for urban traffic (Barthelemy, 2011). ´ Finally, we derived the edges' adjacency matrix, an essential ingredient for the

We analysed all car crashes involving personal injuries that occurred in the city of Leeds from 2011 to 2019 and became known to the Police Forces within thirty days from their occurrence. First, we downloaded the data from UK's official road traffic casualty database. Then, we excluded those car crashes that occurred farther than fifty metres from the closest road segment, and, finally, we projected the events to the nearest point of the network and counted the occurrences for each segment. The final sample included 15826 events

Let *yi*, *i* = 1,...,*n* represent the number of car crashes that occurred on the *i*th road segment. Following a classical hypothesis in the road safety literature, we assume that *yi*|λ*<sup>i</sup>* ∼ Poisson(*ei*λ*i*), where λ*<sup>i</sup>* represents the car crashes rate and *ei* is an exposure parameter equal to the geographical length of each segment.

\*More precisely, we selected only those segments whose classification range from *Au-*

distributed over 4253 segments covering approximately 1170 km.

*tostrada* (i.e. *Motorway*) to *Strada Comunale* (i.e. *Tertiary Road*).

Andrea Gilardi1, Riccardo Borgoni1, Luca Presicce1 and Jorge Mateu2

<sup>1</sup> Department of Economics, Management and Statistics, University of Milano - Bicocca, Milan, Italy (e-mail: andrea.gilardi@unimib.it)

<sup>2</sup> Department of Mathematics, Universitat Jaume I, Castellon, Spain ´

ABSTRACT: Road casualties represent the leading cause of death among young people worldwide, especially in poor and developing countries. This paper introduces a Bayesian hierarchical model to analyse car accidents on a network lattice that takes into account measurement error in spatial covariates. We exemplified the proposed approach analysing all car crashes that occurred in the road network of Leeds (UK) from 2011 to 2019. Our results show that omitting measurement error considerably worsens the fit of the model and attenuates the effects of spatial covariates.

KEYWORDS: CAR, Linear Networks, Network Lattices, Spatial Measurement Error

#### 1 Introduction

As reported by World Health Organisation in 2018, car crashes are responsible for more than 1.35 million casualties each year, representing the leading cause of death among people aged 5-29 years, particularly those living in developing countries. In the last years, several authors developed sophisticated statistical models to analyse the spatial distribution of car crashes at the areal level (e.g. cities or census wards) and help the local authorities define safety measures.

Nevertheless, road casualties represent a classic example of events occurring on a linear network. This paper presents a Bayesian hierarchical model for car crashes developed on a network lattice that takes into account measurement error (ME) in spatial covariates. In particular, a Conditional Auto-Regressive (CAR) prior is introduced to adjust for ME in estimating road traffic volumes within the classical ME model paradigm. The Integrated Nested Laplace Approximation (INLA) framework is adopted for inference. This approach was found particularly convenient for large networks, as the one considered in this paper, while MCMC techniques may be challenging and time-consuming (Muff *et al.*, 2015).

#### 2 Road network and car crashes

MEASUREMENT ERROR MODELS ON SPATIAL NETWORK LATTICES: CAR CRASHES IN LEEDS Andrea Gilardi1, Riccardo Borgoni1, Luca Presicce1 and Jorge Mateu2

<sup>1</sup> Department of Economics, Management and Statistics, University of Milano - Bic-

ABSTRACT: Road casualties represent the leading cause of death among young people worldwide, especially in poor and developing countries. This paper introduces a Bayesian hierarchical model to analyse car accidents on a network lattice that takes into account measurement error in spatial covariates. We exemplified the proposed approach analysing all car crashes that occurred in the road network of Leeds (UK) from 2011 to 2019. Our results show that omitting measurement error considerably

KEYWORDS: CAR, Linear Networks, Network Lattices, Spatial Measurement Error

As reported by World Health Organisation in 2018, car crashes are responsible for more than 1.35 million casualties each year, representing the leading cause of death among people aged 5-29 years, particularly those living in developing countries. In the last years, several authors developed sophisticated statistical models to analyse the spatial distribution of car crashes at the areal level (e.g. cities or census wards) and help the local authorities define safety measures. Nevertheless, road casualties represent a classic example of events occurring on a linear network. This paper presents a Bayesian hierarchical model for car crashes developed on a network lattice that takes into account measurement error (ME) in spatial covariates. In particular, a Conditional Auto-Regressive (CAR) prior is introduced to adjust for ME in estimating road traffic volumes within the classical ME model paradigm. The Integrated Nested Laplace Approximation (INLA) framework is adopted for inference. This approach was found particularly convenient for large networks, as the one considered in this paper, while MCMC techniques may be challenging and time-consuming

worsens the fit of the model and attenuates the effects of spatial covariates.

1 Introduction

(Muff *et al.*, 2015).

occa, Milan, Italy (e-mail: andrea.gilardi@unimib.it) <sup>2</sup> Department of Mathematics, Universitat Jaume I, Castellon, Spain ´ The statistical analysis introduced in Section 3 requires a specific data structure that was obtained after several preprocessing steps briefly described hereafter.

The *road network* was built using data extracted from Open Street Map (OSM), an online database that provides open-access geographic rich-attribute data worldwide. We downloaded the street segments that pertain to the most important\* roads of Leeds and created a matrix of segments representing the elementary units of the statistical model.

A street network can also be seen as a graph object whose edges represent the road network segments and whose vertices are placed at junctions, intersections, and boundary points (Barthelemy, 2011). We took advantage of the ´ graph representation to contract the street network removing redundant nodes, edges loops, duplicated roads, and several isolated clusters of segments that may create numerical problems (Gilardi *et al.*, 2020). Furthermore, we calculated the weighted edge betweenness centrality, a graph measure correlated with the spatial distribution of commercial activities, which is usually adopted to analyse congestion problems as a proxy for urban traffic (Barthelemy, 2011). ´ Finally, we derived the edges' adjacency matrix, an essential ingredient for the CAR prior used below.

We analysed all car crashes involving personal injuries that occurred in the city of Leeds from 2011 to 2019 and became known to the Police Forces within thirty days from their occurrence. First, we downloaded the data from UK's official road traffic casualty database. Then, we excluded those car crashes that occurred farther than fifty metres from the closest road segment, and, finally, we projected the events to the nearest point of the network and counted the occurrences for each segment. The final sample included 15826 events distributed over 4253 segments covering approximately 1170 km.

#### 3 Statistical methods

Let *yi*, *i* = 1,...,*n* represent the number of car crashes that occurred on the *i*th road segment. Following a classical hypothesis in the road safety literature, we assume that *yi*|λ*<sup>i</sup>* ∼ Poisson(*ei*λ*i*), where λ*<sup>i</sup>* represents the car crashes rate and *ei* is an exposure parameter equal to the geographical length of each segment.

\*More precisely, we selected only those segments whose classification range from *Autostrada* (i.e. *Motorway*) to *Strada Comunale* (i.e. *Tertiary Road*).

In the first level of the hierarchy, we define a log-linear structure on λ*i*, i.e.

$$\log(\mathsf{X}\_{i}) = \mathsf{B}\_{0} + \mathsf{B}\_{z}z\_{i} + \mathsf{B}\_{x}x\_{i} + \mathsf{B}\_{i} + \phi\_{i};\ i = 1, \ldots, n,\tag{1}$$

No ME ME Spat. ME

**Pred. Counts** 0.0 to 3.9 3.9 to 12.0 12.0 to 23.6 23.6 to 38.3 38.3 to 64.5 64.5 to 95.8 95.8 to 122.2 122.2 to 151.4 151.4 to 168.6

the fit of the model. Motorways were found less prone to car crashes than the other road types, while the posterior distributions of fixed effects and common hyperparameters were found stable among the three models. We report in Table 1 a short summary of fixed effects' posterior means, while Figure 1 displays the posterior means of predicted counts. We can notice that it highlights a few road segments close to the city centre that would require a more detailed

BARTHELEMY ´ , MARC. 2011. Spatial networks. *Physics Reports*, 499(1-3),

GILARDI, ANDREA, MATEU, JORGE, BORGONI, RICCARDO, & LOVELACE, ROBIN. 2020. Multivariate hierarchical analysis of car crashes data considering a spatial network lattice. *arXiv preprint*

MUFF, STEFANIE, RIEBLER, ANDREA, HELD, LEONHARD, RUE, HAVARD ˚ , & SANER, PHILIPPE. 2015. Bayesian analysis of measurement error models using integrated nested Laplace approximations. *Journal of the*

SIMPSON, DANIEL, RUE, HAVARD ˚ , RIEBLER, ANDREA, MARTINS, THI-AGO G, & SØRBYE, SIGRUNN H. 2017. Penalising model component complexity: A principled, practical approach to constructing priors. *Sta-*

*Royal Statistical Society: Series C: Applied Statistics*, 231–252. RIEBLER, ANDREA, SØRBYE, SIGRUNN H, SIMPSON, DANIEL,&RUE, HAVARD ˚ . 2016. An intuitive Bayesian spatial model for disease mapping that accounts for scaling. *Statistical methods in medical research*, 25(4),

Figure 1. *Map displaying the posterior means of car crashes counts.*

β*<sup>x</sup>* 0.01 1.064 2.95 β<sup>0</sup> -5.307 -9.90 -15.441 βprimary 0.61 0.56 0.40 βsecondary 0.57 0.68 1.05 DIC 33126 30466 Table 1. *Summary of DIC, posterior means of fixed effects, and error-prone covariate.*

statistical analysis.

References

1–101.

*arXiv:2011.12595*.

1145–1165.

*tistical science*, 1–28.

where β<sup>0</sup> denotes the intercept, *zi* is an error free covariate representing the road-type of each segment, *xi* is an unobservable error prone covariate representing the traffic volumes, while β*<sup>x</sup>* and β*<sup>z</sup>* are the corresponding coefficients. Finally, θ*<sup>i</sup>* and φ*<sup>i</sup>* denote spatially structured and unstructured random effects that are modelled using a reparametrisation and a network re-adaptation of Besag-York-Mollie (BYM) prior (Riebler ´ *et al.*, 2016, Gilardi *et al.*, 2020).

The classical spatial ME model assumes that *xi* can be observed only via a proxy, say *wi*, such that

$$\boldsymbol{w}\_{i} = \boldsymbol{x}\_{i} + \boldsymbol{\mu}\_{i} + \boldsymbol{\Phi}\_{i}; \ i = 1, \ldots, n.$$

The terms *ui* and ϕ*<sup>i</sup>* represent the ME and denote, respectively, spatially structured and unstructured random effects that are also modelled using the BYM prior. In particular, parameter ϕ*<sup>i</sup>* adds a spatial smoothing effect to the unobserved covariate *xi*. In this paper, we assume that the edge betweenness centrality measure can approximate the unobservable traffic volumes.

At the second stage of the hierarchy, we specified an exposure model that relates *xi* with the error-free predictor:

$$\mathfrak{a}\_{i} = \mathfrak{a}\_{0} + \mathfrak{a}\_{\mathfrak{z}} \mathfrak{z}\_{i} + \mathfrak{e}\_{i};\ i = 1, \ldots, n. \tag{2}$$

The parameter α<sup>0</sup> denotes the intercept, α*<sup>z</sup>* is the coefficient of the error-free covariate, and ε*<sup>i</sup>* is a normally distributed error component. Furthermore, we assigned independent *N*(0,103) priors to β0, β*z*, α0, and α*z*, i.e. the intercepts and the coefficients assigned to *zi* in equations (1) and (2).

The third level completes the specification of the hierarchical model eliciting a N(0,100) prior for β*x*, i.e. the coefficient of the error-prone covariate, a Gamma(1,5e-05) prior on the precision of *ui* and ϕ*i*, and Penalised Complexity priors for the parameters of BYM's re-adaptation (Simpson *et al.*, 2017).

#### 4 Results and conclusions

We estimated the statistical model described in Section 3 using INLA methodology and compared the results with two simpler models: the first one completely ignores ME, while the second one adopts a classical ME without spatial smoothing effects. We found that omitting ME greatly attenuates the importance of traffic volumes, and excluding the spatial smoothing terms worsens


Table 1. *Summary of DIC, posterior means of fixed effects, and error-prone covariate.*

Figure 1. *Map displaying the posterior means of car crashes counts.*

the fit of the model. Motorways were found less prone to car crashes than the other road types, while the posterior distributions of fixed effects and common hyperparameters were found stable among the three models. We report in Table 1 a short summary of fixed effects' posterior means, while Figure 1 displays the posterior means of predicted counts. We can notice that it highlights a few road segments close to the city centre that would require a more detailed statistical analysis.

#### References

In the first level of the hierarchy, we define a log-linear structure on λ*i*, i.e.

where β<sup>0</sup> denotes the intercept, *zi* is an error free covariate representing the road-type of each segment, *xi* is an unobservable error prone covariate representing the traffic volumes, while β*<sup>x</sup>* and β*<sup>z</sup>* are the corresponding coefficients. Finally, θ*<sup>i</sup>* and φ*<sup>i</sup>* denote spatially structured and unstructured random effects that are modelled using a reparametrisation and a network re-adaptation of Besag-York-Mollie (BYM) prior (Riebler ´ *et al.*, 2016, Gilardi *et al.*, 2020). The classical spatial ME model assumes that *xi* can be observed only via a

*wi* = *xi* +*ui* +ϕ*i*; *i* = 1,...,*n*.

The terms *ui* and ϕ*<sup>i</sup>* represent the ME and denote, respectively, spatially structured and unstructured random effects that are also modelled using the BYM prior. In particular, parameter ϕ*<sup>i</sup>* adds a spatial smoothing effect to the unobserved covariate *xi*. In this paper, we assume that the edge betweenness

At the second stage of the hierarchy, we specified an exposure model that

The parameter α<sup>0</sup> denotes the intercept, α*<sup>z</sup>* is the coefficient of the error-free covariate, and ε*<sup>i</sup>* is a normally distributed error component. Furthermore, we assigned independent *N*(0,103) priors to β0, β*z*, α0, and α*z*, i.e. the intercepts

The third level completes the specification of the hierarchical model eliciting a N(0,100) prior for β*x*, i.e. the coefficient of the error-prone covariate, a Gamma(1,5e-05) prior on the precision of *ui* and ϕ*i*, and Penalised Complexity priors for the parameters of BYM's re-adaptation (Simpson *et al.*, 2017).

We estimated the statistical model described in Section 3 using INLA methodology and compared the results with two simpler models: the first one completely ignores ME, while the second one adopts a classical ME without spatial smoothing effects. We found that omitting ME greatly attenuates the importance of traffic volumes, and excluding the spatial smoothing terms worsens

*xi* = α<sup>0</sup> +α*zzi* +ε*i*; *i* = 1,...,*n*. (2)

centrality measure can approximate the unobservable traffic volumes.

and the coefficients assigned to *zi* in equations (1) and (2).

proxy, say *wi*, such that

relates *xi* with the error-free predictor:

4 Results and conclusions

log(λ*i*) = β<sup>0</sup> +β*zzi* +β*xxi* +θ*<sup>i</sup>* +φ*i*; *i* = 1,...,*n*, (1)


## THE *L<sup>p</sup>* DATA DEPTH AND ITS APPLICATION TO MULTIVARIATE PROCESS CONTROL CHARTS

Multivariate quality control studies was first conducted by Hotteling, 1947.

Woodall, 2000 distinguishes the techniques of the control chart processing in Phase I (also called retrospective or preliminary phase) and in Phase II. Phase I uses charts with the purpose of defining whether a process is statistically under control when the first group are processed. In Phase II, the charts are used to check if the process is in control when future subgroups were being processed. In this last phase, it is assumed that the distribution of the process is known and most of the classical applications require the hypothesis that the process under consideration follows a multivariate normal distribution. However, in most industrial applications the distribution of a parametric multivariate control chart is difficult to estimate for processes with multiple quality characteristics. As such, all observations are considered as *d*-dimensional vector and therefore used to detect possible shifts in the *d*-dimensional distributions of the quality process. A statistical process control (SPC) procedure set up in a multivariate framework is more effective than a joint monitoring system consisting of a se-

The most popular multivariate statistical process control charts are based on the Hotelling's *T*<sup>2</sup> statistics, that are a multivariate extension of Shewart's chart (or *X*¯ control chart). Like the univariate counterparts, also the multivariate control charts can be distinguished into parametric and non-parametric types according to the distributive assumption underlying a control charts (e.g. normality) are verified or not. When the assumption of normality is not verified, the use of conventional (multivariate) control charts for process monitoring is questionable. Non-parametric control charts do not require distributive assumptions on process data and generally enjoy greater robustness, namely they are less sensitive to outliers than parametric control schemes. A survey of parametric multivariate SPC charts can be found in Bersimis *et al.*, 2007, while a review of non-parametric multivariate control charts can be found in Chakraborti &

Statistical depth functions are largely used in non-parametric statistics for the analysis of multivariate data. These are non-parametric functions that can provide a dimension reduction to high-dimensional problems. In this work, we will focus on *L<sup>p</sup>* data depth to build a control chart. The *L<sup>p</sup>* data depth have additional advantages over other existing depth based control charts already introduced in a multivariate SPC (i.e. they ensure an ease of computation even

For a more detailed description please refer to Montgomery, 2007.

ries of traditional univariate control charts (Crosier, 1988).

Graham, 2019.

in high dimensions).

Carmela Iorio 1, Giuseppe Pandolfo1, Michele Staiano1 , Massimo Aria2 and Roberta Siciliano1

<sup>1</sup> Department of Industrial Engineering, University of Naples Federico II, Italy, (e-mail: carmela.iorio@unina, giuseppe.pandolfo@unina.it, michele.staiano@unina.it, roberta@unina.it)

<sup>2</sup> Department of Economics and Statistics, University of Naples Federico II, Italy, (e-mail:massimo.aria@unina.it)

ABSTRACT: Control charts are used to identify non-random behaviours of a manufacturing process by monitoring changes in the distribution of the quality characteristics of the tested product. Process monitoring of related variables is usually referred to as a multivariate quality control problem. In many applications there is not enough information to justify the assumption of a specific form for the underlying process distribution. Thus, a non-parametric approach is a valid tool in a quality control process. Among possible non-parametric statistical techniques, data depth functions are gaining increasing interest in multivariate quality control. The aim of this work is to investigate the behaviour of a non-parametric approach based on the notion of the *L<sup>p</sup>* depth in the statistical process control.

KEYWORDS: Non-parametric statistics, Q-charts, Data depth.

#### 1 Introduction

Nowadays, industries collect a large amount of data on more than one variable. Hence in a quality control process there is more than one quality variable to be monitored simultaneously. A traditional control chart monitoring a single variable is not useful for detecting the overall quality of a process, as it is determined by the interaction of several related variables (Liu, 1995, Idris *et al.*, 2019). For this reason, multivariate analysis is becoming increasingly important within the statistical process control approaches (Woodall & Montgomery, 1999). Multivariate control charts are needed when dealing with more than one quality variable as overcome the drawback of obtaining incorrect control limits when dealing with related variables. As a matter of fact, the multivariate procedure takes into account the association between the components of a multivariate process.

Multivariate quality control studies was first conducted by Hotteling, 1947. For a more detailed description please refer to Montgomery, 2007.

THE *L<sup>p</sup>* DATA DEPTH AND ITS APPLICATION TO MULTIVARIATE PROCESS CONTROL CHARTS Carmela Iorio 1, Giuseppe Pandolfo1, Michele Staiano1 , Massimo Aria2 and Roberta Siciliano1

<sup>1</sup> Department of Industrial Engineering, University of Naples Federico II, Italy, (e-mail: carmela.iorio@unina, giuseppe.pandolfo@unina.it,

<sup>2</sup> Department of Economics and Statistics, University of Naples Federico II, Italy,

ABSTRACT: Control charts are used to identify non-random behaviours of a manufacturing process by monitoring changes in the distribution of the quality characteristics of the tested product. Process monitoring of related variables is usually referred to as a multivariate quality control problem. In many applications there is not enough information to justify the assumption of a specific form for the underlying process distribution. Thus, a non-parametric approach is a valid tool in a quality control process. Among possible non-parametric statistical techniques, data depth functions are gaining increasing interest in multivariate quality control. The aim of this work is to investigate the behaviour of a non-parametric approach based on the notion of the *L<sup>p</sup>*

Nowadays, industries collect a large amount of data on more than one variable. Hence in a quality control process there is more than one quality variable to be monitored simultaneously. A traditional control chart monitoring a single variable is not useful for detecting the overall quality of a process, as it is determined by the interaction of several related variables (Liu, 1995, Idris *et al.*, 2019). For this reason, multivariate analysis is becoming increasingly important within the statistical process control approaches (Woodall & Montgomery, 1999). Multivariate control charts are needed when dealing with more than one quality variable as overcome the drawback of obtaining incorrect control limits when dealing with related variables. As a matter of fact, the multivariate procedure takes into account the association between the components of a mul-

michele.staiano@unina.it, roberta@unina.it)

KEYWORDS: Non-parametric statistics, Q-charts, Data depth.

(e-mail:massimo.aria@unina.it)

depth in the statistical process control.

1 Introduction

tivariate process.

Woodall, 2000 distinguishes the techniques of the control chart processing in Phase I (also called retrospective or preliminary phase) and in Phase II. Phase I uses charts with the purpose of defining whether a process is statistically under control when the first group are processed. In Phase II, the charts are used to check if the process is in control when future subgroups were being processed. In this last phase, it is assumed that the distribution of the process is known and most of the classical applications require the hypothesis that the process under consideration follows a multivariate normal distribution. However, in most industrial applications the distribution of a parametric multivariate control chart is difficult to estimate for processes with multiple quality characteristics. As such, all observations are considered as *d*-dimensional vector and therefore used to detect possible shifts in the *d*-dimensional distributions of the quality process. A statistical process control (SPC) procedure set up in a multivariate framework is more effective than a joint monitoring system consisting of a series of traditional univariate control charts (Crosier, 1988).

The most popular multivariate statistical process control charts are based on the Hotelling's *T*<sup>2</sup> statistics, that are a multivariate extension of Shewart's chart (or *X*¯ control chart). Like the univariate counterparts, also the multivariate control charts can be distinguished into parametric and non-parametric types according to the distributive assumption underlying a control charts (e.g. normality) are verified or not. When the assumption of normality is not verified, the use of conventional (multivariate) control charts for process monitoring is questionable. Non-parametric control charts do not require distributive assumptions on process data and generally enjoy greater robustness, namely they are less sensitive to outliers than parametric control schemes. A survey of parametric multivariate SPC charts can be found in Bersimis *et al.*, 2007, while a review of non-parametric multivariate control charts can be found in Chakraborti & Graham, 2019.

Statistical depth functions are largely used in non-parametric statistics for the analysis of multivariate data. These are non-parametric functions that can provide a dimension reduction to high-dimensional problems. In this work, we will focus on *L<sup>p</sup>* data depth to build a control chart. The *L<sup>p</sup>* data depth have additional advantages over other existing depth based control charts already introduced in a multivariate SPC (i.e. they ensure an ease of computation even in high dimensions).

#### 2 Our proposal

Depth-based methodology to construct control charts can be interpreted as a multivariate generalization of standard univariate. A depth function aims at providing the degree of centrality of a point *x* with respect to a distribution *F* in R*d*, denoted by *D*(*x*,*F*). Hence, higher values of *D*(*x*,*F*) correspond to deeper (more central) while smaller values indicate less central points (i.e. further away from the center with respect to *F*). Hence, a center-outward ranking of the data is provided. There are several notions of data depth function available in the literature. The halfspace, simplicial, Mahalanobis and *L<sup>p</sup>* depths are some of the most popular ones. In this work, we adopt the notion of *L<sup>p</sup>* depth introduced by Zuo, 2004 because of its ease of computation and (local and global) robustness properties. The depth is defined as follows:

3 Conclusion

References

In a statistical process control framework, we proposed to use control charts based on the *L<sup>p</sup>* data depth. Our approach is fully non-parametric, meaning that the obtained charts are valid without parametric assumptions on the process distribution. In addition, these charts allow for the simultaneous detection of both the location change and the scale increase in a process. The performance of our proposal is investigated via a simulation study. The results show that the *L<sup>p</sup>* depth based control charts are a promising alternative to the well-known Mahalanobis depth. Moreover, *L<sup>p</sup>* depth is particularly appealing because of

BERSIMIS, SOTIRIS, PSARAKIS, STELIOS,&PANARETOS, JOHN. 2007. Multivariate statistical process control charts: an overview. *Quality and*

CHAKRABORTI, S, & GRAHAM, MA. 2019. Nonparametric (distributionfree) control charts: An updated overview and some results. *Quality En-*

CROSIER, RONALD B. 1988. Multivariate generalizations of cumulative sum

HOTTELING, H. 1947. Multivariate quality control, illustrated by the air testing of sample bombsights. *Techniques of statistical analysis*, 111–184. IDRIS, SUWANDA, WACHIDAH, LISNUR, SOFIYAYANTI, TETI,&HARA-HAP, ERWIN. 2019. The Control Chart of Data Depth Based on Influence Function of Variance Vector. *In: Journal of Physics: Conference Series*,

LIU, REGINA Y. 1995. Control charts for multivariate processes. *Journal of*

MONTGOMERY, DOUGLAS C. 2007. *Introduction to statistical quality con-*

WOODALL, WILLIAM H. 2000. Controversies and contradictions in statistical process control. *Journal of Quality Technology*, 32(4), 341–350. WOODALL, WILLIAM H, & MONTGOMERY, DOUGLAS C. 1999. Research issues and ideas in statistical process control. *Journal of Quality Technol-*

ZUO, YIJUN. 2004. Robustness of weighted L p–depth and L p–median. *All-*

*gemeines Statistisches Archiv*, 88(2), 215–234.

*the American Statistical Association*, 90(432), 1380–1387.

its computational ease even in high multidimensional spaces.

*Reliability engineering international*, 23(5), 517–543.

quality-control schemes. *Technometrics*, 30(3), 291–303.

*gineering*, 31(4), 523–544.

vol. 1366. IOP Publishing.

*trol*. John Wiley & Sons.

*ogy*, 31(4), 376–386.

$$L^p D\left(\mathbf{x}, F\right) = \frac{1}{1 + E\left(||\mathbf{x} - X||\_p\right)},$$

where *<sup>X</sup>* <sup>∼</sup> *<sup>F</sup>*, · denotes the *<sup>L</sup>p*-norm (when *<sup>p</sup>* <sup>=</sup> 2 the Euclidean norm is derived) and *E* (·) is its expected value.

We conducted a Montecarlo simulation study to evaluate the performance of the *Q*-type control charts based on the *L<sup>p</sup>* data depth in comparison with the Mahalanobis depth-based *Q*-type charts. The *Q*-type control chart is the multivariate analogue of the average univariate chart *X*¯. We set *p* = 2 for the computation of the *L<sup>p</sup>* depth. The simulation study was designed as an analysis to evaluate the chart performances under multiple settings, defined with regard to the number of variables to be monitored (i.e., the dimension), the size of the reference sample, the size of the sub-group, and by considering different distributional settings (Normal, Skew-Normal and Cauchy). We considered both the in-control and out-of-control cases. Specifically, three out-of-control scenarios were evaluated including shift in the mean vector, change in the variance and a combination of both variance change and shift in the mean. We evaluated the performances in terms of average run length (ARL) and its standard deviation. ARL is defined as the expected number of samples required to get a first out-of-control signal, and it can be obtained by taking reciprocal of false alarm probability. Moreover, ARL is one of the performance measures used for comparing the control charts. Results obtained from both in-control and out-of-control cases indicate that *Q*-type charts based on *L*2*D* perform better than those based on Mahalanobis regardless of the process distribution, the dimensionality and the size of both the reference and the sub-group samples.

### 3 Conclusion

2 Our proposal

Depth-based methodology to construct control charts can be interpreted as a multivariate generalization of standard univariate. A depth function aims at providing the degree of centrality of a point *x* with respect to a distribution *F* in R*d*, denoted by *D*(*x*,*F*). Hence, higher values of *D*(*x*,*F*) correspond to deeper (more central) while smaller values indicate less central points (i.e. further away from the center with respect to *F*). Hence, a center-outward ranking of the data is provided. There are several notions of data depth function available in the literature. The halfspace, simplicial, Mahalanobis and *L<sup>p</sup>* depths are some of the most popular ones. In this work, we adopt the notion of *L<sup>p</sup>* depth introduced by Zuo, 2004 because of its ease of computation and (local and

global) robustness properties. The depth is defined as follows:

*<sup>D</sup>*(*x*,*F*) = <sup>1</sup>

where *<sup>X</sup>* <sup>∼</sup> *<sup>F</sup>*, · denotes the *<sup>L</sup>p*-norm (when *<sup>p</sup>* <sup>=</sup> 2 the Euclidean norm is

We conducted a Montecarlo simulation study to evaluate the performance of the *Q*-type control charts based on the *L<sup>p</sup>* data depth in comparison with the Mahalanobis depth-based *Q*-type charts. The *Q*-type control chart is the multivariate analogue of the average univariate chart *X*¯. We set *p* = 2 for the computation of the *L<sup>p</sup>* depth. The simulation study was designed as an analysis to evaluate the chart performances under multiple settings, defined with regard to the number of variables to be monitored (i.e., the dimension), the size of the reference sample, the size of the sub-group, and by considering different distributional settings (Normal, Skew-Normal and Cauchy). We considered both the in-control and out-of-control cases. Specifically, three out-of-control scenarios were evaluated including shift in the mean vector, change in the variance and a combination of both variance change and shift in the mean. We evaluated the performances in terms of average run length (ARL) and its standard deviation. ARL is defined as the expected number of samples required to get a first out-of-control signal, and it can be obtained by taking reciprocal of false alarm probability. Moreover, ARL is one of the performance measures used for comparing the control charts. Results obtained from both in-control and out-of-control cases indicate that *Q*-type charts based on *L*2*D* perform better than those based on Mahalanobis regardless of the process distribution, the dimensionality and the size of both the reference and the sub-group samples.

1+*E*(*x*−*X<sup>p</sup>*)

,

*Lp*

derived) and *E* (·) is its expected value.

In a statistical process control framework, we proposed to use control charts based on the *L<sup>p</sup>* data depth. Our approach is fully non-parametric, meaning that the obtained charts are valid without parametric assumptions on the process distribution. In addition, these charts allow for the simultaneous detection of both the location change and the scale increase in a process. The performance of our proposal is investigated via a simulation study. The results show that the *L<sup>p</sup>* depth based control charts are a promising alternative to the well-known Mahalanobis depth. Moreover, *L<sup>p</sup>* depth is particularly appealing because of its computational ease even in high multidimensional spaces.

### References


### **ANGULAR HALFSPACE DEPTH: CENTRAL REGIONS**\*

Petra Laketa1and Stanislav Nagy1

<sup>1</sup> Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic (e-mail: laketa@karlin.mff.cuni.cz, nagy@karlin.mff.cuni.cz)

**ABSTRACT**: The angular halfspace depth is an extension of the classical halfspace depth that is applicable to directional data. The upper level sets of this function serve as an analogue of the inter-quantile regions, and allow introduction of orders and rank statistics also for data living on the unit sphere. We explore the basic theoretical properties of these regions, and contrast them with the central regions defined in multivariate Euclidean spaces using the standard halfspace depth.

**KEYWORDS**: angular depth, central regions, directional data analysis, halfspace depth.

#### **1 Angular halfspace depth**

Statistical depth is a remarkable tool for ordering multivariate data. For a probability measure *P* in the Euclidean space R*d*, *d* ≥ 1, the depth describes how much "centrally located" a point *x* ∈ R*<sup>d</sup>* is with respect to *P*. The arguably most important depth in R*<sup>d</sup>* is the *halfspace depth* that to each *x* ∈ R*<sup>d</sup>* assigns

$$hD\left(\mathbf{x}; P\right) = \inf\left\{ P\left(H\right) : H \in \mathcal{H} \text{ and } \mathbf{x} \in H \right\} \in [0, 1],\tag{1}$$

for *H* = *Hy*,*<sup>v</sup>* : *y* ∈ R*d*, *v* ∈ R*<sup>d</sup>* \ {0} the set of all closed halfspaces *Hy*,*<sup>v</sup>* <sup>=</sup> *z* ∈ R*<sup>d</sup>* : �*z*−*y*, *v*� ≥ 0 in R*d*. Here we deal with directional data (Ley & Verdebout, 2017), meaning data generated from *P* whose support lies on the unit sphere S*d*−<sup>1</sup> = *x* ∈ R*<sup>d</sup>* : �*x*� = 1 . For most such *P*, the depth (1) is trivially zero on S*d*<sup>−</sup>1, and is therefore of no use. In that situation, it is more natural to consider an angular variant of the halfspace depth introduced by Small, 1987. Let *H*<sup>0</sup> <sup>⊂</sup> *H* be the collection of those halfspaces *<sup>H</sup>*0,*<sup>v</sup>* <sup>∈</sup> *H* whose boundary contains the origin 0 ∈ R*d*. The *angular halfspace depth* of *x* ∈ S*d*−<sup>1</sup> with respect to a probability measure *P* on S*d*−<sup>1</sup> is defined as

$$ahD\left(\mathbf{x}; P\right) = \inf\left\{P\left(H\right) : H \in \mathcal{H}\_0 \text{ and } \mathbf{x} \in H\right\} \in [0, 1]. \tag{2}$$

\*This work was supported by the grant 19-16097Y of the Czech Science Foundation, and by the PRIMUS/17/SCI/3 project of Charles University. P. Laketa was supported by the OP RDE project "International mobility of research, technical and administrative staff at the Charles University" CZ.02.2.69/0.0/0.0/18 053/0016976.

The similarity of the depths (1) and (2) is obvious — one restricts to halfspaces from *H*<sup>0</sup> when considering the angular depth. Many properties of the angular halfspace depth were explored by Liu & Singh, 1992. Here we first revise some of those known results, and then derive an array of new properties of the upper level sets of the function (2) for general probability measures on S*d*<sup>−</sup>1.

#### **2 A hemisphere of constant depth**

A peculiar property of the angular halfspace depth is that it has to be constant on a hemisphere of S*d*<sup>−</sup>1, for any *P* on S*d*<sup>−</sup>1. More precisely, Proposition 4.6 of Liu & Singh, 1992 says that there exists a hemisphere with a constant angular depth equal to α<sup>0</sup> = inf*x*∈S*d*−<sup>1</sup> *ahD*(*x*;*P*). That result is given without a proof and from the context it appears to be claimed for a closed hemisphere. In our first example we demonstrate that for general measures *P* one has to be cautious when formulating this statement.

Consider the probability measure *P* on the circle S<sup>1</sup> in R<sup>2</sup> (left panel of Figure 1) defined as a mixture of a uniform distribution on the upper halfcircle S1 <sup>+</sup> <sup>=</sup> (*x*1, *x*2) ∈ S<sup>1</sup> : *x*<sup>2</sup> > 0 and an atom at *a* = (1,0) with equal weights 1/2. For each *n* = 1,2,... consider a halfspace (beige region in Figure 1)

**Figure 1.** *Left: For P a mixture of uniform distribution on* S<sup>1</sup> <sup>+</sup> *(grey arc) and an atom at point a (black point) a closed hemisphere of constant depth does not exist. Right: A measure P with five atoms such that ahD*(·;*P*) *is constant on* S<sup>1</sup> *and equal to* α<sup>0</sup> = 2/5*.*

$$H\_n = \left\{ (\mathbf{x}\_1, \mathbf{x}\_2) \in \mathbb{R}^2 \colon \mathbf{x}\_1 \cos(\pi/2 - 1/n) + \mathbf{x}\_2 \sin(\pi/2 - 1/n) \le \mathbf{0} \right\} \in \mathcal{H}\_0$$

not containing *a* at an angle θ*<sup>n</sup>* = −1/*n* with the *x*1-axis. Since lim*n*→<sup>∞</sup> θ*<sup>n</sup>* = 0, surely lim*n*→<sup>∞</sup> *P*(*Hn*) = 0, meaning that *ahD*(*x*;*P*) = 0 for any point *x* in the lower halfcircle S<sup>1</sup> <sup>−</sup> <sup>=</sup> (*x*1, *x*2) ∈ S<sup>1</sup> : *x*<sup>2</sup> < 0 . We obtain α<sup>0</sup> = 0. On the other hand, *<sup>P</sup>*(*H*) <sup>≥</sup> <sup>1</sup>/2 for every *<sup>H</sup>* <sup>∈</sup>*H*<sup>0</sup> that contains *<sup>a</sup>*, implying that *ahD*(*a*;*P*) <sup>≥</sup> <sup>1</sup>/<sup>2</sup> <sup>&</sup>gt; <sup>α</sup>0. Also, any *<sup>H</sup>* <sup>∈</sup>*H*<sup>0</sup> that contains points from <sup>S</sup><sup>1</sup> <sup>+</sup> is of positive *P*-mass. Overall we obtain that the angular depth is positive exactly in the set S<sup>1</sup> <sup>+</sup> ∪ {*a*}, and there is no closed halfcircle of S<sup>1</sup> of depth α<sup>0</sup> = 0.

In our example there exists an open hemisphere S<sup>1</sup> <sup>−</sup> with constant depth α0. That is not a coincidence — an analogous result for an open hemisphere of constant depth is possible to be proved\* for any measure *P* on S*d*<sup>−</sup>1. It is however interesting to note that it may also happen that the depth (2) is constant on the whole sphere S*d*<sup>−</sup>1, and therefore equal to α<sup>0</sup> everywhere, see Figure 1.

#### **3 Central regions of the angular halfspace depth**

While for any *P* on S*d*−<sup>1</sup> there always exists an open hemisphere *Smin* ⊂ S*d*−<sup>1</sup> of minimal depth, its complement S*d*−<sup>1</sup> \ *Smin* typically contains points of higher depth (2). The upper level sets of the angular halfspace depth therefore form a basis for generalizations of quantiles and inter-quantile regions to S*d*<sup>−</sup>1, in the same way as the level sets of the halfspace depth (1) do in R*d*. The *central region* of *P* at level α ≥ 0 is given by

$$D\_{\mathbf{a}} = \left\{ \mathbf{x} \in \mathbb{S}^{d-1} \colon ahD\left(\mathbf{x}; P\right) \ge \mathbf{a} \right\}.\tag{3}$$

The smallest non-empty region *D*<sup>α</sup> presents a natural analogue of the median applicable to directional data. In analogy with the corresponding properties well established for the standard halfspace depth (1) in R*d*, it is possible to show that also regions (3) posses several attractive traits — they are closed and spherically convex sets in S*d*−<sup>1</sup> that can be represented as intersections of closed spherical halfspaces (sets of the form *H*<sup>0</sup> <sup>∩</sup>S*d*<sup>−</sup>1). Formal proofs of the following statements will appear in our comprehensive treatment of the theory of the angular halfspace depth that is currently in preparation. In each of the statements *P* is a Borel probability measure on S*d*<sup>−</sup>1.

**Upper semi-continuity.** The mapping S*d*−<sup>1</sup> → [0,1]: *x* �→ *ahD*(*x*;*P*) is upper semi-continuous, i.e. for any *x* ∈ S*d*−<sup>1</sup> and a sequence {*xn*}<sup>∞</sup> *<sup>n</sup>*=<sup>1</sup> <sup>⊂</sup> <sup>S</sup>*d*−<sup>1</sup> that converges to *x* it holds lim sup*n*→<sup>∞</sup> *ahD*(*xn*;*P*) ≥ *ahD*(*x*;*P*). As a consequence, all depth regions (3) are closed sets.

\*The proof of this claim is not difficult, but due to the limited available space we will present it elsewhere, together with all the other technical derivations outlined in the rest of this note.

**Intersection of halfspaces.** For any α > α<sup>0</sup> we can write

$$D\_{\alpha} = \bigcap \left\{ \text{int}(H) : H \in \mathcal{H}\_{0} \text{ and } P(\text{int}(H)) > 1 - \alpha \right\} \cap \left( \mathbb{S}^{d-1} \backslash S\_{\text{min}} \right),$$

where int(*H*) denotes the interior of *H*, which is an open halfspace. This result is weaker than the one for the usual halfspace depth (Proposition 6 of Rousseeuw & Ruts, 1999), where one can write a central region as an intersection of closed halfspaces from *H* . As a consequence of our result we obtain that each *D*<sup>α</sup> with α > α<sup>0</sup> is an intersection of a convex set in R*<sup>d</sup>* and a hemisphere. The case α = α<sup>0</sup> gives trivially the whole sphere *D*α<sup>0</sup> = S*d*<sup>−</sup>1.

#### **4 Refinements for smooth measures**

We say that a probability measure *P* on S*d*−<sup>1</sup> is smooth if *P*(∂*H*) = 0 for any boundary hyperplane <sup>∂</sup>*<sup>H</sup>* of *<sup>H</sup>* <sup>∈</sup> *H*0. It is satisfied for any *<sup>P</sup>* that has a density with respect to the spherical Lebesgue measure on S*d*<sup>−</sup>1. For smooth *P* one obtains stronger results about the central regions of *ahD*(*x*;*P*):


A useful application of these results is the construction of bagdistances for directional data presented by *H. Demni* in this book of short papers. In that contribution, bagdistances are used with success in a comprehensive simulation study of nonparametric classification of points in S*d*<sup>−</sup>1.

#### **References**


### CLUSTERING PRODUCTION INDEXES FOR CONSTRUCTION WITH FORECAST DISTRIBUTIONS

This approach accounts for the future dynamic behaviour of the time series,

*<sup>T</sup>*+*h*|*<sup>T</sup>* (·), *<sup>i</sup>* <sup>=</sup> <sup>1</sup>,...,*<sup>S</sup>* is the forecast distribution function at a given fu-

*<sup>T</sup>*+*h*|*<sup>T</sup>* (*y*)−*<sup>F</sup> <sup>j</sup>*

*<sup>k</sup>*x*t*−*<sup>h</sup>* +*wk*<sup>0</sup>

*<sup>T</sup>*+*h*|*<sup>T</sup>* (*y*)

, conditioned on the information set available


 *r*

+*c*<sup>0</sup> (1)

*<sup>t</sup>*−*h*),*<sup>t</sup>* <sup>=</sup> *<sup>p</sup>*+*h*,...,*T*}

*<sup>t</sup>*=*p*+*<sup>h</sup>* (*Yt* <sup>−</sup> *fmh*(x*t*−*h*;θ))<sup>2</sup> .

*<sup>t</sup>*=*p*+*<sup>h</sup>* εˆ*t*.

*<sup>t</sup>*−*h*−*p*+1), *<sup>t</sup>* <sup>=</sup> *<sup>p</sup>* <sup>+</sup> *<sup>h</sup>*,...,*T*}, as an

*<sup>T</sup>*+*<sup>h</sup>* where ε<sup>∗</sup>

*<sup>T</sup>*+*<sup>h</sup>* is a

*<sup>T</sup>*+*<sup>h</sup>* con-

*<sup>T</sup>*−*p*−*h*+<sup>1</sup> <sup>∑</sup>*<sup>T</sup>*

x∗ *<sup>t</sup>*−*h*;<sup>θ</sup> <sup>2</sup> .

*<sup>T</sup>*+*h*|*<sup>T</sup>* is given by the law of *<sup>Y</sup>*<sup>ˆ</sup> <sup>∗</sup>

*h* + ε∗ *dy r* = 1,2,

 *Fi*

puted directly, La Rocca *et al.*, 2021 have proposed a strategy in which the unknown distributions are consistently estimated by using a feed forward neural network estimator and the pair bootstrap approach. In particular, given the forecast horizon *h*, the unknown function *g*(·) can be approximated by using

> *ck*ψ w′

with θ = (*c*0,*c*1,..., *cm*,w1,...,w*m*,*w*10,...,*wm*0), where *m* is the hidden layer size, w*<sup>k</sup>* are the vectors of weights for the connections between input layer and hidden layer, *ck*, *k* = 1,...,*m* are the weights of the link between the hidden layer and the output; *wk*<sup>0</sup> and *c*<sup>0</sup> are the bias terms; ψ(·) is a proper chosen

*<sup>t</sup>*−*<sup>h</sup>* = (*Yt*−*h*,...,*Yt*−*h*−*p*+1). The general procedure is summarized in the following Algorithm.

2: Fix the hidden layer size *m* and the lag structure *p* and estimate the weights

3: Compute the residuals from the estimated network defined as: εˆ*<sup>t</sup>* = *Yt* −

*<sup>t</sup>*−*h*,...,*Y*<sup>∗</sup>

*t*=*p*+*h Y*∗ *<sup>t</sup>* <sup>−</sup> *fmh*

*YT* ,*YT*−1,...,*YT*−*p*+1;θ<sup>ˆ</sup> <sup>∗</sup>

Note that, as usual, the bootstrap distribution can be approximated by Monte Carlo simulations repeating *B* times steps 5-7, and then computing the

1 *<sup>T</sup>*−*p*−*h*+<sup>1</sup> <sup>∑</sup>*<sup>T</sup>*

*m* ∑ *k*=1


*fmh* (x*t*−*h*;θ) =

1: Fix the forecast horizon *h* ≥ 1. Let *X* = {(*Yt*,x′

4: Compute the centered residuals: <sup>ε</sup>˜*<sup>t</sup>* <sup>=</sup> <sup>ε</sup>ˆ*<sup>t</sup>* <sup>−</sup> <sup>1</sup>

*<sup>t</sup>*−*h*)=(*Y*<sup>∗</sup>

1 *<sup>T</sup>*−*p*−*h*+<sup>1</sup> <sup>∑</sup>*<sup>T</sup>*

random sample from the centered residuals {ε˜*t*}.

*<sup>t</sup>* ,*Y*<sup>∗</sup>

6: Get the bootstrap estimate of the neural network weights:

of the network as θˆ *<sup>h</sup>* = argmin<sup>θ</sup>

*<sup>t</sup>* ,x′∗

*<sup>h</sup>* = argmin<sup>θ</sup>

iid sample from the set of tuples *X*.

*<sup>T</sup>*+*<sup>h</sup>* <sup>=</sup> *fmh*

8: The bootstrap forecast distribution *F*∗

by using the *L<sup>r</sup>*

ture point *T* +*h*, of the series y(*i*)

up to time *T*. Since the *L<sup>r</sup>*

activation function and x′

<sup>x</sup>*t*−*h*;θ<sup>ˆ</sup> *<sup>h</sup>* 

5: Resample {(*Y*<sup>∗</sup>

θˆ ∗

ditioned on *X*.

7: Compute *Y*ˆ <sup>∗</sup>

where *F<sup>i</sup>*

the network

Algorithm

*fmh*

Michele La Rocca 1, Francesco Giordano1 and Cira Perna1

<sup>1</sup> Department of Economics and Statistics, University of Salerno, (e-mail: larocca@unisa.it, giordano@unisa.it, perna@unisa.it)

ABSTRACT: In this paper we focus on a recent proposal for clustering nonlinear time series data in which dissimilarities are computed according to time series forecast distributions. The aim is to evaluate the impact of COVID-19 pandemic on the construction sector for a set of 21 European countries.

KEYWORDS: Feedforward neural networks, bootstrap, nonlinear time series.

#### 1 Introduction

In the last decades there has been a growing interest in time series clustering. Some recent approaches rely on the use of distance criteria which compare the forecast densities estimated by using a resampling method combined with a nonparametric kernel estimator (see Alonso *et al.*, 2006 and Vilar *et al.*, 2010). More recently, La Rocca *et al.*, 2021, have proposed a novel approach for clustering nonlinear autoregressive time series based on the use of a class of neural network models to approximate the original nonlinear process, combined with the pair bootstrap as a resampling device. The aim of this paper is to discuss the novel approach and to evaluate the impact of COVID-19 pandemic on the production index for construction, an important business cycle indicator, for a set of 21 European countries.

#### 2 The clustering procedure in a nutshell

Let {*Yt*,*t* ∈ Z} be a real valued stationary stochastic process modeled as a nonlinear autoregressive (NAR) model of the form *Yt* = *g*(x*t*−1)+ε*t*, where *g*(·) is an unknown (possibly) nonlinear regression function, x′ *<sup>t</sup>*−<sup>1</sup> = (*Yt*−1,...,*Yt*−*p*) and {ε*t*} are *iid* error terms, with <sup>E</sup>[ε*t*] = 0 and <sup>E</sup>[ε<sup>2</sup> *<sup>t</sup>* ] <sup>&</sup>gt; 0. Let y(1) ,...,y(*S*) be *S* observed time series of length *T* generated from a DGP of the previous class, where y(*i*) = *Y*(*i*) <sup>1</sup> ,...,*Y*(*i*) *T* . The aim is to cluster time series based on their full forecast distribution at a specific future time *T* +*h*, with *h* ≥ 1.

This approach accounts for the future dynamic behaviour of the time series, by using the *L<sup>r</sup>* -norm distance *Dr*,*i j* <sup>=</sup> *Fi <sup>T</sup>*+*h*|*<sup>T</sup>* (*y*)−*<sup>F</sup> <sup>j</sup> <sup>T</sup>*+*h*|*<sup>T</sup>* (*y*) *r dy r* = 1,2, where *F<sup>i</sup> <sup>T</sup>*+*h*|*<sup>T</sup>* (·), *<sup>i</sup>* <sup>=</sup> <sup>1</sup>,...,*<sup>S</sup>* is the forecast distribution function at a given future point *T* +*h*, of the series y(*i*) , conditioned on the information set available up to time *T*. Since the *L<sup>r</sup>* -norm distance previously defined cannot be computed directly, La Rocca *et al.*, 2021 have proposed a strategy in which the unknown distributions are consistently estimated by using a feed forward neural network estimator and the pair bootstrap approach. In particular, given the forecast horizon *h*, the unknown function *g*(·) can be approximated by using the network

$$f\_{mh} \left( \mathbf{x}\_{t-h}; \Theta \right) = \sum\_{k=1}^{m} c\_k \Psi \left( \mathbf{w}\_k' \mathbf{x}\_{t-h} + \mathbf{w}\_{k0} \right) + c\_0 \tag{1}$$

with θ = (*c*0,*c*1,..., *cm*,w1,...,w*m*,*w*10,...,*wm*0), where *m* is the hidden layer size, w*<sup>k</sup>* are the vectors of weights for the connections between input layer and hidden layer, *ck*, *k* = 1,...,*m* are the weights of the link between the hidden layer and the output; *wk*<sup>0</sup> and *c*<sup>0</sup> are the bias terms; ψ(·) is a proper chosen activation function and x′ *<sup>t</sup>*−*<sup>h</sup>* = (*Yt*−*h*,...,*Yt*−*h*−*p*+1).

The general procedure is summarized in the following Algorithm.

#### Algorithm

CLUSTERING PRODUCTION INDEXES FOR CONSTRUCTION WITH FORECAST DISTRIBUTIONS Michele La Rocca 1, Francesco Giordano1 and Cira Perna1

<sup>1</sup> Department of Economics and Statistics, University of Salerno, (e-mail:

ABSTRACT: In this paper we focus on a recent proposal for clustering nonlinear time series data in which dissimilarities are computed according to time series forecast distributions. The aim is to evaluate the impact of COVID-19 pandemic on the con-

In the last decades there has been a growing interest in time series clustering. Some recent approaches rely on the use of distance criteria which compare the forecast densities estimated by using a resampling method combined with a nonparametric kernel estimator (see Alonso *et al.*, 2006 and Vilar *et al.*, 2010). More recently, La Rocca *et al.*, 2021, have proposed a novel approach for clustering nonlinear autoregressive time series based on the use of a class of neural network models to approximate the original nonlinear process, combined with the pair bootstrap as a resampling device. The aim of this paper is to discuss the novel approach and to evaluate the impact of COVID-19 pandemic on the production index for construction, an important business cycle indicator, for a

Let {*Yt*,*t* ∈ Z} be a real valued stationary stochastic process modeled as a nonlinear autoregressive (NAR) model of the form *Yt* = *g*(x*t*−1)+ε*t*, where *g*(·) is

be *S* observed time series of length *T* generated from a DGP of the previous

their full forecast distribution at a specific future time *T* +*h*, with *h* ≥ 1.

*<sup>t</sup>*−<sup>1</sup> = (*Yt*−1,...,*Yt*−*p*)

,...,y(*S*)

y(1)

*<sup>t</sup>* ] <sup>&</sup>gt; 0. Let

. The aim is to cluster time series based on

KEYWORDS: Feedforward neural networks, bootstrap, nonlinear time series.

larocca@unisa.it, giordano@unisa.it, perna@unisa.it)

struction sector for a set of 21 European countries.

1 Introduction

set of 21 European countries.

class, where y(*i*) =

2 The clustering procedure in a nutshell

an unknown (possibly) nonlinear regression function, x′

and {ε*t*} are *iid* error terms, with <sup>E</sup>[ε*t*] = 0 and <sup>E</sup>[ε<sup>2</sup>

<sup>1</sup> ,...,*Y*(*i*) *T* 

 *Y*(*i*)


$$\hat{\boldsymbol{\Theta}}\_{h}^{\*} = \underset{\bullet}{\arg\min} \boldsymbol{\upmu}\_{\boldsymbol{\Theta}} \frac{1}{T - p\_{\boldsymbol{\bot}}^{-}h + 1} \sum\_{t=p+h}^{T} \left( Y\_{t}^{\*} - f\_{mh} \left( \mathbf{x}\_{t-h}^{\*}; \boldsymbol{\Theta} \right) \right)^{2} \dots$$


Note that, as usual, the bootstrap distribution can be approximated by Monte Carlo simulations repeating *B* times steps 5-7, and then computing the empirical cumulative distribution function (ECDF) of *Y*ˆ *<sup>b</sup> <sup>T</sup>*+*h*, *b* = 1,2,...,*B*. As a resampling device, the pair bootstrap has been implemented, a suitable choice in the context of neural network models. Moreover, being the data generating process nonlinear, a direct multi-step forecasting approach is considered, where a separate neural network model is estimated for each forecasting horizon, and forecasts are computed only conditioning on the observed data.

#### 3 An application to the European construction sector

The proposed procedure has been used to cluster the production index for construction (seasonally and calendar adjusted) for 21 European countries observed from January 2000 to December 2020 (base year 2015). The production index measures the activity in the building and construction industry, and it is considered a critical business cycle indicator. The dataset is available from the Eurostat website. The aim here is to identify the different group structure induced by the COVID-19 pandemic by using the forecast one-step ahead distribution for January 2020 (so excluding any observations from the COVID-19 pandemic), the forecast twelve-step ahead distribution for January 2021, the forecast one-step ahead distribution for January 2021 (we have trained all models up to December 2020).

Hungary

Hungary Bulgaria Slovakia Slovenia Croatia Netherlands Austria Sweden Poland Romania Finland Luxembourg Germany Denmark Czechia UK Spain Italy France Belgium Portugal

Cluster Dendrogram

UK Czechia Portugal Bulgaria Slovakia Spain Belgium France Hungary Denmark Germany Slovenia Croatia Netherlands Romania Poland Austria Sweden Italy Luxembourg Finland

tributions and *L*1-norm distance.

0

10

Height

20

0

10

20

Height

30

40

0

10

20

Height

30

40

Bulgaria

Slovakia

Cluster Dendrogram

Cluster Dendrogram

Spain

Italy

Portugal

Belgium

France

Romania

UK

(a) Training period January 2000 - December 2019, prediction *h* = 1

(b) Training period January 2000 - December 2019, prediction *h* = 12

(c) Training period January 2000 - December 2020, prediction *h* = 1 Figure 1: Construction indexes clustering based on *h*-step ahead forecast dis-

Czechia

Luxembourg

Germany

Finland

Denmark

Poland

Slovenia

Sweden

Netherlands

Croatia

Austria

Apparently, the group structure would have been almost identical without the impact due to the COVID-19 pandemic, showing a somewhat stable economic evolution of all the countries considered in the application (see panels a and b). On the contrary, when based on models that include the year 2020 in the training period, where all countries experienced severe contractions in their economic activities, the dataset shows a pretty different group structure, indicating different routes and timelines for economic recovery (see panel c).

#### References


empirical cumulative distribution function (ECDF) of *Y*ˆ *<sup>b</sup>*

3 An application to the European construction sector

models up to December 2020).

References

*ted*.

As a resampling device, the pair bootstrap has been implemented, a suitable choice in the context of neural network models. Moreover, being the data generating process nonlinear, a direct multi-step forecasting approach is considered, where a separate neural network model is estimated for each forecasting horizon, and forecasts are computed only conditioning on the observed data.

The proposed procedure has been used to cluster the production index for construction (seasonally and calendar adjusted) for 21 European countries observed from January 2000 to December 2020 (base year 2015). The production index measures the activity in the building and construction industry, and it is considered a critical business cycle indicator. The dataset is available from the Eurostat website. The aim here is to identify the different group structure induced by the COVID-19 pandemic by using the forecast one-step ahead distribution for January 2020 (so excluding any observations from the COVID-19 pandemic), the forecast twelve-step ahead distribution for January 2021, the forecast one-step ahead distribution for January 2021 (we have trained all

Apparently, the group structure would have been almost identical without the impact due to the COVID-19 pandemic, showing a somewhat stable economic evolution of all the countries considered in the application (see panels a and b). On the contrary, when based on models that include the year 2020 in the training period, where all countries experienced severe contractions in their economic activities, the dataset shows a pretty different group structure, indicating different routes and timelines for economic recovery (see panel c).

ALONSO, A.M., BERRENDERO, J.R., HERNANDEZ ´ , A., & JUSTEL, A. 2006. Time series clustering based on forecast densities. *Computational*

LA ROCCA, M., GIORDANO, F., & PERNA, C. 2021. Clustering nonlinear time series with neural network bootstrap forecast distributions. *Submit-*

VILAR, J.A., ALONSO, A.M., & VILAR, J.M. 2010. Non-linear time series clustering based on non-parametric forecast densities. *Computational*

*Statistics & Data Analysis*, 51(2), 762–776.

*Statistics & Data Analysis*, 54(11), 2850–2865.

*<sup>T</sup>*+*h*, *b* = 1,2,...,*B*.

(a) Training period January 2000 - December 2019, prediction *h* = 1 Cluster Dendrogram

(b) Training period January 2000 - December 2019, prediction *h* = 12 Cluster Dendrogram

(c) Training period January 2000 - December 2020, prediction *h* = 1 Figure 1: Construction indexes clustering based on *h*-step ahead forecast distributions and *L*1-norm distance.

## CLUSTERING LONGITUDINAL DATA WITH CATEGORY THEORY FOR DIABETIC KIDNEY DISEASE

ory\* (Grandis, 2020), we can exploit its concepts for cluster analysis (Carlsson & Mémoli, 2013). We introduce comparisons between patient trajectories according to their shapes. Here, we compare patients' trajectories building clusters of shapes using the Fréchet distance (Genolini *et al.*, 2016), which takes into account shape variations. The novelty of our study is an integrative approach joining categories, clusters, and shape trajectories. For each patient we observe demographic and clinical variables, treatments, and response to the different treatments. This approach is derived for a real dataset concerning patients with diabetic kidney disease (DKD) from the DC-ren project.† Clustering is based on the response to the treatment, that is evaluated from the estimated glomerular filtration rate (eGFR). The identification of different evolution patterns can shed light on the best individualized drug combination. The paper is structured as follows: in Section 2 we introduce some theoretical concepts and the methodology we adopted, and in Section 3 we analyze the

Let us consider a dataset composed of *n* patients characterized by *p* observable variables at four time points *t*0, *t*1, *t*2, *t*3. Each patient is characterized as a triplet (X*i*(*tk*), D(*tk*), *Yi*(*tk*)), where *i* is the individual (the patient); *tk* is the time point *k* = 0,1,2,3; X*i*(*tk*) is a set of variables characterizing the individual; *Yi*(*tk*) is the value of the response variable *Y* at *tk*; D(*tk*) stands for the

obtain an *enriched double category* with metrics in R (Grandis, 2020), having

The comparison of trajectories of different patients involves both of these dis-

of significative curves, which groups similar trajectories in the same cluster.

\*A category is constituted by objects (points) and morphisms (arrows). A *functor* maps objects and morphisms of a category into objects and morphisms of another category. *Natural*

with respect to the

*<sup>i</sup>* (*tk*,*tk*). We thus

*<sup>i</sup>* . Clustering

*<sup>i</sup>* (*tk*,*tk*) as morphisms.

(*tk*), and the distance between values observed at

(*tk*), *d<sup>Y</sup>*

<sup>ι</sup> >, ι = 1,...,*n* < *n*, where *n* is the number

*i*,*i*

given drug. We indicate the distance between patients *i*,*i*

times *tk*,*tk* of the variable *Y* for the same individual *i* as *d<sup>Y</sup>*

tances. The time trajectory of the *i*-th patient is indicated as *pathY*

†https://dc-ren.eu/. The project focuses on type 2 diabetes.

*i*,*i*

*<sup>i</sup>* <sup>→</sup><sup>&</sup>lt; *patha*,*<sup>Y</sup>*

*<sup>i</sup>* (*tk*), *k* = 0,...3, *i* = 1,...,*n*, as objects, and *d<sup>Y</sup>*

results of our study.

2 Methodology

variable *Y* and time *k* as *d<sup>Y</sup>*

is a functor *Fa* : *pathY*

*transformations* map functors to functors.

*xY*

Maria Mannone12, Veronica Distefano1, Claudio Silvestri13 and Irene Poli1

<sup>1</sup> European Centre for Living Technology, Ca' Foscari University of Venice, Italy, (e-mail: maria.mannone@unive.it, veronica.distefano@unive.it, claudio.silvestri@unive.it, irenpoli@unive.it)

<sup>2</sup> Department of Mathematics and Computer Sciences, University of Palermo, Italy

<sup>3</sup> Dipartimento di Scienze Ambientali, Informatica e Statistica, Ca' Foscari University of Venice, Italy

ABSTRACT: In the framework of precision medicine, we investigate the similarity of diabetic kidney disease (DKD) patients through longitudinal data clusters. Starting with insights from category theory, we build patients' clusters according to the shapes of their trajectories, adopting the Fréchet distance. We group patients according to their behavior of the estimated glomerular filtration rate (eGFR), obtaining informative mean curves. Behavior pattern recognition can shed light on individualized treatments.

KEYWORDS: longitudinal data clustering, category theory, Fréchet distance, precision medicine, DKD disease progress

### 1 Introduction

Precision medicine aims to find individualized therapeutic treatments according to patients' specific characteristics. To make accurate predictions it is crucial to retrieve information on the long-term reactions of patients to given treatments, investigating time trajectories of the disease progress (Karpati *et al.*, 2018). Each patient is identified by demographic and clinical variables at different time points. The similarity of behavior of patients across time can be accounted by *clusters of trajectories*. The final aim of this research is to build clusters of longitudinal data to identify the optimal treatment rule. Therefore, we intend to build patient clusters according to distances between patients at each time point, and distances of the same patient between time points. We can consider distance as a transformation, and distance variation as a transformation between transformations. Because the concept of transformations between transformations is the starting point of mathematical category theory\* (Grandis, 2020), we can exploit its concepts for cluster analysis (Carlsson & Mémoli, 2013). We introduce comparisons between patient trajectories according to their shapes. Here, we compare patients' trajectories building clusters of shapes using the Fréchet distance (Genolini *et al.*, 2016), which takes into account shape variations. The novelty of our study is an integrative approach joining categories, clusters, and shape trajectories. For each patient we observe demographic and clinical variables, treatments, and response to the different treatments. This approach is derived for a real dataset concerning patients with diabetic kidney disease (DKD) from the DC-ren project.† Clustering is based on the response to the treatment, that is evaluated from the estimated glomerular filtration rate (eGFR). The identification of different evolution patterns can shed light on the best individualized drug combination. The paper is structured as follows: in Section 2 we introduce some theoretical concepts and the methodology we adopted, and in Section 3 we analyze the results of our study.

#### 2 Methodology

CLUSTERING LONGITUDINAL DATA WITH CATEGORY THEORY FOR DIABETIC KIDNEY DISEASE Maria Mannone12, Veronica Distefano1, Claudio Silvestri13 and Irene Poli1

<sup>1</sup> European Centre for Living Technology, Ca' Foscari University of Venice, Italy, (e-mail: maria.mannone@unive.it, veronica.distefano@unive.it,

<sup>2</sup> Department of Mathematics and Computer Sciences, University of Palermo, Italy <sup>3</sup> Dipartimento di Scienze Ambientali, Informatica e Statistica, Ca' Foscari University

ABSTRACT: In the framework of precision medicine, we investigate the similarity of diabetic kidney disease (DKD) patients through longitudinal data clusters. Starting with insights from category theory, we build patients' clusters according to the shapes of their trajectories, adopting the Fréchet distance. We group patients according to their behavior of the estimated glomerular filtration rate (eGFR), obtaining informative mean curves. Behavior pattern recognition can shed light on individualized treatments.

KEYWORDS: longitudinal data clustering, category theory, Fréchet distance, preci-

Precision medicine aims to find individualized therapeutic treatments according to patients' specific characteristics. To make accurate predictions it is crucial to retrieve information on the long-term reactions of patients to given treatments, investigating time trajectories of the disease progress (Karpati *et al.*, 2018). Each patient is identified by demographic and clinical variables at different time points. The similarity of behavior of patients across time can be accounted by *clusters of trajectories*. The final aim of this research is to build clusters of longitudinal data to identify the optimal treatment rule. Therefore, we intend to build patient clusters according to distances between patients at each time point, and distances of the same patient between time points. We can consider distance as a transformation, and distance variation as a transformation between transformations. Because the concept of transformations between transformations is the starting point of mathematical category the-

claudio.silvestri@unive.it, irenpoli@unive.it)

of Venice, Italy

1 Introduction

sion medicine, DKD disease progress

Let us consider a dataset composed of *n* patients characterized by *p* observable variables at four time points *t*0, *t*1, *t*2, *t*3. Each patient is characterized as a triplet (X*i*(*tk*), D(*tk*), *Yi*(*tk*)), where *i* is the individual (the patient); *tk* is the time point *k* = 0,1,2,3; X*i*(*tk*) is a set of variables characterizing the individual; *Yi*(*tk*) is the value of the response variable *Y* at *tk*; D(*tk*) stands for the given drug. We indicate the distance between patients *i*,*i* with respect to the variable *Y* and time *k* as *d<sup>Y</sup> i*,*i* (*tk*), and the distance between values observed at times *tk*,*tk* of the variable *Y* for the same individual *i* as *d<sup>Y</sup> <sup>i</sup>* (*tk*,*tk*). We thus obtain an *enriched double category* with metrics in R (Grandis, 2020), having *xY <sup>i</sup>* (*tk*), *k* = 0,...3, *i* = 1,...,*n*, as objects, and *d<sup>Y</sup> i*,*i* (*tk*), *d<sup>Y</sup> <sup>i</sup>* (*tk*,*tk*) as morphisms. The comparison of trajectories of different patients involves both of these distances. The time trajectory of the *i*-th patient is indicated as *pathY <sup>i</sup>* . Clustering is a functor *Fa* : *pathY <sup>i</sup>* <sup>→</sup><sup>&</sup>lt; *patha*,*<sup>Y</sup>* <sup>ι</sup> >, ι = 1,...,*n* < *n*, where *n* is the number of significative curves, which groups similar trajectories in the same cluster.

<sup>\*</sup>A category is constituted by objects (points) and morphisms (arrows). A *functor* maps objects and morphisms of a category into objects and morphisms of another category. *Natural transformations* map functors to functors.

<sup>†</sup>https://dc-ren.eu/. The project focuses on type 2 diabetes.

The < *patha*,*<sup>Y</sup>* <sup>ι</sup> >, <sup>∀</sup><sup>ι</sup> is the representative curve of each cluster.‡ Most of the existing research uses the Euclidean distance to compare trajectories. However, because we aim to compare trajectory shapes, we determine the Fréchet distance, and, according to this distance, we build patient clusters. The Fréchet distance is based on the comparison between pairs of points following the profiles of the curves they belong to. We analyzed the response to the treatment, measured according to the eGFR variable. We build patient clusters deriving the mean of eGFR trajectories. We then investigate the characteristics of patients, in relationship with demographic and clinical variables which characterize each patient. We evaluated the behavior of 241 DKD patients observed in a 4-year period, according to the DC-ren project. Computationally, we used an extension of the longitudinal k-means, *kmlShape* (Genolini *et al.*, 2016), with time scale 0.5.

do not have a positive response. In cluster 7, all patients are keeping the same drug (*D*1), and most of them show a positive response. Patients in clusters 6 and 7 display the best eGFR behavior; most of the patients in these clusters keep in fact a stable behavior and a positive response to the treatment. Patients in clusters 1 and 6 start from close eGFR values, but different UACR values; given *D*<sup>1</sup> +*D*2, their response is different. Patients in clusters 1, 2, 5 change, receiving *D*<sup>1</sup> +*D*<sup>3</sup> and *D*<sup>1</sup> +*D*4. These results will be considered in building a

Figure 1: Shape clusters (a) and Table of mean values (b).

This research activity is part of the project DC-ren that has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement No. 848011. We thank the researchers of the European Centre for Living Technology (ECLT) for helpful discussions and suggestions.

CARLSSON, G., & MÉMOLI, F. 2013. Classifying Clustering Schemes. *Foun-*

GENOLINI, C., ECOCHARD, R., BENGHEZAL, M., DRISS, T., ANDRIEU, S., & SUBTIL, F. 2016. kmlShape: An Efficient Method to Cluster Longitudinal Data (Time-Series) According to Their Shapes. *PlosOne*, 11(6). GRANDIS, M. 2020. *Higher Category Theory*. Singapore: World Scientific. KARPATI, T., LEVENTER-ROBERTS, M., FELDMAN, B., C., COHEN-STAVI., & RAZ I., BALICER, R. 2018. Patient clusters based on HbA1c trajectories: A step toward individualized medicine in type 2 diabetes.

*dations of Computational Mathematics*, 13, 221–252.

**variables / clusters cl. 1 (19%) cl. 2 (16%) cl. 3 (13%) cl 4 (12%) cl. 5 (11%) cl. 6 (11%) cl. 7 (10%) cl. 8 (7%)** eGFR (t0) 82 (6) 71 (5) 67 (7) 44 (8) 49 (6) 79 (9) 57 (6) 37 (4) eGFR (t1) 79 (11) 68 (9) 57 (11) 43 (7) 52 (6) 89 (14) 62 (9) 31 (6) eGFR (t2) 76 (10) 67 (9) 52 (10) 39 (6) 51 (5) 83 (10) 64 (9) 29 (7) eGFR (t3) 69 (7) 63 (6) 46 (7) 39 (6) 51 (5) 89 (10) 66 (6) 26 (4) mean\_UACR (t0) 29.36 (48.07) 58.27 (278.21) 170.18 (466.31) 162.48 (513.48) 46.76 (143.67) 40.41 (85.99) 46.41 (144.17) 142.71 (246.52) mean\_UACR (t1) 22.61 (25.89) 51.99 (162.43) 84.76 (296.70) 85.46 (208.09) 42.50 (136.90) 57.50 (200.63) 36.38 (108.26) 126.52 (220.99) mean\_UACR (t2) 25.15 (13.29) 56.10 (219.01) 100.46 (254.98) 112.26 (236.83) 82.67 (227.14) 31.24 (71.27) 39.98 (81.98) 131.71 (300.80) mean\_UACR (t3) 47.07 (120.65) 40.55 (130.90) 121.71 (337.43) 107.78 (262.89) 76.96 (177.44) 29.66 (65.78) 71.67 (205.72) 219.31 (553.02) HbA1c (t0) 7.0 (1.3) 7.2 (1.3) 7.2 (1.1) 7.1 (1.3) 7.2 (1.1) 7.4 (0.9) 7.1 (1.3) 6.8 (0.8) HbA1c (t1) 7.2 (1.2) 7.5 (1.3) 7.3 (1.2) 7.2 (1.7) 7.3 (1.2) 7.3 (1.2) 7.5 (1.3) 7.0 (1.4) HbA1c (t2) 7.1 (1.2) 7.4 (1.5) 7.2 (1.1) 7.4 (1.3) 7.2 (1.0) 7.2 (1.0) 7.3 (1.2) 7.3 (1.7) HbA1c (t3) 7.0 (1.0) 7.4 (1.2) 7.4 (1.1) 7.8 (1.4) 7.9 (1.7) 7.9 (1.7) 7.5 (1.2) 7.9 (1.9) (b)

predictive system to envisage the best treatment for each individual.

0.5 **cl. 1**

1.0 1.5 2.0 2.5 3.0 3.5 4.0

<sup>t</sup><sup>0</sup> <sup>t</sup><sup>1</sup> <sup>t</sup><sup>2</sup> <sup>t</sup><sup>3</sup> time (a)

7% **cl. 2 cl. 3 cl. 4 cl. 5 cl. 6 cl. 7 cl. 8**

19% 16% 13% 12% 11% 11% 10%

**cl. 1 cl. 6 cl. 2 cl. 3 cl. 7 cl. 5 cl. 4 cl. 8**

Times 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Time

Times

*PLos One*, 13(11).

Acknowledgements

References

eGFR

#### 3 Results

In this study, which involves clusters based on shapes of individual trajectories, without assuming a particular shape, we obtained a grouping of patients according to their similarity of eGFR behavior. Trajectories are evaluated upon the Fréchet distance between them, computed on the continuous variable eGFR. We obtain 8 patient clusters with similar eGFR shape of individual trajectories. In Figure 1a, we represent the obtained clusters. In the Table (Figure 1b), we show the behavior of clusters achieved with the Fréchet distance, with standard deviations at each time point. We find three main patients' subgroups: patients with decreasing eGFR (clusters 1, 3, 8); low decreasing eGFR (cl. 2, 4), and stable/increasing eGFR (cl. 5, 6, 7). To explain these behaviors, we consider some relevant clinical characteristics of patients, which are the ratio of urinary albumin and creatinine (mean UACR) and the glycated hemoglobin (HbA1c), shown in the Table (Figure 1b). Mean UACR is decreasing in cluster 6, while it is increasing in cluster 8. Decreasing or stable values of HbA1c, which characterize DKD patients (Karpati *et al.*, 2018), are shown by patients in cl. 6, while increasing values of HbA1c are shown by patients in cl. 8. We notice that there is a relationship between non-decreasing eGFR and stable HbA1c. Patients can receive different treatments, such as *D*1, *D*2, *D*3, *D*4. In cluster 3 most of the patients that change drug (*D*<sup>1</sup> → *D*<sup>1</sup> +*D*2) show a positive response to the treatment. However, patients in cluster 2 who change the drug

<sup>‡</sup>A different clustering method *Fb* gives us similar representative paths <sup>&</sup>lt; *pathb*,*<sup>Y</sup>* <sup>ι</sup> >. Natural transformation α*a*,*<sup>b</sup>* : *Fa* → *Fb* maps clustering methods.

do not have a positive response. In cluster 7, all patients are keeping the same drug (*D*1), and most of them show a positive response. Patients in clusters 6 and 7 display the best eGFR behavior; most of the patients in these clusters keep in fact a stable behavior and a positive response to the treatment. Patients in clusters 1 and 6 start from close eGFR values, but different UACR values; given *D*<sup>1</sup> +*D*2, their response is different. Patients in clusters 1, 2, 5 change, receiving *D*<sup>1</sup> +*D*<sup>3</sup> and *D*<sup>1</sup> +*D*4. These results will be considered in building a predictive system to envisage the best treatment for each individual.

Times Figure 1: Shape clusters (a) and Table of mean values (b).

#### Acknowledgements

This research activity is part of the project DC-ren that has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement No. 848011. We thank the researchers of the European Centre for Living Technology (ECLT) for helpful discussions and suggestions.

#### References

The < *patha*,*<sup>Y</sup>*

with time scale 0.5.

3 Results

<sup>ι</sup> >, <sup>∀</sup><sup>ι</sup> is the representative curve of each cluster.‡ Most of the

existing research uses the Euclidean distance to compare trajectories. However, because we aim to compare trajectory shapes, we determine the Fréchet distance, and, according to this distance, we build patient clusters. The Fréchet distance is based on the comparison between pairs of points following the profiles of the curves they belong to. We analyzed the response to the treatment, measured according to the eGFR variable. We build patient clusters deriving the mean of eGFR trajectories. We then investigate the characteristics of patients, in relationship with demographic and clinical variables which characterize each patient. We evaluated the behavior of 241 DKD patients observed in a 4-year period, according to the DC-ren project. Computationally, we used an extension of the longitudinal k-means, *kmlShape* (Genolini *et al.*, 2016),

In this study, which involves clusters based on shapes of individual trajectories, without assuming a particular shape, we obtained a grouping of patients according to their similarity of eGFR behavior. Trajectories are evaluated upon the Fréchet distance between them, computed on the continuous variable eGFR. We obtain 8 patient clusters with similar eGFR shape of individual trajectories. In Figure 1a, we represent the obtained clusters. In the Table (Figure 1b), we show the behavior of clusters achieved with the Fréchet distance, with standard deviations at each time point. We find three main patients' subgroups: patients with decreasing eGFR (clusters 1, 3, 8); low decreasing eGFR (cl. 2, 4), and stable/increasing eGFR (cl. 5, 6, 7). To explain these behaviors, we consider some relevant clinical characteristics of patients, which are the ratio of urinary albumin and creatinine (mean UACR) and the glycated hemoglobin (HbA1c), shown in the Table (Figure 1b). Mean UACR is decreasing in cluster 6, while it is increasing in cluster 8. Decreasing or stable values of HbA1c, which characterize DKD patients (Karpati *et al.*, 2018), are shown by patients in cl. 6, while increasing values of HbA1c are shown by patients in cl. 8. We notice that there is a relationship between non-decreasing eGFR and stable HbA1c. Patients can receive different treatments, such as *D*1, *D*2, *D*3, *D*4. In cluster 3 most of the patients that change drug (*D*<sup>1</sup> → *D*<sup>1</sup> +*D*2) show a positive response to the treatment. However, patients in cluster 2 who change the drug

‡A different clustering method *Fb* gives us similar representative paths <sup>&</sup>lt; *pathb*,*<sup>Y</sup>*

transformation α*a*,*<sup>b</sup>* : *Fa* → *Fb* maps clustering methods.

<sup>ι</sup> >. Natural


GRANDIS, M. 2020. *Higher Category Theory*. Singapore: World Scientific. KARPATI, T., LEVENTER-ROBERTS, M., FELDMAN, B., C., COHEN-

STAVI., & RAZ I., BALICER, R. 2018. Patient clusters based on HbA1c trajectories: A step toward individualized medicine in type 2 diabetes. *PLos One*, 13(11).

## A REDUNDANCY ANALYSIS WITH MULTIVARIATE RANDOM-COEFFICIENTS LINEAR MODELS

*Analysis* (*RDA)* was originally introduced in order to capture the effect onto a reduced space of the linear dependence by a set of criterion variables from a set of predictors. A *RDA* of the predicted criterion variables by the best linear unbiased predictor may be quite representative (Marcis & Salvatore, 2020). This paper uses a *RDA* by a least-squares solution for an optimal fixed-effects estimate from the data provided by the random-coefficients linear model predictors of the criterion variables. The application study performs the method introduced on the official data by the Italian Equitable and Sustainable Well-

2 Redundancy Analysis: model estimation and application study

Given a *q*-variate random vector *Y*, consider the case when *Y* is partitioned in *n* subjects (groups), each of them with *ni* individuals (*i* = 1,...,*m*,; *j* = 1,...,*ni*;*s* = 1,...,*q*). We assume that the population model for the *n* sub-

gression coefficients. *Ai* is a matrix of *q*-variate *r*-dimensional vectors of

*r* = *p*, the population model is a multivariate *RCM*, with *zi* = *xi*. Given a sample of *N* = Σ*i*Σ*jni j* units (e.g. repeated measurements), then the model struc-

covariates, *Z*<sup>+</sup> the design matrix of random effects, and *E* the matrix of regression within-subject errors, *cov*(*vec*(*E*)) = *R*. Assuming both *Y* and *X*<sup>+</sup> as columnwise centered and standardized, we get in the general RCM setup

*<sup>y</sup>*<sup>∗</sup>, <sup>β</sup> <sup>=</sup> *vec*(*B*), <sup>Σ</sup> <sup>=</sup> *var*(*<sup>y</sup>*) = *<sup>E</sup>*

tains the row joint reduced coordinates in the space of *Y*∗∗. In accordance with recent law reforms, the Equitable and Sustainable Well-being indicators (BES) - annually provided by the Italian Statistical Institute(ISTAT, 2017) are designed to define the economic policies which largely act on some fundamental aspects of the quality of life. In order to highlight the result of the proposed *RDA,* we use 12 BES indicators relating to the years 2013-2016, collected at NUTS-2 level. In particular the variables are S8 (Age-standardised mortality rate), IF3 (People with tertiary education), L12 (Satisfaction with

=*Y* = *X*<sup>+</sup>

*<sup>Y</sup>* , further noticing that *U*<sup>∗</sup>

*si*. If *<sup>F</sup>* <sup>=</sup> *<sup>Y</sup>*∗∗*var*(*<sup>y</sup>*)−<sup>1</sup>

*Y* ∗∗

*i*

*<sup>q</sup>*×*pxi*|*q*×<sup>1</sup> + *Ai*|*q*×*rzi*|*r*×1, where *B* is the matrix of fixed re-

) ∼ *N*(0,Σ*a*), Σ*<sup>a</sup>* = *cov*(*vec*(*A*

*<sup>i</sup>*,*ss*)), where *s*,*s* = 1,...,*q*, the *r*×*r* blocks of Σ*<sup>a</sup>* . When

*<sup>N</sup>*×*pmApm*×*<sup>q</sup>* <sup>+</sup> *EN*×*q*, with *<sup>X</sup>*<sup>+</sup> the matrix of data

 (*y*∗ *<sup>s</sup>* −*y*<sup>∗</sup> *s*) (*y*∗ *<sup>s</sup>* −*y*<sup>∗</sup> *s*) ,

*i*

<sup>2</sup> <sup>−</sup> *<sup>X</sup>*+*B*,*<sup>Y</sup>*∗∗ <sup>=</sup> *<sup>Y</sup>* <sup>−</sup> *<sup>E</sup>*(*<sup>Y</sup>*), and

β gives the common rescaled

*<sup>Y</sup>*∗∗ <sup>=</sup> *YV* <sup>−</sup>1Λ*<sup>Y</sup>* con-

)) = {Σ*a*,*ss*},

being indicators.

jects is *yi*|*q*×<sup>1</sup> = *B*

ture is *YN*×*<sup>q</sup>* <sup>=</sup> *<sup>X</sup>*<sup>+</sup>

*cov*(*asi*, *ysi*) = *DZ*

Σ−<sup>1</sup>

*X*)−1*X*

predictor's coordinates, *UY*Λ*YV*

 β = (*X*

and Σ*a*,*ss* = *cov*(*vec*(*A*

random-effects, with *ai* = *vec*(*A*

*<sup>N</sup>*×*<sup>p</sup>Bp*×*<sup>q</sup>* <sup>+</sup> *<sup>Z</sup>*<sup>+</sup>

*<sup>i</sup>* = *DX*

Σ−<sup>1</sup>

the singular value decomposition of

Laura Marcis 1, Maria Chiara Pagliarella <sup>2</sup> and Renato Salvatore <sup>1</sup>

<sup>1</sup> Department of Economics and Law, University of Cassino, (e-mail: laura.marcis@unicas.it, rsalvatore@unicas.it) <sup>2</sup> Italian National Institute for Public Policy Analysis (e-mail: mc.pagliarella@inapp.org)

ABSTRACT: Random-coefficients linear models can be considered as a particular case of linear mixed models, in which the random effects depend on the model fixedeffects design matrix. A Redundancy Analysis of estimates of the multivariate randomeffects may be able to capture the leading contribution to this correlation. Starting from the standardized multivariate best linear predictors, we introduce the random effects reduced space by a weighted least-squares closed-form solution. The application shows the effect of the linear dependence of the random-effects in the space of the predictor variables.

KEYWORDS: Redundancy analysis, linear mixed model, empirical best linear unbiased predictor, restricted maximum likelihood estimator.

### 1 Introduction

Random-coefficients linear regression models (*RCM*) represent a special case of linear mixed models (LMM, Demidenko, 2004), where the vector of regression coefficients for the subjects (e.g. repeated observations) is modeled in a second stage linear regression equation. In order to specify this type of models, it is convenient to define a two-stage hierarchical linear model, with a first stage that models within-subject observations, and as second stage we use a linear model for the random regression coefficients. Although in the basic linear mixed models the random effects are not correlated with the modeled response variables (unlike the fixed-effects with the random effects estimates), in the *RCM* this correlation depends on the fixed-effects design matrix of the regression model. Since this happens, one can be interested to know in which components of a multivariate model the random effects are related to the subspace spanned by the model covariates. In the same way, what components of the multivariate vector seem to be orthogonal to that subspace. *Redundancy* *Analysis* (*RDA)* was originally introduced in order to capture the effect onto a reduced space of the linear dependence by a set of criterion variables from a set of predictors. A *RDA* of the predicted criterion variables by the best linear unbiased predictor may be quite representative (Marcis & Salvatore, 2020). This paper uses a *RDA* by a least-squares solution for an optimal fixed-effects estimate from the data provided by the random-coefficients linear model predictors of the criterion variables. The application study performs the method introduced on the official data by the Italian Equitable and Sustainable Wellbeing indicators.

A REDUNDANCY ANALYSIS WITH MULTIVARIATE RANDOM-COEFFICIENTS LINEAR MODELS Laura Marcis 1, Maria Chiara Pagliarella <sup>2</sup> and Renato Salvatore <sup>1</sup>

<sup>1</sup> Department of Economics and Law, University of Cassino, (e-mail:

<sup>2</sup> Italian National Institute for Public Policy Analysis (e-mail:

ABSTRACT: Random-coefficients linear models can be considered as a particular case of linear mixed models, in which the random effects depend on the model fixedeffects design matrix. A Redundancy Analysis of estimates of the multivariate randomeffects may be able to capture the leading contribution to this correlation. Starting from the standardized multivariate best linear predictors, we introduce the random effects reduced space by a weighted least-squares closed-form solution. The application shows the effect of the linear dependence of the random-effects in the space of the

KEYWORDS: Redundancy analysis, linear mixed model, empirical best linear unbi-

Random-coefficients linear regression models (*RCM*) represent a special case of linear mixed models (LMM, Demidenko, 2004), where the vector of regression coefficients for the subjects (e.g. repeated observations) is modeled in a second stage linear regression equation. In order to specify this type of models, it is convenient to define a two-stage hierarchical linear model, with a first stage that models within-subject observations, and as second stage we use a linear model for the random regression coefficients. Although in the basic linear mixed models the random effects are not correlated with the modeled response variables (unlike the fixed-effects with the random effects estimates), in the *RCM* this correlation depends on the fixed-effects design matrix of the regression model. Since this happens, one can be interested to know in which components of a multivariate model the random effects are related to the subspace spanned by the model covariates. In the same way, what components of the multivariate vector seem to be orthogonal to that subspace. *Redundancy*

laura.marcis@unicas.it, rsalvatore@unicas.it)

ased predictor, restricted maximum likelihood estimator.

mc.pagliarella@inapp.org)

predictor variables.

1 Introduction

#### 2 Redundancy Analysis: model estimation and application study

Given a *q*-variate random vector *Y*, consider the case when *Y* is partitioned in *n* subjects (groups), each of them with *ni* individuals (*i* = 1,...,*m*,; *j* = 1,...,*ni*;*s* = 1,...,*q*). We assume that the population model for the *n* subjects is *yi*|*q*×<sup>1</sup> = *B <sup>q</sup>*×*pxi*|*q*×<sup>1</sup> + *Ai*|*q*×*rzi*|*r*×1, where *B* is the matrix of fixed regression coefficients. *Ai* is a matrix of *q*-variate *r*-dimensional vectors of random-effects, with *ai* = *vec*(*A i* ) ∼ *N*(0,Σ*a*), Σ*<sup>a</sup>* = *cov*(*vec*(*A i* )) = {Σ*a*,*ss*}, and Σ*a*,*ss* = *cov*(*vec*(*A <sup>i</sup>*,*ss*)), where *s*,*s* = 1,...,*q*, the *r*×*r* blocks of Σ*<sup>a</sup>* . When *r* = *p*, the population model is a multivariate *RCM*, with *zi* = *xi*. Given a sample of *N* = Σ*i*Σ*jni j* units (e.g. repeated measurements), then the model structure is *YN*×*<sup>q</sup>* <sup>=</sup> *<sup>X</sup>*<sup>+</sup> *<sup>N</sup>*×*<sup>p</sup>Bp*×*<sup>q</sup>* <sup>+</sup> *<sup>Z</sup>*<sup>+</sup> *<sup>N</sup>*×*pmApm*×*<sup>q</sup>* <sup>+</sup> *EN*×*q*, with *<sup>X</sup>*<sup>+</sup> the matrix of data

covariates, *Z*<sup>+</sup> the design matrix of random effects, and *E* the matrix of regression within-subject errors, *cov*(*vec*(*E*)) = *R*. Assuming both *Y* and *X*<sup>+</sup> as columnwise centered and standardized, we get in the general RCM setup *cov*(*asi*, *ysi*) = *DZ <sup>i</sup>* = *DX si*. If *<sup>F</sup>* <sup>=</sup> *<sup>Y</sup>*∗∗*var*(*<sup>y</sup>*)−<sup>1</sup> <sup>2</sup> <sup>−</sup> *<sup>X</sup>*+*B*,*<sup>Y</sup>*∗∗ <sup>=</sup> *<sup>Y</sup>* <sup>−</sup> *<sup>E</sup>*(*<sup>Y</sup>*), and β = (*X* Σ−<sup>1</sup> *X*)−1*X* Σ−<sup>1</sup> *<sup>y</sup>*<sup>∗</sup>, <sup>β</sup> <sup>=</sup> *vec*(*B*), <sup>Σ</sup> <sup>=</sup> *var*(*<sup>y</sup>*) = *<sup>E</sup>* (*y*∗ *<sup>s</sup>* −*y*<sup>∗</sup> *s*) (*y*∗ *<sup>s</sup>* −*y*<sup>∗</sup> *s*) , the singular value decomposition of *Y* ∗∗ =*Y* = *X*<sup>+</sup> β gives the common rescaled predictor's coordinates, *UY*Λ*YV <sup>Y</sup>* , further noticing that *U*<sup>∗</sup> *<sup>Y</sup>*∗∗ <sup>=</sup> *YV* <sup>−</sup>1Λ*<sup>Y</sup>* contains the row joint reduced coordinates in the space of *Y*∗∗. In accordance with recent law reforms, the Equitable and Sustainable Well-being indicators (BES) - annually provided by the Italian Statistical Institute(ISTAT, 2017) are designed to define the economic policies which largely act on some fundamental aspects of the quality of life. In order to highlight the result of the proposed *RDA,* we use 12 BES indicators relating to the years 2013-2016, collected at NUTS-2 level. In particular the variables are S8 (Age-standardised mortality rate), IF3 (People with tertiary education), L12 (Satisfaction with

job), REL4 (Social participation), POL5 (Trust in institutions), SIC1 (Homicide rate), BS3 (Positive judgment for future perspectives), PATR9 (Presence of Parks/Gardens), AMB9 (Satisfaction for life), INN1 (Percentage of R&D expenditure), Q2 (Childhood services) and LBE1 (logarithm of per-capita adjusted disposable income). We use the latter as the predictor variable in the *RCM*, while the remaining 11 variables are dependent variables. The application uses the restricted maximum likelihood estimation, inside a SAS/IML code. To simplify the estimation process, we assume equicorrelation between the multivariate components of random effects. The linear mixed model with random coefficients highlights its analytical capabilities in the Figure 1. The plot features the standardized best predictors (STDP) and the original criterion variables in the space of the latter. As an example, while there is no correlation between INN1 (R&D expenditure) and AMB9 (satisfaction for the environment), the same best predictor variables register an inverse correlation. This evidence is supported by arguments, such as critical differences among Northern and Southern Italian Regions. Figure 2 shows the constrained RDA, in which the major contribution is given by the variables IF3, Q2, REL4, and AMB9. Interpreting the correlation between random effects and the LBE1 model covariate, this dependence is mainly explained indeed by the amount of the population that completed tertiary education (IF3). This means that most of the differences between Italian Regions reflect the dependence of the IF3 variable on the disposable income.

References

Sons.

DEMIDENKO, E. 2004. *Mixed Models: theory and applications*. Wiley and

MARCIS, L., & SALVATORE, R. 2020. Joint Redundancy Analysis by a Multivariate Linear Predictor. *Conference of the Italian Statistical Society*.

#### References

job), REL4 (Social participation), POL5 (Trust in institutions), SIC1 (Homicide rate), BS3 (Positive judgment for future perspectives), PATR9 (Presence of Parks/Gardens), AMB9 (Satisfaction for life), INN1 (Percentage of R&D expenditure), Q2 (Childhood services) and LBE1 (logarithm of per-capita adjusted disposable income). We use the latter as the predictor variable in the *RCM*, while the remaining 11 variables are dependent variables. The application uses the restricted maximum likelihood estimation, inside a SAS/IML code. To simplify the estimation process, we assume equicorrelation between the multivariate components of random effects. The linear mixed model with random coefficients highlights its analytical capabilities in the Figure 1. The plot features the standardized best predictors (STDP) and the original criterion variables in the space of the latter. As an example, while there is no correlation between INN1 (R&D expenditure) and AMB9 (satisfaction for the environment), the same best predictor variables register an inverse correlation. This evidence is supported by arguments, such as critical differences among Northern and Southern Italian Regions. Figure 2 shows the constrained RDA, in which the major contribution is given by the variables IF3, Q2, REL4, and AMB9. Interpreting the correlation between random effects and the LBE1 model covariate, this dependence is mainly explained indeed by the amount of the population that completed tertiary education (IF3). This means that most of the differences between Italian Regions reflect the dependence of the IF3

variable on the disposable income.


### THE USE OF MULTIPLE IMPUTATION TECHNIQUES IN SOCIAL MEDIA DATA

step, the substitution of missing values is implemented using a threshold based on the number of expressed "Likes". In particular, a missing value is considered as a "Dislike" only when a user has expressed a percentage of "Likes" that is higher than a selected threshold. Alternatively, if the percent of "Likes" is less than the threshold, a missing "Like" is imputed as a "Nothing". The second step has been pursued using a multiple imputation technique known as MIMCA method (Multiple Imputation with Multiple Correspondence Analysis) (Audigier *et al.*, 2017). This procedure is applied to social media data from

Multiple Imputation with Multiple Correspondence Analysis represents an available alternative as imputation technique for qualitative data. This approach allows to impute data sets with incomplete categorical variables. The principle of MI with MCA, as well as all the other multiple imputation techniques, consists in creating *M* different datasets to reflect the uncertainty on imputed values. In this context, each dataset is obtained with an algorithm called *iterative MCA*, which is useful to impute qualitative data. The iterative MCA algorithm consists in recoding the incomplete dataset as an incomplete disjunctive table *Z*, randomly imputing the missing values, estimating the principal components and loadings from the completed matrix and then, using these estimates to impute missing values according to the following reconstruction formula:

*Z*ˆ = *U*ˆΛˆ*V*ˆ *<sup>T</sup>* + *M*.

where *U*ˆ, Λˆ and *V*ˆ are the left singular vectors, the diagonal matrix of singular values and the right singular vectors, respectively. The final version of matrix *<sup>Z</sup>* is obtained as *<sup>Z</sup>* <sup>=</sup> *<sup>W</sup>* <sup>∗</sup> *<sup>Z</sup>* + (<sup>I</sup> <sup>−</sup>*W*) <sup>∗</sup> *<sup>Z</sup>*ˆ, where <sup>∗</sup> is the Hadamard product and *W* is a matrix of weights where *wi j* = 1 if *zi j* is missing and *wi j* = 0 otherwise. In this context, MCA is configured as a singular values decomposition

the disjunctive table, *M* is a matrix whose rows are equal to the vectors of the means of each component of *Z*, *D*<sup>Σ</sup> is a diagonal matrix with the proportions of individuals characterized by a specific category and *R* is the matrix of uniform weights assigned to individuals. After a first step of imputation, the procedure of iterative MCA is repeated many times until a convergence criterion is reached. In many cases, due to overfitting problems, a regularized version of

*<sup>K</sup> <sup>D</sup>*−<sup>1</sup>

<sup>Σ</sup> ,*R*). The matrix *Z* represents

the official pages of 7 Italian newspapers.

applied on the triplet of matrices (*<sup>Z</sup>* <sup>−</sup> *<sup>M</sup>*, <sup>1</sup>

this algorithm is used (Josse *et al.*, 2012).

2 The MIMCA approach

Paolo Mariani 1, Andrea Marletta <sup>1</sup> and Matteo Locci <sup>1</sup>

<sup>1</sup> Department of Economics, Management and Statistics, University of Milano-Bicocca, (e-mail: andrea.marletta@unimib.it, paolo.mariani@unimib.it, m.locci2@campus.unimib.it )

ABSTRACT: In the big data context, it is very frequent to manage the analysis of missing values. This is especially relevant in the field of statistical analysis, where this represents a thorny issue. This study proposes a strategy for data enrichment in presence of sparse matrices. The research objective consists in the evaluation of a possible distinction of behaviour among observations in sparse matrices with missing data. After selecting among the multiple imputation methods, an innovative technique will be presented to impute missing observations as a negative position or a neutral opinion. This method has been applied to a dataset measuring the interaction between users and social network pages for some Italian newspapers.

KEYWORDS: Social network data, Missing values, Multiple imputations

#### 1 Introduction

The treatment of missing values is still a neglected phase in the field of quantitative analysis. In not statistical contexts, the most abused solution is the row elimination, that is to say the deletion of the observation with missing values. This operation could result misleading and the treatment of missing observations is more complex procedure. Firstly, it is necessary to conduct some preliminary analysis about the nature of this lack of information and to recognize the mechanism of the missing data. This relationship aims to evaluate the link between the observed value and the missing one. This lead to the well-known classification of Little and Rubin (Little, 1988) in MCAR (Missing Completely At Random) data, MAR (Missing at Random) data o NMAR (Not Missing At Random) data. Only after the identification of these mechanism, it is possible to find the best solution to solve the problem of missing values. If the complete case analysis has not been considered as a valid alternative, it is necessary to proceed with the imputation of the missing observation.

In this study, an innovative technique to discern the missing value from a behaviour for some individuals has been proposed using two steps. In the first step, the substitution of missing values is implemented using a threshold based on the number of expressed "Likes". In particular, a missing value is considered as a "Dislike" only when a user has expressed a percentage of "Likes" that is higher than a selected threshold. Alternatively, if the percent of "Likes" is less than the threshold, a missing "Like" is imputed as a "Nothing". The second step has been pursued using a multiple imputation technique known as MIMCA method (Multiple Imputation with Multiple Correspondence Analysis) (Audigier *et al.*, 2017). This procedure is applied to social media data from the official pages of 7 Italian newspapers.

#### 2 The MIMCA approach

THE USE OF MULTIPLE IMPUTATION TECHNIQUES IN SOCIAL MEDIA DATA Paolo Mariani 1, Andrea Marletta <sup>1</sup> and Matteo Locci <sup>1</sup>

<sup>1</sup> Department of Economics, Management and Statistics, University of Milano-Bicocca, (e-mail: andrea.marletta@unimib.it,

ABSTRACT: In the big data context, it is very frequent to manage the analysis of missing values. This is especially relevant in the field of statistical analysis, where this represents a thorny issue. This study proposes a strategy for data enrichment in presence of sparse matrices. The research objective consists in the evaluation of a possible distinction of behaviour among observations in sparse matrices with missing data. After selecting among the multiple imputation methods, an innovative technique will be presented to impute missing observations as a negative position or a neutral opinion. This method has been applied to a dataset measuring the interaction between

paolo.mariani@unimib.it, m.locci2@campus.unimib.it )

users and social network pages for some Italian newspapers.

proceed with the imputation of the missing observation.

1 Introduction

KEYWORDS: Social network data, Missing values, Multiple imputations

The treatment of missing values is still a neglected phase in the field of quantitative analysis. In not statistical contexts, the most abused solution is the row elimination, that is to say the deletion of the observation with missing values. This operation could result misleading and the treatment of missing observations is more complex procedure. Firstly, it is necessary to conduct some preliminary analysis about the nature of this lack of information and to recognize the mechanism of the missing data. This relationship aims to evaluate the link between the observed value and the missing one. This lead to the well-known classification of Little and Rubin (Little, 1988) in MCAR (Missing Completely At Random) data, MAR (Missing at Random) data o NMAR (Not Missing At Random) data. Only after the identification of these mechanism, it is possible to find the best solution to solve the problem of missing values. If the complete case analysis has not been considered as a valid alternative, it is necessary to

In this study, an innovative technique to discern the missing value from a behaviour for some individuals has been proposed using two steps. In the first Multiple Imputation with Multiple Correspondence Analysis represents an available alternative as imputation technique for qualitative data. This approach allows to impute data sets with incomplete categorical variables. The principle of MI with MCA, as well as all the other multiple imputation techniques, consists in creating *M* different datasets to reflect the uncertainty on imputed values. In this context, each dataset is obtained with an algorithm called *iterative MCA*, which is useful to impute qualitative data. The iterative MCA algorithm consists in recoding the incomplete dataset as an incomplete disjunctive table *Z*, randomly imputing the missing values, estimating the principal components and loadings from the completed matrix and then, using these estimates to impute missing values according to the following reconstruction formula:

$$
\hat{\mathbf{Z}} = \hat{U}\hat{\Lambda}\hat{\mathcal{V}}^T + M.
$$

where *U*ˆ, Λˆ and *V*ˆ are the left singular vectors, the diagonal matrix of singular values and the right singular vectors, respectively. The final version of matrix *<sup>Z</sup>* is obtained as *<sup>Z</sup>* <sup>=</sup> *<sup>W</sup>* <sup>∗</sup> *<sup>Z</sup>* + (<sup>I</sup> <sup>−</sup>*W*) <sup>∗</sup> *<sup>Z</sup>*ˆ, where <sup>∗</sup> is the Hadamard product and *W* is a matrix of weights where *wi j* = 1 if *zi j* is missing and *wi j* = 0 otherwise. In this context, MCA is configured as a singular values decomposition applied on the triplet of matrices (*<sup>Z</sup>* <sup>−</sup> *<sup>M</sup>*, <sup>1</sup> *<sup>K</sup> <sup>D</sup>*−<sup>1</sup> <sup>Σ</sup> ,*R*). The matrix *Z* represents the disjunctive table, *M* is a matrix whose rows are equal to the vectors of the means of each component of *Z*, *D*<sup>Σ</sup> is a diagonal matrix with the proportions of individuals characterized by a specific category and *R* is the matrix of uniform weights assigned to individuals. After a first step of imputation, the procedure of iterative MCA is repeated many times until a convergence criterion is reached. In many cases, due to overfitting problems, a regularized version of this algorithm is used (Josse *et al.*, 2012).

This approach is part of the family of joint modelling MI method, which means that it is more computationally efficient than conditional models. In fact, this MI technique is based on Multiple Correspondence Analysis and then the number of parameters estimated is small. Another advantage of MI with MCA is the goodness of estimation even if the number of individuals is small. Finally, MI with MCA well represents less frequent categories in the step of imputation. This last is another property that derives from MCA.

the following:

References

*and computing*, 27(2), 501–518.

83(404), 1198–1202.

*Journal of classification*, 29(1), 91–116.

then a "Dislike" is imputed;

"Dislike" nor a "Nothing" is imputed.

a "Nothing" is imputed;

• if the proportion of "Dislike" imputed is greater than or equal to 60%,

• if the proportion of "Dislike" imputed is less than or equal to 40%, then

• if the proportion of "Dislike" is between 40% and 60%, then neither a

La Repubblica Corriere della Sera Il Fatto Quotidiano Il Sole 24 Ore

La Gazzetta dello Sport Il Messaggero La Stampa Total

As can be noted from the table 1, few missing values are still present. In fact, in some cases the number of "Dislike" imputed in a specific cell is very similar to the number of "Nothing". In particular, this behaviour is manifested when the proportion of "Dislike" (or "Nothing") is between 40% and 60%. Even if there are some cases where missing values are not imputed, MI with MCA works well. In fact, the proportion of missing values is now equal to 6%.

AUDIGIER, V, HUSSON, F, & JOSSE, J. 2017. MIMCA: multiple imputation for categorical variables with multiple correspondence analysis. *Statistics*

JOSSE, J., CHAVENT, M., LIQUET, B., & HUSSON, F. 2012. Handling missing values with regularized iterative multiple correspondence analysis.

LITTLE, R. 1988. A test of missing completely at random for multivariate data with missing values. *Journal of the American statistical Association*,

Table 1. *Distribution of "Like", "Dislike" and "Nothing" after MI with MCA.*

"Like" 299 244 268 158 "Dislike" 24 8 57 47 "Nothing" 149 235 139 197 Missing Values 24 9 32 94

"Like" 116 66 86 1237 "Dislike" 221 255 200 812 "Nothing" 155 169 170 1214 Missing Values 4 6 40 209

#### 3 Application on Italian newspaper social pages

The dataset used for this application is represented by users that expressed at least one "Like" in social media pages, websites, and forums concerning drugs and health. The research was conducted on 2,795 Italian subjects considering all interactions between people and brands and between products and services on Facebook. The selected category for Facebook pages is Italian newspapers. Each column of the dataset is a dummy variable that represents the presence or absence of a "Like." The 7 Italian newspapers are: La Repubblica, Corriere della Sera, Il Fatto Quotidiano, Il Sole 24 Ore, La Gazzetta dello Sport, Il Messaggero, La Stampa.

Before performing the MIMCA approach, the entire dataset has been divided into training and validation set. In particular, a number of cells equivalent to 30% of the cells observed has been set to "missing value". In order to create a validation set similar to the original data set, the proportion of each category ("Like", "Dislike" and "Nothing") has been maintained. The number of multiple data sets generated is equal to 100. The category to be imputed is selected by the majority rule. In other terms, among 100 imputations for each cell of the validation set, the category imputed at least 34 times is selected.

In order to evaluate the performances of MI with MCA, a confusion matrix has been created and summarized through an index of accuracy. This approach imputes more than 81% of the cells considered. Then, the performance of this technique is satisfactory.

Once the goodness of MIMCA has been proved, the process of data enrichment about cells without a category observed or imputed can be completed. In order to achieve this goal, *M* = 100 data sets have been imputed with MIMCA and, for each cell with a missing value, only those where a "Dislike" or a "Nothing" has been imputed are considered. Moreover, in order to minimize the simulation error due to the application of a bootstrap procedure, a threshold has been introduced. More specifically, considering only the data sets where a "Dislike" or a "Nothing" has been imputed, the imputation rule for each cell is the following:

This approach is part of the family of joint modelling MI method, which means that it is more computationally efficient than conditional models. In fact, this MI technique is based on Multiple Correspondence Analysis and then the number of parameters estimated is small. Another advantage of MI with MCA is the goodness of estimation even if the number of individuals is small. Finally, MI with MCA well represents less frequent categories in the step of

The dataset used for this application is represented by users that expressed at least one "Like" in social media pages, websites, and forums concerning drugs and health. The research was conducted on 2,795 Italian subjects considering all interactions between people and brands and between products and services on Facebook. The selected category for Facebook pages is Italian newspapers. Each column of the dataset is a dummy variable that represents the presence or absence of a "Like." The 7 Italian newspapers are: La Repubblica, Corriere della Sera, Il Fatto Quotidiano, Il Sole 24 Ore, La Gazzetta dello Sport, Il

Before performing the MIMCA approach, the entire dataset has been divided into training and validation set. In particular, a number of cells equivalent to 30% of the cells observed has been set to "missing value". In order to create a validation set similar to the original data set, the proportion of each category ("Like", "Dislike" and "Nothing") has been maintained. The number of multiple data sets generated is equal to 100. The category to be imputed is selected by the majority rule. In other terms, among 100 imputations for each cell of the validation set, the category imputed at least 34 times is selected.

In order to evaluate the performances of MI with MCA, a confusion matrix has been created and summarized through an index of accuracy. This approach imputes more than 81% of the cells considered. Then, the performance of this

Once the goodness of MIMCA has been proved, the process of data enrichment about cells without a category observed or imputed can be completed. In order to achieve this goal, *M* = 100 data sets have been imputed with MIMCA and, for each cell with a missing value, only those where a "Dislike" or a "Nothing" has been imputed are considered. Moreover, in order to minimize the simulation error due to the application of a bootstrap procedure, a threshold has been introduced. More specifically, considering only the data sets where a "Dislike" or a "Nothing" has been imputed, the imputation rule for each cell is

imputation. This last is another property that derives from MCA.

3 Application on Italian newspaper social pages

Messaggero, La Stampa.

technique is satisfactory.



Table 1. *Distribution of "Like", "Dislike" and "Nothing" after MI with MCA.*

As can be noted from the table 1, few missing values are still present. In fact, in some cases the number of "Dislike" imputed in a specific cell is very similar to the number of "Nothing". In particular, this behaviour is manifested when the proportion of "Dislike" (or "Nothing") is between 40% and 60%. Even if there are some cases where missing values are not imputed, MI with MCA works well. In fact, the proportion of missing values is now equal to 6%.

#### References


### PREDICTION OF GENE EXPRESSION FROM TRANSCRIPTION FACTORS AFFINITIES: AN APPLICATION OF BAYESIAN NON-LINEAR MODELLING

Person log Expr. of EGFR

*EE* Response (log <sup>y</sup>)

Figure 1. *Left: Humans have two copies of DNA in each cell; the expression of a gene is the amount of RNA it produces. Transcription factors bind the DNA at the regulatory regions from where they activate or*

into numbers to be used in a regression model. Most existing methods rely on genotypes (discrete variables taking values 0, 1, or 2 encoding single-letter differences in the DNA of different people), which do not allow for easy interpretation (*e.g.*, "If the DNA has an 'A' instead of a 'T', the expression of the gene will be higher"). Our first goal is to develop a more interpretable model. Gene expression is mainly controlled by specialised proteins called transcription factors, which bind the DNA at particular locations (regulatory regions) by establishing weak chemical bonds. Different DNA sequences will have, therefore, different chemical *affinities* for the transcription factors. Since different individuals have different DNA sequences, it is possible to use the affinities for transcription factors as numerical (continuous) predictors in the predictive model of gene expression. Affinities have a far superior interpretation, exemplified by statements such as "If the affinity for this transcription

However, one needs to make an assumption about the relationship (*e.g.*, linear) between affinities and gene expression. de Boer *et al.* , 2020 models the logarithm of the expression as a linear function of the affinities. The model is developed for a type of yeast and achieves a good performance, but is still too simple for our application. Indeed, yeast has two important distinguishing features: 1) it is haploid, meaning that it has only one copy of DNA, whereas humans have two; and 2) its genes are regulated primarily by one regulatory

In this paper, we set up a predictive model for the expression of the EGFR (Epidermal Growth Factor Receptor) gene, and explicitly address both limitations in de Boer *et al.* , 2020. Figure 1 provides a schematic of our application.

*inhibit the expression of their target gene. Right: A plausible instance of our data set.*

factor is higher, the expression will be higher."

region, whereas human genes typically have more than one.

Affinity of TF 1 for copy 1 of region 1

Alice 3.5 8.4 . . . 1.1 Bob 4.1 7.7 . . . 0.6 Craig 3.3 9.4 . . . 0.5 Dave 3.8 10.2 . . . 0.8 Eve 3.4 8.1 . . . 1.2 Frank 4.2 9.5 . . . 0.5

...

*AAAAAAAAAAAAAAAAAA* Predictors (A)

Affinity of TF *l* for copy 2 of region *r*

Federico Marotta1, Paolo Provero1 and Silvia Montagna2,3

<sup>1</sup> Dipartimento di Neuroscienze "Rita Levi Montalcini", Universita` degli Studi di Torino, Via Cherasco, 15, 10126, Torino, Italy (e-mail: federico.marotta@edu.unito.it, paolo.provero@unito.it)

<sup>2</sup> Dipartimento di Scienze Economico-sociali e Matematico-statistiche, Universita` degli Studi di Torino, Corso Unione Sovietica, 218/bis, 10134 Torino, Italy, (e-mail: silvia.montagna@unito.it)

<sup>3</sup> Collegio Carlo Alberto, Piazza Vincenzo Arbarello, 8, 10122 Torino, Italy

ABSTRACT: The prediction of gene expressions from DNA sequences is a relevant problem in biology. While most of the existing methods dedicated to this task use genotypes as predictors, here we propose a method based on transcription factor affinities, which have a clearer biological interpretation. This novelty, however, introduces new challenges for modelling, which we address leveraging on Bayesian non-linear modelling techniques.

KEYWORDS: Bayesian Methods, Gene Expressions, Non-linear Predictive Modelling.

#### 1 Introduction

Scientists are often interested in predicting differences in the expression of a gene in different individuals solely from the DNA sequence of the individuals. The predicted expression can then be used in place of the real one when measuring the latter is too expensive, and the learnt relationship between DNA and expression can lead to a better understanding of how genes are regulated (Manor & Segal, 2013). The expression of a gene is the amount of RNA molecules it produces. Humans have two independent sets of DNA molecules, one coming from the father and one from the mother, therefore there are two copies of each gene. When measuring the expression, one simply sums the molecules produced by each copy.

When associating DNA to gene expressions, the first problem we face is how to encode the DNA (a 3-billion letter string from the alphabet {*A*,*C*,*G*,*T*})

PREDICTION OF GENE EXPRESSION FROM TRANSCRIPTION FACTORS AFFINITIES: AN APPLICATION OF BAYESIAN NON-LINEAR MODELLING Federico Marotta1, Paolo Provero1 and Silvia Montagna2,3

<sup>1</sup> Dipartimento di Neuroscienze "Rita Levi Montalcini", Universita` degli Studi di Torino, Via Cherasco, 15, 10126, Torino, Italy (e-mail: federico.marotta@edu.unito.it, paolo.provero@unito.it)

<sup>2</sup> Dipartimento di Scienze Economico-sociali e Matematico-statistiche, Universita` degli Studi di Torino, Corso Unione Sovietica, 218/bis, 10134 Torino, Italy, (e-mail:

ABSTRACT: The prediction of gene expressions from DNA sequences is a relevant problem in biology. While most of the existing methods dedicated to this task use genotypes as predictors, here we propose a method based on transcription factor affinities, which have a clearer biological interpretation. This novelty, however, introduces new challenges for modelling, which we address leveraging on Bayesian non-linear

KEYWORDS: Bayesian Methods, Gene Expressions, Non-linear Predictive Modelling.

Scientists are often interested in predicting differences in the expression of a gene in different individuals solely from the DNA sequence of the individuals. The predicted expression can then be used in place of the real one when measuring the latter is too expensive, and the learnt relationship between DNA and expression can lead to a better understanding of how genes are regulated (Manor & Segal, 2013). The expression of a gene is the amount of RNA molecules it produces. Humans have two independent sets of DNA molecules, one coming from the father and one from the mother, therefore there are two copies of each gene. When measuring the expression, one simply sums the

When associating DNA to gene expressions, the first problem we face is how to encode the DNA (a 3-billion letter string from the alphabet {*A*,*C*,*G*,*T*})

<sup>3</sup> Collegio Carlo Alberto, Piazza Vincenzo Arbarello, 8, 10122 Torino, Italy

silvia.montagna@unito.it)

modelling techniques.

1 Introduction

molecules produced by each copy.

Figure 1. *Left: Humans have two copies of DNA in each cell; the expression of a gene is the amount of RNA it produces. Transcription factors bind the DNA at the regulatory regions from where they activate or inhibit the expression of their target gene. Right: A plausible instance of our data set.*

into numbers to be used in a regression model. Most existing methods rely on genotypes (discrete variables taking values 0, 1, or 2 encoding single-letter differences in the DNA of different people), which do not allow for easy interpretation (*e.g.*, "If the DNA has an 'A' instead of a 'T', the expression of the gene will be higher"). Our first goal is to develop a more interpretable model.

Gene expression is mainly controlled by specialised proteins called transcription factors, which bind the DNA at particular locations (regulatory regions) by establishing weak chemical bonds. Different DNA sequences will have, therefore, different chemical *affinities* for the transcription factors. Since different individuals have different DNA sequences, it is possible to use the affinities for transcription factors as numerical (continuous) predictors in the predictive model of gene expression. Affinities have a far superior interpretation, exemplified by statements such as "If the affinity for this transcription factor is higher, the expression will be higher."

However, one needs to make an assumption about the relationship (*e.g.*, linear) between affinities and gene expression. de Boer *et al.* , 2020 models the logarithm of the expression as a linear function of the affinities. The model is developed for a type of yeast and achieves a good performance, but is still too simple for our application. Indeed, yeast has two important distinguishing features: 1) it is haploid, meaning that it has only one copy of DNA, whereas humans have two; and 2) its genes are regulated primarily by one regulatory region, whereas human genes typically have more than one.

In this paper, we set up a predictive model for the expression of the EGFR (Epidermal Growth Factor Receptor) gene, and explicitly address both limitations in de Boer *et al.* , 2020. Figure 1 provides a schematic of our application.

#### 2 Methodology and results

Our dataset consists in the expression values of the EGFR gene for *n* = 414 individuals (from The GTEx Consortium, 2020), and in the affinity of each regulatory region for all transcription factors, for a total of *p* = 358 predictors. Table 1. *Results of the nested-cross validation. MSE is the mean squared error,* ρ *the correlation between true and predicted expression; averages and standard deviations of these quantities are computed across the 5-folds. Avg R*<sup>2</sup> *is the average of the squared correlations. Z is the Z-score computed via Stouffer's*

Gene Avg MSE Sd MSE Avg ρ Sd ρ Avg *R*<sup>2</sup> *Z* pval *Z* EGFR 0.011 0.003 0.199 0.065 0.043 4.030 2.8e-5

estimator of β when imposing a Gaussian prior on β. 2) Ability to encode knowledge via priors distributions. For example, we can exploit existing biological data about which transcription factors are bound to a region of DNA by

To carry out an unbiased evaluation of the performance, we implemented a 5-fold cross-validation strategy. Table 1 summarises the results. While the average *R*<sup>2</sup> may seem small, we emphasise that low values are common in the prediction of gene expression and our model outperforms recently published genotype-based models (the *R*<sup>2</sup> achieved by Nagpal *et al.* , 2019 is only 0.005). Thus, our method can model the underlying biological problem in a realistic way and provide meaningful results thanks to its interpretable predictors. In the future, it could be improved by considering interactions between transcription factors, which are also biologically important. Nevertheless, for the time being, we hope that non-linear models will find their way in the field of gene expression prediction, which currently is dominated by genotype-based

DE BOER, CARL G., VAISHNAV, EESHIT D., SADEH, RONEN, *et al.* . 2020. Deciphering eukaryotic gene-regulatory logic with 100 million random

MANOR, OHAD,&SEGAL, ERAN. 2013. Robust Prediction of Expression Differences among Human Individuals Using Only Genotype Informa-

NAGPAL, SINI, MENG, XIAORAN, EPSTEIN, MICHAEL P., *et al.* . 2019. TIGAR: An Improved Bayesian Tool for Transcriptomic Data Imputation Enhances Gene Mapping of Complex Traits. *Am. J. Hum. Genet.*, 105(2),

THE GTEX CONSORTIUM. 2020. The GTEx Consortium atlas of genetic regulatory effects across human tissues. *Science*, 369(6509), 1318–1330.

promoters. *Nat. Biotechnol.*, 38(1), 56–65.

tion. *PLoS Genet.*, 9(3), e1003396.

*method, which combines the* ρ *of the five folds, and pval Z is the p-value of the Z-score.*

giving the corresponding variables a less stringent regularisation.

linear models.

References

258–266.

We can take multiple regulatory regions into account (goal 2 above) via a straightforward modification of the model in de Boer *et al.* , 2020, which becomes: log(*y*) = β<sup>0</sup> + ∑*<sup>r</sup> <sup>g</sup>*=<sup>1</sup> ∑*<sup>l</sup> <sup>f</sup>*=<sup>1</sup> A*f g*β*f g*. Here *y* denotes the gene expression, {*Af g*}*l*,*<sup>r</sup> <sup>f</sup>*=1,*g*=<sup>1</sup> are the affinities, and β = (β0,...,β*rl*) is a vector of model parameters. Similarly to de Boer *et al.* , 2020, we sum over all transcriptions factors, indexed by *f* , but now also along the regulatory regions, *g*, of the gene.

Accommodating for both copies of DNA (goal 1 above) is more challenging. Biologically, we know that the effects of the two copies should be additive in the original scale of the expression, not in the log-transformed expression. At the same time, working with the expression in the original scale can be troublesome, for it is often not normally distributed. Therefore, we propose the following model for the expression of a single gene:

$$\log(\mathfrak{y}) \sim \text{mvnormal}\left(\log\left(e^{\mathbf{A}^{(1)}\mathfrak{B}} + e^{\mathbf{A}^{(2)}\mathfrak{B}}\right), \sigma^2 I\right). \tag{1}$$

Here *y* is an *n*-vector of expression values (one for each individual), *A*(*i*) , with *i* ∈ {1,2}, is the *n* × *rl* affinity matrix for copy *i*, where each column represents a transcription factor-regulatory region pair (*r* is the number of regions, *l* the number of transcription factors), and vector β (*lr*×1) encapsulates the coefficients of the affinities. By computing the exponential of *A*(*i*) β, with *i* ∈ {1,2}, we obtain the effect of copy *i* on the expression in the original scale. We subsequently sum the two effects, and take the log of the sum to go back to the log-scale response. Importantly, the coefficient of a given transcription factor in a given regulatory region is the same for the two copies of DNA. We notice that for this reason our model does not fall in the class of generalised linear models (at least not obviously), as each coefficient β*<sup>j</sup>* appears two times independently for two different predictors.

Model (1) is embedded in a Bayesian framework by placing a normal prior (with mean zero and variance τ) on all coefficients β independently, and a noninformative Jeffreys prior on σ2. The Bayesian framework is chosen for two primary reasons: 1) Although in our specific application *p* < *n*, generally in genomics *p n* and regularisation is needed. Regularisation can be thought of as imposing a Bayesian prior on the underlying parameters. Specifically, the ridge regression estimator can be viewed as the Bayesian posterior mean

Table 1. *Results of the nested-cross validation. MSE is the mean squared error,* ρ *the correlation between true and predicted expression; averages and standard deviations of these quantities are computed across the 5-folds. Avg R*<sup>2</sup> *is the average of the squared correlations. Z is the Z-score computed via Stouffer's method, which combines the* ρ *of the five folds, and pval Z is the p-value of the Z-score.*


estimator of β when imposing a Gaussian prior on β. 2) Ability to encode knowledge via priors distributions. For example, we can exploit existing biological data about which transcription factors are bound to a region of DNA by giving the corresponding variables a less stringent regularisation.

To carry out an unbiased evaluation of the performance, we implemented a 5-fold cross-validation strategy. Table 1 summarises the results. While the average *R*<sup>2</sup> may seem small, we emphasise that low values are common in the prediction of gene expression and our model outperforms recently published genotype-based models (the *R*<sup>2</sup> achieved by Nagpal *et al.* , 2019 is only 0.005).

Thus, our method can model the underlying biological problem in a realistic way and provide meaningful results thanks to its interpretable predictors. In the future, it could be improved by considering interactions between transcription factors, which are also biologically important. Nevertheless, for the time being, we hope that non-linear models will find their way in the field of gene expression prediction, which currently is dominated by genotype-based linear models.

#### References

2 Methodology and results

becomes: log(*y*) = β<sup>0</sup> + ∑*<sup>r</sup>*

sion, {*Af g*}*l*,*<sup>r</sup>*

Our dataset consists in the expression values of the EGFR gene for *n* = 414 individuals (from The GTEx Consortium, 2020), and in the affinity of each regulatory region for all transcription factors, for a total of *p* = 358 predictors. We can take multiple regulatory regions into account (goal 2 above) via a straightforward modification of the model in de Boer *et al.* , 2020, which

parameters. Similarly to de Boer *et al.* , 2020, we sum over all transcriptions factors, indexed by *f* , but now also along the regulatory regions, *g*, of the gene. Accommodating for both copies of DNA (goal 1 above) is more challenging. Biologically, we know that the effects of the two copies should be additive in the original scale of the expression, not in the log-transformed expression. At the same time, working with the expression in the original scale can be troublesome, for it is often not normally distributed. Therefore, we propose

*<sup>f</sup>*=1,*g*=<sup>1</sup> are the affinities, and β = (β0,...,β*rl*) is a vector of model

<sup>β</sup> +*eA*(2) β ,σ<sup>2</sup> *I* 

. (1)

,

β, with

*<sup>f</sup>*=<sup>1</sup> A*f g*β*f g*. Here *y* denotes the gene expres-

*<sup>g</sup>*=<sup>1</sup> ∑*<sup>l</sup>*

the following model for the expression of a single gene:

 log *eA*(1)

Here *y* is an *n*-vector of expression values (one for each individual), *A*(*i*)

with *i* ∈ {1,2}, is the *n* × *rl* affinity matrix for copy *i*, where each column represents a transcription factor-regulatory region pair (*r* is the number of regions, *l* the number of transcription factors), and vector β (*lr*×1) encapsulates the coefficients of the affinities. By computing the exponential of *A*(*i*)

*i* ∈ {1,2}, we obtain the effect of copy *i* on the expression in the original scale. We subsequently sum the two effects, and take the log of the sum to go back to the log-scale response. Importantly, the coefficient of a given transcription factor in a given regulatory region is the same for the two copies of DNA. We notice that for this reason our model does not fall in the class of generalised linear models (at least not obviously), as each coefficient β*<sup>j</sup>* appears two times

Model (1) is embedded in a Bayesian framework by placing a normal prior (with mean zero and variance τ) on all coefficients β independently, and a noninformative Jeffreys prior on σ2. The Bayesian framework is chosen for two primary reasons: 1) Although in our specific application *p* < *n*, generally in genomics *p n* and regularisation is needed. Regularisation can be thought of as imposing a Bayesian prior on the underlying parameters. Specifically, the ridge regression estimator can be viewed as the Bayesian posterior mean

log(*y*) ∼ mvnormal

independently for two different predictors.


## HIGH DIMENSIONAL MODEL-BASED CLUSTERING OF EUROPEAN GEOREFERENCED VEGETATION PLOTS

on the Western Australian continental margin (Woolley et al., 2013) and forest physiognomic types in Italy (Attorre et al., 2014). However, when we face with high-dimensional vegetation data, FMM, or more specifically, standard model-based clustering techniques, may show a disappointing behavior. This is mainly due to the fact that the number of parameters to be estimated usually depends on the dimension of the observed space and such approaches may therefore suffer from the so-called curse of dimensionality (Bellman, 1957). In this paper, we suggest the use of a robust model-based clustering, named Gaussian mixture models for high-dimensional data (HD-GMM) proposed by Bouveyron et al. (2007). We examine a database of 7955 georeferenced plots and 3181 plant species of evergreen forest vegetation, created in TURBOVEG by storing published and unpublished phytosociological plots collected over the last 30 years. These plant communities are scattered along the Mediterranean coastal area, whose main and distinctive ecological feature is the prolonged aridity in summer and rainfall mainly concentrated during winter and spring. Making use of HD-GMM, we assume that high-dimensional vetegation data live in subspaces with a dimensionality lower than the dimensionality of the original plant species space, limiting the number of parameters to estimate and, consequently, the computational time. Finally, as in the FMM framework, the plots are classified based on their plant species composition through a posteriori specific-plot probability and the clusters are defined to be homogeneous

in that they include plots that show similar vegetation.

Let Y = (Y1,...,Y*n*) be the abundance data matrix, where the generic element *yi j* represents the value of the measure of abundance for the *j*-th tree species, namely the percentage of biomass of a certain species with respect to the total biomass of vegetation, observed in the *i*-th plot of the study area, (*i* = 1,...,*n*, *j* = 1,..., *p*). FMM assumes that each plot y*<sup>i</sup>* is drawn from a mixture of *K* components in some unknown mixing proportions π1,...,π*K*,

*<sup>k</sup>*=<sup>1</sup> π*<sup>k</sup>* = 1. Each component identifies a cluster. When a (multivariate) Gaussian density is used to describe the component-specific distribution of observed plant species cover, the component is identified by a specific center, defined by the mean vector (as the observed values are on abundance scale, we may hypothesize that similar plots will be characterized by similar values of abundance of the same species), and a specific shape, summarized by the covariance matrix, which allows for varying dependence between cover values corresponding to different plant species for plots in that component. In other

2 The model

with ∑*<sup>K</sup>*

Francesca Martella 1, Fabio Attorre 2, Michele De Sanctis <sup>2</sup> and Giuliano Fanelli <sup>2</sup>

<sup>1</sup> Department of Statistical Sciences, Sapienza University of Rome, (e-mail: francesca.martella@uniroma1.it)

<sup>2</sup> Department of Environmental Biology, Sapienza University of Rome, Italy (e-mail: fabio.attorre@uniroma1.it, michele.desanctis@uniroma1.it, giuliano.fanelli@gmail.com)

ABSTRACT: An important challenge in complex vegetation systems is the classification of vegetation since it represents a useful tool for summarizing our knowledge of vegetation patterns and, consequently, for nature conservation, landscape mapping and land-use planning. It typically requires standard clustering methods that are capable of identifying groups of plots characterized by dominant and diagnostic species. When the data are high-dimensional, however, efficient clustering methods have to be considered. In this paper, we consider a robust model-based clustering, called Gaussian mixture models for high-dimensional data (HD-GMM) which takes into account for the specific subspace around which each cluster is located and, consequently, provides parsimonious modeling. Results are encouraging and deserve further discussion.

KEYWORDS: vegetation plots, high-dimensional data, finite mixture models

#### 1 Introduction

Improving actions for nature conservation, landscape mapping and land-use planning is a key point in vegetation Science. The need for ecologists to develop appropriate management and conservation strategies has been widely recognized. The identification of homogeneous vegetation communities provides a useful way of summarizing our knowledge of vegetation in a certain area. Clustering represents an important tool to discover such communities and, in general, to draw insights from vegetation data. Summary of vegetation clustering methods can be found in several proposals that focus on this discipline (Sun et al., 1997). Attorre et al. (2020) propose a finite mixture model (FMM) for classifying georeferenced vegetation plots present in the Italian peninsula, including the two main islands (Sicily and Sardinia), but excluding the Alps and the Po plain, according to species composition and environmental variables. Previously, FMM has been applied to identify marine bioregions on the Western Australian continental margin (Woolley et al., 2013) and forest physiognomic types in Italy (Attorre et al., 2014). However, when we face with high-dimensional vegetation data, FMM, or more specifically, standard model-based clustering techniques, may show a disappointing behavior. This is mainly due to the fact that the number of parameters to be estimated usually depends on the dimension of the observed space and such approaches may therefore suffer from the so-called curse of dimensionality (Bellman, 1957). In this paper, we suggest the use of a robust model-based clustering, named Gaussian mixture models for high-dimensional data (HD-GMM) proposed by Bouveyron et al. (2007). We examine a database of 7955 georeferenced plots and 3181 plant species of evergreen forest vegetation, created in TURBOVEG by storing published and unpublished phytosociological plots collected over the last 30 years. These plant communities are scattered along the Mediterranean coastal area, whose main and distinctive ecological feature is the prolonged aridity in summer and rainfall mainly concentrated during winter and spring. Making use of HD-GMM, we assume that high-dimensional vetegation data live in subspaces with a dimensionality lower than the dimensionality of the original plant species space, limiting the number of parameters to estimate and, consequently, the computational time. Finally, as in the FMM framework, the plots are classified based on their plant species composition through a posteriori specific-plot probability and the clusters are defined to be homogeneous in that they include plots that show similar vegetation.

#### 2 The model

HIGH DIMENSIONAL MODEL-BASED CLUSTERING OF EUROPEAN GEOREFERENCED VEGETATION PLOTS Francesca Martella 1, Fabio Attorre 2, Michele De Sanctis <sup>2</sup> and Giuliano Fanelli <sup>2</sup>

<sup>1</sup> Department of Statistical Sciences, Sapienza University of Rome, (e-mail:

<sup>2</sup> Department of Environmental Biology, Sapienza University of Rome, Italy (e-mail: fabio.attorre@uniroma1.it, michele.desanctis@uniroma1.it,

ABSTRACT: An important challenge in complex vegetation systems is the classification of vegetation since it represents a useful tool for summarizing our knowledge of vegetation patterns and, consequently, for nature conservation, landscape mapping and land-use planning. It typically requires standard clustering methods that are capable of identifying groups of plots characterized by dominant and diagnostic species. When the data are high-dimensional, however, efficient clustering methods have to be considered. In this paper, we consider a robust model-based clustering, called Gaussian mixture models for high-dimensional data (HD-GMM) which takes into account for the specific subspace around which each cluster is located and, consequently, provides parsimonious modeling. Results are encouraging and deserve further discussion. KEYWORDS: vegetation plots, high-dimensional data, finite mixture models

Improving actions for nature conservation, landscape mapping and land-use planning is a key point in vegetation Science. The need for ecologists to develop appropriate management and conservation strategies has been widely recognized. The identification of homogeneous vegetation communities provides a useful way of summarizing our knowledge of vegetation in a certain area. Clustering represents an important tool to discover such communities and, in general, to draw insights from vegetation data. Summary of vegetation clustering methods can be found in several proposals that focus on this discipline (Sun et al., 1997). Attorre et al. (2020) propose a finite mixture model (FMM) for classifying georeferenced vegetation plots present in the Italian peninsula, including the two main islands (Sicily and Sardinia), but excluding the Alps and the Po plain, according to species composition and environmental variables. Previously, FMM has been applied to identify marine bioregions

francesca.martella@uniroma1.it)

giuliano.fanelli@gmail.com)

1 Introduction

Let Y = (Y1,...,Y*n*) be the abundance data matrix, where the generic element *yi j* represents the value of the measure of abundance for the *j*-th tree species, namely the percentage of biomass of a certain species with respect to the total biomass of vegetation, observed in the *i*-th plot of the study area, (*i* = 1,...,*n*, *j* = 1,..., *p*). FMM assumes that each plot y*<sup>i</sup>* is drawn from a mixture of *K* components in some unknown mixing proportions π1,...,π*K*, with ∑*<sup>K</sup> <sup>k</sup>*=<sup>1</sup> π*<sup>k</sup>* = 1. Each component identifies a cluster. When a (multivariate) Gaussian density is used to describe the component-specific distribution of observed plant species cover, the component is identified by a specific center, defined by the mean vector (as the observed values are on abundance scale, we may hypothesize that similar plots will be characterized by similar values of abundance of the same species), and a specific shape, summarized by the covariance matrix, which allows for varying dependence between cover values corresponding to different plant species for plots in that component. In other words, y*<sup>i</sup>* has density function defined by:

$$f\left(\mathbf{y}\_{i}|\Psi\right) = \sum\_{k=1}^{K} \pi\_{k} \phi\left(\mathbf{x}\_{i} \mid \boldsymbol{\mu}\_{k}, \Sigma\_{k}\right),\tag{1}$$

3 Conclusion

References

Princeton.

461–464.

948.

Thanks to the significant reduction of the number of parameters to be estimated, HD-GMM seems to be a promising approach when dealing with the analysis of high-dimensional complex vegetation systems data. This modelling may effectively highlight specific subspaces in the geographical patterns

ATTORRE, F., FRANCESCONI, F., DE SANCTIS, M., ALF, M., MARTELLA, F., VALENTI, R., VITALE, M. 2014. Classifying and Mapping Potential Distribution of Forest Types Using a Finite Mixture Model. *Vegetation*

ATTORRE, F., CAMBRIA, V.E., AGRILLO, E., ALESSI, N., ALFO, M., DE SANCTIS, M., MALATESTA, L., SITZIA, T., GUARINO, R., MARCEN, C., MASSIMI, M., SPADA, F., FANELLI, G. 2020. Finite Mixture Model-based classification of a complex vegetation system. *Vegetation*

BELLMAN, R. 1957. Dynamic Programming. *Princeton University Press*.

BOUVEYRON, C., GIRARD, S., SCHMID, C. 2007. High-dimensional data clustering. *Computational Statistics and Data Analysis.*, 52(1), 502–519. CATTELL, R. 1966. The scree test for the number of factors. *Multivariate*

SCHWARZ, G. 1978. Estimating the dimension of a model. *Ann. Stat.*, 6,

SUN, D., HNATIUK, R.J., NELDNER, V.J. 1997. Review of vegetation classification and mapping systems undertaken by major forested land management agencies in australia.. *Australian Journal of Botany.*, 45(6), 929–

WOOLLEY, S.N.C., MCCALLUM, A.W., WILSON, R., OHARA, T.D., DUNSTAN, P.K., 2013. Fathom out: biogeographical subdivision across the Western Australian continental margin a multispecies modelling ap-

proach. *Diversity and Distributions.*, 19, 1506–1517.

helping the interpretation of the clustering results.

*Folia Geobotanica.*, 49, 313–335.

*Classification and Survey.*, 1, 77–86.

*Behavioral Research.*, 1(2), 145–276.

where φ(·) represents the cluster-specific *p*-variate Gaussian density with vector mean *µk* and covariance matrix Σ*k*, for *k* = 1,...,*K*, and Ψ = (π1,...,π*K*,*µ*1, ...,*µK*,Σ1,...,Σ*K*) denotes the overall parameter vector. Unfortunately, FMM requires the estimation of a very large number of parameters (proportional to *p*2) and therefore faces numerical problems in high-dimensional spaces. In this respect, HD-GMM assumed that high-dimensional data live around subspaces with a dimension lower than the considered species number, limiting to estimate the specific subspace and the cluster-specific intrinsic dimension. Formally, HD-GMM considers the following eigen-decomposition of the clusterspecific covariance matrix Σ*k*:

$$
\Sigma\_k = \mathbf{D}\_k^t \mathbf{A}\_k \mathbf{D}\_k \tag{2}
$$

where D*<sup>k</sup>* is a (*p* × *p*) orthogonal matrix having as columns the eigenvectors of Σ*<sup>k</sup>* and A*<sup>k</sup>* is a (*p*× *p*) diagonal matrix which contains the associated eigenvalues (sorted in decreasing order), *k* = 1,...,*K*. It follows that, A*<sup>k</sup>* represents the cluster-specific covariance matrix in the eigenspace of Σ*k*. Moreover, it is assumed that A*<sup>k</sup>* is reparametrized as a diagonal matrix having only *qk* +1 different eigenvalues:

$$\mathbf{A}\_{k} = \text{diag}(a\_{k1}, \dots, a\_{kq\_k}, b\_k, \dots, b\_k), \tag{3}$$

with *ak j* > *bk*, *j* = 1,...,*qk*, *qk* ∈ {1,..., *p* − 1}. In this way, the parameters *ak j* describe the cluster-specific variance of the original data, while the unique parameter *bk* models the variance of the noise which is isotropic and contained in a subspace, which is orthogonal to the subspace of the *k*-th cluster. The dimension *qk* is unknown and represents the dimension of the cluster-specific subspace E*<sup>k</sup>* which is spanned by the *qk* first columns of D*k*, i.e. by the *qk* first eigenvectors corresponding to the eigenvalues *ak j*, with *µk* ∈ E*k*. Notice that, if *qk* = *p*−1 for all *k* = 1,...,*K* then HD-GMM reduces to FMM. Following the classical parsimony strategy, a family of 28 parsimonious HD-GMMs is defined by constraining some (or all) parameters to vary within and between clusters. The more general HD-GMM is denoted by [*ak jbk*D*kqk*].

### 3 Conclusion

words, y*<sup>i</sup>* has density function defined by:

specific covariance matrix Σ*k*:

different eigenvalues:

*f* (y*i*|Ψ) =

*K* ∑ *k*=1

Σ*<sup>k</sup>* = D*<sup>t</sup>*

where D*<sup>k</sup>* is a (*p* × *p*) orthogonal matrix having as columns the eigenvectors of Σ*<sup>k</sup>* and A*<sup>k</sup>* is a (*p*× *p*) diagonal matrix which contains the associated eigenvalues (sorted in decreasing order), *k* = 1,...,*K*. It follows that, A*<sup>k</sup>* represents the cluster-specific covariance matrix in the eigenspace of Σ*k*. Moreover, it is assumed that A*<sup>k</sup>* is reparametrized as a diagonal matrix having only *qk* +1

with *ak j* > *bk*, *j* = 1,...,*qk*, *qk* ∈ {1,..., *p* − 1}. In this way, the parameters *ak j* describe the cluster-specific variance of the original data, while the unique parameter *bk* models the variance of the noise which is isotropic and contained in a subspace, which is orthogonal to the subspace of the *k*-th cluster. The dimension *qk* is unknown and represents the dimension of the cluster-specific subspace E*<sup>k</sup>* which is spanned by the *qk* first columns of D*k*, i.e. by the *qk* first eigenvectors corresponding to the eigenvalues *ak j*, with *µk* ∈ E*k*. Notice that, if *qk* = *p*−1 for all *k* = 1,...,*K* then HD-GMM reduces to FMM. Following the classical parsimony strategy, a family of 28 parsimonious HD-GMMs is defined by constraining some (or all) parameters to vary within and between

clusters. The more general HD-GMM is denoted by [*ak jbk*D*kqk*].

where φ(·) represents the cluster-specific *p*-variate Gaussian density with vector mean *µk* and covariance matrix Σ*k*, for *k* = 1,...,*K*, and Ψ = (π1,...,π*K*,*µ*1, ...,*µK*,Σ1,...,Σ*K*) denotes the overall parameter vector. Unfortunately, FMM requires the estimation of a very large number of parameters (proportional to *p*2) and therefore faces numerical problems in high-dimensional spaces. In this respect, HD-GMM assumed that high-dimensional data live around subspaces with a dimension lower than the considered species number, limiting to estimate the specific subspace and the cluster-specific intrinsic dimension. Formally, HD-GMM considers the following eigen-decomposition of the cluster-

π*k*φ(x*<sup>i</sup>* | *µk*,Σ*k*), (1)

*<sup>k</sup>*A*k*D*<sup>k</sup>* (2)

A*<sup>k</sup>* = diag(*ak*1,...,*akqk* ,*bk*,...,*bk*), (3)

Thanks to the significant reduction of the number of parameters to be estimated, HD-GMM seems to be a promising approach when dealing with the analysis of high-dimensional complex vegetation systems data. This modelling may effectively highlight specific subspaces in the geographical patterns helping the interpretation of the clustering results.

#### References


## MULTIVARIATE OUTLIER DETECTION FOR HISTOGRAM-VALUED VARIABLES

underlying domain *O* ⊆ R, a histogram-valued variable is defined by a mapping *Y* : *S* → *B*. Each realisation *i* of the histogram-valued variable, *Y*(*si*), may

, *pi*1;...,

the values of variable *Y*(*si*) are uniformly distributed. Another repre-

where *pi*<sup>1</sup> + ··· + *piKi* = 1. Also, it is assumed that within each sub-interval

quantile functions are piecewise linear and even though the space of the quantile functions is only a semi-vector space, the arithmetic operations are simpler with this representation, which is preferred to represent histogram-valued data. To identify multivariate outliers we propose a measure based on the Mallows distances to the multivariate means of quantile functions. The Mal-

and the multivariate mean of quantile functions is the barycenter (φ¯ *<sup>b</sup>*), which

Hubert *et al.*, 2015 is adopted and, an outlyingness measure based on a onedimensional projection of the observed data is computed. Thus, the Mallows

> *d*2 *M*

*n* ∑ *i*=1 *d*2 *M* 

1 *n*−1 φ*i*(*t*),φ*j*(*t*)

φ*i*(*t*), *t* ∈ [0,1]. To easily implement this method, the approach by

φ*i*(*t*)*v*, <sup>1</sup> *n n* ∑ *j*=1 *j*=*i*

> φ*i*(*t*)*v*, <sup>1</sup> *n n* ∑ *j*=1 *j*=*i*

sentation of the histogram-valued variables is the quantile function,

I*iKi* , ¯*IiKi* , *piKi* 

*ri*<sup>1</sup> if 0 ≤ *t* ≤ *wi*<sup>1</sup>

*ri*<sup>2</sup> if *wi*<sup>1</sup> ≤ *t* ≤ *wi*<sup>2</sup>

*pi*, *<sup>h</sup>* <sup>=</sup> <sup>1</sup>,...,*Ki* and *ri* <sup>=</sup> ¯*Ii* <sup>−</sup> <sup>I</sup>*<sup>i</sup>* for <sup>=</sup> {1,...,*Ki*}. The

*riKi* if *wiKi*−<sup>1</sup> ≤ *t* ≤ 1,

 = <sup>1</sup> 0 

*n* ∑ *i*=1  1 0

φ*j*(*t*)*v*

φ*j*(*t*)*v*

(φ*<sup>i</sup>* <sup>−</sup>φ¯ *<sup>b</sup>*)

, (1)

φ*i*(*t*)−φ*j*(*t*)

2

2 *dt*,

*dt*, leading to

, (3)

(2)

be represented by the histogram

[I*i*, ¯*Ii* 

where *wih* <sup>=</sup> *<sup>h</sup>*

φ¯(*t*) = <sup>1</sup> *n n* ∑ *i*=1

outlyingness measure is

*HY*(*si*) =

 

I*i*<sup>1</sup> + *<sup>t</sup> wi*<sup>1</sup>

I*i*<sup>2</sup> + *<sup>t</sup>*−*wi*<sup>1</sup> *wi*2−*wi*<sup>1</sup>

<sup>I</sup>*iKi* <sup>+</sup> *<sup>t</sup>*−*wiKi*−<sup>1</sup> 1−*wiKi*−<sup>1</sup>

. . .

φ*i*(*t*) =

lows distance (*dM*) is defined as *dM*

*SDOM*<sup>2</sup>

is the solution of the minimisation problem: *Min*

*<sup>i</sup>* = sup ||*v*||=1

∑ =1  I*i*1, ¯*Ii*<sup>1</sup> 

Ana Martins 1, Paula Brito2 , Sonia Dias ´ <sup>3</sup> and Peter Filzmoser <sup>4</sup>

<sup>1</sup> Institute of Electronics and Informatics Engineering of Aveiro, Aveiro, Portugal (email: a.r.martins@ua.pt)

<sup>2</sup> Faculdade de Economia, Universidade do Porto & LIAAD-INESC TEC, Porto, Portugal (e-mail: mpbrito@fep.up.pt)

<sup>3</sup> Instituto Politecnico de Viana do Castelo & LIAAD-INESC TEC, Portugal (e-mail: ´ sdias@estg.ipvc.pt)

<sup>4</sup> Institute of Statistics and Mathematical Methods in Economics, Vienna University of Technology, Vienna, Austria (e-mail: peter.filzmoser@tuwien.ac.at)

ABSTRACT: A measure for outlier detection of multivariate histogram-valued variables based on the Mallows (*SDO*<sup>2</sup> *M*) distance is proposed. A case study with distributional data of repeated measurements of 10 patients' hematocrit and hemoglobin is presented. The *Q*3+3(*Q*3−*Q*1) criteria and, *P*<sup>95</sup> and *P*97.<sup>5</sup> of a Chi-Square distribution with *p*-degrees of freedom (*p* number of variables) are used as cut-offs. Overall, the *SDO*<sup>2</sup> *<sup>M</sup>* along with the *P*<sup>95</sup> cut-off are able to detect outliers in most analysed situations.

KEYWORDS: histogram-valued data, Mallows distance, outlier detection.

#### 1 Introduction

Symbolic data were introduced to better describe and analyse data with intrinsic variability. Descriptive statistics (e.g., mean, median) and multivariate data analysis methods (e.g.,linear regression) for histogram-valued data analysis have been developed. Outlier analysis has first been addressed by Verde *et al.*, 2014. Following a different approach, we introduce a method for multivariate outlier analysis based on the Mallows distance. We define outlier as a data unit which is far apart from the center of the data cloud, here the barycenter. Results on a case study are presented.

### 2 Methods

Let *S* = {*s*1,...,*sn*} be the set of entities under analysis, *B* the set of probability of frequency distributions over a set of sub-intervals {*Ii*1,...,*IiKi* } of an underlying domain *O* ⊆ R, a histogram-valued variable is defined by a mapping *Y* : *S* → *B*. Each realisation *i* of the histogram-valued variable, *Y*(*si*), may be represented by the histogram

MULTIVARIATE OUTLIER DETECTION FOR HISTOGRAM-VALUED VARIABLES Ana Martins 1, Paula Brito2 , Sonia Dias ´ <sup>3</sup> and Peter Filzmoser <sup>4</sup>

<sup>1</sup> Institute of Electronics and Informatics Engineering of Aveiro, Aveiro, Portugal (e-

<sup>2</sup> Faculdade de Economia, Universidade do Porto & LIAAD-INESC TEC, Porto, Por-

<sup>3</sup> Instituto Politecnico de Viana do Castelo & LIAAD-INESC TEC, Portugal (e-mail: ´

<sup>4</sup> Institute of Statistics and Mathematical Methods in Economics, Vienna University of Technology, Vienna, Austria (e-mail: peter.filzmoser@tuwien.ac.at)

ABSTRACT: A measure for outlier detection of multivariate histogram-valued vari-

butional data of repeated measurements of 10 patients' hematocrit and hemoglobin is presented. The *Q*3+3(*Q*3−*Q*1) criteria and, *P*<sup>95</sup> and *P*97.<sup>5</sup> of a Chi-Square distribution with *p*-degrees of freedom (*p* number of variables) are used as cut-offs. Overall, the

Symbolic data were introduced to better describe and analyse data with intrinsic variability. Descriptive statistics (e.g., mean, median) and multivariate data analysis methods (e.g.,linear regression) for histogram-valued data analysis have been developed. Outlier analysis has first been addressed by Verde *et al.*, 2014. Following a different approach, we introduce a method for multivariate outlier analysis based on the Mallows distance. We define outlier as a data unit which is far apart from the center of the data cloud, here the barycen-

Let *S* = {*s*1,...,*sn*} be the set of entities under analysis, *B* the set of proba-

bility of frequency distributions over a set of sub-intervals {*Ii*1,...,*IiKi*

KEYWORDS: histogram-valued data, Mallows distance, outlier detection.

*<sup>M</sup>* along with the *P*<sup>95</sup> cut-off are able to detect outliers in most analysed situations.

*M*) distance is proposed. A case study with distri-

} of an

mail: a.r.martins@ua.pt)

sdias@estg.ipvc.pt)

ables based on the Mallows (*SDO*<sup>2</sup>

ter. Results on a case study are presented.

*SDO*<sup>2</sup>

1 Introduction

2 Methods

tugal (e-mail: mpbrito@fep.up.pt)

$$H\_{Y(s\_l)} = \left\{ \left[ \underline{\mathbf{I}}\_{i1}, \overline{I}\_{l1} \left[ , p\_{i1}; \dots, , \left[ \underline{\mathbf{I}}\_{iK\_l}, \overline{I}\_{iK\_l} \right] , p\_{iK\_l} \right] \right\}, \tag{1}$$

where *pi*<sup>1</sup> + ··· + *piKi* = 1. Also, it is assumed that within each sub-interval [I*i*, ¯*Ii* the values of variable *Y*(*si*) are uniformly distributed. Another representation of the histogram-valued variables is the quantile function,

$$\phi\_i(t) = \begin{cases} \mathbf{I}\_{l1} + \frac{t}{w\_{l1}} r\_{l1} & \text{if } 0 \le t \le w\_{l1} \\ \mathbf{I}\_{l2} + \frac{t - w\_{l1}}{w\_{l2} - w\_{l1}} r\_{l2} & \text{if } w\_{l1} \le t \le w\_{l2} \\ \vdots \\ \mathbf{I}\_{iK\_i} + \frac{t - w\_{iK\_i - 1}}{1 - w\_{iK\_i - 1}} r\_{iK\_i} & \text{if } w\_{iK\_i - 1} \le t \le 1, \end{cases} \tag{2}$$

where *wih* <sup>=</sup> *<sup>h</sup>* ∑ =1 *pi*, *<sup>h</sup>* <sup>=</sup> <sup>1</sup>,...,*Ki* and *ri* <sup>=</sup> ¯*Ii* <sup>−</sup> <sup>I</sup>*<sup>i</sup>* for <sup>=</sup> {1,...,*Ki*}. The quantile functions are piecewise linear and even though the space of the quantile functions is only a semi-vector space, the arithmetic operations are simpler with this representation, which is preferred to represent histogram-valued data.

To identify multivariate outliers we propose a measure based on the Mallows distances to the multivariate means of quantile functions. The Mallows distance (*dM*) is defined as *dM* φ*i*(*t*),φ*j*(*t*) = <sup>1</sup> 0 φ*i*(*t*)−φ*j*(*t*) 2 *dt*, and the multivariate mean of quantile functions is the barycenter (φ¯ *<sup>b</sup>*), which is the solution of the minimisation problem: *Min n* ∑ *i*=1 1 0 (φ*<sup>i</sup>* <sup>−</sup>φ¯ *<sup>b</sup>*) 2 *dt*, leading to *n*

φ¯(*t*) = <sup>1</sup> *n* ∑ *i*=1 φ*i*(*t*), *t* ∈ [0,1]. To easily implement this method, the approach by Hubert *et al.*, 2015 is adopted and, an outlyingness measure based on a onedimensional projection of the observed data is computed. Thus, the Mallows outlyingness measure is

$$SDO\_{M\_i^2} = \sup\_{||\mathbf{v}|| = 1} \frac{d\_M^2 \left(\Phi\_i(t)\mathbf{v}, \frac{1}{n} \sum\_{\substack{j=1 \\ j \neq i}}^n \Phi\_j(t)\mathbf{v}\right)}{\frac{1}{n-1} \sum\_{i=1}^n d\_M^2 \left(\Phi\_i(t)\mathbf{v}, \frac{1}{n} \sum\_{\substack{j=1 \\ j \neq i}}^n \Phi\_j(t)\mathbf{v}\right)},\tag{3}$$

where *v* = (*a*1,...,*ap*,*b*1,...,*bp*) runs through a set of 2*p*-dimensional vectors (for *p* variables) that project the histogram-valued data into a one-dimensional space, using the definition of linear combination proposed by Dias & Brito, 2015, that solves the problem of the semi-linearity of the space:

this is the best cut-off (data not shown). In conclusion, *SDO*<sup>2</sup>

histogram-valued data.

Figure 1. *SDO*<sup>2</sup>

References

579–587.

202.

off (and the Tukey's criterion) seem a promising approach to detect outliers in

*<sup>M</sup> measure for univariate and bivariate outliers and cut-offs P*95, *P*97.<sup>5</sup>

*and Q*<sup>3</sup> +3(*Q*<sup>3</sup> −*Q*1)*. Red dots represent the outlier observation (unit 1).*

*tics and Data Mining John Wiley*.

*ioral and Social Sciences*. Springer.

*and Knowledge Discovery*, 4(4), 281–295.

BILLARD, L, & DIDAY, E. 2006. *Symbolic Data Analysis: Conceptual Statis-*

BRITO, P. 2014. Symbolic data analysis: another look at the interaction of data mining and statistics. *Wiley Interdisciplinary Reviews: Data Mining*

DIAS, S., & BRITO, P. 2015. Linear regression model with histogram-valued variables. *Statistical Analysis and Data Mining*, 8(2), 75–113. FILZMOSER, P., GARRETT, R.G., & REIMANN, C. 2005. Multivariate outlier detection in exploration geochemistry. *Computers & Geosciences*, 31(5),

HUBERT, M., ROUSSEEUW, P.J., & SEGAERT, P. 2015. Multivariate functional outlier detection. *Statistical Methods & Applications*, 24(2), 177–

VERDE, R., IRPINO, A., & RIVOLI, L. 2014. A box-plot and outliers detection proposal for histogram data: new tools for data stream analysis. *Pages 283–291 of: Analysis and Modeling of Complex Data in Behav-*

*<sup>M</sup>* with the *P*<sup>95</sup> cut-

$$\Phi\_{\hat{W}\_l}(t) = a\_1 \Phi\_{X\_{l\bar{l}}}(t) - b\_1 \Phi\_{X\_{l\bar{l}}}(1-t) + \dots + a\_p \Phi\_{X\_{p\_l}}(t) - b\_p \Phi\_{X\_{p\_l}}(1-t), \tag{4}$$

where φ*X*1*<sup>i</sup>* (*t*),...,φ*Xpi* (*t*) are the quantile functions of the observed histograms and φ*X*1*<sup>i</sup>* (1−*t*),...,φ*Xpi* (1−*t*) are the quantile functions of the corresponding symmetric histograms, *au*,*bu* ≥ 0,*u* ∈ {1,..., *p*}, and *t* ∈ [0,1].

To flag observations as outliers, possible alternative cut-offs for *SDOM*<sup>2</sup> *i* are Tukey's boxplot *Q*<sup>3</sup> +3(*Q*<sup>3</sup> −*Q*1) criterion, and, by analogy with the classical case where the Mahalanobis distance to the mean is used (see Filzmoser *et al.*, 2005), the *P*<sup>95</sup> or the *P*97.<sup>5</sup> of a Chi-Square distribution with *p*-degrees of freedom (*p* = number of variables).

#### 3 Case Study

An analysis of *SDO*<sup>2</sup> *<sup>M</sup>* using distributional data of repeated measurements of the hematocrit (*Y*) and hemoglobin (*X*) values for 10 patients (Billard & Diday, 2006) was conducted. First, univariate outliers were considered, by perturbing the distribution of variable *X* for the first unit, in seven different ways (Out1 to Out7). *SDO*<sup>2</sup> *<sup>M</sup>* was computed for the original hemogloblin distributions and for the seven situations where unit 1 is now an outlier. Then, bivariate outliers were considered, by perturbing the distributions of both variables for the first unit. *SDO*<sup>2</sup> *<sup>M</sup>* was computed for all seven cross-situations between outliers of variables X and Y.

#### 3.1 Results and Discussion

The ability of *SDO*<sup>2</sup> *<sup>M</sup>* to detect outlier observations was studied, using all three cut-offs mentioned above. Overall, *P*<sup>95</sup> is the cut-off that works the best to identify outliers in both cases. In the univariate case this cut-off fails to identify Out5 only and, in the bivariate case it fails to identify both Out5 and Out6 (Fig. 1). Note that in the bivariate case the outlier unit 1 for variable Y is fixed (only one perturbation considered). The *Q*<sup>3</sup> +3(*Q*<sup>3</sup> −*Q*1) seems to be the worst, but this may reflect the fact that *n* = 10, which means that sample size is small to compute the quantiles. In fact, preliminary studies with larger *n*, suggest that this is the best cut-off (data not shown). In conclusion, *SDO*<sup>2</sup> *<sup>M</sup>* with the *P*<sup>95</sup> cutoff (and the Tukey's criterion) seem a promising approach to detect outliers in histogram-valued data.

Figure 1. *SDO*<sup>2</sup> *<sup>M</sup> measure for univariate and bivariate outliers and cut-offs P*95, *P*97.<sup>5</sup> *and Q*<sup>3</sup> +3(*Q*<sup>3</sup> −*Q*1)*. Red dots represent the outlier observation (unit 1).*

#### References

where *v* = (*a*1,...,*ap*,*b*1,...,*bp*) runs through a set of 2*p*-dimensional vectors (for *p* variables) that project the histogram-valued data into a one-dimensional space, using the definition of linear combination proposed by Dias & Brito,

(1−*t*) +···+*ap*φ*Xpi*

To flag observations as outliers, possible alternative cut-offs for *SDOM*<sup>2</sup>

are Tukey's boxplot *Q*<sup>3</sup> +3(*Q*<sup>3</sup> −*Q*1) criterion, and, by analogy with the classical case where the Mahalanobis distance to the mean is used (see Filzmoser *et al.*, 2005), the *P*<sup>95</sup> or the *P*97.<sup>5</sup> of a Chi-Square distribution with *p*-degrees of

the hematocrit (*Y*) and hemoglobin (*X*) values for 10 patients (Billard & Diday, 2006) was conducted. First, univariate outliers were considered, by perturbing the distribution of variable *X* for the first unit, in seven different ways (Out1

for the seven situations where unit 1 is now an outlier. Then, bivariate outliers were considered, by perturbing the distributions of both variables for the first

cut-offs mentioned above. Overall, *P*<sup>95</sup> is the cut-off that works the best to identify outliers in both cases. In the univariate case this cut-off fails to identify Out5 only and, in the bivariate case it fails to identify both Out5 and Out6 (Fig. 1). Note that in the bivariate case the outlier unit 1 for variable Y is fixed (only one perturbation considered). The *Q*<sup>3</sup> +3(*Q*<sup>3</sup> −*Q*1) seems to be the worst, but this may reflect the fact that *n* = 10, which means that sample size is small to compute the quantiles. In fact, preliminary studies with larger *n*, suggest that

(*t*)−*bp*φ*Xpi*

(*t*) are the quantile functions of the observed histograms

*<sup>M</sup>* using distributional data of repeated measurements of

*<sup>M</sup>* was computed for the original hemogloblin distributions and

*<sup>M</sup>* to detect outlier observations was studied, using all three

*<sup>M</sup>* was computed for all seven cross-situations between outliers of

(1−*t*) are the quantile functions of the corresponding

(1−*t*), (4)

*i*

2015, that solves the problem of the semi-linearity of the space:

symmetric histograms, *au*,*bu* ≥ 0,*u* ∈ {1,..., *p*}, and *t* ∈ [0,1].

(*t*)−*b*1φ*X*1*<sup>i</sup>*

φ*W*ˆ*i*

where φ*X*1*<sup>i</sup>*

3 Case Study

to Out7). *SDO*<sup>2</sup>

variables X and Y.

The ability of *SDO*<sup>2</sup>

3.1 Results and Discussion

unit. *SDO*<sup>2</sup>

An analysis of *SDO*<sup>2</sup>

and φ*X*1*<sup>i</sup>*

(*t*) = *a*1φ*X*1*<sup>i</sup>*

(*t*),...,φ*Xpi*

(1−*t*),...,φ*Xpi*

freedom (*p* = number of variables).


### A NONPARAMETRIC TEST FOR MODE SIGNIFICANCE

the tools provided by gradient ascent algorithms. This allows us to define a sequence of realisations shown to be asymptotically normal, to approximate

While intuitively clear, the problem of testing mode significance is firstly definitional. The concept of mode itself is, indeed, ambiguous, as for example the Uniform distribution can be regarded to as both unimodal or without modes. To overcome this problem and formalise our framework without any elusiveness, we shall restrict the analysis to smooth distributions, and exclude non-standard ones as, for example, functions with plateau. For our purpose, we resort to the framework provided by Morse Theory, a branch of differential topology which draws the relationship between the stationary points of a smooth real-valued

functions on a manifold, and its global topology (Matsumoto, 2002).

negative gradient −∇ *f* is the solution of the initial value problem

νx(0) = x ν

curve having them as starting points has destination θ.

significance of a mode recasts to defining the system of hypotheses

hence, under the null hypothesis, it holds: <sup>θ</sup><sup>0</sup> <sup>=</sup> argmin*x*∈*D*(θ0) <sup>−</sup>*f*(*x*).

Let (x1,...,x*n*) be a sample of realisations from a random variable *X* with unknown probability density *<sup>f</sup>* : <sup>R</sup>*<sup>d</sup>* <sup>→</sup> <sup>R</sup>, which we shall assume to be a Morse function, i.e. a function whose critical points are non-degenerate. We can define the autonomous system identified by the the gradient ∇ *f* , to be *<sup>d</sup>*x(*t*)

<sup>∇</sup> *<sup>f</sup>*(x(*t*)). Given an initial value <sup>x</sup> <sup>∈</sup> <sup>R</sup>*d*, the integral curve <sup>ν</sup><sup>x</sup> : <sup>R</sup> <sup>→</sup> <sup>R</sup>*<sup>d</sup>* of the

namely, starting at a point x, its integral curve moves it according to the gradient of *f* , to eventually reach, except for a set of null measure, the destination lim*t*→<sup>∞</sup> <sup>ν</sup>x(*t*). By the Morse theory, the set of destinations <sup>Θ</sup> <sup>=</sup> {<sup>θ</sup> <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* : <sup>θ</sup> <sup>=</sup> lim*t*→<sup>∞</sup> <sup>ν</sup>x(*t*),<sup>x</sup> <sup>∈</sup> <sup>R</sup>*d*} is the set of distinct modes of *<sup>f</sup>* . Since integral curves never intersect except at critical points, Θ allows to identify a unique partition {*D*θ}θ∈<sup>Θ</sup> of <sup>R</sup>*<sup>d</sup>* in distinct regions *D*<sup>θ</sup> <sup>=</sup> {<sup>x</sup> : lim*t*→<sup>∞</sup> <sup>ν</sup>x(*t*) = <sup>θ</sup>} which represent the "basins of attraction" of each mode θ and include all points whose integral

In the lack of information about the true modal structure of *f* , testing the

for some <sup>θ</sup><sup>0</sup> <sup>∈</sup> <sup>R</sup>*d*. While apparently composite, the null hypothesis is fact a simple one, as the - yet unknown - partition of <sup>R</sup>*<sup>d</sup>* in the set {*D*(θ)}θ∈<sup>Θ</sup> allows us to intend *H*<sup>0</sup> as "θ<sup>0</sup> is the mode of the domain *D*(θ) where it belongs";

*H*<sup>0</sup> : θ<sup>0</sup> ∈ Θ vs *H*<sup>1</sup> : θ<sup>0</sup> ∈/ Θ, (2)

*dt* =

<sup>x</sup>(*t*) = −∇ *f*(νx(*t*)), (1)

the sample distribution of an estimated mode.

2 Modes as critical points of the density

Giovanna Menardi1, Federico Ferraccioli1

<sup>1</sup> Department of Statistical Sciences, University of Padova, (e-mail: menardi@stat.unipd.it, ferraccioli@stat.unipd.it)

ABSTRACT: We propose a nonparametric test for the significance of a mode, with the aim of evaluating whether a region of relatively high observed density reflects the actual presence of a mode in the true distribution underlying a set of data. The method leverages on the correspondence between the the mathematical framework of Morse theory and the tools provided by gradient ascent approximation. This allows for building a sequence of asymptotically Normal realisations of the sample distribution of an estimated mode and the definition of a chi-squared test statistic.

KEYWORDS: Asymptotics, Gradient ascent, Hypothesis testing, Kernel estimator.

#### 1 Introduction

Although often overlooked with respect to location measures as mean and median, inference on the modes of a distribution plays a central role in data analysis. In fact, modes represent informative summaries of a distribution, especially when data exhibit non-Gaussian features as, multimodality, skewness, or heavy tails. One question of interest typically arises when somewhat clumped data are observed, often at the tails of the empirical distribution, possibly inducing to wonder if they are real or just a spurious effect of sample variability. Similarly, in many applications where clustering is the final aim, one wishes to evaluate significance of detected groups. In astronomy, for example, a main goal is to establish if clusters of photon emissions are evidence of the presence of celestial energy sources or just express a strong background contamination. This problem has been often neglected by the inherent literature, mostly addressing related aims as the one of testing unimodality of a density function or the number of the modes (Chacon, 2020). Few contributions in the direction ´ of interest are Duong *et al.*, 2008 and Genovese *et al.*, 2016.

In this work we propose a test to evaluate if a specific point is a true mode of the - unknown - probability density function underlying an observed set of data. We take advantage of formal definitions and theory underlying the modal concept of cluster (Chacon, 2015). The rationale we follow relies on ´ the correspondence between the mathematical framework of Morse theory and the tools provided by gradient ascent algorithms. This allows us to define a sequence of realisations shown to be asymptotically normal, to approximate the sample distribution of an estimated mode.

#### 2 Modes as critical points of the density

A NONPARAMETRIC TEST FOR MODE SIGNIFICANCE Giovanna Menardi1, Federico Ferraccioli1

ABSTRACT: We propose a nonparametric test for the significance of a mode, with the aim of evaluating whether a region of relatively high observed density reflects the actual presence of a mode in the true distribution underlying a set of data. The method leverages on the correspondence between the the mathematical framework of Morse theory and the tools provided by gradient ascent approximation. This allows for building a sequence of asymptotically Normal realisations of the sample distribution

KEYWORDS: Asymptotics, Gradient ascent, Hypothesis testing, Kernel estimator.

Although often overlooked with respect to location measures as mean and median, inference on the modes of a distribution plays a central role in data analysis. In fact, modes represent informative summaries of a distribution, especially when data exhibit non-Gaussian features as, multimodality, skewness, or heavy tails. One question of interest typically arises when somewhat clumped data are observed, often at the tails of the empirical distribution, possibly inducing to wonder if they are real or just a spurious effect of sample variability. Similarly, in many applications where clustering is the final aim, one wishes to evaluate significance of detected groups. In astronomy, for example, a main goal is to establish if clusters of photon emissions are evidence of the presence of celestial energy sources or just express a strong background contamination. This problem has been often neglected by the inherent literature, mostly addressing related aims as the one of testing unimodality of a density function or the number of the modes (Chacon, 2020). Few contributions in the direction ´

In this work we propose a test to evaluate if a specific point is a true mode of the - unknown - probability density function underlying an observed set of data. We take advantage of formal definitions and theory underlying the modal concept of cluster (Chacon, 2015). The rationale we follow relies on ´ the correspondence between the mathematical framework of Morse theory and

(e-mail: menardi@stat.unipd.it, ferraccioli@stat.unipd.it)

<sup>1</sup> Department of Statistical Sciences, University of Padova,

of an estimated mode and the definition of a chi-squared test statistic.

of interest are Duong *et al.*, 2008 and Genovese *et al.*, 2016.

1 Introduction

While intuitively clear, the problem of testing mode significance is firstly definitional. The concept of mode itself is, indeed, ambiguous, as for example the Uniform distribution can be regarded to as both unimodal or without modes. To overcome this problem and formalise our framework without any elusiveness, we shall restrict the analysis to smooth distributions, and exclude non-standard ones as, for example, functions with plateau. For our purpose, we resort to the framework provided by Morse Theory, a branch of differential topology which draws the relationship between the stationary points of a smooth real-valued functions on a manifold, and its global topology (Matsumoto, 2002).

Let (x1,...,x*n*) be a sample of realisations from a random variable *X* with unknown probability density *<sup>f</sup>* : <sup>R</sup>*<sup>d</sup>* <sup>→</sup> <sup>R</sup>, which we shall assume to be a Morse function, i.e. a function whose critical points are non-degenerate. We can define the autonomous system identified by the the gradient ∇ *f* , to be *<sup>d</sup>*x(*t*) *dt* = <sup>∇</sup> *<sup>f</sup>*(x(*t*)). Given an initial value <sup>x</sup> <sup>∈</sup> <sup>R</sup>*d*, the integral curve <sup>ν</sup><sup>x</sup> : <sup>R</sup> <sup>→</sup> <sup>R</sup>*<sup>d</sup>* of the negative gradient −∇ *f* is the solution of the initial value problem

$$\mathbf{v}\_{\mathbf{x}}(0) = \mathbf{x} \qquad \mathbf{v}\_{\mathbf{x}}'(t) = -\nabla f(\mathbf{v}\_{\mathbf{x}}(t)),\tag{1}$$

namely, starting at a point x, its integral curve moves it according to the gradient of *f* , to eventually reach, except for a set of null measure, the destination lim*t*→<sup>∞</sup> <sup>ν</sup>x(*t*). By the Morse theory, the set of destinations <sup>Θ</sup> <sup>=</sup> {<sup>θ</sup> <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* : <sup>θ</sup> <sup>=</sup> lim*t*→<sup>∞</sup> <sup>ν</sup>x(*t*),<sup>x</sup> <sup>∈</sup> <sup>R</sup>*d*} is the set of distinct modes of *<sup>f</sup>* . Since integral curves never intersect except at critical points, Θ allows to identify a unique partition {*D*θ}θ∈<sup>Θ</sup> of <sup>R</sup>*<sup>d</sup>* in distinct regions *D*<sup>θ</sup> <sup>=</sup> {<sup>x</sup> : lim*t*→<sup>∞</sup> <sup>ν</sup>x(*t*) = <sup>θ</sup>} which represent the "basins of attraction" of each mode θ and include all points whose integral curve having them as starting points has destination θ.

In the lack of information about the true modal structure of *f* , testing the significance of a mode recasts to defining the system of hypotheses

$$H\_0: \mathfrak{G}\_0 \in \Theta \quad \text{vs} \quad H\_1: \mathfrak{G}\_0 \notin \Theta,\tag{2}$$

for some <sup>θ</sup><sup>0</sup> <sup>∈</sup> <sup>R</sup>*d*. While apparently composite, the null hypothesis is fact a simple one, as the - yet unknown - partition of <sup>R</sup>*<sup>d</sup>* in the set {*D*(θ)}θ∈<sup>Θ</sup> allows us to intend *H*<sup>0</sup> as "θ<sup>0</sup> is the mode of the domain *D*(θ) where it belongs"; hence, under the null hypothesis, it holds: <sup>θ</sup><sup>0</sup> <sup>=</sup> argmin*x*∈*D*(θ0) <sup>−</sup>*f*(*x*).

Gradient descent algorithms find iterative solutions to general optimisation problems with suitable smoothness properties. In the current framework, the problem can be faced via the discretisation of the integral curve (1)

$$\boldsymbol{\Theta}\_{(t+1)} = \boldsymbol{\Theta}\_{(t)} + \eta \nabla f(\boldsymbol{\Theta}\_{(t)}),\tag{3}$$

−4 −2 0 2 4

clustering. *Stat. Sc.*, 30, 518–532.

H0 H1

*for a nominal* α = 0.05 *and power for increasing distance from H*0*.*

sample size to guarantee the control of Type I probability error.

Figure 1. *Contour plot of a density and associated estimated Type I error probability*

by drawing 500 samples of size *n* = 1000 from a bivariate skew distribution. The test shows an overall good control of type I error, with a slight tendency to be anti-conservative. The power rightly increases as the true mode departs from θ0, with higher values associated to the steepest side of the density, and lower ones to the most gentle side. This confirms that the testing problem is strictly related to the local curvature around the true mode, and hence to the eigenvalues of the Hessian. For brevity, further results are not reported here, broadly confirming the illustrated behaviour. More challenging settings, such as multimodal ones with overlapping modal regions, in general require a higher

Further discussion is needed to provide insights about the test in higher dimensional settings, along with the sensitivity to different choices of η and *h*.

CHACON´ , J. 2015. A population background for nonparametric density-based

CHACON´ , J. 2020. The modal age of statistics. *Int. Stat. Rev.*, 88, 122–141. DUONG, T., COWLING, A., KOCH, I., & WAND, M. 2008. Feature significance for multivariate kernel density estimation. *Comp.Stat.& Data An.*,

GENOVESE, C., PERONE-PACIFICO, M., VERDINELLI, I., & WASSERMAN, L. 2016. Non-parametric inference for density modes. *J.Roy.Stat.Soc. B*,

LIANG, T., & SU, W. 2019. Statistical inference for the population landscape via moment adjusted stochastic gradients. *J.Roy.Stat.Soc. B*, 81, 431–456. MATSUMOTO, Y. 2002. *An introduction to Morse theory*. Amer. Math. Soc.

*d*(θ,θ0) P(reject *H*0)

0 0.06 0.06 0.2 0.13 0.05 0.4 0.29 0.08 0.6 0.57 0.2 0.8 0.78 0.36 1.00 0.92 0.55

left right

−3 −2 −1 0 1 2 3

References

52, 4225–4242.

78, 99–126.

where η is the step size, usually selected to guarantee convergence. In our framework, the target function *f* may be suitably replaced by a nonparametric kernel estimate ˆ*f* = <sup>1</sup> *<sup>n</sup>* <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=1*K*( <sup>x</sup>−x*<sup>i</sup> <sup>h</sup>* ), with bandwidth *h* > 0 and kernel *K* which we take to be a symmetric probability density; hence, the estimated gradient ∇ˆ *f*(x) = <sup>1</sup> *<sup>n</sup>* <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> ∇*Kh*(x−x*i*) is plugged into the (3).

Under regularity conditions, the asymptotic distribution of the kernel gradient estimator (Duong *et al.*, 2008) is shown to be <sup>∇</sup> <sup>ˆ</sup>*f*(*x*)∼˙ *<sup>N</sup>* ∇ *f*(*x*), <sup>1</sup> *<sup>n</sup>*Σ∇ , with Σ∇ = [*h*−*d*+2*R*(∇*K*)*f*(x)] and *R*(∇*K*) a constant depending on the kernel. In order to develop a test statistic for the (2), we adapt to the current framework the rationale of Liang & Su, 2019, which discuss moment-adjusted stochastic gradient descent for optimisation in statistical inference, so that the (3) gets:

$$\begin{aligned} \boldsymbol{\Theta}\_{(l+1)} &= \quad \boldsymbol{\Theta}\_{l} + \mathfrak{η}\nabla\hat{f}(\boldsymbol{\Theta}\_{(l)}) - \mathfrak{η}[\nabla\hat{f}(\boldsymbol{\Theta}\_{(l)}) - \mathbb{E}(\nabla\hat{f}(\boldsymbol{\Theta}\_{(l)}))] \\ &= \quad \boldsymbol{\Theta}\_{(l)} + \mathfrak{η}\boldsymbol{h}^{\frac{d+2}{2}}\mathcal{R}(\nabla\boldsymbol{K})^{-\frac{1}{2}}\frac{\nabla\hat{f}(\boldsymbol{\Theta}\_{(l)})}{\sqrt{f(\boldsymbol{\Theta}\_{(l)})}} - \frac{\mathfrak{η}}{\sqrt{n}}\boldsymbol{\varepsilon}\_{(l)} \end{aligned}$$

where the last step comes from a classic standardisation idea and ε(*t*) ∼ *N*(0,*Id*). Starting from θ(0) = θˆ, which under *H*<sup>0</sup> we expect to lie in *D*(θ0), we may then produce a random sequence θ(1),...,θ(*T*) of sample modes by simply generating an artificial sample of ε(*t*) ∼ *N*(0,*Id*) and applying the update mechanism (4), where *f* is replaced by ˆ*f* . *H*<sup>0</sup> is afterwards rejected for large values of the asymptotically chi-squared distributed test statistic

$$(\overline{\Theta} - \Theta\_0)^\top \dot{\Sigma}\_{\overline{\Theta}}^{-1} (\overline{\Theta} - \Theta\_0) \stackrel{\cdot}{\sim} \chi^2\_d, \eta$$

where θ and Σˆ <sup>θ</sup> are respectively the mean and covariance matrix of the sequence of sample modes θ(1),...,θ(*T*).

#### 3 Empirical study

A simulation study has been run to evaluate the behaviour of the proposed test with respect to the Type-I probability error and the power. The simple rule of thumb of selecting *h* as asymptotically optimal for Normal data has been used, and *T* has been set to 500.

Figure 1 displays the results - associated to the best value of η - obtained

Figure 1. *Contour plot of a density and associated estimated Type I error probability for a nominal* α = 0.05 *and power for increasing distance from H*0*.*

by drawing 500 samples of size *n* = 1000 from a bivariate skew distribution. The test shows an overall good control of type I error, with a slight tendency to be anti-conservative. The power rightly increases as the true mode departs from θ0, with higher values associated to the steepest side of the density, and lower ones to the most gentle side. This confirms that the testing problem is strictly related to the local curvature around the true mode, and hence to the eigenvalues of the Hessian. For brevity, further results are not reported here, broadly confirming the illustrated behaviour. More challenging settings, such as multimodal ones with overlapping modal regions, in general require a higher sample size to guarantee the control of Type I probability error.

Further discussion is needed to provide insights about the test in higher dimensional settings, along with the sensitivity to different choices of η and *h*.

#### References

Gradient descent algorithms find iterative solutions to general optimisation problems with suitable smoothness properties. In the current framework, the

where η is the step size, usually selected to guarantee convergence. In our framework, the target function *f* may be suitably replaced by a nonparametric

we take to be a symmetric probability density; hence, the estimated gradient

with Σ∇ = [*h*−*d*+2*R*(∇*K*)*f*(x)] and *R*(∇*K*) a constant depending on the kernel. In order to develop a test statistic for the (2), we adapt to the current framework the rationale of Liang & Su, 2019, which discuss moment-adjusted stochastic gradient descent for optimisation in statistical inference, so that the (3) gets: <sup>θ</sup>(*t*+1) <sup>=</sup> <sup>θ</sup>*<sup>t</sup>* <sup>+</sup>η∇ <sup>ˆ</sup>*f*(θ(*<sup>t</sup>*))−η[<sup>∇</sup> <sup>ˆ</sup>*f*(θ(*<sup>t</sup>*))−E(<sup>∇</sup> <sup>ˆ</sup>*f*(θ(*<sup>t</sup>*)))]

where the last step comes from a classic standardisation idea and ε(*t*) ∼ *N*(0,*Id*). Starting from θ(0) = θˆ, which under *H*<sup>0</sup> we expect to lie in *D*(θ0), we may then produce a random sequence θ(1),...,θ(*T*) of sample modes by simply generating an artificial sample of ε(*t*) ∼ *N*(0,*Id*) and applying the update mechanism (4), where *f* is replaced by ˆ*f* . *H*<sup>0</sup> is afterwards rejected for large values of the

Under regularity conditions, the asymptotic distribution of the kernel gra-

−1 2

∇ ˆ*f*(θ(*<sup>t</sup>*))

<sup>∼</sup> <sup>χ</sup><sup>2</sup> *d*,

*f*(θ(*<sup>t</sup>*))

− η √*n* ε(*t*)

θ(*t*+1) = θ(*t*) +η∇ *f*(θ(*<sup>t</sup>*)), (3)

*<sup>h</sup>* ), with bandwidth *h* > 0 and kernel *K* which

∇ *f*(*x*), <sup>1</sup>

*<sup>n</sup>*Σ∇ ,

problem can be faced via the discretisation of the integral curve (1)

kernel estimate ˆ*f* = <sup>1</sup>

*<sup>n</sup>* <sup>∑</sup>*<sup>n</sup>*

∇ˆ *f*(x) = <sup>1</sup>

*<sup>n</sup>* <sup>∑</sup>*<sup>n</sup>*

*<sup>i</sup>*=1*K*( <sup>x</sup>−x*<sup>i</sup>*

= θ(*t*) +η*h*

asymptotically chi-squared distributed test statistic

quence of sample modes θ(1),...,θ(*T*).

3 Empirical study

and *T* has been set to 500.

(θ−θ0)

Σˆ <sup>−</sup><sup>1</sup>

where θ and Σˆ <sup>θ</sup> are respectively the mean and covariance matrix of the se-

A simulation study has been run to evaluate the behaviour of the proposed test with respect to the Type-I probability error and the power. The simple rule of thumb of selecting *h* as asymptotically optimal for Normal data has been used,

Figure 1 displays the results - associated to the best value of η - obtained

<sup>θ</sup> (θ−θ0) ·

*<sup>i</sup>*=<sup>1</sup> ∇*Kh*(x−x*i*) is plugged into the (3).

dient estimator (Duong *et al.*, 2008) is shown to be <sup>∇</sup> <sup>ˆ</sup>*f*(*x*)∼˙ *<sup>N</sup>*

*d*+2 <sup>2</sup> *R*(∇*K*)


### **VISUALIZING CLUSTER OF WORDS: A GRAPHICAL APPROACH TO GRAMMAR ACQUISITION**

Saffran [4] assert that in language acquisition, the term `statistical learning' is most closely associated with tracking sequential statistics in word segmentation or grammar learning tasks. Knowing these rules and constraints does not allow us to predict the outcome of a child beginning to be immersed in his/her native language. All we know is that around the age of 5/6, she/he will master his/her own language/s. We know approximately the learning stages, the date of his/her first word, and the rough order of consonant acquisition. Interesting theories have been developed about the patterns of errors (e.g. phonetic variation) that the child will most likely make, but it is to date vary hard to model language acquisition. The types of patterns tracked by a statistical learning mechanism could be quite simple, such as a frequency count, or more complex, such as conditional probability [4]. In other words, learning a language (here conceived as a statistical structure of the environment) is in some ways a process that bring a child to minimize long-term prediction error. Clustering text is an important phase in data analysis. The common task in text clustering is to handle text in a multidimensional space, and to partition corpora into groups, where each group contains sentences that are similar to each other according to some grammatical indicators. Considering the above, in this paper we propose a new statistical strategy to evaluate the development of child linguistic structures over time in a reliable way based on clustering and visualization of words. The clusters are sufficiently explanatory for understanding first language acquisition as well as seem efficient for clustering performance. The paper is organized as follows: section 2 describes the data structure and the model applied; section 3 briefly provides the analysis strategy, the principal elaborations and visual

CoLaJE is a database composed of seven children that have been videorecorded in vivo approximately one hour every month from their first year of life until they were five (see https://www.ortolang.fr/). In this exploratory research, statistical treatments have been tested only on two children (Adrien and Madeleine) because the transcriptions obtained from these corpora are the most complete. Code for the Human Analysis of Transcripts (CHAT) provides a standardized format for producing computerized transcripts of conversational interactions. By analyzing, cleaning, filtering and normalizing all the available original CHAT transcripts we aimed at producing two corpora composed of the overall amount of what the children said through the years. A total of 8214 and 7168 database annotated

measures have been calculated such as: child age in years (Time) and Sentence Phonetic Variation Rate (SPVR) [1]: the SPVR is obtained by comparing mod and pho in order to measure how the relation between varied and correct form evolves

Due to lack of space in this paper we present the results for the Adrein dataset only. All other

. Some useful

sentences containing more than 100 variables were collected1

over time. In the single sentence *i* (with *i N* =1.... ),

calculations are available on request.

interface for clustering.

1

**2 Data Structure and Model**

Massimo Mucciardi1, Giovanni Pirrotta2 , Andrea Briglia3 and Arnaud Sallaberry4

<sup>1</sup> Department of Cognitive Science, Education and Cultural Studies, University of Messina, (e-mail: massimo.mucciardi@unime.it)

<sup>2</sup> University of Messina, (e-mail: gpirrotta@unime.it)

3 Université "Paul Valéry » Montpellier3 (e-mail: andrea.briglia@univ-montp3.fr)

<sup>4</sup> LIRMM, University of Montpellier, CNRS, & AMIS, Université "Paul Valéry » Montpellier3 (e-mail: arnaud.sallaberry@lirmm.fr)

**ABSTRACT**: Language has been traditionally considered as a qualitative phenomenon that mainly requires hermeneutical methodologies in order to be studied, yet in recent decades thanks to advances in data storage, processing and visualization - there has been a growing and fertile interest in analysing language by relying on statistics and quantitative methods. In light of these motivations, we think it is worthwhile to try to explore databases made up of transcripted infant children spoken language in order to verify whether and how underlying patterns and recurrent sequences of learning stages work during acquisition. So, we think that the Expectation Maximization clustering method combined with an innovative graphical visualization can be useful to evaluate the development of linguistic structures over time in a reliable way.

**KEYWORDS**: first language acquisition, EM clustering, graphical visualization, phonetic variation rate, POS Tags.

#### **1 General Framework**

First language acquisition can be studied and modelled by using statistical tools: experiments have shown how specific *innately biased statistical learning mechanisms* are activated during in vitro settings where children easily learn how to keep memory of the transitional probability between syllables to spot word' boundaries [6]. Statistical and computational methods have contributed to important advances in the understanding of language acquisition: corpus analysis is one of the most rigorous ways to account for pattern, regularities and learning stages in a sound and replicable procedure [2]. In a very abstract form, first language acquisition could be viewed as a mixture of deterministic and random processes. It is deterministic because rules and constraints applied to human cognition are partly known. It is partly a random process because the amount of variability between children and within a single child is largely acknowledged and represents at the same time what is interesting and what is di cult in modelling child language studies. Romberg and Saffran [4] assert that in language acquisition, the term `statistical learning' is most closely associated with tracking sequential statistics in word segmentation or grammar learning tasks. Knowing these rules and constraints does not allow us to predict the outcome of a child beginning to be immersed in his/her native language. All we know is that around the age of 5/6, she/he will master his/her own language/s. We know approximately the learning stages, the date of his/her first word, and the rough order of consonant acquisition. Interesting theories have been developed about the patterns of errors (e.g. phonetic variation) that the child will most likely make, but it is to date vary hard to model language acquisition. The types of patterns tracked by a statistical learning mechanism could be quite simple, such as a frequency count, or more complex, such as conditional probability [4]. In other words, learning a language (here conceived as a statistical structure of the environment) is in some ways a process that bring a child to minimize long-term prediction error. Clustering text is an important phase in data analysis. The common task in text clustering is to handle text in a multidimensional space, and to partition corpora into groups, where each group contains sentences that are similar to each other according to some grammatical indicators. Considering the above, in this paper we propose a new statistical strategy to evaluate the development of child linguistic structures over time in a reliable way based on clustering and visualization of words. The clusters are sufficiently explanatory for understanding first language acquisition as well as seem efficient for clustering performance. The paper is organized as follows: section 2 describes the data structure and the model applied; section 3 briefly provides the analysis strategy, the principal elaborations and visual interface for clustering.

#### **2 Data Structure and Model**

**VISUALIZING CLUSTER OF WORDS: A GRAPHICAL APPROACH TO GRAMMAR ACQUISITION**

<sup>1</sup> Department of Cognitive Science, Education and Cultural Studies, University of Messina,

 Université "Paul Valéry » Montpellier3 (e-mail: andrea.briglia@univ-montp3.fr) <sup>4</sup> LIRMM, University of Montpellier, CNRS, & AMIS, Université "Paul Valéry » Montpellier3

**ABSTRACT**: Language has been traditionally considered as a qualitative phenomenon that mainly requires hermeneutical methodologies in order to be studied, yet in recent decades thanks to advances in data storage, processing and visualization - there has been a growing and fertile interest in analysing language by relying on statistics and quantitative methods. In light of these motivations, we think it is worthwhile to try to explore databases made up of transcripted infant children spoken language in order to verify whether and how underlying patterns and recurrent sequences of learning stages work during acquisition. So, we think that the Expectation Maximization clustering method combined with an innovative graphical visualization can be useful to evaluate the development of linguistic structures over time in a

**KEYWORDS**: first language acquisition, EM clustering, graphical visualization, phonetic

First language acquisition can be studied and modelled by using statistical tools: experiments have shown how specific *innately biased statistical learning mechanisms* are activated during in vitro settings where children easily learn how to keep memory of the transitional probability between syllables to spot word' boundaries [6]. Statistical and computational methods have contributed to important advances in the understanding of language acquisition: corpus analysis is one of the most rigorous ways to account for pattern, regularities and learning stages in a sound and replicable procedure [2]. In a very abstract form, first language acquisition could be viewed as a mixture of deterministic and random processes. It is deterministic because rules and constraints applied to human cognition are partly known. It is partly a random process because the amount of variability between children and within a single child is largely acknowledged and represents at the same time what is interesting and what is di cult in modelling child language studies. Romberg and

, Andrea Briglia3 and Arnaud Sallaberry4

Massimo Mucciardi1, Giovanni Pirrotta2

<sup>2</sup> University of Messina, (e-mail: gpirrotta@unime.it)

(e-mail: massimo.mucciardi@unime.it)

(e-mail: arnaud.sallaberry@lirmm.fr)

3

reliable way.

variation rate, POS Tags.

**1 General Framework**

CoLaJE is a database composed of seven children that have been videorecorded in vivo approximately one hour every month from their first year of life until they were five (see https://www.ortolang.fr/). In this exploratory research, statistical treatments have been tested only on two children (Adrien and Madeleine) because the transcriptions obtained from these corpora are the most complete. Code for the Human Analysis of Transcripts (CHAT) provides a standardized format for producing computerized transcripts of conversational interactions. By analyzing, cleaning, filtering and normalizing all the available original CHAT transcripts we aimed at producing two corpora composed of the overall amount of what the children said through the years. A total of 8214 and 7168 database annotated sentences containing more than 100 variables were collected1 . Some useful measures have been calculated such as: child age in years (Time) and Sentence Phonetic Variation Rate (SPVR) [1]: the SPVR is obtained by comparing mod and pho in order to measure how the relation between varied and correct form evolves over time. In the single sentence *i* (with *i N* =1.... ),

<sup>1</sup> Due to lack of space in this paper we present the results for the Adrein dataset only. All other calculations are available on request.

$$\text{SPVR}\_{i} = \left(TNPV\_{i} / CTWT\_{i}\right) \cdot 100\tag{l}$$

3) visualize the different values characterizing the clusters (age, SPVR, number of POS tags, number of sentences) and the POS tags (number of occurrences in a cluster, percentage, mean, Fisher coefficient, p-value); 4) visualize the list of sentences of a cluster; 5) visualize the relative and absolute evolution of the number of POS tags when child grows up (see the following link for all the details http://advanse.lirmm.fr/EMClustering/). In conclusion, we would suggest that these preliminary results represent a fair attempt to visualize child language development through clusters of words grouped by several criteria (age, grammatical properties, correct pronunciation). We can cautiously say that in this first stage of research the EM algorithm can provide us some mild descriptions in the classification of POS

**Table 1.** *EM clustering results by strata - Dataset 1 (Adrien) (# - clusters number in brackets - POS sorted for ANOVA post-hoc F-test (in bold) p <0.05) (First 10 POS)* 

POS LL (3) LM (2) LH (4) ML (5) MM (3) MH (3) HL (4) HM (5) HH (5) POS1 **INTJ VERB PRON CCONJ ADP PRON PRON NOUN AUX** POS2 **DET PROPN ADV PRON ADV AUX DET DET NOUN** POS3 **ADP** ADV **DET NOUN DET NOUN VERB PRON VERB** POS4 **NOUN** NOUN **VERB AUX SCONJ DET NOUN ADJ DET** POS5 **SYM** INTJ **NOUN VERB CCONJ ADP SCONJ AUX PRON** POS6 **ADV** PRON **INTJ NUM INTJ ADV ADP VERB NUM** POS7 **PROPN** DET **PROPN SYM NOUN PROPN AUX ADP ADJ** POS8 **PRON** AUX **AUX ADV ADJ SCONJ ADV ADV ADP** POS9 **VERB** NUM **ADJ DET NUM VERB ADJ SCONJ ADV** POS10 **X** CCONJ **SCONJ PROPN PROPN INTJ CCONJ X X** 

[1] BRIGLIA A., MUCCIARDI M., SAUVAGE J. 2020. Identify the speech code through statistics: a data driven approach. Proceedings SIS 2020 (Pearson Editions). [2] CHATER, N. MANNING, C. D. 2006. Probabilistic models of language processing

[3] DEMPSTER A.P., LAIRD N.M., RUBIN D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society.

[4] ROMBERG, A.R, SAFFRAN, J.R. 2020. Statistical learning and language

[5] QI, P., ZHANG Y., ZHANG Y., BOLTON J., MANNING C. J. 2020. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. In Association for Computational Linguistics (ACL) System Demonstrations. [6] SAFFRAN J. R., ASLIN R. N., NEWPORT E. L. 1996 Statistical learning by 8-

[7]UNIVERSAL DEPENDENCIES. 2021. Retrieved from

and acquisition Trends in Cognitive Sciences 10(7), 335-44.

acquisition. Wiley Inter discip Rev Cogn Sci. 1(6): 906-914.

Month-Old infants. Science, vol. 274,. 1926-1928.

https://universaldependencies.org/fr/ pos/index.html

Series B: Methodological 39: 1-38.

tags.

Ordered

**References** 

where TNPV is the Total Number of Phonetic Variations of the words - total number of the difference between what the child really says (called "pho") and what he should have said according to the adult norm called ("mod") - and CTWT is the Child Total Words Tokenized. Hence, SPVR can assume the value 0% when the child does not make any error and 100% when the child does not pronounce all the words contained in the sentence correctly. Then, we applied Part-Of-Speech Tagger (POS Tags), a software that reads text in a given language and assigns parts of speech to each word such as noun, verb, adjective. We used Stanza Core NLP engine [22] to tag all CHI words by using Universal Dependencies as a standard of reference for part-of-speech classification [7]. Considering the nature of the variables (count data), we use finite multivariate Poisson mixtures in the EM procedure. We recall that EM clustering is an iterative method relying on the assumption that the data is generated by a mixture of underlying probability distributions, where each component represents a separate group, or cluster. The method provides the optimal number of clusters in any empirical situation, by using a two step iterative algorithm [3]. According to this approach to estimate mixture parameters we computing the maximum likelihood estimate (MLE) with the EM algorithm. In the next paragraph we will see the results of the principal estimates

#### **3 Principal Results**

To extend previous research [1], we divide our database in nine strata considering 3 different age classes of the child (L=1.97-2.64; M= 2.71-3.39 H=3.46- 4.33 expressed in years and months) and 3 classes of SPVR (L= 33; M=>33 and 66; H>66 expressed in percent). In total we get 9 strata (from LL to HH). By framing the analysis in this way, we turn EM clustering algorithm into a potentially interesting method that could provide a reliable way to observe linguistic structures development over time. In tables 1 we summarize the main results obtained from clustering through a overview on the most influential POS tags for each strata and its related clusters for the dataset examined. In addition, the means of the POS are calculated in each strata (data not shown). We can observe that VERB occupies an increasing important role in development: it is almost absent in Adrien (dataset 1) during the earlier ages strata, it develops sharply in median age strata while it is present in almost any sentence in the upper age strata. It is clear that VERB causes an increase in the SPVR, as their values are higher in higher error rate strata (more than 33 percent). We can also observe that the parts of speech such as PRON (pronoun), VERB, SCONJ (subordinating conjunction) - which could be considered as markers of longer sentences - increase their importance. For visualization of clusters, we propose an interactive and visual interface to better this analysis. It has been designed considering a list of requirements defined in regards of the data structures and variables extracted by the clustering technique and the tasks one should be able to perform on such data. These are the main features: 1) visualize the clusters by age and SPVR; 2) visualize the distribution of POS tags in the clusters; 3) visualize the different values characterizing the clusters (age, SPVR, number of POS tags, number of sentences) and the POS tags (number of occurrences in a cluster, percentage, mean, Fisher coefficient, p-value); 4) visualize the list of sentences of a cluster; 5) visualize the relative and absolute evolution of the number of POS tags when child grows up (see the following link for all the details http://advanse.lirmm.fr/EMClustering/). In conclusion, we would suggest that these preliminary results represent a fair attempt to visualize child language development through clusters of words grouped by several criteria (age, grammatical properties, correct pronunciation). We can cautiously say that in this first stage of research the EM algorithm can provide us some mild descriptions in the classification of POS tags.


**Table 1.** *EM clustering results by strata - Dataset 1 (Adrien) (# - clusters number in brackets - POS sorted for ANOVA post-hoc F-test (in bold) p <0.05) (First 10 POS)* 

#### **References**

( ) 100 *i ii SPVR TNPV CTWT* = ⋅ (1)

**3 Principal Results**

where TNPV is the Total Number of Phonetic Variations of the words - total number of the difference between what the child really says (called "pho") and what he should have said according to the adult norm called ("mod") - and CTWT is the Child Total Words Tokenized. Hence, SPVR can assume the value 0% when the child does not make any error and 100% when the child does not pronounce all the words contained in the sentence correctly. Then, we applied Part-Of-Speech Tagger (POS Tags), a software that reads text in a given language and assigns parts of speech to each word such as noun, verb, adjective. We used Stanza Core NLP engine [22] to tag all CHI words by using Universal Dependencies as a standard of reference for part-of-speech classification [7]. Considering the nature of the variables (count data), we use finite multivariate Poisson mixtures in the EM procedure. We recall that EM clustering is an iterative method relying on the assumption that the data is generated by a mixture of underlying probability distributions, where each component represents a separate group, or cluster. The method provides the optimal number of clusters in any empirical situation, by using a two step iterative algorithm [3]. According to this approach to estimate mixture parameters we computing the maximum likelihood estimate (MLE) with the EM algorithm. In the next paragraph we will see the results of the principal estimates

To extend previous research [1], we divide our database in nine strata considering 3 different age classes of the child (L=1.97-2.64; M= 2.71-3.39 H=3.46- 4.33 expressed in years and months) and 3 classes of SPVR (L= 33; M=>33 and 66; H>66 expressed in percent). In total we get 9 strata (from LL to HH). By framing the analysis in this way, we turn EM clustering algorithm into a potentially interesting method that could provide a reliable way to observe linguistic structures development over time. In tables 1 we summarize the main results obtained from clustering through a overview on the most influential POS tags for each strata and its related clusters for the dataset examined. In addition, the means of the POS are calculated in each strata (data not shown). We can observe that VERB occupies an increasing important role in development: it is almost absent in Adrien (dataset 1) during the earlier ages strata, it develops sharply in median age strata while it is present in almost any sentence in the upper age strata. It is clear that VERB causes an increase in the SPVR, as their values are higher in higher error rate strata (more than 33 percent). We can also observe that the parts of speech such as PRON (pronoun), VERB, SCONJ (subordinating conjunction) - which could be considered as markers of longer sentences - increase their importance. For visualization of clusters, we propose an interactive and visual interface to better this analysis. It has been designed considering a list of requirements defined in regards of the data structures and variables extracted by the clustering technique and the tasks one should be able to perform on such data. These are the main features: 1) visualize the clusters by age and SPVR; 2) visualize the distribution of POS tags in the clusters;


### ROBUSTNESS METHODS FOR MODELLING COUNT DATA WITH GENERAL DEPENDENCE STRUCTURES

with such a method that even if the model is not correct the method would

Copula are functions that join multivariate distributions to their marginal distributions (Nelsen, 2007). They describe the dependence structure existing across marginal random variables. In this way we can consider bivariate distributions with dependency structures different from the linear one that characterizes the

A bivariate copula *<sup>C</sup>* : *<sup>I</sup>*<sup>2</sup> <sup>→</sup> *<sup>I</sup>*, with *<sup>I</sup>* = [0,1], is the cumulative bivariate distribution function of the random variables (*U*,*V*) with uniform marginal

Let (*Y*1,*Y*2) be a bivariate random vector with marginal cdfs *FY*<sup>1</sup> (*y*1) and *FY*<sup>2</sup> (*y*2) and joint cdf *FY*1,*Y*<sup>2</sup> (*y*1,*y*2;θ). There always exists a copula function

*FY*<sup>1</sup> (*y*1),*FY*<sup>2</sup> (*y*2);θ

When *Y*<sup>1</sup> and *Y*<sup>2</sup> are discrete random variables taking values on some lattice, Ω, the copula *C* is unique in (*y*1, *y*2) ∈ Ω but not elsewhere. Thus, in the discrete case the mapping from two marginals and a copula {*F*1,*F*2,*C*} to a bivariate distribution *F*(*Y*1,*Y*2) is not one-to-one. However, this is notuniqueness is of no consequence as the region outside Ω is not of interest in

This result states that each joint distribution can be expressed in terms of two separate but related issues, the marginal distributions and the dependence structures between them. The dependence structure is explained by the copula

where θ is a parameter measuring the dependence between *U* and *V*.

*C*(*u*,*v*;θ) = *P*(*U* ≤ *u*,*V* ≤ *v*;θ), 0 ≤ *u* ≤ 1 0 ≤ *v* ≤ 1 (1)

, *y*1, *y*<sup>2</sup> ∈ IR. (2)

In this work, we consider a copula based bivariate Poisson distribution. We apply a minimum distance estimation methodology using Hellinger distance. We investigate its robustness under outliers contamination and model misspecification. Particular focus is given on the robustness of copula related parameters that measure the association exhibited by paired count data. The effectiveness of this methodology is examined on data from English Premier

protect from deriving inconsistent results.

League 2013-2014.

multivariate Gaussian distribution.

distributions in [0,1]. It is define as:

*FY*1,*Y*<sup>2</sup> (*y*1, *y*2;θ) = *C*

2 Copulas

*C*(·,·;θ) such that

function *C*(·,·;θ).

the discrete case (Nelsen, 2007).

Marta Nai Ruscone 1and Dimitris Karlis2

<sup>1</sup> DIMA - Department of Mathematics, University of Genoa, (e-mail: marta.nairuscone@unige.it)

<sup>2</sup> Department of Statistics, Athens University of Economics and Business, (e-mail: karlis@aueb.it)

ABSTRACT: Bivariate Poisson models are appropriate for modelling paired count data. However, the bivariate Poisson model does not allow for a negative dependence structure. Therefore, it is necessary to consider alternatives. A natural way is to consider copulas to generate various bivariate discrete distributions. While such models exist in the literature, the issue of choosing a suitable copula has been overlooked so far. Different copulas lead to different structures and any copula misspecification can render the inference useless. In this work, we consider bivariate Poisson models generated with a copula and investigate its robustness under outliers contamination and model misspecification. Particular focus is on the robustness of copula related parameters. English Premier League data are used to demonstrate the effectiveness of our approach.

KEYWORDS: copula, dependence, outliers, robustness.

### 1 Introduction

Bivariate Poisson models are appropriate for modelling paired count data exhibiting correlation. Paired count data arise in a wide context including, for example, sports (e.g. the number of goals scored by each one of the two opponent teams in soccer). Several models are available that can incorporate different structures and marginal properties, see for example Karlis & Ntzoufras, 2003. See also the work in Nikoloulopoulos, 2013 for defining models with copulas. While several extensions and models have been proposed, up to our knowledge, issues of robustness have been overlooked. Following da Fonseca & Fieller, 2006, there are two kinds of achieved robustness that one should consider. The first one refers to contamination from outlier observations or, better, from observations that are unexpected under a certain model. The second one refers to model deviation, i.e. a researcher would like to fit the model with such a method that even if the model is not correct the method would protect from deriving inconsistent results.

In this work, we consider a copula based bivariate Poisson distribution. We apply a minimum distance estimation methodology using Hellinger distance. We investigate its robustness under outliers contamination and model misspecification. Particular focus is given on the robustness of copula related parameters that measure the association exhibited by paired count data. The effectiveness of this methodology is examined on data from English Premier League 2013-2014.

#### 2 Copulas

ROBUSTNESS METHODS FOR MODELLING COUNT DATA WITH GENERAL DEPENDENCE STRUCTURES Marta Nai Ruscone 1and Dimitris Karlis2

<sup>1</sup> DIMA - Department of Mathematics, University of Genoa, (e-mail:

<sup>2</sup> Department of Statistics, Athens University of Economics and Business, (e-mail:

ABSTRACT: Bivariate Poisson models are appropriate for modelling paired count data. However, the bivariate Poisson model does not allow for a negative dependence structure. Therefore, it is necessary to consider alternatives. A natural way is to consider copulas to generate various bivariate discrete distributions. While such models exist in the literature, the issue of choosing a suitable copula has been overlooked so far. Different copulas lead to different structures and any copula misspecification can render the inference useless. In this work, we consider bivariate Poisson models generated with a copula and investigate its robustness under outliers contamination and model misspecification. Particular focus is on the robustness of copula related parameters. English Premier League data are used to demonstrate the effectiveness of our

Bivariate Poisson models are appropriate for modelling paired count data exhibiting correlation. Paired count data arise in a wide context including, for example, sports (e.g. the number of goals scored by each one of the two opponent teams in soccer). Several models are available that can incorporate different structures and marginal properties, see for example Karlis & Ntzoufras, 2003. See also the work in Nikoloulopoulos, 2013 for defining models with copulas. While several extensions and models have been proposed, up to our knowledge, issues of robustness have been overlooked. Following da Fonseca & Fieller, 2006, there are two kinds of achieved robustness that one should consider. The first one refers to contamination from outlier observations or, better, from observations that are unexpected under a certain model. The second one refers to model deviation, i.e. a researcher would like to fit the model

marta.nairuscone@unige.it)

KEYWORDS: copula, dependence, outliers, robustness.

karlis@aueb.it)

approach.

1 Introduction

Copula are functions that join multivariate distributions to their marginal distributions (Nelsen, 2007). They describe the dependence structure existing across marginal random variables. In this way we can consider bivariate distributions with dependency structures different from the linear one that characterizes the multivariate Gaussian distribution.

A bivariate copula *<sup>C</sup>* : *<sup>I</sup>*<sup>2</sup> <sup>→</sup> *<sup>I</sup>*, with *<sup>I</sup>* = [0,1], is the cumulative bivariate distribution function of the random variables (*U*,*V*) with uniform marginal distributions in [0,1]. It is define as:

$$C(\mu, \nu; \theta) = P(U \le \mu, V \le \nu; \theta), \quad 0 \le \mu \le 1 \quad 0 \le \nu \le 1 \tag{l}$$

where θ is a parameter measuring the dependence between *U* and *V*.

Let (*Y*1,*Y*2) be a bivariate random vector with marginal cdfs *FY*<sup>1</sup> (*y*1) and *FY*<sup>2</sup> (*y*2) and joint cdf *FY*1,*Y*<sup>2</sup> (*y*1, *y*2;θ). There always exists a copula function *C*(·,·;θ) such that

$$F\mathbf{y}\_1, \mathbf{y}\_2(\mathbf{y}\_1, \mathbf{y}\_2; \boldsymbol{\Theta}) = C\left(F\mathbf{y}\_1(\mathbf{y}\_1), F\mathbf{y}\_2(\mathbf{y}\_2); \boldsymbol{\Theta}\right), \quad \mathbf{y}\_1, \mathbf{y}\_2 \in \mathbb{R}.\tag{2}$$

This result states that each joint distribution can be expressed in terms of two separate but related issues, the marginal distributions and the dependence structures between them. The dependence structure is explained by the copula function *C*(·,·;θ).

When *Y*<sup>1</sup> and *Y*<sup>2</sup> are discrete random variables taking values on some lattice, Ω, the copula *C* is unique in (*y*1, *y*2) ∈ Ω but not elsewhere. Thus, in the discrete case the mapping from two marginals and a copula {*F*1,*F*2,*C*} to a bivariate distribution *F*(*Y*1,*Y*2) is not one-to-one. However, this is notuniqueness is of no consequence as the region outside Ω is not of interest in the discrete case (Nelsen, 2007).

#### 3 Bivariate count models based on copulas

For count data, a common starting point is to use the Poisson distribution for the marginals:

$$f(\mathbf{y}; \mu\_j) = \mu\_j^\mathbf{y} e^{-\mu\_j} / \mathbf{y}!, \qquad \qquad j = 1, 2 \quad \mathbf{y} = \mathbf{0}, 1, \dots \tag{3}$$

directly comparable to the ML estimating equations

that of the regression setting may occur.

say, 1994).

5 Application

References

the effect of the large scores.

tion. *Metrika*, 63(2), 169–190.

*Quantitative Finance*.

∑*x*

*d*(*x*) *m*β(*x*) ∂*m*β(*x*) ∂β <sup>=</sup> <sup>0</sup>

In this work we extend the approach for bivariate count models defined by copulas aiming at deriving robust estimators for both the marginal and the copula parameters. Now *x* implies a pair of observations. Also, in our case the parameters β to estimate are those of the marginal distribution plus the copula parameter(s).We have also developed an iterative algorithm that facilitates the estimation. In the bivariate case we are interested in the relative frequencies are still reasonable estimators of the underlying probabilities but we need larger sample sizes for that. As we move on higher dimensions, problems similar to

which actually implies that we weight the observations differently (see Lind-

Bivariate count models are widely used for modelling the outcome of a football game. The two counts refer to the number of goals scored by each team. It seems natural to assume some dependence between the goals to represent the competitive nature of soccer. Our data refer to all scores from English Premier League 2013-2014 where a series of unexpectedly large scores have occurred. We apply a robust approach to estimate the parameters of the model to reduce

DA FONSECA,VGRUNERT,&FIELLER, NRJ. 2006. Distortion in statistical inference: the distinction between data contamination and model devia-

KARLIS, D., & NTZOUFRAS, I. 2003. Analysis of sports data by using bi-

LINDSAY, B. G. 1994. Efficiency versus robustness: the case for minimum Hellinger distance and related methods. *Ann. Stat.*, 22(2), 1081–1114.

NIKOLOULOPOULOS, A.K. 2013. Copula-based models for multivariate discrete response data. *Pages 231–249 of: Copulae in Mathematical and*

variate Poisson models. *J. R. Stat. Soc.*, 52(3), 381–393.

NELSEN, R B. 2007. *An introduction to copulas*. New York: Springer.

where *µj* > 0. Models based on copulas in the case of bivariate counts offer the advantage of allowing easy generalization to several different models which is not easy in general. Take, for instance, the Frank copula:

$$C(\mu, \nu; \gamma) = -\gamma^{-1} \log \left[ 1 + \frac{(\exp^{-\gamma \mu} - 1)(\exp^{-\gamma \nu} - 1)}{\exp(-\gamma) - 1} \right], \quad \gamma \in \mathcal{R} - \{0\}, \text{ } \mu, \nu \in [0, 1]. \tag{4}$$

Then

$$F(\mathbf{y}\_1, \mathbf{y}\_2; \mu\_1, \mu\_2, \eta) \equiv C(F(\mathbf{y}\_1; \mu\_1), F(\mathbf{y}\_2; \mu\_2); \eta), \tag{5}$$

is a well defined distribution function with a dependence structure. It's probability mass function is

$$\begin{aligned} P(Y\_1 = \mathbf{y}\_1, Y\_2 = \mathbf{y}\_2; \mu\_1, \mu\_2, \eta) &= \, \, \, \, F(\mathbf{y}\_1, \mathbf{y}\_2; \mu\_1, \mu\_2, \eta) - F(\mathbf{y}\_1 - 1, \mathbf{y}\_2; \mu\_1, \mu\_2, \eta) \\ &- \, \, \, \, \, \, \, \, \, \, \, \, \mu\_1 - 1; \mu\_1, \mu\_2, \eta) + F(\mathbf{y}\_1 - 1, \mathbf{y}\_2 - 1; \mu\_1, \mu\_2, \eta) \end{aligned} \tag{6}$$

In the present work we focus on bivariate models. For a review of discrete valued models based on copulas see Nikoloulopoulos, 2013.

#### 4 Minimum distance estimation

In discrete data, model robustness and efficiency can be achieved almost at the same time, by defining distances that downweight some observations Lindsay, 1994. The minimum distance estimators can be interpreted as weighted likelihood estimators, the weights are determined by some kind of distance between observed and expected frequencies. For example, consider Minimum Hellinger distance estimators based on minimizing

$$\sum\_{\mathbf{x}} \left( d(\mathbf{x})^{1/2} - m\_{\mathbb{B}}(\mathbf{x})^{1/2} \right)^2$$

where *d*(*x*) is the observed relative frequency and *m*β(*x*) is the probability mass at *x* with the assumed model with parameters of interest β. It turns out that this quantity leads to estimating equations of the form

$$\sum\_{\mathbf{x}} \left( \frac{d(\mathbf{x})}{m\_{\mathbb{B}}(\mathbf{x})} \right)^{1/2} \frac{\partial m\_{\mathbb{B}}(\mathbf{x})}{\partial \boldsymbol{\mathfrak{B}}} = \mathbf{0}$$

directly comparable to the ML estimating equations

$$\sum\_{\mathbf{x}} \frac{d(\mathbf{x})}{m\_{\mathbb{B}}(\mathbf{x})} \frac{\partial m\_{\mathbb{B}}(\mathbf{x})}{\partial \mathbb{B}} = 0$$

which actually implies that we weight the observations differently (see Lindsay, 1994).

In this work we extend the approach for bivariate count models defined by copulas aiming at deriving robust estimators for both the marginal and the copula parameters. Now *x* implies a pair of observations. Also, in our case the parameters β to estimate are those of the marginal distribution plus the copula parameter(s).We have also developed an iterative algorithm that facilitates the estimation. In the bivariate case we are interested in the relative frequencies are still reasonable estimators of the underlying probabilities but we need larger sample sizes for that. As we move on higher dimensions, problems similar to that of the regression setting may occur.

#### 5 Application

3 Bivariate count models based on copulas

*f*(*y*;*µj*) = *µ*

<sup>−</sup><sup>1</sup> log 

4 Minimum distance estimation

Hellinger distance estimators based on minimizing

∑*x d*(*x*)

that this quantity leads to estimating equations of the form

∑*x d*(*x*) *m*β(*x*)

*y j e*−*µj*

not easy in general. Take, for instance, the Frank copula:

the marginals:

*C*(*u*, *v*; γ) = −γ

ability mass function is

Then

For count data, a common starting point is to use the Poisson distribution for

where *µj* > 0. Models based on copulas in the case of bivariate counts offer the advantage of allowing easy generalization to several different models which is

is a well defined distribution function with a dependence structure. It's prob-

*P*(*Y*<sup>1</sup> = *y*1,*Y*<sup>2</sup> = *y*2;*µ*1,*µ*2, γ) = *F*(*y*1,*y*2;*µ*1,*µ*2, γ)−*F*(*y*<sup>1</sup> −1,*y*2;*µ*1,*µ*2, γ) (6)

In the present work we focus on bivariate models. For a review of discrete

In discrete data, model robustness and efficiency can be achieved almost at the same time, by defining distances that downweight some observations Lindsay, 1994. The minimum distance estimators can be interpreted as weighted likelihood estimators, the weights are determined by some kind of distance between observed and expected frequencies. For example, consider Minimum

<sup>1</sup>/<sup>2</sup> <sup>−</sup>*m*β(*x*)

where *d*(*x*) is the observed relative frequency and *m*β(*x*) is the probability mass at *x* with the assumed model with parameters of interest β. It turns out

1/<sup>2</sup> <sup>∂</sup>*m*β(*x*)

1/2 2

∂β <sup>=</sup> <sup>0</sup>

valued models based on copulas see Nikoloulopoulos, 2013.

<sup>1</sup><sup>+</sup> (exp−γ*<sup>u</sup>* <sup>−</sup>1)(exp−γ*<sup>v</sup>* <sup>−</sup>1) exp(−γ)−1

/*y*!, *j* = 1,2 *y* = 0,1,... (3)

, γ ∈ *R*− {0}, *u*, *v* ∈ [0,1]. (4)

*F*(*y*1, *y*2;*µ*1,*µ*2, γ) ≡ *C*(*F*(*y*1;*µ*1),*F*(*y*2;*µ*2); γ), (5)

−*F*(*y*1,*y*<sup>2</sup> −1;*µ*1,*µ*2, γ) +*F*(*y*<sup>1</sup> −1,*y*<sup>2</sup> −1;*µ*1,*µ*2, γ)

Bivariate count models are widely used for modelling the outcome of a football game. The two counts refer to the number of goals scored by each team. It seems natural to assume some dependence between the goals to represent the competitive nature of soccer. Our data refer to all scores from English Premier League 2013-2014 where a series of unexpectedly large scores have occurred. We apply a robust approach to estimate the parameters of the model to reduce the effect of the large scores.

#### References


### BAYESIAN ANALYSIS OF A WATER QUALITY HIGH-FREQUENCY TIME SERIES THROUGH MARKOV SWITCHING AUTOREGRESSIVE MODELS

(2012)), that can be used to understanding of sources and processes delivering

of turbidity: stage height (in cm) and rainfall (in mm).

In order to provide simulation inputs for further investigations on diffuse pollution, a time series of turbidity data recorded in the Wemyss catchment (Scotland) from 1st January 2011 (00:00) to 5th January 2012 (15:15) is analysed here. Measurements (in NTU) were taken every 15 minutes; thus, the length of the series is 35,486 points, with 470 missing values (1.38% of the total number of observations). Two time series of explanatory variables are also available, without missing values and recorded with the same time resolution

A few complex issues need to be taken into account when modelling turbidity time series: non-Normality, non-linearity, non-stationarity, and long memory. Non-Normality is observed when the data density is multimodal or asymmetric or kurtic and the data cannot be considered as realizations from a Gaussian process. Non-linearity is assumed when the whole series does not show the same statitical peculiarities over all the observations, but they can be classified into a few homogeneous groups. Non-linearity can also be assumed when the series exhibits asymmetries. Weak non-stationarity is caused by generating processes having time-varying means and autocovariances. Finally, when the series shows high autocorrelations at the higher lags, with a slow decay, the observations are realizations from a long-memory process.

Because of these issues the turbidity time series considered here was analysed by Markov switching autoregressive models (MSARMs). This class of models is a popular tool within the econometrics community to model complex time series. Although they are extremely powerful, MSARMs have been considered quite rarely in other disciplines. Among the few applications in environmental sciences see Spezia et al. (2004) and Paroli and Spezia (2008) for air pollutant concentrations; Birkel et al. (2012) for isotope signatures; Montbet and Ailliot (2017) for air temperatures; Ailliot and Montbet (2012)

MSARMs are pairs of discrete-time stochastic processes, one observed and one latent, or hidden. The hidden process is a finite-state Markov chain, whereas the observed process, given the Markov chain, is conditionally autoregressive. The dynamics of the observed process is driven by the dynamics of the latent one, so that each observation depends on the contemporary state of the Markov chain. By this theoretical structure, MSARMs allow: *i*) mod-

them to surface waters.

for wind time series.

2 Model and Inference

Roberta Paroli 1, Luigi Spezia 2, Marc Stutter3 and Andy Vinten3

<sup>1</sup> Dipartimento di Scienze Statistiche, Universita Cattolica SC, Milano, (e-mail: ` roberta.paroli@unicatt.it)

<sup>2</sup> Biomathematics & Statistics Scotland, Aberdeen, (e-mail: luigi@bioss.ac.uk)

<sup>3</sup> The James Hutton Institute, Aberdeen, (e-mail: marc.stutter@hutton.ac.uk and andy.vinten@hutton.ac.uk)

ABSTRACT: In order to provide simulation inputs for investigations on diffuse water pollution and support rural land management policy on soil and water management, a turbidity time series recorded in a Scottish stream for more than a year, along with two covariates, is considered. Turbidity time series have complex dynamics because they are non-linear, non-Normal, non-stationary, with a long memory, and present missing values. Given these issues the turbidity process is analysed by Markov switching autoregressive models under the Bayesian paradigm using novel evolutionary Monte Carlo algorithms. Hence, it is possible to efficiently fit the actual data, reconstruct the sequence of hidden states, restore the missing values, and classify the observations into a few regimes, providing new insight on turbidity dynamics.

KEYWORDS: non-homogeneous hidden Markov chain; path sampling; population Markov chain MonteCarlo; water quality; Wemyss catchment.

#### 1 Introduction and Data

Evidence of the effectiveness of diffuse pollution control measures is needed to support rural land management policy on soil and water management. For key pollutants (e.g., suspended sediment or particulate phosphorus), such evidence is difficult to obtain, because of the cost of sampling and chemical analysis of storm event driven changes in concentrations and loads in streams and rural drainage features. Some works have investigated the use of continuous automated turbidity as a proxy to estimate particulate phosphorus, fine sediment or hydrophobic pollutant loads using site specific calibrations of turbidity versus the pollutant of interest, with some success. The turbidity trace along with discharge and other data may contain hidden temporal patterns (Birkel et al. (2012)), that can be used to understanding of sources and processes delivering them to surface waters.

In order to provide simulation inputs for further investigations on diffuse pollution, a time series of turbidity data recorded in the Wemyss catchment (Scotland) from 1st January 2011 (00:00) to 5th January 2012 (15:15) is analysed here. Measurements (in NTU) were taken every 15 minutes; thus, the length of the series is 35,486 points, with 470 missing values (1.38% of the total number of observations). Two time series of explanatory variables are also available, without missing values and recorded with the same time resolution of turbidity: stage height (in cm) and rainfall (in mm).

A few complex issues need to be taken into account when modelling turbidity time series: non-Normality, non-linearity, non-stationarity, and long memory. Non-Normality is observed when the data density is multimodal or asymmetric or kurtic and the data cannot be considered as realizations from a Gaussian process. Non-linearity is assumed when the whole series does not show the same statitical peculiarities over all the observations, but they can be classified into a few homogeneous groups. Non-linearity can also be assumed when the series exhibits asymmetries. Weak non-stationarity is caused by generating processes having time-varying means and autocovariances. Finally, when the series shows high autocorrelations at the higher lags, with a slow decay, the observations are realizations from a long-memory process.

Because of these issues the turbidity time series considered here was analysed by Markov switching autoregressive models (MSARMs). This class of models is a popular tool within the econometrics community to model complex time series. Although they are extremely powerful, MSARMs have been considered quite rarely in other disciplines. Among the few applications in environmental sciences see Spezia et al. (2004) and Paroli and Spezia (2008) for air pollutant concentrations; Birkel et al. (2012) for isotope signatures; Montbet and Ailliot (2017) for air temperatures; Ailliot and Montbet (2012) for wind time series.

#### 2 Model and Inference

BAYESIAN ANALYSIS OF A WATER QUALITY HIGH-FREQUENCY TIME SERIES THROUGH MARKOV SWITCHING AUTOREGRESSIVE MODELS Roberta Paroli 1, Luigi Spezia 2, Marc Stutter3 and Andy Vinten3

<sup>1</sup> Dipartimento di Scienze Statistiche, Universita Cattolica SC, Milano, (e-mail: `

<sup>2</sup> Biomathematics & Statistics Scotland, Aberdeen, (e-mail: luigi@bioss.ac.uk) <sup>3</sup> The James Hutton Institute, Aberdeen, (e-mail: marc.stutter@hutton.ac.uk

ABSTRACT: In order to provide simulation inputs for investigations on diffuse water pollution and support rural land management policy on soil and water management, a turbidity time series recorded in a Scottish stream for more than a year, along with two covariates, is considered. Turbidity time series have complex dynamics because they are non-linear, non-Normal, non-stationary, with a long memory, and present missing values. Given these issues the turbidity process is analysed by Markov switching autoregressive models under the Bayesian paradigm using novel evolutionary Monte Carlo algorithms. Hence, it is possible to efficiently fit the actual data, reconstruct the sequence of hidden states, restore the missing values, and classify the observations

KEYWORDS: non-homogeneous hidden Markov chain; path sampling; population

Evidence of the effectiveness of diffuse pollution control measures is needed to support rural land management policy on soil and water management. For key pollutants (e.g., suspended sediment or particulate phosphorus), such evidence is difficult to obtain, because of the cost of sampling and chemical analysis of storm event driven changes in concentrations and loads in streams and rural drainage features. Some works have investigated the use of continuous automated turbidity as a proxy to estimate particulate phosphorus, fine sediment or hydrophobic pollutant loads using site specific calibrations of turbidity versus the pollutant of interest, with some success. The turbidity trace along with discharge and other data may contain hidden temporal patterns (Birkel et al.

into a few regimes, providing new insight on turbidity dynamics.

Markov chain MonteCarlo; water quality; Wemyss catchment.

1 Introduction and Data

roberta.paroli@unicatt.it)

and andy.vinten@hutton.ac.uk)

MSARMs are pairs of discrete-time stochastic processes, one observed and one latent, or hidden. The hidden process is a finite-state Markov chain, whereas the observed process, given the Markov chain, is conditionally autoregressive. The dynamics of the observed process is driven by the dynamics of the latent one, so that each observation depends on the contemporary state of the Markov chain. By this theoretical structure, MSARMs allow: *i*) modelling non-linear and non-Normal time series by assuming that different autoregressions, each one depending on a hidden state, alternate according to the Markovian regime switching; *ii*) modelling a long-memory process; *iii*) classifying the observations into a small number of homogeneous groups, labelled as the regimes of the Markov chain.

and *p* = 0,...,6. The best model was characterized by three hidden states

Given the dimensions of the model and the identifiability constraint, the whole set of parameters was estimated. They show that covariates in the observed process have a positive coefficient, that is the level of turbidity increases when stage height and/or rainfall increase. On the other hand, the covariates in the hidden process have a negative coefficient, that is the probabilities of state transitions decrease when stage height and/or rainfall increase, while the

The model fit was very satisfactory, as shown by the comparison of actual and fitted values. The model performance was assessed by the root mean square error and the mean absolute error, which are very low. They are 0.164 (2% of the range of the data) and 0.282 (3%), respectively. All observations

This methodology will be generalized and used in a further study based on turbidity observations recorded in several catchments. In fact, a hierarchical linear regression model will be developed in a longitudinal study by taking the different sequences of hidden states and state-dependent parameters from each model associated with any of the several catchments as explanatory variables

BIRKEL C., SOULSBY C., TETZLAFF D. DUNN S., & L., SPEZIA. 2012. High-frequency storm event isotope sampling reveals time-variant transit time distributions and influence of diurnal cycles. *Hydrological Pro-*

P., AILLIOT, & V., MONTBET. 2012. Markov-switching autoregressive models for wind time series. *Environ. Modell. Softw.*, 30, 92–101. R., PAROLI, & L., SPEZIA. 2008. Bayesian inference in non-homogeneous Markov mixture of periodic autoregressions with state-dependent exoge-

SPEZIA L., PAROLI R., & DELLAPORTAS, P. 2004. Periodic Markov switching autoregressive models for Bayesian analysis and forecasting of air

V., MONTBET, & P., AILLIOT. 2017. Sparse vector Markov switching autoregressive models. Application to multivariate time series of temperature.

nous variables. *Comput. Statist. Data Anal.*, 52, 2311–2330.

pollution. *Statistical Modeling*, 4, 19–38.

*Comput. Statist. Data Anal.*, 108, 40–51.

(*m* = 3) and autoregressions of the fourth order (*p* = 4).

diagonal probabilities of the transition matrices increase.

were within the 99% credibility interval.

for the analysis of particulate phosphorus.

*cesses*, 26, 308–316.

References

Covariates, i.e. stage height and rainfall, were also incorporated into the model through both the hidden Markov chain (the transition probabilities are time-varying and dependent on the two dynamic exogenous variables) and the observed process (the two state-dependent exogenous variables are added to the past observations). Thus, we have time-varying means and autocovariances, and hence, a non-stationary model. Finally, the slow decay of the autocorrelation function is due to both the non-linearity of the series and the automatic recording of the data at a high temporal frequency. Non-linear time series with structural changes produce realizations that appear to have long memory. Given that structural changes can be efficiently described by stochastic regime switching models, we adopted MSARMs to highlight the changes in the turbidity dynamics, classify the observations into a few states, and fit the long memory process of the turbidity dynamics.

Because of the multimodal posterior density an efficient simulation-based evolutionary Monte Carlo (EMC) method is developed to better traverse the posterior surface and, so, fit the actual data and classify temporal correlated observations into a few homogeneous groups. EMC is a Markov chain Monte Carlo method which processes a population of chains in parallel, exchanging information one another. An advanced EMC algorithm is proposed here for Bayesian inference and model choice. This original EMC algorithm and its application to MSARMs represent a further methodological contribution of the paper. We introduce novel random walk crossover operators and made the EMC algorithm more efficient by flattening the likelihood only, and not the posterior, as in common practice. Thus, the same algorithm can be run for both model choice and parameter estimation (including the fitted values, the hidden states, and the missing values).

#### 3 Results

The Bayesian analysis was developed in two steps: model choice and parameter estimation. The choice of the best model among the many available which differed for the number of states of the hidden Markov chain (*m*) and the autoregressive order (*p*), was performed computing the logarithms of the marginal likelihoods by EMC via the power posteriors, for any *m* = 1,...,4 and *p* = 0,...,6. The best model was characterized by three hidden states (*m* = 3) and autoregressions of the fourth order (*p* = 4).

Given the dimensions of the model and the identifiability constraint, the whole set of parameters was estimated. They show that covariates in the observed process have a positive coefficient, that is the level of turbidity increases when stage height and/or rainfall increase. On the other hand, the covariates in the hidden process have a negative coefficient, that is the probabilities of state transitions decrease when stage height and/or rainfall increase, while the diagonal probabilities of the transition matrices increase.

The model fit was very satisfactory, as shown by the comparison of actual and fitted values. The model performance was assessed by the root mean square error and the mean absolute error, which are very low. They are 0.164 (2% of the range of the data) and 0.282 (3%), respectively. All observations were within the 99% credibility interval.

This methodology will be generalized and used in a further study based on turbidity observations recorded in several catchments. In fact, a hierarchical linear regression model will be developed in a longitudinal study by taking the different sequences of hidden states and state-dependent parameters from each model associated with any of the several catchments as explanatory variables for the analysis of particulate phosphorus.

#### References

elling non-linear and non-Normal time series by assuming that different autoregressions, each one depending on a hidden state, alternate according to the Markovian regime switching; *ii*) modelling a long-memory process; *iii*) classifying the observations into a small number of homogeneous groups, labelled

Covariates, i.e. stage height and rainfall, were also incorporated into the model through both the hidden Markov chain (the transition probabilities are time-varying and dependent on the two dynamic exogenous variables) and the observed process (the two state-dependent exogenous variables are added to the past observations). Thus, we have time-varying means and autocovariances, and hence, a non-stationary model. Finally, the slow decay of the autocorrelation function is due to both the non-linearity of the series and the automatic recording of the data at a high temporal frequency. Non-linear time series with structural changes produce realizations that appear to have long memory. Given that structural changes can be efficiently described by stochastic regime switching models, we adopted MSARMs to highlight the changes in the turbidity dynamics, classify the observations into a few states, and fit the

Because of the multimodal posterior density an efficient simulation-based evolutionary Monte Carlo (EMC) method is developed to better traverse the posterior surface and, so, fit the actual data and classify temporal correlated observations into a few homogeneous groups. EMC is a Markov chain Monte Carlo method which processes a population of chains in parallel, exchanging information one another. An advanced EMC algorithm is proposed here for Bayesian inference and model choice. This original EMC algorithm and its application to MSARMs represent a further methodological contribution of the paper. We introduce novel random walk crossover operators and made the EMC algorithm more efficient by flattening the likelihood only, and not the posterior, as in common practice. Thus, the same algorithm can be run for both model choice and parameter estimation (including the fitted values, the

The Bayesian analysis was developed in two steps: model choice and parameter estimation. The choice of the best model among the many available which differed for the number of states of the hidden Markov chain (*m*) and the autoregressive order (*p*), was performed computing the logarithms of the marginal likelihoods by EMC via the power posteriors, for any *m* = 1,...,4

as the regimes of the Markov chain.

long memory process of the turbidity dynamics.

hidden states, and the missing values).

3 Results


### DETECTING THE EFFECT OF SECONDARY SCHOOL IN HIGHER EDUCATION UNIVERSITY CHOICES \*

ies which mainly focus on the mobility between macro-geographic areas, with emphasis on South-North trajectory, and other which also investigate the mobility patterns within macro-geographical areas. Starting from this evidence, our contribution is twofold. First, we investigate how secondary schools background affects students' preferences towards local or non local universities. To our knowledge, this is the first attempt to use data on Italian students to shed light on the role that secondary school have in students' location decision process. Our second contribution is to provide a robust definition of local and non local university choices by using multiple criteria based on students' traveled distance, the supply of education services in their local area and the uncertainty

Our analysis relies on the administrative data collected from the Italian National Student Archive (NSA)\* and the open database of the Italian Ministry of University and Research (MUR). We consider all Italian high-school leavers enrolled in an Italian university between a.y. 2016/2017 and a.y. 2018/2019 in a bachelor's programme. We define our dataset according to two rules. First, we do not consider the students enrolled in programs accessible with a national entry test since their choices are likely to depend on their ranking position rather than their preferences. Secondly, since we have information on high schools only from the a.y. 2016/2017, we retain in our sample only the students that left their high-school after 2015. Thus, starting from a population of 815,614 pupils, our data consists of 700,024 students, cross classified

Students' university choices are classified depending on: (i) the tertiary education supply in their local area, (ii) the chosen subject of study and (iii) the minimum travel time needed to reach the nearest university. At this aim, we define travel time as the minimum distance by car between two cities obtained by combining the ISTAT matrices on Italian cities with the data available on Google Maps. Then, we define two thresholds: *duniv*, given by the distance between students' city and the nearest university, and *dfield*, defined considering only universities providing programmes in students' field of study. To avoid

\*Data drawn from the Italian "Anagrafe Nazionale della Formazione Superiore" has been processed according to the research project "From high school to the job market: analysis of the university careers and the university North-South mobility" carried out by the University of Palermo (head of the research program), the Italian "Ministero Universita e Ricerca", and `

in the assignment of the local catchment area to each university.

in 5,887 secondary schools and 297 university-city pairs.

2 Data and Methods

INVALSI.

Mariano Porcu 1, Isabella Sulis1 and Cristian Usala1

<sup>1</sup> Department of Political and Social Sciences, University of Cagliari, (e-mail: mrporcu@unica.it, isulis@unica.it, cristian.usala@unica.it)

#### ABSTRACT:

The paper investigates the relationship between students' university choices and their secondary school background. The main aim is to assess the effect that secondary schools have in advising university applications toward local or non local institutions, also on the light of the tertiary education supply in students' area of residence. For this sake, four typologies of students have been identified and a multilevel model has been adopted to jointly consider the secondary schools effect on the probability to belong to one specific category conditional upon students' subject of study, and the characteristics of their local areas. Moreover, we provide a robust definition of local and non local universities by defining multiple criteria for the definition of non local universities and taking into account the uncertainty in the definition of the catching areas.

KEYWORDS: higher education, mobility choices, multilevel model, secondary school, uncertainty

### 1 Introduction

In the last decade there has been an increasing interest in Italian students' mobility choices for university studies as phenomenon which mirrors the inequalities in socioeconomic conditions between origin and destination areas and contributes in widening the already sharp disparities existing in the country (see Ciriaci, 2014; D'Agostino *et al.*, 2019; Attanasio & Enea, 2019). Despite the similar contexts, the literature is characterized by a high level of heterogeneity in the definition of local or non local universities for students, and consequently on the classification of students as *mover* or *stayer*, and on methods to account for distances between origin and destination places, with stud-

\*This paper has been supported from Italian Ministerial grant PRIN 2017 "From high school to job placement: micro-data life course analysis of university student mobility and its impact on the Italian North-South divide.", n. 2017HBTK5P - CUP B78D19000180001.

ies which mainly focus on the mobility between macro-geographic areas, with emphasis on South-North trajectory, and other which also investigate the mobility patterns within macro-geographical areas. Starting from this evidence, our contribution is twofold. First, we investigate how secondary schools background affects students' preferences towards local or non local universities. To our knowledge, this is the first attempt to use data on Italian students to shed light on the role that secondary school have in students' location decision process. Our second contribution is to provide a robust definition of local and non local university choices by using multiple criteria based on students' traveled distance, the supply of education services in their local area and the uncertainty in the assignment of the local catchment area to each university.

#### 2 Data and Methods

DETECTING THE EFFECT OF SECONDARY SCHOOL IN HIGHER EDUCATION UNIVERSITY CHOICES \* Mariano Porcu 1, Isabella Sulis1 and Cristian Usala1

<sup>1</sup> Department of Political and Social Sciences, University of Cagliari, (e-mail: mrporcu@unica.it, isulis@unica.it, cristian.usala@unica.it)

The paper investigates the relationship between students' university choices and their secondary school background. The main aim is to assess the effect that secondary schools have in advising university applications toward local or non local institutions, also on the light of the tertiary education supply in students' area of residence. For this sake, four typologies of students have been identified and a multilevel model has been adopted to jointly consider the secondary schools effect on the probability to belong to one specific category conditional upon students' subject of study, and the characteristics of their local areas. Moreover, we provide a robust definition of local and non local universities by defining multiple criteria for the definition of non local universities and taking into account the uncertainty in the definition of the catching

KEYWORDS: higher education, mobility choices, multilevel model, secondary school,

In the last decade there has been an increasing interest in Italian students' mobility choices for university studies as phenomenon which mirrors the inequalities in socioeconomic conditions between origin and destination areas and contributes in widening the already sharp disparities existing in the country (see Ciriaci, 2014; D'Agostino *et al.*, 2019; Attanasio & Enea, 2019). Despite the similar contexts, the literature is characterized by a high level of heterogeneity in the definition of local or non local universities for students, and consequently on the classification of students as *mover* or *stayer*, and on methods to account for distances between origin and destination places, with stud-

\*This paper has been supported from Italian Ministerial grant PRIN 2017 "From high school to job placement: micro-data life course analysis of university student mobility and its impact

on the Italian North-South divide.", n. 2017HBTK5P - CUP B78D19000180001.

ABSTRACT:

areas.

uncertainty

1 Introduction

Our analysis relies on the administrative data collected from the Italian National Student Archive (NSA)\* and the open database of the Italian Ministry of University and Research (MUR). We consider all Italian high-school leavers enrolled in an Italian university between a.y. 2016/2017 and a.y. 2018/2019 in a bachelor's programme. We define our dataset according to two rules. First, we do not consider the students enrolled in programs accessible with a national entry test since their choices are likely to depend on their ranking position rather than their preferences. Secondly, since we have information on high schools only from the a.y. 2016/2017, we retain in our sample only the students that left their high-school after 2015. Thus, starting from a population of 815,614 pupils, our data consists of 700,024 students, cross classified in 5,887 secondary schools and 297 university-city pairs.

Students' university choices are classified depending on: (i) the tertiary education supply in their local area, (ii) the chosen subject of study and (iii) the minimum travel time needed to reach the nearest university. At this aim, we define travel time as the minimum distance by car between two cities obtained by combining the ISTAT matrices on Italian cities with the data available on Google Maps. Then, we define two thresholds: *duniv*, given by the distance between students' city and the nearest university, and *dfield*, defined considering only universities providing programmes in students' field of study. To avoid

\*Data drawn from the Italian "Anagrafe Nazionale della Formazione Superiore" has been processed according to the research project "From high school to the job market: analysis of the university careers and the university North-South mobility" carried out by the University of Palermo (head of the research program), the Italian "Ministero Universita e Ricerca", and ` INVALSI.

arbitrary assumptions on these thresholds and to assess results' sensitivity to the deterministic choice of the cut points, we apply Rubin's rule (Rubin, 1987) to combine the results obtained by using different thresholds. In particular, we generate multiple cut points values by increasing *duniv* and *dfield* by a random amount of time δ ∈ [30; 90]. Thus, from students perspective, we have four categories of university choices: local, forced non local, free non local, and telematic. Universities are classified as 'local' when hosted in city closer than *duniv* minutes of travel from students' residence. Non local universities are considered as 'forced' if the chosen university is the nearest one providing a programme in students' field of study (i.e. located closer than *dfield*), and as 'free' when students exceed both thresholds. The last category refers to students enrolled in distance-learning telematic universities.

Table 1. *Cross-Classified Multinomial Logit*

Observations 700068

by using Rubin's rules to combine the results.

*in Italia*. Bologna: Il Mulino.

*national Journal of Manpower*, 40, 56–72.

48(10), 1592–1608.

MUR/CINECA.

*New York*.

Random effect parameters:

References

Forced Non Local Free Non Local Telematic

[-4.198;-3.935] [-4.359;-4.005] [-2.199;-1.185]

[5.157;5.611] [1.928;2.096] [1.522;1.758]

[1.320;5.263] [0.319;1.337] [4.070;23.962]

In conclusion, the approach proposed in this work allowed us to assess the effect that secondary schools have in advising university applications toward local or non local institutions by accounting for the choice of the disciplinary field. Further analyses are still in progress to take into account of the uncertainty related to a deterministic definition of δ parameter. At this aim, multiple values of δ have been generated to assess results' sensitivity to the choice of the cut points. This uncertainty in thresholds definitions is taken into account

ATTANASIO, M., & ENEA, M. 2019. La mobilita degli studenti universitari ´ nell'ultimo decennio in Italia. *Pages 43–58 of:* DE SANTIS, G., PI-RANI, E., & PORCU, M. (eds), *Rapporto sulla popolazione. L'istruzione*

CIRIACI, D. 2014. Does University Quality Influence the Interregional Mobility of Students and Graduates? The Case of Italy. *Regional Studies*,

D'AGOSTINO, A., GHELLINI, G., & LONGOBARDI, S. 2019. Out-migration of university enrolment: the mobility behaviour of Italian students. *Inter-*

DATABASE MOBYSU.IT [MOBILITA DEGLI ` STUDI UNIVERSITARI IN ITALIA]. Research protocol MUR - Universities of Cagliari, Palermo, Siena, Torino, Sassari, Firenze, Cattolica and Napoli Federico II. Scientific Coordinator Massimo Attanasio (UNIPA). Data Source ANS-

RUBIN, D. B. 1987. Multiple Imputation for Nonresponse in Surveys. *Wiley,*

Constant -4.079 -4.214 -1.678

Controls Yes Yes Yes

High School × Curriculum 5.379 2.008 1.637

Field of study 2.694 0.659 10.31

The effect of secondary school background on students' university choices is estimated by specifying two cross-classified Multinomial Logit models which consider (a) the cross-classification of students in secondary schools and university city pairs and (b) the cross-classification of students in secondary schools and disciplinary fields. To take into account of the several curricula offered by secondary schools, we define the first level of clustering as the interaction between the high schools and the type of curricula offered. Moreover, we account for students' choice determinants by controlling for students' gender, macro area of residence, diploma grade, year of enrollment, years of delay in finishing the high school and an indicator that takes value 1 if the student has attended a lyceum.

#### 3 Results and Discussion

Table 1 reports the results related to the model (b) which considers the clustering of students according to secondary school-curricula combinations and students' field of study, with the δ parameter set equal to 60. The results concerning the variance of the random terms suggest a clear effect of schools in students' choices to attend local or non local universities when accounting for differences in their field of study. Indeed, the variability of the high school-curricula effect is relevant in all the categories. The posterior predictions regarding the school random terms provide evidences on the role that schools have in orienting students' choices towards local, non local and telematic universities. Moreover, the results concerning the control variables in the fix effect component show that students' educational background and socioeconomic characteristics affect the probability to make different choices in terms of selection of local and non local universities.


Table 1. *Cross-Classified Multinomial Logit*

In conclusion, the approach proposed in this work allowed us to assess the effect that secondary schools have in advising university applications toward local or non local institutions by accounting for the choice of the disciplinary field. Further analyses are still in progress to take into account of the uncertainty related to a deterministic definition of δ parameter. At this aim, multiple values of δ have been generated to assess results' sensitivity to the choice of the cut points. This uncertainty in thresholds definitions is taken into account by using Rubin's rules to combine the results.

#### References

arbitrary assumptions on these thresholds and to assess results' sensitivity to the deterministic choice of the cut points, we apply Rubin's rule (Rubin, 1987) to combine the results obtained by using different thresholds. In particular, we generate multiple cut points values by increasing *duniv* and *dfield* by a random amount of time δ ∈ [30; 90]. Thus, from students perspective, we have four categories of university choices: local, forced non local, free non local, and telematic. Universities are classified as 'local' when hosted in city closer than *duniv* minutes of travel from students' residence. Non local universities are considered as 'forced' if the chosen university is the nearest one providing a programme in students' field of study (i.e. located closer than *dfield*), and as 'free' when students exceed both thresholds. The last category refers to

The effect of secondary school background on students' university choices is estimated by specifying two cross-classified Multinomial Logit models which consider (a) the cross-classification of students in secondary schools and university city pairs and (b) the cross-classification of students in secondary schools and disciplinary fields. To take into account of the several curricula offered by secondary schools, we define the first level of clustering as the interaction between the high schools and the type of curricula offered. Moreover, we account for students' choice determinants by controlling for students' gender, macro area of residence, diploma grade, year of enrollment, years of delay in finishing the high school and an indicator that takes value 1 if the student has

Table 1 reports the results related to the model (b) which considers the clustering of students according to secondary school-curricula combinations and students' field of study, with the δ parameter set equal to 60. The results concerning the variance of the random terms suggest a clear effect of schools in students' choices to attend local or non local universities when accounting for differences in their field of study. Indeed, the variability of the high school-curricula effect is relevant in all the categories. The posterior predictions regarding the school random terms provide evidences on the role that schools have in orienting students' choices towards local, non local and telematic universities. Moreover, the results concerning the control variables in the fix effect component show that students' educational background and socioeconomic characteristics affect the probability to make different choices in

students enrolled in distance-learning telematic universities.

attended a lyceum.

3 Results and Discussion

terms of selection of local and non local universities.


### SEMI-CONSTRAINED MODEL-BASED CLUSTERING OF MIXED-TYPE DATA USING A COMPOSITE LIKELIHOOD APPROACH

continouous variables by modeling the dependence between the latent and the

Taking these aspects into account, i.e. heterogeneity, high dimensional and mixed type data, we propose a Gaussian mixture model with a factor decomposition on component-specific covariance matrices. Parameters may be constrained to be equal or unequal across mixture components (McNicholas & Murphy, 2010) obtaining different degrees of parsimony. The ordinal variables corresponds to some variates of the mixture that are partially observed through

Inference could be carried out through the likelihood function. However, the presence of ordinal variables requires the computation of many high dimensional integrals. This makes the evaluation of the likelihood computationally demanding, or prohibitive, as the number of ordinal variables increases. To solve the problem, the likelihood is replaced with a surrogate function, that is the composite likelihood, defined as the product of *m*-dimensional marginals or conditional events (Lindsay, 1988). Under some regularity conditions the corresponding estimators are consistent, asymptotically unbiased and normally distributed (see Ranalli & Rocci, 2017, and references therein). In general they are less efficient than the maximum likelihood estimators, but much more efficient in terms of computational complexity. In the current work, the composite likelihood is based on the product of all possible sub-likelihoods composed of two ordinal and all continuous variables. The computation of parameter esti-

A simulation study as well as a real data analysis is presented in the ex-

Let <sup>y</sup>*O*¯ = [*y*1,..., *yO*¯] and <sup>x</sup> = [*xO*¯+1,...,*xP*] be *<sup>O</sup>*¯ continuous variables and *<sup>O</sup>* <sup>=</sup> *<sup>P</sup>*−*O*¯ ordinal variables, respectively. Each ordinal variable has the associated categories *ci* = 1,...,*Ci* with *i* = *O*¯ +1,...,*P*. Following the underlying response variable approach, the observed ordinal variables x are considered as a discretization of some continuous latent variables y*<sup>O</sup>* = [*yO*¯+1,...,*yP*]. The

(*i*)

*ci* ⇔ *xi* = *ci*,

*Ci* = +∞ are the non observable

,y*O*] follows a

(*i*)

observed continuous variables.

a discretization (see e.g. Ranalli & Rocci, 2017).

mates is carried out through an EM-type algorithm.

γ (*i*)

<sup>1</sup> < ... < γ

(*i*)

*ci*−<sup>1</sup> <sup>≤</sup> *yi* <sup>&</sup>lt; <sup>γ</sup>

thresholds defining the *Ci* categories. In our proposal y = [y*O*¯

(*i*) *Ci*−<sup>1</sup> <sup>&</sup>lt; <sup>γ</sup>

tended version of the paper.

relationship between x and y*<sup>O</sup>* is

(*i*) <sup>0</sup> < γ

2 Model

where −∞ = γ

Roberto Rocci1 and Monia Ranalli2

<sup>1</sup> Department of Statistical Sciences, Sapienza University of Rome (roberto.rocci@uniroma1.it)

<sup>2</sup> Department of Statistical Sciences, Sapienza University of Rome (monia.ranalli@uniroma1.it)

ABSTRACT: We propose a class of semi-constrained models for clustering ordinal and continuous data. Ordinal variables are assumed to be a discretization of some latent continuous variables jointly distribuited with the observed continuous variables as a finite mixture of Gaussians. Parsimonious modeling is obtained by reparameterizing the covariance matrices in terms of factor analysis models semi-constrained across the components. Parameter estimation is carried out using a EM-type algorithm to maximize a composite log-likelihood. The proposal is evaluated through a simulation study and an application to real data.

KEYWORDS: mixture models, factor analyzers, composite likelihood, EM algorithm, mixed-type data

#### 1 Introduction

Complex data structures are characterized by the presence of heterogeneity and a large number of features of mixed type, i.e. ordinal and continuous. To capture heterogeneity, clustering methods are used to find subgroups in the population. The literature has been mainly developed for continuous variables with methods distance-based (e.g. *k*-means, Ward) or model-based. In the latter, finite Gaussian mixture models are the most commonly used (Hennig *et al.*, 2015). In order to reduce the large number of parameters caused by the high dimensionality of the data, parsimonious modelling is needed like in factor analysis modelling. The challenge to model ordinal data is mainly due to the lack of metric properties. Ordinal variables can be modeled properly adopting the underlying variable approach (Joreskog, 1990) where the ordinal ¨ variables are assumed to be generated by thresholding some latent continuous variables. This allow us to model the dependence between ordinal and continouous variables by modeling the dependence between the latent and the observed continuous variables.

Taking these aspects into account, i.e. heterogeneity, high dimensional and mixed type data, we propose a Gaussian mixture model with a factor decomposition on component-specific covariance matrices. Parameters may be constrained to be equal or unequal across mixture components (McNicholas & Murphy, 2010) obtaining different degrees of parsimony. The ordinal variables corresponds to some variates of the mixture that are partially observed through a discretization (see e.g. Ranalli & Rocci, 2017).

Inference could be carried out through the likelihood function. However, the presence of ordinal variables requires the computation of many high dimensional integrals. This makes the evaluation of the likelihood computationally demanding, or prohibitive, as the number of ordinal variables increases. To solve the problem, the likelihood is replaced with a surrogate function, that is the composite likelihood, defined as the product of *m*-dimensional marginals or conditional events (Lindsay, 1988). Under some regularity conditions the corresponding estimators are consistent, asymptotically unbiased and normally distributed (see Ranalli & Rocci, 2017, and references therein). In general they are less efficient than the maximum likelihood estimators, but much more efficient in terms of computational complexity. In the current work, the composite likelihood is based on the product of all possible sub-likelihoods composed of two ordinal and all continuous variables. The computation of parameter estimates is carried out through an EM-type algorithm.

A simulation study as well as a real data analysis is presented in the extended version of the paper.

#### 2 Model

SEMI-CONSTRAINED MODEL-BASED CLUSTERING OF MIXED-TYPE DATA USING A COMPOSITE LIKELIHOOD APPROACH Roberto Rocci1 and Monia Ranalli2

ABSTRACT: We propose a class of semi-constrained models for clustering ordinal and continuous data. Ordinal variables are assumed to be a discretization of some latent continuous variables jointly distribuited with the observed continuous variables as a finite mixture of Gaussians. Parsimonious modeling is obtained by reparameterizing the covariance matrices in terms of factor analysis models semi-constrained across the components. Parameter estimation is carried out using a EM-type algorithm to maximize a composite log-likelihood. The proposal is evaluated through a simulation

KEYWORDS: mixture models, factor analyzers, composite likelihood, EM algorithm,

Complex data structures are characterized by the presence of heterogeneity and a large number of features of mixed type, i.e. ordinal and continuous. To capture heterogeneity, clustering methods are used to find subgroups in the population. The literature has been mainly developed for continuous variables with methods distance-based (e.g. *k*-means, Ward) or model-based. In the latter, finite Gaussian mixture models are the most commonly used (Hennig *et al.*, 2015). In order to reduce the large number of parameters caused by the high dimensionality of the data, parsimonious modelling is needed like in factor analysis modelling. The challenge to model ordinal data is mainly due to the lack of metric properties. Ordinal variables can be modeled properly adopting the underlying variable approach (Joreskog, 1990) where the ordinal ¨ variables are assumed to be generated by thresholding some latent continuous variables. This allow us to model the dependence between ordinal and

<sup>1</sup> Department of Statistical Sciences, Sapienza University of Rome

<sup>2</sup> Department of Statistical Sciences, Sapienza University of Rome

(roberto.rocci@uniroma1.it)

(monia.ranalli@uniroma1.it)

study and an application to real data.

mixed-type data

1 Introduction

Let <sup>y</sup>*O*¯ = [*y*1,...,*yO*¯] and <sup>x</sup> = [*xO*¯+1,..., *xP*] be *<sup>O</sup>*¯ continuous variables and *<sup>O</sup>* <sup>=</sup> *<sup>P</sup>*−*O*¯ ordinal variables, respectively. Each ordinal variable has the associated categories *ci* = 1,...,*Ci* with *i* = *O*¯ +1,...,*P*. Following the underlying response variable approach, the observed ordinal variables x are considered as a discretization of some continuous latent variables y*<sup>O</sup>* = [*yO*¯+1,..., *yP*]. The relationship between x and y*<sup>O</sup>* is

$$
\mathfrak{Y}\_{c\_l - 1}^{(i)} \le \mathfrak{y}\_i < \mathfrak{Y}\_{c\_l}^{(i)} \Leftrightarrow \mathfrak{x}\_i = c\_i,
$$

where −∞ = γ (*i*) <sup>0</sup> < γ (*i*) <sup>1</sup> < ... < γ (*i*) *Ci*−<sup>1</sup> <sup>&</sup>lt; <sup>γ</sup> (*i*) *Ci* = +∞ are the non observable thresholds defining the *Ci* categories. In our proposal y = [y*O*¯ ,y*O*] follows a finite mixture of factor analyzers (McNicholas & Murphy, 2010)

$$f(\mathbf{y}) = \sum\_{\mathbf{g}=1}^{G} p\_{\mathbf{g}} \phi(\boldsymbol{\mu}\_{\mathbf{g}}, \mathbf{A}\_{\mathbf{g}} \mathbf{A}\_{\mathbf{g}}^{\prime} + \mathbf{y}\_{\mathbf{g}}),$$

where L*<sup>g</sup>* is a positive definite diagonal matrix of factor saliences. They can be considered as constrained cases between the first and the last four models of Table 1. The latent factors in each cluster are the same but with different

invariance firstly introduced by Cattell (1944) and then developed by several authors, e.g. in the context of three-way analysis (see Giordani *et al.*, 2020, and references therein). A nice feature of the semi-constrained models is that, under mild conditions, the factors are unique. In other terms, it is not possible

Our proposal has been tested by a simulation study and an application on real data not shown here for sake of space. In the first, the effectiveness of the composite likelihood approach has been investigated under various settings, such as different numbers of observations, groups and latent factors, in terms of estimates precision and ability of recovering the true partition. In the second, the model has been used to find latent groups in a dataset taken from the survey on academic graduates' vocational integration carried out by ISTAT in 2015.

CATTELL, RAYMOND B. 1944. "Parallel proportional profiles" and other principles for determining the choice of factors by rotation. *Psychometrika*,

GIORDANI, PAOLO, ROCCI, ROBERTO,&BOVE, GIUSEPPE. 2020. Factor Uniqueness of the Structural Parafac Model. *Psychometrika*, 85(3), 555–

HENNIG, CHRISTIAN, MEILA, MARINA, MURTAGH, FIONN,&ROCCI,

JORESKOG ¨ , KARL G. 1990. New developments in LISREL: analysis of ordinal variables using polychoric correlations and weighted least squares.

LINDSAY, BRUCE. 1988. Composite likelihood methods. *Contemporary*

MCNICHOLAS, PAUL D., & MURPHY, THOMAS BRENDAN. 2010. Modelbased clustering of microarray expression data via latent Gaussian mix-

RANALLI, MONIA,&ROCCI, ROBERTO. 2017. Mixture models for mixedtype data through a composite likelihood approach. *Computational Statis-*

ROBERTO. 2015. *Handbook of cluster analysis*. CRC Press.

*Quality and Quantity*, 24(4), 387–404.

*tics & Data Analysis*, 110(C), 87–102.

ture models. *Bioinformatics*, 26(21), 2705–2712.

*Mathematics*, 80, 221–239.

to rotate the factors as in the classical factor analysis model.

*<sup>g</sup>*. This is a particular form of factorial

variances recorded by the matrices L<sup>2</sup>

References

574.

9(4), 267–283.

where φ is the multivariate normal density, Λ*<sup>g</sup>* is the *P* × *K* matrix of factor loadings, and Ψ*<sup>g</sup>* is the diagonal matrix of uniqueness that could be assumed of the isotropic form ψ*g*I. Each term may be constrained to be equal or un-

Table 1: The covariance structure of parsimonious Gaussian mixture models with a constrained (C), semiconstrained (S) or unconstrained (U) factor loadings matrix.


equal across mixture components. The result of imposing, or not, such constraints generates the family of eight parsimonious Gaussian mixture models, described in Table 1, Λ*<sup>g</sup>* type C and U, and introduced by McNicholas & Murphy (2010) in the context of continuous data. Each member of this family has a number of covariance parameters that grows linearly with the data dimensionality. By assuming a common covariance structure, even more parsimonious models are obtained. Some identifiability constraints are imposed on thresholds and factor loadings. They are not discussed here for sake of space.

With respect to the proposal of McNicholas & Murphy (2010), we introduce four semi-constrained models to add some extra flexibility, with a certain degree of parsimony (see Table 1, Λ*<sup>g</sup>* type S). The flexibility is achieved by assuming that the matrix of factor loadings can be written in the form Λ*<sup>g</sup>* = ΛL*g*, where L*<sup>g</sup>* is a positive definite diagonal matrix of factor saliences. They can be considered as constrained cases between the first and the last four models of Table 1. The latent factors in each cluster are the same but with different variances recorded by the matrices L<sup>2</sup> *<sup>g</sup>*. This is a particular form of factorial invariance firstly introduced by Cattell (1944) and then developed by several authors, e.g. in the context of three-way analysis (see Giordani *et al.*, 2020, and references therein). A nice feature of the semi-constrained models is that, under mild conditions, the factors are unique. In other terms, it is not possible to rotate the factors as in the classical factor analysis model.

Our proposal has been tested by a simulation study and an application on real data not shown here for sake of space. In the first, the effectiveness of the composite likelihood approach has been investigated under various settings, such as different numbers of observations, groups and latent factors, in terms of estimates precision and ability of recovering the true partition. In the second, the model has been used to find latent groups in a dataset taken from the survey on academic graduates' vocational integration carried out by ISTAT in 2015.

#### References

finite mixture of factor analyzers (McNicholas & Murphy, 2010)

*G* ∑ *g*=1

*pg*φ(µ*g*,Λ*g*Λ

where φ is the multivariate normal density, Λ*<sup>g</sup>* is the *P* × *K* matrix of factor loadings, and Ψ*<sup>g</sup>* is the diagonal matrix of uniqueness that could be assumed of the isotropic form ψ*g*I. Each term may be constrained to be equal or un-

Table 1: The covariance structure of parsimonious Gaussian mixture models with a constrained (C), semiconstrained (S) or unconstrained (U) factor load-

> Model ID Λ*<sup>g</sup>* Ψ*<sup>g</sup>* Isotropic Σ*<sup>g</sup>* CCC C C C ΛΛ

CCU C C U ΛΛ

CUC C U C ΛΛ

CUU C U U ΛΛ

SCC S C C ΛL<sup>2</sup>

SCU S C U ΛL<sup>2</sup>

SUC S U C ΛL<sup>2</sup>

SUU S U U ΛL<sup>2</sup>

UCC U C C Λ*g*Λ

UCU U C U Λ*g*Λ

UUC U U C Λ*g*Λ

UUU U U U Λ*g*Λ

equal across mixture components. The result of imposing, or not, such constraints generates the family of eight parsimonious Gaussian mixture models, described in Table 1, Λ*<sup>g</sup>* type C and U, and introduced by McNicholas & Murphy (2010) in the context of continuous data. Each member of this family has a number of covariance parameters that grows linearly with the data dimensionality. By assuming a common covariance structure, even more parsimonious models are obtained. Some identifiability constraints are imposed on thresholds and factor loadings. They are not discussed here for sake of space.

With respect to the proposal of McNicholas & Murphy (2010), we introduce four semi-constrained models to add some extra flexibility, with a certain degree of parsimony (see Table 1, Λ*<sup>g</sup>* type S). The flexibility is achieved by assuming that the matrix of factor loadings can be written in the form Λ*<sup>g</sup>* = ΛL*g*,

*<sup>g</sup>* +Ψ*g*)

+ψI*<sup>P</sup>*

+Ψ

+ψ*g*I*<sup>P</sup>*

+Ψ*<sup>g</sup>*

+ψI*<sup>P</sup>*

+ψ*g*I*<sup>P</sup>*

+Ψ*<sup>g</sup>*

*<sup>g</sup>* +ψI*<sup>P</sup>*

*<sup>g</sup>* +Ψ

*<sup>g</sup>* +ψ*g*I*<sup>P</sup>*

*<sup>g</sup>* +Ψ*<sup>g</sup>*

*g*Λ

*g*Λ

*g*Λ

*g*Λ +Ψ

*f*(y) =

ings matrix.


## ANTIBODIES TO SARS-COV-2: AN EXPLORATORY ANALYSIS CARRIED OUT THROUGH THE BAYESIAN PROFILE REGRESSION

89 individuals: 39 people have received one dose of the Pfizer vaccine and 50 one dose of AstraZeneca. In order to identify the main covariates associated with immunoglobulin antibodies, we followed an analytic strategy based on Bayesian Profile Regression (Molitor *et al.*, 2010) conceived as a non parametric dimension reduction technique, set in a Bayesian framework, for clustering responses and covariates simultaneously. The remainder of this paper proceeds as follows. In section 2, we provide details of the theoretical background of the Bayesian Profile Regression technique. Section 3 considers the available data whereas the main results of the statistical analysis are presented in Section 4.

The Bayesian Profile Regression (BPR), is a Bayesian dimension reduction and clustering technique to jointly modeling an outcome variable and a number of potentially correlated predictors (Molitor *et al.*, 2010). This technique, links non parametrically a response variable to covariate data through cluster membership, so that the outcome and the clusters mutually inform each other (Hastie *et al.*, 2013). To deal with these joint effects, the BPR approach adopts as unit of inference a profile, formed from a sequence of covariates values. In what follows, for each unit *i*, *yi* denotes the outcome of interest while Xi=(*xi*<sup>1</sup> ,,..., *xiP* ) represents the covariate profile that consists of p covariates that we are interested in studying. Additionally, wi are the fixed effects which are constrained to only have a global (i.e. non-cluster specific) effect on the

ter *c* is equal to *xip*. The model of interest here can be described by two key components: a covariate model which assigns individual profile to clusters and a response model which links cluster of profiles to an outcome of interest via a regression model. The full data are then jointly modelled leading to the

where *zi* =c is the allocation variable that indicates the cluster to which a unit *i* belongs, Λ is a vector of global (i.e., non-cluster specific) parameters, finally, ψ*<sup>c</sup>* are the mixture weights. The mixture weights corresponding to a maximum of C clusters, denoted as ψ*c*, *c* = 1,,...,*C*, will be modeled according

*<sup>c</sup>* (*xip*) indicate the probability that the *p*-th variable in clus-

ψ*<sup>c</sup> f*(xi|*zi* = *c*,φ*c*)*f*(*yi*|*zi* = *c*,θ*c*,Λ,wi) (1)

2 Bayesian Profile Regression

response *yi* and φ*<sup>p</sup>*

following likelihood

*<sup>f</sup>*(xi, *yi*|θ*zi*,*wi*,ψ) = ∑*<sup>c</sup>*

Annalina Sarra 1, Adelia Evangelista1, Tonio Di Battista1 and Damiana Pieragostino2

<sup>1</sup> Department of Philosophical, Pedagogical and Economic-Quantitative Sciences, University "G. d'Annunzio" of Chieti-Pescara, (e-mail: annalina.sarra@unich.it, adelia.evangelista@unich.it, tonio.dibattista@unich.it)

<sup>2</sup> Center for Advanced Studies and Technology (CAST), University "G. d'Annunzio" of Chieti-Pescara, (e-mail: damiana.pieragostino@unich.it)

ABSTRACT: In this paper we aim at characterizing the immune response to SARS-CoV-2 in a cohort study of individuals who received the first dose of mRna vaccines Pzifer or AstraZeneca. To examine some covariate-related effects on anti-S1 spike IgG levels we adopted a statistical technique known as Bayesian Profile Regression (BPR). The BPR explores the link between a response variable and a set of associated covariate data through cluster membership and supervises the clustering assignment in a unified fashion. In our study, this methodology allowed us to identify three clusters, differentiated according to the antibody titer of respondents, and draw the profile of subjects whose amount of antibodies produced is significantly higher.

KEYWORDS: Bayesian Profile Regression, SARS-CoV-2, anti-S1 spike IgG levels, cluster profile

#### 1 Introduction

To impede the progress of the COVID-19 pandemic, the scientific world has raced to identify and understand the immune response to SARS-CoV-2 infection. Many efforts have been directed towards the development of the vaccines to curtail the novel coronavirus. Currently, among the EU authorized COVID-19 vaccines, with greater than 90% efficacy to reduce the symptomatic infection risk, there are the Pfizer/BioNTech and AstraZeneca. So far, post infection immunity to SARS-CoV-2 is still unclear and much work needs to be carried out to characterize the immune response to the virus. This knowledge is crucial to give insights into the disease pathogenics and into the usefulness of bridge therapies. In this study, we analysed anti-S1 spike IgG levels in a cohort of 89 individuals: 39 people have received one dose of the Pfizer vaccine and 50 one dose of AstraZeneca. In order to identify the main covariates associated with immunoglobulin antibodies, we followed an analytic strategy based on Bayesian Profile Regression (Molitor *et al.*, 2010) conceived as a non parametric dimension reduction technique, set in a Bayesian framework, for clustering responses and covariates simultaneously. The remainder of this paper proceeds as follows. In section 2, we provide details of the theoretical background of the Bayesian Profile Regression technique. Section 3 considers the available data whereas the main results of the statistical analysis are presented in Section 4.

#### 2 Bayesian Profile Regression

ANTIBODIES TO SARS-COV-2: AN EXPLORATORY ANALYSIS CARRIED OUT THROUGH THE BAYESIAN PROFILE REGRESSION Annalina Sarra 1, Adelia Evangelista1, Tonio Di Battista1 and Damiana Pieragostino2

<sup>1</sup> Department of Philosophical, Pedagogical and Economic-Quantitative Sciences, University "G. d'Annunzio" of Chieti-Pescara, (e-mail: annalina.sarra@unich.it, adelia.evangelista@unich.it, tonio.dibattista@unich.it) <sup>2</sup> Center for Advanced Studies and Technology (CAST), University "G. d'Annunzio"

ABSTRACT: In this paper we aim at characterizing the immune response to SARS-CoV-2 in a cohort study of individuals who received the first dose of mRna vaccines Pzifer or AstraZeneca. To examine some covariate-related effects on anti-S1 spike IgG levels we adopted a statistical technique known as Bayesian Profile Regression (BPR). The BPR explores the link between a response variable and a set of associated covariate data through cluster membership and supervises the clustering assignment in a unified fashion. In our study, this methodology allowed us to identify three clusters, differentiated according to the antibody titer of respondents, and draw the profile of

KEYWORDS: Bayesian Profile Regression, SARS-CoV-2, anti-S1 spike IgG levels,

To impede the progress of the COVID-19 pandemic, the scientific world has raced to identify and understand the immune response to SARS-CoV-2 infection. Many efforts have been directed towards the development of the vaccines to curtail the novel coronavirus. Currently, among the EU authorized COVID-19 vaccines, with greater than 90% efficacy to reduce the symptomatic infection risk, there are the Pfizer/BioNTech and AstraZeneca. So far, post infection immunity to SARS-CoV-2 is still unclear and much work needs to be carried out to characterize the immune response to the virus. This knowledge is crucial to give insights into the disease pathogenics and into the usefulness of bridge therapies. In this study, we analysed anti-S1 spike IgG levels in a cohort of

of Chieti-Pescara, (e-mail: damiana.pieragostino@unich.it)

subjects whose amount of antibodies produced is significantly higher.

cluster profile

1 Introduction

The Bayesian Profile Regression (BPR), is a Bayesian dimension reduction and clustering technique to jointly modeling an outcome variable and a number of potentially correlated predictors (Molitor *et al.*, 2010). This technique, links non parametrically a response variable to covariate data through cluster membership, so that the outcome and the clusters mutually inform each other (Hastie *et al.*, 2013). To deal with these joint effects, the BPR approach adopts as unit of inference a profile, formed from a sequence of covariates values. In what follows, for each unit *i*, *yi* denotes the outcome of interest while Xi=(*xi*<sup>1</sup> ,,..., *xiP* ) represents the covariate profile that consists of p covariates that we are interested in studying. Additionally, wi are the fixed effects which are constrained to only have a global (i.e. non-cluster specific) effect on the response *yi* and φ*<sup>p</sup> <sup>c</sup>* (*xip*) indicate the probability that the *p*-th variable in cluster *c* is equal to *xip*. The model of interest here can be described by two key components: a covariate model which assigns individual profile to clusters and a response model which links cluster of profiles to an outcome of interest via a regression model. The full data are then jointly modelled leading to the following likelihood

$$f(\mathbf{x\_i}, \mathbf{y\_i} | \boldsymbol{\Theta}\_{\bar{z}i}, \boldsymbol{w\_i}, \boldsymbol{\Psi}) = \sum\_c \boldsymbol{\Psi\_c} f(\mathbf{x\_i} | z\_i = c, \boldsymbol{\Phi\_c}) f(\mathbf{y\_i} | z\_i = c, \boldsymbol{\Theta\_c}, \boldsymbol{\Lambda}, \mathbf{w\_i}) \tag{1}$$

where *zi* =c is the allocation variable that indicates the cluster to which a unit *i* belongs, Λ is a vector of global (i.e., non-cluster specific) parameters, finally, ψ*<sup>c</sup>* are the mixture weights. The mixture weights corresponding to a maximum of C clusters, denoted as ψ*c*, *c* = 1,,...,*C*, will be modeled according to a "stick-breaking" representation of a Dirichlet process prior. Owing to the complexity of the model, inference is facilitated by Markov Chain Monte Carlo (MCMC) methods. A detailed description of the BPR can be found in Molitor *et al.*, 2010.

### 3 Data

In this paper, we refer to a longitudinal research carried out by the Center of Advanced Studies and Technology (CAST) of University "G. d'Annunzio" of Chieti-Pescara (Italy). This study looked at antibody response of 89 individuals who received the first dose of mRna vaccines Pzifer or AstraZeneca. IgG antibodies to SARS-CoV-2 were measured by a fully automated solid phase DELFIA (time-resolved fluorescence) immunoassay in a few drops of blood collected by finger prick and let dry on filter paper card. Subjects involved in the analysis were re-called at 7, 10, 15 days after the first injection of vaccine for re-determination of IgG levels, recoded according to the quartiles. All participants were also surveyed regarding post-vaccination symptoms, including presence (coded with 1) or absence (coded with 0) of distinct symptom types, such as: fatigue, headache, chills, muscle pain, fever and joint pain. Age, recoded in two classes (20-40 and 40-65 years) and gender of vaccinated people have been also determinated (0=Male and 1=Female).

Figure 1. *Summary plot of the posterior distribution of parameter* φ*c, for c* = 1,2,3

received the first dose of Pfizer vaccine. Furthermore, the majority of individuals belonging to this group has not experienced side effects while for them we observe a greater amount of anti-S1 spike IgG levels after 10 days from injection. Specular results characterize the first two clusters associated with a

HASTIE, D.I., LIVERANI, S., AZIZI, L., RICHARDSON, S., & STUCKER ¨ , I. 2013. A semi-parametric approach to estimate risk functions associated with multi-dimensional exposure profiles: application to smoking

and lung cancer. *BMC Medical Research Methodology.*, 13, 129. LIVERANI, S., HASTIE, D.I., PAPATHOMAS, M., & RICHARDSON, S. 2015. PReMiuM: An R package for Profile Regression Mixture Models using

Dirichlet Processes. *Spatial and Spatio-temporal Epidemiology.* MOLITOR, J., PAPATHOMAS, M., JERRETT, M., & RICHARDSON, S. 2010. Bayesian profile regression with an application to the National survey of

children's health. *Biostatistics.*, 11, 484–498.

lower immunity response.

References

### 4 Main results

The BPR estimation, performed through the R package PreMium (Liverani *et al.*, 2015), has produced a partition of anti-S1 spike IgG levels after 21 days from injection, recoded using the median as cut-off, and some potential explanatory variables (IgG levels at previous at different times, type of vaccine, side effects after vaccination, age, gender) into 3 clusters. Each group is characterized by similar covariate profiles, as well as by the same amount of the antibodies. The posterior distribution of all clusters specific parameters are represented in Fig.1. The left panel of each figure displays the MCMC posterior draws of the anti-S1 spike IgG levels after 21 days for the identified clusters; conversely the right panel of each figure shows the posterior distributions of the probability that an explanatory variable appears with one of the discrete categories across the identified groups. In the typical profile of the cluster 3 (red boxplot in Fig.1), associated with the highest amount of antibodies produced, there is a prevalence of people aged 20-40 years, who have

Figure 1. *Summary plot of the posterior distribution of parameter* φ*c, for c* = 1,2,3

received the first dose of Pfizer vaccine. Furthermore, the majority of individuals belonging to this group has not experienced side effects while for them we observe a greater amount of anti-S1 spike IgG levels after 10 days from injection. Specular results characterize the first two clusters associated with a lower immunity response.

#### References

to a "stick-breaking" representation of a Dirichlet process prior. Owing to the complexity of the model, inference is facilitated by Markov Chain Monte Carlo (MCMC) methods. A detailed description of the BPR can be found in Molitor

In this paper, we refer to a longitudinal research carried out by the Center of Advanced Studies and Technology (CAST) of University "G. d'Annunzio" of Chieti-Pescara (Italy). This study looked at antibody response of 89 individuals who received the first dose of mRna vaccines Pzifer or AstraZeneca. IgG antibodies to SARS-CoV-2 were measured by a fully automated solid phase DELFIA (time-resolved fluorescence) immunoassay in a few drops of blood collected by finger prick and let dry on filter paper card. Subjects involved in the analysis were re-called at 7, 10, 15 days after the first injection of vaccine for re-determination of IgG levels, recoded according to the quartiles. All participants were also surveyed regarding post-vaccination symptoms, including presence (coded with 1) or absence (coded with 0) of distinct symptom types, such as: fatigue, headache, chills, muscle pain, fever and joint pain. Age, recoded in two classes (20-40 and 40-65 years) and gender of vaccinated people

The BPR estimation, performed through the R package PreMium (Liverani *et al.*, 2015), has produced a partition of anti-S1 spike IgG levels after 21 days from injection, recoded using the median as cut-off, and some potential explanatory variables (IgG levels at previous at different times, type of vaccine, side effects after vaccination, age, gender) into 3 clusters. Each group is characterized by similar covariate profiles, as well as by the same amount of the antibodies. The posterior distribution of all clusters specific parameters are represented in Fig.1. The left panel of each figure displays the MCMC posterior draws of the anti-S1 spike IgG levels after 21 days for the identified clusters; conversely the right panel of each figure shows the posterior distributions of the probability that an explanatory variable appears with one of the discrete categories across the identified groups. In the typical profile of the cluster 3 (red boxplot in Fig.1), associated with the highest amount of antibodies produced, there is a prevalence of people aged 20-40 years, who have

have been also determinated (0=Male and 1=Female).

*et al.*, 2010.

3 Data

4 Main results


## MODELING THREE-WAY RNA SEQUENCING DATA WITH MIXTURES OF MULTIVARIATE POISSON-LOGNORMAL DISTRIBUTIONS

units might be questionable. Silva *et al.* (2019) propose to use mixtures of multivariate Poisson-lognormal distributions to account for possible dependency structures via a latent multivariate normal distribution after transforming the data to a two-way format where the genes are in one dimension and time points and biological units are crossed out for the second dimension. Subedi & Browne (2020) also consider mixtures of multivariate Poisson-lognormal distributions for two-way data, but following Fraley & Raftery (2002) they propose parsimonious specifications of the variance-covariance matrix resulting from the decomposition into volume, shape and orientation. Taking the three-way structure into accout, Silva *et al.* (2018) also arrive at a more parsi-

In all these contributions, maximum likelihood estimation of the finite mixture model for a fixed number of components is performed and a suitable model is selected based on information criteria such as BIC, AIC and ICL. The expectation-maximization (EM) algorithm is used for estimation with the cluster memberships as well as the latent multivariate normal observations are viewed as missing data. The EM algorithm is an iterative procedure where each iteration consists of an E- and an M-step. The expectation of the complete-data log-likelihood which results from combining the observed with the missing data conditional on current parameter estimates and the observed data is determined in the E-step. In the M-step the expected complete-data log-likelihood is maximized with respect to the parameters and new parameter estimates are obtained. In each iteration the log-likelihood is increased, ensuring that the algorithm converges to a fixed point if the log-likelihood is bounded. For mixtures of multivariate Poisson-lognormal distributions, the M-step is straighforward. However, the E-step is complicated. Silva *et al.* (2019) and Silva *et al.* (2018) use Bayesian Markov chain Monte Carlo methods to obtain an estimate for the expectation. Subedi & Browne (2020) propose to use a variational E-step.

Following Silva *et al.* (2018), a finite mixture model of multivariate Poissonlognormal distributions implies the following data generating process for the observations *yi*, *jt*, with *i* the gene index, *j* the biological unit index and *t* the

> ,*USi* ,*V Si* ),

*Si* ∼ *M* (η),


*yi*, *jt*|*Si* ∼ *P*(*bj* exp(Θ*Si*,*i j*)),

monious parameterization of the variance-covariance matrix.

2 The Mixture Model for Three-Way Data

Θ*Si*

time point index:

Theresa Scharl1 and Bettina Grun¨ <sup>2</sup>

<sup>1</sup> Institute of Statistics, University of Natural Resources and Life Sciences, Vienna, (e-mail: Theresa.Scharl@boku.ac.at)

<sup>2</sup> Institute for Statistics and Mathematics, WU Vienna University of Economics and Business, (e-mail: Bettina.Gruen@wu.ac.at)

ABSTRACT: Mixtures of multivariate Poisson-lognormal distributions are used for modeling three-way RNA sequencing data. Taking the three-way structure into account, a range of specifications for the means and the variance-covariance matrices of the latent multivariate normal distribution emerge, leading to a more parsimonious and better interpretable clustering solution. We develop suitable specifications for an RNA sequencing dataset containing several biological units categorized by additional covariates. These include, for example, a regression setup for the means and time-series structure for the variance-covariance matrices. The models are fitted using maximum likelihood estimation with the expectation-maximization algorithm involving a variational E-step and their suitability investigated for a specific RNA sequencing dataset.

KEYWORDS: clustering, EM algorithm, multivariate Poisson-lognormal distribution, RNA sequencing

#### 1 Modeling RNA Sequencing Data

RNA sequencing of time-course experiments leads to three-way count data where the dimensions are the genes, the time points and the biological units. Cluster analysis is used to group genes in dependence of their expression levels taking the development over time and across the biological units into account. Model-based clustering methods allow to embed the clustering problem within a statistical framework and the mixture models used may be adapted in a flexible way to the data structure and clustering aims by specifying suitable models for the components of the mixture.

The Poisson distribution is obvious to use for modeling count data. However, assuming independence between the time points and/or the biological units might be questionable. Silva *et al.* (2019) propose to use mixtures of multivariate Poisson-lognormal distributions to account for possible dependency structures via a latent multivariate normal distribution after transforming the data to a two-way format where the genes are in one dimension and time points and biological units are crossed out for the second dimension. Subedi & Browne (2020) also consider mixtures of multivariate Poisson-lognormal distributions for two-way data, but following Fraley & Raftery (2002) they propose parsimonious specifications of the variance-covariance matrix resulting from the decomposition into volume, shape and orientation. Taking the three-way structure into accout, Silva *et al.* (2018) also arrive at a more parsimonious parameterization of the variance-covariance matrix.

MODELING THREE-WAY RNA SEQUENCING DATA WITH MIXTURES OF MULTIVARIATE POISSON-LOGNORMAL DISTRIBUTIONS Theresa Scharl1 and Bettina Grun¨ <sup>2</sup>

<sup>1</sup> Institute of Statistics, University of Natural Resources and Life Sciences, Vienna,

<sup>2</sup> Institute for Statistics and Mathematics, WU Vienna University of Economics and

ABSTRACT: Mixtures of multivariate Poisson-lognormal distributions are used for modeling three-way RNA sequencing data. Taking the three-way structure into account, a range of specifications for the means and the variance-covariance matrices of the latent multivariate normal distribution emerge, leading to a more parsimonious and better interpretable clustering solution. We develop suitable specifications for an RNA sequencing dataset containing several biological units categorized by additional covariates. These include, for example, a regression setup for the means and time-series structure for the variance-covariance matrices. The models are fitted using maximum likelihood estimation with the expectation-maximization algorithm involving a variational E-step and their suitability investigated for a specific RNA sequencing dataset.

KEYWORDS: clustering, EM algorithm, multivariate Poisson-lognormal distribution,

RNA sequencing of time-course experiments leads to three-way count data where the dimensions are the genes, the time points and the biological units. Cluster analysis is used to group genes in dependence of their expression levels taking the development over time and across the biological units into account. Model-based clustering methods allow to embed the clustering problem within a statistical framework and the mixture models used may be adapted in a flexible way to the data structure and clustering aims by specifying suitable models

The Poisson distribution is obvious to use for modeling count data. However, assuming independence between the time points and/or the biological

(e-mail: Theresa.Scharl@boku.ac.at)

1 Modeling RNA Sequencing Data

for the components of the mixture.

RNA sequencing

Business, (e-mail: Bettina.Gruen@wu.ac.at)

In all these contributions, maximum likelihood estimation of the finite mixture model for a fixed number of components is performed and a suitable model is selected based on information criteria such as BIC, AIC and ICL. The expectation-maximization (EM) algorithm is used for estimation with the cluster memberships as well as the latent multivariate normal observations are viewed as missing data. The EM algorithm is an iterative procedure where each iteration consists of an E- and an M-step. The expectation of the complete-data log-likelihood which results from combining the observed with the missing data conditional on current parameter estimates and the observed data is determined in the E-step. In the M-step the expected complete-data log-likelihood is maximized with respect to the parameters and new parameter estimates are obtained. In each iteration the log-likelihood is increased, ensuring that the algorithm converges to a fixed point if the log-likelihood is bounded. For mixtures of multivariate Poisson-lognormal distributions, the M-step is straighforward. However, the E-step is complicated. Silva *et al.* (2019) and Silva *et al.* (2018) use Bayesian Markov chain Monte Carlo methods to obtain an estimate for the expectation. Subedi & Browne (2020) propose to use a variational E-step.

#### 2 The Mixture Model for Three-Way Data

Following Silva *et al.* (2018), a finite mixture model of multivariate Poissonlognormal distributions implies the following data generating process for the observations *yi*, *jt*, with *i* the gene index, *j* the biological unit index and *t* the time point index:

$$\begin{aligned} S\_i &\sim \mathcal{M}(\mathfrak{m}), \\ \Theta\_{S\_i}|\_{S\_i} &\sim \mathcal{M}\mathcal{N}(\mathbf{M}\_{S\_i}, \mathbf{U}\_{S\_i}, \mathbf{V}\_{S\_i}), \\ \mathbf{y}\_{i,j}|\_{S\_i} &\sim \mathcal{P}(b\_j \exp(\Theta\_{S\_i, ij})), \end{aligned}$$

where *Si* is the component membership of gene *i*, *M* (η) is the multinomial distribution with success probabilities vector η, *M N* (*MSi* ,*USi* ,*V Si* ) is the matrix normal distribution which is equivalent to

the three biological replicates is used for analysis. Data pre-preprocessing reduces the number of genes by eliminating those which are not differentially expressed. The number of time points will also be reduced by using the first time point as baseline level. The biological units are characterized as wildtype

The mixture model can be fitted in R using the R package PLNmodels (Chiquet *et al.*, 2021a), available from the Comprehensive R Archive Network. The package implements the variant of the EM algorithm using a variational E-step, has been developed for modelling joint species abundances (Chiquet *et al.*, 2021b) and may be extended to cover the model specifications of interest

CHIQUET, JULIEN, MARIADASSOU, MAHENDRA, &GINDRAUD, FRANC¸ OIS. 2021a. *PLNmodels: Poisson Lognormal Models*. R package

CHIQUET, JULIEN, MARIADASSOU, MAHENDRA,&ROBIN, STEPHANE ´ . 2021b. The Poisson-Lognormal Model as a Versatile Framework for the Joint Analysis of Species Abundances. *Frontiers in Ecology and Evolution*,

FRALEY, CHRIS,&RAFTERY, ADRIAN E. 2002. Model-Based Clustering, Discriminant Analysis and Density Estimation. *Journal of the American*

SILVA, ANJALI, ROTHSTEIN, STEVEN J., MCNICHOLAS, PAUL D., & SUBEDI, SANJEENA. 2018. *Finite Mixtures of Matrix Variate Poisson-Log Normal Distributions for Three-Way Count Data*. arXiv:1807.08380

SILVA, ANJALI, ROTHSTEIN, STEVEN J., MCNICHOLAS, PAUL D., & SUBEDI, SANJEENA. 2019. A Multivariate Poisson-Log Normal Mixture Model for Clustering Transcriptome Sequencing Data. *BMC Bioinformat-*

SILVA, H. ANJALI. 2018. *Bayesian Clustering Approaches for Discrete Data*.

SUBEDI, SANJEENA,&BROWNE, RYAN P. 2020. A Family of Parsimonious Mixtures of Multivariate Poisson-Lognormal Distributions for Clus-

or recombinant.

References

9, 188.

[stat.ME].

*ics*, 20(1), 394.

version 0.11.4.

for modeling three-way RNA sequencing data.

*Statistical Association*, 97(458), 611–631.

Ph.D. thesis, The University of Guelph.

tering Multivariate Count Data. *Stat*, 9(1), e310.

$$\text{vec}(\boldsymbol{\Theta}\_{\boldsymbol{\S}\_{l}})|\boldsymbol{\S}\_{l} \sim \mathcal{N}(\text{vec}(\boldsymbol{\mathsf{M}}\_{\boldsymbol{\S}\_{l}}), \boldsymbol{\mathcal{U}}\_{\boldsymbol{\S}\_{l}} \otimes \boldsymbol{\mathsf{V}}\_{\boldsymbol{\S}\_{l}}),$$

with vec() the vectorization operator and ⊗ the Kronecker product. A suitable constraint needs to be imposed on *USi* and *V Si* to ensure identifiability. *P*(λ) is the univariate Poisson distribution with parameter λ given by the exponentiated *i j*th element of the latent normal variable Θ*Si* multiplied with a biological unit specific offset *bj*.

Taking the specific three-way data structure into account the following model specifications might be considered:

(a) The mean matrix for each component *M* has dimension number of biological units times the number of time points. Assuming additive biological units and time point effects, this mean matrix would be given by:

$$
\mathfrak{M} = \mathfrak{a} \otimes \mathfrak{B},
$$

where α are the mean biological effects and β are the mean time point effects. Additional interaction effects would indicate the need for a general *M*.


#### 3 Data

The available RNA sequencing data contains 4523 genes, 17 biological units and 4 time points for three biological replicates. The median value across the three biological replicates is used for analysis. Data pre-preprocessing reduces the number of genes by eliminating those which are not differentially expressed. The number of time points will also be reduced by using the first time point as baseline level. The biological units are characterized as wildtype or recombinant.

The mixture model can be fitted in R using the R package PLNmodels (Chiquet *et al.*, 2021a), available from the Comprehensive R Archive Network. The package implements the variant of the EM algorithm using a variational E-step, has been developed for modelling joint species abundances (Chiquet *et al.*, 2021b) and may be extended to cover the model specifications of interest for modeling three-way RNA sequencing data.

### References

where *Si* is the component membership of gene *i*, *M* (η) is the multinomial dis-

with vec() the vectorization operator and ⊗ the Kronecker product. A suitable constraint needs to be imposed on *USi* and *V Si* to ensure identifiability. *P*(λ) is the univariate Poisson distribution with parameter λ given by the exponentiated *i j*th element of the latent normal variable Θ*Si* multiplied with a biological unit

Taking the specific three-way data structure into account the following

*M* = α ⊗β,

where α are the mean biological effects and β are the mean time point effects. Additional interaction effects would indicate the need for a general

(b) A more parsimonious specification of the mean vectors is possible if covariates are available to characterize the biological units using a regression

(c) The variance-covariance matrix *V* capturing time dependence could be specified in a more parsimonious way by assuming for example an under-

(d) Assuming a correlation between the biological units might be questionable

(e) Inspired by Fraley & Raftery (2002), different sets of parameters might either be assumed to be group-specific or the same across groups, thus allowing for a more parsimonious specification and easier interpretation

The available RNA sequencing data contains 4523 genes, 17 biological units and 4 time points for three biological replicates. The median value across

lying auto-regressive process, e.g., an AR(1) process.

and the identity matrix could be specified for *U*.

(a) The mean matrix for each component *M* has dimension number of biological units times the number of time points. Assuming additive biological

units and time point effects, this mean matrix would be given by:

)|*Si* ∼ *N* (vec(*MSi*

,*USi* ,*V Si*

),

),*USi* ⊗*V Si*

) is the matrix

tribution with success probabilities vector η, *M N* (*MSi*

normal distribution which is equivalent to

model specifications might be considered:

specific offset *bj*.

*M*.

3 Data

model.

of the fitted model.

vec(Θ*Si*


## STACKING ENSEMBLE LEARNING WITH GAUSSIAN MIXTURES

*Analysis* (EDDA) model, has been proposed (Bensmail & Celeux, 1996). This assumes that (i) the density for each class can be described by a single Gaussian component, i.e. *Gk* = 1 for all *k* in equation (1), and (ii) the class covariance

*inant Analysis* (QDA) model. Finally, assuming the matrix of eigenvectors *U* is the identity matrix, features are conditional independent within each class

> Pr(*Ck*|*x*) = <sup>τ</sup>*<sup>k</sup> <sup>f</sup>*(*x*|*Ck*) ∑*K*

where *f*(*x*|*Ck*) are the class-conditional densities, and τ*<sup>k</sup>* = Pr(*Ck*) are the prior

In this section we propose a form of stacking, called *Super Learner* algorithm (Wolpert, 1992; Van der Laan *et al.*, 2007), which uses EDDA models as *base learners*. Let *M* = {1,...,*M*} be the set of EDDA models. The conditional probabilities of classifying an observation *xi* to class *Ck* according to model *m* ∈ *M* estimated using training data *D* (the *level-zero* data) is indicated as *pikm* = Pr(*Ck*|*xi*;*m*,*D*), for *i* = 1,...,*n* observations, *k* = 1,...,*K* classes, and *m* = 1,...,*M*. Base learners can be used to generate cross-validation predic-

Estimation of unknown parameters (τ1,..., τ*K*,*µ*1,...,*µK*,Σ1,...,Σ*K*) for EDDA models can be obtained with a single M-step from the EM algorithm for Gaussian mixtures, with the conditional probabilities *zik* set to 1 if observation

Classification of observation *x* can be obtained according to the MAP (*maximum a posteriori*) principle, that is by assigning an observation to the class with the largest posterior class probability computed via Bayes' theorem

*k* . The EDDA family contains 14 different models (see Scrucca *et al.*, 2016, Table 3), some of which are popular discriminant analysis models. For instance, if each class has the same covariance matrix, that is Σ*<sup>k</sup>* = λ*U*∆*U* for all *k*, then EDDA is equivalent to the classical *Linear Discriminant Analysis* (LDA) model. If the class covariance matrices are unconstrained, that is

*<sup>k</sup>* for all *k*, then EDDA is equivalent to the *Quadratic Discrim-*

*<sup>g</sup>*=<sup>1</sup> τ*<sup>g</sup> f*(*x*|*Cg*)

,

structure is factorised as Σ*<sup>k</sup>* = λ*kUk*∆*kU*

and the so-called *Naïve-Bayes* models are obtained.

class probabilities for each class *Ck* (*k* = 1,...,*K*).

3 Stacking EDDA for ensemble classification

*p*CV *ikm* = Pr

tions, typically using *V*-fold cross-validation with *V* = 10, to get

*Ck*|*xi*;*m*,*D*(−*v*(*i*))

,

*i* belongs to class *k* and 0 otherwise.

Σ*<sup>k</sup>* = λ*kUk*∆*kU*

Luca Scrucca <sup>1</sup>

<sup>1</sup> Dept. of Economics, University of Perugia (e-mail: luca.scrucca@unipg.it)

ABSTRACT: Stacking is an ensemble method which uses a meta-learning approach to learn how to best combine the predictions from two or more base statistical and machine learning algorithms. In this contribution we propose a stacking algorithm for classification using Gaussian mixtures as base learners.

KEYWORDS: Gaussian mixtures, classification, ensemble learning, stacking.

### 1 Introduction

Ensemble learning is a broad term referring to meta-learning methods that combine predictions provided by multiple learners or models to obtain a prediction that often is more accurate than any single prediction. Typically, ensemble learning is applied in supervised learning tasks, such as in regression and classification (Rokach, 2010). In this contribution we propose an algorithm for ensemble learning using Gaussian mixtures as base learners for classification tasks.

#### 2 EDDA Gaussian mixture models for classification

Consider a training dataset *D* <sup>=</sup> {(*xi*, *yi*)}*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> for which both the features vector *xi* and the true class *yi* ∈ {*C*1,...,*CK*} are known. Mixture-based classification models assume that the density within each class follows a Gaussian mixture distribution:

$$f(\mathfrak{x}|\mathcal{C}\_k) = \sum\_{\mathfrak{g}=1}^{G\_k} \pi\_{\mathfrak{g}k} \phi(\mathfrak{x}; \mathfrak{h}\_{\mathfrak{g}k}, \mathfrak{B}\_{\mathfrak{g}k}), \tag{1}$$

where *Gk* is the number of components within class *k*, π*gk* are the mixing probabilities for class *k*, such that π*gk* > 0 and ∑*Gk <sup>g</sup>*=<sup>1</sup> π*gk* = 1, and *µgk* and Σ*gk* are, respectively, the mean vectors and the covariance matrices of component *g* within class *k*. Since this model is highly flexible, (i) parameter estimates are subject to high uncertainty unless a very large dataset is available, and (ii) it may easily lead to overfit. For these reasons a parsimonious mixturebased classification model, termed *Eigenvalue Decomposition Discriminant* *Analysis* (EDDA) model, has been proposed (Bensmail & Celeux, 1996). This assumes that (i) the density for each class can be described by a single Gaussian component, i.e. *Gk* = 1 for all *k* in equation (1), and (ii) the class covariance structure is factorised as Σ*<sup>k</sup>* = λ*kUk*∆*kU k* .

STACKING ENSEMBLE LEARNING WITH GAUSSIAN MIXTURES Luca Scrucca <sup>1</sup>

<sup>1</sup> Dept. of Economics, University of Perugia (e-mail: luca.scrucca@unipg.it)

ABSTRACT: Stacking is an ensemble method which uses a meta-learning approach to learn how to best combine the predictions from two or more base statistical and machine learning algorithms. In this contribution we propose a stacking algorithm for

Ensemble learning is a broad term referring to meta-learning methods that combine predictions provided by multiple learners or models to obtain a prediction that often is more accurate than any single prediction. Typically, ensemble learning is applied in supervised learning tasks, such as in regression and classification (Rokach, 2010). In this contribution we propose an algorithm for ensemble learning using Gaussian mixtures as base learners for clas-

*xi* and the true class *yi* ∈ {*C*1,...,*CK*} are known. Mixture-based classification models assume that the density within each class follows a Gaussian mixture

where *Gk* is the number of components within class *k*, π*gk* are the mixing

are, respectively, the mean vectors and the covariance matrices of component *g* within class *k*. Since this model is highly flexible, (i) parameter estimates are subject to high uncertainty unless a very large dataset is available, and (ii) it may easily lead to overfit. For these reasons a parsimonious mixturebased classification model, termed *Eigenvalue Decomposition Discriminant*

*Gk* ∑ *g*=1 *<sup>i</sup>*=<sup>1</sup> for which both the features vector

π*gk*φ(*x*;*µgk*,Σ*gk*), (1)

*<sup>g</sup>*=<sup>1</sup> π*gk* = 1, and *µgk* and Σ*gk*

KEYWORDS: Gaussian mixtures, classification, ensemble learning, stacking.

2 EDDA Gaussian mixture models for classification

*f*(*x*|*Ck*) =

probabilities for class *k*, such that π*gk* > 0 and ∑*Gk*

Consider a training dataset *D* <sup>=</sup> {(*xi*,*yi*)}*<sup>n</sup>*

classification using Gaussian mixtures as base learners.

1 Introduction

sification tasks.

distribution:

The EDDA family contains 14 different models (see Scrucca *et al.*, 2016, Table 3), some of which are popular discriminant analysis models. For instance, if each class has the same covariance matrix, that is Σ*<sup>k</sup>* = λ*U*∆*U* for all *k*, then EDDA is equivalent to the classical *Linear Discriminant Analysis* (LDA) model. If the class covariance matrices are unconstrained, that is Σ*<sup>k</sup>* = λ*kUk*∆*kU <sup>k</sup>* for all *k*, then EDDA is equivalent to the *Quadratic Discriminant Analysis* (QDA) model. Finally, assuming the matrix of eigenvectors *U* is the identity matrix, features are conditional independent within each class and the so-called *Naïve-Bayes* models are obtained.

Classification of observation *x* can be obtained according to the MAP (*maximum a posteriori*) principle, that is by assigning an observation to the class with the largest posterior class probability computed via Bayes' theorem

$$\Pr(C\_k|\mathbf{x}) = \frac{\mathsf{\tau}\_k f(\mathbf{x}|C\_k)}{\sum\_{\mathbf{g}=1}^K \mathsf{\tau}\_{\mathfrak{g}} f(\mathbf{x}|C\_{\mathfrak{g}})},$$

where *f*(*x*|*Ck*) are the class-conditional densities, and τ*<sup>k</sup>* = Pr(*Ck*) are the prior class probabilities for each class *Ck* (*k* = 1,...,*K*).

Estimation of unknown parameters (τ1,..., τ*K*,*µ*1,...,*µK*,Σ1,...,Σ*K*) for EDDA models can be obtained with a single M-step from the EM algorithm for Gaussian mixtures, with the conditional probabilities *zik* set to 1 if observation *i* belongs to class *k* and 0 otherwise.

#### 3 Stacking EDDA for ensemble classification

In this section we propose a form of stacking, called *Super Learner* algorithm (Wolpert, 1992; Van der Laan *et al.*, 2007), which uses EDDA models as *base learners*. Let *M* = {1,...,*M*} be the set of EDDA models. The conditional probabilities of classifying an observation *xi* to class *Ck* according to model *m* ∈ *M* estimated using training data *D* (the *level-zero* data) is indicated as *pikm* = Pr(*Ck*|*xi*;*m*,*D*), for *i* = 1,...,*n* observations, *k* = 1,...,*K* classes, and *m* = 1,...,*M*. Base learners can be used to generate cross-validation predictions, typically using *V*-fold cross-validation with *V* = 10, to get

$$
\widehat{p}\_{ikm}^{\rm CV} = \widehat{\mathbf{Pr}}\left(\mathbf{C}\_k | \mathbf{x}\_i; m, \mathcal{D}^{(-\nu(i))}\right),
$$

where *v*(*i*) indicates the fold containing the *i*th observation, and *D*(−*v*(*i*)) is the training set given by all the observations except those in the *v*th fold. The crossvalidated predicted probabilities, along with the vector of original classes, is referred to as the *level-one* data.

The ensemble classifier or *metalearner* defines predicted classification probabilities as the convex linear combination of the base learners predictions:

$$\widehat{\Pr}\left(C\_k|\mathfrak{x}\_i;\mathfrak{a}\right) = \widehat{p}\_{ik} = \sum\_{m=1}^{M} \alpha\_m \widehat{p}\_{ikm}^{\text{CV}},$$

where α = (α1,...,α*M*) are the ensemble weights, such that α*<sup>m</sup>* ≥ 0 and ∑*<sup>M</sup> <sup>m</sup>*=1α*<sup>m</sup>* = 1. To completely specify the ensemble classifier the optimal combination of base learners is required. This can be achieved by minimizing the cross entropy loss function:

$$\mathcal{L}(\mathfrak{a}) = -\frac{1}{n} \sum\_{i=1}^{n} \sum\_{k=1}^{K} \mathcal{y}\_{ik} \log(\widehat{p}\_{ik}),\tag{2}$$

where *zm* <sup>=</sup> logit−1{θ*<sup>m</sup>* <sup>−</sup>log(*<sup>M</sup>* <sup>−</sup>*m*)}, and logit−1(*x*) = <sup>1</sup>/(1+exp(−*x*)). Thus, the unconstrained minimization of the cross entropy loss function in (2) can be pursued with respect to the parameters θ = (θ1,...,θ*m*−1), and optimal stacking weights α = (α1,...,α*M*) are obtained via the inverse unit simplex transformation of the solution of such optimization. Numerical algorithms, such as the BFGS quasi-Newton method, typically require initialization of parameters. A natural choice is to consider α*<sup>m</sup>* = 1/*M* for all *m* = 1,...,*M*, which amounts to assign the same weight to all the models in the ensemble, and it is equivalent to set θ*<sup>m</sup>* = 0 for *m* = 1,...,*M* −1. To improve exploration of the search space and to avoid getting trapped in local minima, a multiple restarts strategy can be implemented by generating uniformly distributed values on the *M*-simplex space, i.e. randomly drawn from a Dirichlet(1,...,1)

In this contribution we have proposed an ensemble approach to classification based on stacking with Gaussian EDDA mixtures as base learners. The proposal has been applied to both simulated and real datasets (not included here due to space constraints), demonstrating that it is able to improve the overall classification accuracy compared to the best single model among the base

BENSMAIL, H., & CELEUX, G. 1996. Regularized Gaussian Discriminant Analysis through Eigenvalue Decomposition. *Journal of the American*

ROKACH, LIOR. 2010. *Pattern Classification Using Ensemble Methods*.

SCRUCCA, LUCA, FOP, MICHAEL, MURPHY, T. BRENDAN,&RAFTERY, ADRIAN E. 2016. mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models. *The R Journal*, 8(1),

VAN DER LAAN, MARK J, POLLEY, ERIC C, & HUBBARD, ALAN E. 2007. Super learner. *Statistical Applications in Genetics and Molecular Biol-*

WOLPERT, DAVID H. 1992. Stacked Generalization. *Neural Networks*, 5(2),

*Statistical Association*, 91, 1743–1748.

distribution.

learners.

References

World Scientific.

205–233.

*ogy*, 6(1).

241–259.

4 Conclusion

where *yik* <sup>=</sup> 1 if the *<sup>i</sup>*th observation is from class *<sup>k</sup>* and 0 otherwise, and *<sup>p</sup>ik* is the estimated probability that the *i*th observation belongs to class *k*.

The optimization of the loss function in (2) is a constrained minimization problem which can be solved in different ways. One efficient approach is to remove the constraints on the ensemble weights by reformulating the problem as an unconstrained optimization using a different parameterization.

Let <sup>α</sup> = (α1,...,α*M*) <sup>∈</sup> *S <sup>M</sup>* :<sup>=</sup> α ∈ [0,1] *<sup>M</sup>*,∑*<sup>M</sup> <sup>m</sup>*=1α*<sup>m</sup>* = 1 be the (*M*−1) dimensional unit simplex vector. Define the *Unit Simplex Transform* function which maps *S <sup>M</sup>* → <sup>Θ</sup> <sup>∈</sup> <sup>R</sup>*M*−<sup>1</sup> as

$$\Theta\_m = \text{logit}\left(\frac{\mathfrak{a}\_m}{1 - \sum\_{m'=0}^{m-1} \mathfrak{a}\_{m'}}\right) + \log(M - m) \qquad \text{for } m = 1, \dots, M - 1, \dots$$

where logit(*x*) = log(*x*/(1−*x*)) and α<sup>0</sup> = 0. Backward transformation can be obtained via the *Inverse Unit Simplex Transform* defined as

$$\begin{cases} \mathbf{\mathcal{O}}\_1 = z\_1 \\ \mathbf{\mathcal{O}}\_m = \left( 1 - \sum\_{m'=1}^{m-1} \mathbf{\mathcal{O}}\_{m'} \right) z\_m & \text{for } m = 2, \dots, M - 1 \\ \mathbf{\mathcal{O}}\_M = 1 - \sum\_{m=1}^{M-1} \mathbf{\mathcal{O}}\_m \end{cases}$$

where *zm* <sup>=</sup> logit−1{θ*<sup>m</sup>* <sup>−</sup>log(*<sup>M</sup>* <sup>−</sup>*m*)}, and logit−1(*x*) = <sup>1</sup>/(1+exp(−*x*)).

Thus, the unconstrained minimization of the cross entropy loss function in (2) can be pursued with respect to the parameters θ = (θ1,...,θ*m*−1), and optimal stacking weights α = (α1,...,α*M*) are obtained via the inverse unit simplex transformation of the solution of such optimization. Numerical algorithms, such as the BFGS quasi-Newton method, typically require initialization of parameters. A natural choice is to consider α*<sup>m</sup>* = 1/*M* for all *m* = 1,...,*M*, which amounts to assign the same weight to all the models in the ensemble, and it is equivalent to set θ*<sup>m</sup>* = 0 for *m* = 1,...,*M* −1. To improve exploration of the search space and to avoid getting trapped in local minima, a multiple restarts strategy can be implemented by generating uniformly distributed values on the *M*-simplex space, i.e. randomly drawn from a Dirichlet(1,...,1) distribution.

#### 4 Conclusion

where *v*(*i*) indicates the fold containing the *i*th observation, and *D*(−*v*(*i*)) is the training set given by all the observations except those in the *v*th fold. The crossvalidated predicted probabilities, along with the vector of original classes, is

abilities as the convex linear combination of the base learners predictions:

where α = (α1,...,α*M*) are the ensemble weights, such that α*<sup>m</sup>* ≥ 0 and

*<sup>m</sup>*=1α*<sup>m</sup>* = 1. To completely specify the ensemble classifier the optimal combination of base learners is required. This can be achieved by minimizing the

> *K* ∑ *k*=1

(*Ck*|*xi*;α) = *<sup>p</sup>ik* <sup>=</sup>

*<sup>L</sup>*(α) = <sup>−</sup><sup>1</sup>

*n*

the estimated probability that the *i*th observation belongs to class *k*.

as an unconstrained optimization using a different parameterization.

obtained via the *Inverse Unit Simplex Transform* defined as

*m*−1 ∑ *m* =1 α*m* 

*M*−1 ∑ *m*=1 α*<sup>m</sup>*

*n* ∑ *i*=1

where *yik* <sup>=</sup> 1 if the *<sup>i</sup>*th observation is from class *<sup>k</sup>* and 0 otherwise, and *<sup>p</sup>ik* is

The optimization of the loss function in (2) is a constrained minimization problem which can be solved in different ways. One efficient approach is to remove the constraints on the ensemble weights by reformulating the problem

α ∈ [0,1]

dimensional unit simplex vector. Define the *Unit Simplex Transform* function

where logit(*x*) = log(*x*/(1−*x*)) and α<sup>0</sup> = 0. Backward transformation can be

*<sup>M</sup>*,∑*<sup>M</sup>*

*<sup>m</sup>*=1α*<sup>m</sup>* = 1

+log(*M* −*m*) for *m* = 1,...,*M* −1,

*zm* for *m* = 2,...,*M* −1

The ensemble classifier or *metalearner* defines predicted classification prob-

*M* ∑ *m*=1

<sup>α</sup>*<sup>m</sup> <sup>p</sup>*CV *ikm*,

*yik* log(*<sup>p</sup>ik*), (2)

be the (*M*−1)-

referred to as the *level-one* data.

cross entropy loss function:

Let <sup>α</sup> = (α1,...,α*M*) <sup>∈</sup> *S <sup>M</sup>* :<sup>=</sup>

α*<sup>m</sup>* <sup>1</sup>−∑*m*−<sup>1</sup> *m* <sup>=</sup>0α*m*

which maps *S <sup>M</sup>* → <sup>Θ</sup> <sup>∈</sup> <sup>R</sup>*M*−<sup>1</sup> as

α<sup>1</sup> = *z*<sup>1</sup>

α*<sup>M</sup>* = 1−

 1−

α*<sup>m</sup>* =

θ*<sup>m</sup>* = logit

∑*<sup>M</sup>*

Pr

In this contribution we have proposed an ensemble approach to classification based on stacking with Gaussian EDDA mixtures as base learners. The proposal has been applied to both simulated and real datasets (not included here due to space constraints), demonstrating that it is able to improve the overall classification accuracy compared to the best single model among the base learners.

#### References


### A ROBUST QUANTILE APPROACH TO ORDINAL TREES

quantiles at the various splits), the type of approach (descriptive vs inferential). In case of a categorical response, the modal value of the terminal node is commonly used to assign the predictive value. However, modal values might not be unique and, in case of an ordinal response, these values perform poorly,

This paper addresses the case of ordinal dependent variables through a robust quantile tree. Given that quantiles can be always defined for ordinal rating data and do not need any scoring rule for categories, we introduce a tree methodology to study the effects of covariates on an ordinal response, exploiting quantile ANOVA (Wilcox, 2017), which is an effective approach to analyze the quantiles of an ordinal distribution. We implemented the recursive partitioning algorithm to detect significant differences in possibly many quantiles, given splitting covariates, at each partitioning level. Our approach is based on inference, i.e. it assesses whether the subgroups are significantly

This section briefly describes the proposed approach to grow a tree through a sequence of splits best discriminating the response variable in terms of a selected grid of quantiles. We refer to an ordinal response variable, even if the generalization to the continuous case is fairly straightforward. Several steps are needed when growing a tree, namely the splitting criterion, the classification rule, the stopping rule, the accuracy measure, among others. Due to the limited space, we discuss here only the splitting criterion, being the originality

Let *R* be a rating response collected on a support with *m* ordered categories, labelled using the first *m* natural numbers, without loss of generality. The splitting criterion enables to identify subgroups (child nodes) significantly different with respect to the quantiles, i.e. the selected location measure. Moreover, the goodness of fit of the decision rule is assessed using a measure that takes into account solely the ordinal nature of the dependent variable.

*q*<sup>τ</sup> is the quantile of order τ. At each node *k*, a quantile ANOVA (Wilcox, 2017) is carried out for each of the available covariate to check if the implied binary split induces significant differences in (at least one of) the selected quantiles *S*(q). At a given node *k*, and for each candidate binary split-

quantiles of order τ of the conditional response distributions (*R* | *D* = 0) and

,...,*q*τ*rh*

(*r*) <sup>τ</sup> the two

(*l*) <sup>τ</sup> and *q* }, where

Let *S*(q) denote the set of quantiles of interest, *S*(q) = {*q*τ*r*<sup>1</sup>

ting variable *D*, whose levels are coded as 0 and 1, let *q*

since the modal value does not consider the ordering of the categories.

different from each other.

of the proposals.

2 A quantile–based classification tree

Rosaria Simone1, Cristina Davino2, Domenico Vistocco1 and Gerhard Tutz3

<sup>1</sup> Department of Political Science, University of Naples Federico II, Italy, (e-mail: rosaria.simone@unina.it, domenico.vistocco@unina.it)

<sup>2</sup> Department of Economics and Statistics, University of Naples Federico II, Italy, (email: cristina.davino@unina.it)

<sup>3</sup> Ludwig–Maximilians–Universitat M ¨ unchen, Germany, (e-mail: ¨ tutz@uni-muenchen.de)

ABSTRACT: We propose a quantile tree making use of the one-way quantile ANOVA to check whether two groups of observations of an ordinal response variable differ significantly in a group of quantiles. Specifically, at each node, a quantile ANOVA checks, for each of the available covariates, if the implied split induces significant differences in (at least one of) the selected quantiles. If several splits are significant, the final split will be that with the highest number of significant differences in quantiles, and among them, the one with the strongest overall effects. Since at each step, a multiple testing is applied, the selection of the split is based on the adjusted p–values with the Hochberg correction. An application to the profiling of voting probabilities is used to show the potentiality of the quantile tree for ordinal responses.

KEYWORDS: non-parametric trees, ordinal responses, quantile ANOVA.

#### 1 Introduction and motivation

Decision trees (Breiman *et al.*, 1984) are supervised learning methods aiming at modeling and predicting the value of a response variable based on several explanatory variables. Since they mostly employ an ordinary least squares loss (OLS) function as splitting criterion, the corresponding decision rule is sensitive to outliers and/or skewness in the distribution of the response variable. Moreover, in compliance with classical OLS interpretative rules, results are to be read in light of the effect the predictors exert on the conditional mean of the response. Breiman *et al.*, 1984 extended the regression tree to the median tree through the use of least absolute deviations (LAD) as splitting criterion. Quantile regression trees represent the natural evolution to inspect the conditional quantiles of the response. The proposals in literature differ in the splitting criterion, the used quantiles (a fixed quantile for the whole tree vs different quantiles at the various splits), the type of approach (descriptive vs inferential). In case of a categorical response, the modal value of the terminal node is commonly used to assign the predictive value. However, modal values might not be unique and, in case of an ordinal response, these values perform poorly, since the modal value does not consider the ordering of the categories.

This paper addresses the case of ordinal dependent variables through a robust quantile tree. Given that quantiles can be always defined for ordinal rating data and do not need any scoring rule for categories, we introduce a tree methodology to study the effects of covariates on an ordinal response, exploiting quantile ANOVA (Wilcox, 2017), which is an effective approach to analyze the quantiles of an ordinal distribution. We implemented the recursive partitioning algorithm to detect significant differences in possibly many quantiles, given splitting covariates, at each partitioning level. Our approach is based on inference, i.e. it assesses whether the subgroups are significantly different from each other.

#### 2 A quantile–based classification tree

A ROBUST QUANTILE APPROACH TO ORDINAL TREES

Rosaria Simone1, Cristina Davino2, Domenico Vistocco1 and Gerhard Tutz3

<sup>1</sup> Department of Political Science, University of Naples Federico II, Italy, (e-mail:

<sup>2</sup> Department of Economics and Statistics, University of Naples Federico II, Italy, (e-

<sup>3</sup> Ludwig–Maximilians–Universitat M ¨ unchen, Germany, (e-mail: ¨

ABSTRACT: We propose a quantile tree making use of the one-way quantile ANOVA to check whether two groups of observations of an ordinal response variable differ significantly in a group of quantiles. Specifically, at each node, a quantile ANOVA checks, for each of the available covariates, if the implied split induces significant differences in (at least one of) the selected quantiles. If several splits are significant, the final split will be that with the highest number of significant differences in quantiles, and among them, the one with the strongest overall effects. Since at each step, a multiple testing is applied, the selection of the split is based on the adjusted p–values with the Hochberg correction. An application to the profiling of voting probabilities is used

rosaria.simone@unina.it, domenico.vistocco@unina.it)

to show the potentiality of the quantile tree for ordinal responses.

KEYWORDS: non-parametric trees, ordinal responses, quantile ANOVA.

Decision trees (Breiman *et al.*, 1984) are supervised learning methods aiming at modeling and predicting the value of a response variable based on several explanatory variables. Since they mostly employ an ordinary least squares loss (OLS) function as splitting criterion, the corresponding decision rule is sensitive to outliers and/or skewness in the distribution of the response variable. Moreover, in compliance with classical OLS interpretative rules, results are to be read in light of the effect the predictors exert on the conditional mean of the response. Breiman *et al.*, 1984 extended the regression tree to the median tree through the use of least absolute deviations (LAD) as splitting criterion. Quantile regression trees represent the natural evolution to inspect the conditional quantiles of the response. The proposals in literature differ in the splitting criterion, the used quantiles (a fixed quantile for the whole tree vs different

mail: cristina.davino@unina.it)

tutz@uni-muenchen.de)

1 Introduction and motivation

This section briefly describes the proposed approach to grow a tree through a sequence of splits best discriminating the response variable in terms of a selected grid of quantiles. We refer to an ordinal response variable, even if the generalization to the continuous case is fairly straightforward. Several steps are needed when growing a tree, namely the splitting criterion, the classification rule, the stopping rule, the accuracy measure, among others. Due to the limited space, we discuss here only the splitting criterion, being the originality of the proposals.

Let *R* be a rating response collected on a support with *m* ordered categories, labelled using the first *m* natural numbers, without loss of generality. The splitting criterion enables to identify subgroups (child nodes) significantly different with respect to the quantiles, i.e. the selected location measure. Moreover, the goodness of fit of the decision rule is assessed using a measure that takes into account solely the ordinal nature of the dependent variable. Let *S*(q) denote the set of quantiles of interest, *S*(q) = {*q*τ*r*<sup>1</sup> ,...,*q*τ*rh* }, where *q*<sup>τ</sup> is the quantile of order τ. At each node *k*, a quantile ANOVA (Wilcox, 2017) is carried out for each of the available covariate to check if the implied binary split induces significant differences in (at least one of) the selected quantiles *S*(q). At a given node *k*, and for each candidate binary splitting variable *D*, whose levels are coded as 0 and 1, let *q* (*l*) <sup>τ</sup> and *q* (*r*) <sup>τ</sup> the two quantiles of order τ of the conditional response distributions (*R* | *D* = 0) and (*R* | *D* = 1) associated to the left and right descendants, respectively. The procedure will test the null: *H*<sup>0</sup> : *q* (*l*) <sup>τ</sup> = *q* (*r*) <sup>τ</sup> , ∀*q*<sup>τ</sup> ∈ *S*(q), against the alternative *H*<sup>1</sup> : at least one *q*<sup>τ</sup> ∈ *S*(q),*q* (*l*) <sup>τ</sup> = *q* (*r*) <sup>τ</sup> . The chosen split will select the candidate split so that it is one of those with the highest number of significant differences in quantiles, and with the lowest p-value among the competitor splitting variables with the same number of significant differences. The p-vales in this step are adjusted with the Benjamini-Hochberg correction for multiple testing (Benjamini & Hochberg, 1995).

and petition). It is worth of notice that the extreme quantiles 0.1 and 0.9 are never chosen as the best quantiles and differences at the first decile are never significant. By following the different paths of the tree from the root to the terminal nodes it is possible to identify different profiles of respondents. For sake of space, results related to the distribution of the dependent variable in the terminal nodes are not shown but a further deepening of the analysis can be achieved by exploring the homogeneity of each profile of respondents.

BENJAMINI, Y., & HOCHBERG, Y. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. *Journal of*

BREIMAN, L., J.H., FRIEDMAN, OLSHEN, R.A., & C.J., STONE. 1984. *Classification and Regression Trees*. Boca Raton: Chapman & Hall CRC. GESIS ALLBUS LEIBNIZ INSTITUTE FOR THE SOCIAL SCIENCES, GE. 2012. German General Social Survey (ALLBUS) - Cumulation 1980-

WILCOX, R.R. 2017. *Introduction to Robust Estimation and Hypothesis Test-*

Figure 1. *Quantile tree for rating probabilities for SPD party*

*the Royal Statistical Society: Series B.*, 57, 289–300.

2014. Data file version 1.0.0. *GESIS Data Archive.*

*ing. 4th edition*. Amsterdam, The Netherlands: Elsevier.

References

#### 3 An application to German vote data

The performances of the proposed quantile tree for ordinal rating data are illustrated through an application to response profiles for the probability to vote for competing German parties. Data are taken from the GESIS ALLBUS German Social survey (GESIS ALLBUS Leibniz Institute for the Social Sciences, 2012). On a rating scale ranging from 1 = "*very unlikely*", to 10 = "*very likely*", respondents were asked to rate "*How likely it is that you would ever vote for this German party?*". Here we refer to interviewees collected in 2008, and we shall focus on probability to vote for Social Democratic Party (SPD). In the assessment of the electorate belief and behavior, the collection of ratings on the probability to vote for each of the candidate running in an upcoming electoral competition is a much more valuable source of information than the one based on classical voting intentions collected on nominal scales, as it allows to design targeted campaign and to locally understand and predict the electorate behavior. The splitting variables are related to the presence or absence of a series of personal or political-related characteristics of the interviews: participation of the respondent in the previous federal election (*votelast*), supporter of a particular political party (*supportpp*), marital status (*marital*), signing a petition (*petition*), gender, religion (*norel*), catholic, past use of a vote for protest a party (demo), past refuse to vote in some election out of a protest (*refusevote*). We use the following setting for growing up the tree: a maximum depth of 4, a nominal level α = 0.05 for the testing procedure of the splitting phase, a minimum samples sizes of 250 at a node for a split to be attempted, and a minimum sample size of 50 required to children of a candidate split to be admissible. The final tree obtained using the grid of quantiles *S*(q) = {*q*0.1,*q*0.25,*q*0.5,*q*0.75,*q*0.9} is shown in Figure 1: it includes nine terminal nodes and seven splitting variables out of the nine candidates (the splitting variables that determine the major number of effects are demo

and petition). It is worth of notice that the extreme quantiles 0.1 and 0.9 are never chosen as the best quantiles and differences at the first decile are never significant. By following the different paths of the tree from the root to the terminal nodes it is possible to identify different profiles of respondents. For sake of space, results related to the distribution of the dependent variable in the terminal nodes are not shown but a further deepening of the analysis can be achieved by exploring the homogeneity of each profile of respondents.

Figure 1. *Quantile tree for rating probabilities for SPD party*

### References

(*R* | *D* = 1) associated to the left and right descendants, respectively. The pro-

date split so that it is one of those with the highest number of significant differences in quantiles, and with the lowest p-value among the competitor splitting variables with the same number of significant differences. The p-vales in this step are adjusted with the Benjamini-Hochberg correction for multiple testing

The performances of the proposed quantile tree for ordinal rating data are illustrated through an application to response profiles for the probability to vote for competing German parties. Data are taken from the GESIS ALLBUS German Social survey (GESIS ALLBUS Leibniz Institute for the Social Sciences, 2012). On a rating scale ranging from 1 = "*very unlikely*", to 10 = "*very likely*", respondents were asked to rate "*How likely it is that you would ever vote for this German party?*". Here we refer to interviewees collected in 2008, and we shall focus on probability to vote for Social Democratic Party (SPD). In the assessment of the electorate belief and behavior, the collection of ratings on the probability to vote for each of the candidate running in an upcoming electoral competition is a much more valuable source of information than the one based on classical voting intentions collected on nominal scales, as it allows to design targeted campaign and to locally understand and predict the electorate behavior. The splitting variables are related to the presence or absence of a series of personal or political-related characteristics of the interviews: participation of the respondent in the previous federal election (*votelast*), supporter of a particular political party (*supportpp*), marital status (*marital*), signing a petition (*petition*), gender, religion (*norel*), catholic, past use of a vote for protest a party (demo), past refuse to vote in some election out of a protest (*refusevote*). We use the following setting for growing up the tree: a maximum depth of 4, a nominal level α = 0.05 for the testing procedure of the splitting phase, a minimum samples sizes of 250 at a node for a split to be attempted, and a minimum sample size of 50 required to children of a candidate split to be admissible. The final tree obtained using the grid of quantiles *S*(q) = {*q*0.1,*q*0.25,*q*0.5,*q*0.75,*q*0.9} is shown in Figure 1: it includes nine terminal nodes and seven splitting variables out of the nine candidates (the splitting variables that determine the major number of effects are demo

(*r*)

<sup>τ</sup> , ∀*q*<sup>τ</sup> ∈ *S*(q), against the alternative

<sup>τ</sup> . The chosen split will select the candi-

(*l*) <sup>τ</sup> = *q*

(*r*)

(*l*) <sup>τ</sup> = *q*

cedure will test the null: *H*<sup>0</sup> : *q*

(Benjamini & Hochberg, 1995).

3 An application to German vote data

*H*<sup>1</sup> : at least one *q*<sup>τ</sup> ∈ *S*(q),*q*


### THE DETECTION OF SPAM BEHAVIOUR IN REVIEW BOMB

Effect');

reached yet.

2 Dataset

com.

unsatisfied.

• Under-reporting bias, people review when they are extremely satisfied or

The consequence of these biases is a J-shaped distribution of scores in online ratings (Hu *et al.*, 2017; Smironva *et al.*, 2020). These biases make easier to fraud the network of eWOM by injection of fake reviews submitted by the so-called 'sock puppet' accounts. Experimental results confirm that positive fake reviews have an impact on the success of online business (van de Rijt *et al.*, 2014). A consensus on the impact of negative fake reviews has not been

Some recommender systems have information if the reviewer purchased the item (e.g., Amazon) but recommender systems generally do not know how much the user is experienced about the item (e.g., how much time spent interacting with that). This issue is related to the fake reviews: one could ask an uninterested friend with an account in the system to rig a review of a item. Should a case like this be considered fake? To overcome such issues, researchers have adopted the broader perspective of 'spam reviews' attack (Hussain *et al.*, 2019). Spam is not necessarily fake but it is an excess of information which is undesired or harmful for the purposes of the system. According to Aggarwal, 2016, a good spam attack, hard to detect, is deployed slowly in the time, so

Recently, another type of review spam attack has emerged, known as 'Review Bomb', occurring when a massive amount of accounts reviews, usually negatively, attack a single product to make its reputation slump (Tomaselli *et al.*, 2021). During a 'Review Bomb', is often unclear how many accounts are sock puppets and how many accounts are people ideologically driven to review the specific item, but most of them involved lack a history of previous

The dataset includes *N* = 59*k* English reviews on the video game *The Last of Us Part II* (TLOU2). TLOU2 was 'review bombed' since its publication date (June 19th, 2020) for ideological reasons (Tomaselli *et al.*, 2021). These reviews were written by registered users on the online platform metacritic.

From each review, the following metadata are extracted: *i*) username; *ii*) the date the current review was written; *iii*) text of the review; *iv*) score, in a scale

that the sock puppet mimicries the behaviour of a regular user.

reviews/ratings in the system (*cold-start* problem).

Venera Tomaselli 1, Giulio Giacomo Cantone2 and Valeria Mazzeo2

<sup>1</sup> Department of Political and Social Sciences, University of Catania (e-mail: venera.tomaselli@unict.it)

<sup>2</sup> Department of Physics and Astronomy 'E. Majorana', University of Catania, (e-mail: giulio.cantone@phd.unict.it; valeria.mazzeo@phd.unict.it)

ABSTRACT: In recent years, a new phenomenon called 'Review Bomb' has affected online rating systems. It occurs when a massive amount of accounts reviews a single product, usually negatively to make its reputation slump.

This study analyses the differences among legitimate users and 'review bombers', using common classifiers and techniques from spam detection to identify suspicious reviews, by looking at both content and user's features.

KEYWORDS: review bomb, online ratings, cold start, machine learning.

#### 1 Introduction

Often, before purchasing a product or service, consumers ask for the opinion of their peers who already purchased it. This is commonly referred to as *wordof-mouth* (WOM). A positive opinion among WOM networks is regarded by marketing experts as a valuable and powerful source of reputation for brands. Online rating platforms, or 'review aggregators', are a case of technological innovation for electronic word-of-mouth (eWOM): by browsing a review aggregator, a consumer can read opinions of people who already purchased items (i.e., *evaluands*, such as products, services, place to visits, etc).

Aggregators take this name from the service of recommendation (i.e., a recommender system) they offer. They ask their registered users for submitting a numerical score in a constrained multipoint scale, and then summarise the scores into ratings and rankings (Tomaselli & Cantone, 2020). Scores collected in experimental settings respect methodological assumptions or normality (i.e., independence of observations) but scores collected in online (open) platforms are subject to two biases:

• Purchasing bias, people review what they purchase but they purchase what is already reviewed or, at least, already popular (a case of 'Matthew Effect');

THE DETECTION OF SPAM BEHAVIOUR IN REVIEW BOMB Venera Tomaselli 1, Giulio Giacomo Cantone2 and Valeria Mazzeo2

<sup>1</sup> Department of Political and Social Sciences, University of Catania (e-mail:

<sup>2</sup> Department of Physics and Astronomy 'E. Majorana', University of Catania, (e-mail: giulio.cantone@phd.unict.it; valeria.mazzeo@phd.unict.it)

ABSTRACT: In recent years, a new phenomenon called 'Review Bomb' has affected online rating systems. It occurs when a massive amount of accounts reviews a single

This study analyses the differences among legitimate users and 'review bombers', using common classifiers and techniques from spam detection to identify suspicious

Often, before purchasing a product or service, consumers ask for the opinion of their peers who already purchased it. This is commonly referred to as *wordof-mouth* (WOM). A positive opinion among WOM networks is regarded by marketing experts as a valuable and powerful source of reputation for brands. Online rating platforms, or 'review aggregators', are a case of technological innovation for electronic word-of-mouth (eWOM): by browsing a review aggregator, a consumer can read opinions of people who already purchased items

Aggregators take this name from the service of recommendation (i.e., a recommender system) they offer. They ask their registered users for submitting a numerical score in a constrained multipoint scale, and then summarise the scores into ratings and rankings (Tomaselli & Cantone, 2020). Scores collected in experimental settings respect methodological assumptions or normality (i.e., independence of observations) but scores collected in online (open) platforms

• Purchasing bias, people review what they purchase but they purchase what is already reviewed or, at least, already popular (a case of 'Matthew

KEYWORDS: review bomb, online ratings, cold start, machine learning.

(i.e., *evaluands*, such as products, services, place to visits, etc).

venera.tomaselli@unict.it)

1 Introduction

are subject to two biases:

product, usually negatively to make its reputation slump.

reviews, by looking at both content and user's features.

• Under-reporting bias, people review when they are extremely satisfied or unsatisfied.

The consequence of these biases is a J-shaped distribution of scores in online ratings (Hu *et al.*, 2017; Smironva *et al.*, 2020). These biases make easier to fraud the network of eWOM by injection of fake reviews submitted by the so-called 'sock puppet' accounts. Experimental results confirm that positive fake reviews have an impact on the success of online business (van de Rijt *et al.*, 2014). A consensus on the impact of negative fake reviews has not been reached yet.

Some recommender systems have information if the reviewer purchased the item (e.g., Amazon) but recommender systems generally do not know how much the user is experienced about the item (e.g., how much time spent interacting with that). This issue is related to the fake reviews: one could ask an uninterested friend with an account in the system to rig a review of a item. Should a case like this be considered fake? To overcome such issues, researchers have adopted the broader perspective of 'spam reviews' attack (Hussain *et al.*, 2019). Spam is not necessarily fake but it is an excess of information which is undesired or harmful for the purposes of the system. According to Aggarwal, 2016, a good spam attack, hard to detect, is deployed slowly in the time, so that the sock puppet mimicries the behaviour of a regular user.

Recently, another type of review spam attack has emerged, known as 'Review Bomb', occurring when a massive amount of accounts reviews, usually negatively, attack a single product to make its reputation slump (Tomaselli *et al.*, 2021). During a 'Review Bomb', is often unclear how many accounts are sock puppets and how many accounts are people ideologically driven to review the specific item, but most of them involved lack a history of previous reviews/ratings in the system (*cold-start* problem).

#### 2 Dataset

The dataset includes *N* = 59*k* English reviews on the video game *The Last of Us Part II* (TLOU2). TLOU2 was 'review bombed' since its publication date (June 19th, 2020) for ideological reasons (Tomaselli *et al.*, 2021). These reviews were written by registered users on the online platform metacritic. com.

From each review, the following metadata are extracted: *i*) username; *ii*) the date the current review was written; *iii*) text of the review; *iv*) score, in a scale [1:10]; *v*) number of upvotes (i.e., likes) assigned to the review from users, *vi*) number of downvotes (i.e., dislikes) assigned to the review from users; *vii*) number of past ratings that a user provided on Metacritic; *viii*) number of past reviews that a user wrote on metacritic.com. Once collected data, the labelling procedure, consisting of assigning a binary class label whether the review was legitimate (0) or related to the bombing phenomenon (1), is performed.

References

*Science*, 47(1), 58–81.

Review. *Applied Sciences*, 9(5), 987.

*Security Companion (QRS-C)*.

449–471.

1–6.

23(10), 1191–1204.

ABKENAR, S. B., KASHANI, M. H., AKBARI, M., & MAHDIPOUR, E. 2020. Twitter Spam Detection: A Systematic Review. *ArXiv*, abs/2011.14754.

AL-ZOUBI, AM, ALQATAWNA, J., FARIS, H., & HASSONAH, MA. 2021. Spam profiles detection on social networks using computational intelligence methods: The effect of the lingual context. *Journal of Information*

HARRIS, C. G. 2018. Decomposing TripAdvisor: Detecting Potentially Fraudulent Hotel Reviews in the Era of Big Data. *Pages 243–251 of:*

HUSSAIN, N., TURAB MIRZA, H., RASOOL, G., HUSSAIN, I., & KALEEM, M. 2019. Spam Review Detection Techniques: A Systematic Literature

LIU, P., XU, Z., AI, J., & WANG, F. 2017. Identifying Indicators of Fake Reviews Based on Spammer's Behavior Features. *Pages 396–403 of: 2017 IEEE International Conference on Software Quality, Reliability and*

NEMATZADEH, Z., IBRAHIM, R., & SELAMAT, A. 2015. Comparative studies on breast cancer classifications with k-fold cross validations using machine learning techniques. *2015 10th Asian Control Conference (ASCC)*,

SMIRONVA, E., KIATKAWSIN, K., LEE, S. K., KIM, J., & LEE, C.-H. 2020. Self-selection and non-response biases in customers' hotel ratings – a comparison of online and offline ratings. *Current Issues in Tourism*,

TOMASELLI, V., & CANTONE, G. G. 2020. Evaluating Rank-Coherence of Crowd Rating in Customer Satisfaction. *Social Indicators Research*. TOMASELLI, V., CANTONE, G. G., & MAZZEO, V. 2021. The polarising

VAN DE RIJT, A., KANG, S. M., RESTIVO, M., & PATIL, A. 2014. Field experiments of success-breeds-success dynamics. *Proceedings of the Na-*

ZHENG, X., ZENG, Z., CHEN, Z., YU, Y., & RONG, C. 2015. Detecting

effect of Review Bomb. *ArXiv*, abs/2104.01140.

*tional Academy of Sciences*, 111(19), 6934–6939.

spammers on social Networks. *Neurocomputing*, 42(02).

*2018 IEEE International Conference on Big Knowledge (ICBK)*. HU, N., PAVLOU, P., & ZHANG, J. 2017. On Self-Selection Biases in Online Product Reviews. *Management Information Systems Quarterly*, 41(2),

AGGARWAL, C. C. 2016. *Recommender Systems*. Springer-Verlag.

### 3 Methods

In the present paper, we propose a methodology for analysing data from a real dataset of TLOU2 reviews, focusing on the online review bomb phenomenon. The data pre-processing stage (data cleaning and handling of missing values) consists of reducing noise words by removing all parts of text which are not relevant for the scope, i.e., punctuation, symbols, and stopwords. Simple Bag-Of-Words and weighted strategy such as Term Frequency-Inverse Document Frequency (TF-IDF) measures are applied to determine term's representativeness. In terms of review's content, some statistical features (e.g., number of punctuation marks, number of unique words, words per sentences) are also extracted.

Techniques for detecting spammer activities on online social networks (Abkenar *et al.*, 2020) and online review platforms (Liu *et al.*, 2017; Harris, 2018) allow to identify accounts involved in review bombing within this dataset. Extra engineered features, therefore, are created to better discriminate not legitimate reviews from legitimate one by looking at users' features, such as username length, username starting with/containing numbers among others.

To reduce the dimensionality of the data and improve the results of the analysis, the most relevant features are selected to enter the model. Popular statistical tests, such as Pearson's test and Chi-squared, are used for this purpose, since they can handle numerical and categorical variables, respectively.

Once got the most important features, these ones are then passed into the classification algorithms to produce a range of models to predict not legitimate reviews. A k-Fold Cross Validation technique is considered to compare different machine learning algorithms ((e.g., Logistic Regression, Naive Bayes, Random Forest, Support Vector Machine); Nematzadeh *et al.*, 2015), generally used in spam (Al-Zoubi *et al.*, 2021) and fake news/reviews detection. Finally, model performance is evaluated by scoring the outcomes from a test set, using precision, accuracy, recall, and *F*<sup>1</sup> score (Zheng *et al.*, 2015) metrics.

#### References

[1:10]; *v*) number of upvotes (i.e., likes) assigned to the review from users, *vi*) number of downvotes (i.e., dislikes) assigned to the review from users; *vii*) number of past ratings that a user provided on Metacritic; *viii*) number of past reviews that a user wrote on metacritic.com. Once collected data, the labelling procedure, consisting of assigning a binary class label whether the review was legitimate (0) or related to the bombing phenomenon (1), is performed.

In the present paper, we propose a methodology for analysing data from a real dataset of TLOU2 reviews, focusing on the online review bomb phenomenon. The data pre-processing stage (data cleaning and handling of missing values) consists of reducing noise words by removing all parts of text which are not relevant for the scope, i.e., punctuation, symbols, and stopwords. Simple Bag-Of-Words and weighted strategy such as Term Frequency-Inverse Document Frequency (TF-IDF) measures are applied to determine term's representativeness. In terms of review's content, some statistical features (e.g., number of punctuation marks, number of unique words, words per sentences) are also ex-

Techniques for detecting spammer activities on online social networks (Abkenar *et al.*, 2020) and online review platforms (Liu *et al.*, 2017; Harris, 2018) allow to identify accounts involved in review bombing within this dataset. Extra engineered features, therefore, are created to better discriminate not legitimate reviews from legitimate one by looking at users' features, such as username

To reduce the dimensionality of the data and improve the results of the analysis, the most relevant features are selected to enter the model. Popular statistical tests, such as Pearson's test and Chi-squared, are used for this purpose, since

Once got the most important features, these ones are then passed into the classification algorithms to produce a range of models to predict not legitimate reviews. A k-Fold Cross Validation technique is considered to compare different machine learning algorithms ((e.g., Logistic Regression, Naive Bayes, Random Forest, Support Vector Machine); Nematzadeh *et al.*, 2015), generally used in spam (Al-Zoubi *et al.*, 2021) and fake news/reviews detection. Finally, model performance is evaluated by scoring the outcomes from a test set, using precision, accuracy, recall, and *F*<sup>1</sup> score (Zheng *et al.*, 2015) metrics.

length, username starting with/containing numbers among others.

they can handle numerical and categorical variables, respectively.

3 Methods

tracted.

ABKENAR, S. B., KASHANI, M. H., AKBARI, M., & MAHDIPOUR, E. 2020. Twitter Spam Detection: A Systematic Review. *ArXiv*, abs/2011.14754.

AGGARWAL, C. C. 2016. *Recommender Systems*. Springer-Verlag.


### **CLUSTERING MODELS FOR THREE-WAY DATA**

two ways, usually objects and variables, are clustered (for a comprehensive review see Madeira & Oliveira, 2004). In this paper, we are going to focus on the first class

Wilderjans & Ceulemans (2013) introduced the so-called Clusterwise PARAFAC, where objects are assigned to a limited number of clusters and, simultaneously, a *standard* PARAFAC model is applied within each cluster. In other words, within each cluster, objects, variables and subjects are summarized through a limited number of components. Therefore, the main idea of the Clusterwise PARAFAC model is that objects assigned to the same clusters have the same component structure, whereas objects belonging to different clusters have different

A different approach is followed by Rocci & Vichi (2005). First of all, the PARAFAC model is replaced by the Tucker3 model (Tucker, 1966). Tucker3 is more general than PARAFAC. In fact, the *standard* Tucker3 model allows for different numbers of components for objects, variables and subjects. Unfortunately, the Tucker3 solution suffers from rotational indeterminacy. On the contrary, the PARAFAC solution is unique (up to scaling and permuting the components) under mild conditions. Rocci & Vichi (2005) suggested to summarize only variables and subjects through components, whilst objects are partitioned into a reduced number of clusters. It follows that objects are analyzed asymmetrically with respect to variables and subjects. Specifically, objects are assigned to clusters following a *K*-Means-type procedure where the cluster prototypes lie onto the low-dimensional space spanned by the components for the variables and the subjects. The partition of the objects and the dimensionality reduction of both variables and subjects is performed simultaneously in such a way that the components explain the betweencluster variability. In this respect, Rocci & Vichi (2005) is actually a generalization of the Reduced *K*-Means method for standard two-way data (De Soete & Carroll,

In the next section, we present a new clustering model for three-way data, which takes inspiration from the previously mentioned proposals. Namely, consistently with Rocci & Vichi (2005), objects play an asymmetric role with respect to variables and subjects, and consistently with Wilderjans & Ceulemans (2013), the PARAFAC model is used for its simplicity and the uniqueness property. As we shall see, it can

Let us suppose *J* variables are measured on *N* objects by *H* subjects. Such data are stored in the three-way array **X** of order (*N* × *J* × *H*), whose generic element is *xnjh*, expressing the measurement of object *n* (*n* = 1, …, *N*) with respect to variable *j* (*j* = 1, …, *J*) made by subject *h* (*h* = 1, …, *H*). The array **X** can be seen as a collection of matrices, one for every subject. Therefore, matrix **X***<sup>h</sup>* (*h* = 1, …, *H*) of size (*N* × *J*),

be interpreted as a K-Means type clustering model for three-way data.

usually referred to as slice, contains all measurements from subject *h*. The most general model can be fully specified as follows:

of models, seeking, without loss of generality, a partition of objects.

underlying components.

1994).

**3 The clustering model**

Donatella Vicari1 and Paolo Giordani1

<sup>1</sup> Department of Statistical Sciences, Sapienza University of Rome (e-mail: donatella.vicari@uniroma1.it, paolo.giordani@uniroma1.it)

**ABSTRACT**: A novel clustering model for three-way data concerning a set of objects on which variables are measured by different subjects is proposed. The main aim of the model is to summarize the objects through a limited number of clusters. In order to exploit the three-way structure of the data, such clusters are assumed to be common to all subjects and variables and subjects are summarized through the PARAFAC model.

**KEYWORDS**: *K*-Means, PARAFAC, Variable weighting.

#### **1 Introduction**

Nowadays, it is very frequent to analyze data corresponding to variables measured on some objects by a set of subjects. Such three-way data can be stored in a (threeway) array or tensor. It can be interesting to discover clusters of homogeneous objects with respect to the variables measured by the subjects. However, classical (two-way) clustering techniques are usually inadequate to handle three-way data. To this purpose, several three-way extensions have been developed following the model-based (see, e.g., Viroli, 2011) or least-squares approaches. Here, we propose a new clustering model for three-way data according to the least-squares approach. It can be seen as a three-way extension of the well-known *k*-Means algorithm (MacQueen, 1967) where, in particular, the three-way nature of the model is exploited by considering the so-called PARAFAC model, independently developed by Carroll & Chang (1970) and Harshman (1970). In the PARAFAC model, data are summarized by a limited number of components. As such, the PARAFAC model represents a three-way extension of classical Principal Component Analysis.

The paper is organized as follows. In the next section, we briefly review alternative clustering models for three-way data. Section 3 deals with the proposed model. Some final comments are made in Section 4.

#### **2 Related works**

The clustering problem of three-way data has received a great deal of attention over the last few years. We can roughly distinguish two main classes of techniques aiming at partitioning the entities referring to a single way or to more ways simultaneously. A common case is referred to as bi-clustering or co-clustering when two ways, usually objects and variables, are clustered (for a comprehensive review see Madeira & Oliveira, 2004). In this paper, we are going to focus on the first class of models, seeking, without loss of generality, a partition of objects.

Wilderjans & Ceulemans (2013) introduced the so-called Clusterwise PARAFAC, where objects are assigned to a limited number of clusters and, simultaneously, a *standard* PARAFAC model is applied within each cluster. In other words, within each cluster, objects, variables and subjects are summarized through a limited number of components. Therefore, the main idea of the Clusterwise PARAFAC model is that objects assigned to the same clusters have the same component structure, whereas objects belonging to different clusters have different underlying components.

A different approach is followed by Rocci & Vichi (2005). First of all, the PARAFAC model is replaced by the Tucker3 model (Tucker, 1966). Tucker3 is more general than PARAFAC. In fact, the *standard* Tucker3 model allows for different numbers of components for objects, variables and subjects. Unfortunately, the Tucker3 solution suffers from rotational indeterminacy. On the contrary, the PARAFAC solution is unique (up to scaling and permuting the components) under mild conditions. Rocci & Vichi (2005) suggested to summarize only variables and subjects through components, whilst objects are partitioned into a reduced number of clusters. It follows that objects are analyzed asymmetrically with respect to variables and subjects. Specifically, objects are assigned to clusters following a *K*-Means-type procedure where the cluster prototypes lie onto the low-dimensional space spanned by the components for the variables and the subjects. The partition of the objects and the dimensionality reduction of both variables and subjects is performed simultaneously in such a way that the components explain the betweencluster variability. In this respect, Rocci & Vichi (2005) is actually a generalization of the Reduced *K*-Means method for standard two-way data (De Soete & Carroll, 1994).

In the next section, we present a new clustering model for three-way data, which takes inspiration from the previously mentioned proposals. Namely, consistently with Rocci & Vichi (2005), objects play an asymmetric role with respect to variables and subjects, and consistently with Wilderjans & Ceulemans (2013), the PARAFAC model is used for its simplicity and the uniqueness property. As we shall see, it can be interpreted as a K-Means type clustering model for three-way data.

#### **3 The clustering model**

**CLUSTERING MODELS FOR THREE-WAY DATA**

Donatella Vicari1 and Paolo Giordani1

**ABSTRACT**: A novel clustering model for three-way data concerning a set of objects on which variables are measured by different subjects is proposed. The main aim of the model is to summarize the objects through a limited number of clusters. In order to exploit the three-way structure of the data, such clusters are assumed to be common to all subjects and variables and

Nowadays, it is very frequent to analyze data corresponding to variables measured on some objects by a set of subjects. Such three-way data can be stored in a (threeway) array or tensor. It can be interesting to discover clusters of homogeneous objects with respect to the variables measured by the subjects. However, classical (two-way) clustering techniques are usually inadequate to handle three-way data. To this purpose, several three-way extensions have been developed following the model-based (see, e.g., Viroli, 2011) or least-squares approaches. Here, we propose a new clustering model for three-way data according to the least-squares approach. It can be seen as a three-way extension of the well-known *k*-Means algorithm (MacQueen, 1967) where, in particular, the three-way nature of the model is exploited by considering the so-called PARAFAC model, independently developed by Carroll & Chang (1970) and Harshman (1970). In the PARAFAC model, data are summarized by a limited number of components. As such, the PARAFAC model

represents a three-way extension of classical Principal Component Analysis.

The paper is organized as follows. In the next section, we briefly review alternative clustering models for three-way data. Section 3 deals with the proposed

The clustering problem of three-way data has received a great deal of attention over the last few years. We can roughly distinguish two main classes of techniques aiming at partitioning the entities referring to a single way or to more ways simultaneously. A common case is referred to as bi-clustering or co-clustering when

(e-mail: donatella.vicari@uniroma1.it, paolo.giordani@uniroma1.it)

<sup>1</sup> Department of Statistical Sciences, Sapienza University of Rome

subjects are summarized through the PARAFAC model.

**KEYWORDS**: *K*-Means, PARAFAC, Variable weighting.

model. Some final comments are made in Section 4.

**1 Introduction**

**2 Related works**

Let us suppose *J* variables are measured on *N* objects by *H* subjects. Such data are stored in the three-way array **X** of order (*N* × *J* × *H*), whose generic element is *xnjh*, expressing the measurement of object *n* (*n* = 1, …, *N*) with respect to variable *j* (*j* = 1, …, *J*) made by subject *h* (*h* = 1, …, *H*). The array **X** can be seen as a collection of matrices, one for every subject. Therefore, matrix **X***<sup>h</sup>* (*h* = 1, …, *H*) of size (*N* × *J*), usually referred to as slice, contains all measurements from subject *h*.

The most general model can be fully specified as follows:

$$\mathbf{X}\_h = \mathbf{U}\_h \mathbf{Y}\_h + \mathbf{E}\_h, \ h = 1, \ldots, H,\tag{1}$$

Model (3) can be extended along various directions. For instance, it might be fruitful to introduce subject-specific weights for the variables tuning their importance in the clustering process. Such weights might be objectively estimated

The paper introduced a novel *K*-Means type clustering model for three-way data involving the PARAFAC decomposition. The effectiveness of the proposal will be illustrated with simulated and real applications and its possible extensions will be

CARROLL, J. D., & CHANG, J. J. 1970. Analysis of individual differences in multidimensional scaling via an *n*-way generalization of Eckart–Young

CARROLL, J. D., & CHATURVEDI, A. 1995. A general approach to clustering and multidimensional scaling of two-way, three-way or higher-way data. *Geometrical Representations of perceptual phenomena*. Mahwah, NJ: Lawrence

DE SOETE, G., & CARROLL, J. D. 1994. *k*-means clustering in a low-dimensional Euclidean space. *New approaches in classification and data analysis*.

HARSHMAN, R. A. 1970. Foundations of the PARAFAC procedure: Models and conditions for an 'explanatory' multi-modal factor analysis. *UCLA Working* 

MACQUEEN, J. B. 1967. Some Methods for classification and Analysis of Multivariate Observations. *Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability*. Berkeley: University of California

MADEIRA, S. C. & OLIVEIRA, A. L. 2004. Biclustering algorithms for biological data analysis: A survey. *IEEE Transactions in Computational Biology and* 

ROCCI, R. & VICHI, M. 2005. Three-mode component analysis with crisp or fuzzy

TUCKER, L. R 1966. Some mathematical notes on three-mode factor analysis.

VIROLI, C. 2011. Model based clustering for three-way data structures. *Bayesian* 

WILDERJANS, T. F., & CEULEMANS, E. 2013. Clusterwise Parafac to identify heterogeneity in three-way data. *Chemometrics and Intelligent Laboratory* 

by minimizing loss function (4).

**4 Concluding remarks**

presented during the meeting.

Erlbaum, 295-318.

Press, **1**, 281-297.

*Bioinformatics*, **1**, 24-45.

*Psychometrika*, **31**, 279-311.

*Analysis*, **6**, 573-602.

*Systems, 129*, 87-97.

decomposition. *Psychometrika*, **35**, 283-319.

Heidelberg: Springer Verlag, 212-219.

partition of units. *Psychometrika*, **70**, 715-736.

*Papers in Phonetics*, **16**, 1-84.

**References**

where **E***<sup>h</sup>* is the error term for subject *h* and **U***<sup>h</sup>* is the membership matrix of order (*N* × *K*) for objects into clusters, being *K* the number of clusters. Matrix **U***<sup>h</sup>* is binary with only one entry equal to 1 per row and identifies a partition of the *N* objects into *K* disjoint clusters for subject *h* (*h* = 1, …, *H*). Matrix **Y***<sup>h</sup>* (*h* = 1, …, *H*) of order (*K* × *J*) is the subject-specific prototype matrix. Thus, the model assumes a *different* partition among the slices referring to the subjects. In other words, separate partitions are sought by means of a *K*-means-type model for every subject.

In order to exploit the three-way structure of the data, i.e., to properly take into account that the same variables are observed on the same objects by the subjects, constrained versions of model (1) can be derived. For instance, we may assume that **U***<sup>h</sup>* = **U**, *h* = 1, …, *H*. Then, we get

$$\mathbf{X}\_h = \mathbf{U}\mathbf{Y}\_h + \mathbf{E}\_h, h = 1, \dots, H. \tag{2}$$

Matrix **U** is the allocation matrix, fulfilling the same constraints as for **U***h*. Therefore, model (2) identifies a *common* partition across subjects. As in model (1), different prototype matrices **Y***<sup>h</sup>* are assumed allowing for possible differences among subjects. Model (2) is therefore a *K*-means-type model with a consensus partition specified by **U**.

Model (2) can be further extended by considering the PARAFAC model. Specifically, setting **Y***<sup>h</sup> =* **D***h***B**, model (2) can be rewritten as

$$\mathbf{X}\_h = \mathbf{U}\mathbf{D}\_h\mathbf{B} + \mathbf{E}\_h, h = 1, \dots, H,\tag{3}$$

where **D***<sup>h</sup>* (*h* = 1, …, *H*) is the diagonal matrix of order (*K* × *K*) with diagonal elements giving subject-specific weights for the *K* clusters. Matrix **B** of order (*K* × *J*) measures the relevance of the variables for the *K* clusters. The three-way structure of the data is captured by the matrices **D***<sup>h</sup>* (*h* = 1, …, *H*) and **B**. In fact, since the same matrix **B** is assumed across subjects, the underlying idea of model (3) is that the slices are described by the same matrices **U** and **B**, but in different proportions because **B** is weighted differently through the subject-specific matrices **D***<sup>h</sup>* (*h* = 1, …, *H*).

The proposed model is a PARAFAC model with binary constraints on **U**. It is a special case of the so-called NMFA/GENNCLUS model, mentioned by Carroll & Chaturvedi (1995). The solution is unique up to scaling and cluster labeling, as it holds for PARAFAC. Such a solution can be found according to the least-squares approach by minimizing the loss function

$$\sum\_{h} \|\|\mathbf{E}\_{h}\|\|^{2},\tag{4}$$

with respect to **U**, **Y** and **D***<sup>h</sup>* (*h* = 1, …, *H*), being **||** || the Frobenius norm of matrices. For this purpose, an Alternating Least-Squares algorithm has been implemented.

Model (3) can be extended along various directions. For instance, it might be fruitful to introduce subject-specific weights for the variables tuning their importance in the clustering process. Such weights might be objectively estimated by minimizing loss function (4).

#### **4 Concluding remarks**

The paper introduced a novel *K*-Means type clustering model for three-way data involving the PARAFAC decomposition. The effectiveness of the proposal will be illustrated with simulated and real applications and its possible extensions will be presented during the meeting.

#### **References**

**X***<sup>h</sup>* = **U***h***Y***<sup>h</sup>* + **E***h*, *h* = 1, …, *H*, (1)

**X***<sup>h</sup>* = **UY***<sup>h</sup>* + **E***h*, *h* = 1, …, *H*. (2)

**X***<sup>h</sup>* = **UD***h***B** + **E***h*, *h* = 1, …, *H*, (3)

, (4)

where **E***<sup>h</sup>* is the error term for subject *h* and **U***<sup>h</sup>* is the membership matrix of order (*N* × *K*) for objects into clusters, being *K* the number of clusters. Matrix **U***<sup>h</sup>* is binary with only one entry equal to 1 per row and identifies a partition of the *N* objects into *K* disjoint clusters for subject *h* (*h* = 1, …, *H*). Matrix **Y***<sup>h</sup>* (*h* = 1, …, *H*) of order (*K* × *J*) is the subject-specific prototype matrix. Thus, the model assumes a *different* partition among the slices referring to the subjects. In other words, separate

In order to exploit the three-way structure of the data, i.e., to properly take into account that the same variables are observed on the same objects by the subjects, constrained versions of model (1) can be derived. For instance, we may assume that

Matrix **U** is the allocation matrix, fulfilling the same constraints as for **U***h*. Therefore, model (2) identifies a *common* partition across subjects. As in model (1), different prototype matrices **Y***<sup>h</sup>* are assumed allowing for possible differences among subjects. Model (2) is therefore a *K*-means-type model with a consensus partition

Model (2) can be further extended by considering the PARAFAC model.

where **D***<sup>h</sup>* (*h* = 1, …, *H*) is the diagonal matrix of order (*K* × *K*) with diagonal elements giving subject-specific weights for the *K* clusters. Matrix **B** of order (*K* × *J*) measures the relevance of the variables for the *K* clusters. The three-way structure of the data is captured by the matrices **D***<sup>h</sup>* (*h* = 1, …, *H*) and **B**. In fact, since the same matrix **B** is assumed across subjects, the underlying idea of model (3) is that the slices are described by the same matrices **U** and **B**, but in different proportions because **B** is weighted differently through the subject-specific matrices **D***<sup>h</sup>* (*h* = 1,

The proposed model is a PARAFAC model with binary constraints on **U**. It is a special case of the so-called NMFA/GENNCLUS model, mentioned by Carroll & Chaturvedi (1995). The solution is unique up to scaling and cluster labeling, as it holds for PARAFAC. Such a solution can be found according to the least-squares

with respect to **U**, **Y** and **D***<sup>h</sup>* (*h* = 1, …, *H*), being **||** || the Frobenius norm of matrices. For this purpose, an Alternating Least-Squares algorithm has been

*<sup>h</sup>* **|| E***<sup>h</sup>* ||2

partitions are sought by means of a *K*-means-type model for every subject.

Specifically, setting **Y***<sup>h</sup> =* **D***h***B**, model (2) can be rewritten as

**U***<sup>h</sup>* = **U**, *h* = 1, …, *H*. Then, we get

approach by minimizing the loss function

specified by **U**.

…, *H*).

implemented.


## **USING EYE-TRACKING DATA TO CREATE A WEIGHTED DICTIONARY FOR SENTIMENT ANALYSIS: THE EYE DICTIONARY**

Aim of the present method is to develop a new dictionary for sentiment analysis using eye-tracking data as weights to attribute a different relevance to the words in a

To develop a dictionary based on eye tracking data, we focus on two main aspects: weights and polarities. Weights have been computed based on the ProvoCorpus, a large corpus including eye tracking data for 55 paragraphs taken from various sources (e.g. news articles, science magazines and public domain works of fiction). Each paragraph was read by an average of 40 participants. Across all texts, eye tracking data in the form of dwell time for each word (i.e. total reading time calculated as the summation of the duration across all fixations on a given word) are available for a total of 2,689 words (1,191 of which are unique). For each word w included in the corpus of eye tracking data, the average dwell time based on the total number of

occurrences of the word in the corpus is calculated as in Eq. (1)

1 ∑ 

1 ∑ 

=

or negative polarity, we compute a probability in the form of Eq. (4):

() <sup>=</sup>

word *w* is then calculated as the ratio in Eq. (3)

=1

1 <sup>∑</sup> =1

1 <sup>∑</sup> =1

where *m* is the number of all occurrences of all words observed in the dataset and is the dwell time for the occurrence *i* of a word in the dataset. Each weight for each

and these values have been normalized using the min-max normalization. Polarities are computed using a large dataset of movie reviews including 50,000 texts, labeled as positive and negative reviews (Maas et al., 2011). To assess if a word has a positive

where () is the probability that the word *w* is positive, is the number of occurrences of the word *w* in positive labeled texts and is the number of

() <sup>=</sup>

=1

where *n* is the number of occurrences of a word *w* in the dataset and is the dwell time for the word *w*. The average global dwell time for any word in the dataset is

(1)

(2)

(3)

(4)

text, based on the attention they might receive.

**2 Materials and methods**

computed as in Eq. (2)

**2.1 Development of the Eye-dictionary**

Gianpaolo Zammarchi1, Jaromír Antoch2

<sup>1</sup> Department of Economics and Business Sciences, University of Cagliari, (e-mail: gp.zammarchi@unica.it)

<sup>2</sup> Department of Mathematics and Physics, Charles University, (e-mail: antoch@karlin.mff.cuni.cz)

**ABSTRACT**: Extracting information from written texts is of paramount importance to many entities (e.g. businesses, public organizations, individuals), but the exponential growth of available data has made this task beyond any single human being or business. Sentiment analysis is a tool to automatically transform the information extracted into knowledge. One of the main challenges is to assess if a text is positive or negative, which can be tackled using a dictionary where each word has a positive or negative associated value and then combining single-words values to express an overall text sentiment. In order to use such lexicon-based approach, we need an existing dictionary or to build a new one. In this work we present a new dictionary for sentiment analysis developed using eye-tracking data to determine the relevance of words and we assess its performances against other existing dictionaries.

**KEYWORDS**: eye-tracking, sentiment analysis, lexicon, dictionary.

### **1 Introduction**

Sentiment analysis is aimed at classifying texts into sentiments with a polarity (positive or negative) using different approaches. The lexicon-based approach is based on a dictionary, i.e. a base tool where hundreds or thousands of words are associated with a polarity (negative/positive). In order to classify the polarity of a text, each word is searched in the dictionary. If the word is present, the value assigned to that word will contribute to the overall text sentiment (along with the other words present both in the text and in the dictionary). To obtain a single value representative of the whole text a summarizing function (e.g. average or sum) is applied. An important challenge in sentiment analysis is the definition of weights to attribute to words, i.e. to have instruments to define which words should be assigned greater importance. In this sense, the eye tracking technology, which allows to measure the exact position of the eyes during the visualization of texts, images or other visual stimuli, can be of help to understand which words might be able to gain more attention from a reader and are thus potentially more relevant.

Aim of the present method is to develop a new dictionary for sentiment analysis using eye-tracking data as weights to attribute a different relevance to the words in a text, based on the attention they might receive.

#### **2 Materials and methods**

**USING EYE-TRACKING DATA TO CREATE A WEIGHTED DICTIONARY FOR SENTIMENT ANALYSIS: THE EYE DICTIONARY**

Gianpaolo Zammarchi1, Jaromír Antoch2

**ABSTRACT**: Extracting information from written texts is of paramount importance to many entities (e.g. businesses, public organizations, individuals), but the exponential growth of available data has made this task beyond any single human being or business. Sentiment analysis is a tool to automatically transform the information extracted into knowledge. One of the main challenges is to assess if a text is positive or negative, which can be tackled using a dictionary where each word has a positive or negative associated value and then combining single-words values to express an overall text sentiment. In order to use such lexicon-based approach, we need an existing dictionary or to build a new one. In this work we present a new dictionary for sentiment analysis developed using eye-tracking data to determine the relevance

Sentiment analysis is aimed at classifying texts into sentiments with a polarity (positive or negative) using different approaches. The lexicon-based approach is based on a dictionary, i.e. a base tool where hundreds or thousands of words are associated with a polarity (negative/positive). In order to classify the polarity of a text, each word is searched in the dictionary. If the word is present, the value assigned to that word will contribute to the overall text sentiment (along with the other words present both in the text and in the dictionary). To obtain a single value representative of the whole text a summarizing function (e.g. average or sum) is applied. An important challenge in sentiment analysis is the definition of weights to attribute to words, i.e. to have instruments to define which words should be assigned greater importance. In this sense, the eye tracking technology, which allows to measure the exact position of the eyes during the visualization of texts, images or other visual stimuli, can be of help to understand which words might be able to gain more attention from a reader and are

<sup>1</sup> Department of Economics and Business Sciences, University of Cagliari,

of words and we assess its performances against other existing dictionaries.

**KEYWORDS**: eye-tracking, sentiment analysis, lexicon, dictionary.

<sup>2</sup> Department of Mathematics and Physics, Charles University,

(e-mail: gp.zammarchi@unica.it)

**1 Introduction**

thus potentially more relevant.

(e-mail: antoch@karlin.mff.cuni.cz)

#### **2.1 Development of the Eye-dictionary**

To develop a dictionary based on eye tracking data, we focus on two main aspects: weights and polarities. Weights have been computed based on the ProvoCorpus, a large corpus including eye tracking data for 55 paragraphs taken from various sources (e.g. news articles, science magazines and public domain works of fiction). Each paragraph was read by an average of 40 participants. Across all texts, eye tracking data in the form of dwell time for each word (i.e. total reading time calculated as the summation of the duration across all fixations on a given word) are available for a total of 2,689 words (1,191 of which are unique). For each word w included in the corpus of eye tracking data, the average dwell time based on the total number of occurrences of the word in the corpus is calculated as in Eq. (1)

$$\frac{1}{n}\sum\_{l=1}^{n}d\_{l}^{\omega}\tag{l}$$

where *n* is the number of occurrences of a word *w* in the dataset and is the dwell time for the word *w*. The average global dwell time for any word in the dataset is computed as in Eq. (2)

$$\frac{1}{m}\sum\_{l=1}^{m}d\_{l}\tag{2}$$

where *m* is the number of all occurrences of all words observed in the dataset and is the dwell time for the occurrence *i* of a word in the dataset. Each weight for each word *w* is then calculated as the ratio in Eq. (3)

$$\upsilon^{\rm w} = \frac{\frac{1}{n} \Sigma\_{l=1}^{n} d\_{l}^{\rm w}}{\frac{1}{m} \Sigma\_{l=1}^{m} d\_{l}} \tag{3}$$

and these values have been normalized using the min-max normalization. Polarities are computed using a large dataset of movie reviews including 50,000 texts, labeled as positive and negative reviews (Maas et al., 2011). To assess if a word has a positive or negative polarity, we compute a probability in the form of Eq. (4):

$$P\left(\omega\_{pos}\right) = \frac{N\_{\nu\_{pos}}}{N\_{\nu}} \qquad P\left(\omega\_{neg}\right) = \frac{N\_{\nu\_{neg}}}{N\_{\nu}} \tag{4}$$

where () is the probability that the word *w* is positive, is the number of occurrences of the word *w* in positive labeled texts and is the number of occurrences of the word *w*. The same computation is made for negatives. Given the probabilities in Eq. (4) we assign a polarity to each word *w* as in Eq. (5)

$$p^w = \begin{cases} 1 & \text{if } P(\mathbf{w}\_{pos}) > P(\mathbf{w}\_{neg})\\ 0 & \text{if } P(\mathbf{w}\_{pos}) = P(\mathbf{w}\_{neg})\\ -1 & \text{otherwise} \end{cases} \tag{5}$$

Therefore, we assign the word *w* a positive (+1) or negative value (-1) in case () is greater or lower than 0.5, respectively. If the probability is exactly 0.5 the word *w* is assigned 0 (neutral). For each word, a final value *s* is then computed as the product of weights and polarities as in Eq. (6)

$$\mathbf{s}^{\mathbf{w}} = \boldsymbol{\nu}^{\mathbf{w}} \cdot \mathbf{p}^{\mathbf{w}} \tag{6}$$

was able to achieve a performance similar or better compared to most of the other

Precision 0.60 0.55 0.38 0.30 0.54 0.56 0.48 0.42 0.58 0.68 Recall 0.39 0.74 0.46 0.23 0.63 0.46 0.74 0.19 0.81 0.41 F1-score 0.47 0.63 0.41 0.26 0.58 0.51 0.58 0.27 0.67 0.51 Accuracy 0.56 0.35 0.55 0.46 0.61

In this work we present a new sentiment analysis dictionary built by leveraging eye tracking data to assign weights to words based on their ability to gain attention from a reader. To this aim, dwell time is used as a measure of relevance of a word. Future developments include the expansion of the number of words included in the dictionary as well as evaluation of its performance in the classification of text using rules to handle cases in which classification is particularly challenging, such as sentences

KOTZIAS, D., DENIL, M., DE FREITAS, N., & SMYTH, P. 2015. From group to individual labels using deep features. In *Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*. ACM, USA,

MAAS, A., DALY, R.E., PHAM, P.T., HUANG, D., NG, A.Y., & POTTS, C. 2011. Learning word vectors for sentiment analysis. In *Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language* 

MCAULEY, J.J., & LESKOVEC, J. 2013. Hidden factors and hidden topics: Understanding rating dimensions with review text. *RecSys '13: Proceedings of the* 

RINKER, T.W. 2018. lexicon: Lexicon Data version 1.2.1.

*Technologies - Volume 1 (HLT '11).* ACL, USA, 142–150.

*7th ACM conference on Recommender systems*, 165–172.

http://github.com/trinker/lexicon

SentiWord Net

Pos Neg Pos Neg Pos Neg Pos Neg Pos Neg

SO-CAL

Google Hu Liu

**Table 1. Comparison between Eye dictionary and four other dictionaries**

dictionaries even if it includes a much lower number of words.

Loughran-McDonald

Eye dictionary

**4 Conclusions**

**References**

597–606.

including negations, amplifiers and downtoners.

#### **2.2 Assessment of the performance of the Eye dictionary and comparison with existing dictionaries**

The performance of the dictionary based on eye tracking data in the classification of sentiment polarity of texts has been assessed using two independent collections of labeled texts: 1,000 consumer reviews from Amazon (McAuley et al., 2013) and 1,000 consumer reviews from Yelp (Yelp dataset). For these texts, the performance of the Eye dictionary in the classification of sentiment polarity is compared with four existing dictionaries: Loughran-McDonald (2,702 words), SentiWordNet 3.0 (20,093 words), SO-CAL Google (3,290 words) and Hu Liu (6,874 words) extracted from the Lexicon package in R (Rinker, 2018). For each text, a polarity value is calculated as the algebraic sum of signed values assigned to each word by a dictionary. Finally, the number of texts correctly classified using the different dictionaries is compared.

#### **3 Results**

A total of 1,185 words for which weights and polarities were computed are included in the Eye dictionary (619 positive, 466 negative and 100 neutral). Table 1 shows the performance of the Eye dictionary and four other dictionaries in terms of precision, recall, F1-score and accuracy for the Yelp dataset (similar results were obtained using the Amazon dataset).

The Eye dictionary showed the best precision for positive texts, best recall for negative texts and the second-best accuracy after the Hu Liu dictionary. The Eye dictionary was able to correctly classify a higher number of texts compared to two of the four dictionaries (Loughran and Socal Google) in the Amazon dataset and three of the four dictionaries (Loughran, Sentiword and Socal Google) in the second dataset. Hu Liu was the only dictionary to show a better performance in both datasets.

Overall, all dictionaries only showed a modest performance in this preliminary analysis, which could be improved with the application of rules for handling cases such as presence of negations, amplifiers and downtoners. Notably, the Eye dictionary was able to achieve a performance similar or better compared to most of the other dictionaries even if it includes a much lower number of words.


**Table 1. Comparison between Eye dictionary and four other dictionaries**

### **4 Conclusions**

occurrences of the word *w*. The same computation is made for negatives. Given the

1 () > () 0 () = ()

Therefore, we assign the word *w* a positive (+1) or negative value (-1) in case () is greater or lower than 0.5, respectively. If the probability is exactly 0.5 the word *w* is assigned 0 (neutral). For each word, a final value *s* is then computed as the product

**2.2 Assessment of the performance of the Eye dictionary and comparison with** 

The performance of the dictionary based on eye tracking data in the classification of sentiment polarity of texts has been assessed using two independent collections of labeled texts: 1,000 consumer reviews from Amazon (McAuley et al., 2013) and 1,000 consumer reviews from Yelp (Yelp dataset). For these texts, the performance of the Eye dictionary in the classification of sentiment polarity is compared with four existing dictionaries: Loughran-McDonald (2,702 words), SentiWordNet 3.0 (20,093 words), SO-CAL Google (3,290 words) and Hu Liu (6,874 words) extracted from the Lexicon package in R (Rinker, 2018). For each text, a polarity value is calculated as the algebraic sum of signed values assigned to each word by a dictionary. Finally, the number of texts correctly classified using the different dictionaries is compared.

A total of 1,185 words for which weights and polarities were computed are included in the Eye dictionary (619 positive, 466 negative and 100 neutral). Table 1 shows the performance of the Eye dictionary and four other dictionaries in terms of precision, recall, F1-score and accuracy for the Yelp dataset (similar results were obtained using

The Eye dictionary showed the best precision for positive texts, best recall for negative texts and the second-best accuracy after the Hu Liu dictionary. The Eye dictionary was able to correctly classify a higher number of texts compared to two of the four dictionaries (Loughran and Socal Google) in the Amazon dataset and three of the four dictionaries (Loughran, Sentiword and Socal Google) in the second dataset. Hu Liu was the only dictionary to show a better performance in both datasets.

Overall, all dictionaries only showed a modest performance in this preliminary analysis, which could be improved with the application of rules for handling cases such as presence of negations, amplifiers and downtoners. Notably, the Eye dictionary

= ∙ (6)

(5)

probabilities in Eq. (4) we assign a polarity to each word *w* as in Eq. (5)

−1 ℎ

= {

of weights and polarities as in Eq. (6)

**existing dictionaries**

**3 Results**

the Amazon dataset).

In this work we present a new sentiment analysis dictionary built by leveraging eye tracking data to assign weights to words based on their ability to gain attention from a reader. To this aim, dwell time is used as a measure of relevance of a word. Future developments include the expansion of the number of words included in the dictionary as well as evaluation of its performance in the classification of text using rules to handle cases in which classification is particularly challenging, such as sentences including negations, amplifiers and downtoners.

### **References**


The book collects the short papers presented at the 13th Scientific Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society (SIS). The meeting has been organized by the Department of Statistics, Computer Science and Applications of the University of Florence, under the auspices of the Italian Statistical Society and the International Federation of Classification Societies (IFCS). CLADAG is a member of the IFCS, a federation of national, regional, and linguistically-based classification societies. It is a non-profit, non-political scientific organization, whose aims are to further classification research.

**Giovanni C. Porzio** PhD, is Professor of Statistics in the Department of Economics and Law at the University of Cassino and Southern Lazio. His research interests include directional statistics, statistical learning, nonparametric multivariate analysis and data depth, graphical methods and data visualization.

**Carla Rampichini** PhD, is full professor of Statistics and head of the Department of Statistics, Computer Science and Applications 'G. Parenti' of the University of Florence. Her research interests relate to random effects models for multilevel analysis, multivariate analysis and evaluation of educational systems.

**Chiara Bocci** PhD, is a Researcher in Statistics at the Department of Statistics, Computer Science and Applications "G. Parenti" of the University of Florence. Her current research interests include statistical analysis of spatially referenced data, small area estimation methods, and statistical models for skewed variables.

> ISSN 2704-601X (print) ISSN 2704-5846 (online) ISBN 978-88-5518-340-6 (PDF) ISBN 978-88-5518-341-3 (XML) DOI 10.36253/978-88-5518-340-6

www.fupress.com