**Artificial Intelligence: Foundations, Theory, and Algorithms**

## Qionghai Dai Yue Gao

# Hypergraph Computation

## **Artificial Intelligence: Foundations, Theory, and Algorithms**

#### **Series Editors**

Barry O'Sullivan, Dep. of Computer Science, University College Cork, Cork, Ireland

Michael Wooldridge, Department of Computer Science, University of Oxford, Oxford, UK

*Artificial Intelligence: Foundations, Theory and Algorithms* fosters the dissemination of knowledge, technologies and methodologies that advance developments in artificial intelligence (AI) and its broad applications. It brings together the latest developments in all areas of this multidisciplinary topic, ranging from theories and algorithms to various important applications. The intended readership includes research students and researchers in computer science, computer engineering, electrical engineering, data science, and related areas seeking a convenient way to track the latest findings on the foundations, methodologies, and key applications of artificial intelligence.

This series provides a publication and communication platform for all AI topics, including but not limited to:


This series includes monographs, introductory and advanced textbooks, state-of-theart collections, and handbooks. Furthermore, it supports Open Access publication mode.

Qionghai Dai • Yue Gao

# Hypergraph Computation

Qionghai Dai Department of Automation Tsinghua University Beijing, China

Yue Gao School of Software Tsinghua University Beijing, China

This work was supported by Tsinghua University

ISSN 2365-3051 ISSN 2365-306X (electronic) Artificial Intelligence: Foundations, Theory, and Algorithms ISBN 978-981-99-0184-5 ISBN 978-981-99-0185-2 (eBook) https://doi.org/10.1007/978-981-99-0185-2

© The Editor(s) (if applicable) and The Author(s) 2023. This book is an open access publication.

**Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

## **Preface**

Artificial Intelligence is now everywhere and fuels both industry and daily life all over the world. We are in the era of "big data," and huge sums of information can be obtained which are too cumbersome for people to process themselves. These big data are even with much complex correlations behind them in various areas, such as computer vision and social media. For example, the complex correlations among pixels in an image reveal its semantic information, and different types of correlations among social posts infer the users' emotions. Therefore, developing effective AI methods to exploit such complex data correlations has become an urgent but challenging task.

Graph has been widely used to formulate data correlations. A graph is a nonlinear data structure which is composed of groups of vertices and edges, representing the pairwise correlations among vertices. Graph learning and graph neural networks have attracted much attention in both research and industrial fields and become very hot topics in these years. It is noted that the world is far more complex than just pairwise connections, and thus graph-based methods still have limitations on highorder correlation modeling.

Hypergraph, as a generation of graph, is able to formulate such high-order correlations among the data and has been investigated in last decades. Recent years have witnessed a great popularity of research on hypergraph-related AI methods, which have been used in computer vision, social media analysis, and etc. We noticed that there still has not been a theoretical book to systematically introduce the recent achievements in this field and then started preparation of this book. We summarize these attempts as a new computing paradigm, called hypergraph computation, which is to formulate the high-order correlations underneath the data using hypergraph, and then conduct semantic computing on the hypergraph for different applications.

In this book, we introduce recent progress in hypergraph computation, from hypergraph modeling to hypergraph neural networks. The applications of hypergraph computation are also discussed. We also summarize the recent achievements and useful tools in hypergraph computation. This book can be regarded as both a theoretical book and a manual on how to use hypergraph computation in practice.

#### **Book Organization**

This book includes 13 chapters with 3 parts. The first part introduces the fundamental knowledge of hypergraph computation. In this part, Chap. 1 depicts the basic knowledge, applications, and history of hypergraph. The mathematical foundations of hypergraph are introduced in Chap. 2. Three general paradigms of hypergraph computation are provided in Chap. 3.

The second part focuses on hypergraph modeling and learning techniques. The first step of hypergraph computation is to construct a hypergraph to formulate the high-order correlations among data, which is provided in Chap. 4. Typical hypergraph computation tasks are then provided in Chap. 5, including label propagation, data clustering, cost-sensitive learning, and link prediction. We further introduce the hypergraph structure evolution methods for hypergraph optimization in Chap. 6. The neural networks on hypergraph are introduced in Chap. 7. The practical applications of hypergraph computation require the capability of handling large-scale data. Therefore, we give an extensive introduction to large-scale hypergraph computation in Chap. 8.

The third part introduces the applications of hypergraph computation in several fields, including social media analysis in Chap. 9, medical and biological applications in Chap. 10, and computer vision in Chap. 11. This part also introduces the DeepHypergraph library, a hypergraph computation library based on Python, in Chap. 12, and the future advancement of hypergraph computation research in Chap. 13.

#### **Prerequisites**

This book is designed for advanced undergraduate and graduate students, postdoctoral researchers, lecturers, researchers, and industrial engineers, as well as anyone interested in AI, especially hypergraph computation. The readers are expected to have basic knowledge in probability, linear algebra, and machine learning. Graph theory could be a good prior before reading this book, but not mandatory. Besides the theoretical part from Section 3 to Section 8, we have also provided a series of applications from Section 9 to Section 11, which can be used as guidelines for the deployment of hypergraph computation in practice.

Preface vii

### **Contact Information**

We welcome any feedback, corrections, and suggestions on the book, which may be sent to *gaoyue@tsinghua.edu.cn*. The readers can also find updates about the book from the personal homepage at www.gaoyue.org.

Beijing, China Qionghai Dai December 2022 Yue Gao

## **Acknowledgments**

The authors would like to acknowledge the support and contributions of research collaborators, who have provided insightful comments and suggestions. For the whole book, we thank Yifan Feng, Shuyi Ji, Yutong Jiang, Qingmei Tang, Jielong Yan, Xinwei Zhang, Yubo Zhang, Zizhao Zhang, and Hao Zhong for the preparation of initial drafts, and thank Jialiang Cheng, Yue Dai, Lou Fang, Jiashu Han, Jiangang Huang, Kejie Huang, Tao Jin, Renjie Li, Zhi Li, Jun Ma, Bohua Wang, Yuehang Wang Xu Wu, Chengwu Yang, Yifan Zhang, and Zhikuan Zhou for proofreading and corrections.

We started the plan for this book in 2018 and finished the preparation in 2022. We sincerely thank the Springer Senior Editor, Dr. Celine Lanlan Chang, for providing insightful comments and suggestions during the preparation of this book. We are also grateful to Springer's production editor, Jayesh Kalleri, for offering invaluable help during the preparation of this manuscript.

Also special thanks to Dr. Shihui Ying, for patiently discussing the book framework and reviewing early versions of this manuscript and high-quality suggestions which have significantly improved this book.

We also give our appreciations to our organizations, Department of Automation at Tsinghua University, School of Software at Tsinghua University, Institute for Brain and Cognitive Sciences at Tsinghua University (THUIBCS), and the Broadband Network and Digital Multimedia Lab at Tsinghua University, who have provided outstanding supports and facilities for preparing this book.

Finally, and most importantly, a very heartfelt thank you to our families, for their constant support, encouragement, patience, and understanding during the whole journey.

This book is supported by the Natural Science Foundation of China (62088102 and U1701262).

## **Contents**





## **Acronyms**



## **Chapter 1 Introduction**

**Abstract** High-order correlations among data exist widely in various practical applications. Compared with the simple graph which can only model the pairwise relationship between two subjects, hypergraph is a flexible and representative model to formulate high-order correlations. Based on the hypergraph model, there have been many efforts to design the computation framework and analyze the highorder correlations. In this chapter, we briefly introduce the hypergraph computation, including its background, definition, history, recent challenges, and objectives.

#### **1.1 Background**

The basic elements of many natural and artificial systems have dependencies on each other and call for correlation modeling and analytic methods to study these. The graphs are all around us from different perspective, and in general all the objects in the real world are defined based on their connections with other objects. These connections can be described as a graph, which is a common data structure in many cases. For example, graphs can depict the path in a city, where each path is represented with an edge to show the spatial connections between two locations. Graphs are also employed in the airline route map, in which each vertex is an airport and each edge is an airline.

Recently, the most challenging data processing problem comes from the connected data, not just from the discrete ones. How to exploit the underneath connections behind the data has become an urgent and important task in many applications. Generally, graph has been used to formulate such correlations among data. A graph is a nonlinear data structure which is composed of a group of vertices and edges. Here, the vertices in a graph represent the subjects to be analyzed, and the edges in a graph are the lines connecting two vertices in the graph. Figure 1.1 shows an example of a graph.

As a common way to model pairwise correlations among data, the components in a system can be represented by the vertices of a graph, and the associations between components are described by the edges. In this way, the association pattern is abstracted by the topological structure of the graph. In the past decades,

it was not easy to apply graph theory in practice because of the limitation of computing power. In recent years, with the advancement of information technology and computing power, graph theory has demonstrated its practical values. As scales of data grow, scientists have come up with the concept of network science. The study of network science can be applied in various fields. For example, by studying the connection relationship between terminals on the Internet, the efficiency of data transmission in a network can be estimated. The study of interpersonal relationships can help understand the way people communicate with each other, disseminate information, and generate community. Studying the transmission chain of infectious diseases can help predict risks in time, thus interrupt transmission, and prevent their spread. People have also found that many biological, social, information, and other real networks have nontrivial structural patterns in the connections among their elements. These patterns reflect meaningful features of the whole network. For example, the small-world phenomenon (the average path length in the network does not increase significantly with the increase of the network size) widely exists in social networks [1]. Another example is scale-free network [2], in which the vertex degree distribution follows a power-law distribution, and this phenomenon is known in some biological metabolic networks [3].

It is noted that the world is far more complex than just pairwise connections. Typical examples include social networks, protein–protein interaction networks, and brain networks. In social networks, the individual characteristics of users are related to the interactive patterns among users. The users with similar characteristics are more likely to connect with each other to form a social group. The social relationships of users also affect their profiling portraits. We notice that the correlations among these uses are not just pairwise connections but also group-like connections, which are more complex than these pairwise connections. Figure 1.2 shows an example of social connections, in which each user could have different types of connections with two or more other users or items.

In human brain networks, the cerebral cortex contains more than 1011 neurons and a cluster of neurons with similar functions and connections forms a nucleus. The nuclei can be further divided into different brain regions, resulting in a multilevel and multi-scale complex brain network. For example, the whole brain map includes Insula and Cingulate Gyri, Frontal Lobe, Occipital Lobe, Parietal Lobe, and other regions, which can be further divided into 90 brain regions that are provided in AAL atlas [4], such as Hippocampus and Parahippocampus. Each neuron can have more than 10,000 synapses, which can connect the neurons in the brain to other neurons in

**Fig. 1.2** An illustration of pairwise correlations and high-order correlations between/among users

the rest of the body or connect the neurons to the muscles. The connections among the neurons are complex and hard to be formulated in a graph, although graph is a typical way to model such correlations in the brain.

Such complex correlations, i.e., the high-order correlation rather than pairwise ones, are very common in real-world data. To study these complex systems, it is necessary to characterize and analyze high-order relationships between their elements. Empirical studies have shown that the correlation patterns of a system often play an important role in functions of the system. In recent years, more researchers have begun to pay attention to this field and apply high-order correlation modeling and analytic methods.

At the beginning of the development of machine learning on graph and network science, only graph has been used to model the network or the correlations, and the associations between the elements of the system were generally described by the topological structure of the graph. As a result, the pairwise connections can be described in the graph, while a large amount of semantic information in the system could be lost, and descriptive features in the network could not be extracted. Some well-discussed network properties, such as degree centrality, semi-locality centrality, and closeness centrality, were all based on such a static single network model. The underneath high-order information behind the data has to be degenerated to pairwise ones for processing, which may lead to serious information loss. With the development of big data, the explosive growth of data demonstrates their complexity and diversity, which calls for more complex data modeling methods. The network modeling methods for complex data types, complex topological structures, and complex connection patterns emerge. For example, the social closeness between individuals in a social network can be strong or weak, and a system with weight distribution for the association between vertices can be modeled using a weighted network [5]. Also, the power network and the communication network are inter-dependent in infrastructure construction. The vertices of the communication network provide control signals to the vertices of the power network, whereas the vertices of the power network supply power to the vertices of the communication network. The interdependence between different networks can be modeled using an inter-dependent graph [6]. Another example is the air transportation network, where the routes between the vertices may belong to different airlines. For the heterogeneity of object types and association relationships, the concept of multi-layer network or graph has been proposed [7]. The last example is that the ecological food chain in the species network changes with the change of seasonal environmental conditions. For dynamic systems, the concept of temporal network has been introduced [8] to formulate the correlation among the subjects.

Although graph-based methods have been developed for decades and great progresses have been achieved, they still have limitations. These graph models can better formulate the binary relationships between the elements in the system, while they may ignore the high-order correlations among three or more elements. In recent years, many studies have shown that modeling and optimizing highorder correlations are even more important in most of the applications [9–11]. For example, in the biosphere system, the high-order interactions between species ensure stable diversity of species [10]. The high-order characteristics of different networks can effectively distinguish their fields [11]. With the rapid development of network science, the complexity of data and correlation increase rapidly. In the fields of biological information, social computing, and image processing, there are a large number of multi-modal, heterogeneous, high-level data, and there are needs for effective high-order correlation modeling and optimization methods.

As the subject of interdisciplinary study in many different fields including computer science, physics, and biology, high-order correlation modeling and optimization have attracted much attention in recent decades. There are a large number of high-order relationships in many systems in the real world [12]. For example, in social networks, people form groups of three or more to communicate, and in academic networks, multiple authors cooperate to write an article. Protein interactions in biological networks may occur between multiple proteins, and gene expression is driven by high-order interactions between biomolecules [13]. Highorder associations among elements are difficult to be described by the topology of simple graphs. Under such circumstances, the corresponding mathematical expressions have been introduced, such as set systems [14], simplicial complexes, and hypergraphs [15]. However, how to deploy the mathematical expressions in computation paradigm is still an open problem. The complexity of high-order correlations is much higher than that of pairwise correlations, which brings about new challenges to computation paradigms.

Hypergraph, as a generation of graph, which is able to formulate high-order correlations among the data, has been investigated recently. In this book, we introduce recent progress on hypergraph computation, from hypergraph modeling to hypergraph neural networks. Below we first introduce the basic definitions of hypergraph and then show the applications and research history of hypergraph. Finally, we provide the summary of our works in hypergraph computation and the structure of this book.

#### **1.2 The Definition of Hypergraph**

The hypergraph is an important concept in discrete mathematics, which is a generalization of the graph. Therefore, many concepts of hypergraphs can be defined related to the well-known definition of graphs. A *hypergraph* is defined as a pair of hypervertex set and hyperedge set. The *hypervertex set*, also called the *vertex set*, is a finite set, whereas the *hyperedge* represents the subset of the vertex set. As the hyperedge can connect any number of vertices, more general types of relationships could be modeled by hypergraphs rather than graphs. The order and the size of the hypergraph can be defined based on the vertex set and hyperedge set, i.e., the *order of the hypergraph* represents the cardinality of the vertex set, and the *size of the hypergraph* denotes the cardinality of the hyperedge set.

Similar to graphs, two specific types of hypergraphs can be defined, including the empty hypergraph and the trivial hypergraph.


Generally speaking, unless stated otherwise, hypergraphs have a nonempty vertex set and nonempty hyperedge set and do not contain empty hyperedges.

The *isolated vertex* denotes the vertex which is not contained in any of the hyperedges. Two vertices are *adjacent* if there exists a hyperedge containing both of these two vertices. Two hyperedges are *incident* if they have a nonempty intersection.

The sub-hypergraph and partial hypergraph can be defined as follows:


Two special types of the hypergraph can be defined based on the degree:


The concept of connectivity is defined as follows. The *loop* denotes the hyperedge with only one element. The *path* is a vertex–hyperedge alternative sequence, where the vertex belongs to the consecutive hyperedge in the sequence. The *cycle*  is a path whose first vertex is the same as the last vertex. The *length* of a path is the

number of vertices in the path. A path *connects* two vertices if these two vertices are in the path. A hypergraph is *connected* if any pair of vertices is connected, otherwise it is *disconnected*. The *distance* between two vertices is the minimum length of the path connecting these two vertices. The *diameter* of the hypergraph is the maximum distance among all pairs of vertices.

Here, we provide an example of a hypergraph in Fig. 1.3. In this hypergraph, there are 11 vertices and 5 hyperedges. In this hypergraph, the hyperedge *e*<sup>1</sup> connects vertices *x*1, *x*2, *x*3, and *x*4. The hyperedge *e*<sup>2</sup> connects *x*4, *x*6, *x*7, and *x*8. The hyperedge *e*<sup>3</sup> connects *x*<sup>5</sup> and *x*6. The hyperedge *e*<sup>4</sup> connects *x*1, *x*5, and *x*8. The hyperedge *e*<sup>5</sup> is a loop, which only connects vertex *x*<sup>10</sup> itself. Vertices *x*<sup>9</sup> and *x*<sup>11</sup> are two isolated vertices. The hypergraph is disconnected since *x*<sup>11</sup> is not connected with any other vertex. *x*<sup>3</sup> → *e*<sup>1</sup> → *x*<sup>1</sup> → *e*<sup>3</sup> → *x*<sup>8</sup> → *e*<sup>2</sup> → *x*<sup>7</sup> is a path from *x*<sup>3</sup> to *x*7, with length 4. The distance between *x*<sup>4</sup> and *x*<sup>5</sup> is 3 since the shortest path from *x*<sup>4</sup> to *x*<sup>5</sup> is *x*<sup>4</sup> → *e*<sup>2</sup> → *x*<sup>8</sup> → *e*<sup>4</sup> → *x*5.

Besides Fig. 1.3, there are also other typical illustrations of hypergraph, which are shown in Fig. 1.4. In Fig. 1.4a, each circular represents a hyperedge. In Fig. 1.4b, all the lines with the same color represent a hyperedge, which connect the vertices in the hyperedge. In Fig. 1.4c, each hollow circle indicates a hypergraph and the lines with the same color link the vertices in the hyperedge.

It is noted that the hypergraph-type structures may be not explicit in many applications and they are hidden behind the data which can be observed directly. In some cases, we may only capture some pairwise correlations among the data, while the high-order correlation is needed to the regenerated based on these observations. For example, some popular citation networks, such as Cora, Citeseer, and PubMed [16], are widely used for analysis, while all these datasets only contain graph-type data, which treat the articles as vertices and the citation relationships as

**Fig. 1.4** Three typical hypergraph illustrations

links. Under such circumstances, to exploit the high-order correlation among these data, we need to transform these data to a hypergraph. As a typical method, a coauthorship hypergraph can be generated, which formulates the articles as vertices, and articles with the same authors are connected by a hyperedge. In a similar way, a co-citation hypergraph can be generated, which treats the articles as vertices as well, and the articles with the same citation are treated as a hyperedge.

#### **1.3 Applications of Hypergraph**

Hypergraph has been applied across several disciplines, including biology, economics, and sociology, due to its superiority in complex correlations modeling, which has promoted intelligent applications. In this part, we introduce several typical applications of hypergraphs to help understand this powerful tool.

One representative application is social computing. The social media data have been increasing rapidly over the past couple of decades, which can provide potential population-level insights. The hypergraph [17] is a useful tool for discovering the complex and hidden correlations from the data, in which the hypergraph structure can be used to formulate the high-order correlation in social networks.

In recommender system, the hypergraph is used to model the user–item network, to profile the user, and to further predict the preferences (future interactions). Given the raw user–item network without other information than the historical interactions between users and items, hypergraph [17] can be used to discriminatively formulate the high-order connectivities among users and items separately and conduct the collaborative filtering task. Sometimes the users and the items may be attached with different attributes or properties. For example, the user-side information may include the gender, age, and personality, and the item-side information may contain the category, text description, and image. This attribute information can help capture the user's preference. Therefore, another application of hypergraphs in recommender system is attribute modeling and inference.

Another popular yet challenging social media computing application is sentiment analysis, with the goal of recognizing the real emotions and attitudes of people in social media contexts. Nevertheless, the multi-modality and complexity of social media data have made the task more difficult. For example, the text, images, and videos may coexist in one tweet. Additionally, there are intricate relationships between posts, such as in the dimensions of time, location, and user preferences. Therefore, how to find out the complex relationship between tweets and analyze the user sentiment has become an urgent issue. To this end, hypergraph [18] can be used to formulate the correlation among each sample and conduct robust and accurate multi-modal sentiment prediction, taking into consideration different moods having their own characteristics, and that sentiment analysis should be based on the joint analysis of multiple information. As far as social event detection is concerned, exploring a set of highly related posts becomes more important because of noise and insufficient content in a single post that fails to convey clear and comprehensive information. Hypergraph [19] can be used to characterize the relationship between heterogeneous data among different tweets for its superiority in modeling high-order correlations between data of various posts, modalities, and times, therefore enabling real-time social event detection. Specifically, each microblog is connected with its several textual-related and visual-related microblogs and forms two hyperedges. Next, the microblog clique, a basic unit consisting of a set of highly related tweets, is produced by using the hypergraph cut method to put together microblogs that are about the same subject.

Hypergraph has also shown its advantage in medical and biological applications. In the past few decades, massive amounts of biological and medical data have been produced. The data is complex, heterogeneous, and multi-modal, with interwoven inter- and intra-data correlations. By concatenating hyperedge groups, the hypergraph [20–22] can naturally accommodate multi-modal or heterogeneous data. Moreover, in doing so, it can discriminatively use the complementary information among these data. The pipeline below can be used to describe how hypergraph computation is used in biological and medical tasks: (1) modeling the medical image, patches, or biological entities as vertices and connecting them with hyperedges based on their feature similarity or high-order topological links and (2) learning high-order correlations between data using a series of hypergraph computation methods. In this type of applications, hypergraph has been used for mild cognitive impairment (MCI) identification using magnetic resonance imaging (MRI) [23], COVID-19 identification using CT imaging [24], ASD identification using brain functional networks [25], medical image retrieval [26], etc.

The aforementioned examples are just a small part of hypergraph applications. Hypergraph computation techniques can be used in any cases where there exist highorder and complex correlations among data, such as computer vision, knowledge graph, and so on.

#### **1.4 The History of Studies on Hypergraph**

#### *1.4.1 Topology and Coloring on Hypergraph*

The studies of utilization on hypergraph have a long history. In 1943, Prenowitz et al. [27] first illustrated several kinds of geometries (projective, descriptive, and spherical) as hypergroup or multigroup. Prenowitz et al. [28] created Geometries on Join Spaces, a unique hypergroup that has been proven to be a valuable tool in the study of a variety of topics, including graphs, hypergraphs, binary relations, fuzzy sets, and rough sets. In 1996, Rosenberg et al. [29] first addressed the relationships between Hyperstructures (hypergraphs) and Binary Relations in the broadest sense. Later, they were also studied by Corsini and Leoreanu [30]. Rosenberg et al. [29] first developed join spaces related to fuzzy sets in 1996. Corsini, Leoreanu, and Tofan [31] have all reexamined these structures. Zahedi et al. [32] also advanced the concepts of linking a hypergraph with a fuzzy set and examining algebraic structures equipped with a fuzzy structure.

Hypergraph coloring is a typical and important task, which has attracted much attention since last century. It is fundamental to combinatorics and can be used to determine bounds for the chromatic number of some graphs as described by Kierstead et al. [33]. Lu et al. [34] suggested these algorithms to solve different optimization problems, such as divide and conquer and partition problems, in which hypergraph coloring can also be used to find monochromatic paths and cycles. Voloshin et al. [35, 36] described how to color mixed hypergraphs, which are divided into hyperedge and anti-hyperedge families. In such a case, they further applied it to energy supply problem.

The problem of finding large matches is closely related to the problem of bounding the chromatic index of a hypergraph (notice that the color classes of a proper edge-coloring form a matching). As a classical subject in the study of graphs, matching theory is very well developed and goes back to the work [37] in the 1930s. Tutte's theorem [38] is a characterization of graphs that contains perfect matchings. Edmonds et al. [39] proposed the Blossom algorithm, which uncovers a maximum matching in a graph in a polynomial amount of time for graphs containing a perfect matching. The above methods are early works on hypergraph-related research.

#### *1.4.2 Hypergraph Partitioning, Clustering, and Machine Learning*

Hypergraph partition is another important problem on hypergraph. It is defined in the Encyclopedia of Parallel Computing1 that hypergraph partitioning involves

<sup>1</sup> https://link.springer.com/referenceworkentry/10.1007/978-0-387-09766-4\_1.

dividing a hypergraph into two or more roughly equal parts in such a way that the cost function of the hyperedge connecting vertices in the different parts is minimized. In many cases, this definition is too restrictive and requires more than two parts. Karypis et al. [40] proposed the hMetis algorithm, which is based on multilevel coarsening of hypergraphs. The method iteratively bisections coarsened hypergraphs, starting with the smallest. George et al. [41] further developed the hMeTiS-Kway algorithm, which directly constructs a K-way partitioning of a hypergraph with coarse–uncoarse paradigm to solve the K-way hypergraph partitioning problem.

Besides, Papa et al. [42] provided several methods of partitioning hypergraphs and defines clustering as "the process of merging vertices into larger groups of vertices known as clusters to compute a coarser hypergraph from an input hypergraph." A number of applications of partitioning and clustering are also given, including VLSI design, numerical linear algebra, automated theorem proving, and formal verification. Several applications and methods have been described in the literature. For more details, a survey of clustering ensemble techniques has been published in [43], which includes hypergraph partitioning techniques as well. Multilevel strategies are often required in clustering and partitioning, which have been well studied in previous works. It has been extensively used in VLSI design [40], parallel scientific computing [44–46], image categorization [47], and social networks [48, 49].

In this century, hypergraph has been used in machine learning. Transductive hypergraph learning [48] is introduced to give the basic mathematical formulation of the objective function for predicting labels of vertices on a hypergraph. Since the performance of hypergraph learning is related to the modeling quality of the hypergraph, there are some efforts to further assign weights to the components in the hypergraph, including hyperedges, vertices, and hyperedge-dependent vertex weights [50, 51]. To accelerate the label propagation process on hypergraph, the cross diffusion on multiple hypergraphs is further introduced to model the highorder correlations among multi-modal data and conduct multi-modal information fusion [52].

#### *1.4.3 Deep Learning on Hypergraph*

Research on high-order representations of hypergraph structures has also been inspired by deep learning's powerful learning and modeling abilities. Generally speaking, most deep learning methods on hypergraph can be divided into spectralbased methods and spatial-based methods.

As for the spectral-based methods, Feng et al. [53] proposed Hypergraph Neural Networks (HGNNs) to model non-pairwise relations based on the hypergraph Laplacian. Multi-modal data can be naturally modeled using the proposed methods. It is also possible to classify images using hypergraph neural networks[54]. Using tools from the spectral theory of hypergraphs, Yadati et al. [55] proposed HyperGCN to train a GCN for semi-supervised learning on hypergraphs using graph convolutional networks (GCNs). As for the spatial-based method, by extending the dynamic hypergraph learning, Jiang et al. [56] proposed a dynamic hypergraph neural network, which can adaptably change the hypergraph structure at each layer. As opposed to hypergraph convolution, where the underlying structure is defined beforehand, Bai et al. [57] proposed a hypergraph attention mechanism strategy to learn a dynamic connection of hyperedges, which propagates and gathers information in the task-relevant parts of the graph, thereby generating more discriminative vertex embeddings. Moreover, Gao et al. [58] proposed a general hypergraph neural network framework, which can be applied to multiple types of hypergraphs like undirected hypergraph, directed hypergraph, probabilistic hypergraph, vertex/hyperedge weighted hypergraph, etc.

For homogeneous and heterogeneous hypergraphs, Zhang et al. [59] proposed a self-attention-based hypergraph neural network (Hyper-SAGNN). By mapping the hypergraph to a weighted attribute line graph, Bandyopadhyay et al. [60] achieved a bi-injective hypergraph structure. Huang et al. [61] proposed UniGNN, which can generalize general GNN models into hypergraphs by interpreting the message passing process in graph and hypergraph neural networks. These neural network methods on hypergraph enable the representation learning by incorporating highorder correlation in process.

#### **1.5 Hypergraph Computation: Challenges and Objectives**

Hypergraph has its advantage on high-order correlation modeling compared with graph and other structures. To take this advantage in practice, hypergraph can be used to formulate such correlations and the conduct computing task accordingly. In this part, we summarize the objective of hypergraph computation, especially the main challenges and the tasks inside.

Below we give the definition of hypergraph computation: **hypergraph computation** is to formulate the high-order correlations underneath the data using hypergraph and then conduct semantic computing on the hypergraph for different applications.

The main challenges and objectives in hypergraph computation are from three parts, including how to generate a hypergraph, how to deal with large scale data, and how to conduct learning on hypergraph.

1. **How to generate a hypergraph.** In most cases, the hypergraph structure is not explicitly existed. What can be observed could be non-structure data, such as images, videos and discrete signals, and pairwise relationships between two subjects. To reveal the underneath high-order correlation as a hypergraph, it is needed to define how to generate it. More importantly, the observed data could be noisy, missing, and tend to be multi-modal. How to describe these data is also challenged. Under such circumstances, it is difficult to generate an accurate hypergraph structure based on these data. Therefore, how to generate a hypergraph, especially a good hypergraph structure for specific task, is the first challenge in practice.


Hypergraph modeling can be briefly divided into two categories, i.e., the intracorrelation modeling and the inter-correlation modeling, as shown in Fig. 1.5. Here, the intra-correlation modeling regards the high-order correlations inside the subject. The components of the subject are represented as the vertices, and the correlations among these components are represented as hyperedges in the hypergraph. In these cases, the hypergraph, named intra-hypergraph, aims to represent the subject itself. The inter-correlation modeling concentrates on the high-order correlations among different subjects. A group of subjects is represented as the vertices, and the

**Fig. 1.5** The intra-hypergraph and the inter-hypergraph based on the intra-correlations and the inter-correlations among components and subjects

correlations among these subjects are represented as hyperedges in the hypergraph, named inter-hypergraph. The objective is to learn the representation or connections of the target subject with the help of its correlations to other subjects. Here we take image representation as an example. When an image is selected as the subject, the correlations among the pixels or the patches in the image are intracorrelations, and the corresponding intra-hypergraph can be generated for image representation. On the other side, we can also observe other images for processing. The correlations among the subject image and other images are inter-correlations, and the corresponding inter-hypergraph can be generated for image representation too. That is to say, the intra- and inter-correlations can be regarded as the views from different scales. If we take the subject itself as the target system, the correlations of the subject and other subjects are inter-correlations of the subject, corresponding to an inter-hypergraph. If we take the group of subjects as the target system, the correlations of these subjects are intra-correlations, leading to an intra-hypergraph accordingly.

#### **1.6 Structure of This Book**

This book is composed of 13 chapters and the structure of the remainders is introduced here.


and the spatial-based methods. The comparison between graph neural networks and hypergraph neural networks is also provided in this chapter.


#### **1.7 Summary**

In this chapter, we introduce the basic ideas and background of hypergraph computation. We also provide the applications and the related research history on hypergraph. The idea of hypergraph computation is detailed introduced and discussed in this chapter. We also summarize our studies on hypergraph computation and present the organization of this book.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 2 Mathematical Foundations of Hypergraph**

**Abstract** In this chapter, we introduce the mathematical foundations of hypergraph and present the mathematical notations that are used to facilitate deep understanding and analysis of hypergraph structure. A hypergraph is composed of a set of vertices and hyperedges, and it is a generalization of a graph, where a weighted hypergraph quantifies the relative importance of hyperedges or vertices. Hypergraph can also be divided into two main categories, i.e., the undirected hypergraph representation and the directed hypergraph representation. The latter one further divides the vertices in one hyperedge into the source vertex set and the target vertex set to model more complex correlations. Additionally, we discuss the relationship between hypergraph and graph from the perspective of structural transformation and expressive ability. The most intuitive difference between a simple graph and a hypergraph can be observed in the size of order and expression of adjacency. A hypergraph can be converted into a simple graph using clique expansion, star expansion, and line expansion. Moreover, the proof based on random walks and Markov chains establishes the relationship between hypergraphs with edge-independent vertex weights and weighted graphs.

#### **2.1 Introduction**

The importance of high-order complex network modeling has been discussed in Chap. 1. In this chapter, we introduce the basic knowledge of hypergraph. In a hypergraph, the edge degree is usually higher than that of a simple graph, which is two for a simple graph. Different from a graph structure that can model pairwise connections with its 2-degree edges, a hypergraph can model correlations between practical data that are much more complex than pairwise relationships. As a result of its versatility and usefulness of modeling complex correlations of data, machine learning on hypergraph has attracted increasing attention.

Machine learning methods on hypergraph have been used in many real-world applications due to its advantages. A wide variety of tasks have been performed with hypergraph in computer vision, including image retrieval [1] and 3D object classification [2], video segmentation [3], re-identification of people [4], hyper-spectral image analysis [5], landmark retrieval [6], and visual tracking [7]. It is possible to embed a wide range of subjects into a hypergraph structure for these tasks. In different tasks, the hypergraph structure can be used to formulate the correlation among a variety of subjects. In image retrieval [3], the correlation among different images can be modeled in a hypergraph, where each vertex denotes an image and the hyperedges can be generated by finding similar image features. In 3D object classification [2], the correlation among different 3D objects can be modeled in a hypergraph, where each vertex denotes a 3D object and the hyperedges can be generated based on the similarity among these 3D objects. In person re-identification [4], a hypergraph structure can be constructed, where each vertex represents a personal image and the hyperedges can be generated based on the similarities in the feature space. Similar modeling attempts have been deployed in medical image analysis and bio-informatics studies to identify genes [8, 9], predict diseases [10, 11], identify sub-types [12], and analyze functional networks [13].

Before detailed introduction of the hypergraph computation paradigm, hypergraph modeling, and other related methods and applications, in this chapter, we first present preliminary knowledge of hypergraph and multiple representations of hypergraph. We also compare the hypergraph structure with the graph structure from four aspects.

#### **2.2 Preliminary Knowledge of Hypergraph**

The basic concepts of hypergraph are hereby briefly discussed. Table 2.1 provides the main notations and definitions of hypergraphs throughout this chapter. We first introduce undirected hypergraph and directed hypergraph, respectively, and then introduce the K-uniform hypergraph, probabilistic hypergraph, the relationship between hypergraph and bipartite graph, and the weights on hypergraph.

#### *2.2.1 Undirected Hypergraph*

Let G be an indication of a hypergraph (undirected hypergraph), which consists of a set of vertices V and a set of hyperedges E . In a weighted hypergraph, each hyperedge *e* ∈ E is assigned with a weight *w(e)*, symbolizing the importance of the connection relationship throughout the whole hypergraph. Let **W** denote the diagonal matrix of the hyperedge weights, i.e., diag*(***W***)* = - *w (e*1*), w (e*2*),...,w e*|<sup>E</sup> <sup>|</sup> . Given a hypergraph G = *(*V *,* E *,***W***)*, the structure of the hypergraph is usually represented by an incidence matrix **H** ∈ {0*,* 1}|<sup>V</sup> |×|<sup>E</sup> <sup>|</sup> , with each entry **H***(v, e)* indicating whether the vertex *v* is in the hyperedge *e*:

$$\mathbf{H}(v,e) = \begin{cases} 1 & \text{if } v \in e \\ 0 & \text{if } v \notin e, \end{cases} \tag{2.1}$$


**Table 2.1** Notations and definitions of hypergraphs

where **H***(v, e)* indicates the possibility of vertex *v* assigned to hyperedge *e* or the importance of vertex *v* for hyperedge *e*. The degree of hyperedge *e* and the degree of vertex *v* are defined as follows:

$$\delta(e) = \sum\_{v \in \mathcal{V}} \mathbf{H}(v, e), \tag{2.2}$$

and

$$d(v) = \sum\_{e \in \mathcal{S}} w(e) \* \mathbf{H}(v, e). \tag{2.3}$$

The traditional hypergraph structure creates associations among vertices, with a single hyperedge connecting multiple vertices that have associations. All vertices on the same hyperedge are given a value of 1 in the incidence matrix **H**. The adjacency matrix **H** is calculated as in (2.1), whose elements are valued by 0 or 1. Each row represents each vertex in the hypergraph and the columns represent all hyperedges. Each column represents the set of vertices on this hyperedge.

Figure 2.1 shows an undirected hypergraph, including the hypergraph itself, the incidence matrix **H**, the vertex set V , the hyperedge set E , and the weight matrix **W**. In the illustrated undirected hypergraph, there are 3 hyperedges *e*1, *e*2, and *e*<sup>3</sup> with 6 vertexes. The degree of the hyperedge *e*<sup>3</sup> is 3, which contains vertices {*v*3*, v*4*, v*6}. By the same token, other elements of **D***<sup>v</sup>* can be inferred. Vertex *v*<sup>3</sup> belongs to the hyperedges *e*<sup>2</sup> and *e*3, and the degree of the vertex is 2. The incidence matrix **H** of

hypergraph is readily obtained by the rules of construction, which are shown on the right side of Fig. 2.1.

Given the incidence matrix **H** as calculated as in Eq. (2.1), all elements are valued by either 0 or 1. It is noted that the connection weights of different vertices on a hyperedge could be different. For example, some vertices are highly connected in the hyperedge and with high weights, while others may be with low weights. That is to say, the sum of each column of **H** is 1 (or not, due to different applications and objectives) and its values represent the vertex importance on this hyperedge.

There are various rules that can be used to determine whether vertices are associated with one another. Hyperedge groups can be generated from the data with a graph structure by using pairwise edges and k-hops; for the data without a graph structure, they can be generated by using neighbors in feature space. A detailed description of these methods is provided in Chap. 4.

#### *2.2.2 Directed Hypergraph*

The real world is incompatible with traditional undirected hypergraph representation in that hyperedges may be directional. Therefore, the representation of directed hypergraph structures is important. In each hyperedge, the vertex can be further divided into two sets: the source vertex set and the target vertex set. On directed hypergraph, a trivial definition [14] for the incidence matrix is defined as follows:

$$\hat{\mathbf{H}}(v,e) = \begin{cases} -1 \text{ if } v \in T(e) \\ 1 \quad \text{if } v \in S(e) \\ 0 \quad \text{otherwise}, \end{cases} \tag{2.4}$$

**Fig. 2.2** An example of a directed hypergraph

where *T (e)* and *S(e)* are the target and source vertices for hyperedge *e*, respectively. The incidence matrix **H** is split into two matrices, **Hs** and **Ht**, describing the source and target vertices for all hyperedges, respectively. When passing messages with these two incidence matrices, it is important to maintain the directional information. Two different incidence matrices guide message passing in the directed hypergraph, **Hs** and **Ht**, unlike in the undirected hypergraph. The average aggregation of messages is normalized by **Ds** and **Dt** as two matrices, and it can be formulated as follows:

$$\begin{cases} \mathbf{D}\_{\mathbf{s}} = \text{diag}(\text{col\\_sum}(\mathbf{H}\_{\mathbf{s}})) \\ \mathbf{D}\_{\mathbf{t}} = \text{diag}(\text{col\\_sum}(\mathbf{H}\_{\mathbf{l}})), \end{cases} \tag{2.5}$$

where diag*(v)* is a function that converts a vector *v* to a diagonal matrix. The col\_sum*(*·*)* is a column accumulation function.

Figure 2.2 shows an example of directed hypergraph including the directed hypergraph itself, the incidence matrix **H**, the source incidence matrix **H***s*, and the target incidence matrix **H***<sup>t</sup>* . The illustrated directed hypergraph contains six vertices and two hyperedge *e*<sup>1</sup> and *e*2. *e*<sup>1</sup> connects four vertices and *e*<sup>2</sup> connects three vertices. In hyperedge *e*1, the source vertices are *v*<sup>1</sup> and *v*2, and the target vertices are *v*<sup>4</sup> and *v*5. As for the hyperedge *e*2, the source vertices are *v*<sup>2</sup> and *v*3, and the target vertices are only *v*6.

#### *2.2.3 Probabilistic Hypergraph*

In the real-world correlations, the intensity of the connection can not only be a binary number but also be a continuous number from zero to one. Consequently, the incidence matrix may be a continuous matrix with elements ranging from 0 to 1, which is adopted to denote a probabilistic hypergraph.

As shown in Fig. 2.3, the probabilistic hypergraph consists of six vertices and three hyperedges. The hyperedge *e*<sup>1</sup> connects three vertices *v*1, *v*2, and *v*5. The intensity of the connection in this hyperedge is not the same. As shown in the right side of the figure, *e*<sup>1</sup> connects *v*<sup>1</sup> with an intensity of 0*.*3, connects *v*<sup>2</sup> with


**Fig. 2.3** An example of a probabilistic hypergraph

an intensity of 0*.*8, and connects *v*<sup>5</sup> with an intensity of 0*.*5. The degree of vertex and hyperedge in this type of hypergraph is computed by the sum of the row or column of the hypergraph incidence matrix **H**, as shown in the bottom of Fig. 2.3.

#### *2.2.4 K-Uniform Hypergraph*

In many applications, hyperedges in a hypergraph may connect the same number of vertices, which is known as the *k*-uniform hypergraph. In the *k*-uniform hypergraph, each hyperedge contains precisely *k* vertices, as shown in Fig. 2.4. Under this definition, a simple can be regarded as a spatial case of hypergraph, a 2-uniform hypergraph, where each hyperedge only connects two vertices.

Figure 2.4 illustrates an example of 3-uniform hypergraph. The hypergraph consists of six vertices and three hyperedges, and each hyperedge contains precisely 3 vertices. Hyperedge *e*<sup>1</sup> connects vertices *v*1, *v*2, and *v*5. Hyperedge *e*<sup>2</sup> connects vertices *v*1, *v*2, and *v*3. The degree of all hyperedges in this type of hypergraph is consistent *k*.

#### *2.2.5 Hypergraph and Bipartite Graph*

The bipartite graph can be indicated by G = {U *,* V *,* E }. Unlike the simple graph, vertices in the bipartite can be divided into two disjoint and independent sets U

**Fig. 2.4** An example of a 3-uniform hypergraph

**Fig. 2.5** The relationship between hypergraph and bipartite graph

and V . Every edge only connects one vertex in set U and another vertex in set V . Obviously, an undirected hypergraph can be regarded as a bipartite graph if the hyperedges are treated as another vertex set, as shown in Fig. 2.5.

Figure 2.5 illustrates examples of converting hypergraph to bipartite graph. The bipartite graph can be generated by two strategies: the vertices and hyperedges are treated as vertices in U and vertices in V (as illustrated in the left part), and the vertices and hyperedges are treated as vertices in V and vertices in U (as illustrated in the right part). Similarly, a bipartite graph can also be transformed to an undirected hypergraph with set U /V as the hyperedges. It is not mean that the hypergraph is the same as or can be replaced with the bipartite graph. The transformation only exists in the undirected hypergraph and the probabilistic hypergraph. Confronting more complex hypergraph like directed hypergraph, the transformation will be invalid.

#### *2.2.6 The Weights on Hypergraph*

It is noted that there are different weights on a hypergraph, which provide additional information to assign values to a hypergraph structure. This is a more semantically preferred way of representing a hypergraph, as different components of a hypergraph, such as a vertex, a hyperedge or even a sub-hypergraph, should have different impact on the relationship modeling. For example, in a recommender system, the weights in the user profile influence the categorization of user attributes. If the attributes are not categorized accurately, the accuracy of the recommendations and marketing based on the profile could be questionable. The main types of weight information on a hypergraph are hyperedge weights and vertex weights, with the magnitude of the values indicating the relative importance of hyperedges and vertices, respectively.

First, let us show how the weights on vertex can be used. Different vertices may have varying importance on hypergraph modeling, and vertex weights are used in a hypergraph to determine the importance of different vertices. If a vertex is connected on the hypergraph strongly (with high correlations), it should be with a large vertex weight. Otherwise, it should be with a small vertex weight. For those vertices which have a 0 weight value in the incidence matrix, it can also be regarded as it is connected by the corresponding hyperedge with a weight of 0. Here, the diagonal elements of **U** to represent the weights of vertices, which are between 0 and 1, which reveal the relative importance of these hyperedges. Figure 2.6 shows an example hypergraph with vertex weights. In this figure, the weight of each vertex is denoted by the size of the vertex node. Vertex *v*<sup>6</sup> has a weight of 0*.*9, which is larger than all other vertices, and vertex *v*<sup>2</sup> is the smallest among the six vertices.

Then, let us focus on the weights on hyperedge. Hyperedge weights reflect the importance of different hyperedges in a hypergraph. As different hyperedges may have different importance in representing connections among vertices, it is crucial that hyperedges be weighted corresponding to their representative capabilities. In some cases, a part of hyperedges are more reliable due to its generation method or the features employed in this task, and these hyperedges should be given a large

weight during the learning process. Here, the diagonal element values of **W** can be used to represent the weights of vertices, which are between 0 and 1, revealing the relative importance of these hyperedges. Figure 2.7 shows an example of the hyperedge weighted hypergraph. In the illustration, the three hyperedges have the weights 0*.*3, 0*.*9, and 0*.*5, respectively.

#### **2.3 Comparison Between Graph and Hypergraph**

As a generalization of graph, the relationship between graph and hypergraph is a fundamental question. In this part, we detailedly introduce the relationship between graph and hypergraph from four aspects, i.e., the order of correlations, the representation methods, the structure transformation and random work on both of them.

#### *2.3.1 Low-Order Versus High-Order Correlations*

First, we define the *interaction* as a set *I* = [*p*0*, p*1*,* ··· *, pk*−1] containing *k* basic elements of the system being studied, which can also be called vertices or nodes. Various real-world interactions can be described by such interactions, such as coauthors of a scientific paper, genes required to perform a specific function, neurons co-activating during a specific task, and more. We then denote the order (or dimension) of interactions among vertices as an order-0 interaction for a vertex interacting with itself only, an order-1 interaction for two vertices interacting with each other, an order-2 interaction for three vertices interactions, and so on.

**Fig. 2.8** The expressive ability comparison of graph and hypergraph

Furthermore, high-order interactions are considered *k*-interactions with *k* ≥ 2. Loworder interactions, on the other hand, are those characterized by *k* ≤ 1.

Figure 2.8 shows the comparison of hypergraph and graph on the modeling of different orders of correlations. We notice that a graph can only represent the order-1 interactions between two vertices. Different from graph, a hypergraph can represent any order-k interactions through its flexible hyperedges. From this direction, hypergraph is more effective on modeling high-order correlation among subjects compared with graph.

#### *2.3.2 Adjacency Matrix Versus Incidence Matrix*

A graph with *<sup>N</sup>* vertices can be described by an adjacency matrix **<sup>A</sup>** ∈ {0*,* <sup>1</sup>}*N*×*<sup>N</sup>* , where **A***i,j* = 1 denotes that there is an edge connecting vertex *vi* and vertex *vj* . In most cases, the adjacency matrix **A** is a symmetry matrix.

A hypergraph with *N* vertices and *M* hyperedges can be described by an incidence matrix **<sup>H</sup>** ∈ {0*,* <sup>1</sup>}*N*×*M*, where **H***i,j* <sup>=</sup> <sup>1</sup> denotes that the hyperedge *ej* connects vertex *vi*.

By comparison of adjacency matrix and incidence matrix, a graph can be regarded as a 2-uniform hypergraph. In this case, each hyperedge can only connect two vertices. Given the possible *N* × *N* order-1 hyperedges **H** in the 2-uniform hypergraph, they can be directly projected to the *N* × *N* elements in adjacency matrix **A**. The hypergraph incidence and the simple graph adjacency matrix can be bi-transformed as follows:

$$\mathbf{H} \mathbf{H}^{\parallel} = \mathbf{A} + \mathbf{D}.\tag{2.6}$$

The adjacency matrix for graph and the incidence matrix for hypergraph have different processing styles when confronting multi-modal data or multiple types of connections. Given *m* adjacency matrices representing *m* graphs G1*,* G2*,...,* G*m*, there are two typical ways to combine these data for graph. The first way is to combine different graphs into one graph G and then conduct other tasks. The second way is to conduct the task in each graph individually and then combine all these results. Figures 2.9 and 2.10 show these two types of methods. In either method, it is required to perform fusion, either in the graph structure part or in the result part. In recent years, a series of graph fusion methods [15, 16] have been introduced, while it is still a challenging task to optimally combine different graphs. On the other side, the multi-modal graph fusion is also with high computational complexity, which may limit the applications on multi-modal data.

Different from the processing method in graph, hypergraph can handle such types of different connections in an easy and direct way, due to its flexible hyperedges. As shown in Fig. 2.11, when there are multiple types of connections available, it is possible to generate multiple hyperedge groups with *m* incidence matrices **H**1*,* **H**2*,...,* **H***m*, and these *m* incidence matrices can be directly concatenated to generate the overall hypergraph structure **H**. In this way, all these multi-modal data or multiple types of connections can be easily modeled in one hypergraph and all further processing can be directly deployed on this hypergraph structure. Under such circumstances, it is not required to conduct multi-modal fusion in an explicit way, while it could be jointly included in the hypergraph computation process.

#### *2.3.3 Structure Transformation from Hypergraph to Graph*

A hypergraph can encode high-order data correlation (beyond pairwise) using its degree-free hyperedges compared to a simple graph, where the degree for all edges has to be 2. In a sense, a simple graph can be viewed as a special case, where all hyperedges on a hypergraph are of degree 2. Therefore, hypergraph and graph are interconvertible. Currently, there are a number of methods for converting a hypergraph to a simple graph. The common ones are clique expansion, star expansion, and line expansion, which are shown in Figs. 2.12, 2.13 and 2.14, respectively.

#### **(1) Clique Expansion**

Figure 2.12 shows an example of transforming a hypergraph to a graph with clique expansion. The clique expansion algorithm constructs a graph G *<sup>x</sup> (*V *, E<sup>x</sup> )* from the original hypergraph G *(*V *,* E *)* by replacing each hyperedge *e* with edges, whose degree is 2, for each pair *(u, v)* of vertices in the hyperedge [17]: <sup>E</sup> *<sup>x</sup>* = {*(u, v)* : *u, v* ∈ *e, e* ∈ E }.

It is interesting to note that the vertices in hyperedge *e* form a clique in the graph G *<sup>x</sup>* , exactly where the name comes from. G *<sup>x</sup>* preserves the structure of the vertices of G , so that the information on the edges needs to be reduced as far as possible to

**Fig. 2.10** An example of the results fusion for the multi-modal data

the higher order associations of the hyperedges. That is, the difference between the weights of any two edges that contains both *u* and *v* on G *<sup>x</sup>* and the weights of the hyperedge connections should be as small as possible. Thus we use the following formula when assigning weights *w<sup>x</sup> (u, v)* to edges on G *<sup>x</sup>* :

$$w^{\chi}(\boldsymbol{u}, \boldsymbol{v}) = \underset{w^{\chi}(\boldsymbol{u}, \boldsymbol{v})}{\text{arg min}} \sum\_{e \in \mathcal{E} : \boldsymbol{u}, \boldsymbol{v} \in e} \left( w^{\chi}(\boldsymbol{u}, \boldsymbol{v}) - w(\boldsymbol{e}) \right)^{2} . \tag{2.7}$$

Hence, clique expansion uses the discriminative model, where every edge in the clique of G *<sup>x</sup>* associated with hyperedge *e* has weight *w(e)*. This criterion has the following minimizer:

$$w^{\chi}(\mu, v) = \mu \sum\_{e \in \mathcal{E} : \mu, v \in e} w(e) = \mu \sum\_{e} h(\mu, e) h(v, e) w(e), \tag{2.8}$$

where *μ* is a fixed scalar. Equivalently, from the point of view of edges, the weight between two vertices *u* and *v* is derived from the sum of the weights assigned by the hyperedge that contains all of them simultaneously.

#### **(2) Star Expansion**

Figure 2.13 shows an example of transforming a hypergraph to a graph with star expansion. By star expansion, a graph G ∗ *(*V ∗*,* E ∗*)* can be constructed from hypergraph G *(*V *,* E *)* by regarding every hyperedge *e* ∈ E as a new vertex, thus V <sup>∗</sup> = V ∪ E [17]. Each vertex in the hyperedge is connected to the new graph vertex *e*, i.e., E <sup>∗</sup> = {*(u, e)* : *u* ∈ *e, e* ∈ E }.

**Fig. 2.12** An example of transforming a hypergraph to a graph with clique expansion

**Fig. 2.13** An example of transforming a hypergraph to a graph with star expansion

**Fig. 2.14** An example of transforming a hypergraph to a graph with line expansion

There are different types of vertices in graph G ∗ and each hyperedge in E corresponds to a star in graph G. With star expansion, the scaled hyperedge weight is assigned to each graph edge *w*∗*(u, e)* that corresponds to each hyperedge in E as follows:

$$w^\*(\mu, e) = w(e) / \delta(e). \tag{2.9}$$

For each vertex representing a hyperedge, the weights of edges connecting to it are equivalent for equally dividing the superside weights into |*δ(e)*| parts.

#### **(3) Line Expansion**

Figure 2.14 shows an example of transforming a hypergraph to a graph with line expansion. In the case of line expansion algorithm, the vertices of the graph <sup>G</sup> *<sup>l</sup>* <sup>=</sup> V *l ,* E *<sup>l</sup>* are constructed by reconstructing the structure of the data stored in the vertices of the hypergraph, <sup>G</sup> <sup>=</sup> *(*<sup>V</sup> *,* <sup>E</sup> *)*. Each line vertex *(u, e)* in <sup>G</sup> *<sup>l</sup>* can be viewed as a vertex in a context of a hyperedge or a hyperedge in a context of a vertex [18]. For each point on each hyperedge, a vertex is created to represent it. The vertex *v* in the line expended graph indicates the property of the vertex in the hyperedge, to each vertex in the hyperedge to it, i.e., V <sup>∗</sup> = {*(u, e)* : *u* ∈ *e, u* ∈ V *, e* ∈ E }. This means that  V *l* <sup>=</sup> *<sup>e</sup> δ(e)*.

Therefore the vertexes in G *<sup>l</sup>* , which contain the same vertex or the same hyperedge, can be defined as the neighborhood. Consider both connections to be equally important, so *W<sup>l</sup>* <sup>=</sup> *diag(*1*,...,* <sup>1</sup>*)*, |*W<sup>l</sup>* |=|<sup>V</sup> *<sup>l</sup>* |×|<sup>V</sup> *<sup>l</sup>* |. The mapping between a hypergraph G and its line expansion G *<sup>l</sup>* is bijective under the construction.

#### *2.3.4 Random Walks on Graph and Hypergraph*

Random walks propagate the information stored in the vertices based on the links among the vertices in the graph or hypergraph. These links constitute the path of different vertices. In the hypergraph, each vertex's neighbor vertex messages are aggregated to update itself based on the "path" between the central vertex and each vertex in its neighborhood. A hypergraph's path between vertices *v*<sup>1</sup> and *vk* is defined as a sequence, called hyperpath [19]:

$$P(v\_1, v\_k) = (v\_1, e\_1, v\_2, e\_2, \dots, v\_{k-1}, e\_k, v\_k),\tag{2.10}$$

where *vj* and *vj*+<sup>1</sup> are both part of the same vertex subset described by a hyperedge *ej* . We say that a hyperpath separates two neighboring vertices by a hyperedge. In a hypergraph, messages between vertices are propagated through hyperedges, which are higher-order relationships than those in graphs. It is first necessary to extend the Neighbor Relation definition among vertices to the Inter-Neighbor Relation *N* over vertex set V and hyperedge set E for message propagation from vertex to hyperedge and hyperedge to hyperedge on the hyperpath.

**Definition 1** The Inter-Neighbor Relation *N* ⊂ V × E on a hypergraph G = *(*V *,* E *,***W***)* with incidence matrix **H** ∈ {0*,* 1}|<sup>V</sup> |×|<sup>E</sup> <sup>|</sup> is defined as

$$N = \{(v, e) \mid \mathbf{H}(v, e) = 1, \ v \in \mathcal{V} \text{ and } e \in \mathcal{E}\}.\tag{2.11}$$

The hyperedge inter-neighbor set *Ne(v)* of vertex *v* and the vertex inter-neighbor set *Nv(e)* of hyperedge *e* are defined based on the Inter-Neighbor Relation.

**Definition 2** The hyperedge inter-neighbor set of vertex *v* ∈ V is defined as

$$N\_{\varepsilon}(v) = \{ e \mid vNe, \ v \in \mathcal{V} \text{ and } e \in \mathcal{E} \}. \tag{2.12}$$

**Definition 3** The vertex inter-neighbor set of hyperedge *e* ∈ E is defined as

$$N\_v(e) = \{ \upsilon \mid \upsilon Ne, \ \upsilon \in \mathcal{V} \text{ and } e \in \mathcal{E} \}. \tag{2.13}$$

With hypergraph learning, in contrast to graph learning, data are correlated at a higher level, and correlation models are expanded to a high level, resulting in improved performance in practice. This is just an apparent part of the nature of graph and hypergraph. Next, we delve deeper into the relationship between graphs and hypergraph from the point of view of mathematical derivations with the help of random walks [20] and Markov chain [21]. We then provide a mathematical comparison between hypergraph and graph. The proof concludes that, from random walks' aspect, a hypergraph with edge-independent vertex weights is equivalent to a weighted graph, and a hypergraph with edge-dependent vertex weights cannot be reduced to a weighted graph.

Two types of hypergraphs can be constructed to accurately represent real-world correlations, that is, hypergraph with vertex weights independent of edge and hypergraph with vertex weights dependent on edge. By using the binary hypergraph incidence matrix **H** ∈ {0*,* 1}|<sup>V</sup> |×|<sup>E</sup> <sup>|</sup> , where vertices in each hyperedge share the same weight, hypergraph with edge-independent vertex weights (Gin = {V *,* E *,***W**}) can model beyond pairwise correlations. Alternatively, the weighted hypergraph incidence matrix **<sup>R</sup>** <sup>∈</sup> <sup>R</sup>|<sup>V</sup> |×|<sup>E</sup> <sup>|</sup> is used to model the variable correlation intensity in each hyperedge for the hypergraph with edge-dependent vertex weights (Gde = {V *,* E *,***W***, γ* }). We assume that hyperedge *e* includes vertex *v*, where *γe(v)* denotes the connection intensity and *w(e)* the weight of hyperedge *e*.

In hypergraph with edge-independent vertex weights, the definition of binary hypergraph incidence matrix **H**, vertex degree *d(v)*, and hyperedge degree *δ(e)* is the same as in Sect. 2.1. In hypergraph with edge-dependent vertex weights, define the *d(v)* and *δ(e)* as follows:

$$\begin{cases} d(v) = \sum\_{\beta \in \mathcal{N}\_{\mathfrak{e}}(v)} w(\beta) \\ \delta(e) = \sum\_{\alpha \in \mathcal{N}\_{\mathfrak{e}}(e)} \chi\_{\mathfrak{e}}(\alpha), \end{cases} \tag{2.14}$$

where N*v(*·*)* and N*e(*·*)* are defined in Eqs.(2.12) and (2.13), respectively.

Then, we will introduce the random walks and the Markov chain in hypergraph. First, we define the random walk in a hypergraph following papers [20–23]. At time *t*, a random walker at vertex *vt* does the following:


We then define the transition probability *pv,u* of the corresponding Markov chain on V as *pv,u* = *<sup>e</sup>*∈N*e(v,u) pv*→*epe*→*u*, where N*e(v, u)* <sup>=</sup> <sup>N</sup>*e(v)* <sup>∩</sup> <sup>N</sup>*e(u)* denotes the hyperedge *β* ∈ N*e(v, u)* containing vertices *v* and *u*, simultaneously. In hypergraph with edge-independent vertex weights, we have *pv*→*<sup>e</sup>* = *w(e)/d(v)* and *pe*→*<sup>u</sup>* = 1*/δ(e)*. The transition probability *pv,u* can be written as *pv,u* = *β*∈N*e(v,u) w(β) d(v)* · <sup>1</sup> *δ(β).* In hypergraph with edge-dependent vertex weights, we have *pv*→*<sup>e</sup>* = *w(e)/d(v)* and *pe*→*<sup>u</sup>* = *γe(u)/δ(e)*, and the transition probability *pv,u* can be written as *pv,u* = *β*∈N*e(v,u) w(β) d(v)* · *γβ (u) δ(β)* .

The following lemmas and definitions are used to compare the graph and the two types of hypergraphs [21].

**Definition 4** Let *M* be a Markov chain with state space *X* and transition probability *px,y* , for *x, y* ∈ *S*. It can be said that *M* is reversible if there exists a probability distribution *π* over *S* such that *πxpx,y* = *πypy,x* .

**Lemma 5** *Let M be an irreducible Markov chain with finite state space S and transition probability px,y for x, y* ∈ *S. M is reversible if and only if there exists a weighted undirected graph* G *with vertex set S such that random walks on* G *and M are equivalent.* 

*Proof of Lemma 5* Note that *π* indicates the stationary distribution [21, 24] of a given edge-independent/edge-dependent hypergraph. The transition probability *pv,u* of vertices in hypergraph with edge-independent vertex weights is defined as

$$p\_{v,u} = \sum\_{\beta \in \mathcal{N}\_\ell(v,u)} \left( \frac{w(\beta)}{d(v)} \right) \left( \frac{1}{\delta(\beta)} \right). \tag{2.15}$$

Moreover, the transition probability *pv,u* of vertices in hypergraph with edgedependent vertex weights is defined as

$$p\_{v,u} = \sum\_{\beta \in \mathcal{N}\_\ell(v,u)} \left( \frac{w(\beta)}{d(v)} \right) \left( \frac{\gamma\_\beta(u)}{\delta(\beta)} \right). \tag{2.16}$$

"⇒": Suppose *M* is reversible with transition probability *px,y* . We then construct a graph G with vertex set *S* and edge weights *wx,y* = *πxpx,y* . Because *M* is irreducible, *πx* = 0 and *px,y* = 0 for all states *x* and *y*. Thus, the edge weight *wx,y* = 0 and the graph G are a connected graph. Due to the reversibility of *M* that *wx,y* = *πxpx,y* = *πypy,x* = *wy,x* , the constructed graph G is an undirected graph. Random walks on G from *x* to *y* in one-time step satisfy the following:

$$\frac{w\_{\boldsymbol{\chi},\boldsymbol{\chi}}}{\sum\_{\boldsymbol{\chi}\in\mathcal{S}}w\_{\boldsymbol{\chi},\boldsymbol{\chi}}} = \frac{\pi\_{\boldsymbol{\chi}}p\_{\boldsymbol{\chi},\boldsymbol{\chi}}}{\sum\_{\boldsymbol{\varepsilon}\in\mathcal{S}}\pi\_{\boldsymbol{\chi}}p\_{\boldsymbol{\chi},\boldsymbol{\varepsilon}}} = \frac{p\_{\boldsymbol{\chi},\boldsymbol{\chi}}}{\sum\_{\boldsymbol{\varepsilon}\in\mathcal{S}}p\_{\boldsymbol{\chi},\boldsymbol{\varepsilon}}} = p\_{\boldsymbol{\chi},\boldsymbol{\chi}},\tag{2.17}$$

since *<sup>z</sup>*∈*<sup>S</sup> px,z* <sup>=</sup> 1. Thus, if *<sup>M</sup>* is reversible, the stated claim holds. "⇐": Random walks on an undirected graph are always reversible.

**Definition 6** A Markov chain is reversible if and only if its transition probability satisfies

$$p\_{v\_1, v\_2} p\_{v\_2, v\_3} \cdots p\_{v\_n, v\_1} = p\_{v\_1, v\_n} p\_{v\_n, v\_{n-1}} \cdots p\_{v\_2, v\_1} \tag{2.18}$$

for any finite sequence of states *v*1*, v*2*,* ··· *vn* ∈ *S*. The definition is also known as Kolmogorov's criterion. For more detailed proofs, please refer to [25].

**Theorem 1** *Let* G*in* = {V *,* E *,***W**} *be a hypergraph with edge-independent weights, and then there exists a weighted undirected graph* G *such that a random walk on* G *is equivalent to a random walk on* G*in.* 

*Proof of Theorem 1* The probability *pv,u* of Gin is defined in Eq. (2.15). By Definition 6, the following equation can be deduced:

$$\begin{split} & \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \tag{2.19} \\ &= \sum\_{\beta \in \cdot, \ell\_{\ell}'(v\_{1}, v\_{2})} \left( \frac{w(\beta)}{d(v\_{1})} \cdot \frac{1}{\delta(\beta)} \right) \cdots \sum\_{\beta \in \cdot, \ell\_{\ell}'(v\_{n}, v\_{1})} \left( \frac{w(\beta)}{d(v\_{n})} \cdot \frac{1}{\delta(\beta)} \right) \\ &= \left( \frac{1}{d(v\_{1})} \sum\_{\beta \in \cdot, \ell\_{\ell}'(v\_{1}, v\_{2})} \frac{w(\beta)}{\delta(\beta)} \right) \cdots \left( \frac{1}{d(v\_{n})} \sum\_{\beta \in \cdot, \ell\_{\ell}'(v\_{n}, v\_{1})} \frac{w(\beta)}{\delta(\beta)} \right) \\ &= \frac{1}{d(v\_{2})} \sum\_{\beta \in \cdot, \ell\_{\ell}'(v\_{1}, v\_{2})} \frac{w(\beta)}{\delta(\beta)} \cdots \frac{1}{d(v\_{1})} \sum\_{\beta \in \cdot, \ell\_{\ell}'(v\_{n}, v\_{1})} \frac{w(\beta)}{\delta(\beta)}. \end{split}$$

For any *vi* and *vj* , *β*∈N*e(vi,vj ) w(β) δ(β)* = *β*∈N*e(vj ,vi) w(β) δ(β)* . Thus, the reversibility can be proven by

$$\begin{split} & \quad p\_{v\_1, v\_2} p\_{v\_2, v\_3} \cdots p\_{v\_{n\_1}, v\_1} \\ & = \frac{1}{d(v\_2)} \sum\_{\beta \in \mathcal{A}\_\ell'(v\_2, v\_1)} \frac{w(\beta)}{\delta(\beta)} \cdots \frac{1}{d(v\_1)} \sum\_{\beta \in \mathcal{A}\_\ell'(v\_1, v\_n)} \frac{w(\beta)}{\delta(\beta)} \\ & = p\_{v\_2, v\_1} p\_{v\_3, v\_2} \cdots p\_{v\_1, v\_n} \\ & = p\_{v\_1, v\_n} p\_{v\_2, v\_{n-1}} \cdots p\_{v\_2, v\_1} . \end{split} \tag{2.20}$$

We say that a random walk on Gin is reversible. Furthermore, by Lemma 5, a random walk on Gin is equivalent to a random walk on a weighted undirected graph G .

The proof of Theorem 1 can be processed as follows:

1. A random walk on Gin is equivalent to a random walk on a reversible Markov chain (according to **Definition 6**).

**Fig. 2.15** An example of two types of random walks on the hypergraph with edge-independent vertex weights and the hypergraph with edge-dependent vertex weights. This figure is from [26]

2. A random walk on a reversible Markov chain is equivalent to a random walk on a weighted undirected graph G (according to **Lemma 5**).

**Theorem 2** *Let* G*de* = {V *,* E *,***W***, γ* } *be a hypergraph with edge-dependent weights, and then there does not exist a weighted undirected graph* G *such that a random walk on* G *is equivalent to a random walk on* G*de.* 

*Proof of Theorem 2* Figure 2.15 provides an example that a random walk on Gde is not equivalent to a random walk on a reversible Markov chain. According to the second step of **Theorem 1**'s proof, **Theorem 2** holds.

A simple illustration is shown in Fig. 2.15 to make it easier to understand. There is no difference in the connection structure between the two hypergraphs, but there is a difference in the intensity of the connections. For two types of hypergraphs, the transition probabilities *pv,u* can be computed accordingly. As a consequence, two random walks from vertex *v*<sup>0</sup> are conducted: "*v*<sup>0</sup> → *v*<sup>1</sup> → *v*<sup>2</sup> → *v*0" and "*v*<sup>0</sup> → *v*<sup>2</sup> → *v*<sup>1</sup> → *v*0." Having obtained *pv*0*,v*<sup>1</sup> ·*pv*1*,v*<sup>2</sup> ·*pv*2*,v*<sup>0</sup> and *pv*0*,v*<sup>2</sup> ·*pv*2*,v*<sup>1</sup> ·*pv*1*,v*<sup>0</sup> for the two paths, the cumulative transition probability can then be calculated. This type of hypergraph is reversible according to **Theorem 1** and **Lemma 5**. Thus, from the two reversible paths, the same accumulated transition probability can be obtained. Alternatively, two different accumulated transition probabilities are obtained from two reversible paths in the hypergraph with edge-independent vertex weights.

#### **2.4 Summary**

In this chapter, we present the mathematical definition of the foundations of hypergraph and their interpretation. We then also show the representation of directed

hypergraph, different from undirected hypergraph, which represents the relationships between vertices within a hyperedge. Finally, we discuss the relationship between graph and hypergraph in conversions and expressive ability perspectives. The most intuitive differences between graph and hypergraph can be seen in loworder versus high-order representations and adjacency matrix versus incidence matrix. Clique expansion, star expansion, and line expansion are methods for converting hypergraph into simple graph. We also show the relationship between graph and hypergraph from the random walk view. A hypergraph with edgeindependent vertex weights is equivalent to a weighted graph, and a hypergraph with edge-dependent vertex weights cannot be reduced to a weighted graph from the information propagation process on graph/hypergraph.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 3 Hypergraph Computation Paradigms**

**Abstract** This chapter introduces three hypergraph computation paradigms, including intra-hypergraph computation, inter-hypergraph computation, and hypergraph structure computation. Intra-hypergraph computation representation aims to conduct representation learning of a hypergraph, where each subject is represented by a hypergraph of its components. Inter-hypergraph computation is to conduct representation learning of vertices in the hypergraph, where each subject is a vertex in the hypergraph. Hypergraph structure computation is to conduct hypergraph structure prediction, which aims to find the connections among vertices. This chapter is a general introduction of hypergraph computation paradigms to show how to formulate the task in the hypergraph computation framework.

#### **3.1 Introduction**

Hypergraph computation can be roughly divided into three types: representation learning of a hypergraph, where each subject is represented by a hypergraph of its components, representation learning of vertices in the hypergraph, where each subject is a vertex in the hypergraph, and hypergraph structure prediction, which aims to find the connections among vertices. There are three types of computation paradigms that can be named intra-hypergraph computation, inter-hypergraph computation, and hypergraph structure computation. In this chapter, we introduce the generalized computation paradigms corresponding to these three directions and show how to formulate practical tasks in these hypergraph computation frameworks. We note that specific implementations of generalized functions in the paradigm are not introduced here, as they are parts of specifically defined functions or modules in the hypergraph computation framework and will be introduced in subsequent chapters.

#### **3.2 Intra-hypergraph Computation**

Intra-hypergraph computation targets on learning the representation of a single subject using the inside component information, in which the correlations among the components of this subject are formulated in a hypergraph. In this hypergraph, the components of this subject are regarded by the set of vertices, and their highorder correlations are modeled by hyperedges. In this way, the individual subject is transformed into a hypergraph. As this hypergraph is generated by the subject's components themselves, we can name this hypergraph as the *intra-hypergraph* of this subject.

Image representation and understanding [1–3] are typical intra-hypergraph computation applications. For example, an image can be split into a group of patches, and each patch is denoted by a vertex in the hypergraph. The hypergraph can be generated according to the semantic and spatial information of these patches. The information of these patches and their high-order correlations can be then used simultaneously to learn the representation for the image.

The general paradigm of intra-hypergraph computation can be described as follows. Given a target subject that contains *n* components, that are represented by feature vectors **<sup>X</sup>** <sup>∈</sup> <sup>R</sup>*n*×*<sup>d</sup>* . An intra-hypergraph <sup>G</sup> can be generated to formulate the high-order correlations inside the subject, whose incidence matrix is denoted by **H**. The representation of the individual subject can be learned by

$$\mathbf{Z}\_{\theta} = f\_{\Theta}(\mathbf{H}, \mathbf{X}),\tag{3.1}$$

where *Θ* denotes the to-be-learned parameters. The function *fΘ(*·*)* can be the neural network layers or other computing operators that aggregate the information of vertices together based on the hypergraph structure. Intra-hypergraph computation integrates the complex correlations among components into the learned representation, which can extract more information than simple aggregation operations.

In this paradigm, the subject to be analyzed is regarded as a whole system, and the intra-hypergraph is to model the correlation inside the system. This process is shown in Fig. 3.1.

#### **3.3 Inter-hypergraph Computation**

Inter-hypergraph computation targets at learning the representation of a subject by considering its correlations with other subjects. In this hypergraph, each subject, including the target one, is regarded by the set of vertices, and their high-order correlations are modeled by hyperedges. In this way, this group of subjects is transformed into a hypergraph. As this hypergraph is generated by the cross-subject correlations, we can name this hypergraph as the *inter-hypergraph* of this subject. Subject classification and retrieval [4–7] are typical inter-hypergraph computation

applications. For example, we take an image as the target subject, and we can also have a pool of images for processing. Each image can be denoted by a vertex in the hypergraph. The hypergraph can be generated according to the semantic and spatial information of these images. The information of these images and their high-order correlations can be then used simultaneously to learn the representation of the target image.

The general paradigm of inter-hypergraph computation can be described as follows. Given a target subject and other *n* − 1 subjects, represented by feature vectors **<sup>X</sup>** <sup>∈</sup> <sup>R</sup>*n*×*<sup>d</sup>* , an inter-hypergraph <sup>G</sup> can be generated to formulate the highorder correlations among these subjects, whose incidence matrix is denoted by **H**. The representation of the target subject can be learned by

$$\mathbf{Z}\_{\mathcal{V}} = f\_{\Theta}(\mathbf{H}, \mathbf{X}).\tag{3.2}$$

The vertex embedding can be further used for the downstream tasks, such as vertex classification, where the vertices are associated with pre-defined labels *Y* ∈ [*K*] *n*. This process is also shown in Fig. 3.1.

It is noted that a hypergraph structure can be either homogeneous or heterogeneous, depending on the definition of vertices. Given multiple types of data, or multi-modal data, another way to formulate such correlations is to generate multiple hypergraphs accordingly. For example, supposing that there are *m* types of features or modalities, denoted by **X**1*,* **X**2*,...,* **X***m*, we can construct one hypergraph for each modality respectively. In this way, we can have *m* hypergraphs G<sup>1</sup> = *(*V1; E1;**W**1*)*; G<sup>2</sup> = *(*V2; E2;**W**2*)*;*...*; G*<sup>m</sup>* = *(*V*m*; E*m*;**W***m)* for the data with *m* modalities. The general paradigm for multi-modal inter-hypergraph computation can be described as

$$\mathbf{Z}\_{\mathcal{V}} = f\_{\Theta}(\mathbf{H}\_{\mathcal{l}}, \mathbf{H}\_2, \dots, \mathbf{H}\_m, \mathbf{X}\_{\mathcal{l}}, \mathbf{X}\_2, \dots, \mathbf{X}\_m), \tag{3.3}$$

where **H**1*,* **H**2*,...,* **H***<sup>m</sup>* are the incidence matrices of the *m* hypergraphs.

#### **3.4 Hypergraph Structure Computation**

Hypergraph structure computation aims to learn the high-order correlations among data in the presence of missing links and inaccurate initial structure. There are two scenarios in which hypergraph structure computation is performed: either the set of hyperedges is incomplete or the affiliation relationships between vertices and hyperedges are incomplete. Recommender system and drug discovery [8– 10] are typical hypergraph structure computation applications. For example, in recommender system, the hyperedges describe the connections between items and users with specific semantics. The number of hyperedges is fixed, and the features of both vertices and hyperedges can be obtained as the input. Here, the target of hypergraph structure computation is to predict whether a vertex belongs to a hyperedge or not. If a new hyperedge is predicted, we can have new link to indicate the connections. However, in a knowledge hypergraph, the hyperedges display the facts in the real world, which are usually highly incomplete. The missing links are expected to be inferred based on existing links by hypergraph structure computation. Therefore, in the second case, the objective of hypergraph structure computation is not only optimizing existing links but also inferring the unobserved links.

In the following, we describe the computation paradigms of these two cases separately. The first scenario is that the set of hyperedges is complete and the affiliation relationships between vertices and hyperedges are incomplete. In this case, we usually can extract a feature vector for each hyperedge representation. Given the input of vertex features **X**<sup>V</sup> and hyperedge features **X**<sup>E</sup> , we can calculate the incidence matrix by the function related to the vertex and hyperedge features as

$$\mathbf{H}^\* = f\_{\Theta}(\mathbf{X}\_{\mathcal{Y}}, \mathbf{X}\_{\mathcal{S}}). \tag{3.4}$$

For example, the attention score can be used as an instance of the function in practice.

In the second scenario, if there are missing hyperedges in the observed hypergraph and the semantics of hyperedges are ambiguous, it is difficult to directly describe the hyperedges by features. Consequently, only the initial incomplete hypergraph structure and the features of vertices can be available as the input. We denote the incidence matrix of the initial hypergraph structure by **H***(*0*)* . The computation paradigm can be written as

$$\mathbf{H}^\* = f\_{\boldsymbol{\Theta}}(\mathbf{X}\_{\mathcal{V}}, \mathbf{H}^{(0)}),\tag{3.5}$$

which indicates that the new hypergraph structure is updated based on the original hypergraph structure following specific prior information.

To guide the evolution of hypergraph structure to more accurately model data correlation, it is necessary to evaluate the quality of hypergraph structure based on the training data and prior information. If there is part of ground truth information about the hypergraph structure, the performance of correlation modeling can be evaluated directly. However, there is no golden standard for hypergraph structure in most cases. Therefore, we may need to perform downstream tasks using the new hypergraph and indirectly evaluate hypergraph computation performance through the downstream task results. Here, we refer to Fig. 3.1, and hypergraph structure computation can be conducted under the intra- and inter-hypergraph computation frameworks.

#### **3.5 Summary**

In this chapter, we introduce three hypergraph computation paradigms for different scenarios. These three paradigms are intra-hypergraph computation, interhypergraph computation, and hypergraph structure computation, which focus on learning the representation of a single subject using the inside component information, learning the representation of a subject by considering its correlations with other subjects, and learning the high-order correlations among data in the presence of missing links and inaccurate initial structure. This chapter provides an overview of how to use hypergraph computation, and the detailed hypergraph computation theory, methods, and application will be introduced in the following chapters.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 4 Hypergraph Modeling**

**Abstract** Hypergraph modeling is the fundamental task in hypergraph computation, which targets on establishing a high-quality hypergraph structure to accurately formulate the high-order correlation among data. In this section, we introduce different hypergraph modeling methods to show how to build hypergraphs using various pieces of information, such as features, attributes, and/or graphs. These methods are organized into two broad categories, depending on whether these correlations are explicit or implicit, to distinguish the similarities and differences. We then further discuss different hypergraph structure optimization and generation methods, such as adaptive hypergraph modeling, generative hypergraph modeling, and knowledge hypergraph generation.

#### **4.1 Introduction**

Although there are complex correlations among data in many applications, it is difficult to discover such complex correlations in many cases due to the limitations of observation technologies. Taking social networks as an example, the group information is a kind of high-order correlation that connects a number of people based on certain criteria. However, it is intractable to investigate all the groups when there are millions or even billions of vertices in a social network. Another typical example is the human brain network. Apparently, some functions of the brain are implemented by the communications among multiple brain regions rather than just two regions, which means that there exist high-order correlations among brain regions. Nevertheless, much manpower and material resources would be required to directly record such high-order correlations by neuroscience experiments. Therefore, it is necessary to study how to model such high-order correlations based on existing information in practical applications.

Hypergraph has shown its superiority on high-order correlation modeling. Hypergraph structure generation has attracted much attention and is still an open problem due to complex correlations among non-standard data. In this chapter, we systematically review the existing hypergraph modeling methods, including both the implicit hypergraph modeling strategy and the explicit hypergraph modeling

**Fig. 4.1** Different categories of hypergraph modeling methods

strategy. The implicit hypergraph modeling strategy aims to generate the hypergraph structure using vertex representations based on their distances or similarities, in which the correlations are not directly provided. The explicit hypergraph modeling strategy targets at the data with explicit high-order correlation information, such as attributes and pairwise connections. For the implicit hypergraph modeling strategy, we mainly introduce the distance-based and representation-based hypergraph structure generation methods. For the explicit hypergraph modeling strategy, we focus on the attribute-based and the network-based hypergraph generation approaches. Figure 4.1 illustrates the hypergraph modeling methods.

We further give four examples in computer vision, recommender system, computer-aided diagnosis, and brain network for hypergraph modeling in this chapter. In the last part, we discuss the topics of further research of hypergraph modeling, which have the potential of going beyond the limitations of current methods that are difficult to be adaptive to complex data. Part of the work introduced in this chapter has been published in [1–4].

#### **4.2 Implicit Hypergraph Modeling**

In implicit hypergraph modeling, the correlations among data are not directly provided. Under such circumstances, we need to explore different representations of the data to build the correlations. Two typical methods for implicit hypergraph modeling are distance-based methods and representation-based methods. In distance-based methods, we can explore the neighborhood information for each sample in some specific feature spaces, and the samples with high similarity/low distance can be connected by a corresponding hyperedge. In representation-based methods, the representation among different feature vectors for the samples is used to measure the neighborhood information, which can be used to generate hyperedges.

#### *4.2.1 Distance-Based Hypergraph Generation*

Distance-based hypergraph generation methods construct the hyperedges based on the distances in the feature space for all the vertices. In general, the construction of the hypergraph can be divided into two steps: the incidence matrix generation and the hyperedge weight generation. For the incidence matrix generation, the connectivity on the hypergraph, i.e., the hyperedge, is determined with the consideration of the neighborhood relationships, where the neighbors of the vertices in the feature space are connected by these hyperedges. For the hyperedge weight generation, the weights of these hyperedges are calculated based on the distance information.

The incidence matrix is generated based on the neighbors of the vertices. There are two major approaches to determine the neighbors [1], i.e., the nearest-neighborbased hyperedge generation strategy (shown in Fig. 4.2) and the clustering-based hyperedge strategy (shown in Fig. 4.3). The nearest-neighbor-based hyperedge generation strategy searches the nearest vertices for the given vertex, i.e., the centroid, and connects these vertices by the hyperedges. The clustering-based hyperedge generation strategy groups the vertices with the features and constructs a hyperedge to connect all vertices fallen into the same cluster.

The nearest-neighbor-based hyperedge generation strategy starts out with calculating the distances between all pairs of vertices in the feature space. Subsequently,

**Fig. 4.2** An illustration of the nearest-neighbor-based hyperedge generation strategy. (**a**) shows the *k*-NN neighbors of the given vertex, and (**b**) shows the *-*-ball neighbors

**Fig. 4.3** Illustration of the cluster-based hyperedge generation strategy. This figure is from [1]

two commonly used criteria [5] are applied to determine the neighbors of the given centroid, i.e., the *k*-NN neighbors [6] and the *-*-ball neighbors [2]. The given centroid and the selected neighbors are connected together by a hyperedge.

Here we denote V as the vertices set, *u* ∈ V as the given centroid, *X(u)* as the feature vector of *u*, *d(x*1*, x*2*)* = ||*x*<sup>1</sup> − *x*2||<sup>2</sup> as the Euclidean distance between the vectors *x*<sup>1</sup> and *x*2, N*k(u)* as the *k*-NN neighbors set of *u*, and N*- (u)* as the *-*-ball neighbors set of *u*. N*k(u)* contains *k* vertices with the smallest distance to *u*, while N*- (u)* contains the vertices with distance smaller than *-*, i.e.,

$$\mathcal{A}'\_{\epsilon}(u) = \{ v | d(X(u), X(v)) \le \epsilon \}. \tag{4.1}$$

The vertex *u* and the neighbors N *(u)* (either N*k(u)* or N*- (u)*) are grouped together to generate a hyperedge *e(u)*:

$$e(\mu) = \mathcal{A}'(\mu) \cup \{\mu\},\tag{4.2}$$

and the hyperedge set E is formulated as

$$\mathcal{A}^{\mathbb{C}} = \{e(\mu)|\mu \in \mathcal{V}\}.\tag{4.3}$$

The clustering-based hyperedge generation strategy starts out with grouping the vertices according to the corresponding features using the clustering algorithms, such like *k*-means. Subsequently, the vertices belonging to the same cluster are connected together using a hyperedge. Here we assume that the *k*-means algorithm clusters the vertex set V into *K* groups V1*,...,* V*K*. Then, *K* hyperedges can be constructed using these clustering results:

$$\forall 1 \le k \le K, e\_k = \mathcal{V}\_k = \{v\_{k\_1}, v\_{k\_2}, \dots\}, \tag{4.4}$$

and the hyperedge set E is formulated as

$$\mathcal{E} = \{e\_k | \forall 1 \le k \le K\}. \tag{4.5}$$

Besides the similarity/distance in the feature space, other types of information, which can be used to measure the correlation in some specific space, such as the spatial information, can also be applied for hyperedge generation. For example, the spatial information of pixels in an image can be used to select a group of neighbor pixels for one centroid, which can be connected by a hyperedge, as shown in Fig. 4.4.

**Fig. 4.4** An illustration of using spatial information of pixels to generate a hyperedge

Typically, an incidence matrix **H** is used to represent the structure of the hypergraph, i.e.,

$$\mathbf{H}\_{\mu\epsilon} = \begin{cases} 1 & \text{if } \mu \in e \\ 0 & \text{otherwise} \end{cases},\tag{4.6}$$

where *u* ∈ V and *e* ∈ E .

The weight matrix of the hypergraph represents the importance of each hyperedge. A commonly used method for the hyperedge weight measurement is based on the Gaussian kernel, where the scores of each pair of vertices belonging to the same hyperedge are calculated using the distance between the vertices in the pair and the average score can be used as the weight of the hyperedge, i.e.,

$$w(e) = \sum\_{u,v \in e} \exp\left(-\frac{d(X(u), X(v))}{\sigma^2}\right),\tag{4.7}$$

where *w(e)* denotes the weight of hyperedge *e*, and *σ* is the band width of the Gaussian kernel.

In this way, if the vertices connected by a hyperedge are with relatively higher similarity, the corresponding hyperedge weight could be larger and vice versa. Then, the hyperedge weights can represent whether this hyperedge is trustable for further processing.

In practice, *σ* can be set as the median value of the distances among all vertices by

$$\sigma = \text{median}\_{u, v \in \mathcal{V}} d\left(X\left(\mu\right), X\left(v\right)\right), \tag{4.8}$$

where median denotes the median value. It is noted that the hyperedge weight can be set in other ways following the purpose of evaluating the importance of each hyperedge.

The main limitation of the distance-based hypergraph generation method is the inaccurate distances due to noise and outliers of data, which may further introduce noise to the structure of hypergraphs. In practice, the feature representation for the data is still a challenging task. It is not easy to conduct effective feature extraction under certain application scenario. The metric for distance calculation also matters. Although the Euclidean distance is commonly used, there still exist some other metrics, such as the *L*1-norm and the negative cosine distance. The decision making of these metrics requires experimental evaluation. Therefore, the distance-based hypergraph generation method may suffer under such circumstances.

The nearest-neighbor-based hyperedge generation strategy is the most simple one to be deployed in practice. The limitations of this strategy are as follows. First, the hyperparameter, i.e., *k* for the *k*-NN neighbors and  for the *-*-ball neighbors, may significantly affect the structure of the hypergraph and further influence the performance of hypergraph learning. Unfortunately, there are still no general principles for the selection of *k* and *-*, and the adaptive justification of these hyperparameters is not trivial in practice. Second, the calculation of the *k*-NN neighbors is expensive for large scaled data in both time and memory.

Regarding the clustering-based hyperedge generation strategy, there is no common way to determine how many clusters should the vertex set be divided into, as the scale of the clustering results also affects the structure of the hypergraph. A possible solution is to conduct clustering multiple times in different scales, which generates multiple hypergraphs with different *k* values and then composes these hypergraphs together for multilevel representation.

#### *4.2.2 Representation-Based Hypergraph Generation*

As introduced above, the distance-based hypergraph generation has some disadvantages. For the *k*NN hypergraph, the hypergraph, which connects the centroid sample and its k nearest samples, is uniform. Its structure may not be sufficiently adaptive. Also, the distance-based hypergraph is sensitive to noise. To solve this problem, the hypergraph can be generated by the representation-based methods.

Different from the distance-based methods, which generate hyperedges through some metrics in the feature space, the relations among the vertices in representationbased methods are from the feature reconstruction, as shown in Fig. 4.5. In reconstruction, different strategies have different generation effects. Here we introduce three representation-based main branches to construct hypergraphs, i.e., *l*1-hypergraph [7], *l*2-hypergraph [8], and the combination of them both. The details of these methods are described as follows.

#### **(1)** *l*1**-Hypergraph Generation**

For the *l*1-hypergraph construction, as introduced in [7], sparse representation method can be used to formulate the relation between the hyperedge and its vertices, and the sparse representation is embodied in the coefficients that linearly combine

**Fig. 4.5** An illustration of the representation-based methods

the basic vectors to reconstruct the input vector. In the hyperedge construction, the centroid vertex is reconstructed by the other vertices in the same hyperedge. We use the coefficients to present the incidence matrix of hypergraph. Mathematically, we denote the centroid vertex in the *l*1-hypergraph by *vc*, and it can be represented as

$$\begin{aligned} \arg\min\_{\mathbf{z}} & \|\mathbf{B}\mathbf{z} - \mathbf{X}(v\_c)\|\_2^2 + \mathcal{y} \|\mathbf{z}\|\_1, \\\\ \text{s.t. } & \forall i, \mathbf{z}\_l \ge 0, \end{aligned} \tag{4.9}$$

where **X***(vc)* denotes the feature vector of the centroid vertex, **B** denotes the feature of its k nearest vertices, and **z***<sup>i</sup>* is the reconstruction coefficient vector. The first term in Eq. (4.9) is the reconstruction term that makes a good representation of input vector **X***(vc)* with the basic vectors **B**. The second term is the *l*1-regularization, which forces the coefficient **z** to sparse. *γ* is a hyperparameter that balances the influences of the two terms. The constraint **z***<sup>i</sup>* ≥ 0 makes the reconstruction coefficients non-negative. Note that each sample may act as a centroid vertex to generate a hyperedge. For the dataset containing *n* samples, the optimization problem is solved for *n* times. The non-zero reconstruction coefficients in the representation can be seen as the connection weights of the neighborhood vertices in the hyperedge, and the neighborhood vertices with zero reconstruction coefficients are outside of the hyperedge. The connection weight between the hyperedge and the neighborhood vertices can be set as the vector of coefficients **z***i*. The incident matrix **H** of this hypergraph is defined as

$$\mathbf{H}(v\_j, e\_l) = \begin{cases} \mathbf{z}\_l^j & \text{if } v\_j \in e\_l \\ 0 & \text{otherwise} \end{cases},\tag{4.10}$$

where *ei* is generated with the centroid vertex *vi*, and **z** *j <sup>i</sup>* is the *j* th element of representation coefficients **z**.

#### **(2)** *Elastic***-Hypergraph Generation**

The *l*1-regularization in *l*1-hypergraph can generate sparse and effective hypergraphs, despite that fact that it is hard to reveal the grouping information of samples. To enhance the effect of grouping, the elastic net [9] is introduced to combine an *l*2-norm penalty with the *l*1-norm constraint. The objective function of elastic net can be formulated as

$$\begin{aligned} \arg\min\_{\mathbf{z}} & \|\mathbf{B}\mathbf{z} - \mathbf{X}(\upsilon\_c)\|\_2^2 + \nu \|\mathbf{z}\|\_1 + \beta \|\mathbf{z}\|\_2^2, \\\\ \text{s.t. } & \forall i. \mathbf{z}\_i \ge 0. \end{aligned} \tag{4.11}$$

The elastic net can create a hyperedge whose weight can be determined by the reconstruction coefficients by using both the *l*2-norm and the *l*1-norm penalties to group more relevant and important neighbors.

#### **(3)** *l*2**-Hypergraph Generation**

Note that there are two drawbacks of the above two representation-based approaches: (1) They use a *l*2-norm-based metric to measure the reconstruction errors, which makes them still sensitive to sparse reconstruction errors. (2) Since these methods create hyperedges by linearization, they are unable to handle nonlinear data. By eliminating the sparse noise component from the original data, integrating the locality, and maintaining the constraint to the linear regression framework, the *l*2-hypergraph [9] is created to address these issues as

$$\begin{aligned} &\arg\min\_{\mathbf{z}} \|\mathbf{X} - \mathbf{X}\mathbf{C} - \mathbf{E}\|\_{F}^{2} + \frac{\mathcal{V}\_{1}}{2} \|\mathbf{C}\|\_{F}^{2} + \frac{\mathcal{V}\_{2}}{2} \|\mathbf{Q} \odot \mathbf{C}\|\_{F}^{2} + \beta \|\mathbf{E}\|\_{1}, \\ &\text{s.t. } \mathbf{C}^{T}\mathbf{1} = \mathbf{1}, \text{Diag}(\mathbf{C}) = \mathbf{0}, \end{aligned} \tag{4.12}$$

where stands for element-wise multiplication, **C** is the coefficient matrix, **E** is the data error matrix, and **Q** is the locality adapter matrix used to retain the local manifold structures. Hyperedges can then be created using the coefficient matrix **C**.

The ability of each vertex being able to be reconstructed in the feature space can be evaluated via representation-based hyperedges. It is possible to calculate and use the correlation between the feature vectors to create connections among the vertices. Similar to the distance-based methods, this field of study may encounter the issue of data noise and outliers. Another drawback of this type of hypergraph generation methods is that only a portion of the relevant samples is chosen for reconstruction during the computing process, and the resulting hyperedge may not be able to accurately capture the data correlation through the complete data distribution.

#### **4.3 Explicit Hypergraph Modeling**

Different from implicit hypergraph modeling, in some cases, there are existing connections among data. Explicit hypergraph modeling focuses on these scenarios and generates hypergraph structure using attribute information or networks.

#### *4.3.1 Attribute-Based Hypergraph Generation*

The data in real world may be associated with attributes in many cases. For example, the users in social network could have profiles, such as gender, age, and interests. The visual objects in images could have different characteristics, such as color, shape, and texture. Given the data assigned with different attributes, attribute-based hypergraph generation methods can be adopted to construct the hypergraph based on the attribute information, which provides an explicit way to encode semantic properties and diffuse knowledge [10]. As such a construction schema leverages the apparent correlations among objects directly, it can be categorized as explicit hyperedge methods.

To generate a hypergraph using attributes, the following steps are needed: the hypergraph structure construction and the hyperedge weight assignation. The first step is to generate the vertex set V and hyperedge set E based on the attribute information, and the second step is to assign different weights to the hyperedges and acquire the weight matrix **W**.

When constructing the hypergraph from the attribute data, the samples to be explored are first modeled as vertices in a hypergraph, denoted as the vertex set V . The same attribute shared by different vertices effectively indicates that these samples share common characteristics, which may be an objective tag or a subjective evaluation. Therefore, each attribute can be regarded as the semantic information on a connection, i.e., a hyperedge. In attribute-based hypergraph generation methods, a group of hyperedges (called a hyperedge group) are generated by linking all the vertices associated with the attribute space. It is obvious that the number of hyperedges equals to the number of attributes in this way. Such a hyperedge group generated based on the attribute information is denoted by

$$\mathcal{A}\_{\text{attribute}}^{\mathbb{C}} = \left\{ N\_{\text{att}}(a) \mid a \in \mathcal{J}' \right\}, \tag{4.13}$$

where *N*att*(a)* is the subset of vertex set V sharing the attribute *a*, and A is a set containing all defined attributes. Sometimes the attribute could be hierarchical, *e.g.*, the car within the vehicles. In this case, the A and Eattribute can be extended to involve the subtypes of the attributes.

Here we give one simple example to show how to construct the hypergraph structure using the attribute information, as shown in Fig. 4.6. Given a social network data with user profiles, the users in the social network are first modeled as

**Fig. 4.6** An illustration of the attribute-based hyperedge generation method

vertices V . The user profiles contain the objective reality such as gender and age as well as the subjective characteristics such as interests and knowledge, both of which can be adopted to generate the hyperedge groups. For example, we can have *ef emale* hyperedge connecting all female users and *esports* linking users who like sports. Additionally, as discussed above, sometimes the attributes are hierarchical. Under such circumstances, we can generate hyperedges with different levels to characterize multiple-scale attribute connections. For instance, we have users A, B, C, and D who all like sports, among them both users A and B like playing basketball, and users C and D like playing tennis. In this case, we first generate *esports* connecting users A, B, C, and D, and then *ebasketball* and *etennis* are generated to link A, B and C, D, respectively. The hyperedge set in this example can be written as

$$\mathcal{C} = \{e\_{female}, e\_{sports}, e\_{base}, e\_{temanis}, \}.$$

The hyperedge weights are also important here. For attribute-based hypergraph, the number of shared attributes among the samples connected by the hyperedge can quantitatively reflect the relative correlation strength. Specifically, the more the attributes that the samples share, the stronger connections exist among these corresponding vertices, and the bigger weight that the hyperedges are assigned. Here each hyperedge *e* here can be seen as a clique. The mean of the heat kernel weights *w(e)* of the pairwise edges in this clique is considered as the corresponding hyperedge weight:

$$w(e) = \frac{1}{\delta(e)(\delta(e) - 1)} \sum\_{u, v \in e} \exp\left(-\frac{\|\mathbf{X}(u) - \mathbf{X}(v)\|\_2^2}{\sigma^2}\right),\tag{4.14}$$

where *δ(e)* indicates the degree of hyperedge *e*, and **X***(u)* and **X***(v)* denote the feature vectors of vertices *u* and *v*, respectively.

The attribute-based hypergraph generation method can capture the semantic properties apart from the structural information conveyed by the hypergraph structures themselves. The attributes serve as a type of intermediate-level feature representation of vertices and can provide another description for vertices beyond the low-level representations. However, the attributes are not available all the time. When there is no natural attribute descriptor for vertices, some extra solutions need to be applied to conduct attribute-based hypergraph generation. One possible solution is to manually design attribute tags, which may be both cumbersome and time-consuming. The other alternative is extracting attribute information from the raw low-level features by machine learning models [11]. Such a schema is more time-saving than manual definition, whereas the results rely heavily on the accuracy of the machine learning model. We also note that the attributes can be nameable, which indicates the semantic information can be directly understood, while they can also be non-nameable, which means the semantic information is not explicit.

#### *4.3.2 Network-Based Hypergraph Generation*

There are many applications of network data, including social networks [12], reaction networks [13], cellular networks [13], and human brain networks [3]. It is possible to generate subject correlations using the network information for these data. In a typical work of social media analysis [14], the vertices on hypergraph represent users and images. In addition to visual–textual relationships among images, hyperedges can be used to capture social links between users and images, also called homogeneous and heterogeneous hyperedges. The nearest-neighborbased and attribute-based hyperedge generation methods are used to construct homogeneous hyperedges representing the visual and textual relations among images. Users and images are connected through social link relations to construct heterogeneous hyperedges. For example, both friendship and mobility information in location-based social networks can be used to generate hypergraphs using [12]. As a result, friendship hyperedges are generated within the social domain, and check-in hyperedges are generated across the social, semantic, temporal, and spatial domains. A protein–protein interaction network is naturally represented by a hypergraph [15], whose subsets (hyperedges) can be represented by tandem affinity purification (TAP) data.

Aside from the first-order correlation, high-order correlations, e.g., the secondand third-order correlations, within the network can also be used as a means for generating hyperedges. A center vertex can be connected with its first-order and high-order neighbors (i.e., vertices whose shortest path to the centroid is greater than 1) through a hyperedge. A vertex's low-order neighbors need only to be considered if attention is focused on its local connection in the network. As an example, users who have similar preferences on items are able to be connected

**Fig. 4.7** An illustration of the network-based hyperedge generation method

within the recommendation network [4] according to first-order and second-order correlations, which will be used in order to generate a hypergraph as well as to perform collaborative filtering for the recommendation. Alternatively, if information of a vertex travels a long distance in the network, higher-order correlation is required to generate hyperedges.

We then introduce two typical approaches to construct hyperedges from network/graph structure, i.e., pair-based and *k*-hop-based. Figure 4.7 illustrates the profile of these two approaches. In this example, G*<sup>s</sup>* = *(*V*s,* E*s)* represents the graph structure with *vi* ∈ V*<sup>s</sup>* representing a vertex and *esij* ∈ E*<sup>s</sup>* representing an edge connecting *vi* and *vj* . We let **A** indicate the adjacency matrix of G*s*. As a result of such a graph structure, two types of hyperedge groups can be generated (Fig. 4.7).

#### **(1) Pair-Based Hyperedge Generation Strategy**

The Epair is adopted to indicate the hyperedges constructed by pair correlations in the network/graph. Epair targets at directly transforming the graph structure into a group of 2-uniform hyperedges, which can be formulated as follows:

$$\mathcal{C}\_{\text{pair}} = \left\{ \{v\_i, v\_j\} \mid (v\_i, v\_j) \in \mathcal{C}\_s \right\}. \tag{4.15}$$

As a result, Epair covers the low-order (pairwise) correlations in the graph structure, which is the basic information for high-order correlation modeling.

#### **(2)** *k***-Hop-Based Hyperedge Generation Strategy**

Ehop is adopted to indicate the hyperedges constructed by the *k*-hop neighbors in the network/graph. First, we define the *k*-hop neighborhoods of a vertex *v* in graph G*<sup>s</sup>* as follows:

$$N\_{\text{hop}\_k}(v) = \{ \mu \mid \mathbf{A}\_{\mu v}^k \neq 0, \,\mu \in \mathcal{V}\_s \}.$$

Based on the *k*-hop's reachable positions in the graph structure, Ehop aims to find the related vertices for a central vertex. The range of the values of *k* is [2*, nv*], where *nv* refers to the number of vertices in G*s*. The following is an example of a

#### 4.4 Typical Examples of Hypergraph Modeling 61

hyperedge group Ehop with *k*-hop:

$$\mathcal{C}\_{\text{hop}\_k} = \left\{ N\_{\text{hop}\_k}(v) \mid v \in \mathcal{V} \right\}. \tag{4.16}$$

The hyperedge generated by Ehop can be exploited by extending the search radius to the external vertices, which also leads to groups of vertices rather than just two vertices, as opposed to two vertices only in the graph structure. As compared with just the pairwise correlation in Epair, it can provide more information about correlations.

Here, we discuss the advantages and limitations of the two types of hyperedges using network data, respectively. As far as the pair-based construction is concerned, clearly this type of hyperedge can only model low-order correlations, which cannot naturally explore high-order correlations in some scenarios. In contrast, hyperedges generated from the *k*-hop-based methods have the high-order information built-in of the original network. However, the high-order information in this type of hyperedges may be redundant and ambiguous. This is because the connection details in the *k*hop-based hyperedges may be lost, which means that you cannot reconstruct the original network/graph from this type of hyperedge. Additionally, the *k*-hop-based hyperedges may lead to irreversible over-smoothing in each hyperedge, which is caused by the *k*-hop neighbors with exponential growth as *k* grows.

#### **4.4 Typical Examples of Hypergraph Modeling**

Here we give several examples of hypergraph modeling in real applications, including computer vision, recommender systems, computer-aided diagnosis, and brain networks, to demonstrate how to construct hypergraphs from data.

#### *4.4.1 Computer Vision*

Computer vision has attracted much attention in recent decades. In computer vision, there are multi-modal data, such as images, point clouds, etc. Both low-level vision tasks and high-level vision tasks have been deeply investigated. In these tasks, an important but challenging issue is the complex data correlation behind the vision data. For example, from the aspect of images, the pixels or patches are the elements of an image, while the semantic information for the image is represented by these pixels or patches. Terrence Joseph Sejnowski [16] mentioned that "*In a task such as face recognition, in which important information may be contained in the highorder relationships among pixels, it seems reasonable to expect that better basis images may be found by methods sensitive to these high-order statistics*." Similar situations occur when facing multi-modal 3D object representation. Usually, a 3D object can be represented by different ways, such as one single image, multiview, point clouds, voxel, and mesh. Under such circumstances, the correlation among these objects becomes even more complicated. To model such high-order relationship among pixels/patches in one image, or among different 3D objects, simple graph is not capable to conduct this task.

We first look into the high-order correlation modeling for an image. A 2D image is composed of a set of pixels, and each pixel owns a feature vector (channels). To generate a hypergraph to model the correlation behind this image, we can take each patch in the image as a vertex in the hypergraph, and the objective is to generate a group of hyperedges to connect these vertices (patches). Here we can employ the distance-based hypergraph generation method, in which each patch is selected as the centroid, and its nearest neighbors in the feature space are connected by a hyperedge. This process is shown in Fig. 4.8. Furthermore, we can also employ the spatial information to build connection among these patches. The patches with closed spatial locations in the image could be connected with a hyperedge. Figure 4.9 shows an example of hypergraph modeling for image patches using spatial information.

For 3D visual objects, there are complex correlations among them. For example, different furniture, such as tables and chairs, have legs, and different vehicles, such as cars and bicycles, have wheels. Another challenging issue comes from the multimodality aspect. Given different modal data of 3D objects, the correlations are composed of inter-modal correlations and the cross-modal correlations, as shown

**Fig. 4.8** An example of hypergraph modeling for image patches using feature information

**Fig. 4.9** An example of hypergraph modeling for image patches using spatial information

**Fig. 4.10** The complex correlations among multi-modal 3D objects

in Fig. 4.10. Given a large number of 3D objects, it is difficult to accurately and completely manually describe all these correlations.

In order to efficiently build a hypergraph structure, we usually extract the features of 3D objects and then build implicit hypergraphs. 3D objects can be described by multiple modalities, including point clouds, views, grids, and voxels. We can extract the descriptors of their respective modalities through the corresponding deep neural networks, such as dynamic graph CNN (DGCNN) [17] and PointNet (PointNet) [18] for point cloud data, multi-view convolutional neural networks (MVCNN) [19], and group-view convolutional neural networks (GVCNN) [20] for the multi-view data. When multi-modal features have been obtained, we can build a hypergraph structure for each kind of features.

Here, each 3D object can be represented by a vertex in the hypergraph. Each time, one object is selected as the centroid in a feature space, and its nearest neighbors can be connected by a corresponding hyperedge. This process is repeated until all objects have been selected as the centroid once in this feature space. Every feature and possible feature combination can be used in this process. In this way, we can achieve multiple hypergraphs, represented by incidence matrices **H**1*,* **H**2*,...,* **H***<sup>m</sup>* to formulate their correlations under different modalities. The pipeline is demonstrated in Fig. 4.11. We can further concatenate these incidence matrices along the axes of hyperedges to integrate these hypergraphs and obtain the complete hypergraph structure, as shown in Fig. 4.12.

#### *4.4.2 Recommender System*

In a recommender system, the relationship between users and items can be represented by a bipartite graph, that is, if an item is in a user's recommendation list, then we connect the user vertex with the item vertex. This bipartite graph can

**Fig. 4.11** An example of hypergraph modeling for multi-modal 3D objects

**Fig. 4.12** An illustration of multi-hypergraph combination. This figure is from [5]

**Fig. 4.13** An example of hypergraph modeling for a recommender system

be simply transformed into a hypergraph, where vertices on one side remain and vertices on the other side become hyperedges, as shown in Fig. 4.13. In this way, each user can be represented as a vertex in the hypergraph, and the users shared the same items can be connected by a corresponding hyperedge here. If the item is regarded as a vertex, then the hyperedges are generated using shared users. This hypergraph generation procedure follows the attribute-based strategy.

Mathematically, the ranking matrix of the recommender system equals to the incidence matrix of the corresponding hypergraph. With this transformation, we can solve the problem in recommender systems via hypergraph learning methods. In fact, undirected bipartite graph modeling and hypergraph modeling are interchangeable in some cases. If the edges in bipartite graph are weighted, we can use the hyperedge-dependent vertex weights accordingly.

#### *4.4.3 Computer-Aided Diagnosis*

In computer-aided diagnosis, the main objective is to measure whether a coming patient has some specific disease or not, or how serious the disease it is. For diagnosis, the experience and knowledge are from previous medical records. Casebased diagnosis has shown importance in practice. For AI-based computer-aided diagnosis, it is important to explore the existing labeled training data, which could be very few in some cases. These medical records may contain different examine files, MR images, CT images, and other types of data.

A conventional pipeline for computer-aided diagnosis is first extracting features from clinical text or medical imaging data and then applying computer programs to automatically categorize healthy people and patients. The commonly used techniques involve natural language processing, medical imaging analysis, machine learning, etc. It is worth noting that the existing methods mostly focus on individual subject classification. Under such circumstances, how to model the correlation among these subjects, including the training data and the coming patient (the testing data), is an important but difficult task.

**Fig. 4.14** An example of hypergraph modeling for computer-aided diagnosis

Here, a hypergraph at the subject level, i.e., each vertex stands for a subject, can be generated, where the hyperedges can be created using the distance-based method or attribute-based method. Given the MR images or other medical data, the features can be used to measure the distance between each two subjects. Then, the *k*-NN scheme can be used to select nearest neighbors for a centroid vertex and then generate a corresponding hyperedge, as shown in Fig. 4.14.

Another type of applications is to model the inter-correlation in one medical image, such as gigapixel whole-slide histopathological images (WSIs). Survival prediction is an important task in medical image analysis, which targets on modeling the life duration of a patient using WSIs. Different from traditional images, WSIs are with very large size and rich details. Therefore, traditional image representation methods do not work well in this task. To formulate the inter-correlation inside a WSI, a hypergraph can be generated, which the patch correlations are generated. A group of patches can be sampled from the original WSI, such as 2000 or 8000 patches. Then, these patches are represented as vertices in the corresponding hypergraph. The hyperedges can be generated based on either the visual feature of these patches or the spatial information, or both of them, using the distance-based hypergraph generation methods.

#### *4.4.4 Brain Network*

Recently, the development of neuroimaging techniques has provided a way to understand the brain network on a large scale. Studies have shown that the interaction relationships in the brain, from neuronal information flow to wholebrain functional networks, are the basis of its functionality. Therefore, formulating

**Fig. 4.15** An example of hypergraph modeling for brain network

the brain as a complex network and decoding its signals may further deepen our understanding of the human cognitive processes. The conventional functional network is usually modeled and represented based on pairwise correlations between two brain regions. However, neurologically, a brain region predominantly interacts with one or more other brain regions.

When using hypergraphs to model a single brain network, the vertices denote brain regions, and the hyperedges represent the interactions among multiple regions. Each element in the incidence matrix corresponds to the contribution of the brain region to the specific function, as shown in Fig. 4.15. In this process, each region can be selected as the centroid, and its nearest neighbor regions in the feature space can be selected and connected by a corresponding hyperedge.

#### **4.5 Hypergraph Modeling in Next Stage**

In this part, we discuss future research topics of hypergraphs modeling to render them more accurate and flexible, including adaptive hypergraph modeling, generative hypergraph modeling, and knowledge hypergraph generation.

#### *4.5.1 Adaptive Hypergraph Modeling*

Having initialized the hypergraph structure, the structure is fixed during the learning process. However, the initial hypergraph structure constructed by existing hypergraph modeling methods contains many noisy connections that may be destructive for the learning process. Therefore, the original structure needs to be optimized according to the data and downstream tasks to cut down on structure noise. Although there are some existing work on hypergraph structure optimization, these efforts are still far from reaching the goal of accurately modeling of complex data correlations.

At this stage, the selection of hypergraph generation methods still depends on experience, rather than a theoretical strategy. A possible route to conduct automate hypergraph generation is to create various hypergraphs via different approaches and then group them together to obtain a more complex but relatively complete hypergraph. The grouping weights can be learned in the training stage. Another way is to update the incidence matrix of hypergraph structure, which can be either directly optimized as learnable parameters or indirectly optimized via metric learning.

#### *4.5.2 Generative Hypergraph Modeling*

The generative models are a set of models that learn the distribution from the observed data and generate new data instances based on probability. They have been widely used in different tasks such as generation, synthesis, translation, reconstruction, prediction, etc. In recent years, with the development of deep graph representation learning, deep graph generative models have attracted much attention. Given a series of training graph data (assumed to be taken from the same distribution), the neural network is trained as a graph generation model. Inspired by these generative models, building a hypergraph by estimating the distribution of latent structures from observed data may be a viable way. Given a set of training hypergraphs or sampling signals from every vertex, the distribution can be implicitly or explicitly derived by combining hypergraph embeddings and generative models.

However, there is still a long way to go for hypergraph generative models to become practical. Unlike simple graphs whose distributions are the joint distributions of all pairwise correlations between data, the distribution of a hypergraph structure is the joint distribution of all high-order correlations among data. Therefore, the joint distribution is high dimension, and the variables are dependent on each other. Estimating the density function is intractable with considerable complexity. Furthermore, due to the high-dimensional issue, a large amount of observed data is required to make the density estimate closer to the true distribution, which is difficult to obtain in practical applications. Despite the above obstacles, generative hypergraph modeling is an area worth exploring in the future and will become useful in many areas, such as simulations of complex physical systems, trajectory tracking system identification, and community detection.

#### *4.5.3 Knowledge Hypergraph Generation*

Knowledge hypergraph has attracted much attention in recent years since it can store facts using high-arity relations. In a knowledge hypergraph H = *(*V *,* E *)*, the vertices represent the set of entities, and hyperedges demonstrate the high-arity relations. The basic unit is a fact based on a high-arity relation. Unlike knowledge graph that only uses binary relations, the relations in knowledge hypergraph are defined on any number of entities.

Although there have been several pieces of work targeting at knowledge hypergraph embedding and completion, such as Multi-fold TransH (m-TransH) [21], Hyper-relational Knowledge Graph Embedding (HINGE) [22], N-ary Link Prediction (NaLP) [23], they are mostly based on the assumption that there exists an initial knowledge hypergraph or some hyper-relational links. Few efforts have been made on the initial knowledge hypergraph generation. Actually, manually mining the hyper-relations among entities requires much time and effort. Therefore, it is of great significance to study the knowledge hypergraph generation methods for efficient and comprehensive knowledge inference.

#### **4.6 Summary**

In this section, we introduce the hypergraph modeling methods, which are categorized as the implicit type and the explicit type. The implicit hyperedges can be used in tasks in which we can represent each subject and develop metrics to evaluate sample similarity. By using the sparse representation, representation-based approaches might mitigate the impact of the noise vertices in comparison with distance-based ones. Explicit hyperedges are more appropriate when input data may already have certain structural details. In general, choosing a suitable hyperedge generation method is important for a specific task. Finally, adaptive and generative hypergraph modeling are worth further exploring to adjust hypergraph structures based on the data and the on-going tasks.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 5 Typical Hypergraph Computation Tasks**

**Abstract** After hypergraph structure generation for the data, the next step is how to conduct data analysis on the hypergraph. In this chapter, we introduce four typical hypergraph computation tasks, including label propagation, data clustering, imbalance learning, and link prediction. The first typical task is label propagation, which is to predict the labels for the vertices, i.e., assigning a label to each unlabeled vertex in the hypergraph, based on the labeled information. In general cases, label propagation is to propagate the label information from labeled vertices to unlabeled vertices through structural information of the hyperedges. In this part, we discuss the hypergraph cut on hypergraphs and random walk interpretation of label propagation on hypergraphs. The second typical task is data clustering, which is formulated as dividing the vertices into several parts in a hypergraph. In this part, we introduce a hypergraph Laplacian smoothing filter and an embedded model for hypergraph clustering tasks. The third typical task is cost-sensitive learning, which targets on learning with different mis-classification costs. The fourth typical task is link prediction, which aims to discover missing relations or predict new coming hyperedges based on the observed hypergraph.

#### **5.1 Introduction**

In previous chapters, we have introduced how to generate the hypergraph structure given observed data. After the hypergraph generation step, how to use this hypergraph for different applications becomes the key task. Hypergraph has the potential to be used in different areas, such as social medial analysis, medical and biological applications, and computer vision. We notice most of the applications can be categorized into several typical tasks and follow similar application patterns. In this chapter, we introduce several typical hypergraph computational tasks, which can be used for different applications.

More specifically, four typical tasks, including label propagation, data clustering, cost-sensitive learning, and link prediction, are introduced in this chapter. The first typical task is label propagation, which is also one of the most widely used methods in machine learning. The objective of label propagation is to assign a label to each unlabeled data. In general cases, label propagation on hypergraph is to propagate the label information from labeled vertices to unlabeled vertices through structural information of the hyperedges. Random walk is a basic processing for information propagation, which also plays a fundamental role in this process. We then review the hypergraph cut on hypergraphs and random-walk-based label propagation on hypergraphs. We introduce the label propagation process on single hypergraph and multi-hypergraphs [1, 2], respectively, in this part.

The second typical task is data clustering, targeting on grouping data into different clusters. We introduce how to conduct data clustering using hypergraph computation. The hypergraph structure can be used as guidance to the clustering criteria. Two types of hypergraph clustering methods are introduced, including structural hypergraph clustering and attribute hypergraph clustering, due to the different data information in the hypergraph. In structural hypergraph, the clustering tasks only use structural information, while in attribute hypergraph, each vertex is usually accompanied by attribute information from the real world. We introduce a hypergraph Laplacian smoothing filter and an embedded model specifically for hypergraph clustering tasks that named adaptive hypergraph auto-encoder (AHGAE) [3].

The third typical task is cost-sensitive learning, which is to solve the learning task under the scenario with different mis-classification costs, such as confronting the imbalanced data distribution issue. Here, we introduce two hypergraph computation methods, i.e., cost-sensitive hypergraph computation [4] and cost interval optimization for hypergraph computation [5]. First, we introduce a cost-sensitive hypergraph modeling method, in which the cost for different objectives is fixed in advanced. As the exact cost value may be not easy to be determined, we then introduce a cost interval optimization method, which can utilize the cost chosen inside the interval while generating data with high-order relations.

The fourth typical task is link prediction, which is to predict data relationship and can be used for recommender system and other applications. Here, the hypergraph link prediction is to mine the missing hyperedges or predict new coming hyperedges based on the observed hypergraph. We introduce a variational autoencoder for heterogeneous hypergraph link prediction [6]. It aims to learn the low-dimensional heterogeneous hypergraph embedding based on the Bayesian deep generative strategy. The heterogeneous encoder generates the vertex embedding and hyperedge embedding, and the hypergraph embedding is the combination of them. The hypergraph decoder reconstructs the incidence matrix based on the vertex embedding and the hyperedge embedding, and the heterogeneous hypergraph is generated based on the reconstructed incidence matrix.

Part of the work introduced in this chapter has been published in [1–6].

#### **5.2 Label Propagation on Hypergraph**

This section mainly introduces the label propagation task on hypergraph. We first introduce the basic assumptions of the label propagation process. Given a set of vertices on a hypergraph, a part of vertices is labeled, while other vertices are unlabeled. The task is to predict the label information of these unlabeled data given the label information and the hypergraph structure. Figure 5.1 shows that the label propagation process is to propagate the label information from these labeled vertices to the unlabeled vertices.

When propagating label information, vertices within the same hyperedge are more likely to have the same label because they characterize themselves with similar attributes in some aspects, and therefore, they have a higher probability of sharing the same label. Under this assumption, the label propagation task can be transformed into a hypergraph cut. In a hypergraph cut, the goal is to make the cut edges as sparse as possible, with each vertex set after the cut as dense as possible. After cutting the hypergraph, different sets of vertices have different labels. This approach satisfies the goal based on the above assumption. The form of the hypergraph cut can be described below.

Suppose a vertex set *S* ∈ V and its compliment *S*. There is a cut that splits the V into *S* and *S*. A hyperedge *e* is cut if it is incident with the vertices in both *S* and *S*. Define the hyperedge boundary *∂S* as the cut hyperedges, i.e., *∂S* = {*e* ∈ E |*e*∩*S* = <sup>∅</sup>*, e* <sup>∩</sup> *<sup>S</sup>* = <sup>∅</sup>}, and the volume of *S*, *vol(S)*, be the sum of the degrees of vertices in *S*, i.e., *vol(S)* = - *<sup>v</sup>*∈*<sup>S</sup>* **<sup>D</sup>***v(v)*. It can be shown as

$$vol(\partial S) = \sum\_{e \in \partial S} w(e) \frac{|e \cap S| |e \cap S|}{\mathbf{D}\_\ell(e)}.\tag{5.1}$$

The derivation is shown as follows, and the details can be found in [7]. Suppose that hyperedge *e* is a clique, i.e., a fully connected graph. To avoid confusion, the edges in the clique are called subedges. Then, the weight *w(e)* **<sup>D</sup>***e(e)* is assigned to each subedge. When the hyperedge *e* is cut, |*e*∩*S*|×|*e*∩*S*| subedges are cut. The volume of the cut is the sum of the weights over these subedges. Recall that our goal is to make the cut edges as sparse as possible, with each vertex set after the cut as dense

**Fig. 5.2** An illustration of the hypergraph label propagation based on random walks

as possible. Based on the goal, the objective partition formula is written as

$$\arg\min\_{S\subset \mathcal{Y}} c(S) = vol(\partial S) \left( \frac{1}{vol(S)} + \frac{1}{vol(\overline{S})} \right). \tag{5.2}$$

There are many methods to propagate label information on a hypergraph, and the propagation based on random walks is the most widely used. The following describes the label propagation by random walk, and the illustration is shown as Fig. 5.2. Suppose that the current position is *u* ∈ V , and at first, we walk to a hyperedge *e* over all hyperedges incident with *u* with probability *w(e)*, and then we sample a vertex *v* ∈ *e* uniformly. By generalizing from typical random walks on graphs, we use **P** as the transition probability matrix of the random walk on a hypergraph, and the element *p(u, v)* is defined as follows:

$$p(\boldsymbol{\mu}, \boldsymbol{v}) = \sum\_{e \in \mathcal{E}} w(e) \frac{\mathbf{H}(\boldsymbol{\mu}, e)}{\mathbf{D}\_{\boldsymbol{v}}(\boldsymbol{\mu})} \frac{\mathbf{H}(\boldsymbol{v}, e)}{\mathbf{D}\_{\boldsymbol{\epsilon}}(e)}. \tag{5.3}$$

The formula can be organized into a matrix form as **<sup>P</sup>** <sup>=</sup> **<sup>D</sup>**−<sup>1</sup> *<sup>v</sup>* **HWD**−<sup>1</sup> *<sup>e</sup>* **H**. The stationary distribution *π* of the random walk is defined as

$$
\pi(\upsilon) = \frac{d(\upsilon)}{vol(\mathcal{V})},\tag{5.4}
$$

where **D***v(v)* is denoted by *d(v)* for short and *vol(.)* is the volume of the vertices in set *S*, defined as *vol(S)* = - *<sup>v</sup>*∈*<sup>S</sup> d(v)*. The formula can be derived from

$$\sum\_{u \in \mathcal{V}} \pi(u) \, p(u, v) = \sum\_{u \in \mathcal{V}} \frac{d(u)}{vol(\mathcal{V})} \sum\_{e \in \mathcal{E}} w(e) \frac{\mathbf{H}(u, e)}{\mathbf{D}\_v(u)} \frac{\mathbf{H}(v, e)}{\mathbf{D}\_e(e)}$$

$$= \frac{1}{vol(\mathcal{V})} \sum\_{e \in \mathcal{E}} w(e) \sum\_{u \in \mathcal{V}} \mathbf{H}(u, e) \frac{\mathbf{H}(u, e)}{\mathbf{D}\_e(e)} \tag{5.5}$$

$$= \frac{1}{vol(\mathcal{V})} \sum\_{e \in \mathcal{E}} w(e) \mathbf{H}(v, e) = \frac{d(v)}{vol(\mathcal{V})}.$$

The objective function Eq. (5.2) can be written as

$$c(S) = \frac{vol(\partial S)}{vol(\mathcal{V})} \left( \frac{1}{vol(S)/vol(\mathcal{V})} + \frac{1}{vol(\overline{S})/vol(\mathcal{V})} \right),\tag{5.6}$$

and then we arrive at

$$\frac{vol(S)}{vol(\mathcal{Y})} = \sum\_{v \in \mathcal{S}} \frac{d(v)}{vol(\mathcal{Y})} = \sum\_{v \in \mathcal{Y}} \pi(v),\tag{5.7}$$

where *vol(S) vol(*<sup>V</sup> *)* is the probability of random walks to vertex in *S*. It can then be shown as

$$\begin{split} \frac{vol(\partial S)}{vol(\mathcal{V})} &= \sum\_{e \in \partial S} \frac{w(e)}{vol(\mathcal{V})} \frac{|e \cap S| |e \cap \overline{S}|}{\delta(e)} \\ &= \sum\_{e \in \partial S} \sum\_{u \in e \cap S} \sum\_{v \in e \cap \overline{S}} \frac{w(e)}{vol(\mathcal{V})} \frac{\mathbf{H}(u, e)\mathbf{H}(v, e)}{\delta(e)} \\ &= \sum\_{e \in \partial S} \sum\_{u \in e \cap \overline{S}} \sum\_{v \in e \cap \overline{S}} w(e) \frac{d(u)}{vol(\mathcal{V})} \frac{\mathbf{H}(u, e)}{d(u)} \frac{\mathbf{H}(v, e)}{\delta(e)} \\ &= \sum\_{u \in \epsilon \cap \overline{\mathcal{V}}} \sum\_{v \in \partial \overline{\mathcal{S}}} \frac{d(u)}{vol(\mathcal{V})} \sum\_{e \in \partial \overline{\mathcal{S}}} w(e) \frac{\mathbf{H}(u, e)}{d(u)} \frac{\mathbf{H}(v, e)}{\delta(e)} \\ &= \sum\_{u \in \mathcal{S}} \sum\_{v \in \overline{\mathcal{S}}} \pi(u) \operatorname{p}(u, v), \end{split} \tag{5.8}$$

where the ratio *vol(∂S) vol(*<sup>V</sup> *)* is the probability with the random walk from a vertex in *S* to *S* under the stationary distribution. It can be seen that the hypergraph normalized cut criterion is to search a cut such that the probability with which the random walk crosses different clusters is as small as possible, while the probability with which the random walk stays in the same cluster is as large as possible.

Let us review the objective function Eq. (5.2). Note that it is NP complete, while it can be relaxed into the following optimization problem as

$$\begin{aligned} \arg\min\_{\mathbf{f}\in\mathbb{R}^{|V|}}\mathcal{Q}(\mathbf{f}) &= \frac{1}{2} \sum\_{e\in\mathcal{E}} \sum\_{\{u,v\}\in e} \frac{w(e)}{\delta(e)} \left( \frac{\mathbf{f}(u)}{\sqrt{d(u)}} - \frac{\mathbf{f}(v)}{\sqrt{d(v)}} \right)^2, \\\ s.t. &\quad \sum\_{v\in\mathcal{V}} \mathbf{f}^2(v) = 1, \sum\_{v\in\mathcal{V}} \mathbf{f}(v)\sqrt{d(v)} = 0,\end{aligned} \tag{5.9}$$

where **f** is the to-be-learned score vector. Since the goal is label propagation, it can be arrived at for some labeled data. The optimization problem becomes the

#### 78 5 Typical Hypergraph Computation Tasks

transductive inference problem as

$$\arg\min\_{\mathbf{f}\in\mathbb{R}^{|\mathcal{V}|}} \{\mathcal{Q}(\mathbf{f}) + \lambda R\_{emp}(\mathbf{f})\},\tag{5.10}$$

where the regularizer term is *Ω(***f***)*, the empirical loss term is *Remp(***f***)* = *f* −*y* - <sup>2</sup> <sup>=</sup> *<sup>v</sup>*∈<sup>V</sup> *(***f***(v)* <sup>−</sup> **<sup>y</sup>***(v))*2, **<sup>y</sup>** <sup>∈</sup> <sup>R</sup>|<sup>V</sup> <sup>|</sup> is the label vector, and *<sup>λ</sup>* is the balance parameter. Let us assume that the *i*-th vertex is labeled, and the elements of **y** are all 0 except the *i*-th value that is 1. The regularizer *Ω(***f***)* can be turned into

$$\begin{split} \mathcal{Q}(\mathbf{f}) &= \frac{1}{2} \sum\_{e \in \mathcal{E}} \sum\_{\{u,v\} \in e} \frac{w(e)}{\delta(e)} \left( \frac{\mathbf{f}(u)}{\sqrt{d(u)}} - \frac{\mathbf{f}(v)}{\sqrt{d(v)}} \right)^2 \\ &= \sum\_{e \in \mathcal{E}} \sum\_{\{u,v\} \in \mathcal{V}} \frac{w(e)\mathbf{H}(u,e)\mathbf{H}(v,e)}{\delta(e)} \left( \frac{\mathbf{f}^2(u)}{d(u)} - \frac{\mathbf{f}(u)\mathbf{f}(v)}{\sqrt{d(u)d(v)}} \right) \\ &= \sum\_{u \in \mathcal{V}} \mathbf{f}^2(u) \sum\_{e \in \mathcal{E}} \frac{w(e)\mathbf{H}(u,e)}{d(u)} \sum\_{v \in \mathcal{V}} \frac{\mathbf{H}(v,e)}{\delta(e)} \\ &\quad - \sum\_{e \in \mathcal{E}} \sum\_{u,v \in \mathcal{V}} \frac{\mathbf{f}(u)\mathbf{H}(u,e)w(e)\mathbf{H}(v,e)\mathbf{f}(v)}{\sqrt{d(u)d(v)}\delta(e)} \\ &= \mathbf{f}^\top(\mathbf{I} - \Theta)\mathbf{f}, \end{split} \tag{5.11}$$

where *<sup>Θ</sup>* <sup>=</sup> **<sup>D</sup>**<sup>−</sup> <sup>1</sup> <sup>2</sup> *<sup>v</sup>* **HWD**−<sup>1</sup> *<sup>e</sup>* **<sup>H</sup>D**<sup>−</sup> <sup>1</sup> <sup>2</sup> *<sup>v</sup>* . The hypergraph Laplacian is denoted by *Λ* = **I** − *Θ*. Therefore, the objective function can be rewritten as

$$
\mathcal{Q}(\mathbf{f}) = \mathbf{f}^\top A \mathbf{f}.\tag{5.12}
$$

The optimization function can be turned into

$$\arg\min\_{\mathbf{f}\in\mathbb{R}^{|\mathcal{V}|}} \{\mathbf{f}^{\top}A\mathbf{f} + \lambda\|\mathbf{f} - \mathbf{y}\|^2\}. \tag{5.13}$$

There are two ways to solve the above problem. The first one is differentiating the objective function in Eq. (5.13) with respect to *f* , and it can be obtained as

$$\mathbf{f} = \left(\mathbf{I} + \frac{1}{\lambda}\boldsymbol{A}\right)^{-1}\mathbf{y}.\tag{5.14}$$

The second one is an iterative method. Similar to the iterative approach in [8], Eq. (5.13) can be efficiently solved by an iterative process. The process is illustrated in Fig. 5.3. The **f** *<sup>t</sup>*+<sup>1</sup> can be obtained from the last iterative **f** *<sup>t</sup>* and **y**, and the procedure is repeated until convergence.

**Fig. 5.3** The iterative solution of Eq. (5.13). This figure is from [1]

This process will converge to the solution Eq. (5.14). To prove it, we first prove that the eigenvalues of *<sup>Θ</sup>* are in [−1*,* <sup>1</sup>]. Since *<sup>Θ</sup>* <sup>=</sup> **<sup>D</sup>**−1*/*<sup>2</sup> *<sup>v</sup>* **HWD**−<sup>1</sup> *<sup>e</sup>* **<sup>H</sup>D**−1*/*<sup>2</sup> *<sup>v</sup>* , we find that its eigenvalues are in [−1*,* 1]. Therefore, (**I**±*Θ*) are positive semi-definite.

The convergence of the iterative process is proved in [1]. Without loss of generality, we assume **f** *(*0*)* <sup>=</sup> *<sup>y</sup>*. From the iterative process, it can be obtained that

$$\begin{split} \mathbf{f}^{(t)} &= \left(\frac{\lambda}{1+\lambda}\right) \sum\_{l=0}^{t-1} \left(\frac{1}{1+\lambda} \Theta\right)^{l} \mathbf{y} + \left(\frac{1}{1+\lambda} \Theta\right)^{l} \mathbf{y} \\ &= (1-\xi) \sum\_{l=0}^{t-1} (\xi \Theta)^{l} \mathbf{y} + (\xi \Theta)^{l} \mathbf{y}, \end{split} \tag{5.15}$$

where *<sup>ζ</sup>* <sup>=</sup> <sup>1</sup> <sup>1</sup>+*<sup>λ</sup>* . Since <sup>0</sup> *<ζ<* 1, and the eigenvalues of *<sup>Θ</sup>* are in [−1*,* <sup>1</sup>], it can be derived that

$$\lim\_{t \to \infty} \left( \zeta \Theta \right)^{t} = 0 \tag{5.16}$$

and

$$\lim\_{t \to \infty} \sum\_{l=0}^{t-1} (\zeta \Theta)^l = (\mathbf{I} - \zeta \Theta)^{-1}. \tag{5.17}$$

Then, it turns out

$$\mathbf{f} = \lim\_{t \to \infty} \mathbf{f}^{(t)} = (1 - \boldsymbol{\zeta})(\mathbf{I} - \boldsymbol{\zeta}\boldsymbol{\Theta})^{-1}\mathbf{y} = \left(\mathbf{I} + \frac{1}{\lambda}\boldsymbol{\Delta}\right)^{-1}\mathbf{y}.\tag{5.18}$$

Therefore, the convergence of **f** is proved to be equal to the closed-form solution Eq. (5.14).

The random-walk-based method is the most commonly used approach in label propagation on hypergraphs. It has the advantages of being simple to implement and theoretically verifiable.

In many cases, different hypergraphs may be generated based on different criteria. Under such circumstances, we need to conduct label propagation on multi-hypergraph. Here, we briefly introduce the cross diffusion method on multihypergraph [2]. We assume that there are *T* hypergraphs, and the *t*-th hypergraph is denoted as <sup>G</sup> *<sup>t</sup>* <sup>=</sup> *(*<sup>V</sup> *<sup>t</sup> ,* E *<sup>t</sup> ,***W***<sup>t</sup> )*, where V *<sup>t</sup>* is the vertex set, E *<sup>t</sup>* is the hyperedge set, and **W***<sup>t</sup>* is a diagonal matrix, representing the weights of hyperedges.

The transition matrix is first generated for each hypergraph. The label propagation process on hypergraph is based on the assumption that the local similarities could approximate the long-range similarities, and therefore, the local similarities are more important than far-away vertices. The similarity matrix among vertices of the *t*-th hypergraph is shown as follows:

$$\boldsymbol{A}^{l}(\boldsymbol{\mu},\boldsymbol{v}) = \sum\_{e \in \mathcal{E}^{l}} \frac{\mathbf{W}^{l}(e)\mathbf{H}^{l}(\boldsymbol{\mu},e)\mathbf{H}^{l}(\boldsymbol{v},e)}{\delta(e)},\tag{5.19}$$

or in the matrix form:

$$\boldsymbol{A}^{\boldsymbol{t}} = \mathbf{H}^{\boldsymbol{t}} \mathbf{W}^{\boldsymbol{t}} \mathbf{D}\_{\boldsymbol{\varepsilon}}^{\boldsymbol{t}-1} \mathbf{H}^{\boldsymbol{t}} \,. \tag{5.20}$$

The transition matrix **P***<sup>t</sup>* is the normalized similarity matrix:

$$\mathbf{P}^{l}(i,j) = \frac{A\_{l}(i,j)}{\sum\_{w \in \mathcal{Y}^{l}} A\_{l}(i,w)}\tag{5.21}$$

and

$$\mathbf{P}^{t} = \mathbf{D}^{t-1} \boldsymbol{\Lambda}^{t},\tag{5.22}$$

where **D***<sup>t</sup>* is a diagonal matrix with the *i*-th diagonal element **D***<sup>t</sup> (i, i)* = -|V *t* | *<sup>j</sup>*=<sup>1</sup> *Λt (i, j )*.

The element of the transition matrix **P***<sup>t</sup> (i, j )* represents the probability of transition from the vertex *i* to the vertex *j* , and **P***<sup>t</sup>* could be regarded as the Parzen window estimators on hypergraph structure. After the generation of the transition matrix, the cross label propagation process is applied to the multi-hypergraph structure.

Denote **Y**<sup>0</sup> as the initial label matrix. For labeled vertices, the *i*-th row of **Y**<sup>0</sup> is the one-hot label of the *i*-th vertex, while for the unlabeled vertices, all elements of the *i*-th row are 0*.*5, indicating that there is no prior knowledge of the label. We denote the labeled part of the initial label matrix as **Y***<sup>L</sup>* 0 .

For simplicity, we assume the number of hypergraphs *T* is 2. The label propagation process for multi-hypergraph uses the output of one hypergraph as the input of the other hypergraph, which repeats until the output converges. The process could be formulated as

$$\mathbf{Y}\_{d+1}^{1} \leftarrow \mathbf{P}^{1} \mathbf{Y}\_{d}^{2},\tag{5.23}$$

$$\mathbf{Y}\_{d+1}^{\mathrm{IL}} \leftarrow \mathbf{Y}\_0^{\mathrm{L}} \tag{5.24}$$

**Fig. 5.4** An illustration of the diffusion process on multi-hypergraph. This figure is from [2]

and

$$\mathbf{Y}\_{d+1}^{2} \leftarrow \mathbf{P}^{2} \mathbf{Y}\_{d}^{1},\tag{5.25}$$

$$\mathbf{Y}\_{d+1}^{2L} \leftarrow \mathbf{Y}\_0^L,\tag{5.26}$$

where **Y***<sup>k</sup> <sup>d</sup>* denotes the label matrix of the *k*-th hypergraph after *d* times of label propagation. This process is shown in Fig. 5.4.

The overall matrix could be calculated according to the label matrix of each hypergraph after convergence:

$$\mathbf{Y}\_{final} = \frac{1}{T} \sum\_{l=1}^{T} \mathbf{Y}\_{d}^{l}. \tag{5.27}$$

For more complicated scenarios, where more than two hypergraphs are available, the label propagation process can repeat that, and the output of one hypergraph can be used as the input of other hypergraphs.

This diffusion process can also be used for a single hypergraph, and the framework can be described in Fig. 5.5.

**Fig. 5.5** An illustration of the diffusion process on a single hypergraph

#### **5.3 Data Clustering on Hypergraph**

Data clustering is a typical machine learning task that aims to group data into clusters. In this section, we introduce hypergraph-based data clustering methods, which can utilize the hypergraph structure for better finding correlations behind the data. For hypergraph clustering, two types of information can be used, including structural hypergraph clustering and attribute hypergraph clustering according to the data information in the hypergraph. In structural hypergraph, the clustering tasks only use structural information. For example, the hypergraph spectral clustering method[7] is extended on the basis of graph, which uses the hypergraph Laplacian to learn complex relations between nodes in the hypergraph. And some auto-encoderbased techniques[9] are also applied to structural clustering. In attribute hypergraph, each vertex is usually accompanied by attribute information from the real world. There are two assumptions as follows:


How to balance graph structure information and node feature information is a study focus of attributed graph clustering [10]. In this way, hypergraphs can utilize the features, attributes, and structured information of vertices to conduct data clustering task.

In this section, we introduce a hypergraph Laplacian smoothing filter and an embedded model called adaptive hypergraph auto-encoder (AHGAE) that is designed specifically for hypergraph clustering tasks [3]. First, we describe the hypergraph Laplacian smoothing filter and derive its low-pass filtering properties in the frequency domain. Then, we analyze the influence of each vertex on the attributes of its connected hyperedges and the feature of neighbor vertices. Finally, we introduce the detailed procedure and framework of the adaptive hypergraph autoencoder.

The hypergraph Laplacian smoothing filter, as shown in Fig. 5.6, first merges the vertex features into hyperedge features, and the feature of hyperedge *ek* is defined as

$$\mathbf{E}\_{k}^{(t)} = \frac{1}{|N\left(e\_{k}\right)|} \sum\_{v\_{j} \in N(e\_{k})} \mathbf{X}\_{j}^{(t)} = \sum\_{v\_{j} \in \mathcal{Y}} \frac{h(j,k)}{d\_{\epsilon}(k)} \mathbf{X}\_{j}^{(t)},\tag{5.28}$$

where *ek* denotes the *k*-th hyperedge in the hyperedge set E , *vi* denotes the *i*-th vertex in the vertex set V , *t* represents the order, *N (ek)* is the vertex set in hyperedge *ek*, **E***<sup>k</sup>* describes the hyperedge *ek* feature, and **X***<sup>j</sup>* describes the feature of the vertex *vj* .

**Fig. 5.6** An illustration for hypergraph Laplacian smoothing filter. This figure is from [3]

After aggregating the vertex features to get the hyperedge features, we can further combine the vertex features according to the hyperedge weights:

$$\begin{split} \mathbf{X}\_{l}^{(t+1)} &= (1-\boldsymbol{\nu})\mathbf{X}\_{l}^{(t)} + \boldsymbol{\nu} \sum\_{e\in N(v)} \frac{h(\boldsymbol{i},k)w(\boldsymbol{k})}{d\_{v}(\boldsymbol{i})} \mathbf{E}\_{k}^{(t)} \\ &= (1-\boldsymbol{\nu})\mathbf{X}\_{l}^{(t)} + \boldsymbol{\nu} \sum\_{v\_{l}\in \mathcal{V}} \sum\_{e\in \mathcal{E}} \frac{h(\boldsymbol{i},k)w(\boldsymbol{k})h(\boldsymbol{j},k)}{d\_{v}(\boldsymbol{i})d\_{e}(\boldsymbol{k})} \mathbf{X}\_{j}^{(t)}, \\ \mathbf{X}^{(t+1)} &= (1-\boldsymbol{\nu})\mathbf{X}^{(t)} + \boldsymbol{\nu}\mathbf{D}\_{v}^{-1/2}\mathbf{H}\mathbf{W}\mathbf{D}\_{e}^{-1}\mathbf{H}^{\top}\mathbf{D}\_{v}^{-1/2}\mathbf{X}^{(t)}, \end{split} \tag{5.29}$$

where *N (v)* represents the hyperedge connected to vertex *v*, and *γ* ∈ [0*,* 1] is the weight coefficient of the filter. **D***<sup>v</sup>* denotes the diagonal matrix of the vertex degrees, **D***<sup>e</sup>* denotes the diagonal matrix of the hyperedge degrees, and **H** is the incidence matrix of the hypergraph. In order to make the spectral radius less than 1, we can replace **D**−<sup>1</sup> <sup>v</sup> **HWD**−<sup>1</sup> *<sup>e</sup>* **H** with symmetric normalized form:

$$\begin{split} \mathbf{X}^{(t+1)} &= (1 - \boldsymbol{\gamma})\mathbf{X}^{(t)} + \boldsymbol{\gamma}\mathbf{D}\_{v}^{-1/2}\mathbf{H}\mathbf{W}\mathbf{D}\_{e}^{-1}\mathbf{H}^{\top}\mathbf{D}\_{v}^{-1/2}\mathbf{X}^{(t)} \\ &= \mathbf{X}^{(t)} - \boldsymbol{\gamma}\left(\mathbf{I} - \mathbf{D}\_{v}^{-1/2}\mathbf{H}\mathbf{W}\mathbf{D}\_{e}^{-1}\mathbf{H}^{\top}\mathbf{D}\_{v}^{-1/2}\right)\mathbf{X}^{(t)}. \end{split} \tag{5.30}$$

Then, the multi-order hypergraph Laplacian smoothing filter can be written as

$$\mathbf{X}^{(t)} = (\mathbf{I} - \boldsymbol{\gamma}\mathbf{L})^{\dagger}\mathbf{X}.\tag{5.31}$$

After decomposing the eigenvalues of the hypergraph Laplacian operator **L** = **U***Λ***U**−1, the diagonal elements of the diagonal matrix *Λ* are eigenvalues of **L**. The

**Fig. 5.7** The framework of the adaptive hypergraph auto-encoder framework. This figure is from [3]

frequency response function is as

$$p(\mathbf{A}) = \text{diag}\left(p\left(\lambda\_1\right), \dots, p\left(\lambda\_{|\mathcal{V}|}\right)\right),\tag{5.32}$$

$$p(\lambda) = 1 - \gamma \lambda, \; \chi \in [0, 1]. \tag{5.33}$$

Due to the eigenvalue of the hypergraph Laplacian *λ* ∈ [0*,* 1], *p(Λ)* is a positive semi-definite matrix, and the value of *p(λ)* decreases as *λ* increases. Therefore, the hypergraph Laplacian smoothed filtered can effectively suppress high-frequency signals:

$$\mathbf{F} = \mathbf{U}p(\mathbf{A})\mathbf{U}^{-1} = \mathbf{U}(\mathbf{I} - \boldsymbol{\chi}\boldsymbol{\Lambda})\mathbf{U}^{-1} = \mathbf{I} - \boldsymbol{\chi}\mathbf{L}.\tag{5.34}$$

Figure 5.7 illustrates how to use the relational reconstruction auto-encoder after getting the smoothed feature matrix to conduct vertex representation learning in low-dimensional environments without losing information. First, the incidence matrix is used to generate the adjacency matrix:

$$\mathbf{A} = \varepsilon \left( \mathbf{H} \mathbf{H}^{\top} \right), \tag{5.35}$$

$$\varepsilon(x) = \begin{cases} 1, \ x > 0 \\ 0, \ x = 0 \end{cases} \tag{5.36}$$

A single fully connected layer is used to compress the filtered feature matrix:

$$\mathbf{Z} = \text{scale}\left(\mathbf{X}\_{\text{sm}}\boldsymbol{\Theta}\right),\tag{5.37}$$

$$\text{scale}(\mathbf{x}) = \frac{\mathbf{x} - \min(\mathbf{x})}{\max(\mathbf{x}) - \min(\mathbf{x})},\tag{5.38}$$

where **Z** represents the vertex embedding matrix, which includes both structural and feature information, and *Θ* is the learnable parameter that is used to extract features from the vertices. In order to rescale the range of vertex characteristics to [0*,* 1], scale *(*·*)* represents a normalization function. So the following is the similarity matrix for vertex features:

$$\mathbf{S} = \text{sigmoid}\left(\mathbf{Z}\mathbf{Z}^{\top}\right),\tag{5.39}$$

$$\text{sigmoid}(\mathbf{x}) = \frac{1}{1 + e^{-\mathbf{x}}}.\tag{5.40}$$

This is the inner product decoder used to reconstruct vertex and its neighbors. The objective is to minimize the error between the adjacency matrix **A** and the similarity matrix **S**. However, using Eq. (5.35) to construct an adjacency matrix leads to a problem: the number of edges is too large when the hyperedge degree increases. To solve this problem, the elements in matrix **A** are weighted as

$$\mathbf{W}\_{lj} = \begin{cases} \frac{|\mathcal{V}|^2 - \sum \sum \mathbf{A}\_{lj}|}{\sum \sum \mathbf{A}\_{lj}}, \mathbf{A}\_{lj} = 1\\ 1 \qquad , \mathbf{A}\_{lj} = 0 \end{cases},\tag{5.41}$$

The reconstruction loss can be calculated by using the weighted binary crossentropy function:

$$L\_{re} = \frac{1}{|\mathcal{V}|^2} \sum\_{i=1}^{|\mathcal{V}|} \sum\_{j=1}^{|\mathcal{V}|} - \mathbf{W}\_{ij} \left[ \mathbf{A}\_{ij} \log \mathbf{S}\_{ij} + \left( \mathbf{1} - \mathbf{A}\_{ij} \right) \log \left( \mathbf{1} - \mathbf{S}\_{ij} \right) \right]. \tag{5.42}$$

The relational reconstruction auto-encoder can be trained to produce the learned vertex embeddings, and the spectral clustering technique can be further used to obtain the final clustering results.

#### **5.4 Cost-Sensitive Learning on Hypergraph**

Most of the machine learning applications may suffer from cost-sensitive scenarios. It is noted that different types of faults in real-world jobs might result in losses with varying severity. In diagnostic work, for example, misdiagnosing a patient as a healthy person is significantly more erroneous than classifying a healthy individual as a patient, as shown in Fig. 5.8. Similar cases also happen in the application of software defect prediction. Misjudging the flaws of software modules as a good one may destroy the software system and have disastrous repercussions in software defect prediction. In these cases, cost-sensitive learning methods [11–13] have been developed to deal with these issues.

In many cases, the data from a group of categories may be enough, while the data from other categories may be very limited. These imbalanced data distributions lead to different costs for the classification performance of different categories. Under such circumstances, imbalanced learning [13, 14] attracts much attention, which aims to attain a predictive prediction using imbalanced sampling data. In traditional methods, sampling methods [15, 16] are used to over-sample the minority class and under-sample the majority class to solve the imbalanced sample problem. Another way is to conduct cost-sensitive learning that can focus more on the minority class.

To confront the cost-sensitive issue in hypergraph computation, in this section, we introduce cost-sensitive hypergraph computation framework [4] and cost interval optimization for hypergraph computation [5], respectively. First, we describe how to quantify cost in the hypergraph modeling procedure [4], in which a fixed cost value is provided for modeling, and thereafter, we illustrate how to use the costsensitive hypergraph computation approach to tackle imbalanced problems. As the cost value for mis-classification results may not be feasible in practice, we then introduce the hypergraph computation method with cost interval optimization [5], which can utilize the cost chosen inside the interval while generating data with high-order relations. Figure 5.9 shows the frameworks of hypergraph computation under cost-sensitive scenarios, from traditional hypergraph modeling, hypergraph modeling with cost matrix, to hypergraph modeling with cost matrix using cost interval.

**Fig. 5.8** A medical example of cost-sensitive classification scenario

**Fig. 5.9** The frameworks of hypergraph computation under cost-sensitive scenarios

#### *(1) Cost-Sensitive Hypergraph Computation*

In this part, we introduce a cost-sensitive hypergraph computation method [4], and Fig. 5.10 shows the framework of this method. This framework consists of two stages to handle the cost-sensitive issue: F-measure is used in the initial step to calculate candidate cost information for cost-sensitive learning, and then the hypergraph structure is utilized to model the high-order correlations among the data in the second stage.

First, we introduce the hypergraph modeling with cost matrix. In traditional hypergraph modeling, each vertex represents a subject, and the hyperedges connect related vertices. To introduce cost information in hypergraph modeling, a cost matrix is associated with each vertex, indicating different costs for misclassification, as shown in Fig. 5.11 for a binary classification task. The definition of cost matrix is as follows.

As shown in Fig. 5.11, the cost matrix is a 2×2 matrix, including the true positive cost *CT P* , the true negative cost *CT N* , the false positive cost *CFP* , and the false negative cost *CF N* , respectively. The true positive cost and the true negative cost are mostly 0 in the matrix since that denotes the correct prediction. The cost-sensitive hypergraph's propensity for each class is achieved by giving various values to the false positive cost and the false negative cost in the cost matrix. A special case is that, if the false positive cost and the false negative cost are equal, then the costsensitive hypergraph reduces to traditional hypergraph modeling.

We generate candidate cost information at first and then apply F-measure to reduce the expense for both binary and multi-class data. For a classifier *h*, we can define the error profile as

$$\Psi(h) = \begin{pmatrix} \text{FN}\_1(h), \text{FP}\_1(h), \dots, \text{FN}\_{N\_c}(h), \text{FP}\_{N\_c}(h) \end{pmatrix}, \tag{5.43}$$

where *Nc* represents the number of classes, and FN and FP represent the false negative and the false positive probabilities. For simplicity, we let *ψ*2*k*−<sup>1</sup> represent the FN possibility of the *k*-th class and *ψ*2*<sup>k</sup>* represent the FP possibility of the *k*-th

class. The F-measure for binary classification can be defined as

$$F\_{\beta}(\Psi) = \frac{\left(1 + \beta^2\right) \left(P\_1 - \psi\_1\right)}{\left(1 + \beta^2\right) P\_1 - \psi\_1(h) + \psi\_2(h)},\tag{5.44}$$

where *Pk* represents the marginal probability of class *k*. Similarly, the micro-Fmeasure for multi-class classification can be defined as

$$mcF\_{\beta}(\Psi) = \frac{\left(1 + \beta^2\right)\left(1 - P\_1 - \sum\_{k=2}^{C} \psi\_{2k-1}\right)}{\left(1 + \beta^2\right)\left(1 - P\_1\right) - \sum\_{k=2}^{C} \psi\_{2k-1} + \psi\_1}.\tag{5.45}$$

We can further divide the F-measure values in the region [0*,* 1] into a collection of equally spaced values *F* = {*fi*} to calculate the cost of various mis-classifications. The cost function *Υ* is then used to construct the cost vector using every *fi*. For binary classification, we constrain the denominator of Eq. (5.44) to be positive and *Fβ(Ψ )* ≤ *fi* for a value *c* of the F-measure:

$$\left(1+\beta^2-f\right)\psi\_1+f\psi\_2+\left(1+\beta^2\right)P\_1(f-1)\geq 0.\tag{5.46}$$

Therefore, the cost of *<sup>ψ</sup>*<sup>1</sup> and *<sup>ψ</sup>*<sup>2</sup> can be allocated according to *<sup>f</sup>* and <sup>1</sup>+*β*<sup>2</sup> <sup>−</sup>*<sup>f</sup>* , and the cost function can be written as follows:

$$\mathcal{T}\_i^{F\_\beta} = \begin{cases} 1 + \beta^2 - f, \text{ if sample from class 1} \\ f, & \text{if sample from class 2} \\ 0, & \text{otherwise} \end{cases} \tag{5.47}$$

Similarly, the cost function of multi-class classification can be written as follows:

$$\mathcal{T}\_{l}^{mI F\_{\beta}} = \begin{cases} 1 + \beta^2 - f, \text{ if sample from odd class and not from class 1} \\ f, & \text{if sample from class 1} \\ 0, & \text{otherwise} \end{cases} \tag{5.48}$$

The cost of F-measure optimization is added to the optimization function to increase the efficacy of the hypergraph computation method in imbalanced data. We first regard each data to be a vertex of the hypergraph and then apply the k nearest neighbor algorithm to construct the hypergraph. The cost-sensitive hypergraph differs in that it includes the cost matrix information of each vertex in addition to the original hypergraph correlation structure. With training and testing samples represented by **O**, cost-sensitive hypergraph computation function can be expressed as

$$\begin{aligned} \arg\min\_{\boldsymbol{\omega}, \mathbf{W}} & \left[ \mu \mathcal{Q}(\boldsymbol{\omega}) + \mathcal{R}\_{emp}(\boldsymbol{\omega}) + \lambda \Phi(\mathbf{W}) \right], \\ & \text{s.t. } \sum\_{j=1}^{N} \mathbf{W}\_{j,j} = 1, \forall \, \mathbf{W}\_{j,j} \ge 0, \end{aligned} \tag{5.49}$$

where *Ω(ω)* = *(***O***ω)Δ(***O***ω)* represents the hypergraph Laplacian regularized with hypergraph Laplacian *<sup>Δ</sup>*, <sup>R</sup>*emp(ω)* = *Υ (***O***<sup>ω</sup>* <sup>−</sup> **<sup>y</sup>***)*<sup>2</sup> <sup>2</sup> = -*N i*=1 *Υi,i (***o***iω* − **y***i)* 2 is the empirical loss using cost information with diagonal matrix *Υ* that *Υi,i* represents the cost of the *i*-th data, *Φ(***W***)* <sup>=</sup> *<sup>λ</sup>***W**<sup>2</sup> <sup>F</sup> stands for the hypergraph regularization, *ω* represents the mapping vector to be learnt, **W** is a diagonal matrix representing hyperedge weights, and *μ* and *λ* are the trade-off hyperparameter. We first fix **W** to optimize *ω*, and then the optimization equation can be expressed as

$$\arg\min\_{\boldsymbol{\omega}} \left\{ \|\boldsymbol{\mathcal{T}}(\mathbf{O}\boldsymbol{\omega}) - \mathbf{y}\|\_{2}^{2} + \mu(\mathbf{O}\boldsymbol{\omega})^{\top}\boldsymbol{\Delta}(\mathbf{O}\boldsymbol{\omega}) \right\}.\tag{5.50}$$

The optimal *ω* can be obtained as

$$
\omega = \left(\mathbf{O}^{\top}\mathcal{I}^2\mathbf{O} + \mu\mathbf{O}^{\top}\Delta\mathbf{O}\right)^{-1}\left(\mathbf{O}^{\top}\mathcal{I}\mathbf{y}\right).
\tag{5.51}
$$

Following that, we fix *ω* to enhance **W**:

$$\begin{aligned} \arg\min\_{\mathbf{W}} & \left\{ \mu (\mathbf{O}\omega)^{\top} \Delta (\mathbf{O}\omega) + \lambda \|\mathbf{W}\|\_{F}^{2} \right\}. \\ \text{s.t. } & \sum\_{j=1}^{N} \mathbf{W}\_{j,j} = 1, \forall \; \mathbf{W}\_{j,j} \ge 0. \end{aligned} \tag{5.52}$$

We can have **W** as

$$\mathbf{W} = \frac{\mu A^{\top} A \left(\mathbf{D}\_{\ell}\right)^{-1} - \eta \mathbf{I}}{2\lambda},\tag{5.53}$$

where *<sup>η</sup>* can be calculated as *<sup>η</sup>* <sup>=</sup> *μΛ(***D***e)* <sup>−</sup>1*Λ*−2*<sup>λ</sup> <sup>N</sup>* , and *Λ* can be calculated as *Λ* = *(***O***ω) (***D***v)* <sup>−</sup>1*/*<sup>2</sup> **H**. The optimized mapping vector *ω* allows sample *ζi* in the test set to obtain the classification result *γ* = *ζiω*.

Each piece of potential cost information *ci* generates a cost matrix *Υ* , which is then used to build a cost-sensitive hypergraph structure G*i*. The model then employs an efficient collection to choose the cost-sensitive hypergraph with the greatest Fmeasure as the best choice.

#### *(2) Cost Interval Optimization for Hypergraph Computation*

As the cost value for cost-sensitive hypergraph modeling is not easy to be determined in practice, in this part, we introduce a cost interval optimization method for hypergraph computation [5], in which the fixed cost value is replaced by a cost interval, which is much easier to be provided than a fixed cost value.

Given a hypergraph G = *(*V *,* E *,***W***)*, the regularization foundation of the costsensitive hypergraph can be divided into three components, i.e., empirical loss using cost information, the hypergraph Laplacian regularizer, and the hypergraph regularization, in order to optimize the overall cost by adding the mis-classification costs of various categories to the hypergraph framework.

The empirical loss using cost information can be formulated as

$$\left\|\mathcal{R}\_{emp}(\omega) = \left\|\Phi(\mathbf{S}\omega - \mathbf{y})\right\|\_{2}^{2} = \sum\_{l=1}^{N\_{v}} \left(\Phi\_{l,i}\left(\mathbf{s}\_{l}\omega - \mathbf{y}\_{l}\right)\right)^{2},\tag{5.54}$$

where *ω* represents the mapping vector, and *Φ* is a diagonal matrix representing misclassification cost weights. The hypergraph Laplacian regularizer can be written as

$$\begin{split} \mathcal{Q}(\omega) &= \frac{1}{2} \sum\_{e \in \mathcal{S}} \sum\_{v\_i, v\_j \in \mathcal{V}} \frac{\mathbf{W}(e) \mathbf{H}\left(v\_i, e\right) \mathbf{H}\left(v\_j, e\right)}{\delta(e)} \left( \frac{\omega \mathbf{s}\_l}{\sqrt{d\left(v\_l\right)}} - \frac{\omega \mathbf{s}\_f}{\sqrt{d\left(v\_j\right)}} \right)^2 \\ &= (\mathbf{S}\omega)^\top \Delta(\mathbf{S}\omega). \end{split} \tag{5.55}$$

To adjust the hyperedges weights and hence the hypergraph classification ability, the hypergraph regularization is written as *Ψ (***W***)* = **W**<sup>2</sup> *<sup>F</sup>* . It is noted that this part can be removed in different applications, if not required.

Combining the above three, the whole optimization task for cost-sensitive hypergraph computation can be written as

$$\begin{aligned} \arg\min\_{\boldsymbol{\omega}, \mathbf{W}} & \left\| \| \boldsymbol{\Phi} (\mathbf{S} \boldsymbol{\omega} - \mathbf{y}) \|\|\_{2}^{2} + \mu (\mathbf{S} \boldsymbol{\omega})^{\top} \boldsymbol{\Delta} (\mathbf{S} \boldsymbol{\omega}) + \boldsymbol{\lambda} \|\mathbf{W} \|\_{\rm F}^{2} \right\|, \\ \text{s.t. } & \sum\_{j=1}^{N\_{\varepsilon}} \mathbf{W}\_{j,j} = 1, \forall \; \mathbf{W}\_{j,j} \ge 0, \end{aligned} \tag{5.56}$$

where *μ* and *λ* are the trade-off hyperparameters.

The precise cost of each category is required for cost-sensitive hypergraph computation, but the cost is frequently impossible to be obtained, and it can only be known that the cost is within a cost interval [*Cmax, Cmin*]. Therefore, a simple idea is to attempt all values inside the cost interval and minimize the overall cost. However, this is inefficient given the possibly huge cost interval. As the actual cost is difficult to establish, we need to find a surrogate cost *c*∗ to guide the optimization procedure, and the surrogate classifier *h*∗ is supposed to be as successful as the true cost classifier *h<sup>t</sup>* . In this way, the problem can be formulated as

$$\begin{aligned} \min\_{h, c^\*} L(h, c^\*),\\ \text{s.t. } p(L(h, c) < \theta) &> 1 - \varphi, \forall c \in [\mathcal{C}\_{\text{min}}, \mathcal{C}\_{\text{max}}], \mathcal{C}\_{\text{min}} \le c^\* \le \mathcal{C}\_{\text{max}}, \end{aligned} \tag{5.57}$$

where *L(h, c)* is the empirical risk. *L(h, c)* is formulated as *L(h, c)* = -*Nv <sup>i</sup>*=<sup>1</sup> *cI (ρi* = *<sup>y</sup>* <sup>∧</sup> *<sup>y</sup>* = +*)* <sup>+</sup> *<sup>I</sup> (ρi* = *<sup>y</sup>* <sup>∧</sup> *<sup>y</sup>* = −*)*, where *ρi* <sup>=</sup> *siω* is the *i*-th data labeling in the test set, and + and − represent the label of the important class and the unimportant class, respectively.

The worst-case risk is considered first to guarantee that all limitations can be fulfilled. The worst-case classifier *h*∗ can be written as

$$h^\* = \arg\min\_h \sup\_c L(h, c) \tag{5.58}$$

and

$$p\left(\sup\_{c} L\left(h\_{\*},c\right) < \theta\right) > 1 - \varphi. \tag{5.59}$$

We have *p (L(h*∗*, c) < θ) >* 1 − *ϕ* for any *c*. The worst-case risk is attained when the surrogate cost *c*<sup>∗</sup> equals *Cmax* . However, only a solution that meets the requirements can be acquired in this manner, and the cost cannot be guaranteed to be close to the true cost. As the average cost is the smallest maximum distortion of the genuine risk, it is another good choice, which can be calculated as *Cmean* = 0*.*5*(Cmax* + *Cmin)*.

With the use of alternative costs *Cmax* and *Cmean*, we can conduct cost interval optimization. First, *Cmax* is used as a surrogate cost, and a collection of costsensitive hypergraph structures with varying parameter values is learned in the first stage. Then, *Cmean* is used as a surrogate cost to determine the lowest overall cost on the valid dataset, and then we choose the hypergraph structure as the final solution.

In this section, we describe cost-sensitive hypergraph computation methods. Imbalanced data issue is very common in many applications. The cost-sensitive hypergraph computation methods introduce cost matrix in hypergraph modeling, and both fixed cost value and cost interval can be used in the learning process.

#### **5.5 Link Prediction on Hypergraph**

Link prediction is a fundamental task in network analysis. The objective of link prediction is to predict whether two vertices in a network may have a link. Link prediction has wide applications in different domains, such as social relation exploration [17, 18], protein interaction prediction [19, 20], and recommender system [21, 22], which has attracted much attention in the past decades.

Link prediction on hypergraph aims to discover missing relations or predict new coming hyperedges based on the observed hypergraph, where hypergraph computation can be used to deeply exploit the underneath high-order correlations among these data. Unlike the link prediction task on the graph structure [23, 24], the hypergraph models the high-order correlation among the data, which is heterogeneous in many applications, as the vertices are in different types. For example, in a bibliographic network, the vertex can represent a paper, an author, or a venue, while the hyperedge represents the relation where the paper is written by multiple authors and published in a venue. These different types of vertices do not necessarily share the same representation space. The heterogeneous hypergraph consists of two kinds of vertex in the view of the hypergraph event, i.e., identifier vertex and slave vertex. Identifier vertex is the vertex that determines a hyperedge uniquely, while slave vertex is the other vertex except for the identifier vertex. In this section, we introduce the Heterogeneous Hypergraph Variational Auto-encoder (HeteHG-VAE) method [6] for heterogeneous hypergraph link prediction task.

The overview of HeteHG-VAE can be found in Fig. 5.12. HeteHG-VAE aims to learn the low-dimensional heterogeneous hypergraph embedding based on the Bayesian deep generative strategy. The input hypergraph is represented by the incidence matrix **H**, whose sub-hypergraph represents the hypergraph generated by different types of slave vertices. The heterogeneous encoder can project the vertices and the hyperedges to the vertex embedding and hyperedge embedding, respectively. The hypergraph embedding is the combination of the vertex embedding and the hyperedge embedding, which can be used for reconstructing the incidence matrix by the hypergraph decoder.

In the following part of this section, we first introduce the variational evidence lower bound with the task specific derivation. Then, the inference model, including the heterogeneous vertex encoder and the heterogeneous hyperedge encoder, is presented. At last, the generative model and the link prediction method are introduced.

Denote {*xk*}*<sup>K</sup> <sup>k</sup>*=<sup>1</sup> as the observed data with the total number *K*, **Z***<sup>V</sup> <sup>k</sup>* as the latent vertex embedding, and *Z<sup>E</sup>* as the latent hyperedge embedding. HeteHG-VAE assumes that **Z***<sup>V</sup> <sup>k</sup>* and **Z***<sup>E</sup>* are drawn *i.i.d.* from a Gaussian prior, i.e., **<sup>Z</sup>***<sup>V</sup> <sup>k</sup>* <sup>∼</sup> *<sup>p</sup>*0*(***Z***<sup>V</sup> k )* and **Z***<sup>E</sup>* <sup>∼</sup> *<sup>p</sup>*0*(***Z***E)*, and *xk* are drawn from the conditional distribution, *xk* <sup>∼</sup> *p(xk*|**Z***<sup>V</sup> <sup>k</sup> , ZE*; *λk)*, where *λk* is the parameter of the distribution. The objective of HeteHG-VAE is to maximize the log-likelihood of the observed data by optimizing

**Fig. 5.12** An illustration of the HeteHG-VAE method. This figure is from [6]

*λk* as follows:

$$\begin{split} & \log p(\mathbf{x}\_1, \dots, \mathbf{x}\_K; \boldsymbol{\lambda}) \\ &= \log \int\_{\mathbf{Z}\_1^V} \cdots \int\_{\mathbf{Z}\_K^V} \int\_{\mathbf{Z}^E} p(\mathbf{x}\_1, \dots, \mathbf{x}\_K, \mathbf{Z}\_1^V, \dots, \mathbf{Z}\_K^V, \mathbf{Z}^E; \boldsymbol{\lambda}) d\mathbf{Z}\_1^V \cdots d\mathbf{Z}\_K^V d\mathbf{Z}^E \\ & \ge \mathbb{E}\_q \left( \log \frac{p(\mathbf{x}\_1, \dots, \mathbf{x}\_K, \mathbf{Z}\_1^V, \dots, \mathbf{Z}\_K^V, \mathbf{Z}^E; \boldsymbol{\lambda})}{q(\mathbf{Z}\_1^V, \dots, \mathbf{Z}\_K^V, \mathbf{Z}^E | \mathbf{x}\_1, \dots, \mathbf{x}\_K; \boldsymbol{\theta})} \right) \\ & \coloneqq \mathcal{L}^p(\mathbf{x}\_1, \dots, \mathbf{x}\_K; \boldsymbol{\theta}, \boldsymbol{\lambda}), \end{split} \tag{5.60}$$

where *q(*·*)* is the variational posterior for the estimation of the true posterior *p(***Z***<sup>V</sup>* <sup>1</sup> *,...,***Z***<sup>V</sup> <sup>K</sup>,***Z***E*|*x*1*,...,xK)*, which is inaccessible, and *<sup>θ</sup>* is the parameter to be estimated. Then, L *(x*1*,...,xK*; *θ , λ)* is the evidence lower bound of the log marginal likelihood. Based on the evidence lower bound, an inference encoder is presented to parameterize *q*, and a generative decoder is used to parameterize *p*.

The inference encoder of HeteHG-VAE consists of two main parts, i.e., the heterogeneous vertex encoder and the heterogeneous hyperedge encoder. Heterogeneous vertex encoder first maps the observed data *xk* to a latent space **Z**˜ *<sup>V</sup> <sup>k</sup>* , which can be written as

$$\tilde{\mathbf{Z}}\_k^V = f^V(\mathbf{x}\_k \mathbf{W}\_k^V + b\_k^V),\tag{5.61}$$

where **W***<sup>V</sup> <sup>k</sup>* and *<sup>b</sup><sup>V</sup> <sup>k</sup>* are the to-be-learned weights of the model, and *<sup>f</sup> <sup>V</sup>* is a nonlinear activation function. Two separated linear layers map the latent representation of the means *μV <sup>k</sup>* and variances *<sup>σ</sup> <sup>V</sup> <sup>k</sup>* of *q*:

$$
\mu\_k^V = \tilde{\mathbf{Z}}\_k^V \mathbf{W}\_k^{V\mu} + b\_k^{V\mu},\tag{5.62}
$$

$$
\sigma\_k^V = \tilde{\mathbf{Z}}\_k^V \mathbf{W}\_k^{V\sigma} + b\_k^{V\sigma}, \tag{5.63}
$$

where **W***V μ <sup>k</sup>* , *bV μ <sup>k</sup>* , **W***V σ <sup>k</sup>* , and *bV σ <sup>k</sup>* are learnable parameters. The vertex embedding is the sample from the Gaussian distribution N *(μV <sup>k</sup> , σ <sup>V</sup> <sup>k</sup> )*.

Heterogeneous hyperedge encoder first maps the observed data *xk* to a latent space **Z**˜ *<sup>E</sup> <sup>k</sup>* , which can be written as

$$\tilde{\mathbf{Z}}\_k^E = f^E(\mathbf{x}\_k^\top \mathbf{W}\_k^E + b\_k^E),\tag{5.64}$$

where **W***<sup>E</sup> <sup>k</sup>* and *<sup>b</sup><sup>E</sup> <sup>k</sup>* are the to-be-learned weights of the model, and *<sup>f</sup> <sup>E</sup>* is a nonlinear activation function. Then, the importance of different types of vertices is learned by the hyperedge attention mechanism, which can be written as

$$
\tilde{\alpha}\_k = \text{Tan}h(\tilde{\mathbf{Z}}\_k^E \mathbf{W}\_k^{E\alpha} + b\_k^{E\alpha})\mathbf{P},\tag{5.65}
$$

where **W***Eα <sup>k</sup>* , *<sup>b</sup>Eα <sup>k</sup>* , and **P** are learnable parameters. The attention score *αk* is obtained by normalizing *α*˜ *<sup>k</sup>*, and the hyperedge embedding can be written as

$$
\tilde{\mathbf{Z}}^E = \sum\_{k=1}^K \alpha\_k \tilde{\mathbf{Z}}\_k^E \,. \tag{5.66}
$$

Similarly, two separated linear layers map the latent representation of the means *μE* and variances *σ <sup>E</sup>* of the distribution *q*:

$$
\mu^E = \tilde{\mathbf{Z}}^E \mathbf{W}^{E\mu} + b^{E\mu},\tag{5.67}
$$

$$
\sigma^E = \tilde{\mathbf{Z}}^E \mathbf{W}^{E\sigma} + b^{E\sigma},\tag{5.68}
$$

where **W***Eμ*, *bEμ*, **W***Eσ* , and *bEσ* are learnable parameters. The vertex embedding is the sample from the Gaussian distribution N *(μE, σ E)*.

The incidence matrix is sampled from a Bernoulli distribution parameterized by H*k*:

$$p(\mathbf{H}\_{lj}|\mathbf{Z}\_{k,i}^{V}, \mathbf{Z}\_{k,j}^{E}; \lambda\_k) = Ber(\mathcal{H}\_{lj}),\tag{5.69}$$

where H*ij* is the dot product of the vertex embedding and the hyperedge embedding:

$$\mathcal{H}\_{lj}^{\ell} = \text{Sigmoid}(\mathbf{Z}\_{k,l}^{V}(\mathbf{Z}\_{j}^{E})^{\top}).\tag{5.70}$$

The likelihood of the connection among vertices could be obtained based on the vertex embedding and hyperedge embedding as follows:

$$p\_{conn}(\mathbf{Z}\_l^V, \mathbf{Z}\_j^E) = ||\mathbf{Z}\_l^V, \mathbf{Z}\_j^E||\_2. \tag{5.71}$$

In this section, we have introduced the Heterogeneous Hypergraph Variational Auto-encoder method [6] for the task of link prediction on hypergraph, which captures the high-order correlations among the data while preserving the origin loworder topology. Link prediction on hypergraph has shown superior performance in different experiments and can be further used in other applications.

#### **5.6 Summary**

In this chapter, we introduce four typical hypergraph computation tasks, including label propagation, data clustering, imbalance learning, and link prediction. Label propagation on hypergraph is to predict the labels for the vertices on a hypergraph, i.e., assigning a label to each unlabeled vertex in the hypergraph, based on the labeled information. Data clustering on hypergraph divides the vertices in a hypergraph into several groups. Imbalanced learning on hypergraph considers the imbalanced data distributions and introduces cost-sensitive hypergraph computation methods. Link prediction on hypergraph discovers missing relations or predicts new coming hyperedges based on the observed hypergraph. We note that these four tasks are typical ways to use hypergraph computation in practice. Other tasks can also be deployed under the hypergraph computation framework, such as data regression, data completion, and data generation. Following these typical hypergraph computation tasks, we can use them in different applications, such as social media analysis and computation vision.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 6 Hypergraph Structure Evolution**

**Abstract** In practice, noise exists in the process of data collection and hypergraph construction. Therefore, missing, abundant, and noisy connections may be introduced into the generated hypergraph structure, which may lead to inaccurate inference on hypergraph. Another issue comes from the increasing data stream, which is also very common in many applications. It is important to consider the structure evolution methods on the hypergraph, which optimize the hypergraph structure accordingly. Early hypergraph computation methods mainly rely on static hypergraph structure, which may suffer from the limitation of the static mechanism when confronting random and increasing data scenarios. In this chapter, we introduce dynamic hypergraph structure evolution methods, including both hypergraph component optimization and hypergraph structure optimization. Finally, we briefly introduce the incremental learning method on growing data.

#### **6.1 Introduction**

The hypergraph structure models the high-order and complex correlations among data, and thus the quality of topology structure plays an important role in learning tasks on hypergraph. As shown in the previous chapter, there have been implicit and explicit methods of hypergraph generation from observed data. However, the generated hypergraph may contain abundant, missing, and noisy connections due to the disturbances in the process of data collection and hypergraph construction. In other words, there may exist biases between the generated hypergraph and the ground truth structure. Under such circumstances, it is essential to optimize the hypergraph structure to make it fit the ground truth high-order correlation more accurately. The quality of a hypergraph can be directly qualified by comparing with the ground truth structure if available, or indirectly evaluated by the performance of downstream applications. Most existing hypergraph computation methods rely on static hypergraph structure, such as k-nn-based method [1], cluster-based method [2], and spare representation-based method [3]. These methods may suffer from the inaccurate hypergraph structure that exists in practice. In this chapter, we introduce hypergraph structure evolution methods under the dynamic hypergraph structure learning mechanism. Hypergraph structure evolution can be divided into two main categories, i.e., hypergraph component optimization and hypergraph structure optimization. The problem of hypergraph structure evolution is usually integrated with the learning process and formulated as a bi-level optimization problem. Part of the work introduced in this chapter has been published in [4–7].

#### **6.2 Hypergraph Component Optimization**

Besides the main structure of a hypergraph, i.e., the incidence matrix, a hypergraph is also composed of a group of components such as the weights for hyperedges, vertices, and even sub-hypergraphs, which play an important role on the hypergraph structure. Hypergraph component optimization aims to explore the optimal components of the hypergraphs, i.e., hyperedge weights, vertices weights, and sub-hypergraph weights. The hyperedge weights represent the strength of each highorder correlation among data, while the vertex weights represent the importance of different samples on the structure. In many cases, we may construct multiple hypergraphs using multi-modal data or different criteria, which can be regarded as sub-hypergraphs. The sub-hypergraph weights are used to measure the importance of different sub-hypergraphs on the overall structure. The optimization procedure adjusts the hyperedge weights, the vertex weights, and sub-hypergraph weights during the training process in order to improve the performances on the downstream applications.

#### *6.2.1 Hyperedge Weight Optimization*

The hyperedge is a basic component of the hypergraph, representing the high-order complex correlation among data. The initial hypergraph usually assigns an identical weight to all hyperedges. However, hyperedges actually have different effects for a given task. The hyperedge weights indicate the importance of different hyperedges contributing to the whole structure. In this section, we introduce the hyperedge weight learning methods [4], in which the weights of hyperedges are adaptively adjusted during the training process, and thus the importance of different hyperedges can be automatically modulated.

#### 6.2 Hypergraph Component Optimization 103

We assume that there are *m* hyperedges in the hypergraph, denoted by {*e*1*, e*2*,...,en*}. The weights of the hyperedges are defined by the *n* × 1 vectors *w* = [*w*1*, w*2*,...,wn*] -. There is usually a constraint on the hypergraph weights that their sum is equal to one, i.e., *n <sup>i</sup>*=<sup>1</sup> *ωi* <sup>=</sup> 1. We use **<sup>F</sup>** to denote the output of hypergraph learning. The problem of learning hyperedge weights can be formulated in a dual-optimization form mathematically

$$\arg\min\_{\mathbf{F}, w} \Psi(\mathbf{F}) := \left\{ \mathcal{Q}(\mathbf{F}) + \lambda R\_{\text{emp}}(\mathbf{F}) + \mu \Phi(w) \right\},$$

$$\text{s.t. } \sum\_{e \in \mathcal{E}} \mathbf{W}(e) = 1. \tag{6.1}$$

Here, *Ω(***F***)* and *R*emp*(***F***)* are the regularizer and empirical loss of **F**, respectively. *Φ(w)* is the regularizer on *w*. *λ* and *μ* are the scalars controlling the relative importance of these three items.

The general formulation can be implemented by specifying the functions *Ω(*·*)*, *R*emp*(*·*)* and *Φ(*·*)*. As said before, **F** is the to-be-learned labels in the node classification task. The regularizer *Ω(***F***)* can be defined as **F**-*Δ***F**, where *Δ* is the Laplacian matrix. The empirical loss *R*emp*(***F***)* in the general form can be instantiated by the difference between the learned **F** and observed labels of training data **Y**, which are called the least residuals. The regularizer on *w* is a 2-norm. The general formulation can be written as

$$\begin{aligned} \arg\min\_{\mathbf{F}, w} \Psi(\mathbf{F}) &:= \left\{ \mathbf{F}^\top \Delta \mathbf{F} + \lambda \|\mathbf{F} - \mathbf{Y}\|^2 + \mu \sum\_{l=1}^n w\_l^2 \right\}, \\ &\text{s.t. } \sum\_{l=1}^n w\_l^2 = 1. \end{aligned} \tag{6.2}$$

The aim of the learning process is to search the optimal solution of **F** and *w* to minimize the cost function in Eq. (6.2).

There are two variables to be optimized in Eq. (6.2), which can be solved by the alternating optimization algorithm. For each instant in time, one variable is optimized, while the other is kept constant for the to-be-learned two variables **F** and *w*. The details of the alternating optimization strategy are introduced as follows.

Given the initial hyperedge weights, the first step is fixing *w* and optimizing *Ω(***F***)*. The sub-problem is written as

$$\arg\min\_{\mathbf{F}} \Psi(\mathbf{F}) = \arg\min\_{\mathbf{F}} \left\{ \mathbf{F}^{\top} \Delta \mathbf{F} + \lambda \left\| \mathbf{F} - \mathbf{Y} \right\|^{2} \right\}. \tag{6.3}$$

A closed-form solution of Eq. (6.3) has already been achieved from the traditional hypergraph learning. The solution is written as

$$\mathbf{F} = \left(I + \frac{1}{\lambda}\Delta\right)^{-1}\mathbf{Y}$$

$$= \left(\mathbf{I} + \frac{1}{\lambda}(\mathbf{I} - \Theta)\right)^{-1}\mathbf{Y}$$

$$= \frac{\lambda + 1}{\lambda} \left(\mathbf{I} - \frac{1}{\lambda + 1}\Theta\right)^{-1}\mathbf{Y}.\tag{6.4}$$

Let *<sup>ζ</sup>* <sup>=</sup> <sup>1</sup> *<sup>λ</sup>*+<sup>1</sup> , and Eq. (6.4) can be rewritten as

$$\mathbf{F} = \frac{1}{1 - \xi} (\mathbf{I} - \xi \Theta)^{-1} \mathbf{Y}. \tag{6.5}$$

With the updated **F**, the next step is fixing **F** while optimizing *w*, and the subproblem about *w* is

$$\begin{aligned} \arg\min\_{w} \Psi(\mathbf{F}) &= \arg\min\_{\mathbf{F}} \left\{ \mathbf{F}^{\top} \Delta \mathbf{F} + \mu \sum\_{l=1}^{n} w\_{l}^{2} \right\}, \\ &\text{s.t. } \sum\_{l=1}^{n} w\_{l} = 1, \mu > 0. \end{aligned} \tag{6.6}$$

The Lagrangian multipliers method is employed here, and the sub-problem is replaced with

$$\begin{aligned} &\arg\min\_{w,\eta} \mathbf{F}^\top \Delta \mathbf{F} + \mu \sum\_{l=1}^n w\_l^2 + \eta \left(\sum\_{l=1}^n w\_l - 1\right) \\ &= \quad \arg\min\_{w,\eta} \mathbf{F}^\top \left(\mathbf{I} - \mathbf{D}\_v^{-\frac{1}{2}} \mathbf{H} \mathbf{W} \mathbf{D}\_e^{-1} \mathbf{H}^\top \mathbf{D}\_v^{-\frac{1}{2}}\right) \mathbf{F} + \mu \sum\_{l=1}^n w\_l^2 + \eta \left(\sum\_{l=1}^n w\_l - 1\right) . \end{aligned} \tag{6.7}$$

Let *<sup>Γ</sup>* <sup>=</sup> **<sup>D</sup>**<sup>−</sup> <sup>1</sup> 2 <sup>0</sup> **H**, and it can be shown that

$$\eta = \frac{\mathbf{F}^\top \Gamma \mathbf{F} - 2\mu}{n} \tag{6.8}$$

#### 6.2 Hypergraph Component Optimization 105

and

$$w\_{l} = \frac{1}{n} - \frac{\mathbf{F}^{\top} \boldsymbol{I} \boldsymbol{\Gamma} \mathbf{D}\_{e}^{-1} \boldsymbol{\Gamma}^{\top} \mathbf{F}}{2n\mu} + \frac{\mathbf{F}^{\top} \boldsymbol{I} \boldsymbol{\Gamma} \mathbf{D}\_{e}^{-1} (i, i) \boldsymbol{\Gamma}\_{l}^{\top} \mathbf{F}}{2\mu}. \tag{6.9}$$

Here, *Γi* defines the *i*-th column of *Γ* .

In this way, **F** and *w* are alternatively updated until convergence. Finally, the optimal values of **F** and *w* are obtained. We note that the above method is a typical way to optimize the hyperedge weights using the *l*2-norm. Other methods can also be used to learn the hyperedge weights using different constraints.

#### *6.2.2 Vertex Weight Optimization*

Early hypergraph computation methods may not take the importance of vertices into account and mainly focus on the weights of hyperedges. However, the vertex set in the hypergraph may have heterogeneous, unbalanced, and outlier problems, resulting in performance degeneration of learning process. Therefore, it is highly required to consider the weights of vertices to define the impact of different subjects during the learning process. For example, vertices belonging to the minority class may require larger weights and vice versa for imbalanced data. In this part, we introduce the vertex-weighted hypergraph learning method [5], which can update the vertex weights during the learning process.

The aim of vertex-weighted hypergraph learning algorithm is to emphasize the vertices with distinguishable information and disregard the redundant vertices that bring in bias and noise instead of useful information. On the basis of learning hyperedge weights, vertex-weighted learning algorithm further considers the vertex weights. Here let {*v*1*, v*2*,...,vn*} denote all *n* vertices in the hypergraph. The corresponding weight for vertex *vi* is represented by *ui*. Let **U** denote the diagonal matrix of vertex weights. The overall cost function is similar to learning hyperedge weights, but with the impact of **U** simultaneously taken into consideration. The general formulation is written as

$$\begin{aligned} \arg\min\_{\mathbf{F}, w} \Psi\_{\mathbf{U}}(\mathbf{F}) &:= \left\{ \Omega\_{\mathbf{U}}(\mathbf{F}) + \lambda R\_{\text{emp}}(\mathbf{F}) + \mu \Phi(w) \right\}, \\\\ \text{s.t. } \mathbf{W}(e) &\le 0, \sum\_{e \in \mathcal{E}} \mathbf{H}(v, e) \mathbf{W}(e) = \mathbf{D}\_{v}(v). \end{aligned} \tag{6.10}$$

The key point of vertex weight optimization is to design a reasonable vertex weighting scheme that scores the importance of each subject during the learning process. First, the pairwise distances between vertices are calculated based on the features. Let *dij* denote the distances between vertices *vi* and *vj* , and *d*ˆ *<sup>i</sup>* declares the mean distance between *vi* and all other training vertices with the same label. The vertex weight is then defined as

$$
\mu\_l = \frac{\vec{d}\_l}{\sum\_{j=1}^{n\_{train}} \hat{d}\_j},
\tag{6.11}
$$

where *ntrain* denotes the number of training samples. It is noted that only the training data are labeled and further weighted. The unlabeled vertices are initialized with an identical weight. Normalization is then applied to the vertex weights. This weighting scheme can assign higher weights to vertices that are far from other intraclass vertices and vice versa. Therefore, the importance of repeated/close samples is relatively smaller than the outliers during the hypergraph learning process.

Since the hypergraph structure is updated with vertex weights, the hypergraph structure regularizer is different from the initial one. As stated already, the hypergraph regularizer is defined based on the cut cost. Here, the cut cost is related to not only just the hyperedge weights but to the vertex weights. In general, the higher the weight of two vertices, the higher the cut cost. Therefore, the regularizer of the hypergraph structure *Ψ***U***(***F***)* is rewritten as

*Ω(***F***)* <sup>=</sup> *C k*=1 *e*∈E *u,v*∈V **W***(e)***U***(u)***H***(u, e)***U***(v)***H***(v, e)* <sup>2</sup>*δ(e)* **F***(u, k)* <sup>√</sup>*d(u)* <sup>−</sup> **<sup>F</sup>***(v, k)* <sup>√</sup>*d(v)* <sup>2</sup> <sup>=</sup> *C k*=1 *e*∈E *u,v*∈V **W***(e)***U***(u)***H***(u, e)***U***(v)***H***(v, e) δ(e)* × **F***(u, k)*<sup>2</sup> *d(u)* <sup>−</sup> **<sup>F</sup>***(u, k)***F***(v, k)* <sup>√</sup>*d(u)d(v)* <sup>=</sup> *C k*=1 *u*∈V **U***(u)***F***(u, k)*2 *e*∈E **W***(e)***H***(u, e) d(u) v*∈V **H***(v, e)***U***(v) δ(e)* − *e*∈E *u,v*∈V **F***(u, k)***U***(u)***H***(u, e)***W***(e)***H***(v, e)***U***(v)***F***(v, k)* <sup>√</sup>*d(u)d(v)δ(e)* ⎫ ⎬ ⎭ <sup>=</sup> *C k*=1 **F***(*:*, k)*-*Δ***UF***(*:*, k)* = **F**-*Δ***UF***.* (6.12)

Here, **F***(*:*, k)* is the k-th column of **F** and *C* is the number of data categories. *Δ***<sup>U</sup>** is the vertex-weighted hypergraph Laplacian, which can be defined as

$$\mathbf{A}\_{\mathbf{U}} = \mathbf{U} - \boldsymbol{\Theta} = \mathbf{U} - \mathbf{D}\_{v}^{-1/2} \mathbf{U} \mathbf{H} \mathbf{W} \mathbf{D}\_{e}^{-1} \mathbf{H}^{\top} \mathbf{U} \mathbf{D}\_{v}^{-1/2}. \tag{6.13}$$

Compared with the traditional hypergraph Laplacian *<sup>Δ</sup>* <sup>=</sup> **<sup>I</sup>** <sup>−</sup> **<sup>D</sup>**<sup>−</sup> <sup>1</sup> <sup>2</sup> *<sup>v</sup>* **HWD***<sup>e</sup>* −1**H**- **D***v* − 1 <sup>2</sup> , the hypergraph Laplacian with weighted vertices takes different weights of vertices into consideration during the evaluation of the cost on the hypergraph structure. Therefore, the learning task can be further defined as

$$\begin{aligned} \arg\min\_{\mathbf{F}, \mathbf{W}} \Psi(\mathbf{F}) &:= \left\{ \mathbf{F}^\top \mathbf{A}\_\mathbf{U} \mathbf{F} + \lambda \|\mathbf{F} - \mathbf{Y}\|^2 + \mu \sum\_{e \in \mathcal{E}} \mathbf{W}(e)^2 \right\}, \\\\ \text{s.t. } \mathbf{W}(e) &\ge 0, \sum\_{e \in \mathcal{E}} \mathbf{H}(v, e) \mathbf{W}(e) = \mathbf{D}\_v(v). \end{aligned} \tag{6.14}$$

The above optimization problem can be solved by the alternative optimization algorithm. The sub-problem about **F** has the closed-form solution as in traditional hypergraph learning. The sub-problem about **W** is written as

$$\begin{aligned} \arg\min\_{\mathbf{F},\mathbf{W}} \Psi(\mathbf{F}) &:= \left\{ \mathbf{F}^\top \mathbf{A}\_\mathbf{U} \mathbf{F} + \mu \sum\_{e \in \mathcal{E}} \mathbf{W}(e)^2 \right\}, \\\text{s.t. } \mathbf{W}(e) &\ge 0, \sum\_{e \in \mathcal{E}} \mathbf{H}(v, e) \mathbf{W}(e) = \mathbf{D}\_v(v). \end{aligned} \tag{6.15}$$

The above optimization task can be solved via quadratic programming, since it is convex on **W**. Through vertex weight optimization, the vertex-weighted hypergraph structure takes the contribution of each vertex to the whole hypergraph structure into consideration, and thus it can model the high-order relevance among objects more accurately. During the learning process, the impact of low-quality training samples on the structure and subsequent classification tasks decreases continuously, while high-quality training data, which account for a minority, can be given greater importance. On the other hand, the minority of training data can have greater importance. The additional vertex weights lead to an optimal Laplacian matrix of hypergraph that measures data correlation better than the traditional one and consequently lead to improvement of the classification performance.

#### *6.2.3 Sub-hypergraph Weight Optimization*

Given multiple sub-hypergraphs that are used to jointly formulate the correlation among data, it is important to measure how these sub-hypergraphs work in the main task. Sub-hypergraph weight optimization adjusts the importance of the subhypergraphs, which models the complex correlation among the multi-model data. In this part, we introduce the inductive multi-hypergraph learning (iMHL) [7] to learn the weights of the model and adjust the weights of the sub-hypergraphs during the training process simultaneously, which models the high-order correlation of

**Fig. 6.1** The framework of inductive multi-hypergraph learning method. This figure is from [7]

the multi-model data with the multi-hypergraph, diffuses the sub-hypergraphs as the modality weight, and learns the map from the data to the labels under the supervised setting. Given testing data, the learning projection can be used to predict corresponding labels. The framework of iMHL is illustrated in Fig. 6.1, where the offline training and online training are both supported by the inductive learning process, which can easily handle new coming data efficiently.

Here, we denote *m* as the total number of all sub-hypergraphs and G*<sup>i</sup>* = *(*V*i,* E*i,***W***i)* as the *i*-th hypergraph for the *i*-th modality. The projection matrices **M***<sup>i</sup>* are combined as per the sub-hypergraph weights and are used to map the data to the label for prediction. The combination weights *ω* = [*ω*1*,* ··· *, ωm*] are another object to be optimized, which represents the weight of the corresponding modality, subject to *m <sup>i</sup>*=<sup>1</sup> *ωi* <sup>=</sup> <sup>1</sup> and *<sup>ω</sup>* <sup>≥</sup> 0.

The loss function *Ψ*¯ for learning all **M***<sup>i</sup>* can be formulated as

$$\bar{\Psi} = \sum\_{l=1}^{m} \omega\_l \{ \mathcal{Q}(\mathbf{M}\_l) + \lambda R\_{emp}(\mathbf{M}\_l) + \mu \Phi(\mathbf{M}\_l) \} + \eta \Gamma(\omega), \tag{6.16}$$

which consists of two main parts, i.e., the summation of the cost of each subhypergraph and the regularization on the sub-hypergraph weights *ω*. *Φ(***M***)* is the regularizer on the projection matrix. We assume that the vertices with similar labels are connected strongly, and *Ω(***M***)* can then be written as

$$\begin{split} \mathcal{Q}(\mathbf{M}) &= \frac{1}{2} \sum\_{k=1}^{c} \sum\_{e \in \mathcal{E}} \sum\_{u,v \in \mathcal{V}} \frac{\mathbf{W}(e)\mathbf{H}(u,e)\mathbf{H}(v,e)}{\delta(e)} \left( \frac{\mathbf{X}^{\top}\mathbf{M}(u,k)}{\sqrt{d(u)}} - \frac{\mathbf{X}^{\top}\mathbf{M}(v,k)}{\sqrt{d(v)}} \right)^{2} \\ &= \text{tr}(\mathbf{M}^{\top}\mathbf{X}\boldsymbol{\Delta}\mathbf{X}^{\top}\mathbf{M}), \end{split} \tag{6.17}$$

where *Δ* denotes the normalized hypergraph Laplacian,

$$
\Delta = \mathbf{I} - \mathbf{D}\_v^{-1/2} \mathbf{H} \mathbf{W} \mathbf{D}\_e^{-1} \mathbf{H}^\top \mathbf{D}\_v^{-1/2}. \tag{6.18}
$$

The empirical loss term *Remp(***M***)* can be written as

$$R\_{emp}(\mathbf{M}) = ||\mathbf{X}^\top \mathbf{M} - \mathbf{Y}||^2. \tag{6.19}$$

*Φ(***M***)* can be formulated as the 2*,*1-norm of **M**,

$$\Phi(\mathbf{M}) = ||\mathbf{M}||\_{2,1},\tag{6.20}$$

which produces row sparsity for more informative features. *Γ (ω)* is the 2-norm of the sub-hypergraph weights

$$\left|\Gamma(\omega)\right| = \left||\omega||^2,\tag{6.21}$$

which aims to learn the optimal weights for each sub-hypergraph.

The inductive multi-hypergraph learning task can be formulated as

$$\begin{aligned} \arg\min\_{\mathbf{M}\_l,\omega\geq0} &\sum\_{l=1}^m \alpha\_l \left(\varOmega(\mathbf{M}\_l) + \lambda R\_{emp}(\mathbf{M}\_l) + \mu \varPhi(\mathbf{M}\_l)\right) + \eta \varGamma(\omega), \\ &\text{s.t. } \sum\_{l=1}^m \omega\_l = 1. \end{aligned} \tag{6.22}$$

It is observed that Eq. (6.22) could be split into *m*+1 independent sub-problems, each **M***<sup>i</sup>* is optimized individually, and the combination weights *ω* are optimized to fuse all multi-hypergraphs.

The optimization of **M***<sup>i</sup>* shown below can be solved by iterative algorithm.

$$\arg\min\_{\mathbf{M}\_l} \mathcal{Q}(\mathbf{M}\_l) + \lambda R\_{emp}(\mathbf{M}\_l) + \mu \Phi(\mathbf{M}\_l). \tag{6.23}$$

The optimization problem of *ω* can then be written as

$$\begin{aligned} \arg\min\_{\boldsymbol{\omega}\geq 0} & \sum\_{l=1}^{m} \alpha\_{l} \left( \mathcal{Q}(\mathbf{M}\_{l}) + \lambda R\_{emp}(\mathbf{M}\_{l}) + \mu \Phi(\mathbf{M}\_{l}) \right) + \eta ||\boldsymbol{\omega}||^{2}, \\ & \text{s.t. } \sum\_{l=1}^{m} \alpha\_{l} = 1. \end{aligned} \tag{6.24}$$

We denote *Υi* = *Ω(***M***i)* + *λRemp(***M***i)* + *μΦ(***M***i)*, and Eq. (6.24) can be simplified to

$$\begin{aligned} \arg\min\_{\boldsymbol{\omega}\geq 0} & \sum\_{l=1}^{m} \boldsymbol{\omega}\_{l} \boldsymbol{\Upsilon}\_{l} + \eta ||\boldsymbol{\omega}||^{2}, \\ & \text{s.t. } \sum\_{l=1}^{m} \boldsymbol{\omega}\_{l} = 1. \end{aligned} \tag{6.25}$$

The Lagrangian algorithm can be applied to solve Eq. (6.25), which can be formulated as

$$\arg\min\_{\boldsymbol{\omega},\boldsymbol{\xi}} \sum\_{l=1}^{m} \boldsymbol{\omega}\_{l} \boldsymbol{\Upsilon}\_{l} + \eta ||\boldsymbol{\omega}||^{2} + \xi \left(\sum\_{l=1}^{m} \boldsymbol{\omega}\_{l} - 1\right). \tag{6.26}$$

Then, we can have

$$\xi = \frac{-\sum\_{l=1}^{m} \Upsilon\_l - 2\eta}{m} \tag{6.27}$$

and

$$
\omega\_l = \frac{1}{m} + \frac{\sum\_{l=1}^{m} \Upsilon\_l}{2m\eta} - \frac{\Upsilon\_l}{2\eta}.\tag{6.28}
$$

Given the testing sample *x<sup>t</sup>* = {*x<sup>t</sup>* <sup>1</sup>*,* ··· *, x<sup>t</sup> <sup>m</sup>*} features for each modality, the prediction of the corresponding label can be achieved by

$$C(\mathbf{x}^{l}) = \arg\max\_{k} \sum\_{l=1}^{m} \alpha\_{l} \mathbf{x}\_{l}^{l} \prescript{\mathsf{T}}{}{\mathbf{M}}\_{l}. \tag{6.29}$$

The overall algorithm is shown in Fig. 6.2. The optimization of sub-hypergraph weights is effective as the incorporation of the multi-modal data via multiple subhypergraphs can make it flexible to investigate the contributions of different data or information on the learning process.

#### **6.3 Hypergraph Structure Optimization**

Although the above component optimization methods can modify the weights of hyperedges, vertices, or sub-hypergraphs, it is not easy to precisely adjust the inappropriate or wrong connections since the intersections between vertices and hyperedges cannot be changed, i.e., the incidence matrix of the hypergraph is


fixed. To solve this challenge and further optimize the hypergraph structure, it is imperative to investigate how to finely optimize the hypergraph structure and dynamically learn the high-order relationship. It can be regarded as finding the optimal hypergraph structure in a hypergraph space, as shown in Fig. 6.3.

In this part, we introduce the dynamic hypergraph structure learning method [6], and Fig. 6.4 shows the framework of this method. Different from the above methods, structure optimization on incidence matrix aims to optimize the incidence matrix **H**.

**Fig. 6.3** An illustration of hypergraph structure evolution

The output **F** and the incidence matrix **H** are jointly optimized by the dualoptimization method. The objective function of the joint learning can be formulated as

$$\arg\min\_{\mathbf{F},0\preceq\mathbf{H}\preceq1}\Psi(\mathbf{F}) := \left\{\Omega(\mathbf{F},\mathbf{H}) + \lambda\mathcal{A}\_{\text{emp}}^{\vee}(\mathbf{F}) + \mu\Phi(\mathbf{H})\right\}.\tag{6.30}$$

There are three terms in the objective function, explained as follows:

• First, *Ω(***F***,* **H***)* is the regularizer related to **F** and **H**. The output **F** is the tobe-learned label vectors of vertices. Therefore, smoothness is expected to be conducted on the hypergraph structure, where the commonly used regularizer of hypergraph smoothness can be written as

$$\mathcal{Q}(\mathbf{F}, \mathbf{H}) = \text{tr}\left(\mathbf{F}^{\top} \left(\mathbf{I} - \mathbf{D}\_v^{-1/2} \mathbf{H} \mathbf{W} \mathbf{D}\_e^{-1} \mathbf{H}^{\top} \mathbf{D}\_v^{-1/2}\right) \mathbf{F}\right). \tag{6.31}$$

However, the regularizer in the previous methods is a function only of **F**, while **H** is a stable parameter. Here, the regularizer is a function of both **F** and **H**.


$$\boldsymbol{\Phi}(\mathbf{F}) = \text{tr}\left(\mathbf{X}^{\top} \left(\mathbf{I} - \mathbf{D}\_{v}^{-1/2} \mathbf{H} \mathbf{W} \mathbf{D}\_{e}^{-1} \mathbf{H}^{\top} \mathbf{D}\_{v}^{-1/2}\right) \mathbf{X}\right). \tag{6.32}$$

To summarize, the general objective function in Eq. (6.30) for dynamic hypergraph structure learning is instantiated as

$$\arg\min\_{\mathbf{F},0\preceq\mathbf{H}\preceq\mathbf{I}}\Psi(\mathbf{F}) := \text{tr}\left(\left(\mathbf{I} - \mathbf{D}\_v^{-1/2}\mathbf{H}\mathbf{W}\mathbf{D}\_e^{-1}\mathbf{H}^\top\mathbf{D}\_v^{-1/2}\right)\left(\mathbf{F}\mathbf{F}^\top + \mu\mathbf{X}\mathbf{X}^\top\right)\right)$$

$$+\lambda\|\mathbf{F} - \mathbf{Y}\|^2. \tag{6.33}$$

Similar to the previous methods, the alternative optimization algorithm is applied to solve the dual-optimization problem. The sub-problem about **F** has the same closed-form solution as traditional hypergraph learning [8].

The most important point that is different from the previous one is the subproblem about **H**, which is written as

$$\begin{split} \arg\min\_{0 \le \mathbf{H} \le 1} \mathcal{Q}(\mathbf{H}) &= \mathcal{Q}(\mathbf{H}) + \mu \Phi(\mathbf{H}) \\ &= \text{tr}\left( \left( \mathbf{I} - \mathbf{D}\_v^{-1/2} \mathbf{H} \mathbf{W} \mathbf{D}\_e^{-1} \mathbf{H}^\top \mathbf{D}\_v^{-1/2} \right) \mathbf{K} \right), \end{split} \tag{6.34}$$

where **K** = **FF**- +*μ***XX**-. The projected gradient method is employed here because Eq. (6.34) is a complex function of **H** with a bound constraint. The gradient is derived as

$$\begin{split} \nabla \mathcal{Q}(\mathbf{H}) &= \mathbf{J} \left( \mathbf{I} \otimes \mathbf{H}^{\top} \mathbf{D}\_{v}^{-1/2} \mathbf{K} \mathbf{D}\_{v}^{-1/2} \mathbf{H} \right) \mathbf{W} \mathbf{D}\_{e}^{-2} \\ &+ \mathbf{D}\_{v}^{-3/2} \mathbf{H} \mathbf{W} \mathbf{D}\_{e}^{-1} \mathbf{H}^{\top} \mathbf{D}\_{v}^{-1/2} \mathbf{K} \mathbf{J} \mathbf{W} - 2 \mathbf{D}\_{v}^{-1/2} \mathbf{K} \mathbf{D}\_{v}^{-1/2} \mathbf{H} \mathbf{W} \mathbf{D}\_{e}^{-1}, \end{split} \tag{6.35}$$

where **J** = **11**-. The detailed derivation process can be found in [6]. The step size of learning **H** is set as *α*. Since **H** is required to be in the range of [0*,* 1], the projection *P* on the feasible set is conducted after each update. Therefore, **H** is updated by

$$\mathbf{H}\_{k+1} = \mathbf{P} \left[ \mathbf{H}\_k - a \nabla \mathcal{Q} \left( \mathbf{H}\_k \right) \right],\tag{6.36}$$

where

$$\mathcal{P}\left[h\_{ij}\right] = \begin{cases} h\_{ij} & \text{if } 0 \le h\_{ij} \le 1 \\ 0 & \text{if } h\_{ij} < 0 \\ 1 & \text{if } h\_{ij} > 1 \end{cases} \tag{6.37}$$

In this way, we can alternately optimize **F** and **H** until the objective function converges.

The dynamic hypergraph structure learning method can outperform the traditional hypergraph learning consistently. This is due to the fact that the dynamic hypergraph structure can fit the data better and formulate the high-order correlation more effectively. Furthermore, both the feature and the label information are applied for the hypergraph structure optimization. Therefore, the learned hypergraph structure is smooth on the feature space and the label space. In other words, the vertices with the same labels have stronger high-order connections, which benefit the downstream task. We also note that the above dynamic hypergraph structure optimization method is with relatively high computational complexity, as it optimizes the whole incidence matrix **H**.

#### **6.4 Incremental Learning on Growing Data**

Most of the existing methods consider the static structures with fixed sets of vertices and edges, while the data are generally dynamic in real-world applications. Under such circumstances, the vertices and connections can be added or removed, and the vertex attributes and connects weights change during the dynamic procedure. Generally, there are two typical ways of dynamic structure learning, i.e., using recurrent architectures [9, 10] and capturing temporal patterns [11, 12]. However, the efficient learning of temporally growing structure has not been explored yet, where the vertex and edge sets are expanding over time. Taking the citation network into consideration, new publications and citation links are continuously added into the network.

The incremental subgraph is the subgraph with the newly appeared vertices and related new edges in the given growing graph at each time step. The edges connecting the vertices from the same incremental subgraph are denoted as intraedges, while the edges connecting the vertices from different incremental subgraphs are denoted as inter-edges. The incremental learning method aims to update the model based on the incremental subgraphs at each time step and perform on the entire graph consistently. The challenge of the incremental graph learning method is how to design the efficient strategy to update the model with incremental data and maintain the performance on the whole dataset.

The main differences between incremental graph learning and existing incremental learning methods are as follows:


There are two straightforward solutions of incremental graph learning. First, the static graph learning methods can directly be applied on the whole graph at each time step, which suffers from a high computation cost. Second, only learn from the incremental subgraph, which leads to bias to the newly coming subgraphs and loses the information of the inter-edges.

In this section, we introduce incremental learning for graphs on the growing data. During training, a graph G *<sup>L</sup> <sup>t</sup>* with a smaller number of vertices and edge sets from the growing graph {G*t*} is generated for updating current model, which can be implemented by existing GNN methods for specified graph learning task and can perform on the entire observed graph at any time. Vertices and edges within restricted numbers from the old graph are selected and combined with new subgraph into G *<sup>L</sup> <sup>t</sup>* . Therefore, G *<sup>L</sup> <sup>t</sup>* is unbiased to the entire graph and enough inter-edges are preserved. The overview of the IGL is shown in Fig. 6.5.

To address these issues of subgraph bias and inter-edges missing, the following conditions should be considered for generating learning.

*Unbiased Estimation of Neighboring Aggregation* To alleviate the bias of subgraph, the aggregation results of vertices in G *<sup>L</sup> <sup>t</sup>* should be unbiased estimations of them in the entire graph, i.e., ∀**v** ∈ V*<sup>t</sup>* ,

$$\mathbb{E}\left(\text{agg}\left(\mathbf{v}, \mathcal{N}\_t(\mathbf{v}) \cap \mathcal{V}\_t^L\right) \mid \mathbf{v} \in \mathcal{V}\_t^L\right) = \text{agg}\left(\mathbf{v}, \mathcal{N}\_t(\mathbf{v})\right),\tag{6.38}$$

where *agg(***v***,* N *)* is the aggregator function of GNN to aggregate vertex embeddings from N to **v**, and N*t(***v***)* = {**u** ∈ V*<sup>t</sup>* | *(***u***,* **v***)* ∈ E*t*} is the neighborhood set of vertex **<sup>v</sup>**. Thus, <sup>N</sup>*t(***v***)* <sup>∩</sup> <sup>V</sup> *<sup>L</sup> <sup>t</sup>* represents the sampled neighboring vertices in G *<sup>L</sup> t* .

*Preservation of Inter-edges* Since the missing of inter-edges may seriously affect training, we aim at preserving more edges of E *inter <sup>t</sup>* in *Δ*E *<sup>L</sup> <sup>t</sup>* , which can be formulated as

$$\begin{array}{ll}\max\_{\Delta \mathcal{E}\_l^L} |\Delta \mathcal{E}\_l^L \cap \mathcal{E}\_l^{inner}|.\\\\ \text{s.t.} \quad |\Delta \mathcal{E}\_l^L| \le E\_{\max}.\end{array} \tag{6.39}$$

The edge preservation can be required as a definite optimization problem in Eq. (6.39) or sampling problem with priority to vertices with higher degrees so that *P (***<sup>u</sup>** <sup>∈</sup> <sup>V</sup> *<sup>L</sup> <sup>t</sup> )* ∝ |{*(***u***,* **<sup>v</sup>***)* <sup>∈</sup> <sup>E</sup> *inter <sup>t</sup>* <sup>|</sup> **<sup>v</sup>** <sup>∈</sup> <sup>V</sup> *new <sup>t</sup>* }|.

IGL is based on the unbiased and edge-preserved conditions. In the presentation of method, we follow the memory constraint *Vmax* and set *Emax* <sup>=</sup> *(*|<sup>V</sup> *new t* | + *Vmax )*<sup>2</sup> − |<sup>E</sup> *intra <sup>t</sup>* | by default. The generated edges can be uniformly sampled if a smaller *Emax* is required. The sample-based strategy is presented to select a subgraph from the previous graph for learning. The following cluster-based strategy is presented to construct a cluster graph that satisfies both the unbiased and edgepreserved conditions in midway. The presented strategies are illustrated in Fig. 6.6.

#### **(1) Sample-Based Strategy**

The strategy of sampling a representative subgraph from previous data based on the required conditions is studied first. We assume that a subset *Δ*V *<sup>L</sup> <sup>t</sup>* from V*t*−<sup>1</sup> in size of *Vmax* is sampled, and all the related edges are preserved, i.e.,

$$\begin{aligned} \Delta \mathcal{V}\_l^L &= \text{Sample}\ (\mathcal{Y}\_{l-1}, V\_{\text{max}}),\\ \Delta \mathcal{E}\_l^L &= \{ (\mathbf{u}, \mathbf{v}) \in \mathcal{E}\_l^{inter} \mid \mathbf{u} \in \Delta \mathcal{V}\_l^L, \mathbf{v} \in \mathcal{V}\_l^{new} \}, \end{aligned} \tag{6.40}$$

**Fig. 6.6** An illustration of the sample-based and cluster-based strategies

where *Sample()* denotes the sampling function. Considering the required conditions, we explore the following pragmatic methods for appropriate sampling:


The above methods take into consideration only part of required conditions. It can be proved that, ignoring the ideal case when all the vertices in V*t*−<sup>1</sup> connect with the same number of vertices in V *new <sup>t</sup>* , sampling in Eq. (6.40) satisfies the two required conditions when all the vertices have been sampled, i.e., joint training.

#### **(2) Cluster-Based Strategy**

The sample-based strategy selects a subgraph from the previous graph for learning. However, in such a process, G*t*−<sup>1</sup> is not completely covered, and some important vertices might be dropped. Then, the selected subgraph cannot perform full communication with the new subgraph. The assumption of sampling that G *<sup>L</sup> <sup>t</sup>* must be a subgraph of G*<sup>t</sup>* is relaxed, and a cluster graph is constructed. Technically, we first arrange vertices in V*t*−<sup>1</sup> into *<sup>K</sup>* cluster sets {<sup>C</sup> *<sup>t</sup>*−<sup>1</sup> *<sup>i</sup>* }*<sup>K</sup> <sup>i</sup>*=<sup>1</sup> with centers {**c***t*−<sup>1</sup> *<sup>i</sup>* }*<sup>K</sup> i*=1 in average values of clusters. We set the number of clusters *K* = *Vmax* . The cluster graph is therefore defined as

$$\begin{split} \Delta \mathcal{V}\_{t}^{\ell^{L}} &= \{ \mathbf{c}\_{1}^{t-1}, \dots, \mathbf{c}\_{K}^{t-1} \}, \\ \Delta \mathcal{E}\_{t}^{\ell^{L}} &= \{ (\mathbf{c}\_{l}^{t-1}, \mathbf{v}) \mid \mathbf{v} \in \mathcal{V}\_{l}^{new}, \exists \ \mathbf{u} \in \mathcal{C}\_{l}^{t-F1}, (\mathbf{u}, \mathbf{v}) \in \mathcal{E}\_{l}^{inter} \} \cup \\ & \{ (\mathbf{c}\_{l}^{t-1}, \mathbf{c}\_{j}^{t-1}) \mid \exists \ \mathbf{u}\_{1} \in \mathcal{C}\_{l}^{t-1}, \mathbf{u}\_{2} \in \mathcal{C}\_{j}^{t-1}, (\mathbf{u}\_{1}, \mathbf{u}\_{2}) \in \mathcal{E}\_{l-1} \}, \end{split} \tag{6.41}$$

which suggests that the cluster centers be added as new cluster vertices, and the edges connecting to any vertex in V*t*−<sup>1</sup> be directly transferred to the corresponding cluster vertex. It is noted that the additional edge sets in Eq. (6.41) represent E *inter t* and E*t*−1, respectively.

Due to the continued growth of the graph, direct clustering on the entire graph is time-consuming. For an approximate but efficient clustering with a balanced size, we first conduct clustering on the new vertices V *new <sup>t</sup>* into cluster sets {*Δ*<sup>C</sup> *<sup>t</sup> i* }*K i*=1 with centers {**c**ˆ*<sup>t</sup> i*}*K <sup>i</sup>*=1. The bipartite matching algorithm is applied to optimize a bijective matching function *M(*·*)* : {1*, ..., K*}→{1*, ..., K*} for the objective: min*m(*·*) <sup>Σ</sup><sup>K</sup> <sup>k</sup>*=1**c***t*−<sup>1</sup> *<sup>k</sup>* <sup>−</sup> **<sup>c</sup>**ˆ*<sup>t</sup> m(k)*<sup>2</sup> <sup>2</sup>, which assigns new clusters to be closer with old clusters. Then, we merge the clusters as C *<sup>t</sup> <sup>k</sup>* <sup>=</sup> <sup>C</sup> *<sup>t</sup>*−<sup>1</sup> *<sup>k</sup>* <sup>∪</sup> *<sup>Δ</sup>*<sup>C</sup> *<sup>t</sup> m(k)* and update the value of centers **c***<sup>t</sup> k*.

In a word, incremental graph learning (IGL) is a general framework for efficient learning on growing graphs in an incremental manner, which has the following advantages. First, IGL is more suitable in real-world applications, since the dynamic graphs are commonly appeared. Second, the sample-based and cluster-based strategies significantly improve the efficiency when the large scale graph grows. However, only the incremental of the nodes and edges are considered, while the deletion are ignored, which limits the application of the method. The general dynamic patterns are worth studying in the future works.

#### **6.5 Summary**

In this chapter, we introduce hypergraph structure evolution methods, i.e., hyperedge weight optimization, vertex weight optimization, sub-hypergraph weight optimization, dynamic hypergraph learning, and the techniques for incremental learning on growing graphs. The hyperedge weight optimization adjusts weights of each hyperedge for different contributions, while the vertex weight optimization considers the different importance of vertices on hypergraph. The sub-hypergraph weight optimization method further combines multiple hypergraphs for multi-modal data with learned weights. Dynamic hypergraph learning optimizes the hypergraph structure by modifying the inappropriate connections, which can partially solve the missing and incorrect connection issue. Finally, we introduce the incremental learning method on growing graphs, which can update the data structure under the incremental scenario.

It is noted that the optimization of hypergraph, either component or the structure, will bring in extra computational cost and lead to potentially high computation complexity in practice. How to effectively and efficiently adjust the hypergraph structure is still a challenging problem, which requires further investigation in future.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 7 Neural Networks on Hypergraph**

**Abstract** With the development of deep learning on high-order correlations, hypergraph neural networks have received much attention in recent years. Generally, the neural networks on hypergraph can be divided into two categories, including the spectral-based methods and the spatial-based methods. For the spectral-based methods, the convolution operation is formulated in the spectral domain of graph, and we introduce the typical spectral-based methods, including hypergraph neural networks (HGNN), hypergraph convolution with attention (Hyper-Atten), and hyperbolic hypergraph neural network (HHGNN), which extend hypergraph computation to hyperbolic spaces beyond the Euclidean space. For the spatial-based methods, the convolution operation is defined in groups of spatially close vertices. We then present spatial-based hypergraph neural networks of the general hypergraph neural networks (HGNN+) and the dynamic hypergraph neural networks (DHGNN). Additionally, there are several convolution methods that attempt to reduce the hypergraph structure to the graph structure, so that the existing graph convolution methods can be directly deployed. Lastly, we analyze the association and comparison between hypergraph and graph in the two areas described above (spectral-based, spatial-based), further demonstrating the ability and advantages of hypergraph on constructing and computing higher-order correlations in the data.

#### **7.1 Introduction**

Hypergraph has demonstrated its ability to model and learn complex correlations in recent years. Zhou et al. [1] introduced the hypergraph learning, which conducts transductive learning and propagates information on the hypergraph structure. Transductive inference on the hypergraph aims to minimize the label difference between vertices with stronger connections. There has been extensive development and application of hypergraph learning in several fields over the past few years.

In addition, hypergraph has been investigated in deep learning applications. Based on the hypergraph Laplacian and the Chebyshev formula, Feng et al. [2] first introduced hypergraph neural networks (HGNN). The hypergraph Laplacian is synthesized using predictions in Yadati et al. [3], while Bai et al. [4] defined two neural hypergraph operators based on [5, 6]. However, they do not implement highorder learning algorithms by introducing only vertex functions, even though they construct a simple weighted graph and apply mature graph learning algorithms. A lack of powerful tools for expressing hyperstructure and a wealth of graph literature motivated the work of [7]. Additionally, recent successes with graph representation learning have been achieved by using neural operators (convolution, attention, spectral, etc.). Generally, the neural networks on hypergraph can be divided into two categories, including the spectral-based methods and the spatial-based methods.

For the spectral-based methods, Feng et al. [2] introduced the hypergraph neural networks (HGNN) for modeling and learning beyond pairwise complex correlations. Different from the traditional graph neural networks (GNN), HGNN learns its data representation by iteratively propagating the vertex–hyperedge– vertex information pattern. Additionally, the hypergraph Laplacian is first approximated and introduced into the deep hypergraph learning method to speed up the learning process. Following [2], Bai et al. [4] developed an attention module based on hypergraph convolution patterns (Hyper-Atten). Hyper-Atten introduced a hyperedge–vertex attention learning module that adaptively identifies the importance of different vertices in a hyperedge, thus revealing the intrinsic correlations between vertices.

Using the spatial methods, Atwood et al. [8] made use of transition matrices to determine where vertices are located. The generalization of convolution in the spatial domain is achieved using Gaussian mixture models based on local path operators. A graph-based attention-based architecture was built in work [9] for analyzing vertices on graph using attention mechanism. A dynamic change in hypergraph structure was taken into considerations in [6]. The framework introduced in [6], which is more versatile than HGNN [2]. A unified hypergraph is then constructed by merging the correlations from different modalities/types using an adaptive hyperedge grouping strategy. To learn a general data representation for various tasks, a hypergraph convolution scheme [6] was performed in the spatial domain.

Hypergraph spectral graph theory [10] has been explored far less in other methods. The concept of hypergraph learning was first introduced by Zhou et al. [1], where it was presented as a propagation process. The Laplacian matrix, however, is equivalent to pairwise operations according to [11]. There have been several studies addressing non-pairwise relationships since then, including developing nonlinear Laplacian operators [12, 13], learning the optimal parameters of hyperedges [13, 14], as well as utilizing random walking techniques [10]. Hyperedges can be regarded as connectors in these algorithms, which explicitly break the bipartite property of hypergraph by focusing on vertices.

In this chapter, we systematically introduce the above three types of neural networks on hypergraph and show the comparison between graph neural networks and hypergraph neural networks from both spectral and spatial aspects. Part of the work introduced in this chapter has been published in [2, 15, 16].

#### **7.2 Spectral-Based Neural Networks on Hypergraph**

The spectral neural networks methods have attracted much attention since Bruna et al. [17] and Kipf et al. [18] simplified them in a graph convolutional network pattern. The data are transformed from the common domain to the spectral domain to be processed with according to map theory and the convolution theorem, and it then gets transformed back to the common domain. In other words, first we convert the signal from the common domain to the frequency domain (Fourier transform implementation) and then multiply it by the phase. Then, we convert the result of the phase multiplication back to the common domain again (Fourier inverse transform implementation). We will present spectral-based hypergraph neural networks methods, including hypergraph neural networks (HGNN) [2], hypergraph convolution with attention (Hyper-Atten) [3], and hyperbolic hypergraph neural networks (HHGNN) [19]. In particular, HHGNN extends hypergraph learning to the hyperbolic spaces beyond the Euclidean space.

#### *7.2.1 Hypergraph Neural Networks*

Given a hypergraph G = *(*V *,* E *, Δ)* with *N* vertices, the hypergraph Laplacian *Δ* is an *N* × *N* positive semi-definite matrix. The orthonormal eigen vectors *Φ* = diag*(φ*1*,...,φN )* and a diagonal matrix *Λ* = diag*(λ*1*,...,λN )*, which contains the corresponding non-negative eigenvalues, are obtained by employing the eigendecomposition *Δ* = *ΦΛΦ*-. *x*ˆ = *Φx* defines the Fourier transform for a signal *x* = *(x*1*,...,xN )* in the hypergraph. It is assumed that the eigenvectors represent the Fourier bases and the eigenvalues represent the frequencies. The spectral convolution of signal *x* and filter *g* can be denoted as

$$\mathbf{g}\star\mathbf{x} = \Phi((\boldsymbol{\Phi}^{\top}\boldsymbol{g})\odot(\boldsymbol{\Phi}^{\top}\boldsymbol{x})) = \boldsymbol{\Phi}\mathbf{g}(\boldsymbol{\Lambda})\boldsymbol{\Phi}^{\top}\mathbf{x},\tag{7.1}$$

where denotes the element-wise Hadamard product and *g(Λ)* = diag*(g(λ*1*), . . . , g(λn))* indicates a function of the Fourier coefficients. However, in the forward and inverse Fourier transforms, the computational cost is *O(n*2*)*, which is high. To solve this problem, Defferrard et al. [20] parameterize *g(Λ)* with *K*-order polynomials, and one such polynomial is the truncated the Chebyshev expansion. Chebyshev polynomials *Tk(x)* are computed by the formula of *Tk(x)* = 2*xTk*−1*(x)* − *Tk*−2*(x)*, with *T*0*(x)* = 1 and *T*1*(x)* = *x*. After that, the *g(Λ)* can be computed by

$$\log \star x \approx \sum\_{k=0}^{K} \theta\_k T\_k(\tilde{\Delta}) x,\tag{7.2}$$

where *Tk(Δ)*˜ denotes the Chebyshev polynomial of order *k* with scaled Laplacian *<sup>Δ</sup>*˜ <sup>=</sup> <sup>2</sup> *λmax Δ*−*I* . In Eq. (7.2), matrix powers, additions, and multiplications are combined instead of expansive computation of Laplacian Eigen vectors, thus improving computation complexity even further. Since that the Laplacian in hypergraph can already represent the high-order correlations among nodes, it can further limit the order of convolution operation to *K* = 1. It is suggested by Kipf et al. [18] that *λmax* ≈ 2 for the scale adaptability of neural networks. After that, the convolution operation can be simplified to

$$\mathbf{g} \star \mathbf{x} \approx \theta\_0 \mathbf{x} - \theta\_1 \mathbf{D}\_v^{-1/2} \mathbf{H} \mathbf{W} \mathbf{D}\_e^{-1} \mathbf{H}^\top \mathbf{D}\_v^{-1/2} \mathbf{x},\tag{7.3}$$

where *θ*<sup>0</sup> and *θ*<sup>1</sup> represent the parameters of all node filters. In addition, a single parameter *θ* is used to avoid the overfitting problem, which is defined as

$$\begin{cases} \theta\_{\mathbb{I}} = -\frac{1}{2}\theta\\ \theta\_{0} = \frac{1}{2}\theta \mathbf{D}\_{v}^{-1/2} \mathbf{H} \mathbf{D}\_{e}^{-1} \mathbf{H}^{\top} \mathbf{D}\_{v}^{-1/2} . \end{cases} \tag{7.4}$$

Thereafter, the convolution process can be simplified to the following function:

$$\begin{split} \mathbf{g} \star \mathbf{x} &\approx \frac{1}{2} \theta \mathbf{D}\_{v}^{-1/2} \mathbf{H} (\mathbf{W} + \mathbf{I}) \mathbf{D}\_{e}^{-1} \mathbf{H}^{\top} \mathbf{D}\_{v}^{-1/2} \mathbf{x} \\ &\approx \theta \mathbf{D}\_{v}^{-1/2} \mathbf{H} \mathbf{W} \mathbf{D}\_{e}^{-1} \mathbf{H}^{\top} \mathbf{D}\_{v}^{-1/2} \mathbf{x}, \end{split} \tag{7.5}$$

where *(***W**+ **I***)* can be regarded as the weight of the hyperedges. In the initialization of **W**, the hyperedges can be all assigned with equal weights as an identity matrix.

When having a hypergraph signal **X***<sup>t</sup>* for the *t*-th layer, the hyperedge convolution layer **HGNNConv** can be formulated by

$$\mathbf{X}^{t+1} = \sigma(\mathbf{D}\_v^{-1/2}\mathbf{H}\mathbf{W}\mathbf{D}\_e^{-1}\mathbf{H}^\top\mathbf{D}\_v^{-1/2}\mathbf{X}^t\Theta),\tag{7.6}$$

where *Θ* is the parameter to be learned during the training process. To extract features from a hypergraph, the filter *Θ* is applied to the vertices. After convolution, **X***t*+1, which can be used for further processing.

The framework of the abovementioned HGNN model is shown in Fig. 7.1. HGNN is able to address the challenges of learning representations for complex data by incorporating such data structures into hypergraph, which are more flexible and effectively confronting practical data.

The HGNN calculation stages are shown in Fig. 7.2, and the three processes are directly projected to the functions. We can observe that there are vertex feature transform, hyperedge feature gathering, and vertex feature aggregating steps in this framework.

#### *7.2.2 Hypergraph Convolution and Hypergraph Attention*

Based on the study of hypergraph neural networks [2], Bai et al. [4] introduced hypergraph convolution and hypergraph attention (Hyper-Atten) by introducing attention mechanism in the framework.

In this method, an explicit magnitude of importance is assigned to the afferent and efferent information flow for non-binary values of the transition probability between vertices for a given vertex. However, such an attention mechanism must work after the graph structure (the incidence matrix **H**) is given, instead of learning a dynamic incidence matrix. It is easier to reveal the intrinsic relationship between vertices using a dynamic transition matrix than by using a fixed incidence matrix. An attention learning module could be imposed on **H**, which does not treat each vertex as being connected by a hyperedge or which does not assign non-binary and real values when measuring the degree of connectivity. Following [6] when the vertex set and the edge set are comparable, the attention score between a given vertex *xi* and its associated hyperedge *xj* can be written as

$$\mathbf{H}\_{lj} = \frac{\exp\left(\sigma\left(\text{sim}\left(x\_l \mathbf{P}, x\_j \mathbf{P}\right)\right)\right)}{\sum\_{k \in N\_l} \exp\left(\sigma\left(\text{sim}\left(x\_l \mathbf{P}, x\_k \mathbf{P}\right)\right)\right)},\tag{7.7}$$

where *σ (*·*)* is a nonlinear activation function. The weight matrix between the *(l)*-th and *(l* <sup>+</sup> <sup>1</sup>*)*-th layers is denoted as **<sup>P</sup>** <sup>∈</sup> <sup>R</sup>*F(l)*×*F(l*+1*)* . *Ni* is the neighborhood set of *xi*. The pairwise similarity of two vertices is computed with this similarity function sim*(*·*)*:

$$\text{sinc}\left(\mathbf{x}\_{l}, \mathbf{x}\_{f}\right) = \mathbf{a}^{\top} \left[\mathbf{x}\_{l} \| \mathbf{x}\_{f}\right]. \tag{7.8}$$

Operation [*, ,*] indicates concatenation, and notation **a** is a weight vector for outputting a scalar similarity value.

When following Eq. (7.6) to learn the intermediate embedding of vertices layer by layer, hypergraph attention also propagates gradients to **H** in addition to **X***(l)* and *Θ*. Therefore, Eq. (7.7) means the share of hyperedge *xj* in the neighbors of the vertex *xi*, which indicates the relative importance *xj* of *xi*. More categorical embeddings can be learned by the probabilistic model, and the relationship between vertices can be described more accurately.

In order to further enhance the capability of representation learning, the method uses hypergraph attention mechanisms based on the basic formulation of performing convolutions.

#### *7.2.3 Hyperbolic Hypergraph Neural Networks*

The hyperbolic space is a manifold with constant Gaussian negative curvature everywhere, which has several models. Similar to [21, 22], the work is based on the Poincaré ball model for its well-suited for gradient-based optimization. The Poincaré ball model with constant negative curvature −1*/k(k >* 0*)* corresponds to the Riemannian manifold P*n,k, g*<sup>P</sup> **x** . P*n,k* <sup>=</sup> {**<sup>x</sup>** <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* : **x** *<sup>&</sup>lt;* <sup>1</sup>} is an open *n*-dimensional unit ball, where *.* denotes the Euclidean norm. Its metric tensor is *g*<sup>P</sup> **<sup>x</sup>** <sup>=</sup> *<sup>λ</sup>*<sup>2</sup> **<sup>x</sup>***g*E, where *λ***<sup>x</sup>** <sup>=</sup> <sup>2</sup> <sup>1</sup>−*k***x**<sup>2</sup> is the conformal factor and *g*<sup>E</sup> <sup>=</sup> **<sup>I</sup>***<sup>n</sup>* is the Euclidean metric tensor. Then, we define the Möbius addition of two points **<sup>x</sup>***,* **<sup>y</sup>** <sup>∈</sup> <sup>P</sup>*n,k* as follows:

$$\mathbf{x} \oplus\_{k} \mathbf{y} = \frac{\left(1 + 2k \langle \mathbf{x}, \mathbf{y} \rangle + k \|\mathbf{y}\|^2\right) \mathbf{x} + \left(1 - k \|\mathbf{x}\|^2\right) \mathbf{y}}{1 + 2k \langle \mathbf{x}, \mathbf{y} \rangle + k^2 \|\mathbf{x}\|^2 \|\mathbf{y}\|^2}. \tag{7.9}$$

The distance between two points **x***,* **<sup>y</sup>** <sup>∈</sup> <sup>P</sup>*n,k* is calculated by integration of the metric tensor, which is given as

$$d\_{\mathbb{P}}^{k}(\mathbf{x}, \mathbf{y}) = (2/\sqrt{k}) \tanh^{-1} \left(\sqrt{k} \| -\mathbf{x} \oplus\_{k} \mathbf{y} \| \right). \tag{7.10}$$

Here we can denote point **<sup>z</sup>** <sup>∈</sup> <sup>T</sup>xP*n,k* as the tangent (Euclidean) space centered at any point **x** in the hyperbolic space. For the tangent vector **z** = **0** and the point **<sup>y</sup>** <sup>=</sup> **<sup>0</sup>**, the exponential map exp**<sup>x</sup>** : <sup>T</sup>**x**P*n,k* <sup>→</sup> <sup>P</sup>*n,k* and the logarithmic map log**<sup>x</sup>** : <sup>P</sup>*n,k* <sup>→</sup> <sup>T</sup>**x**P*n,k* are given for **<sup>y</sup>** <sup>=</sup> **<sup>x</sup>** by

$$\exp\_\mathbf{x}^k(\mathbf{z}) = \mathbf{x} \oplus\_k \left( \tanh\left(\sqrt{k} \frac{\lambda\_\mathbf{x}^k \|\mathbf{z}\|}{2}\right) \frac{\mathbf{z}}{\sqrt{k} \|\mathbf{z}\|} \right) \tag{7.11}$$

and

$$\log\_{\mathbf{x}}^{k}(\mathbf{y}) = \frac{2}{\sqrt{k}\lambda\_{\mathbf{x}}^{k}} \tanh^{-1}\left(\sqrt{k}\parallel - \mathbf{x} \oplus\_{k} \mathbf{y} \parallel\right) \frac{-\mathbf{x} \oplus\_{k} \mathbf{y}}{\parallel - \mathbf{x} \oplus\_{k} \mathbf{y} \parallel}.\tag{7.12}$$

The transformation between the tangent space and the hyperbolic space is shown in Fig. 7.3. Leverage the operations of exp and log maps, so that we can use the tangent space T**x**P to perform transformations such as convolution and activation in Euclidean space. In the convolution, vertex information is first gathered to the hyperedge for storage, and then each vertex aggregates the information of the connected hyperedge.

It is noted that initial data are on the Euclidean space and need to be converted into embeddings on the hyperbolic space, so then first project the data on the previously obtained Euclidean space onto the hyperbolic manifold space in order to use the spectral-based hypergraph hyperbolic convolutional network to learn the information to update the node embeddings. Set *<sup>t</sup>* := {√*k,* <sup>0</sup>*,* <sup>0</sup>*,...,* <sup>0</sup>} ∈ <sup>P</sup>*d,k*

**Fig. 7.3** The transformation between the tangent space and the hyperbolic space

as a reference point to perform tangent space operations. The above condition makes *(*0*,* **<sup>x</sup>**0*,*E*), t*<sup>=</sup> <sup>0</sup> hold, so *(*0*,* **<sup>x</sup>**0*,*E*)* can be regarded as the initial embedding representation of the hypergraph structure on the tangent plane of the hyperbolic manifold space T*t*P*d,k*. The initial hypergraph structure embedding is then mapped onto the hyperbolic manifold space P following [19]:

$$\begin{split} \mathbf{x}^{0,\mathbb{P}} &= \exp\_{l}^{k} \left( \left( 0, \mathbf{x}^{0,\mathbb{E}} \right) \right) \\ &= \left( \sqrt{k} \cosh \left( \frac{\|\mathbf{x}^{0,\mathbb{E}}\|\_{2}}{\sqrt{k}} \right), \sqrt{k} \sinh \left( \frac{\|\mathbf{x}^{0,\mathbb{E}}\|\_{2}}{\sqrt{k}} \right) \frac{\mathbf{x}^{0,\mathbb{E}}}{\|\mathbf{x}^{0,\mathbb{E}}\|\_{2}} \right). \end{split} \tag{7.13}$$

Unlike the previous study [23] that simply generates the hyperedge structure for common domain convolution, combined with the inspiration provided by HGNN [2], hypergraph computation from the perspective of spectral convolution can be conducted.

Given hyperbolic curvatures −1*/k*−1*,* −1*/k* at layers −1 and , respectively, then the hyperbolic hypergraph convolution of the hypergraph input signal *x*<sup>P</sup> with filter g can be defined as

$$\begin{split} \mathbf{x}^{\mathbb{P}} \ast \mathbf{g} &= \exp\_{\mathbf{x}}^{k\ell} \left( \Phi \left( \left( \Phi^{\top} \left( \log\_{\mathbf{x}}^{k\_{\ell-1}} \left( \mathbf{x}^{\mathbb{P}} \right) \right) \right) \odot \left( \Phi^{\top} \mathbf{g} \right) \right) \right) \\ &= \exp\_{\mathbf{x}}^{k\_{\ell}} \left( \Phi \mathbf{g}(\boldsymbol{A}) \Phi^{\top} \left( \log\_{\mathbf{x}}^{k\_{\ell-1}} \left( \mathbf{x}^{\mathbb{P}} \right) \right) \right), \end{split} \tag{7.14}$$

where is the element-wise product, *g(Λ)* = diag*(θ )*, and *θ* = [*θ*1*,* ··· *, θn*] is the parameters to be learned. Leverage the operations of exp and log maps, so that the tangent space T0P*d,k* can be used to perform Euclidean transformations. It operates in the tangent space of each center point *x*<sup>P</sup> because the Euclidean approximation is best [19].

Considering the high computational complexity of the Fourier transform and inverse Fourier transform, this convolution method is very expensive to calculate. Convolutions can be computed more efficiently by truncating Chebyshev polynomials as [2]. It can be simply expressed as

$$\mathbf{x}^{\mathbb{P}} \ast \; \mathbf{g} \approx \exp\_{\boldsymbol{\chi}}^{k\_{\ell}} \left( \theta \mathbf{D}\_{\boldsymbol{v}}^{-1/2} \mathbf{H} \mathbf{W} \mathbf{D}\_{\boldsymbol{e}}^{-1} \mathbf{H}^{\top} \mathbf{D}\_{\boldsymbol{v}}^{-1/2} \left( \log\_{\boldsymbol{\chi}}^{k\_{\ell-1}} \left( \mathbf{x}^{\mathbb{P}} \right) \right) \right), \tag{7.15}$$

where **W** is the initial weight of hyperedges. The above equation uses the hypergraph Laplacian matrix to calculate the total gain obtained after a small perturbation of a point. For a hypergraph with *n* vertices, the convolution layer can be denoted as following formulation:

$$\mathbf{X}^{\ell} = \exp\_{\mathbf{x}^{\ell, \mathbb{E}}}^{k\_{\ell}} \left( \sigma \left( \mathbf{A} \left( \log\_{\mathbf{x}^{\ell-1, \mathbb{P}}}^{k\_{\ell-1}} \left( \mathbf{X}^{\ell-1, \mathbb{P}} \right) \right) \Theta \right) \right), \tag{7.16}$$

where *<sup>Θ</sup>* <sup>∈</sup> <sup>R</sup>*c(*−1*)*×*c()* is the parameter to be learned during the training process, which is applied over the vertices in the hypergraph to extract features. *c* indicates the size of the embedding dimension, *σ* denotes the nonlinear activation function, and **<sup>A</sup>** <sup>=</sup> **<sup>D</sup>**−1*/*<sup>2</sup> *<sup>v</sup>* **HWD**−<sup>1</sup> *<sup>e</sup>* **H**-**<sup>D</sup>**−1*/*<sup>2</sup> *<sup>v</sup>* .

The hyperbolic operation is accomplished by conducting feature mapping between the Euclidean space and the hyperbolic space. The framework of the above spectral-based hyperbolic hypergraph convolution is shown in Fig. 7.4.

**Fig. 7.4** The framework of the spectral-based hyperbolic hypergraph convolution method

#### **7.3 Spatial-Based Neural Networks on Hypergraph**

To show the spatial-based neural networks on hypergraph, we first briefly review the definition of spatial-based graph convolution. The processing on an image is taken as an example. The pixel in an image can be represented as vertices in a grid graph, where each vertex only connects its neighbor vertices in the spatial–closed region where it is located. A *C*-channel feature can be accordingly generated for each vertex (pixel) in the image. The process of filtering an image can be viewed as an average aggregation of neighbors' features after a central vertex transforms their features. Similar to convolution neural networks in image processing, spatialbased graph convolution combines the neighbors of the central vertex to produce a new representation. Spatial-based graph convolution runs from neighbor vertices to center vertices, which is similar to the definition of a path in a simple graph. A path in graph is defined as *P (v*1*, vk)* = *(v*1*, v*2*,...,vk)*. Vertices in the sequence are adjacent to each other, so that every vertex in the sequence is adjacent to every other vertex. It means that all the vertex pairs of *i* and *i* + 1 (1 ≤ *i* ≤ *k* − 1) have the neighbor relation.

Similar to the spatial-based graph convolution, spatial-based hypergraph neural networks also consider the neighbor information when learning representation. Following, we introduce two typical spatial-based hypergraph neural networks, including general hypergraph neural networks (HGNN+) [16] and dynamic hypergraph neural networks (DHGNN) [15].

#### *7.3.1 General Hypergraph Neural Networks*

In this part, the general framework [16] for modeling representation learning using hypergraph neural networks on given raw data is introduced. Figure 7.5 demonstrates the framework of general hypergraph neural networks, which also consists of two procedures, i.e., hypergraph modeling and hypergraph convolution. In the hypergraph modeling step, data issued to generate the high-order correlations, which are represented as a hypergraph. Similar to previous tasks, hyperedge groups can be generated as pairwise edges, *k*-Hop, and neighbors in the feature space. As a result of this procedure, all types of hyperedge groups (if they are available) are generated and concatenated in a hypergraph for the purpose of data correlations modeling. Hypergraph convolution is the process of creating a set of hypergraph convolutions on a given set of hypergraph, i.e., the spectralbased convolution and the spatial-based hypergraph convolution for representation learning on hypergraph. As a result of these convolution procedures, they can generate much more accurate representations of multi-modal data and high-order correlations using this information.

#### **(1) Hypergraph Modeling**

The first step is to construct a flexible hypergraph from raw data if there is no hypergraph existed, and the data correlations can be modeled using a hypergraph structure. The ability to generate a suitable hypergraph structure is critical to exploit the high-order correlations among the data. Generally, hypergraph structures are not explicit in most cases. Therefore, different strategies are needed to generate the hypergraph. Hypergraph generation from scratch usually involves a combination of three scenarios, namely, data with graph structure, data without graph structure, and data with multi-modal/multi-type representations. Hyperedge generation strategies, which employ pairwise edges, *k*-Hop, and neighbors in the feature space, respectively, are introduced here. The strategies of using pairwise edges and *k*-Hop are utilized for hyperedge group generation from the data with a graph structure, and those of using neighbors in feature space are employed for hyperedge group generation from the data without graph structure. Finally, all the hyperedge groups are further concatenated to generate the overall hypergraph.

The above strategies can be used here to generate a number of hyperedge groups. A final hypergraph is then generated by further combining generated or natural hyperedge groups. Supposing there are *K* hyperedge groups {E1*,* E2*,...,* E*K*}, *K* indicates incidence matrices **H***<sup>k</sup>* ∈ {0*,* <sup>1</sup>}*N*×*Mk* , respectively. For the hypergraph <sup>G</sup> , the simplest fusion way to construct the incidence matrix is directly concatenating all the hyperedge groups as **H** = **H**1||**H**2|| · · · ||**H***K*, where ·||· is the matrix concatenation operation. These hyperedges weight matrices of hypergraph can be assigned a value of 1 in order to treat them equally. This simplest fusion way can be called as coequal fusion.

It is noted that other combination strategies can be also used according to different application scenarios. As the multi-modal hybrid high-order correlations cannot be fully exploited by a simple coequal fusion, due to differences in information richness between hyperedge groups, an adaptive strategy for the fusion of hyperedge groups, namely Adaptive Fusion, was introduced in [16]. Specifically, each hyperedge group is associated with a trainable parameter that can be used to adjust the effect of multiple hyperedge groups on the final vertex embedding in an adaptive manner, which can be defined as

$$\begin{cases} \mathbf{w}\_k = \text{copy}(\text{sigmoid}(w\_k), M\_k) \\ \mathbf{W} = \text{diag}(\mathbf{w}\_1^1, \dots, \mathbf{w}\_1^{M\_1}, \dots, \mathbf{w}\_K^1, \dots, \mathbf{w}\_K^{M\_K}) \\ \mathbf{H} \quad = \mathbf{H}\_1 || \mathbf{H}\_2 || \cdots || \mathbf{H}\_K \end{cases} \tag{7.17}$$

where **w***<sup>k</sup>* <sup>∈</sup> <sup>R</sup> is a trainable parameter that is shared by all hyperedges inside a specified hyperedge group *k*. sigmoid*(*·*)* is an element-wise normalization function. **<sup>w</sup>***<sup>k</sup>* <sup>=</sup> *(***w**<sup>1</sup> *<sup>k</sup>,* ··· *,* **w** *Mk <sup>k</sup> )* <sup>∈</sup> <sup>R</sup>*Mk* denotes the generated weight vector for hyperedge group *k*. copy*(a, b)* function returns a vector of size *b*, and the value of which is padded by copying *a* by *b* times. Let *M* = *M*<sup>1</sup> + *M*<sup>2</sup> + ··· + *MK* denote the summation of the hyperedges in all hyperedge groups. **<sup>W</sup>** <sup>∈</sup> <sup>R</sup>*M*×*<sup>M</sup>* is a diagonal matrix that indicates the weight matrix of hypergraph, and each entry **W***ii* denotes the weight of the corresponding hyperedge *ei*. By concatenating (·||·) the incidence matrices of multiple hyperedge groups, **<sup>H</sup>** ∈ {0*,* <sup>1</sup>}*N*×*<sup>M</sup>* can denote the incidence matrix of the hypergraph generated.

Multi-model/multi-type data can be analyzed to generate multiple hyperedge groups. From the constructed hyperedge groups, the hypergraph incidence matrix **H** and hyperedge weight matrix **W** can be generated, which can then be fed into the hypergraph convolution layer for further processing.

#### **(2) Hypergraph Convolution**

Following **Definitions 1, 2, 3**, an aggregation of neighbor vertex messages via hyperpath is introduced for one spatial hypergraph convolution layer. Given a vertex *α* ∈ V of hypergraph G = {V *,* E *,***W**}, aggregating messages from its hyperedge inter-neighbor set N*e(α)* is the aim. In order to obtain those hyperedge messages of each hyperedge *β* in the hyperedge inter-neighbor set N*e(α)*, aggregating messages from its vertex inter-neighbor set N*v(β)*. After that, the two steps of hypergraph convolution make a closed loop from vertex feature sets *X<sup>t</sup>* to *Xt*+1. A general spatial hypergraph convolution in the *t*-th layer can be defined as

$$\begin{cases} m^{l}\_{\beta} = \sum\_{\alpha \in \mathcal{N}\_{l}(\beta)} M^{l}\_{v}(\mathbf{x}^{l}\_{\alpha}) \\ \quad \mathbf{y}^{l}\_{\beta} = U^{l}\_{e}(w\_{\beta}, m^{l}\_{\beta}) \\ m^{l+1}\_{\alpha} = \sum\_{\beta \in \mathcal{N}\_{l}(\alpha)} M^{l}\_{e}(\mathbf{x}^{l}\_{\alpha}, \mathbf{y}^{l}\_{\beta}) \\ \quad \mathbf{x}^{l+1}\_{\alpha} = U^{l}\_{v}(\mathbf{x}^{l}\_{\alpha}, m^{l+1}\_{\alpha}) \end{cases},\tag{7.18}$$

where *x<sup>t</sup> <sup>α</sup>* <sup>∈</sup> **<sup>X</sup>***<sup>t</sup>* denotes the input feature vector of vertex *<sup>α</sup>* <sup>∈</sup> <sup>V</sup> in layer *<sup>t</sup>* <sup>=</sup> <sup>1</sup>*,* <sup>2</sup>*,...,T* , and *xt*+<sup>1</sup> *<sup>α</sup>* denotes the updated feature of vertex *α*. *m<sup>t</sup> <sup>β</sup>* denotes the message of hyperedge *β* ∈ E , and *wβ* denotes a weight associated to hyperedge *β*. *mt*+<sup>1</sup> *<sup>α</sup>* denotes the message of vertex *α*. *y<sup>t</sup> <sup>β</sup>* denotes the hyperedge feature of hyperedge *<sup>β</sup>* that denotes an element of hyperedge feature set *<sup>Y</sup> <sup>t</sup>* <sup>=</sup> {*yt* 1*, y<sup>t</sup>* 2*,...,y<sup>t</sup> M*}, *y<sup>t</sup> <sup>i</sup>* <sup>∈</sup> <sup>R</sup>*Ct* in layer *t*. *M<sup>t</sup> v(*·*), U<sup>t</sup> <sup>e</sup> (*·*), M<sup>t</sup> e(*·*), U<sup>t</sup> v(*·*)* are the vertex message function, hyperedge update functions, hyperedge message function, and vertex update function in *tth* layer, respectively, which can be defined for specified applications.

With the high-order relationship in the hypergraph structure, the spatial hypergraph convolution layer is designed for high-level representation learning. In comparison with the graph convolution that consists of a single stage of message passing, the spatial hypergraph convolution is composed of four flexible operations with learned differentiable functions. As neighbor relations in graph, there is no natural ordering in inter-neighbors between vertices and hyperedges. Therefore, a summation operation is used to aggregate vertex–hyperedge messages from *M<sup>t</sup> v(*·*)*/*M<sup>t</sup> e(*·*)* operation.

A simple spatial hypergraph convolution layer (named HGNNConv+) via specifying the message-update functions (vertex message function *M<sup>t</sup> v(*·*)*, hyperedge update function *U<sup>t</sup> <sup>e</sup> (*·*)*, hyperedge message function *M<sup>t</sup> e(*·*)*, and vertex update function *U<sup>t</sup> v(*·*)*) is introduced as

$$\begin{cases} M\_v^l(\mathbf{x}\_\alpha^l) &= \frac{\mathbf{x}\_\alpha^l}{|\mathcal{L}\_v(\beta)|} \\ U\_e^l(w\_\beta, m\_\beta^l) &= w\_\beta \cdot m\_\beta^l \\ M\_e^l(\mathbf{x}\_\alpha^l, \mathbf{y}\_\beta^l) &= \frac{\mathbf{y}\_\beta^l}{|\mathcal{L}\_e(\alpha)|} \\ U\_v^l(\mathbf{x}\_\alpha^l, m\_\alpha^{l+1}) &= \sigma(m\_\beta^{l+1} \cdot \Theta^l) \end{cases},\tag{7.19}$$

where *Θ<sup>t</sup>* <sup>∈</sup> <sup>R</sup>*C<sup>t</sup>* <sup>×</sup>*Ct*+<sup>1</sup> denotes a trainable parameter of layer *t*, learned in training phase. *σ (*·*)* denotes an arbitrary nonlinear activation function such as *ReLU (*·*)*, etc. Note that in Eq. (7.19), *x<sup>t</sup> α/*|N*v(β)*<sup>|</sup> and *<sup>y</sup><sup>t</sup> <sup>β</sup>/*|N*e(α)*| denote the normalized vertex– hyperedge feature, of which convergence is accumulated and jittering is somewhat minimized.

For faster forward propagation of HGNNConv+ in GPU/CPU devices, here rewrite it in the matrix format. Consider **X***<sup>t</sup>* as the input vertex feature set of layer *t*. From **Definitions 1, 2**, **H**- ∈ {0*,* <sup>1</sup>}*M*×*<sup>N</sup>* can control the hyperedge interneighbor of each vertex feature in **X***<sup>t</sup>* . Hence, it can be used to guide each vertex to aggregate and generate the hyperedge feature set *Y <sup>t</sup>* , which can be formulated as **<sup>Y</sup>***<sup>t</sup>* <sup>=</sup> **WD**−<sup>1</sup> *<sup>e</sup>* **H**-**X***t* . In a similar way, the process of updating vertex feature set **X***t*+<sup>1</sup> from hyperedge feature set **Y***<sup>t</sup>* can be formulated as **X***t*+<sup>1</sup> <sup>=</sup> *σ (***D**−<sup>1</sup> *<sup>v</sup>* **HY***<sup>t</sup> Θt )*. Thus, the matrix format of HGNNConv+ can be written as

$$\mathbf{X}^{t+1} = \sigma(\mathbf{D}\_v^{-1} \mathbf{H} \mathbf{W} \mathbf{D}\_e^{-1} \mathbf{H}^\top \mathbf{X}^t \boldsymbol{\Theta}^t). \tag{7.20}$$

Similar to HGNN, **X***t*+<sup>1</sup> can be obtained after convolution, which can be used for further learning. As an extension of HGNN [2], this method employs a broad multi-modal/multi-type data correlation model to learn an adaptive weight for each modality/type representation using a single hypergraph model.

#### *7.3.2 Dynamic Hypergraph Neural Networks*

Dynamic hypergraph neural networks (DHGNN) [15] is a kind of neural networks modeling dynamically evolving hypergraph structures, which is composed of the stacked layers of two modules: dynamic hypergraph construction and hypergraph convolution. The dynamic hypergraph construction module dynamically updates hypergraph structures on each layer as initially constructed hypergraph may not be an appropriate representation for data. After that, hypergraph convolution is introduced as a means of encoding high-order correlations between data points within a hypergraph. There are two phases in the hypergraph convolution module: vertex convolution and hyperedge convolution, each of which is designed to aggregate features among vertices and hyperedges, respectively.

#### **(1) Dynamic Hypergraph Construction**

Symbol Con*(e)* is used to denote the vertex set that a hyperedge *e* contains, and the symbol Adj*(v)* is used to denote the hyperedge set where all hyperedges containing the vertex *v*:

$$\text{Con}(e) = \left\{v\_1, v\_2, \dots, v\_{k\_\ell}\right\},\tag{7.21}$$

$$\text{Adj}(\upsilon) = \{e\_1, e\_2, \dots, e\_{k\_{\upsilon}}\} \tag{7.22}$$

where *ke* and *kv* are the number of vertices in hyperedge *e*, and the number of hyperedges containing vertex *v*. *v* is defined as the centroid vertex of the hyperedge set Adj*(v)*. Here, traditional *k*-NN methods and k-means clustering methods can be combined for dynamic hypergraph construction to exploit local and global structures. On the one hand, it has computed the k-1 nearest neighbors for each vertex *v*. These neighborhood vertices, along with the vertex *v*, form a hyperedge in Adj*(v)*. On the other hand, it has conducted k-means algorithm on the whole feature map of each layer according to the Euclidean distance. For each vertex, the nearest *S* − 1 clusters are assigned as to be the adjacent hyperedges of this vertex. Here, |*Adj (v)*| denotes the size of adjacent hyperedge set, **x***<sup>e</sup>* denotes adjacent hyperedge features, and **x***<sup>v</sup>* denotes centroid vertex feature. **W** and **b** are learnable parameters.

Such a procedure on the feature embedding of each layer is performed. Especially, it initializes hypergraph structures with the input feature embedding. Therefore, the hyperedge set is dynamically adjusted as the feature embedding evolves with network going deeper. In this way, it is able to obtain better hypergraph structures for high-order data correlation modeling with deep neural networks.

#### **(2) Dynamic Hypergraph Convolution**

Hypergraph convolution is composed of two sub-modules: vertex convolution submodule and hyperedge convolution sub-module. By using vertex convolution, vertex features are aggregated to the hyperedge, and then by using hyperedge convolution, adjacent hyperedge features are aggregated to the center vertex.

There are several methods of pooling that can be used, including maximum pooling and average pooling. Vertex aggregation in state-of-the-art algorithms involves a fixed, pre-computed transform matrix generated from graph or hypergraph structure. Nevertheless, such methods cannot effectively model discriminative information among vertex features. For feature permutation and weighting, learn the transform matrix **T** from the vertex features. Information can flow within and between channels using the transform matrix. Using multi-layer perception (MLP), obtain the transform matrix **T** and compress the transformed features by using convolution as follows:

$$\mathbf{T} = \text{MLP}\left(\mathbf{X}\_v\right) \tag{7.23}$$

and

$$\mathbf{x}\_{\ell} = \text{conv}\left(\mathbf{T} \cdot \text{MLP}\left(\mathbf{X}\_{\upsilon}\right)\right). \tag{7.24}$$

#### **(3) Hyperedge Convolution**

Here, the hyperedge convolution is following the spatial convolution strategy, which consists of the aggregation of hyperedge features to center vertex features. Hyperedge convolution employs multi-layer perception to generate weight scores for each hyperedge. As a weighted sum of input hyperedge features, the center vertex feature is computed as an output. This procedure can be formulated as follows:

$$w = \text{softmax}\left(\mathbf{x}\_e \mathbf{W} + \mathbf{b}\right) \tag{7.25}$$

and

$$\mathbf{x}\_{\upsilon} = \sum\_{i=0}^{|\text{Adj}(\upsilon)|} w^{i} \mathbf{x}\_{\iota}^{i}. \tag{7.26}$$

As a result of these deep learning techniques, graph/hypergraph structure is taken into consideration as prior knowledge to the training of the model. There are, however, a number of hidden and important relationships that are not directly represented in the inherent structure. For vertex convolution, a transform matrix is employed to permute and weight vertices within hyperedges; for hyperedge convolution, an attention mechanism is employed to aggregate adjacent hyperedge features. Figure 7.6 shows the architecture of the DHGNN. The first part of the figure illustrates the process of the hyperedge construction. There are two hyperedges generated from two clusters (dashed ellipses), for example. In the second part, vertices within a hyperedge are aggregated to form a hyperedge feature through vertex convolution, and vertices within adjacent hyperedges are aggregated to form a center vertex feature via hyperedge convolution. In the third part, after performing such operations on all vertices in the current layer feature embedding, the new layer feature embedding and the new hypergraph structure can be constructed.

#### **7.4 Comparison Between Graph and Hypergraph Neural Networks**

After the previous introduction to the spectral-based and spatial-based hypergraph neural networks methods, we have a basic understanding of the implementation of these methods. In this section, we compare hypergraph neural networks with simple graph neural networks according to spectral and spatial areas to discover the

**Fig. 7.6** The DHGNN framework. This figure is from [15]

connections and differences between them. The most typical methods of the two neural networks are chosen, the hypergraph neural networks model and the graph neural networks model, as a way to compare the most typical relationships and differences. HGNN [2] and HGNN+ [16] are used to compare them in the spectral and spatial domains, respectively. In terms of convolution, GNN is the classical operator designed to operate on graph, such as [6, 18, 24, 25]. In this subsection, the HGNN [2] and HGNN+ [16] are compared with GNN [18] from the spectral perspective and spatial perspective, respectively. Furthermore, the extended learning domain of the hypergraph emphasizes the connection.

#### *7.4.1 Spectral Perspective*

It can be proved that the GNN can be mathematically viewed as a special case of HGNN. Based on the assumption that every hyperedge connects only two nodes and has a weight equal to that of others, the simple hypergraph (2-uniform hypergraph) can also be expressed as a graph that has a graph adjacency matrix **A** and a vertex degree matrix **D**, which is a construction similar to Epair. It is indicated by the hypergraph incidence matrix **H**, the vertex degree matrix **D***v*, the hyperedge degree matrix **D***e*, and the hyperedge weight matrix **W**. Under such circumstances, then the following formulations can reduce the simple hypergraph:

$$\begin{cases} \mathbf{H} \mathbf{H}^{\top} = \mathbf{A} + \mathbf{D} \\\\ \mathbf{D}\_e^{-1} = \frac{1}{2} \mathbf{I} \\\\ \mathbf{W} \quad = \mathbf{I} \end{cases} \quad . \tag{7.27}$$

This can be reduced as follows using the hypergraph convolution:

$$\begin{split} \mathbf{X}^{t+1} &= \sigma(\mathbf{D}\_{v}^{-1/2} \mathbf{H} \mathbf{W} \mathbf{D}\_{e}^{-1} \mathbf{H}^{\top} \mathbf{D}\_{v}^{-1/2} \mathbf{X}^{t} \boldsymbol{\Theta}^{t}) \\ &= \sigma(\mathbf{D}\_{v}^{-1/2} \mathbf{H} (\frac{1}{2} \mathbf{I}) \mathbf{H}^{\top} \mathbf{D}\_{v}^{-1/2} \mathbf{X}^{t} \boldsymbol{\Theta}^{t}) \\ &= \sigma(\frac{1}{2} \mathbf{D}^{-1/2} (\mathbf{A} + \mathbf{D}) \mathbf{D}^{-1/2} \mathbf{X}^{t} \boldsymbol{\Theta}^{t}) \\ &= \sigma(\frac{1}{2} (\mathbf{I} + \mathbf{D}^{-1/2} \mathbf{A} \mathbf{D}^{-1/2}) \mathbf{X}^{t} \boldsymbol{\Theta}^{t}) \\ &= \sigma(\mathbf{D}^{-1/2} \hat{\mathbf{A}} \mathbf{D}^{-1/2} \mathbf{X}^{t} \boldsymbol{\Theta}^{t}) \end{split} \tag{7.28}$$

where **A**<sup>ˆ</sup> <sup>=</sup> **<sup>I</sup>** <sup>+</sup> **<sup>D</sup>**−1*/*2**AD**−1*/*<sup>2</sup> and *Θ*<sup>ˆ</sup> *<sup>t</sup>* <sup>=</sup> <sup>1</sup> 2*Θ<sup>t</sup>* . The extra <sup>1</sup> <sup>2</sup> can be absorbed by the learnable parameter *Θ*. It appears that in modeling the simple graph, the spectralbased hypergraph convolution in HGNN [2] exhibits the same formation as the graph convolution in GCN [18]. Due to its powerful expressive capabilities, the hypergraph convolution not only models and learns the high-order correlation in the hypergraph, but also it has the ability to handle simple graph.

#### *7.4.2 Spatial Perspective*

Learning to embed the rooted subtree in low-dimensional space can be viewed as a powerful GNN model [26]. Not only can rooted subtree [27] describe the connections of local vertices, but it can also describe message passing paths in a graph. The rooted subtree can therefore be used to compare HGNN+ [16] with GNN [18]. In hypergraph, the node in the rooted subtree of hypergraph can either be a vertex or a hyperedge in order to satisfy the path definition (also known as the message passing path).

Comparing graph structures that are isomorphic is more straightforward. Therefore, 2-uniform hypergraph (each hyperedge connects only two vertices) is compared. Figure 7.7 displays the rooted subtree for HGNN+ [16] and GNN [18] for a specified vertex, which can also be expressed as the message path in graph and hyperpath in hypergraph. It is obvious that in graph convolution, the vertex features of the neighbors are taken into account. These features are then aggregated to update the central vertex feature at the end of the process. This layer can be described as a hierarchical structure that enables the development of more powerful expressions and modeling capabilities. HGNN+ [16] performs a two stage, i.e., vertex–hyperedge–vertex, transformation. As formulated in Eq.(7.18), the first stage of the procedure generates a hyperedge feature based on the vertex interneighboring of the vertex. As a result, the hyperedge inter-neighbor's features are aggregated to get the updated features of the vertices. Additionally, multilayer hypergraph convolution has much more message interactions than graph convolution. The rooted vertex appears more frequently in the HGNN+ [16] path of subtrees (like a latent extra self-loop), which accounts for its better performance. In comparison with graph convolution, hypergraph convolution can efficiently extract low- and high-order correlations on hypergraph via vertex–hyperedge–vertex transformation.

#### **7.5 Summary**

In this chapter, we introduce two types of hypergraph neural networks learning: spectral-based and spatial-based methods. In spectral-based methods, the hypergraph transforms the nodes in the common and spectral domains by computing the Laplacian matrix. In the spatial-based methods, each node is updated by aggregating information from the nodes on the spatial domain. Then, we consider that most learning methods in graph learning are still simple graph neural networks.

Finally, we also compare hypergraph neural networks and graph neural networks on the previous spectral-based spatial-based and others. According to the comparison of the convolutional computation coefficients, the hypergraph convolution can not only have the comparable expressive ability of GCN when handling a simple graph, but also is capable of modeling and learning high-order correlations within

**Fig. 7.7** Comparison of rooted subtree of graph and 2-uniform hypergraph. This figure is from [16]

the hypergraph. Comparing hypergraph convolution with graph convolution based on spatial domain comparison, we can find that hypergraph convolution layer can efficiently extract both low-order and high-order correlations on hypergraph using the vertex–hyperedge–vertex transformation.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 8 Large Scale Hypergraph Computation**

**Abstract** As introduced in the previous chapters, the complexity of hypergraph computation is relatively high. In practical applications, the hypergraph may not be in a small scale, where we often encounter the scenario that the size of the hypergraph is very large. Therefore, hypergraph computation confronts the complexity issues in many applications. Therefore, how to handle large scale data is an important task. In this chapter, we discuss the computation methods for large scale hypergraphs and their applications. Two types of hypergraph computation methods are provided to handle large scale data, namely the factorization-based hypergraph reduction method and hierarchical hypergraph learning method. In the factorization-based hypergraph reduction method, the large scale hypergraph incidence matrix is reduced to two low-dimensional matrices. The computing procedures are conducted with the reduced matrices. This method can support the hypergraph computation with more than 10,000 vertices and hyperedges. On the other hand, the hierarchical hypergraph learning method splits all samples as some sub-hypergraphs and merges the results obtained from each sub-hypergraph computation. This method can support hypergraph computation with millions of vertices and hyperedges.

#### **8.1 Introduction**

Hypergraph computation has been used in many areas such as image analysis [1–3] and recommendation [4–6]. In practical applications, the hypergraph may not be in a small scale, and the size of the hypergraph could be very large in many cases, where hypergraph computation confronts the complexity issues [7–13]. For instance, in medical image analysis, hypergraphs can be used to model the relationship among case patches within an image or different images. Here we take the gigapixel wholeslide histopathological images (WSIs) as an example. The large scale of pixels in WSIs leads to a great challenge for medical image analysis. If we generate a hypergraph for such pixels in WSIs, the scale of vertices tends to be in billion level. Even we sample patches in WSIs, this number can be still around tens of thousands, or in million level. The conventional hypergraph modeling methods are highly unlikely able to analyze such large scale pixels. Another example is the recommender system. In recommender system, graphs or hypergraphs have been very widely used with their superior structural modeling capabilities. Meanwhile, the number of uses and items in the Internet or the recommender systems can be in million to billions level, and even keep increasing. Consequently, recommender systems are one of the typical playgrounds for large scale hypergraph applications. The large scale problem of hypergraphs is encountered in many other areas, such as social network analysis, protein relations prediction, and so on.

Under such circumstances, hypergraph computation confronts the large scale issue, as the modeling and computing on hypergraph are with high complexity in general. To help solve the large scale problem, we introduce two types of hypergraph computational methods to handle large scale data in this chapter, namely the factorization-based hypergraph reduction method and hierarchical hypergraph computation method. We also introduce their applications in medical image analysis and recommender systems, respectively. The factorization-based hypergraph reduction reduces the large scale hypergraph incidence matrix **H** to two lowdimensional matrices, leading to the reduction of the complexity. This method can support the hypergraph computation with tens of thousands vertices. The other method, i.e., the hierarchical hypergraph computation, splits the vertices to several subsets and computes each sub-hypergraph, respectively. The results from these subhypergraphs can be further combined following a hierarchical strategy. This method can support the hypergraph computation with millions of vertices and hyperedges. Part of the work introduced in this chapter has been published in [8].

#### **8.2 Factorization-Based Big-Hypergraph Modeling**

The complexity of the incident matrix **<sup>H</sup>** <sup>∈</sup> <sup>R</sup>*N*×*<sup>E</sup>* is O*(N*2*)*, which rises rapidly with respect to the increasing of the number of vertices (|V | = *N*) and the number of hyperedges (|E | = *N*). Although hypergraphs can model high-order complex associations well, the incidence matrix cannot take up a sizable number of vertices in traditional hypergraph modeling and transductive computation strategy. This is one typical bottleneck that limits the applications of hypergraph computation. To address this problem, the factorization-based hypergraph reduction method [8] is introduced to handle hypergraph modeling and computing with tens of thousands vertices.

It is an effective way to reduce dimensionality by conducting matrix decomposition of matrices with high dimensionality into the product of matrices with small dimensionality and has been applied in different areas such as spectral clustering [14] and recommendation algorithms [15]. For a large-dimensional incidence matrix **H** for a hypergraph, matrix decomposition can also be used to find the lowdimensional embeddings of each vertex and hyperedge and support large scale hypergraph computation.

**Fig. 8.1** The pipeline of the factorization-based hypergraph reduction method. This figure is from [8]

As illustrated in Fig. 8.1, the factorization-based hypergraph reduction incorporates a factor embedding component that encodes the relationships between hyperedges and vertices into two latent semantic spaces. Due to the low dimension of the latent semantic space, it can handle more vertices and hyperedges accordingly.

The purpose of factorization is to reduce the dimension of the incident matrix **H** to two semantic spaces, including vertex-belonging hyperedge represented by **<sup>H</sup>***v*∈E*<sup>v</sup>* <sup>∈</sup> <sup>R</sup>*N*×*<sup>ϕ</sup>* and hyperedge-containing-vertices represented by **H***e*⊃V*<sup>e</sup>* <sup>∈</sup> <sup>R</sup>*E*×*ϕ*, where E*<sup>v</sup>* and V*<sup>e</sup>* represent the hyperedge set containing vertex *v* and vertex set in hyperedge *e*, respectively, and *ϕ* is a hyperparameter that represents the latent semantic space dimension. Figure 8.1 illustrates that the two latent semantic spaces aim to express all connections between vertices and hyperedges. This procedure is formulated as below:

$$\arg\min\_{\mathbf{H}\_{\boldsymbol{\nu}\in\mathcal{E}\_{\boldsymbol{V}}},\mathbf{H}\_{\boldsymbol{e}\supset\mathcal{V}\_{\boldsymbol{\varepsilon}}}} \left\{ ||\mathbf{H} - \mathbf{H}\_{\boldsymbol{\nu}\in\mathcal{E}\_{\boldsymbol{V}}}\mathbf{H}\_{\boldsymbol{e}\supset\mathcal{V}\_{\boldsymbol{\varepsilon}}}^{\top}||\_{2}^{2} \right\}.\tag{8.1}$$

Consequently, the corresponding loss generated by the hypergraph dimensionality reduction can be written as

$$\mathcal{QC}\_{\mathcal{Y}} = ||\mathbf{H} - \mathbf{H}\_{\boldsymbol{\nu} \in \mathcal{\mathcal{E}}\_{\boldsymbol{\nu}}} \mathbf{H}\_{\boldsymbol{e} \supset \mathcal{Y}\_{\boldsymbol{e}}}^{\top}||\_{2}^{2}. \tag{8.2}$$

The hypergraph Laplacian matrix **L** is another crucial component of hypergraph computation, with the ordinary form is **<sup>L</sup>** <sup>=</sup> **<sup>I</sup>** <sup>−</sup> **<sup>D</sup>**−1*/*<sup>2</sup> *<sup>v</sup>* **HWD**−<sup>1</sup> *<sup>e</sup>* **<sup>H</sup>D**−1*/*<sup>2</sup> *<sup>v</sup>* . Since the incident matrix **H** has two low-dimensional latent semantic spaces, the lowdimensional hypergraph factorization-based Laplacian **L***<sup>F</sup>* is formulated as

$$\mathbf{L}\_{F} = \mathbf{I} - \mathbf{D}\_{v}^{-1/2} \mathbf{H}\_{v \in \mathcal{E}\_{v}} \underbrace{\mathbf{H}\_{e \supset \mathcal{V}\_{\ell}}^{\top} \mathbf{W} \mathbf{D}\_{e}^{-1} \mathbf{H}\_{e \supset \mathcal{V}\_{\ell}} \mathbf{H}\_{v \in \mathcal{E}\_{v}}^{\top} \mathbf{D}\_{v}^{-1/2}}\_{\boldsymbol{\Sigma} \in \mathbb{R}^{q \times \varphi}} \mathbf{D}\_{v}^{-1/2},\tag{8.3}$$

where *Σ* = **H** *e*⊃V*<sup>e</sup>* **WD**−<sup>1</sup> *<sup>e</sup>* **H***e*⊃V*<sup>e</sup>* is an intermediate latent feature multiplication term of dimension *ϕ*. Because the latent semantic space dimension *ϕ* is significantly smaller than the total amount of vertices and hyperedges, the multiplication intermediate term *Σ* functions as an extended control coefficient matrix.

**Fig. 8.2** (**a**) The whole-slide image for survival prediction; (**b**) Local feature extraction with convolution networks; (**c**) Feature aggregation with pairwise relation; (**d**) Global feature representation with high-order relation and multiple spaces. This figure is from [1]

The factorization-based hypergraph reduction can be used in hypergraph neural networks to support large scale computation, which can be used for more than 10,000 vertices and hyperedges.

Here we illustrate an application of hypergraph computation for large scale medical image analysis using whole-slide histopathology images for survival prediction. The goal is to make predictions by extracting valid survival-specific features reflecting the survival status of a patient based on a whole section histopathology image. Unlike conventional images, WSI data can be very large, i.e., a single image may have billions of pixels, and the correlations of these data are very complicated. Therefore, hypergraph computation in this application meets the large scale issue. The existing medical image analysis models are designed for analyzing natural images with a much smaller size, such as 256*px* × 256*px* or more. In order for the model to handle these WSI data, a number of patches of a moderate size are usually sampled first. Some patches of a moderate size (e.g., 256 × 256) are extracted from each WSI, and then these patches are stacked up and fed into a CNN-based feature extractor (e.g., VGG) to generate a global representation, as shown in Fig. 8.2. Subsequently, a regression model is applied to the global features to predict the survival score. These methods have an obvious drawback that the structure of the entire histopathological image is broken into pieces by patch sampling.

It may be unrealistic to extract all of the structural information at the cellular level from gigapixel images because there is an apparently massive amount of pixel data that are included in a single histopathological image. A small number of image patches can be selected to generate graph-based models. The global feature can be extracted by this method. However, the number of sampled patches limits the sampling area's coverage to the original image's informative regions, which causes a serious portion of fields with pathological features to be missing. The incident matrix, which represents the connectivity between vertices and hyperedges, is an essential component of the hypergraph neural network. The large scale vertices and hyperedges in the constructed hypergraph limit the application of HGNN [16].

Here, we introduce the Big-Hypergraph Factorization Neural Network (b-HGFN) [8], which uses factorization-based hypergraph reduction to address the above issue. It incorporates a factor embedding component that encodes the

relationships between hyperedges and vertices into two latent semantic spaces, as illustrated in Fig. 8.3. Due to the low dimension of the latent semantic space, b-HGFN can handle more vertices and hyperedges. With the hypergraph reduction, b-HGFN can provide more accurate feature representations of histopathological images from more densely sampled patches. Consequently, the first loss generated by the hypergraph dimensionality reduction can be written as Eq. (8.2). The hypergraph Laplacian matrix **L** is another crucial component of b-HGFN, and the low-dimensional hypergraph factorization Laplacian **L***<sup>F</sup>* is formulated as Eq. (8.3). A standard hypergraph neural network layer is represented as

$$\mathbf{H} \mathbf{GFConv}(\cdot) = D \Big[ \sigma \left( \Theta^{(\cdot)} \mathbf{X}^{(\cdot)} (\mathbf{I} - \mathbf{L}\_F) \right) \Big], \tag{8.4}$$

where *σ* stands for the nonlinear activation function, and *D* represents the dropout layer. Convolution operations are embedded into the implicit latent semantic space by modifying the convolution network's specifics, which are denoted as

$$\begin{cases} \text{HGFConv}(0) &= D \Big[ \sigma \left( \boldsymbol{\Theta}^{(0)} \mathbf{X}^{(0)} \mathbf{D}\_v^{-1/2} \mathbf{H}\_{v \in \mathcal{E}\_v} \boldsymbol{\Sigma} \right) \Big] \\ \text{HGFConv}(1) &= D \big[ \sigma \left( \boldsymbol{\Theta}^{(1)} \mathbf{X}^{(1)} \boldsymbol{\Sigma} \right) \big] \\ \dots & \\ \text{HGFCConv}(L-1) = D \big[ \sigma \left( \boldsymbol{\Theta}^{(L-1)} \mathbf{X}^{(L-1)} \boldsymbol{\Sigma} \right) \big] \\ \text{HGFCConv}(L) &= D \Big[ \sigma \left( \boldsymbol{\Theta}^{(L)} \mathbf{X}^{(L)} \boldsymbol{\Sigma} \mathbf{H}\_{v \in \mathcal{E}\_v}^{\top} \mathbf{D}\_v^{-1/2} \right) \Big] \end{cases} (8.5)$$

According to the HGFConv mentioned above, the hypergraph's high-dimensional connection relations can be embedded in the low-dimensional latent semantic spaces. To represent global features (i.e., **<sup>X</sup>** <sup>∈</sup> <sup>R</sup>1×*CL*+<sup>1</sup> ) at the histopathological image level, the output of the last layer of HGFConv (i.e., **X***(L*+1*)* ) is squeezed by a pooling layer after a complete b-HGFN.

The patient survival duration prediction is calculated using a fully connected neural network after obtaining the histopathological image's feature representation. The hierarchical loss, which incorporates list-wise loss, pairwise loss, and pointwise loss, has been experimentally demonstrated to be more effective for b-HGFN than using the simply pairwise Bayesian Concordance Readjust (BCR) loss function. The point-wise loss function applies negative Cox log partial likelihood loss as

$$\mathcal{QC}\_{\alpha} = \sum \delta\_l \left( -s\_l + \log \sum\_{j \in \{j: t\_j \le t\_l\}} \exp(t\_j) \right), \tag{8.6}$$

where *si* and *ti* represent the predicted duration and the truth, while the pairwise loss and list-wise loss refer to NDCGLoss2 derived by LambdaLoss [17] and BCR loss [2]. Taken into consideration the loss function of hypergraph dimension reduction,

the combination of all loss functions can be expressed as

$$\begin{cases} \mathcal{QC}\_{\lambda} = \lambda \mathcal{QC}\_{\mathcal{Q}} + (1 - \lambda) \mathcal{QC}\_{\mathcal{P}}\\ \mathcal{QC}\_{\mathcal{P}} = \{\text{NDCGLoss2}(\mathbb{S}, \mathbb{G}), \text{BCRLoss}(\mathbb{S}, \mathbb{G})\}\ \text{ .} \tag{8.7}$$
  $\mathcal{QC}\_{\text{all}} = \mathcal{QC}\_{\mathcal{Y}} + \mathcal{QC}\_{\lambda}$ 

The factorization-based hypergraph reduction incorporates a factor embedding component that encodes the relationships between hyperedges and vertices into two latent semantic spaces. Due to the low dimension of the latent semantic space, it can handle more vertices and hyperedges. The factorization-based hypergraph reduction can be used in HGNN [16] to solve the large scale problem. The method can effectively solve the hypergraph analysis problem with almost 10,000 vertices and hyperedges.

#### **8.3 Hierarchical Hypergraph Modeling**

The factorization-based hypergraph reduction can effectively analyze the hypergraph with almost 10,000 vertices and hyperedges, while it stretches its limit when the size extends to hypergraph with millions of vertices or hyperedges. Figure 8.4 shows a hierarchical hypergraph learning method for large scale hypergraphs with hierarchical labels. The hierarchical hypergraph can handle the hypergraph neural network with millions of data points. In the following, it is introduced in detail.

For million-scale unstructured data, it is impractical to convert the whole dataset into a single large hypergraph to represent the correlations of samples or conduct the factorization-based reduction, which would require an unrealistically large incidence matrix or a significant cost of computing memory. If there are hierarchical labels in the dataset, hierarchical hypergraph learning can be adopted to solve the

**Fig. 8.4** An illustration of the hierarchical hypergraph learning

problem. The original dataset **<sup>X</sup>** <sup>∈</sup> <sup>R</sup>*N*×*<sup>d</sup>* can be randomly divided uniformly into several subsets with smaller and more affordable scales, with that *N* denotes the scale of dataset and *d* denotes the dimension of sample. Then, each sample in the dataset forms vertices and hyperedges. In each subset, we construct a subhypergraph using the K nearest neighbors algorithm (*k*NN), which is based on the Euclidean distance between the representations of each pair of vertices. The incidence matrix **H***<sup>i</sup>* <sup>∈</sup> <sup>R</sup>|V*i*|×|E*i*<sup>|</sup> serves as the role of indicating the correlation among vertices and the hyperedges, of values consisting of 0 and 1.

Given the initial feature matrix of vertices **X** as well as the corresponding incidence matrix **H**, we use G*<sup>i</sup>* = V*i,* E*i, (i* = 1*,* 2*,* 3*, . . . , m)* to represent the i-th hypergraph that contains |V*i*| vertices and |E*i*| hyperedges. In order to weaken the loss of feature over-smooth in the convolutional operations, the residual connection [4] can be adopted to generate the updated vertex representations for the next layer of convolution, formulated as follows:

$$
\widehat{\mathbf{X}}\_l = \sigma(\boldsymbol{D}\_l^{-1/2} \boldsymbol{H}\_l \mathbf{W}\_l \boldsymbol{\mathcal{O}}\_l^{-1} \boldsymbol{H}\_l^\top \boldsymbol{D}\_l^{-1/2} \mathbf{X}\_l \boldsymbol{\Theta}\_l + \mathbf{X}\_l), \tag{8.8}
$$

where *Di* <sup>∈</sup> <sup>R</sup>|V*i*|×|V*i*<sup>|</sup> and D*<sup>i</sup>* <sup>∈</sup> <sup>R</sup>|E*i*|×|E*i*<sup>|</sup> are degree matrices of vertex and hyperedge. *Wi* <sup>=</sup> *diag(w*1*, w*2*,...,w*|E*i*|*)* and *Θi* <sup>∈</sup> <sup>R</sup>*d*×*<sup>d</sup>* indicate the trainable weight parameters of the hyperedges and trainable weight matrix for feature transformation.

Note that here we assume that each sample has two hierarchical labels, named secondary label and primary label, and in which secondary label is the fine-grained category of the primary label. One special component in this first step is the "vertex belonging matrix," denoted as *Γi* <sup>∈</sup> <sup>R</sup>|V*i*|×N<sup>2</sup> , where <sup>N</sup><sup>2</sup> is the number of secondary labels. The matrix *Γi* is generated by the labels in the training set and serves as the input for the transductive learning method.

The global labels shared by all the subsets are usually in the magnitude of hundreds, making it feasible to combine the independently learned label features of different groups. Obtaining the local latent high-order representations of subsets in the previous hypergraph learning step, two aggregating operations can be conducted here for primary and secondary labels classification, respectively. The aggregation of local secondary labels can be formulated as follows:

$$\mathbf{S}\_{l} = \boldsymbol{\Gamma}\_{l}^{\top} \widehat{\mathbf{X}}\_{l},\tag{8.9}$$

where **X***<sup>i</sup>* denotes the aggregated local representation for secondary label, whose dimension is RN2×*<sup>d</sup>* . Each row of the matrix **S***<sup>i</sup>* represents the latent feature for each specific category of secondary label in the *i*-th subset.

We then concatenate all of the local high-order vertices' features **X***<sup>i</sup>* to generate the global high-order vertices' features **<sup>X</sup>** <sup>∈</sup> <sup>R</sup>|**V**|×*<sup>d</sup>* as follows:

$$\widehat{\mathbf{X}} = \left[ \widehat{\mathbf{X}}\_1^\top \| \widehat{\mathbf{X}}\_2^\top \| \cdots \| \widehat{\mathbf{X}}\_m^\top \right]^\top,\tag{8.10}$$

where ·· denotes the concatenating operation between two matrices. The local aggregated secondary features **S***<sup>i</sup>* <sup>∈</sup> <sup>R</sup>N2×*<sup>d</sup>* can be further fused to form the global secondary features **<sup>S</sup>** <sup>∈</sup> <sup>R</sup>N2×*<sup>d</sup>* by average pooling, formulated as follows:

$$\mathbf{S} = \mathbf{S}\_1 \oplus \mathbf{S}\_2 \oplus \cdots \oplus \mathbf{S}\_m,\tag{8.11}$$

where ⊕ denotes the average pooling operation, calculating the mean value of the corresponding latent features from the local secondary labels.

The global high-order representation of primary labels (**<sup>P</sup>** <sup>∈</sup> <sup>R</sup>N1×*<sup>d</sup>* ) is yielded from the global features of secondary labels, formulated as

$$\mathbf{P} = \phi \mathbf{S},\tag{8.12}$$

where *<sup>Φ</sup>* <sup>∈</sup> <sup>R</sup>N1×N<sup>2</sup> denotes the owning relations between secondary and primary labels.

Based on the results of the hypergraph convolution and global aggregation, the classifier consisting of the fully connected layers can be trained by concatenating the updated vertices' high-order representations and the global classification. The augmented representations of vertices are shown below:

$$\begin{cases} \widetilde{\mathbf{X}}\_{l}^{<1>} = \widehat{\mathbf{X}}\_{l} \parallel \frac{1}{\sqrt{\mathcal{I}\_{1}}} \sum\_{j=1}^{\mathcal{A}\_{1}} \mathbf{P}\_{j} \\ \widetilde{\mathbf{X}}\_{l}^{<2>} = \widehat{\mathbf{X}}\_{l} \parallel \frac{1}{\sqrt{\mathcal{I}\_{2}}} \sum\_{j=1}^{\mathcal{A}\_{2}} \mathbf{S}\_{j} \end{cases} \tag{8.13}$$

Then the aggregated features can be used for some tasks and trained with the hierarchical labels in training set. In the following, we introduce the hierarchical hypergraph learning in recommendation.

Here, we introduce an application of hierarchical hypergraph learning for large scale user retrieval intention detection. Figure 8.5 shows the layout, which mainly consists of three steps: data division and local hypergraph modeling, latent highorder feature aggregation, and user intention prediction, respectively.

First, we randomly divide the original dataset uniformly into several subsets. In our work, every query log and the relationships between numerous query logs form vertices and hyperedges. As shown in Fig. 8.5, the whole original dataset and the divided subsets are, respectively, denoted as **V** and V*i*, where *i* ∈ [1*,* 2*,...,m*]. And in each subset, a sub-hypergraph can be constructed, which is introduced above. Note that the initial semantic embeddings of vertices (**X***<sup>i</sup>* <sup>∈</sup> <sup>R</sup>|V*i*|×*<sup>d</sup>* ) are extracted by the well-known pre-trained models, such as BERT [18], where *d* denotes the dimension of embeddings.

The hierarchical hypergraph learning can then be used to conduct the user intention prediction. In our research, the user intentions are categorized into two levels, i.e., the primary label and the secondary label, which is the fine-grained category of the primary label. After applying the hierarchical model, the features **X***<*1*<sup>&</sup>gt; <sup>i</sup>* and **X***<*2*<sup>&</sup>gt; <sup>i</sup>* can be obtained.

#### 8.4 Summary 155

In this application, the multi-classification can be converted into multiple binary classification problems to improve the effect of the model. We use C = {C1*,*C2*,...,*CN }*,* N ∈ *(*N1*,* N2*)* to denote the collection of the user intentions. Therefore, the original multiple labels are converted into two labels: 0 and 1. For instance, we traverse all the data with *l* intentions to label 1, and others to label 0. Each classifier is trained using multi-layer perceptron (MLP) and the sigmoid activation function to implement label prediction based on the newly allocated binary label, formulated as follows:

$$\begin{cases} \widehat{\boldsymbol{\theta}}\_{l}^{\boldsymbol{\epsilon}} = \sigma(\widetilde{\mathbf{X}}\_{l}^{\boldsymbol{\epsilon} \cdot \boldsymbol{1} \boldsymbol{\epsilon}} \boldsymbol{\Theta}\_{f1} + b\_{1}) \\ \widehat{\boldsymbol{\theta}}\_{2}^{\boldsymbol{\epsilon}} = \sigma(\widetilde{\mathbf{X}}\_{l}^{\boldsymbol{\epsilon} \cdot 2 \boldsymbol{\epsilon}} \boldsymbol{\Theta}\_{f2} + b\_{2}), \end{cases} \tag{8.14}$$

where *Θf* <sup>1</sup> and *Θf* <sup>2</sup> are the trainable transformation matrices. *b*<sup>1</sup> and *b*<sup>2</sup> are the biases. *<sup>σ</sup>* is the activation function. Y<sup>1</sup> and Y<sup>2</sup> denote the prediction of the primary and secondary user intentions, respectively.

To supervise and optimize the trainable parameters, we apply the cross-entropy loss function in the training procedure:

$$\mathcal{QC} = \mathbb{CE}(\mathcal{Y}\_1, \hat{\mathcal{Y}}\_1) + \mathbb{CE}(\mathcal{Y}\_2, \hat{\mathcal{Y}}\_2), \tag{8.15}$$

where Y<sup>1</sup> and Y<sup>2</sup> denote the ground truth of the primary and secondary user intentions, respectively. When all of the classifiers have been trained completely, each test sample can be predicated to obtain a list of scores for both primary and secondary user intentions.

To summarize, the hierarchical hypergraph learning method can handle large scale hypergraphs with hierarchical labels, which divides a dataset into multiple subhypergraphs, and hierarchical aggregation is performed based on hierarchical labels. The hierarchical hypergraph can integrate with the hypergraph neural network to handle millions of data points.

#### **8.4 Summary**

This chapter describes two kinds of large scale hypergraph computation methods, i.e., factorization-based hypergraph reduction and hierarchical hypergraph learning. The factorization-based hypergraph reduction is based on the strategy of factorization, which decomposes the large scale hypergraph into low-dimensional embeddings of vertices and hyperedges. It can support the processing of hypergraphs with nearly 10,000 vertices or hyperedges. The hierarchical hypergraph learning is used to analyze hypergraphs with hierarchical labels, which divides a dataset into multiple sub-hypergraphs, and hierarchical aggregation is performed based on hierarchical labels. This method can support millions of data points. We also introduce two applications as examples, i.e., whole-slide image analysis and recommendation, to illustrate the usage of these two algorithms in practice. There

are some other large scale hypergraph application scenarios, such as community discovery [19], spectral clustering [20], etc.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 9 Hypergraph Computation for Social Media Analysis**

**Abstract** Social media, such as Twitter and Weibo, have grown rapidly over the past decade. Large numbers of active social media users produce a voluminous amount of data each day, from which important insights can be drawn. Several applications, such as recommender system and sentiment analysis, have been developed to help study the users' intension and portrait. One common challenge faced by these social media applications is how to leverage the complex and multimodal data on social networks and model the higher-order associations hidden in the data. Hypergraph computation has the huge potential to be used in such analysis. In this chapter, we introduce three typical applications of hypergraph computation, i.e., recommender system, sentiment analysis, and emotion recognition, from which hypergraph computation has shown great value on social media analysis.

#### **9.1 Introduction**

With the fast development of information technologies, social media data have increased rapidly. Social media platforms provide new ways to produce and receive content, especially user-generated content. Users can shop, watch movies, and instantly participate in the propagation, interaction, and sharing of news events on the Internet. Rich behavioral data on social media platforms are generated by great numbers of users every day, which support different downstream applications and provide insights for better understanding of users' intension and portrait.

A typical social media analysis application is the so-called recommender system [1]. When listening to music, shopping, watching movies on the Internet, or looking for friends on social network services, users are likely to be drowned in an unprecedented amount of information. This is what we call "information overload." To address this issue, recommender systems have been developed for decades. The main goal of recommender systems is to forecast how users would react to a product by better understanding their preferences based on the user's historical interaction data, user profile, item attributes, context data, and other information. This could help predict whether the users like an item or not. For example, in the movie recommender system, the user profile may contain user ID, age, gender, income,

marital status, and more. The movie (item) attributes' information may include movie ID, name, genre, director, released time, actors, and more. Interactive data contain the movies which users have seen and even provided comments. The goal of the movie recommender system is to integrate this information to recommend movies that users might like.

Another popular social media analysis application is sentiment analysis [2]. The uses on social media platforms have generated a large amount of opinion data every moment in recent years, which helps to decode and mine users' attitudes on specific topics. Researchers have begun to look at sentiment analysis of users in social media contexts. In economics, stock price fluctuations can be forecast to some extent by analyzing the sentiment of social media users. In politics, social media posts can reflect public opinions. Users' sentiments may also affect their behaviors, for example, emotionally charged people are more likely to forward and repost tweets. Therefore, sentiment analysis plays an important role in social media analysis. However, sentiment analysis is challenging due to the multi-modality and complexity of social media data. For example, a tweet may include text, images, videos, and possibly more. Furthermore, there exist complex correlations among posts in various areas, such as the dimensions of time, location, and user preferences. The interaction among these users further increases the challenges in this task.

In addition to the posts on social media, physiological signals can also be used to analyze the emotion of people [3]. Compared with text, facial expressions, and other data, physiological signals are not easy to disguise and can better reflect real emotions of people. Therefore, emotion recognition based on physiological signals plays an important role in many applications such as clinical diagnosis, which also has played a significant role on social media analysis when these data are available. Physiological signals of different modalities contain complementary information representations of human emotions. It is of great significance to discover and utilize the correlations among these representations to improve the accuracy of emotion recognition.

From the above examples, it can be readily seen that one important issue of social media analysis is how to model complex correlations among data and to make use of the complementary information among multi-modal representations to better understand data. Hypergraphs have been widely used in social media computing in recent years because of their usefulness in complex data modeling. In the following, we discuss three applications of hypergraph computation in social media analysis: recommender system, sentiment analysis, and emotion recognition. In the recommender system, we discuss hypergraph-based collaborative filtering [4] and attribute inference. We then present sentiment prediction [5] and social event detection using hypergraph computation [6] for sentiment analysis. In the third section, we introduce two different hypergraph computation methods of emotion recognition using multimodal physiological signals [7–9]. Part of the work introduced in this chapter has been published in [4–9].

#### **9.2 Recommender System**

In recent years, the Internet has become an integral part of people's daily life shopping, watching the news, listening to music, etc., on the Internet. However, with the explosion of information, people find it is increasingly difficult to sift through the massive volumes of data on the Internet to access the needed information. For example, when a user wants to watch a movie online and access the movie site, the user is likely to drown in thousands of movies and cannot find the one in mind. This is called the "information overload" issue.

Recommender systems emerge under such circumstances. Recommender systems are a powerful tool for reducing the problem of information overload since they may assist users to find useful information and assist service providers to boost profits. Recommender systems have been used in many online systems, from general platforms including e-commerce, social media, and content sharing to vertical services such as movie, news, and music websites.

The core of the recommender system is to understand the users through their attribute information and historical interactions and then predict whether they would like one item. It is worth noticing that the user-side information, the item-side information, as well as the interaction data, play a vital role in this process. The user-side information, including gender, age, personality, etc., often reflects the users' preference. For example, male users may be more likely to read military and political news, while female users may prefer fashion and entertainment news. The item-side information, such as the category, text description, image, etc., can characterize the attribute of the item. Such attribute information may suggest potential consumer groups. For instance, health supplements may be bought more often by the elderly, while electronics are more likely to be purchased by younger people. Historical interactions also involve potential users' preferences, which are suggested by the assumption that "behavioral similar users may have similar preferences on items." Figure 9.1 shows an example for recommender system based on similar patterns.

We can find from these examples that what recommender systems actually do is to distinguish similar users from different perspectives based on complex, multimodal given data. Therefore, one key problem is how to model and learn the complex relationship between users and items. Recently, hypergraph computation has attracted much attention and has been applied to recommender systems to help solve this problem. The hypergraph can naturally integrate the user-side information, item-side information, as well as the interaction data, thanks to flexible hyperedges and especially hyperedge groups. Therefore, similar users/items can be connected in different areas. In this section, we discuss two examples of applying hypergraph computation in recommender system, i.e., collaborative filtering and attribute inference.

**Fig. 9.1** An example of recommender system based on similar patterns

#### *9.2.1 Collaborative Filtering*

In the past decades, collaborative filtering (CF), a crucial, popular recommendation technique, has been extensively used in various recommender systems. The fundamental assumption of CF is that consumers who engage in similar behaviors, for example, reading the same kind of news frequently, are likely to have similar tastes for items, such as games, movies, and commodities. A common CF-based solution goes through the following two steps: first, it uses historical interactions to identify similar users and items; and second, it makes suggestions for users based on the information acquired in the last step.

Since people and things have topological links that the network can describe, graph-based CF approaches have attracted a lot of interest in recent years. Although graph-based CF approaches have been explored for a long time and produced respectable performance, there are still certain restrictions. First, the high-order correlations in the user–item network are modeled and utilized insufficiently. For example, CF methods hope to find a group of behavior-similar users. Such associations between users are group-level (beyond pairwise) and cannot be wellcaptured by the graph structure since only pairwise correlations can be modeled in a graph. Second, when users and things are represented by a graph in graph-based approaches, there are no fundamental distinctions between them. When an item has many users connected to it, it is a popular item. In contrast, being connected to a variety of items does not necessarily mean that a user is well-liked.

Under these circumstances, more adaptable and appropriate user and item modeling is required. Thanks to its adaptable hyperedges, the hypergraph structure, as opposed to the graph structure, offers a more natural approach for representing such high-order and intricate relationships. In this subsection, we present a dual channel hypergraph collaborative filtering (DHCF) framework [4] to solve the

**Fig. 9.2** An example of hypergraph modeling for user–item network

aforementioned problems. In the following, we introduce how to model the user– item interactions and learn the high-order connectivity with dual hypergraphs.

#### **(1) Hypergraph Modeling of High-Order Connectivity**

Given a user–item network, the high-order connectivity is captured by some selfdefined association rules. Based on these rules, several hyperedge groups can be constructed, which can capture higher-order correlations rather than pairwise relationships, e.g., by linking users who behave similarly but without direct connections. For example, we can connect the users who have purchased the same item with a hyperedge, as shown in Fig. 9.2. In addition to the interactions that are apparently visible in the observed data, these rules may also be thought of as a high-order perspective to describe the otherwise raw data. Here we introduce a way to capture the high-order connectivity with hypergraphs for users and items, separately.

**User Hypergraph Construction** We first define the *k*-order neighbors for items. If there is a path between *itema* and *itemb* that consists of a series of adjacent vertices and has fewer users than *k*, then we can say *itema* (*itemb*) is *itemb* (*itema*)'s *k*order reachable neighbor in the user–item network.

We then define the *k*-order neighbor users for items. If there are direct paths between *usera* and *itema* and *itema* is *itemb*'s k-order neighbor, then *usera* is *k*-order neighbor for *itemb*.

The *B<sup>k</sup> u(i)* symbol represents the set of *k*-order *B<sup>k</sup> u(i)* users for item *i*.A hypergraph can be defined mathematically as a set family where each set indicates a hyperedge. As a result, a hypergraph may be built using the *k*-order neighbor users set of an object. By using the above definitions, the corresponding hyperedge group

may be constructed as follows:

$$\mathcal{d}\_{B\_{\mu}^{k}}^{\mathbb{C}} = \{ B\_{\mu}^{k}(i) \mid i \in I \}. \tag{9.1}$$

The *k*-order accessible matrix of items is denoted by **A***<sup>k</sup> <sup>i</sup>* ∈ {0*,* <sup>1</sup>}*M*×*M*, which can be written as follows:

$$\mathbf{A}\_{l}^{k} = \text{Min}(\mathbf{l}, \text{power}(\mathbf{H}^{\top} \cdot \mathbf{H}, k)), \tag{9.2}$$

where the function pow*(M, k)* determines the *k* power of the matrix *M* in question. The incidence matrix of the user–item network is represented by **<sup>H</sup>** ∈ {0*,* <sup>1</sup>}*N*×*M*, where *N* and *M* are the numbers of users and items, respectively. Then, the incidence matrix of the hyperedge group has the following form:

$$\mathbf{H}\_{B\_{\mu}^{k}} = \mathbf{H} \cdot \mathbf{A}\_{l}^{k-1}. \tag{9.3}$$

The hypergraph G*<sup>u</sup>* can capture the overall high-order correlations among users by fusing multiple hyperedge groups that are constructed via *k*-order reachable rule. Therefore, the **H***<sup>u</sup>* can be written as

$$\mathbf{H}\_{\boldsymbol{u}} = f\left(\boldsymbol{\mathcal{C}}\_{\boldsymbol{B}\_{\boldsymbol{u}}^{k\_1}}, \boldsymbol{\mathcal{C}}\_{\boldsymbol{B}\_{\boldsymbol{u}}^{k\_2}}, \dots, \boldsymbol{\mathcal{C}}\_{\boldsymbol{B}\_{\boldsymbol{u}}^{k\_d}}\right) = \underbrace{\mathbf{H}\_{\boldsymbol{B}\_{\boldsymbol{u}}^{k\_1}} || \mathbf{H}\_{\boldsymbol{B}\_{\boldsymbol{u}}^{k\_2}} || \dots || \mathbf{H}\_{\boldsymbol{B}\_{\boldsymbol{u}}^{k\_d}}}\_{\boldsymbol{a}}.\tag{9.4}$$

where ·||· is the concatenation operation, which is an example of hyperedge groups fusion function *f (*·*)*.

**Item Hypergraph Construction** Here the high-order connectivity for items is defined in a similar way. The *k*-order accessible matrix of user **A***<sup>k</sup> <sup>u</sup>* ∈ {0*,* <sup>1</sup>}*N*×*<sup>N</sup>* is defined as

$$\mathbf{A}\_{\mu}^{k} = \text{Min}(\mathbf{l}, \text{power}(\mathbf{H} \cdot \mathbf{H}^{\top}, k)). \tag{9.5}$$

The incidence matrix **H***B<sup>k</sup> <sup>i</sup>* ∈ {0*,* <sup>1</sup>}*M*×*<sup>N</sup>* can be written as

$$\mathbf{H}\_{B\_i^k} = \mathbf{H}^\top \cdot \mathbf{A}\_u^{k-1}.\tag{9.6}$$

By assuming that we have *b* hyperedge groups, the item's hypergraph incidence matrices **H***<sup>i</sup>* are similarly formulated as follows:

$$\mathbf{H}\_{l} = f\left(\boldsymbol{\mathcal{C}}\_{\boldsymbol{B}\_{l}^{k\_{1}}}, \boldsymbol{\mathcal{C}}\_{\boldsymbol{B}\_{l}^{k\_{2}}}, \dots, \boldsymbol{\mathcal{C}}\_{\boldsymbol{B}\_{l}^{k\_{b}}}\right) = \underbrace{\mathbf{H}\_{\boldsymbol{B}\_{l}^{k\_{1}}}||\mathbf{H}\_{\boldsymbol{B}\_{l}^{k\_{2}}}|| \dots || \mathbf{H}\_{\boldsymbol{B}\_{l}^{k\_{b}}}}\_{b} \tag{9.7}$$

In this way, the high-order connectivity for both users and items is captured with a hypergraph. Figure 9.3 gives one example of the defined high-order connectivity

**Fig. 9.3** The illustration of high-order connectivity for users

for users [4]. Subsequently, two embedding look-up tables (**E***<sup>u</sup>* = [**e***u*<sup>1</sup> *,...,* **e***uN* ] and **E***<sup>i</sup>* = [**e***i*<sup>1</sup> *,...,* **e***iM* ]) are constructed to describe both users and items, which, together with the hypergraph structure, are prepared for later learning.

#### **(2) High-Order Information Passing**

When mixed high-order correlations have been obtained, the neighboring messages are aggregated using the high-order information passing technique, which can be expressed as

$$\begin{cases} M\_{\text{ul}} = \text{HyConv}(E\_{\text{ul}}, H\_{\text{ul}}) \\ M\_{\text{l}} = \text{HyConv}(E\_{\text{l}}, H\_{\text{l}}) \end{cases} \tag{9.8}$$

where HyConv*(*·*,* ·*)* can be any hypergraph convolution operation as that specified in HGNN (HGNNConv for short). Through information passing from high-order neighbors, the complex correlations between vertices have been encoded to the aggregated messages of users (*M u*) and items (*M <sup>i</sup>*), respectively. It should be noted that the high-order neighbor mentioned here is not a fixed concept of the direct interactions in user–item network, but an abstract description that can link the similar users/items in latent behavior–attribute space.

To provide an example of high-order information passing, we present the jump hypergraph convolution (JHyConv) in this part. Inspired by some previous work [10], the JHyConv operator creates the learned representations by

concatenating a vertex's current representation with its aggregated neighborhood representation. The JHyConv is written as

$$\mathbf{X}^{(l+1)} = \sigma \left( \mathbf{D}\_v^{-1/2} \mathbf{H} \mathbf{D}\_e^{-1} \mathbf{H}^\top \mathbf{D}\_v^{-1/2} \mathbf{X}^{(l)} \boldsymbol{\Theta}^{(l)} + \mathbf{X}^{(l)} \right), \tag{9.9}$$

where all symbols follow existing notations consistently.

In contrast to conventional HGNNConv, the jump hypergraph convolution enables the model to take into account both its representation and aggregated highorder representations. The messages *Mu* and *Mi* are then used to jointly update *Eu* and *Ei*.

#### **(3) Joint Information Updating**

The goal of the joint information updating is to extract information that is discriminatory for users and items, which is formulated by

$$\begin{cases} E\_u' = \text{JMU}(M\_u, M\_l) \\ E\_l' = \text{JMU}(M\_l, M\_u) \end{cases},\tag{9.10}$$

where any learnable feed-forward neural network may be used for JMU*(*·*,* ·*)*. Updated embeddings for users and items are termed as *E <sup>u</sup>* and *E <sup>i</sup>*, respectively. Here, a shared fully connected layer is applied.

#### **(4) Overall DHCF Layer**

The two stages of DHCF framework are illustrated in Figs. 9.4 and 9.5, respectively. The high-order information passing and joint information updating constitute an integrated DHCF layer, which, thanks to its powerful hypergraph structure, can directly model and encode the high-order connectivity.

With the specified HyConv and JMU, a DHCF configuration can be formulated as follows:

$$\begin{cases} \begin{array}{l} f(.,.) = & \cdot \vert \cdot \\ \text{HyConv}(\cdot, \cdot) = \text{HyConv}(\cdot, \cdot) \end{array}, \end{cases} \tag{9.11}$$

where MLP1*(*·*)* is a fully connected layer, *Θ* is trainable parameters, and ·||· is the concatenation operation.

**Fig. 9.4** The first stage of the DHCF framework

**Fig. 9.5** The second stage of the DHCF framework

The matrix form of the embedding propagation on hypergraph can be written as follows:

$$\begin{aligned} \mathbf{H}\_{u} &= \mathbf{H} || \left( \mathbf{H} (\mathbf{H}^{\top} \mathbf{H}) \right) \\ \mathbf{H}\_{l} &= \mathbf{H}^{\top} || \left( \mathbf{H}^{\top} (\mathbf{H} \mathbf{H}^{\top}) \right) \\ \mathbf{M}\_{u}^{(l)} &= \mathbf{D}\_{u\_{v}}^{-1/2} \mathbf{H}\_{u} \mathbf{D}\_{u\_{e}}^{-1} \mathbf{H}\_{u}^{\top} \mathbf{D}\_{u\_{v}}^{-1/2} \mathbf{E}\_{u}^{(l)} + \mathbf{E}\_{u}^{(l)} \\ \mathbf{M}\_{l}^{(l)} &= \mathbf{D}\_{l\_{v}}^{-1/2} \mathbf{H}\_{l} \mathbf{D}\_{l\_{e}}^{-1} \mathbf{H}\_{l}^{\top} \mathbf{D}\_{l\_{v}}^{-1/2} \mathbf{E}\_{l}^{(l)} + \mathbf{E}\_{l}^{(l)} \\ \mathbf{E}\_{u}^{(l+1)} &= \sigma(\mathbf{M}\_{u}^{(l)} \Theta^{(l)}) \\ \mathbf{E}\_{l}^{(l+1)} &= \sigma(\mathbf{M}\_{l}^{(l)} \Theta^{(l)}) \end{aligned} \quad \text{Phase 1} \quad \begin{aligned} \text{hypergraphs } \Omega \\ \text{Phase 2} \end{aligned}$$

where **D***uv* , **D***ue* and **D***iv* , **D***ie* are vertex degree and hyperedge degree matrices of user hypergraph **H***<sup>u</sup>* and item hypergraph **H***i*, respectively. **E***(l) <sup>u</sup>* and **E***(l) <sup>i</sup>* are the inputs for layer *l*, while **E***(l*+1*) <sup>u</sup>* and **E***(l*+1*) <sup>i</sup>* are the outputs for layer *l*.

With the introduced framework, the collaborative signals in the user–item network are modeled and captured, thus achieving better representation.

#### *9.2.2 Attribute Inference*

A CF-based recommender system has the cold-start problem when there is a lack of historical behavior data of users, making it challenging to personalize recommendations to individual users. Making use of user and item attribute data is a potential answer to this issue. The attribute information of users usually includes gender, age, occupation, etc. The attribute information of an item can be the genre of a movie or music, or the classification of an item on an e-commerce website, etc. According to the principle of CF, similar users will choose similar items, and the attribute information can then be used to establish the similarity between users or items. The addition of attribute information can build up the association between users and items in the absence of user historical behaviors, which can well alleviate the cold-start problem. In other words, attribute information can assist in collaborative filtering.

However, attribute information is often insufficient, as many people are reluctant to provide true personal information. Therefore, attribute inference becomes an important task. It is mutually reinforcing with the recommendation task, as highquality attributes can help better with collaborative filtering, while more accurate user behavior can also help infer attributes of users and items.

In this section, we discuss a framework of multi-task learning that combines the attribute inference task with the recommendation task. The framework first utilizes multi-channel hypergraph CF for representation learning, performs two downstream

⎧

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

**Fig. 9.6** The pipeline of multi-channel hypergraph neural networks for recommendation and attribute inference

tasks simultaneously, and lastly optimizes the model by downstream tasks. The pipeline of the framework is presented in Fig. 9.6.

#### **(1) Multi-Channel Hypergraph Collaborative Filtering**

**Multi-Channel Hypergraph Construction** In order to model the higher-order interactions and attributes between users and items, two hypergraphs, named Interaction Hypergraph and Attribute Hypergraph, are constructed and denoted as *I* and *A* for simplicity.

The structure of *I* is generated through the interaction between users and items. The implicit interaction matrix is represented as **<sup>R</sup>** <sup>∈</sup> <sup>R</sup>*nu*×*nv* , where *nu* and *nv* denote user and item numbers, respectively. With the *k*-order reachable rule introduced in the previous subsection, we generate the hyperedges by connecting the user's and item's 1-order reachable users and items. The incidence matrix can be expressed as

$$\begin{aligned} \mathbf{H}\_I^u(i,j) &= \begin{cases} 1 \text{ user} \text{\textquotedbl{}interaction with item} j \\ 0 \text{ otherwise} \text{.} \end{cases} \\ \mathbf{H}\_I^v(i,j) &= \begin{cases} 1 \text{ item}\_l \text{ interact with user} j \\ 0 \text{ otherwise} \text{.} \end{cases} \end{aligned} \tag{9.13}$$

It is obvious that **H***<sup>u</sup> <sup>I</sup>* <sup>=</sup> **<sup>R</sup>** and **H***<sup>v</sup> <sup>I</sup>* = **R**.

The structure of *A* is generated through the attribute information of users and items. The user and item binary attribute matrices are denoted by **<sup>X</sup>** <sup>∈</sup> <sup>R</sup>*nu*×*np* and **<sup>Y</sup>** <sup>∈</sup> <sup>R</sup>*nv*×*nq* , where *np* and *nq* denote user and item attribute numbers, respectively.

Attributes represent hyperedges, and vertices with the same attributes are connected by hyperedges. The incidence matrix can be formulated as

$$\begin{aligned} \mathbf{H}\_A^u(i,j) &= \begin{cases} 1 \text{ user}\_l \text{ has attribute}\_j \\ 0 \text{ otherwise.} \\ \mathbf{H}\_I^v(i,j) &= \begin{cases} 1 \text{ item}\_l \text{ has attribute}\_j \\ 0 \text{ otherwise.} \end{cases} \end{aligned} \tag{9.14}$$

Here we can have **H***<sup>u</sup> <sup>A</sup>* <sup>=</sup> **<sup>X</sup>** and **H***<sup>v</sup> <sup>I</sup>* = **Y**.

**Multi-Channel Hypergraph Learning** When the hypergraph structure has been generated, the multi-channel hypergraph convolution is performed separately. It can be written as

$$\mathbf{X}^{(k+1)} = \sigma(\mathbf{D}\_v^{-1/2}\mathbf{H}\mathbf{D}\_e^{-1}\mathbf{H}^\top\mathbf{D}\_v^{-1/2}\mathbf{X}^{(k)}),\tag{9.15}$$

where **X***(k)* denotes the vertex embeddings after *k*-layer convolution, and it should be replaced by **U***(k) <sup>c</sup>* and **V***(k) <sup>c</sup>* for user and item embeddings on channel *c* ∈ {*A, I* } in our case. To bypass the over-smoothing problem, the results obtained from *K*-layer propagation are averaged as below:

$$\mathbf{U}\_c^\* = \frac{1}{K+1} \sum\_{l=0}^K \mathbf{U}\_c^{(k)},\\\mathbf{V}\_c^\* = \frac{1}{K+1} \sum\_{l=0}^K \mathbf{V}\_c^{(k)}.\tag{9.16}$$

Moreover, to aggregate information from different channels, a channel attention mechanism is leveraged to generate the comprehensive user and item embeddings. It is defined as

$$\mathbf{a}\_{\boldsymbol{\mu}}^{c} = f\_{a}(\mathbf{U}\_{c}^{\*}) = \frac{\exp(\mathbf{a}\_{\boldsymbol{\mu}}^{\top} \cdot \mathbf{W}\_{a}^{c,\boldsymbol{\mu}} \mathbf{U}\_{c}^{\*})}{\sum\_{c} \exp(\mathbf{a}\_{\boldsymbol{\mu}}^{\top} \cdot \mathbf{W}\_{a}^{c,\boldsymbol{\mu}} \mathbf{U}\_{c}^{\*})},\tag{9.17}$$

$$\mathbf{a}\_v^c = f\_a(\mathbf{V}\_c^\*) = \frac{\exp(\mathbf{a}\_v^\top \cdot \mathbf{W}\_a^{c,v} \mathbf{V}\_c^\*)}{\sum\_c \exp(\mathbf{a}\_v^\top \cdot \mathbf{W}\_a^{c,v} \mathbf{V}\_c^\*)},\tag{9.18}$$

where **W***<sup>a</sup>* <sup>∈</sup> <sup>R</sup>*d*×*<sup>d</sup>* is the trainable parameter, and *<sup>d</sup>* denotes the embedding dimension. The comprehensive representations can be formulated as

$$\mathbf{U}^\* = \sum\_c \alpha\_u^c \mathbf{U}\_c^\*, \mathbf{V}^\* = \sum\_c \alpha\_v^c \mathbf{V}\_c^\*,\tag{9.19}$$

where *c* ∈ {*Au, Iu, Av, Iv*}.

The graph convolution is leveraged in order to further exploit the interaction data between users and items. It can be formulated as

$$
\begin{pmatrix} \mathbf{U}^{\*(j+1)} \\ \mathbf{V}^{\*(j+1)} \end{pmatrix} = \mathbf{D}^{-1/2} \begin{pmatrix} \mathbf{0} & \mathbf{R} \\ \mathbf{R}^{\top}\mathbf{0} \end{pmatrix} \mathbf{D}^{-1/2} \begin{pmatrix} \mathbf{U}^{\*(j)} \\ \mathbf{V}^{\*(j)} \end{pmatrix}, \tag{9.20}
$$

$$
\hat{\mathbf{U}} = \frac{1}{J+1} \sum\_{l=0}^{J} \mathbf{U}^{\*(j)},\\\hat{\mathbf{V}} = \frac{1}{J+1} \sum\_{l=0}^{J} \mathbf{V}^{\*(j)},\tag{9.21}
$$

where *J* is the number of graph convolution layers.

#### **(2) Recommendation Task and Attribute Inference Task**

Following up the representation learning through multi-channel hypergraph collaborative filtering, the two downstream tasks can be performed simultaneously.

First, based on the idea of matrix factorization, the user and item interaction can be predicted as

$$
\hat{\mathbf{R}} = \hat{\mathbf{U}} \hat{\mathbf{V}}^{\top}.\tag{9.22}
$$

Next, we consider the nature of the relationship between attributes and vertices, and a subtle method of attribute inference is discussed. Also inspired by matrix factorization, the attribute matrix can be regarded as the product of two low-rank matrices. It can be formulated as

$$
\hat{\mathbf{X}} = \hat{\mathbf{U}} \mathbf{P}^{\top}, \hat{\mathbf{Y}} = \hat{\mathbf{V}} \mathbf{Q}^{\top}, \tag{9.23}
$$

where **<sup>P</sup>** <sup>∈</sup> <sup>R</sup>*np*×*<sup>d</sup>* and **<sup>Q</sup>** <sup>∈</sup> <sup>R</sup>*nq*×*<sup>d</sup>* are the user and item attribute representations. The use of matrix factorization for attribute inference is very reasonable because attributes are influenced by the properties of vertices and the properties of attribute themselves; one cannot be presented without the other. In conclusion, the benefit of processing two distinct tasks concurrently with this method is that it permits information sharing while allowing a high degree of autonomy between the two training activities.

#### **(3) Joint Optimization**

A paired loss called Bayesian Personalized Ranking (BPR) promotes observable behavior predictions to outperform unobserved ones, and it is utilized to optimize the recommendation task. It can be written as

$$\mathcal{QC}\_r = \sum\_{j \in \mathcal{J}'(l), k \notin \mathcal{J}'(l)} -\log \sigma(\hat{r}\_{l,j} - \hat{r}\_{l,k}) + \lambda \|\Phi\_r\|\_2^2,\tag{9.24}$$

where *Φr* represents the model parameters and **r**ˆ*i,j* = **u** *<sup>i</sup>* **v***<sup>j</sup>* represents the probability that user*<sup>i</sup>* is interested in item*<sup>j</sup>* . The sigmoid function is denoted as *σ (*·*)*.

Next, the attribute inference task can be regarded as an attribute categories classification problem. The cross-entropy loss is then leveraged for optimizing the attribute inference task. It can be written as

$$\begin{split} \mathcal{L}\mathcal{C}\_{l}^{u} &= -\frac{1}{n\_{u}} \sum\_{i} \sum\_{j=1}^{n\_{p}} \mathbf{x}\_{ij} \log(\hat{\mathbf{x}}\_{lj}), \\ \mathcal{L}\mathcal{C}\_{l}^{v} &= -\frac{1}{n\_{v}} \sum\_{i} \sum\_{j=1}^{n\_{q}} \mathbf{y}\_{lj} \log(\hat{\mathbf{y}}\_{lj}), \\ \mathcal{L}\mathcal{C}\_{l} &= \mathcal{L}\_{l}^{u} + \mathcal{L}\_{l}^{v}, \end{split} \tag{9.25}$$

where **x**ˆ*i,j* = **u** *<sup>i</sup>* **p***<sup>j</sup>* is the inference score of user*<sup>i</sup>* on user attribute*<sup>j</sup>* , and **y**ˆ*i,j* = **v** *<sup>i</sup>* **q***<sup>j</sup>* is the inference score of item*<sup>i</sup>* on item attribute*<sup>j</sup>* .

Finally, the sum of the losses from the two tasks is the overall loss. It can be written as

$$
\mathcal{A}^{\varrho} = \mathcal{A}\_r^{\varrho} + \boldsymbol{\chi} \cdot \mathcal{A}\_l^{\varrho}, \tag{9.26}
$$

where *γ* is the hyperparameter for balancing the two different losses.

Although in this section we only discuss two instances, i.e., collaborative filtering and attribute inference, applications of hypergraph computation in recommender system do not end there. In collaborative filtering, only the historical interaction data are utilized, and the hypergraph is constructed upon the similarity of users/items in behavior space. In attribute inference, the attribute information of users and items is further utilized to solve the cold-start problem. In this case, the hypergraph is constructed based on both behaviors and attributes. In addition to the behavior and attribute data, the context data, such as the time, location, weather, etc., can also be integrated, and hypergraph can also be applied to model the complex correlations among these data. Also, the user–item network sometimes can be multiplex, that is, there may exist various kinds of interactions between users and items, e.g., a user may view, click, and purchase an item. How to adopt the hypergraph to model such multiplex connections also remains to be explored.

#### **9.3 Sentiment Analysis**

The emergence of Twitter and Sina Weibo has given social media users a place to share their thoughts and emotions about particular occurrences. At the same time, this information is rapidly and widely disseminated throughout social networks. Therefore, how to analyze the information in social media becomes an important issue.

First, sentiment dimension, event monitoring, social network analysis, and business advice all have numerous potential applications for microblog sentiment

research. By analyzing the sentiment of massive data, we can get the emotional attitude of netizens toward relevant events. Second, real-time multimedia data may travel quickly and widely throughout the social network in terms of the temporal dimension, having a significant impact on society. Therefore, efficient real-time temporal detection can help government organizations with macroeconomic control and marketing management at huge corporations.

There are multi-modal data among Twitter data, including text, images, emojis, videos, etc. The higher-order association between different modalities can be well modeled by hypergraphs to extract sentiment information. In the following subsections, we provide two examples to analyze the sentiment of microblog data in two dimensions using hypergraphs, respectively, [5, 6].

#### *9.3.1 Sentiment Prediction*

Predicting multi-modal sentiment of tweets is not an easy task. Most sentiment analysis models focus on textual or visual channels only. However, in human emotional perception, different moods have their own characteristics so that sentiment analysis should be based on multiple perspectives. Even with multi-channel data, it is uncertain whether the emotions of different channels are related. Moreover, there are cases where some channels are missing. To address these problems, a two-layer multi-modal hypergraph learning framework [5] is introduced to create a multi-modal sentiment prediction.

This framework's objective is to forecast the sentiment of provided multi-modal microblog data (e.g., a Weibo tweet) that include text, visuals, and emoticons. The bag-of-textual-words feature *Fbotw <sup>i</sup>* = {*w<sup>t</sup> i,...,w<sup>t</sup> mt* } is extracted for textual modality. The visual modality feature *Fbovw <sup>i</sup>* = {*w<sup>v</sup> <sup>i</sup> ,...,w<sup>v</sup> mv* } is extracted from the *i*-th image. Furthermore, an emoticon dictionary is defined for the emotical modality, which forms the bag-of-emoticon-words feature *Fboew <sup>i</sup>* = {*w<sup>e</sup> <sup>i</sup> ,...,w<sup>e</sup> me* }. A corresponding sentiment score *s<sup>t</sup> k, s<sup>v</sup> <sup>k</sup> , s<sup>e</sup> <sup>k</sup>* is assigned to *w<sup>t</sup> k, w<sup>v</sup> <sup>k</sup> , w<sup>e</sup> <sup>k</sup>*, respectively. Consequently, the tweet *xi* can be denoted as {*Fbotw <sup>i</sup> , Fbovw <sup>i</sup> , Fboew <sup>i</sup>* }. Through investigating *Fbotw <sup>i</sup> , Fbovw <sup>i</sup>* and *Fboew <sup>i</sup>* simultaneously, the sentiment of *xi* can be predicted.

#### **(1) Multi-Modal Hypergraph Learning**

To create the incidence matrix of the hypergraph, the correlation between each tweet and the "centroid" tweets of various modalities is first computed. Each tweet is treated as a vertex and the hyperedges connecting its *k* nearest neighbors in each modality. It is important to note that each vertex can be thought of as a centroid. The incidence matrix can be defined as

$$\mathbf{H}(v\_l, e\_j) = \begin{cases} s(j, i) \text{ if } v\_l \in e\_j \\ 0 & \text{otherwise} \end{cases},\tag{9.27}$$

where *s(j, i)* <sup>=</sup> exp*(*−*dist (i,j )*<sup>2</sup> *σd*ˆ<sup>2</sup> *)* is the correlation between *vi* and *ej* . *dist (i, j )* is the distance in Euclidean terms between *vi* and the centroid vertex of *ej* . *d*ˆ is the average pairwise distance for the corresponding modality, and the parameter *σ* is empirically set to modify the normalization of the tweet relevance. Each hyperedge's weight starts out at 1.

In multi-modal hypergraph learning (MHG) [5], guided inference is used to perform hypergraph learning. It calculates the relevance scores of tweets with varying attitudes by iteratively updating the relevance score vector **f** and the hyperedge weights **W**. It accomplishes the aforementioned objectives by optimizing the loss functions:

$$\begin{aligned} \arg\min\_{\mathbf{f},\mathbf{W}} & \{ \mathcal{Q}(\mathbf{f}) + \lambda \mathcal{R}\_{emp}(\mathbf{f}) + \mu \sum\_{l=1}^{n\_\ell} w\_l^2 \}, \\ & \text{s.t. } \sum\_{l=1}^{n\_\ell} w\_l = 1, \end{aligned} \tag{9.28}$$

where **f** is the learned relevance score, *Ω(***f***)* is a regularizer built on the Hypergraph Normalized Laplacian, <sup>R</sup>*emp(***f***)* <sup>=</sup> *<sup>f</sup>* <sup>−</sup> *<sup>y</sup>*<sup>2</sup> denotes the empirical loss, and *ne i*=1 *w*2 *i* is the regularizer. *Ω(***f***)* can be formulated as

$$\mathcal{Q}(\mathbf{f}) = \frac{1}{2} \sum\_{u,v \in \mathcal{V}}^{e \in \mathcal{E}} \sum\_{u,v \in \mathcal{V}} \frac{w(e)h(u,e)h(v,e)}{\delta(e)} \times \left(\frac{\mathbf{f}(u)}{\sqrt{d(u)}} - \frac{\mathbf{f}(v)}{\sqrt{d(v)}}\right)^2,\tag{9.29}$$

where *d(v)* = *e*∈E **W***(e)h(v, e)* denotes vertex degree and *δ(e)* = *v*∈V *h(v, e)* denotes hyperedge degree. Let *<sup>Θ</sup>* <sup>=</sup> **<sup>D</sup>**−1*/*<sup>2</sup> *<sup>v</sup>* **HWD**−<sup>1</sup> *<sup>e</sup>* **<sup>H</sup>D**−1*/*<sup>2</sup> *<sup>v</sup>* and *<sup>Δ</sup>* <sup>=</sup> **<sup>I</sup>** <sup>−</sup> *<sup>Θ</sup>* be the hypergraph Laplacian. The diagonal matrices of *d(v)* and *δ(e)* are represented as **D***<sup>v</sup>* and **D***e*, respectively. The normalized cost function can be expressed as

$$
\mathfrak{Q}(\mathbf{f}) = \mathbf{f}^{\top} \Delta \mathbf{f}.\tag{9.30}
$$

The two parameters **W** and **f** are optimized iteratively using the following two functions:

$$\arg\min\_{\mathbf{f}} \Phi(\mathbf{f}) = \arg\min\_{\mathbf{f}} \{ \mathbf{f}^\top \Delta \mathbf{f} + \lambda \left\| f - \mathbf{y} \right\|^2 \},\tag{9.31}$$

$$\begin{aligned} \arg\min\_{\mathbf{W}} \Phi(\mathbf{W}) &= \arg\min\_{\mathbf{W}} \{ \mathbf{f}^\top \Delta \mathbf{f} + \mu \sum\_{l=1}^{n\_\ell} w\_l^2 \}, \\ &\text{s.t. } \sum\_{l=1}^{n\_\ell} w\_l = 1. \end{aligned} \tag{9.32}$$

As shown above, MHG simulates the sample–sample relation for the purpose of hypergraph construction. The properties of modalities and their relevance to one another, however, are not fully utilized.

#### **(2) Dual-Layer Multi-Modal Hypergraph Learning**

Dual-layer multi-modal hypergraph learning is composed of 2 hypergraph layers, G<sup>1</sup> = *(*V1*,* E1*,***W***)* for tweet-level hypergraph and G<sup>2</sup> = *(*V2*,* E2*,* **M***)* for featurelevel hypergraph, respectively.

To allow multi-modal features to be adopted more explicitly and to directly construct multi-modal hypergraphs for modal correlation, each hypergraph layer of dual-layer multi-modal hypergraph learning uses relations between vertex and hyperedge to represent sample features or relations between features and samples, rather than relations between samples in MHG.

The sentiment label vector of tweets and the sentiment label vector of multimodal sentiment words are denoted, respectively, by *y* and *t* in distinct hypergraph layers. Therefore, in two hypergraph layers, **f** and **g** started out originally as vectors representing the relevance scores of tweets and multi-modal features/words, respectively. It is said that **M** can be regarded as the confidence ratings of the sentiment labels **y**, which correspond to **f** in the hypergraph of tweet level. Two hypergraph layers are connected, and the multi-modal relevance of features is transferred to the tweet-level hypergraph in order to help predict tweet sentiment.

The probabilistic incidence matrix of a hypergraph is written as

$$\mathbf{H}\_{\*}(v\_{l},e\_{j}) = \begin{cases} 1 \text{ if } v\_{l} \in e\_{j} \\ 0 \text{ otherwise} \end{cases},\tag{9.33}$$

where ∗ denotes either 1 or 2, and the same applies below.

The following loss function can be optimized to represent the learning process:

$$\begin{aligned} \arg\min\_{\mathbf{f}, \mathbf{g}, \mathbf{W}, \mathbf{M}} \{ \mathfrak{Q}\_1(\mathbf{f}) + \lambda\_1 \mathcal{R}\_{emp1}(\mathbf{f}) + \mu\_1 \sum\_{l}^{n\_{\mathrm{e1}}} \mathbf{W}\_l^2 + \mathcal{Q}\_2(\mathbf{g}) + \lambda\_2 \mathcal{R}\_{emp2}(\mathbf{g}) + \mu\_2 \sum\_{l}^{n\_{\mathrm{e2}}} \mathbf{M}\_l^2 \}, \\ \text{s.t. } \begin{cases} \sum\_{l=1}^{n\_{\mathrm{e1}}} \mathbf{W}\_l = 1 \\ \sum\_{l=1}^{n\_{\mathrm{e2}}} \mathbf{M}\_l = 1 \end{cases}, \end{aligned} \tag{9.34}$$

where *Ω*1*(***f***)* and *Ω*2*(***g***)* are regularizers based on the normalized Laplacian on hypergraph, R*emp*1*(***f***)* <sup>=</sup> **<sup>f</sup>** <sup>−</sup> **<sup>y</sup>** ◦ **<sup>M</sup>**<sup>2</sup> and R*emp*2*(***g***)* <sup>=</sup> **<sup>g</sup>** <sup>−</sup> **<sup>t</sup>**<sup>2</sup> are the empirical losses, and *ne*<sup>1</sup> *<sup>i</sup>*=<sup>1</sup> **<sup>W</sup>***<sup>i</sup>* and *ne*<sup>2</sup> *<sup>i</sup>*=<sup>1</sup> **<sup>M</sup>***<sup>i</sup>* are the *L*<sup>2</sup> regularizers on the hyperedge

weights. In this scenario, empirical loss is represented as R*emp*1*(***f***)* <sup>=</sup> **<sup>f</sup>** <sup>−</sup> **<sup>y</sup>** ◦ **<sup>M</sup>**<sup>2</sup> and R*emp*2*(***g***)* <sup>=</sup> **<sup>g</sup>** <sup>−</sup> **<sup>t</sup>**2, and *ne*<sup>1</sup> *<sup>i</sup>*=<sup>1</sup> **<sup>W</sup>***<sup>i</sup>* and *ne*<sup>2</sup> *<sup>i</sup>*=<sup>1</sup> **<sup>M</sup>***<sup>i</sup>* are the *L*<sup>2</sup> regularizers on the hyperedge weights. The normalized Laplacian on hypergraph regularizers *Ω*1*(***f***)* and *Ω*2*(***g***)* are further described as follows:

$$\begin{array}{l} \boldsymbol{\Omega}\_{1}(\mathbf{f}) = \mathbf{f}^{\top}(\mathbf{I} - \mathbf{D}\_{v1}^{-1/2}\mathbf{H}\_{1}\mathbf{W}\mathbf{D}\_{e1}^{-1}\mathbf{H}\_{1}^{\top}\mathbf{D}\_{v1}^{-1/2})\mathbf{f},\\ \boldsymbol{\Omega}\_{2}(\mathbf{g}) = \mathbf{g}^{\top}(\mathbf{I} - \mathbf{D}\_{v2}^{-1/2}\mathbf{H}\_{2}\mathbf{M}\mathbf{D}\_{e2}^{-1}\mathbf{H}\_{2}^{\top}\mathbf{D}\_{v2}^{-1/2})\mathbf{g}.\end{array} \tag{9.35}$$

The loss function then has the following form in terms of **f***,***W***,* **g**, and **M**:

$$\begin{split} \mathcal{L}^{\rho}(\mathbf{f}, \mathbf{W}, \mathbf{g}, \mathbf{M}) &= \mathcal{Q}\_{1}(\mathbf{f}) + \lambda\_{1} \mathcal{R}\_{emp1}(\mathbf{f}) + \mu\_{1} \sum\_{i}^{n\_{e1}} \mathbf{W}\_{i}^{2} \\ &+ \mathcal{Q}\_{2}(\mathbf{g}) + \lambda\_{2} \mathcal{R}\_{emp2}(\mathbf{g}) + \mu\_{2} \sum\_{i}^{n\_{e2}} \mathbf{M}\_{i}^{2} \\ &+ \eta\_{1} \left( \sum\_{i=1}^{n\_{e1}} \mathbf{W}\_{i} - 1 \right) + \eta\_{2} \left( \sum\_{i=1}^{n\_{e2}} \mathbf{M}\_{i} - 1 \right). \end{split} \tag{9.36}$$

To summarize, we introduced a two-layer multi-modal hypergraph learning framework that models correlations among visual, textual, and emoji modalities while allowing input from missing modalities to achieve document sentiment prediction for multi-modal tweets.

#### *9.3.2 Social Event Detection*

The expanding visual content of microblogs and the inter-connectedness of diverse data have received less attention from existing methods, while social event identification as a crucial social media analysis problem has received much attention in recent years. Figure 9.7 presents an example of real-time social event. In social media platforms, event detection is a difficult issue due to the distinctiveness of social media data for the following reasons. First, it is required to explore a set of posts that are significantly related to one another and discuss a common issue because social media postings are noisy and do not include enough substantial material to provide full information. Second, social media posts can come in a variety of multimedia formats and include information such as images, timestamps, locations, user preferences, and social connections in addition to text. Finally, social posts are real time, and these large scale, real-time data make social events difficult to detect. Hypergraph, due to its natural structural advantages, can establish higher-order correlations between data of different posts, different modalities, and different times, thus enabling real-time event detection. In this subsection, we

**Fig. 9.7** An example of a real-time social event. (**a**) Conversational text. (**b**) Heterogeneous content. (**c**) Continuously growing real-time data. Parts of this figure are from [6]

**Fig. 9.8** Overall framework of the real-time social event detection. This figure is from [6]

introduce a hypergraph-based method for real-time social event detection. The overall framework is shown in Fig. 9.8.

#### **(1) Microblog Clique Generation**

Microblog clique (MC), which consists of a collection of closely connected tweets, is constructed as a basic unit rather than a single microblog in order to make up for the lack of information. These microblogs cover the same subject in short time.

A hypergraph is used to describe the relationship between heterogeneous data of various tweets. A set of microblogs is denoted as *M* = {*m*1*, m*2*,...,mn*}. The constructed hypergraph G*<sup>H</sup>* = {V *,* E *,***W**}, where a vertex *v* represents a microblog and a hyperedge *e* represents a subset of microblogs. The hyperedge weight is denoted as *w(e)*, and its diagonal matrix is formed as **W**. The similarity between two microblogs *mi* and *mj* is first determined using the following heterogeneous features in order to generate hyperedges.

The cosine similarity function is used for computing textual and visual similarities. The Haversine formula is used for measuring the geographical similarity. The pairwise temporal similarity is calculated by *sT I (mi, mj )* <sup>=</sup> <sup>1</sup> <sup>−</sup> <sup>|</sup>*t ii,tji*<sup>|</sup> *<sup>τ</sup>* . The timestamps of *mi* and *mj* are *t ii* and *tji*, while *τ* denotes a normalized constant. Measures of the pairwise social similarity are

$$s\_S(m\_l, m\_j) = \begin{cases} 1, & \text{if } u\_l = u\_j \\ 0.5, & \text{if } u\_l \text{ and } u\_j \text{ are linked through the social platform} \\ 0, & \text{otherwise} \end{cases},$$

where *ui* is the owner of *mi*.

Two hyperedges are created by connecting each microblog *mi* with its neighbors as per geographic distance and middle position of location and time information. For each microblog *mi*, the top *N* nearest microblogs in terms of textual information and visual content are chosen. Finally, all microblogs of the same user are connected to generate a hyperedge. The incidence matrix, vertex degree, and edge degree of the hypergraph are defined in the same way as above.

Next, MC is generated by dividing microblogs into groups of the same topic through the hypergraph cut approach. Assume *S* and *S*¯ are the results of G*<sup>H</sup>* through the two-way partition, and the hypergraph cut can be described as

$$\begin{aligned} \text{Cut}\_H(S, \bar{S}) &:= \sum\_{e \in \partial S} w(e) \frac{|e \cap \bar{S}| |e \cap \bar{S}|}{d(e)}, \\ \partial S &:= \{ e \in E | e \cap S \neq \emptyset, \, e \cap \bar{S} \neq \emptyset \}. \end{aligned} \tag{9.38}$$

The definition of the two-way normalized partition is

$$N\text{Cut}\_H(S,\bar{S}) := \text{Cut}\_H(S,\bar{S}) \left( \frac{1}{\text{vol}(S)} + \frac{1}{\text{vol}(\bar{S})} \right),\tag{9.39}$$

where the volume of *S* is denoted by vol*(S)* = *v*∈*S D(v)*. A real-valued optimization work can be relaxed from the normalized cut issue. By choosing the eigenvectors corresponding to the smallest non-zero eigenvalues of the hypergraph Laplacian, *<sup>Δ</sup>* <sup>=</sup> **<sup>I</sup>** <sup>−</sup> **<sup>D</sup>**−1*/*<sup>2</sup> *<sup>v</sup>* **HWD**−<sup>1</sup> *<sup>e</sup>* **<sup>H</sup>D**−1*/*<sup>2</sup> *<sup>v</sup>* , and the solution can be found. The input tweets *M* are split into two groups, and then a bidirectional normalized partitioning is carried out recursively in each new set until the best partitioning outcome is attained. Based on the representation capacity of the various partitions as achieved by Bayesian Information Criteria (BIC), this best partitioning result is determined.

BIC is used to choose the optimal hypergraph partitioning results. For *M* = {*m*1*, m*2*,...,mn*}, with *P* = {*P*1*, P*2*,...,Pm*} as a set of partitions, the BIC score is determined by

$$\begin{split} \text{BIC} &= \text{llh}(M) - \frac{N\_p}{2} \log n, \\ \text{llh}(M) &= \sum\_{l} \left( \frac{1}{\sqrt{2\pi} \hat{\theta}^{N\_p}} - \frac{1}{2\hat{\theta}^2} \left\| d(m\_l, c\_{m\_l}) \right\|^2 + \log \frac{n\_l}{n} \right), \\ \hat{\theta}^2 &= \frac{1}{n-m} \sum\_{l} d(m\_l, c\_{m\_l})^2, \end{split} \tag{9.40}$$

where *Np* represents the parameter number and the microblog features' dimension, *n* is the microblogs number, and *ni* is the count of corresponding partition of *mi*.

Following the division of the provided microblogs into a group of MCs, the MCs offer more sensible information by examining a collection of strongly correlated microblogs rather than individual microblogs, which can express more meaningful and pertinent material in the succeeding event detection technique.

#### **(2) Detection of Social Events in Real Time**

**Event Detection by Using MC** For MC = {MC1*,...,* MC*p*} and corresponding microblogs M = {*m*1*,...,mn*}, there are two observations as follows. First off, inside a single MC, and microblogs frequently refer to the same event (MC cues). Second, MCs with similar features tend to be associated with the same event (smoothness cues).

If a microblog is integrated into an MC, it is connected to the MC to impose MC cues. In order to enforce smoothness cues, pairwise MCs that are close to one another in feature space are connected. Formally, a bipartite graph GB = {*X, Y, B*} is used to express MC and M, and two vertex sets are expressed as *X* and *Y* , where *X* := MC ∩ M*, Y* := MC, with |*X*|=|MC|+|M| and |*Y* |=|MC| vertices, respectively. The definition of the across-affinity matrix *B* between *X* and *Y* is as follows:

$$B\_{lj} = \begin{cases} \eta, & \text{if } \mathbf{x}\_l \in \mathbf{M}, \mathbf{x}\_l \in \mathbf{y}\_j, \mathbf{y}\_j \in \mathbf{MC} \\ e^{-\mathbf{y}d\_{lj}}, & \text{if } \mathbf{x}\_l \in \mathbf{MC}, \mathbf{y}\_j \in \mathbf{MC} \\ 0, & \text{otherwise} \end{cases}, \tag{9.41}$$

where *dij* is the distance between two MCs, and *η* and *γ* are the two parameters that balance the inner-MC correlation and the between-MC smoothness.

The bipartite graph GB and the necessary number of partitions *K* are used as the basis for the transfer cut method to partition MCs. First, assume GBY = {*Y,***W***<sup>Y</sup>* } contains only vertices of the MC. **L***<sup>Y</sup>* = **D***<sup>Y</sup>* − **W***<sup>Y</sup>* is the graph Laplacian of GBY , where **<sup>D</sup>***<sup>Y</sup>* <sup>=</sup> diag*(***B1***)*, **<sup>W</sup>***<sup>Y</sup>* <sup>=</sup> **<sup>B</sup>D**−<sup>1</sup> *<sup>X</sup>* **<sup>B</sup>**. Assume that {*λi,* **<sup>v</sup>***i*}*<sup>K</sup>* <sup>1</sup> are the *K* smallest

eigenpairs of GB. They can be calculated as

$$\begin{array}{l} 0 \le \xi\_l \le 1, \xi\_l(2-\xi\_l) = \lambda\_l, \\ \mathbf{u}\_l = \frac{1}{1-\xi\_l} \mathbf{Q} \mathbf{v}\_l, \mathbf{f}\_l = (\mathbf{u}\_l^\top, \mathbf{v}\_l^\top)^\top, \end{array} \tag{9.42}$$

where **<sup>Q</sup>** <sup>=</sup> **<sup>D</sup>**−<sup>1</sup> **<sup>X</sup> B** is the corresponding transition probability matrix from *X* to *Y* .

Second, {**f**1*,...,***f***K*} are K-spectra clustered and the best *K* is selected by BIC. Assume that *K*<sup>0</sup> is the count of existing events. It is started at 0. Furthermore, suppose that the biggest number for incoming data is not larger than *K*<sup>0</sup> + *nnew/tm*, where the threshold *tm* is used to decide the minimum microblog number. Therefore, the bipartite graph is segmented *nnew/tm* + 1 times, and the segmentation result is selected as the event detection result using BIC. Suppose {*Γ*1*,...,ΓK*} are the detected *K* events in the last process. The key MCs are found by MC selection for each *Γi*, and the number of each MC is measured in terms of importance. Finally, the top *nsMC* MCs are selected to describe each *Γi*.

**Detection of Incremental Social Events** The real-time detection method is defined as follows. Assume that event detection is run at time *t*0, with generated MCs, i.e., MC = {MC1*,...,* MC*p*}, detected events {*Γ*1*,...,Γq* }, and noisy data. New data arrive continuously from moment *t*0, and it can be processed a short time gap *t*. In other words, event detection can be run at every *t*<sup>0</sup> + *x* × *t*, where *x* equals to 1*,* 2*,...*. In this instance, *t*<sup>0</sup> + *Δt* is used as an example, and *Mnew* stands for newly arriving microblogs. The two steps that make up event detection are MC generation and event partition.

To generate new MCs for previous time periods, *MC*<sup>∗</sup> = {*MC*<sup>∗</sup> <sup>1</sup> *,MC*<sup>∗</sup> <sup>2</sup> *, . . . ., MC*∗ *ne* } were used as known samples. *MC*<sup>∗</sup> and *Mnew* are used to construct the incremental microblog hypergraph G *<sup>t</sup>*0+*Δt <sup>H</sup>* . However, it is challenging because there is no clear distinction between a microblog collection and a microblog. No more than 3*ne* representative microblogs get to be chosen since only the three most representative tweets for each MC are chosen, depending on the amount of retweets and comments. To create the incremental microblog hypergraph G *<sup>t</sup>*0+*Δt <sup>H</sup>* , they are merged with *Mnew*. New MCs (MC*new*0) are then created from these data using the hypergraph partition. Based on the representative microblogs, MC*new*<sup>0</sup> and MC<sup>∗</sup> are combined together. In this way, *n*MCnew new MCs (MC*new*0) are constructed and utilized for event detection.

For detection in real time, the past events *Γ* = {*Γ*1*,...,ΓK*}are used as known data in the time period. The corresponding representative MCs in *Γ* and the generated incremental MCnew are used to jointly construct the next graph. The difference is that for the identified events, the distance between MCs is set to 0 as follows:

$$d\_{lj} = \begin{cases} 0, \text{ if } \mathbf{x}\_l \in \varGamma\_k \text{ and } \mathbf{y}\_j \in \varGamma\_k\\ \min\_{\substack{m \mathbf{x}\_k \in \mathbf{x}\_l\\ m \mathbf{y}\_l \in \mathbf{y}\_j}} d(m \mathbf{x}\_k, m \mathbf{y}\_l), \text{ otherwise} \end{cases},\tag{9.43}$$

where *k* = 1*,* 2*,...,K*. Therefore, according to the BIC, the bipartite graph can be partitioned into existing events and new events.

There are still several challenging problems in hypergraph computation for sentiment analysis tasks that can be continued for more research. First, for the sentiment recognition task, the case of conflicting multi-modal information can be considered. Second, further consideration can be given to the information that may be hidden in broken posts and users for the detection task on real-time social events. These tasks take into account the positive or negative associations among multiple entities, where the hypergraph is suitable for modeling such correlations.

#### **9.4 Emotion Recognition**

Emotion recognition has gained wide recognition in neuroscience and psychology research [11], and artificial intelligence offers more reliable and accurate computational models for the identification and study of emotions. It has also been extensively applied in real life [12], especially in human–computer interaction, motor vehicle driving assistance training, emotion classification in movies, and other pertinent similar areas [13].

Emotion recognition has three main goals [14]: first, to enable the understanding, inference, and recognition of human emotions by intelligent systems; second, to make it possible for systems to make human-like expressions of emotion in response to stimuli (e.g., conversational agents or robots); and third, to make it possible for intelligent systems to actually perceive emotions. Over the past three decades, researchers from several disciplines have pursued these three goals in different ways, with the method of recognizing emotions as the central issue of research. Although it has been studied for many years, progress is still being made. The reality is that there are various ways for people to convey their emotions, including language, gestures, facial expressions, and physiological signs [15]. Finding a suitable method to identify and analyze human emotions may be a long-term problem. Human volition determines the first three modalities, and there are substantial individual variances [16]. Because of these, approaches based on these three modalities have limitations in terms of accuracy and reliability. In contrast, physiological signals cannot be readily blocked or concealed and are simultaneously governed by the body's neurological and hormonal systems. They are also often independent of human will. Therefore, physiological signals rather than visual or auditory cues may offer more accurate information about emotions [17]. A multitude of environmental and psychological elements, including interests and personality, can have an impact on human emotion, which is a highly subjective phenomenon.

Nonetheless, because of the following factors, recognizing emotions through physiological signals is still a work in progress:


In this case, the hypergraph structure allows the establishment of complex correlations that can simultaneously take into account: (a) correlations between EEG, EOG, and EMG signals, which are signals from several modalities; (b) correlations between subjects; and (c) patterns of physiological signal changes in a single subject in response to various stimuli. Two methods are presented for emotion prediction using hypergraph computation, including multi-modal vertexweighted hypergraph learning (MVHL) [7, 8] and multi-hypergraph neural networks (MHGNN) [9].

#### **(1) Multi-Modal Vertex-Weighted Hypergraph Learning**

Hypergraphs have been used to depict the link between physiological data and personality [7]. In this way, MVHL introduces a multi-modal vertex-weighted hypergraph learning method for personalized emotion recognition (PER) that takes into account vertex weights, hyperedge weights, and modal weights. Each vertex in this method is a composite tuple (subject, stimulus). A hypergraph structure is used to develop personality correlations between various subjects and physiological correlations between the corresponding stimuli. Each vertex and hyperedge, as well as the weights of the various hypergraphs, are automatically learned. Hyperedge weights are used to create the optimal representation, while vertex weights are used to describe the impact of various samples and patterns in the learning process. The calculated factors—known as sentiment relevance—are employed for sentiment identification and are learned on a multi-modal vertex-weighted hypergraph. The fact that the vertices are composite with incorporated data from various subjects allows MVHL to identify numerous subjects' emotions at once.

The framework of this model is shown as follows. First, a composite tuple of vertices (subjects, stimuli) is formed using the subjects and the stimuli used to elicit the subjects' emotions. Second, multi-modal hyperborders are constructed to form personality associations among different subjects and physiological associations among the corresponding stimuli. Finally, after joint learning of vertex-weighted multi-modal multi-task hypergraphs, PER results can be obtained.

**Hypergraph Construction** This model constructs the hypergraph structure by pairwise similarity between different samples. The pairwise similarity of *ui* and *uj* 's personalities is measured by the cosine function:

$$s\_{PER}(\mu\_i, \mu\_j) = \frac{<\mathbf{p}\_l, \mathbf{p}\_f >}{\|\mathbf{p}\_l\| \cdot \|\mathbf{p}\_f\|},\tag{9.44}$$

where *ui*'s personality vector is denoted by **p***i*. The centroid is determined by selecting one vertex at a time, and a hyperedge is built to link the centroid to its *K* nearest neighbors in the existing representation space. It should be noted that personified hyperedges are built using both intra- and inter-subject viewpoints. A

hyperedge links all the vertices from the same subject together. Additionally, based on personality similarities, the closest K subjects for each subject are chosen, and all of their vertices are connected by creating another hyperedge.

Assume that the constructed hypergraphs are G*<sup>m</sup>* = *(*V*m,* E*m,***W***m)*, where V*<sup>m</sup>* and E*<sup>m</sup>* denote the vertex set and hyperedge set, respectively, and **W***<sup>m</sup>* is the diagonal hyperedge weight matrix of the *m*-th hypergraph (*m* = 1*,* 2*,...,M*). The incidence matrix **H***<sup>m</sup>* can be computed as

$$\mathbf{H}\_m(v, e) = \begin{cases} 1, \text{ if } v \in e \\ 0, \text{ if } v \notin e \end{cases}.\tag{9.45}$$

The different weights of the vertices are learned to evaluate their value and contribution to the learning process. It is distinct from the classic hypergraph learning method, which simply views all the vertices equally. Assume **U***<sup>m</sup>* is the diagonal matrix of vertex weight. The vertex degree and the hyperedge degree are defined as *dm(v)* = *e*∈E*<sup>m</sup>* **W***m(e)***H***m(v, e)* and *δ(e)* = *v*∈V*<sup>m</sup>* **U***m(e)***H***m(v, e)*. Accordingly, the two diagonal matrices are defined as **D***<sup>v</sup> m(i, i)* = *dm(vi)* and **D***e m(i, i)* = *δm(ei)*.

**Multi-Modal Vertex-Weighted Hypergraph Learning** The goal is to simultaneously study the correlations among the included physiological signals and the personality relations across various subjects. The framework of the multimodal vertex-weighted hypergraph learning is presented in Fig. 9.9. Given *N* subjects *u*1*,...,uN* and the involved stimuli *sij (j* = 1*,...,ni)* for *ui*, we assume that the *c*-th emotion category's compound vertices and associated labels are {*(u*1*, s*1*<sup>j</sup> )*} *n*1 *<sup>j</sup>*=1*,...,*{*(uN , sNj )*} *nN <sup>j</sup>*=<sup>1</sup> and **y**1*<sup>c</sup>* = [*y<sup>c</sup>* 11*,...,y<sup>c</sup>* 1*n*1 ] *,...,* **y***Nc* = [*yc N*1*,...,y<sup>c</sup> NnN* ] , where *c* = 1*,...,ne*.

The count of emotion categories is denoted as *ne*. The estimated values of all stimuli associated to the specified users of the *c*-th emotion category, also

**Fig. 9.9** Overall framework of the multi-modal vertex-weighted hypergraph learning. This figure is from [7]

known as emotion relevance, are given by **r**1*<sup>c</sup>* = [*r<sup>c</sup>* 11*,...,r<sup>c</sup>* 1*n*1 ] *,...,* **r***Nc* = [*rc N*1*,...,r<sup>c</sup> NnN* ] . **y***c*, **r***<sup>c</sup>* are denoted by

$$\mathbf{y}\_c = \mathbf{[y}\_{1c}^\top, \dots, \mathbf{y}\_{Nc}^\top]^\top, \mathbf{r}\_c = [\mathbf{r}\_{1c}^\top, \dots, \mathbf{r}\_{Nc}^\top]^\top. \tag{9.46}$$

Let **Y** = [**y**1*,...,* **y***c,...,* **y***ne* ], **R** = [**r**1*,...,* **r***c,...,* **r***ne* ], where the two trade-off parameters are *λ* and *η*. The hypergraph structure's regularizer is defined as follows:

$$\Psi(\mathbf{R}, \mathbf{W}, \mathbf{U}, \boldsymbol{\alpha}) = \sum\_{c=1}^{n\_c} \mathbf{r}\_c^\top \sum\_{m=1}^M \alpha\_m (\mathbf{U}\_m - \Theta\_m) \mathbf{r}\_c,\tag{9.47}$$

where *Θm* <sup>=</sup> *(***D***<sup>v</sup> m)*−1*/*2**U***m***H***m***W***m(***D***<sup>e</sup> m)*−1**H** *m***U***m(***D***<sup>v</sup> m)*−1*/*2. Then, *<sup>Δ</sup>* <sup>=</sup> *M m*=1 *αm(***U***<sup>m</sup>* − *Θm)* can be seen as the fused hypergraph Laplacian with vertex weighting.

#### **(2) Multi-Hypergraph Neural Networks**

Multi-hypergraph neural network (MHGNN) uses hypergraph to build complex correlations and identify emotions by physiological signals, which can take into account: (a) correlations between signals of various modalities, i.e., z EEG, EOG, and EMG; (b) relationships between subjects; and (c) patterns of physiological signal changes in a single person in response to various stimuli. This model groups each given subject and stimuli to a complex tuple, respectively. Assuming it is a vertex in the hypergraph, it would generate a hypergraph for each pattern with its corresponding physiological signal, making use of the term hyperedge to express the correlations among the physiological signals in response to various stimuli. The vertices are then categorized within the MHGNN framework in accordance with the intricate relationships in the data. As a result, the categorization of vertices in various hypergraphs can be equated to the recognition of emotions. Different hypergraph neural networks are combined using a fully connected network. The relative relevance of various multi-modal physiological signals is also taken into account of this network when classifying emotions. This framework's primary benefit is its ability to combine multi-modal data and to represent three intricate relationships of the data. Figure 9.10 shows the pipeline of the MHGNN framework.

**Modeling of Multi-Hypergraph** Subject correlation is formulated using a multihypergraph structure given a number of features from various physiological inputs. Each modality is represented by a separate hypergraph. The connections between the vertices of the hypergraph are constructed using hyperedges, and each vertex on the hypergraph represents a topic to be learned with a description of its corresponding stimuli. The *k*-NN method is used to generate hypergraphs, where *k* is a hyperparameter for assessing the connectivity. The hyperedges are created after all vertices have acted as the centroid. Each vertex gets chosen as a centroid once. We assume that *S* = *S*1*, S*2*,...,Sn* is defined as a training set with modality

**Fig. 9.10** The pipeline of multi-hypergraph neural networks

*<sup>i</sup>*'s features **X***(i)* <sup>=</sup> **<sup>x</sup>** *(i)* <sup>1</sup> *,* **x** *(i)* <sup>2</sup> *,...,* **x** *(i) <sup>n</sup>* , where vector **x** *(i) <sup>j</sup>* is the feature of the *j* -th training sample from modality *i* and *Sj* denotes the *j* -th training sample. According to the KNN approach, the vertex *vp* shares the hyperedge with the *k* nearest vertices above and around it. Hyperedge *ep* is centered on the vertex *vp*. The Euclidean distance between the corresponding feature vectors represents the separation between two vertices. The correlation between vertex *p* and vertex *q* is represented by the matrix element *hp,q* . As an exponential representation of Euclidean distance, the correlation can be described as

$$h\_{p,q}^{(l)} = \begin{cases} \exp(-\frac{d\left(\mathbf{x}\_p^{(l)}, \mathbf{x}\_q^{(l)}\right)^2}{d^2}), & q \in \boldsymbol{\mu}\_p \\ 0, & q \notin \boldsymbol{\mu}\_p \end{cases},\tag{9.48}$$

where *d(***x** *(i) <sup>p</sup> ,* **x** *(i) <sup>q</sup> )* stands for the feature space Euclidean distance between samples *p* and *q*. The weight matrix **W***(i)* is set to be an identity matrix in our model because we lack prior knowledge regarding the significance of hyperedges. As a result, the incident matrix **H***(i)* contains all the data for the hypergraph.

An incidence matrix **H***(i)* is generated for each modality. Finally, *m* incident matrices can be generated for *m* modalities.

**Multi-Hypergraph Convolutional Networks** The creation of subject representation and subsequent emotion classification are crucial steps in emotion recognition. Deep neural networks have made significant progress in the representation of data in the last few years. However, given the intricacy of data correlations, it is still work in progress. In order to represent data and recognize emotions, a multi-hypergraph

convolutional network framework that can simultaneously take into account several physiological inputs from different people is developed.

In a hypergraph convolutional network, the spatial convolution is viewed from the perspective of graph spectral theory as a spectral matrix product, and the hypergraph Laplacian *Δ* is leveraged to convert it from the spatial domain to the spectral domain. *<sup>Δ</sup>* can be formulated as *<sup>Δ</sup>* <sup>=</sup> **<sup>I</sup>** <sup>−</sup> **<sup>D</sup>**−1*/*<sup>2</sup> *<sup>v</sup>* **HWD**−<sup>1</sup> *<sup>e</sup>* **<sup>H</sup>D**−1*/*<sup>2</sup> *<sup>v</sup>* , where **D***<sup>e</sup>* and **D***<sup>v</sup>* are the matrices of hyperedge degree and vertex degree, respectively. In this case, it is possible to formulate a hypergraph convolutional layer for each modality as

$$\mathbf{X}\_{(l+1)}^{(l)} = \sigma \left( \mathbf{D}\_v^{(l)-1/2} \mathbf{H}^{(l)} \mathbf{W}^{(l)} \mathbf{D}\_e^{(l)-1} \mathbf{H}^{(l)\top} \mathbf{D}\_v^{(l)-1/2} \mathbf{X}\_{(l)}^{(l)} \boldsymbol{\Theta}\_{(l)}^{(l)} \right), \tag{9.49}$$

where *Θ(i) (l)* is the learnable parameter of the *l*-th layer in *i*-th hypergraph neural network (HGNN) and *σ* is the activation function. When using hypergraph convolution, the parameters for *Θ(i)* are updated by backpropagating the feature **X***(i)*. Hypergraph structure-related parameters, such as **D***(i)*−1*/*<sup>2</sup> *<sup>v</sup>* **<sup>H</sup>***(i)***W***(i)***D***(i)*−<sup>1</sup> *<sup>e</sup>* **<sup>H</sup>***(i)***D***(i)*−1*/*<sup>2</sup> *<sup>v</sup>* , are pre-computed and are not trainable in this procedure. The symbol **A***(i) <sup>h</sup>* is used to represent these parameters for simplification, and the hypergraph convolutional layer can be rewritten as

$$\mathbf{X}\_{(l+1)}^{(l)} = \sigma \left( \mathbf{A}\_{h}^{(l)} \mathbf{X}\_{(l)}^{(l)} \boldsymbol{\Theta}\_{(l)}^{(l)} \right). \tag{9.50}$$

It is important to note that the formulation of graph convolution and hypergraph convolution is similar. The graph convolution is shown as follows:

$$\mathbf{X}\_{(l+1)}^{(l)} = \sigma \left( \mathbf{D}^{(l)-1/2} \mathbf{A}^{(l)} \mathbf{D}^{(l)-1/2} \mathbf{X}\_{(l)}^{(l)} \Theta\_{(l)}^{(l)} \right). \tag{9.51}$$

Hyperedges built from characteristics of several modalities are concatenated in traditional models of single hypergraph neural networks. However, because of their distinct sizes and dimensions, hyperedges have been known of being inconsistent. Additionally, there could be some variations in the perspectives from which various modalities approach the work. Some could be crucial, while others might not be just as important. In a single hypergraph model with identical weights, such discrepancies are not possible to see. However, simply concatenating distinct hyperedges makes it difficult to specifically weight them. A multi-hypergraph neural network structure is introduced to integrate multiple hypergraph structures in order to address the issue.

To calculate intermediate representations for each modality, *m* hypergraph neural network models are built using *m* hypergraphs for *m* modalities. The *K*-layer *i*-th hypergraph neural network may be expressed as follows:

$$HGNN(\mathbf{H}^{(l)}, \mathbf{X}^{(l)}) = \sigma\_K^{(l)} \left( \mathbf{A}\_h^{(l)} (\cdots \sigma\_1^{(l)} (\mathbf{A}\_h^{(l)} \mathbf{X}^{(l)} \boldsymbol{\Theta}\_1^{(l)}) \cdots) \mathbf{\Theta}\_K^{(l)} \right). \tag{9.52}$$

#### 9.5 Summary 187

The final output is then generated using the m output of intermediate representations by a fully connected layer. As a fusion layer, the layer dynamically combines the outcomes of hypergraph convolutions and weights them corresponding to their contributions. A softmax layer serves as the classifier. In layers of networks with diverse hypergraph structures, modality characteristics of various sizes and dimensions are learned. Finally, they are weighted automatically and merged into the fusion layer.

**W***<sup>f</sup>* and **b***<sup>f</sup>* stand for the weights and bias of the fusion layer, respectively. The model can be expressed as follows:

$$MHGNN(\mathbf{X}^{(1)}, \mathbf{X}^{(2)}, \dots, \mathbf{X}^{(m)}) = \text{softmax}\left(\mathbf{W}\_f \mathbf{W}\_m [HGNN(\mathbf{H}^{(1)}, \mathbf{X}^{(1)}),$$
 
$$HGNN(\mathbf{H}^{(2)}, \mathbf{X}^{(2)}), \dots, \mathbf{} \tag{9.53}$$
 
$$HGNN(\mathbf{H}^{(m)}, \mathbf{X}^{(m)})] + \mathbf{b}\_f$$

where the matrix of modality weights is denoted by **W***<sup>m</sup>* = *Diag* - **w***(*1*) ,* **w***(*2*) , ...,* **w***(m)* .

The patterns were discovered to represent a pair of interconnected and mutually reinforcing interdisciplinary concerns by examining the data findings making use of the network structure of the hypergraph. Another intriguing occurrence in the experiments was the variations in each subject's physiological characteristics. Therefore, what should be considered is to: (a) collect data according to the requirements of real application scenarios; (b) pay attention to individual differences; (c) analyze correlations between subjects of training and test samples; and (d) add more information such as action recognition information. Hypergraphs are considered as a good tool to discover biological patterns among them.

#### **9.5 Summary**

In this chapter, to illustrate the paradigm of using hypergraph computation in social media analysis, we overview three applications, i.e., recommender system, sentiment analysis, and emotion recognition. In recommender system, we discuss two specific applications: collaborative filtering and attribute inference. Collaborative filtering only considers the raw user–item network, and hypergraph is used to model the inter- and intra-domain (user or item) correlations in behavior space. Attribute inference further takes the attribute information into consideration in addition to the historical interactions. Besides, context information such as time and location can also be integrated, which is left to explore. In sentiment analysis, sentiment prediction and social event detection are covered. The former task mainly concerns the sentiment conveyed by each multi-modal tweet, while the latter one focuses on exploring a group of postings that are closely connected and cover the same subjects.

Furthermore, recognizing the emotion of people through multi-modal physiological signals is also presented. There are still many social media analysis applications worth exploring with hypergraph computation. For example, heterogeneous correlations widely exist in the social media context. How to utilize the complementary information among these heterogeneous associations with hypergraph computation has become a key issue. Besides, social media data are always dynamic rather than static, and the newcoming data may have different distributions compared with the existing data. Under such circumstances, the static hypergraph computation method cannot be directly applied, and the dynamic hypergraph computation paradigm is deserved to be investigated to solve this complex issue.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 10 Hypergraph Computation for Medical and Biological Applications**

**Abstract** Hypergraph computation, with its superior capability in complex data modeling, is a powerful tool for many medical and biological applications. In this chapter, we introduce four typical examples of the use of hypergraph computation in medical and biological applications, i.e., computer-aided diagnosis, survival prediction with histopathological images, drug discovery, and medical image segmentation. In each application, we present how to construct the hypergraph structure with different kinds of medical and biological data and different hypergraph computation strategies for these tasks respectively. We can notice that hypergraph computation has shown advantages in these applications.

#### **10.1 Introduction**

In the past few decades, massive biological and medical data were generated owing to the rapid development of big data techniques. These data can be used for tasks of disease gene analysis, disease risk assessment, targeted drug discovery, etc. The data further contribute to disease prevention and early diagnosis and treatment of diseases. The biological and medical data are complex, heterogeneous, and multi-modal, with widespread inter- and intra-data correlations. For example, in early disease diagnosis, patients with similar medical image appearance may also share similar disease conditions; different modalities of the medical image of the same patient, such as MRI and CT, may also exhibit disease characteristics from different perspectives; the patches within gigapixel histopathological images may have implicit collaborative associations that reveal patients' potential health risks. Therefore, how to model such correlation behind these data is very important for medical and biological applications.

Hypergraphs, which own the flexible hyperedges, provide a possible solution for modeling such complex correlations within medical and biological data. Given the observed data, the hypergraph structure can be generated using the previously mentioned methods and naturally incorporate multi-modal or heterogeneous data by concatenation of hyperedge groups and thus can discriminatively utilize the complementary information of these data. The applications of hypergraph computation in medical and biological tasks can be typically summarized as follows: (1) modeling the medical image, the patches, or the biological entities such as vertices, and connecting them with hyperedges following their feature similarity or highorder topological links; (2) exploring the high-order correlations between data using hypergraph label propagation or hypergraph neural networks so as to enhance the vertex representations; and (3) deploying these representations on the medical and biological tasks, such as medical image retrieval, disease identification, cancer tissue classification, survival prediction, and medical image segmentation.

In this chapter, we discuss five typical applications of using hypergraph computation in medical and biological applications, i.e., computer-aided diagnosis, survival prediction with histopathological images, drug discovery, and medical image segmentation. In computer-aided diagnosis, three specific applications are included, i.e., the identification [1] and medical image retrieval of MCI [2], autism spectrum disorder identification using brain functional networks [3], as well as the identification of COVID-19 by CT imaging [4]. For survival prediction with histopathological images, two techniques targeting different cases are displayed, including ranking-based survival prediction [5] and multi-hypergraph modeling for survival prediction [6]. In drug discovery, a heterogeneous hypergraph-based drug–target interaction prediction technique [7] is presented. For medical image segmentation, we introduce the hierarchical hypergraph patch labeling method. Part of the work introduced in this chapter has been published in [1–8].

#### **10.2 Computer-Aided Diagnosis**

Computer-aided diagnosis has made clinical diagnosis incredibly convenient with the advancement of artificial intelligence and owing to the widespread use of medical imaging data, including MRI, CT scan, histopathological images, and so on. Its main goal is to pursue a preliminary examination of patients for clinicians in order to increase diagnostic accuracy, avoid missed illnesses, and improve work efficiency. Many challenges still exist in the field of computer-aided diagnosis despite great machine learning and deep learning research advancements. It involves improper uses of information shared among patients and different forms of medical images, the continued existence of noisy data (such as variations in varied CT manufactures and patients' movement during imaging), and the confusion of cases in the early stages of illness.

In traditional approaches, the relationships among patients are frequently ignored in favor of merely taking into account one patient. The illness information of patients with similar medical images assists to raise the likelihood of computeraided diagnosis since it makes sense that if the MRI or CT features of patients are related, then their disease conditions should also be similar. Therefore, since hyperedges in hypergraphs, unlike in graphs, can connect two or more vertices, this presents a potential solution for the first challenge by allowing hypergraphs to represent high-order illness connections among multiple individuals.

Computer-aided diagnosis with medical images frequently consists of three main steps in order to be more effective. Pre-processing the image is the first step, which mostly consists of enhancing visual information, filtering out the background, and separating the region of interest from the blank to lessen interference of irrelevant areas. The next stage is to extract the region of interest's features. Imaging features including infection lesion count, mean lesion area, lesion density, and morphological aspects must be extracted from images since it is informative and contains task-independent information. The final step is to use machine learning, deep learning, or other statistical approaches to diagnose patients and then identify various types and lesion types with the features gathered in the previous steps.

The use of hypergraph computing techniques in computer-aided diagnosis is introduced in the subsections that follow. Four specific applications are covered, namely MCI identification using MRI [1], medical image retrieval [2], COVID-19 identification using CT imaging [4], and ASD identification using brain functional networks [3]. First, we present a strategy for creating a hypergraph for each MRI sequence and modeling the best correlation of patients by information shared by several MRI sequences. It then explains how to generate multi-graph combination weights to discover the association among query subjects and the existing subject classes. This enhances the precision of medical image retrieval. In the third part, the details of the uncertainty vertex-weighted hypergraph learning approach distinguishing COVID-19 from other types of pneumonia symptoms are described. Finally, we show the application of dynamic hypergraph learning methods to diagnose the autism of children using multi-modal functional connectivity.

#### *10.2.1 MCI Identification Using MRI*

Identifying the initial phase of Alzheimer's disease (AD) [i.e., mild cognitive impairment (MCI)] to support the diagnosis is a proper but challenging task since AD is a relatively regular dementia in seniors. Taking into consideration that research has demonstrated that combining data from various data modalities can improve the accuracy of diagnosing AD/MCI, clinically routine scans are to be used in the upcoming hypergraph computing approaches to diagnosing AD to capture multiple MR sequences of various aspects of brain structures or functions and attempt to combine them optimally.

The centralized hypergraph learning method (CHL) [1] integrates numerous imaging data in a semi-supervised manner to estimate correlations among various subjects to indicate the possibility that subjects belong to the same class. This improves the utilization of multi-modal data, of which the global illustration is shown in Fig. 10.1. In contrast to the usual graphs, hypergraphs propagate information by a group of hyperedges connecting two or more vertices concurrently. They can also capture higher-order relationships among various subjects by selecting the nearest neighbors in the feature space, i.e., whether a set of subjects in this task has common information, therefore allowing each subject to maximize the

**Fig. 10.1** A pipeline to classify MCI or NC from multi-modal imaging data using centralized hypergraph learning. This figure is from [1]

knowledge from MR sequences by optimizing concurrently the correlation and hyperedge weights among subjects. The entire process is sequentially presented in two stages, including the construction of a centralized hypergraph via processing data, and centralized hypergraph learning, to better introduce the details of using CHL in this chapter.

Different types of imaging data from patients with MCI and normal control (NC) need to be pre-processed as features before such data are used to construct the hypergraphs. Thereafter, a hypergraph G*<sup>i</sup>* = -V*i,* E*i,***W***i* is constructed for every sort of imaging data, where each subject is considered as a vertex, while the star expansion procedure is used to generate hyperedges. In particular, every vertex in each feature space is taken into account as the central vertex for generating a hyperedge, which consists of vertices located within distance *ϕd*¯ of the center vertex, where *ϕ* is a hyperparameter and *d*¯ is the vertex's mean distance in feature space. The hypergraph incidence matrix **H***<sup>i</sup>* produced by the star expansion procedure is formalized as

$$\mathbf{H}\_{l}(v,e) = \begin{cases} \exp\left(-\frac{d(v,v\_{c})}{0.1\bar{d}\_{l}}\right) & \text{if } v \in e\\ 0 & \text{otherwise} \end{cases},\tag{10.1}$$

where *di(v, vc)* represents the length from the vertex *v* to the correlating center vertex *vc*, and *d*¯ *<sup>i</sup>* is the vertex's mean distance in feature space of the *i*-th type imaging data. It should be noted that the hyperedge weights **W***<sup>i</sup>* start out with the same value, e.g., 1, when the hypergraph is generated.

For the MCI diagnostic work, which is regarded as a binary classification, various imaging data are employed to construct correlations among subjects using the centralized hypergraph learning method. Each step selects a hypergraph as the core hypergraph out of the four that were created from four types of data, with the others offering additional input for updating the hypergraphs. If hypergraph **H***<sup>j</sup>* is the core, we obtain the *j* -th centralized hypergraph, and to understand the relationship of the vertices, the optimization formula can be written as

$$\begin{aligned} \arg\min\_{\mathbf{F}\_j, \mathbf{W}\_l} \left| \mathcal{Q}\_j^c(\mathbf{F}\_j) + \lambda \mathcal{R}\_{emp}(\mathbf{F}\_j) + \mu \sum\_l \sum\_{e \in \mathcal{E}\_l} \mathbf{W}\_l(e)^2 \right| \\ \text{s.t. } \mathbf{H}\_l \dot{\operatorname{diag}}(\mathbf{W}\_l) = \operatorname{diag}(\mathbf{D}\_l^v), \operatorname{diag}(\mathbf{W}\_l) \ge 0, \end{aligned} \tag{10.2}$$

where *Ω<sup>c</sup> <sup>j</sup> (***F***<sup>j</sup> )* is the regularizer to smooth out the correlations among vertices, R*emp* represents the empirical loss, *i <sup>e</sup>*∈E*<sup>i</sup>* **<sup>W</sup>***i(e)*<sup>2</sup> represents an *l*2-norm regularizer, and **D***<sup>v</sup> <sup>i</sup>* represents the degree matrix. By assigning different weights *α*1*, α*<sup>2</sup> to core hypergraph and others, respectively, the regularizer term can be formulated as

$$\mathfrak{Q}\_j^c(\mathbf{F}\_j) = \alpha\_1 \mathfrak{Q}\_j(\mathbf{F}\_j) + \sum\_{l \neq j} \mathfrak{Q}\_j(\mathbf{F}\_l),\tag{10.3}$$

where *Ωj (***F***<sup>j</sup> )* is equal to **F** *<sup>j</sup> (***<sup>I</sup>** <sup>−</sup> *Θi)***F***<sup>j</sup>* with *Θi* <sup>=</sup> **<sup>D</sup>**−1*/*<sup>2</sup> *<sup>v</sup>* **HWD**−<sup>1</sup> *<sup>e</sup>* **<sup>H</sup>D**−1*/*<sup>2</sup> *<sup>v</sup>* . Consequently, regularizer is rewritten as: *Ω<sup>c</sup> <sup>j</sup> (***F***<sup>j</sup> )* = **F** *<sup>j</sup> (Δ<sup>c</sup> <sup>j</sup> )***F***<sup>j</sup>* with *Δ<sup>c</sup> <sup>j</sup>* = **I** − *(α*1*Θj* + *α*<sup>2</sup> *<sup>i</sup>*=*<sup>j</sup> Θi)*.

The optimization of Eq. (10.2) consists of two steps. In the following, we optimize the relevance matrix **F***<sup>j</sup>* with fixed **W***<sup>i</sup>* as

$$\arg\min\_{\mathbf{F}\_{j}} \left\{ \mathcal{Q}\_{j}^{c}(\mathbf{F}\_{j}) + \lambda \mathcal{R}\_{emp}(\mathbf{F}\_{j}) \right\},\tag{10.4}$$

which results in the closed-form answer for **F***<sup>j</sup>* <sup>=</sup> *<sup>λ</sup>* <sup>1</sup>+*<sup>λ</sup> (***<sup>I</sup>** <sup>−</sup> <sup>1</sup> <sup>1</sup>+*<sup>λ</sup> (α*1*Θj* <sup>+</sup> *α*2 *<sup>i</sup>*=*<sup>j</sup> Θi))*−1**Y**. Following, we optimize the weight of hyperedges **W***<sup>i</sup>* with fixed **F***<sup>j</sup>* as

$$\begin{aligned} \arg\min\_{\mathbf{W}\_l} & \left\{ \mathcal{D}\_j^c(\mathbf{F}\_l) + \mu \sum\_l \sum\_{e \in \mathcal{E}\_l} \mathbf{W}\_l(e)^2 \right\} \\ \text{s.t. } & \mathbf{H}\_l \operatorname{diag}(\mathbf{W}\_l) = \operatorname{diag}(\mathbf{D}\_l^v), \operatorname{diag}(\mathbf{W}\_l) \ge 0. \end{aligned} \tag{10.5}$$

which can be optimized by quadratic programming.

To best integrate data from various MRI, we generate the weights to every centralized hypergraph by minimizing the total hypergraph Laplacian, which is

expressed as

$$\begin{aligned} \arg\min\_{\rho\_l} & \left\{ \sum \rho\_l \, \Omega\_l^c(\mathbf{F}\_l) + \eta \sum \rho\_l^2 \right\} \\ \text{s.t. } & \sum \rho\_l = 1, \end{aligned} \tag{10.6}$$

where *ρi* represents the weight of the *i*-th centralized hypergraph, and *η* represents the trade-off parameter of the Laplacian and *l*2-norm regularizer. Determined by centralized hypergraph weights, the overall relevance matrix is **F** = *ρi***F***i*, of which the matching value can be used to categorize a subject.

In this subsection, we have introduced a centralized hypergraph learning method to model patient relationships for MCI identification. For each type of data, hypergraphs are constructed in the framework. In hypergraph learning, one hypergraph is chosen as the core hypergraph each time, and the remaining hypergraphs help the core hypergraph optimize the relevance matrix for prediction. The method not only takes into account the link among subjects, but it also makes use of a range of different types of data to increase the identification impact.

#### *10.2.2 Medical Image Retrieval*

Medical image retrieval is another crucial application of computer-aided diagnosis in Alzheimer's disease, along with the classification of patients with MCI or natural control introduced above. Its main goal is to offer clinicians with relevant MCI examples of visually comparable imaging data. Such data can also be provided to doctors in medical practice for instance thinking or scientific proof medicine.

Two primary stages help compensate for the MCI diagnosis-aided medical image retrieval technique [2], i.e., query about the class prediction for choosing candidates and ranking. The first stage involves finding the database's most relevant subjects based on the query subject. Such knowledge is then used to predict, under supervision, the query subject's category, i.e., the MCI patients or NC in this case. The graphs based on the pairwise object distance from various data modalities are combined into a multi-graph to predict the category of the query, after that every subject falling under the same category as the query is regarded as a potential subject. Second, the query subject and all of the candidate subjects are represented together in a new multi-graph. The learning process on the multi-graph reveals how related each candidate is to the query subject, allowing for ranking depending on the quality of similarity. The details of the two stages are shown in Fig. 10.2 [2] and explained below.

The query category is initially expected to use the subjects in the database given the query imaging data so that candidates can eventually be chosen based on the result. To analyze the similarity between the query subject and the training subjects chosen from the database, a graph G*<sup>i</sup>* = -V*i,* E*i,***W***i* with *N* + 1 vertices is

**Fig. 10.2** The pipeline for medical image retrieval method. This figure is from [2]

generated for the imaging data of the *i*-th modality out of *Nmod* modalities. The weight **W***i(vs, vt)* of edge E*i(vs, vt)*, which connects the *s*-th and *t*-th vertices of the graph G*i*, is given by

$$\mathbf{W}\_{l}(v\_{s}, v\_{l}) = \exp\left(\frac{d^{2}(v\_{s}, v\_{l})}{\sigma\_{l}^{2}}\right),\tag{10.7}$$

where *d(vs, vt)* represents the Euclidean distance between vertices *vs* and *vt* in the feature space. Similar to the processing of identifying MCI, the optimization equation for the multi-graph learning task for query category prediction can be written as

$$\begin{aligned} \arg\min\_{\mathbf{F}, \omega} & \left\{ \sum\_{l=1}^{N\_{mod}} \omega\_l \mathfrak{Q}\_l(\mathbf{F}) + \mu \mathcal{R}(\mathbf{F}) + \eta \|\mathfrak{w}\|\_2^2 \right\}, \\ & s.t. \sum\_{l=1}^{N\_{mod}} \omega\_l = 1, \end{aligned} \tag{10.8}$$

where *ω* and **F** represent the weighting parameters and the relevance matrix, respectively, *μ, η* represent the trade-off hyperparameters, R represents the empirical loss, and *Ωi* represents the regularizer term defined as

$$\mathcal{Q}\_{l} = \frac{1}{2} \sum\_{v\_{s}, v\_{l}} \mathbf{W}\_{l}(v\_{s}, v\_{l}) \left\| \frac{\mathbf{F}(v\_{s}, \cdot)}{\sqrt{\mathbf{D}\_{l}(v\_{s}, v\_{s})}} - \frac{\mathbf{F}(v\_{l}, \cdot)}{\sqrt{\mathbf{D}\_{l}(v\_{l}, v\_{l})}} \right\|^{2}. \tag{10.9}$$

To solve the aforementioned optimization equation, **F** and *ω* can be optimized alternatively. When *ω* is fixed, the optimization equation for **F** is written as

$$\arg\min\_{\mathbf{F}} \left\{ \sum\_{l=1}^{N\_{\text{mod}}} \omega\_l \mathcal{Q}\_l(\mathbf{F}) + \mu \mathcal{\mathcal{A}}(\mathbf{F}) \right\},\tag{10.10}$$

which can be solved using the iterative process [9] formulated as

$$\mathbf{F}(t+1) = \frac{1}{\mu+1} \sum\_{l=1}^{N\_{mod}} \omega\_l \Theta\_l \mathbf{F}(t) + \frac{\mu}{\mu+1} \mathbf{Y},\tag{10.11}$$

where **F***(t)* is the *t*-th step of the iteration started out with **F***(*0*)* = **Y**. When **F** is fixed, the optimization equation for *ω* can be formulated as

$$\begin{aligned} \arg\min\_{\boldsymbol{\omega}} & \left\{ \sum\_{l=1}^{N\_{\text{mod}}} \alpha\_l \mathcal{Q}\_l(\mathbf{F}) + \eta \|\boldsymbol{\omega}\|\_2^2 \right\}, \\ & \text{s.t. } \sum\_{l=1}^{N\_{\text{mod}}} \alpha\_l = 1, \end{aligned} \tag{10.12}$$

which can be worked on by applying the Lagrangian method. All database subjects belonging to the same category are employed as candidate retrieval results based on the learned category of query subject.

Candidates are ranked for the retrieval of the most relevant subjects. Even though they are related to the same category of query subject, they may still differ from each other from the viewpoint of imaging appearance. Candidate subjects and query subjects construct graphs using each of *Nmod* modalities, where the *i*-th graph can be referred to Gˆ *<sup>i</sup>*, in a manner similar to the previous classification step. Since the graph's weight *ω* has been learned, the optimization equation can be written as

$$\arg\min\_{\hat{\mathbf{f}}} \left\{ \sum\_{l=1}^{N\_{\text{mod}}} \alpha\_{l} \hat{\mathcal{Q}}\_{l}(\hat{\mathbf{f}}) + \hat{\lambda} \hat{\mathcal{A}}(\hat{\mathbf{f}}) \right\},\tag{10.13}$$

where ˆ**f** and *Ω*ˆ represent the relevant vector and graph regularizer, respectively. Rˆ is the empirical loss. The optimization task, such as Eq. (10.10), is handled using an iterative procedure, represented by

$$
\hat{\mathbf{f}}(t+1) = \frac{1}{\hat{\lambda}+1} \sum\_{l=1}^{N\_{mod}} \alpha\_l \hat{\Theta}\_l \hat{\mathbf{f}}(t) + \frac{\hat{\lambda}}{\hat{\lambda}+1} \hat{\mathbf{y}}.\tag{10.14}
$$

The ranking of all candidates can be established by sorting based on the correlation given by ˆ**f**.

This subsection introduces the process of retrieving data relevant to the query subject from medical imaging datasets to support the diagnosis of MCI. The first stage selects the candidate set from the database, and the second stage computes the correlation between the query subject and all of the subjects in the candidate set and then ranks the retrieval based on the correlation. Both stages employ multi-graphs to describe the relationship between subjects, so as to facilitate retrieval tasks.

#### *10.2.3 COVID-19 Identification Using CT Imaging*

The COVID-19 pandemic, which has become the most widespread public health crisis since late 2019, is brought on by an extremely infectious virus and can induce multiple organ failures and server respiratory distress. Therefore, it is crucial to correctly distinguish COVID-19 from other forms of pneumonia to help correctly design pneumonia treatment programs. Nevertheless, the task is complex, as there are two main difficulties, namely noisy data resulting from the highly varied data gathered during crises, and confusing cases resulting from the similarity between COVID-19 and other types of pneumonia cases of the initial phases of symptoms.

Numerous investigations have demonstrated the usefulness of differentiating between COVID-19 and other types of pneumonia using CT, leading to the introduction of an uncertainty vertex-weighted hypergraph learning strategy to identify COVID-19 from other types of pneumonia using CT images [4]. It formulates data correlations among various instances to limit the interference by noisy data and confusing examples by employing an uncertainty rating quantification module and a vertex-weighted hypergraph structure. The framework introduction that follows is divided into three parts, namely pre-processing, measuring data uncertainty, and hypergraph construction and learning. Figure 10.3 depicts the overall illustration.

**Fig. 10.3** An illustration of the uncertainty vertex-weighted hypergraph learning method for identifying COVID-19 among other types of pneumonia. This figure is from [4]

Regional features and radiomics features should be collected from the CT for every patient segregated using VB-Net [10] during the pre-processing stage. Regional features include the number of infected lesions and the surface area of the lesions, whereas textural features including the gray-level co-occurrence matrix are examples of radiomics features. The feature representation **X** of a patient's CT image is constructed by combining the two categories of features with information on age and gender.

Data uncertainty measurements are crucial in determining the dependability of various data throughout the learning process since noise can have an impact on data quality. The two types of uncertainty measurements are aleatoric and epistemic. The former one results from data abnormalities, noise, or other issues that lower the data quality, and the latter one is produced by the case's features being at the decision boundary. The goal of parameter estimation under aleatoric uncertainty is to minimize the KL divergence for both the actual and forecasted distributions, which can be represented by

$$\hat{\boldsymbol{\Theta}} = \underset{\boldsymbol{\Theta}}{\text{arg min}} \frac{1}{N} D\_{KL}(P\_D(\mathbf{X}\_l) || P\_{\boldsymbol{\Theta}}(\mathbf{X}\_l)),\tag{10.15}$$

where *PD(***X***i), PΘ(***X***i)* represent the real distribution and the predicted distribution, respectively. By way of optimization, the loss function is expressed as

$$\mathcal{L}^{\theta}(\Theta) = \frac{1}{N} \sum\_{i}^{N} \left( \frac{1}{2} \exp(-a\_{\Theta}(\mathbf{X}\_{i})) \mathbb{C} \mathbb{E} \left( \mathbf{y}\_{i}, f\_{\Theta}(\mathbf{X}\_{i}) \right) + \frac{1}{2} a\_{\Theta}(\mathbf{X}\_{i}) \right), \qquad (10.16)$$

where *αΘ(***X***i)* represents the log value of the estimated variance, and the aleatoric uncertainty defines as *AΘ(***X***i)* = exp*(αΘ(***X***i))*. Dropout can be used for inference to determine the epistemic uncertainty, which can be expressed as the model's inability to generate accurate predictions and is written as

$$\mathcal{E}(f\_{\hat{\boldsymbol{\Theta}}}(\mathbf{X}\_{l})) \approx \frac{1}{K} \sum\_{k=1}^{K} f\_{\hat{\boldsymbol{\Theta}}(\boldsymbol{\omega}^{k})}(\mathbf{X}\_{l})^{\top} f\_{\hat{\boldsymbol{\Theta}}(\boldsymbol{\omega}^{k})}(\mathbf{X}\_{l}) - \mathbf{E}(f\_{\hat{\boldsymbol{\Theta}}(\boldsymbol{\omega}^{k})}(\mathbf{X}\_{l}))^{\top} \mathbf{E}(f\_{\hat{\boldsymbol{\Theta}}(\boldsymbol{\omega}^{k})}(\mathbf{X}\_{l})),\tag{10.17}$$

where *ω* represents the set of random variables and *k* represents the *k*-th test with dropout. Here, the overall uncertainty is U*Θ*<sup>ˆ</sup> *(***X***i)* = *AΘ*<sup>ˆ</sup> *(***X***i)* + E *(fΘ*<sup>ˆ</sup> *(***X***i))*. With normalization, the final uncertainty can be formulated as

$$U\_l = \sigma \left(\lambda \frac{\partial \ell\_{\hat{\Theta}}(\mathbf{X}\_l) - \mu\_e}{\mathbf{s}\_e}\right),\tag{10.18}$$

where *μe* and **s***<sup>e</sup>* represent the mean and the standard deviation of U and *σ* stands for the sigmoid function setting the output between 0 and 1.

Each instance is viewed as a vertex in the hypergraph that is constructed to mine high-order correlations among related patients for more precise prediction. Regional and radiomics features are used in the construction of hyperedges, respectively. In the regional features space, every vertex is regarded as a center vertex, and the nearest neighbor algorithm is used to link *K* nearest vertices to build a hyperedge. The similar method is applied to generate hyperedges using the radiomics feature. The uncertainty hypergraph, in contrast to the usual hypergraph, must take into account both the connection relationship and the vertex's uncertainty score, leading to a more comprehensive explanation of the incident matrix in uncertainty vertex hypergraph G = -V *,* E *,***W***,* **U** as

$$\mathbf{H}(v\_j, e\_l) = \begin{cases} U\_j & \text{if } v\_j \in e\_l \\ 0 & \text{otherwise} \end{cases}.\tag{10.19}$$

The structure quantifies data uncertainty in comparison to conventional hypergraph learning strategies, and its optimization objective can be expressed as

$$\begin{cases} \mathcal{Q}\_{\mathbf{U}}(\mathbf{F}) &= \arg\min\_{\mathbf{F}} \{ \mathcal{Q}(\mathbf{F}) + \lambda \mathcal{R}\_{emp}(\mathbf{F}) \} \\ \mathcal{Q}(\mathbf{F}, \mathcal{V}, \mathbf{U}, \mathcal{E}, \mathbf{W}) = tr(\mathbf{F}^{\top}(\mathbf{U}^{\top} - \mathbf{U}^{\top}\Theta\_{\mathbf{U}}\mathbf{U})\mathbf{F}) \\ \mathcal{R}\_{emp}(\mathbf{F}, \mathbf{U}) &= \sum\_{k=1}^{K} ||\mathbf{F}(:,k) - \mathbf{Y}(:,k)||^{2} \end{cases} , \tag{10.20}$$

where *Ω(*·*)* and R*emp(*·*)* represent the regular function and the empirical loss, respectively, and *<sup>Θ</sup>***<sup>U</sup>** is equal to **<sup>D</sup>**−1*/*<sup>2</sup> *<sup>v</sup>* **HWD**−<sup>1</sup> *<sup>e</sup>* **<sup>H</sup>D**−1*/*<sup>2</sup> *<sup>v</sup>* . It is reasonable to rewrite the empirical loss as

$$\mathcal{A}\_{emp}(\mathbf{F}, \mathbf{U}) = tr(\mathbf{F}^\top \mathbf{U}^\top \mathbf{U} \mathbf{F} + \mathbf{Y}^\top \mathbf{U}^\top \mathbf{U} \mathbf{Y} - 2\mathbf{F}^\top \mathbf{U}^\top \mathbf{U} \mathbf{Y}).\tag{10.21}$$

The output matrix **<sup>F</sup>** <sup>∈</sup> <sup>R</sup>*n*×*<sup>K</sup>* (*<sup>K</sup>* representing the number of classes, i.e., *<sup>K</sup>* <sup>=</sup> <sup>2</sup> in this case) is thus represented as

$$\mathbf{F} = \lambda(\mathbf{U}^{\top} - \mathbf{U}^{\top}\boldsymbol{\Theta}\_{\mathbf{U}}\mathbf{U} + \lambda\mathbf{U}^{\top}\mathbf{U})^{-1}\mathbf{U}^{\top}\mathbf{U}\mathbf{Y}.\tag{10.22}$$

New coming test cases can be classified as COVID-19 or other pneumonia types using the output label matrix established above.

#### *10.2.4 ASD Identification Using Brain Functional Networks*

Autism spectrum disorder (ASD) is a widespread developmental disorder that mostly affects children and has negative effects such as social communication impairments. Because of the rising cases, early identification and treatment of ASD are crucial in order to provide patients with new skills under clinical supervision. The diagnosis of ASD is mostly dependent on skilled specialists, and it is difficult

**Fig. 10.4** A pipeline to classify ASD or healthy controls from brain functional networks data using dynamic hypergraph learning. This figure is from [3]

to identify ASD quickly due to the shortage of experts. The correlation of various functional connectivity (FC) pattern features in ASD patients can be used for rapid diagnosis.

The ASD identification method using brain functional networks [3] is divided into three stages, namely the selection of pre-processed features, hypergraph construction, and object identification using dynamic hypergraph learning. The overall process can be referred to Fig. 10.4. Static FC (sFC) and dynamic FC (dFC) are produced using a sliding window algorithm on the original functional magnetic resonance imaging time series in the first stage, and Lasso regression is then employed to accomplish the feature selection. The hypergraph construction stage creates a hypergraph based on the comparison of image features that represent data similarity in multiple modalities. Finally, ASD is identified using a multimodal dynamic hypergraph learning technique that detects ASD and simultaneously improves the hypergraph structure.

The feature selection stage aims to discover valuable features in dFC and sFC sequences. The *i*-th subject's sFC sequence of *τ* time points is first separated into *n* sub-sequences, with the *j* -th sub-sequence of {*j, n* + *j,* 2*n* + *j,...*} time points. Defining **z**¯ *j <sup>i</sup>* as the dynamic FC feature of the *j* -th sub-sequence in subject *i*, the Lasso regression model, as the selection operator, can be expressed as

$$\arg\min\_{\beta\_0, \beta} \left( \frac{1}{2\pi' |\beta^\theta|} \sum\_{l \in \beta^\theta} \sum\_{j=1}^{r'} \left( \mathbf{y}\_l - \beta\_0 - \boldsymbol{\beta}^\top \bar{\mathbf{z}}\_l^{j} \right)^2 + \mu |\beta|\_1 \right),\tag{10.23}$$

where *τ* = *τ/n* is the length of the sub-sequences, *yi* represents the label of the subject, *β* is the regression coefficient, and *μ* stands for the trade-off hyperparameter. Features with zero coefficients are discarded, and the remaining are indicated as **z** *j <sup>i</sup>* . Defining **x**¯*<sup>i</sup>* as the static FC feature of the *i*-th subject, the Lasso regression model is expressed as

$$\arg\min\_{\mathcal{P}0,\mathcal{V}} \left( \frac{1}{2|\mathcal{J}^{\mathcal{P}}|} \sum\_{l \in \mathcal{J}^{\mathcal{P}}} \left( \mathbf{y}\_{l} - \boldsymbol{\eta}\_{0} - \boldsymbol{\mathcal{V}}^{\top} \bar{\mathbf{x}}\_{l} \right)^{2} + \boldsymbol{\eta} |\boldsymbol{\mathcal{V}}|\_{1} \right),\tag{10.24}$$

where *yi* represents the label of the subject, *γ* is the regression coefficient, and *η* stands for the trade-off hyperparameter. Features with non-zero coefficients in the sFC selection operator represented as **x***<sup>i</sup>* are selected similarly to dFC.

The dFC sub-hypergraph G<sup>1</sup> = *(*V *,* E1*)* and the sFC sub-hypergraph G<sup>2</sup> = *(*V *,* E2*)*, whose every vertex stands for a subject's sub-sequence, are combined to construct the hypergraph G = *(*V *,* E *)*, i.e., E = E<sup>1</sup> ∪ E2. Since sFC features are subject level, the features of sFC sub-sequences inherit the subjects' static modality, i.e., **x** *j <sup>i</sup>* = **x***i*. Each vertex in each sub-hypergraph is regarded as a central vertex, and the nearest neighbor algorithm is employed to connect *k* neighbors (*k* = 2*n,* 3*n, . . . , kmaxn*) to create *kmax* hyperedges. When the two sub-hypergraphs are generated, the hypergraph is formed at the same time, and its incident matrix is expressed as

$$\mathbf{H}(v,e) = \begin{cases} 1 & \text{if } v \in e \\ 0 & \text{otherwise} \end{cases}.\tag{10.25}$$

To enhance the structure of hypergraph and to help predict ASD, the potential equation of hyperedge can be defined as

$$f(e) = \sum\_{u,v \in \mathcal{V}} \frac{\mathbf{H}(u,e)\mathbf{H}(v,e)g(u,v)}{(a+\alpha\_1+\alpha\_2)\delta(e)},\tag{10.26}$$

where

$$\begin{split} \mathbf{g}(u,v) &= \| \frac{\hat{\mathbf{y}}\_{u}}{\sqrt{d(u)}} - \frac{\hat{\mathbf{y}}\_{v}}{\sqrt{d(v)}} \|\_{2}^{2} + \alpha\_{1} \| \frac{\mathbf{x}\_{u}}{\sqrt{d(u)}} - \frac{\mathbf{x}\_{v}}{\sqrt{d(v)}} \|\_{2}^{2} \\ &+ \alpha\_{2} \| \frac{\mathbf{z}\_{u}}{\sqrt{d(u)}} - \frac{\mathbf{z}\_{v}}{\sqrt{d(v)}} \|\_{2}^{2} \end{split} \tag{10.27}$$

Here *δ(e)* represents the degree of hyperedge *e*, *y*ˆ*u, y*ˆ*<sup>v</sup>* stand for to-be-learned labels of *u, v*, respectively, and *α*1*, α*<sup>2</sup> are the trade-off hyperparameters. It is noted that the potential function determines the data distribution on the hyperedge jointly from sFC, dFC, and label space. The dynamic hypergraph learning cost function is formulated as

$$\mathcal{L}^{\ell}(\hat{\mathbf{y}}, \mathbf{H}) = \sum\_{e \in \mathcal{E}} \omega(e) f(e) + \theta \|\mathbf{y} - \hat{\mathbf{y}}\|\_2^2 + \lambda \|\mathbf{H} - \mathbf{H}\_0\|\_2^2,\tag{10.28}$$

where *ω(e)* stands for the hyperedge's weight, **H**<sup>0</sup> represents the initial hypergraph, and *θ* and *λ* are the trade-off hyperparameters, respectively. The objective function is shown to be divided into three terms: the first term is the loss function based on the hypergraph, and the following two terms are the empirical losses of **y**ˆ and **H**. The optimization of Eq. (10.28) consists of two stages. First, we optimize the to-belearned labels **y**ˆ with the fixed **H**. The problem results in the closed-form solution

as follows:

$$\hat{\mathbf{y}} = \left(\mathbf{I} + \frac{1}{\theta(1 + \alpha\_1 + \alpha\_2)\Delta}\right)^{-1} \mathbf{y},\tag{10.29}$$

where *<sup>Δ</sup>* <sup>=</sup> **<sup>I</sup>**−**D**−1*/*<sup>2</sup> *<sup>v</sup>* **HWD**−<sup>1</sup> *<sup>e</sup>* **<sup>H</sup>D**−1*/*<sup>2</sup> *<sup>v</sup>* . **<sup>I</sup>***,* **<sup>D</sup>***v*, and **<sup>D</sup>***<sup>e</sup>* represent the identity matrix, vertex degree diagonal matrix, and hyperedge degree diagonal matrix, respectively. In the following, we optimize **H** with the fixed **y**ˆ as

$$\mathcal{L}^{\rho}(\mathbf{H}) = \text{tr}\Big((\mathbf{I} - \mathbf{D}\_v^{-1/2}\mathbf{H}\mathbf{W}\mathbf{D}\_e^{-1}\mathbf{H}^\top\mathbf{D}\_v^{-1/2})\mathbf{K}\Big) + \lambda\|\mathbf{H} - \mathbf{H}\_0\|\_2^2,\tag{10.30}$$

where **K** = *(***y**ˆ**y**ˆ + *α*1**XX** + *α*2**ZZ***)/(*1 + *α*<sup>1</sup> + *α*2*)*, which is optimized using the projected gradient method. Optimization can be done by the iterative procedure, formulated as

$$\begin{aligned} \mathbf{H}\_{k+1} &= \mathbf{P} [\mathbf{H}\_k - h\_k \nabla \mathcal{L}(\mathbf{H}\_k)] \\ \nabla \mathcal{L}^\sharp(\mathbf{H}) &= 2\lambda (\mathbf{H} - \mathbf{H}\_0) + \mathbf{J} (\mathbf{I} \otimes \mathbf{H}^\top \mathbf{D}\_v^{-1/2} \mathbf{K} \mathbf{D}\_v^{-1/2} \mathbf{H}) \mathbf{W} \mathbf{D}\_e^{-2} \\ &+ \mathbf{D}\_v^{-3/2} \mathbf{H} \mathbf{W} \mathbf{D}\_e^{-1} \mathbf{H}^\top \mathbf{D}\_v^{-1/2} \mathbf{K} \mathbf{J} \mathbf{W} \\ &- 2 \mathbf{D}\_v^{-1/2} \mathbf{K} \mathbf{D}\_v^{-1/2} \mathbf{H} \mathbf{W} \mathbf{D}\_e^{-1} \end{aligned} \tag{10.31}$$

where **J** = **11**, *hk* represents optimization step size of the *k*-th iteration, and **P** stands for the projection on the set {**H**|0 **H** 1}. When the iterative process converges, the labels of its sub-sequences are aggregated, and the result of prediction is the category with the highest score after aggregation.

In this section, we demonstrate the use of hypergraph-based approaches in four computer-aided diagnosis applications, namely MCI identification, medical image retrieval for MCI diagnostic assistance, COVID-19 identification, and ASD identification. Hypergraphs are employed in applications to represent high-order connections among subjects when mining complicated links among patients to gather knowledge than simply their images. In the future, it could be crucial to use hypergraphs to investigate few-shot learning approaches and transfer learning strategies in the domain of medical areas, such as MCI, COVID-19, and ASD.

#### **10.3 Survival Prediction with Histopathological Image**

Survival prediction is to model survival duration, which is the period that a patient is followed up on until a certain event, e.g., cancer recurrence or death. Survival prediction based on histopathological images is to predict the survival duration or survival risk to a satisfactory degree using only the patient's images, to estimate the severity, or to classify high and low risks, which guides the pathologist to evaluate

the scenario. Since histopathological images typically include gigapixels, which are far more detailed than regular natural images, i.e., those in ImageNet [11] or MNIST [12], the main challenge of this work is how to reliably obtain the patient's feature representation for regression prediction analysis. Moreover, the relevant information for cells and tissues may not be readily extracted as it includes complex relationships and rich morphological structural content in histopathological images.

To overcome the challenge of the large number of pixels, there exists a technique [13] that randomly chooses patches in histopathological images with a variety of cells and without blank. It extracts patch features using a pre-trained CNN network and calculates survival risk using Lasso-Cox [14] regression. To enhance the patient representation, low-level patch features produced by a pre-trained CNN-based feature extractor are optimized by a graph convolutional neural network to construct the intricate relationship between patches [15]. The power of random patch selection to cover the details of the initial histopathological image and the lack of mutual information between patches limit the representation learning capabilities of the non-graph-based method, whereas the method that uses graph modeling applies pairwise correlations modeling to make up for the loss of structural information among cells with similar roles. Nevertheless, reducing complex high-order connections into pairwise relationships inevitably results in inaccurate modeling, losing data correlations among cells and tissues that are necessary to predict one's survival. Hence, the better solution is to model high-order data-associative representations employing hypergraph computational approaches to meet the challenges.

The following subsection explains how to use hypergraph computing in survival prediction based on histopathological images with two parts, namely ranking-based survival prediction [5] and phenotypic and topological hypergraphs-based survival prediction [6]. In the first part, a nearest-neighbor-based hypergraph modeling methodology is introduced, and optimization is achieved using a ranking-based method. In the second part, the hypergraphs are created in the image space and merged for prediction.

#### *10.3.1 Ranking-Based Survival Prediction*

This part describes the three stages required for executing the ranking-based survival prediction task via hypergraph representation [5], namely pre-processing before generating hypergraph, learning hypergraph representation, and survival ranking prediction, as illustrated in Fig. 10.5. It is worth noting that these three components are related to the framework of the graph-based survival prediction task in general, not just the rank-based survival hypergraph framework.

In the pre-processing stage, *N* patches are randomly chosen from each histopathological image, and each patch has the same size as a typical natural image (e.g., 224*px* × 224*px*). Directly choosing patches at random from the original image, however, likely picks up the noisy region as well (e.g., erosion and blank). Therefore, before randomization, the OTSU algorithm [16] is applied to

**Fig. 10.5** A pipeline of ranking-based survival prediction utilizing hypergraph representation, including pre-processing, hazarding prediction via hypergraph representation, and ranking-based survival risk prediction. This figure is from [5]

segregate cell tissue samples with rich information. Next, the foremost patch-level image optical structure features **X***(*0*)* <sup>∈</sup> <sup>R</sup>*N*×*<sup>F</sup>* are extracted by a pre-trained deep neural network from ImageNet [11], where *F* represents the dimension of each patch feature. Image features, which are appropriate for the strata of complex tissue patterns, are included in the raw features that are retrieved from the pre-trained model and reflect the cells and tissues that are present in the patch.

Following pre-processing to extract feature information at the patch level, the hypergraph computing approach is used to produce the features representing the histopathological image level for the subsequent prediction of the survival risk score. Hypergraphs are created using the distance-based hypergraph generation method since intuitive cells and tissues with similar morphologies have comparable functionalities. Each patch is regarded as a vertex, and each vertex is considered as the center vertex to generate a hyperedge. This results in a total of *N* nodes and *N* hyperedges in the hypergraph reflecting the structural information of the histopathological image. We build hyperedges using the *k* nearest neighbor approach, which connects *k* vertices with the closest Euclidean distance between raw features from its center vertex. Therefore, the hypergraph incident matrix **H** is obtained. Beyond pairwise graph structures, hierarchical grouping patterns can be discovered using a hyperedge structure that creates a channel for the transfer and integration of information from the *k* nearest morphological patches. The information fusion among patch vertex is then accomplished using hypergraph convolutional layers, as shown below:

$$\mathbf{X}^{(l+1)} = \sigma \left( \mathbf{D}\_v^{-1/2} \mathbf{H} \mathbf{W} \mathbf{D}\_e^{-1} \mathbf{H}^\top \mathbf{D}\_v^{-1/2} \mathbf{X}^{(l)} \boldsymbol{\Theta}^{(l)} \right), \tag{10.32}$$

where **X***(l)* <sup>∈</sup> <sup>R</sup>*N*×*Cl* is the *l*-th layer convolution input feature with *<sup>N</sup>* vertices and *Cl* dimensions, **X***(l*+1*)* is the *l*-th layer convolution output feature, *σ* stands for nonlinear activation function, and the *l*-th layer's learnable parameters are represented by *Θ(l)*. The output **X***(L*+1*)* of the last layer is used to forecast survival duration after *L* layers of convolution, where *N* hyperedges might reflect *N* patterns of causal variables. The predicted survival risk score is regressed using a fully connected neural network after **X***(L*+1*)* is squeezed into **<sup>X</sup>** <sup>∈</sup> <sup>R</sup>1×*CL*+<sup>1</sup> via the pooling layer representing patient's representation. The patient's actual survival time *t* can be used to supervise the backpropagation process of the regression.

Ranking information, which can be used to infer the conditions of nearby patients, is also significant in regression tasks in addition to the specific survival duration of every single patient. Moreover, the ranking data accurately portray patients' ranks for high and low risks. The prediction of survival ranking is introduced at the final, most significant, and enlightening stage. Pairs of histopathological images (i.e., pairs of patients) should be taken into consideration since models are trained on a single image currently, and the inability to distinguish the relative risks of two similar instances is the most frequent reason for inaccurate patient risk comparisons. To fine-tune the model parameters and enhance the accuracy of the model's forecast ranking, a Bayesian-based method known as Bayesian Concordance Readjust (BCR) is presented. The BCR loss function, which is employed in pairwise training of histopathological images, embodies the Bayesian Concordance Readjust and can be formulated as follows:

$$\mathcal{QC} = -\log\left(\delta(\mathbb{W} \cdot (\mathbf{X}\_l - \mathbf{X}\_f))\right),\tag{10.33}$$

where **X***<sup>i</sup>* and **X***<sup>j</sup>* stand for the feature representation of patients *i* and *j* , respectively, and W represents the learnable parameters of regression.

In this subsection, we provide a ranking-based survival prediction method for predicting a patient's survival hazard score from a single WSI image. The method first extracts informative patches from WSI images and then applies a hypergraph to describe the correlations among patches to create overall features of WSI. Finally, the method considers relative ranking information among various patients and achieves greater prediction results.

#### *10.3.2 Phenotypic and Topological Hypergraph Modeling*

The hypergraph for mining high-order correlations in the data is essential for accurately generating feature representation of histopathological images. We can notice that the previously presented ranking-based survival prediction method only employs the nearest neighbor generation method when constructing a hypergraph. This method only fine-tunes image features among patches with similar features and mines high-order relationships from one single perspective, which tends to leave

**Fig. 10.6** Patch sampling and low-level feature extraction. This figure is from [6]

other informative high-order relationships out. Therefore, here we describe a multihypergraph-based learning method for survival prediction [6], which efficiently achieves a high-order global representation of the histopathological image by using a variety of edges correlation modeling in several spaces and a basic hypergraph convolutional network.

The goal of multi-hypergraph modeling is to uncover topological linkages among patches in image space and high-order connections among patches in latent feature space. The random sampling approach previously employed cannot be used since it is essential to analyze the topological connections of the image space; instead, the sampling is carried out according to the position of the patch in the original image. Therefore, the sampling process uses a boundary-to-center strategy (shown in Fig. 10.6) after the OSTU algorithm [16] filters noisy regions to produce informative regions of interest. In addition to selecting the border B<sup>1</sup> and the center C of regions of interest, patches are chosen based on various distance radios of <sup>3</sup> 4 , 1 <sup>2</sup> , and <sup>1</sup> <sup>4</sup> , i.e., B3 <sup>4</sup> , B<sup>1</sup> <sup>2</sup> , and B<sup>1</sup> <sup>4</sup> in Fig. 10.6 from boundary to the center. Patches with the same percentage of the distance from the border in the same region of interest and centers among regions can be taken up as correlating in the image space.

A multi-hypergraph G = *(*V *,* E *)* is constructed by joining two sub-hypergraphs, namely a phenotypic sub-hypergraph G*phe* = *(*V *,* E*phe)* created from the latent feature space and a topological sub-hypergraph G*top* = *(*V *,* E*top)* generated from image space, i.e., E = E*phe* ∪ E*top*, as shown in Fig. 10.7. Based on the Euclidean distances between extracted patch visual features, as explained in the previous method, the incident matrix of the phenotypic sub-hypergraph **H***phe* is built using the k nearest neighbor method. In the incident matrix of the topological sub-

**Fig. 10.7** Construction of multi-hypergraph, which contains a phenotypic sub-hypergraph and a topological sub-hypergraph. This figure is from [6]

hypergraph **H***top*, each vertex is linked to its neighbors in the topological space, i.e., the centers of all regions of interest, B<sup>1</sup> <sup>4</sup> , B<sup>1</sup> <sup>2</sup> , B<sup>3</sup> <sup>4</sup> , and the boundaries of each region of interest.

The standard hypergraph neural network is modified to the hypergraph maxmask convolution with an increased number of hyperedges, which can address the overfitting issue brought up by a lack of training data. Each layer's convolutional process consists of four steps, namely hyperedge feature gathering, max-mask operation, vertex feature aggregating, and vertex feature re-weighting.

The features of each hyperedge F*(l) <sup>e</sup>* are gathered during the first step from the vertices that are directly linked to it, which can be written as a product of **H** and **X***(l)*. The hyperedge features F*(l*+1*) <sup>e</sup>* of the convolutional layer are then produced by performing a max-mask operation on the features excluding *λ* dominating hyperedges. In the final two steps, the output vertex features F*(l*+1*) <sup>v</sup>* are obtained by aggregating the hyperedge features by multiplying matrix **H** and re-weighting them using a learnable parameter *Θ(l)*, respectively. Therefore, the whole steps of each layer of the hypergraph neural network in the framework are formulated as

$$\begin{cases} \mathbf{X}^{(l+1)} &= \sigma \left[ (\mathbf{I} - \mathbf{L})\mathbf{X}^{(l)} + \mathbf{H}^{-1}(\mathbf{I} - \mathbf{L})\mathbf{X}^{(\lambda)}) \Theta^{(l)} \right] \\ \mathcal{F}\_{\ell}^{(l+1)} &= \mathbf{H}^{-1}(\mathbf{I} - \mathbf{L})\mathbf{X}^{(l)} + \mathbf{X}^{(\lambda)} \end{cases}, \tag{10.34}$$

where **X***(λ)* stands for an offset matrix containing only the data from the dominant *<sup>λ</sup>* hyperedges, and **H**−1*(***<sup>I</sup>** <sup>−</sup> **<sup>L</sup>***)***X***(λ)* ensures the computing gradients and adjusting vertex features have no impact on the top *λ* hyperedges.

With two learnable weight vectors, the vertex feature matrix **X***(L*+1*)* and the hyperedge matrix F*(L*+1*) <sup>e</sup>* of the final layer are squeezed into feature vectors. The feature fusion module then merges the two vectors to establish a global feature representation that represents the entire hypergraph, i.e., the histopathological image for the regression task.

In this subsection, we introduce a general framework and a ranking-based optimization method for the task of survival prediction using histopathological images. The survival prediction challenges are then addressed by replacing a single nearest neighbor modeling algorithm with the multiple hypergraphs modeling method. The transformer network is a commonly used model of long-term sequential data, while histopathological images also include a significant quantity of sequential topological histopathological information, making it conceivable to incorporate transformer to the survival prediction task. Therefore, in future works, we can attempt to include transformer into the framework's feature extraction or the construction of hypergraphs component.

#### **10.4 Drug Discovery**

Predicting drug–target interactions (DTIs) is a critical step in the process of discovering new drugs to treat diseases. Nevertheless, the commonly used biochemical experimental methods in wet laboratories are always costly and tedious. The development of drug discovery computational methods, of which machine learning based methods are one of the most promising, has been prompted by the growing need for low-cost, effective, and efficient DTI prediction methods. The core idea of these methods is that similar targets may be linked with similar drugs, and for the drug the assumption is symmetric. This assumption *defacto* implies the potential high-order associations between drugs and targets, especially when considering the complex heterogeneous biological networks that contain different biological entities such as proteins.

In the DTI network, one single drug may interact with a group of targets, which can be generalized as a "one-to-many" pattern. When it comes to the aforementioned heterogeneous biological networks, the interactions between these biological entities become more complex, emerging as the "many-to-many" pattern. The hypergraph structure, which can naturally model high-order correlations owing to its flexible hyperedge, is suitable for modeling such a complex heterogeneous biological network. It can conveniently incorporate multiple complex interactions between different biological entities and further utilize the hypergraph computing technique to learn the correlations.

In this section, we present a heterogeneous hypergraph learning method for the DTI prediction (HHDTI) task [7]. The overall pipeline of the framework is illustrated in Fig. 10.8. It takes into consideration different types of interactions between biological entities (e.g., drug–target, drug–disease, and target–disease interactions) to facilitate DTI predictions.

#### **(1) Heterogeneous Hypergraph Modeling**

The overall procedure for modeling biological networks into a heterogeneous hypergraph is illustrated in Fig. 10.9. Given a heterogeneous biological network with different kinds of biological entities and interactions among these entities, the goal of hypergraph modeling is to characterize the heterogeneous biological network into a heterogeneous hypergraph G = *(*V *,* E *)*. Here V = {V<sup>1</sup> ∪ V<sup>2</sup> ∪ *...* ∪ V*o*} indicates the vertex set, and E = {E<sup>1</sup> ∪E<sup>2</sup> ∪*...*∪E*r*} is the hyperedge set. *o* and *r* are the number of types for entities and interactions, respectively. Specifically, we have V*<sup>o</sup>* = {*v*1*, v*2*,...,vMo* } with *Mo* vertices and E*<sup>r</sup>* = {*e*1*, e*2*,...,eNr* } with *Nr* hyperedges.

In the heterogeneous biological network discussed here, the set of entity types *O* contains drug, target, and disease. The set of interaction types *R* includes dr–ta, ta–dr, dr–di, and ta–di interactions.1 Therefore, *o* is equal to 3 and *r* is equal to 4.

Moreover, multiple sub-hypergraphs with one sub-hypergraph corresponding to one type of correlation on the basis of the overall heterogeneous hypergraph can be constructed. Therefore, four sub-hypergraphs are acquired in all, i.e., four incidence matrices, which are denoted as **<sup>H</sup>** <sup>∈</sup> <sup>R</sup>*M*×*Nj , j* ∈ [1*, r*] and *<sup>M</sup>* is the number of two types of vertices corresponding to the correlation. Specifically, the four incidence matrices generated based on *R* are defined as *(***H***dr*−*ta,* **H***ta*−*dr,* **H***dr*−*di,* **H***ta*−*di)*. Figure 10.10 shows an example of a drug hypergraph.

#### **(2) Drug and Target Embedding Learning**

The same framework is used to create the overall embeddings for both drugs and targets. We now briefly introduce how this framework learns drug and target embeddings.

The overall embeddings are acquired by combining the main embeddings and the assisted embeddings. Particularly, the primarily vectorized representations for all drugs and targets are provided by the main embeddings, which are learned using direct DTIs. Contrarily, the assisted embeddings offer supplementary information discovered through disease-relevant data, such as *dr*–*di* and *ta*–*di* connections.

We first take a drug as an example to demonstrate the learning framework. The drug's main embeddings *Φ<sup>k</sup> <sup>d</sup>* are learned from **H***dr*−*ta* using an unsupervised Bayesian deep generative model, i.e., hypergraph variational auto-encoder, while the drug assisted embeddings are generated from **H***dr*−*di* by leveraging the hypergraph neural networks (HGNN) [17]. For the main embeddings learning, given the DTI sub-hypergraph structure **H***dr*−*ta*, the Bayesian deep generative model serves as a

<sup>1</sup> dr, ta, di are abbreviations of drug, target, and disease, respectively.

**Fig. 10.9** The overall procedure for modeling biological networks into a heterogeneous hypergraph. This figure is from [7]

**Fig. 10.10** An example of a drug hypergraph. Each vertex on the hypergraph represents a drug, and each hyperedge connects all the drugs that share the same target

vertex encoder [18] to explore the potential associations between drugs linked with one target. This method conducts a nonlinear mapping to transform the hypergraph structure **H***dr*−*ta* from the observed space into the shared space *Φ dr*−*ta* as

$$\boldsymbol{\Phi}'\_{dr-ta} = f\left(\mathbf{H}\_{dr-ta}\mathbf{W}\_{dr-ta} + \mathbf{b}\_{dr-ta}\right),\tag{10.35}$$

where the activation function *f (*·*)* is nonlinear.

The hyperbolic tangent *tanh(x)(exp(x)* − *exp(*−*x)/exp(x)* + *exp(*−*x)* is used here because of its analytic form and efficiency. Learnable weight and bias are represented by **W***dr*−*ta* <sup>∈</sup> <sup>R</sup>*D*in <sup>×</sup>*D*out and the **b***dr*−*ta* <sup>∈</sup> <sup>R</sup>*D*out . *D*in and *D*out are the corresponding dimensions of **H***dr*−*ta* and *Φ dr*−*ta*, respectively. Following the

acquisition of the *Φ dr*−*ta*, two fully connected layers are used to estimate the mean and variance:

$$\mu\_{dr-ta} = f\left(\Phi\_{dr-ta}' \mathbf{W}\_{dr-ta}^{\mu} + \mathbf{b}\_{dr-ta}^{\mu}\right) \tag{10.36}$$

and

$$
\sigma\_{dr-ta} = f\left(\Phi\_{dr-ta}' \mathbf{W}\_{dr-ta}^{\sigma} + \mathbf{b}\_{dr-ta}^{\sigma}\right), \tag{10.37}
$$

where **W***<sup>μ</sup> dr*−*ta*, **W***<sup>σ</sup> dr*−*ta* <sup>∈</sup> <sup>R</sup>*D*out <sup>×</sup>*<sup>D</sup>* and **b***<sup>μ</sup> dr*−*ta*, **b***<sup>σ</sup> dr*−*ta* <sup>∈</sup> <sup>R</sup>*<sup>D</sup>* has been indicated before. The main embeddings *Φ<sup>k</sup> <sup>d</sup>* are then sampled by

$$
\Phi\_d^k = \mu\_{dr-ta} + \sigma\_{dr-ta} \odot \mathfrak{e},\tag{10.38}
$$

where is the Hadamard product and *ε* ∼ *N (*0*,I)*.

In this way, the high-order structural correlations from the direct DTIs can be captured by the major embeddings. In addition to such straightforward interactions, other types of interactions can also contribute to DTI prediction, which has been validated by recent studies [19]. For instance, phenotypic side effects can be determined by how similar they are if these two drugs share a target [20, 21]. It has been verified in the literature that reported that targets can be used as a connection between drugs and illnesses [22]. Enlightened by these discoveries, auxiliary data are integrated into HHDTI, which can provide complementary information so as to improve prediction accuracy and treat extreme cases such as the cold-start problem (only a few DTIs can be fetched).

Specifically, the *dr*–*di* and *ta*–*di* correlations are considered here in HHDTI, and the embeddings learned from the corresponding dr–di incidence matrices **H***dr*−*di* are called drug assisted embeddings, which serve as the auxiliary representation for the drug's main embeddings. The drug assisted embeddings are learned by the HGNN model [17], with which the high-order correlations are encoded as

$$\text{Convh}(\mathbf{H}, \mathbf{X} \mid \mathbf{W}) = f\left(\left(\mathbf{D}^{v}\right)^{-1/2}\mathbf{H}\left(\mathbf{D}^{e}\right)^{-1}\mathbf{H}^{\top}\left(\mathbf{D}^{v}\right)^{-1/2}\mathbf{X}\mathbf{W}\right),\qquad(10.39)$$

where **D***<sup>v</sup>* and **D***<sup>e</sup>* are the degree matrices of vertex and hyperedge, respectively. The corresponding degree of vertex and hyperedge are **D***<sup>V</sup> k,k* = *L <sup>j</sup>*=<sup>1</sup> **<sup>H</sup>***k,j* and *(***D***e)j,j* <sup>=</sup> *N <sup>k</sup>*=<sup>1</sup> **<sup>H</sup>***k,j* , respectively. The matrix **<sup>W</sup>** is the learnable weight parameter, and *(*·*)* is the transposition operator. Specifically, the convolutional layer used to learn the drug assisted embedding *Φ<sup>s</sup> <sup>d</sup>* can be formulated as

$$\boldsymbol{\Phi}\_d^{s(l)} = \text{Convh}\left(\mathbf{H}\_{dr-di}, \boldsymbol{\Phi}\_d^{s(l-1)} \mid \mathbf{W}^{(I-1)}\right),\tag{10.40}$$

where *Φs(l*−1*) <sup>d</sup>* , *Φs(I ) <sup>d</sup>* , and **W***(I*−1*)* represent the *(l* <sup>−</sup> <sup>1</sup>*)*-th layer's input, output, and trainable weight matrix, respectively. Here, the identity matrix is set as the initial

value for **X**. That is, we have *Φs(*0*) <sup>d</sup>* = **X** = **I**. To create the overall embeddings, an attention module is used to combine the main embeddings and assisted embeddings into a single shared space. By determining the coefficients *ω<sup>i</sup>* , the bi-embedding attention fusion process is specifically employed to give various weights to the main embeddings and assisted embeddings:

$$\boldsymbol{\omega}^{l} = \frac{\exp\left(f\left(\boldsymbol{\Phi}^{l}\mathbf{W}^{l} + \mathbf{b}^{l}\right)\cdot\mathbf{P}^{l}\right)}{\sum\_{j \in k, s} \exp\left(f\left(\boldsymbol{\Phi}^{l}\mathbf{W}^{j} + \mathbf{b}^{l}\right)\cdot\mathbf{P}^{l}\right)},\tag{10.41}$$

where **W***<sup>i</sup>* <sup>∈</sup> <sup>R</sup>*D*×*<sup>D</sup>* , **b***<sup>i</sup>* <sup>∈</sup> <sup>R</sup>*<sup>D</sup>* , and **P***<sup>i</sup>* <sup>∈</sup> <sup>R</sup>*<sup>D</sup>* <sup>×</sup><sup>1</sup> are trainable parameters. *D* and *D* are the corresponding dimensions. The overall drug embeddings *Φ<sup>S</sup>* can then be obtained by

$$
\Phi\_d^S = \omega^k \Phi\_d^k + \omega^s \Phi\_d^s. \tag{10.42}
$$

The overall embeddings of targets *Φ<sup>S</sup> <sup>d</sup>* are generated similarly. The main difference lies in that here the **H***ta*−*dr* and **H***ta*−*di* are used as inputs. The target main embeddings *Φ<sup>k</sup> <sup>t</sup>* are learned using the same vertex encoder as that of drugs. The HGNN model is also adopted to yield the target assisted embeddings *Φ<sup>s</sup> <sup>t</sup>* from the target–disease association hypergraph. Finally, the embedding attention fusion is run to achieve the overall target embeddings *Φ<sup>S</sup> t* .

#### **(3) Drug–Target Interactions Prediction**

The likelihood of the drug and the target embeddings is calculated to create the reconstruction space *A*, from which the DTI predictions are generated. That is, we have

$$\mathbf{A} = \text{Sigmoid}\left(\boldsymbol{\Phi}\_d^S \left(\boldsymbol{\Phi}\_l^S\right)^\top\right),\tag{10.43}$$

where *Sigmoid(*·*)* is the sigmoid function. We then give the variational lower bound L , which is optimized by

$$\mathcal{KE} = \mathbb{E}\_q \left[ \log p \left( \mathbf{A} \mid \boldsymbol{\Phi}\_d^{\mathbf{s}}, \boldsymbol{\Phi}\_l^{\mathbf{S}} \right) \right] - \beta \left( \text{KL} \left( q \left( \boldsymbol{\Phi}\_d^{\mathbf{k}} \mid \mathbf{A} \right) \| p \left( \boldsymbol{\Phi}\_d^{\mathbf{k}} \right) \right) \right. \\ \left. + \text{KL} \left[ q \left( \boldsymbol{\Phi}\_l^{\mathbf{k}} \mid \mathbf{A} \right) \| p \left( \boldsymbol{\Phi}\_l^{\mathbf{k}} \right) \right] \right), \tag{10.44}$$

where KL[*q(*·*)*||*p(*·*)*] is the metric from distribution *q(*·*)* to *p(*·*)* in Kullback– Leibler divergence space. Varying *b* provides different acquired representations by changing the amount of learning pressure provided during training. Inspired by the variational auto-encoder, Gaussian priors *p Φk d* = *<sup>i</sup> p ϕd i* = *<sup>i</sup>* N *ϕd <sup>i</sup>* | 0*,***I** and *p Φk t* = *<sup>j</sup> p ϕt j* = *<sup>j</sup>* N *ϕt <sup>j</sup>* | 0*,***I** can be taken into consideration. Here, E*<sup>q</sup>* [log *p(*·|·*)*] is the likelihood of reconstruction space *A*.

In this part, we introduce a general hypergraph-based framework for DTI predictions. It is noted that the introduced framework introduced here is neither restricted to these types of complex interactions nor the DTI prediction task here; other types of interactions that may contribute to the DTI prediction task or even other projects containing complex correlations are also thinkable.

In real-world applications, the annotations for such biomedical data are computationally expensive and time-consuming. Therefore, self-supervised learning has received a lot of attention recently since it can mine useful information from the data in an unsupervised way. Under such circumstances, it is of great significance to further devise the self-supervised hypergraph computation for DTI predictions.

#### **10.5 Medical Image Segmentation**

In the field of medical imaging, hypergraph-based image segmentation methods also play a crucial role, where there are limitations of traditional multi-atlas segmentation (MAS) methods in segmenting anatomical structures with poor image contrast. The hypergraph can be used. The hypergraph can model complex subject-within and subject-to-atlas image voxel relationships and propagate label on atlas image to target subject images.

This method is named hierarchical hypergraph patch labeling (HHPL) [8], which characterizes higher-order associations between context features by constructing a hypergraph, and transforms hypergraph learning into a hierarchical model. At the same time, a dynamic label propagation strategy is used to augment reliably identified labels from subject images to help predict labels.

As shown in Fig. 10.11, pairwise relations and complex higher-order associations in hyperedges are compared when using the MAS method, where *pi* is the subject image voxel, and *Ri(l)* is defined as a 3-D cube of side length *l* centered on *pi*. Image patches are extracted using the target object image at voxel *pi* and the registration atlas image within the corresponding local neighborhood *Rn,i(l)*. Hyperedges can be constructed similarly between the atlas image voxels and target subject image voxels with the high-level context features from the label probability map.

**Fig. 10.11** Comparison of a simple pairwise relationship in the conventional MAS methods and the complex groupwise relationship in hyperedges (with much richer information). This figure is from [8]

In particular, the subject vertices under the label and the related atlas vertices with known labels affect the labels on the target topic vertex. The label propagation process follows two principles: (1) if vertices are grouped in the same hyperedge, they have the same anatomical label. (2) The label difference between vertices with known labels before and after label propagation is to be as small as possible. Therefore, the objective function of hypergraph learning is defined as follows:

$$\arg\min\_{\mathbf{f}} \left\{ \|\mathbf{y} - \mathbf{f}\|\_{2}^{2} + \lambda \cdot \Phi\left(\mathbf{f}, \mathbf{H}, \mathbf{W}, \mathbf{D}\_{\varepsilon}, \mathbf{D}\_{0}\right) \right\}. \tag{10.45}$$

The first term is the control to minimize the difference between the initialization label vector **y** and the prediction vector **f**. The second term is the graph balance term defined as

$$\begin{split} & \Phi \left( \mathbf{f}, \mathbf{H}, \mathbf{W}, \mathbf{D}\_{\varepsilon}, \mathbf{D}\_{v} \right) \\ &= \frac{1}{2} \sum\_{\epsilon \in \epsilon} \sum\_{v, v' \subseteq \epsilon} \frac{w(\epsilon) h(v, \epsilon) h(v', \epsilon)}{\delta(\epsilon)} \left( \frac{f(v)}{\sqrt{d(v)}} - \frac{f(v')}{\sqrt{d(v')}} \right)^2 . \end{split} \tag{10.46}$$

We can determine the optimal ˆ**f** by differentiating the objective function with respect to **f**:

$$\hat{\mathbf{f}} = (\mathbf{I} + \lambda(\mathbf{I} - \Theta))^{-1}\mathbf{y}.\tag{10.47}$$

Having obtained the optimized ˆ**f**, it is easy to obtain the anatomical labels on the subject image from the symbolic calculation target of the correlation value

$$\begin{cases} \text{foreground} & f\_l > 0\\ \text{background} & f\_l < 0 \end{cases}, \quad i = 1, 2 \dots |P| \,. \tag{10.48}$$

In other words, the segmentation can be repeatedly computed to improve the performance by: (1) hypergraph construction with high-level context features; (2) label propagation on hypergraph; and (3) the refinement of context features. The segmentation results can be found in Fig. 10.12.

#### **10.6 Summary**

In this chapter, we introduce three typical applications of hypergraph computation in medical and biological tasks. In computer-aided diagnosis, three specific applications are covered, i.e., the identification and medical image retrieval of MCI and the identification of COVID-19 by CT imaging. These examples show how to adopt hypergraph computation for the tasks of classification and retrieval in medical and biological fields. For the survival prediction with histopathological images, the demonstrated hypergraph computation techniques can also be expanded to similar regression tasks. The introduced paradigm may also be applied to other cases with complicated connections. In summary, these examples demonstrate the high-order correlation between medical and biological data, which are modeled and learned by hypergraph computation. These indeed can contribute to the corresponding study. In addition to the aforementioned examples, there are many medical and biological applications that have the potential to be explored with hypergraph computation, such as medical image enhancement and multi-modal fusion.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 11 Hypergraph Computation for Computer Vision**

**Abstract** In this chapter, the applications of hypergraph computation in computer vision are introduced. Computer vision is one of the most widely used areas of hypergraph computation. The hypergraphs can be constructed by modeling the highorder relationship among inter- or intra-visual samples, and then computer vision tasks can be solved by hypergraph computation procedures. More specifically, four typical applications, including visual classification, 3D object retrieval, and tagbased social image retrieval, are provided, in which hypergraphs are used to model high-order relationship among samples and solve visual problems by hypergraph computation. For example, in social image retrieval, hypergraphs are used to model the high-order relationship among social images based on both visual and textual information, which is the high-order modeling of elements within samples.

#### **11.1 Introduction**

Hypergraphs have demonstrated excellent performance in modeling high-order relationship of data and have been applied in several fields. In computer vision, this property of hypergraphs is also promising for a wide range of works, and many researches focus on how to use hypergraph modeling to solve visual problems. On one hand, hypergraphs can be used to model high-order relationship of images within a class or different classes, and then to conduct the hypergraph-based label propagation procedures, which is useful for visual classification and retrieval. On the other hand, the relation can be modeled within the elements in a visual object to exploit the structural information.

In this chapter, we discuss four typical applications of hypergraph computation in computer vision, i.e., visual classification [1–6], 3D object retrieval [2, 7–12], and tag-based social image retrieval [13–17]. In these applications, the vertices represent the visual objects, and a hypergraph is constructed to formulate the highorder correlations among all the samples by some metric. In this hypergraph, some vertices are labeled. The prediction of other vertices can be obtained by the label propagation procedure. Visual classification and retrieval problems can be solved by this method. The elements within one sample, such as pixels in an image, can also be used to construct the hypergraphs. The properties with each element can be learned by conducting hypergraph computation, in which the semantic information can be learnt during this procedure. Part of the work introduced in this chapter has been published in [1, 2, 13].

#### **11.2 Visual Classification**

Visual classification is the most widely used area of hypergraph in computer vision. Since visual data have a strong clustering characteristic, i.e., visual objects under one label show a clustered distribution in the feature space, this property is fully consistent with the hypothesis of hypergraph-based semi-supervised learning, and therefore, hypergraph-based semi-supervised learning is theoretically well-suited for image classification. A large number of researches have demonstrated its good performance [1, 2]. While there are many applications of hypergraph computation for image classification, they almost follow the same process. It starts out with hypergraph modeling of visual data. After extracting features by some feature extractors, the hypergraph is modeled based on the nearest neighbor relationship of visual features in the Euclidean space, and then label propagation on the hypergraph is adopted to achieve classification. We use the example of multi-view 3D object classification to introduce the process in detail.

First, view-based 3D object classification needs to be introduced. Each 3D object can be represented by a set of views. Compared with the model representation method, the multi-view representation method is more flexible, with less computational overhead. It also has good representation capability. Classification of 3D objects is illustrated in Fig. 11.1. After obtaining the multi-view 3D object data, the first step is to extract the features. There are many feature extraction

**Fig. 11.1** An illustration of the view-based 3D object classification framework. This figure is from [1]

methods for multi-view 3D objects, such as MVCNN [18], Zernike moments, etc. After obtaining the features of each group of views and each image in them, hyperedges can be constructed by *k*-NN with Euclidean distance as the metric. In fact, if several different features are used, multiple hypergraphs can be constructed, i.e., each hypergraph is constructed based on one feature. If *m* features are used, *m* hypergraphs can be generated, denoted by G<sup>1</sup> = *(*V1*,* E1*,***W**1*),* G<sup>2</sup> = *(*V2*,* E2*,***W**2*), . . . ,* G*<sup>m</sup>* = *(*V*m,* E*m,***W***m)*. After obtaining multiple hypergraphs, a weight *ωi, i* = 1*,...,m* is assigned to each hypergraph G*i*, which constitutes a weight vector *ω*. Up to this point, we obtain *m* hypergraphs with weights from the multi-view 3D dataset.

#### **Transductive Hypergraph Computation**

After getting multiple hypergraphs, we can get the label of each vertex by the formula of hypergraph-based semi-supervised learning. The pipeline is shown in Fig. 11.2a. Note that since we are using multi-modal data, the contribution of different modalities to the classification may be different, such that we also have

**Fig. 11.2** The general frameworks of transductive and inductive multi-hypergraph computation algorithms. (**a**) tMHL: transductive multi-hypergraph computation. (**b**) iMHL: inductive multihypergraph computation. This figure is from [1]

to take into account the influence of different modal weights when calculating the classification results and updating the weights during the computing process. The method of weight updating is described in the next section, and the focus here is to establish the idea of hypergraph processing of multi-modal features.

#### **Inductive Hypergraph Computation**

In real-world visual classification endeavors, transductive hypergraph computation can only be updated globally, and the high time complexity can hardly meet efficiency requirements of visual classification. To help solve this problem, inductive hypergraph computation is introduced, which can learn both projections of data to labels and weight vectors of multiple hypergraphs. It can also achieve real-time inference performance for newly added data, as shown in Fig. 11.2b. It is described in the following.

In inductive hypergraph computation, a projection matrix **M** is learned, and the prediction for the unlabeled data is computed by **M**.

The objective function for learning **M** is illustrated as

$$\arg\min\_{\mathbf{M}} \left\{ \mathfrak{Q} \left( \mathbf{M} \right) + \lambda \mathcal{R}\_{emp} \left( \mathbf{M} \right) + \mu \Phi \left( \mathbf{M} \right) \right\}. \tag{11.1}$$

Under the assumption that it is more likely that the vertices connected with one or more hyperedges have the same label, the hypergraph Laplacian regularizer *Ω(***M***)* is defined as follows, and it is in quadratic form of **M**:

$$\begin{split} \boldsymbol{\Omega} \ (\mathbf{M}) &= \frac{1}{2} \sum\_{k=1}^{c} \sum\_{e \in \mathcal{E}} \sum\_{\boldsymbol{\mu}, \boldsymbol{v} \in \mathcal{V}'} \frac{\mathbf{W} \left( e \right) \mathbf{H} \left( \boldsymbol{\mu}, e \right) \mathbf{H} \left( \boldsymbol{v}, e \right)}{\delta \left( e \right)} \boldsymbol{\vartheta} \\ &= \text{tr} \left( \mathbf{M}^{\top} \mathbf{X} \boldsymbol{\Delta} \mathbf{X}^{\top} \mathbf{M} \right), \end{split} \tag{11.2}$$

where *ϑ* = **XM***(u,k)* <sup>√</sup>*d(u)* <sup>−</sup> **<sup>X</sup>M***(v,k)* <sup>√</sup>*d(v)* <sup>2</sup> . It can be noted that *Ω(***M***)* is in quadratic form of **M**. The empirical loss term R*emp(***M***)* is defined as

$$\mathcal{A}\_{emp}(\mathbf{M}) = ||\mathbf{X}^\top \mathbf{M} - \mathbf{Y}||^2. \tag{11.3}$$

*Φ(***M***)* is an *l*2*,*<sup>1</sup> norm regularizer. It is used to avoid overfitting for **M**. Meanwhile, it makes the rows in the matrix more sparse to be informative. It is defined as

$$\Phi(\mathbf{M}) = ||\mathbf{M}||\_{2,1}.\tag{11.4}$$

The objective function of inductive hypergraph computation task can be written as

$$\arg\min\_{\mathbf{M}} \quad \left\{ \text{tr}\left(\mathbf{M}^{\top}\mathbf{X}\Delta\mathbf{X}^{\top}\mathbf{M}\right) + \lambda||\mathbf{X}^{\top}\mathbf{M} - \mathbf{Y}||^{2} + \mu||\mathbf{M}||\_{2,1} \right\}.\tag{11.5}$$

Note that the regularizer *Φ(***M***)* is convex and non-smooth. Therefore, the objective function can be relaxed to the following:

$$\arg\min\_{\mathbf{M},\mathbf{U}} \left\{ \text{tr}\left(\mathbf{M}^{\top}\mathbf{X}\Delta\mathbf{X}^{\top}\mathbf{M}\right) + \lambda||\mathbf{X}^{\top}\mathbf{M} - \mathbf{Y}||^{2} + \mu\text{tr}\left(\mathbf{M}^{\top}\mathbf{U}\mathbf{M}\right) \right\},\tag{11.6}$$

where **U** is a diagonal matrix, and its elements are defined as

$$\mathbf{U}\_{i,l} = \frac{1}{2||\mathbf{M}\left(\mathbf{i}, \mathbf{:} \right)||\_2^2}, \quad i = 1, \ldots, d. \tag{11.7}$$

To solve this optimization problem, **U** is set as an identity matrix first, and the iteratively reweighted least squares method is adopted. More specifically, each variable is updated alternately with the other fixed until convergence is achieved. First, **U** is fixed, and we derive objection with respect to **M**. The closed-form solution is

$$\mathbf{M} = \lambda \left(\mathbf{X}\Delta\mathbf{X}^{\top} + \lambda\mathbf{X}\mathbf{X}^{\top} + \mu\mathbf{U}\right)^{-1}\mathbf{X}\mathbf{Y}.\tag{11.8}$$

Then **M** is fixed, while **U** is updated by Eq. (11.7). The procedure is repeated until both **U** and **M** converge.

Given a testing sample *x<sup>t</sup>* , the prediction of *x<sup>t</sup>* can be obtained by

$$C(\boldsymbol{x}^t) = \arg\max\_k \boldsymbol{x}^t \prescript{\mathsf{T}}{\mathbf{M}}.\tag{11.9}$$

Hypergraph computation can achieve good results in visual classification problems, where inductive hypergraph computation can achieve real-time online classification while maintaining good classification performance.

#### **11.3 3D Object Retrieval**

3D object retrieval targets on finding similar 3D objects in the database, given a 3D query. Usually, each 3D object can be described by several different types of data, such as multiple views, point clouds, mesh, or voxel. The main task of 3D object retrieval is to define an appropriate measure to calculate the similarity between each pair of 3D objects. Therefore, how to define such measures is the key for 3D object retrieval. Traditional methods mainly focus on either representation learning for each type of data or the distance metric for specific features. It is noted that the correlations among 3D objects are very complex, where the pair correlations and beyond-pair correlation both exist. To achieve better 3D object retrieval performance, it is important to take such high-order correlation among 3D objects into consideration. In this retrieval task, each vertex denotes a 3D object in

**Fig. 11.3** An illustration of the hypergraph computation method for 3D object retrieval using multiple views. This figure is from [2]

the database, and thus the number of vertices is equivalent to the number of objects in the database.

Hypergraph can be used for such correlation modeling in 3D object retrieval. We introduce the hypergraph computation method [2] for 3D object retrieval here, and the framework is shown in Fig. 11.3. First a group of hypergraphs can be generated, and the learning process is conducted for similarity measurement.

We take the multi-view representation as an example. All views of these 3D objects are first grouped into clusters. Objects with views in one cluster are then connected by hyperedges (note that a hyperedge can connect multiple vertices in a hypergraph). As a result, a hypergraph can be generated, in which vertices represent objects in a database. A hyperedge's weight is determined by the visual similarity between any two views in a cluster. Multiple hypergraphs can be generated by varying the number of clusters. These hypergraphs encode the relationships between objects at various granularities. When two 3D objects are connected by more and stronger hyperedges, they are with higher similarity. Then, these information can be used for 3D object retrieval.

To generate a 3D object hypergraph, each object is as a vertex in the hypergraph G = *(*V *,* E *,***W***)*. The generated hypergraph has *n* vertices if there are *n* objects in a database. Each view for these 3D objects can be represented by pre-defined features, which can be different with respect to various of tasks. Given these features, the Kmeans clustering method can be used to group visual objects into clusters. Each object in a cluster has a corresponding hyperedge connecting them. There are two diagonal matrices **D***<sup>v</sup>* and **D***<sup>e</sup>* that represent the vertex and hyperedge degrees,

**Fig. 11.4** An illustration of the hypergraph construction for 3D object hypergraph. (**a**) Views of different visual objects. (**b**) Hyperedges construction by view clusters. This figure is from [2]

respectively, and an incidence matrix **H** is generated. The weight of a hyperedge *e* can be measured by

$$w(e) = \sum\_{\chi\_a, \chi\_b \in e} \exp\left(-\frac{d(\chi\_a, \chi\_b)^2}{\sigma^2}\right),\tag{11.10}$$

where *d(xa, xb)* is the distance between *xa* and *xb*, which are two views in the same view cluster. *d(xa, xb)* can be calculated using the Euclidean distance. The parameter *σ* is empirically set to the median distance between all pairs of these views. The hypergraph generation procedure is shown in Fig. 11.4.

Let G<sup>1</sup> = *(*V1*,* E1*,***W**1*)*, G<sup>2</sup> = *(*V2*,* E2*,***W**2*)*, ··· , and G*ng* = *(*V*ng ,* E*ng ,***W***ng )* denote *ng* hypergraphs, and {**D***v*<sup>1</sup> *,* **D***v*<sup>2</sup> *,...,* **D***vng* }, and {**D***e*<sup>1</sup> *,* **D***e*<sup>2</sup> *,...,* **D***eng* }, and {**H**1*,* **H**2*,...,* **H***ng* } be the vertex degree matrices, hyperedge degree matrices, and incidence matrices, respectively. The retrieval results are based on the fusion of these hypergraphs. The weight of the *i*-th hypergraph is denoted by *αi*, where *ng <sup>i</sup>*=<sup>1</sup> *αi* <sup>=</sup> 1, and *αi* ≤ 0.

It is possible to consider retrieval as a one-class classification problem [19]. As a result, we formulate the transductive inference in terms of a regularization problem: arg min**<sup>f</sup>** {*λR*emp*(***f***)*} + *Ω(***f***)*, and the regularizer term *Ω(***f***)* is defined by

$$\frac{1}{2} \sum\_{l=1}^{n\_{\mathcal{S}}} \alpha\_{l} \sum\_{e \in \mathcal{S}\_{l}} \sum\_{u, v \in \mathcal{V}\_{l}} \frac{w\_{l}(e) \mathbf{H}\_{l}(u, e) \mathbf{H}\_{l}(v, e)}{\sigma\_{l}(e)} \times \left( \frac{\mathbf{f}(u)}{\sqrt{d\_{l}(u)}} - \frac{\mathbf{f}(v)}{\sqrt{d\_{l}(v)}} \right)^{2},\tag{11.11}$$

where vector **f** represents the relevance score to be learned.

In this way, the similarity between each object and the query can be calculated based on the relevance score. It is noted that the feature used in this method can be selected based on the task itself, and multiple types of representations can also be used here. Given multiple features for the same data, or different features for multi-modal data, we can generate the hypergraph(s) using the method introduced in Chap. 4.

#### **11.4 Tag-Based Social Image Retrieval**

User-generated tags are widely associated with the social images, which describe the content of the images. These tags are useful for the social image retrieval tasks benefited from the rich contents. Figure 11.5 shows some examples of social images associated with tags.

The main challenge of applying such tags to social image retrieval is that too much noise makes it hard to mine the true relation among the tags and images, and the separation usage of the tags and images leads to a sub-optimal for image retrieval. In this section, we introduce a visual–textual joint relevance learning approach using hypergraph computation [13]. Figure 11.6 shows the illustration of the visual–textual joint relevance learning method on hypergraph for tag-based social image retrieval. In this method, the features for both the images and the tags are first extracted, and the hypergraph is constructed based on these features. Then, the hypergraph learning method is performed, and the learned semantic similarity can be used for tag-based social image retrieval.

In this example, the bag-of-visual-words feature is selected for image representation. For the *i*-th image, the visual content is represented by bag-of-visual-words *f bow <sup>i</sup>* , while for the corresponding tags, the bag-of-textual words representation *f tag <sup>i</sup>* is employed. Then, the visual-content-based hyperedges and the tag-based hyperedges are constructed, respectively. The visual-content-based hyperedges connect the images that have the same visual word, and the tag-based hyperedges connect the images that have the same tag word. Figure 11.7 provides the examples of hyperedge generation process using textual information and visual information, respectively. Therefore, the overall hypergraph has *ne* = *nc* +*nt* hyperedges, where *nc* denotes the number of visual words, and *nt* denotes the number of tag words. After the construction of the hypergraph, the images sharing more visual words or tags are connected by more hyperedges, which can be used for further processing. Figure 11.8 further shows the connections between two social images, based on the textual and the visual information, respectively.

**Fig. 11.5** Some social image examples with associated with tags. This figure is from [13]

**Fig. 11.7** Two examples of hyperedge generation. (**a**) shows hyperedges based on the textual information, in which the social images with the same textual words are connected by a hyperedge. (**b**) shows hyperedges based on the visual information, in which the social images with the same visual words are connected by a hyperedge. This figure is from [13]

**Fig. 11.8** An example of connections between two images from textual and visual directions. This figure is from [13]

Denoting **f** as the relevance score vector, **y** as the ground truth relevance, and **w** is the weight vector of hyperedges, the hypergraph computation can be formulated as

$$\begin{aligned} \arg\min\_{\mathbf{f}, \mathbf{w}} \Phi(\mathbf{f}) &= \arg\min\_{\mathbf{f}} \left\{ \mathbf{f}^\top \Delta \mathbf{f} + \lambda ||\mathbf{f} - \mathbf{y}||^2 + \mu \sum\_{i=1}^{n\_\epsilon} \mathbf{w}(i)^2 \right\}, \\ &\text{s.t. } \sum\_{i=1}^{n\_\epsilon} \mathbf{w}(i) = 1, \end{aligned} \tag{11.12}$$

where *λ* and *μ* are the weighted parameters. The first term in Eq.(11.12) is the regularizer on the hypergraph structure, which is used to guarantee the smoothness over the hypergraph. The second term is the empirical loss between the relevance score vector and the ground truth. The last term represents the <sup>2</sup> norm of the hyperedge weights, which is used to learn better combination of different hyperedges. This optimization task can be easily solved using alternating optimization. First, **w** is fixed, and *f* is optimized by

$$\arg\min\_{\mathbf{f}} \Phi(\mathbf{f}) = \arg\min\_{\mathbf{f}} \left\{ \mathbf{f}^\top \Delta \mathbf{f} + \lambda ||\mathbf{f} - \mathbf{y}||^2 \right\},\tag{11.13}$$

from which we can have

$$\mathbf{f} = \frac{1}{1 - \xi} (\mathbf{I} - \xi \Theta)^{-1} \mathbf{y},\tag{11.14}$$

where *<sup>ξ</sup>* <sup>=</sup> <sup>1</sup> <sup>1</sup>+*<sup>λ</sup>* , *<sup>Θ</sup>* <sup>=</sup> **<sup>I</sup>** <sup>−</sup> *<sup>Δ</sup>*. Then, **f** is fixed, and **w** is optimized by

$$\arg\min\_{\mathbf{w}} \Phi(\mathbf{f}) = \arg\min\_{\mathbf{f}} \left\{ \mathbf{f}^{\top} \Delta \mathbf{f} + \mu \sum\_{i=1}^{n\_{\ell}} \mathbf{w}(i)^{2} \right\}. \tag{11.15}$$
 
$$\text{s.t.} \sum\_{i=1}^{n\_{\ell}} \mathbf{w}(i) = 1, \mu > 0.$$

The Lagrangian can be applied here, and we have

$$\mathbf{w}(i) = \frac{1}{n\_e} - \frac{\mathbf{f}^\top \boldsymbol{\Gamma} \mathbf{D}\_e^{-1} \boldsymbol{\Gamma}^\top \mathbf{f}}{2n\_e \mu} + \frac{\mathbf{f}^\top \boldsymbol{\Gamma} \mathbf{D}\_e^{-1} (i, i) \boldsymbol{\Gamma}\_l^\top \mathbf{f}}{2\mu},\tag{11.16}$$

where *<sup>Γ</sup>* <sup>=</sup> **<sup>D</sup>**<sup>−</sup> <sup>1</sup> <sup>2</sup> *<sup>v</sup>* **H** and *Γi* represents the *i*-th column of *Γ* .

The semantic relevance between an image *xi* and the query tag *tq* is estimated by

$$s(\mathbf{x}\_l, t\_q) = \frac{1}{n\_l} \sum\_{l} s\_{l\text{ag}}(t\_q, t),\tag{11.17}$$

which denotes the average similarity between *tq* and all corresponding tags of *xi*, and *stag* can be calculated as

$$s\_{lag}(t\_1, t\_2) = e^{-FD(t\_1, t\_2)},\tag{11.18}$$

where *FD* represents the Flickr distance [20].

Given these similarities between each image and the query tag, we can have the retrieval results accordingly. We also note that the features used in this application can be changed with respect to the requirement of different tasks.

#### **11.5 Summary**

In this chapter, we have introduced the applications of hypergraph computation on computer vision, including visual classification, 3D object retrieval, and tag-based social image retrieval. For classification and retrieval tasks, hypergraphs can be used to model the high-order relationships among samples in the feature space and solve the problem by hypergraph-based label propagation methods. The success of hypergraphs for computer vision is due to the fact that the feature correlations of visual data are more complex that are hard to be explored by pairwise correlation methods. Hypergraph computation can be further used in other computer vision tasks, such as visual registration, visual segmentation, gaze estimation, etc.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 12 The DeepHypergraph Library**

**Abstract** This chapter introduces the DeepHypergraph library, which bridges the hypergraph theory and hypergraph applications. This library provides the generation of multiple low-order structures (such as graph and directed graph), high-order structures (such as hypergraph and directed hypergraph), datasets, operations, learning methods, visualizations, etc. We first introduce the design motivation and the overall architecture of the library. Then, we introduce the "correlation structure" and "function library" of the Deephypergraph library, respectively.

#### **12.1 Introduction**

We have designed DeepHypergraph (DHG),1 a deep learning library built upon PyTorch2 for hypergraph computation. It is a general framework that supports both low-order and high-order message passing such as from vertex to vertex, from vertex in one domain to vertex in another domain, from vertex to hyperedge, from hyperedge to vertex, and from vertex set to vertex set. It supports the generation of a wide variety of structures such as low-order structures (graph, directed graph, bipartite graph, etc.) and high-order structures (hypergraph, etc.). Various spectralbased operations (such as Laplacian-based smoothing) and spatial-based operations (such as message passing from domain to domain) are integrated inside different structures. It also provides multiple common metrics for performance evaluation on different tasks. A group of state-of-the-art models has also been implemented and can be easily used for research. We also provide several visualization tools for demonstration of both low-order structures and high-order structures. Besides, the dhg.experiments module (that implements Auto-ML upon Optuna3 ) can automatically tune the hyperparameters of the models in training and return the model with

<sup>1</sup> deephypergraph.org.

<sup>2</sup> http://pytorch.org/.

<sup>3</sup> https://optuna.org/.

<sup>©</sup> The Author(s) 2023

Q. Dai, Y. Gao, *Hypergraph Computation*, Artificial Intelligence: Foundations, Theory, and Algorithms, https://doi.org/10.1007/978-981-99-0185-2\_12

the best performance. In this chapter, we first introduce the correlation structures in DHG and then introduce the function library in DHG.

#### **12.2 The Correlation Structures in DHG**

The core motivation of designing the DHG library is to attach the spectral-based and spatial-based operations to each specified structure. When a structure has been created, these related Laplacian matrices and message passing operations with different aggregation functions can be called and combined to manipulate different input features. Figure 12.1 illustrates the architecture of the "correlation structure" in DHG. Currently, the implemented correlation structures of DHG include graph, directed graph, bipartite graph, and hypergraph. For each correlation structure, DHG has developed the corresponding basic operations, such as construction and structure modification functions, related structure transformation functions, and learning functions.

The most computation process on those correlation structures (graph, hypergraph, etc.) can be divided into two categories: spectral-based convolution and spatial-based message passing. The spectral-based convolution methods, such as typical GCN [1] and HGNN [2], learn a Laplacian matrix for a given structure and perform vertex feature smoothing with the generated Laplacian matrix to embed low-order and high-order structures to vertex features. The spatial-based message passing methods, such as typical GraphSAGE [3], GAT [4], and HGNN+ [5],

**Fig. 12.1** The architecture of the "correlation structures" in DHG

perform vertex to vertex, vertex to hyperedge, hyperedge to vertex, and vertex set to vertex set message passing to embed the low-order and high-order structures to vertex features. The learned vertex features can also be pooled to generate the unified structure feature. Finally, the learned vertex features or structure features can be fed into many downstream tasks, such as classification, retrieval, regression, and link prediction, and applications including paper classification, movie recommender, drug exploitation, etc.

#### **12.3 The Function Library in DHG**

To facilitate the complex and repetition codes of learning on correlation structures, DHG further provides the function library. As shown in Fig. 12.2, the function library includes five parts: data module, metric module, visualization module, auto-ML module, and structure generators module.

In the data module, DHG integrates more than 20 public graph/bipartite graph/hypergraph datasets and some commonly used pre-process function such as File Loader and Normalization. By default, DHG can automatically download the integrated datasets and check the integrity of the downloaded files. You can also manually construct your own dataset of DHG style with the existing Datapipe functions in DHG.

**Fig. 12.2** The architecture of the "function library" in DHG

In the metric module, DHG has provided many widely used metrics such as *Accuracy*, *Recall*, and *mAP* for different tasks. Some encapsulation evaluators for different tasks such as classification, retrieval, and recommendation have also been implemented. Besides, DHG provides the structure and feature visualization functions, automatic hyperparameters search function, and random structure generation functions for different applications.

#### **12.4 Summary**

In this chapter, we introduce the DHG library for hypergraph computation. It simultaneously supports the generation and learning on low-order structures and high-order structures. Besides, many commonly used functions have also been integrated in the library.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 13 Conclusions and Future Work**

#### **13.1 Summary of This Book**

Hypergraph computation has attracted much attention and shown apparent advantages in many application fields, such as computer vision, social networks, and biomedicine. In this book, we systematically introduce the basic knowledge, algorithms, and applications of hypergraph computation in three parts and discuss some recent progress in this direction.

In the first part, we mainly introduce the basic knowledge and main concepts of hypergraphs, including the definitions and symbols of common terms and the classification of hypergraphs. More importantly, we discuss the differences between hypergraphs and graphs from several aspects. Following, we introduce three hypergraph computation paradigms, namely, intra-hypergraph computation, inter-hypergraph computation, and hypergraph structure computation. In this part, we can have a general view of the different objectives in hypergraph computation.

In the second part, we specifically introduce a series of algorithms from hypergraph modeling to hypergraph neural networks. In hypergraph modeling sections, we show how to build a hypergraph structure from the collected data. As a typical and fundamental learning framework, label propagation on hypergraph describes how to derive the labels for unknown data from the labels for known data on the structure of a hypergraph. Other typical hypergraph computation tasks, including data clustering, cost-sensitive learning, and link prediction, are also introduced. Regarding the potential inaccurate hypergraph structure, we present the hypergraph structure evolution methods, which optimize the hypergraph structure on the basis of the initial structure. We further introduce the hypergraph neural network, which integrates the neural network framework into the hypergraph computation framework. The large scale hypergraph chapter discusses how to deal with large scale data for classification and clustering applications.

In the third part, we introduce practical examples of hypergraph computation in social media analysis, medical and biological applications, and computer vision, including specific tasks such as recommender system, sentiment analysis, computeraided diagnosis, and image classification. In these examples, we show how to use hypergraph for high-order correlation modeling and select computation paradigms for different objectives. We further introduce the DeepHypergraph library for hypergraph computation.

#### **13.2 Future Work**

Although there have been many efforts to promote the development of hypergraph computation, there are still many open issues that need deep exploration, for instance, the mathematical foundations of hypergraph computation, the interpretability issues, and the temporal hypergraph modeling:


the masks as disturbances to cover the original structure information to study the effects of different disturbances to the original structure. In addition, for hypergraphs in biochemistry, neurobiology, ecology, and engineering, of which the structure is highly correlated with their functions, how to combine domain knowledge to improve model interpretability is also an important issue. Finally, for text or image data, humans can easily understand the semantic information. However, it is difficult to intuitively understand the information of hypergraph structure. How to visualize the high-order complex correlations for intuitive understanding remains a challenge.

	- Time sequence data with static structure. This is a common scenario in the field of traffic forecasting, action recognition, and anomaly detection.
	- Time sequence data with evolving structure. This scenario mostly appears in the field of stock prediction and video relation detection.

Under the above application circumstances, temporal hypergraph modeling is worth study. There are multiple challenges for the tasks mentioned above. For the sensor data, different types of the data are raised by different types of the sensors, while the typical hypergraph neural networks treat the data of the vertices equally. Both temporal and spatial high-order relationships vary over time, which makes the message passing procedure complex. New vertices/hyperedges emerge, and old vertices/hyperedges dissolve during the variation of the structure, which makes it complex to continuously model the varying correlation and aggregate the messages. The vertices/hyperedges may even be completely different at different time steps, which makes the representation-based method questionable. In order to model the temporal information, the vertex representations should be dynamic, and therefore, the representation should be learned on a functional space, rather than on the common vector space. The temporal information from both the vertex representation and the structure topology defies extraction. Considering these challenges, the temporal hypergraph still has a long way to go and needs further exploration.

Besides the above research directions, there are also several other interesting topics, such as big hypergraph model, hypergraph database, and distributed hypergraph, which have not been introduced in detail in this book and deserved further study.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.