**Big Data Management**

Xiang Zhao Weixin Zeng Jiuyang Tang

# Entity Alignment Concepts, Recent Advances and Novel Approaches

## **Big Data Management**

#### **Editor-in-Chief**

Xiaofeng Meng, School of Information, Renmin University of China, Beijing, Beijing, China

#### **Editorial Board Members**

Daniel Dajun Zeng, University of Arizona, Tucson, AZ, USA

Hai Jin, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China

Haixun Wang, Facebook Research, USA

Huan Liu, Arizona State University, Tempe, AZ, USA

X. Sean Wang, Fudan University, Shanghai, Shanghai, China

Weiyi Meng, Binghamton University, Binghamton, NY, USA

#### **Advisory Editors**

Jiawei Han, Dept Comp Sci, University Illinois at Urbana-Champaign, Urbana, IL, USA

Masaru Kitsuregawa, National Institute of Informatics, University of Tokyo, Chiyoda, Tokyo, Japan

Philip S. Yu, University of Illinois at Chicago, Chicago, IL, USA

Tieniu Tan, Chiense Academy of Sciences, Bejing, Beijing, China

Wen Gao, Room 2615, Science Buildings, Peking University Room 2615, Science Buildings, Beijing, Beijing, China

The big data paradigm presents a number of challenges for university curricula on big data or data science related topics. On the one hand, new research, tools and technologies are currently being developed to harness the increasingly large quantities of data being generated within our society. On the other, big data curricula at universities are still based on the computer science knowledge systems established in the 1960s and 70s. The gap between the theories and applications is becoming larger, as a result of which current education programs cannot meet the industry's demands for big data talents.

This series aims to refresh and complement the theory and knowledge framework for data management and analytics, reflect the latest research and applications in big data, and highlight key computational tools and techniques currently in development. Its goal is to publish a broad range of textbooks, research monographs, and edited volumes that will:


The scope of the series includes, but is not limited to, titles in the areas of database management, data mining, data analytics, search engines, data integration, NLP, knowledge graphs, information retrieval, social networks, etc. Other relevant topics will also be considered.

Xiang Zhao • Weixin Zeng • Jiuyang Tang

# Entity Alignment

Concepts, Recent Advances and Novel Approaches

Xiang Zhao Laboratory for Big Data and Decision National University of Defense Technology Changsha, Hunan, China

Jiuyang Tang Laboratory for Big Data and Decision National University of Defense Technology Changsha, Hunan, China

Weixin Zeng Laboratory for Big Data and Decision National University of Defense Technology Changsha, Hunan, China

ISSN 2522-0179 ISSN 2522-0187 (electronic) Big Data Management ISBN 978-981-99-4249-7 ISBN 978-981-99-4250-3 (eBook) https://doi.org/10.1007/978-981-99-4250-3

This work was partially supported by NSFC under grant Nos. 61872446, 62272469 and 71971212.

© The Editor(s) (if applicable) and The Author(s) 2023. This book is an open access publication. **Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Paper in this product is recyclable.

## **Preface**

#### **Background**

Knowledge fusion is an important stage of knowledge management, which connects, combines and updates the knowledge from different resources. With the advent of the era of big data, knowledge graph (KG), as an effective means to extract structured knowledge from massive unstructured data and stocked structured data, becomes an essential part of knowledge management. KGs that are constructed by data-driven techniques usually come from different sources and have low coverage. Hence, it calls for establishing the connections among these individually constructed KGs using knowledge fusion techniques, which can thus achieve the augmentation and update of KGs. During the aforementioned process, entity alignment (EA) plays a crucial role. It aims to detect the equivalent entities in different KGs and connect heterogeneous KGs using these entities as anchors, which lays the foundation for the subsequent knowledge unification and update process. Currently, with the advancement of deep learning techniques, representation learning-based EA methods have become the mainstream approach.

Recent years have witnessed a rapid increase in the number of entity alignment frameworks, while the relationships among them remain unclear. This book aims to fill that gap by elaborating the concept and categorization of entity alignment, reviewing recent advances in entity alignment approaches, and introducing novel scenarios and corresponding solutions. Specifically, the book includes comprehensive evaluations and detailed analyses of state-of-the-art entity alignment approaches and strives to provide a clear picture of the strengths and weaknesses of the currently available solutions, so as to inspire follow-up research. In addition, it identifies novel entity alignment scenarios and explores the issues of largescale data, long-tail knowledge, scarce supervision signals, lack of labeled data, and multimodal knowledge, offering potential directions for future research. The book offers a valuable reference guide for junior researchers, covering the latest advances in entity alignment, and a valuable asset for senior researchers, sharing novel entity alignment scenarios and their solutions. Accordingly, it will appeal to

a broad audience in the fields of knowledge bases, database management, artificial intelligence, and big data.

#### **Content Organization**

The book consists of nine chapters, which can be divided into three parts.

Part I presents the background and overview of entity alignment and then discusses the state-of-the-art entity alignment solutions. Specifically, Chap. 1 presents a brief introduction to the entity alignment task. Besides, it also introduces some works that are closely related to entity alignment, as well as frequently used datasets. Chapter 2 conducts a comprehensive evaluation and detailed analysis of state-of-theart EA approaches.

Part II introduces recent advances of entity alignment approaches, including progresses on representation learning (cf. Chap. 3) and alignment inference (cf. Chap. 4).

Part III introduces novel scenarios of entity alignment and corresponding solutions, including large-scale data, long-tail knowledge, scarce supervision signals, lack of labeled data, and multimodal knowledge, offering potential directions for future research.

Chapter 5 targets at entity alignment at scale, and puts forward a novel solution that can manage large-scale KG pairs and meanwhile achieve promising alignment performance.

Chapter 6 identifies the deficiency of existing EA methods in aligning long-tail entities, and approaches the limit by introducing a complementary signal from entity names in the form of concatenated power mean word embeddings and conceiving an effective way via degree-aware co-attention mechanism to dynamically fuse name and structural signals.

Chapter 7 tackles EA with scarce supervision. It puts forward a reinforced active entity alignment framework to select the entities to be manually labeled with the aim of enhancing alignment performance with minimal labeling efforts.

Chapter 8 identifies the deficiencies of existing EA methods, i.e., requiring labeled data and working under the closed-domain setting, and introduces an unsupervised EA framework to deal with unmatchable entities.

Chapter 9 introduces a novel multi-modal entity alignment strategy, i.e., hyperbolic multi-modal entity alignment, which extends the Euclidean representation to hyperboloid manifold.

Changsha, China Xiang Zhao May 2023 Weixin Zeng

Jiuyang Tang

## **Contents**

#### **Part I Concept and Categorization**





#### **Part III Novel Approaches**



#### Contents xi


# **Part I Concept and Categorization**

## **Chapter 1 Introduction to Entity Alignment**

**Abstract** In this section, we provide a concise overview of the entity alignment task and also discuss other related tasks that have a close connection to entity alignment.

#### **1.1 Background**

In the past few years, there has been a significant increase in the use and development of KGs and their various applications. These KGs are designed to store world knowledge, represented as triples (i.e., *<*entity, relation, entity*>*) consisting of entities, relations, and other entities, with each entity referring to a distinct realworld object, and each relation representing a connection between those objects. Since these entities serve as the foundation for the triples in a KG, the triples are inherently interconnected, creating a large and complex graph of knowledge. Currently, we have a large number of *general* KGs (e.g., DBpedia [1], YAGO [52], Google's Knowledge Vault [14]) and *domain*-specific KGs (e.g., medical [48] and scientific KGs [56]). KGs have been utilized to improve a wide range of downstream applications, including but not limited to keyword search [64], fact-checking [30], and question answering [12, 28].

A knowledge graph, denoted as *G* = *(E, R, T )*, is a graph that consists of three main components: a set of entities *E*, a set of relations *R*, and a set of triples *T* , where *T* ⊆ *E*×*R*×*E* represents the directed edges in the graph. In the set of triples *T* , a single triple *(h, r, t)* represents a relationship between a head entity *h* and a tail entity *t* through a specific relation *r*. Each entity in the graph is identified by a unique identifier, such as http://dbpedia.org/resource/Spain in the case of DBpedia.

In practice, KGs are typically constructed from a single data source, making it difficult to achieve comprehensive coverage of a given domain [46]. To improve the completeness of a KG, one popular strategy is to integrate information from other KGs that may contain supplementary or complementary data. For instance, a general KG may only include basic information about a scientist, while scientific domainspecific KGs may have additional details like biographies and lists of publications.

**Fig. 1.1** An example of EA. The entity identifiers are placed in the square brackets. The prefixes of entity identifiers and the full relation identifiers are omitted for clarity; seed entity pairs are connected by dashed lines

To combine knowledge across multiple KGs, a crucial step is to align equivalent entities in different KGs, which is known as *entity alignment* (EA) [7, 25].<sup>1</sup>

Given a source KG *G*<sup>1</sup> = *(E*1*, R*1*, T*1*)*, a target KG *G*<sup>2</sup> = *(E*2*, R*2*, T*2*)*, and seed entity pairs (training set), i.e., *S* = {*(u, v)* | *u* ∈ *E*1*, v* ∈ *E*2*, u* ↔ *v*}, where ↔ represents equivalence (i.e., *u* and *v* refer to the same real-world object), the task of EA can be defined as discovering the equivalent entity pairs in the test set.

*Example* Figure 1.1 shows a partial English KG (KGEN) and a partial Spanish KG (KGES) concerning the director *Alfonso Cuarón*. Note that each entity in the KG has a unique identifier. For example, the movie "Roma" in the source KG is uniquely identified by Roma(film). 2 Given the seed entity pair, i.e., Mexico from KGEN and Mexico from KGES, EA aims to find the equivalent entity pairs in the test set, e.g., returning Roma(ciudad) in KGES as the corresponding target entity to the source entity Roma(city) in KGEN.

Broadly speaking, current entity alignment (EA) methods typically address the problem by assuming that equivalent entities in different KGs share similar local structures and applying representation learning techniques to embed entities as data points in a low-dimensional feature space. With effective entity embedding, the

<sup>1</sup> As where we are standing, EA can be deemed as a special case of entity resolution (ER), which recalls a pile of literature (to be discussed in Sect. 1.2). Thus, some ER methods (with minor adaptation to handle EA) are also involved in this book.

<sup>2</sup>The identifiers in some KGs are human-readable, e.g., those in Fig. 1.1, while some are incomprehensible, e.g., Freebase MIDs like /m/012rkqx.

pairwise dissimilarity of entities can be calculated as the distance between data points, allowing us to evaluate whether two entities are a match or not.3

#### **1.2 Related Works**

While the problem of EA was introduced a few years ago, the more generic version of the problem –identifying entity records referring to the same real-world entity from different data sources– has been investigated from various angles by different communities, under the names of entity resolution (ER) [15, 18, 45], entity matching [13, 42], record linkage [8, 34], deduplication [16], instance/ontology matching [20, 35, 49–51], link discovery [43, 44], and entity linking/entity disambiguation [11, 29]. Next, we describe the related work and the scope of this book.

#### *1.2.1 Entity Linking*

The process of entity linking (EL) or entity disambiguation is the act of recognizing entity mentions in natural language text and linking them to the corresponding entities in a given reference catalog, which is usually a knowledge graph. This process involves identifying which entity a particular mention in the text refers to. For example, if given the word "Rome," the task would be to determine if it refers to the city in Italy, a movie, or another entity and then link it to the right entity in the reference catalog. Prior studies in EL [21, 22, 29, 36, 68] have used various sources of information to disambiguate entity mentions, including surrounding words, prior probabilities of certain target entities, already disambiguated entity mentions, and background knowledge from sources such as Wikipedia. However, much of this information is not available in scenarios where aligning KGs is required, such as entity embeddings or the prior distribution of entity linking given a mention. Moreover, EL is concerned with mapping natural language text to a KG, while this research investigates the mapping of entities between two KGs.

#### *1.2.2 Entity Resolution*

Entity resolution, which is also referred to as entity matching, deduplication, or record linkage, assumes that the input is *relational data*, and each data object usually has a large amount of textual information described in multiple attributes. Therefore,

<sup>3</sup> Throughout the rest of this article, we may use the terms "align" and "match" interchangeably with the same meaning.

various similarity or distance functions are used in entity resolution to measure the similarity between two objects. These functions include Jaro-Winkler distance for comparing names and numerical distance for comparing dates. Based on the similarity measure, both rule-based and machine learning-based methods can be employed to classify two objects as either matching or non-matching [9].

To clarify further, in ER tasks, the attributes of data objects are first aligned, which can be done manually or automatically. Then, the similarity or distance functions are used to calculate the similarities between corresponding attribute values of the two objects. Finally, the similarity scores between the aligned attributes are combined or aggregated to determine the overall similarity between the two objects. This process allows rule-based or machine learning-based methods to classify pairs of objects as either matching or non-matching, based on the computed similarity scores [32, 45].

#### *1.2.3 Entity Resolution on KGs*

Certain methods for ER are created with the purpose of managing KGs and focus solely on binary connections, or data shaped like a graph. These methods are sometimes called instance/ontology matching approaches [49, 50]. The graphshaped data comes with its own challenges: (1) Entities in graph-shaped data often lack detailed textual descriptions and may only be represented by their name, with a minimal amount of accompanying information. (2) Unlike classical databases, which assume that all fields of a record are present, KGs are built on the Open World Assumption, where the absence of certain attributes of an entity in the KG does not necessarily mean that they do not exist in reality. This fundamental difference sets KGs apart from traditional databases. (3) KGs have their own set of predefined semantics. At a basic level, these can take the form of a taxonomy of classes. In more complex cases, KGs can be endowed with an ontology of logical axioms.

In the past 20 years, various techniques have been developed to address the specific challenges of KGs, particularly in the context of the Semantic Web and the Linked Open Data cloud [26]. These techniques can be categorized along several different dimensions:

• **Scope.** Several techniques have been developed for aligning KGs along different dimensions. For example, some approaches aim to align the entities in two different KGs, while others focus on aligning the relationship names, or schema, between KGs. Additionally, some methods aim to align the class taxonomies of two KGs, and a few techniques achieve all three tasks at once. In this particular book, however, the focus is on the first task, which is aligning entities in KGs.

#### 1.2 Related Works 7


Most of the supervised or semi-supervised approaches for entity alignment utilize recent advances in deep learning [23]. These approaches primarily rely on graph representation learning techniques to model the structure of knowledge graphs and generate entity embeddings for alignment. To refer to the supervised or semisupervised approaches, we use the term "entity alignment (EA) approaches," which is also the main focus of this study. However, in the next chapter, we include PARIS [51] for comparison as a representative of the unsupervised approaches. We also include AgreementMakerLight (AML) [17] as a representative of unsupervised systems that use background knowledge. For the other systems, we refer the reader to other surveys [9, 33, 41, 43].

In addition, since EA pursues the same goal as ER, it can be deemed a special but nontrivial case of ER. In this light, general ER approaches can be adapted to the problem of EA, and we include representative ER methods for comparison (to be detailed in Chap. 2).

**Existing Benchmarks** Several synthetic datasets, such as DBP15K and DWY100K, were created using the inter-language and reference links already present in DBpedia to assess the effectiveness of EA methods. Chapter 2 contains more extensive statistical information about these datasets.

Notably, the Ontology Alignment Evaluation Initiative (OAEI) promoted the knowledge graph track.5 Existing benchmarks for EA only provide instance-level information, while the KGs in these datasets include both schema and instance information. This can create an unfair evaluation of current EA approaches that do not consider the availability of ontology information. Hence, they are not presented in this book.

<sup>4</sup> http://oaei.ontologymatching.org/.

<sup>5</sup> http://oaei.ontologymatching.org/2019/knowledgegraph.

#### **1.3 Evaluation Settings**

This section provides an introduction to the evaluation settings that are commonly used for the EA task.

**Datasets** Three datasets are commonly used and are representative, including the following:


Table 1.1 provides a summary of the datasets used in this study. Each KG pair includes relational triples, cross-KG entity pairs (30% of which are seed entity pairs and used for training), and attribute triples. The cross-KG entity pairs serve as gold standards.

**Degree Distribution** Figure 1.2 presents the degree distributions of entities in the datasets, which provides insights into the characteristics of these datasets. The *degree* of an entity is defined as the number of triples in which the entity is involved. Entities with higher degrees tend to have richer neighboring structures. The degree distributions of the different KG pairs in each dataset are very similar. Thus, for brevity, we present only one KG pair's degree distribution in Fig. 1.2.

The sub-figures in series (a) correspond to the DBP15K dataset. As shown, entities with a degree of 1 comprise the largest proportion, while the number of entities generally decreases with increasing degree values, with some fluctuations.

#### 1.3 Evaluation Settings 9


**Table 1.1** Statistics of EA benchmarks and our constructed dataset

It is worth noting that the coverage curve approximates a straight line, as the number of entities changes only slightly when the degree increases from 2 to 10.

The (b) set of figures is related to DWY100K. This dataset has a distinct structure from (a), as there are no entities with a degree of 1 or 2. Additionally, the number of entities reaches its highest point at degree 4 and then decreases as the entity degree increases.

The (c) set of figures is related to SRPRS. It is clear that the degree distribution of entities in this dataset is more realistic, with entities of lower degrees making up a larger proportion. This is due to its well-thought-out sampling approach. Additionally, the (d) set of figures corresponds to the dataset we created, which will be discussed in Chap. 2.

**Evaluation Metrics** Most existing EA solutions use Hits@*k* (*k* = 1*,* 10) and mean reciprocal rank (MRR) as their evaluation metrics. The target entities are arranged in order of increasing distance scores from the source entity when making a prediction. The Hits@*k* metric shows the proportion of correctly aligned entities among the *k* nearest target entities. Hits@1 is the most significant measure of the accuracy of the alignment results.

MRR denotes the average of the reciprocal ranks of the ground truths. Note that higher Hits@*k* and MRR indicate better performance. Unless otherwise specified, the results of Hits@*k* are represented in percentages.

**Fig. 1.2** Degree distributions on different datasets. The X-axis denotes entity degree. The left Yaxis represents the number of entities (corresponding to bars), while the right Y-axis represents the percentage of entities with a degree lower than a given *x* value (corresponding to lines)

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 2 State-of-the-Art Approaches**

**Abstract** This chapter performs a thorough assessment and meticulous examination of the most advanced EA techniques. Initially, we introduce a broad EA framework that covers all current methods and classify these methods into three main groups. Then, we carefully appraise these solutions on various scenarios, taking into account their efficacy, efficiency, and scalability. Lastly, we create a novel EA dataset that reflects the actual difficulties encountered in alignment, which prior literature mostly ignored. This chapter aims to offer a comprehensive understanding of the advantages and drawbacks of current EA methods, in order to encourage further high-quality research.

#### **2.1 Introduction**

In this chapter, we conduct an empirical evaluation of state-of-the-art EA approaches, which possesses the following characteristics:

**Fair Comparison Within and Across Categories** Most recent studies have limited themselves to comparing only a subset of methods [4, 11, 15, 23, 27– 30, 33]. Moreover, different approaches follow different protocols: some use only the KG structure for alignment, while others incorporate additional information; some perform one-pass alignment of KGs, while others use an iterative (re-)training strategy. While the literature presents a direct comparison of these methods, which highlights their overall effectiveness, a more desirable and equitable approach would be to classify these methods into categories and then compare the outcomes within and across categories.

In this chapter, we incorporate most of the state-of-the-art methods to facilitate a comprehensive comparison, including the very recent approaches that have not been evaluated against other methods previously. We divide them into three groups and conduct a thorough analysis of both intra- and inter-group evaluations, enabling us to better position these methods and evaluate their effectiveness.

**Comprehensive Evaluation on Representative Datasets** To assess the performance of EA systems, various datasets have been developed, which can be broadly classified into two categories: *cross-lingual* benchmarks, exemplified by DBP15K [21], and *mono-lingual* benchmarks, exemplified by DWY100K [22]. A recent study [11] highlights that the KGs in prior datasets are much *denser*  than those in real-world scenarios, which led them to create the SRPRS dataset with entity degrees that follow a *normal* distribution. Despite the availability of multiple datasets, previous studies only report their results on one or two specific datasets, making it challenging to evaluate their efficacy across a wide range of potential scenarios, such as cross-lingual/mono-lingual, dense/normal, and largescale/medium-scale KGs.

In light of this observation, this chapter performs a thorough experimental evaluation on all the prominent datasets, namely, DBP15K, DWY100K, and SRPRS, which together consist of nine pairs of knowledge graphs. The evaluation is conducted across various dimensions, including effectiveness, efficiency, and robustness.

**New Dataset for Real-Life Challenges** It has been noted that current EA datasets assume that each entity in the source KG has exactly one corresponding entity in the target KG, which is an unrealistic assumption. In reality, there are entities in one KG that may not have a corresponding entity in the other KG. For example, when aligning YAGO 4 and IMDB, only a small percentage (1%) of entities in YAGO 4 are related to movies, while the remaining 99% of entities in YAGO 4 do not have any corresponding entities in IMDB. These unmatchable entities would make the EA task more challenging.

Furthermore, we notice that the mono-lingual datasets currently available for EA evaluation assume that the entities in the different KGs share the same naming convention. Therefore, the baseline method that relies on comparing the string similarity between entity names can achieve perfect accuracy. However, this assumption is often not valid in real-life scenarios, where equivalent entities in different KGs may have dissimilar names, such as "America" and "USA" for the same entity. In addition, another challenge that is often overlooked in EA is that different entities in a KG might have the same name. This can make it difficult to determine whether an entity with the name "Paris" in the source KG refers to the same entity as one with the same name in the target KG, as they could potentially refer to different entities, such as the city in France and the city in Texas.

For these reasons, we believe that the current EA datasets do not fully capture the realistic challenges posed by unmatchable entities and ambiguous entity names. To address this issue, we introduce a new dataset that more closely mirrors these practical difficulties.

The main contributions of this chapter are the following:

• This chapter provides a comprehensive evaluation of state-of-the-art EA approaches. The evaluation includes: (1) Identifying the main components of existing EA approaches and proposing a general EA framework (2) Categorizing state-of-the-art approaches into three groups and conducting detailed intra- and inter-group evaluations to better understand their strengths and weaknesses (3) Examining these approaches in various scenarios, including cross-/mono-lingual alignment and alignment on dense/normal, large-/medium-scale data, to evaluate their *effectiveness*, *efficiency*, and *robustness*. The empirical results provide insights into the performance of each approach. This evaluation aims to provide a more systematic and comprehensive understanding of the current state of EA research.

• Through our study, we gained valuable experience and insights that allow us to identify the shortcomings of current EA datasets. To address these issues, we have created a new mono-lingual dataset that accurately reflects the real-life challenges of unmatchable entities and ambiguous entity names. We anticipate that this new dataset will provide a more effective benchmark for evaluating EA systems.

#### **2.2 A General EA Framework**

This section presents a general EA framework that is designed to include state-ofthe-art EA approaches. Through a thorough analysis of current EA approaches, we identify four primary components, as shown in Fig. 2.1:


**Fig. 2.1** A general EA framework

embeddings to calculate the similarity between entities. The target entity with the highest similarity (or lowest distance) is then selected as the counterpart.

• **Extra information module.** In addition to the basic modules, some EA approaches use additional information to improve their performance. One approach is bootstrapping, where confident alignment results are used as training data for subsequent alignment iterations. Another approach is to use multi-type literal information such as attributes, entity descriptions, and entity names to complement the KG structure. These additional sources of information are shown in Fig. 2.1 as blue dashed lines.

*Example* Further to the example in Chap. 1, we explain these modules. The *embedding learning module* generates embeddings for entities in KGEN and KGES, respectively. Then the *alignment module* projects the entity embeddings into the same vector space, where the entity embeddings in KGEN and KGES are directly comparable. Finally, using the unified embeddings, the *prediction module* aims to predict the equivalent target entity in KGES for each source entity in KGEN. The *extra information module* leverages several techniques to improve the EA performance. Concretely, the bootstrapping strategy aims to include the confident EA pairs detected from a previous round, e.g., (Spain, España), into the training set for learning in the next round. Another approach is to use additional textual information to complement the entity embeddings for alignment.

We organize the state-of-the-art approaches based on each module of the EA framework and present them in Table 2.1. For a more detailed view of the approaches, readers can refer to the Appendix. Now, we will explain how each of these modules is implemented in various state-of-the-art approaches.

#### *2.2.1 Embedding Learning Module*

In this section, we will explain the techniques used in the embedding learning module, which utilize the KG structure to create embeddings for each entity.

Table 2.1 shows that the most commonly used models for this module are TransE [3] and GCN [13]. We will provide a brief overview of these fundamental models.

**TransE** The TransE model views relationships as translations that act on the lowerdimensional representations of entities. To clarify, when presented with a relational triple *(h, r, t)*, TransE proposes that the embedded representation of the tail entity *t* should be similar to the embedded representation of the head entity *h* plus the


**Table 2.1** A summary of the EA approaches involved in this study

<sup>a</sup>**C-L** stands for **cross-lingual evaluation** and **M-L** stands for **mono-lingual evaluation** b TransE*-*

represents variants of the TransE model

embedded representation of the relationship *r*, or **h** + **r** ≈ **t**. By doing so, the model is able to maintain the structural information of the entities and produce close representations for entities that share similar neighbors in the embedding space.

**GCN** A type of convolutional network that processes graph-based data directly is known as the graph convolutional network (GCN). It creates embeddings for individual nodes by encoding information about the neighborhoods of those nodes. GCN takes as input feature vectors for each node in the KG, as well as a representative graph structure description in matrix form, such as an adjacency matrix. The output of the GCN is a new feature matrix. A typical GCN model consists of multiple stacked GCN layers, which allows it to capture a partial KG structure that extends several hops away from the entity being processed.

On top of these basic models, some methods make modifications. Regarding the TransE-based models, MTransE removes the negative triples during training, BootEA and NAEA replace the original margin-based loss function with a limit-based objective function, MuGNN uses the logistic loss to substitute for the margin-based loss, and JAPE designs a new loss function.

Concerning the GCN-based models, it has been observed that the GCN does not take into account the relations present in KGs. Therefore, as a solution, RDGCN employs the dual-primal graph convolutional neural network (DPGCNN) [17]. In contrast, MuGNN leverages an attention-based GNN model to assign varying weights to neighboring nodes. Additionally, KECG merges graph attention network (GAT) [25] and TransE to capture both the inner-graph structure and the inter-graph alignment information.

Several approaches have introduced new embedding models. For example, in RSNs, the authors contend that triple-level learning is inadequate for capturing longterm relational dependencies between entities and is insufficient for propagating semantic information among entities. Therefore, they propose using recurrent neural networks (RNNs) with residual learning to learn the long-term relational paths between entities.

Similarly, TransEdge devises a new energy function to measure the error of edge translation between entity embeddings for KG embedding learning. This method models edge embeddings using context compression and projection.

#### *2.2.2 Alignment Module*

In this subsection, we introduce the methods used for the alignment module, which aims to unify separated KG embeddings.

The prevailing approach in KG embedding learning is to use a margin-based loss function on top of the embedding learning module. This loss function requires that the distance between entities in *positive pairs* should be small, while the distance between entities in *negative pairs* should be large, with a margin between the distances of positive and negative pairs. The *positive pairs* refer to seed entity pairs, while *negative pairs* are generated by corrupting the positive pairs. This approach helps to merge the two separate KG embedding spaces into one vector space. Table 2.1 indicates that the majority of methods that use GNNs rely on a margin-based alignment model to merge the two KG embedding spaces. In contrast, in GM-Align, a matching framework is employed to maximize the matching probabilities of seed entity pairs, which achieves the alignment process.

Corpus fusion is another common approach, which involves using the seed entity pairs to connect the training corpora of two KGs. Some methods, such as BootEA and NAEA, generate new triples by swapping the entities in the seed entity pairs to align the embeddings in a unified space. Concretely, given an entity pair *(u, v)*, the newly generated triples for *G*<sup>1</sup> are *T new* <sup>1</sup> = {*(v, r, t)*|*(u, r, t)* ∈ *T*1}∪{*(h, r, v)*|*(h, r, u)* ∈ *<sup>T</sup>*1} and for *G*<sup>2</sup> are *<sup>T</sup> new* <sup>2</sup> = {*(u, r, t)*|*(v, r, t)* ∈ *T*2}∪{*(h, r, u)*|*(h, r, v)* ∈ *T*2}. To clarify, the overlay graph is built by connecting the entities in seed entity pairs with edges, and the rest of the entities are connected with edges based on their similarity or co-occurrence in the training corpus. Entity embeddings are then learned using the adjacency matrix of the overlay graph and the training corpus.

Some earlier works proposed transition functions to map the embedding vectors from one KG to another, while others utilized additional information such as entity attributes to align the entity embeddings into a unified space.

#### *2.2.3 Prediction Module*

This module typically involves computing similarity scores between source and target entity embeddings and selecting the target entity with the highest score as the alignment.

To align entities, the most common method is to generate a ranked list of target entities for each source entity based on a specific distance measure between their embeddings. The distance measures commonly used include Euclidean distance, Manhattan distance, and cosine similarity. The top-ranked entity in the list is then considered a match for the source entity. It is worth noting that the similarity score can be converted into the distance score by subtracting it from 1 and vice versa.<sup>1</sup> In contrast, in GM-Align, the entity with the highest matching probability is aligned with the source entity.

Additionally, a recent method called CEA observes that there is a correlation between different entity alignment decisions, meaning that if a target entity is already matched to a source entity with high confidence, it is less likely to be matched to another source entity. To capture this correlation, CEA models it as a stable matching problem, and addresses the problem based on the distance measure, which decreases the number of mismatches and improves the accuracy of entity alignment.

<sup>1</sup> In this work, we use the distance between entity embeddings and the similarity between entity embeddings interchangeably.

#### *2.2.4 Extra Information Module*

In this subsection, we discuss the methods used in the extra information module.

One approach to improve the EA framework is through bootstrapping strategy, also known as iterative training or self-learning strategy. This approach involves iteratively labeling highly probable EA pairs as the training set for the next round, leading to the gradual enhancement of alignment results. There are several methods based on this approach, with variations in the selection of confident EA pairs. The approach ITransE identifies the most similar *nonaligned* target entity for each *nonaligned* source entity, and if the similarity score between them exceeds a certain threshold, they are regarded as a confident pair. BootEA, NAEA, and TransEdge follow a similar approach where they calculate the probability of each source entity being aligned with every target entity. They only consider pairs with probability scores above a certain threshold and use a maximum likelihood matching algorithm with a 1-to-1 mapping constraint to generate a set of confident EA pairs.

Several methods utilize multi-type literal information to improve alignment by providing a more comprehensive view. Commonly used types of information are the attributes associated with entities. Some methods, such as JAPE, GCN-Align, and HMAN, only consider the statistical characteristics of the attribute names. Other methods, such as AttrE and M-Greedy, generate attribute embeddings by encoding the characters of attribute values. AttrE uses attribute embeddings to unify entity embeddings into the same space, while M-Greedy uses them to complement the entity embeddings.

There is a growing tendency toward the use of "entity names".2 Several methods are using "entity names" as input features to learn entity embeddings or exploit the semantic and string-level aspects of entity names as individual features. Specifically, GM-Align, RDGCN, and HGCN utilize entity names as input features to learn entity embeddings. On the other hand, CEA leverages both semantic and string-level aspects of entity names as individual features for alignment. Furthermore, KDCoE and the description-enhanced version of HMAN encode entity descriptions into vector representations and treat them as new features for alignment.

The availability of multi-type information is not always guaranteed in knowledge graph alignment. Some types of information like entity names are commonly available in most scenarios, while others like entity descriptions are often missing in many knowledge graphs. Additionally, due to the graph-based nature of knowledge graph alignment, most existing alignment datasets have limited textual information, which makes some approaches like KDCoE, M-Greedy, and AttrE less applicable.

<sup>2</sup> To obtain the names of entities, for DBpedia and YAGO, current approaches directly adopt the names in the identifiers, while for Wikidata, they use the entity identifier to retrieve the name of the corresponding Wikipedia page. Notably, these names from different KGs share the same naming convention.

#### **2.3 Experiments and Analysis**

This section presents an in-depth empirical study.3

#### *2.3.1 Categorization*

According to the main components, we can broadly categorize current methods into three groups: Group I, which merely utilizes the KG structure for alignment, Group II, which harnesses the iterative training strategy to improve alignment results, and Group III, which utilizes information in addition to the KG structure. We introduce and compare these three categories using the example in Chap. 1.

**Group I** This category of methods solely relies on the structure of the knowledge graph to align entities. Consider again the example in Chap. 1. In KGEN, the entity Alfonso Cuarón is connected to the entity Mexico and three other entities, while Spain is connected to Mexico and one more entity. The same structural information can be observed in KGES. Since we already know that Mexico in KGEN is aligned to Mexico in KGES, by using the KG structure, it is easy to conclude that the equivalent target entity for Spain is España, and the equivalent target entity for Alfonso Cuarón is Alfonso Cuarón.

**Group II** This category of approaches is known as iterative or self-learning strategies, where likely entity alignment pairs are labeled iteratively as the training set for the next round, leading to a progressive improvement in the alignment results. They can also be categorized into Group I or III, depending on whether they merely use the KG structure or not. Nevertheless, they are all characterized by the use of the bootstrapping strategy.

We still use the example in Chap. 1 to illustrate the bootstrapping mechanism. As shown in Fig. 1.1, by utilizing the KG structure, it is straightforward to identify that the source entity Spain is aligned with the target entity España, and the source entity Alfonso Cuarón is aligned with the target entity Alfonso Cuarón. The source entity Madrid does not have a clear target entity, as both Roma(ciudad) and Madrid in the target KG have the same structural information as the source entity. This is because they are both two hops away from the seed entity and have a degree of 1. To address this problem, bootstrappingbased approaches perform multiple rounds of alignment, using the confident entity pairs from the previous round as seed pairs for the next round. More specifically, they consider the entity pairs detected from the first round, i.e., (Spain, España) and (Alfonso Cuarón, Alfonso Cuarón), as the seed pairs in the following rounds. Consequently, in the second round, for the source entity Madrid, only

<sup>3</sup> The relevant materials are available at https://github.com/DexterZeng/EAE.

the target entity Madrid shares the same structural information with it—two hops away from the seed entity pair (Mexico, Mexico) and one hop away from the seed entity pair (Spain, España).

**Group III** Utilizing the KG structure for alignment when presented with graphformatted input data sources is a natural choice; however, KGs also contain a wealth of semantic information that can be used to supplement structural data. These methods stand out by taking advantage of additional information beyond the KG structure.

As seen in Chap. 1, even with the KG structure and bootstrapping strategy, it is still difficult to identify the target entity for the source entity Gravity(film), since its structural information (connected to the entity Alfonso Cuarón and with degree 2) is shared by two target entities Gravity(película) and Roma(película). However, by combining the KG structure with the names in the identifiers, it is easy to differentiate between the two entities and correctly identify Gravity(película) as the target entity for Gravity(film).

#### *2.3.2 Experimental Settings*

The datasets and metrics utilized for assessment were previously introduced in Chap. 1. In the following section, we will elaborate on the techniques and parameter configurations used for comparison.

**Methods to Compare** We will compare the previously mentioned methods, with the exception of KDCoE and MultiKE, due to the absence of entity descriptions in the evaluation benchmarks. Additionally, we will exclude AttrE since it is only functional in the mono-lingual context. Furthermore, we will provide the outcomes of the structure-only versions of JAPE and GCN-Align, specifically JAPE-Stru and GCN-Align(SE).

As previously stated in Chap. 1, to showcase the ability of ER methods in addressing EA, we will also compare with various name-based heuristics. These approaches are commonly used in related tasks [8, 18, 19], as they heavily depend on the resemblance between object names to identify equivalences. Concretely, we use the following:


**Implementation Details** The experiments were performed using a personal computer equipped with an Intel Core i7-4790 CPU, an NVIDIA GeForce GTX TITAN X GPU, and 128 GB of memory. All programs were implemented in Python.

To ensure reproducibility, we employ the source codes provided by the authors and utilize the parameter settings specified in their original papers to execute the models.4 For datasets not included in the original papers, we use the same parameter settings as those employed in the original experiments to ensure consistency.

All of the evaluated methods provide results on the DBP15K dataset in their original papers, with the exception of MTransE and ITransE. We compare our implemented results with the reported results from the original papers. If the difference between our results and the reported results falls outside of a reasonable range, which we define as ±5% of the original results, we mark the methods with an asterisk ∗. It is worth noting that there should not be a significant difference theoretically since we use the same source codes and parameter settings for implementation. For the SRPRS dataset, only RSNs reports results in its original paper [11]. We conduct experiments on all methods for SRPRS and present the results in Table 2.3. For the DWY100K dataset, we run all approaches and compare the performance of BootEA, MuGNN, NAEA, KECG, and TransEdge with the results provided in their original papers. We mark methods with notable differences with an asterisk ∗.

On each dataset, we highlight the best results within each group by denoting them in **bold**. We also mark the best Hits@1 performance among all approaches with since this metric is the most crucial and can best reflect the effectiveness of EA methods.

#### *2.3.3 Results and Analyses on* **DBP15K**

We then compare the performance within each category and across categories. The experiment results on the cross-lingual dataset DBP15K can be found in Table 2.2. Note that the Hits@10 and MRR results of CEA are missing in this table since it directly generates aligned entity pairs instead of returning a list of ranked entities.5 We then compare the performance both within each category and across categories.

**Group I** Out of the methods that only utilize the KG structure, RSNs consistently obtains superior outcomes in Hits@1 and MRR metrics. This success can be attributed to its ability to capture long-term relational paths, which offer more structural indications for alignment. The performance of MuGNN and KECG is equivalent, which can be partly attributed to their shared goal of completing KGs

<sup>4</sup> In the interest of space, we put the detailed parameter settings in Appendix B.

<sup>5</sup> The Hits@10 and MRR results of CEA are also missing in Table 2.3 and Table 2.4 for the same reason.



and reconciling structural disparities. While MuGNN utilizes AMIE+ [10] to induce rules for completion, KECG harnesses TransE to implicitly achieve this aim.

The remaining three techniques achieve comparatively lower outcomes. MTransE and JAPE-Stru leverage TransE to capture the KG structure, but JAPE-Stru outperforms MTransE because the latter models KG structures in different vector spaces, resulting in information loss when translating between them [21]. On the other hand, GCN-Align(SE) attains relatively superior results than MTransE and JAPE-Stru.

**Group II** Among these methods, ITransE obtains notably poorer outcomes, which can be attributed to the information loss during embedding space translation and its simpler bootstrapping strategy as described in Sect. 2.2.4. BootEA, NAEA, and TransEdge all utilize the same bootstrapping strategy. BootEA achieves slightly inferior performance compared to reported outcomes, while NAEA performs significantly worse. In theory, NAEA should outperform BootEA as it employs an attention mechanism to capture neighbor-level information. On the other hand, TransEdge employs an edge-centric embedding model to capture structural information, resulting in more accurate entity embeddings and hence better alignment outcomes.

**Group III** Both JAPE and GCN-Align utilize attributes to enhance entity embeddings, and their outcomes surpass those of their structure-only counterparts, demonstrating the utility of attribute information. Additionally, HMAN, which incorporates relation types as input, outperforms JAPE and GCN-Align by also utilizing attributes.

The remaining four methods utilize entity names instead of attributes for alignment and achieve superior outcomes. Among them, RDGCN and HGCN attain similar results, surpassing GM-Align. This can be attributed to their use of relations to optimize entity embedding learning, which was mostly overlooked in prior GNNbased EA models. However, CEA achieves the best performance in this group by effectively utilizing and merging available features.

**Name-Based Heuristics** Regarding KG pairs with closely related languages, Lev achieves encouraging results, but it is ineffective on distantly related language pairs such as DBP15KZH-EN and DBP15KJA-EN. On the other hand, Embed attains consistent performance on all KG pairs.

**Intra-Category Comparison** Across all datasets, CEA obtains the best Hits@1 performance, while TransEdge, RDGCN, and HGCN achieve the top results for other metrics. This confirms the effectiveness of incorporating additional information such as the bootstrapping strategy and textual information.

The performance of name-based heuristics, such as Embed, is highly competitive, surpassing most methods that do not utilize entity name information in terms of Hits@1. This indicates that conventional ER solutions can still be effective for the EA task. However, Embed still lags behind most EA methods that integrate entity name information, such as RDGCN, HGCN, and CEA.

We can also observe that methods from the first two groups, such as TransEdge, achieve consistent results across all three KG pairs. In contrast, methods that utilize entity name information, such as HGCN, achieve much better results on KG pairs with closely related languages (DBP15KFR-EN) than those with distantly related languages (DBP15KZH-EN). This indicates that language barriers can hinder the use of textual information, which can, in turn, undermine the overall effectiveness of the method.

#### *2.3.4 Results and Analyses on* **SRPRS**

The results on SRPRS are presented in Table 2.3. Similar observations can be made as in the case of DBP15K, which we will not elaborate on. However, we can focus on the differences from DBP15K as well as the patterns specific to this dataset.

**Group I** The results show that the performance of the methods on the relatively sparse KGs in SRPRS is lower compared to DBP15K. However, RSNs outperforms the other methods, closely followed by KECG. It is important to note that while MuGNN achieves decent results on DBP15K, it performs much worse on SRPRS because there are no aligned relations on SRPRS, which results in the failure of rule transferring. Additionally, the sparser KG structure leads to a smaller number of detected rules.

**Group II** Among these solutions, TransEdge still yields consistently superior results.

**Group III** In contrast to GCN-Align(SE) and JAPE-Stru, incorporating attributes into GCN-Align leads to better results, but it does not contribute to the performance of JAPE. This is likely because the dataset has a relatively smaller number of attributes. On the other hand, using entity names significantly improves the results. It is worth noting that CEA achieves ground-truth performance on SRPRSDBP-WD and SRPRSDBP-YG.

**Name-Based Heuristics** For mono-lingual EA datasets like DBpedia, Wikidata, and YAGO, Lev and Embed are able to achieve ground-truth performance since the equivalent entities in different KGs have identical names based on their entity identifiers, making it easy to achieve accurate results through a simple comparison of these names. Additionally, Lev shows promising results on cross-lingual KG pairs with closely related language pairs.

**Intra-Category Comparison** In contrast to DBP15K, methods that incorporate entity names (Group III) perform much better on SRPRS. This is likely due to two reasons: (1) the KG structure is less effective on this dataset, which is much sparser compared to DBP15K, and (2) the entity name information plays a significant role on both mono-lingual and cross-lingual datasets with closely related language pairs, where the names of equivalent entities are very similar.




**Table 2.4** Experimental results on DWY100K and DBP-FB

#### *2.3.5 Results and Analyses on* **DWY100K**

Table 2.4 shows the results on the large-scale mono-lingual dataset DWY100K. However, we were unable to obtain the results of RDGCN and NAEA due to their requirement for an extremely large amount of memory space in our experimental environment.

The methods in the first group perform significantly better on this dataset, which can be attributed to the relatively richer KG structure (as shown in Fig. 1.2 in Chap. 1). Among them, MuGNN and KECG achieve over 60% Hits@1 on DWY100KDBP-WD and over 70% on DWY100KDBP-YG, due to the rich structure that facilitates the process of KG completion, ultimately leading to improved EA performance.

The approaches in the second group achieve further improvement in results with the aid of the iterative training strategy. However, the reported results of BootEA and TransEdge are slightly higher than the values we obtained. Among the methods in Group III, CEA achieves ground-truth performance. Similar to SRPRS, the namebased heuristics Lev and Embed also achieve ground-truth results.

#### 2.3 Experiments and Analysis 31


**Table 2.5** Averaged time cost on each dataset (in seconds)

#### *2.3.6 Efficiency Analysis*

In order to provide a comprehensive evaluation, we report the average running time of each method on each dataset in Table 2.5, which allows us to compare the efficiency of different state-of-the-art solutions and provides insights into their *scalability*. We acknowledge that different parameter settings, such as the learning rate and number of epochs, may influence the final time cost. However, we aim to provide a general understanding of the efficiency of these methods by adopting the parameters reported in their original papers. As previously mentioned, we were unable to obtain the results of RDGCN and NAEA on DWY100K due to their requirement for an extremely large amount of memory space in our experimental environment.

On DBP15K and SRPRS, GCN-Align(SE) is the most efficient method with consistent alignment performance, followed closely by JAPE-Stru and ITransE. Most of the other methods have similar time costs (ranging from 1,000 to 10,000 seconds), except for NAEA and GM-Align, which require significantly longer running times.

The larger size of the DWY100K dataset leads to a significant increase in the time costs of all methods. MuGNN, KECG, and HMAN cannot run on GPUs due to memory limitations, and the authors of the original papers suggest running them on CPUs, which results in longer running times. Only three methods can complete the alignment process within 10,000s, while most of the other approaches take between 10,000s and 100,000s. In particular, GM-Align requires 5 days to generate the results, indicating that current state-of-the-art EA methods still have low efficiency when dealing with very large-scale data. Some methods, such as NAEA, RDGCN, and GM-Align, have poor scalability.

#### *2.3.7 Comparison with Unsupervised Approaches*

There exist some unsupervised methods aimed at aligning KGs that do not employ representation learning methodologies. To ensure the study's comprehensiveness, we compare with a typical system, namely, PARIS [20]. PARIS relies on the comparison of similarities between literals and employs a probabilistic algorithm to align entities jointly in an unsupervised manner. Additionally, we also evaluate PARIS alongside AgreementMakerLight (AML) [9], an unsupervised system for ontology alignment that leverages KGs' background knowledge.6

The F1 score is employed as the evaluation metric since PARIS and AML do not produce a target entity for every source entity, thereby addressing cases where certain entities do not have a corresponding match in the other KG. The F1 score is calculated as the harmonic mean between precision (i.e., the number of correctly aligned entity pairs divided by the number of source entities for which an approach returns a target entity) and recall (i.e., the number of source entities for which an approach returns a target entity divided by the total number of source entities).

Figure 2.2 illustrates that the overall performance of PARIS and AML is marginally lower than that of CEA. Despite CEA exhibiting more robust performance, it depends on training data (seed entity pairs) that may not be present in actual KGs. In contrast, unsupervised systems do not necessitate any training data and can still produce highly favorable outcomes. Furthermore, the results from PARIS and AML demonstrate that ontology information does, in fact, enhance the alignment outcomes.

#### *2.3.8 Module-Level Evaluation*

To obtain a better understanding of the techniques employed in various modules, we conduct an evaluation at the module level and present the associated experimental outcomes. More specifically, we select the representative methods from each module and create feasible combinations. By comparing the performance of different combinations, we can obtain a more precise assessment of the efficacy of various methods in these modules.

<sup>6</sup>AML requires ontology information, which does not exist in current EA datasets. Therefore, we mine the ontology information for these KGs. However, we can only successfully run AML on SRPRSEN-FR and SRPRSEN-DE.

**Fig. 2.2** F1 scores of PARIS, AML, and CEA on EA datasets

Regarding the embedding learning module, we use GCN and TransE. As for the alignment module, we adopt the margin-based loss function (Mgn) and the corpus fusion strategy (Cps). Following current approaches, we combine GCN with Mgn, and TransE with Cps, where the parameters are tuned in accordance with GCN-Align and JAPE, respectively. In the prediction module, we use the Euclidean distance (Euc), the Manhattan distance (Manh), and the cosine similarity (Cos). With regard to the extra information module, we denote the use of the bootstrapping strategy as B by implementing the iterative method in [32]. The use of multi-type information is represented as Mul, and we adopt the semantic and string-level features of entity names as in CEA.

The Hits@1 results of 24 combinations are shown in Table 2.6. 7 It is evident that the addition of the bootstrapping strategy and/or textual information does, in fact, improve the overall performance. Regarding the embedding model, the GCN+Mgn model appears to have more robust and superior performance than TransE+Cps. Furthermore, the selection of distance measures also has an impact on the outcomes. Compared with Manh and Euc, Cos leads to better performance on TransE-based models, while it brings worse results on GCN-based models. Despite this, the integration of entity name embeddings results in consistently superior performance when using the Cos distance measure.

Significantly, GCN+Mgn+Cos+Mul+B (referred to as Comb.) attains the most exceptional performance, indicating that a basic amalgamation of techniques from existing modules can lead to highly favorable alignment outcomes.

<sup>7</sup> The results on other datasets exhibit similar trends and hence are omitted in the interest of space.


**Table 2.6** Hits@1 results of module-level evaluation

## *2.3.9 Summary*

We summarize the major findings from the experimental results.

**EA vs. ER** EA is distinctive from other related tasks since it operates on *graphstructured* data. As a result, all current EA solutions utilize the KG structure to create entity embeddings for aligning entities, which can produce favorable results on DBP15K and DWY100K. Nonetheless, depending solely on the KG structure has certain limitations, as there are long-tail entities with minimal structural information or entities that have similar neighboring entities but do not refer to the same realworld object. To address this issue, recent studies propose incorporating textual information, leading to better performance. However, this prompts a question regarding whether ER approaches can handle the EA task, given that the texts linked to entities are often used by conventional ER solutions.

**Fig. 2.3** The box plot of Hits@1 of all methods on different datasets

We answer this question by involving the name-based heuristics that have been used in most typical ER methods for comparison, and the experimental results reveal that: (1) ER solutions can indeed function on EA, but their performance is heavily reliant on the textual similarity between entities (2) While ER solutions can surpass the majority of structure-based EA methods, *they are still surpassed by EA techniques* that use name information to supplement entity embeddings (3) Incorporating the primary concepts in ER, specifically utilizing literal similarity to identify the equivalence between entities, into EA methods, is a promising direction that is worth exploring (as demonstrated by CEA)

**Influence of Datasets** Figure 2.3 illustrates that the performance of EA methods varies significantly across different datasets. In general, dense datasets such as DBP15K and DWY100K tend to yield relatively better results than sparse ones. Moreover, mono-lingual KGs perform better than cross-lingual ones (DWY100K vs. DBP15K). Notably, on all mono-lingual datasets, the most performant method CEA, as well as the name-based heuristics Lev and Embed, achieves 100% accuracy. This is because these datasets are sourced from DBpedia, Wikidata, and YAGO, where equivalent entities in different KGs have identical names based on their entity identifiers, making it possible to obtain ground-truth results through a simple comparison of these names. However, these datasets do not reflect the real-life challenge of ambiguous entity names. To address this, we introduce a new monolingual benchmark, which will be discussed in the following section.

#### *2.3.10 Guidelines and Suggestions*

In this subsection, we provide guidelines and suggestions for potential users of EA approaches.

**Guidelines for Practitioners** There are several considerations that may impact the selection of EA models. We have identified four of the most prevalent factors and provide the following recommendations:


**Suggestions for Future Research** We also discuss some open problems that are worthy of exploration in the future:


• **EA in the open world**. Most existing EA methods [12] operate under a closed-domain assumption, meaning that every entity in the source KG has a corresponding entity in the target KG. However, in real-world scenarios, there are always entities that cannot be matched. Moreover, labeled data, which is often necessary for state-of-the-art approaches, may not be accessible. Therefore, it is important to investigate EA in open-world settings, where unmatchable entities and limited labeled data are taken into account.

#### **2.4 New Dataset and Further Experiments**

As mentioned earlier, in current mono-lingual datasets, entities that have equivalent counterparts in different knowledge graphs have the same names based on their entity identifiers, which allows for reasonably accurate results through simple name comparison (with 100% precision on SRPRSDBP-YG). However, in real-life KGs, entity identifiers are often not human-readable, and instead, they are linked to one or more human-readable names. For instance, Freebase identifies the capital of France as /m/05qtj, which is linked to names like "Paris" or "The City of Light." Retrieving these names and matching entities that share the same name can still yield a precision of 100% on datasets such as DWY100KDBP-WD and SRPRSDBP-WD. However, in actual knowledge graphs, different entities can have the same name, even if they have different identifiers. For instance, the Freebase entities /m/05qtj (the capital of France) and /m/0h0\_x (the king of Troy) share the name "Paris," as do 20 cities in the USA. This means that using just the entity name to match entities will not work in real-life knowledge graphs. This presents a significant challenge for EA because it is not always certain that an entity with the name "Paris" in the source knowledge graph is the same as an entity with the same name in the target knowledge graph. The reason is that one might refer to the city in France, while the other might refer to the king of Troy. This is a significant complication in real-life knowledge graphs, as illustrated by the fact that in YAGO 3, about 34% of entities share a name with one or more other entities. This problem is not fully reflected in the commonly used mono-lingual datasets for EA.

A second issue with EA datasets is that they assume that for each entity in the source KG, there is exactly one corresponding entity in the target KG. This means that an EA approach can map each source entity to the most similar target entity. However, this is not a realistic scenario since KGs in real life may contain entities that are not present in other KGs. For instance, when aligning YAGO 3 and DBpedia, some entities may appear in YAGO 3 but not in DBpedia and vice versa. This problem is even more severe for KGs that draw data from various sources, such as YAGO 4 and IMDB. In YAGO 4, only 1% of entities are related to movies, while the remaining 99% are unrelated to IMDB entities, such as universities and smartphone brands. As a result, these entities have no matches in IMDB, and this problem is not addressed in current EA datasets.

We thus observe that the existing datasets for EA are an oversimplification of the real-life problem. Our solution is to create a fresh dataset that mimics these challenges. We anticipate that this dataset will result in improved EA models that can handle even more demanding problem scenarios and provide a clearer research direction for the community. In this section, we describe the development of the new dataset and present our experimental findings on it.

#### *2.4.1 Dataset Construction*

To reflect the difficulty of using entity names, we choose Freebase [2] as our target knowledge graph because it represents entities using indecipherable identifiers (i.e., Freebase MIDs), and different entities may have the same name. As to the source knowledge graph, we utilize DBpedia, which contains external links to Freebase that can be regarded as gold standards. The detailed process of constructing the new dataset is explained below:

**Determining the Source Entity Set** We utilize the disambiguation information available in DBpedia to gather entities that have the same disambiguation term and create the entity set for the source knowledge graph. For example, for the ambiguous term *Apple*, the disambiguation records consist of entities such as Apple Inc. and Apple(fruit), both of which are included in the source entity set.

**Determining Links and the Target Entity Set** Next, we utilize the external links between DBpedia and Freebase to obtain the entities in Freebase that correspond to the source entities and create the entity set for the target knowledge graph. These external links are considered as the gold standards. It should be noted that the entities in the target knowledge graph are identified using Freebase MIDs and multiple entities may have the same name, such as Apple. To retrieve the name for each entity, we use the label triples.

**Retrieving Triples** Once the entity sets for the source and target knowledge graphs are determined, we extract the relational and attributive triples involving these entities from their respective knowledge graphs.

**Refining Links and Entity Sets** Following the approach in previous work [21, 22], we retain only the links whose source and target entities are involved in at least one triple in their respective knowledge graphs, resulting in a total of 25,542 links. The entity sets are adjusted accordingly, including entities that participate in triples but not in links. Ultimately, there are 29,861 entities in the source knowledge graph, of which 4,319 cannot be matched, and 25,542 matchable entities in the target knowledge graph. Consistent with existing datasets, 30% of the links and unmatchable entities are utilized as the training set. For additional statistics on the dataset, please refer to Chap. 1.

#### *2.4.2 Experimental Results on* **DBP-FB**

In accordance with the current evaluation paradigm, we first analyze the performance of EA methods without considering unmatchable entities. As shown in Table 2.4, the overall performance of the methods in the first two groups is lower than that on SRPRS. This can be attributed to the greater structural heterogeneity of DBP-FB, which can be observed from sub-figures (d) in Fig. 1.2. In contrast to the KG pairs in sub-figures (a), (b), or (c), the entity distributions in these KGs are highly dissimilar, which makes it challenging to effectively leverage the structural information.

Methods that utilize entity names continue to produce the best results, although their performance is lower than that on previous mono-lingual datasets. Furthermore, on DBP-FB, Embed and Lev achieve only Hits@1 values of 58.3% and 57.8%, respectively, while they attain 100% on SRPRSDBP-YG, SRPRSDBP-WD, DWY100KDBP-YG, and DWY100KDBP-WD. This confirms that DBP-FB is a more suitable mono-lingual dataset for addressing the challenge of entity name ambiguity compared to existing datasets. Thus, DBP-FB can be considered a preferable monolingual dataset.

#### *2.4.3 Unmatchable Entities*

In addition, DBP-FB also contains unmatchable entities, which presents another real-life challenge for EA. We therefore evaluate the performance of Comb. (from Sect. 2.3.8) on DBP-FB, taking into account these unmatchable entities. Consistent with Sect. 2.3.7, we utilize the *precision*, *recall*, and *F1* score as evaluation metrics, with the exception that we define *recall* as the number of matchable source entities for which an approach returns a target entity, divided by the total number of matchable source entities.

The information presented in Table 2.7 shows that Comb. exhibits a high level of recall, but its precision is relatively low. This is because it creates a target entity for each source entity, including those that cannot be matched. This pattern reflects the current performance of entity alignment solutions when dealing with unmatchable source entities. Nonetheless, this problem is not addressed in the current entity alignment datasets.

In order to address this issue, we suggest a straightforward approach to handle unmatchable entities in DBP-FB, in addition to the current entity alignment solutions. Specifically, we propose setting a NIL threshold, denoted as *θ*, to predict unmatchable entities. As discussed in Sect. 2.2.3, entity alignment solutions typically employ a distance measure to find the corresponding target entity. If the distance value between a source entity and its nearest target entity is greater than *θ*, we consider the source entity to be unmatchable and exclude it from the alignment results. The value of the threshold *θ* can be determined from the training data.


As shown in Table 2.7, the threshold-enhanced solution Comb. +TH achieves a better F1 score. We hope this preliminary study can inspire follow-up research on this issue.

#### **2.5 Conclusion**

Entity alignment plays a crucial role in integrating KGs to enhance knowledge coverage and quality. Despite the numerous proposed solutions, there has been limited comprehensive evaluation and detailed analysis of their performance. To address this gap, this chapter presents an empirical assessment of state-of-the-art approaches in terms of effectiveness and efficiency on representative datasets. We also conduct a thorough analysis of their performance and provide evidence-based discussions. Furthermore, we introduce a new dataset that more accurately reflects real-world challenges, which can serve as a benchmark for future research in this field.

#### **Appendix A**

#### *Methods in Group I of Table 2.1*

**MTransE** The MTransE model [6] is a translation-based approach for learning multilingual KG embeddings to support EA. Initially, it utilizes TransE (without negative triples) to project each KG into separate embedding spaces. Next, MTransE applies three distinct transition strategies: distance-based axis calibration, translation vectors, and linear transformation, to map the embedding vectors to their cross-lingual counterparts. During the prediction stage, a KNN search is conducted on the cross-lingual transition point of a target entity to obtain its corresponding counterpart.

**RSNs** In this study [11], RNNs are combined with residual learning to effectively capture the long-term relational dependencies among KGs and generate more comprehensive KG structural embeddings for EA.

The paper [11] argues that triple-level learning is inadequate for capturing the long-term relational dependencies of entities and for propagating semantic information among entities. To address this limitation, recurrent skipping networks (RSNs) are proposed to learn the long-term relational paths between entities. To obtain the desired paths, biased random walks are used to efficiently sample paths from the KGs, with elements in two KGs connected by seed alignments. During the prediction phase, cosine similarity is utilized to predict the results.

**MuGNN** The paper [4] proposes a multichannel GNN for learning KG embeddings that are oriented toward entity alignment.

The MuGNN approach first conducts relation weighting to generate a weight matrix for each KG using KG self-attention and cross-KG attention schemes, which correspond to different GNN channels. Next, it applies the GNN encoder (and the corresponding weight matrix) for each channel to model the KG structure. The outputs of the different channels are combined using the pooling operation. Finally, a margin-based alignment model is utilized to embed the two KGs into a unified embedding space.

In addition, MuGNN also proposes a method to address structural differences between KGs by completing missing relations. This is accomplished by using AMIE+ to induce rules and transferring rules between KGs through aligned relations. However, it is important to note that not all datasets have aligned relations, which may cause the rule transferring approach to fail.

**KECG** The paper [15] proposes a method for jointly learning a knowledge embedding model that encodes inner-graph relationships and a cross-graph model that enhances entity embeddings with their neighbors' information.

The main concept behind KECG involves employing a cross-graph model, which is an enhanced version of a graph attention network (GAT), to convert entities into a single vector space by incorporating both intra-graph and inter-graph alignment information. The resulting embeddings are then utilized as input for TransE, which models intra-graph connections and enforces relational constraints between entities to promote consistency across different KGs. During inference, equivalent entities are identified based on the L2 distance between entities in terms of the unified embeddings.

#### *Methods in Group II of Table 2.1*

**ITransE** This study [34] extends the use of TransE to learn the structure of knowledge graphs. It develops three models, including translation-based, linear transformation-based, and parameter sharing-based models, to generate joint embeddings for various knowledge graphs. The study proceeds to iteratively align entities and update the joint knowledge embeddings, progressively considering highly confident aligned entities identified by the model. During the prediction stage, the model retrieves the closest entity from the target knowledge graph as the corresponding entity for each source entity.

**BootEA** This work [22] suggests a technique called bootstrapping for EA that involves the iterative labeling of probable EA pairs as training data to teach alignment-oriented KG embeddings.

In terms of the KG structure encoder, BootEA employs TransE, but substitutes the margin-based loss function with a limit-based objective function. The approach involves learning alignment-oriented KG embeddings by swapping aligned entities between triples from different KGs. Additionally, the authors develop a bootstrapping strategy to refine alignment-oriented embeddings, which involves iteratively labeling probable alignments and adding them to the training data. BootEA further models EA as a classification problem and aims to maximize alignment likelihood across all labeled and unlabeled entities based on KG embeddings. During the prediction stage, cosine similarity is used to identify latent aligned entities.

**NAEA** This paper [35] introduces a technique called neighborhood-aware attentional representation to enhance the effectiveness of EA, which is built on the fundamental framework of BootEA.

NAEA comprises two components: a knowledge embedding (KE) component and an entity alignment (EA) component. KE employs an attention mechanism to obtain neighbor-level representations of entities by combining their neighbors with weighted attention, and subsequently utilizes TransE to model both neighbor-level and relation-level representations. In contrast, BootEA only encodes relation-level representations.

Like BootEA, NAEA also treats the alignment task as a classification problem in its EA component. However, in NAEA, alignment probability calculation also incorporates neighbor-level knowledge information. During the prediction stage, the approach employs cosine similarity to identify aligned entity pairs based on integrated representations of entities.

**TransEdge** This work [23] introduces a new edge-centric embedding model for EA, which contextualizes relation representations with respect to particular headtail entity pairs.

The proposed method, TransEdge, defines a novel energy function to evaluate the accuracy of edge translation between entity embeddings for KG embedding learning. To model edge embeddings, two methods are employed: context compression and context projection. The limit-based loss function of TransEdge is used to optimize entity embeddings for EA, and the distance between seed entities is minimized to reconcile two KGs. During the prediction phase, the model ranks entities in another KG based on the cosine similarity of their entity embeddings in descending order for a given entity to be aligned. The intended match is expected to have the highest rank.

#### *Methods in Group III of Table 2.1*

**JAPE** This work [21] presents a joint attribute-preserving embedding model for EA, which generates embeddings that incorporate both KG relations and attributes.

The proposed JAPE approach first employs TransE to encode the structure of each KG, but adapts the loss function. In addition to the large margin between scores of positive and negative triples, JAPE aims to assign lower scores to positive triples and higher scores to negative triples. The seed EA pairs are used to construct an overlay relationship graph in the corpus, which can align separate KG embeddings into a unified one.

Additionally, JAPE observes that latent aligned entities tend to have similar attribute values, and therefore abstracts attribute values to their range types and generate attribute embeddings to capture attribute correlations. Finally, attribute similarity constraints are combined with structural embeddings to refine entity representations by clustering entities with high attribute correlations. During the search for latent aligned entities, the model uses cosine similarity between entity embeddings.

**GCN-Align** This work [26] utilizes GCN as the KG structure encoder for aligning entities.

To elaborate further, GCN-Align leverages GCN to capture the structure information of KGs, which generates neighborhood-aware embeddings of entities. Additionally, it embeds the attribute names of entities to provide a complementary view. The model uses a margin-based ranking loss function to unify embeddings from different KGs. The structural and attributive embeddings are then combined to predict aligned entity pairs based on the Manhattan distance score. Finally, the model predicts latent entity alignments based on the distance measure between entities from the two KGs.

**AttrE** This work [24] proposes to learn attribute embeddings of entities, which shift the entity embeddings of two KGs into the same vector space.

First, AttrE creates a module for matching the predicates of two KGs, renaming them into a shared naming system to make sure the relation embeddings are compatible. Subsequently, TransE is used to learn the structural embeddings and attributes are encoded as attribute character embeddings. Transitivity rule is used to enrich the attribute triples. Finally, the attribute character embeddings are used to project the structural embeddings of entities into the same vector space and cosine similarity is used to make the prediction.

**KDCoE** This work [5] develops a semi-supervised cross-lingual method to align multilingual KGs with minimal supervision.

The KDCoE approach uses TransE as the structure encoder and combines it with a linear transformation-based network to bring together different knowledge graph embeddings. Additionally, it uses an attentive gated recurrent unit encoder (AGRU) to create representations of entity descriptions. The KDCoE approach trains both modules simultaneously, with both models suggesting a set of the most confident entity alignment pairs during each iteration to improve cross-lingual learning accuracy over time. Similarly to MTransE, the prediction is made through a KNN search from the cross-lingual conversion point of a target entity.

**HMAN** The HMAN [30] method utilizes GCN to merge multiple types of information in order to generate entity embeddings. Additionally, it proposes a modified model that incorporates the textual descriptions of entities, which are encoded using a pretrained multilingual BERT model.

In detail, the HMAN approach employs GCN to model the structural connections and uses feedforward neural networks to generate embeddings for attributes and relations, as using GCN to learn attribute and relation embeddings inherently considers the neighboring entities' attributes and relations, which could lead to noise. The approach then concatenates these representations to form a hybrid multiaspect entity embedding. Finally, the method utilizes a margin-based ranking loss function to align the entities.

Furthermore, the HMAN method introduces two additional techniques, pointwise-BERT and pairwise-BERT, for utilizing multilingual BERT on entity descriptions to aid in the entity alignment process. To integrate entity descriptions with the hybrid multi-aspect entity embeddings, two strategies, reranking and weighted concatenation, are proposed. For prediction, the method leverages the L1 distance between entity embeddings.

**GM-Align** The work described in reference [29] approaches the entity alignment problem as a graph matching challenge that can be addressed through both entitylevel and graph-level matching techniques.

The GM-Align method initially generates a topic entity graph to depict the connections between a given entity (the "topic entity") and its neighboring entities. This graph is then used to apply GCN to encode the structural information and generate matching scores. The method employs a word-based LSTM to embed the entity names as an initial feature matrix for GCN. The matching framework learns alignment information between the two knowledge graphs. During prediction, the method ranks all entities in the other knowledge graph in descending order of their matching probabilities, with the top-ranked entity considered as the result.

**RDGCN** The authors of reference [27] propose a relation-aware dual-graph convolutional network to include relation information by employing attentive interactions between a knowledge graph and its dual relation counterpart, so as to achieve an effective entity alignment process.

The RDGCN method acknowledges that GCN-based models often disregard the relation information present in knowledge graphs. To address this, the authors employ the dual-primal graph CNN (DPGCNN) method to incorporate relation information. To adapt DPGCNN to the entity alignment task, the RDGCN method proposes a weighted model and explores the head/tail representations, which are initialized with entity names, as a way to capture the relation information.

The RDGCN method permits multiple rounds of interactions between the primal entity graph and its dual relation graph, thus allowing the model to integrate more complex relation information into entity representations effectively. The method employs GCN with highway gates to incorporate neighboring structural information. The authors devise a margin-based scoring function to align embeddings from different knowledge graphs. During prediction, the method uses the Manhattan distance between entity embeddings.

**HGCN** The authors of reference [28] suggest jointly learning entity and relation representations for the entity alignment task.

The HGCN method first uses highway-GCNs that employ highway gates to control noise propagation in GCN to embed entities from various knowledge graphs. Next, the entity embeddings are utilized to approximate relation representations, which are then used to align relations across knowledge graphs. Finally, HGCN incorporates the relation representations into the entity embeddings to obtain joint entity representations and continues to use GCN to iteratively integrate neighboring structural information to improve the entity and relation representations further. Similar to RDGCN, a margin-based scoring function is used to align embeddings from different knowledge graphs, and the entity name is used as the initial feature matrix for GCN. During prediction, the method employs the Manhattan distance between entity embeddings.

**MultiKE** The MultiKE method proposes a new framework that integrates entity names, relations, and attributes to learn embeddings for alignment [33].

The MultiKE method defines three different perspectives for EA, namely, entity name, relation, and attribute, and employs specific models to learn embeddings for each perspective. The TransE model is used to encode KG structure, with logistic loss replacing the margin-based loss. Two cross-KG identity inference strategies are proposed to capture and propagate alignment information between KGs. The view-specific entity embeddings are then combined, which are used for prediction through nearest-neighbor search. It should be noted that this method is currently only applicable to mono-lingual EA.

**CEA** The authors of [31] create a unified EA framework that takes into account how different EA decisions are interconnected.

CEA uses three types of features (structural, semantic, and string signals) to capture different aspects of entities in heterogeneous knowledge graphs. The authors then model the problem of making collective EA decisions by framing it as a stable matching problem, which is solved using the deferred acceptance algorithm.

#### **Appendix B**

#### *Parameter Setting*

The definitions of the parameters can be found in their original papers.


#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Part II Recent Advances**

## **Chapter 3 Recent Advance of Representation Learning Stage**

**Abstract** Over the last few years, there are a pile of research devoted to learning better KG representations to facilitate entity alignment. Thus, in this chapter, we summarize recent progress in the representation learning stage of EA and also provide a detailed empirical evaluation to reveal the strengths and weaknesses of current solutions.

## **3.1 Overview**

To better understand current advanced representation learning methods, we propose a general framework to describe these methods, which includes six modules, i.e., pre-processing, messaging, attention, aggregation, post-processing, and loss function. In pre-processing, the initial entity and relation representations are generated. Then, KG representations are obtained via a representation learning network, which usually consists of three steps, i.e., messaging, attention, and aggregation. Among them, messaging aims to extract the features of the neighboring elements, attention aims to estimate the weight of each neighbor, and aggregation integrates the neighboring information with attention weights. Through the post-processing operation, the final representations are obtained. The whole model is then optimized by the loss function in the training stage.

More specifically, we summarize ten representative methods in terms of these modules in Table 3.1.




vector to calculate attention weights. Besides, some use inner product of entity representations to compute similarity.


#### **3.2 Models**

We use Eq. (3.1) to characterize the core procedure of representation learning:

$$e\_l^l = \text{Aggregation}\_{\forall j \in \mathcal{N}(l)}(\text{Attention}(i, j) \cdot \text{Message}(i, j))\,,\tag{3.1}$$

where **Messaging** aims to extract the features of neighboring elements, **Attention** aims to estimate the weight of each neighbor, and **Aggregation** integrates the neighborhood information with attention weights.

Next, we briefly introduce recent advance of representation learning for EA in terms of the modules mentioned in Table 3.1.

#### *3.2.1 ALiNet*

It aims to aggregate multi-hop structural information for learning entity representations [12].

**Aggregation** This work devises a multi-hop aggregation strategy. For 2-hop aggregation, **Aggregate** is denoted as:

$$\mathbf{h}\_{i,2}^{l} = \sigma \left( \sum\_{j \in \mathcal{N}\_2 \sqcup i} \mathbf{Attention}(i, j) \cdot \mathbf{Message}(i, j) \right), \tag{3.2}$$

where N<sup>2</sup> denotes the 2-hop neighbors.

Then, it aggregates the multi-hop aggregation results to generate the entity representation. Aggregating 1-hop and 2-hop information is denoted as:

$$\boldsymbol{\hbar}\_{l} = \boldsymbol{g}\left(\boldsymbol{\hbar}\_{l,2}^{l}\right) \cdot \boldsymbol{\hbar}\_{l,1}^{l} + \left(1 - \boldsymbol{g}\left(\boldsymbol{\hbar}\_{l,2}^{l}\right)\right) \cdot \boldsymbol{\hbar}\_{l,2}^{l}\,,\tag{3.3}$$

where *g(h<sup>l</sup> i,*2*)* <sup>=</sup> *σ (Mh<sup>l</sup> i,*<sup>2</sup> + *b)*, which is the gate to control the influences of different hops. *M* and *b* are learnable parameters.

**Attention** Regarding the attention weight, it assumes that not all distant entities contribute positively to the characterization of the target entity representation, and the softmax function is used to produce the attention weights:

$$\text{Attention}(i, j) = \alpha\_{ij}^{l} = \text{softmax}\left(c\_{lj}^{l}\right) = \frac{\exp\left(c\_{lj}^{l}\right)}{\sum\_{n \in \mathcal{N}\_{2}(l) \cup l} \exp\left(c\_{lin}^{l}\right)},\tag{3.4}$$

where *c<sup>l</sup> ij* <sup>=</sup> *LeakyReLU ((M<sup>l</sup>* 1*hl i)<sup>T</sup> <sup>M</sup><sup>l</sup>* 2*hl <sup>j</sup> )*, and *M*1*, M*<sup>2</sup> are two learnable matrices.

**Messaging** The extraction of the features of neighboring entities is implemented as a simple linear transformation: **Messaging***(i, j )* <sup>=</sup> *<sup>W</sup><sup>l</sup> qhl*−<sup>1</sup> *<sup>j</sup>* , where *W<sup>q</sup>* denotes the weight matrix for the *q*-hop aggregation.

**Post-processing** The representations of all layers are concatenated to produce the final entity representation:

$$\mathbf{h}\_l = \oplus\_{l=1}^L norm \left( \mathbf{h}\_l^l \right). \tag{3.5}$$

**Loss Function** The loss function is formulated as:

$$\mathcal{L} = \sum\_{(l,j)\in\mathcal{R}^+} ||h\_l - h\_j|| + \sum\_{(l',j')\in\mathcal{R}^-} \alpha\_1 [\gamma - ||h\_{l'} - h\_{j'}||]\_+,\tag{3.6}$$

where A<sup>−</sup> is the set of negative samples, obtained through random sampling. || · || denotes the L2 norm. [·]+ = max*(*0*,* ·*)*.

#### *3.2.2 MRAEA*

It proposes to utilize the relation information to facilitate the entity representation learning process [8].

**Pre-processing** Specifically, it first creates an inverse relation for each relation, resulting in the extended relation set R. Then, it generates the initial features for entities by averaging and concatenating the embeddings of neighboring entities and relations:

$$\boldsymbol{h}\_{e\_l}^{in} = \left[ \frac{1}{|\mathcal{N}\_l^e| + 1} \sum\_{e\_f \in \mathcal{N}\_l^e \cup e\_l} \boldsymbol{h}\_{e\_f} || \frac{1}{|\mathcal{N}\_l^e|} \sum\_{r\_k \in \mathcal{N}\_l'} \boldsymbol{h}\_{r\_k} \right],\tag{3.7}$$

where the embeddings of entities and relations are randomly initialized.

**Aggregation** The aggregation is a simple combination of the extracted features and the weights:

$$\mathcal{H}\_{e\_l}^{out} = \sigma \left( \sum\_{e\_j \in \mathcal{N}\_l^\ell} \text{Attention}(i, j) \cdot \text{Message}(i, j) \right), \tag{3.8}$$

where *σ* is implemented as *ReLU*.

**Attention** It augments the common self-attention mechanism to include relation features:

$$\mathbf{Attention}(i,j) = softmax$$

$$\times \left( LeakyReLU\left(\mathbf{v}^T \left[\boldsymbol{\mu}\_{e\_i}^{in} || \boldsymbol{\mu}\_{e\_j}^{in} || \frac{1}{|\mathcal{M}\_{i,j}|} \sum\_{r\_k \in \mathcal{M}\_{i,j}} \boldsymbol{\mu}\_{r\_k} \right] \right) \right), \tag{3.9}$$

where M*i,j* represents the set of linked relations that connect *ei* to *ej* . Noteworthily, it also adopts the multi-head attention mechanism to obtain the representation.

**Messaging** The features of neighboring entities are the corresponding features from the pre-processing stage.

**Post-processing** Finally, the outputs from different layers are concatenated to produce the final entity representations:

$$\hat{\boldsymbol{h}}\_{e\_l}^{out} = \left[ \boldsymbol{h}\_{e\_l}^{out(0)} || \boldsymbol{\dots} . \left| \boldsymbol{h}\_{e\_l}^{out(l)} \right| \right]. \tag{3.10}$$

**Loss Function** The loss function is formulated as:

$$\mathcal{L} = \sum\_{(e\_l, e\_j) \in \mathcal{P}} ReLU(dis(e\_l, e\_j) - dis(e'\_l, e\_j) + \lambda) + ReLU(dis(e\_l, e\_j))$$

$$-dis(e\_l, e'\_j) + \lambda) \,. \tag{3.11}$$

where *dis(*·*,* ·*)*is the Manhattan distance between two entity representations. *e <sup>i</sup>* and *e <sup>j</sup>* represent the negative instances.

#### *3.2.3 RREA*

It proposes to use relational reflection transformation to aggregate features for learning entity representations [9].

**Aggregation** The entity representations are denoted as:

$$h\_{e\_l}^{l+1} = ReLU\left(\sum\_{e\_l \in \mathcal{N}\_{e\_l}} \sum\_{r\_k \in \mathcal{R}\_{lj}} \mathbf{Attention}(i, j, k) \cdot \mathbf{Message}(i, j, k)\right), \qquad (3.12)$$

where <sup>N</sup>*<sup>e</sup> ei* and R*ij* represent the neighboring entity and relation sets, respectively.

**Attention Attention***(i, j, k)* denotes the weight coefficient computed by:

$$\text{Attention}(i, j, k) = \frac{\exp\left(\beta\_{ljk}^{l}\right)}{\sum\_{e\_{j} \in \mathcal{N}\_{e\_{l}}} \sum\_{rk \in \mathcal{R}\_{lj}} \exp\left(\beta\_{ljk}^{l}\right)},\tag{3.13}$$

where *β<sup>l</sup> ijk* <sup>=</sup> *<sup>v</sup><sup>T</sup>* [*h<sup>l</sup> ei* ||*Mrkh<sup>l</sup> ej* ||*hrk* ]. *v* is a trainable vector. *Mrk* is the relational reflection matrix of *rk*. We leave out the details of relational reflection matrix in the interest of space, which can be found in the original paper.

**Messaging** The features of neighboring entities are the corresponding features from the pre-processing stage:

$$\mathbf{M} \text{essaging}(i, j, k) = \mathbf{M}\_{r\_k} \boldsymbol{\hspace{0.1cm}}^{l},\tag{3.14}$$

where *Mrk* is the relational reflection matrix of *rk*.

**Post-processing** Then, the outputs from different layers are concatenated to produce the output vector:

$$\boldsymbol{h}\_{e\_l}^{out} = \left[ \boldsymbol{h}\_{e\_l}^0 | \boldsymbol{\parallel} \dots | | \boldsymbol{h}\_{e\_l}^l \right] \,. \tag{3.15}$$

Finally, it concatenates the entity representation with its neighboring relation embeddings to obtain the final entity representation:

$$\boldsymbol{h}\_{e\_i}^{Mul} = \left[ \boldsymbol{h}\_{e\_i}^{out} \vert \vert \frac{1}{|\mathcal{N}\_{e\_i}|} \sum\_{r\_j \in \mathcal{N}\_{e\_i}} \boldsymbol{h}\_{r\_j} \right]. \tag{3.16}$$

**Loss Function** The loss function is formulated as:

$$\mathcal{L} = \sum\_{(e\_i, e\_j) \in \mathcal{P}} \max(\text{dis}(e\_i, e\_j) - \text{dis}\left(e'\_i, e'\_j\right) + \lambda, 0), \tag{3.17}$$

where *dis(*·*,* ·*)*is the Manhattan distance between two entity representations. *e <sup>i</sup>* and *e <sup>j</sup>* represent the negative instances generated by nearest neighbor sampling.

#### *3.2.4 RPR-RHGT*

This work introduces a meta path-based similarity framework for EA [2]. It considers the paths that frequently appear in the neighborhoods of pre-aligned entities to be reliable. We omit the generation of these reliable paths in the interest of space, which can be found in Sect. 3.3 of the original paper.

**Pre-processing** Specifically, it first generates relation embeddings by aggregating the representations of neighboring entities:

$$\mathcal{R}^{l}(r) = \sigma \left[ \frac{1}{|\mathcal{H}\_{r}|} \sum\_{e\_{l} \in \mathcal{H}\_{r}} \mathsf{b}\_{h} \mathsf{e}\_{l}^{l-1} || \frac{1}{|\mathcal{T}\_{r}|} \sum\_{e\_{l} \in \mathcal{T}\_{r}} \mathsf{b}\_{l} \mathsf{e}\_{j}^{l-1} \right],\tag{3.18}$$

where H*<sup>r</sup>* and T*<sup>r</sup>* denote the set of head entities and tail entities that are connected with relation *r*.

**Aggregation** The entity representation is obtained by averaging the messages from neighborhood entities with the attention weights:

$$
\tilde{\mathbf{e}}\_h^l = \oplus\_{\mathbb{V}(r,t) \in RN(h)} HAttention(h, r, t) \cdot H Mesage(h, r, t), \tag{3.19}
$$

where ⊕ denotes the overlay operation.

**Attention** The multi-head attention is computed as:

$$\begin{aligned} \textit{H} & \textit{H} \textit{H} \textit{H} \textit{t} \textit{r} (\textit{h}, \textit{r}, \textit{t}) = ||\_{\textit{i} \in [\textit{l}, \textit{h}\_{\textit{h}}]} \textit{s} \textit{f} \textit{t} \textit{max}\_{\forall (\boldsymbol{r}, \boldsymbol{t}) \in \textit{R} \textit{N} \ (\boldsymbol{h})} (\textit{H} \textit{A} \textit{T} \textit{T}\_{\textit{h} \text{end}^{\boldsymbol{d}}(\boldsymbol{h}, \boldsymbol{r}, \boldsymbol{t})), \\ \textit{H} & \textit{A} \textit{T} \textit{T}\_{\textit{h} \text{end}^{\boldsymbol{d}}(\boldsymbol{h}, \boldsymbol{r}, \boldsymbol{t}) = \textit{a}^{\boldsymbol{T}} (\llbracket \boldsymbol{K}^{\boldsymbol{i}}(\boldsymbol{h}) || \boldsymbol{\mathcal{Q}}^{\boldsymbol{i}}(\boldsymbol{t}) \rrbracket \boldsymbol{\mathcal{R}}^{\boldsymbol{l}}(\boldsymbol{r})) / \sqrt{\textit{d}/\boldsymbol{h}\_{\textit{n}}}, \end{aligned} \tag{3.20}$$

where *K<sup>i</sup> (h)* <sup>=</sup> *<sup>K</sup>*\_*Linear<sup>i</sup> (el*−<sup>1</sup> *<sup>h</sup> )*, *Q<sup>i</sup> (t)* <sup>=</sup> *<sup>Q</sup>*\_*Linear<sup>i</sup> (el*−<sup>1</sup> *<sup>t</sup> )*, *RN (h)* represents the neighborhood entities of *h*, *a* denotes the learnable attention vector, *hn* is the number of attention heads, and *d/hn* is the dimension per head.

**Messaging** The multi-head message passing is computed as:

$$\begin{aligned} HMMessage(h, r, t) &= ||\_{i \in [1, h\_n]} (HMSG\_{head}(h, r, t)), \\ HMSG\_{head}(h, r, t) &= [V\_-Linear^l(\mathbf{e}\_l^{l-1}) || R^l(r)], \end{aligned} \tag{3.21}$$

where *V* \_*Linear<sup>i</sup>* is a linear projection of the tail entity, which is then concatenated with the relation representation.

**Post-processing** This work also combines the structural representations with name features using the residual connection:

$$\mathbf{e}\_h^l = \omega\_\beta \mathbf{A}\_- Linear \left(\tilde{\mathbf{e}}\_h^l\right) + (1 - \omega\_\beta) N\_- Linear \left(\mathbf{e}\_h^{l-1}\right),\tag{3.22}$$

where *A*\_*Linear* and *N*\_*Linear* are linear projections. Correspondingly, based on the relation structure T*rel* and path structure T*path*, it generates the relation-based embeddings *Erel* and the path-based embeddings *Epath*.

**Loss Function** Finally, the margin-based ranking loss function is used to formulate the overall loss function:

$$\begin{split} \mathcal{L} &= \sum\_{(p,q)\in\mathcal{L}, (p',q')\in\mathcal{L}'\_{rel}} [d\_{rel}(p,q) - d\_{rel}(p',q') + \lambda\_1]\_+ \\ &+ \theta \left( \sum\_{(p,q)\in\mathcal{L}, (p',q')\in\mathcal{L}'\_{path}} [d\_{path}(p,q) - d\_{path}(p',q') + \lambda\_2]\_+ \right), \end{split} \tag{3.23}$$

where the distance is measured by the Manhattan distance and *θ* is the hyperparameter that controls the weights of relation loss and path loss.

#### *3.2.5 RAGA*

It proposes to adopt the self-attention mechanism to spread entity information to the relations and then aggregate relation information back to entities, which can further enhance the quality of entity representations [17].

**Pre-processing** In the pre-processing module, the pre-trained vectors are used as input and then forwarded to a two-layer GCN with highway network to encode structure information. We leave out the implementation details in the interest of space, which can be found in Sect. 4.2 in the original paper.

**Aggregation** In RAGA, there are three main GNN networks. Denote the initial representation of entity *i* as *hi*, which is generated in pre-processing module. The first GNN network obtains relation representation by aggregating all of its connected head entities and tail entities. For relation *k*, the aggregation of its connected head entities is computed as follows:

$$r\_k^h = \sigma \left( \sum\_{\substack{e\_l \in \mathcal{H}\_k \\ e\_l \in \mathcal{H}\_k}} \sum\_{e\_j \in \mathcal{T}\_{e\_l r\_k}} \mathbf{Attention}\_l(i, j, k) \cdot \mathbf{Message}\_l(i) \right), \tag{3.24}$$

where *σ* is the ReLU activation function, H*rk* is the set of head entities for relation *rk*, and T*eirk* is the set of tail entities for head entity *ei* and relation *rk*. The aggregation of all tail entities *r<sup>t</sup> <sup>k</sup>* can be computed through a similar process, and the relation representation is obtained as *<sup>r</sup><sup>k</sup>* <sup>=</sup> *<sup>r</sup><sup>h</sup> <sup>k</sup>* <sup>+</sup> *<sup>r</sup><sup>t</sup> k*.

Then, the second GNN network generates relation-aware entity representation through aggregating relation information back to entities. For entity *i*, the aggregation of all its outward relation embeddings is computed as follows:

$$\mathbf{h}\_{l}^{h} = \sigma \left( \sum\_{e\_{j} \in \mathcal{T}\_{e\_{j}}} \sum\_{r\_{k} \in \mathcal{R}\_{e\_{j}e\_{j}}} \mathbf{Attention}\_{2}(i,k) \cdot r\_{k} \right), \tag{3.25}$$

where T*ei* is the set of *tail* entities for *head* entity *ei* and R*ei ej* is the set of relations between head entity *ei* and tail entity *ej* . The aggregation of inward relation embeddings *h<sup>t</sup> <sup>i</sup>* is computed through a similar process. Then the relation-aware entity representations *hrel <sup>i</sup>* can be obtained by concatenation: *<sup>h</sup>rel <sup>i</sup>* = *<sup>h</sup>ih<sup>h</sup> <sup>i</sup> h<sup>t</sup> i* .

Finally, the third GNN takes as input the relation-aware entity representations and makes aggregation to produce the final entity representations:

$$\mathbf{h}\_{l}^{out} = \sigma \left( \sum\_{j \in \mathcal{N}\_l} \mathbf{Attention}\_3(i, j) \cdot \mathbf{h}\_k^{rel} \right), \tag{3.26}$$

**Attention** Corresponding to three GNN networks, there are three attention computations in RAGA. In the first GNN, to compute the attention weights, representations of head entity and tail entity are linearly transformed, respectively, and then concatenated:

$$\text{Attention}\_{\text{l}}(i,j,k) = \frac{\exp\left(\text{LeakReLU}\left(\mathbf{a}\_{\text{l}}^{T}\left[\mathbf{W}^{h}\boldsymbol{h}\_{\text{l}}\|\mathbf{W}^{\text{l}}\boldsymbol{h}\_{j}\right]\right)\right)}{\sum\_{e\_{l'} \in \mathcal{H}\_{\text{l}}} \sum\_{e\_{j'} \in \mathcal{T}\_{e\_{l'}l\_{\text{l}}}} \exp\left(\text{LeakReLU}\left(\mathbf{a}\_{\text{l}}^{T}\left[\mathbf{W}^{h}\boldsymbol{h}\_{\text{l}'}\|\mathbf{W}^{\text{l}}\boldsymbol{h}\_{j'}\right]\right)\right)},\tag{3.27}$$

where *a*<sup>1</sup> is the learnable attention vector.

In the second GNN, representations of entity and its neighboring relations are directly concatenated:

$$\text{Attention}\_{2}(i,k) = \frac{\exp\left(\text{LeakReLU}(\mathbf{a}\_{2}^{T}\left[\boldsymbol{h}\_{i}\|\mathbf{r}\_{k}\right]\right)\right)}{\sum\_{e\_{j}\in\mathcal{T}\_{e\_{i}}}\sum\_{r\_{k'}\in\mathsf{R}\_{i'e\_{j}}}\exp\left(\text{LeakReLU}(\mathbf{a}\_{2}^{T}\left[\boldsymbol{h}\_{i}\|\mathbf{r}\_{k'}\right]\right)},\quad(3.28)$$

where *a*<sup>2</sup> is the learnable attention vector.

The computation of attention in the third GNN, i.e., **Attention**3, is similar to Eq. (3.28), which concatenates entity and its neighboring entity instead of relation.

**Messaging** Only the first GNN utilizes linear transformation as the messaging approach:

$$\mathbf{Messaging}\_{\parallel}(i) = Wh\_{l} \,\,,\tag{3.29}$$

where *W* can refer to *W<sup>h</sup>* or *W<sup>t</sup>* depending on the aggregation of head or tail entities.

**Post-processing** The final enhanced entity representation is the concatenation of outputs of the second and the third GNNs:

$$
\boldsymbol{\hbar}\_{l}^{final} = \left[ \boldsymbol{\hbar}\_{l}^{rel} \| \boldsymbol{\hbar}\_{l}^{out} \right]. \tag{3.30}
$$

**Loss Function** The loss function is formulated as:

$$\mathcal{L} = \sum\_{(e\_l, e\_j) \in T} \sum\_{(e'\_l, e'\_j) \in T'\_{e\_l, e\_j}} \max(dis(e\_l, e\_j) - dis(e'\_l, e'\_j) + \lambda, 0) \,, \tag{3.31}$$

where *T ei,ej* is the set of negative sample for *ei* and *ej* , *λ* is the margin, and *dis()* is defined as the Manhattan distance.

#### *3.2.6 Dual-AMN*

Dual-AMN proposes to utilize both intra-graph and cross-graph information for learning entity representations [7]. It constructs a set of virtual nodes, i.e., proxy vectors, through which the messaging and aggregation between graphs are conducted.

**Aggregation** Dual-AMN uses two GNN networks to learn intra-graph and crossgraph information, respectively. Firstly, it utilizes relation projection operation in RREA to obtain intra-graph embeddings:

$$\mathbf{h}\_{e\_i}^l = \sigma \left( \sum\_{e\_j \in \mathcal{N}\_{\ell\_j}} \sum\_{r\_k \in \mathcal{R}\_{ij}} \mathbf{Attention}\_{\mathbf{l}}(i, j, k) \cdot \mathbf{Message}\_{\mathbf{l}}(j, k) \right), \tag{3.32}$$

where *σ* is the tanh activation function and *h<sup>l</sup> ei* represents the output of *l*-th layer. Then the multi-hop embeddings are obtained by concatenation:

$$\boldsymbol{h}\_{e\_l}^{multi} = \left[ \boldsymbol{h}\_{e\_l}^{0} \| \boldsymbol{h}\_{e\_l}^{1} \| \dots \| \boldsymbol{h}\_{e\_l}^{l} \right]. \tag{3.33}$$

Secondly, it constructs a set of virtual nodes S*<sup>p</sup>* = {*q*1*, q*2*,..., qn*}, namely, the proxy vectors, which are randomly initialized. The cross-graph aggregation is computed as:

$$h\_{e\_i}^p = \sum\_{j \in \mathcal{S}\_p} \mathbf{Attention}\_2(i, j) \cdot \mathbf{Message}\_2(i, j) \,. \tag{3.34}$$

**Attention** For intra-graph information learning, the attention weights are calculated as:

$$\mathbf{Attention}\_{l}(i,j,k) = \frac{\exp(\mathbf{v}^{T}\mathbf{h}\_{r\_{k}})}{\sum\_{e\_{j'} \in \mathcal{N}\_{e\_{l}}} \sum\_{r\_{k'} \in \mathcal{R}\_{ij'}} \exp(\mathbf{v}^{T}\mathbf{h}\_{r\_{k'}})},\tag{3.35}$$

where *v<sup>T</sup>* is a learnable attention vector and *hrk* is the representation of relation *rk*, which is randomly initialized by He\_initializer [4].

For cross-graph information learning, the attention weights are computed by the similarity between entity and proxy vectors:

$$\text{Attention}\_2(i, j) = \frac{\exp(\cos(\mathbf{h}\_{e\_l}^{multi}, \mathbf{q}\_j))}{\sum\_{k \in \mathcal{S}\_p} \exp(\cos(\mathbf{h}\_{e\_l}, \mathbf{q}\_k))} \,. \tag{3.36}$$

**Messaging** For the first GNN, the messaging is the same as RREA, which utilizes a relational reflection matrix to transform neighbor embeddings.

For the second GNN, the features of neighboring entities are represented as the difference between entity and proxy vectors:

$$\mathbf{Messaging}\_2(i,j) = \boldsymbol{h}\_{e\_l}^{multi} - \boldsymbol{q}\_j \,. \tag{3.37}$$

**Post-processing** For the final entity embeddings, the gate mechanism is used to combine intra-graph and cross-graph representations:

$$\begin{aligned} \boldsymbol{\eta}\_{e\_l} &= \sigma(\boldsymbol{M} \boldsymbol{h}\_{e\_l}^p + \boldsymbol{b}), \\ \boldsymbol{h}\_{e\_l}^{final} &= \boldsymbol{\eta}\_{e\_l} \cdot \boldsymbol{h}\_{e\_l}^p + (1 - \boldsymbol{\eta}\_{e\_l}) \cdot \boldsymbol{h}\_{e\_l}^{multi}, \end{aligned} \tag{3.38}$$

where *M* and *b* are the gate weight matrix and gate bias vector.

**Loss Function** Firstly, it calculates the original margin loss as follows:

$$\|l\_o(e\_l, e\_j, e'\_j) = \gamma + \|\mathbf{h}\_{e\_l}^{final} - \mathbf{h}\_{e\_j}^{final}\|\_2^2 - \|\mathbf{h}\_{e\_l}^{final} - \mathbf{h}\_{e'\_j}^{final}\|\_2^2. \tag{3.39}$$

Inspired by batch normalization [5] which reduces the internal covariate shift, it proposes to use a normalization step that fixes the mean and variance of sample losses from *lo(ei, ej , e <sup>j</sup> )* to *ln(ei, ej , e <sup>j</sup> )* and reduces the dependence on the scale of the hyper-parameter. Finally, the overall loss function is defined as follows:

$$\begin{split} \mathcal{L} = \sum\_{(e\_l, e\_j) \in P} \log \left[ 1 + \sum\_{e'\_j \in E\_2} \exp(l\_n(e\_l, e\_j, e'\_j)) \right] \\ + \sum\_{(e\_l, e\_j) \in P} \log \left[ 1 + \sum\_{e'\_j \in E\_1} \exp(l\_n(e\_j, e\_l, e'\_j)) \right], \end{split} \tag{3.40}$$

where *P* is the set of positive samples and *E*<sup>1</sup> and *E*<sup>2</sup> are the sets of entities in two knowledge graphs, respectively.

#### *3.2.7 ERMC*

This work proposes to jointly model and align entities and relations and meanwhile retain their semantic independence [14].

**Pre-processing** For pre-processing, it obtains names or descriptions of entities and relations as the inputs for BERT [6] and adds an MLP layer to construct initial representations, which are denoted as *xe(*0*)* and *xr(*0*)* for each entity and relation, respectively.

**Aggregation** Given an entity *e*, the model first aggregates the embeddings of entities that point to *e*:

$$\mathcal{H}\_{\mathcal{N}\_l^{\epsilon}}^{\epsilon(l+1)} = \sigma \left( \frac{1}{|\mathcal{N}\_l^{\epsilon(e)}|} \sum\_{e\_i \in \mathcal{N}\_l^{\epsilon(e)}} \mathbf{Messageing}(i) \right), \tag{3.41}$$

where *σ (*·*)* contains normalization, dropout, and activation operations. Similarly, the model aggregates the embeddings of entities that *e* points to, the embeddings of relations that point to *e*, and the embeddings of relations that *e* points to, producing *he(l*+1*)* N*r i* , *he(l*+1*)* N*e <sup>o</sup>* , and *<sup>h</sup>e(l*+1*)* N*r <sup>o</sup>* , respectively. The model also aggregates the embeddings of entities that point to a relation *r* or *r* points to, so as to produce the relation embeddings *hr(l*+1*)* N*e i* and *hr(l*+1*)* N*e <sup>o</sup>* , respectively.

**Messaging** Given an entity *e*, the messaging process of the entities that point to *e* is implemented as a simple linear transformation: **Messaging***(i)* <sup>=</sup> *<sup>W</sup>e(l) ei xei(l)*, where *xei(l)* is the node representation in the last layer and *We(l) ei* is a learnable weight matrix that aggregates the inward entity features. The messaging process of other operations is implemented similarly.

**Post-processing** The final representation of entity *e* is formulated as follows:

$$\begin{split} \boldsymbol{h}^{e(l+1)} &= \left[ \boldsymbol{h}^{e(l+1)}\_{\mathcal{N}\_l^e} \| \boldsymbol{h}^{e(l+1)}\_{\mathcal{N}\_l^e} \| \boldsymbol{h}^{e(l+1)}\_{\mathcal{N}\_o^e} \| \boldsymbol{h}^{e(l+1)}\_{\mathcal{N}\_o^e} \right], \\ \boldsymbol{x}^{e(l+1)} &= MLP\left( \left[ \boldsymbol{h}^{e(l+1)} \| \boldsymbol{x}^{e(l)} \right] \right). \end{split} \tag{3.42}$$

And the final representation of relation *r* is formulated similarly:

$$\begin{aligned} \mathbf{h}^{r(l+1)} &= \left[ \mathbf{h}\_{\mathcal{N}\_l^\varepsilon}^{r(l+1)} \| \mathbf{h}\_{\mathcal{N}\_o^\varepsilon}^{r(l+1)} \right], \\ \mathbf{x}^{r(l+1)} &= MLP\left( \left[ \mathbf{h}^{r(l+1)} \| \mathbf{x}^{r(l)} \right] \right). \end{aligned} \tag{3.43}$$

The graph embedding *<sup>H</sup>* <sup>∈</sup> <sup>R</sup>*(*|*E*|+|*R*|*)*×*<sup>d</sup>* is the concatenation of all entities and relations' representations.

**Loss Function** Denote *H<sup>s</sup>* and *H<sup>t</sup>* as the representations of two graphs, respectively. The similarity matrix is computed as:

$$\mathbf{S} = \sinh horn(\mathbf{H}\_s, \mathbf{H}\_l^T) \,, \tag{3.44}$$

where *si,j* <sup>∈</sup> *<sup>S</sup>* is a real number that denotes the correlation between entity *<sup>e</sup><sup>i</sup> <sup>s</sup>* (from source graph) and *e j <sup>t</sup>* (from target graph), or the correlation between relation *r<sup>i</sup> s* (from source graph) and *r j <sup>t</sup>* (from target graph). The other elements are set to −∞ to mask the correlation between entity and relation across different graphs. The final loss function is formulated as follows:

$$\mathcal{L} = -\sum\_{\left(e\_s^\downarrow, e\_t^\slash\right) \in \mathcal{Q}^\ell} \log \left( \mathbf{s}\_{l,j} \right) - \lambda \sum\_{\left(r\_s^\downarrow, r\_t^\downarrow\right) \in \mathcal{Q}^\ell} \log \left( \mathbf{s}\_{l,j} \right), \tag{3.45}$$

where *(e<sup>i</sup> s, e<sup>j</sup> <sup>t</sup> )* and *(r<sup>i</sup> s, r<sup>j</sup> <sup>t</sup> )* are pre-aligned entity and relation pairs and *λ* ∈ [0*,* 1] is a hyper-parameter.

#### *3.2.8 KE-GCN*

It combines GCNs and advanced KGE methods to learn the representations, where a novel framework is put forward to realize the messaging and aggregation modules in representation learning [15].

**Aggregation** Denoting *h<sup>l</sup> <sup>v</sup>* as the embedding of entity *v* at layer *l*, the entity updating rules are:

$$\begin{aligned} \mathfrak{m}\_v^{l+1} &= \sum\_{(\boldsymbol{\mu}, \boldsymbol{r}) \in \mathcal{N}\_{\text{in}}(\boldsymbol{v})} \mathsf{Message}(\boldsymbol{\mu}, \boldsymbol{r}, \boldsymbol{v}) + \sum\_{(\boldsymbol{\mu}, \boldsymbol{r}) \in \mathcal{N}\_{\text{out}}(\boldsymbol{v})} \mathsf{Message}(\boldsymbol{\mu}, \boldsymbol{r}, \boldsymbol{v}), \\\ h\_v^{l+1} &= \sigma(\mathsf{m}\_v^{l+1} + \mathsf{W}\_0^l h\_v^l), \end{aligned} \tag{3.46}$$

where Nin*(v)* = {*(u, r)*|*<sup>u</sup> <sup>r</sup>* → *v*} is the set of inward entity-relation neighbors of entity *v*, while Nout*(v)* = {*(u, r)*|*<sup>u</sup> <sup>r</sup>* <sup>←</sup> *<sup>v</sup>*} is the set of outward neighbors of *v*. *W<sup>l</sup>* 0 is a linear transformation matrix. *σ (*·*)* denotes the activation function for the update. The embedding of relation is updated through a similar process.

**Messaging** It considers GCN as an optimization process, where the messaging process is implemented as a partial derivative:

$$\mathbf{M} \text{essaging}(u, r, v) = \mathbf{W}\_r^l \frac{\partial f(\boldsymbol{\h}\_u^l, \boldsymbol{\h}\_r^l, \boldsymbol{\h}\_v^l)}{\partial \boldsymbol{\h}\_v^l},\tag{3.47}$$

where *h<sup>l</sup> <sup>r</sup>* represents the embedding of relation *r* at layer *l* and *W<sup>l</sup> <sup>r</sup>* is a relationspecific linear transformation matrix. *f (h<sup>l</sup> u, <sup>h</sup><sup>l</sup> r, <sup>h</sup><sup>l</sup> v)* is the scoring function that measures the plausibility of triple *(u, r, v)*. Thus, *ml*+<sup>1</sup> *<sup>v</sup>* <sup>+</sup> *<sup>W</sup><sup>l</sup>* 0*hl <sup>v</sup>* in Eq. (3.46) can be regarded as the gradient ascent to maximize the sum of scoring function. For example, if *f (h<sup>l</sup> u, <sup>h</sup><sup>l</sup> r, <sup>h</sup><sup>l</sup> v)* <sup>=</sup> *(h<sup>l</sup> u)<sup>T</sup> <sup>h</sup><sup>l</sup> <sup>v</sup>*, Eq. (3.47) becomes equivalent to the common linear transformation *W<sup>l</sup> rhl u*.

**Loss Function** Denote the training set as *S* = {*(u, v)*}; this model utilizes marginbased ranking loss for optimization:

$$\mathcal{L} = \sum\_{(u,v)\in\mathcal{S}} \sum\_{(u',v')\in S'\_{(u,v)}} \max\left( \|h\_u - h\_v\|\_1 + \gamma - \|h\_{u'} - h\_{v'}\|\_1, 0\right),\tag{3.48}$$

where *S (u,v)* denotes the set of negative entity alignments constructed by corrupting *(u, v)*, i.e., replacing *u* or *v* with a randomly chosen entity in graph. *γ* represents the margin hyper-parameter separating positive and negative entity alignments.

#### *3.2.9 RePS*

It encodes position and relation information for aligning entities [13].

**Aggregation** Firstly, to encode position information, *k* subsets of nodes (referred to as anchor sets) are randomly sampled. An *ith* anchor set is a collection of *li* number of nodes (anchors). Then for entity *v*, the aggregation process is formulated as:

$$\boldsymbol{h}\_{v\_{\rho}}^{l} = \boldsymbol{g} \left( \frac{1}{k+1} \left( \sum\_{l=1}^{k} \mathbf{Message}\_{l}(v, \psi\_{l}) + \boldsymbol{h}\_{v}^{l-1} \right) \right), \tag{3.49}$$

where *h<sup>l</sup> <sup>v</sup>* represents the embedding of entity *v* from layer *l*, *ψi* is the *ith* anchor set, and *g(X)* = *σ (W*1*X* + *b*1*)*, where *W*<sup>1</sup> and *b*<sup>1</sup> are trainable parameters and *σ* is the activation function.

To encode relation information, a simple relation-specific GNN is used:

$$\boldsymbol{h}\_{v\_{\boldsymbol{v}}}^{l} = \boldsymbol{f}\left((\boldsymbol{1} + \boldsymbol{c}\_{\boldsymbol{v}}) \cdot \boldsymbol{h}\_{\boldsymbol{v}}^{l-1} + \sum\_{i \in \mathcal{N}\_{\boldsymbol{v}}} \mathbf{Message}\_{2}(i)\right),\tag{3.50}$$

where *cv* is the learnable coefficient for entity *v* and N*<sup>v</sup>* is the set of neighboring entities of *v*. *f (X)* = *W*2*X* + *b*2, where *W*<sup>2</sup> and *b*<sup>2</sup> are learnable parameters.

**Messaging** To ensure similar entities in two graphs have similar representations, the relation-enriched distance function is defined as follows:

$$pd(u,v) = \min\_{q} \left( \sum\_{r \in P\_{\tilde{q}}(u,v)} f(r, \mathcal{K}\mathcal{G}\_l) \right),\tag{3.51}$$

where *f (r,* KG*i)* is the frequency of relation *r* in KG*<sup>i</sup>* and *Pq (u, v)* is the list of relations in the *qth* path between *u* and *v*. Thus, *pd(u, v)* aims to find the shortest path between *u* and *v*, where the relations appear less frequently. Then the messaging function is formulated as follows:

$$\mathbf{Message}\_{l}(\upsilon,\psi\_{l}) = \min\left( \left\{ pd(\upsilon,\phi\_{l,j}) \cdot \mathbf{h}\_{\psi\_{l,j}}^{l-1} \right\}\_{j=1}^{l\_{l}} \right),\tag{3.52}$$

where *ψi,j* is the *j* th entity in *i*th anchor set.

For relation-aware embedding, it sums up the neighboring representations with relation-specific weights:

$$\mathbf{Messaging}\_2(i) = \frac{h\_l^{l-1}}{1 + c\_{r\_{\upsilon,l}}} \,, \tag{3.53}$$

where *crv,i* is the learnable coefficient for relation *r* connecting *v* and *i*.

**Post-processing** The final representation of *v* is computed as:

$$\boldsymbol{\hbar}\_{v}^{l} = \boldsymbol{g} \left( \boldsymbol{h}\_{v\_{\rho}}^{l} \right) \cdot \boldsymbol{h}\_{v\_{\rho}}^{l} + \left( 1 - \boldsymbol{g} \left( \boldsymbol{h}\_{v\_{\rho}}^{l} \right) \right) \cdot \boldsymbol{h}\_{v\_{r}}^{l} \,, \tag{3.54}$$

where *g(h<sup>l</sup> vp )* <sup>=</sup> *σ (W*3*h<sup>l</sup> vp* + *b*3*)* learns the relative importance. *W*<sup>3</sup> and *b*<sup>3</sup> are trainable parameters and *σ* is the activation function.

**Loss Function** It introduces a novel knowledge-aware negative sampling (KANS) technique to generate hard negative samples. For each tuple *(v, v )* in *S*, the negative instances for *v* are sampled from set *v*, where *v* is the set of entities which share at least one (relation, tail) pair or (relation, head) pair with *v* . The model is trained by minimizing the following loss:

$$\mathcal{L} = \sum\_{(p, p') \in \mathcal{S}} \|\mathbf{p} - \mathbf{p'}\| + \beta \sum\_{(p, q) \in \mathcal{S}'} [\mathbf{y} - \|\mathbf{p} - \mathbf{q}\|]\_+,\tag{3.55}$$

where *β* is a weighing parameter and *γ* is the margin.

#### *3.2.10 SDEA*

SDEA utilizes BiGRU to capture correlations among neighbors and generate entity representations [16].

**Pre-processing** It devises an attribute embedding module to capture entity associations via entity attributes. Specifically, given an entity *ei*, it concatenates the names and descriptions of its attributes, denoted as *S(ei)*. Then *S(ei)* is fed into BERT model to generate attribute embedding *Ha(ei)*. The details of implementation can be found in Section III of the original paper, which is omitted in the interest of space.

**Aggregation** It aggregates the neighboring information utilizing attention mechanism:

$$H\_r(e\_l) = \sum\_{t=1}^n \mathbf{Attention}(t) \cdot \mathbf{Message}(t) \,. \tag{3.56}$$

Since SDEA treats neighborhood as a sequence, *t* actually represents *t*-th neighboring entity of *ei*, and **Messaging***()* is computed through a BiGRU.

**Attention** SDEA computes attention via simple inner product:

$$\text{Attention}(t) = \frac{\exp\left(\boldsymbol{h}\_l^T \cdot \hat{\boldsymbol{h}}\right)}{\sum\_{l=1}^n \exp\left(\boldsymbol{h}\_l^T \cdot \hat{\boldsymbol{h}}\right)},\tag{3.57}$$

where *h*ˆ is the global attention representation, which is obtained after feeding the output of the last unit of the BiGRU, denoted as *hn*, into an MLP layer.

**Messaging** Different from other models, SDEA captures correlation between neighbors in messaging module, and all neighbors of entity *ei* are regarded as an input sequence of the BiGRU model. Given entity *ei*, let *x<sup>t</sup>* denote the *t*-th input embedding (i.e., the attribute embedding of *ei*'s *t*-th neighbor, as described in pre-processing module) and *h<sup>t</sup>* denote the output *t*-th hidden unit. The process of BiGRU is formulated as follows:

$$\begin{aligned} \mathbf{r}\_{l} &= \sigma \left( W\_{r} \mathbf{x}\_{l} + U\_{r} h\_{l-1} + \mathbf{b}\_{r} \right) \\ \tilde{h}\_{l} &= \phi \left( W \mathbf{x}\_{l} \right) + U \left( r\_{l} \odot h\_{l-1} + \mathbf{b}\_{h} \right) \\ \mathbf{z}\_{l} &= \sigma \left( W\_{z} \mathbf{x}\_{l} + U\_{z} h\_{l-1} + \mathbf{b}\_{z} \right) \\ \mathbf{h}\_{l} &= \left( \mathbf{1} - \mathbf{z}\_{l} \right) \odot h\_{l-1} + \mathbf{z}\_{l} \odot \tilde{h}\_{l} \end{aligned} \tag{3.58}$$

where *r<sup>t</sup>* is the reset gate that drops the unimportant information and *z<sup>t</sup>* is the update gate that combines the important information. *W, U, b* are learnable parameters. *h*˜*<sup>t</sup>* is the hidden state. *σ* is the sigmoid function and *φ* is the hyperbolic tangent. is the Hadamard product.

For BiGRU, there are outputs of two directions ←− *<sup>h</sup><sup>t</sup>* and −→*h<sup>t</sup>* , and the final output of BiGRU, namely, the output of messaging module, is the sum of two directions: **Messaging***(i)* <sup>=</sup> ←− *<sup>h</sup><sup>t</sup>* <sup>+</sup> −→*h<sup>t</sup>* .

**Post-processing** After obtaining the attribute embedding *Ha(ei)* and the relational embedding *Hr(ei)*, they are concatenated and forwarded to another MLP layer, resulting in *Hm(ei)* = *MLP (*[*Ha(ei)Hr(ei)*]*)*. Finally, *Ha(ei)*, *Hr(ei)*, and *Hm(ei)* are concatenated to produce *Hent(ei)* = [*Hr(ei)Ha(ei)Hm(ei)*], which is used in alignment stage.

**Loss Function** The model uses the following margin-based ranking loss as the loss function to train attribute embedding module:

$$\mathcal{L} = \sum\_{e\_l, e'\_l, e''\_l \in D} \max\left\{ 0, \|\mathbf{H}\_a(e\_l) - \mathbf{H}'\_a(e'\_l)\|\_2 - \|\mathbf{H}\_a(e\_l) - \mathbf{H}'\_a(e''\_l)\|\_2 + \beta \right\},\tag{3.59}$$

where *D* is the training set; *H<sup>a</sup>* and *H <sup>a</sup>* are attribute embeddings of source graph and target graph, respectively; and *β >* 0 is the margin hyper-parameter used for separating positive and negative pairs.

The training of relation embedding module uses a margin-based ranking loss similar to Eq. (3.59), where the embedding *Ha(ei)* is replaced by [*Hr(ei)Hm(ei)*].

#### **3.3 Experiments**

In this section, we first conduct overall comparison experiment to reveal the effectiveness of state-of-the-art representation learning methods. Then we conduct further experiments in terms of the six modules of representation learning, so as to examine the effectiveness of various strategies.

#### *3.3.1 Experimental Setting*

**Dataset** We use the most frequently used DBP15K dataset [11] for evaluation.

**Baselines** For overall comparison, we select seven models, including AliNet [12], MRAEA [8], RREA [9], RAGA [17], SDEA [16], Dual-AMN [7], and RPR-RHGT [2]. We collect their source codes and reproduce the results in the same setting. Specifically, to make a fair comparison, we modify and unify the alignment part of these models, forcing them to utilize L1 distance and greedy algorithm for alignment inference. We omit the comparison with the remaining models, as they do not provide the source codes and our implementations cannot reproduce the results. For ablation and further experiments, we choose RAGA as the base model.

**Parameters and Metrics** Since there are various kinds of hyper-parameters for different models, we just unify the common parameters, such as the margin *λ* = 3 in margin loss function, and number of negative samples *k* = 5. For other parameters, we keep the default settings in the original papers.

Following existing studies, we use Hits@k (*k* = 1, 10) and mean reciprocal rank (MRR) as the evaluation metrics. The higher the Hits@k and MRR, the better the performance. In experiments, we report the average performance of three independent runs as the final result.

#### *3.3.2 Overall Results and Analysis*

Firstly, we compare the overall performance of seven advanced models in Table 3.2, where the best results are highlighted in bold, and the second best results are underlined.


**Table 3.2** Comparison of representation learning models on DBP15K

From the results, it can be observed that:


#### *3.3.3 Further Experiments*

To compare various strategies in each module of representation learning, we conduct further experiments using the RAGA model.

#### **3.3.3.1 Pre-processing Module**

RAGA takes pre-trained embeddings as input, which are forwarded to a two-layer GCN with highway network to generate initial representations. To examine the


**Table 3.3** Analysis of the pre-processing module using RAGA

effectiveness of pre-trained embeddings and structural embeddings, we remove them, respectively, and then make comparison. Table 3.3 shows the results, where "w/o Pre-trained" represents removing pre-trained embeddings, "w/o GNN" represents removing GCN, and "w/o Both" represents removing the whole pre-processing module.

The results show that removing the structural features and the pre-trained embeddings significantly degrades the performance, and the model that completely removes the pre-processing module achieves the worst result. Hence, it is important to extract useful features to initialize the embeddings. Additionally, we can also observe that the semantic features in the pre-trained model are more useful than the structural vectors, which verifies the effectiveness of the prior knowledge contained in the pre-trained embeddings. Using structural embeddings for initialization is less effective, as the subsequent steps in representation learning also aim to extract the structural features to produce meaningful representations.

#### **3.3.3.2 Messaging Module**

For the messaging module, linear transformation is the most widely used approach. RAGA only utilizes linear transformation in its first GNN and does not use transformation in the other two GNNs. Thus, we design two variants: one that eliminates the linear transformation in the first GNN ("-Linear Transform"), resulting in a model without linear transformation at all, and the other one that adds linear transformation in the other two GNNs ("+Linear Transform"), resulting in a model that is fully equipped with linear transformation.

The results are presented in Table 3.4. Besides, we also report their convergence rates in Fig. 3.1.

It is evident that adding linear transformation in the rest of the GNNs improves the performance of RAGA, especially on JA-EN and FR-EN datasets, where Hits@1 improves by 1.1% and 1.2%, respectively. Additionally, when removing linear transformation, the performance drops significantly. Furthermore, Fig. 3.1 shows that linear transformation can also boost the convergence of model, possibly due to the introduction of extra parameters.


**Table 3.4** Analysis of the messaging module using RAGA

**Fig. 3.1** Comparison of convergences

#### **3.3.3.3 Attention Module**

For attention module, there are two popular implementations, i.e., inner product and concatenation. To compare the two approaches, we replace the concatenation computation of RAGA with inner product computation (i.e., "-Inner product," by changing *v<sup>T</sup>* [*eie<sup>j</sup>* ] to *(M*1*ei)<sup>T</sup> (M*2*e<sup>j</sup> )*, where *M*1*, <sup>M</sup>*<sup>2</sup> are learnable transformation matrices), and remove the attention mechanism, (i.e., "w/o Attention," where we do not compute attention coefficient and just take average operation), respectively, and then report the results.

As it is shown in Table 3.5, the two variant models perform almost the same as the original model. Considering the influence of the initial representation generated in the pre-processing module, we remove the pre-trained vectors of the pre-processing module and then conduct the same comparison. As shown in Table 3.6, removing the attention mechanism drops the performance, so we may draw a preliminary conclusion that the attention mechanism can play a better role in the absence of prior knowledge. As for the two strategies of attention computation, inner product


**Table 3.5** Analysis of the attention module using RAGA



**Table 3.7** Analysis of the aggregation module using RAGA


performs better than concatenation on ZH-EN dataset but worse on JA-EN and FR-EN datasets, which indicates these two approaches make different contributions on different datasets.

#### **3.3.3.4 Aggregation Module**

For the aggregation module, as RAGA incorporates both 1-hop neighbors and relation information to update entity representations, we examine two variants, i.e., adding two hop neighboring information ("-2hop") and removing relation representation ("w/o rel."). The results are shown in Table 3.7.

We can observe that the performance of the model decreases significantly after removing the relation representation learning. This shows that the integration of relation representations can indeed enhance the learning ability of the model. Besides, the performance of the model decreases slightly after adding the information of 2-hop neighboring entities, which might indicate that the 2-hop neighboring information can bring some noises, as not all entities are useful for aligning the target entity.


**Table 3.8** Analysis of the post-processing module using RAGA



#### **3.3.3.5 Post-processing Module**

RAGA concatenates the relation-aware entity representation and the 1-hop aggregation results to produce the final representation. We examine two variants, i.e., "-highway' 'that replaces concatenation with highway network [10], and "w/o postprocessing" that removes the relation-aware entity representation (Table 3.8).

From the experimental results, it can be seen that removing post-processing module decreases the performance, which indicates that the relation-aware representations can indeed enhance the final representations and improve the alignment performance. After replacing the concatenation operation with highway network, the performance decreases on JA-EN dataset and increases on FR-EN dataset, which indicates that the two post-processing strategies do not have absolute advantages and disadvantages.

#### **3.3.3.6 Loss Function Module**

For the loss function, RAGA employs margin-based loss in training. We consider two other popular choices, i.e., TransE-based loss and margin-based + TransE loss. Specifically, TransE-based loss is formulated as *lE* <sup>=</sup> <sup>1</sup> *k <sup>k</sup> hk* + *rk* − *tk*1, where *(hk, rk, tk)* is randomly sampled.

From the results in Table 3.9, it can be seen that the model performance decreases after using or adding the TransE loss. This is mainly because the TransE assumption is not universal. For example, in the RAGA model used in this experiment, the representation of the relation is actually obtained by adding the head entity and the tail entity, which is in conflict with the TransE assumption.

#### **3.4 Conclusion**

In this chapter, we survey recent advance in the representation learning stage of EA. We propose a general framework of GNN-based representation learning models, which consists of six modules, and summarize ten recent works in terms of these modules. Extensive experiments are conducted to show the overall performance of each method and also reveal the effectiveness of the strategies in each module.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 4 Recent Advance of Alignment Inference Stage**

**Abstract** In this chapter, we introduce recent progress of the alignment inference stage.

#### **4.1 Introduction**

Matching data instances that refer to the same real-world entity is a long-standing problem. It establishes the connections among multiple data sources and is critical to data integration and cleaning [33]. Therefore, the task has been actively studied; for instance, in the database community, various entity matching (EM) (and entity resolution (ER)) strategies are proposed to train a (supervised) classifier to predict whether a pair of data records match [9, 33].

Recently, due to the emergence and proliferation of knowledge graphs (KGs), matching entities in KGs draws much attention from both academia and industries. Distinct from traditional data matching, it brings its own challenges. In particular, it underlines the use of KGs' structures for matching and manifests unique characteristics of data, e.g., imbalanced class distribution, and few attributive textual information, etc. As a consequence, although viable, following the traditional EM pipeline, it is difficult to train an effective classifier that can infer the equivalence between KGs' entities. Hence, much effort has been dedicated to specifically addressing the matching of entities in KGs, which is also referred to as *entity alignment* (EA).

Nevertheless, early solutions to EA are mainly unsupervised [22, 42], i.e., no labeled data is assumed. They utilize discriminative features of entities (e.g., entity descriptions and relational structures) to infer the equivalent entity pair, which are, however, embarrassed by the heterogeneity of independently constructed KGs [44].

To mitigate this issue, recent solutions to EA employ a few labeled pairs as seeds to guide the learning and prediction [8, 14, 27, 37, 47]. In short, they embed the symbolic representations of KGs as low-dimensional vectors in a way such that the semantic relatedness of entities is captured by the geometrical structures of embedding spaces [4], where the seed pairs are leveraged to produce unified entity representations. In the testing stage, they match entities based on the unified entity embeddings. They are coined as *embedding-based* EA methods, which have exhibited state-of-the-art performance on existing benchmarks.

To be more specific, the embedding-based EA1 pipeline can be roughly divided into two major stages, i.e., *representation learning* and *matching KGs in entity embedding spaces* (or *embedding matching* for short). While the former encodes the KG structures into low-dimensional vectors and establishes connections between independent KGs via the calibration or transformation of (seed) *entity embeddings* [44], the latter computes pairwise scores between source and target entities based on such embeddings and then makes alignment decisions according to the pairwise scores. Although this field has been actively explored, existing efforts are mainly devoted to the *representation learning* stage [24, 26, 61], while *embedding matching* has not raised many attentions until very recently [29, 53]. The majority of existing EA solutions adopt a simple algorithm to realize this stage, i.e., DInf, which first leverages common similarity metrics such as cosine similarity to calculate the pairwise similarity scores between entity embeddings and then matches a source entity to its most similar target entity according to the pairwise scores [47]. Nevertheless, it is evident that such an intuitive strategy can merely reach local optimums for individual entities and completely overlooks the (global) interdependence among the matching decisions for different entities [55].

To address the shortcomings of DInf, advanced strategies are devised [12, 44, 49, 53, 55, 56]. While some of them inject the modeling of global interdependence into the computation of pairwise scores [12, 44, 53], some directly improve the alignment decision-making process by imposing collective matching constraints [49, 55, 56]. These efforts demonstrate the significance of *matching KGs in entity embedding spaces* from at least three major aspects: (1) It is an indispensable step of EA, which takes as input the entity embeddings (generated by the representation learning stage), and outputs matched entity pairs. (2) Its performance is crucial to the overall EA results, e.g., an effective algorithm can improve the alignment results by up to 88% [53]. (3) It empowers EA with explainability, as it unveils the decisionmaking process of alignment. We use the following example to further illustrate the significance of the embedding matching process.

*Example* Figure 4.1 presents three representative cases of EA. The KG pairs to be aligned are first encoded into embeddings via the representation learning models. Next, the embedding matching algorithms produce the matched entity pairs based on the embeddings. In the most ideal case where two KGs are identical, e.g., case (a), with an ideal representation learning model, equivalent entities would be embedded into exactly the same place in the

(continued)

<sup>1</sup> In the rest of the paper, we use *EA* to refer to embedding-based EA solutions and *conventional EA* for the early solutions.

low-dimensional space, and using the simple DInf algorithm would attain perfect results. Nevertheless, in the majority of practical scenarios, e.g., case (b) and (c), the two KGs have high structure heterogeneity. As thus, even an ideal representation learning model might generate different embeddings for equivalent entities. In this case, adopting the simple DInf strategy is likely to produce false entity pairs, such as *(u*5*, v*3*)* in case (b).

Worse still, as pointed out in previous works [44, 59], existing representation learning methods for EA cannot fully capture the structural information (possibly due to their inner design mechanisms, or their incapability of dealing with scarce supervision signals). Under these settings, e.g., case (c), the distribution of entity embeddings in the low-dimensional space would become irregular, where the simple embedding matching algorithm DInf would fall short, i.e., producing incorrect entity pairs *(u*3*, v*1*)* and *(u*5*, v*1*)*. As thus, in these practical cases, an effective embedding matching algorithm is crucial to inferring the correct matches. For instance, by exploiting the collective embedding matching algorithm that imposes the 1-to-1 alignment constraint, the correct matches, i.e., *(u*3*, v*3*)* and *(u*5*, v*5*)*, are likely to be restored.

While the study on matching KGs in entity embedding spaces is rapidly progressing, there is no systematic survey or comparison of these solutions [44]. We do notice that there are several survey papers covering embedding-based EA

**Fig. 4.1** Three cases of EA. Dashed lines between KGs denote the seed entity pairs. Entities with the same subscripts are equivalent. In the embedding space, the circles with two colors represent that the corresponding entities in the two KGs have the same embeddings

frameworks [44, 52, 57–59], whereas they all briefly introduce the embedding matching module (mostly only mentioning the DInf algorithm). In this chapter, we aim to fill in this gap by surveying current solutions for matching KGs in entity embedding spaces and providing a comprehensive empirical evaluation of these methods with the following features:

**(1) Systematic Survey and Fair Comparison** Albeit essential to the alignment performance, existing embedding matching strategies have yet not been compared directly. Instead, they are integrated with representation learning models and then evaluated and compared with each other (as a whole). This, however, cannot provide a fair comparison of the embedding matching strategies themselves, since the difference among them can be offset by other influential factors, such as the choices of representation learning models or input features. Therefore, in this chapter, we exclude irrelevant factors and provide a fair comparison of current matching algorithms for KGs in entity embedding spaces at both theoretical and empirical levels.

**(2) Comprehensive Evaluation and Detailed Discussion** To fully appreciate the effectiveness of embedding matching strategies, we conduct extensive experiments on a wide range of EA settings, i.e., with different representation learning models, with various input features, and on datasets at different scales. We also analyze the complexity of these algorithms and evaluate their efficiency/scalability under each experimental setting. Based on the empirical results, we discuss to reveal strengths and weaknesses.

**(3) New Experimental Settings and Insights** Through empirical evaluation and analysis, we discover that the current mainstream evaluation setting, i.e., 1-to-1 constrained EA, oversimplifies the real-life alignment scenarios. As thus, we identify two experimental settings that better reflect the challenges in practice, i.e., alignment with unmatchable entities, as well as a new setting of *non-1 to-1 alignment*. We compare the embedding matching algorithms under these challenging settings to provide further insights.

**Contributions** We make the following contributions:


• Based on our evaluation and analysis, we provide useful insights into the design trade-offs of existing works and suggest promising directions for the future development of matching KGs in entity embedding spaces (Sect. 4.6).

#### **4.2 Preliminaries**

In this section, we first present the task formulation of EA and its general framework. Next, we introduce the studies related to alignment inference and clarify the scope of this chapter. Finally, we present the key assumptions of embedding-based EA.

#### *4.2.1 Task Formulation and Framework*

**Task Formulation** A KG G is composed of triples {*(s, p, o)*}, where *s, o* ∈ E represent entities and *p* ∈ P denotes the predicate (relation). Given a source KG G*s*, a target KG G*<sup>t</sup>* , the task of EA is formulated as discovering new (equivalent) entity pairs M = {*(u, v)*|*u* ∈ E*s, v* ∈ E*t, u* ⇔ *v*} by using pre-annotated (seed) entity pairs S as anchors, where ⇔ represents the equivalence between entities and E*<sup>s</sup>* and E*<sup>t</sup>* denote the entity sets in G*<sup>s</sup>* and G*<sup>t</sup>* , respectively.

**General Framework** The pipeline of state-of-the-art embedding-based EA solutions can be divided into two stages, i.e., *representation learning* and *embedding matching*, as shown in Fig. 4.2. The general algorithm can be found in Algorithm 1.

The majority of studies on EA are devoted to the *representation learning* stage. They first utilize KG embedding techniques such as TransE [4] and GCN [20] to capture the KG structure information and generate entity structural representations. Next, based on the assumption that equivalent entities from different KGs possess similar neighboring KG structures (and in turn similar embeddings), they leverage the seed entity pairs as anchors and progressively project individual KG embeddings

**Fig. 4.2** The pipeline of embedding-based EA. Dashed lines denote the pre-annotated alignment links



into a unified space through training, resulting in the unified entity representations *E*. 2 There have already been several survey papers concentrating on representation learning approaches for EA, and we refer the interested readers to these works [2, 44, 57, 59].

Next, we introduce the *embedding matching* process –the focus of this chapter– as well as its related works.

#### *4.2.2 Related Work and Scope*

**Matching KGs in Entity Embedding Spaces** After obtaining the unified entity representations *E* where equivalent entities from different KGs are assumed to have similar embeddings, the *embedding matching* stage (also frequently referred to as *alignment inference* stage [44]) produces alignment results by comparing the embeddings of entities from different KGs. Concretely, it first calculates the pairwise scores between source and target entity embeddings according to a specific metric.3 The pairwise scores are then organized into matrix form as *S*. Next, according to the pairwise scores, various matching algorithms are put forward to align entities. The most common algorithm is Greedy, described in Algorithm 2. It directly matches a source entity to the target entity that possesses the highest pairwise score according to *S*. Over the last few years, advanced solutions [12, 15, 28, 29, 44, 49, 53, 55, 56, 60] are devised to improve the embedding matching performance, and in this work, we focus on surveying and comparing these algorithms for matching KGs in entity embedding spaces.

**Matching KGs in Symbolic Spaces** Before the emergence of embedding-based EA, there have already been many conventional frameworks that match KGs in symbolic spaces [17, 41, 42]. While some are based on equivalence reasoning man-

**<sup>3</sup> return** M;

<sup>2</sup> Indeed there are a few exceptions, which instead learn a mapping function between individual embedding spaces [44]. However, the subsequent steps still require mapping between spaces and operate on a "unified" one, e.g., target entity embeddings.

<sup>3</sup> Under certain metrics such as cosine similarity (resp., Euclidean distance), the larger (resp., smaller) the pairwise scores, the higher the probability that two entities are equivalent. In this work, w.l.o.g., we adopt the former expression and consider that higher pairwise scores are preferred.


dated by OWL semantics [17], some leverage similarity computation to compare the symbolic features of entities [42]. However, these solutions are not comparable to algorithms for matching KGs in entity embedding spaces, as (1) they cover *both*  the representation learning and embedding matching stages in embedding-based EA and (2) the required inputs are different from those of embedding matching algorithms. Thus, we do not include them in our experimental evaluation, while they have already been compared in the survey papers covering the overall embeddingbased EA frameworks [44, 59].

The matching of relations (or ontology) between KGs has also been studied by prior symbolic works [41, 42]. Nevertheless, compared with entities, they are usually in smaller amounts, of various granularities [36], and under-explored in embedding-based approaches [51]. Hence, in this work, we exclude relevant studies on this topic and focus on the matching of *entities*.

The task of entity resolution (ER) [9, 16, 35], also known as entity matching, deduplication, or record linkage, can be regarded as the general case of EA [59]. It assumes that the input is relational data, and each data object usually has a large amount of textual information described in multiple attributes. Nevertheless, in this article, we focus on EA approaches, which strive to align *KGs* and mainly rely on *graph representation learning* techniques to model the KG structure and generate entity structural *embeddings* for alignment. Therefore, the discussion and comparison with ER solutions is beyond the scope of this work.

**Matching Data Instances via Deep Learning** Entity matching (EM) between databases has also been greatly advanced by utilizing pre-trained language models for expressive contextualization of database records [10, 33]. These deep learning (DL)-based EM solutions devise end-to-end neural models to learn to classify an entity pair into matching or non-matching and then feed the test entity pairs into the trained models to obtain classification results [5, 25, 33]. Nevertheless, this procedure is different from the focus of our study, as both of its training and testing stage involve representation learning and matching. Besides, these solutions are not suitable for matching KGs in entity embedding space, since (1) they require adequate labeled data to train the neural classification models, but the training data in EA is much less than the testing ones, which could result in the overfitting issue; (2) they would suffer from severe class imbalance in EA, where an entity and all of its nonequivalent entities in another KG would constitute many negative samples,

while there is usually one positive sample for this entity; and (3) they depend on the attributive text information between data records for training, while EA underlines the use of KG structure, which could provide much less useful features for model training. In the experiment, we adapt DL-based EM models to tackle EA, and the results are not promising. This will be further discussed in Sect. 4.4.3.

**Existing Surveys on EA** There are several survey papers covering EA frameworks [44, 52, 57–59]. Some articles provide high-level discussion of embeddingbased EA frameworks, experimentally evaluate and compare these works, and offer guidelines for potential practitioners [44, 58, 59]. Specifically, Zhao et al. propose a general EA framework to encompass existing works and then evaluate them under a wide range of settings. Nevertheless, they only briefly mention DInf and SMat in the embedding matching stage [59]. Sun et al. survey EA approaches and develop an open-source library to evaluate existing works. However, they merely introduce DInf, SMat, and Hun. and overlook the comparison among these algorithms. Besides, they point out that current approaches put in their main efforts in learning expressive embeddings to capture entity features but ignore the alignment inference (i.e., embedding matching) stage [44]. Zhang et al. empirically evaluate state-of-the-art embedding-based EA methods in an industrial context and particularly investigate the influence of the sizes and biases in seed mappings. They evaluate each method as a whole and do not mention the embedding matching process [58].

Two recent survey papers include the latest efforts on embedding-based EA and give more self-contained explanation on each technique. Zhang et al. provide a tutorial-type survey, while for embedding matching, they merely introduce the nearest neighbor search strategy, i.e., DInf [57]. Zeng et al. mainly introduce representation learning methods and their applications on EA, but they neglect the embedding matching stage [52].

In all, existing EA survey articles focus on the **representation learning** process and briefly introduce the embedding matching module (mostly only mentioning the DInf algorithm), while in this work, we systematically survey and empirically evaluate the algorithms designed for the **embedding matching** process in KG alignment and present comprehensive results and insightful discussions.

**Scope of This Work** This study aims to survey and empirically compare the algorithms for matching KGs in *entity embedding* spaces, i.e., various implementations of Embedding\_Matching( ) in Algorithm 1, on a wide range of EA experimental settings.

#### *4.2.3 Key Assumptions*

Notably, existing embedding-based EA solutions have a **fundamental assumption**; that is, the equivalent entities in different KGs possess similar (ideally, isomorphic) neighboring structures. Under such an assumption, effective representation learning

models would transform the structures of equivalent entities into similar entity embeddings. As thus, based on the entity embeddings, the embedding matching stage would assign higher (resp., lower) pairwise similarity scores to the equivalent (resp., nonequivalent) entity pairs and finally make accurate alignment decisions via the coordination according to pairwise scores.

Besides, current EA evaluation settings assume that the entities in different KGs conform to the 1-to-1 constraint. That is, each *u* ∈ E*<sup>s</sup>* has *one and only one*  equivalent entity *v* ∈ E*<sup>t</sup>* and vice versa. However, we contend that this assumption is in fact impractical and provides detailed experiments and discussions in Sect. 4.5.2.

#### **4.3 Alignment Inference Algorithms**

In this section, we introduce the algorithms for alignment inference, i.e., Embedding\_Matching( ) in Algorithm 1.

#### *4.3.1 Overview*

We first provide the overview and comparison of matching algorithms for KGs in entity embedding spaces in Table 4.1. As mentioned in Sect. 4.2, embedding matching comprises two stages—**pairwise score** computation and **matching**. The baseline approach DInf adopts existing similarity metrics to calculate the similarity between entity embeddings and generate the pairwise scores in the first stage, and then it leverages Greedy for matching. In pursuit of better alignment performance, more advanced embedding matching strategies are put forward. While some (i.e., CSLS, RInf, and Sink.) optimize the **pairwise score** computation process and produce more *accurate* pairwise scores, some (i.e., Hun., SMat, and RL) take into account the global alignment dynamics, rather than greedily pursue the local optimum for each entity, during the **matching** process, where more correct matches could be generated according to the coordination under the global constraint.

We further identify two notable characteristics of matching KGs in entity embedding spaces, i.e., whether the matching leverages the **1-to-1** constraint, and the **direction** of the matching. Regarding the former, Hun. and SMat explicitly exert the 1-to-1 constraint on the matching process. RL relaxes the strict 1-to-1 constraint by allowing non-1-to-1 matches. The greedy strategies, however, normally do not take into consideration this constraint, except for Sink., which implicitly implements the 1-to-1 constraint in a progressive manner when calculating the pairwise scores. As for the **direction** of matching, Greedy only considers a single direction at a time and overlooks the influence from the reverse direction. As thus, the resultant source-to-target alignment results are not necessarily equal to the target-to-source ones. By improving the pairwise score computation, CSLS, RInf, and Sink. are actually modeling and integrating the bidirectional alignments, whereas they still



adopt Greedy to produce final results. For non-greedy methods, Hun. and SMat fully consider the bidirectional alignments and produce a matching agreed by both directions, while RL is unidirectional.

Next, we describe these methods in detail.4

#### *4.3.2 Simple Embedding Matching*

DInf is the most common implementation of Embedding\_Matching( ), described in Algorithm 3. Assume both KGs contain *n* entities. The time and space complexity of DInf is *O(n*2*)*.


#### *4.3.3 CSLS Algorithm*

The cross-domain similarity local scaling (CSLS) algorithm [23] is introduced to mitigate the hubness and isolation issues of entity embeddings in EA [44]. The hubness issue refers to the phenomenon where some entities (known as hubs) frequently appear as the top one most similar entities of other entities in the vector space, while the isolation issue means that there exist some outliers isolated from any point clusters. As thus, CSLS increases the similarity associated with isolated entity embeddings and conversely decreases the ones of vectors lying in dense areas [23]. Formally, the CSLS pairwise score between source entity *u* and target entity *v* is:

$$\mathbf{CSL}\mathbf{S}(\boldsymbol{\mu},\boldsymbol{\upsilon}) = 2S(\boldsymbol{\mu},\boldsymbol{\upsilon}) - \boldsymbol{\phi}(\boldsymbol{\mu}) - \boldsymbol{\phi}(\boldsymbol{\upsilon})\,',\tag{4.1}$$

where *<sup>S</sup>* is the similarity matrix derived from *<sup>E</sup>* using similarity metrics, *φ(u)* <sup>=</sup> <sup>1</sup> *k* - *v* <sup>∈</sup>N*<sup>u</sup> <sup>S</sup>(u, v )* is the mean similarity score between the source entity *u* and its top-*k* most similar entities N*<sup>u</sup>* in the target KG, and *φ(v)* is defined similarly. The mean similarity scores of all source and target entities are denoted in vector

<sup>4</sup> We omit the algorithmic description of the classical algorithms (e.g., Hungarian [21] and Gale-Shapley [40]) and the neural model (i.e., RL [32]) in the interest of space.

form as *φs* and *φt* , respectively. To generate the matched entity pairs, it further applies Greedy on the CSLS matrix (i.e., *S*CSLS). Algorithm 4 describes the detailed procedure of CSLS.


**Complexity** The time and space complexity is *O(n*2*)*. Practically, it requires more time and space than DInf, as it needs to generate the additional CSLS matrix.

#### *4.3.4 Reciprocal Embedding Matching*

Zeng et al. [53] formulate EA task as the reciprocal recommendation process [38] and offer a reciprocal embedding matching strategy RInf to model and integrate the bidirectional preferences of entities when inferring the matching results. Formally, it defines the pairwise score of source entity *u* toward target entity *v* as:

$$p\_{\boldsymbol{\mu},\boldsymbol{v}} = \mathcal{S}(\boldsymbol{\mu},\boldsymbol{v}) - \max\_{\boldsymbol{\mu}' \in \mathcal{E}\_3} \mathcal{S}(\boldsymbol{v},\boldsymbol{\mu}') + 1,\tag{4.2}$$

where *S* is the similarity matrix derived from *E*, 0 ≤ *pu,v* ≤ 1, and a larger *pu,v* denotes a higher degree of preference. As such, the matrix forms of the source-to-target and target-to-source preference scores are denoted as *Ps,t* and *Pt,s*, respectively. Next, it converts the preference matrix *P* into the ranking matrix *R* and then averages the two ranking matrices, resulting in the reciprocal preference matrix *Ps*↔*<sup>t</sup>* that encodes the bidirectional alignment information. Finally, it adopts Greedy to generate the matched entity pairs.

**Complexity** Algorithm 5 describes the detailed procedure of RInf. The time complexity is *O(n*<sup>2</sup> lg *n)* [53]. The space complexity is *O(n*2*)*. Practically, it requires more space than DInf and CSLS, due to the computation of similarity, preference, and ranking matrices. Noteworthily, two variant methods, i.e., RInf-wr and RInf-pb, are proposed to reduce the memory and time consumption brought by the reciprocal modeling. More details can be found in [53].

**Algorithm 5:** RInf*(*E*s,* E*t, E)*

**Input** : Source and target entity sets: E*s*, E*<sup>t</sup>* ; Unified entity embeddings: *E* **Output** : Matched entity pairs: M **1** Derive similarity matrix *S* based on *E*; **2 for** *u* ∈ E*<sup>s</sup>* **do 3 for** *v* ∈ E*<sup>t</sup>* **do 4** Calculate *pu,v* and *pv,u* (cf. Eq. (4.2)); **5** Collect the preference scores, resulting in *Ps,t* and *Pt,s*; **6** Convert *Ps,t* and *Pt,s* into *Rs,t* and *Rt,s*, respectively; **7** *Ps*↔*<sup>t</sup>* = *(Rs,t* + *R t,s)/*2; **8** M ← Greedy*(*E*s,* E*t,* −*Ps*↔*t)*; **9 return** M;

#### *4.3.5 Embedding Matching as Assignment*

Some very recent studies [29, 49] propose to model the embedding matching process as the linear assignment problem. They first use similarity metrics to calculate pairwise similarity scores based on *E*. Then they adopt the Hungarian algorithm [21] to solve the task of assigning source entities to target entities according to the pairwise scores. The objective is to maximize the sum of the pairwise similarity scores of the final matched entity pairs while observing the 1-to-1 assignment constraint. In this work, we use the Hungarian algorithm implemented by Jonker and Volgenant [18] and denote it as Hun.*(*E*s,* E*t, E)*.

Besides, the Sinkhorn operation [31] (or Sink. for short) is also adopted to solve the assignment problem [12, 15, 29], which converts the similarity matrix *S* into a doubly stochastic matrix *Ssinkhorn* that encodes the entity correspondence information. Specifically,

$$\begin{aligned} \operatorname{Sinkhorn}^l(\mathbf{S}) &= \Gamma\_c(\Gamma\_r(\operatorname{Sinkhorn}^{l-1}(\mathbf{S})));\\ \mathbf{S}\_{\operatorname{sinkhorn}} &= \lim\_{l \to \infty} \operatorname{Sinkhorn}^l(\mathbf{S}), \end{aligned} \tag{4.3}$$

where *Sinkhorn*0*(S)* <sup>=</sup> exp*(S)* and *c* and *r* refer to the column and row-wise normalization operators of a matrix. Since the number of iterations *l* is limited, the Sinkhorn operation can only obtain an approximate 1-to-1 assignment solution in practice [29]. Then *Ssinkhorn* is forwarded to Greedy to obtain the alignment results.

**Complexity** For Hun., the time complexity is *O(n*3*)*, and the space complexity is *O(n*2*)*. Algorithm 5 describes the procedure of Sink.. The time complexity of Sink. is *O(ln*2*)* [29], and the space complexity is *O(n*2*)*. In practice, both algorithms require more space than DInf, since they need to store the intermediate results.

**Algorithm 6:** Sink.*(*E*s,* E*t, E,l)* **Input** : Source and target entity sets: E*s*, E*<sup>t</sup>* ; Unified entity embeddings: *E*; Hyper-parameter: *l* **Output** : Matched entity pairs: M **1** Derive similarity matrix *S* based on *E*; **<sup>2</sup>***Ssinkhorn* <sup>=</sup> *Sinkhorn<sup>l</sup> (S)* (cf. Eq. (4.3)); **3** M ← Greedy*(*E*s,* E*t, Ssinkhorn)*; **4 return** M;

#### *4.3.6 Stable Embedding Matching*

In order to consider the interdependence among alignment decisions, the embedding matching process is formulated as the stable matching problem [13] by Zeng et al. [55] and Zhu et al. [60]. It is proved that for any two sets of members with the same size, each of whom provides a ranking of the members in the opposing set, there exists a bijection of the two sets such that no pair of two members from the opposite side would prefer to be matched to each other rather than their assigned partners [11]. Specifically, these works first produce the similarity matrix *S* based on *E* using similarity metrics. Next, they generate the rankings of members in the opposing set according to the pairwise similarity scores. Finally, they use the Gale-Shapley algorithm [40] to solve the stable matching problem. This procedure is denoted as SMat*(*E*s,* E*t, E)*.

**Complexity** SMat has time complexity of *O(n*<sup>2</sup> lg *n)* (since for each entity, the ranking of entities in the opposite side needs to be computed) and space complexity of *O(n*2*)*.

#### *4.3.7 RL-Based Embedding Matching*

The embedding matching process is cast to the classic sequence decision problem by [56]. Given a sequence of source entities (and their embeddings), the goal of the sequence decision problem is to decide to which target entity each source entity aligns. It devises a reinforcement learning (RL)-based framework to learn to optimize the decision-making for all entities, rather than optimize every single decision separately. More details can be found in the original paper [56], and we use RL*(*E*s,* E*t, E)* to denote it.

**Complexity** Note that it is difficult to deduce the time complexity for this neural RL model. Instead, we provide the empirical time costs in the experiments. The space complexity is *O(n*2*)*.

#### **4.4 Main Experiments**

In this section, we compare the algorithms for matching KGs in entity embedding spaces on the mainstream EA evaluation setting (1-to-1 alignment).

#### *4.4.1 EntMatcher: An Open-Source Library*

To ensure comparability, we re-implemented all compared algorithms using Python under a unified framework and established an open-source library, EntMatcher. 5 The architecture of EntMatcher library is presented in the blue block of Fig. 4.3, which takes as input unified entity embeddings *E* and produces the matched entity pairs. It has the following three major features:

**Loosely Coupled Design** There are three independent modules in EntMatcher, and we have implemented the representative methods in each module. Users are free to combine the techniques in each module to develop new approaches, or to implement their new designs by following the templates in modules.

**Reproduction of Existing Approaches** To support our experimental study, we tried our best to re-implement all existing algorithms by using EntMatcher. For instance, the combination of cosine similarity, CSLS, and Greedy reproduces the CSLS algorithm in Sect. 4.3.3; and the combination of cosine similarity, None, and Hun. reproduces the Hun. algorithm in Sect. 4.3.5. The specific hyper-parameter settings are elaborated in Sect. 4.4.2.

**Flexible Integration with Other Modules in EA** EntMatcher is highly flexible, which can be directly called during the development of standalone EA approaches. Besides, users may also use EntMatcher as the backbone and call other modules. For instance, to conduct the experimental evaluations in this work, we implemented the representation learning and auxiliary information modules to generate the unified entity embeddings *E*, as shown in the white blocks of Fig. 4.3. More details are elaborated in the next subsection. Finally, EntMatcher is also compatible with existing open-source EA libraries (that mainly focus on representation learning) such as OpenEA6 and EAkit.7

<sup>5</sup> The codes are publicly available at https://github.com/DexterZeng/EntMatcher.

<sup>6</sup> https://github.com/nju-websoft/OpenEA.

<sup>7</sup> https://github.com/THU-KEG/EAkit.

**Fig. 4.3** Architecture of the EntMatcher library and additional modules required by the experimental evaluation

#### *4.4.2 Experimental Settings*

Current EA evaluation setting assumes that the entities in source and target KGs are 1-to-1 matched (cf. Sect. 4.2.3). Although this assumption simplifies the real-word scenarios where some entities are unmatchable or some might be aligned to multiple entities on the other side, it indeed reflects the core challenge of EA. Therefore, following existing literature, we mainly compare the embedding matching algorithms under this setting and postpone the evaluation on the challenging real-life scenarios to Sect. 4.5.

**Datasets** We used popular EA benchmarks for evaluation: **(1) DBP15K** , which comprises three multilingual KG pairs extracted from DBpedia [1], English to Chinese (DBP15KZH-EN), English to Japanese (DBP15KJA-EN), and English to French (DBP15KFR-EN); **(2) SRPRS** , which is a sparser dataset that follows real-life entity distribution, including two multilingual KG pairs extracted from DBpedia, English to French (SRPRSEN-FR) and English to German (SRPRSEN-DE), and two mono-lingual KG pairs, DBpedia to Wikidata [46] (SRPRSDBP-WD) and DBpedia to YAGO [43] (SRPRSDBP-YG); and **(3) DWY100K** , which is a larger dataset consisting of two mono-lingual KG pairs: DBpedia to Wikidata (D-W) and DBpedia to YAGO (D-Y). The detailed statistics can be found in Table 4.2, where the numbers of entities, relations, triples, gold links, and the average entity degree are reported. Regarding the gold alignment links, we adopted 70% as test set, 20% for training, and 10% for validation.

**Hardware Configuration and Hyper-Parameter Setting** Our experiments were performed on a Linux server that is equipped with an Intel Core i7-4790 CPU running at 3.6GHz, NVIDIA GeForce GTX TITAN X GPU, and 128 GB RAM. We followed the configurations presented in the original papers of these algorithms and tuned the hyper-parameters on the validation set. Specifically, for CSLS, we set *k* to 1, except on the non-1-to-1 setting where we set it to 5. Similarly, regarding RInf, we changed the maximum operation in Eq. (4.2) to top-*k* average operation on the non-1-to-1 setting, where *k* is set to 5. As to Sink., we set *l* to 100. For RL, we found the hyper-parameters in the original paper could already produce the best results and directly adopted them. The rest of the approaches, i.e., DInf, Hun., and SMat, do not contain hyper-parameters.



#### 4.4 Main Experiments 93

**Evaluation Metric** We utilized *F1 score* as the evaluation metric, which is the harmonic mean between *precision* and *recall*, where the *precision* value is computed as the number of correct matches divided by the number of matches found by a method, and the *recall* value is computed as the number of correct matches found by a method divided by the number of gold matches. Note that *recall* is equivalent to the Hits@1 metric used in some previous works.

**Representation Learning Models** Since representation learning is not the focus of this work, we adopted two frequently used models, i.e., RREA [30] and GCN [47]. Concretely, GCN is one of the simplest models, which uses graph convolutional networks to learn the structural embeddings, while RREA is one of the bestperforming solutions, which leverages relational reflection transformation to obtain relation-specific entity embeddings.

**Auxiliary Information for Alignment** Some works leverage the auxiliary information in KGs (e.g., entity attributes, descriptions, and pictures) to complement the KG structure. Specifically, these auxiliary information are first encoded into lowdimensional vectors and then fused with structural embeddings to provide more accurate entity representations for the subsequent embedding matching stage [44]. Although EA underlines the use of graph structure for alignment [59], for a more comprehensive evaluation, we examined the influence of auxiliary information on the matching results by following previous works and using entity name embeddings to facilitate alignment [34, 59]. We also combined these two channels of information with equal weights to generate the fused similarity matrix for matching.8

**Similarity Metric** After obtaining the unified entity representations *E*, a similarity metric is required to produce pairwise scores and generate the similarity matrix *S*. Frequent choices include the cosine similarity [6, 30, 45], the Euclidean distance [7, 24], and the Manhattan distance [48, 50]. In this work, we followed mainstream works and adopted the cosine similarity.

#### *4.4.3 Main Results and Comparison*

We first evaluate with only structural information and report the results in Table 4.3, where R- and G- refer to using RREA and GCN to generate the structural embeddings, respectively, and DBP and SRP denote DBP15K and SRPRS, respectively. Next, we supplement with name embeddings and report the results in Table 4.4, where N- and NR- refer to only using the name embeddings and fusing name embeddings with RREA structural representations, respectively. Note that, on

<sup>8</sup> Note that we only reported the results of fusing entity name embeddings with RREA structural embeddings. The results of combining entity name embeddings with GCN embeddings exhibited similar patterns and were omitted in the interest of space.





existing datasets, all the entities in the test set can be matched, and all the algorithms are devised to find a target entity for each test source entity. Hence, the number of matches found by a method equals to the number of gold matches, and consequently the precision value is equal to the recall value and the F1 score [56].

**Overall Performance** First, we do not delve into the embedding matching algorithms and directly analyze the general results. Specifically, using RREA to learn structural representations can bring better performance compared with using GCN, showcasing that representation learning strategies are crucial to the overall alignment performance. When introducing the entity name information, it observes that this auxiliary signal alone can already provide very accurate signal for alignment. This is because the equivalent entities in different KGs of current datasets share very similar or even identical names. After fusing the semantic and structural information, the alignment performance is further lifted, with most of the approaches hitting over 0.9 in terms of the F1 score.

**Effectiveness Comparison of Embedding Matching Algorithms** From the tables, it is evident that: (1) *Overall, Hun. and Sink. attain much better results than the other strategies.* Specifically, Hun. takes full account of the global matching constraints and strives to reach a globally optimal matching given the objective of maximizing the sum of pairwise similarity scores. Moreover, the 1-to-1 constraint it exerts aligns with present evaluation setting where the source and target entities are 1-to-1 matched. Sink., on the other hand, implicitly implements the 1-to-1 constraint during pairwise score computation and still adopts Greedy to produce final results, where there might exist non-1-to-1 matches; (2) *DInf attains the worst performance.*  This is because it directly adopts the similarity scores that suffer from the hubness and isolation issues [44]. Besides, it leverages Greedy, which merely reaches the local optimum for each entity. (3) *The performance of RInf, CSLS, SMat, and RL are well matched.* RInf and CSLS improve upon DInf by mitigating the hubness issue and enhancing the quality of pairwise scores. SMat and RL, on the other hand, improve upon DInf by modeling the interactions among matching decisions for different entities.

Furthermore, we conduct a deeper analysis of these approaches and identify the following patterns:

**Pattern 1.** *If for source entities, their highest pairwise similarity scores are close, RInf and CSLS (resp., SMat and RL) would attain relatively better (resp., worse) performance.* Specifically, in Table 4.3 where RInf consistently (CSLS sometimes) attains superior results than SMat and RL, the average standard deviation (STD) values of the top five pairwise similarity scores of source entities (cf. Fig. 4.4) are very small, unveiling that the top scores are close and difficult to differentiate. In contrast, in Table 4.4 where SMat and RL outperform RInf and CSLS, the corresponding STD values are relatively large. This is because RInf and CSLS aim to make the scores more distinguishable, and hence they are more effective in cases where the top similarity scores are very close (i.e., low STD values). On the contrary, when the top similarity scores are already discriminating (e.g., Table 4.4), RInf and

CSLS become less useful, while SMat and RL can still make improvements by using the global constraints to enforce the deviation from local optimums.

**Pattern 2.** *On sparser datasets, the superiority of Sink. and Hun. over the rest of the methods becomes less significant.* This is based on the observation that on SRPRS, other matching algorithms (RInf in particular) attain much closer performance to Sink. and Hun.. Such a pattern could be attributed to the fact that, on sparser datasets, entities normally have fewer connections with others, i.e., lower average entity degree (in Table 4.2), where representation learning strategies might fail to fully capture the structural signals for alignment and the resultant pairwise scores become less accurate. These inaccurate scores could mislead the matching process and hence limit the effectiveness of the top-performing methods, i.e., Sink. and Hun.. In other words, sparser KG structures are more likely to (partially) break the fundamental assumption on KG structure similarity (cf. Sect. 4.2.3).

**Efficiency Analysis** We compare the time and space efficiency of these methods on the medium-sized datasets in Fig. 4.5. Since the costs on KG pairs from the same dataset are very similar, we report the *average* time and space costs under each setting in the interest of space.

Specifically, it observes that: (1) The simple algorithm DInf is the most efficient approach. (2) *Among the advanced approaches, CSLS is the most efficient one*, closely following DInf. (3) *The efficiency of RInf and Hun. are equally matched*. While Hun. consumes relatively less memory space than RInf, its time efficiency is less stable and tends to run slower on datasets with less accurate pairwise scores. (4) The space efficiency of Sink. is close to RInf and Hun., whereas it has much higher time costs, which largely depends on the value of *l*. (5) *RL is the least time-efficient approach, while SMat is the least space-efficient algorithm*. RL requires more time on datasets with less accurate pairwise scores where its pre-processing module fails to produce promising results [56]. The memory space consumption of SMat is high, as it needs to store a large amount of intermediate matching results. In all, we can conclude that *generally, advanced embedding matching algorithms require more time and memory space, among which the methods incorporating global matching constraints tend to be less efficient.* 

**Comparison with DL-Based EM Approaches** We utilize the deepmatcher Python package [33], which provides built-in neural networks and utilities that **Fig. 4.5** Efficiency comparison. Shapes in blue denote methods that improve pairwise scores, while shapes in black denote those exerting global constraints (except for DInf). (**a**) Time cost (in seconds). (**b**) Memory space cost (in GB)

can train and apply state-of-the-art deep learning models for entity matching, to address EA. Specifically, we use the structural and name embeddings to replace the attributive text inputs in deepmatcher, respectively, and then train the neural model with labeled data. For each positive entity pair, we randomly sample ten negative ones. In the testing stage, for each source entity, we feed the entity pairs constituting it and all the target entities into the trained classifier and regard the entity pair with the highest predicted score as the result.

In the final results, only several entities are correctly aligned, showing that DLbased EM approaches cannot handle EA, which can be ascribed to the insufficient labeled data, imbalanced class distribution, and the lack of attributive text information, as discussed in Sect. 4.2.2.

#### *4.4.4 Results on Large-Scale Datasets*

Next, we provide the results on the relatively larger dataset, i.e., DWY100K, which can also reflect the scalability of these algorithms. The results are presented in



Table 4.5. 9 The general pattern is similar to that on G-DBP (i.e., using GCN on DBP15K), where Sink. and Hun. obtain the best results, followed by RInf. The performance of CSLS and RL is close, outperforming DInf by over 20%.

We compare the efficiency of these algorithms in Table 4.5, where *T*¯ refers to the average time cost and Mem. denotes whether the memory space required by the model can be covered by our experimental environment.10 It observes that, given larger datasets, *most of the performant algorithms have poor efficiency and scalability* (e.g., RInf, Sink., and Hun.). Note that in [53], two variants of RInf, i.e., RInf-wr and RInf-pb, are proposed to improve its scalability at the cost of a small performance drop, which is empirically validated in Table 4.5. This also reveals that *more scalable matching algorithms for KGs in entity embedding spaces should be devised*.

#### *4.4.5 Analysis and Insights*

We provide further experiments and discussions in this subsection.

**On Efficiency and Scalability** The simple algorithm DInf is the most efficient and scalable one, as it merely involves the most basic computation and matching operations. CSLS is slightly less efficient than DInf due to the update of pairwise similarity scores. It also has good scalability. Although RInf adopts a similar idea to CSLS, it involves an additional ranking process, which brings much more time and memory consumption, making it less scalable. Sink. repeatedly conducts the normalization operation, and thus its time efficiency is mainly up to the *l* value. Its scalability is also limited by the memory space consumption since it needs to store intermediate results, as revealed in Table 4.5.

<sup>9</sup> We cannot provide the results of SMat, as it requires extremely large memory space and cannot work under our experimental environment.

<sup>10</sup> Note that for algorithms with memory space costs exceeding our experimental environment (except for SMat), there is additional swap area in the hard drive for them to finish the program (which usually takes much longer time).

Regarding the methods that exert global constraints, Hun. is efficient on mediumsized datasets, while it is not scalable due to the high time complexity and memory space consumption. SMat is space-inefficient even on the medium-sized datasets, making it not scalable. In comparison, RL has more stable time and space costs and can scale to large datasets, and the main influencing factor is the accuracy of pairwise scores. This is because RL has a pre-processing step that filters out confident matched entity pairs and excludes them from the time-consuming RL learning process [56]. More confident matched entity pairs would be filtered out if the pairwise scores are more accurate.

**On Effectiveness of Improving Pairwise Score Computation** We compare and discuss the strategies for improving the pairwise score computation, i.e., CSLS, RInf, and Sink.

Both CSLS and RInf aim to mitigate the hubness and isolation issues in the raw pairwise scores (from different starting points). Particularly, we observe that, by setting *k* (in Eq. (4.1)) of CSLS to 1, *the difference between RInf and CSLS is reduced to the extra* ranking *process of RInf*, and the results in Table 4.3 and 4.4 validate that *this ranking process can consistently bring better performance*. This is because the ranking operation can amplify the difference among the scores and prevent such information from being lost after the bidirectional aggregation [53]. However, it is noteworthy that *the ranking process brings much more time and memory consumption*, as can be observed from the empirical results.

Then we analyze the influence of *k* value in CSLS. As shown in Fig. 4.6,a larger *k* leads to worse performance. This is because a larger *k* implies a smaller *φ* value in Eq. (4.1) (where the top-*k* highest scores are considered and averaged), and the resultant pairwise scores become less distinctive. This also validates the effectiveness of the design in RInf (cf. Eq. (4.2)), where only the *maximum* value is considered to compute the preference score. Nevertheless, in Sect. 4.5.2, we reveal that setting *k* to 1 is only useful in the 1-to-1 alignment setting.

As for Sink., it adopts an extreme approach to optimize the pairwise scores, which encourages each source (resp., target) entity to *have only one positive pairwise score* with a target (resp., source) entity and 0's with the rest of the target (resp., source) entities. Thus, *it is in fact progressively and implicitly implementing the 1-to-1 alignment constraint during the pairwise score computation process with* 

**Fig. 4.6** F1 scores of CSLS with varying *k* value. (**a**) On R-DBP. (**b**) On G-SRP

**Fig. 4.7** F1 scores of Sink. with varying *l* value. (**a**) On R-DBP. (**b**) On G-SRP

*the increase of l* and is particularly useful in present 1-to-1 evaluation settings of EA. In Fig. 4.7, we further examine the influence of *l* in Eq. (4.3) on the alignment results of Sink., which meets our expectation that the larger the *l* value, the better the distribution of the resultant pairwise scores fits the 1-to-1 constraint, and thus the higher the alignment performance. Nevertheless, a larger *l* also implies longer processing time. Therefore, by tuning on the validation set, we set *l* to 100 to reach the balance between effectiveness and efficiency.

**On Effectiveness of Exerting Global Constraints** Next, we compare and discuss the methods that exert global constraints on the embedding matching process, i.e., Hun., SMat, and RL.

It is evident that *Hun. is the most performant approach, as it fits well with the present EA setting* and can secure an optimal solution toward maximizing the sum of pairwise scores. Specifically, the current EA setting has two notable assumptions (cf. Sect. 4.2.3). With these two assumptions, EA can be transformed into the linear assignment problem, which aims to maximize the sum of pairwise scores under the 1-to-1 constraint [29]. As thus, the algorithms for solving the linear assignment problem, e.g., Hun., can attain remarkably high performance on EA. However, these two assumptions do not necessarily hold on all occasions, which could influence the effectiveness of Hun.. For instance, as revealed in Pattern 2, on sparse datasets (e.g., SRPRS), the neighboring structures of some equivalent entities are likely to be different, where the effectiveness of Hun. is limited. In addition, the 1-to-1 alignment constraint is not necessarily true in practice, which will be discussed in Sect. 4.5.

In comparison, SMat merely aims to attain a stable matching, where the resultant entity pairing could be sub-optimal under present evaluation setting. RL, on the other hand, relaxes the 1-to-1 constraint and only deviates slightly from the greedy matching, and hence the results are not very promising.

**Overall Comparison and Conclusion** Finally, we compare the algorithms all together and draw the following conclusions *under the 1-to-1 alignment setting*: (1) The best performing methods are Hun. and Sink.. Nevertheless, they have low scalability. (2) CSLS and RInf achieve the best balance between effectiveness and efficiency. While CSLS is more efficient, RInf is more effective. (3) SMat and RL tend to attain better results when the accuracy of the pairwise scores is high. Nevertheless, they require relatively more time.



**Combining Embedding Matching Algorithms** As described above, CSLS, RInf, and Sink. mainly improve the computation of pairwise scores, while Hun., SMat, and RL exert the global constraints during the matching process. Thus, by using the EntMatcher library, we aim to investigate whether the combination of these strategies would lead to better matching performance.

The results are reported in Table 4.6, where we can observe that: (1) Hun. is already effective given the raw pairwise scores, and using CSLS or RInf to improve the pairwise scores would not change and even bring down the performance (2) For SMat and RL, using CSLS or RInf to improve the raw pairwise scores would consistently lead to better results (3) Looking from the other side, while applying Hun., SMat, and RL upon CSLS improves its performance, such additional operations bring down the results of RInf This is because, by modeling entity preference and converting to rankings, RInf has already lost the information contained in the original pairwise scores, and exerting global constraints upon the reciprocal preference scores is no longer beneficial.

Hence, we can conclude that, generally speaking, *combining the algorithms (designed for different embedding matching stages) would lead to better alignment results*, except for the combination with RInf and Hun..

**Alignment Results Analysis** We further analyze the sets of matched entity pairs produced by the compared algorithms. Specifically, we examine the difference of the correct results found by different methods and report the pairwise *difference ratios* in the heat map of Fig. 4.8. The *difference ratio* is defined as the |C − R|*/*|C|, where C and R denote the correct aligned entity pairs produced by the corresponding algorithms in the column and row, respectively.

From Fig. 4.8a, we can observe that: (1) The elements in the matrix (except those in secondary diagonal) are above 0, showing that these matching algorithms produce complementary correct matches (2) The results of Hun. and Sink., CSLS and RInf, and SMat and RL are similar; that is, the algorithms in each pair produce very similar correct matches (i.e., with low difference ratios and light colors) (3) The columns of

Hun., Sink., and RInf have relatively darker colors, revealing that they tend to discover the matches that other methods fail to detect

We further select three representative methods, i.e., RInf, Hun., and SMat, and provide a more detailed analysis in Fig. 4.8b. It is obvious that these approaches do produce complementary results, and *it calls for an ensemble framework that integrates the alignment results produced by different matching algorithms.* 

#### **4.5 New Evaluation Settings**

In this section, we conduct experiments on settings that can better reflect real-life challenges.

#### *4.5.1 Unmatchable Entities*

Current EA literature largely overlooks the unmatchable issue, where a KG contains entities that the other KG does not contain. For instance, when aligning YAGO 4 and IMDB, only 1% of entities in YAGO 4 are film-related and possibly have equivalent entities in IMDB, while the other 99% of entities in YAGO 4 necessarily have no match in IMDB [59]. Hence, we aim to evaluate the embedding matching algorithms in terms of dealing with unmatchable entities.

**Datasets and Evaluation Settings** Following [54], we adapt the KG pairs in DBP15K to include unmatchable entities, resulting in DBP15K+. More specific construction procedure can be found in [54].

As for the evaluation metric, we follow the main experimental setting and adopt the *F1 score*. Unlike 1-to-1 alignment, there exist unmatchable entities in this adapted dataset, and the precision and recall values are not necessarily equivalent, since some methods would also align unmatchable entities.

Noteworthily, the original setting of SMat and Hun. requires that the numbers of entities on the two sides are equal. Thus, we add the dummy nodes on the side with fewer entities to restore such a setting and then apply SMat and Hun.. The corresponding results are reported in Table 4.7.

**Alignment Results** It reads that Hun. attains the best results, followed by SMat. The superior results are partially due to the addition of dummy nodes, which could mitigate the unmatchable issue to a certain degree. The results RInf and Sink. are close, outperforming CSLS and RL. DInf still achieves the worst performance.

Besides, by comparing the results on DBP15K+ and those on the original dataset DBP15K (cf. Table 4.3), we observe that: (1) After including the unmatchable entities, for all methods, the F1 scores drop. This is because most of current embedding matching algorithms are greedy, i.e., retrieving a target entity for each source entity (including the unmatchable ones), which leads to a very low precision. For the rest of the methods, e.g., Hun. and SMat, the unmatchable entities also mislead the matching process and thus affect the final results. (2) Unlike on DBP15K where the performance of Sink. and Hun. is close, on DBP15K+, Hun. largely outperforms Sink., as Hun. does not necessarily align a target entity to each source entity and has a higher precision. (3) *Overall, existing algorithms for matching KGs in entity embedding spaces lack the capability of dealing with unmatchable entities.* 

#### *4.5.2 Non-1-to-1 Alignment*

Next, we study the setting where the source and target entities do not strictly conform to the 1-to-1 constraint, so as to better appreciate these matching algorithms for KGs in entity embedding spaces. Non-1-to-1 alignment is common in practice,




**Table 4.8** The results on non-1-to-1 alignment dataset

especially when two KGs contain entities in different granularity, or one KG is noisy and involves duplicate entities. *To the best of our knowledge, we are among the first attempts to identify and investigate this issue.* 

**Dataset Construction** Present EA benchmarks are constructed according to the 1-to-1 constraint. Thus, in this work, we establish a new dataset that involves non-1-to-1 alignment relationships. Specifically, we obtain the pre-annotated links11 between Freebase [3] and DBpedia [1] and preserve the entities that are involved in 1-to-many, many-to-1, and many-to-many alignment relationships. Then, we retrieve the relational triples that contain these entities from respective KGs, which also introduces new entities.

Next, we detect the links among the newly added entities and add them into the alignment links. Finally, the resultant dataset, FB\_DBP\_MUL, contains 44,716 entities, 164,882 triples, and 22,117 gold links, among which 20,353 are non-1 to-1 links and 1,764 are 1-to-1 links.12 The specific statistics are also presented in Table 4.2.

**Evaluation Settings** To keep the integrity of the links among entities, we sample the training, validation, and test sets from the gold links according to the principle that the links involving the same entity should not be distributed among different sets. The size of the final training, validation, and test sets is approximately 7:1:2. We compare the entity pairs produced by embedding matching algorithms against the gold test links and report the precision (P), recall (R), and F1 values.

**Alignment Results** It is evident from Table 4.8 that, compared with 1-to-1 alignment, the results change significantly on the new dataset. Specifically: (1) RInf and CSLS attain the best F1 scores, whereas the results are not very promising (e.g., with F1 score lower than 0.1 when using GCN). (2) Sink. and Hun. achieve much worse results compared with the performance on 1-to-1 alignment datasets. (3) The

<sup>11</sup> https://www.dbpedia.org/blog/dbpedia-is-now-interlinked-with-freebase-links-to-opencycupdated/.

<sup>12</sup>FB\_DBP\_MUL is publicly available at https://github.com/DexterZeng/EntMatcher.

**Fig. 4.9** F1 scores with varying *k* value on FB\_DBP\_MUL. (**a**) With GCN. (**b**) With RREA

results of SMat and RL are even inferior to those of the simple baseline DInf. The main reason accounting for these changes is that *the non-1-to-1 alignment links pose great challenges to existing embedding matching algorithms.* Specifically, for DInf, CSLS, RInf, Sink., and RL, they only align *one* target entity (that possesses the highest score) to a given source entity, but fail to discover other alignment links that also involve this source entity. For SMat and Hun., they impose the 1 to-1 constraint during the matching process, which falls short on the non-1-to-1 setting, thus leading to inferior results. Therefore, it calls for the study on embedding matching algorithms targeted at non-1-to-1 alignment.

**Discussion of the** *k* **Value in CSLS and RInf** In Fig. 4.9, we report the performance of CSLS and RInf given varying *k* values on FB\_DBP\_MUL. It shows that, generally, a larger *k* leads to better results. This is because, on the non-1-to-1 setting, an entity is likely to be matched to several entities on the other side, where it is more appropriate to consider the top-*k* values, rather than the sole maximum value, when refining the pairwise scores.

#### **4.6 Summary and Future Direction**

In this section, we summarize the observations and insights made from our evaluation and provide possible future research directions.

**(1) The Investigation into Matching KGs in Embedding Spaces Has Not Yet Made Substantial Progress** Although there are a few algorithms tailored for matching KGs in embedding spaces, e.g., CSLS, RInf, and RL, under the most popular EA evaluation setting (with 1-to-1 alignment constraint), they are outperformed by the classic general matching algorithms, i.e., Hun.. Hence, there is still much room for improving matching KGs in embedding spaces.

**(2) No Existing Embedding Matching Algorithm Prevails Under All Experimental Settings** The strategies designed to solve the linear assignment problem attain the best performance under the 1-to-1 setting, while they fall short on more practical and challenging scenarios since the new settings (e.g., non-1-to-1 alignment) no longer align with the conditions of these optimization algorithms. Similarly, although the methods for improving the computation of pairwise scores achieve superior results in the non-1-to-1 alignment scenario, they are outperformed by other solutions under the unmatchable setting. Therefore, each evaluation setting poses its own challenge to the embedding matching process, and currently there is no consistent winner.

**(3) The Adaptation from General Matching Algorithms Requires Careful Design** Among the embedding matching algorithms, Hun. and SMat are general matching algorithms that have been applied to many other related tasks. Although directly adopting these general strategies to tackle EA is simple and effective, they might well fall short in some scenarios, as the alignment on KGs possesses its own challenges, e.g., the matching is not necessarily 1-to-1 constrained, or the pairwise scores are inaccurate. Thus, it is suggested to take full account of the characteristics of the alignment settings when adapting other general matching algorithms to cope with matching KGs in entity embedding spaces.

**(4) The Scalability and Efficiency Should Be Brought to the Attention** Existing advanced embedding matching algorithms have poor scalability, due to the additional resource-consuming operations that contribute to the alignment performance, such as the ranking process in RInf and the 1-to-1 constraint exerted by Hun. and SMat. Besides, the space efficiency is also a critical issue. As shown in Sect. 4.4.4, most of the approaches have rather high memory costs given large-scale datasets. Therefore, considering that in practice there are much more entities, the scalability and efficiency issues should be considered during the algorithm design.

**(5) The Practical Evaluation Settings Are Worth Further Investigation** Under the unmatchable and non-1-to-1 alignment settings, the performance of existing algorithms is not promising. A possible future direction is to introduce the notion of probability and leverage the probabilistic reasoning frameworks [19, 39], which have higher flexibility, to produce the alignment results.

## **4.7 Conclusion**

This paper conducts a comprehensive survey and evaluation of matching algorithms for KGs in entity embedding spaces. We evaluate seven state-of-the-art strategies in terms of effectiveness and efficiency on a wide range of datasets, including two experimental settings that better mirror real-life challenges. We identify the strengths and weaknesses of these algorithms under different settings. We hope the experimental results would be valuable for researchers and practitioners to put forward more effective and scalable embedding matching algorithms.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Part III Novel Approaches**

## **Chapter 5 Large-Scale Entity Alignment**

**Abstract** In this chapter, we focus on the concept of entity alignment at scale and present a new method for addressing this task. The proposed solution is capable of handling vast amounts of knowledge graph pairs and delivering high-quality alignment outcomes. First, to manage large-scale KG pairs, we develop a set of seed-oriented graph partition strategies that divide them into smaller subgraph pairs. Next, within each subgraph pair, we employ existing methods to learn unified entity representations and introduce a novel reciprocal alignment inference strategy to model bidirectional alignment interactions, which can lead to more accurate outcomes. To further enhance the scalability of reciprocal alignment inference, we propose two variant strategies that can significantly reduce memory and time costs, albeit at the expense of slightly reduced effectiveness. Our solution is versatile and can be applied to existing representation learning-based EA models to enhance their ability to handle large-scale KG pairs. We also create a new EA dataset that comprises millions of entities and conduct comprehensive experiments to verify the efficiency of our proposed model. Furthermore, we compare our proposed model against state-of-the-art baselines on popular EA datasets, and our extensive experiments demonstrate its effectiveness and superiority.

## **5.1 Introduction**

Figure 5.1 describes a toy example of EA. Typically, state-of-the-art EA solutions follow a two-stage working pipeline, which can be broadly divided into two main stages—*representation learning* and *alignment inference*. Most of the current works [4, 6, 33, 35] are dedicated to the former, which leverage various KG embedding models, e.g., TransE [2] and graph convolutional network (GCN) [15], for learning the representations of entities. By using seed entity pairs as reference points, the entity embeddings of different KGs are projected onto a common embedding space. This allows for the measurement of similarity or distance<sup>1</sup>

<sup>1</sup> In the rest of the paper, we may use "distance" and "similarity" with obvious adaptation.

<sup>©</sup> The Author(s) 2023

X. Zhao et al., *Entity Alignment*, Big Data Management, https://doi.org/10.1007/978-981-99-4250-3\_5

**Fig. 5.1** An example of EA. There is an English and a Spanish KG concerning the band *The national* in the figures. The aim of EA is to find equivalent entities in these KGs using the KG structure, e.g., [A. Dessner]en and [A. Dessner]es. The left denotes the alignment results generated by current EA solutions that perform direct alignment inference based on structural similarity, where both [A. Dessner]en and [B. Dessner]en are aligned to [A. Dessner]es. In comparison, the right denotes the results generated by our proposed reciprocal alignment inference, where [A. Dessner]en is aligned to [A. Dessner]es, while [B. Dessner]en is matched with [B. Dessner]es. (**a**) Results of direct alignment inference using structural similarity. (**b**) Results of reciprocal alignment inference using entity preference

between entities from different KGs by assessing the similarity or distance between data points in the unified embedding space. Once the entities have been projected onto the unified embedding space, the *alignment inference* stage involves predicting the alignment results using these embeddings. When given an entity from the source KG, most state-of-the-art solutions adopt the *direct* alignment inference strategy. This strategy involves ranking the entities in the target KG based on a specific similarity measure between entity embeddings. The top-ranked target entity is then considered a match for the source entity.

Despite the improvements made by current techniques in boosting the precision of EA, these sophisticated models typically involve a substantial number of parameters and demand significant computational resources. Therefore, scalability is compromised in achieving the improvement, and these approaches are not suitable for handling practical large KGs. For instance, it is reported in [50] that, on the DWY100K dataset with 200,000 entities [33], the time cost for most state-of-theart solutions is over 20,000 seconds, and some approaches [39, 52] even cannot produce the alignment results. Thus, the existence of real-life KGs that consist of tens of millions of entities creates a significant obstacle for current EA solutions, necessitating research on *large-scale entity alignment*. The investigation of largescale EA aligns with the current trend of responsible design, development, use, and oversight of automated decision systems in the data management community [29].

Drawing inspiration from traditional graph partitioning strategies [12, 14], a feasible technique is to divide large KG pairs into several smaller subgraph pairs and then perform entity alignment on them. However, partitioning KG pairs for alignment is a challenging task that must achieve two objectives: (1) preserving the original structure of the KG as much as possible and (2) ensuring that the partition results of the source and target KGs match, meaning that equivalent entities in the source and target KGs are placed in the same subgraph pair. Although the first objective can be accomplished by modifying classical graph partitioning techniques such as METIS [13], the second objective is specific to the alignment task.

To achieve the second objective, we can use the seed entity pairs to guide the partition process. Seed entity pairs are pre-labeled entity pairs in which the entities are equivalent and are used to link two individual KGs. Ideally, if we can preserve the seed entity pairs during the partition and distribute them among the smaller subgraph pairs, the remaining (unknown) equivalent entities would have a greater likelihood of being placed in the same subgraph pair using these seed entity pairs as references, as equivalent entities usually have similar neighboring structures. Following this idea, there is a preliminary approach METIS-CPS (shortened as CPS) proposed by a concurrent work [10]. The proposed approach first partitions one KG into subgraphs. Then, based on the distribution of seed entities, it assigns appropriate weights to the edges in the other KG, and the partition is performed on this KG. However, it can be challenging for methods of this type (referred to as *unidirectional*  partition strategies) to achieve the first objective because the partitioning of the second KG is limited by the requirement to maintain the seed links, which may compromise the structure of the KG to some extent.

To address this issue, this chapter proposes the Seed-oriented Bidirectional graph Partition framework, SBP, which aims to satisfy both objectives by conducting bidirectional partitions and aggregating the partition results from the source-totarget and target-to-source directions. The motivation behind this approach is that the subgraphs generated from partitioning the first KG tend to have more complete structures, while the subgraphs generated from partitioning the second KG mainly retain alignment signals. By performing bidirectional partitions and combining the subgraphs, the resulting subgraphs in each KG can have both complete structures and larger numbers of seed entities pointing to the subgraphs in the opposite KG,

which can lead to more precise alignment results. Note that SBP can be used with various *unidirectional* partitioning strategies. Additionally, an iterative variant of SBP, I-SBP, is proposed to improve partition performance by incorporating confident alignment results from previous rounds into the seed entity pairs.

During the partition process, the accuracy of alignment results may be compromised because equivalent entities could be placed in different subgraph pairs, and the original KG structure information may also be lost to some extent. To improve alignment performance, we propose to enhance the alignment inference stage, which has received little attention in previous work. Specifically, we introduce a reciprocal alignment inference strategy. The idea of reciprocal modeling of the alignment process is motivated by the fact that the commonly used direct alignment inference approach (1) considers an entity's preference toward entities on the other side via a similarity score, but neglects other influential factors, and (2) fails to integrate bidirectional preference scores or capture the mutual preferences of entities when making alignment decisions. Such an alignment inference strategy tends to produce many inaccurate results, as illustrated in the following example.

*Example* As shown in Fig. 5.1a, using the structural information, the direct alignment inference strategy would align both [A. Dessner]en and [B. Dessner]en to [A. Dessner]es, since [A. Dessner]es is the entity that has the most similar structural information with them (connected to three entities, including [The national]en/es).

However, this direct inference approach overlooks the fact that entities' preferences are not solely determined by the similarity score but also by the impact of alignment in the reverse direction. For instance, it is evident that [A. Dessner]es has higher similarity with [A. Dessner]en than [B. Dessner]en since it shares more neighboring information with [A. Dessner]en. Under this circumstance, [B. Dessner]en will lower its preference toward [A. Dessner]es, since in its view, although [A. Dessner]es is its most preferred candidate in terms of similarity, they are less likely to form a match because [A. Dessner]es has a higher similarity with [A. Dessner]en.

Therefore, by modeling and aggregating the bidirectional preferences as depicted in Fig. 5.1b, we could avoid matching [B. Dessner]en with [A. Dessner]es and possibly help identify its correct equivalent entity [B. Dessner]es.

Specifically, we propose to model the entity alignment task as a reciprocal recommendation process [18, 27], which takes effect at two levels: (1) *Entity preference modeling.* It first incorporates the influence of the alignment in the reverse direction into an entity's preference, so as to generate more accurate preference scores. (2) *Bidirectional preference integration.* It integrates bidirectional preferences to generate a reciprocal preference matrix that encodes the mutual

preferences of entities on both sides. Experimental results have shown that the twolevel reciprocal modeling approach achieves superior results compared to direct inference (to be detailed in Sect. 5.7).

We further notice that while the reciprocal inference approach achieves superior alignment performance, it also consumes more memory space and time compared to direct alignment inference. Therefore, to improve the efficiency, we propose two variants: no-ranking aggregation and progressive blocking, which approximate the reciprocal alignment inference. While the former removes the time- and resourceconsuming ranking process during the preference aggregation process, the latter divides the entities into multiple blocks and performs alignment within each block. These variant strategies can significantly reduce the memory and time costs associated with the reciprocal alignment inference, albeit at the cost of a slight decrease in effectiveness.

The proposed techniques form a novel and scalable solution for Large-scale entIty alignMEnt, namely, LIME. Notably, LIME is model-agnostic and can be used with any entity representation learning models. In this work, we evaluate using the commonly used GCN model [15] and the state-of-the-art RREA model [22] in this work for empirical evaluation. To validate the effectiveness of LIME, we create a large EA dataset FB\_DBP\_2M with millions of entities and tens of millions of facts. Experimental results demonstrate that LIME can effectively handle EA at scale while remaining reasonably effective and efficient. We also compare LIME against stateof-the-art solutions on three mainstream datasets, showing that LIME can achieve promising results even on small-scale datasets.

**Contributions** The main contributions of this chapter are the following:


**Organization** In Sect. 5.2, we present the outline of LIME. In Sect. 5.3, we introduce the partition strategies. In Sect. 5.4, we introduce the reciprocal alignment inference strategy. In Sect. 5.5, we introduce the variants of reciprocal alignment inference. In Sects. 5.6 and 5.7, we introduce the experimental settings and results, respectively. In Sect. 5.8, we introduce related work, followed by conclusion in Sect. 5.9.

## **5.2 Framework**

We present the overall framework of our proposal, LIME, in Fig. 5.2.


*Augment Seeds with Confident Results* 

**Fig. 5.2** The framework of our proposal. The entities in gray represent the seed entities. The corresponding seed entities are connected by dotted lines in the left of the figure

<sup>2</sup> Notice that LIME is agnostic to the choice of structural learning models.

#### **5.3 Partition Strategies for Entity Alignment**

To handle large-scale input KGs, a common approach is to partition the KGs and parallelize the computation across a distributed cluster of machines [3]. In this work, we adopt this approach and propose to partition KGs into smaller subgraphs, align entities in each subgraph pair, and aggregate the alignment results in each partition to produce the final aligned entity pairs.

We leverage the commonly used graph partition tool, METIS [13], as the basic partition strategy. The algorithms in METIS are based on multilevel graph partitioning [12, 14], which reduces the graph size by collapsing vertices and edges, partitioning the smaller graph, and then uncoarsening it to construct a partition for the original graph. The aim is to create a balanced vertex partition that equitably divides the set of vertices into multiple partitions while minimizing the number of edges spanning the partitions. However, in the case of EA, there are two separate graphs at scale and a small number of seed entity pairs connecting them. The two graphs are interlinked by the seed entity pairs and can be considered as one and forwarded to METIS for partitioning. Indeed, this approach is likely to generate subgraphs that only contain source or target entities, which is contrary to the goal of EA that aims to identify equivalent entities between KGs. Therefore, we use seedoriented graph partition strategies in this work.

In this section, we first introduce the seed-oriented *unidirectional* graph partition strategy as the baseline model. Then, we describe our proposed bidirectional partition framework and its iterative variants.

#### *5.3.1 Seed-Oriented Unidirectional Graph Partition*

Unidirectional graph partition strategies for EA conduct only one-way partition (e.g., source-to-target) of KG pairs using the seed entity pairs. Formally, they partition the source KG KG*<sup>s</sup>* and target KG KG*<sup>t</sup>* into *k* subgraph pairs *-* = {C1*,* <sup>C</sup>2*,...,* <sup>C</sup>*k*}, where each subgraph pair C*<sup>i</sup>* = {KG*<sup>i</sup> s,* KG*<sup>i</sup> t,*S*<sup>i</sup>* } contains a pair of source subgraph KG*<sup>i</sup> <sup>s</sup>* and target subgraph KG*<sup>i</sup> <sup>t</sup>* , as well as a number of seed entity pairs S*<sup>i</sup>* connecting the subgraphs. Specifically, in this work, we adopt a state-ofthe-art unidirectional partition strategy CPS [10] as the baseline model.

CPS first directly partitions the source KG into *k* subgraphs *s* = {KG<sup>1</sup> *s,...,* KG*<sup>k</sup> <sup>s</sup>*} using METIS. Each source subgraph KG*<sup>i</sup> <sup>s</sup>* contains *εi* source entities S*<sup>i</sup> <sup>s</sup>* = {*u<sup>i</sup>* 1*,...,u<sup>i</sup> εi* } from the seed entity pairs S. To partition the target KG (KG*t*), we still use METIS, but with some modifications: (1) we assign higher weights to edges among seed target entities whose corresponding source entities are in the same subgraph. This encourages METIS to place these seed target entities in the same subgraph while retaining the overall KG structure; and (2) we assign edges among seed target entities whose corresponding source entities are from different subgraphs, say S*<sup>i</sup> <sup>s</sup>* and S*<sup>j</sup> <sup>s</sup>* , with weight 0. This discourages

**Fig. 5.3** Illustration of the partition process. In each box, the solid line separates different subgraph pairs, while the dotted line differentiates the source subgraphs from the target ones

placing these seed target entities in the same subgraph as their corresponding seed source entities are not in the same subgraph. Partitioning the target KG also results in *k* subgraphs *<sup>t</sup>* = {KG<sup>1</sup> *<sup>t</sup> ,...,* KG*<sup>k</sup> <sup>t</sup>* }. Then, for each source subgraph KG*<sup>i</sup> s*, it retrieves the target subgraph KG<sup>∗</sup> *<sup>t</sup>* that possesses the largest number of target entities S<sup>∗</sup> *<sup>t</sup>* corresponding to seed source entities S*<sup>i</sup> <sup>s</sup>* in KG*<sup>i</sup> <sup>s</sup>* and considers them as a subgraph pair C*<sup>i</sup>* = {KG*<sup>i</sup> s,* KG<sup>∗</sup> *<sup>t</sup> ,*S∗}, where S<sup>∗</sup> refers to the links connecting S<sup>∗</sup> *t* and S*<sup>i</sup> <sup>s</sup>*. We illustrate this process using the following example.

*Example* As shown in Fig. 5.3, there are two KGs to be aligned (i.e., KG*<sup>s</sup>* and KG*t*), where the colored lines denote the links in seed entity pairs, and the seed entities are also represented in gray. The entities with the same subscripts are equivalent.

The proposed CPS conducts a one-off source-to-target partition. It first partitions KG*s*, resulting in two source subgraphs shown in the left part of the box. These subgraphs consist of {*u*1*, u*2*, u*3*, u*4} and {*u*5*, u*6*, u*7*, u*8*, u*9},

respectively. Next, when partitioning KG*<sup>t</sup>* , it increases the weight of the edge between *v*<sup>1</sup> and *v*<sup>4</sup> (resp., *v*<sup>6</sup> and *v*7) since the seed source entities *u*<sup>1</sup> and *u*<sup>4</sup> (resp., *u*<sup>6</sup> and *u*7) are in the same subgraph. Additionally, it sets the weight of the edge between *v*<sup>4</sup> and *v*<sup>7</sup> to 0. KG*<sup>t</sup>* is thus partitioned into two subgraphs shown in the right part of the box, which consist of {*v*1*, v*2*, v*3*, v*4*, v*5} and {*v*6*, v*7*, v*8*, v*9}, respectively. Finally, using the seed entity pairs as anchors, it generates two subgraph pairs, i.e., C<sup>1</sup> and C2.

#### *5.3.2 Bidirectional Graph Partition*

It is observed that in unidirectional partition strategies like CPS, the partition of the source KG can preserve its original structure well. However, the partition of the target KG is limited by the goal of retaining the seed entity pairs, which may lead to the destruction of the KG structure to some extent. As a solution, we propose a seed-oriented bidirectional graph partition framework, called SBP. The SBP framework first conducts the source-to-target partition using any unidirectional strategy, resulting in a set of source subgraphs (*-*0 *<sup>s</sup>*) and a set of target subgraphs (*-*0 *<sup>t</sup>* ). Then, it conducts the partition process reversely, obtaining another set of source subgraphs (*-*1 *s*) and target subgraphs (*-*1 *<sup>t</sup>* ). Next, it identifies and combines corresponding source subgraphs in *-*0 *<sup>s</sup>* and *-*1 *<sup>s</sup>*, resulting in the aggregated set of source subgraphs (*<sup>s</sup>*). Similarly, it generates the aggregated set of target subgraphs (*<sup>t</sup>*). Finally, for each source subgraph (KG*<sup>i</sup> <sup>s</sup>* ∈ *<sup>s</sup>*), it retrieves the target subgraph (KG<sup>∗</sup> *<sup>t</sup>* ∈ *<sup>t</sup>*) that possesses the largest number of seed target entities (S<sup>∗</sup> *<sup>t</sup>* ) corresponding to seed source entities in KG*<sup>i</sup> <sup>s</sup>*. It considers them as a subgraph pair (C*<sup>i</sup>* = {KG*<sup>i</sup> s,* KG<sup>∗</sup> *<sup>t</sup> ,*S∗}) for alignment. The detailed process is presented in Algorithm 1 and the following example.

*Example* Continuing with the previous example, the SBP framework conducts the target-to-source partition, resulting in two target subgraphs comprising {*v*1*, v*2*, v*3*, v*4*, v*7} and {*v*5*, v*6*, v*8*, v*9} and two source subgraphs comprising {*u*1*, u*2*, u*3*, u*4*, u*5*, u*7} and {*u*6*, u*8*, u*9}. Next, it identifies and combines corresponding source and target subgraphs generated by the source-to-target and target-to-source partition. For instance, based on the number of overlapping source seed entities, it identifies that the source subgraph comprising {*u*1*, u*2*, u*3*, u*4} (resp., {*u*5*, u*6*, u*7*, u*8*, u*9}) generated by the source-to-target partition and the source subgraph comprising {*u*1*, u*2*, u*3*, u*4*, u*5*, u*7} (resp., {*u*6*, u*8*, u*9}) generated by the target-to-source partition are corresponding. It

(continued)

#### **Algorithm 1:** Bidirectional graph partition (SBP)

**Input** : KG*s*: source KG; KG*<sup>t</sup>* : target KG; S:seed pairs. **Output** : *-* = {C1*,...,* C*<sup>k</sup>* }: *k* subgraph pairs. **1** Conduct source-to-target partition using any unidirectional partition strategy (e.g., CPS). Obtain *-*0 *<sup>s</sup>* and *-*0 *t* ; **2** Conduct target-to-source partition using any unidirectional partition strategy (e.g., CPS). Obtain *-*1 *<sup>s</sup>* and *-*1 *t* ; **3** *<sup>s</sup>* ← ∅; **4 foreach** KG*<sup>i</sup> <sup>s</sup>* ∈ *-*0 *<sup>s</sup>* **do 5** Identify the source subgraph KG<sup>∗</sup> *<sup>s</sup>* ∈ *-*1 *<sup>s</sup>* that has the largest number of overlapping seed entities with KG*<sup>i</sup> s*; **6** *<sup>s</sup>* ← *<sup>s</sup>* ∪ {KG*<sup>i</sup> <sup>s</sup>* ∪ KG<sup>∗</sup> *<sup>s</sup>* }; **7** *<sup>t</sup>* ← ∅; **8 foreach** KG*<sup>i</sup> <sup>t</sup>* ∈ *-*0 *<sup>t</sup>* **do 9** Identify the target subgraph KG<sup>∗</sup> *<sup>t</sup>* ∈ *-*1 *<sup>t</sup>* that has the largest number of overlapping seed entities with KG*<sup>i</sup> t* ; **10** *<sup>t</sup>* ← *<sup>t</sup>* ∪ {KG*<sup>i</sup> <sup>t</sup>* ∪ KG<sup>∗</sup> *<sup>t</sup>* }; **11** *-* ← ∅; **12 foreach** KG*<sup>i</sup> <sup>s</sup>* ∈ *<sup>s</sup>* **do 13** Retrieve the target subgraph KG<sup>∗</sup> *<sup>t</sup>* ∈ *<sup>t</sup>* that possesses the largest number of target seed entities S<sup>∗</sup> *<sup>t</sup>* corresponding to the seed source entities S*<sup>i</sup> <sup>s</sup>* <sup>⊂</sup> KG*<sup>i</sup> s*; **<sup>14</sup>**C*<sup>i</sup>* ← {KG*<sup>i</sup> s,* KG<sup>∗</sup> *<sup>t</sup> ,*S∗}; **15** *-* ← *-* ∪ {C*i*}; **16 return** *-*.

combines them to generate the aggregated subgraph {*u*1*, u*2*, u*3*, u*4*, u*5*, u*7} (resp., {*u*5*, u*6*, u*7*, u*8*, u*9}). The target subgraphs are aggregated in the same way. Finally, using the seed entity pairs as anchors, it generates two subgraph pairs, as shown in the rightmost box.

As shown in Fig. 5.3, in the partition results of CPS, equivalent entities may be placed in different subgraph pairs, such as *u*<sup>5</sup> and *v*5. The SBP framework can effectively mitigate this issue by conducting bidirectional partitions and aggregating the results. Hence, while the partition results of the SBP framework may include redundant entities that exist in multiple subgraph pairs, it can still effectively decrease the instances where equivalent entities are allocated to different subgraph pairs.

**Merits of Bidirectional Partitioning** Noteworthily, *-*0 *<sup>s</sup>* (resp., *-*1 *<sup>t</sup>* ) is generated with the aim of preserving the original KG structure, while *-*1 *<sup>s</sup>* (resp., *-*0 *<sup>t</sup>* ) is generated with the aim of both retaining the links and preserving the original KG structure. Consequently, the integration of subgraphs in *-*0 *<sup>s</sup>* (and *-*1 *<sup>t</sup>* ) with *-*1 *<sup>s</sup>* (and

*-*0 *<sup>t</sup>* ) results in aggregated subgraphs in *<sup>s</sup>* and *<sup>t</sup>* that have a more comprehensive structure and a greater number of seed entities pointing to the subgraphs in the opposite side. This can ultimately lead to more precise alignment outcomes.

Moreover, unlike *-*0 *s*, *-*1 *s*, *-*0 *<sup>t</sup>* , or *-*1 *<sup>t</sup>* , where the subgraphs do not have common entities, the subgraphs in *<sup>s</sup>* and *<sup>t</sup>* overlap. This is comparable to the concept of redundancy-based methods in traditional entity resolution (ER) blocking techniques, where an entity can be assigned to multiple blocks [26]. This is because the partitioning process may unavoidably assign equivalent entities to different subgraph pairs, which limits the upper bound of the alignment performance (as the alignment is only performed within each subgraph pair). However, this upper bound can be raised through bidirectional partitioning, which assigns an entity to multiple subgraph pairs. This is empirically validated in Sect. 5.7.3.

**Integration of Subgraph-Wise Alignment Results** As previously mentioned, the partition results produced by unidirectional strategies do not have redundancies, and therefore, the alignment outcomes can be obtained by directly merging subgraphwise alignment results. However, since the subgraph pairs generated by SBP may contain overlapping entities, an additional result aggregation module is necessary to resolve any potential conflicts in the alignment outcomes. To address this, we adopt a straightforward *voting* strategy. Specifically, for the source entity aligned to multiple target entities generated by different subgraph pairs, we choose the target entity with the highest number of "votes" from the subgraph pairs as the final alignment outcome. If multiple target entities have the same highest vote, we select the one with the lowest mutual preference rank (explained in Sect. 5.4.3) as the match.

#### *5.3.3 Iterative Bidirectional Graph Partition*

It is clear that the one-off partitioning approach tends to generate inaccurate partition results, where equivalent entities may be placed in different subgraph pairs, and the original KG structure information could also be partially lost. To address this issue, we propose an iterative framework called I-SBP, which performs the partitioning process for *γ* rounds based on the signals provided by the previous round. Specifically, in each iteration, we partition the KG into *k* subgraph pairs using SBP and perform entity alignment within each subgraph pair (detailed in the next section). We then aggregate the subgraph-wise alignment results to generate the final aligned entity pairs. Since the final alignment results include confident entity pairs, which can be considered as pseudo seeds according to previous studies [33, 45], we select these entity pairs using the bidirectional nearest neighbor search in [45] and add them to the seed entity pairs S to aid the partition in the next round. The process is detailed in Algorithm 2.

**Algorithm 2:** Iterative SBP (I-SBP)

**Input** : KG*s*: source KG; KG*<sup>t</sup>* : target KG; S:seed pairs. **Output** : *-* = {C1*,* C2*,...,* C*<sup>k</sup>* }: *k* subgraph pairs. **1** *r* ← 0; **2 while** *r < γ* **do 3** Obtain the *-* = {C1*,* C2*,...,* C*<sup>k</sup>* } via Algorithm 1; **4** Perform alignment in each subgraph pair and generate results; **5** Select confident alignment results and add them to S; **6** *r* ← *r* + 1; **7 return** *-*.

#### *5.3.4 Complexity Analysis*

The time complexity of SBP is roughly double that of the unidirectional partition strategy it employs, while the time complexity of I-SBP is approximately *γ* times that of SBP. We use CPS as the unidirectional partition strategy in this study, and its time complexity is *O(*|S| + *(*2*k*−1*)*|S<sup>|</sup> 2 *<sup>k</sup>*<sup>2</sup> + |*Es*|+|*Et*|+|*Ts*|+|*Tt*| + *k* log*(k))* [10], where |*Es*| (and |*Et*|) and |*Ts*| (and |*Tt*|) represent the number of entities and triples in the source (and target) KG, respectively, |S| refers to the number of seed entity pairs, and *k* denotes the number of subgraph pairs.

Regarding space complexity, most unidirectional partition strategies need to store two knowledge graphs (KGs) simultaneously. However, for the SBP algorithm, bidirectional partitions are required, which necessitates the storage of four KGs. The space complexity of I-SBP is similar to that of SBP. In general, the space complexity of the partition process is determined by the size of the knowledge graphs involved.

#### *5.3.5 Discussion*

It is important to note that partition strategies are used to divide large-scale knowledge graph pairs into smaller ones so that state-of-the-art deep learning-based methods can be used to identify equivalent entities. However, the partition process can reduce the alignment performance as equivalent entities can be placed into different subgraph pairs. While this issue can be mitigated by improving partition strategies, it cannot be entirely avoided. Therefore, when dealing with small- or medium-sized datasets such as current entity alignment benchmarks, it may not be worthwhile to use partition strategies since partitioning would not significantly reduce computational costs while compromising alignment accuracy. This is also supported by empirical evidence from experiments conducted on the DWY100K dataset. Whether to use partition strategies ultimately depends on the alignment goal, i.e., efficiency or effectiveness. In this work, we follow previous works and do not employ partition strategies when dealing with small- or medium-sized EA datasets, except for the analysis of partition strategies.

#### **5.4 Reciprocal Alignment Inference**

After partitioning large-scale knowledge graph pairs into smaller ones, we perform alignment on each subgraph pair and combine the alignment results. In this subsection, we provide a brief overview of the representation learning process. Additionally, we propose a reciprocal inference strategy (illustrated in Fig. 5.4) that takes into account the mutual interactions between bidirectional alignments to enhance the alignment inference process. This strategy allows us to capture reciprocal interactions and improve alignment inference.

#### *5.4.1 Entity Structural Representation Learning*

The entity structural representation learning phase aims to model the structural characteristics of entities and project them from different knowledge graphs into a unified embedding space. In this space, the similarity between entities can be directly inferred by comparing their structural embeddings. Most state-of-the-art EA solutions focus on improving this phase by designing advanced structural representation learning models. However, our focus in this work is to enhance the alignment inference process and the capability of EA models to handle largescale datasets. As such, our proposed model, LIME, is agnostic to the choice of structural learning models. We adopt a state-of-the-art embedding learning model

**Fig. 5.4** An example of the preference modeling and aggregation. (**a**) similarity matrix, (**b**) preference matrix, (**c**) ranking matrix (**d**) reciprocal matrix

for EA, RREA [22], which reflects entity representations along different relational hyperplanes to construct relation-specific entity embeddings for alignment. More model and implementation details can be found in the original paper. Besides, to demonstrate that LIME is generic and can be applied to existing representation learning models, we also adopt the most commonly used model in the EA literature, GCN [15, 38], as the baseline model. Relevant experimental evaluations can be found in Sect. 5.7.

#### *5.4.2 Preference Modeling*

Once we have obtained the unified entity representations, we can infer the alignment results based on entity preferences. Specifically, for each source entity, we predict its most preferred target entity as its equivalent entity.

**Direct Alignment Inference** Previous studies only considered the similarity between entity representations to model entity preferences. We refer to this as *direct alignment inference*. Given an entity pair *(u, v), u* ∈ E*s, v* ∈ E*<sup>t</sup>* , their similarity score is denoted as *sim(u, v)*, 3 where *u* and *v* are the entity embeddings of *u* and *v*, respectively. The corresponding similarity matrix is denoted as *S*. For direct alignment inference, the preference score of *u* toward *v* is defined as:

$$p\_{\mathfrak{u},v} = \operatorname{sim}(\mathfrak{u}, \mathfrak{v}).\tag{5.1}$$

According to this definition, the preference score of *u* toward *v* is the same as the preference score of *v* toward *u*. Therefore, we have *pu,v* = *pv,u*, since similarity measures are usually symmetric and do not differentiate between the two input elements.

**Reciprocal Preference Modeling** We believe that to accurately model entity preferences, an entity's preference score toward another entity should also consider the likelihood of a match between them. For instance, as can be observed from Fig. 5.1, for [B. Dessner]en, despite the high similarity score, it might have a low preference toward [A. Dessner]es, since in its view, they are less likely to form a match (considering that [A. Dessner]es has a higher similarity with [A. Dessner]en). Theoretically, a source (target) entity would prefer the target (source) entities that have high similarities with it *and* meanwhile low similarities with other source (target) entities. In this connection, we define the preference score of *u* toward *v* as:

$$p\_{\boldsymbol{\mu},\boldsymbol{\upsilon}} = \operatorname{sim}(\boldsymbol{\mu}, \boldsymbol{\upsilon}) - \max\{\operatorname{sim}(\boldsymbol{\upsilon}, \boldsymbol{\mu}'), \boldsymbol{\mu}' \in \mathcal{E}\_3\} + 1,\tag{5.2}$$

<sup>3</sup> The similarity measure *sim* is usually chosen from cosine similarity [32–34], Euclidean distance [6, 44, 51], or Manhattan distance [38–40].

where 0 ≤ *pu,v* ≤ 1, and a larger *pu,v* denotes a higher degree of preference. The preference score of *v* toward *u* is defined similarly.

Our definition of the preference score for an entity toward another entity is composed of three elements. The first element represents the similarity score between the two entities, while the second element represents the highest similarity score that the target entity has with all the source entities. Intuitively, *u* would prefer *v* more if their similarity score *sim(u, v)* is close (ideally, equal) to the highest similarity score that *v* has, i.e., max{*sim(v, u ), u* ∈ E*s*}. Hence, we subtract the first element from the second element. If the difference is close to 0, it shows that *u* is satisfied with *v*. To make the preference value positive, we add the third element, i.e., 1.

Our definition of the preference score takes into account the alignment in the reverse direction (i.e., the preference of the target entity toward the source entity), which is naturally incorporated into an entity's preference modeling. Moreover, *pu,v* is not necessarily equal to *pv,u*, since the preference score encodes the alignment information at the entity level (rather than the pairwise level as in Eq. (5.1)). We denote the matrix forms of the source-to-target and target-to-source preference scores as *Ps,t* and *Pt,s*, respectively, and in general *Ps,t* = *P t,s*, where *P t,s* is the transpose of *Pt,s*.

#### *5.4.3 Preference Aggregation*

The preference scores only reflect the preferences in one direction, and an optimal alignment result should consider the preference scores in both directions. Hence, we propose to aggregate the unidirectional preferences. More specifically, we first convert the preference matrix *P* into the ranking matrix *R*. The elements in each row of *Ps,t* and *Pt,s* are ranked descendingly according to their values, resulting in *Rs,t* and *Rt,s*, respectively.4 Each element in the ranking matrix *R* represents the rank of the corresponding preference score, where a lower rank indicates a higher preference value. As thus, the ranking matrices can also encode the preference information.

The primary objective of transforming scores into ranks is to magnify the disparities between the scores. As we combine the source-to-target and target-tosource matrices to capture shared preferences, the small differences in scores on one side may be easily overlooked after the aggregation with information on the other side. Transforming scores into ranks allows us to preserve and integrate such differences into the ultimate mutual preference.

<sup>4</sup> For tied elements, we follow the common practice and denote their ranks as the average of the ranks that would have been assigned to them.

Afterward, we combine the two ranking matrices to capture the mutual preferences of entities and create the corresponding preference matrix:

$$\mathcal{P}\_{s \leftrightarrow l} = \phi \left( \mathcal{R}\_{s,l}, \mathcal{R}\_{l,s}^{\parallel} \right), \tag{5.3}$$

where *φ* is an aggregation function, which can be chosen from any mean operators, cross-ratio uniform [43], or other viable methods [25]. For this study, we use the arithmetic mean, which remains impartial and presents both entities' preferences toward each other precisely, without showing any inclination toward a higher or lower rank [23, 25]. The reciprocal matrix contains elements denoted by *pu*↔*v*, which indicates the degree of mutual preference between a pair of entities (*u* and *v*). The lower the value of *pu*↔*v*, the higher the level of preference between the two entities.


Algorithm 3 provides the details of the reciprocal alignment inference process. We also use Example 1 to further illustrate the process.

*Example 1* As shown in Fig. 5.4, there are a total of four source entities (*u*1*, u*2*, u*3*, u*4) and four target entities (*v*1*, v*2*, v*3*, v*4). In *S*, *Ps,t* , *Rs,t* , and *Ps*↔*<sup>t</sup>* , the rows correspond to the source entities and the columns correspond to the target entities, while in *Pt,s* and *Rt,s*, the rows correspond to the target entities and the columns correspond to the source entities. The entities with the same subscripts are equivalent.

(a): The similarity scores in matrix *S* are computed by using cosine similarity with entity embeddings. If we consider these similarity scores as the entity

preferences and align each source entity to its most preferred target entity, the results, such as *(u*1*, v*2*)*, *(u*2*, v*2*)*, *(u*3*, v*2*)*, and *(u*4*, v*2*)*, will only contain one correct match in this example.


**Discussion** Some might reckon that Eq. (5.2) is similar to the definition of crossdomain similarity local scaling (CSLS) [16], a metric that is proposed to mitigate the hubness issue during the nearest neighbor search. However, CSLS subtracts the average of the top-*n* highest similarity scores of *both source and target* entities from the pairwise similarity, resulting in a score that is still at the pairwise level and cannot fully characterize the preference of each entity. On the other hand, our proposed entity-level preference measure can better reflect the individual preferences of entities, and the integrated reciprocal preference matrix leads to more accurate alignment results, as demonstrated in Sect. 5.7.4.

#### *5.4.4 Correctness Analysis*

The optimal solution for alignment inference is to correctly identify all entity pairs. For example, in Fig. 5.4, where there are four source entities (*u*1*, u*2*, u*3*, u*4), four target entities (*v*1*, v*2*, v*3*, v*4), and the similarity matrix *S*, the optimal solution is M = {*(u*1*, v*1*), (u*2*, v*2*), (u*3*, v*3*), (u*4*, v*4*)*}.

Nevertheless, the ability of Algorithm 3 to attain a correct or optimal solution depends on the input similarity matrix *S*, which is generated by the deep learningbased representation learning process that captures the relatedness among entities. In the worst-case scenario, where the representation learning process fails to learn anything useful and the similarity matrix is composed of 0s, Algorithm 3 or any alignment inference strategy, such as direct alignment inference, would produce results that are full of wrongly aligned entity pairs. However, if the similarity matrix is accurate (i.e., for each ground-truth entity pair *(u, v)*, *u* has a higher similarity score with *v* than the rest of the source entities, and likewise, *v* has a higher similarity score with *u* than the rest of the target entities), Algorithm 3 can find the correct solution, as proven in Proof.

*Proof* We prove that, given an accurate similarity matrix where for each groundtruth entity pair *(u, v)*:

$$\mu = \arg\max\{sim(\boldsymbol{\upsilon}, \boldsymbol{u}'), \boldsymbol{u}' \in \mathcal{E}\_{\mathbb{S}}\};$$

$$\boldsymbol{v} = \arg\max\{sim(\boldsymbol{u}, \boldsymbol{v}'), \boldsymbol{v}' \in \mathcal{E}\_{\mathbb{T}}\},$$

the reciprocal inference algorithm could accurately identify these entity pairs.

Without loss of generality, we consider the ground-truth entity pair *(u, v)*. The following proof also applies to the rest of the ground-truth entity pairs.

First, we can derive that *pu,v* = 1 and *pv,u* = 1 according to Eq. (5.2). Further, we can derive that:

$$p\_{u,v} = \max\{p\_{u,v'}, v' \in \mathcal{E}\_l\}; \quad p\_{v,u} = \max\{p\_{v,u'}, u' \in \mathcal{E}\_3\}.$$

After converting the scores into ranks, we have:

$$r\_{u,v} = \min\{r\_{u,v'}, v' \in \mathcal{E}\_l\}; \quad r\_{v,u} = \min\{r\_{v,u'}, u' \in \mathcal{E}\_s\},$$

where *r* denotes the rank value. Next, after aggregating with arithmetic mean using Eq. (5.3), we can derive that:

$$p\_{u\leftrightarrow v} = \min\{p\_{u\leftrightarrow v'}, \,\upsilon' \in \mathcal{E}\_l\}; \quad p\_{v\leftrightarrow u} = \min\{p\_{v\leftrightarrow u'}, \,\mu' \in \mathcal{E}\_l\}.$$

Finally, according to Line 10 to Line 12 in Algorithm 3, *u* and *v* would be aligned by reciprocal alignment inference.

Therefore, the main challenge lies in obtaining an accurate similarity matrix. However, in most cases, the similarity matrix is likely to be inaccurate, as the representation learning process cannot guarantee to learn high-quality entity representations for generating an accurate similarity matrix. Thus, we categorize the similarity scores of the ground-truth entity pairs into four cases and discuss the performance of our proposed reciprocal alignment inference and the direct inference (baseline model) under these circumstances in the appendix. Empirically, reciprocal alignment inference achieves much better results than direct inference.

#### *5.4.5 Complexity Analysis*

Regarding the worst time complexity of Algorithm 3, the preference modeling process (Lines 1–3) requires *O(n*2*)* <sup>+</sup> *O(*2*n*2*)* <sup>+</sup> *O(*2*n*2*)* as we can calculate the highest similarity scores outside of the loops, the ranking process (Lines 5–8) requires *O(*2*<sup>n</sup>* <sup>×</sup> *<sup>n</sup>* lg *n)*, the aggregation process (Line 9) requires *O(n*2*)*, and the matching process (Lines 10–12) requires *O(n*2*)*, where *n* denotes the number of entities in a KG. Overall, the time complexity of Algorithm 3 is *O(n*<sup>2</sup> lg *n)*. Notably,

the time complexity of the direct alignment inference strategy is *O(n*2*)*. Our proposed reciprocal alignment inference has a higher time complexity than direct inference as it includes an additional ranking process that converts the preference scores into ranks. However, the process of ranking is crucial for enhancing the alignment performance and will be confirmed through experimentation.

The primary factor contributing to the space complexity of LIME is the reciprocal alignment inference stage, specifically the computation of the similarity, preference, and ranking matrices. In contrast, the direct alignment inference approach only requires computing the similarity matrix with a size of *n*×*n*, where *n* represents the number of entities. In our reciprocal modeling strategy, we remove the matrices once they are no longer necessary to decrease memory usage, and only up to three matrices are present at any given time. Thus, our model's maximum memory consumption is three times that of the direct alignment inference.

#### **5.5 Variants of Reciprocal Alignment Inference**

As discussed in Sect. 5.4.5, incorporating the reciprocal preferences of entities into the model requires a greater amount of memory and time compared to the direct alignment strategy, as a result of computing the preference and ranking matrices. Consequently, in this section, we propose two alternative methods to minimize the memory and time usage associated with the reciprocal modeling.

#### *5.5.1 No-Ranking Aggregation*

The complexity analysis in Sect. 5.4.5 has identified that the increased time complexity is primarily due to the calculation of preference score rankings. Therefore, we propose a no-ranking aggregation strategy in order to approximate reciprocal alignment inference, which eliminates the ranking process and instead directly aggregates *Ps,t* and *Pt,s* to produce the reciprocal preference matrix *Ps*↔*<sup>t</sup>* .

#### *5.5.2 Progressive Blocking*

To further reduce the time and space requirements of reciprocal alignment inference, we propose a method to decrease the value of *n*. We introduce a progressive blocking method that partitions the entities into smaller blocks and infers the alignment results at the block level. The algorithm for this method is presented in Algorithm 4 and the process is illustrated in Fig. 5.5.

**Fig. 5.5** An example of the progressive blocking process. The shape in gray denotes a block. For instance, there are five blocks in (**b**), i.e., {*u*2*, v*2*, u*4*, v*4}, {*u*1}, {*v*1}, {*u*3}, and {*v*3}. (**a**) The unified graph and the reciprocal matrix, (**b**) first round of blocking, (**c**) second round of blocking, (**d**) third found of blocking

**Difference from the Graph Partition Strategies** It is important to note that the progressive blocking process and the graph partition strategies presented in Sect. 5.3 are distinct, despite both being methods for dividing large graphs into smaller ones. The input to the graph partition strategies is a KG, and the goal is to partition it into smaller subgraphs while preserving the original KG structure. In contrast, the input to the progressive blocking method is a bipartite graph with nodes representing source and target entities to be aligned and edges representing pairwise connections between them. The aim is to divide the bipartite graph into smaller blocks, where alignment can be inferred within a smaller search space. Consequently, when aligning large KG pairs, we first conduct graph partitioning to divide the KGs into smaller subgraphs. For each small KG pair, we learn entity structural representations and reciprocally infer the alignment results, where the progressive blocking method can be used to reduce the time and memory costs of reciprocal inference.

**One-Off Blocking** To provide more detail, the inputs to the progressive blocking process include the unified graph G, which contains entities from both source and target KGs and their pairwise connections; the similarity matrix *S*, which encodes the pairwise similarities between source and target entities; and , which is the set of given thresholds (hyper-parameters). First, the blocking process begins by removing the connections between a source entity *u* and a target entity *v* if the similarity score between them *sim(u, v)* is lower than a predefined threshold *θ* ∈ . This division creates different blocks of source and target entities, as illustrated in Fig. 5.5b. After obtaining the blocks, we perform reciprocal entity alignment on the

entities within each block and aggregate the results from different blocks to obtain the overall alignment performance.

It is important to note that setting a small value for *θ* will result in most connections remaining, and most entities remaining in the same block. Therefore, the threshold is typically set to a relatively large value to ensure that the entities are effectively divided into appropriate blocks. However, this blocking process may still produce isolated blocks containing only a single entity, as depicted in Fig. 5.5b. We have found that these isolated blocks can represent a significant portion of the overall entities. One intuitive approach to handling these isolated entities is to gather them together and place them in the same block. However, this block would likely be large in size, and reciprocally aligning the entities within it would still require a significant amount of memory (as empirically validated in Sect. 5.7.5).


**Progressive Blocking** To address this issue, we propose a progressive blocking strategy. The strategy begins by removing connections between source and target entities in G with similarity scores lower than the threshold *θ* and computing the connected components of G. Each connected component is considered as a block (Lines 2–3 in Algorithm 4). For each block in the block sets, if it contains more than one entity, it is added to the final set of blocks (Lines 6–7). We gather up the isolated entities (blocks) and place them into one block and restore the connections among the entities in this block, which forms the new unified graph G (Lines 9–10). Then, we choose the next *θ* (smaller than the previous one) from and block G using the same strategy, i.e., removing connections with similarity scores lower than *θ*. As the threshold is lower than the previous one, some of the connections among these entities would remain, and these entities would be placed into different blocks. Similarly, there may still be isolated entities, and we can repeat the progressive

blocking strategy to generate more non-isolated blocks by gathering up the isolated entities and adjusting the threshold. Finally, we obtain the final set of blocks (Line 11). We perform reciprocal entity alignment within each block and aggregate the individual results to attain the final alignment performance. A running example can be found in the following example.

**The Benefits and Limitations of Progressive Blocking** By applying the progressive blocking method, the memory and time costs of reciprocal alignment inference are significantly reduced, as the number of entities in each block is much smaller than in the original graph. This reduction is empirically validated in Table 5.2 in Sect. 5.7.1. However, the blocking process may partition equivalent entities into different blocks, which can negatively impact the alignment accuracy. This issue is further discussed and analyzed in Sect. 5.7.

*Example* Continuing with the previous example, we now explain the progressive blocking process, which is also illustrated in Fig. 5.5.


#### **5.6 Experimental Settings**

In this section, we introduce the experimental settings.

#### *5.6.1 Dataset*

Following previous works, we adopt three popular EA datasets for evaluation:


Table 5.1 presents a summary of the statistics of these datasets. We use 30% of the aligned pairs for training and 10% for validation.


**Table 5.1** Statistics of the datasets used for evaluation

#### *5.6.2 Construction of a Large-Scale Dataset*

To evaluate the scalability of EA, we created a new dataset with millions of entities by using DBpedia and Freebase as the source and target KGs, respectively. We obtain the gold standards, i.e., aligned entity pairs, from the external links between DBpedia and Freebase.5 We extract the relational triples involving the entities in the external links from the respective KGs. We then extract the relational triples involving these entities from their respective KGs. To ensure the quality of the extracted triples, we follow the method proposed in a previous work [50]. We keep only the links whose source and target entities are involved in at least one triple in their respective KGs, and the entity sets are adjusted accordingly. As a result of this process, each KG contains over two million entities and tens of millions of triples. Table 5.1 presents the statistics of the newly constructed dataset.

#### *5.6.3 Implementation Details*

For the graph partition, we set the number of subgraph pairs *k* to 75 for FB\_DBP\_2M and 5 for DWY100K. For CPS, we adopt the same settings as the original paper. The number of rounds *γ* of I-SBP is set to 3. For the representation learning models RREA and GCN, we adopt the same settings as the original papers [22, 38]. We use cosine similarity to measure the similarity between entity embeddings. The reciprocal alignment inference stage does not require any additional parameters. Regarding the progressive blocking process, we conduct three rounds and set the thresholds (hyper-parameters) to the 50th percentile (median), 25th percentile (the first quartile), and 1st percentile of the set of the largest similarity score of each source entity, respectively, which could be directly obtained given the similarity matrix. The main intuition behind this is that such settings can guarantee the thresholds are decreasing, and meanwhile the threshold values would not be too small as they are obtained from the set of the largest similarity scores of all source entities.

To compare with the approaches that leverage extra information, we incorporate entity names into our proposal. We directly adopt the strategies proposed in [46] to generate useful features from entity names for alignment. We do acknowledge that some methods [36, 44] use entity descriptions to improve the alignment performance significantly. However, we leave the integration of such information and comparison with these methods to future work, as it is outside the scope of this study. The source codes of LIME are publicly available at https://github.com/DexterZeng/LIME.

<sup>5</sup> https://www.dbpedia.org/blog/dbpedia-is-now-interlinked-with-freebase-links-to-opencycupdated/.

#### *5.6.4 Evaluation Metrics*

As per convention [6], we adopt Hits@1 as the performance measure, which indicates the proportion of correct alignments. Unless otherwise specified, Hits@1 is represented in percentage. We omit the frequently used Hits@10 and mean reciprocal rank (MRR) metrics since (1) they are less important indicators as pointed out in previous works [6, 50] and (2) they show similar trends to Hits@1.

In addition, we assess the alignment methods based on their memory usage (in GB) and time consumption (in seconds).

#### *5.6.5 Competing Methods*

Our model is compared against 24 methods, which are categorized into two groups. The first category consists of methods that employ various embedding learning models to acquire valuable entity representations for alignment, such as the following:


The second group of techniques makes use of data beyond the KG structure. This comprises the following:


We choose these baselines since they are the most recent and also the best performant approaches. Indeed, the majority of the baselines are embeddingbased methods since most of the EA approaches merely focus on the embedding learning stage. There are only a limited number of methods that focus on the alignment inference stage, such as CEA, GM-EHD-JEA, and CEAFF. To ensure a fair comparison, we executed the source codes of the baseline methods in our experimental setup and presented the results obtained in comparison with the corresponding results reported in the original papers, despite the possibility of differences between the two. We have highlighted the top-performing results in each table by marking them in bold.

#### **5.7 Results**

We aim to answer the following research questions by conducting relevant experiments:


#### *5.7.1 Evaluation on Large-Scale Dataset*

**Settings** To address RQ1, we experimented on FB\_DBP\_2M. All of the state-ofthe-art approaches *cannot* be directly implemented on this dataset due to the huge computation cost. Hence, we utilized the SBP algorithm to partition KGs and used CPS and I-SBP for comparison. We used GCN and RREA as the entity representation learning models. Regarding the alignment inference stage, we compared our proposed reciprocal alignment inference strategy RInf and its variant methods RInfwr, RInf-pb, with the direct alignment strategy DInf. For the comprehensiveness of evaluation, we also conducted the experiments on the medium-sized dataset DWY100K. The results are presented in Table 5.2.

**Overall Results** According to Table 5.2, the best alignment performance on FB\_DBP\_2M and DWY100KDBP-WD is achieved by the combination of I-SBP, RREA,


**Table 5.2** Evaluation results of variants of LIME on large-scale datasets

and RInf-pb. However, replacing RInf-pb with RInf in this combination leads to the highest Hits@1 on DWY100KDBP-WD. In terms of efficiency, the combination of CPS, GCN, and DInf is the fastest across all three KG pairs. Additionally, the alignment results on DWY100K are much higher, while the memory and time costs are lower than those on FB\_DBP\_2M, demonstrating that our newly constructed large-scale EA dataset presents a significant challenge to EA solutions.

**Partition Strategies** In terms of partition strategies, it is clear that I-SBP consistently achieves the best alignment results, regardless of the choice of embedding and inference models. Moreover, using SBP results in better alignment performance than using CPS, highlighting the effectiveness of leveraging bidirectional information for KG partitioning. However, SBP is more time- and memory-intensive than CPS as it requires bidirectional partitions. I-SBP further increases the time cost to a significantly higher level, at least three times that of SBP. This excessive time cost is due to the iterative re-partitioning process. Additionally, I-SBP consumes more memory space than other partition strategies.

**Alignment Inference Strategies** Initially, we compare our proposed alignment inference strategies with the direct alignment inference approach. The results presented in Table 5.2 indicate that our proposed reciprocal inference strategy RInf outperforms the commonly used direct alignment inference DInf by a significant margin on DWY100K. On the FB\_DBP\_2M dataset, although RInf cannot work due to the high memory cost, its approximation strategies still attain better results than DInf. Particularly, compared to DInf, RInf-pb only requires more time and memory space within a reasonable range while consistently achieving superior alignment results across all datasets under different combinations of partition and embedding learning models. It is especially effective in the iterative partition setting. On the other hand, RInf-wr incurs slightly higher time and memory costs than DInf and achieves better results than DInf when using CPS while performing worse than DInf under bidirectional partitioning on DWY100KDBP-YG and FB\_DBP\_2M. This can be attributed to the fact that directly aggregating the preference scores can result in the loss of information, as the preference scores are typically very close. This is also discussed in Sect. 5.4.3.

In the next step, we compare RInf with its variants. On DWY100K, applying the blocking strategy reduces the Hits@1 performance of RInf by 2–5%, with the exception of cases where I-SBP is used. This is because the blocking process cannot guarantee that equivalent entities are placed in the same block. Nonetheless, RInf-pb reduces the memory cost by over 90% and the time cost by over 70%. This validates that our progressive blocking strategy can significantly increase the efficiency of the reciprocal modeling process at the cost of a slight performance drop. Although applying the blocking strategy reduces the Hits@1 performance of LIME, the results are still significantly higher than DInf. When using the iterative partition strategy, it can be observed that using RInf-pb achieves comparable or even better Hits@1 performance than using RInf. This is because the progressive blocking process reduces the search space and can generate more confident pairs, which could lead to increasingly better partition and alignment results.

Regarding the no-ranking variant RInf-wr, even though its time and memory costs are small (close to DInf), its alignment performance is significantly lower than RInf across all settings. This confirms that the ranking process is crucial in the preference aggregation process, as discussed in Sect. 5.4.3.

**Representation Learning Models** Regarding the entity structural embedding learning methods, the more advanced model RREA achieves better results than the baseline model GCN with various partition and inference strategies, demonstrating the importance of modeling KG structure information for overall alignment performance. This also confirms that our proposal is independent of the embedding learning model and can consistently improve alignment results.

For further details on the design of each component and more experiments and discussions, please refer to the following subsections.

#### *5.7.2 Comparison with State-of-the-Art Methods*

In this subsection, we answer RQ2.

**Settings** In the previous section, we demonstrated that LIME can effectively handle large-scale EA datasets. However, since state-of-the-art methods cannot handle the FB\_DBP\_2M dataset, we conducted further experiments on popular medium-sized and small datasets to validate the effectiveness of our proposal. Given that these datasets are relatively smaller, we did not use our proposed partition strategies in LIME, as discussed in Sect. 5.3.5. Therefore, we evaluated the effectiveness of our proposed reciprocal alignment inference strategy and its variants, using the RREA model as the representation learning module in LIME. We denoted using the noranking and progressive blocking variants as LIME-wr and LIME-pb, respectively.

We presented the results of methods that only utilize KG structure to learn entity embeddings for alignment in Table 5.3 and the results of methods that use additional information in Table 5.4. Additionally, we demonstrated that LIME can be applied to other representation learning models, and the results are reported in Table 5.5. We also provided a comparison of efficiency in Fig. 5.6.

**Comparison of Alignment Performance** We can observe from Tables 5.3 and 5.4 that LIME achieves the best alignment performance in both categories, and the performance of LIME-wr and LIME-pb also surpasses that of the baseline models, validating the effectiveness of the reciprocal inference strategy and its variant strategies. Notably, LIME adopts RREA as the representation learning component, which has already attained the highest Hits@1 among existing methods. In order to further validate that LIME is a generic framework that can be used to improve the alignment performance of any representation learning-based EA method, we removed the RREA model and applied LIME to other models. We reported the corresponding results in Table 5.5. Specifically, we selected a representative approach from each group, namely, RSNs and RDGCN, and reported the results on DBP15K



86.2 For some methods, we do not have access to their source codes and our implemented results are much worse than those reported in the original papers. In this case, we directly adopt the results from the papers (which might be missing on some datasets) and mark these methods with ∗ in the tables



and SRPRS in Table 5.5. Results on other datasets are omitted due to space limitations. The results in Table 5.5 verify that applying LIME leads to much better alignment performance than the direct alignment inference strategy, regardless of the approaches or datasets. This further demonstrates the effectiveness and generality of the LIME framework.

Additionally, we can observe several trends from the tables: (1) the results on DWY100K are higher than those on DBP15K and SRPRS, since the KGs in DWY100K are denser, which can provide more structural information for alignment. In comparison, the results on SRPRS are the worst among the three as its KG structure is the sparsest. This reveals that the density of the KG structure is crucial to the alignment of entities; and (2) overall, compared with methods that only use structural information, the methods that incorporate additional features achieve much better alignment performance. On the mono-lingual datasets, some solutions even achieve ground-truth results, showcasing the benefits of incorporating other useful features.

**Usage of Partition Strategies** In Sect. 5.3.5, we discussed that when dealing with small- or medium-sized datasets, it may not be worth using the partition strategy since partitioning may not significantly reduce computational costs, while it may decrease the alignment accuracy. We empirically validate this point by comparing the results of LIME in Table 5.3 with the results in Table 5.2. Specifically, we can see from Table 5.2 that the Hits@1 of SBP +RREA +RInf6 on DWY100KDBP-WD is 76.9%, while this figure is 81.6% for LIME (equivalent to RREA +RInf) in Table 5.3, demonstrating that the partition process indeed harms the alignment accuracy. Furthermore, in terms of time cost, they are of the same order of magnitude (thousands of seconds), despite the fact that using the partition strategy would be faster.

**Comparison of the Efficiency** We compare LIME with state-of-the-art approaches in terms of the efficiency and show the results in Fig. 5.6. 7 The study demonstrates

<sup>6</sup> Note that we do not compare with I-SBP, since it selects confident EA pairs to augment training data, which improves both the partition and the representation learning process. It corresponds to the semi-supervised setting in previous EA works [22, 34], which is usually not compared with the methods without the semi-supervised setting (e.g., LIME in this work) for fairness.

<sup>7</sup> We do not include an evaluation of the efficiency of methods that use extra information because processing this information is complex and it is challenging to provide an unbiased evaluation.

**Fig. 5.6** Running time comparison of methods merely using structural information. (**a**) On DBP15K. (**b**) On DWY100K. (**c**) On SRPRS

that LIME is effective on all datasets, primarily because the representation learning model RREA is highly efficient. However, LIME does require slightly more running time to significantly enhance the alignment performance of RREA. It is also worth noting that the time cost is generally higher on larger datasets (such as DWY100K compared to DBP15K) and on denser datasets (such as DBP15K compared to SRPRS).

#### *5.7.3 Experiments and Analyses on Partitioning*

In this section, we seek to answer RQ3. By examining Table 5.2, we can conclude that I-SBP and SBP outperform CPS in generating precise alignment outcomes, albeit at the expense of greater time and memory usage. This section presents additional experiments aimed at assessing the efficacy of these partition methods.

**Influence of Partition Strategies on Alignment Links** Our initial goal is to evaluate the percentage of preserved alignment links following the partitioning process. This is a critical aspect, as the optimal partition strategy should place equivalent entity pairs in the same subgraph pair, thereby enabling accurate alignment in subsequent stages. The ability of the partition strategy to group equivalent entity pairs together determines the maximum achievable alignment accuracy, as discussed in Sect. 5.3.2. As a result, we present the percentage of preserved gold alignment links following partitioning in Table 5.6.

It reads from Table 5.6 that CPS destroys over 10% of the links on DWY100K and more than half of the links on FB\_DBP\_2M. This indicates that the partition process itself significantly reduces the maximum achievable alignment accuracy, which is undesirable. In comparison, adopting SBP can retain 67.5% of the links on FB\_DBP\_2M, which increases the result of CPS by over 50%. Moreover, I-SBP produces a remarkable improvement, preserving 80% of the links in FB\_DBP\_2M and almost all links in DWY100K. This demonstrates that iterative partitioning can effectively optimize the partition process and prevent equivalent entities from being placed into different subgraph pairs. Nevertheless, as shown in Table 5.2, the time


**Table 5.6** The percentage of gold alignment links preserved after partitioning


cost of SBP is almost double that of CPS, while I-SBP requires significantly more time depending on the number of iterations.

**Influence of the Number of Subgraph Pairs** *k* Our next step is to analyze the impact of the number of subgraph pairs *k* on the partition process. To be specific, Table 5.7 presents the percentage of preserved links, time cost, and the number of entities in the largest subgraph pair for CPS and SBP, with *k* set to 50, 75, and 100. Indeed, Table 5.7 indicates that as the number of subgraph pairs increases, the percentage of preserved links decreases, and the partition time cost increases for both CPS and SBP. However, increasing the number of subgraph pairs results in smaller subgraphs, which can be beneficial for structural representation learning strategies due to their scalability limitations.

#### *5.7.4 Experiments and Analyses on Reciprocal Inference*

In this subsection, we address RQ4.

**Comparison with the CSLS Metric** In Sect. 5.4, we mentioned the CSLS metric, which was introduced to address the hubness problem in nearest neighbor search and may have a similar effect as the reciprocal alignment inference strategy. We thus replaced the reciprocal inference approach in LIME with the CSLS metric (with a hyper-parameter *n* set to 1, 5, or 10) and evaluated the corresponding Hits@1 results, which are presented in Fig. 5.7. It is worth noting that all other settings were kept the same.

The results presented in Fig. 5.7 demonstrate that LIME consistently outperforms the CSLS metric on all datasets. This confirms that our reciprocal alignment inference strategy can more effectively model and integrate entity preferences, leading to more accurate alignment results compared to the CSLS metric (as discussed in Sect. 5.4). Additionally, we observe that the performance of the CSLS metric deteriorates as the hyper-parameter *n* increases.

**Deeper Insights into the Preference Modeling and Aggregation** It is worth noting that in cases where the entity representation learning model has poor performance on EA (i.e., the model outputs a homogeneous probability distribution

**Fig. 5.7** Comparison of the reciprocal inference in LIME and the CSLS metric. DBP-WD∗ and DBP-YG∗ refer to the KG pairs in DWY100K. The results on FB\_DBP\_2M are omitted due to the excessive time and memory costs required by reciprocal inference in LIME and the CSLS metric

of entity embeddings), the preference matrix can have many ties, which may impede the effectiveness of the reciprocal modeling approach. Therefore, we aim to (1) analyze the likelihood of ties occurring in the preference matrix and (2) empirically demonstrate that our proposed inference strategy can still improve the performance of a low-performing entity representation learning model even in the presence of ties.

Take SRPRSEN-FR, for example. We conducted an analysis of the preference matrices for RREA and a low-performing model RSNs, which have dimensions of 10,500\*10,500. On average, ties occur 8.72 times in each row or column of the RREA preference matrix, which is not a frequent occurrence. For the low-performing RSNs model, this figure increases to 12.82. This suggests that the quality of entity representations can influence the frequency of ties during preference aggregation, but the effect is not significant. Furthermore, despite the presence of ties in the ranking matrices, applying our reciprocal inference strategy improves the Hits@1 of RSNs by 10.9% as shown in Table 5.5. This demonstrates that the reciprocal modeling approach can still benefit a low-performing entity representation learning method.

#### *5.7.5 Experiments and Analyses on Progressive Blocking*

In this subsection, we proceed to answer RQ5. First, we analyze the impact of the hyper-parameter *θ* on the alignment performance and efficiency. Next, we discuss the parameter settings of the progressive blocking process.

**Analysis of** *θ* As mentioned in Sect. 5.5.2, setting *θ* to a small value will retain the majority of connections, resulting in most entities being placed in the same block. On the other hand, setting *θ* to a large value will remove many connections and separate entities into different isolated blocks. To empirically verify this claim, we conducted an experiment on the DBP15KZH-EN dataset, varying the value of *θ*. We reported the total number of blocks (#Total), the size of the largest block


**Table 5.8** Analysis of the hyper-parameter *θ* in progressive blocking on DBP15KZH-EN

(#MaxSize), the number of blocks that only contain one entity (which we refer to as isolated blocks, #Iso), the percentage of isolated blocks (Perc.), the aggregated Hits@1 results of performing the alignment within each block (H@1), and the aggregated Hits@1 results of performing the alignment within each block and the aggregated isolated blocks (H@1\*), in Table 5.8.

The results in Table 5.8 show that setting *θ* to a large value, specifically 0.75, results in the removal of most pairwise connections, leading to over 10,000 blocks, of which 71.9% were isolated blocks. Also, the Hits@1 result is very low (at 48.1%). Aggregating the 8,225 isolated blocks and considering the alignment performance within this aggregated block, the Hits@1 result increased to 67.9%. However, this aggregated block contains over 8,000 entities and still requires significant memory space. In contrast, setting *θ* to a small value, specifically 0.4, results in the majority of entities (over 20,000) being placed in the same block, which does not achieve the objective of reducing memory space.

Therefore, in a progressive blocking setting, the value of *θ* in the first round is typically set to a larger value. Although this may result in a larger size of the aggregated isolated block, the subsequent rounds with lower *θ* values further process the aggregated block.

**Analysis of the Progressive Blocking** In this work, we conduct three rounds of progressive blocking and directly set to the 50th percentile (median), 25th percentile (the first quartile), and 1st percentile of the set of the largest similarity scores of all source entities, respectively. In this study, our goal is to investigate the impact of the values of *θ* and the number of rounds of progressive blocking on the alignment performance and memory consumption. To be more specific, we keep two threshold values constant and vary the value of the other threshold. Then, we report the Hits@1 and memory size in Fig. 5.8a, b, and c. Moreover, we perform progressive blocking for 0 to 4 rounds and present the Hits@1 and memory size in Fig. 5.8d.

As shown in Fig. 5.8a, the value of the initial threshold has an impact on the final Hits@1 result and memory cost. Setting the initial threshold to a relatively small value may produce more accurate alignment results, but it also comes with a high

**Fig. 5.8** Analysis of the progressive blocking. (**a**) Threshold in the first round. On DBP15KZH-EN. (**b**) Threshold in the second round. On DBP15KZH-EN. (**c**) Threshold in the third round. On DBP15KZH-EN. (**d**) Rounds of progressive blocking. On DBP15KZH-EN

memory cost since most entities are still connected and placed in the same block. On the other hand, a larger threshold can reduce the memory cost, but it also leads to a lower alignment performance.

Figures 5.8b and c demonstrate that the values of the thresholds in the second and third round do not have a significant impact on the memory cost, while they only have a small influence on the alignment performance. Furthermore, Fig. 5.8d indicates that the Hits@1 performance and memory cost drop in the first few rounds and remain relatively stable with more rounds of blocking. Therefore, conducting progressive blocking for a few rounds is sufficient.

**Threshold Setting in Practice** Based on the analysis, we can identify two crucial factors when setting the threshold schedule: (1) the threshold values should be gradually decreased; and (2) the initial threshold value should be selected carefully, possibly with the guidance of statistical information regarding the similarity scores. Therefore, our proposed strategy for scheduling the threshold is a feasible option in practice, and it can be adjusted based on the statistical information available.

#### **5.8 Related Work**

We will provide a brief overview of the studies that have addressed the scalability problem in EA. The experimental paper on EA [50] indicates that even the stateof-the-art EA methods still suffer from poor scalability. While simpler models such as GCN-Align [38] and ITransE [51] are faster, they tend to have poorer effectiveness [50]. In contrast, more effective models typically have complex architectures and are inefficient.

There have been several studies on relevant tasks that propose strategies for handling large-scale data. For instance, Flamino et al. [9] approach the alignment of entities in large-scale networks by clustering nodes using network-specific features. However, these features are not present in KGs, and the structure of KGs is more intricate than networks. Zhuang et al. [54] suggest partitioning entities from various knowledge bases into smaller blocks using predicates in the triples. Nonetheless, aligning predicates in different knowledge bases is already a challenging task, and the source codes of these methods are not available. Therefore, their implemented programs cannot be applied to the EA task. Zhang et al. [49] address the problem of linking large-scale heterogeneous entity graphs. However, the entity graph only includes entities in a few types, such as paper, author, and venue, and the relation types are also limited, which is very different from KGs. Thus, their proposed method, which depends on the characteristics of entity graphs, cannot be used for EA.

Several recent works have focused on addressing the efficiency issue in EA. Mao et al. [20] identify over-complex graph encoders and inefficient negative sampling strategies as the primary causes of poor efficiency in EA. They propose a novel KG encoder, Dual Attention Matching Network, to reduce computational complexity. However, their work focuses only on the representation learning stage and is evaluated on a medium-sized dataset, DWY100K. GM-EHD-JEA [42] formulates EA as a task assignment problem and proposes to solve it using the Hungarian algorithm. However, the Hungarian algorithm cannot be directly applied to EA due to its extra-large computation time. Therefore, they propose a space separation strategy to reduce the search space so that the Hungarian algorithm can work properly. This method is similar to our blocking strategy without the progressive procedure. However, we improve the performance by aggregating isolated blocks, and our progressive blocking process can further enhance efficiency.

Another recent work proposes a unidirectional strategy, CPS, to partition largescale KGs and uses name information to improve alignment performance [10]. However, in general, the scalability issue in EA remains a critical and underexplored problem. It is worth noting that ER can be regarded as the general version of the EA task [50]. There have been several studies on improving the efficiency and scalability of ER, and we refer readers to the survey paper [7]. Our blocking strategy is inspired by these relevant works on ER.

#### **5.9 Conclusion**

In this chapter, we have highlighted the scalability issue in state-of-the-art EA approaches and proposed an effective solution, LIME, to address EA at scale. The LIME approach initially uses graph partition strategies that focus on seeds to divide large-scale KGs into smaller pairs. Then, LIME employs a novel reciprocal alignment inference strategy within each subgraph pair to generate alignment results based on the entity representations learned by existing embedding learning models. To enhance the scalability of reciprocal alignment inference, LIME suggests two variant strategies that can reduce computational costs, albeit with a slight decrease in performance. The experimental evaluations conducted on a novel large-scale EA dataset reveal that LIME can successfully address EA on a large scale. Besides, the empirical results on the popular EA datasets also validate the superiority of LIME and show that it can be applied to existing methods to improve their performance.

#### **Appendix**

#### **Correctness Analysis**

In Sect. 5.4.4, we examine the performance of reciprocal alignment inference and the baseline model (direct inference) under various conditions.

**Case 1** For the ground-truth entity pair *(u, v)*:

$$\mu = \arg\max\{sim(\boldsymbol{\upsilon}, \boldsymbol{\mu}'), \boldsymbol{\mu}' \in \mathcal{E}\_{\mathbb{S}}\};$$

$$\boldsymbol{\upsilon} = \arg\max\{sim(\boldsymbol{\mu}, \boldsymbol{\upsilon}'), \boldsymbol{\upsilon}' \in \mathcal{E}\_{\mathbb{T}}\},$$

As previously discussed in Sect. 5.4.4, when an accurate similarity signal is provided by the representation learning process, both reciprocal alignment inference and direct alignment inference can produce the correct alignment.

**Case 2** For the ground-truth entity pair *(u, v)*:

$$\begin{aligned} \mu &= \arg\max \{ \operatorname{sim}(\boldsymbol{\upsilon}, \boldsymbol{\mu}'), \boldsymbol{\mu}' \in \mathcal{E}\_{\mathbb{S}} \}; \\\upsilon^\* &= \arg\max \{ \operatorname{sim}(\boldsymbol{\mu}, \boldsymbol{\upsilon}'), \boldsymbol{\upsilon}' \in \mathcal{E}\_{\mathbb{I}} \}, \quad \upsilon^\* \neq \upsilon, \end{aligned}$$

In this case, the direct alignment inference *cannot* generate the correct answer, since it only considers *u*'s preference and would generate *(u, v*∗*)* as the answer. In comparison, reciprocal alignment inference does not necessarily generate the correct answer. This is because we only know *pu,v* = 1*, pv,u <* 1. Given any target entity *v* , we can derive *pu,v* ≤ 1*, pv ,u* ≤ 1. Thus, *ru,v* ≤ *ru,v* , while we cannot compare

*rv,u* and *rv ,u*. Therefore, we also cannot compare *pu*↔*<sup>v</sup>* = *(ru,v* + *rv,u)/*2 with *pu*↔*v* = *(ru,v* + *rv ,u)/*2 as the exact values are unknown.

**Case 3** For the ground-truth entity pair *(u, v)*:

$$\begin{aligned} \mu^\* &= \arg\max\{\operatorname{sim}(\boldsymbol{v}, \boldsymbol{u}'), \boldsymbol{u}' \in \mathcal{E}\_{\mathcal{S}}\}, \quad \boldsymbol{u}^\* \neq \boldsymbol{u};\\ \boldsymbol{v} &= \arg\max\{\operatorname{sim}(\boldsymbol{u}, \boldsymbol{v}'), \boldsymbol{v}' \in \mathcal{E}\_{\mathcal{I}}\}, \end{aligned}$$

In this case, the direct alignment inference can generate the correct answer since it only considers *u*'s preference. In comparison, reciprocal alignment inference does not necessarily generate the correct answer. This is because we only know *pu,v <* 1*, pv,u* = 1. Given any target entity *v* , we can derive *pu,v* ≤ 1*, pv ,u <* 1. Thus, *rv,u < rv ,u*, while we cannot compare *ru,v* and *ru,v* . Therefore, we also cannot compare *pu*↔*<sup>v</sup>* = *(ru,v* + *rv,u)/*2 with *pu*↔*v* = *(ru,v* + *rv ,u)/*2 as the exact values are unknown.

**Case 4** For the ground-truth entity pair *(u, v)*:

$$u^\* = \arg\max\{\sin(\upsilon, u'), u' \in \mathcal{E}\_{\mathbb{S}}\}, \quad u^\* \neq u;$$

$$v^\* = \arg\max\{\sin(\mu, v'), v' \in \mathcal{E}\_{\mathbb{I}}\}, \quad v^\* \neq v.$$

In this case, the direct alignment inference *cannot* generate the correct answer, since it only considers *u*'s preference and would generate *(u, v*∗*)* as the answer. In comparison, reciprocal alignment inference does not necessarily generate the correct answer. This is because we only know *pu,v <* 1*, pv,u <* 1. Given any target entity *v* , we can derive *pu,v* ≤ 1*, pv ,u* ≤ 1. However, we cannot compare *rv,u* and *rv ,u*, or *ru,v* and *ru,v* . Therefore, we cannot compare *pu*↔*<sup>v</sup>* = *(ru,v* + *rv,u)/*2 with *pu*↔*v* = *(ru,v* + *rv ,u)/*2 as the exact values are unknown.

To summarize, the direct alignment inference method can only provide correct results in Case 1 and Case 3, while our proposed reciprocal alignment inference strategy can generate correct answers in Case 1 and has the potential to produce correct results in other cases as well. Since the input similarity matrix is often not very accurate, as representation learning models may not fully capture the relatedness between entities, our proposed method is expected to perform better than direct alignment inference, as empirically demonstrated in our experiments.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 6 Long-Tail Entity Alignment**

**Abstract** Most entity alignment solutions currently rely on structural information, specifically KG embedding, to align entities. However, in real-life KGs, the majority of entities have a sparse neighborhood structure, while only a few entities are densely connected to others. These less-connected entities are referred to as*long-tail entities*, and this phenomenon limits the effectiveness of using structural information for entity alignment.

To address this issue, we propose an approach that incorporates *entity name*  information, which is often overlooked but readily available. We amplify the weak structural information of long-tail entities with concatenated power mean word embeddings of their names during pre-alignment. To align entities, we introduce a novel complementary framework that combines both structural and name signals. It uses the entity's *degree* as a guide to fuse the two sources of information effectively and proposes a degree-aware *co-attention* network that dynamically adjusts the significance of features in a degree-aware manner. Finally, we propose using confident entity alignment results as anchors to complement original KGs with facts from their counterparts via *iterative* training during post-alignment. Experimental evaluations show the effectiveness of the proposed techniques.

#### **6.1 Introduction**

Many current approaches to entity alignment (EA) in knowledge graphs (KGs) heavily rely on the graph structure of KGs [4, 7, 10, 15, 18]. These approaches assume that equivalent entities have similar neighborhood structures. While these methods have achieved state-of-the-art performance on synthetic datasets extracted from large-scale KGs, as mentioned in [2, 15, 24], recent studies have shown that these *synthetic* datasets are much denser than real-life KGs. Furthermore, existing EA methods are *not* capable of yielding satisfactory results on datasets with real-life distributions, as discussed in [7].

A recent study, referenced as [7], has shown that nearly half of the entities in actual knowledge graphs have connections to less than three other entities, which are called *long-tail entities*. This results in the KG being a relatively sparse

**Fig. 6.1** An example of EA. Nodes in gray (resp. white) are long-tail (resp. popular) entities (relation names and other entities are omitted in the interest of space)

graph. This matches our perception that only a few entities in real-life KGs are frequently accessed and have rich connections and detailed attributes, while the majority remain under-explored and provide little structural information. This leads to existing EA methods that rely solely on structural information struggling to accurately align these entities, as demonstrated in the following example.

*Example* In Fig. 6.1 is a partial English KG (KGEN) and a partial Spanish KG (KGES) concerning the film Summer 1993. Note that the entities The Bookshop and La Librería in gray describe the original novel, while those in white depict the film.

During aligning entities of high degrees, e,g., Spain and España, structural information is of great help; however, as to long-tail entities, e.g., Carla Simón in KGEN, structural information may suggest Laia Artigas in KGES as its match, since they have a single link to Summer 1993 and Verano 1993, respectively.

The example unveils the shortcoming of solely relying on structural information for EA, which renders existing EA methods sup-optimal, even infeasible for longtail entities. Hence, we are motivated to revisit the key phases of EA pipeline and address the challenge of EA when structural information is insufficient.

In the *pre-alignment* phase, we are searching for extra signals that can improve EA, and we find that entity names can provide a source of valuable information. This type of information is commonly present in real-life entities, but previous research has not given it sufficient attention. For example, if we consider the longtail entity Carla Simón in KGEN, incorporating entity name information would be beneficial in finding the correct mapping, which is Carla Simón in KGES. This shows that entity name information can provide a supplementary perspective to the commonly used structural information in EA.

Previous studies [19–21] have already used name embeddings, specifically averaged word embeddings, to populate initial feature matrices for learning *structural representation*. However, our approach is different in that we use entity names as an additional source of signal, in addition to structural information. We achieve this by encoding the names through concatenated power mean word embeddings [22].

During the *alignment* phase, we carefully merge the two signals mentioned earlier by considering the fact that the significance of structural and name information differs for entities with varying degrees. In the example above, aligning the longtail entity Carla Simón in KGEN relies more on entity name information than its limited neighboring structure. Conversely, for mapping popular entities such as the film La Librería, where ambiguous entity names are present (i.e., both the film La Librería and the novel La Librería share the same name), structure plays a more significant role. Generally speaking, we can assume that the importance of the entity name signal is higher (resp. lower) for entities with lower (resp. higher) degrees, while the opposite is true for the signal from neighboring structure. In order to accurately represent the nonlinear dynamics between the two signals, we develop a co-attention network that uses entity degrees as a guide to determine the weights of various signals. It is important to note that [10] introduced degrees as a way to address the bias in *structural embedding* methods, which tend to place entities with similar degrees in close proximity. However, our motivation is different in that we use degrees to calculate pairwise similarities instead of individual embeddings.

During the *post-alignment* phase, our proposal is to significantly improve the structural information of knowledge graphs by recursively examining and crossreferencing each other. While long-tail entities may lack structural information in their original knowledge graph (referred to as the "source KG"), the knowledge graph being aligned with (the "target KG") may have this information in a complementary manner. As an illustration, let's consider the entity Carla Simón. In KGEN, there may be missing information such as the fact that Carla Simón is from España, which is present in KGES. By pairing the surrounding entities and leveraging information from the target KG, the source KG can potentially acquire this missing information and improve the alignment. Inspired by the beneficial impact of using rules to complete knowledge graphs [2], we propose an *iterative*  training procedure that includes knowledge graph completion. In each round, we use confident entity alignment results as anchors to identify and add any missing relations, thereby enhancing the current knowledge graphs. As a result, these knowledge graphs become enriched, which in turn allows for the learning of better structural embeddings. Additionally, the matching signal can propagate to longtail entities, which were previously difficult to align in a single shot but may now become easier to align as a result of this iterative process.

**Contribution** In short, the contribution of this chapter can be summarized as follows:


**Organization** Section 6.2 overviews related work. In Sect. 6.3, we analyze the long-tail phenomenon in EA. DAT and its components are elaborated in Sect. 6.4. Section 6.5 introduces experimental settings, evaluation results, and detailed analysis, followed by conclusion in Sect. 6.6.

#### **6.2 Related Work**

**Conventional EA Framework** The advancements made by state-of-the-art methods can be analyzed based on a phased pipeline. Firstly, for the pre-alignment phase, KG representation methods such as TransE [3, 4, 23] and GCN [18] are utilized to encode structural information and embed KGs into low-dimensional spaces individually. Subsequently, for the alignment phase, the embedding spaces are evaluated and compared to derive alignment results under the supervision of seed entity pairs. Certain techniques [7, 14, 15] employ a method of combining training data to create a unified embedding space. This allows for the direct projection of entities from various KGs into the same space. Equivalence across KGs can then be identified by measuring the distance between entities in the unified embedding space during alignment. In order to enhance supervision signals by utilizing the outcomes of the alignment stage, post-alignment iterative techniques are utilized as described in [15, 23]. This approach involves updating structural embeddings and performing alignment recursively until a stopping condition is met. These techniques can be roughly summarized into a framework, depicted by Fig. 6.2.

**Recent Advancement on EA** Recent endeavors have been directed toward addressing structural heterogeneity by developing sophisticated structural learning

**Fig. 6.2** Conventional framework of EA

models such as topic graph matching [21] and multichannel graph neural network [2]. These approaches are intended to overcome the challenges associated with structural heterogeneity. A recent work enhances structural embedding through adversarial training that takes into account degree difference [10]. However, this approach may not be effective when aligning entities in knowledge graphs that are both in a low-frequency range. Furthermore, in this study, degree information is used to improve the learning of structural embeddings, whereas in our approach, degree information is used to combine two different alignment signals: structural and name information.

While iterative strategies can be effective in improving entity alignment (EA), previous research has shown that they can also have drawbacks. For example, they can be biased toward one knowledge graph (KG) and time-consuming [15], or they may introduce many false-positive instances [23], which is not ideal for real-life applications. In order to balance precision and computational efficiency, we propose a novel iterative training approach that incorporates a KG completion module. This module updates the structure of the KG in each round based on confident anchoring entity pairs. Our strategy is lightweight and limits the inclusion of incorrect pairs, reducing the likelihood of introducing false positives.

It is apparent that the majority of the aforementioned embeddings rely on structural information for learning, which can be inadequate for long-tail entities in some cases. To address this issue, some researchers have suggested incorporating *attributes* into embeddings in order to potentially compensate for the shortcomings of relying solely on structural information [14, 17, 18, 22]. However, a significant percentage (between 69 and 99%) of instances in popular KGs are lacking at least one attribute that other entities in the same class possess [6]. The use of *entity descriptions* [3] has been proposed as a way to provide additional information that is often missing in many KGs. While these efforts can improve overall performance, they may not effectively align entities in the long tail. Previous approaches have explored using entity names either as initial features for learning *structural representation* [19–21] or in combination with other information for representation learning [22]. In contrast, our proposed approach consolidates features from separate similarity matrices learned from structure and name information, with different strategies evaluated in Sect. 6.5.2.

#### **6.3 Impact of Long-Tail Phenomenon**

**Task Definition** Given a source KG *G*<sup>1</sup> = *(E*1*, R*1*, T*1*)* and a target KG *G*<sup>2</sup> = *(E*2*, R*2*, T*2*)*, where *E*<sup>1</sup> (resp. *E*2) represents source (resp. target) entities, *R* denotes relations and *T* ⊆ *E* × *R* × *E* represents triples. Denote the seed entity pairs as *<sup>S</sup>* = {*(e<sup>i</sup>* 1*, e<sup>i</sup>* 2*)*|*e<sup>i</sup>* <sup>1</sup> <sup>=</sup> *<sup>e</sup><sup>i</sup>* 2*, e<sup>i</sup>* <sup>1</sup> <sup>∈</sup> *<sup>E</sup>*1*, e<sup>i</sup>* <sup>2</sup> ∈ *E*2}, *i* ∈ [1*,* |*S*|], where |·| denote the cardinality of a set. EA task is to find new EA pairs based on *S* and return the eventual results *S* = {*(e<sup>i</sup>* 1*, e<sup>i</sup>* 2*)*|*e<sup>i</sup>* <sup>1</sup> <sup>=</sup> *<sup>e</sup><sup>i</sup>* 2*, e<sup>i</sup>* <sup>1</sup> <sup>∈</sup> *<sup>E</sup>*1*, e<sup>i</sup>* <sup>2</sup> ∈ *E*2}, *i* ∈ [1*,* min{|*E*1|*,* |*E*2|}], where = expresses that two entities are the same physical one.

A recently published study [7] identified that previous entity alignment datasets had knowledge graphs that were too densely connected and had degree distributions that differed significantly from real-life knowledge graphs. To address this issue, they created a new entity alignment benchmark that better reflects real-life distributions. The benchmark includes both cross-lingual datasets such as SRPRSEN-FR, SRPRSEN-DE, and mono-lingual datasets such as SRPRSDBP-WD and SRPRSDBP-YG. The degree of an entity is defined as the number of relational triples it participates in. The study reports the degree distributions of entities in the test sets in Table 6.1. The researchers also evaluated the performance of RSNs, which was found to be the best solution in [7]. The evaluation included measuring the number of correctly aligned entities in different degrees.

The results presented in Table 6.1 indicate that in the SRPRSEN-FR and SRPRSDBP-YG datasets, over 50% of the entities' degrees are less than three, and in the SRPRSEN-DE and SRPRSDBP-WD datasets, almost half of the entities' degree are only 1 or 2. This confirms that the majority of entities in the knowledge graph have very few connections to others and are considered long-tail entities. The results also demonstrate that the accuracy of long-tail entities is much lower than that of higher-degree entities, even though RSNs is the leading method in the benchmark. This suggests that current methods are not effective in handling long-tail entities, which limits overall performance. Therefore, it is crucial to re-evaluate the entity alignment pipeline, with a particular focus on addressing the challenges posed by long-tail entities.

#### **6.4 Methodology**

To provide an overview, we have summarized the main components of the DAT (degree-aware entity alignment in tail) framework in Fig. 6.3, highlighting the new designs in purple blue. In pre-alignment, *structural representation learning module*  and *name representation learning module* are put forward to learn useful features of entities, i.e., name representation and structural representation; in alignment, these features are forwarded to *degree-aware fusion module* for effective fusion and alignment under the guide of degree information. In post-alignment, *KG completion*



**Fig. 6.3** The framework of DAT

*module* aims to complete KGs with confident EA pairs in the results, and the augmented KGs are then again utilized in the next round iteratively.

Since *structural representation learning module* has been extensively studied, we adopt the state-of-the-art model RSNs [7] for this purpose. Given a structural embedding matrix **<sup>Z</sup>** <sup>∈</sup> <sup>R</sup>*n*×*ds* , two entities *e*<sup>1</sup> <sup>∈</sup> *<sup>G</sup>*<sup>1</sup> and *e*<sup>2</sup> <sup>∈</sup> *<sup>G</sup>*2, their structural similarity *Sims(e*1*, e*2*)* is the cosine similarity between **Z***(e*1*)* and **Z***(e*2*)*, where *n* denotes the number of all entities in two KGs, *ds* is the dimension of structural embeddings, and **Z***(e)* denotes the embedding vector for entity *e* (i.e., **Z***(e)* = **Ze**, where **e** is the one-hot encoding of entity *e*). From the perspective of structure, the target entity with the highest similarity to a source entity is returned as its alignment result.

#### *6.4.1 Name Representation Learning*

Remembering that using structural information to align long-tail entities has limited effectiveness, we are taking a different approach from previous attempts that focus on utilizing structures. Instead, we are searching for a signal that is generally accessible to long-tail entities and can provide benefits for alignment.

In order to achieve this goal, we suggest including the textual names of entities, which has largely been ignored by current embedding-based EA methods. This approach is particularly attractive for several reasons, including: (1) the name of an entity is typically sufficient to identify it, and when given two entities, comparing their names is often the most straightforward way to determine if they are equivalent and (2) the majority of real-life entities have a name, and the proportion of entities with names is much greater than the proportion with other textual information, such as descriptions and attributes. This is particularly relevant for long-tail entities, which tend to lack such additional information.

Despite that there are many classic approaches for measuring the *string similarity*  between entity names, we go for *semantic similarity* since it can still work when the vocabularies of KGs differ, especially for the *cross-lingual* scenario. Specifically, we choose a general form of power mean embeddings [11], which encompasses many well-known means such as the arithmetic mean, the geometric mean, and the harmonic mean. Given a sequence of word embeddings, **w**1*,...,* **<sup>w</sup>***<sup>l</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* , the power mean operation is formalized as:

$$\left(\frac{w\_{1l}^p + \dots + w\_{ll}^p}{l}\right)^{1/p}, \quad \forall i = 1, \dots, d, \quad p \in \mathbb{R} \cup \pm \infty,\tag{6.1}$$

where *l* is the number of words and *d* denotes the dimension of embeddings. It can be seen that setting *p* to 1 results in the arithmetic mean, to 0 the geometric mean, to −1 the harmonic mean, to +∞ the maximum operation, and to −∞ the minimum operation [12].

Given a word embedding space E*<sup>i</sup>* , the embeddings of the words in the name of entity *<sup>s</sup>* can be represented as **W***<sup>i</sup>* = [**w***<sup>i</sup>* <sup>1</sup>*,...,* **<sup>w</sup>***<sup>i</sup> <sup>l</sup>*] ∈ <sup>R</sup>*l*×*d<sup>i</sup>* . Correspondingly, *Hp(***W***<sup>i</sup> )* <sup>∈</sup> <sup>R</sup>*d<sup>i</sup>* denotes the power mean embedding vector after feeding **w***<sup>i</sup>* <sup>1</sup>*,...,* **<sup>w</sup>***<sup>i</sup> l* to Eq. (6.1). To obtain summary statistics of entity *s*, we compute *K* power means of *<sup>s</sup>* and concatenate them to get the entity name representation **<sup>s</sup>***<sup>i</sup>* <sup>∈</sup> <sup>R</sup>*d<sup>i</sup>* ·*K*, i.e.,

$$\mathbf{s}^{l} = H\_{p\_1}(\mathbf{W}^{l}) \oplus \cdots \oplus H\_{p\_K}(\mathbf{W}^{l}),\tag{6.2}$$

where ⊕ represents concatenation along rows and *p*1*,...,pK* are *K* different power mean values [12].

To get further representational power from different word embeddings, we generate the final entity name representation **n***<sup>s</sup>* by concatenating **s***<sup>i</sup>* obtained from different embedding spaces E*<sup>i</sup>* :

$$\mathbf{n}\_s = \bigoplus\_i \mathbf{s}^i. \tag{6.3}$$

Note that the dimensionality of this representation is *dn* = *<sup>i</sup> <sup>d</sup><sup>i</sup>* · *<sup>K</sup>*. The name embeddings of all entities can be denoted in matrix form as **<sup>N</sup>** <sup>∈</sup> <sup>R</sup>*n*×*dn* .

The representation space will group together entity names that are semantically related, similar to how word embeddings work. When considering the textual names of two entities, denoted as *e*<sup>1</sup> in group *G*<sup>1</sup> and *e*<sup>2</sup> in group *G*2, their similarity *Simt(e*1*, e*2*)* is calculated as the cosine similarity between the vector representation of *e*<sup>1</sup> and the vector representation of *e*2, denoted as **N***(e*1*)* and **N***(e*2*)*, respectively. The alignment result for a source entity is the target entity with the highest similarity score.

**Discussion** The combined power mean word embedding, as presented in the article by Rücklé et al. [12], provides a superior alternative to averaged word embedding when it comes to representing entity names. This is because it is better equipped to capture and synthesize the relevant information conveyed by an entity name.<sup>1</sup> Averaging word embeddings results in a significant loss of information because it fails to account for the semantic variation that can exist within different names. On the other hand, using concatenated power means produces a more accurate summary by reducing ambiguity and uncertainty in the representation of an entity name. This is supported by the empirical evidence presented in Sect. 6.5.3.

It should be noted that in the context of cross-lingual entity alignment, we rely on pre-trained multilingual word embeddings, as described in [5]. These embeddings have already aligned words from different languages into a shared semantic space. As a result, entity names from multiple languages can exist within the same semantic space, obviating the need to design a separate mapping function for aligning multilingual embeddings.

The method described above can be extended to accommodate other textual information, such as attributes, without sacrificing its generality. One simple approach is to concatenate the attributes and entity name to form a "sentence" that provides a more comprehensive description of the entity. This combined sentence can then be encoded using concatenated power mean word embeddings. However, the integration of additional information and more complex adaptations is not within the scope of this chapter.

#### *6.4.2 Degree-Aware Co-attention Feature Fusion*

Entity identities can be characterized by various types of features from different perspectives. Therefore, it is important to have a feature fusion module that effectively combines these different signals. Some researchers have proposed to integrate different embeddings into a unified representation space [22], but this approach necessitates additional training to align irrelevant features. A more desirable strategy involves first computing the similarity matrix within each featurespecific space and then combining the similarity scores for each feature-specific space [9, 18]. However, the contributions of different features vary for entities with different degrees. For long-tail entities that lack structural information, entity name representation should be given more weight, whereas for popular entities, the structural representation is relatively more informative than the entity name information. To address this dynamic shift, we draw inspiration from the bi-attention mechanism proposed in [13] and design a degree-aware co-attention network, depicted in Fig. 6.4.

Formally, we are given the structural embedding matrix **Z** and the name embedding matrix **N**. For each entity pair *(e*1*, e*2*)*, where *e*<sup>1</sup> ∈ *G*<sup>1</sup> and *e*<sup>2</sup> ∈ *G*2, we calculate a similarity score between *e*<sup>1</sup> and *e*2. This similarity score is then

<sup>1</sup> For possible out-of-vocabulary (OOV) words, we skip them and use the embeddings of the rest to produce entity name embeddings.

**Fig. 6.4** Degree-aware co-attention feature fusion

used to determine the alignment result. To compute the overall similarity between entity pairs, we first calculate the feature-specific similarity scores, *Sims(e*1*, e*2*)* and *Simt(e*1*, e*2*)*, between *e*<sup>1</sup> and *e*2, as explained in the previous subsections. Our degree-aware co-attention network is designed to determine the weights for *Sims(e*1*, e*2*)* and *Simt(e*1*, e*2*)* by incorporating degree information. This network consists of three stages: feature matrix construction, co-attention similarity matrix calculation, and weight assignment.

**Feature Matrix Construction** Apart from entity name and structural information, we also include entity degree information to construct a feature matrix for each entity. To be precise, we represent entity degrees as one-hot vectors of all possible degree values and pass them through a fully connected layer to obtain a continuous degree vector. As an example, the degree vector of *e*<sup>1</sup> can be represented as **g***e*<sup>1</sup> = **<sup>M</sup>** · **<sup>h</sup>***e*<sup>1</sup> <sup>∈</sup> <sup>R</sup>*dg* , where **h***e*<sup>1</sup> is the one-hot representation of its degree, **<sup>M</sup>** is the weight matrix in the fully-connected layer, and *dg* denotes the dimension of the degree vector. This continuous degree vector, along with structural and entity name representations, is stacked to form an entity's feature matrix. For entity *e*1:

$$\mathbf{F}\_{e\_1} = [\mathbf{N}(e\_1); \mathbf{Z}(e\_1); \mathbf{g}\_{e\_1}] \in \mathbb{R}^{\mathcal{I} \times d\_m},\tag{6.4}$$

where ; denotes the concatenation along *columns*, *dm* = max{*dn, ds, dg*}, and we pad the missing values with 0s.

**Co-attention Similarity Matrix Calculation** To model the interaction between **F***e*<sup>1</sup> and **F***e*<sup>2</sup> , as well as highlight important features, we build a co-attention matrix **<sup>S</sup>** <sup>∈</sup> <sup>R</sup>3×3, where the similarity between the *i*-th feature of *<sup>e</sup>*<sup>1</sup> and the *<sup>j</sup>* -th feature of *e*<sup>2</sup> is computed by:

$$\mathbf{S}\_{lj} = \alpha(\mathbf{F}\_{e\_1}^{l:}, \mathbf{F}\_{e\_2}^{j:}) \in \mathbb{R},\tag{6.5}$$

where **F***i*: *<sup>e</sup>*<sup>1</sup> is the *i*-th row vector and **F***<sup>j</sup>* : *<sup>e</sup>*<sup>2</sup> is the *j* -th row vector, *i* = 1*,* 2*,* 3; *j* = 1*,* 2*,* 3. *α(***u***,* **v***)* = **w** *(***u** ⊕ **v** ⊕ *(***u** ◦ **v***))* is a trainable scalar function that encodes the similarity, where **<sup>w</sup>** <sup>∈</sup> <sup>R</sup>3*dm* is a trainable weight vector and ◦ is the element-wise multiplication. Note that the implicit multiplication is a matrix multiplication.

**Weight Assignment** The co-attention similarity matrix, denoted by **S**, is used to generate attention vectors, which are **att1** and **att2**, in both directions. The attention vector **att1** indicates the feature vectors in *e*<sup>1</sup> that are most important or relevant to the feature vectors in *e*2. Similarly, **att2** indicates the feature vectors in *e*<sup>2</sup> that are most important or relevant to the feature vectors in *e*1. To achieve this, we pass the co-attention similarity matrix **S** through a softmax layer. Next, the resulting matrix from the softmax layer is compressed using an average layer to create the attention vectors. It is worth noting that when performing column-wise operations in the softmax layer and row-wise operations in the average layer, we get **att1**. Conversely, when conducting row-wise operations in the softmax layer and columnwise operations in the average layer, we obtain **att2**.

Eventually, we multiply the feature-specific similarity scores with the attention values to obtain the final similarity score:

$$Sim(e\_1, e\_2) = Sim\_\delta(e\_1, e\_2) \cdot \mathbf{att\_1}^s + Sim\_l(e\_1, e\_2) \cdot \mathbf{att\_1}^l,\tag{6.6}$$

where **att1** *<sup>s</sup>* and **att1** *<sup>t</sup>* are the corresponding weight values for structural and name similarity scores, respectively. Note that *Sim(e*1*, e*2*)* = *Sim(e*2*, e*1*)* as they may have different attention weight vectors.

The model that combines co-attention and feature fusion has a relatively simple structure with only two parameters, **M** and **w**. Furthermore, it is straightforward to modify this model to include additional features.

**Training** The training objective is to maximize the similarity scores of the training entity pairs, which can be converted to minimizing the following loss function:

$$L = \sum\_{(e\_1, e\_2) \in S} [-Sim(e\_1, e\_2) + \gamma]\_+ + [-Sim(e\_2, e\_1) + \gamma]\_+,\tag{6.7}$$

where [*x*]+ = *max*{0*, x*} and and *γ* is a constant number.

**Discussion** Alternative methods of implementing degree-aware weighting are possible, such as applying sigmoid*(***W** · [**N***(e),***Z***(e),* **g***e*]*)* where **W** represents the parameter. In this study, we utilize a co-attention mechanism to combine various signal channels with degree-aware weights, which highlights the benefits of incorporating degrees for effective EA in the tail. However, a more comprehensive comparison with other implementations is a subject for future research.

#### *6.4.3 Iterative KG Completion*

The concept of iterative self-training has been shown to be effective and warrants further investigation, as demonstrated in previous studies [15, 23]. However, current research has failed to consider the potential for enriching structural information during the iterative process. Our findings suggest that, while long-tail entities in the source KG may lack structural information, this information can be found in the target KG in a complementary manner. By mining confident EA results and using them as pseudo matching pairs to anchor subgraphs, we can replenish the original KG with facts from its counterpart, thereby mitigating the KGs' structural sparsity. This can significantly improve KG coverage and reduce the number of long-tail entities. As the structural learning model generates increasingly better structural embeddings from the amplified KGs, the accuracy of EA results in subsequent rounds also improves naturally in an iterative fashion.

To start, we will describe how we incorporate EA pairs that have a high level of confidence. Our focus is on preventing the inclusion of any incorrect pairs that could potentially harm the model. To achieve this, we have developed a unique approach for choosing EA pairs. For every given entity *e*<sup>1</sup> ∈ *E*<sup>1</sup> − *S*<sup>1</sup> (in *G*<sup>1</sup> but not in the training set), suppose its most similar entity in *G*<sup>2</sup> is *e*2, its second most similar entity is *e* <sup>2</sup> and the difference between the similarity scores is <sup>1</sup> - *Sim(e*1*, e*2*)* − *Sim(e*1*, e* <sup>2</sup>*)*, if for *e*2, its most similar entity in *G*<sup>1</sup> is exactly *e*1, its second most similar entity is *e* <sup>1</sup>, the difference between the similarity scores is <sup>2</sup> - *Sim(e*2*, e*1*)* − *Sim(e*2*, e* <sup>1</sup>*)*, and 1, <sup>2</sup> are both above a given threshold *θ*, *(e*1*, e*2*)* would be considered as a correct pair. This is a relatively strong constraint, as it requires that (1) the similarity between the two entities is the highest from both sides, respectively, and (2) there is a margin between the top two candidates.

Once we have integrated the EA results with high confidence to the initial set of entity pairs, we proceed to use these entities (*Sa*) to connect two KGs and supplement them with new facts from each other. For example, if a triple *t*<sup>1</sup> ∈ *T*<sup>1</sup> has both its head and tail entities matching entries in *Sa*, we replace the entities in *t*<sup>1</sup> with the corresponding entities in *E*<sup>2</sup> and add the new triple to *T*2. While this may seem like a simple and straightforward approach, it effectively increases the overall coverage of the KGs. Finally, we leverage the augmented KGs to improve the quality of the structural representations, which in turn contributes to enhancing the EA performance. This iterative completion process is repeated for *ζ* rounds.

**Discussion** Certain EA methods also use bootstrapping or iterative training techniques, but their primary goal is to expand the training signals for updating the embeddings, without modifying the underlying structure of the KGs. In comparison to other approaches of selecting EA pairs which can be slow and may generate inaccurate results [15, 23], we improve this process by prioritizing two entities if they give each other priority. This is empirically validated in Sect. 6.5.5.

#### **6.5 Experiments**

This section reports the experiments with in-depth analysis.<sup>2</sup>

#### *6.5.1 Experimental Setting*

**Dataset** We use SRPRS [7] due to the KG pairs having a distribution similar to the real world. It was created with inter-language links and references in DBpedia, and each entity has an equivalent counterpart in the other KG. The relevant details are listed in Table 6.2, and 30% of entity pairs are utilized for training.

**Parameter Settings** For the *structural representation learning module*, we follow the settings in [7], except for assigning *ds* to 300. Regarding *name representation learning module*, we set **p** = [*p*1*,...,pK*] to [1*,* min*,* max]. For mono-lingual datasets, we merely use the fastText embeddings [1] as the word embedding (i.e., only one embedding space in Eq. (6.3)). For cross-lingual datasets, the multilingual word embeddings are obtained from MUSE. 3 Two word embedding spaces (from two languages) are used in Eq.(6.3). As for *degree-aware fusion module*, we set *dg* to 300, *γ* to 0.8, and batch size to 32. Stochastic gradient descent is harnessed to minimize the loss function, with learning rate set to 0.1, and we use early stopping to prevent over-fitting. In *KG completion module*, *θ* is set to 0.05 and *ζ* is set to 3.

**Evaluation Metric** We use Hits@*k* (*k* = 1, 10) and the mean reciprocal rank (MRR) as evaluation metrics. For each source entity, entities in the other KG are


<sup>2</sup> The source code is available at https://github.com/DexterZeng/DAT.

<sup>3</sup> https://github.com/facebookresearch/MUSE.

ranked according to their similarity scores *Sim* with the source entity in descending order. Hits@*k* measures the proportion of correctly aligned entities among the top-*k* similar entities to the source entity. In particular, Hit@1 indicates the accuracy of the alignment results. MRR, on the other hand, is the average of the reciprocal ranks of the ground-truth results. A higher Hits@*k* and MRR indicate better performance. Unless stated otherwise, the results of Hits@*k* are represented as percentages. The best results are displayed in **bold** in the tables.

**Competitors** Overall 13 state-of-the-art methods are involved in comparison. The group that solely utilizes structural feature includes (1) MTransE [4], which proposes to utilize TransE for EA; (2) IPTransE [23], which uses an iterative training process to improve the alignment results; (3) BootEA [15], which devises an alignment-oriented KG embedding framework and a bootstrapping strategy; (4) RSNs [7], which integrates recurrent neural networks with residual learning; (5) MuGNN [2], which puts forward a multichannel graph neural network to learn alignment-oriented KG embeddings; (6) KECG [8], which proposes to jointly learn knowledge embeddings that encode inner-graph relationships, and a cross-graph model that enhances entity embeddings with their neighbors' information; and (7) TransEdge [16], which presents a novel edge-centric embedding model that contextualizes relation representations in terms of specific head-tail entity pairs.

Various methods have been proposed to incorporate other types of information in EA. JAPE [14] utilizes attributes of entities to refine structural information. GCN [18] generates entity embeddings and attribute embeddings to align entities in different KGs. GM-Align [21] builds a local subgraph of an entity to represent it and utilizes entity name information to initialize the framework. MultiKE [22] offers a novel framework that unifies the views of entity names, relations, and attributes at *representation-level for mono-lingual* EA. RDGCN [19] proposes a relation-aware dual-graph convolutional network to incorporate relation information via attentive interactions between KG and its dual relation counterpart. HGCN [20] is a learning framework that jointly learns entity and relation representations for EA.

#### *6.5.2 Results*

Table 6.3 presents the results. The first group of approaches only use structural information for alignment. BootEA and KECG outperform MTransE and IPTransE because of their alignment-oriented KG embedding framework and attention-based graph embedding model, respectively. RSNs further improves the results by taking into account long-term relational dependencies between entities, which can capture more structural signals for alignment. TransEdge achieves the best performance due to its edge-centric KG embedding and bootstrapping strategy. MuGNN fails to produce effective results as there are no aligned relations on SRPRS, which prevents the rule transferring from taking place and limits the number of detected rules. It is noteworthy that Hits@1 values on most datasets are below 50%, demonstrating



a When running GM-Align, it is noted that entities without valid name embeddings are excluded from evaluation, and hence we consider that GM-Align fails to align these entities without specifying rankings, which leads to the lack of Hits@10 and MRR values

the inadequacy of solely relying on KG structure, especially when long-tail entities make up the majority.

Regarding the second group, both GCN and JAPE exploit attribute information to complement structural signals. However, they fail to outperform the leading method in the first group, which can be attributed to the limited effect of attributive information. The other four methods make use of the publicly available entity name data. The substantial improvement in results compared to those of the first group confirms the value of this feature. Our framework, DAT, demonstrates its superiority over GM-Align, RDGCN, and HGCN with a 10% improvement in Hits@1 over all datasets, validating the effectiveness of exploiting entity name information. The fundamental explanation for this is that the fusion of features on the representation level by GM-Align, RDGCN, and HGCN may lead to information loss since the resulting merged feature representation may not retain the distinguishing features of the original ones. On the other hand, DAT adopts a co-attention network to compute feature weights and fuse features at the output level, which is based on featurespecific similarity scores.

**Evaluation by Degree** We present the outcomes of DAT in terms of degree to illustrate its ability to align long-tail entities, as shown in Table 6.4. It is worth noting that the degree pertains to the original degree distribution since the entity degree may be changed by the completion process.

Table 6.4 indicates that for entities with a degree of 1, the Hits@1 scores of DAT are two or three times higher than those of RSNs, confirming the capability of DAT in handling the long-tail problem. While there is also an improvement in the performance of DAT for popular entities, the gap between DAT and RSNs is much smaller than that observed in the case of long-tail entities. Furthermore, DAT outperforms RDGCN in all degree categories across four datasets, despite both using entity name information as an external signal for EA.

**Comparison with MultiKE on Dense Datasets** The reason for not providing the results of MultiKE on SRPRS is because it can only handle datasets in a single language and requires prior knowledge of the relations' semantics. However, in order to better understand DAT, we present the experimental results of DAT on the *dense* datasets that were previously evaluated with MultiKE. Specifically, the dense datasets, DWY100KDBP-WD and DWY100KDBP-YG, are similar to SRPRSDBP-WD and SRPRSDBP-YG, but have a larger scale (100K entities on each side) and higher density [15].

When evaluated on dense datasets, DAT produces superior results with Hits values exceeding 90% and MRR surpassing 0.95, as presented in Table 6.5. This indicates that DAT effectively utilizes name information, which can be credited to the degreeaware feature fusion module and the approach of first computing scores within each view rather than learning a merged representation that may result in the loss of information.




**Table 6.5** Experimental results on dense datasets

## *6.5.3 Ablation Study*

We report an ablation study on SRPRSEN-FR dataset in Table 6.6.

**Iterative KG Completion** If we remove the entire module, the performance of EA drops by 3.7% on Hits@1 (comparing DAT with DAT w/o IKGC). However, if we eliminate only the KG completion module while keeping the iterative process (similar to [23]), Hits@1 decreases by 1.9% (DAT vs. DAT w/o KGC). This validates the significance of KG completion. We also present the dynamic change of the degree distribution after each round (original, R1, R2, R3) in Fig. 6.5, which suggests that the embedded KG completion improves KG coverage and reduces the number of long-tail entities.

**Degree-Aware Co-attention Feature Fusion** In Table 6.6, it can be observed that if the fixed equal weights are used instead of the *degree-aware fusion module*, the Hits@1 decreases by 2.7% (DAT vs. DAT w/o ATT). This result confirms that adjusting the weights of features dynamically based on their degree leads to better integration of features and, as a result, more accurate alignment results. In Fig. 6.6, we present the weight of the structural representation generated by our degree-aware

SRPRSEN-FR Methods Hits@1 Hits@10 MRR DAT **75.8 89.9 0.81**  DAT w/o IKGC 72.1 85.4 0.77 DAT w/o KGC 73.9 88.6 0.79 DAT w/o ATT 73.1 88.5 0.79 DAT w/o CPM 75.3 89.7 0.80

**Fig. 6.6** Weight distribution of structural representation

fusion model across different degrees (in the first round). This figure demonstrates that, in general, the importance of structural information increases with the degree of entities, which is in line with our expectations.

**Concatenated Power Mean Word Embeddings** We compared concatenated power mean word embeddings and averaged word embeddings in terms of aligning entities, denoting as DAT and DAT w/o CPM, respectively. The findings indicate that combining multiple power mean embeddings effectively captures more alignment features.

#### *6.5.4 Error Analysis*

We conduct an error analysis on SRPRSEN-FR dataset to investigate the contribution of each module and cases where DAT falls short. Using only structural information leads to a high error rate of 65.5% on Hits@1. The dataset contains 67.0% long-tail(i.e., with degree ≤3) entities, with a majority (65.1%) being misaligned. However, incorporating entity name information and dynamically fusing it with structural information significantly reduces the overall Hits@1 error rate to 27.9%, with a corresponding reduction in long-tail entity error rate to 33.2%. Furthermore, we employ iterative KG completion to replenish structure and propagate signals, which further decrease the overall Hits@1 error rate to 24.2%. This approach also reduces the percentage of long-tail entities to 49.7%, with only 8.3% being misaligned. Overall, our results indicate that long-tail entities initially account for the most errors, but employing the proposed techniques reduces not only the error rate but also the contribution of long-tail entities to the overall error.

For cases that DAT cannot solve, we provide an analysis that focuses on the information related to entity names. Out of the incorrect cases (24.2% in SRPRSEN-FR), 41% don't have an appropriate entity name embedding because all the words in the name are out-of-vocabularies (OOVs), and 31% have partial OOVs. Additionally, 15% could have been correct by solely utilizing the name information, but they were misled by structural signals, while 13% fail to align because of either the inadequacy of the entity name representation method or the fact that the entities with the same name refer to different physical objects.

#### *6.5.5 Further Experiment*

We substantiate the efficacy of our iterative training approach by performing the following experiments.

Our iterative approach differs from current methods not only in the embedded KG completion procedure but also in the choice of confident pairs. To showcase its advantage, we remove the KG completion module from DAT and obtain DAT-I to compare its selection methods with those of [15, 23]. In [23], the authors use a threshold-based method (TH) to find pairs. For each *nonaligned* source entity, it identifies the most comparable *nonaligned* target entity, and if the similarity between the two entities exceeds a specified threshold, they are deemed confident pairs. In [15], the authors use a maximum weight graph matching (MWGM) method to find confident entity alignment pairs. For each source entity, it calculates the alignment likelihood to every target entity, and only those with likelihood above a given threshold are considered in a maximum likelihood matching process under a 1-to-1 mapping constraint, which generates a solution that contains confident EA pairs. We implement the methods within our framework and adjust the parameters based on the original papers. To evaluate the effectiveness of various iterative training techniques, we use the number of chosen confident EA pairs, the accuracy of these pairs, and the duration of each round as primary metrics.

To ensure fairness in the comparison, we present the outcomes of the initial three rounds in Fig. 6.7. The findings indicate that DAT-I outperforms the other two methods regarding the quantity and quality of chosen pairs in a relatively shorter time. As MWGM necessitates solving a global optimization problem, it takes considerably more time. Nonetheless, compared to TH, it performs better in terms of the accuracy of selected pairs.

(c) Running time consumption (s)

#### **6.6 Conclusion**

In this chapter, we present an improved framework called DAT for entity alignment, which specifically focuses on handling long-tail entities. Recognizing the limitations of relying solely on structural information, we propose to incorporate entity name information in the pre-alignment phase through concatenated power mean embedding. For alignment, we introduce a co-attention feature fusion network that dynamically adjusts the weights of different features guided by degree to consolidate various signals. In the post-alignment phase, we enhance the performance by iteratively completing the KG with confident EA results as anchors, thereby amplifying the structural information. We evaluate DAT on cross-lingual and monolingual EA benchmarks and achieve superior results.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 7 Weakly Supervised Entity Alignment**

**Abstract** The majority of state-of-the-art entity alignment solutions heavily rely on the labeled data, which are difficult to obtain in practice. Therefore, it calls for the study of EA with *scarce supervision*. To resolve this issue, we put forward a *reinforced active* entity alignment framework to select the entities to be manually labeled with the aim of enhancing alignment performance with minimal labeling efforts. Under this framework, we further devise an unsupervised *contrastive*  loss to contrast different views of entity representations and augment the limited supervision signals by exploiting the vast unlabeled data. We empirically evaluate our proposal on eight popular KG pairs, and the results demonstrate that our proposed model and its components consistently boost the alignment performance under scarce supervision.

#### **7.1 Introduction**

The entity alignment performance heavily relies on the amount of labeled data (i.e., aligned entity pairs). It has been empirically verified that the alignment accuracy drops sharply when decreasing the number of seed entity pairs [23]. This is also illustrated in Fig. 7.1, where we summarize the alignment performance of the most performant EA solutions given varying sizes of training data.1 Although this problem is prominent, it has been largely neglected by existing literature, since they directly extract the supervision signals from the inter-language links in DBpedia [1] or reference links among DBpedia, YAGO [22], Freebase [3], and Wikidata [28]. In practice though, these prior alignments might not exist among KGs constructed from different sources. In this case, it requires manual annotation to produce such labeled data, which is a nontrivial task since the annotator needs to retrieve the entity

<sup>1</sup> Note that in this chapter, we confine the main discussion to EA solutions that only use the KG structure and exclude those using auxiliary information such as entity descriptions, since some KGs might have little or even no auxiliary information [35], and the former can be regarded as a more general case of EA. Nevertheless, we will show that our proposed method can also be applied to the latter in the experiment.

equivalent to a given entity from a vast pool of candidates. As thus, to reduce the manual labeling cost and also the reliance on labeled data, it is of great significance to study EA with scarce supervision.

In this chapter, we propose to approach EA with limited supervision by addressing two key research questions: **(Q1)** Given a fixed labeling budget, how to select the entities for manual annotation so that the labeled data can provide more useful guidance for the alignment? This can also be interpreted as, to reach a certain target of alignment performance, how to optimize the selection of entities for labeling so that we could label as few entities as possible? **(Q2)** Given a limited number of labeled data, how can we leverage the rich unlabeled data to facilitate the alignment?

In response to **Q1**, we exploit active learning (AL) to overcome the labeling bottleneck by asking queries in the form of unlabeled entities to be labeled by an oracle (e.g., a human annotator) [9, 21]. Through designing effective query strategies, the active learner can achieve satisfying performance using as few labeled instances as possible, thereby reducing the cost of obtaining labeled data [21]. In this chapter, we develop several query strategies to characterize the informativeness of entities from different angles and offer a reinforced AL framework to blend these query strategies adaptively with the aim of selecting the most valuable entities to be labeled.

To answer **Q2**, inspired by the recent advance in contrastive learning (CL) [12, 27], we devise an unsupervised contrastive loss to exploit the abundant unlabeled data for augmenting supervision signals. CL aims to generate data representations by learning to encode the similarities or dissimilarities among a set of unlabeled examples. The underlying intuition is that the rich unlabeled data themselves can be used as supervision to help guide the model training [29]. In this chapter, we employ two graph encoders to model different views of the structural information of entities and design a contrastive objective to distinguish the embeddings of the same entity in these two views from the embeddings of other entities. By incorporating the unsupervised contrastive loss into the semi-supervised alignment objective, the scarce supervision signals are amplified, which can further ameliorate the alignment performance.

The reinforced AL and the contrastive representation learning constitute RAC, an EA framework developed specifically to deal with scarce supervision. We empirically evaluate RAC on eight popular KG pairs against active and non-active baseline models. The results demonstrate that RAC achieves superior performance under scarce supervision and can be applied on existing EA models.

**Contribution** The main contribution of this chapter can be summarized as follows: (1) we put forward RAC, an EA framework that aims to solve the scarce supervision issue, which can be employed on top of existing EA solutions to improve their capability of tackling limited labeled data; (2) we devise a reinforced active learning approach to blend query heuristics adaptively and select valuable entities for labeling, which benefits the subsequent alignment process; and (3) we are among the first attempts to exploit contrastive learning for EA, where the underlying supervision signals in the abundant unlabeled entities are leveraged to facilitate alignment.

#### **7.2 Preliminaries**

#### *7.2.1 Problem Formulation*

A KG is denoted as G = {*(s, r, o)*} ⊂ E × R × E, where E is the entity set, R is the relation set, and a triple *(s, r, o)* represents a subject entity *s* ∈ E and an object entity *o* ∈ E are connected by a relation *r* ∈ R. The inputs to entity alignment include two KGs to be aligned (i.e., the source KG G*<sup>s</sup>* and the target KG G*t*), a set of labeled entity pairs S = {*(u*∗*, v*∗*)*|*u*<sup>∗</sup> ∈ E*s, v*<sup>∗</sup> ∈ E*t, u*<sup>∗</sup> ⇔ *v*∗}, where ⇔ represents equivalence, E*<sup>s</sup>* and E*<sup>t</sup>* refer to the entity sets in the source and target KGs, respectively. The objective is to detect equivalent entity pairs in the rest of the entities.

The focus of this chapter is to study entity alignment with *scarce supervision*, which is decomposed into two problems, i.e., selecting entities for labeling and entity alignment under such limited supervision signals. The former is defined as, given a pool of unlabeled entities U, a labeling budget *B* and an oracle to label the entities, selecting *B* entities from U for annotation so that the labeled data could provide more useful guidance for the subsequent alignment.

#### *7.2.2 Model Overview*

We provide the overview of our proposed model RAC in Fig. 7.2. RAC can be decomposed into multiple iterations. In each iteration, we first conduct **reinforced**

**Fig. 7.2** The framework of our proposal RAC

**active learning**, where we use query strategies to measure the informativeness of entities and exploit the multi-armed bandit (MAB) mechanism to blend these strategies adaptively, so as to produce the entities to be labeled by the oracle. Next, the labeled entity pairs are added to the training set and forwarded to the **contrastive entity representation learning** module. In this module, using the labeled entity pairs as connections, we project individual KGs to a unified embedding space, where the entities from different KGs become comparable and the equivalence between entities can thus be inferred. Specifically, we design a semi-supervised alignment loss function to enforce the embeddings of the entities in each labeled entity pair to be close, such that the supervision signals can be propagated to unlabeled entities and their embeddings are updated to be comparable across KGs. Then we supplement with the unsupervised contrastive loss to contrast the structural entity representations learned by different graph encoders, which can leverage the rich unlabeled information for learning more expressive entity representations. Finally, the learned unified entity representations are used to conduct **alignment inference**  to produce the results and also to help improve query strategies.

#### **7.3 Reinforced Active Learning**

To cope with EA with scarce supervision, we first address the entity selection (for annotation) problem. Concretely, we adopt active learning (AL) to select the entities to be manually labeled with the aim of maximizing model performance with minimal effort. The AL process normally consists of multiple iterations. Given the labeling budget *B*, in each iteration, guided by the query strategies, we select *b(b < B)* entities with the highest *informativeness* for labeling and add the annotated entity pairs into the labeled data for training the EA framework. The iteration continues until the labeling budget is exhausted. Next, we introduce the query strategies in detail and then elaborate the reinforced active entity selection framework.

#### *7.3.1 Query Strategies*

We leverage three query strategies, i.e., degree centrality, PageRank centrality, and information density, to characterize the *informativeness* (more specifically, the representativeness) of entities.<sup>2</sup>

**Degree and PageRank Centrality** Since the entities in KGs are not i.i.d., we consider nodes with higher centrality contain more useful information and are of greater values. Hence, we adopt the commonly used centrality metric, i.e., the degree centrality *φdeg(e)*, which is defined as the number of edges directly connected with entity *e*. Besides, we also leverage the PageRank centrality [18] *φpr(ei*; *p)* to characterize the representativeness of entities:

$$\phi\_{pr}(e\_i; \Theta\_p) = \rho \sum\_j A\_{ij} \frac{\phi\_{pr}(e\_j; \Theta\_p)}{\sum\_k A\_{jk}} + \frac{1-\rho}{n},\tag{7.1}$$

where *A* is the adjacency matrix, *n* is the number of entities in the KG, *ρ* is the damping parameter, and *p* denotes the parameter set.

**Information Density** In addition to the topological structure, the representativeness of an entity can also be measured from the embedding level. Concretely, we apply K-means on the embeddings of unlabeled entities. We consider the entities placed at or close to the centers of clusters are of greater values. Thus, we calculate the Euclidean distance *d(e, ce)* between each entity *e* and the center entity *ce* of the cluster it belongs to and define the *information density* of entity *e* as *φi(e*; *i)* <sup>=</sup> <sup>1</sup> <sup>1</sup>+*d(e,ce)*, where *i* denotes the parameter set. A larger *φi(e*; *i)* indicates that entity *e* locates in the denser area of the embedding space and is more representative.

Considering that the query strategy scores are on incomparable scales, we convert them into percentiles as in [34]. Denote P*φ(e,* U*)* as the percentile of the score of *e* among the unlabeled data U in terms of query strategy *φ*. Accordingly, the converted percentile scores of degree centrality, PageRank centrality, and information density are denoted as P*deg*, P*pr*, and P*i*, respectively.

<sup>2</sup> We do not employ the uncertainty-based query strategies that are frequently used in the classification problems [7, 9], since it is complicated to characterize the uncertainty in the rankingbased problem setting [20]. We will investigate it in the future.

#### *7.3.2 Reinforced Active Entity Selection via MAB*

We leverage aforementioned query strategies to select the most informative entities for labeling. Considering that the significance of query strategies might vary in different iterations and no single query strategy can satisfy the need of all datasets, we propose to adaptively blend these strategies by adopting the multiarmed bandit (MAB) mechanism [26]. The MAB problems are some of the simplest reinforcement learning (RL) problems to solve, where we are given a slot machine with *n* arms (bandits) and each arm has its own probability distribution of success. Pulling any one of the arms yields a stochastic reward and the objective is to pull the arms such that the total reward collected in the long run can be maximized. Based on MAB, we treat each query strategy as an arm and approximate the importance of each strategy by estimating the expected reward (i.e., utility) of the corresponding arm. In this chapter, we adopt an extended framework of MAB, i.e., combinatorial MAB (CMAB) [6], which allows to play multiple arms in each iteration. Next, we elaborate the implementation details with regard to the alignment task.

Let be the set of arms. In each iteration *t*, based on the percentile scores, each arm *φ* ∈ suggests its own set of entities to be labeled Q*t(φ)*, while the actual set of queried entities Q*<sup>t</sup>* are chosen based on the utility score *ε* assigned to each unlabeled entity *e* ∈ U, which is defined as:

$$\varepsilon\_I(e) = \sum\_{\phi}^{\Phi} \varepsilon\_I(\phi) \mathcal{P}\_{\phi}(e),\tag{7.2}$$

where *εt(φ)* is the utility score of arm *φ* in iteration *t* and P*φ(e)* is the percentile score of entity *e* in terms of query strategy *φ*. Then, the top *b* entities from the unlabeled entity set with the highest *εt(e)* are selected as Q*<sup>t</sup>* for querying the oracle in iteration *t*.

We estimate the utility of arm *ε(φ)* by taking into account the exploitationexploration trade-off dilemma in MAB [6]. That is, we consider both the *exploitation* of the arm that has the highest expected payoff and the *exploration* to get more information about the expected payoffs of the other arms. Regarding the former, we define the expected reward of choosing arm *φ* in iteration *t* as the averaged reward it received from previous rounds:

$$
\bar{\varepsilon}\_l(\phi) = \frac{1}{t-1} \sum\_{l=1}^{t-1} \hat{\varepsilon}\_l(\phi),
\tag{7.3}
$$

where *ε*ˆ*i(φ)* is the reward received by arm *φ* in round *i*, which is defined as the change of alignment result on validation set:

$$\hat{\varepsilon}\_{l}(\phi) = (F(\mathcal{L}\_{l} \cup \mathcal{L}\_{l}(\phi)) - F(\mathcal{L}\_{l})) + \frac{|\mathcal{L}\_{l}(\phi)|}{b} (F(\mathcal{L}\_{l} \cup Q\_{l}) - F(\mathcal{L}\_{l})),\qquad(7.4)$$

where *F (*·*)* denotes the value of a specific alignment measure (e.g., Hits@1, to be detailed in Sect. 7.5.1) on the validation set, which is generated by using the labeled data in the bracket. L*<sup>i</sup>* represents the already labeled entities in iteration *i*. L*i(φ)* = Q*i(φ)* ∩ Q*i*, denoting the set of *labeled* entities suggested by query strategy *φ* in iteration *i*. The difference between *F (*L*<sup>i</sup>* ∪ L*i(φ))* and *F (*L*i)* represents the *direct*  change of alignment performance brought by arm *φ*. Besides, we reckon that the performance change caused by adding all the labeled entities Q*<sup>i</sup>* can also be used to measure the utility of each arm. Hence, we use |L*i(φ)*<sup>|</sup> *<sup>b</sup>* to denote the contribution of arm *φ* and multiply it with the overall performance change to produce the *implicit*  change of alignment performance brought by arm *φ*.

Next, we move on to *exploration*. Following [6], to encourage leveraging the under-explored arms, we obtain the utility score of arm *φ* in iteration *t* by adjusting *ε*¯*t(φ)*:

$$
\varepsilon\_l(\phi) = \overline{\varepsilon}\_l(\phi) + \sqrt{\frac{3\ln t}{2n\_\phi}},
\tag{7.5}
$$

where *nφ* represents the total number of *labeled* entities suggested by arm *φ* until iteration *t*. As thus, the utility *ε(φ)* of arm *φ* is estimated by considering both the *exploitation* and *exploration*, which can provide more accurate signals for suggesting the entities to be labeled. Note that *t* starts from 1, and when *t* = 1, we omit the calculations of Eqs. (7.3)–(7.5) and set *ε*1*(φ)* to 1 for all arms.

#### **7.4 Contrastive Embedding Learning**

Given the scarce labeled data generated by reinforced AL, in this section, we introduce contrastive entity representation learning that further mitigates the scarce supervision issue by mining supervision signals from the unlabeled data. We first introduce the semi-supervised alignment loss, the core of EA models. Then we introduce the graph encoders. Finally, we elaborate the unsupervised contrastive loss, as well as the training and inference processes.

#### *7.4.1 Semi-supervised Alignment Loss*

Since the entities (nodes) from different KGs cannot be compared directly, following current EA solutions, we first learn the entity structural embeddings of source and target KGs independently, i.e., *O<sup>s</sup>* and *O<sup>t</sup>* , and then devise a semi-supervised loss function to enforce the distance between the embeddings of the entities in the labeled entity pairs to be small and meanwhile the negative samples (i.e., nonequivalent entity pairs) to be large. Formally:

$$\mathcal{L}\_s = \sum\_{(\mathfrak{u}, \mathfrak{v}) \in \mathcal{S}} \sum\_{(\mathfrak{u'}, \mathfrak{v'}) \in \mathcal{S}'\_{(\mathfrak{u}, \mathfrak{v})}} [dis(\mathfrak{u}, \mathfrak{v}) + \chi - dis(\mathfrak{u'}, \mathfrak{v'})]\_+,\tag{7.6}$$

where [·]+ = max{0*,* ·}, *(u, v)* is a labeled entity pair from the training data and S *(u,v)* represents the set of negative entity pairs obtained by corrupting *(u, v)* using nearest neighbor sampling [15]. *u* and *v* represent the embeddings of source and target entities retrieved from *O<sup>s</sup>* and *O<sup>t</sup>* , respectively. *dis(*·*,* ·*)* is the distance function that measures the distance between two embeddings. *γ* is a hyperparameter separating positive samples from negative ones.

In this chapter, the entity representation is obtained by aggregating the embeddings generated by two graph encoders: *O<sup>ω</sup>* <sup>=</sup> *agg(Zψ*<sup>1</sup> *<sup>ω</sup> ,Zψ*<sup>2</sup> *<sup>ω</sup> )*, where *agg* is the aggregation function, which can be implemented as weighted average, concatenation, etc. *Zψ*<sup>1</sup> *<sup>ω</sup>* and *Zψ*<sup>2</sup> *<sup>ω</sup>* represent the embeddings generated from two different views. *ω* ∈ {*s,t*} denotes the source and target KG, respectively.

Note that we devise two graph encoders since (1) they can capture different views of the structural information and the integrated embeddings could be more expressive and (2) by devising a contrastive objective to enforce the embeddings of each entity in the two different views to agree with each other and meanwhile to be distinguished from the embeddings of other entities, the rich unlabeled data can be leveraged as supervision signals to learn discriminative entity representations and benefit the alignment.

#### *7.4.2 Graph Encoders*

In this chapter, we use two basic models, graph convolutional network (GCN) [13] and approximate personalized propagation of neural predictions [14], to capture the close and distant structural information of entities and generate different views of KG embeddings.3

The GCN model has been leveraged to generate entity embeddings by many previous works [30, 33]. It is a simple message passing algorithm. The inputs include the feature matrix of nodes *X* and the adjacency matrix of graph *A*. In the case of two message passing layers, the equation of GCN can be formulated as:

$$\mathbf{Z}^{\psi\_1} = \text{ReLU}\left(\hat{A}\,\text{ReLU}\left(\hat{A}\mathbf{X}\,\mathbf{W}\_0\right)\mathbf{w}\_1\right),\tag{7.7}$$

<sup>3</sup> Note that it is feasible to use other graph encoders here, e.g., the RREA embedding learning model (to be detailed in Sect. 7.5.2). We use two simple models to give prominence to the effects of reinforced AL and unsupervised CL strategies on EA.

where *Zψ*<sup>1</sup> is the output entity embedding matrix. *A***ˆ** is the symmetrically normalized adjacency matrix with self-loops. ReLU is the activation function, and *W*<sup>0</sup> and *W*<sup>1</sup> are the weight matrices.

While many approaches adopt GCN to learn entity representations, it is pointed out in [17] that when increasing the number of GCN layers, the alignment performance actually drops due to the oversmoothing issue. Therefore, we exploit the approximate personalized propagation [14] to generate the entity embeddings:

$$\mathbf{Z}^{(l)} = (1 - \alpha)\tilde{A}\mathbf{Z}^{(l-1)} + \alpha X, \quad i = 1, 2, \dots, k,\tag{7.8}$$

where *<sup>α</sup>* is the teleport probability and *<sup>k</sup>* denotes the round of iterations. *<sup>Z</sup>(*0*)* <sup>=</sup> *<sup>X</sup>*, and the initial feature matrix *X* acts as both the starting vector and the teleport set. *Zψ*<sup>2</sup> is the output entity embedding matrix. Note that we remove the neural prediction network *fθ* in the original model since it is not required in EA. We denote the resultant model as APP. By removing the weight matrices and nonlinearity in GCN, APP can capture distant structural information while retaining the quality of entity embeddings [14].

#### *7.4.3 Unsupervised Contrastive Loss*

Inspired by the successful application of contrastive learning (CL) on unsupervised graph representation learning [27, 36], in this chapter, we also devise a contrastive objective to distinguish the embeddings of the same entity under the two views from the embeddings of other entities, so as to leverage the supervision signals in the unlabeled data. Given an entity *xi*, we denote its embedding generated by the first view as *Zψ*<sup>1</sup> *<sup>ω</sup> (i)*, and the embedding generated by the second view as *Zψ*<sup>2</sup> *<sup>ω</sup> (i)*, where *ω* ∈ {*s,t*} refers to the source and target KGs. These two embeddings form a positive sample. We consider the pairs of embeddings that contain *Zψ*<sup>1</sup> *<sup>ω</sup> (i)* (or *Zψ*<sup>2</sup> *<sup>ω</sup> (i)*) and the embedding of another entity as the negative samples. Then, the contrastive object of the entity in the first view is defined as:

$$\ell\_{\boldsymbol{\alpha}}^{\psi\_{1}}(\boldsymbol{x}\_{l}) = -\log \frac{e^{\theta \left( \mathbf{Z}\_{\boldsymbol{w}}^{\psi\_{1}}(l), \mathbf{Z}\_{\boldsymbol{w}}^{\psi\_{2}}(l) \right)}}{e^{\theta \left( \mathbf{Z}\_{\boldsymbol{w}}^{\psi\_{1}}(l), \mathbf{Z}\_{\boldsymbol{w}}^{\psi\_{2}}(l) \right)} + \mathcal{N}\_{\boldsymbol{c}\text{cross}} + \mathcal{N}\_{\boldsymbol{h}\text{intra}}},\tag{7.9}$$

$$\mathcal{N}\_{cross} = \sum\_{k=1}^{n\_{ao}} \mathbf{1}\_{[k \neq i]} e^{\theta \left( Z\_w^{\psi\_1}(l), Z\_w^{\psi\_2}(k) \right)} \tag{7.10}$$

$$\mathcal{N}\_{intra} = \sum\_{k=1}^{n\_{\text{av}}} \mathbf{1}\_{[k \neq i]} e^{\theta \left( Z\_{w}^{\psi\_1}(l), Z\_{w}^{\psi\_1}(k) \right)} \tag{7.11}$$

**Fig. 7.3** Illustration of the losses

where *θ (*·*,* ·*)* is a score function that calculates the similarity between two embeddings, which is implemented as *θ (*·*,* ·*)* = *f (g(*·*), g(*·*))*, where *g(*·*)* is a multilayer perceptron (MLP) with nonlinear activation functions for transforming the embeddings, and *f (*·*,* ·*)* is a similarity metric capturing the similarity between embeddings. **1**[·] is an indicator function which equals to 1 if the argument inside the bracket holds, and 0 otherwise. *nω* is the number of entities in the KG. In the denominator, the first term is the positive sample, the second term N*cross* corresponds to the cross-view negative samples, and the third term N*intra* corresponds to the intra-view negative samples. Detailed illustrations can be found in Fig. 7.3. The contrastive object of the second view  *ψ*2 *<sup>ω</sup> (xi)* is defined similarly. As thus, the overall unsupervised loss is defined as:

$$\mathcal{L}\_{u} = \frac{1}{2n\_s} \sum\_{l=1}^{n\_l} \left[ \ell\_s^{\psi\_1}(e\_l) + \ell\_s^{\psi\_2}(e\_l) \right] + \frac{1}{2n\_l} \sum\_{l=1}^{n\_l} \left[ \ell\_l^{\psi\_1}(e\_l) + \ell\_l^{\psi\_2}(e\_l) \right],\tag{7.12}$$

where *ns* and *nt* denote the number of entities in the source and target KGs, respectively.

**Model Training** Finally, we combine the semi-supervised alignment loss and the unsupervised contrastive loss, resulting in the loss function of our proposed model:

$$
\mathcal{L} = \mathcal{L}\_s + \lambda\_u \mathcal{L}\_u,\tag{7.13}
$$

where *λu >* 0 is the hyper-parameter balancing the two objectives.

#### *7.4.4 Alignment Inference*

After obtaining the learned unified embeddings, the alignment results can thus be inferred. For each source entity, we calculate its distance with all target entities according to a specific distance metric and consider the entity with the smallest distance as the match. We describe the overall procedure of RAC in Algorithm 1.


**Input** : G*<sup>s</sup>* and G*<sup>t</sup>* : source and target KGs; S: labeled data; *B*: labeling budget; An oracle to label entities. **Output** : A: the set of aligned entity pairs.

**1** S ← S; **2 while** |S |−|S| *< B* **do** 


**5** S ← S ∪ S*b*;


**11 return** A;

## **7.5 Experiment**

In this section, we empirically evaluate our proposed model4 by answering the following questions:


<sup>4</sup> The source code is available at https://github.com/DexterZeng/RAC.


**Table 7.1** Statistics of the datasets used for evaluation

EN, ZH, JA, FR, and DE refer to the English, Chinese, Japanese, French, and German version of DBpedia, respectively #Triples, #Ents, #Rels, and #Aligns denote the number of triples, entities, relations, and gold alignment data in each dataset, respectively

#### *7.5.1 Experimental Settings*

**Datasets** Following previous works, we adopt three popular EA datasets for evaluation: (1) DBP15K [23], which includes three cross-lingual KG pairs extracted from DBpedia; (2) SRPRS [11], which comprises two cross-lingual and two mono-lingual KG pairs extracted from DBpedia, Wikidata, and YAGO; and (3) DBP-FB [35], which is a mono-lingual KG pair extracted from DBpedia and Freebase. In each KG pair, 70%, 10%, and 20% of the gold standards are used for testing, validation, and training, respectively. Since we study EA with limited supervision, we only keep 500 seed entity pairs as the initial training set. Then, according to the labeling budget, we select the entities from the rest of the training data for annotation and add the labeled entity pairs into the initial training set. The details of datasets can be found in Table 7.1.

**Implementation Details** Regarding the *query strategies*, we set the damping parameter in Eq. (7.1) to the default value 0.85. We set *b*, the number of entities selected in each iteration, to 50. As to the *semi-supervised alignment loss* in Eq. (7.6), we adopt the Euclidean distance as *dis(*·*,* ·*)* and select *γ* among [1*,* 3*,* 5*,* 10]. We implement the embedding aggregation function as: *agg(Zψ*<sup>1</sup> *<sup>ω</sup> ,Zψ*<sup>2</sup> *<sup>ω</sup> )* <sup>=</sup> *λeZψ*<sup>1</sup> *<sup>ω</sup>* <sup>+</sup> *(*1 <sup>−</sup> *λe)Zψ*<sup>2</sup> *<sup>ω</sup>* , where *λe* ∈ *(*0*,* 1*)* is the hyperparameter that balances the weights of the two views, and we select it among [0*.*2*,* 0*.*4*,* 0*.*6*,* 0*.*8]. As for the *graph encoders*, we follow previous works [30, 33] by adopting two two-layer GCNs. We follow the original work of APP [14] and directly set the teleport probability *α* in Eq. (7.8) to 0.2, and the round of propagations *k* to 5. The dimensionality of entity embeddings is set to 100. Concerning the *unsupervised contrastive loss* in Eq. (7.9), we implement *g(*·*)* as two-layer MLP with a nonlinear activation *elu* and adopt the cosine similarity as *f (*·*,* ·*)*. We select *λu* in Eq. (7.13) among [0*.*05*,* 0*.*1*,* 0*.*15*,* 0*.*2*,* 0*.*25*,* 0*.*3*,* 0*.*35] and adopt Adam optimizer to minimize the training objective. The distance function in the alignment inference process is set to the Euclidean distance.

By tuning the hyper-parameters on the validation set, we set *γ* to 1, *λe* to 0.2, and *λu* to 0.2. The experiments are conducted on a personal computer with the Ubuntu system, an Intel Core i7-4790 CPU, an NVIDIA GeForce GTX TITAN X GPU, and a 32 GB memory. We conduct the experiments for five independent runs and report the averaged performance (and the standard deviation) on each dataset.

**Evaluation Metrics** Following the convention [5], for each source entity in the test set, we rank the target entities ascendingly according to the embedding distance as in Sect. 7.4.4 and adopt Hits@1 as the evaluation metric, which is defined as the percentage of source entities whose ground-truth target entity is ranked first. Note that the Hits@1 results are represented in percentages, and the bold figures in the tables represent the best results.

**Competing Methods** The majority of state-of-the-art EA methods focus on designing advanced representation learning models to capture more useful structural information for alignment. In comparison, our proposed framework RAC aims to improve the alignment performance under limited supervision by using reinforced AL and CL, which are agnostic to the choices of these embedding learning models. RAC can be applied on these methods to improve their capability of dealing with scare supervision signals. Hence, the main goal of this chapter is not to compare with these state-of-the-art models, but with the methods that improve EA performance under scarce supervision. In this light, we compare RAC with a very recent work [2], ALEA, which harnessed AL for EA. Specifically, we adopt the most performant variants of ALEA, i.e., ALEA-D and ALEA-B, as the baseline models, which leverage the degree and betweenness centrality as the query strategies, respectively.

Noteworthily, to demonstrate the wide applicability of RAC, we employ it on the most performant embedding learning model RREA [17], as well as a state-of-the-art EA model that leverages auxiliary information, CEA [33], in Sect. 7.5.2.

#### *7.5.2 Main Results (RQ1)*

We report the alignment results in Table 7.2 by setting the labeling budget *B* to 500 and 1500, respectively. It can be observed that RAC significantly outperforms the embedding learning-based baseline models GCN and APP across all datasets (over 40% on DBP-FB), showcasing the effectiveness of our proposal when the supervision signals are limited. Particularly, RAC (*B* = 500) even achieves comparable results to GCN (*B* = 1500) on SRPRSEN-FR and DBP-FB, which validates that, to reach a certain performance target, adopting RAC significantly reduces manual labeling effort. Besides, it is notable that the improvement is more prominent when there are fewer labeled data. For instance, RAC outperforms APP by over 15% on most datasets when *B* = 500, while the improvement is less than 15% on most datasets when *B* = 1500.



**Fig. 7.4** Hits@1 results of ablation study. The shaded area denotes the standard deviation

Then, by comparing the results of RAC with ALEA-D and ALEA-B, it is obvious that our proposed model is more effective and robust than existing AL-based EA models given scarce labeled data. The superior performance can be attributed to the reinforced AL strategy and the contrastive representation learning, which we will analyze in detail in the following.

**Ablation Results** To examine the usefulness of the two key components—the unsupervised contrastive loss and the reinforced AL strategy—in RAC, we conduct ablation study. As shown in Fig. 7.4, we select the labeling budget *B* among [250, 500, 750, 1000, 1250, 1500, 1750, 2000] and obtain the corresponding alignment results of RAC -Active, RAC -Rand., RAC w/o CL -Active, and RAC w/o CL -Rand., where -Active denotes using our proposed reinforced AL strategy, -Rand. denotes selecting the entities randomly, and w/o CL denotes removing the unsupervised contrastive loss. Note that, in the interest of space, we only select representative KG pairs from each dataset and report their results, among which DBP15KZH-EN and SRPRSEN-FR are cross-lingual datasets, while SRPRSDBP-WD and DBP-FB are mono-lingual ones.

It reads from Fig. 7.4 that the reinforced AL and CL strategies both contribute positively to the overall performance. More concretely, with the increase of labeling budget, the effectiveness of the AL strategy becomes less significant, while the unsupervised contrastive learning loss begins to play a more important role. This could be ascribed to the fact that: (1) the quality of the entities selected by AL drops with the increase of budget since the valuable entities have already been chosen in the early stages and (2) the effectiveness of CL relies on the quality of entity representations, which is improved when there are more labeled data.

**Applying RAC on Embedding Learning-Based EA Model** We apply RAC on RREA, the most performant EA method so far, to see whether RAC would improve its capability of dealing with limited supervision. Specifically, we follow the implementation details in the original paper [17] and contrast the entity embeddings learned by it with the embeddings generated by GCN and then conduct the reinforced AL. The results are provided in Table 7.3, which validate that RAC can be applied on existing EA models to improve their performance under limited supervision, and the improvement is more notable when there are fewer labeled data (*B* = 500 vs. *B* = 1500).


**Table 7.3** Hits@1 results of applying RAC on RREA and CEA

**Applying RAC on EA Model that Uses Auxiliary Information** We apply RAC on CEA [33], an EA model leveraging the entity name information to complement KG structural information for alignment. The results are provided in Table 7.3, which demonstrate that our proposal is also effective on EA models harnessing auxiliary information. We notice that the improvements on SRPRSEN-FR and SRPRSEN-DE are not significant. This is because the entity name information in SRPRS can already provide very accurate alignment signals, e.g., solely comparing the entity names can lead to ground-truth performance on SRPRSDBP-YG and SRPRSDBP-WD [35] (and thus we omit their results in Table 7.3). This unveils that it is more beneficial to study EA with scarce supervision when the auxiliary information is not available or of low quality (as is often the case) [35].

#### *7.5.3 Experiments on Contrastive Learning (RQ2)*

In this subsection, we carefully examine the effectiveness of unsupervised CL. We first empirically validate that the main performance enhancement brought by CL comes from the unsupervised contrastive loss itself rather than the combination of embeddings. Then we conduct parameter analysis to show its robustness.

**Comparison with** *mere* **Combination of Embeddings** Since different representation learning models capture different structural information in KGs, one might wonder whether the effectiveness of unsupervised CL mainly comes from the combination of embeddings. To investigate this issue, we remove the effect of AL and report the results of GCN, APP, the combination of these two embeddings (denoted as Comb.), and the combination of these two embeddings with unsupervised CL loss (denoted as Comb.+CL) in Table 7.4. It shows that, compared with utilizing the representation learning models separately, Comb. only slightly improves the alignment performance in some cases and even brings down the results under a few settings, e.g., *B* = 500. After adding the unsupervised contrastive loss,


**Table 7.4** Hits@1 results of variants of CL on DBP15KZH-EN after removing the influence of AL

Note that Comb.+CL is equivalent to *λu* = 0*.*2 in the bottom half of the table

Comb.+CL achieves superior results than Comb. and APP across all settings. This demonstrates the significance of using CL to mine supervision signals from the abundant unlabeled data. Furthermore, by comparing RAC with RAC w/o CL (which combines the two representations) in Fig. 7.4, we can conclude that the contrastive loss is effective with or without AL.

**Sensitivity Analysis** We conduct sensitivity analysis on a critical hyper-parameter in the contrastive entity representation learning, *λu*, which determines the contributions of semi-supervised alignment loss L*<sup>s</sup>* and unsupervised contrastive loss L*<sup>u</sup>* to the overall training objective, to show the stability of the model under perturbation of the hyper-parameter. Since it is intuitive that L*<sup>s</sup>* can provide more accurate signals for alignment compared with L*u*, we vary *λu* from 0.05 to 0.35 and report the results in Table 7.4. From the table, it can be observed that the alignment performance is relatively stable when *λu* is not too large. We thus conclude that, overall, our model is robust to the perturbation of *λu*.

#### *7.5.4 Experiments on Reinforced AL (RQ3)*

In this subsection, we aim to examine the usefulness of the reinforced AL component. We first demonstrate that the reinforced AL strategy can be applied on the baseline models to improve their performance given scarce labeled data. Next, we empirically verify that using our proposed reinforced AL to blend query strategies can lead to better results than using these strategies individually or combining the query strategies with equal weights.

**Effectiveness of Reinforced AL on Baseline Models** We apply our proposed reinforced AL on the baseline models and report the results in Fig. 7.5. It shows that

**Fig. 7.5** Hits@1 results of applying AL on baseline models. The shaded area denotes the standard deviation

the performance of both APP and GCN is enhanced after applying the reinforced AL strategy, and the improvement is more prominent when the budget is smaller.

**Comparison with Using Query Strategies Individually** To verify that blending the query strategies with MAB is more effective than using these strategies individually, we replace reinforced AL in RAC with degree centrality, PageRank centrality, and information density, resulting in RAC-Deg, RAC-Pr, and RAC-Emb, respectively, and report the results in Table 7.5. It shows that, overall speaking, the reinforced active entity selection strategy can lead to better alignment results than using query strategies individually.

**Comparison with Combination with Equal Weights** To demonstrate that reinforced AL can adaptively integrate query strategies and lead to better alignment performance, we compare it with blending query strategies with equal weights (RAC-Avg) and provide the results in Table 7.5. It can be observed that RAC is more effective than RAC-Avg, especially when the budget value is small, showcasing the importance of combining query strategies adaptively.

#### **7.6 Related Work**

**Entity Alignment** The task of EA has been intensively studied over the last few years [35]. The majority of existing EA literature [4, 5, 10, 23, 30] are devoted to learning better entity representations using the KG embedding techniques such as TransE and GCN. Specifically, some propose to capture the neighboring information [11, 25] for learning expressive entity representations, while some propose to model the relations to help guide the alignment of entities [24, 31]. All of these approaches require seed entity pairs to project entity embeddings from different KGs into a unified space, where the entities can be directly compared across KGs. Nevertheless, such labeled data are hard to obtain in real-life settings. To reduce the reliance on labeled data, some efforts are devoted to aligning entities in unsupervised settings [32]. They leverage the auxiliary (side information) of KGs, such as attributes and entity names, to produce the pseudo-labeled data, which are then used to learn the unified structural embeddings. Nevertheless, the effectiveness


RAC-Emb 34.78

RAC-Avg 37.64

±

0.15 39.44

±

0.14 38.37

±

0.13 25.33

±

0.17 39.20

±

0.10 26.70

±

0.10 32.60

±

0.17 15.33

±

0.26 31.83

±

0.19 37.87

±

0.24 37.31

±

0.22 22.93

±

0.36 38.01

±

0.14 24.16

±

0.18 29.47

± 0.32

9.60

±

0.26 29.27


of these approaches is largely restrained by the quality of side information. In practice, the auxiliary information could be unavailable or unevenly distributed [35].

**EA with Limited Supervision** The most similar work to ours is [2], which examines the effectiveness of various heuristics from AL in terms of improving EA performance under limited supervision. Our work differs from [2] in that (1) we devise a reinforced AL framework to adaptively blend query strategy heuristics and (2) we exploit the idea of CL to help further improve the EA performance. We also empirically validate the superiority of our proposal over [2].

**Reinforced Active Learning** Reinforced AL approaches have been developed also for other related problems, where RL is used to take the role of traditional query strategy heuristics [7–9]. To tackle cross-lingual named entity recognition task, Fang et al. design a deep Q-network to select data for annotation in a streaming setting [8]. In [7, 9], different multi-armed bandit models [6] are used to learn active discriminative network representations for the node classification task. Note that the MAB mechanism implemented in RAC differs from theirs and is developed specifically for the alignment task.

**Contrastive Learning on Graphs** Recently, contrastive learning (CL) has emerged as a successful method for unsupervised graph representation learning [27, 36]. CL is an active field of self-supervised learning, which can generate data representations by learning to encode the similarities or dissimilarities among a set of unlabeled examples [12]. The intuition behind is that the rich unlabeled data themselves can be used as supervision signals to help guide model training. In this chapter, we also exploit this idea and leverage the abundant unlabeled entities to facilitate the alignment.

#### **7.7 Conclusion**

State-of-the-art EA approaches are overly dependent on labeled data, which are difficult to obtain in practical settings. In response, we propose a reinforced active framework RAC to tackle EA with scarce supervision. In each labeling iteration, RAC selects the valuable entities to be labeled according to the multi-armed bandit mechanism that blends different query strategies. Then, given the limited labeled data, it mines useful supervision signals from the rich unlabeled data to help generate more accurate entity representations (and the alignment results). We evaluate RAC on popular EA benchmarks, and the empirical results validate that RAC is effective at coping with limited labeled data. Besides, we also demonstrate that RAC is a general framework to tackle EA with scarce supervision and can be employed on top of existing EA solutions.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 8 Unsupervised Entity Alignment**

**Abstract** State-of-the-art entity alignment solutions tend to rely on labeled data for model training. Additionally, they work under the closed-domain setting and cannot deal with entities that are unmatchable. To address these deficiencies, we offer an unsupervised framework that performs entity alignment in the open world. Specifically, we first mine useful features from the side information of KGs. Then, we devise an unmatchable entity prediction module to filter out unmatchable entities and produce preliminary alignment results. These preliminary results are regarded as the pseudo-labeled data and forwarded to the progressive learning framework to generate structural representations, which are integrated with the side information to provide a more comprehensive view for alignment. Finally, the progressive learning framework gradually improves the quality of structural embeddings and enhances the alignment performance by enriching the pseudo-labeled data with alignment results from the previous round. Our solution does not require labeled data and can effectively filter out unmatchable entities. Comprehensive experimental evaluations validate its superiority.

#### **8.1 Introduction**

State-of-the-art EA solutions [2–5] assume that equivalent entities usually possess similar neighboring information. Consequently, they utilize KG embedding models, e.g., TransE [6], or graph neural network (GNN) models, e.g., GCN [7], to generate structural embeddings of entities in individual KGs. Then, these separated embeddings are projected into a unified embedding space by using the seed entity pairs as connections, so that the entities from different KGs are directly comparable. Finally, to determine the alignment results, the majority of current works [1, 8–10] formalize the alignment process as a ranking problem; that is, for each entity in the source KG, they rank all the entities in the target KG according to some distance metric, and the closest entity is considered as the equivalent target entity.

**Fig. 8.1** An example of EA

*Example* In Fig. 8.1 are a partial English KG and a partial Spanish KG concerning the director Hirokazu Koreeda, where the dashed lines indicate known alignments (i.e., seeds). The task of EA aims to identify equivalent entity pairs between two KGs, e.g., (Shoplifters, Manbiki Kazoku).

Nevertheless, we still observe several issues from current EA works:


In response to these issues, we put forward an unsupervised EA solution UEA that is capable of addressing the unmatchable problem. Specifically, to mitigate the reliance on labeled data, we mine useful features from the KG side information and use them to produce preliminary pseudo-labeled data. These preliminary seeds are forwarded to our devised **progressive learning framework** to generate unified KG structural representations, which are integrated with the side information to provide a more comprehensive view for alignment. This framework also progressively augments the training data and improves the alignment results in a self-training fashion. Besides, to tackle the unmatchable issue, we design an **unmatchable entity prediction** module, which leverages thresholded bidirectional nearest neighbor search (TBNNS) to filter out the unmatchable entities and excludes them from the alignment results. We embed the unmatchable entity prediction module into the progressive learning framework to control the pace of progressive learning by dynamically adjusting the thresholds in TBNNS.

Furthermore, considering that the pseudo-labeled data generated during the progressive learning process might be of different qualities, we introduce the concept of **confidence** to measure the probability of an entity pair of being correct. We further incorporate such confidence scores into KG representation learning with the aim of producing more accurate structural embeddings. Through empirical studies, we demonstrate that the confidence-based framework, CUEA, has a more stable performance than UEA regardless of the quality of input side information and is particularly more useful when the side information is low-grade.

**Contribution** The main contributions of the chapter can be summarized as follows:


**Organization** In Sect. 8.2, we formally define the task of EA and introduce related work. Section 8.3 elaborates the framework. In Sect. 8.4, we introduce experimental results and conduct detailed analysis. Section 8.5 concludes this chapter.

#### **8.2 Task Definition and Related Work**

In this section, we formally define the task of EA and then introduce the related work.

**Task Definition** The inputs to EA are a source KG *G*<sup>1</sup> and a target KG *G*2. The task of EA is defined as finding the equivalent entities between the KGs, i.e., *-* = {*(u, v)*|*u* ∈ *E*1*, v* ∈ *E*2*, u* ↔ *v*}, where *E*<sup>1</sup> and *E*<sup>2</sup> refer to the entity sets in *G*<sup>1</sup> and

*G*2, respectively, and *u* ↔ *v* represents that the source entity *u* and the target entity *v* are *equivalent*, i.e., *u* and *v* refer to the same real-world object.

Most of current EA solutions assume that there exist a set of seed entity pairs *<sup>s</sup>* = {*(us, vs)*|*us* ∈ *E*1*, vs* ∈ *E*2*, us* ↔ *vs*}. Nevertheless, in this chapter, we focus on unsupervised EA and do not assume the availability of such labeled data.

**Unsupervised Entity Alignment** A few methods have investigated the alignment without labeled data. Qu et al. [20] propose an unsupervised approach toward knowledge graph alignment with the adversarial training framework. Nevertheless, the experimental results are extremely poor. He et al. [21] utilize the shared attributes between heterogeneous KGs to generate aligned entity pairs, which are used to detect more equivalent attributes. They perform entity alignment and attribute alignment alternately, leading to more high-quality aligned entity pairs, which are used to train a relation embedding model. Finally, they combine the alignment results generated by attribute and relation triples using a bivariate regression model. The overall procedure of this work might seem similar to our proposed model. However, there are many notable differences; for instance, the KG embeddings in our work are updated progressively, which can lead to more accurate alignment results, and our model can deal with unmatchable entities. We empirically demonstrate the superiority of our model in Sect. 8.4.

We notice that there are some entity resolution (ER) approaches established in a setting similar to EA, represented by PARIS [22]. They adopt collective alignment algorithms such as similarity propagation so as to model the relations among entities. We include them in the experimental study for the comprehensiveness of the chapter.

#### **8.3 Methodology**

In this section, we first introduce the outline of our proposal. Then, we elaborate the processing of side information to produce preliminary alignment seeds.

#### *8.3.1 Model Outline*

As shown in Fig. 8.2, given two KGs, CUEA first mines useful features from the *side information*. These features are forwarded to the *unmatchable entity prediction*  module to generate initial alignment results with confidence scores, which are regarded as pseudo-labeled data. Then, the *progressive learning framework* uses these pseudo seeds, along with the probability scores, to connect two KGs and learn unified entity structural embeddings. It further combines the alignment signals from the side information and *structural information* to provide a more comprehensive view for alignment. Finally, it progressively improves the quality of structural

**Fig. 8.2** Outline of CUEA. Arrows in blue represent the progressive learning process. By setting the confidence to 1, the UEA model can be restored

embeddings and augments the alignment results by iteratively updating the pseudolabeled data with results from the previous round, which also leads to increasingly better alignment. Note that by assigning the confidence score of 1 to all entity pairs, CUEA turns into the UEA model.

#### *8.3.2 Side Information*

There is abundant side information in KGs, such as the attributes, descriptions, and classes. In this chapter, we use a particular form of the attributes—the entity name, as it exists in the majority of KGs. To make the most of the entity name information, inspired by Zeng et al. [5], we exploit it from the semantic level and string level and generate the textual distance matrix between entities in two KGs.

More specifically, we use the averaged word embeddings to represent the semantic meanings of entity names. Given the semantic embeddings of a source and a target entity, we obtain the semantic distance score by subtracting their cosine similarity score from 1. We denote the semantic distance matrix between the entities in two KGs as **Mn**, where rows represent source entities, columns denote target entities, and each element in the matrix denotes the distance score between a pair of source and target entities. As for the string-level feature, we adopt the Levenshtein distance [23] to measure the difference between two sequences. We denote the string distance matrix as **M<sup>l</sup>** .

To obtain a more comprehensive view for alignment, we combine these two distance matrices and generate the textual distance matrix as **M<sup>t</sup>** <sup>=</sup> *<sup>α</sup>***M<sup>n</sup>** <sup>+</sup> *(*<sup>1</sup> <sup>−</sup> *α)***Ml** , where *α* is a hyper-parameter that balances the weights. Then, we forward the textual distance matrix **M<sup>t</sup>** into the unmatchable entity module to produce alignment results, which are considered as the pseudo-labeled data for training KG structural embeddings. The details are introduced in the next subsection.

**Remark** The goal of this step is to exploit available side information to generate useful features for alignment. Other types of side information, e.g., attributes and entity descriptions, can also be leveraged. Besides, more advanced textual encoders, such as misspelling oblivious word embeddings [24] and convolutional embedding for edit distance [25], can be utilized. We will investigate them in the future.

#### *8.3.3 Unmatchable Entity Prediction*

State-of-the-art EA solutions generate for each source entity a corresponding target entity and fail to consider the potential unmatchable issue. Nevertheless, as mentioned in [12], in real-life settings, KGs contain entities that other KGs do not contain. For instance, when aligning YAGO 4 and IMDB, only 1% of entities in YAGO 4 are related to movies, while the other 99% of entities in YAGO 4 necessarily have no match in IMDB. These unmatchable entities would increase the difficulty of EA. Therefore, in this chapter, we devise an unmatchable entity prediction module to predict the unmatchable entities and filter them out from the alignment results.

#### **8.3.3.1 Thresholded Bidirectional Nearest Neighbor Search**

More specifically, we put forward a novel strategy, i.e., thresholded bidirectional nearest neighbor search (TBNNS), to generate the alignment results, and the resulting unaligned entities are predicted to be unmatchable. As can be observed from Algorithm 1, given a source entity *u* and a target entity *v*, if *u* and *v* are the nearest neighbor of each other, and the distance between them is below a given threshold *θ*, we consider *(u, v)* as an aligned entity pair. Note that **M***(u, v)* represents the element in the *u*-th row and *v*-th column of the distance matrix **M**.

**Algorithm 1:** TBNNS in the unmatchable entity prediction module

**Input** : *G*1 and *G*2: the two KGs to be aligned; *E*1 and *E*2: the entity sets in *G*1 and *G*2; *θ*: a given threshold; **M**: a distance matrix. **Output** : *S*: Alignment results. **1 foreach** *u* ∈ *E*1 **do 2** *v* ← arg min *v*ˆ∈*E*<sup>2</sup> **M***(u, v)*ˆ ; **3 if** arg min *u*ˆ∈*E*<sup>1</sup> **M***(v, u)*ˆ = *u* **and M***(u, v) < θ* **then 4** *S* ← *S* + {*(u, v)*} **5 return** *S*.

The TBNNS strategy exerts strong constraints on alignment, since it requires that the matched entities should both prefer each other the most, and the distance between their embeddings should be below a certain value. Therefore, it can effectively predict unmatchable entities and prevent them from being aligned. Notably, the threshold *θ* plays a significant role in this strategy. A larger threshold would lead to more matches, whereas it would also increase the risk of including erroneous matches or unmatchable entities. In contrast, a small threshold would only lead to a few aligned entity pairs, and almost all of them would be correct. This is further discussed and verified in Sect. 8.4.4. Therefore, our progressive learning framework dynamically adjusts the threshold value to produce more accurate alignment results (to be discussed in the next subsection).

#### **8.3.3.2 Confidence-Based TBNNS**

Considering that the aligned entity pairs generated by TBNNS are of different qualities (i.e., some are true, while some are not), we further put forward confidencebased TBNNS, C-TBNNS, to measure the confidence of an entity pair (of being true). Specifically, we define the confidence score of an entity pair *(u, v)* as:

$$\Theta(u,v) = \mathbf{M}(u,v') - \mathbf{M}(u,v) + \mathbf{M}(v,u') - \mathbf{M}(v,u),\tag{8.1}$$

where <sup>1</sup> = **M***(u, v )*−**M***(u, v)* denotes the gap between the distance scores of the top two closest entities (i.e., *v* and *v* ) to entity *u*, while <sup>2</sup> = **M***(v, u )* − **M***(v, u)* denotes the gap between the distance scores of the top two closest entities (i.e., *u* and *u* ) to entity *v*. This is based on the intuition that, for an entity pair *(u, v)*, if the distance between them is the smallest from both sides and there are larger margins between the distances of the top two candidates, it would be more confident to consider them as a correct entity pair. We further restrict the confidence scores to a certain range:

$$\Theta(\mathcal{S}) = (1 - \lambda) \frac{\Theta(\mathcal{S}) - \min\{\Theta(\mathcal{S})\}}{\max\{\Theta(\mathcal{S})\} - \min\{\Theta(\mathcal{S})\}} + \lambda \tag{8.2}$$

where *(*S*)* represents the confidence scores of the entity pairs in S. The core of Eq. (8.2) is the min-max normalization, which converts the confidence scores to [0*,* 1]. We add a hyper-parameter *λ* ∈ [0*,* 1] to further restrict the range of the confidence scores to [*λ,* 1]. As thus, by setting *λ* to 1, all entity pairs would have the same confidence score of 1, and C-TBNNS can be restored to TBNNS. Hence, C-TBNNS can be regarded as a general case of TBNNS, which introduces the concept of confidence (probability) into the alignment result generation process.

#### *8.3.4 The Progressive Learning Framework*

To exploit the rich structural patterns in KGs that could provide useful signals for alignment, we design a progressive learning framework to combine structural and textual features for alignment and improve the quality of both structural embeddings and alignment results in a self-training fashion.

#### **8.3.4.1 Knowledge Graph Representation Learning**

As mentioned above, we forward the textual distance matrix **M<sup>t</sup>** generated by using the side information to the unmatchable entity prediction module to produce the preliminary alignment results, which are considered as pseudo-labeled data for learning unified KG embeddings. Concretely, following [18], we adopt GCN1 to capture the neighboring information of entities. We leave out the implementation details since this is not the focus of this paper, which can be found in [18].

**Alignment Objective** Since the representations of source and target KGs are learned individually, they need to be projected into a unified embedding space, where the entities across KGs could be compared directly. To this end, we use the semi-supervised loss function to enforce the distance between the embeddings of the entities in the labeled entity pairs to be small and meanwhile the negative samples (i.e., nonequivalent entity pairs) to be large. Formally:

$$\mathcal{L} = \sum\_{(\boldsymbol{\mu}, \boldsymbol{v}) \in \mathcal{S}} \sum\_{(\boldsymbol{\mu}', \boldsymbol{v}') \in \mathcal{S}'\_{(\boldsymbol{\mu}, \boldsymbol{v})}} [d(\mathbf{u}, \mathbf{v}) + \boldsymbol{\gamma} - d(\mathbf{u}', \mathbf{v}')]\_+,\tag{8.3}$$

where [·]+ = max{0*,* ·}, *(u, v)* is a labeled entity pair from the training data and S *(u,v)* represents the set of negative entity pairs obtained by corrupting *(u, v)* using nearest neighbor sampling [1]. **u** and **v** represent the embeddings of source and target entities learned by GCN, respectively. *d(*·*,* ·*)* is the distance function that measures the distance between two embeddings. *γ* is a hyper-parameter separating positive samples from negative ones.

**Confidence-Based Objective** Considering that the pseudo-labeled entity pairs have different confidences of being true, we incorporate such probabilities into the

<sup>1</sup> More advanced structural learning models, such as recurrent skipping networks [13], could also be used here. We will explore these alternative options in the future.

alignment objective to learn more accurate structural embeddings:

$$\mathcal{L}\_{\mathbf{c}} = \sum\_{(\boldsymbol{\mu}, \boldsymbol{v}) \in \mathcal{S}} \sum\_{(\boldsymbol{\mu'}, \boldsymbol{v}') \in \mathcal{S}'\_{(\boldsymbol{\mu}, \boldsymbol{v})}} \Theta(\boldsymbol{\mu}, \boldsymbol{v}) \ast [d(\mathbf{u}, \mathbf{v}) + \boldsymbol{\gamma} - d(\mathbf{u}', \mathbf{v}')]\_{+}, \tag{8.4}$$

where *(u, v)* is the confidence score attached to each entity pair. As thus, the more confident entity pairs would play a more important role during the training process, while the less confident pseudo entity pairs would have a smaller effect on the training, such that the impact from the false positives could be mitigated.

**Feature Fusion** Given the learned structural embedding matrix **Z**, we calculate the structural distance score between a source and a target entity by subtracting the cosine similarity score between their embeddings from 1. We denote the resultant structural distance matrix as **M<sup>s</sup>** . Then, we combine the textual and structural information to generate more accurate signals for alignment: **<sup>M</sup>** <sup>=</sup> *<sup>β</sup>***Mt** <sup>+</sup>*(*1−*β)***Ms** , where *β* is a hyper-parameter that balances the weights. The fused distance matrix **M** can be used to generate more accurate matches.

#### **8.3.4.2 The Progressive Learning Algorithm**

The amount of training data has an impact on the quality of the unified KG embeddings, which in turn affects the alignment performance [3, 26]. As thus, we devise an algorithm (Algorithm 2) to progressively augment the pseudo training data, so as to improve the quality of KG embeddings and enhance the alignment performance. The algorithm starts with learning unified structural embeddings and generating the fused distance matrix **M** by using the preliminary pseudo-labeled data S<sup>0</sup> (Lines 1–2). Then, the fused distance matrix is used to produce the new alignment results S using C-TBNNS (line 4). These newly generated entity pairs S are added to the alignment results, which are used for generating the fused distance matrix in the next round (Lines 6–7). The entities in S are removed from the entity sets (Lines 9–10). In order to progressively improve the quality of KG embeddings and detect more alignment results, we perform the aforementioned process recursively until the number of newly generated entity pairs is below a given threshold *μ*. Finally, we consider the entity pairs in S as the final alignment results *-*.

Notably, in the learning process, once a pair of entities is considered as a match, the entities will be removed from the entity sets (Lines 5–6 and Lines 12–13). This could gradually reduce the alignment search space and lower the difficulty for aligning the rest of the entities. Obviously, this strategy suffers from the error propagation issue, which, however, could be effectively mitigated by the progressive learning process that dynamically adjusts the threshold. We will verify the effectiveness of this setting in Sect. 8.4.3.

#### **Algorithm 2:** Progressive learning

```
Input : G1 and G2: KGs to be aligned; E1 and E2: the entity sets; Mt
                                                                     : textual distance 
            matrix; S0: preliminary labeled data; θ0: the initial threshold. 
   Output : -
               : Alignment results. 
 1 S ← S0; 
 2 Use S to learn structural embeddings and generate M; 
 3 θ ← θ0; 
 4 S, U ←C-TBNNS (G1, G2, E1, E2, θ, M); 
 5 while |S| ≥ μ do 
 6 S ← S + S; 
 7 Use S to learn structural embeddings and generate M; 
 8 θ ← θ + η; 
 9 E1 ← {e|e ∈ E1, e ∈/ S}; 
10 E2 ← {e|e ∈ E2, e ∈/ S}; 
11 S, U ←C-TBNNS (G1, G2, E1, E2, θ, M); 
12 -
    ← S; 
13 return -
            .
```
#### **8.3.4.3 Dynamic Threshold Adjustment**

It can be observed from Algorithm 2 that the matches generated by the unmatchable entity prediction module are part of not only the eventual alignment results but also the pseudo training data for learning subsequent structural embeddings. Therefore, to enhance the overall alignment performance, the alignment results generated in each round should, ideally, have both large *quantity* and high *quality*. Unfortunately, these two goals cannot be achieved at the same time. This is because, as stated in Sect. 8.3.3, a larger threshold in TBNNS can generate more alignment results (large quantity), whereas some of them might be erroneous (low quality). These wrongly aligned entity pairs can cause the error propagation problem and result in more erroneous matches in the following rounds. In contrast, a smaller threshold leads to fewer alignment results (small quantity), while almost all of them are correct (high quality).

To address this issue, we aim to balance between the quantity and the quality of the matches generated in each round. An intuitive idea is to set the threshold to a moderate value. However, this fails to take into account the characteristics of the progressive learning process. That is, in the beginning, the quality of the matches should be prioritized, as these alignment results will have a long-term impact on the subsequent rounds. In comparison, in the later stages where most of the entities have been aligned, the quantity is more important, as we need to include more possible matches that might not have a small distance score. In this connection, we set the initial threshold *θ*<sup>0</sup> to a very small value so as to reduce potential errors. Then, in the following rounds, we gradually increase the threshold by *η*, so that more possible matches could be detected. We will empirically validate the superiority of this strategy over the fixed weight in Sect. 8.4.3.

Noteworthily, our proposed confidence-based framework CUEA can further help mitigate the low-quality issue, as we calculate and assign a confidence score to each entity pair, where the wrongly aligned entity pairs would presumably have lower confidence scores and thus exert smaller influence on the subsequent alignment process.

**Remark** As mentioned in the related work, there are some existing EA approaches that exploit the iterative learning (bootstrapping) strategy to improve EA performance. Particularly, BootEA calculates for each source entity the alignment likelihood to every target entity and includes those with likelihood above a given threshold in a maximum likelihood matching process under the 1-to-1 mapping constraint, producing a solution containing confident EA pairs [15]. This strategy is also adopted by [8, 16]. Zhu et al. use a threshold to select the entity pairs with very close distances as the pseudo-labeled data [14]. DAT employs a bidirectional marginbased constraint to select the confident EA pairs as labels [17]. Our progressive learning strategy differs from these existing solutions in three aspects: (1) we exclude the entities in the confident EA pairs from the test sets; (2) we use the dynamic threshold adjustment strategy to control the pace of learning process; (3) our strategy can deal with unmatchable entities; and (4) we attach a confidence score to each selected entity pair, which can mitigate the negative influence of the false positives on the KG representation learning process as well as the alignment results. The superiority of our strategy is validated in Sect. 8.4.3.

#### **8.4 Experiment**

This section reports the experimental results with in-depth analysis. The source code is available at https://github.com/DexterZeng/UEA.

#### *8.4.1 Experimental Settings*

**Datasets** Following existing works, we adopt the DBP15K dataset [3] for evaluation. This dataset consists of three multilingual KG pairs extracted from DBpedia. Each KG pair contains 15,000 inter-language links as gold standards. The statistics can be found in Table 8.1. We note that state-of-the-art studies merely consider the labeled entities and divide them into training and testing sets. Nevertheless, as can be observed from Table 8.1, there exist unlabeled entities, e.g., 4,388 and 4,572 entities in the Chinese and English KG of DBP15KZH-EN, respectively. In this connection, we adapt the dataset by including the unmatchable entities. Specifically, for each KG pair, we keep 30% of the labeled entity pairs as the training set (for training the supervised or semi-supervised methods). Then, to construct the test set, we include the rest of the entities in the first KG and the rest of the labeled entities in the second


**Table 8.1** The statistics of the evaluation benchmarks

KG, so that the unlabeled entities in the first KG become unmatchable. The statistics of the test sets can be found in the *test set* column in Table 8.1.

**Parameter Settings** For the *side information* module, we utilize the fastText embeddings [27] as word embeddings. To deal with cross-lingual KG pairs, following [19], we use Google Translate to translate the entity names from one language to another, i.e., translating Chinese, Japanese, and French to English. *α* is set to 0.5. For the *structural information learning*, we set *β* to 0.5. Following [18], we set *γ* in the alignment objectives to 3 and adopt Manhattan distance as *d(*·*,* ·*)*. Regarding C-TBNNS, we set *λ* to 0.4. For *progressive learning*, we set the initial threshold *θ*0 to 0.05, the incremental parameter *η* to 0.1, and the termination threshold *γ* to 30. Note that if the threshold *θ* is over 0.45, we reset it to 0.45. These hyper-parameters are default values since there is no extra validation set for hyper-parameter tuning.

**Evaluation Metrics** We use *precision* (P), *recall* (R), and *F1 score* as evaluation metrics. The *precision* is computed as the number of correct matches divided by the number of matches found by a method. The *recall* is computed as the number of correct matches found by a method divided by the number of gold matches. The *F1 score* is the harmonic mean between *precision* and *recall*. The bold figures in the tables represent the best results.

**Competitors** We select the most performant state-of-the-art solutions for comparison. Within the group that solely utilizes structural information, we compare with BootEA [15], TransEdge [8], MRAEA [26], and SSP [28]. Among the methods incorporating other sources of information, we compare with GCN-Align [18], HMAN [9], HGCN [4], RE-GCN [29], DAT [17], and RREA [30]. We also include the unsupervised approaches, i.e., IMUSE [21] and PARIS [22]. To make a fair comparison, we only use entity name labels as the side information.


 reported performance SRPRSEN-FRandSRPRSEN-DE, all of the entities are matchable, and the number of matches to the number of entities in a KG. Besides, for

On goldequals most methods, they generate matches for all the entities in a KG. Therefore, the number of matches produced by these methods is equal to the number of gold matches, and the values of precision, recall, and F1 score are equal

#### *8.4.2 Results*

Table 8.2 reports the alignment results, which shows that state-of-the-art supervised or semi-supervised methods have rather low precision values. This is because these approaches cannot predict the unmatchable source entities and generate a target entity for each source entity (including the unmatchable ones). Particularly, methods incorporating additional information attain relatively better performance than the methods in the first group, demonstrating the benefit of leveraging such additional information.

Regarding the unsupervised methods, although IMUSE cannot deal with the unmatchable entities and achieves a low precision score, it outperforms most of the supervised or semi-supervised methods in terms of recall and F1 score. This indicates that, for the EA task, the KG side information is useful for mitigating the reliance on labeled data. In contrast to the abovementioned methods, PARIS attains very high precision, since it only generates matches that it believes to be highly possible, which can effectively filter out the unmatchable entities. It also achieves the second best F1 score among all approaches, showcasing its effectiveness when the unmatchable entities are involved. Our proposals, UEA and CUEA, attain the best balance between precision and recall and obtain the best F1 scores, outperforming the second best by a large margin, validating their effectiveness. Notably, although our proposed models do not require labeled data, they achieve even better performance than the most performant supervised methods HMAN and DAT.

Furthermore, it can be seen that, by integrating the notion of confidence into UEA, CUEA achieves comparable results to UEA. At first sight, it seems that assigning confidence scores to entity pairs does not have a large influence on the representation learning and the alignment results, which, however, could be ascribed to the fact that the side information is too effective on these datasets (solely using the string information can achieve an F1 score of 0.814, to be shown in Table 8.4), and hence rendering the structural information (largely affected by the confidence scores) less contributive to the overall results. Next, we will show that the confidence-based framework would be much more useful on datasets with side information in low quality.

#### **8.4.2.1 Results Using Low-Quality Side Information**

We compare the unsupervised approaches under a practical scenario where the side information is in low quality. Specifically, we assume that the pre-trained word embeddings as well as the machine translation tools are not available. Under this circumstance, to use the entity name information, a viable solution is to compare the name strings directly. However, the direct string comparison would be ineffective for cross-lingual datasets such as DBP15KZH-EN and DBP15KJA-EN, where the languages in the source and target KGs are disparate. Hence, we aim to examine


the effectiveness of these unsupervised approaches when the side information is in low quality and cannot provide many useful signals for alignment.

We report the results on DBP15KZH-EN and DBP15KJA-EN in Table 8.3, where the direct comparison between entity name strings serves as the side information. It can be observed that the F1 scores of all methods are very low (compared with those in Table 8.2), revealing that the quality of side information does affect the overall alignment results. Besides, given the low-quality side information, our proposed models UEA and CUEA still outperform the baselines IMUSE and PARIS in terms of the F1 score, demonstrating the effectiveness of the progressive learning framework and the unmatchable entity prediction module. Moreover, it is notable that CUEA achieves better results than UEA in terms of all metrics. This could be attributed to the confidence-based alignment result generation process, which could enable the entity pairs of higher confidence (higher probability of being correct, presumably) to have a larger impact on the representation learning and alignment process.


#### **Table 8.4** Ablation results

#### *8.4.3 Ablation Study*

In this subsection, we examine the usefulness of proposed modules by conducting the ablation study. More specifically, in Table 8.4, we report the results of UEA w/o Unm, which excludes the unmatchable entity prediction module, and UEA w/o Prg, which excludes the progressive learning process. It shows that removing the unmatchable entity prediction module (UEA w/o Unm) brings down the performance on all metrics and datasets, validating its effectiveness of detecting the unmatchable entities and enhancing the overall alignment performance. Besides, without the progressive learning (UEA w/o Prg), the precision increases, while the recall and F1 score values drop significantly. This shows that the progressive learning framework can discover more correct aligned entity pairs and is crucial to the alignment progress.

To provide insights into the progressive learning framework, we report the results of UEA w/o Adj, which does not adjust the threshold, and UEA w/o Excl, which does not exclude the entities in the alignment results from the entity sets during the progressive learning. Table 8.4 shows that setting the threshold to a fixed value (UEA w/o Adj) leads to worse F1 results, verifying that the progressive learning process depends on the choice of the threshold and the quality of the alignment results. We will further discuss the setting of the threshold in the next subsection. Besides, the performance also decreases if we do not exclude the matched entities from the entity sets (UEA w/o Excl), validating that this strategy indeed can reduce the difficulty of aligning entities.

Moreover, we replace our progressive learning framework with other state-ofthe-art iterative learning strategies (i.e., MWGM [15], TH [14], and DAT-I [17]) and report the results in Table 8.4. It shows that using our progressive learning framework (UEA) can attain the best F1 score, verifying its superiority.

#### *8.4.4 Quantitative Analysis*

In this subsection, we perform quantitative analysis of the modules in UEA and CUEA.

**The Threshold** *θ* **in TBNNS** We discuss the setting of *θ* to reveal the trade-off between the risk and gain from generating the alignment results in the progressive learning. Identifying a match leads to the integration of additional structural information, which benefits the subsequent learning. However, for the same reason, the identification of a false positive, i.e., an incorrect match, potentially leads to mistakenly modifying the connections between KGs, with the risk of amplifying the error in successive rounds. As shown in Fig. 8.3, a smaller *θ* (e.g., 0.05) brings low risk and low gain; that is, it merely generates a small number of matches, among which almost all are correct. In contrast, a higher *θ* (e.g., 0.45) increases the risk and brings relatively higher gain; that is, it results in much more aligned entity

**Fig. 8.3** Alignment results given different threshold values. Correct-*θ* refers to the number of correct matches generated by the progressive learning framework at each round given the threshold value *θ*. Wrong refers to the number of erroneous matches generated in each round

pairs, while a certain portion of them are erroneous. Additionally, using a higher threshold leads to increasingly more alignment results, while for a lower threshold, the progressive learning process barely increases the number of matches. This is in consistency with our theoretical analysis in Sect. 8.3.3.

**Unmatchable Entity Prediction** Zhao et al. [12] propose an intuitive strategy (U-TH) to predict the unmatchable entities. They set an NIL threshold, and if the distance value between a source entity and its closest target entity is above this threshold, they consider the source entity to be unmatchable. We compare our unmatchable entity prediction strategy with it in terms of the percentage of unmatchable entities that are included in the final alignment results and the F1 score. On DBP15KZH-EN, replacing our unmatchable entity prediction strategy with U-TH attains the F1 score at 0.837, which is 8.4% lower than that of UEA. Besides, in the alignment results generated by using U-TH, 18.9% are unmatchable entities, while this figure for UEA is merely 3.9%. This demonstrates the superiority of our unmatchable entity prediction strategy.

**Influence of Parameters** As mentioned in Sect. 8.4.1, we set *α* and *β* to 0.5 since there are no training/validation data. Here, we aim to prove that different values of the parameters do not have a large influence on the final results. More specifically, we keep *α* at 0.5 and choose *β* from [0.3, 0.4, 0.5, 0.6, 0.7]; then we keep *β* at 0.5 and choose *α* from [0.3, 0.4, 0.5, 0.6, 0.7]. It can be observed from Fig. 8.4 that, although smaller *α* and *β* lead to better results, the performance does not change significantly.

**The Hyper-Parameter** *λ* **in CUEA** We then analyze the influence of *λ* in Eq. (8.2), which determines the range of the confidence scores, on the final alignment results. To highlight its influence on the structural representation learning, we follow the settings in Sect. 8.4.2.1 and report the results in Table 8.5.

It reads from Table 8.5 that the alignment performance is relatively stable when *λ* is not too large. Nevertheless, when setting *λ* to a large value (e.g., 1, to restore

**Fig. 8.4** The F1 scores by setting *α* and *β* to different values



UEA), the results drop sharply. This reveals that assigning probability scores to the entity pairs according to their confidence of being true can facilitate the alignment. Besides, generally speaking, CUEA is robust to the perturbation of *λ* (as long as it is not too large).

**Influence of Input Side Information** We adopt different side information as input to examine the performance of UEA. More specifically, we report the results of UEA-**Ml** , which merely uses the string-level feature of entity names as input, UEA-**Mn**, which only uses the semantic embeddings of entity names as input. We also provide the results of **Ml** and **Mn**, which use the string-level and semantic information to directly generate alignment results (without progressive learning), respectively.

As shown in Table 8.3, the performance of solely using the input side information is not very promising (**Ml** and **Mn**). Nevertheless, by forwarding the side information into our model, the results of UEA-**Ml** and UEA-**Mn** become much better. This unveils that UEA can work with different types of side information and consistently improve the alignment results. Additionally, by comparing UEA-**Ml** with UEA-**Mn**, it is evident that the input side information does affect the final results, and the quality of the side information is of significance to the overall alignment performance.

**Pseudo-Labeled Data** We further examine the usefulness of the preliminary alignment results generated by the side information, i.e., the pseudo-labeled data. Concretely, we replace the training data in HGCN with these pseudo-labeled data, resulting in HGCN-U, and then compare its alignment results with the original performance. Regarding the F1 score, HGCN-U is 4% lower than HGCN on DBP15KZH-EN, 2.9% lower on DBP15KJA-EN, and 2.8% lower on DBP15KFR-EN. The minor difference validates the effectiveness of the pseudo-labeled data generated by the side information. It also demonstrates that this strategy can be applied to other supervised or semi-supervised frameworks to reduce their reliance on labeled data.

#### **8.5 Conclusion**

In this chapter, we propose an unsupervised EA solution that is capable of dealing with unmatchable entities. We first exploit the side information of KGs to generate preliminary alignment results, which are considered as pseudo-labeled data and forwarded to the progressive learning framework to produce better KG embeddings and alignment results in a self-training fashion. We also devise an unmatchable entity prediction module to detect the unmatchable entities. The experimental results validate the usefulness of our proposed model and its superiority over state-of-theart approaches.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 9 Multimodal Entity Alignment**

**Abstract** In various tasks related to artificial intelligence, data is often present in multiple forms or modalities. Recently, it has become a popular approach to combine these different forms of information into a knowledge graph, creating a multi-modal knowledge graph (MMKG). However, multi-modal knowledge graphs (MMKGs) often face issues of insufficient data coverage and incompleteness. In order to address this issue, a possible strategy is to incorporate supplemental information from other multi-modal knowledge graphs (MMKGs). To achieve this goal, current methods for aligning entities could be utilized; however, these approaches work within the Euclidean space, and the resulting entity representations can distort the hierarchical structure of the knowledge graph. Additionally, the potential benefits of visual information have not been fully utilized.

To address these concerns, we present a new approach for aligning entities across multiple modalities, which we call hyperbolic multi-modal entity alignment (HMEA). This method expands upon the conventional Euclidean representation by incorporating a hyperboloid manifold. Initially, we utilize hyperbolic graph convolutional networks(HGCN) to acquire structural representations of entities. In terms of visual data, we create image embeddings using the densenet model and subsequently map them into the hyperbolic space utilizing HGCN. Lastly, we merge the structural and visual representations within the hyperbolic space and utilize the combined embeddings to forecast potential entity alignment outcomes. Through a series of thorough experiments and ablation studies, we validate the efficacy of our proposed model and its individual components.

#### **9.1 Introduction**

In recent times, there has been a noticeable trend of integrating multimedia data into knowledge graphs (KGs) to facilitate cross-modal activities that involve the interplay of information across multiple modalities, e.g., image and video retrieval [27], video summaries [19], visual entity disambiguation [17], visual question answering [32], etc. To this end, several multi-modal KGs (MMKGs) [16, 28] have been constructed very recently. An example of MMKG can be found in

**Fig. 9.1** An example of MMKG

Fig. 9.1. For this study, we focus on MMKGs that consist of two modalities, namely, the KG structural details and visual information, while retaining a generalizable approach.

*Example* Figure 9.1 shows a partial MMKG, which consists of entities, image sets, and the links between them. To elaborate, the KG structural data entails the relationships between the different entities, whereas the visual data is sourced from the sets of images. For the entity The Prestige, its image set may contain scenes, actors, posters, etc.

However, many of the current MMKGs have been sourced from restricted data sources, causing them to have inadequate domain coverage [22]. To broaden the scope of these MMKGs, one potential solution is to incorporate valuable knowledge from other MMKGs. An essential step in consolidating knowledge across MMKGs is to identify matching entities in different KGs, given that entities serve as the links that connect the diverse KGs. This technique is also referred to as multi-modal entity alignment (MMEA).

MMEA is a complex undertaking that necessitates the modeling and amalgamation of information from multiple modalities. For the *KG structural information*, existing entity alignment (EA) approaches [3, 9, 25, 33] can be directly adopted to generate entity structural embeddings for MMEA. These methods usually utilize TransE-based or graph convolutional network(GCN)-based models [1, 12] to learn entity representations of individual KGs, which are then unified using the seed entity pairs. Despite this, all of these techniques generate entity representations in the Euclidean space, which can result in significant distortion when embedding realworld graphs that possess scale-free or hierarchical structures [4, 23]. Concerning the *visual information*, the VGG16 model has been utilized to create embeddings for images linked to entities and subsequently employed for alignment. However, the VGG16 model is not adept at extracting valuable features from images, which limits the efficacy of the alignment process. Lastly, the integration of information from both modalities must be executed meticulously to enhance overall effectiveness.

To tackle the problems mentioned above, we introduce a multi-modal entity alignment technique that works in hyperbolic space (HMEA). More specifically, we expand the Euclidean representation to the hyperboloid manifold and utilize the hyperbolic graph convolutional networks (HGCN) to develop structural representations of entities. With regard to visual data, we create image embeddings using the densenet model and also map them into the hyperbolic space with HGCN. Ultimately, we combine the structural embeddings and image embeddings in the hyperbolic space to forecast potential alignments.

To sum up, the key contributions of our technique can be outlined as follows:


**Organization** Section 9.2 overviews related work, and the preliminaries are introduced in Sect. 9.3. Section 9.4 describes our proposed approach. Section 9.5 presents experimental results, followed by conclusion in Sect. 9.6.

#### **9.2 Related Work**

In this section, we introduce some efforts that are relevant to this work.

#### *9.2.1 Multi-Modal Knowledge Graph*

Many knowledge graph construction studies concentrate on organizing and discovering textual data in a structured format, neglecting other resources available on the Web [28]. Nevertheless, real-world applications require cross-modal data, such as image and video retrieval, visual question answering, video summaries, visual commonsense reasoning, and so on. Consequently, multi-modal knowledge graphs (MMKGs) have been introduced, which comprise diverse information (e.g., image, text, KG) and cross-modal relationships. However, building MMKGs poses several challenges. Collecting substantial multi-modal data from search engines is a time-consuming and laborious task. Additionally, MMKGs often have low domain coverage and are incomplete. Integrating multi-modal knowledge from other MMKGs is an effective way to enhance their completeness. Currently, there are few studies about merging different MMKGs. Liu et al. [16] built two pairs of MMKGs and extracted relational, latent, numerical, and visual features for predicting the *SameAs* link between entities. And some approaches of multimodal knowledge representation involve visual features from entity images for knowledge representation learning; IKRL [31] integrates image representations into an aggregated image-based representation via an attention-based method.

#### *9.2.2 Representation Learning in Hyperbolic Space*

Essentially, most of the existing GCN models are designed for graphs in Euclidean spaces [2]. However, research has found that graph data exhibits a non-Euclidean structure [18], and embedding real-world graphs with a scale-free or hierarchical structure results in significant distortion [4, 23]. Moreover, recent studies in network science have shown that hyperbolic geometry is ideal for modeling complex networks, as the hyperbolic space can naturally reflect some graph properties [14]. One of the key features of hyperbolic spaces is that they expand more rapidly than Euclidean spaces, which expands exponentially rather than polynomially. Due to the advantages of hyperbolic space in representing graph structure data, there has been growing interest in representation learning in hyperbolic spaces, particularly in learning the hierarchical representation of a graph [20]. Furthermore, Nickel et al. [21] have demonstrated that the Lorentz model of hyperbolic geometry has favorable properties for stochastic optimization and leads to substantially enhanced embeddings, particularly in low dimensions. Additionally, some researchers have begun to extend deep learning methods to hyperbolic space, achieving state-of-theart performance on link prediction and node classification tasks [7, 8, 26].

#### **9.3 Preliminaries**

In this section, we start by providing a formal definition of the MMEA task. Then, we provide a brief overview of the GCN model. Lastly, we introduce the fundamental principles of hyperbolic geometry, which serve as the foundation for our proposed model.

**Fig. 9.2** An example of MMEA. Seed entity pairs are connected by dashed lines. For clarity, we only choose an image to represent the set of images of an entity

#### *9.3.1 Task Formulation*

The goal of MMEA is to align entities in two MMKGs. An MMKG typically encompasses information in several modalities. In this study, we concentrate on the KG structural information and visual information, without any loss of generality. Formally, we represent MMKGs as *MG* = *(E, R, T , I )*, where *E*, *R*, *T* , and *I* denote the sets of entities, relations, triples, and images, respectively. A relational triple *t* ∈ *T* can be represented as *(e*1*, r, e*2*)*, where *e*1*, e*<sup>2</sup> ∈ *E* and *r* ∈ *R*. An entity *<sup>e</sup>* is associated with multiple images *Ie* = {*i*<sup>0</sup> *<sup>e</sup> , i*<sup>1</sup> *<sup>e</sup> ,...,i<sup>n</sup> e* }.

Given two MMKGs, *MG*<sup>1</sup> = *(E*1*, R*1*, T*1*, I*1*)*, *MG*<sup>2</sup> = *(E*2*, R*2*, T*2*, I*2*)*, and seed entity pairs (pre-aligned entity pairs for training) *<sup>S</sup>* = {*(e*<sup>1</sup> *<sup>s</sup> , e*<sup>2</sup> *<sup>s</sup> )*|*e*<sup>1</sup> *<sup>s</sup>* <sup>↔</sup> *<sup>e</sup>*<sup>2</sup> *<sup>s</sup> , e*<sup>1</sup> *s* ∈ *E*1*, e*<sup>2</sup> *<sup>s</sup>* ∈ *E*2}, where ↔ represents equivalence, the task of MMEA can be defined as discovering more aligned entity pairs {*(e*1*, e*2*)*|*e*<sup>1</sup> <sup>∈</sup> *<sup>E</sup>*1*, e*<sup>2</sup> <sup>∈</sup> *<sup>E</sup>*2}. We use the following example to further illustrate this task.

*Example* Figure 9.2 shows two partial MMKGs. The equivalence between The Dark Knight in *MG*<sup>1</sup> and The Dark Knight in *MG*<sup>2</sup> is known in advance. EA aims to detect potential equivalent entity pairs, e.g., Nolan in *MG*<sup>1</sup> and Nolan in *MG*2, using the known alignments.

#### *9.3.2 Graph Convolutional Neural Networks*

GCNs [10, 13] are a neural network type that works directly with graph data. A GCN model comprises several stacked GCN layers. The inputs to the *l*-th layer of the GCN model are node feature vectors and the graph's structure. *H(l)* <sup>∈</sup> *<sup>R</sup>n*×*d<sup>l</sup>* is a vertex feature representation, where *n* is the number of vertices and *d<sup>l</sup>* is the dimensionality of feature matrix. *<sup>A</sup>***<sup>ˆ</sup>** <sup>=</sup> *<sup>D</sup>*<sup>−</sup> <sup>1</sup> <sup>2</sup> *(<sup>A</sup>* <sup>+</sup> *<sup>I</sup> )D*<sup>−</sup> <sup>1</sup> <sup>2</sup> represents the symmetric normalized adjacency matrix. The identity matrix *I* is added to the adjacency matrix *A* to obtain self-loops for each node, and the degree matrix *D* = - *<sup>j</sup> (Aij* +*I ij )*. The output of the *l*-th layer is a new feature matrix *H(l*+1*)* by the following convolutional computation:

$$H^{(l+1)} = \sigma(\hat{A}H^{(l)}W^{(l)}).\tag{9.1}$$

#### *9.3.3 Hyperboloid Manifold*

We provide a brief overview of the critical concepts in hyperbolic geometry. For a more comprehensive description, please refer to [6]. Hyperbolic geometry refers to a non-Euclidean geometry that features a constant negative curvature used to measure how a geometric object differs from a flat plane. In this work, we use the *d*dimensional Poincare ball model with negative curvature <sup>−</sup>*<sup>c</sup> (c >* <sup>0</sup>*)*: *<sup>P</sup>(d,c)* = {**<sup>x</sup>** <sup>∈</sup> *<sup>R</sup><sup>d</sup>* : **x**<sup>2</sup> *<sup>&</sup>lt;* <sup>1</sup> *<sup>c</sup>* }, where · is the *L*<sup>2</sup> norm. For each point *<sup>x</sup>* <sup>∈</sup> *<sup>P</sup>(d,c)*, the tangent space *T <sup>c</sup> <sup>x</sup>* is a *d*-dimensional vector space at point *x*, which contains all possible directions of paths in *P(d,c)* leaving from *x*. Next, we present several fundamental actions in the hyperbolic space, which play a critical role in our proposed model.

**Exponential and Logarithmic Maps** Specifically, let *v* be the feature vector in the tangent space *T <sup>c</sup>* **<sup>o</sup>** ; **o** is a point in the hyperbolic space *P(d,c)*, which is also used as a reference point. Let **<sup>o</sup>** be the origin, **<sup>o</sup>** <sup>=</sup> 0. The tangent space *<sup>T</sup> <sup>c</sup>* **<sup>o</sup>** can be mapped to *P(d,c)* via the exponential map:

$$\exp\_{\mathbf{o}}^{c}(\boldsymbol{\upsilon}) = \tanh(\sqrt{c} \|\boldsymbol{\upsilon}\|) \frac{\boldsymbol{\upsilon}}{\sqrt{c} \|\boldsymbol{\upsilon}\|}. \tag{9.2}$$

And conversely, the logarithmic map which maps *P(d,c)* to *T <sup>c</sup>* **<sup>o</sup>** is defined as:

$$\log\_{\mathbf{o}}^{c}(\mathbf{y}) = \arctan(\sqrt{c} \|\mathbf{y}\|) \frac{\mathbf{y}}{\sqrt{c} \|\mathbf{y}\|}. \tag{9.3}$$

**Möbius Addition** Vector addition does not have a well-defined meaning in the hyperbolic space. Adding the vectors of two points directly, as in Euclidean space, in the Poincare ball could yield a point outside the ball. In this case, the Möbius addition [7] provides an analogue to the Euclidean addition in the hyperbolic space. Here, ⊕*<sup>c</sup>* represents the Möbius addition as:

$$\left\{\boldsymbol{h}\_{l}\oplus\_{c}\boldsymbol{h}\_{j}=\frac{\left(1+2c\left<\boldsymbol{h}\_{l},\boldsymbol{h}\_{j}\right>+c\left\|\boldsymbol{h}\_{j}\right\|^{2}\right)\boldsymbol{h}\_{l}+\left(1-c\left\|\boldsymbol{h}\_{l}\right\|^{2}\right)\boldsymbol{h}\_{j}}{1+2c\left<\boldsymbol{h}\_{l},\boldsymbol{h}\_{j}\right>+c^{2}\left\|\boldsymbol{h}\_{l}\right\|^{2}\left\|\boldsymbol{h}\_{j}\right\|^{2}}.\tag{9.4}$$

**Fig. 9.3** The framework of our proposed method

#### **9.4 Methodology**

In this section, we present our proposed approach HMEA, which operates in the hyperbolic space. The framework is shown in Fig. 9.3. We first adopt HGCN to obtain the structural embeddings of entities. Subsequently, we transform the corresponding entity images into visual embeddings employing the densenet model, which are further projected into the hyperbolic space. In the end, we join these embeddings in the hyperbolic space and predict the alignment outcomes utilizing a pre-determined hyperbolic distance. We use the following example to illustrate our proposed model.

*Example* Further to the previous example, by using structural information, it is easy to detect that Nolan in *MG*<sup>1</sup> is equivalent to Nolan in *MG*2. However, solely relying on structural data is insufficient and might result in an incorrect alignment of Michael Caine in *MG*<sup>1</sup> with Christian Bale in *MG*2. In this scenario, the utilization of visual information would be highly beneficial as the images of Michael Caine in *MG*<sup>1</sup> and Christian Bale in *MG*<sup>2</sup> are significantly dissimilar. Consequently, we consider both structural and visual information for alignment.

In the following, we elaborate on the various components of our proposal.

#### *9.4.1 Structural Representation Learning*

We acquire the structural representation of MMKGs by employing hyperbolic graph convolutional neural networks, which extends convolutional computation to manifold space and leverages the effectiveness of both graph neural networks and hyperbolic embeddings. Initially, we transform the input Euclidean features to the hyperboloid manifold. Then, through *feature transformation*, *message passing*, and *nonlinear activation* in the hyperbolic space, we can get the hyperbolic structural representations.

**Mapping Input Features to Hyperboloid Manifold** In general, the input node features are produced by pre-trained Euclidean neural networks, and hence, they exist in the Euclidean space. We begin by establishing a conversion from Euclidean features to the hyperbolic space.

Here, we assume that the input Euclidean features *x<sup>E</sup>* <sup>∈</sup> *<sup>T</sup>***o***Hc*, where *T***o***Hc* represent the tangent space referring to **o**, and **o** ∈ *Hc* denotes the north pole (origin) in hyperbolic space. We obtain the hyperbolic feature matrix *x<sup>H</sup>* via: *<sup>x</sup><sup>H</sup>* <sup>=</sup> exp*<sup>c</sup> o(xE)*, where exp*<sup>c</sup> o(*·*)* is defined in Eq. (9.2).

**Feature Transformation and Propagation** The core operations in hyperbolic structural learning, similar to GCN, are feature transformation and message passing. While these operations are well-established in the Euclidean space, they are considerably more complex in the hyperboloid manifold. One possible solution is to perform these functions with trainable parameters in the *tangent space* of a point within the hyperboloid manifold, as the tangent space is Euclidean. To this end, we utilize the exp*(*·*)* map and log*(*·*)* map to convert between the hyperboloid manifold and the tangent space. This enables us to make use of the tangent space *T***o***H<sup>d</sup> <sup>c</sup>* for executing Euclidean operations.

The initial step involves using the logarithmic map to map the hyperbolic representation *x<sup>H</sup> <sup>v</sup>* <sup>∈</sup> *<sup>R</sup>*1×*<sup>d</sup>* of node *<sup>v</sup>* to the tangent space *T***o***H<sup>d</sup> <sup>c</sup>* . Next, in *T***o***H<sup>d</sup> c* , we compute the feature transformation and propagation rule for node *v* as:

$$\mathbf{x}\_v^T = \hat{A}\log\_\mathbf{0}^c \left(\mathbf{x}\_v^H\right)\mathbf{W},\tag{9.5}$$

where *x<sup>T</sup> <sup>v</sup>* <sup>∈</sup> *<sup>R</sup>*1×*<sup>d</sup>* denotes the feature representation in the tangent space and *A***ˆ** represents the symmetric normalized adjacency matrix; *W* is a *d* × *d* trainable weight matrix.

**Nonlinear Activation with Different Curvatures** Once the features have been transformed in the tangent space, a nonlinear activation function *σ* <sup>⊗</sup>*cl,cl*+<sup>1</sup> is applied to learn nonlinear transformations. Specifically, in the tangent space *T***o***H<sup>d</sup> cl* of layer *l*, Euclidean nonlinear activation is performed before mapping the features to the manifold of the next layer:

$$\sigma^{\otimes^{c\_l,c\_{l+1}}}\left(\mathbf{x}\_v^T\right) = \exp\_{\mathbf{0}}^{c\_{l+1}}\left(\sigma\left(\log\_{\mathbf{0}}^{c\_l}\left(\mathbf{x}\_v^T\right)\right)\right),\tag{9.6}$$

where the hyperbolic curvatures at layer *l* and *l* + 1 are denoted as −1*/cl* and −1*/cl*+1, respectively. The activation function *σ* used is the ReLU*(*·*)* function. This step is critical in enabling us to vary the curvature smoothly at each layer, which is necessary for achieving good performance due to limitations in machine precision and normalization.

Based on the hyperboloid feature transformation and nonlinear activation, the convolutional computation in the hyperbolic space is redefined as:

$$H^{l+1} = \exp\_{\mathbf{0}}^{c\_{l+1}} \left( \sigma \left( \hat{A} \log\_{\mathbf{0}}^{c\_l} \left( H^l \right) W \right) \right), \tag{9.7}$$

where the convolutional computation in hyperbolic space involves using learned node embeddings in the hyperbolic space at layer *l* + 1 and layer *l*, represented respectively as *Hl*+<sup>1</sup> <sup>∈</sup> *<sup>R</sup>n*×*dl*+<sup>1</sup> and *H<sup>l</sup>* <sup>∈</sup> *<sup>R</sup>n*×*d<sup>l</sup>* . The initial embeddings are represented as *H*<sup>0</sup> <sup>=</sup> *<sup>x</sup><sup>H</sup>* . The symmetric normalized adjacency matrix is represented by *A***ˆ**, and the trainable weight matrix is represented by *W*, which has dimensions *<sup>d</sup><sup>l</sup>* <sup>×</sup> *<sup>d</sup>l*+1.

#### *9.4.2 Visual Representation Learning*

The densenet model [11] is used to learn image embeddings, which has been pretrained on the ImageNet dataset [5]. The softmax layer in densenet is removed and 1920-dimensional embeddings are obtained for all images in the MMKGs. These embeddings are then projected into the hyperbolic space using HGCN to enhance their expressive power.

#### *9.4.3 Multi-Modal Information Fusion*

As both visual and structural information can impact the alignment results. To combine these two types of information, we propose a novel method that merges the *structural information* and *visual information* of MMKGs. Specifically, we obtain the merged representation of entity **e***<sup>i</sup>* in the hyperbolic space using the following approach:

$$\boldsymbol{h}\_{l} = \left(\boldsymbol{\beta} \cdot \boldsymbol{H}\_{s}^{l}\right) \oplus\_{c} \left((1-\beta)\cdot \boldsymbol{H}\_{v}^{l}\right),\tag{9.8}$$

where *H<sup>s</sup>* and *H<sup>v</sup>* are structural and visual embeddings learned from HGCN model, respectively; the hyper-parameter *β* is used to adjust the relative weight of the structure and visual features in the final merged representation. The Möbius addition operator ⊕*<sup>c</sup>* is used to combine the structural and visual embeddings. However, the dimensions of the structural and visual representations should be identical.

#### *9.4.4 Alignment Prediction*

To predict the alignment results, we compute the distance between the entity representations from two MMKGs. The Euclidean distance and Manhattan distance are popular distance measures used in the Euclidean space [15, 30]. However, in the hyperbolic space, we must use the hyperbolic distance between nodes as the distance measure. For entities *ei* in *MG*<sup>1</sup> and *ej* in *MG*2, the distance is defined as:

$$d\_c\left(\hbar\_l, \hbar\_j\right) = ||(-\hbar\_l) \oplus\_c \hbar\_j||,\tag{9.9}$$

where *h<sup>i</sup>* and *h<sup>j</sup>* denote the merged embeddings of *ei* and *ej* in the hyperbolic space, respectively; · is the *L*<sup>1</sup> norm; the operator ⊕*<sup>c</sup>* is the Möbius addition.

We expect the distance to be small for equivalent entities and large for nonequivalent ones. To align a specific entity *ei* in *MG*1, our approach calculates the distances between *ei* and all entities in *MG*<sup>2</sup> and presents a ranked list of entities as candidate alignments.

#### *9.4.5 Model Training*

To embed equivalent entities as closely as possible in the vector space, we utilize a set of established entity alignments (known as seed entities) *S* as training data to train the model. Specifically, we minimize the margin-based ranking loss function during model training:

$$L = \sum\_{(\epsilon, v) \in \mathcal{S}} \sum\_{(\epsilon', v') \in S\_{(\epsilon, v)}'} [d\_c \ (\mathsf{h}\_{\epsilon}, \mathsf{h}\_{v}) + \mathsf{y} - d\_c \ (\mathsf{h}\_{\epsilon'}, \mathsf{h}\_{v'})]\_+ \tag{9.10}$$

where [*x*]+ = max{0*, x*}; *(e, v)* represents a seed entity pair and *S* is the set of entity pairs; *S (e,v)* represents the set of negative instances created by altering *(e, v)*, i.e., by substituting *e* or *v* with a randomly selected entity from either *MG*<sup>1</sup> or *MG*2; *γ >* 0 denotes the margin hyper-parameter that separates positive and negative instances. The margin-based loss function stipulates that the distance between entities in positive pairs should be small, and the distance between entities in negative pairs should be large.

#### **9.5 Experiment**

#### *9.5.1 Dataset and Evaluation Metric*

In this study, we utilized datasets sourced from FreeBase, DBpedia, and YAGO, which were created by Liu et al. [16]. These datasets were developed by starting with FB15K to establish multi-modal knowledge graphs, which were then aligned with entities from other knowledge graphs such as DB15K and YAGO15K through reference links. Our experiments focused on two pairs of multi-modal knowledge graphs: FB15K-DB15K and FB15K-YAGO15K.

Due to the absence of original images in the datasets, we acquired the corresponding images for each entity using the URIs provided in [17]. To achieve this, we developed a Web crawler that can extract query results from image search engines, i.e., Google Images,1 Bing Images,2 and Yahoo Image Search.3 Following this, we allocated the images obtained from various search engines to different MMKGs, thereby showcasing the dissimilarity among different MMKGs.

The detailed information on the datasets is provided in Table 9.1. Each dataset comprises approximately 15,000 entities and over 11,000 sets of entity images. The *Images* column represents the number of entities that possess the image sets. These alignments are given by the *SameAs* predicates that have been previously found. In the experiments, the known equivalent entity pairs are used for model training and testing.

**Evaluation Metric** We utilize *Hits*@*k* as the evaluation metric to gauge the efficacy of all the approaches. This metric determines the percentage of correctly aligned entities that are ranked among the top-*k* candidates.

#### *9.5.2 Experimental Setting and Competing Approaches*

**Experimental Setting** To analyze the effectiveness of the methods across various percentages of the provided alignments *P (*%*)*, we evaluate the methods with low


**Table 9.1** Statistic of the MMKG datasets

<sup>1</sup> https://www.google.com/imghp?hl=EN.

<sup>2</sup> https://www.bing.com/image.

<sup>3</sup> https://images.search.yahoo.com/.

(20%), medium (50%), and high percentage (80%) of the given seed entity pairs. The remaining *sameAs* triples are used for test. To ensure fairness, we have maintained the same number of dimensions (i.e., 400) for both GCN-Align and HMEA. The other parameters of GCN-Align follow [29]. For the parameters of our approach HMEA, we have created six negative samples for each positive sample. The margin hyper-parameters used in the loss function are *γ*HMEA−*<sup>s</sup>* = 0*.*5 and *γ*HMEA−*<sup>v</sup>* = 1*.*5, respectively. We optimized HMEA using the Adam optimizer.

**Competing Approaches** To showcase the effectiveness of our proposed model, we have selected three state-of-the-art approaches as competitors:


In order to showcase the advantages of hyperbolic geometry, particularly in the learning of structural features, we have conducted preliminary experiments which solely utilize the *structural information* for EA, resulting in HMEA-s, GCN-Aligns, and PoE-s. In addition, to evaluate the contribution of *visual information*, we compare PoE, GCN-Align, and HMEA with just *visual information*, namely, PoE-v, GCN-Align-v, and HMEA-v.

#### *9.5.3 Results*

Table 9.2 displays the results, indicating that HMEA exhibits the most superior performance in all scenarios. Notably, in the case of FB15K-YAGO15K with 80% seed entity pairs, HMEA outperforms PoE and GCN-Align by almost 15% in terms of *Hits*@1. With 20% seed entity pairs, our approach also shows better results and the improvement of *Hits*@1 is around 2% and *Hits*@10 is up to 20%. Based on the results obtained from PoE, it is evident that there is only a slight improvement in performance from *Hits*@1 to *Hits*@10, with the range being between 4 and 9%. In contrast, the enhancements in performance from *Hits*@1 to *Hits*@10 observed for HMEA are at least 20% across all scenarios. Moreover, it is worth noting that HMEA achieves significantly better results than IKRL.

Table 9.3 demonstrates that even when utilizing solely *structural information*, HMEA-s still achieves superior results compared to the other two methods. Specifically, our proposed approach outperforms GCN-Align-s by almost 5% in terms of


**Table 9.2** Alignment prediction on both datasets for different percentages of P

The bold figures in the tables represent the best result


**Table 9.3** Results of three methods with *structural information* 

The bold figures in the tables represent the best result

*Hits*@1 on FB15K-DB15K and by 3% on FB15K-YAGO15K with 20% seed alignments. When using 50 and 80% seed entity pairs, HMEA-s shows significant improvements in performance. The improvements range from 10 to 18% regarding *Hits*@1 and from 20 to 30% in terms of *Hits*@10. These results suggest that our approach excels in capturing precise hierarchical structural representations.

Table 9.4 presents the results when incorporating *visual information* into the model. We compare the performance of three variants: PoE-v, GCN-Align-v, and HMEA-v. The results indicate that GCN-Align-v does not produce valuable visual representations for MMEA. However, even when utilizing only structural information, HMEA-v still achieves better results than PoE-v. Specifically, our proposed approach outperforms PoE-v slightly in both datasets for *Hits*@1, by less than 1% with 20% seed alignments. On FB15K-DB15K dataset, when using 80% seeds, our proposed approach HMEA-v demonstrates significant improvements in performance. The improvements are around 7% regarding *Hits*@1 and 18% in terms of *Hits*@10. These results indicate that our proposed method is effective


**Table 9.4** Comparison of three methods with *visual information* 

The bold figures in the tables represent the best result

in learning visual features and incorporating them into the model to improve the overall performance.

#### *9.5.4 Ablation Experiment*

In this work, we consider multiple modalities of information in MMKGs. Specifically, we take into account the structural and visual aspects of the information. To further confirm the usefulness of multi-modal knowledge for MMEA, we carry out an ablation experiment. In addition, upon comparing HMEA and HMEA-s in Tables 9.2 and 9.3, we observe that incorporating visual information in our approach results in slightly better performance. The improvements are approximately 1% in terms of *Hits*@1. Moreover, by comparing HMEA and HMEA-v in Tables 9.2 and 9.4, we can also conclude that the structural information plays a significant role. From the ablation study, we can conclude that MMEA primarily relies on the *structural information*, but the *visual information* still plays a useful role. Furthermore, the study also highlights that the combination of these two types of information leads to even better results.

#### *9.5.5 Case Study*

A key property of hyperbolic spaces is their exponential expansion, which means that they expand much faster than Euclidean spaces that expand polynomially. This property can be advantageous for distinguishing between similar entities since the neighbor nodes of a central node can be distributed in a larger space, resulting in greater distances between them.

To demonstrate the effectiveness of hyperbolic embeddings, we conducted a case study using Michael Caine as the root node. We visualized the embeddings of 1-hop film-related entities learned from both GCN-Align and HMEA separately, in the PCA-projected spaces shown in Fig. 9.4. We observed that for entities of the same type or with similar structural information, such as entity Alfie and B-o-B, their Euclidean embeddings (generated via GCN-Align) are placed closely together. In contrast, the distances between such entities in hyperbolic space are relatively farther apart, with only a few exceptions. This validates that the hyperbolic structural representation can help distinguish between similar entities. Furthermore, by placing similar entities (in the same KG) far apart, the hyperbolic representation can facilitate the alignment process across KGs.

An example can be seen in Fig. 9.4a, where entity Alfie in FB15K is closest to entity B-o-B, which is incorrect. However, in Fig. 9.4b, entity B-o-B is placed far away from Alfie, and the closest entity to Alfie is its equivalent entity in

**Fig. 9.4** The embeddings of 1-hop film-related neighbor entities of Michael Caine generated from GCN-Align and HMEA separately in the PCA-projected space. The green points represent entities in FB15K; red points represent entities in DB15K. For simplicity, we annotate part of entities. B-o-B is the abbreviation of Battle of Britain. (**a**) Embedding generated from GCN-Align. (**b**) Embedding generated from HMEA



**Table 9.5** Details of the cross-lingual datasets

DB15K. By using hyperbolic projections, similar entities in the same KG are well distinguished and placed far apart, reducing the likelihood of alignment mistakes.

#### *9.5.6 Additional Experiment*

The cross-lingual EA datasets are the most commonly used datasets for evaluating EA methods. We included experiments on these datasets to demonstrate that our proposed approach is effective for popular datasets, including the cross-lingual EA task. Note that diverse languages are not taken as multiple modalities, and the cross-lingual EA is in essence single-modal EA. We use the DBP15K datasets in the experiments, which were built by Sun et al. [24]. As shown in Table 9.5, the datasets were generated from DBpedia, which contains rich inter-language links between different language versions of Wikipedia. Each dataset contains data in different languages and 15,000 known inter-language links connecting equivalent entities in two KGs, which are used for model training and testing. Following the setting in [29], we use 30% of inter-language links for training, and 70% of them for testing. *Hits*@*k* is used as the evaluation measure.

The dimensions of both structural and attribute embeddings were set to 300 dimensions for GCN-Align. GCN-Align-s and HMEA-s represent adopting structural information; GCN-Align-a and HMEA-a represent adopting attribute information; and GCN-Align and HMEA combine both the structural information and attribute information.

Table 9.6 shows that in all datasets, HMEA-s outperforms GCN-Align-s, with improvements of around 7% in terms of *Hits*@1 and more than 10% in terms of *Hits*@10. These results demonstrate that HMEA benefits from hyperbolic geometry and is able to capture better structural features. Furthermore, our proposed approach achieves better results compared to GCN-Align as it combines both structural and attributive information, resulting in an approximately 10% increase in *Hits*@1. Regarding attribute information, it is worth noting that our approach, HMEAa, outperforms GCN-Align-a by a significant margin. Specifically, our approach achieves an approximately 15% improvement in *Hits*@1 across all datasets.


**Table 9.6** Result in cross-lingual datasets

#### **9.6 Conclusion**

This chapter introduces our proposed approach, HMEA, which is a multi-modal EA approach designed to efficiently integrate multi-modal information for EA in MMKGs. To achieve this, our approach extends the Euclidean representation to a hyperboloid manifold and employs HGCN to learn structural embeddings of entities. Additionally, we leverage a more advanced model, densenet, to learn more accurate visual embeddings. These structural and visual embeddings are then aggregated in the hyperbolic space to predict potential alignments. We validate the effectiveness of our proposed approach through comprehensive experimental evaluations. Additionally, we conduct further experiments that confirm the superior performance of HGCN in learning structural features of knowledge graphs in the hyperbolic space.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.