Dirk Burghardt Elena Demidova Daniel A. Keim  *Editors*

# Volunteered Geographic Information

Interpretation, Visualization and Social Context

Volunteered Geographic Information

Dirk Burghardt • Elena Demidova . Daniel A. Keim Editors

# Volunteered Geographic Information

Interpretation, Visualization and Social Context

*Editors*  Dirk Burghardt Cartographic Communication TU Dresden Dresden, Germany

Daniel A. Keim Data Analysis and Visualization University of Konstanz Konstanz, Germany

Elena Demidova Data Science and Intelligent Systems, Computer Science Institute University of Bonn Bonn, Germany

Lamarr Institute for Machine Learning and Artificial Intelligence Bonn, Germany

ISBN 978-3-031-35373-4 ISBN 978-3-031-35374-1 (eBook) https://doi.org/10.1007/978-3-031-35374-1

The publication of this book was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) within Priority Research Program 1894 Volunteered Geographic Information: Interpretation, Visualization and Social Computing.

© The Editor(s) (if applicable) and The Author(s) 2024. This book is an open access publication.

**Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Paper in this product is recyclable.

## **Preface**

Volunteered geographic information (VGI) has emerged as a novel form of usergenerated content in recent years. VGI involves the active generation of geodata, for example, in citizen science projects, and the passive data collection via locationenabled mobile devices. In addition, an increasing number of sensors perceive our environment with ever-greater detail and dynamics. The resulting VGI data can support various applications addressing critical societal challenges, e.g., environment and disaster management, health, transport, and citizen participation.

The interpretation and visualization of VGI are challenging due to the considerable heterogeneity of the underlying multi-source data and the social context. In particular, the utilization of VGI is influenced by the specific characteristics of the underlying data, such as real-time availability, event-driven generation, and subjectivity, all with an implicit or explicit spatial reference. The DFG-funded priority program "VGI: Interpretation, Visualization, and Social Computing"<sup>1</sup> (2016–2022) aims to address these challenges. The results of the research work in this program form the basis for this book publication.

The book includes three parts, "Representation and Analysis of VGI," "Geovisualization and User Interactions Related to VGI," and "Active Participation, Social Context, and Privacy Awareness," representing the principal research pillars within the priority program. The intersection of these three domains offers new research potential. Of particular interest is the link between social behavior and technology during the collective collection, processing, and usage of VGI. This includes, e.g., the consideration of different acquisition and usage contexts, evaluating subjective information, and ensuring privacy-aware data processing.

A total of 30 projects were financed as part of the twice 3-year funding period. Research groups from the fields of geoinformation science, computer science, psychology, and mechanical and safety engineering were involved in the interdisciplinary collaboration. Results of the collaborative research within the priority program have been made available through the VGI repository, the sustainable

<sup>1</sup> https://www.vgiscience.org/.

public platform with access to benchmark data, and VGI-related software tools, documentation, and publications. A focus was placed on cross-project collaboration. The flagship was the Young Research Groups, who used their complementary methodological knowledge to identify and answer cross-project research questions in self-organized projects. The results were published in 25 dissertations and are also included in the publications within this book.

Dresden, Germany Dirk Burghardt Bonn, Germany Elena Demidova Konstanz, Germany Daniel A. Keim

# **Contents**

### **Part I Representation and Analysis of VGI**



# **Part I Representation and Analysis of VGI**

VGI is a critical data source for various real-world applications, including mobility services in smart cities, remote sensing, tracing animal behavior, environmental protection, and disaster management. Popular examples of VGI data sources include crowdsourced maps such as OpenStreetMap (OSM) and GPS trajectory data. However, VGI is highly challenging to represent, access, and analyze.

The challenges associated with VGI representation, access, and analysis can be attributed to the 4 Vs of big data, such as volume, velocity, variety, and veracity. The volume and velocity of community-created geographic data are rapidly expanding due to the growth of volunteer communities, increased availability of open, editable community-created data sources, and easy access to real-time tracking devices. For example, the number of OSM contributors has grown from approximately 5.6 million in August 2019 to 9.6 million in December 2022.1 Increased heterogeneity of data representations and a range of quality issues accompany this growth.

Community-created geographic data comes in various formats with heterogeneous annotations, often lacking clear semantics. Whereas standardized formats and representations for geographic information and semantic annotation exist, the openness of VGI sources leads to increased heterogeneity. Furthermore, the representation and access requirements of different downstream applications vary. In particular, data-driven machine learning applications, such as traffic inference and accident prediction, depend on the availability of large-scale, high-quality data, and machine-readable annotations to build predictive models effectively and efficiently.

In this part, we discuss recent approaches to enhance the representation and analysis of VGI. In particular, this includes semantic representation of VGI data in knowledge graphs, machine-learning approaches to VGI mining, completion, and enrichment, as well as to the improvement of data quality and fitness for purpose. Furthermore, we discuss new approaches for more efficient analytics of VGI images.

<sup>1</sup> https://planet.openstreetmap.org/statistics/data\_stats.html.

# **Chapter 1 WorldKG: World-Scale Completion of Geographic Information**

**Alishiba Dsouza, Nicolas Tempelmeier, Simon Gottschalk, Ran Yu, and Elena Demidova** 

**Abstract** Knowledge graphs provide standardized machine-readable representations of real-world entities and their relations. However, the coverage of geographic entities in popular general-purpose knowledge graphs, such as Wikidata and DBpedia, is limited. An essential source of the openly available information regarding geographic entities is OpenStreetMap (OSM). In contrast to knowledge graphs, OSM lacks a clear semantic representation of the rich geographic information it contains. The generation of semantic representations of OSM entities and their interlinking with knowledge graphs are inherently challenging due to OSM's large, heterogeneous, ambiguous, and flat schema and annotation sparsity. This chapter discusses recent knowledge graph completion methods for geographic data, comprising entity linking and schema inference for geographic entities, to provide semantic geographic information in knowledge graphs. Furthermore, we present the WorldKG knowledge graph, lifting OSM entities into a semantic representation.

**Keywords** Geographic knowledge graphs · WorldKG

A. Dsouza (O) · R. Yu

N. Tempelmeier · S. Gottschalk L3S Research Center, University of Hannover, Hannover, Germany

e-mail: tempelmeier@L3S.de; gottschalk@L3S.de

E. Demidova Data Science & Intelligent Systems Group (DSIS), University of Bonn, Bonn, Germany

Data Science & Intelligent Systems Group (DSIS), University of Bonn, Bonn, Germany e-mail: dsouza@cs.uni-bonn.de; ran.yu@uni-bonn.de

Lamarr Institute for Machine Learning and Artificial Intelligence, Bonn, Germany e-mail: elena.demidova@cs.uni-bonn.de www.lamarr-institute.org

### **1.1 Introduction**

Geographic information is of crucial importance for a variety of real-world applications, including accident prediction (Dadwal et al. 2021), detection of topological dependencies in road networks (Tempelmeier et al. 2021a), and positioning charging stations (von Wahl et al. 2022). Such applications can substantially profit from standardized machine-readable representations of geographic entities, including monuments, roads, and charging stations. Particularly, such semantic representations should comprise detailed descriptions of geographic entities, including their types, properties, context, relations, and interlinking across sources.

OpenStreetMap (OSM)1 is a critical source of volunteered and openly available geographic information. OSM provides rich but highly heterogeneous data regarding geographic entities, including fine-grained coordinates of real-world locations and user-defined tags comprising entity types, properties, and relations. At the time of writing, OSM contains over 6.8 billion entities from 188 countries.2 However, the adoption of OSM data in real-world applications is limited, mainly due to the large, heterogeneous, ambiguous, and flat schema adopted for the OSM tags.

Knowledge graphs (Hogan et al. 2021)—graph-based representations of realworld entities and their relations—provide detailed machine-readable descriptions of real-world entities through ontologies and facilitate interlinking across sources. Information representation in knowledge graphs is based on W3C standards such as the Resource Description Framework (RDF)3 and established ontologies. This representation facilitates structured semantic access via standardized query languages, such as SPARQL.4 Although popular general-purpose knowledge graphs such as Wikidata and DBpedia (Auer et al. 2007) contain a number of geographic entities, only a tiny fraction of them include precise location information. Furthermore, whereas some community-defined links between OSM entities and knowledge graphs exist at the instance level, these links are sparse and cover only selected entity types. For example, as of September 2022, only 0*.*52% of OSM nodes provided links to the Wikidata knowledge graph. In this setting, knowledge graph completion, such as interlinking knowledge graphs and geographic information sources at the entity and schema levels, is inherently challenging due to the representation heterogeneity in OSM and the sparsity of geographic information in popular knowledge graphs.

Table 1.1 illustrates a geographic entity, Cairo, the capital of Egypt, and its representations in OSM and the Wikidata knowledge graph,5 OSM provides

<sup>1</sup> OpenStreetMap, OSM, and the OpenStreetMap magnifying glass logo are trademarks of the OpenStreetMap Foundation and are used with their permission. We are not endorsed by or affiliated with the OpenStreetMap Foundation.

<sup>2</sup> OSMstats: https://osmstats.neis-one.org.

<sup>3</sup> Resource Description Framework: https://www.w3.org/RDF/.

<sup>4</sup> SPARQL 1.1 Query Language: https://www.w3.org/TR/sparql11-query/.

<sup>5</sup>wd and wtd are the prefixes of http://www.wikidata.org/entity/ and http://www.wikidata.org/ prop/direct/, respectively.


**Table 1.1** Representation of Cairo in OpenStreetMap and Wikidata

information as heterogeneous key-value pairs called "tags." In this example, OSM encodes the entity type information as <place*,* city>, whereas the precise semantics of the tags often remain unclear. In contrast, entities in Wikidata are represented via well-defined statements, also known as RDF triples. A triple has the form <Subject, Predicate, Object> and enables the representation of entity types, properties, and relations. In Wikidata, an entity type is expressed using the instance of property, denoted as wdt:P31. 6 In this example, this property connects a unique entity identifier wd:Q85, representing Cairo, to the entity type city, denoted as wd: Q515. 7

In this chapter, we discuss recent methods aiming to bridge the gap between OSM and knowledge graphs through semantically enriching geospatial information in OSM and making this information available in WORLDKG—a novel geographic knowledge graph. In particular, we develop methods for knowledge graph completion, to establish links between OSM and knowledge graphs at the entity and schema levels. *Geographic entity linking* discussed in this chapter aims at interlinking the representations of entities in OSM and Wikidata (Cairo in this example). *Geographic class alignment* aims to link the OSM tags that provide entity type information to the corresponding knowledge graph classes. In this example, the OSM tag <place*,* city> should be aligned to the Wikidata class wd:Q515 (city).

Existing schema alignment and entity linking methods are not directly applicable to geographic data sources such as OSM due to structural differences between OSM and knowledge graphs (Otero-Cerdeira et al. 2015). Generic schema matching methods typically rely on name and schema structure similarities (Madhavan et al. 2001). Other approaches, such as LIMES (Ngomo and Auer 2011), have strict heuristics and consider fixed schemas. Geographic entity linking approaches such as LinkedGeoData (Auer et al. 2009) rely on manually aligned schemas and create links using type information, spatial distance, and name similarity. These approaches often fail due to representation differences, toponym ambiguities, OSM schema flatness, and geographic coordinate variations across sources. Thus, new approaches are required to lift OSM's flat and heterogeneous geographic information into a precise, machine-readable semantic representation.

<sup>6</sup> Definition of the instance of Wikidata property: https://www.wikidata.org/wiki/Property: P31.

<sup>7</sup> Definition of the city (Q515) in Wikidata: https://www.wikidata.org/wiki/Q515.

**Fig. 1.1** Overview of the pipeline for creating WORLDKG from OSM, consisting of three main steps: (i) geographic entity linking with OSM2KG, (ii) geographic class alignment with NCA, and (iii) WORLDKG geographic knowledge graph creation

In the remainder of this chapter, we first formally define the problem of geographic entity linking and class alignment in Sect. 1.2. Then, we discuss approaches we recently proposed to interlink OSM and knowledge graphs at the entity and schema levels. These approaches are illustrated in Fig. 1.1. In Sect. 1.3, we present OSM2KG—a geographic entity linking approach (Tempelmeier and Demidova 2021) depicted in the upper part of Fig. 1.1. Following that, in Sect. 1.4, we discuss NCA—a neural approach for geographic class alignment between OSM and knowledge graphs that utilizes entity links (Dsouza et al. 2021a). NCA is illustrated in the lower part of Fig. 1.1. Then, in Sect. 1.5, we describe WORLDKG a novel geographic knowledge graph (Dsouza et al. 2021b) that adopts OSM2KG and NCA to provide semantic representations of OSM entities. Finally, in Sect. 1.6, we discuss open research directions and provide a conclusion.

### **1.2 Problem Definition**

Linking geographic data sources and knowledge graphs at the entity and the schema level can help create a comprehensive source of geographic information, i.e., a geographic knowledge graph. First, we define geographic data sources and knowledge graphs based on the definitions by Tempelmeier and Demidova (2021).

A geographic data source represents geographic entities. Each geographic entity is annotated with an identifier, a location, and a set of key-value pairs called tags. More formally:

**Definition 1.1** A *geographic data source* G = *(N , T )* consists of a set of geographic entities *N* and a set of tags *T* . Each tag *t* ∈ *T* is represented as a keyvalue pair *t* = <*k, v*>. Each node *n* ∈ *N* represents a real-world geographic entity with a geolocation and a set of tags *Tn* ⊂ *T* .

A typical example of a geographic data source is OpenStreetMap. Examples of the tags assigned to the node representing Cairo are illustrated in Table 1.1.A geolocation can be represented as a coordinate pair (i.e., latitude and longitude) or a sequence of coordinate pairs (e.g., forming a polygon).

A knowledge graph is a semantic information source containing data regarding real-world entities, their classes, and their properties. Typical examples of popular general-purpose knowledge graphs are Wikidata and DBpedia, which cover an extensive set of real-world entities and their relations in various application domains. Table 1.1 illustrates selected properties of the Wikidata entity representing Cairo.

**Definition 1.2** A *knowledge graph* KG = *(E, C, P , L, Lgeo,F)* consists of a set of entities *E*, a set of classes *C* ⊂ *E*, a set of properties *P*, a set of literals *L*, a set of geolocations *Lgeo*, and a set of relations *F* ⊆ *E* × *P* × *(E* ∪ *L* ∪ *Lgeo)*.

An entity in a knowledge graph can represent a real-world entity or a class of entities. Literals represent values including strings, numbers, and dates. An entity *e* ∈ *E* can be related to one or multiple geolocations representing different geometries, such as a point or a polygon.

Geographic entity linking aims to align entities from a geographic data source and a knowledge graph representing the same real-world entity.

**Definition 1.3** Given a geographic data source G = *(N , T )* and a knowledge graph KG = *(E, C, P , L, Lgeo,F)*, the problem of *geographic entity linking* is to identify a node *n* ∈ *N* and an entity *e* ∈ *E* representing the same real-world object.

In the example illustrated in Table 1.1, Cairo from OSM and Cairo from Wikidata represent the same real-world entity. In RDF, this link is typically denoted using the owl:sameAs property.

Schema alignment refers to the interlinking of equivalent schema elements across sources. In the context of this work, we focus on geographic class alignment, which refers to the interlinking of tags of a geographic data source and classes of a knowledge graph representing the same semantic concept. The key or the value of the tag alone is not sufficient to describe the class. For example, tag <natural*,* peak> describes the concept *Mountain*. Considering only the key or the value here may not align to the correct class *Mountain* of the knowledge graphs. Hence, we align the tag, i.e., key=value, to the knowledge graph classes.

**Definition 1.4** Given a geographic data source G = *(N , T )* and a knowledge graph KG = *(E, C, P , L, Lgeo,F)*, the problem of *geographic class alignment* is to identify a tag *t* ∈ *T* and a class *c* ∈ *C* representing the same semantic concept.

For example, the OSM tag <place*,* city> and the Wikidata class wd:Q515 illustrated in Table 1.1 represent the same semantic concept.

In the context of this work, a *geographic knowledge graph* refers to a knowledge graph that provides geolocation information for a substantial fraction of the entities it contains. This work aims to create a geographic knowledge graph through geographic entity linking and class alignment.

### **1.3 Geographic Entity Linking with OSM2KG**

Geographic entity linking refers to the task of interlinking geographic entities representing the same real-world entity across data sources (Definition 1.3). Typically, entity linking approaches utilize semantic and syntactic similarity of the different entity representations. In OSM, geographic entity linking is particularly challenging due to the large scale and the ambiguities of location names. For example, "Berlin" can denote the name of the capital of Germany and a restaurant name. Also, the names of geographic entities, such as "Church Road," are often non-distinctive.

### *1.3.1 Related Work*

State-of-the-art entity linking approaches such as LIMES (Ngomo and Auer 2011) and WOMBAT (Sherif et al. 2017) assume that entities are represented through the same number of properties with a 1:1 property mapping. In the case of linking OSM nodes with the knowledge graph entities, these assumptions do not hold. Entity linking methods such as DBpedia Spotlight (Daiber et al. 2013) detect links between textual data and knowledge graphs. LinkedGeoData (Stadler et al. 2012) performs geographic entity linking by creating links between OSM and knowledge graphs such as DBpedia and GeoNames. However, the linking with LinkedGeoData relies on manually aligned schema, syntactic similarity, and spatial distance. To overcome the shortcomings of current approaches, we proposed OSM2KG (Tempelmeier and Demidova 2021), a machine learning algorithm for geographic entity linking based on representation learning of OSM tags.

### *1.3.2 The* **OSM2KG** *Approach*

The overall geographic entity linking process of OSM2KG is illustrated in the upper part of Fig. 1.1. First, OSM2KG adopts geographic blocking to reduce the number of potential candidate entities given an OSM node (*candidate set generation*). Then, OSM2KG creates latent representations of OSM tags (*tag embeddings*) and extracts features of the candidate entities (*feature extraction*). Finally, OSM2KG predicts if a node-entity pair represents the same real-world entity (*link classification*). In the following, we present these steps in more detail.

**Candidate Set Generation** Given a node *n* ∈ *N* of the geographic data source G = *(N , T )*, the goal of the candidate set generation step is to identify potentially matching entities in the knowledge graph KG. The geographic coordinates in OSM and knowledge graphs represent the points of community consensus, rather than an objective metric (Auer et al. 2009). Consequently, the coordinates of geographic entities represented in these sources can deviate. The candidate generation step is based on the intuition that entities and nodes representing the same real-world entities should be located in geographic proximity. Thus, for a given input node *n*, OSM2KG creates the candidate set by considering all entities within an experimentally determined distance threshold. A spatial index such as R-Tree (Guttman 1984) can be utilized to enable efficient geographic blocking.

**Tag Embeddings** The set of tags *Tn* assigned to an OSM node *n* plays an essential role in detecting the correct matching candidate. As OSM tags are highly heterogeneous, OSM2KG aims at learning their unsupervised latent representations. OSM2KG utilizes a skip-gram-based neural network model (Mikolov et al. 2013). This representation learns the co-occurrences of OSM tags to capture the semantic similarity between OSM nodes. The embedding model is trained in an unsupervised manner based on the tag similarity of geographic entities, meaning geographic entities with similar tags are represented in a closer space. The resulting embeddings can be used to estimate the semantic similarity of the OSM nodes. Geographic coordinates are not considered in this step, such that the embedding reflects semantic similarity independent of the geolocation.

**Feature Extraction** For each candidate entity from KG in the candidate set, we extract additional features, namely, the entity type and its popularity, as reflected by the number of incoming edges in the knowledge graph.

In addition to the tag embeddings and the entity features, the Jaro-Winkler distance (Winkler 1999) is calculated between the names of the OSM node and the candidate to measure their name similarity. Furthermore, the logistic distance proposed by Stadler et al. (2012) is used to compute the geographic distance between the OSM node and the candidate entity.

**Link Classification** Finally, a random forest classification model is utilized to classify whether the input node in G represents the same real-world entity as the candidate entity in KG. To train the model, as positive examples, OSM2KG takes node-entity pairs from the existing links between G and KG.

### *1.3.3 Evaluation Results of the* **OSM2KG** *Approach*

We evaluated OSM2KG (Tempelmeier and Demidova 2021) regarding its interlinking performance. In particular, we considered the interlinking of OSM entities with Wikidata and DBpedia knowledge graphs in Germany, France, and Italy. This evaluation demonstrated a substantial F1-score improvement achieved by OSM2KG compared to eight different baseline approaches. OSM2KG performed best on all Wikidata datasets, achieving an F1-score of 92*.*05% on average and outperforming the best-performing baseline by 21*.*82 percentage points. OSM2KG also achieved the best recall performance and high precision on all datasets.

As a result of OSM2KG, we can infer new links between the OSM nodes and geographic entities in knowledge graphs. Such links can be beneficial for creating and enriching semantic sources, as they can provide complementary information regarding the linked geographic entities. These linked entities can also serve as additional training data to develop supervised methods for geographic schema alignment.

### **1.4 Geographic Class Alignment with NCA**

Geographic class alignment between a geographic data source G = *(*N*,*T*)* and a knowledge graph KG = *(E, C, P , L, Lgeo,F)* aims to align the tags and the classes representing the same real-world concepts (Definition 1.4).

The heterogeneous tag-based OSM structure created by volunteers makes it challenging to identify the tags that can be linked to knowledge graph ontologies. For example, the OSM tag <*natural, peak*> corresponds to the "mountain" class in the Wikidata knowledge graph. This match cannot be easily identified using the existing approaches based on syntactic and structural similarity.

### *1.4.1 Related Work*

Ontology alignment methods typically rely on structural and element-level similarity to align schema elements (Otero-Cerdeira et al. 2015). As the OSM schema is flat, approaches that depend on the structural hierarchy (Melnik et al. 2002) do not perform well. Schema alignment methods that depend on the element-level syntactic similarity (Madhavan et al. 2001) do not work well either, due to the essential differences in the syntactic representation of OSM tags and knowledge

graph classes. Instance-based alignment approaches (Ngo et al. 2013) rely on the structural similarity of neighboring instances to align schema elements. Machine learning (Doan et al. 2004) and deep learning-based approaches (Bento et al. 2020; Xiang et al. 2015) also rely on the structure. Furthermore, tabular data alignment methods (Cappuzzo et al. 2020) cannot appropriately handle sparse OSM tag annotations. Overall, the lack of a well-defined OSM ontology and the essential differences in the structural as well as syntactic representation of OSM tags and knowledge graph ontologies, along with the sparsity of OSM annotations, hinder the application of state-of-the-art ontology and schema alignment approaches. To overcome these limitations, we proposed NCA, a neural class alignment approach that utilizes existing entity links between geographic data sources and knowledge graphs in a novel shared latent space.

### *1.4.2 The* **NCA** *Approach*

At the bottom of Fig. 1.1, we briefly illustrate the building blocks of the class alignment NCA approach. In the first step, NCA aims to create a shared latent space that aligns the feature spaces of G and KG. To this extent, NCA creates an auxiliary neural classification model. This model captures the semantic relations between the OSM tags and the semantic classes in the shared latent space. In the second step, NCA probes the auxiliary model to obtain the tag-to-class alignments between the OSM tags and the knowledge graph classes.

**Auxiliary Classification** The goal of the first NCA step is to create a shared latent space containing similar latent representations of geographic entities in G and KG. To achieve this aim, NCA creates an auxiliary classification model. This model is trained to classify linked entities from G and KG into the corresponding semantic classes *c* ∈ *C* of KG. During supervised training, the auxiliary classification model adopts known pairs of linked geographic entities from G and KG as a training set. For G, tags and keys having more than 50 occurrences8 in OSM are selected as features. For KG, top-25 properties of each class are used as features. These features are passed through the fully connected layers to form the shared latent space of the model that aligns the representations of OSM and KG entities. The intuition behind the shared latent space is that linked entities from OSM and KG that belong to the same semantic class will be represented similarly. To create the shared latent space, NCA adopts an adversarial classifier that exploits linked entities of G and KG. This classifier aims to distinguish between G and KG entities in the latent space. NCA aims to make their representations similar by inverting the gradient of the adversarial loss. In this way, as a result of the training, the feature spaces of G and KG are aligned.

<sup>8</sup> https://taginfo.openstreetmap.org/tags.

**Tag-to-Class Alignment** The training of the auxiliary classification model results in a shared latent space. NCA then probes the model with one OSM tag at a time and computes the complete forward pass. NCA selects the results of the classification layer to obtain the tag-to-class alignment. As one OSM tag can be matched to multiple classes in the knowledge graph, NCA selects all classes whose confidences exceed an experimentally determined threshold value.

### *1.4.3 Evaluation Results of the* **NCA** *Approach*

We evaluated the NCA approach on OSM as the geographic data source and Wikidata and DBpedia as knowledge graphs (Dsouza et al. 2021a). The NCA performance was compared to six state-of-the-art ontology and tabular data alignment methods. The evaluation was conducted on a dataset with seven countries having the most data available in OSM, namely, Germany, France, Great Britain, Spain, Russia, the USA, and Australia. In terms of tag-to-class alignment, NCA obtained up to 13 and up to 37 percentage point improvement of the F1-score on Wikidata and DBpedia, respectively. On average, we observed 10 (21) percentage point F1-score improvement on Wikidata (DBpedia). As a result, the NCA approach increased the number of OSM entities with semantic class annotations from Wikidata and DBpedia knowledge graphs by over 400%. The resulting tag-to-class annotations are available as part of the WORLDKG knowledge graph presented in the next section.

### **1.5 The WORLDKG Knowledge Graph**

WORLDKG is a geographic knowledge graph that provides semantic information on geographic entities extracted from OSM. While OSM contains rich data regarding such geographic entities, this data is not directly accessible to semantic applications. With WORLDKG, we tackle this problem and provide a geographic knowledge graph. WORLDKG follows the ontology illustrated in Fig. 1.2 and is available online.<sup>9</sup>

### *1.5.1 Related Work*

Geographic knowledge graphs such as LinkedGeoData (Auer et al. 2009) and YAGO2geo (Karalis et al. 2019) either contain only a few geographic classes

<sup>9</sup> WORLDKG: https://www.worldkg.org/.

**Fig. 1.2** The WORLDKG ontology

or represent data of a restricted geographic area. Specialized geographic knowledge graphs such as the KnowWhereGraph (Janowicz et al. 2022) and EventKG (Gottschalk and Demidova 2019) concentrate on past events and have a limited location coverage. In contrast, WORLDKG is based on OSM and contains over 100 million geographic entities on a world scale typed with over 1,000 semantic classes.

### *1.5.2* **WORLDKG** *Creation Approach*

WORLDKG captures geographic entities in OSM and contains links to Wikidata and DBpedia at the entity and class levels. The creation procedure of WORLDKG includes two main tasks depicted in Fig. 1.1, namely, ontology creation and triple creation.

**Ontology Creation** To infer a class hierarchy from OSM tags, we utilize OSM *map features*10 —a list of established key-value pairs. We extract classes (keys) and their subclasses (values) from the map features. For example, from the map feature <*place, city*>, we infer the class "Place" and its subclass "City." All remaining keys not covered by the map features are considered properties.

We convert the names of the extracted properties and classes according to the OWL naming conventions.11 We also incorporate the tag-to-class alignment inferred using the NCA approach (Sect. 1.4) into the WORLDKG ontology.

<sup>10</sup> https://wiki.openstreetmap.org/wiki/Map\_features.

<sup>11</sup> https://www.w3.org/TR/owl-ref/.

**Fig. 1.3** An example of a city and its classes in WORLDKG

The WORLDKG ontology is depicted in Fig. 1.2. Any object in WORLDKG can be connected to a geolocation (geo:SpatialObject—either a point, a line string, or a polygon) via wkgs:spatialObject. Other relations are represented using properties typed as wkgs:WKGProperty. Information regarding the original OSM tags is provided by dcterms:source.

An example of a WORLDKG entity of the class "City" is illustrated in Fig. 1.3. Via the property wkgs:spatialObject, the entity is connected to its geolocation, which provides a coordinate pair. The "City" class is connected to its equivalents in DBpedia and Wikidata.

**Triple Creation** We add all OSM nodes that have at least one tag and belong to at least one class of the WorldKG ontology to the WORLDKG knowledge graph. To this extent, we create triples that represent the nodes and their properties and adhere to the WorldKG ontology.

### *1.5.3* **WORLDKG** *Access, Statistics, Evaluation, and Examples*

**Access** WORLDKG offers a GeoSPARQL endpoint.12 This endpoint supports queries in GeoSPARQL13 —a geographic query language for RDF data—and visualizes geolocations of the query results on a map.

<sup>12</sup> WORLDKG GeoSPARQL endpoint: https://www.worldkg.org/sparql.

<sup>13</sup> GeoSPARQL: https://www.ogc.org/standards/geosparql.

**Statistics** As of September 2022, WORLDKG contains over 800 million triples describing approximately a 100 million entities that belong to over 1*,* 000 distinct classes. The number of unique properties (wgks:WKGProperty) in WORLDKG is over 1*,* 800. As a result of the NCA approach presented in Sect. 1.4, WORLDKG provides links to 40 Wikidata and 21 DBpedia classes.

**Evaluation** We evaluated the quality of WORLDKG by assessing the type assertions of the geographic entities (Dsouza et al. 2021b). From Wikidata and DBpedia, we randomly selected five classes each, aligned with the WORLDKG ontology. Per each of these classes, we randomly selected 100 example geographic entities in WORLDKG and manually checked if they belong to the assigned knowledge graph class. We observed that WORLDKG achieved over 97% accuracy on average.

**Examples** Listing 1.1 illustrates the representation of a WORLDKG entity of type wkgs:Restaurant in the Turtle format. Listing 1.2 is an example query that makes use of the GeoSPARQL function bif:st\_distance to extract three restaurants closest to the Dresden Central Station.14 The query results are shown in Table 1.2 and Fig. 1.4. This example illustrates the potential of using WORLDKG in downstream applications such as POI recommendation.

```
wkg:4182951095 a wkgs:Restaurant ; 
    rdfs:label "Restaurant Quarre" ; 
    wkgs:addrHousenumber "77" ; 
    wkgs:addrPostcode "10117" ; 
    wkgs:capacity "200" ; 
    wkgs:cuisine "regional" ; 
    wkgs:delivery "no" ; 
    wkgs:openingHours "12:00-15:00,18:00-22:30" ; 
    wkgs:osmLink osmn:4182951095 ; 
    wkgs:phone "+493022611959" ; 
    wkgs:takeaway "no" ; 
    wkgs:website "http://restaurant-quarre.de" ; 
    wkgs:spatialObject wkg:geo4182951095 . 
wkg:geo4182951095 a sf:Point; 
    geo:asWKT "POINT(13.3798808 52.5160887)"^^geo:wktLiteral .
```
**Listing 1.1** RDF triples in the Turtle format for an example geographic entity of type wkgs: Restaurant in WORLDKG

<sup>14</sup> The geographic location of the Dresden Central Station is taken from OSM.

```
SELECT ?restaurant 
       (bif:st_distance(?cWKT, ?fWKT) AS ?Distance) 
WHERE { 
   ?poi rdfs:label "Dresden Hbf". 
   ?poi rdf:type wkgs:RailwayStation . 
   ?poi wkgs:spatialObject [ 
      geo:asWKT ?cWKT 
   ]. 
   ?closeObject rdf:type wkgs:Restaurant . 
   ?closeObject rdfs:label ?restaurant . 
   ?closeObject wkgs:spatialObject ?fGeom . 
   ?fGeom geo:asWKT ?fWKT . 
} 
ORDER BY ASC(bif:st_distance(?cWKT, ?fWKT)) 
LIMIT 3
```
**Listing 1.2** Example GeoSPARQL query to retrieve three restaurants closest to the Dresden Central Station (Dresden Hbf)


**Fig. 1.4** Visualization of three restaurants closest to the Dresden Central Station returned for the query in Listing 1.2

### **1.6 Discussion and Open Research Directions**

In this chapter, we presented WORLDKG—a geographic knowledge graph that we developed to provide a semantic representation of geographic entities in OSM. Furthermore, we described OSM2KG and NCA, novel methods for geographic entity linking and class alignment. These methods enable interlinking geographic entities in OpenStreetMap with other semantic sources of geographic information at the entity and schema levels. Our proposed approaches outperformed state-of-theart methods when applied to OSM and popular general-purpose knowledge graphs, Wikidata and DBpedia. We made WORLDKG publicly available.

WORLDKG is a comprehensive source of semantic geographic information in its current form; it also opens many directions for future research.

A critical aspect of the knowledge graph creation from volunteered geographic information is data quality. As OSM data builds a basis for the knowledge graph creation, data quality issues of OSM can be propagated into WORLDKG. In WORLDKG, we rely on the existing links between OSM nodes and knowledge graphs as a quality signal. Moreover, to enhance the quality of OSM data, we developed OVID (Tempelmeier and Demidova 2022)—a novel method to detect vandalism in OpenStreetMap automatically. Quality aspects of OSM are also considered in Chap. 2. In future work, we would like to investigate further methods to enhance data quality in OSM and WORLDKG. WORLDKG can also potentially be used for visual reporting solutions discussed in Chap. 7.

To make OSM data more easily accessible to machine learning algorithms, we developed GeoVectors—a reusable openly available dataset of OSM embeddings (Tempelmeier et al. 2021b). GeoVectors approach extends the OSM node embedding algorithms presented in Sect. 1.3 and encodes semantic and geographic similarity of OSM nodes. In future work, we would like to leverage WORLDKG and GeoVectors to provide semantic geographic information for machine learning applications.

**Acknowledgments** This research was partially funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)—WORLDKG, 424985896.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 2 Analyzing and Improving the Quality and Fitness for Purpose of OpenStreetMap as Labels in Remote Sensing Applications**

**Moritz Schott, Adina Zell, Sven Lautenbach, Gencer Sumbul, Michael Schultz, Alexander Zipf, and Begüm Demir** 

**Abstract** OpenStreetMap (OSM) is a well-known example of volunteered geographic information. It has evolved to one of the most used geographic databases. As data quality of OSM is heterogeneous both in space and across different thematic domains, data quality assessment is of high importance for potential users of OSM data. As use cases differ with respect to their requirements, it is not data quality per se that is of interest for the user but fitness for purpose. We investigate the fitness for purpose of OSM to derive land-use and land-cover labels for remote sensing-based classification models. Therefore, we evaluated OSM land-use and land-cover information by two approaches: (1) assessment of OSM fitness for purpose for samples in relation to intrinsic data quality indicators at the scale of individual OSM objects and (2) assessment of OSM-derived multi-labels at the scale of remote sensing patches (1*.*22×1*.*22 km) in combination with deep learning approaches. The first approach was applied to 1000 randomly selected relevant OSM objects. The quality score for each OSM object in the samples was combined with a large set of intrinsic quality indicators (such as the experience of the mapper, the number of mappers in a region, and the number of edits made to the object) and auxiliary information about the location of the OSM object (such as the continent or the ecozone). Intrinsic indicators were derived by a newly developed tool

M. Schott (-) · A. Zipf

A. Zell · G. Sumbul · B. Demir

e-mail: adina.zell@campus.tu-berlin.de; gencer.suembuel@tu-berlin.de; demir@tu-berlin.de

S. Lautenbach

### M. Schultz Institute of Geography, University of Tübingen, Tübingen, Germany e-mail: michael.schultz@uni-tuebingen.de

Institute of Geography, GIScience, Heidelberg University, Heidelberg, Germany e-mail: moritz.schott@uni-heidelberg.de; zipf@uni-heidelberg.de

Faculty of Electrical Engineering and Computer Science, Remote Sensing Image Analysis Group, Technische Universitát Berlin, Berlin, Germany

HeiGIT gGmbH at Heidelberg University, Heidelberg, Germany e-mail: sven.lautenbach@heigit.org

based on the OSHDB (OpenStreetMap History DataBase). Afterward, supervised and unsupervised shallow learning approaches were used to identify relationships between the indicators and the quality score. Overall, investigated OSM land-use objects were of high quality: both geometry and attribute information were mostly accurate. However, areas without any land-use information in OSM existed even in well-mapped areas such as Germany. The regression analysis at the level of the individual OSM objects revealed associations between intrinsic indicators, but also a strong variability. Even if more experienced mappers tend to produce higher quality and objects which underwent multiple edits tend to be of higher quality, an inexperienced mapper might map a perfect land-use polygon. This result indicates that it is hard to predict data quality of individual land-use objects purely on intrinsic data quality indicators. The second approach employed a label-noise robust deep learning method on remote sensing data with OSM labels. As the quality of the OSM labels was manually assessed beforehand, it was possible to control the amount of noise in the dataset during the experiment. The addition of artificial noise allowed for an even more fine-grained analysis on the effect of noise on prediction quality. The noise-tolerant deep learning method was capable to identify correct multi-labels even for situations with significant levels of noise added. The method was also used to identify areas where input labels were likely wrong. Thereby, it is possible to provide feedback to the OSM community as areas of concern can be flagged.

**Keywords** Volunteered geographic information · Data quality · Data analysis · Remote sensing · OpenStreetMap · Machine learning

### **2.1 Introduction**

OpenStreetMap (OSM) has evolved to one of the most used geographic databases and is a prototype for volunteered geographic information (VGI). It is a major knowledge source for researchers, professionals, and the general public to answer geographically related questions. As a free and open community project, the OSM database can not only be edited but also used by any person with very limited restrictions such as internet access or usage citation. This open nature of the project enabled the establishment of a vibrant community that curates and maintains the projects' data and infrastructure, but also a growing ecosystem of tools that use or analyze the data (OpenStreetMap Contributors 2022a,b).

Recently, OSM has become a popular source of labeled data for the remote sensing community. However, spatial heterogeneous data quality provides challenges for the training of machine learning models. Frequently, OSM land-use and landcover (LULC) data has thereby been taken at face value without critical reflection. And, while the quality and fitness for purpose of OSM data have been proven in many cases (e.g., Jokar Arsanjani et al. 2015; Fonte et al. 2015), these analyses have also unveiled quality variations, e.g., between rural and urban regions. The quality of OSM can thus be assumed to be generally high, but remains unknown for a specific use case. It is therefore of importance to develop both tools that are capable of quantifying data quality of LULC information in OSM and approaches that are capable of dealing with the noise potentially present in OSM.

The IDEAL-VGI project investigated the fitness for purpose of OSM to derive LULC labels for remote sensing-based classification models by two approaches: (1) assessment of OSM fitness for purpose for samples in relation to intrinsic data quality indicators at the scale of individual OSM objects and (2) assessment of OSM-derived multi-labels at the scale of remote sensing patches (1*.*22 × 1*.*22 km) in combination with deep learning methods.

### **2.2 Intrinsic Data Quality Analysis for OSM LULC Objects**

One of the most prominent analysis topics in OSM-related research is data quality that has been covered in theory (see, e.g., Barron et al. 2014; Senaratne et al. 2017) as well as in many practical studies (e.g., Jokar Arsanjani et al. 2015; Brückner et al. 2021). The topic of data quality is of concern for many studies working with volunteered geographic information—Chap. 1, for example, deals with data quality in OSM and Wikidata. Senaratne et al. (2017) characterize analyses into extrinsic metrics, where OSM is compared to another dataset, and intrinsic indicators, where metrics are calculated from the data itself. Semi-intrinsic (or semi-extrinsic) metrics use auxiliary information to assess the quality of OSM population density can, for example, be used to assess the completeness of buildings in OSM, as population density and number of buildings are related. The quality gold standard has frequently been defined for extrinsic metrics through an external dataset of higher or known quality and standards. However, external datasets of high quality—including high up-to-dateness—are frequently not available. Therefore, intrinsic data quality indicators have frequently been used (Barron et al. 2014). These try to capture data quality aspects based on the history of OSM data itself, such as the number of edits to an object. Although OSM objects can be viewed individually, they are always embedded in a larger context of surrounding OSM objects, communities of contributors, and other classification systems, such as biomes or socioeconomic factors. Comparing contributions and communities for selected cities, (Neis et al. 2013), e.g., found a positive correlation between contributor density and gross national product per capita and showed that community sizes vary between Europe and other regions. In 2021, Schott et al. (2021) described "digital" and "physical locations" in which an OSM object is located. These "locations" consist of, intrinsic, OSM-specific measures such as density and diversity of elements, but also include—semi-intrinsic—aspects of economic status, culture, and population density to describe the surrounding of an object. Such information provides potentially relevant information to help characterize and predict data quality of OSM objects.

LULC information in OSM is a challenging topic. On the one hand, this information provides the background for all other data rendered on the central map. It can highly benefit from local input, survey mapping, and (live) updates. On the other hand, this information has a difficult position within the OSM ecosystem. While the routing and the building of communities are prominent, LULC is not so frequently mentioned in the ecosystems' communication platforms. LULC information can also be quite cumbersome or even difficult to map, e.g., due to natural ambiguity. The growing tagging scheme provides a collection of sometimes ambiguous or overlapping tag definitions that are not fully compatible with any official LULC legend definition (Fonte et al. 2016). Furthermore, the data is highly shaped by national preferences and imports.

### *2.2.1 OSM Element Vectorization: Intrinsic and Semi-intrinsic Data Quality Indicators*

The OSM element vectorization tool (OEV, Schott et al. 2022) has been developed to ease access to intrinsic and semi-intrinsic indicators, with a specific focus on LULC feature classes. The tool1 provides access to currently 31 indicators at the level of single OSM objects (c.f. Table 2.1), which cover aspects concerning the element itself, surrounding objects, and the editors of the object.

The usability of the tool was proven on the use case of LULC polygons. One thousand out of the globally existing 62.9 million LULC elements were randomly sampled on 2022-01-01. Only polygonal objects with at least one of the LULC defining tags were considered. These elements' IDs were then fed to the tool to extract the data and calculate the described metrics from Table 2.1. These metrics were used in a cluster analysis to identify structures in the OSM LULC objects. Furthermore, we tested three hypotheses on the triangular relation between the size of OSM objects, their age, and their location in terms of population density. We hypothesized that a general mapping order exists where the OSM community first concentrates on or arises from urban areas before moving to rural areas. This was tested by the hypotheses 1 (H1): *There is a positive correlation between the object age and the population density*. Second, we tested the hypotheses that areas with higher population density are more fragmented and therefore exhibit smaller elements, while areas with low population density, such as forest, are often larger objects: (H2) *there is a negative correlation between the object size and the population density*. Third, we tested the effect between the OSM LULC objects' age and population density, assuming a non-significant correlation. This was based on two opposing assumptions: Large geographical entities may be mapped first, and regions may be first coarsely drafted before adding details. This would lead to old objects being of larger size. Yet, hypotheses 1 and 2 contradict this tendency: according to H1 and H2, younger objects would be in areas with less population density and therefore tend to be larger. All three hypotheses were tested separately

<sup>1</sup> https://oev.geog.uni-heidelberg.de/.




using Kendall's *τ* (Hollander and Wolfe 1973), with the *p*-values adjusted for multiple testing (Benjamini and Hochberg 1995).

Results of this first exploratory analysis provided interesting insights into the complex and multifaceted structure of OSM land-use objects and its relation to OSM mapping communities. The main hypotheses regarding the mapping order (H1) could not be confirmed. In fact, the estimated correlation was slightly negative, meaning that for the used sample, objects in urban areas were younger than in rural areas. Yet, this does not imply that this mapping order does not exist in certain subregions. In addition, the age of an object is a fragile metric that highly depends on the mapping style of local mappers. Mappers frequently decide to delete and redraw elements instead of changing the original object, especially if the object was only a coarse approximation. This "resets" the object age, meaning that urban areas may have a high share of young objects because they are still actively mapped and maintained, even though they started their map appearance relatively early. H3 was equally confirmed, but only after *p*-value correction (*p*-value = 0.14). Regional specialities may exist in this aspect and need further investigation.

The negative correlation between the object size and the population density (H2) was confirmed with a *p*-value*<*0.01 though the *τ* was only −0*.*096 implying a small effect. At the global scale, many influencing factors may overlap or intervene with each other, hindering the extraction of single detailed effects. For the example at hand, we can assume that there are multiple regional communities or active mappers with individual mapping styles. The mapping detail in urban or rural regions will therefore be linked to these and other factors as well, not only the population density. Population density itself may not be generalizable on a global scale. The same level of fragmentation, meaning object size distribution, may be reached at different population density values, depending, e.g., on the continent.

The cluster analysis revealed interesting aspects, as some clusters could be associated with imports. Especially, a large import of North American lakes could be separated. This element group made up a considerable share of the global data and must therefore be taken into account when analyzing or describing the global dataset.

One thousand LULC objects were manually checked against high-resolution imagery. A combined quality score was assigned based on the thematic and the geometric correctness of the object. A quantile random forest was used to identify relationships between the data quality score and the 31 indicators calculated by the OEV tool.

While the overall quality of the model was intermediate, we were able to identify a series of interesting relationships between the indicators and the quality of the land-use objects based on the visual inspection of partial dependency plots. The most important features in the model (c.f. Fig. 2.1) were the size of the OSM object, contributor characteristics (such as experience and remoteness), rare OSM tags, and regional OSM mapping aspects (e.g., number of OSM objects in the surrounding of the object).

Element size had by far the highest feature importance. However, the effect on data quality of the OSM object had no clear direction. The indicator had to

### 2 Fitness for Purpose of OSM for Remote Sensing 29

**Fig. 2.1** Feature importance in relation to data quality. The importance was derived based on a quantile random forest for 1000 randomly selected OSM objects. The features are sorted by the percentage increase in squared mean error if the feature would be dropped. In addition, node purity is provided as a second feature importance indicator. To ease interpretation, the second indicator is displayed together with its position relative to the median node purity value across all selected features

be interpreted in combination with other indicators such as the primary tag (e.g., *landuse* or *natural*) as some land-use tags were characterized by big objects (e.g., forest) and others by small objects (e.g., urban grass). Objects with rare primary tags indicated, in general, a lower object quality—presumably as these tags are less well established and possibly poorly defined and thereby harder to map consistently. The effect of contributor experience showed a multimodal distribution: contributors with very little experience (newcomers) were associated with objects of medium quality and contributors with medium experience (stable mappers) with high quality of the land-use objects. Interestingly, highly experienced mappers were associated with poor quality of the land-use object. One possible explanation is that these extraordinary active users might represent bots or importers. One has, however, to keep in mind that only a few contributors fell into this category, which led to a low statistical power.

With respect to the number of elements surrounding an OSM LULC object, areas with more elements were—expectedly—better mapped. The larger the share of the surrounding of an OSM object mapped by LULC, the higher the quality of the OSM object. However, shares above 100% expectedly were associated with a lower quality of the OSM object. These regions are characterized by areas mapped using highly overlapping polygons, which typically indicate mapping errors. OSM edits are contributed as changesets. Larger changesets (more elements, often from different regions) were associated with lower quality of the OSM land-use objects. This is in line with the expectation that local, concise, and coherent edits are better. With respect to the primary tag (the LULC class), the model indicated differences in the quality of some classes: while forest objects were of higher quality, grassdominated LULC classes were of lower quality. This could be explained by the clear distinction of forest from other LULC classes and the diversity of tags used to characterize different grass-dominated LULC classes. LULC objects in North America had a tendency for lower quality, presumably due to the large imports in this region. Besides that, LULC objects mapped in regions with higher Human Development Index or higher population density were associated with better data quality. This presumably reflects the larger OSM community in the Global North, especially in countries such as Austria, Germany, or France, as well as the higher number of potential mappers available in urban areas compared to the countryside. With respect to out-of-dateness, complex interactions were identified; however, generally recently changed objects were associated with higher quality. Except for newcomers, lower user diversity—i.e., users focusing on one aspect of OSM—was associated with higher data quality.

### **2.3 Label Noise Robust Deep Learning for Remote Sensing Data with OSM Tags**

### *2.3.1 OSM as the Source of Training RS Image Labels in ML*

Supervised ML methods have attracted great attention for Earth observation applications on ever-growing RS image archives. Due to their capability to automatically model higher-level RS image semantics in large scale, they are applied to many problems in RS such as multi-label image classification and land-cover map generation. These methods generally require the availability of a high quantity of annotated training RS images. However, the manual collection of RS image annotations by domain experts for a large amount of data can be time-consuming, complex, and costly. Accordingly, the use of volunteered geographic information as crowdsourced data such as OSM to automatically derive annotated training data has been drawing significant attention in RS. As an example, in (Kaiser et al. 2017; Wan et al. 2017; Comandur and Kak 2021), it is shown that the direct use of OSM tags as pixel-level land-use class labels is useful for the automatic map generation of RS images through support vector machines and convolutional neural networks (CNNs). In (Li et al. 2020), OSM tags are utilized as scene-level class labels of training RS images to automatically predict land-use/land-cover classes of RS image scenes through CNNs. In (Audebert et al. 2017), OSM data is also utilized as an auxiliary information source by fusing it with optical data from very-high-resolution satellite imagery through dual-stream CNNs. In (Lin et al. 2022), an active learning strategy is introduced to partially annotate RS images with salient multi-labels based on OSM tags. In this study, an adaptive temperatureassociated model is also proposed to apply multi-label RS image classification by utilizing partially annotated training data and automatically assigning missing labels to training images during training.

Thanks to the publicly available OSM database, collection of RS image annotations for a high quantity of training data to be utilized for ML methods can be achieved at lower costs. However, OSM tags can be outdated regarding RS images due to possible changes on the ground; or there can be annotation errors. Accordingly, using OSM tags as the source of training image annotations may increase the chance of including noisy labels in training data of ML methods. As an example, for multi-label image annotations of RS images, two types of noise can exist. Noise can be associated with missing labels or wrong labels. A missing label means that although a land-use/land-cover class exists in an RS image, the corresponding class label is not assigned. A wrong label means that although a class label is assigned to RS image, the corresponding class is not present in the image.

### *2.3.2 Label Noise Robust ML Methods*

When a ML model is trained on noisy training data, there is a risk of overfitting of the model parameters to noisy labels and thus suboptimal inference performance. To this end, a few methods are presented in RS to improve the robustness of ML models toward noisy labels in training data. As an example, in (Zhang et al. 2020a), a noisy label knowledge distillation method is introduced for single-label RS image classification problems to leverage the knowledge learned through a teacher model on images with noisy labels for a student model. In this method, two CNNs are employed as a teacher-student framework, while a clean and trustworthy subset of a training set is assumed to be available for the student CNN. In (Aksoy et al. 2022), a collaborative learning framework is proposed to identify and exclude images with noisy multi-labels during training. To this end, it employs two CNNs operating collaboratively, while they are forced to characterize distinct image representations and to produce similar predictions. In (Burgert et al. 2022), the effects of the abovementioned label noise types in multi-label RS image classification problems are investigated, while different single-label noise robust methods are integrated to multi-label classification problems in RS. In (Dong et al. 2022), for land-cover map generation through semantic segmentation, an online noise correction approach is introduced to detect and correct pixel-level noisy labels via information entropy at the early stage of model training and thus to continue training with corrected labels. Although all these methods are potentially effective, the development of label noise robust ML methods when the OSM tags are utilized as the source of image annotations has not yet been investigated in RS literature.

It is worth mentioning that learning the parameters of ML models under noisy labels in training data has been studied more extensively in computer vision (CV) literature than RS. Recent research directions in CV can be grouped into the development of (1) deep neural network (DNN) architectures, (2) ML loss functions, (3) regularization strategies for training ML models, and (4) training sample selection and label adjustment techniques for single-label image classification problems. The first set of methods are concentrated on the development of DNN architectures designed for training data with noisy labels. For example, a contrastive-additive noise network is introduced in (Yao et al. 2019) to model trustworthiness of noisy training labels. This network consists of a probabilistic latent variable model as a contrastive layer in order to measure the quality of annotations and an additive layer to aggregate the class predictions and noisy labels. The second set of methods is mostly focused on the development of ML loss functions, which embody robust characteristics toward noisy labels. As an example, in (Ridnik et al. 2021), an asymmetric loss function is proposed to dynamically decrease the weights of negative classes in multi-labels. This allows to decrease the effect of images with missing labels on ML model parameter updates during training. The third set of methods are concentrated on regularizing the whole ML model training to prevent overfitting of model parameters to noisy labels. For instance, a regularization term is integrated into the cross-entropy loss function in (Liu et al. 2020) to utilize the class predictions from an early stage of ML model training to prevent the memorization of noisy labels. The fourth set of methods aim to first select images with correct labels or adjust noisy labels and then to learn through samples with correct labels. As an example, a joint training with co-regularization approach is introduced in (Wei et al. 2020) to employ collaborative learning of two CNNs for the selection of correct labels by an agreement strategy.

### *2.3.3 Proposed Methods*

Due to the public availability of OSM, RS images can be automatically associated with multiple land-use/land-cover classes (i.e., multi-labels) by using OSM tags. This allows to create large training sets for deep learning (DL)-based multi-label RS image classification methods at lower costs. Let X={*x*1*,..., xM*} be an RS image archive that includes *M* images, where *x<sup>k</sup>* is the *k*th image in the archive. We assume that a training set <sup>T</sup> = {*(xi, <sup>y</sup>i)*}*<sup>D</sup> <sup>i</sup>*=<sup>1</sup> is available. Each training image *<sup>x</sup><sup>i</sup>* is associated with a set of class labels *y<sup>i</sup>* ∈ {0*,* <sup>1</sup>}*<sup>S</sup>* based on the corresponding OSM tags, where *S* is the total number of classes. Let *φ* : *θ ,* X → Yˆ be any type of convolutional neural network (CNN) that generates the multi-label *y*ˆ *<sup>k</sup>* of an image *x<sup>k</sup>* ∈ X. Training *φ* on T, which may include noisy labels due to noisy OSM tags, can lead to learning suboptimal model parameters *θ* and inaccurate inference performance, as discussed in the previous sections.

To address this issue, we aim to first automatically detect noisy OSM tags based on the CNN *φ* trained on T and then adjust training labels associated with noisy OSM tags for label noise robust learning of the CNN model parameters *θ*.

### **2.3.3.1 Noisy OSM Tag Detection**

Region-based RS image representations combining both local information and the related spatial organization of land-use/land-cover classes are important for the accurate detection of noisy OSM tags. However, multi-label RS image predictions Yˆ of the considered CNN *φ* do not provide spatial information regarding the class location.

Accordingly, we employed class activation maps (CAM) introduced in (Zhou et al. 2016) since they are capable of deriving the regions most relevant for a given class with respect to the DL model trained for image classification. Let <sup>F</sup>*<sup>i</sup>* <sup>∈</sup> <sup>R</sup>*CxWxH* be a set of feature maps for an image *x<sup>i</sup>* obtained from the last convolutional layer of the CNN backbone where C, H, and W represent the number of channels, height, and width of the feature maps, respectively. CAMs associated with *x<sup>i</sup>* can be obtained by applying a 1x1 convolutional layer, which takes the feature maps F*<sup>i</sup>* of *x<sup>i</sup>* as input and produces a set of feature maps A*<sup>i</sup>* <sup>∈</sup> <sup>R</sup>*SxWxH* . The *s*th feature map A*<sup>s</sup> <sup>i</sup>* <sup>∈</sup> <sup>R</sup>*WxH* is the localization map associated with a class *s*, which can be obtained as follows:

$$\mathcal{H}\_l^s = \sum\_{c=1}^c w\_s^c \mathcal{F}\_l^c \tag{2.1}$$

where *w<sup>c</sup> <sup>s</sup>* is the weight of importance for the *c*th feature map F*<sup>c</sup> <sup>i</sup>* regarding the *s*th class. The obtained CAMs are forwarded through a global average pooling (GAP) layer to obtain multi-label class predictions. However, multi-label classification models from which CAMs can be derived are trained only to identify the presence of a given class within the image. Thus, CAMs tend to focus only on the most discriminative features within the image, leading to the incomplete coverage of the target class within the image (Zhang et al. 2020b).

Self-enhancement maps (SEMs) introduced in (Zhang et al. 2020b) address this issue and improve the localization maps derived from CAMs by including the similarity of feature maps in the localization map calculation. This is achieved by first defining seed coordinates, which are the image regions with the largest activation values on CAMs for a given class. Then, a similarity map is created for each seed point based on the cosine similarity between the seed point feature vector and all other feature vectors. For a given class *s* of an image *xi*, the final class activation map <sup>E</sup>*<sup>s</sup> <sup>i</sup>* is obtained by taking the maximum value at each pixel across the similarity maps. After obtaining all the class activation maps, we generate a class prototype *P<sup>s</sup>* for the class *s* by following (Lee et al. 2018). This is achieved by averaging all the feature maps, which are extracted from T and associated with the class *s*. Class prototypes allow us to obtain more accurate class predictions based on spatial information regarding the location of image classes and the corresponding region-based image representations. Accordingly, to define whether the class *s* is present in the image *xi*, the extracted features of the image for the class F*<sup>s</sup> <sup>i</sup>* are compared with the corresponding class prototype based on their cosine similarity as follows:

$$\hat{\mathbf{y}}\_{l}^{s} = \begin{cases} 1, & \cos(P^{s}, \mathcal{F}\_{l}^{s}) > 0.5\\ 0, & \text{otherwise} \end{cases} \tag{2.2}$$

To detect if *xi* is associated with noisy labels, we compare the class predictions *y*ˆ*<sup>i</sup>* with the associated OSM tags. If the CNN model predicts a class which is not in the list of class labels derived from OSM tags, it is assumed to be a missing class. This missing class can be localized through SEMs. If the CNN model does not predict a class, but it is in the list of class labels derived from OSM tags, it is assumed to be a wrong class label. It is noted that automatically defining noisy OSM tags and the localization of missing classes via SEMs allow providing feedback to the OSM community. Such feedback together with further investigations in the OSM community can lead to correcting noisy OSM tags by human mappers.

### **2.3.3.2 Label Noise Robust Multi-label RS Image Classification**

It is worth noting that the abovementioned method for the detection of noisy OSM tags relies on the model parameters *θ* of the considered CNN, which is obtained through training on T. If a small trustworthy subset C ∈ T of the training set is available, this method can also be used to automatically find training images associated with noisy labels.

To this end, we divide the whole learning procedure into two stages. In the first stage, *θ* is learned by training *φ* only on C. After this stage is finalized, we first automatically divide the rest of the training set T\C into training images with noisy labels N and training images with correct labels L (i.e., N∪L = T\C). Then, class labels associated with each image in N are automatically corrected based on (2.2) leading to training images with corrected labels N∗. This leads to automatically correcting noisy labels in T\C derived from OSM tags. Then, the training set of the second stage T<sup>∗</sup> is formed by combining L, C, and N∗.

In the second stage, all the model parameters of *φ* are fine-tuned on T∗. Thanks to the first stage, noisy labels included in the training set of this stage are significantly reduced. This allows to overcome overfitting on noisy labels of the whole training set in the second stage. Due to this two-stage learning of the model parameters, abundant training RS images annotated with OSM tags can be facilitated for label noise robust learning of multi-label RS image classification through CNNs.

### *2.3.4 Results and Discussion*

In this subsection, we first describe the considered dataset and the experimental setup and then provide our analysis of the experimental results.

### **2.3.4.1 Dataset Description and Experimental Setup**

To conduct experiments, we selected a Sentinel-2 tile acquired over South-West Germany including parts of France on 2021-06-13. This region spans from the Palatinate Forest in the west to the Odenwald in the east and includes large forested areas as well as areas dominated by agriculture or by built-up areas. This tile was divided into 81001*.*22 × 1*.*22-km-sized image patches. Each image patch is annotated with multi-labels based on the presence or absence of four major land-use classes that are defined with OSM tags (c.f. Table 2.2). While assigning the labels, small OSM objects were filtered (see Table 2.2 for thresholds used). The resulting labels of 910 image patches were manually validated against Sentinel-2 imagery.


**Table 2.2** OSM land-use classes used for the multi-label image classification


The OSM quality and completeness in the region were high (c.f. Table 2.3). OSMbased multi-label assignments had a mean average precision of 94.2%. The patches were clustered with respect to the correct assignment of the multi-labels: Clusters of correct OSM data were often due to monotonous landscapes, e.g., the Palatinate Forest. Clusters of flawed data were often due to missing data, e.g., in the region around Kaiserslautern. To perform experiments, 200 manually labeled patches were used as the test set, while the rest of the image patches were utilized as the training set.

In the experiments, we utilized the DeepLabv3+ (Chen et al. 2018) CNN architecture as the DL model. It is worth noting that DeepLabv3+ is originally designed for semantic segmentation problems. We replaced its semantic segmentation head with a fully connected layer followed by a GAP layer that forms the multi-label classification head with four output classes. We trained DeepLabv3+ for 20 epochs with Adam optimizer and the initial learning rate of 0.001. For the proposed label noise robust learning method, the same number of training epochs is used for each of the first and second stages. Experimental results are provided in terms of micro mean average precision (mAP) scores and noise detection accuracies.

We conducted experiments to (i) compare the considered DL model with the direct use of OSM tags for multi-label RS image classification, (ii) analyze the effectiveness of the proposed label noise detection method, and (iii) assess the effectiveness of the proposed label noise robust learning method. For the proposed label noise robust learning method, which requires the availability of a small trustworthy subset of the training set, we included the manually labeled image patches to the training set. However, for the comparison between the DL model and the OSM tags, only non-verified training data was utilized. To assess the robustness of the CNN model toward label noise and to detect noisy samples, we injected synthetic label noise to the training and test sets at different percentages (which were 20%, 40%, 60%, and 80% for the training set and 10%, 20%, 30%, and 40% for the test set) by following (Burgert et al. 2022).

### **2.3.4.2 Comparison Between Direct Use of OSM Tags and DL-Based Multi-label Image Classification**

In this subsection, we assess the effectiveness of the considered DL model (DeepLabV3+) compared to the direct use of OSM tags for multi-label RS image classification. To this end, the model parameters of the DL model were learned on abundant non-verified training data without considering label noise robust learning. Table 2.3 shows the corresponding results in terms of mAP values, when the different values of synthetic label noise rate (SLNR) were applied to the training set of DeepLabV3+ and the OSM tags. One can observe from the table that when SLNR equals to 0% for training and test sets, DeepLabV3+ achieves 5% higher mAP values compared to directly using OSM tags. Even when 20% label noise is synthetically added to the training set of the CNN model, it is still capable of achieving higher results compared to OSM when SLNR value equals to 0%. It is worth mentioning that directly using OSM of such low quality leads to missing or wrong classes. It can be seen from the table that when synthetic noise is added to OSM tags, its multi-label image classification performance is significantly reduced. As an example, when SLNR value is increased to 20% from 0%, multi-label image classification performance of OSM is reduced by more than 10%. These results show the effectiveness of using OSM as a training source of CNN models compared to directly using OSM tags for multi-label image classification. This is relevant because preliminary OSM data analyses may not be able to confidently identify such malicious areas of bad quality.

It is worth noting that further increasing the SLNR value of the training set of DeepLabV3+ significantly reduces multi-label image classification performance. This is due to the fact that when a training set of a DL model includes a higher rate of noisy labels, the model parameters are overfitted on noisy labels that lead to suboptimal learning of multi-label image classification. Figure 2.2 shows the selfenhancement maps (SEMs) of an RS image obtained on DeepLabV3+ trained under different values of SLNR. One can see from the figure that as the SLNR value of the training set increases, the capability of CNN model to characterize the semantic content of the image reduces due to noisy labels.

### **2.3.4.3 Label Noise Detection**

In this subsection, we assess the effectiveness of the proposed label noise detection method when different rates of synthetic label noise are applied to the test set. We also analyze the effect of the level of label noise (which is present in the training data) on our method. Table 2.4 shows the corresponding label noise detection accuracies obtained on DeepLabV3+ trained with abundant non-verified training data at different SLNR values and a small data, which is verified in terms of label noise. One can observe from the table that when SLNR equals to 0%, our label noise detection method, which is applied to DeepLabV3+ and trained on abundant data, achieves the highest label noise detection accuracies. For example, when synthetic

**Fig. 2.2** (**a**) An example of RS image and its self-enhancement maps obtained on DeepLabV3+ trained under synthetic label noise rates (**b**) 0%, (**c**) 10%, (**d**) 20%, (**e**) 30%, and (**f**) 40%

**Table 2.4** Label noise detection results in terms of accuracy (%) obtained by the proposed label noise detection method applied to the DeepLabV3+ trained with abundant non-verified data at different values of synthetic label noise rate (SLNR) and small verified data


label noise is applied to the test set with 40%, our label noise detection method is capable of achieving 94% accuracy. As the SLNR value increases on abundant training data, label noise detection accuracy of our method decreases. This is in line with our conclusion from the previous subsection about the effect of label noise level in training data. In greater details, when SLNR value reaches to 40% for abundant training data, noise detection accuracy of our method decreases by more than 50% compared to SLNR = 0%. When a small verified training data with the size of 10% of the whole training set is used for DeepLabV3+, our noise detection method achieves higher accuracies compared to abundant training data with SLNR ≥ 40%. These results show that our label noise detection method is capable of effectively detecting noisy labels without requiring a small verified data when the amount of label noise in the training data is small. However, if the level of label noise in training data is greater than a certain extent, our method requires the availability of a small trustworthy subset of the training set for accurate label noise detection.

### **2.3.4.4 Label Noise Robust Multi-label Image Classification**

In this subsection, we compare the proposed label noise robust learning method with the standard learning procedure, in which label noise of a training set is not considered during training. Table 2.5 shows the corresponding multi-label RS image classification scores when synthetic label noise is injected to the training set at different values of SLNR. It can be seen from the table that when the label noise level in the training set is small (SLNR ≤ 20%), standard learning of CNN model parameters achieves higher mAP values compared to label noise robust learning. As an example, when there is no synthetic label noise added to the training set, standard learning leads to more than 3% higher mAP score compared to label noise robust learning. However, as the SLNR value of the training set is higher than a particular value (20%), the considered CNN model with label noise robust learning provides higher multi-label RS image classification accuracies compared to standard learning. For example, when SLNR equals to 80% for the training set, label noise robust learning leads to almost 27% higher mAP value compared to standard learning. These results show that our learning method provides more robust learning of the model parameters for the considered CNN model toward label noise in the training set. Due to the two-stage learning procedure in our method, a


small trustworthy subset of the training set is effectively utilized in its first stage to automatically define noisy labels in the whole training set, which are accurately corrected. Then, employing corrected labels for fine-tuning CNN model parameters on the whole training set leads to leveraging abundant training data without being significantly affected by the label noise.

### **2.4 Conclusion and Outlook**

While OSM provides ample opportunities for use as labels in machine learningbased remote sensing applications, it is necessary to be aware of the challenges the dataset provides. Intrinsic and semi-intrinsic data quality indicators provide insights into the complexity of the OSM mapping process. Meaningful relationships between the indicators and data quality for a test set were derived. The complexity of the interactions did, however, not allow for a reliable prediction of data quality at the level of individual OSM objects. This might change if bigger sample sizes are used. And, while object-level quality prediction requires further research, the developed quality indicators referencing the data region can already support regional quality predictions which are successfully in use in production today.

The proposed deep learning method showed its potential to perform label noise robust multi-label image classification if at least a small set of high-quality labels is available. This shows the potential of the method (i) to overcome the challenges of OSM land-use labels in remote sensing applications and (ii) to provide qualityrelated feedback for the OSM community. As the OSM community is skeptical toward imports, especially based on automatic labeling, areas flagged as potentially problematic will when presumable be investigated by human mappers and potentially corrected in OSM. Furthermore, these areas can further be analyzed in combination with the intrinsic data quality indicators developed during the project. Approaches described in Chap. 7 might become helpful for this communication. The remote sensing community, on the other hand, can profit from this work through the automated creation of regionalized high-quality image classification models.

**Acknowledgments** This research was supported by the German Research Foundation DFG within Priority Research Program 1894 *Volunteered Geographic Information: Interpretation, Visualization, and Social Computing* (VGIscience). In this chapter, results from the project *Information Discovery from Big Earth Observation Data Archives by Learning from Volunteered Geographic Information* (IDEAL-VGI, Project number 424966858) are reviewed.

### **References**

Aksoy AK, Ravanbakhsh M, Demir B (2022) Multi-label noise robust collaborative learning for remote sensing image classification. IEEE Trans Neural Netw Learn Syst 1–14. https://doi.org/ 10.1109/TNNLS.2022.3209992


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 3 Efficient Mining of Volunteered Trajectory Datasets**

### **Axel Forsch, Stefan Funke, Jan-Henrik Haunert, and Sabine Storandt**

**Abstract** With the ubiquity of mobile devices that are capable of tracking positions (be it via GPS or Wi-Fi/mobile network localization), there is a continuous stream of location data being generated every second. These location measurements are typically not considered individually but rather as sequences, each of which reflects the movement of one person or vehicle, which we call trajectory. This chapter presents new algorithmic approaches to process and visualize trajectories both in the network-constrained and the unconstrained case.

**Keywords** Trajectories · Data mining · Indexing · Driving preferences · Map matching · Anonymization · Isochrones · Processing pipeline

### **3.1 Introduction**

An abundance of volunteered trajectory data was made openly available in the last decades, fueled by the development of cheap sensor technology and widespread access to tracking devices. The OpenStreetMap project alone collected and published some 2.43 million GPS trajectories from around the world in the past 17 years. Such datasets enable a wealth of applications, as, e.g., movement pattern extraction, map generation, social routing, or traffic flow analysis and prediction.

Dealing with huge trajectory datasets poses many challenges, though. This is especially true for volunteered trajectory datasets which are often heterogeneous in terms of geolocation accuracy, duration of movement, or underlying transportation mode. A raw trajectory is typically represented as a sequence of time-stamped geolocations, potentially enriched with semantic information. To enable trajectorybased applications, the raw trajectories need to be processed appropriately. In the

A. Forsch (O) · S. Funke · J.-H. Haunert · S. Storandt

Universität Bonn, Bonn, Germany

e-mail: forsch@igg.uni-bonn.de; funke@fmi.uni-stuttgart.de; haunert@igg.uni-bonn.de; storandt@informatik.uni-konstanz.de

following, we describe the core steps of a general processing pipeline. Of course, this pipeline may be refined or adapted depending on the target application.


This chapter is a review of the achievements made inside the projects "Dynamic and Customizable Exploitation of Trajectory Data" and "Inferring personalized multi-criteria routing models from sparse sets of voluntarily contributed trajectories" inside the DFG priority program "Volunteered Geographic Information: Interpretation, Visualization, and Social Computing" (SPP 1894). We will discuss achievements for each step of the pipeline.

First, we describe a method to protect the privacy of the user who provided their trajectories. In specific, an algorithm is presented that anonymizes sensitive locations, such as the user's home location, by truncating the trajectory. For this, a formal attack model is introduced. To maximize the utility of the anonymized data, the algorithm truncates as little information as possible while still guaranteeing that the user's privacy cannot be breached by the defined attacks. This section is based on Brauer et al. (2022).

Then, we present a new map-matching method that is able to deal with socalled semi-restricted trajectories. Those are trajectories in which the movement is only partially bound to an underlying network, for example, stemming from a pedestrian who walked along some hiking path but crossed a meadow in between. The challenge is to identify the parts of the trajectory that happened within the network and to find a concise representation of the unrestricted parts. A method based on the careful tessellation of open spaces and multi-criteria path selection that copes with this challenge is presented. This section is based on Behr et al. (2021).

Afterward, we present the PATHFINDER storage and indexing system. It allows dealing with huge quantities of map-matched trajectories with the help of a novel data structure that augments the underlying network with so-called shortcut edges. Trajectories are then represented as sequences of such shortcut edges, which automatically compresses and indexes them. Integrating spatial and temporal search structures allows for answering space-time range queries within a few microseconds per trajectory in the result set. This section is based on Funke et al. (2019).

Then, we review new algorithms for driving preference mining from trajectory sets. Based on a linear preference model, these algorithms identify the driving preferences from the trajectories and use this information to compress and cluster the trajectories. This section is based on Forsch et al. (2022) and Barth et al. (2021).

Finally, we present a method to visualize the results of preference mining to improve their interpretability. For this, an isochrone visualization is used. The isochrones show which areas are easy to access for a user with a specific routing profile and which areas are more difficult to access. This information is especially useful for infrastructure planning. This section is based on Forsch et al. (2021).

The overarching vision is to build an integrated system that enables uploading, preprocessing, indexing, and mining of trajectory sets in a flexible fashion and which scales to the steadily growing OpenStreetMap trajectory dataset. We conclude the chapter by discussing some open problems on the way to achieving this goal.

### **3.2 Protection of Sensitive Locations**

Trajectories are personal data, and, as such, they come within the ambit of the General Data Protection Regulation (GDPR). Therefore, the user's privacy must always be considered when collecting or analyzing users' trajectories. For this, location privacy-preserving mechanisms (LPPMs) are developed. In this section, we review the LPPM presented in Brauer et al. (2022), which focuses on protecting sensitive locations along the trajectory.

Publishing trajectories anonymously, i.e., without giving the name of the user recording the trajectory, is not sufficient to protect the user's privacy. An adversary can still extract personal information from this data, such as the user's home and workplace. This extracted information can link the published trajectories back to the user's identity by using so-called re-identification attacks.

A widely used concept to prevent re-identification attacks is *k-anonymity*  (Sweeney 2002). A *k*-anonymized trajectory cannot be distinguished from at least *k* − 1 other trajectories in the same dataset. LPPMs that *k*-anonymize trajectory datasets (e.g., Abul et al. 2008; Yarovoy et al. 2009; Monreale et al. 2010; Dong and Pi 2018) make use of generalization, suppression, and distortion. While *k*anonymized trajectory datasets still retain the characteristics of the trajectories when analyzing the dataset as a whole, the utility of single trajectories in the dataset is greatly diminished. To counteract this, anonymization can be restricted to protecting sensitive locations along the trajectories. Sensitive locations are either user-specific, e.g., the user's home or workplace, or they are universally sensitive, such as hospitals, banks, or casinos. LPPMs that focus on the protection of sensitive locations (e.g., Huo et al. 2012; Dai et al. 2018; Wang and Kankanhalli 2020) are more utility preserving and thus allow for better analysis in cases where no strict *k*-anonymity is needed. In this section, a truncation-based algorithm for protecting sensitive locations is reviewed that transfers the concept of *k*-anonymity to sensitive locations.

### *3.2.1 Privacy Concept*

In this section, the problem of protecting the users' privacy is formalized. At first, a formal attacker model is introduced that defines the kind of adversary considered. Then, the privacy model to prevent the attacks defined in the attacker model is presented.

**Attacker Model** The attacker has access to a set of trajectories *O* = {*T*1*,...,Tl*} that have a common destination *s*˜. Furthermore, the attacker knows a set of sites *S* that is guaranteed to contain *s*˜. In this context, sites can be any collection of points of interest, such as addresses or buildings, and *S* could contain all buildings for a given region. The attacker's objective is to identify *s*˜ based on *O* and *S*. For this, the attacker utilizes several attack functions *f*1*,...,fa*. For each trajectory *T* , an attack function *f* yields a set of candidate sites: *f (S, T )* ⊆ *S*. By applying all attack functions to all trajectories in *O*, the attacker collects a joint set of candidates, which enables him to infer *s*˜. For example, the attacker could assume that *s*˜ is the site that occurs most often in the joint set.

**Privacy Model** Brauer et al. (2022) introduced a privacy concept called *k*-siteunidentifiability which transfers the concept of *k*-anonymity to sites. Given a set *S* of sites and a destination site *s*˜ ∈ *S*, *k*-site-unidentifiability requires that *s*˜ is indistinguishable from at least *k* − 1 other sites in *S*. Put differently, if *k*-siteunidentifiability is satisfied, *s*˜ is hidden in a set *C(s)*˜ of *k* sites, which is termed *protection set*.

Recall that the attacker's attack functions return sets of candidate sites. The sites in the protection set *C(s)*˜ are indistinguishable with respect to a given attack function *f* if either all sites or none of the sites in *C(s)*˜ are returned as part of the candidate set for *f* . In order to preserve *k*-site-unidentifiability, this property must be guaranteed for all attack functions. The sites in *S* are available to the attacker and thus cannot be changed. Therefore, this can only be done by altering the trajectories in *O*. In conclusion, the problem is defined as follows:

Given a set of sites *S*, a set of trajectories *O* with a common destination *s*˜ ∈ *S*, and a set of attack functions *F*, transform *O* into a set of trajectories *O*' that fulfills *k*-siteunidentifiability for all attack functions in *F*, i.e., either all or none of the sites in *C(s)*˜ are part of the attack's result for each trajectory in *O*.

In the following, an algorithm that truncates trajectories such that they fulfill *k*site-unidentifiability is presented.

### *3.2.2 The S-TT Algorithm*

In this section, the **S**ite-dependent **T**rajectory **T**runcation algorithm (S-TT) is explained. This algorithm truncates a trajectory *T* such that *k*-site-unidentifiability with respect to a set *F* of given attack functions is guaranteed. The truncated trajectory *T* ' is obtained by iteratively suppressing the endpoint of *T* until each attack function either contains all sites of *C(s)*˜ in its candidate set or none of them. The S-TT algorithm is simple to implement, yet it guarantees that none of the attack functions can be used to single out *s*˜ from the other sites in *C(s)*˜ . For using the algorithm, two further considerations need to be made. Firstly, the protection set *C(s)*˜ needs to be selected. Secondly, assumptions on the used attack functions in *F* need to be made. In the following, both of these aspects are discussed.

**Obtaining the Protection Sets** The choice of the protection set *C(s)*˜ greatly influences the quality of the anonymized data. There are two requirements for the protection sets to guarantee good anonymization results. Firstly, the sites in the protection set should be spatially close to each other. This maximizes the utility of the anonymized data, as the truncated part of the trajectory gets minimized. Secondly, the choice of the protection set should not depend on *s*˜. Otherwise, an attacker can infer *s*˜ by reverse engineering based on its protection set. Both requirements can be fulfilled by computing a partition of *S* into pairwise disjoint subsets, where each subset contains at least *k* sites. Each of these subsets becomes the protection set for all sites included in it. The partition of the sites is a clustering problem with minimum cluster size *k* and spatial compactness as the optimization criterion. Possible approaches can be purely geometric-based, e.g., by computing a minimum-weight spanning forest of the sites such that each of its trees spans at least *k* sites (e.g., Imielinska et al. ´ 1993), or they can take additional information into account, such as the road network (e.g., Haunert et al. 2021). A polygonal representation of the protection sets, called *protection cell*, is obtained by unioning the Voronoi cells of the sites in the protection set.

**Geometric Attack Functions** The S-TT algorithm truncates a trajectory based on the attack functions in *F*. In the following, a geometric-based S-TT algorithm is presented by defining attack functions for a geometric inference of the trajectories' destination site *s*˜. Two important characteristics of a trajectory that can be used to identify its destination site are the proximity to and the direction toward the destination site. In the following, attacks using these characteristics are formalized into attack functions.

A trajectory *T* most likely ends very close to its destination site. Therefore, it is reasonable to assume that the trajectory's endpoint end*(T )* is closer to *s*˜ than to all other sites in *S*. Thus, the proximity-based attack function *f*<sup>p</sup> returns the site closest to end*(T )* as the candidate site:

$$f\_{\mathbb{P}}(\mathcal{S}, T) := \operatorname\*{arg\,min}\_{s \in \mathcal{S}} \mathrm{d}(\mathrm{end}(T), s), \tag{3.1}$$

where the function d is the Euclidean distance. Note that this attack function can easily be extended to return the *n*-nearest sites to end*(T )* to introduce some tolerance.

A second aspect that can be used to infer the destination is the direction in which the trajectory is headed. Specifically, the trajectory's last segment *e* is of interest. However, the segment *e*, most likely, does not point exactly toward *s*˜. Therefore, a V-shaped region *R(t)* anchored in end*(T )* that is mirror-symmetric with respect to *e* and has an infinite radius is considered (Fig. 3.1). The opening angle of the V-shape is fixed and denoted with *α*.

The direction-based attack function *f*<sup>d</sup> returns the sites that are inside *R(T )* as its candidate set:

$$f\_{\mathbf{d}}(\mathcal{S}, T) := \mathbf{s} \cap \mathcal{R}(T). \tag{3.2}$$

The attacks *f*<sup>p</sup> and *f*<sup>d</sup> are simple, yet common sense suggests that they may be fairly successful. Figure 3.2 displays the geometric S-TT algorithm on an example trajectory. In the starting situation (Fig. 3.2a), the closest site to end*(T )* is *s*˜. Thus, the algorithm suppresses the endpoint of *T* until the site closest to its endpoint is not part of the protection set *C(s)*˜ of *s*˜ anymore (Fig. 3.2b). At this point, some, but not all, of the sites in *C(s)*˜ are part of the candidate set for the direction-based attack (Fig. 3.2b), violating *k*-site-unidentifiability for this attack function. Thus, truncation continues until either all the sites or none of the sites in *C(s)*˜ are part of

**Fig. 3.1** Schematic representation of a direction-based attack. The sites (round dots) inside *R(T )* (colored black) are candidate sites this attack function returns. *R(T )* is based on the terminating segment *e* of trajectory *T*

**Fig. 3.2** Schematic illustration of the geometric S-TT algorithm. (**a**) The original trajectory *T* leads to a destination site *s*˜. This destination has to be hidden among the sites in the protection set *C(s)*˜ , depicted as squares. All other sites are shown as gray points. The S-TT algorithm suppresses the trajectory's endpoint until neither (**b**) the site closest to the endpoint is part of the protection set *C(s)*˜ nor (**c**) a V-shaped region that is aligned with the last segment of the truncated trajectory *T* ' contains some, but not all, of the sites in *C(s)*˜

the candidate set of *f*<sup>d</sup> anymore (Fig. 3.2c). Note that at this point, neither *f*<sup>p</sup> nor *f*<sup>d</sup> allow for a distinction between the sites in *C(s)*˜ . Thus, *k*-site-unidentifiability is preserved, and the algorithm is done.

### *3.2.3 Experimental Results*

In this section, experimental results carried out with an implementation of the geometric-based S-TT algorithm are presented. The evaluation focuses on the algorithm's utility and the amount of data lost during anonymization. In this experimental study, it is assumed that the origin and destination of the trajectories were sensitive by default and should be protected under *k*-site-unidentifiability. For this, the S-TT algorithm is applied to both ends of the trajectories.

**Input Data** A dataset consisting of 10,927 synthetically generated trajectories in Greater Helsinki, Finland, is used for the evaluation. The trajectories are generated in a three-step process. First, the trajectory endpoints are randomly sampled using a weighted random selection algorithm with the population density as weight. This means that locations in densely populated areas are more likely to be selected as endpoints than locations in sparsely populated areas. In the second step, the endpoints are connected to the road network using a grid-based version of Dijkstra (1959)'s shortest path algorithm. Finally, using generic Dijkstra's algorithm, the shortest path between the two points on the road network is computed and used to connect the two endpoints to a full trajectory.

**Selection of the Protection Sets** In most scenarios, the sites of origin and destination of the trajectories are not explicitly given. For truncation, the S-TT algorithm does not require the destination site *s*˜ but only its protection set *C(s)*˜ . A naive approach to selecting *C(s)*˜ would be to select the site cluster belonging to the site closest to the trajectory's endpoint. However, trajectory data is prone to positioning uncertainties. Using the naive approach, these uncertainties could lead to selecting a protection set that does not contain *s*˜, severely diminishing the anonymization quality. A circular buffer of radius *b* around the endpoint is used to account for these uncertainties. The protection sets of all protection cells intersecting this buffer are unioned into one large protection set. This unioned protection set is then used as the protection set for the trajectory.

**Evaluation** The geometric S-TT algorithm, as outlined in Sect. 3.2.2, has three major configuration parameters: the minimum size of the protection sets *k*, the buffer radius *b*, and the opening angle *α* for the direction-based attack function. In the following, the effect of these three parameters is analyzed.

Figure 3.3 displays the mean length *δ*supp of the truncated trajectory parts for different parameterizations. The parameters *k* and *b* influence the size of the protection cell. Raising these parameters has a significant impact on the results, with the suppressed length of the trajectories increasing almost at a linear rate (Fig. 3.3a and c). While these results show that small values for *k* and *b* preserve more of the data, the choice of these parameters mainly depends on the application and the needed level of anonymity. For example, to eliminate the effect of GNSS accuracies, values for *b* of approximately 50 meters should be sufficient, while trajectory datasets with a higher inaccuracy need a larger buffer radius.

Regarding the parameterization of the V-shaped region for the direction-based attack, the results indicate that the highest data loss lies around *α* = 40◦ to 60◦ (Fig. 3.3b). The reason for this non-monotone behavior is that the direction-based attack function *f*<sup>d</sup> can be satisfied by two conditions: either all or none of the sites in the protection set must be covered by the V-shaped region. In the case that *α* is

**Fig. 3.3** Mean length of suppressed trajectory parts over different parametrization (**a**) of *k*, *α* = 60◦, *b* = 50m; (**b**) of *α*, *k* = 4, *b* = 50m; and (**c**) of *b*, *k* = 4, *α* = 60◦

very small, and therefore the V-shaped region is very narrow, it is very likely that the V-shape covers none of the sites in the protection set. Analogously, if *α* is very large, the V-shape is very broad, and all the sites are likely to be covered. Comparing the influence of *α* to the protection set parameters *k* and *b*, *α* has a significantly smaller impact on *δ*supp.

For each suppressed trajectory point, the attack function that triggered the suppression is stored to evaluate the importance of the two attack functions against each other. Given the parameters *k* = 4, *α* = 60◦, and *b* = 0 m, the directionbased attack function did not trigger the suppression of any trajectory points in 24% of the cases. In other words, the algorithm stopped truncating the trajectories at the first trajectory point located outside their protection cell. Likewise, the share of trajectory points suppressed due to the direction-based criterion was 38%. This means that 62% of the suppressed trajectory points were located in the protection cells.

### **3.3 Map Matching for Semi-restricted Trajectories**

Map matching is the process of pinpointing a trajectory to the path in an underlying network that explains the observed measurements best. Map matching is often the first step of trajectory data processing, as there are several benefits when dealing with paths in a known network instead of raw location measurement data:


However, suppose the assumption that a given trajectory was derived from restricted movement in a certain network is incorrect. In that case, map matching might heavily distort the trajectory and erase important semantic characteristics. For example, there could be two trajectories of pedestrians who met in the middle of a market square, arriving from different directions. After map matching, not only the aspect that the two trajectories got very close at one point would be lost, but the visit to the market square would be removed completely (if there are no paths across it in the given network). Not applying map matching might result in misleading results as well, as the parts of the movement that actually happened in a restricted fashion might not be discovered and—as outlined above—having to store and query huge sets of raw trajectories is undesirable.

Hence, the goal is to design an approach that allows for sensible map matching of trajectories that possibly contain on- and off-road sections, also referred to as semi-restricted trajectories. In the following, an algorithm introduced by Behr et al. (2021) is reviewed that deals with this problem.

### *3.3.1 Methodology*

The algorithm is based on a state transition model where each point of a given trajectory is represented by a set of matching candidates—the possible system states, i.e., possible positions of a moving subject. This is similar to hidden Markov model (HMM)-based map-matching algorithms (Haunert and Budig 2012; Koller et al. 2015; Newson and Krumm 2009), which aim to find a sequence of positions maximizing a product of probabilities. In contrast, the presented algorithm minimizes the sum of energy terms of a carefully crafted model.

**Input** The algorithm works on the spatial components of a given trajectory *T* = <sup>&</sup>lt;*p*1*,...,pk*<sup>&</sup>gt; of *<sup>k</sup>* points in <sup>R</sup>2, referred to as GPS points in the following.

As base input, we are given a directed, edge-weighted graph *G(V , E)* that models the underlying transport network. Every directed edge *uv* ∈ *E* corresponds to a directed straight-line segment representing a road segment with an allowed direction of travel. For an edge *e*, let *w(e)* be its *weight*.

Additionally, we assume to be given a set of open spaces such as public squares, parks, or parking lots represented as polygonal areas. The given transport network is extended by triangulating all open spaces and adding tessellation edges as arcs into the graph (Fig. 3.4).

**Fig. 3.4** Left: Movement data (blue crosses) on the open space (green) cannot be matched appropriately (dashed orange). An unmatched segment (dashed red) remains. Right: Extended network where open spaces are tessellated to add appropriate off-road candidates. Note that the green polygon has a hole, which remains untessellated since it is not open space

To accurately represent polygons of varying degrees of detail and to provide an appropriate number of off-road candidates, CGAL's meshing algorithm (Boissonnat et al. 2000) is used with an upper bound on the length of the resulting edges and a lower bound on the angles inside triangles.

**System States** For every *pi*, a set of matching *candidates* in the given network is computed by considering a disk *Di* of prescribed radius *r* around *pi* and extracting the set of *nodes Vi* ⊆ *V* within the disk. Furthermore, tessellation nodes (i.e., triangle corners) inside the disk are considered possible matches and included in *Vi*. Finally, the GPS point itself is added as a candidate to have a fallback. The optimization model ensures that input points are only chosen as output points if all network and tessellation candidates are too far away to be a sensible explanation for this measurement.

**Energy Model** To ensure that the output path *P* matches well to the trajectory, an energy function is set up that aggregates a state-based and transition-based energy:


Note that we have three different sets of edges in the graph: the original edges of the road network *E*r, the edges incident to unmatched candidate nodes *E*u, and the offroad edges on open spaces *E*t. On-road matches are preferred over off-road matches, while unmatched candidates are used as a fallback solution and thus should only be selected if no suitable path over on- and off-road edges can be found. To model this, the energy function is adapted by changing the edge weighting *w* of the graph. Two weighting terms *α*<sup>t</sup> and *α*<sup>u</sup> are introduced that scale the weight *w(e)* of each edge *e* in *E*<sup>t</sup> or *E*u, respectively.

The state-based and transition-based energies are subsequently aggregated using a weighted sum, parametrized with a parameter *α*c. This yields the overall objective function quantifying the fit between a trajectory <*p*1*,...,pk*> and an output path defined with the selected sequence <*v*1*,...,vk*> of nodes:

$$\text{Minimize } \quad \mathcal{E}(\langle p\_1, \dots, p\_k \rangle, \langle v\_1, \dots, v\_k \rangle) = \alpha\_\mathbb{C} \cdot \sum\_{l=1}^k \left\| p\_l - v\_l \right\|^2 + \sum\_{l=1}^{k-1} w(P\_{l,l+1}) \tag{3.3}$$

To favor matches in the original road network and keep unmatched candidates as a fallback solution, 1 *< α*<sup>t</sup> *< α*<sup>u</sup> should hold. Together with the weighting factor *α*<sup>c</sup> for the state-based energy, the final energy function thus comprises three different weighting factors.

**Matching Algorithm** An optimal path, minimizing the introduced energy term, is computed using *k* runs of Dijkstra's algorithm on a graph that results from augmenting *G* with a few auxiliary nodes and arcs. More precisely, an incremental algorithm that proceeds in *k* iterations is used. In the *i*th iteration, it computes for the subtrajectory <*p*1*,...,pi*> of *T* and each matching candidate *v* ∈ *Vi* the objective value E*<sup>v</sup> <sup>i</sup>* of a solution <*v*1*,...,vi*> that minimizes E*(*<*p*1*,...,pi*>*,*<*v*1*,...,vi*>*)* under the restriction that *vi* = *v*. This computation is done as follows. For *i* = 1 and any node *<sup>v</sup>* <sup>∈</sup> *<sup>V</sup>*1, <sup>E</sup>*<sup>v</sup>* <sup>1</sup> is simply the state-based energy for *v*, i.e., <sup>E</sup>*<sup>v</sup>* <sup>1</sup> <sup>=</sup> *<sup>α</sup>*<sup>c</sup> ·|*p*<sup>1</sup> <sup>−</sup> *<sup>v</sup>*|2. For *i >* 1, a dummy node *si* and a directed edge *siu* for each *u* ∈ *Vi*−<sup>1</sup> whose weight we set as <sup>E</sup>*<sup>u</sup> <sup>i</sup>*−<sup>1</sup> are introduced. With this, for any node *<sup>v</sup>* <sup>∈</sup> *Vi*, <sup>E</sup>*<sup>v</sup> <sup>i</sup>* corresponds to the weight of a minimum-weight *si*-*v*-path in the augmented graph, plus the state-based energy for *v*. Thus, for *si* and every node in *Vi*, a minimum-weight path needs to be found. All these paths can be found with one single-source shortest path query with source *si* and, thus, with a single execution of Dijkstra's algorithm.

### *3.3.2 Experimental Results*

The experimental region is the area around Lake Constance, Germany. For this region, data is extracted from OSM to build a transport network feasible for cyclists and pedestrians. In total, a graph with 931,698 nodes and 2,013,590 directed edges is extracted. The open spaces used for tessellation are identified by extracting polygons with special tags. Spaces with unrestricted movement (e.g., parks) and *obstacles* (e.g., buildings) are identified by lists of tags representing these categories, respectively. Following these tags, 6827 polygons representing open spaces are extracted. Including tessellation edges, the final graph for matching consists of 1,148,213 nodes and 3,345,426 directed edges.

The quality of the approach is analyzed using 58 cycling or walking trajectories. Off-road sections were annotated manually to get a ground truth. The length of the trajectories varies from 300 meters to 41.3 kilometers, and overall, they contain 66 annotated off-road segments.

The energy model parameters were set to *au* = 10*, at* = 1*.*1, and *ac* = 0*.*07. Using these values, the precision and recall for off-road section identification were both close to 1.0. A sensitivity study revealed that similar results are achieved for quite large parameter ranges. Furthermore, movement shapes are far better preserved with the presented approach compared to using map matching only on the transport network. This leads to the conclusion that the combined graph consisting of the transport network plus tessellation nodes and edges is only about 50% larger than the road network, but enables a faithful representation of restricted and unrestricted movement at the same time. This then allows applying storage and indexing methods that demand the movement to happen in an underlying network to be also applicable to semi-restricted trajectories.

### **3.4 Indexing and Querying of Massive Trajectory Sets**

Storing and indexing huge trajectory sets in an efficient manner is the very basis for large-scale analysis and mining tasks. An important goal is to compress the input to deal with millions or even billions of trajectories (which individually might consist of hundreds or even thousands of time-stamped positions). But at the same time, we want to be able to retrieve specific subsets of the data (e.g., all trajectories intersecting a given spatiotemporal range) without the necessity to decompress huge parts of the data.

In this section, we present a novel index structure called PATHFINDER that allows to answer range queries on map-matched trajectory sets on an unprecedented scale. Current approaches for the efficient retrieval of trajectory data make use of different dedicated data structures for the two main tasks, compression and indexing. In contrast, our approach elegantly uses an augmented version of the so-called contraction hierarchy (CH) data structure (Geisberger et al. 2012) for both of these tasks. CH is typically used to accelerate route planning queries, but has also proved successful in other settings like continuous map simplification (Funke et al. 2017). This saves space and makes our algorithms relatively simple without the need of too many auxiliary data structures. Only this slenderness allows for scalability to continent-sized road networks and huge trajectory sets. Other existing indexing schemes only work on small network sizes (e.g., PRESS (Song et al. 2014), network of Singapore; PARINET (Sandu Popa et al. 2011), cities of Stockton and Oldenburg; TED (Yang et al. 2018), cities of Singapore and Beijing) or moderate sizes (e.g., SPNET (Krogh et al. 2016): network of Denmark with 800k vertices). Our approach efficiently deals with networks containing millions of nodes and edges.

### *3.4.1 Methodology*

Throughout this section, we assume to be given a trajectory collection T that already was map matched to an underlying directed graph *G(V , E)* with *V* embedded in R2. We also assume the edges in the graphs to have non-negative costs *c* : *E* → *R*+. Accordingly, each element in *t* ∈ T is a path *π* = *v*0*v*<sup>1</sup> *...vk* in *G* annotated with timestamps *τ*0*, τ*1*,...,τk*.

The goal is to construct an index for T which allows to efficiently answer queries of the form [*xl, xu*]×[*yl, yu*]×[*τl, τu*] that aim to identify all trajectories which in the time interval [*τl, τu*] traverse the rectangular region [*xl, xu*]×[*yl, yu*]. In the literature, this kind of query is often named *window* query (Yang et al. 2018) or *range* query (Krogh et al. 2016), where the formal definitions may differ in detail; also see Zheng (2015).

**Contraction Hierarchies** Our algorithms heavily rely on the *contraction hierarchy*  (CH) (Geisberger et al. 2012) data structure, which was originally developed to speed up shortest path queries. A nice property of CH is that, as a by-product, it also constructs compressed representations of shortest paths.

The CH augments a given weighted graph *G(V , E, c)* with *shortcuts* and *node levels*. The elementary operation to construct *shortcuts* is the so-called node contraction, which removes a node *v* and all of its adjacent edges from the graph. To maintain the shortest path distances in the graph, a shortcut *s* = *(u, w)* is created between two adjacent nodes *u, w* of *v* if the only shortest path from *u* to *w* is the path *uvw*. We define the cost of the shortcut to simply be the sum of the costs of the replaced edges, i.e., *c(s)* = *c(uv)* + *c(vw)*. The construction of the CH is the successive contraction of all *v* ∈ *V* in some order; this order defines the *level l(v)* of a node *v*. The order in which nodes are contracted strongly influences the resulting speed-up for shortest path queries, and hence, many ordering heuristics exist. In our work, we choose the probably most popular heuristic: nodes with low *edge difference*, which is the difference between the number of added shortcut edges and the number of removed edges when contracting a node, are contracted first. We also allow the simultaneous contraction of non-adjacent nodes. As a result, the maximum level of even a continent-sized road network like the one of Europe never exceeds a few hundred in practice. The final CH data structure is defined as *G(V , E*+*, c, l)* where *E*+ is the union of *E* and all shortcuts created.

We also define the nesting depth *nd(e)* of an edge *e* = *(v, w)*. If *e* is an original edge, then *nd(e)* = 0. Otherwise, *e* is a shortcut replacing edges *e*1*, e*2, and we define its nesting depth *nd(e)* := max{*nd(e*1*), nd(e*2*)*} + 1. Clearly, the nesting depth is upper bounded by the maximum level of a node in the network.

**Compression** Given a pre-computed CH graph, we construct a CH representation for each trajectory *t* ∈ T, that is, we transform the path *π* = *e*0*e*<sup>1</sup> *...ek*−<sup>1</sup> with *ei* ∈ *E* in the original graph into a path *π*' = *e*' 0*e*' 1*e*' <sup>2</sup> *...e*' *k*' <sup>−</sup><sup>1</sup> with *e*' *<sup>i</sup>* ∈ *E*<sup>+</sup> in the CH graph.

Our algorithm to compute a CH representation is quite simple: We repeatedly check if there is a shortcut bridging two neighboring edges *ei* and *ei*+1. If so, we substitute them with the shortcut. We do this until there are no more such shortcuts. See Fig. 3.5 for an example.

Note that uniqueness of the CH representation can be proven, and therefore, it does not matter in which order neighboring edges are replaced by shortcuts. The running time of that algorithm is linear in the number of edges. Note that by switching to the CH representation, we can achieve a considerable compression rate in case the trajectory is composed of few shortest paths.

**Spatial Indexing and Retrieval** The general idea is to associate a trajectory with all edges of its *compressed representation* in *E*+. Only due to that compression, it becomes feasible to store a huge number of trajectories within the index. Answering a spatial query then boils down to finding all edges of the CH for which a corresponding path in the original graph intersects the query rectangle. Typically, an additional query data structure would be used for that purpose. Yet, we show how to utilize the CH itself as a geometric query data structure. This requires two central definitions:

**Fig. 3.5** Original path (black, 13 edges) and derivation of its CH representation (bold blue, 3 edges) via repeated shortcut substitution (in order according to the numbers). *y*-coordinate corresponds to CH level


Both *P B(e)* and *DB(v)* can be computed for all nodes and edges in linear time via a bottom-up traversal of the CH in a preprocessing step and *independent* of the trajectory set to be indexed.

For a spatial-only window query with query rectangle *Q*, we start traversing the CH level by level in a top-down fashion, first inspecting all nodes which do not have a higher-level neighbor (note that there can be several of them in case the graph is not a single connected component). We can check in constant time for the intersection of the query rectangle and the downgraph box of a node, only continuing with children of nodes with non-empty intersection. We call the set of nodes with non-empty intersection *VI* . The set of candidate edges *EO* are then all edges adjacent to a node in *VI* .

Having retrieved the candidate edges, we have to filter out edges *e* for which only the path box *P B(e)* intersects but not the path represented by *e*. For this case, we recursively unpack *e* to decide whether *e* (and the trajectories associated with *e*) has to be reported. As soon as one child edge reports a non-empty intersection in the recursion, the search can stop, and *e* must be reported. We call the set of edges that results from this step *Er*. Our final query result is all the trajectories which are referenced at least once by the retrieved edges, i.e., those *t* ∈ U *<sup>e</sup>*∈*Er* <sup>T</sup>*e*.

**Temporal Indexing** Timestamps of a trajectory *t* are annotations to its nodes. In the CH representation of *t*, we omit nodes via shortcuts, hence losing some temporal information. Yet, PATHFINDER will always answer queries conservatively, i.e., returning a superset of the exact result set.

Like the spatial bounding boxes *P B(e)*, we store time intervals to keep track of the earliest and latest trajectory passing over an edge. Similar to *DB(v)*, we compute minimal time intervals containing all time intervals associated with edges on a down-path from *v*. This allows us to efficiently answer queries which specify a time interval [*τl, τu*]. Like the spatial bounding boxes, we use these time intervals to prune tree branches when they do not intersect the time interval of the query.

An edge is associated with a set of trajectories, each of which we could check for the time when the respective trajectory traverses the edge. A more efficient approach is to store for all trajectories traversing an edge their time intervals in a so-called interval tree (Berg et al. 2008). This allows us to efficiently retrieve the associated trajectories matching the time interval constraint of the query for a given edge. An interval tree storing *l* intervals has space complexity *O(l)*, can be constructed in *O(l* log *l)*, and can retrieve all intervals intersecting a given query interval in time *O(*log *l* + *o)* where *o* is the output size.

### *3.4.2 Experimental Results*

As input graph data, we extracted the road and path network of Germany from OpenStreetMap. The network has about 57.4 million nodes and 121.7 million edges. CH preprocessing took only a few minutes and roughly doubled the number of edges.

For evaluation, we considered all trajectories within Germany from the Open-StreetMap collection, only dropping low-quality ones (e.g., due to extreme outliers, non-monotonous timestamps, etc.). As a result, we obtained 350 million GPS measurements which were matched to the Germany network with the map matcher from Seybold (2017) to get a dataset with 372,534 trajectories which we call Tger*,*real.

Our experiments reveal that on average, a trajectory from our dataset can be represented by 11 shortest paths in the CH graph. While the original edge representation of Tger*,*real consists of 121.8 million edges (992 MB on disk), the CH representation only requires 13.8 million edges (112 MB on disk). The actual compression for the 372k trajectories took 42 seconds, that is, around 0.1ms per trajectory.

To test window query efficiency, we used rectangles of different sizes and different scales of time intervals. Across all specified queries, PATHFINDER was orders of magnitude faster than its competitors, including the naive linear scan baseline but also an inverted index that also benefits from the computed CH graph. Indeed, queries only take a few microseconds per trajectory in the output set.

**Fig. 3.6** Visualization of trajectory data as developed in Rupp et al. (2022)

Furthermore, the CH-based representation also allows reporting aggregated results easily, which is especially useful for the efficient visualization of large result sets; see, for example, Fig. 3.6, which is a screenshot of our visualization tool presented in Rupp et al. (2022).

### **3.5 Preference-Based Trajectory Clustering**

It is well observable in practice that drivers' preferences are not homogeneous. If we have two alternative paths *π*1*, π*<sup>2</sup> between a given source-target pair, characterized by 3 costs/metrics (travel time, distance, and ascent along the route) each, e.g., *c(π*1*)* = *(*27 min*,* 12 km*,* 150 m*)* and *c(π*2*)* = *(*19 min*,* 18 km*,* 50 m*)*, there are most likely people who prefer *π*<sup>1</sup> over *π*<sup>2</sup> and vice versa. The most common model to formalize these preferences assumes a linear dependency on the metrics. This allows for *personalized route planning*, where a routing query consists of not only source and destination but also a weighting of the metrics in the network.

The larger a set of paths is, though, the less likely it is that a single preference/weighting exists which explains all paths, i.e., for which all paths are optimal. One might, for example, think of different driving styles/preferences when commuting versus leisure trips through the countryside. So, a natural question to ask is as follows: what is the minimum number of preferences necessary to explain a set of given paths in a road network with multiple metrics on the edges? This can also be interpreted as a trajectory clustering task, where routes are to be classified according to their purpose. In our example, one might be able to differentiate between commute and leisure. Or in another setting, where routes of different drivers are analyzed, one might be able to cluster them into speeders and cruisers depending on the routes they prefer.

A use case from recent research of our methods is presented in Chap. 11. In a field study, cyclists were equipped with sensors tracking their exposure to environmental stressors. The study's aim is to evaluate to what degree cyclists are willing to change their routing behavior to decrease their exposure to these stressors. The algorithms presented in the following section can be used to aid this evaluation.

### *3.5.1 A Linear Preference Model*

The following section is based on a *linear* preference model, i.e., for a given directed graph *G(V , E)*, for every edge *<sup>e</sup>* <sup>∈</sup> *<sup>E</sup>*,a *d*-dimensional cost vector *c(e)* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* is given, where *c*1*(e), c*2*(e), . . . , cd (e)* ≥ 0 correspond to non-negative quantities like travel time, distance, non-negative ascent, etc., which are to be minimized. A E path *π* = *e*1*e*<sup>2</sup> *...ek* in the network then has an associated cost vector *c(π )* := *k <sup>i</sup>*=<sup>1</sup> *c(ei)*.

A preference to distinguish between different alternative paths is specified by a vector *α* ∈ [0*,* 1] *<sup>d</sup>* , E*αi* <sup>=</sup> 1. For example, *α<sup>T</sup>* <sup>=</sup> *(*0*.*4*,* <sup>0</sup>*.*5*,* <sup>0</sup>*.*1*)* might express that the respective driver does not care much about ascents along the route, but considers travel time and distance similarly important. Alternative paths *π*<sup>1</sup> and *π*<sup>2</sup> are compared by evaluating the respective scalar products of the cost vectors of the path and the preference, i.e., *c(π*1*)<sup>T</sup>* · *<sup>α</sup>* and *c(π*2*)<sup>T</sup>* · *<sup>α</sup>*. Smaller scalar values in the linear model correspond to a preferred alternative. An *st*-path *π* (which is a path with source *s* and target *t*) is optimal for a fixed preference *α* if no other *st*-path *π*' exists with *c(π*' *)<sup>T</sup>* · *α < c(π )<sup>T</sup>* · *<sup>α</sup>*. Such a path is referred to as *<sup>α</sup>*-*optimal*.

From a practical point of view, it is very unintuitive (or rather almost impossible) for a user to actually express their driving preferences as such a vector *α*, even if they are aware of the units of the cost vectors on the edges. Hence, it would be very desirable to be able to *infer* their preferences from paths they like or which they have traveled before. In Funke et al. (2016), a technique for preference inference is proposed, which essentially instruments linear programming to determine an *α* for which a given path *π* is *α*-optimal or certify that none exists. Given an *st*-path *π* in a road network, in principle, their proposed LP has non-negative variables *α*1*,...,αd* and one constraint for each *st*-path *π*' which states that for the *α* we are after, *π*' should not be preferred over *π*". So the LP looks as follows:

$$\begin{aligned} \max \qquad & \alpha\_l\\ \forall st. \text{-paths } \pi': \alpha^T \left( c(\pi) - c(\pi') \right) \le 0 \qquad & \text{optimality constraints} \\ & \alpha\_l \qquad \ge 0 \qquad & non-negativity constraints \\ & \sum \alpha\_l = 1 \qquad & \text{scaling constraint} \end{aligned}$$

As such, this linear program is of little use, since typically, there is an exponential number of *st*-paths, so just writing down the complete LP seems infeasible. Fortunately, due to the equivalence of optimization and separation, it suffices to have an algorithm at hand which—for a given *α*—decides in polynomial time whether all constraints are fulfilled or, if not, provides a violated constraint (such an algorithm is called a *separation oracle*). In our case, this is very straightforward: we simply compute (using, e.g., Dijkstra's algorithm) the optimum *st*-path for a given *α*. If the respective path has better cost (w.r.t. *α*) than *π*, we add the respective constraint and resolve the augmented LP for a new *α*; otherwise, we have found the desired *α*. Via the ellipsoid method, this approach has polynomial running time; in practice, the dual simplex algorithm has proven to be very efficient.

### *3.5.2 Driving Preferences and Route Compression*

The method to infer the preference vector *α* from a trajectory presented in Sect. 3.5.1 requires that the given trajectory is optimal for at least one preference value *α*. In practice, for many trajectories, this is not the case. One prominent example is round trips, which are often recorded by recreational cyclists to share with their community. The corresponding trajectories end at the same location as they started. As such, the optimal route for any preference value with respect to the linear model is to not leave the starting point at all, thus having a cost of zero. Note, however, that this problem of not being optimal for any preference value also occurs for many oneway trips. In this section, we review a method by Forsch et al. (2022) that allows us to infer the routing preferences from these trajectories as well. The approach is loosely based on the minimum description length principle (Rissanen 1978), which states that the less information is needed to describe a dataset, the better it is represented. Applied to trajectories, the input trajectory, represented as a path in the road network, is segmented into as few as possible subpaths, such that each of these subpaths is *α*-optimal. In the following, a compression-based algorithm for the bicriteria case, i.e., when considering two cost functions, is presented. The algorithm is evaluated on cycling trajectories by inferring the importance of signposted cycling routes. Additionally, the trajectories are clustered using the algorithm's results.

### **3.5.2.1 Problem Formulation**

In the bicriterial case, the trade-off between two cost functions *c*<sup>0</sup> and *c*<sup>1</sup> for routing is considered. The linear preference model as described in Sect. 3.5.1 simplifies to *cα* = *α*<sup>0</sup> · *c*<sup>0</sup> + *α*<sup>1</sup> · *c*1, which can be rewritten to:

$$c\_{\alpha} = (1 - \alpha\_{\text{l}}) \cdot c\_{0} + \alpha\_{\text{l}} \cdot c\_{\text{l}} \dots$$

Thus, only a single preference value *α*<sup>1</sup> is present, which is simply denoted with *α* in the remainder of this section. For each path, the *α* interval for which the path is *α*-optimal is of interest. This interval is called the *optimality range* of the given path. The search for the optimality range is formalized in the following problem:

Given a graph *G* with two edge cost functions *c*<sup>0</sup> and *c*1, and a path *π*, find Iopt = {*α* ∈ [0*,* 1] | *π* is *α*-optimal}.

Many paths are not optimal for any *α*, and as such, their optimality range will be empty. For these paths, a minimal milestone segmentation and the corresponding preference value are searched. A *milestone segmentation* is a decomposition of a path *π* into the minimum number of subpaths {*π*1*,...,πh*}, such that every subpath *πi* with *i* ∈ {1*,...,h*} is *α*-optimal for the given *α*. Thus, the following optimization problem is solved:

Given a graph *G* with two edge cost functions *c*<sup>0</sup> and *c*<sup>1</sup> and a path *π*, find *α* ∈ [0*,* 1] and a milestone segmentation of *π* with respect to *α* that is as small as possible. That is, there is no *α*˜ that yields a milestone segmentation of *π* with a smaller number of subpaths.

The splitting points between the subpaths are called *milestones*. A path *π* in *G* can be fully reconstructed given its starting point and endpoint, *α*, and the corresponding milestones. By minimizing the number of milestones, we achieve the largest compression of the input data over all *α* values and thus, according to the minimum description length principle, found the preference value that best describes the input path.

### **3.5.2.2 Solving Milestone Segmentation**

Using a separation oracle as described in Sect. 3.5.1, the optimality range of every (sub)path of *π* is computed in O*(*SPQ · log*(Mn)* time, where SPQ is the running time of a shortest path query in *G*, *n* is the number of vertices in *G*, and *M* is the maximum edge cost among all edges regarding *c*<sup>0</sup> and *c*1. For a more detailed description of the algorithm, we refer to Forsch et al. (2022).

Retrieving the milestone segmentation with the smallest number of subpaths is done by first computing a set that contains a milestone segmentation of *π* for every *α* and then selecting the optimal milestone segmentation. Existing works in the field of trajectory segmentation use *start-stop matrices* (Alewijnse et al. 2014; Aronov et al. 2016) to evaluate the segmentation criterion. In our case, *α*-optimality is the segmentation criterion, and the segmentation is not solved for a single *α*, but the *α* value minimizing the number of subpaths over all *α* ∈ [0*,* 1] is searched. Therefore, the whole optimality interval for the corresponding subpath is stored in the start-stop matrix M. This allows us to use the same matrix for all *α* values. Figure 3.7 shows the (Boolean) start-stop matrix for a single *α*, retrieved by applying the expression *α* ∈ M[*i, j* ] to all elements of the matrix. For a path *π* consisting of *k* vertices, a (*k* × *k*)-matrix M of subintervals of [0*,* 1] is considered. The entry M[*i, j* ] in row *i* and column *j* corresponds to the optimality range of the subpath of *π* starting at its

**Fig. 3.7** Depiction of a start-stop matrix of a path consisting of *h* = 7 vertices for a fixed *α* ∈ [0*,* 1]. The black line represents a minimum segmentation of the path into two subpaths *(v*1*, v*5*, v*7*)*

*i*th vertex and ending at its *j* th vertex. Since we are only interested in subpaths with the same direction as *π*, we focus on the upper triangle matrix with *i* ≤ *j* .

Due to the optimal substructure of shortest paths (Cormen et al. 2009), for *i < k<j* , both M[*i, j* ] ⊆ M[*i, k*] and M[*i, j* ] ⊆ M[*k, j* ] hold. This results in the structure visible in Fig. 3.7, where no red cell is below or left of a green cell. The start-stop matrix is therefore computed starting in the lower-left corner, and each row is filled starting from its main diagonal. That way, the search space for M[*i, j* ] is limited to the intersection of the already computed values of M[*i* + 1*, j* ] and M[*i, j* − 1]. This is especially useful if one of these intervals is empty, meaning the current interval is empty as well and computation in this row can be stopped.

As a consequence of the substructure optimality explained above, once the startstop matrix is set up, it is easy to find a solution to the segmentation problem for a fixed *α*. According to Buchin et al. (2011), for example, an exact solution to this problem can be found with a greedy approach in O*(h)* time; see Fig. 3.7.

Since only a finite set of intervals is considered, we know that if a minimal solution exists for an *α* ∈ [0*,* 1], it also exists for one of the values bounding the intervals in M. Consequently, we can discretize our search space without loss of exactness, and each of the O*(h*2*)* optimality ranges yields at most two candidates for the solution. For each of these candidates, a minimum segmentation is computed in <sup>O</sup>*(h)* time. Thus, we end up with a total running time of <sup>O</sup>*(h*<sup>2</sup> ·*(h*+SPQ log*(Mn))* where *n* denotes the number of vertices in the graph and *h* denotes the number of vertices in the considered path. Thus, the algorithm is efficient and yields an exact solution. The solution consists of the intervals of the balance factor producing the best personalized costs with respect to the input criteria.

### **3.5.2.3 Experimental Results**

The experiments are set up for an examination of the extent to which cyclists stick to official bicycle routes. For this, the following edge costs are defined:

$$c\_0(e) = \begin{cases} 0, & \text{if } e \text{ is part of an artificial bicycle route} \\ d(e), & \text{else.} \end{cases}$$

$$c\_1(e) = \begin{cases} d(e), & \text{if } e \text{ is part of an artificial bicycle route} \\ 0, & \text{else.} \end{cases}$$

Hence, in the cost function, edge costs *c*<sup>0</sup> and *c*<sup>1</sup> are used that, apart from an edge's geodesic length *d*, depend only on whether the edge in question is part of an official bicycle route. Using this definition, a minimal milestone segmentation for *α* = 0*.*5 suggests that minimizing geodesic length outweighs the interest in official cycle paths. A value of *α <* 0*.*5 indicates avoidance of official cycle paths, whereas *α >* 0*.*5 indicates the opposite, i.e., preference.

For the analysis, a total of 1016 cycling trajectories from the user-driven platform GPSies1 are used. The map-matching algorithm described in Sect. 3.3 (without the extended road network) is used to generate paths in the road network, resulting in 2750 paths.2

First, the results of the approach on a single trajectory are presented. Figure 3.8 shows the path resulting from map matching the trajectory, as well as a minimal milestone decomposition for *α* = 0*.*55. In this example, a minimal milestone segmentation is found for *α* ∈ [0*.*510*,* 0*.*645], resulting in the milestones *a*, *b*, and *c*. For *α* = 0*.*5, i.e., the geometric shortest path, a milestone segmentation takes an additional milestone between *a* and *b*. For this specific trajectory, the minimal milestone segmentation is found for *α* values greater than 0*.*5. This relates to a preference toward officially signposted cycle routes. The increase in milestones for *α >* 0*.*645 indicates the upper boundary on the detour the cyclist is willing to take to stick to cycle ways.

Based on the compression algorithm, the set of trajectories is divided into three groups PRO, CON, and INDIF. The group PRO comprises trajectories for which a milestone segmentation of minimal size is only found for *α >* 0*.*5. Such a result is interpreted that way that the trajectory was planned in favor of official cycle routes. Conversely, it is considered as avoidance of official cycle routes if milestone segmentations of minimal size exist only for *α <* 0*.*5. The group CON represents these trajectories. Considering trajectories lacking this clarity, no strict categorization into one of the two groups is done. Instead, these trajectories form a third group, INDIF. The results of this classification are displayed in Fig. 3.9. The

<sup>1</sup> www.gpsies.com, [no longer available], downloaded May 2018.

<sup>2</sup> Trajectories are split by the algorithm at unmatched segments.

**Fig. 3.8** The user's path has an optimal milestone segmentation into four optimal subpaths *(s, a, b, c, t)* presented by white nodes. Officially signposted paths are highlighted in green. Disregarding minor shiftings of *a* and/or *b*, this is a valid optimal segmentation for each *α* ∈ [0*.*510*,* 0*.*645]. For *α* = 0*.*5, i.e., the geometric shortest path, a milestone segmentation takes an additional milestone between *a* and *b*

**Fig. 3.9** Size of the categories PRO (green, 998 paths), CON (red, 230 paths), and INDIF (gray, 1522 paths)

group PRO being over four times larger than the group CON is a first indicator that cyclists prefer official cycle routes over other roads and paths. The group CON is the smallest group with a share of 8% of all paths. One assumption for this result is that this group mainly consists of road cyclists who prefer using roads over cycle ways. In future research, this could be verified by analyzing a dataset of annotated trajectories.

Overall, more than 50% of the paths are segmented into five *α*-optimal subpaths or less, resulting in significant compression of the data.

### *3.5.3 Minimum Geometric Hitting Set*

The approach described in Sect. 3.5.1 can easily be extended to decide for a *set* of paths (with different source-target pairs) whether there exists a single preference *α* for which they are optimal (i.e., which explains this route choice). It does not work, though, if different routes for the same source-target pair are part of the input or simply no single preference can explain all chosen routes. The latter seems quite plausible when considering that one would probably prefer other road types on a leisure trip on the weekend versus the regular commute trip during the week. So, the following optimization problem is quite natural to consider:

Given a set of trajectories *T* in a multiweighted graph, determine a set *A* of preferences of minimal cardinality, such that each *π* ∈ *T* is optimal with respect to at least one *α* ∈ *A*.

We call this problem *preference-based trajectory clustering* (PTC).

For a concrete problem instance from the real world, one might hope that each preference in the set *A* then corresponds to a driving style, like speeder or cruiser. Note that while a real-world trajectory often is not optimal for a single *α*, studies like in Barth et al. (2020) show that it can typically be decomposed into very few optimal subtrajectories if multiple metrics are available.

In Funke et al. (2016), a sweep algorithm is introduced that computes an approximate solution of PTC. It is, however, relatively easy to come up with examples where the result of this sweep algorithm is by a factor of *o(*|*T* |*)* worse than the optimal solution. In Barth et al. (2021), we aim at improving this result by finding practical ways to solve PTC optimally. Our (surprisingly efficient) strategy is to explicitly compute for each trajectory *π* in *T* the polyhedron of preferences for which *π* is optimal and to translate PTC into a geometric hitting set problem.

Fortunately, the formulation as a linear program as described in 3.5.1 already provides a way to compute these polyhedra. The constraints in the above LP exactly characterize the possible values of *α* for which one path *π* is optimal. These values are the intersection of half-spaces described by the optimality constraints and the non-negativity constraints of the LP. We call this (convex) intersection *preference polyhedron*.

Using the preference polyhedra, we are armed to rephrase our original problem as a *geometric hitting set (GHS)* problem. In an instance of GHS, we typically have geometric objects (possibly overlapping) in space, and the goal is to find a set of points (a *hitting set*) of minimal cardinality, such that each of the objects contains at least one point of the hitting set. Figure 3.10 shows an example of how preference polyhedra of different optimal paths could look like in the case of three metrics. In terms of GHS, our PTC problem is equivalent to finding a hitting set for the preference polyhedra of minimum cardinality, and the "hitters" correspond to respective preferences. In Fig. 3.10, we have depicted two feasible hitting sets (white squares and black circles) for this instance. Both solutions are minimal in that no hitter can be removed without breaking feasibility. However, the white squares

(in contrast to the black circles) do not describe a minimum solution, as one can hit all polyhedra with fewer points.

While the GHS problem allows picking arbitrary points as hitters, it is not hard to see that it suffices to restrict to vertices of the polyhedra and intersection points between the polyhedra boundaries or more precisely vertices in the arrangement of feasibility polyhedra.

The GHS instance is then formed in a straightforward manner by having all the hitting set candidates as ground set and subsets according to containment in respective preference polyhedra. For an exact solution, we can formulate the problem as an integer linear program (ILP). Let *α(*1*) , α(*2*) , ..., α(l)* be the hitting set candidates and U := {*P*1*, P*2*, ..., Pk*} be the set of preference polyhedra. We create a variable *Xi* <sup>∈</sup> <sup>0</sup>*,* <sup>1</sup> indicating whether *α(i)* is picked as a hitter and use the following ILP formulation:

$$\min \sum\_{i} X\_{i}$$

$$\forall \ P \in \mathcal{U} : \sum\_{\alpha^{(i)} \in P} X\_{i} \ge 1$$

$$\forall \ i : X\_{i} \in \{0, 1\}$$

While solving ILPs is known to be NP-hard, it is often feasible to solve ILPs derived from real-world problem instances, even of non-homeopathic size.

### **3.5.3.1 Theoretical and Practical Challenges**

In theory, the complexity of a single preference polyhedron could be almost arbitrarily high. In this case, computing hitting sets of such complex polyhedra is extremely expensive. To guarantee bounded complexity, we propose to "sandwich" the preference polyhedron between approximating inner and outer polyhedra of bounded complexity.

**Fig. 3.11** Inner (yellow) and outer approximation (gray) of the preference polyhedron (black)

For *d* metrics, our preference polyhedron lives in *d* − 1 dimensions, so we uniformly *e*-sample the unit *(d* <sup>−</sup> <sup>2</sup>*)*-sphere using *O((*1*/e)d*−2*)* samples. Each of the samples gives rise to an objective function vector for our linear program; we solve each such LP instance to optimality. This determines *O((*1*/e)d*−2*)* extreme points of the polyhedron in equally distributed directions. Obviously, the convex hull of these extreme points is *contained within* and with decreasing *e* converges toward the preference polyhedron. Guarantees for the convergence in terms of *e* have been proven before, but are not necessary for our (practical) purposes. We call the convex hull of these extreme points the *inner approximation* of the preference polyhedron.

What is interesting in our context is the fact that each extreme point is defined by *d* − 1 half-spaces. So we can also consider the set of half-spaces that define the computed extreme points and compute their intersection. Clearly, this halfspace intersection *contains* the preference polyhedron. We call this the *outer approximation* of the preference polyhedron.

Let us illustrate our approach for a graph with *d* = 3 metrics, so the preference polyhedron lives in the *two*-dimensional plane; see the black polygon/polyhedron in Fig. 3.11. Note that we do not have an explicit representation of this polyhedron, but can only probe it via LP optimization calls. To obtain inner and outer approximation, we determine the extreme points of this implicitly (via the LP) given polyhedron, by using objective functions max *α*1*,* max *α*2*,* min *α*1*,* min *α*2. We obtain the four solid red extreme points. Their convex hull (in yellow) constitutes the inner approximation of the preference polyhedron. Each of the extreme points is defined by *two* constraints (half-planes supporting the two adjacent edges of the extreme points of the preference polyhedron). In Fig. 3.11, these are the light green, blue, dark green, and cyan pairs of constraints. The half-space intersection of these constraints form the *outer approximation* in gray.

### **3.5.3.2 Experimental Results**

For our experiments, we extracted a weighted graph from OpenStreetMap of the German state of Baden-Württemberg with the cost types *distance, travel time for cars*, and *travel time for trucks* containing about 4M nodes and 9M edges. A set of 50 preferences were chosen u.a.r. per instance and created a random source target


optimal for those preferences. Our implementation was evaluated on 24-core Xeon E5-2630 running Ubuntu Linux 20.04.

In Table 3.1, we see the results for a set of 1000 paths. Constructing all the preference polyhedra exactly costs 6.5 seconds and setting up the hitting set instance 5.2 seconds. Solving the hitting set ILP took 0.7 seconds and resulted in a hitting set of size 36. While in theory, the complexity of the preference polyhedra can be almost arbitrarily high, this is not the case in practice. Hence, exact preference polyhedra construction is also more efficient than our proposed approximation approaches (in the table the rows denoted by Inner-XX and Outer-XX), which also only can provide an approximation to the actual hitting set size.

### **3.6 Visualizing Routing Profiles**

In the previous sections, we reviewed algorithms to infer the routing preferences of a user or a group of users from user-generated trajectories. When applying these methods to large trajectory datasets, the trajectories, and thus the users, can be clustered according to their routing behavior. Each cluster has a characteristic routing behavior, expressed as a vector *α* ∈ [0*,* 1] *<sup>d</sup>* of weighting factors of the *d* different influence factors. On the one hand, these weighting factors are very important for improving route recommendation because they can directly be used to compute optimized future routes. On the other hand, the information on the weighting can be used for planning purposes, e.g., for planning new cycling infrastructure. For this use case, the mathematical representation of the routing preference is not useful, as it is hard to interpret. In this section, we review an alternative representation of routing preferences based on isochrone visualizations geared specifically at a fast and comprehensive understanding of the cyclist's needs. Isochrones are polygons that display the area that is reachable from a given starting point in a specific amount of time and are often used for network analysis, e.g., for public transportation (O'Sullivan et al. 2000) or the movement of electric cars (Baum et al. 2016). For most routing profiles, time is not the only influencing factor. Therefore, the definition of isochrones is slightly altered, and polygons that display areas reachable within a certain amount of *effort* a user is willing to spend instead of a fixed time are used. This effort is dependent on the specific routing preferences. Even though these polygons do not display times anymore, they will be called isochrones in the following. The visual complexity of the isochrones is reduced by using schematized polygons where the polygon's outline is limited to horizontal, vertical, and diagonal segments. This property is called *octilinearity*. The isochrones created with the presented approach guarantee a formally correct classification of reachable and unreachable parts of the network.

### *3.6.1 Methodology*

The methodology for computing schematic isochrones for specific routing profiles can be split up into two major components: the computation of reachability and the generation of the isochrones.

**Computing Reachable Parts of the Road Network** The first step for displaying the routing profiles is computing the reachability in the road network, modeled as a directed graph R, for the given profile. Recall the routing model from Sect. 3.5.1, where the personalized cost is the linear combination of the inherent costs in the road network and the user's preference, expressed as *α*. Thus, given a specific preference, e.g., computed by the methods presented in Sect. 3.5.2 or Sect. 3.5.3, we can compute the personalized costs in our graph. The sum of the personalized costs of the edges along an optimal route from *s* to *t* is called the *effort* needed to get from *s* to *t*. Shortest path algorithms such as Dijkstra's algorithm (Dijkstra 1959) can be used to compute optimal routes from a given starting location to all other locations in the road network. Such an algorithm is used to compute all nodes in R that are reachable within a given maximum effort. This information is stored by coloring the reachable nodes blue and the unreachable nodes red. Further, the road network graph R is planarized to obtain a graph R whose embedding is planar, that is, there are no crossings between two edges. This is done by replacing each edge crossing with an additional node and splitting the corresponding edges at these crossings.

**Generating the Isochrones** Given the node-colored road network, we proceed by generating the polygon that encloses all reachable nodes, the isochrones. For this, we first extend the node coloring of the graph to an edge coloring by performing the following steps. For each edge *e* that is incident to both a blue and a red node, we determine the location on *e* that is still reachable from *s*; see Fig. 3.12a. At this location, we subdivide *e* by a blue *dummy node*. A special case occurs when the sum of the remaining distances at two adjacent, reachable nodes *u* and *v* is smaller than the length of the edge *e* = {*u, v*}; see Fig. 3.12b. In this case, *e* is subdivided by two additional dummy nodes, with the middle part being colored red. Further, the coloring and the newly inserted nodes are transferred into the planarized graph R, additionally coloring the nodes that have been introduced to planarize R; see Fig. 3.12c. Each such node becomes blue if it subdivides an edge *e* in R that is only incident to blue nodes; otherwise, it becomes red. Altogether, we obtain a coloring

**Fig. 3.12** The reachability is modeled by subdividing edges. (**a**) The node *x* (resp. *y*) subdivides the edge *(u, v)* (resp. *(u, w)*) at the last reachable position. (**b**) Special case: The nodes *u* and *v* are reachable from different directions. (**c**) The coloring of R is transferred to R, and the nodes that are introduced for the planarization (squares) are colored

**Fig. 3.13** Creating an octilinear polygon enclosing the component *C*<sup>1</sup> of all reachable nodes and edges. (**a**) The faces *f*1*,...,f*<sup>5</sup> surround the component *C*1. (**b**) Each face is subdivided by an octilinear grid (Step 1). Furthermore, these grids are connected to one large grid *G* that is split by the port between *f*<sup>1</sup> and *f*<sup>5</sup> (Step 2). (**c**) An octilinear polygon is constructed by computing a bend-minimal path through *G* (Step 3)

of R, defining all nodes either blue or red. Due to the insertion of the dummy nodes, this node coloring induces an edge coloring: edges that are incident to two reachable nodes are also reachable, and all other edges are unreachable.

Given the colored planar graph R, we first observe that R has faces that have both red and blue edges. An edge that is incident to both a blue and a red node is called a *gate*. Further, the reachable node of a gate is denoted its *port*. Removing the gates from R decomposes the graph into a set of components *C*1*,...,Cl* such that component *C*<sup>1</sup> is blue and all other components are red. These components are called the *colored components* of R. Figure 3.13a shows the gates and the colored components for an example graph.

Given the colored components, we are looking for a single octilinear polygon such that *C*<sup>1</sup> lies inside and *C*2*,...,Cl* lie outside of the polygon. Our method consists of three steps, displayed in Fig. 3.13. In the first step, we create for each pair *vi*, *vi*+<sup>1</sup> of consecutive ports an octilinear grid *Gi* contained in *fi*. In the second step, we fuse these grids to one large grid *G*. In the final step, we use *G* to determine an octilinear polygon by finding an optimal path through *G*.

**Fig. 3.14** The face *f*<sup>2</sup> of the example shown in Fig. 3.13 is subdivided by octilinear rays based on vertices on the bounding box. (**a**) Shooting octilinear rays from *vi* and *vi*+<sup>1</sup> does not yield a connected line arrangement (see yellow highlight). (**b**–**c**) The bounding box of *f*<sup>1</sup> is successively refined by vertices shooting octilinear rays until they connect *vi* and *vi*+<sup>1</sup>

*Step 1* For each pair *vi*, *vi*+<sup>1</sup> of consecutive ports with 1 ≤ *i* ≤ *k*, we first compute an octilinear grid *Gi* that is contained in *fi* and connects *vi* and *vi*+1; see Fig. 3.14. To that end, we shoot from both ports octilinear rays and compute the line segment arrangement L of these rays restricted to the face *fi*; see Fig. 3.14a. If L forms one component, we use it as grid *Gi*. Otherwise, we refine L as follows. We uniformly subdivide the bounding box of *fi* by further nodes, from which we shoot additional octilinear rays; see Fig. 3.14b. We insert them into L restricting them to *fi*. We call the number of nodes on the bounding box the *degree of refinement d*. We double the degree of refinement until L is connected or a threshold *d*max is exceeded; see Fig. 3.14c. In the latter case, we also insert the boundary of *fi* into L to guarantee that L is connected. Later on, when creating the octilinear polygon, we only use the boundary edges of *fi* if necessary.

*Step 2* In the following, each grid *Gi* is interpreted as a geometric graph, such that the grid points are the nodes of the graph and the segments connecting the grid points are the edges of the graph. These graphs are unioned into one large graph *G*. More precisely, *G* is the graph that contains all nodes and edges of the grids *G*1*,...,Gk*. In particular, each port *vi* is represented by two nodes *xi* and *yi* in *G* such that *xi* stems from *Gi*−<sup>1</sup> and *yi* stems from *Gi*; for *i* = 1, we define *x*<sup>1</sup> = *vk*. Two grids *Gi*−<sup>1</sup> and *Gi* in *G* are connected by introducing the directed edge *(xi, yi)* in *G* for 2 ≤ *i* ≤ *k*.

*Step 3* In the following, let *s* := *y*<sup>1</sup> and *t* := *x*1. A path *P* from *s* to *t* in *G* is computed such that *P* has a minimum number of bends, i.e., there is no other path from *s* to *t* that has fewer bends. To that end, Dijkstra's algorithm on the linear dual graph of *G* is used, allowing us to penalize bends in the cost function. In case the choice of *P* is not unique because there are multiple paths with the same number of bends, the geometric length of the path is used as a tie-breaker, preferring paths that are shorter. As *s* and *t* have the same geometric location, the path *P* forms an octilinear polygon.

In some cases, *C*<sup>1</sup> contains one or multiple holes, i.e., one or more of the components *C*2*,...,Cl* are enclosed by *C*1. We deal with this in the following way. For each enclosed component *C*∗, the road network R is recolored, such that *C*∗ is blue and all other parts of the network are red. We then proceed by computing the enclosing polygon for *C*<sup>∗</sup> as outlined for *C*<sup>1</sup> above and subtracting the resulting polygon from the isochrone of *C*1. After this step, the isochrone is guaranteed to contain all parts of the road network that are reachable and none of the parts that are unreachable. We call this the *separation property*.

### *3.6.2 Application*

Recall the classification of cyclists in Sect. 3.5.2. The cyclists are clustered according to their usage of officially signposted cycle routes. Two different classes are introduced, PRO and CON, with preference values of *α*PRO *>* 0*.*5 and *α*CON *<* 0*.*5. In Fig. 3.15, these two classes are visualized by isochrones with *α* = 0*.*65 for group PRO and *α* = 0*.*45 for group CON. In Fig. 3.15a, the isochrones are stretched out very far in the east-west direction along the course of an official cycling route, reflecting the preference of a cyclist in group PRO for officially signposted cycle ways. Moreover, the isochrones are clinched in the north-south direction, revealing a deficit in the road network in this direction for this group of cyclists: while it is possible to get to these areas, the perceived distance will be much longer than in the east-west direction. Figure 3.15b highlights the routing profile for cyclists of the group CON. The isochrones for this routing profile are almost circular, stretching a similar distance in all directions. This is because in most cases, there is a nonsignposted road very close to a signposted road, such that avoiding these signposted

**Fig. 3.15** Two routing profiles visualized as isochrones. Roads (black) highlighted in blue are part of signposted cycling routes. Displayed are the areas that are reachable for a cyclist in group PRO (**a**) and CON (**b**) within four distinct values of maximum effort (encoded by different brightness values)

roads does not involve large detours and is often as easy as just switching to the road from the existing cycling lane. This routing profile is thus probably much less relevant for planning purposes.

### **3.7 Conclusion and Future Work**

We have presented efficient algorithmic approaches to process and mine huge collections of trajectory datasets and demonstrated their usefulness on volunteered data. The next step is integrating our methods as well as other existing ones into an openly available trajectory processing pipeline, which will be flexibly adapted and augmented to cater for a specific application scenario by the user. This would allow others to easily access and benefit from our developed tools. For example, our anonymization as well as indexing and storing methods could be very useful for the VGI challenges discussed in Part III of this book. In the very next chapter, the focus will be on animal trajectories. Here, anonymization is less of an issue, but efficient mining and visualization tools are crucial. With the rich set of applications for trajectory mining occurring across all VGI domains, we see a clear demand for further development and improvements of algorithms to explore and learn from the information hidden in volunteered trajectory collections in the best possible way.

**Acknowledgments** This research was supported by the German Research Foundation DFG within Priority Research Program 1894 *Volunteered Geographic Information: Interpretation, Visualization, and Social Computing* (VGIscience). In this chapter, results from the projects VGI-Routing (424960421) and Trajectory 2 (314699172) are reviewed.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 4 Uncertainty-Aware Enrichment of Animal Movement Trajectories by VGI**

**Yannick Metz and Daniel A. Keim** 

**Abstract** Combining data from different sources and modalities can unlock novel insights that are not available by analyzing single data sources in isolation. We investigate how multimodal user-generated data, consisting of images, videos, or text descriptions, can be used to enrich trajectories of migratory birds, e.g., for research on biodiversity or climate change. Firstly, we present our work on advanced visual analysis of GPS trajectory data. We developed an interactive application that lets domain experts from ornithology naturally explore spatiotemporal data and effectively use their knowledge. Secondly, we discuss work on the integration of general-purpose image data into citizen science platforms. As part of inter-project cooperation, we contribute to the development of a classifier pipeline to semiautomatically extract images that can be integrated with different data sources to vastly increase the number of available records in citizen science platforms. These works are an important foundation for a dynamic matching approach to jointly integrate geospatial trajectory data and user-generated geo-referenced content. Building on this work, we explore the joint visualization of trajectory data and VGI data while considering the uncertainty of observations. *BirdTrace*, a visual analytics approach to enable a multi-scale analysis of trajectory and multimodal user-generated data, is highlighted. Finally, we comment on the possibility to enhance prediction models for trajectories by integrating additional data and domain knowledge.

**Keywords** Visualization · Trajectory data · User-generated con tent · Animal tracking

Y. Metz (O) · D. A. Keim Universität Konstanz, Konstanz, Germany e-mail: yannick.metz@uni-konstanz.de; keim@uni-konstanz.de

### **4.1 Introduction**

Our goal is to integrate and match the two heterogeneous large data sources to enrich the spatial databases with contextual information. Furthermore, we want to investigate in detail the uncertainty that is found in different VGI data sources. In this chapter, we describe our research toward this goal, by focusing on three core contributions:


Additionally, we present work on using visual analytics for the training of deep learning models for movement prediction, which can support future research in domains of geographic information science and forecasting for biologging data.

In the following, we first present the relevant background that motivates our research. Subsequently, we describe pursued research contributing to the aforementioned points, done in conjunction with domain experts or in close collaboration with other partners of the priority program. Finally, we give a brief outlook on possible future research directions.

### **4.2 Related Work**

VGI contains insight that has the potential to solve fundamental and unsolved social and environmental challenges. The big scientific challenge consists of how to investigate and extract value from noisy VGI data sources. Coxen et al. (2017) stressed that there is a general concern about spatial biases in citizen science datasets. Geldmann et al. (2016) also underline that untrained volunteers might be introducing biases, leading to spatial biases toward densely populated areas and easy-to-watch observations. Integrating and assimilation of VGI into scientific models requires a change of paradigm that embraces uncertainty and bias. Some popular citizen science initiatives have already tried to assimilate VGI to improve the results of biodiversity models. Since 2002, the eBird project has been gathering bird observation records from volunteers around the world. The participation of volunteers has shown a rapid increase in recent years with millions of observations submitted each year. Since then, more than 500,000 users have visited the eBird website (Sullivan et al. 2009). Kelling et al. (2012) proposed a Human/Computer Learning Network for Biodiversity Conservation incorporating VGI coming from eBird using an active learning feedback loop to improve the results of the AI algorithms. Fink et al. (2010, 2013) introduced a spatiotemporal exploratory model,

STEM, and AdaSTEM, to study species distribution models. They used massively crowdsourced citizen science data to construct the corresponding models. The STEM model was afterward integrated into a visual analytics system, BirdVis (Ferreira et al. 2011), that allows the ornithologist to analyze abundance models to understand better bird populations. Coxen et al. (2017) compare two species distribution models using satellite tracking data vs. citizen science datasets coming from eBird datasets. Their results showed the effectiveness of citizen science datasets for this particular use. In other disciplines, such as atmospheric sciences, the shift to the user-generated information paradigm is even harder. Chapman et al. (2017) stated that in atmospheric science, high-quality and precise observation is deeply rooted in the essence of the discipline and other data sources with lowquality, bias, and imprecision are hard to accept. Other antecedents to understanding bird migration behavior and patterns are the work of Jain and Dilkina, who constructed a migration network using K-means clustering and Markov chain model (Jain and Dilkina 2015), and the tool of Wood et al., where they studied the seasonal behavior of bird species within a specific location (Wood et al. 2011). There is also previous work to understand the quality of the observations and uncertainty of the models, by characterizing bird watchers. Cole and Scott in 1999 used Texas Conservation Passport holders and members of the American Birding Association to categorize differences between two different groups of wildlife watching as casual wildlife watchers and serious birders (Cole and Scott 1999). These two groups were defined by their skill level at identifying birds, the frequency of participation, expenditures, and bird-watching behavior. Afterward, Scott conducted another study with Thigpen (Scott and Thigpen 2003) to understand bird watchers' behavior. Data were collected from the bird-watching festival in September 1995 at the Seventh Annual HummerBird Celebration in Rockport/Fulton, Texas. In Scott and Shafer (2001), specialization was measured with regard to birdwatcher behavior, level of skill, and commitment. Based on the this, they categorized birders into casual, interested, active, and skilled birders. The outcomes of the study by Scott and Thigpen have supported McFarlane's investigation of birders in Alberta (McFarlane 1994). She revealed that 80% of the general population in her example were casual or novice birders. This information could be very valuable to quantify the confidence of VGI observations. Previous work shows a glimpse of the unprecedented opportunity to materialize a change in the way scientific models treat data, changing the VGI data paradigm to embrace the uncertainty and bias of data provided by humans as part of the scientific investigation process. From now on, a fascinating endeavor comprises for us the development of new techniques that can bridge the gap between social, exact, and natural sciences by using VGI as an interlinkage among biodiversity and human behavior to provide effective and timely answers to societal calls such as climate change and nature preservation.

Previous work has shown how geo-tagged social media data reflects the spatiotemporal distribution of social groups (MacEachren et al. 2011). For example, scatterplots combine event detection and classification in investigating the geotagged social media, to enable situation awareness (Thom et al. 2012; Cao et al. 2012) for geospatial information diffusion. Our preliminary work in this area covers

the dynamics of social groups and their expressions on social media. Our work on social media bubbles (Diehl et al. 2018) is the first step toward the structuring of complex social relationships on social media. The main connection point between the social group structure as the scaffolding of society and VGI is the uncertainty that humans introduce into the data and the trustworthiness of systems consumed by humans. This work addresses different aspects of uncertainty from a practical point of view. Our work on the visual assessment for visual abstractions (Sacha et al. 2017) addresses the trustworthiness of the users in the systems for the particular case of soccer data. The works above addressed the uncertainty from the perspective of the producer of VGI and the trustworthiness of the users on the systems. During the last few years, the study of uncertainty and its propagation through the visual analytics workflow have gained popularity. Early, in 2015, MacEachren proposed to consider the propagation of uncertainty through the whole VA workflow rather than just the visualization of the uncertainty at the end of the pipeline (MacEachren et al. 2011). He illustrated current challenges and possible approaches to tackle uncertainty using definitions from decision sciences. Kinkeldey et al. analyzed the impact of visually represented geodata uncertainty on decision-making and addressed possible approaches for the evaluation of uncertainty in visualizations. Previously, we tackled the uncertainty aspects of VGI using a theoretical framework (Diehl et al. 2018), which, for the first time, shapes the human factors of uncertainty in VGI and defines a new term "user uncertainty" to enclose them.

### **4.3 Analysis of GPS Trajectory Data**

We start by looking at an analysis that is possible for stand-alone trajectory data in the biologging context. Analysis tools, both visually and algorithmically, build the foundation for our goal of enabling a joint approach for trajectory data and voluntary geographic information.

### *4.3.1 Motivation and Research Gap*

Segmenting biologging time series of animals on multiple temporal scales is an essential step that requires complex techniques with careful parameterization and possibly cross-domain expertise. Yet, there is a lack of visual-interactive tools that strongly support such multi-scale segmentation. To close this gap, we present our MultiSegVA platform for interactively defining segmentation techniques and parameters on multiple temporal scales in our paper *MultiSegVA: Using Visual Analytics to Segment Biologging Time Series on Multiple Scales* (Meschenmoser et al. 2020). MultiSegVA primarily contributes tailored, visual-interactive means and visual analytics paradigms for segmenting unlabeled time series on multiple scales. Further, to flexibly compose the multi-scale segmentation, the platform

contributes a new visual query language that links a variety of segmentation techniques. To illustrate our approach, we present a domain-oriented set of segmentation techniques derived in collaboration with movement ecologists. In the paper, the applicability and usefulness of MultiSegVA are demonstrated in two real-world use cases from movement ecology, related to behavior analysis after environment-aware segmentation and after progressive clustering. Expert feedback from movement ecologists shows the effectiveness of tailored visual-interactive means and visual analytics paradigms in segmenting multi-scale data, enabling them to perform semantically meaningful analyses. Here, we want to highlight two key aspects of the work, the characteristics of biologging time series data and the respective analysis, as well as how we can support this process within the visual analytics framework. For further details, we refer to the paper of Meschenmoser et al. (2020). For our work, we focus on biologging time series of moving animals: these time series have prototypical multi-scale characters and include widely unexplored behaviors, which are hidden in high resolutions and cardinalities. Additionally, biologging-driven movement ecology is an emerging field (Brown et al. 2013; Shepard et al. 2008), triggered by technical advances that enable academia to address open questions in innovative ways. The biologging time series stems from miniaturized tags and gives high-resolution information about, e.g., an animal's location, tri-axial acceleration, and heart rate. Here, semantics are typically distributed on diverse temporal scales, including life stages, seasons, days, day times, and (micro)movement frames. These temporal scales are complemented by spatial scales concerning, e.g., the overall migration range, migration stops, and foraging ranges. There are complex scaleand context-specific conditions (Benhamou 2014; Levin 1992), implying different energy expenditures, driving factors, and decisions for behavior. Hence, segmenting such time series on a single scale with global parameters does not sufficiently address their multi-scale character. The relevance of multi-scale segmentation can be further motivated by three reasons. First, analysts can deepen their understanding of how scales relate to each other: e.g., in terms of nesting relations, next to relative scale sizes and types. A multi-scale perspective can even enable one "to gain an insight on an entire knowledge domain or a relevant sub-part" (Nazemi et al. 2015). Second, even without labeled data or thoroughly parameterized single-scale techniques, it is possible to identify fine-grained patterns that are wrapped by lowerscale, context-yielding patterns. Such fine-grained and context-aware patterns are crucial to enriching existing classification and prediction models. Third, demands for more multi-scale analyses originate from domain literature. Such demands can be found in movement ecology and analysis (Andrienko and Andrienko 2013; Demšar et al. 2015), but also in, e.g., medical sciences (Alber et al. 2019) and social sciences (Cash et al. 2006). However, in practice, segmenting time series on multiple scales is often impeded by several factors. First, multi-scale techniques rely on more in-depth, theoretical foundations and inherent parameters that need to be carefully adapted. Therefore, analysts (e.g., movement ecologists) might require cross-domain expertise in statistical multi-scale time series analysis. Second, even with such expertise, it is difficult to decide on scale properties (e.g., size, dimension, number of scales) and further parameters. Third, we observe a lack of suitable

visual-interactive approaches in related works (Sect. 3.2) that can strongly support and promote segmenting time series on multiple scales.

For MultiSegVa, we defined four requirements in cooperation with domain experts:


### *4.3.2 Approach*

To close the research gap of enabling multi-scale analysis of biologging time series, we present our web-based MultiSegVA platform that allows analysts to visually explore and refine a multi-scale segmentation, which results from a simple way of setting segmentation techniques on multiple scales. In the context of multi-scale segmentation, MultiSegVA primarily contributes to the use of tailored visual-interactive features and established VA paradigms. To flexibly configure segmentation techniques and parameters, MultiSegVA includes a new visual query language (VQL, C2) that links a variety of segmentation techniques across multiple scales. These techniques stem in the present case from a set that was derived together with movement ecologists and covers typical domain use cases. Figure 4.1 shows the main window of the *MultiSegVA* system. Here, the analyst can build visual queries and analyze the existing segmentation results in a hierarchy visualization,

**Fig. 4.1** A screenshot from **MultiSegVA**

**Fig. 4.2** The visual query language interface: In the left column, multiple time series segmentation methods can be selected. The hierarchy of applied methods can be changed via an interactive interface, shown in the second column. Finally, detailed settings for each method can be changed in the third column

which in turn is closely linked to one- and multidimensional time series plots. It is also possible to access additional details of segments via a temporal detail window or inspect the underlying trajectories on a map. MultiSegVA implements a feedback loop for iterative analysis and refinement of the segmentation. After importing a time series, e.g., GPS trajectory data of tracked animals via Movebank (Wikelski M 2023), the analyst can start to analyze the time series. After an initial visual inspection of the time series dimensions, an analyst can steer hierarchical time series segmentation using the visual query language (VQL). The interface is shown in Fig. 4.2. The VQL serves three purposes here: (1) Different types of segmentation techniques can be easily arranged across different time scales, (2) the hierarchical application order can be defined by manipulating with building blocks, and (3) the chosen techniques can be interactively parameterized. The query interface first provides a list of available techniques organized by category and can recommend appropriate techniques. To modify the hierarchical application of techniques, selected techniques are arranged as visual building blocks which can be modified by drag-and-drop interactions. This can alleviate some issues that arise with text-based queries. In particular, it avoids changing the ordering of nested queries, which can be tedious and error-prone in text-based queries. The query language also provides several selectors and operations to chain and link different techniques at the same or different scales. The finalized query is then processed in the backend of the application, and the results are visualized via the icicle hierarchy view; see the top of Fig. 4.1. The analyst can choose to adapt the query based on the achieved results to iteratively improve the segmentation results and get a detailed understanding of the time series data.

### *4.3.3 Results*

MultiSegVA enables the comprehensive exploration and refining of multi-scale segmentation by tailored visual-interactive features and VA paradigms. MultiSegVA includes segment tree encoding, subtree highlighting, guidance, density-dependent features, adapted navigation, multi-window support, and a feedback-based workflow. The VQL facilitates exploring and parameterizing different multi-scale structures. Still, a few aspects remain for further reflection. The icicle visualization meets expert requests and has several benefits. Yet, guiding the user by color to interesting parts of the segment tree is a challenging task. We tested global, level-based, and sibling-based guidance variants according to color fills. We chose sibling-based guidance (i.e., all siblings of one hovered segment are colored) that optimally captures local similarities while requiring more navigation effort across levels and nodes. Upcoming works will include an even more effective variant, i.e., guidance to local similarities with little interaction and one fixed color scale. Our VQL makes it trivial to build a multi-scale segmentation. Query building is a play with building blocks that benefits from strong abstraction and simple interactions. Rather, it is difficult to decide which multi-scale structure and building blocks are most appropriate: a decision that depends on data, analysis, and tasks. MultiSegVA facilitates this decision through extensive documentation, technique categorization, few technique parameters, and short processing times in a compact workflow. For further support, we plan predefined queries, and instant responses at query building, next to the parameter and technique suggestions. For suggesting parameters, we will apply estimators (Catarci et al. 1997; Yao 1988) for the number of change points as well as the elbow method for knn-searches. While motif length and HDBSCAN's minPts (Campello et al. 2013) optimally benefit from domain expertise, suggesting other parameters will simplify the interaction and can address another limitation. Now, a technique processes each segment of one scale with the same parameters; thus, slight data-dependent parameter modifications will be examined. For technique suggestions, we envision for each technique a scale-wise relevance score that reflects data properties and is part of a rule-based prioritization, shaped by domain expertise and meaningful hierarchies. It is essential to depict the semantics into which MultiSegVA can provide insights. First of all, MultiSegVA illuminates diverse multi-scale structures and gives insights into how scales relate to each other. Coarse behaviors can be distinguished by relatively simple techniques, motifs show repetitive behaviors, and knn-searches allow the matching with already explored segments. Segment lengths and similarities can be explored, next to local anomalies and spatial contexts. However, with the current techniques, it is difficult to broadly capture deeper, behavioral semantics (e.g., chew, scratch). Hereto more complex or learning techniques (e.g., HMMs, SVMs) will be needed that neither overfill the interface nor delimit generalizability due to the lack of learned patterns. The latter point goes hand in hand with our major limitation and the corresponding implication for upcoming work: integrating even more intelligent methods and automatism. These plans all relate to aspects from above, i.e., better guidance,

technique, and more parameter suggestions, as well as techniques for deeper behavioral semantics. MultiSegVA relies on requirements from movement ecology experts and stands for an iterative, extensively collaborative, and interdisciplinary process. We can gather domain feedback on several stages, derive a domain-oriented set of techniques, and even link MultiSegVA to Movebank with *>*2*.*2 billion animal locations. With this application domain-focused, MultiSegVA underpins the value of multi-scale analyses and is certainly another step forward "to empower the animal tracking community and to foster new insight into the ecology and movement of tracked animals" (Spretke et al. 2011). Meanwhile, our third use case shows that MultiSegVA variants for other domains are conceivable, especially with tailored domain-oriented technique sets. This generalizability is promoted by the platform's I/O features and its ability to handle heterogeneous time series, with *>*1*.*2 million records.

### **4.4 Analysis of VGI Contributor Data**

In joint collaborative work with partners of the priority program, we investigate the utility of using a novel pipeline based on a deep learning-based image classifier to integrate images from the social media platform *Flickr* with data from citizen science platforms: *A text and image analysis workflow using citizen science data to extract relevant social media records: combining red kite observations from Flickr, eBird and iNaturalist* (Hartmann et al. 2022). In our research agenda, this work serves a dual role: (1) We explore the characteristics of VGI image data in our chosen domain of migratory birds, as well as automated integration techniques, and (2) we integrate contributions from non-verified data sources, which directly connects to the research topic of uncertainty in data sources. Specifically, the confidence of the developed classification pipeline might be directly used as an uncertainty measure for the matching process we introduce in the following chapter.

### *4.4.1 Motivation and Research Gap*

There is an urgent need to develop new methods to monitor the state of the environment. One potential approach is to use new data sources, such as user-generated content, to augment existing approaches. Despite a wide range of works discussing and demonstrating the potential of new data forms in the creation of indicators, we could not identify previous research which explicitly created a workflow designed to integrate data from different sources and of different modalities. Furthermore, although the properties of different forms of UGC are relatively well understood, they have not been effectively used to develop reproducible workflows. Finally, most studies evaluate the quality of extracted information in isolation through metrics such as precision and recall, but do not explore the added value of integrating data.

In the paper, we propose, implement, and evaluate a workflow taking advantage of citizen science data documenting and recording sightings of birds and more specifically red kites (*Milvus milvus*). Analyzing social media data until recently has often used simple keyword-based methods to perform an initial filtering or search step, meaning that content tagged in other ways was not found. However, improvements in content-based classification now mean that it is also possible to use off-the-shelf, pre-trained algorithms to reliably identify predefined classes such as the presence of buildings, people, or birds in image data with reasonable accuracy.

### *4.4.2 Approach*

We take a new approach, using citizen science projects recording sightings of red kites (*Milvus milvus*) to train and validate a convolutional neural network (CNN) capable of identifying images containing red kites. This CNN is integrated into a sequential workflow that also uses an off-the-shelf bird classifier and text metadata to retrieve observations of red kites in the Chilterns, England. Our workflow reduces an initial set of more than 600,000 images to just 3065 candidate images. Manual inspection of these images shows that our approach has a precision of 0.658. A workflow using only text identifies 14% fewer images than that including image content analysis, and by combining image and text classifiers, we achieve an almost perfect precision of 0.992. Images retrieved from social media records complement those recorded by citizen scientists spatially and temporally, and our workflow is sufficiently generic that it can easily be transferred to other species.

Flickr is a social media site, where individuals can upload photographs and metadata, including tags and locations in the form of coordinates. Flickr's usage has declined in recent years, but it remains very popular in research, mostly because of its well-documented and easy-to-use API, which allows querying using search terms and bounding boxes. Our citizen scientist data came from two platforms: iNaturalist and eBird. iNaturalist allows participants to upload images of organisms such as plants and insects to the platform and use its community to crowdsource taxonomic identification. Currently, according to their website (https://inaturalist. org), iNaturalist hosts nearly 100 million observations of over 375000 species and is, therefore, one of the largest and most successful citizen science projects to date (Unger et al. 2020). eBird has similar features to iNaturalist but as a platform is exclusively specialized in bird observations. Their website states (https://ebird.org) that "eBird is among the world's largest biodiversity-related science projects, with more than 100 million bird sightings contributed annually." It predominantly hosts observation location data, but also corresponding bird images, as well as bird sounds (Sullivan et al. 2009; Wood et al. 2011).

Since our workflow is designed to be generic, take advantage of the text and image data, and combine records from citizen science reports with social media data, it uses a combination of a simple rule-based approach, existing pre-trained models, and a model trained specifically for our target species. Our approach is designed

to take advantage of what we assume to be high-quality data collected by citizen scientists with an interest in ornithology, use off-the-shelf models where possible, and reduce the initial number of social media posts in a given region to a manageable size for manual verification. In the following, we want to give a summary of the proposed workflow:


The workflow creates a high-confidence dataset of images that can be integrated with existing citizen science platform data. As part of the paper, we ran a detailed study of the characteristics of the different data sources, namely, *Flickr*, *eBird*, and *iNaturalist*. In the chosen target area, we compare (1) spatial coverage, (2) temporal distribution, (3) contributor patterns, as well as (4) image data quality.

### *4.4.3 Results*

Our workflow aimed to extract relevant images of red kites from Flickr data and to use these to complement citizen science records from eBird and iNaturalist. In the following, we, therefore, explore the following aspects of the results we obtained:


The workflow returned 3065 candidate images, downsampling the original dataset by 99.5%. These images were then individually inspected to identify true


**Table 4.1** Precision using different combinations of the components in the workflow

and false positives and allow us to calculate the precision. Images were marked as true positives if a red kite was identifiable in an image. This meant that images had to be sufficiently clear, such that distinctive features of red kites (e.g., their forked tails or red-brown coloring) were visible. Images where a bird was visible, but not unambiguously identifiable, images showing feathers or pellets, and images that were obviously irrelevant were all marked as false positives. A total of 2017 records were thus identified as true positives, with 1048 false positives and a resulting precision for the complete workflow of 0.658.

To understand the benefits of text and image analysis, we ran the components of the workflow individually and annotated any additional images extracted (Table 4.1).


We found 1419 Flickr posts that were included by both settings, of which 1407 were true positives. This means that by only retaining candidate records identified by both textual and image-based information, we can achieve an almost perfect precision of 0.992. We then checked for records that were exclusively identified by either text or image analysis. Five hundred and thirty-nine posts were only detected by the textual analysis (point 1 in the list above), and 316 were only detected by the visual analysis (point 2 in the list above). Combining these results leads to a total of 3559 records, of which 2262 are true positives and 1297 are false positives, and a precision of 0.636. Looking back at the performance of our initial integrated workflow, we note that 245 (12%) additional true positive red kite posts were extracted by merging the results of separately performed textual and visual analysis. This increase in recall is at the cost of a very slight reduction in the precision of 0.02. Summarizing these findings, 62% of true positives were found using either text or image analysis. Twenty-four percent are only correctly classified by textual data, and 14% are missed if no visual analysis is performed.

We find that the workflow functions as a data filter, reducing the data volume by 99.5%. By reducing the data volume, it becomes realistic to analyze the remaining data manually to select true positives. The workflow thus addresses the research gap identified by Burke et al. (2022), using generalizable methods to extract target data from various unverified sources to enrich data.

We found that while keyword matching delivered high precision with little evidence of ambiguity, image analysis returned more potential candidates than textual analysis, but with lower precision. By only retaining posts identified by both textual and visual analysis, they were able to achieve almost perfect precision (0.992), at the cost of a lower recall. By combining the two approaches, they increased the extracted data volume by almost 14% while still downsampling the original dataset by around 99.5% and with a precision 0.636.

The visual distribution of points on the map in the article shows how different sources can complement one another when trying to determine patterns. The locations of Flickr posts tend to cluster around urban areas and points of interest along existing road networks, which suggests that the Flickr observations are often taken opportunistically. eBird and iNaturalist observations, on the other hand, are more heterogeneously allocated and show less obvious relationships to known spatial features, suggesting that birdwatchers go out with a clear intent to observe birds and seek a variety of locations for that purpose.

The study found that the temporal coverage of red kite observations in the Chilterns was different on a yearly and monthly scale. Aggregating data over years showed that the pattern shown by Flickr was different from the ones of eBird and iNaturalist. Year-on-year changes appeared to be more driven by underlying platform dynamics, such as user base and popularity changes. The study found that the rapid drop in Flickr observations from 2012 onward represented a decrease in Flickr popularity rather than a decline in red kites in the Chilterns. On the other hand, the study found that there was a strong increase for eBird and iNaturalist from the year 2016 onward. This could be the result of increased popularity, increased interest in red kites, or increased visits to the study region. Looking at monthly temporal scales showed a trend toward the warmer spring and summer months between March and June. These results may suggest higher visitation rates to the Chilterns in warmer periods, but could also be influenced by specific red kite behavioral patterns. Investigating the number of unique users per data source revealed that representativeness varies between platforms. eBird data was contributed by the fewest individuals, whereas Flickr and iNaturalist offered a more diverse user base. This observation could be attributed to higher platform popularity and an overall larger user base of the latter. Knowing the share of the population represented by a UGC-based analysis is crucial for policymakers to make adequate decisions that reflect the people's opinion (Wang et al. 2019b).

The image quality analysis revealed clear differences between social media data on Flickr and citizen science data in eBird and iNaturalist. Flickr's users are interested in capturing scenic and visually pleasing images, while eBird and iNaturalist users are more concerned about capturing the target species itself as proof of observation and less about the image quality. This discovery may point

to the potential usefulness of social media data for the identification and tracking of individuals.

### **4.5 BirdTrace: A Visual Analytics Application to Jointly Analyze Trajectory Data and VGI**

### *4.5.1 Motivation and Research Gap*

BirdTrace makes use of two primary data sources: GPS data from tagged birds and user-generated content from birders. GPS data typically is of high quality but is only available on a small scale, while user-generated data is more abundant but of variable quality. By combining these two data sources, BirdTrace can provide a more complete picture of bird populations. The system uses a dynamic matching approach to semantically enrich trajectory data with geo-referenced data like images or textual descriptions.

### *4.5.2 Automated Matching*

A key step was the development of semi-automatic methods to extract, integrate, and match data from VGI and tracked spatiotemporal datasets. This has allowed for further knowledge to be gained about individuals and populations of animals, including information about local animal habitats, animal migrations across continents, landuse change, biodiversity loss, invasive species, the spread of diseases, and climate change. There are yet no existing methods and systems that integrate and fuse mixed VGI data from birdwatchers and tracked trajectories of wildlife animals from the ICARUS (Movebank), so we developed them as part of the project.

We have already described the analysis of the trajectories, but now want to find relevant VGI contributions (e.g., images, video, audio, or text descriptions) for trajectories. Here, *relevance* refers to "how well a domain expert can use the found VGI contributions to answer specific questions." As this implies, the criteria for relevance might therefore depend on the problem. We tackle this problem by giving users the possibility to choose between different matching criteria, e.g., based on spatial or temporal distance, and potential classified behaviors like breeding. By including additional data sources, one might increase the number of possible matching criteria in the future. Let's look at the way we enable automated matching; see Fig. 4.3:


**Fig. 4.3** The pipeline and workflow of the BirdTrace system. We combine animal movement trajectory data with VGI data from different citizen science portals. An automated matching approach is used to filter relevant VGI contributions to respective movement trajectories. Both trajectories and VGI data points are then jointly visualized in a shared visual interface. The interface enables aggregating, filtering, or annotating the given data


To facilitate the matching process, we implemented a data processing pipeline, which applies appropriate preprocessing to both the GPS trajectory data and the VGI contributions. We apply steps like line simplification, motif discovery, and outlier detection to the trajectory data to reduce the size of the data and to simplify the matching computations. VGI contributions from VGI portals like eBird, iNaturalist, and GBIF are collected and processed. To simplify analysis, we use precomputing and caching of data. This enables the efficient clustering of VGI contributions and fast matching of VGI contributions with trajectory data.

### *4.5.3 A Joint Visual Analytics Workspace*

We developed **BirdTrace**, 1 a novel visual analytics method and interfaces to support the semantic annotation of the integrated database consisting of the VGI and the tracked trajectory data. The goal of the application is to add context information semantically from a domain expert (ornithologist) to increase the quality and enrich the integrated data sources previously discussed. The primary challenge is to reduce the uncertainty of the combined data sources and to raise awareness of the remaining uncertainty in our resulting database. Specifically, the tool allows annotating spatiotemporal databases and VGI in their semantic context. We will further support the annotation process with a semi-automatic process to enable the fast and reliable annotation of large datasets. We have to add semi-automatically

<sup>1</sup> Available at https://birdtrace.dbvis.de.

**Fig. 4.4** A screenshot from **BirdTrace** 

annotation to the dataset, as the domain experts do not have the time to review a large number of matchings.

Using the semantic annotations, we will be able to enrich our joined database with more knowledge from domain experts to increase the data quality of our integrated database. For instance, an ornithologist can verify or oppose the VGI information. The semantic annotation will also assist to clean and prune possibly incorrect merged data records and increase the awareness of uncertainty. Figure 4.4 shows the user interface of BirdTrace, showing a spatial map view on top, and a temporal "timeline" view on the bottom.

### **4.6 Data-Driven Modeling of Tracked and Observed Animal Behavior**

Finally, we explored the challenging tasks of training prediction models, applicable, e.g., to animal trajectory forecasting. To improve uncertainty-aware prediction models for animal trajectories, we explored techniques from deep imitation and reinforcement learning. Specifically, we explored how to use data-driven deep learning methods to predict the movement of fish swarms. A complete presentation of results on predictive models would be beyond the scope of this chapter. We, therefore, want to focus on the workflow and, specifically, how concepts from visual analytics can be used here. We leave a discussion on model implementations for future work. In general, although deep learning-based approaches for related tasks are very promising, we still observe low adoption. Multiple challenges hinder the application of reinforcement learning algorithms in experimental and real-world use cases. Such challenges occur at different stages of the development and deployment of such models. While reinforcement learning workflows share similarities with

machine learning approaches, we argue that distinct challenges can be tackled and overcome using visual analytic concepts. Thus, we propose a comprehensive workflow for reinforcement learning and present an implementation of our workflow incorporating visual analytic concepts integrating tailored views and visualizations for different stages and tasks of the workflow (Metz et al. 2022). In this final section, we would like to shine a light on how our workflow supports experimentation in this space and encourage future research in the application of novel RL-based methods for a wide range of problems in the context of geoinformatics and VGI, e.g., trajectory forecasting.

### *4.6.1 Motivation and Research Gap*

Recently, there have been notable examples of the capabilities of reinforcement learning (RL) in diverse fields like robotics (Nguyen and La 2019), physics (Martín-Guerrero and Lamata 2021), or even video compression (Mandhane et al. 2022). Despite these successes, the application and evaluation of recent deep reinforcement and imitation learning techniques in real-world scenarios are still limited. Existing research almost exclusively focuses on synthetic benchmarks and use cases (Bellemare et al. 2013). We argue that the usage and evaluation in realistic scenarios is a mandatory step in assessing the capabilities of current approaches and identifying existing weaknesses and possibilities for further development. In this chapter, we present a visual analytics workflow and an instantiation of the approach that facilitates the application of state-of-the-art algorithms to various scenarios. Our presented approach is designed specifically to support domain experts, with basic knowledge of core concepts in reinforcement learning, who are interested in applying RL algorithms to domain-specific sequential decision-making tasks. The goal is to enable the effective application of their knowledge to (1) design agents and simulation environments including reward functions and (2) a detailed assessment of trained agents' capabilities in terms of performance, robustness, and traceability. A structured and well-defined approach can also help to critically investigate and combat some fundamental difficulties of reinforcement learning like *brittleness*, *generalization* to new tasks and environments, and issues of *reproducibility* (Dulac-Arnold et al. 2020; Henderson et al. 2017).

Outside of reinforcement and imitation learning, there exists a wide range of workflows and interactive visual analytics (VA) tools for the training and evaluation of ML models (Amershi et al. 2015; Endert et al. 2018; Spinner et al. 2020). Compared to other fields of machine learning, there has been less work on applying visual analytics in the space of reinforcement and especially imitation learning. A large number of necessary decisions and the existence of interconnected tasks make the application of interactive machine learning, with a close coupling of model and human, especially valuable for reinforcement learning.

Existing work such as *DQNViz* by Wang et al. (2019a) enables the analysis of spatial behavior patterns of agents in Atari environments like breakout (see arcade learning environment (Bellemare et al. 2013)) using visual analytics. He et al.

present *DynamicsExplorer* (He et al. 2020) to evaluate and diagnose a trained policy in a robotics use case, which incorporates views to track the trajectories of a ball in the maze during episodes. The application enables the inspection of the effect of real-world conditions for trained agents. Saldanha et al. (2019) showcase an application that supports data scientists during experimentation by increasing situational awareness. Key elements are thumbnails summarizing agent performance during episodes and specialized views to understand the connection between particular hyperparameter settings and training performance.

Compared to the existing approaches, we (1) extend the existing frameworks to encompass a holistic view of the relevant stages of the reinforcement learning process instead of just sub-tasks; (2) present a generic, easily adaptable application, which can be instantiated to specific use cases; (3) explicitly consider imitation learning, due to the frequent use in conjunction with reinforcement learning; and (4) apply our framework in a novel, custom real-world use case instead of an existing benchmark environment.

### *4.6.2 Approach*

There has not been a comprehensive workflow for the experimentation and application of reinforcement learning tightly incorporating users. This leaves both researchers and practitioners to loosely defined best practices. In the following chapter, we outline a conceptual workflow for developers and researchers, which we base on guides, projects, and popular open-source libraries. We follow the terminology used, e.g., in the *Gym* package (Brockman et al. 2016). As a starting point, we consider the fundamental workflow from Sacha et al. (2019) that is aimed at generic ML tasks:


We are interested in highlighting steps and tasks that are specific and critical to reinforcement and imitation learning and which have not been captured previously by more generic workflows. Figure 4.5 summarizes our proposed workflow described in this chapter. In the paper, we discuss the specific stages for imitation and reinforcement learning mirroring the workflow. Specifically, we highlight user tasks during (1) setup and design of the environment which corresponds to mapping a domain-specific problem to a setup applicable to RL algorithms, (2) model training and supervision, and finally (3) evaluation and understanding of trained models. For each of these steps, we present further detailed user tasks and highlight how

**Fig. 4.5** Overview of the **RIVA** (**R**einforcement and **I**mitation Learning with **V**isual **A**nalytics) workflow. **RIVA** is an integrated experimentation workflow and application that provides a range of tools to support all major critical steps: (**a**) inspecting observations, actions, and rewards and ensuring matching values between simulation, expert demonstrations, and architectures, (**b**) provenance tracking of interesting states to enable targeted case-based evaluation, (**c**) tracking of parameters and settings to ensure reproducibility and understand the effect of design decision, (**d**) interactively monitor training and final performance beyond reward, (**e**) enable effective evaluation by integrating multiple evaluation tools, and (**f**) explain behavior by natively integrating XAI methods like input attribution techniques

visual analytics concepts are applicable. We apply the proposed framework and developed an application in the use case of imitation and reinforcement learning for collective behavior: data-driven learning of the behavior of fish schools (collective movement of fish swarms). We cooperated with a domain expert throughout the entire process, from designing custom environments and agents, training, to final evaluation. Modeling the behavior of individual actors in swarm systems has been a long-standing problem in biology (Reynolds 1987; Sumpter 2006; Calovi et al. 2013). Learning individual policies that lead to coordinated collective behavior via both reinforcement learning and imitation learning from recorded trajectories is an exciting application that promises to overcome existing simplifications in handcrafted models.

### *4.6.3 Results and Discussion*

The use case can be well integrated into our workflow and application with minimal modifications. Noticeably, a custom interactive rendering of the environment was added. We utilize the modularity of the software to integrate additional components like custom visualizations. During the design phase, the inspection views were used to ensure consistency between environment, agent, and dataset, e.g., to spot premature episode termination. Our workflow was highly effective in maintaining

a high level of productivity and consistency through an iterative design process, in which we experimented with different observation space designs, reward functions, network types, and hyperparameter configurations. The set of evaluation tools is used both for internal evaluation and external presentation.

### **4.7 Discussion and Conclusion**

In this chapter, we have given an overview of research contributing to the overall goal of enriching high-quality sparse and curated data with VGI contributions to enable different applications like analysis of species distribution or prediction modeling of movement. In particular, we highlighted the potential of visual analytics solutions for different stages of curation, analysis, and model building using VGI data. Visualizations can be especially suited to present data of varying quality and express uncertainty. Both our joint collaborative work on integrating Flickr images with citizen science data and the BirdTrace platform highlight the potential of integrating different data sources, ranking from citizen science platforms, social media, to professional data collection efforts.

**Acknowledgments** This research was supported by the German Research Foundation DFG within Priority Research Program 1894 *Volunteered Geographic Information: Interpretation, Visualization, and Social Computing* (VGIscience). In this chapter, results from the project *Uncertainty-Aware Enrichment of Animal Movement Trajectories by VGI* (BirdTrace, Project number 314671965) are reviewed.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 5 Two Worlds in One Network: Fusing Deep Learning and Random Forests for Classification and Object Detection**

### **Christoph Reinders, Michael Ying Yang, and Bodo Rosenhahn**

**Abstract** Neural networks have demonstrated great success; however, large amounts of labeled data are usually required for training the networks. In this work, a framework for analyzing the road and traffic situations for cyclists and pedestrians is presented, which only requires very few labeled examples. We address this problem by combining convolutional neural networks and random forests, transforming the random forest into a neural network, and generating a fully convolutional network for detecting objects. Because existing methods for transforming random forests into neural networks propose a direct mapping and produce inefficient architectures, we present neural random forest imitation—an imitation learning approach by generating training data from a random forest and learning a neural network that imitates its behavior. This implicit transformation creates very efficient neural networks that learn the decision boundaries of a random forest. The generated model is differentiable, can be used as a warm start for fine-tuning, and enables end-to-end optimization. Experiments on several realworld benchmark datasets demonstrate superior performance, especially when training with very few training examples. Compared to state-of-the-art methods, we significantly reduce the number of network parameters while achieving the same or even improved accuracy due to better generalization.

**Keywords** Classification · Neural networks · Random forests · Object detection · Localization · Imitation learning

M. Y. Yang

C. Reinders (O) · B. Rosenhahn

Institute for Information Processing, L3S/Leibniz University Hannover, Hannover, Germany e-mail: reinders@tnt.uni-hannover.de; rosenhahn@tnt.uni-hannover.de

Scene Understanding Group, University of Twente, Enschede, Netherlands e-mail: michael.yang@utwente.nl

### **5.1 Introduction**

During the last few years, the availability of spatial data has rapidly developed. An essential aspect of this development is the involvement of a large number of users, who often use smartphones and mobile devices, to generate and make freely available volunteered geographic information (VGI). For example, apps like *Waze* combine the local velocities of smartphones (in cars) to predict the flow velocities (and time delay) of traffic jams. Users can recommend and comment on specific traffic situations. Although GPS and gyroscope data (e.g., in fitness straps) are common, images allow a comprehensive scene understanding. The collection of large amounts of unlabeled images is easy; however, the development of machine learning methods for scene analysis with limited amounts of labeled data is challenging.

Neural networks have become very popular in many areas, such as computer vision (Krizhevsky et al. 2012; Reinders et al. 2022; Ren et al. 2015; Simonyan and Zisserman 2015; Zhao et al. 2017; Qiao et al. 2021; Rudolph et al. 2022; Sun et al. 2021a), speech recognition (Graves et al. 2013; Park et al. 2019; Sun et al. 2021a), automated game-playing (Mnih et al. 2015; Dockhorn et al. 2017), or natural language processing (Collobert et al. 2011; Sutskever et al. 2014; Otter et al. 2021). Researchers have published many datasets for training neural networks and put enormous effort into providing labels for each data sample. For real-world applications, the dependency on large amounts of labeled data represents a significant limitation (Breiman et al. 1984; Hekler et al. 2019; Barz and Denzler 2020; Qi and Luo 2020; Phoo and Hariharan 2021; Wang et al. 2021). Frequently, there is little or even no labeled data for a particular task, and hundreds or thousands of examples have to be collected and annotated. This particularly affects new applications and rare labels (e.g., detecting rare diseases or defects in manufacturing). Transfer learning and regularization methods are usually applied to reduce overfitting. However, for training with little data, the networks still have a considerable number of parameters that have to be fine-tuned—even if just the last layers are trained.

In contrast to neural networks, random forests are very robust to overfitting due to their ensemble of multiple decision trees. Each decision tree is trained on randomly selected features and samples. Random forests have demonstrated remarkable performance in many domains (Fernández-Delgado et al. 2014). While the generated decision rules are simple and interpretable, the orthogonal separation of the feature space can also be disadvantageous on other datasets, especially with correlated features (Menze et al. 2011). Additionally, random forests are not differentiable and cannot be fine-tuned with gradient-based optimization.

In this research project *Comprehensive Conjoint GPS and Video Data Analysis for Smart Maps* (COVMAP), we are interested in combining GPS, gyroscope, and image data to analyze road and traffic situations for cyclists and pedestrians. Our standard setting is a smartphone attached to a bicycle, which records the GPS coordinates, images, motion information, local weather information, and time. We present a framework for detecting traffic signs that are of interest for cyclists and pedestrians. Related to this work, Chap. 3 introduces methods for anonymizing and map-matching trajectories, and Chap. 1 presents a geographic knowledge graph for a semantic representation of geographic entities in OSM. The goal of this work is to minimize the costs of annotating a dataset and enable the detection of objects with only a handful of examples per class. For that, we combine neural networks and random forests and bring both worlds together. After generating a classifier for image patches, the random forest is mapped to a neural network to combine all modules in a single pipeline, and a fully convolutional network is created for object detection.

Mapping random forests into neural networks is already used in many applications such as network initialization (Humbird et al. 2019), camera localization (Massiceti et al. 2017), object detection (Reinders et al. 2018), or semantic segmentation (Richmond et al. 2016). State-of-the-art methods (Massiceti et al. 2017; Sethi 1990; Welbl 2014) create a two-hidden-layer neural network by adding a neuron for each split node and each leaf node of the decision trees. The number of parameters of the networks becomes enormous as the number of nodes grows exponentially with the increasing depth of the decision trees. Additionally, many weights are set to zero so that an inefficient representation is created. Due to both reasons, the mappings do not scale and are only applicable to simple random forests.

In this work, we present an imitation learning approach to generate neural networks from random forests, which results in very efficient models. We introduce a method for generating training data from a random forest that creates any amount of input-target pairs. With this data, a neural network is trained to imitate the random forest. Experiments demonstrate that the accuracy of the imitating neural network is equal to the original accuracy or even slightly better than the random forest due to better generalization while being significantly smaller. To summarize, our **contributions** are:


### **5.2 Related Work**

Many deep learning-based methods have been presented for object detection in recent years. Two-stage methods like R-CNN (Girshick et al. 2014), Fast R-CNN (Girshick 2015), and Faster R-CNN (Ren et al. 2015) include a region proposal mechanism and predict the object scores and boundaries based on the pooled features. Cascade R-CNN (Cai and Vasconcelos 2018) consists of multiple R-CNN stages that progressively refine the predicted bounding boxes. Sparse R-CNN (Sun et al. 2021b) learns a fixed set of bounding box candidates. One-stage methods achieve great performance by regressing and classifying candidate bounding boxes of a predefined set of anchor boxes. Well-known methods are SSD (Liu et al. 2016), YOLO (Redmon and Farhadi 2016), and RetinaNet (Lin et al. 2017). CenterNet (Duan et al. 2019) introduces a triplet representation, including one center keypoint and two corners. FCOS (Tian et al. 2019) presents a center-ness branch for anchorfree detection. YOLOF (Chen et al. 2021) uses a single-scale feature map without feature pyramid network. DETR (Carion et al. 2020) models object detection as a set prediction problem and introduces a vision transformer architecture. R(Det)2 (Li and Wang 2022) presents a combination of soft decision trees and neural networks for randomized decision routing. All the presented methods have a huge number of trainable parameters and require large amounts of labeled data for training.

Random forests and neural networks share some similar characteristics, such as the ability to learn arbitrary decision boundaries; however, both methods have different advantages. Random forests are based on decision trees. Various tree models have been presented—the most well known are C4.5 (Quinlan 1993) and CART (Breiman et al. 1984). Decision trees learn rules by splitting the data. The rules are easy to interpret and additionally provide an importance score of the features. Random forests (Breiman 2001) are an ensemble method consisting of multiple decision trees, with each decision tree being trained using a random subset of samples and features. Fernández-Delgado et al. (2014) conduct extensive experiments comparing 179 classifiers on 121 UCI datasets (Dua and Graff 2017). The authors show that random forests perform best, followed by support vector machines with a radial basis function kernel. Therefore, random forests are often considered as a reference for new classifiers.

Neural networks are universal function approximators. The generalization performance has been widely studied. Zhang et al. (2017) demonstrate that deep neural networks are capable of fitting random labels and memorizing the training data. Bornschein et al. (2020) analyze the performance across different dataset sizes. Olson et al. (2018) evaluate the performance of modern neural networks using the same test strategy as Fernández-Delgado et al. (2014) and find that neural networks achieve good results but are not as strong as random forests.

Sethi (1990) presents a mapping of decision trees to two-hidden-layer neural networks. In the first hidden layer, the number of neurons equals the number of split nodes in the decision tree. Each of these neurons implements the decision function of the split nodes and determines the routing to the left or right child node. The second hidden layer has a neuron per leaf node in the decision tree. Each of the neurons is connected to all split nodes on the path from the root node to the leaf node to evaluate if the data is routed to the respective leaf node. Finally, the output layer is connected to all leaf neurons and aggregates the results by implementing the leaf votes. By using hyperbolic tangent and sigmoid functions, respectively, as activation functions between the layers, the generated network is differentiable and, thus, trainable with gradient-based optimization algorithms. The method can be easily extended to random forests by mapping all trees.

Welbl (2014) and Biau et al. (2019) follow a similar strategy. The authors propose a method that maps random forests into neural networks as a smart initialization and then fine-tunes the networks by backpropagation. Two training modes are introduced: *independent* and *joint*. Independent training fits all networks one after the other and creates an ensemble of networks as a final classifier. Joint training concatenates all tree networks into one single network so that the output layer is connected to all leaf neurons in the second hidden layer from all decision trees and all parameters are optimized together. Additionally, the authors evaluate sparse and full connectivity. Sparse connectivity maintains the tree structures and has fewer weights to train. In practice, sparse weights require a special differentiable implementation, which can drastically decrease performance, especially when training on a GPU. Full connectivity optimizes all parameters of the fully connected network. Massiceti et al. (2017) extend this approach and introduce a network splitting strategy by dividing each decision tree into multiple subtrees. The subtrees are mapped individually and share common neurons for evaluating the split decision.

These techniques, however, are only applicable to trees of limited depth. As the number of nodes grows exponentially with the increasing depth of the trees, inefficient representations are created, causing extremely high memory consumption. In this work, we address this issue by proposing an imitation learning-based method that results in much more efficient models.

### **5.3 Traffic Sign Recognition**

In the first part of this chapter, we present a framework for object detection and localization that is able to recognize traffic signs for cyclists and pedestrians with very few labeled examples. While there are a lot of datasets for cars, the amount of labeled data for cyclists and pedestrians is very limited. Therefore, the advantages of convolutional neural networks and random forests are combined to build a robust object detector. After the detection of the objects, the image, GPS, and motion data are fused to localize the traffic signs on the map. We introduce an app for collecting and synchronizing data with a customary smartphone and present the captured dataset. Finally, experiments are performed to analyze the recognition performance. All details and further evaluations can be found in Reinders et al. (2018) and Reinders et al. (2019).

### *5.3.1 Framework*

The framework consists of three modules. First, a system for object detection based on convolutional neural networks and random forests is presented. Afterward, the detected traffic signs are localized on the map by integrating GPS and motion information. Lastly, multiple observations are clustered to improve the precision.

### **5.3.1.1 Object Detection**

In the first step, we train a convolutional neural network for representation learning on a related task where large amounts of data are available. In this application, the GTSRB (Stallkamp et al. 2012) dataset is selected, which consists of images of traffic signs for cars. The architecture of the network is a standard backbone (Springenberg et al. 2015) with multiple convolutional layers and a global average pooling. For generating the feature representations, the output of the last layer before the final classification layer is calculated.

On the downstream task, we start with a classifier for image patches. The feature representations of all patches and a fixed number of background patches are extracted. Because only a few number of labeled examples are available, we train a random forest to classify the image features and predict one of the *C* classes or background. The ensemble of multiple decision trees trained on different subsets of features and samples is very robust to overfitting (Breiman 2001).

Afterward, the convolutional neural network for feature generation and random forest for classification are combined in one pipeline. For that, we transform the random forest into a neural network using a method presented by Sethi (1990) and Welbl (2014). The method creates a two-hidden-layer neural network by mapping each decision tree of the random forest. An example of mapping a decision tree into a neural network is visualized in Fig. 5.1. For each split node in the decision tree, a neuron is created in the first hidden layer. The neurons are connected to the respective split features (all other weights are set to zero if no sparse architecture is used) and evaluate the split decisions, i.e., the routing to the left or right child node. In the second hidden layer, a neuron is created for each leaf node in the decision tree. The neurons combine the split decisions from the previous layer and determine whether the sample is routed to the respective leaf. In the output layer, the number of neurons corresponds to the number of classes. Each neuron stores the class votes from the leafs. Mapping a random forest, i.e., multiple decision trees, is done by mapping each decision tree and combining the neural networks. Now, we are able to create a fully convolutional network (Shelhamer et al. 2017) by replacing the fully connected layers with convolutional layers that perform the identical operation. Due to the shared features, the processing of the images is significantly accelerated. The images are analyzed by the fully convolutional network at multiple scales, and the output predicts the probability of each traffic sign class at each spatial position. In a post-processing, all detections with a probability larger than a defined threshold are extracted, and a non-maximum suppression is performed.

**Fig. 5.1** A decision tree (left) can be mapped to a two-hidden-layer neural network (right). For each *split node* (green circle) in the decision tree, a neuron in the first hidden layer is created which evaluates the split rule. For each *leaf node* (blue rectangle), a neuron in the second hidden layer is created which determines the leaf membership. A routing to leaf node 11, for example, involves the split nodes 0, 8, and 9. The relevant connections for the corresponding calculation in the neural network are highlighted

### **5.3.1.2 Localization**

The detected 2D bounding boxes are localized on the map by integrating GPS and heading information. Each image is associated with a GPS position and a heading. The heading points in the direction in which the device is oriented. For each bounding box, the depth is estimated by assuming a simple pinhole camera model, and the relative heading is determined based on the horizontal position in the image. Afterward, the information can be combined with the GPS position and heading of the image to generate the latitude, longitude, and heading of the traffic sign.

### **5.3.1.3 Clustering**

After localizing the traffic signs, we merge multiple observations of the same traffic sign. Clustering algorithms (MacQueen et al. 1967; Fukunaga and Hostetler 1975; Dockhorn et al. 2015, 2016; Schier et al. 2022) automatically discover natural groupings in the data. If multiple detections exist in an image, we can generate additional constraints because we know that multiple traffic signs exist and the respective traffic signs should not be grouped. The additional information is represented as cannot-link constraints. For weakly supervised clustering with an unknown number of clusters, constrained mean shift (CMS) (Schier et al. 2022) clustering is performed to merge the detections. CMS is a density-based clustering algorithm that extends mean shift clustering (Fukunaga and Hostetler 1975) by enabling sparse supervision using cannot-link constraints. The clustering of the detections improves the localization accuracy and makes the position estimation more robust.

### *5.3.2 Dataset*

To analyze the road and traffic situations for cyclists and pedestrians, we collected a real-world dataset. For that, we developed an app for capturing and synchronizing images and data from other sensors, like GPS and motion information. The smartphone is attached to the handlebar of the bicycle so that the camera is pointed in the direction of travel. Because monotonous routes, e.g., in rural areas, produce many similar images, we therefore introduce an adaptive filtering of the images to automatically detect points of interest. For that, we integrate motion information and apply a twofold filtering strategy based on decreases in speed and acceleration: (i) Decreases in speed indicate situations where the cyclist has to slow down because of potential traffic obstructions such as traffic jams, construction works, or other road users. (ii) Acceleration is used to analyze the road conditions and to detect, for example, potholes.

The collected dataset consists of 500 tours with a total riding time of 6 days in different cities. A visualization of the collected tours in Hanover is shown in Fig. 5.2. After filtering, the dataset has 56000 images with a size of 1080 × 1920 pixels. For the detection of traffic signs, we selected ten traffic signs that are of interest for

**Fig. 5.2** Example tracks collected around Hanover. In total, we collected in Hanover, 450 tours, *>*47K images, 5.4 days of riding; Enschede, 40 tours, *>*8K images, 18 hours of riding; and Heidelberg, 11 tours, 1000 images, several hours of riding

**Fig. 5.3** Precision-recall curve for each class to analyze the recognition performance on the test set. (**a**) Standard traffic signs. (**b**) Info signs

cyclists and pedestrians and manually annotated the ground truth for a set of images to have data for training and testing. Overall, 524 bounding boxes are annotated in the images and split 50/50 in training and testing. The splitting is repeated multiple times with different seeds.

### *5.3.3 Experiments*

The framework is evaluated on the presented dataset to analyze the recognition performance. For that, all bounding boxes are predicted at multiple scales and assigned to the ground truth bounding box with the highest overlap if the IoU is greater or equal than 0*.*5. The resulting precision-recall curve for each class is presented in Fig. 5.3. While the performance of the standard traffic signs is good, the more inconspicuous traffic signs are detected worse. The recognition performance of the latter correlates with the number of examples that are available for training. Qualitative examples are shown in Fig. 5.4. For more details and further analyses, please see Reinders et al. (2018) and Reinders et al. (2019).

### **5.4 Neural Random Forest Imitation**

We propose a novel method, called *neural random forest imitation* (NRFI), for implicitly transforming random forests into neural networks that learns the decision boundaries and generates efficient representations. The advantages of our approach for mapping random forests into neural networks are threefold: (1) We enable the generation of neural networks with very few training examples. (2) The resulting network can be used as a warm start, is fully differentiable, and allows further end-to-end fine-tuning. (3) The generated network can be easily integrated into

**Fig. 5.4** Qualitative results of randomly selected examples on the test set. True positives, false positives, and false negatives are shown for each class. Some classes have less than two false positives or false negatives, respectively

any trainable pipeline (e.g., jointly with feature extraction), and existing highperformance deep learning frameworks can be used directly. This accelerates the process and enables parallelization via GPUs. In the following, we evaluate on standard benchmark datasets to present a general approach for various domains. While we focus on classification tasks in this work, NRFI can be simply adapted for regression tasks.

### *5.4.1 Background and Notation*

In this section, we briefly describe decision trees (Breiman et al. 1984), random forests (Breiman 2001), and the notation used throughout this work. Decision trees consist of *split nodes <sup>N</sup>*split and *leaf nodes <sup>N</sup>*leaf. Each split node *<sup>s</sup>* <sup>∈</sup> *<sup>N</sup>*split performs a *split decision* and routes a data sample *x* to the left or right child node, denoted as cleft*(s)* and cright*(s)*, respectively. When using binary, axis-aligned split decisions, a single feature *f (s)* ∈ {1*,...,N*} and a threshold *θ (s)* <sup>∈</sup> <sup>R</sup> are the basis for the split, where *N* is the number of features. If the value of feature *f (s)* is smaller than *θ (s)*, the data sample is routed to the left child node and otherwise to the right child node, denoted as

$$\mathbf{x} \in \mathbf{c}\_{\text{left}}(\mathbf{s}) \iff \mathbf{x}\_{\text{f}(\mathbf{s})} < \theta(\mathbf{s}) \tag{5.1}$$

$$\mathbf{x} \in \mathsf{c}\_{\mathrm{right}}(\mathsf{s}) \iff \mathbf{x}\_{\mathrm{f}(\mathsf{s})} \ge \theta(\mathsf{s}).\tag{5.2}$$

Data samples are routed through a decision tree until a leaf node *<sup>l</sup>* <sup>∈</sup> *<sup>N</sup>*leaf is reached which stores the target value. For the classification task, these are the estimated class probabilities *P*leaf*(l)* <sup>=</sup> *(p<sup>l</sup>* 1*,...,p<sup>l</sup> <sup>C</sup>)*, where *C* is the number of classes. Decision trees are trained by creating a root node and consecutively finding the best split of the data based on a criterion. The resulting subsets are assigned to the left and right child node, and the subtrees are processed further. Commonly used criteria are the *Gini impurity* or *entropy*.

A single decision tree is very fast and operates on high-dimensional data. However, it tends to overfit the training data by constructing a deep tree that separates perfectly all training examples. While having a very small training error, this easily results in a large test error. Random forests address this problem by learning an ensemble of *nT* decision trees. Each tree is trained with a random subset of training examples and features. The prediction RF*(x)* of a random forest is calculated by averaging the predictions of all decision trees.

### *5.4.2 Methodology*

Our proposed neural random forest imitation approach implicitly transforms random forests into neural networks. The main concept includes (1) generating training data from decision trees and random forests, (2) adding strategies for reducing conflicts and increasing the variety of the generated examples, and (3) training a neural network that imitates the random forest by learning the decision boundaries. As a result, NRFI enables the transformation of random forests into efficient neural networks. An overview of the proposed method is shown in Fig. 5.5.

### **5.4.2.1 Data Generation**

First, we propose a method for generating data from a given random forest. A data sample *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>N</sup>* is an *N*-dimensional vector, where *<sup>N</sup>* is the number of features. We select a target class *t* ∈ [1*,...,C*] from *C* classes and generate a data sample for the selected class.

**Fig. 5.5** Neural random forest imitation enables an implicit transformation of random forests into neural networks. Usually, data samples are propagated through the individual decision trees, and the split decisions are evaluated during inference. We propose a method for generating input-target pairs by reversing this process and training a neural network that imitates the random forest. The resulting network is much smaller compared to current state-of-the-art methods, which directly map the random forest

### Data Initialization

A data sample *x* is initialized randomly. In the following, the feature-wise minimum and maximum of the training samples will be denoted as *f*min*, f*max <sup>∈</sup> <sup>R</sup>*<sup>N</sup>* . To initialize *x*, we sample *x* ∼ *U (f*min*, f*max*)*. In the next step, we will present a method for adapting the data sample to obtain characteristics of the target class.

Data Generation from Decision Trees

A decision tree processes an input vector *x* by routing the data through the tree until a leaf is reached. At each node, a split decision is evaluated, and the input is passed to the left child node or the right child node. Finally, a leaf *l* is reached which stores the estimated probabilities *<sup>P</sup>*leaf*(l)* <sup>=</sup> *(p<sup>l</sup>* 1*,...,p<sup>l</sup> <sup>C</sup>)* for each class.

We reverse this process and present a method for generating training data from a decision tree. An overview of the proposed data generation process is shown in Fig. 5.6. First, the class distribution information is propagated bottom-up from the leaf nodes to the split nodes (see Fig. 5.6a), and we define the class weights *W (n)* = *(w<sup>n</sup>* <sup>1</sup> *,...,w<sup>n</sup> <sup>C</sup>)* for every node *n* as follows:

$$W(n) = \begin{cases} P\_{\text{left}}(n) & \text{if } n \in N^{\text{leaf}} \\\\ W(\text{c}\_{\text{left}}(n)) + W(\text{c}\_{\text{right}}(n)) & \text{if } n \in N^{\text{split}} \end{cases} \tag{5.3}$$

For every leaf node, the class weights are equal to the stored probabilities in the leaf. For every split node, the class weights in the child nodes are summed up.

After preparation, data samples for a target class *t* are generated (see Fig. 5.6b). For that, characteristics of the target class are successively added to the data sample. Starting at the root node, we modify the input data so that it is routed through

**Fig. 5.6** Overview of the data generation process from a decision tree. First, the class distribution information is propagated from the leaf nodes to the split nodes (**a**). Afterward, data samples are generated by guided routing (Sect. 5.4.2.1) and modifying the data based on the split decisions (**b**). The weights for sampling the left or right child node are highlighted in orange

selected split nodes until a leaf node is reached. The pseudocode is presented in Algorithm 1.

The routing is guided based on the weights for the target class in the left child node *w*left <sup>=</sup> *<sup>w</sup>*cleft*(n) <sup>t</sup>* and right child node *w*right <sup>=</sup> *<sup>w</sup>*cright*(n) <sup>t</sup>* . The weights are normalized by their L2-norm, denoted as *w*ˆleft and *w*ˆright. Afterward, the left or right child node is randomly selected as next child node *n*next depending on *w*ˆleft and *w*ˆright.

In the next step, the data sample is updated. We verify that the data sample is routed to the selected child node by evaluating the split decision. A split node *s* routes the data to the left or right child node based on a split feature f*(s)* and a threshold *θ (s)*. If the value of the split feature *x*f*(s)* is smaller than *θ (s)*, the data sample is routed to the left child node and otherwise to the right child node. The data sample is modified if it is not already routed to the selected child node by assigning a new value. If the selected child node is the left child node, the value has to be smaller than the threshold *θ (s)*, and a new value within the minimum feature value *f*min*,*f*(s)* and *θ (s)* is randomly sampled:

$$\text{x}\_{\text{f}(s)} \sim U(f\_{\text{min},\text{f}(s)}, \theta(s)). \tag{5.4}$$

If the data sample is supposed to be routed to the right child node, the new value is randomly sampled between *θ (s)* and the maximum feature value *f*max*,*f*(s)*:

$$\mathbf{x}\_{\mathbf{f}(s)} \sim U(\theta(\mathbf{s}), f\_{\text{max}, \mathbf{f}(s)}).\tag{5.5}$$

### **Algorithm 1** DATAGENERATIONFROMTREE Generate data samples from a decision tree

**Input:** Decision tree split features f*(n)* and thresholds *θ (n)*, target class *t*, feature minimum *f*min and maximum *f*max, class weights *W (n)* <sup>=</sup> *(w<sup>n</sup>* <sup>1</sup> *,...,w<sup>n</sup> <sup>C</sup>)* for all nodes *<sup>n</sup>* <sup>∈</sup> *<sup>N</sup>*split <sup>∪</sup> *<sup>N</sup>*leaf **Output:** Data sample for target class *t*

1: Sample *<sup>x</sup>* <sup>∼</sup> *U (f*min*, f*max*)* <sup>∈</sup> <sup>R</sup>*<sup>N</sup>* 2: *n* ← root node 3: **while** *<sup>n</sup>* /∈ *<sup>N</sup>*leaf **do**  4: *<sup>w</sup>*left <sup>←</sup> *<sup>w</sup>*cleft*(n) <sup>t</sup>* 5: *<sup>w</sup>*right <sup>←</sup> *<sup>w</sup>*cright*(n) t* 6: **if** feature f*(n)* is already used **then**  7: weight current route with *w*path 8: *w*current ← *w*current · *w*path 9: **end if**  10: *w*ˆleft*, w*ˆright ← normalize *w*left and *w*right 11: *n*next ← randomly select left or right child node with probability of *w*ˆleft and *w*ˆright, respectively 12: **if** *n*next = cleft*(n)* **then**  13: **if** *x*f*(n)* ≥ *θ (n)* **then**  14: *x*f*(n)* ∼ *U (f*min*,*f*(n), θ (n))* 15: **end if**  16: **else**  17: **if** *x*f*(n) < θ (n)* **then**  18: *x*f*(n)* ∼ *U (θ (n), f*max*,*f*(n))* 19: **end if**  20: **end if**  21: mark feature f*(n)* as used 22: *n* ← *n*next 23: **end while**  24: **return** *x*

This process is repeated until a leaf node is reached. In each node, characteristics are added that classify the data sample as the target class.

During this process, modifications can conflict with previous decisions because features are used multiple times within a decision tree or across multiple decision trees. Therefore, the current routing is weighted with a factor *w*path ≥ 1 to prioritize the path and not change the data sample if possible. Overall, the presented method enables the generation of data samples and corresponding labels from a decision tree without adding any further data.

Data Generation from Random Forests

In the next step, we extend the method to generate data from random forests. Random forests consist of *nT* decision trees *RF* = {*T*1*,...,TnT* }. For generating a data sample *x*, the presented method for a single decision tree is applied to multiple decision trees consecutively. The initialization is performed only once, and the visited features are shared. In each decision tree, the data sample is modified and routed to selected nodes based on the target class *t*. When using all decision trees, data samples are created where all trees agree with a high probability. For generating examples with varying confidence, i.e., the predictions of the individual decision trees diverge, we select a subset of *n*sub decision trees *RF*sub ⊆ *RF*.

All decision trees in *RF*sub are processed in random order to generate a data sample. For each decision tree, the presented method modifies the data sample based on the target class. Finally, the output of the random forest *y* = RF*(x)* is predicted. In most cases, the prediction matches the target class. Due to factors such as the stochastic process, a small subset size, or varying predictions of the decision trees, it can be different occasionally. Thus, an input-target pair *(x, y)* has been created, showing similar characteristics as the target class and any amount of data can be generated by repeating this process.

### Automatic Confidence Distribution

The number of decision trees *n*sub can be set to a fixed value or sampled uniformly. Alternatively, we will present an automatic process for determining an optimal distribution of the confidences for generating a wide variety of different examples. The strategy is motivated by *importance weighting* (Fang et al. 2020). We generate *n* data samples (*n* is empirically set to 1000) for each number of decision trees *j* ∈ [1*, nT* ]. The respective generated datasets will be denoted as *Dj* .

An optimal sampling process generates highly diverse data samples with different confidences. To achieve that, an automated balancing of the distributions is determined. A histogram with *<sup>H</sup>* bins is calculated for each *Dj* , where *h<sup>j</sup> <sup>i</sup>* denotes the number of generated examples in the *i*th interval (equally distributed) from the distribution with *j* decision trees. In the next step, a weight *w<sup>D</sup> <sup>j</sup>* is defined for each distribution, and we optimize *w<sup>D</sup>* as follows:

$$\min\_{w^D} \left\| \left[ \sum\_{j=1}^{nT} w\_j^D h\_1^j \; \dots \; \sum\_{j=1}^{nT} w\_j^D h\_H^j \right]^T - \begin{bmatrix} 1 \\ \vdots \\ 1 \end{bmatrix} \right\|^2 \quad \text{s.t.} \quad \forall\_j \; 0 \le w\_j^D,\tag{5.6}$$

where *w<sup>D</sup>* <sup>∈</sup> <sup>R</sup>*nT* . This optimization finds a weighting of the number of decision trees so that the generated confidences cover the full range equally. For that, the number of samples per bin *h<sup>j</sup> <sup>i</sup>* is summed up, weighted over all numbers of decision trees. After determining *wD*, the number of decision trees can be sampled depending on *w<sup>D</sup> <sup>j</sup>* . An analysis of different sampling methods will be presented in Sect. 5.4.3.4. Automatically balancing the number of decision trees generates data samples with low and high confidence very equally distributed. The process does not require training data and provides a universal solution.

### **5.4.2.2 Imitation Learning**

Finally, a neural network that imitates the random forest is trained. The network learns the decision boundaries from the generated data and approximates the same function as the random forest. The network architecture is based on a fully connected network with one or multiple hidden layers. The data dimensions are the same as those of the random forest, i.e., an *N*-dimensional input and *C*-dimensional output. Each hidden layer is followed by a ReLU activation (Nair and Hinton 2010). The last fully connected layer is using a softmax activation.

For training, we generate input-target pairs *(x, y)* as described in the last section. These training examples are fed into the training process to teach the network to predict the same results as the random forest. To avoid overfitting, the data is generated on-the-fly so that each training example is unique. In this way, we learn an efficient representation of the decision boundaries and are able to transform random forests into neural networks implicitly. In addition to that, the training is performed end to end on the generated data, and we can easily integrate the original training data.

### *5.4.3 Experiments*

In this section, we perform several experiments to analyze the performance of neural random forest imitation and compare our method to state-of-the-art methods.

### **5.4.3.1 Datasets**

The experiments are evaluated on nine classification datasets from the UCI Machine Learning Repository (Dua and Graff 2017) (*Car*, *Connect-4*, *Covertype*, *German Credit*, *Haberman*, *Image Segmentation*, *Iris*, *Soybean*, and *Wisconsin Breast Cancer (Original)*). The datasets cover many real-world problems in different areas, such as finance, computer vision, games, or medicine.

Following Fernández-Delgado et al. (2014), each dataset is split into a training and a test set using a 50/50 split while maintaining the label distribution. Afterward, the number of training examples is limited to *n*limit examples per class. We evaluate the training with 5, 10, 20, and 50 examples per class. In contrast to Fernández-Delgado et al. (2014), we extract validation sets from the training set (e.g., for hyperparameter tuning). This ensures that the training and validation data are not mixed with the test data. For some datasets which provide a separate test set, the test accuracy is evaluated on the respective set. Missing values are set to the mean value of the feature. All experiments are repeated ten times with randomly sampled splits. The methods are repeated additionally four times with different seeds on each split.

### **5.4.3.2 Implementation Details**

In all our experiments, stochastic gradient descent with Nesterov momentum as optimizer and cross-entropy loss are used. The initial learning rate is set to 0*.*1, momentum to 0*.*9, and weight decay to 0*.*0005. The batch size is set to 128 and 512, respectively, for generated data. The input data is normalized to [−1*,* 1]. For generating a wide variety of data, the prioritization of the current path *w*path ∼ 1 + |N*(*0*,* 5*)*| is sampled for each data sample individually. A new random forest is trained every 100 epochs to average the influence of the stochastic process, and the generated data samples are mixed. In the following, training on generated data will be denoted as *NRFI (gen)* and training on generated and original data as *NRFI (gen+ori)*. The fraction of NRFI data is set to 0*.*9. Random forests are trained with 500 decision trees, which are commonly used in practice (Fernández-Delgado et al. 2014; Olson et al. 2018). The decision trees are constructed up to a maximum depth of 10. For splitting, the Gini impurity is used, and <sup>√</sup>*<sup>N</sup>* features are randomly selected, where *N* is the number of features.

### **5.4.3.3 Results**

The proposed method generates data from a random forest and trains a neural network that imitates the random forest. The goal is that the neural network approximates the same function as the random forest. This also implies that the network reaches the same accuracy if successful.

We analyze the performance by training random forests for each dataset and evaluating neural random forest imitation with different network architectures. A variety of network architectures with different depths, widths, and additional layers such as dropout have been studied. In this work, we focus on two-hidden-layer networks with an equal number of neurons in both layers for clarity. The results are shown in Fig. 5.7 exemplarily for the *Car*, *Covertype*, and *Wisconsin Breast Cancer (Original)* dataset. The other datasets show similar characteristics. The overall evaluation on all datasets is presented in the next section. The number of training examples per class is shown in parentheses and increases in each row from left to right. For each setting, the test accuracy of the random forest is indicated by a red dashed line. The average test accuracy and standard deviation depending on the network architecture, i.e., the number of neurons in the first and second hidden layer, are plotted for different architectures. NRFI (gen), which is trained with generated data only, is shown in orange, and NRFI (gen+ori), which is trained with generated and original data, is shown in blue.

The analysis shows that the accuracy of the neural networks trained by NRFI reaches the accuracy of the random forest for all datasets. Only very small networks do not have the required capacity. The proposed method for generating labeled data from random forests by analyzing the decision boundaries enables training neural networks that imitate the random forests. For instance, in the case of 5 training examples per class, a two-hidden-layer network with 16 neurons in both

**Fig. 5.7** Test accuracy depending on the network architecture (i.e., number of neurons in both hidden layers). Different datasets are shown per row, with an increasing number of training examples per class from left to right (indicated in parentheses). The red dashed line shows the accuracy of the random forest. NRFI with generated data is shown in orange and NRFI with generated and original data in blue. With increasing network capacity, NRFI is capable of imitating and even outperforming the random forest

layers already achieves the same accuracy as the random forest across all 3 datasets in Fig. 5.7. Additionally, the experiment shows that the training is very robust to overfitting even when the number of parameters in the network increases. When combining the generated data and original data, the accuracy on *Car* and *Covertype*  improves with an increasing number of training examples.

Overall, the experiment shows that the accuracy increases with an increasing number of neurons in both layers and NRFI is robust to different network architectures. NRFI is capable of generating a large variety of unique examples from random forests which have been initially trained on a limited amount of data.

### Comparison to State of the Art

We now compare the proposed method to state-of-the-art methods for mapping random forests into neural networks and classical machine learning classifiers such as random forests and support vector machines with a radial basis function kernel that have shown to be the best two classifiers across all UCI datasets (Fernández-Delgado et al. 2014). In detail, we will evaluate the following methods:


classes. As evaluated by Fernández-Delgado et al. (2014), the best performance is achieved with a radial basis function kernel.


First, we analyze the performance of state-of-the-art methods for mapping random forests into neural networks and neural random forest imitation. The results are shown in Fig. 5.8 for different numbers of training examples per class. For each method, the average number of parameters of the generated networks across all datasets is plotted depending on the test error. That means that the methods aim for the lower-left corner (smaller number of network parameters and higher accuracy). Please note that the y-axis is shown on a logarithmic scale. The average performance of the random forests is indicated by a red dashed line.

The analysis shows that Sethi, Welbl (ind-full), and Welbl (joint-full) generate the largest networks. Network splitting (Massiceti et al. 2017) slightly improves the number of parameters of the networks. Using a sparse network architecture reduces the number of parameters. However, it should be noted that this requires special operations. NRFI with and without the original data is shown for different network architectures. The smallest architecture has 2 neurons in both hidden layers and the largest 128. For NRFI (gen-ori), we can see that a network with 16 neurons in both hidden layers (NN-16-16) is already sufficient to learn the decision boundaries of the random forest and achieve the same accuracy. When fewer training samples are available, NN-8-8 already has the required capacity. In the following, we will further analyze the accuracy and number of network parameters.

**Fig. 5.8** Comparison of the state-of-the-art and our proposed method for transforming random forests into neural networks. The closer a method is to the lower-left corner, the better it is (fewer number of network parameters and lower test error). For neural random forest imitation, different network architectures are shown. Note that the number of network parameters is shown on a logarithmic scale

### Accuracy

The average test accuracy and standard deviation for all methods are shown in Table 5.1. Here, we additionally include decision trees, support vector machines, random forests, and neural networks in the comparison. The evaluation is performed on all nine datasets, and results for different numbers of training examples are shown (increasing from left to right). The overall performance of each method is summarized in the last column. For neural random forest imitation, a network architecture with 128 neurons in both hidden layers is used. From the analysis, we can make the following observations: (1) When training neural random forest imitation with generated data only, the method achieves 99*.*18% of the random forest accuracy (71*.*44% compared to 72*.*03%). This shows that NRFI is capable of learning the decision boundaries. (2) Overall, NRFI trained with generated and original data reaches state-of-the-art performance (50 samples per class) or outperforms the other methods (5, 10, and 20 samples per class).


**Table 5.1** Average test accuracy [%] and standard deviation on all nine datasets for different numbers of training examples per class. The overall performance of each method is summarized in the last column. The best methods are highlighted in bold

**Table 5.2** Comparison to state-of-the-art methods. For each method, the average number of parameters of the generated neural networks is shown. While achieving the same or even slightly better accuracy, neural random forest imitation generates much smaller models, enabling the mapping of complex random forests


### Network Parameters

Finally, we will analyze the number of parameters of the generated networks in detail. The results are shown in Table 5.2. Current state-of-the-art methods directly map random forests into neural networks. The number of parameters of the resulting network is evaluated on all datasets with different numbers of training examples. The overall performance is shown in the last column. Due to the stochastic process when training the random forests, the results can vary marginally.

Sethi, Welbl (ind-full), and Welbl (joint-full) generate networks with around 980 000 parameters on average. Of the four variants proposed by Welbl, joint training has a slightly smaller number of parameters compared to independent training because of shared neurons in the output layer. Network splitting proposed by Massiceti et al. (2017) maps multiple subtrees while sharing common split nodes and reduces the average number of network parameters to 748 000. Using sparse network architectures additionally reduces the number of network parameters to about 142 000; however, this requires a special implementation for sparse matrix multiplication. All of the methods show a drastic increase with the growing complexity of the classifiers. Sethi, for example, generates networks with 374 000 parameters when training with 5 examples per class. The average number of network parameters increases to 1*.*9 million when training with 50 examples per class.

NRFI introduces imitation instead of direct mapping. In the following, a network architecture with 32 neurons in both hidden layers is selected. The previous analysis has shown that this architecture is capable of imitating the random forests (see Fig. 5.8 for details) across all datasets and different numbers of training examples. Our method significantly reduces the number of parameters of the generated networks while reaching the same or even slightly better accuracy. The current best-performing methods generate networks with an average number of parameters of either 142 000, if sparse processing is available, or 748 000 when using usual fully connected neural networks. In comparison, neural random forest imitation requires only 2676 parameters. Another advantage is that the proposed method does not create a predefined architecture but enables arbitrary network architectures. As a result, NRFI enables the transformation of very complex classifiers into neural networks.

### **5.4.3.4 Analysis of the Generated Data**

To study the sampling process, we analyze the variability of the generated data as well as different sampling modes in the next experiment. Subsequently, we investigate the impact of combining original and generated data.

### Confidence Distribution

The data generation process aims to produce a wide variety of data samples. This includes data samples that are classified with a high confidence and data samples that are classified with a low confidence to cover the full range of prediction uncertainties. The following analyses are shown exemplarily on the *Soybean* dataset. This dataset has 35 features and 19 classes. First, we analyze the generated data with a fixed number of decision trees, i.e., the number of sampled decision trees in *RF*sub. The resulting confidence distributions for different numbers of decision trees are shown in the first column of Fig. 5.9. When adopting the data sample to only a few decision trees, the confidence of the generated samples is lower (around 0*.*2 for 5 samples per class). Using more decision trees for generating data samples increases the confidence on average.

**Fig. 5.9** Probability distribution of the predicted confidences for different data generation settings on *Soybean* with 5 (top) and 50 samples per class (bottom). Generating data with different numbers of decision trees is visualized in the left column. Additionally, a comparison between random sampling (red), NRFI uniform (orange), and NRFI dynamic (green) is shown in the right column. By optimizing the decision tree sampling, NRFI dynamic automatically balances the confidences and generates the most diverse and evenly distributed data

NRFI uniform and NRFI dynamic sample the number of decision trees for each data point uniformly, respectively, optimized via automatic confidence distribution (see Sect. 5.4.2.1). The confidence distributions for both sampling modes are visualized in the second column of Fig. 5.9. Additionally, sampling random data points without generating data from the random forest is included as a baseline. The analysis shows that random data samples and uniform sampling have a bias to generate data samples that are classified with high confidence. NRFI dynamic automatically balances the number of decision trees and archives an evenly distributed data distribution, i.e., generates the most diverse data samples.

In the next step, the imitation learning performance of the sampling modes is evaluated. The results are shown in Table 5.3. Random data generation reaches a mean accuracy of 63*.*80%, while NRFI uniform and NRFI dynamic achieve 87*.*46% and 88*.*14%, respectively. This shows that neural random forest imitation is able to generate significantly better data samples based on the knowledge in the random


**Table 5.3** Imitation learning performance (in accuracy [%]) of different data sampling modes on *Soybean*. NRFI achieves better results than random data generation. When optimizing the selection of the decision trees, the performance is improved due to more diverse sampling

forest. NRFI dynamic improves the performance by automatically optimizing the decision tree sampling and generating the largest variation in the data.

### Original and Generated Data

In the next experiment, we study the effects of training with original data, NRFI data, and combinations of both. For that, the fraction of NRFI data *w*gen is varied, which weights the loss of the generated data. Accordingly, the weight for the original data is set to *w*ori = 1 − *w*gen. The average accuracy over all datasets for different number of samples per class is shown in Fig. 5.10. When the fraction of NRFI data

**Fig. 5.10** Analyzing the influence of training with original data, NRFI data, and combinations of both for different number of samples per class. Using only NRFI data (*w*gen = 100%) achieves better results than using only the original data (*w*gen = 0%) for less than 50 samples per class. Combining the original data and generated data improves the performance

is set to 0%, the network is trained with only the original data. When the fraction is set to 100%, the network is trained completely with the generated data. The study shows that training with NRFI data performs better than training with original data except for 50 samples per class where training with original data is slightly better. Combining original and NRFI data improves the performance. The best result is achieved when using mainly NRFI data with a small fraction of original data.

### **5.5 Conclusion**

In this work, we brought two worlds together by combining neural networks and random forests. First, we presented an object detection framework for analyzing the road and traffic situations for cyclists and pedestrians. The combination of convolutional neural networks and random forests enables the training with very few labeled examples. Both methods are combined in an end-to-end pipeline by transforming the random forest into a neural network and generating a fully convolutional network.

Because existing approaches for mapping random forests into neural networks generate inefficient networks, we presented a novel method for transforming random forests into neural networks. Instead of a direct mapping, we introduced a process for generating data from random forests by analyzing the decision boundaries and guided routing of data samples to selected leaf nodes. Based on the generated data and corresponding labels, a network is trained that imitates the random forest. Experiments on several real-world benchmark datasets demonstrate that NRFI is capable of learning the decision boundaries very efficiently. Compared to stateof-the-art methods, the presented implicit transformation significantly reduces the number of parameters of the networks while achieving the same or even slightly improved accuracy due to better generalization. Our approach has shown that it scales very well and is able to imitate highly complex classifiers.

**Acknowledgments** This research was supported by the German Research Foundation DFG (COVMAP—RO 2497/12-2) within Priority Research Program 1894 *Volunteered Geographic Information: Interpretation, Visualization, and Social Computing*; the Federal Ministry of Education and Research (BMBF), Germany, under the project LeibnizKILabor (grant no. 01DD20003); the Center for Digital Innovation (ZDIN); and the German Research Foundation (DFG) under Germany's Excellence Strategy within the Cluster of Excellence PhoenixD (EXC 2122).

### **References**

Barz B, Denzler J (2020) Deep learning on small datasets without pre-training using cosine loss. In: IEEE Winter Conference on Applications of Computer Vision (WACV), pp 1360–1369 Biau G, Scornet E, Welbl J (2019) Neural Random Forests. Sankhya A 81:347–386


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Part II Geovisualization and User Interactions Related to VGI**

VGI has been employed to support a variety of applications that benefit from the georeferencing of data, from navigation to exploratory data analysis. VGI data is large in quantity, dynamic, and spread across many different types of modalities, such as text, images, or positional information. To make full use of this data and to make it understandable for non-experts, the effective use of visualizations is a primary tool. For expert users, interactive visualizations enable varied user interactions to explore and interpret large, dynamic, and heterogeneous datasets and enable extraction of knowledge and patterns. For non-expert users, geovisualizations can help to communicate relevant patterns and properties like uncertainty of locations.

Part II of the book explores geovisualizations and user interactions supporting the analysis and presentation of VGI data. When designing these visualizations and user interactions, we need to consider the specific properties of VGI data, the knowledge and abilities of different target users, as well as the technical viability of solutions. It is crucial to find the right trade-off between information density and complexity, on the one hand, and interpretability or ease of use, on the other hand. Algorithmic advances for the optimization of visualizations can support the process of revealing relevant information in visualizations; however, we should be aware to keep faithful representations of data. User interactions need to be efficient and powerful for analysts while not overwhelming possible users. The chapters in this part tackle these challenges in the space of VGI.

We start by discussing strategies for visually analyzing dynamic social messages and news articles containing geo-referenced information based on visual analytics. We then discuss the use of interactive visual reporting solutions for leveraging natural language and visualizations for geo-related data. Following this, we examine the effects of landmark position and design in VGI-based maps on visual attention and cognitive processing. Related to this, we discuss landmark uncertainty in VGIbased maps and its effects on orientation and navigation performance. Finally, we present a study to analyze the user behavior while solving interpretation tasks with VGI data and use the findings to improve task-oriented visual interpretation of VGI point data.

We hope that this research strengthens the extraction of usable knowledge from VGI data for both experts and non-experts, integrating different data types such as text, and guiding further research into visual saliency for maps and task-specific representations of VGI data.

# **Chapter 6 Toward Visually Analyzing Dynamic Social Messages and News Articles Containing Geo-Referenced Information**

### **Johannes Knittel, Franziska Huth, Steffen Koch, and Thomas Ertl**

**Abstract** The number of social media posts and news articles that are being published every day is high. This makes them an attractive source of humangenerated information for different domain experts such as journalists and business analysts but also emergency responders, particularly if posts contain references to geolocations. Visual analytics approaches can help to gain insights into such datasets and inform decision-makers. However, the high volume and the veracity of the data, as well as the velocity in the case of streaming data, pose challenges when supporting explorative analysis with interactive visualization. Based on four exemplary approaches, we outline recently proposed strategies to tackle these challenges. We describe how geo-aware filtering and anomaly detection methods can help to inform stakeholders based on geolocated tweets. We show that data-aware tag maps can provide analysts with an overview-first, details-on-demand visual summary of large amounts of text content over time. With space-filling curves, we can visualize the temporal evolution of geolocations in a two-dimensional plot without relying on animations that would impede comparative analyses. Additionally, we discuss the use of an efficient dynamic clustering algorithm for enabling large-scale visual analyses of streaming posts.

**Keywords** Social media · Documents · Streaming data · Visualization · Visual analytics

### **6.1 Introduction**

Unstructured or semi-structured data such as news articles and social media posts contain a significant amount of human knowledge. Analyzing such vast amounts of data enables several stakeholders from different domains to obtain insights and

University of Stuttgart, Stuttgart, Germany

J. Knittel (-) · F. Huth · S. Koch · T. Ertl

e-mail: johannes.knittel@vis.uni-stuttgart.de; franziska.huth@vis.uni-stuttgart.de; steffen.koch@vis.uni-stuttgart.de; thomas.ertl@vis.uni-stuttgart.de

D. Burghardt et al. (eds.), *Volunteered Geographic Information*, https://doi.org/10.1007/978-3-031-35374-1\_6

inform their decision-making, for instance, business traders that need up-to-date information about new developments and first responders that benefit from timely witness reports on social media. In some cases, we can leverage the structured metadata associated with some of these documents such as the geolocation of posts or the timestamp of news articles, but it is generally challenging to gain insights into the actual content comprising text data.

The field of visual analytics (Thomas and Cook 2005; Keim et al. 2010) particularly aims at solving such complex problems of analyzing and exploring large amounts of data with often open and ill-defined goals. It combines the domain knowledge and intelligence of human experts with interactive visualizations that are sourced from advanced automated data analytics and machine learning models. If we want to harvest news reports and social media posts for timely insights using visual analytics, we need efficient algorithms to deal with the volume of the data, we need to extract information from unstructured data such as text, we need to integrate and combine this information with additional metadata such as geolocations into interactive visualizations, and we need to develop adaptive visualizations and streaming-aware algorithms that can deal with dynamic data sources. This chapter outlines recently proposed visual analytics approaches to tackle the said challenges.

### **6.2 Analyzing the Temporal Evolution of Text Data with PyramidTags**

Making sense of large document corpora is a challenging endeavor, since it is inherently difficult for machines to grasp the meaning of natural language. Visual analytics approaches that combine methods of automatic information retrieval and data analytics with interactive visualizations help to tackle such challenges by incorporating human expertise and human interpretability. However, it remains challenging to provide a comprehensive overview of large amounts of text data such as news articles or social media posts due to the unstructured nature of the data, the variety of how people express similar things, and the inherent ambiguity of natural languages.

PyramidTags (Knittel et al. 2021c) proposes a novel tag layout for exploring large document collections such as tweets that aims at providing analysts with an overview of the content at hand and the temporal evolution of its themes without introducing hard clusters or topics. The approach utilizes an optimization process to place extracted relevant keywords and keyphrases from articles or posts onto a two-dimensional plot such that related tags ideally appear close to each other, while it is also possible to infer in which date ranges tags mostly appear in the dataset based on their position on the map (Fig. 6.1).

**Fig. 6.1** PyramidTags aims at providing an interactive overview of large document collections. This is an example of the main visualization based on 70,000 news articles published in late January 2020

### *6.2.1 Processing and Objectives*

PyramidTags first extracts the top *k* relevant keywords and keyphrases from the document collection using ELSKE (Knittel et al. 2021b), a fast keyphrase extraction library specialized in summarizing text collections. These tags serve as a summary of the content, and the way how they are placed on the two-dimensional tag map should support analysts making sense of the data with a date-aware, context-aware, and word-order-aware layout. In a second step, we process the dataset again to infer which tags are related based on how often and how close they appear in the same paragraph, whether there seems to be a preferred reading order of tag pairs (e.g., *John | Doe* vs. *Doe | John*), as well as in which date ranges tags and tag pairs mostly appear.

The resulting data structure informs the subsequent layouting process, which optimizes an objective function using particle swarm optimization (Kennedy and Eberhart 1995). Minimizing this function therefore corresponds to finding a balanced trade-off between different objectives such as that (1) related tags are placed nearby, (2) the position of the tag conveys the associated date range, (3) the preferred reading order is preserved for important pairs (if applicable), and (4) tags should not overlap.

### *6.2.2 Triangular Layout*

One of the defining aspects of PyramidTags is its triangular layout, which aims to convey the temporal evolution of the extracted tags. Each tag is associated with a specific date range in which it mainly appears in the data (we may have several tags with the same text in case they appear in distinct clusters of date ranges). The vertical position on the map corresponds to the duration of this range, and the horizontal position to the mid-point of the said date range. At the bottom of the visualization, we place a timeline that depicts the entire date range of the dataset. For instance, if a word mainly appears on a specific date in the data, it is placed at the bottom of the map, right above the corresponding date in the timeline. On the other hand, if words or phrases appear in most of the articles, they are placed at the center-top of the visualization. Analysts should be able to infer this date range by spanning a right triangle from the tag to the timeline at the bottom. With this layout, we can visualize associated time spans of data points without relying on animations, which helps analysts to hypothesize about relevant events since tags that are mentioned during similar date ranges are also placed in the same neighborhood.

For instance, in Fig. 6.1, the tags *diamond princess* and *cruise ship* appear at the top of the map, indicating that the discussion about the Covid-19 cases on the said cruise ship was in the news during most of the depicted date range, whereas *storm dennis* and *flooding* are placed at the bottom-left around the first day of the 2-week date range.

### *6.2.3 Interactions and Document Retrieval*

If users hover over certain tags, a lightly colored trapezoid visualizes the associated date range of the respective tag. While the optimization process tries to place related tags nearby, due to the inherent information loss of projecting data to twodimensional spaces, not all tags that are close to each other are necessarily related, and there might be tags placed further away that are nevertheless related. The system therefore shades all other tags on the map depending on how related they are to the currently hovered tag (the more opaque, the less related). Users can also select one or several keywords by clicking on them. For instance, Fig. 6.2 shows an example in which the analyst has selected three tags (A). PyramidTags will then list the most related documents that contain the chosen selection of words or phrases, ranking the results based on the number of occurrences and the relative position of keywords to each other in the document (B). Users may also retrieve individual documents (C).

**Fig. 6.2** Once one or several tags are selected (A), the PyramidTags system shades terms based on how related they are to the selected keywords. For each tag, a colored trapezoid visualizes the associated date range in which the respective term mostly appears in. The system also provides a list of related documents (B) from which individual documents can be retrieved (C)

### **6.3 Leveraging Geodata to Scale the Visual Analysis of Posts**

When dealing with large amounts of streaming text data, we can leverage geographical references to scale the analysis (e.g., the current position of the person that has just posted the respective tweet). Such geo-tags not only provide additional context to the textual content; we can also use them to cluster items and their content in a geospatial way, providing several important benefits. We can visualize content on top of a geographical map to help analysts focusing on specific regions of instance, drastically reducing the actual amount of data analysts have to cope with. Grouping content by geographic region may also help with providing thematic aggregations, as people in the same region within a certain time span may also have a higher chance of posting content with similar topics (e.g., football match in a city). Another advantage is that we can compare metadata and extracted aggregated information of documents with a spatial-aware baseline (e.g., typical occurrences of tags within a region). This also helps to develop anomaly detection algorithms, for instance, to notify first responders in a very timely manner about evolving situations.

ScatterBlogs (Thom et al. 2012, 2015; Bosch et al. 2013) is a visual analytics approach that proposed to leverage the geographical annotations of tweets in these ways to scale the visual analysis of streaming posts. Case studies with domain experts from crisis management groups and critical infrastructure companies underlined the need for such systems so that analysts can obtain important additional information in real time regarding critical situations, despite the apparent learning curve and the need for specialized human labor to monitor this channel (Thom et al. 2015). However, they also showed that the velocity of newly published posts, even regarding specific events, and the dynamically evolving nature of the content itself (e.g., novel hashtags) still pose significant challenges for analysts and the development of such interactive monitoring systems (Fathi et al. 2020).

### *6.3.1 Geospatial Clustering of Terms*

One of the core ideas in ScatterBlogs is to continuously extract terms that appear unusually frequently in certain geographic regions and visualize them on top of a map such that analysts can get an overview of interesting developments that take place at specific locations. In the beginning, every term (except for stop words) defines a cluster comprising all geo-tagged posts containing the respective term; new received posts are added to these clusters based on the terms they contain. Once the distortion of any such cluster is too high (i.e., the geographic positions of related messages are too widely scattered), we split the cluster using the *k*-means algorithm (with *k* = 2). The system visualizes such dynamic *term clusters* by placing the respective tags (or representative dots) on a map based on the average geographic position of corresponding posts. The decision which terms to display in which size also incorporates how anomalous the usage of this term is. This score depends on the number of unique users and the geographic density of the corresponding posts, that is, the importance is high if many different users post messages at a specific location. Figure 6.3 depicts the main user interface of the system, visualizing anomalous terms in green on top of a map.

### *6.3.2 Keyword Lens and Topic Modeling*

ScatterBlogs provides additional views to support explorative tasks. Analysts can move a lens across the map, which will highlight the most important keywords of posts that were sent in the corresponding geographic regions under the lens. When selecting specific term clusters, a histogram depicts the number of posts over time. Text-based and date/time-based filters can be applied to select a subset of tweets. For such a selection of posts, the system can provide a thematic overview of the content using LDA topic modeling (Blei et al. 2003).

### *6.3.3 Interactive Classifier*

In addition to keyword-based filtering, ScatterBlogs also offers means to train and apply SVM-based classifiers interactively, which can be mapped to a color and icon

**Fig. 6.3** Main interface of the ScatterBlogs system (Thom 2015) for visually analyzing geolocated tweets in real time or retrospectively. Terms which exhibit anomalous usage in specific geographic regions are placed on top of a dynamic map visualization to guide analysts to potentially developing situations (B). Analysts can select a subset of tweets (D) based on date ranges (A, C, E), which will also trigger the extraction of topics using LDA (G). The right side allows defining and combines classifiers interactively for filtering messages (F, H)

to support the visual indication of classified posts. An initial training set can be labeled greedily based on keyword searches and geographic filters. The system then provides visual feedback of the classifier in its current state (e.g., visualization of affected posts on the map) so that analysts can refine them iteratively. It is also possible to combine several such trained classifiers with a visual graph structure (right side of Fig. 6.3) that helps to define Boolean chains.

### **6.4 Space-Filling Curves for Visualizing the Spatiotemporal Evolution of Data**

In addition to the publishing date, a subset of posts and articles also contain a geographic reference (e.g., location of the tweeter). As outlined in the previous section, such geo-references enable analysts to filter data based on relevant regions, but evolving geographic patterns and anomalies can also hint at interesting developments and inform the decision-making process. The ScatterBlogs system focuses on the real-time analysis of streaming posts, and thus, the temporal aspect is mostly implicitly encoded by the dynamic nature of the visualizations. However, for certain analytical tasks, it might be important to analyze larger time ranges in retrospect. The first section introduced PyramidTags, which applies the triangular layout to visually encode date ranges without animations for exploring vast amounts of social media posts and news articles. However, it is challenging to visualize the temporal evolution of *geolocated* data without animations, since the two main dimensions are typically already reserved to visually encode the geographic location on a map. Animations, though, need to capture the attention of the analyst over a longer time span and impede comparative analyses. Franke et al. (2021) proposed the use of space-filling curves for visually encoding spatial data into just one dimension so that we can depict the evolution alongside the y-axis.

### *6.4.1 Neighborhood-Preserving 1D Projections*

The main idea of the approach is to project geographic positions into onedimensional positions using space-filling curves, such as Hilbert and Morton, while still preserving local neighborhoods to a certain degree. We can then plot a representative scalar value of geo-referenced data points across time in a twodimensional plot so that analysts can better assess the temporal evolution of geographic neighborhoods, as well as spot spreading patterns, geographic hotspots, similar patterns across different regions, and trendsetters. In a preprocessing step, the system clusters the data points hierarchically based on their spatial position (if the spatial hierarchy is not already given). This clustering allows us to aggregate larger datasets at different levels of spatial granularity and enables analysts to focus their analysis on specific geographic regions, which also aids in the interpretation of the resulting geo-projections.

### *6.4.2 Main Interface*

For each aggregation level in the spatial hierarchy, the timeline view (Fig. 6.4 underneath the map) visualizes the temporal evolution of each *entity* (e.g., aggregated cluster or single data point) alongside the y-axis. The bars correspond to geographic entities in the clustering and are ordered based on their calculated position in the respective space-filling curve. Analysts can select a specific entity to focus on (highlighted by a red border), which will trigger the system to re-order close entities in the detail view at the bottom. The map at the top serves as an aggregated overview of the different geospatial entities based on a specific point in time that users can specify with the slider at the top-right of the interface.

The system supports several methods for computing space-filling curves (topleft panel). Upon hovering over a specific method, the respective curve is plotted on top of the map, and the differences in the ordering of the elements to the currently selected curve are visualized. Several computed metrics help analysts better assess the quality of the projections.

**Fig. 6.4** Space-filling curves help to visualize the temporal evolution of geospatial patterns in local neighborhoods (Franke et al. 2021). The map at the top shows an aggregated view at a specific time, whereas the timeline view underneath depicts the value of each geographic entity in the respective hierarchy layer across time. Analysts can select one of the bars, representing entities in the hierarchical clustering, which will highlight close entities in dark orange

### **6.5 Clustering Posts Dynamically to Analyze Posts in Real Time**

Leveraging geo-annotations helps to scale the visual analysis of streaming data and enable geo-specific baselines as well as anomaly detection methods. However, while people still post textual geo-references (e.g., names of cities), the percentage of geolocated social media posts has steadily decreased in recent years. Thus, we need different strategies to enable the real-time analysis of social media posts, and we need to facilitate more context-rich analyses of the actual textual content.

To achieve this, Knittel et al. (2022) have proposed a visual analytics system that employs an efficient dynamic clustering algorithm, providing analysts a continuous overview of what people talk about on Twitter. A dynamic visualization of frequently used phrases and a stream of representative posts help analysts to monitor topics they are interested in, and they can also dive deeper into such topics while increasing the resolution of the analysis. Figure 6.5 depicts the main interface of the approach.

**Fig. 6.5** The approach of Knittel et al. (2022) enables large-scale real-time analyses of streaming posts using an efficient dynamic clustering algorithm. The interface provides a thematic overview of what people are currently posting on Twitter (left side). Each row corresponds to one such topic, conveying its size with the length of the bar and the number of posts over time in a small line chart. Once analysts have selected one or several such topics, frequent phrases are continuously extracted and visualized on the right side at the top. In addition to a manageable stream of representative posts underneath, they help to provide a more context-rich visual summary of the streamed content. It is also possible to create filters based on such a topic selection so that the clustering processes and visualizations operate on this filtered stream to enable more fine-grained analyses

### *6.5.1 Dynamic Clustering*

The system stores each new post in a sliding window of configurable size (e.g., the last 20 minutes) and computes corresponding bag-of-words vector representations (Salton and Buckley 1988). For the dynamic clustering, the approach adapts the *k*means clustering algorithm (Lloyd 1982) based on a more efficient implementation for sparse vectors (Knittel et al. 2021a). The idea is to regularly cluster the documents in the sliding window with different cluster sizes while using the centroids from the previous clustering run (if available) to obtain more coherent clusters between runs. The Davies-Bouldin Index (DBI) (Davies and Bouldin 1979) determines which of the different clusterings is deemed best. The algorithm then tries to map the final cluster centroids to the ones from the previous run to determine which clusters are actually new, are deprecated, or have just been updated. The system runs two independent clustering processes with different levels of granularity. The more coarse-grained clustering provides a topical overview of the tweets in the window; the more fine-grained clustering facilitates a stream of representative posts.

### *6.5.2 Topical Overview*

The left side of the main user interface (Fig. 6.5) visualizes the topics as computed by the coarse-grained dynamic clustering process. Each row represents one topic, conveying its size with the length of the bar and the number of posts over time in a small line chart, as well as providing a short topic summary with the most important words that define that cluster. Once the clustering has been updated with new posts, the visualization changes dynamically to represent the new state, but in a staggered way so that the mental map is preserved. Each row gets updated step by step (e.g., the *lebron* topic in Fig. 6.5), visualizing the number of new posts in dark green, the number of removed posts in red, and the number of posts that were moved to a different topic in magenta (while also depicting this flow with curves to the left of the bars). New terms are highlighted in dark green. The speed of this dynamic rollout is adjustable.

### *6.5.3 Frequent Phrases and Stream of Representative Posts*

Users can select one or several of the main topics on the left, which will update the right side of the interface with more detailed views for summarizing the content of these topics. The system continuously extracts unusually frequent words and phrases in the selection with ELSKE (Knittel et al. 2021b) and lists them on the right side, highlighting new entries in dark green. Below each such phrase, a small barcode-like visualization allows analysts to infer which of these text parts co-occur by mentally overlapping their respective barcodes. The system can also emphasize this overlap in orange if analysts select several such phrases.

Below this view, a stream of posts appears that resembles a typical feed users would also see on Twitter. One of the challenges the case study of Fathi et al. (2020) with emergency managers identified is the dynamic nature of quickly evolving situations, which can easily lead to situations in which the sheer number of published posts related to a specific event or topic can overwhelm analysts. Hence, the idea of the feed below the frequent phrases is that the appearing posts should cover the thematic variety of the selected topics while keeping the number of new posts in a given time span low, ensuring that analysts can focus on a digestible set of posts.

The system selects for each cluster in the fine-grained dynamic clustering process one post that should represent this fine-grained theme by calculating distances between the vector representations with the respective cluster centroid. Due to their similarity with the respective centroids, these *representative posts* ideally cover a large proportion of tweets in the same cluster, but they should also provide a diverse summary as they were sourced from different clusters in the fine-grained clustering process. If such a representative post has not been added yet, it will be queued up, and a small badge in light blue appears that notifies the user about new posts. For each such post, a small bar chart depicts the number of similar posts, which can also be retrieved if analysts click on the bar.

### *6.5.4 Diving into Topics*

While the frequent phrases and representative posts already provide more contextrich summaries of the selected topics, the general thematic variety on social media platforms is typically high, so it is challenging to group all posts into just 10 to 20 overview clusters. To alleviate this, the proposed system allows users to gradually dive into topics. After selecting one or several such coarse topics on the left side, analysts can click on the fork button at the top of the interface. This will define a new filter layer, in which only posts that fit the selection will be processed and visualized (it is also possible to create keyword-based filters). As a result, the topical overview on the left and the fine-grained clustering process now only operate on this filtered stream of data, increasing the resolution of the topics and aggregations and, thus, increasing the resolution and specificity of the analysis.

### **6.6 Conclusion**

Geo-referenced social media posts and news articles are a rich source for harvesting information and knowledge, but the unstructured nature of the main content and the volume, veracity, and velocity of the data pose significant challenges for developing such visual analysis systems. This chapter outlined several recent approaches for tackling these challenges. The ScatterBlogs system scales the visual analysis of streaming geo-referenced posts by continuously extracting terms that exhibit spatiotemporal anomalies. Our proposed dynamic clustering algorithm enables the continuous monitoring of posts even if they are not geo-tagged. PyramidTags is a novel tag map layout for exploring large time-stamped text collections. We further outlined how we can utilize one-dimensional projection methods to visualize georeferenced time series data such that we can still observe important spatial trends and patterns.

There are several benefits if we can incorporate geographic locations into our analysis, since this allows us to better detect interesting events and helps to filter the content that has to be processed. However, due to the decrease of geolocated documents, we need to develop new strategies and techniques for leveraging geographic references in social media posts and news articles.

**Acknowledgments** This research was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)—VA4VGI, 314647693.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 7 Visually Reporting Geographic Data Insights as Integrated Visual and Textual Representations**

**Fabian Beck and Shahid Latif** 

**Abstract** Geographic information volunteered by the public is usually also of public interest. However, just publishing the data is not enough to make the data accessible and usable for the public. The raw data might need to be abstracted and interpreted, as well as visually presented to be understandable to non-experts. To address this, we propose interactive visual reporting solutions that leverage natural language and visualizations for geo-related data. We present these reports as interactive documents, but also in other media such as virtual reality environments. First, we have studied the interplay of textual and visual content in such reports. To ease the creation of content, we have developed solutions for authoring interactive documents with a close linking of textual contents and visually presented data. Moreover, we propose automatic report generation approaches that specifically support the exploration of the geo-related data starting from an explanatory summary.

**Keywords** Geographic visualization · Data-driven storytelling · Interactive documents · Authoring interfaces · Text generation

### **7.1 Introduction**

Data-driven storytelling embeds data into a narration and usually combines a textual representation with visualizations (Segel and Heer 2010). While magazine-style stories might be most common, there exist other presentation genres like animations and slide shows, comics, or annotated charts (Segel and Heer 2010). The power of data-driven storytelling lies in making data accessible to a wider audience, by guiding through the analysis insights while inviting users to engage through simple interactions. Textual and visual data descriptions complement each other and, together, form an integrated representation that is both easy to follow and rich

University of Bamberg, Bamberg, Germany e-mail: fabian.beck@uni-bamberg.de

© The Author(s) 2024 D. Burghardt et al. (eds.), *Volunteered Geographic Information*, https://doi.org/10.1007/978-3-031-35374-1\_7

F. Beck (-) · S. Latif

**Fig. 7.1** Conceptual diagram of the applied analysis cycle; steps **I**–**II** in gray mark previous research and results; steps **III**–**IV** as black arrows represent the focus of our research, which is intended to facilitate a joint dialog between stakeholders (**V**)

of information. For instance, a story on flight data might link a globe or map that shows flight trajectories with explanations on busy routes and airports but also could provide insights into specific examples like the longest flights.

As an expressive and interactive medium, data-driven stories fit volunteered geographic information. For data that is volunteered by the public, it is a natural choice to make that data also available and understandable for a broad audience. Figure 7.1 illustrates this concept as *closing a circle*. Public groups or individuals volunteer geographic information that is then made available on open data platforms (**I**). While experts have already analyzed such data in various application scenarios (**II**), we contribute methods for bringing the derived insights back to the general public and decision-makers. From the open data itself (**III.a**) and insights of the experts (**III.b**), we support authoring and automatically generating reports that can be understood by this broad group of users (**IV**). Computer support and automation are necessary to efficiently create summaries of varying data and to allow personalized reporting. With these reporting solutions, we intend to facilitate non-experts users and foster a dialog between the public, decision-makers, and experts (**V**). This cycle aligns with endeavors of others to make volunteered geographic data directly usable to the public, for instance, within project *IDEAL-VGI* (Chap. 2).

The challenges of this research are in the identification and selection of relevant insights, as well as in their reporting as integrated visual and textual representations. The produced reports should furthermore invite the readers and users to explore the data. We have approached these challenges by first studying the interplay of text and visualization in existing examples of geographic data-driven stories (Sect. 7.2). Then, we investigated solutions that help author reports with close linking between the two representations (Sect. 7.3). Finally, designing automatically generated reports allowed us to provide, aside to a data-driven story, solutions with certain support for exploration (Sect. 7.4). While geodata plays a role in all presented research, we also consider the visualization of additional, non-geographic data.

This chapter describes the results of the project *vgiReports* and summarizes as well as connects an excerpt of project-related publications (Latif et al. 2021a,b, 2022a,b) and a preliminary work (Latif and Beck 2019a). Two of these works (Latif et al. 2021a,b) report results from collaborations with other projects of the priority program.

### **7.2 The Interplay of Text and Visualization**

The way textual and visual descriptions are combined is crucial in data-driven stories. If integrated well, it could avoid a *split attention effect* between the two media (Ayres and Sweller 2005) and might even help identify misaligning information (Zheng and Ma 2022). Interactive linking of text and visualization can increase user engagement (Zhi et al. 2019) and guide user attention while supporting specifically less experienced users in correctly mapping the text and data (Barral et al. 2021).

Journalistic outlets provide many high-quality, manually crafted examples of data-driven stories. For instance, *The New York Times* has published in 2021 more than a hundred carefully designed visual stories and interactive graphics (The New York Times 2021). As some of such stories cover geographic aspects, we can leverage them to study how geographic data is successfully reported to a wide audience. Previous research has already studied in existing stories structure and sequence (Hullman et al. 2013), patterns of visual narrative flow (McKenna et al. 2017), and narrative order in time-oriented stories (Lan et al. 2021). Text in such stories can have different roles, ranging from introductory texts to detailed annotations of the visualization (Segel and Heer 2010). We have focused on a finegrained analysis of such categories and the explicit and implicit interplay of text and visualization in stories, with a certain focus on geographic aspects. In a first study, we analyzed 22 full stories from a variety of news media (Latif et al. 2021b). A second study looked at a set of 110 paragraph-chart pairs stemming from 77 articles of different news media (Latif et al. 2022b). Using a qualitative methodology in both studies, we investigated the text on sentence and word level and classified the cases into different categories. Specifically, we have addressed the following research questions.

*What Are the Reported Analysis Insights, and How Is the Related Data Visually Communicated?* **(Latif et al. 2021b)** We observed two categories of textual narrative: data-driven text and contextual embedding text. The former directly relates to the data and describes analysis insights. These insights link to the analysis tasks, namely, *identify*, *summarize*, and *compare*. In stories with geographic focus, location and time are generally key concepts. In particular, locations associated with extreme values as well as those depicting very dissimilar behavior (outliers) are identified and explained in the narrative. Likewise, clusters of locations are discussed together to highlight their similarities. Other reported insights include geographic and temporal variations of variable values across a geographic region or time span. The narrative either uses measures of central tendency like mean, median, or mode to summarize them or describes these variations in plain words. Lastly, the narrative compares locations or other data items using part-to-whole contrasts, correlation, or statistical ranking. Apart from these insights, as contextual embedding, a comparable proportion of the narrative blends in the background of the story and data, necessary domain knowledge, and quotes from external sources or people to make these stories self-sufficient units of information. Moreover, authors of the stories interpret analysis insights, relate to other data and information sources, and attach judgment. It is important to note that the data-driven text and the contextual embedding text are often intermingled and cannot be always unambiguously separated. As the textual narrative explains the analysis insights, visualizations act as a complement to show the relevant data. Visualizations can serve a specific purpose, for instance, to provide an overview of the data, to support comparisons, or to highlight details. The use of simple visualizations like maps, visually enriched tables, bar charts, and line plots is more common compared to slightly more advanced ones like distribution plots and scatter plots.

*How Do Textual Narration and Visualization Interplay?* **(Latif et al. 2021b)** We discovered different kinds of linking strategies to convert visualizations and an associated textual narrative into a single engaging story. First, the visualizations are almost always placed close to the text that describes them. Likewise, the sequence of visualizations in a story is important: The overview visualizations often appear first and are followed by detailed visualizations. Second, to strengthen the linking further, textual elements like captions, annotations, or tooltips are employed inside or next to visualizations. These textual elements often explain the key insight of a visual or help users in better interpreting the visualization. We observed that the use of descriptive annotations even enabled authors to include comparatively complex and non-standard visualizations into their stories. Third, visualizations as a whole or parts of them are explicitly referenced from the textual narrative. Authors sometimes also use the same colors in text and visualization to show connections between the two media.

*What Implicit References Exist Between Text and Visualization, and How Do They Relate to the Data?* **(Latif et al. 2022b)** Implicit references can be defined as connections between a textual narrative and a visualization if both refer to the same data items. For instance, the mentions of countries aside showing a world map make country names implicit references. However, such connections are not limited to just single entities or values but also include group references (referring to many data points, e.g., *EU*) and interval references (referring to numerical ranges). Furthermore, individual references can be grouped together to form higher-order references. We found that these implicit references can correspond to analysis tasks such as *identification*, *summarization*, and *comparison*. Almost half of the implicit references directly matched a chart feature (e.g., axis label, legend, annotation, caption). However, the other half contained linguistic variations (e.g., inferences, synonyms, abbreviations, stems, or lemmas) or numerical variations (e.g., rounded off numbers, approximations, computed measures) and are harder to map to the visualized data.

### **7.3 Authoring Interactive Reports**

Creating a data-driven story requires effort: Aside from writing the text, data needs to be analyzed and visually presented. Web technologies provide a good basis for making the content available. However, whereas many content management systems allow placing textual and visual content side by side, they do not support integrating both representations closer and, through this, creating interactive documents. Filling this gap, various authoring tools and supporting approaches have already been suggested for data-driven storytelling (Tong et al. 2018, Section 3). For instance, Chen et al. (2020) developed a framework to synthesize stories from insights identified using a visual analytics systems. It allows an author to arrange insights in different simplified visualizations, annotating and connecting them to tell a story.

Whereas most of these approaches support efficiently generating stories of different kinds, they do not directly address creating explicit and interactive links between the text and visualization. In contrast, *VizFlow* (Sultanum et al. 2021) focuses on links between text and visualization for authoring; while their links are limited to manually created links to image-based features of the visualization, they investigate in more detail how to leverage such links for document layout. *Ellipsis*  (Satyanarayan and Heer 2014) allows authoring staged slide show stories with annotations that can be bound to data values and adapt with it. *Elastic Documents*  (Badam et al. 2019) does not allow creating links directly but extracts text and tables that relate and connects them using new visualizations. Related are, furthermore, general approaches for annotating charts by, among other visual marks, textual content (Ren et al. 2017).

Focusing on an easy and efficient creation of valuable links between the text and visualizations, we have developed *Kori* (Latif et al. 2022b). The system, as demonstrated in Fig. 7.2, supports both manual creation of links and automatic suggestions for links. The computed links are based on processing the text and consider the hierarchical structure of references discussed in Sect. 7.2. While an author is composing an interactive story, the system offers unobtrusive suggestions, which can then be inspected and accepted or discarded. For the reader, the links finally act as interactive references and, when triggered, take the users' attention to the respective portion of the visualization. Not only do they reduce a split attention effect but could be starting points from where to explore the data further. For the

**Fig. 7.2** The user interface of *Kori*. It consists of a chart gallery (1) and an editing interface (2). It supports manual creation of links through simple interactions (3). Users can choose highlighting options and their properties (4)

manual creation and adaption of links, the system offers an interface that requires only a few interactions to define the references. There are two modes of manual construction: First, the authors can directly select visual marks in the chart using a direct manipulation mode (e.g., rectangular brush selection). Alternatively, authors can apply a series of filters to select visualized data points. Through these means, authors can effortlessly create references and focus on composing their story.

In a study, we asked 11 participants having diverse background and experience to create references to link text and visualizations in three examples. In the first and second task, they reproduced given links of various kinds with the tool. The third task was more open-ended; the participants were only given a set of visualizations and had to also textually summarize some findings as a short story while linking the text to the visualizations. The results indicated that participants did not have difficulties using the interface and were able to construct meaningful references in all three tasks. Among 64 automatic suggestions that occurred in the sessions, 48 were correct and 16 incorrect. Participants also used the manual construction mode and rated it comparable to the automatic suggestion feature, both with a median of 4 on a scale from 1 (worst) to 5 (best). The feedback for the automatic suggestions was mostly positive and confirmed that many recommendations were helpful and did not disturb the users' workflow. "Smarter" reference detection methods, however, could still improve the experience.

### **7.4 Explorative Reporting**

Data-driven stories and visual reports of data might be presented as interactive documents but often remain rather static. Users can interactively navigate through the story and retrieve some details on demand, but the documents mostly lack support for starting an explorative analysis that goes beyond the original story. Moreover, stories do not adapt to personal interests or current data. The explanation for these restrictions is simple: the stories are manually written as static texts. However, if partly automizing the generation of the natural-language content, we could provide extended options for explorative data analysis and personalization.

Various techniques exist for natural-language generation (Gatt and Krahmer 2018). Whereas the use of advanced generation methods for automatic reporting of data is not yet common in journalistic and industrial practices, some research prototypes have already investigated its potential for more adaptive reporting. For instance, such generation techniques have been used to provide guidance in the data exploration process by reporting automatically derived data facts (Srinivasan et al. 2019). But also whole stories can be generated. Unlike approaches that rather target at fully automatic generation (Shi et al. 2021), we are interested in still humanauthored reports, however, which can adapt to different data automatically. Used in interactive documents, the generated text blends with visualizations in a datadriven story. In earlier works, we have explored such representations, for instance, to generate profiles of scientific authors (Latif and Beck 2019b) or, in software engineering use cases, to summarize program executions (Beck et al. 2017) and code quality (Mumtaz et al. 2019). More and more, the generated reports allow greater flexibility regarding the interactive exploration of the data, which complements explanatory texts and guided data analysis. The general idea can be described as *exploranation*, mixing *exploration* with *explanation* (Ynnerman et al. 2018).

Now, we have investigated such approaches to apply them to geographic data in the context of different media and usage modalities. These cover novel aspects such as comparative descriptions of selected entities, novel forms of presentation such as adaptive audio guides, and novel blends of interaction forms and presentations such as chatbots. This set of diverse examples shows early prototypes that demonstrate promising directions of visual reporting; we have not evaluated them yet in detail or connected them into a more comprehensive framework.

### *7.4.1 Maps with Data-Driven Explanations*

Maps that show statistical information are widely used in data-driven storytelling. *Choropleth maps* visualize variation in one variable for a set of regions (e.g., countries). However, oftentimes, it is desirable to describe the relationship between two variables, which would require simultaneous visualization of two values per region. For instance, *per capita spending on education* could be compared to *per*

**Fig. 7.3** A report describing casualties due to storms in the US as a bivariate map and textual summary (top). Details on a selected region and a comparison of two selected regions are available on demand (cutouts at the bottom)

*capita spending on defense* to understand different geopolitical roles of countries. An established way of visualizing such bivariate data is to employ graduated symbols overlaid on a choropleth map (Elmer 2012). However, by construction, these bivariate map visualizations are more complex to interpret, and it gets harder to spot visual patterns. Additional textual explanations might counterbalance and could hint at interaction effects of the variables that would go unnoticed otherwise. Encoding more variables per region in a more complex visual glyph is doable but would render the communication to a wider audience even more challenging.

To visually and textually report at least bivariate geographic data in a more accessible way, we developed *Interactive Map Reports* (Latif and Beck 2019a). It employs well-established statistical methods to detect notable relationships, geographic patterns, and outliers in given bivariate data. These insights are automatically transformed into a natural-language narrative that is, then, presented alongside a bivariate map visualization as shown in Fig. 7.3. The given example relates, for states of the USA, the *number of fatalities caused by storms* to the *number* *of storms* to reflect if the quantity of storms is directly related to the death toll. The textual narrative serves as a guide and explains findings. Small graphics in the text help establish linking between the two representations. Users can explore the map visualization as they read through the narrative by activating interactive links (printed in boldface). Likewise, while exploring the map, users can either get additional details on a selected geographic region or a comparative text for two selected regions.

The system is capable of generating interactive reports for different bivariate geographic datasets. Through a small set of parameters that the user provides about variables, geographic region and granularity, and general terminology, it adapts the generated narrative and visualization.

### *7.4.2 Interactive Audio Guides in Virtual Reality*

Virtual reality is emerging to be an engaging medium for interactive data visualization, and it has just been started to be explored for data-driven storytelling (Isenberg et al. 2018). The idea of *exploranation* is also applicable to virtual reality visualizations. However, as used previously in documents, longer textual narrative will not be suitable as reading would counteract immersion. As an alternative, audio can be used in virtual reality applications such as games, movies, or virtual museums. Prerecorded audio narrative might be played at various stages of the story (e.g., in a game) or activated by a user interaction (e.g., in a virtual museum). The prerecording aspect limits the flexibility, and such approaches cannot adapt to changes in the data as a result of interactions.

To support *exploranation*, our approach *Talking Realities* (Latif et al. 2022a) combines a data-driven audio narrative with an immersive virtual-reality visualization. The audio narrative is based on automatic identification of interesting analysis insights. Using speech synthesis services, it is rendered on the fly from generated text and, therefore, adapts to data selections and user interactions. To provide a smooth exploration experience, the narrative should be synchronized with visual animations. To cater to the needs of a larger target user group, *Talking Realities*  advocates three modes with varying levels of guidance. On the one hand, fully *guided tours* walk users through a pre-defined sequence of findings with the least freedom to explore. *Free exploration*, on the other hand, lets users investigate the data visualization without any intervention. In the middle lies the *guided exploration*  that provides hints at potential perspectives that are worthy of exploration. We have tested the approach with different immersive visualizations, ranging from multivariate statistical data to astronomic data. Figure 7.4 shows an example of intercontinental air traffic data projected onto a globe.

**Fig. 7.4** Scenes and audio explanations (here, transcribed) from our prototype implementing the *Talking Realities* approach for air traffic data. (Top) A description of the aggregated intercontinental flights for one day. (Bottom) Scenes reporting the longest flight from an airport and most flights to any other airport

### *7.4.3 A Chatbot Interface Providing Visual and Textual Answers*

Using natural language can make interactions with a machine effortless. Chatbots that reply to textual messages are an example of that. Instead of going through context menus and then choosing the relevant option, chatbots let us verbalize our requests as we would to another human being. However, research on supporting chatbot interfaces for data analysis and visualization is still in its infancy. Although some systems are already powerful, the use of chatbots can still lead to false expectations, misunderstood questions, and unexpected replies (Tory and Setlur

**Fig. 7.5** *VisKonnect* answers user questions with a mix of textual reply (left) and explorable visualizations (right). The cutout of the visualization shows a timeline for the two identified scientists; annotations are placed manually for highlighting events that the users might further explore

2019). We believe that a chatbot interface could be a good starting point to make the first contact with the data. In response to the user query, a resulting *exploranative*  representation of data should then enable users to verify and validate presented facts and to explore related ones.

For a specialized use case of exploring relationships among historical public figures, we have developed *VisKonnect* (Latif et al. 2021a) together with project *WorldKG* (Chap. 1). The approach offers a chatbot interface to ask questions about said historical figures. Given a question, it uses a rule-based approach to understand the intent of the question and extract meaningful entities (e.g., people, places). Based on this information, it formulates a SPARQL query to pull the relevant data from an event knowledge graph (Gottschalk and Demidova 2019). This data is then visualized in multiple linked visualizations, highlighting the timelines of individual and shared events, as well as where these events take place. These visualizations are augmented with a textual explanation that aims at answering the user question (either through simple text templates or the GPT-3 language model). Additionally, related events are listed and serve as interactive links to explore them in the associated visualization. Figure 7.5 demonstrates a query about two wellknown scientists and the response generated by *VisKonnect*.

### **7.5 Conclusion and Future Work**

Within the presented research, we have empirically investigated in-depth how geographic data and related information can be jointly described and linked in textual and visual representations. For creating data-driven stories as integrated reports, we provide authoring support of such links that better connect the two representations. While links can be manually added in a flexible and easy-to-use way, our solution also automatically recommends specific linking through analyzing the data-driven text. In different reporting solutions, we were able to demonstrate the flexibility and broad applicability of our reporting solutions as automatically generated descriptions of statistical maps, as audio guides in virtual reality for immersive visualizations, and as a natural-language interface to a knowledge graph that responds with textual and visual data representations.

With these solutions, we not only guide users through insights of a data analysis but at the same time invite them to explore the data in depth. Following our overarching goal to loop back the volunteered data to the public, we are now specifically interested in transferring these empirical results, methods-oriented general solutions, and early research prototypes to specific application examples and inviting a broader audience to use them. Ongoing work targets at this already, for instance, investigating a visual reporting solution for personalized, comparative summarizations of hotel reviews.

Our research generally emphasizes that citizen participation in research is not one directional. Reflecting back results and providing options to explore the data support an even higher level of participation and should be considered in all citizen science projects and data volunteering platforms. Our ideas can be brought together with analysis solutions for volunteered geographic information, and we invite researchers developing such solutions to also investigate this perspective. Still, empirical studies are necessary that explore the effect of providing visual reporting solutions on the engagement of volunteers and the influence on decision processes.

**Acknowledgments** The work is funded by the *Deutsche Forschungsgemeinschaft* (DFG, German Research Foundation)—424960846.

### **References**


on Visual Analytics Science and Technology. IEEE, pp 93–103. https://doi.org/10.1109/ vast47406.2019.8986918


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 8 Effects of Landmark Position and Design in VGI-Based Maps on Visual Attention and Cognitive Processing**

### **Julian Keil, Frank Dickmann, and Lars Kuchinke**

**Abstract** Landmarks play a crucial role in map reading and in the formation of mental spatial models. Especially when following a route to get to a fixed destination, landmarks are crucial orientation aids. Which objects from the multitude of spatial objects in an environment are suitable as landmarks and, for example, can be automatically displayed in navigation systems has hardly been clarified. The analysis of Volunteered Geographic Information (VGI) offers the possibility of no longer having to separate methodologically between active and passive salience of landmarks in order to gain insights into the effect of landmarks on orientation ability or memory performance. Since the users (groups) involved are map producers and map users at the same time, an analysis of the user behavior of user-generated maps provides in-depth insights into cognitive processes and enables the direct derivation of basic methodological principles for map design. The landmarks determined on the basis of the VGI and entered as signs in maps can provide indications of the required choice, number, and position of landmarks that users need in order to orientate themselves in space with the help of maps. The results of several empirical studies show which landmark pictograms from OpenStreetMap (OSM) maps are cognitively processed quickly by users and which spatial position they must have in order to be able to increase memory performance, for example, during route learning.

**Keywords** Landmarks · Salience · Pictograms · Spatial memory

Ruhr-Universität Bochum, Bochum, Germany e-mail: julian.keil@rub.de; frank.dickmann@rub.de

J. Keil (-) · F. Dickmann

L. Kuchinke International Psychoanalytic University Berlin, Berlin, Germany e-mail: lars.kuchinke@ipu-berlin.de

### **8.1 Introduction**

The increasing relevance of mobility is linked to an increasing extent of using mobile navigation technology. Modern navigation and visualization technology depend on the processing of extensive spatial data, which must be provided and maintained in real time. A sub-project of the SPP 1894 "Volunteered Geographic Information (VGI): Interpretation, Visualization and Social Computing" of the Ruhr University Bochum and the International Psychoanalytical University Berlin investigates how wayfinding and navigation can be supported by user-generated geodata. The project addresses the growing interest in reliable spatial data, which extends far beyond the capacities of official geodata sets. The aim is to obtain quantitative data on the use of landmarks in maps in the formation of mental spatial models by means of empirical studies and to transfer these to cartographic applications. The focus of this research is thus on the representation, use, and interpretation of landmarks as they appear in map-based Web 2.0 services such as OpenStreetMap (OSM). The results are expected to further contribute to the development of automatic extraction processes in mobile map design (cf. Klippel and Winter 2005; Elias and Paelke 2008).

### **8.2 The Role of Landmarks in Self-Localization, Navigation, and the Formation of Mental Spatial Models**

Effective navigation in space depends on continuous self-localization (Loomis et al. 1999). By using topographic objects in the environment as spatial reference points, self-localization can be achieved (Meilinger et al. 2006). In this context, landmarks, salient spatial objects (Sorrows and Hirtle 1999; Bestgen et al. 2017), play an important role in real-world and map-based navigation. From the perspective of their perceiver, landmarks pop out of their surrounding objects (Bestgen et al. 2017; Röser 2017). This increases their likelihood to be perceived and processed. Therefore, landmarks are more likely used as the spatial reference points required for self-localization (Sorrows and Hirtle 1999; Bestgen et al. 2017; Elias and Paelke 2008; Millonig and Schechtner 2007).

Due to their relevance in the context of self-localization, landmarks are also known to play an important role for orienting. Turn-off points of a route can be identified by recognizing landmarks located close to these turn-off points (Millonig and Schechtner 2007). Furthermore, landmarks along the route can help ascertain that one is still following a specific route correctly, even if these landmarks are not located close to a location where the travel direction needs to be adjusted (Anacta et al. 2017). The relevance of landmarks for navigation is also reflected in findings demonstrating that the availability of landmarks during navigation increases the accuracy of route finding (Ruddle et al. 1997) and user confidence (Ross et al. 2004).

When people perceive space, either directly in an environment or as a map representation, they gradually integrate the perceived spatial information into a mental spatial model (Millonig and Schechtner 2007). According to Siegel and White (1975), landmark-based spatial knowledge is the first building block for the development of a mental spatial model. By focusing on salient landmarks, people make sense of otherwise (too) complex environments. Thus, landmarks provide an abstraction layer that is easier to memorize than the unfiltered environment (Millonig and Schechtner 2007; Presson and Montello 1988). In this abstraction layer, spatial elements can be memorized based on their relative location of the reference points provided by landmarks (Golledge 1999; Richter and Winter 2014). In other words, landmarks form a framework with the help of which other topographic objects in space are remembered. However, memorizing only landmarks as spatial reference points does not result in a usable mental spatial model in the context of spatial navigation. People also need to identify and memorize route segments connecting the spatial reference points. Werner et al. (2000) describe such a network as consisting of "nodes" and "edges" (see Fig. 8.1).

If acquired within real-world space, spatial knowledge in the form of landmarks and routes is incipiently egocentric (Millonig and Schechtner 2007). However, with the integration of different viewing perspectives of space, an allocentric survey model is likely also being developed (Werner et al. 1997). The map-like structure of such an allocentric mental spatial model allows more complex and flexible spatial evaluations. Even landmark pairs, for example, that are not directly connected with a route segment can be set into spatial relation to each other within such an allocentric model. Additionally, each added route segment increases the number of potential routes (see Fig. 8.1). This allows, for example, estimating distances or planning alternative routes between specific landmark pairs.

**Fig. 8.1** Landmark-based mental spatial model. Memorized landmark locations are represented by red dots (nodes). The dotted line connections (edges) between the landmarks each represent a memorized route between two nodes of the mental spatial model. The blue and green lines demonstrate that different routes between a start and target location can be selected if enough edges between node pairs have been memorized. Figure adapted from Keil (2021)

### **8.3 Identifying Landmarks**

As demonstrated above, landmarks play an important role in spatial cognition, spatial behavior, and spatial memory. But how can we distinguish landmarks as relevant spatial reference points from less relevant topographic objects? So far, several attempts have been made to define landmarks from different perspectives. (Lynch 1960) focused on the inherent physical characteristics of urban objects and their visual contrast to features in the environment. Caduff and Timpf (2008) suggested a trilateral approach to identifying landmarks. The trilateral approach assumes that landmarks are created through an interaction between the observer, the spatial object, and its environment. According to Anacta et al. (2017), landmarks are objects with a fixed geographic location that are easy to perceive and recognize.

All of the aforementioned approaches to defining landmarks, either implicitly or explicitly, suggest a common characteristic of landmarks: they need to be salient. The salience of a topographic object describes and comprises characteristics that increase the likelihood of this object to attract the attention of (potential) observers in the environment (Itti 2005, 2007). Conversely, topographic objects that attract more attention than surrounding objects can be interpreted as being more salient (Caduff and Timpf 2008) and thus more likely to be used as a landmark (Röser 2017).

The trilateral approach of Caduff and Timpf (2008) already suggests that salience is not based on one single object characteristic that can be easily measured. Instead, to identify potential landmarks in the environment based on the salience of topographic objects, numerous characteristics need to be considered. To address this complexity of landmark salience, Sorrows and Hirtle (1999) take an approach that does not only refer to the appearance and position of a spatial element in an environment: They propose three categories of landmarks, visual (visual contrast), structural (prominent location), and cognitive (use, meaning) landmarks. Thus, it is proposed that the identification of landmarks also takes the semantics of the spatial object into account. To better distinguish the semantics from cognitive effects that could also affect the visual salience, the term semantic salience has also been established (Klippel and Winter 2005; Quesnot and Roche 2015). Due to the individual experiences and previous knowledge of a viewer, an exact determination of the meaning that a landmark (e.g., a building) has for an individual viewer can only hardly be defined and measured (Golledge 1991; Nuhn and Timpf 2017). One way out is to fall back on generally accepted classifications, e.g., road classifications (federal road, rural road, etc.) or monument classifications in the case of historic buildings, etc. (e.g. Raubal and Winter 2002). However, a study by Nuhn and Timpf (2019) indicates that personal attributions of meaning might be less important for the selection of landmarks than previously assumed. A landmark selection model containing personal information did not perform better than a model without this information. Although these findings do not solve the fundamental problem of the individual assignment of meanings to landmarks, they provide a perspective on the problem, at least in connection with visual and structural salience characteristics.

### **8.4 Landmark Representations in VGI-Based Maps**

The main focus of previous research on landmark salience was directed at landmarks within real-world space (e.g., Anacta et al. 2017; Röser 2017; Sorrows and Hirtle 1999). However, psychological studies demonstrate that salience also affects visual attention in the perception of 2D information, e.g., on computer interfaces (Buscher et al. 2009) or in maps (Fine and Minnery 2009). This raises the question of how salience effects influence the perception and use of landmark representations in maps.

In most maps, real-world landmarks are represented by pictograms. However, the selection of landmarks to be represented in the maps occurs a priori. Thus, the selection of landmarks to be represented in maps is usually not based on direct interaction with the environment and user choices but rather based on established cartographic principles. As stated above, user characteristics (semantic salience) affect the selection of landmarks in real-world space. Additionally, visual salience characteristics (e.g., visibility from a specific viewpoint) can affect the selection of landmarks. In this context, an additional categorization of visual, semantic, and structural salience characteristics into active and passive salience can be used to illustrate the issue of landmark selection in maps. Passive salience depends on physical attributes of landmarks, such as size, color, or their function (Bestgen et al. 2017). It describes the potential of an object to attract attention based on bottom-up processes independent of individual characteristics of a potential perceiver (Caduff and Timpf 2008). The methods for identifying the passive salience of landmarks are explained in detail by Duckham et al. (2010), Klippel and Winter (2005), Nothegger et al. (2004), or Elias (2003). These approaches use, among other things, the results of statistical and data mining methods (Peters et al. 2010; Sadeghian and Kantardzic 2008). The concept of active salience, on the other hand, is based on the perceived, cognitive, and contextual appearance of a landmark, i.e., viewed from the perspective of the traveler (user), including his or her experiences, age, prior (cultural) knowledge, and the way he or she moves (Caduff and Timpf 2008; Millonig and Schechtner 2007; Zhu 2012). Taken together, exogenous (passive) salience patterns are contrasted with endogenous (active) ones. Both concepts influence the processing of map information and need to be taken into account for the identification and representation of landmarks. If the active user- and contextdependent salience characteristics are not considered, the set of landmarks selected to be represented in a map will not match the topographic objects selected as landmarks by the map users.

A special case in the context of landmark selection are maps based on Volunteered Geographic Information (VGI), such as OpenStreetMap (OSM). Producers of VGI map data (influenced by endogenous salience characteristics) simultaneously assume the role of the consumer (influenced by exogenous salience characteristics), i.e., two perspectives on landmark saliency are merged. There is no external decision-making authority, as is usually the case in map production. The resulting landmarks come from the actual spatial experiences of the volunteers (producers = consumers) themselves. Therefore, the study of map elements that are considered landmarks by such volunteers can provide valuable information on how landmarks are identified in maps in terms of their density and spatial arrangement and how user-generated landmark representations influence the formation and accuracy of cognitive representations (mental models) of the mapped space.

The VGI-based selection of spatial objects represented as landmarks in VGIbased maps is assumed to reflect the specific distribution of landmarks to each other (i.e., patterns) but also the relations to other map elements (non-landmarks) in maps that are required to solve spatial tasks. Since the mapping process of volunteers is affected by direct interaction with the mapped environments as well as their mental spatial representation of these environments, the landmarks represented in VGI-based maps are expected to reflect the patterns of landmarks required to represent space in a mental spatial model. This raises the question of how landmarks distributed in maps on the basis of empirically obtained rules (i.e., on the evaluation of landmark patterns created by volunteers) are used to create mental spatial models.

In real-world space, it is not only a single landmark in the landscape that is important for building survey knowledge of an environment. Rather, networks of several landmarks and the relations (e.g., distances or routes) between them must be considered (Herman et al. 1979; Werner et al. 2000). In a map, an interaction or pattern of several landmarks could therefore support the function of structuring (map) spaces and thus possesses a function for the construction of more accurate mental models in the same way (cf. Golledge 1993). Again, if the active user- and context-dependent salience characteristics are not considered, the set of landmarks selected to be represented in a map will not match the topographic objects selected as landmarks by the map users. In addition, it can be shown that landmarks are used in spatial learning to relate spatial objects to them (Ferguson and Hegarty 1994; Golledge 1999). Thus, visual attention directed toward specific landmarks due to their salience may not only affect spatial memory of the landmarks themselves. The availability of salient landmarks may also affect spatial memory of surrounding spatial objects or routes.

However, it needs to be considered that the salience and perception of landmarks in real-world space differs from the perception of landmark representations in maps. In real-world space, landmarks are only potentially salient if they are within the line of sight of the observer. This excludes many proximate landmarks that are not visible, for example, because they are hidden behind other spatial objects (cf. Lynch 1960). In maps on the other hand, each represented landmark is potentially visible. In this case, which landmark representations are perceived depends on the salience characteristics and/or task requirements. Therefore, how salience or spatial tasks affect the use of landmarks in maps (opposed to real-world landmarks) could affect performance in spatial tasks such as self-localization, orientation, and navigation, as well as the formation of mental spatial models based on map perception.

Not all graphic elements of a map share the same relevance for the formation of cognitive map representations as the selected (higher-level) landmark structures. The analysis of task-dependent salience of VGI-based map elements (here OSM maps) is therefore expected to support the identification of a working definition of the required spatial reference points (specifically landmark representations) that need to be visualized to ensure effective spatial information transfer. This might lead to new content structures that make it easier for map users to associate objects to be learned in a map with a higher-level frame of reference. The results of several studies contained in our project on the semantic, visual, and structural salience of VGIbased landmarks provide insights into the selection of spatial reference points in maps during the acquisition of spatial memory. Thus, these findings contribute to the development of guidelines for task-oriented map design that support the formation of mental spatial models.

### *8.4.1 Semantic Salience*

Semantic salience refers to semantic properties that affect the likelihood of an object to draw attention. As demonstrated by Pilarczyk and Kuniecki (2014), semantically salient stimuli attract visual attention, and the semantic features of a stimulus can in some cases even have a stronger effect on the direction of visual attention than the visual features. Different characteristics that affect the semantic salience of topographic objects have been suggested over the years. Some suggestions focus primarily on generalizable characteristics like cultural and historical significance or the purpose or function of an object (Claramunt and Winter 2007; Nothegger et al. 2004; Raubal and Winter 2002; Röser et al. 2011). These characteristics suggest that semantic salience is an intrinsic bottom-up property of an object that is not affected by the observer. Based on these suggestions, a church or a police station should be semantically more salient than a common residential building. Other approaches to assess semantic salience argue that top-down processes based on knowledge and preferences of an observer affect the semantic salience of topographic objects (e.g. Golledge 1991; Nuhn and Timpf 2017; Quesnot and Roche 2015). For example, a residential building can—in some cases—even be semantically more salient than a church or a police station if the observer lives in it.

In the context of map elements like landmark representations, assessing semantic salience faces issues, which are unique to the medium. First, landmarks are usually represented in maps as pictograms. In most cases, these pictograms do not reflect the visual characteristics of the represented landmarks but are designed to reflect the purpose or function of the landmark. This abstract representation of landmarks can override the semantic associations with a landmark. For example, your favorite restaurant located in a beautiful old building could be represented by the same pictogram as each other restaurant. On the other hand, a map pictogram can also communicate semantic information that is not effectively communicated by the visual appearance of the real-world object represented by the pictogram. It is also important to consider how the pictogram design affects to what extent map users are able to interpret the purpose or function of the represented landmark. For example, some pictograms might only be used in specific cultures or might have different meanings in different cultures (Spinillo 2012). Thus, pictogram designs need to be selected based on the intended user group of a map.

In a first study of our project, we investigated to what extent people understand the meaning of landmark pictograms (meaningfulness) and how the ability of a pictogram to communicate its semantics affects the attraction of visual attention (semantic salience) and the ability to memorize the pictogram (see Keil et al. 2019). We chose to investigate a set of 153 pictograms obtained from OSM. The map content and design of OSM are provided and influenced by a large worldwide community of volunteers. This is assumed to be reflected in the pictogram design and the accessibility of pictogram semantics within different cultural groups.

In a recognition design, sets of 12 pictograms (see example in Fig. 8.2) were shown to the participants. After a distractor task, participants had to identify the previously shown pictograms within a set of 24 successively shown pictograms. During the encoding phase, fixations on the pictograms were recorded with an eye tracker. In the second half of the experiment, participants successively saw the 153

**Fig. 8.2** Fixation heat map. In the study, participants saw and had to memorize sets of 12 landmark pictograms. Semantic salience of the pictograms was assessed based on the visual attention directed toward each pictogram, as reflected in the measured fixations. In the example above, the helipad pictogram (down right) was fixated less often than the postbox pictogram (top left) and was therefore scored as less semantically salient. Potential order effects in the stimuli (e.g., based on reading direction) were addressed by varying pictogram locations between trials

pictograms. Each pictogram was shown together with a continuous scale that was used to assess to what extent participants were certain to understand the meaning of the pictogram (meaningfulness).

The findings demonstrate that pictograms with a very low meaningfulness rating attracted more visual attention and were recognized more often. An explanation of this unexpected finding needs to consider the experimental design and the general source of salience, which is a contrast with surrounding stimuli (Claramunt and Winter 2007; Sorrows and Hirtle 1999). Most of the 153 pictograms were rated as having a relatively high meaningfulness. Therefore, the few pictograms with a low meaningfulness were the ones with the highest semantic contrast. Consequently, as the semantic contrast is assumed to direct visual attention, the few pictograms with a low meaningfulness should also be the ones with the highest semantic salience. As selective visual attention has been associated with improved object learning (Walther et al. 2005), this also explains the better memory performance of landmark pictograms with a low meaningfulness.

However, it is important to also consider that the perception of landmark pictograms in this experiment does not match the perception of landmark pictograms in their "natural" environment. In OSM and other maps, semantic pictograms are surrounded by unified representations with a low meaningfulness, for example, buildings, roads, or green spaces. Thus, pictograms with a high meaningfulness should have a higher semantic contrast within a map and therefore a higher semantic salience. Furthermore, due to the higher selective attention, landmark representations in maps with a higher meaningfulness are assumed to be more likely to be stored in a mental spatial model (Walther et al. 2005). Taken together, the study provides an approach for assessing the semantic salience of landmark pictograms and demonstrates that semantic salience affects the attraction of visual attention and the memory of map elements.

### *8.4.2 Visual Salience*

Visual salience is probably the most investigated salience characteristic. It describes the visual contrast of an object to its surrounding objects and depends on parameters as illumination, size, color, texture, or shape (Clarke et al. 2013; Davoudian 2011; Duckham et al. 2010; Röser et al. 2011) and has been demonstrated to direct visual attention to stimuli (Wenczel et al. 2017). Commonly used examples of visually salient landmarks used for orientation and navigation are large or tall buildings, unique objects like statues, or buildings with an eccentric architecture or uncommon visual features (Klippel and Winter 2005). According to von Stülpnagel and Frankenstein (2015), visual salience affects the selection of spatial objects as landmarks for orientation. In other words, people look for visually salient objects that can be used as the spatial reference points required for making sense out of space and building mental spatial models (cf. Clarke et al. 2013).

Opposed to semantic and structural salience, the allocation of attention based on visual salience has been argued to be a stimulus-driven bottom-up process (Itti 2005; Ouerhani et al. 2004). This is supported by neurological findings demonstrating that the ability to perceive feature contrasts as the foundation for visual salience is located in the V1 area of the visual cortex (Li 2002). This means that the visual salience of visual stimuli is evaluated in an early and automatic stage likely before top-down processes affect the direction of visual attention. Its dependence on feature contrast means that visual salience is a context-dependent characteristic. An object is visually salient relative to its surrounding objects (Claramunt and Winter 2007; Klippel and Winter 2005). Thus, a tall building might not be visually salient if it is part of the skyline in a large city. On the contrary, a small building can be visually salient because it is surrounded by tall buildings.

Visual salience has been intensively investigated both for real-world objects like buildings or facades (e.g., Davoudian 2011; Franke 2021; Dong et al. 2020; Röser 2017; von Stülpnagel and Frankenstein 2015; Wenczel et al. 2017), but also for the perception of 2D stimuli, for example, computer interfaces or images (e.g., Clarke et al. 2013; Ouerhani et al. 2004; Sutherland et al. 2017). However, the effects of visual salience on the perception and processing of maps and landmark representations have received little attention in the literature so far.

The design of map elements is often not based on the visual features of the represented real-world objects. Streets, buildings, green spaces, and water bodies are represented by geometric shapes, and the colors of these shapes are determined by the map design guidelines (Dickmann 2018). Landmarks, on the other hand, as mentioned earlier, are usually represented as specific semantic pictograms based on the semantic categories they are assigned to (e.g., restaurants, shops, statues, etc.). In both cases, the individual visual characteristics are lost due to the type of representation. Thus, the visual salience of a real-world object does not match the visual salience of its map representation. Furthermore, opposed to real-world environments, it is easily possible to adjust the visual characteristics of object representations in a map and, consequently, the visual salience of specific map elements.

In a second study of our project, we explored how adjustments of map design can be used to systematically direct visual attention toward specific map areas. Furthermore, we investigated to what extent different map designs and the resulting visual salience differences of specific map regions affect spatial memory. The full study is described in detail in Keil et al. (2018). As study materials, we obtained maps from OSM and added routes to the maps. As a second stimulus condition, map areas offside the route (more than 10 pixels from the route) were displayed transparently (see Fig. 8.3). This was meant to reduce the visual salience of the areas offside the route and expected to direct visual attention toward the map areas close to the route. Participants saw the maps for 30 s and were asked to memorize the route (encoding phase). During this phase, fixations on the map areas were recorded with an eye tracker. Two different map areas of interest (AOIs) were defined. The first AOI contained the route and the area 10 pixels around the route, thus the area which was not transparent in the second stimulus condition (route AOI). The second

**Fig. 8.3** Stimulus conditions. Maps were obtained from OSM (© OpenStreetMap contributors), and a route was added. For the second stimulus condition, areas offside the route (more than 10 pixels) were displayed transparently. Eye tracking was used to record fixations on the area close to the route and the area offside the route

AOI contained the rest of the map (offside route AOI). Each fixation was recorded according to the AOI it targeted at. After the encoding phase, participants were shown four versions of the previously presented map, either containing the correct route or a slightly manipulated route. For each map, they had to decide whether the displayed route matched the previously learned route.

The results show that significantly fewer fixations were directed at the map areas offside the route (offside route AOI) when these areas were transparent. Fixations on the area around the route (route AOI) did not differ significantly between the standard map and the transparent map. Thus, we were able to demonstrate that changing map design, and consequently, the visual salience of specific map elements affects the distribution of visual attention across the map. Interestingly, the fact that landmark representations offside the route were not displayed transparently did not significantly undermine the shift of visual attention toward the route. We argue that visual attention was in both conditions distinctively affected by the task requirement of memorizing the route. Thus, most fixations in the non-transparent map were already relatively close to the route and not on landmark representations far offside the route. Applying transparency only narrowed the area of fixations around the route. This interpretation is supported by the fact that route memory performance did not differ significantly between the original map and the transparent map. Participant route memory performance seems to have relied primarily on reference points close to the route. Therefore, making reference points offside the route less visible and less likely to receive attention did not affect memory performance. However, performance differences could potentially occur if the task is carried out with more time pressure. If a task requires map readers to make quick decisions or to capture map information quickly, directing visual attention toward relevant map areas could reduce distraction from less task-relevant map areas. The effects of task requirements on the direction of visual attention are further investigated in the following chapter.

Taken together, the study demonstrates that visual salience affects the distribution of visual attention in predictable patterns and that these patterns can be manipulated by adjusting the map design. This makes it possible to direct visual attention toward specific task-relevant map elements. In future studies, it needs to be addressed how different ways of guided visual attention based on landmark pictogram and map design affect spatial tasks such as orientation and navigation, as well as the formation of mental spatial models.

### *8.4.3 Structural Salience*

Structural salience is a task- or context-dependent salience (Peebles et al. 2007). Previous research on structural salience focused primarily on the relative location of landmarks during navigation and the conceptualization of routes (Klippel and Winter 2005; Röser et al. 2011). Based on their relative location to the observer or a route, structurally salient landmarks can be divided into global and local landmarks (Elias and Paelke 2008).

Global landmarks in real-world settings are located far enough to only marginally change their relative location based on movements of the observer (Keil 2021). Thereby, they can act as beacons for assessing the general travel direction (Lynch 1960; Steck and Mallot 2000; Wenig et al. 2017). An important characteristic of a global landmark is its size, as it determines its remote visibility (von Stülpnagel and Frankenstein 2015). The distance from the observer can range between a few hundred meters (e.g., a tall building) and hundreds of kilometers (e.g., a mountain) or even light years (e.g., the north star). Global landmarks are not suitable to identify or memorize specific routes. Instead, they can be used to choose paths that lead to the general direction of the travel destination. This is reflected in a large flexibility of selected routes when people navigate based on global landmarks (Hurlebaus et al. 2008).

Local landmarks, on the other hand, are located close to the observer and/or to a specified route. Compared to local landmarks, global landmarks provide more precise information about the observer's location and support encoding of and navigation along a specific route (Hurlebaus et al. 2008; Ruddle et al. 2011; Steck and Mallot 2000). Furthermore, they are frequently used for communicating routes (Anacta et al. 2017). Local landmarks can be subdivided based on their relative location to specific fragments of a route. They can be located close to decision points, potential decision points, or along the route (see Fig. 8.4). Decision points are intersections where the travel direction needs to be adjusted. Landmarks located at decision points can be used as a reference marker for identifying, memorizing, or communicating a required turn (Millonig and Schechtner 2007). Potential decision points are locations where the travel direction could be adjusted but should not be adjusted, for example, an intersection where the route follows a straight direction.

**Fig. 8.4** Navigation-oriented landmark locations. Local landmarks close to a route can be located at locations where the travel direction needs to be adjusted (decision point), at locations where the travel direction could be adjusted (potential decision point), or where the travel direction cannot be adjusted (along the route). Global landmarks act as beacons and are located offside the route (figure adapted from Bauer 2018)

Landmarks along the route are located next to the route where the travel direction cannot be adjusted (Anacta et al. 2017; Elias and Paelke 2008). Landmarks at potential decision points or along the route are not necessarily required for conceptualizing a route. However, they can be used during navigation to ensure that the route is still followed correctly (Millonig and Schechtner 2007). At decision points, landmarks can further be subdivided based on their location relative to the turn direction of a route. Röser et al. (2012) found that landmarks are more likely to be perceived and used in way finding and thus are more structurally salient, if they are located in the direction of the turn. Furthermore, turning decisions have been found to be more likely to be correct if a local landmark is available in the direction of the turn (Albrecht and Stuelpnagel 2018).

Similar to semantic and visual salience, the question of how structural salience affects the distribution of visual attention in maps has received little attention yet. As it has been argued that structural salience is task-dependent (Peebles et al. 2007), we carried out three studies on the effects of different map-based spatial tasks on the distribution of visual attention across the map. The first two studies investigated the structural salience of landmark representations in map-based route memory tasks (for the complete study details, see Keil et al. 2020a). In both studies, participants had to memorize routes displayed in maps obtained from OSM. Eye tracking was used to assess the distribution of visual attention across the maps. Landmark pictograms were available in the maps according to the locations of landmark representations added by volunteers to these OSM maps. However, to control semantic salience of pictograms, the original pictograms were randomly replaced by a set of OSM pictograms with similar levels of meaningfulness as measured in the first study of our project (see Keil et al. 2019). As other map elements than landmark pictograms could also be used as spatial reference points for memorizing the route, we also addressed how the visual complexity of a map (number of spatial elements) affects the structural salience of landmark pictograms. Therefore, the first of the two route memory studies compared rural maps with a moderate number of map elements and urban maps with a high number of map elements. In the second study, only urban maps were used, but in one condition, fractions of the original map were used, and these fractions of the map were stretched to the original map size to create a second map condition with less map elements per map display. The second study was meant to address the limitation of the first study that landmark pictograms and other map elements are usually less evenly distributed across rural maps compared to urban maps. This was argued to affect the distribution of visual attention across the maps.

Both studies found distinctive effects of the task on visual attention. Most fixations were directed at map areas around the to-be-learned routes and toward decision points of the route. Map areas and landmarks offside the route were only rarely fixated (see Fig. 8.5). Thus, we found clear evidence for the effects a specific task has on the distribution of visual attention (as assessed using an eye tracker) across a map. Furthermore, the controlled manipulation of visual map complexity in the second experiment provided additional insights in the relevance of spatial reference points for the formation of mental spatial models. In the stretched maps with reduced spatial reference points, landmarks farther offside the displayed route were fixated. This indicates that not enough spatial reference points close to the route were available for the map users to memorize the route. Thus, participants seem to have expanded their search area for suitable spatial reference points.

The third study on the structural salience of landmark representations in maps addressed location memory tasks (for the complete study details, see Keil et al.

**Fig. 8.5** Fixations during a route-learning task. The fixation heat map demonstrates that visual attention was almost exclusively directed at the map areas around the to-be-learned route. Especially high fixation counts can be seen around landmark pictograms close to the route. These pictograms can be suggested to have a high structural salience. Maps were obtained from OSM (© OpenStreetMap contributors)

2020b). Participants were presented maps taken from the OSM project. The maps contained several landmark pictograms and a to-be-learned object location highlighted by a red pictogram. After a short distractor task, participants were presented the previously shown map again, but this time without the red pictogram. They were asked to click on the recalled location of the red pictogram using a computer mouse. During the encoding phase, fixations on landmark pictograms were recorded with an eye tracker (cf. Kuchinke et al. 2016; Dickmann et al. 2015).

The results of the third study on structural salience demonstrate how the task requirements of an object location memory task affect the structural salience of landmark pictograms in a map. According to these results, most visual attention was directed toward the landmark pictograms closest to the to-be-learned object location. Additionally, landmark pictograms were fixated more often if they were located closer to the (imaginary) horizontal and vertical cardinal axes of the to-be-learned object location (see Fig. 8.6). In agreement with assumptions of Rock (1997) and Tversky (1981), people appear to apply an imaginary coordinate system to perceived maps, either based on the viewing angle or the map borders. Landmark pictograms seem to receive privileged access to visual attention as spatial reference points if they are easy to conceptualize based on such an imaginary coordinate system, e.g., being directly above, below, left, or right to a to-be-learned object location.

Also, object location memory in this task was more accurate if the closest landmark pictogram was closer to the to-be-learned object location. Fixation rates steeply dropped toward the second- and third-closest landmark pictogram to the to-be-learned object location. Of interest is that there was no significant relation between the distance of these pictograms to the to-be-learned object location and the memory performance found in this study. This seems to indicate that a single landmark pictogram already can act as a spatial reference point for object location memory and that this task requirement directly affects (or better implies) the structural salience of this reference point and the low structural salience of other (potential) spatial reference points.

Taken together, the three studies provide new insights into the structural salience of landmark pictograms applying different map-based memory tasks. All three studies clearly indicate that task requirements affect how visual attention is distributed across maps. People appear to search for suitable reference points for memorizing locations or route nodes. Whether landmark representations are chosen as reference points depends on their distance and orientation relative to memorized locations and routes.

**Fig. 8.6** Structural salience in the context of object location memory. Landmark representations in maps were fixated more often when they were located close to a to-be-learned object location and its (imaginary) horizontal and vertical cardinal axes

### **8.5 Conclusion**

The reported studies of the sub-project "The Effects of Landmarks on Navigation Performance in VGI-based Maps" of the SPP 1894 provide new insights in how landmark representations in maps are perceived and how they affect the formation of mental spatial models. Previous research on landmark salience primarily focused on the parameters that affect the direction of visual attention toward real-world landmarks (e.g., Golledge 1991; Klippel and Winter 2005; Millonig and Schechtner 2007; Röser et al. 2012). The studies reported in this paper extend these findings by investigating the salience of landmark representations in maps and depict similarities and differences between real-world landmarks and landmark representations in maps.

Semantic salience has been argued to affect the likelihood of real-world landmarks to attract visual attention both based on general attributes, as well as on individual characteristics of the observer. General attributes include cultural and historical significance, as well as the purpose or function of a landmark (Claramunt and Winter 2007; Nothegger et al. 2004; Raubal and Winter 2002; Röser et al. 2011). Individual characteristics consider emotional or knowledge associations of an observer with a perceived landmark (Golledge 1991; Nuhn and Timpf 2017; Quesnot and Roche 2015).

In the context of landmark representations in maps, semantic salience seems only partially based on the general attributes and the individual associations with the represented landmarks. Rather, the type of representation also needs to be considered. As landmarks are often represented in maps as pictograms, specific associations with a unique landmark can be overridden. For example, a famous church with a unique architecture could be represented by the same pictogram as every other church. Thus, the unique associations related to this church could not be evoked by only seeing its map representation. Furthermore, intercultural and individual differences can affect the ability to understand pictograms used to represent landmarks and, consequently, their semantic salience (Spinillo 2012).

Our first reported study demonstrated that, similar to real-world landmarks, visual attention toward landmark representations in the form of pictograms is affected by their semantics respectively their meaningfulness. However, opposed to approaches of defining semantic salience in advance (based on obtained knowledge), we found that landmark representations can also attract visual attention if the conveyed semantic information is more difficult to understand than that of surrounding objects. In other words, visual attention is not directed based on the intrinsic semantic characteristics of a landmark representation (where more meaningfulness equals more semantic salience and visual attention) but is in some circumstances also based on the semantic contrast to other objects.

However, how semantic landmark pictograms in interaction with different map designs affect the distribution of visual attention remains unclear and needs to be addressed in future studies. Interestingly, although semantic salience has been argued to be affected by individual characteristics (Golledge 1991; Nuhn and Timpf 2017; Quesnot and Roche 2015), our study found no pronounced individual differences in the perceived meaningfulness of landmark pictograms or the direction of visual attention toward specific pictograms. This demonstrates that, as mentioned above, individual associations with specific landmarks get lost when the landmarks are represented as abstract (generalized) pictograms. Still, as the study was carried out with a homogeneous cultural group, future research needs to address to what extent cultural background affects the semantic associations with specific landmark pictograms and, consequently, the direction of visual attention toward these map elements.

The role of visual salience for the selection of spatial objects as landmarks has been emphasized repetitively (Klippel and Winter 2005; Röser 2017; von Stülpnagel and Frankenstein 2015). A commonly mentioned criterion of visual salience is the visual contrast of an object to its surroundings based on its visual characteristics, for example, its size, color, or illumination (Clarke et al. 2013; Davoudian 2011; Duckham et al. 2010; Röser et al. 2011). We argued that the form of landmark representation in maps (usually as pictograms) overwrites the visual characteristics of a real-world landmark and thereby also its visual salience. However, due to their accentuated design relative to other map elements, landmark pictograms in maps are also visually salient and attract visual attention (see example in Fig. 8.5). This is reflected in their common use as spatial reference points, despite the availability of other potential spatial reference points in the maps (see Keil et al. 2020a,b). Furthermore, opposed to real-world landmarks, the visual contrast of landmark representations and other reference points in maps to surrounding map elements and, consequently, their visual salience can easily be modified. As demonstrated in our second study, adding transparency to specific map areas directs visual attention away from these areas. Thus, by manipulating map design according to task requirements, visual attention can be systematically directed toward relevant map areas and spatial reference points as landmark representations (though the effects on memory are in need of future research).

However, it needs to be considered that not only the inherent visual characteristics of landmarks and landmark representations in contrast to their surroundings affect their visual salience and, consequently, their likelihood to attract visual attention. The general ability to perceive landmarks and their representations in maps also needs to be addressed. Concerning the selection of objects as landmarks, Röser et al. (2012) and Klippel and Winter (2005) stress the relevance of visibility, thus the ability to perceive the object from a specific viewpoint. Spatial objects might be hidden behind other spatial objects. If this is the case, they cannot be used as landmarks, even if they have a high visual contrast to their surroundings. The visibility of landmark representations in maps, on the other hand, depends not on viewpoints in real-world space but on the displayed map region and the selection of landmarks to be represented in a map. Consequently, during map-based navigation, visual attention could be attracted by landmark representations in the map that are not visible from the user's viewpoint. This could impair navigation performance, as only visible landmarks can be used for self-localization. Therefore, future studies should address how visibility of landmarks can be assessed based on real-time tracking of map users and how this visibility information can be used to dynamically adjust which landmarks are represented in a map.

The third type of salience according to the approach of Sorrows and Hirtle (1999), structural salience, has been argued to be task- or context-dependent (Peebles et al. 2007). Landmarks have previously been argued to be structurally salient if they are located along a specific route and at locations that can be used to conceptualize a route (Klippel and Winter 2005; Millonig and Schechtner 2007; Röser et al. 2011). In three consecutive studies (see Keil et al. 2018, 2020a), we were able to demonstrate that, similar to real-world landmarks, landmark representations in maps are structurally salient based on their relative location to specific routes. The visualization of a task-relevant route directs visual attention to the map elements close to the route and its decision points. Furthermore, an additional study (Keil et al. 2020b) extended previous findings by demonstrating that landmark representations are also structurally salient and attract visual attention if they are located close to a to-be-learned object location and its (imaginary) cardinal axes. The effects of the cardinal axes of the map on the distribution of visual attention demonstrate a unique map-related characteristic of structural salience that cannot be applied to real-world space. Either based on the head orientation of the observer relative to a map or based on the (usually rectangular) shape of a map, up/down and left/right dimensions can induce such cardinal axes that are not induced by the perception of real-world space.

The reported findings emphasize the relevance of landmarks and landmark representations for location memory, as theorized by the mental spatial network of nodes and edges proposed by Werner et al. (2000). However, in contradiction to the allocentric mental spatial representation structure proposed by Werner et al. (1997, 2000), we found no evidence for a landmark configuration or framework used to conceptualize and memorize map space. In both the route memory and location memory tasks, visual attention was directed mainly toward the task-relevant landmarks close to the routes and object locations. Landmarks offside the routes and object locations received almost no visual attention. In the location memory task, already the second- and third-closest landmarks received significantly less visual attention, even if they were close to the to-be-learned object location. This demonstrates that participants did not explore the general spatial configuration of landmark representations. However, the lack of evidence in our experiments cannot be used to reject the assumption that landmark representations in maps are used to form a mental representation of maps based on nodes, as proposed by Werner et al. (2000). According to Millonig and Schechtner (2007), such allocentric mental spatial models are formed gradually during repeated interaction with spatial information. The short and clearly task-oriented perception of map information in our experiments might not have had sufficient power to detect or sufficient exploration time for the formation of an allocentric mental spatial model based on landmark configurations. In order to further explore the relevance of landmark representation configurations (landmark patterns) in maps on the formation of mental spatial models, future studies need to provide repeated interaction with maps and should be designed to support the formation of less task-specific mental spatial models, for example, by asking participants to draw sketch maps of the perceived maps. If, as assumed, landmark representation configurations are used as the first building block for the formation of mental spatial models (cf. Bestgen et al. 2017), identifying the ideal configuration characteristics and highlighting landmark representations in maps according to these characteristics could help to effectively and efficiently communicate spatial information to map users and support the formation of mental spatial models.

The reported studies provide first insight into salience characteristics of landmark representations in maps and their effects on the formation of mental spatial models. Based on these findings, better predictions of the distribution of visual attention across maps are possible. By systematically manipulating the salience of landmark representations, visual attention can be actively directed toward task-relevant map elements. The insights into the effects of directed visual attention on spatial tasks such as orientation or navigating along a selected route could be used to optimize automatic map generalization processes based on task requirements. In order to further exploit the potential benefits of the reported outcomes, future research needs to investigate how the systematic direction of visual attention toward specific map elements can improve performance on spatial tasks such as self-localization, orientation, and navigation in real-world environments. As a final step, the insights into the effects of directed visual attention on spatial tasks such as orientation or navigating along a selected route could be used to optimize automatic map generalization processes based on task requirements.

**Acknowledgments** This research was supported by the German Research Foundation DFG within Priority Research Program 1894 *Volunteered Geographic Information: Interpretation, Visualization and Social Computing* (VGIscience, Landmarks VGI, DI 771/11-1, KU 2872/6-1).

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 9 Addressing Landmark Uncertainty in VGI-Based Maps: Approaches to Improve Orientation and Navigation Performance**

### **Julian Keil, Frank Dickmann, and Lars Kuchinke**

**Abstract** Landmarks, salient spatial objects, play an important role in orientation and navigation. They provide a spatial reference frame that helps to make sense of complex environments. Landmark representations in maps support map matching and orientation, because matching landmarks to their map representations provides information about spatial directions and distances. However, effective landmarkbased map matching demands sufficiently accurate georeferencing of the landmarks represented in a map, because spatial inaccuracies of landmark representations cause distortions of the spatial reference frame and derived directions and distances. The requirement of accurate landmark georeferencing imposes difficulties on the use of maps based on Volunteered Geographic Information (VGI) for map matching. Differences of the motivation, competence, and available apparatus of volunteers can cause great variations of the data quality in VGI-based maps, including spatial accuracy of landmark representations. In a series of experiments, we investigated and quantified to what extent spatial inaccuracies of landmark representations in VGI-based maps affect map matching. Based on the findings, we were able to identify critical thresholds for spatial landmark inaccuracies. Furthermore, we explored potential ways to sustain successful map matching at higher degrees of spatial landmark inaccuracies. Through visual communication of spatial uncertainties, we were able to make map users more resilient to potential inaccuracies and sustain successful map matching.

**Keywords** Landmarks · Spatial inaccuracy · Uncertainty · Visualization · Map matching · Orientation

Ruhr-Universität Bochum, Bochum, Germany e-mail: julian.keil@rub.de; frank.dickmann@rub.de

Lars Kuchinke International Psychoanalytic University Berlin, Berlin, Germany e-mail: lars.kuchinke@ipu-berlin.de

J. Keil (O) · F. Dickmann

### **9.1 Introduction**

In unfamiliar environments, people tend to use maps for orientation and navigation (Roskos-Ewoldsen et al. 1998). By matching spatial representations in maps to realworld objects, people identify their own location and obtain spatial information about orientation and route directions that are necessary for effective navigation (Kiefer et al. 2014). In the context of such a map matching process, landmarks, salient spatial objects with a fixed geographic location (Anacta et al. 2017; Bestgen et al. 2017; Claramunt and Winter 2007), have been discussed to play an important role (see Chap. 8). In maps, landmarks are often represented as pictograms that communicate semantic information about the nature or purpose of the represented landmark (Keil 2021). Furthermore, Peebles et al. (2007) found that people tend to use single landmarks and their map representations to match 2D maps to 3D spaces.

In recent years, due to the widespread use of mobile Internet and smartphones, paper maps have been increasingly replaced by online map services. The advantages of these online maps are their convenient availability and their ability to record and display one's own position in real time and to calculate and display routes in real time. Furthermore, they are—in comparison to printed maps—(often) more up to date, since these digital maps do not have to be reprinted after each modification. A particular phenomenon that has arisen in connection with online maps are maps based on Volunteered Geographic Information (VGI), with maps based on the OpenStreetMap (OSM) project as the most prominent representatives of this map category. Opposed to "traditional" commercial or official maps, VGI-based maps are created in a process that allows and even encourages the participation of map users (Goodchild 2007). Volunteers can add or modify map content and thereby contribute to the map creation process.

The use of VGI data can provide substantial benefits, in particular in regard to the quality of a map. In areas with numerous volunteers, these volunteers are able to make corrections to the map at very short notice if local conditions change, for example, when a road is closed or a new building is built (Barrington-Leigh and Millard-Ball 2017; Olteanu-Raimond et al. 2017). Thus, the availability of VGI data has improved geographic information (Flanagin and Metzger 2008) and the way such information is spread and processed. Volunteers also share the role of map users of this geographic information. Hence, the data provided clearly relies on individual experiences and thus shares a natural, implicit advantage over commercial products. As a result, the involvement of volunteers in map creation can lead to the mapping of spatial elements that are less relevant from the point of view of a public or commercial authority but are very relevant for certain groups of map users. For example, OSM contains many hiking trails that are not mapped in official maps (See et al. 2017). However, the source of these advantages of VGI-based maps is directly linked to disadvantages in terms of data quality (cf. Bégin et al. 2013).

Overall, there is an ongoing and thorough discussion on quality issues of such spatial data, mainly in comparison to commercial products (e.g., Degrossi et al. 2018; Flanagin and Metzger 2008; Senaratne et al. 2017; Zhang and Malczewski 2017). In regions with only a few active volunteers who provide VGI data, the maps are usually much less detailed (Rousell and Zipf 2017), and map errors tend to get fixed late, if at all. It is also known that interindividual differences between volunteers affect the quality of VGI-based maps. These differences include personal motivations, skills and mapping expertise, as well as the technical equipment used, for example, devices to record GPS data (Van Exel et al. 2010). These betweenregion and interindividual differences result in a pronounced heterogeneity of data quality (c.f. Chap. 3) and completeness available in VGI-based maps (Girres and Touya 2010). Thus, data quality and data characteristics are of a strongly heterogeneous nature in the case of VGI. Most map readers, however, do not question data quality when, for example, using OSM or are even not aware of the fact that maps are based on OSM. They are not aware of the very different characteristics of VGI as opposed to traditional or commercial datasets (Skopeliti et al. 2017). In contrast, Schiewe and Schweer (2013) report "a rather high degree of awareness of uncertainty problems" in OSM users. But this awareness circulates around completeness and up-to-dateness of the data, while localization errors and thematic inaccuracies remain unaware (Schiewe and Schweer 2013). As a result, it can be assumed that map readers treat every available landmark in the same way, independent of its representational quality.

In the context of successful map matching, two potential problems arise from the described disadvantages of VGI-based maps. First, in some cases, there may not be a sufficient number of spatial reference points represented in certain map areas that would be necessary for successful map matching. And second, localization errors of important spatial reference points to be used as landmarks in navigation and orientation (see Fig. 9.1) could potentially lead to unsuccessful map matching, i.e., elements in real space not being recognized in the map or landmarks represented in the map are not identified in real space.

As part of the SPP 1894 (Volunteered Geographic Information (VGI): Interpretation, Visualization and Social Computing) of the DFG, the sub-project on "The Effects of Landmark Uncertainty in VGI-based Maps: Approaches to Improve Wayfinding and Navigation Performance" carried out by the Ruhr-Universität Bochum (RUB) and the International Psychoanalytical University Berlin (IPU) addresses these presumed effects of spatial landmark inaccuracies on map matching, orientation, and navigation performance. The first aim was to assess and quantify to what extent spatial inaccuracies of landmark representations affect map matching and, consequentially, orientation and navigation (see Sect. 9.2). In a second step, approaches for reducing the assumed negative effects of landmark inaccuracies in maps on map matching are being developed and evaluated (see Sect. 9.3).

**Fig. 9.1** Inaccuracies of landmark representations in maps. Due to data quality issues intrinsic to VGI data, individual landmark representations in VGI-based maps can be spatially more or less inaccurate. For example, the gas station represented in the map above may also be located on the other side of the road or on a different location along the road. If spatial inaccuracies are too high, map users may experience difficulties when trying to match map representations to the represented real-world environment (© OpenStreetMap contributors)

### **9.2 Effects of Landmark Inaccuracies on Map Matching**

In a first experiment of the SPP 1894 sub-project, we aimed to investigate and quantify how spatial inaccuracies of landmark representations in maps affect the ability of map users to match the map to the represented 3D environment. For this purpose, we created a virtual 3D environment and a digital map that allowed us to fully control the locations of a landmark building and its pictogram representation in the map (see Fig. 9.2).

The locations of both the 3D landmark and the landmark representation were fully randomized, independent of each other along the road. Consequentially, the spatial inaccuracy of the landmark representations was different in each trial. After each trial, participants used a continuous scale to respond to what extent they perceived (as pedestrians) the map as matching the 3D environment. For full details on the study design, see Keil et al. (2022a).

The results demonstrated a pronounced and significant nonlinear relation between the spatial inaccuracy of the landmark representation in the 2D map and the perceived match between the 3D environment and the map (see Fig. 9.3). A tipping point was observed at approximately 10 meters of spatial inaccuracy, i.e., 10-meter walking distance in a virtual 3D environment. Maps with less spatial

**Fig. 9.2** Stimulus design. Participants saw a 3D environment containing a landmark building and a corresponding map representation. Random spatial inaccuracies were applied to the landmark representation in the map (here, landmark building matching the map representation)

**Fig. 9.3** Relation between inaccuracies of the landmark representation in the map and the perceived match between the 3D environment and the map. Values of one represent a certain match, values of zero represent a certain mismatch, and values between zero and one represent uncertainty concerning a match or mismatch. If spatial inaccuracies of the landmark representation were too high, maps were perceived as not matching the represented 3D environment

inaccuracy were mainly rated as matching the 3D environment. Maps with more than 10 meters of spatial inaccuracy were mainly rated as not matching the 3D environment.

These findings suggest that inaccuracies of a landmark's map representation of more than 10 meters for a pedestrian observer and map user are recognized. Map users then seem to start to mismatch map and 3D space representations, meaning that they are getting unable to match a corresponding map to the represented 3D space, although in the present experiment, only one map element (the landmark pictogram) is spatially inaccurate. However, some issues concerning the generalizability of these findings still need to be considered. First, in this experiment, only one fixed map scale was applied. However, most modern digital maps support dynamic adjustments of the visualized map scale. Selecting smaller map scales results in smaller misplacements of spatially inaccurate landmark representations in terms of pixels or millimeters. Consequentially, spatial landmark inaccuracies could be more difficult to recognize if a smaller map scale is used. Hence, how specific spatial inaccuracies of landmark representations affect map matching also needs to be quantified with different map scales in future experiments (see Keil et al. 2022a).

A second limitation concerning the generalizability of the findings is that the experiment was carried out with controlled virtual 3D environments. Although full experimental control is important to isolate the effect of the spatial inaccuracy of the landmark pictogram on map matching, it needs to be considered that people who carry out a map matching task are usually confronted with complex realworld environments containing numerous spatial elements that can act as helpful spatial reference points or as distractors. For example, an unusual route shape could support the map matching process or visually highly salient spatial objects that attract visual attention but are not represented in the map could disturb the map matching process. How the perception of complex real-world environments affects the map matching process still needs to be investigated. Still, one could assume that the effects observed in this experiment could be less pronounced in real-world environments because the salience of a single landmark and its map representation is less pronounced.

Finally, maps often represent more than one landmark in a map section. Due to the heterogeneity of VGI-based maps (Girres and Touya 2010), only some of these landmark representations may be significantly spatially inaccurate. If this is the case, other landmark representations could still be used to maintain successful map matching. Thus, how exactly one or some spatially inaccurate landmark representations affect map matching if other spatially more accurate landmark representations are available needs to be addressed in future studies.

Despite these mentioned limitations, the findings demonstrate that spatial inaccuracy of landmark representations in maps can jeopardize successful map matching and most likely also orientation and navigation performance. Therefore, finding ways for reducing the impact of spatial landmark pictogram inaccuracies on the map matching process seems to be a relevant topic for further research (cf. next paragraph).

### **9.3 Visualizing Spatial Uncertainty**

In a second experiment, we aimed to assess ways to reduce the negative effect of spatial inaccuracies of landmark representations on map matching. According to Padilla et al. (2021), uncertainty concerning data quality is an issue of most data and can affect all kinds of decision-making. However, people may not be aware of these data quality issues. Thus, if map users do not consider during navigation that some landmark representations may be more or less spatially accurate, they could interpret a map as representing another spatial environment and not the currently perceived one. Pang et al. (1997) argue that by visualizing data uncertainty, people are provided with important information of data quality and can therefore make more informed decisions. However, according to Mason et al. (2016), there is a "lack of comprehensive and generalizable empirical studies across the entire domain of uncertainty visualization." Therefore, we investigated the effects of visualizing uncertainty concerning the correct location of map-based landmark representations on map matching. For this purpose, the uncertainty visualization variables transparency and size were selected and manipulated based on suggestions of MacEachren (1992) and MacEachren et al. (2005). Furthermore, an uncertainty area visualization already used by the commercial map provider Google Maps to visualize uncertainty of GPS locations was investigated and compared to the other visualizations as well (see Fig. 9.4).

The stimulus design was the same as in the first experiment, with one exception. In addition to the control condition with the unmodified landmark pictogram, three different visualizations for spatial landmark uncertainty were compared in a withinsubject design. In four sets of twelve trials, participants either saw a map with an unmodified landmark pictogram (control condition), a transparent pictogram, a pictogram with modified size, or a pictogram with a circular transparent uncertainty area (see Fig. 9.4). The level of transparency, the size of the landmark pictogram, or the uncertainty area was linked to the spatial inaccuracy of the landmark pictogram in the map. Again, participants used a continuous scale ranging from 0 to 1 after

**Fig. 9.4** Visualizations for spatial uncertainty. Pictogram transparency (left), size (middle), and circular (transparent) uncertainty areas (right) were used to visualize uncertainty concerning the correctness of the landmark pictogram location

**Fig. 9.5** Perceived match per landmark pictogram condition. Participants significantly less often perceived a mismatch between the 3D environment and the map representation (mismatches represented by low perceived match scores) when potential spatial inaccuracies of landmark representations were visualized by increasing the landmark pictogram size or by adding a circular uncertainty area

each trial to indicate to what extent they perceived the map as matching the 3D environment. For more study details, see Keil et al. (2022b).

The results demonstrate that participants are less likely to perceive a mismatch between the 3D environment and the map (mismatches represented by low perceived match scores), if the size of the landmark pictogram was modified or if an uncertainty area was added around the landmark pictogram to illustrate uncertainty (see Fig. 9.5). However, adding transparency to the landmark pictogram did not lead to fewer perceived mismatches between the 3D environments and the maps compared to the control condition. Figure 9.6 shows that if pictogram size or uncertainty areas are used to communicate uncertainty concerning the correct location of the landmark representation, a larger landmark inaccuracy was accepted by the participants to still be perceived as a match between the 3D environment and the map. Based on this result, it seems that in map matching (i.e., comparison of map and 3D space), the size of the pictogram or the size of the uncertainty area placed around the pictogram are particularly suitable for depicting spatial uncertainty. These two pictogram variants seem to support the processing of uncertainty by

**Fig. 9.6** Relation between inaccuracies of the landmark representation in the map and the perceived match between the 3D environment and the map per landmark pictogram condition. When uncertainty concerning the correct location of the landmark representation in the map was visualized by increasing the pictogram size or adding a circular uncertainty area, higher values of spatial landmark pictogram inaccuracy were required for participants to perceive a certain mismatch between the map representation and the represented 3D environment

map users significantly better than, for example, the also tested transparency representations of pictograms.

Based on these initial findings, we conclude that uncertainty visualization can be used to reduce the negative effects of spatial inaccuracies of landmark representations in maps on map matching. Especially, modifying the size of pictograms as suggested by MacEachren et al. (2005) and adding transparent uncertainty areas around the landmark pictogram proved to communicate spatial uncertainty effectively and to improve successful map matching. It seems likely that an explanation for the fact that adding transparency to landmark pictograms did not prove to be effective in communicating spatial uncertainty needs to consider the difficulties to registering different transparency levels of our map users in the experiment, as only one landmark pictogram was visible simultaneously.

Concerning the generalizability of these findings, two issues need to be addressed. First, Fig. 9.6 reveals a data artifact for extremely high landmark inaccuracy values in the size condition (green line). Opposed to the other visualizations, uncertainty increased when spatial inaccuracy values were extremely high. This particular artifact might be explained by the fact that pictogram size in this condition was directly linked to the spatial inaccuracy of the landmark pictogram. Thus, if the spatial inaccuracy was extremely high, the pictogram was extremely large and covered large areas of the map. However, if the pictogram gets too large, it will make the map unreadable, and map matching will become impossible. This artifact demonstrates that, as also stressed by Kinkeldey et al. (2014), new ways to visualize uncertainty call for studies to investigate their usability. Potential usability issues of the addressed uncertainty visualizations need to be identified in further extensive series of usability tests to ensure that the general map reading ability is not impaired. For example, as a consequence of the identified data artifact, maximum values for the pictogram size could be defined.

The second issue is related to the values used to control the intensity of the uncertainty visualization. In this experiment, the spatial inaccuracy of the landmark representations was used to control the transparency or size of landmark pictograms or the uncertainty area around the pictogram. In a real application scenario, however, these values would not be available. If precise information concerning the correct landmark location would be available, then in most cases, the errors could easily be corrected, and an uncertainty visualization would not be required. Instead, measures for uncertainty based on the available metadata related to VGI contributions need to be developed, or average inaccuracy values need to be estimated and applied to the uncertainty visualization. For example, an uncertainty score could be calculated based on the number and spread of VGI contributions linked to a specific landmark pictogram, the number of contributions of a contributor, or the number of corrections linked to the contributions of a contributor. The development of such uncertainty value calculations and their evaluation in applied user studies are subject to future research.

### **9.4 Conclusion and Outlook**

The studies reported above provide first insights into how spatial inaccuracies of landmark representations in maps affect map matching. Our first study demonstrates that spatially inaccurate landmark representations reduce the likelihood of successful map matching. This could create problems for real-world orientation and navigation, especially in VGI-based maps, due to their great heterogeneity in terms of data quality (Girres and Touya 2010; Rousell and Zipf 2017). In a second study, we were able to show that by visualizing uncertainty concerning the spatial accuracy of landmark representations, the negative effects on map matching can be reduced. Especially modifying the size of landmark pictograms as suggested by MacEachren et al. (2005) or adding uncertainty areas as used in Google Maps proved to maintain successful map matching at higher levels of spatial landmark inaccuracies. We argue that it is this visualization of spatial uncertainty that provides the information required for more informed decision-making (Padilla et al. 2021; Pang et al. 1997).

**Fig. 9.7** Map matching with 360-degree photos. Due to the higher number of real-world distractors and spatial reference points in the maps obtained from OSM (© OpenStreetMap contributors), effects of spatial landmark inaccuracies in the maps are assumed to be less pronounced compared to experimentally fully controlled virtual environments

In future studies, the generalizability of the identified relation between spatial inaccuracies of landmark representations and map matching to real-world environments needs to be investigated. As previously argued, it is assumed that the negative effect on map matching will be less pronounced in real-world environments compared to the virtual 3D environment used in the first experiment of our SPP 1894 sub-project. The higher complexity of real-world environments should provide both more spatial reference points that are represented in maps and further distractors (like persons, building facades, cars, etc.) that are not represented in maps. Both the additional spatial reference points and the distractors are expected to compete with the landmark representations for visual attention. In consequence, this distribution of attention across more objects is assumed to reduce the relevance of the landmark representations for map matching. To test this assumption, it is necessary to conceptually extend the experiment design of the two studies reported above (see Keil et al. 2022a,b). Instead of using virtual 3D scenes in the map matching task, 360-degree images of real-world scenes with landmarks that are also included in the corresponding OSM map section may be tested (see Fig. 9.7).

In addition, to investigate the effect of map scale on the relationship between spatial inaccuracy of landmark representations and map matching, in future studies, different map scales should be used. It is expected that the results of such studies will not only provide a deeper understanding of the role of individual landmark representations for map matching in the context of spatial inaccuracies in realworld environments. They will also provide new findings on the effects of different map scales and potential usability issues of the used uncertainty visualizations in widespread maps as OSM.

Furthermore, potential ways to quantify spatial uncertainty based on available OSM metadata need to be explored, developed, and tested. This will allow visualizing different levels of uncertainty based on the suggested data quality levels of different OSM map regions. Finally, the effects of the uncertainty visualizations on orientation ability and navigation performance (i.e., real-world map matching) also need to be investigated with regard to different target groups. The results can not only provide detailed information on how spatial inaccuracies of landmark representations in maps affect map matching. They will also show how successful map matching can be maintained by providing map users with information about the uncertainty of map content to support more informed decision-making.

**Acknowledgments** This research was supported by the German Research Foundation DFG within Priority Research Program 1894 *Volunteered Geographic Information: Interpretation, Visualization and Social Computing* (VGIscience, Landmark Uncertainty, DI 771/11-2, KU 2872/6-2).

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 10 Improvement of Task-Oriented Visual Interpretation of VGI Point Data**

**Martin Knura and Jochen Schiewe** 

**Abstract** VGI is often generated as point data representing points of interest (POIs) and semantic qualities (such as accident locations) or quantities (such as noise levels), which can lead to geometric and thematic clutter in visual presentations of regions with numerous VGI contributions. As a solution, cartography provides several point generalization operations that reduce the total number of points and therefore increase the readability of a map. However, these operations are applied rather general and could remove specific spatial pattern, possibly leading to false interpretations in tasks where these spatial patterns are of interest. In this chapter, we want to tackle this problem by defining task-oriented sets of map generalization constraints that help to maintain spatial pattern characteristics during the generalization process. Therefore, we conduct a study to analyze the user behavior while solving interpretation tasks and use the findings as constraints in the following point generalization process, which is implemented through agent-based modeling.

**Keywords** Point generalization · Constraints · Agent-based modeling

### **10.1 Introduction**

As shown by the variety of different aspects and applications which are observed in this book, the volume and relevance of Volunteered Geographic Information (VGI) have immensely increased in recent years. In many cases, this VGI data is generated and visualized as point data, e.g., representing the location of a point of interest (POI), an event, or a data source. However, utilizing VGI data needs to take some specific characteristics into account in comparison to geospatial data acquired and processed in the "traditional" way. In particular, VGI "is produced by heteroge-

HafenCity University Hamburg, Hamburg, Germany

M. Knura (O) · J. Schiewe

e-mail: martin.knura@hcu-hamburg.de; jochen.schiewe@hcu-hamburg.de

**Fig. 10.1** Examples for VGI point data with point clutter. (**a**) Parked bikes in the city of Dresden, detected on Flickr images between 2004 and 2014 according to Knura et al. (2021). (**b**) Sightings of selected antelope species in Kruger National Park uploaded on the platform iNaturalist

neous contributors, using various technologies and tools, having different levels of details and precision, serving heterogeneous purposes, and a lack of gatekeepers" (Senaratne et al. 2017), leading to an enormous volume and heterogeneity within the data. All of these characteristics could harm the usability of the data, especially when it comes to the visual presentation and exploration of very dense and even overlapping point markers or symbols (see Fig. 10.1a and b), commonly known as geometric point clutter (Moacdieh and Sarter 2015).

As a solution to this clutter problem, cartography provides several point generalization operations such as selection, aggregation, or displacement, which rearrange or reduce the total number of points and therefore increase the readability of a map. However, these operations are applied rather general and could remove a specific spatial pattern, possibly leading to false interpretations in tasks where these spatial patterns are of interest.

The aim of the TOVIP project is to tackle this problem by defining a set of cartographic constraints—i.e., conditions a generalized map should satisfy—that preserve these spatial patterns throughout the whole generalization process. The first research question we want to answer in this chapter is therefore:

• What is the minimum set of constraints and constraint measures that should be used to evaluate interpretation tasks based on VGI point visualizations, such as pattern identification, pattern comparison, or relation seeking?

Different cartographic constraints often describe contradicting aspects with no optimal solution, as it is not possible during map generalization to maintain all information—i.e., fulfill all information preservation constraints—while keeping the map readable, i.e., fulfill all legibility constraints. Constraint-based generalization is therefore an optimization task, which tries to find a solution that satisfies as many constraints as good as possible, and has been implemented in recent years through multi-agent systems (Duchêne et al. 2018). We want to contribute to this research and define our second research question as:

• Is it possible to optimize the task-oriented generalization using an agent-based modeling approach?

The following chapter describes the workflow to answer the research questions as follows: in Sect. 10.2, we introduce the cartographic concept of constraintbased generalization, on which the TOVIP project is based upon. Section 10.3 summarizes the results of a user study, which analyzed the user behavior while working with spatial patterns in point data sets. In Sect. 10.4, we translate the findings of the previous section into measurable constraints that could be utilized in map generalization practice. In Sect. 10.5, we apply these constraints in an agentbased generalization model. That followed, we discuss our findings in Sect. 10.6 before concluding in Sect. 10.7.

### **10.2 Constraint-Based Map Generalization**

Cartography provides a variety of different point generalization operations—and various combinations between them—to solve the aforementioned clutter problem. As an example, a *simplification* describes a straight reduction of source points based on geometric criteria (e.g., only points which have a minimum distance to their neighbors are preserved; (Slocum et al. 2009)). When semantic criteria are used, a *selection* operation could take place. For example, points can be selected based on respective information filtering methods (Huang and Gartner 2012) or scaledependent (Gröbe and Burghardt 2021). *Aggregation* takes place when multiple points are replaced by a single aggregator marker. Most frequently, points are grouped through clustering with a respective initialization method (e.g., random, kmeans, Voronoi-based; (Yan and Weibel 2008)), while alternatives, for example, use heat maps (Meier 2016), or geometry objects (Zahtila and Knura 2022) to aggregate point data. A different solution to overcome clutter, but with the possibility to preserve the cardinality (i.e., the overall number of points) of the dataset, is to *displace* the points. During the displacement operation, an iterative workflow of overlap detection, relocation, and re-evaluation is executed (Mackaness and Purves 2001). Furthermore, if the preservation of cardinality and the original topology is of interest, a *spatial distortion* based on pixels (Keim et al. 2004) or point density (Bak et al. 2009) could help.

An operational system for point generalization must implement a workflow to trigger and orchestrate the individual point generalization operations described above. First implementations used a rule-based approach where a predefined set of well-defined and unambiguous rules guided the generalization process (Beard 1991). Each rule thereby states what has to be done in a process at a certain condition, so each condition was connected to a specific action (Harrie and Weibel 2007). The problem that occurred with this approach was that the enormous variety of spatial and non-spatial characteristics that exist in the world and therefore in maps led to a number of rules which were not possible to handle anymore. This leads to the constraint-based approach, which focused on the requirements that the final map should fulfil instead of providing a set of isolated generalization operations, leaving more flexibility within the generalization process on how to reach these results. According to Beard (1991), these constraints can be classified into aspects related to position, topology, shape, structural, functional, and legibility. Furthermore, it is necessary to introduce respective measures for these constraints, which are grouped by Mackaness and Ruas (2007) into either internal or external and either micro, meso, or macro.

If the constraints are defined in a complete and measurable way, there are different techniques available for implementation. Looking at optimizing single generalization methods, there is considerable work done, for example, regarding the displacement operation by applying least squares adjustment (Sester 2000), simulated annealing (Ware and Jones 1998), or snakes (Burghardt 2005). For more complex processes, agent-based modeling has shown great success in terms of applicability (Duchêne et al. 2018). In this approach, agents represent autonomous map objects trying to minimize a given cost function, which is based on the fulfillment of the constraint measures. As a result, the whole complexity of the generalization workflow is distributed to a set of relatively simple interacting agents.

The agent-based modeling approach is also used in the TOVIP project. Regarding the aim of TOVIP—defining a set of constraints that optimizes the generalization workflow designed for visual interpretation tasks where specific spatial patterns are of interest—it is necessary to consider two potentially contradictory aspects: On the final map, the aforementioned spatial patterns have to be visible (*preservation constraints*), while the map must still be readable by the users (*legibility constraints*). Describing constraints that preserve the relevant information during the generalization process is often done with object-specific measures, e.g., preserving the area of a polygon before and after generalization (Harrie and Weibel 2007). On the other hand, legibility constraints ensure the readability of the map, for example, by avoiding any spatial conflict—i.e., display clutter—and showing objects in a suitable degree of detail according to the scale of the map. A respective list of analytical legibility measures, such as the number of vertices or the object line length, was developed by Stigmar and Harrie (2011). For the aim of the TOVIP project, it is now of interest to find the minimum set of preservation and legibility constraints that allow the interpretation of specific spatial pattern even after generalization.

### **10.3 User Behavior When Interpreting VGI Point Data**

Developing constraints that support users while interpreting specific tasks implies profound knowledge of their behavior while doing so. Before defining constraints that support interpretation tasks, it is therefore necessary to analyze the behavior of users working with VGI point data sets. We conduct a user study where participants have to perform different interpretation tasks—like finding clusters within a dataset, comparing point densities, or finding areas with a specific point distribution—using a novel method that combines postal questionnaires, think-aloud interviews, and techniques from visual analytics. A more detailed overview on the technical aspects and the execution of the user study, including a detailed description of the analysis of the think-aloud interviews, is given by Knura and Schiewe (2021). In this chapter, we want to summarize the results of the study (see also Knura and Schiewe 2022), focusing on the impact of the user behavior on the definition of a minimum set of constraints as described above.

### *10.3.1 Task-Solving Strategies*

We analyzed the strategies of the participants by dividing the overall task-solving process into three sequential actions: (1) finding a start position, (2) obtaining information, and (3) decision-making. Apart from a task where the participants have to find a similar pattern compared to a given reference, the point density of a cluster—as a combination of proximity and cardinality of points—was the most important factor when selecting a starting position on the map, followed by the point color. For the process of obtaining information, point density was again the most important factor, as more dense clusters were described and analyzed earlier and more often. Moreover, density was the main evaluation measure in comparison tasks and during decision-making. Although we had different categories of synoptic interpretation tasks, which—in contrast to elementary tasks—include pattern identification, pattern comparison, and relation seeking (see Andrienko and Andrienko 2006), the task-solving strategies did not differ significantly between different kinds of tasks. As a first result of the study, we state that point density has the biggest impact on the task-solving behavior of the participants and has to be addressed in the first place when defining constraints.

### *10.3.2 Influence of Point Data Cardinality and Background Map*

A key factor that could have an impact on the user behavior during visual interpretation is the map complexity. There are numerous definitions and concepts of map complexity in cartography (see Touya et al. 2016). As most of them distinct between the intellectual complexity, which relates to the cognitive process of map reading, and the graphical complexity, which relates to the visual perception of individual map objects, we vary the maps for some of the tasks with respect to these two categories. To learn more about the influence of the intellectual complexity on user behavior, we varied the data cardinality—i.e., the number of points—for two of the tasks. Although we recognized some minor differences in the behavior between the user groups, the overall task execution strategy remains unchanged with a higher data cardinality. For analyzing the impact of the graphical complexity, we varied the background map source between Google Maps, Bing Maps, and Stamen Terrain. This time, we identified both an implicit and explicit influence from the background map. Implicitly, because participants frequently identified clusters which were visually supported by the background map and explicitly because they refer to the characteristics of the background map when explaining their strategies. But again, and despite the influence of the background map on the reasoning, the overall task-solving strategies described in the section above remain unchanged between different levels of graphical complexity.

### *10.3.3 Implications for Constraints Supporting Interpretation Tasks*

Following the results of our study, there are two main aspects that have to be considered while defining constraints for map generalization. First, it is of major importance to preserve the original pattern proportions during the generalization process. Our study revealed that the point density had the biggest impact on the task-solving process, and participants discussed both interrelations between clusters with different density, as well as between different classes of points within the same cluster. Information preservation constraints regarding the point density should therefore:


The second aspect to consider is the use of cartographic techniques to guide the interpretation of points. The use of specific colors to draw attention is common in cartography, and this can be applied to other map objects with the aim of lowering the graphical complexity. Respective constraints could ensure to:


These constraints could be categorized as both preserving information and legibility, and they address not only the location and visibility of the point symbols but also their style and the surrounding map areas.

### **10.4 Defining Constraints and Measures for Spatial Pattern Interpretation**

The previous section revealed the importance of preserving point densities during generalization. In this section, we collect a list of different approaches—both from the literature and own experiments—to define constraints and respective measures, which can help to preserve point densities and spatial pattern and test them on exemplary point distributions. The aim is thereby to find a minimum set of constraints that fit best to the list of requirements described above. We thereby focus on information preservation constraints regarding the point distributions. Constraints related to cartographic techniques are a key aspect of our future work.

### *10.4.1 Measures Describing Spatial Pattern and Densities*

When defining measures for spatial pattern and densities, we follow the categorization of Mackaness and Ruas (2007), who distinguish between macro-measures that deal with all point objects of interest, micro-measures that deal with individual characteristics of objects (i.e., points), and meso-measures that deal with the specific properties of different groups of objects (i.e., point clusters). The authors also distinguish between internal and external measures, which states if a measure can be calculated based on a single dataset (internal) or is a relation between two datasets (e.g., before and after a generalization operation; external).

### **10.4.1.1 Macro-Measures**

Macro-measures are able to describe the entirety of information and respective characteristics in a single value. One of the most basic macro-measures is the radical law (Töpfer and Pillewizer 1966), which estimates how many features should be maintained at a smaller scale in the generalization process. It is defined as:

$$n\_D = n\_S \sqrt{\frac{m\_S}{m\_D}},\tag{10.1}$$

where *n* is the number of objects of the derived (*D*) resp. source (*S*) map, and *m* is the scale denominator. In the context of this project, it is worth noting that the calculation should be based on a readable map, i.e. without any point clutter. If this measure is calculated from a map with point clutters, the calculated value should be interpreted as the *maximum number of objects* on the derived map. Even more basic is the measure that describes the *amount of information Ni* as the number of all map objects (Harrie and Stigmar 2010), calculated as:

$$N\_l = \sum\_{i=1}^{n} \sum\_{j=1}^{m\_l} O\_{ij}. \tag{10.2}$$

For objects other than points, this measure can be expanded with the number of object points, calculating the overall measure as the sum of all object points of all map objects.

Beside measures that deal with the amount of information in general, global measures can also describe a specific characteristic of the dataset or the map. In the same work, Harrie and Stigmar (2010) defined an index to characterize the *spatial distribution of points ISDP* based on Voronoi regions. The index is calculated as:

$$I\_{SDP} = \frac{\sum\_{l=1}^{NP} P\_{SDP,l} \log\_{P\_{SDP,l}}}{\log\_{\frac{1}{NP}}},\tag{10.3}$$

where *PSDP ,i* is the relative size of the Voronoi region for a point *i* and *NP* is the number of points. *ISDP* converges to 1 the more even the sizes of the Voronoi regions are. Zhang et al. (2009) use the *Voronoi region size as the variable of interest in Moran's I* to discern if point distributions are clustered, dispersed, or random:

$$I = \frac{N}{W} \frac{\sum\_{l=1}^{N} \sum\_{j=1}^{N} w\_{lj} (\chi\_l - \bar{\chi})(\chi\_j - \bar{\chi})}{\sum\_{l=1}^{N} (\chi\_l - \bar{\chi})^2},\tag{10.4}$$

where *N* is the number of spatial units indexed by *i* and *j* , *x* is the size of the Voronoi region *AV* , *x*¯ is the mean of *x*, *wij* is a matrix of spatial weights with zeroes on the diagonal (i.e., *wii* = 0), and *W* is the sum of all *wij* .

### **10.4.1.2 Micro-Measures**

Micro-measures describe characteristics of individual objects and therefore can take the local neighborhood into account. Analog to the macro-measures before, calculating the size of the Voronoi region of a point *AV* is used as a fundamental metric to describe local density. Based on this, Zhang et al. (2008) calculate the *object-oriented density OD* as:

$$OD = \frac{1}{A\_V}.\tag{10.5}$$

A higher object-oriented density implies a smaller Voronoi region and therefore a higher point density in the local neighborhood. Vice versa, a small object-oriented density indicates a bigger Voronoi area and a more dispersed distribution around that point.

Besides density measures, qualitative and quantitative information about the points in close proximity are also of interest when point generalization operations like selection are used. Therefore, Delauney triangulations are often used to identify "natural" neighbors in point distributions (Sadahiro 1997). Applying this tessellation to a point data set provides a *list of neighbors* for each point, and micromeasures like the *number of natural neighbors*, the *mean neighbor distance*, and the existence of *local extreme values* can be calculated (see Fig. 10.2). Delauney triangulation also helps to define clusters, so the *cluster affiliation* can also be defined in this way.

All the measures introduced in this section are internal because they can be calculated solely based on one dataset. However, it is possible to compare the measures of an individual point to measures of the same point during the

**Fig. 10.2** Example of micro-measures for point P. Natural neighbors N1 to N5 in red, other points in gray. Voronoi region *AV* for point P in light blue

generalization process, and note the amount of change as an additional external measure. In the same way, the *distance to the origin location* of a point is of interest when displacement operations take place during the generalization.

### **10.4.1.3 Meso-Measures**

Compared to macro- and micro-measures, meso-measures are not bound to a predefined number or list of points they describe. The first step to calculating meso-measures is therefore to define which points are members of the group of interest. In the TOVIP project, we focus on spatial pattern, and so the definition of clusters is relevant for further processing. Clustering can be made—among many other techniques—by cluster algorithms such as k-means and HDBscan, based on Delauney triangulation (Sadahiro 1997), or by using grids (Yan et al. 2021). These clusters can then be described by meso-measures such as the *number of group members*, the *existence of different point categories*, and the *mean distance*  between members or between members and the group centroid. Comparing the measures of the respective clusters, it is also possible to define *cluster rankings*. Furthermore, according to the findings of the user study presented in Sect. 10.3, measures regarding the shape and the orientation of the clusters can be of interest. Common methods to represent *the shapes of point clusters* are convex hulls or alpha shapes (Edelsbrunner et al. 1983). The *orientation of a cluster* can be described by the minimum rotated rectangle, a technique which is usually used for building orientation (Duchêne et al. 2003). Furthermore, all macro-measures defined in Sect. 10.4.1.1 can also be applied on clusters with a defined border.

### *10.4.2 Deriving a Minimum Set of Constraints*

Based on the list of different measures, we test the suitability of the measures to control the different aspects which help to fulfil the information preservation constraints we developed in Sect. 10.3. We thereby subdivide the constraints and respective measures into three groups:


Furthermore, we compare the measures and their performance on different point distributions to identify redundancies, and we examine the robustness on point cardinality, which is essential when applied in map generalization operations. We create a series of experimental point distributions with 100, 200, 500, and 1000 points and different characteristics: a regular and a random distribution, distributions

**Fig. 10.3** Point distributions with different characteristics, using the example of 200 points. Gray borders in the background of the gridded and pattern distributions indicate the areas in which the same number of points were randomly distributed

where we predefined regular (gridded distribution) and irregular areas (pattern distribution) to control the density, and distributions with loose and clear clusters (see Fig. 10.3). Furthermore, we tested the behavior of micro-measures on points in different VGI datasets to evaluate their utility.

### **10.4.2.1 Preserve the Overall Distribution of Points and the Density Ranking Between Areas**

The first subgroup of measures combines the first two constraints on information preservation in Sect. 10.3.3 and can be controlled through a combination of macroand meso-measures. We tested both the Voronoi-based Moran's I and the spatial distribution of points with our series of different artificial point distributions. Table 10.1 shows the calculated measures and the standard deviation over the different point cardinalities. We can see that the spatial point distribution measure is to a certain degree stable toward the point cardinality and is smaller when the points



are more clustered. Moran's I on Voronoi regions is more sensitive to variations of point cardinality, as it uses the area size as the variable of interest—whose value decreases with more points. As a result, we decided to use the point distribution measure to control the overall distribution and the distribution within clusters. That also includes the fact that the measures for amount of information and cluster size are variables within the calculation of the spatial point distribution on the macroand meso-level. In contrast, the cluster ranking measure has no overlap with other measures and is therefore inevitable.

### **10.4.2.2 Preserve Pattern-Specific Characteristics**

Pattern-specific measures are a crucial part of the goals of the TOVIP project. If local extreme values and the existence of different point categories are of interest in an interpretation task, it is mandatory to preserve these points and therefore control them with related measures. As this measure requires a Delauney triangulation to define the neighborship, respective measures that are based on this can be performed with low additional effort, even if not compulsory. As an example, the mean distance to neighbors can be calculated this way. As an alternative, the distance to all points within the predefined cluster can be used to decide which points are overlapping and thus should be a controlling measure. If a displacement operation is implemented, the distance to the origin location of a point can be of interest. For the other patternspecific measures, we did not find a unique behavior in which we see an additional utility for our model.

### **10.4.2.3 Preserve Gestalt Law Rules**

The maximum number of points can be utilized as a target value for the generalization process, although it is not mandatory if all legibility measures are satisfied. Measures related to the shape and orientation of clusters are utilizing common techniques from the field of geospatial analysis, such as calculating the minimum bounding rectangle, the convex hull, or the alpha shape of a point set. We compared the different approaches on different data sets and decided to use the convex hull to describe the shape, as it needs no additional parameter compared to the alpha shape and is more detailed than the rectangular bounding box. If the orientation is of interest, the longer side of the minimum bounding rectangle can be utilized.

Table 10.2 shows the selected measures which we initially implemented in the agent-based model, together with additional measures that could be relevant for certain tasks and were also recognized. Nevertheless, because most of the measures are defined in code blocks outside the actual agent-based model, it is possible to adopt measures from other scale levels during model optimization.


**Table 10.2** Subgroups of measures and selected measures in column "Set". (X) indicates the measure is not mandatory in certain applications

### **10.5 Application Using Agent-Based Modeling**

The set of constraints and respective measures developed in the previous section is a key component for the implementation of a map generalization process, which preserves spatial patterns. We apply the constraint-based approach using agent-based modeling, which is a powerful method for controlling complex processes (Harrie and Weibel 2007). As the model is currently in the final phase of development, this section will focus on the architecture and parametrization we implemented: First, we introduce the software framework we use and explain the different components within the model. In the second part, we describe the integration of global map specifications and the translation of measures to a satisfaction scale, which helps the agents to better evaluate their fulfillment of constraints. Evaluation of the model results will be part of our future work.

### *10.5.1 Software and Components*

We implement our agent-based model1 using the Mesa framework (Kazil et al. 2020). Mesa is an open-source framework for creating agent-based models written in Python. It includes four core components (Model, Agent, Schedule and Space), along with additional components for analysis and visualization. Thereby, the *Model* class is the core class for creating the environment of the model using the *Space* class, initializing the agents which are objects of the *Agent* class, and orchestrating the running model through the *Scheduler* class. Applied to the process of map generalization, our model has a map area which is implemented through a continuous space—providing a high flexibility for different map scales and contains map agents which represent objects that generalize themselves by performing generalization operations, according to the perception of their current state and their fulfillment of given constraint measures. Besides micro-agents, which represent the individual points, our model also contains meso-agents, which are generated within the model initialization and control the pattern preservation.

(Map) agents and the implementation of their decision-making process are the most complex part of an agent-based model. Duchêne et al. (2018) decomposed the "brain" of map agent into three main components: capacities, mental representation, and procedural knowledge. We followed this approach and used these components in our model (see Fig. 10.4). The capacities of our agents include the ability to perceive their surrounding space, to evaluate themselves, and to perform generalization operations. The updating process of the first two capacities is thereby provided by the *Model* class, which performs several spatial analysis operations on the totality of map objects after each simulation step and transmits the calculated measures back to the individual micro- and meso-agents. The mental representation of the agents compares their current state with the goals they are aiming at—i.e., the fulfillment of map constraints—and calculates their satisfaction. It also memorizes all previous actions the agents took and the respective outcome of it. Finally, the procedural knowledge component is the decision-making unit of the agents. Based on the agent's constraint satisfaction and the knowledge of the past steps, it decides which operation the agent should execute in the next step.

Besides the core functions for agent-based modeling, Mesa offers functionalities for data analysis and model visualization. The *DataCollector* class of Mesa is able to record, store, and export all relevant data of the agents for further analysis. It allows us to control the mechanisms of the model, as well as tuning the decisionmaking process of the agents. Via the visualization components, Mesa also provides a browser-based visualization of the running model, but until now, we haven't implemented a respective function in our model yet. Instead, we set up and run the model via Jupyter Notebook and present the generalized map in an interactive browser map.

<sup>1</sup> Our source code is online: https://gitlab.com/g2lab/tovip.

**Fig. 10.4** Components of our model for agent-based map generalization

### *10.5.2 Global Map Specifications and Measure Satisfaction*

Besides the specific constraints we defined in order to support interpretation tasks, there are also global map specifications and characteristics which are important for the process of generalization in general and for the point generalization in particular and which have to be defined in advance. For example, it is necessary to know the scale of the source map and if this map satisfies all legibility constraints regarding the point symbols (i.e., the source map has no point clutter and is readable). It is also required to define the target scale of the map and the (pixel) size of the point symbols. Moreover, it is of interest if the point data set contains different classes and, if it does, the respective scale of measurement. While these global map specifications are determined in most of the use cases for point generalization (e.g., the target map scale via predefined zoom levels), they can also be changed in the model setup.

Furthermore, the "brain" of the map agent requires determining a predefined behavior regarding the task of translating a list of measures into a value representing the satisfaction of an agent at its current status. The common workflow for this task consists of two steps (Touya 2012): First, the measures get translated into a Likertlike satisfaction scale, which ranges from 1 ("unacceptable") to 8 ("perfect"). Each measure thereby has its own method for translation, which has to be defined in advance. In the second step, a global satisfaction value is derived from the individual values, utilizing principles from Social Welfare Orderings (SWO). Again, the specific SWO method has to be chosen in advance and triggers different strategies the agents will follow. For example, using a SWO that emphasizes low values with a higher weight, the agents will try to minimize the number of low values and focus less on maximizing other values, while a utilitarian SWO fosters strategies where agents maximize the sum of all values.

Taillandier and Gaffuri (2012) proposed an approach to help the user with parameterization using a human-machine dialogue. We utilize this approach for our model and offer a guided user interface where parameter adjustments get visualized via samples on a map. It allows the user to adjust the satisfaction scales by modifying the class dividers, which are predefined as a function of respective global map attributes such as scale and point cardinality.

### **10.6 Discussion**

The previous sections described the workflow of defining a set of constraints and respective measures based on the findings from a user study and the implementation in an agent-based model. This already answers our research questions to a certain degree. In this section, we want to further discuss implications that occur with the results of the previous sections.

In Sect. 10.4.2, we define a set of constraints containing measures that control the spatial distribution of points, the ranking between clusters, the shapes of clusters, and the distances between the points of a cluster. Furthermore, the preservation of local extreme values and all point categories should be added if their existence is of interest in the interpretation task. Taking the different measure scales of two of the constraints into account, there are only six to eight measures, which can be used to control the generalization process. Still, this requires at least six different predefined parameter adjustments to translate measures into satisfaction values. The complexity of the parametrization process has been identified as one of the main drawbacks of the agent-based approach (Duchêne et al. 2018), and this is also the case in our model. As an example, defining a function to evaluate the macro-measure of point distribution is little intuitive, as it needs to define class dividers for narrowvalue ranges, which differences are hard to visualize. However, six parameters and a user-friendly way to adjust them are still feasible in our opinion.

Manual parameter adjustment is one reason which makes it difficult to transfer our approach of point generalization to other applications. The second reason is the time-consuming calculation of measures that rely on rather complex geospatial operations. On-the-fly point generalization (Jabeur et al. 2006) is therefore not possible with our approach, and the computing time depends heavily on the number of points to generalize. A solution to this problem could be the integration of novel learning techniques (Touya et al. 2019). If a model can learn how to generalize points while preserving the right information, it could predict a generalized point set on-the-fly.

### **10.7 Summary and Outlook**

The generation of VGI data in general, and of points in particular, has shown an immense increase in recent years. As one of the main properties of VGI is its enormous volume and heterogeneity within the data, it leads to dense clutters when it is presented on maps. The cartographic solution to this problem is point generalization: rearranging or reducing the number of points. If this is applied rather general, specific spatial pattern could be eliminated—which is a major problem when these patterns were of high importance and subject of interest to the user. This chapter presents a workflow to resolve this problem by defining a set of constraint that can be used to control the generalization workflow. We developed the list of constraints based on a user study and applied them by implementing an agent-based model for point generalization.

VGI point data is often produced in multiple scale levels and over longer periods of time. In our future work, we want to factor this and develop our model further by adding functionalities for *multi-scale views*, which requires consideration of scale transitions, and *multi-temporal representations*, where the cognitive workload related to animations must also be considered. Furthermore, we plan to improve the usability of our agent-based model by developing a more intuitive user interface, which would allow more users to apply the findings of our model to their objectives. A third task we plan to work on in the near future is the integration of other cartographic techniques such as point color, point symbolization, and others in our generalization model. The overall plan is thereby to stepwise add functionalities for broader applications of map generalization into our system.

**Acknowledgments** This research was supported by the German Research Foundation DFG within Priority Research Program 1894 *Volunteered Geographic Information: Interpretation, Visualization and Social Computing* (VGIscience, TOVIP, SCHI 1008/11-1).

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Part III Active Participation, Social Context, and Privacy Awareness**

Part III of the book addresses the human impact associated with VGI. The influence of humans is manifold during acquisition, processing, and interpretation of VGI. People can take both active and passive roles in the generation of VGI, for example, simple unintentional actions such as georeferencing pictures or messages in social media or intentional involvement of volunteers in citizen science and crowdsourced projects. Potentials for active participation can be released event-driven in the short term, e.g., in the case of catastrophic events, but also result in continuous participation over a long period of time, e.g., while monitoring animals, plants, or recording weather and climate phenomena. Specialist and local knowledge is thereby used to build up free knowledge archives and geodata collections.

Humans not only play a central role in the generation of VGI but can also become the subject of research themselves when human behavior is studied based on the published content. This includes studies of who is involved in the creation of VGI, e.g., differentiated by age, gender, or previous education. Furthermore, a distinction has been made between the large majority, which makes contributions available independently often at short notice and a rather small user group, which contributes content very persistently and thus also determines future directions of VGI projects. Motives of all groups are the subject of research and are just as interesting as approaches on how participation can be promoted.

The first chapter in this section describes the use of wearable sensors worn by volunteers to record their exposure to environmental stressors on their daily journeys. With that the behavior of individuals and the effect of customized recommendations are examined. The following chapter investigates the collective behavior of people using location-based social media and movement data from football matches. In addition to a conceptual behavioral model, various generic methods and application-related workflows for visual analysis are presented. The third chapter in this section deals with the motivation of volunteers who provide important support in information gathering, filtering, and analysis of social media in disaster situations. Since humans play a central role in the generation of VGI, special precautions are also required with regard to the protection of privacy. In the case of active participation, the consent of the user can be obtained. This is not possible in the case of passive, unintentional participation. The last chapter in this section describes a set of concepts that take privacy by design into account to process social media and make it impossible to identify individual persons.

Risks and negative side effects must be examined more closely in future research. This includes the invasion of masses when promoting sensitive places via social media, the emergence of filter bubbles when information is sorted based on user preferences, or the manipulative use of information about users who contribute content on social media. Selected, simple approaches for considering bias and for normalizing the social media data are already taken into account in the various chapters.

# **Chapter 11 Environmental Tracking for Healthy Mobility**

**Anna Maria Becker, Carolin Helbig, Abdelrhman Mohamdeen, Torsten Masson, and Uwe Schlink** 

**Abstract** Environmental stressors in city traffic are a relevant health threat to urban cyclists and pedestrians. These stressors are multifaceted and include noise pollution, heat, and air pollution such as particulate matter. In the present chapter, we describe the use of wearable sensors carried by volunteers to capture their exposure to environmental stressors on their everyday routes. These wearable sensors are becoming increasingly important to capture the spatial and temporal distribution of environmental factors in the city. They also offer the unique opportunity to provide individualized feedback to the person wearing the sensor as well as possibilities to visualize different stressors in their temporal and spatial distribution in a virtual reality environment. We used the option of providing individualized feedback on personal exposure levels in two randomized controlled field studies. In these experiments, we studied the psychological health-related outcomes of carrying a wearable sensor and receiving feedback on one's individual exposure levels.

**Keywords** Wearable sensors · Air pollution · Noise · Heat · Mobility · Health

### **11.1 Introduction**

Volunteered geographic information (VGI) that is collected by laypeople can be utilized by other citizens and enables scientific investigations that can be seen as a special form of citizen science (Connors et al. 2012; Goodchild 2007). User-generated content such as VGI has proliferated in recent years and is a form of crowdsourcing, where information is drawn together from several non-

A. M. Becker (O) · T. Masson

Department of Social Psychology, Wilhelm-Wundt-Institute for Psychology, Leipzig, Germany e-mail: anna.becker@uni-leipzig.de; torsten.masson@uni-leipzig.de

C. Helbig · A. Mohamdeen · U. Schlink

Department of Urban and Environmental Sociology, Helmholtz Centre for Environmental Research, UFZ, Leipzig, Germany

e-mail: carolin.helbig@ufz.de; mahmoud.mohamdeen@ufz.de; uwe.schlink@ufz.de

expert individuals (Elwood et al. 2012). Besides many other applications, VGI has been used in environmental monitoring (Connors et al. 2012) and public transport planning (Giuffrida et al. 2019). Multi-level data mashups—the integration of different forms of information—are becoming increasingly important to layer information in respect to specific locations (Elwood et al. 2012). While some VGI is socially embedded and subjective (Elwood et al. 2012), sensor data can be objectively geotagged.

The aim of this chapter is to demonstrate how VGI can take advantage of smart sensors for the assessment of environmental data that is relevant for human health. In the approach described here, we develop a specific design of VGI techniques that can support city planners and urban authorities as well as individuals who are vulnerable to elevated levels of environmental pollutants and heat. We hypothesize that VGI can improve health protection, which is particularly important for susceptible and vulnerable individuals. The following sections outline the methods and the design of our studies and summarize the results.

### **11.2 Measuring Environmental Stressors**

Epidemiological studies have consistently demonstrated links between exposure to particulate matter and adverse health effects (World Health Organization 2016; Stafoggia et al. 2022). To protect human health, air quality monitoring is regulated by European Directives (European Parliament 2008). Personal exposure to environmental stressors is multifactorial and includes exogenous factors contributing to human health risks (Schlink and Ueberham 2020), such as air temperature, air humidity, air pollutants (gasses, particulate matter), and noise. In Leipzig, traffic is a major source of intra-city PM10 and NOx emissions, while small combustion plants are a source of other emissions (Stadt Leipzig 2018).

Due to the individual differences in daily activities, each person has very individual exposure patterns, which obviously cannot be adequately captured by measurements at a few monitoring stations in the city (improved approach; see Steininger et al. 2020) but require person-specific measurements (Dias and Tchepel 2018; Hinwood et al. 2007). However, air pollution data from a terrestrial monitoring station can only be considered representative of the surrounding district. Such measurements are strongly influenced by the location of the stations, and the dispersion and dilution of air pollutants are affected by meteorological and local conditions (e.g., urban structure; built-up areas, street canyons, traffic networks, and green areas). Therefore, it does not adequately capture the spatial variability of locally sourced pollutants and cannot be an indicator of human exposure (Adams and Kanaroglou 2016; Dionisio et al. 2013). The activities of individuals and air pollution levels exhibit a large degree of spatiotemporal dynamics. This emphasizes the necessity to investigate near-real-time measurements and to develop methods for data assessment (Yang et al. 2022).

With advances in technology in the fields of electrical engineering and wireless networks, "low-cost" air quality sensors have been developed to make air quality monitoring more accessible and portable (Castell et al. 2017; Snyder et al. 2013). In the past several years, evidence has begun to emerge about the usefulness of low-cost sensors. Studies have shown various applications, for example, the classification of emissions sources. Low-cost sensors have a limited ability to detect nanoparticle real-time emissions from residential sources (Wang et al. 2020). However, low-cost sensors were found to perform well in both ambient (field) and controlled (laboratory) conditions (Connolly et al. 2022).

### **11.3 Citizen Science and VGI in the Context of Environmental Tracking**

Including the general public in the production of scientific knowledge is becoming increasingly important (Eitzel et al. 2017). From collecting data (e.g., about local wildlife sightings), processing large amounts of data on private computers, or contributing tedious work by cataloging pictures of animal colonies or galaxies to a deeper involvement in identifying research questions and procedures, laypeople can contribute to scientific progress (Strasser et al. 2019). While citizen science can arise from grassroots movements with an activist agenda (Ottinger 2010), it is most often initiated by scientific and educational institutions (Eitzel et al. 2017). Citizen science can not only shift workload to interested citizens and create unique monitoring opportunities (Aceves-Bueno et al. 2015; Verplanke et al. 2016). It can also serve in educating laypeople about the scientific process and increase public interest and trust in science (Strasser et al. 2019). The use of wearable sensors to collect data on environmental stressors around the city can be understood in the framework of citizen science, as laypersons collect data and investigate their own exposure to pollution as they explore their surroundings. This immersion in the scientific process can not only aid in data collection but also help citizens to identify the least polluted areas in their city. Platforms that integrate data from citizen scientists have been developed to share and visualize environmental information from different sources (Lautenschlager et al. 2018). A plethora of studies indicates that volunteers have a high interest in their personal exposure to environmental stressors, though behavior change to avoid these stressors is often hard to implement (see Becker et al. 2021 for a review).

### **11.4 A Health-Psychological Perspective**

Tracking environmental stressors with a wearable sensor makes it possible to give individualized feedback to the users. Thereby, they receive information on potentially harmful exposure levels to air pollution, heat, and noise pollution. As these stressors (e.g., particulate matter) cannot be perceived directly, this is new information that may elicit fear in the user. Fear appeals have been studied by psychologists in the past, showing that effects of fear appeals on behavior change (e.g., smoking cessation) are strongest when the negative consequences of a harmful behavior are seen as personally relevant and likely to occur (Rosenstock 1974; Rogers 1975). Choosing highly polluted routes in the city can be framed as a harmful behavior to one's health, and messaging should motivate people to choose less polluted routes, e.g., smaller streets with less traffic. Two influential theories in health psychology, namely, the health belief model (Rosenstock 1974) and protection motivation theory (Rogers 1975, 1983; Maddux and Rogers 1983), have studied health-related behavior change such as smoking cessation or physical exercise. Both theories have in common that they predict a healthy behavior change when negative health outcomes of a current behavior are seen as severe and a person feels susceptible to these negative outcomes. In our case, this may be that people perceive the health effects of particulate matter to be serious and that it is likely that they will be affected by these, if they do not lower their exposure. Additionally, persons must see a behavior change as useful to mitigate the risk of illness and feasible to implement (Rogers 1983; Maddux and Rogers 1983). Perceived psychological costs or barriers will reduce the likelihood of behavior change (Rogers 1983). For example, if a person is not willing to take a longer route to work or to use a side street with a less comfortable surface to cycle on, this will reduce the likelihood of them changing their route. In more technical terms, protection motivation theory posits that both threat appraisal (harmful outcomes of the current situation are severe and likely to occur) and coping appraisal (alternative behaviors will be effective in mitigating the threat and are feasible to implement) contribute to health-related behavior change (Rogers 1983; Maddux and Rogers 1983). Research on protection motivation theory has found that both threat appraisal and coping appraisal must interact to elicit a problem-focused coping response (Babcicky and Seebauer 2019; Rippetoe and Rogers 1987). If a person only perceives high threat but has a low coping appraisal, they might choose emotion-focused coping strategies rather than problem-focused coping. These emotion-focused strategies, such as denial, fatalism, or wishful thinking, will reduce the emotional burden of the threat but not result in healthy behavior change (Babcicky and Seebauer 2019; Rippetoe and Rogers 1987). Protection motivation is most commonly applied to health behaviors (Plotnikoff and Trinh 2010; Prentice-Dunn et al. 2009). However, it can also be applicable to other behavioral safety measures, for example, in relation to flood events (Babcicky and Seebauer 2019), climate change (Bagagnan et al. 2019), the adoption of electric vehicles (Bockarjova and Steg 2014), household water management (Bryan et al. 2019), earthquake preparedness (Mulilis and Lippa 1990), and wildfire protection (Dupéy and Smith 2019).

### **11.5 Visualizing Environmental Stressors**

The amount and variety of data are increasing in all research fields, which means that analyzing these large, complex datasets has become a challenging task (Avazpour et al. 2019). That includes integrating data from multiple heterogeneous sources, normalizing these data, and providing a unified view of these data sets to users (Tian and Li 2019). Collecting, integrating, aligning, and efficiently extracting information from heterogeneous and autonomous data sources are considered a major challenge (Fusco and Aversano 2020). In this context, visualization is an important tool that makes it possible to analyze complex data. Scientific visualization assists scientists in analyzing data by transforming data into geometric representations, thus supporting the analysis of complex data (Gershon and Eick 1995; Defanti and Brown 1991). In addition, virtual reality (VR) is recognized as a powerful human-computer interface (Burdea and Coiffet Philippe 2003). Users can immerse themselves in a virtual world and manipulate it by changing their viewpoint and interacting (Brooks 1999). VR environments are a promising tool for scientists to visualize their large and complex data sets and to control the behavior of virtual objects using interaction functions (Simpson et al. 2000). The combination of the power of a VR system and the human ability to detect interesting patterns and inconsistencies in the data makes VR a suitable tool to solve future scientific visualization tasks (van Dam et al. 2002). VR is also used in the context of urban planning in some prototypical projects, for example, in knowledge transfer on the value of urban greenery (Mokas et al. 2021), the investigation of the perceived safety level of cyclists (Nazemi et al. 2021), different planning phases of high-rise building construction (Lu et al. 2021), green landscape planning and design (Pei 2021), and the evaluation of urban spaces (Luigi et al. 2015; Zhang and Zhang 2021). One way to combine methods of visualization and VR, as well as to implement analysis methods and make them available via a graphical user interface (GUI), is to use the Unity Game Engine (Helbig et al. 2015). Its use in scientific projects has increased in recent years, e.g., for integration of 3D building model data (Keil et al. 2021), visualizing 3D point clouds captured by drones (Weißmann et al. 2022), implementing a walkable virtual city model (Schmohl et al. 2020), and simulating secure hazardous transportation (Yang et al. 2020). Another argument for the use of game engines is the possibility to integrate audio data in a straightforward way, which increases the degree of immersion considerably (Hruby 2019; Berger and Bill 2019; Rafiee et al. 2017) and is not available in conventional visualization tools for geodata. In the presented project, we implemented a visualization and analysis application for mobile sensing data and questionnaire data in combination with other urban data such as 3D city model, traffic data and noise maps from the city of Leipzig, and weather station data to explore the data with scientists as well as decision makers and citizens.

### **11.6 Implementation: Two Field Experiments Using Wearable Sensors**

We conducted two field experiments with wearable sensors in order to explore the spatiotemporal distribution of environmental stressors in the city of Leipzig. Another aim of these studies was to test the psychological effects of carrying wearable sensors and receiving feedback about one's personal exposure levels. Lastly, one of our aims was to implement innovative ways to integrate and visualize sensor data and subjective experiences of participants with the sensors.

### *11.6.1 Field Study 1*

### **11.6.1.1 Design and Procedure of Study 1**

To test the psychological effects of carrying the sensors, we conducted experiments as randomized controlled trials, including a measurement group that carried the wearable sensors and a control group that did not receive the sensor kit. The first field experiment was run from July to September 2020. After registering on our website, participants were allocated to a week of participation. Within each study week, participants were randomly allocated to either the measurement group or the control group. Persons in the measurement group carried the measurement kit for three days and received individualized feedback about their exposure to heat, particulate matter, and noise after the measurement phase. Participants in the control group did not carry the measurement kits but received questionnaires including the same questions as in the measurement group. This allowed us to compare the answers of those in the control group to those in the measurement group and identify the effects of the intervention. On Fridays before their allocated study week, all participants received a first questionnaire. Persons in the measurement group then picked up the measurement device on Monday and used it for three consecutive days to measure their exposure on everyday routes by bike or as pedestrians. They returned the measurement kit on Thursdays. The next day, all participants received a link to the second questionnaire (both those who carried the measurement kit and those in the control group). We then retrieved the data from the measurement kits and compiled individual feedback reports for those who wore the sensors. One week after the measurement ended, each person from the measurement group received their feedback report and a third questionnaire. The feedback report included general information about particulate matter, heat, and noise as environmental stressors. The report also showed graphs indicating the accumulated stressor levels throughout the measurement phase. A color-coded legend indicated the magnitude of exposure with reference to noise levels (silent room–pain threshold) or temperature ranges (no temperature stress–extreme temperature stress). Approximately 2 to 4 months after their participation, both the measurement and the control group were sent the link

**Fig. 11.1** Experimental procedure for the measurement group in study 1

to a follow-up questionnaire (with some outliers submitting after 1.6 to 4.7 months) (Fig. 11.1).

### **11.6.1.2 Sensors Used for Measuring Environmental Stressors in Study 1**

The measurement system consisted of three different devices: (1) smartphone Motorola, Moto G3, launched in July 2015 and operated with android system; (2) Little Environmental Observatories (LEO) Electrochemical sensors widely used for the detection of toxic gasses at the parts per million (ppm) level and for oxygen in levels of percent of volume (% vol)—toxic gas sensors are available for a wide range of gasses, including NO2, NO, and O3 (Robinson et al. 2018), providing the current temperature and relative humidity (Ateknea Solutions Catalonia S.A.; https://ateknea.com); and (3) Dylos-DC1100, an air quality monitor that measures particulate matter (PM) number concentration to provide a continuous assessment of ambient suspended particles. The unit counts particles in two size ranges: large and small. According to the manufacturer, large particles have diameters between 2.5 and 10 micrometers, μm (i.e., PM10); small particles have diameters from 0*.*5 μm up to 2*.*5 μm (i.e., PM1 and PM2.5).

### *11.6.2 Field Study 2*

### **11.6.2.1 Design and Procedure of Study 2**

The second field experiment was conducted from September 2021 to August 2022. Similar to study 1, there was a control group and a measurement group. However, we extended the design of the field experiment by providing participants with daily feedback about their personal exposure as well as suggestions for alternative routes as a strategy to strengthen participants' coping capacity. After signing up through our website and being allocated to a study week, participants were randomly assigned to their group. Both groups received a first questionnaire, before the measurement group was given the wearable sensor and smartphone kit. Participants in the measurement group used the sensor and smartphone for 3 days on their everyday routes. They received feedback about their exposure in the smartphone app every evening. The app allowed them to (a) see the measured levels of heat, noise, and particulate matter along the routes they had used during the day and to (b) receive an alternative route suggestion for each route they had recorded that day. These alternative route suggestions were based on a set of routing preferences optimized for low pollution and developed in collaboration with the VGI-Routing team (see Chap. 3). Participants in the measurement group were asked to fill in a short daily questionnaire after looking at their feedback and the alternative route suggestions. After their study week, both the measurement group and the control group received the link to another questionnaire. Approximately 3 to 6 months after their study week, all participants received a follow-up questionnaire (with some outliers submitting after 2 to 8.7 months).

### **11.6.2.2 Sensors Used for Measuring Environmental Stressors in Study 2**

The system included a smartphone and the PAM AS520 wearable air quality monitor (www.atmosphericsensors.com), which can assess air quality in the wearer's immediate environment. The PAM uses electrochemical cells ("Alphasense" sensors with 2 electrodes), a laser-optical particle monitor, temperature and humidity sensors, and a high-sensitivity GPS receiver and 3-axis accelerometer. A built-in microphone records ambient noise. At night, the PAM is stored in a docking station where the battery is charged. The docking station is equipped with a modem and SIM card that transmits the data stored on the device to the project's central database for further processing (Fig. 11.2).

### *11.6.3 The Questionnaires*

Multiple questionnaires were used to assess psychological variables in our field experiments. Questionnaires of each participant were linked through an identifier code that was indicated at each measurement point. This code was also used to identify each participant's sensor measurements and connect this information to the questionnaire data. In the first questionnaire, participants were asked for their preferred mode of transport (public transport, car, bike, or by foot) and which aspects were particularly important to them when choosing a route as a cyclist or pedestrian (e.g., speed, green space, etc.) (adapted from Ueberham et al. 2019). Participants were also asked about the stage of action they were in regarding healthy routing choices with low pollution levels (adapted from Olsson et al. 2018). To

**Fig. 11.2** The sensor and smartphone kit used in Study 2 (Foto by André Künzelmann, UFZ)

capture the main predictors of protection motivation theory (Rogers 1975, 1983), we measured health risk perception for particulate matter, noise, and heat (e.g., *particulate matter, noise, and heat on my daily routes have very negative effects for my health*), as well as coping appraisals regarding these three environmental stressors (e.g., *I can reduce my exposure to environmental stressors in street traffic*) (partly adapted from Ueberham et al. 2019). As our main dependent variable, we measured individual protection intentions to change one's everyday routes in a way that would avoid environmental stressors (e.g., avoiding streets with a lot of car traffic). We further measured collective action intentions to fight for lower levels of environmental stressors in city traffic (e.g., by signing petitions), as well as willingness to pay for a service providing daily information on current low-pollution routes. We measured a range of other constructs in the questionnaire, including emotion-focused coping strategies that may come into play when facing a threat while coping appraisals are low (Rippetoe and Rogers 1987) and routing behavior habits when traveling to school/university, when going shopping, and during leisure time (Verplanken and Orbell 2003). Further measures included, among others, social norms (e.g., whether participants thought that others in their city were avoiding environmental stressors on everyday routes); identification with the city of Leipzig, cyclists, pedestrians, and car drivers; derogation of car drivers; general health concerns (Fahrenberg et al. 2003); and preference for technology use.

### **11.7 Results of the Field Experiments**

### *11.7.1 Results Regarding the Sensor Measurements*

The two field experiments (FE) were designed to monitor individual's experiences on a trajectory based real-time (FE1; *At* = 5 sec. and FE2; *At* = 10 sec.) for a given population (i.e., volunteer cyclists and pedestrians) during their daily life activities. These specific conditions of personal exposure are used to develop a new perception of the spatial categories of the road network in Leipzig, where our population activities took place. The framework of the developed system includes (1) mobile sampling using low-cost air quality sensors along road segments of individual's routes traveled (first field experiment), (2) empirically identified classes of microenvironments for road segments in Leipzig, and (3) recognition of classes developed by applying supervised machine learning (ML) models (e.g., Random Forest, RF; logistic regression, LR; support vector machines, SVM; and Knearest neighbors, KNN). The outdoor microenvironment is categorized into three categories, each of which is virtually differentiated by buildings, road structure, traffic rate, vehicle fleet number, and driving mode with OSM-based field categories (such as highway, bicycle, and pedestrian lane), helping to reduce the multiple factors affecting personal exposure in urban data areas such as weather conditions, land use, and traffic. The three categories are (1) main, characterized by detached blocks and broad road structure, e.g., avenues, semi-continuous pollutant line source from fleet flow of vehicles with large number and high rate of dissipation, which also depends on weather conditions, i.e., wind factors; (2) secondary, spaced by linear semi-tall buildings, secondary streets (e.g., residential streets and parking areas, i.e., idle driving mode), discrete emissions from vehicle critical dual periods (i.e., daily back-and-forth movements of the workforce), and large eddy currents which counteract or reduce aerodynamic dispersal for pollutants in street canyons; and (3) green, vegetation cover, e.g., forests and parks. Several non-parametric ML algorithms were also tested with the same selected urban stressors (10 features, PM2.5, PM10, NO, NO2, O3, temperature, specific humidity, light intensity, speed, and noise) to determine which ones are suitable for our data set. The results showed that the performance difference between the individual models (RF, LG, SVM, and KNN) was significant, while RF models performed best over others, achieving >90% accuracy. Model confusion from multiple experimental samples is less than 10%, and this can be explained in terms of the following: (1) the high degree of overlap between classes is one of the main challenges affecting the accuracy of the classifier (Deberneh and Kim 2021), and (2) the developed categories are strongly coherent in a heterogeneous urban environment, so sensor response time and recording interval must also be considered when interpreting the results. In addition, the time to transfer a cyclist or pedestrian from one category to another is always much less than the response time of the sensors (Ueberham and Schlink 2018). This poses a great challenge in the crossings and adjacent parts between the three classes.

### *11.7.2 Results Regarding the Questionnaires*

### **11.7.2.1 Statistical Analysis**

To identify effects of our intervention, we computed linear mixed-effect models with random intercepts to estimate within-participant changes from pretest to follow-up for our outcome measures while also assessing differences between the intervention group and the control group. The mixed models included time (pretest, posttest, test after receiving exposure feedback, follow-up) as well as the group (intervention vs. control) and their interaction term to predict each outcome variable. When using an additional moderator, we entered time, group, the moderator variable, as well as the two-way and three-way interaction terms into the mixed model.

### **11.7.2.2 Results of Study 1: Descriptive Analysis**

One hundred and eighty-two participants completed the pretest questionnaire, 167 participants completed the posttest questionnaire, and 121 participants completed the follow-up questionnaire. The datasets were merged based on an identifier code, generated by each participant at the start of each questionnaire, resulting in a final sample of 109 participants (N*intervent ion* = 56, N*control* = 53; 59.89% of the pretest sample). Sixty-one participants identified as female and 48 identified as male. Ages ranged from 19 to 67 years (*M* = 36.33, *SD* = 9.68). Most participants (72.5%) had a university degree. 6.4% reported having a respiratory health condition such as asthma, and 29.4% reported having allergies. Overall, participants rated their health as good (*Mdn* = 6 on a seven-point scale ranging from 1, very bad, to 7, very good) and reported medium levels of health concerns (*M* = 4.04, *SD* = 1.00). All scales were 7-point Likert scales. Participants rated their previous knowledge about particulate matter, heat, and noise pollution as medium (for PM, *Mdn* = 3; heat, *Mdn* = 3; noise, *Mdn* = 3). Satisfaction with the measurement kit was medium (*Mdn* = 4.5). 57.1% of the participants rated the handling of the measurement kit as at least somewhat easy, indicating that the ease of use was only medium (*Mdn*  = 3). More than 70% reported a high or very high tracking frequency throughout the measurement phase (*Mdn* = 6). Participants reported that they used the tracker on typical everyday routes (*Mdn* = 7). More than 74% of the respondents at least somewhat agreed that wearable sensors can help people reduce their personal exposure to environmental health risks (*Mdn* = 5), and 63% at least somewhat agreed that wearable sensors may support behavior change aimed at reducing personal exposure (*Mdn* = 5). Following the individualized feedback report, participants rated their exposure to particulate matter (*Mdn* = 5) and noise (*Mdn* = 5) as medium to high while rating the heat exposure as lower (*Mdn* = 3.00). The feedback did not greatly differ from participants' expectations, as participants indicated it being neither much higher nor lower than expected (PM, *Mdn* = 4.00; noise, *Mdn* = 4.00;

heat, *Mdn* = 4.00). Participants perceived the feedback to be mostly representative for their everyday exposure (PM, *Mdn* = 6.00; noise, *Mdn* = 6.00; heat, *Mdn* = 5.00).

### **11.7.2.3 Results of Study 1: Mixed Model Results**

For individual action intentions, results showed no significant main effects of time and group and, more importantly, no significant interaction effect of time and group. In other words, our results did not indicate that participation in the measurement group (vs. control group) would increase respondents' action intentions to protect themselves against environmental health risks. However, results of exploratory analysis, including routing behavior habits as an additional moderator in the analysis, revealed an interesting pattern of results. First, participation in the measurement group (vs. control group) increased action intentions from pretest to posttest for respondents with weak (but not strong) routing behavior habits. This initial increase, however, was not stable throughout the intervention period. At the follow-up measurement point, we found no differences in action intentions between respondents with weak and strong routing behavior habits. For perceptions of environmental health risks, our findings indicate that participation in the measurement group (vs. control group) led to a significant increase in the perception of particulate matter health risks from pretest to posttest. Importantly, increased health risk perceptions for particulate matter were retained throughout the followup period, indicating a robust intervention effect. There were no intervention effects on health risk perceptions for noise and heat.

### **11.7.2.4 Results of Study 2: Descriptive Analysis**

The pretest questionnaire was completed by 267 eligible participants, 225 completed the posttest questionnaire, and lastly 151 eligible participants completed the follow-up questionnaire. After matching participants' codes, the final sample with complete data sets consisted of 136 participants (N*intervent ion* = 67, N*control* = 69; 50.9% of pretest sample). Eighty-eight participants identified as female, 45 identified as male, and three participants identified as diverse. Ages ranged from 18 to 70 years (*M* = 29.76, *SD* = 10.43). Approximately half of the respondents had a university degree. Regarding health condition, 7.4% reported a respiratory health condition such as asthma, and 27.9% reported having allergies. Overall, participants rated their health as good to very good (*Mdn* = 6) and reported medium levels of health concerns (*M* = 4.15, *SD* = 1.08). Satisfaction with the measurement kit was high (*Mdn* = 6). Approximately 85% of the participants rated the handling of the measurement kit as somewhat easy, easy, or very easy (*Mdn* = 2), and more than 90% reported a high or very high tracking frequency throughout the measurement phase (*Mdn* = 6). More than 77% of the respondents at least somewhat agreed that wearable sensors can help people reduce their personal exposure to environmental health risks (*Mdn* = 5), and more than 62% at least somewhat agreed that wearable sensors may support behavior change aimed at reducing personal exposure (*Mdn* = 5).

### **11.7.2.5 Results of Study 2: Mixed Model Results**

For individual action intentions, results showed no significant interaction effect of time and group. While inspection of simple effects showed a trend indicating that participation in the measurement group (vs. control group) increased action intentions from pretest to posttest, the intervention effect was not significant. In contrast to Study 1, we found no effects of routing behavior habits on changes in action intentions. Regarding perceptions of environmental health risks, our findings indicate a robust intervention effect on perceived particulate matter risk. Specifically, results revealed that participation in the measurement group (vs. control group) led to a significant increase in perceived PM health risks from pretest to posttest. Importantly, increased levels of perceived PM health risk were retained throughout the follow-up period. For perceived heat and noise health risks, results indicate no significant treatment effects.

### *11.7.3 The Visualization and Analysis Application and Achieved Results*

As part of the project, a visualization and analysis application was implemented that allows the data from the measurement campaign to be evaluated in combination with data on building structure, weather, GI, and traffic (Helbig et al. 2022). The application was implemented with the Unity Game Engine, which provides a development environment for computer games and other interactive 3D graphic applications and enables us to meet the following requirements: (1) Provide a user interface and interaction functionality, (2) performance (avoid long loading times), (3) Integrate various data sources/data types, (4) 3D visualization, (5) integrate analysis methods, and (6) presentation on PC and VR environment. By using mobile sensors within the measurement campaign of our project, we get data from two different perspectives: (1) the individual exposure to environmental stressors for a participant and (2) the distribution of environmental stressors within the urban area. In the application, methods must be provided that enable the analysis to be carried out from both perspectives. This is possible by displaying individual routes or the routes of individual participants in the 3D city model, as well as in a 2D representation. The distribution of all measurement points over the urban space can also be displayed. In addition, a hot and cold spot analysis of the distribution of the values can be carried out. The application also offers a range of functions that allow a comprehensive analysis, such as different perspectives (perspective and orthographic), showing/hiding of data layers, display of individual routes (incl.

**Fig. 11.3** Analysis and visualization of the trajectories of mobile sensor data

extracting GPS outliers), different coding of values (by color and/or size), chained filtering by parameters, and space-time cube visualization (Fig. 11.3).

The analysis of the data with the help of the application showed that paths, especially on which many cyclists move and which also allow faster speeds due to their condition, are polluted with high nitrogen oxide levels due to their location directly on main roads. The evaluation so far also shows the temperature differences that occur between parks and heavily sealed surfaces, especially on hot days. There, the surroundings heat up strongly, especially in the second half of the day, and can be poorly ventilated with dense development at the same time.

### **11.8 Conclusions**

Our study was conducted in the frame of the ExpoAware project1 and supported by the German Research Foundation DFG within the Priority Research Program 1894 Volunteered Geographic Information: Interpretation, Visualization and Social Computing. The research developed an integrated design that demonstrated (1) the feasibility of individuals applying advanced VGI techniques to explore the urban environmental conditions and provide useful information for city planners and officials; (2) the potential of VGI for the assessment of personal exposure that is highly relevant for adverse health effects, especially in vulnerable persons; and (3) the ability to study protective behavior of individuals and to develop actions for behavior changes and individual adaptation strategies. From our results of the urban measurements, we conclude that the proposed novel microenvironment classifications successfully reduced the complexity of urban data, and the results of the random forest models emphasized the validity of the hypothesized classification method. Nonparametric methods such as random forest are promising approaches

<sup>1</sup> https://www.vgiscience.org/projects/expoaware.html.

when the model is complex and can benefit from a large number of training examples. The study also highlights the benefits and potential of low-cost sensors in categorizing street-level personal exposure in urban areas. The field experiments are an important contribution to improving the framework conditions for cycling and walking and make it more attractive. For a sustainable and resilient city, the mobility transition and thus the reduction of motorized individual transport and at the same time the promotion of active mobility are a decisive component. By integrating the results of mobile measurements into long-term planning, urban planners can reduce the level of environmental stressors in cities and better react to extreme events (e.g., heat waves) in the short term. Very short-term reactions to elevated exposure can be made by the individuals themselves. In our study, the application of personal sensors increased participants' health risk perceptions (with regard to particulate matter) and was also able to temporarily elevate their action intentions to protect themselves against environmental health risks, albeit only for participants with weak routing habits in the first study. With the help of Unity, we were able to combine methods from 2D and 3D visualization, as well as implement analysis methods and make them available via GUI. Unity supports various platforms and enables us to use it in different contexts, ranging from PC for individual analysis to VR environments for collective analysis and presentation. We claim that our approach should be used for implementing analysis and visualization tools in future projects because (1) it has the potential to become a modular system for applications by reusing and further developing methods, (2) it enables the combination of modern 3D visualization with various analysis methods, and (3) it supports presentation in VR environments and thereby facilitates multidisciplinary, research in collaborative projects, and projects with high interdisciplinary (Helbig et al. 2022).

**Acknowledgments** This research was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)—ExpoAware, 424979005.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 12 Extraction and Visually Driven Analysis of VGI for Understanding People's Behavior in Relation to Multifaceted Context**

**Dirk Burghardt, Alexander Dunkel, Eva Hauthal, Gota Shirato, Natalia Andrienko, Gennady Andrienko, Maximilian Hartmann, and Ross Purves** 

**Abstract** Volunteered Geographic Information in the form of actively and passively generated spatial content offers great potential to study people's activities, emotional perceptions, and mobility behavior. Realizing this potential requires methods which take into account the specific properties of such data, for example, its heterogeneity, subjectivity, and spatial resolution but also temporal relevance and bias.

The aim of the chapter is to show how insights into human behavior can be gained from location-based social media and movement data using visual analysis methods. A conceptual behavioral model is introduced that summarizes people's reactions under the influence of one or more events. In addition, influencing factors are described using a context model, which makes it possible to analyze visitation and mobility patterns with regard to spatial, temporal, and thematic-attribute changes. Selected generic methods are presented, such as extended time curves and the cobridge metaphor to perform comparative analysis along time axes. Furthermore, it is shown that emojis can be used as contextual indicants to analyze sentiment and emotions in relation to events and locations.

D. Burghardt (O) · A. Dunkel · E. Hauthal

Institute of Cartography, Faculty of Environmental Sciences, TU Dresden, Dresden, Germany e-mail: dirk.burghardt@tu-dresden.de; alexander.dunkel@tu-dresden.de; eva.hauthal@tu-dresden.de

G. Shirato · N. Andrienko · G. Andrienko

Fraunhofer Institute IAIS, Sankt Augustin, Germany e-mail: gota.shirato@iais.fraunhofer.de; natalia.andrienko@iais.fraunhofer.de; gennady.andrienko@iais.fraunhofer.de

M. Hartmann · R. Purves

University Zurich, Zurich, Switzerland e-mail: maximilianchristoph.hartmann@geo.uzh.ch; ross.purves@geo.uzh.ch

Application-oriented workflows are presented for activity analysis in the field of urban and landscape planning. It is shown how location-based social media can be used to obtain information about landscape objects that are collectively perceived as valuable and worth preserving. The mobility behavior of people is analyzed using the example of multivariate time series from football data. Therefore, topic modeling and pattern analyzes were utilized to identify average positions and area of movements of the football teams.

**Keywords** Location-based social media · Football analytics · Visual analytics · Reactions · Behavior · Context · Emoji · Bias

### **12.1 Introduction**

The rise of so-called Volunteered Geographic Information (VGI) has brought with it fundamental changes not only in the nature of geographic data but also its production and accessibility. These changes have implications across the board for those carrying out research requiring geographic data and most profoundly for research exploring how humans interact with and are affected by changes in their environment. The creation of user-generated spatial content is diverse, whether through the usage of a wide variety of sensors during activities in the real world or the use of social media platforms for information exchange in the digital world. Realizing the potential of VGI requires methods which take account of the specific properties of such data, for example, its heterogeneity, quality, subjectivity, spatial resolution, and temporal relevance. Of particular interest in this research are two types of information—first, information about people's reactions and behavior and, second, modeling different types of contexts, both derived from location-based social media (LBSM) and football match data.

Location-based social media respective geosocial media are extensively used for expressing and exchanging thoughts, opinions, ideas, and feelings (publicly or within a particular group of people)—thus reactions to an event or related to a theme. A definition of reactions is given by Dunkel et al. (2019) consisting of an identifier to a referent event and four facets (spatial, temporal, thematic, and social) describing the reaction. A series of reactions from an actor (where an actor might be an individual, a group, or an organization) can be seen as a manifestation of this actor's behavior (Luckmann 2013). Behaviors convey actions and provide information about doing and refraining. Activities in geosocial media are intentional, purposive, and subjectively meaningful, and they are usually targeted at others, i.e., they are social actions. The creation and production of such information is an expression of human behavior and as such influenced strongly by events and context. Key to any framework seeking to analyze behavior is a definition of dimensions through which space, time, thematic attributes, and events can be described. Moreover, modelling behavior and context requires methods to aggregate and generalize data across spatial and temporal scales.

The chapter is structured as follows. Section 12.2 describes the state-of-the art research related to modeling of reactions and behavior, context modelling, use of emojis, and consideration of privacy. Section 12.3 introduces conceptual models for describing behavior and context. Section 12.4 presents generic methods for accounting for representativeness and bias in LBSM, for comparative visual analysis, and using emojis as contextual indicators. Section 12.4 presents application-related workflows for activity analyzes in the area of landscape and urban planning, as well as the exploration of people's mobility behavior in football games.

### **12.2 Related Work**

Geosocial media is a special case of Volunteered Geographic Information (VGI) (Goodchild 2007), which enables users to communicate and connect with each other by sharing location-based information. Geovisual analysis of this information enables the study of events, geographic phenomena, and related human reactions for a variety of application cases, e.g., in the areas of urban and landscape planning, tourism, health, transport, or disaster management. Human responses manifest themselves within and outside of social network communication in a variety of ways as emotional expressions, opinions, and thoughts expressed or actions taken (Hauthal et al. 2019). Thus, Volunteered Geographic Information has added a new dimension to traditional geospatial data acquisition for human activity research (Li et al. 2020). Activities in Geosocial Media are intentional, purposive, and subjectively meaningful, and they are usually targeted at others, i.e., they are social actions. From a technical point of view, they can be subdivided accordingly Davis (2016) into origination (creating own, original content, e.g., tweeting, posting), acknowledgment (reactions to content, e.g., liking, favorite), associating (interaction with content, e.g., replying, commenting, mentioning, following), amplification (spreading content, e.g., retweeting, sharing), and action (moving beyond content, e.g., signing up for a newsletter, buying something, or going to a demonstration).

The research presented within this paper aims at analyzing reactions as a component of behavior, incorporating external context to better allow events and activities to be related and compared. Key to any framework seeking to analyze reactions to events is a definition of dimensions through which both reactions and events can be described. As pointed out by Teitler et al. (2008), these dimensions form the core of a description of an event and include not only ways of describing (what, who, where, when) but also explaining (how and why). Answering these questions can be seen as a way of characterizing the context of an event and associated behaviors (Dunkel et al. 2019).

As the project has been focusing on the concept of individual and collective behaviors (particularly, behaviors of social media users), it was appropriate to review the existing literature on the topic of spatial behavior analysis. The most intensive studies have been conducted in the area of sports analytics, where the researchers focus on the behaviors of players and teams in sport games. Duarte

et al. (2012) propose to consider sport teams as superorganisms. "Superorganism" is defined in sociobiology as a group of individuals self-organized by division of labor and united by a system of communication. The division of labor can be characterized by the "areas of responsibility" of the players on the pitch. The communications can be represented by graphs of interactions, e.g., passing networks. Fonseca et al. (2013) use Voronoi diagrams to study the spatial interaction behavior in terms of continuously changing players' arrangement on the pitch. Gudmundsson and Horton (2017) published a survey of approaches that use spatiotemporal data from team sports. This includes approaches from social network analysis, which are applied to passing networks and transition networks. Specifically for football (soccer), Memmert et al. (2017) present approaches to analysis of collective organization (distribution on the pitch, maintenance of distances between players); inter-player coordination; inter-team and inter-line coordination; correspondences between team formations as an aspect of inter-team interaction; formations of tactical groups, e.g., offense and defense; etc. Many of the existing approaches are specific for sports games and can hardly be applied to other types of data. In our project, we strive to develop more general concepts and techniques, and we also strive to consider behavior dynamics rather than derive summary characteristics of particular individual or group behaviors.

Emojis play a special role in the analysis of reactions, especially feelings and opinions. Their use in geosocial media is just as popular as the use of hashtags (Bai et al. 2019; Highfield and Leaver 2016), and as a language-independent characters, they have the advantage of avoiding error-prone language processing (Hu et al. 2013; Kralj Novak et al. 2015). The use of emojis is influenced by regional and cultural context (Kejriwal et al. 2021), so emojis have great potential to characterize spatial context but also analyze activities related to places and events (Hauthal et al. 2021).

Volunteered Geographic Information is based on people's willingness to collect and share content with the public community. It should be noted that this data is always related to individuals and could therefore be sensitive in terms of privacy and ethical issues. Olteanu et al. (2019) demands that an ethical approach must respect individual autonomy. In the case of purposeful active participation in VGI projects, this can be ensured by confirming active consent (by a declaration of informed consent). This is more critical in terms of analyzing millions of social media posts, even when users post the content publicly and agree to the terms of use for third parties to use it. Williams et al. (2017) therefore calls for user expectations to be taken into account, to perceive changed contexts, for example, in the form of "context collapse" (Crawford and Finn 2015), and to be respectful when combining potentially sensitive personal data. Through an "aggregation effect" (Solove 2013), privacy-relevant insights can be gained from different data sources without the contributing user being aware of it. An ethical research approach requires a balance between social and individual interests—the boundary of privacy is not rigid but depends on the topic, place, time, and user characteristics. Cartographers and geoscientists have a responsibility here to develop flexible methods that protect user

privacy while taking these contextual factors into account (Burghardt and Dunkel 2022).

### **12.3 Theoretical and Conceptual Foundations**

In the first phase of the project, we developed a conceptual model underpinning the extraction, analysis, and visualization of *events* and *reactions* to events in locationbased social media (Dunkel et al. 2019). For instance, a message or post published on a social media platform could be considered as a reaction, related to either simple events (e.g., a tweet that is observable or a single rumble of thunder) or preceding and ongoing chains of events (e.g., the Brexit). The key feature of our conceptual model and its implementation is the integration of spatial, temporal, thematic, and social dimensions combined with an explicit link between events and reactions.

The second phase of the project focused on the concept of individual and collective *behavior*. Among multiple existing definitions of the concept of behavior (Henriques and Michalski 2020), a simple definition suitable for our purposes can be "an organism's activities in response to external or internal stimuli" (American Psychological Association n.d.). From our perspective, it is important that a behavior unfolds over time, i.e., it covers a certain time interval, unlike a singular reaction, which can be viewed as an instant event. Like a reaction, a behavior belongs to a certain *actor*, who may be an individual, a group, an organization, or even the population of a country. However, a behavior may include multiple reactions to the same or different events. A behavior may also include *movements*  of the actor in space and changes of various characteristics, which can be expressed by thematic attributes. Hence, like the concept of reaction, the concept of behavior involves social (who?), temporal (when?), spatial (where?), and thematic (how?) facets. However, the temporal facet extends to a time interval, and the spatial and thematic aspects can no longer be expressed by a singular location in space and a combination of values of thematic attributes but need to be represented by a timereferenced sequence of locations (i.e., a trajectory in space) and time series of values of the thematic attributes.

Besides, as already mentioned, a behavior may include multiple reactions to one or more events, and the reactions themselves are conceptualized as events. Even more generally, a behavior may also include not only reactions but also other kinds of events in which the actor participates, for example, actions or decisions.

Hence, we conceptualize a behavior as a tuple *B* = *(p, T* = [*t*1*, t*2]*, S(t), A(t), <sup>E</sup><sup>T</sup>* = {*e*1*,...,eN* }*)*, where *<sup>p</sup>* is an actor, *<sup>T</sup>* is a time interval, *S(t)* is a function representing changes of the spatial position of *p* over time, *A(t)* is a function representing changes of the thematic attributes of *p* over time, and *E* is a set of events involving *p*, including but not limited to *p*'s reactions to other events. In *S(t)* and *A(t)*, the variable *t* takes values from *T* . The times of all events in *E<sup>T</sup>* are contained in *T* or at least overlap with *T* .

Since a behavior is a complex dynamic (i.e., changing over time) entity, study and understanding of behaviors require abstraction by which elementary locations, characteristics, and events that took place at different time moments are united and treated together as units, which are called *patterns*. Collins et al. (2018) proposed the following definition of a pattern: "a representation of a collection of items of any kind as an integrated whole with specific properties that are not mere compositions of properties of the constituent items." A more formal definition of a data pattern has been later proposed by Andrienko et al. (2021b). In brief, a pattern is a combination of relationships between data items. For time-related data, the pattern-forming relationships include temporal ordering, temporal distances, and relationships between temporally arranged spatial locations (direction and distance), attribute values (equality or difference, sign and amount of difference), and sets of events (equality, inclusion, overlap, disjointedness). Such relationships can also exist between patterns, which enable joining two or more patterns into more complex patterns. For example, a pattern of increase of values of a numeric variable followed by a pattern of decrease can be joined in a single pattern of a peak in the numeric value variation.

A behavior can thus be represented as a combination of patterns reflecting changes of the spatial location (i.e., movement patterns, such as quick movement straight to the north), changes of thematic attributes (e.g., increase in frequency and duration of sport activities), and/or changes in the set of relevant events (e.g., end of exams and beginning of a holiday).

A behavior unfolding during a time interval of substantial length may need to be described in terms of complex patterns composed of temporally arranged simpler patterns, i.e., sequences of simpler patterns. Such complex structures are hard to analyze and compare. Therefore, it is practically useful to divide behaviors into parts, which may be called *episodes*. An episode is an excerpt from a behavior B having the same structure as *B*, i.e., *EP* = *(p, T* ' = [*t*' 1*, t*' 2]*, S(t), A(t), E<sup>T</sup>* ' = {*e*1*,...,eN* }*)*, but *T* ' ⊆ *T* , the variable *t* in *S(t)* and *A(t)* takes values from *T* ' , and *ET* ' ⊆ *ET* consists of those events that existed during the interval *T* ' . The idea behind dividing a behavior into episodes or extracting episodes from a behavior is that an episode is relatively short and can be described in terms of simpler patterns. Then, analysis of a single behavior consists of comparison of the constituent episodes, i.e., revealing similarities and differences between the patterns in the episodes, which can be followed by detecting re-occurrences of similar patterns and investigating temporal relationships between the re-occurrences. Comparative analysis of two or more different behaviors involves comparisons of their episodes, which may include, in particular, comparison of co-occurring episodes from different behaviors as well as detection of similar asynchronous episodes in different behaviors.

Any behavior and, consequently, any episode of a behavior occurs in a certain *context*, i.e., a combination of circumstances that change over time. A context can be conceptually modelled as a tuple *<sup>C</sup>* <sup>=</sup> *(T , S, AC(t, s), AC(t), ECT , BCT )*, where *T* is a time interval, *S* is a part of space, *AC(t, s)* represents various dynamic attributes whose values are associated with different spatial locations (e.g., weather,

fuel prices), *AC(t)* are dynamic attributes considered as global (i.e., their values refer to *S* as a whole), *ECT* is a set of events that occurred during the interval *T* (e.g., political events or epidemic outbreaks), and *BCT* is a set of behaviors of various actors other than the actor(s) whose behaviors are analyzed. Contextaware analysis of behaviors and their constituent episodes includes determining relationships between patterns in the behaviors/episodes in focus and patterns occurring in the context attributes, context events, and context behaviors. The analysis can be directed from the context to the behaviors (e.g., what behavior patterns occurred when the air temperature was high or after pandemic lockdowns) or from the behaviors to the context (e.g., what events or attribute development patterns preceded the occurrence of a given pattern in one or more behaviors).

Let us illustrate the concepts using an example of a football game. Each player of a team is an actor having his/her behavior happening in the course of the game, i.e., *T* is the time of the game. The player's behavior includes movements of the player *S(t)*, his/her physical condition *A(t)*, and game events *E* involving the player, such as passes, attacks, shots on goal, goals, etc. Besides the individual behaviors of the players, there is a collective behavior of a whole team, including team movements, changes of team width, depth and relative arrangements of the players, and game events involving one or more players. The context of the team's behavior includes the behavior of the opponent team, the game events that have already happened, the weather and lighting conditions, the team's ranks, the goals set by the coach, etc. The context of the individual players' behaviors includes, additionally to what was listed above, the individual behaviors and skills of their teammates and of the opponents.

Since behaviors and their contexts are inherently complex (dynamic and multifaceted), analysis of behaviors and their relationships to contexts is a very challenging problem that cannot be tackled without simplification. Possible operations for achieving include *decomposition* (e.g., behaviors are divided into episodes), *abstraction* (e.g., unification of multiple items into patterns), *selection* (focusing on particular facets and their components, particular combinations of contextual circumstances, or particular types of patterns), and *aggregation* (representation of sets of items by summaries).

The concepts and ideas presented above are illustrated in Figs. 12.1, 12.2, 12.3, and 12.4 by example of football data. These data are more suitable for illustration purposes than data from location-based social media due to their high quality and absence of legal and ethical issues.

**Fig. 12.1** An illustration of behavior simplification by means of abstraction and aggregation. The image on the left shows the trajectories of two players during a football game. To increase the level of abstraction, the points of the trajectories were grouped and abstracted to areas on the pitch. the segments of the trajectories were transformed into transitions between the areas which, in turn, were aggregated into flows between the areas. The resulting spatial network is a simplified representation of the spatial facet of the behavior

**Fig. 12.2** An illustration of the decomposition of a player's behavior during a football game into several behaviors occurring in different contexts. The behaviors are represented in the form of spatial networks resulting from abstraction and aggregation of original data, as shown in Fig. 12.1

### **12.4 Generic Methods That Support Studies of Reactions and Behaviors**

### *12.4.1 Representativity and Bias in Location-Based Social Media*

In the second part of this chapter, we discuss a range of analyzes based on locationbased social media. Before we describe these analyzes in more detail, it is important to consider issues related to representativity and bias with respect to such data sources. It is important to firstly remember that all analyzes are subject to these

**Fig. 12.3** Abstracted and aggregated representation of collective behavioral patterns of two teams in a selected subset of episodes. The left image corresponds to a team in possession of the ball and the right image to a defending team

**Fig. 12.4** Abstracted movement behaviors of players of two teams in two sets of episodes: first 12 seconds after gaining the ball by the yellow team (left) and after gaining the ball by the red team (right). The movements are abstracted by taking the average players' positions from all episodes at the time moments *t* + 0*, t* + 1*,...,t* + 12 seconds, where *t* is the time moment of the ball possession change

issues—for example, if we generalize patterns found in the football data described above, it is important to consider which teams are more likely to be monitored and generate such data and to what extent these represent footballers more generally.

However, in location-based social media, these issues are more obvious and immediate. For example, these data are exclusively produced by people with location-enabled devices. These are not equally available within and across society, with, for example, the very old and very young being less likely to be captured by such data, and considerable variations exist in the willingness of individuals to share location-based information in different cultures and countries (Li et al. 2013; Krasnova and Veltri 2011). The first question we must ask therefore with respect to representativeness is who can, and is willing to, contribute data to platforms such as Twitter, Flickr, Instagram, and Google's location-based services. Some uses of space—for example, for play by small children—are likely to be underrepresented, especially if children are encouraged to spend time outside without their parents.

The second question that we can ask with respect to representativity relates to the ways in which these data are linked to space. Typically, we use other geographic datasets, for example, produced by national mapping agencies and/or volunteers (e.g., OpenStreetMap and GeoNames), and the content of these datasets will profoundly influence any analysis. For example, OpenStreetMap is known to have biases related to gender in terms of the categories of objects mapped (Gardner et al. 2020), while place-name density in GeoNames reflects geopolitical events (Acheson et al. 2017). A third question that we can ask with respect to these issues relates to culture and language. The underlying models used to capture, for example, emotions are often based on Western (and more specifically Anglo-Saxon) notions and assume universal emotions shared across cultures. Furthermore, many methods to analyze text focus on English as a starting point, despite clear evidence that this has limitations with respect to the ways in which we understand the world (Blasi et al. 2022).

### *12.4.2 Methods for Comparative Analyses*

As stated earlier (Sect. 12.3), behavior is a complex, dynamic entity that can be studied and understood only with the help of simplifying abstractions and decomposition. Decomposition includes interchangeable selection of aspects and facets to put in focus and division of behaviors into episodes. Depending on the analysis goals, there can be two strategies for decomposing behaviors into episodes: partitioning of the entire behavior by dividing its time span into intervals, e.g., of a chosen fixed length and extraction of episodes with particular properties. The latter strategy is achieved by means of temporal queries, which select multiple disjoint time intervals such that the query conditions fulfil during these intervals. The pieces of behaviors contained within the selected time intervals are extracted as episodes for analysis. A set of primitives for making temporal queries has been proposed by Andrienko et al. (2021a). The query primitives enable selection of sets of time intervals containing situations with specified characteristics and, moreover, further selection of sets of intervals having certain temporal relationships to the previously selected intervals. This can be used, in particular, for considering selected episodes stepwise or for studying what happened before or after them.

Comparative visual analysis of selected aspects of behaviors can be enabled by juxtaposed representation of two or more behaviors along a time axis. This approach was taken by Chen et al. (2021) for comparison of streams of text messages published on social media by different politicians. The authors created an imaginative visual design where the flow of time is represented as a river and bridges across the river are built from significant keywords extracted from the texts. This technique has been called "co-bridges." The visualization supports both qualitative (common and distinct keywords) and quantitative (stream volume, keyword frequencies) comparisons. Moreover, it is possible to compare two or

**Fig. 12.5** Comparison of population mobility behaviors in three countries (Germany in blue, Italy in red, and the UK in yellow) using multiple time graphs showing dynamics of different attributes

more co-bridges by juxtaposing them. These may be, for example, co-bridges corresponding to different themes of the politicians' agendas.

While alignment along a common time axis is essential for comparing behaviors as dynamic entities, the representation of behaviors depends on the type of data under analysis. One of our recent work directions was comparative analysis of behavior characteristics expressed by multiple time-variant numeric attributes, i.e., by multivariate time series. An approach is being developed using, among others, the publicly available data of COVID-19 Google Mobility Trends.1

The company Google summarizes anonymized data provided by apps such as Google Maps into statistics showing how collective mobility behaviors of people in different countries were changing throughout the COVID-19 pandemic. The data consist of daily counts of visitors to specific categories of places (e.g., grocery stores, parks, train stations, etc.) relative to baseline days before the pandemic outbreak. Hence, the population mobility behaviors in different countries are expressed by time series of six attributes.

In comparing the behaviors in different countries, it is insufficient to compare the dynamics of each attribute separately from others, although this can be done relatively easily by means of usual time graphs or line charts, as shown in Fig. 12.5. This visualization does not support holistic perception of patterns of joint development of the attributes.

Consideration of temporal evolution of complex data, in particular, combinations of values of multiple attributes, is supported by the visualization technique called time curve (Bach et al. 2016); a similar method was simultaneously proposed by van den Elzen et al. (2016). The approach relies on embedding of data corresponding to different time units in a low-dimensional (typically 2D) space based on similarities between the data in terms of an appropriate similarity metric.

<sup>1</sup> https://ourworldindata.org/covid-google-mobility-trends.

The class of computational methods for data embedding is known as dimensionality reduction methods. Well-known examples are multidimensional scaling (MDS), Sammon's mapping, t-SNE, etc. The result of embedding (also called projection) is visualized as a scatterplot where time units are represented by points. Consecutive time units are represented by lines. Spatial arrangements of the points and lines are treated as different development patterns, such as gradual or rapid changes, stable states, oscillation, stagnation, etc. (Bach et al. 2016; van den Elzen et al. 2016).

The time curve technique is by itself poorly suited for the task of comparative analysis of two or more behaviors which, as we stated previously, calls for an opportunity to see the behaviors aligned along a common time axis. We extend the time curve technique in the following way. We paint the background of a projection plot (i.e., a plot representing a result of behavior embedding) using a continuous 2D color scale. Thereby, each point on the plot receives a specific color. The colors of the points can be transmitted to other displays, in particular, to a timeline display suitable for aligned representation of several behaviors. This generic approach is demonstrated in Fig. 12.6 using the COVID-19 mobility data for Germany, Italy, and the UK shown in Fig. 12.5.

**Fig. 12.6** Upper part: the population mobility behaviors in Germany, Italy, and the UK are represented by means of data embedding (projection) according to the time curve technique. To enable aligned representation of the behaviors along a common time axis, continuous coloring is applied to the backgrounds of the projection plots. Lower part: a joint representation of the three behaviors in a timeline display, where the horizontal dimension represents the time flow. The behaviors are represented by horizontal bars with segments painted in the colors of the corresponding points in the projection plots. In this example, the Saturdays and Sundays are filtered out (shown in the timeline view as gray segments), which enables disregarding the weekly fluctuations and focusing on long-term patterns of change

The character of the color variation in the timeline view is representative of the character of the joint development of the multiple attributes. Abrupt changes of the color tone along the timeline signify sudden or rapid changes in a behavior, bar pieces painted in very similar color shades correspond to stable states in the behavior development, and smooth changes of shades correspond to gradual transitions between states.

### *12.4.3 Emojis to Study Sentiment, Emotion, and Context of Events*

Since geosocial media are used to state opinions, express emotions, or document experiences, they contain a lot of subjective information. The recognition of such subjective phenomena is usually done via natural language processing, which is by now quite sophisticated but can hardly recognize irony or sarcasm, for example, and is often applied limited to one or a few languages. Promising solutions have been achieved in this context with emojis, which have become extremely popular in geosocial media and are available in steadily growing numbers.

Hauthal et al. (2019) used a Twitter dataset to investigate reactions to the political event Brexit in terms of opinions and emotions, using emojis in two different approaches. In the first approach, emojis and hashtags were combined. Hashtags, established in political campaigns before the referendum, indicate which sub-topic of the overall Brexit debate is addressed in a tweet, i.e., leave or remain. A sentiment toward these topics in terms of positive or negative was detected by emojis appearing in the same tweet. For this, emojis showing a positive or negative facial expression were considered, based on the official categorization of emojis by Unicode. The combination of a hashtag and an emoji results in the rejection or support of the UK leaving the European Union. A spatial comparison of these analysis results with the actual referendum results on NUTS1 level (the highest level in the hierarchical classification used to clearly identify and classify the spatial reference units of official statistics in the Member States of the European Union) showed a higher consistency than a pure hashtag-based consideration without including emojis.

In the second approach, emojis showing faces or persons with a countenance or gesture were not only considered on a positive-negative scale but were assigned to emotional categories based on a classification according to Shaver et al. (1987), which includes love, joy, surprise, fear, sadness, and anger (see Fig. 12.7). Each category of this classification is allocated emotional terms that are most likely to be mentioned when people are asked to name those emotions. The official Unicode names of all previously used emojis were matched with these terms and assigned to the corresponding emotional categories. Emotions were then examined comparatively before and after the announcement of the Brexit referendum results, with only sadness showing a significant increase overall and fear decreasing slightly. Spatially, the increase in sadness is evident in two out of three NUTS1 regions,

**Fig. 12.7** Classification of Unicode Emoji (List v11.0) by authors into positive, negative, and according to emotional categories based on Shaver et al. (1987)

where votes did not match the overall referendum result, and an increase in joy in five out of nine NUTS1 regions where the hopes were fulfilled.

Another use of emojis to investigate subjectivity, in this case perception, was implemented in a study by Hauthal et al. (2021). In a global Instagram dataset about sunrise and sunset, the measure typicality was applied. Typicality is a relative measure specifically tailored for geosocial media that determines how typical a particular object of interest (e.g., emoji or hashtag) is within a sub-dataset compared to the total dataset. Sub-datasets may be formed spatially, temporally, thematically, etc. Typicality is calculated by the normalized difference of two relative frequencies and returns a positive (= typical) or negative (= atypical) value. Typicality was used to identify emojis in the previously mentioned global Instagram dataset that provide information about the context of the user while observing the event. On the one hand, these emojis deliver information about activities performed and, on the other hand, also about perceived landscape features in the immediate surroundings. It was found that emojis provide more detailed information in this regard than the hashtags contained in the same dataset. Moreover, location-specific emojis were identified, which are chosen depending on the location, and match the features of the physical environment, as shown by matching them with geographic attributes. This proves that emojis are not randomly chosen but provide insights not only into the user's situational context but also into their perception and thus appreciation of certain aspects of the environment.

These studies demonstrate the potential of emojis to provide insights into the subjectivity of geosocial media users in a relatively straightforward way. A further increase in the use of emojis as well as an increasing variety of them can still be expected, which will open up further possibilities and applications.

The conceptual framework presented in Sect. 12.3 strongly influenced both of the presented studies. In the study on Brexit, the framework led the way in allowing us to look at the reactions contained in the underlying dataset from numerous different perspectives and thus obtain in-depth results. Furthermore, the framework significantly influenced the development of the typicality measure in the second study described, as the sub-datasets required for the calculation can be formed following the four facets, which is the particularity and novelty of this measure.

### **12.5 Application-Oriented Workflows**

### *12.5.1 Activity Analysis for Landscape and Urban Planning*

VGI is also increasingly recognized as an important resource in the fields of landscape and urban planning, for example, to support the analysis of visitation patterns, assessing collective values, or improving human well-being through fair and equitable design of public green spaces (Ghermandi and Sinclair 2019). To this end, landscape and urban planners must first assess "what" is collectively valued, "where," by "whom," and "when" to understand the how and why of human behavior, as introduced in Sect. 12.3. However, the reproducibility of human behavior research is often impaired because samples, populations, and the phenomena being observed change between studies (Gruebner et al. 2017). This is particularly true for VGI and geosocial media, which are noisy, biased, difficult to fully sample, and often shared through incompletely documented and opaque application programming interfaces (APIs). In addition to these core challenges, protecting user privacy is becoming increasingly important when working with user-generated content (danah boyd and Crawford 2012). For this reason, Dunkel et al. (2023a) sought to develop a robust and transferable "workflow template," for assessing human activities and subjective landscape values through geosocial media worldwide—without compromising user privacy.

For demonstration purposes, an event type with a strong temporal and spatial consistency was chosen that allowed for a significant reduction in the number of "incidental variables" in the study while at the same time maintaining sample volume. Sunset and sunrise were among the few events that met these criteria. In addition, improving results reproducibility ideally requires an experiment with two maximally separated datasets and "finding relationships in the same direction and of similar strength" (Laraway et al. 2019, p. 38) in both. This was difficult to implement with more "newsworthy" topics. The consistent global and long-term footprint of the sunset and sunrise offered an opportunity to maximize the sample size while also providing a basis for reproducing the results using two datasets, albeit not universally representative but independent, collected from Instagram and Flickr. Despite the narrow topic of this study, the shared expected frequencies (e.g., Flickr 300 million post counts dataset for a 100×100 km grid, Dunkel et al. 2023b)

can also be very useful for calculating chi in studies of other phenomena at global scales, for differently sampled data, e.g., based on a different set of search terms.

Importantly, sunset and sunrise events are entirely ephemeral yet have a profound, measurable impact on human perception and interaction with the environment. Unlike many other events, the ability to perceive sunset and sunrise is tightly bound to time, but almost completely decoupled from space. Photographs of these events function as evidence of presence in place and time. While the immediate reaction to taking a photograph of a sunset or sunrise is trivial, people will take into account all previous experiences, learned behaviors, expectations, etc. when reacting. Individual photographs therefore reflect different memorable experiences that function as artifacts of different preference contexts. The narrow thematic filter of sunset and sunrise allowed for a focused description and evaluation of these preference contexts. Based on the four-facet context model (Dunkel et al. 2019), reactions to sunset and sunrise were examined in terms of where, who, what, and when.

The study first asked whether it was possible to compare the relative importance of sunset and sunrise reactions, independent of overall visitation frequency, for different locations worldwide. Visualizing relative user frequency was critical because geosocial media tends to be skewed toward highly populated locations and cities. The goal was to provide a balanced evaluation assessment of sunset and sunrise reactions across different rural and urban regions. Several visualization methods were tested, such as based on a relative ranking method for individual locations (Fig. 12.8) and different metrics, such as user counts, post counts, or user days (Wood et al. 2013). The final workflow uses the signed chi equation, proposed by Clarke et al. (2007), to visualize over- and under-frequentation with respect to these two events and aggregated using HyperLogLog for a global grid with a resolution of 100 × 100 km.

Globally, sunrise events are often associated with east coasts (e.g., Italy, Sardinia in Fig. 12.8) or mountainous regions (e.g., the Alps), while sunset events are photographed on west coasts. The study also observed a strong ranking order between Flickr and Instagram reactions, despite the fact that both platforms have different user groups. In other words, we actually expected a much stronger effect of the platform on the results, and our work shows, at least for Instagram and Flickr and the selected events, that results can be reliable and reproducible across platforms. Still, for some locations, the incentives of the social media platforms themselves can have a significant impact on what gets shared and by whom. On Instagram, for instance, the Burning Man festival in Nevada ranks second worldwide for sunrise reactions. Out of a total of about 70,000 total visitors (Wikipedia, Burning Man,<sup>2</sup> 2022), 1295 (±30) users shared sunrise images on Instagram during the short period of the 2017 Burning Man festival, compared to only 54 (±2) Flickr users for the same location over a 10-year period—a pattern that can be explained by the different user groups of these platforms. Finally, the use of abstracted, estimated,

<sup>2</sup> https://en.wikipedia.org/wiki/Burning\_Man#2013\_to\_2019.

**Fig. 12.8** Relative ranking of places based on reactions to the sunset and sunrise in Central Europe. This figure shows a preview visualization of the collected raw data that was later further aggregated to a 100 × 100 km grid in Dunkel et al. (2023a). Some repeated patterns are already observable here, such as the larger cities (1) commonly featuring a balanced representation of sunset and sunrise reactions, whereas sunrise reactions dominate along eastern-facing coasts and in mountainous regions (2) and sunset reactions being predominantly shared from along westernfacing coasts (3)

non-personal data based on HyperLogLog, as suggested by a scoping study (Dunkel et al. 2020), was a practically feasible solution that supports a shift toward privacypreserving and ethically aware data analysis in human preference research. The analysis process and anonymized data are made available in a repository, allowing transparent verification, replication, and transfer to other events or datasets (data repository; Dunkel et al. 2023b).

Even though individual people perceive landscapes and their attributed values differently, there are landscapes which the majority of people perceive as scenic and beautiful (Bell 2012). These prolific landscapes (e.g., Preikestolen in Norway or Wildkirchli in Switzerland) are often depicted by characteristic motif images, which are clusters of images all taken from a similar viewpoint and angle. Which landscapes become popular is driven by propagation of landscape or nature appreciation through travel guides or art from the romantic area, popularizing a selective subset of landscapes, thus not a new phenomenon. Today, tourism agencies and other influencers (e.g., celebrities, companies, movies, songs) can shape landscapes through social media promotion by planting seed images that people will try to recreate and, by doing so, form new motifs. By reaching millions of people and potentially influencing their future visiting plans, this social media-induced tourism can have drastic physical consequences on the local environment, infrastructure, and

people (Simmonds et al. 2018). Being able to monitor the spatiotemporal emergence of motifs as a proxy for induced tourism is crucial to support local decision-makers to tackle the potential increase in visitation rates to a respective landscape. In the paper by Hartmann et al. (2022), we created an operationalizable conceptual model of motifs that is able to identify, extract, and monitor prone landscapes based on geotagged social media data. More specifically, the proposed pipeline leverages creative-commons Flickr images from the YFCC100M dataset (Thomee et al. 2016) within the European Nature 2000-protected areas, which represent a network of breeding and resting sites within important landscapes for rare and threatened species. The core methodological process to identify motifs within a corpus of 2.1 million images involved two steps. Firstly, images were downsampled through spatial clustering by using Hierarchical Density-Based Clustering (HDBSCAN) (McInnes et al. 2017) since images belonging to the same motif were by definition in close proximity to one another. Secondly, with the help of the computer vision algorithm Scale-Invariant Feature Transform (SIFT) (Lowe 2004), we calculated image similarities between each image pair within a spatial HDBSAN cluster and clustered them again based on that outcome. The results were our motifs, of which we found a total of 119 in our study sites across Europe. Analysis of the motifs revealed that 65% depict cultural elements such as castles and bridges, whereas the remaining 35% contain natural features that were biased toward coastal elements like cliffs. Ultimately, the early detection of emerging motifs and their monitoring allows the identification of locations subject to increased pressure, which enables managers to explore why sites are being visited and to take timely and appropriate actions (e.g., allocation of infrastructure such as toilets and rubbish disposals or visitor routing).

Not only descriptive textual information and emojis can be used for the analysis of geosocial media data, but it is also possible to use the image information directly. As an application for urban bicycle infrastructure planning, an object recognition algorithm based on convolutional neural networks was used to identify bicycles and potential parking spaces. The research and development work was carried out as a cooperation of a Young Research Group within the framework of the priority program VGIscience (Knura et al. 2021; Zahtila and Knura 2022). The research on object recognition was carried out in the COVMAP project (see Chap. 5); the processing of social media data and the development of methods for visual analysis were realized by the projects EVA-VGI (this chapter) and TOVIP (see Chap. 10).

### *12.5.2 Exploring People's Mobility Behavior*

In search for a workable approach to analysis of multiple behavior episodes characterized by multivariate time series, Shirato et al. (2021) made an attempt to apply topic modelling. For this purpose, the patterns of variation of different attributes, or features, are represented by symbolic codes, which can be treated as words. The expected role of topic modelling is to reveal co-occurrences of such

**Fig. 12.9** Episodes from a football match are represented by dots in a 2D projection space obtained by applying t-SNE to the vectors of the topic weights. The dot colors represent the dominant topics

patterns. To transform a time series of numeric values into a symbolic code, the values are compared to the median of the time series and encoded by symbols '−', '=', or '+' depending on their relative position with respect to the median. Hence, each episode can be represented by a combination of the variation codes of the multiple variables. A method for probabilistic topic modelling, such as Latent Dirichlet Allocation (LDA), is applied to these combinations, which are treated as "texts" where the variation codes are "words." The resulting "topics" show which variation patterns of different variables tend to occur together in the same episodes (see Fig. 12.9). The topic modelling method also assigns vectors of topic probabilities to the episodes. Using these vectors, the episodes can be clustered and/or arranged in a projection space according to similarities of the variation patterns and further explored by means of various existing methods.

The approach was tested using football data as an example. The features that were involved in the analysis reflected widths of empty spaces between team players on different levels of separation from the goal that is under attack. The topics corresponded to combinations of dynamic patterns of the changes of the widths on the different layers. To support interpretation of the results of topic modelling, the representation of the topics in the form of a table was combined with a map of the football pitch, where the behavior patterns were summarized as the average positions and areas of movement of the team centers (see Fig. 12.10).

The experiment showed that application of topic modelling to episodes characterized by multivariate (and, possibly, multifaceted) data has potential for behavior analysis. The research in this direction is worth being continued.

**Fig. 12.10** Distributions of centroids after transitions for topic 4 and topic 6. Colors represent the objects (blue, defense; red, attack; green, ball). The cross (x) means the average position of centroids, and the ellipse means the standard deviation of centroids for each topic

### **12.6 Conclusion**

The research approaches presented in the chapter were designed for the visual analysis of social media posts and trajectories of football players to study people's reactions and behavior. As a starting point, a conceptual behavioral model is introduced to describe an actor in a certain period of time with regard to its spatial, thematic-attributive changes under the influence of events and external context factors. For the analysis, behavior is broken down into episodes, which refer to short periods of time and support the description with simple patterns. In order to model the external influences on behavior, a context model is also proposed, which, in addition to spatial, temporal, and attributive influences, also takes into account events that have taken place over a period of time and the behavior of other people.

Based on the conceptual model, generic methods are presented, which allow analyzing the behavior. The temporal query approach enables the investigation within in individual episodes. Time curves were used for a holistic analysis of several attributes with regard to their development over time. An extension of the time curve is proposed by using continuous coloring derived from a dimensional reduced attribute space. A method developed in the project for the visual analysis of text messages in social media data uses a "co-bridge" metaphor. Significant keywords are connected over time to enable both a quantitative comparison with regard to common and different keywords and a qualitative analysis with regard to stream volume and keyword frequencies. Furthermore, the use of emojis as contextual indicators was examined. It has been found that emojis are suitable for describing the spatial context of people in terms of perceived objects and activities taking place at these locations. Various normalization methods such as Chi-Value, "typicality," and metrics such as "user count per day" were used to account for heterogeneity and bias in social media data.

Two main areas of application were considered, on the one hand, the perception of environmental phenomena as input for questions of urban and landscape planning and, on the other hand, the analysis of the movement behavior of people on the example of football games. The study of a global phenomenon such as sunrise and sunset based on Instagram and Flickr allows conclusions to be drawn about consistency and reproducibility, as well as motives why the events were documented. Based on the four-facet context model, sunset and sunrise responses are examined for the where, who, what, and when. In addition, responses from different groups are compared with the aim of quantifying differences in behavior and information spread. As a second field of application, the movement behavior of people was analyzed using visitor statistics for selected locations during the COVID-19 pandemic and movement data from football games. Therefore, time-dependent abstracted and aggregated representations are created to compare collective behavior patterns during the pandemic in different countries in the first case and different football teams or teams in different stages of the game in the second case.

**Acknowledgments** This research was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) within Priority Research Program 1894 *Volunteered Geographic Information: Interpretation, Visualization and Social Computing* (VGIscience, EVA-VGI, BU 2605/8-2) and the Swiss National Science Foundation (Project No 200021E-166788).

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 13 Digital Volunteers in Disaster Management**

**Ramian Fathi and Frank Fiedrich** 

**Abstract** During disaster situations, social media is used extensively by the affected population for communication and collaboration, but there is also increased public sharing of important disaster-related information about the current situation. With the goal of utilizing this data and Volunteered Geographic Information (VGI) for disaster management, digital volunteers organized themselves into so-called Volunteer and Technical Communities (V&TC). In addition, professionalized digital volunteers have institutionalized Virtual Operations Support Teams (VOST) in established Emergency Management Agencies (EMA). While technical issues have dominated research in this area in recent years, questions about the motivation, organization, and impact of the analytical work of these volunteers have remained unanswered. In this chapter, we present five studies that address questions about the motivation of digital volunteers, organization, and collaboration requirements, the analytical impact of VOST, data biases in Crisis Information Management (CIM), and privacy-related topics. Overall, it could be shown that digital volunteers make a significant contribution during disaster management, in which they effectively process their analytical results and VGI for the management of disaster situations. However, human limitations and privacy-related methods need to receive greater attention in the future, both in research and in practice.

**Keywords** Digital volunteers · Disaster management · Virtual operations support teams · Social media · Social media analytics · Emergency operation center · Data bias · Information management

R. Fathi (O) · F. Fiedrich

Chair for Public Safety and Emergency Management, School of Mechanical Engineering and Safety Engineering, University of Wuppertal, Wuppertal, Germany e-mail: fathi@uni-wuppertal.de; fiedrich@uni-wuppertal.de

### **13.1 Introduction**

The earthquake in Haiti 2010 can be seen as a starting point of digital volunteering in the context of disaster management and humanitarian assistance. The first digital volunteers were committed to helping affected populations by processing and providing Volunteered Geographic Information (VGI) for disaster response. Over time, they organized themselves and formed virtual groups. These emerging so-called Volunteer and Technical Communities (V&TC) opened the view for multiple new fields of research. Many research approaches paid attention to technical topics like data mining and creating better analytical tools for the volunteers. In contrast, less research has been conducted to determine which motivational factors drive people to volunteer digitally in disaster management and which organizational requirements exist to collaborate with established Emergency Management Agencies (EMA). Furthermore, questions about the impact of analytical results in this time-critical context arise. The underlying research focused on the volunteers themselves and organizational requirements for collaboration, more specifically their motivational, participative, and analytical factors. This chapter is based on research carried out in the project "Active Participation and Motivation of Professionalized Digital Volunteer Communities: Distributed Decision-Making and its Impact on Disaster Management Organizations."

After this brief introduction, the second section of this chapter presents a comprehensive study of the motivational factors of operationally active digital volunteers (Fathi and Fiedrich 2020). In a cross-organizational online survey, possible motives, individual organizational commitment, and potential incentive options were analyzed. In addition, two experienced digital team leaders of V&TC were interviewed in guided expert interviews about methods and measures for increasing motivational and organizational commitment. Based on the findings generated in this way, explanatory patterns for the motivation factors of digital volunteers can be derived on the one hand; on the other hand, beneficial and identification-generating measures can be identified.

Contrary to the V&TC, digital volunteers institutionalized Virtual Operations Support Teams (VOST), which are closely linked to established EMA. This development led to more in-depth research questions concerning the professionalized digital volunteers and their integration in decision-making processes using VGI in Emergency Operations Centers (EOC), which are discussed in Sects. 13.3 and 13.4. The organizational structure and technical requirements for succeeding and their decision-making processes in a time-critical environment were of interest. A research gap between the VGI created by VOST and decision-makers needs in the established EMA was acknowledged, as the digital volunteers started to collaborate with EMA. Therefore, the topic of voluntary digital participation for collaborative emergency management was explored in collaboration with the University of Stuttgart, whose research efforts mainly focused on visual analysis of VGI. In Sect. 13.2, we present a case study which was conducted with the project "VA4VGI-2" (Chap. 6), where structural, procedural, and technical requirements of integrating VOST in EOC structures were investigated, applying a mixed-method approach (Fathi et al. 2020).

Overall, little attention was paid to VOST, who act as groups of data analysts with direct integration to EOC. Specific tasks of VOST include filtering, verifying, and analyzing social media data from various platforms and creating information products for decision-makers in EOC. These information products can contribute to the situational awareness of EOC members and to the decision-making processes by integrating actionable information. In a case study following the 2021 flooding in Germany, the aspects of analyzing social media by digital volunteers in VOST and the impact of the information products on situational awareness and decisionmaking were examined and are presented in Section 4 (Fathi and Fiedrich 2022).

Analytical and decision-making processes in the time-critical environment of EOC in disaster management are challenging due to the numerous disaster-related conditions like time-pressure and uncertainty. To examine the interplay of the conditions in disaster management, VGI, and biases in Crisis Information Management (CIM), a workshop experiment was conducted with digital volunteers and decisionmakers (Paulus et al. 2022). A three-stage experiment on epidemic response was developed as the underlying scenario to analyze how biases can be mitigated by observing digital volunteers and decision-makers in the analytical and decisionmaking processes and is presented in Sect. 13.5. The findings of this case study suggest that debiasing efforts are strongly undervalued, and external analysts fail to debias data successfully in favor of rapid results. The biased data was then passed on to decision-makers in the form of information products, who make decisions based on biased data.

Section 13.6 addresses the challenge of privacy-aware data analytics in disaster management and discusses a collaborative work between this underlying project and "Privacy Aspects" (Chap. 14).

### **13.2 Motivational Factors of Digital Volunteers in Disaster Management**

With the emergence of social media and VGI, various new research areas and new opportunities to use this open-access data increased, also in disaster management. Anyhow, limited resources characterize disaster, and EMA do not have enough staff to analyze big amounts of data to integrate VGI in their situational awareness and decision-making processes. Digital volunteers analyze social media data from, e.g., Twitter and proceed disaster-related VGI. The volunteers can work dislocated from the actual operational site and thus can be deployed almost instantly. Over time, the volunteers organized themselves and formed V&TC. These communities fostered research interest, especially in the fields of organization and technology. An existing research gap in the context of volunteering on a digital basis in disaster management are the motivational factors of V&TC members. Questions regarding the motivation of the digital volunteer, the barriers to participation, and the commitment to the V&TC have not yet been extensively answered. In the paper "Digital Volunteers in Disaster Management—Motivational Factors and Barriers of Participation" (Fathi and Fiedrich 2020), we aimed to understand what motivates the digital volunteers and which incentive options there are to further motivate them. Additionally, we looked into measures and methods that can be implemented by digital team leaders to motivate their team members. Lastly, the differences and correlations between the needs of the digital volunteers and the motives of the digital team leaders were examined.

Therefore, we used a mixed-methods approach of quantitative and qualitative social science methods. It was of special interest to capture and query the digital volunteers in their social contexts as well as their individuality. An online survey was conducted among different V&TC, to explore motivational factors and incentive options. The survey was designed under the use of the Volunteer Functions Inventory (VFI) introduced by Clary and Snyder (1999), which is a widely used questionnaire on volunteer motivation. In order to understand measures and methods to foster motivation of digital volunteers by leaders of V&TC, guideline-based expert interviews were carried out. The guidelines were designed based on the online survey, but the content was transferred to the perspective of team leaders.

It was found that digital volunteers are mostly motivated by their values, but also by the experience, they are gaining paired with fun-based intrinsic motivation. In contrast, having the prospect of a "career" within V&TC was less motivating. The participants strongly agreed to statements of organizational commitment and identification with their organization. More than 70% stated that they fully agreed to be proud to be part of their V&TC. The main barriers of digital volunteering were named as time, trust in one's own abilities, and Internet access. Especially during crises, time allocation becomes a challenge. Collaborating with other V&TC or a pool of digital volunteers, who can be acquired ad hoc, seems to be an appropriate method to allocate work of digital volunteers. The queried digital volunteers see potential for motivational enhancement rather in non-crisis times, for example, more feedback and additional online and offline community activities without a disaster context. Accordingly, the volunteers see incentive options in digital or analog exercises or events and appreciative measures. It became clear that feedback is very important to the digital volunteers, who, due to their dislocation, can only guess what impact and use the resulting information products and VGI have. Negative impact on the motivation of digital volunteers were identified as a lack of feedback and a lack of identification with the work or the V&TC. Feedback cannot only be given by the tasking EMA but also by the public or the digital team leaders (Fathi and Fiedrich 2020).

### **13.3 Virtual Operations Support Teams in Disaster Management**

Virtual Operations Support Teams (VOST) are groups of professionalized digital volunteers, who are closely linked to EMA. During a VOST operation, the common tasks include monitoring and analyzing social media, verifying and geolocating information and developing crisis maps, recognizing and analyzing trends and sentiment in social media, and ad hoc tasks as assigned by the EOC. The actionable information identified and verified can then be provided to the EOC decisionmaker to expand situational awareness and support decision-making processes. Furthermore, VOST integrate a liaison officer in the EOC structures. This enables to ensure effective communication and distribution of tasks between VOST and EOC. VOST pursue the goal of effectively integrating information products and VGI into decision-making processes through close organizational integration in the time-critical context of disaster management. This in turn led to multiple research questions concerning structural, procedural, and technical requirements for an effective collaboration between a virtual team of analysts and decisionmaker in an EOC. These questions were to be explored in collaboration with the project "VA4VGI-2" (Chap. 6). As described in the paper "VOST: A case study in voluntary digital participation for collaborative emergency management" (Fathi and Fiedrich 2020), the main goal was to understand the decision-making processes, which emerged by integrating a VOST into the structures of an EOC. An exploratory case study was conducted as field research during the start (Grand Départ) of the Tour de France in Düsseldorf, Germany in 2017. Especially of interest were the requirements of structure and procedure for a successful collaboration, technical requirements and the evaluation of existing technical tools for social media analytics, the identification of the actual tasks that needed performing during the operation, and structural, organizational, and technical implications for future decision-making systems in EOC.

The VOST operation at the Grand Départ in Düsseldorf was in the scope of a pilot project with the German Federal Agency for Technical Relief ("Technisches Hilfswerk" – THW). The THW VOST consists of 20 digital volunteers which were appointed as THW members and thus act as team members in a governmental EMA rather than a loosely coupled group of digital volunteers in V&TC. To make use of the insights provided by the case study, multiple methods for data collection and analysis were conducted. These methods comprised participant observation during the two-day operation, focus group discussions and informal interviews with decision-makers and VOST members at different stages of the operation, analysis of the tasks performed by the VOST during the operation, analysis of the organizational setup, and technology use and decision-making processes of the VOST. For this particular operation, the following VOST working priorities were identified as a result of the focus group discussions: identification of critical crowd densities and flows; detection of unusual events; image analysis of social media; developing of a crisis map for spatial analysis; identification of false information, rumors, and fake news; and scenario-dependent tasks.

The structural and procedural requirements for a successful collaboration were identified as a division in small VOST working groups to simplify the distribution of tasks. This allows specialized subgroups to be formed to respond to dynamic operational situations, e.g., verification groups, and to ensure information exchange on the level of team and group leaders, individual group briefings to implement adjustments quicker, and, most importantly, the implementation of a liaison officer. The technical requirements are especially a reliable user experience and customtailored tools for the use during an operation to alleviate the high mental workload. Putting new tools to the test in real-world operations seems to be a beneficial way to ensure advanced algorithmic tools. The most time-consuming and mentally challenging work at the same time poses collection, filtering, and documentation of user-generated content from social media platforms. Advanced mining tools are crucial to verify social media data in a time-critical environment. Situation monitoring is a highly repetitive and demanding task, which can only be carried out by a digital volunteer for a certain amount of time. Nonetheless, the biggest challenge seems to be the velocity and the volume of user-generated social media data. Additional disaster-related data becomes available all the time during the analysis, which presents a challenge for real-time social media analytics. To address the questions of what disaster-related information is processed by a VOST during a disaster management and what impact it has on members of an EOC, another case study was conducted.

### **13.4 Social Media Analytics by Virtual Operations Support Teams in Disaster Management**

Climate change poses numerous challenges and risks, including a significant increase in extreme weather events such as flooding (IPCC 2021). With this, the need for crisis communication and social media analytics in times of disaster rises accordingly. VOST address this need, integrated into EOC structures in times of crises, and their goals are to increase the situational awareness of decision-makers and to provide actionable information to improve decision-making in a time-critical environment. To examine these efforts, a case study was carried out, using the data collected by 22 VOST analysts during the 2021 flood in Wuppertal, Germany (Fathi and Fiedrich 2022). The city was severely flooded in July 2021; parts of the city had to be evacuated, and warning sirens were set off (Zander 2021). The EOC operated in cooperation with the VOST of the German Federal Agency for Technical Relief (THW VOST), which was deployed virtually but was directly connected to the EOC through a liaison officer, who was physically present in the EOC. This operation thus raised the research question, how VOST can support situational awareness and generate actionable information for EOC decision-making processes by integrating social media analytics practices. The research question was explored by analyzing the data generated during the THW VOST operation and by a survey among EOC decision-makers on the impact of the information provided by VOST on their decisions and situational awareness.

Unlike other research, which focuses, e.g., mainly on social media big data analysis, decision-making processes, or developing machine learning approaches, this study aims for examining closely the real-world VOST integration. Case studies, such as the underlying, can provide valuable insights of virtual teams in the disaster management context. In order to analyze the data generated by the digital volunteers during the operation, the following tasks were performed: data cleaning, summarizing categories, visualization of the data, and subsequently a comparative quantitative analysis and contextualization of the data. Furthermore, three different parameters, the format, the source, and the mean value of the prioritization, were used for an in-depth analysis. The survey of decision-makers was conducted among the EOC members, which collaborated with the THW VOST and used their information products during the flood response in Wuppertal. The prerequisite was that the interviewee had worked with VOST information products during the operation. Nine decision-makers from the EOC met these criteria, and all of them participated in the survey.

To classify information categories in the VOST dataset, which was identified by VOST volunteers during the flood response, 536 social media posts from eight different social media platforms were analyzed. Additionally, 42 posts from websites (e.g., traditional media) were collected. The dataset was classified in 23 different categories. The largest category was found to have emerged after the flood, namely, spontaneous community engagement (see Fig. 13.1). The earlier phases of the flood were dominated by categories like level of the river, warning, or flooded roads.

Posts from categories that could have had a direct impact were forwarded by VOST during the operation as actionable information to the EOC decision-maker. These social media posts were prioritized as "highly-relevant." The information in the format of videos was found to have a higher priority than information in the form of texts or images. In a category analysis over time, it was found that realtime disaster events, such as the activation of the warning siren, are simultaneously apparent in social media data.

VOST impact on situational awareness was explored by querying EOC directors and executives who collaborated with the THW VOST during the flood. All statements in the survey were rated with a strong overall agreement. The statement with the highest degree of agreement was: "Information from VOST contributes to expanded situational awareness." The necessity of a liaison officer in the EOC was also strongly agreed to. The lowest level of agreement was given the statement that VOST information can forecast developments of future situations. The second category of statements concerned the VOST impact on decision-making, e.g., to ensure people-centered risk and crisis communications and contributed to confidence in decision-making.

**Fig. 13.1** Percentage distribution of social media information during the flood response 2021. (Source: Fathi and Fiedrich 2022)

The results underline that situational awareness and decision-making is supported by VOST information and VGI on three different levels: perception, comprehension, and projection. Based on this distinction of situational awareness from Endsley (1988), it can be concluded that statements that can be assigned to the first two levels (perception and comprehension) receive high agreement. However, VOST information during the dynamic flood situation supports less to project the future flood-situation in a more precise way. Statements in this category received less agreement. This condition can be explained by the observation that VOST information and VGI provide new and complementary information that must be processed, comprehended, and projected by EOC decision-makers onto future scenarios in addition to that provided by other sources (feedback from responders, emergency calls, etc.). Due to the accompanying conditions in disaster management (e.g., time pressure, uncertainty), the cognitive load on decision-makers is very high so that the use of VGI can create biases in data analysis and decision-making.

### **13.5 Data Bias in Crisis Information Management**

Due to various conditions, humanitarian crises and disaster management especially challenge digital volunteers' data analysis. Crisis Information Management (CIM) is characterized, e.g., by time criticality and uncertainty. Additionally, resources are limited, and the cognitive load is high. This makes analysts and decision-makers prone to inducing biases in the data and cognitive processes. When undetected, biases remain untreated and lead to decisions based on biased information, which in turn can lead to an inefficient response. To find out more about the interplay of data and cognitive biases, an exploratory three-stage experiment on epidemic response was conducted (Paulus et al. 2022). A scenario-based workshop was held in The Hague in 2020 with experienced crisis decision-makers and digital volunteers from various V&TC and VOST, which entailed stage 1 and 2. For stage 3, the same participants were additionally addressed in an online survey. The experimental scenario was an epidemic outbreak in three countries. The first stage included an observation of digital volunteers, who were provided with different datasets with biased data, e.g., in the infection spread. The observation was set out to be in a fictional but realistic setting to avoid interference in a real epidemic response. In stage 2 of the experiment, the decision-makers were provided with the VGI and information products (e.g., maps) created in stage 1 and had to make decisions on treatment center placements. Stage 3 of the experiment was an online survey onsite of the workshop. It aimed to explore whether confirmation bias leads to path dependencies of former decisions based on biased information. In the survey, they were able to select the information they viewed as most important for future decision-making from a list of datasets.

The results of the experiment show that in the first stage, the participants failed to debias data, even though biases were detected. Debiasing efforts were undervalued in favor of immediate results. The information products created based on biased data in stage 1 were then forwarded to the decision-makers in stage 2, who made their decisions based on biased information. Even though the decision-makers in all three groups put enormous pressure on the digital volunteers to find out on which datasets the information products were generated, they did not succeed in ensuring that the decisions were not based on biased information. Confirmation bias was detected in stage 3; the reliance on conclusion reached with biased data was reinforced by it. Thus, biased assumptions remained undetected. The main causes for biased data remaining untreated are the described conditions of data analysis and decisionmaking in the context of disaster management. The realistic scenario design made it possible to recreate the general conditions, e.g., by simulating time pressure and uncertainty. Mindfulness debiasing efforts have been found to be effective to counteract these conditions and therefore pose a promising strategy to mitigate data and cognitive biases in future disaster management (Paulus et al. 2022).

### **13.6 Privacy-Aware Social Media Data Processing in Disaster Management**

Social media analytics by Emergency Management Agencies for relief purposes has become a common practice in the last few years. In the time-critical environment of disaster and life-threatening situations, privacy is often perceived as a secondary problem. Nonetheless, avoiding unnecessary data retention is important to protect the social media users' privacy by, e.g., preventing subsequent abuse. In times of crisis, social media users are especially vulnerable, e.g., by sharing names, addresses, or other personal information, for example, searching for missing relatives or friends.

In a cross-project effort, expertise from the present project and "Privacy Aspects" (Chap. 14) were combined for a joint case study (Löchner et al. 2020). The study examined the extent to which VOST can integrate privacy-aware methods and algorithms, in particular HyperLogLog (HLL), in their operational work in disaster management. To investigate the practicality of privacy-aware methods and HLL, a case study with digital volunteers from two VOST has been conducted. For this, a focus group discussion addressing opportunities, challenges, and implementation barriers was held. The focus group discussion with participants from THW VOST and VOST Baden-Württemberg aimed to document the expertise of the participants, who are experienced in the field of social media analytics.

The most important finding was that the focus group discussion revealed no disadvantage against using privacy-friendly methods and HLL for VOST. The algorithm does not distract the data analysis process, since the VOST work starts after the data processing via HLL. Overall, HLL was found to be an appropriate technology to ensure privacy-aware social media data processing. Opposing to the initial assumptions that HLL use might be conflicting with gathering data for creating information products, several benefits of the algorithm were come up upon. These benefits include improved working with big datasets, which might lead to a more widespread use of HLL and thus improved privacy-awareness among digital volunteers.

### **13.7 Conclusion and Outlook**

In this chapter, various facets of digital volunteering using and processing VGI in the disaster management sector were highlighted. In particular, social, organizational and analytical factors were examined and discussed in five different papers. It was shown that the motivation of V&TC is particularly value-based and that the commitment to the virtual team is pronounced. However, measures can be derived, especially for future developments, in order to sustainably establish digital volunteering. Professionalized teams in the structures of established EMA have been institutionalized worldwide, but detailed research on the requirements for collaboration between a virtual team and an EOC has been lacking. In Sect. 13.3, a joint study with the project VA4VGI-2 (Chap. 6) was presented. From the results, it can be concluded that an effective integration of digital volunteers and their analytical skills into established disaster management structures is achievable, although specific requirements have to be considered.

The analytical value for situational awareness and decision-making of EOC members during a specific disaster management situation could be highlighted in Sect. 13.4. In conclusion, the section shows that VOST analysts were able to identify, verify, and categorize a large amount of disaster-related information from eight different social media platforms in various formats. During the 2021 flood, this information contributed to an expanded situational awareness of the decisionmakers in an Emergency Operation Center and made it possible, for example, to conduct crisis communication in a more people-centered manner. Thus, it could be shown that during dynamic disaster situations, the involvement of digital volunteers was of major relevance for the operation management. Furthermore, it could be shown that the virtual structures of a VOST are able to effectively support an EOC even in an acute disaster situation. Nevertheless, analysts and decision-makers in such situations are accompanied by conditions that can bias results and decisions. In order to investigate data and cognitive biases in the work of digital volunteers and decision-makers with VGI, a paper was presented in Chap. 5 in which this aspect was investigated in a three-level experiment. It could be shown that debiasing efforts were not pronounced enough, and thus biased information was considered in decision-making. It can be deduced that in future exercises and trainings, debiasing efforts need to receive more attention in order to ensure the integration of digital volunteers and VGI in the future. This is also accompanied by privacy-aware analytics of social media in disaster situations, where the affected population is particularly vulnerable. To investigate possible methods and an algorithm developed in the project "Privacy Aspects" (Chap. 14) in the use of VOST, a case study was conducted with two VOST (Sect. 13.6). Thereby, it could be examined that the use of privacy-aware methods and algorithms in operational use is reasonable and possible.

Overall, it could be shown that digital volunteers make a significant contribution during disaster management, in which they effectively process their analytical results and VGI for the management of disaster situations. However, human limitations and privacy-aware methods need to receive greater attention in the future, both in research and in practice. In addition, questions remain about how situational pictures will need to be designed by digital volunteers in the future. In a project funded by the "German Federal Office of Civil Protection and Disaster Assistance" called "#sosmap" (Fiedrich 2022), it will be investigated from August 2022 to what extent psychosocial situation pictures can be created for EOC by analyzing social media.

**Acknowledgments** This research was supported by the German Research Foundation DFG within Priority Research Program 1894 Volunteered Geographic Information: Interpretation, Visualization and Social Computing (VGIscience, projectnumber: 314672086).

### **References**


Fiedrich F (2022) https://www.buk.uni-wuppertal.de/de/forschung/laufende-projekte/sosmap/


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 14 Protecting Privacy in Volunteered Geographic Information Processing**

**Marc Löchner, Alexander Dunkel, and Dirk Burghardt** 

**Abstract** Social media data is used for analytics, e.g., in science, authorities, or the industry. Privacy is often considered a secondary problem. However, protecting the privacy of social media users is demanded by laws and ethics. In order to prevent subsequent abuse, theft, or public exposure of collected datasets, privacy-aware data processing is crucial. In this chapter, we show a set of concepts to process social media data with social media user's privacy in mind. We present a data storage concept based on the cardinality estimator HyperLogLog to store social media data, so that it is not possible to extract individual items from it, but only to estimate the cardinality of items within a certain set, plus running set operations over multiple sets to extend analytical ranges. Applying this method requires to define the scope of the result before even gathering the data. This prevents the data from being misused for other purposes at a later point in time and thus follows the privacy by design principles. We further show methods to increase privacy through the implementation of abstraction layers. As another additional instrument, we introduce a method to implement filter lists on the incoming data stream. A conclusive case study demonstrates our methods to be protected against adversarial actors.

**Keywords** Privacy · Social media · Data retention · HyperLogLog

### **14.1 Introduction**

Social media services like Twitter or Instagram are used to communicate and share information worldwide, which generates a rich set of data. Since a large part of this data is publicly available, it can be used beyond the features of the social media services itself, especially by third parties.

Technische Universität Dresden, Dresden, Germany

e-mail: marc.loechner@tu-dresden.de; alexander.dunkel@tu-dresden.de; dirk.burghardt@tu-dresden.de

M. Löchner (O) · A. Dunkel · D. Burghardt

D. Burghardt et al. (eds.), *Volunteered Geographic Information*, https://doi.org/10.1007/978-3-031-35374-1\_14

The main problem with social media data utilization for applications other than their dedicated use case is that explicit consent from the social media user is usually missing. While most users are aware that their content is publicly available on the Internet, they do not assume that data is frequently recycled for other purposes such as scientific, commercial, or administrative use (Boyd and Crawford 2012). Accordingly, this demands an exceptional strong focus on their privacy.

In contrast to other environments, the data to be protected is already public (Williams et al. 2017). In the view of third parties, that data can be utilized for any purpose, including those that oppose the user's interest (Zhou et al. 2008). But data can also be used with good intentions (Daly et al. 2019), whereas "good" could be defined by "in the user's Interest." For example, social media has shown a valuable source of information in crisis mapping, emergency response, or public planning (Fiedrich and Fathi 2021; Dunkel 2021).

In order to support ongoing development of positive use cases, scientists need to respect and actively protect social media users' privacy. Scientists need to take explicit control over data that they expose and prevent accidental disclosures.

An approach to support the adoption of accidental disclosure prevention techniques is to *prevent* the gathering of privacy-relevant data in the first place. We specifically aim at providing methods for the use of social media data following the *privacy-by-design* principles (Cavoukian et al. 2009).

In this chapter, we show a set of concepts that enable to process social media data with social media user's privacy in mind. We present a data storage concept that implements an algorithm called HyperLogLog (HLL) (Flajolet et al. 2007) to not store raw social media data but only statistics about their occurrence. We further show that while losing precision of the data, privacy can even be increased by applying multiple layers of abstraction on the data. For a contextdependent treatment of privacy and to cover edge cases, we further introduce a model to implement filter lists on the incoming data stream. A conclusive case study demonstrates our methods to be protected against adversarial actors.

### **14.2 Fundamentals**

### *14.2.1 Related Work*

Issues and challenges related to privacy arise everywhere, where social media data is involved. Following up, we link to research projects within this book, which are primarily based on processing social media data and therefore our research is relevant for.

The EVA-VGI project (see Chap. 12) studies the heterogeneity, quality, subjectivity, spatial resolution, and temporal relevance of geo-referenced social media data. Focusing on the integration of spatial, temporal, topical, and social dimensions combined with an explicit link between events and reactions, they present conceptual approaches and methods that enable a privacy-aware visual analysis of VGI in general and geo-social media data in particular. The project has taken advantage of the results of our research by implementing HLL on datasets related to their publications (Dunkel et al. 2020).

Similarly, the VA4VGI project (see Chap. 6) describes how geo-aware filtering and anomaly detection on geo-referenced social media data can be a significant information source for stakeholders in journalism, urban planning, or disaster management. They present tag maps that provide overview-first, details-on-demand, visual summaries of large amounts of social media data over time and thus visualize their temporal evolution.

Closely related to the former is the DVCHA project (see Chap. 13). The overall objective of their research is to study the implications of social media data for the efficiency of disaster management. Focusing on so-called Virtual Operations Support Teams (VOST), their research addresses motivation, success factors, and improvement of distributed decision making processes based on disaster-related real-time social media data.

In a collaboration with the DVCHA project, we carried out a case study, in which we explored the deployment of HLL into disaster management processes (see Sect. 13.6). We developed and conducted a focus group discussion with VOST members, where we identified challenges and opportunities of working with HLL and compared the process with conventional techniques (Löchner et al. 2020). Findings showed that deploying HLL in the data acquisition process of VOST operations will not distract their data analysis process. Instead, several benefits, such as improved working with huge datasets, may contribute to a more widespread use and adoption of the presented technique, which provides a basis for a better integration of privacy considerations in disaster management.

### *14.2.2 On Privacy Aspects*

From a generic point of view, *privacy* is the freedom to fully or partially retreat oneself in a self-controlled manner. There are always multiple forms of definitions of the term *privacy*, stretching from personal to a cultural point of views (Solove 2008). It is important to distinguish between the *right to privacy* and the *concept of privacy*  (Hildebrandt 2006). The *right* is clearly formed by laws, whereas the *concept* is rather vaguely determined based on subjectively perceived personal values. Privacy is often sacrificed voluntarily in exchange for perceived benefits and sometimes violated by others, either intentionally or accidentally (Reyman 2013).

*Privacy by design* as a set of principles is a relevant objective in the conception of applications in general. As Cavoukian et al. (2009) state, privacy must be approached from a design-thinking perspective. It must be incorporated in technologies not as an optional on-top feature but as a fundamental characteristic of organizational priorities, project objectives, design processes, and planning

operations. Concepts built upon these principles are hard to break in terms of privacy violations.

A contemporary method to protect data has been presented as *differential privacy*  (DP) (Dwork 2008) and adopted frequently (Desfontaines and Pejó 2020). DP adds certain amounts of random data to a set of real data set, in order to make real data indistinguishable from the random data and thus protect it from being identified as such. However, DP still requires the original data to be available to process. Furthermore, DP requires developing new concepts and models for each data set, which is very inefficient when dealing with really large sets of data.

In the geo-community, there is a wide range of concepts known to protect privacy in terms of location data. Some techniques are based on anonymity, e.g., *mix zones* (Beresford and Stajano 2003) or *k-anonymity* (Ciriani et al. 2007). Others are based on obfuscation, e.g., *imprecision* (Duckham and Kulik 2005), or policy like *restriction* (Hauser and Kabatnik 2001). All of these approaches require the possession of original raw data. Processed data sets are unable to be updated with subsequent data, which requires reprocessing of the entire data set upon updates. This is very inefficient when dealing with large amounts of social media data.

In the context of social media data, the consideration of privacy, ethics, and legal issues should play an important role. The statement "Privacy of user data and information should be considered in the initial design of VGI systems" (Mooney et al. 2017) can be extended to platforms and methods for the analysis and further processing of social media data in general.

Kounadi et al. (2018) discuss privacy threats related to inference attacks on *geosocial network data*. They provide protection recommendations for sharing these sorts of data and publishing resulting visualizations. Keßler and McKenzie (2018) proposed in a total of 21 theses to reflect on the current state of *geoprivacy* from a technological, ethical, legal, and educational perspective. They provide various examples of how common it has become to share location and how it can be used and misused.

### *14.2.3 Data Retention*

Processing social media data is to a relevant extent based on operating analytics software, which provides automatic analysis on gathered social media data stored in local databases. Their user interfaces take input to be crawled for in the stored data and return, for example, statistics of post occurrences in any context. Depending on the situation, only parts of that information may be relevant (see Sect. 14.3.1). Still, the entirety of every post has been and remains stored in local databases.

This means that if a data item is being deleted on the site of the corresponding social media service, it still resides at the place where it has been downloaded to. Technically, that practice meets the requirements to be termed *data retention*. We define this term as such: preserving data for an indefinite time period with no specific

purpose for any individual data item but with the assumption to make use of the information in entirety at a later point in time.

The term is being discussed in the public mostly in conjunction with telecommunication analysis and surveillance. European Digital Rights public interest group states that "data retention practices interfere with the right to privacy at two levels: at the level of retention of data, and at the level of subsequent access to that data by law enforcement" (Rucz and Kloosterboer 2020).

We introduce the term in a broader and more technical environment to emphasize the explosive nature of recklessly dealing with personal data, which social media data is (European Commission 2018). According to the above definition, the term is valid for any case of storing and retending personal data in stocks. Wright et al. (2020) use it even to describe any storage of data underlying scientific studies.

Owning a set of data requires great responsibility in terms of data security. It opens up risks of possible abuse, theft, or accidental public exposure (Miller 2020). Breaking it down to a simple rule, it can be stated that "the more data you have, the more data you can lose" (Guillou and Portner 2020).

Beyond governmental agencies and law enforcement, also commercial players, journalists, researchers, or nonprofit organizations face challenges when storing individual-related data like those from social media. Stieglitz et al. (2018) discovered that the volume of data was most often cited as a challenge by researchers. Wang and Ye (2018) summarize common techniques for social media analytics in natural disaster management and coin the term *mining* for that matter.

Furthermore, the social impact of misusing large sets of data is well-known. The Cambridge Analytica scandal is one of the examples that show how massive data sets can be alienated (Berghel 2018). The company used personal information from millions of Facebook users without their consent to derive information about their political points of view and then microtarget personally tailored political advertisements to them. They claimed to have a major impact on the 2016 US presidential election, which can be regarded as a threat to democratic legitimacy (Dowling 2022).

Users of social media services start to realize that all of their data is not only publicly available but made use of by third parties. Data retention drives forgetfulness as a social concept at risk (Blanchette and Johnson 2002). The *chilling effect*, people slowly increasing self-discipline and restriction of their communication behavior due to becoming aware of digital surveillance, and panopticism (Manokha 2018; Büchi et al. 2022) are described consequences.

Nevertheless, the huge amount of data raised by social media services being a tremendous privacy thread is only one side of the coin. Large sets of social media data can also be beneficial for the public. The work of humanitarian organizations depends on publicly available data that is authentic and relevant. Especially, VOSTs rely on the availability of public social media data (Kuner and Marelli 2020); therefore, its prosperity must be preserved. A gradual retreat of users from social media services in favor of closed, "antisocial" messaging groups (Leetaru 2019; Wilson 2020) must be prevented.

### *14.2.4 HyperLogLog*

One of our contributions to this issue presented in this chapter is based on storing data using an algorithm called *HyperLogLog* (HLL). This algorithm is a cardinality estimator first introduced by Flajolet et al. (2007).

Its fundamental strength is the ability to *estimate* the distinct count of a multiset (cardinality) and store it in a data structure, which does not allow the extraction of individual elements. This is done by storing only hashes of data items instead of the original raw data and identifying them by counting leading zeros of the binary representation of their hashes. The algorithm is able to predict how many distinct items have been added to the HLL set, based on the maximum number of leading zeros observed. This makes processing data using HLL very efficient in terms of processing time and storage space. It is not possible to search for prior unknown information in an HLL set, for example, the usernames of all the posts that have been gathered. This makes implementing HLL follow the *privacy by design* principle.

### **14.3 Concepts**

### *14.3.1 Privacy-Aware Storage*

The key aspect for our approach is to make it impossible to relate to the original social media data from a given processed data set (*privacy by design*). Therefore, we propose to utilize the cardinality estimation algorithm HyperLogLog (HLL) described in Sect. 14.2.4 to gathered store social media data.

To provide a minimal example of the process, we introduce a scenario, in which the difference in spatial occurrences of social media posts including a certain hashtag should be visualized. The result should be a choropleth map of areas according to the amount of post occurrences within that area (see Fig. 14.1). Areas are defined by a *GeoHash*, a hierarchical grid-like geocode identification concept (Niemeyer 2008; Morton 1966).

To store the occurrence of posts in an area, it is only necessary to *count* the number of distinct occurring posts, their *cardinality*. Reflecting, this unveils that storing the entirety of a social media post is unnecessary. It is sufficient to memorize its unique identifier (ID), which has been assigned by the social media service it originates from.

However, storing the ID in clear text in the database will allow identifying the post and thus the author of a post later on. The characteristics of HLL in turn enable to store data like the ID of a post in a set without the ability to regain it without prior knowledge about its existence in the set. Storing post IDs in an HLL set related to their geohash will only reveal their cardinality. Posts that occur later in the stream and match the same geohash will be added to this HLL set, which increases its cardinality by one for each new post. The geohash itself representing the post's

**Fig. 14.1** Example of a map showing areas with different occurrences of posts containing #omicron hashtag on Twitter from January through March 2022. Map data: OpenStreetMap contributors Color distribution: Head/Tail Breaks (Jiang 2013)

**Table 14.1** Exemplary database table structure showing four records (each stands for one area represented by the geohash) and the corresponding HLL set containing the post IDs


originating area is stored as the index of the database record (see Table 14.1). The resulting HLL data structure represents all posts matching a certain term from a certain area, while it is impossible to derive the post IDs back from it.

Using HLL, we do *not* store the post IDs itself but calculate hashes from them and store them in an array of counters that represent the set of post IDs (see

Sect. 14.2.4). Table 14.1 shows an example database table structure with geohash values representing an area and the corresponding HLL set representing the IDs of posts that occurred in that area.

Having a database with geohashes and their corresponding HLL set as shown exemplarily in Table 14.1, it is possible to compute the cardinality of the HLL set and thus determine the number of posts in each area. The result of such a computation could as well be achieved by just incrementing an integer per seen post ID and storing the sum instead of an HLL set. The significance of using the HLL algorithm instead is that it provides the opportunity to perform the set operations *union* and *intersection* on the HLL sets.

This can be useful for combinations of individual data sets. Different sets of gathered posts, each relating to certain terms, can be combined to monitor a more specific scenario.

A social media post as a data item can be broken down into its spatial, temporal, topical, and social components, each of which can be stored as separate HLL sets. As shown in Fig. 14.2, this can lead to a number of different HLL sets, each containing the post IDs of posts matching different criteria: involving a certain topic, originating in a certain area or in a certain time period, or authored by a user of a certain group.

Using the topical facet exemplarily in a disaster management scenario, an intersection of a set containing posts with the terms fire and one containing forest posts could lead more precisely to disaster incidents than both terms on their own. It still makes sense to monitor the terms individually in the first place because a combination of fire and accident can lead to other and different disaster incidents, as well as forest and accident does.

Furthermore, different terms could have the same meaning, for example, flood, high tide, wave, and tsunami could all refer to the same situation. So, a union of HLL sets on posts over these terms can provide more comprehensive information about disasters. Likewise, terms in different languages could also be

**Fig. 14.2** Examples of HLL sets derived from the four facets of a social media post

monitored in combination. This, for example, enables VOSTs (see Sect. 14.2.1 and Chap. 13) to monitor larger, multiple languages involving areas like border triangles or including smaller countries like Benelux or the Baltics.

This concept provides privacy by design because it does not store the post IDs in a readable way. It only stores a statistical derivative resulting from the characteristics of the HLL algorithm (see Sect. 14.2.4) and complies to the *privacy by design*  principles. The following subsections cover how it can be extended even further by applying extended concepts to adjust the level of privacy protection.

### *14.3.2 Abstraction Layers*

The concept of *abstraction* has been widely used in the geo-community to visualize spatial information scale dependent on different degrees of detail (Burghardt et al. 2016). We re-dedicate these generalization methods from geovisualization to privacy protection.

Herein, we present a model to improve privacy for social media users, in particular in the context of data collection. It aims at withdrawing precision from the data by deriving multiple abstraction layers of it. Applying these layers, we are able to quantitatively describe different levels of privacy. By deploying methods of *generalization* and thus decreasing precision of the data, we can increase privacy, and vice versa.

Figure 14.3 shows a visual representation of this model, following the fourfacet representation to characterize a social media post, introduced by Dunkel et al. (2019). The bottom layer in each facet is formed by the original data. Each following layer represents an increase in privacy protection for the user. This way, we have the ability to adjust the level of detail of the data in a fine-grained and context-dependent way. Each of the layers is described in detail in the following subsections.

### **14.3.2.1 Spatial Facet**

In the spatial facet, the original data is usually represented by a *coordinate* in latitude and longitude or a tiny area surrounding that coordinate. A first abstraction from it

**Fig. 14.3** Abstraction layers for each facet

can be an arbitrarily named *place* that includes the coordinate, e.g., a market square or a park. The next abstraction could be an administrative or functional *region* or other territories enclosing the place, e.g., a city or metropolitan area, county, or state. Cities can also be regarded as intermediate layers below regions. The next abstraction layer could be a *country* or another even broader defined, e.g., natural, political, administrative, or religious region.

When applying this model to the HLL-based storage concept presented in Sect. 14.3.1, it is crucial to note that even the lowest layer needs to be an area to be able to count multiple posts within it. If using a point coordinate instead, chances tend toward zero for multiple posts hitting that exact coordinate. An alternative approach with clustering techniques would be necessary there.

In an application implementing this model, the database index would be the geohash, place, city, or country, depending on the layer of abstraction. The corresponding HLL sets include the hashed post IDs of posts that originate in that respective area. An appropriate visualization of that data would be a map showing visually differentiable areas (see, e.g., Fig. 14.1).

### **14.3.2.2 Temporal Facet**

Abstractions in the temporal facet are clearly defined by common time units. The basic layer is the timestamp of the publication of a post, abstracted as the day of the publication or a month or even a year.

In analogy to the spatial facet, when implementing this model, the database index must be a time range rather than a point in time, to be able to associate multiple matching posts with it. To visualize just the temporal facet, a timeline is the preferred graphical representation. Usually, this facet requires to also filter for a certain topic first, to prevent just visualizing *all* social media posts occurring within a certain time period.

### **14.3.2.3 Topical Facet**

The topical facet is characterized by applying topic modeling techniques (Kherwa and Bansal 2019) to find abstractions of terms. The basic layer can be defined by the *terms*, the original content in the post, e.g., The River Elbe has burst its banks in Dresden today. A more general layer can be rendered in the overall *subject* of the post, e.g., Dresden flood. Another abstraction layer can be the *domain* of the Natural Disaster.

In an implementation, the database index will represent these terms, subjects, or domains, and the corresponding HLL sets hold the associated post IDs. It is not trivial to generate more generic terms for specific posts, but topic modeling techniques can help with that task. A word cloud can visualize these terms in different sizes (Hearst et al. 2019) depending on the cardinality of posts associated with them.

### **14.3.2.4 Social Facet**

The social facet relates to users of social media. When running analyses on this object, it is crucial to note that we are switching the focus. While we are trying to avoid storing personal data around an object in the other facets, here we want to achieve the opposite: counting appearances of data that relates to a single person or a group. An exemplary analysis would be to count the number of posts per user or group. In this scenario, it is especially useful to apply abstraction layers in order to gain privacy for a single user. In analogy to the temporal facet, it is also useful to filter posts for a certain topic in beforehand.

In the basic layer, the creator of a social media post, the *user*, is targeted. The database index would be the username or id, and the corresponding HLL set consists of the post IDs. Combining, e.g., all of the user account's *followers* to a group and regarding posts of all of them could apply as the first abstracted layer. Through network analysis (Maireder et al. 2014), we can define *clusters* to be objects in a second step of abstraction. Another layer of abstraction could be the consideration of different social network *platforms* (Cosenza 2022), which have a distinct user base, which might originate from different cultural backgrounds, e.g., Twitter, Instagram, WeChat, and VKontakte.

Implementing groups of users is more challenging than in other layers. The group of followers in the second layers consists of a list of user IDs eventually, which could again be stored in an HLL set and get an ID assigned to. IDs of multiple groups are then stored in HLL sets and can be combined or contrasted with other group IDs, defining clusters accordingly.

All the described layers are only examples and can be replaced by other structures. Also, the number of abstraction layers can be chosen arbitrarily, as the granularity of the data can change.

It should be noted that abstraction layers do not only gain privacy for the social media users, but they also diminish the precision of the data. This makes applying abstraction to social media data be a compromise between privacy and precision.

### *14.3.3 Filter Lists*

Storing social media data using HLL to be processed in analytics software forms the basement of privacy protection. Applying generalization methods as described in Sect. 14.3.2 provides further opportunities to adjust data precision. However, there are edge cases that require special handling. For instance, even the existence of a single specific term, a specific time, location, etc. may provide hints that can be repurposed or combined with other (e.g., external) information to compromise user privacy in certain situations. Following the principle that different data must be treated differently (Almås et al. 2018), we seek to contribute to a systematic approach to fine-tuning privacy preservation and analytical flexibility.

There are two main approaches to adjusting privacy—utility trade-offs with HLL and abstraction layers. First,*stop and allow lists* can be used during the generation of the HLL set to enable context-dependent data protection through filtering. Second, *threshold values* can be defined flexible to influence the granularity of the HLL set indexes and, based on that, the degree of anonymity. Table 14.2 lists examples for each context in the framework, where accuracy (utility) may be traded in favor of a higher degree of privacy, similar to the broader data sensitivity spectrum proposed by Rumbold and Pierscionek (2018).

Whether stop lists or allow lists are preferable depends on the context of application. Allow lists are more restrictive and require less effort from the analysts, by automatically excluding all terms, times, locations, etc. that are not explicitly considered beforehand. For the spatial context, for instance, unless worldwide data is required, allow lists are frequently used, to limit data collection to a specific area, region, place, etc. Conversely, stop lists can be added selectively on top, to exclude places that are known to be related to vulnerable groups or sensitive contexts (e.g., hospitals, party locations). Similarly, filter lists for specific terms, hashtags, or emoji can be defined for the topical context.

For topical contexts, the openness of possible references complicates defining holistic stop lists ahead of time. As an example, Fig. 14.4 shows a map generated from terms, hashtags, and emoji used on the social media services Twitter, Flickr, and Instagram at a public vantage point and park. The syringe emoji could indicate drug use, which may lead to further onsite investigation by, e.g., authorities, with potential unexpected consequences of the user perspective. Obviously, this is an edge case for social-individual privacy because both positive (society) and negative (user) consequences are imaginable. One solution would be to assign the specific emoji to a thematic broader *emoji class*, e.g., the umbrella group of "medical emoji"1 (see Sect. 14.3.2). As another solution, the syringe emoji could be classified ahead of time, for increased sensitivity, leading to, e.g., a greater spatial granularity reduction on data ingestion, or exclusion, preventing having to deal with this ambiguous ethical edge case in advance.

Lastly, as the second approach to enable systematic user privacy with HLL, threshold values may be defined, similar to what is known from other disciplines, such as the HIPAA Privacy Rules for health data publications (Malin et al. 2011) or census statistics (Szibalski 2007, p.142). Allshouse et al. (2010), for instance, use geomasking in combination with k-anonymity, to define a lower threshold of *k* = 5 (people), which is a rule of thumb size in geoprivacy (Kamp et al. 2013). Comparable best-practice threshold values could be defined for HLL sets of different sizes, e.g., suggestions by Desfontaines et al. (2019), with smaller sets indicating lesser privacy protection due to a scarce context collapse. In the spatial context, this could be implemented by using quadtrees, for example, to split and aggregated social data into sub-sections (quads), based on pre-defined thresholds, where the resolution is automatically decreased for areas of lesser data density.

<sup>1</sup> Unicode Consortium, unicode.org/emoji/charts-13.0/full-emoji-list.html#medical.


**Fig. 14.4** A thematically sensitive emoji on drug use at selected locations

### **14.4 Case Study**

Even though different implementations of HLL exist, all share a number of basic steps. At the core, the binary representation of any given character string is divided into *buckets*, for which the number of leading zeroes is counted (see Sect. 14.2.4). Because any given character string is first randomized, it is possible to predict how many distinct items must have been added to a given HLL set, based on the maximum number of leading zeroes observed. In other words, if multiple items are added to an HLL set, only the highest number of leading zeroes per bucket needs to be memorized. As a result, the cardinality estimation will only approximate counts.

As a side effect, there is a limited ability to check whether a specific user or ID has been added to a HLL set. In an adversarial situation, Desfontaines et al. (2019) refer to such a check as an *intersection attack*. Intersection attacks first require obtaining the hash of a targeted person or ID and then adding this hash to an HLL set. If the HLL set changes, an adversarial may be able to increase their initial suspicion by a certain degree. To better illustrate intersection attacks and how and under which circumstances the privacy of a user could become compromised in the presented two-component research setup, we briefly introduce two examples.

Alex is included in the YFCC100M dataset (Thomee et al. 2016) because he published 289 photos under Creative Commons Licenses between 2013 and 2014 on Flickr; 120 of these photos are geotagged. Given this information, it will be relatively easy to re-identify Alex. Sandy is an internal adversary. She could be someone working at an analytics service with full access to the database. Robert, on the other hand, is someone representing an external adversary, with access only to the published dataset. In the first example, the privacy of Alex is compromised if Sandy could increase or confirm her suspicion that Alex was not at his workplace in Berlin on 9 May 2012. In the second example, the privacy of Alex is compromised if Robert could increase or confirm his suspicion that Alex was indeed at least once at a specific location, e.g., contrary to what Alex claims. Finally, Alex could be

someone who voluntarily contributed his pictures to the conceived analytics service or altruistically published Creative Commons photos on Flickr.

Consider that, at the moment of contribution, Alex may not have thought of the consequences for his privacy but later realized his mistake. With the use of raw data, even removing any compromising data from Flickr, this change would need to be reflected in any subsequent data collection, such as in the analytics service or the YFCC100M dataset. This is either impractical or impossible. The question is, therefore, whether it is possible to replace raw data workflows with a privacy-aware visualization pipeline, without significantly reducing utility.

Several factors must coincide for intersection attacks to be successful. Firstly, an adversarial must have access to HLL sets. In our system model, this can either be an internal adversary (Sandy), having direct access to the database, or an external adversary (Robert), having access only to published data. Furthermore, an adversary must be able to either compute hashes for a given target user or somehow gain access to a computed HLL set for the given user. The former is only possible if the secret key is compromised. The latter appears conceivable, in our example, if the adversary has some prior knowledge about other locations visited by a target user, and if the HLL sets of these locations ideally contain only the target user or a few other users. In the following, we explore this worst-case scenario, where both Sandy and Robert somehow got hold of an HLL set that only contains Alex's computed hashes.

For Sandy, this means in order to test whether Alex was not in Berlin on 9 May 2012, she either needs Alex's original user ID and the secret key to construct the hash or find another location that has only been visited by Alex on this date. In this unlikely scenario, the result of an intersection attack for all grid cells is shown in Fig. 14.5. Visible in the figure is that a large number of other grid cells show false positives for the intersection test, that is, these HLL sets did not change, even when updated with the particular user day-hash for Alex.

**Fig. 14.5** Evaluation of scenario "Sandy" (Dunkel et al. 2020, CC-BY 4.0)

Since HLL prevents the occurrence of false negatives, and San Francisco is indeed among these locations, the result does include Alex's actual location on 9 May 2012. Depending on the size of the targeted HLL set, Sandy may then increase her suspicion by some degree. In the case of the grid cell for San Francisco, with 209,581 user days, this increase in posterior knowledge may be found to be negligibly small. In other words, even if there was no post from Alex on 9 May 2012, the intersection attack may have produced the same result. In conclusion, even in the worst scenario, having direct access to the database and a compromised secret key, Sandy could not gain any further affirmation.

Similarly, and rather incidentally, the positive grid cell for Berlin does indeed falsely suggest that Alex was in Berlin. This is not surprising given that larger HLL sets have a higher likeliness of showing false positives and Berlin is a highly frequented location. In other words, Alex benefits from the privacy-preserving effect of HLL.

In the second scenario, consider a situation in which Robert may have an a priori suspicion that Alex went to Cabo Verde. Alex, on the other hand, does not want Robert to know that he went surfing without him. Robert knows that Alex is participating in the conceived analytics service and, somehow, gains access to an HLL set containing only one hashed user ID from Alex. The results of the intersection attack for all grid cells are shown in Fig. 14.6. Since only 56 users have been to Cabo Verde in the YFCC100M dataset, the particular bin is not included in the published benchmark data, which is limited by a minimum threshold of 100 users. However, with direct access to the database, Robert could observe that Cabo Verde is among the locations revealed. In this case, Robert may gain some affirmation for his suspicion that Alex was in Cabo Verde. At the same time, a definite answer will not be possible, given the irreversible approximation of the HLL structure. For example, for the same intersection attack, for set sizes below 56

**Fig. 14.6** Evaluation of scenario "Robert" (Dunkel et al. 2020, CC-BY 4.0)

users, there are 14 other grid cells that show false positives, down to 8 users. In other words, even though these HLL sets do not change when tested, Alex has never been to these locations.

While these two scenarios provide a base to understand how intersection attacks may be executed in a spatial setting, a valid question is how likely successful intersection attacks are overall. To some degree, this depends on questions of security, such as protecting the secret key or managing database access.

Another part is directly related to the distribution of collected data and the number of outliers that are present at each stage of data processing. If data is more clustered, users will generally receive more benefits from the privacy-preserving effects of HLL. This can be quantitatively substantiated with the given dataset (Dunkel et al. 2020).

### **14.5 Conclusion**

The research presented in this chapter introduced a number of approaches to deal with privacy aspects in the process of social media data processing. Social media data is being used as a source of data for wide-ranging projects within and beyond the scope of this book (see Sect. 14.2.1). The relevance of privacy aspects in processing this kind of data and the range of related work are pointed out in Sect. 14.2.2. Furthermore, in Sect. 14.2.3, we discussed our focus on data retention as a potential threat for analysts. We defined the term and explained that we make use of that specific term to emphasize the explosiveness of dealing with personal data.

We showed that it is possible to preserve the privacy of social media users with the major concepts. As a basis for our first concept, we first introduced the cardinality estimation algorithm HyperLogLog in Sect. 14.2.4. In Sect. 14.3.1, the main part of this chapter, we introduced a concept to store social media data in a way that it is not possible to extract individual items from it but only to estimate the cardinality of social media data items within a certain set, plus running set operations over multiple sets to extend analytical ranges. Applying this method requires defining the scope of the result before even gathering the data and thus prevents the data from being misused for other purposes at a later point in time. This follows the *privacy-by-design* principle.

As an extension to the first concept, we proceeded by introducing a concept that is well known in the geographic community, generalization, in Sect. 14.3.2. By defining a number of abstraction layers, it is possible to even more reduce the data to be stored, depending on the required precision. The less precise data is needed, the fewer data needs to be stored. Finally, in Sect. 14.3.3, we explain the conceptual exclusion of edge cases by applying filter lists to the data set.

A closing case study in Sect. 14.4 explains the concept of intersection attacks and shows that under rare circumstances the HyperLogLog technology is vulnerable against them. The case study unveils that the larger the dataset, the less likely are

intersection attacks. Since social media data is usually very large, implementing the HyperLogLog technology is an excellent approach to protect the data from being abused, thieving, or publicly exposed and thus preserves the privacy of social media users.

**Acknowledgments** This research was supported by the German Research Foundation DFG within Priority Research Program 1894 *Volunteered Geographic Information: Interpretation, Visualization and Social Computing* (VGIscience, EVA-VGI, BU 2605/8-2).

### **References**


comparative study of the us and south korea. ISPRS Int J Geo-Inf, 25. https://doi.org/10.3390/ ijgi10010025


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.